Scale Data Science by Pandas API on Spark a talk by Xinrong Meng

Saturday 16 September 11:10 (30 minutes)

__floor__

As Python has become the go-to language for data science, pandas has quickly evolved into a standard library in the field. However, one key drawback of pandas is its inability to linearly scale with increasing data volumes, primarily due to its reliance on single-machine processing. Pandas API on Spark addresses this issue, empowering users to handle vast datasets by leveraging Apache Spark while preserving the pandas APIs.

In this talk, I will introduce the Pandas API on Spark, explain how it enables the scaling of data science workloads, and explore the reasons behind its highly optimized performance. By the end of the session, you will have the knowledge to scale your existing data science workloads seamlessly using this powerful tool.

What do you need to know to enjoy this talk

Python level

You can write basic scripts.

About the topic

No previous knowledge of the topic is required, basic concepts will be explained.

Xinrong Meng

I am an Apache Spark PMC (Project Management Committee) Member and Committer, with deep technical expertise in PySpark. I am one of the main contributors to the Pandas API on Spark. I work as a software engineer at Databricks.

I am engaged as a speaker at industry-leading conferences such as PyData Global and Data+AI Summit. I established and actively maintain a Knowledge Sharing GitHub repository.

I dance and surf.