As Python has become the go-to language for data science, pandas has quickly evolved into a standard library in the field. However, one key drawback of pandas is its inability to linearly scale with increasing data volumes, primarily due to its reliance on single-machine processing. Pandas API on Spark addresses this issue, empowering users to handle vast datasets by leveraging Apache Spark while preserving the pandas APIs.
In this talk, I will introduce the Pandas API on Spark, explain how it enables the scaling of data science workloads, and explore the reasons behind its highly optimized performance. By the end of the session, you will have the knowledge to scale your existing data science workloads seamlessly using this powerful tool.
What do you need to know to enjoy this talk
Python level
You can write basic scripts.
About the topic
No previous knowledge of the topic is required, basic concepts will be explained.