DataFrames: scaling up and out
a talk by Ondřej Kokeš
DataFrames have become ubiquitous when it comes to fast analyses of complex data. They go beyond SQL by not adhering to a strict schema and offer a rich API, where you chain methods, which fosters exploratory analytics.
While newcomers to Python usually learn about pandas early on, they sometimes struggle as their underlying data grow in size. Given the in-memory nature of pandas' storage system, one can usually only scale up.
I'd like to outline several workflows for adapting to the ever-increasing size of datasets:
- Changing application logic to handle streams rather than loading the whole dataset into memory.
- Actually scaling up – locally by buying more memory and/or faster disk drives, or by deploying servers in the cloud and SSH tunneling to remote Jupyter instances.
- Scaling your data source and utilizing pandas' SQL connector. This will help in other areas as well (e.g. direct connections in BI).
- Using a distributed DataFrame engine – Dask or PySpark. These scale from laptops to large clusters, using the very same API the whole way through.
I will cover the various differences between these approaches and will outline their set of upsides (e.g. scaling and performance) and downsides (DevOps difficulties, cost).
This talk is suitable for both beginner and advanced Pythonistas.
Economist by training, analyst and engineer by trade.
I have been working with DataFrames over the past five years and faced issues scaling my workflows when dealing with data on the economy, energy, marketing, transactions, user behaviour, and a number of other areas, ranging from megabytes to terabytes in size.
I will try and outline the various ways one can tackle the issue of scale while keeping their usage patterns largely unchanged.