PyCon CZ

PyCon CZ 23
15–17 September
Prague

Introduction to Data Analysis Using Pandas a workshop with Stefanie Molin

Sunday 17 September 10:00 (3 hours)
Room 347

Working with data can be challenging: it often doesn’t come in the best format for analysis, and understanding it well enough to extract insights requires both time and the skills to filter, aggregate, reshape, and visualize it. This session will equip you with the knowledge you need to effectively use pandas – a powerful library for data analysis in Python – to make this process easier.

Pandas makes it possible to work with tabular data and perform all parts of the analysis from collection and manipulation through aggregation and visualization. While most of this session focuses on pandas, during our discussion of visualization, we will also introduce at a high level matplotlib (the library that pandas uses for its visualization features, which when used directly makes it possible to create custom layouts, add annotations, etc.) and seaborn (another plotting library, which features additional plot types and the ability to visualize long-format data).

Section 1: Getting Started with Pandas

We will begin by introducing the Series, DataFrame, and Index classes, which are the basic building blocks of the pandas library, and showing how to work with them. By the end of this section, you will be able to create DataFrames and perform operations on them to inspect and filter the data.

Section 2: Data Wrangling

To prepare our data for analysis, we need to perform data wrangling. In this section, we will learn how to clean and reformat data (e.g. renaming columns, fixing data type mismatches), restructure/reshape it, and enrich it (e.g. discretizing columns, calculating aggregations, combining data sources).

Section 3: Data Visualization

The human brain excels at finding patterns in visual representations of the data; so in this section, we will learn how to visualize data using pandas along with the matplotlib and seaborn libraries for additional features. We will create a variety of visualizations that will help us better understand our data.

Requirements

Bring your laptop with the virtual environment for the session installed -- see full setup instructions.

All code examples will be presented using Jupyter Notebooks.

Prerequisites

Attendees should have basic knowledge of Python and be comfortable working in Jupyter Notebooks.

What do you need to know to enjoy this workshop

Python level

Medium knowledge: You use frameworks and third-party libraries.

About the topic

You used or did it just a few times.

Stefanie Molin

I am a software engineer and data scientist at Bloomberg in New York City, where I tackle tough problems in information security, particularly those revolving around data wrangling/visualization, building tools for gathering data, and knowledge sharing. I am also the author of Hands-On Data Analysis with Pandas, which is currently in its second edition and has been translated into Korean. I hold a bachelor’s of science degree in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, as well as a master’s degree in computer science, with a specialization in machine learning, from Georgia Tech. In my free time, I enjoy traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.