Grouping and Aggregating Data with pandas: A Complete Guide

 3 min read

YouTube video ID: txMdrV1Ut64

Source: YouTube video by Corey SchaferWatch original video

PDF

Introduction

In this article we walk through the essential techniques for grouping and aggregating data using pandas. By the end you’ll be able to answer questions such as the average developer salary, the most popular social‑media platform per country, and the percentage of developers who know Python – all without needing to watch the original video.

Basic Aggregations

  • What is aggregation? Combining multiple values into a single result (e.g., mean, median, mode).
  • Median vs. mean – Median is robust to outliers; the median salary in the developer survey is about $57,000, while the mean (≈ $127,000) is skewed by a few very high salaries.
  • DataFrame.median() – Returns the median for every numeric column (age, weekly work hours, etc.).
  • DataFrame.describe() – Provides count, mean, std, min, 25‑, 50‑ (median), and 75‑percentiles in one call.
  • Series.value_counts() – Counts occurrences of each unique value (useful for yes/no questions, hobbyist status, or social‑media preferences). Adding normalize=True returns percentages instead of raw counts.

Grouping Data

  1. Creating groupsdf.groupby('country') splits the DataFrame into sub‑frames for each country.
  2. Inspecting a single groupcountry_group.get_group('United States') returns all rows where country == 'United States'.
  3. Applying functions to groups
  4. Simple aggregation: country_group['converted_comp'].median() gives median salary per country.
  5. Multiple aggregations: country_group['converted_comp'].agg(['median', 'mean']) returns both median and mean salaries.
  6. Using apply for custom logic – To count respondents who mention Python in a free‑text column: python country_group['language_worked_with'].apply(lambda s: s.str.contains('Python').sum()) This avoids the error that occurs when trying to use .str directly on a GroupBy object.

Combining Results

  • pd.concat([series1, series2], axis=1, sort=False) merges two Series (e.g., total respondents per country and Python‑knowing respondents) into a single DataFrame.
  • Renaming columnsdf.rename(columns={'old_name': 'new_name'}, inplace=True) makes the table easier to read.
  • Calculating percentages – Create a new column: python df['pct_python'] = df['num_python'] / df['num_respondents'] * 100
  • Sortingdf.sort_values('pct_python', ascending=False, inplace=True) puts countries with the highest Python adoption at the top.

Practical Example Workflow

  1. Load the developer survey CSV.
  2. Use value_counts() to see how many responses each country provided.
  3. Group by country.
  4. Compute median salary per country.
  5. Count Python users per country with apply + str.contains.
  6. Concatenate the two series, rename columns, compute % Python, and sort.
  7. Inspect any country directly with df.loc['Japan'].

Tips & Common Pitfalls

  • Missing values (NaN) are ignored by most aggregation functions; count reports only non‑missing entries.
  • Outliers heavily affect the mean; prefer median for skewed salary data.
  • GroupBy objects do not expose string methods; always use .apply() for custom text operations.
  • Normalization (normalize=True) is handy when you care about relative frequencies rather than absolute counts.

Next Steps

The next video in the series will cover handling missing data and data cleaning – essential skills before any serious analysis.

Sponsor Note

The tutorial mentions Brilliant as a sponsor for its interactive courses on statistics and machine learning. While optional, such courses can deepen the concepts demonstrated here.

Grouping and aggregating with pandas lets you transform raw survey data into actionable insights—median salaries by country, popular social‑media platforms, and Python adoption rates—without writing complex SQL or manual calculations.

Frequently Asked Questions

Who is Corey Schafer on YouTube?

Corey Schafer is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

PDF