Grouping and Aggregating Data with pandas: A Complete Guide
Introduction
In this article we walk through the essential techniques for grouping and aggregating data using pandas. By the end you’ll be able to answer questions such as the average developer salary, the most popular social‑media platform per country, and the percentage of developers who know Python – all without needing to watch the original video.
Basic Aggregations
- What is aggregation? Combining multiple values into a single result (e.g., mean, median, mode).
- Median vs. mean – Median is robust to outliers; the median salary in the developer survey is about $57,000, while the mean (≈ $127,000) is skewed by a few very high salaries.
DataFrame.median()– Returns the median for every numeric column (age, weekly work hours, etc.).DataFrame.describe()– Provides count, mean, std, min, 25‑, 50‑ (median), and 75‑percentiles in one call.Series.value_counts()– Counts occurrences of each unique value (useful for yes/no questions, hobbyist status, or social‑media preferences). Addingnormalize=Truereturns percentages instead of raw counts.
Grouping Data
- Creating groups –
df.groupby('country')splits the DataFrame into sub‑frames for each country. - Inspecting a single group –
country_group.get_group('United States')returns all rows wherecountry == 'United States'. - Applying functions to groups
- Simple aggregation:
country_group['converted_comp'].median()gives median salary per country. - Multiple aggregations:
country_group['converted_comp'].agg(['median', 'mean'])returns both median and mean salaries. - Using
applyfor custom logic – To count respondents who mention Python in a free‑text column:python country_group['language_worked_with'].apply(lambda s: s.str.contains('Python').sum())This avoids the error that occurs when trying to use.strdirectly on a GroupBy object.
Combining Results
pd.concat([series1, series2], axis=1, sort=False)merges two Series (e.g., total respondents per country and Python‑knowing respondents) into a single DataFrame.- Renaming columns –
df.rename(columns={'old_name': 'new_name'}, inplace=True)makes the table easier to read. - Calculating percentages – Create a new column:
python df['pct_python'] = df['num_python'] / df['num_respondents'] * 100 - Sorting –
df.sort_values('pct_python', ascending=False, inplace=True)puts countries with the highest Python adoption at the top.
Practical Example Workflow
- Load the developer survey CSV.
- Use
value_counts()to see how many responses each country provided. - Group by
country. - Compute median salary per country.
- Count Python users per country with
apply+str.contains. - Concatenate the two series, rename columns, compute
% Python, and sort. - Inspect any country directly with
df.loc['Japan'].
Tips & Common Pitfalls
- Missing values (
NaN) are ignored by most aggregation functions;countreports only non‑missing entries. - Outliers heavily affect the mean; prefer median for skewed salary data.
- GroupBy objects do not expose string methods; always use
.apply()for custom text operations. - Normalization (
normalize=True) is handy when you care about relative frequencies rather than absolute counts.
Next Steps
The next video in the series will cover handling missing data and data cleaning – essential skills before any serious analysis.
Sponsor Note
The tutorial mentions Brilliant as a sponsor for its interactive courses on statistics and machine learning. While optional, such courses can deepen the concepts demonstrated here.
Grouping and aggregating with pandas lets you transform raw survey data into actionable insights—median salaries by country, popular social‑media platforms, and Python adoption rates—without writing complex SQL or manual calculations.
Frequently Asked Questions
Who is Corey Schafer on YouTube?
Corey Schafer is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?
Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
Helpful resources related to this video
If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.
Links may be affiliate links. We only include resources that are genuinely relevant to the topic.