Filtering Data in Pandas: A Complete Guide
Introduction
In this article we walk through the essential techniques for filtering rows and columns in pandas DataFrame and Series objects. Whether you need to isolate respondents who know Python, select a specific salary range, or limit results to certain countries, the methods described here cover the most common scenarios.
Understanding Boolean Masks
- A comparison such as
df['last_name'] == 'Doe'returns a Series of True/False values. - This series acts as a mask:
Truemarks rows that satisfy the condition,Falsemarks those that do not. - Example mask:
0 False 1 True 2 True dtype: bool
Applying a Filter Directly
mask = df['last_name'] == 'Doe'
filtered_df = df[mask]
The result is a new DataFrame containing only the rows where the mask is True.
Using the .loc Indexer
.locaccepts a boolean mask for the row selector and a list of column labels for the column selector.- Syntax:
df.loc[mask, ['email']]returns theemailcolumn for rows that match the mask. - Benefits: you can filter rows and pick specific columns in a single, readable statement.
Combining Conditions
- AND: use
&and wrap each condition in parentheses.python mask = (df['last_name'] == 'Doe') & (df['first_name'] == 'John') - OR: use
|.python mask = (df['last_name'] == 'Schaefer') | (df['first_name'] == 'John') - NOT: prepend
~to a mask to invert it.python mask = ~((df['last_name'] == 'Schaefer') & (df['first_name'] == 'John'))
Real‑World Example: Survey Data
1. Filtering by Salary
high_salary = df['ConvertedComp'] > 70000
result = df.loc[high_salary, ['Country', 'LanguageWorkedWith', 'ConvertedComp']]
Shows respondents earning more than $70k together with their country and known languages.
2. Filtering by a List of Countries
countries = ['United States', 'India', 'United Kingdom', 'Germany', 'Canada']
mask = df['Country'].isin(countries)
result = df.loc[mask, 'Country']
Returns only rows whose Country value appears in the predefined list.
3. Filtering with String Methods
When a column stores multiple values as a semicolon‑separated string (e.g., LanguageWorkedWith), use the str.contains method:
mask = df['LanguageWorkedWith'].str.contains('Python', na=False)
result = df.loc[mask, 'LanguageWorkedWith']
Selects all respondents who listed Python among their known languages.
Key Takeaways
- Boolean masks are the foundation of pandas filtering.
.locprovides a clean way to apply masks and select columns simultaneously.- Combine conditions with
&,|, and~for complex queries. - Use
Series.isin()for membership tests andSeries.str.contains()for substring searches. - Filtering is usually the first step in any pandas workflow, allowing you to work only with the data that matters.
Filtering with boolean masks and the .loc indexer is a fundamental pandas skill that lets you quickly isolate the exact rows and columns you need, forming the basis for all subsequent data analysis tasks.
Frequently Asked Questions
Who is Corey Schafer on YouTube?
Corey Schafer is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?
Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
Helpful resources related to this video
If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.
Links may be affiliate links. We only include resources that are genuinely relevant to the topic.