Topic 6.8: Filtering & Conditional Selection
Extracting precise subsets from datasets using logical conditions
This topic uses the file 6_7_sales.csv. Click here to download the dataset and place it in the same folder as your notebook before running the code examples.
Note: The dataset used in this reading material may differ slightly from the one used in the video. As a result, some outputs shown here may not exactly match what you see in the video. The logic and techniques are identical â only the data values may vary.
Data analysis begins with asking specific questions. Which transactions exceeded 100 EGP? Which products sold fewer than 10 units? Finding answers requires filtering the dataset to show only rows that match your criteria.
Pandas implements filtering through boolean masks. When you write a condition like Quantity > 20 on a DataFrame column, Pandas doesn't immediately filter anything. Instead, it creates a Series containing only True and False values - one for each row in the original DataFrame.
import pandas as pd # Load sales data sales = pd.read_csv('6_7_sales.csv') print("Original DataFrame (first 5 rows):") print(sales.head()) # Create a boolean mask high_revenue_mask = sales['Revenue'] > 3000 print("\nBoolean mask (first 10 values):") print(high_revenue_mask.head(10)) # Output: # 0 False # 1 True # 2 False # 3 True # 4 False # 5 True # 6 False # 7 False # 8 True # 9 False # Name: Revenue, dtype: bool
Each row that matches the condition receives True. Rows that don't match receive False. This Series of boolean values is called a mask because it functions like a filter - it hides the False rows and exposes only the True rows when applied.
The boolean mask is a logical selection tool. Pandas examines each element in the specified column against the condition, producing an array of boolean values with the same length as the original DataFrame. This mechanism allows you to identify rows with precision without modifying the source data.
# Apply the mask to filter the DataFrame high_revenue_sales = sales[high_revenue_mask] print("High revenue products (Revenue > 3000):") print(high_revenue_sales) print(f"\nFiltered results: {len(high_revenue_sales)} out of {len(sales)} rows")
You apply the mask by placing it inside square brackets after the DataFrame name. Pandas examines the mask and returns a new DataFrame containing only the rows marked True. The original data remains unchanged - you're extracting a filtered view.
Pandas supports all standard comparison operators: > (greater than), < (less than), == (equal to), != (not equal to), >= (greater than or equal), <= (less than or equal). Each operator generates a boolean mask when applied to a DataFrame column.
This approach provides flexibility. You can apply any comparison condition - greater than, less than, equal to, not equal to - and the result is always a boolean mask ready for filtering. The mask and filtered DataFrame are separate objects; the original data structure stays intact.
Real-world filtering demands combining multiple conditions. You might want transactions where the price exceeds 10 EGP and the quantity exceeds 15 units simultaneously. Simple single-condition masks cannot handle this scenario.
Pandas uses the & symbol for the AND operation. Each condition must be enclosed in parentheses due to Python's operator precedence rules. A row appears in the result only if it satisfies all conditions simultaneously.
# AND condition: Price > 10 AND Units_Sold > 200 and_mask = (sales['Price'] > 10) & (sales['Units_Sold'] > 200) filtered_sales = sales[and_mask] print("Products with Price > 10 AND Units_Sold > 200:") print(filtered_sales) print(f"Found {len(filtered_sales)} products matching both conditions")
The result contains only transactions that meet every specified criterion. This is valuable for finding rare cases - for example, expensive products with low inventory, or high-revenue transactions from specific regions. The AND operation represents logical intersection.
The AND operation logically means intersection. You're searching for rows present in both the first condition's set and the second condition's set. Adding more AND conditions narrows the results further because the criteria become more restrictive.
# OR condition: Price < 5 OR Revenue > 5000 or_mask = (sales['Price'] < 5) | (sales['Revenue'] > 5000) filtered_sales = sales[or_mask] print("Products with Price < 5 OR Revenue > 5000:") print(filtered_sales) print(f"Found {len(filtered_sales)} products matching either condition")
The OR operation uses the | symbol. A row appears if it satisfies at least one condition - any match is sufficient. For example, transactions where the price is below 5 EGP or the quantity exceeds 30 units will include all rows meeting either criterion.
The OR operation logically means union. You're combining rows from the first condition's set with rows from the second condition's set. Results are broader than AND because partial matches are accepted. This is useful when searching for items that meet any of several alternative criteria.
Always wrap each condition in parentheses when using &, |, or ~ operators. Without parentheses, Python's operator precedence causes incorrect evaluation. Writing df[df['A'] > 5 & df['B'] < 10] will fail with an error. The correct syntax is df[(df['A'] > 5) & (df['B'] < 10)].
# NOT condition: NOT (City == 'Cairo')
not_mask = ~(sales['City'] == 'Cairo')
filtered_sales = sales[not_mask]
print("Products NOT sold in Cairo:")
print(filtered_sales)
print(f"Found {len(filtered_sales)} products from other cities")The NOT operation uses the ~ symbol. This inverts boolean values - True becomes False and False becomes True. It's valuable for exclusion queries. If you want all transactions except those from Cairo, you write the condition City == 'Cairo' and apply ~ to reverse it.
You can combine NOT with AND and OR to build complex filters. For instance, transactions that are not from Cairo and not from Alexandria simultaneously. These compound conditions let you express sophisticated business logic in a single filtering statement.
# Complex compound condition
complex_mask = ((sales['Price'] > 20) & (sales['Units_Sold'] > 100)) | (sales['City'] == 'Alexandria')
filtered_sales = sales[complex_mask]
print("Products matching: (Price > 20 AND Units_Sold > 100) OR City == 'Alexandria':")
print(filtered_sales)
print(f"Found {len(filtered_sales)} products")Complex conditions combine multiple AND, OR, and NOT operations using parentheses to control evaluation order. The example above finds products that are either high-priced with high volume, or located in Alexandria regardless of price or volume. This demonstrates how boolean logic translates directly into data filtering.
Filtering by multiple specific values using OR operators becomes tedious quickly. If you want transactions for products Pen, Notebook, and Marker, writing (Product == 'Pen') | (Product == 'Notebook') | (Product == 'Marker') is verbose and error-prone.
The isin() method solves this problem. You pass a list of acceptable values, and the method checks each row to see if its value appears in that list. Rows with matching values receive True; all others receive False.
# Filter multiple cities using isin() selected_cities = ['Cairo', 'Alexandria', 'Giza'] city_mask = sales['City'].isin(selected_cities) filtered_sales = sales[city_mask] print("Products sold in Cairo, Alexandria, or Giza:") print(filtered_sales) print(f"Total: {len(filtered_sales)} products from selected cities")
This approach is cleaner and more maintainable. Instead of chaining multiple OR conditions, you define the list of values once and pass it to isin(). The code reads naturally: 'select rows where City is in the list [Cairo, Alexandria, Giza]'.
The isin() method is also valuable for dynamic filtering. You can construct the list programmatically from another data source - a file, a database query, or user input - and pass it directly to the method. This enables flexible, data-driven filtering logic.
# Filter multiple product categories selected_categories = ['Stationery', 'Electronics'] category_mask = sales['Category'].isin(selected_categories) filtered_sales = sales[category_mask] print("Products in Stationery or Electronics categories:") print(filtered_sales[['Product', 'Category', 'Price', 'Revenue']]) print(f"Total: {len(filtered_sales)} products")
When working with categorical data - columns containing a limited set of distinct values like product types, regions, or status codes - isin() is the standard filtering approach. It handles any number of values with the same syntax, making your code scalable.
You can invert isin() using the ~ operator to exclude specific values. For example, ~sales['City'].isin(['Cairo', 'Giza']) returns all rows where City is not Cairo or Giza. This provides a concise way to filter out multiple unwanted values simultaneously.
Text columns require specialized filtering methods. You might need to find all cities containing the letter 'a', or all products starting with 'P'. Standard comparison operators don't support substring matching or pattern detection.
Pandas provides the str accessor for string operations on text columns. The str.contains() method searches for a substring anywhere within each text value. It returns True if the substring is found, False otherwise.
# Find cities containing the letter 'a' cities_with_a = sales[sales['City'].str.contains('a', case=False)] print("Cities containing letter 'a':") print(cities_with_a['City'].unique()) print(f"\nTotal transactions: {len(cities_with_a)}")
The case parameter controls case sensitivity. Setting case=False makes the search ignore differences between uppercase and lowercase letters. This ensures 'Cairo', 'CAIRO', and 'cairo' are all treated identically.
Beyond contains(), Pandas offers additional string methods for specific patterns. The str.startswith() method checks if text begins with a specific substring. The str.endswith() method checks if text ends with a specific substring.
# Find products starting with 'P' products_p = sales[sales['Product'].str.startswith('P')] print("Products starting with 'P':") print(products_p[['Product', 'Units_Sold', 'Price', 'Revenue']]) print(f"Total: {len(products_p)} products")
These string methods return boolean masks just like comparison operators. You can combine them with AND, OR, and NOT operations to build sophisticated text filters. For example, you could find products that start with 'P' and are in the 'Stationery' category.
# Find products ending with specific suffix products_ending_er = sales[sales['Product'].str.endswith('er', case=False)] print("Products ending with 'er':") print(products_ending_er['Product'].unique())
String methods like str.contains() will fail if the column contains missing values (NaN). To handle this, add the parameter na=False, which treats missing values as non-matches: df['Column'].str.contains('pattern', na=False). This prevents errors when working with incomplete datasets.
String filtering methods make text analysis straightforward. Whether searching for patterns, extracting specific formats, or validating text content, the str accessor provides methods that integrate seamlessly with Pandas' boolean masking system.
Real data analysis combines all these filtering techniques. Consider a sales dataset where you need to identify high-value transactions from specific regions that meet quality criteria. This requires boolean masks, compound conditions, membership testing, and string matching working together.
import pandas as pd
# Load sales data
sales = pd.read_csv('6_7_sales.csv')
# Scenario 1: High-revenue electronics from major cities
major_cities = ['Cairo', 'Alexandria', 'Giza']
electronics_high_revenue = sales[
(sales['Category'] == 'Electronics') &
(sales['Revenue'] > 4000) &
(sales['City'].isin(major_cities))
]
print("High-revenue electronics from major cities:")
print(electronics_high_revenue[['Product', 'City', 'Revenue', 'Units_Sold']])
print(f"Found {len(electronics_high_revenue)} products\n")This filter combines three conditions using AND logic. The product must be in the Electronics category, the revenue must exceed 4000, and the city must be one of Cairo, Alexandria, or Giza. All three criteria must be satisfied simultaneously.
# Scenario 2: Products requiring attention (low sales OR high-volume low-priced products) needs_attention = sales[ (sales['Units_Sold'] < 50) | ((sales['Units_Sold'] > 300) & (sales['Price'] < 15)) ] print("Products requiring attention:") print(needs_attention[['Product', 'Units_Sold', 'Price', 'Revenue']]) print(f"Found {len(needs_attention)} products requiring attention\n")
This filter uses nested conditions with both OR and AND operators. It identifies products with either very low sales (under 50 units) or products that have high sales but low prices (over 300 units sold and price under 15 EGP). The parentheses control the evaluation order.
# Scenario 3: String pattern matching for product analysis writing_instruments = sales[ sales['Product'].str.contains('Pen|Pencil|Marker', case=False, na=False) ] print("Writing instruments (contains Pen, Pencil, or Marker):") print(writing_instruments[['Product', 'Category', 'Units_Sold', 'Revenue']]) print(f"Found {len(writing_instruments)} writing instruments\n")
The str.contains() method accepts regular expressions, allowing you to search for multiple patterns simultaneously. The pattern 'Pen|Pencil|Marker' matches any string containing Pen, Pencil, or Marker. The na=False parameter prevents errors if missing values exist.
When building complex filters: (1) Test each condition individually before combining them. (2) Use descriptive variable names for masks to improve readability. (3) Check the result count after each filter to verify expected behavior. (4) Remember that filtering creates views - the original DataFrame remains unchanged unless you explicitly reassign.
Filtering becomes your primary tool for exploratory data analysis. You ask questions, build the corresponding boolean logic, apply the filter, and examine the results. This iterative process - question, filter, analyze, refine - drives most data investigation workflows.
- Boolean masks are Series of True/False values created by applying comparison operators to DataFrame columns.
- Apply masks by placing them inside square brackets after the DataFrame name to filter rows.
- The & operator combines conditions with AND logic (all conditions must be True); requires parentheses around each condition.
- The | operator combines conditions with OR logic (at least one condition must be True); requires parentheses around each condition.
- The ~ operator inverts boolean values (True becomes False, False becomes True) for exclusion filtering.
- The isin() method filters rows where column values match any item in a provided list, eliminating verbose OR chains.
- String methods like str.contains(), str.startswith(), and str.endswith() enable pattern matching on text columns.
- Complex filters combine multiple boolean operations, membership tests, and string methods to express sophisticated selection criteria.
- â Pandas Documentation: Boolean Indexing
https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing - â Pandas Documentation: Working with Text Data
https://pandas.pydata.org/docs/user_guide/text.html - â Real Python: Pandas Boolean Indexing Guide
https://realpython.com/pandas-python-explore-dataset/ - â DataCamp: Filtering Data in Pandas
https://www.datacamp.com/tutorial/pandas-tutorial-dataframe-python