Topic 6.9: Structural Mutations - Schema Evolution
Reshaping DataFrames through column operations and transformations
Most datasets arrive incomplete. You have raw measurements but need derived metrics. Sales data contains Price and Units_Sold but lacks Revenue. Inventory records show Stock and Threshold but don't indicate status. Analysis requires computing these missing fields.
Pandas treats DataFrames as mutable structures. You can add new columns through direct assignment. Write the new column name in square brackets on the left side of the equals sign, then specify the calculation on the right side. The operation creates the column and computes the value for every row simultaneously.
import pandas as pd import numpy as np # Load inventory data inventory = pd.read_csv('6_5_inventory.csv') print("Original inventory data:") print(inventory.head()) print(f"Columns: {inventory.columns.tolist()}")
The inventory DataFrame contains six columns. To calculate revenue for each product, multiply Price by Units_Sold. This arithmetic operation needs to happen across all rows â a task that would require loops in traditional programming.
# Add calculated column for revenue inventory['Revenue'] = inventory['Price'] * inventory['Units_Sold'] print("After adding Revenue column:") print(inventory[['Product', 'Price', 'Units_Sold', 'Revenue']].head())
A single assignment statement adds the Revenue column. Pandas performs vectorized multiplication â the operation processes the entire column in one pass without explicit iteration. The engine broadcasts the calculation across all rows simultaneously, delivering performance that scales to millions of records.
Vectorized operations execute at compiled C speed, not interpreted Python speed. When you write df['A'] * df['B'], Pandas delegates the multiplication to NumPy, which uses optimized CPU instructions. This is 50â100 times faster than Python loops that process one row at a time.
You can use any arithmetic operation when creating calculated columns: addition, subtraction, multiplication, division, exponents, or combinations. The result is always a new column added to the DataFrame without modifying existing data.
Calculations produce numerical results. Classification produces categorical labels based on conditional logic. You need a column that marks products as In_Stock or Low_Stock depending on whether inventory exceeds a threshold. Simple arithmetic cannot express this branching logic.
NumPy provides the where() function for vectorized conditionals. This function accepts three arguments: a boolean condition to test, the value when True, and the value when False. It evaluates the condition for each element and selects the appropriate result.
# Add conditional column for stock status inventory['Stock_Status'] = np.where( inventory['Stock'] > 500, 'In_Stock', 'Low_Stock' ) print("After adding Stock_Status column:") print(inventory[['Product', 'Stock', 'Stock_Status']].head())
The condition inventory['Stock'] > 500 generates a boolean mask. NumPy examines each value: if True, it assigns 'In_Stock'; if False, it assigns 'Low_Stock'. The result is an array containing the classification for each row, which is then stored as a new column in the DataFrame.
np.where(condition, value_if_true, value_if_false) operates on the entire column at once without any loops. The condition tests each element, and the appropriate value is selected for each row automatically. This pattern handles binary classification tasks cleanly and efficiently.
Data cleaning often requires removing columns. You might drop temporary fields after completing calculations, eliminate redundant information, or remove columns irrelevant to your analysis. DataFrame schema should contain only the columns you need.
The drop() method removes columns by name. Pass the column name as a string, or pass a list of names to remove multiple columns. The axis parameter controls whether you're dropping rows or columns â axis=1 indicates column removal.
# Remove columns we don't need for analysis inventory_clean = inventory.drop(['Stock_Status'], axis=1) print("After dropping Stock_Status column:") print(inventory_clean.head())
By default, drop() returns a new DataFrame without modifying the original. The original inventory DataFrame still contains Stock_Status, while inventory_clean holds the version without it. This non-destructive behavior lets you experiment safely.
The inplace parameter controls mutation behavior. When inplace=True, drop modifies the original DataFrame directly. When inplace=False (the default), drop returns a new DataFrame and leaves the original unchanged. Choose based on whether you need to preserve the original structure.
The axis parameter controls operation direction: axis=0 operates on rows (vertical direction), axis=1 operates on columns (horizontal direction). Forgetting axis=1 when dropping columns causes errors because Pandas defaults to axis=0 and attempts row removal instead.
Column names matter for readability and maintainability. Datasets often arrive with unclear abbreviations, inconsistent capitalization, or confusing terminology. Renaming transforms cryptic headers into self-documenting labels that communicate meaning clearly.
The rename() method changes column names using a dictionary that maps old names to new names. Pass this dictionary to the columns parameter. The method searches for each key in the DataFrame headers and replaces it with the corresponding value.
# Rename columns for clarity inventory_renamed = inventory.rename( columns={ 'Units_Sold': 'Sales_Volume', 'Revenue': 'Total_Revenue' }, inplace=False ) print("After renaming columns:") print(inventory_renamed.columns.tolist()) print(inventory_renamed[['Product', 'Sales_Volume', 'Total_Revenue']].head())
The mapping dictionary specifies only the columns you want to rename â you don't need to include columns that should remain unchanged. Pandas renames the specified columns and leaves all others with their original names.
Like drop(), the rename() method supports the inplace parameter. Setting inplace=False (the default) returns a new DataFrame with updated names. Setting inplace=True modifies the original DataFrame directly. Renaming fixes unclear abbreviations, corrects spelling mistakes, and unifies naming conventions across datasets.
Column order affects how humans read data. Identifiers should appear first, followed by descriptive attributes, then metrics and calculations. Logical ordering reduces cognitive load when scanning tables or inspecting data.
Reorder columns by creating a list containing column names in your target sequence, then indexing the DataFrame with that list. This produces a new DataFrame with columns rearranged according to your specification.
# Reorder columns for better presentation column_order = ['Product', 'Category', 'Supplier', 'Price', 'Stock', 'Units_Sold', 'Revenue', 'Stock_Status'] inventory_reordered = inventory[column_order] print("After reordering columns:") print(inventory_reordered.head())
The reordered DataFrame groups related information together. Place identifier columns first, then descriptive attributes, then metrics and derived fields. This sequence helps readers scan the data efficiently.
Reordering is purely presentational â it doesn't modify data values or types. However, column sequence impacts usability when exporting to CSV files, displaying in reports, or integrating with external systems that expect specific column positions.
The DataFrame is not read-only â you can modify individual cell values directly. When you discover an error in data or need to correct an outdated value, the loc indexer provides precise, targeted cell updates without creating a new DataFrame.
Use loc with the row index and column name to pinpoint a specific cell, then assign the new value. The loc indexer uses label-based indexing: you reference rows by their index labels and columns by their names.
# Update a specific price value inventory.loc[0, 'Price'] = 6.00 print("After updating Pen price:") print(inventory[['Product', 'Price']].head(3))
When you discover errors in data during analysis or need to correct outliers, loc provides precise, surgical accuracy for targeted updates. This is useful for fixing data entry mistakes or updating a specific record when you receive corrected information.
- Add calculated columns through direct assignment with arithmetic operations; Pandas executes vectorized computation across all rows simultaneously.
- Create conditional columns using
np.where(condition, value_if_true, value_if_false)for binary classifications (two possible outcomes). - Remove columns with the
drop()method, specifyingaxis=1for column removal; useinplace=Trueto modify the original DataFrame directly. - Rename columns using the
rename()method with a dictionary mapping old names to new names via thecolumnsparameter. - Reorder columns by creating a list with the target sequence and indexing the DataFrame with that list.
- Update individual cell values using the
locindexer with the row index and column name to fix errors or correct outdated values.
- â Pandas Documentation: Adding and Removing Columns
https://pandas.pydata.org/docs/user_guide/dsintro.html#column-selection-addition-deletion - â NumPy Documentation: numpy.where()
https://numpy.org/doc/stable/reference/generated/numpy.where.html - â Pandas Documentation: DataFrame.rename()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html