Topic 6.9: Structural Mutations

➕ Adding Calculated Columns

▼

Most datasets arrive incomplete. You have raw measurements but need derived metrics. Sales data contains Price and Units_Sold but lacks Revenue. Inventory records show Stock and Threshold but don't indicate status. Analysis requires computing these missing fields.

Pandas treats DataFrames as mutable structures. You can add new columns through direct assignment. Write the new column name in square brackets on the left side of the equals sign, then specify the calculation on the right side. The operation creates the column and computes the value for every row simultaneously.

Python

import pandas as pd
import numpy as np

# Load inventory data
inventory = pd.read_csv('6_5_inventory.csv')
print("Original inventory data:")
print(inventory.head())
print(f"Columns: {inventory.columns.tolist()}")

▶ Output

Original inventory data: Product Category Supplier Price Stock Units_Sold 0 Pen Stationery Alpha Supply 5.50 450 320 1 Notebook Stationery Beta Goods 15.00 600 275 2 Marker Stationery Gamma Corp. 8.00 520 310 3 Ruler Stationery Alpha Supply 3.50 780 180 4 Eraser Stationery Beta Goods 2.00 890 420 Columns: ['Product', 'Category', 'Supplier', 'Price', 'Stock', 'Units_Sold']

The inventory DataFrame contains six columns. To calculate revenue for each product, multiply Price by Units_Sold. This arithmetic operation needs to happen across all rows — a task that would require loops in traditional programming.

Python

# Add calculated column for revenue
inventory['Revenue'] = inventory['Price'] * inventory['Units_Sold']
print("After adding Revenue column:")
print(inventory[['Product', 'Price', 'Units_Sold', 'Revenue']].head())

▶ Output

After adding Revenue column: Product Price Units_Sold Revenue 0 Pen 5.50 320 1760.0 1 Notebook 15.00 275 4125.0 2 Marker 8.00 310 2480.0 3 Ruler 3.50 180 630.0 4 Eraser 2.00 420 840.0

A single assignment statement adds the Revenue column. Pandas performs vectorized multiplication — the operation processes the entire column in one pass without explicit iteration. The engine broadcasts the calculation across all rows simultaneously, delivering performance that scales to millions of records.

ℹ️

Vectorization Performance

Vectorized operations execute at compiled C speed, not interpreted Python speed. When you write df['A'] * df['B'], Pandas delegates the multiplication to NumPy, which uses optimized CPU instructions. This is 50–100 times faster than Python loops that process one row at a time.

You can use any arithmetic operation when creating calculated columns: addition, subtraction, multiplication, division, exponents, or combinations. The result is always a new column added to the DataFrame without modifying existing data.

🎯 Adding Conditional Columns

▼

Calculations produce numerical results. Classification produces categorical labels based on conditional logic. You need a column that marks products as In_Stock or Low_Stock depending on whether inventory exceeds a threshold. Simple arithmetic cannot express this branching logic.

NumPy provides the where() function for vectorized conditionals. This function accepts three arguments: a boolean condition to test, the value when True, and the value when False. It evaluates the condition for each element and selects the appropriate result.

Python

# Add conditional column for stock status
inventory['Stock_Status'] = np.where(
    inventory['Stock'] > 500,
    'In_Stock',
    'Low_Stock'
)
print("After adding Stock_Status column:")
print(inventory[['Product', 'Stock', 'Stock_Status']].head())

▶ Output

After adding Stock_Status column: Product Stock Stock_Status 0 Pen 450 Low_Stock 1 Notebook 600 In_Stock 2 Marker 520 In_Stock 3 Ruler 780 In_Stock 4 Eraser 890 In_Stock

The condition inventory['Stock'] > 500 generates a boolean mask. NumPy examines each value: if True, it assigns 'In_Stock'; if False, it assigns 'Low_Stock'. The result is an array containing the classification for each row, which is then stored as a new column in the DataFrame.

ℹ️

np.where() Syntax

np.where(condition, value_if_true, value_if_false) operates on the entire column at once without any loops. The condition tests each element, and the appropriate value is selected for each row automatically. This pattern handles binary classification tasks cleanly and efficiently.

🗑️ Removing Columns with drop()

▼

Data cleaning often requires removing columns. You might drop temporary fields after completing calculations, eliminate redundant information, or remove columns irrelevant to your analysis. DataFrame schema should contain only the columns you need.

The drop() method removes columns by name. Pass the column name as a string, or pass a list of names to remove multiple columns. The axis parameter controls whether you're dropping rows or columns — axis=1 indicates column removal.

Python

# Remove columns we don't need for analysis
inventory_clean = inventory.drop(['Stock_Status'], axis=1)
print("After dropping Stock_Status column:")
print(inventory_clean.head())

▶ Output

After dropping Stock_Status column: Product Category Supplier Price Stock Units_Sold Revenue 0 Pen Stationery Alpha Supply 5.50 450 320 1760.0 1 Notebook Stationery Beta Goods 15.00 600 275 4125.0 2 Marker Stationery Gamma Corp. 8.00 520 310 2480.0 3 Ruler Stationery Alpha Supply 3.50 780 180 630.0 4 Eraser Stationery Beta Goods 2.00 890 420 840.0

By default, drop() returns a new DataFrame without modifying the original. The original inventory DataFrame still contains Stock_Status, while inventory_clean holds the version without it. This non-destructive behavior lets you experiment safely.

The inplace parameter controls mutation behavior. When inplace=True, drop modifies the original DataFrame directly. When inplace=False (the default), drop returns a new DataFrame and leaves the original unchanged. Choose based on whether you need to preserve the original structure.

⚠️

Understanding axis Parameter

The axis parameter controls operation direction: axis=0 operates on rows (vertical direction), axis=1 operates on columns (horizontal direction). Forgetting axis=1 when dropping columns causes errors because Pandas defaults to axis=0 and attempts row removal instead.

✏️ Renaming Columns

▼

Column names matter for readability and maintainability. Datasets often arrive with unclear abbreviations, inconsistent capitalization, or confusing terminology. Renaming transforms cryptic headers into self-documenting labels that communicate meaning clearly.

The rename() method changes column names using a dictionary that maps old names to new names. Pass this dictionary to the columns parameter. The method searches for each key in the DataFrame headers and replaces it with the corresponding value.

Python

# Rename columns for clarity
inventory_renamed = inventory.rename(
    columns={
        'Units_Sold': 'Sales_Volume',
        'Revenue': 'Total_Revenue'
    },
    inplace=False
)
print("After renaming columns:")
print(inventory_renamed.columns.tolist())
print(inventory_renamed[['Product', 'Sales_Volume', 'Total_Revenue']].head())

▶ Output

After renaming columns: ['Product', 'Category', 'Supplier', 'Price', 'Stock', 'Sales_Volume', 'Total_Revenue', 'Stock_Status'] Product Sales_Volume Total_Revenue 0 Pen 320 1760.0 1 Notebook 275 4125.0 2 Marker 310 2480.0 3 Ruler 180 630.0 4 Eraser 420 840.0

The mapping dictionary specifies only the columns you want to rename — you don't need to include columns that should remain unchanged. Pandas renames the specified columns and leaves all others with their original names.

Like drop(), the rename() method supports the inplace parameter. Setting inplace=False (the default) returns a new DataFrame with updated names. Setting inplace=True modifies the original DataFrame directly. Renaming fixes unclear abbreviations, corrects spelling mistakes, and unifies naming conventions across datasets.

🔄 Reordering Columns

▼

Column order affects how humans read data. Identifiers should appear first, followed by descriptive attributes, then metrics and calculations. Logical ordering reduces cognitive load when scanning tables or inspecting data.

Reorder columns by creating a list containing column names in your target sequence, then indexing the DataFrame with that list. This produces a new DataFrame with columns rearranged according to your specification.

Python

# Reorder columns for better presentation
column_order = ['Product', 'Category', 'Supplier', 'Price', 'Stock', 'Units_Sold', 'Revenue', 'Stock_Status']
inventory_reordered = inventory[column_order]
print("After reordering columns:")
print(inventory_reordered.head())

▶ Output

After reordering columns: Product Category Supplier Price Stock Units_Sold Revenue Stock_Status 0 Pen Stationery Alpha Supply 5.50 450 320 1760.0 Low_Stock 1 Notebook Stationery Beta Goods 15.00 600 275 4125.0 In_Stock 2 Marker Stationery Gamma Corp. 8.00 520 310 2480.0 In_Stock 3 Ruler Stationery Alpha Supply 3.50 780 180 630.0 In_Stock 4 Eraser Stationery Beta Goods 2.00 890 420 840.0 In_Stock

The reordered DataFrame groups related information together. Place identifier columns first, then descriptive attributes, then metrics and derived fields. This sequence helps readers scan the data efficiently.

Reordering is purely presentational — it doesn't modify data values or types. However, column sequence impacts usability when exporting to CSV files, displaying in reports, or integrating with external systems that expect specific column positions.

🔧 Updating Cell Values with loc

▼

The DataFrame is not read-only — you can modify individual cell values directly. When you discover an error in data or need to correct an outdated value, the loc indexer provides precise, targeted cell updates without creating a new DataFrame.

Use loc with the row index and column name to pinpoint a specific cell, then assign the new value. The loc indexer uses label-based indexing: you reference rows by their index labels and columns by their names.

Python

# Update a specific price value
inventory.loc[0, 'Price'] = 6.00
print("After updating Pen price:")
print(inventory[['Product', 'Price']].head(3))

▶ Output

After updating Pen price: Product Price 0 Pen 6.00 1 Notebook 15.00 2 Marker 8.00

When you discover errors in data during analysis or need to correct outliers, loc provides precise, surgical accuracy for targeted updates. This is useful for fixing data entry mistakes or updating a specific record when you receive corrected information.

Add calculated columns through direct assignment with arithmetic operations; Pandas executes vectorized computation across all rows simultaneously.
Create conditional columns using np.where(condition, value_if_true, value_if_false) for binary classifications (two possible outcomes).
Remove columns with the drop() method, specifying axis=1 for column removal; use inplace=True to modify the original DataFrame directly.
Rename columns using the rename() method with a dictionary mapping old names to new names via the columns parameter.
Reorder columns by creating a list with the target sequence and indexing the DataFrame with that list.
Update individual cell values using the loc indexer with the row index and column name to fix errors or correct outdated values.

📚External Resources

▼

↗
Pandas Documentation: Adding and Removing Columns
https://pandas.pydata.org/docs/user_guide/dsintro.html#column-selection-addition-deletion
↗
NumPy Documentation: numpy.where()
https://numpy.org/doc/stable/reference/generated/numpy.where.html
↗
Pandas Documentation: DataFrame.rename()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html

Topic 6.9: Structural Mutations - Schema Evolution