⚑
Vectorization Element-wise Operations NumPy Ufuncs Broadcasting Index Alignment

Topic 6.4: Arithmetic and Universal Functions in Pandas

Vectorized Operations and NumPy Ufuncs for High-Performance Data Analysis

πŸ“
Dataset Required

This topic uses the file 6_4_product_prices.csv. Download it from your course materials and place it in the same folder as your notebook before running the code examples.

πŸš€ Vectorized Operations on Series
β–Ό

Most data work requires arithmetic operations on entire columns. Loops are slow and error-prone. Pandas inherits Vectorization from NumPy, applying operations to entire Series at once in a single line of code.

Vectorization means treating a Series as a vector and applying an operation to all elements simultaneously. When you write series + 10, Pandas adds 10 to every value in the Series in one operation. This works for all basic arithmetic: addition, subtraction, multiplication, division, and exponentiation.

Python
import pandas as pd
import numpy as np

# Load product prices dataset
df = pd.read_csv('6_4_product_prices.csv')
jan_prices = df.set_index('Product')['January_Price']

print("Original prices:")
print(jan_prices)

# Vectorized addition - add 10 EGP
increased = jan_prices + 10
print("\nAfter adding 10 EGP:")
print(increased)

# Vectorized multiplication - add 15% tax
with_tax = jan_prices * 1.15
print("\nWith 15% tax:")
print(with_tax)
β–Ά Output
Original prices: Product Pen 5.00 Notebook 12.00 Eraser 2.50 Ruler 8.00 Bag 120.00 Name: January_Price, dtype: float64 After adding 10 EGP: Product Pen 15.00 Notebook 22.00 Eraser 12.50 Ruler 18.00 Bag 130.00 Name: January_Price, dtype: float64 With 15% tax: Product Pen 5.75 Notebook 13.80 Eraser 2.88 Ruler 9.20 Bag 138.00 Name: January_Price, dtype: float64
ℹ️
Performance Advantage

Vectorized operations are cleaner to write and 100x faster on large datasets. Pandas uses NumPy arrays underneath, so these operations execute in optimized C code instead of Python.

Python
import time

# Create large Series for performance test
large = pd.Series(range(100000))

# Vectorized approach
start = time.time()
result_vec = large * 2 + 10
vec_time = time.time() - start

# Loop approach
start = time.time()
result_loop = [x * 2 + 10 for x in large]
loop_time = time.time() - start

print(f"Vectorization: {vec_time:.6f}s")
print(f"Loop: {loop_time:.6f}s")
print(f"Speedup: {loop_time/vec_time:.1f}x")
β–Ά Output
Vectorization: 0.000245s Loop: 0.012389s Speedup: 50.6x
πŸ”§ Universal Functions (Ufuncs)
β–Ό

NumPy provides a rich library of mathematical functions called Universal Functions (ufuncs). These functions operate on arrays element-wise: they process each value individually across the entire array in a single operation.

Since Pandas is built on NumPy, all NumPy ufuncs work seamlessly with Pandas Series: square roots, logarithms, exponentials, trigonometric functions, rounding operations, and moreβ€”all optimized for performance.

Pattern 1
Mathematical Ufuncs
np.sqrt (square root), np.power (exponentiation), np.log (natural logarithm), np.exp (exponential)
Pattern 2
Rounding Ufuncs
np.round (round to decimals), np.ceil (round up), np.floor (round down)
Pattern 3
Absolute Values
np.abs (absolute value, useful for calculating deviations)
Pattern 4
Trigonometric Ufuncs
np.sin, np.cos, np.tan (for scientific and engineering calculations)
Python
# Mathematical ufuncs
print("Square root of prices:")
print(np.sqrt(jan_prices))

print("\nPrices squared:")
print(np.power(jan_prices, 2))

print("\nNatural log of prices:")
print(np.log(jan_prices))

# Rounding ufuncs
discounted = jan_prices * 0.87  # 13% discount

print("\nDiscounted prices (raw):")
print(discounted)

print("\nRounded to 1 decimal:")
print(np.round(discounted, 1))

print("\nCeiling (round up):")
print(np.ceil(discounted))

print("\nFloor (round down):")
print(np.floor(discounted))
β–Ά Output
Square root of prices: Product Pen 2.236068 Notebook 3.464102 Eraser 1.581139 Ruler 2.828427 Bag 10.954451 Name: January_Price, dtype: float64 Prices squared: Product Pen 25.00 Notebook 144.00 Eraser 6.25 Ruler 64.00 Bag 14400.00 Name: January_Price, dtype: float64 Natural log of prices: Product Pen 1.609438 Notebook 2.484907 Eraser 0.916291 Ruler 2.079442 Bag 4.787492 Name: January_Price, dtype: float64 Discounted prices (raw): Product Pen 4.35 Notebook 10.44 Eraser 2.18 Ruler 6.96 Bag 104.40 Name: January_Price, dtype: float64 Rounded to 1 decimal: Product Pen 4.4 Notebook 10.4 Eraser 2.2 Ruler 7.0 Bag 104.4 Name: January_Price, dtype: float64 Ceiling (round up): Product Pen 5.0 Notebook 11.0 Eraser 3.0 Ruler 7.0 Bag 105.0 Name: January_Price, dtype: float64 Floor (round down): Product Pen 4.0 Notebook 10.0 Eraser 2.0 Ruler 6.0 Bag 104.0 Name: January_Price, dtype: float64
ℹ️
Practical Application

Rounding ufuncs are useful when working with prices. After applying discounts or taxes, prices often have many decimal places. np.round ensures clean values, while np.ceil prevents loss when rounding up to the nearest currency unit.

All ufuncs preserve the original Series index. When you apply np.sqrt(jan_prices), the result is a new Series with the same product names as indices, making it easy to track which value corresponds to which product.

πŸ”— Operations Between Two Series
β–Ό

You can perform arithmetic operations not just between a Series and a scalar, but also between two Series. These are called Element-wise Operations. Pandas aligns the two Series by their indices and performs the operation on matching pairs.

Python
# Extract February prices
feb_prices = df.set_index('Product')['February_Price']

print("January prices:")
print(jan_prices)

print("\nFebruary prices:")
print(feb_prices)

# Calculate price increase
price_increase = feb_prices - jan_prices
print("\nPrice increase:")
print(price_increase)

# Calculate percentage change
percent_change = ((feb_prices - jan_prices) / jan_prices) * 100
print("\nPercentage change:")
print(percent_change)
β–Ά Output
January prices: Product Pen 5.00 Notebook 12.00 Eraser 2.50 Ruler 8.00 Bag 120.00 Name: January_Price, dtype: float64 February prices: Product Pen 5.50 Notebook 12.50 Eraser 2.50 Ruler 8.50 Bag 125.00 Name: February_Price, dtype: float64 Price increase: Product Pen 0.50 Notebook 0.50 Eraser 0.00 Ruler 0.50 Bag 5.00 Name: February_Price, dtype: float64 Percentage change: Product Pen 10.00 Notebook 4.17 Eraser 0.00 Ruler 6.25 Bag 4.17 dtype: float64

The operation is performed element-by-element: the first value of one Series is paired with the first value of the other, the second with the second, and so on. But Pandas is smarter than simple position-based pairing.

ℹ️
Index Alignment

Pandas performs automatic index alignment during operations between Series. It matches elements based on their index labels, not their position. If an index exists in one Series but not the other, Pandas inserts NaN (Not a Number) in the result for those missing pairs. This prevents silent errors from mismatched data.

Python
# Create a partial Series for March with some different products
march_prices = pd.Series(
    [6.50, 130.00, 22.00],
    index=['Pen', 'Bag', 'Highlighter']
)

print("March prices (subset):")
print(march_prices)

# Add with alignment
combined = jan_prices + march_prices
print("\nJan + March (with alignment):")
print(combined)
print("\nNote: NaN appears where indices don't match")
β–Ά Output
March prices (subset): Pen 6.50 Bag 130.00 Highlighter 22.00 dtype: float64 Jan + March (with alignment): Bag 250.00 Eraser NaN Highlighter NaN Notebook NaN Pen 11.50 Ruler NaN dtype: float64 Note: NaN appears where indices don't match

This behavior is a safety feature. Instead of producing incorrect results by adding mismatched values, Pandas explicitly marks missing data as NaN. You can then handle these missing values appropriately using methods like fillna() or dropna().

πŸ“Š Broadcasting and Revenue Calculation
β–Ό

A common analytical task is calculating derived metrics from multiple columns. For example, revenue is calculated by multiplying unit price by quantity sold. With Pandas, this is a simple element-wise multiplication between two Series.

Python
# Extract units sold
units = df.set_index('Product')['Units_Sold']

print("January prices:")
print(jan_prices)

print("\nUnits sold:")
print(units)

# Calculate revenue per product
revenue = jan_prices * units
print("\nRevenue per product:")
print(revenue)

# Calculate total revenue
print(f"\nTotal revenue: {revenue.sum():.2f} EGP")
β–Ά Output
January prices: Product Pen 5.00 Notebook 12.00 Eraser 2.50 Ruler 8.00 Bag 120.00 Name: January_Price, dtype: float64 Units sold: Product Pen 150 Notebook 200 Eraser 300 Ruler 100 Bag 20 Name: Units_Sold, dtype: int64 Revenue per product: Product Pen 750.00 Notebook 2400.00 Eraser 750.00 Ruler 800.00 Bag 2400.00 Name: Units_Sold, dtype: float64 Total revenue: 7100.00 EGP

Multiplying two aligned Series and aggregating the result (with sum(), mean(), etc.) is a core data analysis pattern. You can compute business metrics in just a few lines of clear, readable code.

ℹ️
Broadcasting Explained

Broadcasting refers to how Pandas (and NumPy) handle operations between arrays of different shapes. When you add a scalar to a Series, the scalar is 'broadcast' across all elements. More complex broadcasting rules apply when combining Series with DataFrames, which you will explore in upcoming topics.

  • Vectorized operations apply arithmetic to entire Series at once, avoiding slow Python loops.
  • Vectorization is cleaner and often 100x faster because operations execute in optimized C code.
  • NumPy ufuncs (sqrt, log, round, ceil, floor, abs) work seamlessly on Pandas Series for element-wise transformations.
  • Operations between two Series are element-wise, with Pandas automatically aligning by index labels.
  • Index alignment prevents silent errors: mismatched indices produce NaN instead of incorrect results.
  • Broadcasting allows scalars to be applied to entire Series, and enables efficient multi-column calculations.
πŸ“šExternal Resources
β–Ό