Topic 6.4: Arithmetic and Universal Functions in Pandas
Vectorized Operations and NumPy Ufuncs for High-Performance Data Analysis
This topic uses the file 6_4_product_prices.csv. Download it from your course materials and place it in the same folder as your notebook before running the code examples.
Most data work requires arithmetic operations on entire columns. Loops are slow and error-prone. Pandas inherits Vectorization from NumPy, applying operations to entire Series at once in a single line of code.
Vectorization means treating a Series as a vector and applying an operation to all elements simultaneously. When you write series + 10, Pandas adds 10 to every value in the Series in one operation. This works for all basic arithmetic: addition, subtraction, multiplication, division, and exponentiation.
import pandas as pd import numpy as np # Load product prices dataset df = pd.read_csv('6_4_product_prices.csv') jan_prices = df.set_index('Product')['January_Price'] print("Original prices:") print(jan_prices) # Vectorized addition - add 10 EGP increased = jan_prices + 10 print("\nAfter adding 10 EGP:") print(increased) # Vectorized multiplication - add 15% tax with_tax = jan_prices * 1.15 print("\nWith 15% tax:") print(with_tax)
Vectorized operations are cleaner to write and 100x faster on large datasets. Pandas uses NumPy arrays underneath, so these operations execute in optimized C code instead of Python.
import time # Create large Series for performance test large = pd.Series(range(100000)) # Vectorized approach start = time.time() result_vec = large * 2 + 10 vec_time = time.time() - start # Loop approach start = time.time() result_loop = [x * 2 + 10 for x in large] loop_time = time.time() - start print(f"Vectorization: {vec_time:.6f}s") print(f"Loop: {loop_time:.6f}s") print(f"Speedup: {loop_time/vec_time:.1f}x")
NumPy provides a rich library of mathematical functions called Universal Functions (ufuncs). These functions operate on arrays element-wise: they process each value individually across the entire array in a single operation.
Since Pandas is built on NumPy, all NumPy ufuncs work seamlessly with Pandas Series: square roots, logarithms, exponentials, trigonometric functions, rounding operations, and moreβall optimized for performance.
# Mathematical ufuncs print("Square root of prices:") print(np.sqrt(jan_prices)) print("\nPrices squared:") print(np.power(jan_prices, 2)) print("\nNatural log of prices:") print(np.log(jan_prices)) # Rounding ufuncs discounted = jan_prices * 0.87 # 13% discount print("\nDiscounted prices (raw):") print(discounted) print("\nRounded to 1 decimal:") print(np.round(discounted, 1)) print("\nCeiling (round up):") print(np.ceil(discounted)) print("\nFloor (round down):") print(np.floor(discounted))
Rounding ufuncs are useful when working with prices. After applying discounts or taxes, prices often have many decimal places. np.round ensures clean values, while np.ceil prevents loss when rounding up to the nearest currency unit.
All ufuncs preserve the original Series index. When you apply np.sqrt(jan_prices), the result is a new Series with the same product names as indices, making it easy to track which value corresponds to which product.
You can perform arithmetic operations not just between a Series and a scalar, but also between two Series. These are called Element-wise Operations. Pandas aligns the two Series by their indices and performs the operation on matching pairs.
# Extract February prices feb_prices = df.set_index('Product')['February_Price'] print("January prices:") print(jan_prices) print("\nFebruary prices:") print(feb_prices) # Calculate price increase price_increase = feb_prices - jan_prices print("\nPrice increase:") print(price_increase) # Calculate percentage change percent_change = ((feb_prices - jan_prices) / jan_prices) * 100 print("\nPercentage change:") print(percent_change)
The operation is performed element-by-element: the first value of one Series is paired with the first value of the other, the second with the second, and so on. But Pandas is smarter than simple position-based pairing.
Pandas performs automatic index alignment during operations between Series. It matches elements based on their index labels, not their position. If an index exists in one Series but not the other, Pandas inserts NaN (Not a Number) in the result for those missing pairs. This prevents silent errors from mismatched data.
# Create a partial Series for March with some different products march_prices = pd.Series( [6.50, 130.00, 22.00], index=['Pen', 'Bag', 'Highlighter'] ) print("March prices (subset):") print(march_prices) # Add with alignment combined = jan_prices + march_prices print("\nJan + March (with alignment):") print(combined) print("\nNote: NaN appears where indices don't match")
This behavior is a safety feature. Instead of producing incorrect results by adding mismatched values, Pandas explicitly marks missing data as NaN. You can then handle these missing values appropriately using methods like fillna() or dropna().
A common analytical task is calculating derived metrics from multiple columns. For example, revenue is calculated by multiplying unit price by quantity sold. With Pandas, this is a simple element-wise multiplication between two Series.
# Extract units sold units = df.set_index('Product')['Units_Sold'] print("January prices:") print(jan_prices) print("\nUnits sold:") print(units) # Calculate revenue per product revenue = jan_prices * units print("\nRevenue per product:") print(revenue) # Calculate total revenue print(f"\nTotal revenue: {revenue.sum():.2f} EGP")
Multiplying two aligned Series and aggregating the result (with sum(), mean(), etc.) is a core data analysis pattern. You can compute business metrics in just a few lines of clear, readable code.
Broadcasting refers to how Pandas (and NumPy) handle operations between arrays of different shapes. When you add a scalar to a Series, the scalar is 'broadcast' across all elements. More complex broadcasting rules apply when combining Series with DataFrames, which you will explore in upcoming topics.
- Vectorized operations apply arithmetic to entire Series at once, avoiding slow Python loops.
- Vectorization is cleaner and often 100x faster because operations execute in optimized C code.
- NumPy ufuncs (sqrt, log, round, ceil, floor, abs) work seamlessly on Pandas Series for element-wise transformations.
- Operations between two Series are element-wise, with Pandas automatically aligning by index labels.
- Index alignment prevents silent errors: mismatched indices produce NaN instead of incorrect results.
- Broadcasting allows scalars to be applied to entire Series, and enables efficient multi-column calculations.
- β Pandas Documentation: Essential Basic Functionality
https://pandas.pydata.org/docs/user_guide/basics.html - β NumPy Universal Functions (ufunc) Reference
https://numpy.org/doc/stable/reference/ufuncs.html - β NumPy Broadcasting Rules
https://numpy.org/doc/stable/user/basics.broadcasting.html - β Pandas Indexing and Selecting Data
https://pandas.pydata.org/docs/user_guide/indexing.html