Topic 5.3: Structured Arrays

🤔 The Limitation We Haven't Discussed

▼

So far, you've learned that NumPy arrays must be homogeneous—all elements must be the same type. Every number in an array is either all integers, or all floats, or all strings. You can't mix them.

This restriction is what makes NumPy fast. But it creates a problem when you need to work with real-world data. Imagine you're tracking student information:

Python

# Student data: name (string), age (int), grade (float)
# How do we store this in a NumPy array?

# Option 1: Three separate arrays
names = np.array(['Alice', 'Bob', 'Carol'])
ages = np.array([17, 16, 18])
grades = np.array([95.5, 88.0, 92.3])

# Problem: Easy to get out of sync!
# What if you sort one array but forget to sort the others?

Option 1 is fragile. The connection between Alice, 17, and 95.5 only exists in your mind—NumPy doesn't know they belong together. If you sort the grades array, the names and ages don't follow automatically.

Python

# Option 2: Convert everything to strings
student_data = np.array([
    ['Alice', '17', '95.5'],
    ['Bob', '16', '88.0'],
    ['Carol', '18', '92.3']
])

# Problem: Lost type information!
# Can't do math on the grades anymore - they're strings

Option 2 keeps the data together, but now you've lost type information. The grade '95.5' is a string, not a number. You can't calculate the average grade without converting back to floats. And the age '17' is a string, so you can't determine who's oldest.

This is where structured arrays come in. They let you have the best of both worlds: different types in the same array, with each row staying together as a unit.

🏗️ Creating a Structured Array

▼

A structured array is like a mini-database table. Each row is a record, and each column has a name and a specific data type. Here's how you create one:

Python

import numpy as np

# Step 1: Define the structure (like creating table columns)
student_dtype = np.dtype([
    ('name', 'U10'),      # U10 = Unicode string, max 10 characters
    ('age', 'i4'),        # i4 = 4-byte integer
    ('grade', 'f4')       # f4 = 4-byte float
])

# Step 2: Create the structured array with data
students = np.array([
    ('Alice', 17, 95.5),
    ('Bob', 16, 88.0),
    ('Carol', 18, 92.3)
], dtype=student_dtype)

print(students)
print(f"\nData type: {students.dtype}")

▶ Output

[('Alice', 17, 95.5) ('Bob', 16, 88. ) ('Carol', 18, 92.3)]

Data type: [('name', '<U10'), ('age', '<i4'), ('grade', '<f4')]

Notice how the output looks different from regular arrays. Each element is a tuple wrapped in parentheses. This is because each row is a complete record with three fields.

Understanding the dtype Definition

The np.dtype() call creates a data type specification. It's a list of tuples, where each tuple describes one field:

('name', 'U10')

Field named 'name', Unicode string, max 10 chars

('age', 'i4')

Field named 'age', 4-byte integer (int32)

('grade', 'f4')

Field named 'grade', 4-byte float (float32)

The type codes might look cryptic, but they follow a pattern: 'U' for Unicode strings (with a number for max length), 'i' for integers (with byte size), and 'f' for floats (with byte size). You can also use more readable names like 'int32' or 'float32' if you prefer.

🔍 Accessing Data by Field Name

▼

The real power of structured arrays is that you can access entire columns by name, just like you would in a spreadsheet or database:

Python

# Access an entire column by field name
all_names = students['name']
print("All names:", all_names)

all_ages = students['age']
print("All ages:", all_ages)

all_grades = students['grade']
print("All grades:", all_grades)

# Now you can do math on the numeric columns!
average_grade = students['grade'].mean()
print(f"\nAverage grade: {average_grade:.1f}")

oldest_age = students['age'].max()
print(f"Oldest student: {oldest_age} years old")

▶ Output

All names: ['Alice' 'Bob' 'Carol']
All ages: [17 16 18]
All grades: [95.5 88. 92.3]

Average grade: 92.0
Oldest student: 18 years old

Notice that when you access a field, you get back a regular NumPy array of that type. students['grade'] returns a float array, so you can immediately use array methods like .mean() and .max() on it.

Accessing Individual Records

You can also access individual records (rows) using regular indexing:

Python

# Get the first student's complete record
first_student = students[0]
print("First student:", first_student)
print(f"Type: {type(first_student)}")

# Access individual fields from a record
print(f"\nName: {first_student['name']}")
print(f"Age: {first_student['age']}")
print(f"Grade: {first_student['grade']}")

▶ Output

First student: ('Alice', 17, 95.5)
Type: <class 'numpy.void'>

Name: Alice
Age: 17
Grade: 95.5

When you extract a single record, it's a special NumPy type called 'void', but you can still access its fields by name just like you would with the whole array.

🔧 Practical Operations on Structured Arrays

▼

Structured arrays support many of the same operations as regular arrays, but with the added benefit of keeping your related data together:

Sorting by a Specific Field

Python

# Sort students by grade (highest to lowest)
sorted_by_grade = np.sort(students, order='grade')[::-1]
print("Students sorted by grade:")
for student in sorted_by_grade:
    print(f"{student['name']}: {student['grade']}")

▶ Output

Students sorted by grade:
Alice: 95.5
Carol: 92.3
Bob: 88.0

The key insight: when you sort by one field, the entire record moves together. Alice, her age, and her grade all stay connected. This is exactly the problem we were trying to solve!

Filtering with Boolean Masks

Python

# Find all students with grades above 90
high_achievers = students[students['grade'] > 90]
print("Students with grade > 90:")
for student in high_achievers:
    print(f"{student['name']}: {student['grade']}")

# Find students who are 18 or older
adults = students[students['age'] >= 18]
print("\nStudents age 18+:")
for student in adults:
    print(f"{student['name']} ({student['age']} years old)")

▶ Output

Students with grade > 90:
Alice: 95.5
Carol: 92.3

Students age 18+:
Carol (18 years old)

Boolean masking works exactly like it does with regular arrays, but you can apply the condition to any field and get back complete records.

Adding New Records

Python

# Create a new student record
new_student = np.array([('David', 17, 89.5)], dtype=student_dtype)

# Append to existing array
students = np.append(students, new_student)
print("After adding David:")
print(students)

▶ Output

After adding David:
[('Alice', 17, 95.5) ('Bob', 16, 88. ) ('Carol', 18, 92.3) ('David', 17, 89.5)]

⚠️ When to Use Structured Arrays

▼

Structured arrays are powerful, but they're not always the right tool. Here's when to use them—and when to use something else:

Structured Arrays: Use Cases

✅ Good Use Cases

Small to medium datasets with mixed types
Scientific data with related measurements
When you need NumPy's speed but have heterogeneous data
Interfacing with C structures or binary file formats

❌ When to Use Something Else

Large tabular data → Use Pandas (Week 6!)
Simple single-type calculations → Regular NumPy arrays
Complex data manipulations → Pandas is much easier
Need to add/remove columns frequently → Pandas is flexible

The truth is, structured arrays are a niche feature. They're most commonly used when reading binary data files or interfacing with other languages like C. For most data science work, you'll use Pandas DataFrames instead.

ℹ️

Preview of Next Week

Everything you learned about structured arrays—named fields, accessing columns, filtering rows—is exactly how Pandas DataFrames work, but with much more power and convenience. Think of this as a preview of what's coming in Week 6.

Structured arrays show you that NumPy can handle heterogeneous data when needed. But they also show you why the Python data science community created Pandas—because working with tables of mixed-type data is so common that we needed a better tool for it.

💡 Real-World Example: Sensor Data

▼

Let's see a practical example that shows why structured arrays exist. Imagine you're collecting data from weather sensors across Egypt:

Python

# Define the sensor data structure
sensor_dtype = np.dtype([
    ('station_id', 'U10'),      # Station name
    ('timestamp', 'i8'),        # Unix timestamp (8-byte int)
    ('temperature', 'f4'),      # Celsius
    ('humidity', 'f4'),         # Percentage
    ('pressure', 'f4')          # Millibars
])

# Create sample sensor readings
readings = np.array([
    ('Cairo_01', 1704067200, 28.5, 45.2, 1013.2),
    ('Alex_01', 1704067200, 22.3, 68.5, 1015.8),
    ('Aswan_01', 1704067200, 35.1, 28.3, 1011.5),
    ('Cairo_01', 1704070800, 29.2, 43.8, 1013.0)
], dtype=sensor_dtype)

print("Sensor readings:")
print(readings)

▶ Output

Sensor readings:
[('Cairo_01', 1704067200, 28.5, 45.2, 1013.2)
('Alex_01', 1704067200, 22.3, 68.5, 1015.8)
('Aswan_01', 1704067200, 35.1, 28.3, 1011.5)
('Cairo_01', 1704070800, 29.2, 43.8, 1013. )]

Now you can perform analysis while keeping all the related information together:

Python

# Find the hottest reading
hottest_idx = readings['temperature'].argmax()
hottest = readings[hottest_idx]
print(f"Hottest reading: {hottest['temperature']}°C at {hottest['station_id']}")

# Calculate average temperature across all stations
avg_temp = readings['temperature'].mean()
print(f"Average temperature: {avg_temp:.1f}°C")

# Find all readings from Cairo
cairo_readings = readings[readings['station_id'] == 'Cairo_01']
print(f"\nCairo readings: {len(cairo_readings)} found")
for reading in cairo_readings:
    print(f"  Temp: {reading['temperature']}°C, Humidity: {reading['humidity']}%")

▶ Output

Hottest reading: 35.1°C at Aswan_01
Average temperature: 28.8°C

Cairo readings: 2 found
Temp: 28.5°C, Humidity: 45.2%
Temp: 29.2°C, Humidity: 43.8%

This example shows structured arrays at their best: scientific data with measurements of different types (strings, integers, floats) that need to stay grouped together for meaningful analysis.

Structured arrays allow you to store heterogeneous data (mixed types) in a single NumPy array while maintaining type safety.
Each field has a name and a specific data type, defined using np.dtype() with a list of (name, type) tuples.
Access entire columns by field name: students['grade'] returns all grades as a regular NumPy array.
All NumPy operations (sorting, filtering, statistical functions) work on structured arrays while keeping records together.
Common type codes: 'U10' (Unicode string, 10 chars), 'i4' (4-byte int), 'f4' (4-byte float), 'i8' (8-byte int).
Use structured arrays for small-to-medium datasets with mixed types; use Pandas DataFrames for larger or more complex tabular data.
Structured arrays are a preview of Pandas—they show the need for better tools to work with heterogeneous tabular data.

📚External Resources

▼

↗
NumPy Structured Arrays Documentation
https://numpy.org/doc/stable/user/basics.rec.html
↗
Understanding NumPy Structured Arrays (Tutorial)
https://jakevdp.github.io/PythonDataScienceHandbook/02.09-structured-data-numpy.html
↗
When to Use Structured Arrays vs Pandas
https://realpython.com/numpy-array-programming/#working-with-heterogeneous-data

Topic 5.3: Structured Arrays - Tables in NumPy

Understanding the dtype Definition

Accessing Individual Records

Sorting by a Specific Field

Filtering with Boolean Masks

Adding New Records