Topic 6.1: Introduction to Pandas
Building data structures that carry their own context and meaning
In Level 2 you worked with Python lists and basic data containers. At the start of this level you were introduced to NumPy arrays â fast, contiguous blocks of numbers designed for vectorized computation. NumPy is powerful for pure numerical work, but it carries a structural gap: it stores values without any names attached. When you look at a NumPy array, you see rows and columns of numbers, and you must remember externally what each row and column represents.
Pandas was built directly on top of NumPy to close that gap. It keeps NumPy's computational engine working internally while adding a layer of labeled structure. This topic introduces that labeled structure â the DataFrame â and explains why it changes the way you interact with data.
Imagine working with a spreadsheet containing student grades. Each row shows a student's name, and each column represents a subject. When someone asks 'What is Mona's grade in Science?', you can answer instantly because the data carries its own labels â the row headers show student names, and the column headers show subjects.
Now imagine the same data stored in a NumPy array. You see only numbers arranged in rows and columns. To answer the same question, you must remember that row 1 is Mona (because indexing starts at 0), and column 1 is Science. If you forget this mapping, or if someone else reads your code, the entire dataset becomes meaningless. This is the fundamental limitation that Pandas was designed to solve.
import numpy as np # Student grades in NumPy - just numbers grades_array = np.array([ [85, 92, 78, 88], # Row 0: Ahmed [90, 88, 85, 92], # Row 1: Mona [78, 85, 90, 80], # Row 2: Youssef [92, 90, 88, 95], # Row 3: Fatima [88, 78, 92, 85] # Row 4: Hassan ]) # Columns: Math, Science, English, History (must remember this!) # To get Mona's Science grade: mona_science = grades_array[1, 1] # 88 # Problem: No one can tell what this means without external documentation print(mona_science)
This approach creates several real problems. First, you must constantly reference external documentation or comments to understand what each position represents. Second, if the data order changes, your entire codebase breaks. Third, the code becomes difficult to maintain as datasets grow larger. A hundred students and twenty subjects would be nearly impossible to work with using pure index numbers.
Pandas stands for Panel Data, a term from econometrics referring to multi-dimensional structured datasets. The library was created by Wes McKinney in 2008 while working at a financial firm, specifically to handle labeled, heterogeneous data that did not fit NumPy's homogeneous array model.
Pandas introduces the DataFrame, a two-dimensional data structure that combines the computational efficiency of NumPy with meaningful labels. Each row has a name (called an index), and each column has a name. The data becomes self-documenting â anyone reading your code can immediately understand what each number represents.
import pandas as pd # Create the same data as a DataFrame with labels grades_df = pd.DataFrame( data=[ [85, 92, 78, 88], [90, 88, 85, 92], [78, 85, 90, 80], [92, 90, 88, 95], [88, 78, 92, 85] ], index=['Ahmed', 'Mona', 'Youssef', 'Fatima', 'Hassan'], columns=['Math', 'Science', 'English', 'History'] ) print(grades_df)
| Math | Science | English | History | |
|---|---|---|---|---|
| Ahmed | 85 | 92 | 78 | 88 |
| Mona | 90 | 88 | 85 | 92 |
| Youssef | 78 | 85 | 90 | 80 |
| Fatima | 92 | 90 | 88 | 95 |
| Hassan | 88 | 78 | 92 | 85 |
The DataFrame constructor accepts three key parameters. The data parameter contains the actual values, typically as a list of lists where each inner list represents one row. The index parameter provides labels for the rows. The columns parameter provides labels for the columns. Once created, the DataFrame displays its data in a clear, formatted table structure.
Internally, Pandas still uses NumPy arrays for storage and computation. This means you get the same performance benefits â contiguous memory, vectorized operations, and low-level optimization. The labels are stored separately as lightweight index objects. You are not sacrificing speed; you are adding a thin layer of semantic meaning that makes data analysis far more intuitive.
The row labels in a DataFrame are stored in a special Index object. This is not a simple list â it is an immutable array with fast lookup capabilities. Index objects support operations like intersection, union, and difference, making them powerful tools for data alignment and merging.
Once your data lives in a labeled DataFrame, many everyday questions become quick to answer. With the student-grades table above, Pandas lets you do things like:
- Get the average (mean) of all students' grades in a subject â for example, the class average in Math is 86.6.
- Get one specific value directly by its labels â for example, Mona's Science grade is 88.
- Find the student with the highest grade in a subject â for example, Fatima has the top Math grade of 92.
- Filter, sort, group, and combine data without losing track of which row and column each value belongs to.
Notice that each result comes back with its meaning intact, because the rows and columns keep their names. For now, focus on the idea that these questions are easy to ask with Pandas â you will learn exactly how to do each of them in the topics that follow this week.
Students new to Pandas often arrive with assumptions that lead to confusion. Three patterns come up repeatedly.
Pandas does not replace NumPy. Every DataFrame column is backed by a NumPy array. When Pandas performs arithmetic, it delegates the actual number-crunching to NumPy. Think of Pandas as a labeling and coordination layer that sits on top of NumPy, not a competitor to it.
By default, if you do not pass an index argument, Pandas assigns integer positions (0, 1, 2, âĻ) as row labels. But these are labels, not positions in the NumPy sense. You can replace them with names, dates, or any other hashable values. Confusing the default integer index with positional access is a common source of bugs.
- NumPy arrays store only values; Pandas DataFrames store values with meaningful row and column labels.
- A DataFrame is created using pd.DataFrame() with three key parameters: data (values), index (row labels), and columns (column labels).
- Pandas uses NumPy arrays internally for storage and computation, maintaining high performance while adding semantic meaning.
- The row labels are stored in an immutable Index object that supports fast lookups and set operations.
- Labeled data makes code self-documenting â anyone reading it can understand what each value represents without external documentation.
- Pandas makes common questions easy to ask â such as the class average in a subject, one student's grade, or the highest grade â and you will learn exactly how to do each of these in the topics that follow this week.
- â Pandas Official Documentation: Getting Started
https://pandas.pydata.org/docs/getting_started/index.html - â Real Python: Pandas DataFrame Tutorial
https://realpython.com/pandas-dataframe/ - â Pandas User Guide: Intro to Data Structures
https://pandas.pydata.org/docs/user_guide/dsintro.html