🗂️
Pandas DataFrame Data Structures Data Analysis

Topic 6.5: The DataFrame - The Relational Engine

Mastering Pandas' core data structure for organizing and analyzing tabular data.

🎯 From Series to DataFrame

📂 Dataset for this topic — 6_5_inventory.csv: ⬇ Download 6_5_inventory.csv

We learned about Series, the intelligent column with labels. But real data rarely comes as a single column. Usually it arrives as a complete table: students with multiple subjects, products with multiple attributes, inventory with multiple characteristics. A DataFrame is how Pandas handles this reality.

A DataFrame is a two-dimensional table with rows and columns, where each column is a Series. Think of it as a spreadsheet in Python, but with programming power built in. Excel has rows and columns, but a DataFrame adds analysis tools to filter, transform, and aggregate data with code instead of mouse clicks.

Imagine you have three separate Series: one contains product names (Koshari, Falafel, Mahshi), another contains prices (25, 15, 35), and a third contains stock quantities (120, 200, 80). Each Series alone is useful, but they are disconnected. You cannot see which price belongs to which product, or how much stock each item has. A DataFrame solves this by organizing all three Series into columns of a single table, where each row represents one complete product record.

DataFrames keep related data together. When you have product names, prices, stock levels, and categories in one DataFrame, you can ask complex questions: which products are low in stock? What is the average price by category? Which items should we reorder? The DataFrame structure makes these questions answerable with just a few lines of code.

🔨 Three Methods for Creating DataFrames

Pandas provides three primary methods for creating DataFrames, each suited to different data sources and situations. Understanding when to use each method makes data loading efficient and intuitive.

The first method uses a dictionary of lists. The keys become the column names, and the values are lists containing the data for each column. This method is straightforward and readable - you define the structure by naming each column, then provide all the values for that column as a list.

Python
import pandas as pd
import numpy as np

# Method 1: Dictionary of lists
df1 = pd.DataFrame({
    'Product': ['Koshari', 'Falafel', 'Mahshi', 'Molokhia'],
    'Price': [25, 15, 35, 30],
    'Stock': [120, 200, 80, 95],
    'Category': ['Main Dish', 'Appetizer', 'Main Dish', 'Main Dish']
})

print(df1)
# Output:
#     Product  Price  Stock    Category
# 0   Koshari     25    120  Main Dish
# 1   Falafel     15    200  Appetizer
# 2    Mahshi     35     80  Main Dish
# 3  Molokhia     30     95  Main Dish
▶ Output
Product Price Stock Category 0 Koshari 25 120 Main Dish 1 Falafel 15 200 Appetizer 2 Mahshi 35 80 Main Dish 3 Molokhia 30 95 Main Dish

Pandas automatically aligns the data so that the first item in each list forms the first row, the second item in each list forms the second row, and so on. All lists must have the same length, otherwise Pandas cannot create a proper rectangular table.

The second method uses a list of dictionaries. Each dictionary represents one complete row, where the keys are the column names and the values are the data points in that row.

Python
# Method 2: List of dictionaries
inventory = [
    {'Product': 'Koshari', 'Price': 25, 'Stock': 120, 'Category': 'Main Dish'},
    {'Product': 'Falafel', 'Price': 15, 'Stock': 200, 'Category': 'Appetizer'},
    {'Product': 'Mahshi', 'Price': 35, 'Stock': 80, 'Category': 'Main Dish'},
    {'Product': 'Molokhia', 'Price': 30, 'Stock': 95, 'Category': 'Main Dish'}
]

df2 = pd.DataFrame(inventory)

print(df2)
# Output:
#     Product  Price  Stock    Category
# 0   Koshari     25    120  Main Dish
# 1   Falafel     15    200  Appetizer
# 2    Mahshi     35     80  Main Dish
# 3  Molokhia     30     95  Main Dish
▶ Output
Product Price Stock Category 0 Koshari 25 120 Main Dish 1 Falafel 15 200 Appetizer 2 Mahshi 35 80 Main Dish 3 Molokhia 30 95 Main Dish

Pandas collects all the keys from all dictionaries to determine the column names, then fills in the values row by row. If some dictionaries have keys that others lack, Pandas fills those positions with NaN (Not a Number) to maintain the rectangular table structure.

The third method uses a NumPy array combined with explicit labels. This combines the computational speed of NumPy with the organizational clarity of Pandas. The array contains just raw numbers, without any labels or structure. By providing the index and columns parameters, you transform those numbers into a labeled DataFrame where each value has both a row name and a column name.

Python
# Method 3: NumPy array with labels
data = np.array([[25, 120], [15, 200], [35, 80], [30, 95]])

df3 = pd.DataFrame(
    data,
    index=['Koshari', 'Falafel', 'Mahshi', 'Molokhia'],
    columns=['Price', 'Stock']
)

print(df3)
# Output:
#           Price  Stock
# Koshari      25    120
# Falafel      15    200
# Mahshi       35     80
# Molokhia     30     95
▶ Output
Price Stock Koshari 25 120 Falafel 15 200 Mahshi 35 80 Molokhia 30 95

Use this method when you are working with numerical computations in NumPy and want to add meaningful labels for analysis. The array provides computational efficiency, while the DataFrame adds interpretability. This is common in machine learning workflows where you compute features with NumPy, then organize them in a DataFrame for exploration.

🔍 Understanding DataFrame Structure

A DataFrame has several components that work together to organize data. Understanding these components helps you navigate and manipulate datasets effectively.

Every DataFrame has an index, columns, and values. The index identifies rows - by default, these are sequential integers starting from zero, but you can set them to meaningful labels like product names or dates. The columns are the names of each column, typically strings that describe what data that column contains. The values are the actual data stored in the table.

Python
df = pd.DataFrame({
    'Product': ['Koshari', 'Falafel', 'Mahshi', 'Molokhia'],
    'Price': [25, 15, 35, 30],
    'Stock': [120, 200, 80, 95],
    'Category': ['Main Dish', 'Appetizer', 'Main Dish', 'Main Dish']
})

# Access the index
print(df.index)
# Output: RangeIndex(start=0, stop=4, step=1)

# Access the columns
print(df.columns)
# Output: Index(['Product', 'Price', 'Stock', 'Category'], dtype='object')

# Access the values as a NumPy array
print(df.values)
# Output:
# [['Koshari' 25 120 'Main Dish']
#  ['Falafel' 15 200 'Appetizer']
#  ['Mahshi' 35 80 'Main Dish']
#  ['Molokhia' 30 95 'Main Dish']]
▶ Output
RangeIndex(start=0, stop=4, step=1) Index(['Product', 'Price', 'Stock', 'Category'], dtype='object') [['Koshari' 25 120 'Main Dish'] ['Falafel' 15 200 'Appetizer'] ['Mahshi' 35 80 'Main Dish'] ['Molokhia' 30 95 'Main Dish']]

Each column in a DataFrame is a Series. When you extract a single column, you get a Series object with all the properties and methods we learned in the previous topic. This consistency makes working with DataFrames predictable - operations you learned on Series work the same way on DataFrame columns.

Python
# Extract a column as a Series
prices = df['Price']
print(type(prices))
# Output: <class 'pandas.core.series.Series'>

print(prices)
# Output:
# 0    25
# 1    15
# 2    35
# 3    30
# Name: Price, dtype: int64
▶ Output
<class 'pandas.core.series.Series'> 0 25 1 15 2 35 3 30 Name: Price, dtype: int64

The dtypes attribute shows the data type of each column. This is critical information when working with data. If you expect a Price column to be numeric but dtypes shows it as object, that signals a problem - perhaps there are text values mixed in with numbers, which would prevent mathematical operations.

Python
# Check data types for all columns
print(df.dtypes)
# Output:
# Product     object
# Price        int64
# Stock        int64
# Category    object
# dtype: object
▶ Output
Product object Price int64 Stock int64 Category object dtype: object

Here, Product and Category are object type, which means they contain strings. Price and Stock are int64, which means they are integers. This confirms the data is structured correctly for analysis - text columns contain text, numeric columns contain numbers.

👁️ Inspecting DataFrames: Quick Views

When you open a large DataFrame with thousands of rows, printing the entire thing overwhelms your screen and provides little insight. Pandas provides methods for quick inspection that show you just enough to understand the data without flooding your terminal.

The head method returns the first five rows by default. The tail method returns the last five rows. You can pass a different number as an argument if you want more or fewer rows. For example, head(3) shows the first three rows, and tail(10) shows the last ten rows.

Python
# Load a sample inventory dataset
df = pd.read_csv('6_5_inventory.csv')

# View first 3 rows
print(df.head(3))
# Output:
#    Product  Price  Stock    Category
# 0  Koshari     25    120  Main Dish
# 1  Falafel     15    200  Appetizer
# 2   Mahshi     35     80  Main Dish

# View last 2 rows
print(df.tail(2))
# Output:
#      Product  Price  Stock    Category
# 8   Ful Medames     10    180  Appetizer
# 9      Basbousa     20     70     Dessert
▶ Output
Product Price Stock Category 0 Koshari 25 120 Main Dish 1 Falafel 15 200 Appetizer 2 Mahshi 35 80 Main Dish Product Price Stock Category 8 Ful Medames 10 180 Appetizer 9 Basbousa 20 70 Dessert

Head shows what the data looks like at the beginning, revealing the structure and column names. This is typically the first command you run when loading a new dataset. Tail shows what the data looks like at the end, which can reveal incomplete or unusual entries at the bottom of the file. Sometimes data collection processes fail partway through, leaving the last rows incomplete or corrupted.

📊 Inspecting DataFrames: Detailed Summaries

Beyond quick views of the data itself, Pandas provides methods that summarize the DataFrame's structure and statistics. These methods answer questions like: how many rows are there? What data types are present? Are there missing values? What is the range of numerical values?

The info method gives you a comprehensive summary of the DataFrame: the number of rows, the number of columns, the data types of each column, and how many non-null values each column contains. This is one of the most important inspection methods because it reveals data quality issues at a glance.

Python
# Get DataFrame structure and metadata
print(df.info())
# Output:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10 entries, 0 to 9
# Data columns (total 4 columns):
#  #   Column    Non-Null Count  Dtype 
# ---  ------    --------------  ----- 
#  0   Product   10 non-null     object
#  1   Price     10 non-null     int64 
#  2   Stock     9 non-null      int64 
#  3   Category  10 non-null     object
# dtypes: int64(2), object(2)
# memory usage: 448.0+ bytes
▶ Output
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Product 10 non-null object 1 Price 10 non-null int64 2 Stock 9 non-null int64 3 Category 10 non-null object dtypes: int64(2), object(2)

Run info first when you open a new dataset. The output tells you that this DataFrame has 10 rows and 4 columns. It shows the data types for each column, confirming whether columns are numeric or text. Most importantly, it shows the non-null count. If a column has fewer non-null entries than total rows, that means there are missing values. In this example, the Stock column has only 9 non-null values out of 10 rows, indicating one missing value that needs investigation.

The describe method provides statistical summaries for all numerical columns: count, mean, standard deviation, minimum, quartiles, and maximum. This shows you the range of values, the average, and whether there are any extreme outliers.

Python
# Get statistical summary
print(df.describe())
# Output:
#            Price       Stock
# count  10.000000    9.000000
# mean   23.500000  117.777778
# std     9.438834   58.871603
# min    10.000000   50.000000
# 25%    17.500000   75.000000
# 50%    22.500000  100.000000
# 75%    30.000000  150.000000
# max    40.000000  200.000000
▶ Output
Price Stock count 10.000000 9.000000 mean 23.500000 117.777778 std 9.438834 58.871603 min 10.000000 50.000000 25% 17.500000 75.000000 50% 22.500000 100.000000 75% 30.000000 150.000000 max 40.000000 200.000000

The describe output tells you that prices range from 10 to 40, with an average of 23.5. Stock levels range from 50 to 200, with an average of about 118. The standard deviation tells you how spread out the values are - a large standard deviation relative to the mean indicates high variability. The quartiles (25%, 50%, 75%) show the distribution of values, helping you identify whether data is concentrated or spread evenly.

The shape attribute returns a tuple containing the number of rows and columns. This is useful for quick size checks, especially when working with large datasets or when you need to verify that data operations produced the expected number of rows.

Python
# Get dimensions
print(df.shape)
# Output: (10, 4)
# This means 10 rows and 4 columns

# Access rows and columns separately
num_rows = df.shape[0]
num_cols = df.shape[1]

print(f"The dataset has {num_rows} rows and {num_cols} columns.")
# Output: The dataset has 10 rows and 4 columns.
▶ Output
(10, 4) The dataset has 10 rows and 4 columns.

The columns attribute returns the column names as an Index object, which you can convert to a list. This is useful when you need to iterate over columns or check whether a specific column exists.

Python
# Get column names
print(df.columns)
# Output: Index(['Product', 'Price', 'Stock', 'Category'], dtype='object')

# Convert to list
column_list = list(df.columns)
print(column_list)
# Output: ['Product', 'Price', 'Stock', 'Category']

# Check if a column exists
if 'Price' in df.columns:
    print("Price column exists")
# Output: Price column exists
▶ Output
Index(['Product', 'Price', 'Stock', 'Category'], dtype='object') ['Product', 'Price', 'Stock', 'Category'] Price column exists

The index attribute returns the row labels. For a default DataFrame, this is a RangeIndex starting at zero. For DataFrames with custom row labels, this shows those labels.

Python
# Get row index
print(df.index)
# Output: RangeIndex(start=0, stop=10, step=1)

# Get first 5 index values
print(list(df.index[:5]))
# Output: [0, 1, 2, 3, 4]
▶ Output
RangeIndex(start=0, stop=10, step=1) [0, 1, 2, 3, 4]
🎓 Why Inspection Matters

Inspecting your DataFrame before analysis is not optional. Skipping this step leads to errors, incorrect conclusions, and wasted time. Professional data scientists follow a consistent inspection workflow when loading new data.

First, run info to understand the structure. How many rows and columns are there? What are the data types? Are there missing values? This gives you the big picture.

Second, run head to see the actual data. Does it look reasonable? Are the column names clear? Are there any obvious errors in the first few rows?

Third, run describe to check the statistics. Are the ranges sensible? If you have a Price column and the maximum is 1 million, that might be an error. If the minimum is negative, that is definitely suspicious. Do the mean and median make sense for this domain?

Fourth, check the shape. Does the number of rows match your expectations? If you are loading a file that should have 1000 records but shape shows 950, something went wrong during the import.

This inspection workflow takes less than a minute but prevents hours of debugging later. It catches data quality issues early when they are easy to fix, rather than discovering them halfway through an analysis when they invalidate all your work.

  • A DataFrame is a two-dimensional table where each column is a Series, combining multiple related data columns into one structure.
  • Create DataFrames using dictionary of lists (column-oriented), list of dictionaries (row-oriented), or NumPy arrays with labels.
  • Every DataFrame has an index (row labels), columns (column names), and values (the actual data).
  • Use head() to view the first few rows and tail() to view the last few rows for quick data inspection.
  • Use info() to see DataFrame structure, data types, and missing values - run this first on new datasets.
  • Use describe() to get statistical summaries of all numerical columns, revealing ranges and distributions.
  • The shape attribute gives you (rows, columns), columns gives column names, and dtypes shows data types.
  • Proper inspection before analysis catches data quality issues early and prevents errors later.
Quick Check
?
You have three separate Series: one with product names, one with prices, and one with stock levels. What is the correct way to combine them into a single DataFrame?
?
You load a new CSV dataset and want to check whether any column has missing values. Which method do you run first?
?
What does df.shape return for a DataFrame with 50 rows and 6 columns?
📚External Resources