๐Ÿ’พ
CSV Input Output Persistence Data Cleaning

Topic 6.6: Reading and Writing CSV Files

Mastering data persistence with Pandas โ€” reading CSV files, exploring data, and saving results.

๐Ÿ“‚
6_6_student_scores.csv
Student scores dataset used in all code examples in this topic. Download it and place it in the same folder as your notebook before running the code.
โฌ‡ Download Dataset
๐Ÿ“‚ Loading Data from CSV Files
โ–ผ

You've been creating DataFrames manually by typing data directly into code. In practice, data comes from external sources: files on your computer, databases, APIs, or shared drives. CSV files (Comma-Separated Values) are the standard format you'll encounter most often in data science.

A CSV file is a plain text file where each row represents a record, and values within each row are separated by commas. The first row typically contains column names. Pandas provides the read_csv() method that converts CSV files into DataFrames.

Python
import pandas as pd

# Read the student scores CSV file
df = pd.read_csv('6_6_student_scores.csv')

# Display the first few rows
print(df.head())
โ–ถ Output
Student Math Science English History 0 Ahmed 85 90 78 82 1 Sara 92 88 95 87 2 Omar 76 82 80 74 3 Fatima 88 91 84 86 4 Khaled 70 75 72 68

The read_csv() method takes a file path as input. If the file is in the same directory as your notebook, you only need the filename. Otherwise, provide the complete path. The method detects column names from the first row, determines data types, and returns a DataFrame.

โ„น๏ธ
Explore First, Analyze Later

After loading data from a CSV, always explore it. Use .head() to verify the data loaded correctly, .info() to check column names and data types, and .describe() to get statistical summaries. This catches problems earlyโ€”a numeric column might load as text, or unexpected missing values might appear.

๐Ÿ” Exploring Loaded Data
โ–ผ

Once data is loaded, you need to understand its structure before analysis. Pandas provides several essential methods for this purpose.

Python
# Check the DataFrame structure
print("DataFrame information:")
print(df.info())

# Get statistical summary of numeric columns
print("\nStatistical summary:")
print(df.describe())
โ–ถ Output
DataFrame information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 8 entries, 0 to 7 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Student 8 non-null object 1 Math 8 non-null int64 2 Science 8 non-null int64 3 English 8 non-null int64 4 History 8 non-null int64 dtypes: int64(4), object(1) memory usage: 452.0+ bytes None Statistical summary: Math Science English History count 8.000 8.000 8.000 8.000 mean 83.000 86.000 83.125 80.875 std 8.498 5.809 7.748 7.641 min 70.000 75.000 72.000 68.000 25% 77.500 83.500 77.750 75.500 50% 82.500 86.500 82.000 82.500 75% 89.000 90.250 89.250 86.250 max 95.000 93.000 95.000 91.000

The .info() method shows column names, counts of non-null values, and data types. This reveals missing data and type mismatches immediately. The .describe() method generates statistical summaries (mean, standard deviation, minimum, maximum, quartiles) for all numeric columns, showing how your data is distributed.

Rule 1
.head()
Display the first 5 rows (or specify a number: .head(10))
Rule 2
.info()
Show column names, data types, and count of non-null values
Rule 3
.describe()
Generate statistical summary of numeric columns
Rule 4
.shape
Return the dimensions (rows, columns) of the DataFrame
๐Ÿ’ฝ Saving Data with to_csv()
โ–ผ

After modifying a DataFrameโ€”adding columns, filtering data, or cleaning missing valuesโ€”you often need to save the results. The to_csv() method writes the entire DataFrame to a CSV file, creating it if it doesn't exist or overwriting it if it does.

Python
# Add calculated columns
df['Total'] = df['Math'] + df['Science'] + df['English'] + df['History']
df['Average'] = df['Total'] / 4

# Save to a new CSV file
df.to_csv('6_6_student_scores_updated.csv', index=False)
print("Data saved successfully")
โ–ถ Output
Data saved successfully

By default, to_csv() saves the DataFrame's index as the first column in the CSV file. If your index consists of default numeric row labels (0, 1, 2, 3...), you don't need to save itโ€”it clutters the file without adding value. Prevent this by setting index=False when calling the method.

โ„น๏ธ
Round-Trip Verification

After saving important data, verify the file by reading it back with read_csv(). This ensures the data was written correctly and can be recovered later. When you read a file without an explicit index, Pandas creates a new default numeric index.

๐Ÿ“Œ Topic Summary
โ–ผ
  • CSV files are the standard format for storing and sharing tabular data in data science.
  • read_csv() converts CSV files into Pandas DataFrames, automatically detecting column names and data types.
  • Always explore loaded data with .head(), .info(), and .describe() before analysis.
  • to_csv() saves DataFrames to CSV files; use index=False to exclude the default row index.
  • Use .head() and .tail() to preview the beginning and end of a dataset.
  • Use .info() to check data types and identify missing values; use .describe() for statistical summaries.
๐Ÿ“šExternal Resources
โ–ผ