Topic 6.6: Reading and Writing CSV Files
Mastering data persistence with Pandas โ reading CSV files, exploring data, and saving results.
You've been creating DataFrames manually by typing data directly into code. In practice, data comes from external sources: files on your computer, databases, APIs, or shared drives. CSV files (Comma-Separated Values) are the standard format you'll encounter most often in data science.
A CSV file is a plain text file where each row represents a record, and values within each row are separated by commas. The first row typically contains column names. Pandas provides the read_csv() method that converts CSV files into DataFrames.
import pandas as pd # Read the student scores CSV file df = pd.read_csv('6_6_student_scores.csv') # Display the first few rows print(df.head())
The read_csv() method takes a file path as input. If the file is in the same directory as your notebook, you only need the filename. Otherwise, provide the complete path. The method detects column names from the first row, determines data types, and returns a DataFrame.
After loading data from a CSV, always explore it. Use .head() to verify the data loaded correctly, .info() to check column names and data types, and .describe() to get statistical summaries. This catches problems earlyโa numeric column might load as text, or unexpected missing values might appear.
Once data is loaded, you need to understand its structure before analysis. Pandas provides several essential methods for this purpose.
# Check the DataFrame structure print("DataFrame information:") print(df.info()) # Get statistical summary of numeric columns print("\nStatistical summary:") print(df.describe())
The .info() method shows column names, counts of non-null values, and data types. This reveals missing data and type mismatches immediately. The .describe() method generates statistical summaries (mean, standard deviation, minimum, maximum, quartiles) for all numeric columns, showing how your data is distributed.
After modifying a DataFrameโadding columns, filtering data, or cleaning missing valuesโyou often need to save the results. The to_csv() method writes the entire DataFrame to a CSV file, creating it if it doesn't exist or overwriting it if it does.
# Add calculated columns df['Total'] = df['Math'] + df['Science'] + df['English'] + df['History'] df['Average'] = df['Total'] / 4 # Save to a new CSV file df.to_csv('6_6_student_scores_updated.csv', index=False) print("Data saved successfully")
By default, to_csv() saves the DataFrame's index as the first column in the CSV file. If your index consists of default numeric row labels (0, 1, 2, 3...), you don't need to save itโit clutters the file without adding value. Prevent this by setting index=False when calling the method.
After saving important data, verify the file by reading it back with read_csv(). This ensures the data was written correctly and can be recovered later. When you read a file without an explicit index, Pandas creates a new default numeric index.
- CSV files are the standard format for storing and sharing tabular data in data science.
- read_csv() converts CSV files into Pandas DataFrames, automatically detecting column names and data types.
- Always explore loaded data with .head(), .info(), and .describe() before analysis.
- to_csv() saves DataFrames to CSV files; use index=False to exclude the default row index.
- Use .head() and .tail() to preview the beginning and end of a dataset.
- Use .info() to check data types and identify missing values; use .describe() for statistical summaries.
- โ Pandas IO Tools: Reading and Writing CSV Files
https://pandas.pydata.org/docs/user_guide/io.html#csv-text-files - โ Pandas read_csv() Function Documentation
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html - โ Pandas to_csv() Function Documentation
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html - โ Real Python: Pandas Read and Write Files
https://realpython.com/pandas-read-write-files/