Topic 8.1: Data Quality
The six dimensions that make data trustworthy â and why raw data almost never meets them
A large hospital network collects patient records across twenty clinics over five years. Each clinic used a slightly different intake form â one clinic recorded age as a number, another as a date of birth, and a third skipped it entirely for some patients. Several patients appear twice because they visited two different branches. Blood pressure was sometimes recorded as a single average number rather than systolic/diastolic. Before any doctor could study health trends in this data, an entire team spent weeks just making the records usable.
This is not an unusual situation â it is the default. Real-world data is collected by humans, entered by hand, exported from systems that were never designed to work together, and accumulated over years without consistent standards. The result is almost never clean. It is raw: unverified, inconsistent, and full of gaps.
Industry surveys consistently show that data scientists spend approximately 80% of their working time preparing and cleaning data, and only 20% actually building models or drawing insights. This ratio surprises students who expect data science to be mostly about algorithms. The discipline of data preparation is where most of the professional value is created.
The hospital example maps precisely onto every data science project. Before analysis can begin, someone must unify the age formats, remove duplicate patients, decide what to do about missing values, and verify that all recorded values fall within plausible ranges. That process â systematic, careful, and skilled â is what Week 8 is about.
Data quality is not a single property â it is a combination of six measurable dimensions. A dataset can excel on one dimension and fail on another. Understanding each dimension separately allows an analyst to diagnose precisely what is wrong and choose the correct remediation strategy.
These six dimensions are not independent. A value can be complete (present) but inaccurate (wrong). A dataset can be consistent (all the same format) but invalid (all the same wrong format). A full quality assessment checks all six dimensions simultaneously.
| Dimension | Question It Answers | Example of Failure |
|---|---|---|
| Accuracy | Are the values correct? | A student's score recorded as 850 on a 100-point exam. |
| Completeness | Are all required values present? | The 'grade' column is empty for 40 out of 207 students. |
| Consistency | Is the same entity always written the same way? | 'Alex', 'Alexandria', and 'ALEX' all appearing in the city column. |
| Timeliness | Is the data current enough to answer today's questions? | Competition results from 2019 used to evaluate 2024 students. |
| Validity | Does each value conform to its expected type, format, and range? | The date column contains '2024-13-01' â month 13 does not exist. |
| Uniqueness | Does each real-world entity appear only once? | The same student appears twice because a system export ran twice. |
The most important distinction to establish at the start of Week 8 is the difference between raw data and analysis-ready data. These are not just different stages â they represent entirely different fitness levels for the task at hand.
A useful test: ask What would happen if I ran an analysis on this data right now? If the answer involves plausible errors â averages pulled up by a typo, categories that should be the same counted as different, or a model that silently ignores half the rows â then the data is not analysis-ready. The job of data preparation is to close every one of those gaps.
- No critical missing values remain
- Every row has enough information to be useful
- Gaps are filled with justified estimates or intentionally removed
- Every column has one data type throughout
- Text values follow a single format (e.g., Title Case for city names)
- Dates are all in the same structure (e.g., YYYY-MM-DD)
- No exact duplicate rows remain
- Each record represents exactly one real-world entity
- No columns that repeat the same information
- Values fall within plausible ranges for their domain
- No obvious data-entry errors remain
- Outliers have been investigated and either corrected or documented
It is tempting to view data preparation as a preliminary chore â something you get through before the 'real' analysis begins. This framing is inaccurate and dangerous. A model trained on dirty data does not produce dirty results; it produces confident results that are wrong, which is far more harmful than an obviously broken output.
Professional data preparation is a structured diagnostic and remediation process. It follows a consistent sequence: first inspect the data to understand what problems exist, then address each problem type in a planned order, then validate the result before passing the data to any downstream process.
| Phase | What Happens | Professional Purpose |
|---|---|---|
| Inspection | Examine shape, types, missing counts, and summary statistics without changing anything. | Diagnose before treating â understand the problem completely before applying any fix. |
| Structural Repair | Fix column types, resolve schema inconsistencies, standardise formats. | Ensure the data has the correct shape and type for analysis. |
| Content Cleaning | Remove duplicates, handle missing values, correct formatting errors. | Ensure the values inside each cell are valid and consistent. |
| Transformation | Encode categories, scale numeric features, derive new variables. | Convert cleaned data into the form required by the analytical method. |
| Validation | Re-inspect the cleaned dataset to confirm all issues were resolved. | Quality-assure the output before it is used in analysis or modelling. |
Each phase requires judgment, not just syntax. Knowing which method to use â and when â is the skill that separates a careful analyst from someone who merely runs code. The decisions made in data preparation directly determine the validity of every conclusion drawn from the data.
In professional environments, data quality is an ongoing concern, not a one-time fix. Many organisations have dedicated data engineering teams whose job is to keep data pipelines clean, consistent, and validated before analysts ever see the data. Week 8 gives you the vocabulary and skills to take part in that kind of work.
The W8 competition dataset is a concrete illustration of quality problems across multiple dimensions. Looking at the raw dataset through the lens of all six quality dimensions reveals exactly what needs to be fixed before any analysis or modelling can begin.
| Column | Quality Dimension Violated | Observed Problem |
|---|---|---|
| city | Consistency | 'Cairo', 'cairo', 'CAIRO', 'Alex', 'Alexandria', 'Giza', 'giza' â 7 spellings for 3 cities |
| score | Completeness | 12 missing values (NaN) out of 207 rows â 5.8% incomplete |
| date | Validity + Consistency | Four different date format conventions in the same column; stored as text instead of datetime |
| All columns | Uniqueness | 7 duplicate rows â the same student appears twice, inflating all counts |
| level | Completeness | 5 missing values â 2.4% of level entries are absent |
None of these problems cause a Pandas error. The dataset loads, the columns are named correctly, and the data appears superficially reasonable. The problems only become visible when each quality dimension is explicitly audited â which is exactly what Topic 8.4 (Initial Inspection) covers.
The competition results dataset used throughout Week 8 was constructed with realistic data quality problems: duplicate entries (representing a system export that ran twice), missing scores (representing students who did not complete the submission form), casing inconsistencies in city names (representing data entered by different operators), and mixed date formats (representing data merged from two different registration systems). These are not artificial edge cases â they are the kinds of problems found in real event registration, school record, and survey datasets every day.
The W8 dataset violates three of the six quality dimensions: completeness (missing scores and level values), consistency (seven city spellings for three cities and four date format conventions), and uniqueness (duplicate rows that overcount students). Accuracy, validity, and timeliness happen to be intact for this dataset â the scores are within the valid range and the competition data is current. But no single dataset violates all six dimensions at once â the art of data quality is knowing which dimensions to check and in what order.
Data quality problems are not abstract technical concerns â they translate directly into wrong decisions, wasted resources, and failed projects. Understanding what can go wrong when quality is ignored makes the case for rigorous preparation more concrete than any statistic.
| Quality Failure | Real-World Consequence | Root Dimension |
|---|---|---|
| Student competition scores contain 7 duplicate rows | Class average is inflated because top students are counted twice | Uniqueness |
| City column has 7 spellings for 3 cities | A 'students per city' report shows 7 rows instead of 3 â Cairo data is split across three groups | Consistency |
| Date column stored as text strings | Time-series sorting fails; a chronological chart shows dates in alphabetical order instead of calendar order | Validity |
| Score column has 12 missing values that are ignored | Average score is calculated from only 195 students, not 207 â the reported class mean is wrong | Completeness |
| A model is trained on dirty data | The model produces high accuracy on the training set but fails in production because it learned patterns from corrupted values | All dimensions |
Each row in the table above describes a consequence that looks like an analysis error but is actually a data quality error. The analyst runs the correct code, the code executes without errors, and the output is presented to decision-makers as factual. The damage is done before anyone realises the data was never trustworthy.
The most dangerous category of data quality failure produces confident-looking wrong answers. A model that silently ignores NaN rows and trains on only 195 out of 207 students reports its accuracy on 195 students â but is deployed to classify all 207. The 12 students it was never trained on represent exactly those whose data was problematic to begin with. Their predictions are the least reliable â but the performance metric never reveals this.
Quality awareness is not negativity â it is the professional default. Treating every new dataset as likely to have quality issues (because real-world data almost always does) is what separates an analyst who produces trustworthy results from one who produces fast results. Speed without quality is not an advantage in data science.
Students accustomed to tutorial datasets assume that real data is normally clean and only occasionally has a few errors. In practice, the opposite is true. Every real dataset collected over time â from any source â will have quality issues. Treating data quality problems as exceptional leads analysts to skip the inspection phase entirely.
Deletion is the simplest fix for any data problem, but it is often the most damaging one. Deleting rows with missing values, outliers, or formatting errors can remove large proportions of a dataset â and the rows that are deleted may not be random. If missing values are concentrated in a particular subgroup, deleting those rows introduces a systematic bias that invalidates any conclusion drawn from the remaining data.
In real projects, data preparation is iterative. After an initial cleaning pass, exploration often reveals new problems â a distribution that looks wrong, a category that splits unexpectedly, a date field that works in most rows but fails in edge cases. Preparation and exploration alternate until the analyst is confident the data is trustworthy.
- Data quality has six measurable dimensions: Accuracy (values are correct), Completeness (values are present), Consistency (same entity is always written the same way), Uniqueness (each entity appears only once), Timeliness (data is current), and Validity (values conform to expected type, format, and range).
- Raw data is data in its original, unprocessed state â almost always containing problems across one or more quality dimensions that make it unfit for direct analysis.
- Analysis-ready data is complete, consistent, non-redundant, accurate, and valid â every row represents a valid record and every column has a uniform format and type.
- Data scientists spend approximately 80% of their working time preparing data â and this is where most analytical errors are either prevented or introduced.
- Data preparation follows a structured five-phase sequence: inspect â repair structure â clean content â transform â validate. Skipping the inspection phase produces confident-looking results built on flawed foundations.
- â InfoWorld: The 80/20 Data Science Dilemma
https://www.infoworld.com/article/2257148/the-80-20-data-science-dilemma.html - â IBM: What Is Data Quality?
https://www.ibm.com/think/topics/data-quality - â Collibra: The 6 Data Quality Dimensions with Examples
https://www.collibra.com/blog/the-6-dimensions-of-data-quality