Does Cleaned Data guarantee high Data Quality?

Sandhya Krishnan
Geek Culture
Published in
7 min readOct 19, 2022

--

The direct answer to the “Does Cleaned Data guarantee high Data Quality?” is a big NO and Never. This article covers data quality dimensions and how to ensure the data is of high quality using python.

Photo by Kindel Media: https://www.pexels.com/photo/person-holding-black-pen-on-white-printer-paper-7054417/

Data cleaning only ensure data does not have any missing or duplicate values, fix structural errors, filter unwanted outliers, format the data, and remove irrelevant observations or data. If you want to know how to clean data using python, you can check my article here.

When we say that data cleaning removes irrelevant observations or data, the main focus of it is to remove irrelevant columns and rows which have error values or information. It does not dig deep into the data to check if it the irrelevant or not. Moreover, data cleaning can never guarantee that data is fit to serve the business objective.

Hence, data quality checks are to be done to ensure that the data fit to serve the business objective and correctly represents the real-world construct to which it refers. The data quality definition varies from the customer, business, and standard perspective.

Thus data quality should ensure the data is fit for use, meet or exceed customer expectation, meets the specification document, is free of defects, and meets the requirement for its intended use. Moreover, it should be accurate, correct, consistent, relevant, valid, and useful for business decision-making and its application.

These expectations, specifications, and requirements are usually defined by one or more individuals or groups, standards organizations, laws, and regulations, business policies, or software development policies.

Data is categorized as high quality or low quality by assessing it with data quality metrics. The different data quality metrics are:

Accuracy:

Accuracy is a key attribute of high-quality data and the toughest one to determine. If we have millions of records in our data, this dimension of data quality has to ensure each data record is accurate, manually checking millions of records is practically more than impossible.

Depending on the data and its behavior various checks and suspicious error patterns are to be done to ensure the data is of high quality.

To explain this I am using clean data, which is available here.

The dataset attributes and expect value or datatype is as below:

Category: Should have ‘phone’, ‘tab’, or ‘laptop’
Rank: Seller Rank which is an integer and can be in a range of 1 to 50
Unique_Identifier: Alphanumeric 8 characters.
Seller_State: Two States of India either Karnataka and Kerala,
Seller_City: Any city of these two States
Postal_Code: Pin Code of the City.

For this dataset, to ensure data is accurate below data quality checks can be done. Note, I have used google colab to explain.

  • Accuracy Check for Category: Expected to have ‘phone’, ‘tab’, or ‘laptop, then checks should be done that there is no value other than the mentioned ones.
  • Accuracy Check for Rank: Positive Integers ranging from 1 to 50, should not be decimal.
  • Accuracy Check for Unique Identifier: Alphanumeric with a length of 8.
  • Accuracy Check for Seller_State: Value should be either Karnataka or Kerala
  • Accuracy Check for Seller_City: Seller_City belongs to Seller_State

To check if the City Belong to a particular state, I have created a master list of State, City, and also Postal(Postal Code required for later checks) in google sheet. To access Google Sheets, one has to provide authentication to the google account, it can be done using the below code.

Once the google account is authorized, then using the below code we can get the content of the google Sheets.

Here city_info calls function get_input_file() to store all the content of google sheet.

  • Accuracy Check for Postal_Code: Postal_Code belongs to Seller_City. Indian Post Code is 6 digits.
Sample of Indian Postal Code Number https://upload.wikimedia.org/wikipedia/commons/f/f1/Example_of_Indian_Postal_Index_Number.svg

In case, if the data have age details, then a check can be done that age should not have negative values or unreasonable positive values.

Completeness:

To ensure data is complete the main check is all the columns mentioned in the data specification document are available in the dataset. Accurate data must not always be complete. As mentioned earlier if a column is expected to have only a few values like ‘phone’, ‘tab’, and ‘laptop. Data is accurate if the data does not have any category apart from ‘phone’ or ‘tab’ or ‘laptop’, but is not complete until all three are present.

Consistency:

Data is said to be consistent when it is stable, compatible, same format, and uniform. When a data file is to be submitted in regular intervals for analysis or to feed for the ML model, the filesize should be consistent, that is the filesize should always fall between mean plus or minus two stand deviations. For the upper boundary, outliers can be analyzed before rejecting the data whereas the lower boundary data files can be directly rejected.

Data uniformity should be checked thoroughly, as discussed earlier if we are having a dataset of phones, tabs, and laptops and for the first two we are having 1000 records, and for laptops, we have only a handful of records, then the data is not at all uniform and consistent.

Consistency can be checked by analyzing the difference between df[‘Category’].value_counts().nlargest(1) and df[‘Category’].value_counts().nsmallest(1).

Validity:

For research studies, we can check internal and external validity for consistency. Internal validity will tell us how confident we are in our data, which can be assured when the effect changes simultaneously when the cause changes, the result will always be an after-effect of change, if there is any correlation then there is no plausible reason for it.

Researchers also should take care there is no experimental bias, and that there is no compromise on the results of studies.

External validity comprises Population validity, Ecological validity, and Temporal validity. Population validity ensures that the population of study and the population of interest are the same. Ecological validity ensures that the result can be applied to the real-world and Temporal validity checks how well the result remains accurate over time.

For Clean data, the accuracy check of postal codes, if the city belongs to the mapped state, can be considered as the validity check.

Timeliness:

Historical data should have clear chronological information. Data made available to customers after processing should be within the right time the customer needs and when the Data is expected for a time range, it should not have data before or after the time period, and the stop time should not be before the start time. When businesses get the right data at right time in the right format, the correct and accurate business decision can be made.

Uniqueness:

Data uniqueness is checked during the data cleaning process, that is there should be no duplicate values in the file. But it will be a good practice to cross-verify the same in the data quality check also, so if there is any miss in data cleaning it can be captured here.

Relevance and Compliance:

The data should be relevant to solve the business problem and should comply with data regulations.

The complete code can is available on GitHub and Kaggle.

--

--