Conduct Data Cleansing¶
(Activity) for Tier: Data Management
PURPOSE¶
The purpose of Data Cleansing is to ensure that the subject data is correct, consistent, and usable. Cleaning data involves the actions of identifying, correcting, and processing errors or corruptions in data as needed. Discrete business rules established in the Data Management Strategy enable identification of errors or corruptions. Correcting errors or corruptions may involve the development of new processes, deletion of data, or manipulation of data to prevent future errors.
WHEN¶
Data Cleansing follows the Conduct Data Profiling and Conduct Data Quality Assessment. In addition to the referenced activities, Data Cleansing occurs as often as needed on key data stores or critical data elements to ensure compliance with the business rules documented in the Data Management Strategy.
PARTICIPATING ROLES¶
- ACCOUNTABLE
- RESPONSIBLE
ENTRY CRITERIA¶
- Completion of Conduct Data Profiling and Conduct Data Profiling.
SUB-ACTIVITIES¶
Remove Duplicate and Extraneous Data
- Ensure that key data stores provide unique observations to upstream systems.
- Leverage critical data elements to ensure that each combination of critical data elements provides a sub-population of valid observations.
- Each sub-population (combination of critical data elements), should add distinct information and context to the mission of the project.
Note
Extraneous observations do not fit into the mission of the project as defined in the Create Team Charter. By definition, extraneous observations distract from the stated goals of the project by adding information that is irrelevant. Guarding against extraneous observations will reduce cost and ensure viability of the data management effort.
Handle Missing Data
The project shall establish how it will handle missing observations in the Data Cleansing Plan.
In general, three broad strategies exist to deal with missing data in no particular order.
The project may choose any of the following remedies below to handle missing data:
- Delete observations that have missing values.
- Add the missing observations based on other observations. Many imputation methods exist such as linear interpolation, central tendency substitution, cluster analysis, etc. Include the selected imputation method(s) in the Data Cleansing Plan.
- Contact the publisher of the data source to gather the missing data.
Correct Structural Errors
Structural errors typically create cascading failures in upstream systems.
Structural errors cause data to be inconsistent across a dimension such as time, data source(s), observations, etc. Structural errors may include but not be limited to:
- Observations within a data element should all have the same data-type. When data element values have multiple data type, e.g., boolean, numeric, date, null, NA, blank, etc. then a data type mismatch has occurred. For example, a date time column containing boolean values is a data type mismatch.
- Numeric and date-time data elements typically have a range context that dictates valid or invalid observations. For example, naval ships lost in battle will range from zero to infinity. A negative number of naval ships lost in battle would constitute a range violation.
- Observations for data elements often come from a finite population. When observations fall outside of the established population a set-membership violation has occurred. For example, the US Navy has a finite inventory of bases with names. A dataset feature of Navy-base names that includes the titles of movies would constitute a set-membership violation since movie titles do not belong in the set of Navy-base names.
- This error most commonly occurs in relational databases, when a foreign key has a value that does not exist in the primary key data element.
- Regular-expression mismatches occur when a text data element has an established pattern but contains observations that do not adhere to the pattern. For example, a column of phone numbers that contains observations not matching the pattern (999) 999–9999 could be an example of a regular-expression-pattern mismatch.
- Logic violations occur when required conditions are violated.
Example: A data source on battles occurring during the Persian Gulf war cannot have dates prior to the year 1991.
Monitor Errors
- Data Cleansing is a continuous activity. Error trends are an indication of systemic issues with key data stores. The project shall publish data corrections for affected upstream systems and downstream sources as data errors get resolved. When data corrections follow a discernable trend the project shall have an established escalation protocol for pursuing resolution of the trend with internal and external data providers.
Exit Criteria¶
- Data conforms to established business rules contained in Data Management Strategy.
Next Activity¶
- None
Todo
Need DMM team to create next activity for Measurement and Analysis. Once activity has been created next activity will be populated.
See Also¶
Process Guidance Version: 10.4