Conduct Data Profiling

(Activity) for Tier: Data Management

PURPOSE

The Purpose of Data Profiling is to develop an understanding of the content, quality, and rules of a specified set of data under management. It contains activities that help the organization assess the data against a set of quality objectives, which are defined in Data Quality Strategy. Data profiling may be done on both critical data elements as defined by the stakeholders and additional data elements in the data store. Data Quality is limited to critical data elements, but Data Profiling could be more comprehensive if the project wants to be. It is a discovery task revealing what is stored in data stores and how physical values differ from expected values in metadata repository documents. It examines basic things such as distinct values in column, null values, date ranges, string length, nonstandard format or more advanced analysis such as frequency distribution, cardinality, key integrity. Data profiling differs from Conduct Data Quality Assessment in that profiling results in a series of conclusions about the data set, whereas assessment evaluates how well the data meets specific quality requirements.

WHEN

Data profiling is first step before conducting Conduct Data Quality Assessment. Innova Systems Business Units may profile data prior to projects such as data conversion or migration, design of a data warehouse, planning data store consolidation or whenever data store is modified due to new sets of data feeds.

PARTICIPATING ROLES

ENTRY CRITERIA

  • Project with Data Business Objective

SUB-ACTIVITIES

  1. Select Data Stores to Profile

    • Select data store(s), depending on the project and business objective(s) it could be a file feed, a database, a data warehouse, a data lake, a webpage data dump in some file format.
    • Document this within your Data Profile Plan
  2. Select Data Sets to Profile

    • Depending on the project and business objective(s) an identification and selection of dataset(s) needs to be done, dataset may include multiple or single dataset(s).
    • The priority and selection are driven by the project and task/objective at hand in relation to which profiling is done.
    • Document within your Data Profile Plan

    Note

    Data Profiling is not done on “all” of the data a project has. data profiling is scoped based on project/task and business objective(s).

  3. Identify Stakeholders

    • Create a list of stakeholders and ensure they are clearly identified. Examples of stakeholders could be Product Owner and Project Manager. A Project may have additional stakeholders such as a Data Science Project Manager while others may not. The goal is to identify key stakeholders to share the Data Profiling Plan and results.
    • Document within your Data Profile Plan
  4. Develop Data Quality Checks/Criteria Based on Objective(s)

    • Data quality criteria based on business objective(s) may include referential integrity checks, consistency of data with respect to documented metadata in Data Catalog or advanced criteria such as frequency distribution and cardinality of data.
    • Document within your Data Profile Plan
  5. Develop Rules to be Applied During Profiling

    • Based on criteria defined in earlier steps, develop rules to perform those checks. These rules are often developed/coded by a Data Engineer however, could be developed by the Data Steward depending on their skill level and experience, Or it could be a collaboration between Data Engineer and Data Steward.
    • Document within your Data Profile Plan
  6. Determine Method or Tool for Data Profiling

    • The method or tool used for profiling needs to be discussed and agreed upon. Projects can choose to create a simple script in programming language of their choice such as sql, python or an out of the box solution.
    • Document within your Data Profile Plan

    Note

    Tool and method selected is dependent on project, their resources and task at hand.

  7. Conduct Data Profiling and Document Findings

    • Once data profiling has been completed determine a method of presenting the results. Data profiling results can be shared in a basic report such as excel, or a mature project may choose to publish data profiling metrics and results through dashboards or scorecards.
    • Centralize these metrics where they can be accessed by project stakeholders per team procedure.

Exit Criteria

  • Data profile plan is developed, conducted and data profiling results are communicated to stakeholders & published.

See Also

Process Guidance Version: 10.4