Data Exploration¶

(Activity) for Tier: Data Analytics

PURPOSE¶

To obtain fundamental understanding of a dataset. The results of data exploration can be extremely powerful in grasping the structure of the data, the distribution of the values, and the presence of extreme values and the interrelationships between the attributes in the dataset. Data exploration also provides guidance on applying the right kind of further statistical and data science treatment.

WHEN¶

Point in time or event trigger

PARTICIPATING ROLES¶

ACCOUNTABLE
- Data Scientist
RESPONSIBLE
- Data Scientist

INPUTS¶

A well-defined statement of the problem, the subject matter, the context and sensitivity, business process generating the data, and the data.

Todo

need to be define work products for all inputs.

ENTRY CRITERIA¶

A new dataset that has not been investigated before.

SUB-ACTIVITIES¶

Organize the dataset: Structure the dataset with standard rows and columns. Organizing the dataset to have objects or instances in rows and dimensions or attributes in columns will be helpful for many data analysis tools. Identify the target or “class label” attribute, if applicable.

Find the central point for each attribute: Calculate mean, median, and mode for each attribute and the class label. If all three values are very different, it may indicate the presence of an outlier, or a multimodal or non-normal distribution for an attribute.

Understand and Visualize spread of each attribute: Calculate the standard deviation and range for an attribute. Compare the standard deviation with the mean to understand the spread of the data, along with the max and min data points. Develop the histogram and distribution plots for each attribute. Repeat the same for class-stratified histograms and distribution plots, where the plots are either repeated or color-coded for each class.

Pivot the data: Sometimes called dimensional slicing, a pivot is helpful to comprehend different values of the attributes. This technique can stratify by class and drill down to the details of any of the attributes. Microsoft Excel and Business Intelligence tools popularized this technique of data analysis for a wider audience.

Watch out for outliers: Use a scatterplot or quartiles to find outliers. The presence of outliers skews some measures like mean, variance, and range. Exclude outliers and rerun the analysis. Notice if the results change.

Understand and Visualize the relationship between attributes: Measure the correlation between attributes and develop a correlation matrix. Notice what attributes are dependent on each other and investigate why they are dependent. Plot a quick scatter matrix to discover the relationship between multiple attributes at once. Zoom in on the attribute pairs with simple two-dimensional scatterplots stratified by class.

Visualize high-dimensional datasets: Create parallel charts and Andrews curves to observe the class differences exhibited by each attribute. Deviation charts provide a quick assessment of the spread of each class for each attribute.

OUTPUTS¶

Interpret & present visual findings, insight to stakeholders. Only after getting input from stakeholders move to data modeling

Todo

need to define work products for all outputs.

EXIT CRITERIA¶

Develop intuition about data and gather as many insights from it. Extract important features, detect outliers/ anomalies and test underlying assumptions

NEXT ACTIVITY¶

Model Data

Process Guidance Version: 10.4