Data Analytics Life Cycle and Role of Big Data

Data Analytics and Big Data are getting importance and attention. They are expected to create customer value and competitive advantage for the business. We have depicted the Data Analytics Life Cycle in details. Considering focus around big data, an analysis is undertaken to understand impact of big data on data analytics life cycle. Typical analytics projects have following (column chart below) effort and time distribution. Of course, various factors influence time taken across data analytics life stages such as complexity of business problem, messiness of data (quality, variety and volume), experience of data analyst or scientist, maturity of analytics in an organization or analytical tools/systems. But, data manipulation is one of the biggest effort drains of analyst time1.

Effort Distribution across Data Analytics Life Cycle

What is an impact of big data across Data Analytics Life Cycle?

  • Understanding Business Objective

Big Data or any other technology plays little role in understanding the business objective and converting a business problem into an analytics problem. But the flexibility and versatility of the tools and technology guides in what all can or can’t be done.  For example, a brick and mortar retailer may have to launch a survey to understand customer sensitivity toward prices. But an eCommerce retailer may carry out an analysis using customers’ web visits – what different ecommerce website customers visit pre and post the visiting the eCommerce retailer.

  • Data Manipulation

Data manipulation requires significant effort from an analyst and the big data is expected to impact this stage the most. The big data will help an analyst in getting the result of a query quicker (Velocity of Big Data). Also, the big data facilitates accessing and using unstructured data (Variety of Big Data) which was a challenge in traditional technology. The data volume handling (Volume of big data) is expected to help by taking away a data volume processing constraint or improving the speed.  Statistical Scientists had devised sampling techniques to get rid of constraint of processing high volume of data. Though, big data can process high volume of data and the sampling techniques may not be required from this perspective. But the sampling is still relevant and required.

Speech Analytics and Big data Example: In one my previous experience Eureka Call miner3 was used to understand customers’ needs and concerns along with monitoring agent performance.  Due to size of the call volume and space requirements, only latest 2 weeks of data were available for an analysis. This was a constraint on what hypotheses can be validated. Due to big data technology, this constraint may not be relevant and many more hypotheses could be validated to add value to the end customers and the business.

  • Data Analysis and Modeling

Most of the machine learning and statistical techniques are available in traditional technology platform, so the value add of big data could be limited. One of the arguments in favour of machine learning in big data is “more data is fed to the machine learning algorithm more it can learn and higher would be quality of insights”2.  Many practitioners do not believe in simply volume leading to quality of insights.

Certainly having different dimensions of data such as customer web clicks and calls data will lead to better insights and improved accuracy of the predictive models.

  • Action on Insights or Deployment

Big Data has created a new wave in industry and there is a lot of pressure on organizations to think of big data. The big data technology is still maturing, but organizations are making investment to tap big data for competitive advantage. A few organizations such as Facebook and Amazon have already adopted and are using the big data.  The real differentiator between successful and non-successful originations will be rights insights and action on the insights.

 Big Data technology is expected to enables deployment of insights or predictive models quicker but more importantly speed to action on analytics will be almost in real time.

 Offer Recommendation on Web and Big Data

A generic offer is prevalent on a web without much success. A personalized and relevant offer is the customer expectation and the organizations are proceeding in this direction. One of the ways to identify customer needs is combining web clicks behavior and transactional behavior in a real time, and providing a personalized offer to the customer. This may be a realty using big data & big data analytics.

  • Learning and Guiding

Due to Big Data and Big Data Analytics, data analytics cycle time and cost is expected to come down. The cost reduction and shrinkage in cycle time will have propitious impact on analytics adoptions. The organizations will be open proceed toward experimentation and learning culture. Of course, this is not going to happen automatically.

 Summary

Big Data is industry buzz word with a lot of focus, attention and investment. Big Data investment is going to add value to the customers and the business only if right insights are developed and actioned upon. Big data is going to impact each stage of Data Analytics life cycle, but the main value add (till Big Data analytics tools matures) will be around data manipulation.

Reference

  1. http://www.sas.com/offices/NA/canada/downloads/presentations/Vancouver_fall2008/Data.pdf
  2. http://www.skytree.net/machine-learning/
  3. http://www.callminer.com/

 

Data Analysis Life Cycle

Aim is to illustrate activities typically happens in data analysis life cycle.  Of course, there will be some example in real life where some of the life cycle stages may not happen. But it is important to follow a structured approach for data analytics.

Data Analysis Life CycleBusiness Objective

An analysis starts with a business objective or problem statement. For example, business problem could be that average banking product holding of the customers is very low (retail banking scenario) or retailer wants to launch a promotion campaign on a television.  In a few cases, analytics team can also proactively form a list of hypotheses and develop insights for the management to act. Once an overall business problem is defined, the problem is converted into an analytical problem. Consider the example of retail promotion campaign, one of the important questions could be to find the target customer segments. Once target segment is defined, the retailer could decide the TV channel and timing of the campaign. So, assume that building the customer segmentation is the analytical problem.

  • Data Manipulation

Once the business problem is defined, next stage is data manipulation. Data manipulation involves

    • Extraction: Pulling data from different systems
    • Transformation: Aggregation of the transactions and activities  at a  particular level
    • Descriptive Analysis and visualization: Understanding of the variables values and distribution is an important step in data analysis. Analyst looks at the minimum, maximum, average and variance values of the continuous variables. Box plot could be used. A frequency plot for categorical variables could also be required and is relevant.
    • Treatment: The input variables may have missing or outlier values. A scatter plot may be helpful to see the distribution of variable values.

For the above example of customer segmentation, we may require to pull transactions, payments, channel interactions and customer demographic data.  Since we may require building segmentation at a customer level, the transaction or channel data needs to be aggregated at a customer level. Some customers may not have made any transactions; hence variable values are missing. The variable could be treated with zero or by any other values.

  • Data Analysis and Modeling

Data Analysis involves multiple steps. Typically, the first analysis is Exploratory Data Analysis (EDA). EDA helps in understanding the data trends and patterns.

After EDA, a relevant statistical technique is selected based on the business problem. Using the relevant techniques, the statistical model is built or analysis is completed. The model or analysis insights are validated on a training data set. The exact steps and sequence may be different for different types of analysis and techniques used.

In summary, data analysis/modeling steps

    • Exploratory Data Analysis (EDA)
    • Statistical Technique Selection
    • Model Building or Analysis
    • Validation of Results

For the customer segmentation following approach can be followed. Customer Segmentation Approach

  • Action on Insights or Deployment

Analysis output is typically used in two ways – informing decision makers or deploying in the system. Sometimes an analysis is carried out for decision makers to understand and be aware of the customer behaviors or business performance.  For example, what is profile of the customers who responded to a particular campaign? Or what is risk profile of the customers acquired from online channels? In these cases there is no mathematical or statistical relationship between target objective and input variables.

In another example of predicting customer churn, we may build a statistical model which can be deployed in the system. So at a regular interval customers who are at the risk of closing a relationship (attrition or churn) are identified.

In the example of customer segmentation, the relevant customer segment is identified.  Based on the channel/serial the target segment is likely to watch, the appropriate advertisement campaign is designed and promotion schedule is selected.

  • Learning and Guiding

Once a statistical model or analysis output is deployed, the performance is monitored and analyzed regularly to understand and improve the business performance.

In the customer segmentation and promotion scenario, if the impact of promotion is not significant, alternate channel or slot could be considered instead of wasting the marketing budget.

3 challenges in getting value from analytics investments

There are a lot of success stories of analytics applications. Organizations across industries from banks to sports have used analytics to create competitive advantages or finding winning ideas.

Tesco– one of the biggest retailer, Capital One – a leading credit card provider, Netflix – a movie rental organization, and Marriott International – a hotelier are some of the organizations which have employed analytics for sustainable competitive advantage.

Some of the common challenges or difficulties with analytics application for the business decisions are

    • Poor quality of data
    • Limited data or poorly structured data sample
    • Poor design of analytics deployment and over fitting the analytics

Above 3 hindrances limit the value addition of analytics deployment for improved business decisions

Poor quality of data

Data analytics and insights are based on input data and if the data has an issue the insights will be inaccurate. It is garbage in garbage out. So, the recommendation in such a scenario is not to use analytics or insights.  But organizations should focus on to improve quality of the data.

For one of the clients, at the end customer calls the customer service representatives enter the comments to capture the important points.  When we started looking at the data – unstructured data, we realized that comments are not really making sense from the business perspective or just illustrative general category of the call, which is already available as a structured column. This is not an isolated example.

One of the other issues with the data is a lot of missing values. But this is lesser of the devil. There are multiple approaches available for missing value treatment and analysis. One of the important points to keep in mind is to review the variables and find out the rationale around missing values. There may a business reason for missing value and the reason could be helpful e.g. in understanding customer behavior.  A few years back we were building a customer churn model for a telecom client and found that a variable had around 80% missing values. Typically analyst would exclude the variables with over 30-40% missing.  When we look at the variables, we found that one of the variables was “Value of international calls”. Of course, it is not expected that all the customers would be international callers. We have treated the variable and used in the model.

Limited data or poorly structured data samples

In the age of big data, you might be wondering why I am bringing this point. There is a difference between volume of data and diversification data. We may have huge volume of customer transactions for the recent period. We may have all customer interaction data but the not the calls or web interaction data.

 For developing good statistical model, we may not require high volume of data. The volume of data may necessarily improve model effectiveness or quality. But we have to be very careful in creating data sample for the statistical modeling and analysis.

Example:  If one wants to develop a mortgage customer attrition model, the sample data points used to build the model play an important role. The customer behavior in terms of attrition is influenced by economic condition – whether interest rate increasing or decreasing scenario.  So, relevant sample of data points be available and used in an appropriate way to bring out the right insights and patterns.

Poor design of analytics deployment and over fitting the analytics

One of the crucial aspects of Capital One’s analytics success story is running thousands of business experiments and learning from them.  The successful experiments are deployed on a larger scale. If analytics are not deployed properly, a limited learning and performance can be derived.

A lot of organizations use analytics in an ad-hoc way and any un-successful result is taken as an excuse of not using analytics in future.  But the main result is a poor design of analytics deployment.

What is measured can be managed but not improved upon. And for improving decisions, one has to synthesize and learn from the historical decisions-results, what works or do not work and why. A proper design of implementation plan before deployment will ensure that the insights can be generated on what works and why it works. Analytics deployment and learning is a systematic adaptive improvement mechanism, which is key in getting value from analytics investment and creating competitive advantage.

 Reference

Thomas H. Davenport, and Jeanne G. Harris , Competing on Analytics: The New Science of Winning

Tactical and practical approach for treating outliers and missing values

Missing Data

There are multiple types of missing data e.g. Missing at Random (MAR) and Missing at Completely Random (MCAR) missing data. For details on the classification examples, users are referred to Little and Ruben1.  For satisfying both MAR and MCAR, the missing records or observation should not be related to specific information.  Example, when housewife fill up the information, the income field may be missing but for a reason.

There are multiple approaches available for missing value treatment and analysis. One of the important points to keep in mind is to review the variables and find out the rationale around missing values. There may a business reason for missing value and the reason could be helpful e.g. in understanding customer behavior.  A few years back we were building a customer churn model for a telecom client and found that a variable had around 80% missing values. Typically analyst would exclude the variables with over 30-40% missing.  When we look at the variables, we found that one of the variables was “Value of international calls”. Of course, it is not expected that all the customers would be international callers.

 Missing DataA few approaches on missing value treatment

  • Deletion of missing observations:  This approach can be adopted with assumption of Missing at Random (MAR) or Missing Completely at Random (MCAR) otherwise the sample could be bias.
  • Replacing with zero, mean or median values: This approach can also cause bias in mean or variance estimation.
  • Using Multiple Imputation2,3 techniques

In the graph, it seems some of the values are outliers, but actually they are missing values. Analysts have to be careful about these values. Some time the missing values are denoted with 99999 etc. In this case for missing date of birth (DOB) and Start Date, a default date is populated hence when age and years with an organization is calculated, it has some patterns with exceptionally high values

Outlier Data

Outlier data points are the observations and values which are significant beyond the typical values of a variable. For example, income of a successful businessman or COE may have a value which is significantly higher than the typical values. The inclusion of such observations may cause bias in estimates including mean or variance values.  The impact could be more pronounced on a sample depending on whether these observations are selected in a sample.

Outliers

In a statistical or predictive modeling, the outliers could be two types, first outlier values for a dependent variable and second outlier values for a predictor. Outliers for predictor variables are also called leverage points.  Residual analysis for regression and graphical analysis are some of the ways to identify outliers.

Why outliers are important? How outliers are different from influential points? How outliers can be detected? How robust regression can help?

WOE Variable transforming for tackling missing and outlier observations

One of the practical approaches adopted by many practitioners while building predictive model using Binary Logistic Regression is transforming variables to Weightage of Evidence (WOE) variables.  WOE variable transformation is used for tackling both missing and outliers. Missing or outlier classes are grouped with other classes based on Weight of Evidence (WOE) using fine and coarse classing.

 Reference

1 Little, R.J.A. & Rubin, D.B. (1987). Statistical analysis with missing data. New York: Wiley.

2. http://coedpages.uncc.edu/cpflower/wayman_multimp_aera2003.pdf

3. http://support.sas.com/rnd/app/da/new/dami.html

Approaches to build Binary Predictive Model

Summary

A predictive model is developed and used to predict an outcome. When the dependent variable (outcome) is dichotomous or has two levels (e.g.  Credit application accepted or rejected), it is called a binary predictive model.

Binary logistic regression, Decision Tree or Neural Network Statistical Techniques can be used to build a model when target or dependent variable is binary.  Predictors can be of any type – categorical (e.g. marital status), ordinal (e.g. income level) and continuous (e.g.  Spend amount). Predictor variables are also called independent or exploratory variables.

Binary predictive model has many real life applications across industries. And binary logistic regression is one of the commonly used techniques to build the predictive model.  When binary logistic regression is used for developing a predictive model, the input predictor variables are transformed to improve the model fit. This is called variable transformation. We are limiting the discussion to predictor variable transformation.

For explaining the below approaches, a customer attrition example is considered. If a customer attrite in a period the target variable takes value as 1 otherwise 0. Customer Years with an organization is taken as a predictor variable.

Variable Transformation

    • Approach 1: Continuous Transformation
    • Approach 2: Dummy Variable Creation
    • Approach 3: WOE Variable Creation