Classification, Regression and How They Can Save You Time & Money

Whom should I target with my next campaign? What are the characteristics of a person that will churn within the next weeks? What is the best location for my upcoming event? Businesses usually ask questions about the future, but the data they can base themselves on will always be history.

Now, “predicting the future” admittedly sounds a bit like magic. However, many questions about the future can be answered using data from the past and the hypothesis that the same underlying mechanics that formed past data will be at play in the future. Regardless of the tool used, the foundation is not magic, but statistics.

This blog series will explain some of the most important statistical concepts, their fields of application and an example of how they help save time and money. Read on for an applied example of segmenting clients based on socio-demographic data for a marketing campaign using SAP InfiniteInsight®.

A Little Bit of Theory

Classification is a methodology used to assign an observation to one of a known, user-defined set of classes. A statistical model is a mathematical formula that expresses a target (output, dependent, explained) variable by means of some explanatory (input, independent, predictor) variables. Target variable is the class that an observation is associated with, while any other attribute of the observation can serve as explanatory variables.

The classes that we’re associating our data with are chosen, not derived from the data. The reliability and usefulness of such assignments heavily depends on the actual, causal relationships between the variables, whether known or unknown. If the target variable is discreet (or categorical), i.e. one that can take on a limited and usually fixed number of values, we speak of classification. If the target variable is continuous (such as a temperature) we speak of regression. Target classes are often dichotomous, e.g. good/bad, dirty/clean, buyer/non-buyer. The term classification is sometimes replaced by the word prediction, especially in non-scientific literature.

Here’s an example: We will be rather safe to classify people into ‘rich’ and ‘poor’ based on their average income over the past 5 years; there is a direct, causal relationship between the two and we can therefore predict the characteristic ‘richness’ using the explanatory variable ‘average income’ with some confidence. However, we can have less confidence when predicting richness based on the age of a person. Of course, there is some correlation between the two, but no strict causality or even dependency, let alone other influencing factors.

Fields of Application

While it seems that our statistical weapons are quite weak, their power lies in their application. As we are free to define the classes that we want our observations to be related to, we can apply the same technique to many different fields of application. Here are some examples:

  • The propensity of a client to buy a certain product based on CRM data;
  • The likelihood of a client to churn based on his usage statistics;
  • The optimal pricing for a product based on location and sociodemographics;
  • The probability of a client to default on his credit payments based on his payment history;
  • The lifetime of a machine under certain environment variables;
  • The number of defect supplies based on logistics data;
  • The success of an event based on its location;
  • The ‘fit’ of an employee based on his past employments;
  • The failure of a delivery system based on meteorological data.

Keep in mind that the best model cannot predict anything useful if there are no (useful) correlations in the data or too much noise.

Practice: Using SAP InfiniteInsight®

As the workflow in InfiniteInsight usually starts by selecting a technique or a family of algorithms rather than a business problem, the tool requires the user to have some knowledge about the suitability of an approach for a given problem. Likewise, it is advantageous to understand the ways applicable algorithms work, the type of target variables one can obtain and the ways to influence the creation of the model. However, with automation in mind, the tool uses a proprietary algorithm which remains a ‘black box’ to a large degree.

What might look like a disadvantage at first sight is in practice a modest loss of flexibility in exchange for a huge gain in productivity. Steps such as cutting the training data into estimation, validation and testing sets, selection and binning (regrouping) of variables and parameterization are largely automated. The user can, but does not have to, intervene; as a result it becomes possible to use a large number of potential explanatory variables without risk and let the tool decide which ones to use, which translates into more accurate models at no extra cost. It is also possible to create a large number of models in parallel, i.e. scoring individuals for campaigns for hundreds of products and selecting only those with the highest level of confidence, translating into unprecedented efficiency.

Classification/Regression Using SAP InfiniteInsight®

Here’s the scenario: In order to maximise the impact of our marketing campaign, we want to maximise the number of potential buyers among the targeted individuals. Therefore, we want to assign a score to each individual that describes his willingness to buy and then take the top individuals that fit into our campaign budget.

In the opening screen, we select ‘Create a Regression/Classification Model’ and subsequently select a data source.

Click on ‘Analyze’ in step 3 to get a preview of the columns contained. Step 4 lets us select target and explanatory variables: In our case we select two variables for prediction and eliminate everything that we can’t apply to other datasets from the explanatory variables.

Click ‘Next’ and ‘Generate’ in the subsequent screens to create the model. A summary is being displayed. Click ‘Next’ to access the following screen, allowing you to inspect, apply and deploy the model.

Use the confusion matrix to assess the predictive power of the model. We can expect roughly 50 % of all recipients to buy our product when targeting just 10 % of the population. While this is not an extremely high value, it represents a huge increase in precision as there are only about 19 % of buyers in the whole population. Click on ‘Contributions by Variable’ if you want to inspect which variables are the most powerful predictors.

If you want to deploy your model and use it in an existing application such as a dashboard or report, simply export the model to SQL using ‘Generate Source Code’ and integrate it into your DWH.

Share this article
  


CONTACT

agileDSS Inc.
407, rue McGill, bureau 501.
Montréal (QC) H2Y 2G3.

info@agiledss.com
(514) 788-1337