Zac Bodner Final Lab Assignment

In class this semester, we have already explored regression for explanatory purposes. For example, we previously built a regression model to explain what effects certain independent variables (trustworthiness, intelligence, “like me-ness”) have on certain dependent variables (I have a good opinion of) in political candidates. These models were helpful in determining how different candidates could employ certain tactics to gain favor (and hopefully votes) from voters. In this regard, regression models are effective tools in explaining the reasons behind certain phenomena - like why certain candidates do well with certain populations, or what factors make us love ice cream - but their real power comes in how capable these models are of predicting certain outcomes. For example, if we have an explanatory regression model that identifies certain characteristics (variables) shared by both customers and non-customers in a particular data set - could we then turn around and use that model to identify additional, probable customers (and non-customers) in another, independent data set (and make tons of cash for our employers in the process)? We can and we will! The following paper presents a step by step explanation of how to do this, based on our final lab assignment of the semester. The first thing we must do is divide the current data set (Customer, we’ve been using it all semester) in half. This will allow us to confirm our model’s findings in one half of the data set on the other. In other words, this will help us prove that our model will not only work on the data set that we are testing, but in other, unrelated (random) data sets, as well. The key to dividing a data set into equal, random halves, is to confirm that both sides are distributed evenly on a number of variables. We have already divided the data set using SPSS in a previous assignment, and confirmed the randomness and equality of both halves by examining the distributions of some of these variables in a CROSSTABS setting. If the differences in the distributions of these variables are very small, (fractions of a percent) then we are good to go. From the previous assignment, we can confirm that our data set is split into two randomized and equal halves. From here, we must calculate some of the interactions between variables that we found in another previous assignment - the CHAID segmentation. CHAID stands for Chi Square Automatic Interaction Detector. It produces a tree that shows which variables contain the largest segments of customers, and continues on by further dividing each segment. For example - the tree starts by segmenting customers from non-customers. Then, it segments the customers further by say, the market value of their home. Then it continues, by separating this market value segment into each gender. By doing this, we can examine the percentage of customers in each segment, and compare their concentration to the rest of the data set for targeting purposes. By identifying these interactions, we can make new variables to add to our model. This is what we will do here. We do this because, even though CHAID is a great tool for segmenting the data set, we are more interested in seeing the total, combined interactions. For this purpose, a regression will always be the better option. To turn these segmentation interactions into variables, we simply multiply some of our segments that demonstrated high concentrations of customers. Here is a screenshot of a few that I used:
These interactions are computed, and added to the list of our data set’s independent variables. From here, we can add them and all of our other independent variables to a step-wise regression. First, we must make sure to select which half of the data-set we want to test. We will test Half 1 - by selecting it in SPSS. A step-wise regression is valuable for differentiating significant predictor (independent) variables from insignificant ones. All you do is throw the kitchen sink (all of your variables) in, and SPSS will find the ones with the strongest beta coefficients (relationships to our outcome variable) and order them from highest to lowest in the model. For purposes of orderliness and ease of use, we always want our models to be parsimonious - meaning they have as few variables as possible while still making good predictions. This being the case, I chose the first eleven variables the step-wise regression returned. You will know when to stop adding variables by how much the total R-Square value increases. The RSquare (and adjusted R-square - for multiple variables) is a measure of how much of the variance in the outcome variable the independent variables in the model explain. If you have ten variables that have an adjusted R-square of .159, or fifteen variables with a .164 - it’s best to just use the ten variables, because each additional variable isn’t explaining much variance at this point. We’re cooking with gas now! We have eleven solid, significant variables that we can now throw into a regular regression, signified by switching the “Stepwise” option to the “Enter” option in SPSS. It is important to note that for this assignment, we need to check the option to “replace missing values with the mean” in SPSS. This means that if any of the single members of our data population have missing values, we will replace those missing values with the mean for that data point, instead of tossing the member altogether. This way, we do not detract or add anything from the model, but we don’t have to waste data. Luckily for us, of the eleven variables that made it into our model - there were no missing data. Now, we will input the variables into the “enter” regression, and save the output as a variable. We will call the variable PREDICT, because we will use it later to predict, based on our observations of customers and non-customers in this data-set, the likelihood of finding additional customers among separate data-sets.
We will now divide this output into deciles, since we are concerned primarily with our model’s capability of finding prospects based on how much they resemble the observed customers of this data-set. SPSS does this fairly easily by going to Transform and selecting Rank Cases. We then save that output as DECILES, and using CROSSTABS, we can compare our observed customers with our ranked predictions. Here is a screenshot of this:
This is great; our prediction works. We would hope that the highest deciles (1) have the highest concentration of customers, and vice-versa with the non-customers coming from the lowest deciles (10). As we can see, it does. In the first decile, we have an 8.9% higher concentration of customers than in the rest of the data-set. We can take those odds to Vegas! Now, we have to test this model on the other half, Half 2. To do this, we have to calculate a score for our output that we can apply to the second half, but before we do this - we must confirm the score we calculate correctly interprets the regression output that we have.
This is pretty easy, we go into Transform > Compute and enter in an equation based on our variables in the model:
Our final score then looks like this, and we save it as the another variable in the set, SCORE:
We then compare this variable SCORE to PREDICT, and luckily for us - they are almost identical. This means our score calculation is a correct interpretation of our regression model, and can now be applied to an independent sample (data-set). To get to our simulated, independent data set - we now select Half 2. With Half 2 active, we run the code we just made, and then transform the output again into deciles. We will save these deciles as DECILES2, and run the same CROSSTABS function as we did before. If we have done our job correctly, this Crosstabs will look almost identical, and hopefully a little better than the first one. Drumroll!!!!
Hallelujah! This crosstabs has a slightly higher percentage of customers in the first decile, but is practically the same. Check out the comparison on the next page. Our predictive model works!
This is great news. We built a regression model and ran i on one data-set. If the model worked, we would expect similar results by running the same model on a different data-set. We got those similar results. With these techniques and a handy tool like SPSS, we can take a data set, build a regression model to find characteristics shared by customers and non-customers - then validate this model on an independent data set. By doing this, we can significantly increase our chances of finding new customers anywhere, which is obviously an incredibly valuable skill to have. But remember, anyone can input numbers into SPSS. The real difference between a good and great market researcher is being able to interpret those numbers, by asking good questions and employing impeccable language!