Chapter 6 Model Assessment 6.1
Model Fit S t a t i s t i c s Demonstration: Comparing Models with Summary Statistics
6.2
6.3
6.4
Statistical G r a p h i c s
6-3 6-6
6-9
Demonstration: Comparing Models with R O C Charts
6-13
Demonstration: Comparing Models with Score Rankings Plots
6-18
Demonstration: Adjusting for Separate Sampling
6-21
A d j u s t i n g for S e p a r a t e S a m p l i n g
6-26
Demonstration: Adjusting for Separate Sampling (continued)
6-29
Demonstration: Creating a Profit Matrix
6-32
Profit M a t r i c e s
6-39
Demonstration: Evaluating Model Profit
6-42
Demonstration: Viewing Additional A s s e s s m e n t s
6-44
Demonstration: Optimizing with Profit (Self-Study)
6-46
Exercises
6-48
6.5
Chapter Summary
6-49
6.6
Solutions
6-50
Solutions to E x e r c i s e s
6-50
6-2
Chapter 6 Model Assessment
6.1 Model Fit Statistics
6.1
6-3
Model Fit Statistics
Summary Statistics Summary Prediction Type
Statistic Accuracy/Misclassification
Decisions
Profit/Loss Inverse prior threshold
3 A s introduced in Chapter 3, summary statistics can be grouped by prediction type. For decision prediction, the Model Comparison tool rates model performance based on accuracy or misclassification, profit or loss, and by the Kolmogorov-Smirnov ( K S ) statistic. A c c u r a c y and misclassification tally the correct or incorrect prediction decisions. Profit is detailed later in this chapter. T h e Kolmogorov-Smirnov statistic describes the ability o f the model to separate the primary and secondary outcomes.
6-4
Chapter 6 Model Assessment
Summary Statistics Summary Prediction Type
Rankings
Statistic
ROC Index (concordance) Gini coefficient
For ranking predictions, the Model Comparison tool gives two closely related measures o f model fit. T h e R O C index is similar to concordance (introduced in Chapter 3 ) . T h e G i n i coefficient (for binary prediction) equals 2 x ( R O C Index - 0.5). f
T h e R O C index equals the percent o f concordant cases plus one-half times the percent tied cases. Recall that a pair o f cases, consisting o f one primary outcome and one secondary outcome, is concordant i f the primary outcome case has a higher rank than the secondary outcome case. B y contrast, i f the primary outcome case has a lower rank, that pair is discordant. I f the two cases have the same rank, they are said to be tied.
6.1 Model Fit Statistics
6-5
Summary Statistics Summary Prediction Type
Estimates
Statistic
Average squared error SBC/Likelihood
5 For estimate predictions, the Model Comparison tool provides two performance statistics. Average squared error was used to tune many o f the models fit in earlier chapters. T h e Schwarz's Bayesian Criterion ( S B C ) is a penalized likelihood statistic. The likelihood statistic was used to estimate regression and neural network model parameters and can be thought o f as a weighted average squared error. f
S B C is provided only for regression and neural network models and is calculated only on training data.
6-6
Chapter 6 Model Assessment
Comparing Models with Summary Statistics
After y o u build several models, it is desirable to compare their performance. T h e Model Comparison tool collects assessment information from attached modeling nodes and enables y o u to easily compare model performance measures. 1.
Select the Assess tab.
2.
Drag a M o d e l C o m p a r i s o n tool into the diagram workspace.
3.
Connect both Decision Trees, the Regression node, and the Neural Network node to the Model Comparison node as shown. (Self-study models are ignored here.) - i ; Predictive Analysis
Rqfacanttt
(Yob ittftty Tree
^ = — — 6 r -0! Madd
• % O | 0 - J - ® m|^oj"
6.1 Model Fit Statistics
4.
6-7
Run the M o d e l C o m p a r i s o n node and v i e w the results. The Results w i n d o w opens. nT Results - Nude:
Modd C o m p a n i o n
Diagram: Predictive A n j l y u *
S I Qj _Al _ l i-J D i l l ROM - 1KMH
D i l l Hole - VALIDATE
Ssiecled Mood
r-'.--r.cr:;^-Node
Reg &0.6>
£
Modd Dejcrrton
Trrgn
Reg
Regression
Neural
Neural
Neural Net
TargerS
Tree
Tree • 0 . 6 -
Decision Tr
Targets
Targets
0.43804
Tiee3
Tree 3
Probability... Targets
0 426366
E c W
0 0
0.2
0 4
0.6
0.B
1.0
04-
1.0
0.3
0.4
D.6
0.B
1.0
1 - Specificity
1 - Specificity • Neural Network - Probability Trea
Regression
student Diet: 0 * i PcJt - IRA IK 1.4-
Cala Rislfl • VALlOAlE
J u l y O S , 2009 10:16:-IS
Time: Teaming
Output
V a r i a b l e sufimacY
= 'J . E
Meaauzeacnt
reetjucnev
Level
count
The Results w i n d o w contains four s u b - w i n d o w s : R O C Chart, Score R a n k i n g s , Fit Statistics, and Output. 5.
M a x i m i z e the O u t p u t w i n d o w .
v»: HKtBMta tun R a o
0 421327 0 422073
6-8
Chapter 6 Model Assessment
6.
G o to line 90 in the Output window. Data Role=Valid Reg
Statistics Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid:
Neural
Tree
Tree3
0.16 0.14 Kolmogorov-Smirnov S t a t i s t i c 0.16 0.14 0.24 0.24 0.24 0.24 Average Squared Error 0.61 0.61 0.58 Roc Index 0.59 Average Error Function 0.67 0.67 0.68 0.68 Bin-Based Two-Hay Kolmogorov-Smirnov Probability Cutoff Cumulative Percent Captured Response 14.33 14.49 13.22 13.22 Percent Captured Response 7.18 6.90 6.60 6.60 Frequency of C l a s s i f i e d Cases 4843.00 4843.00 4843.00 4843.00 Divisor for ASE 9686.00 9686.00 9686.00 9686.00 Error Function 6533.21 6512.78 6584.21 6582.17 Gain 43.06 44.71 32.00 32.00 Gini C o e f f i c i e n t 0.22 0.22 0.17 0.18 Bin-Based Two-Uay Kolmogorov-Smirnov S t a t i s t i c 0.00 0.00 0.00 0.00 Kolmogorov-Smirnov Probability Cutoff 0.54 0.52 0.43 0.49 Cumulative L i f t 1.43 1.45 1.32 1.32 Lift 1.44 1.38 1.32 1.32 Maximum Absolute Error 0.86 0.87 0.64 0.64 M i s c l a s s i f i c a t i o n Rate 0.42 0.42 0.43 0.43 Sum of Frequencies 4843.00 4843.00 4843.00 4843.00 Root Average Squared Error 0.49 0.49 0.49 0.49 Cumulative Percent Response 71.55 72.37 66.02 66.02 Percent Response 71.90 69.01 66.02 66.02 Sum of Squared E r r o r s 2332.27 2323.48 2357.33 2356.28 Number of Wrong C l a s s i f i c a t i o n s 2040.00 2048.00 2073.00 2077.00
.
.
.
.
T h e output shows various fit statistics for the selected models. It appears that the performance o f each model, as gauged by fit statistics, is quite similar. A s discussed above, the choice o f fit statistics best depends on the predictions of interest. Prediction T y p e
Validation Fit Statistic
Direction
Decisions
Misclassification
smallest
Average Profit/Loss
largest/smallest
Kolmogorov-Smirnov Statistic
largest
Rankings
Estimates
ROC
Index (concordance)
largest
G i n i Coefficient
largest
Average Squared Error
smallest
Schwarz's Bayesian Criterion
smallest
Log-Likelihood
largest
6.2 Statistical Graphics
6.2
6-9
Statistical Graphics
Statistical Graphics - ROC Chart 1.0
captured response fraction (sensitivity)
0.0 00
false posnive traction (1-specificity)
1.0
The ROC chart illustrates a tradeoff between a captured response fraction and a false positive fraction.
6 The M o d e l C o m p a r i s o n tool features t w o charts to aid i n model assessment: the R O C chart and the Score Rankings chart. Consider the R O C chart.
Statistical Graphics - ROC Chart
fraction of cases, ordered by their predicted value.
10
To create a R O C chart, predictions are generated f o r a set o f validation data. For chart generation, the predictions must be rankings or estimates. The validation data is sorted f r o m h i g h to l o w (either scores or estimates). Each point o n the R O C chart corresponds t o a specific fraction o f the sorted data.
6-10
Chapter 6 Model Assessment
Statistical Graphics - ROC Chart
13
For example, the red point on the R O C chart corresponds to the indicated selection o f 4 0 % o f the validation data. That is, the points in the gray region on the scatter plot are in the highest 4 0 % o f predicted probabilities.
Statistical Graphics - ROC Chart
top 40'
OA The /-coordinate shows the fraction of primary outcome cases captured in the top 40% of all cases.
14
The vertical or_j'-coordinate o f the red point indicates the fraction o f p r i m a r y outcome cases " c a p t u r e d " in the gray region (here about 4 5 % ) .
6.2 Statistical Graphics
Statistical Graphics - ROC Chart
cases captured in the top 40% of all cases.
The horizontal or.Y-coordinate o f the red point indicates the fraction o f secondary outcome cases " c a p t u r e d " in the gray region (here about 2 5 % ) .
Statistical Graphics - ROC Chart
The R O C chart represents the union o f similar calculations f o r all selection fractions.
6-11
6-12
Chapter 6 Model Assessment
Statistical Graphics - ROC Chart
weak model
strong model
20
The R O C chart provides a nearly universal diagnostic for predictive models. M o d e l s that capture p r i m a r y and secondary outcome cases in a proportion approximately equal to the selection fraction are weak models ( l e f t ) . M o d e l s that capture mostly primary outcome cases w i t h o u t capturing secondary outcome cases are strong models ( r i g h t ) .
Statistical Graphics - ROC Index
weak model ROC Index < 0.6
strong model ROC Index > 0.7
21
The tradeoff between p r i m a r y and secondary case capture can be summarized by the area under the R O C curve. In S A S Enterprise M i n e r , this area is called the ROC Index. ( I n statistical literature, it is more c o m m o n l y called the c-statistic.) Perhaps surprisingly, the ROC Index is closely related to concordance, the measure o f correct case ordering.
6.2 Statistical Graphics
6-13
Comparing Models with R O C Charts
Use the f o l l o w i n g steps to compare models using R O C charts. 1.
M a x i m i z e the R O C chart. ROC C h a r t : T Data Role = VALIDATE
Data Role = TRAIN
& >
1.0-
1.0-
0.8-
0.8-
ÂŁ
0.6 -
0.6
in C d> W 0.4
If)
c
01
</) 0-4 H
o.: -
0.2-
o.o-
0.0 n 0.0
1 0.2
1 0.4
1 0.6
1 0.8
r 1.0
-i 0.0
1 0.4
1 0.6
1 - Specificity
1 - Specificity Baseline Decision Tree
1 0.2
Neural Network Probability Tree
Regression
i 0.8
r 1.0
6-14
Chapter 6 Model Assessment
The R O C chart shows little difference between the non-tree models. This is consistent w i t h the values o f the R O C Index, w h i c h equals the area under the ROC curves.
6.2 Statistical Graphics
6-15
Statistical Graphics - Response Chart
cumulative percent
The response chart shows the expected response rate for various selection percentages.
23
A second category o f assessment charts examines response rate. It is the prototype o f the so-called Score Rankings charts found in every model Results w i n d o w .
Statistical Graphics - Response Chart
fraction of cases, ordered by their predicted values.
A s w i t h R O C charts, a model is applied to validation data to sort the cases f r o m highest to lowest (again, by prediction rankings or estimates). Each point on the response chart corresponds to a selected fraction o f cases.
6-16
Chapter 6 Model Assessment
Statistical Graphics - Response Chart
the 40% of cases with the highest predicted values.
23 For example, the red point on the Response chart corresponds to the indicated selection o f 4 0 % o f the v a l i d a t i o n data. That is, the points in the gray region on the scatter plot are in the highest 4 0 % o f predicted probabilities.
Statistical Graphics - Response Chart
29 The x-coordinate o f the red point is s i m p l y the selection fraction ( i n this case, 4 0 % ) .
6.2 Statistical Graphics
6-17
Statistical Graphics - Response Chart
31
The vertical coordinate f o r a point on the response chart is the p r o p o r t i o n o f p r i m a r y outcome cases i n the selected f r a c t i o n . It is called the cumulative percent response i n the S A S Enterprise M i n e r interface and is more w i d e l y k n o w n as cumulative gain in the predictive m o d e l i n g literature. D i v i d i n g c u m u l a t i v e gain by the primary outcome p r o p o r t i o n yields a quantity named lift.
Statistical Graphics - Response Chart
33
Plotting c u m u l a t i v e gain f o r all selection fractions yields a gains chart. N o t i c e that w h e n all cases are selected, the c u m u l a t i v e gain equals the overall primary outcome p r o p o r t i o n .
6-18
Chapter 6 Model Assessment
Comparing Models with Score Rankings Plots
U s e the following steps to compare models with Score Rankings plots: 1.
Maximize the Score Rankings Overlay window.
m Data Role - TRAIN
Data Role - VALIDATE
1.4
1.3
13 | 1 ,
a
3 u
U
E 3 1.2
1.1
1.0
1.0 20
40
60
Percentile
80
100
20
40
60
Percentile
60
100!
6.2 Statistical Graphics
Double-click the Data Role = V A L I D A T E plot. HEJE3
S c o r e R a n k i n g s Overlay: T a r g e t B
Data Role-VALIDATE
1.4-
Cumulative Lift
2.
6-19
1.0-
.
. 0
i 20
i 40
i 60
80
100
Percentile - Neural Network -
• Regression
• Decision Tree
•Probability Tree]
T h e Score Rankings Overlay plot presents what is commonly called a cumulative lift chart. Cases in the training and validation data are ranked based on decreasing predicted target values. A fraction o f the ranked data is selected. T h i s fraction, or decile, corresponds to the horizontal axis o f the chart. T h e ratio, (proportion o f cases with the primary outcome in the selected fraction) to (proportion o f cases with the primary outcome overall), is defined as cumulative lift. Cumulative lift corresponds to the vertical axis. High values o f cumulative lift suggest that the model is doing a good j o b separating the primary and secondary cases. A s can be seen, the model with the highest cumulative lift depends on the decile; however, none o f the models has a cumulative lift over 1.5. It is instructive to view the actual proportion o f cases with the primary outcome (called gain or cumulative percent response) at each decile.
6-20
3.
Chapter 6 Model Assessment
Select C h a r t â&#x20AC;˘=> Cumulative % Response. The Score Rankings Overlay plot is updated to show Cumulative Percent Response on the vertical axis. S c o r e R a n k i n g s O v e r l a y : Target!)
0
20
40
60
80
100
Percentile |
Neural Network
-Regression
Decision Tree
Probability Tree |
T h i s plot should show the response rate for soliciting the indicated fraction o f individuals. Unfortunately, the proportion o f responders in the training data does not equal the true proportion o f responders for the 97NK campaign. The training data under-represents nonresponders by almost a factor o f 20! T h i s under-representation was not an accident. It is a rather standard predictive modeling practice known as separate sampling. (Oversampling, balanced sampling, choice-based sampling, casecontrol sampling, and other names are also used.)
6.2 Statistical Graphics
6-21
Adjusting for Separate Sampling
I f y o u do not adjust f o r separate s a m p l i n g , the f o l l o w i n g occurs: • Prediction estimates reflect target proportions in the t r a i n i n g sample, not the p o p u l a t i o n f r o m w h i c h the sample was d r a w n . • Score Rankings plots are inaccurate and misleading, • Decision-based statistics related to misclassification or accuracy misrepresent the model performance on the p o p u l a t i o n . Fortunately, it is easy to adjust f o r separate s a m p l i n g in SAS Enterprise M i n e r . H o w e v e r , y o u must rerun the models that y o u created. f
Because this can take some t i m e , it is recommended that y o u run this demonstration d u r i n g the discussion about the benefits and consequences o f separate s a m p l i n g .
F o l l o w these steps to integrate s a m p l i n g i n f o r m a t i o n into y o u r analysis. 1.
Close the Results - M o d e l C o m p a r i s o n w i n d o w .
2.
Select Decisions
in the PVA97NK node's Properties panel. Value
Property General 'Node ID imported Data Exported Data Notes Tram [Output Type Role ;Rerun Summarize [slcolumn Metadata
Ids •••
... ...
View Raw No No
...
Variables -Decisions •Refresh Metadata -Advisor -Advanced Options
... ... Basic
[•'Data Selection -Data Source
Data Source PVA97NK
-Table Name -Variable Validation -New Variable Role
Strict Reject
...
*
6-22
Chapter 6 Model Assessment
The D e c i s i o n Processing - P V A 9 7 N K dialog box opens.
Targets
Probabilities
Decisions
Decision Weights
Targets Name:
TargetB
Measurement L e v e l : Binary Target level order :
Descending
Event l e v e l : Format:
Build
OK
Cancel
The Decision Processing d i a l o g b o x enables y o u to i n f o r m S A S Enterprise M i n e r about the extent o f separate s a m p l i n g in the t r a i n i n g data. This is done by d e f i n i n g prior 3.
probabilities.
Select B u i l d i n the Decision Processing dialog box. S A S Enterprise M i n e r scans the raw data to determine the p r o p o r t i o n o f p r i m a r y and secondary outcomes in the r a w data.
6.2 Statistical Graphics
4.
6-23
Select the P r i o r Probabilities tab.
m Targets j Prior Probabilities | Decisions ' Decision Weights Do you want to enter new prior probabilities?
0
ÂŽ No
Yes Level
Set Equal Prior Count
I
Prior
I
4843 4343
1 0
0.5 0.5
OK
5.
Cancel
Select Yes. The d i a l o g b o x is updated to show the Adjusted Prior c o l u m n . Jecision P r o c e s s i n g -
Targets i Prior Probabilities
Decisions ; Decision Weights
Do you want to enter new prior probabilities? Set Equal Prior
ONo
â&#x20AC;˘ Yes
Count
Level 4843 4843
Adjusted Prior
Prior 0.5 0.5
0.0 0.0
OK
Cancel
The A d j u s t e d Prior c o l u m n enables y o u to specify the proportion o f p r i m a r y and secondary outcomes in the original PVA97NK population. When the analysis problem was introduced in Chapter 3, the p r i m a r y outcome (response) proportion was claimed to be 5%.
6-24
Chapter 6 Model Assessment
6.
Type 0 . 0 5 as the A d j u s t e d Prior value for the primary outcome, Level 1.
7.
Type 0 . 9 5 in the as the A d j u s t e d Prior value for the secondary outcome, Level 0.
ocessrng Targets
[ Prior Probabilities j Decisions
i Decision Weights
Do you want to enter new prior probabilities?
0
速 Yes Level
Set Equal Prior
No
Count
Prior 0.5 0.5
4843 4843
Adjusted Prior 0.05 0.95
OK
8.
Cancel
Select O K to close the Decision Processing dialog box.
D e c i s i o n P r o c e s s i n g a n d the Neural Network Node I f y o u r diagram was developed according to the instructions in Chapters 3 through 5, one m o d e l i n g node (the Neural N e t w o r k ) needs a property change to correctly use the decision processing i n f o r m a t i o n . F o l l o w these steps to adjust the Neural N e t w o r k node settings. 1.
E x a m i n e the Properties panel for the Neural N e t w o r k node. The default model selection c r i t e r i o n is Profit/Loss.
General Neural Node ID Imported Data ^Exported Data Njbtes^^^^^^^^
... ...
Train Variables (Continue Training No Network loptimization Initialization Seed 12345 Model Selection CriterProfit/Loss Sunnress Outnut No
... ... ...
6.2 Statistical Graphics
2.
6-25
Select M o d e l Selection C r i t e r i o n O Average E r r o r .
General Node ID 'imported Data Exporter! Data MNotes
Neural
... ...
Train Variables Continue Training No 'Network Optimization Initialization Seed 12345 Model Selection CriterAveraqe Error Suppress Output No f
... •••
W h e n no decision data is defined, the neural n e t w o r k optimizes c o m p l e x i t y using average squared error, even t h o u g h the default says Profit/Loss. N o w that decision data is d e f i n e d , y o u must m a n u a l l y change M o d e l Selection C r i t e r i o n to Average Error.
D e c i s i o n P r o c e s s i n g a n d the AutoNeural Node Decision processing data ( p r i o r probabilities and decision weights) is ignored b y the A u t o N e u r a l node. Predictions f r o m the node are adjusted for priors, but that actual model selection process is based strictly on misclassification ( w i t h o u t p r i o r adjustment). This fact can lead to unexpected prediction results w h e n the primary and secondary outcome proportions are not equal. Fortunately, the PVA97NK data has an equal p r o p o r t i o n o f p r i m a r y and secondary outcomes and gives reasonable results w i t h the A u t o N e u r a l node. f
U s i n g the A u t o N e u r a l node w i t h t r a i n i n g data that does not have equal outcome proportions is reasonable o n l y i f y o u r data is not separately sampled a n d y o u r prediction goal is classification.
Run the diagram f r o m the M o d e l C o m p a r i s o n node. This reruns the entire analysis.
6-26
Chapter 6 Model Assessment
6.3 Adjusting for Separate Sampling
Outcome Overrepresentation A c o m m o n predictive modeling p r a c t i c e is to build m o d e l s from a s a m p l e with a primary o u t c o m e proportion
(r^*'47--*k
different from the original population.
37
A common predictive modeling practice is to build models from a sample with a primary outcome proportion different from the true population proportion. T h i s is typically done when the ratio o f primary to secondary outcome cases is small.
Separate Sampling secondary outcome
primary outcome
T a r g e t - b a s e d s a m p l e s a r e c r e a t e d by considering the primary o u t c o m e c a s e s s e p a r a t e l y from the s e c o n d a r y outcome c a s e s . 40
6.3 Adjusting for Separate Sampling
6-27
Separate Sampling secondary outcome
Select some cases.
primary outcome
Select all cases.
41
Separate s a m p l i n g gets its name f r o m the technique used to generate the m o d e l i n g data, that is, samples are d r a w n separately based on the target outcome. In the case o f a rare p r i m a r y o u t c o m e , usually all p r i m a r y outcome cases are selected. T h e n , each primary outcome case is matched by one or ( o p t i m a l l y ) more secondary outcome cases.
6-28
Chapter 6 Model Assessment
The Modeling Sample + bimilar predictive power with smaller case count -
Must adjust assessment statistics a n d graphics
-
Must adjust prediction estimates for bias
43
The advantage o f separate s a m p l i n g is that y o u are able to obtain (on the average) a model o f s i m i l a r predictive power w i t h a smaller overall case count. This is in concordance w i t h the idea that the amount o f i n f o r m a t i o n i n a data set w i t h a categorical outcome is determined not by the total number o f cases i n the data set itself, but instead by the number o f cases in the rarest outcome category. ( F o r binary target data sets, this is usually the p r i m a r y outcome.) (Harrell 2006) This advantage m i g h t seem o f m i n i m a l importance i n the age o f extremely fast computers. ( A model m i g h t f i t 10 times faster w i t h a reduced data set, but a 10-second model f i t versus a 1-second model f i t is probably not relevant.) Flowever, the m o d e l - f i t t i n g process occurs only after the c o m p l e t i o n o f a l o n g , tedious, and error-prone data preparation process. Smaller sample sizes f o r data preparation are usually welcome. W h i l e it reduces analysis t i m e , separate sampling also introduces some analysis c o m p l i c a t i o n s . â&#x20AC;˘ M o s t m o d e l f i t statistics (especially those related to prediction decisions) and most o f the assessment plots are closely tied to the outcome proportions in the training samples. I f the outcome proportions i n the t r a i n i n g and v a l i d a t i o n samples d o not match the outcome proportions i n the s c o r i n g p o p u l a t i o n , model performance can be greatly misestimated. â&#x20AC;˘
I f the outcome proportions in the t r a i n i n g sample and scoring populations do not m a t c h , model prediction estimates are biased.
Fortunately, S A S Enterprise M i n e r can adjust assessments and prediction estimates to match the s c o r i n g population i f y o u specify prior probabilities, the scoring population outcome proportions. T h i s is precisely w h a t was done using the Decisions option in the demonstration.
6.3 Adjusting for Separate Sampling
6-29
Adjusting for Separate Sampling (continued)
The consequences o f i n c o r p o r a t i n g p r i o r probabilities in the analysis can be v i e w e d in the M o d e l C o m p a r i s o n node. 1.
Select R e s u l t s . . . in the Run Status dialog box. The Results - M o d e l C o m p a r i s o n w i n d o w opens. ROC Chart: TargetB Seleded Model
1.00 -
Predecessor Node
Model Node
Model Desertion
Target Variable
Vafld Average Profil
0.75-
1 0.S0-
B
a V>Q2S -
Neural Reg Tree3 Tree
0.000.0
0.2
0.4
0.6
0.8
0.0
1.0
0.2
1 - Specificity
0.4
0.6
0.I
1 - Specificity
Baseline Regression Probability Tree
Dsta Role = TRAIN
1 E
0
Dal* Role ' VM.ICATE
• I 252.00¬ 1,75150| ,.5125¬ 1.00u 1.0 1 - 1 — i — i — i — i — H 0 0 20 -JO 60 30 100 Percentile
\
Tr9e
TargetB TargetB TargetB TargetB
Neural Network Decision Tree
Cumulative Lift
0 £
Neura! Net... Regression Probability... Decision Tr...
Neural Reg Tree 3
I»-
Usee:
Student
Dace:
July
Time:
10:47:28
Training
1 20
i 40
i 60
I 80
i 100
Percentile Variable
• neural Network Decision Tree
Output
Regression •Probability Tree
Sunnaiy
0 8 , 2009
1
6-30
2.
Chapter 6 Model Assessment
M a x i m i z e the Score Rankings Overlay window and focus on the Data Role = V A L I D A T E chart. S c o r e R a n k i n g s O v e r l a y : T.irtjctLS
T
1
0
20
r 40
• |
Neural Network
1
1
60
80
r 100
Percentile Regression
—
Decision Tree
•
Probability Tree |
Before y o u adjusted for priors, none o f the models had a cumulative lift over 1.5. N o w most o f the models have a cumulative lift in excess o f 1.5 (at least for some deciles). f
T h e exception is the Decision Tree model. Its lift is exactly 1 for all percentiles. T h e reason for this is the inclusion o f disparate prior probabilities for a model tuned on misclassification. O n the average, cases have a primary outcome probability o f 0.05. A low misclassification rate model might be built by simply calling everyone a non-responder.
6.3 Adjusting for Separate Sampling
3.
6-31
Select Chart: â&#x20AC;˘=> Cumulative % Response. Score Rankings Overlay: TargetB
H
1
1
1
0
20
40
E3
E3
-1
1
1
1
60
80
100
Percentile |
Neural Network
Regression
- Decision Tree
Probability Tree I
T h e cumulative percent responses are now adjusted for separate sampling. Y o u now have an accurate representation o f response proportions by selection fraction. With this plot, a business analyst could rate the relative performance o f each model for different selection fractions. T h e best selection fraction is usually determined by financial considerations. For example, the charity might have a budget that allows contact with 4 0 % o f the available population. T h u s , it intuitively makes sense to contact the 4 0 % o f the population with the highest chances o f responding (as predicted by one o f the available models). Another financial consideration, however, is also important: the profit (and loss) associated with a response (and non-response) to a solicitation. T o correctly rank the value o f a case, response probability estimates must be combined with profit and loss information.
6-32
Chapter 6 Model Assessment
Creating a Profit Matrix
To determine reasonable values for profit and loss information, consider the outcomes and the actions y o u would take given knowledge o f these outcomes. In this case, there are two outcomes (response and nonresponse) and two corresponding actions (solicit and ignore). K n o w i n g that someone is a responder, y o u would naturally want to solicit that person; knowing that someone is a non-responder, y o u would naturally want to ignore that person. (Organizations really do not want to send j u n k mail.) O n the other hand, knowledge o f an individual's actual behavior is rarely perfect, so mistakes are made, for example, soliciting non-responders (false positives) and ignoring responders (false negatives). Taken together, there are four outcome/action combinations: Solicit
Ignore
Response Non-response E a c h o f these outcome/action combinations has a profit consequence (where a loss is called, somewhat euphemistically, a negative profit). Some o f the profit consequences are obvious. For example, i f y o u do not solicit, y o u do not make any profit. S o for this analysis, the second column can be immediately set to zero. Solicit
Ignore
Response
0
Non-response
0
From the description o f the analysis problem, you find that it costs about $0.68 to send a solicitation. A l s o , the variable T a r g e t D gives that amount o f response (when a donation occurs). T h e completed profit consequence matrix can be written as shown.
Response Non-response
Solicit
Ignore
T a r g e tD-0.68
0
-0.68
0
From a statistical perspective, T a r g e t D is a random variable. Individuals who are identical on every input measurement might give different donation amounts. To simplify the analysis, a summary statistic for T a r g e t D is plugged into the profit consequence matrix. In general, this value can vary on a case-bycase basis. However, for this course, the overall average o f T a r g e t D is used. Y o u can obtain the average o f T a r g e t D using the StatExplore node. 1.
Select the E x p l o r e tab.
2.
Drag the S t a t E x p l o r e tool into the diagram workspace.
6.3 Adjusting for Separate Sampling
3.
Connect the StatExplore node to the Data Partition node. Predictive Analysis
4.
6-33
Select Variables... from the Properties panel o f the StatExplore node.
HHE3|
Chapter 6 Model Assessment
6-34
The Variables - Stat w i n d o w opens.
5.
Select Use <=> Yes f o r the TargetD variable. ^Variables -Stat j(none) Columns:
m
J^J
Name
Use Default
DemCluster
Default D e m G e n d e r Default |D e m H o m e OvvDefault D e m M e d H o m Default DemMedlncorDefault DemPctVeteraDefault |GiflAvg36 [GiftAvgAII
Default Default
GiftAvgCard36Defautt GiftAvgLast GiftCnt36 GiftCntAII
Default Default Default
lGiftCntCard36befault blftCntC a rdAN Default GinTimeFirst Default Gift-Time La st Default ID
Default
PromCntl2 PromCnt36
Default Default
Default PromCntAII PrornCntCard Default Pro m C n t C a r d D e f a u l t PrornCntCardJJefautt !
REP DemMecDefault StatusCat96NDefault StatusCatStar.Uefault
iTargetB TarqetD dataobs
-
not
(Equal to
P Label
DemAge
:
I
Default Yes Default
V Mining Report
r Role
nput
Interval
No
Input nput
Nominal
nput nput Rejected Input Input
Binary Interval Interval Interval Interval Interval
Input Input
Interval Interval
Input
Interval Inter/a I
Input Input Input Input
E>
Interval Interval Interval Interval
Input
Nominal Interval
Input
Internal
Input
Interval
Input Input
Interval Interval
Input Input
Interval
Input Input Target Rejected ID
Interval Nominal Binary Binary Interval Interval
Explore.,,
6.
Select O K to close the Variables d i a l o g box.
Statistics
Nominal
Input
Input
r
Basic
! Reset
Level
ho
No No No NO No No No No NO No No No No NO No No No No NO No No No Mo No NO No No Mo
Apply
J
Update Path
OK
Cancel
!
6.3 Adjusting for Separate Sampling
7.
6-35
Run the StatExplore node and view the results. The Results - StatExplore window opens. Chi-Square Wot
,:G:x|
Ei
output
Gi£ tTinie Last H I PUT PEoflCntJ.2 H I PUT PronCnt36 H I PUT ProaCntAll N I PUT PcoBCntCordl2 N I PUT Pco»CntCard36 N I PUT ProaCntCardAl N I PUT I PUT PEP_DemMedn Icooe N REJECTED TargetD
CN-Square
Twget-TwgrtB Data Rols * TRAIN
— 60|
2 2
40
18.0-5667 12.98286 29.29837 48.45096 S.38963S 11.8823 19.01631 53570.8S 15.81626
4.031484 3 4.876026 7.864742 22.96727 1.341807 4.603954 8.551418 20241.72 12.27443
I 20 u
Class Variable Summary Statistics by Class Targe Jj 1
0 V.iri.ible Worth
TjfQ« = Twfl«B
J
0.4
go: -i
1
1
V.iti.ihlu
Scrolling the Output window shows the average of TargetD as $15.82. 8.
Close the Results - Stat window.
9.
Select the P V A 9 7 N K node.
10. Select the D e c i s i o n s . . . property. The Decision Processing - P V A 9 7 N K dialog box opens.
Targets
Prior Probabilities
Decisions
Decision Weights
* ^ TargetB
Name:
TargetB
Measurement Level: Binary Target level order :
Descending
Event level:
1
Format:
Refresh
OK
Cancel
r
6-36
Chapters Model Assessment
11. Select the Decisions tab.
Targets
| Prior Probabilities | Decisions
[ Decision Weights
Do you want to use the decisions? (â&#x20AC;˘J Yes
Default with Inverse Prior Weights
J No
Decision N a m e DECISION1
Label
DECISI0N2
Cost Variable < None >
0 JJ
Constant
< None >
0.0
Add Delete
Delete All Reset Default
OK
12. Type s o l i c i t ( i n place o f the w o r d
13.
Type i g n o r e ( i n place o f the w o r d
Targets
DECISION1) DECISION2)
Cancel
in first r o w o f the Decision N a m e c o l u m n . in the second r o w o f the Decision N a m e c o l u m n .
[ Prior Probabilities [ Decisions | Decision Weights
Do you want to use the decisions? ÂŽ Yes
O No
Decision Name solicit 1 ignorej
0
Default with Inverse Prior Weights Label
Cost Variable < None > < None >
Constant
Add
0.0 0.0
Delete Delete All Reset Default
OK
Cancel
6.3 Adjusting for Separate Sampling
4. Select the Decision Weights tab.
Targets
j Prior Probabilities
[ Decisions
; Decision Weights
S e l e c t a d e c i s i o n function: Maximize
O
Minimize
Enter weight v a l u e s for the d e c i s i o n s . Level
solicit
ignore
1
1.0
0.0
0
0.0
1.0
OK
The completed p r o f i t consequence matrix for this example is s h o w n below. Solicit
Ignore
Response
15.14
0
Non-response
-0.68
0
Cancel
6-37
6-38
Chapter 6 Model Assessment
15. Type the p r o f i t values into the corresponding cell o f the p r o f i t w e i g h t matrix.
Targets
| Prior Probabilities
| Decisions
[ Decision Weights
Select a decision function: (â&#x20AC;˘: Maximize
O Minimize
Enter weight values for the decisions. Level 1 0
solicit 15.14 -0.63
iqnore G.O 0.0
OK
16. Select O K to close the Decision Processing - P V A 9 7 N K dialog box. 17. Run the M o d e l C o m p a r i s o n node. 4r
It w i l l take some t i m e to run the analysis.
Cancel
6.4 Profit Matrices
6-39
6.4 Profit Matrices
Profit Matrices solicit^ primary outcome
15.14
profit distribution for solicit decision
47
Statistical decision theory is an aid to making optimal decisions from predictive models. U s i n g decision theory, each target outcome is matched to a particular decision or course o f action. A profit value is assigned to both correct (and incorrect) outcome and decision combinations. T h e profit value can be random and vary between cases. A vast simplification and common practice in prediction is to assume that the profit associated with each case, outcome, and decision is a constant. T h i s is the default behavior o f S A S Enterprise Miner. This simplifying assumption can lead to biased model assessment and incorrect prediction decisions. For the demonstration, the overall donation average minus the solicitation cost is used as the (constant) profit associated with the primary outcome and the solicit decision.
6-40
Chapter 6 Model Assessment
Profit Matrices solicit
secondary
â&#x20AC;˘0.68
outcome profit distribution for solicit decision
48
Similarly, the solicitation cost is taken as the profit associated with the secondary outcome and the solicit decision.
Decision Expected Profits soiicit primary
f
5
ignore^
0
u
outcome secondary outcome
u
"
w
choose the larger
E x p e c t e d Profit Solicit = 15.14 jS - 0 . 6 8 p 1
0
E x p e c t e d Profit Ignore = 0
49
Making the reasonable assumption that there is no profit associated with the ignore decision, you can complete the profit matrix as shown. With the completed profit consequence matrix, you can calculate the expected profit associated with each decision. T h i s is equal to the sum o f the outcome/action profits multiplied by the outcome probabilities. T h e best decision for a case is the one that maximizes the expected profit for that case.
6.4 Profit Matrices
6-41
Decision Threshold solicit primary outcome secondary â&#x20AC;&#x201D;
outcome
decision
ignore
15.14
0
-0.68
0
threshold
p , > 0 . 6 8 / 1 5 . 8 2 => Solicit < 0.68 / 1 5 . 8 2 => Ignore
50
When the elements o f the profit consequence matrix are constants, prediction decisions depend solely on the estimated probability o f response and a constant decision threshold, as shown above.
Average Profit solicit primary outcome secondary outcome
average
ignore
15.14
0
-0.68
0
profit
A v e r a g e profit = (15.14X/Yps- 0 . 6 8 x A / ) / N s s
N
= # solicited primary outcome cases
P S
N s S
=
# solicited secondary outcome cases
N = total number of assessment cases
51
A new fit statistic, average profit, can be used to summarize model performance. For the profit matrix shown, average profit is computed by multiplying the number o f cases by the corresponding profit in each outcome/decision combination, adding across all outcome/decision combinations, and dividing by the total number o f cases in the assessment data.
6-42
Chapter 6 Model Assessment
Evaluating Model Profit
The consequences o f incorporating a profit matrix in the analysis can be v i e w e d i n the M o d e l C o m p a r i s o n node. F o l l o w these steps to evaluate a model w i t h average profit; 1.
Select R e s u l t s . . . in the Run Status d i a l o g box. The Results - N o d e : M o d e l C o m p a r i s o n w i n d o w opens. E
RemH*-NmI<- Model C w n i n m u n Di.irjt.urL Predictive Andlyu*
ojQiiiidi'ii
331
HOC Chart I TargetB •Mi Sola - THAN
lrari £nor
DB-
0
1
T i m Sun of S p a r t a Errors
075-
£
//
1 £
Avcrpje PlOII
1D0-
10-
•
ooDO
Rtg Neural TTM TraoS
Rag Neural Tit* TrenJ
RtoriKion Neural Not Di'tision IJ Piohsbiiiti .
ooo02
04 06 1 -Sp.tHietty
Bt
CD
o:
0*
OE
jS
10
1 • Specificity
Out RU. • TPJJH
Student J u l y 08, J009 L5i08:26
PtUI ITIUI
0«iBU..V>i1ClTE
1
Tt l i n i n g Output
V a r i a b l e Surte.iy Heasuieaent
0
- rijural rMiA^rv
—
—
Regreiiwi
10
40
SO
KJ
1C0
- Pi-- U 3111 it, T i w |
rcequsncy Count
Targets TajgelB T-jrjetQ Targeie
0.171304 0.169749 0.156675 0143923
14465*9 1434379 1*46382 14*7576
331* 21* 3330873 4327.395 433962*
6.4 Profit Matrices
2.
6-43
Maximize the Output window and go to line 30. Fit Statistics Hodel S e l e c t i o n based on V a l i d : Average P r o f i t (_ VAPR0FJ
Selected Model Model Node Y
Reg Neural Tree Tree3
Model Description
Train: Valid: Train: Valid: V a l i d : Average Average Average Squared H i s c l a s s i f i c a t i o n Squared H i s c l a s s i f i c a t i o n Rate P r o f i t Error Rate Error
Regression Neural Network D e c i s i o n Tree P r o b a b i l i t y Tree
0.17130 0.16975 0.15667 0.14293
0.24202 0.24062 0.44677 0.44700
0.49990 0.49990 0.49990 0.49990
0.24079 0.23988 0.44689 0.44720
0.50010 0.50010 0.50010 0.50010
With a profit matrix defined, model selection in the Model Assessment node is performed on validation profit. Based on this criterion, the selected model is the Regression Model. T h e Neural Network is a close second.
6-44
Chapter 6 Model Assessment
Viewing Additional A s s e s s m e n t s
T h i s demonstration shows several other assessments o f possible interest. 1.
M a x i m i z e the Score Rankings Overlay w i n d o w .
2.
Select Select C h a r t : â&#x20AC;˘=> Total Expected Profit.
D J U R o l . - VSLIOATE
Percenillo
The Total Expected Profit plot shows the amount o f value w i t h each demi-decile ( 5 % ) block o f data. It turns out that all models select approximately 6 0 % o f the cases (although cases in one m o d e l ' s 6 0 % might not be in another model's 6 0 % ) .
6.4 Profit Matrices
3.
Select Select Chart: â&#x20AC;˘=> Cumulative % Captured Response.
4.
Double-click the Data Role = V A L I D A T E chart.
6-45
Store K.infcing% OvrrÂť.iy: (.irt)H8
Percentile - Ntural Nttmrk -
- Ragrmlon
This plot shows sensitivity (also known as Cumulative % Captured Response) versus selection fraction (Decile). B y selecting 6 0 % o f the data, for example, y o u "capture" about 7 5 % o f the primaryoutcome cases. 5.
Close the Results window.
6-46
Chapter 6 Model Assessment
Optimizing with Profit (Self-Study)
The models f i t in the previous demonstrations were optimized to m i n i m i z e average error. Because it is the most general o p t i m i z a t i o n criterion, the best model selected by this criterion can rank and decide cases. I f the ultimate goal o f y o u r model is to create prediction decisions, it m i g h t make sense to o p t i m i z e o n that criterion. A f t e r y o u define a p r o f i t m a t r i x , it is possible to optimize your model strictly on profit. Instead o f seeking the model w i t h the best prediction estimates, y o u find the model w i t h best prediction decisions (those that m a x i m i z e expected p r o f i t ) . f
The default model selection method in SAS Enterprise M i n e r is v a l i d a t i o n profit o p t i m i z a t i o n , so these settings essentially restore the node defaults. Finding a meaningful p r o f i t m a t r i x f o r most m o d e l i n g scenarios, however, is d i f f i c u l t . Therefore, these notes r e c o m m e n d o v e r r i d i n g the defaults and creating models w i t h a general selection criterion such as validation average squared error.
D e c i s i o n T r e e Profit Optimization One o f the t w o tree models in the diagram is set up this w a y : 1.
Select the original Decision Tree node.
2.
Examine the Decision Tree node's Subtree property. llsubtree
•Method -Number of Leaves •Assessment Measure -Assessment Fraction
Assessment 1 Decision 0.25
The Decision Tree node is pruned using the default assessment measure, Decision. The goal o f this tree is to make the best decisions rather than the best estimates. 3.
E x a m i n e the Probability Tree node's Subtree property.
• Assessment \- Method [-Number of Leaves H [•Assessment Measure Average Square Error 0.25 --Assessment Fraction The Probability Tree node is pruned using average squared error. The goal o f this tree is to provide the best prediction estimates rather than the best decision.
6.4 Profit Matrices
6-47
R e g r e s s i o n Profit Optimization Use the f o l l o w i n g settings to o p t i m i z e a regression model using p r o f i t : 1.
Select the Regression node.
2.
Select Selection C r i t e r i o n •=> Validation Profit/Loss.
B Model Selection r Selection Model Stepwise Selection Criterion Validation Profit/Loss Use Selection Defaults No Selection Opiions 3.
Repeat steps 1 and 2 f o r the P o l y n o m i a l Regression node.
N e u r a l N e t w o r k Profit O p t i m i z a t i o n 1.
Select the N e u r a l N e t w o r k node.
2.
Select Model Selection C r i t e r i o n O Profit/Loss.
Train Variables iNetwork Model Selection Criterio Profit/Loss Use Current Estimates No f
• •
The A u t o N e u r a l node does not support profit matrices.
The next step refits the models using p r o f i t o p t i m i z a t i o n . Be aware that the demonstrations in Chapters 8 and 9 assume that the models are fit using average squared error o p t i m i z a t i o n . 3.
Run the M o d e l C o m p a r i s o n node and v i e w the results.
4.
M a x i m i z e the Output w i n d o w and v i e w lines 20-35. Fir Statistics Model S e l e c t i o n based on V a l i d : Average P r o f i t ( _VAPR0FJ
Selected Model Model Erode Y
Neural Reg Tree Tree3
Model Description
Valid: Train: Valid: Train: Average V a l i d : Average Ave r age Squared H i s c l a s s i f i c a t i o n Squared M i s c l a s s i f i c a t i o n Rate Error Rate P r o f i t Error
Neural Network Regression Decision Tree P r o b a b i l i t y Tree
0.17660 0.23645 0.15934 0.23711 0.15667 0.44677 0.14293 0.44700
0.49990 0.49990 0.49990 0.49990
0.24096 0.24419 0.44639 0.44720
0.50010 0.50010 0.50010 0.50010
The reported v a l i d a t i o n profits are now slightly higher, and the relative o r d e r i n g o f the models changed.
6-48
Chapter 6 Model Assessment
1. Assessing Models a.
Connect all models in the O R G A N I C S diagram to a Model Comparison node.
b.
R u n the Model Comparison node and view the results. W h i c h model has the best R O C curve? What is the corresponding R O C Index?
c.
What is the lift o f each model at a selection depth of 4 0 % ?
6.5 Chapter Summary
6-49
6.5 Chapter Summary The Model Comparison node enables you to compare statistics and statistical graphics summarizing model performance. T o make assessments applicable to the scoring population, y o u must account for differences in response proportions between the training, validation, and scoring data. While y o u can choose to select a model based strictly on statistical measures, y o u can also consider profit as a measure o f model performance. For each outcome, an appropriate decision must be defined. A profit matrix is constructed, characterizing the profit associated with each outcome and decision combination. T h e decision tools in S A S Enterprise Miner support only constant profit matrices.
Assessment Tools Review Compare model summary statistics and statistical graphics.
TrhputData
Create decision data; add prior probabilities and profit matrices.
Tune models with average squared error or appropriate profit matrix.
Obtain means and other statistics on data source variables. 56
6-50
Chapter 6 Model Assessment
6.6
Solutions
Solutions to Exercises 1.
Assessing Models a.
Connect all models in the O R G A N I C S diagram to a Model Comparison node.
Mode) nportson
:::DRGAMCS
6.6 Solutions
b.
6-51
Run the M o d e l C o m p a r i s o n node and v i e w the results. W h i c h model has the best R O C curve? T h e Decision Tree seems to have the best H O C c u r v e . What is the corresponding R O C index?
T h e R O C Index values are found in the Fit Statistics window. i
i
Selected Muriel
V
Predecessor Node
Reg Reg2 Tree Neural Tree2
Model Node
Reg Reg2 Tree Neural Tree2
Valid: Roc Index A
Model Description
0.805062Regression 0.82007Regression (2) 0.823782DecisionTree 0.82381 SNeural Network 0.824339Decision Tree (2)
6-52
Chapter 6 Model Assessment
c.
W h a t is the lift o f each model at a selection depth o f 4 0 % ? 1) Select Validation Score R a n k i n g C h a r t .
Data Role " V A L I D A T E
0
20
40
60
60
Percentile NauMi tiniAori
Ragressi-.ii
p - ^ - m p n i:>
-
— — Decision Tinr
• D9Hs]onli-:i t:>
2) R i g h t - c l i c k the chart and select Data O p t i o n s . . . f r o m the O p t i o n menu. 3) Select the W h e r e tab.
100
6.6 Solutions
4) A d d the text r
& (DECILE=40)
6-53
to the selection string.
iay.ir.t« Variables
Wiere [ Sorting
Boles
(((OATAROLE = "VALIDATE") & (((TARGET="Tarcje1Buy")>))) & (DECILE=40)
Add
Apply
Custom
Delete
Reset
5) Select A p p l y . 6) Select O K . T h e S c o r e R a n k i n g s plot shows the lift for the 40' percentile. You can r e a d the values for each model from the plot.
Data Role = V A L I D A T E
- N-i;nl I loTAcrt
—
P?qr?sslQn
R<fl»»lan(?)
D?[l9lcn Tiaa
Daemon Ti*a C J |
6-54
Chapter 6 Model Assessment
Chapter 7 Model Implementation 7.1
Introduction
7-3
7.2
Internally S c o r e d Data S e t s
7-4
7.3
Demonstration: Creating a Score Data Source
7-5
Demonstration: Scoring with the Score Tool
7-6
Demonstration: Exporting a Scored Table (Self-Study)
7-9
S c o r e C o d e Modules
7-15
Demonstration: Creating a SAS Score Code Module
7-16
Demonstration: Creating Other Score Code Modules
7-22
Exercises
7-24
7.4
Chapter Summary
7-25
7.5
S o l u t i o n s to E x e r c i s e s
7-26
7-2
Chapter 7 Model Implementation
7.1 Introduction 7.1
7-3
Introduction
Model
Implementation
mm mm mm mm mm
mm mm
Z
mm mm mm
Z
wmm mm mm
Predictions might be added to a data source inside and outside of SAS Enterprise Miner.
3 After you train and compare predictive models, one model is selected to represent the association between the inputs and the target. After it is selected, this model must be put to use. The contribution of SAS Enterprise Miner to model implementation is a scoring recipe capable of adding predictions to any data set structured in a manner similar to the training data. SAS Enterprise Miner offers two options for model implementation. â&#x20AC;˘ Internally scored data sets are created by combining the Score tool with a data set identified for scoring. f
A copy of the scored data set is stored on the SAS Foundation server assigned to your project. If the data set to be scored is very large, you should consider scoring the data outside the SAS Enterprise Miner environment and use the second deployment option.
â&#x20AC;˘ Scoring code modules are used to generate predicted target values in environments outside o f SAS Enterprise Miner. SAS Enterprise Miner can create scoring code in the SAS, C, and Java programming languages. The SAS language code can be embedded directly into a SAS Foundation application to generate predictions. The C and Java language code must be compiled. The C code should compile with any C compiler that supports the ISO/I EC 9899 International Standard for Programming Languages â&#x20AC;&#x201D; C.
7-4
Chapter 7 Model Implementation
7.2
Internally S c o r e d Data
Sets
To create an internally scored data set, you need to define a Score data source, integrate the Score data source and Score tool into your process flow diagram, and (optionally) relocate the scored data set to a library of your choice.
7.2 Internally Scored Data Sets
7-5
Creating a Score Data Source
Creating a score data source is similar to creating a modeling data source. 1.
Right-click Data Sources in the Project panel and select Create Data Source. The Data Source Wizard opens.
2.
Select the table ScorePVA97NK in the A A E M library.
3.
Select Next > until you reach Data Source Wizard - Step 6 of 7 Data Source Attributes.
4.
Select Score as the role. g f D a t a S o u r c e W i z a r d â&#x20AC;&#x201D; S t e p 6 of 7 D a t a S o u r c e Attributes
( \ t. -wdb gj)
You may change the name and the rale, and can specify a population segment Identifier for the data source to be created.
Name:
|5COREPVA97NK
â&#x20AC;˘
Role: Segment: Notes :
< Back 5.
Select Next > to move to Step 7.
6.
Select Finish. You now have a data source that is ready for scoring.
Next >
Cancel
7-6
Chapter 7 Model Implementation
Scoring with the Score Tool
The Score tool attaches model predictions from a selected model to a score data set. 1.
Select the Assess tab.
2.
Drag a Score tool into the diagram workspace.
3.
Connect the Model Comparison node to the Score node as shown. •^Predictive Analysis
:
Predictive Analysis Polynomial '•Rsqr«stioa(2)
Ragrcstlon
Model aenpwltoa
-il
6
1
•±I|H*<5|0 — J —©ioo%|Saj D
7.2 Internally Scored Data Sets
7-7
The Score node creates predictions using the model deemed best by the Model Comparison node (in this case, the Regression model). f
If you want to create predictions using a specific model, either delete the connection to the Model Comparison node of the models that you do not want to use, or connect the Score node directly to the desired model and continue as described below.
4. Drag the ScorePVA97NK data source into the diagram workspace. 5. Connect the ScorePVA97NK data source node to the Score node as shown.
Chapter 7 Model Implementation
7-8 6.
Run the Score node and view the results. The Results - Score window opens. in? J
Student J u l y O B , 2009 16:22:16
CUffPjlTTII I T : " * —
—
• TOO) Input ' T I M : tunic: •
•
Training Output
soum;
Variable aunaiY
• TOOLExternicn ; CMMJ • m t i RtplI
Frequency
• Haiti
Count
•
•'• TOOL: F a i t l t l m Cl> Tift: liBELE;
BINARY
Varlatle
fOEE;
IB. SCOfE TEMIOH: 4.1; CEbEKATEl 111
LM
i _ T A R G E T B R9(J
ij_ClASS Store M_EVErJT Score . M . P R O B A Score E M _ S E G H . . . Score ~P_JAfiGE... Beg
cRtATEr: MwiamiUtMrW
TOOL: Input fnea luuic TVrE: iUtPLE) VOTE: Mm
^0O_0ilWv... Tran3 ' l O O . O I I K . . . Trans
'JTscgeieo Reg 'JTargelDI Rerj JJTirgetS Rag .WORM. Rsg j_TarselB MdlComp
TOOL: Exwiuicn C I m j j TV?Ei maim BOl'I: fttpl;
7.
SIOHTVT TIF SET
FmclDn DECISION
Prediction!. C L A S S I F I C Probability f.P R E D I C T ProDaulnty.. P R E D I C T
C C N N
T H A N S F O R MM
Eipecled Pr. A S S E S S Into: TargetsC L A S S I F I C . . .
N C
TFANSFORMN TRANSFORUN
Predicted T.PP RR EE DD II CC TT Predicted T C L A S S I F I C . Unnormaliz..
Warnings
ASSESS
N N H C
TRANSFORM
Maximize the Output window. The item o f greatest interest is a table of new variables added to the Score data set.
8.
Go to line 150. Score Output Variables
9.
Variable Maine
Function
Creator
Label
D_TARGETB EM_CLASSIFICATION EM_EVENTPRGBABILITY EM_PR0BABILITY EM_SEGHENT EPJTARGETB I_TargetB L0G_GiftAvgAll L0G_GiftCnt36 P_TargetB0 P_TargetBl UJTarcjetB _WARN_ b_TargetE
DECISION CLASSIFICATION PREDICT PREDICT TRANSFORM ASSESS CLASSIFICATION TRANSFORM TRANSFORM PREDICT PREDICT CLASSIFICATION ASSESS TRANSFORM
Reg Score Score Score Score Reg Reg Trans Trans Reg Reg Reg Reg MdlComp
Decision: TargetB Prediction for TargetB Probability for level 1 of TargetB Probability of Classification
Close the Results window.
1
Expected Profit: TargetB Into: TargetB
Predicted: TargetB=0 Predicted: TargetB=l Unnormalized Into: TargetB Warnings
7.2 Internally Scored Data Sets
7-9
Exporting a Scored Table (Self-Study)
1.
Select Exported Data •=> ] ...j from the Score node Properties panel. Value
Property
General
Node ID Imported Data Exported Data Notes
Train
Score
in u u
Variables Type of Scored Data View Use Fixed Output Name Yes Hide Variables No Hide Selection
• •
The Exported Data - Score dialog box opens.
TRAIN VALIDATE TEST 3CORE
Port
Table EMWS.Score_TRAIN EMWS.Score VALIDATE EMWS.Score TEST EMWS.Score_SCORE
Train Validate Test Score
Role
Yes Yes NO Yes
Data Exists
OK
2.
Select the Score table.
7-10 3.
Chapter 7 Model Implementation Select E x p l o r e . . . The Explore window opens with a 20,000-row sample of the EMWS2 . Score_SCORE data set (the internal name in SAS Enterprise Miner for the scored S c o r e P V A 9 7 N K data). •
File View
Actions
Window
\ r' Sample Properties :
Property
Rows Columns library Member Type Sample Method Fetch Size Fetched Rows Random Seed
Unknown 59 EMWS SCORE SCORE VIEW Random Max 20000 12345
Value
Apply | 0 3 EMWS.Score_SCORE
r/Ef
lEJj
Orjs# I Control Nit... Tarciet Gift ..I Tarcret Gift... Gift Count 3.JOift Count A..J Gift Count...] Gift Count ...1 Gift Amount... 0 10 6 8 $10.00 100033265 7 0 200029879 3 3 1 1 $16.00 300006622 0 2 12 0 6 $15.00 400088185 0 2 7 2 6 $23.00 0 1 1 1 $20.00 500116411 1 600146237 4 6 4 4 $16.00 1 520.00 700047124 14 3 6 $3.00 0 6 800187178 0 1 1 1 1 $25.00 900140150 0 4 4 3 3 $12.00 1000040007 3 a 3 0 5 $19.00 1100036815 0 2 13 1 9 $17.00 <|:,
a i n n r t i jf?i—i
::
:
,j
. .iri
-
—
| •
Tliis scored table is situated on the SAS Foundation server in the current project directory. You might want to place a copy o f this table in another location. The easiest way to do this is with a SAS code node. 4.
Close the Explore window.
5.
Select the Utility tab.
6.
Drag a SAS Code node into the Diagram workspace.
7.2 Internally Scored Data Sets 7.
Connect the Score node to the SAS Code node as shown.
8.
Select Code Editor <=>
in the SAS Code node's Properties panel.
Property
General
Node ID Imported Data Exported Data Notes
Value EMCODE
J
...
Train
Variables Code Editor Tool Type Data Needed Rerun Use Priors
Utility No No Yes
Advisor Type Publish Code Code Format
Basic Publish Data step
Score
u
u
7-11
7-12
Chapter 7 Model Implementation
The Training Code window opens. 1 B T Training Code -Cod sNodt File
u
Edit
Run
View
o
La
I @
Train EM_REGISTER EM_REPORT
EM_DATA2CODE EM D E C D A T A ;EM_CHECKMACRO 'EM_CHECKSETINIT EM_ODSLISTON EM ODSLISTOFF Macros
Macro Variables
Variables \
Training Code
-
AT
Output | Log ]
2
Istudent a s student - My Project - Predictive Analysis - EMCODE - STATUS=NONE LASTSTATUS=NOrJE
The Training Code window enables you to add new functionality to SAS Enterprise Miner by accessing the scripting language of SAS.
7.2 Internally Scored Data Sets 9.
7-13
Type the following program in the Training Code window: d a t a AAEMSl.ExportedScoreData; s e t &EM_IMPORT_SCORE; run; 1ST T r a i n i n g C o d e - C o d e Nod File
Edit
Run
View :
H | .a | cJ> | *)
'3
Train B
;EM_REGISTER j-.EM_REPORT ;-EM_DATA2CODE
;-£M_D_ECDATA ;-'EM_CHECKMACRO rEM_CHECKSETlNlT j-'EM_ODSLISTON '•'EM ODSUSTOFF Macros
Macro Variables
d Variables
Training Code
I
(lata
AAEM61.ExportedScoreData;
set
—
sEH_IMP0RT_SCORE;
run;
a.'*
..
-
Output J Log | 2
zl zl
|student as student - My Project - Predictive Analysis - EMCODE - 5TATU5=fJONE LAST ST ATU5=COMPLETE
This program creates a new table named E x p o r t e d S c o r e D a t a in the A A E M 6 1 library. 10. Close the Training Code window and save the changes. 11. Run the SAS Code node. You do not need to view the results.
7-14
Chapter 7 Model Implementation
12. Select View •=> Explorer... from the SAS Enterprise Miner menu bar. The Explorer window opens. 31 Explorer Show Project Data
1BASE
Name B)--J3) Aaem61 £
J j) Apfmtlb
£
Map;
[+] J j( Sampsio 51
Sasdata
SI
J Sasheip
S
jl Sasuser
>' _j) St pi amp ffl 3
Work
£
Wrstemp
EE .2)Wrso1st
£j) Aaem61
Engine
Path C:\Workshop\Winsa5\aaem61
V9
D:\SAS\Config\Lev I\SASApp\SASEn...
V9
D:\Program Files\SAS\5ASFoundatio...
V9
D:\Progtam Files\5A5l5ASFoundatio...
Sasdata
BASE
D:\SAS\Config\Levl \SASApp\Data
Sashelp
V9
_ j l Sasuser
V9
3
Apfmtlib
j l Maps j j l Sampsio y
$ Stpsamp 0
D:\Program Files^AS\SA5Foundatio...
C:\Documents and Settings\Student\... D:\Program Files\5AS\SASFoundatio...
BASE
Work
V9
C:\DOCUME~l\StudenHLOCALS~ll...
BASE
D:\5A5Konfig\Lev l\5A5App\Data\w...
BASE
Wrsdot
•J Wrstemp
D:\SA5\Conf ig\Lev l\5A5App\Data\w...
13. Select the Aaem61 library. 1 St Explorer I Show Project Data
H -
-
;J SAS Libraries Apfmtlib
fi-j^i Maps .»]
Name
Li. Bank Census2000
Z
• •) Sampsio
Dottraln
lE—^j) Sasdata ES-'^jl Sashetp
ijs-'jjj Sasuser •
Work
S3 'j9 Wrsdist '+• ) Wrstemp
DATA DATA DATA
Dot valid
DATA
Dungaree
DATA
Exportedscoredata DATA
) Stpsamp
EB • ^
Credit
DATA
Inq2O05
DATA
Insurance
DATA DATA DATA
H™| Organics [J
Pva97nk
f~J Scoreorganics [™J Scorepva97nk Ea Transactions • Webseg •
Webstation
DATA DATA DATA DATA DATA
DATA
Type
Created Date
Modified Date
See
2001-01-19
2004-01-19
2006-07-05 2006-09-26
2006-0705
2006-09-26
372KB
2006-07-25
2006-07-25
32KB
2001-01-19
2004-01-19
2009-07-08
2009-07-08
2006-07-25
2006-09-19
2006-07-25
780KB 1MB
32KB 40KB
41MB
2006-09-19
9MB
2006-09-19
2006-09-19
5MB
2006-09-23
2006-09-23
5MB
2003-05-26
400KB
2006-09-19
15MB
2008-04-10
2008-03-03
20O8-05-2B 2008-03-03 2006-09-19
2006-09-19 2006-09-24
2008-04-10
2008-03-03 2008-03-03 2006-09-19 2006-09-24
D|x
2MB 1MB
1SMB 1MB
49MB
The Aaem61 library contains the E x p o r t e d s c o r e d a t a table created by your SAS Code node. You can modify the SAS Code node to place the scored data in any library that is visible to the SAS Foundation server. 14. Close the Explorer window.
7.3 Score Code Modules
7-15
7.3 Score Code Modules Model deployment usually occurs outside of SAS Enterprise Miner and sometimes even outside of SAS. To accommodate this need, SAS Enterprise Miner is designed to provide score code modules to create predictions from properly prepared tables. In addition to the prediction code, the score code modules include all the transformations that are found in your modeling process flow. You can save the code as a SAS, C, or Java program.
7-16
Chapter 7 Model Implementation
Creating a SAS Score Code Module
The SAS Score Code module is opened by default when you open the Score node. 1.
Open the Score node Results window.
2.
Maximize the SAS Code window.
The SAS Code window shows the SAS DATA step code that is necessary to append predictions from the selected model (in this case, the Regression Model) to a score data set. Each node in the process flow can contribute to the DATA step code. The following list describes some highlights of the generated SAS code: • Go to line 12. This code removes the spurious zero from the median income input. TOOL: Extension C l a s s ; TYPE: MODIFY; NODE: Repl;
* Variable: DemMedlncome ; Label REP_DemMedIncome = "Replacement: DemMedlncome "; REP_DemMedIncome =DemMedIncome ; i f DemMedlncome ne . and DemMedlncome <1 then REP DemMedlncome
=
Go to line 36. This code takes the log transformation of selected inputs. * TRANSFORM: GiftAvg36 , log(GiftAvg36 + 1 ) ; * . ..........
......*• J
l a b e l L0G_GiftAvg36 = 'Transformed: G i f t Amount Average 36 Months'; i f GiftAvg36 + 1 > 0 then L0G_GiftAvg36 = log(GiftAvg36 + 1 ) ; else L0G_GiftAvg36 = .; * .... .*• i * TRANSFORM: GiftAvgAll , log(GiftAvgAll + 1 ) ; *. . ......... .. .. .*• >
l a b e l LOG_GiftAvgAll = 'Transformed: G i f t Amount Average A l l Months'; i f GiftAvgAll + 1 > 0 then L0G_GiftAvgAll = log(GiftAvgAll + 1 ) ; e l s e LOG_GiftAvgAll = .;
7.3 Score Code Modules • Go to line 84. This code replaces the levels of the StatusCat96NK input. *
*.
i * TOOL: Extension C l a s s ; * TYPE: MODIFY; * NODE: Repl2;
* Defining New V a r i a b l e s ; * •
i Length REP_StatusCat96NK $5; Label REP_StatusCat96NK = "Replace:Status Category 96NK"; REP_StatusCat96NK= StatusCat96NK; * Replace S p e c i f i c C l a s s Levels ; * • i length _UFormat200 $200; drop _UF0RMAT200; _UF0RMAT200 = " »; * .
* Variable: StatusCat96NK; * i• i f strip(StatusCat96NK) = "A" REP_StatusCat96NK="A"; i f strip(StatusCat96NK) = "S" REP_StatusCat96NK="A"; i f strip(StatusCat96NK) = *F" REP_StatusCat96NK="N"; i f strip(StatusCat96NK) = "N" REP_StatUSCat96NK="N"; i f strip(StatusCat96NK) = "E" REP_StatusCat96NK="L"; i f strip(StatusCat96NK) = "L" REP_StatusCat96NK="L";
then then then then then then
7-17
7-18
Chapter 7 Model Implementation
• Go to line 118. This code replaces missing values and creates missing value indicators. * TOOL: Imputation; * TYPE: MODIFY; * NODE: Impt; i *MEAN-MEDIAN-MIDRANGE AND ROBUST ESTIMATES; i l a b e l IMP_DemAge = 'Imputed: Age'; IMP_DemAge = DemAge; i f DemAge = . then IMP_DemAge = 59.262912087912; l a b e l IMP_L0G_GiftAvgCard36 = 'Imputed: Transformed: G i f t Amount Average Card 36 Months'; IMP_L0G_GiftAvgCard36 = L0G_GiftAvgCard36; i f L0G_GiftAvgCard36 = . then IMP_L0G_GiftAvgCard36 = 2.5855317177381; l a b e l IMP_REP_DemMedIncome = 'Imputed: Replacement: DemMedlncome'; IMP_REP_DemMedIncome = REP_DemMedIncome; i f REP_DemMedIncome = . then IMP_REP_DemMedIncome = 53570.8504928806; *. »
•INDICATOR VARIABLES; t
l a b e l M_DemAge = "Imputation Indicator f o r DemAge"; i f DemAge = . then M_DemAge = 1; e l s e M_DemAge= 0; l a b e l M_L0G_GiftAvgCard36 = "Imputation Indicator f o r L0G_GiftAvgCard36"; i f L0G_GiftAvgCard36 = . then M_L0G_GiftAvgCard36 = 1; e l s e M_L0G_GiftAvgCard36= 0; l a b e l M_REP_DemMedIncome = "Imputation Indicator f o r REP_DemMedIncome"; i f REP_DemMedIncome = . then M_REP_DemMedIncome = 1; e l s e M_REP_DemMedIncome= 0;
7.3 Score Code Modules
• Go to line 160. This code comes from the Regression node. It is this code that actually adds the predictions to a Score data set. * TOOL:
Regression;
* TYPE:
HODEL;
* NODE: R e g ;
***
begin
3 c o r i n g code
f o r regression;
l e n g t h _HARN_ «4; label _HARH_ = 'Warnings'
;
l e n g t h IJTaxgetB 5 12; label I_TaEg-etB •• ' I n t o : ***
Target
Targets'
;
Values;
REGDRF [2] $12 _tenporary_ ( ' 1 ' label U_TargetB " 'Urmomalized Into: array
»**
Unnoraalized
ARRAY
REGDRD[2]
target
'0'
);
TargetB'
;
values;
_TEHPORARY_ ( 1 0 ) ;
d r o p _DH_BAD; _DH_BAD=0;
***
C h e c k Details d H o a e V a l u e
f o r Eiissing values
i f uissing( DemHedHoaeValue ) then do; substr(_warn_,1,1) » 'M'; _DH_BAD = 1; end; ***
Check ( J i c t T i m e L a s t f o r E i i 3 3 i n g v a l u e s
i f nissing( GiftTimeLasc ) then do; substr (_warn_, 1,1) » W ; _DH_BAD = 1; end;
;
;
7-19
7-20
Chapter 7 Model Implementation
• Go to line 290. This block of code comes from the Model Comparison node. It adds demi-decile bin numbers to the scored output. For example, bin 1 corresponds to the top 5% of the data as scored by the Regression model, bin 2 corresponds to the next 5%, and so on. *.
* TOOL: Model Compare C l a s s ; * TYPE: ASSESS; * NODE: MdlComp; i f (P_TargetB1 ge 0.09198928166881) then do; b_TargetB - 1; end; else i f (P_TargetB1 ge 0.08014055773524) then do; b_TargetB = 2; end; else i f (P_TargetB1 ge 0.07027014765811) then do; b_TargetB = 3; end;
i
..„......*• j
7.3 Score Code Modules
7-21
• Go to line 380. This block of code comes from the Score node. It adds the following standardized variables to the scored data set: EM_ CLASSIFICATION
Prediction for TargetB
EM_ DECISION
Recommended Decision for TargetB
EM_ EVENTPROBABILITY
Probability for Level 1 of Target
EM_ PROBABILITY
Probability of Classification
EM_ PROFIT
Expected Profit for TargetB
EM_ SEGMENT
Segment
*
* TOOL: Score Node; * TYPE: ASSESS; * NODE: Score; * * * Score: Creating Fixed Names;
*
*• i *• >
*•
LABEL EM_SEGMENT = 'Segment'; EM_SEGMENT = b_TargetB; LABEL EM_EVENTPROBABILITY = 'Probability f o r l e v e l 1 of TargetB'; EM_EVENTPROBABILITY = P_TargetB1; LABEL EM_PROBABILITY = 'Probability of C l a s s i f i c a t i o n ' ; EM_PR0BABILITY = max( P_TargetB1 J
P_TargetB0 );
f
To use this code, you must embed it in a DATA step. The easiest way to do this is by saving it to as a SAS code file and including it in your DATA step.
3. Select File •=> Save As... to save this code to a location of your choice.
7-22
Chapter 7 Model Implementation
Creating Other Score Code Modules
To access scoring code for languages other than SAS, use the following procedure: 1.
Select View O Scoring •=> C Score Code. The C Score Code window opens. C Score Code <?xml
version="1.0"
encodin<j="ute-8"?>
<Scoce> cPzoduceo <Mame> SAS E n t e r p r i s e H i n e r < 7 e z s i o o > 1.0
</Wame>
</V*rsion>
</Producer> <TargetList> </TargetLis» <Input> CVariablO <Hame> DEHAGE </Name> <Type> n u m e r i c
</Type>
<Description> •C! [ C D A T A [ A g e ] ] > </Description> </Variable> <Vatiable> <Haue> DEIIHEDH0HEVALUE </!Iame>
| •
i m Scoring Function Metadata •
There are four parts to the C Score Code window. The actual C score code part is accessible from the menu at the bottom of the C Score Code window. C Score Code the
a d v e r t i s i n g o r p r o m o t i o n o£ p r o d u c t s
prior
/ * —
start
^include
generated
code
<string.h>
^include
<memorY.h>
^include
<ctype.b>
^include
"cscore.h"
^include
"csparm.h" csDEHAGE
indata[0].data.fnua
j f d e f i n e c s DE HUE DHO ME VALUE ^define
— V
<math.h>
^include
#der"ine
or s e r v j
w r i t t e n a u t h o r i z a t i o n f r o m SAS I n s t i t u t e I i
csDEHHED I I I COME
Score Code
i n d a t a [ 1 ] . d a t a . £ nun indata[2].data.fnun
7.3 Score Code Modules 2.
Select View
Scoring •=> Java Score Code. The Java Score Code window opens.
Java Score Code
<?xml version="l. 0" encoding="utf-8"?> <Score> <Producer> <Name> SAS Enterprise Hiner </Name> <Version> 1.0 </Version> </Producer> <TargetList> </TargetList> <Input> <Variable> <Hame> DEMAGE </Mame> •CTjpe> numeric </Type> <Description> <![CDATA[Age]]> </Description> </Varlable> <Vari ab1e> <fJame> DEHHEDHOHEVALUE </Name>
Scoring Function Metadata
•
There are five parts to the Java Score Code window. The actual Java score code part is accessible from the menu at the bottom of the Java Score Code window. Java Score Code
package eiainer.user. Score; import java. util. Map; import java.util.EashHap; import com.sas.ds.*; import com.sas.analytics.eminer.jscore.util.*;
* The Score class encapsulates data step scoring cc * by the SAS Enterprise Hiner Java Scoring Componei * * * *
Qsince 1.0 Aversion J-Score 1.0 Gauthor SAS Enterprise Hiner Java Scoring Compon Qsee com.sas.analytics.eminer.jscore.util.Jscore
public class Score ,
T - - . „ .
Score Code
,f. ,
,
,
,
7-23
7-24
Chapter 7 Model Implementation
Exercises
1. Scoring Organics Data a. Create a Score data source for the S c o r e O r g a n i c s data. b. Score the S c o r e O r g a n i c s data using the model selected with the Model Comparison node.
7.4 Chapter Summary
7.4
7-25
Chapter Summary
The Score node is used to score new data inside SAS Enterprise Miner and to create scoring modules for use outside SAS Enterprise Miner. The Score node adds predictions to any data source with a role of Score. This data source must have the same inputs as the training data. A scored data source is stored within the project directory. You can use a SAS Code tool to relocate it to another library. The Score tool creates score code modules in the SAS, C, and Java languages. These score code modules can be saved and used outside of SAS Enterprise Miner.
Model Implementation Tools Review Add predictions to Score data sources; and create SAS, C, and Java score code modules. 1
JLB-U.1 : : : hput Data
Create a Score data source.
Use SAS scripting language to export scored data outside a SAS Enterprise . Miner project. 11
7-26
Chapter 7 Model Implementation
7.5
Solutions to Exercises
1.
Scoring Organics Data a.
Create a Score data source for the S c o r e O r g a n i c s data. 1) Select File •=> New •=> Data Source. 2) Proceed to Step 2 of the Data Source Wizard by selecting Next >. 3) Select the A A E M 6 1 . S C O R E O R G A N I C S data set. g f D a t a S o u r c e W i z a r d - S t e p 2 of 7 S e l e c t a S A S Table
Select a SAS table
Jable:
|2E
M61 .SCORE ORGANIC
Browse..
< Back
4) Proceed to the final step o f the Data Source Wizard.
Next >
Cancel
7.5 Solutions to Exercises
7-27
5) Select Score as the role. g^Data S o u r c e Wizard â&#x20AC;&#x201D; S t e p 6 of 7 D a t a S o u r c e Attributes \
Y o u may change the n a m e and the role, and c a n specify a population s e g m e n t identifier for the data source to h e created.
{IS)
1
Name:
IKOREORGANICS
Bole: Segment:
Notes:
<; Back
Ne*t >
Cancel
6) Select Finish. b. Score the S c o r e O r g a n i c s data using the model selected with the Model Comparison node. 1) Connect a Score tool to the Model Comparison node. 2) Connect a ScoreOrganics data source to the Score node. Decision Tien
'J Decision Tin-;
""Comparison
^BCOREORGArJ J V'rfl'iensform
FS-s ,
,
Polynomial Regie SShXl
J feurti Netaiork
6 3) Run the Score node.
7-28
Chapter 7 Model Implementation 4) Browse tiie Exported data from the Score node to confirm the scoring process. ...
B_TARGETBUY
zed into: Targe IB uy
Segment
Probability (or level 1 of TargetBuy Probahiiiry of Classification
43
0 1277330265
0.8722669735
0
7
28
0.2397003745
0.7602996255
0
3
S
54
0.201754386
0.798245614
0
4
1
39
0.9353233831
09353233831
1
5
18
57
0.0456333596
0.9543666404
0
6
16
52
0.0537484277
0.9402515723
â&#x20AC;˘
7
3
37
0 5350877193
05350877193
1
8
19
57
0 0456333596
0.9543666404
0
2
fl
13
48
01277330265
0.8722669735
0
10
8
)54
G.201754386
0 798245614
0
11
11
43
0.1571038251
0.3423961 749
0
12
18
0 0456333595
0.95436G6404
0
13
18
0.0456333596
0 9543666404
0
14
18
57
0.0458333598
0 9543666404
0
19
56
0.0121396055
0 9878603945
0
15 <
.
Prediction for TargelBuy
It 3
1
_
-
nâ&#x20AC;˘
A successfully scored data set features predicted probabilities and prediction decisions in the last three columns.
C h a p t e r
8
Introduction
to
P a t t e r n
D i s c o v e r y
8.1
Introduction
8.2
Cluster Analysis
8.3
8
-
3
8-8
Demonstration: Segmenting Census Data
8-15
Demonstration: Exploring and Filtering Analysis Data
8-24
Demonstration: Setting Cluster Tool Options
8-37
Demonstration: Creating Clusters with the Cluster Tool
8-41
Demonstration: Specifying the Segment Count
8-44
Demonstration: Exploring Segments
8-46
Demonstration: Profiling Segments
8-56
Exercises
8-61
Market B a s k e t A n a l y s i s (Self-Study)
8-63
Demonstration: Market Basket Analysis
8-67
Demonstration: Sequence Analysis
8-80
Exercises
8-83
8.4
Chapter Summary
8-85
8.5
Solutions
8-86
Solutions to Exercises
8-86
8-2
Chapter 8 Introduction to Pattern Discovery
8.1 I n t r o d u c t i o n
8.1
8-3
Introduction
Pattern Discovery The Essence of Data Mining?
"...the discovery of interesting, unexpected, or valuable structures in large data sets." - David Hand
3
3
T h e r e are a m u l t i t u d e o f d e f i n i t i o n s f o r the f i e l d o f data m i n i n g a n d k n o w l e d g e d i s c o v e r y . M o s t c e n t e r o n t h e c o n c e p t o f p a t t e r n d i s c o v e r y . F o r e x a m p l e , D a v i d H a n d , P r o f e s s o r o f S t a t i s t i c s at I m p e r i a l C o l l e g e , L o n d o n a n d a n o t e d d a t a m i n i n g a u t h o r i t y , d e f i n e s t h e f i e l d as
or valuable structures in large data sets."
"...the discovery of interesting,
unexpected,
( H a n d 2 0 0 5 ) T h i s is m a d e p o s s i b l e b y t h e e v e r - i n c r e a s i n g d a t a
stores b r o u g h t about b y the era's i n f o r m a t i o n technology.
8-4
Chapter 8 Introduction to Pattern Discovery
Pattern Discovery The Essence of Data Mining?
"...the discovery of interesting, unexpected, or valuable structures in large data sets." - David Hand
"If you've got terabytes of data, and you're relying on data mining to find interesting things in there for you, you've lost before
4 4
While Hand's pronouncement is grandly promising, experience has shown it to be overly optimistic. Herb Edelstein, President of Two Crows Corporation and an internationally recognized expert in data mining, data warehousing, and C R M , counters with the following (Beck (Editor) 1997): "Ifyou've got terabytes of data, and you 're relying on data mining to find interesting things in there for you, you've lost before you've even begun. You really need people who understand what it is they are looking for - and what they can do with it once they find it." Many people think data mining (in particular, pattern discovery) means magically finding hidden nuggets of information without having to formulate the problem and without regard to the structure or content of the data. This is an unfortunate misconception.
8.1 Introduction
8-5
Pattern Discovery Caution
5
•
Poor data quality
•
Opportunity
•
Interventions
•
Separability
•
Obviousness
•
Non-stationarity
5
In h i s d e f e n s e , D a v i d H a n d is w e l l a w a r e o f the l i m i t a t i o n s o f p a t t e r n d i s c o v e r y a n d p r o v i d e s g u i d a n c e o n h o w these analyses c a n fail ( H a n d 2 0 0 5 ) . These f a i l i n g s o f t e n fall i n t o o n e o f s i x categories:
• Poor data quality
assumes m a n y guises: inaccuracies (measurement or r e c o r d i n g errors), m i s s i n g ,
i n c o m p l e t e o r o u t d a t e d values, a n d inconsistencies (changes o f d e f i n i t i o n ) . Patterns f o u n d i n false data are f a n t a s i e s .
• Opportunity
t r a n s f o r m s t h e p o s s i b l e t o t h e p e r c e i v e d . H a n d r e f e r s t o t h i s as t h e p r o b l e m o f
m u l t i p l i c i t y , o r the l a w o f t r u l y large n u m b e r s . E x a m p l e s o f this a b o u n d . H a n d notes the odds o f a p e r s o n w i n n i n g t h e l o t t e r y i n t h e U n i t e d States are e x t r e m e l y s m a l l a n d t h e o d d s o f that p e r s o n w i n n i n g
it t w i c e a r e f a n t a s t i c a l l y s o . H o w e v e r , t h e o d d s o f someone in the United States w i n n i n g i t t w i c e ( i n a g i v e n y e a r ) a r e a c t u a l l y b e t t e r t h a n e v e n . A s a n o t h e r e x a m p l e , y o u c a n s e a r c h t h e d i g i t s of pi f o r " p r o p h e t i c " s t r i n g s s u c h as y o u r b i r t h d a y o r s i g n i f i c a n t d a t e s i n h i s t o r y a n d u s u a l l y find t h e m , g i v e n enough digits (www.angio.net/pi/piquery).
• Intervention,
t h a t i s , t a k i n g a c t i o n o n t h e p r o c e s s t h a t g e n e r a t e s a set o f d a t a , c a n d e s t r o y o r d i s t o r t
detected patterns. F o r e x a m p l e , f r a u d d e t e c t i o n techniques lead t o p r e v e n t a t i v e measures, but t h e fraudulent behavior o f t e n e v o l v e s in response t o this i n t e r v e n t i o n . •
Separability o f t h e i n t e r e s t i n g f r o m t h e m u n d a n e is n o t a l w a y s p o s s i b l e , g i v e n t h e i n f o r m a t i o n f o u n d i n a d a t a set. D e s p i t e t h e m a n y s a f e g u a r d s i n p l a c e , i t i s e s t i m a t e d t h a t c r e d i t c a r d c o m p a n i e s l o s e $ 0 . 1 8 t o $ 0 . 2 4 p e r $ 100 i n o n l i n e transactions ( R u b i n k a m 2 0 0 6 ) .
•
Obviousness i n d i s c o v e r e d p a t t e r n s r e d u c e s t h e p e r c e i v e d u t i l i t y o f a n a n a l y s i s . A m o n g t h e p a t t e r n s d i s c o v e r e d t h r o u g h a u t o m a t i c d e t e c t i o n a l g o r i t h m s , y o u find t h a t t h e r e i s a n a l m o s t e q u a l n u m b e r o f m a r r i e d m e n as m a r r i e d w o m e n , a n d y o u l e a r n t h a t o v a r i a n c a n c e r o c c u r s p r i m a r i l y i n w o m e n a n d t h a t check fraud occurs most often for customers w i t h c h e c k i n g accounts.
• Non-stationarity
o c c u r s w h e n the process t h a t generates a data set c h a n g e s o f its o w n a c c o r d . I n s u c h
c i r c u m s t a n c e s , p a t t e r n s d e t e c t e d f r o m h i s t o r i c d a t a c a n s i m p l y c e a s e . A s E r i c H o f f e r s t a t e s , "In times
of change, learners inherit the Earth, while the learnedfind with a world that no longer exists."
themselves beautifully equipped to deal
8-6
Chapter 8 Introduction to Pattern Discovery
Pattern Discovery Applications Data reduction Novelty detection Profiling Market basket analysis Sequence analysis
6 6
Despite the potential pitfalls, there are many successful applications of pattern discovery: • Data reduction is the most ubiquitous application, that is, exploiting patterns in data to create a more compact representation of the original. Though vastly broader in scope, data reduction includes analytic methods such as cluster analysis. • Novelty detection methods seek unique or previously unobserved data patterns. The methods find application in business, science, and engineering. Business applications include fraud detection, warranty claims analysis, and general business process monitoring. • Profiling is a by-product of reduction methods such as cluster analysis. The idea is to create rules that isolate clusters or segments, often based on demographic or behavioral measurements. A marketing analyst might develop profiles of a customer database to describe the consumers of a company's products. • Market basket analysis, or association rule discovery, is used to analyze streams of transaction data (for example, market baskets) for combinations of items that occur (or do not occur) more (or less) commonly than expected. Retailers can use this as a way to identify interesting combinations of purchases or as predictors of customer segments. • Sequence analysis is an extension of market basket analysis to include a time dimension to the analysis. In this way, transactions data is examined for sequences of items that occur (or do not occur) more (or less) commonly than expected. A Webmaster might use sequence analysis to identify patterns or problems of navigation through a Web site.
8.1 Introduction
8-7
Pattern Discovery Tools Data reduction j g g j l Novelty detection
7â&#x20AC;&#x201D; SOMKohonen
Profiling M a r k e t
B
P
b a s k e t
S e q u e n c e
Segment ftofto
analysis
analysi
7 7
The first three pattern discovery applications are primarily served (in no particular order) by three tools in SAS Enterprise Miner: Cluster, SOM/Kohonen, and Segment Profile. The next section features a demonstration of the Cluster and Segment Profile tools.
Pattern Discovery Tools s
Market basket analysis gfc AgC
Sequence analysis
ith Analysis
8 8
Market basket analysis and sequence analysis are performed by the Association tool. The Path Analysis tool can also be used to analyze sequence data. (An optional demonstration of the Association tool is presented at the end of this chapter.)
8-8
C h a p t e r 8 Introduction to Pattern Discovery
8.2
Cluster Analysis
Unsupervised Classification E&a HSBB
mm as
l..
-v
i
Unsupervised classification: grouping of cases based on similarities in input values.
E3SS1 mm
boss
mm
mm mm mm 1B9I H B h h i
1010
Unsupervised
classification
( a l s o k n o w n as
cases based o n s i m i l a r i t i e s i n
input
clustering
and
segmenting)
a t t e m p t s t o g r o u p t r a i n i n g d a t a set
v a r i a b l e s . I t is a data r e d u c t i o n m e t h o d because a n e n t i r e t r a i n i n g data
set c a n b e r e p r e s e n t e d b y a s m a l l n u m b e r o f c l u s t e r s . T h e g r o u p i n g s a r e k n o w n as
clusters o r segments, supervised
a n d t h e y c a n b e a p p l i e d t o o t h e r d a t a sets t o c l a s s i f y n e w c a s e s . I t i s d i s t i n g u i s h e d f r o m
classification
( a l s o k n o w n as
predictive
modeling),
w h i c h is d i s c u s s e d i n p r e v i o u s c h a p t e r s .
T h e p u r p o s e o f c l u s t e r i n g is o f t e n d e s c r i p t i o n . F o r e x a m p l e , s e g m e n t i n g e x i s t i n g c u s t o m e r s i n t o g r o u p s and associating a distinct p r o f i l e w i t h each g r o u p m i g h t help future m a r k e t i n g strategies. H o w e v e r ,
there
is n o guarantee t h a t the r e s u l t i n g clusters w i l l be m e a n i n g f u l o r u s e f u l . U n s u p e r v i s e d c l a s s i f i c a t i o n is a l s o u s e f u l as a s t e p i n p r e d i c t i v e m o d e l i n g . F o r e x a m p l e , c u s t o m e r s c a n b e c l u s t e r e d i n t o h o m o g e n o u s g r o u p s b a s e d o n sales o f d i f f e r e n t i t e m s . T h e n a m o d e l c a n b e b u i l t t o p r e d i c t the cluster m e m b e r s h i p based o n m o r e easily obtained i n p u t variables.
8.2 Cluster A n a l y s i s
8-9
/c-means Clustering Algorithm Training
Data
1
Select inputs.
2
Select k cluster centers.
3
Assign cases to closest center.
A
Update cluster centers.
5
Re-assign cases. Repeat steps 4 a n d 5 until convergence.
122
O n e o f t h e m o s t c o m m o n l y u s e d m e t h o d s f o r c l u s t e r i n g is t h e
k-means algorithm,
i t is a s t r a i g h t f o r w a r d
a l g o r i t h m t h a t s c a l e s w e l l t o l a r g e d a t a sets a n d i s , t h e r e f o r e , t h e p r i m a r y t o o l f o r c l u s t e r i n g i n Enterprise
SAS
Miner.
W h i l e o f t e n o v e r l o o k e d as a n i m p o r t a n t p a r t o f a c l u s t e r i n g p r o c e s s , t h e f i r s t s t e p i n u s i n g t h e £ - m e a n s a l g o r i t h m is t o c h o o s e a set o f i n p u t s . I n g e n e r a l , y o u s h o u l d s e e k i n p u t s t h a t h a v e t h e f o l l o w i n g attributes: •
are m e a n i n g f u l t o t h e a n a l y s i s
• are r e l a t i v e l y •
objective
independent
are l i m i t e d i n n u m b e r
• have a measurement level o f Interval • h a v e l o w k u r t o s i s a n d s k e w n e s s s t a t i s t i c s ( a t least i n t h e t r a i n i n g d a t a ) C h o o s i n g m e a n i n g f u l i n p u t s is c l e a r l y i m p o r t a n t f o r i n t e r p r e t a t i o n a n d e x p l a n a t i o n o f t h e g e n e r a t e d clusters. Independence and l i m i t e d i n p u t count make the resulting clusters m o r e stable. ( S m a l l perturbations o f t r a i n i n g data usually do not result i n large changes to the generated clusters.) A n interval m e a s u r e m e n t l e v e l is r e c o m m e n d e d f o r £ - m e a n s t o p r o d u c e n o n - t r i v i a l c l u s t e r s . L o w k u r t o s i s a n d s k e w n e s s statistics o n t h e i n p u t s a v o i d c r e a t i n g single-case o u t l i e r clusters.
8-10
C h a p t e r 8 Introduction to Pattern Discovery
/c-means Clustering Algorithm Training
Data
1. S e l e c t i n p u t s . 2. S e l e c t ft c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s . 5. R e - a s s i g n c a s e s . 6. R e p e a t s t e p s 4 a n d 5 until convergence.
1313
T h e n e x t s t e p i n t h e ÂŁ - m e a n s a l g o r i t h m is t o c h o o s e a v a l u e f o r k, t h e n u m b e r o f c l u s t e r c e n t e r s . SAS
E n t e r p r i s e M i n e r features an a u t o m a t i c w a y t o do t h i s , a s s u m i n g that the d a t a has k d i s t i n c t
c o n c e n t r a t i o n s o f c a s e s . I f t h i s is n o t t h e c a s e , y o u s h o u l d c h o o s e k t o b e c o n s i s t e n t w i t h y o u r analytic objectives. With k seeds).
s e l e c t e d , t h e ÂŁ - m e a n s a l g o r i t h m c h o o s e s cases t o r e p r e s e n t t h e i n i t i a l
cluster centers
(also
named
8.2 Cluster A n a l y s i s
8-11
/c-means Clustering Algorithm Training
Data
1. S e l e c t i n p u t s . ;
• -J
*
2. S e l e c t ft c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s .
• . • • s i x * :
•
•
5. R e a s s i g n
cases.
6. R e p e a t s t e p s 4 a n d 5 until convergence.
14)4
T h e E u c l i d e a n d i s t a n c e f r o m e a c h case i n t h e t r a i n i n g d a t a t o e a c h c l u s t e r c e n t e r is c a l c u l a t e d . C a s e s a r e assigned t o t h e closest cluster center. B e c a u s e t h e d i s t a n c e m e t r i c is E u c l i d e a n , i t is i m p o r t a n t f o r t h e i n p u t s t o h a v e c o m p a t i b l e
f
m e a s u r e m e n t scales. U n e x p e c t e d results can o c c u r i f one i n p u t ' s m e a s u r e m e n t scale d i f f e r s greatly from the others.
/V-means Clustering Algorithm Training
Data 1. S e l e c t i n p u t s . 2. S e l e c t ft c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center.
AY* •
•
4. U p d a t e c l u s t e r c e n t e r s . 5. R e a s s i g n
cases.
6. R e p e a t s t e p s 4 a n d 5 until convergence.
1S5
T h e c l u s t e r c e n t e r s a r e u p d a t e d t o e q u a l t h e a v e r a g e o f t h e cases a s s i g n e d t o t h e c l u s t e r in the p r e v i o u s step.
8-12
C h a p t e r 8 Introduction to Pattern Discovery
/V-means Clustering Algorithm Training
Data 1. S e l e c t i n p u t s .
•'
•
•
2. S e l e c t ft c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s .
•
•
5. R e a s s i g n
cases.
6. R e p e a t s t e p s 4 a n d 5 until convergence.
• 186
C a s e s are r e a s s i g n e d t o t h e c l o s e s t c l u s t e r c e n t e r .
/c-means Clustering Algorithm Training
•d S p
Data
•
* '
H
1. Select inputs.
H
2. S e l e c t ft c l u s t e r c e n t e r s .
H
3. A s s i g n c a s e s t o c l o s e s t 4. U p d a t e c l u s t e r c e n t e r s .
^|
*
1717
5
R e a s s i g n cas'ps
6. R e p e a t s t e p s 4 a n d 5 until convergence.
8.2 Cluster A n a l y s i s
k-means Clustering Algorithm Training
Data
1. S e l e c t i n p u t s . 2. S e l e c t k c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s . 5. R e a s s i g n c a s e s . 6. R e p e a t s t e p s 4 a n d 5 until convergence.
1818
T h e u p d a t e and reassign steps are repeated u n t i l the process c o n v e r g e s .
/c-means Clustering Algorithm Training
Data
1. S e l e c t i n p u t s . 2. S e l e c t k c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s . 5. R e a s s i g n c a s e s . 6. R e p e a t s t e p s 4 a n d 5 until convergence.
2SS O n c o n v e r g e n c e , f i n a l c l u s t e r a s s i g n m e n t s are m a d e . E a c h case is a s s i g n e d t o a u n i q u e s e g m e n t . T h e s e g m e n t d e f i n i t i o n s c a n b e s t o r e d a n d a p p l i e d t o n e w cases o u t s i d e o f t h e t r a i n i n g d a t a .
8-13
8-14
C h a p t e r 8 Introduction t o Pattern Discovery
2727
W h i l e t h e y a r e o f t e n u s e d s y n o n y m o u s l y , a s e g m e n t a t i o n a n a l y s i s is d i s t i n c t f r o m a t r a d i t i o n a l c l u s t e r a n a l y s i s . A c l u s t e r a n a l y s i s is g e a r e d t o w a r d i d e n t i f y i n g d i s t i n c t c o n c e n t r a t i o n s o f cases i n a d a t a set. W h e n n o d i s t i n c t c o n c e n t r a t i o n s e x i s t , t h e best y o u c a n d o i s a s e g m e n t a t i o n a n a l y s i s , t h a t i s , a l g o r i t h m i c a l l y p a r t i t i o n i n g t h e i n p u t space i n t o c o n t i g u o u s g r o u p s o f cases.
Demographic Segmentation Demonstration Analysis goal: Group geographic regions into segments based on income, household size, and population density. Analysis plan: • Select and transform segmentation inputs. • Select the number of segments to create. • Create segments with the Cluster tool. • Interpret the segments.
2828
T h e f o l l o w i n g d e m o n s t r a t i o n i l l u s t r a t e s t h e use o f c l u s t e r i n g t o o l s . T h e g o a l o f t h e a n a l y s i s is t o g r o u p p e o p l e i n t h e U n i t e d States i n t o d i s t i n c t s u b s e t s b a s e d o n u r b a n i z a t i o n , h o u s e h o l d s i z e , a n d i n c o m e factors. T h e s e factors are c o m m o n t o c o m m e r c i a l l i f e s t y l e a n d life-stage s e g m e n t a t i o n p r o d u c t s .
( F o r e x a m p l e s , see www.claritas.com or www.spectramarketing.com.)
8.2 Cluster A n a l y s i s
8-15
Segmenting C e n s u s Data
This demonstration introduces S A S Enterprise M i n e r tools and techniques f o r cluster and segmentation a n a l y s i s . T h e r e a r e five p a r t s : •
define the diagram and data source
• explore and filter the t r a i n i n g data •
i n t e g r a t e t h e C l u s t e r t o o l i n t o t h e process f l o w a n d select t h e n u m b e r o f s e g m e n t s t o create
• r u n a segmentation analysis • use t h e S e g m e n t P r o f i l e t o o l t o i n t e r p r e t t h e a n a l y s i s results
Diagram
Definition
Use t h e f o l l o w i n g steps t o d e f i n e t h e d i a g r a m f o r t h e s e g m e n t a t i o n a n a l y s i s .
I.
Right-click
Diagrams
i n t h e P r o j e c t p a n e l a n d select
w i n d o w opens and requests a d i a g r a m name.
Diagram Name: Segmentation Analysis OK
Cancel
Create Diagram.
T h e Create N e w D i a g r a m
C h a p t e r 8 Introduction t o Pattern Discovery
8-16
2.
Type S e g m e n t a t i o n
A n a l y s i s i n the D i a g r a m
Name
field
and select O K . S A S E n t e r p r i s e
M i n e r creates an analysis w o r k s p a c e w i n d o w n a m e d S e g m e n t a t i o n A n a l y s i s . BED
|ETEnterprise Miner - My Project Fie
Ed* View
getters
Options
Wndow
tt<*
D-|--.l--!i|X|HmiE|D|D|ai|Jj| *|C|S 3
•
!
My Project
!
M^l
QDataSc«r.es
-i [£jDl30/*llS
Simple
6^8 Orgaracs -'; Precktrve Anarysi
]i^ore | Modfy | Model [ Assess j Utility | CiccH Stwhg \
Srrimrril.fi if ill Analysis
... -£ !*J Model Packages * jljUse.s
.•v. :
EMW3.
_ ! 9 _ __
SeQrTieriJaJion Analysis
Name "Status
No:es H i : lor,-
Open —
n
I
d J
1 EXasran Segmentation Analysis opened
J
-ii "|f?
3 — J — riioo*
-
_j
;
s t u d e n t « student ; X Ccmected to SASApp • Loekel Workspace Sever
Y o u use t h e S e g m e n t a t i o n A n a l y s i s w i n d o w t o create process f l o w d i a g r a m s .
8.2 Cluster A n a l y s i s
8-17
Data S o u r c e Definition F o l l o w these steps t o create the s e g m e n t a t i o n a n a l y s i s data s o u r c e .
I.
Right-click
Data Sources
in the Project panel and select
Create Data Source.
The Data Source
W i z a r d opens. jSTData Source Wizard â&#x20AC;&#x201D; Step 1 of 7 Metadata Source
Select a metadata source
Source:
<8_rck
Next >
T h e D a t a S o u r c e W i z a r d g u i d e s y o t i t h r o u g h a seven-step process t o create a S A S
Cancel
Enterprise M i n e r
data source. 2.
Select N e x t > t o use a S A S
t a b l e as t h e s o u r c e f o r t h e m e t a d a t a . ( T h i s is t h e u s u a l c h o i c e . )
T h e D a t a S o u r c e W i z a r d p r o c e e d s t o S t e p 2. I n t h i s s t e p , s e l e c t t h e S A S available to SAS
table that y o u w a n t to m a k e
E n t e r p r i s e M i n e r . Y o u can either t y p e the l i b r a r y n a m e a n d S A S
libname.tablename
or select the S A S
table f r o m a list.
t a b l e n a m e as
8-18
3.
C h a p t e r 8 Introduction t o Pattern Discovery
Select
Browse...
t o choose a S A S table f r o m the libraries v i s i b l e t o the S A S F o u n d a t i o n server.
gT.Data Source Wizard - Step 2 of 7 Select a SAS Table
Select a SAS tarjle
Browse.
Table:
<{3ack
Cancel
T h e Select a S A S Table d i a l o g b o x opens. g T Select a S A S Table SAS Libraries •jijl Aaern61 \J
Apfmtlib
- f p Maps Sampsio -fgp Sasdata • j i j Sashelp J
Sasuser Stpsamp Work
Name -) Aaem61
Path
Engine BASE
C:\Workshop\Winsas\aa...
Apfmtlib
V9
D:\5A5\Config\Levl\5A5...
Maps
V9
D:\Prograrn Files\SAS\SA.,,
„ J Sampsio
V9
D;\Program Files\5A5\SA...
-) Sasdata
BASE
D:\5AS\Config\Levl\SAS...
_ ) Sashelp
V9
D:\Program Files\5A5\5A...
) Sasuser
V9
C:\Docurnents and Settin... —
) Stpsamp
BASE
D:\Program Files\SAS\SA...
V9
C:\DOCUMEM\Student\...
|p
l9
Work
Refresh
Properties...
S i
Cancel
O n e o f t h e l i b r a r i e s l i s t e d is n a m e d A A E M 6 1 , w h i c h is t h e l i b r a r y n a m e d e f i n e d i n t h e p r o j e c t s t a r t - u p code.
8.2 Cluster A n a l y s i s
4.
Select the
Aaem61
library and the
Census2000
8-19
S A S table.
g T Select a SAS Table Name
j SAS Libraries BaJAaemSl 1 J Apfmtlib - j j l Maps
i n
2004-01-19
2004-01-19
780KB
Census2000
DATA
2006-07-05
2006-07-05
1MB
DATA
2006-09-26
2006-09-26
372KB
DATA
2006-07-25
2006-07-25
32KB
DATA
2006-07-25
2006-07-25
32KB
DATA
2004-01-19
2004-01-19
40KB
DATA
2009-07-08
2009-07-08
41MB
DATA
2006-09-19
2006-09-19
9MB
DATA
2006-09-19
2006-09-19
5MB
Q Credit ""J Dottrain
. ) Sasdata
Dotvalid
Sasuser Stpsamp Work
â&#x20AC;&#x201D;
H Dungaree ; = ] Exportedscoredata | |
Inq2005
i H Insurance
Refresh
The C e n s u s 2 0 0 0
Size
DATA
7) Sampsio V) Sashelp
Created Date Modified D.,
Type
Bank
Properties...
OK
A.
Cancel
d a t a is a postal code-level s u m m a r y o f the e n t i r e 2 0 0 0 U n i t e d States Census.
It f e a t u r e s s e v e n v a r i a b l e s : ID
postal code o f the region
LOCX
region longitude
LOCY
region latitude
MEANKHSZ
a v e r a g e h o u s e h o l d size i n the r e g i o n
jMEDHHINC
m e d i a n household income in the region
REGDENS
region population density percentile ( S l o w e s t density, 100=highest density)
REGPOP
n u m b e r o f people i n the region
T h e data is suited f o r creation o f life-stage, lifestyle segments u s i n g S A S Enterprise M i n e r ' s pattern discovery tools. 5.
Select OK.
8-20
C h a p t e r 8 I n t r o d u c t i o n to P a t t e r n D i s c o v e r y
T h e Select a S A S
T a b l e d i a l o g b o x closes a n d t h e s e l e c t e d t a b l e is e n t e r e d i n t h e T a b l e
field.
ST.Data Source Wizard - Step 2 of 7 Select a SAS Table
Select a SAS table
l a b l e : 55 EM61.CENSUS200Q
Browse.
< Back
6.
Select
Next
Cancel
Next >
> . T h e D a t a S o u r c e W i z a r d p r o c e e d s t o S t e p 3.
Data Source Wizard â&#x20AC;&#x201D; 5 t e p 3 o f 7 Table Information Table Properties
Table Name Description Member Type Data Set Type Engine Number of Variables Nurnbet of Observations Created Date Modified Date
Property
AAEM61 .CENSU52000
Value
DATA DATA BASE 7 33178 July 5, 2006 5:53:42 PM EDT July 5, 2006 5:53:42 PM EDT
<8ack
T h i s step o f the Data Source W i z a r d p r o v i d e s basic i n f o r m a t i o n
Next>
Cancel
about the selected table.
8.2 Cluster A n a l y s i s
7.
Select
Next
8-21
> . T h e D a t a S o u r c e W i z a r d proceeds to Step 4.
1 Metadata Advisor Options Use the basic setting to set the initial measurement levels and roles based on the variable attributes. Use the advanced setting to set the Initial measurement levels and roles based on both the variable attributes and distributions.
(*" Basic
I " Advanced
< Back
I Next >
Step 4 o f the D a t a S o u r c e W i z a r d starts the metadata d e f i n i t i o n process. S A S
Enterprise M i n e r
assigns i n i t i a l values t o the m e t a d a t a based on characteristics o f the selected S A S
table. The Basic
s e t t i n g a s s i g n s i n i t i a l v a l u e s t o t h e m e t a d a t a b a s e d o n v a r i a b l e a t t r i b u t e s s u c h as t h e v a r i a b l e n a m e , data type, and assigned S A S
f o r m a t . T h e A d v a n c e d s e t t i n g also i n c l u d e s i n f o r m a t i o n a b o u t the
distribution o f the variable to assign the initial metadata values.
8-22
C h a p t e r s Introduction t o Pattern Discovery
Select
Next
> t o use t h e B a s i c s e t t i n g . T h e D a t a S o u r c e W i z a r d p r o c e e d s t o S t e p 5.
Sf,Data Source Wizard â&#x20AC;&#x201D; Step 5 of 7 Column Metadata | f " nol jEqual to
(none) Columns:
m
f
Name ID LocX LocY MeanHHSz MedHHInc RegDens RegPop
r
Label Role ID Input Input Input Input Input Input
|
Mining [
Level
Nominal Interval Interval Interval Interval Interval interval
3
Report
Apply
'J P
f " Basic Order
Drop
Reset
Statistic
Lower Limit
Upper Lit
No No No Mo No No No
No ~No No No No
Uo No
1 Show code
Explore
Compute Summary
< Back
: Next >
Cancel
Step 5 o f the D a t a Source W i z a r d enables y o u t o specify the role and level f o r each variable i n t h e s e l e c t e d S A S t a b l e . A d e f a u l t r o l e i s a s s i g n e d b a s e d o n the n a m e o f a v a r i a b l e . F o r e x a m p l e , t h e v a r i a b l e I D w a s g i v e n t h e role I D based o n its n a m e . W h e n a v a r i a b l e does n o t have a n a m e c o r r e s p o n d i n g t o o n e o f t h e p o s s i b l e v a r i a b l e r o l e s , i t w i l l , u s i n g t h e B a s i c s e t t i n g , be g i v e n t h e d e f a u l t r o l e o f input. A n I n p u t v a r i a b l e i s u s e d f o r v a r i o t i s t y p e s o f a n a l y s i s t o d e s c r i b e a c h a r a c t e r i s t i c , m e a s u r e m e n t , o r a t t r i b u t e o f a r e c o r d , o r case, i n a S A S t a b l e . T h e m e t a d a t a seltings are correct f o r the u p c o m i n g analysis.
8.2 Cluster A n a l y s i s
9.
Select
Next
8-23
> . T h e D a t a S o u r c e W i z a r d p r o c e e d s to Step 6.
ST Data Source Wizard — S t e p6 of 7 Data Source Attributes You may change the name and the role, and can specify a population segment identifier for the data source to be created.
Name:
CENS162000
gole:
[Raw
Segment:
f
3]
Notes:
< Back
Cancel
I Next >
S t e p 6 i n t h e D a t a S o u r c e W i z a r d e n a b l e s y o u t o set a r o l e f o r t h e s e l e c t e d S A S
table and p r o v i d e
descriptive c o m m e n t s about the data source d e f i n i t i o n . For the i m p e n d i n g analysis, a table role o f Raw
is a c c e p t a b l e .
S t e p 7 p r o v i d e s s u m m a r y i n f o r m a t i o n a b o u t t h e c r e a t e d d a t a set. ST Data Source Wizard — Step 7 of 7 Summary I .,
Metadata Completed. Library: AAEM61 Table: CENSUS2000 Role: Raw Role ID Input
Count
Level Nominal Interval
1 6
<Back
0. Select
Finish
to c o m p l e t e the data source definition. The
Data Sources entry in the Project panel.
|
CENSUS2000
frmsh
|
Cancel
t a b l e is a d d e d t o t h e
8-24
Chapter 8 Introduction to Pattern Discovery
Exploring and Filtering Analysis Data
A worthwhile next step in the process of defining a data source is to explore and validate its contents. By assaying the prepared data, you substantially reduce the chances of erroneous results in your analysis, and you can gain insights graphically into associations between variables.
Data Source Exploration 1.
Right-click the CENSUS2000 data source and select Edit Variables... from the shortcut menu. The Variables - CENSUS2000 dialog box opens. Variables - CENSUS2000
(none)
- t | foot Equal to
Name
Role D Input Input Input Input Input Input
Apply
TiHirtng
Oatuntns! f" Label ID LocX LocY MeanHHSz MedHHInc [ReoDens [ReaPop
3r
Level Nominal Interval Interval Interval Interval Interval Interval
Report
r Order
Mo Mo Mo Mo Mo Mo No
Drop
Lower Limit
Reset
Statistics Upper L&rit
Mo Mo Mo Mo Mo Mo Mo Explore.
Ed* Using SAS Code
OK
2.
Examine histograms for the available variables.
3.
Select all listed inputs by dragging the cursor across all of the input names or by holding down the C T R L key and typing A.
8.2 Cluster A n a l y s i s
4.
8-25
S e l e c t Explore.... T h e E x p l o r e w i n d o w o p e n s , a n d t h i s t i m e d i s p l a y s h i s t o g r a m s f o r a l l o f t h e variables i n the
CENSUS2000
data source. I - ;C|x
Explore - flAEM61.CEMSU520oa gie
View
fctcttt
Window
JOL5J
|, i Sample Properties Value
Property
-
33178
Rows Columns Ubrary Member
7
Type Sample Method f elch Sfce Fetched Rows
15000-
&
MEM61 CENSUS2000 DATA Random Max '33178
S 100005000 -
n
il
SO
.176.6368 -1432335-109.8302 -7 Region Longitude
160.000
1120.001 1180.001
Median Household Income
tl
jlAAEMSI.
Obs»
1
| Region ID [Region L — jRegion L..
100601 200602 300603 400604 500606 600610 700612 800616 900617 1000622
-66.74047 -67.18025 -67.13422 -67.137 -66.95881 -67.13605 -66.6980 -66.67660 -66.55576 -67.16746
18.1801 G j J 1B.3632B 18.44861 18.49898 18 18215 18.28831 18.44973 1B.42674 18.45549 18.00312 _ i
3000
=I
I £
17S622
333634
43 3646
659656
Region ID
0 00
1.70 3 40
5.09
Average Household Size
205
40.6 604 602
1000
Region Density Percentile
Region Lalilude
U 00603 00606 00S12 00517 00602 00604 00610 00616 0062
2000-
5000 -
6.79
43 207
86.414
Region Population
1 29,62
C h a p t e r 8 Introduction to Pattern
8-26
5.
Discovery
M a x i m i z e t h e M e a n H H S i z e h i s t o g r a m b y d o u b l e - c l i c k i n g i t s t i t l e b a r . T h e h i s t o g r a m n o w fills t h e Explore
window. E E S
Explore - AAEM61.CENSUS2000 Ed*
tie
View
Actons
window
a l H l o K /
I 15000-
10000-
0) cr S U5000-
0.00
0.85
1.70
2.55
3.40
4.24
5.09
5.94
6.79
Average Household Size A s b e f o r e , i n c r e a s i n g t h e n u m b e r o f h i s t o g r a m b i n s f r o m t h e d e f a u l t o f 10 i n c r e a s e s y o u r u n d e r s t a n d i n g o f t h e data.
7.64
8.49
8.2 Cluster A n a l y s i s
6.
R i g h t - c l i c k i n the h i s t o g r a m w i n d o w a n d select
Graph Properties...
8-27
from the shortcut menu.
T h e Properties - H i s t o g r a m d i a l o g b o x opens. Properties - Histogram Sraph Histogram Axes Horizontal Vertical Title/Footnote
Bins K
Show Missing Bin
W Bin Axis |v End Labels T
Bin to Data Range
Boundary:
Upper
Number of X Bins:
10
Number ot Y Bins:
1
Color W Show Outline ft
Gradient
Gradient
ThreeColorRamp
OK
Apply
Cancel
Y o u can use the Properties - H i s t o g r a m d i a l o g b o x t o change the appearance o f the c o r r e s p o n d i n g histogram.
8-28
7.
C h a p t e r 8 Introduction to P a t t e r n Discovery
T y p e 100
in the N u m b e r
o f
X
Bins
f i e l d and select OK.
T h e h i s t o g r a m is u p d a t e d t o h a v e
100 b i n s . Explore - AAEM61.tENSUS20aO Efe
Edit
View
Actions
Wndow
4000-
3000-
u c
CD
9> 2000
1000-
1Tm7t -t> ^ i
[ • • • •r
• • • •i
t
I
I I I I , I I ...
I
0.00 0.42 0.85 1.27 1.70 2.12 2.55 2.97 3.40 3.82 4.24 4.67 5.09 5.52 5.94 6.37 6.79 7.22 7.64 8.07 8.49 Average Household
Size
T h e r e is a c u r i o u s s p i k e i n t h e h i s t o g r a m at ( o r n e a r ) z e r o . A z e r o h o u s e h o l d s i z e d o e s n o t m a k e s e n s e in the context o f census data. 8.
Select the bar near zero i n the h i s t o g r a m .
8.2
9.
Cluster A n a l y s i s
Restore the size o f the w i n d o w b y double-clicking the title bar o f the M e a n H H S i z e
8-29
window.
T h e w i n d o w r e t u r n s t o its o r i g i n a l size. Explore - AAEM6I.CENSUS2000
Fie
E*
Se"
froons
Window
#|H!01i>| Pi op1" Ly
Value
Sample Method Felch Size Fetched Rows
33178 7 MEM 61 C E N S U S 20 DO DATA Random Max 33178
ROW5
Columns Library Member
15000
& £
-176.6358 -1432335-109.8302 -76.4270
Apply
I0OO0
SO
Region Longitude
160.000
$120,001 JI80.001
Median Household Income
r j l A A I M b l . l I sslISIflHHI
Obs#
Region ID |Region L...jRegion L „ -66.74947 18.18011 - l | -67.18025 18.36328¬ -67.13422 18.44861 -67.137 18.49898 -66.958B1 18.18215 -67.13605 18.28831 -66.6988 18.44973 -66.67669 1B.42674 -66.55576 18.45549 -67.16746 IS.003; 2 .
100601 200602 300603 400604 500606 600610 700612 800616 900617 1000622
3 1 0
179622
33.9634
49 9646
653658
Region Lalilude
ID
2QB
40.6 60.4 802
1000
Region Density Percentile
g. 3000 c §• 2000 $
Ik
501 00603 00608 00612 00617 00602 00604 00610 00616 0062 Region ID
J L
1000¬
T
0
1——I 1
I™
V"—'!™—T"—-1
0.00 0 93 1.87 2.80 3 74 4 67 5.60 654 7.47 8.4 Average Household Size
43.207 86.414 Region Population
1 29.62
T h e zero average household size seems t o be evenly distributed across the l o n g i t u d e , latitude, a n d density percentile variables. It seems concentrated o n l o w i n c o m e s and p o p u l a t i o n s , and also m a k e s up the m a j o r i t y o f the m i s s i n g observations
i n the d i s t r i b u t i o n o f R e g i o n D e n s i t y . It is w o r t h w h i l e
t o l o o k at t h e i n d i v i d u a l r e c o r d s o f t h e e x p l o r e
sample.
8-30
C h a p t e r 8 Introduction to Pattern Discovery
10. M a x i m i z e t h e
CENSUS2000
data table.
1 1 . S c r o l l i n t h e d a t a t a b l e u n t i l y o u see t h e first s e l e c t e d r o w . ' Explore - A AEM 61X ENS US200U Fie
View
fictions
LUnE
Window
•1 Ml 0141 i y AAEM01.CEN5US20U0
Qpstf
| Region ID |Region L...jRegion L...| Region ... | Region ... | Median ... [Average ...| 1600638 1700641 1000646 1900647 2000656 21G0652
2200653 2300656 2400659 2500660 2600662 2700664 2800687 2900669 3000670 3100674 3200676
3300677 3400678 3500680 3600682 3700683 3800685 3900687
4000688 4100690 4200692 4300693 4400698 45006HH 46006XX 4700703 4S00704
-66.49835
•66.70519 •66.27689 •66.93993
-66 56773 -66.61217 -6690098 -66 79169 -66.80039 -67.12086 -67.01974 •66.59244 -67.04227 -66.87504 -66.97604 -66 4B897 -67.08425
-67.23675 -66.93276 -67.12655 -67.15429 -67.04525 -66.98104 -66.41529 -6661348 -67.09867 -66.33186 -66.39211 -66.85588 •66.83453 -67.88803 •66.12827
4900705 5000707 5100714
-66.22291 -66.26542 •6591019 -6605553
5200715 5300716 5400717 5500718
-66.55869 -66.59966 -68 61375 -65.74294
19 308139 1B.268896 1B.442798 17.964529
18.363331 18.457453 17.992112 18.038866 19.432956 18.139108 18.478855 18.212565 18.017819 18.288418 18.241343 1B.426137 18 37956 18.336121 18.442334
18.205232 18 208402 18.092807 18.332595 19.31708 18.40415 19.495369 1B.419866 18.440667 1B.06547
18.473441 18.102537 18.246205 17.970112 18.12942 18.014505 17.997288
18 003492 17.999066 18 004303 18.22048 4 r.
70 71 84 73 77
75 78 76 80 84
80 73 74
77 80 81 79
81 81 84 90 78 76 79 75 85 81 83 78
19,648 35,095
34.114 5,352
12.207 2,903 16,536 23,072 38,925 16,614 42,178 17,318 26,637 32,246 11.174 45,874 39,897 14,767
27.271 75,090 25,138 35,165 44,649 29,965
13,501 5,249 35,223 64,054 46.384 0
$10,986
19,901 116.378 19.607 $11,544 (12,097
19,770 (11,361 112.378 (16,745 (11.753 (11,220 (11,426 (9,758 (9,556 (12,820 $11,271 $11,460 $12,005
(12,042 (11,121 (13,036 (11,014 $12,090 $11,729 $12,629 (13,691 $13,857 $11,924 $0 SO (12,463 (11,340
80 85
0 28,752 7,593
83
19,117
(12.725 (11,638 (11,484
2,268 38,859 23,083 23.753
(10,556 (15,610 (11,697 (11.461
80 77 63 89 08 74
26,493 12,741
*- -, 111
3.26 3.14
3.11 2.87
3.11 284 3.04 3.19 3.08
2.82 2.94 3.35 2.91 3.12 3.09 2.97
3.11 2.86 3.07 2.77 2.84 2.86 2.94 3.40
3.01 2.89 3.16 3.12 307 0.00 0.00 3.12 3.06 3.13 3.19 3.09
2.78 3.06 2.55 2.97
R e c o r d s 4 5 and 4 6 ( a m o n g others) h a v e the zero A v e r a g e H o u s e h o l d Size characteristic. O t h e r in these records also h a v e u n u s u a l
values.
fields
8.2 Cluster A n a l y s i s
12. Select the in this
Average Household Size
field.
c o l u m n heading t w i c e t o sort the table b y descending values
Cases o f interest are c o l l e c t e d at the t o p o f the data t a b l e . ME
Explore - AAEM6I.CEriSUS20GO Fie
Vjew
8-31
Actions Window
A A£M 61 .E EN5USZ0 OG
Obstt
Region ID Region L... Region L... Region ...
447020HH 500021HH 50302222 S04022HH 524 02356 531023HH 532023XX 578025HH 579025XX 613026HH
649027HH 7D4028HH 721029HH
755030HH 763031HH 815G32HH 818032XX 821 033HH 847034HH 854 035HH 865035XX 873 036HH
897037HH 959039HH 969039HH 1028040HH
1037041HH 1082042HH 1103043HH 1163044HH 1191045HH I253046HH 1292047HH 1312048HH 1373049HH 1424050HH 1472053HH 1473053XX
1516Q54HH 1550056HH
-70.76329 -70.99512 -71.06283 -71.05037 -70.66089 -70.66299 -70.69411 -70.56594 -70.65427 -70.12764
-71.0363 -71.42673 -71.40145 -71.47319
-71.47113 -71.59953 -71.52839 -71.5597 -71.93063 -71.24389 -71.29905 -72,43502 -72.1821 -70.91394 -70.69216
-70.27453 -70.2651 B -70.39034 -69.85778 -68.76274 -69.6258 -67.89638 -68.3812 -89.0871 -E9.71961 -72.44107 -72.89058 -72.99699 -73.0833 -72.71076
42 157445 42.333634
42.367797 42.352702 41.854063 41.95506 41 837895 41.573686 41.779736 41.756212 41.711052 41.651234
Region ... Median ...
85 67 3 1
0 0 55 0 13B 0 18 0 7 0 0 0 0
41.8141 4281 187
0 0 0 0 0 0 0 0 0 0
43.007183 43612667
43.976513 43.246016 42.988155 44 636085 44 510468 43.255688 43.832093 43.311793 43.143285 43 562653
0 0 0
4367039 44.327523 44.282949
45.279499 43.920971 44.69515 46.665748 44 104367 44.762422 43.741 081 42.858851 42.982934 44.36125 44.313574
0 0 0 0 0 0 0 0 0 0 0 0 0 0
$0 10
to so so JO JO
so JO 10 10 JO JO $0
so so so so so so so so so so $0 so $0 so so
so so J0 JO so so so so so so so
•
Average Household Size
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00; 0.00 0.00 0.00' G.QQ000 0.00 0.00 0.00 0.00 0-00 000 0.00 0.00 0.00
0.00 0.00: 0.00 0.00 0.00
o.oo; 0.00 0.00 0.00 0-00 0.00 0.00 0.00 0.00 0.00 o-oo; 0.00
n
M o s t o f the cases w i t h z e r o A v e r a g e H o u s e h o l d Size h a v e z e r o o r m i s s i n g o n t h e r e m a i n i n g n o n g e o g r a p h i c a t t r i b u t e s . T h e r e a r e s o m e e x c e p t i o n s , b u t i t c o u l d b e a r g u e d t h a t c a s e s s u c h as t h i s a r e n o t o f interest f o r a n a l y z i n g h o u s e h o l d d e m o g r a p h i c s . T h e next part o f this d e m o n s t r a t i o n s h o w s h o w t o r e m o v e c a s e s s u c h as t h i s f r o m t h e s u b s e q u e n t 13. C l o s e t h e E x p l o r e a n d Variables
windows.
analyses.
C h a p t e r s Introduction to Pattern Discovery
8-32
Case Filtering The SAS
E n t e r p r i s e M i n e r F i l t e r t o o l enables y o u t o r e m o v e u n w a n t e d r e c o r d s f r o m an a n a l y s i s .
Use these steps t o b u i l d a d i a g r a m t o read a data source and t o f i l t e r r e c o r d s .
1.
Drag the
CENSUS2000
data source to the Segmentation A n a l y s i s w o r k s p a c e w i n d o w ,
cog Segmentation Analysis
v —
n^H*
1*0
•
: i: CENSUS2000
1
2. 3.
a
3
3
100%
HJ
S e l e c t t h e Sample t a b t o a c c e s s t h e S a m p l e t o o l g r o u p . D r a g t h e Filter t o o l ( f o u r t h f r o m t h e l e f t ) f r o m t h e t o o l s p a l l e t i n t o t h e S e g m e n t a t i o n A n a l y s i s w o r k s p a c e w i n d o w a n d c o n n e c t it t o the
CENSUS2000
°°g Segmentation Analysis
f
:ensus200o
I
*J
\*2f Fitt-rl J
U 100%
data source.
8.2 Cluster A n a l y s i s
4.
8-33
S e l e c t t h e Filter n o d e a n d e x a m i n e t h e P r o p e r t i e s p a n e l . Property
General N o d e ID
Value Filter
i m p o r t e d Data Notes
Train E
Export T a b l e
Filtered
T a b l e s to Filter
T r a i n i n g Data
p a s s Variables
•
Q
Exported Data
y
Class Variables
Default Filtering M e t h o R a r e V a l u e s ( P e r c e n t Keep Missing Values Y e s Normalized Values
Yes
M i n i m u m F r e q u e n c y C1 M i n i m u m Cutoff for Pe 0.01 M a x i m u m N u m b e r o f L25
E [interval
Variables
•
Interval V a r i a b l e s
Default Filtering M e t h o S t a n d a r d Deviations fr Keep Missing Values Yes Tuning Parameters
Score
Create s c o r e c o d e
Yes
Based o n t h e v a l u e s o f t h e p r o p e r t i e s p a n e l , the n o d e w i l l , b y d e f a u l t , f i l t e r cases i n rare levels i n a n y class i n p u t v a r i a b l e a n d cases e x c e e d i n g three standard d e v i a t i o n s f r o m t h e m e a n o n a n y i n t e r v a l input variable. Because the C E N S U S 2 0 0 0 data source o n l y contains interval inputs, o n l y the Interval Variables c r i t e r i o n is c o n s i d e r e d . 5.
Change the D e f a u l t F i l t e r i n g M e t h o d property to
-Interval Variables ; Default Filtering MethoUser-Specified Limits - Keep Missing Values Yes •Tuning Parameters |»J
User-Specified
Limits.
8-34
6.
C h a p t e r 8 Introduction to Pattern Discovery
Select the Interval Variables ellipsis (...). The Interactive Interval Filter w i n d o w opens.
W Interactive Interval Filter Train or raw data set does not e x i s t . Columns:
f~" Label Filtering Method
Name LocX LocY MeanHHSz MedHHInc ReqDens ReqPop
Default Default Default Default Default Default
f " Mining Keep Missing Values
F
Basic
Filter Lower Limit
I
-
Statistic
Filter Upper Limit
Report
Default Default
â&#x20AC;˘
Default Default
No No
Default Default
No
Generate Summar
Y o u a r e w a r n e d at t h e t o p o f t h e d i a l o g b o x t h a t t h e
No No No
OK
Train
Cancel
o r r a w d a t a set d o e s n o t e x i s t .
T h i s i n d i c a t e s t h a t y o u a r e r e s t r i c t e d f r o m the i n t e r a c t i v e f i l t e r i n g e l e m e n t s o f t h e n o d e , w h i c h a r e a v a i l a b l e a f t e r a n o d e is r u n . Y o u c a n , n e v e r t h e l e s s , e n t e r f i l t e r i n g i n f o r m a t i o n .
8.2 Cluster A n a l y s i s
7.
8-35
T y p e 0 . 1 as t h e F i l t e r L o w e r L i m i t v a l u e f o r t h e i n p u t v a r i a b l e MeanHHSz.
2ÂŁ Interactive Interval Filter Train or r a w data set does not e x i s t . olumns:
|~ Label Filtering Method
Name
LocX Default LocY _ J)efault MeanHHSz Default edHHInc befault Reg D e n s Default Re gPop Default
f
Mining Keep Missing Values
I
-
Basic
f~
Filter Lower Limit
Statistics
Filter Upper Limit
Report
Default
No
efault efault Jefault befault
Ho No
) efault
i i Generate Summary
OK
Select O K t o close the I n t e r a c t i v e I n t e r v a l F i l t e r d i a l o g b o x . Y o u are r e t u r n e d t o the S A S
Cancel
Enterprise
M i n e r interface w i n d o w . A l l c a s e s w i t h a n a v e r a g e h o u s e h o l d s i z e less t h a n 0.1 w i l l b e
filtered
f r o m s u b s e q u e n t analysis steps.
8-36
9.
C h a p t e r 8 Introduction to Pattern Discovery
R u n the Filter node and v i e w the results. T h e Results w i n d o w opens. Jr. •
File
Edit
View
Window
•
1 Limits for Interval Variables Variable
Role
Minimum
Maximum
Output
Frlte * T r a i n i n g Output
MeanHHSz
0.1
INPUT
.MAN
V a r i a b l e Summary Measurement
Frequency
Excluded C l a s s Values Role
Level
Train Count Train Percent
Label
Filter Method Keep Missing Values
10. G o t o l i n e 3 8 i n the O u t p u t w i n d o w . Number Of
Observations
Data Role TRAIN
Filtered 32097
Excluded 1081
DATA 33178
T h e F i l t e r n o d e r e m o v e d 1081 cases w i t h z e r o h o u s e h o l d size. 1. C l o s e t h e R e s u l t s w i n d o w . T h e C E N S U S 2 0 0 0 d a t a is r e a d y f o r s e g m e n t a t i o n .
i|
8.2 Cluster A n a l y s i s
8-37
Cluster Tool Options
T h e C l u s t e r t o o l p e r f o r m s fc-means c l u s t e r a n a l y s e s , a w i d e l y u s e d m e t h o d f o r c l u s t e r a n d s e g m e n t a t i o n a n a l y s i s . T h i s d e m o n s t r a t i o n s h o w s y o u h o w t o use t h e t o o l t o s e g m e n t t h e c a s e s i n t h e C E N S U S 2 0 0 0 d a t a set.
1.
S e l e c t t h e Explore t a b .
2.
L o c a t e a n d d r a g a Cluster t o o l i n t o t h e d i a g r a m w o r k s p a c e .
3.
C o n n e c t t h e Filter n o d e t o t h e Cluster n o d e .
T o c r e a t e m e a n i n g f u l s e g m e n t s , y o u n e e d t o set t h e C l u s t e r n o d e t o d o t h e f o l l o w i n g :
4.
â&#x20AC;˘
ignore irrelevant inputs
â&#x20AC;˘
standardize the i n p u t s t o h a v e a s i m i l a r range
S e l e c t t h e Variables... p r o p e r t y f o r t h e C l u s t e r n o d e . T h e V a r i a b l e s w i n d o w o p e n s .
C h a p t e r 8 Introduction to Pattern Discovery
8-38
5.
Select
Use
(none) Name ID LOCX LocY MeanHHSz MedHHInc RegDens RegPop
O
No
•
For
not
LocX, LocY,
Equal to
Use Yes No No Default Default Default No
and
Report No
No No No" No
RegPop.
r—i
T
Role ID Input Input Input Input Input Input
Level Nominal Interval Interval Interval Interval Interval Interval
Type
Order
Character Numeric Numeric Numeric Numeric Numeric Numeric
Explore-
T h e C l u s t e r node creates s e g m e n t s u s i n g the inputs
Apply
Label
Reset
Format
Region ID Region LongitBEST9.2 Region LatilucBEST9.2 Average Hous COMMAS.2 Median HouscDOLLARS.O Region Densi BEST9.0 Region Popul:COMMA9.0
Update Path
MedHHlnc, MeanHHSz,
OK
and
Cancel
RegDens.
S e g m e n t s a r e c r e a t e d b a s e d o n t h e ( E u c l i d e a n ) d i s t a n c e b e t w e e n e a c h case i n t h e s p a c e o f s e l e c t e d i n p u t s . I f y o u w a n t t o use a l l t h e i n p u t s t o c r e a t e c l u s t e r s , t h e s e i n p u t s s h o u l d h a v e s i m i l a r m e a s u r e m e n t scales. C a l c u l a t i n g distances u s i n g standardized distance m e a s u r e m e n t s ( s u b t r a c t i n g the m e a n a n d d i v i d i n g b y t h e s t a n d a r d d e v i a t i o n o f t h e i n p u t v a l u e s ) is o n e w a y t o e n s u r e t h i s . Y o u c a n s t a n d a r d i z e t h e i n p u t m e a s u r e m e n t s u s i n g t h e T r a n s f o r m V a r i a b l e s n o d e . H o w e v e r , i t is e a s i e r t o use the built-in property in the Cluster node.
8.2
6.
Select the inputs
MedHHInc, MeanHHSz,
and
RegDens
and select
Cluster Analysis
Explore....
The
8-39
Explore
w i n d o w opens.
File View
Actions
Window
Sample Properties
n"ef
Property
Row; Columns Library Member
IE!
2O00O -I
Unknown 7 EMWS5 FILTER TRAIN ucuu
Tutl*
& isooo -
_ 1 Apply
Region ID 100601 200602 300603 400604 500606 600610 700612 1
=
:0000-
£
5000 -
c
0 SO
Plot-
t-10.000
180.000
1120.001
$1<~>0,001
L l ReuDens Region Lon yr.v.c- Lati..' pe.ior. ue Rs; •66.74347 18 180103 70 -67.18025 18 363285 83 •67.13422 18 448619 85 -67.137 18.488987 83 -66.95881 18.182151 65 •67.13605 18.288319 79 •66.6988 18 449732 81
L ! MeanHHSz
3000 |
2000¬
*
1000¬ 0
1.0
3
10.9
20.8 30.7 40J3
SOS
60.4 70.3
Region DensflyPerceniile
20OO0-
|
15000 -
=•
10000-
<I
SO0O0-
359
4.41
523
Average Household Size
T h e inputs selected f o r use in the cluster are o n three entirely different measLirement scales. T h e y n e e d t o be standardized if y o u want a meaningful clustering. 7.
C l o s e the
8.
Select
Explore window.
O K to close the
Variables
window.
— i 0O.1
^Ef
2:000&
1200.001
Modioli Household Income
E l EMWS5.Filter_TRAIN obs?
L l MedHHInc
Value
r
100.0
IHi
C h a p t e r 8 Introduction to Pattern Discovery
8-40
9.
Select
Internal Standardization ^ Standardization.
D i s t a n c e s b e t w e e n p o i n t s are c a l c u l a t e d
based
on standardized measurements.
—j
Property
.
-™— J
?-f».,-..^~
General N o d e ID
Value
r
Clus
Imported Data
J.
Exported Data [Notes
,..
Variables
...
Train
Cluster Variable Role Segment Internal S t a n d a r d i z a t i o S t a n d a r d i z a t i o n • N u m b e r of Clusters • ' •S p e c i f i c a t i o n M e t h o d
Automatic
" M a x i m u m N u m b e r of CI 0 4r
A n o t h e r w a y t o s t a n d a r d i z e a n i n p u t is b y s u b t r a c t i n g t h e i n p u t ' s m i n i m u m v a l u e a n d d i v i d i n g b y the i n p u t ' s range. T h i s is called
range .standardization.
rescales the d i s t r i b u t i o n o f each i n p u t t o the unit i n t e r v a l , [ 0 , 1 ] . T h e C l u s t e r n o d e is r e a d y t o r u n .
Range standardization
8.2 Cluster A n a l y s i s
8-41
Creating Clusters with the Cluster Tool
By default, the Cluster tool attempts to automatically determine the number of clusters in the data. A three-step process is used. Step 1 A large number of cluster seeds are chosen (50 by default) and placed in the input space. Cases in the training data are assigned to the closest seed, and an initial clustering of the data is completed. The means of the input variables in each of these preliminary clusters are substituted for the original training data cases in the second step of the process. Step 2 A hierarchical clustering algorithm (Ward's method) is used to sequentially consolidate the clusters that were formed in the first step. At each step of the consolidation, a statistic named the cubic clustering criterion ( C C C ) (Sarle 1983) is calculated. Then, the smallest number of clusters that meets both of the following criteria is selected: â&#x20AC;˘ The number of clusters must be greater than or equal to the number that is specified as the Minimum value in the Selection Criterion properties. â&#x20AC;˘ The number of clusters must have cubic clustering criterion statistic values that are greater than the C C C threshold that is specified in the Selection Criterion properties. Step 3 The number of clusters determined by the second step provides the value for k in a ÂŁ-means clustering of the original training data cases.
8-42
1.
C h a p t e r 8 Introduction to Pattern Discovery
R u n the C l u s t e r n o d e and select
Results....
T h e Results - C l u s t e r w i n d o w opens.
? R e s u l t * - Node: Cluster Diagram: Segmentation Analysk Fife
Edit
tfew
Wndow
•J QJ j i l D l i l Qujlcinij Crlercn
t!
OBfigcn OBlw Seeds
0667333 0 667333 0.667333 0 667333
40 -
Scniiieiit Vjridlile
• • • • •
Frequency olOusttr
0 33:1 35 • I 35:1.37 5 43:5 45 • 6 45:7.47 50000:75000 • 75000:100001 175001:200001 • 1:13 375 50.5:62.875 • 6 2 875.75 25
• • • • •
2 37:3 39 • 7 47:B 49 • 100001:125001 • 13.375:25.75 • 75 25.87 625 •
J
Crtetan
1 2 3 4
0 014476 0.014476 0014476 0014476
Roa-MesvS quare amdard Devterfcon
MQKTTUTI
CWance from D j i l w Seed
3810 0.851212 8 *84734 14729 0.56443B 6.554205 1 1669 0.638211 4.632054 2089 1.031565 1300417
Segment Vii[iiit>!e
3 39:4 41 • 4.41:5 43 0:25000 • 25000:50000 125001 150001 • 150001:175001 25.75:38.125 • 38 125:50 5 87.625:100
User:
Student
Date:
J u l y 0 9 , 2009 14:23:29
Tcoining
Vaiiable
Output
Sucaacy Heesurtnent
Frequency Count
H0NIHA1 IHTTPVAL
T h e Results - Cluster w i n d o w contains four embedded w i n d o w s . • The
Segment Plot
w i n d o w attempts to s h o w the d i s t r i b u t i o n o f each i n p u t v a r i a b l e b y cluster.
• The
Mean Statistics
• The
Segment Size
w i n d o w lists v a r i o u s descriptive statistics b y cluster.
w i n d o w shows a pie chart d e s c r i b i n g the size o f each cluster f o r m e d .
• T h e Output w i n d o w s h o w s t h e o u t p u t o f v a r i o u s S A S p r o c e d u r e s r u n b y t h e C l u s t e r n o d e . A p p a r e n t l y , t h e Cluster node f o u n d f o u r clusters in C E N S U S 2 0 0 0 data. Because the n u m b e r o f c l u s t e r s is b a s e d o n t h e c u b i c c l u s t e r i n g c r i t e r i o n , it m i g h t b e i n t e r e s t i n g t o e x a m i n e t h e v a l u e s o f t h i s statistic f o r various cluster counts.
8.2 Cluster A n a l y s i s
2.
Select L
View â&#x20AC;˘=> Summary Statistics â&#x20AC;˘=> C C C Plot.
8-43
T h e C C C Plot w i n d o w opens.
c c c Plot 1500-
Number o f Clusters I n t h e o r y , t h e n u m b e r o f c l u s t e r s i n a d a t a set is r e v e a l e d b y t h e p e a k o f t h e C C C v e r s u s N u m b e r o f Clusters plot. H o w e v e r , w h e n no distinct concentrations o f data exist, the u t i l i t y o f the C C C statistic is s o m e w h a t s u s p e c t . S A S E n t e r p r i s e M i n e r a t t e m p t s t o e s t a b l i s h r e a s o n a b l e d e f a u l t s f o r i t s a n a l y s i s t o o l s . T h e a p p r o p r i a t e n e s s o f these d e f a u l t s , h o w e v e r , s t r o n g l y d e p e n d s o n t h e a n a l y s i s o b j e c t i v e a n d the nature o f the data.
8-44
C h a p t e r 8 Introduction t o Pattern Discovery
Specifying the Segment Count
Y o u m i g h t w a n t t o increase t h e n u m b e r o f clusters created b y t h e C l u s t e r node. Y o u c a n d o this by c h a n g i n g the C C C c u t o f f property or b y s p e c i f y i n g the desired n u m b e r o f clusters.
1.
I n the Properties panel f o r the C l u s t e r node, select Property
General N o d e ID
Specification Method "=> User Specify.
Value Clus
I m p o r t e d Data E x p o r t e d Data
â&#x20AC;˘ U
Notes
Train
Variables Cluster Variable Role S e g m e n t
u
Internal S t a n d a r d i z a t i o S t a n d a r d i z a t i o n r S p e c i f i c a t i o n M e t h o d U s e r Specify -
â&#x20AC;˘-Maximum N u m b e r o f C10 T h e U s e r S p e c i f y s e t t i n g creates a n u m b e r o f segments i n d i c a t e d b y the M a x i m u m N u m b e r o f C l u s t e r s p r o p e r t y l i s t e d a b o v e i t ( i n t h i s case, 1 0 ) .
8.2 C l u s t e r A n a l y s i s
8-45
R u n the C l u s t e r n o d e and select R e s u l t s . . . . T h e Results - N o d e : C l u s t e r D i a g r a m w i n d o w opens, and s h o w s a t o t a l o f 10 g e n e r a t e d s e g m e n t s . i* Resulti -
E*s
£dt
Node: Clutter
Sew ffimJon
Diagram: Segmentation Anfllyili
2 .
.
100 -
-
-
-
5
-
s,
-
— J
4
5
B
7
8
9
w
0 534301 0 534301 0.534301 0 534301 0 534301 0 534301 0 534301 0.S34301 0 534301 0 534301
-
— 20 •
—
8J
10
S e g m e n t V j j Utile
• • • • •
—
so •
,
-
—
—
4
S
8
7
.
,
S
4
Segment VailaWe
033135 • 1 35 2 37 • 2 37:3.39 • 5 43 6 45 • 6 4 5 7 47 H 7 47:8.49 0 5000075000 • 75000 100001 • 1 0 0 0 0 1 : 1 2 5 0 0 1 0 1 7 5 0 0 1 : 2 0 0 0 0 1 0 1:1 3 375 • 13 375:25.75 O 50 5 62 875 0 6 2 8 7 5 75 25 • 75 25:37.625 0
10
4715 93 927 9009 1503 51S 6443 3714 10 516B
0 097022 0 037022 0 087022 0.007022 0 087022 0087022 0 087022 0087022 0037022 0 037022
0.598335 1.13243 0 706122 0 46754B 0 694551 0.997191 0439273 0 543696 1 912147 0496536
4 311315 4.B53B93 3.923714 2.B07549 3.478019 4.300237 2002035 359743 5 300685 6 081233
a
3 39 4 41 • 4 41:5.43 0 25000 0 2 5 0 0 0 50000 1 2 5 0 0 1 : 1 5 0 0 0 1 0 150001:175001 25 75 3B 125 O 38.125:50 5 87 825:100
StuSenc J u l y 09,
2009
14:28:16 T r a i n i n g Output
Variable
Suuaiy Frequency Count liOHIHAL IKTEFVAL
A s s e e n i n t h e M e a n S t a t i s t i c s w i n d o w , s e g m e n t f r e q u e n c y c o u n t s v a r y f r o m 10 c a s e s t o m o r e t h a n 9 , 0 0 0 cases. Mean Statistics num ;ive i g e in :er
Root-Mean-S M a x i m u m
I m p r o v e m e n t . S E G M E N T j Frequency of Cluster in Clustering Criterion
quare
Distance
Standard
from C l u s t e r
Deviation
Seed
s 0.596385
1
4715
2
93
1.19246
927 9009
0.706122
087022
3 4
087022 087022
1503
6
0.694551 0.987181
087022 087022 087022
087022 087022 087022 087022
5
515
0.467548
4.311615 4.853893 3.923714 2.807549 3.478019 4.800237 2.002035
7
6443
0.439273
8
3714
0.543696
9
10
1.912147
5.300685
5168
0.496586
6.081233
10
3.59743
Nee Clu
1
8-46
C h a p t e r 8 Introduction to Pattern Discovery
Exploring Segments
W h i l e t h e R e s u l t s w i n d o w s h o w s a v a r i e t y o f d a t a s u m m a r i z i n g t h e a n a l y s i s , i t is d i f f i c u l t t o u n d e r s t a n d t h e c o m p o s i t i o n o f t h e g e n e r a t e d c l u s t e r s . I f t h e n u m b e r o f c l u s t e r i n p u t s is s m a l l , t h e G r a p h w i z a r d c a n aid in interpreting the cluster analysis. 1. 2.
Close the Results - Cluster w i n d o w . Select
Exported Data
f r o m the Properties panel f o r the C l u s t e r n o d e . T h e E x p o r t e d D a t a - C l u s t e r
w i n d o w opens.
Port TRAIN VALIDATE TEST CLUSSTAT CLUSMEAN VARMAP
Table EMWS5.Clus_TRAIN EMWS5.Clus_VALIDATE EMWS5.ClilS_TEST EMWS5.CIUS_0UTSTAT EMWS5.ClUS_0UTMEAN EMWS5.Clus_0UTVAR
Role iTrain Validate Test 1 CLUSSTAT 1 CLUSMEAN â&#x20AC;˘VARMAP
Yes No No Yes Yes Yes
Data Exists
Pi opMTiev
T h i s w i n d o w s h o w s t h e d a t a sets t h a t are g e n e r a t e d a n d e x p o r t e d b y t h e C l u s t e r n o d e .
OK
8.2 Cluster A n a l y s i s
3.
Select the
Train
d a t a set a n d s e l e c t
Explore....
The Explore w i n d o w opens.
®Uytos*.-iiMZ»£te2nW\ File
View
Actions
8-47
• .....
.-
^
r-•
Window
| J}B Sample Properties Property
1
Rows Columns Library Member Type Sample Method Fetch Size Fetched Rows Random Seed
| Unknown 10 EMWS5 C L U S TRAIN VIEW Random Wax 20000 12345
Value
j
Appry
Plot...
Hi
o* E f OEMWS5.Clus„TRAIN bs# | ReqionID iRe Oion Lcn.J Region L a t i J Region De... Region P o p J Median Ho.jAveracte 19.143 $9,888 1 00601 -66.74947 18180103 70 -67.137 $31,199 200604 83 3.844 16.498987 $9,243 300606 -66.95881 18.182151 65 6.449 $13,000 400612 -66.6388 18.449732 81 72.865 18.08643 80 38.627 $13,565 500623 •67.15223 77 26.719 $12,194 600624 •66.72603 18 055399 700627 -66.85644 18.435246 35.244 $13,163 79 $8,345 -66.85175 18.186739 68 2.169 800631 •66.94864 18.073078 25.935 $12,485 900637 78 1000641 $9,901 •66.70519 18.268896 71 35.095 34.114 1100646 •65.27689 18.442798 84 $16,378 $9,607 1200647 •66.93993 17.964529 73 5,352 i*mnecn r i ir\7
H o J SEGMENT I DISTANCE | 3.24 5 1.512915 3.00 5 0.927069 1.561623 5 3.20 2.84 5 1.823781 7 1.748842 2.75 3.47 5 1.434529 5 1.468324 3.07 5 1.73708 3.05 5 1.714428 2.91 1.564744 3.14 5 5 1.299486 3.11 5 1.902125 2.87 >
1J •
Y o u c a n use t h e G r a p h W i z a r d t o g e n e r a t e a t h r e e - d i m e n s i o n a l p l o t o f t h e C E N S U S 2 0 0 0
data.
•
8-48
4.
C h a p t e r 8 Introduction to Pattern Discovery
Select
Actions •=> Plot.
T h e Select a C h a r t T y p e w i n d o w opens.
1* S c a t t e r \>± Line alhj Histogram
Plots v a l u e s of two variables against e a c h other.
m\ Density $Box 3
Tables
H
Matrix
Qf{ Lattice E l Parallel A x i s £*> Constellation
Cancel
5.
< Bark
Select t h e i c o n f o r a three-dimensional scatter plot.
bJl S c a t t e r Line i i h Histogram |fc Density £ i Box H
Tables
1QU Matrix ffl
Lattice
I S Parallel Axis ®
Constellation
Cancel
6.
Select
Next
> . T h e G r a p h W i z a r d proceeds t o the next step. Select
Next>
Chart Roles.
8-2 Cluster A n a l y s i s
7.
Select roles o f
X, Y.
8.
Select
Color
Role
and
Z
for
for
MeanHHSz, MedHHInc,
and
RegDens,
respectively.
SEGMENT .
S e l e c t Chart R o l e s Use default assignments A Variable LSEGMENT
Role Color
Type
Description
Numeric
Segment Id
Format
Character
Segment Description
Distance
Numeric
Distance
ID
Character
Region ID
LocX
Numeric
Region Longitude
BEST9
LocY
Numeric
Region Latitude
FJEST9
SEGMENT LABEL
MeanHHSz
X
Average Household S... COMMA9.2
Numeric
MedHHInc
Y
Mumeric
Median Household Inc.. DOLLAR9
ReqDens
Z
Numeric
Region Density Perce... BEST9
Numeric
Region Population
RegPop
COMMA9
r" Allow multiple role assignments Cancel
9.
S e l e c t Finish.
< Back
Next >
Finish
8-49
8-50
C h a p t e r 8 Introduction to Pattern Discovery
T h e E x p l o r e w i n d o w opens w i t h a three-dimensional plot o f the C E N S U S 2 0 0 0 data.
File
a |[
Edit
View
Actions
Window
h o Graiihl
R e g i o n Density Percentile 1001
â&#x20AC;˘ 80 ' 60-
m
4020
m â&#x20AC;˘
0 SO
B 17.-.
T \
$100,000 $200,000 Median H o u s e h o l d I n c o m e
0.00 4.00 Average Household Size
10. Rotate the p l o t by h o l d i n g d o w n the C T R L k e y and d r a g g i n g the m o u s e . E a c h s q u a r e i n t h e p l o t r e p r e s e n t s a u n i q u e p o s t a l c o d e . T h e s q u a r e s are c o l o r - c o d e d b y c l u s t e r segment.
8.2 C l u s t e r A n a l y s i s
8-51
To further a i d interpretability, add a distribution plot o f the segment number. 1.
Select
Action <=> Plot....
*
T h e Select a Chart Type w i n d o w opens.
A
Scatter
_
"
\& Line i h i Histogram
Plots values of two variables against each olher.
fet Oensity 10! Box m
Tables
H Matrix S B Lattice M Parallel Axis @
Constellation
Cancel
2.
S e l e c t a Bar
!
< flACll -
1 1
Next>
L
x
I
chart.
Lattice
!l
H Parallel Axis ÂŽ
Constellation
h- 3 0 Charts ^
Contour
â&#x20AC;˘J
Bar
C
Pie
Liil
Needle
9
Vector
^
Band
Bar chart provides statistical relations between categorical values. Bar can be subdivided and/or grouped using subgroup variable and/or group variable respectively.
Cancel
3.
Select
Next >.
Next>
1
C h a p t e r 8 Introduction to Pattern Discovery
8-52
4.
Select
Role O Category
f o r the variable
_SEGMENT_.
Select Chart Roles
X] Use default assignments
â&#x20AC;˘
Variable
Role
SEGMENT LABEL
Type
Description
Character
Segment Description
Format
Distance
Numeric
Distance
ID
Character
Region ID
LocX
Numeric
Region Longitude
BEST9
Numeric
Region Latitude
BEST 9
Numeric
Average Household S... COMMA9.2 Median Household Inc.. DOLLAR9
LocY MeanHHSz MedHHInc RegDens RegPop
Numeric
Region Density Perce... BESTS
Numeric
Region Population
Numeric
COMMA9
Response statistic: [Frequency V
Allow multiple role assignments Cancel
5.
S e l e c t Finish.
<Back
Next >
Finish
8.2 Cluster A n a l y s i s
A histogram o f
File
Edit
View
_SEGMENT_
Actions
8-53
opens.
Window
_y l il Granh2
V e ?
El
B y i t s e l f , t h i s p l o t is o f l i m i t e d u s e . H o w e v e r , w h e n t h e p l o t is c o m b i n e d w i t h t h e t h r e e - d i m e n s i o n a l plot, y o u can easily interpret the generated segments.
8-54
6.
C h a p t e r 8 Introduction t o Pattern Discovery
S e l e c t t h e t a l l e s t s e g m e n t b a r i n t h e h i s t o g r a m , s e g m e n t 4.
File
Edit
View
Actions
Window
SEGMENT^
8.2 Cluster A n a l y s i s
7.
Select t h e t h r e e - d i m e n s i o n a l p l o t . Cases c o r r e s p o n d i n g to s e g m e n t 4 are h i g h l i g h t e d .
8.
R o t a t e t h e t h r e e - d i m e n s i o n a l p l o t t o g e t a b e t t e r l o o k at t h e h i g h l i g h t e d c a s e s .
fc.
Craphl
;
8-55
51
Cases in this largest s e g m e n t c o r r e s p o n d to households a v e r a g i n g b e t w e e n t w o and three m e m b e r s , l o w e r p o p u l a t i o n density, and m e d i a n household incomes between $ 2 0 , 0 0 0 and $ 5 0 , 0 0 0 . 9.
C l o s e the E x p l o r e , E x p o r t e d Data, and Results w i n d o w s .
8-56
C h a p t e r 8 Introduction to Pattern Discovery
Profiling Segments
Y o u c a n g a i n a g r e a t d e a l o f i n s i g h t b y c r e a t i n g p l o t s as i n t h e p r e v i o u s d e m o n s t r a t i o n . U n f o r t u n a t e l y , i f m o r e t h a n t h r e e v a r i a b l e s are used t o generate t h e s e g m e n t s , the i n t e r p r e t a t i o n o f s u c h p l o t s b e c o m e s difficult. F o r t u n a t e l y , t h e r e is a n o t h e r u s e f u l t o o l i n S A S
Enterprise M i n e r for interpreting the c o m p o s i t i o n o f
clusters: t h e S e g m e n t P r o f i l e t o o l . T h i s t o o l enables y o u t o c o m p a r e the d i s t r i b u t i o n o f a v a r i a b l e in an i n d i v i d u a l s e g m e n t t o t h e d i s t r i b u t i o n o f t h e v a r i a b l e o v e r a l l . A s a b o n u s , t h e v a r i a b l e s are s o r t e d b y h o w w e l l they characterize the segment. 1.
D r a g a Segment Profile t o o l f r o m t h e A s s e s s t o o l p a l e t t e i n t o t h e d i a g r a m w o r k s p a c e .
2.
C o n n e c t t h e Cluster n o d e t o t h e Segment Profile n o d e . K Segmentation Analysis
: : : ;EN:;lK:000
Cluster
-o
S e g m e n t Profile
6
•J
••'/
C\
Q
r=^=^
©
100%
T o best d e s c r i b e the s e g m e n t s , y o u s h o u l d p i c k a reasonable subset o f the a v a i l a b l e i n p u t v a r i a b l e s .
8.2 C l u s t e r A n a l y s i s
3.
Select the
Variables
8-57
property f o r the Segment Profile node.
gf Variables - Prof Equal lo
(none) Columns:
V Label
Name /
J Report:
Use
Distance ID LocX LocY MeanHHSz MedHHInc ReetDens RegPop
Default Default
No No NO No No No No
Default Default Default
Default Default Default " ^SEGMENT, Default " s e g m e n t " LDefault
-
No
Apply
Reset
OK
Cancel
[~~ Basic
dining Rote
Rejected D nput nput nput nput nput nput Seqment Rejected
No No
J
Level Interval Nominal Interval Interval Interval Interval Internal Interval Nominal Nominal
—
Explore.
4.
S e l e c t Use
Update Path
for I D , L o c X , L o c Y , and R e g P o p .
•=> No
BjF Variables - Prof
E3
|(none) Columns:
j I
-
|~~ fining
f~ Label
Name J
Use
Default No No No Default Default 'MedHHInc RegDens Default No RegPop LSEGMENT. Default SEGMENT [ B e f a u l t Distance ID LOCX LOCY MeanHHSz
J
dl
not lEqualto
Report No NO No No No No No No No NO
Role Rejected D Input nput nput nput Input nput Segment Rejected
Apply
V Basic |
r
SttC5
Level
Interval Nominal Interval Intewal, 'interval 'Interval Interval Interval Nominal Nominal Explore.
5.
••
Reset
Select O K to close the Variables d i a l o g b o x .
Update Path
OK
Cancel
8-58
6.
C h a p t e r 8 Introduction to Pattern Discovery
R u n t h e S e g m e n t P r o f i l e n o d e a n d select
Results....
The R e s u l t s - N o d e : S e g m e n t P r o f i l e D i a g r a m
w i n d o w opens. R e s u l t s - Node: S e g m e n t Profilt D i a g r a m : Segmentation Analysis
» S e g m e n t Siie
I _SEGMENI_
30 •
7
Segment: •! :cmrt: 9009 >ercent: 2S.G7
1
B
ii 6oiisityI>ercclitlie 10
/
40-1
—
1
A v e i a fe Household S i z e House
«30-
Segment: 7 lourit: 6-143 >ercent: 30.07
Variable W o t t h ! _Str.**-<T_ Scud;nc J u l y 0 9 , 2009 14:43:11 Teaming
£
S
Output
0.050Vacloblt
Summary rlsasurfstnc Level
RcgOetw
MeatttS; Variable
Mntttoc
RtgCc™
MedHHnc Variable
MeanHHS:
ID
HOHUML
IBPDT
ILTEKVAL
PEJECTED
INTERVAL
REJECTED
N0HINAI.
SEOHEirr
NOMINAL
Frecruenc^ Count
Median
8.2 Cluster A n a l y s i s
7.
8-59
M a x i m i z e the Profile w i n d o w . ? R e s u l t s - H a d e : S e g m e n t Profile D i a g r a m : S e g m e n t a t i o n Analysis tjle
E.o»
•1 o | Piofite:
tfew
"£indow
j j n i ^
EE
SEGMENT
Segment: 4 :txnt: 9009 >ercent: 28.07
Segment: 7 :cxnt: 6443 'etcent: 20,07
»x a
40¬
50-
50¬
Segment: 10 rant: 5163 'etcent: 16,1
«-
5
S30-
? 20
a:o-
.V. • i ,n i •.•Hons cho kl
Region Densny Pciceiitlle
tee
a h - a
Meili.in lluusohold
Income'
40¬ 30-
^gment; 1 Scant: 471S >acent: 14.69
Features
o f each segment b e c o m e apparent. For e x a m p l e , segment 4, w h e n c o m p a r e d t o the o v e r a l l
d i s t r i b u t i o n s , has a l o w e r R e g i o n D e n s i t y P e r c e n t i l e , m o r e c e n t r a l M e d i a n H o u s e h o l d I n c o m e , a n d s l i g h t l y h i g h e r A v e r a g e H o u s e h o l d Size.
8-60
C h a p t e r 8 Introduction to Pattern Discovery
11. M a x i m i z e the Variable W o r t h :
SEGMENT
window.
I-
or R e s u l t * - Node: S e g m e n t Profile Dkigr
~l*
Variable W n r t h : _ÂŤGMET<T_
E
^
E
O.Qi-
g
0 04
g" D.04-
010-
RecDens
M e a n H H S : MeOHHtnc Variable
R e g D ~ n s MedHHInc M e a n H H S :
RldDent
Variable
U e a n H H S i MedHHInc Variable
MaanHHSi
RegDens Variable
3 nrn.
MedHHInc
RufjDen) Variable
MedHHInc U * a n H H S : Variable
T h e w i n d o w s h o w s the relative w o r t h o f each variable i n c h a r a c t e r i z i n g each s e g m e n t . F o r e x a m p l e , s e g m e n t 4 is l a r g e l y c h a r a c t e r i z e d b y t h e R e g D e n s i n p u t , b u t t h e o t h e r t w o i n p u t s a l s o p l a y a r o l e . A g a i n , s i m i l a r analyses can be used t o describe the other segments. T h e advantage o f the S e g m e n t P r o f i l e w i n d o w ( c o m p a r e d t o d i r e c t v i e w i n g o f t h e s e g m e n t a t i o n ) is t h a t t h e d e s c r i p t i o n s c a n b e m o r e than three-dimensional.
8.2 Cluster A n a l y s i s
8-61
Exercises
1. Conducting Cluster Analysis The DUNGAREE data set gives the number of pairs of four different types of dungarees sold at stores over a specific time period. Each row represents an individual store. There are six columns in the data set. One column is the store identification number, and the remaining columns contain the number of pairs of each type of jeans sold. Name
STOREID
Model Role ID
Description
Measurement Level Nominal
Identification number of the store
FASHION
! Input
; Interval
LEISURE
Input
Interval
Number of pairs of leisure jeans sold at the store
STRETCH
1 Input
Interval
j Number of pairs of stretch jeans sold at the store
Input
Interval
Number of pairs of original jeans sold at the store
Interval
Total number of pairs of jeans sold (the sum of FASHION, L E I S U R E , S T R E T C H , and ORIGINAL)
ORIGINAL SALESTOT
a.
; Rejected
] Number of pairs of fashion jeans sold at the store
Create a new diagram in your project. Name the diagram J e a n s .
b. Define the data set DUNGAREE as a data source. c. Determine whether the model roles and measurement levels assigned to the variables are appropriate. Examine the distribution of the variables. â&#x20AC;˘ Are there any unusual data values? â&#x20AC;˘ Are there missing values that should be replaced? d.
Assign the variable STORE I D the model role I D and the variable S A L E S T O T the model role Rejected. Make sure that the remaining variables have the Input model role and the Interval measurement level. Why should the variable SALESTOT be rejected?
e. Add an Input Data Source node to the diagram workspace and select the D U N G A R E E data table as the data source.
8-62
Chapter 8 Introduction to Pattern Discovery
f. Add a Cluster node to the diagram workspace and connect it to the Input Data node. g.
Select the Cluster node and select Internal Standardization â&#x20AC;˘=> Standardization. What would happen if you did not standardize your inputs?
h. Run the diagram from the Cluster node and examine the results. Does the number of clusters created seem reasonable? i. Specify a maximum of six clusters and rerun the Cluster node. How does the number and quality of clusters compare to that obtained in part h?
j.
Use the Segment Profile node to summarize the nature of the clusters.
8.3 M a r k e t B a s k e t A n a l y s i s (Self-Study)
8.3
8-63
Market Basket Analysis (Self-Study)
Market Basket Analysis
Rule A => D C=>A A => C B&C^>D
Support 2/5 2/5 2/5 1/5
Confidence 2/3 2/4 2/3 1/3
3sa
Market basket analysis
( a l s o k n o w n as
association rule discovery
or
affinity analysis)
is a p o p u l a r data
m i n i n g m e t h o d . I n t h e s i m p l e s t s i t u a t i o n , t h e d a t a c o n s i s t s o f t w o v a r i a b l e s : a transaction
a n d a n item.
F o r each t r a n s a c t i o n , there is a list o f i t e m s . T y p i c a l l y , a t r a n s a c t i o n is a s i n g l e c u s t o m e r p u r c h a s e , a n d t h e items are t h e t h i n g s that w e r e b o u g h t . A n set
association rule
i s a s t a t e m e n t o f t h e f o r m ( i t e m set
A)
=> (item
5).
T h e a i m o f t h e a n a l y s i s is t o d e t e r m i n e t h e s t r e n g t h o f a l l t h e a s s o c i a t i o n r u l e s a m o n g a set o f i t e m s . T h e s t r e n g t h o f t h e a s s o c i a t i o n is m e a s u r e d b y t h e
support
and
confidence
o f the rule. T h e support f o r
t h e r i l l e d = > B i s t h e p r o b a b i l i t y t h a t t h e t w o i t e m sets o c c u r t o g e t h e r . T h e s u p p o r t o f t h e r u l e A = > B i s estimated by the f o l l o w i n g :
transactions
that contain all
every item in A and B
transactions
N o t i c e t h a t s u p p o r t i s s y m m e t r i c . T h a t i s . t h e s u p p o r t o f t h e r u l e A = > B i s t h e s a m e as t h e s u p p o r t o f t h e r u l e 5 => ,4. T h e c o n f i d e n c e o f a n a s s o c i a t i o n rule.-( = > B is t h e c o n d i t i o n a l p r o b a b i l i t y o f a t r a n s a c t i o n c o n t a i n i n g i t e m set B g i v e n t h a t i t c o n t a i n s i t e m set A. T h e c o n f i d e n c e i s e s t i m a t e d b y t h e f o l l o w i n g :
transactions
that contain
transactions
every item in A and B
that contain
the items in A
8-64
C h a p t e r 8 Introduction to Pattern Discovery
Implication? Checking Account
Savinqs Account
No
Yes
No
500
3500
4,000
Yes!
1000
5000
6,000
S u p p o r t ( S V G => C K ) = 5 0 % C o n f i d e n c e ( S V G => C K ) = 8 3 % E x p e c t e d C o n f i d e n c e ( S V G => C K ) = 8 5 % Lift(SVG => C K ) = 0.83/0.85 < 1
10,000
4010
The interpretation of the implication (=>) in association rules is precarious. High confidence and support does not imply cause and effect. The rule is not necessarily interesting. The two items might not even be correlated. The term confidence is not related to the statistical usage; therefore, there is no repeated sampling interpretation. Consider the association rule (saving account) => (checking account). This rule has 50% support (5,000/10,000) and 83% confidence (5,000/6,000). Based on these two measures, this might be considered a strong rule. On the contrary, those without a savings account are even more likely to have a checking account (87.5%). Saving and checking are, in fact, negatively correlated. If the two accounts were independent, then knowing that a person has a saving account does not help in knowing whether that person has a checking account. The expected confidence if the two accounts were independent is 85% (8,500/10,000). This is higher than the confidence of S V G => C K . The lift of the rule A => B is the confidence of the rule divided by the expected confidence, assuming that the item sets are independent. The lift can be interpreted as a general measure of association between the two item sets. Values greater than 1 indicate positive correlation, values equal to 1 indicate zero correlation, and values less than 1 indicate negative correlation. Notice that lift is symmetric. That is, the lift of the rule A => B is the same as the lift of the rule B => A.
8.3 M a r k e t B a s k e t A n a l y s i s (Self-Study)
8-65
Barbie Doll =^> Candy 1. Put them closer together in the store. 2.
Put them far apart in the store.
3. Package candy bars with the dolls. 4.
Package Barbie + candy + poorly selling item.
5. Raise the price on one, and lower it on the other. 6.
Offer Barbie accessories for proofs of purchase.
7. Do not advertise candy and Barbie together. 8. Offer candies in the shape of a Barbie doll.
4W
Forbes
(Palmeri
1997) reported that a m a j o r retailer determined that customers w h o buy Barbie dolis
have a 6 0 % l i k e l i h o o d o f b u y i n g one o f three types o f candy bars. T h e c o n f i d e n c e o f the rule B a r b i e c a n d y is 6 0 % . T h e r e t a i l e r w a s u n s u r e w h a t t o d o w i t h t h i s n u g g e t . T h e o n l i n e n e w s l e t t e r
Discovery Nuggets
invited suggestions (Piatesky-Shapiro
Knowledge
1998).
Data Capacity
4312
I n d a t a m i n i n g , t h e d a t a is n o t g e n e r a t e d t o m e e t t h e o b j e c t i v e s o f t h e a n a l y s i s . It m u s t b e d e t e r m i n e d w h e t h e r t h e d a t a , as i t e x i s t s , h a s t h e c a p a c i t y t o m e e t t h e o b j e c t i v e s . F o r e x a m p l e , q u a n t i f y i n g a f f i n i t i e s a m o n g r e l a t e d i t e m s w o u l d b e p o i n t l e s s i f v e r y f e w t r a n s a c t i o n s i n v o l v e d m u l t i p l e i t e m s . T h e r e f o r e , i t is i m p o r t a n t to d o s o m e i n i t i a l e x a m i n a t i o n o f the data before a t t e m p t i n g to d o a s s o c i a t i o n analysis.
8-66
Chapter 8 Introduction to Pattern Discovery
Association Tool Demonstration Analysis goal: Explore associations between retail banking services used by customers. A n a l y s i s plan: • Create a n a s s o c i a t i o n data s o u r c e . • R u n a n association a n a l y s i s . • interpret the association rules. • Run a sequence analysis. • Interpret the s e q u e n c e rules.
43)3
A bank's Marketing Department is interested in examining associations between various retail banking services used by customers. Marketing would like to determine both typical and atypical service combinations as well as the order in which the services were first used. These requirements suggest both a market basket analysis and a sequence analysis.
8.3 Market B a s k e t Analysis (Self-Study)
8-67
Market Basket Analysis
The BANK data set contains service information for nearly 8,000 customers. There are three variables in the data set, as shown in the table below. 1
Name
Model Role
ACCOUNT
ID
! S E R V I C E ..
=
VISIT
M e a s u re m e n t L ev e 1
j' Target
Description
Nominal
Account Number
j: Nominal
) i Type of $Šfvice
Sequence
Ordinal
j
Order of Product Purchase
The BANK data set has over 32,000 rows. Each row of the data set represents a customer-service combination. Therefore, a single customer can have multiple rows in the data set, and each row represents one of the products he or she owns. The median number of products per customer is three. The 13 products are represented in the data set using the following abbreviations: ATM
automated teller machine debit card
AUTO
automobile installment loan
CCRD
credit card
CD
certificate of deposit
CKCRD
check/debit card
CKING
checking account
HMEQLC
home equity line of credit
IRA
individual retirement account
MMDA
money market deposit account
MTG
mortgage
PLOAN
personal/consumer installment loan
SVG
saving account
TRUST
personal trust account
Your first task is to create a new analysis diagram and data source for the BANK data set. 1.
Create a new diagram named A s s o c i a t i o n s A n a l y s i s to contain this analysis.
2.
Select Create Data Source from the Data Sources project property.
3.
Proceed to Step 2 of the Data Source Wizard.
8-68
4.
C h a p t e r 8 Introduction t o Pattern Discovery
Select the
BANK
table in the A A E M 6 1
library.
gT.Data Source Wizard - Step 2 of 7 Select a SAS Table
J Select a SAS table
l a b l e : |2
Browse.
< Back
5.
Proceed to Step 5 o f the D a t a Source W i z a r d .
6.
I n S t e p 5, a s s i g n m e t a d a t a t o t h e t a b l e v a r i a b l e s as s h o w n b e l o w .
Next >
Cancel
mm
Apply
if Data Source Wizard â&#x20AC;&#x201D; S t e p 5 o f 7 Column Metadata |(none) Columns:
T
f
\ I
not jEqualto
-
Label
|~ Mining
Name
Rc!c-
A.CCOUNT SERVICE VISIT
ID Target Sequence
Show code
Explore
Level rrterval Nominal nterval
Compute Summary
Report
I
-
Basic
Order
No No No
Drop
Reset
r m#<x Lower Limit
Upper Lir
No NO No
<Back
Next>
Cancel
A n a s s o c i a t i o n a n a l y s i s r e q u i r e s e x a c t l y o n e t a r g e t v a r i a b l e a n d at l e a s t o n e I D v a r i a b l e . B o t h s h o u l d have a n o m i n a l m e a s u r e m e n t level. A sequence analysis also requires a sequence variable. It usually has an o r d i n a l m e a s u r e m e n t scale.
8.3 M a r k e t B a s k e t A n a l y s i s (Self-Study)
7.
Proceed to Step 7 o f the D a t a Source W i z a r d . F o r an association a n a l y s i s , the data source s h o u l d have a role o f T r a n s a c t i o n .
8.
Select
Role •=> Transaction.
£TData Source Wizard — Step 7 of 8 Data Source Attributes You may change the name and the role, and can specify a population segment identifier for the data source to be created.
Name:
|BANK
Role: Segment:
Notes:
< Back
9.
S e l e c t Finish t o c l o s e t h e D a t a S o u r c e W i z a r d .
10.
Drag a
BANK
11. Select the
data source into the d i a g r a m workspace.
Explore
12. C o n n e c t t h e
Next >
BANK
t a b a n d d r a g an node to the
Association
Association
tool into the d i a g r a m w o r k s p a c e .
node.
Association Analysis
3ANK
>
* Association
<5 Q,
0
ed^fca
©
100%
Cancel
8-69
8-70
13.
Chapter 8 Introduction to Pattern Discovery
Select the
Association
n o d e a n d e x a m i n e its P r o p e r t i e s p a n e l . Value
Property
General N o d e ID
Assoc
Imported Data Exported Data
J J
Notes
Train
1 —1
Variables M a x i m u m N u m b e r o f 11 0 0 0 0 0 Rules
u
i Maximum Items M i n i m u m Confidence 10 Support Type
Percent
Support Count
1
Support Percentage E Chain Count
3
Consolidate Time
0.0
M a x i m u m T r a n s a c t i o n 0.0 Support Type
Percent
Support Count
1
^-Support P e r c e n t a g e
2.0
N u m b e r t o Keep
200
Sort C r i t e r i o n
Default
N u m b e r t o Transpose 200 Export R u l e by ID 14.
No
T h e E x p o r t R u l e b y ID p r o p e r t y d e t e r m i n e s w h e t h e r t h e R u l e - b y - l D node and i f the R u l e
Set t h e v a l u e f o r E x p o r t R u l e b y ID t o Yes.
•
-Numberto Keep -Sort C r i t e r i o n
200 Default
- N u m b e r t o T r a n s p o s e 200 •Export R u l e by ID
d a t a is e x p o r t e d f r o m t h e
D e s c r i p t i o n table w i l l be available f o r d i s p l a y i n t h e Results w i n d o w .
Yes
8.3 Market B a s k e t A n a l y s i s (Self-Study)
8-71
Other options in the Properties panel include the following: • The Minimum Confidence Level specifies the minimum confidence level to generate a rule. The default level is 10%. • The Support Type specifies whether the analysis should use the support count or support percentage property. The default setting is P e r c e n t . • The Support Count specifies a minimum level of support to claim that items are associated (that is, they occur together in the database). The default count is 2. • The Support Percentage specifies a minimum level of support to claim that items are associated (that is, they occur together in the database). The default frequency is 5%. The support percentage figure that you specify refers to the proportion of the largest single item frequency, and not the end support. • The Maximum Items determine the maximum size of the item set to be considered. For example, the default of four items indicates that a maximum of four items will be included in a single association rule. f
If you are interested in associations that involve fairly rare products, you should consider reducing the support count or percentage when you run the Association node. If you obtain too many rules to be practically useful, you should consider raising the minimum support count or percentage as one possible solution. Because you first want to perform a market basket analysis, you do not need the sequence variable.
15. Open the Variables dialog box for the Association node.
8-72
C h a p t e r 8 Introduction to Pattern Discovery
16. S e l e c t Use •=> No f o r t h e V i s i t v a r i a b l e . £ T Variables - Assoc ""•")
(none) Columns:
P
not | Equal to P
f " Mining
f " Label
Name
Apply
Use
Role
I
Basic
-
Reset
Statistics
Level
ACCOUNT
Yes
ID
Interval
SERVICE
Yes
Target
Nominal
VISIT
No
Sequence
Interval
Explore...
Update Path
17.
Select OK
to close the Variables d i a l o g b o x .
18.
R u n the d i a g r a m f r o m the A s s o c i a t i o n node and v i e w the results.
OK
Cancel
8.3
M a r k e t B a s k e t A n a l y s i s (Self-Study)
T h e Results - N o d e : A s s o c i a t i o n D i a g r a m w i n d o w opens w i t h the Statistics Plot, Statistics L i n e Rule Matrix, and Output w i n d o w s
8-73
Plot,
visible.
lfPii^itilMMiitili
r/tf
1
[El
1
| _ Rule Matrix
£
-
0
-
cr*
IEI
I : 1 El S
:*
ft
l
ft
J
*
• ft ftft: *
8
._,
**ft*fti T - i — i
i
i
T
i
• i—i
i
r
i
i i—i—i—T
i
i
i
i — i — i — i — i — r
Riijlit Hand of Rule 3100.00
•
Output
User:
Dodotech
Dace:
24IIAY08
Time:
19:51
* Training
50
100
150
Variable
Output
Summary
Rule Index U Niimlter identifying a specific rule Lift Expected Coniidence(%]
Goiiflderice(%) Support(%)
Role i
Measurement
Frequency
Level
Count
8-74
Chapter 8 Introduction to Pattern Discovery
19. Maximize the Statistics Line Plot window.
The statistics line plot graphs the lift, expected confidence, confidence, and support for each of the rules by rule index number. Consider the rule A => B. Recall the following: • Support of A => B is the probability that a customer has both A and B. • Confidence of A => B is the probability that a customer has B given that the customer has A. • Expected Confidence of A => B is the probability that a customer has B. • Lift of A => B is a measure of the strength of the association. If Lift=2 for the rule A=>B, then a customer having A is twice as likely to have B than a customer chosen at random. Lift is the confidence divided by the expected confidence. Notice that the rules are ordered in descending order of lift.
8-3 M a r k e t B a s k e t A n a l y s i s (Self-Study)
2 0 . T o v i e w t h e d e s c r i p t i o n s o f the rules, select j
View •=> Rules => Rule description.
Rule d e s c r i p t i o n
MAP
RULE
RULE1
CKING & C C R D ==> CKCRD
RULE2
C K C R D = = > CKING & CCRD
RULE3
CKCRD = = > C C R D
RULE4
CKING & CKCRD ==> CCRD
RULES
CCRD==> CKCRD
RULE6
C C R D ==> CKING & CKCRD
RULE7
HMEQLC = = > CKING & CCRD
RULE8
CKING & C C R D ==> HMEQLC
RULE9
HMEQLC==>CCRD
RULE10
HMEQLC & CKING = > CCRD
RULE11
C C R D = = > HMEQLC
RULE12
C C R D = = > HMEQLC & CKING
RULE13
SVG & H M E Q L C = = > C K I N G & ATM
RULE14
C K I N G & ATM = = > SVG & H M E Q L C
RULE15
H M E Q L C = = > SVG & C K I N G & ATM
]RULE1 6
SVG & C K I N G & ATM = = > H M E Q L C
RULE17
SVG & A T M = = > H M E Q L C
RULE18
SVG & ATM = = > H M E Q L C & C K I N G
RULE19
H M E Q L C = = > SVG & ATM
RULE20
H M E Q L C & C K I N G = = > SVG & ATM
RULE21
H M E Q L C = " > C K I N G & ATM
IRULE22
CKING &ATM = > HMEQLC
RULE23
8-75
n^Ef
IE]
-A.
SVG & H M E Q L C = = > ATM
RULE24
SVG & H M E Q L C & C K I N G = = > ATM
RULE25
ATM = = > SVG & H M E Q L C
RULE26
ATM = = > SVG & H M E Q L C & C K I N G
RULE27
C D & ATM = = > SVG & C K I N G
RULE28
HMEQLC ==>ATM
RULE29
H M E Q L C & C K I N G = = > ATM
RULE30
ATM==> HMEQLC
RULE31
ATM = = > H M E Q L C & C K I N G
RULE32
CKING & AUTO ==>ATM
T h e h i g h e s t l i f t r u l e i s c h e c k i n g , a n d c r e d i t c a r d i m p l i e s c h e c k c a r d . T h i s is n o t s u r p r i s i n g g i v e n t h a t m a n y c h e c k c a r d s i n c l u d e c r e d i t card l o g o s . N o t i c e t h e s y m m e t r y i n r u l e s 1 a n d 2. T h i s is n o t a c c i d e n t a l b e c a u s e , as n o t e d e a r l i e r , l i f t is s y m m e t r i c .
8-76
C h a p t e r 8 Introduction to Pattern Discovery
2 1 . E x a m i n e the rule m a t r i x . h i Rule Matrix
0>
a t
»
4 —
o T3 C
-j
•
r
a
m * a
a
a
— i
1
1
1
Right Hand of Rule 10.231
r
^100.00
T h e r u l e m a t r i x plots the rules based o n the i t e m s on the left side o f the r u l e and t h e i t e m s o n the r i g h t s i d e o f t h e r u l e . T h e p o i n t s are c o l o r e d , b a s e d o n t h e c o n f i d e n c e o f t h e r u l e s . F o r e x a m p l e , t h e r u l e s w i t h the h i g h e s t c o n f i d e n c e are in the c o l u m n in the p i c t u r e a b o v e . U s i n g t h e i n t e r a c t i v e f e a t u r e o f the g r a p h , y o u d i s c o v e r that these rules all have c h e c k i n g o n the r i g h t side o f the r u l e . A n o t h e r w a y t o e x p l o r e t h e r u l e s f o u n d i n t h e a n a l y s i s is b y p l o t t i n g t h e R u l e s t a b l e .
8.3 M a r k e t B a s k e t A n a l y s i s (Self-Study)
22. Select
View <=> Rules •=> Rules Table.
T h e Rules Table w i n d o w opens.
! H I Rules Table Relations
Expected
Confidence(
Confidence(
%)
Support(%)
Lift
%) 3
11.30
37.57
5.58
3
14.85
49.39
5.58
3.3: 3.3:
2
15.48
49.39
5.58
3.1 E
3
15.48
49.39
5.58
3.1 E
2
11.30
36.05
5.58
3.1 £
3
11.30
36.05
5.58
3.1 £
3
14.85
23.12
4.63
1.8£
3
16.47
31.17
4.63
1.8E
2
15.48
28.12
4.63
1.83
15.48
28.12
4.63
1.83
16.47
29.91
4.63
1.83
3
16.47
29.91
4.63
1.83
4
36.19
54.66
6.09
1.51
A A A
11.15
16.84
6.09
1.51
24.85
37.01
6.09
1.41
16.47
24.52
6.09
1.4E
3
16.47
23.72
6.09
1.44 —
A
1 R A7
no.
1 AA •r | >
I
:¥v:*:;:yx
U \ • ' wffi 4
A:
R
_
2 3 . S e l e c t t h e P l o t W i z a r d i c o n , \&~}\ 2 4 . C h o o s e a t h r e e - d i m e n s i o n a l scatter p l o t f o r the t y p e o f chart, a n d select
Next >.
8-77
8-78
C h a p t e r 8 Introduction t o Pattern Discovery
25. Select the roles
X, Y,
and
Z
for the variables
SUPPORT, L I F T ,
and
CONF,
respectively.
Use default assignments A Variable CONF COUNT
Rote
Type
Description
Format
z-
Mumeric
Conficlence(%)
F6.2
Numeric
Transaction Count
F6.2
EXP.CONF
Numeric
Expected Confidence... F6.2
INDEX
Numeric
Rule Index if Number i...
IT EMI
Character
Rule item 1
ITEMS
Character
Rule item 2
Character
Rule Item 3
ITEM4
Character
Rule Item 4
ITEM 5
Character
Rule Item 5
ITEMS
LIFT RULE
Y
SET SIZE SUPPORT
X
1
2 6 . S e l e c t Finish t o g e n e r a t e t h e p l o t .
F6.2
Numeric
Lift
Character
'Rule
Numeric
Relations
F6
Numeric
Support(%) Transnns.fi Rulft
F6.2
Mi
tmorir
â&#x20AC;&#x201D;
8.3 M a r k e t B a s k e t A n a l y s i s (Self-Study)
8-79
27. Rearrange the w i n d o w s t o v i e w the data and the plot s i m u l t a n e o u s l y .
;
S
»11 <--«3i ili'li l^Btf :tH--t» j ^>1i r iT'j i ia t j ! - t-.j f i iji Htf ^Vi^ir-jH i t» J itf^i r= I
C^eT
3
3 3
3
3.33 3.33 3.19 3.19 3.19 3.19 1.89 1.89 1.82 1.82 1.82 1.82 1.51 1.51 1.49 1.49 1.44
4 4 6 . 0 0 C K I N G & C C R O = > CKCRD 446.00 CKCRD = > CKINO & CCRD 4 4 6 . 0 0 C K C R D = = > CCRD 446.00 CKfNG & CKCRD = > CCRD 4 4 6 . 0 0 C C R D = > CKCRD 4 4 6 . 0 0 C C R D ==> CKING & CKCRD 370.00 HMEQLC ==> CKING & CCRD 3 7 0 . 0 0 C K I N G & C C R D = > HMEQLC 370.00HMEQLC = > C C R D 370.00HMEQLC & CKING ==> CCRD 3 7 0 . 0 0 C C R D ==> HMEQLC 3 7 0 . 0 0 C C R D ==> HMEQLC & CKING 487.OOSVG & HMEQLC = > CKING &.. 487.00CKING &ATM ==> SVG & HME. 487.00 HMEQLC ==> SVG & CKING &.. 487.00SVG & CKING &ATM = > HME. 487.00 SVG & ATM === HMEQLC
-ISA
E
[userGraph 1
cR-
CH Ch-J
Confidence(%) 0.00
C[ | C< C(
m a
H» Hf
C( C(
60.00
s\
Cr
HH
S\
S\ —
S u p p o r t (%)
100.00 3.00
4.OD0
•
E x p a n d i n g the Rule c o l u m n in the data table and selecting points in the three-dimensional plot enable y o u to q u i c k l y uncover high lift rules f r o m the market basket analysis w h i l e j u d g i n g their confidence a n d s u p p o r t . Y o u c a n use W H E R E clauses i n t h e D a t a O p t i o n s d i a l o g b o x t o s u b s e t cases i n w h i c h y o u are interested. 28. Close the Results w i n d o w .
8-80
C h a p t e r 8 Introduction to Pattern Discovery
Sequence Analysis
I n a d d i t i o n t o t h e p r o d u c t s o w n e d b y i t s c u s t o m e r s , t h e b a n k is i n t e r e s t e d i n e x a m i n i n g t h e o r d e r i n w h i c h t h e p r o d u c t s are p u r c h a s e d . T h e s e q u e n c e v a r i a b l e i n t h e d a t a set e n a b l e s y o u t o c o n d u c t a s e q u e n c e analysis. 1.
A d d an A s s o c i a t i o n node t o the d i a g r a m workspace and connect it to the
2.
Rename the n e w node S e q u e n c e
r;'' 0
Set E x p o r t R u l e b y I D t o Y e s . El
• •
-Numberto Keep
200
-Sort C r i t e r i o n
befault
-Numberto Transpose 200 -Export R u l e by ID 4.
node.
Analysis.
Association Analysis
3.
BANK
Yes
E x a m i n e the Sequence panel in the Properties panel.
-Chain C o u n t
3
-Consolidate T i m e
0.0
- M a x i m u m T r a n s a c t i o n G.O •SupportType
Percent
•Support Count
1
-SupportPercentage
2.0
1HI
8.3 Market B a s k e t A n a l y s i s (Self-Study)
8-81
The options i n the Sequence panel enable y o u to specify the f o l l o w i n g properties:
• Chain Count
is t h e m a x i m u m n u m b e r o f i t e m s t h a t c a n be i n c l u d e d i n a s e q u e n c e . T h e d e f a u l t v a l u e
is 3 a n d t h e m a x i m u m v a l u e is 1 0 .
• Consolidate Time
enables y o u to specify whether consecutive visits to a location o r consecutive
purchases o v e r a g i v e n i n t e r v a l can be c o n s o l i d a t e d i n t o a s i n g l e v i s i t f o r analysis purposes. F o r e x a m p l e , t w o p r o d u c t s p u r c h a s e d less t h a n a d a y a p a r t m i g h t b e c o n s i d e r e d t o b e a s i n g l e t r a n s a c t i o n .
• Maximum Transaction Duration
enables y o u t o s p e c i f y the m a x i m u m l e n g t h o f t i m e f o r a series o f
transactions t o be c o n s i d e r e d a sequence. F o r e x a m p l e , y o u m i g h t w a n t t o s p e c i f y that t h e purchase o f t w o p r o d u c t s m o r e than three m o n t h s apart does n o t constitute a sequence. s p e c i f i e s w h e t h e r t h e s e q u e n c e a n a l y s i s s h o u l d use t h e S u p p o r t C o u n t o r S u p p o r t
• Support Type
P e r c e n t a g e p r o p e r t y . T h e d e f a u l t s e t t i n g is P e r c e n t .
• Support Count
specifies t h e m i n i m u m frequency required t o include a sequence i n the sequence
a n a l y s i s w h e n t h e S e q u e n c e S u p p o r t T y p e is set t o C o u n t . I f a s e q u e n c e h a s a c o u n t less t h a n t h e s p e c i f i e d v a l u e , t h a t s e q u e n c e is e x c l u d e d f r o m t h e o u t p u t . T h e d e f a u l t s e t t i n g i s 2 .
• Support Percentage
specifies the m i n i m u m level o f support t o include the sequence in the analysis
w h e n t h e S u p p o r t T y p e i s set t o P e r c e n t . I f a s e q u e n c e h a s a f r e q u e n c y t h a t i s less t h a n t h e s p e c i f i e d p e r c e n t a g e o f t h e t o t a l n u m b e r o f t r a n s a c t i o n s , t h e n t h a t s e q u e n c e is e x c l u d e d f r o m t h e o u t p u t . T h e d e f a u l t p e r c e n t a g e i s 2%. P e r m i s s i b l e v a l u e s a r e r e a l n u m b e r s b e t w e e n 0 a n d 1 0 0 . 5.
R u n the d i a g r a m f r o m the Sequence A n a l y s i s node and v i e w the results.
6.
M a x i m i z e the Statistics L i n e Plot w i n d o w .
File Edit View
Window
[H" IB H H IS Statistics Line Plot 100-
80-
00 -
40-
20-
050
100
R u l e Index U Number identifying a s p e c i f i c rule Canfidence(%)
150
8-82
C h a p t e r 8 Introduction t o Pattern
Discovery
The statistics line plot graphs the confidence and support for each of the rules by rule index number. The percent support is the transaction count divided by the total number of customers, which would be the maximum transaction count. The percent confidence is the transaction count divided by the transaction count for the left side of the sequence. 7.
Select View ^> Rules <=> Rule description to view the descriptions of the rules. P I Rule description
jjjj
MAP
Rule
RULE1
C K I N G = = > SVG
RULE2
CKING ==> ATM
RULE3
SVG = = > ATM
$
RULE4
C K I N G = = > SVG = = > ATM
RULES
ATM = > ATM
RULE6
CKING ==> C D
RULE7
C K I N G = > ATM = = > A T M
RULES
CKING ==> HMEQLC
RULE9
SVG = > C D
RULE10
C K I N G = = > MMDA
RULE11
CKING ==> C C R D
RULE12
C K I N G = = > SVG = = > C D
RULE13
SVG = = > A T M = = > A T M
RULE14
SVG = = > S V G
RULE15
CKCRD ==>CKCRD
RULE16
CKING ==> C K C R D
RULE17
CKING ==> C K C R D = > C K C R D
RULE18
SVG = > H M E Q L C
RULE19
C K I N G = = > SVG = = > H M E Q L C
RULE20
SVG==>CCRD
RULE21
C K I N G = = > SVG = = > C C R D
di
ii
e
n
\a
rr\
The confidence for many of the rules changes after the order of service acquisition is considered.
8.3 Market B a s k e t Analysis (Self-Study)
8-83
2. Conducting an Association Analysis A store is interested in determining the associations between items purchased from the Health and Beauty Aids department and the Stationery Department. The store chose to conduct a market basket analysis of specific items purchased from these two departments. The A S S O C I A T I O N S data set contains information about over 400,000 transactions made over the past three months. The following products are represented in the data set: 1. bar soap 2. bows 3. candy bars 4. deodorant 5. greeting cards 6. magazines 7. markers 8. pain relievers 9. pencils 10. pens 11. perfume 12. photo processing 13. prescription medications 14. shampoo 15. toothbrushes 16. toothpaste 17. wrapping paper
8-84
C h a p t e r 8 Introduction to Pattern Discovery
There are four variables in the data set:
STORE
PRODUCT
a.
Rejected
Target
Nominal
Identification number of the store
Nominal
! | Transaction identification number
Nominal
Product purchased
Create a new diagram. Name the diagram T r a n s a c t i o n s .
b. Create a new data source using the data set AAEM61. TRANSACTIONS. c. Assign the variables STORE and QUANTITY the model role Rejected. These variables will not be used in this analysis. Assign the ID model role to the variable TRANSACTION and the Target model role to the variable PRODUCT. d. Add the node for the TRANSACTIONS data set and an Association node to the diagram. e.
Change the setting for Export Rule by ID to Yes.
f.
Leave the remaining default settings for the Association node and run the analysis.
g.
Examine the results of the association analysis. What is the highest lift value for the resulting rules? Which rule has this value?
8.4 C h a p t e r S u m m a r y
8-85
8.4 Chapter Summary Pattern discovery seems to embody the promise of data mining, but there are many ways for an analysis to fail. SAS Enterprise Miner provides tools to help with data reduction, novelty detection, profiling, market basket analysis, and sequence analysis. Cluster and segmentation analyses are similar in intent but differ in execution. In cluster analysis, the goal is to identify distinct groupings of cases across a set of inputs. In segmentation analysis, the goal is to partition cases from a single cluster into contiguous groups. SAS Enterprise Miner offers several tools for exploring the results of a cluster and segmentation analysis. For low dimension data, you can use capabilities provided by the Graph Wizard and the Explore window. For higher dimensional data, you can choose the Segment Profile tool to understand the generated partitions. Market basket and sequence analyses are handled by the Association tool. This tool transforms transaction data sets into rules. The value of the generated rules is gauged by confidence, support, and lift. The Association tool features a variety of plots and tables to help you explore the analysis results.
Pattern Discovery Tools: Review Generate cluster models using automatic settings and segmentation models with user-defined settings.
s â&#x20AC;˘W^on* â&#x20AC;&#x201D;
Compare within-segment distributions of selected inputs to overall distributions. This I helps you understand segment definition. Conduct market basket and sequence analysis on transactions data. A data source must have one target, one ID, and (if desired) one sequence variable in the data source.
4ÂŤ6
8-86
Chapter 8 Introduction to Pattern Discovery
8.5 Solutions Solutions to Exercises 1. Conducting Cluster Analysis a.
Create a new diagram in your project. 1) To open a diagram, select File «=> New «=> Diagram. 2) Type the name of the new diagram, J e a n s , and select O K .
b. Define the data set DUNGAREE as a data source. 1) Select File «=> New «=> Data Source.... 2) In the Data Source Wizard - Metadata Source window, make sure that S A S Table is selected as the source and select Next >. 3) To choose the desired data table, select Browse.... 4) Double-click on the A A E M 6 1 library to see the data tables in the library. 5) Select the D U N G A R E E data set, and then select O K . 6) Select Next >. JS.'DaUi Source Wizard --Step 3 of 7 Table Information
Table Properties " Property'''.' Table Name Description Member Type Data Set Type Engine Number of Variables Number of Observations Created Date ModfiedDate
AAEM61. DUNGAREE
Value
DATA DATA BASE 6 689 January 19, 2004 4:34:26 PM EST January 19, 2004 4:34:26 PM EST
£ancd
7) Select Next >. 8) Select Advanced to use the Advanced Advisor, and then select Next >.
8.5 Solutions
c.
8-87
D e t e r m i n e w h e t h e r the m o d e l roles and m e a s u r e m e n t levels assigned t o t h e v a r i a b l e s are appropriate. Examine the distribution o f the variables. 1) H o l d d o w n t h e C T R L k e y a n d c l i c k t o s e l e c t t h e v a r i a b l e s o f i n t e r e s t . " D a t a Source Wizard — S t e p 5 of 7 C o l u m n M e t a d a t a |(none)
lolumns: Name
~" j
C
not
FASHION LEISURE ORIGINAL SALESTOT STOREID STRETCH
1
I
T Mining
Role
Input Input Input Input Intuit input
Apply
lEqualio
f~ Label |
Level
Inttayal Interval Interval Interval Interval Inierva!
|
Report
-
Basfc
r
Order
No No No No No No
Drop
Reset
Statistics
Lower Lint
U;per L»
No Ho No No No NO
±1 ^now code
2)
Select
Explore.
3)
Select
Next
> and then
Explore
Finish
Refresh Summary
< Back
Next >
to c o m p l e t e the data source c r e a t i o n .
Cancel
8-88
C h a p t e r 8 Introduction t o Pattern Discovery
| - ! D | x
Explore - AAEMG1 DUNGAREE Fife
Vjew
fictions
! iOK/l
Window
Property
Value
699 8 AAEM61 DUNGAREE
ROWS Columns Library I.I em lior Type Sample Meltiod Fetch Size Fetched Rows
-
pATA
Random Max 689
_
Appty
1
107 117 110
U B R B B B
496 296
1744
419
267
1652
1736 1837 2484 267S 1765
79 73 64
100 126
10 11
2347
1528
162 129
B B
755
613 468 209 581
1789
37
180 41
1S28
2203 1890 2342 2119 1781 2138 1216 1504 1933 2405 1752
& e
1579.8 ORIGINAL
& c
5
= 100-
! ioo
1.0
21J
41.6
61.9
92.2
1 0 2 5 1221
143 1 163.4 183.7 204.0
FASHION
There do not appear to be any unusual or missing data values.
490.8 STRETCH
1958.2
8.5 S o l u t i o n s
d.
8-89
T h e variable S T O R E I D should have the I D m o d e l role and the variable S A L E S T O T should have the Rejected model role.
The variable S A L E S T O T should be rejected because it is the sum of the other input variables in the data set. Therefore, it should not be considered as an independent input value. Sf Data Source Wizard —Step 5 of 7 Column Metadata (none) Zolumns:
-• |
P
not
\~ Label
Name FASHION
LEISURE
ORIGINAL SALESTOT STOREID STRETCH
|
Role
Input
[nput
Input Rejected ID Input
Apply
| E q u a l to
r
Mining
Level
Interval Interval Interval Interval Interval Interval
|
Report
I
-
I
Basic
Order
Crop
-
Statistics
Lower Limit
Upper Lii
No NO NO No No No
No NO — N o No No No
2l Show code
e.
g.
Refresh Summary
<Back
T o add an Input D a t a node t o the d i a g r a m w o r k s p a c e and select the as t h e d a t a s o u r c e , d r a g t h e
f.
Explore
DUNGAREE
Next >
Cancel
DUNGAREE
data table
data source onto the d i a g r a m w o r k s p a c e .
A d d a C l u s t e r n o d e t o t h e d i a g r a m w o r k s p a c e . T h e w o r k s p a c e s h o u l d a p p e a r as s h o w n .
S e l e c t t h e Cluster n o d e . In the p r o p e r t y sheet, select
Internal Standardization >=> Standardization.
If you do not standardize, the clustering will occur strictly on the inputs with the largest range (Original and Leisure).
8-90
C h a p t e r 8 Introduction to Pattern Discovery
h.
R u n the d i a g r a m f r o m the C l u s t e r node and e x a m i n e the results. 1) R u n t h e C l u s t e r n o d e a n d v i e w t h e r e s u l t s . 2) T o v i e w the results, right-click the
File Edit View
Window
a
MW\
a
i s
Cluster
L i Segment Plot
node and select
r/Ef
HI
variable = FASHION
Results....
PI Mean Statistics Maximum
Criterion
Relative C h a n g e
Improvemen'
LEI
S E G M E N T
in in
Cluster
80-
Clustering Criterion
S e e d s
r>o40ÂŹ 20ÂŹ 0-
ET
Clustering
T
1
I
2
I
3
I
4
I
5
I
G
1
7
I
8
I
9
1
I
T
I
I
I
T
I
I
r
11 13 15 17 19 10 12 14 16 18 20
0.406453 0.406453 0.406453 0.406453 0.406453 0.406453 0.406453 0.406453 0.406453 0.406453 0.406453
0.009633
0.009633 0.009683 0.009683 0,009683 0.009683 0.009683 0.009633 0.009683 0.009683 0.009683
1 2 3 4 5 6 7 8 9 10 11 _d-3_
The Cluster node's Automatic number of cluster specification method seems to generate an excessive number of clusters.
Specify a m a x i m u m o f six clusters. 1) S e l e c t
Specification Method ^ User Specify.
2)
Maximum Number of Clusters •=> 6.
Select
Variables ClusterVariable Role
Segment
Internal S t a n d a r d i z a t i o n
Standardization
• Number
of Clusters
•-Specification M e t h o d
l U s e r Specify
' - M a x i m u m N u m b e r o f Clu:6
8-92
C h a p t e r 8 Introduction t o Pattern Discovery
3)
R u n the Cluster node and v i e w the results.
File Edit View • .
•
m .
Window •
i )
L_ Segment Plot
u
1^ [EI
Variable= FASHION
100 4
• Mean Statistics Clustering
Maximum
Criterion
Relative C h a n g e
in
Freq ofCI
Clustering Criterion
S e e d s
0.630325 0.630825 0.630325 0.630825 0.630825 0.630825
GOH
^
S E G M E N T
1
in
Cluster
80-
£
Improvemen
40
Q.
20 n
2
1
3
1
4
0.006552 0.006552 0.006552 0.006552 0.006552 0.006552
5
Segment Vaiiahie
V Segment Size
r/Ef
IE] If Q
Output Usee:
Dodotech
Dace:
27HAY08
14:56
Time:
* Training
Output
100
V a r i a b l e Summary
Role
Measurement
Frequency
Level
Count
Apparently, all but one of the segments is well populated. There are more details about the segment composition in the next step. j.
Connect a Segment Profile node t o the Cluster node.
DUNGAREE
tj^'Setjivient Profile
Cluster
o
J
8.5 Solutions
1) R u n t h e S e g m e n t P r o f i l e n o d e a n d v i e w t h e r e s u l t s .
30¬
Segment 1 Count: 110 ! 'etcent 15.97
E ao a a. 20 -
$20
£ 30 -
1 S e g m e n t 1 contains stores s e l l i n g a higher-than-average n u m b e r o f o r i g i n a l j e a n s . soSegment: 2 Count 124 *e<cert: 18
-
-
|
s
30
a3J
a.
10-
0-
S e g m e n t 2 contains stores s e l l i n g a higher-than-average n u m b e r o f stretch jeans.
Segment 3 Count 10 Percent 1.45
E no a S JQ
5
3
S.
0
20
JJ S e g m e n t 3 contains stores s e l l i n g small n u m b e r s o f all j e a n styles.
S e g m e n t 4 c o n t a i n s stores s e l l i n g a higher-than-average n u m b e r o f leisure j e a n s . 40¬
Segment 5 Count 111
30-
10-
0—r
S e g m e n t 5 contains stores s e l l i n g a higher-than-average n u m b e r o f f a s h i o n j e a n s .
8-93
8-94
C h a p t e r 8 Introduction to Pattern Discovery
S e g m e n t 6 c o n t a i n s stores s e l l i n g a higher-than-average n u m b e r o f o r i g i n a l j e a n s , b u t lowerthan-average n u m b e r o f stretch and fashion.
2.
Conducting an Association Analysis a.
Create the Transactions diagram. 1) T o o p e n a n e w d i a g r a m i n t h e p r o j e c t , s e l e c t 2)
b.
O
File >=> New
N a m e t h e n e w d i a g r a m T r a n s a c t i o n s and select
Diagram....
OK.
C r e a t e a n e w d a t a s o u r c e u s i n g t h e d a t a set A 7 A E M 6 1 . T R A N S A C T I O N S . 1) R i g h t - c l i c k
2)
Data Sources
i n the p r o j e c t tree and select
Create Data Source.
In the Data Source W i z a r d - Metadata Source w i n d o w , m a k e sure that as t h e s o u r c e a n d s e l e c t
3) Select
Browse...
SAS Table
is s e l e c t e d
Next >.
t o c h o o s e a d a t a set.
4)
D o u b l e - c l i c k o n the
5)
Select
6)
Select
AAEIV161
l i b r a r y and select the
TRANSACTIONS
d a t a set.
OK.
Next >.
Data Source Wizard --Step 3 of 7 Table Information
Table Properties Value
Property Table Name Description Member Type Dats Set T y c ; Engine Number of Variables Number of Observations Created Date Modified Date
AAEM61. TRANS ACTIONS DATA DATA BASE •i •45?253 September 19, 2006 3:55:00 PM EDT September 19, 2006 3:55:00 PM EOT
< Bad.
I ["Next >2| |
7 ) E x a m i n e t h e d a t a t a b l e p r o p e r t i e s , a n d t h e n s e l e c t Next>.
8)
Select
Advanced
t o use t h e A d v a n c e d A d v i s o r , a n d t h e n s e l e c t
Next >.
Cancel
8.5
Solutions
8-95
A s s i g n a p p r o p r i a t e m o d e l roles t othe variables. 1)
H o l d d o w n the C T R L
k e y a n d select the r o w s f o r the v a r i a b l e s S T O R E
Select the T R A N S A C T I O N
QUANTITY.
Rejected.
I n t h e R o l e c o l u m n o f one o f these r o w s , select 2)
and
r o w a n d s e l e c t I D as t h e r o l e .
3) S e l e c t t h e P R O D U C T r o w a n d s e l e c t Target as t h e r o l e . £ T D a t a S o u r c e Wizard — S t e p 5 of 7 Column M e t a d a t a |(none) Columns:
"
-
not jEqusl to P Mining
T Label
Name
|
Product
f
| I
El
Quantity Store
Uvd
Role
Pepoit
Target Rejected
Nominal Interval
No No
Rejected
Nominal
No
Interval
No
Transaction ID
J
f~ Basic Order
[TCP
Reset
Apply r
Statistics. Upper Lit
Lower Limit
No No No No
J Show code
Explore
4)
Select
5)
T o s k i p d e c i s i o n p r o c e s s i n g , select
6)
C h a n g e t h e r o l e t o Transaction.
Refresh Summary
< Back
Next >
Jj
Cancel
Next >. Next >.
Sf D a t a S o u r c e Wizard — S t e p 7 of 8 D a t a Source Attributes
\ I f-. 4 f t OS)
EE]
You may change the name and the role, and can specify a population segment identifier for the data source to be created.
Name:
(TRANSACTIONS
Sole: Segment:
Notes:
< Back
7) S e l e c t Finish.
Next >
Cancel
8-96
C h a p t e r 8 Introduction to Pattern Discovery
d.
A d d the node for the TRANSACTIONS data set and an A s s o c i a t i o n node to the d i a g r a m . The w o r k s p a c e s h o u l d appear as s h o w n .
RANSACTION
^^Association
La e.
Change the setting for E x p o r t Rule by I D to Yes.
Train
Property
Value
Variables M a x i m u m N u m b e r of Iterr 1 0 0 0 0 0 Rules ^Association •Maximum Items
4
•Minimum Confidence Levi 0 • S u p p o r t Type
(Percent
-Support Count
1
-Support Percentage
5.0
E ] Sequence •Chain Count -Consolidate T i m e
3 0.8
- M a x i m u m T r a n s a c t i o n Di-0.0 •Support Type
Percent
-Support C o u n t
1
•Support Percentage
2.0
-Numberto Keep
200
-Sort C r i t e r i o n
Default
- [ N u m b e r to T r a n s p o s e
200
-Export R u l e by ID
Yes
• •
8.5 S o l u t i o n s
8-97
8-98
Chapter 8 Introduction to Pattern Discovery
g.
Examine the results of the association analysis. Examine the Statistics Line plot.
0-
i 20
1
10
i
30
Rule Index
— Lift —Confidence(%)
— E x p e c t e d Confidence(%) —~Support(%)
Rule 1 has the highest lift value, 3.60. Looking at the output reveals that Rule 1 is the rule Toothbrush •=> Perfume.
Chapter 9 Special Topics 9.1
Introduction
9-3
9.2
Ensemble Models
9-4
Demonstration: Creating Ensemble Models 9.3
9.4
Variable Selection
9-9
Demonstration: Using the Variable Selection Node
9-10
Demonstration: Using Partial Least Squares for Input Selection
9-14
Demonstration: Using the Decision Tree Node for Input Selection
9-19
Categorical Input Consolidation Demonstration: Consolidating Categorical Inputs
9.5
9-5
Surrogate Models Demonstration: Describing Decision Segments with Surrogate Models
9-22 9-23 9-31 9-32
9-2
Chapter 9 Special Topics
9.1 Introduction
9-3
9.1 Introduction This chapter contains a selection of optional special topics related to predictive modeling. Unlike other chapters, each section is independent of the preceding section. However, demonstrations in this chapter depend on the demonstrations found in Chapters 2 through 8.
9-4
Chapter 9 Special Topics
9.2 Ensemble Models
Ensemble Models
1 /u
Combine predictions from multiple models to create a single consensus prediction.
4 The Ensemble node creates a new model by combining the predictions from multiple models. For prediction estimates and rankings, this is usually done by averaging. When the predictions are decisions, this is done by voting. The commonly observed advantage of ensemble models is that the combined model is better than the individual models that compose it. It is important to note that the ensemble model can be more accurate than the individual models only if the individual models disagree with one another. You should always compare the model performance of the ensemble model with the individual models.
9.2 Ensemble Models
9-5
Creating Ensemble Models
This demonstration continues from the Predictive Analysis diagram. This demonstration proceeds in two parts. First, a control point is added to the process flow to reduce clutter. Then, the Ensemble tool is introduced as potential prediction model.
Diagram Control Point Follow these steps to add a control point to your analysis diagram: 1. Select the Utility tab. 2. Drag a Control Point tool into the diagram workspace. 3. Delete all connections to the Model Comparison node. •x: Predictive Analysis
•par
-il
^$er-
H H E 3
J r - a r -
—
;
J
•
—
•
4. Connect model nodes of interest to the Control Point node. 5. Connect the Control Point node to the Model Comparison node.
— 1 - ^ l ^ l Q - J — ©w*|g*bf
9-6
Chapter 9 Special Topics The completed process flow should appear as shown. :; Predictive Analysis :
Bag
fc-jwrr...^
jj.
j
*
M © - J
© 70% g"Dj
The Control Point node serves as a junction in the diagram. A single connection out of the Control Point node is equivalent to all the connections into the node. 6. Select the Model tab. 7. Drag an Ensemble tool into the diagram. 8. Connect the Control Point node to the Ensemble node. 9. Connect the Ensemble node to the Model Comparison node. The completed process should appear as shown. :
Predictive Analysis
anted M i t
\~-
7*1
'6 VA»7
•J
I i
-^|HfO|0 - J — © 76%|g.*nj'
9.2 Ensemble Models
9-7
Ensemble Tool The Ensemble node is not itself a model. It merely combines model predictions. For a categorical target variable, you can choose to combine models using the following functions: • Average takes the average of the prediction estimates from the different models as the prediction from the Ensemble node. This is the default method. • Maximum takes the maximum of the prediction estimates from the different models as the prediction from the Ensemble node. • Voting uses one of two methods for prediction. The average method averages the prediction estimates from the models that decide the primary outcome and ignores any model that decides the secondary outcome. The proportion method ignores the prediction estimates and instead returns the proportion of models deciding the primary outcome. • Run the Model Comparison node and view the results. This enables you to see how the ensemble model compares to the individual models. of? Results - Node: Model Comparison Diagram: Predictive Analysis 08 &ft Sew jgjhcfow
mi
I
H B E3
n s HEE3|
ROC Chart: TargetB Data Role = VALIDATE
1.00
0.75
>
1 c
0.50
w
0.25
0.00 0.0
0.2
0.4
0.6
0.8
1 - Specificity Baseline Regression
Ensemble Decision Tree
Neural Network Probability Tree
1.0
9-8
Chapter 9 Special Topics
-i
1
0
1
20
40
1
1
60
80
1
1â&#x20AC;&#x201D;
100
Percentile Ens-imljlv
Meural llot.voil-
Recession
â&#x20AC;˘
Decision Tree
Probability Tree |
The R O C chart and Score Rankings plots show that the ensemble model is s i m i l a r to the other models.
9.3 Variable Selection
9-9
9.3 Variable Selection
Input Selection Alternatives ^
Regresskn
'artebte :tkÂť
^ ftrfal Least > V Squares
Decision Tree
Sequential selection
Univariate + forward selection (R square) Tree-like selection (Chi-square) Variable Importance in the Projection (VIP)
Split search selection
7
Removing redundant or irrelevant inputs from a training data set often reduces overfitting and improves prediction performance. Some prediction models (for example, neural networks) do not include methods for selecting inputs. For these models, input selection is done with a separate SAS Enterprise Miner tool. Chapter 4 introduced sequential selection to find useful inputs for regression models. Later in Chapter 5, these inputs were used to compensate for the Neural Networks tool's lack of input selection capabilities. SAS Enterprise Miner features several additional methods to perform input selection for modeling tools inherently missing this capability. The following demonstrations illustrate the use of the Variable Selection, Partial Least Squares, and Decision Tree tools for selection of potentially useful inputs.
9-10
Chapter 9 Special Topics
Using the Variable Selection Node
The Variable Selection tool provides a selection based on one of two criteria. When you use the R-squared variable selection criterion, a two-step process is followed: 1. SAS Enterprise Miner computes the squared correlation for each variable and then assigns the Rejected role to those variables that have a value less than the squared correlation criterion. (The default is 0.005.) 2. SAS Enterprise Miner evaluates the remaining (not rejected) variables using a forward stepwise R-squared regression. Variables that have a stepwise R-squared improvement less than the threshold criterion (default = 0.0005) are assigned the Rejected role. When you use the chi-squared selection criterion, variable selection is performed using binary splits for maximizing the chi-squared value of a 2x2 frequency table. The rows of the 2x2 table are defined by the (binary) target variable. The columns of the table are formed by a partition of the training data using a selected input. Several partitions are considered for each input. For an Z,-level class input (binary, ordinal, or nominal), partitions are formed by comparing each input level separately to the remaining L - \ input levels, creating a collection of L possible data partitions. The partition with the highest chi-squared value is chosen as the input's best partition. For interval inputs, partitions are formed by dividing the input range into (a maximum of) 50 equal-length bins and splitting the data into two subsets at 1 of the 49 bin boundaries. The partition with the highest chi-squared statistic is chosen as the interval input's best partition. The partition/variable combination with the highest chi-squared statistic is used to split the data and the process is repeated within both subsets. The partitioning stops when no input has a chi-squared statistic in excess of a user-specified threshold. All variables not used in at least one partition are rejected. The Variable Selection node's chi-squared approach is quite similar to a decision tree algorithm with its ability to detect nonlinear and nonadditive relationships between the inputs and the target. However, the method for handling categorical inputs makes it sensitive to spurious input/target correlations. Instead of the Variable Selection node's chi-squared setting, you might want to try the Decision Tree node, properly configured for input selection. The following steps show how to use the Variable Selection tool with the R-squared setting. 1. Select the Explore tab. 2. Drag a Variable Selection tool into the diagram workspace.
9.3 Variable Selection
3.
9-11
Connect the Impute node to the Variable Selection node.
The default Target M o d e l (input selection method) property is set to Default. W i t h this default setting, i f the target is binary and the total parameter count f o r a regression model exceeds 4 0 0 , the c h i squared method is used. O t h e r w i s e , the R-squared method o f variable selection is used.
4.
Select Target Model O R-Square. Property
Value
General Node ID Imported Data Exported Data Notes
Varsel
Train Variables Max Class Level 100 _Max Missing Percen 5 Ci R-Square Target Model Manual Selector Reiects Unused Inn Yes [•jBypass Options None [•Variable •-Role Input
• U u
• •
Recall that a two-step process is performed when y o u apply the R-squared variable selection criterion to a binary target. The Properties panel enables y o u t o specify the m a x i m u m n u m b e r o f variables that can be selected, the c u t o f f m i n i m u m squared correlation measure f o r an i n d i v i d u a l variable t o be selected, and the necessary R-squared improvement f o r a variable to remain as an input variable.
9-12
Chapter 9 Special Topics
A d d i t i o n a l available options include the f o l l o w i n g : • Use A O V I 6 Variables - W h e n selected, this option requests SAS Enterprise M i n e r to b i n interval variables into 16 equally spaced groups ( A O V I 6 ) . The A O V 1 6 variables are created to help identify nonlinear relationships w i t h the target. Bins w i t h zero observations are e l i m i n a t e d , w h i c h means that an A O V I 6 variable can have fewer than 16 bins. • Use G r o u p Variables - W h e n set to Yes, this option enables the number o f levels o f a class variable to be reduced, based on the relationship o f the levels to the target variable. • Use Interactions - W h e n this option is selected, SAS Enterprise M i n e r evaluates t w o - w a y interactions f o r categorical inputs. 5.
Run the Variable Selection node and v i e w the results. The Results w i n d o w opens.
m u m File
Edit
View
Window
0
Variable Selection &
NAME DemCluster DernGender DernHorne... DemMedH... DemPctVet... G_DemClu... GiftTimeFirst GiftTimeLast IMP_DemA... IMP_LOG_... IMP„REP_... LOG_GiftAv... LOG_GiftAv... LOG_GiftAv„, LOGjGiftC...
ROLE REJECTED REJECTED REJECTED REJECTED REJECTED INPUT REJECTED INPUT REJECTED REJECTED REJECTED REJECTED INPUT REJECTED INPUT
LEVEL NOMINAL NOMINAL BINARY INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL
TYPE C C C N N N N N N N N N N N N
LABI
Demk Gene!. Horn' Medi:'." Perci :
Time: Tj'mefl Impu Impu Impu Tran; Tran: Tran Tranj*
ij: •
1 >\ Output
n"er
Usee: Date: Time:
Dodotech
21HAYG8 14:16
* Training Output
Variable Summary Measurement Level
6.
M a x i m i z e the Variable Selection w i n d o w .
Frequency Count
IEI
9.3 Variable Selection
7.
Select the R O L E c o l u m n heading t o sort the variables by their assigned roles.
8.
Select the Reasons for Rejection c o l u m n heading.
9-13
EH Variable Selection Variable Name
ROLE
G_DemClu... INPUT GittTimeLast INPUT LOG_GiftAv... INPUT LOG_GiftC... INPUT LOG_GirtC... INPUT REP_Statu... INPUT DemGender REJECTED DemHome... REJECTED DemMedH... REJECTED DemPctVet... REJECTED GiftTimeFirst REJECTED IMP_DemA... REJECTED IMP_LOG_... REJECTED IMP_REP_... REJECTED LOG_GiftAv... REJECTED LOG_GiftAv... REJECTED LOG_GiftC... REJECTED LOG_GiftC... REJECTED M_DemAge REJECTED M_LOG_Gif... REJECTED M_REP_De... REJECTED PromCnt12 REJECTED PrornCnt36 REJECTED PromCntAII REJECTED PromCntCa.. REJECTED PromCntCa.. REJECTED PrornCntCa.. REJECTED StatusCatSt. REJECTED iDe m Cluster REJECTED i ,
LEVEL NOMINAL INTERVAL INTERVAL INTERVAL INTERVAL NOMINAL NOMINAL BINARY INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL BINARY BINARY BINARY INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL BINARY NOMINAL
TYPE N N N N N c c N N N N N N N N N N N N N N N N N N N
fi C
Variable Label
Reasons for Rejection A
1
Time Since... Transforme... Transforms... Transforms... Replace:St... Gender Varsel:Sma... Home OwnerVarsel:Sma... Median Ho... Varsel:Sma... PercentVet... Varse1:Sma... Time Since... Varsel:Sma... Imputed: Age Varsel:Sma... Imputed: Tr... Varsel:Sma... Imputed: R... Varsel:Sma... Transforme... Varsel:Sma... Transforme... Varsel:Sma... Transforme... Varsel:Sma... Transforme... Varsel:Sma... Imputation I... Varsel:Sma... Imputation I... Varsel:Sma... Imputation L.Varsel:Sma... Promotion... Varsel:Sma... Promotion... Varsel:Sma... Promotion... Varsel:Sma... Promotion... Varsel:Sma... Promotion... Varsel:Sma... Promotion... VarseI:Sma... Status Cate... Varsel:Sma... Demograp... Varsel:Sma...
The Variable Selection node finds that most inputs have insufficient target correlation to j u s t i f y keeping t h e m . Y o u can try these inputs i n a subsequent model or adjust the R-squared settings t o be less severe. N o t i c e the input G _ D e m C l u s t e r is a g r o u p i n g o f the o r i g i n a l D e m C l u s t e r input. f
Binary targets generate notoriously l o w R-squared statistics. A more appropriate association measure m i g h t be the l i k e l i h o o d chi-squared statistic f o u n d i n the Regression node.
9-14
Chapter 9 Special Topics
Using Partial Least Squares for Input Selection
Partial least squares (PLS) regression can be thought of as a merging of multiple and principal components regression. In multiple regression, the goal is to find linear combinations of the inputs that account for as much (linear) variation in the target as possible. In principal component regression, the goal is to find linear combinations of the inputs that account for as much (linear) variation in the input space as possible, and then use these linear combinations (called principal component vectors) as the inputs to a multiple regression model. In PLS regression, the goal is to have linear combinations of the inputs (called latent vectors) that account for variation in both the inputs and the target. The technique can extract a small number of latent vectors from a set of correlated inputs that correlate with the target. A useful feature of the PLS procedure is the inclusion of an input importance metric named variable importance in the projection (VIP). VIP quantifies the relative importance of the original input variables to the latent vectors. A sufficiently small VIP for an input (less than, 0.8, by default) plus a small parameter estimate (less than 0.1, by default) permits removal of the input from the model. The following steps demonstrate the use of the PLS tool for variable selection: 1. Select the Model tab. 2. Drag a Partial Least Squares tool into the diagram workspace. 3. Connect the Impute node to the Partial Least Squares node. [â&#x20AC;˘ ;;; Predictive Analysis
HEE3|
Select Export Selected Variables => Yes. T r 3 in Variables •Modeling Techniques ^-Regression Model [-PLS Algorithm r Maximum Iteration - Epsilon ^ N u m b e r of Factors [•Default - N u m b e r of Factors a c r o s s Validation [•CV Method •-CVN Parameter itrlffffifflle j - N u m b e r o f Iterations |-Default No. o f T e s t O b s . |-No. ofTestObs. [•Default Random Seed - R a n d o m Seed Score :
j-Variable Selection Criterion [-Para. Est. Cutoff I-VIP Cutoff |- Export Selected Variables ••Hide Rejected Variables
m PLS NIPALS
200 1.0E-12 Yes
15 None
7 10 Yes
100 Yes
1234
Both
0.1 0.8 Yes No
9-16
5.
Chapter 9 Special Topics
R u n the Partial Least Squares node and v i e w the results. 1
Results - Node: Partial Least Squares Diagram; Predictive Analysis
£fc Ed*flewtfndow Store Biiikingi Overlay: Target. B
Variable I m p o r t a n c e For P r o j r i t
Stude July 16:07
Usee: Date: Tlae: * Training Output
Variable Suaaary fleajuicaint Level
AH B M B
I-I.UJJ.UIII X variation accoulcd lor by IMs laclor
Hunter or Extracted Factors
Total X v*wan accounted lor )o rar
occcurted (orrjflliis rati or
TOH¥ variation
StanWrcSie
Variable
d Parameter
•tipariaree
R e u s e d By Parameter
Ettrnate
1
8 3263
9.3263
2.9889
2.9B89
DamClusla
0 02660
0.154201 Y e s
Yes
Rejected
2
3.1016
12.4278
1.5930
4.5818
.DsmCluste
0 00978
0.106351 Y e s
Yas
Rejected
3
2 5687
14 3 9 6 6
0.4026
4.9B4*
DemCluste
•0 0 0 2 0 2
0 00969BYSS
Yes
Rejected
0.086102Yas
Yes
Rejected
0.128782Yes
Yes
Rejected
0.105993Yes
Yes
Rejected
0 236457Yes
Yes
Rejected
0.271819Yes
Yes
Rejected
0.137011 Y e s
Yes
Rejected
0.053855Yes
Yes
Rejected
0 266047Yes
Yes
Rejected
019543Y6S
Yes
Rejected
0.057593 Y e s
Yes
Rejected
4
22235
17.2201
0.0909
5.0754
DemCluste
0 00215
5
3.4493
20 6 5 9 *
0.0223
5.0977
:DemCluste
0 00821
6
1.7022
22.3718
00290
51267
DemCluste
7
1.BB4S
142364
0 0134
51*00
.DemClusle
8
2 509*
26 74 SB
0 0113
5.1513
DemCluste
9
1.0899
27.B357
0.0233
5.1751
DemClusle
10
1.5489
29 3B46
0 0096
518*7
DemClusle
11
1.1864
30 5 8 6 9
0 0086
5193*
DemClusle
12 13
0.7255
31 3 0 6 5
0.0158
5.2091
pemCluste
1.2303
32.5368
0.0087
5 2178
DemCluste
• Li. Target
•001016 •0.01654 0 02483 -0 00984 -I) 0 0 2 2 5 -001887 0 01489 0 00299
Ft Statistics
Statistics Label
TargatB
_ERR„
Error Fundi...
14273.51
14337,59
TargelB
„SSE_
SumofSqu
2549 322
2508.348
Targets
_MAX_
H a i i m u m A...
•.850114
0.87B124
TargetB
_DIV_
D M s n r f a r A..
9666
9BB6
Targets
_NQBS_
S u m of Fie...
4B43
Targets
_WRONO.
Number of...
2421
2(22
Targets
_DISF_
Frequency...
4843
4843
Targets
_MISC_
Mlsclassific.
0.499897
0.500103
TargalB
_PROF_
Total Profit
9 * 4 4704
799.05BB
4843
9,3 Variable Selection
6.
9-17
M a x i m i z e the Percent Variation w i n d o w .
| HI P e r c e n t V a r i a t i o n Number of
X variation
Total X
Y variation
Total Y
Extracted
accounted
Factors
for by this factor
variation accounted
accounted for by this
variation accounted
for so far
factor
for so far
1
9.3263
9.3263
2.9889
2.9889
2
3.1016
12.4279
1.5930
4.5813
3
2.5687
14.9966
0.4026
4.9844
2.2235
17.2201
0.0909
5.0754 5.0977
E 5
3.4493
20.6694
0.0223
6
1.7022
22.3716
0.0290
5.1267
7
1.8648
24.2364
0.0134
5.1400
8
2.5094
26.7458
0.0113
5.1513
9
1.0899
27.8357
0.0238
5.1751
10
1.5489
29.3846
0.0096
5.1847
11
1.1964
30.5809
0.0086
5.1934
12
0.7255
31.3065
0.0158
5.2091
13
1.2303
32.5368
0.0087
5.2178
14
1.3137
33.8505
0.0065
5.2243
15
0.8789
34.7294
0.0089
5.2332
By default, the Partial Least Squares tool extracts 15 latent vectors or factors f r o m the t r a i n i n g data set. These factors account for 3 5 % and 5 . 2 % o f the variation in the inputs and target, respectively.
9-18
Chapter 9 Special Topics
7.
M a x i m i z e the Variable Selection w i n d o w .
8.
Select the Role c o l u m n heading to sort the table by variable role. r j ^ R e s u l t s - Mode: P a r t i a l L e a s t S q u a r e s D i a g r a m : P r e d i c t i v e A n a l y s i s
File
Edit
View a
Window E3
*>
Variable Selection
Variable
GiftTimeLast LOG_GiftAv... LOG_GiftC... LOG_GiftC... LOG_GiftC... LOG_GiftC... PromCntAII PromCntCa... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... <
|
Standardize d Parameter Estimate -0.07806 -0.05216
0.06015 -0.09391 0.04138 0.01129 0.14428 -0.10896 0.02860
0.00978 -0.00202 0.00215 0.00821 -0.01016 -0.01654 0.02488 -0.00984 -0.00225 -0.01887 0.01489 0.00299
Variable Importance for Projection
Rejected by Parameter Estimate
0.842938Yes 0.824256Yes 0.948709Yes 0.871212Yes 0.851113Yes 0.863856Yes 0.717499IMO 0.756161 No 0.154201 Yes 0.106351 Yes 0.009698Yes 0.086102Yes 0.128782Yes 0.105993Yes 0.236457Yes 0.271 819Yes 0.137011 Yes 0.053855Yes 0.266047Yes 0.19543Yes 0.057599Yes
Rejected by VIP
NO No No No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Role
±
Input Input Input Input Input Input Input Input Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected \
—
w
*1
The m a j o r i t y o f the selected inputs relate to donation count, p r o m o t i o n count, t i m e since donation, and amount o f donation.
9.3 Variable Selection
9-19
Using the Decision Tree Node for Input Selection
Decision trees can be used to select inputs for flexible predictive models. They have an advantage over using a standard regression model or the Variable Selection tool's R-squared method i f the inputs' relationships to the target are nonlinear or nonadditive. 1. Connect a Decision Tree node to the Impute node. Rename the Decision Tree node S e l e c t i o n Tree. Predictive Analysis
Q
—
J
—
©ioo%|s*aj
You can use the Tree node with default settings to select inputs. However, this tends to select too few inputs for a subsequent model. Two changes to the Tree defaults result in more inputs being selected. Generally, when you use trees to select inputs for flexible models, it is better to err on the side of too many inputs rather than too few. The model's complexity optimization method can usually compensate for the extra inputs. f
The following changes to the defaults act independently. You can experiment to discover which method generalizes best with your data.
9-20
Chapter 9 Special Topics
2.
Type 1 as the N u m b e r o f Surrogate Rules value.
3.
Select Subtree •=> Method <=> Largest.
Train Variables Interactive
•
s Criterion •Significance Level -Missing Values -Use Input Once -Maximum Branch -Maximum Depth -Minimum Categorical Size
Default 0.2 Use in search No
-Leaf Size -Number ofRules -Number of Surrogate Rules •Split Size
• Exhaustive '-Node Sample El \- Method hNumber of Leaves \-Assessment Measure '-Assessment Fraction
5000 20000 Largest 1 Decision 0.25
C h a n g i n g the number o f surrogates enables inclusion o f surrogate splits i n the variable selection process. B y d e f i n i t i o n , surrogate inputs are typically correlated w i t h the selected split input. W h i l e it is usually a bad practice to include redundant inputs in predictive models, many flexible models tolerate some degree o f input redundancy. The advantage o f i n c l u d i n g surrogates i n the variable selection is to enable inclusion o f inputs that do not appear in the tree e x p l i c i t l y but are still important predictors o f the target. C h a n g i n g the Subtree method causes the tree algorithm to not prune the tree. A s w i t h a d d i n g surrogate splits to the variable selection process, it tends to add (possibly irrelevant) inputs to the selection list. 4.
Run the Selection Tree node and v i e w the results.
9.3 Variable Selection
9-21
5. View lines 60 through 73 of the Output window. Cbs mjE
1 uDGj&ftortae 2 UDGJ&ftCnffianfi6 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
RxrrCrrt12 DenCIister Dent^tecHmeValue UDGJ3LftA\g36 LJDGJiftAvrJjast GiftTiieLast UDGJ^ftAvgoil RtnQTt36 PixnCWCard12 DGrrH3TE0Arer StahcCartStarAll I^JB'JDerrrVMtrcaiB U^J&fttrrtCanWl Danftettfeterans
um.
MILES NELFFDJATES IvFCRTANCE
TransfbniBd: Gift Count 35 Mirths Transformed: Gift Cant Card 35 Mirths Piuiulicn Cant 12 Mnths DenEojctphic Cluster Msclian hfcme Value Region TrarBfornEd: Gift Arant Average 35 Mxrths Trarefbrmsd: Gift Anxnt Last Tine Sinoe Last Gift IrrpjtEd: TrensfoniEd: Gift Arairt Average Card 35 Months TransfDrTred: Gift Aiant Average ALL Mxrths RuiuLicn Cant 35 Mirths PraiDtion Cant Card 12 Mxrths Hare Qvner Status Category Star All Mxrths Inpjted: Replacement: DEntteCnrxme Transfonnad: Gift Cant Card All Mirths Feroent \feterans Region
1
2
0 1 1 2 1
0
3
1 1 2 1
0 0 0 0 1
0
2 3
0 0
1 1 0 0 0
1
0 0 1 1 1
1.00000 0.89&5 0.66372 0.66908 0.64347 0.62733 0.53413 0.48092 0.40032 0.33350 0.37819 0.35544 0.30767 0.29455 0.25345 0.25255 0.02873
VBJPCKIMEE ROTO 1.00000 0.89945 0.35177 0.09777 0.10851 0.70944 0.67183 0.44530 0.32098 O.OOOOO 0.00000 0.00000 0.00000 0.06599 0.00000 0.06882 O.OOOOO
1.00000 1.00000 0.54506 0.14834 0.16379 1.13095 1.2B189 0.92535 0.80240 0.00000 O.OOOOO O.OOOOO O.OOOOO 0.22403 0.00000 0.22408 O.OOOOO
The IMPORTANCE column quantifies approximately how much of the overall variability in the target each input explains. The values are normalized by the amount of variability explained by the input with the highest importance. The variable importance definition not only considers inputs selected as split variables, but it also accounts for surrogate inputs (if a positive number of surrogate rules is selected in the Properties panel). For example, the second most important input (LOG_Gif tCntCard3 6) accounts for almost the same variability as the most important input (LOG_Gif tent36) even though it does not explicitly appear in the tree.
9-22
Chapter 9 Special Topics
9.4 Categorical Input Consolidation
Categorical Input Consolidation ABCDJ
[ABCD
Hir\EFG
x
2
Combine categorical input levels that have similar pnmaiy outcome proportions.
12
Categorical inputs pose a major problem for parametric predictive models such as regressions and neural networks. Because each categorical level must be coded by an indicator variable, a single input can account for more model parameters than all other inputs combined. Decision trees, on the other hand, thrive on categorical inputs. They can easily group the distinct levels of the categorical variable together and produce good predictions. It is possible to use a decision tree to consolidate the levels of a categorical input. You simply build a decision tree using the categorical variable of interest as the sole modeling input. The split search algorithm then groups input levels with similar primary outcome proportions. The IDs for each leaf replace the original levels of the input.
9.4
Categorical Input Consolidation
9-23
Consolidating Categorical Inputs
Follow these steps to use a tree model to group categorical input levels and create useful inputs for regression and neural network models. 1.
Connect a Decision Tree node to the Impute node, and rename the Decision Tree node
Consolidation Tree. - Predictive Analysis :
Polynomial
~ \l â&#x20AC;˘
The Consolidation Tree node is used to group the levels of DemCluste r, a categorical input with more than 50 levels. You use a tree model to group these levels based on their associations with TargetB. From this grouping, a new modeling input is created. You can use this input in place of DemCluste r in a regression or other model. In this way, the predictive prowess of DemCluster is incorporated into a model without the plethora of parameters needed to encode the original. The grouping can be done autonomously by simply running the Decision Tree node, or interactively by using the node's interactive training features. You use the automatic method here.
9-24
Chapter 9 Special Topics
2.
Select Variables... f r o m the Consolidation Tree Properties panel.
3.
Select DemCluster 3> Explore.... The Explore w i n d o w opens. / • Explore - EMWS.Impt_TRAIN File
View
Actions
Window
oR/l
Do
.jDjx]
Rows Columns Library Member Type Sample Method Fetch Size Fetched Rows Random Seed
250
Value
Property
Unknown 45 EMWS IMPT TRAIN VIEW Random Max 4343 12345
200 | 150 £ « 100 u_ 50 0
Apply
Plot..,
—i 35
1
1
1
1
1
1
1
15
36
16
17
24
39
45
r~ 18
Demographic Cluster ^Jnjxjj
Obs#
Target G... Demogr... 000 1 00 I 035 1 035 115 136 016 017 124 035 $ 11 039 12 045 - • • -
'
*I
T
9-4 Categorical Input Consolidation
4.
9-25
M a x i m i z e the D e m C l u s t e r histogram. L l DemCluster 250-
200-
> 150 o
c
u cr 100-
50
i
- 1 â&#x20AC;&#x201D;
15
24
36
33
45
18
Demographic Cluster The D e m C l u s t e r input has more than 50 levels, but y o u can see the d i s t r i b u t i o n o f o n l y a f e w o f these levels. 5.
Click
!
i n the l o w e r right corner o f the DemCluster w i n d o w .
9-26
Chapter 9 Special Topics
The histogram expands to show the relative frequencies o f each level o f D e m C l u s t e r . L _ DemCluster
E Demographic Cluster The histogram reveals input levels w i t h l o w case counts. This can detrimentally affect the performance o f most models. 6.
Close the Explore w i n d o w .
9.4 Categorical Input Consolidation
7.
9-27
Select Use *=> No f o r a l l variables in the Variables w i n d o w . Select Use O Yes f o r D e m C l u s t e r and T a r g e t B . ©"Variables-Tree4 j(none) lolumns: Name
^ j f~~ nol jEqualto f " Label Report No No No No No No No No No No No No No No No No No No No No No No No No No No No No
No No
Role nput Taiqet nput nput nput Rejected nput nput nput nput nput nput nput nput nput nput Input nput nput nput nput nput nput Input Input Input Input Input Input Reiected
Apply
J |7 Basic"
f [fining
Use 7
DemCluster iYes TarqetB Yes DemGender No DernHorneOwNo DemMedHomNo DemMedlncortlo DemPctVeteraNo GiftTimeFirst No GlftTimeLast No IMP DemAge No IMP LOG GiONo IMP REP DetNo LOG GiftAvg3Ho LOG GiftAvqAHo LOG GinAvciLJJo LOG GiftCnt3No LOG GiftCntAINo LOG GiftCntCNo LOG GiftCntCNo M DemAae No Ml LOG GiftAvNo M REP DemtNo PromCnt12 No PromCnt36 No PromCntAII No PromCntCard-No PromCntCardWo PromCntCard.'No REP StatusCfJo StatusCatQSNNo
dl Level Nominal Binary Nominal Binary nterval Interval nterval Interval Interval Interval nterval nterval nterval nterval Interval Interval Interval Interval Interval Binary Binary Binary Interval Interval Interval Interval interval Internal Nominal Nominal
Type
r
Statistics
Informal:
Format
Character 'Numeric Character Character Numeric Numeric Numeric jNumeric Numeric Numeric 'Numeric Numeric 'Numeric {Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric •Numeric Character Character
Reset
Length 2 8 3 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 _ 8 5
DOLLAR! 1.0 DOLLAR 11.0
< |
Explore..,
Update Path
A f t e r s o r t i n g o n the c o l u m n U s e . the Variables w i n d o w should appear as s h o w n . 9.
Select O K to close the Variables w i n d o w .
OK
Cancel
9-28
Chapter 9 Special Topics
10. M a k e these changes i n the Train property group. a.
Select Assessment Measure O Average Squared Error. T h i s optimizes the tree f o r prediction estimates.
b.
Select Bonferroni Adjustment *=> No. W h e n y o u evaluate a potential split, the Decision Tree tool applies, b y default, a B o n f e r r o n i adjustment to the splits l o g w o r t h . The adjustment penalizes the l o g w o r t h o f potential D e m C l u s t e r splits. The penalty is calculated as the l o g o f the number o f partitions o f L ]
DemCluster levels split into t w o groups, or log]{)(2 ' - I ) . W i t h 54 distinct levels, the penalty is quite large. It is also, in this case, quite unnecessary. The penalty avoids f a v o r i n g inputs w i t h many possible splits. Here y o u are b u i l d i n g a tree w i t h only one input. It is impossible to favor this input over others because there are no other inputs. 11. M a k e these changes i n the Score property group. a.
Select Variable Selection •=> No. This prevents the decision tree f r o m rejecting inputs in subsequent nodes.
b.
Select Leaf Role <^> Input. This adds a new input ( _ N O D E _ ) to the t r a i n i n g data.
12. N o w use the Interactive Tree tool to cluster DemCluster values into related groups, a.
Select Interactive... from the Decision Tree Properties panel. Property Node ID Imported Data Exported Data Notes
Value Tree 4 •••
...
Train Variables Interactive Use Frozen Tree Use Multiple Taraets
...
No No
9.4 Categorical Input Consolidation
The S A S Enterprise M i n e r Tree Desktop A p p l i c a t i o n opens. Tree View N i n Node: 4343 T a r g e t : Tarc/etB 1 (%) : 5% 0 (%): 95*
9-29
Chapter 9 Special Topics
9-30
b.
R i g h t - c l i c k on the root node and select T r a i n Node f r o m the o p t i o n m e n u .
N i n Node: 4343 Target: TargetB 1 (%) : 5% 0 (%): 95%
Demographic Cluster
I 00,35,15,16,17,24, 39,1B, 53.27,4.
36.45, OS, 37, 25,43.51. 49, 34,32,4...
I N i n Mods: 3185 Target: TargetB 1 (%): £ 0 (%] : 94*
!
Demographic Cluster
i t 53 N i n Mode: Target: Targets 1 (%) : 4\ 0 . '. • : 96%
Demographic Cluster
I 0, 35,15,16,17,24, 39,18,53,27, 2.
40, 42,13,23,14,29,19,07, 04
IJ i n Node: 2433 Target: Targets 5% 1 (%) i • [«): 95%
N i n Node: 752 Target: TargetB 7% i (%) : 0 IV): 93%
The levels o f
DemCluster
3S. 05.37,25, 43, 34, 47,33, 26.20,0. H i n Node: S59 Target: TargetB 1 (%): 4% • .\j : 96%
45, 51,49.32, 41.30,10.06 N i n Node: 799 Target: TargetB 3% 1 (%) : • (%) : 97%
are partitioned into four groups corresponding to the four leaves
o f the tree. A n input named _ N O D E _ is added to the training data. You can use the Transform Variables t o o l to rename _ N O D E _ to a more descriptive value. Y o u can use the Replacement tool to change the level names.
9.5 Surrogate Models
9.5
S u r r o g a t e
9-31
M o d e l s
Surrogate Models
neural network decision boundary
decision
surrogate boundary
Approximate an inscrutable model's predictions with a decision tree.
15
The usual c r i t i c i s m o f neural networks and similar flexible models is the d i f f i c u l t y in understanding the predictions. This criticism stems f r o m the c o m p l e x parameterizations found in the m o d e l . W h i l e it is true that little insight can be gained by a n a l y z i n g the actual parameters o f the m o d e l , m u c h can be gained by a n a l y z i n g the resulting prediction decisions. A p r o f i t m a t r i x or other mechanism for generating a decision threshold defines regions in the input space corresponding to the p r i m a r y decision. The idea behind a surrogate m o d e l is b u i l d i n g an easy-tounderstand model that describes this region. In this way, characteristics used by the neural n e t w o r k to make the p r i m a r y decision can be understood, even i f the neural n e t w o r k model i t s e l f cannot be.
9-32
Chapter 9 Special Topics
Describing Decision Segments with Surrogate Models
In this demonstration, a decision tree is used to isolate cases found solicitation-worthy by a neural network model. Using the decision tree, characteristics of solicit-decision donors can be understood even if the model making the decision is a mystery.
Setting Up the Diagram You need to attach two nodes to the modeling node that you want to study. 1. Select the Utilities tab. 2. Drag a Metadata tool into the diagram workspace. 3. Connect the Neural Network node to the Metadata node. 4. Select the Model tab. 5. Drag a Decision Tree tool into the diagram workspace. 6. Connect the Decision Tree tool to the Metadata node. Rename the Decision Tree node D e s c r i p t i o n Tree. ;
Predictive Analysis
y ÂŁ Regression
ieladata
Description
km {Tree
zl
9.5 Surrogate Models
9-33
Changing the Metadata Y o u must change the focus o f the analysis from T a r g e t B to the variable describing the neural n e t w o r k decisions. S A S Enterprise M i n e r automatically creates this variable in the t r a i n i n g and validation data sets exported f r o m a m o d e l i n g node. T h e variable's name is D_target, where
target
is the name o f the
original target variable.
/r
D_TARGETB is automatically created on the d e f i n i t i o n o f decision data only.
The f o l l o w i n g steps change the target variable f r o m 1.
TargetB to D_TARGETB:
Scroll d o w n to the Variables section in the Metadata node properties. Select T r a i n . . . . T h e Variables w i n d o w opens. S V a r i a b l e s - Meta
Eiplore...
2
-
Select New Role ^ Target for D_TARGETB.
3.
Select New Role â&#x20AC;˘=> Rejected for TargetB.
4.
Select O K to close the Variables d i a l o g box.
5.
Run the Metadata node. D o not v i e w the results.
Update Path
OK
9-34
Chapter 9 Special Topics
Exploring the Description Tree Follow these steps to explore the description tree. 1. Run the Description Tree node. View the results. 2. Select View «=> Model «=> Iteration Plot from the Description Tree results. 3. Change the basis of the plot to Misclassification Rate. I t e r a t i o n Plot Misclassification Rate
|j|§
"i
1
1
r
1
r
0
10
20
30
40
50
Number of Leaves Train: Misclassification Rate Valid: Misclassification Rate
The Assessment plot shows the trade-off between tree complexity and agreement with the original neural network model. The autonomously fit description agrees with the neural network about 95% of the time, and as the tree becomes more complicated, the agreement with the neural network improves. You can scrutinize this tree for the input values resulting in a solicit decision, or you can simplify the description with some accuracy loss. Surprisingly, 85% of the decisions made by the neural network model can be summarized by a three-leaf tree! The following steps generate the three-leaf description tree: 4. Close the Description Tree results window.
9.5 Surrogate Models
5.
Under the Subtree properties o f the description tree, change the M e t h o d to N and the N u m b e r o f Leaves to 3. Value
Property i i- Leaf Size 5 | \. Number of Rules 5 Number of Surrogate Ri 0 Split Size
B ! [- Exhaustive
Node Sample B i s ™ " i | Method Number of Leaves ; Assessment Measure ; : Assessment Fraction
5000 2 0 0 0 0 ^ ^ ^ ^ ^ ^ ! N 3 Decision 0.25
•
\ \ Perform Cross Validatio No Number of Subsets Number of Repeats
10 1 12345
j '• Seed (slobservation Based Impi ^Observation[Based ImpiNo •Number Single Var lmp(5 • P-Value Adjustment [•Bonferroni Adjustment Yes Time of Kass AdjustmerBefore No r Inputs r Number of Inputs '••Split Adjustment Yes uMMIllEHEIiia '- Leaf Variable 6.
Run the Description Tree node and open the Results w i n d o w .
9-35
9-36
Chapter 9 Special Topics
7.
Select View ^> Model •=> Iteration Plot from the Description Tree results.
8.
Change the basis o f the plot to Misclassification Rate. I t e r a t i o n Plot Misclassification Rate
0.3 -
0.2-
0.1 -
[
~~i
10
20
30
40
50
Number of L e a v e s Train: Misclassification Rate Valid: Misclassification Rate
The accuracy dropped to 8 5 % , but the tree has only t w o rules.
Statistic 0:
Ttain
Validation
35.2* £4.8* 4843
34. S t £5.1% 4843
1
Transformed: Gift Count 36 Month
< 1.24245 I Statistic 0: 1:
Ttaln £2.1%
>= 1.24245 I Validation
£3.8%
37.9% ::o7
::b3
11:
Statistic 0: 1:
Statistic
•:
1: II:
Train 18.4V ei.£% 7Q5
< 118200
Validation
22.3%
statistic Ol
71.n £SL
%i 11:
Ttain 81.e% i d . 4* 1S7G
as.7% 2S60
N:
Median Home Value Region
>= 118200
Train 11.3%
Validation 81.2% 18.8% 155£
Validation 10.£% 89.4% 2£3£
9.5 Surrogate Models
9-37
You can conclude from this tree that the neural network is soliciting donors who gave relatively many times in the past or donors who gave fewer times, but live in more expensive neighborhoods. You should experiment with accuracy versus tree size trade-offs (and other tree options) to achieve a description of the neural network model that is both understandable and accurate. It should be noted that the inputs used to describe the tree are the same as the inputs used to build the neural network. You can also consider other inputs to describe the tree. For example, some of the monetary inputs were log-transformed. To improve understandability, you can substitute the original inputs for these without loss of model accuracy. You can also consider inputs that were not used to build the model to describe the decisions. For example, i f a charity's Marketing Department was interested in soliciting new individuals similar to those selected by the neural network, but without a previous donation history, it could attempt to describe the donors using demographic inputs only. f
It should be noted that separate sampling distorts the results shown in this section. To accurately describe the solicit and ignore decisions, you must down-weight the primary cases and up-weight the secondary cases. This can be done by defining a frequency variable defined as shown.
i f TargetB=l then WEIGHT = 0.05/0.50; else WEIGHT = 0.95/0.50; You can define this variable using a Transform Variables tool's Expression Builder. You then set the model role of this variable to Frequency. Another approach is to build the model on the Score data set. Because the Score data set is not oversampled, you get accurate agreement percentages. This is possible because the description tree does not depend on the value of the target variables, only on the model predictions.
9-38
Chapter 9 Special Topics
Appendix A Case Studies A.1
Banking Segmentation Case Study
A-3
A.2
Web Site Usage Associations Case Study
A-16
A.3
Credit Risk Case Study
A-19
A.4
Enrollment Management Case Study
A-36
A-2
Appendix A Case Studies
A.1 Banking Segmentation Case Study
A-3
A.1 Banking Segmentation Case Study Case Study Description A consumer bank sought to segment its customers based on historic usage patterns. Segmentation was to be used for improving contact strategies in the Marketing Department. A sample o f 100,000 active consumer customers was selected. A n active consumer customer was defined as an individual or household with at least one checking account and at least one transaction on the account during a three-month study period. A l l transactions during the three-month study period were recorded and classified into one o f four activity categories: • traditional banking methods ( T B M ) • automatic teller machine ( A T M ) • point o f sale (POS) • customer service (CSC) A three-month activity profile for each customer was developed by combining historic activity averages with observed activity during the study period. Historically, for one CSC transaction, an average customer would conduct two POS transactions, three A T M transactions, and 10 T B M transactions. Each customer was assigned this initial profile at the beginning o f the study period. The initial profile was updated by adding the total number o f transactions in each activity category over the entire three-month study period. The PROFILE data set contains all 100,000 three-month activity profiles. This case study describes the creation o f customer activity segments based on the PROFILE data set. The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams •=> I m p o r t Diagram from X M L . . . in SAS Enterprise Miner. A l l nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.
f
Case Study Data Name
Model Role
Mens u rem en t Level
Description
ID
ID
CNT_TBM
Input
i Interval
CNT_ATM
Input
Interval
A T M transaction count
CNT_POS
Input
Interval
Point-of-sale transaction count
CNT_CSC
Input
Interval
Customer service transaction count
CNT_TOT
Input
Interval
Total transaction count
Nominal
Customer I D Traditional bank method transaction count
A-4
Appendix A Case Studies
Accessing and Assaying the Data A SAS Enterprise Miner data source was defined using the metadata settings indicated above. The StatExplore node was used to provide preliminary statistics on the input variables.
The Interval Variable Summary from the StatExplore node showed no missing values but did show a surprisingly large range on the transaction counts. I n t e r v a l Variable Summary S t a t i s t i c s
Variable
ROLE
CNT_ATM CNT_CSC CNT_POS CNTTBM CNT TOT
INPUT INPUT INPUT INPUT INPUT
Mean 19.500 6.684 11.923 68.137 106.244
Std. Deviation 20.856 12.129 20.734 101.154 113.370
Non Missing 100000 100000 100000 100000 100000
Missing 0 0 0 0 0
Minimum 3 1 2 10 17
A plot o f the input distributions showed highly skewed distributions for all inputs.
Median
Maximum
13 2 2 52 89
628 607 345 14934 15225
A-5
A.1 Banking Segmentation Case Study
Explore - AAEM61.PROFILE File
View
Actions
Window
| i S a m p l e Proper 1
^ j n j x j
xJ 60000 -
Value
Property
&
100000 6 AAEMB1 PROFILE
Rows Columns Library Member
= 40000-
o>
£ • i
Apply
20000 •
LL
1.0
Plot.,.
1
1
i
1
1
i
'
1
0 •
1
i
— i
i
17.0
99.1 197.2 295.3
i .
i . i
i
i .
3355.8 6694.6
CNT TOT
CNT CSC
_jnf_x]
CNT TBM
CNT A
34 42 20 33 52 101
1
2.0
1
i
1
1
1
1
1
1
r
30.7 59.4 88.1 116.8 145.5 174.2 202.9 231.6 260.3 289.0
CNT POS
±1 lu
&
-
60000 -
= 40000 I " 20000 LL
3.0
102.3 201.6 300.9
0-
1
1669.2
10.0
CNT ATM
3328.4
6645.8
4987.6
8306.0
CNT TBM
It would be difficult to develop meaningful segments from such highly skewed inputs. Instead o f focusing on the transaction counts, it was decided to develop segments based on the relative proportions o f transactions across the four categories. This required a transformation o f the raw data. A Transform Variables node was connected to the PROFILE node.
The Transform Variables node was used to create category logil scores for each transaction category. category logit score = \og{transaction county
c a
iegory /
transaction county
0
fcategory)
A-6
Appendix A Case Studies
The transformations were created using these steps: 1.
Select F o r m u l a s . . . in the Transform Variable node's Properties panel. The Formulas window opens. ^Formulas
LZ3 J*
4,000 ^3.000 -
a
a 2.000 £ 1.000 0
— I
3.0
27.7
52.4
77.1
101.8
151.2
126.5
1 —
175.9
200.6
1
1
225.3
250.0
CNT ATM Columns: Name
Method
CNTITBM
r Basic
Number of Bins
Default Default Default Default
CNT_ATM CNT_CSC ,CNT_POS
Inputs
f~ Mining
Label
f
Role
4 4 4 4
Interval Interval Interval Inter/a I
t
Name
Outputs i •*
Type
Sample
Length
•
Format
Level
Formula
Label
Role
Report
Log
Preview
2.
Input Input Input Input
rstati:
Level
Select the Create icon as indicated above.
OK
Cancel
I
A.1 Banking Segmentation Case Study
The Add Transformation dialog box opens.
Value
Property Name Type Length Format Level Label Role Report
LGT TBM N 8 INTERVAL INPUT No
rluimula: TRANS 0 =
I o g (C NTTB M/(C NT_TOT-CNT_TB M))
Build...
OK
Cancel
3.
For each transaction category, type the name and formula as indicated.
4.
Select O K to add the transformation. The Add Transformation dialog box closes and you return to the Formula Builder window.
5.
Select Preview to see the distribution of the newly created input.
A-7
A-8
Appendix A Case Studies 2 f Formulas
• 3
m m
It
•/
Li?,
4,000 I*3,000 §•2,000o £ 1.000 0
3.0
27.7
52.4
77.1
101.8
I
126.5
151.2
— I —
175.9
1 —
200.6
225.3
250.0
1 0.92
1 1.78
CNT ATM Columns;
|~~ Label
Name
I
Mining
i ~ Basic
Number of Bins
Method
CiMT.ATM CNT_CSC CNT„POS CMT_TBM
-
Role 4 4 4 4
Default Default Default Default
Input Input Input Input
r~ Statistics
Level Interval Interval Interval Interval
Inputs & 1,500 1 1,000 Sf 500 o -6.82
! -5.96
1 -5.10
1 -424
1 -3.38
I -2.52
i -1J 36
1 -0.80
1 0.06
LGT ATM Name LGT_TBM
Length
Type Numeric
Format
Level
Formula
Label
Role
Report
30
[Interval
log(CNT_TB...
: Input
No
30
Interval
]log(CNT_A,..
[Input
Mo
LGTJ'OS
Numeric
32
Interval
|log(CNT_P.,.
Input
No
LGT.CSC
Numeric
30
Interval
log(CNT_C...
Input
No
Outputs | •
Sample
»
Log
>
V s i
»
Previa
6.
Repeat Steps I -5 for the other three transaction categories.
7.
Select O K to close the Formula Builder window.
8.
Run the Transform Variables node.
OK
Cancel
Segmentation was to be based on the newly created category logit scores. Before proceeding, it was deemed reasonable to examine the joint distribution o f the cases using these derived inputs. A scatter plot L i s i n g any three o f the four derived inputs would represent the joint distribution without significant loss of information.
A.1 Banking Segmentation Case Study
A-9
A three-dimensional scatter plot was produced using the following steps: 1.
Select E x p o r t e d Data . . . from the Properties panel o f the Transform Variables node. The Exported Data window opens.
2.
Select the T R A I N data and select E x p l o r e . . . . The Explore window opens.
3.
Select P l o t . . . T h e Plot Wizard opens.
4.
Select a three-dimensional scatter plot.
5.
Select R o l e ^ X . Y , and Z for L G T _ A T M . L G T _ C S C . and L G T _ P O S , respectively.
6.
Select F i n i s h to generate the scatter plot. :
GrapM
jfo J
LGT_POS
The scatter plot showed a single clump o f cases, making this analysis a segmentation (rather than a clustering) o f the customers. There were a few outlying cases with apparently low proportions on the three plotted inputs. Given that the proportions in the four original categories must sum to 1, it followed that these outlying cases must have a high proportion o f transactions in the non-plotted category, T B M .
B
A-10
Appendix A Case Studies
Creating Segments Transactions segments were created using the Cluster node.
•IstntExplore
^^jTransform til Variables
PROFILE
Cluster
O
Two changes to the Cluster node default properties were made, as indicated below. Both were related to limiting the number of clusters created to 5.
Train f..J
Variables ClusterVariable Role Segment Internal Standardizatio None ?• Specification Method User Specify "Maximum Number of C5
Because the inputs were all on the same measurement scale (category logit score), it was decided to not standardize the inputs. Only the four LGT inputs defined in the Transform Variables node were set to Default in the Cluster node. Variables - Clus
!~" not
(none) Columns:
Report
Use
to
No No No No No yes
I to I to
Default Default Default Default
No ho No
to to to
to
Apply
J
|~ Mining
V Label
Name CNT ATM CNT CSC CNT POS CNT TBM CNT TOT ID LGT_ATM LGT CSC •LGT POS LGT TBM
Equal to
Role Input Input Input Input Input ID Input Input Input Input
r Basic
I
-
Reset
Statistics
Level Interval Interval Interval Interval Interval Nominal nterval Interval nterval nterval
Explore...
Update Path
OK
Cancel
A.1 Banking Segmentation Case Study
Additional cluster interpretations were made with the Segment Profile tool.
A-11
A-12
Appendix A Case Studies
Interpreting Segments A Segment Profile node attached to the Cluster node helped to interpret the contents o f the generated segments.
ten StatExplore
:
: : : PROFILE
I ronsform Variables
m
'Cluster
'Segment Profile
f
Only the LGT inputs were set to Yes in the Segment Profile node. ^ V a r i a b l e s - Prof
• I P
Apply
2J Columns:
P
Label
Name
P Mining
Use
[CNT ATM jCNT CSC iCNT POS CNT_TBM CNT TOT Distance ID LGT_ATM LGT_CSC LGT_POS •LGT TBM
Report
No No Default Default Yes Ves Yes Yes
Role
No No
Uo No iio No No No No Explore...
P Statistics
Level
Input Input Input Input Input Rejected ID Input Input Input Inout
No
No No
P Basic
Reset
Interval Interval Interval Interval Interval Interval Nominal Interval interval Interval Interval
Update Path
J
OK
Cancel
The following profiles were created for the generated segments:
30SetjmGtit: 1
j =j
Count: 13157 Percent: 13.16
« , _ 0
£ 10-
LGT TBM
LGT CSC
Segment 1 customers had a significantly higher than average use o f traditional banking methods and lower than average use o f all other transaction categories. This segment was labeled B r i c k - a n d Mortar.
A.1 Banking Segmentation Case Study
A-13
40¬
305egment:2
=
Count: 25536
« ,
P e r c e n t : 25.54
Q
.
£
a . 20 -
10 -
J
fT TBM
•
la LGT ATM
LGT POS
LGT CSC
Segment 2 customers had a higher than average use o f traditional banking methods but were close to the distribution centers on the other transaction categories. This segment was labeled T r a n s i t i o n a l s because they seem to be transitioning from brick-and-mortar to other usage patterns.
Segment:3
I =
Count: 17839
«
P e r c e n t : 17.84 £ 2 0 -
Segment 3 customers eschewed traditional banking methods in favor of ATMs. This segment was labeled ATMs.
Segment: 4 Count: 14537
S
:
o
.
|
Percent: 1 4 5 4
Segment 4 was characterized by a high prevalence o f point-of-sale transactions and few traditional bank methods. This segment was labeled C a s h l e s s .
Segment 5 had a higher than average rate o f customer service contacts and point-of-sale transactions. This segment was labeled S e r v i c e .
A-14
Appendix A Case Studies
Segment Deployment Deployment o f the transaction segmentation was facilitated by the Score node.
The Score node was attached to the Cluster node and run. The SAS Code window inside the Results window provided SAS code that was capable o f transforming raw transaction counts to cluster assignments. The complete SAS scoring code is shown below. * _ .... ..... ........... ....*.i * Formula Code; *
_
...............
LGT_TBM = log(CNT_TBM/(CNT_T0T-CNT_TBM)); LGT_ATM = log(CNT_ATM/(CNT_T0T-CNT_ATM)); LGT_P0S = log(CNT_POS/(CNT_TOT-CNT_POS)); LGT_CSC = log(CNT_CSC/(CNT_TOT-CNT_CSC)); *. .........................................
i
.
..
* TOOL: C l u s t e r i n g ; * TYPE: EXPLORE; * NODE: Clus; * ****************************************** i *** Begin Scoring Code from PR0C DMVQ ***; *****************************************. *** Begin C l a s s Look-up, Standardization, Replacement ; drop _dm_bad; _dm_bad = 0; *** Omitted Cases; i f _dm_bad then do; _SEGMENT_ - .; Distance = .; goto CLUSvlex ; end; *** omitted; (Continued on the next page.)
*. *
*â&#x20AC;˘
A.1 Banking Segmentation Case Study
*** Compute Distances and C l u s t e r Membership; l a b e l _SEGMENT_ = ' ' ; l a b e l Distance = ' ; array CLUSvads [5] _temporary_; drop _vqclus _vqmvar _vqnvar; _vqmvar = 0; do _vqclus = 1 to 5; CLUSvads [_vqclus] = 0; end; i f not missing( LGT_ATM ) then do; CLUSvads [1] + ( LGT_ATM - -3.54995114884544 )**2\ CLUSvads [2] + ( LGT_ATM - -2.2003888516185 )**2; CLUSvads [3] + ( LGT_ATM - -0.23695023328541 )**2; CLUSvads [4] + ( LGT_ATM - -1.47814712774378 )**2; CLUSvads [5] + ( LGT_ATM - -1.49704375204905 )**2; end; e l s e _vqmvar + 1.31533540479172; i f not missing( LGT_CSC ) then do; CLUSvads [1] + ( LGT_CSC - -4.16334022538952 )**2; CLUSvads [2] + { LGT_CSC - -3.38356120535046 )**2; CLUSvads [3] + ( LGT_CSC - -3.55519058753 )**2; CLUSvads [4] + ( LGT_CSC - -3.96526745641346 )**2; CLUSvads [5] + ( LGT_CSC - -2.08727391873096 )**2; end; e l s e _vqmvar + 1.20270093291082; i f not missing( LGT_P0S ) then do; CLUSvads [1] + ( LGT_P0S - -4.08779761080976 )**2; CLUSvads [2] + ( LGT_P0S - -3.27644694006696 )**2; CLUSvads [3] + ( LGT_P0S - -3.02915771770445 )**2; CLUSvads [4] + ( LGT_P0S - -0.9841959454775 )**2; CLUSvads [5] + ( LGT_P0S - -2.21538937073223 )**2; end; e l s e _vqmvar + 1.30942457262733; i f not missing( LGT_TBM ) then do; CLUSvads [1] + ( LGT_TBM - 2.62509260779666 )**2; CLUSvads [2] + ( LGT_TBM - 1.40885156098964 )**2; CLUSvads [3] + ( LGT_TBM - -0.15878507901546 )**2; CLUSvads [4] + ( LGT_TBM - -0.11252803970828 )**2\ CLUSvads [5] + ( LGT_TBM - 0.22075831354075 )**2; end; e l s e _vqmvar + 1.17502484629096; _vqnvar = 5.00248575662084 - _vqmvar; i f _vqnvar <= 2.2748671456706E-12 then do; _SEGMENT_ = .; Distance = .; end; else do; _SEGMENT_ = 1; Distance = CLUSvads [ 1 ] ; _vqfzdst = Distance * 0.99999999999988; drop _vqfzdst; do _vqclus = 2 to 5; i f CLUSvads [_vqclus] < _vqfzdst then do; _SEGMENT_ = _vqclus; Distance = CLUSvads [_vqclus]; _vqfzdst = Distance * 0.99999999999988; end; end; Distance = sqrt(Distance * (5.00248575662084 / _vqnvar)); end; CLUSvlex :; 1
A-15
A-16
Appendix A Case Studies
A.2 Web Site Usage Associations Case Study Case Study Description A radio station developed a Web site to broaden its audience appeal and offerings. In addition to a simulcast o f the station's primary broadcast, the Web site was designed to provide services to Web users, such as podcasts, news streams, music streams, archives, and live Web music performances. The station tracked usage o f these services by URL. Analysts at the station wanted to see whether any unusual patterns existed in the combinations o f services selected by its Web users. The W E B S T A T I O N data set contains services selected by more than 1.5 million unique Web users over a two-month period in 2006. For privacy reasons, the URLs are assigned anonymous I D numbers. The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams â&#x20AC;˘=> Import Diagram from X M L . . . in SAS Enterprise Miner. A l l nodes i n the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.
f
Case Study Data Name ID
ID
TARGET
f
Model Role
'i T a r g e t
Measurement Level
Description
Nominal
URL (with anonymous I D numbers)
jj â&#x20AC;˘. .
ji
j! Web service, selected
Nominal
i
j
The W E B S T A T I O N data set should be assigned the role o f Transaction. This role can be assigned either in the process o f creating the data source or by changing the properties o f the data source inside SAS Enterprise Miner.
Accessing and Assaying the Data A SAS Enterprise Miner data source was defined for the W E B S T A T I O N data set using the metadata settings indicated above. The StatExplore node was used to provide preliminary statistics on the target variable. The Variable Levels Summary showed that there are 1,586,124 unique URLs in the data set and 8 distinct services. Variable Levels Summary Number of Levels Role Variable ID TARGET
f
ID TARGET
1586124 8
To match the results shown above, the Number o f Observations should be set to A L L in the properties o f the StatExplore node.
A.2 Web Site Usage Associations Case Study
A-17
A plot o f target distribution (produced from the Explore window) identified the eight levels and displayed the relative frequency in a random sample o f 6000 cases. Li
wmmmmmtfrf [Hi
TARGET
4,000-
3.000> u
c 0-2,000 0)
1,000 -
0 i i r i 1 1 1 r LIVE STREAM PODCAST SIMULCAST ARCHIVE WEBSITE EXTREF MUSICSTREAM NEWS
TARGET Generating Associations An Association node was connected to the WEBSTATION node.
atExplore
/ BSTATION
â&#x20AC;˘Q
*~ ^^Association
A preliminary run o f the Association node yielded very few association rules. It was discovered that the default minimum Support Percentage setting was too large. (Many o f the URLs selected only one service, diminishing the support o f all association rules.) To obtain more association rules, the minimum Support Percentage was changed to 1 . 0. In addition, the number o f items to process was increased to 3 0 0 0 0 0 0 to account for the large training data set.
A-18
Appendix A Case Studies
Train
•
Variables Maximum Number of 113000000 Rules ation 4 [•Maximum Items [-Minimum Confidence 110 Percent [•Support Type [Support Count 1 ••Support Percentage 1.0
Using these changes, the analysis was rerun and yielded substantially more association rules. The Rules Table was used to scrutinize the results. n
Rules Table
Relations
n
Expected Confidence!, Support(%) Confldence(
Lift
Transaction Rule Count
%) 7.32 1.71 7.32 1.96 7.05 1.95 178 4.10 5.35 1.58 9.47 9.47 9.47 6.95 2.84 9.47 11.83 11.83 11.83 11.83 4.10 11.83 6.95 9.47
98.32 23.02 98.07 26.19 B6.22 23.90 16.05 36.97 41.71 12.29 64.45 51.35 44.86 31.69 12.95 41.55 51.44 46.87 44.61 44.00 13.21 38.17 21.61 29.43 1R.51.
1.69 1.69 1.92 1.92 1.69 1.69 0.66 0.66 0.66 0.66 0,90 0.69 0.66 0.90 0.90 0.74 0.66 0.74 0.60 0.90 1.56 1.56 2.05 2.05 _1_5R_
13.42 13.42 13.39 13.39 12.22 12.22 9.03 9.03 7.90 7.90 6.61 5.43 4.74 4.56 4.56 4.39 4.35 3.96 3.77 3.72 3.23 3.23 3.11 3.11 _3.ao.
26744WEBSITE & EXTREF = > ARCHIVE 26744ARCHIVE = > WEBSITE & EXTREF 30419EXTREF ==> ARCHIVE 30419ARCHIVE = > EXTREF 26744 EXTREF = > WEBSITE & ARCHIVE 26744WEBSITE 6ARCHIVE = > EXTREF 10424WEBSITE £ SIMULCAST ==> PODCAST a MUSICSTREAM 10424PODCAST £ MUSICSTREAM ==> WEBSITE & SIMULCAST 10424 SIMULCAST & PODCAST ==> WEBSITE & MUSICSTREAM 10424WEBSITE & MUSICSTREAM ==> SIMULCAST & PODCAST 14275NEWS & MUSICSTREAM ==> SIMULCAST 1 0944WEBSITE & NEWS = > SIMULCAST 1 0424WEBSITE & PODCAST & MUSICSTREAM ==> SIMULCAST 1 4275SIMULCAST & MUSICSTREAM = > NEWS 14275NEWS — >SIMULCAST £ MUSICSTREAM 11714PODCAST 6 MUSICSTREAM ==> SIMULCAST 10424WEBSITE & SIMULCAST S, PODCAST ==> MUSICSTREAM 11714 SIMULCAST &. PODCAST ==> MUSICSTREAM 95Q6.0WEBSITE S NEWS ==> MUSICSTREAM 1 4275SIMULCAST S NEWS ==> MUSICSTREAM 24794MUSICSTREAM ==* WEBSITE S SIMULCAST 24 794 WEBSITE & SIMULCAST === MUSICSTREAM 32444SIMULCAST==> NEWS 32444NEWS ===> SIMULCAST _7JZ9iSib^LllJ2ARr==?_MfEBaiTJEJ?JMLISIC£lSEAlLiL- ,
The following were among the interesting findings from this analysis: • Most external referrers to the Web site pointed to the programming archive (98% confidence). • Selecting the simulcast service tripled the chances o f selecting the news service. • Users who streamed music, downloaded podcasts, used the news service, or listened to the simulcast were less likely to go to the Web site.
A.3 Credit Risk Case Study
A-19
A.3 Credit Risk Case Study A bank sought to use performance on an in-house subprime credit product to create an updated risk model. The risk model was to be combined with other factors to make future credit decisions. A sample o f applicants for the original credit product was selected. Credit bureau data describing these individuals (at the time o f application) was recorded. The ultimate disposition o f the loan was determined (paid o f f or bad debt). For loans rejected at the time o f application, a disposition was inferred from credit bureau records on loans obtained in a similar time frame. The credit scoring models pursued in this case study were required to conform to the standard industry practice o f transparency and interpretability. This eliminated certain modeling tools from consideration (for example, neural networks) except for comparison purposes. I f a neural network significantly outperformed a regression, for example, it could be interpreted as a sign o f lack o f fit for the regression. Measures could then be taken to improve the regression model. f
The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams "=> Import Diagram from X M L . . . in SAS Enterprise Miner. A l l nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.
A-20
Appendix A Case Studies
Case Study Training Data Name TARGET Banruptcylnd TLBadDerogCnt CollectCnt lnqFinanceCnt24 lnqCnt06 DeroqCnt TLDe!3060Cnt24 TL50UtilCnt TLDel60Cnt24 TLDel60CntAII TL75UtilCnt ri_Del90Cnt24 TLBadCnt24 TLDelBOCnt TLSatCnt TLCntl 2 TLCnt24 TLCnt03 TLSatPct TLBalHCPct TLOpenPct TLOpen24Pct TLTimeFirst InqTimeLast TLTimeLast TLSum TLMaxSum TLCnt ID
Role Target Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input ID
Level J
Label
Binary Binary nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval Interval Interval Interval
Interval Interval Interval Interval
Interval Interval Nominal
Bankruptcy Indicator Number Bad Dept plus Public Derogatories Number Collections Number Finance Inquires 24 Months Number Inquiries 6 Months Number Public Deroqatories NumberTrade Lines 30 or 60 Days 24 Months Number Trade Lines 50 pet Utilized NumberTrade Lines 60 Days orWorse 24 Months NumberTrade Lines 60 Days orWorse Ever NumberTrade Lines 75 pet Utilized NumberTrade Lines 90+ 24 Months NumberTrade Lines Bad Debt 24 Months NumberTrade Lines Currently 60 Days orWorse NumberTrade Lines Currently Satisfactory NumberTrade Lines Opened 12 Months NumberTrade Lines Opened 24 Months NumberTrade Lines Opened 3 Months Percent Satisfactory to Total Trade Lines Percent Trade Line Balance to High Credit Percent Trade Lines Open Percent Trade Lines Open 24 Months Time Since FirstTrade Line Time Since Last Inquiry Time Since Last Trade Line Total Balance All Trade Lines Total High Credit All Trade Lines Total Open Trade Lines
A.3 Credit Risk Case Study
A-21
Accessing and Assaying the Data A SAS Enterprise Miner data source was defined for the CREDIT data set using the metadata settings indicated above. The Data source definition was expedited by customizing the Advanced Metadata Advisor in the Data Source Wizard as indicated.
Property Missing Percentage Threshold 50 Reject Vars with Excessive Missing Value Yes Class Levels Count Threshold Detect Class Levels Yes Reject Levels Count Threshold 20 Reject Vars with Excessive Class Values Yes
Value
Class Levels Count Threshold I f "Detect class levels'-Yes, interval variables with less than the number specified for this property will be marked as NOMINAL. The default value is 20.
OK
With this change, all metadata was set correctly by default.
Cancel
A-22
Appendix A Case Studies
Decision processing was selected in Step 7 of the Data Source Wizard. ^ M^ml ^,V^.,,^ Ul^,, .,l >
<
i
l
M
>U^,,,. ,,, i1
Targets
a
l
l
i
M., 1, ,, i
l
j Prior Probabilities [ Decisions | Decision Weights
Do you want to use the decisions? *Yes
O Ho
Decision Name! DECISIONI DECIS10N2
Default with Inverse Prior Weights
Label
Cost Variable <None > < None >
Constant 0.0 0.0
Add
Delete Delete All Reset Default
< Back
Next >
Cancel
The Decisions option Default with Inverse Prior Weights was selected to provide the values in the Decision Weights tab.
Targets
Prior Probabilities
Decisions j Decision Weights
Select a decision function: (ยง) Maximize
O Minimize
Enter weight values for the decisions.
Level
DECISION1 DECISIONS 5.9938002...
(U)
0.0
1.2000480...
< Back
Next >
Cancel
A.3 Credit Risk Case Study
A-23
It can be shown that, theoretically, the so-called central decision rule optimizes model performance based on the KS statistic. The StatExplore node was used to provide preliminary statistics on the target variable.
B a n r u p t c y l n d and TARGET were the only two class variables in the CREDIT data set. Class Variable Summary S t a t i s t i c s
Role
Number of Levels
Missing
INPUT TARGET
2 2
0 0
Variable Banruptcylnd TARGET
Mode 0 0
Mode Percentage
Mode
Mode2 Percentage
84.67 83.33
1 1
15.33 16.67
The Interval Variable Summary shows missing values on 11 o f the 27 interval inputs. Interval Variable Summary S t a t i s t i c s
Variable
ROLE
CollectCnt DerogCnt InqCnt06 InqFinanceCnt24 InqTimeLast TL50UtilCnt TL75UtilCnt TLBadCnt24 TLBadDerogCnt TLBalHCPct TLCnt TLCnt03 TLCnt12 TLCnt24 TLDel3060Cnt24 TLDel60Cnt TLDel60Cnt24 TLDel60CntAll TLDel90Cnt24 TLMaxSum TL0pen24Pct TLOpenPct TLSatCnt TLSatPct TLSum TLTimeFirst TLTimeLast
INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT
Mean
Std. Deviation
Non Missing
0.86 1.43 3.11 3.56 3.11 4.08 3.12 0.57 1.41 0.65 7.88 0.28 1.82 3.88 0.73 1.52 1.07 2.52 0.81 31205.90 0.56 0.50 13.51 0.52 20151.10 170.11 11.87
2.16 2.73 3.48 4.48 4.64 3.11 2.61 1.32 2.46 0.27 5.42 0.58 1.93 3.40 1.16 2.81 1.81 3.41 1.61 29092.91 0.48 0.21 8.93 0.23 19682.09 92.81 16.32
3000 3000 3000 3000 2812 2901 2901 3000 3000 2959 2997 3000 3000 3000 3000 3000 3000 3000 3000 2960 2997 2997 2996 2996 2960 3000 3000
Missing
Minimum
0 0 0 0 188 99 99 0 0 41 3 0 0 0 0 0 0 0 0 40 3 3 4 4 40 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0
Median
Maximum
0.00 0.00 2.00 2.00 1.00 3.00 3.00 0.00 0.00 0.70 7.00 0.00 1.00 3.00 0.00 0.00 0.00 1.00 0.00 24187.00 0.50 0.50 12.00 0.53 15546.00 151.00 7.00
50.00 51.00 40.00 48.00 24.00 23.00 20.00 16.00 47.00 3.36 40.00 7.00 15.00 28.00 8.00 38.00 20.00 45.00 19.00 271036.00 6.00 1.00 57.00 1.00 210612.00 933.00 342.00
A-24
Appendix A Case Studies
By creating plots using the Explore window, it was found that several of the interval inputs show somewhat skewed distributions. Transformation of the more severe cases was pursued in regression modeling.
File
View
Actions
Window
ET
; Sample Properties I
Ej
L l TL75UtUCnt
L i InuCntQi}
|
E H £MWS2,idsJ)ATA Or>s#
PIWL.
2 3
24
32
40
tr* ET E I
L i lnuFinanceCnt24
20
30
El
00
9j6 192 28.8 38.4 48.0
r / E f EI
I Value Unknown
1 EMWS2.lds DATA
!
:
c* E f
Ei
ID
Ohs*
0000066 0000116 0000124
L l TLDel3060Crrt24
!~ i H
r/Ef
IHI
j :ooo 10000
r^Er
4.0 92
c*er
E l
L l TLCrtt12 1000 0
' 00
94
•*er
E l
: H
13.8 28.2 37.6 47.0
m
00
30
15.0
9.0 12.0 15.0
Number Trade L i n e s Open.-
ET Ei
o'V
TLCnt24
Ei
1000 -
tbn.
Ll TLDel60Cnt24
£
,00
o t3.8 18.4 230
r/Ef
0%
101%
202% 303%
OJ]
Percent Trade Line B a l a n c e
El
0
3,2 4.8
6,4
U
r ^ E f Ei
5.6 11.2 16.8 22.4 28.0
Number Trade L i n e s Open...
8
12
16 20
0%
r^ET El
TLDemOCntAII 2000 1000 0
L i TLSum i S it
:Ih_
4
180% 3G0% 540%
Percent Trade L i n e s Open _
r/Ef
SO
HOSJOH 1210.612
Total B a l a n c e All Trade l _
r / ET E l f f L i TLTImeFlrst
L i TLOpenPct
EI
200010000
Ef E l
£ 400 it o 0.0
9.0 18.0 27.0 3ii.O 45.0
0%
Number Trade Lines GO Da... L i TLDel90Cnt24
r/Ef
El
t
1000 0 SO
(135,518
r/Ef
El
11.4 22.8 34.2 45.6 57.0
Number Trade Lines Curre... L l TLSatPcl
t>
On.
00
r^Ef
EI
Win .
J271.036
Total Umii Credit All Tiade...
284.1 562.2
640,3
Time Since First Trade Line L i TLTimeLast
C^ET E i
n n - r ^ 00
Number Trade Lines 9 0 * 2~. EI
CO% 90%
L i TLSalCtrt
3.9 7 6 11.4 15.2 190
r/ET
30%
Percent Trade Lines Open
i 0.0
r / E f E i f L i TLMaxSum
7J3 1 5.2 22.8 30.4 38.0
TLOpen24Pct
Number Trade Lines 60 Da...
BO
Numher Trade L i n e s C u r r e . .
U
! 2OO0H eT 1000 it 0
p:J_k
Number Trade Lines 30 or
DO
40
Number Trade Lines Open...
Number Bad Dept plus P u b _
2000 1000 0
L i TLDcieocnt
u
Number Trade Lines 50 p c t _
H
1.6
32
i—'—i—•—i—'—i—•—i—> 0 0 1.4 2S 42 SB 70
123 16D
Number Trade Lines Bad 0 . -
14.4 19.2 24.0
~ ;ooo - i—i
Plat..
Apply
0.0
2+
Window
f}6 Sample Properties
S it
*&
0.0
10.2 20.4 30.8 40.0 S1.0
Property Rows
1fi
Ll TLCnt03
1
32 G.4
Ll TLBartDerouCiit
TL50IMCM 100O JOO 0
Actions
00
Time Since Last InnuHy
Number Public Deroratories
8
Total Open Trade L i n e s
1 — — I —— I — r
CZ
40 SO
Ll
0
20
• iooo I I
L i InctlimeLast
m
View
16
C^ET E l
L l TLBadCnt24
Number Finance [tn|iiiies 2 —
DerogCnt
0.0
12
:
0.0
Number Collections U
S
~ :ooo-i r~|
2000 A 1000 0 10
4
1
c*Ef
0
0
Number Trade Lines 75 p e t —
<u 1000-
| >
L I CollectCrrt
16
J. 2000-
| ID 0000066 0000116 0000124
-
8
Number Inquiries 6 Moirllis
p* ET E l
TARGET
i 1
E I l | L l TLCnt 400 0
0 Ajiply
rf^E?
l :ooo4"pj o 1000-] it 0
Value Unknown
Property Rows
Percent Satisfactoiy to Tola...
0.0
102.6 205.2
307.8
Time Since Last Trade Line
A.3 Credit Risk C a s e Study
A-25
Creating Prediction Models: Simple Stepwise Regression Because it was the most likely model to be selected for deployment, a regression model was considered first.
f
?
.j£ Regression
• In the Data Partition node, 50% o f the data was chosen for training and 50% for validation. • The Impute node replaced missing values for the interval inputs with the input mean (the default for interval valued input variables), and added unique imputation indicators for each input with missing values. • The Regression node used the Stepwise method for input variable selection, and Validation Profit for complexity optimization. The selected model included seven inputs. Analysis of Maximum Likelihood Estimates Parameter Intercept IMP_TLBalHCPct IMP_TLSatPct InqFinanceCnt24 TLDel3O60Cnt24 TLDel60Cnt24 TLOpenPct TLTimeFirst
DF
Estimate
Standard Error
Wald Chi-Square
Pr > ChiSq
Standardized Estimate
Exp(Est)
1 1 1 1 1 1 1 1
-2.7602 1.8759 -2.6095 0.0610 0.3359 0.1126 1.5684 -0.00253
0.4089 0.3295 0.4515 0.0149 0.0623 0.0408 0.4633 0.000923
45.57 32.42 33.40 16.86 29.11 7.62 11.46 7.50
<.0001 <.0001 <.0001 <.0O01 <.00O1 0.0058 0.0007 0.0062
0.2772 -0.3363 0.1527 0.2108 0.1102 0.1792 -0.1309
0.063 6.527 0.074 1.063 1.399 1.119 4.799 0.997
The odds ratio estimates facilitated model interpretation. Increasing risk was associated with increasing
values o f IMP_TLBalHCPct, InqF±nanceCnt24, TLDel3060Cnt24, TLDel60Cnt, and TLOpenPct. Increasing risk was associated with decreasing values o f l M P _ T L S a t P c t and
T L T i m e F i r s t. Odds Ratio Estimates Effect IMP_TLBalHCPct IMP_TLSatPct InqFinanceCnt24 TLDel3060Cnt24 TLDel60Cnt24 TLOpenPct TLTimeFirst
Point Estimate 6.527 0.074 1 .063 1 .399 1.119 4.799 0.997
A-26
Appendix A Case Studies
The iteration plot (found by selecting View "=t> Model ^ Iteration Plot in the Results window) can be set to show average profit versus iteration.
00 \E
;
Iteration Plot i Select Chart: Average Profit
1.4
1.3-
1.2-
1.1-
1.0- r — 5.0
—r7.5
10.0
Model Selection Step Number •Train: Average Profit for TARGET •Valid: Average Profit for TARGET
In theory, the average profit for a model using the defined profit matrix equals 1+K.S statistic. Thus, the iteration plot (from the Regression node's Results window) showed how the profit (or, in turn, the KS statistic) varied with model complexity. From the plot, the maximum validation profit equaled 1.43, which implies that the maximum KS statistic equaled 0.43. The actual calculated value o f KS (as found using the Model Comparison node) was found to differ slightly from this value (see below).
f
Creating Prediction Models: Neural Network While it is not possible to deploy as the final prediction model, a neural network was used to investigate regression lack o f fit.
XetExptore
CRHMT 1
i Partition
Regression
eural Network
l?i
The default settings o f the Neural Network node were used in combination with inputs selected by the Stepwise Regression node.
A.3 Credit Risk Case Study
A-27
The iteration plot showed slightly higher validation average profit compared to the stepwise regression model.
Average Profit
1.400 -
Training Iteration â&#x20AC;˘Train: Average Profit for TARGET â&#x20AC;˘Valid: Average Profit for TARGET
It was possible (although not likely) that transformations to the regression inputs could improve regression prediction.
Creating Prediction Models: Transformed Stepwise Regression In assaying the data, it was noted that some o f the inputs had rather skewed distributions. Such distributions create high leverage points that can distort an input's association with the target. The Transform Variables node was used to regularize the distributions o f the model inputs before fitting the stepwise regression.
A-28
Appendix A Case Studies
The Transform Variables node was set to maximize the normality o f each interval input by selecting from one o f several power and logarithmic transformations. Property
Value
!
General Node ID Imported Data Exported Data Notes
Trans ~ a u ~ a
Train Variables Formulas Interactions SAS Code ildMIUBIiffllH -Interval inputs -Interval Targets -Class Inputs -Class Targets
u u
â&#x20AC;˘ Maximum Normal None None None
The result o f this setting was to apply a log transformation to each interval input. The Transformed Stepwise Regression node performed stepwise selection from the transformed inputs. The selected model had many o f the same inputs as the original stepwise regression model, but on a transformed (and difficult to interpret) scale. Odds R a t i o
Estimates Point
Effect L0G_IMP_TL75UtilCnt LOG_IMPJ"LBalHCPct LOGJMPJTLSatPct L0G_InqFinanceCnt24 L0G_TLDel3060Cnt24 L0G_TLDel60Cnt24 L0G_TLOpenPct LOG T L T i m e F i r s t
Estimate 9.243 515.899 0.015 15.799 17.508 7.914 6.616 0.045
A.3 Credit Risk Case Study
A-29
The transformations would be justified (despite the increased difficulty in model interpretation) i f they resulted in significant improvement in model fit. Based on the profit calculation, the transformed stepwise regression model showed only marginal performance improvement compared to the original stepwise regression model. Iteration Plot i i ^ ^ f f i ^ i ^ f f i ^ i ^ ^ ^ H l
cftf*
0
Average Profit
1.4-
1.3-
1.2-
1.1 -
1.0-
Model Selection Step Number Train: Average Profit for TARGET Valid: Average Profit for TARGET
Creating Prediction Models: Discretized Stepwise Regression Partitioning input variables into discrete ranges was another common risk-modeling method that was investigated.
j^i- Recession
r i =S; CREDIT
ia Partition
^mpute
] T j
l\\
-j^y^eural Network
| â&#x20AC;˘ , Transformed kyÂŤ Stepwise...
^ Jransform Variables
-6 v. Bucket Stepwise
ket Input /enables
Quantie Input Variables
Quantfle Stepwise..
Optimal
Optimal Discretized...
E l Wscretized...
ft
A-30
Appendix A Case Studies
Three discretization approaches were investigated. The Bucket Input Variables node partitioned each interval input into four bins w i t h equal widths. The Bin Input Variables node partitioned each interval input into four bins with equal sizes. The Optimal Discrete Input Variables node found optimal partitions for each input variable using decision tree methods. Bucket Transformation The relatively small size o f the CREDIT data set resulted in problems for the bucket stepwise regression model. Many o f the bins had a small number o f observations, which resulted in quasi-complete separation problems for the regression model, as dramatically illustrated by the selected model's odds ratio report. Odds Ratio Estimates
Effect BIN_IMP_TL75UtilCnt BIN_IMP_TL75UtilCnt BIN_IMP_TL75UtilCnt BIN_IMP_TLBalHCPct BIN_IMP_TLBalHCPct BIN_IMP_TLBalHCPct BIN_IMP_TLSatPct BIN_IMP_TLSatPct BIN_IMP_TLSatPct BIN_InqFinanceCnt24 BIN_InqFinanceCnt24 BIN_InqFinanceCnt24 BIN_TLDel3060Cnt24 BIN_TLDel3060Cnt24 BIN_TLDel60CntAll BIN_TLDel60CntAll BIN_TLDel60CntAll BIN_TLTimeFirst BINJLTimeFirst BIN TLTimeFirst
01: low -5 vs 04:15-high 02: 5-10 vs 04:15-high 03: 10-15 vs 04:15-high 01: low -0.840325 vs 04:2.520975-high 02: 0.840325-1.68065 VS 04:2.520975-high 03: 1.68065-2.520975 VS 04:2.520975-high 01: low -0.25 vs 04:0.75-high 02: 0.25-0.5 vs 04:0.75-high 03: 0.5-0.75 vs 04:0.75-high 01; low -9.75 vs 04:29.25-high 02: 9.75-19.5 vs 04:29.25-high 03 19.5-29.25 vs 04:29.25-high 01 low -2 vs 04:6-high 02 2-4 vs 04:6-high 01 low -4.75 vs 04:14.25-high 02 4.75-9.5 vs 04:14.25-high 03 9.5-14.25 vs 04:14.25-high 01 low -198.75 vs 04:584.25-high 02 198.75-391.5 VS 04:584.25-high 03 :391.5-584.25 vs 04:584.25-high
Point Estimate 999.000 999.000 999.000 <0.001 <0.001 <0.001 4.845 1.819 1.009 0.173 0.381 0.640 999.000 999.000 0.171 0.138 0.166 999.000 999.000 999.000
A.3 Credit Risk Case Study
A-31
The iteration plot showed substantially worse performance compared to the other modeling efforts.
Average Profit
â&#x20AC;˘
i 0
1 2
w \
1 4
1 6
1
r 8
Model Selection Step Number Train: Average Profit for TARGET Valid: Average Profit for TARGET |
|
Bin Transformation Somewhat better results were seen with the binned stepwise regression model. By ensuring that each bin included a reasonable number o f cases, more stable model parameter estimates could be made. Odds Ratio Estimates
Effect PCTL_IMP_TLBalHCPct PCTL_IMP_TLBalHCPct PCTL_IMP_TLBalHCPct PCTL_IMP_TLSatPct PCTL_IMP_TLSatPct PCTL_IMP_TLSatPct PCTL_InqFinanceCnt24 PCTL_InqFinanceCnt24 PCTL_InqFinanceCnt24 PCTL_TLDel3060Cnt24 PCTL_TLDel60Cnt24 PCTL_TLTimeFirst PCTL_TLTimeFirst PCTLJTLTimeFirst
01 02 03 01 02 03 01 02 03 02 02 01 02 03
low -0.513 vs 04:0.8389-high 0.513-0.7041 VS 04:0.8389-high 0.7041-0.8389 vs 04:0.8389-high low -0.3529 vs 04:0.6886-high 0.3529-0.5333 VS 04:0.6886-high 0.5333-0.6886 VS 04:0.6886-high low -1 vs 04:5-high 1-2 vs 04:5-high 2-5 vs 04:5-high 0-1 vs 03:1-high 0-1 vs 03:1-high low -107 vs 04:230-high 107-152 vs 04:230-high 152-230 vs 04:230-high
Point Estimate 0.272 0.452 0.630 1 .860 1 .130 1 .040 0.599 0.404 0.807 0.453 0.357 1 .688 1 .477 0.837
A-32
Appendix A Case Studies
The improved model fit was also seen in the iteration plot, although the average profit o f the selected model was still not as large as the original stepwise regression model.
0
2
4
6
8
10
Model Selection Step Number |
Train: Average Profit for TARGET Valid: Average Profit for TARGET |
Optimal Transformation A final attempt on discretization was made using the optimistically named Optimal Discrete transformation. The final 18 degree-of-freedom model included 10 separate inputs (more than any other model). Odds Ratio Estimates
Effect Banruptcylnd 0PT_IMP_TL75UtilCnt 0PT_IMP_TL75UtilCnt 0PT_IMP_TLBalHCPct 0PT_IMP_TLBalHCPct 0PT_IMP_TLBalHCPct 0PT_IMP_TLSatPct 0PT_IMP_TLSatPct 0PT_InqFinanceCnt24 0PT_InqFinanceCnt24 0PT_TLDel3060Cnt24 0PT_TLDel60Cnt OPT_TLDel60Cnt 0PT_TLDel60Cnt24 OPT_TLDel60Cnt24 OPT_TLTimeFirst TLOpenPct
0 vs 1 01: low -1.5 vs 03:8.5-high 02 1.5-8.5, MISSING vs 03:8.5-high 01 low -0.6706, MISSING vs 04:1.0213-high 02 0.6706-0.86785 VS 04:1.0213-high 03 0.86785-1.0213 vs 04:1.0213-high 01 low -0.2094 vs 03:0.4655-high, 02 0.2094-0.4655 VS 03:0.4655-high, 01 low -2.5, MISSIN vs 03:7.5-high 02 2.5-7.5 vs 03:7.5-high 01 low -1.5, MISSIN vs 02:1.5-high 01 low -0.5, MISSIN vs 03:14.5-high 02 0.5-14.5 vs 03:14.5-high 01 low -0.5, MISSIN vs 03:5.5-high 02 0.5-5.5 vs 03:5.5-high 01 :low -154.5, MISSING vs 02:154.5-high
Point Estimate 2.267 0.270 0.409 0.090 0.155 0.250 5.067 1.970 0.353 0.657 0.499 0.084 0.074 0.327 0.882 1 .926 3.337
A.3 Credit Risk Case Study
A-33
The validation average profit was still slightly smaller than the original model. A substantial difference in profit between the training and validation data was also observed. Such a difference was suggestive o f overfitting by the model. Iteration Plot
^^^^^^^^^^^M
Average Profit 1.5-
1.4-
1.31.2-
1.1 -
1.0-r 15
Model Selection Step Number â&#x20AC;˘Train: Average Profit for TARGET Valid: Average Profit for TARGET
A-34
Appendix A Case Studies
Assessing the Prediction Models The collection o f models was assessed using the Model Comparison node.
The ROC chart shows a jumble o f models with no clear winner.
Data Role = VALIDATE
1 - Specificity Baseline Neural Network Regression Transformed Stepwise Regression Bucket Stepwise Regression Quantile Stepwise Regression Optimal Discretized Stepwise Regression
A.3 Credit Risk Case Study
A-35
The Fit Statistics table from the Output window is shown below. Data Role=Valid Statistics Valid Kolmogorov-Smirnov Statistic Valid Average Profit for TARGET Valid Average Squared Error Valid Roc Index Valid Average Error Function Valid Percent Capture Response Valid Divisor for VASE Valid Error Function Valid Gain Valid Gini Coefficient Valid Bin-Based Two-Way Kolmogorov-Smirnov Statistic Valid L i f t Valid Maximum Absolute Error Valid Misclassification Rate Valid Mean Square Error Valid Sum of Frequencies Valid Total Profit for TARGET Valid Root Average Squared Error Valid Percent Response Valid Root Mean Square Error Valid Sum of Square Errors Valid Sum of Case Weights Times Freq
Reg 0.43 1.43 0.12 0.77 0.38 14.40 3000.00 1152.26 180.00 0.54 0.43 2.88 0.97 0.17 0.12 1500.00 2143.03 0.35 48.00 0.35 359.70 3000.00
Neural 0.46 1.42 0.12 0.77 0.39 12.00 3000.00 1168.64 152.00 0.54 0.44 2.40 0.99 0.17 0.12 1500.00 2131.02 0.35 40.00 0.35 367.22 3000.00
Reg5 0.42 1.42 0.12 0.76 0.40 11.60 3000.00 1186.46 148.00 0.53 0.41 2.32 1.00 0.17 0.12 1500.00 2127.45 0.35 38.67 0.35 371.58 3000.00
Reg2 0.44 1.42 0.12 0.78 0.38 14.40 3000.00 1131.42 192.00 0.56 0.44 2.88 0.98 0.17 0.12 1500.00 2127.44 0.34 48.00 0.34 352.69 3000.00
Reg4 0.45 1.41 0.12 0.77 0.39 12.64 3000.00 1158.23 144.89 0.54 0.45 2.53 0.99 0.17 0.12 1500.00 2121.42 0.35 42.13 0.35 366.76 3000.00
Reg3 0.39 1.38 0.13 0.73 0.43 9.60 3000.00 1282.59 124.00 0.47 0.39 1.92 1.00 0.17 0.13 1500.00 2072.25 0.36 32.00 0.36 381.44 3000.00
The best model, as measured by average profit, was the original regression. The neural network had the highest KS statistic. The log-transformed regression, Reg2, had the highest ROC-index. I f the purpose o f a credit risk model is to order the cases, then Reg2, the transformed regression, had the highest rank decision statistic, the ROC index. In short, the best model for deployment was as much a matter o f taste as o f statistical performance. The relatively small validation data set used to compare the models did not produce a clear winner. In the end, the model selected for deployment was the original stepwise regression, because it offered consistently good performance across multiple assessment measures.
A-36
Appendix A Case Studies
A.4 Enrollment Management Case Study Case Study Description In the fall of2004, the administration o f a large private university requested that the Office o f Enrollment Management and the Office o f Institutional Research work together to identify prospective students who would most likely enroll as new freshmen in the Fall 2005 semester. The administration stated several goals for this project: • increase new freshman enrollment • increase diversity • increase SAT scores o f entering students. Historically, inquiries numbered about 90,000+ students, and the university enrolled from 2400 to 2800 new freshmen each Fall semester. f
The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams •=> Import Diagram from X M L . . . in SAS Enterprise Miner. A l l nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.
A.4 Enrollment Management Case Study
A-37
Case Study Training Data Name ACADEMI C_INTERE S T _ l
Model Role
Description
'Measurement Level
Rejected
Nominal
Primary academic interest code
ACADEMIC_INTEREST_2 | Rejected
j Nominal
i Secondary academic interest code
CAMPUS_VISIT
Input
Nominal
Campus visit code
CONTACT_CODEl
j| Rejected
Nominal
' First contact code
CONTACT_DATE1
Rejected
Nominal
First contact date
j; Rejected
Nominal
Ethnicity
ETHNICITY ENROLL
Target
Binary
l=Enrolled F2004,0=Not enrolled F2004
IRSCHOQL
Rejected
Nominal
H i g h school code
INSTATE
Input
Binary
l = I n state, 0=Out o f state
LBVELJYEAR
Rejected
Unary
Student academic level
REFERRAL_CNTCTS
Input
Ordinal
Referral contact count
SELF_INITJCNTCTS
Input
Interval
Self initiated contact count
SOLICITED_CNTCTS
Input
Ordinal
Solicited contact count
TERRITORY
Input
Nominal
Recruitment area
TOTAL_CONTACTS
Input
Interval
Total contact count
TRAVEL_INIT_CNTCTS
Input
Ordinal
Travel initiated contact count
AVG_INCOME
Input
Interval
Commercial H H income estimate
DISTANCE
Input
Interval
Distance from university
HSCRAT
Input
Interval
5-year high school enrollment rate
INIT_SPAN
Input
Interval
Time from first contact to enrollment date
INT1RAT
Input
Interval
5-year primary interest code rate
INT2RAT
Input
Interval
5-year secondary interest code rate
INTEREST
Input
Ordinal
Number o f indicated extracurricular interests
MAILQ
Input
Ordinal
Mail qualifying score (l=very interested)
PREMIERE
Input
Binary
l=Attended campus recruitment event, 0=Did not
SATSCORE
Rejected
Interval
SAT (original) score
SEX
Rejected
Binary
Sex
STTJEMAIL
Input
Binary
l=Have e-mail address, 0=Do not
TELECQ
Rejected
Ordinal
Telecounciling qualifying score (l=very interested)
A-38
Appendix A Case Studies
The Office o f Institutional Research assumed the task o f building a predictive model, and the Office o f Enrollment Management served as consultant to the project. The Office o f Institutional Research built and maintained a data warehouse that contained information about enrollment for the past six years. It was decided that inquiries for Fall 2004 would be used to build the model to help shape the Fall 2005 freshman class. The data set Inq2005 was built over a period o f a several months in consultation with Enrollment Management. The data set included variables that could be classified as demographic, financial, number o f correspondences, student interests, and campus visits. Many variables were created using historical data and trends. For example, high school code was replaced by the percentage o f inquirers from that high school over the past five years who enrolled. The resulting data set included over 90,000 observations and over 50 variables. For this case study, the number o f variables was reduced. The data set Inq2005 is in the A A E M library, and the variables are described i n the table above. Some of the variables were automatically rejected based on the number o f missing values. The nominal variables AOVDEMlC_iNTEREST_JL, ACADEMIC_INTEREST_2, and IRSCHOOL were rejected because they were replaced by the interval variables INT1RAT, INT2RAT, and HSCRAT, respectively. For example, academic interest codes 1 and 2 were replaced by the percentage o f inquirers over the past five years who indicated those interest codes and then enrolled. The variable IRSCHOOL is the high school code of the student, and it was replaced by the percentage o f inquirers from that high school over the last five years who enrolled. The variables ETHNICITY and SEX were rejected because they cannot be used in admission decisions. Several variables count the various types o f contacts the university has with the students.
Accessing and Assaying the Data A SAS Enterprise Miner data source was defined using the metadata settings indicated above. The StatExplore node was used to provide preliminary statistics on the input variables.
The following is extracted from the StatExplore node's Results window: Class Variable Summary S t a t i s t i c s
Variable CAMPUS_VISIT INSTATE REFERRAL_CNTCTS SOLICITED_CNTCTS TERRITORY TRAVEL_INIT_CNTCTS INTEREST MAILQ PREMIERE STUEMAIL ENROLL
Role INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT TARGET
Number of Levels 3 2 6 8 12 7 4 5 2 2 2
Missing 0 0 0 0 0 0 0 0 0 0
Mode 0 Y 0 0 2 0 0 5 0 0 0
Mode Percentage 96.61 62.04 96.46 52.45 15.98 67.00 95.01 69.33 97.11 51 .01 96.86
Mode N
Mode2 Percentage 3.31 37.96 3.21 41.60 15.34 29.90 4.62 12.80 2.89 48.99 3.14
A.4 Enrollment Management Case Study
A-39
The class input variables are listed first. Notice that most o f the count variables have a high percent o f Os. D i s t r i b u t i o n of Class Target and Segment Variables
Variable Enroll Enroll
Role TARGET TARGET
Formatted Value 0 1
Frequency Count 88614 2868
Percent 96.8650 3.1350
Next is the target distribution. Only 3.1 % o f the target values are Is, making a 1 a rare event. Standard practice i n this situation is to separately sample the Os and Is. The Sample tool, used below, enables you to create a stratified sample i n SAS Enterprise Miner. Interval Variable Summary Statistics
Variable SELF_INrT_CNTCTS TOrALJXNTACTS AVGJNCCME DISTANCE HSCRAT INIT_SfWI INT1RAT INT2RAT
ROLE INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT
Mean 1.21 2.17 47315.33 380.43 0.04 19.69 0.04 0.04
Std. Deviation 1.67 1.85 20608.89 397.98 0.06 8.72 0.02 0.03
Non Missing 91482 91482 70553 72014 91482 91482 91482 91482
Missing 0 0 20929 19468 0 0 0 0
Minimum 0.00 1.00 4940.00 0.42 0.00 -216.00 0.00 0.00
Median 1.00 2.00 42324.00 183.55 0.03 19.00 0.04 0.06
Maximum 56.00 58.00 200001.00 4798.90 1.00 228.00 1.00 1.00
A-40
Appendix A Case Studies
Finally, interval variable summary statistics are presented. Notice that AVG_lNCOME and DISTANCE have missing values. The Explore window was used to study the distribution of the interval variables.
File View Actions Window *J EMWS3.IUS DATA Obs#
^ET
EE!
ETHN CO... RECR CO... AW 1 N 2C N 3N N 4 N 5 N 6 N 7 N 8 N 9 N
I
L l SELF JMT CNTCTS
L l lnit_snan
15000
5 S 10000 £
5000 0 0.0
3.0 48.0 &3.0 138.0 183.0 2 23.0 INIT SPAN
8.4 12.6 10.8 21.0 SELF INIT CNTCTS
42
L l distance
L l TOTAL ..CONTACTS
10000 -
15000¬
5 5- 5000-
3 10000
J
£
5000 0 1.0
0.00 0.12 0.24 0.36 0.48 0.60 INT2RAT
6.4 11.8 17.2 22.fi 23.0 TOTAL CONTACTS
L l hscrat 20000 -
Li intlrat I5O00 -
g, 15000¬
&
3 10000
g 10000 -
« 5000 0
£ 0.0
0.2
0.4 0.6 HSCRAT
0.8
1.0
5000¬ 0 0.00 0.10 0.20 0.30 0.40 0.50 INT 1 RAT
0.4171 1531.7303 3063.0435 DISTANCE L l avyjncome
8000 & ' " = 4000 r
000
3
«: 2G0O0-
r 8455.0
85097.4 161699J AVG INCOME
The apparent skewness of all inputs suggests that some transformations might be needed for regression models.
A.4 Enrollment Management Case Study
A-41
Creating a Training Sample Cases from each target level were separately sampled. All cases with the primary outcome were selected. For each primary outcome case, seven secondary outcome cases were selected. This created a training sample with a 12.5% overall enrollment rate. The Sample tool was used to create a training sample for subsequent modeling.
To create the sample as described, the following modifications were made to the Sample node's properties panel: 1.
Type 1 0 0 as the Percentage value (in the Size property group).
2.
Select Criterion
3.
Type 12 . 5 as the Sample Proportion value (in the Level Based Options property group). Train Variables Output Type Sample Method Random Seed -Type -Observations •Percentage Alpha •PValue Cluster Method
Level Based (in the Stratified property group).
Data Default 12345
•
Percentage 1 100.0 0.01 0.01 Random
•Criterion Level Based -Ignore Small Sirata No -Minimum Strata Size 5 ULevel Based Options -Level Selection Event -Level Proportion 1 00.0 -Sample Proportion 1 2.5 EMSffiffillMI -Adjust Frequency No -Based on Count No -Exclude Missing LeviNo
,
A-42
Appendix A Case Studies
The Sample node Results window shows all primary outcome cases that are selected and a Sufficient number o f secondary outcome cases that are selected to achieve the 12.5% primary outcome proportion. Summary S t a t i s t i c s
for
Class
Targets
Data-DATA Numeric
Formatted
Frequency
Variable
Value
Value
Count
Percent
Enroll Enroll
0
0
1
1
88614 2868
96.8650 3.1350
Data=SAMPLE Numeric
Formatted
Frequency
Variable
Value
Value
Count
Enroll
0
0
20076
87.5
Enroll
1
1
2868
12.5
Percent
Configuring Decision Processing The primary purpose o f the predictions was decision optimization and, secondarily, ranking. An applicant was considered a good candidate i f his or her probability o f enrollment was higher than average. Because o f the Sample node, decision information consistent with the above objectives could not be entered in the data source node. To incorporate decision information, the Decisions tool was incorporated in the analysis.
JstatExplore
-6
Sample
INQ200S
Decisions o
These steps were followed to configure the Decisions node: 1.
Select Decisions ^ Property
Value
General Node ID Imported Data Exported Data Notes
Dec
â&#x20AC;˘ â&#x20AC;˘
Train Variables Decisions Apply Decisions
... ...
No
A.4 Enrollment Management Case Study
After the analysis path is updated, the Decision window opens. Wt'lftd
OK
2.
Select Build in the Decision Processing window.
3.
Select the Decisions tab.
4.
Select Default w i t h Inverse Prior Weights.
Targets
Prior Probabilities \ Decisions
Cancel
Decision Weights
Do you want to use the decisions? 速Yes
Default with [nverse Prior Weights
U No
Decision Name 1 DECISION1 1 DECISION2 [0
Label
I CostVariable None > ... < None >
lo.o
Constant
Add
0.0
Deleie
Delete All Reset Default
OK
Cancel
A-43
A-44
5.
Appendix A Case Studies
Select the Decision Weights tab.
Targets
Prior Probabilities ; Decisions
Decision Weights
Select a decision function: 째 Maximize
O Minimize
Enter weight values for the decisions. DECISION1 DECISION2 Level 0.0 ... 18.0 1 .,.(0.0 0 It .1428571... I
OK
Cancel
The nonzero values used in the decision matrix are the inverse of the prior probabilities (1/.125 = 8. and 1/0.875 = 1.142857). Such a decision matrix, sometimes referred to as the central decision rule, forces a primary decision when the estimated primary outcome probability for a case exceeds the primary outcome prior probability (0.125 in this case).
Creating Prediction Models (All Cases) Two rounds of predictive modeling were performed. In the first round, all cases were considered for model building. From the Decision node, partitioning, imputation, modeling, and assessment were performed. The completed analysis appears as shown.
A.4 Enrollment Management Case Study
A-45
• The Data Partition node used 60% for training and 40% for validation. • The Impute node used the Tree method for both class and interval variables. Unique missing indicator variables were also selected and used as inputs. • The stepwise regression model was used as a variable selection method for the Neural Network and second Regression nodes. • The Regression node labeled I n s t a t e R e g r e s s i o n included the variables from the Stepwise Regression node and the variable I n s t a t e . It was felt that prospective students behave differently based on whether they are in state or out o f state. In this implementation o f the case study, the Stepwise Regression node selected three inputs: high school, self-initiated contact count, and student e-mail indicator. The model output is shown below. Analysis of Maximum Likelihood Estimates
Parameter INTERCEPT SELF_INIT_CNTCTS HSCRAT STUEMAIL
DF
Estimate
Standard Error
Wald Chi-Square
Pr > ChiSq
1 1 1 1
-12.1422 0.6895 16.4261 -7.7776
18.9832 0.0203 0.8108 18.9824
0.41 1156.19 410.46 0.17
0.5224 <.0001 <.0001 0.6820
0
Standardized Estimate
0.8773 0.7506
Exp(Est) 0.000 1.993 999.000 0.000
Odds Ratio Estimates Effect SELF_INIT_CNTCTS HSCRAT STUEMAIL
Point Estimate 1.993 999.000 O.001
0 VS 1
The unusual odds ratio estimates for HSCRAT and STUEMAIL result from an extremely strong association in those inputs. For example, certain high schools had all applicants or no applicants enroll. Likewise, very few students enrolled who did not provide an e-mail address. Adding the INSTATE input in the Instate Regression model changed the significance o f inputs selected by the stepwise regression model. The input STUEMAIL is no longer statistically significant after including the INSTATE input. Analysis of Maximum Likelihood Estimates
Parameter INTERCEPT INSTATE N SELF_INIT_CNTCTS HSCRAT STUEMAIL 0
DF 1 1 1 1 1
Estimate -12.0541 -0.4145 0.6889 16.2327 -7.3528
Odds Ratio Estimates Effect INSTATE N VS Y SELF_INIT_CNTCTS HSCRAT STUEMAIL 0 VS 1
Point Estimate 0.437 1.992 999.000 <0.001
Standard Error 16.7449 0.0577 0.0196 0.7553 16.7443
Wald Chi-Square 0.52 51.67 1233.22 461.95 0.19
Pr > ChiSq 0.4716 <.0001 <.0001 <.0O01 0.6606
Standardized Estimate
0.8231 0.7142
Exp(Est) 0.000 0.661 1.992 999.000 0.001
A-46
Appendix A Case Studies
A slight increase in validation profit (the criterion used to tune models) was found using the neural network model. The tree provides insight into the strength of mode! fit. The assessment plot shows the highest profit having 17 leaves. Most of the predictive performance, however, is provided by the initial splits. Iteration Plot
â&#x20AC;˘ B E 3
Average Profit
1.75-
1.50-
1.25-
1.00_
|
!
0
10
r
20
Number of L e a v e s
Train: Average Profit for Enroll Valid: Average Profit for Enroll
A.4 Enrollment Management Case Study
A-47
A simpler tree is scrutinized to aid in interpretation. The tree model was rerun with properties changed as follows to produce a tree with three leaves: Method=N, Number o f Leaves=3.
100.Q\
St.clstic B7.5*
Train Validation 10D.OV B7.St 100.0* 12.3% ixiee tin
SELF iNrr CMTCTS
>= 3.5
3.5 iia.n Z'.-c:
Statl.tlc it.$s l.» Mi
Train ValldaU™ no.s* 97.0* 33.S» 1.0\ 1175B 77U
C13C1C 14.0V
Train
Validation
szs.3% zoas
esiis ia93
SELF INIT CWTCTS
5Ut»UC M.4% ;os.=>
u. at
ze.i\
Train H4.Et
Validation 74.31 us
ii3.it 14.11
statiatlc SB.:* l.t\
Train llJ.c-c li. m IQtJZ
Validation »S.5» i.v. 7101
Students with three or fewer self-initiated contacts rarely enrolled (as seen in the left leaf of the first split). Enrollment was even rarer for students with two or fewer self-initiated contacts (as seen in the left leaf of the second split). Notice that the primary target percentage is rounded down. Also notice that most of the secondary target cases can be found in the lower left leaf. The decision tree results shown in the rest of this case study are generated by the original, 17-leaf, tree.
A-48
Appendix A Case Studies
Assessing the Prediction Models Model performance was compared in the Model Comparison node.
Data Role = VALIDATE
0.4
0.6
1 - Specificity Baseline Stepwise Regression Decision Tree
Neural Network Instate Regression
The validation ROC chart showed an extremely good performance for all models. The neural model seemed to have a slight edge over the other models. This was mirrored in the Fit Statistics table (abstracted below to show only the validation performance).
A.4 Enrollment Management Case Study
A-49
Data Role=Valid Statistics Valid Kolmogorov-Smirnov S t a t i s t i c Valid Average Profit for E n r o l l Valid Average Squared Error Valid Roc Index Valid Average Error Function Valid Percent Capture Response Valid Divisor for VASE Valid Error Function Valid Gain Valid Gini Coefficient Valid Bin-Based Two-Way Kolmogorov-Smirnov S t a t i s t i c Valid L i f t Valid Maximum Absolute Error Valid M i s c l a s s i f i c a t i o n Rate Valid Mean Square Error Valid . Sum of Frequencies Valid • Total Profit for E n r o l l Valid . Root Average Squared Error Valid • Percent Response Valid : Root Mean Square Error Valid : Sum of Square Errors Valid . Sum of Case Weights Times Freq Valid : Number of Wrong C l a s s i f i c a t i o n s .
Neural 0.89 1.88 0.04 0.98 0.11 30.94 18356.00 2097.03 576.37 0.96 0.88 6.19 1.00 0.05 0.04 9178.00 17285.71 0.19 77.34 0.19 657.78 18356.00 463.00
Tree 0.88 1.88 0.04 0.96
.
30.95 18356.00
.
519.90 0.93 0.87 6.19 1.00 0.05
.
9178.00 17256.00 0.20 77.36
.
735.99 18356.00 •
Reg 0.87 1.87 0.04 0.98 0.14 29.72 18356.00 2521.91 552.83 0.95 0.86 5.94 1.00 0.06 0.04 9178.00 17122.29 0.20 74.29 0.20 752.78 18356.00
Reg2 0.86 1.86 0.04 0.98 0.14 29.55 18356.00 2486.75 552.83 0.95 0.86 5.91 1.00 0.06 0.04 9178.00 17099.43 0.20 73.85 0.20 754.37 18356.00
•
•
It should be noted that a ROC Index o f 0.98 needed careful consideration because it suggested a nearperfect separation o f the primary and secondary outcomes. The decision tree model provides some insight into this apparently outstanding model fit. Self-initiated contacts are critical to enrollment. Fewer than three self-initiated contacts almost guarantees non-enrollment.
Creating Prediction Models (Instate-Only Cases) A second round o f analysis was performed on instate-only cases. The analysis sample was reduced using the Filter node. The Filter node was attached to the Decisions node, as shown below.
A-50
Appendix A Case Studies
The following configuration steps were applied: 1.
Select Default Filtering Method
None for both the Class and Interval variables.
Property
Value
General Node ID Imported Data Exported Data [Notes
Filter p
Train Export Table Filtered Training Data Tables to Filter © C l a s s Variables [•Class Variables None [- Default Filtering Method Yes [Keep Missing Values Yes [Normalized Values 1 r Minimum Frequency Cutoff j-Minimum Cutoff for Percentage 0.01 "Maximum Number of Levels Cutof25 Blnterval Variables r Interval Variables None v Default Filtering Method Yes [-Keep Missing Values "Tuning Parameters
•
•
Score Create score code
Yes
2.
Select Class Variables •=> [ j j . After the path is updated, the Interactive Class Filter window opens.
3.
Select Generate Summary, after Yes is selected to generate summary statistics.
A.4 Enrollment Management Case Study
4.
Select Instate. The Interactive Class Filter window is updated to show the distribution o f the INSTATE input.
â&#x20AC;˘Illi1f11)ffit1nlli1"il Select values to remove from the sample.
Instate Apply Filter
Clear Fitter
Name
Filtering Method
ACAD EMICJNDefault ACADEMIC JNDefault CAMPUSJ/ISIDefault CONTACT_CCDefault C 0 NTACT_DADefault ETHNICITY Default Enroll Default IRSCHOOL Default Instate npf?nitt LEVEL_YEAR Default REFER RAL_CDefault
Keep Missing Values Default Default Default Default Default Defaujt Default Default J ^ J Default Default Default
Minimum Frequency Cutoff
Number of Le Cutoff
i
Refresh Summary
OK
Cancel
A-51
A-52
5.
Appendix A Case Studies
Select N bar and select Apply Filter.
KM Select values to remove from the sample.
Instate Apply Filter
Name
Clear Filter Keep Missing Values
Fitter my Method
ACADEMIC INDefault ACADEMIC INDefault CAMPUS VISIDefault CONTACT CCDefault CONTACT DADefautt Default ETHNICITY Enroll Default IRSCHOOL Default DHfaillt Instate LEVEL YEAR Default REFERRAL CDefault
Minimum Frequency Cutoff
Number of Le Cutoff
Default Default Default Default Default Default Default Default â&#x20AC;˘ iDflfmilt Default Default
___
Refresh Summary
OK
6.
Select O K to close the Interactive Class Filter window.
7.
Run the Filter node and view the results. Excluded Class
Cancel
Values
(maximum 500 o b s e r v a t i o n s
printed) Keep
Variable
Role
Instate
INPUT
Level N
Train Count 8200
Train Percent 35.7392
Label
Filter Method
Missing Values
MANUAL
A l l out-of-state cases were filtered from the analysis. After filtering, an analysis similar to the above was conducted with stepwise regression, neural network, and decision tree models.
A.4 Enrollment Management Case Study
A-53
The partial diagram (after the Filter node) is shown below:
As for the models in this subset analysis, the Instate Stepwise Regression model selects two o f the same inputs found in the first round o f modeling, SELF_INIT_CNTCTS and STUEMAIL. Analysis
o f Haximum
Likelihood Estimates
Standard DF
Parameter
Intercept SELE_INIT_CNTCTS stuemail
tfald
Standardized
Estimate
Error
Chi-Square
Pr >
ChiSq
-10.1372
20.B246
0.24
0.6264
0.7188
0.0210
1174.00
<.0001
-6.8602
20.6245
• .11
0.7418
Estimate
Exp(Est)
0.000 0.9297
2.052 0.001
The Instate decision tree showed a structure similar to the decision tree model from the first round. The tree with the highest validation profit possessed 20 leaves. The best five-leaf tree, whose validation profit is 97% of the selected tree, is shown below.
ItHllUC i: O: B i n noda:
Validation 1st B4I sags
an seis
SELF IKIT CHTCIS
I
< •tie I! O: nod*;
I — < S-.i'.li-.lt 1: 0: JJ i n r.oJ.:
3.S
Training 41 7Z16
Slit
m
Validation 41 Ml 47SS
Statlltle tl w i n r.aJ«.
Training SSI 341 tea
Validation «l 341 1143
hiciat
i CTJTCTS
1 »
Z.S
Training if
Validation zi
SS38
4370
Statistic 1: 0: TJ I n noda:
1.5
Training; 171 731 6 IS
0.0171434 Validation (SI 7SI 3IC
Statistic 1: 0:
S
i n nod.:
Training 1M e*i 174
Validation 13k 871 133
0.0O310SSS Training 01 1001
1
J
0.00510559 Statistic 1: O; H i n nod*:
Training 371 631 76
Validation 411 191 41
Again, much of the performance of the model is due to a low self-initiated contacts count.
A-54
Appendix A Case Studies
Assessing Prediction Models (Instate-Only Cases) As before, model performance was gauged in the Model Comparison node. The ROC chart showed no clearly superior model, although all models had rather exceptional performance. Data Roto-VALIDATE 1.0
0.8
0.6
0.4-
0.2-
0.00.0
0.2
0.4
0.6
0.8
1.0
1 - Specificity
The Fit Statistics table o f the Output window showed a slight edge over the tree model in Misclassification Rate. The validation ROC Index and Validation Average Profit favored the Stepwise Regression and Neural Network models. Again, it should be noted that these were unusually high model performance statistics.
A.4 Enrollment Management Case Study
A-55
Data Role=Valid Statistics Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid
Kolmogorov-Smirnov Statistic Average Profit Average Squared Error Roc Index Average Error Function Bin-Based Two-Uay Kolmogorov-Smirnov Probability Cutoff Cumulative Percent Captured Response Percent Captured Response Frequency of Classified Cases Divisor for ASE Error Function Gain Gini Coefficient Bin-Based TVo-Uay Kolmogorov-Smirnov Statistic Kolmogorov-Smirnov Probability Cutoff Cumulative L i f t Lift Maximum Absolute Error Hisclassification Rate Sum of Frequencies Total Profit Root Average Squared Error Cumulative Percent Response Percent Response Sum of Squared Errors Number of Wrong Classifications
Neural2
Reg3
Tree2
0.80 2.02 0.06 0.96 0.19
0.80 2.02 0.06 0.96 0.20
0.80 2.00 0.06 0.92 0.21
.
50.69 23.41 5899.00 11798.00 2194.05 406.85 0.91 0.00 0.14 5.07 4.68 0.97 0.09 5899.00 11901.71 0.24 80.08 73.97 705.43 510.00
.
50.69 23.41 5899.00 11798.00 2330.78 406.85 0.91 0.00 0.14 5.07 4.68 1.00 0.09 5899.00 11901.71 0.25 80.08 73.97 725.26 527.00
.
46.38 23.19 5899.00 11798.00 2462.62 363.74 0.84 0.00 0.03 4.64 4.64 0.98 0.08 5899.00 11824.00 0.25 73.27 73.27 716.66 462.00
Deploying the Prediction Model The Score node facilitated deployment of the prediction model, as shown in the diagram's final form.
The best (instate) model was selected by the Instate Model Comparison node and passed on to the Score node. Another INQ2005 data source was assigned a role o f Score and attached to the Score node. Columns from the scored INQ2005 were then passed into the Office o f Enrollment Management's data management system by the final SAS Code node.
A-56
Appendix A Case Studies
Appendix B Bibliography B.1
References
B-3
B-2
Appendix B Bibliography
B.1 References
B-3
B.1 References Beck, A . 1997. "Herb Edelstein discusses the usefulness o f data mining." DSStar. Vol. 1, NO. 2. Available viww.tgc.com/dsstar/. Bishop, C. M . 1995. Neural Networks for Pattern Recognition. New York: Oxford University Press. Breiman, L . et al. 1984. Classification and Regression Trees. Belmont, C A : Wadsworth International Group. Hand, D . J. 1997. Construction and Assessment of Classification Inc.
Rules. New York: John Wiley & Sons,
Hand, D . J. 2005. "What you get is what you want? - Some dangers o f black box data mining." M2005 Conference Proceedings, Cary, NC: SAS Institute Inc. Hand, D. J. 2006. "Classifier technology and the illusion o f progress." Statistical Science 21:1-14. Hand, D. J. and W. E. Henley. 1997. "Statistical classification methods in consumer credit scoring: a review." Journal of the Royal Statistical Society A 160:523-541. Hand, David, Heikki Mannila, and Padraic Smyth. 2001. Principles of Data Mining. Cambridge, Massachusetts: The M I T Press. Harrell, F. E. 2006. Regression Modeling Strategies. New York: Springer-Verlag New York, Inc. Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Data Mining, Inference, and Prediction. New York: Springer-Verlag New York, Inc. Hoaglin, D. C , F. Mosteller, and J. W. Tukey. 1983. Understanding Robust and Exploratory Analysis. New York: John Wiley & Sons, Inc.
Learning:
Data
Kass, G V. 1980. " A n exploratory technique for investigating large quantities o f categorical data." Applied Statistics 29:119-127. Mosteller, F. and J. W. Tukey. 1977. Data Analysis and Regression. Reading, M A : Addison-Wesley. Piatesky-Shapiro, G. 1998. "What Wal-Mart might do with Barbie association rules." Knowledge Discovery Nuggets, 98:1. Available http://www.kdnuggets.com/. Ripley, B. D. 1996. Pattern Recognition and Neural Networks. New York: Cambridge University Press. Rubinkam, M . 2006. "Internet Merchants Fighting Costs o f Credit Card Fraud," AP Worldstream. The Associated Press. Rud, Olivia Parr. 2001. Data Mining Cookbook: Modeling Data, Risk, and Customer Management. New York: John Wiley & Sons, Inc.
Relationship
Sarle, W. S. 1983. Cubic Clustering Criterion. SAS Technical Report A-108. Cary, N C : SAS Institute Inc. Sarle, W. S. 1994a. "Neural Networks and Statistical Models," Proceedings of the Nineteenth SAS' Users Group International Conference. Cary: N C , SAS Institute Inc., 1538-1550.
Annual
Sarle, W. S. 1994b. "Neural Network Implementation in SAS速Software," Proceedings of the Nineteenth Annual SAS* Users Group International Conference. Cary: NC, SAS Institute Inc., 1550-1573.
B-4
Appendix B Bibliography
Sarle, W. S. 1995. "Stopped Training and Other Remedies for Overfitting." Proceedings of the 27th Symposium on the Interface. SAS Institute Inc. 2002. SAS速9 Language: Reference, Volumes 1 and 2. Cary, N C : SAS Institute Inc. SAS Institute Inc. 2002. SA& 9 Procedures Guide. Cary, N C : SAS Institute Inc. SAS Institute Inc. 2002. SAS/STAT速9 User's Guide, Volumes 1,2, and 3. Cary, N C : SAS Institute Inc. Weiss, S. M . and C. A . Kulikowski. 1991. Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. San Mateo, C A : Morgan Kaufmann.
Appendix C Index A analytic workflow, 1-20 Assess tab, 1-17 assessing models, A-34-A-35, A-48, A-54 association rule, 8-63 Association tool, 1-10 AutoNeural node, 6-25 AutoNeural tool, 1-14, 5-39-5-45 average profit, 6-41 average squared error, 6-5 B
backward selection method, 4-31 bin transformations, A-31 bucket transformations, A-30 C categorical inputs consolidating, 9-22-9-30 regressions, 4-67-4-68 chi-square selection criterion, 9-10 cluster analysis, 8-8-8-60 cluster centers, 8-10 Cluster tool, 1-10, 8-37-8-43 clustering, 8-8-8-60 k-means algorithm, 8-9-8-13 seeds, 8-10 complete case analysis, 4-12 consolidating categorical inputs, 9-22-9-30 constructing decision trees, 3-43-3-62 Control Point tool, 1-18 creating SAS Enterprise Miner projects, 2ÂŹ 5-2-9 creating segments, A-10 Credit Exchange tool, 1-19 Credit Scoring tab, 1-19 D Data Partition tool, 1-8 data reduction, 8-6 data splitting, 3-19 decision predictions, 3-9 decision processing
configuring, A-42-A-44 Decision Tree node selecting inputs, 9-19-9-21 decision tree profit optimization, 6-46 Decision Tree tool, 1-14 decision trees, 3-30 constructing, 3-43-3-62 defaults, 3-106 multiway splits, 3-107-3-108 prediction rules, 3-30-3-31 split search, 3-32-3-41 stopping rules, 3-112-3-113 variables, 3-102 Decisions tool, 1-17 diagram workspace, 1-5 process flows, 1-6 discretized stepwise regression creating prediction models, A-29-A-33 Dmine regression, 5-48-5-49 Dmine Regression tool, 1-14 DMNeural tool, 1-14, 5-50-5-51 Drop tool, 1-12 E ensemble models, 9-4 Ensemble node, 9-4, 9-7 Ensemble tool, 1-15 estimate predictions, 3-11 estimation methods missing values, 4-17 Explore tab, 1-10-1-11 exporting scored tables, 7-9-7-14 F
forward selection method, 4-30 G gains charts, 6-17 Gini coefficient, 6-4 H Help panel, 1-5 hidden layer neural networks, 5-7
C-2
Index
hidden units neural networks, 5-5 I implementing models, 7-3-7-21 Impute tool, 1-12 Input Data tool, 1-8 input layer neural networks, 5-7 input selection, 9-9-9-21 neural networks, 5-20-5-23 predictive modeling, 3-16 inputs categorical, 9-22-9-30 consolidating, 9-22-9-30 selecting, 3-14-3-16, 5-20-5-23, 9-9-9-21 selecting for regressions, 4-29-4-37 transforming, 4-52-4-65 Interactive Grouping tool, 1-19 internally scored data sets, 7-3-7-14 K k-means clustering algorithm, 8-9-8-13 Kolmogorov-Smirnov (KS) statistic, 6-3 L logistic function, 4-7, 5-8 logit link function, 4-7 M market basket analysis, 8-6, 8-63-8-79 memory-based reasoning (MBR) predictive modeling, 5-55 Memory-Based Reasoning tool, 1-15 Merge tool, 1-8 Metadata tool, 1-18 missing values causes, 4-16 estimation methods, 4-17 neural networks, 5-11 regressions, 4-17-4-24 synthetic distribution, 4-17 Model Comparison tool, 1-17, 6-3-6-25 statistical graphics, 6-10-6-20 Model tab, 1-14-1-16 models assessing, A-34-A-35, A-48, A-54 comparing using statistical graphics, 6ÂŹ 18-6-20
creating, A-44-A-47, A-49-A-53 creating using neural networks, A-26 creating using stepwise regression, A-25A-26 creating using transformed stepwise regression, A-27 creating with discretized stepwise regressions, A-29-A-33 deploying, A-55 implementing, 7-3-7-21 model complexity, 3-18 surrogate, 9-31-9-37 underfitting, 3-18 viewing additional assessments, 6-44-6ÂŹ 45 Modify tab, 1-12 multi-layer perceptrons, 5-3 Multiplot tool, 1-10 N network diagrams, 5-7 neural network models. See neural networks Neural Network node, 6-24 neural network profit optimization, 6-47 Neural Network tool, 1-15, 5-12-5-19 neural networks, 5-3 AutoNeural tool, 5-39-5-45 binary prediction formula, 5-6 creating prediction models, A-26 diagrams, 5-7 exploring networks with increasing hidden unit costs, 5-39-5-45 extreme values, 5-11 input selection, 5-20-5-23 missing values, 5-11 Neural Network tool, 5-12-5-19 nonadditivity, 5-11 nonlinearity, 5-11 nonnumeric inputs, 5-11 prediction formula, 5-5 stopped training, 5-24â&#x20AC;&#x201D;5-38 unusual values, 5-11 nodes, 1-6 nonlinearity neural networks, 5-11 regressions, 4-11 novelty detection methods, 8-6
Index
0
odds ratio, 4-50 optimal transformations, optimizing model complexly, 3-17 3 i Z output layer neural networks, 5-7
Path Analysis tool, 1-10 pattern discovery, 8-3-8-4 applications, 8-6 cautions, 8-5 limitations, 8-5 tools, 8-7 polynomial regressions, 4-76-4-81 prediction estimates regressions, 4-5-4-10 prediction formula neural networks, 5-4-5-6 regressions, 4-5-4-8 prediction rules decision trees, 3-30-3-31 predictive modeling, 3-3 applications, 3-4 decision predictions, 3-9 Dmine regression, 5-48-5-49 DMNeural tool, 5-50-5-51 essential tasks, 3-8-3-22 estimate predictions, 3-11 input selection, 3-14-3-16 memory-based reasoning (MBR), 5-55 neural networks, 5-3-5-45 optimizing model complexity, 3-17-3-22 ranking predictions, 3-10 regressions, 4-17-4-87 Rule Induction, 5-46-5-47 tools, 3-28-3-29 training data sets, 3-5 Principal Components tool, 1-12 process flows, 1-6 profiling, 8-6 profiling segments, 8-56-8-60 profit matrices, 6-39-6-41 creating, 6-32-6-38 evaluating, 6-42-6-43 profit value, 6-39 Project panel, 1-4 projects creating, 2-5-2-9 Properties panel, 1-4
C-3
R ranking Pactions, 3-10 regressioimodels. See regressions Regressiomode selectioniethods, 4-29-4-37 regressionrofit optimization, 6-47 Regressiosool, 1-16, 4-25-4-28 regressioi 4-3 missinplues, 4-17-4-24 non-adivity, 4-11 nonlin'ity, 4-11 nonnjric inputs, 4-11 optima complexity, 4-38-4-48 polyr^ * 4-76-4-87 pred estimates, 4-5-4-10 p on formula, 4-5-4-8 leg inputs, 4-29-4-37 i-ming inputs, 4-52-4-65 J values, 4-11 ÂŁ j ference tool, 1-19 ^gpflent tool, 1-12 ^ rate charts, 6-15-6-17 l e x , 6-4 ÂŁ_-e variable selection criterion, 9-10 reduction tool, 5-46-5-47 1
m
re(
se
tra
u n
e
e
r e s RC
ie tab, 1-8 le tool, 1-8 Oode tool, 1-18 Enterprise Miner alytic workflow, 1-20 eating projects, 2-5-2-9 agram workspace, 1-5 felp panel, 1-5 aterface for 5.2, 1-3-1-19 lodts, 1-6 prosct panel, 1-4 proerties panel, 1-4 Utity tools, 1-18 iiwrz Bayesian Criterion (SBC), 6-5 ircode modules, 7-3, 7-15-7-23 crating a SAS Score Code module, 7-16f-21 w data source cuating, 7-5 ;oe tool, 1-17 ccecard tool, 1-19 fcced tables porting, 7-9-7-14
C-4
Index
seeds, 8-10 Segment Profile tool, 1-17 segmentation analysis, 8-15-8-0 segmenting, 8-8-8-60 segments creating, A-10 deploying, A-14-A-15 interpreting, A-12-A-13 profiling, 8-56-8-60 specifying counts, 8-44 I selecting inputs, 9-9-9-21 regressions, 4-29-4-37 selecting neural network inpuj>-20-5-23 selection methods Regression node, 4-29-4-37 SEMMA, 1-7 , SEMMA tools palette, 1-7 ' separate sampling, 6-20-6-31 adjusting for, 6-21-6-25, 6-29! benefits of, 6-28 consequences of, 6-28 sequence analysis, 8-6, 8-80-8-8 SOM/Kohonen tool, 1-11 specifying segment counts, 8-44 split search decision trees, 3-32-3-41 StatExplore tool, 1-11 statistical graphics comparing models, 6-18-6-20 Model Comparison tool, 6-10-6-1 stepwise regression creating prediction models, A-2& discretized, A-29-A-33 transformed, A-27 stepwise selection method, 4-32-4-3 stopped training, 5-3 neural networks, 5-24-5-38 surrogate models, 9-31-9-37 ' synthetic distribution
1
missing values, 4-17
target layer neural networks, 5-7 test data sets, 1-8, 3-19 Time Series tool, 1-9 training data sets, 1-8 3.5 creating, 3-23-3-27 ' transaction count, 8-82 Transform Variables tool, I-13 transformations bin, A-31 bucket, A-30 optimal, A-32 transformed stepwise regression creating prediction models, A-27 transforming inputs regressions, 4-52-4-65 TwoStage tool, 1-16 U
Utility tools SAS Enterprise Miner, 1-18
validation data sets, 1-8, 3-19, 3-21 creating, 3-23-3-27 Variable Selection node, 9-9-9-13 Variable Selection tool, 1-11 W
weight estimates neural networks, 5-5