Applied Analytics Using SAS Enterprise Miner-part2

Page 1

Chapter 6 Model Assessment 6.1

Model Fit S t a t i s t i c s Demonstration: Comparing Models with Summary Statistics

6.2

6.3

6.4

Statistical G r a p h i c s

6-3 6-6

6-9

Demonstration: Comparing Models with R O C Charts

6-13

Demonstration: Comparing Models with Score Rankings Plots

6-18

Demonstration: Adjusting for Separate Sampling

6-21

A d j u s t i n g for S e p a r a t e S a m p l i n g

6-26

Demonstration: Adjusting for Separate Sampling (continued)

6-29

Demonstration: Creating a Profit Matrix

6-32

Profit M a t r i c e s

6-39

Demonstration: Evaluating Model Profit

6-42

Demonstration: Viewing Additional A s s e s s m e n t s

6-44

Demonstration: Optimizing with Profit (Self-Study)

6-46

Exercises

6-48

6.5

Chapter Summary

6-49

6.6

Solutions

6-50

Solutions to E x e r c i s e s

6-50


6-2

Chapter 6 Model Assessment


6.1 Model Fit Statistics

6.1

6-3

Model Fit Statistics

Summary Statistics Summary Prediction Type

Statistic Accuracy/Misclassification

Decisions

Profit/Loss Inverse prior threshold

3 A s introduced in Chapter 3, summary statistics can be grouped by prediction type. For decision prediction, the Model Comparison tool rates model performance based on accuracy or misclassification, profit or loss, and by the Kolmogorov-Smirnov ( K S ) statistic. A c c u r a c y and misclassification tally the correct or incorrect prediction decisions. Profit is detailed later in this chapter. T h e Kolmogorov-Smirnov statistic describes the ability o f the model to separate the primary and secondary outcomes.


6-4

Chapter 6 Model Assessment

Summary Statistics Summary Prediction Type

Rankings

Statistic

ROC Index (concordance) Gini coefficient

For ranking predictions, the Model Comparison tool gives two closely related measures o f model fit. T h e R O C index is similar to concordance (introduced in Chapter 3 ) . T h e G i n i coefficient (for binary prediction) equals 2 x ( R O C Index - 0.5). f

T h e R O C index equals the percent o f concordant cases plus one-half times the percent tied cases. Recall that a pair o f cases, consisting o f one primary outcome and one secondary outcome, is concordant i f the primary outcome case has a higher rank than the secondary outcome case. B y contrast, i f the primary outcome case has a lower rank, that pair is discordant. I f the two cases have the same rank, they are said to be tied.


6.1 Model Fit Statistics

6-5

Summary Statistics Summary Prediction Type

Estimates

Statistic

Average squared error SBC/Likelihood

5 For estimate predictions, the Model Comparison tool provides two performance statistics. Average squared error was used to tune many o f the models fit in earlier chapters. T h e Schwarz's Bayesian Criterion ( S B C ) is a penalized likelihood statistic. The likelihood statistic was used to estimate regression and neural network model parameters and can be thought o f as a weighted average squared error. f

S B C is provided only for regression and neural network models and is calculated only on training data.


6-6

Chapter 6 Model Assessment

Comparing Models with Summary Statistics

After y o u build several models, it is desirable to compare their performance. T h e Model Comparison tool collects assessment information from attached modeling nodes and enables y o u to easily compare model performance measures. 1.

Select the Assess tab.

2.

Drag a M o d e l C o m p a r i s o n tool into the diagram workspace.

3.

Connect both Decision Trees, the Regression node, and the Neural Network node to the Model Comparison node as shown. (Self-study models are ignored here.) - i ; Predictive Analysis

Rqfacanttt

(Yob ittftty Tree

^ = — — 6 r -0! Madd

• % O | 0 - J - ® m|^oj"


6.1 Model Fit Statistics

4.

6-7

Run the M o d e l C o m p a r i s o n node and v i e w the results. The Results w i n d o w opens. nT Results - Nude:

Modd C o m p a n i o n

Diagram: Predictive A n j l y u *

S I Qj _Al _ l i-J D i l l ROM - 1KMH

D i l l Hole - VALIDATE

Ssiecled Mood

r-'.--r.cr:;^-Node

Reg &0.6>

£

Modd Dejcrrton

Trrgn

Reg

Regression

Neural

Neural

Neural Net

TargerS

Tree

Tree • 0 . 6 -

Decision Tr

Targets

Targets

0.43804

Tiee3

Tree 3

Probability... Targets

0 426366

E c W

0 0

0.2

0 4

0.6

0.B

1.0

04-

1.0

0.3

0.4

D.6

0.B

1.0

1 - Specificity

1 - Specificity • Neural Network - Probability Trea

Regression

student Diet: 0 * i PcJt - IRA IK 1.4-

Cala Rislfl • VALlOAlE

J u l y O S , 2009 10:16:-IS

Time: Teaming

Output

V a r i a b l e sufimacY

= 'J . E

Meaauzeacnt

reetjucnev

Level

count

The Results w i n d o w contains four s u b - w i n d o w s : R O C Chart, Score R a n k i n g s , Fit Statistics, and Output. 5.

M a x i m i z e the O u t p u t w i n d o w .

v»: HKtBMta tun R a o

0 421327 0 422073


6-8

Chapter 6 Model Assessment

6.

G o to line 90 in the Output window. Data Role=Valid Reg

Statistics Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid: Valid:

Neural

Tree

Tree3

0.16 0.14 Kolmogorov-Smirnov S t a t i s t i c 0.16 0.14 0.24 0.24 0.24 0.24 Average Squared Error 0.61 0.61 0.58 Roc Index 0.59 Average Error Function 0.67 0.67 0.68 0.68 Bin-Based Two-Hay Kolmogorov-Smirnov Probability Cutoff Cumulative Percent Captured Response 14.33 14.49 13.22 13.22 Percent Captured Response 7.18 6.90 6.60 6.60 Frequency of C l a s s i f i e d Cases 4843.00 4843.00 4843.00 4843.00 Divisor for ASE 9686.00 9686.00 9686.00 9686.00 Error Function 6533.21 6512.78 6584.21 6582.17 Gain 43.06 44.71 32.00 32.00 Gini C o e f f i c i e n t 0.22 0.22 0.17 0.18 Bin-Based Two-Uay Kolmogorov-Smirnov S t a t i s t i c 0.00 0.00 0.00 0.00 Kolmogorov-Smirnov Probability Cutoff 0.54 0.52 0.43 0.49 Cumulative L i f t 1.43 1.45 1.32 1.32 Lift 1.44 1.38 1.32 1.32 Maximum Absolute Error 0.86 0.87 0.64 0.64 M i s c l a s s i f i c a t i o n Rate 0.42 0.42 0.43 0.43 Sum of Frequencies 4843.00 4843.00 4843.00 4843.00 Root Average Squared Error 0.49 0.49 0.49 0.49 Cumulative Percent Response 71.55 72.37 66.02 66.02 Percent Response 71.90 69.01 66.02 66.02 Sum of Squared E r r o r s 2332.27 2323.48 2357.33 2356.28 Number of Wrong C l a s s i f i c a t i o n s 2040.00 2048.00 2073.00 2077.00

.

.

.

.

T h e output shows various fit statistics for the selected models. It appears that the performance o f each model, as gauged by fit statistics, is quite similar. A s discussed above, the choice o f fit statistics best depends on the predictions of interest. Prediction T y p e

Validation Fit Statistic

Direction

Decisions

Misclassification

smallest

Average Profit/Loss

largest/smallest

Kolmogorov-Smirnov Statistic

largest

Rankings

Estimates

ROC

Index (concordance)

largest

G i n i Coefficient

largest

Average Squared Error

smallest

Schwarz's Bayesian Criterion

smallest

Log-Likelihood

largest


6.2 Statistical Graphics

6.2

6-9

Statistical Graphics

Statistical Graphics - ROC Chart 1.0

captured response fraction (sensitivity)

0.0 00

false posnive traction (1-specificity)

1.0

The ROC chart illustrates a tradeoff between a captured response fraction and a false positive fraction.

6 The M o d e l C o m p a r i s o n tool features t w o charts to aid i n model assessment: the R O C chart and the Score Rankings chart. Consider the R O C chart.

Statistical Graphics - ROC Chart

fraction of cases, ordered by their predicted value.

10

To create a R O C chart, predictions are generated f o r a set o f validation data. For chart generation, the predictions must be rankings or estimates. The validation data is sorted f r o m h i g h to l o w (either scores or estimates). Each point o n the R O C chart corresponds t o a specific fraction o f the sorted data.


6-10

Chapter 6 Model Assessment

Statistical Graphics - ROC Chart

13

For example, the red point on the R O C chart corresponds to the indicated selection o f 4 0 % o f the validation data. That is, the points in the gray region on the scatter plot are in the highest 4 0 % o f predicted probabilities.

Statistical Graphics - ROC Chart

top 40'

OA The /-coordinate shows the fraction of primary outcome cases captured in the top 40% of all cases.

14

The vertical or_j'-coordinate o f the red point indicates the fraction o f p r i m a r y outcome cases " c a p t u r e d " in the gray region (here about 4 5 % ) .


6.2 Statistical Graphics

Statistical Graphics - ROC Chart

cases captured in the top 40% of all cases.

The horizontal or.Y-coordinate o f the red point indicates the fraction o f secondary outcome cases " c a p t u r e d " in the gray region (here about 2 5 % ) .

Statistical Graphics - ROC Chart

The R O C chart represents the union o f similar calculations f o r all selection fractions.

6-11


6-12

Chapter 6 Model Assessment

Statistical Graphics - ROC Chart

weak model

strong model

20

The R O C chart provides a nearly universal diagnostic for predictive models. M o d e l s that capture p r i m a r y and secondary outcome cases in a proportion approximately equal to the selection fraction are weak models ( l e f t ) . M o d e l s that capture mostly primary outcome cases w i t h o u t capturing secondary outcome cases are strong models ( r i g h t ) .

Statistical Graphics - ROC Index

weak model ROC Index < 0.6

strong model ROC Index > 0.7

21

The tradeoff between p r i m a r y and secondary case capture can be summarized by the area under the R O C curve. In S A S Enterprise M i n e r , this area is called the ROC Index. ( I n statistical literature, it is more c o m m o n l y called the c-statistic.) Perhaps surprisingly, the ROC Index is closely related to concordance, the measure o f correct case ordering.


6.2 Statistical Graphics

6-13

Comparing Models with R O C Charts

Use the f o l l o w i n g steps to compare models using R O C charts. 1.

M a x i m i z e the R O C chart. ROC C h a r t : T Data Role = VALIDATE

Data Role = TRAIN

& >

1.0-

1.0-

0.8-

0.8-

ÂŁ

0.6 -

0.6

in C d> W 0.4

If)

c

01

</) 0-4 H

o.: -

0.2-

o.o-

0.0 n 0.0

1 0.2

1 0.4

1 0.6

1 0.8

r 1.0

-i 0.0

1 0.4

1 0.6

1 - Specificity

1 - Specificity Baseline Decision Tree

1 0.2

Neural Network Probability Tree

Regression

i 0.8

r 1.0


6-14

Chapter 6 Model Assessment

The R O C chart shows little difference between the non-tree models. This is consistent w i t h the values o f the R O C Index, w h i c h equals the area under the ROC curves.


6.2 Statistical Graphics

6-15

Statistical Graphics - Response Chart

cumulative percent

The response chart shows the expected response rate for various selection percentages.

23

A second category o f assessment charts examines response rate. It is the prototype o f the so-called Score Rankings charts found in every model Results w i n d o w .

Statistical Graphics - Response Chart

fraction of cases, ordered by their predicted values.

A s w i t h R O C charts, a model is applied to validation data to sort the cases f r o m highest to lowest (again, by prediction rankings or estimates). Each point on the response chart corresponds to a selected fraction o f cases.


6-16

Chapter 6 Model Assessment

Statistical Graphics - Response Chart

the 40% of cases with the highest predicted values.

23 For example, the red point on the Response chart corresponds to the indicated selection o f 4 0 % o f the v a l i d a t i o n data. That is, the points in the gray region on the scatter plot are in the highest 4 0 % o f predicted probabilities.

Statistical Graphics - Response Chart

29 The x-coordinate o f the red point is s i m p l y the selection fraction ( i n this case, 4 0 % ) .


6.2 Statistical Graphics

6-17

Statistical Graphics - Response Chart

31

The vertical coordinate f o r a point on the response chart is the p r o p o r t i o n o f p r i m a r y outcome cases i n the selected f r a c t i o n . It is called the cumulative percent response i n the S A S Enterprise M i n e r interface and is more w i d e l y k n o w n as cumulative gain in the predictive m o d e l i n g literature. D i v i d i n g c u m u l a t i v e gain by the primary outcome p r o p o r t i o n yields a quantity named lift.

Statistical Graphics - Response Chart

33

Plotting c u m u l a t i v e gain f o r all selection fractions yields a gains chart. N o t i c e that w h e n all cases are selected, the c u m u l a t i v e gain equals the overall primary outcome p r o p o r t i o n .


6-18

Chapter 6 Model Assessment

Comparing Models with Score Rankings Plots

U s e the following steps to compare models with Score Rankings plots: 1.

Maximize the Score Rankings Overlay window.

m Data Role - TRAIN

Data Role - VALIDATE

1.4

1.3

13 | 1 ,

a

3 u

U

E 3 1.2

1.1

1.0

1.0 20

40

60

Percentile

80

100

20

40

60

Percentile

60

100!


6.2 Statistical Graphics

Double-click the Data Role = V A L I D A T E plot. HEJE3

S c o r e R a n k i n g s Overlay: T a r g e t B

Data Role-VALIDATE

1.4-

Cumulative Lift

2.

6-19

1.0-

.

. 0

i 20

i 40

i 60

80

100

Percentile - Neural Network -

• Regression

• Decision Tree

•Probability Tree]

T h e Score Rankings Overlay plot presents what is commonly called a cumulative lift chart. Cases in the training and validation data are ranked based on decreasing predicted target values. A fraction o f the ranked data is selected. T h i s fraction, or decile, corresponds to the horizontal axis o f the chart. T h e ratio, (proportion o f cases with the primary outcome in the selected fraction) to (proportion o f cases with the primary outcome overall), is defined as cumulative lift. Cumulative lift corresponds to the vertical axis. High values o f cumulative lift suggest that the model is doing a good j o b separating the primary and secondary cases. A s can be seen, the model with the highest cumulative lift depends on the decile; however, none o f the models has a cumulative lift over 1.5. It is instructive to view the actual proportion o f cases with the primary outcome (called gain or cumulative percent response) at each decile.


6-20

3.

Chapter 6 Model Assessment

Select C h a r t •=> Cumulative % Response. The Score Rankings Overlay plot is updated to show Cumulative Percent Response on the vertical axis. S c o r e R a n k i n g s O v e r l a y : Target!)

0

20

40

60

80

100

Percentile |

Neural Network

-Regression

Decision Tree

Probability Tree |

T h i s plot should show the response rate for soliciting the indicated fraction o f individuals. Unfortunately, the proportion o f responders in the training data does not equal the true proportion o f responders for the 97NK campaign. The training data under-represents nonresponders by almost a factor o f 20! T h i s under-representation was not an accident. It is a rather standard predictive modeling practice known as separate sampling. (Oversampling, balanced sampling, choice-based sampling, casecontrol sampling, and other names are also used.)


6.2 Statistical Graphics

6-21

Adjusting for Separate Sampling

I f y o u do not adjust f o r separate s a m p l i n g , the f o l l o w i n g occurs: • Prediction estimates reflect target proportions in the t r a i n i n g sample, not the p o p u l a t i o n f r o m w h i c h the sample was d r a w n . • Score Rankings plots are inaccurate and misleading, • Decision-based statistics related to misclassification or accuracy misrepresent the model performance on the p o p u l a t i o n . Fortunately, it is easy to adjust f o r separate s a m p l i n g in SAS Enterprise M i n e r . H o w e v e r , y o u must rerun the models that y o u created. f

Because this can take some t i m e , it is recommended that y o u run this demonstration d u r i n g the discussion about the benefits and consequences o f separate s a m p l i n g .

F o l l o w these steps to integrate s a m p l i n g i n f o r m a t i o n into y o u r analysis. 1.

Close the Results - M o d e l C o m p a r i s o n w i n d o w .

2.

Select Decisions

in the PVA97NK node's Properties panel. Value

Property General 'Node ID imported Data Exported Data Notes Tram [Output Type Role ;Rerun Summarize [slcolumn Metadata

Ids •••

... ...

View Raw No No

...

Variables -Decisions •Refresh Metadata -Advisor -Advanced Options

... ... Basic

[•'Data Selection -Data Source

Data Source PVA97NK

-Table Name -Variable Validation -New Variable Role

Strict Reject

...

*


6-22

Chapter 6 Model Assessment

The D e c i s i o n Processing - P V A 9 7 N K dialog box opens.

Targets

Probabilities

Decisions

Decision Weights

Targets Name:

TargetB

Measurement L e v e l : Binary Target level order :

Descending

Event l e v e l : Format:

Build

OK

Cancel

The Decision Processing d i a l o g b o x enables y o u to i n f o r m S A S Enterprise M i n e r about the extent o f separate s a m p l i n g in the t r a i n i n g data. This is done by d e f i n i n g prior 3.

probabilities.

Select B u i l d i n the Decision Processing dialog box. S A S Enterprise M i n e r scans the raw data to determine the p r o p o r t i o n o f p r i m a r y and secondary outcomes in the r a w data.


6.2 Statistical Graphics

4.

6-23

Select the P r i o r Probabilities tab.

m Targets j Prior Probabilities | Decisions ' Decision Weights Do you want to enter new prior probabilities?

0

ÂŽ No

Yes Level

Set Equal Prior Count

I

Prior

I

4843 4343

1 0

0.5 0.5

OK

5.

Cancel

Select Yes. The d i a l o g b o x is updated to show the Adjusted Prior c o l u m n . Jecision P r o c e s s i n g -

Targets i Prior Probabilities

Decisions ; Decision Weights

Do you want to enter new prior probabilities? Set Equal Prior

ONo

• Yes

Count

Level 4843 4843

Adjusted Prior

Prior 0.5 0.5

0.0 0.0

OK

Cancel

The A d j u s t e d Prior c o l u m n enables y o u to specify the proportion o f p r i m a r y and secondary outcomes in the original PVA97NK population. When the analysis problem was introduced in Chapter 3, the p r i m a r y outcome (response) proportion was claimed to be 5%.


6-24

Chapter 6 Model Assessment

6.

Type 0 . 0 5 as the A d j u s t e d Prior value for the primary outcome, Level 1.

7.

Type 0 . 9 5 in the as the A d j u s t e d Prior value for the secondary outcome, Level 0.

ocessrng Targets

[ Prior Probabilities j Decisions

i Decision Weights

Do you want to enter new prior probabilities?

0

速 Yes Level

Set Equal Prior

No

Count

Prior 0.5 0.5

4843 4843

Adjusted Prior 0.05 0.95

OK

8.

Cancel

Select O K to close the Decision Processing dialog box.

D e c i s i o n P r o c e s s i n g a n d the Neural Network Node I f y o u r diagram was developed according to the instructions in Chapters 3 through 5, one m o d e l i n g node (the Neural N e t w o r k ) needs a property change to correctly use the decision processing i n f o r m a t i o n . F o l l o w these steps to adjust the Neural N e t w o r k node settings. 1.

E x a m i n e the Properties panel for the Neural N e t w o r k node. The default model selection c r i t e r i o n is Profit/Loss.

General Neural Node ID Imported Data ^Exported Data Njbtes^^^^^^^^

... ...

Train Variables (Continue Training No Network loptimization Initialization Seed 12345 Model Selection CriterProfit/Loss Sunnress Outnut No

... ... ...


6.2 Statistical Graphics

2.

6-25

Select M o d e l Selection C r i t e r i o n O Average E r r o r .

General Node ID 'imported Data Exporter! Data MNotes

Neural

... ...

Train Variables Continue Training No 'Network Optimization Initialization Seed 12345 Model Selection CriterAveraqe Error Suppress Output No f

... •••

W h e n no decision data is defined, the neural n e t w o r k optimizes c o m p l e x i t y using average squared error, even t h o u g h the default says Profit/Loss. N o w that decision data is d e f i n e d , y o u must m a n u a l l y change M o d e l Selection C r i t e r i o n to Average Error.

D e c i s i o n P r o c e s s i n g a n d the AutoNeural Node Decision processing data ( p r i o r probabilities and decision weights) is ignored b y the A u t o N e u r a l node. Predictions f r o m the node are adjusted for priors, but that actual model selection process is based strictly on misclassification ( w i t h o u t p r i o r adjustment). This fact can lead to unexpected prediction results w h e n the primary and secondary outcome proportions are not equal. Fortunately, the PVA97NK data has an equal p r o p o r t i o n o f p r i m a r y and secondary outcomes and gives reasonable results w i t h the A u t o N e u r a l node. f

U s i n g the A u t o N e u r a l node w i t h t r a i n i n g data that does not have equal outcome proportions is reasonable o n l y i f y o u r data is not separately sampled a n d y o u r prediction goal is classification.

Run the diagram f r o m the M o d e l C o m p a r i s o n node. This reruns the entire analysis.


6-26

Chapter 6 Model Assessment

6.3 Adjusting for Separate Sampling

Outcome Overrepresentation A c o m m o n predictive modeling p r a c t i c e is to build m o d e l s from a s a m p l e with a primary o u t c o m e proportion

(r^*'47--*k

different from the original population.

37

A common predictive modeling practice is to build models from a sample with a primary outcome proportion different from the true population proportion. T h i s is typically done when the ratio o f primary to secondary outcome cases is small.

Separate Sampling secondary outcome

primary outcome

T a r g e t - b a s e d s a m p l e s a r e c r e a t e d by considering the primary o u t c o m e c a s e s s e p a r a t e l y from the s e c o n d a r y outcome c a s e s . 40


6.3 Adjusting for Separate Sampling

6-27

Separate Sampling secondary outcome

Select some cases.

primary outcome

Select all cases.

41

Separate s a m p l i n g gets its name f r o m the technique used to generate the m o d e l i n g data, that is, samples are d r a w n separately based on the target outcome. In the case o f a rare p r i m a r y o u t c o m e , usually all p r i m a r y outcome cases are selected. T h e n , each primary outcome case is matched by one or ( o p t i m a l l y ) more secondary outcome cases.


6-28

Chapter 6 Model Assessment

The Modeling Sample + bimilar predictive power with smaller case count -

Must adjust assessment statistics a n d graphics

-

Must adjust prediction estimates for bias

43

The advantage o f separate s a m p l i n g is that y o u are able to obtain (on the average) a model o f s i m i l a r predictive power w i t h a smaller overall case count. This is in concordance w i t h the idea that the amount o f i n f o r m a t i o n i n a data set w i t h a categorical outcome is determined not by the total number o f cases i n the data set itself, but instead by the number o f cases in the rarest outcome category. ( F o r binary target data sets, this is usually the p r i m a r y outcome.) (Harrell 2006) This advantage m i g h t seem o f m i n i m a l importance i n the age o f extremely fast computers. ( A model m i g h t f i t 10 times faster w i t h a reduced data set, but a 10-second model f i t versus a 1-second model f i t is probably not relevant.) Flowever, the m o d e l - f i t t i n g process occurs only after the c o m p l e t i o n o f a l o n g , tedious, and error-prone data preparation process. Smaller sample sizes f o r data preparation are usually welcome. W h i l e it reduces analysis t i m e , separate sampling also introduces some analysis c o m p l i c a t i o n s . • M o s t m o d e l f i t statistics (especially those related to prediction decisions) and most o f the assessment plots are closely tied to the outcome proportions in the training samples. I f the outcome proportions i n the t r a i n i n g and v a l i d a t i o n samples d o not match the outcome proportions i n the s c o r i n g p o p u l a t i o n , model performance can be greatly misestimated. •

I f the outcome proportions in the t r a i n i n g sample and scoring populations do not m a t c h , model prediction estimates are biased.

Fortunately, S A S Enterprise M i n e r can adjust assessments and prediction estimates to match the s c o r i n g population i f y o u specify prior probabilities, the scoring population outcome proportions. T h i s is precisely w h a t was done using the Decisions option in the demonstration.


6.3 Adjusting for Separate Sampling

6-29

Adjusting for Separate Sampling (continued)

The consequences o f i n c o r p o r a t i n g p r i o r probabilities in the analysis can be v i e w e d in the M o d e l C o m p a r i s o n node. 1.

Select R e s u l t s . . . in the Run Status dialog box. The Results - M o d e l C o m p a r i s o n w i n d o w opens. ROC Chart: TargetB Seleded Model

1.00 -

Predecessor Node

Model Node

Model Desertion

Target Variable

Vafld Average Profil

0.75-

1 0.S0-

B

a V>Q2S -

Neural Reg Tree3 Tree

0.000.0

0.2

0.4

0.6

0.8

0.0

1.0

0.2

1 - Specificity

0.4

0.6

0.I

1 - Specificity

Baseline Regression Probability Tree

Dsta Role = TRAIN

1 E

0

Dal* Role ' VM.ICATE

• I 252.00¬ 1,75150| ,.5125¬ 1.00u 1.0 1 - 1 — i — i — i — i — H 0 0 20 -JO 60 30 100 Percentile

\

Tr9e

TargetB TargetB TargetB TargetB

Neural Network Decision Tree

Cumulative Lift

0 £

Neura! Net... Regression Probability... Decision Tr...

Neural Reg Tree 3

I»-

Usee:

Student

Dace:

July

Time:

10:47:28

Training

1 20

i 40

i 60

I 80

i 100

Percentile Variable

• neural Network Decision Tree

Output

Regression •Probability Tree

Sunnaiy

0 8 , 2009

1


6-30

2.

Chapter 6 Model Assessment

M a x i m i z e the Score Rankings Overlay window and focus on the Data Role = V A L I D A T E chart. S c o r e R a n k i n g s O v e r l a y : T.irtjctLS

T

1

0

20

r 40

• |

Neural Network

1

1

60

80

r 100

Percentile Regression

Decision Tree

Probability Tree |

Before y o u adjusted for priors, none o f the models had a cumulative lift over 1.5. N o w most o f the models have a cumulative lift in excess o f 1.5 (at least for some deciles). f

T h e exception is the Decision Tree model. Its lift is exactly 1 for all percentiles. T h e reason for this is the inclusion o f disparate prior probabilities for a model tuned on misclassification. O n the average, cases have a primary outcome probability o f 0.05. A low misclassification rate model might be built by simply calling everyone a non-responder.


6.3 Adjusting for Separate Sampling

3.

6-31

Select Chart: •=> Cumulative % Response. Score Rankings Overlay: TargetB

H

1

1

1

0

20

40

E3

E3

-1

1

1

1

60

80

100

Percentile |

Neural Network

Regression

- Decision Tree

Probability Tree I

T h e cumulative percent responses are now adjusted for separate sampling. Y o u now have an accurate representation o f response proportions by selection fraction. With this plot, a business analyst could rate the relative performance o f each model for different selection fractions. T h e best selection fraction is usually determined by financial considerations. For example, the charity might have a budget that allows contact with 4 0 % o f the available population. T h u s , it intuitively makes sense to contact the 4 0 % o f the population with the highest chances o f responding (as predicted by one o f the available models). Another financial consideration, however, is also important: the profit (and loss) associated with a response (and non-response) to a solicitation. T o correctly rank the value o f a case, response probability estimates must be combined with profit and loss information.


6-32

Chapter 6 Model Assessment

Creating a Profit Matrix

To determine reasonable values for profit and loss information, consider the outcomes and the actions y o u would take given knowledge o f these outcomes. In this case, there are two outcomes (response and nonresponse) and two corresponding actions (solicit and ignore). K n o w i n g that someone is a responder, y o u would naturally want to solicit that person; knowing that someone is a non-responder, y o u would naturally want to ignore that person. (Organizations really do not want to send j u n k mail.) O n the other hand, knowledge o f an individual's actual behavior is rarely perfect, so mistakes are made, for example, soliciting non-responders (false positives) and ignoring responders (false negatives). Taken together, there are four outcome/action combinations: Solicit

Ignore

Response Non-response E a c h o f these outcome/action combinations has a profit consequence (where a loss is called, somewhat euphemistically, a negative profit). Some o f the profit consequences are obvious. For example, i f y o u do not solicit, y o u do not make any profit. S o for this analysis, the second column can be immediately set to zero. Solicit

Ignore

Response

0

Non-response

0

From the description o f the analysis problem, you find that it costs about $0.68 to send a solicitation. A l s o , the variable T a r g e t D gives that amount o f response (when a donation occurs). T h e completed profit consequence matrix can be written as shown.

Response Non-response

Solicit

Ignore

T a r g e tD-0.68

0

-0.68

0

From a statistical perspective, T a r g e t D is a random variable. Individuals who are identical on every input measurement might give different donation amounts. To simplify the analysis, a summary statistic for T a r g e t D is plugged into the profit consequence matrix. In general, this value can vary on a case-bycase basis. However, for this course, the overall average o f T a r g e t D is used. Y o u can obtain the average o f T a r g e t D using the StatExplore node. 1.

Select the E x p l o r e tab.

2.

Drag the S t a t E x p l o r e tool into the diagram workspace.


6.3 Adjusting for Separate Sampling

3.

Connect the StatExplore node to the Data Partition node. Predictive Analysis

4.

6-33

Select Variables... from the Properties panel o f the StatExplore node.

HHE3|


Chapter 6 Model Assessment

6-34

The Variables - Stat w i n d o w opens.

5.

Select Use <=> Yes f o r the TargetD variable. ^Variables -Stat j(none) Columns:

m

J^J

Name

Use Default

DemCluster

Default D e m G e n d e r Default |D e m H o m e OvvDefault D e m M e d H o m Default DemMedlncorDefault DemPctVeteraDefault |GiflAvg36 [GiftAvgAII

Default Default

GiftAvgCard36Defautt GiftAvgLast GiftCnt36 GiftCntAII

Default Default Default

lGiftCntCard36befault blftCntC a rdAN Default GinTimeFirst Default Gift-Time La st Default ID

Default

PromCntl2 PromCnt36

Default Default

Default PromCntAII PrornCntCard Default Pro m C n t C a r d D e f a u l t PrornCntCardJJefautt !

REP DemMecDefault StatusCat96NDefault StatusCatStar.Uefault

iTargetB TarqetD dataobs

-

not

(Equal to

P Label

DemAge

:

I

Default Yes Default

V Mining Report

r Role

nput

Interval

No

Input nput

Nominal

nput nput Rejected Input Input

Binary Interval Interval Interval Interval Interval

Input Input

Interval Interval

Input

Interval Inter/a I

Input Input Input Input

E>

Interval Interval Interval Interval

Input

Nominal Interval

Input

Internal

Input

Interval

Input Input

Interval Interval

Input Input

Interval

Input Input Target Rejected ID

Interval Nominal Binary Binary Interval Interval

Explore.,,

6.

Select O K to close the Variables d i a l o g box.

Statistics

Nominal

Input

Input

r

Basic

! Reset

Level

ho

No No No NO No No No No NO No No No No NO No No No No NO No No No Mo No NO No No Mo

Apply

J

Update Path

OK

Cancel

!


6.3 Adjusting for Separate Sampling

7.

6-35

Run the StatExplore node and view the results. The Results - StatExplore window opens. Chi-Square Wot

,:G:x|

Ei

output

Gi£ tTinie Last H I PUT PEoflCntJ.2 H I PUT PronCnt36 H I PUT ProaCntAll N I PUT PcoBCntCordl2 N I PUT Pco»CntCard36 N I PUT ProaCntCardAl N I PUT I PUT PEP_DemMedn Icooe N REJECTED TargetD

CN-Square

Twget-TwgrtB Data Rols * TRAIN

— 60|

2 2

40

18.0-5667 12.98286 29.29837 48.45096 S.38963S 11.8823 19.01631 53570.8S 15.81626

4.031484 3 4.876026 7.864742 22.96727 1.341807 4.603954 8.551418 20241.72 12.27443

I 20 u

Class Variable Summary Statistics by Class Targe Jj 1

0 V.iri.ible Worth

TjfQ« = Twfl«B

J

0.4

go: -i

1

1

V.iti.ihlu

Scrolling the Output window shows the average of TargetD as $15.82. 8.

Close the Results - Stat window.

9.

Select the P V A 9 7 N K node.

10. Select the D e c i s i o n s . . . property. The Decision Processing - P V A 9 7 N K dialog box opens.

Targets

Prior Probabilities

Decisions

Decision Weights

* ^ TargetB

Name:

TargetB

Measurement Level: Binary Target level order :

Descending

Event level:

1

Format:

Refresh

OK

Cancel

r


6-36

Chapters Model Assessment

11. Select the Decisions tab.

Targets

| Prior Probabilities | Decisions

[ Decision Weights

Do you want to use the decisions? (•J Yes

Default with Inverse Prior Weights

J No

Decision N a m e DECISION1

Label

DECISI0N2

Cost Variable < None >

0 JJ

Constant

< None >

0.0

Add Delete

Delete All Reset Default

OK

12. Type s o l i c i t ( i n place o f the w o r d

13.

Type i g n o r e ( i n place o f the w o r d

Targets

DECISION1) DECISION2)

Cancel

in first r o w o f the Decision N a m e c o l u m n . in the second r o w o f the Decision N a m e c o l u m n .

[ Prior Probabilities [ Decisions | Decision Weights

Do you want to use the decisions? ÂŽ Yes

O No

Decision Name solicit 1 ignorej

0

Default with Inverse Prior Weights Label

Cost Variable < None > < None >

Constant

Add

0.0 0.0

Delete Delete All Reset Default

OK

Cancel


6.3 Adjusting for Separate Sampling

4. Select the Decision Weights tab.

Targets

j Prior Probabilities

[ Decisions

; Decision Weights

S e l e c t a d e c i s i o n function: Maximize

O

Minimize

Enter weight v a l u e s for the d e c i s i o n s . Level

solicit

ignore

1

1.0

0.0

0

0.0

1.0

OK

The completed p r o f i t consequence matrix for this example is s h o w n below. Solicit

Ignore

Response

15.14

0

Non-response

-0.68

0

Cancel

6-37


6-38

Chapter 6 Model Assessment

15. Type the p r o f i t values into the corresponding cell o f the p r o f i t w e i g h t matrix.

Targets

| Prior Probabilities

| Decisions

[ Decision Weights

Select a decision function: (•: Maximize

O Minimize

Enter weight values for the decisions. Level 1 0

solicit 15.14 -0.63

iqnore G.O 0.0

OK

16. Select O K to close the Decision Processing - P V A 9 7 N K dialog box. 17. Run the M o d e l C o m p a r i s o n node. 4r

It w i l l take some t i m e to run the analysis.

Cancel


6.4 Profit Matrices

6-39

6.4 Profit Matrices

Profit Matrices solicit^ primary outcome

15.14

profit distribution for solicit decision

47

Statistical decision theory is an aid to making optimal decisions from predictive models. U s i n g decision theory, each target outcome is matched to a particular decision or course o f action. A profit value is assigned to both correct (and incorrect) outcome and decision combinations. T h e profit value can be random and vary between cases. A vast simplification and common practice in prediction is to assume that the profit associated with each case, outcome, and decision is a constant. T h i s is the default behavior o f S A S Enterprise Miner. This simplifying assumption can lead to biased model assessment and incorrect prediction decisions. For the demonstration, the overall donation average minus the solicitation cost is used as the (constant) profit associated with the primary outcome and the solicit decision.


6-40

Chapter 6 Model Assessment

Profit Matrices solicit

secondary

•0.68

outcome profit distribution for solicit decision

48

Similarly, the solicitation cost is taken as the profit associated with the secondary outcome and the solicit decision.

Decision Expected Profits soiicit primary

f

5

ignore^

0

u

outcome secondary outcome

u

"

w

choose the larger

E x p e c t e d Profit Solicit = 15.14 jS - 0 . 6 8 p 1

0

E x p e c t e d Profit Ignore = 0

49

Making the reasonable assumption that there is no profit associated with the ignore decision, you can complete the profit matrix as shown. With the completed profit consequence matrix, you can calculate the expected profit associated with each decision. T h i s is equal to the sum o f the outcome/action profits multiplied by the outcome probabilities. T h e best decision for a case is the one that maximizes the expected profit for that case.


6.4 Profit Matrices

6-41

Decision Threshold solicit primary outcome secondary —

outcome

decision

ignore

15.14

0

-0.68

0

threshold

p , > 0 . 6 8 / 1 5 . 8 2 => Solicit < 0.68 / 1 5 . 8 2 => Ignore

50

When the elements o f the profit consequence matrix are constants, prediction decisions depend solely on the estimated probability o f response and a constant decision threshold, as shown above.

Average Profit solicit primary outcome secondary outcome

average

ignore

15.14

0

-0.68

0

profit

A v e r a g e profit = (15.14X/Yps- 0 . 6 8 x A / ) / N s s

N

= # solicited primary outcome cases

P S

N s S

=

# solicited secondary outcome cases

N = total number of assessment cases

51

A new fit statistic, average profit, can be used to summarize model performance. For the profit matrix shown, average profit is computed by multiplying the number o f cases by the corresponding profit in each outcome/decision combination, adding across all outcome/decision combinations, and dividing by the total number o f cases in the assessment data.


6-42

Chapter 6 Model Assessment

Evaluating Model Profit

The consequences o f incorporating a profit matrix in the analysis can be v i e w e d i n the M o d e l C o m p a r i s o n node. F o l l o w these steps to evaluate a model w i t h average profit; 1.

Select R e s u l t s . . . in the Run Status d i a l o g box. The Results - N o d e : M o d e l C o m p a r i s o n w i n d o w opens. E

RemH*-NmI<- Model C w n i n m u n Di.irjt.urL Predictive Andlyu*

ojQiiiidi'ii

331

HOC Chart I TargetB •Mi Sola - THAN

lrari £nor

DB-

0

1

T i m Sun of S p a r t a Errors

075-

£

//

1 £

Avcrpje PlOII

1D0-

10-

ooDO

Rtg Neural TTM TraoS

Rag Neural Tit* TrenJ

RtoriKion Neural Not Di'tision IJ Piohsbiiiti .

ooo02

04 06 1 -Sp.tHietty

Bt

CD

o:

0*

OE

jS

10

1 • Specificity

Out RU. • TPJJH

Student J u l y 08, J009 L5i08:26

PtUI ITIUI

0«iBU..V>i1ClTE

1

Tt l i n i n g Output

V a r i a b l e Surte.iy Heasuieaent

0

- rijural rMiA^rv

Regreiiwi

10

40

SO

KJ

1C0

- Pi-- U 3111 it, T i w |

rcequsncy Count

Targets TajgelB T-jrjetQ Targeie

0.171304 0.169749 0.156675 0143923

14465*9 1434379 1*46382 14*7576

331* 21* 3330873 4327.395 433962*


6.4 Profit Matrices

2.

6-43

Maximize the Output window and go to line 30. Fit Statistics Hodel S e l e c t i o n based on V a l i d : Average P r o f i t (_ VAPR0FJ

Selected Model Model Node Y

Reg Neural Tree Tree3

Model Description

Train: Valid: Train: Valid: V a l i d : Average Average Average Squared H i s c l a s s i f i c a t i o n Squared H i s c l a s s i f i c a t i o n Rate P r o f i t Error Rate Error

Regression Neural Network D e c i s i o n Tree P r o b a b i l i t y Tree

0.17130 0.16975 0.15667 0.14293

0.24202 0.24062 0.44677 0.44700

0.49990 0.49990 0.49990 0.49990

0.24079 0.23988 0.44689 0.44720

0.50010 0.50010 0.50010 0.50010

With a profit matrix defined, model selection in the Model Assessment node is performed on validation profit. Based on this criterion, the selected model is the Regression Model. T h e Neural Network is a close second.


6-44

Chapter 6 Model Assessment

Viewing Additional A s s e s s m e n t s

T h i s demonstration shows several other assessments o f possible interest. 1.

M a x i m i z e the Score Rankings Overlay w i n d o w .

2.

Select Select C h a r t : •=> Total Expected Profit.

D J U R o l . - VSLIOATE

Percenillo

The Total Expected Profit plot shows the amount o f value w i t h each demi-decile ( 5 % ) block o f data. It turns out that all models select approximately 6 0 % o f the cases (although cases in one m o d e l ' s 6 0 % might not be in another model's 6 0 % ) .


6.4 Profit Matrices

3.

Select Select Chart: •=> Cumulative % Captured Response.

4.

Double-click the Data Role = V A L I D A T E chart.

6-45

Store K.infcing% OvrrÂť.iy: (.irt)H8

Percentile - Ntural Nttmrk -

- Ragrmlon

This plot shows sensitivity (also known as Cumulative % Captured Response) versus selection fraction (Decile). B y selecting 6 0 % o f the data, for example, y o u "capture" about 7 5 % o f the primaryoutcome cases. 5.

Close the Results window.


6-46

Chapter 6 Model Assessment

Optimizing with Profit (Self-Study)

The models f i t in the previous demonstrations were optimized to m i n i m i z e average error. Because it is the most general o p t i m i z a t i o n criterion, the best model selected by this criterion can rank and decide cases. I f the ultimate goal o f y o u r model is to create prediction decisions, it m i g h t make sense to o p t i m i z e o n that criterion. A f t e r y o u define a p r o f i t m a t r i x , it is possible to optimize your model strictly on profit. Instead o f seeking the model w i t h the best prediction estimates, y o u find the model w i t h best prediction decisions (those that m a x i m i z e expected p r o f i t ) . f

The default model selection method in SAS Enterprise M i n e r is v a l i d a t i o n profit o p t i m i z a t i o n , so these settings essentially restore the node defaults. Finding a meaningful p r o f i t m a t r i x f o r most m o d e l i n g scenarios, however, is d i f f i c u l t . Therefore, these notes r e c o m m e n d o v e r r i d i n g the defaults and creating models w i t h a general selection criterion such as validation average squared error.

D e c i s i o n T r e e Profit Optimization One o f the t w o tree models in the diagram is set up this w a y : 1.

Select the original Decision Tree node.

2.

Examine the Decision Tree node's Subtree property. llsubtree

•Method -Number of Leaves •Assessment Measure -Assessment Fraction

Assessment 1 Decision 0.25

The Decision Tree node is pruned using the default assessment measure, Decision. The goal o f this tree is to make the best decisions rather than the best estimates. 3.

E x a m i n e the Probability Tree node's Subtree property.

• Assessment \- Method [-Number of Leaves H [•Assessment Measure Average Square Error 0.25 --Assessment Fraction The Probability Tree node is pruned using average squared error. The goal o f this tree is to provide the best prediction estimates rather than the best decision.


6.4 Profit Matrices

6-47

R e g r e s s i o n Profit Optimization Use the f o l l o w i n g settings to o p t i m i z e a regression model using p r o f i t : 1.

Select the Regression node.

2.

Select Selection C r i t e r i o n •=> Validation Profit/Loss.

B Model Selection r Selection Model Stepwise Selection Criterion Validation Profit/Loss Use Selection Defaults No Selection Opiions 3.

Repeat steps 1 and 2 f o r the P o l y n o m i a l Regression node.

N e u r a l N e t w o r k Profit O p t i m i z a t i o n 1.

Select the N e u r a l N e t w o r k node.

2.

Select Model Selection C r i t e r i o n O Profit/Loss.

Train Variables iNetwork Model Selection Criterio Profit/Loss Use Current Estimates No f

• •

The A u t o N e u r a l node does not support profit matrices.

The next step refits the models using p r o f i t o p t i m i z a t i o n . Be aware that the demonstrations in Chapters 8 and 9 assume that the models are fit using average squared error o p t i m i z a t i o n . 3.

Run the M o d e l C o m p a r i s o n node and v i e w the results.

4.

M a x i m i z e the Output w i n d o w and v i e w lines 20-35. Fir Statistics Model S e l e c t i o n based on V a l i d : Average P r o f i t ( _VAPR0FJ

Selected Model Model Erode Y

Neural Reg Tree Tree3

Model Description

Valid: Train: Valid: Train: Average V a l i d : Average Ave r age Squared H i s c l a s s i f i c a t i o n Squared M i s c l a s s i f i c a t i o n Rate Error Rate P r o f i t Error

Neural Network Regression Decision Tree P r o b a b i l i t y Tree

0.17660 0.23645 0.15934 0.23711 0.15667 0.44677 0.14293 0.44700

0.49990 0.49990 0.49990 0.49990

0.24096 0.24419 0.44639 0.44720

0.50010 0.50010 0.50010 0.50010

The reported v a l i d a t i o n profits are now slightly higher, and the relative o r d e r i n g o f the models changed.


6-48

Chapter 6 Model Assessment

1. Assessing Models a.

Connect all models in the O R G A N I C S diagram to a Model Comparison node.

b.

R u n the Model Comparison node and view the results. W h i c h model has the best R O C curve? What is the corresponding R O C Index?

c.

What is the lift o f each model at a selection depth of 4 0 % ?


6.5 Chapter Summary

6-49

6.5 Chapter Summary The Model Comparison node enables you to compare statistics and statistical graphics summarizing model performance. T o make assessments applicable to the scoring population, y o u must account for differences in response proportions between the training, validation, and scoring data. While y o u can choose to select a model based strictly on statistical measures, y o u can also consider profit as a measure o f model performance. For each outcome, an appropriate decision must be defined. A profit matrix is constructed, characterizing the profit associated with each outcome and decision combination. T h e decision tools in S A S Enterprise Miner support only constant profit matrices.

Assessment Tools Review Compare model summary statistics and statistical graphics.

TrhputData

Create decision data; add prior probabilities and profit matrices.

Tune models with average squared error or appropriate profit matrix.

Obtain means and other statistics on data source variables. 56


6-50

Chapter 6 Model Assessment

6.6

Solutions

Solutions to Exercises 1.

Assessing Models a.

Connect all models in the O R G A N I C S diagram to a Model Comparison node.

Mode) nportson

:::DRGAMCS


6.6 Solutions

b.

6-51

Run the M o d e l C o m p a r i s o n node and v i e w the results. W h i c h model has the best R O C curve? T h e Decision Tree seems to have the best H O C c u r v e . What is the corresponding R O C index?

T h e R O C Index values are found in the Fit Statistics window. i

i

Selected Muriel

V

Predecessor Node

Reg Reg2 Tree Neural Tree2

Model Node

Reg Reg2 Tree Neural Tree2

Valid: Roc Index A

Model Description

0.805062Regression 0.82007Regression (2) 0.823782DecisionTree 0.82381 SNeural Network 0.824339Decision Tree (2)


6-52

Chapter 6 Model Assessment

c.

W h a t is the lift o f each model at a selection depth o f 4 0 % ? 1) Select Validation Score R a n k i n g C h a r t .

Data Role " V A L I D A T E

0

20

40

60

60

Percentile NauMi tiniAori

Ragressi-.ii

p - ^ - m p n i:>

-

— — Decision Tinr

• D9Hs]onli-:i t:>

2) R i g h t - c l i c k the chart and select Data O p t i o n s . . . f r o m the O p t i o n menu. 3) Select the W h e r e tab.

100


6.6 Solutions

4) A d d the text r

& (DECILE=40)

6-53

to the selection string.

iay.ir.t« Variables

Wiere [ Sorting

Boles

(((OATAROLE = "VALIDATE") & (((TARGET="Tarcje1Buy")>))) & (DECILE=40)

Add

Apply

Custom

Delete

Reset

5) Select A p p l y . 6) Select O K . T h e S c o r e R a n k i n g s plot shows the lift for the 40' percentile. You can r e a d the values for each model from the plot.

Data Role = V A L I D A T E

- N-i;nl I loTAcrt

P?qr?sslQn

R<fl»»lan(?)

D?[l9lcn Tiaa

Daemon Ti*a C J |


6-54

Chapter 6 Model Assessment


Chapter 7 Model Implementation 7.1

Introduction

7-3

7.2

Internally S c o r e d Data S e t s

7-4

7.3

Demonstration: Creating a Score Data Source

7-5

Demonstration: Scoring with the Score Tool

7-6

Demonstration: Exporting a Scored Table (Self-Study)

7-9

S c o r e C o d e Modules

7-15

Demonstration: Creating a SAS Score Code Module

7-16

Demonstration: Creating Other Score Code Modules

7-22

Exercises

7-24

7.4

Chapter Summary

7-25

7.5

S o l u t i o n s to E x e r c i s e s

7-26


7-2

Chapter 7 Model Implementation


7.1 Introduction 7.1

7-3

Introduction

Model

Implementation

mm mm mm mm mm

mm mm

Z

mm mm mm

Z

wmm mm mm

Predictions might be added to a data source inside and outside of SAS Enterprise Miner.

3 After you train and compare predictive models, one model is selected to represent the association between the inputs and the target. After it is selected, this model must be put to use. The contribution of SAS Enterprise Miner to model implementation is a scoring recipe capable of adding predictions to any data set structured in a manner similar to the training data. SAS Enterprise Miner offers two options for model implementation. • Internally scored data sets are created by combining the Score tool with a data set identified for scoring. f

A copy of the scored data set is stored on the SAS Foundation server assigned to your project. If the data set to be scored is very large, you should consider scoring the data outside the SAS Enterprise Miner environment and use the second deployment option.

• Scoring code modules are used to generate predicted target values in environments outside o f SAS Enterprise Miner. SAS Enterprise Miner can create scoring code in the SAS, C, and Java programming languages. The SAS language code can be embedded directly into a SAS Foundation application to generate predictions. The C and Java language code must be compiled. The C code should compile with any C compiler that supports the ISO/I EC 9899 International Standard for Programming Languages — C.


7-4

Chapter 7 Model Implementation

7.2

Internally S c o r e d Data

Sets

To create an internally scored data set, you need to define a Score data source, integrate the Score data source and Score tool into your process flow diagram, and (optionally) relocate the scored data set to a library of your choice.


7.2 Internally Scored Data Sets

7-5

Creating a Score Data Source

Creating a score data source is similar to creating a modeling data source. 1.

Right-click Data Sources in the Project panel and select Create Data Source. The Data Source Wizard opens.

2.

Select the table ScorePVA97NK in the A A E M library.

3.

Select Next > until you reach Data Source Wizard - Step 6 of 7 Data Source Attributes.

4.

Select Score as the role. g f D a t a S o u r c e W i z a r d — S t e p 6 of 7 D a t a S o u r c e Attributes

( \ t. -wdb gj)

You may change the name and the rale, and can specify a population segment Identifier for the data source to be created.

Name:

|5COREPVA97NK

•

Role: Segment: Notes :

< Back 5.

Select Next > to move to Step 7.

6.

Select Finish. You now have a data source that is ready for scoring.

Next >

Cancel


7-6

Chapter 7 Model Implementation

Scoring with the Score Tool

The Score tool attaches model predictions from a selected model to a score data set. 1.

Select the Assess tab.

2.

Drag a Score tool into the diagram workspace.

3.

Connect the Model Comparison node to the Score node as shown. •^Predictive Analysis

:

Predictive Analysis Polynomial '•Rsqr«stioa(2)

Ragrcstlon

Model aenpwltoa

-il

6

1

•±I|H*<5|0 — J —©ioo%|Saj D


7.2 Internally Scored Data Sets

7-7

The Score node creates predictions using the model deemed best by the Model Comparison node (in this case, the Regression model). f

If you want to create predictions using a specific model, either delete the connection to the Model Comparison node of the models that you do not want to use, or connect the Score node directly to the desired model and continue as described below.

4. Drag the ScorePVA97NK data source into the diagram workspace. 5. Connect the ScorePVA97NK data source node to the Score node as shown.


Chapter 7 Model Implementation

7-8 6.

Run the Score node and view the results. The Results - Score window opens. in? J

Student J u l y O B , 2009 16:22:16

CUffPjlTTII I T : " * —

• TOO) Input ' T I M : tunic: •

Training Output

soum;

Variable aunaiY

• TOOLExternicn ; CMMJ • m t i RtplI

Frequency

• Haiti

Count

•'• TOOL: F a i t l t l m Cl> Tift: liBELE;

BINARY

Varlatle

fOEE;

IB. SCOfE TEMIOH: 4.1; CEbEKATEl 111

LM

i _ T A R G E T B R9(J

ij_ClASS Store M_EVErJT Score . M . P R O B A Score E M _ S E G H . . . Score ~P_JAfiGE... Beg

cRtATEr: MwiamiUtMrW

TOOL: Input fnea luuic TVrE: iUtPLE) VOTE: Mm

^0O_0ilWv... Tran3 ' l O O . O I I K . . . Trans

'JTscgeieo Reg 'JTargelDI Rerj JJTirgetS Rag .WORM. Rsg j_TarselB MdlComp

TOOL: Exwiuicn C I m j j TV?Ei maim BOl'I: fttpl;

7.

SIOHTVT TIF SET

FmclDn DECISION

Prediction!. C L A S S I F I C Probability f.P R E D I C T ProDaulnty.. P R E D I C T

C C N N

T H A N S F O R MM

Eipecled Pr. A S S E S S Into: TargetsC L A S S I F I C . . .

N C

TFANSFORMN TRANSFORUN

Predicted T.PP RR EE DD II CC TT Predicted T C L A S S I F I C . Unnormaliz..

Warnings

ASSESS

N N H C

TRANSFORM

Maximize the Output window. The item o f greatest interest is a table of new variables added to the Score data set.

8.

Go to line 150. Score Output Variables

9.

Variable Maine

Function

Creator

Label

D_TARGETB EM_CLASSIFICATION EM_EVENTPRGBABILITY EM_PR0BABILITY EM_SEGHENT EPJTARGETB I_TargetB L0G_GiftAvgAll L0G_GiftCnt36 P_TargetB0 P_TargetBl UJTarcjetB _WARN_ b_TargetE

DECISION CLASSIFICATION PREDICT PREDICT TRANSFORM ASSESS CLASSIFICATION TRANSFORM TRANSFORM PREDICT PREDICT CLASSIFICATION ASSESS TRANSFORM

Reg Score Score Score Score Reg Reg Trans Trans Reg Reg Reg Reg MdlComp

Decision: TargetB Prediction for TargetB Probability for level 1 of TargetB Probability of Classification

Close the Results window.

1

Expected Profit: TargetB Into: TargetB

Predicted: TargetB=0 Predicted: TargetB=l Unnormalized Into: TargetB Warnings


7.2 Internally Scored Data Sets

7-9

Exporting a Scored Table (Self-Study)

1.

Select Exported Data •=> ] ...j from the Score node Properties panel. Value

Property

General

Node ID Imported Data Exported Data Notes

Train

Score

in u u

Variables Type of Scored Data View Use Fixed Output Name Yes Hide Variables No Hide Selection

• •

The Exported Data - Score dialog box opens.

TRAIN VALIDATE TEST 3CORE

Port

Table EMWS.Score_TRAIN EMWS.Score VALIDATE EMWS.Score TEST EMWS.Score_SCORE

Train Validate Test Score

Role

Yes Yes NO Yes

Data Exists

OK

2.

Select the Score table.


7-10 3.

Chapter 7 Model Implementation Select E x p l o r e . . . The Explore window opens with a 20,000-row sample of the EMWS2 . Score_SCORE data set (the internal name in SAS Enterprise Miner for the scored S c o r e P V A 9 7 N K data). •

File View

Actions

Window

\ r' Sample Properties :

Property

Rows Columns library Member Type Sample Method Fetch Size Fetched Rows Random Seed

Unknown 59 EMWS SCORE SCORE VIEW Random Max 20000 12345

Value

Apply | 0 3 EMWS.Score_SCORE

r/Ef

lEJj

Orjs# I Control Nit... Tarciet Gift ..I Tarcret Gift... Gift Count 3.JOift Count A..J Gift Count...] Gift Count ...1 Gift Amount... 0 10 6 8 $10.00 100033265 7 0 200029879 3 3 1 1 $16.00 300006622 0 2 12 0 6 $15.00 400088185 0 2 7 2 6 $23.00 0 1 1 1 $20.00 500116411 1 600146237 4 6 4 4 $16.00 1 520.00 700047124 14 3 6 $3.00 0 6 800187178 0 1 1 1 1 $25.00 900140150 0 4 4 3 3 $12.00 1000040007 3 a 3 0 5 $19.00 1100036815 0 2 13 1 9 $17.00 <|:,

a i n n r t i jf?i—i

::

:

,j

. .iri

-

| •

Tliis scored table is situated on the SAS Foundation server in the current project directory. You might want to place a copy o f this table in another location. The easiest way to do this is with a SAS code node. 4.

Close the Explore window.

5.

Select the Utility tab.

6.

Drag a SAS Code node into the Diagram workspace.


7.2 Internally Scored Data Sets 7.

Connect the Score node to the SAS Code node as shown.

8.

Select Code Editor <=>

in the SAS Code node's Properties panel.

Property

General

Node ID Imported Data Exported Data Notes

Value EMCODE

J

...

Train

Variables Code Editor Tool Type Data Needed Rerun Use Priors

Utility No No Yes

Advisor Type Publish Code Code Format

Basic Publish Data step

Score

u

u

7-11


7-12

Chapter 7 Model Implementation

The Training Code window opens. 1 B T Training Code -Cod sNodt File

u

Edit

Run

View

o

La

I @

Train EM_REGISTER EM_REPORT

EM_DATA2CODE EM D E C D A T A ;EM_CHECKMACRO 'EM_CHECKSETINIT EM_ODSLISTON EM ODSLISTOFF Macros

Macro Variables

Variables \

Training Code

-

AT

Output | Log ]

2

Istudent a s student - My Project - Predictive Analysis - EMCODE - STATUS=NONE LASTSTATUS=NOrJE

The Training Code window enables you to add new functionality to SAS Enterprise Miner by accessing the scripting language of SAS.


7.2 Internally Scored Data Sets 9.

7-13

Type the following program in the Training Code window: d a t a AAEMSl.ExportedScoreData; s e t &EM_IMPORT_SCORE; run; 1ST T r a i n i n g C o d e - C o d e Nod File

Edit

Run

View :

H | .a | cJ> | *)

'3

Train B

;EM_REGISTER j-.EM_REPORT ;-EM_DATA2CODE

;-£M_D_ECDATA ;-'EM_CHECKMACRO rEM_CHECKSETlNlT j-'EM_ODSLISTON '•'EM ODSUSTOFF Macros

Macro Variables

d Variables

Training Code

I

(lata

AAEM61.ExportedScoreData;

set

sEH_IMP0RT_SCORE;

run;

a.'*

..

-

Output J Log | 2

zl zl

|student as student - My Project - Predictive Analysis - EMCODE - 5TATU5=fJONE LAST ST ATU5=COMPLETE

This program creates a new table named E x p o r t e d S c o r e D a t a in the A A E M 6 1 library. 10. Close the Training Code window and save the changes. 11. Run the SAS Code node. You do not need to view the results.


7-14

Chapter 7 Model Implementation

12. Select View •=> Explorer... from the SAS Enterprise Miner menu bar. The Explorer window opens. 31 Explorer Show Project Data

1BASE

Name B)--J3) Aaem61 £

J j) Apfmtlb

£

Map;

[+] J j( Sampsio 51

Sasdata

SI

J Sasheip

S

jl Sasuser

>' _j) St pi amp ffl 3

Work

£

Wrstemp

EE .2)Wrso1st

£j) Aaem61

Engine

Path C:\Workshop\Winsa5\aaem61

V9

D:\SAS\Config\Lev I\SASApp\SASEn...

V9

D:\Program Files\SAS\5ASFoundatio...

V9

D:\Progtam Files\5A5l5ASFoundatio...

Sasdata

BASE

D:\SAS\Config\Levl \SASApp\Data

Sashelp

V9

_ j l Sasuser

V9

3

Apfmtlib

j l Maps j j l Sampsio y

$ Stpsamp 0

D:\Program Files^AS\SA5Foundatio...

C:\Documents and Settings\Student\... D:\Program Files\5AS\SASFoundatio...

BASE

Work

V9

C:\DOCUME~l\StudenHLOCALS~ll...

BASE

D:\5A5Konfig\Lev l\5A5App\Data\w...

BASE

Wrsdot

•J Wrstemp

D:\SA5\Conf ig\Lev l\5A5App\Data\w...

13. Select the Aaem61 library. 1 St Explorer I Show Project Data

H -

-

;J SAS Libraries Apfmtlib

fi-j^i Maps .»]

Name

Li. Bank Census2000

Z

• •) Sampsio

Dottraln

lE—^j) Sasdata ES-'^jl Sashetp

ijs-'jjj Sasuser •

Work

S3 'j9 Wrsdist '+• ) Wrstemp

DATA DATA DATA

Dot valid

DATA

Dungaree

DATA

Exportedscoredata DATA

) Stpsamp

EB • ^

Credit

DATA

Inq2O05

DATA

Insurance

DATA DATA DATA

H™| Organics [J

Pva97nk

f~J Scoreorganics [™J Scorepva97nk Ea Transactions • Webseg •

Webstation

DATA DATA DATA DATA DATA

DATA

Type

Created Date

Modified Date

See

2001-01-19

2004-01-19

2006-07-05 2006-09-26

2006-0705

2006-09-26

372KB

2006-07-25

2006-07-25

32KB

2001-01-19

2004-01-19

2009-07-08

2009-07-08

2006-07-25

2006-09-19

2006-07-25

780KB 1MB

32KB 40KB

41MB

2006-09-19

9MB

2006-09-19

2006-09-19

5MB

2006-09-23

2006-09-23

5MB

2003-05-26

400KB

2006-09-19

15MB

2008-04-10

2008-03-03

20O8-05-2B 2008-03-03 2006-09-19

2006-09-19 2006-09-24

2008-04-10

2008-03-03 2008-03-03 2006-09-19 2006-09-24

D|x

2MB 1MB

1SMB 1MB

49MB

The Aaem61 library contains the E x p o r t e d s c o r e d a t a table created by your SAS Code node. You can modify the SAS Code node to place the scored data in any library that is visible to the SAS Foundation server. 14. Close the Explorer window.


7.3 Score Code Modules

7-15

7.3 Score Code Modules Model deployment usually occurs outside of SAS Enterprise Miner and sometimes even outside of SAS. To accommodate this need, SAS Enterprise Miner is designed to provide score code modules to create predictions from properly prepared tables. In addition to the prediction code, the score code modules include all the transformations that are found in your modeling process flow. You can save the code as a SAS, C, or Java program.


7-16

Chapter 7 Model Implementation

Creating a SAS Score Code Module

The SAS Score Code module is opened by default when you open the Score node. 1.

Open the Score node Results window.

2.

Maximize the SAS Code window.

The SAS Code window shows the SAS DATA step code that is necessary to append predictions from the selected model (in this case, the Regression Model) to a score data set. Each node in the process flow can contribute to the DATA step code. The following list describes some highlights of the generated SAS code: • Go to line 12. This code removes the spurious zero from the median income input. TOOL: Extension C l a s s ; TYPE: MODIFY; NODE: Repl;

* Variable: DemMedlncome ; Label REP_DemMedIncome = "Replacement: DemMedlncome "; REP_DemMedIncome =DemMedIncome ; i f DemMedlncome ne . and DemMedlncome <1 then REP DemMedlncome

=

Go to line 36. This code takes the log transformation of selected inputs. * TRANSFORM: GiftAvg36 , log(GiftAvg36 + 1 ) ; * . ..........

......*• J

l a b e l L0G_GiftAvg36 = 'Transformed: G i f t Amount Average 36 Months'; i f GiftAvg36 + 1 > 0 then L0G_GiftAvg36 = log(GiftAvg36 + 1 ) ; else L0G_GiftAvg36 = .; * .... .*• i * TRANSFORM: GiftAvgAll , log(GiftAvgAll + 1 ) ; *. . ......... .. .. .*• >

l a b e l LOG_GiftAvgAll = 'Transformed: G i f t Amount Average A l l Months'; i f GiftAvgAll + 1 > 0 then L0G_GiftAvgAll = log(GiftAvgAll + 1 ) ; e l s e LOG_GiftAvgAll = .;


7.3 Score Code Modules • Go to line 84. This code replaces the levels of the StatusCat96NK input. *

*.

i * TOOL: Extension C l a s s ; * TYPE: MODIFY; * NODE: Repl2;

* Defining New V a r i a b l e s ; * •

i Length REP_StatusCat96NK $5; Label REP_StatusCat96NK = "Replace:Status Category 96NK"; REP_StatusCat96NK= StatusCat96NK; * Replace S p e c i f i c C l a s s Levels ; * • i length _UFormat200 $200; drop _UF0RMAT200; _UF0RMAT200 = " »; * .

* Variable: StatusCat96NK; * i• i f strip(StatusCat96NK) = "A" REP_StatusCat96NK="A"; i f strip(StatusCat96NK) = "S" REP_StatusCat96NK="A"; i f strip(StatusCat96NK) = *F" REP_StatusCat96NK="N"; i f strip(StatusCat96NK) = "N" REP_StatUSCat96NK="N"; i f strip(StatusCat96NK) = "E" REP_StatusCat96NK="L"; i f strip(StatusCat96NK) = "L" REP_StatusCat96NK="L";

then then then then then then

7-17


7-18

Chapter 7 Model Implementation

• Go to line 118. This code replaces missing values and creates missing value indicators. * TOOL: Imputation; * TYPE: MODIFY; * NODE: Impt; i *MEAN-MEDIAN-MIDRANGE AND ROBUST ESTIMATES; i l a b e l IMP_DemAge = 'Imputed: Age'; IMP_DemAge = DemAge; i f DemAge = . then IMP_DemAge = 59.262912087912; l a b e l IMP_L0G_GiftAvgCard36 = 'Imputed: Transformed: G i f t Amount Average Card 36 Months'; IMP_L0G_GiftAvgCard36 = L0G_GiftAvgCard36; i f L0G_GiftAvgCard36 = . then IMP_L0G_GiftAvgCard36 = 2.5855317177381; l a b e l IMP_REP_DemMedIncome = 'Imputed: Replacement: DemMedlncome'; IMP_REP_DemMedIncome = REP_DemMedIncome; i f REP_DemMedIncome = . then IMP_REP_DemMedIncome = 53570.8504928806; *. »

•INDICATOR VARIABLES; t

l a b e l M_DemAge = "Imputation Indicator f o r DemAge"; i f DemAge = . then M_DemAge = 1; e l s e M_DemAge= 0; l a b e l M_L0G_GiftAvgCard36 = "Imputation Indicator f o r L0G_GiftAvgCard36"; i f L0G_GiftAvgCard36 = . then M_L0G_GiftAvgCard36 = 1; e l s e M_L0G_GiftAvgCard36= 0; l a b e l M_REP_DemMedIncome = "Imputation Indicator f o r REP_DemMedIncome"; i f REP_DemMedIncome = . then M_REP_DemMedIncome = 1; e l s e M_REP_DemMedIncome= 0;


7.3 Score Code Modules

• Go to line 160. This code comes from the Regression node. It is this code that actually adds the predictions to a Score data set. * TOOL:

Regression;

* TYPE:

HODEL;

* NODE: R e g ;

***

begin

3 c o r i n g code

f o r regression;

l e n g t h _HARN_ «4; label _HARH_ = 'Warnings'

;

l e n g t h IJTaxgetB 5 12; label I_TaEg-etB •• ' I n t o : ***

Target

Targets'

;

Values;

REGDRF [2] $12 _tenporary_ ( ' 1 ' label U_TargetB " 'Urmomalized Into: array

»**

Unnoraalized

ARRAY

REGDRD[2]

target

'0'

);

TargetB'

;

values;

_TEHPORARY_ ( 1 0 ) ;

d r o p _DH_BAD; _DH_BAD=0;

***

C h e c k Details d H o a e V a l u e

f o r Eiissing values

i f uissing( DemHedHoaeValue ) then do; substr(_warn_,1,1) » 'M'; _DH_BAD = 1; end; ***

Check ( J i c t T i m e L a s t f o r E i i 3 3 i n g v a l u e s

i f nissing( GiftTimeLasc ) then do; substr (_warn_, 1,1) » W ; _DH_BAD = 1; end;

;

;

7-19


7-20

Chapter 7 Model Implementation

• Go to line 290. This block of code comes from the Model Comparison node. It adds demi-decile bin numbers to the scored output. For example, bin 1 corresponds to the top 5% of the data as scored by the Regression model, bin 2 corresponds to the next 5%, and so on. *.

* TOOL: Model Compare C l a s s ; * TYPE: ASSESS; * NODE: MdlComp; i f (P_TargetB1 ge 0.09198928166881) then do; b_TargetB - 1; end; else i f (P_TargetB1 ge 0.08014055773524) then do; b_TargetB = 2; end; else i f (P_TargetB1 ge 0.07027014765811) then do; b_TargetB = 3; end;

i

..„......*• j


7.3 Score Code Modules

7-21

• Go to line 380. This block of code comes from the Score node. It adds the following standardized variables to the scored data set: EM_ CLASSIFICATION

Prediction for TargetB

EM_ DECISION

Recommended Decision for TargetB

EM_ EVENTPROBABILITY

Probability for Level 1 of Target

EM_ PROBABILITY

Probability of Classification

EM_ PROFIT

Expected Profit for TargetB

EM_ SEGMENT

Segment

*

* TOOL: Score Node; * TYPE: ASSESS; * NODE: Score; * * * Score: Creating Fixed Names;

*

*• i *• >

*•

LABEL EM_SEGMENT = 'Segment'; EM_SEGMENT = b_TargetB; LABEL EM_EVENTPROBABILITY = 'Probability f o r l e v e l 1 of TargetB'; EM_EVENTPROBABILITY = P_TargetB1; LABEL EM_PROBABILITY = 'Probability of C l a s s i f i c a t i o n ' ; EM_PR0BABILITY = max( P_TargetB1 J

P_TargetB0 );

f

To use this code, you must embed it in a DATA step. The easiest way to do this is by saving it to as a SAS code file and including it in your DATA step.

3. Select File •=> Save As... to save this code to a location of your choice.


7-22

Chapter 7 Model Implementation

Creating Other Score Code Modules

To access scoring code for languages other than SAS, use the following procedure: 1.

Select View O Scoring •=> C Score Code. The C Score Code window opens. C Score Code <?xml

version="1.0"

encodin<j="ute-8"?>

<Scoce> cPzoduceo <Mame> SAS E n t e r p r i s e H i n e r < 7 e z s i o o > 1.0

</Wame>

</V*rsion>

</Producer> <TargetList> </TargetLis» <Input> CVariablO <Hame> DEHAGE </Name> <Type> n u m e r i c

</Type>

<Description> •C! [ C D A T A [ A g e ] ] > </Description> </Variable> <Vatiable> <Haue> DEIIHEDH0HEVALUE </!Iame>

| •

i m Scoring Function Metadata •

There are four parts to the C Score Code window. The actual C score code part is accessible from the menu at the bottom of the C Score Code window. C Score Code the

a d v e r t i s i n g o r p r o m o t i o n o£ p r o d u c t s

prior

/ * —

start

^include

generated

code

<string.h>

^include

<memorY.h>

^include

<ctype.b>

^include

"cscore.h"

^include

"csparm.h" csDEHAGE

indata[0].data.fnua

j f d e f i n e c s DE HUE DHO ME VALUE ^define

— V

<math.h>

^include

#der"ine

or s e r v j

w r i t t e n a u t h o r i z a t i o n f r o m SAS I n s t i t u t e I i

csDEHHED I I I COME

Score Code

i n d a t a [ 1 ] . d a t a . £ nun indata[2].data.fnun


7.3 Score Code Modules 2.

Select View

Scoring •=> Java Score Code. The Java Score Code window opens.

Java Score Code

<?xml version="l. 0" encoding="utf-8"?> <Score> <Producer> <Name> SAS Enterprise Hiner </Name> <Version> 1.0 </Version> </Producer> <TargetList> </TargetList> <Input> <Variable> <Hame> DEMAGE </Mame> •CTjpe> numeric </Type> <Description> <![CDATA[Age]]> </Description> </Varlable> <Vari ab1e> <fJame> DEHHEDHOHEVALUE </Name>

Scoring Function Metadata

There are five parts to the Java Score Code window. The actual Java score code part is accessible from the menu at the bottom of the Java Score Code window. Java Score Code

package eiainer.user. Score; import java. util. Map; import java.util.EashHap; import com.sas.ds.*; import com.sas.analytics.eminer.jscore.util.*;

* The Score class encapsulates data step scoring cc * by the SAS Enterprise Hiner Java Scoring Componei * * * *

Qsince 1.0 Aversion J-Score 1.0 Gauthor SAS Enterprise Hiner Java Scoring Compon Qsee com.sas.analytics.eminer.jscore.util.Jscore

public class Score ,

T - - . „ .

Score Code

,f. ,

,

,

,

7-23


7-24

Chapter 7 Model Implementation

Exercises

1. Scoring Organics Data a. Create a Score data source for the S c o r e O r g a n i c s data. b. Score the S c o r e O r g a n i c s data using the model selected with the Model Comparison node.


7.4 Chapter Summary

7.4

7-25

Chapter Summary

The Score node is used to score new data inside SAS Enterprise Miner and to create scoring modules for use outside SAS Enterprise Miner. The Score node adds predictions to any data source with a role of Score. This data source must have the same inputs as the training data. A scored data source is stored within the project directory. You can use a SAS Code tool to relocate it to another library. The Score tool creates score code modules in the SAS, C, and Java languages. These score code modules can be saved and used outside of SAS Enterprise Miner.

Model Implementation Tools Review Add predictions to Score data sources; and create SAS, C, and Java score code modules. 1

JLB-U.1 : : : hput Data

Create a Score data source.

Use SAS scripting language to export scored data outside a SAS Enterprise . Miner project. 11


7-26

Chapter 7 Model Implementation

7.5

Solutions to Exercises

1.

Scoring Organics Data a.

Create a Score data source for the S c o r e O r g a n i c s data. 1) Select File •=> New •=> Data Source. 2) Proceed to Step 2 of the Data Source Wizard by selecting Next >. 3) Select the A A E M 6 1 . S C O R E O R G A N I C S data set. g f D a t a S o u r c e W i z a r d - S t e p 2 of 7 S e l e c t a S A S Table

Select a SAS table

Jable:

|2E

M61 .SCORE ORGANIC

Browse..

< Back

4) Proceed to the final step o f the Data Source Wizard.

Next >

Cancel


7.5 Solutions to Exercises

7-27

5) Select Score as the role. g^Data S o u r c e Wizard — S t e p 6 of 7 D a t a S o u r c e Attributes \

Y o u may change the n a m e and the role, and c a n specify a population s e g m e n t identifier for the data source to h e created.

{IS)

1

Name:

IKOREORGANICS

Bole: Segment:

Notes:

<; Back

Ne*t >

Cancel

6) Select Finish. b. Score the S c o r e O r g a n i c s data using the model selected with the Model Comparison node. 1) Connect a Score tool to the Model Comparison node. 2) Connect a ScoreOrganics data source to the Score node. Decision Tien

'J Decision Tin-;

""Comparison

^BCOREORGArJ J V'rfl'iensform

FS-s ,

,

Polynomial Regie SShXl

J feurti Netaiork

6 3) Run the Score node.


7-28

Chapter 7 Model Implementation 4) Browse tiie Exported data from the Score node to confirm the scoring process. ...

B_TARGETBUY

zed into: Targe IB uy

Segment

Probability (or level 1 of TargetBuy Probahiiiry of Classification

43

0 1277330265

0.8722669735

0

7

28

0.2397003745

0.7602996255

0

3

S

54

0.201754386

0.798245614

0

4

1

39

0.9353233831

09353233831

1

5

18

57

0.0456333596

0.9543666404

0

6

16

52

0.0537484277

0.9402515723

•

7

3

37

0 5350877193

05350877193

1

8

19

57

0 0456333596

0.9543666404

0

2

fl

13

48

01277330265

0.8722669735

0

10

8

)54

G.201754386

0 798245614

0

11

11

43

0.1571038251

0.3423961 749

0

12

18

0 0456333595

0.95436G6404

0

13

18

0.0456333596

0 9543666404

0

14

18

57

0.0458333598

0 9543666404

0

19

56

0.0121396055

0 9878603945

0

15 <

.

Prediction for TargelBuy

It 3

1

_

-

n•

A successfully scored data set features predicted probabilities and prediction decisions in the last three columns.


C h a p t e r

8

Introduction

to

P a t t e r n

D i s c o v e r y

8.1

Introduction

8.2

Cluster Analysis

8.3

8

-

3

8-8

Demonstration: Segmenting Census Data

8-15

Demonstration: Exploring and Filtering Analysis Data

8-24

Demonstration: Setting Cluster Tool Options

8-37

Demonstration: Creating Clusters with the Cluster Tool

8-41

Demonstration: Specifying the Segment Count

8-44

Demonstration: Exploring Segments

8-46

Demonstration: Profiling Segments

8-56

Exercises

8-61

Market B a s k e t A n a l y s i s (Self-Study)

8-63

Demonstration: Market Basket Analysis

8-67

Demonstration: Sequence Analysis

8-80

Exercises

8-83

8.4

Chapter Summary

8-85

8.5

Solutions

8-86

Solutions to Exercises

8-86


8-2

Chapter 8 Introduction to Pattern Discovery


8.1 I n t r o d u c t i o n

8.1

8-3

Introduction

Pattern Discovery The Essence of Data Mining?

"...the discovery of interesting, unexpected, or valuable structures in large data sets." - David Hand

3

3

T h e r e are a m u l t i t u d e o f d e f i n i t i o n s f o r the f i e l d o f data m i n i n g a n d k n o w l e d g e d i s c o v e r y . M o s t c e n t e r o n t h e c o n c e p t o f p a t t e r n d i s c o v e r y . F o r e x a m p l e , D a v i d H a n d , P r o f e s s o r o f S t a t i s t i c s at I m p e r i a l C o l l e g e , L o n d o n a n d a n o t e d d a t a m i n i n g a u t h o r i t y , d e f i n e s t h e f i e l d as

or valuable structures in large data sets."

"...the discovery of interesting,

unexpected,

( H a n d 2 0 0 5 ) T h i s is m a d e p o s s i b l e b y t h e e v e r - i n c r e a s i n g d a t a

stores b r o u g h t about b y the era's i n f o r m a t i o n technology.


8-4

Chapter 8 Introduction to Pattern Discovery

Pattern Discovery The Essence of Data Mining?

"...the discovery of interesting, unexpected, or valuable structures in large data sets." - David Hand

"If you've got terabytes of data, and you're relying on data mining to find interesting things in there for you, you've lost before

4 4

While Hand's pronouncement is grandly promising, experience has shown it to be overly optimistic. Herb Edelstein, President of Two Crows Corporation and an internationally recognized expert in data mining, data warehousing, and C R M , counters with the following (Beck (Editor) 1997): "Ifyou've got terabytes of data, and you 're relying on data mining to find interesting things in there for you, you've lost before you've even begun. You really need people who understand what it is they are looking for - and what they can do with it once they find it." Many people think data mining (in particular, pattern discovery) means magically finding hidden nuggets of information without having to formulate the problem and without regard to the structure or content of the data. This is an unfortunate misconception.


8.1 Introduction

8-5

Pattern Discovery Caution

5

Poor data quality

Opportunity

Interventions

Separability

Obviousness

Non-stationarity

5

In h i s d e f e n s e , D a v i d H a n d is w e l l a w a r e o f the l i m i t a t i o n s o f p a t t e r n d i s c o v e r y a n d p r o v i d e s g u i d a n c e o n h o w these analyses c a n fail ( H a n d 2 0 0 5 ) . These f a i l i n g s o f t e n fall i n t o o n e o f s i x categories:

• Poor data quality

assumes m a n y guises: inaccuracies (measurement or r e c o r d i n g errors), m i s s i n g ,

i n c o m p l e t e o r o u t d a t e d values, a n d inconsistencies (changes o f d e f i n i t i o n ) . Patterns f o u n d i n false data are f a n t a s i e s .

• Opportunity

t r a n s f o r m s t h e p o s s i b l e t o t h e p e r c e i v e d . H a n d r e f e r s t o t h i s as t h e p r o b l e m o f

m u l t i p l i c i t y , o r the l a w o f t r u l y large n u m b e r s . E x a m p l e s o f this a b o u n d . H a n d notes the odds o f a p e r s o n w i n n i n g t h e l o t t e r y i n t h e U n i t e d States are e x t r e m e l y s m a l l a n d t h e o d d s o f that p e r s o n w i n n i n g

it t w i c e a r e f a n t a s t i c a l l y s o . H o w e v e r , t h e o d d s o f someone in the United States w i n n i n g i t t w i c e ( i n a g i v e n y e a r ) a r e a c t u a l l y b e t t e r t h a n e v e n . A s a n o t h e r e x a m p l e , y o u c a n s e a r c h t h e d i g i t s of pi f o r " p r o p h e t i c " s t r i n g s s u c h as y o u r b i r t h d a y o r s i g n i f i c a n t d a t e s i n h i s t o r y a n d u s u a l l y find t h e m , g i v e n enough digits (www.angio.net/pi/piquery).

• Intervention,

t h a t i s , t a k i n g a c t i o n o n t h e p r o c e s s t h a t g e n e r a t e s a set o f d a t a , c a n d e s t r o y o r d i s t o r t

detected patterns. F o r e x a m p l e , f r a u d d e t e c t i o n techniques lead t o p r e v e n t a t i v e measures, but t h e fraudulent behavior o f t e n e v o l v e s in response t o this i n t e r v e n t i o n . •

Separability o f t h e i n t e r e s t i n g f r o m t h e m u n d a n e is n o t a l w a y s p o s s i b l e , g i v e n t h e i n f o r m a t i o n f o u n d i n a d a t a set. D e s p i t e t h e m a n y s a f e g u a r d s i n p l a c e , i t i s e s t i m a t e d t h a t c r e d i t c a r d c o m p a n i e s l o s e $ 0 . 1 8 t o $ 0 . 2 4 p e r $ 100 i n o n l i n e transactions ( R u b i n k a m 2 0 0 6 ) .

Obviousness i n d i s c o v e r e d p a t t e r n s r e d u c e s t h e p e r c e i v e d u t i l i t y o f a n a n a l y s i s . A m o n g t h e p a t t e r n s d i s c o v e r e d t h r o u g h a u t o m a t i c d e t e c t i o n a l g o r i t h m s , y o u find t h a t t h e r e i s a n a l m o s t e q u a l n u m b e r o f m a r r i e d m e n as m a r r i e d w o m e n , a n d y o u l e a r n t h a t o v a r i a n c a n c e r o c c u r s p r i m a r i l y i n w o m e n a n d t h a t check fraud occurs most often for customers w i t h c h e c k i n g accounts.

• Non-stationarity

o c c u r s w h e n the process t h a t generates a data set c h a n g e s o f its o w n a c c o r d . I n s u c h

c i r c u m s t a n c e s , p a t t e r n s d e t e c t e d f r o m h i s t o r i c d a t a c a n s i m p l y c e a s e . A s E r i c H o f f e r s t a t e s , "In times

of change, learners inherit the Earth, while the learnedfind with a world that no longer exists."

themselves beautifully equipped to deal


8-6

Chapter 8 Introduction to Pattern Discovery

Pattern Discovery Applications Data reduction Novelty detection Profiling Market basket analysis Sequence analysis

6 6

Despite the potential pitfalls, there are many successful applications of pattern discovery: • Data reduction is the most ubiquitous application, that is, exploiting patterns in data to create a more compact representation of the original. Though vastly broader in scope, data reduction includes analytic methods such as cluster analysis. • Novelty detection methods seek unique or previously unobserved data patterns. The methods find application in business, science, and engineering. Business applications include fraud detection, warranty claims analysis, and general business process monitoring. • Profiling is a by-product of reduction methods such as cluster analysis. The idea is to create rules that isolate clusters or segments, often based on demographic or behavioral measurements. A marketing analyst might develop profiles of a customer database to describe the consumers of a company's products. • Market basket analysis, or association rule discovery, is used to analyze streams of transaction data (for example, market baskets) for combinations of items that occur (or do not occur) more (or less) commonly than expected. Retailers can use this as a way to identify interesting combinations of purchases or as predictors of customer segments. • Sequence analysis is an extension of market basket analysis to include a time dimension to the analysis. In this way, transactions data is examined for sequences of items that occur (or do not occur) more (or less) commonly than expected. A Webmaster might use sequence analysis to identify patterns or problems of navigation through a Web site.


8.1 Introduction

8-7

Pattern Discovery Tools Data reduction j g g j l Novelty detection

7— SOMKohonen

Profiling M a r k e t

B

P

b a s k e t

S e q u e n c e

Segment ftofto

analysis

analysi

7 7

The first three pattern discovery applications are primarily served (in no particular order) by three tools in SAS Enterprise Miner: Cluster, SOM/Kohonen, and Segment Profile. The next section features a demonstration of the Cluster and Segment Profile tools.

Pattern Discovery Tools s

Market basket analysis gfc AgC

Sequence analysis

ith Analysis

8 8

Market basket analysis and sequence analysis are performed by the Association tool. The Path Analysis tool can also be used to analyze sequence data. (An optional demonstration of the Association tool is presented at the end of this chapter.)


8-8

C h a p t e r 8 Introduction to Pattern Discovery

8.2

Cluster Analysis

Unsupervised Classification E&a HSBB

mm as

l..

-v

i

Unsupervised classification: grouping of cases based on similarities in input values.

E3SS1 mm

boss

mm

mm mm mm 1B9I H B h h i

1010

Unsupervised

classification

( a l s o k n o w n as

cases based o n s i m i l a r i t i e s i n

input

clustering

and

segmenting)

a t t e m p t s t o g r o u p t r a i n i n g d a t a set

v a r i a b l e s . I t is a data r e d u c t i o n m e t h o d because a n e n t i r e t r a i n i n g data

set c a n b e r e p r e s e n t e d b y a s m a l l n u m b e r o f c l u s t e r s . T h e g r o u p i n g s a r e k n o w n as

clusters o r segments, supervised

a n d t h e y c a n b e a p p l i e d t o o t h e r d a t a sets t o c l a s s i f y n e w c a s e s . I t i s d i s t i n g u i s h e d f r o m

classification

( a l s o k n o w n as

predictive

modeling),

w h i c h is d i s c u s s e d i n p r e v i o u s c h a p t e r s .

T h e p u r p o s e o f c l u s t e r i n g is o f t e n d e s c r i p t i o n . F o r e x a m p l e , s e g m e n t i n g e x i s t i n g c u s t o m e r s i n t o g r o u p s and associating a distinct p r o f i l e w i t h each g r o u p m i g h t help future m a r k e t i n g strategies. H o w e v e r ,

there

is n o guarantee t h a t the r e s u l t i n g clusters w i l l be m e a n i n g f u l o r u s e f u l . U n s u p e r v i s e d c l a s s i f i c a t i o n is a l s o u s e f u l as a s t e p i n p r e d i c t i v e m o d e l i n g . F o r e x a m p l e , c u s t o m e r s c a n b e c l u s t e r e d i n t o h o m o g e n o u s g r o u p s b a s e d o n sales o f d i f f e r e n t i t e m s . T h e n a m o d e l c a n b e b u i l t t o p r e d i c t the cluster m e m b e r s h i p based o n m o r e easily obtained i n p u t variables.


8.2 Cluster A n a l y s i s

8-9

/c-means Clustering Algorithm Training

Data

1

Select inputs.

2

Select k cluster centers.

3

Assign cases to closest center.

A

Update cluster centers.

5

Re-assign cases. Repeat steps 4 a n d 5 until convergence.

122

O n e o f t h e m o s t c o m m o n l y u s e d m e t h o d s f o r c l u s t e r i n g is t h e

k-means algorithm,

i t is a s t r a i g h t f o r w a r d

a l g o r i t h m t h a t s c a l e s w e l l t o l a r g e d a t a sets a n d i s , t h e r e f o r e , t h e p r i m a r y t o o l f o r c l u s t e r i n g i n Enterprise

SAS

Miner.

W h i l e o f t e n o v e r l o o k e d as a n i m p o r t a n t p a r t o f a c l u s t e r i n g p r o c e s s , t h e f i r s t s t e p i n u s i n g t h e £ - m e a n s a l g o r i t h m is t o c h o o s e a set o f i n p u t s . I n g e n e r a l , y o u s h o u l d s e e k i n p u t s t h a t h a v e t h e f o l l o w i n g attributes: •

are m e a n i n g f u l t o t h e a n a l y s i s

• are r e l a t i v e l y •

objective

independent

are l i m i t e d i n n u m b e r

• have a measurement level o f Interval • h a v e l o w k u r t o s i s a n d s k e w n e s s s t a t i s t i c s ( a t least i n t h e t r a i n i n g d a t a ) C h o o s i n g m e a n i n g f u l i n p u t s is c l e a r l y i m p o r t a n t f o r i n t e r p r e t a t i o n a n d e x p l a n a t i o n o f t h e g e n e r a t e d clusters. Independence and l i m i t e d i n p u t count make the resulting clusters m o r e stable. ( S m a l l perturbations o f t r a i n i n g data usually do not result i n large changes to the generated clusters.) A n interval m e a s u r e m e n t l e v e l is r e c o m m e n d e d f o r £ - m e a n s t o p r o d u c e n o n - t r i v i a l c l u s t e r s . L o w k u r t o s i s a n d s k e w n e s s statistics o n t h e i n p u t s a v o i d c r e a t i n g single-case o u t l i e r clusters.


8-10

C h a p t e r 8 Introduction to Pattern Discovery

/c-means Clustering Algorithm Training

Data

1. S e l e c t i n p u t s . 2. S e l e c t ft c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s . 5. R e - a s s i g n c a s e s . 6. R e p e a t s t e p s 4 a n d 5 until convergence.

1313

T h e n e x t s t e p i n t h e ÂŁ - m e a n s a l g o r i t h m is t o c h o o s e a v a l u e f o r k, t h e n u m b e r o f c l u s t e r c e n t e r s . SAS

E n t e r p r i s e M i n e r features an a u t o m a t i c w a y t o do t h i s , a s s u m i n g that the d a t a has k d i s t i n c t

c o n c e n t r a t i o n s o f c a s e s . I f t h i s is n o t t h e c a s e , y o u s h o u l d c h o o s e k t o b e c o n s i s t e n t w i t h y o u r analytic objectives. With k seeds).

s e l e c t e d , t h e ÂŁ - m e a n s a l g o r i t h m c h o o s e s cases t o r e p r e s e n t t h e i n i t i a l

cluster centers

(also

named


8.2 Cluster A n a l y s i s

8-11

/c-means Clustering Algorithm Training

Data

1. S e l e c t i n p u t s . ;

• -J

*

2. S e l e c t ft c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s .

• . • • s i x * :

5. R e a s s i g n

cases.

6. R e p e a t s t e p s 4 a n d 5 until convergence.

14)4

T h e E u c l i d e a n d i s t a n c e f r o m e a c h case i n t h e t r a i n i n g d a t a t o e a c h c l u s t e r c e n t e r is c a l c u l a t e d . C a s e s a r e assigned t o t h e closest cluster center. B e c a u s e t h e d i s t a n c e m e t r i c is E u c l i d e a n , i t is i m p o r t a n t f o r t h e i n p u t s t o h a v e c o m p a t i b l e

f

m e a s u r e m e n t scales. U n e x p e c t e d results can o c c u r i f one i n p u t ' s m e a s u r e m e n t scale d i f f e r s greatly from the others.

/V-means Clustering Algorithm Training

Data 1. S e l e c t i n p u t s . 2. S e l e c t ft c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center.

AY* •

4. U p d a t e c l u s t e r c e n t e r s . 5. R e a s s i g n

cases.

6. R e p e a t s t e p s 4 a n d 5 until convergence.

1S5

T h e c l u s t e r c e n t e r s a r e u p d a t e d t o e q u a l t h e a v e r a g e o f t h e cases a s s i g n e d t o t h e c l u s t e r in the p r e v i o u s step.


8-12

C h a p t e r 8 Introduction to Pattern Discovery

/V-means Clustering Algorithm Training

Data 1. S e l e c t i n p u t s .

•'

2. S e l e c t ft c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s .

5. R e a s s i g n

cases.

6. R e p e a t s t e p s 4 a n d 5 until convergence.

• 186

C a s e s are r e a s s i g n e d t o t h e c l o s e s t c l u s t e r c e n t e r .

/c-means Clustering Algorithm Training

•d S p

Data

* '

H

1. Select inputs.

H

2. S e l e c t ft c l u s t e r c e n t e r s .

H

3. A s s i g n c a s e s t o c l o s e s t 4. U p d a t e c l u s t e r c e n t e r s .

^|

*

1717

5

R e a s s i g n cas'ps

6. R e p e a t s t e p s 4 a n d 5 until convergence.


8.2 Cluster A n a l y s i s

k-means Clustering Algorithm Training

Data

1. S e l e c t i n p u t s . 2. S e l e c t k c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s . 5. R e a s s i g n c a s e s . 6. R e p e a t s t e p s 4 a n d 5 until convergence.

1818

T h e u p d a t e and reassign steps are repeated u n t i l the process c o n v e r g e s .

/c-means Clustering Algorithm Training

Data

1. S e l e c t i n p u t s . 2. S e l e c t k c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s . 5. R e a s s i g n c a s e s . 6. R e p e a t s t e p s 4 a n d 5 until convergence.

2SS O n c o n v e r g e n c e , f i n a l c l u s t e r a s s i g n m e n t s are m a d e . E a c h case is a s s i g n e d t o a u n i q u e s e g m e n t . T h e s e g m e n t d e f i n i t i o n s c a n b e s t o r e d a n d a p p l i e d t o n e w cases o u t s i d e o f t h e t r a i n i n g d a t a .

8-13


8-14

C h a p t e r 8 Introduction t o Pattern Discovery

2727

W h i l e t h e y a r e o f t e n u s e d s y n o n y m o u s l y , a s e g m e n t a t i o n a n a l y s i s is d i s t i n c t f r o m a t r a d i t i o n a l c l u s t e r a n a l y s i s . A c l u s t e r a n a l y s i s is g e a r e d t o w a r d i d e n t i f y i n g d i s t i n c t c o n c e n t r a t i o n s o f cases i n a d a t a set. W h e n n o d i s t i n c t c o n c e n t r a t i o n s e x i s t , t h e best y o u c a n d o i s a s e g m e n t a t i o n a n a l y s i s , t h a t i s , a l g o r i t h m i c a l l y p a r t i t i o n i n g t h e i n p u t space i n t o c o n t i g u o u s g r o u p s o f cases.

Demographic Segmentation Demonstration Analysis goal: Group geographic regions into segments based on income, household size, and population density. Analysis plan: • Select and transform segmentation inputs. • Select the number of segments to create. • Create segments with the Cluster tool. • Interpret the segments.

2828

T h e f o l l o w i n g d e m o n s t r a t i o n i l l u s t r a t e s t h e use o f c l u s t e r i n g t o o l s . T h e g o a l o f t h e a n a l y s i s is t o g r o u p p e o p l e i n t h e U n i t e d States i n t o d i s t i n c t s u b s e t s b a s e d o n u r b a n i z a t i o n , h o u s e h o l d s i z e , a n d i n c o m e factors. T h e s e factors are c o m m o n t o c o m m e r c i a l l i f e s t y l e a n d life-stage s e g m e n t a t i o n p r o d u c t s .

( F o r e x a m p l e s , see www.claritas.com or www.spectramarketing.com.)


8.2 Cluster A n a l y s i s

8-15

Segmenting C e n s u s Data

This demonstration introduces S A S Enterprise M i n e r tools and techniques f o r cluster and segmentation a n a l y s i s . T h e r e a r e five p a r t s : •

define the diagram and data source

• explore and filter the t r a i n i n g data •

i n t e g r a t e t h e C l u s t e r t o o l i n t o t h e process f l o w a n d select t h e n u m b e r o f s e g m e n t s t o create

• r u n a segmentation analysis • use t h e S e g m e n t P r o f i l e t o o l t o i n t e r p r e t t h e a n a l y s i s results

Diagram

Definition

Use t h e f o l l o w i n g steps t o d e f i n e t h e d i a g r a m f o r t h e s e g m e n t a t i o n a n a l y s i s .

I.

Right-click

Diagrams

i n t h e P r o j e c t p a n e l a n d select

w i n d o w opens and requests a d i a g r a m name.

Diagram Name: Segmentation Analysis OK

Cancel

Create Diagram.

T h e Create N e w D i a g r a m


C h a p t e r 8 Introduction t o Pattern Discovery

8-16

2.

Type S e g m e n t a t i o n

A n a l y s i s i n the D i a g r a m

Name

field

and select O K . S A S E n t e r p r i s e

M i n e r creates an analysis w o r k s p a c e w i n d o w n a m e d S e g m e n t a t i o n A n a l y s i s . BED

|ETEnterprise Miner - My Project Fie

Ed* View

getters

Options

Wndow

tt<*

D-|--.l--!i|X|HmiE|D|D|ai|Jj| *|C|S 3

!

My Project

!

M^l

QDataSc«r.es

-i [£jDl30/*llS

Simple

6^8 Orgaracs -'; Precktrve Anarysi

]i^ore | Modfy | Model [ Assess j Utility | CiccH Stwhg \

Srrimrril.fi if ill Analysis

... -£ !*J Model Packages * jljUse.s

.•v. :

EMW3.

_ ! 9 _ __

SeQrTieriJaJion Analysis

Name "Status

No:es H i : lor,-

Open —

n

I

d J

1 EXasran Segmentation Analysis opened

J

-ii "|f?

3 — J — riioo*

-

_j

;

s t u d e n t « student ; X Ccmected to SASApp • Loekel Workspace Sever

Y o u use t h e S e g m e n t a t i o n A n a l y s i s w i n d o w t o create process f l o w d i a g r a m s .


8.2 Cluster A n a l y s i s

8-17

Data S o u r c e Definition F o l l o w these steps t o create the s e g m e n t a t i o n a n a l y s i s data s o u r c e .

I.

Right-click

Data Sources

in the Project panel and select

Create Data Source.

The Data Source

W i z a r d opens. jSTData Source Wizard — Step 1 of 7 Metadata Source

Select a metadata source

Source:

<8_rck

Next >

T h e D a t a S o u r c e W i z a r d g u i d e s y o t i t h r o u g h a seven-step process t o create a S A S

Cancel

Enterprise M i n e r

data source. 2.

Select N e x t > t o use a S A S

t a b l e as t h e s o u r c e f o r t h e m e t a d a t a . ( T h i s is t h e u s u a l c h o i c e . )

T h e D a t a S o u r c e W i z a r d p r o c e e d s t o S t e p 2. I n t h i s s t e p , s e l e c t t h e S A S available to SAS

table that y o u w a n t to m a k e

E n t e r p r i s e M i n e r . Y o u can either t y p e the l i b r a r y n a m e a n d S A S

libname.tablename

or select the S A S

table f r o m a list.

t a b l e n a m e as


8-18

3.

C h a p t e r 8 Introduction t o Pattern Discovery

Select

Browse...

t o choose a S A S table f r o m the libraries v i s i b l e t o the S A S F o u n d a t i o n server.

gT.Data Source Wizard - Step 2 of 7 Select a SAS Table

Select a SAS tarjle

Browse.

Table:

<{3ack

Cancel

T h e Select a S A S Table d i a l o g b o x opens. g T Select a S A S Table SAS Libraries •jijl Aaern61 \J

Apfmtlib

- f p Maps Sampsio -fgp Sasdata • j i j Sashelp J

Sasuser Stpsamp Work

Name -) Aaem61

Path

Engine BASE

C:\Workshop\Winsas\aa...

Apfmtlib

V9

D:\5A5\Config\Levl\5A5...

Maps

V9

D:\Prograrn Files\SAS\SA.,,

„ J Sampsio

V9

D;\Program Files\5A5\SA...

-) Sasdata

BASE

D:\5AS\Config\Levl\SAS...

_ ) Sashelp

V9

D:\Program Files\5A5\5A...

) Sasuser

V9

C:\Docurnents and Settin... —

) Stpsamp

BASE

D:\Program Files\SAS\SA...

V9

C:\DOCUMEM\Student\...

|p

l9

Work

Refresh

Properties...

S i

Cancel

O n e o f t h e l i b r a r i e s l i s t e d is n a m e d A A E M 6 1 , w h i c h is t h e l i b r a r y n a m e d e f i n e d i n t h e p r o j e c t s t a r t - u p code.


8.2 Cluster A n a l y s i s

4.

Select the

Aaem61

library and the

Census2000

8-19

S A S table.

g T Select a SAS Table Name

j SAS Libraries BaJAaemSl 1 J Apfmtlib - j j l Maps

i n

2004-01-19

2004-01-19

780KB

Census2000

DATA

2006-07-05

2006-07-05

1MB

DATA

2006-09-26

2006-09-26

372KB

DATA

2006-07-25

2006-07-25

32KB

DATA

2006-07-25

2006-07-25

32KB

DATA

2004-01-19

2004-01-19

40KB

DATA

2009-07-08

2009-07-08

41MB

DATA

2006-09-19

2006-09-19

9MB

DATA

2006-09-19

2006-09-19

5MB

Q Credit ""J Dottrain

. ) Sasdata

Dotvalid

Sasuser Stpsamp Work

—

H Dungaree ; = ] Exportedscoredata | |

Inq2005

i H Insurance

Refresh

The C e n s u s 2 0 0 0

Size

DATA

7) Sampsio V) Sashelp

Created Date Modified D.,

Type

Bank

Properties...

OK

A.

Cancel

d a t a is a postal code-level s u m m a r y o f the e n t i r e 2 0 0 0 U n i t e d States Census.

It f e a t u r e s s e v e n v a r i a b l e s : ID

postal code o f the region

LOCX

region longitude

LOCY

region latitude

MEANKHSZ

a v e r a g e h o u s e h o l d size i n the r e g i o n

jMEDHHINC

m e d i a n household income in the region

REGDENS

region population density percentile ( S l o w e s t density, 100=highest density)

REGPOP

n u m b e r o f people i n the region

T h e data is suited f o r creation o f life-stage, lifestyle segments u s i n g S A S Enterprise M i n e r ' s pattern discovery tools. 5.

Select OK.


8-20

C h a p t e r 8 I n t r o d u c t i o n to P a t t e r n D i s c o v e r y

T h e Select a S A S

T a b l e d i a l o g b o x closes a n d t h e s e l e c t e d t a b l e is e n t e r e d i n t h e T a b l e

field.

ST.Data Source Wizard - Step 2 of 7 Select a SAS Table

Select a SAS table

l a b l e : 55 EM61.CENSUS200Q

Browse.

< Back

6.

Select

Next

Cancel

Next >

> . T h e D a t a S o u r c e W i z a r d p r o c e e d s t o S t e p 3.

Data Source Wizard — 5 t e p 3 o f 7 Table Information Table Properties

Table Name Description Member Type Data Set Type Engine Number of Variables Nurnbet of Observations Created Date Modified Date

Property

AAEM61 .CENSU52000

Value

DATA DATA BASE 7 33178 July 5, 2006 5:53:42 PM EDT July 5, 2006 5:53:42 PM EDT

<8ack

T h i s step o f the Data Source W i z a r d p r o v i d e s basic i n f o r m a t i o n

Next>

Cancel

about the selected table.


8.2 Cluster A n a l y s i s

7.

Select

Next

8-21

> . T h e D a t a S o u r c e W i z a r d proceeds to Step 4.

1 Metadata Advisor Options Use the basic setting to set the initial measurement levels and roles based on the variable attributes. Use the advanced setting to set the Initial measurement levels and roles based on both the variable attributes and distributions.

(*" Basic

I " Advanced

< Back

I Next >

Step 4 o f the D a t a S o u r c e W i z a r d starts the metadata d e f i n i t i o n process. S A S

Enterprise M i n e r

assigns i n i t i a l values t o the m e t a d a t a based on characteristics o f the selected S A S

table. The Basic

s e t t i n g a s s i g n s i n i t i a l v a l u e s t o t h e m e t a d a t a b a s e d o n v a r i a b l e a t t r i b u t e s s u c h as t h e v a r i a b l e n a m e , data type, and assigned S A S

f o r m a t . T h e A d v a n c e d s e t t i n g also i n c l u d e s i n f o r m a t i o n a b o u t the

distribution o f the variable to assign the initial metadata values.


8-22

C h a p t e r s Introduction t o Pattern Discovery

Select

Next

> t o use t h e B a s i c s e t t i n g . T h e D a t a S o u r c e W i z a r d p r o c e e d s t o S t e p 5.

Sf,Data Source Wizard — Step 5 of 7 Column Metadata | f " nol jEqual to

(none) Columns:

m

f

Name ID LocX LocY MeanHHSz MedHHInc RegDens RegPop

r

Label Role ID Input Input Input Input Input Input

|

Mining [

Level

Nominal Interval Interval Interval Interval Interval interval

3

Report

Apply

'J P

f " Basic Order

Drop

Reset

Statistic

Lower Limit

Upper Lit

No No No Mo No No No

No ~No No No No

Uo No

1 Show code

Explore

Compute Summary

< Back

: Next >

Cancel

Step 5 o f the D a t a Source W i z a r d enables y o u t o specify the role and level f o r each variable i n t h e s e l e c t e d S A S t a b l e . A d e f a u l t r o l e i s a s s i g n e d b a s e d o n the n a m e o f a v a r i a b l e . F o r e x a m p l e , t h e v a r i a b l e I D w a s g i v e n t h e role I D based o n its n a m e . W h e n a v a r i a b l e does n o t have a n a m e c o r r e s p o n d i n g t o o n e o f t h e p o s s i b l e v a r i a b l e r o l e s , i t w i l l , u s i n g t h e B a s i c s e t t i n g , be g i v e n t h e d e f a u l t r o l e o f input. A n I n p u t v a r i a b l e i s u s e d f o r v a r i o t i s t y p e s o f a n a l y s i s t o d e s c r i b e a c h a r a c t e r i s t i c , m e a s u r e m e n t , o r a t t r i b u t e o f a r e c o r d , o r case, i n a S A S t a b l e . T h e m e t a d a t a seltings are correct f o r the u p c o m i n g analysis.


8.2 Cluster A n a l y s i s

9.

Select

Next

8-23

> . T h e D a t a S o u r c e W i z a r d p r o c e e d s to Step 6.

ST Data Source Wizard — S t e p6 of 7 Data Source Attributes You may change the name and the role, and can specify a population segment identifier for the data source to be created.

Name:

CENS162000

gole:

[Raw

Segment:

f

3]

Notes:

< Back

Cancel

I Next >

S t e p 6 i n t h e D a t a S o u r c e W i z a r d e n a b l e s y o u t o set a r o l e f o r t h e s e l e c t e d S A S

table and p r o v i d e

descriptive c o m m e n t s about the data source d e f i n i t i o n . For the i m p e n d i n g analysis, a table role o f Raw

is a c c e p t a b l e .

S t e p 7 p r o v i d e s s u m m a r y i n f o r m a t i o n a b o u t t h e c r e a t e d d a t a set. ST Data Source Wizard — Step 7 of 7 Summary I .,

Metadata Completed. Library: AAEM61 Table: CENSUS2000 Role: Raw Role ID Input

Count

Level Nominal Interval

1 6

<Back

0. Select

Finish

to c o m p l e t e the data source definition. The

Data Sources entry in the Project panel.

|

CENSUS2000

frmsh

|

Cancel

t a b l e is a d d e d t o t h e


8-24

Chapter 8 Introduction to Pattern Discovery

Exploring and Filtering Analysis Data

A worthwhile next step in the process of defining a data source is to explore and validate its contents. By assaying the prepared data, you substantially reduce the chances of erroneous results in your analysis, and you can gain insights graphically into associations between variables.

Data Source Exploration 1.

Right-click the CENSUS2000 data source and select Edit Variables... from the shortcut menu. The Variables - CENSUS2000 dialog box opens. Variables - CENSUS2000

(none)

- t | foot Equal to

Name

Role D Input Input Input Input Input Input

Apply

TiHirtng

Oatuntns! f" Label ID LocX LocY MeanHHSz MedHHInc [ReoDens [ReaPop

3r

Level Nominal Interval Interval Interval Interval Interval Interval

Report

r Order

Mo Mo Mo Mo Mo Mo No

Drop

Lower Limit

Reset

Statistics Upper L&rit

Mo Mo Mo Mo Mo Mo Mo Explore.

Ed* Using SAS Code

OK

2.

Examine histograms for the available variables.

3.

Select all listed inputs by dragging the cursor across all of the input names or by holding down the C T R L key and typing A.


8.2 Cluster A n a l y s i s

4.

8-25

S e l e c t Explore.... T h e E x p l o r e w i n d o w o p e n s , a n d t h i s t i m e d i s p l a y s h i s t o g r a m s f o r a l l o f t h e variables i n the

CENSUS2000

data source. I - ;C|x

Explore - flAEM61.CEMSU520oa gie

View

fctcttt

Window

JOL5J

|, i Sample Properties Value

Property

-

33178

Rows Columns Ubrary Member

7

Type Sample Method f elch Sfce Fetched Rows

15000-

&

MEM61 CENSUS2000 DATA Random Max '33178

S 100005000 -

n

il

SO

.176.6368 -1432335-109.8302 -7 Region Longitude

160.000

1120.001 1180.001

Median Household Income

tl

jlAAEMSI.

Obs»

1

| Region ID [Region L — jRegion L..

100601 200602 300603 400604 500606 600610 700612 800616 900617 1000622

-66.74047 -67.18025 -67.13422 -67.137 -66.95881 -67.13605 -66.6980 -66.67660 -66.55576 -67.16746

18.1801 G j J 1B.3632B 18.44861 18.49898 18 18215 18.28831 18.44973 1B.42674 18.45549 18.00312 _ i

3000

=I

I £

17S622

333634

43 3646

659656

Region ID

0 00

1.70 3 40

5.09

Average Household Size

205

40.6 604 602

1000

Region Density Percentile

Region Lalilude

U 00603 00606 00S12 00517 00602 00604 00610 00616 0062

2000-

5000 -

6.79

43 207

86.414

Region Population

1 29,62


C h a p t e r 8 Introduction to Pattern

8-26

5.

Discovery

M a x i m i z e t h e M e a n H H S i z e h i s t o g r a m b y d o u b l e - c l i c k i n g i t s t i t l e b a r . T h e h i s t o g r a m n o w fills t h e Explore

window. E E S

Explore - AAEM61.CENSUS2000 Ed*

tie

View

Actons

window

a l H l o K /

I 15000-

10000-

0) cr S U5000-

0.00

0.85

1.70

2.55

3.40

4.24

5.09

5.94

6.79

Average Household Size A s b e f o r e , i n c r e a s i n g t h e n u m b e r o f h i s t o g r a m b i n s f r o m t h e d e f a u l t o f 10 i n c r e a s e s y o u r u n d e r s t a n d i n g o f t h e data.

7.64

8.49


8.2 Cluster A n a l y s i s

6.

R i g h t - c l i c k i n the h i s t o g r a m w i n d o w a n d select

Graph Properties...

8-27

from the shortcut menu.

T h e Properties - H i s t o g r a m d i a l o g b o x opens. Properties - Histogram Sraph Histogram Axes Horizontal Vertical Title/Footnote

Bins K

Show Missing Bin

W Bin Axis |v End Labels T

Bin to Data Range

Boundary:

Upper

Number of X Bins:

10

Number ot Y Bins:

1

Color W Show Outline ft

Gradient

Gradient

ThreeColorRamp

OK

Apply

Cancel

Y o u can use the Properties - H i s t o g r a m d i a l o g b o x t o change the appearance o f the c o r r e s p o n d i n g histogram.


8-28

7.

C h a p t e r 8 Introduction to P a t t e r n Discovery

T y p e 100

in the N u m b e r

o f

X

Bins

f i e l d and select OK.

T h e h i s t o g r a m is u p d a t e d t o h a v e

100 b i n s . Explore - AAEM61.tENSUS20aO Efe

Edit

View

Actions

Wndow

4000-

3000-

u c

CD

9> 2000

1000-

1Tm7t -t> ^ i

[ • • • •r

• • • •i

t

I

I I I I , I I ...

I

0.00 0.42 0.85 1.27 1.70 2.12 2.55 2.97 3.40 3.82 4.24 4.67 5.09 5.52 5.94 6.37 6.79 7.22 7.64 8.07 8.49 Average Household

Size

T h e r e is a c u r i o u s s p i k e i n t h e h i s t o g r a m at ( o r n e a r ) z e r o . A z e r o h o u s e h o l d s i z e d o e s n o t m a k e s e n s e in the context o f census data. 8.

Select the bar near zero i n the h i s t o g r a m .


8.2

9.

Cluster A n a l y s i s

Restore the size o f the w i n d o w b y double-clicking the title bar o f the M e a n H H S i z e

8-29

window.

T h e w i n d o w r e t u r n s t o its o r i g i n a l size. Explore - AAEM6I.CENSUS2000

Fie

E*

Se"

froons

Window

#|H!01i>| Pi op1" Ly

Value

Sample Method Felch Size Fetched Rows

33178 7 MEM 61 C E N S U S 20 DO DATA Random Max 33178

ROW5

Columns Library Member

15000

& £

-176.6358 -1432335-109.8302 -76.4270

Apply

I0OO0

SO

Region Longitude

160.000

$120,001 JI80.001

Median Household Income

r j l A A I M b l . l I sslISIflHHI

Obs#

Region ID |Region L...jRegion L „ -66.74947 18.18011 - l | -67.18025 18.36328¬ -67.13422 18.44861 -67.137 18.49898 -66.958B1 18.18215 -67.13605 18.28831 -66.6988 18.44973 -66.67669 1B.42674 -66.55576 18.45549 -67.16746 IS.003; 2 .

100601 200602 300603 400604 500606 600610 700612 800616 900617 1000622

3 1 0

179622

33.9634

49 9646

653658

Region Lalilude

ID

2QB

40.6 60.4 802

1000

Region Density Percentile

g. 3000 c §• 2000 $

Ik

501 00603 00608 00612 00617 00602 00604 00610 00616 0062 Region ID

J L

1000¬

T

0

1——I 1

I™

V"—'!™—T"—-1

0.00 0 93 1.87 2.80 3 74 4 67 5.60 654 7.47 8.4 Average Household Size

43.207 86.414 Region Population

1 29.62

T h e zero average household size seems t o be evenly distributed across the l o n g i t u d e , latitude, a n d density percentile variables. It seems concentrated o n l o w i n c o m e s and p o p u l a t i o n s , and also m a k e s up the m a j o r i t y o f the m i s s i n g observations

i n the d i s t r i b u t i o n o f R e g i o n D e n s i t y . It is w o r t h w h i l e

t o l o o k at t h e i n d i v i d u a l r e c o r d s o f t h e e x p l o r e

sample.


8-30

C h a p t e r 8 Introduction to Pattern Discovery

10. M a x i m i z e t h e

CENSUS2000

data table.

1 1 . S c r o l l i n t h e d a t a t a b l e u n t i l y o u see t h e first s e l e c t e d r o w . ' Explore - A AEM 61X ENS US200U Fie

View

fictions

LUnE

Window

•1 Ml 0141 i y AAEM01.CEN5US20U0

Qpstf

| Region ID |Region L...jRegion L...| Region ... | Region ... | Median ... [Average ...| 1600638 1700641 1000646 1900647 2000656 21G0652

2200653 2300656 2400659 2500660 2600662 2700664 2800687 2900669 3000670 3100674 3200676

3300677 3400678 3500680 3600682 3700683 3800685 3900687

4000688 4100690 4200692 4300693 4400698 45006HH 46006XX 4700703 4S00704

-66.49835

•66.70519 •66.27689 •66.93993

-66 56773 -66.61217 -6690098 -66 79169 -66.80039 -67.12086 -67.01974 •66.59244 -67.04227 -66.87504 -66.97604 -66 4B897 -67.08425

-67.23675 -66.93276 -67.12655 -67.15429 -67.04525 -66.98104 -66.41529 -6661348 -67.09867 -66.33186 -66.39211 -66.85588 •66.83453 -67.88803 •66.12827

4900705 5000707 5100714

-66.22291 -66.26542 •6591019 -6605553

5200715 5300716 5400717 5500718

-66.55869 -66.59966 -68 61375 -65.74294

19 308139 1B.268896 1B.442798 17.964529

18.363331 18.457453 17.992112 18.038866 19.432956 18.139108 18.478855 18.212565 18.017819 18.288418 18.241343 1B.426137 18 37956 18.336121 18.442334

18.205232 18 208402 18.092807 18.332595 19.31708 18.40415 19.495369 1B.419866 18.440667 1B.06547

18.473441 18.102537 18.246205 17.970112 18.12942 18.014505 17.997288

18 003492 17.999066 18 004303 18.22048 4 r.

70 71 84 73 77

75 78 76 80 84

80 73 74

77 80 81 79

81 81 84 90 78 76 79 75 85 81 83 78

19,648 35,095

34.114 5,352

12.207 2,903 16,536 23,072 38,925 16,614 42,178 17,318 26,637 32,246 11.174 45,874 39,897 14,767

27.271 75,090 25,138 35,165 44,649 29,965

13,501 5,249 35,223 64,054 46.384 0

$10,986

19,901 116.378 19.607 $11,544 (12,097

19,770 (11,361 112.378 (16,745 (11.753 (11,220 (11,426 (9,758 (9,556 (12,820 $11,271 $11,460 $12,005

(12,042 (11,121 (13,036 (11,014 $12,090 $11,729 $12,629 (13,691 $13,857 $11,924 $0 SO (12,463 (11,340

80 85

0 28,752 7,593

83

19,117

(12.725 (11,638 (11,484

2,268 38,859 23,083 23.753

(10,556 (15,610 (11,697 (11.461

80 77 63 89 08 74

26,493 12,741

*- -, 111

3.26 3.14

3.11 2.87

3.11 284 3.04 3.19 3.08

2.82 2.94 3.35 2.91 3.12 3.09 2.97

3.11 2.86 3.07 2.77 2.84 2.86 2.94 3.40

3.01 2.89 3.16 3.12 307 0.00 0.00 3.12 3.06 3.13 3.19 3.09

2.78 3.06 2.55 2.97

R e c o r d s 4 5 and 4 6 ( a m o n g others) h a v e the zero A v e r a g e H o u s e h o l d Size characteristic. O t h e r in these records also h a v e u n u s u a l

values.

fields


8.2 Cluster A n a l y s i s

12. Select the in this

Average Household Size

field.

c o l u m n heading t w i c e t o sort the table b y descending values

Cases o f interest are c o l l e c t e d at the t o p o f the data t a b l e . ME

Explore - AAEM6I.CEriSUS20GO Fie

Vjew

8-31

Actions Window

A A£M 61 .E EN5USZ0 OG

Obstt

Region ID Region L... Region L... Region ...

447020HH 500021HH 50302222 S04022HH 524 02356 531023HH 532023XX 578025HH 579025XX 613026HH

649027HH 7D4028HH 721029HH

755030HH 763031HH 815G32HH 818032XX 821 033HH 847034HH 854 035HH 865035XX 873 036HH

897037HH 959039HH 969039HH 1028040HH

1037041HH 1082042HH 1103043HH 1163044HH 1191045HH I253046HH 1292047HH 1312048HH 1373049HH 1424050HH 1472053HH 1473053XX

1516Q54HH 1550056HH

-70.76329 -70.99512 -71.06283 -71.05037 -70.66089 -70.66299 -70.69411 -70.56594 -70.65427 -70.12764

-71.0363 -71.42673 -71.40145 -71.47319

-71.47113 -71.59953 -71.52839 -71.5597 -71.93063 -71.24389 -71.29905 -72,43502 -72.1821 -70.91394 -70.69216

-70.27453 -70.2651 B -70.39034 -69.85778 -68.76274 -69.6258 -67.89638 -68.3812 -89.0871 -E9.71961 -72.44107 -72.89058 -72.99699 -73.0833 -72.71076

42 157445 42.333634

42.367797 42.352702 41.854063 41.95506 41 837895 41.573686 41.779736 41.756212 41.711052 41.651234

Region ... Median ...

85 67 3 1

0 0 55 0 13B 0 18 0 7 0 0 0 0

41.8141 4281 187

0 0 0 0 0 0 0 0 0 0

43.007183 43612667

43.976513 43.246016 42.988155 44 636085 44 510468 43.255688 43.832093 43.311793 43.143285 43 562653

0 0 0

4367039 44.327523 44.282949

45.279499 43.920971 44.69515 46.665748 44 104367 44.762422 43.741 081 42.858851 42.982934 44.36125 44.313574

0 0 0 0 0 0 0 0 0 0 0 0 0 0

$0 10

to so so JO JO

so JO 10 10 JO JO $0

so so so so so so so so so so $0 so $0 so so

so so J0 JO so so so so so so so

Average Household Size

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00; 0.00 0.00 0.00' G.QQ000 0.00 0.00 0.00 0.00 0-00 000 0.00 0.00 0.00

0.00 0.00: 0.00 0.00 0.00

o.oo; 0.00 0.00 0.00 0-00 0.00 0.00 0.00 0.00 0.00 o-oo; 0.00

n

M o s t o f the cases w i t h z e r o A v e r a g e H o u s e h o l d Size h a v e z e r o o r m i s s i n g o n t h e r e m a i n i n g n o n g e o g r a p h i c a t t r i b u t e s . T h e r e a r e s o m e e x c e p t i o n s , b u t i t c o u l d b e a r g u e d t h a t c a s e s s u c h as t h i s a r e n o t o f interest f o r a n a l y z i n g h o u s e h o l d d e m o g r a p h i c s . T h e next part o f this d e m o n s t r a t i o n s h o w s h o w t o r e m o v e c a s e s s u c h as t h i s f r o m t h e s u b s e q u e n t 13. C l o s e t h e E x p l o r e a n d Variables

windows.

analyses.


C h a p t e r s Introduction to Pattern Discovery

8-32

Case Filtering The SAS

E n t e r p r i s e M i n e r F i l t e r t o o l enables y o u t o r e m o v e u n w a n t e d r e c o r d s f r o m an a n a l y s i s .

Use these steps t o b u i l d a d i a g r a m t o read a data source and t o f i l t e r r e c o r d s .

1.

Drag the

CENSUS2000

data source to the Segmentation A n a l y s i s w o r k s p a c e w i n d o w ,

cog Segmentation Analysis

v —

n^H*

1*0

: i: CENSUS2000

1

2. 3.

a

3

3

100%

HJ

S e l e c t t h e Sample t a b t o a c c e s s t h e S a m p l e t o o l g r o u p . D r a g t h e Filter t o o l ( f o u r t h f r o m t h e l e f t ) f r o m t h e t o o l s p a l l e t i n t o t h e S e g m e n t a t i o n A n a l y s i s w o r k s p a c e w i n d o w a n d c o n n e c t it t o the

CENSUS2000

°°g Segmentation Analysis

f

:ensus200o

I

*J

\*2f Fitt-rl J

U 100%

data source.


8.2 Cluster A n a l y s i s

4.

8-33

S e l e c t t h e Filter n o d e a n d e x a m i n e t h e P r o p e r t i e s p a n e l . Property

General N o d e ID

Value Filter

i m p o r t e d Data Notes

Train E

Export T a b l e

Filtered

T a b l e s to Filter

T r a i n i n g Data

p a s s Variables

Q

Exported Data

y

Class Variables

Default Filtering M e t h o R a r e V a l u e s ( P e r c e n t Keep Missing Values Y e s Normalized Values

Yes

M i n i m u m F r e q u e n c y C1 M i n i m u m Cutoff for Pe 0.01 M a x i m u m N u m b e r o f L25

E [interval

Variables

Interval V a r i a b l e s

Default Filtering M e t h o S t a n d a r d Deviations fr Keep Missing Values Yes Tuning Parameters

Score

Create s c o r e c o d e

Yes

Based o n t h e v a l u e s o f t h e p r o p e r t i e s p a n e l , the n o d e w i l l , b y d e f a u l t , f i l t e r cases i n rare levels i n a n y class i n p u t v a r i a b l e a n d cases e x c e e d i n g three standard d e v i a t i o n s f r o m t h e m e a n o n a n y i n t e r v a l input variable. Because the C E N S U S 2 0 0 0 data source o n l y contains interval inputs, o n l y the Interval Variables c r i t e r i o n is c o n s i d e r e d . 5.

Change the D e f a u l t F i l t e r i n g M e t h o d property to

-Interval Variables ; Default Filtering MethoUser-Specified Limits - Keep Missing Values Yes •Tuning Parameters |»J

User-Specified

Limits.


8-34

6.

C h a p t e r 8 Introduction to Pattern Discovery

Select the Interval Variables ellipsis (...). The Interactive Interval Filter w i n d o w opens.

W Interactive Interval Filter Train or raw data set does not e x i s t . Columns:

f~" Label Filtering Method

Name LocX LocY MeanHHSz MedHHInc ReqDens ReqPop

Default Default Default Default Default Default

f " Mining Keep Missing Values

F

Basic

Filter Lower Limit

I

-

Statistic

Filter Upper Limit

Report

Default Default

•

Default Default

No No

Default Default

No

Generate Summar

Y o u a r e w a r n e d at t h e t o p o f t h e d i a l o g b o x t h a t t h e

No No No

OK

Train

Cancel

o r r a w d a t a set d o e s n o t e x i s t .

T h i s i n d i c a t e s t h a t y o u a r e r e s t r i c t e d f r o m the i n t e r a c t i v e f i l t e r i n g e l e m e n t s o f t h e n o d e , w h i c h a r e a v a i l a b l e a f t e r a n o d e is r u n . Y o u c a n , n e v e r t h e l e s s , e n t e r f i l t e r i n g i n f o r m a t i o n .


8.2 Cluster A n a l y s i s

7.

8-35

T y p e 0 . 1 as t h e F i l t e r L o w e r L i m i t v a l u e f o r t h e i n p u t v a r i a b l e MeanHHSz.

2ÂŁ Interactive Interval Filter Train or r a w data set does not e x i s t . olumns:

|~ Label Filtering Method

Name

LocX Default LocY _ J)efault MeanHHSz Default edHHInc befault Reg D e n s Default Re gPop Default

f

Mining Keep Missing Values

I

-

Basic

f~

Filter Lower Limit

Statistics

Filter Upper Limit

Report

Default

No

efault efault Jefault befault

Ho No

) efault

i i Generate Summary

OK

Select O K t o close the I n t e r a c t i v e I n t e r v a l F i l t e r d i a l o g b o x . Y o u are r e t u r n e d t o the S A S

Cancel

Enterprise

M i n e r interface w i n d o w . A l l c a s e s w i t h a n a v e r a g e h o u s e h o l d s i z e less t h a n 0.1 w i l l b e

filtered

f r o m s u b s e q u e n t analysis steps.


8-36

9.

C h a p t e r 8 Introduction to Pattern Discovery

R u n the Filter node and v i e w the results. T h e Results w i n d o w opens. Jr. •

File

Edit

View

Window

1 Limits for Interval Variables Variable

Role

Minimum

Maximum

Output

Frlte * T r a i n i n g Output

MeanHHSz

0.1

INPUT

.MAN

V a r i a b l e Summary Measurement

Frequency

Excluded C l a s s Values Role

Level

Train Count Train Percent

Label

Filter Method Keep Missing Values

10. G o t o l i n e 3 8 i n the O u t p u t w i n d o w . Number Of

Observations

Data Role TRAIN

Filtered 32097

Excluded 1081

DATA 33178

T h e F i l t e r n o d e r e m o v e d 1081 cases w i t h z e r o h o u s e h o l d size. 1. C l o s e t h e R e s u l t s w i n d o w . T h e C E N S U S 2 0 0 0 d a t a is r e a d y f o r s e g m e n t a t i o n .

i|


8.2 Cluster A n a l y s i s

8-37

Cluster Tool Options

T h e C l u s t e r t o o l p e r f o r m s fc-means c l u s t e r a n a l y s e s , a w i d e l y u s e d m e t h o d f o r c l u s t e r a n d s e g m e n t a t i o n a n a l y s i s . T h i s d e m o n s t r a t i o n s h o w s y o u h o w t o use t h e t o o l t o s e g m e n t t h e c a s e s i n t h e C E N S U S 2 0 0 0 d a t a set.

1.

S e l e c t t h e Explore t a b .

2.

L o c a t e a n d d r a g a Cluster t o o l i n t o t h e d i a g r a m w o r k s p a c e .

3.

C o n n e c t t h e Filter n o d e t o t h e Cluster n o d e .

T o c r e a t e m e a n i n g f u l s e g m e n t s , y o u n e e d t o set t h e C l u s t e r n o d e t o d o t h e f o l l o w i n g :

4.

•

ignore irrelevant inputs

•

standardize the i n p u t s t o h a v e a s i m i l a r range

S e l e c t t h e Variables... p r o p e r t y f o r t h e C l u s t e r n o d e . T h e V a r i a b l e s w i n d o w o p e n s .


C h a p t e r 8 Introduction to Pattern Discovery

8-38

5.

Select

Use

(none) Name ID LOCX LocY MeanHHSz MedHHInc RegDens RegPop

O

No

For

not

LocX, LocY,

Equal to

Use Yes No No Default Default Default No

and

Report No

No No No" No

RegPop.

r—i

T

Role ID Input Input Input Input Input Input

Level Nominal Interval Interval Interval Interval Interval Interval

Type

Order

Character Numeric Numeric Numeric Numeric Numeric Numeric

Explore-

T h e C l u s t e r node creates s e g m e n t s u s i n g the inputs

Apply

Label

Reset

Format

Region ID Region LongitBEST9.2 Region LatilucBEST9.2 Average Hous COMMAS.2 Median HouscDOLLARS.O Region Densi BEST9.0 Region Popul:COMMA9.0

Update Path

MedHHlnc, MeanHHSz,

OK

and

Cancel

RegDens.

S e g m e n t s a r e c r e a t e d b a s e d o n t h e ( E u c l i d e a n ) d i s t a n c e b e t w e e n e a c h case i n t h e s p a c e o f s e l e c t e d i n p u t s . I f y o u w a n t t o use a l l t h e i n p u t s t o c r e a t e c l u s t e r s , t h e s e i n p u t s s h o u l d h a v e s i m i l a r m e a s u r e m e n t scales. C a l c u l a t i n g distances u s i n g standardized distance m e a s u r e m e n t s ( s u b t r a c t i n g the m e a n a n d d i v i d i n g b y t h e s t a n d a r d d e v i a t i o n o f t h e i n p u t v a l u e s ) is o n e w a y t o e n s u r e t h i s . Y o u c a n s t a n d a r d i z e t h e i n p u t m e a s u r e m e n t s u s i n g t h e T r a n s f o r m V a r i a b l e s n o d e . H o w e v e r , i t is e a s i e r t o use the built-in property in the Cluster node.


8.2

6.

Select the inputs

MedHHInc, MeanHHSz,

and

RegDens

and select

Cluster Analysis

Explore....

The

8-39

Explore

w i n d o w opens.

File View

Actions

Window

Sample Properties

n"ef

Property

Row; Columns Library Member

IE!

2O00O -I

Unknown 7 EMWS5 FILTER TRAIN ucuu

Tutl*

& isooo -

_ 1 Apply

Region ID 100601 200602 300603 400604 500606 600610 700612 1

=

:0000-

£

5000 -

c

0 SO

Plot-

t-10.000

180.000

1120.001

$1<~>0,001

L l ReuDens Region Lon yr.v.c- Lati..' pe.ior. ue Rs; •66.74347 18 180103 70 -67.18025 18 363285 83 •67.13422 18 448619 85 -67.137 18.488987 83 -66.95881 18.182151 65 •67.13605 18.288319 79 •66.6988 18 449732 81

L ! MeanHHSz

3000 |

2000¬

*

1000¬ 0

1.0

3

10.9

20.8 30.7 40J3

SOS

60.4 70.3

Region DensflyPerceniile

20OO0-

|

15000 -

=•

10000-

<I

SO0O0-

359

4.41

523

Average Household Size

T h e inputs selected f o r use in the cluster are o n three entirely different measLirement scales. T h e y n e e d t o be standardized if y o u want a meaningful clustering. 7.

C l o s e the

8.

Select

Explore window.

O K to close the

Variables

window.

— i 0O.1

^Ef

2:000&

1200.001

Modioli Household Income

E l EMWS5.Filter_TRAIN obs?

L l MedHHInc

Value

r

100.0

IHi


C h a p t e r 8 Introduction to Pattern Discovery

8-40

9.

Select

Internal Standardization ^ Standardization.

D i s t a n c e s b e t w e e n p o i n t s are c a l c u l a t e d

based

on standardized measurements.

—j

Property

.

-™— J

?-f».,-..^~

General N o d e ID

Value

r

Clus

Imported Data

J.

Exported Data [Notes

,..

Variables

...

Train

Cluster Variable Role Segment Internal S t a n d a r d i z a t i o S t a n d a r d i z a t i o n • N u m b e r of Clusters • ' •S p e c i f i c a t i o n M e t h o d

Automatic

" M a x i m u m N u m b e r of CI 0 4r

A n o t h e r w a y t o s t a n d a r d i z e a n i n p u t is b y s u b t r a c t i n g t h e i n p u t ' s m i n i m u m v a l u e a n d d i v i d i n g b y the i n p u t ' s range. T h i s is called

range .standardization.

rescales the d i s t r i b u t i o n o f each i n p u t t o the unit i n t e r v a l , [ 0 , 1 ] . T h e C l u s t e r n o d e is r e a d y t o r u n .

Range standardization


8.2 Cluster A n a l y s i s

8-41

Creating Clusters with the Cluster Tool

By default, the Cluster tool attempts to automatically determine the number of clusters in the data. A three-step process is used. Step 1 A large number of cluster seeds are chosen (50 by default) and placed in the input space. Cases in the training data are assigned to the closest seed, and an initial clustering of the data is completed. The means of the input variables in each of these preliminary clusters are substituted for the original training data cases in the second step of the process. Step 2 A hierarchical clustering algorithm (Ward's method) is used to sequentially consolidate the clusters that were formed in the first step. At each step of the consolidation, a statistic named the cubic clustering criterion ( C C C ) (Sarle 1983) is calculated. Then, the smallest number of clusters that meets both of the following criteria is selected: • The number of clusters must be greater than or equal to the number that is specified as the Minimum value in the Selection Criterion properties. • The number of clusters must have cubic clustering criterion statistic values that are greater than the C C C threshold that is specified in the Selection Criterion properties. Step 3 The number of clusters determined by the second step provides the value for k in a £-means clustering of the original training data cases.


8-42

1.

C h a p t e r 8 Introduction to Pattern Discovery

R u n the C l u s t e r n o d e and select

Results....

T h e Results - C l u s t e r w i n d o w opens.

? R e s u l t * - Node: Cluster Diagram: Segmentation Analysk Fife

Edit

tfew

Wndow

•J QJ j i l D l i l Qujlcinij Crlercn

t!

OBfigcn OBlw Seeds

0667333 0 667333 0.667333 0 667333

40 -

Scniiieiit Vjridlile

• • • • •

Frequency olOusttr

0 33:1 35 • I 35:1.37 5 43:5 45 • 6 45:7.47 50000:75000 • 75000:100001 175001:200001 • 1:13 375 50.5:62.875 • 6 2 875.75 25

• • • • •

2 37:3 39 • 7 47:B 49 • 100001:125001 • 13.375:25.75 • 75 25.87 625 •

J

Crtetan

1 2 3 4

0 014476 0.014476 0014476 0014476

Roa-MesvS quare amdard Devterfcon

MQKTTUTI

CWance from D j i l w Seed

3810 0.851212 8 *84734 14729 0.56443B 6.554205 1 1669 0.638211 4.632054 2089 1.031565 1300417

Segment Vii[iiit>!e

3 39:4 41 • 4.41:5 43 0:25000 • 25000:50000 125001 150001 • 150001:175001 25.75:38.125 • 38 125:50 5 87.625:100

User:

Student

Date:

J u l y 0 9 , 2009 14:23:29

Tcoining

Vaiiable

Output

Sucaacy Heesurtnent

Frequency Count

H0NIHA1 IHTTPVAL

T h e Results - Cluster w i n d o w contains four embedded w i n d o w s . • The

Segment Plot

w i n d o w attempts to s h o w the d i s t r i b u t i o n o f each i n p u t v a r i a b l e b y cluster.

• The

Mean Statistics

• The

Segment Size

w i n d o w lists v a r i o u s descriptive statistics b y cluster.

w i n d o w shows a pie chart d e s c r i b i n g the size o f each cluster f o r m e d .

• T h e Output w i n d o w s h o w s t h e o u t p u t o f v a r i o u s S A S p r o c e d u r e s r u n b y t h e C l u s t e r n o d e . A p p a r e n t l y , t h e Cluster node f o u n d f o u r clusters in C E N S U S 2 0 0 0 data. Because the n u m b e r o f c l u s t e r s is b a s e d o n t h e c u b i c c l u s t e r i n g c r i t e r i o n , it m i g h t b e i n t e r e s t i n g t o e x a m i n e t h e v a l u e s o f t h i s statistic f o r various cluster counts.


8.2 Cluster A n a l y s i s

2.

Select L

View •=> Summary Statistics •=> C C C Plot.

8-43

T h e C C C Plot w i n d o w opens.

c c c Plot 1500-

Number o f Clusters I n t h e o r y , t h e n u m b e r o f c l u s t e r s i n a d a t a set is r e v e a l e d b y t h e p e a k o f t h e C C C v e r s u s N u m b e r o f Clusters plot. H o w e v e r , w h e n no distinct concentrations o f data exist, the u t i l i t y o f the C C C statistic is s o m e w h a t s u s p e c t . S A S E n t e r p r i s e M i n e r a t t e m p t s t o e s t a b l i s h r e a s o n a b l e d e f a u l t s f o r i t s a n a l y s i s t o o l s . T h e a p p r o p r i a t e n e s s o f these d e f a u l t s , h o w e v e r , s t r o n g l y d e p e n d s o n t h e a n a l y s i s o b j e c t i v e a n d the nature o f the data.


8-44

C h a p t e r 8 Introduction t o Pattern Discovery

Specifying the Segment Count

Y o u m i g h t w a n t t o increase t h e n u m b e r o f clusters created b y t h e C l u s t e r node. Y o u c a n d o this by c h a n g i n g the C C C c u t o f f property or b y s p e c i f y i n g the desired n u m b e r o f clusters.

1.

I n the Properties panel f o r the C l u s t e r node, select Property

General N o d e ID

Specification Method "=> User Specify.

Value Clus

I m p o r t e d Data E x p o r t e d Data

• U

Notes

Train

Variables Cluster Variable Role S e g m e n t

u

Internal S t a n d a r d i z a t i o S t a n d a r d i z a t i o n r S p e c i f i c a t i o n M e t h o d U s e r Specify -

•-Maximum N u m b e r o f C10 T h e U s e r S p e c i f y s e t t i n g creates a n u m b e r o f segments i n d i c a t e d b y the M a x i m u m N u m b e r o f C l u s t e r s p r o p e r t y l i s t e d a b o v e i t ( i n t h i s case, 1 0 ) .


8.2 C l u s t e r A n a l y s i s

8-45

R u n the C l u s t e r n o d e and select R e s u l t s . . . . T h e Results - N o d e : C l u s t e r D i a g r a m w i n d o w opens, and s h o w s a t o t a l o f 10 g e n e r a t e d s e g m e n t s . i* Resulti -

E*s

£dt

Node: Clutter

Sew ffimJon

Diagram: Segmentation Anfllyili

2 .

.

100 -

-

-

-

5

-

s,

-

— J

4

5

B

7

8

9

w

0 534301 0 534301 0.534301 0 534301 0 534301 0 534301 0 534301 0.S34301 0 534301 0 534301

-

— 20 •

8J

10

S e g m e n t V j j Utile

• • • • •

so •

,

-

4

S

8

7

.

,

S

4

Segment VailaWe

033135 • 1 35 2 37 • 2 37:3.39 • 5 43 6 45 • 6 4 5 7 47 H 7 47:8.49 0 5000075000 • 75000 100001 • 1 0 0 0 0 1 : 1 2 5 0 0 1 0 1 7 5 0 0 1 : 2 0 0 0 0 1 0 1:1 3 375 • 13 375:25.75 O 50 5 62 875 0 6 2 8 7 5 75 25 • 75 25:37.625 0

10

4715 93 927 9009 1503 51S 6443 3714 10 516B

0 097022 0 037022 0 087022 0.007022 0 087022 0087022 0 087022 0087022 0037022 0 037022

0.598335 1.13243 0 706122 0 46754B 0 694551 0.997191 0439273 0 543696 1 912147 0496536

4 311315 4.B53B93 3.923714 2.B07549 3.478019 4.300237 2002035 359743 5 300685 6 081233

a

3 39 4 41 • 4 41:5.43 0 25000 0 2 5 0 0 0 50000 1 2 5 0 0 1 : 1 5 0 0 0 1 0 150001:175001 25 75 3B 125 O 38.125:50 5 87 825:100

StuSenc J u l y 09,

2009

14:28:16 T r a i n i n g Output

Variable

Suuaiy Frequency Count liOHIHAL IKTEFVAL

A s s e e n i n t h e M e a n S t a t i s t i c s w i n d o w , s e g m e n t f r e q u e n c y c o u n t s v a r y f r o m 10 c a s e s t o m o r e t h a n 9 , 0 0 0 cases. Mean Statistics num ;ive i g e in :er

Root-Mean-S M a x i m u m

I m p r o v e m e n t . S E G M E N T j Frequency of Cluster in Clustering Criterion

quare

Distance

Standard

from C l u s t e r

Deviation

Seed

s 0.596385

1

4715

2

93

1.19246

927 9009

0.706122

087022

3 4

087022 087022

1503

6

0.694551 0.987181

087022 087022 087022

087022 087022 087022 087022

5

515

0.467548

4.311615 4.853893 3.923714 2.807549 3.478019 4.800237 2.002035

7

6443

0.439273

8

3714

0.543696

9

10

1.912147

5.300685

5168

0.496586

6.081233

10

3.59743

Nee Clu

1


8-46

C h a p t e r 8 Introduction to Pattern Discovery

Exploring Segments

W h i l e t h e R e s u l t s w i n d o w s h o w s a v a r i e t y o f d a t a s u m m a r i z i n g t h e a n a l y s i s , i t is d i f f i c u l t t o u n d e r s t a n d t h e c o m p o s i t i o n o f t h e g e n e r a t e d c l u s t e r s . I f t h e n u m b e r o f c l u s t e r i n p u t s is s m a l l , t h e G r a p h w i z a r d c a n aid in interpreting the cluster analysis. 1. 2.

Close the Results - Cluster w i n d o w . Select

Exported Data

f r o m the Properties panel f o r the C l u s t e r n o d e . T h e E x p o r t e d D a t a - C l u s t e r

w i n d o w opens.

Port TRAIN VALIDATE TEST CLUSSTAT CLUSMEAN VARMAP

Table EMWS5.Clus_TRAIN EMWS5.Clus_VALIDATE EMWS5.ClilS_TEST EMWS5.CIUS_0UTSTAT EMWS5.ClUS_0UTMEAN EMWS5.Clus_0UTVAR

Role iTrain Validate Test 1 CLUSSTAT 1 CLUSMEAN •VARMAP

Yes No No Yes Yes Yes

Data Exists

Pi opMTiev

T h i s w i n d o w s h o w s t h e d a t a sets t h a t are g e n e r a t e d a n d e x p o r t e d b y t h e C l u s t e r n o d e .

OK


8.2 Cluster A n a l y s i s

3.

Select the

Train

d a t a set a n d s e l e c t

Explore....

The Explore w i n d o w opens.

®Uytos*.-iiMZ»£te2nW\ File

View

Actions

8-47

• .....

.-

^

r-•

Window

| J}B Sample Properties Property

1

Rows Columns Library Member Type Sample Method Fetch Size Fetched Rows Random Seed

| Unknown 10 EMWS5 C L U S TRAIN VIEW Random Wax 20000 12345

Value

j

Appry

Plot...

Hi

o* E f OEMWS5.Clus„TRAIN bs# | ReqionID iRe Oion Lcn.J Region L a t i J Region De... Region P o p J Median Ho.jAveracte 19.143 $9,888 1 00601 -66.74947 18180103 70 -67.137 $31,199 200604 83 3.844 16.498987 $9,243 300606 -66.95881 18.182151 65 6.449 $13,000 400612 -66.6388 18.449732 81 72.865 18.08643 80 38.627 $13,565 500623 •67.15223 77 26.719 $12,194 600624 •66.72603 18 055399 700627 -66.85644 18.435246 35.244 $13,163 79 $8,345 -66.85175 18.186739 68 2.169 800631 •66.94864 18.073078 25.935 $12,485 900637 78 1000641 $9,901 •66.70519 18.268896 71 35.095 34.114 1100646 •65.27689 18.442798 84 $16,378 $9,607 1200647 •66.93993 17.964529 73 5,352 i*mnecn r i ir\7

H o J SEGMENT I DISTANCE | 3.24 5 1.512915 3.00 5 0.927069 1.561623 5 3.20 2.84 5 1.823781 7 1.748842 2.75 3.47 5 1.434529 5 1.468324 3.07 5 1.73708 3.05 5 1.714428 2.91 1.564744 3.14 5 5 1.299486 3.11 5 1.902125 2.87 >

1J •

Y o u c a n use t h e G r a p h W i z a r d t o g e n e r a t e a t h r e e - d i m e n s i o n a l p l o t o f t h e C E N S U S 2 0 0 0

data.


8-48

4.

C h a p t e r 8 Introduction to Pattern Discovery

Select

Actions •=> Plot.

T h e Select a C h a r t T y p e w i n d o w opens.

1* S c a t t e r \>± Line alhj Histogram

Plots v a l u e s of two variables against e a c h other.

m\ Density $Box 3

Tables

H

Matrix

Qf{ Lattice E l Parallel A x i s £*> Constellation

Cancel

5.

< Bark

Select t h e i c o n f o r a three-dimensional scatter plot.

bJl S c a t t e r Line i i h Histogram |fc Density £ i Box H

Tables

1QU Matrix ffl

Lattice

I S Parallel Axis ®

Constellation

Cancel

6.

Select

Next

> . T h e G r a p h W i z a r d proceeds t o the next step. Select

Next>

Chart Roles.


8-2 Cluster A n a l y s i s

7.

Select roles o f

X, Y.

8.

Select

Color

Role

and

Z

for

for

MeanHHSz, MedHHInc,

and

RegDens,

respectively.

SEGMENT .

S e l e c t Chart R o l e s Use default assignments A Variable LSEGMENT

Role Color

Type

Description

Numeric

Segment Id

Format

Character

Segment Description

Distance

Numeric

Distance

ID

Character

Region ID

LocX

Numeric

Region Longitude

BEST9

LocY

Numeric

Region Latitude

FJEST9

SEGMENT LABEL

MeanHHSz

X

Average Household S... COMMA9.2

Numeric

MedHHInc

Y

Mumeric

Median Household Inc.. DOLLAR9

ReqDens

Z

Numeric

Region Density Perce... BEST9

Numeric

Region Population

RegPop

COMMA9

r" Allow multiple role assignments Cancel

9.

S e l e c t Finish.

< Back

Next >

Finish

8-49


8-50

C h a p t e r 8 Introduction to Pattern Discovery

T h e E x p l o r e w i n d o w opens w i t h a three-dimensional plot o f the C E N S U S 2 0 0 0 data.

File

a |[

Edit

View

Actions

Window

h o Graiihl

R e g i o n Density Percentile 1001

• 80 ' 60-

m

4020

m •

0 SO

B 17.-.

T \

$100,000 $200,000 Median H o u s e h o l d I n c o m e

0.00 4.00 Average Household Size

10. Rotate the p l o t by h o l d i n g d o w n the C T R L k e y and d r a g g i n g the m o u s e . E a c h s q u a r e i n t h e p l o t r e p r e s e n t s a u n i q u e p o s t a l c o d e . T h e s q u a r e s are c o l o r - c o d e d b y c l u s t e r segment.


8.2 C l u s t e r A n a l y s i s

8-51

To further a i d interpretability, add a distribution plot o f the segment number. 1.

Select

Action <=> Plot....

*

T h e Select a Chart Type w i n d o w opens.

A

Scatter

_

"

\& Line i h i Histogram

Plots values of two variables against each olher.

fet Oensity 10! Box m

Tables

H Matrix S B Lattice M Parallel Axis @

Constellation

Cancel

2.

S e l e c t a Bar

!

< flACll -

1 1

Next>

L

x

I

chart.

Lattice

!l

H Parallel Axis ÂŽ

Constellation

h- 3 0 Charts ^

Contour

•J

Bar

C

Pie

Liil

Needle

9

Vector

^

Band

Bar chart provides statistical relations between categorical values. Bar can be subdivided and/or grouped using subgroup variable and/or group variable respectively.

Cancel

3.

Select

Next >.

Next>

1


C h a p t e r 8 Introduction to Pattern Discovery

8-52

4.

Select

Role O Category

f o r the variable

_SEGMENT_.

Select Chart Roles

X] Use default assignments

•

Variable

Role

SEGMENT LABEL

Type

Description

Character

Segment Description

Format

Distance

Numeric

Distance

ID

Character

Region ID

LocX

Numeric

Region Longitude

BEST9

Numeric

Region Latitude

BEST 9

Numeric

Average Household S... COMMA9.2 Median Household Inc.. DOLLAR9

LocY MeanHHSz MedHHInc RegDens RegPop

Numeric

Region Density Perce... BESTS

Numeric

Region Population

Numeric

COMMA9

Response statistic: [Frequency V

Allow multiple role assignments Cancel

5.

S e l e c t Finish.

<Back

Next >

Finish


8.2 Cluster A n a l y s i s

A histogram o f

File

Edit

View

_SEGMENT_

Actions

8-53

opens.

Window

_y l il Granh2

V e ?

El

B y i t s e l f , t h i s p l o t is o f l i m i t e d u s e . H o w e v e r , w h e n t h e p l o t is c o m b i n e d w i t h t h e t h r e e - d i m e n s i o n a l plot, y o u can easily interpret the generated segments.


8-54

6.

C h a p t e r 8 Introduction t o Pattern Discovery

S e l e c t t h e t a l l e s t s e g m e n t b a r i n t h e h i s t o g r a m , s e g m e n t 4.

File

Edit

View

Actions

Window

SEGMENT^


8.2 Cluster A n a l y s i s

7.

Select t h e t h r e e - d i m e n s i o n a l p l o t . Cases c o r r e s p o n d i n g to s e g m e n t 4 are h i g h l i g h t e d .

8.

R o t a t e t h e t h r e e - d i m e n s i o n a l p l o t t o g e t a b e t t e r l o o k at t h e h i g h l i g h t e d c a s e s .

fc.

Craphl

;

8-55

51

Cases in this largest s e g m e n t c o r r e s p o n d to households a v e r a g i n g b e t w e e n t w o and three m e m b e r s , l o w e r p o p u l a t i o n density, and m e d i a n household incomes between $ 2 0 , 0 0 0 and $ 5 0 , 0 0 0 . 9.

C l o s e the E x p l o r e , E x p o r t e d Data, and Results w i n d o w s .


8-56

C h a p t e r 8 Introduction to Pattern Discovery

Profiling Segments

Y o u c a n g a i n a g r e a t d e a l o f i n s i g h t b y c r e a t i n g p l o t s as i n t h e p r e v i o u s d e m o n s t r a t i o n . U n f o r t u n a t e l y , i f m o r e t h a n t h r e e v a r i a b l e s are used t o generate t h e s e g m e n t s , the i n t e r p r e t a t i o n o f s u c h p l o t s b e c o m e s difficult. F o r t u n a t e l y , t h e r e is a n o t h e r u s e f u l t o o l i n S A S

Enterprise M i n e r for interpreting the c o m p o s i t i o n o f

clusters: t h e S e g m e n t P r o f i l e t o o l . T h i s t o o l enables y o u t o c o m p a r e the d i s t r i b u t i o n o f a v a r i a b l e in an i n d i v i d u a l s e g m e n t t o t h e d i s t r i b u t i o n o f t h e v a r i a b l e o v e r a l l . A s a b o n u s , t h e v a r i a b l e s are s o r t e d b y h o w w e l l they characterize the segment. 1.

D r a g a Segment Profile t o o l f r o m t h e A s s e s s t o o l p a l e t t e i n t o t h e d i a g r a m w o r k s p a c e .

2.

C o n n e c t t h e Cluster n o d e t o t h e Segment Profile n o d e . K Segmentation Analysis

: : : ;EN:;lK:000

Cluster

-o

S e g m e n t Profile

6

•J

••'/

C\

Q

r=^=^

©

100%

T o best d e s c r i b e the s e g m e n t s , y o u s h o u l d p i c k a reasonable subset o f the a v a i l a b l e i n p u t v a r i a b l e s .


8.2 C l u s t e r A n a l y s i s

3.

Select the

Variables

8-57

property f o r the Segment Profile node.

gf Variables - Prof Equal lo

(none) Columns:

V Label

Name /

J Report:

Use

Distance ID LocX LocY MeanHHSz MedHHInc ReetDens RegPop

Default Default

No No NO No No No No

Default Default Default

Default Default Default " ^SEGMENT, Default " s e g m e n t " LDefault

-

No

Apply

Reset

OK

Cancel

[~~ Basic

dining Rote

Rejected D nput nput nput nput nput nput Seqment Rejected

No No

J

Level Interval Nominal Interval Interval Interval Interval Internal Interval Nominal Nominal

Explore.

4.

S e l e c t Use

Update Path

for I D , L o c X , L o c Y , and R e g P o p .

•=> No

BjF Variables - Prof

E3

|(none) Columns:

j I

-

|~~ fining

f~ Label

Name J

Use

Default No No No Default Default 'MedHHInc RegDens Default No RegPop LSEGMENT. Default SEGMENT [ B e f a u l t Distance ID LOCX LOCY MeanHHSz

J

dl

not lEqualto

Report No NO No No No No No No No NO

Role Rejected D Input nput nput nput Input nput Segment Rejected

Apply

V Basic |

r

SttC5

Level

Interval Nominal Interval Intewal, 'interval 'Interval Interval Interval Nominal Nominal Explore.

5.

••

Reset

Select O K to close the Variables d i a l o g b o x .

Update Path

OK

Cancel


8-58

6.

C h a p t e r 8 Introduction to Pattern Discovery

R u n t h e S e g m e n t P r o f i l e n o d e a n d select

Results....

The R e s u l t s - N o d e : S e g m e n t P r o f i l e D i a g r a m

w i n d o w opens. R e s u l t s - Node: S e g m e n t Profilt D i a g r a m : Segmentation Analysis

» S e g m e n t Siie

I _SEGMENI_

30 •

7

Segment: •! :cmrt: 9009 >ercent: 2S.G7

1

B

ii 6oiisityI>ercclitlie 10

/

40-1

1

A v e i a fe Household S i z e House

«30-

Segment: 7 lourit: 6-143 >ercent: 30.07

Variable W o t t h ! _Str.**-<T_ Scud;nc J u l y 0 9 , 2009 14:43:11 Teaming

£

S

Output

0.050Vacloblt

Summary rlsasurfstnc Level

RcgOetw

MeatttS; Variable

Mntttoc

RtgCc™

MedHHnc Variable

MeanHHS:

ID

HOHUML

IBPDT

ILTEKVAL

PEJECTED

INTERVAL

REJECTED

N0HINAI.

SEOHEirr

NOMINAL

Frecruenc^ Count

Median


8.2 Cluster A n a l y s i s

7.

8-59

M a x i m i z e the Profile w i n d o w . ? R e s u l t s - H a d e : S e g m e n t Profile D i a g r a m : S e g m e n t a t i o n Analysis tjle

E.o»

•1 o | Piofite:

tfew

"£indow

j j n i ^

EE

SEGMENT

Segment: 4 :txnt: 9009 >ercent: 28.07

Segment: 7 :cxnt: 6443 'etcent: 20,07

»x a

40¬

50-

50¬

Segment: 10 rant: 5163 'etcent: 16,1

«-

5

S30-

? 20

a:o-

.V. • i ,n i •.•Hons cho kl

Region Densny Pciceiitlle

tee

a h - a

Meili.in lluusohold

Income'

40¬ 30-

^gment; 1 Scant: 471S >acent: 14.69

Features

o f each segment b e c o m e apparent. For e x a m p l e , segment 4, w h e n c o m p a r e d t o the o v e r a l l

d i s t r i b u t i o n s , has a l o w e r R e g i o n D e n s i t y P e r c e n t i l e , m o r e c e n t r a l M e d i a n H o u s e h o l d I n c o m e , a n d s l i g h t l y h i g h e r A v e r a g e H o u s e h o l d Size.


8-60

C h a p t e r 8 Introduction to Pattern Discovery

11. M a x i m i z e the Variable W o r t h :

SEGMENT

window.

I-

or R e s u l t * - Node: S e g m e n t Profile Dkigr

~l*

Variable W n r t h : _ÂŤGMET<T_

E

^

E

O.Qi-

g

0 04

g" D.04-

010-

RecDens

M e a n H H S : MeOHHtnc Variable

R e g D ~ n s MedHHInc M e a n H H S :

RldDent

Variable

U e a n H H S i MedHHInc Variable

MaanHHSi

RegDens Variable

3 nrn.

MedHHInc

RufjDen) Variable

MedHHInc U * a n H H S : Variable

T h e w i n d o w s h o w s the relative w o r t h o f each variable i n c h a r a c t e r i z i n g each s e g m e n t . F o r e x a m p l e , s e g m e n t 4 is l a r g e l y c h a r a c t e r i z e d b y t h e R e g D e n s i n p u t , b u t t h e o t h e r t w o i n p u t s a l s o p l a y a r o l e . A g a i n , s i m i l a r analyses can be used t o describe the other segments. T h e advantage o f the S e g m e n t P r o f i l e w i n d o w ( c o m p a r e d t o d i r e c t v i e w i n g o f t h e s e g m e n t a t i o n ) is t h a t t h e d e s c r i p t i o n s c a n b e m o r e than three-dimensional.


8.2 Cluster A n a l y s i s

8-61

Exercises

1. Conducting Cluster Analysis The DUNGAREE data set gives the number of pairs of four different types of dungarees sold at stores over a specific time period. Each row represents an individual store. There are six columns in the data set. One column is the store identification number, and the remaining columns contain the number of pairs of each type of jeans sold. Name

STOREID

Model Role ID

Description

Measurement Level Nominal

Identification number of the store

FASHION

! Input

; Interval

LEISURE

Input

Interval

Number of pairs of leisure jeans sold at the store

STRETCH

1 Input

Interval

j Number of pairs of stretch jeans sold at the store

Input

Interval

Number of pairs of original jeans sold at the store

Interval

Total number of pairs of jeans sold (the sum of FASHION, L E I S U R E , S T R E T C H , and ORIGINAL)

ORIGINAL SALESTOT

a.

; Rejected

] Number of pairs of fashion jeans sold at the store

Create a new diagram in your project. Name the diagram J e a n s .

b. Define the data set DUNGAREE as a data source. c. Determine whether the model roles and measurement levels assigned to the variables are appropriate. Examine the distribution of the variables. • Are there any unusual data values? • Are there missing values that should be replaced? d.

Assign the variable STORE I D the model role I D and the variable S A L E S T O T the model role Rejected. Make sure that the remaining variables have the Input model role and the Interval measurement level. Why should the variable SALESTOT be rejected?

e. Add an Input Data Source node to the diagram workspace and select the D U N G A R E E data table as the data source.


8-62

Chapter 8 Introduction to Pattern Discovery

f. Add a Cluster node to the diagram workspace and connect it to the Input Data node. g.

Select the Cluster node and select Internal Standardization •=> Standardization. What would happen if you did not standardize your inputs?

h. Run the diagram from the Cluster node and examine the results. Does the number of clusters created seem reasonable? i. Specify a maximum of six clusters and rerun the Cluster node. How does the number and quality of clusters compare to that obtained in part h?

j.

Use the Segment Profile node to summarize the nature of the clusters.


8.3 M a r k e t B a s k e t A n a l y s i s (Self-Study)

8.3

8-63

Market Basket Analysis (Self-Study)

Market Basket Analysis

Rule A => D C=>A A => C B&C^>D

Support 2/5 2/5 2/5 1/5

Confidence 2/3 2/4 2/3 1/3

3sa

Market basket analysis

( a l s o k n o w n as

association rule discovery

or

affinity analysis)

is a p o p u l a r data

m i n i n g m e t h o d . I n t h e s i m p l e s t s i t u a t i o n , t h e d a t a c o n s i s t s o f t w o v a r i a b l e s : a transaction

a n d a n item.

F o r each t r a n s a c t i o n , there is a list o f i t e m s . T y p i c a l l y , a t r a n s a c t i o n is a s i n g l e c u s t o m e r p u r c h a s e , a n d t h e items are t h e t h i n g s that w e r e b o u g h t . A n set

association rule

i s a s t a t e m e n t o f t h e f o r m ( i t e m set

A)

=> (item

5).

T h e a i m o f t h e a n a l y s i s is t o d e t e r m i n e t h e s t r e n g t h o f a l l t h e a s s o c i a t i o n r u l e s a m o n g a set o f i t e m s . T h e s t r e n g t h o f t h e a s s o c i a t i o n is m e a s u r e d b y t h e

support

and

confidence

o f the rule. T h e support f o r

t h e r i l l e d = > B i s t h e p r o b a b i l i t y t h a t t h e t w o i t e m sets o c c u r t o g e t h e r . T h e s u p p o r t o f t h e r u l e A = > B i s estimated by the f o l l o w i n g :

transactions

that contain all

every item in A and B

transactions

N o t i c e t h a t s u p p o r t i s s y m m e t r i c . T h a t i s . t h e s u p p o r t o f t h e r u l e A = > B i s t h e s a m e as t h e s u p p o r t o f t h e r u l e 5 => ,4. T h e c o n f i d e n c e o f a n a s s o c i a t i o n rule.-( = > B is t h e c o n d i t i o n a l p r o b a b i l i t y o f a t r a n s a c t i o n c o n t a i n i n g i t e m set B g i v e n t h a t i t c o n t a i n s i t e m set A. T h e c o n f i d e n c e i s e s t i m a t e d b y t h e f o l l o w i n g :

transactions

that contain

transactions

every item in A and B

that contain

the items in A


8-64

C h a p t e r 8 Introduction to Pattern Discovery

Implication? Checking Account

Savinqs Account

No

Yes

No

500

3500

4,000

Yes!

1000

5000

6,000

S u p p o r t ( S V G => C K ) = 5 0 % C o n f i d e n c e ( S V G => C K ) = 8 3 % E x p e c t e d C o n f i d e n c e ( S V G => C K ) = 8 5 % Lift(SVG => C K ) = 0.83/0.85 < 1

10,000

4010

The interpretation of the implication (=>) in association rules is precarious. High confidence and support does not imply cause and effect. The rule is not necessarily interesting. The two items might not even be correlated. The term confidence is not related to the statistical usage; therefore, there is no repeated sampling interpretation. Consider the association rule (saving account) => (checking account). This rule has 50% support (5,000/10,000) and 83% confidence (5,000/6,000). Based on these two measures, this might be considered a strong rule. On the contrary, those without a savings account are even more likely to have a checking account (87.5%). Saving and checking are, in fact, negatively correlated. If the two accounts were independent, then knowing that a person has a saving account does not help in knowing whether that person has a checking account. The expected confidence if the two accounts were independent is 85% (8,500/10,000). This is higher than the confidence of S V G => C K . The lift of the rule A => B is the confidence of the rule divided by the expected confidence, assuming that the item sets are independent. The lift can be interpreted as a general measure of association between the two item sets. Values greater than 1 indicate positive correlation, values equal to 1 indicate zero correlation, and values less than 1 indicate negative correlation. Notice that lift is symmetric. That is, the lift of the rule A => B is the same as the lift of the rule B => A.


8.3 M a r k e t B a s k e t A n a l y s i s (Self-Study)

8-65

Barbie Doll =^> Candy 1. Put them closer together in the store. 2.

Put them far apart in the store.

3. Package candy bars with the dolls. 4.

Package Barbie + candy + poorly selling item.

5. Raise the price on one, and lower it on the other. 6.

Offer Barbie accessories for proofs of purchase.

7. Do not advertise candy and Barbie together. 8. Offer candies in the shape of a Barbie doll.

4W

Forbes

(Palmeri

1997) reported that a m a j o r retailer determined that customers w h o buy Barbie dolis

have a 6 0 % l i k e l i h o o d o f b u y i n g one o f three types o f candy bars. T h e c o n f i d e n c e o f the rule B a r b i e c a n d y is 6 0 % . T h e r e t a i l e r w a s u n s u r e w h a t t o d o w i t h t h i s n u g g e t . T h e o n l i n e n e w s l e t t e r

Discovery Nuggets

invited suggestions (Piatesky-Shapiro

Knowledge

1998).

Data Capacity

4312

I n d a t a m i n i n g , t h e d a t a is n o t g e n e r a t e d t o m e e t t h e o b j e c t i v e s o f t h e a n a l y s i s . It m u s t b e d e t e r m i n e d w h e t h e r t h e d a t a , as i t e x i s t s , h a s t h e c a p a c i t y t o m e e t t h e o b j e c t i v e s . F o r e x a m p l e , q u a n t i f y i n g a f f i n i t i e s a m o n g r e l a t e d i t e m s w o u l d b e p o i n t l e s s i f v e r y f e w t r a n s a c t i o n s i n v o l v e d m u l t i p l e i t e m s . T h e r e f o r e , i t is i m p o r t a n t to d o s o m e i n i t i a l e x a m i n a t i o n o f the data before a t t e m p t i n g to d o a s s o c i a t i o n analysis.


8-66

Chapter 8 Introduction to Pattern Discovery

Association Tool Demonstration Analysis goal: Explore associations between retail banking services used by customers. A n a l y s i s plan: • Create a n a s s o c i a t i o n data s o u r c e . • R u n a n association a n a l y s i s . • interpret the association rules. • Run a sequence analysis. • Interpret the s e q u e n c e rules.

43)3

A bank's Marketing Department is interested in examining associations between various retail banking services used by customers. Marketing would like to determine both typical and atypical service combinations as well as the order in which the services were first used. These requirements suggest both a market basket analysis and a sequence analysis.


8.3 Market B a s k e t Analysis (Self-Study)

8-67

Market Basket Analysis

The BANK data set contains service information for nearly 8,000 customers. There are three variables in the data set, as shown in the table below. 1

Name

Model Role

ACCOUNT

ID

! S E R V I C E ..

=

VISIT

M e a s u re m e n t L ev e 1

j' Target

Description

Nominal

Account Number

j: Nominal

) i Type of $Šfvice

Sequence

Ordinal

j

Order of Product Purchase

The BANK data set has over 32,000 rows. Each row of the data set represents a customer-service combination. Therefore, a single customer can have multiple rows in the data set, and each row represents one of the products he or she owns. The median number of products per customer is three. The 13 products are represented in the data set using the following abbreviations: ATM

automated teller machine debit card

AUTO

automobile installment loan

CCRD

credit card

CD

certificate of deposit

CKCRD

check/debit card

CKING

checking account

HMEQLC

home equity line of credit

IRA

individual retirement account

MMDA

money market deposit account

MTG

mortgage

PLOAN

personal/consumer installment loan

SVG

saving account

TRUST

personal trust account

Your first task is to create a new analysis diagram and data source for the BANK data set. 1.

Create a new diagram named A s s o c i a t i o n s A n a l y s i s to contain this analysis.

2.

Select Create Data Source from the Data Sources project property.

3.

Proceed to Step 2 of the Data Source Wizard.


8-68

4.

C h a p t e r 8 Introduction t o Pattern Discovery

Select the

BANK

table in the A A E M 6 1

library.

gT.Data Source Wizard - Step 2 of 7 Select a SAS Table

J Select a SAS table

l a b l e : |2

Browse.

< Back

5.

Proceed to Step 5 o f the D a t a Source W i z a r d .

6.

I n S t e p 5, a s s i g n m e t a d a t a t o t h e t a b l e v a r i a b l e s as s h o w n b e l o w .

Next >

Cancel

mm

Apply

if Data Source Wizard — S t e p 5 o f 7 Column Metadata |(none) Columns:

T

f

\ I

not jEqualto

-

Label

|~ Mining

Name

Rc!c-

A.CCOUNT SERVICE VISIT

ID Target Sequence

Show code

Explore

Level rrterval Nominal nterval

Compute Summary

Report

I

-

Basic

Order

No No No

Drop

Reset

r m#<x Lower Limit

Upper Lir

No NO No

<Back

Next>

Cancel

A n a s s o c i a t i o n a n a l y s i s r e q u i r e s e x a c t l y o n e t a r g e t v a r i a b l e a n d at l e a s t o n e I D v a r i a b l e . B o t h s h o u l d have a n o m i n a l m e a s u r e m e n t level. A sequence analysis also requires a sequence variable. It usually has an o r d i n a l m e a s u r e m e n t scale.


8.3 M a r k e t B a s k e t A n a l y s i s (Self-Study)

7.

Proceed to Step 7 o f the D a t a Source W i z a r d . F o r an association a n a l y s i s , the data source s h o u l d have a role o f T r a n s a c t i o n .

8.

Select

Role •=> Transaction.

£TData Source Wizard — Step 7 of 8 Data Source Attributes You may change the name and the role, and can specify a population segment identifier for the data source to be created.

Name:

|BANK

Role: Segment:

Notes:

< Back

9.

S e l e c t Finish t o c l o s e t h e D a t a S o u r c e W i z a r d .

10.

Drag a

BANK

11. Select the

data source into the d i a g r a m workspace.

Explore

12. C o n n e c t t h e

Next >

BANK

t a b a n d d r a g an node to the

Association

Association

tool into the d i a g r a m w o r k s p a c e .

node.

Association Analysis

3ANK

>

* Association

<5 Q,

0

ed^fca

©

100%

Cancel

8-69


8-70

13.

Chapter 8 Introduction to Pattern Discovery

Select the

Association

n o d e a n d e x a m i n e its P r o p e r t i e s p a n e l . Value

Property

General N o d e ID

Assoc

Imported Data Exported Data

J J

Notes

Train

1 —1

Variables M a x i m u m N u m b e r o f 11 0 0 0 0 0 Rules

u

i Maximum Items M i n i m u m Confidence 10 Support Type

Percent

Support Count

1

Support Percentage E Chain Count

3

Consolidate Time

0.0

M a x i m u m T r a n s a c t i o n 0.0 Support Type

Percent

Support Count

1

^-Support P e r c e n t a g e

2.0

N u m b e r t o Keep

200

Sort C r i t e r i o n

Default

N u m b e r t o Transpose 200 Export R u l e by ID 14.

No

T h e E x p o r t R u l e b y ID p r o p e r t y d e t e r m i n e s w h e t h e r t h e R u l e - b y - l D node and i f the R u l e

Set t h e v a l u e f o r E x p o r t R u l e b y ID t o Yes.

-Numberto Keep -Sort C r i t e r i o n

200 Default

- N u m b e r t o T r a n s p o s e 200 •Export R u l e by ID

d a t a is e x p o r t e d f r o m t h e

D e s c r i p t i o n table w i l l be available f o r d i s p l a y i n t h e Results w i n d o w .

Yes


8.3 Market B a s k e t A n a l y s i s (Self-Study)

8-71

Other options in the Properties panel include the following: • The Minimum Confidence Level specifies the minimum confidence level to generate a rule. The default level is 10%. • The Support Type specifies whether the analysis should use the support count or support percentage property. The default setting is P e r c e n t . • The Support Count specifies a minimum level of support to claim that items are associated (that is, they occur together in the database). The default count is 2. • The Support Percentage specifies a minimum level of support to claim that items are associated (that is, they occur together in the database). The default frequency is 5%. The support percentage figure that you specify refers to the proportion of the largest single item frequency, and not the end support. • The Maximum Items determine the maximum size of the item set to be considered. For example, the default of four items indicates that a maximum of four items will be included in a single association rule. f

If you are interested in associations that involve fairly rare products, you should consider reducing the support count or percentage when you run the Association node. If you obtain too many rules to be practically useful, you should consider raising the minimum support count or percentage as one possible solution. Because you first want to perform a market basket analysis, you do not need the sequence variable.

15. Open the Variables dialog box for the Association node.


8-72

C h a p t e r 8 Introduction to Pattern Discovery

16. S e l e c t Use •=> No f o r t h e V i s i t v a r i a b l e . £ T Variables - Assoc ""•")

(none) Columns:

P

not | Equal to P

f " Mining

f " Label

Name

Apply

Use

Role

I

Basic

-

Reset

Statistics

Level

ACCOUNT

Yes

ID

Interval

SERVICE

Yes

Target

Nominal

VISIT

No

Sequence

Interval

Explore...

Update Path

17.

Select OK

to close the Variables d i a l o g b o x .

18.

R u n the d i a g r a m f r o m the A s s o c i a t i o n node and v i e w the results.

OK

Cancel


8.3

M a r k e t B a s k e t A n a l y s i s (Self-Study)

T h e Results - N o d e : A s s o c i a t i o n D i a g r a m w i n d o w opens w i t h the Statistics Plot, Statistics L i n e Rule Matrix, and Output w i n d o w s

8-73

Plot,

visible.

lfPii^itilMMiitili

r/tf

1

[El

1

| _ Rule Matrix

£

-

0

-

cr*

IEI

I : 1 El S

:*

ft

l

ft

J

*

• ft ftft: *

8

._,

**ft*fti T - i — i

i

i

T

i

• i—i

i

r

i

i i—i—i—T

i

i

i

i — i — i — i — i — r

Riijlit Hand of Rule 3100.00

Output

User:

Dodotech

Dace:

24IIAY08

Time:

19:51

* Training

50

100

150

Variable

Output

Summary

Rule Index U Niimlter identifying a specific rule Lift Expected Coniidence(%]

Goiiflderice(%) Support(%)

Role i

Measurement

Frequency

Level

Count


8-74

Chapter 8 Introduction to Pattern Discovery

19. Maximize the Statistics Line Plot window.

The statistics line plot graphs the lift, expected confidence, confidence, and support for each of the rules by rule index number. Consider the rule A => B. Recall the following: • Support of A => B is the probability that a customer has both A and B. • Confidence of A => B is the probability that a customer has B given that the customer has A. • Expected Confidence of A => B is the probability that a customer has B. • Lift of A => B is a measure of the strength of the association. If Lift=2 for the rule A=>B, then a customer having A is twice as likely to have B than a customer chosen at random. Lift is the confidence divided by the expected confidence. Notice that the rules are ordered in descending order of lift.


8-3 M a r k e t B a s k e t A n a l y s i s (Self-Study)

2 0 . T o v i e w t h e d e s c r i p t i o n s o f the rules, select j

View •=> Rules => Rule description.

Rule d e s c r i p t i o n

MAP

RULE

RULE1

CKING & C C R D ==> CKCRD

RULE2

C K C R D = = > CKING & CCRD

RULE3

CKCRD = = > C C R D

RULE4

CKING & CKCRD ==> CCRD

RULES

CCRD==> CKCRD

RULE6

C C R D ==> CKING & CKCRD

RULE7

HMEQLC = = > CKING & CCRD

RULE8

CKING & C C R D ==> HMEQLC

RULE9

HMEQLC==>CCRD

RULE10

HMEQLC & CKING = > CCRD

RULE11

C C R D = = > HMEQLC

RULE12

C C R D = = > HMEQLC & CKING

RULE13

SVG & H M E Q L C = = > C K I N G & ATM

RULE14

C K I N G & ATM = = > SVG & H M E Q L C

RULE15

H M E Q L C = = > SVG & C K I N G & ATM

]RULE1 6

SVG & C K I N G & ATM = = > H M E Q L C

RULE17

SVG & A T M = = > H M E Q L C

RULE18

SVG & ATM = = > H M E Q L C & C K I N G

RULE19

H M E Q L C = = > SVG & ATM

RULE20

H M E Q L C & C K I N G = = > SVG & ATM

RULE21

H M E Q L C = " > C K I N G & ATM

IRULE22

CKING &ATM = > HMEQLC

RULE23

8-75

n^Ef

IE]

-A.

SVG & H M E Q L C = = > ATM

RULE24

SVG & H M E Q L C & C K I N G = = > ATM

RULE25

ATM = = > SVG & H M E Q L C

RULE26

ATM = = > SVG & H M E Q L C & C K I N G

RULE27

C D & ATM = = > SVG & C K I N G

RULE28

HMEQLC ==>ATM

RULE29

H M E Q L C & C K I N G = = > ATM

RULE30

ATM==> HMEQLC

RULE31

ATM = = > H M E Q L C & C K I N G

RULE32

CKING & AUTO ==>ATM

T h e h i g h e s t l i f t r u l e i s c h e c k i n g , a n d c r e d i t c a r d i m p l i e s c h e c k c a r d . T h i s is n o t s u r p r i s i n g g i v e n t h a t m a n y c h e c k c a r d s i n c l u d e c r e d i t card l o g o s . N o t i c e t h e s y m m e t r y i n r u l e s 1 a n d 2. T h i s is n o t a c c i d e n t a l b e c a u s e , as n o t e d e a r l i e r , l i f t is s y m m e t r i c .


8-76

C h a p t e r 8 Introduction to Pattern Discovery

2 1 . E x a m i n e the rule m a t r i x . h i Rule Matrix

0>

a t

»

4 —

o T3 C

-j

r

a

m * a

a

a

— i

1

1

1

Right Hand of Rule 10.231

r

^100.00

T h e r u l e m a t r i x plots the rules based o n the i t e m s on the left side o f the r u l e and t h e i t e m s o n the r i g h t s i d e o f t h e r u l e . T h e p o i n t s are c o l o r e d , b a s e d o n t h e c o n f i d e n c e o f t h e r u l e s . F o r e x a m p l e , t h e r u l e s w i t h the h i g h e s t c o n f i d e n c e are in the c o l u m n in the p i c t u r e a b o v e . U s i n g t h e i n t e r a c t i v e f e a t u r e o f the g r a p h , y o u d i s c o v e r that these rules all have c h e c k i n g o n the r i g h t side o f the r u l e . A n o t h e r w a y t o e x p l o r e t h e r u l e s f o u n d i n t h e a n a l y s i s is b y p l o t t i n g t h e R u l e s t a b l e .


8.3 M a r k e t B a s k e t A n a l y s i s (Self-Study)

22. Select

View <=> Rules •=> Rules Table.

T h e Rules Table w i n d o w opens.

! H I Rules Table Relations

Expected

Confidence(

Confidence(

%)

Support(%)

Lift

%) 3

11.30

37.57

5.58

3

14.85

49.39

5.58

3.3: 3.3:

2

15.48

49.39

5.58

3.1 E

3

15.48

49.39

5.58

3.1 E

2

11.30

36.05

5.58

3.1 £

3

11.30

36.05

5.58

3.1 £

3

14.85

23.12

4.63

1.8£

3

16.47

31.17

4.63

1.8E

2

15.48

28.12

4.63

1.83

15.48

28.12

4.63

1.83

16.47

29.91

4.63

1.83

3

16.47

29.91

4.63

1.83

4

36.19

54.66

6.09

1.51

A A A

11.15

16.84

6.09

1.51

24.85

37.01

6.09

1.41

16.47

24.52

6.09

1.4E

3

16.47

23.72

6.09

1.44 —

A

1 R A7

no.

1 AA •r | >

I

:¥v:*:;:yx

U \ • ' wffi 4

A:

R

_

2 3 . S e l e c t t h e P l o t W i z a r d i c o n , \&~}\ 2 4 . C h o o s e a t h r e e - d i m e n s i o n a l scatter p l o t f o r the t y p e o f chart, a n d select

Next >.

8-77


8-78

C h a p t e r 8 Introduction t o Pattern Discovery

25. Select the roles

X, Y,

and

Z

for the variables

SUPPORT, L I F T ,

and

CONF,

respectively.

Use default assignments A Variable CONF COUNT

Rote

Type

Description

Format

z-

Mumeric

Conficlence(%)

F6.2

Numeric

Transaction Count

F6.2

EXP.CONF

Numeric

Expected Confidence... F6.2

INDEX

Numeric

Rule Index if Number i...

IT EMI

Character

Rule item 1

ITEMS

Character

Rule item 2

Character

Rule Item 3

ITEM4

Character

Rule Item 4

ITEM 5

Character

Rule Item 5

ITEMS

LIFT RULE

Y

SET SIZE SUPPORT

X

1

2 6 . S e l e c t Finish t o g e n e r a t e t h e p l o t .

F6.2

Numeric

Lift

Character

'Rule

Numeric

Relations

F6

Numeric

Support(%) Transnns.fi Rulft

F6.2

Mi

tmorir

—


8.3 M a r k e t B a s k e t A n a l y s i s (Self-Study)

8-79

27. Rearrange the w i n d o w s t o v i e w the data and the plot s i m u l t a n e o u s l y .

;

S

»11 <--«3i ili'li l^Btf :tH--t» j ^>1i r iT'j i ia t j ! - t-.j f i iji Htf ^Vi^ir-jH i t» J itf^i r= I

C^eT

3

3 3

3

3.33 3.33 3.19 3.19 3.19 3.19 1.89 1.89 1.82 1.82 1.82 1.82 1.51 1.51 1.49 1.49 1.44

4 4 6 . 0 0 C K I N G & C C R O = > CKCRD 446.00 CKCRD = > CKINO & CCRD 4 4 6 . 0 0 C K C R D = = > CCRD 446.00 CKfNG & CKCRD = > CCRD 4 4 6 . 0 0 C C R D = > CKCRD 4 4 6 . 0 0 C C R D ==> CKING & CKCRD 370.00 HMEQLC ==> CKING & CCRD 3 7 0 . 0 0 C K I N G & C C R D = > HMEQLC 370.00HMEQLC = > C C R D 370.00HMEQLC & CKING ==> CCRD 3 7 0 . 0 0 C C R D ==> HMEQLC 3 7 0 . 0 0 C C R D ==> HMEQLC & CKING 487.OOSVG & HMEQLC = > CKING &.. 487.00CKING &ATM ==> SVG & HME. 487.00 HMEQLC ==> SVG & CKING &.. 487.00SVG & CKING &ATM = > HME. 487.00 SVG & ATM === HMEQLC

-ISA

E

[userGraph 1

cR-

CH Ch-J

Confidence(%) 0.00

C[ | C< C(

m a

H» Hf

C( C(

60.00

s\

Cr

HH

S\

S\ —

S u p p o r t (%)

100.00 3.00

4.OD0

E x p a n d i n g the Rule c o l u m n in the data table and selecting points in the three-dimensional plot enable y o u to q u i c k l y uncover high lift rules f r o m the market basket analysis w h i l e j u d g i n g their confidence a n d s u p p o r t . Y o u c a n use W H E R E clauses i n t h e D a t a O p t i o n s d i a l o g b o x t o s u b s e t cases i n w h i c h y o u are interested. 28. Close the Results w i n d o w .


8-80

C h a p t e r 8 Introduction to Pattern Discovery

Sequence Analysis

I n a d d i t i o n t o t h e p r o d u c t s o w n e d b y i t s c u s t o m e r s , t h e b a n k is i n t e r e s t e d i n e x a m i n i n g t h e o r d e r i n w h i c h t h e p r o d u c t s are p u r c h a s e d . T h e s e q u e n c e v a r i a b l e i n t h e d a t a set e n a b l e s y o u t o c o n d u c t a s e q u e n c e analysis. 1.

A d d an A s s o c i a t i o n node t o the d i a g r a m workspace and connect it to the

2.

Rename the n e w node S e q u e n c e

r;'' 0

Set E x p o r t R u l e b y I D t o Y e s . El

• •

-Numberto Keep

200

-Sort C r i t e r i o n

befault

-Numberto Transpose 200 -Export R u l e by ID 4.

node.

Analysis.

Association Analysis

3.

BANK

Yes

E x a m i n e the Sequence panel in the Properties panel.

-Chain C o u n t

3

-Consolidate T i m e

0.0

- M a x i m u m T r a n s a c t i o n G.O •SupportType

Percent

•Support Count

1

-SupportPercentage

2.0

1HI


8.3 Market B a s k e t A n a l y s i s (Self-Study)

8-81

The options i n the Sequence panel enable y o u to specify the f o l l o w i n g properties:

• Chain Count

is t h e m a x i m u m n u m b e r o f i t e m s t h a t c a n be i n c l u d e d i n a s e q u e n c e . T h e d e f a u l t v a l u e

is 3 a n d t h e m a x i m u m v a l u e is 1 0 .

• Consolidate Time

enables y o u to specify whether consecutive visits to a location o r consecutive

purchases o v e r a g i v e n i n t e r v a l can be c o n s o l i d a t e d i n t o a s i n g l e v i s i t f o r analysis purposes. F o r e x a m p l e , t w o p r o d u c t s p u r c h a s e d less t h a n a d a y a p a r t m i g h t b e c o n s i d e r e d t o b e a s i n g l e t r a n s a c t i o n .

• Maximum Transaction Duration

enables y o u t o s p e c i f y the m a x i m u m l e n g t h o f t i m e f o r a series o f

transactions t o be c o n s i d e r e d a sequence. F o r e x a m p l e , y o u m i g h t w a n t t o s p e c i f y that t h e purchase o f t w o p r o d u c t s m o r e than three m o n t h s apart does n o t constitute a sequence. s p e c i f i e s w h e t h e r t h e s e q u e n c e a n a l y s i s s h o u l d use t h e S u p p o r t C o u n t o r S u p p o r t

• Support Type

P e r c e n t a g e p r o p e r t y . T h e d e f a u l t s e t t i n g is P e r c e n t .

• Support Count

specifies t h e m i n i m u m frequency required t o include a sequence i n the sequence

a n a l y s i s w h e n t h e S e q u e n c e S u p p o r t T y p e is set t o C o u n t . I f a s e q u e n c e h a s a c o u n t less t h a n t h e s p e c i f i e d v a l u e , t h a t s e q u e n c e is e x c l u d e d f r o m t h e o u t p u t . T h e d e f a u l t s e t t i n g i s 2 .

• Support Percentage

specifies the m i n i m u m level o f support t o include the sequence in the analysis

w h e n t h e S u p p o r t T y p e i s set t o P e r c e n t . I f a s e q u e n c e h a s a f r e q u e n c y t h a t i s less t h a n t h e s p e c i f i e d p e r c e n t a g e o f t h e t o t a l n u m b e r o f t r a n s a c t i o n s , t h e n t h a t s e q u e n c e is e x c l u d e d f r o m t h e o u t p u t . T h e d e f a u l t p e r c e n t a g e i s 2%. P e r m i s s i b l e v a l u e s a r e r e a l n u m b e r s b e t w e e n 0 a n d 1 0 0 . 5.

R u n the d i a g r a m f r o m the Sequence A n a l y s i s node and v i e w the results.

6.

M a x i m i z e the Statistics L i n e Plot w i n d o w .

File Edit View

Window

[H" IB H H IS Statistics Line Plot 100-

80-

00 -

40-

20-

050

100

R u l e Index U Number identifying a s p e c i f i c rule Canfidence(%)

150


8-82

C h a p t e r 8 Introduction t o Pattern

Discovery

The statistics line plot graphs the confidence and support for each of the rules by rule index number. The percent support is the transaction count divided by the total number of customers, which would be the maximum transaction count. The percent confidence is the transaction count divided by the transaction count for the left side of the sequence. 7.

Select View ^> Rules <=> Rule description to view the descriptions of the rules. P I Rule description

jjjj

MAP

Rule

RULE1

C K I N G = = > SVG

RULE2

CKING ==> ATM

RULE3

SVG = = > ATM

$

RULE4

C K I N G = = > SVG = = > ATM

RULES

ATM = > ATM

RULE6

CKING ==> C D

RULE7

C K I N G = > ATM = = > A T M

RULES

CKING ==> HMEQLC

RULE9

SVG = > C D

RULE10

C K I N G = = > MMDA

RULE11

CKING ==> C C R D

RULE12

C K I N G = = > SVG = = > C D

RULE13

SVG = = > A T M = = > A T M

RULE14

SVG = = > S V G

RULE15

CKCRD ==>CKCRD

RULE16

CKING ==> C K C R D

RULE17

CKING ==> C K C R D = > C K C R D

RULE18

SVG = > H M E Q L C

RULE19

C K I N G = = > SVG = = > H M E Q L C

RULE20

SVG==>CCRD

RULE21

C K I N G = = > SVG = = > C C R D

di

ii

e

n

\a

rr\

The confidence for many of the rules changes after the order of service acquisition is considered.


8.3 Market B a s k e t Analysis (Self-Study)

8-83

2. Conducting an Association Analysis A store is interested in determining the associations between items purchased from the Health and Beauty Aids department and the Stationery Department. The store chose to conduct a market basket analysis of specific items purchased from these two departments. The A S S O C I A T I O N S data set contains information about over 400,000 transactions made over the past three months. The following products are represented in the data set: 1. bar soap 2. bows 3. candy bars 4. deodorant 5. greeting cards 6. magazines 7. markers 8. pain relievers 9. pencils 10. pens 11. perfume 12. photo processing 13. prescription medications 14. shampoo 15. toothbrushes 16. toothpaste 17. wrapping paper


8-84

C h a p t e r 8 Introduction to Pattern Discovery

There are four variables in the data set:

STORE

PRODUCT

a.

Rejected

Target

Nominal

Identification number of the store

Nominal

! | Transaction identification number

Nominal

Product purchased

Create a new diagram. Name the diagram T r a n s a c t i o n s .

b. Create a new data source using the data set AAEM61. TRANSACTIONS. c. Assign the variables STORE and QUANTITY the model role Rejected. These variables will not be used in this analysis. Assign the ID model role to the variable TRANSACTION and the Target model role to the variable PRODUCT. d. Add the node for the TRANSACTIONS data set and an Association node to the diagram. e.

Change the setting for Export Rule by ID to Yes.

f.

Leave the remaining default settings for the Association node and run the analysis.

g.

Examine the results of the association analysis. What is the highest lift value for the resulting rules? Which rule has this value?


8.4 C h a p t e r S u m m a r y

8-85

8.4 Chapter Summary Pattern discovery seems to embody the promise of data mining, but there are many ways for an analysis to fail. SAS Enterprise Miner provides tools to help with data reduction, novelty detection, profiling, market basket analysis, and sequence analysis. Cluster and segmentation analyses are similar in intent but differ in execution. In cluster analysis, the goal is to identify distinct groupings of cases across a set of inputs. In segmentation analysis, the goal is to partition cases from a single cluster into contiguous groups. SAS Enterprise Miner offers several tools for exploring the results of a cluster and segmentation analysis. For low dimension data, you can use capabilities provided by the Graph Wizard and the Explore window. For higher dimensional data, you can choose the Segment Profile tool to understand the generated partitions. Market basket and sequence analyses are handled by the Association tool. This tool transforms transaction data sets into rules. The value of the generated rules is gauged by confidence, support, and lift. The Association tool features a variety of plots and tables to help you explore the analysis results.

Pattern Discovery Tools: Review Generate cluster models using automatic settings and segmentation models with user-defined settings.

s •W^on* —

Compare within-segment distributions of selected inputs to overall distributions. This I helps you understand segment definition. Conduct market basket and sequence analysis on transactions data. A data source must have one target, one ID, and (if desired) one sequence variable in the data source.

4ÂŤ6


8-86

Chapter 8 Introduction to Pattern Discovery

8.5 Solutions Solutions to Exercises 1. Conducting Cluster Analysis a.

Create a new diagram in your project. 1) To open a diagram, select File «=> New «=> Diagram. 2) Type the name of the new diagram, J e a n s , and select O K .

b. Define the data set DUNGAREE as a data source. 1) Select File «=> New «=> Data Source.... 2) In the Data Source Wizard - Metadata Source window, make sure that S A S Table is selected as the source and select Next >. 3) To choose the desired data table, select Browse.... 4) Double-click on the A A E M 6 1 library to see the data tables in the library. 5) Select the D U N G A R E E data set, and then select O K . 6) Select Next >. JS.'DaUi Source Wizard --Step 3 of 7 Table Information

Table Properties " Property'''.' Table Name Description Member Type Data Set Type Engine Number of Variables Number of Observations Created Date ModfiedDate

AAEM61. DUNGAREE

Value

DATA DATA BASE 6 689 January 19, 2004 4:34:26 PM EST January 19, 2004 4:34:26 PM EST

£ancd

7) Select Next >. 8) Select Advanced to use the Advanced Advisor, and then select Next >.


8.5 Solutions

c.

8-87

D e t e r m i n e w h e t h e r the m o d e l roles and m e a s u r e m e n t levels assigned t o t h e v a r i a b l e s are appropriate. Examine the distribution o f the variables. 1) H o l d d o w n t h e C T R L k e y a n d c l i c k t o s e l e c t t h e v a r i a b l e s o f i n t e r e s t . " D a t a Source Wizard — S t e p 5 of 7 C o l u m n M e t a d a t a |(none)

lolumns: Name

~" j

C

not

FASHION LEISURE ORIGINAL SALESTOT STOREID STRETCH

1

I

T Mining

Role

Input Input Input Input Intuit input

Apply

lEqualio

f~ Label |

Level

Inttayal Interval Interval Interval Interval Inierva!

|

Report

-

Basfc

r

Order

No No No No No No

Drop

Reset

Statistics

Lower Lint

U;per L»

No Ho No No No NO

±1 ^now code

2)

Select

Explore.

3)

Select

Next

> and then

Explore

Finish

Refresh Summary

< Back

Next >

to c o m p l e t e the data source c r e a t i o n .

Cancel


8-88

C h a p t e r 8 Introduction t o Pattern Discovery

| - ! D | x

Explore - AAEMG1 DUNGAREE Fife

Vjew

fictions

! iOK/l

Window

Property

Value

699 8 AAEM61 DUNGAREE

ROWS Columns Library I.I em lior Type Sample Meltiod Fetch Size Fetched Rows

-

pATA

Random Max 689

_

Appty

1

107 117 110

U B R B B B

496 296

1744

419

267

1652

1736 1837 2484 267S 1765

79 73 64

100 126

10 11

2347

1528

162 129

B B

755

613 468 209 581

1789

37

180 41

1S28

2203 1890 2342 2119 1781 2138 1216 1504 1933 2405 1752

& e

1579.8 ORIGINAL

& c

5

= 100-

! ioo

1.0

21J

41.6

61.9

92.2

1 0 2 5 1221

143 1 163.4 183.7 204.0

FASHION

There do not appear to be any unusual or missing data values.

490.8 STRETCH

1958.2


8.5 S o l u t i o n s

d.

8-89

T h e variable S T O R E I D should have the I D m o d e l role and the variable S A L E S T O T should have the Rejected model role.

The variable S A L E S T O T should be rejected because it is the sum of the other input variables in the data set. Therefore, it should not be considered as an independent input value. Sf Data Source Wizard —Step 5 of 7 Column Metadata (none) Zolumns:

-• |

P

not

\~ Label

Name FASHION

LEISURE

ORIGINAL SALESTOT STOREID STRETCH

|

Role

Input

[nput

Input Rejected ID Input

Apply

| E q u a l to

r

Mining

Level

Interval Interval Interval Interval Interval Interval

|

Report

I

-

I

Basic

Order

Crop

-

Statistics

Lower Limit

Upper Lii

No NO NO No No No

No NO — N o No No No

2l Show code

e.

g.

Refresh Summary

<Back

T o add an Input D a t a node t o the d i a g r a m w o r k s p a c e and select the as t h e d a t a s o u r c e , d r a g t h e

f.

Explore

DUNGAREE

Next >

Cancel

DUNGAREE

data table

data source onto the d i a g r a m w o r k s p a c e .

A d d a C l u s t e r n o d e t o t h e d i a g r a m w o r k s p a c e . T h e w o r k s p a c e s h o u l d a p p e a r as s h o w n .

S e l e c t t h e Cluster n o d e . In the p r o p e r t y sheet, select

Internal Standardization >=> Standardization.

If you do not standardize, the clustering will occur strictly on the inputs with the largest range (Original and Leisure).


8-90

C h a p t e r 8 Introduction to Pattern Discovery

h.

R u n the d i a g r a m f r o m the C l u s t e r node and e x a m i n e the results. 1) R u n t h e C l u s t e r n o d e a n d v i e w t h e r e s u l t s . 2) T o v i e w the results, right-click the

File Edit View

Window

a

MW\

a

i s

Cluster

L i Segment Plot

node and select

r/Ef

HI

variable = FASHION

Results....

PI Mean Statistics Maximum

Criterion

Relative C h a n g e

Improvemen'

LEI

S E G M E N T

in in

Cluster

80-

Clustering Criterion

S e e d s

r>o40ÂŹ 20ÂŹ 0-

ET

Clustering

T

1

I

2

I

3

I

4

I

5

I

G

1

7

I

8

I

9

1

I

T

I

I

I

T

I

I

r

11 13 15 17 19 10 12 14 16 18 20

0.406453 0.406453 0.406453 0.406453 0.406453 0.406453 0.406453 0.406453 0.406453 0.406453 0.406453

0.009633

0.009633 0.009683 0.009683 0,009683 0.009683 0.009683 0.009633 0.009683 0.009683 0.009683

1 2 3 4 5 6 7 8 9 10 11 _d-3_

The Cluster node's Automatic number of cluster specification method seems to generate an excessive number of clusters.


Specify a m a x i m u m o f six clusters. 1) S e l e c t

Specification Method ^ User Specify.

2)

Maximum Number of Clusters •=> 6.

Select

Variables ClusterVariable Role

Segment

Internal S t a n d a r d i z a t i o n

Standardization

• Number

of Clusters

•-Specification M e t h o d

l U s e r Specify

' - M a x i m u m N u m b e r o f Clu:6


8-92

C h a p t e r 8 Introduction t o Pattern Discovery

3)

R u n the Cluster node and v i e w the results.

File Edit View • .

m .

Window •

i )

L_ Segment Plot

u

1^ [EI

Variable= FASHION

100 4

• Mean Statistics Clustering

Maximum

Criterion

Relative C h a n g e

in

Freq ofCI

Clustering Criterion

S e e d s

0.630325 0.630825 0.630325 0.630825 0.630825 0.630825

GOH

^

S E G M E N T

1

in

Cluster

80-

£

Improvemen

40

Q.

20 n

2

1

3

1

4

0.006552 0.006552 0.006552 0.006552 0.006552 0.006552

5

Segment Vaiiahie

V Segment Size

r/Ef

IE] If Q

Output Usee:

Dodotech

Dace:

27HAY08

14:56

Time:

* Training

Output

100

V a r i a b l e Summary

Role

Measurement

Frequency

Level

Count

Apparently, all but one of the segments is well populated. There are more details about the segment composition in the next step. j.

Connect a Segment Profile node t o the Cluster node.

DUNGAREE

tj^'Setjivient Profile

Cluster

o

J


8.5 Solutions

1) R u n t h e S e g m e n t P r o f i l e n o d e a n d v i e w t h e r e s u l t s .

30¬

Segment 1 Count: 110 ! 'etcent 15.97

E ao a a. 20 -

$20

£ 30 -

1 S e g m e n t 1 contains stores s e l l i n g a higher-than-average n u m b e r o f o r i g i n a l j e a n s . soSegment: 2 Count 124 *e<cert: 18

-

-

|

s

30

a3J

a.

10-

0-

S e g m e n t 2 contains stores s e l l i n g a higher-than-average n u m b e r o f stretch jeans.

Segment 3 Count 10 Percent 1.45

E no a S JQ

5

3

S.

0

20

JJ S e g m e n t 3 contains stores s e l l i n g small n u m b e r s o f all j e a n styles.

S e g m e n t 4 c o n t a i n s stores s e l l i n g a higher-than-average n u m b e r o f leisure j e a n s . 40¬

Segment 5 Count 111

30-

10-

0—r

S e g m e n t 5 contains stores s e l l i n g a higher-than-average n u m b e r o f f a s h i o n j e a n s .

8-93


8-94

C h a p t e r 8 Introduction to Pattern Discovery

S e g m e n t 6 c o n t a i n s stores s e l l i n g a higher-than-average n u m b e r o f o r i g i n a l j e a n s , b u t lowerthan-average n u m b e r o f stretch and fashion.

2.

Conducting an Association Analysis a.

Create the Transactions diagram. 1) T o o p e n a n e w d i a g r a m i n t h e p r o j e c t , s e l e c t 2)

b.

O

File >=> New

N a m e t h e n e w d i a g r a m T r a n s a c t i o n s and select

Diagram....

OK.

C r e a t e a n e w d a t a s o u r c e u s i n g t h e d a t a set A 7 A E M 6 1 . T R A N S A C T I O N S . 1) R i g h t - c l i c k

2)

Data Sources

i n the p r o j e c t tree and select

Create Data Source.

In the Data Source W i z a r d - Metadata Source w i n d o w , m a k e sure that as t h e s o u r c e a n d s e l e c t

3) Select

Browse...

SAS Table

is s e l e c t e d

Next >.

t o c h o o s e a d a t a set.

4)

D o u b l e - c l i c k o n the

5)

Select

6)

Select

AAEIV161

l i b r a r y and select the

TRANSACTIONS

d a t a set.

OK.

Next >.

Data Source Wizard --Step 3 of 7 Table Information

Table Properties Value

Property Table Name Description Member Type Dats Set T y c ; Engine Number of Variables Number of Observations Created Date Modified Date

AAEM61. TRANS ACTIONS DATA DATA BASE •i •45?253 September 19, 2006 3:55:00 PM EDT September 19, 2006 3:55:00 PM EOT

< Bad.

I ["Next >2| |

7 ) E x a m i n e t h e d a t a t a b l e p r o p e r t i e s , a n d t h e n s e l e c t Next>.

8)

Select

Advanced

t o use t h e A d v a n c e d A d v i s o r , a n d t h e n s e l e c t

Next >.

Cancel


8.5

Solutions

8-95

A s s i g n a p p r o p r i a t e m o d e l roles t othe variables. 1)

H o l d d o w n the C T R L

k e y a n d select the r o w s f o r the v a r i a b l e s S T O R E

Select the T R A N S A C T I O N

QUANTITY.

Rejected.

I n t h e R o l e c o l u m n o f one o f these r o w s , select 2)

and

r o w a n d s e l e c t I D as t h e r o l e .

3) S e l e c t t h e P R O D U C T r o w a n d s e l e c t Target as t h e r o l e . £ T D a t a S o u r c e Wizard — S t e p 5 of 7 Column M e t a d a t a |(none) Columns:

"

-

not jEqusl to P Mining

T Label

Name

|

Product

f

| I

El

Quantity Store

Uvd

Role

Pepoit

Target Rejected

Nominal Interval

No No

Rejected

Nominal

No

Interval

No

Transaction ID

J

f~ Basic Order

[TCP

Reset

Apply r

Statistics. Upper Lit

Lower Limit

No No No No

J Show code

Explore

4)

Select

5)

T o s k i p d e c i s i o n p r o c e s s i n g , select

6)

C h a n g e t h e r o l e t o Transaction.

Refresh Summary

< Back

Next >

Jj

Cancel

Next >. Next >.

Sf D a t a S o u r c e Wizard — S t e p 7 of 8 D a t a Source Attributes

\ I f-. 4 f t OS)

EE]

You may change the name and the role, and can specify a population segment identifier for the data source to be created.

Name:

(TRANSACTIONS

Sole: Segment:

Notes:

< Back

7) S e l e c t Finish.

Next >

Cancel


8-96

C h a p t e r 8 Introduction to Pattern Discovery

d.

A d d the node for the TRANSACTIONS data set and an A s s o c i a t i o n node to the d i a g r a m . The w o r k s p a c e s h o u l d appear as s h o w n .

RANSACTION

^^Association

La e.

Change the setting for E x p o r t Rule by I D to Yes.

Train

Property

Value

Variables M a x i m u m N u m b e r of Iterr 1 0 0 0 0 0 Rules ^Association •Maximum Items

4

•Minimum Confidence Levi 0 • S u p p o r t Type

(Percent

-Support Count

1

-Support Percentage

5.0

E ] Sequence •Chain Count -Consolidate T i m e

3 0.8

- M a x i m u m T r a n s a c t i o n Di-0.0 •Support Type

Percent

-Support C o u n t

1

•Support Percentage

2.0

-Numberto Keep

200

-Sort C r i t e r i o n

Default

- [ N u m b e r to T r a n s p o s e

200

-Export R u l e by ID

Yes

• •


8.5 S o l u t i o n s

8-97


8-98

Chapter 8 Introduction to Pattern Discovery

g.

Examine the results of the association analysis. Examine the Statistics Line plot.

0-

i 20

1

10

i

30

Rule Index

— Lift —Confidence(%)

— E x p e c t e d Confidence(%) —~Support(%)

Rule 1 has the highest lift value, 3.60. Looking at the output reveals that Rule 1 is the rule Toothbrush •=> Perfume.


Chapter 9 Special Topics 9.1

Introduction

9-3

9.2

Ensemble Models

9-4

Demonstration: Creating Ensemble Models 9.3

9.4

Variable Selection

9-9

Demonstration: Using the Variable Selection Node

9-10

Demonstration: Using Partial Least Squares for Input Selection

9-14

Demonstration: Using the Decision Tree Node for Input Selection

9-19

Categorical Input Consolidation Demonstration: Consolidating Categorical Inputs

9.5

9-5

Surrogate Models Demonstration: Describing Decision Segments with Surrogate Models

9-22 9-23 9-31 9-32


9-2

Chapter 9 Special Topics


9.1 Introduction

9-3

9.1 Introduction This chapter contains a selection of optional special topics related to predictive modeling. Unlike other chapters, each section is independent of the preceding section. However, demonstrations in this chapter depend on the demonstrations found in Chapters 2 through 8.


9-4

Chapter 9 Special Topics

9.2 Ensemble Models

Ensemble Models

1 /u

Combine predictions from multiple models to create a single consensus prediction.

4 The Ensemble node creates a new model by combining the predictions from multiple models. For prediction estimates and rankings, this is usually done by averaging. When the predictions are decisions, this is done by voting. The commonly observed advantage of ensemble models is that the combined model is better than the individual models that compose it. It is important to note that the ensemble model can be more accurate than the individual models only if the individual models disagree with one another. You should always compare the model performance of the ensemble model with the individual models.


9.2 Ensemble Models

9-5

Creating Ensemble Models

This demonstration continues from the Predictive Analysis diagram. This demonstration proceeds in two parts. First, a control point is added to the process flow to reduce clutter. Then, the Ensemble tool is introduced as potential prediction model.

Diagram Control Point Follow these steps to add a control point to your analysis diagram: 1. Select the Utility tab. 2. Drag a Control Point tool into the diagram workspace. 3. Delete all connections to the Model Comparison node. •x: Predictive Analysis

•par

-il

^$er-

H H E 3

J r - a r -

;

J

4. Connect model nodes of interest to the Control Point node. 5. Connect the Control Point node to the Model Comparison node.

— 1 - ^ l ^ l Q - J — ©w*|g*bf


9-6

Chapter 9 Special Topics The completed process flow should appear as shown. :; Predictive Analysis :

Bag

fc-jwrr...^

jj.

j

*

M © - J

© 70% g"Dj

The Control Point node serves as a junction in the diagram. A single connection out of the Control Point node is equivalent to all the connections into the node. 6. Select the Model tab. 7. Drag an Ensemble tool into the diagram. 8. Connect the Control Point node to the Ensemble node. 9. Connect the Ensemble node to the Model Comparison node. The completed process should appear as shown. :

Predictive Analysis

anted M i t

\~-

7*1

'6 VA»7

•J

I i

-^|HfO|0 - J — © 76%|g.*nj'


9.2 Ensemble Models

9-7

Ensemble Tool The Ensemble node is not itself a model. It merely combines model predictions. For a categorical target variable, you can choose to combine models using the following functions: • Average takes the average of the prediction estimates from the different models as the prediction from the Ensemble node. This is the default method. • Maximum takes the maximum of the prediction estimates from the different models as the prediction from the Ensemble node. • Voting uses one of two methods for prediction. The average method averages the prediction estimates from the models that decide the primary outcome and ignores any model that decides the secondary outcome. The proportion method ignores the prediction estimates and instead returns the proportion of models deciding the primary outcome. • Run the Model Comparison node and view the results. This enables you to see how the ensemble model compares to the individual models. of? Results - Node: Model Comparison Diagram: Predictive Analysis 08 &ft Sew jgjhcfow

mi

I

H B E3

n s HEE3|

ROC Chart: TargetB Data Role = VALIDATE

1.00

0.75

>

1 c

0.50

w

0.25

0.00 0.0

0.2

0.4

0.6

0.8

1 - Specificity Baseline Regression

Ensemble Decision Tree

Neural Network Probability Tree

1.0


9-8

Chapter 9 Special Topics

-i

1

0

1

20

40

1

1

60

80

1

1—

100

Percentile Ens-imljlv

Meural llot.voil-

Recession

•

Decision Tree

Probability Tree |

The R O C chart and Score Rankings plots show that the ensemble model is s i m i l a r to the other models.


9.3 Variable Selection

9-9

9.3 Variable Selection

Input Selection Alternatives ^

Regresskn

'artebte :tkÂť

^ ftrfal Least > V Squares

Decision Tree

Sequential selection

Univariate + forward selection (R square) Tree-like selection (Chi-square) Variable Importance in the Projection (VIP)

Split search selection

7

Removing redundant or irrelevant inputs from a training data set often reduces overfitting and improves prediction performance. Some prediction models (for example, neural networks) do not include methods for selecting inputs. For these models, input selection is done with a separate SAS Enterprise Miner tool. Chapter 4 introduced sequential selection to find useful inputs for regression models. Later in Chapter 5, these inputs were used to compensate for the Neural Networks tool's lack of input selection capabilities. SAS Enterprise Miner features several additional methods to perform input selection for modeling tools inherently missing this capability. The following demonstrations illustrate the use of the Variable Selection, Partial Least Squares, and Decision Tree tools for selection of potentially useful inputs.


9-10

Chapter 9 Special Topics

Using the Variable Selection Node

The Variable Selection tool provides a selection based on one of two criteria. When you use the R-squared variable selection criterion, a two-step process is followed: 1. SAS Enterprise Miner computes the squared correlation for each variable and then assigns the Rejected role to those variables that have a value less than the squared correlation criterion. (The default is 0.005.) 2. SAS Enterprise Miner evaluates the remaining (not rejected) variables using a forward stepwise R-squared regression. Variables that have a stepwise R-squared improvement less than the threshold criterion (default = 0.0005) are assigned the Rejected role. When you use the chi-squared selection criterion, variable selection is performed using binary splits for maximizing the chi-squared value of a 2x2 frequency table. The rows of the 2x2 table are defined by the (binary) target variable. The columns of the table are formed by a partition of the training data using a selected input. Several partitions are considered for each input. For an Z,-level class input (binary, ordinal, or nominal), partitions are formed by comparing each input level separately to the remaining L - \ input levels, creating a collection of L possible data partitions. The partition with the highest chi-squared value is chosen as the input's best partition. For interval inputs, partitions are formed by dividing the input range into (a maximum of) 50 equal-length bins and splitting the data into two subsets at 1 of the 49 bin boundaries. The partition with the highest chi-squared statistic is chosen as the interval input's best partition. The partition/variable combination with the highest chi-squared statistic is used to split the data and the process is repeated within both subsets. The partitioning stops when no input has a chi-squared statistic in excess of a user-specified threshold. All variables not used in at least one partition are rejected. The Variable Selection node's chi-squared approach is quite similar to a decision tree algorithm with its ability to detect nonlinear and nonadditive relationships between the inputs and the target. However, the method for handling categorical inputs makes it sensitive to spurious input/target correlations. Instead of the Variable Selection node's chi-squared setting, you might want to try the Decision Tree node, properly configured for input selection. The following steps show how to use the Variable Selection tool with the R-squared setting. 1. Select the Explore tab. 2. Drag a Variable Selection tool into the diagram workspace.


9.3 Variable Selection

3.

9-11

Connect the Impute node to the Variable Selection node.

The default Target M o d e l (input selection method) property is set to Default. W i t h this default setting, i f the target is binary and the total parameter count f o r a regression model exceeds 4 0 0 , the c h i squared method is used. O t h e r w i s e , the R-squared method o f variable selection is used.

4.

Select Target Model O R-Square. Property

Value

General Node ID Imported Data Exported Data Notes

Varsel

Train Variables Max Class Level 100 _Max Missing Percen 5 Ci R-Square Target Model Manual Selector Reiects Unused Inn Yes [•jBypass Options None [•Variable •-Role Input

• U u

• •

Recall that a two-step process is performed when y o u apply the R-squared variable selection criterion to a binary target. The Properties panel enables y o u t o specify the m a x i m u m n u m b e r o f variables that can be selected, the c u t o f f m i n i m u m squared correlation measure f o r an i n d i v i d u a l variable t o be selected, and the necessary R-squared improvement f o r a variable to remain as an input variable.


9-12

Chapter 9 Special Topics

A d d i t i o n a l available options include the f o l l o w i n g : • Use A O V I 6 Variables - W h e n selected, this option requests SAS Enterprise M i n e r to b i n interval variables into 16 equally spaced groups ( A O V I 6 ) . The A O V 1 6 variables are created to help identify nonlinear relationships w i t h the target. Bins w i t h zero observations are e l i m i n a t e d , w h i c h means that an A O V I 6 variable can have fewer than 16 bins. • Use G r o u p Variables - W h e n set to Yes, this option enables the number o f levels o f a class variable to be reduced, based on the relationship o f the levels to the target variable. • Use Interactions - W h e n this option is selected, SAS Enterprise M i n e r evaluates t w o - w a y interactions f o r categorical inputs. 5.

Run the Variable Selection node and v i e w the results. The Results w i n d o w opens.

m u m File

Edit

View

Window

0

Variable Selection &

NAME DemCluster DernGender DernHorne... DemMedH... DemPctVet... G_DemClu... GiftTimeFirst GiftTimeLast IMP_DemA... IMP_LOG_... IMP„REP_... LOG_GiftAv... LOG_GiftAv... LOG_GiftAv„, LOGjGiftC...

ROLE REJECTED REJECTED REJECTED REJECTED REJECTED INPUT REJECTED INPUT REJECTED REJECTED REJECTED REJECTED INPUT REJECTED INPUT

LEVEL NOMINAL NOMINAL BINARY INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL

TYPE C C C N N N N N N N N N N N N

LABI

Demk Gene!. Horn' Medi:'." Perci :

Time: Tj'mefl Impu Impu Impu Tran; Tran: Tran Tranj*

ij: •

1 >\ Output

n"er

Usee: Date: Time:

Dodotech

21HAYG8 14:16

* Training Output

Variable Summary Measurement Level

6.

M a x i m i z e the Variable Selection w i n d o w .

Frequency Count

IEI


9.3 Variable Selection

7.

Select the R O L E c o l u m n heading t o sort the variables by their assigned roles.

8.

Select the Reasons for Rejection c o l u m n heading.

9-13

EH Variable Selection Variable Name

ROLE

G_DemClu... INPUT GittTimeLast INPUT LOG_GiftAv... INPUT LOG_GiftC... INPUT LOG_GirtC... INPUT REP_Statu... INPUT DemGender REJECTED DemHome... REJECTED DemMedH... REJECTED DemPctVet... REJECTED GiftTimeFirst REJECTED IMP_DemA... REJECTED IMP_LOG_... REJECTED IMP_REP_... REJECTED LOG_GiftAv... REJECTED LOG_GiftAv... REJECTED LOG_GiftC... REJECTED LOG_GiftC... REJECTED M_DemAge REJECTED M_LOG_Gif... REJECTED M_REP_De... REJECTED PromCnt12 REJECTED PrornCnt36 REJECTED PromCntAII REJECTED PromCntCa.. REJECTED PromCntCa.. REJECTED PrornCntCa.. REJECTED StatusCatSt. REJECTED iDe m Cluster REJECTED i ,

LEVEL NOMINAL INTERVAL INTERVAL INTERVAL INTERVAL NOMINAL NOMINAL BINARY INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL BINARY BINARY BINARY INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL BINARY NOMINAL

TYPE N N N N N c c N N N N N N N N N N N N N N N N N N N

fi C

Variable Label

Reasons for Rejection A

1

Time Since... Transforme... Transforms... Transforms... Replace:St... Gender Varsel:Sma... Home OwnerVarsel:Sma... Median Ho... Varsel:Sma... PercentVet... Varse1:Sma... Time Since... Varsel:Sma... Imputed: Age Varsel:Sma... Imputed: Tr... Varsel:Sma... Imputed: R... Varsel:Sma... Transforme... Varsel:Sma... Transforme... Varsel:Sma... Transforme... Varsel:Sma... Transforme... Varsel:Sma... Imputation I... Varsel:Sma... Imputation I... Varsel:Sma... Imputation L.Varsel:Sma... Promotion... Varsel:Sma... Promotion... Varsel:Sma... Promotion... Varsel:Sma... Promotion... Varsel:Sma... Promotion... Varsel:Sma... Promotion... VarseI:Sma... Status Cate... Varsel:Sma... Demograp... Varsel:Sma...

The Variable Selection node finds that most inputs have insufficient target correlation to j u s t i f y keeping t h e m . Y o u can try these inputs i n a subsequent model or adjust the R-squared settings t o be less severe. N o t i c e the input G _ D e m C l u s t e r is a g r o u p i n g o f the o r i g i n a l D e m C l u s t e r input. f

Binary targets generate notoriously l o w R-squared statistics. A more appropriate association measure m i g h t be the l i k e l i h o o d chi-squared statistic f o u n d i n the Regression node.


9-14

Chapter 9 Special Topics

Using Partial Least Squares for Input Selection

Partial least squares (PLS) regression can be thought of as a merging of multiple and principal components regression. In multiple regression, the goal is to find linear combinations of the inputs that account for as much (linear) variation in the target as possible. In principal component regression, the goal is to find linear combinations of the inputs that account for as much (linear) variation in the input space as possible, and then use these linear combinations (called principal component vectors) as the inputs to a multiple regression model. In PLS regression, the goal is to have linear combinations of the inputs (called latent vectors) that account for variation in both the inputs and the target. The technique can extract a small number of latent vectors from a set of correlated inputs that correlate with the target. A useful feature of the PLS procedure is the inclusion of an input importance metric named variable importance in the projection (VIP). VIP quantifies the relative importance of the original input variables to the latent vectors. A sufficiently small VIP for an input (less than, 0.8, by default) plus a small parameter estimate (less than 0.1, by default) permits removal of the input from the model. The following steps demonstrate the use of the PLS tool for variable selection: 1. Select the Model tab. 2. Drag a Partial Least Squares tool into the diagram workspace. 3. Connect the Impute node to the Partial Least Squares node. [• ;;; Predictive Analysis

HEE3|


Select Export Selected Variables => Yes. T r 3 in Variables •Modeling Techniques ^-Regression Model [-PLS Algorithm r Maximum Iteration - Epsilon ^ N u m b e r of Factors [•Default - N u m b e r of Factors a c r o s s Validation [•CV Method •-CVN Parameter itrlffffifflle j - N u m b e r o f Iterations |-Default No. o f T e s t O b s . |-No. ofTestObs. [•Default Random Seed - R a n d o m Seed Score :

j-Variable Selection Criterion [-Para. Est. Cutoff I-VIP Cutoff |- Export Selected Variables ••Hide Rejected Variables

m PLS NIPALS

200 1.0E-12 Yes

15 None

7 10 Yes

100 Yes

1234

Both

0.1 0.8 Yes No


9-16

5.

Chapter 9 Special Topics

R u n the Partial Least Squares node and v i e w the results. 1

Results - Node: Partial Least Squares Diagram; Predictive Analysis

£fc Ed*flewtfndow Store Biiikingi Overlay: Target. B

Variable I m p o r t a n c e For P r o j r i t

Stude July 16:07

Usee: Date: Tlae: * Training Output

Variable Suaaary fleajuicaint Level

AH B M B

I-I.UJJ.UIII X variation accoulcd lor by IMs laclor

Hunter or Extracted Factors

Total X v*wan accounted lor )o rar

occcurted (orrjflliis rati or

TOH¥ variation

StanWrcSie

Variable

d Parameter

•tipariaree

R e u s e d By Parameter

Ettrnate

1

8 3263

9.3263

2.9889

2.9B89

DamClusla

0 02660

0.154201 Y e s

Yes

Rejected

2

3.1016

12.4278

1.5930

4.5818

.DsmCluste

0 00978

0.106351 Y e s

Yas

Rejected

3

2 5687

14 3 9 6 6

0.4026

4.9B4*

DemCluste

•0 0 0 2 0 2

0 00969BYSS

Yes

Rejected

0.086102Yas

Yes

Rejected

0.128782Yes

Yes

Rejected

0.105993Yes

Yes

Rejected

0 236457Yes

Yes

Rejected

0.271819Yes

Yes

Rejected

0.137011 Y e s

Yes

Rejected

0.053855Yes

Yes

Rejected

0 266047Yes

Yes

Rejected

019543Y6S

Yes

Rejected

0.057593 Y e s

Yes

Rejected

4

22235

17.2201

0.0909

5.0754

DemCluste

0 00215

5

3.4493

20 6 5 9 *

0.0223

5.0977

:DemCluste

0 00821

6

1.7022

22.3718

00290

51267

DemCluste

7

1.BB4S

142364

0 0134

51*00

.DemClusle

8

2 509*

26 74 SB

0 0113

5.1513

DemCluste

9

1.0899

27.B357

0.0233

5.1751

DemClusle

10

1.5489

29 3B46

0 0096

518*7

DemClusle

11

1.1864

30 5 8 6 9

0 0086

5193*

DemClusle

12 13

0.7255

31 3 0 6 5

0.0158

5.2091

pemCluste

1.2303

32.5368

0.0087

5 2178

DemCluste

• Li. Target

•001016 •0.01654 0 02483 -0 00984 -I) 0 0 2 2 5 -001887 0 01489 0 00299

Ft Statistics

Statistics Label

TargatB

_ERR„

Error Fundi...

14273.51

14337,59

TargelB

„SSE_

SumofSqu

2549 322

2508.348

Targets

_MAX_

H a i i m u m A...

•.850114

0.87B124

TargetB

_DIV_

D M s n r f a r A..

9666

9BB6

Targets

_NQBS_

S u m of Fie...

4B43

Targets

_WRONO.

Number of...

2421

2(22

Targets

_DISF_

Frequency...

4843

4843

Targets

_MISC_

Mlsclassific.

0.499897

0.500103

TargalB

_PROF_

Total Profit

9 * 4 4704

799.05BB

4843


9,3 Variable Selection

6.

9-17

M a x i m i z e the Percent Variation w i n d o w .

| HI P e r c e n t V a r i a t i o n Number of

X variation

Total X

Y variation

Total Y

Extracted

accounted

Factors

for by this factor

variation accounted

accounted for by this

variation accounted

for so far

factor

for so far

1

9.3263

9.3263

2.9889

2.9889

2

3.1016

12.4279

1.5930

4.5813

3

2.5687

14.9966

0.4026

4.9844

2.2235

17.2201

0.0909

5.0754 5.0977

E 5

3.4493

20.6694

0.0223

6

1.7022

22.3716

0.0290

5.1267

7

1.8648

24.2364

0.0134

5.1400

8

2.5094

26.7458

0.0113

5.1513

9

1.0899

27.8357

0.0238

5.1751

10

1.5489

29.3846

0.0096

5.1847

11

1.1964

30.5809

0.0086

5.1934

12

0.7255

31.3065

0.0158

5.2091

13

1.2303

32.5368

0.0087

5.2178

14

1.3137

33.8505

0.0065

5.2243

15

0.8789

34.7294

0.0089

5.2332

By default, the Partial Least Squares tool extracts 15 latent vectors or factors f r o m the t r a i n i n g data set. These factors account for 3 5 % and 5 . 2 % o f the variation in the inputs and target, respectively.


9-18

Chapter 9 Special Topics

7.

M a x i m i z e the Variable Selection w i n d o w .

8.

Select the Role c o l u m n heading to sort the table by variable role. r j ^ R e s u l t s - Mode: P a r t i a l L e a s t S q u a r e s D i a g r a m : P r e d i c t i v e A n a l y s i s

File

Edit

View a

Window E3

*>

Variable Selection

Variable

GiftTimeLast LOG_GiftAv... LOG_GiftC... LOG_GiftC... LOG_GiftC... LOG_GiftC... PromCntAII PromCntCa... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... DemCluste... <

|

Standardize d Parameter Estimate -0.07806 -0.05216

0.06015 -0.09391 0.04138 0.01129 0.14428 -0.10896 0.02860

0.00978 -0.00202 0.00215 0.00821 -0.01016 -0.01654 0.02488 -0.00984 -0.00225 -0.01887 0.01489 0.00299

Variable Importance for Projection

Rejected by Parameter Estimate

0.842938Yes 0.824256Yes 0.948709Yes 0.871212Yes 0.851113Yes 0.863856Yes 0.717499IMO 0.756161 No 0.154201 Yes 0.106351 Yes 0.009698Yes 0.086102Yes 0.128782Yes 0.105993Yes 0.236457Yes 0.271 819Yes 0.137011 Yes 0.053855Yes 0.266047Yes 0.19543Yes 0.057599Yes

Rejected by VIP

NO No No No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

Role

±

Input Input Input Input Input Input Input Input Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected Rejected \

w

*1

The m a j o r i t y o f the selected inputs relate to donation count, p r o m o t i o n count, t i m e since donation, and amount o f donation.


9.3 Variable Selection

9-19

Using the Decision Tree Node for Input Selection

Decision trees can be used to select inputs for flexible predictive models. They have an advantage over using a standard regression model or the Variable Selection tool's R-squared method i f the inputs' relationships to the target are nonlinear or nonadditive. 1. Connect a Decision Tree node to the Impute node. Rename the Decision Tree node S e l e c t i o n Tree. Predictive Analysis

Q

J

©ioo%|s*aj

You can use the Tree node with default settings to select inputs. However, this tends to select too few inputs for a subsequent model. Two changes to the Tree defaults result in more inputs being selected. Generally, when you use trees to select inputs for flexible models, it is better to err on the side of too many inputs rather than too few. The model's complexity optimization method can usually compensate for the extra inputs. f

The following changes to the defaults act independently. You can experiment to discover which method generalizes best with your data.


9-20

Chapter 9 Special Topics

2.

Type 1 as the N u m b e r o f Surrogate Rules value.

3.

Select Subtree •=> Method <=> Largest.

Train Variables Interactive

s Criterion •Significance Level -Missing Values -Use Input Once -Maximum Branch -Maximum Depth -Minimum Categorical Size

Default 0.2 Use in search No

-Leaf Size -Number ofRules -Number of Surrogate Rules •Split Size

• Exhaustive '-Node Sample El \- Method hNumber of Leaves \-Assessment Measure '-Assessment Fraction

5000 20000 Largest 1 Decision 0.25

C h a n g i n g the number o f surrogates enables inclusion o f surrogate splits i n the variable selection process. B y d e f i n i t i o n , surrogate inputs are typically correlated w i t h the selected split input. W h i l e it is usually a bad practice to include redundant inputs in predictive models, many flexible models tolerate some degree o f input redundancy. The advantage o f i n c l u d i n g surrogates i n the variable selection is to enable inclusion o f inputs that do not appear in the tree e x p l i c i t l y but are still important predictors o f the target. C h a n g i n g the Subtree method causes the tree algorithm to not prune the tree. A s w i t h a d d i n g surrogate splits to the variable selection process, it tends to add (possibly irrelevant) inputs to the selection list. 4.

Run the Selection Tree node and v i e w the results.


9.3 Variable Selection

9-21

5. View lines 60 through 73 of the Output window. Cbs mjE

1 uDGj&ftortae 2 UDGJ&ftCnffianfi6 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

RxrrCrrt12 DenCIister Dent^tecHmeValue UDGJ3LftA\g36 LJDGJiftAvrJjast GiftTiieLast UDGJ^ftAvgoil RtnQTt36 PixnCWCard12 DGrrH3TE0Arer StahcCartStarAll I^JB'JDerrrVMtrcaiB U^J&fttrrtCanWl Danftettfeterans

um.

MILES NELFFDJATES IvFCRTANCE

TransfbniBd: Gift Count 35 Mirths Transformed: Gift Cant Card 35 Mirths Piuiulicn Cant 12 Mnths DenEojctphic Cluster Msclian hfcme Value Region TrarBfornEd: Gift Arant Average 35 Mxrths Trarefbrmsd: Gift Anxnt Last Tine Sinoe Last Gift IrrpjtEd: TrensfoniEd: Gift Arairt Average Card 35 Months TransfDrTred: Gift Aiant Average ALL Mxrths RuiuLicn Cant 35 Mirths PraiDtion Cant Card 12 Mxrths Hare Qvner Status Category Star All Mxrths Inpjted: Replacement: DEntteCnrxme Transfonnad: Gift Cant Card All Mirths Feroent \feterans Region

1

2

0 1 1 2 1

0

3

1 1 2 1

0 0 0 0 1

0

2 3

0 0

1 1 0 0 0

1

0 0 1 1 1

1.00000 0.89&5 0.66372 0.66908 0.64347 0.62733 0.53413 0.48092 0.40032 0.33350 0.37819 0.35544 0.30767 0.29455 0.25345 0.25255 0.02873

VBJPCKIMEE ROTO 1.00000 0.89945 0.35177 0.09777 0.10851 0.70944 0.67183 0.44530 0.32098 O.OOOOO 0.00000 0.00000 0.00000 0.06599 0.00000 0.06882 O.OOOOO

1.00000 1.00000 0.54506 0.14834 0.16379 1.13095 1.2B189 0.92535 0.80240 0.00000 O.OOOOO O.OOOOO O.OOOOO 0.22403 0.00000 0.22408 O.OOOOO

The IMPORTANCE column quantifies approximately how much of the overall variability in the target each input explains. The values are normalized by the amount of variability explained by the input with the highest importance. The variable importance definition not only considers inputs selected as split variables, but it also accounts for surrogate inputs (if a positive number of surrogate rules is selected in the Properties panel). For example, the second most important input (LOG_Gif tCntCard3 6) accounts for almost the same variability as the most important input (LOG_Gif tent36) even though it does not explicitly appear in the tree.


9-22

Chapter 9 Special Topics

9.4 Categorical Input Consolidation

Categorical Input Consolidation ABCDJ

[ABCD

Hir\EFG

x

2

Combine categorical input levels that have similar pnmaiy outcome proportions.

12

Categorical inputs pose a major problem for parametric predictive models such as regressions and neural networks. Because each categorical level must be coded by an indicator variable, a single input can account for more model parameters than all other inputs combined. Decision trees, on the other hand, thrive on categorical inputs. They can easily group the distinct levels of the categorical variable together and produce good predictions. It is possible to use a decision tree to consolidate the levels of a categorical input. You simply build a decision tree using the categorical variable of interest as the sole modeling input. The split search algorithm then groups input levels with similar primary outcome proportions. The IDs for each leaf replace the original levels of the input.


9.4

Categorical Input Consolidation

9-23

Consolidating Categorical Inputs

Follow these steps to use a tree model to group categorical input levels and create useful inputs for regression and neural network models. 1.

Connect a Decision Tree node to the Impute node, and rename the Decision Tree node

Consolidation Tree. - Predictive Analysis :

Polynomial

~ \l •

The Consolidation Tree node is used to group the levels of DemCluste r, a categorical input with more than 50 levels. You use a tree model to group these levels based on their associations with TargetB. From this grouping, a new modeling input is created. You can use this input in place of DemCluste r in a regression or other model. In this way, the predictive prowess of DemCluster is incorporated into a model without the plethora of parameters needed to encode the original. The grouping can be done autonomously by simply running the Decision Tree node, or interactively by using the node's interactive training features. You use the automatic method here.


9-24

Chapter 9 Special Topics

2.

Select Variables... f r o m the Consolidation Tree Properties panel.

3.

Select DemCluster 3> Explore.... The Explore w i n d o w opens. / • Explore - EMWS.Impt_TRAIN File

View

Actions

Window

oR/l

Do

.jDjx]

Rows Columns Library Member Type Sample Method Fetch Size Fetched Rows Random Seed

250

Value

Property

Unknown 45 EMWS IMPT TRAIN VIEW Random Max 4343 12345

200 | 150 £ « 100 u_ 50 0

Apply

Plot..,

—i 35

1

1

1

1

1

1

1

15

36

16

17

24

39

45

r~ 18

Demographic Cluster ^Jnjxjj

Obs#

Target G... Demogr... 000 1 00 I 035 1 035 115 136 016 017 124 035 $ 11 039 12 045 - • • -

'

*I

T


9-4 Categorical Input Consolidation

4.

9-25

M a x i m i z e the D e m C l u s t e r histogram. L l DemCluster 250-

200-

> 150 o

c

u cr 100-

50

i

- 1 —

15

24

36

33

45

18

Demographic Cluster The D e m C l u s t e r input has more than 50 levels, but y o u can see the d i s t r i b u t i o n o f o n l y a f e w o f these levels. 5.

Click

!

i n the l o w e r right corner o f the DemCluster w i n d o w .


9-26

Chapter 9 Special Topics

The histogram expands to show the relative frequencies o f each level o f D e m C l u s t e r . L _ DemCluster

E Demographic Cluster The histogram reveals input levels w i t h l o w case counts. This can detrimentally affect the performance o f most models. 6.

Close the Explore w i n d o w .


9.4 Categorical Input Consolidation

7.

9-27

Select Use *=> No f o r a l l variables in the Variables w i n d o w . Select Use O Yes f o r D e m C l u s t e r and T a r g e t B . ©"Variables-Tree4 j(none) lolumns: Name

^ j f~~ nol jEqualto f " Label Report No No No No No No No No No No No No No No No No No No No No No No No No No No No No

No No

Role nput Taiqet nput nput nput Rejected nput nput nput nput nput nput nput nput nput nput Input nput nput nput nput nput nput Input Input Input Input Input Input Reiected

Apply

J |7 Basic"

f [fining

Use 7

DemCluster iYes TarqetB Yes DemGender No DernHorneOwNo DemMedHomNo DemMedlncortlo DemPctVeteraNo GiftTimeFirst No GlftTimeLast No IMP DemAge No IMP LOG GiONo IMP REP DetNo LOG GiftAvg3Ho LOG GiftAvqAHo LOG GinAvciLJJo LOG GiftCnt3No LOG GiftCntAINo LOG GiftCntCNo LOG GiftCntCNo M DemAae No Ml LOG GiftAvNo M REP DemtNo PromCnt12 No PromCnt36 No PromCntAII No PromCntCard-No PromCntCardWo PromCntCard.'No REP StatusCfJo StatusCatQSNNo

dl Level Nominal Binary Nominal Binary nterval Interval nterval Interval Interval Interval nterval nterval nterval nterval Interval Interval Interval Interval Interval Binary Binary Binary Interval Interval Interval Interval interval Internal Nominal Nominal

Type

r

Statistics

Informal:

Format

Character 'Numeric Character Character Numeric Numeric Numeric jNumeric Numeric Numeric 'Numeric Numeric 'Numeric {Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric •Numeric Character Character

Reset

Length 2 8 3 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 _ 8 5

DOLLAR! 1.0 DOLLAR 11.0

< |

Explore..,

Update Path

A f t e r s o r t i n g o n the c o l u m n U s e . the Variables w i n d o w should appear as s h o w n . 9.

Select O K to close the Variables w i n d o w .

OK

Cancel


9-28

Chapter 9 Special Topics

10. M a k e these changes i n the Train property group. a.

Select Assessment Measure O Average Squared Error. T h i s optimizes the tree f o r prediction estimates.

b.

Select Bonferroni Adjustment *=> No. W h e n y o u evaluate a potential split, the Decision Tree tool applies, b y default, a B o n f e r r o n i adjustment to the splits l o g w o r t h . The adjustment penalizes the l o g w o r t h o f potential D e m C l u s t e r splits. The penalty is calculated as the l o g o f the number o f partitions o f L ]

DemCluster levels split into t w o groups, or log]{)(2 ' - I ) . W i t h 54 distinct levels, the penalty is quite large. It is also, in this case, quite unnecessary. The penalty avoids f a v o r i n g inputs w i t h many possible splits. Here y o u are b u i l d i n g a tree w i t h only one input. It is impossible to favor this input over others because there are no other inputs. 11. M a k e these changes i n the Score property group. a.

Select Variable Selection •=> No. This prevents the decision tree f r o m rejecting inputs in subsequent nodes.

b.

Select Leaf Role <^> Input. This adds a new input ( _ N O D E _ ) to the t r a i n i n g data.

12. N o w use the Interactive Tree tool to cluster DemCluster values into related groups, a.

Select Interactive... from the Decision Tree Properties panel. Property Node ID Imported Data Exported Data Notes

Value Tree 4 •••

...

Train Variables Interactive Use Frozen Tree Use Multiple Taraets

...

No No


9.4 Categorical Input Consolidation

The S A S Enterprise M i n e r Tree Desktop A p p l i c a t i o n opens. Tree View N i n Node: 4343 T a r g e t : Tarc/etB 1 (%) : 5% 0 (%): 95*

9-29


Chapter 9 Special Topics

9-30

b.

R i g h t - c l i c k on the root node and select T r a i n Node f r o m the o p t i o n m e n u .

N i n Node: 4343 Target: TargetB 1 (%) : 5% 0 (%): 95%

Demographic Cluster

I 00,35,15,16,17,24, 39,1B, 53.27,4.

36.45, OS, 37, 25,43.51. 49, 34,32,4...

I N i n Mods: 3185 Target: TargetB 1 (%): £ 0 (%] : 94*

!

Demographic Cluster

i t 53 N i n Mode: Target: Targets 1 (%) : 4\ 0 . '. • : 96%

Demographic Cluster

I 0, 35,15,16,17,24, 39,18,53,27, 2.

40, 42,13,23,14,29,19,07, 04

IJ i n Node: 2433 Target: Targets 5% 1 (%) i • [«): 95%

N i n Node: 752 Target: TargetB 7% i (%) : 0 IV): 93%

The levels o f

DemCluster

3S. 05.37,25, 43, 34, 47,33, 26.20,0. H i n Node: S59 Target: TargetB 1 (%): 4% • .\j : 96%

45, 51,49.32, 41.30,10.06 N i n Node: 799 Target: TargetB 3% 1 (%) : • (%) : 97%

are partitioned into four groups corresponding to the four leaves

o f the tree. A n input named _ N O D E _ is added to the training data. You can use the Transform Variables t o o l to rename _ N O D E _ to a more descriptive value. Y o u can use the Replacement tool to change the level names.


9.5 Surrogate Models

9.5

S u r r o g a t e

9-31

M o d e l s

Surrogate Models

neural network decision boundary

decision

surrogate boundary

Approximate an inscrutable model's predictions with a decision tree.

15

The usual c r i t i c i s m o f neural networks and similar flexible models is the d i f f i c u l t y in understanding the predictions. This criticism stems f r o m the c o m p l e x parameterizations found in the m o d e l . W h i l e it is true that little insight can be gained by a n a l y z i n g the actual parameters o f the m o d e l , m u c h can be gained by a n a l y z i n g the resulting prediction decisions. A p r o f i t m a t r i x or other mechanism for generating a decision threshold defines regions in the input space corresponding to the p r i m a r y decision. The idea behind a surrogate m o d e l is b u i l d i n g an easy-tounderstand model that describes this region. In this way, characteristics used by the neural n e t w o r k to make the p r i m a r y decision can be understood, even i f the neural n e t w o r k model i t s e l f cannot be.


9-32

Chapter 9 Special Topics

Describing Decision Segments with Surrogate Models

In this demonstration, a decision tree is used to isolate cases found solicitation-worthy by a neural network model. Using the decision tree, characteristics of solicit-decision donors can be understood even if the model making the decision is a mystery.

Setting Up the Diagram You need to attach two nodes to the modeling node that you want to study. 1. Select the Utilities tab. 2. Drag a Metadata tool into the diagram workspace. 3. Connect the Neural Network node to the Metadata node. 4. Select the Model tab. 5. Drag a Decision Tree tool into the diagram workspace. 6. Connect the Decision Tree tool to the Metadata node. Rename the Decision Tree node D e s c r i p t i o n Tree. ;

Predictive Analysis

y ÂŁ Regression

ieladata

Description

km {Tree

zl


9.5 Surrogate Models

9-33

Changing the Metadata Y o u must change the focus o f the analysis from T a r g e t B to the variable describing the neural n e t w o r k decisions. S A S Enterprise M i n e r automatically creates this variable in the t r a i n i n g and validation data sets exported f r o m a m o d e l i n g node. T h e variable's name is D_target, where

target

is the name o f the

original target variable.

/r

D_TARGETB is automatically created on the d e f i n i t i o n o f decision data only.

The f o l l o w i n g steps change the target variable f r o m 1.

TargetB to D_TARGETB:

Scroll d o w n to the Variables section in the Metadata node properties. Select T r a i n . . . . T h e Variables w i n d o w opens. S V a r i a b l e s - Meta

Eiplore...

2

-

Select New Role ^ Target for D_TARGETB.

3.

Select New Role •=> Rejected for TargetB.

4.

Select O K to close the Variables d i a l o g box.

5.

Run the Metadata node. D o not v i e w the results.

Update Path

OK


9-34

Chapter 9 Special Topics

Exploring the Description Tree Follow these steps to explore the description tree. 1. Run the Description Tree node. View the results. 2. Select View «=> Model «=> Iteration Plot from the Description Tree results. 3. Change the basis of the plot to Misclassification Rate. I t e r a t i o n Plot Misclassification Rate

|j|§

"i

1

1

r

1

r

0

10

20

30

40

50

Number of Leaves Train: Misclassification Rate Valid: Misclassification Rate

The Assessment plot shows the trade-off between tree complexity and agreement with the original neural network model. The autonomously fit description agrees with the neural network about 95% of the time, and as the tree becomes more complicated, the agreement with the neural network improves. You can scrutinize this tree for the input values resulting in a solicit decision, or you can simplify the description with some accuracy loss. Surprisingly, 85% of the decisions made by the neural network model can be summarized by a three-leaf tree! The following steps generate the three-leaf description tree: 4. Close the Description Tree results window.


9.5 Surrogate Models

5.

Under the Subtree properties o f the description tree, change the M e t h o d to N and the N u m b e r o f Leaves to 3. Value

Property i i- Leaf Size 5 | \. Number of Rules 5 Number of Surrogate Ri 0 Split Size

B ! [- Exhaustive

Node Sample B i s ™ " i | Method Number of Leaves ; Assessment Measure ; : Assessment Fraction

5000 2 0 0 0 0 ^ ^ ^ ^ ^ ^ ! N 3 Decision 0.25

\ \ Perform Cross Validatio No Number of Subsets Number of Repeats

10 1 12345

j '• Seed (slobservation Based Impi ^Observation[Based ImpiNo •Number Single Var lmp(5 • P-Value Adjustment [•Bonferroni Adjustment Yes Time of Kass AdjustmerBefore No r Inputs r Number of Inputs '••Split Adjustment Yes uMMIllEHEIiia '- Leaf Variable 6.

Run the Description Tree node and open the Results w i n d o w .

9-35


9-36

Chapter 9 Special Topics

7.

Select View ^> Model •=> Iteration Plot from the Description Tree results.

8.

Change the basis o f the plot to Misclassification Rate. I t e r a t i o n Plot Misclassification Rate

0.3 -

0.2-

0.1 -

[

~~i

10

20

30

40

50

Number of L e a v e s Train: Misclassification Rate Valid: Misclassification Rate

The accuracy dropped to 8 5 % , but the tree has only t w o rules.

Statistic 0:

Ttain

Validation

35.2* £4.8* 4843

34. S t £5.1% 4843

1

Transformed: Gift Count 36 Month

< 1.24245 I Statistic 0: 1:

Ttaln £2.1%

>= 1.24245 I Validation

£3.8%

37.9% ::o7

::b3

11:

Statistic 0: 1:

Statistic

•:

1: II:

Train 18.4V ei.£% 7Q5

< 118200

Validation

22.3%

statistic Ol

71.n £SL

%i 11:

Ttain 81.e% i d . 4* 1S7G

as.7% 2S60

N:

Median Home Value Region

>= 118200

Train 11.3%

Validation 81.2% 18.8% 155£

Validation 10.£% 89.4% 2£3£


9.5 Surrogate Models

9-37

You can conclude from this tree that the neural network is soliciting donors who gave relatively many times in the past or donors who gave fewer times, but live in more expensive neighborhoods. You should experiment with accuracy versus tree size trade-offs (and other tree options) to achieve a description of the neural network model that is both understandable and accurate. It should be noted that the inputs used to describe the tree are the same as the inputs used to build the neural network. You can also consider other inputs to describe the tree. For example, some of the monetary inputs were log-transformed. To improve understandability, you can substitute the original inputs for these without loss of model accuracy. You can also consider inputs that were not used to build the model to describe the decisions. For example, i f a charity's Marketing Department was interested in soliciting new individuals similar to those selected by the neural network, but without a previous donation history, it could attempt to describe the donors using demographic inputs only. f

It should be noted that separate sampling distorts the results shown in this section. To accurately describe the solicit and ignore decisions, you must down-weight the primary cases and up-weight the secondary cases. This can be done by defining a frequency variable defined as shown.

i f TargetB=l then WEIGHT = 0.05/0.50; else WEIGHT = 0.95/0.50; You can define this variable using a Transform Variables tool's Expression Builder. You then set the model role of this variable to Frequency. Another approach is to build the model on the Score data set. Because the Score data set is not oversampled, you get accurate agreement percentages. This is possible because the description tree does not depend on the value of the target variables, only on the model predictions.


9-38

Chapter 9 Special Topics


Appendix A Case Studies A.1

Banking Segmentation Case Study

A-3

A.2

Web Site Usage Associations Case Study

A-16

A.3

Credit Risk Case Study

A-19

A.4

Enrollment Management Case Study

A-36


A-2

Appendix A Case Studies


A.1 Banking Segmentation Case Study

A-3

A.1 Banking Segmentation Case Study Case Study Description A consumer bank sought to segment its customers based on historic usage patterns. Segmentation was to be used for improving contact strategies in the Marketing Department. A sample o f 100,000 active consumer customers was selected. A n active consumer customer was defined as an individual or household with at least one checking account and at least one transaction on the account during a three-month study period. A l l transactions during the three-month study period were recorded and classified into one o f four activity categories: • traditional banking methods ( T B M ) • automatic teller machine ( A T M ) • point o f sale (POS) • customer service (CSC) A three-month activity profile for each customer was developed by combining historic activity averages with observed activity during the study period. Historically, for one CSC transaction, an average customer would conduct two POS transactions, three A T M transactions, and 10 T B M transactions. Each customer was assigned this initial profile at the beginning o f the study period. The initial profile was updated by adding the total number o f transactions in each activity category over the entire three-month study period. The PROFILE data set contains all 100,000 three-month activity profiles. This case study describes the creation o f customer activity segments based on the PROFILE data set. The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams •=> I m p o r t Diagram from X M L . . . in SAS Enterprise Miner. A l l nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.

f

Case Study Data Name

Model Role

Mens u rem en t Level

Description

ID

ID

CNT_TBM

Input

i Interval

CNT_ATM

Input

Interval

A T M transaction count

CNT_POS

Input

Interval

Point-of-sale transaction count

CNT_CSC

Input

Interval

Customer service transaction count

CNT_TOT

Input

Interval

Total transaction count

Nominal

Customer I D Traditional bank method transaction count


A-4

Appendix A Case Studies

Accessing and Assaying the Data A SAS Enterprise Miner data source was defined using the metadata settings indicated above. The StatExplore node was used to provide preliminary statistics on the input variables.

The Interval Variable Summary from the StatExplore node showed no missing values but did show a surprisingly large range on the transaction counts. I n t e r v a l Variable Summary S t a t i s t i c s

Variable

ROLE

CNT_ATM CNT_CSC CNT_POS CNTTBM CNT TOT

INPUT INPUT INPUT INPUT INPUT

Mean 19.500 6.684 11.923 68.137 106.244

Std. Deviation 20.856 12.129 20.734 101.154 113.370

Non Missing 100000 100000 100000 100000 100000

Missing 0 0 0 0 0

Minimum 3 1 2 10 17

A plot o f the input distributions showed highly skewed distributions for all inputs.

Median

Maximum

13 2 2 52 89

628 607 345 14934 15225


A-5

A.1 Banking Segmentation Case Study

Explore - AAEM61.PROFILE File

View

Actions

Window

| i S a m p l e Proper 1

^ j n j x j

xJ 60000 -

Value

Property

&

100000 6 AAEMB1 PROFILE

Rows Columns Library Member

= 40000-

o>

£ • i

Apply

20000 •

LL

1.0

Plot.,.

1

1

i

1

1

i

'

1

0 •

1

i

— i

i

17.0

99.1 197.2 295.3

i .

i . i

i

i .

3355.8 6694.6

CNT TOT

CNT CSC

_jnf_x]

CNT TBM

CNT A

34 42 20 33 52 101

1

2.0

1

i

1

1

1

1

1

1

r

30.7 59.4 88.1 116.8 145.5 174.2 202.9 231.6 260.3 289.0

CNT POS

±1 lu

&

-

60000 -

= 40000 I " 20000 LL

3.0

102.3 201.6 300.9

0-

1

1669.2

10.0

CNT ATM

3328.4

6645.8

4987.6

8306.0

CNT TBM

It would be difficult to develop meaningful segments from such highly skewed inputs. Instead o f focusing on the transaction counts, it was decided to develop segments based on the relative proportions o f transactions across the four categories. This required a transformation o f the raw data. A Transform Variables node was connected to the PROFILE node.

The Transform Variables node was used to create category logil scores for each transaction category. category logit score = \og{transaction county

c a

iegory /

transaction county

0

fcategory)


A-6

Appendix A Case Studies

The transformations were created using these steps: 1.

Select F o r m u l a s . . . in the Transform Variable node's Properties panel. The Formulas window opens. ^Formulas

LZ3 J*

4,000 ^3.000 -

a

a 2.000 £ 1.000 0

— I

3.0

27.7

52.4

77.1

101.8

151.2

126.5

1 —

175.9

200.6

1

1

225.3

250.0

CNT ATM Columns: Name

Method

CNTITBM

r Basic

Number of Bins

Default Default Default Default

CNT_ATM CNT_CSC ,CNT_POS

Inputs

f~ Mining

Label

f

Role

4 4 4 4

Interval Interval Interval Inter/a I

t

Name

Outputs i •*

Type

Sample

Length

Format

Level

Formula

Label

Role

Report

Log

Preview

2.

Input Input Input Input

rstati:

Level

Select the Create icon as indicated above.

OK

Cancel

I


A.1 Banking Segmentation Case Study

The Add Transformation dialog box opens.

Value

Property Name Type Length Format Level Label Role Report

LGT TBM N 8 INTERVAL INPUT No

rluimula: TRANS 0 =

I o g (C NTTB M/(C NT_TOT-CNT_TB M))

Build...

OK

Cancel

3.

For each transaction category, type the name and formula as indicated.

4.

Select O K to add the transformation. The Add Transformation dialog box closes and you return to the Formula Builder window.

5.

Select Preview to see the distribution of the newly created input.

A-7


A-8

Appendix A Case Studies 2 f Formulas

• 3

m m

It

•/

Li?,

4,000 I*3,000 §•2,000o £ 1.000 0

3.0

27.7

52.4

77.1

101.8

I

126.5

151.2

— I —

175.9

1 —

200.6

225.3

250.0

1 0.92

1 1.78

CNT ATM Columns;

|~~ Label

Name

I

Mining

i ~ Basic

Number of Bins

Method

CiMT.ATM CNT_CSC CNT„POS CMT_TBM

-

Role 4 4 4 4

Default Default Default Default

Input Input Input Input

r~ Statistics

Level Interval Interval Interval Interval

Inputs & 1,500 1 1,000 Sf 500 o -6.82

! -5.96

1 -5.10

1 -424

1 -3.38

I -2.52

i -1J 36

1 -0.80

1 0.06

LGT ATM Name LGT_TBM

Length

Type Numeric

Format

Level

Formula

Label

Role

Report

30

[Interval

log(CNT_TB...

: Input

No

30

Interval

]log(CNT_A,..

[Input

Mo

LGTJ'OS

Numeric

32

Interval

|log(CNT_P.,.

Input

No

LGT.CSC

Numeric

30

Interval

log(CNT_C...

Input

No

Outputs | •

Sample

»

Log

>

V s i

»

Previa

6.

Repeat Steps I -5 for the other three transaction categories.

7.

Select O K to close the Formula Builder window.

8.

Run the Transform Variables node.

OK

Cancel

Segmentation was to be based on the newly created category logit scores. Before proceeding, it was deemed reasonable to examine the joint distribution o f the cases using these derived inputs. A scatter plot L i s i n g any three o f the four derived inputs would represent the joint distribution without significant loss of information.


A.1 Banking Segmentation Case Study

A-9

A three-dimensional scatter plot was produced using the following steps: 1.

Select E x p o r t e d Data . . . from the Properties panel o f the Transform Variables node. The Exported Data window opens.

2.

Select the T R A I N data and select E x p l o r e . . . . The Explore window opens.

3.

Select P l o t . . . T h e Plot Wizard opens.

4.

Select a three-dimensional scatter plot.

5.

Select R o l e ^ X . Y , and Z for L G T _ A T M . L G T _ C S C . and L G T _ P O S , respectively.

6.

Select F i n i s h to generate the scatter plot. :

GrapM

jfo J

LGT_POS

The scatter plot showed a single clump o f cases, making this analysis a segmentation (rather than a clustering) o f the customers. There were a few outlying cases with apparently low proportions on the three plotted inputs. Given that the proportions in the four original categories must sum to 1, it followed that these outlying cases must have a high proportion o f transactions in the non-plotted category, T B M .

B


A-10

Appendix A Case Studies

Creating Segments Transactions segments were created using the Cluster node.

•IstntExplore

^^jTransform til Variables

PROFILE

Cluster

O

Two changes to the Cluster node default properties were made, as indicated below. Both were related to limiting the number of clusters created to 5.

Train f..J

Variables ClusterVariable Role Segment Internal Standardizatio None ?• Specification Method User Specify "Maximum Number of C5

Because the inputs were all on the same measurement scale (category logit score), it was decided to not standardize the inputs. Only the four LGT inputs defined in the Transform Variables node were set to Default in the Cluster node. Variables - Clus

!~" not

(none) Columns:

Report

Use

to

No No No No No yes

I to I to

Default Default Default Default

No ho No

to to to

to

Apply

J

|~ Mining

V Label

Name CNT ATM CNT CSC CNT POS CNT TBM CNT TOT ID LGT_ATM LGT CSC •LGT POS LGT TBM

Equal to

Role Input Input Input Input Input ID Input Input Input Input

r Basic

I

-

Reset

Statistics

Level Interval Interval Interval Interval Interval Nominal nterval Interval nterval nterval

Explore...

Update Path

OK

Cancel


A.1 Banking Segmentation Case Study

Additional cluster interpretations were made with the Segment Profile tool.

A-11


A-12

Appendix A Case Studies

Interpreting Segments A Segment Profile node attached to the Cluster node helped to interpret the contents o f the generated segments.

ten StatExplore

:

: : : PROFILE

I ronsform Variables

m

'Cluster

'Segment Profile

f

Only the LGT inputs were set to Yes in the Segment Profile node. ^ V a r i a b l e s - Prof

• I P

Apply

2J Columns:

P

Label

Name

P Mining

Use

[CNT ATM jCNT CSC iCNT POS CNT_TBM CNT TOT Distance ID LGT_ATM LGT_CSC LGT_POS •LGT TBM

Report

No No Default Default Yes Ves Yes Yes

Role

No No

Uo No iio No No No No Explore...

P Statistics

Level

Input Input Input Input Input Rejected ID Input Input Input Inout

No

No No

P Basic

Reset

Interval Interval Interval Interval Interval Interval Nominal Interval interval Interval Interval

Update Path

J

OK

Cancel

The following profiles were created for the generated segments:

30SetjmGtit: 1

j =j

Count: 13157 Percent: 13.16

« , _ 0

£ 10-

LGT TBM

LGT CSC

Segment 1 customers had a significantly higher than average use o f traditional banking methods and lower than average use o f all other transaction categories. This segment was labeled B r i c k - a n d Mortar.


A.1 Banking Segmentation Case Study

A-13

40¬

305egment:2

=

Count: 25536

« ,

P e r c e n t : 25.54

Q

.

£

a . 20 -

10 -

J

fT TBM

la LGT ATM

LGT POS

LGT CSC

Segment 2 customers had a higher than average use o f traditional banking methods but were close to the distribution centers on the other transaction categories. This segment was labeled T r a n s i t i o n a l s because they seem to be transitioning from brick-and-mortar to other usage patterns.

Segment:3

I =

Count: 17839

«

P e r c e n t : 17.84 £ 2 0 -

Segment 3 customers eschewed traditional banking methods in favor of ATMs. This segment was labeled ATMs.

Segment: 4 Count: 14537

S

:

o

.

|

Percent: 1 4 5 4

Segment 4 was characterized by a high prevalence o f point-of-sale transactions and few traditional bank methods. This segment was labeled C a s h l e s s .

Segment 5 had a higher than average rate o f customer service contacts and point-of-sale transactions. This segment was labeled S e r v i c e .


A-14

Appendix A Case Studies

Segment Deployment Deployment o f the transaction segmentation was facilitated by the Score node.

The Score node was attached to the Cluster node and run. The SAS Code window inside the Results window provided SAS code that was capable o f transforming raw transaction counts to cluster assignments. The complete SAS scoring code is shown below. * _ .... ..... ........... ....*.i * Formula Code; *

_

...............

LGT_TBM = log(CNT_TBM/(CNT_T0T-CNT_TBM)); LGT_ATM = log(CNT_ATM/(CNT_T0T-CNT_ATM)); LGT_P0S = log(CNT_POS/(CNT_TOT-CNT_POS)); LGT_CSC = log(CNT_CSC/(CNT_TOT-CNT_CSC)); *. .........................................

i

.

..

* TOOL: C l u s t e r i n g ; * TYPE: EXPLORE; * NODE: Clus; * ****************************************** i *** Begin Scoring Code from PR0C DMVQ ***; *****************************************. *** Begin C l a s s Look-up, Standardization, Replacement ; drop _dm_bad; _dm_bad = 0; *** Omitted Cases; i f _dm_bad then do; _SEGMENT_ - .; Distance = .; goto CLUSvlex ; end; *** omitted; (Continued on the next page.)

*. *

*•


A.1 Banking Segmentation Case Study

*** Compute Distances and C l u s t e r Membership; l a b e l _SEGMENT_ = ' ' ; l a b e l Distance = ' ; array CLUSvads [5] _temporary_; drop _vqclus _vqmvar _vqnvar; _vqmvar = 0; do _vqclus = 1 to 5; CLUSvads [_vqclus] = 0; end; i f not missing( LGT_ATM ) then do; CLUSvads [1] + ( LGT_ATM - -3.54995114884544 )**2\ CLUSvads [2] + ( LGT_ATM - -2.2003888516185 )**2; CLUSvads [3] + ( LGT_ATM - -0.23695023328541 )**2; CLUSvads [4] + ( LGT_ATM - -1.47814712774378 )**2; CLUSvads [5] + ( LGT_ATM - -1.49704375204905 )**2; end; e l s e _vqmvar + 1.31533540479172; i f not missing( LGT_CSC ) then do; CLUSvads [1] + ( LGT_CSC - -4.16334022538952 )**2; CLUSvads [2] + { LGT_CSC - -3.38356120535046 )**2; CLUSvads [3] + ( LGT_CSC - -3.55519058753 )**2; CLUSvads [4] + ( LGT_CSC - -3.96526745641346 )**2; CLUSvads [5] + ( LGT_CSC - -2.08727391873096 )**2; end; e l s e _vqmvar + 1.20270093291082; i f not missing( LGT_P0S ) then do; CLUSvads [1] + ( LGT_P0S - -4.08779761080976 )**2; CLUSvads [2] + ( LGT_P0S - -3.27644694006696 )**2; CLUSvads [3] + ( LGT_P0S - -3.02915771770445 )**2; CLUSvads [4] + ( LGT_P0S - -0.9841959454775 )**2; CLUSvads [5] + ( LGT_P0S - -2.21538937073223 )**2; end; e l s e _vqmvar + 1.30942457262733; i f not missing( LGT_TBM ) then do; CLUSvads [1] + ( LGT_TBM - 2.62509260779666 )**2; CLUSvads [2] + ( LGT_TBM - 1.40885156098964 )**2; CLUSvads [3] + ( LGT_TBM - -0.15878507901546 )**2; CLUSvads [4] + ( LGT_TBM - -0.11252803970828 )**2\ CLUSvads [5] + ( LGT_TBM - 0.22075831354075 )**2; end; e l s e _vqmvar + 1.17502484629096; _vqnvar = 5.00248575662084 - _vqmvar; i f _vqnvar <= 2.2748671456706E-12 then do; _SEGMENT_ = .; Distance = .; end; else do; _SEGMENT_ = 1; Distance = CLUSvads [ 1 ] ; _vqfzdst = Distance * 0.99999999999988; drop _vqfzdst; do _vqclus = 2 to 5; i f CLUSvads [_vqclus] < _vqfzdst then do; _SEGMENT_ = _vqclus; Distance = CLUSvads [_vqclus]; _vqfzdst = Distance * 0.99999999999988; end; end; Distance = sqrt(Distance * (5.00248575662084 / _vqnvar)); end; CLUSvlex :; 1

A-15


A-16

Appendix A Case Studies

A.2 Web Site Usage Associations Case Study Case Study Description A radio station developed a Web site to broaden its audience appeal and offerings. In addition to a simulcast o f the station's primary broadcast, the Web site was designed to provide services to Web users, such as podcasts, news streams, music streams, archives, and live Web music performances. The station tracked usage o f these services by URL. Analysts at the station wanted to see whether any unusual patterns existed in the combinations o f services selected by its Web users. The W E B S T A T I O N data set contains services selected by more than 1.5 million unique Web users over a two-month period in 2006. For privacy reasons, the URLs are assigned anonymous I D numbers. The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams •=> Import Diagram from X M L . . . in SAS Enterprise Miner. A l l nodes i n the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.

f

Case Study Data Name ID

ID

TARGET

f

Model Role

'i T a r g e t

Measurement Level

Description

Nominal

URL (with anonymous I D numbers)

jj •. .

ji

j! Web service, selected

Nominal

i

j

The W E B S T A T I O N data set should be assigned the role o f Transaction. This role can be assigned either in the process o f creating the data source or by changing the properties o f the data source inside SAS Enterprise Miner.

Accessing and Assaying the Data A SAS Enterprise Miner data source was defined for the W E B S T A T I O N data set using the metadata settings indicated above. The StatExplore node was used to provide preliminary statistics on the target variable. The Variable Levels Summary showed that there are 1,586,124 unique URLs in the data set and 8 distinct services. Variable Levels Summary Number of Levels Role Variable ID TARGET

f

ID TARGET

1586124 8

To match the results shown above, the Number o f Observations should be set to A L L in the properties o f the StatExplore node.


A.2 Web Site Usage Associations Case Study

A-17

A plot o f target distribution (produced from the Explore window) identified the eight levels and displayed the relative frequency in a random sample o f 6000 cases. Li

wmmmmmtfrf [Hi

TARGET

4,000-

3.000> u

c 0-2,000 0)

1,000 -

0 i i r i 1 1 1 r LIVE STREAM PODCAST SIMULCAST ARCHIVE WEBSITE EXTREF MUSICSTREAM NEWS

TARGET Generating Associations An Association node was connected to the WEBSTATION node.

atExplore

/ BSTATION

•Q

*~ ^^Association

A preliminary run o f the Association node yielded very few association rules. It was discovered that the default minimum Support Percentage setting was too large. (Many o f the URLs selected only one service, diminishing the support o f all association rules.) To obtain more association rules, the minimum Support Percentage was changed to 1 . 0. In addition, the number o f items to process was increased to 3 0 0 0 0 0 0 to account for the large training data set.


A-18

Appendix A Case Studies

Train

Variables Maximum Number of 113000000 Rules ation 4 [•Maximum Items [-Minimum Confidence 110 Percent [•Support Type [Support Count 1 ••Support Percentage 1.0

Using these changes, the analysis was rerun and yielded substantially more association rules. The Rules Table was used to scrutinize the results. n

Rules Table

Relations

n

Expected Confidence!, Support(%) Confldence(

Lift

Transaction Rule Count

%) 7.32 1.71 7.32 1.96 7.05 1.95 178 4.10 5.35 1.58 9.47 9.47 9.47 6.95 2.84 9.47 11.83 11.83 11.83 11.83 4.10 11.83 6.95 9.47

98.32 23.02 98.07 26.19 B6.22 23.90 16.05 36.97 41.71 12.29 64.45 51.35 44.86 31.69 12.95 41.55 51.44 46.87 44.61 44.00 13.21 38.17 21.61 29.43 1R.51.

1.69 1.69 1.92 1.92 1.69 1.69 0.66 0.66 0.66 0.66 0,90 0.69 0.66 0.90 0.90 0.74 0.66 0.74 0.60 0.90 1.56 1.56 2.05 2.05 _1_5R_

13.42 13.42 13.39 13.39 12.22 12.22 9.03 9.03 7.90 7.90 6.61 5.43 4.74 4.56 4.56 4.39 4.35 3.96 3.77 3.72 3.23 3.23 3.11 3.11 _3.ao.

26744WEBSITE & EXTREF = > ARCHIVE 26744ARCHIVE = > WEBSITE & EXTREF 30419EXTREF ==> ARCHIVE 30419ARCHIVE = > EXTREF 26744 EXTREF = > WEBSITE & ARCHIVE 26744WEBSITE 6ARCHIVE = > EXTREF 10424WEBSITE £ SIMULCAST ==> PODCAST a MUSICSTREAM 10424PODCAST £ MUSICSTREAM ==> WEBSITE & SIMULCAST 10424 SIMULCAST & PODCAST ==> WEBSITE & MUSICSTREAM 10424WEBSITE & MUSICSTREAM ==> SIMULCAST & PODCAST 14275NEWS & MUSICSTREAM ==> SIMULCAST 1 0944WEBSITE & NEWS = > SIMULCAST 1 0424WEBSITE & PODCAST & MUSICSTREAM ==> SIMULCAST 1 4275SIMULCAST & MUSICSTREAM = > NEWS 14275NEWS — >SIMULCAST £ MUSICSTREAM 11714PODCAST 6 MUSICSTREAM ==> SIMULCAST 10424WEBSITE & SIMULCAST S, PODCAST ==> MUSICSTREAM 11714 SIMULCAST &. PODCAST ==> MUSICSTREAM 95Q6.0WEBSITE S NEWS ==> MUSICSTREAM 1 4275SIMULCAST S NEWS ==> MUSICSTREAM 24794MUSICSTREAM ==* WEBSITE S SIMULCAST 24 794 WEBSITE & SIMULCAST === MUSICSTREAM 32444SIMULCAST==> NEWS 32444NEWS ===> SIMULCAST _7JZ9iSib^LllJ2ARr==?_MfEBaiTJEJ?JMLISIC£lSEAlLiL- ,

The following were among the interesting findings from this analysis: • Most external referrers to the Web site pointed to the programming archive (98% confidence). • Selecting the simulcast service tripled the chances o f selecting the news service. • Users who streamed music, downloaded podcasts, used the news service, or listened to the simulcast were less likely to go to the Web site.


A.3 Credit Risk Case Study

A-19

A.3 Credit Risk Case Study A bank sought to use performance on an in-house subprime credit product to create an updated risk model. The risk model was to be combined with other factors to make future credit decisions. A sample o f applicants for the original credit product was selected. Credit bureau data describing these individuals (at the time o f application) was recorded. The ultimate disposition o f the loan was determined (paid o f f or bad debt). For loans rejected at the time o f application, a disposition was inferred from credit bureau records on loans obtained in a similar time frame. The credit scoring models pursued in this case study were required to conform to the standard industry practice o f transparency and interpretability. This eliminated certain modeling tools from consideration (for example, neural networks) except for comparison purposes. I f a neural network significantly outperformed a regression, for example, it could be interpreted as a sign o f lack o f fit for the regression. Measures could then be taken to improve the regression model. f

The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams "=> Import Diagram from X M L . . . in SAS Enterprise Miner. A l l nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.


A-20

Appendix A Case Studies

Case Study Training Data Name TARGET Banruptcylnd TLBadDerogCnt CollectCnt lnqFinanceCnt24 lnqCnt06 DeroqCnt TLDe!3060Cnt24 TL50UtilCnt TLDel60Cnt24 TLDel60CntAII TL75UtilCnt ri_Del90Cnt24 TLBadCnt24 TLDelBOCnt TLSatCnt TLCntl 2 TLCnt24 TLCnt03 TLSatPct TLBalHCPct TLOpenPct TLOpen24Pct TLTimeFirst InqTimeLast TLTimeLast TLSum TLMaxSum TLCnt ID

Role Target Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input ID

Level J

Label

Binary Binary nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval Interval Interval Interval

Interval Interval Interval Interval

Interval Interval Nominal

Bankruptcy Indicator Number Bad Dept plus Public Derogatories Number Collections Number Finance Inquires 24 Months Number Inquiries 6 Months Number Public Deroqatories NumberTrade Lines 30 or 60 Days 24 Months Number Trade Lines 50 pet Utilized NumberTrade Lines 60 Days orWorse 24 Months NumberTrade Lines 60 Days orWorse Ever NumberTrade Lines 75 pet Utilized NumberTrade Lines 90+ 24 Months NumberTrade Lines Bad Debt 24 Months NumberTrade Lines Currently 60 Days orWorse NumberTrade Lines Currently Satisfactory NumberTrade Lines Opened 12 Months NumberTrade Lines Opened 24 Months NumberTrade Lines Opened 3 Months Percent Satisfactory to Total Trade Lines Percent Trade Line Balance to High Credit Percent Trade Lines Open Percent Trade Lines Open 24 Months Time Since FirstTrade Line Time Since Last Inquiry Time Since Last Trade Line Total Balance All Trade Lines Total High Credit All Trade Lines Total Open Trade Lines


A.3 Credit Risk Case Study

A-21

Accessing and Assaying the Data A SAS Enterprise Miner data source was defined for the CREDIT data set using the metadata settings indicated above. The Data source definition was expedited by customizing the Advanced Metadata Advisor in the Data Source Wizard as indicated.

Property Missing Percentage Threshold 50 Reject Vars with Excessive Missing Value Yes Class Levels Count Threshold Detect Class Levels Yes Reject Levels Count Threshold 20 Reject Vars with Excessive Class Values Yes

Value

Class Levels Count Threshold I f "Detect class levels'-Yes, interval variables with less than the number specified for this property will be marked as NOMINAL. The default value is 20.

OK

With this change, all metadata was set correctly by default.

Cancel


A-22

Appendix A Case Studies

Decision processing was selected in Step 7 of the Data Source Wizard. ^ M^ml ^,V^.,,^ Ul^,, .,l >

<

i

l

M

>U^,,,. ,,, i1

Targets

a

l

l

i

M., 1, ,, i

l

j Prior Probabilities [ Decisions | Decision Weights

Do you want to use the decisions? *Yes

O Ho

Decision Name! DECISIONI DECIS10N2

Default with Inverse Prior Weights

Label

Cost Variable <None > < None >

Constant 0.0 0.0

Add

Delete Delete All Reset Default

< Back

Next >

Cancel

The Decisions option Default with Inverse Prior Weights was selected to provide the values in the Decision Weights tab.

Targets

Prior Probabilities

Decisions j Decision Weights

Select a decision function: (ยง) Maximize

O Minimize

Enter weight values for the decisions.

Level

DECISION1 DECISIONS 5.9938002...

(U)

0.0

1.2000480...

< Back

Next >

Cancel


A.3 Credit Risk Case Study

A-23

It can be shown that, theoretically, the so-called central decision rule optimizes model performance based on the KS statistic. The StatExplore node was used to provide preliminary statistics on the target variable.

B a n r u p t c y l n d and TARGET were the only two class variables in the CREDIT data set. Class Variable Summary S t a t i s t i c s

Role

Number of Levels

Missing

INPUT TARGET

2 2

0 0

Variable Banruptcylnd TARGET

Mode 0 0

Mode Percentage

Mode

Mode2 Percentage

84.67 83.33

1 1

15.33 16.67

The Interval Variable Summary shows missing values on 11 o f the 27 interval inputs. Interval Variable Summary S t a t i s t i c s

Variable

ROLE

CollectCnt DerogCnt InqCnt06 InqFinanceCnt24 InqTimeLast TL50UtilCnt TL75UtilCnt TLBadCnt24 TLBadDerogCnt TLBalHCPct TLCnt TLCnt03 TLCnt12 TLCnt24 TLDel3060Cnt24 TLDel60Cnt TLDel60Cnt24 TLDel60CntAll TLDel90Cnt24 TLMaxSum TL0pen24Pct TLOpenPct TLSatCnt TLSatPct TLSum TLTimeFirst TLTimeLast

INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT

Mean

Std. Deviation

Non Missing

0.86 1.43 3.11 3.56 3.11 4.08 3.12 0.57 1.41 0.65 7.88 0.28 1.82 3.88 0.73 1.52 1.07 2.52 0.81 31205.90 0.56 0.50 13.51 0.52 20151.10 170.11 11.87

2.16 2.73 3.48 4.48 4.64 3.11 2.61 1.32 2.46 0.27 5.42 0.58 1.93 3.40 1.16 2.81 1.81 3.41 1.61 29092.91 0.48 0.21 8.93 0.23 19682.09 92.81 16.32

3000 3000 3000 3000 2812 2901 2901 3000 3000 2959 2997 3000 3000 3000 3000 3000 3000 3000 3000 2960 2997 2997 2996 2996 2960 3000 3000

Missing

Minimum

0 0 0 0 188 99 99 0 0 41 3 0 0 0 0 0 0 0 0 40 3 3 4 4 40 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0

Median

Maximum

0.00 0.00 2.00 2.00 1.00 3.00 3.00 0.00 0.00 0.70 7.00 0.00 1.00 3.00 0.00 0.00 0.00 1.00 0.00 24187.00 0.50 0.50 12.00 0.53 15546.00 151.00 7.00

50.00 51.00 40.00 48.00 24.00 23.00 20.00 16.00 47.00 3.36 40.00 7.00 15.00 28.00 8.00 38.00 20.00 45.00 19.00 271036.00 6.00 1.00 57.00 1.00 210612.00 933.00 342.00


A-24

Appendix A Case Studies

By creating plots using the Explore window, it was found that several of the interval inputs show somewhat skewed distributions. Transformation of the more severe cases was pursued in regression modeling.

File

View

Actions

Window

ET

; Sample Properties I

Ej

L l TL75UtUCnt

L i InuCntQi}

|

E H £MWS2,idsJ)ATA Or>s#

PIWL.

2 3

24

32

40

tr* ET E I

L i lnuFinanceCnt24

20

30

El

00

9j6 192 28.8 38.4 48.0

r / E f EI

I Value Unknown

1 EMWS2.lds DATA

!

:

c* E f

Ei

ID

Ohs*

0000066 0000116 0000124

L l TLDel3060Crrt24

!~ i H

r/Ef

IHI

j :ooo 10000

r^Er

4.0 92

c*er

E l

L l TLCrtt12 1000 0

' 00

94

•*er

E l

: H

13.8 28.2 37.6 47.0

m

00

30

15.0

9.0 12.0 15.0

Number Trade L i n e s Open.-

ET Ei

o'V

TLCnt24

Ei

1000 -

tbn.

Ll TLDel60Cnt24

£

,00

o t3.8 18.4 230

r/Ef

0%

101%

202% 303%

OJ]

Percent Trade Line B a l a n c e

El

0

3,2 4.8

6,4

U

r ^ E f Ei

5.6 11.2 16.8 22.4 28.0

Number Trade L i n e s Open...

8

12

16 20

0%

r^ET El

TLDemOCntAII 2000 1000 0

L i TLSum i S it

:Ih_

4

180% 3G0% 540%

Percent Trade L i n e s Open _

r/Ef

SO

HOSJOH 1210.612

Total B a l a n c e All Trade l _

r / ET E l f f L i TLTImeFlrst

L i TLOpenPct

EI

200010000

Ef E l

£ 400 it o 0.0

9.0 18.0 27.0 3ii.O 45.0

0%

Number Trade Lines GO Da... L i TLDel90Cnt24

r/Ef

El

t

1000 0 SO

(135,518

r/Ef

El

11.4 22.8 34.2 45.6 57.0

Number Trade Lines Curre... L l TLSatPcl

t>

On.

00

r^Ef

EI

Win .

J271.036

Total Umii Credit All Tiade...

284.1 562.2

640,3

Time Since First Trade Line L i TLTimeLast

C^ET E i

n n - r ^ 00

Number Trade Lines 9 0 * 2~. EI

CO% 90%

L i TLSalCtrt

3.9 7 6 11.4 15.2 190

r/ET

30%

Percent Trade Lines Open

i 0.0

r / E f E i f L i TLMaxSum

7J3 1 5.2 22.8 30.4 38.0

TLOpen24Pct

Number Trade Lines 60 Da...

BO

Numher Trade L i n e s C u r r e . .

U

! 2OO0H eT 1000 it 0

p:J_k

Number Trade Lines 30 or

DO

40

Number Trade Lines Open...

Number Bad Dept plus P u b _

2000 1000 0

L i TLDcieocnt

u

Number Trade Lines 50 p c t _

H

1.6

32

i—'—i—•—i—'—i—•—i—> 0 0 1.4 2S 42 SB 70

123 16D

Number Trade Lines Bad 0 . -

14.4 19.2 24.0

~ ;ooo - i—i

Plat..

Apply

0.0

2+

Window

f}6 Sample Properties

S it

*&

0.0

10.2 20.4 30.8 40.0 S1.0

Property Rows

1fi

Ll TLCnt03

1

32 G.4

Ll TLBartDerouCiit

TL50IMCM 100O JOO 0

Actions

00

Time Since Last InnuHy

Number Public Deroratories

8

Total Open Trade L i n e s

1 — — I —— I — r

CZ

40 SO

Ll

0

20

• iooo I I

L i InctlimeLast

m

View

16

C^ET E l

L l TLBadCnt24

Number Finance [tn|iiiies 2 —

DerogCnt

0.0

12

:

0.0

Number Collections U

S

~ :ooo-i r~|

2000 A 1000 0 10

4

1

c*Ef

0

0

Number Trade Lines 75 p e t —

<u 1000-

| >

L I CollectCrrt

16

J. 2000-

| ID 0000066 0000116 0000124

-

8

Number Inquiries 6 Moirllis

p* ET E l

TARGET

i 1

E I l | L l TLCnt 400 0

0 Ajiply

rf^E?

l :ooo4"pj o 1000-] it 0

Value Unknown

Property Rows

Percent Satisfactoiy to Tola...

0.0

102.6 205.2

307.8

Time Since Last Trade Line


A.3 Credit Risk C a s e Study

A-25

Creating Prediction Models: Simple Stepwise Regression Because it was the most likely model to be selected for deployment, a regression model was considered first.

f

?

.j£ Regression

• In the Data Partition node, 50% o f the data was chosen for training and 50% for validation. • The Impute node replaced missing values for the interval inputs with the input mean (the default for interval valued input variables), and added unique imputation indicators for each input with missing values. • The Regression node used the Stepwise method for input variable selection, and Validation Profit for complexity optimization. The selected model included seven inputs. Analysis of Maximum Likelihood Estimates Parameter Intercept IMP_TLBalHCPct IMP_TLSatPct InqFinanceCnt24 TLDel3O60Cnt24 TLDel60Cnt24 TLOpenPct TLTimeFirst

DF

Estimate

Standard Error

Wald Chi-Square

Pr > ChiSq

Standardized Estimate

Exp(Est)

1 1 1 1 1 1 1 1

-2.7602 1.8759 -2.6095 0.0610 0.3359 0.1126 1.5684 -0.00253

0.4089 0.3295 0.4515 0.0149 0.0623 0.0408 0.4633 0.000923

45.57 32.42 33.40 16.86 29.11 7.62 11.46 7.50

<.0001 <.0001 <.0001 <.0O01 <.00O1 0.0058 0.0007 0.0062

0.2772 -0.3363 0.1527 0.2108 0.1102 0.1792 -0.1309

0.063 6.527 0.074 1.063 1.399 1.119 4.799 0.997

The odds ratio estimates facilitated model interpretation. Increasing risk was associated with increasing

values o f IMP_TLBalHCPct, InqF±nanceCnt24, TLDel3060Cnt24, TLDel60Cnt, and TLOpenPct. Increasing risk was associated with decreasing values o f l M P _ T L S a t P c t and

T L T i m e F i r s t. Odds Ratio Estimates Effect IMP_TLBalHCPct IMP_TLSatPct InqFinanceCnt24 TLDel3060Cnt24 TLDel60Cnt24 TLOpenPct TLTimeFirst

Point Estimate 6.527 0.074 1 .063 1 .399 1.119 4.799 0.997


A-26

Appendix A Case Studies

The iteration plot (found by selecting View "=t> Model ^ Iteration Plot in the Results window) can be set to show average profit versus iteration.

00 \E

;

Iteration Plot i Select Chart: Average Profit

1.4

1.3-

1.2-

1.1-

1.0- r — 5.0

—r7.5

10.0

Model Selection Step Number •Train: Average Profit for TARGET •Valid: Average Profit for TARGET

In theory, the average profit for a model using the defined profit matrix equals 1+K.S statistic. Thus, the iteration plot (from the Regression node's Results window) showed how the profit (or, in turn, the KS statistic) varied with model complexity. From the plot, the maximum validation profit equaled 1.43, which implies that the maximum KS statistic equaled 0.43. The actual calculated value o f KS (as found using the Model Comparison node) was found to differ slightly from this value (see below).

f

Creating Prediction Models: Neural Network While it is not possible to deploy as the final prediction model, a neural network was used to investigate regression lack o f fit.

XetExptore

CRHMT 1

i Partition

Regression

eural Network

l?i

The default settings o f the Neural Network node were used in combination with inputs selected by the Stepwise Regression node.


A.3 Credit Risk Case Study

A-27

The iteration plot showed slightly higher validation average profit compared to the stepwise regression model.

Average Profit

1.400 -

Training Iteration •Train: Average Profit for TARGET •Valid: Average Profit for TARGET

It was possible (although not likely) that transformations to the regression inputs could improve regression prediction.

Creating Prediction Models: Transformed Stepwise Regression In assaying the data, it was noted that some o f the inputs had rather skewed distributions. Such distributions create high leverage points that can distort an input's association with the target. The Transform Variables node was used to regularize the distributions o f the model inputs before fitting the stepwise regression.


A-28

Appendix A Case Studies

The Transform Variables node was set to maximize the normality o f each interval input by selecting from one o f several power and logarithmic transformations. Property

Value

!

General Node ID Imported Data Exported Data Notes

Trans ~ a u ~ a

Train Variables Formulas Interactions SAS Code ildMIUBIiffllH -Interval inputs -Interval Targets -Class Inputs -Class Targets

u u

• Maximum Normal None None None

The result o f this setting was to apply a log transformation to each interval input. The Transformed Stepwise Regression node performed stepwise selection from the transformed inputs. The selected model had many o f the same inputs as the original stepwise regression model, but on a transformed (and difficult to interpret) scale. Odds R a t i o

Estimates Point

Effect L0G_IMP_TL75UtilCnt LOG_IMPJ"LBalHCPct LOGJMPJTLSatPct L0G_InqFinanceCnt24 L0G_TLDel3060Cnt24 L0G_TLDel60Cnt24 L0G_TLOpenPct LOG T L T i m e F i r s t

Estimate 9.243 515.899 0.015 15.799 17.508 7.914 6.616 0.045


A.3 Credit Risk Case Study

A-29

The transformations would be justified (despite the increased difficulty in model interpretation) i f they resulted in significant improvement in model fit. Based on the profit calculation, the transformed stepwise regression model showed only marginal performance improvement compared to the original stepwise regression model. Iteration Plot i i ^ ^ f f i ^ i ^ f f i ^ i ^ ^ ^ H l

cftf*

0

Average Profit

1.4-

1.3-

1.2-

1.1 -

1.0-

Model Selection Step Number Train: Average Profit for TARGET Valid: Average Profit for TARGET

Creating Prediction Models: Discretized Stepwise Regression Partitioning input variables into discrete ranges was another common risk-modeling method that was investigated.

j^i- Recession

r i =S; CREDIT

ia Partition

^mpute

] T j

l\\

-j^y^eural Network

| • , Transformed ky Stepwise...

^ Jransform Variables

-6 v. Bucket Stepwise

ket Input /enables

Quantie Input Variables

Quantfle Stepwise..

Optimal

Optimal Discretized...

E l Wscretized...

ft


A-30

Appendix A Case Studies

Three discretization approaches were investigated. The Bucket Input Variables node partitioned each interval input into four bins w i t h equal widths. The Bin Input Variables node partitioned each interval input into four bins with equal sizes. The Optimal Discrete Input Variables node found optimal partitions for each input variable using decision tree methods. Bucket Transformation The relatively small size o f the CREDIT data set resulted in problems for the bucket stepwise regression model. Many o f the bins had a small number o f observations, which resulted in quasi-complete separation problems for the regression model, as dramatically illustrated by the selected model's odds ratio report. Odds Ratio Estimates

Effect BIN_IMP_TL75UtilCnt BIN_IMP_TL75UtilCnt BIN_IMP_TL75UtilCnt BIN_IMP_TLBalHCPct BIN_IMP_TLBalHCPct BIN_IMP_TLBalHCPct BIN_IMP_TLSatPct BIN_IMP_TLSatPct BIN_IMP_TLSatPct BIN_InqFinanceCnt24 BIN_InqFinanceCnt24 BIN_InqFinanceCnt24 BIN_TLDel3060Cnt24 BIN_TLDel3060Cnt24 BIN_TLDel60CntAll BIN_TLDel60CntAll BIN_TLDel60CntAll BIN_TLTimeFirst BINJLTimeFirst BIN TLTimeFirst

01: low -5 vs 04:15-high 02: 5-10 vs 04:15-high 03: 10-15 vs 04:15-high 01: low -0.840325 vs 04:2.520975-high 02: 0.840325-1.68065 VS 04:2.520975-high 03: 1.68065-2.520975 VS 04:2.520975-high 01: low -0.25 vs 04:0.75-high 02: 0.25-0.5 vs 04:0.75-high 03: 0.5-0.75 vs 04:0.75-high 01; low -9.75 vs 04:29.25-high 02: 9.75-19.5 vs 04:29.25-high 03 19.5-29.25 vs 04:29.25-high 01 low -2 vs 04:6-high 02 2-4 vs 04:6-high 01 low -4.75 vs 04:14.25-high 02 4.75-9.5 vs 04:14.25-high 03 9.5-14.25 vs 04:14.25-high 01 low -198.75 vs 04:584.25-high 02 198.75-391.5 VS 04:584.25-high 03 :391.5-584.25 vs 04:584.25-high

Point Estimate 999.000 999.000 999.000 <0.001 <0.001 <0.001 4.845 1.819 1.009 0.173 0.381 0.640 999.000 999.000 0.171 0.138 0.166 999.000 999.000 999.000


A.3 Credit Risk Case Study

A-31

The iteration plot showed substantially worse performance compared to the other modeling efforts.

Average Profit

•

i 0

1 2

w \

1 4

1 6

1

r 8

Model Selection Step Number Train: Average Profit for TARGET Valid: Average Profit for TARGET |

|

Bin Transformation Somewhat better results were seen with the binned stepwise regression model. By ensuring that each bin included a reasonable number o f cases, more stable model parameter estimates could be made. Odds Ratio Estimates

Effect PCTL_IMP_TLBalHCPct PCTL_IMP_TLBalHCPct PCTL_IMP_TLBalHCPct PCTL_IMP_TLSatPct PCTL_IMP_TLSatPct PCTL_IMP_TLSatPct PCTL_InqFinanceCnt24 PCTL_InqFinanceCnt24 PCTL_InqFinanceCnt24 PCTL_TLDel3060Cnt24 PCTL_TLDel60Cnt24 PCTL_TLTimeFirst PCTL_TLTimeFirst PCTLJTLTimeFirst

01 02 03 01 02 03 01 02 03 02 02 01 02 03

low -0.513 vs 04:0.8389-high 0.513-0.7041 VS 04:0.8389-high 0.7041-0.8389 vs 04:0.8389-high low -0.3529 vs 04:0.6886-high 0.3529-0.5333 VS 04:0.6886-high 0.5333-0.6886 VS 04:0.6886-high low -1 vs 04:5-high 1-2 vs 04:5-high 2-5 vs 04:5-high 0-1 vs 03:1-high 0-1 vs 03:1-high low -107 vs 04:230-high 107-152 vs 04:230-high 152-230 vs 04:230-high

Point Estimate 0.272 0.452 0.630 1 .860 1 .130 1 .040 0.599 0.404 0.807 0.453 0.357 1 .688 1 .477 0.837


A-32

Appendix A Case Studies

The improved model fit was also seen in the iteration plot, although the average profit o f the selected model was still not as large as the original stepwise regression model.

0

2

4

6

8

10

Model Selection Step Number |

Train: Average Profit for TARGET Valid: Average Profit for TARGET |

Optimal Transformation A final attempt on discretization was made using the optimistically named Optimal Discrete transformation. The final 18 degree-of-freedom model included 10 separate inputs (more than any other model). Odds Ratio Estimates

Effect Banruptcylnd 0PT_IMP_TL75UtilCnt 0PT_IMP_TL75UtilCnt 0PT_IMP_TLBalHCPct 0PT_IMP_TLBalHCPct 0PT_IMP_TLBalHCPct 0PT_IMP_TLSatPct 0PT_IMP_TLSatPct 0PT_InqFinanceCnt24 0PT_InqFinanceCnt24 0PT_TLDel3060Cnt24 0PT_TLDel60Cnt OPT_TLDel60Cnt 0PT_TLDel60Cnt24 OPT_TLDel60Cnt24 OPT_TLTimeFirst TLOpenPct

0 vs 1 01: low -1.5 vs 03:8.5-high 02 1.5-8.5, MISSING vs 03:8.5-high 01 low -0.6706, MISSING vs 04:1.0213-high 02 0.6706-0.86785 VS 04:1.0213-high 03 0.86785-1.0213 vs 04:1.0213-high 01 low -0.2094 vs 03:0.4655-high, 02 0.2094-0.4655 VS 03:0.4655-high, 01 low -2.5, MISSIN vs 03:7.5-high 02 2.5-7.5 vs 03:7.5-high 01 low -1.5, MISSIN vs 02:1.5-high 01 low -0.5, MISSIN vs 03:14.5-high 02 0.5-14.5 vs 03:14.5-high 01 low -0.5, MISSIN vs 03:5.5-high 02 0.5-5.5 vs 03:5.5-high 01 :low -154.5, MISSING vs 02:154.5-high

Point Estimate 2.267 0.270 0.409 0.090 0.155 0.250 5.067 1.970 0.353 0.657 0.499 0.084 0.074 0.327 0.882 1 .926 3.337


A.3 Credit Risk Case Study

A-33

The validation average profit was still slightly smaller than the original model. A substantial difference in profit between the training and validation data was also observed. Such a difference was suggestive o f overfitting by the model. Iteration Plot

^^^^^^^^^^^M

Average Profit 1.5-

1.4-

1.31.2-

1.1 -

1.0-r 15

Model Selection Step Number •Train: Average Profit for TARGET Valid: Average Profit for TARGET


A-34

Appendix A Case Studies

Assessing the Prediction Models The collection o f models was assessed using the Model Comparison node.

The ROC chart shows a jumble o f models with no clear winner.

Data Role = VALIDATE

1 - Specificity Baseline Neural Network Regression Transformed Stepwise Regression Bucket Stepwise Regression Quantile Stepwise Regression Optimal Discretized Stepwise Regression


A.3 Credit Risk Case Study

A-35

The Fit Statistics table from the Output window is shown below. Data Role=Valid Statistics Valid Kolmogorov-Smirnov Statistic Valid Average Profit for TARGET Valid Average Squared Error Valid Roc Index Valid Average Error Function Valid Percent Capture Response Valid Divisor for VASE Valid Error Function Valid Gain Valid Gini Coefficient Valid Bin-Based Two-Way Kolmogorov-Smirnov Statistic Valid L i f t Valid Maximum Absolute Error Valid Misclassification Rate Valid Mean Square Error Valid Sum of Frequencies Valid Total Profit for TARGET Valid Root Average Squared Error Valid Percent Response Valid Root Mean Square Error Valid Sum of Square Errors Valid Sum of Case Weights Times Freq

Reg 0.43 1.43 0.12 0.77 0.38 14.40 3000.00 1152.26 180.00 0.54 0.43 2.88 0.97 0.17 0.12 1500.00 2143.03 0.35 48.00 0.35 359.70 3000.00

Neural 0.46 1.42 0.12 0.77 0.39 12.00 3000.00 1168.64 152.00 0.54 0.44 2.40 0.99 0.17 0.12 1500.00 2131.02 0.35 40.00 0.35 367.22 3000.00

Reg5 0.42 1.42 0.12 0.76 0.40 11.60 3000.00 1186.46 148.00 0.53 0.41 2.32 1.00 0.17 0.12 1500.00 2127.45 0.35 38.67 0.35 371.58 3000.00

Reg2 0.44 1.42 0.12 0.78 0.38 14.40 3000.00 1131.42 192.00 0.56 0.44 2.88 0.98 0.17 0.12 1500.00 2127.44 0.34 48.00 0.34 352.69 3000.00

Reg4 0.45 1.41 0.12 0.77 0.39 12.64 3000.00 1158.23 144.89 0.54 0.45 2.53 0.99 0.17 0.12 1500.00 2121.42 0.35 42.13 0.35 366.76 3000.00

Reg3 0.39 1.38 0.13 0.73 0.43 9.60 3000.00 1282.59 124.00 0.47 0.39 1.92 1.00 0.17 0.13 1500.00 2072.25 0.36 32.00 0.36 381.44 3000.00

The best model, as measured by average profit, was the original regression. The neural network had the highest KS statistic. The log-transformed regression, Reg2, had the highest ROC-index. I f the purpose o f a credit risk model is to order the cases, then Reg2, the transformed regression, had the highest rank decision statistic, the ROC index. In short, the best model for deployment was as much a matter o f taste as o f statistical performance. The relatively small validation data set used to compare the models did not produce a clear winner. In the end, the model selected for deployment was the original stepwise regression, because it offered consistently good performance across multiple assessment measures.


A-36

Appendix A Case Studies

A.4 Enrollment Management Case Study Case Study Description In the fall of2004, the administration o f a large private university requested that the Office o f Enrollment Management and the Office o f Institutional Research work together to identify prospective students who would most likely enroll as new freshmen in the Fall 2005 semester. The administration stated several goals for this project: • increase new freshman enrollment • increase diversity • increase SAT scores o f entering students. Historically, inquiries numbered about 90,000+ students, and the university enrolled from 2400 to 2800 new freshmen each Fall semester. f

The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams •=> Import Diagram from X M L . . . in SAS Enterprise Miner. A l l nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.


A.4 Enrollment Management Case Study

A-37

Case Study Training Data Name ACADEMI C_INTERE S T _ l

Model Role

Description

'Measurement Level

Rejected

Nominal

Primary academic interest code

ACADEMIC_INTEREST_2 | Rejected

j Nominal

i Secondary academic interest code

CAMPUS_VISIT

Input

Nominal

Campus visit code

CONTACT_CODEl

j| Rejected

Nominal

' First contact code

CONTACT_DATE1

Rejected

Nominal

First contact date

j; Rejected

Nominal

Ethnicity

ETHNICITY ENROLL

Target

Binary

l=Enrolled F2004,0=Not enrolled F2004

IRSCHOQL

Rejected

Nominal

H i g h school code

INSTATE

Input

Binary

l = I n state, 0=Out o f state

LBVELJYEAR

Rejected

Unary

Student academic level

REFERRAL_CNTCTS

Input

Ordinal

Referral contact count

SELF_INITJCNTCTS

Input

Interval

Self initiated contact count

SOLICITED_CNTCTS

Input

Ordinal

Solicited contact count

TERRITORY

Input

Nominal

Recruitment area

TOTAL_CONTACTS

Input

Interval

Total contact count

TRAVEL_INIT_CNTCTS

Input

Ordinal

Travel initiated contact count

AVG_INCOME

Input

Interval

Commercial H H income estimate

DISTANCE

Input

Interval

Distance from university

HSCRAT

Input

Interval

5-year high school enrollment rate

INIT_SPAN

Input

Interval

Time from first contact to enrollment date

INT1RAT

Input

Interval

5-year primary interest code rate

INT2RAT

Input

Interval

5-year secondary interest code rate

INTEREST

Input

Ordinal

Number o f indicated extracurricular interests

MAILQ

Input

Ordinal

Mail qualifying score (l=very interested)

PREMIERE

Input

Binary

l=Attended campus recruitment event, 0=Did not

SATSCORE

Rejected

Interval

SAT (original) score

SEX

Rejected

Binary

Sex

STTJEMAIL

Input

Binary

l=Have e-mail address, 0=Do not

TELECQ

Rejected

Ordinal

Telecounciling qualifying score (l=very interested)


A-38

Appendix A Case Studies

The Office o f Institutional Research assumed the task o f building a predictive model, and the Office o f Enrollment Management served as consultant to the project. The Office o f Institutional Research built and maintained a data warehouse that contained information about enrollment for the past six years. It was decided that inquiries for Fall 2004 would be used to build the model to help shape the Fall 2005 freshman class. The data set Inq2005 was built over a period o f a several months in consultation with Enrollment Management. The data set included variables that could be classified as demographic, financial, number o f correspondences, student interests, and campus visits. Many variables were created using historical data and trends. For example, high school code was replaced by the percentage o f inquirers from that high school over the past five years who enrolled. The resulting data set included over 90,000 observations and over 50 variables. For this case study, the number o f variables was reduced. The data set Inq2005 is in the A A E M library, and the variables are described i n the table above. Some of the variables were automatically rejected based on the number o f missing values. The nominal variables AOVDEMlC_iNTEREST_JL, ACADEMIC_INTEREST_2, and IRSCHOOL were rejected because they were replaced by the interval variables INT1RAT, INT2RAT, and HSCRAT, respectively. For example, academic interest codes 1 and 2 were replaced by the percentage o f inquirers over the past five years who indicated those interest codes and then enrolled. The variable IRSCHOOL is the high school code of the student, and it was replaced by the percentage o f inquirers from that high school over the last five years who enrolled. The variables ETHNICITY and SEX were rejected because they cannot be used in admission decisions. Several variables count the various types o f contacts the university has with the students.

Accessing and Assaying the Data A SAS Enterprise Miner data source was defined using the metadata settings indicated above. The StatExplore node was used to provide preliminary statistics on the input variables.

The following is extracted from the StatExplore node's Results window: Class Variable Summary S t a t i s t i c s

Variable CAMPUS_VISIT INSTATE REFERRAL_CNTCTS SOLICITED_CNTCTS TERRITORY TRAVEL_INIT_CNTCTS INTEREST MAILQ PREMIERE STUEMAIL ENROLL

Role INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT TARGET

Number of Levels 3 2 6 8 12 7 4 5 2 2 2

Missing 0 0 0 0 0 0 0 0 0 0

Mode 0 Y 0 0 2 0 0 5 0 0 0

Mode Percentage 96.61 62.04 96.46 52.45 15.98 67.00 95.01 69.33 97.11 51 .01 96.86

Mode N

Mode2 Percentage 3.31 37.96 3.21 41.60 15.34 29.90 4.62 12.80 2.89 48.99 3.14


A.4 Enrollment Management Case Study

A-39

The class input variables are listed first. Notice that most o f the count variables have a high percent o f Os. D i s t r i b u t i o n of Class Target and Segment Variables

Variable Enroll Enroll

Role TARGET TARGET

Formatted Value 0 1

Frequency Count 88614 2868

Percent 96.8650 3.1350

Next is the target distribution. Only 3.1 % o f the target values are Is, making a 1 a rare event. Standard practice i n this situation is to separately sample the Os and Is. The Sample tool, used below, enables you to create a stratified sample i n SAS Enterprise Miner. Interval Variable Summary Statistics

Variable SELF_INrT_CNTCTS TOrALJXNTACTS AVGJNCCME DISTANCE HSCRAT INIT_SfWI INT1RAT INT2RAT

ROLE INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT

Mean 1.21 2.17 47315.33 380.43 0.04 19.69 0.04 0.04

Std. Deviation 1.67 1.85 20608.89 397.98 0.06 8.72 0.02 0.03

Non Missing 91482 91482 70553 72014 91482 91482 91482 91482

Missing 0 0 20929 19468 0 0 0 0

Minimum 0.00 1.00 4940.00 0.42 0.00 -216.00 0.00 0.00

Median 1.00 2.00 42324.00 183.55 0.03 19.00 0.04 0.06

Maximum 56.00 58.00 200001.00 4798.90 1.00 228.00 1.00 1.00


A-40

Appendix A Case Studies

Finally, interval variable summary statistics are presented. Notice that AVG_lNCOME and DISTANCE have missing values. The Explore window was used to study the distribution of the interval variables.

File View Actions Window *J EMWS3.IUS DATA Obs#

^ET

EE!

ETHN CO... RECR CO... AW 1 N 2C N 3N N 4 N 5 N 6 N 7 N 8 N 9 N

I

L l SELF JMT CNTCTS

L l lnit_snan

15000

5 S 10000 £

5000 0 0.0

3.0 48.0 &3.0 138.0 183.0 2 23.0 INIT SPAN

8.4 12.6 10.8 21.0 SELF INIT CNTCTS

42

L l distance

L l TOTAL ..CONTACTS

10000 -

15000¬

5 5- 5000-

3 10000

J

£

5000 0 1.0

0.00 0.12 0.24 0.36 0.48 0.60 INT2RAT

6.4 11.8 17.2 22.fi 23.0 TOTAL CONTACTS

L l hscrat 20000 -

Li intlrat I5O00 -

g, 15000¬

&

3 10000

g 10000 -

« 5000 0

£ 0.0

0.2

0.4 0.6 HSCRAT

0.8

1.0

5000¬ 0 0.00 0.10 0.20 0.30 0.40 0.50 INT 1 RAT

0.4171 1531.7303 3063.0435 DISTANCE L l avyjncome

8000 & ' " = 4000 r

000

3

«: 2G0O0-

r 8455.0

85097.4 161699J AVG INCOME

The apparent skewness of all inputs suggests that some transformations might be needed for regression models.


A.4 Enrollment Management Case Study

A-41

Creating a Training Sample Cases from each target level were separately sampled. All cases with the primary outcome were selected. For each primary outcome case, seven secondary outcome cases were selected. This created a training sample with a 12.5% overall enrollment rate. The Sample tool was used to create a training sample for subsequent modeling.

To create the sample as described, the following modifications were made to the Sample node's properties panel: 1.

Type 1 0 0 as the Percentage value (in the Size property group).

2.

Select Criterion

3.

Type 12 . 5 as the Sample Proportion value (in the Level Based Options property group). Train Variables Output Type Sample Method Random Seed -Type -Observations •Percentage Alpha •PValue Cluster Method

Level Based (in the Stratified property group).

Data Default 12345

Percentage 1 100.0 0.01 0.01 Random

•Criterion Level Based -Ignore Small Sirata No -Minimum Strata Size 5 ULevel Based Options -Level Selection Event -Level Proportion 1 00.0 -Sample Proportion 1 2.5 EMSffiffillMI -Adjust Frequency No -Based on Count No -Exclude Missing LeviNo

,


A-42

Appendix A Case Studies

The Sample node Results window shows all primary outcome cases that are selected and a Sufficient number o f secondary outcome cases that are selected to achieve the 12.5% primary outcome proportion. Summary S t a t i s t i c s

for

Class

Targets

Data-DATA Numeric

Formatted

Frequency

Variable

Value

Value

Count

Percent

Enroll Enroll

0

0

1

1

88614 2868

96.8650 3.1350

Data=SAMPLE Numeric

Formatted

Frequency

Variable

Value

Value

Count

Enroll

0

0

20076

87.5

Enroll

1

1

2868

12.5

Percent

Configuring Decision Processing The primary purpose o f the predictions was decision optimization and, secondarily, ranking. An applicant was considered a good candidate i f his or her probability o f enrollment was higher than average. Because o f the Sample node, decision information consistent with the above objectives could not be entered in the data source node. To incorporate decision information, the Decisions tool was incorporated in the analysis.

JstatExplore

-6

Sample

INQ200S

Decisions o

These steps were followed to configure the Decisions node: 1.

Select Decisions ^ Property

Value

General Node ID Imported Data Exported Data Notes

Dec

• •

Train Variables Decisions Apply Decisions

... ...

No


A.4 Enrollment Management Case Study

After the analysis path is updated, the Decision window opens. Wt'lftd

OK

2.

Select Build in the Decision Processing window.

3.

Select the Decisions tab.

4.

Select Default w i t h Inverse Prior Weights.

Targets

Prior Probabilities \ Decisions

Cancel

Decision Weights

Do you want to use the decisions? 速Yes

Default with [nverse Prior Weights

U No

Decision Name 1 DECISION1 1 DECISION2 [0

Label

I CostVariable None > ... < None >

lo.o

Constant

Add

0.0

Deleie

Delete All Reset Default

OK

Cancel

A-43


A-44

5.

Appendix A Case Studies

Select the Decision Weights tab.

Targets

Prior Probabilities ; Decisions

Decision Weights

Select a decision function: 째 Maximize

O Minimize

Enter weight values for the decisions. DECISION1 DECISION2 Level 0.0 ... 18.0 1 .,.(0.0 0 It .1428571... I

OK

Cancel

The nonzero values used in the decision matrix are the inverse of the prior probabilities (1/.125 = 8. and 1/0.875 = 1.142857). Such a decision matrix, sometimes referred to as the central decision rule, forces a primary decision when the estimated primary outcome probability for a case exceeds the primary outcome prior probability (0.125 in this case).

Creating Prediction Models (All Cases) Two rounds of predictive modeling were performed. In the first round, all cases were considered for model building. From the Decision node, partitioning, imputation, modeling, and assessment were performed. The completed analysis appears as shown.


A.4 Enrollment Management Case Study

A-45

• The Data Partition node used 60% for training and 40% for validation. • The Impute node used the Tree method for both class and interval variables. Unique missing indicator variables were also selected and used as inputs. • The stepwise regression model was used as a variable selection method for the Neural Network and second Regression nodes. • The Regression node labeled I n s t a t e R e g r e s s i o n included the variables from the Stepwise Regression node and the variable I n s t a t e . It was felt that prospective students behave differently based on whether they are in state or out o f state. In this implementation o f the case study, the Stepwise Regression node selected three inputs: high school, self-initiated contact count, and student e-mail indicator. The model output is shown below. Analysis of Maximum Likelihood Estimates

Parameter INTERCEPT SELF_INIT_CNTCTS HSCRAT STUEMAIL

DF

Estimate

Standard Error

Wald Chi-Square

Pr > ChiSq

1 1 1 1

-12.1422 0.6895 16.4261 -7.7776

18.9832 0.0203 0.8108 18.9824

0.41 1156.19 410.46 0.17

0.5224 <.0001 <.0001 0.6820

0

Standardized Estimate

0.8773 0.7506

Exp(Est) 0.000 1.993 999.000 0.000

Odds Ratio Estimates Effect SELF_INIT_CNTCTS HSCRAT STUEMAIL

Point Estimate 1.993 999.000 O.001

0 VS 1

The unusual odds ratio estimates for HSCRAT and STUEMAIL result from an extremely strong association in those inputs. For example, certain high schools had all applicants or no applicants enroll. Likewise, very few students enrolled who did not provide an e-mail address. Adding the INSTATE input in the Instate Regression model changed the significance o f inputs selected by the stepwise regression model. The input STUEMAIL is no longer statistically significant after including the INSTATE input. Analysis of Maximum Likelihood Estimates

Parameter INTERCEPT INSTATE N SELF_INIT_CNTCTS HSCRAT STUEMAIL 0

DF 1 1 1 1 1

Estimate -12.0541 -0.4145 0.6889 16.2327 -7.3528

Odds Ratio Estimates Effect INSTATE N VS Y SELF_INIT_CNTCTS HSCRAT STUEMAIL 0 VS 1

Point Estimate 0.437 1.992 999.000 <0.001

Standard Error 16.7449 0.0577 0.0196 0.7553 16.7443

Wald Chi-Square 0.52 51.67 1233.22 461.95 0.19

Pr > ChiSq 0.4716 <.0001 <.0001 <.0O01 0.6606

Standardized Estimate

0.8231 0.7142

Exp(Est) 0.000 0.661 1.992 999.000 0.001


A-46

Appendix A Case Studies

A slight increase in validation profit (the criterion used to tune models) was found using the neural network model. The tree provides insight into the strength of mode! fit. The assessment plot shows the highest profit having 17 leaves. Most of the predictive performance, however, is provided by the initial splits. Iteration Plot

• B E 3

Average Profit

1.75-

1.50-

1.25-

1.00_

|

!

0

10

r

20

Number of L e a v e s

Train: Average Profit for Enroll Valid: Average Profit for Enroll


A.4 Enrollment Management Case Study

A-47

A simpler tree is scrutinized to aid in interpretation. The tree model was rerun with properties changed as follows to produce a tree with three leaves: Method=N, Number o f Leaves=3.

100.Q\

St.clstic B7.5*

Train Validation 10D.OV B7.St 100.0* 12.3% ixiee tin

SELF iNrr CMTCTS

>= 3.5

3.5 iia.n Z'.-c:

Statl.tlc it.$s l.» Mi

Train ValldaU™ no.s* 97.0* 33.S» 1.0\ 1175B 77U

C13C1C 14.0V

Train

Validation

szs.3% zoas

esiis ia93

SELF INIT CWTCTS

5Ut»UC M.4% ;os.=>

u. at

ze.i\

Train H4.Et

Validation 74.31 us

ii3.it 14.11

statiatlc SB.:* l.t\

Train llJ.c-c li. m IQtJZ

Validation »S.5» i.v. 7101

Students with three or fewer self-initiated contacts rarely enrolled (as seen in the left leaf of the first split). Enrollment was even rarer for students with two or fewer self-initiated contacts (as seen in the left leaf of the second split). Notice that the primary target percentage is rounded down. Also notice that most of the secondary target cases can be found in the lower left leaf. The decision tree results shown in the rest of this case study are generated by the original, 17-leaf, tree.


A-48

Appendix A Case Studies

Assessing the Prediction Models Model performance was compared in the Model Comparison node.

Data Role = VALIDATE

0.4

0.6

1 - Specificity Baseline Stepwise Regression Decision Tree

Neural Network Instate Regression

The validation ROC chart showed an extremely good performance for all models. The neural model seemed to have a slight edge over the other models. This was mirrored in the Fit Statistics table (abstracted below to show only the validation performance).


A.4 Enrollment Management Case Study

A-49

Data Role=Valid Statistics Valid Kolmogorov-Smirnov S t a t i s t i c Valid Average Profit for E n r o l l Valid Average Squared Error Valid Roc Index Valid Average Error Function Valid Percent Capture Response Valid Divisor for VASE Valid Error Function Valid Gain Valid Gini Coefficient Valid Bin-Based Two-Way Kolmogorov-Smirnov S t a t i s t i c Valid L i f t Valid Maximum Absolute Error Valid M i s c l a s s i f i c a t i o n Rate Valid Mean Square Error Valid . Sum of Frequencies Valid • Total Profit for E n r o l l Valid . Root Average Squared Error Valid • Percent Response Valid : Root Mean Square Error Valid : Sum of Square Errors Valid . Sum of Case Weights Times Freq Valid : Number of Wrong C l a s s i f i c a t i o n s .

Neural 0.89 1.88 0.04 0.98 0.11 30.94 18356.00 2097.03 576.37 0.96 0.88 6.19 1.00 0.05 0.04 9178.00 17285.71 0.19 77.34 0.19 657.78 18356.00 463.00

Tree 0.88 1.88 0.04 0.96

.

30.95 18356.00

.

519.90 0.93 0.87 6.19 1.00 0.05

.

9178.00 17256.00 0.20 77.36

.

735.99 18356.00 •

Reg 0.87 1.87 0.04 0.98 0.14 29.72 18356.00 2521.91 552.83 0.95 0.86 5.94 1.00 0.06 0.04 9178.00 17122.29 0.20 74.29 0.20 752.78 18356.00

Reg2 0.86 1.86 0.04 0.98 0.14 29.55 18356.00 2486.75 552.83 0.95 0.86 5.91 1.00 0.06 0.04 9178.00 17099.43 0.20 73.85 0.20 754.37 18356.00

It should be noted that a ROC Index o f 0.98 needed careful consideration because it suggested a nearperfect separation o f the primary and secondary outcomes. The decision tree model provides some insight into this apparently outstanding model fit. Self-initiated contacts are critical to enrollment. Fewer than three self-initiated contacts almost guarantees non-enrollment.

Creating Prediction Models (Instate-Only Cases) A second round o f analysis was performed on instate-only cases. The analysis sample was reduced using the Filter node. The Filter node was attached to the Decisions node, as shown below.


A-50

Appendix A Case Studies

The following configuration steps were applied: 1.

Select Default Filtering Method

None for both the Class and Interval variables.

Property

Value

General Node ID Imported Data Exported Data [Notes

Filter p

Train Export Table Filtered Training Data Tables to Filter © C l a s s Variables [•Class Variables None [- Default Filtering Method Yes [Keep Missing Values Yes [Normalized Values 1 r Minimum Frequency Cutoff j-Minimum Cutoff for Percentage 0.01 "Maximum Number of Levels Cutof25 Blnterval Variables r Interval Variables None v Default Filtering Method Yes [-Keep Missing Values "Tuning Parameters

Score Create score code

Yes

2.

Select Class Variables •=> [ j j . After the path is updated, the Interactive Class Filter window opens.

3.

Select Generate Summary, after Yes is selected to generate summary statistics.


A.4 Enrollment Management Case Study

4.

Select Instate. The Interactive Class Filter window is updated to show the distribution o f the INSTATE input.

•Illi1f11)ffit1nlli1"il Select values to remove from the sample.

Instate Apply Filter

Clear Fitter

Name

Filtering Method

ACAD EMICJNDefault ACADEMIC JNDefault CAMPUSJ/ISIDefault CONTACT_CCDefault C 0 NTACT_DADefault ETHNICITY Default Enroll Default IRSCHOOL Default Instate npf?nitt LEVEL_YEAR Default REFER RAL_CDefault

Keep Missing Values Default Default Default Default Default Defaujt Default Default J ^ J Default Default Default

Minimum Frequency Cutoff

Number of Le Cutoff

i

Refresh Summary

OK

Cancel

A-51


A-52

5.

Appendix A Case Studies

Select N bar and select Apply Filter.

KM Select values to remove from the sample.

Instate Apply Filter

Name

Clear Filter Keep Missing Values

Fitter my Method

ACADEMIC INDefault ACADEMIC INDefault CAMPUS VISIDefault CONTACT CCDefault CONTACT DADefautt Default ETHNICITY Enroll Default IRSCHOOL Default DHfaillt Instate LEVEL YEAR Default REFERRAL CDefault

Minimum Frequency Cutoff

Number of Le Cutoff

Default Default Default Default Default Default Default Default • iDflfmilt Default Default

___

Refresh Summary

OK

6.

Select O K to close the Interactive Class Filter window.

7.

Run the Filter node and view the results. Excluded Class

Cancel

Values

(maximum 500 o b s e r v a t i o n s

printed) Keep

Variable

Role

Instate

INPUT

Level N

Train Count 8200

Train Percent 35.7392

Label

Filter Method

Missing Values

MANUAL

A l l out-of-state cases were filtered from the analysis. After filtering, an analysis similar to the above was conducted with stepwise regression, neural network, and decision tree models.


A.4 Enrollment Management Case Study

A-53

The partial diagram (after the Filter node) is shown below:

As for the models in this subset analysis, the Instate Stepwise Regression model selects two o f the same inputs found in the first round o f modeling, SELF_INIT_CNTCTS and STUEMAIL. Analysis

o f Haximum

Likelihood Estimates

Standard DF

Parameter

Intercept SELE_INIT_CNTCTS stuemail

tfald

Standardized

Estimate

Error

Chi-Square

Pr >

ChiSq

-10.1372

20.B246

0.24

0.6264

0.7188

0.0210

1174.00

<.0001

-6.8602

20.6245

• .11

0.7418

Estimate

Exp(Est)

0.000 0.9297

2.052 0.001

The Instate decision tree showed a structure similar to the decision tree model from the first round. The tree with the highest validation profit possessed 20 leaves. The best five-leaf tree, whose validation profit is 97% of the selected tree, is shown below.

ItHllUC i: O: B i n noda:

Validation 1st B4I sags

an seis

SELF IKIT CHTCIS

I

< •tie I! O: nod*;

I — < S-.i'.li-.lt 1: 0: JJ i n r.oJ.:

3.S

Training 41 7Z16

Slit

m

Validation 41 Ml 47SS

Statlltle tl w i n r.aJ«.

Training SSI 341 tea

Validation «l 341 1143

hiciat

i CTJTCTS

1 »

Z.S

Training if

Validation zi

SS38

4370

Statistic 1: 0: TJ I n noda:

1.5

Training; 171 731 6 IS

0.0171434 Validation (SI 7SI 3IC

Statistic 1: 0:

S

i n nod.:

Training 1M e*i 174

Validation 13k 871 133

0.0O310SSS Training 01 1001

1

J

0.00510559 Statistic 1: O; H i n nod*:

Training 371 631 76

Validation 411 191 41

Again, much of the performance of the model is due to a low self-initiated contacts count.


A-54

Appendix A Case Studies

Assessing Prediction Models (Instate-Only Cases) As before, model performance was gauged in the Model Comparison node. The ROC chart showed no clearly superior model, although all models had rather exceptional performance. Data Roto-VALIDATE 1.0

0.8

0.6

0.4-

0.2-

0.00.0

0.2

0.4

0.6

0.8

1.0

1 - Specificity

The Fit Statistics table o f the Output window showed a slight edge over the tree model in Misclassification Rate. The validation ROC Index and Validation Average Profit favored the Stepwise Regression and Neural Network models. Again, it should be noted that these were unusually high model performance statistics.


A.4 Enrollment Management Case Study

A-55

Data Role=Valid Statistics Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid

Kolmogorov-Smirnov Statistic Average Profit Average Squared Error Roc Index Average Error Function Bin-Based Two-Uay Kolmogorov-Smirnov Probability Cutoff Cumulative Percent Captured Response Percent Captured Response Frequency of Classified Cases Divisor for ASE Error Function Gain Gini Coefficient Bin-Based TVo-Uay Kolmogorov-Smirnov Statistic Kolmogorov-Smirnov Probability Cutoff Cumulative L i f t Lift Maximum Absolute Error Hisclassification Rate Sum of Frequencies Total Profit Root Average Squared Error Cumulative Percent Response Percent Response Sum of Squared Errors Number of Wrong Classifications

Neural2

Reg3

Tree2

0.80 2.02 0.06 0.96 0.19

0.80 2.02 0.06 0.96 0.20

0.80 2.00 0.06 0.92 0.21

.

50.69 23.41 5899.00 11798.00 2194.05 406.85 0.91 0.00 0.14 5.07 4.68 0.97 0.09 5899.00 11901.71 0.24 80.08 73.97 705.43 510.00

.

50.69 23.41 5899.00 11798.00 2330.78 406.85 0.91 0.00 0.14 5.07 4.68 1.00 0.09 5899.00 11901.71 0.25 80.08 73.97 725.26 527.00

.

46.38 23.19 5899.00 11798.00 2462.62 363.74 0.84 0.00 0.03 4.64 4.64 0.98 0.08 5899.00 11824.00 0.25 73.27 73.27 716.66 462.00

Deploying the Prediction Model The Score node facilitated deployment of the prediction model, as shown in the diagram's final form.

The best (instate) model was selected by the Instate Model Comparison node and passed on to the Score node. Another INQ2005 data source was assigned a role o f Score and attached to the Score node. Columns from the scored INQ2005 were then passed into the Office o f Enrollment Management's data management system by the final SAS Code node.


A-56

Appendix A Case Studies


Appendix B Bibliography B.1

References

B-3


B-2

Appendix B Bibliography


B.1 References

B-3

B.1 References Beck, A . 1997. "Herb Edelstein discusses the usefulness o f data mining." DSStar. Vol. 1, NO. 2. Available viww.tgc.com/dsstar/. Bishop, C. M . 1995. Neural Networks for Pattern Recognition. New York: Oxford University Press. Breiman, L . et al. 1984. Classification and Regression Trees. Belmont, C A : Wadsworth International Group. Hand, D . J. 1997. Construction and Assessment of Classification Inc.

Rules. New York: John Wiley & Sons,

Hand, D . J. 2005. "What you get is what you want? - Some dangers o f black box data mining." M2005 Conference Proceedings, Cary, NC: SAS Institute Inc. Hand, D. J. 2006. "Classifier technology and the illusion o f progress." Statistical Science 21:1-14. Hand, D. J. and W. E. Henley. 1997. "Statistical classification methods in consumer credit scoring: a review." Journal of the Royal Statistical Society A 160:523-541. Hand, David, Heikki Mannila, and Padraic Smyth. 2001. Principles of Data Mining. Cambridge, Massachusetts: The M I T Press. Harrell, F. E. 2006. Regression Modeling Strategies. New York: Springer-Verlag New York, Inc. Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Data Mining, Inference, and Prediction. New York: Springer-Verlag New York, Inc. Hoaglin, D. C , F. Mosteller, and J. W. Tukey. 1983. Understanding Robust and Exploratory Analysis. New York: John Wiley & Sons, Inc.

Learning:

Data

Kass, G V. 1980. " A n exploratory technique for investigating large quantities o f categorical data." Applied Statistics 29:119-127. Mosteller, F. and J. W. Tukey. 1977. Data Analysis and Regression. Reading, M A : Addison-Wesley. Piatesky-Shapiro, G. 1998. "What Wal-Mart might do with Barbie association rules." Knowledge Discovery Nuggets, 98:1. Available http://www.kdnuggets.com/. Ripley, B. D. 1996. Pattern Recognition and Neural Networks. New York: Cambridge University Press. Rubinkam, M . 2006. "Internet Merchants Fighting Costs o f Credit Card Fraud," AP Worldstream. The Associated Press. Rud, Olivia Parr. 2001. Data Mining Cookbook: Modeling Data, Risk, and Customer Management. New York: John Wiley & Sons, Inc.

Relationship

Sarle, W. S. 1983. Cubic Clustering Criterion. SAS Technical Report A-108. Cary, N C : SAS Institute Inc. Sarle, W. S. 1994a. "Neural Networks and Statistical Models," Proceedings of the Nineteenth SAS' Users Group International Conference. Cary: N C , SAS Institute Inc., 1538-1550.

Annual

Sarle, W. S. 1994b. "Neural Network Implementation in SAS速Software," Proceedings of the Nineteenth Annual SAS* Users Group International Conference. Cary: NC, SAS Institute Inc., 1550-1573.


B-4

Appendix B Bibliography

Sarle, W. S. 1995. "Stopped Training and Other Remedies for Overfitting." Proceedings of the 27th Symposium on the Interface. SAS Institute Inc. 2002. SAS速9 Language: Reference, Volumes 1 and 2. Cary, N C : SAS Institute Inc. SAS Institute Inc. 2002. SA& 9 Procedures Guide. Cary, N C : SAS Institute Inc. SAS Institute Inc. 2002. SAS/STAT速9 User's Guide, Volumes 1,2, and 3. Cary, N C : SAS Institute Inc. Weiss, S. M . and C. A . Kulikowski. 1991. Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. San Mateo, C A : Morgan Kaufmann.


Appendix C Index A analytic workflow, 1-20 Assess tab, 1-17 assessing models, A-34-A-35, A-48, A-54 association rule, 8-63 Association tool, 1-10 AutoNeural node, 6-25 AutoNeural tool, 1-14, 5-39-5-45 average profit, 6-41 average squared error, 6-5 B

backward selection method, 4-31 bin transformations, A-31 bucket transformations, A-30 C categorical inputs consolidating, 9-22-9-30 regressions, 4-67-4-68 chi-square selection criterion, 9-10 cluster analysis, 8-8-8-60 cluster centers, 8-10 Cluster tool, 1-10, 8-37-8-43 clustering, 8-8-8-60 k-means algorithm, 8-9-8-13 seeds, 8-10 complete case analysis, 4-12 consolidating categorical inputs, 9-22-9-30 constructing decision trees, 3-43-3-62 Control Point tool, 1-18 creating SAS Enterprise Miner projects, 2ÂŹ 5-2-9 creating segments, A-10 Credit Exchange tool, 1-19 Credit Scoring tab, 1-19 D Data Partition tool, 1-8 data reduction, 8-6 data splitting, 3-19 decision predictions, 3-9 decision processing

configuring, A-42-A-44 Decision Tree node selecting inputs, 9-19-9-21 decision tree profit optimization, 6-46 Decision Tree tool, 1-14 decision trees, 3-30 constructing, 3-43-3-62 defaults, 3-106 multiway splits, 3-107-3-108 prediction rules, 3-30-3-31 split search, 3-32-3-41 stopping rules, 3-112-3-113 variables, 3-102 Decisions tool, 1-17 diagram workspace, 1-5 process flows, 1-6 discretized stepwise regression creating prediction models, A-29-A-33 Dmine regression, 5-48-5-49 Dmine Regression tool, 1-14 DMNeural tool, 1-14, 5-50-5-51 Drop tool, 1-12 E ensemble models, 9-4 Ensemble node, 9-4, 9-7 Ensemble tool, 1-15 estimate predictions, 3-11 estimation methods missing values, 4-17 Explore tab, 1-10-1-11 exporting scored tables, 7-9-7-14 F

forward selection method, 4-30 G gains charts, 6-17 Gini coefficient, 6-4 H Help panel, 1-5 hidden layer neural networks, 5-7


C-2

Index

hidden units neural networks, 5-5 I implementing models, 7-3-7-21 Impute tool, 1-12 Input Data tool, 1-8 input layer neural networks, 5-7 input selection, 9-9-9-21 neural networks, 5-20-5-23 predictive modeling, 3-16 inputs categorical, 9-22-9-30 consolidating, 9-22-9-30 selecting, 3-14-3-16, 5-20-5-23, 9-9-9-21 selecting for regressions, 4-29-4-37 transforming, 4-52-4-65 Interactive Grouping tool, 1-19 internally scored data sets, 7-3-7-14 K k-means clustering algorithm, 8-9-8-13 Kolmogorov-Smirnov (KS) statistic, 6-3 L logistic function, 4-7, 5-8 logit link function, 4-7 M market basket analysis, 8-6, 8-63-8-79 memory-based reasoning (MBR) predictive modeling, 5-55 Memory-Based Reasoning tool, 1-15 Merge tool, 1-8 Metadata tool, 1-18 missing values causes, 4-16 estimation methods, 4-17 neural networks, 5-11 regressions, 4-17-4-24 synthetic distribution, 4-17 Model Comparison tool, 1-17, 6-3-6-25 statistical graphics, 6-10-6-20 Model tab, 1-14-1-16 models assessing, A-34-A-35, A-48, A-54 comparing using statistical graphics, 6ÂŹ 18-6-20

creating, A-44-A-47, A-49-A-53 creating using neural networks, A-26 creating using stepwise regression, A-25A-26 creating using transformed stepwise regression, A-27 creating with discretized stepwise regressions, A-29-A-33 deploying, A-55 implementing, 7-3-7-21 model complexity, 3-18 surrogate, 9-31-9-37 underfitting, 3-18 viewing additional assessments, 6-44-6 45 Modify tab, 1-12 multi-layer perceptrons, 5-3 Multiplot tool, 1-10 N network diagrams, 5-7 neural network models. See neural networks Neural Network node, 6-24 neural network profit optimization, 6-47 Neural Network tool, 1-15, 5-12-5-19 neural networks, 5-3 AutoNeural tool, 5-39-5-45 binary prediction formula, 5-6 creating prediction models, A-26 diagrams, 5-7 exploring networks with increasing hidden unit costs, 5-39-5-45 extreme values, 5-11 input selection, 5-20-5-23 missing values, 5-11 Neural Network tool, 5-12-5-19 nonadditivity, 5-11 nonlinearity, 5-11 nonnumeric inputs, 5-11 prediction formula, 5-5 stopped training, 5-24—5-38 unusual values, 5-11 nodes, 1-6 nonlinearity neural networks, 5-11 regressions, 4-11 novelty detection methods, 8-6


Index

0

odds ratio, 4-50 optimal transformations, optimizing model complexly, 3-17 3 i Z output layer neural networks, 5-7

Path Analysis tool, 1-10 pattern discovery, 8-3-8-4 applications, 8-6 cautions, 8-5 limitations, 8-5 tools, 8-7 polynomial regressions, 4-76-4-81 prediction estimates regressions, 4-5-4-10 prediction formula neural networks, 5-4-5-6 regressions, 4-5-4-8 prediction rules decision trees, 3-30-3-31 predictive modeling, 3-3 applications, 3-4 decision predictions, 3-9 Dmine regression, 5-48-5-49 DMNeural tool, 5-50-5-51 essential tasks, 3-8-3-22 estimate predictions, 3-11 input selection, 3-14-3-16 memory-based reasoning (MBR), 5-55 neural networks, 5-3-5-45 optimizing model complexity, 3-17-3-22 ranking predictions, 3-10 regressions, 4-17-4-87 Rule Induction, 5-46-5-47 tools, 3-28-3-29 training data sets, 3-5 Principal Components tool, 1-12 process flows, 1-6 profiling, 8-6 profiling segments, 8-56-8-60 profit matrices, 6-39-6-41 creating, 6-32-6-38 evaluating, 6-42-6-43 profit value, 6-39 Project panel, 1-4 projects creating, 2-5-2-9 Properties panel, 1-4

C-3

R ranking Pactions, 3-10 regressioimodels. See regressions Regressiomode selectioniethods, 4-29-4-37 regressionrofit optimization, 6-47 Regressiosool, 1-16, 4-25-4-28 regressioi 4-3 missinplues, 4-17-4-24 non-adivity, 4-11 nonlin'ity, 4-11 nonnjric inputs, 4-11 optima complexity, 4-38-4-48 polyr^ * 4-76-4-87 pred estimates, 4-5-4-10 p on formula, 4-5-4-8 leg inputs, 4-29-4-37 i-ming inputs, 4-52-4-65 J values, 4-11 ÂŁ j ference tool, 1-19 ^gpflent tool, 1-12 ^ rate charts, 6-15-6-17 l e x , 6-4 ÂŁ_-e variable selection criterion, 9-10 reduction tool, 5-46-5-47 1

m

re(

se

tra

u n

e

e

r e s RC

ie tab, 1-8 le tool, 1-8 Oode tool, 1-18 Enterprise Miner alytic workflow, 1-20 eating projects, 2-5-2-9 agram workspace, 1-5 felp panel, 1-5 aterface for 5.2, 1-3-1-19 lodts, 1-6 prosct panel, 1-4 proerties panel, 1-4 Utity tools, 1-18 iiwrz Bayesian Criterion (SBC), 6-5 ircode modules, 7-3, 7-15-7-23 crating a SAS Score Code module, 7-16f-21 w data source cuating, 7-5 ;oe tool, 1-17 ccecard tool, 1-19 fcced tables porting, 7-9-7-14


C-4

Index

seeds, 8-10 Segment Profile tool, 1-17 segmentation analysis, 8-15-8-0 segmenting, 8-8-8-60 segments creating, A-10 deploying, A-14-A-15 interpreting, A-12-A-13 profiling, 8-56-8-60 specifying counts, 8-44 I selecting inputs, 9-9-9-21 regressions, 4-29-4-37 selecting neural network inpuj>-20-5-23 selection methods Regression node, 4-29-4-37 SEMMA, 1-7 , SEMMA tools palette, 1-7 ' separate sampling, 6-20-6-31 adjusting for, 6-21-6-25, 6-29! benefits of, 6-28 consequences of, 6-28 sequence analysis, 8-6, 8-80-8-8 SOM/Kohonen tool, 1-11 specifying segment counts, 8-44 split search decision trees, 3-32-3-41 StatExplore tool, 1-11 statistical graphics comparing models, 6-18-6-20 Model Comparison tool, 6-10-6-1 stepwise regression creating prediction models, A-2& discretized, A-29-A-33 transformed, A-27 stepwise selection method, 4-32-4-3 stopped training, 5-3 neural networks, 5-24-5-38 surrogate models, 9-31-9-37 ' synthetic distribution

1

missing values, 4-17

target layer neural networks, 5-7 test data sets, 1-8, 3-19 Time Series tool, 1-9 training data sets, 1-8 3.5 creating, 3-23-3-27 ' transaction count, 8-82 Transform Variables tool, I-13 transformations bin, A-31 bucket, A-30 optimal, A-32 transformed stepwise regression creating prediction models, A-27 transforming inputs regressions, 4-52-4-65 TwoStage tool, 1-16 U

Utility tools SAS Enterprise Miner, 1-18

validation data sets, 1-8, 3-19, 3-21 creating, 3-23-3-27 Variable Selection node, 9-9-9-13 Variable Selection tool, 1-11 W

weight estimates neural networks, 5-5


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.