§.sas |
I m
Introduction to SAS Enterprise Miner
-
T H E P O W E R TO K N O W .
SAS Enterprise Miner - Interface Tour Sr.«iSffiJS;
?
o
*
™
!
Menu bar and shortcut buttons
t ^^^^
J
".r--. lUI'B
.'H
(-par*-!
;R«»llH«tlM
(
'•>-
:
«»|
-j i
1 |
1
l e i a l c » r j n v .1
•r : rr« it 4F^4ir:*t1
--
*
2
1
&
—
-
SAS Enterprise Miner - Interface Tour Aji p lie a Analytics 9 njDataSoutces H DEVELOP 9 Diagrams Analysis Dlauram Model Packages <f-!n]Useis 1
Project panel
D —
- -S^— •
SAS Enterprise Miner - Interface Tour
E H
Significance Level Kissing Values Use Input Ones Maiimum Branch Manmum Depth Catena Coanl
02 Use In search No 2
Properties panel
3
2
- J -
SAS Enterprise Miner - Interface Tour i
m t"
I T - fc— £ > » ™ -
B*-
JJEL . i i Q rr.iiL.
s ^ i ^ J r_^rr >**rti
>^
I *-^Ti.^»'T":r-y •• m,y]
i OJ'***
1
H . I , Ift
»H
• — - -e* TJw_ > . . • 3 S f.illrx«: t n
W
E
•9B-
\
IL/V'-K 4rjii> II)
— General
Help panel
CeoasI Properties
->
SAS Enterprise Miner - Interface Tour [k
[ • r
fiL»
Quit
ffrU*
tPfe
jis^wiaawisj IT!"
Diagram workspace .....
- *a*
-
e
-
—
-
-a*
\
3 - J -
3
SAS Enterprise Miner - Interface Tour »
t > K —
B4*
3--J
- JJtJi )>>»
Mi
MM
>
r T - s r i d H t t i i a h fc, J B -.1 [ . : • , .
•Tirt
I W
.-SjWrttm.r.fl.
i.' •
Process flow
^•a
\ J
1 — — 1
i
SAS Enterprise Miner - Interface Tour _"-r.i I [-•-!
»—.li!
i
j j -
•
-
K M , ID
Node CD—
8
4
SAS Enterprise Miner - Interface Tour 0 1 - j -J * I BTui 3'C1 o ar - • '
r; -i. • - « • •
1
SEMMA fools palette
.'TWfKIMMi'l-tWf
M • t e
-T«IH ( H I
lU.llll
—
—
SEMMA- Sample Tab si P" ^^ |HT|Pf9P*iH' * * *'*'* 1
>
1
l
lc|J
Append Data Partition File Import Filter Input Data Merge Sample Time Series
-
-
- Q . y & k \ -
E
•SB"""
j
10
5
SEMMA - Explore Tab rc2 •.V:
tf~ :
E E Ci"™
ft-y.V'.
a* 1
.
SJiCHj •J ;•-V r nj
- •
"jMyI m » I isietl I I t i f l I CrWI^w
Association • Cluster • DMDB • Graph Explore • Market Basket Multiplot. • Path Analysis
Owisi
-Qlr
• SOM/Kohonen • StatExplore • Variable Clustering •Variable Selection
11
SEMMA-Modify Tab b ^ x i B i M a i a i o a .
B U M
! w * | t a u T I Hn«, r**-M I tram 1 1 * I emit frmi?
E E
Drop Impute
...
Interactive Binning Principal Components Replacement CD™Rules Builder Transform Variables
12
6
•Sir
SEMMA - Model Tab !W.)lWi.|W»,l 5Bf
U*yf Oft* ^srof
• AutoNeural • Decision Tree • Dmine Regression • DMNeural
ED—
T*fn»
•Stfj'Kilnt.iil
Cm!
• Ensemble •Gradient Boosting • Least Angle Regression • MBR •Model Import • Neural Network
-&
!
• Partial Least Squares • Regression I -' ^
» ... • • .
— -r
-•-Rate Induction Two Stage
13
SEMMA-Assess Tab 3D e* t a
»™
o*
H-j3iyiBiimanmo^«wioiJ4i.J*i Hmmm
•Cutoff • Decisions • Model Comparison • Segment Profile • Score
iPaq i « 3 p i». -*iv«.-u»*~*
14
7
Beyond SEMMA-Utility Tab pit ox«™ »**•* t*»* —
yjjj^ai^ipjjjj j'"1TJ
"^<^ ' - t ^ ' l
I ;
,
!£'£«i ^'_' '•
Control Point End Groups Metadata Reporter • —
h *B*-
SAS Code Start Groups
15
Credit Scoring Tab (Optional) (QNU7*
Bag
•-. [-v
^v.::;! ttHj
Credit Exchange Interactive Grouping Reject Inference Scorecard -gB
1-
•s-l-... - r»a
ttwl
16
8
•6E
The Analytic Workflow
CD
17
SAS Enterprise Miner Analytic Strengths
Pattern Discovery
Predictive Modeling
9
§.sas
Accessing and Assaying Prepared Data T H E P O W E R TO K N O W ÂŽ
Analysis Element Organization
Projects
Libraries and Diagrams
Process Flows
10
Nodes
Analysis Element Organization
Projects
Libraries and Diagrams
Datasources
W
Reports System
Nodes
My Library EMWS
l - P
Process Flows
- 1 - M em.dgraph
EMWS1
'
pP'lDs I
iPPart
Workspaces - H " "
Creating a SAS Enterprise Miner Project This demonstration illustrates creating a new SAS Enterprise Miner project.
22
11
This demonstration illustrates creating a new SAS library.
23
Creating a SAS Enterprise Miner Diagram This demonstration illustrates creating a diagram in SAS Enterprise Miner.
24
12
Defining a Data Source
Select table. Define variable roles. Analysis Data
Define measurement levels.
f
Define table role.
SAS Foundation Server Libraries 25
Defining a Data Source This demonstration illustrates defining a SAS data source.
26
13
Exploring Source Data This demonstration illustrates assaying and exploring a data source.
27
THE POWER TO KNOW,
14
Unsupervised Classification f i
E
c/us(er J cluster 2
1 1
IH
c/uster 1
Unsupervised classification: grouping of cases based on similarities in input values.
c/uster 2
29
/c-means Clustering Algorithm Training Data 1. Select i n p u t s . 2. S e l e c t /t c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s . 5. R e a s s i g n
cases.
6. R e p e a t s t e p s 4 a n d 5 convergence.
15
Segmentation Analysis Training Data
When no clusters exist, use the /(-means algorithm to partition cases into contiguous groups.
31
Creating Clusters with the Cluster Tool This demonstration illustrates how the Cluster tool determines the number of clusters in the data.
32
16
Exploring Segments This demonstration illustrates how to use graphical aids to explore the segments.
33
Profiling Segments This demonstration illustrates using the Segment Profile tool to interpret the composition of clusters.
34
17
ยงsas |
Case Study I THE POWER TO KNOW.
Bank usage segmentation
18
A.1
B a n k i n g
S e g m e n t a t i o n
C a s e
S t u d y
Case Study Description A consumer bank sought to segment its customers based on historic usage patterns. Segmentation was to be used for improving contact strategies in the Marketing Department. A sample o f 100,000 active consumer customers was selected. A n active consumer customer was defined as an individual or household with at least one checking account and at least one transaction on the account during a three-month study period. A l l transactions during the three-month study period were recorded and classified into one o f four activity categories: • traditional banking methods ( T B M ) • automatic teller machine ( A T M ) • point o f sale (POS) • customer service (CSC) A three-month activity profile for each customer was developed by combining historic activity averages with observed activity during the study period. Historically, for one CSC transaction, an average customer would conduct two POS transactions, three A T M transactions, and 10 T B M transactions. Each customer was assigned this initial profile at the beginning o f the study period. The initial profile was updated by adding the total number o f transactions in each activity category over the entire three-month study period. The P R O F I L E data set contains all 100,000 three-month activity profiles. This case study describes the creation o f customer activity segments based on the P R O F I L E data set. The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams and selecting I m p o r t D i a g r a m f r o m X M L in SAS Enterprise Miner. A l l nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.
f
Case S t u d y Data Name
ID
CNTATM
Model Role
ID
[Measurement Level Customer I D
Nominal
ij Input
j. Interval
Input
Interval
|| Traditional bank method transactioi A T M transaction count
"•'"'"1FT j ^Interval
COTJPQS CNT_CSC
Input
CNT_tOt
1| Input
Description
i! Point-of-sale itransactiotf count Customer service transaction count
Interval
; Total transaction count
\\ Interval
19
A c c e s s i n g a n d A s s a y i n g t h e Data A SAS Enterprise Miner data source was defined using the metadata settings indicated above. The StatExplore node was used to provide preliminary statistics on the input variables.
PROFILE
JstcrtExploi a
The Interval Variable Summary from the StatExplore node showed no missing values but did show a surprisingly large range on the transaction counts. Interval Variable Summary Statistics (maximum 500 observations printed) Data Role=TRAItJ
Variable
Role
CNTATIt CHT_CSC CNTJOS
INPUT INPUT INPUT INPUT INPUT
c n t j t m
CHTTOT
Mean 19.49971 6.63411 11.9233 63.13696 106.2441
Standard Deviation
Hon Hissing
Hissing
Hinimum
Hedian
HaxiHum
Skeuness
20.8561 12.12856 20.73384 101.1542 113.3704
100000 100000 100000 100000 100000
0 0 0 0 0
3 1 2
13 2
628 607
2.357293 6.236494 3.343805 53.05219 39.2061
10
17
A plot o f the input distributions showed highly skewed distributions for all inputs.
20
2 52 89
345
14934 15225
E x p l o r e - AAEM.PROFILE FJe
View
Actions
Window
| i Sample Properties Property
Value
*ows
100000
Zokmns
p
ibrary
KAEM
50000^
60000
40000¬ |" 40O00
5 30000-
Member
PROFILE
Type
DATA
—1
I"
p =•
20000¬
£ 20000 u.
"* 1 0 0 0 0 ¬ t
03.0
u
.
0
69.2 135.4 201.6 267.B 334.0 CNT
2498.B
ATM
4587.6
CNT T B M
5¬ £
1.0
20000
66.4 131.8 197.2 262.6 323.0
2521.1
17.0
CNT
CNT CSC
50252 TOT
g. 4 0 0 0 0 c £ ST 2 0 0 0 0 -
,1
116.8 CNT
145.5
174.2
2023
231.6
~l 260.3
r 289.0
POS
It would be difficult to develop meaningful segments from such highly skewed inputs. Instead o f focusing on the transaction counts, it was decided to develop segments based on the relative proportions o f transactions across the four categories. This required a transformation o f the raw data. A Transform Variables node was connected to the P R O F I L E node.
to
JJ^StatExplore
r? — —
Q
f^Transform Variables
! => .= PROFILE
-o The Transform Variables node was used to create category logit scores for each transaction category. category logit score = \og(transaction count category / transaction count m
21
om
fcategory)
0
The transformations were created using these steps: 1.
Select Formulas in the Transform Variable node's Properties panel. The Formulas window appears. £f.
Formulas
]5 4,000¬ 13,000 a>
§•2,000 v
£ 1,000 03.0 Columns:
27.7
V Label
i
126.5 151,2 CNT ATM
Number of Bins •* 4 4
—I 175.9
Level Interval [interval Interval Interval
Level
Formula
1 225,3
1
250,0
r Statistics
r Basic Role Input Input Input
1 2G0.6
T
Name
Outputs
101.8
77.1 f~ Mining
Method Default Default" Default Default
Name CNT.ATM Cfff_C8C CNT.POS CNT_TBM Inputs
52.4
Type
•
Sample I
Length
Format
Label
Role
Report
Log j V OK
22
Cancel
2.
Select the Create icon as indicated above. The Add Transformation dialog box appears. mS'jsuiiiil'Ji} Property
Value
Name
At
LOT TBM N 8
Type Lenqth Format Level Label Role Report
INTERVAL INPUT No
rFormuIa;TRANS 0 log(CNT_TBM/(CNT_TOT-CNT_TBM))
Build.
OK
Cancel
3.
For each transaction category, type the name and formula as indicated.
4.
Select O K to add the transformation. The Add Transformation dialog box closes and you return to the Formula Builder window.
23
5.
Select Preview to see the distribution of the newly created input. Sff Formulas
L3
n \ ft 50,000 g, 40,000g 30,000 g" 20,000¬ 10,000 03 lolumns:
32.1
61.2
f~ Label
Name CNT_ATM CNT_C5C b]T_POS CrJTJBM CNT JOT
90.3
119.4
148.5 177.6 CNT ATM P Basic
i Mining -
Level Interval Interval Interval Interval Interval
Role Input tnput Input Input Input
206.7
Method DEFAULT DEFAULT DEFAULT DEFAULT DEFAULT
1
235.8
264.9
294.0
0.97
1.96
P Statistics
Number of Bins f0 4.0 4.0 4,0 4.0
Inputs I ~* & 20,000= 10,000-7.93 Name
-6,94
-5.95 Length
Type
LGT_TBM
.Numeric
[30
Numeric
30
LGT POS
Numeric
;LGT CSC
Numeric
3 2 30
Outputs
|
•
Sample ] •*
-4.97 Format
-3.98
.2,99 -2.00 LGT ATM
-1.01
-0.02
Level [interval
IB3BI [interval Interval
log(CNT_P... log(CNT_C...
Log I •
preview
OK
6.
Repeat Steps I -5 for the other three transaction categories.
7.
Select O K to close the Formula Builder window.
8.
Run the Transform Variables node.
Cancel
Segmentation was to be based on the newly created category logit scores. Before proceeding, it was deemed reasonable to examine the joint distribution o f the cases using these derived inputs. A scatter plot using any three of the four derived inputs would represent the joint distribution without significant loss o f information.
24
A three-dimensional scatter plot was produced using the following steps: 1.
Select Exported Data from the Properties panel of the Transform Variables node. The Exported Data window appears.
2.
Select the T R A I N data and select Explore. The Explore window appears.
3.
Select Actions â&#x20AC;˘=> Plot... or click [Mij (the Plot Wizard icon). The Plot Wizard appears.
4.
Select a three-dimensional scatter plot.
5.
Select Role
6.
Select Finish to generate the scatter plot.
X , Y , and Z for L G T A T M , L G T _ C S C , and L G T _ P O S , respectively.
LGT_POS
The scatter plot showed a single clump o f cases, making this analysis a segmentation (rather than a clustering) o f the customers. There were a few outlying cases with apparently low proportions on the three plotted inputs. Given that the proportions in the four original categories must sum to 1, it followed that these outlying cases must have a high proportion o f transactions in the non-plotted category, T B M .
25
Creating Segments Transactions segments were created using the Cluster node.
JStatExplore
,
ransform Variables
PROFILE
?
^'Cluster J
o
Two changes to the Cluster node default properties were made, as indicated below. Both were related to limitine the number o f clusters created to 5.
Train
â&#x20AC;˘
[Variables L__.__ ClusterVariable Role Segment Internal StandardizatioNone [3jNumber of Clusters hSpecification Method [User Specify -Maximum Number of C5 f
Because the inputs were all on the same measurement scale (category logit score), it was decided to not standardize the inputs.
Only the four L G T inputs defined in the Transform Variables node were set to Default in the Cluster node. 2 ÂŁ Variables - Clus (none) Zolumns:
T
[~ Label
Name CNT CNT CNT CNT CNT ID LGT
ATM CSC POS TBM TOT ATM
LGTCSC LGT POS LGT TBM
| f" not | Equal to
Use No Mo No No No Yes Default Default Default Default
r Report No No No No No foo No No No No
L_j Mining
P Basic Role
Input Input Input Input Input llD Input [input Input Input
Apply
Reset
|~ Statistics
Level Interval Interval interval Interval Interval Nominal Interval Interval Interval Interval
Explore...
26
Update Path
OK
Cancel
Running the Cluster node and viewing the Results window confirmed the creation o f five nearly equally sized clusters.
27
Interpreting Segments A Segment Profile node attached to the Cluster node helped to interpret the contents o f the generated segments.
JstatExplore
i= "PROFILE
-Q
'• v £YlTt £1115(01 m Variables
'Segment Profile
O
-o
Only the L G T inputs were set to Yes in the Segment Profile node. E £ Variables -Prof • I I
il
nnt iFrtiisltn
Apply
Reset
Jj
Columns:
V Label
Name
Use
CNT ATM No CNT CSC No CNT POS No CNT TBM No CNT TOT befault Distance ID default LGTATM Yes LGT_CSC Yes ;LGT_POS Yes LGT TBM ftres
Uu
Report No No No No No No No No No No No
I
P Basic
f~ Mining
Role
Statistics
Level
Input Input Input Input Input Rejected ID Input Input Input
Interval Interval Interval Interval Interval Interval Nominal Interval Interval Interval Interval
Inout
Explore..,
-
Update Path
OK
Cancel
The following profiles were created for the generated segments:
Segment: 1
=
Count: 1 3 1 5 7
«
' e r c e n t : 13.16
£
Segment 1 customers had a significantly higher than average use o f traditional banking methods and lower than average use o f all other transaction categories. This segment was labeled Brick-and-Mortar.
28
segment: 2
=
Count: 2 5 5 3 6
"
P e r c e n t : 25.54
LGT TBM
Segment 2 customers had a higher than average use o f traditional banking methods but were close to the distribution centers on the other transaction categories. This segment was labeled Transitionals because they seem to be transitioning from brick-and-mortar to other usage patterns. 30 -
=
Count: 17839
«
r
£ 20 -
LGT ATM
B
LGT POS
LGT TBM
' '
—
Percent: 17.84
^
Percent to
40Segmerrt:3
k
Segment 3 customers eschewed traditional banking methods in favor of ATMs. This segment was labeled ATMs.
Segment: 4
=
Count: 1 4 5 3 7
a ~
Percent: 14.54
2 0
£ 10
LGT POS
I
LGT TBM
tot
Segment 4 was characterized by a high prevalence of point-of-sale transactions and few traditional bank methods. This segment was labeled Cashless.
_ 20-
Segmenl: 5 Count 28931 Percent: 2 8 . 9 3
[
£ £
C
LGT CSC
i
20 -
Thn_
LGT_TBM
LGT POS
Segment 5 had a higher than average rate o f customer service contacts and point-of-sale transactions. This segment was labeled Service.
29
Segment Deployment Deployment o f the transaction segmentation was facilitated by the Score node.
StotExplore
O i : : ijPROFILE
y? ; I ransform
fQ- -. Variables
-â&#x20AC;˘- ^^Cluster
o
The Score node was attached to the Cluster node and run. The SAS Code window inside the Results window provided SAS code that was capable o f transforming raw transaction counts to cluster assignments. The complete SAS scoring code is shown below.
30
*
* Formula Code; LGTJTBH LGT_ATM LGT_P0S LGT_CSC
=log(CHT_TBH/ (CHT_TOT-CHT_TBH)) ; =log(CNT_ATH/ (CHT_TOT-CHT_ATH)) ; =log(CHT_POS/(CHT_TOT - CHT_P0S)) =log(CHT_CSC/(CNT_TOT-CHT_CSC)) ;
* TOOL: Clustering; * TYPE: EXPLORE; * NODE: Clus;
*****************************************• *** Begin Scoring Code from PROC DHVQ ***; *****************************************•
*** Begin Class Look-up, Standardization, Replacement ; drop _dm_bad; _dm_bad = 0 ; *** No transformation for LGT_ATH ; *** No transformation for LGT_CSC ; *** No transformation for LGT_P0S ; *** No transformation for LGTJTBH ; *** End Class Look-up, Standardisation, Replacement ;
*** Omitted Cases; i f _dm_bad then do; _SEGHENT_ = .; Distance = .; goto CLUSvlex ; end; *** omitted;
* * * * * * * *• *
EM SCORE CODE; EM Version: 7.1; SAS Release: 9.03.01M0P060711; Host: SASBAP; Encoding: w l a t i n l ; Locale: en_US; P r o j e c t Path: D:\Workshop\winsas\EM_Projects; P r o j e c t Name: apxa; Diagram I d : EMWS1; Diagram Name: case_studyl; Generated by: sasdemo; Date: 09SEP2011:16:50:09;
** ..
it
31
* TOOL: Input Data Source; * TYPE: SAMPLE; * NODE: Ids2; t
* TOOL: * TYPE: * NODE: * LGT ATM LGT_CSC LGT POS LGT_TBM * * TOOL: * TYPE: * NODE: ic
Transform; MODIFY; Trans; ___** • f
= = = =
log (CNT ATM/ (CNT TOT-CNT ATM)) ; log(CNT_CSC/(CNT_TOT-CNT_CSC)) ; log(CNT_POS/ (CNT_TOT - CNT_POS)) ; log (CNT_TBM/ (CNT_TOT-CNT_TBM)) ; ___* f • W
Clustering; EXPLORE; Clus; _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
*****************************************. *** Begin Scoring Code from PROC DMVQ ***; *****************************************. *** Begin C l a s s Look-up, Standardization, Replacement ; drop _dm_bad; _dm_bad = 0 ; *** n o transformation f o r LGT_ATM ; *** No transformation f o r LGT_CSC ; *** n o transformation f o r LGT_POS ; *** No transformation f o r LGT_TBM ; *** End C l a s s Look-up, Standardization, Replacement ; *** Omitted Cases; i f _dm_bad then do; _SEGMENT_ = .; Distance = .; goto CLUSvlex ; end; *** omitted; *** Compute Distances and C l u s t e r Membership; l a b e l _SEGMENT_ = 'Segment I d ' ; l a b e l Distance = 'Distance' ; array CLUSvads [5] __temporary_; drop _yqclus _vqmvar __vqnvar; _vqmvar = 0 ; do _vqclus = 1 to 5; CLUSvads [_vqclus] = 0 ; end-; i f not missing( LGT_ATM ) then do; CLUSvads [1] + ( LGT_ATM 3.54995114884545 )**2; CLUSvads [2] + ( LGT ATM 2.2003888516185 )**2; CLUSvads [3] + ( LGT ATM 0.23695023328541 )**2; CLUSvads [4] + ( LGT ATM 1.47814712774378 )**2; CLUSvads [5] + ( LGT_ATM - -1.49704375204907 )**2; end; e l s e _vqmvar + 1.31533540479169; i f not missing( LGT CSC ) then do; CLUSvads [1] + ( LGT_CSC 4.16334022538952 )**2;
32
1
CLUSvads [2] + ( LGT_CSC -3.38356120535047 )**2 CLUSvads [3] + ( LGT_CSC -3.55519058753002 )**2 CLUSvads [4] + ( LGT_CSC -3.96526745641347 )**2 CLUSvads [5] + ( LGT CSC -2.08727391873096 )**2 end; e l s e _vqmvar + 1.20270093291078; i f not missing( LGT_POS ) then do; CLUSvads [1] + ( LGT_POS - -4.08779761080977 )**2; CLUSvads [2] + ( LGT_POS - -3.27644694006697 )**2; CLUSvads [3] + ( LGT_POS 3.02915771770446 )**2; CLUSvads [4] + ( LGT__POS - -0.9841959454775 )**2; CLUSvads [5] + ( LGT POS - -2.21538937073223 )**2; end; e l s e __vqmvar + 1.3094245726273; i f not missing( LGT_TBM ) then do; CLUSvads [1] + ( LGTJTBM - 2.62509260779666 )**2; CLUSvads [2] + ( LGT_TBM - 1.40885156098965 )**2; CLUSvads [3] + ( LGTJTBM 0.15878507901546 )**2; CLUSvads [4] + ( LGTJTBM 0.11252803970828 )**2; CLUSvads [5] + ( LGTJTBM - 0.22075831354075 )**2; end; e l s e _vqmvar + 1.17502484629096; _vqnvar = 5.00248575662075 - _vqmvar; i f _vqnvar <= 2.2748671456705E-12 then do; SEGMENT = .; Distance - .; end; e l s e do; _SEGMENT_ = 1 ; Distance = CLUSvads [ 1 ] ; _vqfzdst = Distance * 0.99999999999988; drop _vqfzdst; do _vqclus = 2 to 5; i f CLUSvads [__yqclus] < _vqfzdst then do; _SEGMENT_ = _vqclus; Distance = CLUSvads [_vqclus] ; _vqfzdst = Distance * 0.99999999999988; end; end; Distance = sqrt(Distance * (5.00248575662075 / _vqnvar)); end; CLUSvlex :;
***************************************• *** End Scoring Code from PROC DMVQ ***; *****************************;
* * Clus: Creating Segment Label;
* length _SEGMENT_LABEL_ $80; l a b e l _SEGMENT_LABEL_= Segment Description * ; i f _SEGMENT_ = 1 then _SEGMENT_LABEL_="Clusterl" ; else i f _SEGMENT_ = 2 then _SEGMENT_LABEL_= "Cluster2" ; else i f SEGMENT = 3 then SEGMENT LABEL ="Cluster3"; 1
33
else i f _SEGMENT__ = 4 then _SEGMENT_LABEL_="Cluster4" ; else i f _SEGMENT_ = 5 then _SEGMENT_LABEL_="Cluster5" ; * TOOL: Score Node; * TYPE: ASSESS; * NODE: Score; /
ic
ic .
* Score: Creating Fixed Names; *
LABEL EM_SEGMENT = ' Segment Variable * ; EM SEGMENT = SEGMENT ;
34
T H E P O W E R T O K N O W ,
Market Basket Analysis
rw 3J
II
1
1. Rule A^> D C=^A A=> C B&C^D
Support
Confidence
2/5 2/5 2/5 1/5
2/3 2/4 2/3 1/3
35
Implication? Checking Account No
Yes
No
500
3500
4,000
Yes
1000
5000
6,000
Saving Account
10,000 Support(SVG => CK) = 50%=5,000/10,000 Confidence(SVG => CK) = 83%=5,000/6,000 Expected Confidence(SVG => CK) = 85%=8,500/10,000 Lift(SVG => CK) = 0.83/0.85 < 1
Barbie Doll
Candy
1. Put them closer together in the store. 2. Put them far apart in the store. 3. Package candy bars with the dolls. 4. Package Barbie + candy + poorly selling item. 5. Raise the price on one, lower it on the other. 6. Offer Barbie accessories for proofs of purchase. 7. Do not advertise candy and Barbie together. 8. Offer candies in the shape of a Barbie doll.
4
36
Data Capacity
5
Association Tool Demonstration Analysis goal: Explore associations between retail banking services used by customers.
Analysis plan: • • • • •
Create an association data source. Run an association analysis. Interpret the association rules. Run a sequence analysis. Interpret the sequence rules.
6
37
Market Basket Analysis This demonstration illustrates how to conduct market basket analysis.
7
Model Rale
Measurement Level
ID
Nominal
Account Number
Target
Nominal
Type of Service
Sequence
Ordinai
Order of Product Purchase
Name IlVCCOUNT SERVICE VISIT
Description
ATM
automated teller machine debit card
AUTO
automobile installment loan
CCRD
credit card
CD
certificate of deposit
CKCRD
check debit card
CKTNG
checking account
HMEQLC
home equity line of credit
IRA
individual retirement account
MMDA
money market deposit account
MTG
mortgage
PLOAN
personal consumer installment loan
SVG
saving account
TRUST
personal trust account
38
1
Do. C D U C . W U U J
Sibil • IAS
t«M<
~ 1
Select a Sas iasis
I
Table " ^AAEM.BANK
Dsta Source Bttaid — ?R 5 fi t ff> Coloinn Mrlinirifn
Name
Role
ACCOUNT SERVICE
ID Taiuel Sequence
Snow code
Level Interval Nominal Interval
Explore
Report Hd Ho Ho
j
Order
!
Drop
Lower Limit
tin Mo No
ftilNC)
You ma-]'change Itie name and the role, and can specify a population segment Identifier tor the data source 7
(
.
to be created
>
bank
Name: Role:
j Transaction
Segment: \ Notes:
10
39
Upper
il-
- E
General
Node ID h i t p dried Data Exported Data Notes Train. . } H Variables Maximum Number of Hems to Pnl 00000 Rules r-Maiimum Items [-Minimum Confidence Level [-SupportType ;• Support Ccunl - Support Percentage
U 10 Percenl
i-Chaln Count [•Consolidate Time •-MaximumTransaction Duration j-SupportType j- Support Count -Support Percentage
3 0-0 0.0 Percenl
1 5.0
:
;-Number to Transpose -Export Rule bylD
•
1 i2.0
Defauii 200
;
11
12
40
[ Number to Keep [•Sort Criterion 1 Number lo Transpose '-Export Rule by ID
Yes
rJ
l i * * : C o r f u * ™Pks Eipecled
Riiaonj
U9
Conflderrce{
Trintxban
C O P I E S neat
3 1 3 :
:3 3 3 3
:
:
4 4 4 4
;4 3
4 3
; ' i
j 4 i
,
1130 1495
Kn is.o 11.39 14 8! 1947 1S.4S
tw iaj: 3B.I1 il.ii ».es 18.47 1847 18.47
xtt I S 89 38 I t 1847 3841 3941 lit! il.ii sti: 1941
Count
37.5 49J9 49 35 49.31 36 3' 39 0' 18.13 31 1
19.1;
19.11 39 9 19.9 S4.ec 19.94 37 0 14.91 33 r 33.71 37.9 3r.ni si . r 33 H 54 6! 94 98 IS* ISf 1
n.4. S1J3
5.SB 5.5S s.sa ssa 548 sst
'it 4.5: 4S3
im 4 63
4JJ3 809 909 809 (OS 60 809 809 899 9 5" 9.93 60 60 999 999 C
5 21 is
3.33 130 119 3.19 119 119 1 99 1 99 1.83 1.83 1.81 1.6! 1.91 1.S1 1.49
1,41
1 14 144 144 144 143 141 1.43 1,4: 14 1.43 1.36 111
Run
LI". Hind of R ••: Html Role item 1 Rule Item i RuM
Rule Bern 3 Ruts Hem 4 RnU B m !
BtPtM
44S 0DCK1NO AC CKWOAC CKCRD 446 0 0 C K C R D " CKCRD CMNOA C 44600CKCRO — CKCRD CCRD 446 00 CMNO AC CKtNOAC CCRD 446 0 0 C C R D « " . CCPCi CKCRD 446 0 0 C C R D — ' CCRD CMHGiC. 370 OOHMEQLCc HMEQLC CKINOAC 379 00 CMNO t 0 . CKINOAC. HUEOLC 370 DOHMEQLC i HMEQLC CCRO 370.00HUEOLCi. HMEQLC A CCRD 370 0 0 C C R O = ' CCRD HUEOLC 370 00 CCRD CCRD HUEOLCA 4 97.00 SVO A HME SV0AHME..CK1UGJA . 487 00CMNQSA CKJNOAA SVO A HUE 437 00HMEOLC = . HUEOLC SVOACK] l a r . o o s v o i C K i . . BV0ACK1 HUEOLC 497 DOQVQSATM SVO A ATM HUEOLC 497 Q08VOAATU SVO A ATM HMEQLCA l9r.0CHHEQLC=. HUEOLC SVOAATH 497 DOHMEOLC A . HUEOLCA BVOAATU E93 QOHMEQLC • HUEOLC 69IMCMNO J « CMNOAA HMEQLC 497 00 SVG S HUE SVO A HUE ATM 497 0CSVGSHME SVO A HUE ATM 497 0OATM==>BV MM SVO A HUE. 497 0OATM= = 'SV AIM SVO A HUE 4M00COAATO._ CO A ATM BVO 4CM_. ATM 583 OOHMEOLC = HUEOLC
15
42
cwno
CCRQ
CKCRD CKCRD
curio
CCRD CCRD HMEQLC CMNO HMEQLC HUEOLC CCRO CCRD SVG CMNO MMEOLC SVO SVO SVO HMEQLC HUEOLC HUEOLC CMNO svo SVO ATM ATM Co HUEOLC
OONO CCRO CKCRD
-
CCRO • m a i
CKCRD C KING CKJNO . . . . .
e
CCRD ;
"
;
HMEQLC ATM CMNO ATM ATM . . .
HMEQLC HUEOLC
!===:—> SVO ATM
-
SVO
CMNO CMNO ATM HMEQLC HUEOLC
" CMNG SVO SVO
ATM '
CKCRD CCRO HUEOLC
CCRD
CMNO
' ' j" '
i i - -.-.-.j i ipecrfc run 1 1 3 8 4 9 7
CKCRD CCRO CCRO
'
•
'
• ATM
CMNO CW7JG SVO CtONO
......->
ATM HUEOLC ATM HUEOLC
HUEOLC HMEOLC ATM SVO ATU HMEOLC ATM
ATM
. .. . .
ATM
HUEOLC HUEOLC SVO
Rult M f l 1 Twite-oil Run Number
CKING
9 10 11 ii 13 14 IS 18 19 19 17 10 11
21 14 38
n CMNO CMNO
IS 37 19
i 1 1 1 1 1 1 t 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1
This demonstration illustrates how to conduct a sequence analysis.
18
43
[ 3 unit; description
t/Ef IE
Rule CKJNO = > SVG CKING — " ATM SVO = > ATM CKING SVG = > ATM ATM = • ATM CKING = > CD CKING ==> ATM = = » A 7 M CKING = > HMEOLC S V G = > CO CKING ==» MMDA CKING — > C C R D CKING — > SVG = > CD SVG — > ATM = > ATM S V G — » SVG CKCRD - - > CKCRD CKING = > CKCRD CKING = * CKCRD = » CKCRD SVG = > HMEOLC CKJNG = • SVG = » SVG = » CCRO C K I N G - - » SVG--* CCRD C D = > CD CKINO -=> IRA HMEQLC ==> HMEQLC CKING = > HMEOLC = * HMEQLC CKING = > SVG = > SVG
§ s a s
| % &
Case Study II Web services associations THE POWER
TO KNOW.
44
A.2 Web Site Usage Associations Case Study Case Study Description A radio station developed a Web site to broaden its audience appeal and its offerings. In addition to a simulcast o f the station's primary broadcast, the Web site was designed to provide services to Web users, such as podcasts, news streams, music streams, archives, and live Web music performances. The station tracked usage o f these services by URL. Analysts at the station wanted to see whether any unusual patterns existed in the combinations o f services selected by its Web users. The Y V E B S T A T I O N data set contains services selected by more than 1.5 million unique Web users over a two-month period in 2006. For privacy reasons, the URLs are assigned anonymous I D numbers. #
The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams â&#x20AC;˘=> Import Diagram from X M L in SAS Enterprise Miner. A l l nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.
Case Study Data Name
^
M o d e l Role
Description
Measurement Level
ID
ID
Nominal
URL (with anonymous I D numbers)
TARGET
Target
Nominal
Web service selected
The W E B S T A T I O N data set should be assigned the role o f Transaction. This role can be assigned either in the process o f creating the data source or by changing the properties o f the data source inside SAS Enterprise Miner.
Accessing and Assaying the Data A SAS Enterprise Miner data source was defined for the W E B S T A T I O N data set using the metadata settings indicated above. By right-clicking on the Data Source node in the diagram and selecting Edit Variables, the T A R G E T variable can be explored by highlighting the variable and then selecting Explore. (The following results are obtained by specifying Random and Max for the Sample Method and Fetch Size.) The Sample Statistics window shows that there are over 128 unique URLs in the data set and 8 distinct services. |.|D|x|
jjHjSample Statistics Obs#
Variable... 11D 2TARGET
Type CLASS CLASS
Percent... Number... Mode Pe... 0128+ 08
45
Mode
1.481481 0000275 41.022WEBSITE
A plot o f target distribution (produced from the Explore window) identified the eight levels and displayed the relative frequency in a random sample o f 100000 cases.
40000-
WEBSITE
MUSIC STREAM
PODCAST
NEWS
LIVESTREAM
TARGET Generating Associations An Association node was connected to the W E B S T A T I O N node. case_study2
WEBSTATION
(
'
>
Association
46
ARCHIVE
SIMULCAST
EXTREF
A preliminary run o f the Association node yielded very few association rules. It was discovered that the default minimum Support Percentage setting was too large. (Many o f the URLs selected only one service, diminishing the support o f all association rules.) To obtain more association rules, the minimum Support Percentage setting was changed to 1.0. In addition, the number o f items to process was increased to 3000000 to account for the large training data set.
Using these changes, the analysis was rerun and yielded substantially more association rules. The Rules Table was used to scrutinize the results. .ft Rule Table Ccnl5*nce£ Ccnfideree<
Transaction Court
Lift
%)
%) TJ33 1.71 7.31 1.96 1JB 7.05 1.7B 410 1.58 5.35 9.47 947 947 695 2 84 9.47 11.83 11.83 11.83 11.83 11.83 4.10 9.47 6.95 947 135 11 33 9.47 7.32 4.10 6.95 7.32
96.32 33.02 98.07 26.19 23.90 86.22 1605 3697 12.29 41.71 64.45 51.35 44.66 31.69 1295 41.55 51.44 46.87 44.61 44.00 38.17 13.21 29.43 21.61 29.24 16.51 30.01 24.01 18-30 1024 16 BS 17.53 17 70
1.69 1.69 1.97 1.92 1.69 1.69 0.66 0.66 0.66 0.66 090 069 066 090 090 0.74 066 0.74 060 090 1 56 1.56 205 205 1 56 1.56 2B4 2.84 075 075 069 094 n9i
13.42 1342 13.39 13.39 12.22 12.22 983 903 7.80 7.80 6.81 5.43 4.74 4.56 4.56 4,39 4.35 3.96 3.77 3.72 3.73 3.23 3,11 3.11 309 309 254 2.54 250 250 24! 2.39 7 30
Rufe
left Hand of RlrrtHsr«l cIRUe
Riialeml
RiJoleffl2
RUeIern3
RtfeleM
Rub Ion 5
Riie hOei
ft*
Transpose Rule
ARCHIVE WEBSITES- ARCHIVE WEBSITE 267 4 4 WEB SITE L. WEBSITES. ARCHIVE . WEBSITE EXTREF 36744 ARCHIVE =, ARCHIVE EXTREF ARCHIVE 30419 EXTREF—. EXTREF ARCHIVE .EXTREF 30419ARCHrVE=.. ARCHIVE EXTREF ARCHIVE 26744WEBS1TES, WEBSITE &.. EXTREF WEBSITE ARCHIVE 26744 EXTREF = , EXTREF WEBSITE WEBSITE S. EXTREF 10424 WEBSITE 8. WEBSITE &,. PODCAST.. WEBSITE SIMULCAST MUSICSTR., PODCAST... SIMULCAST 10424PODCAST.. WEBSITE S. PODCAST MUSICSTR... PODCAST MUSICSTR... 10424WEBSTTES.. WEBSITE L. SIMULCAS... WEBSITE MUSICSTR., 10424SIMULCAS. SIMULCAS.. WEBSITE S.. SIMULCAST PODCAST NEWSSM., 14275NEWS6M... SIMULCAST NEWS MUSICSTR... WEBSITE £ 10944WEBSrTE&.. SIMULCAST WEBSITE NEWS WEBSITES. 10424WEBSrTES. SIMULCAST WEBSITE PODCAST SIMULCAS.. 14275SIMULCAS.. SIMULCAST MUSICSTR... NEWS NEWS !4275NEWS=> SIMULCAS.. NEWS SIMULCAST PODCAST... 117I4PODCAST... SIMULCAST PODCAST MUSICSTR. WEBSITES. MUSICSTR.. 10424WEBSfTES. WEBSITE SIMULCAST PODCAST SIMULCAS... 11714SIMULCAS... MUSICSTR.. SIMULCAST PODCAST MUSICSTR., WEBSITES. 9506.0WEBSITE S. MUSICSTR.. WEBSITE MUSICSTR., NEWS SIMULCAS... MUSICSTR,. 14275SIMULCAS... MUSICSTR, SIMULCAST NEWS WEBSITE 8. MUSICSTR., 24794 WEBSITES.. MUSICSTR.. WEBSITE SIMULCAST MUSICSTR., SIMULCAST 34794MUSICSTR WEBSITE i . MUSICSTR.. . WEBSITE NEWS 32444NEWS~>.. ., SIMULCAST SIMULCAST SIMULCAST NEWS . NEWS 32444SIMULCAS... WEBSITES.. NEWS SIMULCAST 24794 WEBSITE S. SIMULCAST SIMULCAST WEBSITE SIMULCAST 24794SIMULCAS... SIMULCAST WEBSITE S. SIMULCAST MUSICSTR., .WEBsrrE 45051 SIMULCAS... MUSICSTR... MUSICSTR.. SIMULCAST .MUSICSTR. 45051 MUSICSTR, WEBSITE S. SIMULCAST MUSICSTR. . SIMULCAST 11890 WEBSITE 4 , ARCHIVE ARCHIVE WEBSITE .ARCHIVE 11890 ARCHIVE =.. WEBSITES. WEBSITE 3, ARCHIVE SIMULCAST .. WEBSITE .NEWS 1D944WEBSITES. WEBSITES. NEWS WEBSITE SIMULCAST .ARCHIVE 14861 WEBSITE&. ARCHWF ARCHIVE WEBSITE MUSICSTR, UUKIflKTH liftfilARCHNF: WRRSITF ft ARCHIVE
The following were among the interesting findings from this analysis: • Most external referrers to the Web site pointed to the programming archive (98% confidence). • Selecting the simulcast service tripled the chances o f selecting the news service. • Users who streamed music, downloaded podcasts, used the news service, or listened to the simulcast were less likely to go to the Web site.
47
TO KNOW
Introduction to Predictive Modeling: Regressions THE POWER TO KNOW.
Model Essentials - Regressions Predict new cases
Prediction formula
Select useful inputs
Sequential selection
Optimize complexity
Best model from sequence
2
49
Model Essentials - Regressions
Linear Regression Prediction Formula input measurement
A
A
y =w + w,-x, + w x "^ 0
intercept estimate
i
2
parameter estimate
Choose intercept and parameter estimates to squared error , I ( y , - y training data
50
function
f
)
2
minimize:
Logit Link Function A
P log
(iqs) p link
=
logit function
The logit link function transforms probabilities (between 0and1)to logit scores (between -째째 and +째째).
Logit Link Function A
l 0 3
p vT3fi)
=
w + Af x + vv x = logit( p) 0
A
2
1
A
P=
2
1 + -logit(p) e
To obtain prediction estimates, logit equation is solved for p.
51
Simple Prediction Illustration - Regressions Predict dot color for each x< and x,. 1.0
logitf p ) = w + w x + w - x 0
2
2
0.8
1
A
P=
1
x
0.7
-| + -logit( p) e
0.6
X 0.5 2
Need intercept and parameter estimates. 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Simple Prediction Illustration - Regressions .... ..•.*.*•••*
logit( p) = w + w x^ + w - x 0
r
2
2
III
A
1
P
-\ + -logit{p)
"... • !
;
e
Find parameter estimates
by
maximizing: Zlog(p,)+ I l o g ( 1 - p , )
primary outcome
trtlning ctset
secondary outcome
training cases
log-likelihood
1 *J r*^ :A;-*ft ':> V . \ . 1 v
rf
function
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 *1
8
52
Simple Prediction Illustration - Regressions 1.0
logit( p) = w + -w,- x + w - x 0.9 0
1
2
0.8
1
A
P=
2
0.7
-] + -logit(/j; e
0.6 X
Find parameter estimates by maximizing:
0.5
z
Q.4 â&#x20AC;&#x17E; n
I"og(pJ+Ilog(1-pJ primary outcome training case a
secondary outcome
trainfng cases
log-likelihood
Q
A
o.o 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
function
Regressions: Beyond the Prediction Formula Manage missing values Interpret the model Handle extreme or unusual values Use nonnumeric inputs Account for nonlinearities 10
53
Missing Values and Regression Modeling
i
i .... , .
1
I
1 mm
1
in
1
1 jH
~C.'*..,r
mm L_J
1
1 M B
mm
Problem 1: Training data cases with missing values on inputs used by a regression model are ignored.
Missing Values and Regression Modeling Training Data
Consequence: Missing values can significantly reduce your amount of training data for regression modeling!
12
54
Missing Values and the Prediction Formula
logit(p) = -0.81 + 0.92 - x +1.11-x i
2
Predict: (x1, x2) = (0.3, ? ) Problem 2: Prediction formulas cannot score cases with missing values.
Missing Values and the Prediction Formula
... A
logit(p) = -0.81 + 0.92-0.3+ 1.11 â&#x20AC;˘ ? Problem 2: Prediction formulas cannot score cases with missing values.
14
55
Missing Value Issues
Managing missing values Problem 1: Training data cases with missing values on inputs used by a regression model are ignored. Problem 2: Prediction formulas cannot score cases with missing values.
15
Missing Value Causes
Managing missing values Non-applicable measurement No match on merge Non-disclosed measurement
16
56
Missing Value Remedies
Managing missing values Non-applicable measurement No match on merge |i
l| Non-disclosed measurement
s
y
n t h e t i c
d i s t r i b u t i o n
—
_
Estimation
*/ 1*n •••- P ) =
X
17
This demonstration illustrates how to impute synthetic data values and create missing value indicators.
18
57
Running the Regression Node This demonstration illustrates using the Regression tool.
19
Model Essentials - Regressions
Select useful inputs
58
Sequential Selection - Forward Input p-value
Entry Cutoff
21
Sequential Selection - Forward Input p-value
Entry Cutoff
22
59
Sequential Selection - Forward Input p-value
• • • •• • •
Entry Cutoff
23
Sequential Selection - Backward Input p-value
• ••••
24
60
Stay Cutoff
Sequential Selection - Backward Input p-value
•••••••• •••••••
Stay Cutoff
25
Sequential Selection - Backward Input p-value
•••••••• ••••••• • •E • • • • •
Stay Cutoff
26
61
Sequential Selection - Backward Input p-value
••••• •••• •••• H D
•
Stay Cutoff
• •
•
•
27
Sequential Selection - Stepwise
Input p-value
Entry Cutoff Stay Cutoff
28
62
Sequential Selection - Stepwise
Input p-value
Entry Cutoff
• ••• • • •• • ••• • • •• • •• • •
Stay Cutoff
29
r - Q ^ / y
Selecting Inputs
This demonstration illustrates using stepwise selection to choose inputs for the model.
30
63
Model Essentials - Regressions
Best model from sequence quence
Optimize complexity
31
Model Fit versus Complexity Model fit statistic
Evaluate each sequence step
aero
1
nrixi
2
•
3
i • • ••
4
••••• •••m
5
6
32
64
Select Model with Optimal Validation Fit Model fit statistic
Choose simplest optimal model
33
Optimizing Complexity This demonstration illustrates tuning a regression model to give optimal performance on the validation data.
34
65
Beyond the Prediction Formula
Interpret the model
35
Beyond the Prediction Formula
Interpret the model
36
66
Odds Ratios and Doubling Amounts A
p loa
(ir«)
=
W
Q
+ Wyjq +
consequence
1 0.69
odds
x exp(w,-)
odds
x2
W $ X
2
°9it
l
scores
Odds ratio: Amount odds change with unit change in input.
37
Odds Ratios and Doubling Amounts A
p
l o g
(T3j)
=
W
0
+
consequence Doubling amount: Input change is required to double odds.
m => odds x exp(iv,)
0.69 _:> odds x 2 IV,
38
67
+
W $ X
2
s
°3it
scores
Interpreting a Regression Model This demonstration illustrates interpreting a regression model using odds ratios.
Beyond the Prediction Formula
Handle extreme or unusual values
40
68
Beyond the Prediction Formula
Handle extreme or unusual values
41
Regularizing Input Transformations Regularized Scale
Original Input S c a l e
true —
m
—
association
—
—
regularized
standard
estimate
• .
.
regularized true
regression
—
•
estimate
association
42
69
standard
regression
This demonstration illustrates using the Transform Variables tool to apply standard transformations to a set of inputs.
43
Credit risk scoring THE POWER TO KNOW
70
A.3
Credit Risk C a s e Study
A bank sought to use performance on an in-house subprime credit product to create an updated risk model. The risk model was to be combined with other factors to make future credit decisions. A sample of applicants for the original credit product was selected. Credit bureau data describing these individuals (at the time of application) was recorded. The ultimate disposition of the loan was determined (paid off or bad debt). For loans rejected at the time of application, a disposition was inferred from credit bureau records on loans obtained in a similar time frame. The credit scoring models pursued in this case study were required to conform to the standard industry practice of transparency and interpretability. This eliminated certain modeling tools from consideration (for example, neural networks) except for comparison purposes. If a neural network significantly outperformed a regression, for example, it could be interpreted as a sign of lack of fit for the regression. Measures could then be taken to improve the regression model. S
The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams â&#x20AC;˘=> Import Diagram from X M L in S A S Enterprise Miner. All nodes in the opened file, except the data node, contain the property settings outlined in this case study. If you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.
71
Case Study Training Data Name TARGET Banruptcvlnd TLBadDerogCnt CollectCnt lnqFinanceCnt24 lnqCnt06 DeroqCnt TLDel3060Cnt24 TL50UtilCnt TLDel60Cnt24 TLDelBOCntAII TL75UtilCnt TLDeI90Cnt24 TLBadCnt24 TLDelBOCnt TLSatCnt TLCntl 2 TLCnt24 TLCnt03 TLSatPct TLBalHCPct TLOpenPct TLOpen24Pct TLTimeFirst InqTimeLast TLTimeLast TLSum TLMaxSum TLCnt ID
Role Target Input Input Input Input Input Input Input Input Input Input Jnput Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input ID
Level 3inary Binary nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval Interval Interval Interval Interval Interval Interval Interval [nterval Interval Interval Interval Nominal
Label Bankruptcy Indicator Number Bad Dept plus Public Deroqatories Number Collections Number Finance Inquires 24 Months Number Inquiries 6 Months Number Public Deroqatories NumberTrade Lines 30 or 60 Days 24 Months Number Trade Lines 50 pet Utilized NumberTrade Lines 60 Days orWorse 24 Months NumberTrade Lines 60 Days orWorse Ever NumberTrade Lines 75 pet Utilized NumberTrade Lines 90+ 24 Months NumberTrade Lines Bad Debt 24 Months NumberTrade Lines Currently 60 Days orWorse NumberTrade Lines Currently Satisfactory NumberTrade Lines Opened 12 Months MumberTrade Lines Opened 24 Months NumberTrade Lines Opened 3 Months Percent Satisfactory to Total Trade Lines Percent Trade Line Balance to Hiqh Credit PercentTrade Lines Open Percent Trade Lines Open 24 Months Time Since FirstTrade Line Time Since Last Inquiry Time Since Last Trade Line Total Balance All Trade Lines Total Hiqh Credit All Trade Lines Total Open Trade Lines
72
Accessing and Assaying the Data A SAS Enterprise Miner data source was defined for the C R E D I T data set using the metadata setting indicated above. The Data source definition was expedited by customizing the Advanced Metadata Advisor in the Data Source Wizard as indicated. ÂŁ T Advanced Advisor Options
Value
Property Missing Percentage Threshold Reject Vars with Excessive Missing Values "jClass Levels Count Threshold Detect Class Levels Reject Levels Count Threshold Reject Vars with Excessive Class Values
50 Yes 2 Yes 20 Yes
A
W |
Class Levels Count Threshold I f "Detect class levels"=Yes, interval variables with less than the number specified for this property will be marked as NOMINAL. The default value is 20.
OK
With this change, all metadata was set correctly by default.
73
Cancel
Decision processing was selected in step 6 o f the Data Source Wizard. ^, Data Source Wizard â&#x20AC;&#x201D; Step7 of 10 Decision Configuration Targets | Prior Probabilities Decisions Decision Weights
;
Do you want to use the decisions? Default with Inverse Prior Weights
(* Yes r No Decision Name DECISION 1 DECISION?
Cost Variable < None > < None >
Label
1 0
Constant 0,0 0.0
Add Delete Delete All Reset Default
<Back
Next>
Cancel
The Decisions option Default w i t h Inverse Prior Weights was selected to provide the values in the Decision Weights tab. Data Source Wizard - Step 7 of 10 Decision Configuration
3
Targets j Prior Probabilities | Decisions Decision Weights Select a decision function: C Minimize
fÂť Maximize! Enter weight values for the decisions. Level 1 0
DECISION1 DECISION2 5.99880023... 0.0 0.0 1.20004800...
I J M U w w
< Back
74
Next >
Cancel
It can be shown that, theoretically, the so-called central decision rule optimizes model performance based on the KS statistic. The StatExplore node was used to provide preliminary statistics on the target variable.
Banruptcylnd and T A R G E T were the only two class variables in the C R E D I T data set. C l a s s V a r i a b l e Summary
Statistics
(maximum 500 o b s e r v a t i o n s p r i n t e d ] Data Role=TRAIN Number Data
Variable
Role
Name
of
Mode
Levels
Role
Nis3ing
Mode
Percentage
Mode 2 Node 2
Percentage
TRAIN
Banrup t c y l n d
INPUT
2
0
0
84.67
1
15.33
TRAIN
TARGET
TARGET
2
0
0
83.33
1
16.67
The Interval Variable Summary shows missing values on 11 o f the 27 interval inputs. Interval Variable Summary S t a t i s t i c s (aaximum 50G obsecrations pointed) Data Role-TRAIN
Variable
Role
CollectCnt DerogCnt InqCntC6 InqFinanceCnt24 InqTineLaat TLSOUCilCtlt TL75UtilCnt TLBadCnt24 TLEadDerogCnt TIBalHCPct TICtit TLCilta3 TLCntl2 TIXnt24
INPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT INPUT INPUT INPUT INPUT INPUT INPUT
•n.Del3060Cnt21 TLDelSOCnt TLDel60Cnt24 TLDelSGCntAll TLDel90Cnt24 TLMaxSun TL0pen24Pct TLOpenPct TLSatCnt TLSatPcc TLSum TLTiaeFirst TLTiaeLast
Mean 0.857 1.43 3.108333 3.555 3.108108 4.077904 3.121682 0.567 1.409 0.648176 7.879546 0.275 1.821333 3.882333 0.726 1.522 1.06B333 2.522 0.814667 31205.9 0.564219 0.496168 13.51168 0.518331 201S1.1 170.1137 11.87367
Standard Deviation 2.161352 2.731469 3.479171 4.477536 4.637831 3.108076 2.605435 1.324423 2.460434 0.266486 5.421595 0.502034 1.925265 3.396714 1.163633 2.809653 1.806124 3.4072SS 1.609508 29092.91 0.480105 0.206722 8.931769 0.234759 19682.09 92.B137 16.32141
Hon Hissing
Biasing
Minimum
Hedian
Maximum
Skeuness
Kurtosis
0
0
0
•
0 0 188 99 99 0
0 0 0 0 0 0
0 0 2 2 1 3 3 0 0 0.6955 7 0 1 3 0 0 0 1 0 24187 0.5 0.5 12 0.5263 IS 546 151 7
50 51 40 40 24 23 20 16 47 3.3613 40 7 15 28 8 38 20 45 19 271036 6 1 57 1 210612 933 342
7.556541 5.045122 2.580016 2.806893 2.366563 1.443077 1.50789 4.376858 4.580204 -0.18073 1.235579 2.805575 1.623636 1.60771 1.381942 3.30846 3.080191 2.564126 3.623972 2.061138 2.779055 0.379339 0.851193 -0.12407 2.276B32 1.031307 6.447907
111.8365 50.93801 12.82077 13.05141 5.626803 3.350659 3.686636 28.58301 48.24276 4.015619 2.195363 12.66839 3.604793 4.379948 1.408509 17.76184 14.35044 12.70062 19.7006 8.093434 18.5329 -0.01934 0.690344 -0.48393 10.96413 2.860035 80.31043
3000 3000 3000 3000 2812 2901 2901 3000 3000 2959 2997 3000 3000 3000 3000 3000 3000 3000 3000 2960 2997 2997 2996 2996 2960 3000 3000
0
0
41 3 0 0 0 0 0 0 0 0 40 3 3 4 4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6
40 0 0
75
0
By creating plots using the Explore window, it was found that several o f the interval inputs show somewhat skewed distributions. Transformation o f the more severe cases was pursued in regression modeling. Explore - AMM.CRIDIT He
Edt
Vtaw
Action*
QIBK>LmiI IIA.4,l,M.!liJi.l.!JJ Property
WMw
H R C l l I ILBadLnt.M
BRE3II • IrmCntUti
tet 3000
EES! ObsfV
TARGET
1 2 3
ID
Number
0000066 0000116 0000124
-J >J
Obs U
|Vatiahle...| 11D 2ColleclCnt 3DerogCn1
Type
[I A
CLASS VAR VAR
-J
MR
mm
LTu
Explore - AAEM.CREDIT FJe
VKW Actions
WWw
HfilBII £ 2000¬
0000066 0000116 0000124 0000129 0000113
Number Trade Lines Currently 6...
I 1000
11D CLASS 2 TAR GET VAR 3TLDel3060... VAR 4TLDei6GCnl VAR 5TLDel60Cn . VAR
—
1
- 1
20%
100%
11000-
LL
500 0%
|l
r — — i — •i — — r — i — r -
o% 40% 60% eo% Percenl Satisfactory to Total Tra...
5 leoo-
&isoo 5
Typo
irTf^ I n
'
V 08.414 J2I5.829
1-
I | Variable... |
|:oo-
Ob. Toi.il Hiuli Credit All Trade Lines
180% 360%
| $40%
soo-
0
W
184,245 JI68.4&0
Total Balance All Trade Lhies
Percent Iiailc Line* Open 24 M -
Number Trails Lines 60 Days 01 _
Obsfl
| 400 -
W
7.0 152 229 704 33 0
& 2000-
Time Sir
II)
TARGET
0.0
| --Jni*J
H Q I I • Hsmi'ct
I 10Q0 I 500
5 1000-
AAEM Apply |
I TLMajrtum
ILfJd&ljr.ntAII &600-
2000 • 1000 -
£ 400 -
f 20000
jd
= 1000-
-Hl [ I rk-rn
500¬ 0 60
0% 20% 40% 60% 60% 100% Percent Ti.tile Line* Open
9 0 100 270 360 450
Number Trade Lines 60 Days or _
284.1
562.2
040.3
Time Since First Trade Line
3s/*l ^3000. ji 2000-
= 600 = 400 * 200
|ioco.
r-r-n 00
1 6
I 2000fiooo-
0 3.2 4 8
6.4
8
0
(lumber Trade Liires 30 or 60 D a -
00
30
76
111 152 190
Oil
11.4 22 8 342 4 5 6 57 0
Number Trade Lines Currently S_.
Number Tiaile Lines00> 24 Mo
76
00
102 6
2052
307 8
Time Since Last Traile Line
Creating Prediction Models: Simple Stepwise Regression Because it was the most likely model to be selected for deployment, a regression model was considered first.
UstatExptqre
injajc 11 CREDIT
a Partition
—
Impute
EQ Jo- ^ . R e g r e s s i o n
• In the Data Partition node, 50% of the data was chosen for training and 50% for validation. • The Impute node replaced missing values for the interval inputs with the input mean (the default for interval valued input variables), and added unique imputation indicators for each input with missing values. • The Regression node used the stepwise method for input variable selection, and validation profit for complexity optimization. The selected model included seven inputs. See line 1197 of the Output window. A n a l y s i s of Maximum L i k e l i h o o d Estimates
DF
Estimate
Standard Error
Wald Chi-Square
1
-2.7602
0.4089
45.57
<.0001
1
1.8759
0.3295
32.42
<.0001
0.2772
1
-2.6095
0.4515
33.40
<.0001
-0.3363
1
0.0610
0.0149
16.86
<.0001
0.1527
1
0.3359
0.0623
29.11
<.0001
0.2108
TLDel60Cnt24
1
0.1126
0.0408
0.0058
0.1102
1.119 TLOpenPct
1
1.5684
0.4633
0.0007
0.1792
1
-0.00253
0.000923
0.0062
-0.1309
Parameter
Standardized Pr >
ChiSq
Estimate
Exp(Est) Intercept 0.063 IMP_TLBalHCPct 6.527 IMPJTLSatPct 0.074 InqFinanceCnt24 1.063 TLDel3060Cnt24 1.399
4.799 TLTimeFirst 0.997
7.62 11.46 7.50
The odds ratio estimates facilitated model interpretation. Increasing risk was associated with increasing values of I M P J T L B a l H C P c t , InqFinanceCnt24, TLDel3060Cnt24, TLDehSOCnt, and TLOpenPct. Increasing risk was associated with decreasing values of I M P J T L S a t P c t and T L T i m e F i r s t .
77
Odds R a t i o
Estimates
Point Effect
Estimate
IMP_TLBalHCPct
6.527
IMPJTLSatPct
0 .074
InqFinanceCnt24
1.063
TLDel3060Cnt24
1.399
TLDel60Cnt24
1.119
TLOpenPct
4 .799
TLTimeFirst
0.997
The iteration plot (found by selecting View "=> Model <=> Iteration Plot in the Results window) can be set to show average profit versus iteration. A l t e r a t i o n Plot
0.0
EUslEr
2.5
5.0
7.5
10.0
Model Selection Step Number Train: Average' Profit for TARGET
Valid: Average Profit for TARGET |
In theory* the average profit for a model using the defined profit matrix equals 1+KS statistic. Thus, the iteration plot (from the Regression node's Results window) showed how the profit (or, in turn, the KS statistic) varied with model complexity. From the plot, the maximum validation profit equaled 1.43, which implies that the maximum KS statistic equaled 0.43. â&#x20AC;˘dr~
The actual calculated value of KS (as found using the Model Comparison node) was found to differ slightly from this value (see below).
78
Creating Prediction Models: Neural Network While it is not possible to deploy as the final prediction model, a neural network was used to investigate regression lack of fit.
CREDIT
1^ Regression
;a Priori
Neural Netvyorfc
The default settings of the Neural Network node were used in combination with inputs selected by the Stepwise Regression node. The iteration plot showed slightly higher validation average profit compared to the stepwise regression model.
;>/ Iteration Plot Average Pro/it
1.475-
1.450-
1.425-
1.400-r-
-r-
10
20
30
Training Iterations - Train: Average Profit for TARGET â&#x20AC;˘
-Valid: Average Profit for TARGET |
It was possible (although not likely) that transformations to the regression inputs could improve regression prediction.
Creating Prediction Models: Transformed Stepwise Regression In assaying the data, it was noted that some of the inputs had rather skewed distributions. Such distributions create high leverage points that can distort an input's association with the target. The Transform Variables node was used to regularize the distributions of the model inputs before fitting the stepwise regression.
79
\.J>< Regression
Stattrxploie
etiral Network
-o
'J {
:l: CREDIT
crta Partition
puts 6
o
liansfoi ni fO. variables
•y Transfoi med '/•'<_ Stepwise...
,
\
The Transform Variables node was set to maximize the normality o f each interval input by selecting from one o f several power and logarithmic transformations. Property
General Node ID Imported Data Exported Data [Notes
Value Trans
u
Train
... ... ...
Variables formulas Interactions SAS Code j-Interval Inputs '••Interval Targets v Class Inputs -Class Targets :
• •
-.
Maximum Normal None None None
The Transformed Stepwise Regression node performed stepwise selection from the transfonned inputs. The selected model had many o f the same inputs as the original stepwise regression model, but on a transformed (and difficult to interpret) scale. Odds R a t i o E s t i m a t e s
1105 1106 1107 1108
Point Effect
Estimate
1109 1110
IMPJTLBalHCPct
4.237
1111
IMPJTLSatPct
0.090
1112
L0G_InqFinanceCnt24
9.648
1113
LOG_TLDel60Cnt24
9.478
1114
SQRT_IMP_TL 7 S U t i l C n t
5.684
1115
SQRT_TLDe13060Cnt24
4.708
1116
SQRTJTLTimeFirst
0.050
80
The transformations would be justified (despite the increased difficulty in model interpretation) i f they resulted in significant improvement in model fit. Based on the profit calculation, the transformed stepwise regression model showed only marginal performance improvement compared to the original stepwise regression model. HIHE3
$t* Iteration Plot Average Profit
1.4-
1.3-
1.2-
1.1 -
1.0 2
4
6
10
8
Model Selection Step Number Valid: Average Profit for TARGET |
Train: Average Profit for TARGET
Creating Prediction Models: Discretized Stepwise Regression Partitioning input variables into discrete ranges was another common risk-modeling method that was investigated.
r
latExplore
CREDIT
j Data Partition
\,p£ Recession
t=
_
,JTi ansform jfO Variables
Impute
£
81
o
. Transformed >?"•'• Stepwise...
r
f r *
lew al Network
J
Li'^ryciimbles
Bucket Stepwise...
^VDuantile Input
Quantle
^aopiinwil jr[) Discretized...
I _v Optimal \X'- Discretized...
Stepwise...
o
-6
Three discretization approaches were investigated. The Bucket Input Variables node partitioned each interval input into four bins with equal widths. The Bin Input Variables node partitioned each interval input into four bins with equal sizes. The Optimal Discrete Input Variables node found optimal partitions for each input variable using decision tree methods. Bucket Transformation The relatively small size of the C R E D I T data set resulted in problems for the bucket stepwise regression model. Many of the bins had a small number of observations, which resulted in quasi-complete separation problems for the regression model, as dramatically illustrated by the selected model's odds ratio report. Go to line 1059 of the Output window. Odds R a t i o E s t i m a t e s
Point Effect Estimate BIN_IMP_TL75UtilCnt 01: low -5 v s 04:15-high 999.000 BIN_IMP_TL75UtilCnt 02: 5-10 v s 04:15-high 999.000 BIN_IMP_TL75UtilCnt 03: 10-15 v s 04:15-high 999.000 BIN_IMP_TLBalHCPct
01: low -0.840325
v s 04:2.520975-high
BIN_IMP_TLBalHCPct
02: 0.840325-1.68065 v s 04:2.520975-high
BIN_IMP_TLBalHCPct
03: 1.68065-2.520975 v s 04:2.520975-high
BIN_IMP_TLSatPct
01: low -0.25 v s 04:0.75-high
BIN_IMP_TLSatPct
02: 0.25-0.5 v s 04:0.75-high
BIN_IMP_TLSatPct
03: 0.5-0.75 v s 04:0.75-high
<0.001 <0.001 <0.001 4 .845 1.819 1.009 B I N _ I n q F i n a n c e C n t 2 4 01: low -9.75 v s 04:29.25-high 0.173 BIN_InqFinanceCnt24 02: 9.75-19.5 v s 04:29.25-high 0.381 BIN_InqFinanceCnt24 03: 19.5-29.25
v s 04:29.25-high
0.640 BIN_TLDel3060Cnt24
01: low -2 v s 04:6-high
BIN_TLDe1306 OCnt2 4
02: 2-4 v s 04:6-high
999.000 999.000
82
BIN_ _ T L D e l 6 0 C n t A l l
01 : low -4.75 v s 04:14.25-high
BIN_ _ T L D e l 6 0 C n t A l l
02 :4.75 -9.5 v s 04:14.25-high
BIN_ _ T L D e l 6 0 C n t A l l
03 :9.5- 14.25 v s 04:14.25-high
BIN_ T L T i m e F i r s t
01 : low -198.75 v s 04:584.25-high
BIN_ T L T i m e F i r s t
02 :198. 75-391.5 v s 04:584.25-high
BIN_ T L T i m e F i r s t
03 :391. 5-584.25 v s 04:584.25-high
0.171 0.138 0.166 999.000 999.000 999.000 The iteration plot showed substantially worse performance compared to the other modeling efforts.
'^Iteration Plot Average Pi of it
i 0
MEM
1
1
1
2
4
6
'
r 8
Model Selection Step Number Train: Average Profit tor TARGET Valid: Average Profit for TARGET i
ii
in i
ii i
1
Bin (or Quantile) Transformation Somewhat better results were seen with the binned stepwise regression model. By ensuring that each bin included a reasonable number of cases, more stable model parameter estimates could be made. See line 1249 of the Output window. Odds R a t i o E s t i m a t e s
Point
83
Effect Estimate PCTL_IMP_TLBalHCPct
01: low -0.513 v s 04:0.8389-high
PCTL_IMP_TLBalHCPct
02: 0.513-0.7041 v s 04:0.8389-high
PCTL_IMP_TLBalHCPct
03: 0.7041-0.8389 v s 04:0.8389-high
PCTL_IMP_TLSatPct
01: low -0.3529 v s 04:0.6886-high
PCTL_IMP_TLSatPct
02: 0.3529-0.5333 v s
PCTL_IMP_TLSatPct
03: 0.5333-0.6886 v s 04:0.6886-high
0 .272 0 .452 0 .630 1 .860 04:0.6886-high
1 .130 1 .040 PCTL_InqFinanceCnt24 01: low -1 v s 04:5-high 0 .599 PCTL_InqFinanceCnt24 02: 1-2 v s 04:5-high 0 .404 P C T L _ I n q F i n a n c e C n t 2 4 03: 2-5 vs 04:5-high 0 .807 PCTL_TLDel3060Cnt24
02: 0-1 v s 0 3 : l - h i g h
PCTL_TLDel60Cnt24
02: 0-1 v s 0 3 : l - h i g h
PCTL_TLTimeFirst
01: low -107 v s 04:230-high
PCTL_TLTimeFirst
02: 107-152 v s 04:230-high
PCTL_TLTimeFirst
03: 152-230 v s 04:230-high
0 .453 0 .357 1 .688 1 .477 0 .837
84
The improved model fit was also seen in the iteration plot, although the average profit o f the selected model was still not as large as the original stepwise regression model. ' I t e r a t i o n Plot
0
ME
2
4
6
8
10
Model Selection Step Number Train: Average Profit for TARGET Valid: Average Profit for TARGET
Optimal Transformation A final attempt on discretization was made using the optimistically named Optimal Discrete transformation. The final 18 degree-of-freedom model included 10 separate inputs (more than any other model). Contents o f the Output window starting at line 1698 are shown below. Odds R a t i o E s t i m a t e s
Point Effect Estimate Banruptcylnd
0 vs 1
OPT_IMP_TL75UtilCnt
01:low
OPT_IMP_TL75UtilCnt
02:1.5-8.5,
OPT_IMP_TLBalHCPct
01:low
OPT_IMP_TLBalHCPct
0 2 : 0 . 6 7 0 6 - 0 . 8 6785
vs
04:1.0213-high
OPT_IMP_TLBalHCPct
0 3 : 0 . 8 678 5 - 1 . 0 2 1 3
vs
04:1.0213-high
2.267 -1.5 vs
03:8.5-high
0.270 MISSING
vs 03:8.5-high
0.409 -0.6706,
HISSING
vs
04:1.0213-high
0 . 090 0.155 0 .250
85
OPT
IMP T L S a t P c t
01
low
02
0.2094-0.4655 v s 03:0.4655-high,
OPT_InqFinanceCnt24
01
low
OPT_InqFinanceCnt24
02
2.5-7.5 v s 0 3 : 7 . 5 - h i g h
OPT_TLDel3060Cnt24
01
low
-1.5,
MISSIN v s 02:1.5-high
OPT_TLDel60Cnt
01
low
-0.5,
MISSIN v s 03:14.5-high
OPT_TLDel60Cnt
02
0.5-14.5 v s 03:14.5-high
OPT
TLDel60Cnt24
01
low
OPT_TLDel60Cnt24
02
0.5-5.5 v s 0 3 : 5 . 5 - h i g h
OPT_TLTimeFirst
01
low
-0.2094 v s 03:0.4655-high,
5 067 OPT_IMP
TLSatPct
1 970 -2.5,
MISSIN v s 03:7.5-high
0 353
0 657
0 499
0 084
0 074 -0.5,
MISSIN v s 03:5.5-high
0 327
c
882 -154.5,
MISSING v s 0 2 : 1 5 4 . 5 - h i g h
1 926 TLOpenPct 3 337 The validation average profit was still slightly smaller than the original model. A substantial difference in profit between the training and validation data was also observed. Such a difference was suggestive o f overfitting by the model. I t e r a t i o n Plot
!-
â&#x20AC;˘
Average Profit 1.5-
15 M o d e l S e l e c t i o n Step N u m b e r Train: Average Profit for TARGET Valid: Average Profit for TARGET
86
X
Assessing the Prediction Models The collection of models was assessed using the Model Comparison node.
The R O C chart shows a jumble of models with no clear winner.
Data Role-VALIDATE
1 - Specificity Baseline Regression Bucket Stepwise Regression Optima Discretjzed Stepwise Regression
87
— Neural Network — — —TransformedStepwise Regression — — Quantfle Stepwise Regression
The Fit Statistics table from the Output window is shown below. Data Role=Valid Statistics Valid: Kolmogorov-Smirnov Statistic Valid: Average Profit for TARGET Valid: Average Squared Error Valid: Roc Index Valid: Average Error Function Valid: Percent Capture Response Valid: Divisor for VASE Valid: Error Function Valid: Gain Valid: Gini Coefficient Valid: Bin-Based Two-Way Kblmogorov-Smirnov Statistic Valid: L i f t Valid: Maximum Absolute Error Valid: Misclassification Rate Valid: Mean Square Error Valid: Sum of Frequencies Valid: Total Profit for TARGET Valid: Root Average Squared Error Valid: Percent Response Valid: Root Mean Square Error Valid: Sum of Square Errors Valid: Sum of Case Weights Times Freq
Reg 0.43 1.43 0.12 0.77 0.38 14.40 3000.00 1152.26 180.00 0.54 0.43 2.88 0.97 0.17 0.12 1500.00 2143.03 0.35 48.00 0.35 359.70 3000.00
Neural 0.46 1.42 0.12 0.77 0.39 12.00 3000.00 1168.64 152.00 0.54 0.44 2.40 0.99 0.17 0.12 1500.00 2131.02 0.35 40.00 0.35 367.22 3000.00
Reg5 0.42 1.42 0.12 0.76 0.40 11.60 3000.00 1186.46 148.00 0.53 0.41 2.32 1.00 0.17 0.12 1500.00 2127.45 0.35 38.67 0.35 371.58 3000.00
Reg2 0.44 1.42 0.12 0.78 0.38 14.40 3000.00 1131.42 192.00 0.56 0.44 2.88 0.98 0.17 0.12 1500.00 2127.44 0.34 48.00 0.34 352.69 3000.00
Reg4 0.45 1.41 0.12 0.77 0.39 12.64 3000.00 1158.23 144.89 0.54 0.45 2.53 0.99 0.17 0.12 1500.00 2121.42 0.35 42.13 0.35 366.76 3000.00
Reg3 0.39 1.38 0.13 0.73 0.43 9.60 3000.00 1282.59 124.00 0.47 0.39 1.92 1.00 0.17 0.13 1500.00 2072.25 0.36 32.00 0.36 381.44 3000.00
The best model, as measured by average profit, was the original regression. The neural network had the highest K S statistic. The log-transformed regression, Reg2, had the highest ROC-index. If the purpose of a credit risk model is to order the cases, then Reg2, the transformed regression, had the highest rank decision statistic, the R O C index. In short, the best model for deployment was as much a matter of taste as of statistical performance. The relatively small validation data set used to compare the models did not produce a clear winner. In the end, the model selected for deployment was the original stepwise regression, because it offered consistently good performance across multiple assessment measures.
88
ยงsas
Introduction to Predictive Modeling: Decision Trees THE POWER TO KNOW.
Predictive Modeling Applications Database marketing Financial risk management &t
Fraud detection
p'j~j
Process monitoring Pattern detection
89
Predictive Modeling Training Data Training Data inputs
i
mm mm
mm mm
? '
target
.;
I1 1 mm
mm mm mm
training data case: categorical or numeric input and target measurements
mm
Predictive Model Training Data 1 WBSR j W^Mputs -' M ';
Wmi target
f1 i 1 im II: 1
predictive model: a concise representation of the input and target association
9
4
90
Predictive Model .*
inputs
|
mm rzi mm mm
mm mm mm
mm
mm mm
:
,
predictions: output of the predictive model given a set of input measurements
...
Modeling Essentials Predict new cases Select useful inputs Optimize complexity
91
Three Prediction Types . Y3fยงpB | S -
Inputs
j
prediction
decisions rankings
3 B
estimates
Decision Predictions Predictive model uses input measurements to make the best decision for each case.
92
Decision Predictions inputs
mm Lmm ZJ mm mm mm mm mm mm mm mm mm mm mm mm mm mm .
prediction primary j secondary tertiary j primary
Predictive model uses input measurements to make the best decision for each case.
secondary
Ranking Predictions 720 520 580 470 630
10
93
Predictive model uses input measurements to optimally rank each case.
Estimate Predictions mm
inputs
mm mm mm mm mm mm mm mm mm mm mm mm mm
S3I3 mm
Predictive model uses input measurements to optimally estimate the target value.
â&#x20AC;˘
mm
Modeling Essentials - Predict Review decide, rank, estimate
Predict new cases
12
94
Modeling Essentials
Select useful inputs
13
The Curse of Dimensionality
1-D
•• • 2-D 3-D
14
95
Input Reduction Strategies Redundancy
Irrelevancy
15
Input Reduction- Redundancy Redundancy
Input x has the same information as input x,. 2
16
96
Input Reduction - Redundancy Irrelevancy
Redundancy
17
Input Reduction - Irrelevancy Irrelevancy
Predictions change with input x but much less with input x . 4
3
18
97
Modeling Essentials - Select Review
Select useful inputs
19
Modeling Essentials
Optimize complexity
20
98
Eradicate redundancies irrelevancies
Model Complexity
1
Data Partitioning i
Training Data \WWLW | WBKf^BBBB ]~BBH
Validation Data target
inputs
target
â&#x20AC;&#x201D;
mm
mm
mm
mm
mm
mm
Partition available data into training and validation sets.
22
99
mm
mm
mm
Predictive Model Sequence
Create sequence of models with increasing complexity.
model complexity
23
Model Performance Assessment Validation Data
Training Data
Hn
mm mm
mm mm H1
WfiUfr l BBH
^n^r
mode/ complexity
validation assessment
24
100
H I
HBB sum m i WmKBs
H GS -mm :
H59I
1 jjfi
Rate model performance using validation data.
Model Selection Training Data
Validation Data
.
i&t&c,
mode/ complexity
Select simplest model with highest validation assessment.
validation assessment
25
Model Selection Training Data . inputs
mm
H
MB
Validation Data j
M
n
BR
Mrk&&
mode/ complexity
validation assessment
26
101
mM mm
Select simplest model with highest validation assessment.
Modeling Essentials - Optimize Review
Optimize complexity
Tune models with validation data
27
Creating Training and Validation Data This demonstration illustrates how to partition a data source into training and validation sets.
28
102
Model Essentials - Decision Trees Predict new cases
Prediction rules
Select useful inputs
Split search
Optimize complexity
Pruning
29
Simple Prediction Illustration Training Data
I o.g I 1.0
Predict dot color for each x* and x . 9
1
*
0.6 X
2
0.5 1 0.4 0.3 | 0.2
I
0.1 1
).0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 | --1
30
103
Simple Prediction Illustration Training Data 1.0 0.9
Predict dot color for each x and x . 1
0.8 0.7
2
0.6 X
2
0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. *1
31
104
Decision Tree Prediction Rules Predict:
V
Decision = Estimate = 0.70
4
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
33
Decision Tree Split Search left
right
0.52
confusion matrix
Calculate the logworth of every partition on input x v
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
34
105
Decision Tree Split Search
Decision Tree Split Search 1.0 0.9 0.8 0.7
0.63
0.6
bottom
X
top
46%
35% 65%
0.5 0.4
_a_a_a|
54%
2
max
4.92
0.3 0.2 0.1 0.0
36
106
.Tar**jsj-'. * >
Decision Tree Split Search left
right
53%
42%
47%
58%
bottom
top
54%
35%
46%
65%
max worth 0.95
Compare partition logworth ratings. max logworth (x ) 4.92 2
37
Decision Tree Split Search
<0.63
w
>0.63 0.63
Create a partition rule from the best partition across all inputs. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.S 0.9 1.0
'1 38
107
Decision Tree Split Search 1.0
<0.G3
© J V
w
>0.63 \
••
<0.52^>0.52
Create a second partition rule.
o
• "•
0.7
.
w
04
0.2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 *1
39
Decision Tree Split Search
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Repeat to form a maximal tree. 40
108
*i
Model Essentials - Decision Trees
Select useful inputs
Split search
41
Model Essentials - Decision Trees
Optimize complexity
42
109
Pruning
The Maximal Tree Training Data Inputs
Validation Data
mm mm mm
M i [ mm; â&#x20AC;&#x201D;
j mm mm
mm
maximal tree
Create sequence of models with increasing complexity. Maximal tree is most complex model in sequence.
ii model complexity
43
The Maximal Tree Training Data
Validation Data
1 2 3
Maximal tree is most complex model in sequence. model complexity
44
110
Pruning One Split Validation Data
Training Data
mm | mfbptjm f mm — • • • I>JR M B | DEVI H i . H E mm
mm | • • — H Bmm BBS mm mm mm mm mm mm mm
mm BH
mm mm mm
Next modef in sequence formed by pruning one split from maximal tree.
4
A mode! complexity
45
Subsequent Pruning Validation Data hp
PJPJ m •MS
Er_n s^ra
mm M l mm
Kara e n BBS!
mm MB mm
A
A
model 46
111
Continue pruning until all subtrees are considered.
Validation Assessment Training Data
Validation Data Inputt '"
Illl Illl Illl nil
pjjj mm mmmm
H H H M M
Choose simplest model with highest validation assessment.
/rode/ validation complexity assessment 47
Validation Assessment Training Data MM Inputs â&#x20AC;˘ M M I target
Validation Data
^ A
What are appropriate validation assessment ratings?
48
112
Validation Assessment Validation Data
Validation Data
What are appropriate validation assessment ratings?
49
Assessment Statistics Validation Data
Ratings depend on...
..."
mm mm mmmmm mm
target measurement
mm mm mm mm mm mm
prediction type
••••1
m m
50
113
Assessment Statistics Validation Data
Ratings depend on... target measurement prediction type
51
Binary Targets Predictions
mm mm mm i
BAVBB _ _ _ _ _
Zm
BBbB | _
_
i
_
_
decisions
_B_H M H K B
0
rankings
mm
i
estimates
52
114
Decision Optimization
target
Inputs
r~i E Z O
decisions
mm
mm mm mmmm â&#x20AC;˘EBf mm mm mm*mm mmmm mm mm mmmm-
53
Decision Optimization - Accuracy
true positive true negative Maximize accuracy:
agreement between outcome and prediction.
54
115
Decision Optimization - Accuracy
Inputs m
m
m
target m
m
1
m
-
mm mm mm mm mm mm mm mm mm mm mm mm mmw mm mm
primary j secondan/|
true positive true negative Maximize accuracy:
â&#x20AC;&#x201D;
\
agreement between outcome and prediction.
55
Decision Optimization - Misclassification
i
I I
II -
1
false negative false positive
mm mm mm
^ ^ ^ ^
E 3
mm
â&#x20AC;&#x201D;
Minimize misclassification:
disagreement between outcome and prediction.
56
116
Ranking Optimization
520
rankings
57
Ranking Optimization - Concordance
target=0â&#x20AC;&#x201D;>low score target=1â&#x20AC;&#x201D;>high score Maximize concordance:
proper ordering of primary and secondary outcomes. 58
117
Ranking Optimization - Discordance
a
mmm
m
m
a
m
m
prediction
target
fnpufs m
••••
mm mm mm mm mm mm mm
mm mm mm
720 520
mum
target=0—>high score target=1—How score Minimize discordance:
improper ordering of primary and secondary outcomes.
59
Estimate Optimization
estimates
60
118
Estimate Optimization
estimates
61
Estimate Optimization - Squared Error
mm
• •mm mm
mm mm • ! • mm mm mm mm mm mm mm mm
(target - estimate)
2
Minimize squared
error.
squared difference between target and prediction 62
119
Complexity Optimization - Summary
decisions accuracy / misclassification
rankings concordance / discordance
estimates squared error
63
This demonstration illustrates how to assess a tree model.
120
Growing Trees Autonomously This demonstration illustrates growing a decision tree model autonomously.
65
Autonomous Decision Tree Defaults Default
m S \
Maximum Branches
â&#x20AC;˘
Splitting Rule Criterion
Settings
2 Logworth
A*
Subtree Method
Average Profit
tyfy^
Tree Size Options
Bonferroni Adjust Split Adjust Maximum Depth Leaf Size
...
66
121
Tree Variations: Maximum Branches Complicates spilt search
Trades height for width
#
Uses heuristic shortcuts
67
Tree Variations: Maximum Branches
Maximum branches in split 11 pa; See r Number of Rules ;â&#x20AC;˘ Number of Surrogate Rules -SplilSLze Eiliauslive Node Sample
Exhaustive search size limit
rMeihod j-NumOef of Leaves rAssessmenlMeas-uK -AssessmenlFradion r perform Ci os s Va I id alion I'Numberof SuBsels i-Number of Repeals Seed L
rObservaton Based Important No -Nj.-nber S.ngis var import.) nc 6 j-Bqnferronl Adjustment Yes r i m e ofKass Adjustment. _.,Before Mnpyts Wo j- Number of Inputs ,|1 -Split Adjustment yes ;
68 122
Tree Variations: Splitting Rule Criterion Yields similar splits
I T
Grows enormous trees
Favors many-level inputs
69
Tree Variations: Splitting Rule Criterion Van able,
Bfl.fi~t1;~^_B_PWPIPM—MMI Cntaiian Signrfiiance Level Missing Values Use Input Once Mailmum Branch Man mum Depft Minimum Categorical Siie
-LeafSBe rNumber of Rules r Number of Surrogate Rules Split Size
Default 0 2 Usemsearch -No
0
Split Criterion Chi-square logworth Entropy Gini
L
i-HaSipd • Number of Leaves {-Assessment Measure -Assessment Fraction
Average Square Error 0.2S
Variance Prob-F logworth
3
3E355 ••PerformCross Validation n< j-NumbercfSubsets ll( rflumber of Repeats 1 Mlvaltan Ba;#s important - Observation Based Importan •Number Single Var importarn I'Bonferrora Adjustment j-Tlme ofKass Adjustment
Klnputs
Logworth adjustments
rNumber of Inputs );Split Adjustment
70
123
categorical criteria
inteival criteria
Tree Variations: Subtree Method
Controls Complexity
71
MM
Tree Variations: Subtree Method Train Variables Interactive Criterion Significance Level Missing Values Use Input Once r Maximum Branch Maximum Depth Minimum Categuncai Size
Assessment Do not prune Prune to size
Bj!_EMjWBJBiMt r Leaf See '•Number of Rules rNumber of Surrogate Rules -Split Sue i-Eifiaus!>e • Node Sample }-Melhod_ hNumberofLejwrs r-Assessment Measure '-Assessment Fraction
ehHH
Average Square Errori^ 0 25
[-Perform Cross Validation j-Number of Subsets '[-Number of Repeats
j-BonferTonl Adjustment i-Time of Kass Ad|uslment j-tnputs i-Number of Inputs '-ScWAdhi stmeni
No_ ) Yes
Pruning options Pruning metrics Decision Average Square Error Misclassification Lift
72 124
Tree Variations: Tree Size Options Avoids orphan nodes
Controls sensitivity
Grows large trees
73
Tree Variations: Tree Size Options Logworth threshold Maximum tree depth inimum leaf size
Method -Number of Leaves A s « a « i m » n ( Msjsuri •Assessment Ftattton
Average Square Error_ B.3S
PerformCrossValldation _ No Numnet of Subsets ID NumberolRepeats 1 5 tea
iEmesmmjMEm
113 « 5
[•Observation Based ImporUncNa •-Number Smgle Var Important^
" i
]
ri
mmu
}•B onfe rronl Adj u stm ent [•Timeof kass Adjustment ;• Inputs I Number of Inputs '•Soli;Adjustment.
Threshold depth adjustment
74
125
ยงsas
i
Case Study IV University enrollment prediction
THE POWER TO KNOW,
126
A.4 Enrollment Management Case Study Case Study Description In the fall of 2004, the administration of a large private university requested that the Office of Enrollment Management and the Office of Institutional Research work together to identify prospective students who would most likely enroll as new freshmen in the Fall 2005 semester. The administration stated several goals for this project: • increase new freshman enrollment • increase diversity • increase SAT scores of entering students Historically, inquiries numbered about 90,000+ students, and the university enrolled from 2400 to 2800 new freshmen each Fall semester. f
The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams •=> Import Diagram from X M L in SAS Enterprise Miner. All nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.
127
Case Study Training Data Name
Model Role
ACADEMIC_INTEREST_1
Measurement Level Nominal
Rejected
j: Rejected j Norninal Input
CAMPUS_VISIT icONTACTCODEl CONTACT_D ATE 1 jETHNICITY ENROLL jlRSGHOQL INSTATE iLEVEI^^AJR
.
Description
Primary academic interest code ; 'Secondary academic interest code
Nominal
Campus visit code
'Rejected
jj Nominal
: First contact code
Rejected
Nominal
First contact date
{Rejected
| Nominal
Target
Binary
Rejected
Nominal
Input
Binary
jpejected
;
••Unary
;
Ethnicity l=Enrolled F2004,0=Not enrolled F2004 i High school code l=In state, 0=Out of state
jj Student academic level
REFERRAL_CNTCTS
Input
Ordinal
Referral contact count
j SELF_INIT_CNTCTS
: Input
Interval
Self initiated contact count
SOLICITED_CNTCTS
Input
Ordinal
Solicited contact count
TERRITORY
Input
Norninal
Recruitment area
TOTAL_CONTACTS
Input
Interval
Total contact count
TRAVELJLNIT__CNTCTS
Input
Ordinal
Travel initiated contact count
AVG_INCOME
Input
Interval
Commercial H H income estimate
DISTANCE
Input
Interval
Distance from university
HSCRAT
Input
Interval
5-year high school enrollment rate
INIT_SPAN
Input
Interval
Time from first contact to enrollment date
INT1RAT
Input
Interval
5-year primary interest code rate
INT2RAT
Input
Interval
! 5-year secondary interest code rate
INTEREST
Input
Ordinal
Number of indicated extracurricular interests
MAILQ
Input
Ordinal
Mail qualifying score (l=very interested)
(Continued on the next page.)
128
PREMIERE
Input
Binary
SATSCORE
Rejected
Interval
SEX
Rejected
Binary
Sex
STUEMAIL
Input
Binary
l=Have e-mail address, 0=Do not
TELECQ
Rejected
Ordinal
Telecounciling qualifying score (l=very interested)
The Office o f Institutional Research assumed the task o f building a predictive model, and the Office o f Enrollment Management served as consultant to the project. The Office o f Institutional Research built and maintained a data warehouse that contained information about enrollment for the past six years. It was decided that inquiries for Fall 2004 would be used to build the model to help shape the Fall 2005 freshman class. The data set Inq2005 was built over a period o f a several months in consultation with Enrollment Management. The data set included variables that could be classified as demographic, financial, number o f correspondences, student interests, and campus visits. Many variables were created using historical data and trends. For example, high school code was replaced by the percentage o f inquirers from that high school over the past five years who enrolled. The resulting data set included over 90,000 observations and over 50 variables. For this case study, the number o f variables was reduced. The data set Inq2005 is in the A A E M library, and the variables are described in the table above. Some o f the variables were automatically rejected based on the number o f missing values. The nominal variables A C A D E M I C J N T E R E S T J , A C A D E M I C I N T E R E S T 2, and I R S C H O O L were rejected because they were replaced by the interval variables I N T 1 R A T , I N T 2 R A T , and H S C R A T , respectively. For example, academic interest codes 1 and 2 were replaced by the percentage o f inquirers over the past five years who indicated those interest codes and then enrolled. The variable I R S C H O O L is the high school code o f the student, and it was replaced by the percentage o f inquirers from that high school over the last five years who enrolled. The variables E T H N I C I T Y and S E X were rejected because they cannot be used in admission decisions. Several variables count the various types o f contacts the university has with the students.
Accessing and Assaying the Data A SAS Enterprise Miner data source was defined using the metadata settings indicated above. The StatExplore node was used to provide preliminary statistics on the input variables.
129
The following is extracted from the StatExplore node's Results window: C l a s s V a r i a b l e Summary S c a c l s c l c s (maximum 500 o b s e r v a t i o n s p r i n t e d ) Data Rolo-TRAIN
Daca Role
V a r i a b l e Nome
TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN
CAHPUS_VISIT Instate RE FERRAL_CNTCTS SOLICITED_CNTCTS TERRITORY TRAVEI._INIT_CNTCTS Interest ma.il q premiere stuemall Enroll
Role
Number OC Levels
Hissing
Hode
Hode Percentage
Hode 2
Hode2 Percentage
INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT TARGET
3 2 6 8 12 7 4 5 2 2 2
0 0 0 0 1 0 0 0 0 0 0
0 Y 0 0 2 0 0 5 0 0 0
96.61 62.04 96.46 52.45 15.98 67.00 95.01 69.33 97.11 51.01 96.86
1 N 1 1 S 1 1 2 1 1 1
3.31 37.96 3.21 41.60 15.34 29.90 4.62 12.80 2.89 48.99 3.14
The class input variables are listed first. Notice that most of the count variables have a high percent of Os. Distribution of Class Target and Segment Variables (maximum 500 observations printed) Data Role =TRAIN Data Role TRAIN TRAIN
Variable Name
Role
Level
Enroll Enroll
TARGET TARGET
0 1
Frequency Count
Percent
88614 2868
96.8650 3.1350
Next is the target distribution. Only 3.1 % of the target values are Is, making a 1 a rare event. Standard practice in this situation is to separately sample the Os and Is. The Sample tool, used below, enables you to create a stratified sample in S A S Enterprise Miner. Interval Variable Sunnary Statistics (naxinun 500 observations printed) Data Role^TRAIH
Variable
Role
SELF_IHIT_CHTCT3 TGTAL_COHTACTS avg_income distance hscrat init_span intlrat int2rat
IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT INPUT IHPUT
Hean 1.214119 2.166098 47315.33 380.4276 0.037652 19.68616 0.037091 0.042896
Standard Deviation 1.666529 1.852537 20608.89 397.9788 0.057399 8.722109 0.024026 0.025244
Hon Hissing 91482 91482 70553 72014 91482 91482 91482 91482
Hissing 0 0 20929 19468 0 0 0 0
Hinisun 0 1 4940 0.417124 0 -216 0 0
Hedian
Haxinun
Shetmess
Kurtosis
1 2 42324 183.5467 0.033333 19 0.042105 0.05667
56 58 200001 4798.899 1 228 1 1
2.916263 3.062389 1.258231 2.276541 7.021978 0.758461 3.496845 3.215683
21.50072 19.60427 1.874903 9.369703 93.31547 10.43657 74.08503 56.32374
Finally, interval variable summary statistics are presented. Notice that avgjtacome and distance have missing values.
130
The Explore window was used to study the distribution o f the interval variables. • Explore • AAEMT . NQ2005 Be
View
Actions Window
i TOTAL CONTACTS
11 Sample Properties Property
Value
lows
9H8Z
library
AAEM
5 40000¬ 5 30000-
=. 20000¬ 1*1
£ ioooo-
03.0
Apply
Plot...
TOTAL CONTACTS
43.0 93.0 138.0 1630 228.0 inil_span
i .ivi;_income
Qhstt Enroll COHTACT DATE1 TOTAL CON 23May04
4
2?Ndv01
^ eooooH
20000¬
S
15000¬
5 40000
. 10000¬
I2OOOO-I
5000¬
OBDeeOQ
0-
15Apr02 27Mar02
6250.0
64375.3
122500.6
1 1 1 1 1 1 1 1 1 r
1B0625S
0.0 0.1 0 2 03 0.4 05 0.6 0.7 0 8 0.9 1.0
Intlrat
avgjncome Variable...
Type
Percent... M
1 CONTACT,... VAR 2 Enroll
VAR
0
0
4T0TAL_CO...VAR 5ayg_income VAR
j
HHESl
3SELFJNIT... VAR
30000¬
.60000 &
20000¬
g 40000
10000¬ 0
o
22.83567
|
0=L 0.417)
£
14395516 28795060
4319.0505
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0i 1.0
35R
310 60000¬
- 60000& S 40000-
40000¬
£ 20000¬ 0
—1—i—1—i—1—1—1—1—1—1—r
int2rat
distance
£
3)000 0
20000¬ 1 —, — 1 — 1 — 1 — 1 — 1 — 1 — —r— 00 112 22.4 33.6 448 SS.O 1
SELF N T CNTCTS
0
-1—1—1—1—1—1—1—1—1—r 0.0 0.1 02 0 2 0.4 05 0.6 0 7 0.3 OS 1.0 hscrat
The apparent skewness o f all inputs suggests that some transformations might be needed for regression models.
131
Creating a Training Sample Cases f r o m each target level were separately sampled. A l l cases w i t h the p r i m a r y o u t c o m e were selected. For each p r i m a r y o u t c o m e case, seven secondary outcome cases were selected. T h i s created a t r a i n i n g sample w i t h a 1 2 . 5 % overall e n r o l l m e n t rate. T h e Sample t o o l was used to create a t r a i n i n g sample for subsequent m o d e l i n g .
To create the sample as described, the f o l l o w i n g m o d i f i c a t i o n s were made to the Sample node's properties panel: 1.
T y p e 100 as the Percentage value ( i n the Size property group).
2.
Select Criterion O Level Based ( i n the Stratified property group).
3.
Type 12.5 as the Sample Proportion value ( i n the Level Based Options property g r o u p ) .
Train Variables Output Type Sample Method Random Seed
Data Default 12345
[5]5i2e Type V Observations Percentage |.Alpha - =value Cluster Method ;
•
|-Criterion Ignore Small Strata '••Minimum Strata Size
B ,i.A',;h;--. - - • • • j. .evel Selection .evel Proportion • Sample Proportion
3
ercentage
100.0 0,01 0.01 Random .evel Based No
zvent 100.0 12.5
No Adjust Frequency Based on Count No • Exclude Missing Levels No
13Z
The Sample node Results window shows all primary outcome cases that are selected and a sufficient number of secondary outcome cases that are selected to achieve the 12.5% primary outcome proportion. Summary S t a t i s t i c s Cor C l a s s T a r g e t s (maximum 500 o b s e r v a t i o n s p r i n t e d ) Data=DATA
Variable
Numeric Value
Formatted Value
Frequency Count
Percent
Enroll Enroll
0 1
0 1
88614 2868
96.8650 3.1350
Variable
numeric Value
Formatted Value
Frequency Count
Percent
Enroll Enroll
0 1
0 1
20076 2868
87.5 12.5
Label
Data=SAMPLE
Label
Configuring Decision Processing The primary purpose of the predictions was decision optimization and, secondarily, ranking. A n applicant was considered a good candidate if his or her probability of enrollment was higher than average. Because of the Sample node, decision information consistent with the above objectives could not be entered in the data source node. To incorporate decision information, the Decisions tool was incorporated in the analysis.
StatExplore
IN0J2OO5
2 ~~ jp{^ Sample jL.
Decisions
133
These steps were f o l l o w e d to c o n f i g u r e the Decisions node: 1.
In the Properties panel o f the D e c i s i o n node, set Decisions to Custom. T h e n select
Custom Editor "=> Property
General
Node ID Imported Data Exported Data Notes
Value Dec
...
Train
Variables Apply Decisions Custom Editor Decisions Matrix rior Probabilities
No
... ...
Custom
3
A f t e r the analysis path is updated, the D e c i s i o n w i n d o w appears. Decision Processing - Decisions Targets [^Enroll
Prior Probabilities 1 Decisions 1 Decision Weights â&#x20AC;˘ Name :
Enroll
M e a s u r e m e n t L e v e l : Binary Target level order :
Descending
Event l e v e l :
1
Format:
3.0
Refresh
OK
2.
Select the Decisions tab.
134
Cancel
3.
Select Default w i t h Inverse Prior Weights. 25! D e c i s i o n P r o c e s s i n g - D e c i s i o n s
Targets ] Prior Probabilities Decisions | Decision Weights Do you want to use the decisions? (* Yes
C No
Decision Name DECISION 1 DECI5ION2
:
>
Label 0
Default with Inverse Prior Weights Cost Variable < None > < None >
Constant 0.0 0.0
Add
Delete Delete All Reset Default
OK
4.
Cancel
Select the Decision Weights tab. ^ D e c i s i o n Processing - Decisions
Targets | Prior Probabilities | Decisions
Decision Weights |
Select a decision function: (* Maximize]
C Minimize
Enter weight values for the decisions. Level 1 0
DECISION 1 DECISION2 8.0 0,0 0,0 1.14285714...
OK
Cancel
The nonzero values used in the decision matrix are the inverse o f the prior probabilities (1/.125=8. and 1/0.875=1.142857). Such a decision matrix, sometimes referred to as trie central decision rule, forces a primary decision when the estimated primary outcome probability for a case exceeds the primary outcome prior probability (0.125 in this case).
135
Creating Prediction Models (All Cases) Two rounds of predictive modeling were performed. In the first round, all cases were considered for model building. From the Decision node, partitioning, imputation, modeling, and assessment were performed. The completed analysis appears as shown. ^
If the Stepwise Regression model is not connected to the Model Comparison node, you might have to first delete the connections for the Instate Regression and Neural Network nodes to the Model Comparison node. Then connect the Stepwise Regression node, Neural Network node, and Regression nodes - in that order - to the Model Comparison node.
***** e> •lOteWopr.
partm
• The Data Partition node used 60% for training and 40% for validation. • The Impute node used the Tree method for both class and interval variables. Unique missing indicator variables were also selected and used as inputs. • The stepwise regression model was used as a variable selection method for the Neural Network and second Regression nodes. • The Regression node labeled Instate Regression included the variables from the Stepwise Regression node and the variable Instate. It was felt that prospective students behave differently based on whether they are in state or out of state. In this implementation of the case study, the Stepwise Regression node selected three inputs: high school, self-initiated contact count, and student e-mail indicator. The model output is shown below. Analysis of Maximum Likelihood Estimates Standard
Wald
Parameter
DF
Estimate
Error
Chi-Square
Standardized Pr > ChiSq
Estimate
Exp(Est)
INTERCEPT
1
-12.1422
18.9832
0.41
0.5224
SELF_INIT_CNTCTS
1
0.6895
0.0203
1156.19
<.0001
0.8773
1.993
HSCRAT
1
16.4261
0.8108
410.46
<.0001
0.7506
999.000
1
-7.7776
18.9824
0.17
0.6820
STUEMAIL
0
0.000
0.000
Odds Ratio Estimates Point Estimate
Effect
1.993
SELF_INIT_CNTCTS
999.000
HSCRAT STUEMAIL
0 VS 1
<0.001
The unusual odds ratio estimates for H S C R A T and S T U E M A I L result from an extremely strong association in those inputs. For example, certain high schools had all applicants or no applicants enroll. Likewise, very few students enrolled who did not provide an e-mail address.
136
Adding the I N S T A T E input in the Instate Regression model changed the significance o f inputs selected by the stepwise regression model. The input S T U E M A I L is no longer statistically significant after including the I N S T A T E input. Analysis of Maximum Likelihood Estimates Standardized
Standard
Wald
Parameter
DF
Estimate
Error
Chi-Square
Pr > ChiSq
INTERCEPT
1
-12.0541
0.52
0.4716
1
-0.4145
16.7449 0.0577
51.67
1 1
0.6889 16.2327
0.0196
0.8231
0.7553
1233.22 461.95
<.0001 <.0001 <.0001
0.7142
1
-7.3528
16.7443
0.19
INSTATE
N
SELF_INIT_CNTCTS HSCRAT STUEMAIL
0
0.6606
Estimate
Exp(Est) 0.000 0.661 1.992 999.000 0.001
Odds Ratio Estimates Point Estimate
Effect INSTATE
0.437
N VS Y
1.992
SELF_INIT_CNTCTS HSCRAT STUEMAIL
999.000 0 VS 1
<0.001
A slight increase in validation profit (the criterion used to tune models) was found using the neural network model. The tree provides insight into the strength of model fit. The Subtree Assessment plot shows the highest profit having 17 leaves. Most of the predictive performance, however, is provided by the initial splits.
m i
Subtree Assessment Ptot Average Profit
1.75-
1.50-
1.25-
1.0010
20
30
N u m b e r of L e a v e s -Train: Average Profit for Enrol â&#x20AC;˘Valid: Ave rage P rofit for E n ro I
137
A s i m p l e r tree is scrutinized to aid in interpretation. T h e tree m o d e l was rerun w i t h properties changed as f o l l o w s to produce a tree w i t h three leaves: M e t h o d = N , N u m b e r o f Leaves=3.
Statistic 0: i: Count:
Hode Id 1 Train Validation 67. S% 87. 5% 12.5% 12.5* 13766 5173
1
SELF 1NIT CNTCTS
>= 3.5
< 3.5 Or Missing 1 Statistic 0: 1: Count:
Node I d : 2 Train Validation 96.5% 97.0% 3.5% 3.0% 11758 7786
Statistic 0: 1: Count:
Mode Id 3 T r a i n Validation 34. 34.3% £5.2% £5.7% 2008 1352
SELF INIT CNTCTS r < 2.5 Or Missing • Statistic 0: i lJ Count:
Node I d : 4 Tcain Validation 98.2% 98.5% 1.8% 1.5% 10532 7301
>= 2.5 Statistic 0: 1: Count:
Node Id 5 Train Validation 73. B% 74.2% 26.2% 25.84 826 485
Students w i t h three or fewer self-initiated contacts rarely enrolled (as seen i n the left l e a f o f the first split). E n r o l l m e n t was even rarer f o r students w i t h t w o or fewer self-initiated contacts (as seen in the left leaf o f the second split). N o t i c e that the p r i m a r y target percentage is rounded d o w n . A l s o notice that most o f the secondary target cases can be found i n the l o w e r left leaf. The decision tree results shown in the rest o f this case study are generated b y the o r i g i n a l , 17-leaf tree.
138
Assessing the Prediction Models Model performance was compared in the Model Comparison node. S
If the Stepwise Regression model does not appear in the R O C chart, it might not be connected to the Model Comparison node. You might have to first delete the connections for the Instate Regression and Neural Network nodes to the Model Comparison node. Connect the Stepwise Regression node, Neural Network node, and Regression nodes - in that order - to the Model Comparison node and re-run the Model Comparison node to make all models visible.
iv'nOC Chart: Enroll
H E E3 Data Role - VALIDATE
~i
1
0.0 ^ |
Baseline
1
0.2 .: Neural Network
1
0.4 0.6 v.. 1 - Specificity Instate Regression
H
1
0.8 â&#x20AC;&#x201D;Stepwise Regression
. 1.0 â&#x20AC;&#x201D;DecisionTree
The validation R O C chart showed an extremely good performance for all models. The neural model seemed to have a slight edge over the other models. This was mirrored in the Fit Statistics table (abstracted below to show only the validation performance).
139
Data R o l e = V a l i d Statistics
Neural
Tree
Reg
Reg2
0 89 1 88
0.88 1.88
0.87 1.87
0.04
0.04
Roc Index
0 04 0 98
0.86 1.86 0.04
0.96
0.98
0.98
Valid
Average E r r o r F u n c t i o n
0 11
0.14
Valid
P e r c e n t Capture Response
.
0.14
Valid
D i v i s o r f o r VASE
Valid
Kolmogorov-Smirnov
Statistic
Valid
Average P r o f i t f o r E n r o l l
Valid
Average Squared E r r o r
Valid
Valid
E r r o r Function
Valid
Gain
Valid
Gini Coefficient
Valid
B i n - B a s e d Two-Way Kolmogorov-Smirnov
Valid
Lift
Valid Valid
Maximum A b s o l u t e E r r o r M i s c l a s s i f i c a t i o n Rate
Valid
Mean Square E r r o r
Valid
Sum o f F r e q u e n c i e s
Valid
Total Profit for Enroll
Valid
Root Average Squared E r r o r
Valid
P e r c e n t Response
Valid
Root Mean Square E r r o r
Valid Valid
Sum o f Square E r r o r s
Valid
Number o f Wrong C l a s s i f i c a t i o n s .
30 94
30.95
29.72
29.55
18356 00
18356.00
18356.00
18356.00
2097 03 576 37
2521.91
2486.75
519.90
552.83
552.83
0.93 0.87
0.95 0.86 5.94
0.95
1.00
1.00
0.06
0.06 0.04
0 96 Statistic
0 88 6 19 1 00 0 05 0 04
Sum o f Case Weights Times F r e q
.
6.19 1.00 0.05
.
0.04
0.86 5.91
9178 00
9178.00
9178.00
9178.00
17285 71
17256.00
17122.29
17099.43
0 19 77 34
0.20
0.20
0.20
77.36
74.29
73.85
0 19 657.78 18356 00
0.20
735.99 18356.00
0.20 754.37
463 00
. .
752.78 18356.00
.
18356.00
.
It should be noted that a R O C Index of 0.98 needed careful consideration because it suggested a nearperfect separation of the primary and secondary outcomes. The decision tree model provides some insight into this apparently outstanding model fit. Self-initiated contacts are critical to enrollment. Fewer than three self-initiated contacts almost guarantees non-enrollment.
Creating Prediction Models (Instate-Only Cases) A second round of analysis was performed on instate-only cases. The analysis sample was reduced using the Filter node. The Filter node was attached to the Decisions node, as shown below.
140
The following configuration steps were applied: 1.
In the Filter Out o f State node, select Default Filtering Method <=> None for both the class and interval variables. Property
General
Node ID Imported Data Exported Data Notes
Train
(Export Table Tables to Filter Distribution Data Sets [Sj'Jass Variables
Value Filter
... •••
... Filtered Training Data No
Class Variables default Filtering Methoc None <eep Missing Values Yes Mormalized Values Yes viinimum Frequency Cut 1 Minimum CutoFF For Pert D.01 •vlaxirnurn Number of Le 25 [SJlnfcerval Variables Interval Variables beFault Filtering Methoc Mone Keep Missing Values Yes Tuning Parameters
„J \
...
2.
Select Class Variables •=> j „ J . After the path is updated, the Interactive Class Filter window appears.
3.
Select Generate Summary and then select Yes to generate summary statistics.
141
4.
Select Instate. The Interactive Class Filter window is updated to show the distribution o f the Instate input. I n t e r a c t i v e Class Filter
Select values to remove from the sample. 15000 H
10000 o o
5000-
Instate Apply Filter blumns:
Clear Filter
\~ Label
P Mining
Report
Name
No No
Statistics
Minimum Frequency Cutoff
Nurnt Cutof
Default Default Default Default Default Default Default Default Default
Default Default Default
^EFERRAL_CNT^o =LOI TrTTFn CNTNn
r
Keep Missing Values
Filtering Method Default Default Default Default Default Default Default User Specified
IftCADEHICJNTtNo [ACADEMIC JNTfJNo CAMPU5_VISIT No ZONTACT_COD!No No ITHNICITY No Enroll No [R5CHOOL Instate LEVEL_YEAR
r~ Basic
Default Default 1
Cancel
OK
Refresh Summary
5.
Select the N bar and select Apply Filter.
6.
Select O K to close the Interactive Class Filter window.
7.
Run the Filter node and view the results. Excluded Class Values (maximum 500 o b s e r v a t i o n s p r i n t e d ) Keep Variable Instate
Role INPUT
Level N
Train Count
Train Percent
0200
35.7392
Label
Filter Method
Missing Values
MANUAL
A l l out-of-state cases were filtered from the analysis. After filtering, an analysis similar to the above was conducted with stepwise regression, neural network, and decision tree models.
142
The partial diagram (after the Filter node) is shown below: rfeural
Mrt»ork(2) ' *i Frier Out of SOle
• , l.h.lel
..Irpute (2)
np.uison (2)
B'ebion Ti*»
As for the models in this subset analysis, the Instate Stepwise Regression model selects two o f the same inputs found in the first round o f modeling, S E L F _ I N I T _ C N T C T S and S T U E M A I L . A n a l y s i s of Maximum Likelihood Estimates Standardized
Standard
Bald
Parameter
DF
Estimate
Error
Chi-Seruane
Intercept
1
-10.1372
0.6264
1 1
0.718S -6.8602
20.8246 0.0210
0.24
SELF_BJIT_CNTCTS
1174.00
<.0001
20.8245
0.11
0.7418
stuemail
0
Pr > ChiSq
Exp(Est)
Estimate
0.000 0.9297
2.052 0.001
The Instate decision tree showed a structure similar to the decision tree model from the first round. The tree with the highest validation profit possessed 20 leaves. The best five-leaf tree, whose validation profit is 97% o f the selected tree, is shown below.
Bode Id: Tiain ValldicLon
et.zx B « . : I is.ei
is.at
SS4S
JfJPf
SELF_IN!T_CNTCTS
_ J >=3.5
< 3 5 Or Missing Hade Id: 2 I i a i u Valldiejon ti.71 9£.31 4.31 3.71 721 f 47SS
5t«l«lC 0: 1: COIU1L:
nod* i d
I < 2.5 Or Missing
T
Cciat:
;.:» CMS
j.oi 437CI
1*2^
31. Bl it. :t 11*3
hscrat >=2.5
Bod* Idt 4 Train V . i ! : 1 t : :
33. It it. SI
1: Count:
SELF INIT CrVTCTS
3
Tiain
setciotie 0:
< 00171
Bode Id: S Train ValidJcioB 73.11 75.41 Zi.il 2i.il 38< iia
»= 0.0171 Of Missing
Bode I d : CI Train Volt d a lent QJ.M B7.2t. ie.lt i;.8i n* I 3 3_
Bode Id: Tiain Valldfclon 27.3) ;f.71
1
7:.st
Ti.it
I4SS
IJI.
[ riscrat
0.0031 Or Missing V a l i d * Ion 100.01 O.Ot
>= 0.0031 Node Id: 13 Train Val Hire Ion £1.21 SB.St If.St 41. St 74 »1
Again, much o f the performance o f the model is due to a low self-initiated contacts count.
143
Assessing Prediction Models (Instate-Only Cases) As before, model performance was gauged in the Model Comparison node. The R O C chart showed no clearly superior model, although all models had rather exceptional performance.
The Fit Statistics table of the Output window showed a slight edge over the tree model in misclassification rate. The validation R O C index and validation average profit favored the Stepwise Regression and Neural Network models. Again, it should be noted that these were unusually high model performance statistics.
144
p
Ifeural2
Reg3
Ttee2
0.80 2.02 0.06
0.80
0.80
2.02 0.06
2.00 0.06
0.96
0.96
0.92
0.19
0.20
0.21
50.69 23.41
50.69 23.41
5899.00 11798.00 2194.05 406.85 0.91 0.00 0.14 5.07 4.68 0.97 0.09 5899.00 11901.71 0.24 80.08 73.97 705.43 510.00
5899.00 11798.00 2330.78 406.85 0.91 0.00 0.14 5.07 4.68 1.00 0.09 5899.00 11901.71 0.25 80.08 73.97 725.26 527.00
46.38 23.19 5899.00 11793.00 2462.62 363.74 0.84 0.00 0.03 4.64 4.64 0.98 0. 08 5899.00 11824.00 0.25 73.27 73.27 716.66 462.00
^ogorov-Smicnov s A t u r e d Response
* l cases
to* 0
y. % * o r o v - S m i r r i o v £»V ^ - o b i l i t y
Statistic
Cutoff
-ions
el t o f the prediction J
,
,
a
s
s
h
o
w
n
i n
t
h
e
diagram's final
form.
d
:wJ* " •
Oxnuiisoii P)
g^tkmM in*
~^|5A3 Corf.
-1
^tti:w-i
the I n s t a t e M £
o n
assigned a r o l ^
P
c c e
c(
then passed ir. ^ de node.
Gcor*
9 /
a r
a n t
c
'
s o n
n o c
^
e
passed on to the Score
* attached to the Score node, e
o f Enrollment Management's data
Data R o l e - V a l i d Heural2
Reg3
Tree2
0.3G
0.80
0.80
2.02
2.02
2.00
0.06
0.06
0.06
V a l i d : Roc I n d e x
0.96
0.96
0.92
V a l i d : Average E r r o r F u n c t i o n
0.19
0.20
0.21
Statistics V a l i d : Kolmogor.ov-Smir.nov Valid:
Average
Valid:
Average S q u a r e d
Statistic
Profit Error
-
V a l i d : B i n - B a s e d Two-Way Kolmogorov-Smirnov P r o b a b i l i t y C u t o f f Valid:
Cumulative
Valid:
Frequency
50.69
23. 41
23.41
23.19
5899.00
5899.00
5899.00
11798.00
11798.00
11798.00
2194.05
2330.78
2462.62
406.85
406.85
363.74
0.91
0.91
0.84
0.00
0.00
0.00
0.14
0.14
0.03
5.07
5.07
4. 64
4.68
4.68
4.64
Error
0.97
1.00
0.98
Rate
0. 09
0.09
0.08
5399.00
5899.00
5899.00
11901.71
11901.71
11824.00
of C l a s s i f i e d
Cases
V a l i d : D i v i s o r f o r ASE Valid: Error Function Valid:
Gain
Valid:
Gini
Coefficient
V a l i d : B i n - B a s e d Two-Hay Kolmogorov-Smirnov
Statistic
V a l i d : Kolmogorov-Smirnov P r o b a b i l i t y C u t o f f Valid:
Cumulative
Lift
Valid: L i f t V a l i d : Maximum A b s o l u t e Valid: M i s c l a s s i f i c a t i o n Valid:
Sum
of Frequencies
Valid: Total
Profit
V a l i d : R o o t Average S q u a r e d
Error
Valid:
Cumulative P e r c e n t Response
Valid:
P e r c e n t Response
Valid:
Sum o f Squared
V a l i d : Number of Wrong
46. 33
50. 69
P e r c e n t C a p t u r e d Response
V a l i d : P e r c e n t Captured Response
Errors Classifications
0. 24
0.25
0.25
80. 08
80.08
73.27
73. 97
73.97
73.27
705.43
725.26
716.66
510.00
527.00
462.00
Deploying the Prediction Model The Score node facilitated deployment o f the prediction model, as shown in the diagram's final form.
The best (instate) model was selected by the Instate Model Comparison node and passed on to the Score node. Another INQ2005 data source was assigned a role o f Score and attached to the Score node. Columns from the scored INQ2005 were then passed into the Office o f Enrollment Management's data management system by the final SAS Code node.
145