SAS EM part2

Page 1

§.sas |

I m

Introduction to SAS Enterprise Miner

-

T H E P O W E R TO K N O W .

SAS Enterprise Miner - Interface Tour Sr.«iSffiJS;

?

o

*

!

Menu bar and shortcut buttons

t ^^^^

J

".r--. lUI'B

.'H

(-par*-!

;R«»llH«tlM

(

'•>-

:

«»|

-j i

1 |

1

l e i a l c » r j n v .1

•r : rr« it 4F^4ir:*t1

--

*

2

1

&

-


SAS Enterprise Miner - Interface Tour Aji p lie a Analytics 9 njDataSoutces H DEVELOP 9 Diagrams Analysis Dlauram Model Packages <f-!n]Useis 1

Project panel

D —

- -S^— •

SAS Enterprise Miner - Interface Tour

E H

Significance Level Kissing Values Use Input Ones Maiimum Branch Manmum Depth Catena Coanl

02 Use In search No 2

Properties panel

3

2

- J -


SAS Enterprise Miner - Interface Tour i

m t"

I T - fc— £ > » ™ -

B*-

JJEL . i i Q rr.iiL.

s ^ i ^ J r_^rr >**rti

>^

I *-^Ti.^»'T":r-y •• m,y]

i OJ'***

1

H . I , Ift

»H

• — - -e* TJw_ > . . • 3 S f.illrx«: t n

W

E

•9B-

\

IL/V'-K 4rjii> II)

— General

Help panel

CeoasI Properties

->

SAS Enterprise Miner - Interface Tour [k

[ • r

fiL»

Quit

ffrU*

tPfe

jis^wiaawisj IT!"

Diagram workspace .....

- *a*

-

e

-

-

-a*

\

3 - J -

3


SAS Enterprise Miner - Interface Tour »

t > K —

B4*

3--J

- JJtJi )>>»

Mi

MM

>

r T - s r i d H t t i i a h fc, J B -.1 [ . : • , .

•Tirt

I W

.-SjWrttm.r.fl.

i.' •

Process flow

^•a

\ J

1 — — 1

i

SAS Enterprise Miner - Interface Tour _"-r.i I [-•-!

»—.li!

i

j j -

-

K M , ID

Node CD—

8

4


SAS Enterprise Miner - Interface Tour 0 1 - j -J * I BTui 3'C1 o ar - • '

r; -i. • - « • •

1

SEMMA fools palette

.'TWfKIMMi'l-tWf

M • t e

-T«IH ( H I

lU.llll

SEMMA- Sample Tab si P" ^^ |HT|Pf9P*iH' * * *'*'* 1

>

1

l

lc|J

Append Data Partition File Import Filter Input Data Merge Sample Time Series

-

-

- Q . y & k \ -

E

•SB"""

j

10

5


SEMMA - Explore Tab rc2 •.V:

tf~ :

E E Ci"™

ft-y.V'.

a* 1

.

SJiCHj •J ;•-V r nj

- •

"jMyI m » I isietl I I t i f l I CrWI^w

Association • Cluster • DMDB • Graph Explore • Market Basket Multiplot. • Path Analysis

Owisi

-Qlr

• SOM/Kohonen • StatExplore • Variable Clustering •Variable Selection

11

SEMMA-Modify Tab b ^ x i B i M a i a i o a .

B U M

! w * | t a u T I Hn«, r**-M I tram 1 1 * I emit frmi?

E E

Drop Impute

...

Interactive Binning Principal Components Replacement CD™Rules Builder Transform Variables

12

6

•Sir


SEMMA - Model Tab !W.)lWi.|W»,l 5Bf

U*yf Oft* ^srof

• AutoNeural • Decision Tree • Dmine Regression • DMNeural

ED—

T*fn»

•Stfj'Kilnt.iil

Cm!

• Ensemble •Gradient Boosting • Least Angle Regression • MBR •Model Import • Neural Network

-&

!

• Partial Least Squares • Regression I -' ^

» ... • • .

— -r

-•-Rate Induction Two Stage

13

SEMMA-Assess Tab 3D e* t a

»™

o*

H-j3iyiBiimanmo^«wioiJ4i.J*i Hmmm

•Cutoff • Decisions • Model Comparison • Segment Profile • Score

iPaq i « 3 p i». -*iv«.-u»*~*

14

7


Beyond SEMMA-Utility Tab pit ox«™ »**•* t*»* —

yjjj^ai^ipjjjj j'"1TJ

"^<^ ' - t ^ ' l

I ;

,

!£'£«i ^'_' '•

Control Point End Groups Metadata Reporter • —

h *B*-

SAS Code Start Groups

15

Credit Scoring Tab (Optional) (QNU7*

Bag

•-. [-v

^v.::;! ttHj

Credit Exchange Interactive Grouping Reject Inference Scorecard -gB

1-

•s-l-... - r»a

ttwl

16

8

•6E


The Analytic Workflow

CD

17

SAS Enterprise Miner Analytic Strengths

Pattern Discovery

Predictive Modeling

9


§.sas

Accessing and Assaying Prepared Data T H E P O W E R TO K N O W ÂŽ

Analysis Element Organization

Projects

Libraries and Diagrams

Process Flows

10

Nodes


Analysis Element Organization

Projects

Libraries and Diagrams

Datasources

W

Reports System

Nodes

My Library EMWS

l - P

Process Flows

- 1 - M em.dgraph

EMWS1

'

pP'lDs I

iPPart

Workspaces - H " "

Creating a SAS Enterprise Miner Project This demonstration illustrates creating a new SAS Enterprise Miner project.

22

11


This demonstration illustrates creating a new SAS library.

23

Creating a SAS Enterprise Miner Diagram This demonstration illustrates creating a diagram in SAS Enterprise Miner.

24

12


Defining a Data Source

Select table. Define variable roles. Analysis Data

Define measurement levels.

f

Define table role.

SAS Foundation Server Libraries 25

Defining a Data Source This demonstration illustrates defining a SAS data source.

26

13


Exploring Source Data This demonstration illustrates assaying and exploring a data source.

27

THE POWER TO KNOW,

14


Unsupervised Classification f i

E

c/us(er J cluster 2

1 1

IH

c/uster 1

Unsupervised classification: grouping of cases based on similarities in input values.

c/uster 2

29

/c-means Clustering Algorithm Training Data 1. Select i n p u t s . 2. S e l e c t /t c l u s t e r c e n t e r s . 3. A s s i g n c a s e s t o c l o s e s t center. 4. U p d a t e c l u s t e r c e n t e r s . 5. R e a s s i g n

cases.

6. R e p e a t s t e p s 4 a n d 5 convergence.

15


Segmentation Analysis Training Data

When no clusters exist, use the /(-means algorithm to partition cases into contiguous groups.

31

Creating Clusters with the Cluster Tool This demonstration illustrates how the Cluster tool determines the number of clusters in the data.

32

16


Exploring Segments This demonstration illustrates how to use graphical aids to explore the segments.

33

Profiling Segments This demonstration illustrates using the Segment Profile tool to interpret the composition of clusters.

34

17


ยงsas |

Case Study I THE POWER TO KNOW.

Bank usage segmentation

18


A.1

B a n k i n g

S e g m e n t a t i o n

C a s e

S t u d y

Case Study Description A consumer bank sought to segment its customers based on historic usage patterns. Segmentation was to be used for improving contact strategies in the Marketing Department. A sample o f 100,000 active consumer customers was selected. A n active consumer customer was defined as an individual or household with at least one checking account and at least one transaction on the account during a three-month study period. A l l transactions during the three-month study period were recorded and classified into one o f four activity categories: • traditional banking methods ( T B M ) • automatic teller machine ( A T M ) • point o f sale (POS) • customer service (CSC) A three-month activity profile for each customer was developed by combining historic activity averages with observed activity during the study period. Historically, for one CSC transaction, an average customer would conduct two POS transactions, three A T M transactions, and 10 T B M transactions. Each customer was assigned this initial profile at the beginning o f the study period. The initial profile was updated by adding the total number o f transactions in each activity category over the entire three-month study period. The P R O F I L E data set contains all 100,000 three-month activity profiles. This case study describes the creation o f customer activity segments based on the P R O F I L E data set. The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams and selecting I m p o r t D i a g r a m f r o m X M L in SAS Enterprise Miner. A l l nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.

f

Case S t u d y Data Name

ID

CNTATM

Model Role

ID

[Measurement Level Customer I D

Nominal

ij Input

j. Interval

Input

Interval

|| Traditional bank method transactioi A T M transaction count

"•'"'"1FT j ^Interval

COTJPQS CNT_CSC

Input

CNT_tOt

1| Input

Description

i! Point-of-sale itransactiotf count Customer service transaction count

Interval

; Total transaction count

\\ Interval

19


A c c e s s i n g a n d A s s a y i n g t h e Data A SAS Enterprise Miner data source was defined using the metadata settings indicated above. The StatExplore node was used to provide preliminary statistics on the input variables.

PROFILE

JstcrtExploi a

The Interval Variable Summary from the StatExplore node showed no missing values but did show a surprisingly large range on the transaction counts. Interval Variable Summary Statistics (maximum 500 observations printed) Data Role=TRAItJ

Variable

Role

CNTATIt CHT_CSC CNTJOS

INPUT INPUT INPUT INPUT INPUT

c n t j t m

CHTTOT

Mean 19.49971 6.63411 11.9233 63.13696 106.2441

Standard Deviation

Hon Hissing

Hissing

Hinimum

Hedian

HaxiHum

Skeuness

20.8561 12.12856 20.73384 101.1542 113.3704

100000 100000 100000 100000 100000

0 0 0 0 0

3 1 2

13 2

628 607

2.357293 6.236494 3.343805 53.05219 39.2061

10

17

A plot o f the input distributions showed highly skewed distributions for all inputs.

20

2 52 89

345

14934 15225


E x p l o r e - AAEM.PROFILE FJe

View

Actions

Window

| i Sample Properties Property

Value

*ows

100000

Zokmns

p

ibrary

KAEM

50000^

60000

40000¬ |" 40O00

5 30000-

Member

PROFILE

Type

DATA

—1

I"

p =•

20000¬

£ 20000 u.

"* 1 0 0 0 0 ¬ t

03.0

u

.

0

69.2 135.4 201.6 267.B 334.0 CNT

2498.B

ATM

4587.6

CNT T B M

5¬ £

1.0

20000

66.4 131.8 197.2 262.6 323.0

2521.1

17.0

CNT

CNT CSC

50252 TOT

g. 4 0 0 0 0 c £ ST 2 0 0 0 0 -

,1

116.8 CNT

145.5

174.2

2023

231.6

~l 260.3

r 289.0

POS

It would be difficult to develop meaningful segments from such highly skewed inputs. Instead o f focusing on the transaction counts, it was decided to develop segments based on the relative proportions o f transactions across the four categories. This required a transformation o f the raw data. A Transform Variables node was connected to the P R O F I L E node.

to

JJ^StatExplore

r? — —

Q

f^Transform Variables

! => .= PROFILE

-o The Transform Variables node was used to create category logit scores for each transaction category. category logit score = \og(transaction count category / transaction count m

21

om

fcategory)

0


The transformations were created using these steps: 1.

Select Formulas in the Transform Variable node's Properties panel. The Formulas window appears. £f.

Formulas

]5 4,000¬ 13,000 a>

§•2,000 v

£ 1,000 03.0 Columns:

27.7

V Label

i

126.5 151,2 CNT ATM

Number of Bins •* 4 4

—I 175.9

Level Interval [interval Interval Interval

Level

Formula

1 225,3

1

250,0

r Statistics

r Basic Role Input Input Input

1 2G0.6

T

Name

Outputs

101.8

77.1 f~ Mining

Method Default Default" Default Default

Name CNT.ATM Cfff_C8C CNT.POS CNT_TBM Inputs

52.4

Type

Sample I

Length

Format

Label

Role

Report

Log j V OK

22

Cancel


2.

Select the Create icon as indicated above. The Add Transformation dialog box appears. mS'jsuiiiil'Ji} Property

Value

Name

At

LOT TBM N 8

Type Lenqth Format Level Label Role Report

INTERVAL INPUT No

rFormuIa;TRANS 0 log(CNT_TBM/(CNT_TOT-CNT_TBM))

Build.

OK

Cancel

3.

For each transaction category, type the name and formula as indicated.

4.

Select O K to add the transformation. The Add Transformation dialog box closes and you return to the Formula Builder window.

23


5.

Select Preview to see the distribution of the newly created input. Sff Formulas

L3

n \ ft 50,000 g, 40,000g 30,000 g" 20,000¬ 10,000 03 lolumns:

32.1

61.2

f~ Label

Name CNT_ATM CNT_C5C b]T_POS CrJTJBM CNT JOT

90.3

119.4

148.5 177.6 CNT ATM P Basic

i Mining -

Level Interval Interval Interval Interval Interval

Role Input tnput Input Input Input

206.7

Method DEFAULT DEFAULT DEFAULT DEFAULT DEFAULT

1

235.8

264.9

294.0

0.97

1.96

P Statistics

Number of Bins f0 4.0 4.0 4,0 4.0

Inputs I ~* & 20,000= 10,000-7.93 Name

-6,94

-5.95 Length

Type

LGT_TBM

.Numeric

[30

Numeric

30

LGT POS

Numeric

;LGT CSC

Numeric

3 2 30

Outputs

|

Sample ] •*

-4.97 Format

-3.98

.2,99 -2.00 LGT ATM

-1.01

-0.02

Level [interval

IB3BI [interval Interval

log(CNT_P... log(CNT_C...

Log I •

preview

OK

6.

Repeat Steps I -5 for the other three transaction categories.

7.

Select O K to close the Formula Builder window.

8.

Run the Transform Variables node.

Cancel

Segmentation was to be based on the newly created category logit scores. Before proceeding, it was deemed reasonable to examine the joint distribution o f the cases using these derived inputs. A scatter plot using any three of the four derived inputs would represent the joint distribution without significant loss o f information.

24


A three-dimensional scatter plot was produced using the following steps: 1.

Select Exported Data from the Properties panel of the Transform Variables node. The Exported Data window appears.

2.

Select the T R A I N data and select Explore. The Explore window appears.

3.

Select Actions •=> Plot... or click [Mij (the Plot Wizard icon). The Plot Wizard appears.

4.

Select a three-dimensional scatter plot.

5.

Select Role

6.

Select Finish to generate the scatter plot.

X , Y , and Z for L G T A T M , L G T _ C S C , and L G T _ P O S , respectively.

LGT_POS

The scatter plot showed a single clump o f cases, making this analysis a segmentation (rather than a clustering) o f the customers. There were a few outlying cases with apparently low proportions on the three plotted inputs. Given that the proportions in the four original categories must sum to 1, it followed that these outlying cases must have a high proportion o f transactions in the non-plotted category, T B M .

25


Creating Segments Transactions segments were created using the Cluster node.

JStatExplore

,

ransform Variables

PROFILE

?

^'Cluster J

o

Two changes to the Cluster node default properties were made, as indicated below. Both were related to limitine the number o f clusters created to 5.

Train

•

[Variables L__.__ ClusterVariable Role Segment Internal StandardizatioNone [3jNumber of Clusters hSpecification Method [User Specify -Maximum Number of C5 f

Because the inputs were all on the same measurement scale (category logit score), it was decided to not standardize the inputs.

Only the four L G T inputs defined in the Transform Variables node were set to Default in the Cluster node. 2 ÂŁ Variables - Clus (none) Zolumns:

T

[~ Label

Name CNT CNT CNT CNT CNT ID LGT

ATM CSC POS TBM TOT ATM

LGTCSC LGT POS LGT TBM

| f" not | Equal to

Use No Mo No No No Yes Default Default Default Default

r Report No No No No No foo No No No No

L_j Mining

P Basic Role

Input Input Input Input Input llD Input [input Input Input

Apply

Reset

|~ Statistics

Level Interval Interval interval Interval Interval Nominal Interval Interval Interval Interval

Explore...

26

Update Path

OK

Cancel


Running the Cluster node and viewing the Results window confirmed the creation o f five nearly equally sized clusters.

27


Interpreting Segments A Segment Profile node attached to the Cluster node helped to interpret the contents o f the generated segments.

JstatExplore

i= "PROFILE

-Q

'• v £YlTt £1115(01 m Variables

'Segment Profile

O

-o

Only the L G T inputs were set to Yes in the Segment Profile node. E £ Variables -Prof • I I

il

nnt iFrtiisltn

Apply

Reset

Jj

Columns:

V Label

Name

Use

CNT ATM No CNT CSC No CNT POS No CNT TBM No CNT TOT befault Distance ID default LGTATM Yes LGT_CSC Yes ;LGT_POS Yes LGT TBM ftres

Uu

Report No No No No No No No No No No No

I

P Basic

f~ Mining

Role

Statistics

Level

Input Input Input Input Input Rejected ID Input Input Input

Interval Interval Interval Interval Interval Interval Nominal Interval Interval Interval Interval

Inout

Explore..,

-

Update Path

OK

Cancel

The following profiles were created for the generated segments:

Segment: 1

=

Count: 1 3 1 5 7

«

' e r c e n t : 13.16

£

Segment 1 customers had a significantly higher than average use o f traditional banking methods and lower than average use o f all other transaction categories. This segment was labeled Brick-and-Mortar.

28


segment: 2

=

Count: 2 5 5 3 6

"

P e r c e n t : 25.54

LGT TBM

Segment 2 customers had a higher than average use o f traditional banking methods but were close to the distribution centers on the other transaction categories. This segment was labeled Transitionals because they seem to be transitioning from brick-and-mortar to other usage patterns. 30 -

=

Count: 17839

«

r

£ 20 -

LGT ATM

B

LGT POS

LGT TBM

' '

Percent: 17.84

^

Percent to

40Segmerrt:3

k

Segment 3 customers eschewed traditional banking methods in favor of ATMs. This segment was labeled ATMs.

Segment: 4

=

Count: 1 4 5 3 7

a ~

Percent: 14.54

2 0

£ 10

LGT POS

I

LGT TBM

tot

Segment 4 was characterized by a high prevalence of point-of-sale transactions and few traditional bank methods. This segment was labeled Cashless.

_ 20-

Segmenl: 5 Count 28931 Percent: 2 8 . 9 3

[

£ £

C

LGT CSC

i

20 -

Thn_

LGT_TBM

LGT POS

Segment 5 had a higher than average rate o f customer service contacts and point-of-sale transactions. This segment was labeled Service.

29


Segment Deployment Deployment o f the transaction segmentation was facilitated by the Score node.

StotExplore

O i : : ijPROFILE

y? ; I ransform

fQ- -. Variables

-•- ^^Cluster

o

The Score node was attached to the Cluster node and run. The SAS Code window inside the Results window provided SAS code that was capable o f transforming raw transaction counts to cluster assignments. The complete SAS scoring code is shown below.

30


*

* Formula Code; LGTJTBH LGT_ATM LGT_P0S LGT_CSC

=log(CHT_TBH/ (CHT_TOT-CHT_TBH)) ; =log(CNT_ATH/ (CHT_TOT-CHT_ATH)) ; =log(CHT_POS/(CHT_TOT - CHT_P0S)) =log(CHT_CSC/(CNT_TOT-CHT_CSC)) ;

* TOOL: Clustering; * TYPE: EXPLORE; * NODE: Clus;

*****************************************• *** Begin Scoring Code from PROC DHVQ ***; *****************************************•

*** Begin Class Look-up, Standardization, Replacement ; drop _dm_bad; _dm_bad = 0 ; *** No transformation for LGT_ATH ; *** No transformation for LGT_CSC ; *** No transformation for LGT_P0S ; *** No transformation for LGTJTBH ; *** End Class Look-up, Standardisation, Replacement ;

*** Omitted Cases; i f _dm_bad then do; _SEGHENT_ = .; Distance = .; goto CLUSvlex ; end; *** omitted;

* * * * * * * *• *

EM SCORE CODE; EM Version: 7.1; SAS Release: 9.03.01M0P060711; Host: SASBAP; Encoding: w l a t i n l ; Locale: en_US; P r o j e c t Path: D:\Workshop\winsas\EM_Projects; P r o j e c t Name: apxa; Diagram I d : EMWS1; Diagram Name: case_studyl; Generated by: sasdemo; Date: 09SEP2011:16:50:09;

** ..

it

31


* TOOL: Input Data Source; * TYPE: SAMPLE; * NODE: Ids2; t

* TOOL: * TYPE: * NODE: * LGT ATM LGT_CSC LGT POS LGT_TBM * * TOOL: * TYPE: * NODE: ic

Transform; MODIFY; Trans; ___** • f

= = = =

log (CNT ATM/ (CNT TOT-CNT ATM)) ; log(CNT_CSC/(CNT_TOT-CNT_CSC)) ; log(CNT_POS/ (CNT_TOT - CNT_POS)) ; log (CNT_TBM/ (CNT_TOT-CNT_TBM)) ; ___* f • W

Clustering; EXPLORE; Clus; _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

*****************************************. *** Begin Scoring Code from PROC DMVQ ***; *****************************************. *** Begin C l a s s Look-up, Standardization, Replacement ; drop _dm_bad; _dm_bad = 0 ; *** n o transformation f o r LGT_ATM ; *** No transformation f o r LGT_CSC ; *** n o transformation f o r LGT_POS ; *** No transformation f o r LGT_TBM ; *** End C l a s s Look-up, Standardization, Replacement ; *** Omitted Cases; i f _dm_bad then do; _SEGMENT_ = .; Distance = .; goto CLUSvlex ; end; *** omitted; *** Compute Distances and C l u s t e r Membership; l a b e l _SEGMENT_ = 'Segment I d ' ; l a b e l Distance = 'Distance' ; array CLUSvads [5] __temporary_; drop _yqclus _vqmvar __vqnvar; _vqmvar = 0 ; do _vqclus = 1 to 5; CLUSvads [_vqclus] = 0 ; end-; i f not missing( LGT_ATM ) then do; CLUSvads [1] + ( LGT_ATM 3.54995114884545 )**2; CLUSvads [2] + ( LGT ATM 2.2003888516185 )**2; CLUSvads [3] + ( LGT ATM 0.23695023328541 )**2; CLUSvads [4] + ( LGT ATM 1.47814712774378 )**2; CLUSvads [5] + ( LGT_ATM - -1.49704375204907 )**2; end; e l s e _vqmvar + 1.31533540479169; i f not missing( LGT CSC ) then do; CLUSvads [1] + ( LGT_CSC 4.16334022538952 )**2;

32

1


CLUSvads [2] + ( LGT_CSC -3.38356120535047 )**2 CLUSvads [3] + ( LGT_CSC -3.55519058753002 )**2 CLUSvads [4] + ( LGT_CSC -3.96526745641347 )**2 CLUSvads [5] + ( LGT CSC -2.08727391873096 )**2 end; e l s e _vqmvar + 1.20270093291078; i f not missing( LGT_POS ) then do; CLUSvads [1] + ( LGT_POS - -4.08779761080977 )**2; CLUSvads [2] + ( LGT_POS - -3.27644694006697 )**2; CLUSvads [3] + ( LGT_POS 3.02915771770446 )**2; CLUSvads [4] + ( LGT__POS - -0.9841959454775 )**2; CLUSvads [5] + ( LGT POS - -2.21538937073223 )**2; end; e l s e __vqmvar + 1.3094245726273; i f not missing( LGT_TBM ) then do; CLUSvads [1] + ( LGTJTBM - 2.62509260779666 )**2; CLUSvads [2] + ( LGT_TBM - 1.40885156098965 )**2; CLUSvads [3] + ( LGTJTBM 0.15878507901546 )**2; CLUSvads [4] + ( LGTJTBM 0.11252803970828 )**2; CLUSvads [5] + ( LGTJTBM - 0.22075831354075 )**2; end; e l s e _vqmvar + 1.17502484629096; _vqnvar = 5.00248575662075 - _vqmvar; i f _vqnvar <= 2.2748671456705E-12 then do; SEGMENT = .; Distance - .; end; e l s e do; _SEGMENT_ = 1 ; Distance = CLUSvads [ 1 ] ; _vqfzdst = Distance * 0.99999999999988; drop _vqfzdst; do _vqclus = 2 to 5; i f CLUSvads [__yqclus] < _vqfzdst then do; _SEGMENT_ = _vqclus; Distance = CLUSvads [_vqclus] ; _vqfzdst = Distance * 0.99999999999988; end; end; Distance = sqrt(Distance * (5.00248575662075 / _vqnvar)); end; CLUSvlex :;

***************************************• *** End Scoring Code from PROC DMVQ ***; *****************************;

* * Clus: Creating Segment Label;

* length _SEGMENT_LABEL_ $80; l a b e l _SEGMENT_LABEL_= Segment Description * ; i f _SEGMENT_ = 1 then _SEGMENT_LABEL_="Clusterl" ; else i f _SEGMENT_ = 2 then _SEGMENT_LABEL_= "Cluster2" ; else i f SEGMENT = 3 then SEGMENT LABEL ="Cluster3"; 1

33


else i f _SEGMENT__ = 4 then _SEGMENT_LABEL_="Cluster4" ; else i f _SEGMENT_ = 5 then _SEGMENT_LABEL_="Cluster5" ; * TOOL: Score Node; * TYPE: ASSESS; * NODE: Score; /

ic

ic .

* Score: Creating Fixed Names; *

LABEL EM_SEGMENT = ' Segment Variable * ; EM SEGMENT = SEGMENT ;

34


T H E P O W E R T O K N O W ,

Market Basket Analysis

rw 3J

II

1

1. Rule A^> D C=^A A=> C B&C^D

Support

Confidence

2/5 2/5 2/5 1/5

2/3 2/4 2/3 1/3

35


Implication? Checking Account No

Yes

No

500

3500

4,000

Yes

1000

5000

6,000

Saving Account

10,000 Support(SVG => CK) = 50%=5,000/10,000 Confidence(SVG => CK) = 83%=5,000/6,000 Expected Confidence(SVG => CK) = 85%=8,500/10,000 Lift(SVG => CK) = 0.83/0.85 < 1

Barbie Doll

Candy

1. Put them closer together in the store. 2. Put them far apart in the store. 3. Package candy bars with the dolls. 4. Package Barbie + candy + poorly selling item. 5. Raise the price on one, lower it on the other. 6. Offer Barbie accessories for proofs of purchase. 7. Do not advertise candy and Barbie together. 8. Offer candies in the shape of a Barbie doll.

4

36


Data Capacity

5

Association Tool Demonstration Analysis goal: Explore associations between retail banking services used by customers.

Analysis plan: • • • • •

Create an association data source. Run an association analysis. Interpret the association rules. Run a sequence analysis. Interpret the sequence rules.

6

37


Market Basket Analysis This demonstration illustrates how to conduct market basket analysis.

7

Model Rale

Measurement Level

ID

Nominal

Account Number

Target

Nominal

Type of Service

Sequence

Ordinai

Order of Product Purchase

Name IlVCCOUNT SERVICE VISIT

Description

ATM

automated teller machine debit card

AUTO

automobile installment loan

CCRD

credit card

CD

certificate of deposit

CKCRD

check debit card

CKTNG

checking account

HMEQLC

home equity line of credit

IRA

individual retirement account

MMDA

money market deposit account

MTG

mortgage

PLOAN

personal consumer installment loan

SVG

saving account

TRUST

personal trust account

38


1

Do. C D U C . W U U J

Sibil • IAS

t«M<

~ 1

Select a Sas iasis

I

Table " ^AAEM.BANK

Dsta Source Bttaid — ?R 5 fi t ff> Coloinn Mrlinirifn

Name

Role

ACCOUNT SERVICE

ID Taiuel Sequence

Snow code

Level Interval Nominal Interval

Explore

Report Hd Ho Ho

j

Order

!

Drop

Lower Limit

tin Mo No

ftilNC)

You ma-]'change Itie name and the role, and can specify a population segment Identifier tor the data source 7

(

.

to be created

>

bank

Name: Role:

j Transaction

Segment: \ Notes:

10

39

Upper


il-

- E

General

Node ID h i t p dried Data Exported Data Notes Train. . } H Variables Maximum Number of Hems to Pnl 00000 Rules r-Maiimum Items [-Minimum Confidence Level [-SupportType ;• Support Ccunl - Support Percentage

U 10 Percenl

i-Chaln Count [•Consolidate Time •-MaximumTransaction Duration j-SupportType j- Support Count -Support Percentage

3 0-0 0.0 Percenl

1 5.0

:

;-Number to Transpose -Export Rule bylD

1 i2.0

Defauii 200

;

11

12

40

[ Number to Keep [•Sort Criterion 1 Number lo Transpose '-Export Rule by ID

Yes



rJ

l i * * : C o r f u * ™Pks Eipecled

Riiaonj

U9

Conflderrce{

Trintxban

C O P I E S neat

3 1 3 :

:3 3 3 3

:

:

4 4 4 4

;4 3

4 3

; ' i

j 4 i

,

1130 1495

Kn is.o 11.39 14 8! 1947 1S.4S

tw iaj: 3B.I1 il.ii ».es 18.47 1847 18.47

xtt I S 89 38 I t 1847 3841 3941 lit! il.ii sti: 1941

Count

37.5 49J9 49 35 49.31 36 3' 39 0' 18.13 31 1

19.1;

19.11 39 9 19.9 S4.ec 19.94 37 0 14.91 33 r 33.71 37.9 3r.ni si . r 33 H 54 6! 94 98 IS* ISf 1

n.4. S1J3

5.SB 5.5S s.sa ssa 548 sst

'it 4.5: 4S3

im 4 63

4JJ3 809 909 809 (OS 60 809 809 899 9 5" 9.93 60 60 999 999 C

5 21 is

3.33 130 119 3.19 119 119 1 99 1 99 1.83 1.83 1.81 1.6! 1.91 1.S1 1.49

1,41

1 14 144 144 144 143 141 1.43 1,4: 14 1.43 1.36 111

Run

LI". Hind of R ••: Html Role item 1 Rule Item i RuM

Rule Bern 3 Ruts Hem 4 RnU B m !

BtPtM

44S 0DCK1NO AC CKWOAC CKCRD 446 0 0 C K C R D " CKCRD CMNOA C 44600CKCRO — CKCRD CCRD 446 00 CMNO AC CKtNOAC CCRD 446 0 0 C C R D « " . CCPCi CKCRD 446 0 0 C C R D — ' CCRD CMHGiC. 370 OOHMEQLCc HMEQLC CKINOAC 379 00 CMNO t 0 . CKINOAC. HUEOLC 370 DOHMEQLC i HMEQLC CCRO 370.00HUEOLCi. HMEQLC A CCRD 370 0 0 C C R O = ' CCRD HUEOLC 370 00 CCRD CCRD HUEOLCA 4 97.00 SVO A HME SV0AHME..CK1UGJA . 487 00CMNQSA CKJNOAA SVO A HUE 437 00HMEOLC = . HUEOLC SVOACK] l a r . o o s v o i C K i . . BV0ACK1 HUEOLC 497 DOQVQSATM SVO A ATM HUEOLC 497 Q08VOAATU SVO A ATM HMEQLCA l9r.0CHHEQLC=. HUEOLC SVOAATH 497 DOHMEOLC A . HUEOLCA BVOAATU E93 QOHMEQLC • HUEOLC 69IMCMNO J « CMNOAA HMEQLC 497 00 SVG S HUE SVO A HUE ATM 497 0CSVGSHME SVO A HUE ATM 497 0OATM==>BV MM SVO A HUE. 497 0OATM= = 'SV AIM SVO A HUE 4M00COAATO._ CO A ATM BVO 4CM_. ATM 583 OOHMEOLC = HUEOLC

15

42

cwno

CCRQ

CKCRD CKCRD

curio

CCRD CCRD HMEQLC CMNO HMEQLC HUEOLC CCRO CCRD SVG CMNO MMEOLC SVO SVO SVO HMEQLC HUEOLC HUEOLC CMNO svo SVO ATM ATM Co HUEOLC

OONO CCRO CKCRD

-

CCRO • m a i

CKCRD C KING CKJNO . . . . .

e

CCRD ;

"

;

HMEQLC ATM CMNO ATM ATM . . .

HMEQLC HUEOLC

!===:—> SVO ATM

-

SVO

CMNO CMNO ATM HMEQLC HUEOLC

" CMNG SVO SVO

ATM '

CKCRD CCRO HUEOLC

CCRD

CMNO

' ' j" '

i i - -.-.-.j i ipecrfc run 1 1 3 8 4 9 7

CKCRD CCRO CCRO

'

'

• ATM

CMNO CW7JG SVO CtONO

......->

ATM HUEOLC ATM HUEOLC

HUEOLC HMEOLC ATM SVO ATU HMEOLC ATM

ATM

. .. . .

ATM

HUEOLC HUEOLC SVO

Rult M f l 1 Twite-oil Run Number

CKING

9 10 11 ii 13 14 IS 18 19 19 17 10 11

21 14 38

n CMNO CMNO

IS 37 19

i 1 1 1 1 1 1 t 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1


This demonstration illustrates how to conduct a sequence analysis.

18

43


[ 3 unit; description

t/Ef IE

Rule CKJNO = > SVG CKING — " ATM SVO = > ATM CKING SVG = > ATM ATM = • ATM CKING = > CD CKING ==> ATM = = » A 7 M CKING = > HMEOLC S V G = > CO CKING ==» MMDA CKING — > C C R D CKING — > SVG = > CD SVG — > ATM = > ATM S V G — » SVG CKCRD - - > CKCRD CKING = > CKCRD CKING = * CKCRD = » CKCRD SVG = > HMEOLC CKJNG = • SVG = » SVG = » CCRO C K I N G - - » SVG--* CCRD C D = > CD CKINO -=> IRA HMEQLC ==> HMEQLC CKING = > HMEOLC = * HMEQLC CKING = > SVG = > SVG

§ s a s

| % &

Case Study II Web services associations THE POWER

TO KNOW.

44


A.2 Web Site Usage Associations Case Study Case Study Description A radio station developed a Web site to broaden its audience appeal and its offerings. In addition to a simulcast o f the station's primary broadcast, the Web site was designed to provide services to Web users, such as podcasts, news streams, music streams, archives, and live Web music performances. The station tracked usage o f these services by URL. Analysts at the station wanted to see whether any unusual patterns existed in the combinations o f services selected by its Web users. The Y V E B S T A T I O N data set contains services selected by more than 1.5 million unique Web users over a two-month period in 2006. For privacy reasons, the URLs are assigned anonymous I D numbers. #

The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams •=> Import Diagram from X M L in SAS Enterprise Miner. A l l nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.

Case Study Data Name

^

M o d e l Role

Description

Measurement Level

ID

ID

Nominal

URL (with anonymous I D numbers)

TARGET

Target

Nominal

Web service selected

The W E B S T A T I O N data set should be assigned the role o f Transaction. This role can be assigned either in the process o f creating the data source or by changing the properties o f the data source inside SAS Enterprise Miner.

Accessing and Assaying the Data A SAS Enterprise Miner data source was defined for the W E B S T A T I O N data set using the metadata settings indicated above. By right-clicking on the Data Source node in the diagram and selecting Edit Variables, the T A R G E T variable can be explored by highlighting the variable and then selecting Explore. (The following results are obtained by specifying Random and Max for the Sample Method and Fetch Size.) The Sample Statistics window shows that there are over 128 unique URLs in the data set and 8 distinct services. |.|D|x|

jjHjSample Statistics Obs#

Variable... 11D 2TARGET

Type CLASS CLASS

Percent... Number... Mode Pe... 0128+ 08

45

Mode

1.481481 0000275 41.022WEBSITE


A plot o f target distribution (produced from the Explore window) identified the eight levels and displayed the relative frequency in a random sample o f 100000 cases.

40000-

WEBSITE

MUSIC STREAM

PODCAST

NEWS

LIVESTREAM

TARGET Generating Associations An Association node was connected to the W E B S T A T I O N node. case_study2

WEBSTATION

(

'

>

Association

46

ARCHIVE

SIMULCAST

EXTREF


A preliminary run o f the Association node yielded very few association rules. It was discovered that the default minimum Support Percentage setting was too large. (Many o f the URLs selected only one service, diminishing the support o f all association rules.) To obtain more association rules, the minimum Support Percentage setting was changed to 1.0. In addition, the number o f items to process was increased to 3000000 to account for the large training data set.

Using these changes, the analysis was rerun and yielded substantially more association rules. The Rules Table was used to scrutinize the results. .ft Rule Table Ccnl5*nce£ Ccnfideree<

Transaction Court

Lift

%)

%) TJ33 1.71 7.31 1.96 1JB 7.05 1.7B 410 1.58 5.35 9.47 947 947 695 2 84 9.47 11.83 11.83 11.83 11.83 11.83 4.10 9.47 6.95 947 135 11 33 9.47 7.32 4.10 6.95 7.32

96.32 33.02 98.07 26.19 23.90 86.22 1605 3697 12.29 41.71 64.45 51.35 44.66 31.69 1295 41.55 51.44 46.87 44.61 44.00 38.17 13.21 29.43 21.61 29.24 16.51 30.01 24.01 18-30 1024 16 BS 17.53 17 70

1.69 1.69 1.97 1.92 1.69 1.69 0.66 0.66 0.66 0.66 090 069 066 090 090 0.74 066 0.74 060 090 1 56 1.56 205 205 1 56 1.56 2B4 2.84 075 075 069 094 n9i

13.42 1342 13.39 13.39 12.22 12.22 983 903 7.80 7.80 6.81 5.43 4.74 4.56 4.56 4,39 4.35 3.96 3.77 3.72 3.73 3.23 3,11 3.11 309 309 254 2.54 250 250 24! 2.39 7 30

Rufe

left Hand of RlrrtHsr«l cIRUe

Riialeml

RiJoleffl2

RUeIern3

RtfeleM

Rub Ion 5

Riie hOei

ft*

Transpose Rule

ARCHIVE WEBSITES- ARCHIVE WEBSITE 267 4 4 WEB SITE L. WEBSITES. ARCHIVE . WEBSITE EXTREF 36744 ARCHIVE =, ARCHIVE EXTREF ARCHIVE 30419 EXTREF—. EXTREF ARCHIVE .EXTREF 30419ARCHrVE=.. ARCHIVE EXTREF ARCHIVE 26744WEBS1TES, WEBSITE &.. EXTREF WEBSITE ARCHIVE 26744 EXTREF = , EXTREF WEBSITE WEBSITE S. EXTREF 10424 WEBSITE 8. WEBSITE &,. PODCAST.. WEBSITE SIMULCAST MUSICSTR., PODCAST... SIMULCAST 10424PODCAST.. WEBSITE S. PODCAST MUSICSTR... PODCAST MUSICSTR... 10424WEBSTTES.. WEBSITE L. SIMULCAS... WEBSITE MUSICSTR., 10424SIMULCAS. SIMULCAS.. WEBSITE S.. SIMULCAST PODCAST NEWSSM., 14275NEWS6M... SIMULCAST NEWS MUSICSTR... WEBSITE £ 10944WEBSrTE&.. SIMULCAST WEBSITE NEWS WEBSITES. 10424WEBSrTES. SIMULCAST WEBSITE PODCAST SIMULCAS.. 14275SIMULCAS.. SIMULCAST MUSICSTR... NEWS NEWS !4275NEWS=> SIMULCAS.. NEWS SIMULCAST PODCAST... 117I4PODCAST... SIMULCAST PODCAST MUSICSTR. WEBSITES. MUSICSTR.. 10424WEBSfTES. WEBSITE SIMULCAST PODCAST SIMULCAS... 11714SIMULCAS... MUSICSTR.. SIMULCAST PODCAST MUSICSTR., WEBSITES. 9506.0WEBSITE S. MUSICSTR.. WEBSITE MUSICSTR., NEWS SIMULCAS... MUSICSTR,. 14275SIMULCAS... MUSICSTR, SIMULCAST NEWS WEBSITE 8. MUSICSTR., 24794 WEBSITES.. MUSICSTR.. WEBSITE SIMULCAST MUSICSTR., SIMULCAST 34794MUSICSTR WEBSITE i . MUSICSTR.. . WEBSITE NEWS 32444NEWS~>.. ., SIMULCAST SIMULCAST SIMULCAST NEWS . NEWS 32444SIMULCAS... WEBSITES.. NEWS SIMULCAST 24794 WEBSITE S. SIMULCAST SIMULCAST WEBSITE SIMULCAST 24794SIMULCAS... SIMULCAST WEBSITE S. SIMULCAST MUSICSTR., .WEBsrrE 45051 SIMULCAS... MUSICSTR... MUSICSTR.. SIMULCAST .MUSICSTR. 45051 MUSICSTR, WEBSITE S. SIMULCAST MUSICSTR. . SIMULCAST 11890 WEBSITE 4 , ARCHIVE ARCHIVE WEBSITE .ARCHIVE 11890 ARCHIVE =.. WEBSITES. WEBSITE 3, ARCHIVE SIMULCAST .. WEBSITE .NEWS 1D944WEBSITES. WEBSITES. NEWS WEBSITE SIMULCAST .ARCHIVE 14861 WEBSITE&. ARCHWF ARCHIVE WEBSITE MUSICSTR, UUKIflKTH liftfilARCHNF: WRRSITF ft ARCHIVE

The following were among the interesting findings from this analysis: • Most external referrers to the Web site pointed to the programming archive (98% confidence). • Selecting the simulcast service tripled the chances o f selecting the news service. • Users who streamed music, downloaded podcasts, used the news service, or listened to the simulcast were less likely to go to the Web site.

47


TO KNOW

Introduction to Predictive Modeling: Regressions THE POWER TO KNOW.

Model Essentials - Regressions Predict new cases

Prediction formula

Select useful inputs

Sequential selection

Optimize complexity

Best model from sequence

2

49


Model Essentials - Regressions

Linear Regression Prediction Formula input measurement

A

A

y =w + w,-x, + w x "^ 0

intercept estimate

i

2

parameter estimate

Choose intercept and parameter estimates to squared error , I ( y , - y training data

50

function

f

)

2

minimize:


Logit Link Function A

P log

(iqs) p link

=

logit function

The logit link function transforms probabilities (between 0and1)to logit scores (between -째째 and +째째).

Logit Link Function A

l 0 3

p vT3fi)

=

w + Af x + vv x = logit( p) 0

A

2

1

A

P=

2

1 + -logit(p) e

To obtain prediction estimates, logit equation is solved for p.

51


Simple Prediction Illustration - Regressions Predict dot color for each x< and x,. 1.0

logitf p ) = w + w x + w - x 0

2

2

0.8

1

A

P=

1

x

0.7

-| + -logit( p) e

0.6

X 0.5 2

Need intercept and parameter estimates. 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Simple Prediction Illustration - Regressions .... ..•.*.*•••*

logit( p) = w + w x^ + w - x 0

r

2

2

III

A

1

P

-\ + -logit{p)

"... • !

;

e

Find parameter estimates

by

maximizing: Zlog(p,)+ I l o g ( 1 - p , )

primary outcome

trtlning ctset

secondary outcome

training cases

log-likelihood

1 *J r*^ :A;-*ft ':> V . \ . 1 v

rf

function

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 *1

8

52


Simple Prediction Illustration - Regressions 1.0

logit( p) = w + -w,- x + w - x 0.9 0

1

2

0.8

1

A

P=

2

0.7

-] + -logit(/j; e

0.6 X

Find parameter estimates by maximizing:

0.5

z

Q.4 „ n

I"og(pJ+Ilog(1-pJ primary outcome training case a

secondary outcome

trainfng cases

log-likelihood

Q

A

QQ

o.o 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

function

Regressions: Beyond the Prediction Formula Manage missing values Interpret the model Handle extreme or unusual values Use nonnumeric inputs Account for nonlinearities 10

53


Missing Values and Regression Modeling

i

i .... , .

1

I

1 mm

1

in

1

1 jH

~C.'*..,r

mm L_J

1

1 M B

mm

Problem 1: Training data cases with missing values on inputs used by a regression model are ignored.

Missing Values and Regression Modeling Training Data

Consequence: Missing values can significantly reduce your amount of training data for regression modeling!

12

54


Missing Values and the Prediction Formula

logit(p) = -0.81 + 0.92 - x +1.11-x i

2

Predict: (x1, x2) = (0.3, ? ) Problem 2: Prediction formulas cannot score cases with missing values.

Missing Values and the Prediction Formula

... A

logit(p) = -0.81 + 0.92-0.3+ 1.11 • ? Problem 2: Prediction formulas cannot score cases with missing values.

14

55


Missing Value Issues

Managing missing values Problem 1: Training data cases with missing values on inputs used by a regression model are ignored. Problem 2: Prediction formulas cannot score cases with missing values.

15

Missing Value Causes

Managing missing values Non-applicable measurement No match on merge Non-disclosed measurement

16

56


Missing Value Remedies

Managing missing values Non-applicable measurement No match on merge |i

l| Non-disclosed measurement

s

y

n t h e t i c

d i s t r i b u t i o n

_

Estimation

*/ 1*n •••- P ) =

X

17

This demonstration illustrates how to impute synthetic data values and create missing value indicators.

18

57


Running the Regression Node This demonstration illustrates using the Regression tool.

19

Model Essentials - Regressions

Select useful inputs

58


Sequential Selection - Forward Input p-value

Entry Cutoff

21

Sequential Selection - Forward Input p-value

Entry Cutoff

22

59


Sequential Selection - Forward Input p-value

• • • •• • •

Entry Cutoff

23

Sequential Selection - Backward Input p-value

• ••••

24

60

Stay Cutoff


Sequential Selection - Backward Input p-value

•••••••• •••••••

Stay Cutoff

25

Sequential Selection - Backward Input p-value

•••••••• ••••••• • •E • • • • •

Stay Cutoff

26

61


Sequential Selection - Backward Input p-value

••••• •••• •••• H D

Stay Cutoff

• •

27

Sequential Selection - Stepwise

Input p-value

Entry Cutoff Stay Cutoff

28

62


Sequential Selection - Stepwise

Input p-value

Entry Cutoff

• ••• • • •• • ••• • • •• • •• • •

Stay Cutoff

29

r - Q ^ / y

Selecting Inputs

This demonstration illustrates using stepwise selection to choose inputs for the model.

30

63


Model Essentials - Regressions

Best model from sequence quence

Optimize complexity

31

Model Fit versus Complexity Model fit statistic

Evaluate each sequence step

aero

1

nrixi

2

3

i • • ••

4

••••• •••m

5

6

32

64


Select Model with Optimal Validation Fit Model fit statistic

Choose simplest optimal model

33

Optimizing Complexity This demonstration illustrates tuning a regression model to give optimal performance on the validation data.

34

65


Beyond the Prediction Formula

Interpret the model

35

Beyond the Prediction Formula

Interpret the model

36

66


Odds Ratios and Doubling Amounts A

p loa

(ir«)

=

W

Q

+ Wyjq +

consequence

1 0.69

odds

x exp(w,-)

odds

x2

W $ X

2

°9it

l

scores

Odds ratio: Amount odds change with unit change in input.

37

Odds Ratios and Doubling Amounts A

p

l o g

(T3j)

=

W

0

+

consequence Doubling amount: Input change is required to double odds.

m => odds x exp(iv,)

0.69 _:> odds x 2 IV,

38

67

+

W $ X

2

s

°3it

scores


Interpreting a Regression Model This demonstration illustrates interpreting a regression model using odds ratios.

Beyond the Prediction Formula

Handle extreme or unusual values

40

68


Beyond the Prediction Formula

Handle extreme or unusual values

41

Regularizing Input Transformations Regularized Scale

Original Input S c a l e

true —

m

association

regularized

standard

estimate

• .

.

regularized true

regression

estimate

association

42

69

standard

regression


This demonstration illustrates using the Transform Variables tool to apply standard transformations to a set of inputs.

43

Credit risk scoring THE POWER TO KNOW

70


A.3

Credit Risk C a s e Study

A bank sought to use performance on an in-house subprime credit product to create an updated risk model. The risk model was to be combined with other factors to make future credit decisions. A sample of applicants for the original credit product was selected. Credit bureau data describing these individuals (at the time of application) was recorded. The ultimate disposition of the loan was determined (paid off or bad debt). For loans rejected at the time of application, a disposition was inferred from credit bureau records on loans obtained in a similar time frame. The credit scoring models pursued in this case study were required to conform to the standard industry practice of transparency and interpretability. This eliminated certain modeling tools from consideration (for example, neural networks) except for comparison purposes. If a neural network significantly outperformed a regression, for example, it could be interpreted as a sign of lack of fit for the regression. Measures could then be taken to improve the regression model. S

The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams •=> Import Diagram from X M L in S A S Enterprise Miner. All nodes in the opened file, except the data node, contain the property settings outlined in this case study. If you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.

71


Case Study Training Data Name TARGET Banruptcvlnd TLBadDerogCnt CollectCnt lnqFinanceCnt24 lnqCnt06 DeroqCnt TLDel3060Cnt24 TL50UtilCnt TLDel60Cnt24 TLDelBOCntAII TL75UtilCnt TLDeI90Cnt24 TLBadCnt24 TLDelBOCnt TLSatCnt TLCntl 2 TLCnt24 TLCnt03 TLSatPct TLBalHCPct TLOpenPct TLOpen24Pct TLTimeFirst InqTimeLast TLTimeLast TLSum TLMaxSum TLCnt ID

Role Target Input Input Input Input Input Input Input Input Input Input Jnput Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input ID

Level 3inary Binary nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval nterval Interval Interval Interval Interval Interval Interval Interval [nterval Interval Interval Interval Nominal

Label Bankruptcy Indicator Number Bad Dept plus Public Deroqatories Number Collections Number Finance Inquires 24 Months Number Inquiries 6 Months Number Public Deroqatories NumberTrade Lines 30 or 60 Days 24 Months Number Trade Lines 50 pet Utilized NumberTrade Lines 60 Days orWorse 24 Months NumberTrade Lines 60 Days orWorse Ever NumberTrade Lines 75 pet Utilized NumberTrade Lines 90+ 24 Months NumberTrade Lines Bad Debt 24 Months NumberTrade Lines Currently 60 Days orWorse NumberTrade Lines Currently Satisfactory NumberTrade Lines Opened 12 Months MumberTrade Lines Opened 24 Months NumberTrade Lines Opened 3 Months Percent Satisfactory to Total Trade Lines Percent Trade Line Balance to Hiqh Credit PercentTrade Lines Open Percent Trade Lines Open 24 Months Time Since FirstTrade Line Time Since Last Inquiry Time Since Last Trade Line Total Balance All Trade Lines Total Hiqh Credit All Trade Lines Total Open Trade Lines

72


Accessing and Assaying the Data A SAS Enterprise Miner data source was defined for the C R E D I T data set using the metadata setting indicated above. The Data source definition was expedited by customizing the Advanced Metadata Advisor in the Data Source Wizard as indicated. ÂŁ T Advanced Advisor Options

Value

Property Missing Percentage Threshold Reject Vars with Excessive Missing Values "jClass Levels Count Threshold Detect Class Levels Reject Levels Count Threshold Reject Vars with Excessive Class Values

50 Yes 2 Yes 20 Yes

A

W |

Class Levels Count Threshold I f "Detect class levels"=Yes, interval variables with less than the number specified for this property will be marked as NOMINAL. The default value is 20.

OK

With this change, all metadata was set correctly by default.

73

Cancel


Decision processing was selected in step 6 o f the Data Source Wizard. ^, Data Source Wizard — Step7 of 10 Decision Configuration Targets | Prior Probabilities Decisions Decision Weights

;

Do you want to use the decisions? Default with Inverse Prior Weights

(* Yes r No Decision Name DECISION 1 DECISION?

Cost Variable < None > < None >

Label

1 0

Constant 0,0 0.0

Add Delete Delete All Reset Default

<Back

Next>

Cancel

The Decisions option Default w i t h Inverse Prior Weights was selected to provide the values in the Decision Weights tab. Data Source Wizard - Step 7 of 10 Decision Configuration

3

Targets j Prior Probabilities | Decisions Decision Weights Select a decision function: C Minimize

fÂť Maximize! Enter weight values for the decisions. Level 1 0

DECISION1 DECISION2 5.99880023... 0.0 0.0 1.20004800...

I J M U w w

< Back

74

Next >

Cancel


It can be shown that, theoretically, the so-called central decision rule optimizes model performance based on the KS statistic. The StatExplore node was used to provide preliminary statistics on the target variable.

Banruptcylnd and T A R G E T were the only two class variables in the C R E D I T data set. C l a s s V a r i a b l e Summary

Statistics

(maximum 500 o b s e r v a t i o n s p r i n t e d ] Data Role=TRAIN Number Data

Variable

Role

Name

of

Mode

Levels

Role

Nis3ing

Mode

Percentage

Mode 2 Node 2

Percentage

TRAIN

Banrup t c y l n d

INPUT

2

0

0

84.67

1

15.33

TRAIN

TARGET

TARGET

2

0

0

83.33

1

16.67

The Interval Variable Summary shows missing values on 11 o f the 27 interval inputs. Interval Variable Summary S t a t i s t i c s (aaximum 50G obsecrations pointed) Data Role-TRAIN

Variable

Role

CollectCnt DerogCnt InqCntC6 InqFinanceCnt24 InqTineLaat TLSOUCilCtlt TL75UtilCnt TLBadCnt24 TLEadDerogCnt TIBalHCPct TICtit TLCilta3 TLCntl2 TIXnt24

INPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT INPUT INPUT INPUT INPUT INPUT INPUT

•n.Del3060Cnt21 TLDelSOCnt TLDel60Cnt24 TLDelSGCntAll TLDel90Cnt24 TLMaxSun TL0pen24Pct TLOpenPct TLSatCnt TLSatPcc TLSum TLTiaeFirst TLTiaeLast

Mean 0.857 1.43 3.108333 3.555 3.108108 4.077904 3.121682 0.567 1.409 0.648176 7.879546 0.275 1.821333 3.882333 0.726 1.522 1.06B333 2.522 0.814667 31205.9 0.564219 0.496168 13.51168 0.518331 201S1.1 170.1137 11.87367

Standard Deviation 2.161352 2.731469 3.479171 4.477536 4.637831 3.108076 2.605435 1.324423 2.460434 0.266486 5.421595 0.502034 1.925265 3.396714 1.163633 2.809653 1.806124 3.4072SS 1.609508 29092.91 0.480105 0.206722 8.931769 0.234759 19682.09 92.B137 16.32141

Hon Hissing

Biasing

Minimum

Hedian

Maximum

Skeuness

Kurtosis

0

0

0

0 0 188 99 99 0

0 0 0 0 0 0

0 0 2 2 1 3 3 0 0 0.6955 7 0 1 3 0 0 0 1 0 24187 0.5 0.5 12 0.5263 IS 546 151 7

50 51 40 40 24 23 20 16 47 3.3613 40 7 15 28 8 38 20 45 19 271036 6 1 57 1 210612 933 342

7.556541 5.045122 2.580016 2.806893 2.366563 1.443077 1.50789 4.376858 4.580204 -0.18073 1.235579 2.805575 1.623636 1.60771 1.381942 3.30846 3.080191 2.564126 3.623972 2.061138 2.779055 0.379339 0.851193 -0.12407 2.276B32 1.031307 6.447907

111.8365 50.93801 12.82077 13.05141 5.626803 3.350659 3.686636 28.58301 48.24276 4.015619 2.195363 12.66839 3.604793 4.379948 1.408509 17.76184 14.35044 12.70062 19.7006 8.093434 18.5329 -0.01934 0.690344 -0.48393 10.96413 2.860035 80.31043

3000 3000 3000 3000 2812 2901 2901 3000 3000 2959 2997 3000 3000 3000 3000 3000 3000 3000 3000 2960 2997 2997 2996 2996 2960 3000 3000

0

0

41 3 0 0 0 0 0 0 0 0 40 3 3 4 4

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6

40 0 0

75

0


By creating plots using the Explore window, it was found that several o f the interval inputs show somewhat skewed distributions. Transformation o f the more severe cases was pursued in regression modeling. Explore - AMM.CRIDIT He

Edt

Vtaw

Action*

QIBK>LmiI IIA.4,l,M.!liJi.l.!JJ Property

WMw

H R C l l I ILBadLnt.M

BRE3II • IrmCntUti

tet 3000

EES! ObsfV

TARGET

1 2 3

ID

Number

0000066 0000116 0000124

-J >J

Obs U

|Vatiahle...| 11D 2ColleclCnt 3DerogCn1

Type

[I A

CLASS VAR VAR

-J

MR

mm

LTu

Explore - AAEM.CREDIT FJe

VKW Actions

WWw

HfilBII £ 2000¬

0000066 0000116 0000124 0000129 0000113

Number Trade Lines Currently 6...

I 1000

11D CLASS 2 TAR GET VAR 3TLDel3060... VAR 4TLDei6GCnl VAR 5TLDel60Cn . VAR

1

- 1

20%

100%

11000-

LL

500 0%

|l

r — — i — •i — — r — i — r -

o% 40% 60% eo% Percenl Satisfactory to Total Tra...

5 leoo-

&isoo 5

Typo

irTf^ I n

'

V 08.414 J2I5.829

1-

I | Variable... |

|:oo-

Ob. Toi.il Hiuli Credit All Trade Lines

180% 360%

| $40%

soo-

0

W

184,245 JI68.4&0

Total Balance All Trade Lhies

Percent Iiailc Line* Open 24 M -

Number Trails Lines 60 Days 01 _

Obsfl

| 400 -

W

7.0 152 229 704 33 0

& 2000-

Time Sir

II)

TARGET

0.0

| --Jni*J

H Q I I • Hsmi'ct

I 10Q0 I 500

5 1000-

AAEM Apply |

I TLMajrtum

ILfJd&ljr.ntAII &600-

2000 • 1000 -

£ 400 -

f 20000

jd

= 1000-

-Hl [ I rk-rn

500¬ 0 60

0% 20% 40% 60% 60% 100% Percent Ti.tile Line* Open

9 0 100 270 360 450

Number Trade Lines 60 Days or _

284.1

562.2

040.3

Time Since First Trade Line

3s/*l ^3000. ji 2000-

= 600 = 400 * 200

|ioco.

r-r-n 00

1 6

I 2000fiooo-

0 3.2 4 8

6.4

8

0

(lumber Trade Liires 30 or 60 D a -

00

30

76

111 152 190

Oil

11.4 22 8 342 4 5 6 57 0

Number Trade Lines Currently S_.

Number Tiaile Lines00> 24 Mo

76

00

102 6

2052

307 8

Time Since Last Traile Line


Creating Prediction Models: Simple Stepwise Regression Because it was the most likely model to be selected for deployment, a regression model was considered first.

UstatExptqre

injajc 11 CREDIT

a Partition

Impute

EQ Jo- ^ . R e g r e s s i o n

• In the Data Partition node, 50% of the data was chosen for training and 50% for validation. • The Impute node replaced missing values for the interval inputs with the input mean (the default for interval valued input variables), and added unique imputation indicators for each input with missing values. • The Regression node used the stepwise method for input variable selection, and validation profit for complexity optimization. The selected model included seven inputs. See line 1197 of the Output window. A n a l y s i s of Maximum L i k e l i h o o d Estimates

DF

Estimate

Standard Error

Wald Chi-Square

1

-2.7602

0.4089

45.57

<.0001

1

1.8759

0.3295

32.42

<.0001

0.2772

1

-2.6095

0.4515

33.40

<.0001

-0.3363

1

0.0610

0.0149

16.86

<.0001

0.1527

1

0.3359

0.0623

29.11

<.0001

0.2108

TLDel60Cnt24

1

0.1126

0.0408

0.0058

0.1102

1.119 TLOpenPct

1

1.5684

0.4633

0.0007

0.1792

1

-0.00253

0.000923

0.0062

-0.1309

Parameter

Standardized Pr >

ChiSq

Estimate

Exp(Est) Intercept 0.063 IMP_TLBalHCPct 6.527 IMPJTLSatPct 0.074 InqFinanceCnt24 1.063 TLDel3060Cnt24 1.399

4.799 TLTimeFirst 0.997

7.62 11.46 7.50

The odds ratio estimates facilitated model interpretation. Increasing risk was associated with increasing values of I M P J T L B a l H C P c t , InqFinanceCnt24, TLDel3060Cnt24, TLDehSOCnt, and TLOpenPct. Increasing risk was associated with decreasing values of I M P J T L S a t P c t and T L T i m e F i r s t .

77


Odds R a t i o

Estimates

Point Effect

Estimate

IMP_TLBalHCPct

6.527

IMPJTLSatPct

0 .074

InqFinanceCnt24

1.063

TLDel3060Cnt24

1.399

TLDel60Cnt24

1.119

TLOpenPct

4 .799

TLTimeFirst

0.997

The iteration plot (found by selecting View "=> Model <=> Iteration Plot in the Results window) can be set to show average profit versus iteration. A l t e r a t i o n Plot

0.0

EUslEr

2.5

5.0

7.5

10.0

Model Selection Step Number Train: Average' Profit for TARGET

Valid: Average Profit for TARGET |

In theory* the average profit for a model using the defined profit matrix equals 1+KS statistic. Thus, the iteration plot (from the Regression node's Results window) showed how the profit (or, in turn, the KS statistic) varied with model complexity. From the plot, the maximum validation profit equaled 1.43, which implies that the maximum KS statistic equaled 0.43. •dr~

The actual calculated value of KS (as found using the Model Comparison node) was found to differ slightly from this value (see below).

78


Creating Prediction Models: Neural Network While it is not possible to deploy as the final prediction model, a neural network was used to investigate regression lack of fit.

CREDIT

1^ Regression

;a Priori

Neural Netvyorfc

The default settings of the Neural Network node were used in combination with inputs selected by the Stepwise Regression node. The iteration plot showed slightly higher validation average profit compared to the stepwise regression model.

;>/ Iteration Plot Average Pro/it

1.475-

1.450-

1.425-

1.400-r-

-r-

10

20

30

Training Iterations - Train: Average Profit for TARGET •

-Valid: Average Profit for TARGET |

It was possible (although not likely) that transformations to the regression inputs could improve regression prediction.

Creating Prediction Models: Transformed Stepwise Regression In assaying the data, it was noted that some of the inputs had rather skewed distributions. Such distributions create high leverage points that can distort an input's association with the target. The Transform Variables node was used to regularize the distributions of the model inputs before fitting the stepwise regression.

79


\.J>< Regression

Stattrxploie

etiral Network

-o

'J {

:l: CREDIT

crta Partition

puts 6

o

liansfoi ni fO. variables

•y Transfoi med '/•'<_ Stepwise...

,

\

The Transform Variables node was set to maximize the normality o f each interval input by selecting from one o f several power and logarithmic transformations. Property

General Node ID Imported Data Exported Data [Notes

Value Trans

u

Train

... ... ...

Variables formulas Interactions SAS Code j-Interval Inputs '••Interval Targets v Class Inputs -Class Targets :

• •

-.

Maximum Normal None None None

The Transformed Stepwise Regression node performed stepwise selection from the transfonned inputs. The selected model had many o f the same inputs as the original stepwise regression model, but on a transformed (and difficult to interpret) scale. Odds R a t i o E s t i m a t e s

1105 1106 1107 1108

Point Effect

Estimate

1109 1110

IMPJTLBalHCPct

4.237

1111

IMPJTLSatPct

0.090

1112

L0G_InqFinanceCnt24

9.648

1113

LOG_TLDel60Cnt24

9.478

1114

SQRT_IMP_TL 7 S U t i l C n t

5.684

1115

SQRT_TLDe13060Cnt24

4.708

1116

SQRTJTLTimeFirst

0.050

80


The transformations would be justified (despite the increased difficulty in model interpretation) i f they resulted in significant improvement in model fit. Based on the profit calculation, the transformed stepwise regression model showed only marginal performance improvement compared to the original stepwise regression model. HIHE3

$t* Iteration Plot Average Profit

1.4-

1.3-

1.2-

1.1 -

1.0 2

4

6

10

8

Model Selection Step Number Valid: Average Profit for TARGET |

Train: Average Profit for TARGET

Creating Prediction Models: Discretized Stepwise Regression Partitioning input variables into discrete ranges was another common risk-modeling method that was investigated.

r

latExplore

CREDIT

j Data Partition

\,p£ Recession

t=

_

,JTi ansform jfO Variables

Impute

£

81

o

. Transformed >?"•'• Stepwise...

r

f r *

lew al Network

J

Li'^ryciimbles

Bucket Stepwise...

^VDuantile Input

Quantle

^aopiinwil jr[) Discretized...

I _v Optimal \X'- Discretized...

Stepwise...

o

-6


Three discretization approaches were investigated. The Bucket Input Variables node partitioned each interval input into four bins with equal widths. The Bin Input Variables node partitioned each interval input into four bins with equal sizes. The Optimal Discrete Input Variables node found optimal partitions for each input variable using decision tree methods. Bucket Transformation The relatively small size of the C R E D I T data set resulted in problems for the bucket stepwise regression model. Many of the bins had a small number of observations, which resulted in quasi-complete separation problems for the regression model, as dramatically illustrated by the selected model's odds ratio report. Go to line 1059 of the Output window. Odds R a t i o E s t i m a t e s

Point Effect Estimate BIN_IMP_TL75UtilCnt 01: low -5 v s 04:15-high 999.000 BIN_IMP_TL75UtilCnt 02: 5-10 v s 04:15-high 999.000 BIN_IMP_TL75UtilCnt 03: 10-15 v s 04:15-high 999.000 BIN_IMP_TLBalHCPct

01: low -0.840325

v s 04:2.520975-high

BIN_IMP_TLBalHCPct

02: 0.840325-1.68065 v s 04:2.520975-high

BIN_IMP_TLBalHCPct

03: 1.68065-2.520975 v s 04:2.520975-high

BIN_IMP_TLSatPct

01: low -0.25 v s 04:0.75-high

BIN_IMP_TLSatPct

02: 0.25-0.5 v s 04:0.75-high

BIN_IMP_TLSatPct

03: 0.5-0.75 v s 04:0.75-high

<0.001 <0.001 <0.001 4 .845 1.819 1.009 B I N _ I n q F i n a n c e C n t 2 4 01: low -9.75 v s 04:29.25-high 0.173 BIN_InqFinanceCnt24 02: 9.75-19.5 v s 04:29.25-high 0.381 BIN_InqFinanceCnt24 03: 19.5-29.25

v s 04:29.25-high

0.640 BIN_TLDel3060Cnt24

01: low -2 v s 04:6-high

BIN_TLDe1306 OCnt2 4

02: 2-4 v s 04:6-high

999.000 999.000

82


BIN_ _ T L D e l 6 0 C n t A l l

01 : low -4.75 v s 04:14.25-high

BIN_ _ T L D e l 6 0 C n t A l l

02 :4.75 -9.5 v s 04:14.25-high

BIN_ _ T L D e l 6 0 C n t A l l

03 :9.5- 14.25 v s 04:14.25-high

BIN_ T L T i m e F i r s t

01 : low -198.75 v s 04:584.25-high

BIN_ T L T i m e F i r s t

02 :198. 75-391.5 v s 04:584.25-high

BIN_ T L T i m e F i r s t

03 :391. 5-584.25 v s 04:584.25-high

0.171 0.138 0.166 999.000 999.000 999.000 The iteration plot showed substantially worse performance compared to the other modeling efforts.

'^Iteration Plot Average Pi of it

i 0

MEM

1

1

1

2

4

6

'

r 8

Model Selection Step Number Train: Average Profit tor TARGET Valid: Average Profit for TARGET i

ii

in i

ii i

1

Bin (or Quantile) Transformation Somewhat better results were seen with the binned stepwise regression model. By ensuring that each bin included a reasonable number of cases, more stable model parameter estimates could be made. See line 1249 of the Output window. Odds R a t i o E s t i m a t e s

Point

83


Effect Estimate PCTL_IMP_TLBalHCPct

01: low -0.513 v s 04:0.8389-high

PCTL_IMP_TLBalHCPct

02: 0.513-0.7041 v s 04:0.8389-high

PCTL_IMP_TLBalHCPct

03: 0.7041-0.8389 v s 04:0.8389-high

PCTL_IMP_TLSatPct

01: low -0.3529 v s 04:0.6886-high

PCTL_IMP_TLSatPct

02: 0.3529-0.5333 v s

PCTL_IMP_TLSatPct

03: 0.5333-0.6886 v s 04:0.6886-high

0 .272 0 .452 0 .630 1 .860 04:0.6886-high

1 .130 1 .040 PCTL_InqFinanceCnt24 01: low -1 v s 04:5-high 0 .599 PCTL_InqFinanceCnt24 02: 1-2 v s 04:5-high 0 .404 P C T L _ I n q F i n a n c e C n t 2 4 03: 2-5 vs 04:5-high 0 .807 PCTL_TLDel3060Cnt24

02: 0-1 v s 0 3 : l - h i g h

PCTL_TLDel60Cnt24

02: 0-1 v s 0 3 : l - h i g h

PCTL_TLTimeFirst

01: low -107 v s 04:230-high

PCTL_TLTimeFirst

02: 107-152 v s 04:230-high

PCTL_TLTimeFirst

03: 152-230 v s 04:230-high

0 .453 0 .357 1 .688 1 .477 0 .837

84


The improved model fit was also seen in the iteration plot, although the average profit o f the selected model was still not as large as the original stepwise regression model. ' I t e r a t i o n Plot

0

ME

2

4

6

8

10

Model Selection Step Number Train: Average Profit for TARGET Valid: Average Profit for TARGET

Optimal Transformation A final attempt on discretization was made using the optimistically named Optimal Discrete transformation. The final 18 degree-of-freedom model included 10 separate inputs (more than any other model). Contents o f the Output window starting at line 1698 are shown below. Odds R a t i o E s t i m a t e s

Point Effect Estimate Banruptcylnd

0 vs 1

OPT_IMP_TL75UtilCnt

01:low

OPT_IMP_TL75UtilCnt

02:1.5-8.5,

OPT_IMP_TLBalHCPct

01:low

OPT_IMP_TLBalHCPct

0 2 : 0 . 6 7 0 6 - 0 . 8 6785

vs

04:1.0213-high

OPT_IMP_TLBalHCPct

0 3 : 0 . 8 678 5 - 1 . 0 2 1 3

vs

04:1.0213-high

2.267 -1.5 vs

03:8.5-high

0.270 MISSING

vs 03:8.5-high

0.409 -0.6706,

HISSING

vs

04:1.0213-high

0 . 090 0.155 0 .250

85


OPT

IMP T L S a t P c t

01

low

02

0.2094-0.4655 v s 03:0.4655-high,

OPT_InqFinanceCnt24

01

low

OPT_InqFinanceCnt24

02

2.5-7.5 v s 0 3 : 7 . 5 - h i g h

OPT_TLDel3060Cnt24

01

low

-1.5,

MISSIN v s 02:1.5-high

OPT_TLDel60Cnt

01

low

-0.5,

MISSIN v s 03:14.5-high

OPT_TLDel60Cnt

02

0.5-14.5 v s 03:14.5-high

OPT

TLDel60Cnt24

01

low

OPT_TLDel60Cnt24

02

0.5-5.5 v s 0 3 : 5 . 5 - h i g h

OPT_TLTimeFirst

01

low

-0.2094 v s 03:0.4655-high,

5 067 OPT_IMP

TLSatPct

1 970 -2.5,

MISSIN v s 03:7.5-high

0 353

0 657

0 499

0 084

0 074 -0.5,

MISSIN v s 03:5.5-high

0 327

c

882 -154.5,

MISSING v s 0 2 : 1 5 4 . 5 - h i g h

1 926 TLOpenPct 3 337 The validation average profit was still slightly smaller than the original model. A substantial difference in profit between the training and validation data was also observed. Such a difference was suggestive o f overfitting by the model. I t e r a t i o n Plot

!-

•

Average Profit 1.5-

15 M o d e l S e l e c t i o n Step N u m b e r Train: Average Profit for TARGET Valid: Average Profit for TARGET

86

X


Assessing the Prediction Models The collection of models was assessed using the Model Comparison node.

The R O C chart shows a jumble of models with no clear winner.

Data Role-VALIDATE

1 - Specificity Baseline Regression Bucket Stepwise Regression Optima Discretjzed Stepwise Regression

87

— Neural Network — — —TransformedStepwise Regression — — Quantfle Stepwise Regression


The Fit Statistics table from the Output window is shown below. Data Role=Valid Statistics Valid: Kolmogorov-Smirnov Statistic Valid: Average Profit for TARGET Valid: Average Squared Error Valid: Roc Index Valid: Average Error Function Valid: Percent Capture Response Valid: Divisor for VASE Valid: Error Function Valid: Gain Valid: Gini Coefficient Valid: Bin-Based Two-Way Kblmogorov-Smirnov Statistic Valid: L i f t Valid: Maximum Absolute Error Valid: Misclassification Rate Valid: Mean Square Error Valid: Sum of Frequencies Valid: Total Profit for TARGET Valid: Root Average Squared Error Valid: Percent Response Valid: Root Mean Square Error Valid: Sum of Square Errors Valid: Sum of Case Weights Times Freq

Reg 0.43 1.43 0.12 0.77 0.38 14.40 3000.00 1152.26 180.00 0.54 0.43 2.88 0.97 0.17 0.12 1500.00 2143.03 0.35 48.00 0.35 359.70 3000.00

Neural 0.46 1.42 0.12 0.77 0.39 12.00 3000.00 1168.64 152.00 0.54 0.44 2.40 0.99 0.17 0.12 1500.00 2131.02 0.35 40.00 0.35 367.22 3000.00

Reg5 0.42 1.42 0.12 0.76 0.40 11.60 3000.00 1186.46 148.00 0.53 0.41 2.32 1.00 0.17 0.12 1500.00 2127.45 0.35 38.67 0.35 371.58 3000.00

Reg2 0.44 1.42 0.12 0.78 0.38 14.40 3000.00 1131.42 192.00 0.56 0.44 2.88 0.98 0.17 0.12 1500.00 2127.44 0.34 48.00 0.34 352.69 3000.00

Reg4 0.45 1.41 0.12 0.77 0.39 12.64 3000.00 1158.23 144.89 0.54 0.45 2.53 0.99 0.17 0.12 1500.00 2121.42 0.35 42.13 0.35 366.76 3000.00

Reg3 0.39 1.38 0.13 0.73 0.43 9.60 3000.00 1282.59 124.00 0.47 0.39 1.92 1.00 0.17 0.13 1500.00 2072.25 0.36 32.00 0.36 381.44 3000.00

The best model, as measured by average profit, was the original regression. The neural network had the highest K S statistic. The log-transformed regression, Reg2, had the highest ROC-index. If the purpose of a credit risk model is to order the cases, then Reg2, the transformed regression, had the highest rank decision statistic, the R O C index. In short, the best model for deployment was as much a matter of taste as of statistical performance. The relatively small validation data set used to compare the models did not produce a clear winner. In the end, the model selected for deployment was the original stepwise regression, because it offered consistently good performance across multiple assessment measures.

88


ยงsas

Introduction to Predictive Modeling: Decision Trees THE POWER TO KNOW.

Predictive Modeling Applications Database marketing Financial risk management &t

Fraud detection

p'j~j

Process monitoring Pattern detection

89


Predictive Modeling Training Data Training Data inputs

i

mm mm

mm mm

? '

target

.;

I1 1 mm

mm mm mm

training data case: categorical or numeric input and target measurements

mm

Predictive Model Training Data 1 WBSR j W^Mputs -' M ';

Wmi target

f1 i 1 im II: 1

predictive model: a concise representation of the input and target association

9

4

90


Predictive Model .*

inputs

|

mm rzi mm mm

mm mm mm

mm

mm mm

:

,

predictions: output of the predictive model given a set of input measurements

...

Modeling Essentials Predict new cases Select useful inputs Optimize complexity

91


Three Prediction Types . Y3fยงpB | S -

Inputs

j

prediction

decisions rankings

3 B

estimates

Decision Predictions Predictive model uses input measurements to make the best decision for each case.

92


Decision Predictions inputs

mm Lmm ZJ mm mm mm mm mm mm mm mm mm mm mm mm mm mm .

prediction primary j secondary tertiary j primary

Predictive model uses input measurements to make the best decision for each case.

secondary

Ranking Predictions 720 520 580 470 630

10

93

Predictive model uses input measurements to optimally rank each case.


Estimate Predictions mm

inputs

mm mm mm mm mm mm mm mm mm mm mm mm mm

S3I3 mm

Predictive model uses input measurements to optimally estimate the target value.

•

mm

Modeling Essentials - Predict Review decide, rank, estimate

Predict new cases

12

94


Modeling Essentials

Select useful inputs

13

The Curse of Dimensionality

1-D

•• • 2-D 3-D

14

95


Input Reduction Strategies Redundancy

Irrelevancy

15

Input Reduction- Redundancy Redundancy

Input x has the same information as input x,. 2

16

96


Input Reduction - Redundancy Irrelevancy

Redundancy

17

Input Reduction - Irrelevancy Irrelevancy

Predictions change with input x but much less with input x . 4

3

18

97


Modeling Essentials - Select Review

Select useful inputs

19

Modeling Essentials

Optimize complexity

20

98

Eradicate redundancies irrelevancies


Model Complexity

1

Data Partitioning i

Training Data \WWLW | WBKf^BBBB ]~BBH

Validation Data target

inputs

target

—

mm

mm

mm

mm

mm

mm

Partition available data into training and validation sets.

22

99

mm

mm

mm


Predictive Model Sequence

Create sequence of models with increasing complexity.

model complexity

23

Model Performance Assessment Validation Data

Training Data

Hn

mm mm

mm mm H1

WfiUfr l BBH

^n^r

mode/ complexity

validation assessment

24

100

H I

HBB sum m i WmKBs

H GS -mm :

H59I

1 jjfi

Rate model performance using validation data.


Model Selection Training Data

Validation Data

.

i&t&c,

mode/ complexity

Select simplest model with highest validation assessment.

validation assessment

25

Model Selection Training Data . inputs

mm

H

MB

Validation Data j

M

n

BR

Mrk&&

mode/ complexity

validation assessment

26

101

mM mm

Select simplest model with highest validation assessment.


Modeling Essentials - Optimize Review

Optimize complexity

Tune models with validation data

27

Creating Training and Validation Data This demonstration illustrates how to partition a data source into training and validation sets.

28

102


Model Essentials - Decision Trees Predict new cases

Prediction rules

Select useful inputs

Split search

Optimize complexity

Pruning

29

Simple Prediction Illustration Training Data

I o.g I 1.0

Predict dot color for each x* and x . 9

1

*

0.6 X

2

0.5 1 0.4 0.3 | 0.2

I

0.1 1

).0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 | --1

30

103


Simple Prediction Illustration Training Data 1.0 0.9

Predict dot color for each x and x . 1

0.8 0.7

2

0.6 X

2

0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. *1

31

104


Decision Tree Prediction Rules Predict:

V

Decision = Estimate = 0.70

4

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

33

Decision Tree Split Search left

right

0.52

confusion matrix

Calculate the logworth of every partition on input x v

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

34

105


Decision Tree Split Search

Decision Tree Split Search 1.0 0.9 0.8 0.7

0.63

0.6

bottom

X

top

46%

35% 65%

0.5 0.4

_a_a_a|

54%

2

max

4.92

0.3 0.2 0.1 0.0

36

106

.Tar**jsj-'. * >


Decision Tree Split Search left

right

53%

42%

47%

58%

bottom

top

54%

35%

46%

65%

max worth 0.95

Compare partition logworth ratings. max logworth (x ) 4.92 2

37

Decision Tree Split Search

<0.63

w

>0.63 0.63

Create a partition rule from the best partition across all inputs. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.S 0.9 1.0

'1 38

107


Decision Tree Split Search 1.0

<0.G3

© J V

w

>0.63 \

••

<0.52^>0.52

Create a second partition rule.

o

• "•

0.7

.

w

04

0.2

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 *1

39

Decision Tree Split Search

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Repeat to form a maximal tree. 40

108

*i


Model Essentials - Decision Trees

Select useful inputs

Split search

41

Model Essentials - Decision Trees

Optimize complexity

42

109

Pruning


The Maximal Tree Training Data Inputs

Validation Data

mm mm mm

M i [ mm; —

j mm mm

mm

maximal tree

Create sequence of models with increasing complexity. Maximal tree is most complex model in sequence.

ii model complexity

43

The Maximal Tree Training Data

Validation Data

1 2 3

Maximal tree is most complex model in sequence. model complexity

44

110


Pruning One Split Validation Data

Training Data

mm | mfbptjm f mm — • • • I>JR M B | DEVI H i . H E mm

mm | • • — H Bmm BBS mm mm mm mm mm mm mm

mm BH

mm mm mm

Next modef in sequence formed by pruning one split from maximal tree.

4

A mode! complexity

45

Subsequent Pruning Validation Data hp

PJPJ m •MS

Er_n s^ra

mm M l mm

Kara e n BBS!

mm MB mm

A

A

model 46

111

Continue pruning until all subtrees are considered.


Validation Assessment Training Data

Validation Data Inputt '"

Illl Illl Illl nil

pjjj mm mmmm

H H H M M

Choose simplest model with highest validation assessment.

/rode/ validation complexity assessment 47

Validation Assessment Training Data MM Inputs • M M I target

Validation Data

^ A

What are appropriate validation assessment ratings?

48

112


Validation Assessment Validation Data

Validation Data

What are appropriate validation assessment ratings?

49

Assessment Statistics Validation Data

Ratings depend on...

..."

mm mm mmmmm mm

target measurement

mm mm mm mm mm mm

prediction type

••••1

m m

50

113


Assessment Statistics Validation Data

Ratings depend on... target measurement prediction type

51

Binary Targets Predictions

mm mm mm i

BAVBB _ _ _ _ _

Zm

BBbB | _

_

i

_

_

decisions

_B_H M H K B

0

rankings

mm

i

estimates

52

114


Decision Optimization

target

Inputs

r~i E Z O

decisions

mm

mm mm mmmm •EBf mm mm mm*mm mmmm mm mm mmmm-

53

Decision Optimization - Accuracy

true positive true negative Maximize accuracy:

agreement between outcome and prediction.

54

115


Decision Optimization - Accuracy

Inputs m

m

m

target m

m

1

m

-

mm mm mm mm mm mm mm mm mm mm mm mm mmw mm mm

primary j secondan/|

true positive true negative Maximize accuracy:

—

\

agreement between outcome and prediction.

55

Decision Optimization - Misclassification

i

I I

II -

1

false negative false positive

mm mm mm

^ ^ ^ ^

E 3

mm

—

Minimize misclassification:

disagreement between outcome and prediction.

56

116


Ranking Optimization

520

rankings

57

Ranking Optimization - Concordance

target=0—>low score target=1—>high score Maximize concordance:

proper ordering of primary and secondary outcomes. 58

117


Ranking Optimization - Discordance

a

mmm

m

m

a

m

m

prediction

target

fnpufs m

••••

mm mm mm mm mm mm mm

mm mm mm

720 520

mum

target=0—>high score target=1—How score Minimize discordance:

improper ordering of primary and secondary outcomes.

59

Estimate Optimization

estimates

60

118


Estimate Optimization

estimates

61

Estimate Optimization - Squared Error

mm

• •mm mm

mm mm • ! • mm mm mm mm mm mm mm mm

(target - estimate)

2

Minimize squared

error.

squared difference between target and prediction 62

119


Complexity Optimization - Summary

decisions accuracy / misclassification

rankings concordance / discordance

estimates squared error

63

This demonstration illustrates how to assess a tree model.

120


Growing Trees Autonomously This demonstration illustrates growing a decision tree model autonomously.

65

Autonomous Decision Tree Defaults Default

m S \

Maximum Branches

•

Splitting Rule Criterion

Settings

2 Logworth

A*

Subtree Method

Average Profit

tyfy^

Tree Size Options

Bonferroni Adjust Split Adjust Maximum Depth Leaf Size

...

66

121


Tree Variations: Maximum Branches Complicates spilt search

Trades height for width

#

Uses heuristic shortcuts

67

Tree Variations: Maximum Branches

Maximum branches in split 11 pa; See r Number of Rules ;• Number of Surrogate Rules -SplilSLze Eiliauslive Node Sample

Exhaustive search size limit

rMeihod j-NumOef of Leaves rAssessmenlMeas-uK -AssessmenlFradion r perform Ci os s Va I id alion I'Numberof SuBsels i-Number of Repeals Seed L

rObservaton Based Important No -Nj.-nber S.ngis var import.) nc 6 j-Bqnferronl Adjustment Yes r i m e ofKass Adjustment. _.,Before Mnpyts Wo j- Number of Inputs ,|1 -Split Adjustment yes ;

68 122


Tree Variations: Splitting Rule Criterion Yields similar splits

I T

Grows enormous trees

Favors many-level inputs

69

Tree Variations: Splitting Rule Criterion Van able,

Bfl.fi~t1;~^_B_PWPIPM—MMI Cntaiian Signrfiiance Level Missing Values Use Input Once Mailmum Branch Man mum Depft Minimum Categorical Siie

-LeafSBe rNumber of Rules r Number of Surrogate Rules Split Size

Default 0 2 Usemsearch -No

0

Split Criterion Chi-square logworth Entropy Gini

L

i-HaSipd • Number of Leaves {-Assessment Measure -Assessment Fraction

Average Square Error 0.2S

Variance Prob-F logworth

3

3E355 ••PerformCross Validation n< j-NumbercfSubsets ll( rflumber of Repeats 1 Mlvaltan Ba;#s important - Observation Based Importan •Number Single Var importarn I'Bonferrora Adjustment j-Tlme ofKass Adjustment

Klnputs

Logworth adjustments

rNumber of Inputs );Split Adjustment

70

123

categorical criteria

inteival criteria


Tree Variations: Subtree Method

Controls Complexity

71

MM

Tree Variations: Subtree Method Train Variables Interactive Criterion Significance Level Missing Values Use Input Once r Maximum Branch Maximum Depth Minimum Categuncai Size

Assessment Do not prune Prune to size

Bj!_EMjWBJBiMt r Leaf See '•Number of Rules rNumber of Surrogate Rules -Split Sue i-Eifiaus!>e • Node Sample }-Melhod_ hNumberofLejwrs r-Assessment Measure '-Assessment Fraction

ehHH

Average Square Errori^ 0 25

[-Perform Cross Validation j-Number of Subsets '[-Number of Repeats

j-BonferTonl Adjustment i-Time of Kass Ad|uslment j-tnputs i-Number of Inputs '-ScWAdhi stmeni

No_ ) Yes

Pruning options Pruning metrics Decision Average Square Error Misclassification Lift

72 124


Tree Variations: Tree Size Options Avoids orphan nodes

Controls sensitivity

Grows large trees

73

Tree Variations: Tree Size Options Logworth threshold Maximum tree depth inimum leaf size

Method -Number of Leaves A s « a « i m » n ( Msjsuri •Assessment Ftattton

Average Square Error_ B.3S

PerformCrossValldation _ No Numnet of Subsets ID NumberolRepeats 1 5 tea

iEmesmmjMEm

113 « 5

[•Observation Based ImporUncNa •-Number Smgle Var Important^

" i

]

ri

mmu

}•B onfe rronl Adj u stm ent [•Timeof kass Adjustment ;• Inputs I Number of Inputs '•Soli;Adjustment.

Threshold depth adjustment

74

125


ยงsas

i

Case Study IV University enrollment prediction

THE POWER TO KNOW,

126


A.4 Enrollment Management Case Study Case Study Description In the fall of 2004, the administration of a large private university requested that the Office of Enrollment Management and the Office of Institutional Research work together to identify prospective students who would most likely enroll as new freshmen in the Fall 2005 semester. The administration stated several goals for this project: • increase new freshman enrollment • increase diversity • increase SAT scores of entering students Historically, inquiries numbered about 90,000+ students, and the university enrolled from 2400 to 2800 new freshmen each Fall semester. f

The diagram containing this analysis is stored as an X M L file on the course data disk. You can open this file by right-clicking Diagrams •=> Import Diagram from X M L in SAS Enterprise Miner. All nodes in the opened file, except the data node, contain the property settings outlined in this case study. I f you want to run the diagram, you need to re-create the case study data set using the metadata settings indicated below.

127


Case Study Training Data Name

Model Role

ACADEMIC_INTEREST_1

Measurement Level Nominal

Rejected

j: Rejected j Norninal Input

CAMPUS_VISIT icONTACTCODEl CONTACT_D ATE 1 jETHNICITY ENROLL jlRSGHOQL INSTATE iLEVEI^^AJR

.

Description

Primary academic interest code ; 'Secondary academic interest code

Nominal

Campus visit code

'Rejected

jj Nominal

: First contact code

Rejected

Nominal

First contact date

{Rejected

| Nominal

Target

Binary

Rejected

Nominal

Input

Binary

jpejected

;

••Unary

;

Ethnicity l=Enrolled F2004,0=Not enrolled F2004 i High school code l=In state, 0=Out of state

jj Student academic level

REFERRAL_CNTCTS

Input

Ordinal

Referral contact count

j SELF_INIT_CNTCTS

: Input

Interval

Self initiated contact count

SOLICITED_CNTCTS

Input

Ordinal

Solicited contact count

TERRITORY

Input

Norninal

Recruitment area

TOTAL_CONTACTS

Input

Interval

Total contact count

TRAVELJLNIT__CNTCTS

Input

Ordinal

Travel initiated contact count

AVG_INCOME

Input

Interval

Commercial H H income estimate

DISTANCE

Input

Interval

Distance from university

HSCRAT

Input

Interval

5-year high school enrollment rate

INIT_SPAN

Input

Interval

Time from first contact to enrollment date

INT1RAT

Input

Interval

5-year primary interest code rate

INT2RAT

Input

Interval

! 5-year secondary interest code rate

INTEREST

Input

Ordinal

Number of indicated extracurricular interests

MAILQ

Input

Ordinal

Mail qualifying score (l=very interested)

(Continued on the next page.)

128


PREMIERE

Input

Binary

SATSCORE

Rejected

Interval

SEX

Rejected

Binary

Sex

STUEMAIL

Input

Binary

l=Have e-mail address, 0=Do not

TELECQ

Rejected

Ordinal

Telecounciling qualifying score (l=very interested)

The Office o f Institutional Research assumed the task o f building a predictive model, and the Office o f Enrollment Management served as consultant to the project. The Office o f Institutional Research built and maintained a data warehouse that contained information about enrollment for the past six years. It was decided that inquiries for Fall 2004 would be used to build the model to help shape the Fall 2005 freshman class. The data set Inq2005 was built over a period o f a several months in consultation with Enrollment Management. The data set included variables that could be classified as demographic, financial, number o f correspondences, student interests, and campus visits. Many variables were created using historical data and trends. For example, high school code was replaced by the percentage o f inquirers from that high school over the past five years who enrolled. The resulting data set included over 90,000 observations and over 50 variables. For this case study, the number o f variables was reduced. The data set Inq2005 is in the A A E M library, and the variables are described in the table above. Some o f the variables were automatically rejected based on the number o f missing values. The nominal variables A C A D E M I C J N T E R E S T J , A C A D E M I C I N T E R E S T 2, and I R S C H O O L were rejected because they were replaced by the interval variables I N T 1 R A T , I N T 2 R A T , and H S C R A T , respectively. For example, academic interest codes 1 and 2 were replaced by the percentage o f inquirers over the past five years who indicated those interest codes and then enrolled. The variable I R S C H O O L is the high school code o f the student, and it was replaced by the percentage o f inquirers from that high school over the last five years who enrolled. The variables E T H N I C I T Y and S E X were rejected because they cannot be used in admission decisions. Several variables count the various types o f contacts the university has with the students.

Accessing and Assaying the Data A SAS Enterprise Miner data source was defined using the metadata settings indicated above. The StatExplore node was used to provide preliminary statistics on the input variables.

129


The following is extracted from the StatExplore node's Results window: C l a s s V a r i a b l e Summary S c a c l s c l c s (maximum 500 o b s e r v a t i o n s p r i n t e d ) Data Rolo-TRAIN

Daca Role

V a r i a b l e Nome

TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN TRAIN

CAHPUS_VISIT Instate RE FERRAL_CNTCTS SOLICITED_CNTCTS TERRITORY TRAVEI._INIT_CNTCTS Interest ma.il q premiere stuemall Enroll

Role

Number OC Levels

Hissing

Hode

Hode Percentage

Hode 2

Hode2 Percentage

INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT INPUT TARGET

3 2 6 8 12 7 4 5 2 2 2

0 0 0 0 1 0 0 0 0 0 0

0 Y 0 0 2 0 0 5 0 0 0

96.61 62.04 96.46 52.45 15.98 67.00 95.01 69.33 97.11 51.01 96.86

1 N 1 1 S 1 1 2 1 1 1

3.31 37.96 3.21 41.60 15.34 29.90 4.62 12.80 2.89 48.99 3.14

The class input variables are listed first. Notice that most of the count variables have a high percent of Os. Distribution of Class Target and Segment Variables (maximum 500 observations printed) Data Role =TRAIN Data Role TRAIN TRAIN

Variable Name

Role

Level

Enroll Enroll

TARGET TARGET

0 1

Frequency Count

Percent

88614 2868

96.8650 3.1350

Next is the target distribution. Only 3.1 % of the target values are Is, making a 1 a rare event. Standard practice in this situation is to separately sample the Os and Is. The Sample tool, used below, enables you to create a stratified sample in S A S Enterprise Miner. Interval Variable Sunnary Statistics (naxinun 500 observations printed) Data Role^TRAIH

Variable

Role

SELF_IHIT_CHTCT3 TGTAL_COHTACTS avg_income distance hscrat init_span intlrat int2rat

IHPUT IHPUT IHPUT IHPUT IHPUT IHPUT INPUT IHPUT

Hean 1.214119 2.166098 47315.33 380.4276 0.037652 19.68616 0.037091 0.042896

Standard Deviation 1.666529 1.852537 20608.89 397.9788 0.057399 8.722109 0.024026 0.025244

Hon Hissing 91482 91482 70553 72014 91482 91482 91482 91482

Hissing 0 0 20929 19468 0 0 0 0

Hinisun 0 1 4940 0.417124 0 -216 0 0

Hedian

Haxinun

Shetmess

Kurtosis

1 2 42324 183.5467 0.033333 19 0.042105 0.05667

56 58 200001 4798.899 1 228 1 1

2.916263 3.062389 1.258231 2.276541 7.021978 0.758461 3.496845 3.215683

21.50072 19.60427 1.874903 9.369703 93.31547 10.43657 74.08503 56.32374

Finally, interval variable summary statistics are presented. Notice that avgjtacome and distance have missing values.

130


The Explore window was used to study the distribution o f the interval variables. • Explore • AAEMT . NQ2005 Be

View

Actions Window

i TOTAL CONTACTS

11 Sample Properties Property

Value

lows

9H8Z

library

AAEM

5 40000¬ 5 30000-

=. 20000¬ 1*1

£ ioooo-

03.0

Apply

Plot...

TOTAL CONTACTS

43.0 93.0 138.0 1630 228.0 inil_span

i .ivi;_income

Qhstt Enroll COHTACT DATE1 TOTAL CON 23May04

4

2?Ndv01

^ eooooH

20000¬

S

15000¬

5 40000

. 10000¬

I2OOOO-I

5000¬

OBDeeOQ

0-

15Apr02 27Mar02

6250.0

64375.3

122500.6

1 1 1 1 1 1 1 1 1 r

1B0625S

0.0 0.1 0 2 03 0.4 05 0.6 0.7 0 8 0.9 1.0

Intlrat

avgjncome Variable...

Type

Percent... M

1 CONTACT,... VAR 2 Enroll

VAR

0

0

4T0TAL_CO...VAR 5ayg_income VAR

j

HHESl

3SELFJNIT... VAR

30000¬

.60000 &

20000¬

g 40000

10000¬ 0

o

22.83567

|

0=L 0.417)

£

14395516 28795060

4319.0505

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0i 1.0

35R

310 60000¬

- 60000& S 40000-

40000¬

£ 20000¬ 0

—1—i—1—i—1—1—1—1—1—1—r

int2rat

distance

£

3)000 0

20000¬ 1 —, — 1 — 1 — 1 — 1 — 1 — 1 — —r— 00 112 22.4 33.6 448 SS.O 1

SELF N T CNTCTS

0

-1—1—1—1—1—1—1—1—1—r 0.0 0.1 02 0 2 0.4 05 0.6 0 7 0.3 OS 1.0 hscrat

The apparent skewness o f all inputs suggests that some transformations might be needed for regression models.

131


Creating a Training Sample Cases f r o m each target level were separately sampled. A l l cases w i t h the p r i m a r y o u t c o m e were selected. For each p r i m a r y o u t c o m e case, seven secondary outcome cases were selected. T h i s created a t r a i n i n g sample w i t h a 1 2 . 5 % overall e n r o l l m e n t rate. T h e Sample t o o l was used to create a t r a i n i n g sample for subsequent m o d e l i n g .

To create the sample as described, the f o l l o w i n g m o d i f i c a t i o n s were made to the Sample node's properties panel: 1.

T y p e 100 as the Percentage value ( i n the Size property group).

2.

Select Criterion O Level Based ( i n the Stratified property group).

3.

Type 12.5 as the Sample Proportion value ( i n the Level Based Options property g r o u p ) .

Train Variables Output Type Sample Method Random Seed

Data Default 12345

[5]5i2e Type V Observations Percentage |.Alpha - =value Cluster Method ;

|-Criterion Ignore Small Strata '••Minimum Strata Size

B ,i.A',;h;--. - - • • • j. .evel Selection .evel Proportion • Sample Proportion

3

ercentage

100.0 0,01 0.01 Random .evel Based No

zvent 100.0 12.5

No Adjust Frequency Based on Count No • Exclude Missing Levels No

13Z


The Sample node Results window shows all primary outcome cases that are selected and a sufficient number of secondary outcome cases that are selected to achieve the 12.5% primary outcome proportion. Summary S t a t i s t i c s Cor C l a s s T a r g e t s (maximum 500 o b s e r v a t i o n s p r i n t e d ) Data=DATA

Variable

Numeric Value

Formatted Value

Frequency Count

Percent

Enroll Enroll

0 1

0 1

88614 2868

96.8650 3.1350

Variable

numeric Value

Formatted Value

Frequency Count

Percent

Enroll Enroll

0 1

0 1

20076 2868

87.5 12.5

Label

Data=SAMPLE

Label

Configuring Decision Processing The primary purpose of the predictions was decision optimization and, secondarily, ranking. A n applicant was considered a good candidate if his or her probability of enrollment was higher than average. Because of the Sample node, decision information consistent with the above objectives could not be entered in the data source node. To incorporate decision information, the Decisions tool was incorporated in the analysis.

StatExplore

IN0J2OO5

2 ~~ jp{^ Sample jL.

Decisions

133


These steps were f o l l o w e d to c o n f i g u r e the Decisions node: 1.

In the Properties panel o f the D e c i s i o n node, set Decisions to Custom. T h e n select

Custom Editor "=> Property

General

Node ID Imported Data Exported Data Notes

Value Dec

...

Train

Variables Apply Decisions Custom Editor Decisions Matrix rior Probabilities

No

... ...

Custom

3

A f t e r the analysis path is updated, the D e c i s i o n w i n d o w appears. Decision Processing - Decisions Targets [^Enroll

Prior Probabilities 1 Decisions 1 Decision Weights • Name :

Enroll

M e a s u r e m e n t L e v e l : Binary Target level order :

Descending

Event l e v e l :

1

Format:

3.0

Refresh

OK

2.

Select the Decisions tab.

134

Cancel


3.

Select Default w i t h Inverse Prior Weights. 25! D e c i s i o n P r o c e s s i n g - D e c i s i o n s

Targets ] Prior Probabilities Decisions | Decision Weights Do you want to use the decisions? (* Yes

C No

Decision Name DECISION 1 DECI5ION2

:

>

Label 0

Default with Inverse Prior Weights Cost Variable < None > < None >

Constant 0.0 0.0

Add

Delete Delete All Reset Default

OK

4.

Cancel

Select the Decision Weights tab. ^ D e c i s i o n Processing - Decisions

Targets | Prior Probabilities | Decisions

Decision Weights |

Select a decision function: (* Maximize]

C Minimize

Enter weight values for the decisions. Level 1 0

DECISION 1 DECISION2 8.0 0,0 0,0 1.14285714...

OK

Cancel

The nonzero values used in the decision matrix are the inverse o f the prior probabilities (1/.125=8. and 1/0.875=1.142857). Such a decision matrix, sometimes referred to as trie central decision rule, forces a primary decision when the estimated primary outcome probability for a case exceeds the primary outcome prior probability (0.125 in this case).

135


Creating Prediction Models (All Cases) Two rounds of predictive modeling were performed. In the first round, all cases were considered for model building. From the Decision node, partitioning, imputation, modeling, and assessment were performed. The completed analysis appears as shown. ^

If the Stepwise Regression model is not connected to the Model Comparison node, you might have to first delete the connections for the Instate Regression and Neural Network nodes to the Model Comparison node. Then connect the Stepwise Regression node, Neural Network node, and Regression nodes - in that order - to the Model Comparison node.

***** e> •lOteWopr.

partm

• The Data Partition node used 60% for training and 40% for validation. • The Impute node used the Tree method for both class and interval variables. Unique missing indicator variables were also selected and used as inputs. • The stepwise regression model was used as a variable selection method for the Neural Network and second Regression nodes. • The Regression node labeled Instate Regression included the variables from the Stepwise Regression node and the variable Instate. It was felt that prospective students behave differently based on whether they are in state or out of state. In this implementation of the case study, the Stepwise Regression node selected three inputs: high school, self-initiated contact count, and student e-mail indicator. The model output is shown below. Analysis of Maximum Likelihood Estimates Standard

Wald

Parameter

DF

Estimate

Error

Chi-Square

Standardized Pr > ChiSq

Estimate

Exp(Est)

INTERCEPT

1

-12.1422

18.9832

0.41

0.5224

SELF_INIT_CNTCTS

1

0.6895

0.0203

1156.19

<.0001

0.8773

1.993

HSCRAT

1

16.4261

0.8108

410.46

<.0001

0.7506

999.000

1

-7.7776

18.9824

0.17

0.6820

STUEMAIL

0

0.000

0.000

Odds Ratio Estimates Point Estimate

Effect

1.993

SELF_INIT_CNTCTS

999.000

HSCRAT STUEMAIL

0 VS 1

<0.001

The unusual odds ratio estimates for H S C R A T and S T U E M A I L result from an extremely strong association in those inputs. For example, certain high schools had all applicants or no applicants enroll. Likewise, very few students enrolled who did not provide an e-mail address.

136


Adding the I N S T A T E input in the Instate Regression model changed the significance o f inputs selected by the stepwise regression model. The input S T U E M A I L is no longer statistically significant after including the I N S T A T E input. Analysis of Maximum Likelihood Estimates Standardized

Standard

Wald

Parameter

DF

Estimate

Error

Chi-Square

Pr > ChiSq

INTERCEPT

1

-12.0541

0.52

0.4716

1

-0.4145

16.7449 0.0577

51.67

1 1

0.6889 16.2327

0.0196

0.8231

0.7553

1233.22 461.95

<.0001 <.0001 <.0001

0.7142

1

-7.3528

16.7443

0.19

INSTATE

N

SELF_INIT_CNTCTS HSCRAT STUEMAIL

0

0.6606

Estimate

Exp(Est) 0.000 0.661 1.992 999.000 0.001

Odds Ratio Estimates Point Estimate

Effect INSTATE

0.437

N VS Y

1.992

SELF_INIT_CNTCTS HSCRAT STUEMAIL

999.000 0 VS 1

<0.001

A slight increase in validation profit (the criterion used to tune models) was found using the neural network model. The tree provides insight into the strength of model fit. The Subtree Assessment plot shows the highest profit having 17 leaves. Most of the predictive performance, however, is provided by the initial splits.

m i

Subtree Assessment Ptot Average Profit

1.75-

1.50-

1.25-

1.0010

20

30

N u m b e r of L e a v e s -Train: Average Profit for Enrol •Valid: Ave rage P rofit for E n ro I

137


A s i m p l e r tree is scrutinized to aid in interpretation. T h e tree m o d e l was rerun w i t h properties changed as f o l l o w s to produce a tree w i t h three leaves: M e t h o d = N , N u m b e r o f Leaves=3.

Statistic 0: i: Count:

Hode Id 1 Train Validation 67. S% 87. 5% 12.5% 12.5* 13766 5173

1

SELF 1NIT CNTCTS

>= 3.5

< 3.5 Or Missing 1 Statistic 0: 1: Count:

Node I d : 2 Train Validation 96.5% 97.0% 3.5% 3.0% 11758 7786

Statistic 0: 1: Count:

Mode Id 3 T r a i n Validation 34. 34.3% £5.2% £5.7% 2008 1352

SELF INIT CNTCTS r < 2.5 Or Missing • Statistic 0: i lJ Count:

Node I d : 4 Tcain Validation 98.2% 98.5% 1.8% 1.5% 10532 7301

>= 2.5 Statistic 0: 1: Count:

Node Id 5 Train Validation 73. B% 74.2% 26.2% 25.84 826 485

Students w i t h three or fewer self-initiated contacts rarely enrolled (as seen i n the left l e a f o f the first split). E n r o l l m e n t was even rarer f o r students w i t h t w o or fewer self-initiated contacts (as seen in the left leaf o f the second split). N o t i c e that the p r i m a r y target percentage is rounded d o w n . A l s o notice that most o f the secondary target cases can be found i n the l o w e r left leaf. The decision tree results shown in the rest o f this case study are generated b y the o r i g i n a l , 17-leaf tree.

138


Assessing the Prediction Models Model performance was compared in the Model Comparison node. S

If the Stepwise Regression model does not appear in the R O C chart, it might not be connected to the Model Comparison node. You might have to first delete the connections for the Instate Regression and Neural Network nodes to the Model Comparison node. Connect the Stepwise Regression node, Neural Network node, and Regression nodes - in that order - to the Model Comparison node and re-run the Model Comparison node to make all models visible.

iv'nOC Chart: Enroll

H E E3 Data Role - VALIDATE

~i

1

0.0 ^ |

Baseline

1

0.2 .: Neural Network

1

0.4 0.6 v.. 1 - Specificity Instate Regression

H

1

0.8 —Stepwise Regression

. 1.0 —DecisionTree

The validation R O C chart showed an extremely good performance for all models. The neural model seemed to have a slight edge over the other models. This was mirrored in the Fit Statistics table (abstracted below to show only the validation performance).

139


Data R o l e = V a l i d Statistics

Neural

Tree

Reg

Reg2

0 89 1 88

0.88 1.88

0.87 1.87

0.04

0.04

Roc Index

0 04 0 98

0.86 1.86 0.04

0.96

0.98

0.98

Valid

Average E r r o r F u n c t i o n

0 11

0.14

Valid

P e r c e n t Capture Response

.

0.14

Valid

D i v i s o r f o r VASE

Valid

Kolmogorov-Smirnov

Statistic

Valid

Average P r o f i t f o r E n r o l l

Valid

Average Squared E r r o r

Valid

Valid

E r r o r Function

Valid

Gain

Valid

Gini Coefficient

Valid

B i n - B a s e d Two-Way Kolmogorov-Smirnov

Valid

Lift

Valid Valid

Maximum A b s o l u t e E r r o r M i s c l a s s i f i c a t i o n Rate

Valid

Mean Square E r r o r

Valid

Sum o f F r e q u e n c i e s

Valid

Total Profit for Enroll

Valid

Root Average Squared E r r o r

Valid

P e r c e n t Response

Valid

Root Mean Square E r r o r

Valid Valid

Sum o f Square E r r o r s

Valid

Number o f Wrong C l a s s i f i c a t i o n s .

30 94

30.95

29.72

29.55

18356 00

18356.00

18356.00

18356.00

2097 03 576 37

2521.91

2486.75

519.90

552.83

552.83

0.93 0.87

0.95 0.86 5.94

0.95

1.00

1.00

0.06

0.06 0.04

0 96 Statistic

0 88 6 19 1 00 0 05 0 04

Sum o f Case Weights Times F r e q

.

6.19 1.00 0.05

.

0.04

0.86 5.91

9178 00

9178.00

9178.00

9178.00

17285 71

17256.00

17122.29

17099.43

0 19 77 34

0.20

0.20

0.20

77.36

74.29

73.85

0 19 657.78 18356 00

0.20

735.99 18356.00

0.20 754.37

463 00

. .

752.78 18356.00

.

18356.00

.

It should be noted that a R O C Index of 0.98 needed careful consideration because it suggested a nearperfect separation of the primary and secondary outcomes. The decision tree model provides some insight into this apparently outstanding model fit. Self-initiated contacts are critical to enrollment. Fewer than three self-initiated contacts almost guarantees non-enrollment.

Creating Prediction Models (Instate-Only Cases) A second round of analysis was performed on instate-only cases. The analysis sample was reduced using the Filter node. The Filter node was attached to the Decisions node, as shown below.

140


The following configuration steps were applied: 1.

In the Filter Out o f State node, select Default Filtering Method <=> None for both the class and interval variables. Property

General

Node ID Imported Data Exported Data Notes

Train

(Export Table Tables to Filter Distribution Data Sets [Sj'Jass Variables

Value Filter

... •••

... Filtered Training Data No

Class Variables default Filtering Methoc None <eep Missing Values Yes Mormalized Values Yes viinimum Frequency Cut 1 Minimum CutoFF For Pert D.01 •vlaxirnurn Number of Le 25 [SJlnfcerval Variables Interval Variables beFault Filtering Methoc Mone Keep Missing Values Yes Tuning Parameters

„J \

...

2.

Select Class Variables •=> j „ J . After the path is updated, the Interactive Class Filter window appears.

3.

Select Generate Summary and then select Yes to generate summary statistics.

141


4.

Select Instate. The Interactive Class Filter window is updated to show the distribution o f the Instate input. I n t e r a c t i v e Class Filter

Select values to remove from the sample. 15000 H

10000 o o

5000-

Instate Apply Filter blumns:

Clear Filter

\~ Label

P Mining

Report

Name

No No

Statistics

Minimum Frequency Cutoff

Nurnt Cutof

Default Default Default Default Default Default Default Default Default

Default Default Default

^EFERRAL_CNT^o =LOI TrTTFn CNTNn

r

Keep Missing Values

Filtering Method Default Default Default Default Default Default Default User Specified

IftCADEHICJNTtNo [ACADEMIC JNTfJNo CAMPU5_VISIT No ZONTACT_COD!No No ITHNICITY No Enroll No [R5CHOOL Instate LEVEL_YEAR

r~ Basic

Default Default 1

Cancel

OK

Refresh Summary

5.

Select the N bar and select Apply Filter.

6.

Select O K to close the Interactive Class Filter window.

7.

Run the Filter node and view the results. Excluded Class Values (maximum 500 o b s e r v a t i o n s p r i n t e d ) Keep Variable Instate

Role INPUT

Level N

Train Count

Train Percent

0200

35.7392

Label

Filter Method

Missing Values

MANUAL

A l l out-of-state cases were filtered from the analysis. After filtering, an analysis similar to the above was conducted with stepwise regression, neural network, and decision tree models.

142


The partial diagram (after the Filter node) is shown below: rfeural

Mrt»ork(2) ' *i Frier Out of SOle

• , l.h.lel

..Irpute (2)

np.uison (2)

B'ebion Ti*»

As for the models in this subset analysis, the Instate Stepwise Regression model selects two o f the same inputs found in the first round o f modeling, S E L F _ I N I T _ C N T C T S and S T U E M A I L . A n a l y s i s of Maximum Likelihood Estimates Standardized

Standard

Bald

Parameter

DF

Estimate

Error

Chi-Seruane

Intercept

1

-10.1372

0.6264

1 1

0.718S -6.8602

20.8246 0.0210

0.24

SELF_BJIT_CNTCTS

1174.00

<.0001

20.8245

0.11

0.7418

stuemail

0

Pr > ChiSq

Exp(Est)

Estimate

0.000 0.9297

2.052 0.001

The Instate decision tree showed a structure similar to the decision tree model from the first round. The tree with the highest validation profit possessed 20 leaves. The best five-leaf tree, whose validation profit is 97% o f the selected tree, is shown below.

Bode Id: Tiain ValldicLon

et.zx B « . : I is.ei

is.at

SS4S

JfJPf

SELF_IN!T_CNTCTS

_ J >=3.5

< 3 5 Or Missing Hade Id: 2 I i a i u Valldiejon ti.71 9£.31 4.31 3.71 721 f 47SS

5t«l«lC 0: 1: COIU1L:

nod* i d

I < 2.5 Or Missing

T

Cciat:

;.:» CMS

j.oi 437CI

1*2^

31. Bl it. :t 11*3

hscrat >=2.5

Bod* Idt 4 Train V . i ! : 1 t : :

33. It it. SI

1: Count:

SELF INIT CrVTCTS

3

Tiain

setciotie 0:

< 00171

Bode Id: S Train ValidJcioB 73.11 75.41 Zi.il 2i.il 38< iia

»= 0.0171 Of Missing

Bode I d : CI Train Volt d a lent QJ.M B7.2t. ie.lt i;.8i n* I 3 3_

Bode Id: Tiain Valldfclon 27.3) ;f.71

1

7:.st

Ti.it

I4SS

IJI.

[ riscrat

0.0031 Or Missing V a l i d * Ion 100.01 O.Ot

>= 0.0031 Node Id: 13 Train Val Hire Ion £1.21 SB.St If.St 41. St 74 »1

Again, much o f the performance o f the model is due to a low self-initiated contacts count.

143


Assessing Prediction Models (Instate-Only Cases) As before, model performance was gauged in the Model Comparison node. The R O C chart showed no clearly superior model, although all models had rather exceptional performance.

The Fit Statistics table of the Output window showed a slight edge over the tree model in misclassification rate. The validation R O C index and validation average profit favored the Stepwise Regression and Neural Network models. Again, it should be noted that these were unusually high model performance statistics.

144


p

Ifeural2

Reg3

Ttee2

0.80 2.02 0.06

0.80

0.80

2.02 0.06

2.00 0.06

0.96

0.96

0.92

0.19

0.20

0.21

50.69 23.41

50.69 23.41

5899.00 11798.00 2194.05 406.85 0.91 0.00 0.14 5.07 4.68 0.97 0.09 5899.00 11901.71 0.24 80.08 73.97 705.43 510.00

5899.00 11798.00 2330.78 406.85 0.91 0.00 0.14 5.07 4.68 1.00 0.09 5899.00 11901.71 0.25 80.08 73.97 725.26 527.00

46.38 23.19 5899.00 11793.00 2462.62 363.74 0.84 0.00 0.03 4.64 4.64 0.98 0. 08 5899.00 11824.00 0.25 73.27 73.27 716.66 462.00

^ogorov-Smicnov s A t u r e d Response

* l cases

to* 0

y. % * o r o v - S m i r r i o v £»V ^ - o b i l i t y

Statistic

Cutoff

-ions

el t o f the prediction J

,

,

a

s

s

h

o

w

n

i n

t

h

e

diagram's final

form.

d

:wJ* " •

Oxnuiisoii P)

g^tkmM in*

~^|5A3 Corf.

-1

^tti:w-i

the I n s t a t e M £

o n

assigned a r o l ^

P

c c e

c(

then passed ir. ^ de node.

Gcor*

9 /

a r

a n t

c

'

s o n

n o c

^

e

passed on to the Score

* attached to the Score node, e

o f Enrollment Management's data



Data R o l e - V a l i d Heural2

Reg3

Tree2

0.3G

0.80

0.80

2.02

2.02

2.00

0.06

0.06

0.06

V a l i d : Roc I n d e x

0.96

0.96

0.92

V a l i d : Average E r r o r F u n c t i o n

0.19

0.20

0.21

Statistics V a l i d : Kolmogor.ov-Smir.nov Valid:

Average

Valid:

Average S q u a r e d

Statistic

Profit Error

-

V a l i d : B i n - B a s e d Two-Way Kolmogorov-Smirnov P r o b a b i l i t y C u t o f f Valid:

Cumulative

Valid:

Frequency

50.69

23. 41

23.41

23.19

5899.00

5899.00

5899.00

11798.00

11798.00

11798.00

2194.05

2330.78

2462.62

406.85

406.85

363.74

0.91

0.91

0.84

0.00

0.00

0.00

0.14

0.14

0.03

5.07

5.07

4. 64

4.68

4.68

4.64

Error

0.97

1.00

0.98

Rate

0. 09

0.09

0.08

5399.00

5899.00

5899.00

11901.71

11901.71

11824.00

of C l a s s i f i e d

Cases

V a l i d : D i v i s o r f o r ASE Valid: Error Function Valid:

Gain

Valid:

Gini

Coefficient

V a l i d : B i n - B a s e d Two-Hay Kolmogorov-Smirnov

Statistic

V a l i d : Kolmogorov-Smirnov P r o b a b i l i t y C u t o f f Valid:

Cumulative

Lift

Valid: L i f t V a l i d : Maximum A b s o l u t e Valid: M i s c l a s s i f i c a t i o n Valid:

Sum

of Frequencies

Valid: Total

Profit

V a l i d : R o o t Average S q u a r e d

Error

Valid:

Cumulative P e r c e n t Response

Valid:

P e r c e n t Response

Valid:

Sum o f Squared

V a l i d : Number of Wrong

46. 33

50. 69

P e r c e n t C a p t u r e d Response

V a l i d : P e r c e n t Captured Response

Errors Classifications

0. 24

0.25

0.25

80. 08

80.08

73.27

73. 97

73.97

73.27

705.43

725.26

716.66

510.00

527.00

462.00

Deploying the Prediction Model The Score node facilitated deployment o f the prediction model, as shown in the diagram's final form.

The best (instate) model was selected by the Instate Model Comparison node and passed on to the Score node. Another INQ2005 data source was assigned a role o f Score and attached to the Score node. Columns from the scored INQ2005 were then passed into the Office o f Enrollment Management's data management system by the final SAS Code node.

145


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.