Applied Analytics Using SAS Enterprise Miner-part1

Page 1

Applied Analytics Using S A S 速 Enterprise Miner

Course Notes


Applied Analytics Using SAlf Enterprise Miner™Course Notes was developed by Jim Georges, Jeff Thompson, and Chip Wells. Additional contributions were made by Tom Bohannon, Mike Hardin, Dan Kelly, Bob Lucas, and Sue Walsh. Editing and production support was provided by the Curriculum Development and Support Department. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Applied Analytics Using SAS®Enterprise Miner™Course Notes Copyright © 2010 SAS Institute Inc. Cary,NC, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. Book code E1755, course code LWAAEM61/AAEM61, prepared date 01 Jul2010.

LWAAEM61_002

ISBN 978-1-60764-593-1


For Your Information

Table of Contents Course Description Prerequisites Chapter 1

1.1

x Introduction

Introduction to SAS Enterprise Miner

Chapter 2

ix

A c c e s s i n g and A s s a y i n g Prepared Data

1-1

1-3 2-1

2.1

Introduction

2-3

2.2

Creating a SAS Enterprise Miner Project, Library, and Diagram

2-5

Demonstration: Creating a SAS Enterprise Miner Project

2.3

2.4

2.5

3.1

Demonstration: Creating a SAS Library

2-10

Demonstration: Creating a SAS Enterprise Miner Diagram

2-13

Exercises

2-14

Defining a Data Source

2-15

Demonstration: Defining a Data Source

2-18

Exercises

2-35

Exploring a Data Source

2-36

Demonstration: Exploring Source Data

2-37

Demonstration: Changing the Explore Window Sampling Defaults

2-64

Exercises

2-66

Demonstration: Modifying and Correcting Source Data

2-67

Chapter Summary

Chapter 3

2-6

Introduction to Predictive Modeling: D e c i s i o n T r e e s

Introduction Demonstration: Creating Training and Validation Data

2-80 3-1

3-3 3-23

iii


iv

F o r Your Information

3.2

Cultivating Decision Trees Demonstration: Constructing a Decision Tree Predictive Model

3.3

Optimizing the Complexity of Decision Trees Demonstration: Assessing a Decision Tree

3.4

Understanding Additional Diagnostic Tools (Self-Study) Demonstration: Understanding Additional Plots and Tables (Optional)

3.5

Autonomous Tree Growth Options (Self-Study)

3-28 3-43 3-63 3-79 3-98 3-99 3-106

Exercises

3-114

3.6

Chapter Summary

3-116

3.7

Solutions

3-117

Chapter 4

4.1

4.2

Solutions to Exercises

3-117

Solutions to Student Activities (Polls/Quizzes)

3-133

Introduction to Predictive Modeling: R e g r e s s i o n s

Introduction

4-18

Demonstration: Running the Regression Node

4-25

Selecting Regression Inputs

Optimizing Regression Complexity Demonstration: Optimizing Complexity

4.4

Interpreting Regression Models Demonstration: Interpreting a Regression Model

4.5

Transforming Inputs Demonstration: Transforming Inputs

4.6

4-3

Demonstration: Managing Missing Values

Demonstration: Selecting Inputs 4.3

4-1

Categorical Inputs Demonstration: Recoding Categorical Inputs

4-29 4-33 4-38 4-40 4-49 4-51 4-52 4-56 4-66 4-69


For Your Information

4.7

Polynomial Regressions (Self-Study)

4-76

Demonstration: Adding Polynomial Regression Terms Selectively

4-78

Demonstration: Adding Polynomial Regression Terms Autonomously (Self-Study)

4-85

Exercises

4-88

4.8

Chapter Summary

4-89

4.9

Solutions

4-91

Solutions to Exercises Solutions to Student Activities (Polls/Quizzes) Chapter 5

5.1

Introduction to Predictive Modeling: Neural Networks a n d Other Modeling Tools

Introduction Demonstration: Training a Neural Network

5.2

Input Selection Demonstration: Selecting Neural Network Inputs

5.3

5.4

Stopped Training

4-91 4-102

5-1

5-3 5-12 5-20 5-21 5-24

Demonstration: Increasing Network Flexibility

5-35

Demonstration: Using the AutoNeural Tool (Self-Study)

5-39

Other Modeling Tools (Self-Study)

5-46

Exercises

5-58

5.5

Chapter Summary

5-59

5.6

Solutions

5-60

Solutions to Exercises Chapter 6

6.1

Model A s s e s s m e n t

Model Fit Statistics Demonstration: Comparing Models with Summary Statistics

6.2

Statistical Graphics

5-60 6-1

6-3 6-6 6-9

v


vi

F o r Your Information

6.3

6.4

Demonstration: Comparing Models with ROC Charts

6-13

Demonstration: Comparing Models with Score Rankings Plots

6-18

Demonstration: Adjusting for Separate Sampling

6-21

Adjusting for Separate Sampling

6-26

Demonstration: Adjusting for Separate Sampling (continued)

6-29

Demonstration: Creating a Profit Matrix

6-32

Profit Matrices

6-39

Demonstration: Evaluating Model Profit

6-42

Demonstration: Viewing Additional Assessments

6-44

Demonstration: Optimizing with Profit (Self-Study)

6-46

Exercises

6-48

6.5

Chapter Summary

6-49

6.6

Solutions

6-50

Solutions to Exercises Chapter 7

Model Implementation

6-50 7-1

7.1

Introduction

7-3

7.2

Internally Scored Data Sets

7-4

7.3

Demonstration: Creating a Score Data Source

7-5

Demonstration: Scoring with the Score Tool

7-6

Demonstration: Exporting a Scored Table (Self-Study)

7-9

Score Code Modules

7-15

Demonstration: Creating a SAS Score Code Module

7-16

Demonstration: Creating Other Score Code Modules

7-22

Exercises

7-24

7.4

Chapter Summary

7-25

7.5

Solutions to Exercises

7-26


For Your Information

Chapter 8

Introduction to Pattern Discovery

8-1

8.1

Introduction

8-3

8.2

Cluster Analysis

8-8

8.3

Demonstration: Segmenting Census Data

8-15

Demonstration: Exploring and Filtering Analysis Data

8-24

Demonstration: Setting Cluster Tool Options

8-37

Demonstration: Creating Clusters with the Cluster Tool

8-41

Demonstration: Specifying the Segment Count

8-44

Demonstration: Exploring Segments

8-46

Demonstration: Profiling Segments

8-56

Exercises

8-61

Market Basket Analysis (Self-Study)

8-63

Demonstration: Market Basket Analysis

8-67

Demonstration: Sequence Analysis

8-80

Exercises

8-83

8.4

Chapter Summary

8-85

8.5

Solutions

8-86

Solutions to Exercises Chapter 9

Special Topics

8-86 9-1

9.1

Introduction

9-3

9.2

Ensemble Models

9-4

Demonstration: Creating Ensemble Models 9.3

9.4

Variable Selection

9-5 9-9

Demonstration: Using the Variable Selection Node

9-10

Demonstration: Using Partial Least Squares for Input Selection

9-14

Demonstration: Using the Decision Tree Node for Input Selection

9-19

Categorical Input Consolidation

9-22

vii


viii

F o r Your Information

Demonstration: Consolidating Categorical Inputs 9.5

Surrogate Models Demonstration: Describing Decision Segments with Surrogate Models

Appendix A

C a s e Studies

A. 1 Banking Segmentation Case Study A.2

Web Site Usage Associations Case Study

9-23 9-31 9-32 A-1

A-3 A-16

A.3 Credit Risk Case Study

A-19

A. 4 Enrollment Management Case Study

A-36

Appendix B

B. l

Bibliography

References

Appendix C

Index

B-1

B-3 C-1


F o r Your Information

ix

Course Description This course covers the skills required to assemble analysis flow diagrams using the rich tool set of SAS Enterprise Miner for both pattern discovery (segmentation, association, and sequence analyses) and predictive modeling (decision tree, regression, and neural network models).

To learn more...

ยงsas Education

ยงsas Publishing

For information on other courses in the curriculum, contact the SAS Education Division at 1-800-333-7660, or send e-mail to training@sas.com. You can also find this information on the Web at support.sas.com/training/ as well as in the Training Course Catalog.

For a list of other SAS books that relate to the topics covered in this Course Notes, USA customers can contact our SAS Publishing Department at 1-800-727-3228 or send e-mail to sasbook@sas.com. Customers outside the USA, please contact your local SAS office. Also, see the Publications Catalog on the Web at support.sas.com/pubs for a complete list of books and a convenient order form.


x

F o r Your Information

Prerequisites Before attending this course, you should be acquainted with Microsoft Windows and Windows-based software. In addition, you should have at least an introductory-level familiarity with basic statistics and regression modeling. Previous SAS software experience is helpful but not required.


Chapter 1 Introduction 1.1

Introduction to S A S E n t e r p r i s e Miner

1 -3


1-2

C h a p t e r 1 Introduction


1.1 Introduction to SAS Enterprise Miner

1.1

Introduction to SAS Enterprise

1-3

Miner

SAS Enterprise Miner

z The SAS Enterprise Miner 6.1 interface simplifies many common tasks associated with applied analysis. It offers secure analysis management and provides a wide variety of tools with a consistent graphical interface. You can customize it by incorporating your choice of analysis methods and tools.

SAS Enterprise Miner - Interface Tour '•'" ,T

s

^

77. "o * - j 9

Menu bar and shortcut buttons

3 The interface window is divided into several functional components. The menu bar and corresponding shortcut buttons perform the usual windows tasks, in addition to starting, stopping, and reviewing analyses.


1-4

Chapter 1 Introduction

SAS Enterprise Miner - Interface Tour •

*• m Una

Project panel

I >*inw*« •

• — - * — -

1

s

— -

-cap-

-

-£U

_

U — 1

The Project panel manages and views data sources, diagrams, results, and project users.

SAS Enterprise Miner - Interface Tour pCO

V(r.«m

-

* a *

n...- - riii L/^ 13" Properties panel 1

:

r , i - •- Lr--

01

•H w i B m h y

•a—-r -at-.. - U — 1 ii—-

5 The Properties panel enables you to view and edit the settings o f data sources, diagrams, nodes, results, and users.


1.1 Introduction to SAS Enterprise Miner

SAS Enterprise Miner - Interface Tour

C — —1

MM "—••HI.

J

• —

-

-

-B

-i$r~

" -k— 1

tit—-

£ = —

Help panel

s The Help panel displays a short description o f the property that you select in the Properties panel. Extended help can be found in the Help Topics selection from the Help main menu.

SAS Enterprise Miner - Interface Tour

Diagram workspace

Q...„ :•— 1

1

.

-

g

-fjif-

- -y^—J

M

Ck—

In the Diagram workspace, process flow diagrams are built, edited, and run. The workspace is where you graphically sequence the tools that you use to analyze your data and generate reports.

1-5


1-6

Chapter 1 Introduction

SAS Enterprise Miner - Interface Tour y

E J — I- -Hto——,-> - g j — - f-

1--la.——]

Process flow -fir-:-J

8 The Diagram workspace contains one or more process flows. A process flow starts with a data source and sequentially applies SAS Enterprise Miner tools to complete your analytic objective.

SAS Enterprise Miner - Interface Tour

et=a .a — •n

Node

D~

- *

IB

1 -Si- - - I * — 1 •fir— i

.w*j-rw>

A process flow contains several nodes. Nodes are SAS Enterprise Miner tools connected by arrows to show the direction o f information flow in an analysis.


1.1

Introduction to S A S Enterprise Miner

1-7

SAS Enterprise Miner - Interface Tour

-

-a*

-

-CKJ—

•J

-

il , ;

-(E

- J -

s-

10

The SAS Enterprise Miner tools available to your analysis are contained in the tools palette. The tools palette is arranged according to a process for data mining, SEMMA. SEMMA is an acronym for the following: Sample

You sample the data by creating one or more data tables. The samples should be large enough to contain the significant information, but small enough to process.

Explore You explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas. Modify

You modify the data by creating, selecting, and transforming the variables to focus the model selection process.

Model

You model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome.

Assess

You assess competing predictive models (build charts to evaluate the usefulness and reliability of the findings from the data mining process).

Additional tools are available under the Utility group and, i f licensed, the Credit Scoring group.


1-8

C h a p t e r 1 Introduction

SEMMA-Sample Tab

•Append > Data Partition

5~

rati—iilr»

• File Import •Filter • Input Data •Merge ^ • Sample • Time Series

— ' - -Sir- > - k - — 1

-I Ji ,.) — J 3

f

11

The tools in each Tool tab are arranged alphabetically. The Append tool is used to append data sets that are exported by two different paths in a single process flow diagram. The Append node can also append train, validation, and test data sets into a new training data set. The Data Partition tool enables you to partition data sets into training, test, and validation data sets. The training data set is used for preliminary model fitting. The validation data set is used to monitor and tune the model during estimation and is also used for model assessment. The test data set is an additional holdout data set that you can use for model assessment. This tool uses simple random sampling, stratified random sampling, or cluster sampling to create partitioned data sets. The File Import tool enables you to convert selected external flat files, spreadsheets, and database tables into a format that SAS Enterprise Miner recognizes as a data source. The Filter tool creates and applies filters to your training data set, and optionally, to the validation and test data sets. You can use filters to exclude certain observations, such as extreme outliers and errant data that you do not want to include in your mining analysis. The Input Data tool represents the data source that you choose for your mining analysis and provides details (metadata) about the variables in the data source that you want to use. The Merge tool enables you to merge observations from two or more data sets into a single observation in a new data set. The Merge tool supports both one-to-one and match merging. The Sample tool enables you to take simple random samples, observation samples, stratified random samples, first-w samples, and cluster samples of data sets. For any type of sampling, you can specify either a number of observations or a percentage of the population to select for the sample. If you are working with rare events, the Sample tool can be configured for oversampling or stratified sampling. Sampling is recommended for extremely large databases because it can significantly decrease model training time. If the sample is sufficiently representative, relationships found in the sample can be expected to generalize to the complete data set. The Sample tool writes the sampled observations to an output data set and saves the seed values that are used to generate the random numbers for the samples so that you can replicate the samples.


1.1

Introduction to S A S Enterprise Miner

1-9

The Time Series tool converts transactional data to time series data. Transactional data is time-stamped data that is collected over time at no particular frequency. By contrast, time series data is time-stamped data that is summarized over time at a specific frequency. You might have many suppliers and many customers, as well as transaction data that is associated with both. The size of each set of transactions can be very large, which makes many traditional data mining tasks difficult. By condensing the information into a time series, you can discover trends and seasonal variations in customer and supplier habits that might not be visible in transactional data.


1-10

Chapter 1 Introduction

SEMMA-Explore Tab laaHai^jJaiajfl^Ju I •Association • Cluster • DMDB • Graph Explore • Market Basket - • Multiplot • Path Analysis • SOM/Kohonen • StatExplore •Variable Clustering •Variable Selection

at-

- U — 1

-fit--

12 The Association tool enables you to perform association discovery to identify items that tend to occur together within the data. For example, i f a customer buys a loaf o f bread, how likely is the customer to also buy a gallon o f milk? This type o f discovery is also known as market basket analysis. The tool also enables you to perform sequence discovery i f a timestamp variable (a sequence variable) is present in the data set. This enables you to take into account the ordering of the relationships among items. The Cluster tool enables you to segment your data; that is, it enables you to identify data observations that are similar in some way. Observations that are similar tend to be in the same cluster, and observations that are different tend to be in different clusters. The cluster identifier for each observation can be passed to subsequent tools in the diagram. The D M D B tool creates a data mining database that provides summary statistics and factor-level information for class and interval variables in the imported data set. The Graph Explore tool is an advanced visualization tool that enables you to explore large volumes of data graphically to uncover patterns and trends and to reveal extreme values in the database. The tool creates a run-time sample of the input data source. You use the Graph Explore node to interactively explore and analyze your data using graphs. Your exploratory' graphs are persisted when the Graph Explore Results window is closed. When you reopen the Graph Explore Results window, the persisted graphs are re-created. The experimental M a r k e t Basket tool performs association rule mining over transaction data in conjunction with item taxonomy. Transaction data contains sales transaction records with details about items bought by customers. Market basket analysis uses the information from the transaction data to give you insight about which products tend to be purchased together. The M u l t i P l o t tool is a visualization tool that enables you to explore large volumes o f data graphically. The MultiPlot tool automatically creates bar charts and scatter plots for the input and target. The code created by this tool can be used to create graphs in a batch environment. The Path Analysis tool enables you to analyze Web log data to determine the paths that visitors take as they navigate through a Web site. You can also use the tool to perform sequence analysis.


1.1

Introduction to S A S Enterprise Miner

1-11

The SOM/Kohonen tool performs unsupervised learning by using Kohonen vector quantization (VQ), Kohonen self-organizing maps (SOMs), or batch SOMs with Nadaraya-Watson or local-linear smoothing. Kohonen VQ is a clustering method, whereas SOMs are primarily dimension-reduction methods. For cluster analysis, the Clustering tool is recommended instead of Kohonen VQ or SOMs. The StatExplore tool is a multipurpose tool used to examine variable distributions and statistics in your data sets. The tool generates summarization statistics. You can use the StatExplore tool to do the following: • select variables for analysis, for profiling clusters, and for predictive models • compute standard univariate distribution statistics • compute standard bivariate statistics by class target and class segment • compute correlation statistics for interval variables by interval input and target The Variable Clustering tool is useful for data reduction, such as choosing the best variables or cluster components for analysis. Variable clustering removes collinearity, decreases variable redundancy, and helps to reveal the underlying structure of the input variables in a data set. The Variable Selection tool enables you to evaluate the importance of input variables in predicting or classifying the target variable. To select the important inputs, the tool uses either an R-squared or a Chi-squared selection criterion. The R-squared criterion enables you to remove variables in hierarchies, remove variables that have large percentages of missing values, and remove class variables that are based on the number of unique values. The variables that are not related to the target are set to a status of rejected. Although rejected variables are passed to subsequent tools in the process flow diagram, these variables are not used as model inputs by more detailed modeling tools, such as the Neural Network and Decision Tree tools. You can reassign the input model status to rejected variables.


1-12

Chapter 1

Introduction

SEMMA-Modify Tab

"I w> ! 1 l»* . . .,*o rr^ipw : .... .. 2 J. rwxn wSjii'' ""' ''' ^ ' ' ' ' " ' " J 1

1

1

1 1 1 1

1

•Drop • Impute • Interactive Binning • Principal Components • Replacement

Er-'»Rule§BuHder

-

-C?>r- - - k — )

•Transform Variables '

OM»nt

'• • • . .*'t

'•• -,\

1

ii-J-

13

The Drop tool is used to remove variables from scored data sets. You can remove all variables with the role type that you specify, or you can manually specify individual variables to drop. For example, you could remove all hidden, rejected, and residual variables from your exported data set, or you could remove only a few variables that you identify yourself. The Impute tool enables you to replace values for observations that have missing values. You can replace missing values for interval variables with the mean, median, midrange, mid-minimum spacing, or distribution-based replacement, or you can use a replacement M-estimator such as Tukey's biweight, Huber's, or Andrew's Wave. You can also estimate the replacement values for each interval input by using a tree-based imputation method. Missing values for class variables can be replaced with the most frequently occurring value, distribution-based replacement, tree-based imputation, or a constant. The Interactive Binning tool is an interactive grouping tool that you use to model nonlinear functions of multiple modes of continuous distributions. The Interactive tool computes initial bins by quantiles. Then you can interactively split and combine the initial bins. You use the Interactive Binning node to create bins or buckets or classes of all input variables, which include both class and interval input variables. You can create bins in order to reduce the number of unique levels as well as attempt to improve the predictive power of each input. The Principal Components tool calculates eigenvalues and eigenvectors from the uncorrected covariance matrix, corrected covariance matrix, or the correlation matrix of input variables. Principal components are calculated from the eigenvectors and are usually treated as the new set of input variables for successor modeling tools. A principal components analysis is useful for data interpretation and data dimension reduction. The Replacement tool enables you to reassign and consolidate levels of categorical inputs. This can improve the performance of predictive models. The Rules Builder tool opens the Rules Builder window so that you can create ad hoc sets of rules with user-definable outcomes. You can interactively define the values of the outcome variable and the paths to the outcome. This is useful in ad hoc rule creation such as applying logic for posterior probabilities and scorecard values.


1.1

Introduction to S A S Enterprise Miner

1-13

The Transform Variables tool enables you to create new variables that are transformations of existing variables in your data. Transformations are useful when you want to improve the fit of a model to the data. For example, transformations can be used to stabilize variances, remove nonlinearity, improve additivity, and correct nonnormality in variables. The Transform Variables tool supports various transformation methods. The available methods depend on the type and the role of a variable.


1-14

C h a p t e r 1 Introduction

SEMMA - Model Tab 4mmM^£&2$M\ OSS

• AutoNeural

f — — r itms .

• Decision Tree

"=r—y

. . .Fa

• Dmine Regression

j

• DMNeural • Ensemble

... .'owt^-v

• —

' '. . r*.

;f ..uuj.'iin.

t

j *

^

« G r a d i e n t Boosing -

-cpj— -

- k — 1

• Least Angle Regression

.».•• *TJ^.—

1

''^S^jfcj^'" '

,_ . •

•MBR • Model Import

" i

-fir--

• Neural Network • Partial Least Squares -a 1

•••Regression

» R u l e induction • Two Stage

I ^ . - . - J - . . - ? '

'

The AutoNeural tool can be used to automatically configure a neural network. It conducts limited searches for a better network configuration. The Decision Tree tool enables you to perform multiway splitting of your database based on nominal, ordinal, and continuous variables. The tool supports both automatic and interactive training. When you run the Decision Tree tool in automatic mode, it automatically ranks the input variables based on the strength of their contributions to the tree. This ranking can be used to select variables for use in subsequent modeling. In addition, dummy variables can be generated for use in subsequent modeling. You can override any automatic step with the option to define a splitting rule and prune explicit tools or subtrees. Interactive training enables you to explore and evaluate a large set of trees as you develop them. The Dmine Regression tool performs a regression analysis on data sets that have a binary or interval level target variable. The Dmine Regression tool computes a forward stepwise least squares regression. In each step, an independent variable is selected that contributes maximally to the model R-square value. The tool can compute all two-way interactions of classification variables, and it can also use AOV16 variables to identify nonlinear relationships between interval variables and the target variable. In addition, the tool can use group variables to reduce the number of levels of classification variables. f

If you want to create a regression model on data that contains a nominal or ordinal target, then you would use the Regression tool.

The DMNeural tool is another modeling tool that you can use tofita nonlinear model. The nonlinear model uses transformed principal components as inputs to predict a binary or an interval target variable.


1.1

Introduction to S A S Enterprise Miner

1-15

The Ensemble tool creates new models by combining the posterior probabilities (for class targets) or the predicted values (for interval targets) from multiple predecessor models. The new model is then used to score new data. One common ensemble approach is to use multiple modeling methods, such as a neural network and a decision tree, to obtain separate models from the same training data set. The component models from the two complementary modeling methods are integrated by the Ensemble tool to form the final model solution. It is important to note that the ensemble model can only be more accurate than the individual models if the individual models disagree with one another. You should always compare the model performance of the ensemble model with the individual models. You can compare models using the Model Comparison tool. The Gradient Boosting tool uses a partitioning algorithm described in "A Gradient Boosting Machine," and "Stochastic Gradient Boosting" by Jerome Friedman. A partitioning algorithm searches for an optimal partition of the data defined in terms of the values of a single variable. The optimality criterion depends on how another variable, the target, is distributed into the partition segments. When the target values are more similar within the segments, the worth of the partition is greater. Most partitioning algorithms further partition each segment in a process called recursive partitioning. The partitions are then combined to create a predictive model. The model is evaluated by goodness-of-fit statistics defined in terms of the target variable. These statistics are different than the measure of worth of an individual partition. A good model might result from many mediocre partitions. The Least Angle Regressions (LARS) tool can be used for both input variable selection and model fitting. When used for variable selection, the LAR algorithm chooses input variables in a continuous fashion that is similar to Forward selection. The basis of variable selection is the magnitude of the candidate inputs' estimated coefficients as they grow from zero to the least squares' estimate. Either a LARS or LASSO algorithm can be used for model fitting. See Efron et al (2004) and Hastie, Tibshirani, and Friedman (2001) for further details. The Memory-Based Reasoning (MBR) tool is a modeling tool that uses a ^-nearest neighbor algorithm to categorize or predict observations. The ^-nearest neighbor algorithm takes a data set and a probe, where each observation in the data set consists of a set of variables and the probe has one value for each variable. The distance between an observation and the probe is calculated. The k observations that have the smallest distances to the probe are the ^-nearest neighbors to that probe. In SAS Enterprise Miner, the ^-nearest neighbors are determined by the Euclidean distance between an observation and the probe. Based on the target values of the ^-nearest neighbors, each of thefc-nearestneighbors votes on the target value for a probe. The votes are the posterior probabilities for the class target variable. The Model Import tool imports and assesses a model that was not created by one of the SAS Enterprise Miner modeling nodes. You can then use the Assessment node to compare the user-defined model(s) with a model(s) that you developed with a SAS Enterprise Miner modeling node. This process is called integrated assessment. The Neural Network tool enables you to construct, train, and validate multilayer feed-forward neural networks. In general, each input is fully connected to the first hidden layer, each hidden layer is fully connected to the next hidden layer, and the last hidden layer is fully connected to the output. The Neural Network tool supports many variations of this general form. The Partial Least Squares tool models continuous and binary targets based on the SAS/STAT PLS procedure. The Partial Least Squares node produces DATA step score code and standard predictive model assessment results.


1-16

Chapter 1 Introduction

The Regression tool enables you to fit both linear and logistic regression models to your data. You can use continuous, ordinal, and binary target variables. You can use both continuous and discrete variables as inputs. The tool supports the stepwise, forward, and backward selection methods. The interface enables you to create higher-order modeling terms such as polynomial terms and interactions. The Rule Induction tool enables you to improve the classification of rare events in your modeling data. The Rule Induction tool creates a Rule Induction model that uses split techniques to remove the largest pure split tool from the data. Rule induction also creates binary models for each level of a target variable and ranks the levels from the rarest event to the most common. The TwoStage tool enables you to model a class target and an interval target. The interval target variable is usually the value that is associated with a level of the class target. For example, the binary variable PURCHASE is a class target that has two levels, Yes and No, and the interval variable AMOUNT can be the value target that represents the amount of money that a customer spends on the purchase.


1.1

Introduction to S A S Enterprise Miner

1-17

SEMMA - Assess Tab 3-..

, , r

"™ ' ,r

n , r n

OQSE3 •Cutoff |

;

~

<•»»»MWCMM H « I « .

|

<-n J

.

• Decisions

.

• Model Comparison

J

• Segment Profile

,|CSi, 8.! ,J:,J.^I^».SIJ.«M /^ ;

I

J

• Score -

•HH^.^, . H"i,^|. «l,0^lH J 1

lt

-

§

-i^f-

- -ja

j

l

•jCtr— n HI

15

The Cutoff tool provides tabular and graphical information to assist users in determining appropriate probability cutoff point(s) for decision making with binary target models. The establishment of a cutoff decision point entails the risk of generating false positives and false negatives, but an appropriate use of the Cutoff node can help minimize those risks. The Decisions tool enables you to define target profiles for a target that produces optimal decisions. The decisions are made using a user-specified decision matrix and output from a subsequent modeling procedure. The Model Comparison tool provides a common framework for comparing models and predictions from any of the modeling tools. The comparison is based on the expected and actual profits or losses that would result from implementing the model. The tool produces several charts that help to describe the usefulness of the model, such as lift charts and profit/loss charts. The Segment Profile tool enables you to examine segmented or clustered data and identify factors that differentiate data segments from the population. The tool generates various reports that aid in exploring and comparing the distribution of these factors within the segments and population. The Score tool enables you to manage, edit, export, and execute scoring code that is generated from a trained model. Scoring is the generation of predicted values for a data set that might not contain a target variable. The Score tool generates and manages scoring formulas in the form of a single SAS DATA step, which can be used in most SAS environments even without the presence of SAS Enterprise Miner. The Score tool can also generate C score code and Java score code.


1-18

Chapter 1 Introduction

Beyond SEMMA - Utility Tab

D — -

Control Point End Groups Metadata Reporter SAS Code Start Groups

-

a —

16 The Control Point tool enables you to establish a control point to reduce the number o f connections that are made in process flow diagrams. For example, suppose that three Input Data Source tools are to be connected to three modeling tools. If no Control Point tool is used, then nine connections are required to connect all o f the Input Data Source tools to all o f the modeling tools. However, i f a Control Point tool is used, only six connections are required. The E n d Groups tool is used only in conjunction with the Start Groups tool. The End Groups node acts as a boundary marker that defines the end-of-group processing operations in a process flow diagram. Group processing operations are performed on the portion o f the process flow diagram that exists between the Start Groups node and the End Groups node. The Metadata tool enables you to modify the columns metadata information at some point in your process flow diagram. You can modify attributes such as roles, measurement levels, and order. The Reporter tool uses SAS Output Delivery System (ODS) capability to create a single PDF or RTF file that contains information about the open process flow diagram. The PDF or RTF documents can be viewed and saved directly and are included in SAS Enterprise Miner report package files. The SAS Code tool enables you to incorporate new or existing SAS code into process flow diagrams. The ability to write SAS code enables you to include additional SAS procedures into your data mining analysis. You can also use a SAS DATA step to create customized scoring code, to conditionally process data, and to concatenate or merge existing data sets. The tool provides a macro facility to dynamically reference data sets used for training, validation, testing, or scoring variables, such as input, target, and predict variables. After you run the SAS Code tool, the results and the data sets can then be exported for use by subsequent tools in the diagram. The Start Groups tool is useful when your data can be segmented or grouped, and you want to process the grouped data in different ways. The Start Groups node uses BY-group processing as a method to process observations from one or more data sources that are grouped or ordered by values o f one or more common variables. BY variables identify the variable or variables by which the data source is indexed, and BY statements process data and order output according to the BY group values.


1.1

Introduction to S A S Enterprise Miner

1-19

Credit Scoring Tab (Optional)

-*

• Credit Exchange -,

f™=:—r =. " r y,t=_j vUT.:i'.^:...^...i.j. ST"* 3

jrrrk.

kllH

• Interactive Grouping • Reject Inference • Scorecard

^^ * - ^ t rrj—

J"*,ST*"S I . , • —. — i riter*™

- -ep,—

-

-a—>

.351— -

)

rn n

17

The optional Credit Scoring tab provides functionality related to credit scoring. f

The Credit Scoring for the SAS Enterprise Miner solution is not included with the base version of SAS Enterprise Miner. If your site did not license Credit Scoring for SAS Enterprise Miner, the Credit Scoring tab and its associated tools do not appear in your SAS Enterprise Miner software.

The Credit Exchange tool enables you to exchange the data that is created in SAS Enterprise Miner with the SAS Credit Risk Management solution. The Interactive Grouping tool creates groupings, or classes, of all input variables. (This includes both class and interval input variables.) You can create groupings in order to reduce the number of unique levels as well as attempt to improve the predictive power of each input. Along with creating group levels for each input, the Interactive Grouping tool creates Weight of Evidence (WOE) values. The Reject Inference tool uses the model that was built using the accepted applications to score the rejected applications in the retained data. The observations in the rejected data set are classified as inferred "goods" and inferred "bads." The inferred observations are added to the A c c e p t s data set that contains the actual "good" and "bad" records, forming an augmented data set. This augmented data set then serves as the input data set of a second credit-scoring modeling run. During the second modeling run, attribute classification is readjusted and the regression coefficients are recalculated to compensate for the data set augmentation. The Scorecard tool enables you to rescale the logit scores of binary prediction models to fall within a specified range.


c CD u> n Q . 5"

<; 3 C 5

O

—1

*—• CD CD

a —

CD

ffl f? O. — —'

o

o o3

3

-a

X

o

CD

hei

(SQ -i o

3

2'

5 3

M

r

10 1a. to cr c - | f & — i <-* S£ S

ucti iter

fS

S cr 3 fj 1

CD

•<

CD

CD (A

UIOJ

tenc esul en

O

to'

5' B, o-i

CD W

, -

3

S3 r*

5

- CD

Repair input data Transform input data

«

(B ^

AS

ta

Q

3

R

Gather results Assess observed results

O

n T 35 O CO

o „.

*< 3 n 3

3* CD O

GS a

IT 3 CD

3

CD

d'O O

"-»)

ft

W

- M 3.

CD

P n

|

£L o

oCD

7T

S— 3

[B

-

2 3

=2 EL

a. si. g Q y w CD £L * o' 5 2 CD cr O

3

O

CD

O -i Di CD &) _ w B 3 -tS

a. O

3*

TJ CD

03

*<

o"

o Q.

o —** o

Apply analysis

«< B .

CD

>

a to

O

5' . " n CD p o o .

3

-

Generate deployment methods o £ Integrate deployment 3 *0 M 3*

CD

1 Z3

rj

_ H e g » CD

•-*)

-6

a.

k3

C

Select cases "S Extract input data o O Validate input data

r+

i3

Define analytic objective i»

Q

S s - a.

o o ££, 3 p. — ~n • <

P> 3

w

1 "8 tsi

g>

I

CD"

n

o r. a 3 CD •<

|

CD

H o o o' 3

to o

CO

CD

o cr 3

i,«

H

CD

O

5T

Refine analytic objective


1.1 Introduction to SAS Enterprise Miner

1-21

SAS Enterprise Miner Analytic Strengths

Pattern Discovery

Predictive Modeling

19

The analytic strengths of SAS Enterprise Miner lie in the realm traditionally known as data mining. There are a great number of analytic methods identified with data mining, but they usually fall into two broad categories: pattern discovery and predictive modeling (Hand 2005). SAS Enterprise Miner provides a variety o f pattern discovery and predictive modeling tools. Chapters 2 through 7 demonstrate SAS Enterprise Miner's predictive modeling capabilities. In those chapters, you use a rich collection of modeling tools to create, evaluate, and improve predictions from prepared data. Special topics in predictive modeling are featured in Chapter 9. Chapter 8 introduces some of SAS Enterprise Miner's pattern discovery tools. In this chapter, you see how to use SAS Enterprise Miner to graphically evaluate and transform prepared data. You use SAS Enterprise Miner's tools to cluster and segment an analysis population. You can also choose to try SAS Enterprise Miner's market basket analysis and sequence analysis capabilities.


1-22

Chapter 1 Introduction

Applied Analytics Case Studies ^ —i

.....

fH

W O

Jfj Bank usage segmentation V Web services associations

:&52J

|||T|i—1

» iinmm

J

&

Credit risk scoring

?

li& University enrollment prediction

^ i — — ^

e — . w .

.

... .

20

Appendix A further illustrates SAS Enterprise Miner's analytic capabilities with case studies drawn from real-world business applications. • A consumer bank sought to segment its customers based on historic usage patterns. Segmentation was to be used for improving contact strategies in the Marketing Department. • A radio station developed a Web site to provide such services to its audience as podcasts, news streams, music streams, archives, and live Web music performances. The station tracked usage o f these services by U R L . Analysts at the station wanted to see whether any unusual patterns existed in the combinations of services selected. • A bank sought to use performance on an in-house subprime credit product to create an updated risk model. The risk model was to be combined with other factors to make future credit decisions. • The administration o f a large private university requested that the Office o f Enrollment Management and the Office o f Institutional Research work together to identify prospective students who would most likely enroll as freshmen.


Chapter 2 Accessing and Assaying Prepared Data 2.1

Introduction

2.2

C r e a t i n g a S A S E n t e r p r i s e M i n e r Project, Library, a n d D i a g r a m

Demonstration: Creating a SAS Enterprise Miner Project

2.3

2.4

2.5

2

-

3

2-5

2-6

Demonstration: Creating a SAS Library

2-10

Demonstration: Creating a SAS Enterprise Miner Diagram

2-13

Exercises

2-14

D e f i n i n g a Data S o u r c e

2-15

Demonstration: Defining a Data Source

2-18

Exercises

2-35

E x p l o r i n g a Data S o u r c e

2-36

Demonstration: Exploring Source Data

2-37

Demonstration: Changing the Explore Window Sampling Defaults

2-64

Exercises

2-66

Demonstration: Modifying and Correcting Source Data

2-67

Chapter Summary

2-80


2-2

Chapter 2 Accessing and Assaying Prepared Data


2.1 Introduction

2-3

2.1 Introduction Analysis Element Organization

Projects

Libraries and Diagrams

Process Flows

Nodes

3

Analyses in SAS Enterprise Miner start by defining a project. A project is a container for a set of related analyses. After a project is defined, you define libraries and diagrams. A library is a collection of data sources accessible by the SAS Foundation Server. A diagram holds analyses for one or more data sources, Processflowsare the specific steps you use in your analysis of a data source. Each step in a process flow is indicated by a node. The nodes represent the SAS Enterprise Miner tools that are available to the user. f

A core set of tools is available to all SAS Enterprise Miner users. Additional tools can be added by licensing additional SAS products or by creating SAS Enterprise Miner extensions.

Before you can create a process flow, you need to define the following items: • a SAS Enterprise Miner project to contain the data sources, diagrams, and model packages • one or more SAS libraries inside your project, linking SAS Enterprise Miner to analysis data • a diagram workspace to display the steps of your analysis A SAS administrator can define these for you in advance of using SAS Enterprise Miner or you can define them yourself. f

The demonstrations show you how to create the items in the list above.


2-4

Chapter 2 Accessing and Assaying Prepared Data

Analysis Element Organization

Projects

Libraries and Diagrams

Process Flows

Nodes

4 Behind the scenes (on the SAS Foundation Server where the project resides), the organization is more complicated. Fortunately, the SAS Enterprise Miner client shields you from this complexity. At the top of the hierarchy is the SAS Enterprise Miner project, corresponding to a directory on (or accessible by) the SAS Foundation. When a project is defined, four subdirectories are automatically created within the project directory: Datasources, Reports, System, and Workspaces. The SAS Enterprise Miner client via the SAS Foundation handles the writing, reading, and maintenance o f these directories. f

In both the personal workstation and SAS Enterprise Miner client configurations, all references to computing resources such as L I B N A M E statements and access to SAS Foundation technologies must be made from the selected SAS Foundation Server's perspective. This is important because data and directories that are visible to the client machine might not be visible to the server.

Projects are defined to contain diagrams, the next level of the SAS Enterprise Miner organizational hierarchy. Diagrams usually pertain to a single analysis theme or project. When a diagram is defined, a new subdirectory is created in the Workspaces directory of the corresponding project. Each diagram is independent, and no information can be passed from one diagram to another. Libraries are defined using a L I B N A M E statement. You can create a library using the SAS Enterprise Miner Library Wizard, using SAS Management Console, or by using the Start-Up Code window when you define a project. Any data source compatible with a L I B N A M E statement can provide data for SAS Enterprise Miner. Specific analyses in SAS Enterprise Miner occur in process flows. A process flow is a sequence o f tasks or nodes connected by arrows in the user interface, and it defines the order of analysis. The organization of a process flow is contained in a file, EM DGRAPH, which is stored in the diagram directory. Each node in the diagram corresponds to a separate subdirectory in the diagram directory. Information in one process flow can be sent to another by connecting the two process flows.


2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram

2-5

2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram Your first task when you start an analysis is creating a SAS Enterprise Miner project, data library, and diagram. Often these are set up by a SAS administrator (or reused from a previous analysis), but knowing how to set up these items yourself is fundamental to learning about SAS Enterprise Miner. Before attempting the processes described in the following demonstrations, you need to log on to your computer, and start and log on to the SAS Enterprise Miner client program.


2-6

Chapter 2 Accessing and Assaying Prepared Data

Creating a SAS Enterprise Miner Project A SAS Enterprise Miner project contains materials related to a particular analysis task. These materials include analysis process flows, intermediate analysis data sets, and analysis results. To define a project, you must specify a project name and the location o f the project on the SAS Foundation Server. Follow the steps below to create a new SAS Enterprise Miner project. • • ^ • I ^ H - i r | x |

|3tEnterprise Miner File

Edit

01

View

g.ctions

I I I

i

Options

Window

Help

f I 1 U

|. |.: 1 M

1 .1 1^1

Welcome to Enterprise Miner O Ac o A o

Hefp Topics

E n te

Mi ne

US'

New Project...

OpenProject...

-Si?

Recent Projects.,,

^?

Exit

|0 student as s t u d e n t ] * ^ No project open


2-2 Creating a SAS Enterprise Miner Project, Library, and Diagram 1.

2-7

Select File => New ^ Project from the main menu. The Create New Project wizard opens at Step 1 £ F Create New Project — Step 1 of 4 Select SAS Server

Select a SAS Server for this project. All processing will take place on this server,

SAS Server SASApp - Logical Workspace Server < Back

!Next>

Cancel

In this configuration o f SAS Enterprise Miner, the only server available for processing is the host server listed above. 2.

Select Next >.

3.

Name the project. 2 £ Create New Project — Step 2 of 4 Specify Project Name and Server Directory

Specify a project name and directory on the SAS Server for this project. All SAS data sets andfileswill be written to this location.

Project Name SAS Server Directory D:\SA5\Config\Lev 1 \AppData\EnterpriseMiner\6.1 < Back

Next >

Browse Cancel

Step 2 of the Create New Project wizard is used to specify the following information: •

the name o f the project you are creating

the location of the project


2-8 4.

Chapter 2 Accessing and Assaying Prepared Data Type a project name, for example, My

P r o j e c t , in the Name field.

2 £ Create New Project — Step 2 of 4 Specify Project Name and Server Directory

SAS'

Specify a project name and directory on the SAS Server for this project. SAS data sets and files will be written to this location.

Project Name My Project| SAS Server Directory D:\SAS\Config\Levl\AppData\EnterpriseMiner\6.t <Back The path specified by the SAS project folder w i l l be created. 5.

Cancel

D i r e c t o r y field is the physical location where the

Select N e x t > . I f you have an existing project directory with the same name and location as specified, this project w i l l be added to the list of available projects in SAS Enterprise Miner. This technique can be used to import a project created by another installation of SAS Enterprise Miner.

f

6.

Server

Next >

Browse

Select a location for the project's metadata.

EG

j5£ Create New Project — S t e p3 of 4 Register the Project

SA s

Select the SAS Folders location for this project. Use these folders to organize your projects and control user access.

En t e r p r i s e j ne r a e f f l l

m\

SAS Folder Location — /Users/student/My Folder ! < Back /f

Browse Next >

Cancel

The SAS folder, My Folder, is in a WebDAV directory. This is where the metadata associated with the project is stored. This folder can be accessed and modified using SAS Management Console.


2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram 7.

2-9

Select Next >. Information about your project is summarized in Step 4.

8.

To finish defining the project, select Finish. 2 f Create New Project — Step 4 of 4 New Project Information

New Project Information Name SAS Metadata location SAS Application server Server Directory

My Project /Users/student/My Folder SASApp - Logical Workspace Server D:\SAS\Config\Lev1\AppData\Enterpri:

<Back

! Finish j

Cancel

The SAS Enterprise Miner client application opens the project that you created. HHE3

? . Enterprise Miner - My Project File

Edit

03 3 i

View

fictions

Option?

Window

Help

I I iSNMJQlQjsjIjijI

j ]

• I Data Sources m| Diagrams 5 I Model Packages Bl

Users

Value

Property Name

My Project

Project Start C o d e C re sled

6/17/09 2 40 PM

Server

SASApp • L o g i c a l W o r k s p a c

Grid Available

No

Path

D ASASIC onfiglLevt tApp Data

Max. Concurrent T a s k s

Default

| f t student as student [i3^ Connected to SASApp - Logical Workspace Server


2-10

Chapter 2 Accessing and Assaying Prepared Data

Creating a SAS Library A SAS library connects SAS Enterprise Miner with the raw data sources, which are the basis of your analysis. A library can link to a directory on the SAS Foundation server, a relational database, or even an Excel workbook. To define a library, you need to know the name and location of the data structure that you want to link with SAS Enterprise Miner, in addition to any associated options, such as user names and passwords. Follow the steps below to create a new SAS library. I.

Select File •=> New O L i b r a r y . . . from the main menu. The Library Wizard - Step l o f 3 Select Action window opens.

-Select an action (*• Create New Lihrary

Next>

Cancel


2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram 2.

2-11

Select Next >. The Library Wizard is updated to show Step 2 of 3 Create or Modify.

Engine

Name

BASE

f Library infor mation Path Browse.,

Options

<Back

Next>

3.

Type AAEM61 in the Name field.

4.

Type the path C: \ w o r k s h o p \ w i n s a s \ a a e m 6 1 in the P a t h field. £

Cancel

_ .Library Wizard —Step 2 of 3 Create or Engine

Name

3

BASE

AAEM61

Library information -

r

Path C: \Workshop\Winsas\aaem61

Browse.

Options

< Back

Next >

Cancel

I f your data is installed in a different directory, specify this directory' in the P a t h field.


2-12 5.

Chapter 2 Accessing and Assaying Prepared Data Select Next >. The Library Wizard window is updated to show Step 3 of 3 Confirm Action. .Library Wizard ~ S t e p 3 o f 3 Confirm Action Property Action IName [Engine 'Path 'Options

Value Create New AAEMS1 BASE C:\WorkslioplWirisas\aaem61

r-Status Action Succeeded! The Library "AAEM61" was created.

<Back

Finish H

The Confirm Action window shows the name, type, and path of the created SAS library. 6.

Select Finish. A l l data available in the SAS library can now be used by SAS Enterprise Miner.


2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram

2-13

Creating a SAS Enterprise Miner Diagram

A SAS Enterprise Miner diagram workspace contains and displays the steps involved in your analysis. To define a diagram, you need only specify its name. Follow the steps below to create a new SAS Enterprise Miner diagram workspace. 1.

Select File & New

D i a g r a m . . . from the main menu.

2.

Type the name P r e d i c t i v e A n a l y s i s in the D i a g r a m Name field and select OK. SAS Enterprise Miner creates an analysis workspace window labeled Predictive Analysis. ^ E n t e r p r i s e Miner - M y Project Fiie

E*

View

3

Actions

Options

Window

hjelp

:--|HNB|D1D|3JUJI

••!

:

I 1 i

I 1 I *\&\

My Project

>. OJ Data Sources -

Sarrple f i i r p b r e | Modify | Model | Assess j Utility j Credit Scoring |

JtJ Diagrams Predictive Analysis

Predictive Analysis

BB _ £ ]Model Packages •. n I Users

Property

Value

ID

EMWS

Name

Predictive Anatysis

Slr.lus

Open

Moles History

Diagram Precfctive Anatyss opened

.

...

|<? student as student [\)^ Connected to SASApp - Logical Workspace Server

You use the Predictive Analysis window to create process flow diagrams.


2-14

Chapter 2 Accessing and Assaying Prepared Data

1. Creating a Project, Defining a Library, and Creating an Analysis Diagram a. Use the steps demonstrated on pages 2-6 to 2-8 to create a project for your SAS Enterprise Miner analyses on your computer. b. Use the steps demonstrated on pages 2-10 to 2-12 to define a SAS library to access the course datafromyour project. f

Your instructor will provide the path to the data if it is different than that specified in these notes.

c. Use the steps demonstrated on page 2-13 to create an analysis diagram in your project.


2.3 Defining a Data Source

2.3

2-15

Defining a Data S o u r c e

Defining a Data Source

Select table. Analysis Data

Define variable roles. Define measurement levels. Define table role.

SAS Foundation Server Libraries 10

After you define a new project and diagram, your next analysis task in SAS Enterprise Miner is usually to create an analysis data source. A data source is a link between an existing SAS table and SAS Enterprise Miner. To define a data source, you need to select the analysis table and define metadata appropriate to your analysis task. The selected table must be visible to the SAS Foundation Server via a predefined SAS library. Any table found in a SAS library can be used, including those stored in formats external to SAS. such as tables from a relational database. (To define a SAS library to such tables might require SAS/ACCESS product licenses.) The metadata definition serves three primary purposes for a selected data set. It informs SAS Enterprise Miner of the following: • the analysis role of each variable • the measurement level o f each variable • the analysis role of the data set The analysis role of each variable tells SAS Enterprise Miner the purpose of the variable in the current analysis. The measurement level o f each variable distinguishes continuous numeric variables from categorical variables. The analysis role of the data set tells SAS Enterprise Miner how to use the selected data set in the analysis. A l l of this information must be defined in the context o f the analysis at hand. Other (optional) metadata definitions are specific to certain types o f analyses. These are discussed in the context o f these analyses. f

Any data source that is defined for use in SAS Enterprise Miner should be largely ready for analysis. Usually, most of the data preparation work is completed before a data source is defined. It is possible (and useful for documentation purposes) to write the completed data preparation process in a SAS Code node embedded in SAS Enterprise Miner. Those details are outside the scope of this discussion.


2-16

Chapter 2 Accessing and Assaying Prepared Data

Charity Direct Mail Demonstration ' Analysis goal: j A veterans' organization seeks continued | contributions from lapsing donors. Use lapsing-donor responses from an earlier campaign to predict future lapsing-donor responses.

11 To demonstrate defining a data source, and later, using SAS Enterprise Miner's predictive modeling tools, consider the following specific analysis example: A national veterans' organization seeks to better target its solicitations for donation. By only soliciting the most likely donors, less money will be spent on solicitation efforts and more money will be available for charitable concerns. Solicitations involve sending a small gift to an individual and include a request for a donation. Gifts to donors include mailing labels and greeting cards. The organization has more than 3.5 million individuals in its mailing database. These individuals are classified by their response behaviors to previous solicitation efforts. Of particular interest is the class of individuals identified as lapsing donors. These individuals made their most recent donation between 12 and 24 months ago. The organization seeks to rank its lapsing donors based on their responses to a greeting card mailing sent in June of 1997. (The charity calls this the 97NK Campaign.) With this ranking, a decision can be made to either solicit or ignore a lapsing individual in the June 1998 campaign.


2.3 Defining a Data Source

2-17

Charity Direct Mail Demonstration Analysis goal: A veterans' organization seeks continued contributions from lapsing donors. Use lapsing-donor responses from an earlier campaign to predict future lapsing-donor responses. Analysis data: •

E x t r a c t e d f r o m p r e v i o u s year's c a m p a i g n

S a m p l e b a l a n c e s r e s p o n s e / n o n - r e s p o n s e rate

• A c t u a l r e s p o n s e rate a p p r o x i m a t e l y 5%

12 The source of this data is the Association for Computing Machinery's (ACM) 1998 KDD-Cup competition. The data set and other details of the competition are publicly available at the UCI KDD Archive at kdd.ics.uci.edu. For model development, the data were sampled to balance the response and non-response rates. (The reason and consequences of this action are discussed in later chapters.) In the original campaign, the response rate was approximately 5%. The following demonstrations show how to access the 97NK campaign data in SAS Enterprise Miner. The process is divided into three parts: • specifying the source data • setting the columns metadata • finalizing the data source specification


2-18

Chapter 2 Accessing and Assaying Prepared Data

Defining a Data Source

Specifying Source Data A data source links SAS Enterprise Miner to an existing analysis table. To specify a data source, you need to define a SAS library and know the name of the table that you will link to SAS Enterprise Miner. f o l l o w these steps to specify a data source. I.

Select File •=> New •=> Data Source... from the main menu. The Data Source Wizard - Step l o f 7 Metadata Source opens.

Select a metadata source

/

Source:

Next

>

Cancel

The Data Source Wizard guides you through a seven-step process to create a SAS Enterprise Miner data source. Step I tells SAS Enterprise Miner where to look for initial metadata values. The default and typical choice is the SAS Table that you will link to in the next step.


2.3 Defining a Data Source 2.

Select Next > to use a SAS

2-19

table (the common choice) as the source for the metadata.

The Data Source Wizard continues to Step 2 of 7 Select a SAS Table. S f Data Source Wizard —Step 2 of 7 Select a SAS Table

Select a SAS table

Browse.

Table:

< Back

3.

Cancel

In this step, select the SAS table that you want to make available to SAS Enterprise Miner. You can either type the library name and SAS table name as lihname.

taJblename or select the SAS table

from a list. 4.

Select Browse... to choose a SAS table from the libraries that are visible to the SAS Foundation Server. The Select a SAS

Table window opens.

g T Select a SAS Table \%4 5 Libraries S A

-0 2J) 2P ~-j[p

Aaern61 Apfrntlib Maps Sampsio

-) )

_J Sasdata jjjj Sashelp • '• ^ Sasuser • • • • ^ Stpsarnp J ) " 0 Work jjj |) Wrsdist ) Wrstemp \

I Name Aaern61 Apfrntlib Maps Sampsio Sasdata Sashelp Sasuser Stpsarnp Work Wrsdist Wrstemp

BASE V9 V9 V9 BASE V9 V9 BASE V9 BASE BASE

Refresh

Path

Engine

C:\Workshop\Winsas\aaem61 D:\SA5\Config\Levl\5A5Ap,,. D:\Program Files\SA5\SASF... D:\Program Files\5AS\5A5F... D:\SAS\Config\Levl\5A5Ap... D:\Program Files\SAS\SASF... C:\Documents and 5ettings\... D:\Prograrn Files\5AS\SASF... C: \DOCUME~1 \Student\LO... D: \S AS\Conf ig\Lev 1 \SASAp.., D:\SAS\Config\Levl\5ASAp...

Properties.

Cancel


2-20

Chapter 2 Accessing and Assaying Prepared Data

One of the libraries listed is named A A E M 6 1 , which is the library name defined in the Library Wizard. 5.

Double-click the Aaem61 library. The panel on the right is updated to show the contents of the A A E M 6 1 library. Select a SAS Table SAS Libraries j f j Aaem61 j$ Apfrntlib -JP Maps Sampsio Sasdata -) Sashelp j j l Sasuser ) Stpsarnp •-Jp Work Wrsdist ^ Wrstemp r

El Name

Type

[=jBank DATA [ | ] Census2000 DATA C r e d i t DATA DATA f_Q Dottrain DATA [HQ Dotvalid : jH Dungaree DATA [ J Inq2005 DATA DATA i | j Insurance 1 ]Q Organics DATA DATA g§| Profile B | Pva97nk DATA ~2 Scoreorga... DATA ; ^]Scorepva9... DATA

Created Date 2004-01-19 2006-07-05 2006-09-26 2006-07-25 2006-07-25 2004-01-19 2006-09-19 2006-09-19 2008-04-10 2006-09-23 2008-03-03 2008-05-28 2008-03-03

H

Refresh

6.

Modified Date 2004-01-19 2006-07-05 2006-09-26 2006-07-25 2006-07-25 2004-01-19 2006-09-19 2006-09-19 2008-04-10 2006-09-23 2008-03-03 2008-05-28 2008-03-03

Properties

7 S0KB 1MB 372KB 32KB 32KB 40KB 9MB 5MB 2MB 5MB 1MB 408KB 18MB OK

Size

Cancel

Select the Pva97nk SAS table. f g S e l e c t a S A S Table z j SAS Libraries J Aaem61 -ffr Apfrntlib -2$ Maps Sampsio Sasdata ^ Sashelp

Jp Sasuser -jjj Stpsarnp •1j Work j i | Wrsdist

Name

DATA L^Bank fj~] Census2000 DATA DATA HO Credit DATA [Hj] Dottrain f"J Dotvalid DATA DATA ~J Dungaree DATA | j | Inq2005 DATA : ^2 Insurance DATA : |Q Organics DATA [|t] Profile DATA Pva97nk Scoreorga... DATA [HQ Scorepva9,,, DATA

-^J Wrstemp I P :

Type

Created Date

Modified Date

2004-01-19 2006-07-05 2006-09-26 2006-07-25 2006-07-25 2004-01-19 2006-09-19 2006-09-19 2008-04-10 2006-09-23

2004-01-19 2006-07-05 2006-09-26 2006-07-25 2006-07-25 2004-01-19 2006-09-19 2006-09-19 2008-04-10 2006-09-23

780KB 1MB 372KB 32KB 32KB 40KB 9MB 5MB 2MB 5MB

2008-03-03 2008-05-28 2008-03-03

2008-03-03 2008-05-28 2003-03-03

1MB 408KB 18MB

Refresh

Properties...

OK

Size A.

Cancel


2.3 Defining a Data Source

7.

2-21

Select OK. The Select a SAS Table window closes and the selected table appears in the T a b l e field. Sf Data Source Wizard - Step 2 of 7 Select a SAS Table

J Select a SAS table

Table:

Browse

M61 .PVA97NK

< Back

Next >

Cancel

Select Next >. The Data Source Wizard proceeds to Step 3 of 7 Table Information. |£f Data Source Wizard — Step 3 of 7 Table Information

Table Properties Property Table Name

Value AAEM6I.PVA97NK

Description Member Type

DATA

Data Set Type

DATA

En;:ne

BASE

Number of vanables

28

Number of Observations

96S6

C r e a t e d Date

M a r c h 3 , 2 0 0 8 1 0 : 0 6 : 0 0 AM E5T

Modified Date

March 3, 2008 1 0 : 0 6 : 0 0 AM EST

< Back

i' Next >

"}]

Cam s\

This step of the Data Source Wizard provides basic information about the selected table. The SAS table P V A 9 7 N K is used in this chapter and subsequent chapters to demonstrate the predictive modeling tools of SAS Enterprise Miner. As seen in the Data Source Wizard - Step 3 o f 7 'fable Information window, the table contains 9.686 cases and 28 variables.


2-22

Chapter 2 Accessing and Assaying Prepared Data

Defining Column Metadata With a data set specified, your next task is to set the column metadata. To do this, you need to know the modeling role and proper measurement level of each variable in the source data set. Follow these steps to define the column metadata: 1.

Select Next >. The Data Source Wizard proceeds to Step 4 of 7 Metadata Advisor Options. £ f Data Source Wizard — Step 4 of? Metadata Advisor Options

Metadata Advisor Options Use the basic setting to set the initial measurement levels and roles based on the variable attributes. Use the advanced setting to set the initial measurement levels and roles based on both the variable attributes and distributions.

F Basic

<~ Advanced

Customize

< Back

11

ftext>"""||

Cancel

This step of the Data Source Wizard starts the metadata definition process. SAS Enterprise Miner assigns initial values to the metadata based on characteristics of the selected SAS table. The Basic setting assigns initial values to the metadata based on variable attributes such as the variable name, data type, and assigned SAS format. The Advanced setting assigns initial values to the metadata in the same way as the Basic setting, but it also assesses the distribution o f each variable to better determine the appropriate measurement level. 2.

Select Next > to use the Basic setting.


2.3 Defining a Data Source

2-23

The Data Source Wizard proceeds to Step 5 of 7, Column Metadata. ST Data Source Wizard —Step 5 of 7 Column Metadata I

lolumns;

I

nnf

f Label

Explore

Compute Summary

f~ Basic

Role

Level

Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input ID Input Input Input Input Input Input Input Input Tarqet

Interval Nominal Nominal Nominal Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Nominal Interval Interval Interval Interval Interval Interval Nominal Interval Interval

i Show code

Apply

tn

\~ Mining

Name DemAge 0 em Cluster DemGender DemHomeOwner DemMedHomeValue DemMedlncome iDernPctVeterans GiftAvg36 GiftAvqAII GiftAvqCard36 GiflAvgLast GiftCnt36 GittCntAII GmCntCard36 GiftCntCardAN GifiTimeFirst GiftTimeLast ID PromCntl 2 PromCnt36 PromCntAII PromCntCard12 PromCntCard36 PromCntCardAII StatusCat96NK StatusCatStarAll TaraetB

I F i l l IA!

Reset

f Statistics

Order

Report

Dr No No ^0 No No No No

Mo No Mo No MO No No No No No NO No No NO No No No No No No No No No No No NO No

ho No No No No No No No No

No No No No No

No No No No No No

„„

I n t n n n l

< Back

Next> ;

^

1 :

1

Cancel

The Data Source Wizard displays its best guess for the metadata assignments. This guess is based on the name and data type of each variable. The correct values for model role and measurement level are found in the PVA97NK metadata table on the next page. A comparison of the currently assigned metadata to that in the PVA97NK metadata table shows several discrepancies. While the assigned modeling roles are mostly correct, the assigned measurement levels for several variables are in error. It is possible to improve the default metadata assignments by using the Advanced option in the Metadata Advisor. 3.

Select < Back in the Data Source Wizard. This returns you to Step 4 o f 6 Metadata Advisor Options.

4.

Select the Advanced option.


2-24

Chapter 2 Accessing and Assaying Prepared Data

PVA97NK Metadata Table Model Role

Name

Measurement Level

Description

DemAge

Input

Interval

Age

DemCluster

Input

Nominal

Demographic Cluster

DeraGender

Input

Nominal

Gender

DemHomeOwner

Input

Binary

Home Owner

DemMedHomeValue

Input

Interval

Median Home Value Region

DemMedl n come

Input

Interval

Median Income Region

DemPctVeterans

Input

Interval

Percent Veterans Region

GiftAvg36

Input

Interval

Gift Amount Average 36 Months

GiftAvgAll

Input

Interval

Gift Amount Average A l l Months

GiftAvgCard36

Input

Interval

Gift Amount Average Card 36 Months

GiftAvgLast

Input

Interval

Gift Amount Last

Giftent36

Input

Interval

Gift Count 36 Months

GiftCntAll

Input

Interval

Gift Count A l l Months

GiftCntCard36

Input

Interval

Gift Count Card 36 Months

GiftCntCardAll

Input

Interval

Gift Count Card A l l Months

GiftTimeFirst

Input

Interval

Time Since First Gift

GiftTimeLast

Input

Interval

Time Since Last Gift

ID

Nominal

Control Number

PromCntl2

Input

Interval

Promotion Count 12 Months

PromCnt3 6

Input

Interval

Promotion Count 36 Months

PromCntAll

Input

Interval

Promotion Count A l l Months

PromCntCardl2

Input

Interval

Promotion Count Card 12 Months

PromCntCard3 6

Input

Interval

Promotion Count Card 36 Months

PromCntCardAII

Input

Interval

Promotion Count Card A l l Months

StatusCat96NK

Input

Nominal

Status Category 96NK

StatusCatStarAll

Input

Binary

Status Category Star A l l Months

Targets

Target

Binary

Target Gift Flag

TargetD

Rejected

1 nterva!

Target Gift Amount


2 3 Defining a Data Source

5.

2-25

Select Next > to use the Advanced setting. The Data Source Wizard again proceeds to Step 5 of 7 Column Metadata.

m

ST Data Source Wizard — Step 5 of 7 Column Metadata

1 I n n t IFI-II isl to lolumns:

f

P

Label Role

Name

DernArje Input Rejected DemCluster DemGender Input DetnHorneOwner Input DemMedHomeValue Input DemMedlncome Input DernPctVeterans Input GiftAvg36 Input GlftAvgAll Input GiflAvgCard36 Input GiftAvgLast Input Input GirtCnt36 GlftCntAll Input GifiCntCard36 Input GiflCntCardAII Input Input GiftTimeFirst Input GiftTimeLast ID ID PromCntl 2 Input PromCnt36 Input PromCntAII Input PromCntCard12 Input PromCntCard36 Input Input PromCntCardAII Input StatusCat96NK Input StatusCatStarAII Target Targets

Apply

Mining

C

Basic

r

Report

Order

Level

Interval Nominal Jominal linary Interval Interval Interval Interval Interval Interval Interval Nominal Interval Nominal

Reset Statistics

No

No No

Dr

No No No

interval Interval

Interval Nominal Interval Interval Interval Nominal Interval Interval Nominal Binary Binary

No

No NO

No

No

o No 0

No

«

Show code

Explore

Refresh Summary

<Back

! Next>

:

Cancel

While many o f the default metadata settings are correct, there are several items that need to be changed. For example, the D e m C l u s t e r variable is rejected (for having too many distinct values), and several numeric inputs have their measurement level set to N o m i n a l instead o f I n t e r v a l (for having too few distinct values). To avoid the time-consuming task of making metadata adjustments, go back to the previous Data Source Wizard step and customize the Metadata Advisor. 6. Select < Back. You return to the Metadata Advisor Options window.


2-26 7.

Chapter 2 Accessing and Assaying Prepared Data

Select Customize.... The Advanced Advisor Options dialog box opens. B £ Advanced Advisor Options

Property

Value

[Missing Percentage Threshold 50 Reject Vars with Excessive Missing Value Yes Class Levels Count Threshold 20 Detect Class Levels Yes Reject Levels Count Threshold 20 Reject Vars with Excessive Class Values Yes

Missing Percentage Tlireshold Specify a maximum percentage of missing values for variables to be rejected. The default value is 50.

OK

Cancel

Using the default Advanced options, the Metadata Advisor can do the following: •

reject variables with an excessive number of missing values (default=50%)

detect the number class levels of numeric variables and assign a role of N o m i n a l to those with class counts below the selected threshold (default=20)

detect the number class levels of character variables and assign a role of Re j e c t e d to those with class counts above the selected threshold (default=20)

f

In the PVA97NK table, there are several numeric variables with fewer than 20 distinct values that should not be treated as nominal. Similarly, there is one class variable with more than 20 levels that should not be rejected. To avoid changing many metadata values in the next step of the Data Source Wizard, you should alter these defaults.


2.3 Defining a Data Source

2-27

Type 2 as the Class Levels Count Threshold value so that only binary numeric variables are treated as categorical variables. Advanced Advisor Options

Property

Missing Percentage Threshold 50 Reject Vars with Excessive Missing ValueYes Class Levels Count Threshold 2 Detect Class Levels Yes Reject Levels Count Threshold 20 Reject Vars with Excessive Class Values Yes

Value

Class Levels Count Threshold I f "Detect class levels"=Yes, interval variables with less than the number

specified for this property will be marked as NOMINAL. The default value is 20.


2-28

9.

Chapter 2 Accessing and Assaying Prepared Data

Type 100 as the Reject Levels Count Threshold value, so that only character variables with more than 100 distinct values are rejected. f

Be sure to press ENTER after you type the number 100. Otherwise, the value might not be registered in the field. Advanced Advisor Options

Property

Missing Percentage Threshold 50 Reject Vars with Excessive Missing Value Yes Class Levels Count Threshold 2 Detect Class Levels Yes Reject Levels Count Threshold 100

Value

Reject Vars with Excessive Class Values Yes

Reject Levels Count Threshold Specify a rnaximum number of levels for a class variable before being marked REJECTED. The default value is 20.

OK

10. Select OK to close the Advanced Advisor Options dialog box.

Cancel


2.3 Defining a Data Source

2-29

11. Select Next > to proceed to Step 5 o f the Data Source Wizard. g £ Data Source Wizard — Step 5 o f ? Column Metadata 1 I

Irnnnf "! 1

Columns:

nnt

["" label

I F i n is!

Apply

to

r~ Mining

r

Role

Level

input Input DemCluster DemG-ender Input DemHomeQwner Input DemMedHomeValue Input Input DemMedlncorne DemPctVeterans Input GiftAvg36 Input Input jGiftAvgAII GiftAvgCard36 Input Input GiftAvgLast ln|)irt GiftCni36 GiftCntAII Input GiftCntCard36 Input Input GiftCntCardAII Input GiftTimeFirst Input GiftTimeLast ID ID Input PromCnt12 Input PromCnt36_ Input PromCntAII PromCntCardT2 Input PromCntCard36 Input PromCntCardAII Input StatusCat98NK Input StatusCatStarAII Input TargetB Target TarqetD Taruet

Interval Nominal Nominal Binary Interval Interval Interval Interval Interval Interval Interval Interval (interval interval

Name

DemAge

Interval Interval Interval Jo mi rial Interval Interval Interval Interval Interval interval Nominal Binary Binary Interval

Basic Report

r

Reset Statistics

Order No

No

Dr

*

No

No No NO NO

No NO No NO

NO

"So" No No

No No

i l Show code

Explore

Refresh Summary

<Back

Next >

Cancel

A comparison o f the Column Metadata table to the table at the beginning o f the demonstration shows that most o f the metadata is correctly defined. SAS Enterprise Miner correctly inferred the model roles for the non-input variables by their names. The measurement levels are correctly defined by using the Advanced Metadata Advisor.

The analysis o f the PVA97NK data in this course focuses on the T a r g e t B variable, so the T a r g e t D variable should be rejected.


2-30

Chapter 2 Accessing and Assaying Prepared Data

12. Select Role •=> Rejected for T a r c r e t D . £ f Data Source Wizard —Step 5 of 7 Column Metadata I

Columns:

f

Label

Name

T U I l i O I I I O M I U i z-

PromCntCard36 PromCntCardAII StatusCat_96NK Statu sCatStarAN TargetB

TargetD

Show code

Explore

nnt

Refresh Summary

If™ml

tn

2J

P Mining Role

Tripntr

Input Input Input Input Target Rejected

f" Basic

Level

Interval Interval luminal imary binary Interval

Report

Apply r

Reset Statistics

Order

Dr

Np_ No

2l <Back

Next>

Cancel

In summary, Step 5 of 7 Column Metadata is usually the most time-consuming of the Data Source Wizard steps. You can use the following tips to reduce the amount o f time required to define metadata for SAS Enterprise Miner predictive modeling data sets: • Only include variables that you intend to use in the modeling process in your raw data source. • For variables that are not inputs, use variable names that start with the intended role. For example, an ID variable should start with I D and a target variable should start with T a r g e t . • Inputs that are to have a nominal measurement level should have a character data type. • Inputs that are to be interval must have a numeric data type. • Customize the Metadata Advisor to have a Class Level Count set equal to 2 and a Reject Levels Count set equal to a number greater than the maximum cardinality (level count) of your nominal inputs.


2.3 Defining a Data Source

2-31

Finalizing the Data Source Specification Follow these steps to complete the data source specification process: I.

Select Next > to proceed to Decision Configuration. f

The Data Source Wizard gained an extra step due to the presence o f a categorical (binary, ordinal, or nominal) target variable. D a t a Source W i z a r d — Step 6 of 9 Decision Configuration

Decision Processing Do you want to build models based on the values of the decisions ? If you answer yes, you may enter information on the cost or profit of each possible decision, prior probability and cost function. The data will be scanned for the distributions of the target variables.l : mm mm mm

a-

no

<Back

C

Yes

Next >

Cancel

When you define a predictive modeling data set, it is important to properly configure decision processing. In fact, obtaining meaningful models often requires using these options. The PVA97NK table was structured so that reasonable models are produced without specifying decision processing. However, this might not be the case for data sources that you will encounter outside this course. Because you need to understand how to set these options, a detailed discussion of decision processing is provided in Chapter 6, "Model Assessment." f

Do n o t select Yes here because that changes the default settings for subsequent analysis steps and yields results that diverge from those in the course notes.


2-32

2.

Chapter 2 Accessing and Assaying Prepared Data

Select Next >. By skipping the decision processing step, you reach the next step of the Data Source Wizard. ] § [ D a t a Source W i z a r d — Step 7 of 8 D a t a Source A t t r i b u t e s

You may change the name and the role, and can specify a population segment identifier for the data source to be created.

Name:

|PVA97NK

Role:

|Raw

Segment:

[

<Back !

Next >

Cancel

This penultimate step enables you to set a role for the data source and add descriptive comments about the data source definition. For the upcoming analysis, a table role o f Raw is acceptable.


2.3 Defining a Data Source

3.

The final step in the Data Source Wizard provides summary details about the data table that you created. Select Finish. j£TData Source Wizard — Step 8 of 8 Summary Metadata Completed. Library. AAEM61 Table: PVA97NK Role: Raw Role ID Input Input Input Rejected Target

Count 1 2 20 3 1 1

Level Nominal Binary Interval Nominal Interval Binary

< Back

! Finish I

Cancel

2-33


2-34

Chapter 2 Accessing and Assaying Prepared Data

The PVA97NK data source is added to the Data Sources entry in the Project panel. |15f Enterprise Miner - My Project File

Ecfit

View

Actions

Options

-IDI x| Window

Help

••1 I | X | H m i 0 | D | Q | a j | : i i |

1 '1 [ I MI^I

I

My Project .-: O Data Sources

Sample f&Tplore | Modify j Model | Assess | Util.ty [ Credit Scoring ] Te>t Mining]"

!-

Diagrams Predictive Analysis m L2J Model Packages v ji] Users

Pr opertY Name Variables Decisions Role Notes Library [Table Type No. Obs Mo Cols No. Bytes Segment Created By Create Date Modifier! By Modify Date Scope

Vabe PVANK PVA97NK 1 Raw AAEM61 PVA97NK DATA 96B6 be 20J9D24

...

H7898\sasdemo 2/5/101:25 PW sasdemo 2/5/10 1:25 PM Local

_il

ID Di3gra.ii Predictive Analysis opened

4.

^ -

- J -

©i

[ 0 ll7898\sasdemo as sasdemo [ s ^ Connected to SASApo • Logical Workspace Server

Select the PVA97NK data source to obtain table properties in the SAS Enterprise Miner Properties panel.


2.3 Defining a Data Source

2-35

Exercises

2. Creating a SAS Enterprise Miner Data Source Use the steps demonstrated on pages 2-18 to 2-34 to create a SAS Enterprise Miner data source from the PVA97NK data.


2-36

Chapter 2 Accessing and Assaying Prepared Data

2.4 Exploring a Data Source As stated in Chapter 1 and noted in Section 2.3, the task of data assembly largely occurs outside of SAS Enterprise Miner. However, it is quite worthwhile to explore and validate your data's content. By assaying the prepared data, you substantially reduce the chances of erroneous results in your analysis, and you can gain insights graphically into associations between variables. In this exploration, you should look for sampling errors, unexpected or unusual data values, and interesting variable associations. The next demonstrations illustrate SAS Enterprise Miner tools that are useful for data validation.


2.4 Exploring a Data Source

2-37

Exploring Source Data SAS Enterprise Miner can construct interactive plots to help you explore your data. This demonstration shows the basic features o f the Explore window. These include the following: • opening the Explore window • changing the Explore window sample size • creating a histogram for a single variable • changing graph properties for a histogram • changing chart axes • adding a missing bin to a histogram • adding plots to the Explore window • exploring variable associations

Opening the Explore Window There are several ways to access the Explore window. Use these steps to open the Explore window through the Project panel. 1.

Open the Data Sources folder in the Project panel and right-click the data source o f interest. The Data Source Option menu appears.

rill

ST Enterprise Miner - My Project File Edit View Actions Options Window Help D-l

iXiSlHffialDlaji^l

;

I

_J_J_.I.

3 My Project - O Data Sources V

IN] i t f V

Sample ["Explore | Modify | Model | Assess j Utility | Credit Scoring

J

I Diaqrar Rename 3£>Pre Duplicate Delete v fijjModell • JTj Users Edit Variables... Refresh Metadata Edit Decisions.,. -

ID.

Explore...

Name Variables Decisions Role Notes Library Table No_._p_bs No. Cols No. Bytes Segment Created By Create Date MnHifiari Du

M]^J

i;:; Predictive Analysis

Value JK

Raw M61 'A97NK DATA 9686 28 2049024 student 6/17/09 3:40 Ptvl"

Diagram Predictive Analysis opened

Id

J [ft student as student

-SJ|i^O|Q - J -

j)ioo%|:-oj"

Connected to SASApp - Logical Workspace Server


2-38

2.

Chapter 2 Accessing and Assaying Prepared Data

Select E x p l o r e . . . from the Data Source Option menu. I f you are using an installation of SAS Enterprise Miner with factory default settings, the Large Data Constraint alert window opens.

Row limit of 2,000 observations reached.

5

OK

By default, a maximum of 2000 observations are transferred from the SAS Foundation Server to the SAS Enterprise Miner client. This represents about a fifth of the PVA97NK table. 3.

Select OK to close the Large Data Constraint alert window. The Explore - A A E M 6 1 .PVA97NK window opens. Explore - AAEM61.PVA97INK File

View

- n|x|

Actions Window

j • Sample Properties Value

Property

Rows Columns Library Member

9636

28 AAEM61 PVA97NK DATA Top Default 2000 12345

Type

Sample Method Fetch Size Fetched Rows Random Seed

Apply AAEM61.PVA97NK

Target G... Control 000014974 000006294 1 00046110 1 00185937 000029637 1 00112632 000123712 000045409 1 00019094 000178271 1 00102224

Plot...

m Target G... GiftCou... $4.00 $10.00 $11.00 $40.00 $6.00

Gift Cou... Gift Cou... 4 8 41 12 1 11 4 4 3 S 16

Gift Cou... Gift Ami $17*1 3 $2£ 3 $E 20 $1t 8 $2C 1 $11 g $ie 3 $1; 3 $se 1 $E 2 $E_L; 13 =l

The Explore window features a 2000-observation sample from the PVA97NK data source. Sample properties are shown in the top half of the window and a data table is shown in the bottom half.


2.4 Exploring a Data Source

2-39

Changing the Explore Sample Size The Sample Method property indicates that the sample is drawn from the top (first 2000 rows) of the data set. Use these steps to change the sampling properties in the Explore window. Although selecting a sample through this method is quick to execute, fetching the top rows o f a table might not produce a representative sample of the table.

f 1.

Left-click the Sample Method value field. The Option menu lists two choices: Top (the current setting) and Random.

2.

Select Random from the Option menu. - • Explore - AAEM61 .PVA97NK File

View

-ln|x

Actions Window

O St | i Sample Properties

Value

Property 96S6

ROWS

28 AAEM61 PVA97NK DATA Random Default 2000 12345

Columns Library Member Type Sample Method Fetch Size Fetched Rows iRandom Seed

Apply

Obs#

Target G. . Control ... Target G... Gift Cou... Gift Cou... Gift Cou... Gift Cou... Gift 000014974 2 4 1 3 1 000006294 1 8 0 3 1 00046110 $4.00 6 41 3 20 4 100185937 $10.00 3 8 M 12 5 000029637 1 1 1 6 100112632 $11.00 3 11 2 9 7 000123712 2 4 2 3 8 000045409 4 0 3 1 1 00019094 $40.00 3 0 Hi 10 000178271 S 1 2 H $6.00 3 16 2 13 11 1 00102224

H:

4 1

3.

Plot...

I

Amt 51 i d

$20—' SE $1t S2C (11 $1£

$1! $3<

$E

$H •j

Select Actions •=> A p p l y Sample Properties from the Explore window menu. A new, random sample o f 2000 observations is made. This 2000-row sample now has distributional properties that are similar to the original 9686 observation table. This gives you an idea about the general characteristics of the variables. I f your goal is to examine the data for potential problems, it is wise to examine the entire data set. SAS Enterprise Miner enables you to increase the sample transferred to the client (up to a maximum of 30.000 observations). See the SAS Enterprise Miner Help file to learn how to increase this maximum value.


2-40

Chapter 2 Accessing and Assaying Prepared Data

4.

Select the Fetch Size property and select Max from the Option menu.

5.

Select Actions >=> A p p l y Sample Properties. Because there are fewer than 30,000 observations, the entire PVA97NK table is transferred to the SAS Enterprise Miner client machine, as indicated by the Fetched

Rows field.

' Explore - AAEM61.PVA97NK

-|D|x|

File View Actions Window 1 1 Sample Properties

ROWS Columns Library Member Type .__ Sample Method iFetch Size Fetched Rows Random Seed

Property

_

9686 28 AAEM61 PVA97NK DATA Random Max 9686 |12345

Value

Apply

j

Plot...

AAEM61.PVA97N

Target G... Control... Target G... Gift Cou... Gift Cou... Gift Cou... Gift Cou... Gift 3 000014974 3 000006294 100046110 $4.00 41 20 12 8 100185937 $10.00 1 000029637 1 11 9 1 00112632 $11.00 4 000123712 3 4 000045409 3 3 1 00019094 $40.00 1 5 2 000178271 16 13 1 00102224 $6.00

Ami $17*1 $2C $£ $1t $2C $11 $1J $1f $3J $E $fjj :=1

±l


2.4 Exploring a Data Source

2-41

Creating a Histogram for a Single Variable While you can use the Explore window to browse a data set, its primary purpose is to create statistical analysis plots. Use these steps to create a histogram in the Explore window. I.

Select Actions •=> Plot from the Explore window menu. The Chart wizard opens to the Select a Chart Type step. 'i

bli Scatter |£S Line alb Histogram

Plots values of two variables against each other.

Efc Density

Si!

Box Tables

M Matrix rjfj Lattice IS] Parallel Axis ©

Constellation

Cancel

Next>

n«'.ti

The Chart wizard enables the construction of a multitude of analysis charts. This demonstration focuses on histograms.


2-42 2.

Chapter 2 Accessing and Assaying Prepared Data

Select Histogram.

1* Scatter 1^ Line ilfri Histogram v

Density

71 fltt Histogram provides statistical relations between X values which are grouped as bins or discrete.

&Box £ 3 Tables \M Matrix Lattice t§] Parallel Axis © Constellation Cancel

Histograms are useful for exploring the distribution o f values in a variable.

Next>


2.4 Exploring a Data Source

3.

2-43

Select Next >. The Chart wizard proceeds to the next step, Select Chart Roles. t

ij 1;.-

-3 Missing required roles: X. A Variable Role DEMAGE DEMCLUSTER DEMGENDER DEMHOMEOWNER DEMMEDHOMEVALUE DEMMEDINCOME DEMPCTVETERANS GIFTAVG36 G1FTAVGALL GFTAVGCARD36 GIFTAVGLAST rciFTTNTSfi Response statistic:

Use default assignments Type Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numprir

Description Age Demographic Cluster Gender Home Owner Median Home Value ... Median income Region Percent Veteran? Re... Gift Amount Average... Gift Amount Average... Gift Amount Average... Gift Amount Last

Frequency

Cancel

<Back

To draw a histogram, one variable must be selected to have the role X.

Format

DOLLAR11 DOLLAR! I DOLLARS .2 DOLLAR9 2 DOLLARS .2 DOLLARS .2

-


2-44

4.

Chapter 2 Accessing and Assaying Prepared Data

Select Role O X for the DEMAGE variable.

Use default assignments A Variable Role DEMAGE X DEMCLUSTER DEMGENDER DEMHOMEOVVWER DEMMEDHOMEVALUE DEMMEDINCOME DEMPCT VETERANS GIFTAVG36 GIFTAVGALL GIFTAVGCARD3fi GFTAVGLAST niFTCNnR Response statistic: Frequency

Type Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric ^umerir.

Description 'Age Demographic Cluster bender jHome Owner Median Home Value... Median Income Region Percent Veterans ReGift Amount Average... |Gift Amount Average... [Gift Amount Average... !Gift Amount Last Gift Gnunl 3R MnntlT?

Format

DOLLAR11 DOLLAR11 DOLLAR9.2 DOLLARS .2 DOLLARS .2 DOLLARS .2

• Cancel

<Back

The Chart wizard is ready to make a histogram of the DEMAGE variable.

Next>

Finish


2.4 Exploring a Data Source

5.

2-45

Select Finish. The Explore window is filled with a histogram o f the DEMAGE variable. Variable descriptions, rather than variable names, are used to label the axes of plots in the Explore window.

f

Explore - AAEM61 .PVA97NK File

Edit

0

View

O

Actions Window

4

,1, Graph 1

1500 H

1000-

u

c 3

500-

0.0

17.4

26/

34.8

43.5

52.2

60.9

69.6

78.3

87.0

Age Axes in Explore window plots are chosen to range from the minimum to the maximum values o f the plotted variable. Here you can see that Age has a minimum value of 0 and a maximum value o f 87. The mode occurs in the ninth bin, which ranges between about 70 and 78. F r e q u e n c y tells you that there are about 1400 observations in this range.


2-46

Chapter 2 Accessing and Assaying Prepared Data

Changing the Graph Properties for a Histogram By default, a histogram in SAS Enterprise Miner has ]0 bins and is scaled to show the entire range of data. Use these steps to change the number of bins in a histogram and change the range of the axes. While the default bin size is sufficient to show the general shape of a variable's distribution, it is sometimes useful to increase the number of bins to improve the histogram's resolution. 1.

Right-click in the data area of the Age histogram and select Graph Properties... from the Option menu. The Properties-Histogram window opens.

Graph Histogram Axes Horizontal Vertical Title/Footnote

Bins LJ Show Missing Bin [E Bin Axis [•] End Labels LJ Bin to Data Range Boundary:

Upper

Number of X Bins:

10

Number of Y Bins:

10

V

Color 115 Show Outline I2j Gradient Gradient ThteeColorRamp

OK

Apply

Cancel

This window enables you to change the appearance of your charts. For histograms, the most important appearance property (at least in a statistical sense) is the number o f bins.


2.4 Exploring a Data Source 2.

Type 87 into the N u m b e r

Graph Histogram Axes Horizontal Vertical Title/Footnote

of

X

2-47

B i n s field.

Bins LJ Show Missing Bin 12i Bin Axis E

End Labels

Q Bin to Data Range Upper

Boundary: Number of X Bins:

37

Number of Y Bins:

10

•

Color IE Show Outline IE Gradient Gradient ThreeCotorRanip

OK

Apply

Cancel

Because Age is integer-valued and the original distribution plot had a maximum o f 87. there w i l l be one bin per possible Age value.


2-48

3.

Chapter 2 Accessing and Assaying Prepared Data

Select OK. The Explore window reopens and shows many more bins in the Age histogram. [

Explore - AAEM61.PVA97NK File

Edit

View

-Infxl

Actions Window

#|a|o|fr| Ji, Graph1 200-

150& C

01

S" too

50-

'i' 'T' ' -

0

3

i''

1

i

1

1

i

1

1

i

1

1

i

1

1

i

1

1

i

1

1

i

1

1

i "

t ' 1

i "

i''

i "

r

1

''"p

1 1

i

1

i i

1 1

i

r

1

i

1

1

1

1

' i'

1

1

1

1

"t

1

1

1

1

1

i

1

1

1

1

1

i'

1

i

6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87

Age With the increase in resolution, unusual features become apparent in the Age variable. For example, there are unexpected spikes in the histogram at 10-year intervals, starting at Age=7. Also, you must question the veracity of ages below 18 for donors to the charity.


2.4 Exploring a Data Source

2-49

Changing Chart Axes A very useful feature of the Explore window is the ability to zoom in on data of interest. Use the following steps to change chart axes in the Explore window: 1.

Position your cursor under the histogram until the cursor appears as a magnifying glass.

2.

Clicking with your mouse and dragging the cursor to the right magnifies the horizontal axis o f the histogram. '."-Explore - AAEM61.PVA97NK File

Edit

View

- l O l X|

Actions Window

Graph1

^JDJXJ

i

i

r

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Age

:


2-50

3.

Chapter 2 Accessing and Assaying Prepared Data

Position your cursor below the histogram but above the Age axis label. A horizontal scroll bar appears. -Inlxf

'/.-Explore - AAEM61.PVA97NK File

Edit

View

Actions Window

y o#

Graph 1

33 34 35 36 37 38 33 40 41 42 43 44 45 46 47 48 49 50 51 52 53 5 ^ 5 56 57 58 59 60 61 62 63 64

Age


2.4 Exploring a Data Source

4.

2-51

Click and drag the scroll bar to the left to translate the horizontal axis to the left. Explore - AAEM61 .P VA97NK File

Edit

H

View

O

-|D|X|

Actions Window

«

.tic Graph 1

^jnjxj

200-

150-

> o c

a> •

g" 100 k.

u_

50-

0

2

4

6

8

1 l i ^ 12

14

16

18

Age

20

22 ~24

26

28

30

32

34

36


2-52

5.

Chapter 2 Accessing and Assaying Prepared Data

Position your cursor to the left of the histogram to make a similar adjustment to the vertical axis o f the histogram. • • Explore - AAEM61.PVA97NK Fi!e

Edit

View

Actions Window

__jnj_xj

triii Graph 1

40-

u

30-

c 3

W i l 20-

10-

i — — i — ' — i——r 1

1

0

2

4

6

-1

10

— i

12

1

1

14

1

I

16

i

|

18

i

1

20

1

[

22

.

[

24

.

1

26

1

[

28

1

1

30

,

1

32

,

34

1 —

Age At this resolution, you can see that there are approximately 40 observations with Age less than 8. A review o f the data preparation process might help you determine whether these values are valid.


2.4 Exploring a Data Source

6.

2-53

Select ! (the yellow triangle) in the lower right corner of the Explore window to reset the axes to their original ranges. • Explore - AAEMB1.PVA97NK File

Edit

View

Actions Window

m i >it, Graph 1

200-

150-

> o c

S" 100

50-

T

0

' I ' ' 1 '

3

1

I ' ' I '

r

I '

1

I ' ' I '

1

I ' ' I ' ' I "

I '

1

i "

I ' ' 1 '

1

I '

1

I "

I

1

' I

1

' I ' ' I ' ' I ' ' I ' ' I ' ' I ' ' I

1

' I ' ' I ' ' I ' ' 1

6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87

Age


2-54

Chapter 2 Accessing and Assaying Prepared Data

Adding a "Missing" Bin to a Histogram Not all observations appear in the histogram for Age. There are many observations with missing values for this variable. Follow these steps to add a missing value bin to the Age histogram: 1.

Right-click on the graph and select Graph Properties... from the Option menu.

2.

Check the Show Missing Bin option.

Graph

Histogram Axes Horizontal Vertical Title/Footnote

Bins E

Show Missing Bin

0

Bin Axis \Z\ End Labels

[ J Bin to Data Range Boundary:

Upper

Number of X Bins:

87

Number of Y Bins:

10

•w

Color E

Show Outline

E

Gradient

Gradient

OK

Apply

Cancel


2.4 Exploring a Data Source 3.

2-55

Select OK. The Age histogram is modified to show a missing value bin. • Explore - AAEM61.PVA97NK File

Edit

View

- •

Actions Window

X

,iu Graph 1

2500 H

2000-

> 1500 o C 3

cr o>

^ 1000-

500-

I ' ' I "

I ' ' I ' ' I ' ' I ' ' I "

I i

1

I

1

' I

1

' I ' ' ! "

I ' ' I I ' I "

I

1

' I ' • I ' ' 1

1

1

I i

1

I

1

1

I ' i I "

I '

1

I "

I "

IffiM

I ' •1 "

I '

1

I

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87

Age With the missing value bin added, it is easy to see that nearly a quarter o f the observations are missing an Age value.


2-56

Chapter 2 Accessing and Assaying Prepared Data

Adding Plots to the Explore Window You can add other plots to the Explore window. Follow these steps to add a pie chart of the target variable. 1.

Select Actions Type step.

Plot from the Explore window menu. The Chart wizard opens to the Select a Chart

2.

Scroll down in the chart list and select a Pie chart.

pjfj Lattice Parallel Axis ® Constellation

Shows each categories contribution to a sum.

jfL 3D Charts Q j Contour t i l Bar € • Pie L i Needle H

Vector

Cancel

Next>


2.4 Exploring a Data Source 3.

2-57

Select Next >. The Chart wizard continues to the Select Chart Roles step.

^ Missing required roles: Category. A Variable DEMAGE DEMCLUSTER DEMGENDER DEMHOMEOVVNER DEMMEDHOMEVALUE DEMMEDINCOME DEMPCTVETERANS G1FTAVG36 GIF T AVGALL GFTAVGCARD36 GIFTAVGLAST GIFTCNT36 GiFTCNTALL G1FTCNTCARD36

Role

Use default assignments Description Format Age Demographic Cluster Gender Home Owner Median Home Value ... DOLLAR11 Median Income Region DOLLAR"! 1 Percent Veterans Re... Gift Amount Average... DOLLAR9.2 Gift Amount Average... DOLLARS .2 Gift Amount Average... DOLLAR9.2 Gift Amount Last DOLLARS .2 Gift Count 38 Months Gift Count All Months Gift Count Card 36 M. .

Type Numeric Character Character iCharacter Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric

Cancel

<Back

The message at the top o f the Select Chart Roles window states that a variable must be assigned the Category role.


2-58

4.

Chapter 2 Accessing and Assaying Prepared Data

Scroll the variable list and select Role •=> Category for the TARGETB variable.

Use default assignments A Variable Role GIFTCNTCARDALL GIFTTIMEFIRST GIFTTIMELAST ID PROMCNT12 PROMCNT36 PROMCNTALL PROMCNTCARD12 PROMCNTCARD36 PROMCNTCARDALL STATUSCAT96NK STATUSCATSTARALL TARGETB Category TARGETD

Type INumeric [Numeric jNumeric Character Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric Numeric Numeric

Description Format Girt Count Card All M... I Times Since First Gift Time Since Last Gift Control Number Promotion Count 12 M... Promotion Count 36 M... Promotion Count All M... Promotion Count Card... Promotion Count Card... Promotion Count Card... Status Category 96NK Status Category Star... Target Gift Flag Target Gift Amount DOLLAR9.2

Cancel

<Back

Next>

Finish


2.4 Exploring a Data Source

The chart shows an equal number of cases for TARGETB=0 (top) and TARGETB=1 (bottom).

2-59


2-60

6.

Chapter 2 Accessing and Assaying Prepared Data

Select Window >=> Tile to simultaneously view all sub-windows of the Explore window. Explore - AAEM61 .PVA97NK

File Edit View Actions Window 'jr Graph2

.Jn|x|

^JDJXJ Property Rows Columns Library Member Type Sample Method iFetch Size Fetched Rows Random Seed

9686 28 AAEM61 PVA97NK DATA Random Max 9686 12345

Value

Apply I,Graphl

^inJiii

2500 2000¬ 1500 £ 1000 LL

500¬ 0

Plot...

-r""r'"i 0

i

i

i

i

i

i

i

i i"

6 12 18 24 30 35 42 48 54 60 66 72 78 84

Age

Target G... Control 000014974 000006294 100046110 1 00185937 000029637 1 00112632 000123712 000045409 1 00019094 000178271 100102224

Target G... G

$4.00 $10.00 $11.00 $40.00 $6.00

'A


2.4 Exploring a Data Source

2-61

Exploring Variable Associations A l l elements of the Explore window are connected. By selecting a bar in one histogram, for example, corresponding observations in the data table and other plots are also selected. Follow these steps to use this feature to explore variable associations: 1.

Double-click the Age histogram title bar so that it fills the Explore window.

2.

Click and drag a rectangle in the Age histogram to select cases with Age in excess o f 70 years. Explore - AAEM61.PVA97NK File

Edit

0

View

Actions Window

*i

i k Graph1

400-

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87

Age The selected cases are cross-hatched. (The vertical axis is rescaled to show the selection better.)

*


2-62

3.

Chapter 2 Accessing and Assaying Prepared Data

Double-click the Age histogram title bar. The tile display is restored. Explore -AAEM61.PVA97NK File

Edit

View

Actions Window

\ \ Sample

^jnjxj Property

Rows Columns Li bran/ Member Type iSample Method Fetch Size Fetched Rows Random Seed

Value

9686 23 AAEM61 PVA97NK DATA Random Max 96S6 12345

Apply

^jD[x]

•di. Graph 1

^jDjxJ Obs#

l ""!" "!'""!" "!" "! 1

0

1

1

1

r ' 'P""l " l f

l l

,

T ,

I

I'

6 12 18 24 30 36 42 48 54 60 66 72 78 84

Aoe

'

Plot...

Target G... Control... Target G... G 000014974 000006294 1 00046110 1 00185937 000029637 1 00112632 000123712 000045409 1 00019094 000178271 1 00102224

$4.00 $10.00 $11.00 $40.00 $6.00

Notice that part o f the TargetB pie chart is selected. This selection shows the relative proportion of observations with Age greater than 70 that do and do not donate. Because the arc on the TARGETB=] segment is slightly thicker, it appears that there is a slightly higher number of donors than non-donors in this Age selection.


2.4 Exploring a Data Source

4.

Double-click the Tar get B pie chart title bar to confirm this observation.

5.

Close the Explore window to return to the SAS Enterprise Miner client interface screen.

2-63


2-64

Chapter 2 Accessing and Assaying Prepared Data

Changing the Explore Window Sampling Defaults

In the previous demonstration, you were alerted to the fact that only a sample of data was initially selected for use in the Explore window. -

»2f Row limit of 2,000 observations reached. OK

Follow these steps to change the preference settings of SAS Enterprise Miner to use a random sample or all o f the data source data in the Explore window: 1.

Select Options •=> Preferences... from the main menu. The Preferences window opens.

!

Property

v Look and feel • ' •Property Sheet Tooltips '••Tools Palette Tooltips S|interactive Sampling [•Sample Method h Fetch Size -Random Seed SModel Package Options [-Generate C Score Code [•Generate Java Score Code

Value Cross Platform Off Display tool name and description Top_ Default 1 2345

; : :;

No NO

OK

Cancel


2.4 Exploring a Data Source

2.

Select Sample Method O Random.

Muser Interface

Value

Property

[•Look and feel '••Property Sheet Tooltips -Tools Palette Tooltips ;

[•Sample Method [•Fetch Size -Random Seed ©ModelPackage Options [-Generate C Score Code [•Generate Java Score Code l

A.

Cross Platform Off Display tool name and description Random Default 12345 No No

OK

Cancel

The random sampling method improves on the default method (at the top o f the data set) by guaranteeing that the Explore window data is representative of the original data source. The only negative aspect is an increase in processing time for extremely large data sources. 3.

Select Fetch Size

Max.

Value

Property [•Look and feel [-Property Sheet Tooltips '-Tools Palette Tooltips Sjlnteractive Sampling [-Sample Method [ Fetch Size -Random Seed :

BlAMIHI^WWBBilBBgi [-Generate C Score Code [-Generate Java Score Code tJ i

r-

n-^i^*.^

A. %

Cross Platform Off Display tool name and description Random Max 12345

% i

_

No No

OK

Cancel

The Max fetch size enables a larger sample of data to be extracted for use in the Explore window. If you use these settings, the Explore window uses the entire data set or a random sample o f up to 30,000 observations (whichever is smaller).

2-65


2-66

Chapter 2 Accessing and Assaying Prepared Data

Exercises

3. Using the Explore Window to Study Variable Distribution Use the Explore window to study the distribution of the variable Medium Income Region (DemMedlncome). Answer the following questions: a. What is unusual about the distribution of this variable?

b. What could cause this anomaly to occur?

c. What do you think should be done to rectify the situation?


2.4 Exploring a Data Source

2-67

Modifying and Correcting Source Data

In the previous exercise, the DemMedlncome variable was seen to have an unusual spike at 0. This phenomenon often occurs in data extracted from a relational database table where 0 or another number is used as a substitute for the value m i s s i n g or u n k n o w n . Clearly, having zero income is considerably different from having an unknown income. I f you properly use the i n c o m e variable in a predictive model, this discrepancy can be addressed. This demonstration shows you how to replace a placeholder value for missing with a true missing value indicator. In this way, SAS Enterprise Miner tools can correctly respond to the true, but unknown, value. SAS Enterprise Miner includes several tools that you can use to modify the source data for your analysis. The following demonstrations show how to use the Replacement node to modify incorrect or improper values for a variable:

Process Flow Setup Use the following steps to set up the process flow that will modify the DemMedlncome variable: 1.

Drag the PVA97NK data source to the Predictive Analysis workspace window. ^ E n t e r p r i s e Mtner - My Project File Edit

View

Actions Options Window Help

ol \n\ iigi^iigiiaiBlajl^iUWlo 3 My Project - • ! Data Sources PVA97NK - ^il Diagrams Analysis j M &Predictive r

I'M

en ^ j ^ i o i m i M J d

Sample | Explore | Modify | Model j Assess j Utility | Credit Scoring 4« Predictive Analysis

EM

* Jj Model Packages v ffi Users

Property ID Name Status Notes History

Value

EMWS

Predictive Analysi

•VA97NK

±1

fD Diagram Predictive Analysis opened

ft

VI

3

y student as student

">

©

|ioo%

S

n

| _J |

Connected to SASApp - Logical Workspace Server


2-68

2.

Chapter 2 Accessing and Assaying Prepared Data

Select the M o d i f y tab to access the Modify tool group. 2 f Enterprise Miner - My Project File

Edit

View

EMEI

Actions Options Window Help M

3 My Project B B l Data Sources ^PVA97NK E 'JsJ Diagrams Predictive Analysis ± -'^ I Model Packages E [Qj Users

Gb

©

&

Sample ) Explore | Modify) Model) Assess | Utility] Credit Scoring" -Predictive

Analysis

Property

P Name Status Notes History

EMWS Predictive Analysis Open

JLi I jJ

ID (diagram Predictive Analysis opened

rJ :

©

|l00%

•J

O student a; student ~\ Connected to SASApp - Logical Workspace Server


2.4 Exploring a Data Source

3.

2-69

Drag the Replacement tool (third from the right) from the fools Palette into the Predictive Analysis workspace window. gf Enterprise Miner - My Project File

Edit

View

Actions Options Window Help

J |0|L3H J ^ | M

23 My Project H ffl Data Sources -

fflPVA97NK

H Diagrams Predictive Analysis i+j i^J Model Packages E [ f l j Users

[Node ID Im porterjData

* Predictive Analysis

E M

Value

Property

General

Sample | Explore | Modify | Model | Assess | Utility | Credit Scoring

Repl

>VA97NK

Exported Data Notes

Train

-Replacement Ed -Default Limits MeStandard Deviati J -Cutoff Values -Replacement Ed General Diagram Predictive Analysis opened

1\

9

I

fO student as student

©

100%

Connected to SASApp • Logical Workspace Server


2-70

4.

Chapter 2 Accessing and Assaying Prepared Data

Connect the P V A 9 7 N K data to the Replacement node by clicking near the right side o f the P V A 9 7 N K node and dragging an arrow to the left side of the Replacement node. ST Enterprise Miner - My Project File Edit View Actions Options Window Help 23 My Project - O Sources [^PVA97NK _HJ Diagrams D a t a

Sample | Explore | Modify | Model | Assess | Utility | Credit Scoring |

Predictive Analysis >. ^) Model Packages •+] ifjj Users

Property

EM

[•tj Predictive Analysis

Value

General Node ID 3epl Imported Data Exported Data Notes ... Train ariables Replacement Ed Default Limits MeStandard Deviati J CutoffValues E Class Variables r Replacement Ed General piagram Predictive Analysis opened

J^.-?gpUcemeiit

: : : 'VA97NK

J j - J JJ

<7

| — J — +j

\t) student as student

|ioo%

Connected to SASApp - Logical Workspace Server

You created a process flow, which is the method that SAS Enterprise Miner uses to carry out analyses. The process flow, at this point, reads the raw P V A 9 7 N K data and replaces the unwanted values of the observations. You must, however, specify which variables have unwanted values and what the correct values are. To do this, you must change the settings of the Replacement node.


2.4 Exploring a Data Source

2-71

Changing the Replacement Node Properties Use the following steps to modify the default settings of the Replacement node: 1.

Select the Replacement node and examine the Properties panel. Property

1

Value

I

General

Node ID Imported Data Exported Data iNotes

1

Repl

... ... ...

•Replacement Editoi -Default Limits MethiStandard Deviation: -.Cutoff Values I H -Replacement Editoi - Unknown Levels [ignore

Score

Replacement ValueComputed Hide No

|

Ri

Replacement RepoYes

The Properties panel displays the analysis methods used by the node when it is run. By default, the node replaces all interval variables whose values are more than three standard deviations from the variable mean. f

You can control the number of standard deviations by selecting the C u t o f f Values property.

In this demonstration, you only want to replace the value for D e m M e d l n c o m e when it equals zero. Thus, you need to change the default setting. 2.

Select the Default Limits Method property and select None from the Options menu. Value

Property

General

Node ID Imported Data Exported Data Notes

Repl

Train

(sllnterval Variables -Replacement Editoi •Default Limits MethiNone -Cutoff Values -Replacement Editoi Ignore -Unknown Levels

Score

H H

Replacement ValueComputed No Hide

Reoort

Replacement RepoYes

... ... ... •

Q


Chapter 2 Accessing and Assaying Prepared Data

2-72

You want to replace improper values with missing values. To do this, you need to change the Replacement Value property. 3.

Select the Replacement Value property and select Missing from the Options menu. Pro pert/

Value

General

Node ID Imported Data Exported Data Notes

Train • ill Wc m

Repl ... ...

...!

\mm

-Replacement Editoi •Default Limits MethiNone •CutoffValues E Iciass Variables - Replacement Edito •Unknown Levels Iqnore

...

• •

Score

Replacement ValueMissino. Hide No

HnnBHBHMHHHHH ReplacementRepoYes You are now ready to specify the variables that you want to replace. 4.

Select |7j (Interval Variables: Replacement Editor ellipsis) from the Replacement node properties panel. Pro perry Node ID Imported Data Exported Data iNotes

Value Repl

Train EjimBffiirafflTOM

Replacement Editoi Default Limits MethiNone CutoffValues E ]ciass Variables Replacement Edito Unknown Levels Ignore

• • U U

m

Score

Replacement Value Mis sing No Hide

Report

Replacement RepoYes f

Be careful to open the Replacement Editor for Interval Variables, not for Class Variables.


2.4 Exploring a Data Source

2-73

The Interactive Replacement Interval Filter window opens. Interactive Replacement Interval Filter Columns:

f " Mining

|~~ Label Name

DemAge DemMedHomeValue DemMedlncome DemPcTVeterans GiftAvg36 GifiAvqAII Gift.a.vciCarrJ36 GirtAvgLast GiftCnt36 GiftCntAII GiftCntCard36 GiftCntCardAII GiftTirneFirst GlftTimeLast PromCntl 2 PrornCnt36 PromCntAll PromCntCardl 2 PromCntCard36 PromCntCardAII farqetD

Use Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default DefauH

<l Generate Summary

Report

to

Mo 4o 4o . MO Mo Mo Mo Mo Mo Mo No No No Mo No No No NO MO Mo

F Statistics

P Basic Limit Method

Lower Limit

Upper Limit -

default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default

• • • •

D

b

£

b P P P P

b • ' • • • •

b P b P

D 0 D D

to

- b

_J OK

Cancel


2-74

5.

Chapter 2 Accessing and Assaying Prepared Data

Select User Specified as the Limit Method value for DemMedlncome. BE Interactive Replacement Interval Filter :olumns:

f~ Mining

P Label Name

Use

DemAge Default DemMedHomeValue Default DemMedlncome Default DemPctVeterans Default Default GiftAvq36 Default GiftAvgAll GiftAvgCard36 Default GfflAvqLast Default (Default GlftCnt36 Default GiftCntAII Default GiftCntCard36 GiftCntCardAII Default Default GiftTimeFirst Default GiftTimeLast PromCntl2 Default PromCnt36 Default PromCntAll Default PromCntCard12 Default jjDefault PromCntCard36 PromCntCardAII Default TargetD Default <l Generate Summary

W Basic

Report IMo No Mo Mo Mo Mo Mo Mo Mo Mo IMo No No No Mo Mo Mo Mo Mo Mo No

Limit Method

r Lower Limit

Statistics Upper Limit

Default Default User Specified Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default

D D D D D

D D

J OK

Cancel


2.4 Exploring a Data Source 6.

2-75

Type 1 as the Lower Limit value for DemMedlncome. ST Interactive Replacement Interval Filter Columns:

f Label Name

•I

Report

Use

DemAqe DemMedHomeValue DemMedlncome DemPcfVeterans GiftAvq36 GiftAvqAll GiftAvqCard36 GiftAvqLast GlftCnt36 GiftCntAII GiftCntCard36 foiftCntCardAll iGiftTime First GiftTimeLast PromCnt12 PromCnt36 PromCntAll PromCntCard12 ;PromCntCard36 PromCntCardAII FarqetD

|7 Basjc

\~ Mining

Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default DefauH DefauH DefauH Default Default

Limit Method

r Lower Limit

Default DefauH User Specified Default Default Default Default Default Default Default Default DefauH DefauH DefauH Default DefauH Default Default Default DefauH DefauH

Mo Mo Mo No MO Mo No No MO NO Mo No No No No NO NO No No NO No

Statistics Upper Limit •

1

b

b b D D D

• P • P P "

D D ET D c

b

~

" P * P b b

Jj OK

Generate Summary

- P

_ j

<l

b

* P

Cancel

If you use this specification, any DemMedlncome values that fall below the lower limit o f 1 are set to missing. A l l other values of this variable do not change. 7.

Select O K to close the Interactive Replacement Interval Filter window.

Running the Analysis and Viewing the Results Use these steps to run the process flow that you created. 1.

Right-click on the Replacement node and select Run from the Option menu. A Confirmation window appears, and requests that you verify the run action. Confirmation *^ .

Do you want to run this path? Diagram:Predictive Analysis Path:

Replacement 1 Ves 1 No

1


2-76

2.

Chapter 2 Accessing and Assaying Prepared Data

Select Yes to close the Confirmation window. A small animation in the lower right corner o f each node indicates analysis activity in the node. The Run Status window opens when the process flow run is complete. Run Status Run completed Diagram: Predictive Analysis Path;

Replacement Results...

OK

3.

Select Results... to review the analysis outcome. The Results - Node: Replacement Diagram: Predictive Analysis window appears. tP Results - Mode: Replacement Diagram: Predictive Analysis File

Edit

View

Window

HHEi|ESoutput

ÂŁf 1 otal Replacement Counts Variable Role Label DemMedln...

INPUT

Train 2357

M e d i a n Inc

rjaet:

Student

Date:

June 1 7 , 2009

Tine:

16:24:52

* T r a i n i n g Output

V a r i a b l e Summary Measurement

HI Interval Variables Variable Replace Variable

Lower Smit

DemMedln... REP„Dem...

Upper Limit i

Limls Method .MANUAL

Label

Replacement Method

Frequency

Lower Replacemert Value

Upper Replacemeri Value

Median Inc... MISSING

The Replacement Counts window shows that 2357 observations were modified by the Replacement node. The Interval Variables window summarizes the replacement that was condLicted. The CHitput window provides more or less the same information as the Total Replacement Counts window and the Interval Variables window (but it is presented as a static text file). 4.

Close the Results window.


2.4 Exploring a Data Source

2-77

Examining Exported Data In a SAS Enterprise Miner process flow diagram, each node takes in data, analyses it, creates a result, and exports a possibly modified version of the imported data. While the report gives the analysis results in abstract, it is good practice to see the actual effects of an analysis step on the exported data. This enables you to validate your expectations at each step of the analysis. Use these steps to examine the data exported from the Replacement node. 1.

Select the Replacement node in your process How diagram.

2.

Select Exported Data •=> _MJ.

r General

Pro pert/

Node ID

Imported Data Exported Data Notes

Train

Interval Variables Replacement Edito •Default Limits MethiNone -Cutoff Values -Replacement Editoi •Unknown Levels Ignore

Score

Replacement ValueMjssing Hide iNo Report Replacement RepoYes The Exported Data - Replacement window appears.

Port TRAIN VALIDATE TEST SCORE TRANSACTION DOCUMENT

Role Table EMWS.Repi_TRAIN Train EMWS.Repl_VALIDATE Validate EMWS.Repl_TEST Test EMWS.RepLSCORE IScore E MWS. R ep l_TRAN SACTI0 N Transaction EMWS.RepLDOCUMENT Document

1 (trotvtc-. j j

Yes MO NO No •Jo NO

CxptaiS-

[

Data Exists

Ptopeities-

|

OK

This window lists the types of data sets that can be exported from a SAS Enterprise Miner process flow node. As indicated, only a T R A I N data set exists at this stage of the analysis.


2-78

3.

Chapter 2 Accessing and Assaying Prepared Data

Select the Train table from the Exported Data - Replacement window.

Port TRAIN VALIDATE TEST SCORE TRANSACTION DOCUMENT

Table Role EUWS.RepI TRAIN Train EMWS.RepLVALIDATE Validate EMWS.RepLTEST Test EMWS.RepLSCORE Score IEMWS.Repl_TRANSACTION Transaction EMWS.RepLDOCUMENT Document Browse..

4.

Yes No NO No No No Explore.

Data Btists

Properties.

OK

Select Explore... to open the Explore window again.

Sample Properties Rows Columns Library Member Type Sample Method Fetch Size Fetched Rows Random Seed

Property

Unknown 29 EMWS REPL TRAIN VIEW Random Max 9686 12345

Value

Apply []3 EMWSJtepl_TRAIN Obs#

Plot... r/Ef

m

j Target Gilt-| Control Nu...| Target Gift ...| Gilt Count 3..l0ffl Count A.J Gift Count... I Gilt Count... I Gilt Amount... 000014974 4 3 1 $17.00 000006294 8 3 2 $20.00 100046110 $4.00 41 20 3 $6.00 100185937 $10.00 8 4 12 $10.00 000029637 1 1 $20.00 5 100112632 9 $11.00 $11.00 11 6 000123712 3 $15.00 4 7 000045409 3 $15.00 4 8 10O019094 1 $35.00 3 S $40.00 000178271 2 $6.00 5 10 1 00102224 $6.00 16 13 11 $6.00 i

'i

• i -i i i


2-4 Exploring a Data Source

5.

2-79

Scroll completely to the right in the data table. EMWS.Repl_TRA!N

1

Aqe

.F 67F .M .M 53M 47M

58M .U JF .U

.u

<

Gender

|Home Owner Median Ho...! ercentVet..J Median Income Reoiorj Replacement: DemMedlncome I A 0 U $0 $0 85 U SO $186,800 $87,600 36 $38,750 38750 u $139,200 27 $38,942 38942 u $168,100 37 $71,509 71509 u H $253,100 0 $92,514 92514 H $234,700 22 $72,868 72868 U $207,000 44 $0 U $1 37,300 32 $0 $180,700 37 $0 u $143,900 30 $0 u 3

1

1

"1<

inn.

c o

\

m

m

n>

A new column is added to the analysis data: Replacement: DemMedlncome. Notice that the values o f this variable match the Median Income Region variable, except when the original variables value equals zero. The replaced zero value is shown by a dot (.). which indicates a missing value. 6.

Close the Explore and Exported Data windows to complete this part o f the analysis.


2-80

Chapter 2 Accessing and Assaying Prepared Data

2.5 Chapter Summary Data Access Tools Review . | j|ft; AwflystoOota

1

• Link existing analysis data sets to SAS Enterprise Miner. • Set variable metadata. • Explore variable distribution characteristics. Remove unwanted cases from analysis data.

18

Analyses in SAS Enterprise Miner are organized hierarchically into projects, data libraries, diagrams, process flows, and analysis nodes. This chapter demonstrated the basics of creating each of these elements. Most process flows begin with a data source node. To access a data source, you must define a SAS library. After a library is defined, a data source is created by linking a SAS table and associated metadata to SAS Enterprise Miner. After a data source is defined, you can assay the underlying cases using SAS Enterprise Miner Explore tools. Care should be taken to ensure that the sample of the data source that you explore is representative of the original data source. You can use the Replacement node to modify variables that were incorrectly prepared. After all necessary modifications, the data is ready for subsequent analysis.


Chapter 3 Introduction to Predictive Modeling: Decision Trees 3.1

Introduction Demonstration: Creating Training and Validation Data

3.2

Cultivating Decision Trees Demonstration: Constructing a Decision Tree Predictive Model

3.3

Optimizing the Complexity of Decision Trees Demonstration: Assessing a Decision Tree

3.4

Understanding Additional Diagnostic Tools (Self-Study) Demonstration: Understanding Additional Plots and Tables (Optional)

3.5

Autonomous Tree Growth Options (Self-Study)

3-3 3-23

3-28 3-43

3-63 3-79

3-98 3-99

3-106

Exercises

3-114

3.6

Chapter Summary

3-116

3.7

Solutions

3-117

Solutions to Exercises

3-117

Solutions to Student Activities (Polls/Quizzes)

3-133


3-2

Chapter 3 Introduction to Predictive Modeling: Decision Trees


3.1 Introduction

3.1

3-3

Introduction

Predictive Modeling The Essence of Data Mining

7

rijH

"Mosf of the big payoff [in data mining] has been in predictive modeling." - Herb Edelstein

3

Predictive modeling has a long history of success in the field of data mining. Models are built from historical event records and are used to predict future occurrences of these events. T h e methods have great potential for monetary gain. Herb Edelstein states in a 1997 interview (Beck 1997): "I also want to stress that data mining is successfully applied to a wide range of problems. T h e beer-and-diapers type of problem is k n o w n as association discovery or market basket analysis, that is, describing something that has happened in existing data. This should be differentiated from predictive techniques. Most of the big payoff stuff has been in predictive modeling." Predicting the future has obvious financial benefit, but, for several reasons, the ability of predictive models to do so should not be overstated. First, the models depend on a property k n o w n statistically as stationarity, meaning that its statistical properties do not change over time. Unfortunately, m a n y processes that generate events of interest are not stationary enough to enable meaningful predictions. Second, predictive models are often directed at predicting events that occur rarely, and in the aggregate, often look somewhat random. Even the best predictive model can only extract rough trends from these inherently noisy processes.


3-4

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Predictive Modeling Applications Database marketing Financial risk m a n a g e m e n t Fraud detection Process monitoring

s4

Pattern detection

4 Applications of predictive modeling are only limited by your imagination. However, the models created by S A S Enterprise Miner typically apply to one of the following categories. • Database marketing applications include offer response, up-sell, cross-sell, and attrition models. • Financial risk m a n a g e m e n t models attempt to predict monetary events such as credit default, loan prepayment, and insurance claim. • Fraud detection methods attempt to detect or impede illegal activity involving financial transactions. • Process monitoring applications detect deviations from the norm in manufacturing, financial, and security processes. • Pattern detection models are used in applications ranging from handwriting analysis to medical diagnostics.


3.1 Introduction

3-5

Predictive Modeling Training Data Training Data

, IBM ' t lnptitt .': : H I &C9 K S

—

R 9

Bwl BW

M i H i

Training data case: categorical or numeric input and target measurements

—

T o begin using S A S Enterprise Miner for predictive modeling, you should be familiar with standard terminology. Predictive modeling (also k n o w n as supervised prediction ox supervised learning) starts with a training data set. T h e observations in a training data set are k n o w n as training cases (also k n o w n as examples, instances, or records). T h e variables are called inputs (also k n o w n as predictors, features, explanatory variables, or independent variables) and targets (also k n o w n as a response, outcome, or dependent variable), for a given case, the inputs reflect your state of knowledge before measuring the target. The measurement scale of the inputs and the target can be varied. T h e inputs and the target can be numeric variables, such as income. They can be nominal variables, such as occupation. They are often binary variables, such as a positive or negative response concerning h o m e ownership.


3-6

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Predictive Model Training Data

mmm

WKM

WMM

H

Predictive model: a concise representation of the input and target association W W

T h e purpose of the training data is to generate a predictive model. T h e predictive model is a concise representation of the association between the inputs and the target variables.

Predictive Model • M l

mm mmi mm mm mmm wmw mmm mm M M

m

m

m

Predictions: output of the predictive model given a set of input measurements

10

The outputs of the predictive model are called predictions. Predictions represent your best guess for the target given a set of input measurements. T h e predictions are based on the associations learned from the training data by the predictive model.


3.1 Introduction

3-7

Modeling Essentials

Predict n e w cases. Select useful inputs. Optimize complexity.

12

Predictive models are widely used and c o m e in m a n y varieties. A n y model, however, must perform three essential tasks: • provide a rule to transform a measurement into a prediction • have a m e a n s of choosing useful inputs from a potentially vast number of candidates • be able to adjust its complexity to compensate for noisy training data T h e following slides illustrate the general principles involved in each of these essentials. Later sections of the course detail the approach used by specific modeling tools.


3-8

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Modeling Essentials

Predict new cases.

13

T h e process of predicting n e w cases is examined first.

Three Prediction Types 1

inputs

mm mm mm

mmm mmm mmm mm mm mm mm mm mm

z

s

s

z

prediction

mmm mmm mmm

decisions rankings estimates

15

T h e training data is used to construct a model (rule) that relates the input variables to the original target variable. T h e predictions can be categorized into three distinct types: • decisions • rankings • estimates


3.1 Introduction

3-9

Decision Predictions

A predictive model uses input measurements to make the best decision for each case.

17

T h e simplest type of prediction is the decision. Decisions usually are associated with s o m e type of action (such as classifying a case as a donor or a non-donor). For this reason, decisions are also k n o w n as classifications. Decision prediction examples include handwriting recognition, fraud detection, and direct mail solicitation. Decision predictions usually relate to a categorical target variable. For this reason, they are identified as primary, secondary, and tertiary in correspondence with the levels of the target. B y default, model assessment in S A S Enterprise Miner assumes decision predictions w h e n the target variable has a categorical measurement level (binary, nominal, or ordinal).


3-10

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Ranking Predictions WMWf • MMbMNM i M M

mmm mmm mmm mmm mmm

mmm mm mm warn mmm mmm umm mmt BH

mm mm mm mm

720 520 470 630

A predictive model uses input measurements to optimally rank each case.

19

Ranking predictions order cases based on the input variables' relationships with the target variable. Using the training data, the prediction model attempts to rank high value cases higher than low value cases. It is assumed that a similar pattern exists in the scoring data so that high value cases have high scores. T h e actual, produced scores are inconsequential; only the relative order is important. T h e most c o m m o n example of a ranking prediction is a credit score. Ranking predictions can be transformed into decision predictions by taking the primary decision for cases above a certain threshold while making secondary and tertiary decisions for cases below the correspondingly lower thresholds. In credit scoring, cases with a credit score above 700 can be called good risks; those with a score between 600 and 700 can be intermediate risks; and those below 600 can be considered poor risks.


3.1 Introduction

3-11

Estimate Predictions

A predictive model uses input measurements to optimally estimate the target value.

21

Estimate predictions approximate the expected value of the target, conditioned on the input values. For cases with numeric targets, this number can be thought of as the average value of the target for all cases having the observed input measurements. For cases with categorical targets, this number might equal the probability of a particular target outcome. Prediction estimates are most c o m m o n l y used w h e n their values are integrated into a mathematical expression. A n example is two-stage modeling, where the probability of an event is combined with an estimate of profit or loss to form an estimate of unconditional expected profit or loss. Prediction estimates are also useful w h e n you are not certain of the ultimate application of the model. Estimate predictions can be transformed into both decision and ranking predictions. W h e n in doubt, use this option. Most S A S Enterprise Miner modeling tools can be configured to produce estimate predictions.


3-12

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Modeling Essentials - Predict Review

D e c i d e , rank,

Predict new cases.

and estimate.

23 T o summarize the first model essential, w h e n a model predicts a n e w case, the model generates a decision, a rank, or an estimate.

3.01 Quiz Match the Predictive Modeling Application to the Decision Type. Modeling Application Loss Reserving Risk Profiling Credit Scoring Fraud Detection Revenue Forecasting Voice Recognition

Decision Type A. Decision B. Ranking C. Estimate

Submit

Clear

25

T h e quiz shows s o m e predictive modeling examples and the decision types.


3.1 Introduction

Modeling Essentials

V> Select useful inputs.

27

Consider the task of selecting useful inputs.

3-13


3-14

Chapter 3 Introduction to Predictive Modeling: Decision Trees

The Curse of Dimensionality

1-D

2-D

3-D 28

T h e dimension of a problem refers to the number of input variables (more accurately, degrees of freedom) that are available for creating a prediction. Data mining problems are often massive in dimension. T h e curse of dimensionality refers to the exponential increase in data required to densely populate space as the dimension increases. For example, the eight pointsfillthe one-dimensional space but b e c o m e more separated as the dimension increases. In a 100-dimensional space, they would be similar to distant galaxies. T h e curse of dimensionality limits your practical ability tofita flexible model to noisy data (real data) w h e n there are a large number of input variables. A densely populated input space is required tofithighly complex models. W h e n you assess h o w m u c h data is available for data mining, you must consider the dimension of the problem.

3.02 Quiz How many inputs are in your modeling data? Mark your selection in the polling area.

30


3.1 Introduction

3-15

Input Reduction - Redundancy Redundancy

Irrelevancy •

*

. j a r .__

31

Input selection, that is, reducing the number of inputs, is the obvious w a y to thwart the curse of dimensionality. Unfortunately, reducing the dimension is also an easy w a y to disregard important information. T h e two principal reasons for eliminating a variable are redundancy and irrelevancy.

Input Reduction - Redundancy Redundancy • •

Input x has the same information as input X), 2

•d 32

A redundant input does not give any n e w information that w a s not already explained by other inputs. In the example above, know ing the value of input X| gives you a good idea of the value ofjfeFor decision tree models, the modeling algorithm makes input redundancy a relatively minor issue. For other modeling tools, input redundancy requires more elaborate methods to mitigate the problem.


3-16

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Input Reduction - Irrelevancy Irrelevancy

Predictions change with input x but much less with input x . 4

3

33

A n irrelevant input does not provide information about the target. In the example above, predictions change with input .v.h but not with input x 3 . For decision tree models, the modeling algorithm automatically ignores irrelevant inputs. Other modeling methods must be modified or rely on additional tools to properly deal with irrelevant inputs.

Modeling E s s e n t i a l s - Select Review

Eradicate

Select useful inputs.

redundancies and irrelevancies.

I

L

,|

34

To thwart the curse of dimensionality, a model must select useful inputs. You can do this by eradicating redundant and irrelevant inputs (or more positively, selecting an independent set of inputs that are correlated with the target).


3.1 Introduction

Modeling Essentials

i s Optimize complexity.

36

Finally, consider the task of optimizing model complexity.

3-17


3-18

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Model Complexity

37

Fitting a model to data requires searching through the space of possible models. Constructing a model with good generalization requires choosing the right complexity. Selecting model complexity involves a trade-off between bias and variance. A n insufficiently complex model might not be flexible enough, which can lead to underfitting, that is, systematically missing the signal (high bias). A naive modeler might assume that the most complex model should always outperform theothers, but this is not the case. A n overly complex model might be too flexible, which can lead to overfitting, that is, accommodating nuances of the random noise in the particular sample (high variance). A model with the right amount of flexibility gives the best generalization.


3.1 Introduction

3-19

Data Partitioning Trainina Data

Partition available data into training and validation sets.

38

A quote from the popular introductory data analysis textbook Data Analysis and Regression by Mosteller and Tukey (1977) captures the difficulty of properly estimating model complexity. "Testing the procedure on the data that gave it birth is almost certain to overestimate performance, for the optimizing process that chose it from a m o n g m a n y possible procedures will have m a d e the greatest use of any and all idiosyncrasies of those particular data...." In predictive modeling, the standard strategy for honest assessment of model performance is data splitting. A portion is used forfittingthe model, that is, the training data set. T h e remaining data is separated for empirical validation. T h e validation data set is used for monitoring and tuning the model to improve its generalization. T h e tuning process usually involves selecting a m o n g models of different types and complexities. T h e tuning process optimizes the selected model on the validation data. f

Because the validation data is used to select from a set of related models, reported performance will, on the average, be overstated. Consequently, a further holdout sample is needed for a final, unbiased assessment.The test data set has only one use, that is, to give afinalhonest estimate of generalization. Consequently, cases in the test set must be treated in the same w a y that n e w data would be treated. T h e cases cannot be involved in any w a y in the determination of the fitted prediction model. In practice, m a n y analysts see no need for a final honest assessment of generalization. A n optimal model is chosen using the validation data, and the model assessment measured on the validation data is reported as an upper bound on the performance expected w h e n the model is deployed.

With small or moderate data sets, data splitting is inefficient; the reduced sample size can severely degrade thefitof the model. Computer-intensive methods, such as the cross validation and bootstrap methods, were developed so that all the data can be used for bothfittingand honest assessment. However, data mining usually has the luxury of massive data sets.


3-20

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Predictive Model Sequence Trainin q Data

mmP<mm —

Validation Data

umm • ' mm

1

I

mm

{ w

mm

fopvB mm

mm

tmt mm

mm

Create a sequence of models with increasing complexity.

40

In S A S Enterprise Miner, the strategy for choosing model complexity is to build a sequence of similar models using the training component only. T h e models are usually ordered with increasing complexity.

Model Performance Assessment mm

mm

mm

mm

mgm mmm mmm

Rate model *A*Wt performance using validation data. F 4 r

42

5

Model Complexity

Validation Assessment

A s Mosteller and Tukey caution in the earlier quotation, using performance on the training data set usually leads to selecting a model that is too complex. (The classic example is selecting linear regression models based on R square.) T o avoid this problem, S A S Enterprise Miner selects the model from the sequence based on validation data performance.


3.1 Introduction

Model Selection Validation Data

TBI

• Select the simplest frvv \ model with the highest validation assessment.

Complexity

Validation Assessment

In keeping with Ockham's razor, the best model is the simplest model with the highest validation performance.

3-21


3-22

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Modeling Essentials - Optimize Review

Optimize complexity.

Tune models with ! validation data.

46

T o avoid overfitting (and underfitting) a predictive model, you should calibrate its performance with an independent validation sample. Y o u can perform the calibration by partitioning the data into training and validation samples. In Chapter 2, you assigned roles and measurement levels to the P V A 9 7 N K data set and created an analysis diagram and a simple process flow. In the following demonstration, you partition the raw data into training and validation components. This sets the stage for using the Decision Tree tool to generate a predictive model.


3.1 Introduction

3-23

Creating Training a n d Validation Data

A critical step in prediction is choosing a m o n g competing models. Given a set of training data, you can easily generate models that very accurately predict a target value from a set of input values. Unfortunately, these predictions might be accurate only for the training data itself. Attempts to generalize the predictions from the training data to an independent, but similarly distributed, sample can result in substantial accuracy reductions. To avoid this pitfall, S A S Enterprise Miner is designed to use a validation data set as a means of independently gauging model performance. Typically, the validation data set is created by partitioning the raw analysis data. Cases selected for training are used to build the model, and cases selected for validation are used to tune and compare models. f

S A S Enterprise Miner also enables the creation of a third partition, n a m e d the test data set. T h e test data set is meant for obtaining unbiased estimates of model performance from a single selected model.

This demonstration assumes that you completed the demonstrations in Chapter 2. Your S A S Enterprise Miner workspace should appear as s h o w n below: s i ts File

Edit View

53

Actions

Options

Window

HHHHHBIffi

•3 My Project o O Data Sources 9 [hi Diagrams 3£> Predictive Analysis O \9j Model Packages G-iflJ Users

1

Help

Property ID Name Status Notes History

173 Sample

• B Bfi Explore

k :

Modify . Model

Assess : Utility

Credit Scoring

so Predictive Analysis

Value EMWS Predictive Analysis Open

r

y • >VA97NK

ID Diagram Identifier. This identifier Diagram Predictive Analyst...

J

p^lrf? placement6

100::. jX Connected to SASMain - Logical Workspace Server


3-24

Chapter 3 Introduction to Predictive Modeling; Decision Trees

Follow these steps to partition the raw data into training and validation sets. 1. Select the S a m p l e tool tab. The Data Partition tool is the second from the left.

• Sample | Explore Modify |_^°del j Assess ) Utility" Credit Scoring 2. Drag a Data Partition tool into the diagram workspace, next to the Replacement node.

•ma

j; Predictive Analysis

>VA97NK

!f^yeplacemenl

• ata Partition -Q

2l\

W 9

\—

a

©100%


3.1 Introduction

3-25

3. Connect the Replacement node to the Data Partition node. • Predictive Analysis

ate merit

>VA97NK

-6

-0

jnjnr j^sasbata Partition

4. Select the Data Partition node and examine its Properties panel. Value

Property

General Node ID Imported Data Exported Data Notes

Part ...

Train Variables Data OutputType Partitioning Method Default 12345 R a n d o m Seed Ebata Set Allocations 40.0 r-Training 30.0 ^•Validation 30.0 'Test Interval Targets Class Targets

E u

Yes Yes

Use the Properties panel to select the fraction of data devoted to the Training, Validation, and Test partitions. B y default, less than half the available data is used for generating the predictive models.


3-26

Chapter 3 Introduction to Predictive Modeling: Decision Trees

f

There is a trade-off in various partitioning strategies. M o r e data devoted to training results in more stable predictive models, but less stable model assessments (and vice versa). Also, the Test partition is used only for calculating fit statistics after the modeling and model selection is complete. M a n y analysts regard this as wasteful of data. A typical partitioning compromise foregoes a Test partition and places an equal number of cases in Training and Validation.

5. Type 5 0 as the Training value in the Data Partition node. 6. Type 5 0 as the Validation value in the Data Partition node. 7. Type 0 as the Test value in the Data Partition node. §bataSet Allocations j-Training 50.0 50.0 r-Validation ••Test 0.0 f

With smaller raw data sets, model stability can become an important issue. In this case, increasing the number of cases devoted to the Training partition is a reasonable course of action.

8. Right-click the Data Partition node and select Run. 9. Select Yes in the Confirmation window. S A S Enterprise Miner runs the Data Partition node.

Do you want to run this path? Diagram: Predictive Analysis Path:

Data Partition Yes

f

No

Only the Data Partition node should run, because it is the only "new" element in the process flow. In general, S A S Enterprise Miner only runs elements of diagrams that are n e w or that are affected by changes earlier in the process flow.

W h e n the Data Partition process is complete, a R u n Status w i n d o w opens.

Run completed Diagram: Predictive Analysis Path: OK

Data Partition Results..


3.1 Introduction

3-27

10. Select Results... in the R u n Status w i n d o w to view the results. T h e Results - N o d e : Data Partition w i n d o w provides a basic metadata s u m m a r y of the raw data that feeds the node and, more importantly, a frequency table that shows the distribution of the target variable, T a r g e t B , in the raw, training, and validation data sets. r

File

1

Edit

1 (73

View

1

Window

•

output

Summary Statistics for Class Targets Data=DATA

Variable

Humeric Value

Formatted Value

TargetB TargetB

Frequency Count

Percent

4843

50 50

Frequency Count

Percent

2422

50.0103

2421

49.9897

Frequency Count

Percent

2421

49.9897

2422

50.0103

4843

Data=TRAIW

Variable

numeric Value

Formatted Value

TargetB TargetB

Data-VALIDATE

Variable TargetB TargetB

f

Humeric Value

Formatted Value

T h e Partition node attempts to maintain the proportion of zeros and ones in the training and validation partitions. T h e proportions are not exactly the same due to an odd number of zeros' and ones' cases in the raw data.

T h e analysis data w a s selected and partitioned. You are ready to build predictive models.


3-28

Chapter 3 Introduction to Predictive Modeling: Decision Trees

3.2 Cultivating Decision Trees

Predictive Modeling Tools Primary

IS

49

M a n y predictive modeling tools are found in S A S Enterprise Miner. T h e tools can be loosely grouped into three categories: primary, specialty, and multiple models. T h e primary tools interface with the three most c o m m o n l y used predictive modeling methods: decision tree, regression, and neural network.

Predictive Modeling Tools

'

5

g^Aiirftart j

Specialty

[u Ditv

1 ÂŁ, Gftctent

6E

so T h e specialty tools are either specializations of the primary tools or tools designed to solve specific problem types.


3.2 Cultivating Decision Trees

3-29

Predictive Modeling Tools

!! v. , i —' —

::u

L/: r :

Multiple

Multiple model tools are used to combine or produce more than one predictive model. T h e Ensemble tool is briefly described in a later chapter. T h e T w o Stage tool is discussed in the Advanced Predictive Modeling Using S A S ® Enterprise Miner™ course.

Model Essentials - Decision Trees

Predict n e w cases.

; Prediction rules

Select useful inputs.

|

Optimize complexity.

Split search Pruning

52

Decision trees provide an excellent introduction to predictive modeling. Decision trees, similar to all modeling methods described in this course, address each of the modeling essentials described in the introduction. Cases are scored using prediction rules. A split-search algorithm facilitates input selection. M o d e l complexity is addressed by pruning. T h e following simple prediction problem illustrates each of these model essentials.


3-30

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Simple Prediction Illustration

1.0

I

0.9

Predict dot color for each x, and x2. X

2

0.5 0.4 0.3 0.2 n

U-1

1

0.0

...

54

Consider a data set with two inputs and a binary target. The inputs, X\ and x 2 , locate the case in the unit square. T h e target outcome is represented by a color; yellow is primary and blue is secondary. T h e analysis goal is to predict the outcome based on the location in the unit square.

Decision Tree Prediction Rules

56

Decision trees use rules involving the values of the input variables to predict cases. The rules are arranged hierarchically in a tree-like structure with nodes connected by lines. The nodes represent decision rules, and the lines order the rules. The first rule, at the base (top) of the tree, is called the root node. Subsequent rules are called interior nodes. Nodes with only one connection are called leaf nodes.


3.2 Cultivating Decision Trees

3-31

Decision Tree Prediction Rules

57

T o score a n e w case, examine the input values and apply the rules defined by the decision tree.

Decision Tree Prediction Rules Predict:, \

Š w

Estimate = 0.70

0.9 >0.63

06 X

2

0.5 0.1

58

T h e input values of a n e w case eventually lead to a single leaf in the tree. A tree leaf provides a decision (for example, classify as yellow) and an estimate (for example, the primary-target proportion).


3-32

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Model Essentials - Decision Trees

Select useful inputs.

Split search

62

T o select useful inputs, trees use a split-search algorithm. Decision trees confront the curse of dimensionality by ignoring irrelevant inputs. f

Curiously, trees have no built-in method for ignoring redundant inputs. Because trees can be trained quickly and have simple structure, this is usually not an issue for model creation. H o w e v e r , it can be an issue for model deployment, in that trees might somewhat arbitrarily select from a set of correlated inputs. T o avoid this problem, you must use an algorithm that is external to the tree to m a n a g e input redundancy.


3.2 Cultivating Decision Trees

3-33

Decision Tree Split Search left

right

0.0 0.1 0.2 0.1 0.4 O.S OS 07 0.3 0.9 1.0

63

Understanding the default algorithm for building trees enables you to better use the Tree tool and interpret your results. T h e description presented here assumes a binary target, but the algorithm for interval targets is similar. (The algorithm for categorical targets with more than two outcomes is m o r e complicated and is not discussed.) T h e first part of the algorithm is called the split search. The split search starts by selecting an input for partitioning the available training data. If the measurement scale of the selected input is interval, each unique value serves as a potential split point for the data. If the input is categorical, the average value of the target is taken within each categorical input level. T h e averages serve the s a m e role as the unique interval input values in the discussion that follows. For a selected input and fixed split point, two groups are generated. Cases with input values less than the split point are said to branch left. Cases with input values greater than the split point are said to branch right. The groups, combined with the target outcomes, form a 2x2 contingency table with columns specifying branch direction (left or right) and rows specifying target value (0 or 1). A Pearson chi-squared statistic is used to quantify the independence of counts in the table's columns. Large values for the chisquared statistic suggest that the proportion of zeros and ones in the left branch is different than the proportion in the right branch. A large difference in outcome proportions indicates a good split. Because the Pearson chi-squared statistic can be applied to the case of multi w a y splits and multi-outcome targets, the statistic is converted to a probability value or/?-value. T h e v a l u e indicates the likelihood of obtaining the observed value of the statistic assuming identical target proportions in each branch direction. For large data sets, these /?-values can be very close to zero. For this reason, the quality of a split is reported by logworth = -/og(chi-squared /7-value). At least one logworth must exceed a threshold for a split to occur with that input. B y default, this threshold corresponds to a chi-squared p-value of 0.20 or a logworth of approximately 0.7.


3-34

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Decision Tree Split Search

The best split for an input is the split that yields the highest logworth.


3.2 Cultivating Decision Trees

3-35

Additional Split-Search Details (Self Study) There are several peripheral factors that m a k e the split search s o m e w h a t m o r e complicated than whatis described above. First, the tree algorithm settings disallow certain partitions of the data. Settings, such as the m i n i m u m number of observations required for a split search and the m i n i m u m n u m b e r of observations in a leaf, force a m i n i m u m n u m b e r of cases in a split partition. This m i n i m u m n u m b e r of cases reduces the number of potential partitions for each input in the split search. Second, w h e n y o u calculate the independence of columns in a contingency table, it is possible to obtain significant (large) values of the chi-squared statistic even w h e n there are n o differences in the outcome proportions between split branches. A s the number of possible split points increases, the likelihood of obtaining significant values also increases. In this way, an input with a multitude of unique input values has a greater chance of accidentally having a large logworth than an input with only a few distinct input values. Statisticians face a similar problem w h e n they combine the results from multiple statisticaltests. A s the number of tests increases, the chance of a false positive result likewise increases. T o maintain overall confidence in the statistical findings, statisticians inflate the p-values of each test by a factor equal to the number of tests being conducted. If an inflated /rvalue shows a significant result, then the significance of the overall results is assured. This type of p-value adjustment is k n o w n as a Bonferroni correction. Because each split point corresponds to a statistical test, Bonferroni corrections are automatically applied to the logworth calculations for an input. These corrections, called Kass adjustments (named after the inventor of the default tree algorithm used in S A S Enterprise Miner), penalize inputs with m a n y split points by reducing the logworth of a split by an amount equal to the log of the n u m b e r of distinct input values. This is equivalent to the Bonferroni correction because subtracting this constant from logworth is equivalent to multiplying the corresponding chi-squared p-value by the n u m b e r of split points. T h e adjustment enables a fairer comparison of inputs with m a n y and few levels later in the split-search algorithm. Third, for inputs with missing values, two sets of Kass-adjusted logworths are actually generated. For the first set, cases with missing input values are included in the left branch of the contingency table and logworths are calculated. For the second set of logworths, missing value cases are m o v e d to the right branch. T h e best split is then selected from the set of possible splits with the missing values in the left and right branches, respectively.


3-36

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Decision Tree Split Search

Repeat for input x . 2

0.0 0.1 0.2 0.3 0.4 0.5 0.6 07 0.8 0.9 1.0

65

T h e partitioning process is repeated for every input in the training data. Inputs whose adjusted logworth fails to exceed the threshold are excluded from consideration.

Decision Tree Split Search

1.0 0.9 0.8 0.7 0.6 X bottom

top

54%

35%

46%

65%

2

0.63

0.5 0.4 0.3

max

j

logworth(Xj)

4.92

0.2 0.1 00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.6 0.9 1.0

Again, the optimal split for the input is the one that maximizes the logworth function.


3.2 Cultivating Decision Trees

Decision Tree Split Search left

right

53%

42%

47%

58%

bottom

top

54%

35%

46%

65%

max logworth(x,) 0.95 Compare partition logworth ratings. max logworth(jr2) 4.92

69

After you determine the best split for every input, the tree algorithm compares each best split's corresponding logworth. T h e split with the highest adjusted logworth is d e e m e d best.

Decision Tree Split Search

The training data is partitioned using the best split rule.

3-37


3-38

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Decision Tree Split Search

1.0

<0.63 ^

7

>0.63

>

:: 0.4

1

0.0

1

Repeat the process in each subset.

72

Decision Tree Split Search left

right

61%

55%

39%

; logworth{x,) 45% 5.72

1

m a x Q

B

1

1

D.3 [ M V i j r 0.2 Km/KjaP

"•1 r 1 1 1 ' 1

BiS&c

;

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.

74

1.0

...


3.2 Cultivating Decision Trees

3-39

Decision Tree Split Search

0.0 0.1 0 2 0.3 0.4 0.S

76

T h e logworth of the split is negative. This might seem surprising, but it results from several adjustments m a d e to the logworth calculation. (The Kass adjustment w a s described previously. Another, called the depth adjustment, is outlined in the following Self-Study section.)

Decision Tree Split Search left

right

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 *1 77


3-40

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Decision Tree Split Search left

right

61%

55%

39%

45%

1.0 max logwortri(x,)

572 .J x , 0.5

0.2

-2.01

0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.'

*1 70

T h e split search continues within each leaf. Logworths are compared as before.

Additional Split-Search Details (Self-Study) Because the significance of secondary and subsequent splits depends on the significance of the previous splits, the algorithm again faces a multiple comparison problem. To compensate for this problem, the algorithm increases the threshold by an amount related to the number of splits above the current split. For binary splits, the threshold is increased by logw{2)- d~ 03-d, where d is the depth of the split on the decision tree. B y increasing the threshold for each depth (or equivalently decreasing the logworths), the tree algorithm makes it increasingly easy for an input's splits to be excluded from consideration.


3.2 Cultivating Decision Trees

3-41

Decision Tree Split Search

79

The data is partitioned according to the best split, which creates a second partition rule. T h e process repeats in each leaf until there are no more splits whose adjusted logworth exceeds the depth-adjusted thresholds. This process completes the split-search portion of the tree algorithm.

Decision Tree Split Search

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Repeat to form a maximal tree. 81

The resulting partition of the input space is k n o w n as the maximal tree. Development of the maximal tree is based exclusively on statistical measures of split worth on the training data. It is likely that the maximal tree will fail to generalize well on an independent set of validation data.


3-42

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Model Essentials - Decision Trees

Select useful inputs.

Split search

82

T h e split-search algorithm can easily cut through the curse of dimensionality and select useful modeling inputs. Because it is simply an algorithm, m a n y variations are possible. ( S o m e variations are discussed at the end of this chapter.) T h e following demonstration illustrates the use of the split-search algorithm to construct a decision tree model.


3.2 Cultivating Decision Trees

3-43

Constructing a Decision Tree Predictive M o d e l

This four-part demonstration illustrates the interactive construction of a decision tree model and builds on the process flows of previous demonstrations.

Preparing for Interactive Tree Construction Follow these steps to prepare the Decision Tree tool for interactive tree construction. 1. Select the M o d e l tab. T h e Decision Tree tool is second from the left.

fe| fr| 1§ %

•Q

Sample | Explore

Modify

Model

m

Assess

jrj i Hi n Utility Credit Scoring

2. Drag a Decision Tree tool into the diagram workspace. Place the node next to the Data Partition node. •cm Predictive Analysis

IE!


3-44

Chapter 3 Introduction to Predictive Modeling: Decision Trees

3. Connect the Data Partition node to the Decision Tree node. m o Predictive Analysis

£1

Decision True

TTi

100%

T h e Decision Tree tool can build predictive models autonomously or interactively. T o build a model autonomously, simply run the Decision Tree node. Building models interactively, however, is more informative and is helpful w h e n you first learn about the node, and even w h e n you are more expert at predictive modeling. 4. Select Interactive

from the Decision Tree node's Properties panel. Value

Property Node ID Imported Data Exported Data [Notes

Tree

Train [Variables nteractive Use Frozen Tree No Use Multiple Targets N O • Splitting Rule ProbF ! I-Interval Criterion ProbChisq Nominal Criterion Entropy [•Ordinal Criterion Significance Level 0.2 Use in search Missing Values f-Use Input Once No 2 Maximum Branch 6 Maximum Depth l * Minimum Categorical 5 llalNode 5 hLeaf Size

•«

• ••

I mmm


3.2 Cultivating Decision Trees T h e S A S Enterprise Miner Interactive Decision Tree application opens. Interactive Decision Tree - EMWS.TREE_BROWSETREE[EMWS.PART2_TRAIN] File Edit View g

&

Tree Train Window 1

6\6

a _|n|x|

Tree View

N in Node: 4843 Target: TargetB • (%) : 50% 1 (%) : 50%

£

1

, .

1

3-45


3-46

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Creating a Splitting Rule Decision tree models involve recursive partitioning of the training data in an attempt to isolate concentrations of cases with identical target values. T h e blue box in the Tree w i n d o w represents the unpartitioned training data. T h e statistics show the distribution of TargetB. Use the following steps to create an initial splitting rule: 1. Right-click the purple box and select Split Node... from the m e n u . T h e Split N o d e 1 dialog b o x opens. Split Node 1 Target Variable: TargetB Variable |GiftCnt36 :GiftAvgCard36 iGiftAvgLast GiftAvg36 QftAvgAll GiftCntCard36 GiftCntCardAH StatusCatStarAll StatusCat96NK GiftCntAll PromCntCard36 GiftTimeFirst GiftTirneLast PromCntAll PromCntCardAll Dern Me dHome Value PromCntCardl2 PromCnt36 PromCntl2 DernPctVeterans DemAge REP_DemMedIncome DemHomeOwner

1 Variable Description Gift Count 36 Months Gift Amount Average ... Gift Amount Last Gift Amount Average... Gift Amount Average .., Gift Count Card 36 M... Gift Count Card All Mo... Status Category Star... Status Category 96NK Gift Count All Months Promotion Count Card... Time Since First Gift Time Since Last GiFt Promotion Count AllM.,, Promotion Count Card.,, Median Home Value R... Promotion Count Card... Promotion Count 36 M... Promotion Count 12 M... Percent Veterans Reg... Age Replacement: DemMe... Home Owner

-Log(p)

Branches 18.44 17.24 15.02 14.63 13,95 12,74 11.91 U1S5 9.52 9.28 8.73 6.69 5.70 5.49 4.59 4.46 3.S7 3.35 2.61 1.99 1.51 1.47 0.40

Apply 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Edit Rule.,. OK Cancel

T h e Split N o d e dialog box shows the relative value, -l.og(p) or logworth, of partitioning the training data using the indicated input. A s the logworth increases, the partition better isolates cases with identical target values. G i f t Count 36 Months has the highest logworth, followed by G i f t Amount Average C a r d 36 Months and G i f t Amount L a s t . You can choose to split the data on the selected input or edit the rule for more information.


3-2 Cultivating Decision Trees

3-47

2. Select Edit Rule.... T h e GittCnt36 - Interval Splitting Rule dialog box opens. GiftCnt36 - Interval Splitting Rule Target Variable: TargetB Assign missing values to {* A specific branch

[i

C A separate missing values branch C All branches Branches Branch 1 2

Split Point 2.5 2.5

<

>=

New split point:

Remove Branch

Add Branch

OK

Cancel

Apply

Undo

Reset

This dialog box shows h o w the training data is partitioned using the input G i f t Count 36 Months. T w o branches are created. T h e first branch contains cases with a 36-rnonth gift count less than 2.5, and the second branch contains cases with a 36-month gift count greater than or equal to 2.5. In other words, with cases that have a 36-month gift count of zero, one or two branch left and three or more branch right. In addition, any cases with a missing or u n k n o w n 36-month gift count are placed in the first branch. f

T h e interactive tree assigns any cases with missing values to the branch that maximizes the purity or logworth of the split by default. For G i f tCnt36, this rule assigns cases with missing values to branch 2.


3-48

Chapter 3 Introduction to Predictive Modeling: Decision Trees

3. Select Apply. T h e GiftCnt36 - Interval Splitting Rule dialog box remains open, and the Tree View w i n d o w is updated. Interactive Decision Tree - EMWSZ.TREE_BROW5ETREE[EMWS2.PART_TRAIN] File Edit View

Tree

Train

Window

Target: Count: 0: It

TargetB 4843 S0% 50%

1

Gift Countit 36 Months 31

Target Count 0 1

1 <2.5 1T a r g e t B 22B3

57% 43%

1

1 >"2.5 Of Missing

1

Target: TargetB Count: 25E0 0: 44% l: 56%

T h e training data is partitioned into two subsets. The first subset, corresponding to cases with a 36month gift count less than 2.5, has a higher than average concentration of T a r g e t B = 0 cases. T h e second subset, corresponding to cases with a 36-month gift count greater than 2.5, has a higher than average concentration of TargetB=l cases. The second branch has slightly more cases than the first, which is indicated by the N i n n o d e field. The partition of the data also represents your first (non-trivial) predictive model. This model assigns to all cases in the left branch a predicted T a r g e t s value equal to 0.43 and to all cases in the right branch a predicted T a r g e t B value equal to 0.56. f

In general, decision tree predictive models assign all cases in a leaf the same predicted target value. For binary targets, this equals the percentage of cases in the target variable's primary outcome (usually the t a r g e t = l outcome).


3.2 Cultivating Decision Trees

3-49

Adding More Splits Use the following steps to interactively add more splits to the decision tree: I. Select the lower left partition. T h e Split N o d e dialog box is updated to s h o w the possible partition variables and their respective logworths. Split Node 3 Target Variable: TargetB Variable DemMedHorneValue DernPctVeterans StatusCatStarAll REP_DemMedIncorne DernAge GiftAvgAll GiftCnt36 PromCntCardl2 5tatusCat96NK GiFtTimeFirst DemHorneOwner GiftCntCardAll GiftCntAII PromCnt36 PromCntCardAll GiFtAvg36 DemGender PromCntAll GiftCntCard36 GiftTimeLast GiftAvgCard36 GiftAvgLast PromCntl2 PromCntCard36 DemCluster

Variable Description Median Home Value Region 3 ercent Veterans Region Status Category Star All M... Replacement: Median Inco... Age Gift Amount Average All Mo... Gift Count 36 Months Promotion Count Card 12 M... Status Category 96NK Time Since First Gift Home Owner Gift Count Card All Months Gift Count All Months Promotion Count 36 Months Promotion Count Card All M... GiFt Amount Average 36 M,.. Gender Promotion Count All Months GiFt Count Card 36 Months Time Since Last GiFt Gift Amount Average Card ... GiFt Amount Last Promotion Count 12 Months Promotion Count Card 36 M... Demographic Cluster

Branches

-iog(p) 3.45360 2,38360 1.81739 1.56106 1.24470 1,03539 0.92244 0.88959 0.76545 0.68583 0.66592 0.58917 0.41340 0.33333 0.26385 0.19233 0.15916 0.13244 0.03293 0.01715 0.00000 0.00000 0,00000 0.00000 0.00000

2

Edit Rule... OK

Cancel

T h e input with the highest logworth is Median Home V a l u e Region.

1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Apply

Refresh


Chapter 3 Introduction to Predictive Modeling: Decision Trees

3-50

2. Select Edit Rule.... T h e D e m M e d H o m e V a l u e - Interval Splitting Rule dialog box opens. D e m M e d H o m e V a l u e - Interval Splitting Rule Target Variable: TargetB Assign missing values to (• A specific branch

[2

^

C A separate missing values branch C All branches Br a r i c h e s

Branch 1

Split Point

<

67350 67350

>=

2

New split point:

Remove Branch

Add Branch

OK

Cancel

Apply

Undo

Reset

Branch 1 contains all cases with a median h o m e value less than 67350. Branch 2 contains all cases that are equal to or greater than 67350.


3.2 Cultivating Decision Trees 3. Select Apply. T h e Tree View w i n d o w is updated to show the additional split. Interactive Decision Tree -EMWS2.TREE_BROWSETREE[EMWS2.PART_TRAIN] File Edit View

pa

Tree Train

i I lib fUT

Window n 0

ml

&

Tree View Target: TargetB Count: 4843 0: 50% 1: 50%

T

Gift Count 36 Months

r

>=2.5 or Missing

<2.5

I

J.

Target: TargetB Count: 2233 L 0: 57% 1: 431

Target: Count:

TargetB 2560

• :

44%

1:

56%

I

1

Median Home Value Region

<S7350 Target: TargetB Count: 84£ 0: £3% 1: 37%

>=67350 or Missing Target: TargetB Count: 1437 •: 53% 47% 1:

Both left and right leaves contain a lovver-than-average proportion of cases with T a r g e t B = l .

3-51


3-52

Chapter 3 Introduction to Predictive Modeling: Decision Trees

4. Repeat the process for the branch that corresponds to cases with G i f t Count 36 Month in excess of 2.5. Interactive Decision Tree -EMWS2.TREE_BROWSETREE[EMW52.PART_TRAIN] File Edit

View

Tree Train

MiJL

i

ifnr m : 6% •a

|

Window | g | xa

^1

Tree View Tar ge t Count 0 1

TargetB 4843 50% 50%

T

Gift Count 36 Months

>=2.5 or Missing

J Target: TargetB Count: 2203 0: 57% 1: 43%

Target: Count: 0: 1:

Y

T~"

Median Home Value Region

<67350

s=67350 or Missing

TargetB 2S£0 44% 56%

Gift Amount Last

<7.5

=7.5 or Missing

I Target: TargetB Count: 34£ 0: 63% 1: 37%

Target: TargetB Count: 1437 S3% •: 47% 1:

Target: TargetB Count: £57 0: 36% 1: £4%

Target: Targ*tB Count: 1903 0: 47% 1: 53%

T h e right-branch cases are split using the G i f t Amount L a s t input. This time, both branches have a higher-than-average concentration of TargetB=l cases. f

It is important to remember that the logworth of each split, reported above, and the predicted T a r g e t B values are generated using the training data only. T h e main focus is on selecting useful input variables for the first predictive model. A diminishing marginal usefulness of input variables is expected, given the split-search discussion at the beginning of this chapter. This prompts the question: W h e r e do you stop splitting? T h e validation data provides the necessary information to answer this question.


3.2 Cultivating Decision Trees

3-53

Changing a Splitting Rule T h e bottom left split on Median Home V a l u e at 67350 was found to be optimal in a statistical sense, but it might seem strange to someone simply looking at the tree. ( W h y not split at the more socially acceptable value of 70000?) Y o u can use the Splitting Rule w i n d o w to edit where a split occurs. Use the following steps to define your o w n splitting rule for an input. 1.

Select the node above the label Median Home V a l u e Region. Interactive Decision Tree -EMWS2.TREE_BRDWSETREE[EMWS2.PART_TRAIN] File

Edit

View

Tree

Train

a 11 a l l P i l f e r

m

Window re 1 • I • 1 rid I n> | xa |

A

_j|

|

|

Tree View Target: TargetB Count: 4S43 •: 50* 1: 50V

Y

Girl Count 36 Months

r

>=2.5 or Missing

I

Target: TargetB Count: 2$£D O: 44% 1: 5£ %

Target: TargetB Count: "S3 . O: 57% 1: 43%

1

Gift Amount Last

Median Home Value Region

1 <67350 Target: TargetB Count: S4£ D: E3% 1: 37%

1 >=57350 or Missing

1

Target: TargetB Count: 1437 0: 53% 1: 471

1 <7.5

1

Tar get T a r g e t B Count 657 0: 36% 1 64%

1 >-75 or Missing

1

Target: TargetB Count: 1S03 •: 47% 1: 53%


3-54

Chapter 3 Introduction to Predictive Modeling: Decision Trees

2. Right-click this node and select Split Node... from the m e n u . T h e Split N o d e w i n d o w opens. Split Node 3 Target Variable: TargetB Variable 1 DernMedHorneValue DernPctVeterans StatusCatStarAil | REP_DemMedIncome DemAge GiftAvgAll GiftCnt36 PromCntCardl2 StatusCat96NK GiftTimeFirst iDemHorneOwner GiftCntCardAll GiftCntAll PromCnt36 PromCntCardAll GiftAvg36 DemGender PromCntAll |GiftCntCard36 iGiftTirneLast GiftAvgCard36 ;GiftAvgLast ;PromCntl2 jpromCntCard36 : DernCluster

Variable Description Median Home Value Region Percent Veterans Region Status Category Star All M... Replacement: Median Inco... Age Gift Amount Average All Mo... Gift Count 36 Months Promotion Count Card 12 M.., Status Category 96NK Time Since First Gift Home Owner Gift Count Card All Months Gift Count All Months Promotion Count 36 Months Promotion Count Card All M... GiFt Amount Average 36 M... Gender Promotion Count All Months Gift Count Card 36 Months Time Since Last Gift Gift Amount Average Card ... Gift Amount Last Promotion Count 12 Months Promotion Count Card 36 M... Demographic Cluster

Branches

-Log(p) 3.36085 2.38360 1.81739 1.56106 1.24470 1.08539 0.92244 0.88959 0.76545 0,68583 0.66592 0,58917 0,41340 0.38333 0.26385 0.19283 0.15916 0,13244 0.03293 0,01715 0,00000 0,00000 0.00000 ' 0,00000 0.00000

Edit Rule... OK

Cancel

1 Zi 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Apply

Refresh


3.2 Cultivating Decision Trees

3-55

3. Select Edit Rule.... T h e D e m M e d H o m e V a l u e - Interval Splitting Rule dialog box opens. DernMedHorneValue - Interval Splitting Rule

LU

Target Variable: TargetB Assign missing values to (• A specific branch

[2 ^ ]

C A separate missing values branch C All branches Branches Split Point

Branch 67350 67350

>=

New split point:

Remove Branch

Add Branch

OK

Cancel

Apply

Undo

Reset

|


Chapter 3 Introduction to Predictive Modeling: Decision Trees

3-56

4. Type 7 0 0 0 0 in the N e w s p l i t p o i n t field. D e m M e d H D m e V a l u e - Interval Splitting Rule Target Variable: TargetB Assign missing values to — (* A specific branch

|2

C A separate missing values branch C All branches Branches Branch

Split Point

1 2

New split point:

67350 67350

<

>=

70000

Add Branch

OK

Remove Branch

Cancel

Apply

Undo

|

Reset


3.2 Cultivating Decision Trees 5. Select A d d Branch. T h e Interval Splitting Rule dialog box shows three branches. DernMedHorneValue - Interval Splitting Rule Target Variable: TargetB Assign missing values to (* A specific branch

fi

C A separate missing values branch C All branches Branches Branch

Split Point

< < >=

l 2 3

New split point:

67350 70000 70000

| Add Branch I

OK

Cancel

Remove Branch

Apply

Undo

Reset

3-57


3-58

Chapter 3 Introduction to Predictive Modeling: Decision Trees

6. Select Branch 1 by highlighting the row. DernMedHorneValue - Interval Splitting Rule Target Variable: TargetB Assign missing values to (* A specific branch

12

>J

C A separate missing values branch C All branches Branches Branch

Split Point

70000 70000

>=

New split point:

Add Branch

OK

Remove Branch

Cancel

Apply

Undo

|

Reset


3.2 Cultivating Decision Trees

3-59

7. Select R e m o v e Branch. DernMedHorneValue - Interval Splitting Rule Target Variable: TargetB Assign missing values to (* A specific branch

[1

•|

C A separate missing values branch C Al! branches Branches Split Point

Branch 1 2

70000 70000

<

>=

Add Branch

New split point: OK

Cancel

Apply

Remove Branch Undo

Reset

T h e split point for the partition is m o v e d to 70000. jf

In general, to change a Decision Tree split point, add the n e w split pointfirstand then remove the unwanted split point.


3-60

Chapter 3 Introduction to Predictive Modeling: Decision Trees Select O K to close the Interval Splitting Rule dialog box and to update the Tree View w i n d o w .

in N o d e : 4843 T a r g e t : Targets • (V) I 50% 1 (%): 50%

T

Gift Count 36 Months

I

1 I

<2.5 or Missing

.

1

>=2.5

,

22B3 in M o d e : T a r g e t : TargetB 57% 0 (%) : 43% 1 (%J:

25£0 in N o d e : Target: TargetB 44% 0 |%): 5c* 1 1%) :

T

Median H o m e Value Region

<700D0 or Missing

Gift Amount Last

=70000

J

>=7.5 or Missing

I H in N o d e : 922 T a r g e t : TargetB 0 (%) : £2% i (%> : 38%

N in N o d e : 1361 T a r g e t : TargetB 54% • (%) : 46% 1 (%) :

in N o d e : £57 T a r g e t : TargetB 0 (%) : 36% 1 (%): 64%

in N o d e Target 0 1%) 1 (*>

1903 TargetB 47% 53*

Overriding the statistically optimal split point slightly changed the concentration of cases in both nodes.


3.2 Cultivating Decision Trees

3-61

Creating the Maximal Tree A s demonstrated thus far, the process of building the decision tree model is one of carefully considering and selecting each split sequentially. There are other, more automated w a y s to grow a tree. Follow these steps to automatically grow a tree with the S A S Enterprise Miner Interactive Tree tool: 1. Select the root node of the tree. 2. Right-click and select Train N o d e from the m e n u . T h e tree is grown until stopping rules prohibit further growth. f

Stopping rules were discussed previously and in the Self-Study section at the end of this chapter.

It is difficult to see the entire tree without scrolling. However, you can z o o m into the Tree View w indow to examine the basic shape of the tree. 3. Select Options •=> Z o o m <=> 5 0 % from the main m e n u . T h e tree view is scaled to provide the general shape of the maximal tree.

s

' 1

r

i

t

T

1

1

i.

1

1

y

Ai ,±,,

|

i •,,'.„,,

T"

i

' 1 f

1

L

1

1 I ^^^^^^^^^

T

1

j

i

Y i

1

i i i ~

_!.„

"T

Y

1

...!....

_.u

~L

i

_i_

1

i

.J^I

i

i

L

' i"~

1

1

u

X

i

T

i

J

-L

iZT-i . . J ..i

Plots and tables contained in the Interactive Tree tool provide a preliminary assessment of the maximal tree.


3-62

Chapter 3 Introduction to Predictive Modeling: Decision Trees

4. Select V i e w •=> Assessment Plot. Interactive Decision Tree - EMW5.TREE jmOWSETREE[EMW5.PART2_TRAIN] File

Edit

View

i

Tree

Train

1

J

Window

1

|

•2)

OQ

Assessment Plot Proportion Misclassifiod 0.50-

N u m b e r of L e a v e s

While the majority of the improvement in fit occurs over the first few splits, it appears that the maximal, fifteen-leaf tree generates a lower misclassification rate than any of its simpler predecessors. T h e plot seems to indicate that the maximal tree is preferred for assigning predictions to cases. However, this plot is misleading. Recall that the basis of the plot is the validation data. Using the s a m e sample of data both to evaluate input variable usefulness and to assess model performance c o m m o n l y leads to overfit models. This is the mistake that the Mosteller and Tukey quotation cautioned you about. You created a predictive model that assigns one of 15 predicted target values to each case. Your next task is to investigate h o w well this model generalizes to another similar, but independent, sample: the validation data. Select File O Exit to close the Interactive Tree application. A dialog box opens. Y o u are asked whether you want to save the maximal tree to the default location. Unsaved Changes ?^

Do you want to save the changes to the default location, EMWS2.TREE_BROWSETREE? I Yes !

No

Cancel

Select Yes. T h e Interactive Tree results are n o w available from the Decision Tree node in the S A S Enterprise Miner flow.


3.3 Optimizing the Complexity of Decision Trees

3.3

3-63

Optimizing the Complexity of D e c i s i o n T r e e s

Model Essentials - Decision Trees

Optimize complexity.

Pruning

The maximal tree represents the most complicated model you are willing to construct from a set of training data. T o avoid potential overfitting. m a n y predictive modeling procedures offer s o m e mechanism for adjusting model complexity. For decision trees, this process is k n o w n as pruning.

Predictive Model Sequence mmm mm

mmm

J IHM M M WmW ^mWm

—

Create a sequence of models with increasing complexity.

)mplexity

The general principle underlying complexity optimization is creating a sequence of related models of increasing complexity from training data and using validation data to select the optimal model.


3-64

Chapter 3 Introduction to Predictive Modeling: Decision Trees

The Maximal Tree Training Data

Validation Data

Maximal

Create a sequence of models with increasing complexity.

Tree

A maximal tree is the most complex model in the sequence.

r 6 Model 87

Complexity

The Maximal Tree 1

E

- irp NO

i bfcH3

OIK

C

A maximal tree is the most complex model in the sequence. Mode/ 89

Complexity

T h e maximal tree that you generated earlier represents the most complex model in the sequence.


3.3 Optimizing the Complexity of Decision Trees

3-65

Pruning One Split Training Data

mm

Validation Data

mm mmm

*pufl • r -

• —

m

w

m

r

m

w

m

mm

The next model in the sequence is formed by pruning one split from the maximal tree. Model Complexity

The predictive model sequence consists of subtrees obtained from the maximal tree by removing splits. T h efirstbatch of subtrees is obtained by pruning (that is, removing) one split from the maximal tree. This process results in several models with one less leaf than the maximal tree.

Pruning One Split

Illll

II

I Il ll ll ll l

Data

III Illll

Validation

Illll

Illll Illll

Training Data

Each subtree's predictive performance is rated on validation data. Model 92

The subtrees are compared using a model rating statistic calculated on validation data. (Choices for model fit statistics are discussed later.)


3-66

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Pruning One Split Training Data

Validation Data WW*

\""""

1

1

; The subtree with the highest validation assessment is selected. Model 94

Complexity

T h e subtree with the best validation assessment statistics is selected to represent a particular level of model complexity.

Pruning Two Splits Training Data

Validation

Data twit

i

ILMJIII'IH m

"i

p**n

H+H I Similarly this is done for subsequent models.

Model Complexity

A similar process is followed to obtain the next batch of subtrees.


3.3 Optimizing the Complexity of Decision Trees

Pruning Two Splits Training Data

ÂŁ

A

^

Prune two splits from the maximal tree,...

Model Complexity

97

This time the subtrees are formed by removing two splits from the maximal tree.

Pruning Two Splits Training Data

Validation

Dais

mm: mmi mil mm

I

1

mmm

mi mi

mm

llll

II

II

mm

1

2 9 validation assessment, and. r a t e

"

"

e a c h

s u b t r e e

u s i n

Model Complexity

O n c e again, the predictive rating of each subtree is compared using the validation data.

3-67


3-68

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Pruning Two Splits Training Data

Validation

Data

T2

s 4k

...select the subtree with the best assessment rating.

Model Complexity

100

T h e subtree with the highest predictive rating is chosen to represent the given level of model complexity.

Subsequent Pruning lion Data

C~2

Continue pruning until all subtrees are considered.

4k 106

Model Complexity

T h e process is repeated until all levels of complexity are represented by subtrees.


3.3 Optimizing the Complexity of Decision Trees

Selecting the Best Tree Training Data

Validation Data

mm mmm mm mmm mmm mm

III

III

mm mm

imam Compare validation assessment between tree complexities.

A MMft ATocte/

Validation

Complexity

107

Assessment

Consider the validation assessment rating for each of the subtrees.

Selecting the Best Tree Validation Data

Training Data

mm

mm

mmm

mm mm

mm

mmm mm

Mkhttt

.<•• -trtrtrXr

Model Complexitv

mm

Which subtree should be selected as the best model?

Validation Assessment

Which of the sequence of subtrees is the best model?

3-69


3-70

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Validation Assessment Ml

mmm

mm — mmm mm mm

109

ft

•

Mode/

Validation

Complexity

Choose the simplest model with the highest validation assessment.

Assessment

In S A S Enterprise Miner, the simplest model with the highest validation assessment is considered best.

Validation Assessment Validation Data

What are appropriate validation assessment ratings?

110

O f course, you must finally define what is meant by validation assessment ratings.


3.3 Optimizing the Complexity of Decision Trees

3-71

Assessment Statistics

Validation

Data

Ratings depend on...

mm mm mm ——mm

- . - t a r g e t measurement prediction type.

112

T w o factors determine the appropriate validation assessment rating, or more properly, assessment statistic: • the target measurement scale • the prediction type A n appropriate statistic for a binary target might not m a k e sense for an interval target. Similarly, models tuned for decision predictions might m a k e poor estimate predictions.

Binary Targets

primary outcome secondary outcome

114

For this discussion, a binary target is assumed with a primary outcome ( t a r g e t = l ) and a secondary outcome ( t a r g e t = 0 ) . T h e appropriate assessment measure for interval targets is noted below.


3-72

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Binary Target Predictions

• ^^^ff^^^M t : mMM

HH

mm

•HBBB

Hi

mm

mmmwm

mm

mm

Mmm

mm

m

m

m

m

mm

^u 5! *I

mm • • • • ma

-

' 1

primary

1

decisions

secondary

520

rankings

720

;

estimates

116

There are different s u m m a r y statistics corresponding to each of the three prediction types: • decisions • rankings • estimates The statistic that you should use to judge a model depends on the type of prediction that you want.


3.3 Optimizing the Complexity of Decision Trees

3-73

Decision Optimization

mmm

mmm mmt

mm mm

mmm

warn mm

mm

MB H i

mm mm

M

mm. mm

mm mmnrnm

decisions

119

Consider decision predictionsfirst.With a binary target, you will typically consider two decision types: • the primary decision, corresponding to the primary outcome • the secondary decision, corresponding to the secondary outcome

Decision Optimization - Accuracy

true positive true negative Maximize accuracy: agreement between outcome and prediction

120

Matching the primary decision with the primary outcome yields a correct decision called a true positive Likewise, matching the secondary decision to the secondary outcome yields a correct decision called a true negative. Decision predictions can be rated by their accuracy, that is, the proportion of agreement between prediction and outcome.


3-74

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Decision Optimization - Misclassification

fnpulS

mm mm mm — mm mm H I mm mm mm mm mm ii -11.

mm mm mm mm mm mm

tj/giir

predietioti 1

1

| secondly!]

false negative

0

[ primly ||

false positive

mum mmm

Minimize misclassification: disagreement between outcome and prediction

122

Mismatching the secondary decision with the primary outcome yields an incorrect decision called a false negative. Likewise, mismatching the primary decision to the secondary outcome yields an incorrect decision called a false positive. Decision predictions can be rated by their misclassification, the proportion of disagreement between prediction and outcome.

Ranking Optimization

124

Consider ranking predictions for binary targets. With ranking predictions, a score is assigned to each case. T h e basic idea is to rank the cases based on their likelihood of being a primary or secondary outcome. Likely primary outcomes receive high scores and likely secondary outcomes receive low scores.


3.3 Optimizing the Complexity of Decision Trees

3-75

Ranking Optimization - Concordance

target=0—+lowscore target=l—"high score Maximize concordance: proper ordering of primary and secondary outcomes

126

W h e n a pair of primary and secondary cases is correctly ordered, the pair is said to be in concordance. Ranking predictions can be rated by their degree of concordance, that is, the proportion of such pairs whose scores are correctly ordered.

Ranking Optimization - Discordance

M

M 720 520

target=0—>high score target=l—dow score Minimize discordance: improper ordering of primary and secondary outcomes

129

W h e n a pair of primary and secondary cases is incorrectly ordered, the pair is said to be in discordance. Ranking predictions can be rated by their degree of discordance, that is, the proportion of such pairs whose scores are incorrectly ordered.


3-76

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Estimate Optimization

1

r~~ "

HH

• •

mmm

mm H

mm

mm

ffl^BBS

Hi

Hi

mm mm mm mm mmm H I E M

estimates

130

Finally, consider estimate predictions. For a binary target, estimate predictions are the probability of the primary outcome for each case. Primary outcome cases should have a high predicted probability; secondary outcome cases should have a low predicted probability.


3.3 Optimizing the Complexity of Decision Trees

3-77

Estimate Optimization - Squared Error

(target - estimate)2 Minimize squared error. squared difference between target and prediction

132

T h e squared difference between target and estimate is called the squared error. Averaged over all cases, squared error is a fundamental statistical measure of model performance.

where, for the /'' case, y, is the actual target value and yt is the predicted target value. W h e n calculated in an unbiased fashion, the average square error is related to the amount of bias in a predictive model. A model with a lower average square error is less biased than a model with a higher average square error. f

Because S A S Enterprise Miner enables the target variable to have more than two outcomes, the actual formula used to calculate average square error is slightly different (but equivalent, for a binary target). Suppose the target variable has L outcomes given by (Cj, C2, ...C/.), then Average

square

error

=

NL

where, for the t case,>'; is the actual target value, p

tJ

and the following:

is the predicted target value for the f

class,


3-78

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Complexity Optimization - Summary

decisions accuracy / misclassification rankings concordance/discordance estimates squared error

134

In s u m m a r y , decisions require high accuracy or low misclassification, rankings require high concordance or low discordance, and estimates require low (average) squared error.


3.3 Optimizing the Complexity of Decision Trees

3-79

A s s e s s i n g a Decision Tree

Use the following steps to perform an assessment of the maximal tree on the validation sample. 1. Select the Decision Tree node. In the Decision Tree node's properties, change Use Frozen Tree from N o to Yes.

Protect Fie Ed* Vew Araju Optitns W n d w

S E n t e r p r l M Miier- M y

Heb

D - l ^ --i|X|BI^|j]|aD|gJl^i|*

3 wf Prsjoil

S*Tte| Eiptaril rtad^vl Mafe|| Istns j Unity | Craftfrumpj Text Hlrtifj !

£i- 1.1. •• tr<-

4Hjperty

Gt rural

-J

1

vaue -

rft 5

3

Imnnrt !! Hals Exported Dal; Moles Vaiaoes tnlurattnr* Use Frozen Tree UseMjIliclBTaroels

M

Yea

InltrrJ Ci leilJd ProbF ProbCnsq -Ncmirar Cntenor • tTjin;' Crnerion Entncv 0.2 r Sijnfflrarue Level rMissinsVilues Jse In search l U5S Input Of It H Nu j-MBdrrumOro-icri 3 >M3rlmimDBB* 6 -Minimum :a!eao-ica Se5 Use Froien Tree

p U p v t Pr«*rtkr* £ r w J v w

J

2l\ZO

Q — J —o w i j '

fn[w*H

T h e Frozen Tree property prevents the maximal tree from being changed by other property settings w h e n the flow is run.


3-80

Chapter 3 Introduction to Predictive Modeling: Decision Trees

2. Right-click the Decision Tree node and run it. Select Results. Enterprise Miner-test I lie grit yjew fictions Options Window Help

3 teitl •*• Ol D a t a Sources H il Diagrams Predictive Analysi 1—J™ predictive analysi; 1 B ftj Model Packages I JlJ Users

Sample Explore Morify Model Assess Ut«y Oedt Scoring Teit r**ig ;

Predictive Analysis

o

•J

USH Multiple TaigetNa B| Interval Criterion ProBF Nominal Criterion ProttChisq Ordinal Criterion Entropy Significance Level 0.2 Missing Values Use In search Use Input Once |No Maximum Branch 2 Maiimum Depth 6 "Minimum Calegorfris

Run completed BaojerniFTeclrtive Analyst Path:

Decision Tree OK

Results...

-—


3.3 Optimizing the Complexity of Decision Trees

3-81

T h e Results w i n d o w opens. nr Results - fode: Decision Tree Diagram: Predictive Analysis Fte Ed* yew Wnctow

Si. i "i i'

ME

\>. mknnji Live rljy: Tjftjelfl

9 14 15 Leaf Index

19

21

• Tiaimng Percent 1 • validation Percent 1

Target TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB

Statistic 0:

It H:

Train so.a* so.at

Validation so.a* so.o\

4943

4S4J

Ft Statistics _NOBS_ _SUMW_ _MISC_ _MAX_ _SSE_ _ASE_ JMSE_ _DIV_ _DFT_

Statistics Label

Tram

S u m of f re._ S u m of Cas... Hisclassrfic . Maximum A... S u m of Squ.. Average Sq.. RoolAvera... Divisor (or A.. Total Degre...

Validation 4843

4843

9686

9686

0103425

0.432377

0.923571

0.928571

2293.347

2377.757

0236769

0 245484

0486599

0.495463

9666

9686

4843

Gift C o u n tret 36 Months

Variable Inpoct-encc

T h e Results w i n d o w contains a variety of diagnostic plots and tables, including a cumulative lift chart, a tree m a p , and a table of lit statistics. T h e diagnostic tools s h o w n in the results vary with the measurement level of the target variable. S o m e of the diagnostic tools contained in the results s h o w n above are described in the following Self-Study section.


3-82

Chapter 3 Introduction to Predictive Modeling: Decision Trees T h e saved tree diagram is in the bottom left corner of the Results window. Y o u can verify this by maximizing the tree plot.

at...

i

'

25

>-

I

Me... >. 70000

Pro...

175

i

*

I

Cif...

i

>- 175

:t

I

r

« 532

T~ c Owner

^7S I

>- is

I

i

I

i

>- 5.32

< 1D5

I

Tin...

,

,

>- 105

I

•> 175

l

>- 115

18.53...

I

'

i —

L >- 175

I

Do...

, . I ,

25

Pio...

,

(

&' 115

I

Gift Amount Lost

room

i

1

I

Me... 24,49

* 124100

. I . . I

Gil...

>- 124100

I

>- 14.415

< 14.415

. I . .• 1

~Y SSH...

7S500

< 7S500


3.3 Optimizing the Complexity of Decision Trees

3-83

4. The main focus of this part of the demonstration is on assessing the performance of a 15-leaf tree, created with the Interactive Decision Tree tool, on the validation sample. Notice that several of the diagnostics s h o w n in the results of the Decision Tree node use the validation data as a basis. Select V i e w <=> M o d e l •=> Iteration Plot. Iteration Plot

E M

Average Square Error

0.250 -

0.245 -

0.240-

10

15

N u m b e r of L e a v e s Train: Average Squared Error

Valid: Average Squared Error

T h e plot shows the Average Square Error corresponding to each subtree as the data is sequentially split. This plot is similar to the one generated with the Interactive Decision Tree tool, and it confirms suspicions about the optimality of the 15-leaf tree. The performance on the training sample becomes monotonically better as the tree becomes more complex. However, the performance on the validation sample only improves up to a tree of, approximately, four or five leaves, and then diminishes as model complexity increases. y

T h e validation performance shows evidence of model overtltting. Over the range of one to approximately four leaves, the precision of the model improves with the increase in complexity. A marginal increase in complexity over this range results in better accommodation of the systematic variation or signal in data. Precision diminishes as complexity increases past this range; the additional complexity accommodates idiosyncrasies in the training sample, and the model extrapolates less well.


3-84

Chapter 3 Introduction to Predictive Modeling: Decision Trees Both axes of the Iteration Plot, s h o w n above, require s o m e explanation.

Average Square Error: T h e formula for average square error w a s given previously. For any decision tree in the sequence, necessary inputs for calculating this statistic are the actual target value and the prediction. Recall that predictions generated by trees are the proportion of cases with the primary outcome ( T a r g e t B = l ) in each terminal leaf. This value is calculated for both the training and validation data sets. Number of leaves: Starting with the four-leaf tree that y o u constructed earlier, it is possible to create a sequence of simpler trees by removing partitions. For example, a three-leaf tree can be constructed from your four-leaf tree by removing either the left partition (involving M e d i a n H o m e V a l u e ) or the right partition (involving G i f t A m o u n t L a s t ) . R e m o v i n g either split might affect the generated average square error. T h e subtree with the lowest average square error (measured, by default, on the validation data) is the split that remains. T o create simpler trees, the process is repeated by, once m o r e , starting with the four-leaf tree that y o u constructed and removing m o r e leaves. For example, the two-leaf tree is created from the four-leaf tree by removing t w o splits, and so on.


3.3 Optimizing the Complexity of Decision Trees

3-85

5. To further explore validation performance, select the arrow in the upper left corner of the Iteration Plot, and switch the assessment statistic to Misclassification Rate. Iteration Plot

l-lnlx

Misclassification Rate

i

0.500 -

0.475 -

0.450 -

0.425 -

10

15

N u m b e r of L e a v e s Train: Misclassification Rate

Valid: Misclassification Rate

T h e validation performance under Misclassification Rate is similar to the performance under Average Square Error. T h e optimal tree appears to have, roughly, four or five leaves. Proportion misclassijied: T h e n a m e decision tree model comes from the decision that is m a d e at the leaf or terminal node. In S A S Enterprise Miner, this decision, by default, is a classification based on the predicted target value. A predicted target value in excess of 0.5 results in a primary outcome classification, and a predicted target value less than 0.5 results in a secondary outcome classification. Within a leaf, the predicted target value decides the classification for all cases regardless of each case's actual outcome. Clearly, this classification will be inaccurate for all cases that are not in the assigned class. T h e fraction of cases for which the wrong decision (classification) was m a d e is the proportion misclassified. This value is calculated for both the training and validation data sets. 6. T h e current Decision Tree node will be renamed and saved for reference. Close the Decision Tree Results window.


3-86

Chapter 3 Introduction to Predictive Modeling: Decision Trees

7. Right-click on the Decision Tree node and select R e n a m e . N a m e the node Maximal Tree.

T h e current process flow should look similar to the following: 2 Predictive Analysis

Maximal TreÂť

l.ita Partition

'J

-J

T h e next step is to generate the optimal tree using the default Decision Tree properties in S A S Enterprise Miner.


3.3 Optimizing the Complexity of Decision Trees

3-87

Pruning a Decision Tree The optimal tree in the sequence can be identified using tools contained in the Decision Tree node. T h e relevant properties are listed under the Subtree heading. 1. Drag another Decision Tree node from the Model tab into the diagram. Connect it to the Data Partition node so that your diagram looks similar to what is s h o w n below. ^149.173.38.118-Remote Desktop

ME

£f Enterprise Miner - M y Project O

Ed* £ewfiftkmsOptions Wncbw tJ*b

DH^l^XlSNBlDlBlailJjMI-.'.'l :-|Q|t5Hi •3 My Project Q| Data Sourc WH Diagrams

Sample EipJore Modify Model Assess Ltity I Credit Scoring

HE

i • - Predictive Analysis

_£] Model Packages Property

General Node ID ImportedData Exported Data Noles

Train

jr*^,p!«— ..at ; — —vfcieiaJliliPirl.li.".

Variables Interactive Use Frozen Tree No Use Multiple Tarjie No

J

6

-

J

RnteiyajCrtterlon Nominal Criterion ProbCnlsq r Ordinal Criterion Entropy r Significance Level G 2 rMisslng Values Use in search rUselnputpnce

General pun completed

- f~l.Mi.imil Ttt*

m

-ti * a a - j -

1 [0 student as student

Connected to SASApp - LogicaJ Workspace Server


Chapter 3 Introduction to Predictive Modeling: Decision Trees

3-88

2. Select the n e w Decision Tree node and scroll d o w n in the Decision Tree properties to the Subtree section. T h e main tree pruning properties are listed under the Subtree heading. Value Property Maximum ueptn b I Minimum Categorical Si 5 §Node 5 Leaf Size • N u m b e r of Rules 5 • 'Number of Surroqate R L 0 : Split Size dsplit Search 5000 iExhaustive 20000 Node Sample :E Subtree ; Method Assessment 1 jNumber of Leaves [Assessment Measure Decision Assessment Fraction 0 . 2 5 [slcross Validation Perform Cross Validatio NO Number of Subsets 10 [Number of Repeats 1 i 12345 iSeed [•jObservation Based Impt Hpbservation Based ImpcNo ^ N u m b e r Single Var ImpcS •Bonferroni Adj u stm e nt Yes "ime of Kass AdjustmerBefore Inputs No •Number of Inputs -Split Adjustment Yes T h e default method used to prune the maximal tree is Assessment. This means that algorithms in S A S Enterprise Miner choose the best tree in the sequence based on s o m e optimality measure. f

Alternative method options are Largest and N . The Largest option provides an autonomous w a y to generate the maximal tree. The N option generates a tree with N leaves. T h e maximal tree is upper-bound on N.

T h e Assessment Measure property specifies the optimality measure used to select the best tree in the sequence. T h e default measure is Decision. The default measure varies with the measurement level of the target variable, and other settings in S A S Enterprise Miner. f

Metadata plays an important role in h o w S A S Enterprise Miner functions. Recall that a binary target variable w a s selected for the project. Based on this, S A S Enterprise Miner assumes that you want a tree that is optimized for making the best decisions (as opposed to the best rankings or best probability estimates). That is, under the current project settings, S A S Enterprise Miner chooses the tree with the lowest misclassification rate on the validation sample by default.


3.3 Optimizing the Complexity of Decision Trees

3-89

Ritjht-click on the n e w Decision Tree node and select R u n . ^Enterprise Miner - M y Project lie Edft View Ac lions Options Wjndow Help

•lOlgHl-ul^l 3 My Prolect t a Date Sources -" Diagrams B

Model Packages

ME

Pr elite live Analysis

Property

General Node ID Imported Dot.) EipcrterJ Data Nates

J Use Multiple TargeNo interval Criterion ProbF •Nominal Criterion PrpbChisq I Ordinal Criterion Entropy r Significance Level 07 r Missing Values Use in search rUse Input Once No r Maximum Branch 2 ' Maximum Depth 6 ; Minimum Calegcnl5

Ed* Variables...

A Run SlResJts... -ft Oeate Model Package... j*) Export Path as SAS Program

Leaf See

cut

Cow Genera]

flxn completed

"P? student as student fat Cornec

4. After the Decision Tree node runs, select Results.... Run Status

V

L3

Run completed Diagram: Predictive Analysis Path:

Decision Tree

OK

Results...

Delete °—una


3-90 5.

Chapter 3 Introduction to Predictive Modeling: Decision Trees In the Result w i n d o w , select V i e w => M o d e l •=> Iteration Plot. T h e Iteration Plot shows model performance under Average Square Error by default. However, this is not the criterion used to select the optimal tree.

6. C h a n g e the basis of the plot to Misclassification Rate.

Sum of Square Errors Maximum Absolute Error Subtree Assessment

0.235 0

5

10

N u m b e r of Leaves Train: Average Squared Error Valid: Average Squared Error

15


3.3 Optimizing the Complexity of Decision Trees

3-91

The five-leaf tree has the lowest associated misclassification rate on the validation sample. Iteration Plot

013

Misclassification Rate 0.500 -

0.475 -

0.450 -

0.425 -

1

5

10

Number of Leaves Train: Misclassification Rate Valid: Misclassification Rate

Notice that the first part of the process in generating the optimal tree is very similar to the process followed in the interactive Decision Tree tool. The maximal, 15-leaf tree is generated. The maximal tree is then sequentially pruned so that the sequence consists of the best 15-leaf tree, the best 14-leaf tree, and so on. Best is defined at any point, for example, eight, in the sequence as the eight-leaf tree with the lowest Validation Misclassification Rate a m o n g all candidate eight-leaf trees. T h e tree in the sequence with the lowest overall validation misclassification rate is selected.


3-92

Chapter 3 Introduction to Predictive Modeling: Decision Trees The optimal tree generated by the Decision Tree node is found in the bottom left corner of the Results window.

SO.01 10.01 BM3

Gift Count 36 Months

2.5 Ti-iin 41.81

3t At 1st ic

V A I irtAt inn

44.Z\

41.0*

7

Girt Amount Last

>=

7.5

7.5 V A I idition

S)

4 1 . It s;.4t

.41

Time Since Last Gift

*=

17.5

7.5

I

T E A in

11.11 40.71

U A I AIIAC

Iialn

Ion

11.81 40.:%

41.11 sa.it

1 • • '~

Gift Amount Average Card 36 Mont

«

< 14.415 SCACucit

TlAtn

41.11 ll.tl

UAI

14.415

idArion

TtAin

tf.M 10.11

IS.CI

Vnlid.it ion

14.El

41.41

45.11

1113

411

7. Close the Results w i n d o w of the Decision Tree node.

1,Lit i o n

UAI

4;.BI

n.ii .

t

n

-Jjl idaalon 11.01

S3.01

l t . i t

TIAIJI

UAIidACicn

11.0 C4.4I ,-•7

34.01

ei.oi


3.3 Optimizing the Complexity of Decision Trees

3-93

Alternative Assessment Measures T h e Decision Tree generated above is optimized for decisions, that is, to m a k e the best classification of cases into donors and non-donors. W h a t if instead of classification, the primary objective of the project is to assess the probability of donation? The discussion above describes h o w the different assessment measures correspond to the optimization of different modeling objectives. Generate a tree that is optimized for generating probability estimates. T h e main purpose is to explore h o w different modeling objectives can change the optimal model specification. 1. Navigate to the M o d e l tab. Drag a n e w Decision Tree into the process flow. fc. Enterprise Miner - My Project FJe £dit View Actions fijtiors Window (jelp

D-l^l-tiixiSll^lBllGlDlAJl^l* 3 My Project - O Data Sources -i JTJ Diagrams

•|0|fl| \.UI-£i ^j^lidiiUj*]

^ J ^ l i d M Jill 5arr.pl? | Explore | M-j'i-fy | Model |

1

|

Scoring |

Mi

• Predictive Analysis

| Model Packages Property

General

Node ID imported Data Exported Data doles

Train

Variables Interactive Use Frozen Tree No use MultipleTargeNo

J

a

J

^JD.i.iion Tret

Interval Critenon ProbF •Nominal Criterion ProbChisc; '•Ordinal Criterion Enlmpy I Signified'e Lew*I 0.2 . Mssinc Values Use in sea-ch !Use Input Once No : vatrr.j':i Branch 2 j-Mailmum Depth E -Minimum CalegoriiS

m

•J533BI Leaf See

1

General pun completed

2.

ft} «udent as student ft*, Connected to SASApp - Logical Workspace Server

Right-click the n e w Decision Tree node and select R e n a m e .

3. N a m e the n e w Decision Tree node P r o b a b i l i t y Rename

m

Node Name; [probability Tree|

OK

Cancel

Tree.


3-94

Chapter 3 Introduction to Predictive Modeling: Decision Trees

4. Connect the Probability Tree node to the Data Partition node. Your diagram should look similar to the following: ...... p r e d i c t i v e A n a l y s i s

5.

Recall that an appropriate assessment measure for generating optimal probability estimates is Average Square Error.


3.3 Optimizing the Complexity of Decision Trees 6. Scroll d o w n in the properties of the Probability Tree node to Subtree. 7. Change the Assessment Measure property to Average Square Error. Value Property 2 ["Maximum Branch 6 j-Maximum Depth '•Minimum Categorical Si:5 ©Node 5 [•Leaf Size 5 '•• Number of Rules H N u m b e r of Surrogate Ru0 M s pi It Size [=]split Search 5000 '•• Exhaustive 20000 '-Node Sample § Subtree Assessment rpMethod 1 r Number of Leaves 1 [-'Assessment Measure Average Sguare Error i '-'Assessment Fraction 0.25 Tperform Cross Validatio N o 10 H N u m b e r of Subsets Number of Repeats 1 12345 '•Seed .[^Observation Based Impt [reservation Based ImpcNo '•Number Single Var Impc5 Pip-Value Adjustment [-Bonferroni Adjustment Yes ! '-Time of Kass AdjustmerBefore No r Inputs 1 H N u m b e r of Inputs Ves '•[Split Adjustment ^Output Variables ; res -LeafVariable Disk Performance 8. R u n the Probability Tree node. 9. After it runs, select Results....

3-95


3-96

Chapter 3 Introduction to Predictive Modeling: Decision Trees

10. In the Results w i n d o w , select View •=> M o d e l <=> Iteration Plot. Iteration Plot

• 2 1

Average Square Error 0.250 -

0.245 -

0.240 -

0.235 N u m b e r of Leaves Train: Average Squared Error Valid: Average Squared Error

A five-leaf tree is also optimal under the Validation Average Square Error criterion. However, viewing the Tree plot reveals that, although the optimal Decision (Misclassification) Tree and the optimal Probability Tree have the same number of leaves, they are not identical trees.


3.3 Optimizing the Complexity of Decision Trees

3-97

11. Maximize the Tree plot at the bottom left corner of the Results window. T h e tree s h o w n below is a different five-leaf tree than the tree optimized under Validation Misclassification. •T Results - Node: Probability Tree Diagram; Predictive Analysis He F,dt View Window

QJDj^lPi/] E m ll.lt I t . 11

f

Gill Caumafl nil Months

2.5

1

Trill. VilidttiOT lilt 11.n lt.lt 11. It

Ti4ln V.llittim ii tt si it

it :t ii.it

Gift Amounl Last

Median H o m e Value Region

I r-350 Ti.lr. V.lliiti

>=

II It II (t

I

T.5 Ilainii •

(111 It.M

1:

TzilB ».(!

a tt

l-.lii.tj-. It.It <(.n

I 1

Time Since Last Gift

\7£ T u r n '.'.liditlm U.tt U.ll ll.lt II,tt

Turn 11.It tilt

Ulllditln. 11.It ii :i

It is important to remember h o w the assessment measures work in model selection. In the first step, cultivating (or growing or splitting) the tree, the important measure is logworth. Splits of the data are m a d e based on logworth to isolate subregions of the input space with high proportions of donors and non-donors. Splitting continues until a boundary associated with a stopping rule is reached. This process generates the maximal tree. The maximal tree is then sequentially pruned. Each subtree in the pruning sequence is chosen based on the selected assessment measure. For example, the eight-leaf tree in the Decision (Misclassification) Tree pruning sequence is the eight-leaf tree with the lowest misclassification rate on the validation sample a m o n g all candidate eight-leaf trees. T h e eight-leaf tree in the Probability Tree pruning sequence is the eight-leaf tree with the lowest average square error on the validation sample a m o n g all candidate eight-leaf trees. This is h o w the change in assessment measure can generate a change in the optimal specification. f

In order to minimize clutter in the diagram, the Maximal Tree node is deleted and does not appear in the notes for the remainder of the course.


3-98

Chapter 3 Introduction to Predictive Modeling: Decision Trees

3.4

Understanding Additional Diagnostic Tools (Self-Study)

Several additional plots and tables contained in the Decision Tree node Results w i n d o w are designed to assist in interpreting and evaluating a decision tree. T h e most useful are demonstrated on the next several pages.


3.4 Understanding Additional Diagnostic Tools (Self-Study)

3-99

Understanding Additional Plots a n d Tables (Optional)

Tree Map T h e Tree M a p w i n d o w is intended to be used in conjunction with the Tree w i n d o w to gauge the relative size of each leaf.

1. Select the small rectangle in the lower right corner of the Tree m a p .

1

1


3-100

Chapter 3 Introduction to Predictive Modeling: Decision Trees

T h e lower right leaf in the tree is also selected.

Statistic 0;

Training

1-

SO*

node:

41 4 3

II

< Statistic

H

in

ii\

V a l idas i<m

so*

50* SO* 4»43

t.s

Training

>• ValiilacIon

0:

ST*

ST

1:

43*

49*

node:

ttoi

« 0 7

' '1 Z.S z

5c a c i n c i c 0:

T r a i n ing

1:

set zsto

0

in

Mil

i<taclon

44*

nodt: tift

Amount

44* SS* ZS96 List

t.S T r a i n ing IS*

(41 SSI

7.5

>= Val idat ion

Statistic

Training

Validation

94*

0:

41*

SS*

It

SJ*

St*

in n o J r .

1903

1971

sss

H

3 ince

I

Last

49*

G I :•

17.5 riing

Val idat ion

ininj

Val idation

41* J9»

49*

SI*

092

914

Sit 49* 1011

SI* 49* 10S7

Gift <

Amount

Average

36

Ho >=

Statistic 0:

Training 47*

Validation S0t

1:

S3*

SOt

in n « d « :

641

£40

>

Card

14.415 e

14.411

Tiaininj

<

T h e size of the small rectangle in the tree m a p indicates the fraction of the training data present in the corresponding leaf. It appears that only a small fraction of the training data finds its w a y to this leaf.


3.4 Understanding Additional Diagnostic Tools (Self-Study)

3-101

Leaf Statistic Bar Chart - m i x 0.6¬ £ 0.4(73 0.2 H 0.0

1

6 Leaf Index • Training Percent 1 • Validation Percent 1 This w i n d o w compares the blue predicted outcome percentages in the left bars (from the training data) to the red observed outcome percentages in the right bars (from the validation data). This plot reveals h o w well training data response rates are reflected in the validation data. Ideally, the bars should be of the s a m e height. Differences in bar heights are usually the result of small case counts in the corresponding leaf.


3-102

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Variable Importance T h e following steps open the Variable Importance window. Select V i e w ^ M o d e l O Variable Importance. The Variable Importance w i n d o w opens. 1 HI Variable Importance Variable Name

Label

I

H

i

^

Number of Splitting Rules

GiftCnt36 Gift Count 3... GiftAvgLast Girt Amount... DemMedH... Median Ho... GiftTimeLast Time Since... DemPctVet... Percent Vet... GiftAvg36 GiftAmount... GiftAvgCard... Gift Amount... DernAge Age GiftAvgAII GiftAmount... GiftCntAII Gift Count A... GiftCntCard... Gift Count... GiftCntCard... Gift Count... GiftTimeFirst Time Since... PromCntCa...Promotion... PromCnt12 Promotion... PromCnt36 Promotion... PromCntAII Promotion... PromCntCa...Promotion... PromCntCa...Promotion... StatusCatSt..Status Gate... REP_Dem... Replaceme... DemCluster Demograp... DernGender Gender DernHome... H o m e Owner StatusCatG... Status Cate...

i

H

Importance

1 1 1 1 0 0 0

n

o 0 0 H I 0

l

Validation Importance

1 0.524135 0.50277 0.480916 0 0

0 0 0 0 0 0 0 0 0 0 0 0

-

0 0 0 0 0 0 0 0 0 0 0 0 0 K O I £ 3 9 0 0 H I 0

^

j

Ratio of Validation to Training Importance

1 0.671883 0.10861 0.445303

1| 1.281891 0.216024 0.925948 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 .

0 0 0 0 • o 0 0

.

0

.

H I

T h e Variable importance w i n d o w provides insight into the importance of inputs in the decision tree. T h e magnitude of the importance statistic relates to the amount of variability in the target explained by the corresponding input relative to the input at the top of the table. For example, G i f t Amount L a s t explains 5 2 . 4 % of the variability explained by G i f t Count 36 Months.


3.4 Understanding Additional Diagnostic Tools (Self-Study)

3-103

The Score Rankings Overlay Plot Score Rankings Overlay: Targets

Hi H

E3

Percentile TRAIN

VALIDATE

The Score Rankings Overlay plot presents what is c o m m o n l y called a cumulative lift chart. Cases in the training and validation data are ranked, based on decreasing predicted target values. A fraction of the ranked data is selected (given by the decile value). The proportion of cases with the primary target value in this fraction is compared to the proportion of cases with the primary target value overall (given by the cumulative lift value). A useful model shows high lift in low deciles in both the training and validation data. For example, the Score Ranking plot shows that in the top 2 0 % of cases (ranked by predicted probability), the training and validation lifts are approximately 1.26. This means that the proportion of primary outcome cases in this top 2 0 % is about 2 6 % more likely to have the primary outcome than a randomly selected 2 0 % of cases.


3-104

Chapter 3 Introduction to Predictive Modeling; Decision Trees

Fit Statistics -\ Fit Statistics Target TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB

Fit Statistics Statistics Train Validation Test Label _NOBS_ S u m of Fre... 4843 4843 _SUMW_ S u m of Cas... 9686 9686 _MISC_ Misclassific... 0.420607 0.42804 _MAX_ Maximum A... 0.643836 0.643836 _SSE_ S u m of Squ... 2351.152 2357.331 _ASE_ Average Sq... 0.242737 0.243375 _RASE„ Root Avera... 0.492684 0.493331 _DIV_ Divisor for A... 9686 9686 DFT Total Degre... 4843

T h e Fit Statistics w i n d o w is used to compare various models built within S A S Enterprise Miner. T h e misclassification and average square error statistics are of most interest for this analysis.


3.4 Understanding Additional Diagnostic Tools (Self-Study) 3-105

The Output Window Output

* User: Date: Time: *

Dodotech 08APR08 13:04

* T r a i n i n g Output

V a r i a b l e Summary Measurement

Frequency

Level

Count

Role

2

INPUT

BINARY

INPUT

INTERVAL

20 3

INPUT

NOMINAL

REJECTED

INTERVAL

1

TARGET

BINARY

1

TARGET

INTERVAL

1

Model E v e n t s Number

The Output w i n d o w provides information generated by the S A S procedures that are used to generate the analysis results. For the decision tree, this information includes variable importance, tree leaf report, model fit statistics, classification information, and score rankings.


3-106

Chapter 3 Introduction to Predictive Modeling: Decision Trees

3.5 Autonomous Tree Growth Options (Self-Study)

Autonomous Decision Tree Defaults Default Settings m/*\

I

~]

Maximum Branches

Splitting Rule Criterion

X^*

Subtree Method

/^)^

Tree Size Options

2

Logworth

Average Profit

Bonferroni Adjust Split Adjust Maximum Depth Leaf Size

139

...

T h e behavior of the tree algorithm in S A S Enterprise Miner is governed by m a n y parameters that can be divided into four groups: • the number of splits to create at each partitioning opportunity • the metric used to compare different splits • the method used to prune the tree model • the rules used to stop the autonomous tree growing process T h e defaults for these parameters generally yield good results for an initial prediction.


3.5 Autonomous Tree Growth Options (Self-Study)

3-107

Tree Variations: Maximum Branches Complicates split search

Trades height for width

Uses heuristic shortcuts

140 S A S Enterprise Miner accommodates a multitude of variations in the default tree algorithm. T h e first involves the use of multiway splits instead of binary splits. Theoretically, there is no clear advantage in doing this. A n y multiway split can be obtained using a sequence of binary splits. T h e primary change is cosmetic. Trees with multiway splits tend to be wider than trees with only binary splits. T h e inclusion of multiway splits complicates the split-search algorithm. A simple linear search becomes a search whose complexity increases geometrically in the number of splits allowed from a leaf. T o combat this complexity explosion, the Tree tool in S A S Enterprise Miner resorts to heuristic search strategies.


3-108

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Tree Variations: Maximum Branches 1

r» • l-.vfi*J r.,*_. . .. • ' u • 1 O'ffin >•rtrriQn r,, i... L**il

,

U » «n >» n t h

"1

j | J M Input 0*.r» :

M a j v n u m D«pD>

! Ljlf S C * -UunMU'ol f?u*< l

S * W f

l :.••> - HM i -

IhB

1

Maximum branches in split

* ;

Exhaustive search size limit

BUM

- M*mn J i_. ,.- • • Ml* u**'. ' A s » » r n # r t Fractal - N I-E - I

h

rvrittr

En*

Oil

141 T w o fields in the Properties panel affect the number of splits in a tree. T h e M a x i m u m Branch property sets an upper limit on the number of branches emanating from a node. W h e n this number is greater than the default of 2. the number of possible splits rapidly increases. T o save computation time, a limit is set in the Exhaustive property as to h o w m a n y possible splits are explicitly examined. W h e n this n u m b e r is exceeded, a heuristic algorithm is used in place of the exhaustive search described above. T h e heuristic algorithm alternately merges branches and reassigns consolidated groups of observations to different branches. T h e process stops w h e n a binary split is reached. A m o n g all candidate splits considered, the one with the best worth is chosen. T h e heuristic algorithm initially assigns each consolidated group of observations to a different branch, even if the number of such branches is more than the limit allowed in the final split. At each merge step, the two branches that degrade the worth of the partition the least are merged. After the two branches are merged, the algorithm considers reassigning consolidated groups of observations to different branches. Each consolidated group is considered in turn, and the process stops w h e n no group is reassigned.


3.5 Autonomous Tree Growth Options (Self-Study)

3-109

Tree Variations: Splitting Rule Criterion Yields similar splits

t ti

Grows enormous trees

Favors many-level inputs

142

In addition to changing the number of splits, you can also change h o w the splits are evaluated in the splitsearch phase of the tree algorithm. For categorical targets, S A S Enterprise Miner offers three separate split-worth criteria. Changing from the chi-squared default criterion typically yields similar splits if the number of distinct levels in each input is similar. If not, the other split methods tend to favor inputs with more levels due to the multiple comparison problem discussed above. Y o u can also cause the chi-squared method to favor inputs with more levels by turning off the Bonferroni adjustments. Because Gini reduction and Entropy reduction criteria lack the significance threshold feature of the chi-squared criterion, they tend to grow enormous trees. Pruning and selecting a tree complexity based on validation profit limit this problem to s o m e extent.


3-110

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Tree Variations: Splitting Rule Criterion • IrtTtf-orCmflpjC" '•• Norn-rial Crdtnon -Qn>ny CtAtnan

Split Criterion

Pi:-. r

Us*mtt«cri • uttinpjo*(« Mi.krr^m Bf*rJh • itanrru*

r"* -

*

; fJirmrWrCfRmtl 5 • fiiiinh*ra[^Ljrrij^tf ftui*s • Sc-i::t :•

-Hflrtafiarnpla Htnm • tUPBtrULlm * • - •. LI_ ,

Chi-square logworth Entropy Gini

Categorical Criteria

Variance Prob-F logworth

Interval Criteria

ft ' 57^>tC"or

Ptfl '•'.H r - /<• A- • lis N u m b * ! of Subtil io QfHiOtlti .•

Logworth adjustments

Jifnm of kati U]<j(rri*nt .BtTor* "_lnpu(f Mq n

143

A total of five choices in S A S Enterprise Miner can evaluate split worth. Three (chi-squared logworth, entropy, and Gini) are used for categorical targets, and the remaining two (variance and ProbF logworth) are reserved for interval targets. Both chi-squared and ProbF logworths are adjusted (by default) for multiple comparisons. It is possible to deactivate this adjustment. T h e split worth for the entropy, Gini, and variance options are calculated as shown below. Let a set of cases S be partitioned intop subsets Si, ...,SP so that

Let the number of cases in S equal N and the number of cases in each subset S, equal n,. Then the worth of a particular partition of 5" is given by the following: worth =

l{S)-^w,l(Sl)

where WrnJN (the proportion of cases in subset SI), and for the specified split worth measure, /(•) has the following value: classes

res classes

Entropy Gini

Variance N

notk cotes

Each worth statistic measures the change in I(S) from node to branches. In the Variance calculation, Y is the average of the target value in the node with Y&tt as a m e m b e r .


3.5 Autonomous Tree Growth Options (Self-Study)

3-111

Tree Variations: Subtree Method Controls Complexity

144

S A S Enterprise Miner features t w o settings for regulating the complexity of subtree: modify the pruning process or the Subtree method. B y changing the model assessment measure to average square error, y o u can construct what is k n o w n as a class probability tree. It can be s h o w n that this action minimizes the imprecision of the tree. Analysts sometimes use this model assessment measure to select inputs for a flexible predictive model such as neural networks. Y o u can deactivate pruning entirely by changing the subtree to Largest. Y o u can also specify that the tree be built with a fixed number of leaves.


3-112

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Tree Variations: Subtree Method

Assessment Largest N Pruning options Pruning metrics imbii of Sublets H n- rii. • - r or Rspeaii

J

Decision Average Square Error Misclassification Lift

145

T h e pruning options are controlled by two properties: Method and Assessment Measure.

Tree Variations: Tree Size Options Avoids orphan nodes

Controls sensitivity

Grows large trees

146

T h e family of adjustments that you modify most often w h e n building trees are the rules that limit the growth of the tree. Changing the m i n i m u m number of observations required for a split search and the m i n i m u m number of observations in a leaf prevents the creation of leaves with only one or a handful of cases. Changing the significance level and the m a x i m u m depth allows for larger trees that can be more sensitive to complex input and target associations. T h e growth of the tree is still limited by the depth adjustment m a d e to the threshold.


3.5 Autonomous Tree Growth Options (Self-Study)

3-113

Tree Variations: Tree Size Options Logworth threshold Maximum tree depth Minimum leaf size

Threshold depth adjustment Changing the logworth threshold changes the m i n i m u m logworth required for a split to be considered by the tree algorithm (when using the chi-squared or ProbF measures). Increasing the m i n i m u m leaf size avoids orphan nodes. For large data sets, you might want to increase the m a x i m u m leaf setting to obtain additional modeling resolution. If you want big trees and insist on using the chi-squared split-worth criterion, deactivate the Split Adjustment option.


3-114

Chapter 3 Introduction to Predictive Modeling: Decision Trees

Exercises

1.

Initial Data Exploration A supermarket is offering a n e w line of organic products. T h e supermarket's management wants to determine which customers are likely to purchase these products. T h e supermarket has a customer loyalty program. A s an initial buyer incentive plan, the supermarket provided coupons for the organic products to all of the loyalty program participants and collected data that includes whether these customers purchased any of the organic products. T h e ORGANICS data set contains 13 variables and over 22,000 observations. T h e variables in the data set are s h o w n below with the appropriate roles and levels: Name

Model Role

Measurement Level

Description

ID

ID

Nominal

Customer loyalty identification number

DemAff1

Input

Interval

Affluence grade on a scale from 1 to 30

DemAge

input

Interval

A g e , in years

DemCluster

Rejected

Nominal

Type of residential neighborhood

DemClusterGroup

Input

Nominal

Neighborhood group

DemGender

Input

Nominal

M = male, F = female, U = u n k n o w n

DemRegion

Input

Nominal

Geographic region

DemTVReg

Input

Nominal

Television region

PromClass

Input

Nominal

Loyalty status: tin, silver, gold, or platinum

PromSpend

Input

Interval

Total amount spent

PromTime

Input

Interval

Time as loyalty card m e m b e r

TargetBuy

Target

Binary

Organics purchased? 1 = Yes, 0 - N o

TargetAmt

Rejected

1 nterval

N u m b e r of organic products purchased

$T

Although two target variables are listed, these exercises concentrate on the binary variable

TargetBuy. a. Create a n e w diagram n a m e d Organics.


3.5 Autonomous Tree Growth Options (Self-Study)

3-115

b. Define the data set A A E M 6 1 . O R G A N I C S as a data source for the project. 1) Set the roles for the analysis variables as s h o w n above. 2) Examine the distribution of the target variable. W h a t is the proportion of individuals w h o purchased organic products? 3) T h e variable D e m C l u s t e r G r o u p contains collapsed levels of the variable D e m C l u s t e r . Presume that, based on previous experience, y o u believe that D e m C l u s t e r G r o u p is sufficient for this type of modeling effort. Set the model role for D e m C l u s t e r to Rejected. 4) A s noted above, only T a r g e t B u y will be used for this analysis and should have a role of Target. C a n T a r g e t A m t be used as an input for a model used to predict T a r g e t B u y ? W h y or w h y not?

5) Finish the O r g a n i c s data source definition. c. A d d the A A E M 6 1 . O R G A N I C S data source to the Organics diagram workspace. d. A d d a Data Partition node to the diagram and connect it to the Data Source node. Assign 5 0 % of the data for training and 5 0 % for validation. e. A d d a Decision Tree node to the workspace and connect it to the Data Partition node. f. Create a decision tree model autonomously. Use average square error as the model assessment statistic. 1) H o w m a n y leaves are in the optimal tree? 2) W h i c h variable w a s used for thefirstsplit? W h a t were the competing splits for thisfirstsplit? g. A d d a second Decision Tree node to the diagram and connect it to the Data Partition node. 1) In the Properties panel of the n e w Decision Tree node, change the m a x i m u m number of branches from a node to 3 to allow for three-way splits. 2) Create a decision tree model using average square error as the model assessment statistic. 3) H o w m a n y leaves are in the optimal tree? h.

Based on average square error, which of the decision tree models appears to be better?


3-116

Chapter 3 Introduction to Predictive Modeling: Decision Trees

3.6 Chapter Summary Predictive modeling fulfills s o m e of the promise of data mining. Past observation of target events coupled with past input measurements enables s o m e degree of assistance for the prediction offuture target event values, conditioned on current input measurements. All of this depends on the stationarity of the process being modeled. A predictive model must perform three essential tasks: 1. It must be able to systematically predict n e w cases, based on input measurements. These predictions might be decisions, rankings, or estimates. 2. A predictive model must be able to thwart the curse of dimensionality and successfully select a limited n u m b e r of useful inputs. This is accomplished by systematically ignoring irrelevant and redundant inputs. 3.

A predictive model must adjust its complexity to avoid overfitting or underfitting. In S A S Enterprise Miner, this adjustment is accomplished by splitting the raw data into a training data set for building the model and a validation data set for gauging model performance. T h e performance measure used to gauge performance depends on the prediction type. Decisions can be evaluated by accuracy or misclassification, rankings by concordance or discordance, and estimates by average square error.

Decision trees are a simple but potentially useful example of a predictive model. They score n e w cases by creating prediction rules. They select useful inputs via a split-search algorithm. They optimize complexity by building and pruning a maximal tree. A decision tree model can be built interactively, automatically, or autonomously. For autonomous models, the Decision Tree Properties panel provides options to control the size and shape of the final tree.

Decision Tree Tools Review i fatten

Partition raw data into training and validation sets. Interactively grow trees using the Tree Desktop application. You can control rules, selected inputs, and tree complexity. Autonomously grow decision trees based on property settings. Settings include branch count, split rule criterion, subtree method, and tree size options.

149


3.7 Solutions

3.7

Solutions

Solutions to Exercises 1. Initial Data Exploration a. Create a n e w diagram n a m e d Organics. 1) Select File •=> N e w «=> Diagram.... The Create N e w Diagram w i n d o w opens.

Diagram Name:

OK

Cancel —

2) Type O r g a n i c s in the D i a g r a m N a m e field.

Diagram Name: Organics OK

3) Select O K .

Cancel

3-117


3-118

Chapter 3 Introduction to Predictive Modeling: Decision Trees

b. Define the data set AAEM61. ORGANICS as a data source for the project. 1) Set the model role for the target variable. 2) Examine the distribution of the target variable. W h a t is the proportion of individuals w h o purchased organic products? a) Select File <=> N e w ^> Data Source.... The Data Source Wizard w i n d o w opens. JjfData Source Wizard — StepI of 7 Metadata Source

Select a metadata source

Source: SAS Table

~3

Cancel

b) Select Next >. The wizard proceeds to Step 2.


3.7 Solutions

3-119

c) Type AAEM61. ORGANICS in the T a b l e field. jjjjfData Source Wizard - Step 2 of 7 Select a SAS Table

Select a S A S table

Table: |aaem6l .organics

Browse..

< Qack

Cancel

(text >

d) Select Next >. T h e wizard proceeds to Step 3.

m

Iff Data Source Wizard — Step 3 of 7 Table Information Table Properties

Properly Table Name Description Member Type Data Set Type Engine Number of Variable? Created Date loclfied Date

Value AAEM6I. ORGANICS DATA DATA BASE 13 22223 April 10, 2008 10:37:03 AM EDT April 10, 2008 10:37:03 AM EDT

<Back

j Net > "j|

Cancel


3-120

Chapter 3 Introduction to Predictive Modeling: Decision Trees e) Select Next >. T h e wizard proceeds to Step 4. SCOAta Source Wizard —5tep4of7 MetadataAdvisor Options Metadata Advisor Options Use the basic setting lo se! the initial measurement levels and roles based on tbe variable attributes. U s e the advanced setting to set the Initial measurement levels and roles based on both the variable attributes and distributions.

P Basic

r Advanced

<6ack

|QÂťxt>''l|

C***!

f) Select the A d v a n c e d radio button and select Customize.... T h e Advanced Advisor Options w i n d o w opens. g) Type 2 as the Class Levels Count Threshold value. Advanced Advisor Options Property

El Value

50 Missing Percentage Threshold Reject Vars with Excessive Missing ValueYes Class Levels Count Threshold 2 Detect Class Levels Yes Reject Levels Count Threshold 20 . , Reject Vars with Excessive Class Values Yes Y e s Class Levels Count Threshold If "Detect class levels"=Yes, interval variables with less than the number specified for this property wiU be marked as N O M I N A L . The default value is 20.

OK

Cancel

h) Select O K . T h e Advanced Advisor Options w i n d o w closes and you are returned to Step 4 of the Data Source Wizard.


3.7 Solutions 3-121 i) Select Next >. T h e wizard proceeds to Step 5. B y customizing the Advanced Metadata Advisor, most of the roles and levels are correctly set. j) Select Role O Rejected for T a r c r e t A m t . fjjjfData Source Wizard —Step 5 of 7 Column Metadata •* [ r not jEquai !o

i(nooe)

"J

lokrnns: V Label

r

Name | Role DemAfrl Input D e m A g e . _ Input DemCluster Rejected D e m C lusterO input DemGender Input DemReg Input D e m T V R e g Input ID ID PromClass Input PromSpend Input PromTime input TargetAmt Rejected TargetBuy Target

Show code

Explore

| Level Interval Interval Nominal Nominal Nominal Nominal Nominal Nominal Nominal Interval Interval Interval Binary

Oder

| Report No No No No Ho Ho Ho No No No No NO NO

Apply r Sanities

Bas* Drop

Lower Lir«

Upper Li

No Ho Ho Ho

1

No

No No No No No

Refresh Sgranary

< Back

.

Next >

Cancel

)

|

k) Select TargetBuy and select Explore. T h e Explore w i n d o w opens. Explore - AAEM61.ORG ANICS File

View fictions Window

HUl.ll.HJJJ.l.W!

^Jnjxj

_ t n l x | i Targets Value

Property

15000 -

22223

Rows Columns

13 AAEM61 ORGANICS DATA Random Max 22223 [l2345

Library Member Type S a m p l e Method Fetch Size Fetched R o w s R a n d o m Seed

S 10000 m S -

£

5000 -

n u

Apply

|

Plot...

^

' I

l

0

1

Or <i,inics Purchase Imlicato! ^jnjxj

5"! A A E M 6 1 . 0 R G A M C S Ous#

•ner

Organics...

o

°1


3-122

Chapter 3 Introduction to Predictive Modeling: Decision Trees T h e proportion of individuals w h o purchased organic products appears to be 2 5 % . I) Close the Explore window. 3) T h e variable DemC l u s t e r Group contains collapsed levels of the variable DemCluster. Presume that, based on previous experience, you believe that DemClusterGroup is sufficient for this type of modeling effort. Set the model role for DemCluster to Rejected. This is already done using the Advanced Metadata Advisor. Otherwise, select Role •=> Rejected for DemCluster. 4) A s noted above, only TargetBuy will be used for this analysis and should have a role of Target. C a n TargetAmt be used as an input for a model used to predict TargetBuy? W h y or w h y not? N o , using TargetAmt as an input is not possible. It is measured at the s a m e time as TargetBuy and therefore has no causal relationship to TargetBuy. 5) Finish the O r g a n i c s data source definition. a) Select Next >. T h e wizard proceeds to Step 6. mm

ItfData Source Wizard" - Step b or 9 Decision Configuration

Decision Processing Do you want to build models based on the vabes of the decisions ? IF you answer yes, you may enter information on the cost or profit of each possible decision, prior probability and cost Function, The data will be scanned for the distributions of the target variables,

® <Back

|| Ne.t >

|j

£ancel

|


3.7 Solutions 3-123 b) Select Next >. T h e wizard proceeds to Step 7. £f Data Source Wizard — Step 7of 8 Data Source Attributes

You m a y change the n a m e and the tole, and can specify a population segment Identifier for the data source

5

Name:

ORGANICS

Role:

| Raw

33

Segment: \

Notes:

<gack

|; Next > |

Cancel

c) Select Next >. T h e wizard proceeds to Step 8.

m

£ D,it ,i Source Wizard -- Step 0 of a Summary Metadata Completed. Library: AAEM81 Table: ORGANICS Role: Raw Role ID Input Input Rejected Rejected Target

Count 1 1

Level Nominal Interval Nominal Interval Nominal Binary

5 1 1 1

<Back

[ hash

|

Cancel

|

d) Select Finish. T h e wizard closes and the O r g a n i c s data source is ready for use in the Project Panel. 13 My Project <? E3| Data Sources IORGANICS PVA97NK <? Lii] Diagrams tj££> Organics Predictive Analysis ©- H j Model Packages ° - JLl Users


3-124

Chapter 3 Introduction to Predictive Modeling: Decision Trees

c. A d d the A A E M 6 1 . O R G A N I C S data source to the Organics diagram workspace. Organics

ET [El

ORGANICS

-V

100%

d. A d d a Data Partition node to the diagram and connect it to the Data Source node. Assign 5 0 % of the data for training and 5 0 % for validation. Organics

ORGANICS

ate Partition

°x

0

0 100%

•


3.7 Solutions 1) Type 5 0 as the Training and Validation values under Data Set Allocations. 2) Type 0 as the Test value.

Variables Data OutputType Partitioning Method Default 12345 Random Seed [•Training [•Validation -Test

.nil

50.0 50.0 0.0

e. A d d a Decision Tree node to the workspace and connect it to the Data Partition node. cc

° Organics

ORGANICS

)nta Paitrtion

Decision Tr ee

3-125


3-126

Chapter 3 Introduction to Predictive Modeling: Decision Trees

f. Create a decision tree model autonomously. Use average square error as the model assessment statistic. • Select Average Square Error as the Assessment Measure property. Train Variables Interactive

Q

I

B

H

Default [-Criterion 0 2 ?• Significance Level Use in search [Missing Values NO [ Use Input Once [•Maximum Branch 2 [•Maximum Depth 6 -Minimum Categorical S 5 llMode 5 [-Leaf Size [•Number ofRules 5 [-Number of Surrogate R 0 '•Split Size =]Split Search [-Exhaustive 5000 : 20000 -Node Sample Assessment [-Method 1 [•Number of Leaves [•Assessment Measure Average Square Error I -Assessment Fraction 0 . 2 5

Right-click on the Decision Tree node and select R u n from the Option m e n u . Select Yes in the Confirmation window.

Do you want to run this path? Diagram: Organics Path:

Decision Tree Yes

No

1) H o w m a n y leaves are in the optimal tree?

Run completed Diagram:Organics Path: OK

Decision Tree

Results..


3.7 Solutions

3-127

a) W h e n the Decision Tree node run finishes, select Results... from the R u n Status window. T h e Results w i n d o w opens. ' Result* - Node: Decision Tree Diagram: Organics Fie Ed* yjew Window Score Rankings Overlays TargetBuy :uirojl*stjve lift

lO Training g h ! Starbt c* Target TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargelSuy TargetBuy TargetBuy TargetBuy

n staaaci SUtntHn _NOBS_ _SUMW_ _M!SC_ _MAX _SSE_ „ASE_ _RASE_

_D1V_ _DFT_

Tn»i Label Sum orFre... Sum otCas.. Mtsclassific... 0. Maximum A... 0 Sum ofSqu.. 26 0. Average 8fl~ ^cotAwra... 0. Divisor for A. Total Degre...

•-ini*l Statisnc

at l a

Ttain 75.;% 24.B*

liu: T Age

J.

Validation 75.=% =4.8%

mil

•sec: Dace: Tlae: * Training Oucpuc •

Scudtnt July 0 6 , 200 13:35:37


3-128

Chapter 3 Introduction to Predictive Modeling: Decision Trees T h e easiest w a y to determine the number of leaves in your tree is via the iteration plot. b) Select V i e w •=> M o d e l •=> Iteration Plot from the Result w i n d o w m e n u . T h e Iteration Plot w i n d o w opens. Iteration Plot ;

x?\£

Average Squared Error

HI]

0.13 -

0.16 -

0.14I

i

10

20

30

Nil mh er of Leaves Train: Average Squared Error —Valid:Average Squared Error [

u

:

,

I

Using average square error as the assessment measure results in a tree with 29 leaves. 2) Which variable w a s used for the first split? W h a t were the competing splits for this first split? These questions are best answered using interactive training. a) Close the Results w i n d o w for the Decision Tree model. I

Train Variables Interactive

b) Select the Interactive ellipsis (...) from the Decision Tree node's Properties panel.


3.7 Solutions T h e S A S Enterprise Miner Tree w i n d o w opens.

I Ffe Edt Model Trah Vtow Options Window Hefc

! Statistic

Training

I

0;. l! I H in noda;

Statistic

Training

1: II In nodi: Aflluanca

« I t u i n l c

0: 1:

Truuuig

»«» 4C1

Training

481

a:

1 14.S

711

2)4 111 It

Sit

399 C n d *

1 >• 14.E Statistic Training 0: 111 1: 331

Statistic 0: 1:

Training »7» 114

jQ. Tree EMWS: ForHeti, preisFI

train Interactively No Piers

3-129


3-130

Chapter 3 Introduction to Predictive Modeling: Decision Trees c) Right-click the root node and select Split Node... from the Option m e n u . T h e Split N o d e I w i n d o w opens with information that answers the two questions. I

Split Node 1

01

Target Variable: TargetBuy Variable DemAffl DemGender PromSpend PromClass

Variable Description Affluence Grade Gender Total Spend .oyalty Status

Branches

-Log(p) 200.19 133.14 32.97 23.33

Apply 2 2 2 2

Edit Rule... OK Cancel

Age is used for the first split. C o m p e t i n g splits are A f f l u e n c e Grade, Gender, T o t a l Spend, a n d L o y a l t y Status.


3.7 Solutions A d d a second Decision Tree node to the diagram and connect it to the Data Partition node.

Decision Tree

: : : ORGANICS

Ll_

0

Decision Tree

3

l) In the Properties panel of the n e w Decision Tree node, change the m a x i m u m n u m b e r of branches from a node to 3 to allow for three-way splits. Train

Variables Interactive !• Criterion [-Significance Level rMissing Values rUse Input Once j-Maximum Branch |-Maximum Depth linimum Categorical

Default 0.2 Use in search No 3| |6 S5

2) Create a decision tree model again using average square error as the model assessment statistic. 3) H o w m a n y leaves are in the optimal tree. Based on the discussion above, the optimal tree with three-way splits has 33 leaves.

3-131


3-132 h.

Chapter 3 Introduction to Predictive Modeling: Decision Trees Based on average square error, which of the decision tree models appears to be better? 1) Select the first Decision Tree node. 2) Right-click and select Results... from the Option m e n u . The Results w i n d o w opens. 3) Examine the Average Squared Error row of the Fit Statistics window. Target TargetBuy

Fit Statistics Statistics Label ASE Average Squared Error

Train Validation 0.132861 0.132773

4) Close the Results window. 5) Repeat the process for the Decision Tree (2) model. Target TargetBuy

Fit Statistics Statistics A Label _ASE_

Average Sq...

Train 0.133009

Validation

Test

0.132662

It appears that the second tree with three-way splits has the lower validation average square error a n d is the better m o d e l .


3.7 Solutions

Solutions to Student Activities (Polls/Quizzes)

3.01 Quiz - Correct Answer Match the Predictive Modeling Application to the Decision Type. Modeling Application C j Loss Reserving

A.

Decision

B Risk Profiling

B.

Ranking

B Credit Scoring

C.

Estimate

A Fraud Detection C \ Revenue Forecasting A Voice Recognition

26

Decision type

3-133


3-134

Chapter 3 Introduction to Predictive Modeling: Decision Trees


Chapter 4 Introduction to Predictive Modeling: Regressions 4.1

4.2

Introduction

Demonstration: Running the Regression Node

4-25

Selecting Regression Inputs

Optimizing Regression Complexity

Interpreting Regression Models Demonstration: Interpreting a Regression Model

4.5

Transforming Inputs Demonstration: Transforming Inputs

4.6

Categorical Inputs Demonstration: Receding Categorical Inputs

4.7

3

4-18

Demonstration: Optimizing Complexity 4.4

-

Demonstration: Managing Missing Values

Demonstration: Selecting Inputs 4.3

4

P o l y n o m i a l R e g r e s s i o n s (Self-Study)

4-29 4-33 4-38 4-40 4-49 4-51 4-52 4-56 4-66 4-69 4-76

Demonstration: Adding Polynomial Regression Terms Selectively

4-78

Demonstration: Adding Polynomial Regression Terms Autonomously (Self-Study)

4-85

Exercises

4-88

4.8

Chapter Summary

4-89

4.9

Solutions

4-91

Solutions to Exercises Solutions to Student Activities (Polls/Quizzes)

4-91 4-102


4-2

Chapter 4 Introduction to Predictive Modeling: Regressions


4.1 Introduction

4.1

4-3

Introduction Model Essentials - Regressions Predict new cases.

Prediction formula

Select useful inputs.

Sequential selection

Optimize complexity.

Best model from sequence

3 Regressions offer a different approach to prediction compared to decision trees. Regressions, as parametric models, assume a specific association structure between inputs and target. By contrast, trees, as predictive algorithms, do not assume any association structure; they simply seek to isolate concentrations o f cases with like-valued target measurements. The regression approach to the model essentials in SAS Enterprise Miner is outlined over the following pages. Cases are scored using a simple mathematical prediction formula. One o f several heuristic sequential selection techniques is used to pick from a collection o f possible inputs and creates a series o f models with increasing complexity. Fit statistics calculated from validation data select the best model from the sequence.


4-4

Chapter 4 Introduction to Predictive Modeling: Regressions

Model Essentials - Regressions i > Predict new cases.

I

p

^<*f*

formula

5 Regressions predict cases using a mathematical equation involving values o f the input variables.


4.1 Introduction

4-5

Linear Regression Prediction Formula input measurement A

y

A

A

= w

0

intercept estimate

+ w,-

A

X)

+ w x 2

2

prediction estimate

parameter estimate

Choose intercept and parameter estimates to minimize: squat ed error function

liy.-y,)

1

framing data

6

In standard linear regression, a prediction estimate for the target variable is formed from a simple linear combination o f the inputs. The intercept centers the range o f predictions, and the remaining parameter estimates determine the trend strength (or slope) between each input and the target. The simple structure o f the model forces changes in predicted values to occur in only a single direction (a vector in the space o f inputs with elements equal to the parameter estimates). Intercept and parameter estimates are chosen to minimize the squared error between the predicted and observed target values (least squares estimation). The prediction estimates can be viewed as a linear approximation to the expected (average) value o f a target conditioned on observed input values. Linear regressions are usually deployed for targets with an interval measurement scale.


4-6

Chapter 4 Introduction to Predictive Modeling: Regressions

Logistic Regression Prediction Formula A

l

o

g

P W + Wj X, + vv x (i3j) = Q

2

2

/0

*

ft

scores

Logistic regressions are closely related to linear regressions. In logistic regression, the expected value o f the target is transformed by a link function to restrict its value to the unit interval. In this way, model predictions can be viewed as primary outcome probabilities. A linear combination o f the inputs generates a logit score, the log o f the odds o f primary outcome, in contrast to the linear regression's direct prediction o f the target. I f your interest is ranking predictions, linear and logistic regressions yield virtually identical results.


4.1 Introduction

4-7

Logit Link Function A

.

/

H P

\

A

log ^ - — A - j = w

p

A 0

A

+ Wi x

1

+ w

x

2

2

'"a"

/og/f link function

The logit link function transforms probabilities (between 0 and 1) to logit scores (between -°° and +»).

For binary prediction, any monotonic function that maps the unit interval to the real number line can be considered as a link. The logit link function is one o f the most common. Its popularity is due, in part, to the interpretability o f the model.

Logit Link Function log y

=

+

+ w * 2

= logit(p)

1

A _

^

2

1

+ e

-loglt(p)

To obtain prediction estimates, the logit equation is solved for ft.

The predictions can be decisions, rankings, or estimates. The logit equation produces a ranking or logit score. To get a decision, you need a threshold. The easiest way to get a meaningful threshold is to convert the prediction ranking to a prediction estimate. You can obtain a prediction estimate using a straightforward transformation o f the logit score, the logistic function. The logistic function is simply the inverse o f the logit function. You can obtain the logistic function by solving the logit equation for p.


4-8

Chapter 4 Introduction to Predictive Modeling: Regressions

14

To demonstrate the properties o f a logistic regression model, consider the t w o - c o l o r prediction p r o b l e m introduced in Chapter 3. A s before, the goal is to predict the target color, based on the location in the unit square. To make use o f the prediction f o r m u l a t i o n , y o u need estimates o f the intercept and other model parameters.

Simple Prediction Illustration - Regressions logit(p) = w + w ' X 1

0

A

p=

1

+ w

2

1 1 + -logiKp)

—

e

Find parameter estimates

by maximizing I

logtfj)+1109(1-^

MmlnjciiH

UUICOB*'

f lining c u t J

log-likelihood function

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

16 The presence o f the logit l i n k function complicates parameter estimation. Least squares estimation is abandoned in favor o f m a x i m u m l i k e l i h o o d estimation. The l i k e l i h o o d function is the j o i n t probability density o f the data treated as a function o f the parameters. The m a x i m u m l i k e l i h o o d estimates are the values o f the parameters that m a x i m i z e the probability o f obtaining the t r a i n i n g sample.


4.1 Introduction

4-9

Simple Prediction Illustration - Regressions

logit( p ) = 0.81+ 0.92X, +1.11 x

o.s

2

1 P

^^^^^^^

~ 1 + i^'(p> e

x 0.5 2

Using the maximum likelihood estimates, the prediction formula assigns a logit score to each x. and x . 2

0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0,7 0.8 0.9 1.0

18

Parameter estimates are obtained by m a x i m u m l i k e l i h o o d estimation. These estimates can be used in the logit and logistic equations to obtain predictions. The plot on the right shows the p r e d i c t i o n estimates f r o m the logistic equation. One o f the attractions o f a standard logistic regression model is the s i m p l i c i t y o f its predictions. The contours are simple straight lines. (In higher dimensions, they w o u l d be hyperplanes.)


4-10

Chapter 4 Introduction to Predictive Modeling: Regressions

4.01 Multiple Choice Poll What is the logistic regression prediction for the indicated point? a. -0.243 b. 0.56 c. yellow d. It depends . . .

s

^

s

^

S

^ ^ \ v i v \ ' V ^ N

logit(p) =-0.81+0.92x + 1.11x 1

2

20

To score a new case, the values o f the inputs are plugged into the logit or logistic equation. T h i s action creates a logit score or prediction estimate. Typically, i f the prediction estimate is greater than 0.5 (or equivalently, the logit score is positive), cases are usually classified to the p r i m a r y outcome. ( T h i s assumes an equal misclassification cost.) The answer to the question posed is, o f course, it depends. • A n s w e r A , the logit score, is reasonable i f the goal is ranking. • A n s w e r B, the prediction estimate f r o m the logistic equation, is appropriate i f the goal is estimation. •

A n s w e r C, a classification, is a g o o d choice i f the goal is deciding dot color.


4.1 Introduction

4-11

Regressions: Beyond the Prediction Formula Manage missing values. Interpret the model. Handle extreme or unusual values. Use nonnumeric inputs. Account for nonlinearities.

While the prediction formula would seem to be the final word in scoring a new case with a regression model, there are actually several additional issues that must be addressed. • What should be done when one o f the input values used in the prediction formula is missing? You might be tempted to simply treat the missing value as zero and skip the term involving the missing value. While this approach can generate a prediction, this prediction is usually biased beyond reason. • How do you interpret the logistic regression model? Certain inputs influence the prediction more than others. A means to quantify input importance is needed. • How do you score cases with unusual values? Regression models make their best predictions for cases near the centers o f the input distributions. I f an input can have (on rare occasion) extreme or outlying values, the regression should respond appropriately. • What value should be used in the prediction formula when the input is not a number? Categorical inputs are common in predictive modeling. They did not present a problem for the rule-based predictions o f decision trees, but regression predictions come from algebraic formulas that require numeric inputs. (You cannot multiply marital status by a number.) A method to include nonnumeric data in regression is needed. • What happens when the relationship between the inputs and the target (or rather logit o f the target) is not a straight line? It is preferable to be able to build regression models in the presence o f nonlinear (and even nonadditive) input target associations.


4-12

Chapter 4 Introduction to Predictive Modeling: Regressions

Regressions: Beyond the Prediction Formula Manage missing values.

The above issues affect both model construction and model deployment. The first o f these, h a n d l i n g missing values, is dealt w i t h immediately. The r e m a i n i n g issues are addressed, in t u r n , at the end o f this chapter.

Missing Values and Regression Modeling

mm ^^HiCZZZj • • • mm h mm

rzii

C~l

MM • • • . " ; — am mm

m

Problem 1: Training data cases with missing values on inputs used by a regression mode! are ignored.

M i s s i n g values present t w o distinct problems. The first relates to model construction. The default method for treating m i s s i n g values in most regression tools in SAS Enterprise M i n e r is complete-case analysis. In complete-case analysis, only those cases w i t h o u t any missing values are used in the analysis.


4.1 Introduction

4-13

Missing Values and Regression Modeling Training Data beto

i

KBiia j tmm t'-'Wi ta'^a f r y * H O t^aa

Consequence: Missing values can significantly reduce your amount of training data for regression modeling!

26

Even a smattering o f missing values can cause an enormous loss o f data in high dimensions. For instance, suppose that each o f the k input variables is missing at random with probability a . In this situation, the expected proportion o f complete cases is as follows: (t-af Therefore, a 1 % probability o f missing ( a = 0 1 ) for 100 inputs leaves only 3 7 % o f the data for analysis, 200 leaves 13%, and 400 leaves 2%. I f the "missingness" were increased to 5% (cc=.05), then < 1 % o f the data would be available with 100 inputs.


4-14

Chapter 4 Introduction to Predictive Modeling: Regressions

Missing Values and the Prediction Formula logit(p) = -0.81 + 0.92 • x +1.11- x 1

2

Predict: (x1, x2) = (0.3, ? ) Problem 2: Prediction formulas cannot score cases with missing values.

27

Missing Values and the Prediction Formula logit(p) = ?

Problem 2: Prediction formulas cannot score cases with missing values.

30

The second missing value problem relates to model deployment or using the prediction formula. How would a model built on the complete cases score a new case i f it had a missing value? To decline to score new incomplete cases would be practical only i f there were a very small number o f missing values.


4.1 Introduction

Missing Value Issues Manage missing values. Problem 1: Training data cases with missing values on inputs used by a regression model are ignored. Problem 2: Prediction formulas cannot score cases with missing values.

31

A remedy is needed for the two problems o f missing values. The appropriate remedy depends on the reason for the missing values.

4-15


4-16

Chapter 4 Introduction to Predictive Modeling: Regressions

Missing Value Causes > Manage missing values. |CZD| Non-applicable measurement czj\

No match on merge Non-disclosed measurement

33

Missing values arise for a variety o f reasons. For example, the time since last donation to a card campaign is meaningless i f someone did not donate to a card campaign. In the PVA97NK data set, several demographic inputs have missing values in unison. The probable cause was no address match for the donor. Finally, certain information, such as an individual's total wealth, is closely guarded and is often not disclosed.

Missing Value Remedies Manage missing values. Non-applicable measurement No match on merge Non-disclosed measurement

34

Synthetic distribution

Estimation


4.1 Introduction

4-17

The primary remedy for missing values in regression models is a missing value replacement strategy. Missing value replacement strategies fall into one o f two categories. Synthetic distribution methods

Estimation methods

use a one-size-fits-all approach to handle missing values. A n y case with a missing input measurement has the missing value replaced w i t h a fixed number. The net effect is to modify an input's distribution to include a point mass at the selected fixed number. The location o f the point mass i n synthetic distribution methods is not arbitrary. Ideally, i t should be chosen to have minimal impact on the magnitude o f an input's association w i t h the target. M t h many modeling methods, this can be achieved by locating the point mass at the input's mean value. eschew the one-size-fits-all approach and provide tailored imputations for each case w i t h missing values. This is done by viewing the missing value problem as a prediction problem. That is, you can train a model to predict an input's value from other inputs. Then, when an input's value is unknown, you can use this model to predict or estimate the unknown missing value. This approach is best suited for missing values that result from a lack o f knowledge, that is, no-match or nondisclosure, but i t is not appropriate for not-applicable missing values.

Because predicted response might be different for cases with a missing input value, a binary imputation indicator variable is often added to the training data. Adding this variable enables a model to adjust its predictions in the situation where "missingness" itself is correlated with the target.


4-18

Chapter 4 Introduction to Predictive Modeling: Regressions

Managing Missing Values

The demonstrations in this chapter b u i l d on the demonstrations o f Chapters 2 and 3. A t this point, the process f l o w diagram has the f o l l o w i n g structure:

Data A s s e s s m e n t Continue the analysis at the Data Partition node. A s discussed above, regression requires that a case have a complete set o f input values for both t r a i n i n g and scoring. F o l l o w these steps to examine the data status after the p a r t i t i o n . 1.

Select the Data Partition node.

2.

Select Exported Data •=> ^ Property

Value

General Node ID Imported Data Exported Data Notes

Part

1

Train

Variables Data Output Type Partitioning Method Default Random Seed 1 234 5

j •BISsB IU 1 BETOfflU [-Training [•Validation '-Test

-50.0 50.0 0.0

f r o m the Data Partition node property sheet.

_ |jgj


4.1 Introduction

4-19

T h e E x p o r t e d Data w i n d o w opens. 3.

Select the T R A I N data p o r t and select E x p l o r e . . . .

mmm TRAIN VALIDATE TEST

Table EMWS.Part_TRA!N EMWS.Part VALIDATE EMWS.Part TEST

Port

|

Role

J ~

Train IValidate 'Test

Browse..

Data Exists

Yes Yes |No

Properties..

Explore..,

OK

There are several inputs w i t h a noticeable frequency o f m i s s i n g values, f o r e x a m p l e . A g e and the replaced value o f m e d i a n i n c o m e .

;

\

^^•llot^EM'Vl^ij.UMIiMlthMMMiifltti^B File

1

View

Actions

M B .

m ' >

Window

Sample Properties

i4843

Property Rows ; Columns Library Member Type j Sample Method Fetch Size Fetched Rows Random Seed

Value

bo EMWS PART TRAIN DATA iRandom Max 4843 12345

Apply

r/cf

o j EMWS.Part TRAIN Status Cate.J Demograp...| 100 035 035 115 136 016 017 024 035 139 045

Plot.

Gender

Age ,M 53M 58 M .U .F .F 39 M 50F M M .F

m

ll-iome Owner Median Ho...j Percent Vet.J Median Inc .lRepiacerneJ 1

Ij-,

=

' '

IpO

$139,200 $168,100 $234,700 $143,900 $134,400 $68,200 $115,100 $0 $207,900 $104,400 $70,000

V

27 37 22 30 41 47 30 33 48 35 37

TO

$38,942 $71,509 $72,868 $0 $0 $0 $63,000 $0 $0 $54,385 $0

38942 71609 72868

63000

54385


4-20

Chapter 4 Introduction to Predictive Modeling: Regressions

There are several ways to proceed: • Do nothing. I f there are very few cases with missing values, this is a viable option. The difficulty with this approach comes when the model must predict a new case that contains a missing value. Omitting the missing term from the parametric equation usually produces an extremely biased prediction. • Impute a synthetic value for the missing value. For example, i f an interval input contains a missing value, replace the missing value with the mean o f the nonmissing values for the input. This eliminates the incomplete case problem but modifies the input's distribution. This can bias the model predictions. Making the missing value imputation process part o f the modeling process allays the modified distribution concern. Any modifications made to the training data are also made to the validation data and the remainder o f the modeling population. A model trained with the modified training data w i l l not be biased i f the same modifications are made to any other data set that the model might encounter (and the data has a similar pattern o f missing values). • Create a missing indicator for each input in the data set. Cases often contain missing values for a reason. I f the reason for the missing value is in some way related to the target variable, useful predictive information is lost. The missing indicator is 1 when the corresponding input is missing and 0 otherwise. Each missing indicator becomes an input to the model. This enables modeling o f the association between the target and a missing value on an input. 4.

Close the Explore and Exported Data windows.


4.1 Introduction

4-21

Imputation To address missing values in the PVA97NK data set, use the following steps to impute synthetic data values and create missing value indicators: 1.

Select the Modify tab.

2.

Drag an Impute tool into the diagram workspace.

3.

Connect the Data Partition node to the Impute node. In the display, below, the Decision Tree modeling nodes are repositioned for clarity. : Predictive Analysis

JLl

*1 J

j

J

i n r "

<*>

n

q

i ~ j -

Š

| i h

p

i

ii

r


Chapter 4 Introduction to Predictive Modeling: Regressions

4-22

4.

Select the I m p u t e node and examine the Properties panel. Value

Property

General Node ID Imported Data Exported Data Notes

Impt

•••

Train Variables Non Missinq Variables Missing Cutoff = ] c i a s s Variables r Default Input Method [• Default Target Method -Normalize Values

Count None Yes

[- Default Input Method ' [Default Target Method

Mean None

:

... ...

No 50.0

[-Default Character Value '-[Default Number Value ^ M e t h o d Options 12345 •-Random Seed '•-Tuning Parameters - Tree Imputation l

...

... •••

Score [Hide Original Variables Yes [=]lndicatorVariables Unique [•Type Imputed Variables }-Source

.

The defaults o f the Impute node are as f o l l o w s : •

For interval inputs, replace any missing values w i t h the mean o f the nonmissing values.

For categorical inputs, replace any missing values w i t h the most frequent category.

tf

These are acceptable default values and are used throughout the rest o f the course.

W i t h these settings, each input w i t h missing values generates a new input. The new input named \\A?_originaljnpui_name

w i l l have missing values replaced by a synthetic value and n o n m i s s i n g

values copied f r o m the original input.


4.1 Introduction

4-23

Missing Indicators Use the f o l l o w i n g steps to create m i s s i n g value indicators. The settings for m i s s i n g value indicators are found in the Score property group.

.Score [Hide Original Variables rVes r-Type [-{Source ••Role

I

Unique Imputed Variables Input

1.

Select I n d i c a t o r V a r i a b l e s •=> T y p e

Unique.

2.

Select I n d i c a t o r V a r i a b l e s O R o l e

Input.

W i t h these settings, n e w inputs named M_original_input_name indicate the synthetic data values.

w i l l be added to the t r a i n i n g data to


4-24

Chapter 4 Introduction to Predictive Modeling: Regressions

Imputation Results Run the Impute node and review the Results w i n d o w . Three inputs had missing values. •r Results - Mode: Impute Diagram: Predictive Analysis File

Edit

View

Window

Imputation Summary Variable Name

Impute Method

DemAge MEAN GiftAvgCard...MEAN REP_Dem... MEAN

r/Rf Imputed Variable

Indicator Variable

IMP_DemA... M_DemAge IMP_GifiAvg... M_GiftAvgC... IMP_REP_... M_REP_De...

Impute Value Role

59.26291 INPUT 14.2049INPUT 53570.85INPUT

Measureme nt Level INTERVAL INTERVAL INTERVAL

[EI

Label

Number of Missing for TRAIN Age 1203 Gift Amount... 910 Repiaceme... 1191

H i Output

*

*

User: Date: Time:

*

n

* Training

Output

s

Variable

*

Summary

W i t h all o f the m i s s i n g values i m p u t e d , the entire training data set is available for b u i l d i n g the logistic regression m o d e l . In addition, a method is in place for scoring new cases w i t h m i s s i n g values. (See Chapter 7.)


4.1 Introduction

4-25

Running the Regression Node

There are several tools i n SAS Enterprise Miner to fit regression or regression-like models. By far, the most commonly used (and, arguably, the most useful) is the simply named Regression tool. Use the following steps to build a simple regression model. 1.

Select the Model tab.

2.

Drag a Regression tool into the diagram workspace.

3.

Connect the Impute node to the Regression node. ;

;; P r e d i c t i v e A n a l y s i s ;

XTUmpttte

1^

placement

Regression

Probability T r e e

-a Deciiion T r e e

JJ

J

>m"T~

<9

11

Q

I—J"

® ' r«r»l

S

9

I El

f

The Regression node can create several types o f regression models, including linear and logistic. The type o f default regression type is determined by the target's measurement level.


4-26

4.

Chapter 4 Introduction to Predictive Modeling: Regressions

Run the Regression node and v i e w the results. The Results - N o d e : Regression D i a g r a m w i n d o w opens.

5.

M a x i m i z e the Output w i n d o w by d o u b l e - c l i c k i n g its title bar. The initial lines o f the Output w i n d o w summarize the roles o f variables used (or not) by the Regression node. V a r i a b l e Summary ROLE

LEVEL

COUNT

INPUT

BINARY

INPUT INPUT

INTERVAL NOMINAL

3

REJECTED

INTERVAL

2

TARGET

BINARY

1

The fit model has 28 inputs that predict a binary target.

5 20


4.1 Introduction 4-27 Ignore the output related to model events and predicted and decision variables. The next lines give more information about the model, including the training data set name, target variable name, number o f target categories, and most importantly, the number o f model parameters. Model

Information

T r a i n i n g Data S e t

EMWS2.IMPT_TRAIN.VIEW

DMDB C a t a l o g

WORK.REG_DMDB

Target

TargetB (Target G i f t Flag)

Variable

Ordinal

T a r g e t Measurement L e v e l Number o f T a r g e t

Categories

2

Error

MBernoulli

Link Function Number o f Model P a r a m e t e r s

Logit 86

Number o f O b s e r v a t i o n s

4843

Based on the introductory material about logistic regression that is presented above, you might expect to have a number o f model parameters equal to the number o f input variables. This ignores the fact that a single nominal input (for example, D e m C l u s t e r ) can generate scores o f model parameters. Next, consider maximum likelihood procedure, overall model fit, and the Type 3 Analysis o f Effects. The Type 3 Analysis tests the statistical significance o f adding the indicated input to a model that already contains other listed inputs. A value near 0 in the Pr > ChiSq column approximately indicates a significant input; a value near 1 indicates an extraneous input. Type 3 A n a l y s i s o f E f f e c t s Wald Effect

DF

Chi-Square

Pr > ChiSq

DemCluster DemGender

53 2 1

58.9098

0.2682

0.5088 0.1630 2.4464

0.7754 0.6864 0.1178

DemHomeOwner DemMedHomeValue DemPctVeterans

1 1

5.2502

0.0219

GiftAvg36

1

1.6709

0.1961

GiftAvgAll

1

0.0339

0.8540

GiftAvgLast

1 1

0.0026

0.9593

1.2230

0.2688

GiftCntAll GiftCntCard36

1

0.1308

0.7176

1

1.0244

0.3115

GiftCntCardAll GiftTimeFirst

1

0.0061

1

1.6064

0.9380 0.2050

GiftTimeLast IMP_DemAge IMP_GiftAvgCard36

1

21.5351 0.0701

0.7911

GiftCnt36

1

<.0001

1 1

0.0476

0.8273

IMP_REP_DemMedIncome

0.1408

0.7074

M_DemAge

1

3.0616

0.0802

M_GiftAvgCard36

1 1

0.9190

0.3377

0.6228

0.4300

PromCntCard12

1 1 1 1

3.2335 1.0866 1.9715 0.0294

PromCntCard36

1

PromCntCardAll StatusCat96NK

1

0.0049 2.9149

0.0721 0.2972 0.1603 0.8639 0.9441

5

11.3434

0.0878 0.0450

StatusCatStarAll

1

1.7487

0.1860

M_REP_DemMedIncome PromCnt12 PromCnt36 PromCntAll


4-28

Chapter 4 Introduction to Predictive Modeling: Regressions The statistical significance measures a range from <0.0001 (highly significant) to 0.9593 (highly dubious). Results such as this suggest that certain inputs can be dropped without affecting the predictive prowess o f the model. Restore the Output window to its original size by double-clicking its title bar. Maximize the Fit Statistics window.

Target TargetB TargetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB | TargetB TargetB TargetB TargetB TargetB

Fit Statistics Statistics Train Validation Test Label _AIC_ Akaike's Inf... 6633.2 Average Sq... 0.237268 _ASE_ 0.24381 0.667066 _AVERR„ Average Err... 0.680861 Degrees of.. 4757 _DFE_ Model Degr... 86 _DFM_ Total Degre.. 4843 _DFT_ Divisor for A.. 9686 _DIV_ 9686 Error FunctL. 6461.2 _ERR_ 6594.821 Final Predic. 0.245847 _FPE_ Maximum A... 0.941246 0.841531 _MAX_ Mean Squa... 0.241558 0.24381 _MSE_ Sum ofFre... 4843 4843 _NOBS_ Number of... 86 RootAvera... _RASE_ 0.487102 0.493771 Root Final... _RFPE_ 0.49583 Root Mean... _RMSE_ 0.491485 0.493771 Schwa rz's... _SBC_ 7190.935 Sum of Squ.. _SSE_ 2298.179 2361.545 SumofCas.. _SUMW_ 9686 9686 Misclassific. MISC 0.411522 0.431964

I f the decision predictions are o f interest, model fit can be judged by misclassification. I f estimate predictions are the focus, model fit can be assessed by average squared error. There appears to be some discrepancy between the values of these two statistics in the train and validation data. This indicates a possible overfit o f the model. It can be mitigated by using an input selection procedure. 7.

Close the Results window.


4.2 Selecting Regression Inputs

4-29

4.2 Selecting Regression Inputs Model Essentials - Regressions

Select useful inputs.

Sequential selection

38

The second task that all predictive models should perform is input selection. One way to find the optimal set o f inputs for a regression is simply to try every combination. Unfortunately, the number o f models to consider using this approach increases exponentially in the number o f available inputs. Such an exhaustive search is impractical for realistic prediction problems. A n alternative to the exhaustive search is to restrict the search to a sequence o f improving models. While this might not find the single best model, it is commonly used to find models with good predictive performance. The Regression node in SAS Enterprise Miner provides three sequential selection methods.


4-30

Chapter 4 Introduction to Predictive Modeling: Regressions

Sequential Selection - Forward Input p-value

Entry Cutoff

• • • • • • • 1 1 • • • • • • • •

43

Forward selection creates a sequence o f models of increasing complexity. The sequence starts with the baseline model, a model predicting the overall average target value for all cases. The algorithm searches the set o f one-input models and selects the model that most improves upon the baseline model. It then searches the set o f two-input models that contain the input selected in the previous step and selects the model showing the most significant improvement. By adding a new input to those selected in the previous step, a nested sequence o f increasingly complex models is generated. The sequence terminates when no significant improvement can be made. Improvement is quantified by the usual statistical measure o f significance, the p-value. Adding terms in this nested fashion always increases a model's overall fit statistic. By calculating the change in the fit statistic and assuming that the change conforms to a chi-squared distribution, a significance probability, or p-value, can be calculated. A large fit statistic change (corresponding to a large chi-squared value) is unlikely due to chance. Therefore, a small p-value indicates a significant improvement. When no p-value is below a predetermined entry cutoff, the forward selection procedure terminates.


4.2 Selecting Regression Inputs

4-31

Sequential Selection - Backward

• •••• •••• •••• • •• • • • • Input p-value

m

Stay Cutoff

mm

|

i

• • •

51

In contrast to f o r w a r d selection, backward selection creates a sequence o f models o f d e c r e a s i n g c o m p l e x i t y . The sequence starts w i t h a saturated model, w h i c h is a model that contains all available inputs, and therefore, has the highest possible fit statistic. Inputs are sequentially r e m o v e d f r o m the m o d e l . A t each step, the input chosen for removal least reduces the overall m o d e l fit statistic. This is equivalent to r e m o v i n g the input w i t h the highest p-value. The sequence terminates w h e n all r e m a i n i n g inputs have a p-value that is less than the predetermined stay cutoff.


4-32

Chapter 4 Introduction to Predictive Modeling: Regressions

Sequential Selection - Stepwise Input p-value

•

•

Entry Cutoff Stay Cutoff

58 Stepwise selection combines elements f r o m both the forward and backward selection procedures. The method begins in the same way as the f o r w a r d procedure, sequentially adding inputs w i t h the smallest p-value b e l o w the entry cutoff. A f t e r each input is added, however, the algorithm reevaluates the statistical significance o f all included inputs. I f the p-value o f any o f the included inputs exceeds the stay cutoff, the input is removed f r o m the model and reentered into the pool o f inputs that are available for inclusion in a subsequent step. The process terminates when all inputs available for inclusion in the model have p-values in excess o f the entry c u t o f f and all inputs already included in the model h a v e p - v a l u e s b e l o w the stay cutoff.


4.2 Selecting Regression Inputs

Selecting Inputs

I m p l e m e n t i n g a sequential selection method in the Regression node requires a m i n o r change to the Regression node settings. 1.

Select Selection M o d e l •=> Stepwise on the Regression node property sheet.

Train Variables

m

B i M a i n Effects Yes r Two-Factor InteracticNo ^Polynomial Terms No r-Polynomial Degree 2 No f-UserTerms --Term Editor '•-Regression Type - Link Function

~U

Loqistic Regression Logit

E3 Suppress Intercept Input Coding Deviation B lection -Selection Model ;Step_wise -Selection Criterion Default -Use Selection DefaiYes -Selection Options

!...

The Regression node is n o w c o n f i g u r e d to use stepwise selection to choose inputs for the m o d e l . 2.

Run the Regression node and v i e w the results.

3.

M a x i m i z e the O u t p u t w i n d o w .

4.

H o l d d o w n the C T R L key and type G. The Go To Line w i n d o w opens.

OK

Enter line number:

. 5.

79

Type 7 9 in the E n t e r

— line

Cancel

n u m b e r field and select O K .

4-33


4-34

Chapter 4 Introduction to Predictive Modeling: Regressions

The stepwise procedure starts with Step 0, an intercept-only regression model. The value o f the intercept parameter is chosen so that the model predicts the overall target mean for every case. The parameter estimate and the training data target measurements are combined in an objective function. The objective function is determined by the model form and the error distribution o f the target. The value o f the objective function for the intercept-only model is compared to the values obtained in subsequent steps for more complex models. A large decrease in the objective function for the more complex model indicates a significantly better model. S t e p 0: I n t e r c e p t e n t e r e d . The DMREG P r o c e d u r e Newton-Raphson Ridge

Optimization

Without P a r a m e t e r S c a l i n g Parameter

Estimates

Optimization Active Constraints Max Abs G r a d i e n t

0

Element

1

Start

Objective Function

3356.9116922

5.707879E-12 Optimization

Results

Iterations

0

Function C a l l s

3

Hessian

1

Active Constraints

0

Calls

Objective Function

3356.9116922

Ridge

0

Convergence c r i t e r i o n

(ABSGC0NV=0.00001)

Max Abs G r a d i e n t

5.707879E-12 0

satisfied.

L i k e l i h o o d R a t i o T e s t f o r G l o b a l N u l l Hypothesis: -2 Log L i k e l i h o o d

Element

A c t u a l Over Pred Change

BETA=0

Likelihood

Intercept

Intercept &

Ratio

Only

Covariates

Chi-Square

DF

6713.823

6713.823

0.0000

0

Pr > ChiSq

A n a l y s i s o f Maximum L i k e l i h o o d

Estimates

Parameter

DF

Estimate

Standard Error

Wald Chi-Square

Pr > ChiSq

Intercept

1

-0.00041

0.0287

0.00

0.9885

Standardized Estimate

Exp(Est) 1 .000


4.2 Selecting Regression Inputs

4-35

Step 1 adds one input to the intercept-only model. The input and corresponding parameter are chosen to produce the largest decrease in the objective function. To estimate the values o f the model parameters, the modeling algorithm makes an initial guess for their values. The initial guess is combined with the training data measurements in the objective function. Based on statistical theory, the objective function is assumed to take its minimum value at the correct estimate for the parameters. The algorithm decides whether changing the values o f the initial parameter estimates can decrease the value o f the objective function. I f so, the parameter estimates are changed to decrease the value o f the objective function and the process iterates. The algorithm continues iterating until changes in the parameter estimates fail to substantially decrease the value of the objective function. S t e p 1: E f f e c t G i f t C n t 3 6 e n t e r e d . The

DMREG P r o c e d u r e

Newton-Raphson Ridge

Optimization

Without P a r a m e t e r S c a l i n g Parameter E s t i m a t e s Optimization Active Constraints Max

Abs

Gradient

0

Element

2

Start

Objective Function

3356.9116922

89.678463762 Ratio Between Actual Abs

and

Function

Gradient

Predicted

Objective Function

Active

Objective

Constraints

Max

Iter

Restarts

Calls

Function

Change

Element

Ridge

1

0

2

0

3316

41.4036

2.1746

0

Change 1.014

2

0

3

0

3315

0.0345

0.00690

0

1.002

3

0

4

0

3315

2.278E-7

4.833E-8

0

1.000

Optimization Results Iterations

3

Function

Hessian

4

Active Constraints

Calls

Objective

Function

3315.473573

Ridge

0

Convergence c r i t e r i o n

(GC0NV=1E-6)

Max

Abs

Calls Gradient

6 0 4.833086E-8

Element

A c t u a l Over P r e d Change

0.999858035

satisfied.

The output next compares the model fit in Step 1 with the model fit in Step 0. The objective functions o f both models are multiplied by 2 and differenced. The difference is assumed to have a chi-squared distribution with one degree o f freedom. The hypothesis that the two models are identical is tested. A large value for the chi-squared statistic makes this hypothesis unlikely. The hypothesis test is summarized in the next lines. Likelihood Ratio Test f o r Global Null Hypothesis: -2 Log L i k e l i h o o d Intercept Intercept & Only Covariates 6713.823

6630.947

Likelihood Ratio Chi-Square

DF

Pr > C h i S q

82.8762

1

<.0001

BETA=0


4-36

Chapter 4 Introduction to Predictive Modeling: Regressions

The output summarizes an analysis o f the statistical significance o f individual model effects. For the one input model, this is similar to the global significance test above. Type 3 A n a l y s i s o f E f f e c t s Wald Effect

DF

Chi-Square

Pr > C h i S q

1

79.4757

<.0001

GiftCnt36

Finally, an analysis o f individual parameter estimates is made. (The standardized estimates and the odds ratios merit special attention and are discussed in the next section o f this chapter.) A n a l y s i s of Maximum Likelihood Estimates Standard

Wald

Error

Chi-Square

Standardized

Parameter

DF

Estimate

Intercept

1

-0.3956

0.0526

56.53

<.0001

GiftCnt36

1

0.1250

0.0140

79.48

<.0001

Pr > ChiSq

Estimate

Exp(Est) 0.673

0.1474

1.133

Odds Ratio Estimates Point Effect

Estimate

GiftCnt36

1.133

The standardized estimates present the effect o f the input on the log-odds o f donation. The values are standardized to be independent o f the input's unit o f measure. This provides a means o f ranking the importance o f inputs in the model. The odds ratio estimates indicate by what factor the odds o f donation increase for each unit change in the associated input. Combined with knowledge o f the range o f the input, this provides an excellent way to judge the practical (as opposed to the statistical) importance o f an input in the model. The stepwise selection process continues for eight steps. After the eighth step, neither adding nor removing inputs from the model significantly changes the model fit statistic. A t this point the Output window provides a summary o f the stepwise procedure. 6.

Go to line 850 to view the stepwise summary. The summary shows the step in which each input was added and the statistical significance o f each input in the final eight-input model. NOTE: No ( a d d i t i o n a l ) e f f e c t s met t h e 0.05 s i g n i f i c a n c e l e v e l f o r e n t r y i n t o t h e model. Summary o f S t e p w i s e Number

Score

Wald

DF

In

Chi-Square

Chi-Square

1 1

1 2 3

81.6807 23.2884 16.9872

<.0001 <.0001 <.0001

Effect Step

Entered

Selection

Pr > C h i S q

1 2 3

GiftCnt36

4

GiftAvgAll

1

StatusCat96NK

5

4 5

14.8514

5

18.2293

0.0001 0.0027

6

DemPctVeterans

1

6

7.4187

0.0065

7

M_GiftAvgCard36

1

7

7.1729

8

M_DemAge

1

8

4.6501

0.0074 0.0311

GiftTimeLast DemMedHomeValue

1


4.2 Selecting Regression Inputs

4-37

The default selection criterion selects the model from Step 8 as the model with optimal complexity. As the next section shows, this might not be the optimal model, based on the fit statistic that is appropriate for your analysis objective. The selected model, based on the CH0OSE=NCrÂŁ criterion, i s the model trained i n Step 8. I t consists of the following effects: Intercept

DenMecHomeValue

DenPctVeterans GiftAvpAll GiftCnt36

GiftTimeLast

M_DemAge M_GiftAvgCard36 StatusCat96rl<

For convenience, the output from Step 8 is repeated. A n excerpt from the analysis o f individual parameter estimates is shown below. A n a l y s i s of Maximum L i k e l i h o o d Estimates Standard Parameter

DF

Estimate

Wald

Standardized

Error

Chi-Square

Pr > ChiSq

Estimate

Exp(Est)

Intercept

1

0.2727

0.2024

1

1.385E-6

3.009E-7

1.82 21.18

0.1779

DemMedHomeValue

<.0001

0.0763

DemPctVeterans

1

0.00658

0.00261

6.38

0.0115

0.0412

1.007

GiftAvgAll

1

-0.0136

0.00444

9.33

0.0023

-0.0608

0.987

GiftCnt36

1

0.0587

0.0187

9.79

0.0018

0.0692

1.060

GiftTimeLast

1

-0.0376

0.00770

23.90

<.0001

-0.0837

0.963

1.314

M_DemAge

0

1

0.0741

0.0344

4.65

0.0311

1.077

M_GiftAvgCard36

0

1

0.1112

0.0411

7.30

StatusCat96NK

A

1

-0.0880

0.0927

0.90

0.0069 0.3423

0.916

StatusCat96NK

E F

1 1

0.4974

StatusCat96NK StatusCat96NK

L

StatusCat96NK

N

0.1818 0.1303

7.48

0.0062

1.644

12.30

0.0005

0.633

1

0.3735

0.15

0.6966

1.157

1

-0.1206

0.1323

0.83

0.3621

0.886

Restore the Output window and maximize the Fit Statistics window. H f Fit Statistics

1.118

-0.4570 0.1456

The parameter with the largest standardized estimate (in absolute value) is G i f t T i m e L a s t . 7.

1.000

WM&

Target

Fit Statistics Statistics Label

TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB argetB argetB argetB argetB TargetB TargetB TargetB

_AIC_ _ASE_ _AVERR_ _DFE_ _DFM_ _DFT_ _DIV_ _ERR_ _FPE_ _MAX_ _MSE_ _NOBS_ _nw_ _RASE_ _RFPE_ _RMSE_ _SBC_ _SSE_ _SUMW_ _MISC

Akaike's Inf... Average Sq... Average Err... Degrees of ... Model Degr... Total Degre... Divisor for A... Error Functi... Final Predic... Maximum A... Mean Squa... Sum of Fre... Number of... RootAvera... Root Final... Root Mean... Schwarz*s... Sum of Squ... Sum of Cas... Misclassific...

Train 6563.093 0.240919 0.674901 4830 13 4843 9686 6537.093 0.242216 0.965413 0.241567 4843 13 0.490835 0.492154 0.491495 6647.402 2333.54 9686 0.421433

Validation

Test

0.242336 0.678517

9686 6572.12 0.998582 0.242336 4843 0.492277 0.492277 2347.27 9686 0.42453

The simpler model improves on both the validation misclassification and average squared error measures o f model performance.


4-38

Chapter 4 Introduction to Predictive Modeling: Regressions

4.3 Optimizing Regression Complexity Model Essentials - Regressions

Optimize complexity.

!

B e s t m o d e

l

! from sequence

61

Regression complexity is optimized by choosing the optimal model in the sequential selection sequence.

Model Fit versus Complexity Model fit statistic

| Evaluate each sequence step.

>xm~m ar.m im irnra tanra tmmm mmm

1

2

3

4

5

6

62

The process involves two steps. First, fit statistics are calculated for the models generated in each step o f the selection process. Both the training and validation data sets are used.


4.3 Optimizing Regression Complexity

4-39

Select Model with Optimal Validation Fit Model fit statistic

1t

Choose simplest optimal model. 3

63 T h e n , as w i t h the decision tree in Chapter 3. the simplest model (that is, the one w i t h the fewest inputs) w i t h the o p t i m a l f i t statistic is selected.


4-40

Chapter 4 Introduction to Predictive Modeling: Regressions

Optimizing Complexity

Iteration Plot The following steps illustrate the use o f the iteration plot in the Regression tool Results window. In the same manner as the decision tree, you can tune a regression model to give optimal performance on the validation data. The basic idea involves calculating a fit statistic for each step in the input selection procedure and selecting the step (and corresponding model) with the optimal fit statistic value. To avoid bias, o f course, the fit statistic should be calculated on the validation data set. 1.

Select View ÂŤ=> Model ÂŤ=> Iteration Plot. The Iteration Plot window opens.

Average Squared Error 0.2500 -

0.2475-

0.2450-

02425-

024002

4

Model Selection Step Number •Train: Average Squared Error -Valid: Average Squared Error

J

The Iteration Plot window shows (by default) average squared error (training and validation) from the model selected in each step o f the stepwise selection process. f

Surprisingly, this plot contradicts the naive assumption that a model fit statistic calculated on training data w i l l always be better than the same statistic calculated on validation data. This concept, called the optimism principle, is correct only on the average, and usually manifests itself only when overly complex (overly flexible) models are considered. It is not uncommon for training and validation fit statistic plots to cross (possibly several times). These crossings illustrate unquantified variability in the fit statistics.

Apparently, the smallest average squared error occurs in Step 4, rather than in the final model, Step 8. I f your analysis objective requires estimates as predictions, the model from Step 4 should provide slightly less biased ones.


4.3 Optimizing Regression Complexity 2.

4-41

Select Select Chart ÂŤ=> Misclassification Rate.

Misclassification Rate 0.500 -

0.475 -

0.450 -

0.425 -

Model Selection Step Number Train: Misclassification Rate -Valid: Misclassification Rate The iteration plot shows that the model with the smallest misclassification rate occurs in Step 3. I f your analysis objective requires decision predictions, the predictions from the Step 3 model are as accurate as the predictions from the final Step 8 model. The selection process stopped at Step 8 to limit the amount o f time spent running the stepwise selection procedure. In Step 8, no more inputs had a chi-squared p-value below 0.05. The value 0.05 is a somewhat arbitrary holdover from the days of statistical tables. With the validation data available to gauge overfitting, it is possible to eliminate this restriction and obtain a richer pool o f models to consider.


4-42

Chapter 4 Introduction to Predictive Modeling: Regressions

Full M o d e l S e l e c t i o n Use the f o l l o w i n g steps to b u i l d and evaluate a larger sequence o f regression models: 1.

Close the Results - Regression w i n d o w .

2.

Select Use Selection Default •=> No f r o m the Regression node Properties panel.

Train Variables

UJ

[Main Effects Yes [-Two-Factor InteracticNo '•Polynomial Terms No [-Polynomial Degree 2 [UserTerms No " T e r m Editor

y

Mciass Targets [-Regression Type -|_ink Function BModel Options '[•Suppress Intercept -Input Coding ;

Logistic Regression Logit No Deviation

[-Selection Model Stepwise [-Selection Criterion Default [-Use Selection DefaiNo ^Selection Options 3.

Select Selection Options "=> ^ J .

wmm i...


4.3 Optimizing Regression Complexity

The Selection Options w i n d o w opens.

J

Property Sequential Order Entry Significance Level Stay Significance Level Start Variable Number Stop Variable Number Force Candidate Effects Hierarchy Effects Moving Effect Rule Maximum Number of Steps

Value No 0.05 0.05 0 0 0 Class None 0

Sequential O r d e r Specifies whether to add or remove variables in the order that is specified in the M O D E L statement.

OK

Cancel

4-43


4-44

Chapter 4 Introduction to Predictive Modeling: Regressions

4.

Type 1 . 0 as the Entry Significance Level value.

5.

T y p e 0 . 5 as the Stay Significance L e v e l value.

I _ Selection Options Property Sequential Order Entry Significance Level Stay Significance Level Start Variable Number Stop Variable Number Force Candidate Effects Hierarchy Effects Moving Effect Rule Maximum Number of Steps

Value No 1.0 0.5 0 0 0 Class None 0

Stay Significance L e v e l Significance level for removing variables in backward or stepwise regression.

OK

Cancel

The Entry Significance value enables any input in the model. (The one chosen w i l l have the smallest p-value.) T h e Stay Significance value keeps any input in the model w i t h a p-value less than 0.5. This second choice is somewhat arbitrary. A smaller value can terminate the stepwise selection process earlier, w h i l e a larger value can maintain it longer. A Stay Significance o f 1.0 forces stepwise to behave in the manner o f a f o r w a r d selection. 6.

Run the Regression node and v i e w the results.


4.3 Optimizing Regression Complexity 7.

Select View

4-45

Model ÂŤ=> Iteration Plot. The Iteration Plot window opens.

Average Squared Error 0.250-

0245-

0.240 -

15

Model Selection Step Number •Train: Average Squared Error Valid: Average Squared Error The iteration plot shows the smallest average squared errors occurring in Steps 4 or 12. There is a significant change in average squared error in Step 13, when the D e m C l u s t e r input is added. Inclusion o f this nonnumeric input improves training performance but hurts validation performance.


4-46

Chapter 4 Introduction to Predictive Modeling: Regressions

Select Select C h a r t •=> M i s c l a s s i f i c a t i o n R a t e . Iteration Plot

r/eT m

Misclassification Rate 0.500 -

0.475 -

0.450 -

0.425 -

0.400 10

15

r

20

M o d e l S e l e c t i o n Step N u m b e r Train: Misclassification Rate 'Valid: Misclassification Rate

The iteration plot shows that the smallest validation misclassi fication rates occur at Step 3. N o t i c e that the change in the assessment statistic in Step 13 is m u c h less pronounced.

Best S e q u e n c e Model Y o u can c o n f i g u r e the Regression node to select the model w i t h the smallest f i t statistic (rather than the final stepwise selection iteration). This method is how SAS Enterprise M i n e r optimizes c o m p l e x i t y for regression models. 1.

Close the Results - Regression w i n d o w .

2.

I f y o u r predictions are decisions, use the f o l l o w i n g setting: Select Selection C r i t e r i o n •=> V a l i d a t i o n M i s c l a s s i f i c a t i o n . ( E q u i v a l e n t l y , y o u can select V a l i d a t i o n P r o f i t / L o s s . The equivalence is demonstrated in Chapter 6.)

3.

I f y o u r predictions are estimates (or rankings), use the f o l l o w i n g setting: Select Selection C r i t e r i o n => V a l i d a t i o n E r r o r . B

•Selection Model Stepwise -Selection Criterion Validation Error Use Selection DefauNo -Selection Options f

The c o n t i n u i n g demonstration assumes validation error selection criteria.


4.3 Optimizing Regression Complexity 4.

Run the Regression node and view the results.

5.

Select View

Model

4-47

Iteration Plot.

Average Squared Error 0250-

0245-

0240 -

Model Selection Step Number •Train: Average Squared Error -Valid: Average Squared Error

The vertical blue line shows the model with the optimal validation error (Step 12). 6.

Go to line 2690. A n a l y s i s of Maximum Likelihood Estimates

Parameter

DF

Estimate

Standard

Wald

Error

Chi-Square

Pr > ChiSq

Standardized Estimate

Exp(Est)

Intercept

1

0.4999

0.2575

3.77

0.0522

DemMedHomeValue DemPctVeterans

1

1.416E-6

3.011E-7

22.12

<.0CO1

0.0781

1.000

1

0.00651

0.00261

6.23

GiftAvg36

1

-0.0101

0.00355

8.02

0.0126 0.0046

0.0407 -0.0561

0.990

GiftCnt36

1

0.0574

0.0197

8.53

0.0035

0.0677

1.059

GiftTimeLast M_DemAge

1

-0.0415

0.00829

<.0001

-0.0923

0.959

0

1

0.0720

0.0345

25.07 4.36

M_GiftAvgCard36

0

1

0.1126

0.0412

7.46

0.0063

1

-0.0381

0.0281

1.85

0.1740

PromCntCard12

1.649

0.0367

1.007

1.075 1.119 -0.0282

0.963

StatusCat96NK

A

1

-0.0353

0.0957

0.14

0.7122

0.965

StatusCat96NK StatusCat96NK

E F

1 1

0.4010 -0.4485

0.1950 0.1314

4.23 11.66

0.0398 0.0006

1.493 0.639

StatusCat96NK

L

1

0.1733

0.3743

0.21

0.6433

1.189

StatusCat96NK

N 0

1

-0.0988

0.1353

0.53

0.906

1

-0.0701

0.0367

3.64

0.4649 0.0563

StatusCatStarAll

0.932


4-48

Chapter 4 Introduction to Predictive Modeling: Regressions

While not all the p-values are less than 0.05, the model seems to have a better validation average squared error (and misclassification) than the model selected using the default Significance Level settings. In short, there is nothing sacred about 0.05. It is not unreasonable to override the defaults o f the Regression node to enable selection from a richer collection o f potential models. On the other hand, most o f the reduction in the fit statistics occurs during inclusion o f the first three inputs. I f you seek a parsimonious model, it is reasonable to use a smaller value for the Stay Significance parameter.


4.4 Interpreting Regression Models

4-49

4.4 Interpreting Regression Models Beyond the Prediction Formula

Interpret the model.

66

After you build a model, you might be asked to interpret the results. Fortunately regression models lend themselves to easy interpretation.


4-50

Chapter 4 Introduction to Predictive Modeling: Regressions

Odds Ratios and Doubling Amounts A Wo'

Doubling amount Input change is required to double odds.

Ax,

consequence

1

=> odds x exp(tv,)

M2=>ocf<fex2

X?

logit

scores

Odds ratio: Amount odds change with unit change in input.

69

There are two equivalent ways to interpret a logistic regression model. Both relate changes in input measurements to changes in odds o f primary outcome. • A n odds ratio expresses the increase in primary outcome odds associated with a unit change in an input. It is obtained by exponentiation o f the parameter estimate o f the input o f interest. • A doubling amount gives the amount o f change required for doubling the primary outcome odds. It is equal to log(2) « 0.69 divided by the parameter estimate o f the input o f interest. f

I f the predicted logit scores remain in the range -2 to +2, linear and logistic regression models o f binary targets are virtually indistinguishable. Balanced stratified sampling (Chapter 6) often ensures this. Thus, the prevalence o f balanced sampling in predictive modeling might, in fact, be a vestigial practice from a time when maximum likelihood estimation was computationally extravagant.


4.4 Interpreting Regression Models

4-51

Interpreting a Regression Model

The following steps demonstrate how to interpret a model using odds ratios: 1.

Go to line 2712 o f the regression model output. Odds R a t i o E s t i m a t e s Point Effect

Estimate

DemMedHomeValue DemPctVeterans

1.000

GiftAvg36

0.990

1.007

GiftCnt36

1.059

GiftTimeLast M_DemAge

0.959 0 vs 1

M_GiftAvgCard36

0 vs 1

1 .253

PromCntCard12 StatusCat96NK

A vs S

0.963 0.957

StatusCat96NK

E vs S

1 .481

StatusCat96NK

F vs S

0.633

1.155

StatusCat96NK

L vs S

1.179

StatusCat96NK

N vs S

0.898

StatusCatStarAll

0 vs 1

0.869

This output includes most o f situations you will encounter when you build a regression model. For G i f tAvg36, the odds ratio estimate equals 0.990. This means that for each additional dollar donated (on average) in the past 36 months, the odds of donation on the 97NK campaign change by a factor o f 0.99, a 1 % decrease. For G i f tCnt36, the odds ratio estimate equals 1.059. This means that for each additional donation in the past 36 months, the odds o f donation on the 97NK campaign change by a factor o f 1.059, a 5.9% increase. For MJDemAge, the odds ratio (0 versus 1) estimate equals 1.155. This means that cases with a 0 value for M_DemAge are 1.155 times more likely to donate than cases with a 1 value for M_DemAge. f

2.

The unusual value o f 1.000 for the DemMedHomeValue odds ratio has a simple explanation. Unit (that is, single dollar) changes in home value do not change the odds o f response by an amount captured in three significant digits. To obtain a more meaningful value for this input's effect on response odds, you can multiply the parameter estimate by 1000 and exponentiate the result. You then have the change in response odds based on 1000 dollar changes in median home value. Equivalently, you could use the Transform Variables node to replace DemMedHomeValue with DemMedHmVall000=DemMedHomeValue/1000, and a unit increase on that new input would represent a $1000 increase in the DemMedHomeValue.

Close the Results window.


4-52

Chapter 4 Introduction to Predictive Modeling: Regressions

4.5 Transforming Inputs Beyond the Prediction Formula

Handle extreme or unusual values.

72

Classical regression analysis makes no assumptions about the distribution o f inputs. The only assumption is that the expected value o f the target (or some function thereof) is a linear combination o f fixed input measurements. Why should you worry about extreme input distributions?


4.5 Transforming Inputs

4-53

Extreme Distributions and Regressions Original I n p u t Scale

true association -gr

i

"

bundard regression

,

• standardmpm i ifnif • skewed input distribution

^

true association high leverage points

74

There are at least two compelling reasons. • First, in most real-world applications, the relationship between expected target value and input value does not increase without bound. Rather, it typically tapers off to some horizontal asymptote. Standard regression models are unable to accommodate such a relationship. • Second, as a point expands from the overall mean o f a distribution, the point has more influence, or leverage, on model fit. Models built on inputs with extreme distributions attempt to optimize fit for the most extreme points at the cost o f fit for the bulk o f the data, usually near the input mean. This can result in an exaggeration or an understating o f an input's association with the target. Both o f these phenomena are seen in the above slide. f

The first concern can be addressed by abandoning standard regression models for more flexible modeling methods. Abandoning standard regression models is often done at the cost o f model interpretability and, more importantly, failure to address the second concern o f leverage.


4-54

Chapter 4 Introduction to Predictive Modeling: Regressions

Extreme Distributions and Regressions Regularized Scale

Original Input Scale true association -sjg '*' 1

"^standard regression , ~~

i

!

* standardmifin i

" :

skewed input distribution

high leverage points

more symmetric distribution

76

A simpler and, arguably, more effective approach transforms or regularizes offending inputs in order to eliminate extreme values.

Regularizing Input Transformations Original Jriput 'jcaly

Regularized Scale

standard/egression B

skewed input distribution

high leverage points

rf j « • • • J

more symmetric distribution

Then, a standard regression model can be accurately fit using the transformed input in place o f the original input.


4.5 Transforming Inputs

4-55

Regularizing Input Transformations Regularized Scale

Original I n p u t Scale

true association u

regularized estimate

• ,

.

.

f*~** • '

regularized estimate °

"

standard regression \ • .«.-•

i

j

• standard/epression

true association

78

Often this can solve both problems mentioned above. This not only mitigates the influence o f extreme cases, but also creates the desired asymptotic association between input and target on the original input scale.


Chapter 4 Introduction to Predictive Modeling: Regressions

4-56

Transforming Inputs

Regression models are sensitive to extreme or outlying values in the input space. Inputs with highly skewed or highly kurtotic distributions can be selected over inputs that yield better overall predictions. To avoid this problem, analysts often regularize the input distributions using a simple transformation. The benefit o f this approach is improved model performance. The cost, o f course, is increased difficulty in model interpretation. The Transform Variables tool enables you to easily apply standard transformations (in addition to the specialized ones seen in Chapter 9) to a set o f inputs.

The Transform Variables Tool Use the following steps to transform inputs with the Transform Variables tool: 1.

Remove the connection between the Data Partition node and the Impute node. | • Predictive Analysis L:

2.

Select the Modify tab.

3.

Drag a Transform Variables tool into the diagram workspace.

4.

Connect the Data Partition node to the Transform Variables node.

5.

Connect the Transform Variables node to the Impute node.

H B E3


4.5 Transforming Inputs

6.

4-57

A d j u s t the diagram icons for aesthetics. (So that y o u can see the entire d i a g r a m , the z o o m level is reduced.) • Predictive Analysis

i

^

-&

-—»>r^j* . ™ pl

tr

rt

^ 6

The T r a n s f o r m Variables node is placed before the Impute node to keep the imputed values at the average (or center o f mass) o f the model inputs. 7.

Select the V a r i a b l e s ^>

property o f the Transform Variables node. Value

Prope

General

Node ID Imported Data Exported Data Notes

Trans

U

m

Train

• •

Variables Formulas Interactions SAS Code

...

u

Ibefault Methods '•-Interval Inputs p Interval Targets |; Class Inputs •-Class Targets

[slsample

None None None None

Properties First N j-Method Default [-Size 12345 - R a n d o m Seed


4-58

Chapter 4 Introduction to Predictive Modeling: Regressions

The Variables - Trans w i n d o w opens.

(none)

not

Name Method DemAge Default DemCluster Default DemGender Default D e rn Ho m e OwDef autt DemMedHomOefault DemMedlnconOefautt DemPctVeteraDefault GiftAvg36 Default GiftAvgAII DefauH GiftAvgCard36Default GifWvgLast Default GiftCnt36 Default GiftCntAII Default GiftCntCard3 6 Default GirtCnlCardAil DefauH GiftTimeFirst Default GiftTimeLast Default PromCntl2 Default PromCnt36 Default PromCntAII Default Pro mCntC a rd Default PromCntCard Default PromCntCard/Default REP DernMecDefault Statu sCat9 6NDefault Statu sCalStar/Oefault

Equal to

• Number of Bins

Apply Role 4.0 Input 4.0 Input 4.0 input 4.0'lnput 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0'lnput 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4J)'lnput 4.0'lnput 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input

Level Interval Nominal Nominal Binary Interval Interval Interval Interval Interval (Interval Interval Interval Inlerval Inlerval Inlerval Interval Inlerval Interval Inlerval 'interval Interval Interval Interval Interval Nominal Binary

Type

Order

Reset Label

Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric

Age Demograph Gender Home Owne Median Horr Median Incoi PercentVete GiftAmountJ • GiftAmountJ GiftAmountJ Gift Amount 1 Gift Count 3 £ Gift Count A l l GiftCountC Gift Count C; Times Since Time Since 1 1 Promotion Cj:. Promotion C Promotion C Promotion C : Promotion O Promotion 0 Repiacemer Status Cated Status Cated*

<

! H Update Path

OK

Cancel


4.5 Transforming Inputs

Select all inputs w i t h M.b, ..^lr.r. i

•

(none)

Gift

in the name.

M

not

Method

Name

4-59

Default DemAge DemCluster Default DemGender Default DemHomeOwDefauft DemMedHomOefault D e m M ed In c o nDefault DemPctVeteraDefault Default GiftAvg36 GiftAvgAII Default GiftAvgCard36Default GiftAvgLasl Default Default GiftCnt36 Default GiftCntAII GiflCntC a rd 36 Default GiftCntCardAII Default GiftTimeFirst Default GiftTimeLast Default PromCnt12 Default PromCnt36 Default PromCntAII Default Pro mCntC a rd Default P ro mCntC a rd Default P ro mC ntC ard/Default REP DernMecOefautt Status Cat9 6 N Default Status CatStar/Defautt

Apply

Equal to

Number of Bins

Role 4.0 Input 4.0Jnput 4.0 Input 4.0 Input 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4,0 Input 4.0 Input 4.0 input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0'lnput 4.0 Input

Level Interval Nomina] Nominal Binary Interval Interval Interval Interval Interval Inlerval Inlerval Interval Interval Interval Interval Inlerval Interval Interval Interval Interval Inlerval Interval Interval Interval Nominal Binary

Type

Label

Order

Age k Demographj Gender Home Owne Median Hon"| Median Incoi Percent Vetej Gift Amount; Gift Amount/ Gift Amount/ Gift Amount Gift Count 3^ Gift Count All Gift Count cj Gift Count Cj Times Since Time Since I Promotion C Promotion Cp : Promotion C Promotion C Promotion 0 Promotion Cj, Replacemer^" Status Catec Status Caterj*

Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric

:

:

in

<

Explore..

Update Path

Reset

OK

Cancel


Chapter 4 Introduction to Predictive Modeling: Regressions

4-60

9.

Select E x p l o r e . . . . T h e E x p l o r e w i n d o w opens.

mWKM File

View

Actions

Window

1 0 ^ Sample Properties ; ; :

:

Property •Rows

4843

$133 p

EMWS.Par1_lRAlN Obs#

A

< | |

l°t~

r / l ^ 0

|

U GtftAvg36 X 4000 a) 2000 u. 0--

I 40005 2000¬ £ o .-.

1

i

[ - 1 •

i

| .

0

.

Gift Amount Average 36 M... n*Ef

t 4000 5 2000u0

i

| -.

• i

r

.

$100.00 $200.00

0.0

m

1000

5.8 11.G 17.4 23.2 20.0

Gift Count Card All Months t l GiftTimeFirst

U GfftCnt36

i

$104.00 $208:00

U GiftAvgAII

1500 - f — I — , 1000 500¬ 0

Gift Amount Last

• 1500 -

— i — i

i

$0.00

-

$0.00

!

1500 i | 1000- • — 500n - J—L \—i—I—'—I—i—I—'—I—r 0.0 1.8 3.6 5.4 7.2 9.0 Gift Count Card 36 Months U GiftCntCardAII

isi

Ef

$104.80 $208.27

I U GmAvgLast

n

:

L i G(ftCntCard36

Gift Amount Average Card...

|l

DATAOBS | Tarqet Git 1 4 5 7

1 2 3

l i GiftAvgCard36

I

Apply 3

m\

r/sf

Value

f - i

- i — r ~ r~TT

1000 500 0

t — . — i — i — i — i — i — i — i — • — r

* 8 : I U 0.0 3.0 6.0 9.0 12.0 15.0

15.0

79.5 144.0

208.5

Time Since First Gift GiftTimeLast

r^Ef

IH

IB) — i — — i —— i I • $1.50 $30.90 $11)0.30 1

1

1

1

Gift Amount Average All M...

- i — i — i— —i — • — i — i — i — i — r 4.0 8.6 13.2 17.8 22.4 27.0 Time Since Last Gift 1

G i f tAvg and G i f t C n t inputs show some degree o f skew i n t h e i r d i s t r i b u t i o n . T h e G i f tTime inputs do not. T o regularize the skewed d i s t r i b u t i o n s , use the l o g t r a n s f o r m a t i o n .

The

For

these i n p u t s , the order o f m a g n i t u d e o f the u n d e r l y i n g measure predicts the target rather than the measure itself. I 0 . C l o s e the E x p l o r e w i n d o w .


4.5 Transforming Inputs

4-61

1. Deselect the t w o inputs w i t h G i f tTime in their names. f*»'-

M i i l

(none) Name

*-

r t w

''

• •

not

Method

DemAge DefauH DemCluster befauR DemGender DefauH DemHomeOwOefau_R_ DemMedHomOefauR DernMedlncortJefault D e m P ctVetera DefauH GiftAvg36 DefauH GiftAvgAII DefauH GiftAvgCard36DefauK GiftAvgLast DefauH Gif!Cnt36 DefauH GiftCntAII Default GiftCntCard36 DefauH GiftCntCardAll DefauH GiftTimeFirst DefauH GiftTimeLast DefauH PromCnt12 DefauH P ro m C nt3 6 Dejault PromCntAII DefauH P ro mC ntC a rd Default PromCntCard DefauH Prom C ntC a rd/OefauK REP_DernMetDefauR StatusCat96NDefauR Status CatStar/DefauR

E q u a l to

Apply

Number of Bins

Role 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input

Level Interval Nominal Nominal Binary Inlerval Interval Interval Interval Interval Inlerval Inlerval Interval Interval Interval Inlerval Interval Interval Interval Inten/al Interval Interval Interval Interval Interval Nominal Binary

Explore...

Type

Order

Label Age Demograph Gender Home Owne Median Horn Median Incoi Percent Vete Gift Amount; GiftAmountJ GiftAmountJ Gift Amount | Gift Count 36 Gift Count At Gift Count C; Gift Count C; Times Since Time Since L Promotion Cj Promotion 0 Promotion C Promotion C Promotion C Promotion C Replacemeri Status C a t e g _ Status Calerj*_

Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric

Update Path

Reset

OK

Cancel


4-62

12.

Chapter 4 Introduction to Predictive Modeling: Regressions

Select M e t h o d •=> L o g for one o f the r e m a i n i n g selected inputs. The selected m e t h o d changes f r o m D e f a u l t to L o g f o r the

(none)

Q not

T

G i f tAvg

and

G i f tCnt

inputs.

Apply

Equal to

Name Method DemAge Default DemCluster Default DemGender Default DernHomeOwDefautt DemMedHomOefautt DemMedlnconDefautt DernPctVeteraDefault Log GiftAvg36 GiftAvgAII Log GiftAvgCard36Log GiftAvgLast Log Log GiflCnt36 GiftCntAII Log GiftCntCard36Log GiftCntCardAll Loy GiftTimeFirst Default GiftTimeLast DefauH PromCntl 2 DefauH PromCnt36 Default PromCntftll DefauH Pro mCntCard DefauH PromCnlCardDefauH PromCnlCard/Default REP DemMecDefautt Statu sCal96NIDefault Statu sCalStar/Default

Number of Bins

Role 4.0 Input 4.0 input 4.0 Input 4.0 In put 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input

4

Level Interval Nominal Nominal Binary Inlerval Inlerval Inlerval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Nominal Binary

Explore..,

13.

i Type Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric

Update Path

Order

Reset

Label Age Demographl Gender Home Owne Median Horn Median Incoi Percent Vete GiftAmountJ Gift Amount; Gift Amount/ Gift Amount i Gift Count 3 £ Gift Count Al Gift Count C: Gift Count C? Times Since Time Since t Promollon C Promolion C Promotion C Promotion C Promotion C Promotion C Replacemer Status Cates Status Catec.

OK

Cancel

Select O K to close the Variables - Trans w i n d o w .

14. Run the T r a n s f o r m Variables node and v i e w the results. 5. M a x i m i z e the O u t p u t w i n d o w and go to line 2 8 . Input I n p u t Name

Role

Level

GiftAvg36 GiftAvgAII GiftAvgCard36

INPUT INPUT INPUT

INTERVAL INTERVAL INTERVAL

L0G_ G i f t A v g 3 6

INTERVAL INTERVAL INTERVAL

log(GiftAvg36

L0G_ G i f t A v g A I I L0G_ _ G i f t A v g C a r d 3 6

GiftAvgLast GiftCnt36 GiftCntAII GiftCntCard36 GiftCntCardAll

INPUT INPUT INPUT INPUT INPUT

INTERVAL INTERVAL INTERVAL

LOG. L0G_ LOG LOG_ LOG_

INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL

log(GiftAvgLast + 1) log(GiftCnt36 + 1) log(GiftCntAII + 1) log(GiftCntCard36 + 1) log(GiftCntCardAll + 1)

INTERVAL INTERVAL

Name

GiftAvgLast _GiftCnt36 GiftCntAII _GiftCntCard36 GiftCntCardAll

Level

Formula + 1 )

log(GiftAvgAII + 1) log(GiftAvgCard36 + 1)

N o t i c e the F o r m u l a c o l u m n . W h i l e a l o g transformation was s p e c i f i e d , the actual t r a n s f o r m a t i o n used was \og(input + 1). T h i s default action o f the T r a n s f o r m Variables tool avoids p r o b l e m s w i t h 0-valucs o f the u n d e r l y i n g inputs. 16.

Close the T r a n s f o r m Variables - Results w i n d o w .


4.5 Transforming Inputs

4-63

Regressions with Transformed Inputs The following steps revisit regression, and use the transformed inputs: 1.

Run the diagram from the Regression node and view the results.

2.

Go to line 3754 the Output window. Summary of Stepwise S e l e c t i o n Effect Step

Entered

Removed

DF

Number

Score

In

Chi-Square

Wald Chi-Square

Pr > ChiSq

1

L0G_GiftCnt36

!

1

95.0275

<.0CO1

2

GiftTimeLast

1

2

21.1330

<.OO01

3

DemMedHomeValue

1

3

17.7373

<.0001

4

L03_GiftAvgAll

4

21.7306

5 6

DemPctVeterans

1 1

7.0742

<.0O01 0.0078

StatusCat96NK L0G_GiftCntCard36

5 6

13.7906

1

7

5.9966

0.0170 0.0143

1

8

5.0301

0.0249

53 1

9

61.2167

0.2049

10

M_DemAge DemCluster StatusCatStarAll

10

1.2431

0.2649

11

PromCntCard12

1

11

1.4604

0.2269

12

PromCntAll

1

12

1.0022

0.3168

13

LOG_GiftCntAll

1

13

2.2990

0.1295

14 15

PromCntl2

1

14

0.8158

0.3664

PromCntCardAll

1

15

1.8875

1

0.6075

7 8 9

PromCntCard12

0.0358

0.1695 0.8500

16 17

M_REP_DemMedIncome

1

14 15

18

L0G_GiftAvg36

1

16

0.4691

0.4357 0.4934

19

M_L0G_GiftAvgCard36

0.6226

0.4301

GiftTimeFirst

1 1 1

17

20

18

0.3972

21

GiftTimeFirst

17

0.5285 0.3971

0.5286

The s e l e c t e d model, based on the (HX)SE=VERR0R c r i t e r i o n , i s the model t r a i n e d i n Step 4. I t c o n s i s t s of the following e f f e c t s : Intercept

DemMedHomeValue

GiftTimeLast

L0G_GiftAvgAll

L0G_GiftCnt36

The stepwise selection process took 21 steps, and the selected model came from step 4. Notice that half o f the selected inputs are log transformations o f the original gift variables.


4-64

3.

Chapter 4 Introduction to Predictive Modeling: Regressions

Go to line 3800 to view more statistics from the selected model. Analysis of Maximum Likelihood Estimates Standard DF

Parameter

Estimate

Error

Wald

Standardized

Chi-Square

Pr > ChiSq

Estimate

Exp(Est)

Intercept DerrrVtedHorneValue

1

0.8251

0.2921

7.98

0.0047

1

1.448E-6

23.26

<.CO01

0.0798

1.000

GiftTimeLast LOGjSiftAvgAH

1

-0.0341

3.002E-7 0.00756

20.33

<.0001

-0.0758

0.966

1

-0.3469

0.0747

21.58

<.0O01

-0.0895

0.707

LOG GiftCnt36

1

0.3736

0.0728

26.34

<.0O01

0.0998

1.453

2.282

Odds Ratio Estimates Point Estimate

Effect

4.

DerrrVtedHorneValue

1.000

GiftTimeLast

0.966

LOGjSiftAvgAll

0.707

LOG GiftCnt36

1.453

Select View <=> Model «=> Iteration Plot.

Average Squired Error

0250-

0245-

0240-

023510

-r~ 15

— r 20

Model Selection Step Number •Train: Average Squared Error -Valid: Average Squared Error

The selected model (based on minimum error) occurs i n Step 4. The value o f average squared error for this model is slightly lower than that for the model with the untransformed inputs.


4.5 Transforming Inputs

5.

4-65

Select Select Chart •=> Misclassification Rate.

m Misclassification Rate 0.500-

0.475 -

0.450 -

0.425 -

0.40010

15

- r 20

Model Selection Step Number -Train: Misclassification Rate -Valid: Misclassification Rate

The misclassification rate with the transformed input model is nearly the same as that for the untransformed input model. The model with the lowest misclassification rate comes from Step 3. I f you want to optimize on the misclassification rate, you must change this property in the Regression node's property sheet. 6.

Close the Results window.


4-66

Chapter 4 Introduction to Predictive Modeling: Regressions

4.6 Categorical Inputs Beyond the Prediction Formula

Use nonnumeric inputs.

81

Using nonnumeric or categorical inputs presents another problem for regressions. As was seen i n the earlier demonstrations, inclusion o f a categorical input with excessive levels can lead to overfitting.


4.6 Categorical Inputs

4-67

To represent these nonnumeric inputs in a model, you must convert them to some sort o f numeric values. This conversion is most commonly done by creating design variables (or dummy variables), with each design variable representing approximately one level o f the categorical input. (The total number o f design variables required is, in fact, one less than the number o f inputs.) A single categorical input can vastly increase a model's degrees o f freedom, which, in turn, increases the chances o f a model overfitting.


4-68

Chapter 4 Introduction to Predictive Modeling: Regressions

Coding Consolidation Level

Dgh

&ABCD

A

1

0

0

B

1

0

0

C

1

0

0

D

1

0

0

E

0

1

0

F

0

1

0

G

0

0

1

H

0

0

1

1

0

0

0

8G There are many remedies to this p r o b l e m . One o f the simplest remedies is to use d o m a i n k n o w l e d g e to reduce the number o f levels o f the categorical input. In this way, level-groups are encoded in the model in place o f the original levels.


4.6 Categorical Inputs

4-69

Recoding Categorical Inputs In Chapter 2, you used the Replacement tool to eliminate an inappropriate value in the median income input. This demonstration shows how to use the Replacement tool to facilitate combining input levels o f a categorical input. 1.

Remove the connection between the Transform Variables node and the Impute node. ; Predictive Analysis

> flDcdiioaTfee

-^|f5"O|0 2.

Select the Modify tab.

3.

Drag a Replacement tool into the diagram workspace.

4.

Connect the Transform Variables node to the Replacement node.

5.

Connect the Replacement node to the Impute node.

J

—© 80%|S:°Hj'

- i ; Predictive Analysis

ft

^toputa 1= L B

PVAOTNt

»

JLl

6

<

>

^

6

K

J

-ii||^Q|0 - J —Q 76%|gqqj


4-70

Chapter 4 Introduction to Predictive Modeling: Regressions

Y o u need to change some o f the node's default settings so that the replacements are l i m i t e d to a single categorical input. 6.

In the Interval Variables property group, select D e f a u l t L i m i t s M e t h o d O N o n e . Value

Property

General

Node ID Imported Data Exported Data iNotes

Repl2

... ...

...

Train

j- Replacement Editor

7.

Cutoff Values Class Variables Replacement Editor Ignore Unknown Levels

In the Class Variables property g r o u p , select R e p l a c e m e n t E d i t o r O Properties panel. 1

Value

Property

General

Node ID Imported Data Exported Data Notes

I

Repl2

Train

Slnterval Variables i-Replacement Editor |-Default Limits MethoNone '-CutofTValues a C l a s s Variables \- Replacement Editor Ignore " U n k n o w n Levels

,.u !

y

. „ [ f r o m the Replacement node


4.6 Categorical Inputs

4-71

The Replacement E d i t o r opens.

Variable DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster

Level 40 24 36 35 49 13 12 27 18 30 00 39 45 51 14 11 02 43 28 16 46 41 17 08

Frequency 236 209 206 194 178 154 152 152 148 143 129 129 114 114 113 111 107 103 99 98 98 97 93 89

Type

c c c c c

C

c c c c" c ;C C C C

c

c

iC !C

b c

*

I Char Raw v... Num RawV... Repiaceme... •40 (24 t 36 35 49 ,13 ~T~

gl

L

27 18 30 00 _i39 J45 [51

I. ~ ~ I. ~f

11 02 43

r

28 |16 *46

I

U1

|.

08 Reset

OK

Cancel

The categorical input Replacement Editor lists all levels o f each binary, o r d i n a l , and n o m i n a l input. Y o u can use the Replacement c o l u m n to reassign values to any o f the levels. The input w i t h the largest number o f levels is

DemCluster.

w h i c h has so many levels that

consolidating the levels using the Replacement Editor w o u l d be an arduous task. ( A n o t h e r , autonomous method for c o n s o l i d a t i n g the levels o f D e m C l u s t e r is presented as a special topic in Chapter 8.) For this demonstration, y o u c o m b i n e the levels o f another input,

StatusCat96NK.


4-72

8.

Chapter 4 Introduction to Predictive Modeling: Regressions

Scroll the Replacement E d i t o r to v i e w the levels o f

Variable DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemGender DemGender DemGender DemGender DemHomeOwner DemHomeOwner DemHomeOwner StatusCal96NK StatusCa!96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCatStarAII Status CatStarAII StatusCatStarAII

Level 33 06 50 04 19 52 JJNKNOW.. F M }U JJNKNOW.. H

ty ~

Frequency 30 C 26 c 26 "c 22 19 C 19 c 2592 c 2002 249

c

c

2655 2188

JJNKNOW.. A 2911 S 1168 342 N 286 E 115 IL 21 JJNKNOW.. 1 2590 0 2253 JJNKNOW..

FT

c c c c c c c c

c c

c C c N N N

StatusCat96NK.

Type

CharRawv... Num Raw... Repiaceme.., 33 06 50 04 19 52 JDEFAULT.

M U .DEFAULT, H U .DEFAULT. A~~ S F N E L .DEFAULT. 1.0 0.0 DEFAULT Reset

OK

Cancel

The input has six levels, plus a level to represent u n k n o w n values ( w h i c h do not occur in the t r a i n i n g data). The levels o f

StatusCat96NK

w i l l be consolidated as f o l l o w s :

Levels A and S (active and star donors) indicate consistent donors and are grouped into a single level, A .

Levels F and N ( f i r s t - t i m e and new donors) indicate new donors and are grouped into a single level, N .

Levels E and L (inactive and lapsing donors) indicate lapsing donors and are grouped into a single level L.


4.6 Categorical Inputs

9.

Type

10. T y p e

A as the N

Replacement level for

StatusCat96NK

levels A and S.

as the Replacement level for

StatusCat96NK

levels F and N .

4-73

11. Type L as the Replacement level for StatusCat96NK levels L and E. StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK

A S F N E L UNKNOW..

2911 1168 342 286 115 21

12. Select O K to close the Replacement Editor.

C C

c |C c c

c

A

N

|7~

E

I

A A N N L

|L

I.

L

IS F

.

I DEFAULT


4-74

Chapter 4 Introduction to Predictive Modeling: Regressions

13. Run the Replacement node and v i e w the results.

file

Edit View

Window

1 1 1 0

1

I

c ^ E f [ET(|

Total Replacement Counts $ Variable

Role

Label

StatusCat9... INPUT

Train

Status Cate...

Validation 1625

1627

[2) output

*

*

Us e r :

Dodotech

Dace:

17HAY08

Time:

21:51

; K

* Training

Output

s

Variable

Role

*

Summary

Measurement

Frequency

Level

Count

The Total Replacement Counts w i n d o w shows the number o f replacements that occur in the t r a i n i n g and v a l i d a t i o n data.


4.6 Categorical Inputs

14. Select View «=> Model !§ Replaced Levels

4-75

Replaced Levels. The Replaced Levels window opens. |fH

W 0 0 $ $ ^

Variable

Formatted Value

Type

StatusCat9... StatusCat9... StatusCat9... StatusCat9... StatusCat9... StatusCat9...

A S F N E L

C C

Character Numeric Unformated Value Value A S F N E L

c c c c

Replacemen Variable Label t Value A A N N L L

Status Status Status Status Status Status

Cate... Cate... Cate... Cate... Cate... Cate...

The replaced level values are consistent with expectations. 15. Close the Results window. 16. Run the Regression node and view the results. 17. Go to line 3659 o f the Output window. Summary of Stepwise Selection Effect Step

Entered

Removed

DF

Number

Score

Wald

In

Chi-Square

Chi-Square

Pr > ChiSq

1

L0G_GiftCnt36

1

1

95.0275

2

GiftTimeLast

1

2

21.1330

<.0001

3

DenttedHomeValue

1

3

17.7373

<.0001

4

L0G_GiftAvgAll

1

4

21.7306

<.0001

5

DemPctVeterans

1

5

7.0742

0.0078

6

REP_StatusCat96NK

6

9.7073

0.0078

7

L0G_GiftCntCard36

1

7

6.2112

0.0127

8

M_DemAge DemCluster

1

8

4.8754

0.0272

9

61.7834

0.1910

10

StatusCatStarAII

1

10

1.6743

0.1957

11

PromCntCard12

1

11

1.3961

0.2374

12

1

12

1.1442

0.2848

13

PromCntAil L0G_GiftCntAll

1

13

1.8685

0.1717

14

PromCnt12

1

14

0.6761

0.4109

15

ProraCntCardAll

1

15

2.0585

1

14

1

15

0.7608

0.3831

1 1 1

16

0.7343

0.3915

17

0.5853

0.4443

18

0.3821

1

17

9

PromCntGard12

16 17 18 19 20

L0G_GiftAvg36 M_L0G_GiftAvgCarxf36 M_R£P_DemMedIncome GiftTimeFirst

21

GiftTimeFirst

<.0CO1

0.1514 0.0216

0.8830

0.5365 0.3821

0.5365

The REP_StatusCat96NK input (created from the original StatusCat96NK input) is included in Step 6 the Stepwise Selection process. The three level input is represented by two degrees o f freedom. 18. Close the Results window.


4-76

Chapter 4 Introduction to Predictive Modeling: Regressions

4.7

Polynomial Regressions (Self-Study)

Beyond the Prediction Formula

A c c o u n t f o r nonlinearities.

90

The Regression tool assumes (by default) a linear and additive association between the inputs and the logit o f the target. I f the true association is more complicated, such an assumption m i g h t result in biased predictions. For decisions and rankings, this bias can ( i n some cases) be unimportant. For estimates, this bias appears as a higher value for the validation average squared error fit statistic.

Standard Logistic Regression

0.7 0.6

X 0.5 2

0.4 03 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.9 1.0 *1

91

In the dot c o l o r p r o b l e m , the (standard logistic regression) assumption that the concentration o f y e l l o w dots increases t o w a r d the upper right corner o f the unit square seems to be suspect.


4.7 Polynomial Regressions (Self-Study)

4-77

Polynomial Logistic Regression

92

W h e n m i n i m i z i n g prediction bias is important, y o u can increase the f l e x i b i l i t y o f a regression model by adding p o l y n o m i a l c o m b i n a t i o n s o f the model inputs. This enables predictions to better match the true input/target association. It also increases the chances o f o v e r f i t t i n g w h i l e simultaneously reducing the interpretability o f the predictions. Therefore, p o l y n o m i a l regression must be approached w i t h some care In SAS Enterprise M i n e r , a d d i n g p o l y n o m i a l terms can be done selectively or a u t o n o m o u s l y .


Chapter 4 Introduction to Predictive Modeling: Regressions

4-78

Adding Polynomial Regression Terms Selectively

This demonstration shows how to use the Term Editor window to selectively add polynomial regression terms. You can modify the existing Regression node or add a new Regression node. I f you add a new node, you must configure the Polynomial Regression node to perform the same tasks as the original. A n alternative is to make a copy o f the existing node. 1.

Right-click the Regression node and select Copy from the menu.

2.

Right-click the diagram workspace and select Paste from the menu. A new Regression node is added with the label R e g r e s s i o n ( 2 ) to distinguish it from the existing one.

3.

Select the Regression (2) node. The properties are identical to the existing node.

4.

Rename the new regression node P o l y n o m i a l R e g r e s s i o n with model identification i n later chapters.

5.

Connect the Polynomial Regression (2) node to the Impute node.

( 2 ) . The (2) is retained to help

SHIES

• ^ P r e d i c t i v e Analysis

PVA97fK

I

^

"

i . .PotfOOBili

v r~lOedda>Tceo

JLl

J

0

- J —

©

76% S^D

To add polynomial terms to the model, you use the Term Editor. To use the Term Editor, you need to enable User Terms.


4-7 Polynomial Regressions (Self-Study)

6.

4-79

Select U s e r T e r m s "=> Yes in the Polynomial Regression ( 2 ) property panel. Value

Property

General Node ID Imported Data Exported Data Notes

Reg2 U u

Train

Variables

y

BEquatfon

Yes -Main Effects Two-Factor InteracticNo •Polynomial Terms NO •Polynomial Degree 2 Yes -UserTerms Term Editor

~12

T h e T e r m E d i t o r is n o w unlocked and can be used to add specific p o l y n o m i a l terms to the regression model. 3.

Select T e r m E d i t o r "=>

Node ID Imported Data Exported Data Notes

II

Value

Property

General

f r o m the Polynomial Regression Properties panel.

Reg2

Train

Variables

Q U

UJ

Yes Main Effects -Two-Factor InteracticNo -Polynomial Terms No •Polynomial Degree 2 Yes -UserTerms Term Editor


Chapter 4 Introduction to Predictive Modeling: Regressions

4-80

T h e T e r m s w i n d o w opens.

EH Target: TargetB

1

I !

* •

X

Variables

Term

DemCluster DemGender DemHomeOwner DemMedHomeV DemPctVeterans

Save

OK

Cancel


4.7 Polynomial Regressions (Self-Study)

Interaction T e r m s Suppose that y o u suspect an interaction between home value and time since last g i f t . (Perhaps a recent change in property values affected the donation patterns.) I.

Select D e m M e d H o m e V a l u e in the Variables panel o f the Terms d i a l o g box.

2.

Select the A d d button,

. The D e m M e d H o m e V a l u e input is added to the T e r m panel.

Target: TargetB

X

Term

Variables

0

DemCluster DemGender

DemMedHomeValue

DemHomeOwnei DemMedHomeV* DemPct Veterans <.

Save

J > OK

Cancel

4-81


4-82

3.

Chapter 4 Introduction to Predictive Modeling: Regressions

Repeat the previous step to add

GiftTimeLast. i-.M

Target: TargetB

1

* X

Variables

Term

GiftTimeFirst

DemMedHomeValuE

GiftTimeLast

GiftTimeLast

IMP_DemAge IMP_LOG_GiftAvt IMP REP DemM

Save

OK

4.

Cancel

Select Save. A n interaction between the selected inputs is now available for consideration by the Regression node.

t i l Target: TargetB DemMedHomeValue*GiftTimeLast

Variables

Term

GiftTimeFirst GiftTimeLast IMP_DemAge !MP_LOG_GiftAv£ IMP_REP_DemM

Save

OK

Cancel


4.7 Polynomial Regressions (Self-Study)

Quadratic Terms S i m i l a r l y , suppose that y o u suspect a parabola-shaped relationship between the l o g i t o f donation probability and median home value. 1.

Select DemMedHomeValue.

2.

Select the A d d b u t t o n ,

3.

Select

m

. The

again. A n o t h e r

DemMedHomeValue input is added to the T e r m panel.

DemMedHomeValue input is added to the T e r m panel.

m

Target: TargetB 1

DemMedHomeValue*GiftTimeLast

Variables

Term

DemMedHomeVE*

DemMetlHomeValuE

DemPctVeterans

DemMedHomeVakiE

GiftTimeFirst GiftTimeLast IMP_DemAge

Save

OK

4.

Cancel

Select Save. A quadratic median home value term is available for consideration by the m o d e l . L 3 Target: T TargetB I 2 3

T

DemMedHorneValue'GiftTimeLast DemMeclHomeValue'DemMedHomeValue

Variables DemMedHomeVc * DemPctVeterans GiftTimeFirst GiftTimeLast IMP_DemAue

*

Term

Save

" i l T > OK

Cancel

4-83


4-84

Chapter 4 Introduction to Predictive Modeling: Regressions

5.

Select O K to close the Terms dialog box.

6.

Run the Polynomial Regression node and view the results.

7.

Go to line 3752 in the Output window. Suimary of Stepwise Selection Effect Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Entered

m

Score Chi-Square

Nutter Remoted

DF

Chi-Square

Pr > ChiSq

LJDGjaftCnt36 GiftTimeLast DenMxHomeVal^

1

1

95.0275

<.CC01

1

21.1330 19.6082

<.OC01

1

2 3

U3GJÂŁftAvgAll

1 1

4 5

21.8432

1 1 1 1 1 1 1

6 7 8 9 10 11 12 11 12

DenFctVetBrans REP_StatosCaft95W

UDGJ3inaTtCand36 M_Demflge DemMedHomeVaJjue*TJemyed^ StatusCatStarAII PranCntCard12 PraiCntALL StalifiCatStarALl DarCiuster UDGJ^ftCntAll StatusCatStarAII PranCnt12 PraiCn^CardALL PronCntCard12 UDG_GiftAvg36 MjeMJenAtedlnoome MJ03_GaftAvgCard36

53 1 1 1 1 1 1 1 1 1

GiftTimeFirst GiftTimeFirst

1

13 14 15 16 15 16 17 18 19 18

<.OC01 <.0CO1

7.0965 9.7703 6.2012 4.9143 3.6530 1.8153 1.2570 1.3799

0.0077 0.0076

0.4504

0.0128 0.0265 0.0660 0.1779 0.2522 0.2401 0.5021

58.7308

0.2736

1.0539 1.2548 0.6591 2.0306

0.3046 0.2626 0.4169 0.1492 0.8931 0.3888 0.4223

0.0189 0.7423 0.6424 0.6013 0.3946

0.4381 0.5299 0.3945

0.5299

The stepwise selection summary shows the interaction term added in Step 3 and the quadratic term in Step 9. 8.

Close the Results window.

This raises the obvious question: How do you know which nonlinear terms to include in a model? Unfortunately, there is no simple solution to this question in SAS Enterprise Miner, other than including all polynomial and interaction terms in the selection process.


4.7 Polynomial Regressions (Self-Study)

4-85

A d d i n g Polynomial Regression Terms A u t o n o m o u s l y (Self-Study) SAS Enterprise M i n e r has the a b i l i t y to add every p o l y n o m i a l c o m b i n a t i o n o f inputs to a regression m o d e l . O b v i o u s l y , this feature must be used w i t h some care, because the number o f p o l y n o m i a l input combinations increases r a p i d l y w i t h input count. For instance, the PVA97NK data set has 20 interval inputs. I f y o u want to consider every quadratic c o m b i n a t i o n o f these 20 inputs, y o u r selection procedure must sequentially p l o d through more than 200 inputs. T h i s is not an o v e r w h e l m i n g task for today's fast computers. F o l l o w these steps to explore a f u l l t w o - f a c t o r stepwise selection process: 1.

Select T w o - F a c t o r I n t e r a c t i o n ^

Yes in the P o l y n o m i a l Regression property panel.

2.

Select P o l y n o m i a l T e r m s O Yes in the P o l y n o m i a l Regression Properties panel. Property

General Node ID Imported Data Exported Data Notes

Train Variables

•Main Effects •Two-Factor InteracticYes •PolynomialTerms Yes •Polynomial Degree 2 -UserTerms Yes -Term Editor 3.

Run the P o l y n o m i a l Regression (2) node and v i e w the results. ( I n general, this m i g h t take longer than most activities.)


Chapter 4 Introduction to Predictive Modeling: Regressions

4-86

4.

Go to line 1774 o f the Output window. Suimary of Stepwise Selection Effect Step

Removed

Entered

Chi-Square

Chi-Square

Pr > ChiSq

1

UTXLGinrjnt36*lIX3ja

1

1

101.0902

<.CO01

1

2

33.9163

<.CO01

3

t^ftTimeLast*LOG_GiftAvgUst rJenMedHomeVa!ue*DemPctVeterar^

1

3

25.2441

<.00O1

4

REP_StatusCat96MK

2

4

10.2804

0.0059

5 6

DemHomeOAner^_lJDGJ^m\^i^ DemCluster*TJemGender

7 8 10

1

5

5.8659

106

6

134.9632

0.0154 0.0302

GimimeLast*PromCrrt12

1

0.0174

1

7 8

5.6507

urjGGdftOrr^^ LOG_GiftAvgAll

3.7134

0.0540

1

9

5.8292

0.0158

50

10

64.6125

53

9

DemCluster

11

5.

Wald

Score In

2

9

f

Number DF

DemCluster

0.0601 39.8737

0.9086

Surprisingly, the selection process takes only 11 steps. This is the result o f the 106 degree-offreedom D e m C l u s t e r and DemGender interaction in Step 6. As the iteration plot shows below, the model is hopelessly overfit after this step. Inputs with many levels are problematic for predictive models. It is a good practice to reduce the impact o f these inputs either by consolidating the levels or by simply excluding them from the analysis.

Scroll down in the Output Window. The

s e l e c t e d model, based on t h e CH00SE=VERR0R c r i t e r i o n ,

consists of the following

i s t h e model t r a i n e d i n S t e p 3. I t

effects:

I n t e r c e p t DemMedHomeValue*DemPctVeterans GiftTimeLast*LOG_GiftAvgLast

The selected model includes only three terms!

L0G_GiftCnt36*L0G_GiftCntCardAll


4.7 Polynomial Regressions (Self-Study)

6.

Select View >=> M o d e l •=> Iteration Plot.

[Average Squared Error

0.250 -

0.245 -

0.240-

0.235-

0.230-

i 0.0

I

—r~

i

5.0 75 25 Model Selection Step Number

10.0

•Train: Average Squared Error •Valid: Average Squared Error

The validation average squared error o f the three-term model is lower than any other model considered to this point.

4-87


Chapter 4 Introduction to Predictive Modeling: Regressions

4-88

A

* ,

Exercises

1. Predictive Modeling Using Regression a. Return to the Chapter 3 Organics diagram. Attach the StatExplore tool to the ORGANICS data source and run it. b.

In preparation for regression, is any missing values imputation needed? I f yes, should you do this imputation before generating the decision tree models? Why or why not?

c. Add an Impute node to the diagram and connect it to the Data Partition node. Set the node to impute U for unknown class variable values and the overall mean for unknown interval variable values. Create imputation indicators for all imputed inputs. d. Add a Regression node to the diagram and connect it to the Impute node. e. Choose the stepwise selection and validation error as the selection criterion. f. Run the Regression node and view the results. Which variables are included in the final model? Which variables are important in this model? What is the validation ASE? g.

In preparation for regression, are any transformations of the data warranted? Why or why not?

h. Disconnect the Impute node from the Data Partition node. i. Add a Transform Variables node to the diagram and connect it to the Data Partition node, j.

Connect the Transform Variables node to the Impute node.

k. Apply a log transformation to the DexnAff 1 and PromTime inputs. I. Run the Transform Variables node. Explore the exported training data. Did the transformations result in less skewed distributions? m. Rerun the Regression node. Do the selected variables change? How about the validation ASE? n. Create a full second-degree polynomial model. How does the validation average squared error for the polynomial model compare to the original model?


4.8 Chapter Summary

4-89

4.8 Chapter Summary Regression models are a prolific and useful way to create predictions. New cases are scored using a prediction formula. Inputs are selected via a sequential selection process. Model complexity is controlled by fit statistics calculated on validation data. To use regression models, there are several issues with which to contend that go beyond the predictive modeling essentials. 1.

A mechanism for handling missing input values must be included i n the model development process.

2.

A reliable way to interpret the results is needed.

3.

Methods for handling extreme or outlying predictions should be included.

4.

The level-count o f a categorical should be reduced to avoid overfitting.

5.

The model complexity might need to be increased beyond what is provided by standard regression methods.

One approach to this is polynomial regression. Polynomial regression models can be fit manually with specific interactions in mind. They can also be fit autonomously by selecting polynomial terms from a list o f all polynomial candidates.


Chapter 4 Introduction to Predictive Modeling: Regressions

4-90

Regression Tools Review Replace missing values for interval (means) and categorical data (mode). Create a unique replacement indicator.

6

,jÂŁ Regression

^Transform ihfortaUes

|

Create linear and logistic regression models. Select inputs with a sequential selection method and appropriate fit statistic. Interpret models with odds ratios. Regularize distributions of inputs. Typical transformations control for input skewness via a log transformation.

ss

continued...

Regression Tools Review Consolidate levels of a nonnumeric input using the Replacement Editor window.

• ( Ftryrtocrfal Regression

Add polynomial terms to a regression either by hand or by an autonomous exhaustive search.


4.9 Solutions

4.9

4-91

Solutions

Solutions to Exercises 1. Predictive Modeling Using Regression a. Return to the Chapter 4 Organics diagram in the E x e r c i s e s project. Use the StatExplore tool on the ORGANICS data source. 1) Connect the StatExplore node to the O R G A N I C S node as shown.


4-92

Chapter 4 Introduction to Predictive Modeling: Regressions

2)

Run the StatExplore node and v i e w the results.

Edit

View

Window

i l l B

15"!

iS Chi-Square Plot Target = TargetBuy

—

0-25-

B

5 0.20 H trt ? 0.1 D M

5 0.10| 0.05 0.00

1

3

Ordered Inputs Q Output Usee:

Dodotech

Dace:

28HAY08

Time:

00:35

• Training

b.

Output

In preparation for regression, is any missing values imputation needed? I f yes, should y o u do this i m p u t a t i o n before generating the decision tree models? W h y or w h y not? G o to line 38 in the Output w i n d o w . Several o f the class inputs have m i s s i n g values.

Variable DemClusterGroup DemGender DemReg DemTVReg PromClass TargetBuy

Role INPUT INPUT INPUT INPUT INPUT TARGET

C l a s s V a r i a b l e Summary S t a t i s t i c s Number of Mode Levels Missing Mode Percentage 8 674 C 20.55 4 2512 F 54.67 6 465 South East 38.85 14 465 London 27.85 4 0 Silver 38.57 2 0 0 75.23

Mode D M Midlands Midlands Tin 1

Mode2 Percentage 19.70 26.17 30.33 14.05 29.19 24.77

G o to line 65 o f the O u t p u t w i n d o w . Most o f the Interval inputs also have m i s s i n g values. I n t e r v a l V a r i a b l e Summary S t a t i s t i c s Variable DemAffl DemAge PromSpend PromTime

ROLE INPUT INPUT INPUT INPUT

Mean 8.71 53.80 4420.59 6.56

Std. Deviation 3.42 13.21 7559.05 4.66

Non Missing 21138 20715 22223 21942

Missing 1085 1508 0 281

Minimum 0.00 18.00 0.01 0.00

Median 8 54 2000 5

Maximum 34.00 79.00 296313.85 39.00

Y o u d o n o t need t o i m p u t e before t h e Decision T r e e node. D e c i s i o n trees h a v e b u i l t - i n w a y s to h a n d l e m i s s i n g v a l u e s . (See Chapter 3.)


4.9 Solutions

c.

4-93

A d d an I m p u t e node to the diagram and connect it to the D a t a P a r t i t i o n node. Set the node to impute U for u n k n o w n class variable values and the overall mean for u n k n o w n interval variable values. Create i m p u t a t i o n indicators for all imputed inputs.

l_!t_ Decision Tree

± \

^StatExplore

Decision Tree

•Q

r I: i : ORGANICS

Impute

~™™Jata Partition

o

1)

Select D e f a u l t I n p u t M e t h o d •=> D e f a u l t C o n s t a n t V a l u e .

2)

Type U for the Default Character Value. IJciass Variables [-Default Input Method Default Target Method ^Normalize Values

I Default Constant Value Mone Ves

[• Default Input Method -Default Tarqet Method

Mean None

[•Default Character Value -Default Number Value

U

:

3)

Select I n d i c a t o r V a r i a b l e T y p e •=> U n i q u e .

4)

Select I n d i c a t o r V a r i a b l e R o l e <=> I n p u t .


4-94

Chapter 4 Introduction to Predictive Modeling: Regressions

d.

A d d a Regression node to the diagram and connect it to the I m p u t e node.

UL

Decision Tree

o

f ^ D e c i s i o n Tree

StatExplore

)ota Partition

ORGANICS

e.

.jf-

Regression

Select the Stepwise selection and V a l i d a t i o n E r r o r as the selection c r i t e r i o n .

[•Selection Model HSelection Criterion [•Use Selection Defaults ^Selection Options f.

Impute

••1

Stepwise Validation Error Ves

Run the Regression node and v i e w the results. W h i c h variables are included in the final model? W h i c h variables are important in this model? What is the v a l i d a t i o n A S E ? 1)

The Results w i n d o w opens.

File

I I

Edit View

Window

IB H ffl

Score Rankings Overlay: TargetBuy

Cumulative Lift

TargetBuy TargetB uy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy •TRAIN

TarfiPlfim/

-VALIDATE

< 1 ''-t:-

$

L l Effects Plot

0

1

1

5 Effect Number

! • + o-1

[Hi

Fit Statistics Statistics Train Vali Label _AIC_ Akaike's Inf... * 9691.257 _ASE_ Average Sq... 0.138587 _AVERR_ Average Err... 0.435352 _DFE_ Degrees of... 11104 _DFM_ Model Degr... 8 _DFT_ Total Degre... 11112 _DIV_ Divisor for A.. 22224 _ERR_ Error Fundi... 9675.257 Pinal Prnrtir

FPF

G) Output

n 1-5Q7RR

;

5

User:

Dodotech

Date:

28I1AY08

Time:

01:24

Training

4

I #1f

Fit Statistics

Target

Output


4.9 Solutions

2)

G o to line 664 in the Output w i n d o w . The selected model, based on the CHOOSE=VERROR c r i t e r i o n , i s the model t r a i n e d Step 6. I t c o n s i s t s of the f o l l o w i n g e f f e c t s : Intercept

3)

IMP__DemAff 1

IMP_DemAge

IMP_DemGender

M_DemAff1

M_DemAge

The odds ratios indicate the effect that each input has on the logit score. Effect IMP_DemAffl IMP_DemAge IMP_DemGender IMP_DemGender M_DemAffl U DemAge M DemGender

4)

4-95

F M 0 0 0

VS VS vs vs vs

Point Estimate 1 .283 0.947 6.967 2.899 0.708 0.796 0.685

u u 1 1 1

The v a l i d a t i o n A S E is g i v e n in the Fit Statistics w i n d o w . Fit Statistics Target TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy

Fit Statistics Statistics Test Train Validation Label _AIC_ Akaike's Inf... 9691.257 _ASE_ Average Sq... 0.138587 0.137156 _AVERR_ Average Err... 0.435352 0.432266 _DFE_ Degrees of... 11104 Model Degr... 8 _DFM_ _DFT_ Total Degre... 11112 _DIV_ Divisor for A,,. 22224 22222 _ERR_ Error Fundi... 9675.257 9605.81 _FPE_ Final Predic... 0.138786 _MAX_ Maximum A... 0.991147 0.987434 _MSE_ Mean Squa... 0.138687 0.137156 Mnpc

<

m

:

Ci i m

nf Fro

a.

| •

in

M_DemGender


Chapter 4 Introduction to Predictive Modeling: Regressions

4-96

g.

In preparation for regression, are any transformations o f the data warranted? W h y or w h y not? 1)

O p e n the Variables w i n d o w o f the Regression node.

2)

Select all Interval inputs.

(none)

fjrwt

Name Use DemCluster Default IMP_DemAffl Default IMP_DemAge Default IMP_DemClusOefautt IMP_DemGeri'Defaiilt lMP_DemReg Default IMP_DsmTVROefault lMP_PromTim Default M_DemAffl Default MJDemAge Default M_DernCluste Default i 1

Apply

Equal to

Report No No _Np No No No No NO No No No

Role Rejected Input Input J^put Input Input Input Input Input input Input

Level Nominal Interval Interval Nominal Nominal Nominal Nominal Inlerval Binary Binary Binary

Type Character Numeric Numeric Character Character Character Character Numeric Numeric Numeric Numeric

Order

Reset

Label Forma Nelgborhood' Imputed: Afflui Imputed: Age Imputed: Neig imputed: Gem Imputed: Geo£ Imputed: Tele» Imputed: Loya Imputation Ind Imputation Ind Imputation Ind

1 Explore™

Update Path

OK

Cancel

-


4.9 Solutions

3)

4-97

Select E x p l o r e . . . . The Explore w i n d o w opens.

File

View

Actions

Window

V-Z Sample Properties Property

Li IMP_DemAge 2500 -

Value

Rows Columns Library

Unknown 28 EMWS3

Mpmhpr

IMPT T R A I N

5

_

01

.? 500-

Apply

18 0

Plot...

eg EMWS3.lmpt_TRAIN

30.2

42.4

54.6

66.8

79.0

Imputed: Age Ll IMP_PromTime

DATAOBS ICustomer L. Affluence G. 10000D00140 40000001120

Obs#

2000-

| 15006 1000¬

50000002313 70000003131 110000007420 120000009814

-000 -

& 4000| 3000§• 2000 £ 1000 0 —

i

0.0

[

.

7.8

[

1

1 5.6

1

1

1

23.4

315

r 39.0

Imputed: Loyalty Cat d Tenure Ll IMP_DemAffl > 4000 -

S

3 0 0 0

"

| 2000 £ 1000-

U-

0

0.0

3.1

I

I

62

9.3

1

1

1

12.4 1?.:. 18.6 Imputed: Affluence Grade

21.7

24.8

27.9

310

B o t h C a r d T e n u r e a n d A f f l u e n c e G r a d e have m o d e r a t e l y s k e w e d d i s t r i b u t i o n s . A p p l y i n g a l o g t r a n s f o r m a t i o n t o these i n p u t s m i g h t i m p r o v e t h e m o d e l fit. h.

Disconnect the I m p u t e node f r o m the D a t a P a r t i t i o n node.

i.

A d d a T r a n s f o r m V a r i a b l e s node to the diagram and connect it to the D a t a P a r t i t i o n node,

j.

Connect the T r a n s f o r m V a r i a b l e s node to the I m p u t e node.

•J

it

f 'r' \

M i Patiwn <

)

Fnrafm -•.unll--.

• 1^

Pe^awfan

J


4-98

Chapter 4 Introduction to Predictive Modeling: Regressions

k.

A p p l y a l o g t r a n s f o r m a t i o n to the D e m A f f 1 and P r o m T i m e inputs. 1)

O p e n the Variables w i n d o w o f the Transform Variables node.

2)

Select M e t h o d •=> L o g f o r the D e m A f f 1 and P r o m T i m e inputs.

(none)

not

Equal i o

Name Method DemAttl Log DemAge Default DemCluster Default DemClusterG (Default DemGender Default Default DemReg DemTVReg Default PromClass Default PromSpend Default PromTime ton rargetAmt Default TargetBuy Default <

Number of Bins

,

.

.

Role 4.0 Input 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 input 4.0 Input 4.0 Input 4.b'|nput 4.0 Rejected 4.0 Target

.

Level Inteival Interval Nominal Nominal Nominal Nominal Nominal Nominal Inlerval Inlerval Interval Binary

Type Numeric Numeric Character Character Character Character Character Character Numeric Numeric Numeric Numeric

I.

Order

Label Affluence Ora Age Neigborhood Neighborhoo Gender Geographic fi Television Re Loyalty Status Total Spend Loyalty Card' Organtcs Pur Organics Pur

1

Explore-.

3)

Reset

Apply

Update Path

OK

Cancel

Select O K to close the Variables w i n d o w .

Run the T r a n s f o r m V a r i a b l e s node. E x p l o r e the exported t r a i n i n g data. D i d the transformations result i n less s k e w e d distributions? 1)

T h e easiest w a y to e x p l o r e the created inputs is to open the Variables w i n d o w i n the subsequent Impute node. M a k e sure that y o u update the I m p u t e node before o p e n i n g its Variables w i n d o w .

(none)

not

Name Use DemAge Default DemCluster Default DemClusterGiDefault DemGender Default DemReg Default DemTVReg Default LOGJDemAffi DefauH LOG_PromTinDefault PromClass Default PromSpend Default TargetAmt Default TargelBuy Default

Apply

Equal t o

Method Default Default Default Default Default Default Default Default Default Default Default Default

Use Tree Default Default Default Default Default Defautt Default Default Default Default Defautt Default

Role Input Rejected Input Input Input Input Inpul input Input Input Rejected Target

Level Interval Nominal Nominal Nominal Nominal Nominal Inteival inlerval Nominal interval Interval Binary

Explore.

Type Numeric Character Character Character Character Character Numeric Numeric Character Numeric Numeric Numeric

Update Path

Order

OK

Reset

Labe Age j Neigbot Neighbi Gender Geogra Televisi Transfo Transfo; Loyalty: Total Sd Organic Organic

Cancel


4.9 Solutions

2)

W i t h the L O G

File

View

D c m A f f l and L O G

Actions

4-99

P r o m T i m e inputs selected, select E x p l o r e . . . .

Window

0,4 Qfl Sample Properties J _ Property Rows Columns Library Member Type Sample Method Fetch Size Fetched Rows Random Seed

Unknown 16 EMWS3 TRANS_TRAIN VIEW Random Max 11112 1 2345

4000¬ > ~ 3000o>

o 2000Um 1000¬

Apply

Plot.

Q] EMWSXTrans TRAIN

i

1 2 3 4 5

6 7 a 9 to 11

IEJ

sooo-

0

Obs#

r^tf

Ll LOG_DemAffl Value

0.69

1.39

2.08 2.77

3.47

Transformed: Affluence Grade [HI

ET

0.00

L l LOG_PromTime

DATAOBS [Customer L . Affluence G„ 10000000140 11 40000001120 11 50000002313 1 70000003131 1 110000007420 120000009814 130000010005 140000010219 150000010812 1 170000011932 180000014656 ! •

T h e d i s t r i b u t i o n s are nicely symmetric, m.

Rerun the R e g r e s s i o n node. D o the selected variables change? H o w about the v a l i d a t i o n A S E ? 1)

G o t o line 664 o f the O u t p u t w i n d o w .

The s e l e c t e d m o d e l , It

consists

Intercept 2)

based on t h e CH00SE=VERR0R c r i t e r i o n ,

of the following

i s t h e model t r a i n e d

i n Step 6 .

effects:

IMP_DemAge IMP_DemGender

IMP_LOG_DemAff1

IMP_LOG_DemAf f 1 and M_LOG_DemAf f 1 M DemAf f 1 , respectively.

M_DemAge

replace

M__DemGender

IMP _DemAf f 1

and

M_LOG_DemAff1


4-100

Chapter 4 Introduction to Predictive Modeling: Regressions

3)

A p p a r e n t l y the l o g transformation actually reduced the v a l i d a t i o n A S E slightly.

1

Fit Statistics Target

Fit Statistics Statistics Label

TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy

_AIC_ _ASE_ _AVERR_ _DFE_ _DFM_ _DFT_ _DIV_ _ERR_ _FPE_ _MAX_ _MSE_ _NOBS_

i | ;: n.

Akaike's Inf... Average Sq... Average Err... Degrees of ... Model Degr... Total Degre... Divisor for A... Error Fundi... Final Predic... Maximum A... Mean Squa... Sum of Fre...

Train

Validation

9758.609 0.139545 0.438382 11104 8 11112 22224 9742.609 0.139746 0.992317 0.139646 11112

T

0.138204 0.43599

' 1

22222 9688.581 0.994405 0.138204 11111

M

•

Create a f u l l second-degree p o l y n o m i a l model. H o w does the v a l i d a t i o n average squared error for the p o l y n o m i a l model compare to the original model? 1)

A d d another Regression node to the diagram and rename it

Polynomial

Regression.

y

PLMÂťt!Ullti.-li


4.9 Solutions

2)

4-101

M a k e the indicated changes to the P o l y n o m i a l Regression Properties panel.

Train [Variables

I

M

[•Main Effects hiTwo-Factor Interactions iPolynomial Terms i-Polynomial Degree rUserTerms f r e r m Editor

Yes Yes Yes 2 No

rRegressionType '-Link Function

Logistic Regression Logit

L

j-Suppress intercept Hnput Coding [SlModel Selection r Selection Model r Selection Criterion r U s e Selection Defaults '•Selection Options 3)

Stepwise Validation Error Yes

G o t o line 1598

The s e l e c t e d m o d e l , It

No Deviation

consists

based on t h e CHOOSE=VERROR c r i t e r i o n ,

of the following

i s t h e model t r a i n e d

i n S t e p 7.

effects:

Intercept IMP_DemAge IMP_DemGender IMP_L0G_DemAff1 M_OemAge MJ)emGender"M_LOG_DeniAffl IMP_DemAge*IMP_DemAge IMP_LOG_DemAffl"IMP_L0G_DemAff1

4)

The P o l y n o m i a l Regression node adds additional interaction terms.

5)

Examine the Fit Statistics w i n d o w . \Z Fit Statistics %

r/ET

| Target

Fit Statistics Statistics Label

TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy

_AIC_ _ASE_ _AVERR_ JDFE_ _DFM_ _DFT_ _DIV_ _ERR_ _FPE_ _MAX_

4

Akaike's Inf.. Average Sq... Average Err... Degrees of.. Model Degr... Total Degre.. Divisorfor A.. Error Fundi... Final Predic. Maximum A...

[

Validation

Train 9529.938 0.136407 0.428003 11103 9 11112 22224 9511.938 0.136628 0.98571 E

T

0.134038 0.421824

!

22222 9373.784 0.986233

••'\

T h e a d d i t i o n a l t e r m s r e d u c e the v a l i d a t i o n A S E s l i g h t l y .

I


4-102

Chapter 4 Introduction to Predictive Modeling: Regressions

S o l u t i o n s to Student Activities (Polls/Quizzes)

4.01 Multiple Choice Poll - Correct Answer What is the logistic regression prediction for the indicated point? a. -0.243 b. 0.56 c. yellow @ l t depends . . . o 1

A

logit(p)=-0.81+0.92x

P

1

1

A _

1 + -l째gi!(pt e

+1lix

2


Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools 5.1

Introduction Demonstration: Training a Neural Network

5.2

Input Selection Demonstration: Selecting Neural Network Inputs

5.3

5.4

Stopped Training

5-3 5-12 5-20 5-21 5-24

Demonstration: Increasing Network Flexibility

5-35

Demonstration: Using the AutoNeural Tool (Self-Study)

5-39

Other Modeling T o o l s (Self-Study)

5-46

Exercises

5-58

5.5

Chapter Summary

5-59

5.6

Solutions

5-60

Solutions to Exercises

5-60


5-2

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools


5.1 Introduction

5-3

5.1 Introduction Model Essentials - Neural Networks Predict new cases.

Prediction formula

Select useful inputs.

None

Optimize complexity.

Stopped training

3

With its exotic sounding name, a neural network model (formally, for the models discussed in this course, multi-layerperceptrons) is often regarded as a mysterious and powerful predictive tool. Perhaps surprisingly, the most typical form o f the model is, in fact, a natural extension o f a regression model. The prediction formula used to predict new cases is similar to a regression's, but with an interesting and flexible addition. This addition enables a properly trained neural network to model virtually any association between input and target variables. Flexibility comes at a price, however, because the problem o f input selection is not easily addressed by a neural network. The inability to select inputs is offset (somewhat) by a complexity optimization algorithm named stopped training. Stopped training can reduce the chances o f overfitting, even in the presence o f redundant and irrelevant inputs.


5-4

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Model Essentials - Neural Networks Predict new cases.

Prediction formula

s

Like regressions, neural networks predict cases using a mathematical equation involving the values o f the input variables.


5.1 Introduction

5-5

Neural Network Prediction Formula pivdiction estimate

fcias estimate

-.wight

csnni.vfo

H = tanh(w + f

-5--

5

h = tanh(w + w 2

x., + w x )

10

20

n

21

2

+w x ) 22

2

H = tanh(w + w X! + w x ) 3

30

31

32

2

A neural network can be thought o f as a regression model on a set o f derived inputs, called hidden units. In turn, the hidden units can be thought o f as regressions on the original inputs. The hidden unit "regressions" include a default link function (in neural network language, an activation Junction), the hyperbolic tangent. The hyperbolic tangent is a shift and rescaling o f the logistic function introduced in Chapter 3. Because o f a neural network's biological roots, its components receive different names from corresponding components o f a regression model. Instead o f an intercept term, a neural network has a bias term. Instead o f parameter estimates, a neural network has weight estimates. What makes neural networks interesting is their ability to approximate virtually any continuous association between the inputs and the target. You simply specify the correct number o f hidden units and find reasonable values for the weights. Specifying the correct number o f hidden units involves some trial and error. Finding reasonable values for the weights is done by least squares estimation (for intervalvalued targets).


Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

5-6

Neural Network Binary Prediction Formula A

log(-j—jr)=w

0 0

+w

01

Hi + w

02

H + w - ^3 03

2

'of/'' link

(unction

J

5

o

! /

tf

Hi = tanh(w +

+w

12

x)

= tanh(vv + w x + w

22

x)

H = tanh(w + w x., + w

32

x)

10

2

20

3

30

21

31

1

2

2

2

When the target variable is binary, as in the demonstration data, the main neural network regression equation receives the same logit link function featured in logistic regression. As with logistic regression, the weight estimation process changes from least squares to maximum likelihood.


5.1 Introduction

5-7

Neural Network Diagram ,

o

g

( l — 8 " )

=

Woo + w

0

1

H,

+ w

Q2

H

2

Hi - tanh(w +

+ w

0

3

H

z

x + w x)

10

A

12

2

H = tanh(w + w x., + w x ) 2

20

21

22

2

H = tanh(w + w * i + w x ) 3

inpi/J

hidden

target

,'iiycf

/ayer

/ayer

30

31

32

2

9 Multi-layer perceptron models were originally inspired by neurophysiology and the interconnections between neurons, and they are often represented by a network diagram instead o f an equation. The basic model form arranges neurons in layers. The first layer, called the input layer, connects to a layer o f neurons called a hidden layer, which, in turn, connects to a final layer called the target or output layer. Each element in the diagram has a counterpart in the network equation. The blocks in the diagram correspond to inputs, hidden units, and target variables. The block interconnections correspond to the network equation weights.


5-8

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Prediction Illustration - Neural Networks logit

1.0

equation

logit( jS) = wbo + " t o i * H

H

0.9

2 ^oj- W +

a

0.8

H, = tanh(w + vv„ x, + w x )

0.7

H = tanh(w + w

0.6

10

2

1 2

M

2 1

x, + w

2

x )

a

2

Wj = tanhfAta + w x , + w x ) 2 0.5 31

32

x

2

0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.S 0.9 1.0

To demonstrate the properties o f a neural network model, consider again the two-color prediction problem. As always, the goal is to predict the target color based on the location in the unit square. As with regressions, the predictions can be decisions, rankings, or estimates. The logit equation produces a ranking or logit score.

Prediction Illustration - Neural Networks logit

1.0

equation

logit( p ) = Woo * <*W

H +w H

+

2

oi

H = tanh(vv + w x + w t

10

H = tanh(w 2

M

u

+

w

3

M

1 2

3

0.8

x )

Q.7

2

x, + Wn x )

2i

H = tanh(w + w

A

0.9

0.6

2

3 1

x +w 1

x 3i

z

0

5

04

Need weight estimates.

0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 06 0.7 08 0.9 1.0

12 As with a regression model, the primary task is to obtain parameter or, in the neural network case, weight estimates that result in accurate predictions. A major difference between a neural network and a regression model, however, is the number o f values to be estimated and the complicated relationship between the weights.


5.1 Introduction

5-9

Prediction Illustration - Neural Networks /oy/f equation loglt( p ) = Wbo + W„- H, + Wqj- H + W - Hj 2

H, = tanh(w + w

1 t

H = tanh( Wjo + w

M

10

2

H,= tanh(Wjo + w

3 1

M

x, + w

1 2

x) 2

x, + x, + w

xj x)

i2

2

x

2 05 0.4

Weight estimates found by

maximizing: Ilog(p>Ilog(1-p ) f

" o.o0.0

Log-iike!ihood

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Function

13

For a binary target, the weight estimation process is driven by an attempt to maximize the log-likelihood function. Unfortunately, in the case o f a neural network model, the maximization process is complicated by local optima as well as a tendency to create overfit models. A technique illustrated in the next section (usually) overcomes these difficulties and produces a reasonable model.


5-10

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

14 Even a relatively simple neural network with three hidden units permits elaborate associations between the inputs and the target. While the mode! might be simple, explanation of the model is decidedly not. This lack of explicability is frequently cited as a major disadvantage o f a neural network. O f course, complex input/target associations are difficult to explain no matter what technique is used to model them. Neural networks should not be faulted, assuming that they correctly modeled this association. After the prediction formula is established, obtaining a prediction is strictly a matter of plugging the input measurements into the hidden unit expressions. In the same way as with regression models, you can obtain a prediction estimate using the logistic function.


5.1 Introduction

5-11

Neural Nets: Beyond the Prediction Formula Manage missing values. Handle extreme or unusual values. Use non-numeric inputs. Account for nonlinaarliies.

16

Neural networks share some o f the same prediction complications with their regression kin. The most prevalent problem is missing values. Like regressions, neural networks require a complete record for estimation and scoring. Neural networks resolve this complication in the same way that regression does, by imputation. Extreme or unusual values also present a problem for neural networks. The problem is mitigated somewhat by the hyperbolic tangent activation functions in the hidden units. These functions squash extreme input values between - I and + 1 . Nonnumeric inputs pose less o f a complication to a properly tuned neural network than they do to regressions. This is mainly due to the complexity optimization process described in the next section. Unlike standard regression models, neural networks easily accommodate nonlinear and nonadditive associations between inputs and target. In fact, the main challenge is over-accommodation, that is, falsely discovering nonlinearities and interactions. Interpreting a neural network model can be difficult, and while it is an issue, it is one with no easy solution.


5-12

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Training a Neural Network

Several tools i n SAS Enterprise Miner include the term neural in their name. The Neural Network tool is the most useful o f these. (The AutoNeural and D M Neural tools are described later in the chapter.) 1.

Select the M o d e l tab.

2.

Drag a Neural N e t w o r k tool into the diagram workspace.

3.

Connect the Impute node to the Neural Network node. Predictive Analysis

faptoccxomt

3

-0

"ye

=4*

-iJ|IW|0 With the diagram configured as shown, the Neural Network node takes advantage o f the transformations, replacements, and imputations prepared for the Regression node. The neural network has a default option for so-called "preliminary training."

ÂŽ8Âť|pal


5.1 Introduction

4.

Select Optimization •=> j

I

Property

from the Neural Network Properties panel. Value

General Neural Node ID Imported Data Exported Data Notes Train Variables No Continue Training Network Optimization Initialization Seed 12345 Model Selection Criterion Profit/Loss Suppress Output Score No {Hidden Units I 'Residuals res Nn Standardization

5.

... ... ... ...

Select Enable •=> No under the Preliminary Training options. I £ f Optimization Property -Accelerate Decelerate :- Learn t- Maximum Learning Minimum Learning Momentum :- Maximum Momentum Tilt ^Preliminary Training T Enable Number of Runs Maximum Iterations Maximum Time

d

Value 1.2 0.5 0.1 50.0 i1 OE-5 0.0 1.75 0.0

J

No 5 10 1 Hour

Enable Indicates whether to perform prelirninary training. The default is No.

OK

Cancel

I

5-13


5-14

6.

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Run the Neural Network node and view the results. The Results - Node: Neural Network Diagram window opens.

File Edit View Window

Score Rankings Overlay: TargetB

r / Ef (Hi I EH Fit Statistics Target

Cumulative Lift

TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TarqetB

Percentile TRAIN

VALIDATE c^eT

Iteration Plot Average Squared Error

Fit Statistics Statistics Train Label _DFT_ Total Degre.. 48' a. _DFE_ Degrees of.. 45S _DFM_ Model Degr... 2< _NW_ Number of ... 2t _AIC_ Akaike's Inf... 7014.5E _SBC_ Schwarz's... 8655.3' _ASE_ Average Sq... 0.2396; _MAX_ Maximum A... 0.7804E Divisor for A.. _DIV_ 96( NOBS Sum of Fre... 48'

<

IE

\>

Output User:

Dodotech

Date:

18HAY06 19:04

Time: * Training

S

10 Training Iteration

Train: Average squared Error. Valid: Average Squared Error.

Output


5.1 Introduction 7.

Maximize the Fit Statistics window. 11

statistics

Target TargetB TargetB TargetB TargetB TargetB TargetB TargetB [TargetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB TargetB TargetB TargetB

-^^MM^MX^^^Mw^^BMMMM.

w-f-f I S

Fit Statistics Statistics Train Validation Test Label _DFT_ Total Degre... 4843 _DFE_ Degrees of... 4590 _DFM_ Model Degr... 253 _NW_ Number of... 253 _AIC_ Akaike's Inf... 7014.564 _SBC_ Schwarz's... 8655.342 _ASE_ Average Sq... 0.239625 0.242925 _MAX_ Maximum A... 0.760482 0.774254 _DIV_ Divisor for A... 9686 9686 _NOBS_ SumofFre... 4843 4843 _RASE_ RootAvera... 0.489515 0.492874 _SSE_ Sum ofSqu... 2321.008 2352.972 Sum ofCas... 9686 9686 _SUMW_ _FPE_ Final Predic... 0.266041 _MSE_ Mean Squa... 0.252833 0.242925 _RFPE_ Root Final... 0.515792 _RMSE_ Root Mean... 0.502825 0.492874 _AVERR_ Average Err... 0.671956 0.678893 _ERR_ Error Functi... 6508.564 6575.759 _MISC_ Misclassific... 0.418542 0.430105 _WRONG_ Number of... 2027 2083

The average squared error and misclassification are similar to the values observed from regression models in the previous chapter. Notice that the model contains 253 weights. This is a large model.

5-15


5-16 8.

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Go to line 54 of the Output window. There you can find a table of the initial values for the neural network weights. The NEURAL Procedure Optimization S t a r t Parameter Estimates

Estimate

Gradient Objective Function

DemMedHomeValue_H11 DemPctVeterans_H11 GiftTimeFirst_H11 GiftTimeLast_H11 IMP_DemAge_H11 IMP_L0G_GiftAvgCard36_H11 IMP_REP_DemMedIncome_H11 L0G_GiftAvg36_H11 L0G_GiftAvgAll_H11

-0.004746 -0.011042 -0.026889 -0.024545 0.008120 0.055146 -0.167987 0.087440 0.063190

0 0 0 0 0 0 0 0 0

H11_TargetB1 H12_TargetB1 H13_TargetB1 BIAS_TargetB1

0 0 0 -0.000413

-0.004814 -0.000030480 0.001641 1.178583E-15

N Parameter 1 2 3 4 5 6 7 8 9

250 251 252 253

Despite the huge number o f weights, the model shows no signs o f overfitting. 9.

Go to line 395. You can find a summary of the model optimization (maximizing the likelihood estimates o f the model weights). Optimization Results Iterations Gradient C a l l s Objective Function Slope of Search Direction

50 Function C a l l s 62 Active Constraints 0.6368394579 Hax Abs Gradient Element -0.000752598

136 0 0.0076065979

QUANEW needs more than 50 i t e r a t i o n s or 2147483647 function c a l l s . WARNING: QUANEU Optimization cannot be completed.

Notice the warning message. It can be interpreted to mean that the model-fitting process did not converge. 10. Close the Results - Neural Network window.


5.1 Introduction

5-17

Reopen the Optimization window and examine the Optimization options in the Properties panel for tlie Neural Network node. 5 £Optimization Value

Property Training Technique Maximum Iterations Maximum Time [§)Nonlinear Options Absolute Absolute Function Absolute Function Times 'Absolute Gradient jAbsolute Gradient Times Absolute Parameter Absolute ParameterTimes Relative Function •RfilativR Fnnntinn Timps

Default 50 4 Hours -1.34078E154 0 1 1.0E-5 1 1.0E-8 1 0.0 1

iJ

Training Technique Specifies the training technique. The following are valid selections: DEFAULT (the default depends on the number of weights that are applied during execution), Trust-Region, Levenberg-Marquardt, Quasi-Newton, Conjugate Gradient, DBLDOG, BackProp, RPROP (uses a separate learning rate for each weight, and adapts alt the learning rates during training), and QPROP (QuickProp is a Newton-like method but uses a diagonal Cancel

OK

The maximum number o f iterations, by default, is 50. Apparently, this is not enough for the network training process to converge. 12. Type 100 for the Maximum Iterations property. 13. Run the Neural Network node and examine the results. 14. Maximize the Output window. You are again warned that the optimization process still failed to converge (even after 100 iterations). QUANEW needs more t h a n 100 i t e r a t i o n s WARNING: QUANEW O p t i m i z a t i o n

o r 2147483647 f u n c t i o n

c a n n o t be c o m p l e t e d .

calls


5-18

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

15. Maximize the Fit Statistics window. Rt Statistics, Target TargetB TargetB TargetB TargetB TargetB TargetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB

Fit Statistics Statistics Train Validation Test Label _DFT_ Total Degre.. 4843 _DFE_ Degrees of.. 4590 DFM Model Degr... 253 Number of... 253 AIC_ Akalke's Inf... 7014.564 _SBC_ Schwarz's... 8655.342 _ASE_ Average Sq... 0.242925 0.239625 _MAX_ Maximum A... 0.774254 0.760482 _DIV_ Divisor for A.. 9686 9686 Sum of Fre... _Nois_ 4843 4843 RootAvera... _RASE_ 0.492874 0.489515 Sum ofSqu.. _SSE_ 2352.972 2321.008 Sum of Cas.. _SUMW_ 9686 9686 Final Predic. _FPE_ 0.266041 Mean Squa... _MSE_ 0.252833 0.242925 Root Final ... _RFPE_ 0.515792 Root Mean ... _RMSE_ 0.502825 0.492874 Average Err... AVERR_ 0.671956 0.678893 Error Functi... _ERR_ 6508.564 6575.759 Misclassific. _MISC_ 0.418542 0.430105 _WRONG_ Number of... 2027 2083

Curiously, increasing the maximum number o f iterations changes none o f the fit statistics. How can this be? The answer is found in the Iteration Plot window.


5.1 Introduction

5-19

16. Examine the Iteration Plot window.

[Average Squared Error

0.26-

0.24-

0.22-

Training Iteration Train: Average Squared Error. Valid: Average Squared Error.

The iteration plot shows the average squared error versus optimization iteration. A massive divergence in training and validation average squared error occurs near iteration 14, indicated by the vertical blue line. The rapid divergence o f the training and validation fit statistics is cause for concern. This primarily results from a huge number o f weights in the fitted neural network model. The huge number o f weights comes from the use o f all inputs in the model. Reducing the number o f modeling inputs reduces the number o f modeling weights and possibly improves model performance. 17. Close the Results window.


5-20

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

5.2 Input Selection Model Essentials - Neural Networks

Select useful inputs.

None

19

The Neural Network tool in SAS Enterprise Miner lacks a built-in method for selecting useful inputs. While sequential selection procedures such as stepwise are known for neural networks, the computational costs o f their implementation tax even fast computers. Therefore, these procedures are not part o f SAS Enterprise Miner. You can solve this problem by using an external process to select useful inputs. In this demonstration, you use the variables selected by the standard regression model. (In Chapter 8, several other approaches are shown.)


5-2 Input Selection

5-21

Selecting Neural Network Inputs

This demonstration shows how to use a logistic regression to select inputs for a neural network. f

Additional dimension reduction techniques are discussed in Chapter 8.

1.

Delete the connection between the Impute node and the Neural Network node.

2.

Connect the Regression node to the Neural Network node. Predictive Analysis

) - , Polynomial 'X'- Req/nsston (2)

)

rips

Replacement

-o

'J

Regression

(inputs

ral Network

a

3 3.

100%

Right-click the Neural Network node and select Update from the shortcut menu.

•-

hi


5-22

4.

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Open the Variables dialog box for the Neural Network node.

(none) Name

^;

f j not

Use

DemCluster Default DernGencfer Default DemHomeOwDefault _ DemMedHormDefault D em Med In co "Default DemPctveteraDefault GiftTimeFirst Default GiftTimeLast Default IMP_DemAge Default IMP LOG^GirVDefautt IMP_REP_DerDefault LOG GiflAvg3fDefault LOG GirtAvgAIDefautt LOG GiftAvgLDefautt LOG GtrlCnt3 Default LOG_GiftCntAI Default LOG_GiflCntC Default LOG_GiftCntC Default M DemAge Default M_LOG_GiftAvDefault MJREPJDernt'Defautt PromCnt12 Default PromCnt36 Default PromCntAII Default P ro mCntG a rd Default P ro mCntC a rd Default *

Apply

Equal to

Report No No No No No Nn No No No No No No ^0 No No No No Ino No

No No No No No No No

Role Rejected Rejected Rejected Input Rejected Rejected Rejected Input Rejected Rejected Rejected Rejected Input Rejected Input Rejected Rejected Rejected Rejected __ Rejected Rejected Rejected Rejected Rejected Rejected Rejected

Level Nominal Nominal Binary Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Binary Binary Binary Interval Interval Interval Interval Interval

Type Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric

I Explore...

Order

Label

Close the Variables dialog box.

Format

Demographic Gender Home Owner Median HomeDOLLAR11 Median IncomDOLLAR11 Percent Veten Times Since F. Time Since Ls Imputed: Age Imputed: Tran! Imputed: Repl Transformed: Transformed: Transformed: Transformed: Transformed: Transformed: Transformed: Imputation Inc Imputation Ind Imputation Ind r Promotion Co Promotion Co Promotion Co Promotion Co! Promotion Co I • Update Path

OK

Only the inputs selected by the Regression node's stepwise procedure are not rejected. 5.

Reset

Cancel


5.2 Input Selection 6.

Run the Neural Network node and view the results. The Fit Statistics window shows an improvement in model fit using only 19 weights. 1 R t ^ i s B c s ^^^^^MMยงBMW^MiiB^$M0:'โ ข Target Fit Statistics Statistics Train Validation Test Label _DFT_ TargetB Total Degre... 4843 _DFE_ TargetB Degrees of... 4824 _DFM_ TargetB Model Degr... 19 TargetB _NW_ Number of... 19 TargetB _AIC_ Akaike's Inf... 6573.409 TargetB _SBC_ Schwarz's... 6696.629 TargetB _ASE_ Average Sq... 0.240983 0.240441 TargetB _MAX_ Maximum A... 0.777 0.78162 TargetB _DIV_ Divisor for A... 9686 9686 TargetB _NOBS_ Sum ofFre... 4843 4843 TargetB _RASE_ RootAvera... 0.490348 0.4909 TargetB _SSE_ Sum ofSqu... 2334.159 2328.913 TargetB _SUMW_ 9686 Sum of Cas... 9686 TargetB _FPE_ Final Predic... 0.242881 TargetB _MSE_ Mean Squa... 0.241932 0.240441 TargetB _RFPE_ Root Final... 0.49283 TargetB _RMSE_ Root Mean... 0.491866 0.490348 TargetB Average Err... 0.674727 0.673563 _AVERR_ TargetB _ERR_ Error Functi... 6524.13 6535.409 TargetB _MISC_ 0.421639 Misclassific... 0.427421 [TargetB _WRONG_ Number of... 2070 2042 The validation and training average squared errors are nearly identical.

K.

5-23


5-24

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

5.3 Stopped Training Model Essentials - Neural Networks

Optimize complexity.

j stopped training

22

As was seen in the first demonstration, complexity optimization is an integral part o f neural network modeling. Other modeling methods selected an optimal model from a sequence o f possible models. In the demonstration, only one model was estimated (a neural network with three hidden units), so what was being compared? SAS Enterprise Miner treats each iteration in the optimization process as a separate model. The iteration with the smallest value o f the selected fit statistic is chosen as the final model. This method o f model optimization is called stopped training.


5.3 Stopped Training

Fit Statistic versus Optimization Iteration initial hidden unit weights

logit(p) = o+

22

w,+

H+ 2

H

3

5-25


5-26

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Fit Statistic versus Optimization Iteration logit(p) = . + H,+ j H + 0 H 2

3

H, = tanh( 1.5 - .03x,-.07x ) 2

H = tanh(.79- .17x,- .15x ) 2

2

W = tanh( 57 + .05x +.35x ) 3

1

z

random initial input weights and biases

24

To begin model optimization, model weights are given initial values. The weights multiplying the hidden units in the logit equation are set to zero, and the bias in the logit equation is set equal to the logit(jti), where K\ equals the primary outcome proportion. The remaining weights (corresponding to the hidden units) are given random initial values (near zero). This "model" assigns each case a prediction estimate: p = TT . An initial fit statistic is calculated on t

]

training and validation data. For a binary target, this is proportional to the log likelihood function:

£log(£.(w))+ primary outcomes

£log(l-£.(w))

sec ondary outcomes

where p

t

\v

is the predicted target value. is the current estimate of the model parameters.

Training proceeds by updating the parameter estimates in a manner that decreases the value of the objective function.


5.3 Stopped Training

5-27

Fit Statistic versus Optimization Iteration • • -:•

»y

"-:i i :^. :-v. ".' : ;

:

•V ^ • ••*•*»*"

\*

0

5

10 Iteration

15

-

;

r

• * ! ; . '

•'•>"'. • X v -

.

• *

20

26

As stated above, in the initial step of the training procedure, the neural network model is set up to predict the overall average response rate for all cases.

Fit Statistic versus Optimization Iteration

01

5

10 iteration

15

20

27 One step substantially decreases the value average squared error (ASE). Amazingly, the model that corresponds to this one-iteration neural network looks very much like the standard regression model, as seen from the fitted isoclines.


5-28

Chapter 5 Introduction to Predictive Modeling; Neural Networks and Other Modeling Tools

Fit Statistic versus Optimization Iteration

0

2

5

10 Iteration

15

20

28

The second iteration step goes slightly astray. The model actually becomes slightly worse on the training and validation data.

Fit Statistic versus Optimization Iteration

-I

^ ^ ^ ^ ^^^^

\

training

0

3

validation

5

10 15 Iteration

20

29

Things are back on track in the third iteration step. The fitted model is already exhibiting nonlinear and nonadditive predictions. Half of the improvement in ASE is realized by the third step.


5.3 Stopped Training

Fit Statistic versus Optimization Iteration

ASE

\

\

training

0

4 5

validation

10 Iteration

15

20

30

Fit Statistic versus Optimization Iteration ASE

Iteration

5-29


5-30

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Fit Statistic versus Optimization Iteration \

•\ V

t r a i n i n gv . validation

V

0

5 6

10 Iteration

15

20

32

Fit Statistic versus Optimization Iteration

•IA

tralning^^vatidation

0

33

5

7

10 15 Iteration

20


5.3 Stopped Training

Fit Statistic versus Optimization Iteration

8

10 Iteration

3-1

Fit Statistic versus Optimization Iteration ASE

0

5

9'0 Iteration

Most of the improvement in validation ASE occurred by the ninth step. (Training ASE continues to improve until convergence in Step 23.) The predictions are close to their final form.

5-31


5-32

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Fit Statistic versus Optimization Iteration

10

20

Iteration

36

Fit Statistic versus Optimization Iteration

0

5

1011 Iteration

15

20


5.3 Stopped Training

5-33

Fit Statistic versus Optimization Iteration

10 12 aeration

15

38

Step 12 brings the minimum value for validation ASE. While this model will ultimately be chosen as the final model, SAS Enterprise Miner continues to train until the likelihood objective function changes by a negligible amount on the training data.

Fit Statistic versus Optimization Iteration

ASE

49 In Step 23, training is declared complete due to lack of change in the objective function from Step 22. Notice that between Step 13 and Step 23, ASE actually increased for the validation data. This is a sign of overfitting.


5-34

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

50

The Neural Network tool selects the modeling weights from iteration 13 for the final model. In this Iteration, the validation ASE is minimized. You can also configure the Neural Network tool to select the iteration with minimum misclassification or maximum profit for final weight estimates. f

The name stopped training comes from the fact that the final model is selected as i f training were stopped on the optimal iteration. Detecting when this optimal iteration occurs (while actually training) is somewhat problematic. To avoid stopping too early, the Neural Network tool continues to train until convergence on the training data or until reaching the maximum iteration count, whichever comes first.


5.3 S t o p p e d Training

5-35

Increasing Network Flexibility

Stopped training helps to ensure that a neural network does not overfit (even when the number o f network weights is large). Further improvement in neural network performance can be realized by increasing the number o f hidden units from the default o f three. There are two ways to explore alternative network sizes: • manually, by changing the number o f weights by hand • automatically, by using the AutoNeural tool Changing the number o f hidden units manually involves trial-and-error guessing o f the "best" number of hidden units. Several hidden unit counts were tried in advance. One o f the better selections is demonstrated. 1.

Select Network •=> „ J from the Neural Network properties panel. Property

Value

General Node ID Neural limported Data jExported Data Notes Train 1 [Variables Continue Traininq No Network [Optimization 12345 initialization Seed Model Selection Critt Profit/Loss Suppress Output No

... ... ... ...

...


5-36

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

The Network window opens.

Property Architecture Direct Connection Number of Hidden Units Randomization Distribution Randomization Center Randomization Scale Input Standardization Hidden Layer Combination Function Hidden Layer Activation Function Hidden Bias Target Layer Combination Function Target Layer Activation Function Target Layer Error Function

Value MLP No 6 Normal 0.0 1.0 Standard Deviation Default Default Yes Default Default Default

T

Architecture Specifies which network architecture is used in constructing the network. The following are valid selections: generalized linear model, multilayer perceptron, ordinary radial basis function with equal widths, ordinary radial basis function with unequal widths, normalized radial basis function with equal heights, normalized radial basis function with equal volumes, normalized

OK

2.

Type 6 as the Number of Hidden Units value.

3.

Select O K .

4.

Run the Neural Network node and view the results.

Cancel


5.3 Stopped Training The Fit Statistics window shows good model performance on both the average squared error and misclassification scales.

Target TargetB argetB argetB argetB argetB argetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB argetB argetB TargetB TargetB argetB argetB argetB argetB

Fit Statistics Statistics Train Validation Test Label _DFT_ Total Degre.. 4843 _DFE_ Degrees of.. 4806 _DFM_ Model Degr... 37 Number of... 37 _NW_ Akaike's Inf... 6601.6 _AIC_ Schwarz's... 6841.556 _SBC_ Average Sq... 0.240623 _ASE_ 0.23988 Maximum A... 0.857328 _MAX_ 0.869935 Divisor for A. 9686 _DIV_ 9686 Sum ofFre... 4843 _NOBS_ 4843 RootAvera... 0.490533 _RASE_ 0.489776 Sum ofSqu.. 2330.673 _SSE_ 2323.478 Sum of Cas.. 9686 _SUMW_ 9686 Final Predic. 0.244328 _FPE_ Mean Squa... 0.242475 _MSE_ 0.23988 Root Final... 0.494295 _RFPE_ Root Mean... 0.492418 0.489776 _RMSE_ Average Err... 0.673921 0.672391 _AVERR_ Error Functi... 6527.6 6512.78 _ERR_ Misclassific. 0.427421 0.422878 _MISC_ Number of... 2070 2048 WRONG

5-37


5-38

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

The iteration plot shows optimal validation average squared error occurring on iteration 5. Iteration Plot

r/ D

K

IZi

Average Squared Error 0.250-

0.245-

0.240—f—

10

20

30

Training Iteration Train: Average Squared Error. Valid: Average Squared Error.

Unfortunately, there is little else of interest to describe this model. (In Chapter 8, a method is shown to give some insight into describing the cases with high primary outcome probability.)


5.3 Stopped Training

5-39

Using the AutoNeural Tool (Self-Study)

The AutoNeural tool offers an automatic way to explore alternative network architectures and hidden unit counts. This demonstration shows how to explore neural networks with increasing hidden unit counts. 1.

Select the Model tab.

2.

Drag the AutoNeural tool into the diagram workspace.

3.

Connect the Regression node to the AutoNeural node as shown. Predictive Analysis

Polynomial Recession (2)

-o

'J Replacement

-Q

6

nputo

Retji ession

nl Network

J

AutoNeu al

100::,

Six changes must be made to the AutoNeural node's default settings. 4.

Select Train Action ^ Search. This configures the AutoNeural node to sequentially increase the network complexity.

5.

Select Number of Hidden Units •=> \. With this option, each iteration adds one hidden unit.

6.

Select Tolerance •=> Low. This prevents preliminary training from occurring.

7.

Select Direct •=> No. This deactivates direct connections between the inputs and the target.

8.

Select N o r m a l <=> t No. This deactivates the normal distribution activation function.


5-40

9.

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Select Sine •=> No. This deactivates the sine activation function. Train Variables [•Architecture Single Layer [•Termination Ove [fitting [•Train Action Search [-Target Layer Error FiDefault [-Maximum Iterations 8 [•Number of Hidden U [•Tolerance Low -TotalTime One Hour crementand Searc Adjust Iterations Yes •Freeze Connections No •Total Number of Hid 30 •Final Training Yes -Final Iterations 5 [SjActivation Functions -Direct No -Exponential No No - Identity -Logistic No -Normal No No •Reciprocal •Sine No -Softmax No -Square No Tanh Yes L

With these settings, each iteration adds one hidden unit to the neural network. Only the hyperbolic tangent activation function is considered. /f

After each iteration, the existing network weights are not reinitialized. With this restriction, the influence of additional hidden units decreases. Also, the neural network models that you obtain with the AutoNeural and Neural Network tools will be different, even i f both networks have the same number of hidden units.


5.3 Stopped Training

5-41

10. Run the AutoNeurai node and view the results. The Results - Node: AutoNeural Diagram window opens.

File

a

Edit

i'ln.: ;\!ti'^y^y^liJ:Jri!iildUti«Jl'lji'idri view Window

l

H

l

m

Score Rankings Overlay: TargetB

d*eT

IE!

3 Fit Statistics Target

Cumulative Lift

VALIDATE

•TRAII-.

-

Iteration Plot

TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB

Average Squared Error

r^kf

m

Fit Statistics Statistics Train valida Label _DFT_ Total Degre... 4843 _DFE_ Degrees of ... 4B36 _DFM_ Model Degr... 7 _NW_ Number of... 7 _AIC_ Akaike's Inf... 6584.S83 _SBC_ Schwarz's... 6630.085 _ASE_ Average Sq... 0.242693 0 • _MAX_ Maximum A... 0.703379 _DIV_ DivisorforA... 9696 _NOBS_ Sum of Fre... 4843 _RASE_ Root Avera... 0.492638 _SSE_ Sum of Sou... 2350.721 SUMW Sum ofCas... 9686

ID Output Dodotech 1BHAY03 20:53

Usee: Date: Time:

0.2500 0.247S -

* T r a i n i n g Output

0.2450 02425¬ 0.2400 Variable Summary Training Step •Train: Average Squared Error. • Vnlki: Average Square:! Error.

Measurement I.e.ve.1

Frequency


5-42

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

11. Maximize the Fit Statistics window. 3 Fit Statistics Target TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB [TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB

r / c ? [HI

Fit Statistics Statistics Train Validation Test Label _DFT_ Total Degre... 4843 _DFE_ Degrees of ... 4836 _DFM_ Model Degr... 7 _NW_ Number of... 7 _AIC_ Akaike's Inf... 6584.688 _SBC_ Schwarz's... 6630.085 _ASE_ Average Sq... 0.242693 0.2411 _MAX_ Maximum A... 0.703379 0,705861 _DIV_ DivisorforA.. 9686 9686 _NOBS_ Sum of Fre... 4843 4843 _RASE_ Root Avera... 0.492638 0.491019 _SSE_ Sum of Squ.. 2350.721 2335.291 _SUMW_ Sum of Cas.. 9686 9686 _FPE_ Final Predic. 0.243395 _MSE_ Mean Squa... 0.243044 0.2411 _RFPE_ Root Final... 0.493351 _RMSE_ Root Mean... 0.492995 0.491019 _AVERR_ Average Err... 0.67837 0.675163 _ERR_ Error Fundi... 6570.688 6539.625 _MISC_ Misclassific. 0.426389 0.41751 _WRONG_ Numberof... 2065 2022

The number o f weights implies that the selected model has one hidden unit. The average squared error and misclassification rates are quite low.


5.3 Stopped Training

5-43

12. Maximize the Iteration Plot window. Iteration Plot |

|

|

^

^

^

!

^

^

K

^

K

^

B

f

i f Uf: M

Average Squared 'Error 0.2500-

0.2475-

0.2450-

0.2425-

0.2400 Training Step •Train: Average Squared Error. Valid: Average Squared Error.

The AutoNeural and Neural Network node's iteration plots differ. The AutoNeural node's iteration plot shows the final fit statistic versus the number o f hidden units in the neural network. 13. Maximize the Output window. The Output window describes the AutoNeural process. 14. Go to line 52. Search # 1 SINGLE LAYER t r i a l # 1 : TANH _ITER_ 0 1 2 3 4 5 6 7 8 8

Training

_AIC_

_AVERR_

_MISC_

_VAVERR_

_VMISC_

6727.82 6725.17 6587.79 6584.69 6584.10 6575.69 6572.57 6571.21 6570.69 6570.69

0.69315 0.69287 0.67869 0.67837 0.67831 0.67744 0.67712 0.67698 0.67692 0.67692

0.49990 0.48462 0.42866 0.42639 0.42804 0.42660 0.42763 0.42866 0.42845 0.42845

0.69315 0.69311 0.67713 0.67516 0.67638 0.67472 0.67455 0.67427 0.67420 0.67420

0.50010 0.48193 0.42350 0.41751 0.43031 0.42061 0.42783 0.42205 0.42061 0.42061

These lines show various fit statistics versus training iteration using a single hidden unit network. Training stops at iteration 8 (based on an AutoNeural property setting). Validation misclassification is used to select the best iteration, in this case, Step 3. Weights from this iteration are selected for use in the next step.


5-44

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

15. View output lines 73-99. Search # 2 SINGLE LAYER t r i a l # 1 : TANH _AIC_

_AVERR_

_MISC_

_VAVERR_

_VMISC_

6596.69 6587.08 6581.99 6580.65 6579.01 6578.27 6577.64 6577.57 6575.88 6575.88

0.67837 0.67738 0.67685 0.67671 0.67654 0.67647 0.67640 0.67640 0.67622 0.67622

0.42639 0.42866 0.42928 0.42887 0.42763 0.43011 0.42474 0.42680 0.42845 0.42845

0.67516 0.67472 0.67405 0.67393 0.67392 0.67450 0.67426 0.67458 0.67411 0.67411

0.41751 0.42639 0.42123 0.42391 0.42267 0.42804 0.42453 0.42825 0.42783 0.42783

_ITER_ 0 1 2 3 4 5 6 7 8 8

Selected I t e r a t i o n based on

VMISC_

_AIC_

_AVERR_

_MISC_

_VAVERR_

_VMISC_

6596.69

0.67837

0.42639

0.67516

0.41751

_ITER_ 0

Training

A second hidden unit is added to the neural network model. A l l weights related to this new hidden unit are set to zero. A l l remaining weights are set to the values obtained i n iteration 3 above. In this way, the two-hidden-unit neural network (Step 0) and the one-hidden-unit neural network (Step 3) have equal fit statistics. Training o f the two-hidden-unit network commences. The training process trains for eight iterations. Iteration 0 has the smallest validation misclassification and is selected to provide the weight values for the next AutoNeural step. 16. Go to line 106. F i n a l Training Training _ITER_ 0 1 2 3 4 5 5

_AIC_

_AVERR_

_MISC_

_VAVERR_

_VMISC_

6584.69 6584.10 6573.21 6571.19 6570.98 6570.56 6570.56

0.67837 0.67831 0.67718 0.67698 0.67695 0.67691 0.67691

0.42639 0.42804 0.42783 0.42990 0.42887 0.43052 0.43052

0.67516 0.67638 0.67462 0.67437 0.67431 0.67418 0.67418

0.41751 0.43031 0.42350 0.42247 0.42308 0.42081 0.42081

Selected I t e r a t i o n based on _VMISC_ _ITER_ 0

_AIC_

_AVERR_

_MISC_

_VAVERR_

_VMISC_

6584.69

0.67837

0.42639

0.67516

0.41751

The final model training commences. Again iteration zero offers the best validation misclassification.


5.3 Stopped Training

The next block o f output summarizes the training process. Fit statistics from the iteration with the smallest validation misclassification are shown for each step. F i n a l Training History _step_

_func_ _ s t a t u s _ _ i t e r _ _AVERR_

SINGLE LAYER 1 SINGLE LAYER 1 SINGLE LAYER 2

TANH TANH TANH

initial keep reject Final

0 3 0 0

0.69315 0.67837 0.67837 0.67837

_MISC_

_AIC_

0.49990 0.42639 0.42639 0.42639

6727.82 6584.69 6596.69 6584.69

_VAVERR_ _VMISC_ 0.69315 0.67516 0.67516 0.67516

0.50010 0.41751 0.41751 0.41751

F i n a l Model Stopping: Termination c r i t e r i a was s a t i s f i e d : o v e r f i t t i n g based on _VMISC _func_

_AVERR_

_VAVERR_

TANH

0.67837

0.67516

neurons 1 1

The Final Model shows the hidden units added at each step and the corresponding value o f the objective function (related to the likelihood).

5-45


5-46

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

5.4 Other Modeling Tools (Self-Study) Model Essentials - Rule Induction Predict new cases.

| Prediction rules / | prediction formula

Select useful inputs. ! j

Optimize complexity.

Split search / none • •

1 Ripping/ ; stopped training

ss

There are several additional modeling tools in SAS Enterprise Miner. This section describes the intended purpose (that is, when you should consider using them), how they perform the modeling essentials, and how they predict the simple example data. The Rule Induction tool combines decision tree and neural network models to predict nominal targets. It is intended to be used when one o f the nominal target levels is rare. New cases are predicted using a combination o f prediction rules (from decision trees) and a prediction formula (from a neural network, by default). Input selection and complexity optimization are described below.


5.4 Other Modeling Tools (Self-Study)

5-47

Rule Induction Predictions [Rips create prediction rules.]

0.9 08

A binary model sequentially classifies and removes correctly classified cases. [A neural network predicts remaining cases.]

0.0 0.0 0.1 0.2 0.3

0.8 0.9 i.O

The Rule Induction algorithm has three steps: • Using a decision tree, the first step attempts to locate "pure" concentrations o f cases. These are regions of the input space containing only a single value o f the target. The rules identifying these pure concentrations are recorded in tile scoring code, and the cases in these regions are removed from the training data. The simple example data does not contain any "pure" regions, so the step is skipped. • The second step attempts to filter easy-to-classify cases. This is done with a sequence o f binary target decision trees. The first tree in the sequence attempts to distinguish the most common target level from the others. Cases found in leaves correctly classifying the most common target level are removed from the training data. Using this revised training data, a second tree is built to distinguish the second most common target class from the others. Again, cases in any leaf correctly classifying the second most common target level are removed from the training data. This process continues through the remaining levels o f the target. • In the third step, a neural network is used to predict the remaining cases. A l l inputs selected for use in the Rule Induction node will be used in the neural network model. Model complexity is controlled by stopped training using classification as the fit statistic.


5-48

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Model Essentials - Dmine Regression j

Predict new cases.

Prediction formula

Select useful inputs.

Forward selection

Optimize complexity.

Stop R-square

57

Dmine regression is designed to provide a regression model with more flexibility than a standard regression model. It should be noted that with increased flexibility comes an increased potential o f over-fitting. A regression-like prediction formula is used to score new cases. Forward selection picks the inputs. Model complexity is controlled by a minimum R-squared statistic.


5.4 Other Modeling Tools (Self-Study)

5-49

Dmine Regression Predictions • Interval inputs binned, categorical inputs grouped

pi«*c»*»-i

• Forward selection picks from binned and original inputs

SB

The main distinguishing feature o f Dmine regression versus traditional regression is its grouping o f categorical inputs and binning o f continuous inputs. • The levels o f each categorical input are systematically grouped together using an algorithm that is reminiscent o f a decision tree. Both the original and grouped inputs are made available for subsequent input selection. • A l l interval inputs are broken into a maximum o f 16 bins in order to accommodate nonlinear associations between the inputs and the target. The levels o f the maximally binned interval inputs are grouped using the same algorithm for grouping categorical inputs. These binned-and-grouped inputs and the original interval inputs are made available for input selection. A forward selection algorithm selects from the original, binned, and grouped inputs. Only inputs with an R square o f 0.005 or above are eligible for inclusion in the forward selection process. Forward selection on eligible inputs stops when no input improves the R square o f the model by the default value 0.0005.


5-50

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Model Essentials - DMNeural Predict new cases. Select useful inputs. Optimize complexity.

59

The DMNeural tool is designed to provide a flexible target prediction using an algorithm with some similarities to a neural network. A multi-stage prediction formula scores new cases. The problem o f selecting useful inputs is circumvented by a principal components method. Model complexity is controlled by choosing the number o f stages in the multi-stage predictions formula.


5.4 Other Modeling Tools (Self-Study)

5-51

DMNeural Predictions • Up to three PCs with highest target R square are selected.

• One of eight continuous transformations are selected and applied to selected PCs. • The process is repeated three times with residuals from each stage.

c

GO The algorithm starts by transforming the original inputs into principle components, which are orthogonal linear transformations of the original variables. The three principal components with the highest target correlation are selected for the next step. Next, one o f eight possible continuous transformations (square, hyperbolic tangent, arctangent, logistic. Gaussian, sine, cosine, exponential) is applied to the three principle component inputs. (For efficiency, the transformation is actually applied to a gridded version of the principle component, where each grid point is weighted by the number o f cases near the point.) The transformation with the optimal fit statistic (squared error, by default) is selected. The target is predicted by a regression model using the selected principal components and transformation. The process in the previous paragraph is repeated on the residual (the difference between the observed and predicted target value) up to the number of times specified in the maximum stage property (three, by default).


5-52

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Model Essentials - Least Angle Regression Predict new cases. > Select useful inputs. Optimize complexity.

Prediction formula Generalized sequential selection Penalized best fit

61

Forward selection, discussed in the previous chapter, provides good motivation for understanding how the Least Angle Regression (LARS) node works. In forward selection, the model-building process starts with an intercept. The best candidate input (the variable whose parameter estimate is most significantly different from zero) is added to create a one-variable model in the first step. The best candidate input variable from the remaining subset o f input variables is added to create a two-variable model in the second step, and so on. Notice that an equivalent way to phrase the description o f what happens in Step 2 is that the candidate input variable most highly correlated with the residuals of the one variable model is added to create a two-variable model in the second step. The LARS algorithm generalizes forward selection in the following way: 1.

Weight estimates are initially set to a value o f zero.

2.

The slope estimate in the one variable model grows away from zero until some other candidate input has an equivalent correlation with the residuals of that model.

3.

Growth o f the slope estimate on the first variable stops, and growth on the slope estimate o f the second variable begins.

4.

This process continues until the least squares solutions for the weights are attained, or some stopping criterion is optimized.

This process can be constrained by putting a threshold on the aggregate magnitude o f the parameter estimates. The LARS node provides an option to use the LASSO (Least Absolute Shrinkage and Selection Operator) method for variable subset selection. The LARS node optimizes the complexity of the model using a penalized best fit criterion. The Schwarz's Bayesian Criterion (SBC) is the default.


5-4 Other Modeling Tools (Self-Study)

5-53

Least Angle Regression Predictions •

Inputs are selected using a generalization of forward selection.

• An input combination in the sequence with optimal, penalized validation assessment is selected by default. o.o

0.0 0.1 0.2 0.3 0.4 0.5 0-6 0.7 0.8 0.9 1.0

The Least Angle Regression node can be useful in a situation where there are many candidate input variables to choose from and the data set is not large. As mentioned above, the best subset model is chosen using SBC by default. SBC and its cousins (AIC, AICC, and others) optimize complexity by penalizing the fit o f candidate input variables. That is, for an input to enter the model, it must improve the overall fit (here, diminish the residual sum o f squares) o f the model enough to overcome a penalty term built into the fit criterion. Other criteria, based on predicted (simulated) residual sums o f squares or a validation data set, are available.


5-54

Chapter 5 introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Model Essentials - MBR Predict new cases.

Training data nearest neighbors

Select useful inputs.

None

Optimize complexity.

Number of neighbors

63

Memory Based Reasoning ( M B R ) is the implementation o f ^-nearest neighbor prediction i n SAS Enterprise Miner. M B R predictions should be limited to decisions rather than rankings or estimates. Decisions are made based on the prevalence o f each target level in the nearest leases (16 by default). A majority o f primary outcome cases results i n a primary decision, and a majority o f secondary outcome cases results in a secondary decision. Nearest neighbor algorithms have no means to select inputs and can easily fall victim to the curse o f dimensionality. Complexity o f the model can be adjusted by changing the number o f neighbors considered for prediction.


5.4 Other Modeling Tools (Self-Study)

5-55

MBR Prediction Estimates • Sixteen nearest training data cases predict the target for each point in the input space.

0.6

• Scoring requires training data and the PMBR procedure.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 8 0.9 1.0 *1

G4 By default, the prediction decision made for each point in the input space is determined by the target values o f the 16 nearest training data cases. Increasing the number o f cases involved in the decision tends to smooth the prediction boundaries, which can be quite complex for small neighbor counts. Unlike other prediction tools where scoring is performed by DATA step functions independent o f the original training data, MBR scoring requires both the original training data set and a specialized SAS procedure, PMBR.


5-56

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Model Essentials - Partial Least Squares Predict new cases.

Prediction formula L

Select useful inputs.

VIP

Optimize complexity.

Sequential factor extraction L.

65

In partial least squares regression (PLS) modeling, model weights are chosen to simultaneously account for variability in both the target and the inputs. In general, model fit statistics like average squared error w i l l be worse for a PLS model than for a standard regression model using the same inputs. A distinction o f the PLS algorithm, however, is its ability to produce meaningful predictions based on limited data (for example, when there are more inputs than modeling cases). A variable selection method named variable importance in the projection arises naturally from the PLS framework. In PLS, regression latent components, consisting o f linear combinations o f original inputs, are sequentially added. The final model complexity is chosen by optimizing a model fit statistic on validation data.


5.4 Other Modeling Tools (Self-Study)

5-57

Partial Least Squares Predictions •

Input combinations (factors) that optimally account for both predictor and response variation are successively selected.

• Factor count with a minimum validation PRESS statistic is selected. •

Inputs with small VIP are rejected for subsequent diagram nodes.

66

The PLS predictions look similar to a standard regression model. In the algorithm, input combinations, called factors, which are correlated with both input and target distributions, are selected. This choice is based on a projection given in the SAS/STAT PLS procedure documentation. Factors are sequentially added. The factor count with the lowest validation PRESS statistic (or equivalently, average squared error) is selected. In the PLS tool, the variable importance in the projection (VIP) for each input is calculated. Inputs with VIP less than 0.8 are rejected for subsequent nodes in the process flow.


5-58

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

Exercises 1. Predictive Modeling Using Neural Networks a. In preparation for a neural network model, is imputation o f missing values needed? _ Why or why not? b.

In preparation for a neural network model, is data transformation generally needed? _ Why or why not?

c. A d d a Neural Network tool to the Organics diagram. Connect the Impute node to the Neural Network node. d. Set the model selection criterion to average squared error. e.

Run the Neural Network node and examine the validation average squared error. How does it compare to other models?


5.5 Chapter Summary

5-59

5.5 Chapter Summary Neural networks are a powerful extension o f standard regression models. They have the ability to model virtually any association between a set o f inputs and a target. Being regression-like models, they share some o f the same prediction complications faced by regression models (missing values, categorical inputs, extreme values). Their performance is reasonable even without input selection, but it can be marginally improved by selecting inputs with an external variable selection process. Most o f their predictive success comes from stopped training, a technique used to prevent overfitting. Network complexity is marginally influenced by the number o f hidden units selected in the model. This number can be selected manually with the Neural Network tool, or it can be selected automatically using the AutoNeural tool. SAS Enterprise Miner includes several other flexible modeling tools. These specialized tools include Rule Induction, Dmine Regression, D M Neural, and M B R .

Neural Network Tool Review <

HfwNwralNrtworii

67

Create a multi-layer perceptron on selected inputs. Control complexity with stopped training and hidden unit count.


5-60

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

5.6 Solutions Solutions to Exercises 1. Predictive Modeling Using Neural Networks a. In preparation for a neural network model, is imputation o f missing values needed? Yes. Why or why not? Neural network models, as well as most models relying on a prediction formula, require a complete record for both modeling and scoring. b. In preparation for a neural network model, is data transformation generally needed? Not necessarily. Why or why not? Neural network models create transformations of inputs for use in a regression-like model. However, having input distributions with low skewness and kurtosis tends to result in more stable models. c. Add a Neural Network tool to the Organics diagram. Connect the Impute node to the Neural Network node.


5.6 Solutions

d.

Set the model selection criterion to average squared error. Value Property General Neural Node ID Imported Data Exported Data Notes Train Variables Network Model Selection Criterior Averaqe Error Use Current Estimates No 20 rMaximurn Iterations 4 Hours '-Maximum Time Default '•Training Technique [^Preliminary Training Opt Preliminary Training Yes 10 rMaximum Iterations 1 Hour r Maximum Time •-Number of Runs 5 Hconverqence Criteria Yes !-Uses Defaults -Options 0Print Options No '-Suppress Output :

I

• • y

i ...i i ...

y

5-61


5-62

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools

e.

Run the Neural Network node and examine the validation average squared error. 1) The Result window should appear as shown below. •ft 149.I73.W.I1B-Remote Desktop Targets uy Targe [Buy TsrgelBuy TargelBuy raigelBuy riigetsur fjigelBuy TiigelBuy TaigetBuy TiigetBuy TaigetBuy TaigetBuy TaigetBuy Targets uy TargelBuy

J*M_ _DIV_ _NOBS_ _HASE_ _SSE_ _SUMW_ _FPE_ _MSE_ _RFPE_ _RWSE_ _AVERR_ _ERR_ _MISC_ _YVRONQ_

Average So... Maximum A... DrnSOrfoiA.. Sum of Fie.. RoolAiera. SumofSqu . SumofCas Final Pied it MeanSo.ua Root Final Root Mean .. Averaga En Eirar Fundi . Mrsc las sale. Number o f .

0 13335! 0978743 222J1 11113 0 355037 2951.309 22324 013G333 0 134792 0 389233 0 367141 0 41053 9323 636 0 183315 2037

0.133209 0.979105 21222 11111 0.364978 2900 168 22222

inlzl

0133209 0.364 97 B 0 416894 9308.673 0.185852 2065

Student J u l f 0 7 , 2009 L4;25: S3 T r a i n i n g Output

Var 1 at} 1 e S u u a c y Tiequency Cmifit nunm a r

BIVASY IETEBVAL

IDTFOT PZJECTED TAPGET

Trainlna Iterations

NOMINAL BIHAEY

A


5-6 Solutions

2)

5-63

Examine the Fit Statistics window. | g j Fit Statistics Target [TargetB uy TargetBuy TargelBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy (TargetBuy TargetBuy TargetBuy TargetBuy

Fit Statistics _DFT_ _DFE_ _DFM_ _NW_ _AIC_ _SBC_ „ASE_ _MAX_ _DIV_ _NOBS_ _RASE_ _SSE_ _SUMW_ _FPE_ _MSE_ _RFPE_ _RMSE_ _AVERR_ _ERR_ _MISC_ _WRONG_

Statistics Label Total Degre... Degrees of... Model Degr... Number of... Akaike's Inf... Schwarz's... Average Sq... Maximum A... Divisor for A... SumofFre... RootAvera... Sum of Squ... Sum of Cas... Final Predic... Mean Squ a... Root Final... Root Mean... Average Err... Error Functi... Misclassific... Number of...

Train

Validation 11112 10985 127

HUB

9577.636 10506.74 0.133252 0.978743 22224 11112 0.365037 2961.389 22224 0.136333 0.134792 0.369233 0.367141 0.41953 9323.636 0.183315 2037

0.133209 0.979105 22222 11111 0.364978 2960.168 22222 0.133209 0.364978 0.418894 9308.673 0.185852 2065

How does it compare to other models? The validation ASE for the neural network model is slightly smaller than the standard regression model, nearly the same as the polynomial regression, and slightly larger than the decision tree's ASE.


5-64

Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.