Applied Analytics Using S A S 速 Enterprise Miner
Course Notes
Applied Analytics Using SAlf Enterprise Miner™Course Notes was developed by Jim Georges, Jeff Thompson, and Chip Wells. Additional contributions were made by Tom Bohannon, Mike Hardin, Dan Kelly, Bob Lucas, and Sue Walsh. Editing and production support was provided by the Curriculum Development and Support Department. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Applied Analytics Using SAS®Enterprise Miner™Course Notes Copyright © 2010 SAS Institute Inc. Cary,NC, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. Book code E1755, course code LWAAEM61/AAEM61, prepared date 01 Jul2010.
LWAAEM61_002
ISBN 978-1-60764-593-1
For Your Information
Table of Contents Course Description Prerequisites Chapter 1
1.1
x Introduction
Introduction to SAS Enterprise Miner
Chapter 2
ix
A c c e s s i n g and A s s a y i n g Prepared Data
1-1
1-3 2-1
2.1
Introduction
2-3
2.2
Creating a SAS Enterprise Miner Project, Library, and Diagram
2-5
Demonstration: Creating a SAS Enterprise Miner Project
2.3
2.4
2.5
3.1
Demonstration: Creating a SAS Library
2-10
Demonstration: Creating a SAS Enterprise Miner Diagram
2-13
Exercises
2-14
Defining a Data Source
2-15
Demonstration: Defining a Data Source
2-18
Exercises
2-35
Exploring a Data Source
2-36
Demonstration: Exploring Source Data
2-37
Demonstration: Changing the Explore Window Sampling Defaults
2-64
Exercises
2-66
Demonstration: Modifying and Correcting Source Data
2-67
Chapter Summary
Chapter 3
2-6
Introduction to Predictive Modeling: D e c i s i o n T r e e s
Introduction Demonstration: Creating Training and Validation Data
2-80 3-1
3-3 3-23
iii
iv
F o r Your Information
3.2
Cultivating Decision Trees Demonstration: Constructing a Decision Tree Predictive Model
3.3
Optimizing the Complexity of Decision Trees Demonstration: Assessing a Decision Tree
3.4
Understanding Additional Diagnostic Tools (Self-Study) Demonstration: Understanding Additional Plots and Tables (Optional)
3.5
Autonomous Tree Growth Options (Self-Study)
3-28 3-43 3-63 3-79 3-98 3-99 3-106
Exercises
3-114
3.6
Chapter Summary
3-116
3.7
Solutions
3-117
Chapter 4
4.1
4.2
Solutions to Exercises
3-117
Solutions to Student Activities (Polls/Quizzes)
3-133
Introduction to Predictive Modeling: R e g r e s s i o n s
Introduction
4-18
Demonstration: Running the Regression Node
4-25
Selecting Regression Inputs
Optimizing Regression Complexity Demonstration: Optimizing Complexity
4.4
Interpreting Regression Models Demonstration: Interpreting a Regression Model
4.5
Transforming Inputs Demonstration: Transforming Inputs
4.6
4-3
Demonstration: Managing Missing Values
Demonstration: Selecting Inputs 4.3
4-1
Categorical Inputs Demonstration: Recoding Categorical Inputs
4-29 4-33 4-38 4-40 4-49 4-51 4-52 4-56 4-66 4-69
For Your Information
4.7
Polynomial Regressions (Self-Study)
4-76
Demonstration: Adding Polynomial Regression Terms Selectively
4-78
Demonstration: Adding Polynomial Regression Terms Autonomously (Self-Study)
4-85
Exercises
4-88
4.8
Chapter Summary
4-89
4.9
Solutions
4-91
Solutions to Exercises Solutions to Student Activities (Polls/Quizzes) Chapter 5
5.1
Introduction to Predictive Modeling: Neural Networks a n d Other Modeling Tools
Introduction Demonstration: Training a Neural Network
5.2
Input Selection Demonstration: Selecting Neural Network Inputs
5.3
5.4
Stopped Training
4-91 4-102
5-1
5-3 5-12 5-20 5-21 5-24
Demonstration: Increasing Network Flexibility
5-35
Demonstration: Using the AutoNeural Tool (Self-Study)
5-39
Other Modeling Tools (Self-Study)
5-46
Exercises
5-58
5.5
Chapter Summary
5-59
5.6
Solutions
5-60
Solutions to Exercises Chapter 6
6.1
Model A s s e s s m e n t
Model Fit Statistics Demonstration: Comparing Models with Summary Statistics
6.2
Statistical Graphics
5-60 6-1
6-3 6-6 6-9
v
vi
F o r Your Information
6.3
6.4
Demonstration: Comparing Models with ROC Charts
6-13
Demonstration: Comparing Models with Score Rankings Plots
6-18
Demonstration: Adjusting for Separate Sampling
6-21
Adjusting for Separate Sampling
6-26
Demonstration: Adjusting for Separate Sampling (continued)
6-29
Demonstration: Creating a Profit Matrix
6-32
Profit Matrices
6-39
Demonstration: Evaluating Model Profit
6-42
Demonstration: Viewing Additional Assessments
6-44
Demonstration: Optimizing with Profit (Self-Study)
6-46
Exercises
6-48
6.5
Chapter Summary
6-49
6.6
Solutions
6-50
Solutions to Exercises Chapter 7
Model Implementation
6-50 7-1
7.1
Introduction
7-3
7.2
Internally Scored Data Sets
7-4
7.3
Demonstration: Creating a Score Data Source
7-5
Demonstration: Scoring with the Score Tool
7-6
Demonstration: Exporting a Scored Table (Self-Study)
7-9
Score Code Modules
7-15
Demonstration: Creating a SAS Score Code Module
7-16
Demonstration: Creating Other Score Code Modules
7-22
Exercises
7-24
7.4
Chapter Summary
7-25
7.5
Solutions to Exercises
7-26
For Your Information
Chapter 8
Introduction to Pattern Discovery
8-1
8.1
Introduction
8-3
8.2
Cluster Analysis
8-8
8.3
Demonstration: Segmenting Census Data
8-15
Demonstration: Exploring and Filtering Analysis Data
8-24
Demonstration: Setting Cluster Tool Options
8-37
Demonstration: Creating Clusters with the Cluster Tool
8-41
Demonstration: Specifying the Segment Count
8-44
Demonstration: Exploring Segments
8-46
Demonstration: Profiling Segments
8-56
Exercises
8-61
Market Basket Analysis (Self-Study)
8-63
Demonstration: Market Basket Analysis
8-67
Demonstration: Sequence Analysis
8-80
Exercises
8-83
8.4
Chapter Summary
8-85
8.5
Solutions
8-86
Solutions to Exercises Chapter 9
Special Topics
8-86 9-1
9.1
Introduction
9-3
9.2
Ensemble Models
9-4
Demonstration: Creating Ensemble Models 9.3
9.4
Variable Selection
9-5 9-9
Demonstration: Using the Variable Selection Node
9-10
Demonstration: Using Partial Least Squares for Input Selection
9-14
Demonstration: Using the Decision Tree Node for Input Selection
9-19
Categorical Input Consolidation
9-22
vii
viii
F o r Your Information
Demonstration: Consolidating Categorical Inputs 9.5
Surrogate Models Demonstration: Describing Decision Segments with Surrogate Models
Appendix A
C a s e Studies
A. 1 Banking Segmentation Case Study A.2
Web Site Usage Associations Case Study
9-23 9-31 9-32 A-1
A-3 A-16
A.3 Credit Risk Case Study
A-19
A. 4 Enrollment Management Case Study
A-36
Appendix B
B. l
Bibliography
References
Appendix C
Index
B-1
B-3 C-1
F o r Your Information
ix
Course Description This course covers the skills required to assemble analysis flow diagrams using the rich tool set of SAS Enterprise Miner for both pattern discovery (segmentation, association, and sequence analyses) and predictive modeling (decision tree, regression, and neural network models).
To learn more...
ยงsas Education
ยงsas Publishing
For information on other courses in the curriculum, contact the SAS Education Division at 1-800-333-7660, or send e-mail to training@sas.com. You can also find this information on the Web at support.sas.com/training/ as well as in the Training Course Catalog.
For a list of other SAS books that relate to the topics covered in this Course Notes, USA customers can contact our SAS Publishing Department at 1-800-727-3228 or send e-mail to sasbook@sas.com. Customers outside the USA, please contact your local SAS office. Also, see the Publications Catalog on the Web at support.sas.com/pubs for a complete list of books and a convenient order form.
x
F o r Your Information
Prerequisites Before attending this course, you should be acquainted with Microsoft Windows and Windows-based software. In addition, you should have at least an introductory-level familiarity with basic statistics and regression modeling. Previous SAS software experience is helpful but not required.
Chapter 1 Introduction 1.1
Introduction to S A S E n t e r p r i s e Miner
1 -3
1-2
C h a p t e r 1 Introduction
1.1 Introduction to SAS Enterprise Miner
1.1
Introduction to SAS Enterprise
1-3
Miner
SAS Enterprise Miner
z The SAS Enterprise Miner 6.1 interface simplifies many common tasks associated with applied analysis. It offers secure analysis management and provides a wide variety of tools with a consistent graphical interface. You can customize it by incorporating your choice of analysis methods and tools.
SAS Enterprise Miner - Interface Tour '•'" ,T
s
^
77. "o * - j 9
Menu bar and shortcut buttons
3 The interface window is divided into several functional components. The menu bar and corresponding shortcut buttons perform the usual windows tasks, in addition to starting, stopping, and reviewing analyses.
1-4
Chapter 1 Introduction
SAS Enterprise Miner - Interface Tour •
—
—
*• m Una
Project panel
I >*inw*« •
• — - * — -
1
s
— -
-cap-
-
-£U
_
U — 1
•
The Project panel manages and views data sources, diagrams, results, and project users.
SAS Enterprise Miner - Interface Tour pCO
V(r.«m
-
* a *
n...- - riii L/^ 13" Properties panel 1
:
•
r , i - •- Lr--
01
•H w i B m h y
•a—-r -at-.. - U — 1 ii—-
5 The Properties panel enables you to view and edit the settings o f data sources, diagrams, nodes, results, and users.
1.1 Introduction to SAS Enterprise Miner
SAS Enterprise Miner - Interface Tour
C — —1
MM "—••HI.
J
• —
-
-
-B
-i$r~
" -k— 1
tit—-
£ = —
Help panel
s The Help panel displays a short description o f the property that you select in the Properties panel. Extended help can be found in the Help Topics selection from the Help main menu.
SAS Enterprise Miner - Interface Tour
Diagram workspace
Q...„ :•— 1
1
.
-
g
-fjif-
- -y^—J
M
Ck—
In the Diagram workspace, process flow diagrams are built, edited, and run. The workspace is where you graphically sequence the tools that you use to analyze your data and generate reports.
1-5
1-6
Chapter 1 Introduction
SAS Enterprise Miner - Interface Tour y
E J — I- -Hto——,-> - g j — - f-
—
1--la.——]
Process flow -fir-:-J
8 The Diagram workspace contains one or more process flows. A process flow starts with a data source and sequentially applies SAS Enterprise Miner tools to complete your analytic objective.
SAS Enterprise Miner - Interface Tour
et=a .a — •n
Node
D~
- *
IB
1 -Si- - - I * — 1 •fir— i
.w*j-rw>
A process flow contains several nodes. Nodes are SAS Enterprise Miner tools connected by arrows to show the direction o f information flow in an analysis.
1.1
Introduction to S A S Enterprise Miner
1-7
SAS Enterprise Miner - Interface Tour
•
—
-
-a*
-
-CKJ—
•J
-
il , ;
-(E
- J -
s-
10
The SAS Enterprise Miner tools available to your analysis are contained in the tools palette. The tools palette is arranged according to a process for data mining, SEMMA. SEMMA is an acronym for the following: Sample
You sample the data by creating one or more data tables. The samples should be large enough to contain the significant information, but small enough to process.
Explore You explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas. Modify
You modify the data by creating, selecting, and transforming the variables to focus the model selection process.
Model
You model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome.
Assess
You assess competing predictive models (build charts to evaluate the usefulness and reliability of the findings from the data mining process).
Additional tools are available under the Utility group and, i f licensed, the Credit Scoring group.
1-8
C h a p t e r 1 Introduction
SEMMA-Sample Tab
•Append > Data Partition
5~
rati—iilr»
• File Import •Filter • Input Data •Merge ^ • Sample • Time Series
— ' - -Sir- > - k - — 1
-I Ji ,.) — J 3
f
11
The tools in each Tool tab are arranged alphabetically. The Append tool is used to append data sets that are exported by two different paths in a single process flow diagram. The Append node can also append train, validation, and test data sets into a new training data set. The Data Partition tool enables you to partition data sets into training, test, and validation data sets. The training data set is used for preliminary model fitting. The validation data set is used to monitor and tune the model during estimation and is also used for model assessment. The test data set is an additional holdout data set that you can use for model assessment. This tool uses simple random sampling, stratified random sampling, or cluster sampling to create partitioned data sets. The File Import tool enables you to convert selected external flat files, spreadsheets, and database tables into a format that SAS Enterprise Miner recognizes as a data source. The Filter tool creates and applies filters to your training data set, and optionally, to the validation and test data sets. You can use filters to exclude certain observations, such as extreme outliers and errant data that you do not want to include in your mining analysis. The Input Data tool represents the data source that you choose for your mining analysis and provides details (metadata) about the variables in the data source that you want to use. The Merge tool enables you to merge observations from two or more data sets into a single observation in a new data set. The Merge tool supports both one-to-one and match merging. The Sample tool enables you to take simple random samples, observation samples, stratified random samples, first-w samples, and cluster samples of data sets. For any type of sampling, you can specify either a number of observations or a percentage of the population to select for the sample. If you are working with rare events, the Sample tool can be configured for oversampling or stratified sampling. Sampling is recommended for extremely large databases because it can significantly decrease model training time. If the sample is sufficiently representative, relationships found in the sample can be expected to generalize to the complete data set. The Sample tool writes the sampled observations to an output data set and saves the seed values that are used to generate the random numbers for the samples so that you can replicate the samples.
1.1
Introduction to S A S Enterprise Miner
1-9
The Time Series tool converts transactional data to time series data. Transactional data is time-stamped data that is collected over time at no particular frequency. By contrast, time series data is time-stamped data that is summarized over time at a specific frequency. You might have many suppliers and many customers, as well as transaction data that is associated with both. The size of each set of transactions can be very large, which makes many traditional data mining tasks difficult. By condensing the information into a time series, you can discover trends and seasonal variations in customer and supplier habits that might not be visible in transactional data.
1-10
Chapter 1 Introduction
SEMMA-Explore Tab laaHai^jJaiajfl^Ju I •Association • Cluster • DMDB • Graph Explore • Market Basket - • Multiplot • Path Analysis • SOM/Kohonen • StatExplore •Variable Clustering •Variable Selection
at-
- U — 1
-fit--
12 The Association tool enables you to perform association discovery to identify items that tend to occur together within the data. For example, i f a customer buys a loaf o f bread, how likely is the customer to also buy a gallon o f milk? This type o f discovery is also known as market basket analysis. The tool also enables you to perform sequence discovery i f a timestamp variable (a sequence variable) is present in the data set. This enables you to take into account the ordering of the relationships among items. The Cluster tool enables you to segment your data; that is, it enables you to identify data observations that are similar in some way. Observations that are similar tend to be in the same cluster, and observations that are different tend to be in different clusters. The cluster identifier for each observation can be passed to subsequent tools in the diagram. The D M D B tool creates a data mining database that provides summary statistics and factor-level information for class and interval variables in the imported data set. The Graph Explore tool is an advanced visualization tool that enables you to explore large volumes of data graphically to uncover patterns and trends and to reveal extreme values in the database. The tool creates a run-time sample of the input data source. You use the Graph Explore node to interactively explore and analyze your data using graphs. Your exploratory' graphs are persisted when the Graph Explore Results window is closed. When you reopen the Graph Explore Results window, the persisted graphs are re-created. The experimental M a r k e t Basket tool performs association rule mining over transaction data in conjunction with item taxonomy. Transaction data contains sales transaction records with details about items bought by customers. Market basket analysis uses the information from the transaction data to give you insight about which products tend to be purchased together. The M u l t i P l o t tool is a visualization tool that enables you to explore large volumes o f data graphically. The MultiPlot tool automatically creates bar charts and scatter plots for the input and target. The code created by this tool can be used to create graphs in a batch environment. The Path Analysis tool enables you to analyze Web log data to determine the paths that visitors take as they navigate through a Web site. You can also use the tool to perform sequence analysis.
1.1
Introduction to S A S Enterprise Miner
1-11
The SOM/Kohonen tool performs unsupervised learning by using Kohonen vector quantization (VQ), Kohonen self-organizing maps (SOMs), or batch SOMs with Nadaraya-Watson or local-linear smoothing. Kohonen VQ is a clustering method, whereas SOMs are primarily dimension-reduction methods. For cluster analysis, the Clustering tool is recommended instead of Kohonen VQ or SOMs. The StatExplore tool is a multipurpose tool used to examine variable distributions and statistics in your data sets. The tool generates summarization statistics. You can use the StatExplore tool to do the following: • select variables for analysis, for profiling clusters, and for predictive models • compute standard univariate distribution statistics • compute standard bivariate statistics by class target and class segment • compute correlation statistics for interval variables by interval input and target The Variable Clustering tool is useful for data reduction, such as choosing the best variables or cluster components for analysis. Variable clustering removes collinearity, decreases variable redundancy, and helps to reveal the underlying structure of the input variables in a data set. The Variable Selection tool enables you to evaluate the importance of input variables in predicting or classifying the target variable. To select the important inputs, the tool uses either an R-squared or a Chi-squared selection criterion. The R-squared criterion enables you to remove variables in hierarchies, remove variables that have large percentages of missing values, and remove class variables that are based on the number of unique values. The variables that are not related to the target are set to a status of rejected. Although rejected variables are passed to subsequent tools in the process flow diagram, these variables are not used as model inputs by more detailed modeling tools, such as the Neural Network and Decision Tree tools. You can reassign the input model status to rejected variables.
1-12
Chapter 1
Introduction
SEMMA-Modify Tab
"I w> ! 1 l»* . . .,*o rr^ipw : .... .. 2 J. rwxn wSjii'' ""' ''' ^ ' ' ' ' " ' " J 1
1
1
1 1 1 1
1
•Drop • Impute • Interactive Binning • Principal Components • Replacement
Er-'»Rule§BuHder
-
-C?>r- - - k — )
•Transform Variables '
OM»nt
'• • • . .*'t
'•• -,\
1
ii-J-
13
The Drop tool is used to remove variables from scored data sets. You can remove all variables with the role type that you specify, or you can manually specify individual variables to drop. For example, you could remove all hidden, rejected, and residual variables from your exported data set, or you could remove only a few variables that you identify yourself. The Impute tool enables you to replace values for observations that have missing values. You can replace missing values for interval variables with the mean, median, midrange, mid-minimum spacing, or distribution-based replacement, or you can use a replacement M-estimator such as Tukey's biweight, Huber's, or Andrew's Wave. You can also estimate the replacement values for each interval input by using a tree-based imputation method. Missing values for class variables can be replaced with the most frequently occurring value, distribution-based replacement, tree-based imputation, or a constant. The Interactive Binning tool is an interactive grouping tool that you use to model nonlinear functions of multiple modes of continuous distributions. The Interactive tool computes initial bins by quantiles. Then you can interactively split and combine the initial bins. You use the Interactive Binning node to create bins or buckets or classes of all input variables, which include both class and interval input variables. You can create bins in order to reduce the number of unique levels as well as attempt to improve the predictive power of each input. The Principal Components tool calculates eigenvalues and eigenvectors from the uncorrected covariance matrix, corrected covariance matrix, or the correlation matrix of input variables. Principal components are calculated from the eigenvectors and are usually treated as the new set of input variables for successor modeling tools. A principal components analysis is useful for data interpretation and data dimension reduction. The Replacement tool enables you to reassign and consolidate levels of categorical inputs. This can improve the performance of predictive models. The Rules Builder tool opens the Rules Builder window so that you can create ad hoc sets of rules with user-definable outcomes. You can interactively define the values of the outcome variable and the paths to the outcome. This is useful in ad hoc rule creation such as applying logic for posterior probabilities and scorecard values.
1.1
Introduction to S A S Enterprise Miner
1-13
The Transform Variables tool enables you to create new variables that are transformations of existing variables in your data. Transformations are useful when you want to improve the fit of a model to the data. For example, transformations can be used to stabilize variances, remove nonlinearity, improve additivity, and correct nonnormality in variables. The Transform Variables tool supports various transformation methods. The available methods depend on the type and the role of a variable.
1-14
C h a p t e r 1 Introduction
SEMMA - Model Tab 4mmM^£&2$M\ OSS
• AutoNeural
f — — r itms .
• Decision Tree
"=r—y
. . .Fa
• Dmine Regression
j
• DMNeural • Ensemble
... .'owt^-v
• —
' '. . r*.
;f ..uuj.'iin.
t
j *
^
« G r a d i e n t Boosing -
-cpj— -
- k — 1
• Least Angle Regression
.».•• *TJ^.—
1
''^S^jfcj^'" '
,_ . •
•MBR • Model Import
" i
-fir--
• Neural Network • Partial Least Squares -a 1
•••Regression
» R u l e induction • Two Stage
I ^ . - . - J - . . - ? '
•
'
The AutoNeural tool can be used to automatically configure a neural network. It conducts limited searches for a better network configuration. The Decision Tree tool enables you to perform multiway splitting of your database based on nominal, ordinal, and continuous variables. The tool supports both automatic and interactive training. When you run the Decision Tree tool in automatic mode, it automatically ranks the input variables based on the strength of their contributions to the tree. This ranking can be used to select variables for use in subsequent modeling. In addition, dummy variables can be generated for use in subsequent modeling. You can override any automatic step with the option to define a splitting rule and prune explicit tools or subtrees. Interactive training enables you to explore and evaluate a large set of trees as you develop them. The Dmine Regression tool performs a regression analysis on data sets that have a binary or interval level target variable. The Dmine Regression tool computes a forward stepwise least squares regression. In each step, an independent variable is selected that contributes maximally to the model R-square value. The tool can compute all two-way interactions of classification variables, and it can also use AOV16 variables to identify nonlinear relationships between interval variables and the target variable. In addition, the tool can use group variables to reduce the number of levels of classification variables. f
If you want to create a regression model on data that contains a nominal or ordinal target, then you would use the Regression tool.
The DMNeural tool is another modeling tool that you can use tofita nonlinear model. The nonlinear model uses transformed principal components as inputs to predict a binary or an interval target variable.
1.1
Introduction to S A S Enterprise Miner
1-15
The Ensemble tool creates new models by combining the posterior probabilities (for class targets) or the predicted values (for interval targets) from multiple predecessor models. The new model is then used to score new data. One common ensemble approach is to use multiple modeling methods, such as a neural network and a decision tree, to obtain separate models from the same training data set. The component models from the two complementary modeling methods are integrated by the Ensemble tool to form the final model solution. It is important to note that the ensemble model can only be more accurate than the individual models if the individual models disagree with one another. You should always compare the model performance of the ensemble model with the individual models. You can compare models using the Model Comparison tool. The Gradient Boosting tool uses a partitioning algorithm described in "A Gradient Boosting Machine," and "Stochastic Gradient Boosting" by Jerome Friedman. A partitioning algorithm searches for an optimal partition of the data defined in terms of the values of a single variable. The optimality criterion depends on how another variable, the target, is distributed into the partition segments. When the target values are more similar within the segments, the worth of the partition is greater. Most partitioning algorithms further partition each segment in a process called recursive partitioning. The partitions are then combined to create a predictive model. The model is evaluated by goodness-of-fit statistics defined in terms of the target variable. These statistics are different than the measure of worth of an individual partition. A good model might result from many mediocre partitions. The Least Angle Regressions (LARS) tool can be used for both input variable selection and model fitting. When used for variable selection, the LAR algorithm chooses input variables in a continuous fashion that is similar to Forward selection. The basis of variable selection is the magnitude of the candidate inputs' estimated coefficients as they grow from zero to the least squares' estimate. Either a LARS or LASSO algorithm can be used for model fitting. See Efron et al (2004) and Hastie, Tibshirani, and Friedman (2001) for further details. The Memory-Based Reasoning (MBR) tool is a modeling tool that uses a ^-nearest neighbor algorithm to categorize or predict observations. The ^-nearest neighbor algorithm takes a data set and a probe, where each observation in the data set consists of a set of variables and the probe has one value for each variable. The distance between an observation and the probe is calculated. The k observations that have the smallest distances to the probe are the ^-nearest neighbors to that probe. In SAS Enterprise Miner, the ^-nearest neighbors are determined by the Euclidean distance between an observation and the probe. Based on the target values of the ^-nearest neighbors, each of thefc-nearestneighbors votes on the target value for a probe. The votes are the posterior probabilities for the class target variable. The Model Import tool imports and assesses a model that was not created by one of the SAS Enterprise Miner modeling nodes. You can then use the Assessment node to compare the user-defined model(s) with a model(s) that you developed with a SAS Enterprise Miner modeling node. This process is called integrated assessment. The Neural Network tool enables you to construct, train, and validate multilayer feed-forward neural networks. In general, each input is fully connected to the first hidden layer, each hidden layer is fully connected to the next hidden layer, and the last hidden layer is fully connected to the output. The Neural Network tool supports many variations of this general form. The Partial Least Squares tool models continuous and binary targets based on the SAS/STAT PLS procedure. The Partial Least Squares node produces DATA step score code and standard predictive model assessment results.
1-16
Chapter 1 Introduction
The Regression tool enables you to fit both linear and logistic regression models to your data. You can use continuous, ordinal, and binary target variables. You can use both continuous and discrete variables as inputs. The tool supports the stepwise, forward, and backward selection methods. The interface enables you to create higher-order modeling terms such as polynomial terms and interactions. The Rule Induction tool enables you to improve the classification of rare events in your modeling data. The Rule Induction tool creates a Rule Induction model that uses split techniques to remove the largest pure split tool from the data. Rule induction also creates binary models for each level of a target variable and ranks the levels from the rarest event to the most common. The TwoStage tool enables you to model a class target and an interval target. The interval target variable is usually the value that is associated with a level of the class target. For example, the binary variable PURCHASE is a class target that has two levels, Yes and No, and the interval variable AMOUNT can be the value target that represents the amount of money that a customer spends on the purchase.
1.1
Introduction to S A S Enterprise Miner
1-17
SEMMA - Assess Tab 3-..
, , r
"™ ' ,r
n , r n
OQSE3 •Cutoff |
;
~
<•»»»MWCMM H « I « .
|
<-n J
.
• Decisions
.
• Model Comparison
J
• Segment Profile
,|CSi, 8.! ,J:,J.^I^».SIJ.«M /^ ;
I
J
• Score -
•HH^.^, . H"i,^|. «l,0^lH J 1
lt
-
§
-i^f-
- -ja
j
l
•jCtr— n HI
15
The Cutoff tool provides tabular and graphical information to assist users in determining appropriate probability cutoff point(s) for decision making with binary target models. The establishment of a cutoff decision point entails the risk of generating false positives and false negatives, but an appropriate use of the Cutoff node can help minimize those risks. The Decisions tool enables you to define target profiles for a target that produces optimal decisions. The decisions are made using a user-specified decision matrix and output from a subsequent modeling procedure. The Model Comparison tool provides a common framework for comparing models and predictions from any of the modeling tools. The comparison is based on the expected and actual profits or losses that would result from implementing the model. The tool produces several charts that help to describe the usefulness of the model, such as lift charts and profit/loss charts. The Segment Profile tool enables you to examine segmented or clustered data and identify factors that differentiate data segments from the population. The tool generates various reports that aid in exploring and comparing the distribution of these factors within the segments and population. The Score tool enables you to manage, edit, export, and execute scoring code that is generated from a trained model. Scoring is the generation of predicted values for a data set that might not contain a target variable. The Score tool generates and manages scoring formulas in the form of a single SAS DATA step, which can be used in most SAS environments even without the presence of SAS Enterprise Miner. The Score tool can also generate C score code and Java score code.
1-18
Chapter 1 Introduction
Beyond SEMMA - Utility Tab
D â&#x20AC;&#x201D; -
Control Point End Groups Metadata Reporter SAS Code Start Groups
-
a â&#x20AC;&#x201D;
16 The Control Point tool enables you to establish a control point to reduce the number o f connections that are made in process flow diagrams. For example, suppose that three Input Data Source tools are to be connected to three modeling tools. If no Control Point tool is used, then nine connections are required to connect all o f the Input Data Source tools to all o f the modeling tools. However, i f a Control Point tool is used, only six connections are required. The E n d Groups tool is used only in conjunction with the Start Groups tool. The End Groups node acts as a boundary marker that defines the end-of-group processing operations in a process flow diagram. Group processing operations are performed on the portion o f the process flow diagram that exists between the Start Groups node and the End Groups node. The Metadata tool enables you to modify the columns metadata information at some point in your process flow diagram. You can modify attributes such as roles, measurement levels, and order. The Reporter tool uses SAS Output Delivery System (ODS) capability to create a single PDF or RTF file that contains information about the open process flow diagram. The PDF or RTF documents can be viewed and saved directly and are included in SAS Enterprise Miner report package files. The SAS Code tool enables you to incorporate new or existing SAS code into process flow diagrams. The ability to write SAS code enables you to include additional SAS procedures into your data mining analysis. You can also use a SAS DATA step to create customized scoring code, to conditionally process data, and to concatenate or merge existing data sets. The tool provides a macro facility to dynamically reference data sets used for training, validation, testing, or scoring variables, such as input, target, and predict variables. After you run the SAS Code tool, the results and the data sets can then be exported for use by subsequent tools in the diagram. The Start Groups tool is useful when your data can be segmented or grouped, and you want to process the grouped data in different ways. The Start Groups node uses BY-group processing as a method to process observations from one or more data sources that are grouped or ordered by values o f one or more common variables. BY variables identify the variable or variables by which the data source is indexed, and BY statements process data and order output according to the BY group values.
1.1
Introduction to S A S Enterprise Miner
1-19
Credit Scoring Tab (Optional)
-*
• Credit Exchange -,
f™=:—r =. " r y,t=_j vUT.:i'.^:...^...i.j. ST"* 3
jrrrk.
kllH
• Interactive Grouping • Reject Inference • Scorecard
^^ * - ^ t rrj—
J"*,ST*"S I . , • —. — i riter*™
- -ep,—
-
-a—>
.351— -
)
—
rn n
17
The optional Credit Scoring tab provides functionality related to credit scoring. f
The Credit Scoring for the SAS Enterprise Miner solution is not included with the base version of SAS Enterprise Miner. If your site did not license Credit Scoring for SAS Enterprise Miner, the Credit Scoring tab and its associated tools do not appear in your SAS Enterprise Miner software.
The Credit Exchange tool enables you to exchange the data that is created in SAS Enterprise Miner with the SAS Credit Risk Management solution. The Interactive Grouping tool creates groupings, or classes, of all input variables. (This includes both class and interval input variables.) You can create groupings in order to reduce the number of unique levels as well as attempt to improve the predictive power of each input. Along with creating group levels for each input, the Interactive Grouping tool creates Weight of Evidence (WOE) values. The Reject Inference tool uses the model that was built using the accepted applications to score the rejected applications in the retained data. The observations in the rejected data set are classified as inferred "goods" and inferred "bads." The inferred observations are added to the A c c e p t s data set that contains the actual "good" and "bad" records, forming an augmented data set. This augmented data set then serves as the input data set of a second credit-scoring modeling run. During the second modeling run, attribute classification is readjusted and the regression coefficients are recalculated to compensate for the data set augmentation. The Scorecard tool enables you to rescale the logit scores of binary prediction models to fall within a specified range.
c CD u> n Q . 5"
<; 3 C 5
O
—1
*—• CD CD
a —
CD
ffl f? O. — —'
o
o o3
3
-a
X
o
CD
hei
—
(SQ -i o
3
2'
5 3
M
r
10 1a. to cr c - | f & — i <-* S£ S
ucti iter
fS
S cr 3 fj 1
CD
•<
CD
CD (A
UIOJ
tenc esul en
O
to'
5' B, o-i
CD W
, -
3
S3 r*
5
- CD
Repair input data Transform input data
«
(B ^
AS
ta
Q
3
R
Gather results Assess observed results
O
n T 35 O CO
o „.
*< 3 n 3
3* CD O
GS a
IT 3 CD
3
CD
d'O O
"-»)
ft
W
- M 3.
CD
P n
|
£L o
oCD
7T
S— 3
[B
-
2 3
=2 EL
a. si. g Q y w CD £L * o' 5 2 CD cr O
3
O
CD
O -i Di CD &) _ w B 3 -tS
—
a. O
3*
TJ CD
03
*<
o"
o Q.
o —** o
Apply analysis
«< B .
CD
>
a to
O
5' . " n CD p o o .
3
-
Generate deployment methods o £ Integrate deployment 3 *0 M 3*
CD
1 Z3
rj
_ H e g » CD
•-*)
-6
a.
k3
C
Select cases "S Extract input data o O Validate input data
r+
i3
Define analytic objective i»
Q
S s - a.
o o ££, 3 p. — ~n • <
P> 3
w
1 "8 tsi
g>
I
CD"
n
o r. a 3 CD •<
|
CD
H o o o' 3
to o
CO
CD
o cr 3
i,«
H
—
CD
O
5T
Refine analytic objective
1.1 Introduction to SAS Enterprise Miner
1-21
SAS Enterprise Miner Analytic Strengths
Pattern Discovery
Predictive Modeling
19
The analytic strengths of SAS Enterprise Miner lie in the realm traditionally known as data mining. There are a great number of analytic methods identified with data mining, but they usually fall into two broad categories: pattern discovery and predictive modeling (Hand 2005). SAS Enterprise Miner provides a variety o f pattern discovery and predictive modeling tools. Chapters 2 through 7 demonstrate SAS Enterprise Miner's predictive modeling capabilities. In those chapters, you use a rich collection of modeling tools to create, evaluate, and improve predictions from prepared data. Special topics in predictive modeling are featured in Chapter 9. Chapter 8 introduces some of SAS Enterprise Miner's pattern discovery tools. In this chapter, you see how to use SAS Enterprise Miner to graphically evaluate and transform prepared data. You use SAS Enterprise Miner's tools to cluster and segment an analysis population. You can also choose to try SAS Enterprise Miner's market basket analysis and sequence analysis capabilities.
1-22
Chapter 1 Introduction
Applied Analytics Case Studies ^ —i
.....
fH
W O
Jfj Bank usage segmentation V Web services associations
:&52J
|||T|i—1
» iinmm
J
&
Credit risk scoring
*»
?
li& University enrollment prediction
^ i — — ^
e — . w .
.
... .
20
Appendix A further illustrates SAS Enterprise Miner's analytic capabilities with case studies drawn from real-world business applications. • A consumer bank sought to segment its customers based on historic usage patterns. Segmentation was to be used for improving contact strategies in the Marketing Department. • A radio station developed a Web site to provide such services to its audience as podcasts, news streams, music streams, archives, and live Web music performances. The station tracked usage o f these services by U R L . Analysts at the station wanted to see whether any unusual patterns existed in the combinations of services selected. • A bank sought to use performance on an in-house subprime credit product to create an updated risk model. The risk model was to be combined with other factors to make future credit decisions. • The administration o f a large private university requested that the Office o f Enrollment Management and the Office o f Institutional Research work together to identify prospective students who would most likely enroll as freshmen.
Chapter 2 Accessing and Assaying Prepared Data 2.1
Introduction
2.2
C r e a t i n g a S A S E n t e r p r i s e M i n e r Project, Library, a n d D i a g r a m
Demonstration: Creating a SAS Enterprise Miner Project
2.3
2.4
2.5
2
-
3
2-5
2-6
Demonstration: Creating a SAS Library
2-10
Demonstration: Creating a SAS Enterprise Miner Diagram
2-13
Exercises
2-14
D e f i n i n g a Data S o u r c e
2-15
Demonstration: Defining a Data Source
2-18
Exercises
2-35
E x p l o r i n g a Data S o u r c e
2-36
Demonstration: Exploring Source Data
2-37
Demonstration: Changing the Explore Window Sampling Defaults
2-64
Exercises
2-66
Demonstration: Modifying and Correcting Source Data
2-67
Chapter Summary
2-80
2-2
Chapter 2 Accessing and Assaying Prepared Data
2.1 Introduction
2-3
2.1 Introduction Analysis Element Organization
Projects
Libraries and Diagrams
Process Flows
Nodes
3
Analyses in SAS Enterprise Miner start by defining a project. A project is a container for a set of related analyses. After a project is defined, you define libraries and diagrams. A library is a collection of data sources accessible by the SAS Foundation Server. A diagram holds analyses for one or more data sources, Processflowsare the specific steps you use in your analysis of a data source. Each step in a process flow is indicated by a node. The nodes represent the SAS Enterprise Miner tools that are available to the user. f
A core set of tools is available to all SAS Enterprise Miner users. Additional tools can be added by licensing additional SAS products or by creating SAS Enterprise Miner extensions.
Before you can create a process flow, you need to define the following items: • a SAS Enterprise Miner project to contain the data sources, diagrams, and model packages • one or more SAS libraries inside your project, linking SAS Enterprise Miner to analysis data • a diagram workspace to display the steps of your analysis A SAS administrator can define these for you in advance of using SAS Enterprise Miner or you can define them yourself. f
The demonstrations show you how to create the items in the list above.
2-4
Chapter 2 Accessing and Assaying Prepared Data
Analysis Element Organization
Projects
Libraries and Diagrams
Process Flows
Nodes
4 Behind the scenes (on the SAS Foundation Server where the project resides), the organization is more complicated. Fortunately, the SAS Enterprise Miner client shields you from this complexity. At the top of the hierarchy is the SAS Enterprise Miner project, corresponding to a directory on (or accessible by) the SAS Foundation. When a project is defined, four subdirectories are automatically created within the project directory: Datasources, Reports, System, and Workspaces. The SAS Enterprise Miner client via the SAS Foundation handles the writing, reading, and maintenance o f these directories. f
In both the personal workstation and SAS Enterprise Miner client configurations, all references to computing resources such as L I B N A M E statements and access to SAS Foundation technologies must be made from the selected SAS Foundation Server's perspective. This is important because data and directories that are visible to the client machine might not be visible to the server.
Projects are defined to contain diagrams, the next level of the SAS Enterprise Miner organizational hierarchy. Diagrams usually pertain to a single analysis theme or project. When a diagram is defined, a new subdirectory is created in the Workspaces directory of the corresponding project. Each diagram is independent, and no information can be passed from one diagram to another. Libraries are defined using a L I B N A M E statement. You can create a library using the SAS Enterprise Miner Library Wizard, using SAS Management Console, or by using the Start-Up Code window when you define a project. Any data source compatible with a L I B N A M E statement can provide data for SAS Enterprise Miner. Specific analyses in SAS Enterprise Miner occur in process flows. A process flow is a sequence o f tasks or nodes connected by arrows in the user interface, and it defines the order of analysis. The organization of a process flow is contained in a file, EM DGRAPH, which is stored in the diagram directory. Each node in the diagram corresponds to a separate subdirectory in the diagram directory. Information in one process flow can be sent to another by connecting the two process flows.
2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram
2-5
2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram Your first task when you start an analysis is creating a SAS Enterprise Miner project, data library, and diagram. Often these are set up by a SAS administrator (or reused from a previous analysis), but knowing how to set up these items yourself is fundamental to learning about SAS Enterprise Miner. Before attempting the processes described in the following demonstrations, you need to log on to your computer, and start and log on to the SAS Enterprise Miner client program.
2-6
Chapter 2 Accessing and Assaying Prepared Data
Creating a SAS Enterprise Miner Project A SAS Enterprise Miner project contains materials related to a particular analysis task. These materials include analysis process flows, intermediate analysis data sets, and analysis results. To define a project, you must specify a project name and the location o f the project on the SAS Foundation Server. Follow the steps below to create a new SAS Enterprise Miner project. • • ^ • I ^ H - i r | x |
|3tEnterprise Miner File
Edit
01
View
g.ctions
I I I
i
Options
Window
Help
f I 1 U
|. |.: 1 M
1 .1 1^1
Welcome to Enterprise Miner O Ac o A o
Hefp Topics
E n te
Mi ne
US'
New Project...
OpenProject...
-Si?
Recent Projects.,,
^?
Exit
|0 student as s t u d e n t ] * ^ No project open
2-2 Creating a SAS Enterprise Miner Project, Library, and Diagram 1.
2-7
Select File => New ^ Project from the main menu. The Create New Project wizard opens at Step 1 £ F Create New Project — Step 1 of 4 Select SAS Server
Select a SAS Server for this project. All processing will take place on this server,
SAS Server SASApp - Logical Workspace Server < Back
!Next>
Cancel
In this configuration o f SAS Enterprise Miner, the only server available for processing is the host server listed above. 2.
Select Next >.
3.
Name the project. 2 £ Create New Project — Step 2 of 4 Specify Project Name and Server Directory
Specify a project name and directory on the SAS Server for this project. All SAS data sets andfileswill be written to this location.
Project Name SAS Server Directory D:\SA5\Config\Lev 1 \AppData\EnterpriseMiner\6.1 < Back
Next >
Browse Cancel
Step 2 of the Create New Project wizard is used to specify the following information: •
the name o f the project you are creating
•
the location of the project
2-8 4.
Chapter 2 Accessing and Assaying Prepared Data Type a project name, for example, My
P r o j e c t , in the Name field.
2 £ Create New Project — Step 2 of 4 Specify Project Name and Server Directory
SAS'
Specify a project name and directory on the SAS Server for this project. SAS data sets and files will be written to this location.
Project Name My Project| SAS Server Directory D:\SAS\Config\Levl\AppData\EnterpriseMiner\6.t <Back The path specified by the SAS project folder w i l l be created. 5.
Cancel
D i r e c t o r y field is the physical location where the
Select N e x t > . I f you have an existing project directory with the same name and location as specified, this project w i l l be added to the list of available projects in SAS Enterprise Miner. This technique can be used to import a project created by another installation of SAS Enterprise Miner.
f
6.
Server
Next >
Browse
Select a location for the project's metadata.
EG
j5£ Create New Project — S t e p3 of 4 Register the Project
SA s
Select the SAS Folders location for this project. Use these folders to organize your projects and control user access.
En t e r p r i s e j ne r a e f f l l
m\
SAS Folder Location — /Users/student/My Folder ! < Back /f
Browse Next >
Cancel
The SAS folder, My Folder, is in a WebDAV directory. This is where the metadata associated with the project is stored. This folder can be accessed and modified using SAS Management Console.
2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram 7.
2-9
Select Next >. Information about your project is summarized in Step 4.
8.
To finish defining the project, select Finish. 2 f Create New Project — Step 4 of 4 New Project Information
New Project Information Name SAS Metadata location SAS Application server Server Directory
My Project /Users/student/My Folder SASApp - Logical Workspace Server D:\SAS\Config\Lev1\AppData\Enterpri:
<Back
! Finish j
Cancel
The SAS Enterprise Miner client application opens the project that you created. HHE3
? . Enterprise Miner - My Project File
Edit
03 3 i
View
fictions
Option?
Window
Help
I I iSNMJQlQjsjIjijI
j ]
• I Data Sources m| Diagrams 5 I Model Packages Bl
Users
Value
Property Name
My Project
Project Start C o d e C re sled
6/17/09 2 40 PM
Server
SASApp • L o g i c a l W o r k s p a c
Grid Available
No
Path
D ASASIC onfiglLevt tApp Data
Max. Concurrent T a s k s
Default
| f t student as student [i3^ Connected to SASApp - Logical Workspace Server
2-10
Chapter 2 Accessing and Assaying Prepared Data
Creating a SAS Library A SAS library connects SAS Enterprise Miner with the raw data sources, which are the basis of your analysis. A library can link to a directory on the SAS Foundation server, a relational database, or even an Excel workbook. To define a library, you need to know the name and location of the data structure that you want to link with SAS Enterprise Miner, in addition to any associated options, such as user names and passwords. Follow the steps below to create a new SAS library. I.
Select File â&#x20AC;˘=> New O L i b r a r y . . . from the main menu. The Library Wizard - Step l o f 3 Select Action window opens.
-Select an action (*â&#x20AC;˘ Create New Lihrary
Next>
Cancel
2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram 2.
2-11
Select Next >. The Library Wizard is updated to show Step 2 of 3 Create or Modify.
Engine
Name
BASE
f Library infor mation Path Browse.,
Options
<Back
Next>
3.
Type AAEM61 in the Name field.
4.
Type the path C: \ w o r k s h o p \ w i n s a s \ a a e m 6 1 in the P a t h field. £
Cancel
_ .Library Wizard —Step 2 of 3 Create or Engine
Name
3
BASE
AAEM61
Library information -
r
Path C: \Workshop\Winsas\aaem61
Browse.
Options
< Back
Next >
Cancel
I f your data is installed in a different directory, specify this directory' in the P a t h field.
2-12 5.
Chapter 2 Accessing and Assaying Prepared Data Select Next >. The Library Wizard window is updated to show Step 3 of 3 Confirm Action. .Library Wizard ~ S t e p 3 o f 3 Confirm Action Property Action IName [Engine 'Path 'Options
Value Create New AAEMS1 BASE C:\WorkslioplWirisas\aaem61
r-Status Action Succeeded! The Library "AAEM61" was created.
<Back
Finish H
The Confirm Action window shows the name, type, and path of the created SAS library. 6.
Select Finish. A l l data available in the SAS library can now be used by SAS Enterprise Miner.
2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram
2-13
Creating a SAS Enterprise Miner Diagram
A SAS Enterprise Miner diagram workspace contains and displays the steps involved in your analysis. To define a diagram, you need only specify its name. Follow the steps below to create a new SAS Enterprise Miner diagram workspace. 1.
Select File & New
D i a g r a m . . . from the main menu.
2.
Type the name P r e d i c t i v e A n a l y s i s in the D i a g r a m Name field and select OK. SAS Enterprise Miner creates an analysis workspace window labeled Predictive Analysis. ^ E n t e r p r i s e Miner - M y Project Fiie
E*
View
3
Actions
Options
Window
hjelp
:--|HNB|D1D|3JUJI
••!
:
I 1 i
I 1 I *\&\
My Project
>. OJ Data Sources -
Sarrple f i i r p b r e | Modify | Model | Assess j Utility j Credit Scoring |
JtJ Diagrams Predictive Analysis
Predictive Analysis
BB _ £ ]Model Packages •. n I Users
Property
Value
ID
EMWS
Name
Predictive Anatysis
Slr.lus
Open
Moles History
Diagram Precfctive Anatyss opened
.
...
|<? student as student [\)^ Connected to SASApp - Logical Workspace Server
You use the Predictive Analysis window to create process flow diagrams.
2-14
Chapter 2 Accessing and Assaying Prepared Data
1. Creating a Project, Defining a Library, and Creating an Analysis Diagram a. Use the steps demonstrated on pages 2-6 to 2-8 to create a project for your SAS Enterprise Miner analyses on your computer. b. Use the steps demonstrated on pages 2-10 to 2-12 to define a SAS library to access the course datafromyour project. f
Your instructor will provide the path to the data if it is different than that specified in these notes.
c. Use the steps demonstrated on page 2-13 to create an analysis diagram in your project.
2.3 Defining a Data Source
2.3
2-15
Defining a Data S o u r c e
Defining a Data Source
Select table. Analysis Data
Define variable roles. Define measurement levels. Define table role.
SAS Foundation Server Libraries 10
After you define a new project and diagram, your next analysis task in SAS Enterprise Miner is usually to create an analysis data source. A data source is a link between an existing SAS table and SAS Enterprise Miner. To define a data source, you need to select the analysis table and define metadata appropriate to your analysis task. The selected table must be visible to the SAS Foundation Server via a predefined SAS library. Any table found in a SAS library can be used, including those stored in formats external to SAS. such as tables from a relational database. (To define a SAS library to such tables might require SAS/ACCESS product licenses.) The metadata definition serves three primary purposes for a selected data set. It informs SAS Enterprise Miner of the following: â&#x20AC;˘ the analysis role of each variable â&#x20AC;˘ the measurement level o f each variable â&#x20AC;˘ the analysis role of the data set The analysis role of each variable tells SAS Enterprise Miner the purpose of the variable in the current analysis. The measurement level o f each variable distinguishes continuous numeric variables from categorical variables. The analysis role of the data set tells SAS Enterprise Miner how to use the selected data set in the analysis. A l l of this information must be defined in the context o f the analysis at hand. Other (optional) metadata definitions are specific to certain types o f analyses. These are discussed in the context o f these analyses. f
Any data source that is defined for use in SAS Enterprise Miner should be largely ready for analysis. Usually, most of the data preparation work is completed before a data source is defined. It is possible (and useful for documentation purposes) to write the completed data preparation process in a SAS Code node embedded in SAS Enterprise Miner. Those details are outside the scope of this discussion.
2-16
Chapter 2 Accessing and Assaying Prepared Data
Charity Direct Mail Demonstration ' Analysis goal: j A veterans' organization seeks continued | contributions from lapsing donors. Use lapsing-donor responses from an earlier campaign to predict future lapsing-donor responses.
11 To demonstrate defining a data source, and later, using SAS Enterprise Miner's predictive modeling tools, consider the following specific analysis example: A national veterans' organization seeks to better target its solicitations for donation. By only soliciting the most likely donors, less money will be spent on solicitation efforts and more money will be available for charitable concerns. Solicitations involve sending a small gift to an individual and include a request for a donation. Gifts to donors include mailing labels and greeting cards. The organization has more than 3.5 million individuals in its mailing database. These individuals are classified by their response behaviors to previous solicitation efforts. Of particular interest is the class of individuals identified as lapsing donors. These individuals made their most recent donation between 12 and 24 months ago. The organization seeks to rank its lapsing donors based on their responses to a greeting card mailing sent in June of 1997. (The charity calls this the 97NK Campaign.) With this ranking, a decision can be made to either solicit or ignore a lapsing individual in the June 1998 campaign.
2.3 Defining a Data Source
2-17
Charity Direct Mail Demonstration Analysis goal: A veterans' organization seeks continued contributions from lapsing donors. Use lapsing-donor responses from an earlier campaign to predict future lapsing-donor responses. Analysis data: •
E x t r a c t e d f r o m p r e v i o u s year's c a m p a i g n
•
S a m p l e b a l a n c e s r e s p o n s e / n o n - r e s p o n s e rate
• A c t u a l r e s p o n s e rate a p p r o x i m a t e l y 5%
12 The source of this data is the Association for Computing Machinery's (ACM) 1998 KDD-Cup competition. The data set and other details of the competition are publicly available at the UCI KDD Archive at kdd.ics.uci.edu. For model development, the data were sampled to balance the response and non-response rates. (The reason and consequences of this action are discussed in later chapters.) In the original campaign, the response rate was approximately 5%. The following demonstrations show how to access the 97NK campaign data in SAS Enterprise Miner. The process is divided into three parts: • specifying the source data • setting the columns metadata • finalizing the data source specification
2-18
Chapter 2 Accessing and Assaying Prepared Data
Defining a Data Source
Specifying Source Data A data source links SAS Enterprise Miner to an existing analysis table. To specify a data source, you need to define a SAS library and know the name of the table that you will link to SAS Enterprise Miner. f o l l o w these steps to specify a data source. I.
Select File â&#x20AC;˘=> New â&#x20AC;˘=> Data Source... from the main menu. The Data Source Wizard - Step l o f 7 Metadata Source opens.
Select a metadata source
/
Source:
Next
>
Cancel
The Data Source Wizard guides you through a seven-step process to create a SAS Enterprise Miner data source. Step I tells SAS Enterprise Miner where to look for initial metadata values. The default and typical choice is the SAS Table that you will link to in the next step.
2.3 Defining a Data Source 2.
Select Next > to use a SAS
2-19
table (the common choice) as the source for the metadata.
The Data Source Wizard continues to Step 2 of 7 Select a SAS Table. S f Data Source Wizard —Step 2 of 7 Select a SAS Table
Select a SAS table
Browse.
Table:
•
< Back
3.
Cancel
In this step, select the SAS table that you want to make available to SAS Enterprise Miner. You can either type the library name and SAS table name as lihname.
taJblename or select the SAS table
from a list. 4.
Select Browse... to choose a SAS table from the libraries that are visible to the SAS Foundation Server. The Select a SAS
Table window opens.
g T Select a SAS Table \%4 5 Libraries S A
-0 2J) 2P ~-j[p
Aaern61 Apfrntlib Maps Sampsio
-) )
_J Sasdata jjjj Sashelp • '• ^ Sasuser • • • • ^ Stpsarnp J ) " 0 Work jjj |) Wrsdist ) Wrstemp \
I Name Aaern61 Apfrntlib Maps Sampsio Sasdata Sashelp Sasuser Stpsarnp Work Wrsdist Wrstemp
BASE V9 V9 V9 BASE V9 V9 BASE V9 BASE BASE
Refresh
Path
Engine
C:\Workshop\Winsas\aaem61 D:\SA5\Config\Levl\5A5Ap,,. D:\Program Files\SA5\SASF... D:\Program Files\5AS\5A5F... D:\SAS\Config\Levl\5A5Ap... D:\Program Files\SAS\SASF... C:\Documents and 5ettings\... D:\Prograrn Files\5AS\SASF... C: \DOCUME~1 \Student\LO... D: \S AS\Conf ig\Lev 1 \SASAp.., D:\SAS\Config\Levl\5ASAp...
Properties.
Cancel
2-20
Chapter 2 Accessing and Assaying Prepared Data
One of the libraries listed is named A A E M 6 1 , which is the library name defined in the Library Wizard. 5.
Double-click the Aaem61 library. The panel on the right is updated to show the contents of the A A E M 6 1 library. Select a SAS Table SAS Libraries j f j Aaem61 j$ Apfrntlib -JP Maps Sampsio Sasdata -) Sashelp j j l Sasuser ) Stpsarnp •-Jp Work Wrsdist ^ Wrstemp r
El Name
Type
[=jBank DATA [ | ] Census2000 DATA C r e d i t DATA DATA f_Q Dottrain DATA [HQ Dotvalid : jH Dungaree DATA [ J Inq2005 DATA DATA i | j Insurance 1 ]Q Organics DATA DATA g§| Profile B | Pva97nk DATA ~2 Scoreorga... DATA ; ^]Scorepva9... DATA
Created Date 2004-01-19 2006-07-05 2006-09-26 2006-07-25 2006-07-25 2004-01-19 2006-09-19 2006-09-19 2008-04-10 2006-09-23 2008-03-03 2008-05-28 2008-03-03
H
—
Refresh
6.
Modified Date 2004-01-19 2006-07-05 2006-09-26 2006-07-25 2006-07-25 2004-01-19 2006-09-19 2006-09-19 2008-04-10 2006-09-23 2008-03-03 2008-05-28 2008-03-03
Properties
7 S0KB 1MB 372KB 32KB 32KB 40KB 9MB 5MB 2MB 5MB 1MB 408KB 18MB OK
Size
Cancel
Select the Pva97nk SAS table. f g S e l e c t a S A S Table z j SAS Libraries J Aaem61 -ffr Apfrntlib -2$ Maps Sampsio Sasdata ^ Sashelp
Jp Sasuser -jjj Stpsarnp •1j Work j i | Wrsdist
Name
DATA L^Bank fj~] Census2000 DATA DATA HO Credit DATA [Hj] Dottrain f"J Dotvalid DATA DATA ~J Dungaree DATA | j | Inq2005 DATA : ^2 Insurance DATA : |Q Organics DATA [|t] Profile DATA Pva97nk Scoreorga... DATA [HQ Scorepva9,,, DATA
-^J Wrstemp I P :
Type
Created Date
Modified Date
2004-01-19 2006-07-05 2006-09-26 2006-07-25 2006-07-25 2004-01-19 2006-09-19 2006-09-19 2008-04-10 2006-09-23
2004-01-19 2006-07-05 2006-09-26 2006-07-25 2006-07-25 2004-01-19 2006-09-19 2006-09-19 2008-04-10 2006-09-23
780KB 1MB 372KB 32KB 32KB 40KB 9MB 5MB 2MB 5MB
2008-03-03 2008-05-28 2008-03-03
2008-03-03 2008-05-28 2003-03-03
1MB 408KB 18MB
Refresh
Properties...
OK
Size A.
Cancel
2.3 Defining a Data Source
7.
2-21
Select OK. The Select a SAS Table window closes and the selected table appears in the T a b l e field. Sf Data Source Wizard - Step 2 of 7 Select a SAS Table
J Select a SAS table
Table:
Browse
M61 .PVA97NK
< Back
Next >
Cancel
Select Next >. The Data Source Wizard proceeds to Step 3 of 7 Table Information. |£f Data Source Wizard — Step 3 of 7 Table Information
Table Properties Property Table Name
Value AAEM6I.PVA97NK
Description Member Type
DATA
Data Set Type
DATA
En;:ne
BASE
Number of vanables
28
Number of Observations
96S6
C r e a t e d Date
M a r c h 3 , 2 0 0 8 1 0 : 0 6 : 0 0 AM E5T
Modified Date
March 3, 2008 1 0 : 0 6 : 0 0 AM EST
< Back
i' Next >
"}]
Cam s\
This step of the Data Source Wizard provides basic information about the selected table. The SAS table P V A 9 7 N K is used in this chapter and subsequent chapters to demonstrate the predictive modeling tools of SAS Enterprise Miner. As seen in the Data Source Wizard - Step 3 o f 7 'fable Information window, the table contains 9.686 cases and 28 variables.
2-22
Chapter 2 Accessing and Assaying Prepared Data
Defining Column Metadata With a data set specified, your next task is to set the column metadata. To do this, you need to know the modeling role and proper measurement level of each variable in the source data set. Follow these steps to define the column metadata: 1.
Select Next >. The Data Source Wizard proceeds to Step 4 of 7 Metadata Advisor Options. ÂŁ f Data Source Wizard â&#x20AC;&#x201D; Step 4 of? Metadata Advisor Options
Metadata Advisor Options Use the basic setting to set the initial measurement levels and roles based on the variable attributes. Use the advanced setting to set the initial measurement levels and roles based on both the variable attributes and distributions.
F Basic
<~ Advanced
Customize
< Back
11
ftext>"""||
Cancel
This step of the Data Source Wizard starts the metadata definition process. SAS Enterprise Miner assigns initial values to the metadata based on characteristics of the selected SAS table. The Basic setting assigns initial values to the metadata based on variable attributes such as the variable name, data type, and assigned SAS format. The Advanced setting assigns initial values to the metadata in the same way as the Basic setting, but it also assesses the distribution o f each variable to better determine the appropriate measurement level. 2.
Select Next > to use the Basic setting.
2.3 Defining a Data Source
2-23
The Data Source Wizard proceeds to Step 5 of 7, Column Metadata. ST Data Source Wizard —Step 5 of 7 Column Metadata I
lolumns;
I
nnf
f Label
Explore
Compute Summary
f~ Basic
Role
Level
Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input ID Input Input Input Input Input Input Input Input Tarqet
Interval Nominal Nominal Nominal Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Nominal Interval Interval Interval Interval Interval Interval Nominal Interval Interval
i Show code
Apply
tn
\~ Mining
Name DemAge 0 em Cluster DemGender DemHomeOwner DemMedHomeValue DemMedlncome iDernPctVeterans GiftAvg36 GiftAvqAII GiftAvqCard36 GiflAvgLast GiftCnt36 GittCntAII GmCntCard36 GiftCntCardAN GifiTimeFirst GiftTimeLast ID PromCntl 2 PromCnt36 PromCntAII PromCntCard12 PromCntCard36 PromCntCardAII StatusCat96NK StatusCatStarAll TaraetB
I F i l l IA!
Reset
f Statistics
Order
Report
Dr No No ^0 No No No No
Mo No Mo No MO No No No No No NO No No NO No No No No No No No No No No No NO No
ho No No No No No No No No
No No No No No
No No No No No No
„„
I n t n n n l
< Back
Next> ;
^
1 :
1
Cancel
The Data Source Wizard displays its best guess for the metadata assignments. This guess is based on the name and data type of each variable. The correct values for model role and measurement level are found in the PVA97NK metadata table on the next page. A comparison of the currently assigned metadata to that in the PVA97NK metadata table shows several discrepancies. While the assigned modeling roles are mostly correct, the assigned measurement levels for several variables are in error. It is possible to improve the default metadata assignments by using the Advanced option in the Metadata Advisor. 3.
Select < Back in the Data Source Wizard. This returns you to Step 4 o f 6 Metadata Advisor Options.
4.
Select the Advanced option.
2-24
Chapter 2 Accessing and Assaying Prepared Data
PVA97NK Metadata Table Model Role
Name
Measurement Level
Description
DemAge
Input
Interval
Age
DemCluster
Input
Nominal
Demographic Cluster
DeraGender
Input
Nominal
Gender
DemHomeOwner
Input
Binary
Home Owner
DemMedHomeValue
Input
Interval
Median Home Value Region
DemMedl n come
Input
Interval
Median Income Region
DemPctVeterans
Input
Interval
Percent Veterans Region
GiftAvg36
Input
Interval
Gift Amount Average 36 Months
GiftAvgAll
Input
Interval
Gift Amount Average A l l Months
GiftAvgCard36
Input
Interval
Gift Amount Average Card 36 Months
GiftAvgLast
Input
Interval
Gift Amount Last
Giftent36
Input
Interval
Gift Count 36 Months
GiftCntAll
Input
Interval
Gift Count A l l Months
GiftCntCard36
Input
Interval
Gift Count Card 36 Months
GiftCntCardAll
Input
Interval
Gift Count Card A l l Months
GiftTimeFirst
Input
Interval
Time Since First Gift
GiftTimeLast
Input
Interval
Time Since Last Gift
ID
Nominal
Control Number
PromCntl2
Input
Interval
Promotion Count 12 Months
PromCnt3 6
Input
Interval
Promotion Count 36 Months
PromCntAll
Input
Interval
Promotion Count A l l Months
PromCntCardl2
Input
Interval
Promotion Count Card 12 Months
PromCntCard3 6
Input
Interval
Promotion Count Card 36 Months
PromCntCardAII
Input
Interval
Promotion Count Card A l l Months
StatusCat96NK
Input
Nominal
Status Category 96NK
StatusCatStarAll
Input
Binary
Status Category Star A l l Months
Targets
Target
Binary
Target Gift Flag
TargetD
Rejected
1 nterva!
Target Gift Amount
2 3 Defining a Data Source
5.
2-25
Select Next > to use the Advanced setting. The Data Source Wizard again proceeds to Step 5 of 7 Column Metadata.
m
ST Data Source Wizard — Step 5 of 7 Column Metadata
1 I n n t IFI-II isl to lolumns:
f
P
Label Role
Name
DernArje Input Rejected DemCluster DemGender Input DetnHorneOwner Input DemMedHomeValue Input DemMedlncome Input DernPctVeterans Input GiftAvg36 Input GlftAvgAll Input GiflAvgCard36 Input GiftAvgLast Input Input GirtCnt36 GlftCntAll Input GifiCntCard36 Input GiflCntCardAII Input Input GiftTimeFirst Input GiftTimeLast ID ID PromCntl 2 Input PromCnt36 Input PromCntAII Input PromCntCard12 Input PromCntCard36 Input Input PromCntCardAII Input StatusCat96NK Input StatusCatStarAII Target Targets
Apply
Mining
C
Basic
r
Report
Order
Level
Interval Nominal Jominal linary Interval Interval Interval Interval Interval Interval Interval Nominal Interval Nominal
Reset Statistics
No
No No
Dr
No No No
interval Interval
Interval Nominal Interval Interval Interval Nominal Interval Interval Nominal Binary Binary
No
No NO
No
No
o No 0
No
«
Show code
Explore
Refresh Summary
<Back
! Next>
:
Cancel
While many o f the default metadata settings are correct, there are several items that need to be changed. For example, the D e m C l u s t e r variable is rejected (for having too many distinct values), and several numeric inputs have their measurement level set to N o m i n a l instead o f I n t e r v a l (for having too few distinct values). To avoid the time-consuming task of making metadata adjustments, go back to the previous Data Source Wizard step and customize the Metadata Advisor. 6. Select < Back. You return to the Metadata Advisor Options window.
2-26 7.
Chapter 2 Accessing and Assaying Prepared Data
Select Customize.... The Advanced Advisor Options dialog box opens. B £ Advanced Advisor Options
Property
Value
[Missing Percentage Threshold 50 Reject Vars with Excessive Missing Value Yes Class Levels Count Threshold 20 Detect Class Levels Yes Reject Levels Count Threshold 20 Reject Vars with Excessive Class Values Yes
Missing Percentage Tlireshold Specify a maximum percentage of missing values for variables to be rejected. The default value is 50.
OK
Cancel
Using the default Advanced options, the Metadata Advisor can do the following: •
reject variables with an excessive number of missing values (default=50%)
•
detect the number class levels of numeric variables and assign a role of N o m i n a l to those with class counts below the selected threshold (default=20)
•
detect the number class levels of character variables and assign a role of Re j e c t e d to those with class counts above the selected threshold (default=20)
f
In the PVA97NK table, there are several numeric variables with fewer than 20 distinct values that should not be treated as nominal. Similarly, there is one class variable with more than 20 levels that should not be rejected. To avoid changing many metadata values in the next step of the Data Source Wizard, you should alter these defaults.
2.3 Defining a Data Source
2-27
Type 2 as the Class Levels Count Threshold value so that only binary numeric variables are treated as categorical variables. Advanced Advisor Options
Property
Missing Percentage Threshold 50 Reject Vars with Excessive Missing ValueYes Class Levels Count Threshold 2 Detect Class Levels Yes Reject Levels Count Threshold 20 Reject Vars with Excessive Class Values Yes
Value
Class Levels Count Threshold I f "Detect class levels"=Yes, interval variables with less than the number
specified for this property will be marked as NOMINAL. The default value is 20.
2-28
9.
Chapter 2 Accessing and Assaying Prepared Data
Type 100 as the Reject Levels Count Threshold value, so that only character variables with more than 100 distinct values are rejected. f
Be sure to press ENTER after you type the number 100. Otherwise, the value might not be registered in the field. Advanced Advisor Options
Property
Missing Percentage Threshold 50 Reject Vars with Excessive Missing Value Yes Class Levels Count Threshold 2 Detect Class Levels Yes Reject Levels Count Threshold 100
Value
Reject Vars with Excessive Class Values Yes
Reject Levels Count Threshold Specify a rnaximum number of levels for a class variable before being marked REJECTED. The default value is 20.
OK
10. Select OK to close the Advanced Advisor Options dialog box.
Cancel
2.3 Defining a Data Source
2-29
11. Select Next > to proceed to Step 5 o f the Data Source Wizard. g £ Data Source Wizard — Step 5 o f ? Column Metadata 1 I
Irnnnf "! 1
Columns:
nnt
["" label
I F i n is!
Apply
to
r~ Mining
r
Role
Level
input Input DemCluster DemG-ender Input DemHomeQwner Input DemMedHomeValue Input Input DemMedlncorne DemPctVeterans Input GiftAvg36 Input Input jGiftAvgAII GiftAvgCard36 Input Input GiftAvgLast ln|)irt GiftCni36 GiftCntAII Input GiftCntCard36 Input Input GiftCntCardAII Input GiftTimeFirst Input GiftTimeLast ID ID Input PromCnt12 Input PromCnt36_ Input PromCntAII PromCntCardT2 Input PromCntCard36 Input PromCntCardAII Input StatusCat98NK Input StatusCatStarAII Input TargetB Target TarqetD Taruet
Interval Nominal Nominal Binary Interval Interval Interval Interval Interval Interval Interval Interval (interval interval
Name
DemAge
Interval Interval Interval Jo mi rial Interval Interval Interval Interval Interval interval Nominal Binary Binary Interval
Basic Report
r
Reset Statistics
Order No
No
Dr
*
No
No No NO NO
No NO No NO
NO
"So" No No
No No
i l Show code
Explore
Refresh Summary
<Back
Next >
Cancel
A comparison o f the Column Metadata table to the table at the beginning o f the demonstration shows that most o f the metadata is correctly defined. SAS Enterprise Miner correctly inferred the model roles for the non-input variables by their names. The measurement levels are correctly defined by using the Advanced Metadata Advisor.
The analysis o f the PVA97NK data in this course focuses on the T a r g e t B variable, so the T a r g e t D variable should be rejected.
2-30
Chapter 2 Accessing and Assaying Prepared Data
12. Select Role •=> Rejected for T a r c r e t D . £ f Data Source Wizard —Step 5 of 7 Column Metadata I
Columns:
f
Label
Name
T U I l i O I I I O M I U i z-
PromCntCard36 PromCntCardAII StatusCat_96NK Statu sCatStarAN TargetB
TargetD
Show code
Explore
nnt
Refresh Summary
If™ml
tn
2J
P Mining Role
Tripntr
Input Input Input Input Target Rejected
f" Basic
Level
Interval Interval luminal imary binary Interval
Report
Apply r
Reset Statistics
Order
Dr
Np_ No
2l <Back
Next>
Cancel
In summary, Step 5 of 7 Column Metadata is usually the most time-consuming of the Data Source Wizard steps. You can use the following tips to reduce the amount o f time required to define metadata for SAS Enterprise Miner predictive modeling data sets: • Only include variables that you intend to use in the modeling process in your raw data source. • For variables that are not inputs, use variable names that start with the intended role. For example, an ID variable should start with I D and a target variable should start with T a r g e t . • Inputs that are to have a nominal measurement level should have a character data type. • Inputs that are to be interval must have a numeric data type. • Customize the Metadata Advisor to have a Class Level Count set equal to 2 and a Reject Levels Count set equal to a number greater than the maximum cardinality (level count) of your nominal inputs.
2.3 Defining a Data Source
2-31
Finalizing the Data Source Specification Follow these steps to complete the data source specification process: I.
Select Next > to proceed to Decision Configuration. f
The Data Source Wizard gained an extra step due to the presence o f a categorical (binary, ordinal, or nominal) target variable. D a t a Source W i z a r d â&#x20AC;&#x201D; Step 6 of 9 Decision Configuration
Decision Processing Do you want to build models based on the values of the decisions ? If you answer yes, you may enter information on the cost or profit of each possible decision, prior probability and cost function. The data will be scanned for the distributions of the target variables.l : mm mm mm
a-
no
<Back
C
Yes
Next >
Cancel
When you define a predictive modeling data set, it is important to properly configure decision processing. In fact, obtaining meaningful models often requires using these options. The PVA97NK table was structured so that reasonable models are produced without specifying decision processing. However, this might not be the case for data sources that you will encounter outside this course. Because you need to understand how to set these options, a detailed discussion of decision processing is provided in Chapter 6, "Model Assessment." f
Do n o t select Yes here because that changes the default settings for subsequent analysis steps and yields results that diverge from those in the course notes.
2-32
2.
Chapter 2 Accessing and Assaying Prepared Data
Select Next >. By skipping the decision processing step, you reach the next step of the Data Source Wizard. ] § [ D a t a Source W i z a r d â&#x20AC;&#x201D; Step 7 of 8 D a t a Source A t t r i b u t e s
You may change the name and the role, and can specify a population segment identifier for the data source to be created.
Name:
|PVA97NK
Role:
|Raw
Segment:
[
<Back !
Next >
Cancel
This penultimate step enables you to set a role for the data source and add descriptive comments about the data source definition. For the upcoming analysis, a table role o f Raw is acceptable.
2.3 Defining a Data Source
3.
The final step in the Data Source Wizard provides summary details about the data table that you created. Select Finish. j£TData Source Wizard — Step 8 of 8 Summary Metadata Completed. Library. AAEM61 Table: PVA97NK Role: Raw Role ID Input Input Input Rejected Target
Count 1 2 20 3 1 1
Level Nominal Binary Interval Nominal Interval Binary
< Back
! Finish I
Cancel
2-33
2-34
Chapter 2 Accessing and Assaying Prepared Data
The PVA97NK data source is added to the Data Sources entry in the Project panel. |15f Enterprise Miner - My Project File
Ecfit
View
Actions
Options
-IDI x| Window
Help
••1 I | X | H m i 0 | D | Q | a j | : i i |
1 '1 [ I MI^I
I
My Project .-: O Data Sources
Sample f&Tplore | Modify j Model | Assess | Util.ty [ Credit Scoring ] Te>t Mining]"
!-
Diagrams Predictive Analysis m L2J Model Packages v ji] Users
Pr opertY Name Variables Decisions Role Notes Library [Table Type No. Obs Mo Cols No. Bytes Segment Created By Create Date Modifier! By Modify Date Scope
Vabe PVANK PVA97NK 1 Raw AAEM61 PVA97NK DATA 96B6 be 20J9D24
...
H7898\sasdemo 2/5/101:25 PW sasdemo 2/5/10 1:25 PM Local
_il
ID Di3gra.ii Predictive Analysis opened
4.
^ -
- J -
©i
[ 0 ll7898\sasdemo as sasdemo [ s ^ Connected to SASApo • Logical Workspace Server
Select the PVA97NK data source to obtain table properties in the SAS Enterprise Miner Properties panel.
2.3 Defining a Data Source
2-35
Exercises
2. Creating a SAS Enterprise Miner Data Source Use the steps demonstrated on pages 2-18 to 2-34 to create a SAS Enterprise Miner data source from the PVA97NK data.
2-36
Chapter 2 Accessing and Assaying Prepared Data
2.4 Exploring a Data Source As stated in Chapter 1 and noted in Section 2.3, the task of data assembly largely occurs outside of SAS Enterprise Miner. However, it is quite worthwhile to explore and validate your data's content. By assaying the prepared data, you substantially reduce the chances of erroneous results in your analysis, and you can gain insights graphically into associations between variables. In this exploration, you should look for sampling errors, unexpected or unusual data values, and interesting variable associations. The next demonstrations illustrate SAS Enterprise Miner tools that are useful for data validation.
2.4 Exploring a Data Source
2-37
Exploring Source Data SAS Enterprise Miner can construct interactive plots to help you explore your data. This demonstration shows the basic features o f the Explore window. These include the following: • opening the Explore window • changing the Explore window sample size • creating a histogram for a single variable • changing graph properties for a histogram • changing chart axes • adding a missing bin to a histogram • adding plots to the Explore window • exploring variable associations
Opening the Explore Window There are several ways to access the Explore window. Use these steps to open the Explore window through the Project panel. 1.
Open the Data Sources folder in the Project panel and right-click the data source o f interest. The Data Source Option menu appears.
rill
ST Enterprise Miner - My Project File Edit View Actions Options Window Help D-l
iXiSlHffialDlaji^l
;
I
_J_J_.I.
3 My Project - O Data Sources V
IN] i t f V
Sample ["Explore | Modify | Model | Assess j Utility | Credit Scoring
J
I Diaqrar Rename 3£>Pre Duplicate Delete v fijjModell • JTj Users Edit Variables... Refresh Metadata Edit Decisions.,. -
ID.
Explore...
Name Variables Decisions Role Notes Library Table No_._p_bs No. Cols No. Bytes Segment Created By Create Date MnHifiari Du
M]^J
i;:; Predictive Analysis
Value JK
Raw M61 'A97NK DATA 9686 28 2049024 student 6/17/09 3:40 Ptvl"
Diagram Predictive Analysis opened
Id
J [ft student as student
-SJ|i^O|Q - J -
j)ioo%|:-oj"
Connected to SASApp - Logical Workspace Server
2-38
2.
Chapter 2 Accessing and Assaying Prepared Data
Select E x p l o r e . . . from the Data Source Option menu. I f you are using an installation of SAS Enterprise Miner with factory default settings, the Large Data Constraint alert window opens.
Row limit of 2,000 observations reached.
5
OK
By default, a maximum of 2000 observations are transferred from the SAS Foundation Server to the SAS Enterprise Miner client. This represents about a fifth of the PVA97NK table. 3.
Select OK to close the Large Data Constraint alert window. The Explore - A A E M 6 1 .PVA97NK window opens. Explore - AAEM61.PVA97INK File
View
- n|x|
Actions Window
j • Sample Properties Value
Property
Rows Columns Library Member
9636
28 AAEM61 PVA97NK DATA Top Default 2000 12345
Type
Sample Method Fetch Size Fetched Rows Random Seed
Apply AAEM61.PVA97NK
Target G... Control 000014974 000006294 1 00046110 1 00185937 000029637 1 00112632 000123712 000045409 1 00019094 000178271 1 00102224
Plot...
m Target G... GiftCou... $4.00 $10.00 $11.00 $40.00 $6.00
Gift Cou... Gift Cou... 4 8 41 12 1 11 4 4 3 S 16
Gift Cou... Gift Ami $17*1 3 $2£ 3 $E 20 $1t 8 $2C 1 $11 g $ie 3 $1; 3 $se 1 $E 2 $E_L; 13 =l
•
The Explore window features a 2000-observation sample from the PVA97NK data source. Sample properties are shown in the top half of the window and a data table is shown in the bottom half.
2.4 Exploring a Data Source
2-39
Changing the Explore Sample Size The Sample Method property indicates that the sample is drawn from the top (first 2000 rows) of the data set. Use these steps to change the sampling properties in the Explore window. Although selecting a sample through this method is quick to execute, fetching the top rows o f a table might not produce a representative sample of the table.
f 1.
Left-click the Sample Method value field. The Option menu lists two choices: Top (the current setting) and Random.
2.
Select Random from the Option menu. - • Explore - AAEM61 .PVA97NK File
View
-ln|x
Actions Window
O St | i Sample Properties
Value
Property 96S6
ROWS
28 AAEM61 PVA97NK DATA Random Default 2000 12345
Columns Library Member Type Sample Method Fetch Size Fetched Rows iRandom Seed
Apply
Obs#
Target G. . Control ... Target G... Gift Cou... Gift Cou... Gift Cou... Gift Cou... Gift 000014974 2 4 1 3 1 000006294 1 8 0 3 1 00046110 $4.00 6 41 3 20 4 100185937 $10.00 3 8 M 12 5 000029637 1 1 1 6 100112632 $11.00 3 11 2 9 7 000123712 2 4 2 3 8 000045409 4 0 3 1 1 00019094 $40.00 3 0 Hi 10 000178271 S 1 2 H $6.00 3 16 2 13 11 1 00102224
•
H:
4 1
3.
Plot...
I
•
Amt 51 i d
$20—' SE $1t S2C (11 $1£
$1! $3<
$E
$H •j
Select Actions •=> A p p l y Sample Properties from the Explore window menu. A new, random sample o f 2000 observations is made. This 2000-row sample now has distributional properties that are similar to the original 9686 observation table. This gives you an idea about the general characteristics of the variables. I f your goal is to examine the data for potential problems, it is wise to examine the entire data set. SAS Enterprise Miner enables you to increase the sample transferred to the client (up to a maximum of 30.000 observations). See the SAS Enterprise Miner Help file to learn how to increase this maximum value.
2-40
Chapter 2 Accessing and Assaying Prepared Data
4.
Select the Fetch Size property and select Max from the Option menu.
5.
Select Actions >=> A p p l y Sample Properties. Because there are fewer than 30,000 observations, the entire PVA97NK table is transferred to the SAS Enterprise Miner client machine, as indicated by the Fetched
Rows field.
' Explore - AAEM61.PVA97NK
-|D|x|
File View Actions Window 1 1 Sample Properties
ROWS Columns Library Member Type .__ Sample Method iFetch Size Fetched Rows Random Seed
Property
_
9686 28 AAEM61 PVA97NK DATA Random Max 9686 |12345
Value
Apply
j
Plot...
AAEM61.PVA97N
Target G... Control... Target G... Gift Cou... Gift Cou... Gift Cou... Gift Cou... Gift 3 000014974 3 000006294 100046110 $4.00 41 20 12 8 100185937 $10.00 1 000029637 1 11 9 1 00112632 $11.00 4 000123712 3 4 000045409 3 3 1 00019094 $40.00 1 5 2 000178271 16 13 1 00102224 $6.00
Ami $17*1 $2C $£ $1t $2C $11 $1J $1f $3J $E $fjj :=1
±l
2.4 Exploring a Data Source
2-41
Creating a Histogram for a Single Variable While you can use the Explore window to browse a data set, its primary purpose is to create statistical analysis plots. Use these steps to create a histogram in the Explore window. I.
Select Actions •=> Plot from the Explore window menu. The Chart wizard opens to the Select a Chart Type step. 'i
bli Scatter |£S Line alb Histogram
Plots values of two variables against each other.
Efc Density
Si!
Box Tables
—
M Matrix rjfj Lattice IS] Parallel Axis ©
Constellation
Cancel
Next>
n«'.ti
The Chart wizard enables the construction of a multitude of analysis charts. This demonstration focuses on histograms.
2-42 2.
Chapter 2 Accessing and Assaying Prepared Data
Select Histogram.
1* Scatter 1^ Line ilfri Histogram v
Density
71 fltt Histogram provides statistical relations between X values which are grouped as bins or discrete.
&Box £ 3 Tables \M Matrix Lattice t§] Parallel Axis © Constellation Cancel
Histograms are useful for exploring the distribution o f values in a variable.
Next>
2.4 Exploring a Data Source
3.
2-43
Select Next >. The Chart wizard proceeds to the next step, Select Chart Roles. t
ij 1;.-
-3 Missing required roles: X. A Variable Role DEMAGE DEMCLUSTER DEMGENDER DEMHOMEOWNER DEMMEDHOMEVALUE DEMMEDINCOME DEMPCTVETERANS GIFTAVG36 G1FTAVGALL GFTAVGCARD36 GIFTAVGLAST rciFTTNTSfi Response statistic:
Use default assignments Type Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numprir
Description Age Demographic Cluster Gender Home Owner Median Home Value ... Median income Region Percent Veteran? Re... Gift Amount Average... Gift Amount Average... Gift Amount Average... Gift Amount Last
Frequency
Cancel
<Back
To draw a histogram, one variable must be selected to have the role X.
Format
DOLLAR11 DOLLAR! I DOLLARS .2 DOLLAR9 2 DOLLARS .2 DOLLARS .2
-
2-44
4.
Chapter 2 Accessing and Assaying Prepared Data
Select Role O X for the DEMAGE variable.
Use default assignments A Variable Role DEMAGE X DEMCLUSTER DEMGENDER DEMHOMEOVVWER DEMMEDHOMEVALUE DEMMEDINCOME DEMPCT VETERANS GIFTAVG36 GIFTAVGALL GIFTAVGCARD3fi GFTAVGLAST niFTCNnR Response statistic: Frequency
Type Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric ^umerir.
Description 'Age Demographic Cluster bender jHome Owner Median Home Value... Median Income Region Percent Veterans ReGift Amount Average... |Gift Amount Average... [Gift Amount Average... !Gift Amount Last Gift Gnunl 3R MnntlT?
Format
DOLLAR11 DOLLAR11 DOLLAR9.2 DOLLARS .2 DOLLARS .2 DOLLARS .2
•
• Cancel
<Back
The Chart wizard is ready to make a histogram of the DEMAGE variable.
Next>
Finish
2.4 Exploring a Data Source
5.
2-45
Select Finish. The Explore window is filled with a histogram o f the DEMAGE variable. Variable descriptions, rather than variable names, are used to label the axes of plots in the Explore window.
f
Explore - AAEM61 .PVA97NK File
Edit
0
View
O
Actions Window
4
,1, Graph 1
1500 H
1000-
u
c 3
500-
0.0
17.4
26/
34.8
43.5
52.2
60.9
69.6
78.3
87.0
Age Axes in Explore window plots are chosen to range from the minimum to the maximum values o f the plotted variable. Here you can see that Age has a minimum value of 0 and a maximum value o f 87. The mode occurs in the ninth bin, which ranges between about 70 and 78. F r e q u e n c y tells you that there are about 1400 observations in this range.
2-46
Chapter 2 Accessing and Assaying Prepared Data
Changing the Graph Properties for a Histogram By default, a histogram in SAS Enterprise Miner has ]0 bins and is scaled to show the entire range of data. Use these steps to change the number of bins in a histogram and change the range of the axes. While the default bin size is sufficient to show the general shape of a variable's distribution, it is sometimes useful to increase the number of bins to improve the histogram's resolution. 1.
Right-click in the data area of the Age histogram and select Graph Properties... from the Option menu. The Properties-Histogram window opens.
Graph Histogram Axes Horizontal Vertical Title/Footnote
Bins LJ Show Missing Bin [E Bin Axis [â&#x20AC;˘] End Labels LJ Bin to Data Range Boundary:
Upper
Number of X Bins:
10
Number of Y Bins:
10
V
Color 115 Show Outline I2j Gradient Gradient ThteeColorRamp
OK
Apply
Cancel
This window enables you to change the appearance of your charts. For histograms, the most important appearance property (at least in a statistical sense) is the number o f bins.
2.4 Exploring a Data Source 2.
Type 87 into the N u m b e r
Graph Histogram Axes Horizontal Vertical Title/Footnote
of
X
2-47
B i n s field.
Bins LJ Show Missing Bin 12i Bin Axis E
End Labels
Q Bin to Data Range Upper
Boundary: Number of X Bins:
37
Number of Y Bins:
10
â&#x20AC;˘
Color IE Show Outline IE Gradient Gradient ThreeCotorRanip
OK
Apply
Cancel
Because Age is integer-valued and the original distribution plot had a maximum o f 87. there w i l l be one bin per possible Age value.
2-48
3.
Chapter 2 Accessing and Assaying Prepared Data
Select OK. The Explore window reopens and shows many more bins in the Age histogram. [
Explore - AAEM61.PVA97NK File
Edit
View
-Infxl
Actions Window
#|a|o|fr| Ji, Graph1 200-
150& C
01
S" too
50-
'i' 'T' ' -
0
3
i''
1
i
1
1
i
1
1
i
1
1
i
1
1
i
1
1
i
1
1
i
1
1
i "
t ' 1
i "
i''
i "
r
1
''"p
1 1
i
1
i i
1 1
i
r
1
i
1
1
1
1
' i'
1
1
1
1
"t
1
1
1
1
1
i
1
1
1
1
1
i'
1
i
6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87
Age With the increase in resolution, unusual features become apparent in the Age variable. For example, there are unexpected spikes in the histogram at 10-year intervals, starting at Age=7. Also, you must question the veracity of ages below 18 for donors to the charity.
2.4 Exploring a Data Source
2-49
Changing Chart Axes A very useful feature of the Explore window is the ability to zoom in on data of interest. Use the following steps to change chart axes in the Explore window: 1.
Position your cursor under the histogram until the cursor appears as a magnifying glass.
2.
Clicking with your mouse and dragging the cursor to the right magnifies the horizontal axis o f the histogram. '."-Explore - AAEM61.PVA97NK File
Edit
View
- l O l X|
Actions Window
Graph1
^JDJXJ
i
i
r
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Age
:
2-50
3.
Chapter 2 Accessing and Assaying Prepared Data
Position your cursor below the histogram but above the Age axis label. A horizontal scroll bar appears. -Inlxf
'/.-Explore - AAEM61.PVA97NK File
Edit
View
Actions Window
y o#
Graph 1
33 34 35 36 37 38 33 40 41 42 43 44 45 46 47 48 49 50 51 52 53 5 ^ 5 56 57 58 59 60 61 62 63 64
Age
2.4 Exploring a Data Source
4.
2-51
Click and drag the scroll bar to the left to translate the horizontal axis to the left. Explore - AAEM61 .P VA97NK File
Edit
H
View
O
-|D|X|
Actions Window
«
.tic Graph 1
^jnjxj
200-
150-
> o c
a> •
g" 100 k.
u_
50-
0
2
4
6
8
1 l i ^ 12
14
16
18
Age
20
22 ~24
26
28
30
32
34
36
2-52
5.
Chapter 2 Accessing and Assaying Prepared Data
Position your cursor to the left of the histogram to make a similar adjustment to the vertical axis o f the histogram. • • Explore - AAEM61.PVA97NK Fi!e
Edit
View
Actions Window
__jnj_xj
triii Graph 1
40-
u
30-
c 3
W i l 20-
10-
i — — i — ' — i——r 1
1
0
2
4
6
-1
10
— i
12
1
1
14
1
I
16
i
|
18
i
1
20
1
[
22
.
[
24
.
1
26
1
[
28
1
1
30
,
1
32
,
34
1 —
Age At this resolution, you can see that there are approximately 40 observations with Age less than 8. A review o f the data preparation process might help you determine whether these values are valid.
2.4 Exploring a Data Source
6.
2-53
Select ! (the yellow triangle) in the lower right corner of the Explore window to reset the axes to their original ranges. â&#x20AC;˘ Explore - AAEMB1.PVA97NK File
Edit
View
Actions Window
m i >it, Graph 1
200-
150-
> o c
S" 100
50-
T
0
' I ' ' 1 '
3
1
I ' ' I '
r
I '
1
I ' ' I '
1
I ' ' I ' ' I "
I '
1
i "
I ' ' 1 '
1
I '
1
I "
I
1
' I
1
' I ' ' I ' ' I ' ' I ' ' I ' ' I ' ' I
1
' I ' ' I ' ' I ' ' 1
6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87
Age
2-54
Chapter 2 Accessing and Assaying Prepared Data
Adding a "Missing" Bin to a Histogram Not all observations appear in the histogram for Age. There are many observations with missing values for this variable. Follow these steps to add a missing value bin to the Age histogram: 1.
Right-click on the graph and select Graph Properties... from the Option menu.
2.
Check the Show Missing Bin option.
Graph
Histogram Axes Horizontal Vertical Title/Footnote
Bins E
Show Missing Bin
0
Bin Axis \Z\ End Labels
[ J Bin to Data Range Boundary:
Upper
Number of X Bins:
87
Number of Y Bins:
10
â&#x20AC;˘w
Color E
Show Outline
E
Gradient
Gradient
OK
Apply
Cancel
2.4 Exploring a Data Source 3.
2-55
Select OK. The Age histogram is modified to show a missing value bin. • Explore - AAEM61.PVA97NK File
Edit
View
- •
Actions Window
X
,iu Graph 1
2500 H
2000-
> 1500 o C 3
cr o>
^ 1000-
500-
I ' ' I "
I ' ' I ' ' I ' ' I ' ' I "
I i
1
I
1
' I
1
' I ' ' ! "
I ' ' I I ' I "
I
1
' I ' • I ' ' 1
1
1
I i
1
I
1
1
I ' i I "
I '
1
I "
I "
IffiM
I ' •1 "
I '
1
I
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87
Age With the missing value bin added, it is easy to see that nearly a quarter o f the observations are missing an Age value.
2-56
Chapter 2 Accessing and Assaying Prepared Data
Adding Plots to the Explore Window You can add other plots to the Explore window. Follow these steps to add a pie chart of the target variable. 1.
Select Actions Type step.
Plot from the Explore window menu. The Chart wizard opens to the Select a Chart
2.
Scroll down in the chart list and select a Pie chart.
pjfj Lattice Parallel Axis ® Constellation
Shows each categories contribution to a sum.
jfL 3D Charts Q j Contour t i l Bar € • Pie L i Needle H
Vector
Cancel
Next>
2.4 Exploring a Data Source 3.
2-57
Select Next >. The Chart wizard continues to the Select Chart Roles step.
^ Missing required roles: Category. A Variable DEMAGE DEMCLUSTER DEMGENDER DEMHOMEOVVNER DEMMEDHOMEVALUE DEMMEDINCOME DEMPCTVETERANS G1FTAVG36 GIF T AVGALL GFTAVGCARD36 GIFTAVGLAST GIFTCNT36 GiFTCNTALL G1FTCNTCARD36
Role
Use default assignments Description Format Age Demographic Cluster Gender Home Owner Median Home Value ... DOLLAR11 Median Income Region DOLLAR"! 1 Percent Veterans Re... Gift Amount Average... DOLLAR9.2 Gift Amount Average... DOLLARS .2 Gift Amount Average... DOLLAR9.2 Gift Amount Last DOLLARS .2 Gift Count 38 Months Gift Count All Months Gift Count Card 36 M. .
Type Numeric Character Character iCharacter Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Cancel
<Back
The message at the top o f the Select Chart Roles window states that a variable must be assigned the Category role.
2-58
4.
Chapter 2 Accessing and Assaying Prepared Data
Scroll the variable list and select Role •=> Category for the TARGETB variable.
Use default assignments A Variable Role GIFTCNTCARDALL GIFTTIMEFIRST GIFTTIMELAST ID PROMCNT12 PROMCNT36 PROMCNTALL PROMCNTCARD12 PROMCNTCARD36 PROMCNTCARDALL STATUSCAT96NK STATUSCATSTARALL TARGETB Category TARGETD
Type INumeric [Numeric jNumeric Character Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric Numeric Numeric
Description Format Girt Count Card All M... I Times Since First Gift Time Since Last Gift Control Number Promotion Count 12 M... Promotion Count 36 M... Promotion Count All M... Promotion Count Card... Promotion Count Card... Promotion Count Card... Status Category 96NK Status Category Star... Target Gift Flag Target Gift Amount DOLLAR9.2
Cancel
<Back
Next>
—
Finish
2.4 Exploring a Data Source
The chart shows an equal number of cases for TARGETB=0 (top) and TARGETB=1 (bottom).
2-59
2-60
6.
Chapter 2 Accessing and Assaying Prepared Data
Select Window >=> Tile to simultaneously view all sub-windows of the Explore window. Explore - AAEM61 .PVA97NK
File Edit View Actions Window 'jr Graph2
.Jn|x|
^JDJXJ Property Rows Columns Library Member Type Sample Method iFetch Size Fetched Rows Random Seed
9686 28 AAEM61 PVA97NK DATA Random Max 9686 12345
Value
Apply I,Graphl
^inJiii
2500 2000¬ 1500 £ 1000 LL
500¬ 0
Plot...
-r""r'"i 0
i
i
i
i
i
i
i
i i"
6 12 18 24 30 35 42 48 54 60 66 72 78 84
Age
Target G... Control 000014974 000006294 100046110 1 00185937 000029637 1 00112632 000123712 000045409 1 00019094 000178271 100102224
Target G... G
$4.00 $10.00 $11.00 $40.00 $6.00
'A
2.4 Exploring a Data Source
2-61
Exploring Variable Associations A l l elements of the Explore window are connected. By selecting a bar in one histogram, for example, corresponding observations in the data table and other plots are also selected. Follow these steps to use this feature to explore variable associations: 1.
Double-click the Age histogram title bar so that it fills the Explore window.
2.
Click and drag a rectangle in the Age histogram to select cases with Age in excess o f 70 years. Explore - AAEM61.PVA97NK File
Edit
0
View
Actions Window
*i
i k Graph1
400-
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87
Age The selected cases are cross-hatched. (The vertical axis is rescaled to show the selection better.)
*
2-62
3.
Chapter 2 Accessing and Assaying Prepared Data
Double-click the Age histogram title bar. The tile display is restored. Explore -AAEM61.PVA97NK File
Edit
View
Actions Window
\ \ Sample
^jnjxj Property
Rows Columns Li bran/ Member Type iSample Method Fetch Size Fetched Rows Random Seed
Value
9686 23 AAEM61 PVA97NK DATA Random Max 96S6 12345
Apply
^jD[x]
â&#x20AC;˘di. Graph 1
^jDjxJ Obs#
l ""!" "!'""!" "!" "! 1
0
1
1
1
r ' 'P""l " l f
l l
,
T ,
I
I'
6 12 18 24 30 36 42 48 54 60 66 72 78 84
Aoe
'
Plot...
Target G... Control... Target G... G 000014974 000006294 1 00046110 1 00185937 000029637 1 00112632 000123712 000045409 1 00019094 000178271 1 00102224
$4.00 $10.00 $11.00 $40.00 $6.00
Notice that part o f the TargetB pie chart is selected. This selection shows the relative proportion of observations with Age greater than 70 that do and do not donate. Because the arc on the TARGETB=] segment is slightly thicker, it appears that there is a slightly higher number of donors than non-donors in this Age selection.
2.4 Exploring a Data Source
4.
Double-click the Tar get B pie chart title bar to confirm this observation.
5.
Close the Explore window to return to the SAS Enterprise Miner client interface screen.
2-63
2-64
Chapter 2 Accessing and Assaying Prepared Data
Changing the Explore Window Sampling Defaults
In the previous demonstration, you were alerted to the fact that only a sample of data was initially selected for use in the Explore window. -
»2f Row limit of 2,000 observations reached. OK
Follow these steps to change the preference settings of SAS Enterprise Miner to use a random sample or all o f the data source data in the Explore window: 1.
Select Options •=> Preferences... from the main menu. The Preferences window opens.
!
Property
v Look and feel • ' •Property Sheet Tooltips '••Tools Palette Tooltips S|interactive Sampling [•Sample Method h Fetch Size -Random Seed SModel Package Options [-Generate C Score Code [•Generate Java Score Code
Value Cross Platform Off Display tool name and description Top_ Default 1 2345
; : :;
•
No NO
—
OK
Cancel
2.4 Exploring a Data Source
2.
Select Sample Method O Random.
Muser Interface
Value
Property
[•Look and feel '••Property Sheet Tooltips -Tools Palette Tooltips ;
[•Sample Method [•Fetch Size -Random Seed ©ModelPackage Options [-Generate C Score Code [•Generate Java Score Code l
A.
Cross Platform Off Display tool name and description Random Default 12345 No No
OK
Cancel
The random sampling method improves on the default method (at the top o f the data set) by guaranteeing that the Explore window data is representative of the original data source. The only negative aspect is an increase in processing time for extremely large data sources. 3.
Select Fetch Size
Max.
Value
Property [•Look and feel [-Property Sheet Tooltips '-Tools Palette Tooltips Sjlnteractive Sampling [-Sample Method [ Fetch Size -Random Seed :
BlAMIHI^WWBBilBBgi [-Generate C Score Code [-Generate Java Score Code tJ i
r-
n-^i^*.^
A. %
Cross Platform Off Display tool name and description Random Max 12345
•
% i
_
No No
OK
Cancel
The Max fetch size enables a larger sample of data to be extracted for use in the Explore window. If you use these settings, the Explore window uses the entire data set or a random sample o f up to 30,000 observations (whichever is smaller).
2-65
2-66
Chapter 2 Accessing and Assaying Prepared Data
Exercises
3. Using the Explore Window to Study Variable Distribution Use the Explore window to study the distribution of the variable Medium Income Region (DemMedlncome). Answer the following questions: a. What is unusual about the distribution of this variable?
b. What could cause this anomaly to occur?
c. What do you think should be done to rectify the situation?
2.4 Exploring a Data Source
2-67
Modifying and Correcting Source Data
In the previous exercise, the DemMedlncome variable was seen to have an unusual spike at 0. This phenomenon often occurs in data extracted from a relational database table where 0 or another number is used as a substitute for the value m i s s i n g or u n k n o w n . Clearly, having zero income is considerably different from having an unknown income. I f you properly use the i n c o m e variable in a predictive model, this discrepancy can be addressed. This demonstration shows you how to replace a placeholder value for missing with a true missing value indicator. In this way, SAS Enterprise Miner tools can correctly respond to the true, but unknown, value. SAS Enterprise Miner includes several tools that you can use to modify the source data for your analysis. The following demonstrations show how to use the Replacement node to modify incorrect or improper values for a variable:
Process Flow Setup Use the following steps to set up the process flow that will modify the DemMedlncome variable: 1.
Drag the PVA97NK data source to the Predictive Analysis workspace window. ^ E n t e r p r i s e Mtner - My Project File Edit
View
Actions Options Window Help
ol \n\ iigi^iigiiaiBlajl^iUWlo 3 My Project - • ! Data Sources PVA97NK - ^il Diagrams Analysis j M &Predictive r
I'M
en ^ j ^ i o i m i M J d
Sample | Explore | Modify | Model j Assess j Utility | Credit Scoring 4« Predictive Analysis
EM
* Jj Model Packages v ffi Users
Property ID Name Status Notes History
Value
EMWS
Predictive Analysi
•VA97NK
±1
fD Diagram Predictive Analysis opened
ft
VI
3
y student as student
">
©
|ioo%
S
n
| _J |
Connected to SASApp - Logical Workspace Server
2-68
2.
Chapter 2 Accessing and Assaying Prepared Data
Select the M o d i f y tab to access the Modify tool group. 2 f Enterprise Miner - My Project File
Edit
View
EMEI
Actions Options Window Help M
3 My Project B B l Data Sources ^PVA97NK E 'JsJ Diagrams Predictive Analysis ± -'^ I Model Packages E [Qj Users
Gb
©
&
Sample ) Explore | Modify) Model) Assess | Utility] Credit Scoring" -Predictive
Analysis
Property
P Name Status Notes History
EMWS Predictive Analysis Open
JLi I jJ
ID (diagram Predictive Analysis opened
rJ :
©
|l00%
•J
O student a; student ~\ Connected to SASApp - Logical Workspace Server
2.4 Exploring a Data Source
3.
2-69
Drag the Replacement tool (third from the right) from the fools Palette into the Predictive Analysis workspace window. gf Enterprise Miner - My Project File
Edit
View
Actions Options Window Help
J |0|L3H J ^ | M
23 My Project H ffl Data Sources -
fflPVA97NK
H Diagrams Predictive Analysis i+j i^J Model Packages E [ f l j Users
[Node ID Im porterjData
* Predictive Analysis
E M
Value
Property
General
Sample | Explore | Modify | Model | Assess | Utility | Credit Scoring
Repl
>VA97NK
Exported Data Notes
Train
-Replacement Ed -Default Limits MeStandard Deviati J -Cutoff Values -Replacement Ed General Diagram Predictive Analysis opened
1\
9
I
fO student as student
—
©
100%
Connected to SASApp • Logical Workspace Server
2-70
4.
Chapter 2 Accessing and Assaying Prepared Data
Connect the P V A 9 7 N K data to the Replacement node by clicking near the right side o f the P V A 9 7 N K node and dragging an arrow to the left side of the Replacement node. ST Enterprise Miner - My Project File Edit View Actions Options Window Help 23 My Project - O Sources [^PVA97NK _HJ Diagrams D a t a
Sample | Explore | Modify | Model | Assess | Utility | Credit Scoring |
Predictive Analysis >. ^) Model Packages •+] ifjj Users
Property
EM
[•tj Predictive Analysis
Value
General Node ID 3epl Imported Data Exported Data Notes ... Train ariables Replacement Ed Default Limits MeStandard Deviati J CutoffValues E Class Variables r Replacement Ed General piagram Predictive Analysis opened
J^.-?gpUcemeiit
: : : 'VA97NK
J j - J JJ
<7
| — J — +j
\t) student as student
|ioo%
Connected to SASApp - Logical Workspace Server
You created a process flow, which is the method that SAS Enterprise Miner uses to carry out analyses. The process flow, at this point, reads the raw P V A 9 7 N K data and replaces the unwanted values of the observations. You must, however, specify which variables have unwanted values and what the correct values are. To do this, you must change the settings of the Replacement node.
2.4 Exploring a Data Source
2-71
Changing the Replacement Node Properties Use the following steps to modify the default settings of the Replacement node: 1.
Select the Replacement node and examine the Properties panel. Property
1
Value
I
General
Node ID Imported Data Exported Data iNotes
1
Repl
... ... ...
•Replacement Editoi -Default Limits MethiStandard Deviation: -.Cutoff Values I H -Replacement Editoi - Unknown Levels [ignore
Score
Replacement ValueComputed Hide No
|
Ri
Replacement RepoYes
The Properties panel displays the analysis methods used by the node when it is run. By default, the node replaces all interval variables whose values are more than three standard deviations from the variable mean. f
You can control the number of standard deviations by selecting the C u t o f f Values property.
In this demonstration, you only want to replace the value for D e m M e d l n c o m e when it equals zero. Thus, you need to change the default setting. 2.
Select the Default Limits Method property and select None from the Options menu. Value
Property
General
Node ID Imported Data Exported Data Notes
Repl
Train
(sllnterval Variables -Replacement Editoi •Default Limits MethiNone -Cutoff Values -Replacement Editoi Ignore -Unknown Levels
Score
H H
Replacement ValueComputed No Hide
Reoort
Replacement RepoYes
... ... ... •
Q
Chapter 2 Accessing and Assaying Prepared Data
2-72
You want to replace improper values with missing values. To do this, you need to change the Replacement Value property. 3.
Select the Replacement Value property and select Missing from the Options menu. Pro pert/
Value
General
Node ID Imported Data Exported Data Notes
Train • ill Wc m
Repl ... ...
...!
\mm
-Replacement Editoi •Default Limits MethiNone •CutoffValues E Iciass Variables - Replacement Edito •Unknown Levels Iqnore
...
• •
Score
Replacement ValueMissino. Hide No
HnnBHBHMHHHHH ReplacementRepoYes You are now ready to specify the variables that you want to replace. 4.
Select |7j (Interval Variables: Replacement Editor ellipsis) from the Replacement node properties panel. Pro perry Node ID Imported Data Exported Data iNotes
Value Repl
Train EjimBffiirafflTOM
Replacement Editoi Default Limits MethiNone CutoffValues E ]ciass Variables Replacement Edito Unknown Levels Ignore
• • U U
m
Score
Replacement Value Mis sing No Hide
Report
Replacement RepoYes f
Be careful to open the Replacement Editor for Interval Variables, not for Class Variables.
2.4 Exploring a Data Source
2-73
The Interactive Replacement Interval Filter window opens. Interactive Replacement Interval Filter Columns:
f " Mining
|~~ Label Name
DemAge DemMedHomeValue DemMedlncome DemPcTVeterans GiftAvg36 GifiAvqAII Gift.a.vciCarrJ36 GirtAvgLast GiftCnt36 GiftCntAII GiftCntCard36 GiftCntCardAII GiftTirneFirst GlftTimeLast PromCntl 2 PrornCnt36 PromCntAll PromCntCardl 2 PromCntCard36 PromCntCardAII farqetD
Use Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default DefauH
<l Generate Summary
Report
to
Mo 4o 4o . MO Mo Mo Mo Mo Mo Mo No No No Mo No No No NO MO Mo
F Statistics
P Basic Limit Method
Lower Limit
Upper Limit -
default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default
•
• • • •
D
b
£
b P P P P
b • ' • • • •
b P b P
•
D 0 D D
•
to
- b
_J OK
Cancel
2-74
5.
Chapter 2 Accessing and Assaying Prepared Data
Select User Specified as the Limit Method value for DemMedlncome. BE Interactive Replacement Interval Filter :olumns:
f~ Mining
P Label Name
Use
DemAge Default DemMedHomeValue Default DemMedlncome Default DemPctVeterans Default Default GiftAvq36 Default GiftAvgAll GiftAvgCard36 Default GfflAvqLast Default (Default GlftCnt36 Default GiftCntAII Default GiftCntCard36 GiftCntCardAII Default Default GiftTimeFirst Default GiftTimeLast PromCntl2 Default PromCnt36 Default PromCntAll Default PromCntCard12 Default jjDefault PromCntCard36 PromCntCardAII Default TargetD Default <l Generate Summary
W Basic
Report IMo No Mo Mo Mo Mo Mo Mo Mo Mo IMo No No No Mo Mo Mo Mo Mo Mo No
Limit Method
r Lower Limit
Statistics Upper Limit
Default Default User Specified Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default
D D D D D
D D
J OK
Cancel
2.4 Exploring a Data Source 6.
2-75
Type 1 as the Lower Limit value for DemMedlncome. ST Interactive Replacement Interval Filter Columns:
f Label Name
•I
Report
Use
DemAqe DemMedHomeValue DemMedlncome DemPcfVeterans GiftAvq36 GiftAvqAll GiftAvqCard36 GiftAvqLast GlftCnt36 GiftCntAII GiftCntCard36 foiftCntCardAll iGiftTime First GiftTimeLast PromCnt12 PromCnt36 PromCntAll PromCntCard12 ;PromCntCard36 PromCntCardAII FarqetD
|7 Basjc
\~ Mining
Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default DefauH DefauH DefauH Default Default
Limit Method
r Lower Limit
Default DefauH User Specified Default Default Default Default Default Default Default Default DefauH DefauH DefauH Default DefauH Default Default Default DefauH DefauH
Mo Mo Mo No MO Mo No No MO NO Mo No No No No NO NO No No NO No
Statistics Upper Limit •
1
b
•
b b D D D
• P • P P "
•
D D ET D c
•
b
~
" P * P b b
Jj OK
Generate Summary
•
- P
_ j
<l
b
* P
Cancel
If you use this specification, any DemMedlncome values that fall below the lower limit o f 1 are set to missing. A l l other values of this variable do not change. 7.
Select O K to close the Interactive Replacement Interval Filter window.
Running the Analysis and Viewing the Results Use these steps to run the process flow that you created. 1.
Right-click on the Replacement node and select Run from the Option menu. A Confirmation window appears, and requests that you verify the run action. Confirmation *^ .
Do you want to run this path? Diagram:Predictive Analysis Path:
Replacement 1 Ves 1 No
1
2-76
2.
Chapter 2 Accessing and Assaying Prepared Data
Select Yes to close the Confirmation window. A small animation in the lower right corner o f each node indicates analysis activity in the node. The Run Status window opens when the process flow run is complete. Run Status Run completed Diagram: Predictive Analysis Path;
Replacement Results...
OK
3.
Select Results... to review the analysis outcome. The Results - Node: Replacement Diagram: Predictive Analysis window appears. tP Results - Mode: Replacement Diagram: Predictive Analysis File
Edit
View
Window
HHEi|ESoutput
ÂŁf 1 otal Replacement Counts Variable Role Label DemMedln...
INPUT
Train 2357
M e d i a n Inc
rjaet:
Student
Date:
June 1 7 , 2009
Tine:
16:24:52
* T r a i n i n g Output
V a r i a b l e Summary Measurement
HI Interval Variables Variable Replace Variable
Lower Smit
DemMedln... REPâ&#x20AC;&#x17E;Dem...
Upper Limit i
Limls Method .MANUAL
Label
Replacement Method
Frequency
Lower Replacemert Value
Upper Replacemeri Value
Median Inc... MISSING
The Replacement Counts window shows that 2357 observations were modified by the Replacement node. The Interval Variables window summarizes the replacement that was condLicted. The CHitput window provides more or less the same information as the Total Replacement Counts window and the Interval Variables window (but it is presented as a static text file). 4.
Close the Results window.
2.4 Exploring a Data Source
2-77
Examining Exported Data In a SAS Enterprise Miner process flow diagram, each node takes in data, analyses it, creates a result, and exports a possibly modified version of the imported data. While the report gives the analysis results in abstract, it is good practice to see the actual effects of an analysis step on the exported data. This enables you to validate your expectations at each step of the analysis. Use these steps to examine the data exported from the Replacement node. 1.
Select the Replacement node in your process How diagram.
2.
Select Exported Data •=> _MJ.
r General
Pro pert/
Node ID
Imported Data Exported Data Notes
•
Train
Interval Variables Replacement Edito •Default Limits MethiNone -Cutoff Values -Replacement Editoi •Unknown Levels Ignore
Score
Replacement ValueMjssing Hide iNo Report Replacement RepoYes The Exported Data - Replacement window appears.
Port TRAIN VALIDATE TEST SCORE TRANSACTION DOCUMENT
Role Table EMWS.Repi_TRAIN Train EMWS.Repl_VALIDATE Validate EMWS.Repl_TEST Test EMWS.RepLSCORE IScore E MWS. R ep l_TRAN SACTI0 N Transaction EMWS.RepLDOCUMENT Document
1 (trotvtc-. j j
Yes MO NO No •Jo NO
CxptaiS-
[
Data Exists
Ptopeities-
|
OK
This window lists the types of data sets that can be exported from a SAS Enterprise Miner process flow node. As indicated, only a T R A I N data set exists at this stage of the analysis.
2-78
3.
Chapter 2 Accessing and Assaying Prepared Data
Select the Train table from the Exported Data - Replacement window.
Port TRAIN VALIDATE TEST SCORE TRANSACTION DOCUMENT
Table Role EUWS.RepI TRAIN Train EMWS.RepLVALIDATE Validate EMWS.RepLTEST Test EMWS.RepLSCORE Score IEMWS.Repl_TRANSACTION Transaction EMWS.RepLDOCUMENT Document Browse..
4.
Yes No NO No No No Explore.
Data Btists
Properties.
OK
Select Explore... to open the Explore window again.
Sample Properties Rows Columns Library Member Type Sample Method Fetch Size Fetched Rows Random Seed
Property
Unknown 29 EMWS REPL TRAIN VIEW Random Max 9686 12345
Value
Apply []3 EMWSJtepl_TRAIN Obs#
Plot... r/Ef
m
j Target Gilt-| Control Nu...| Target Gift ...| Gilt Count 3..l0ffl Count A.J Gift Count... I Gilt Count... I Gilt Amount... 000014974 4 3 1 $17.00 000006294 8 3 2 $20.00 100046110 $4.00 41 20 3 $6.00 100185937 $10.00 8 4 12 $10.00 000029637 1 1 $20.00 5 100112632 9 $11.00 $11.00 11 6 000123712 3 $15.00 4 7 000045409 3 $15.00 4 8 10O019094 1 $35.00 3 S $40.00 000178271 2 $6.00 5 10 1 00102224 $6.00 16 13 11 $6.00 i
'i
â&#x20AC;˘ i -i i i
2-4 Exploring a Data Source
5.
2-79
Scroll completely to the right in the data table. EMWS.Repl_TRA!N
1
Aqe
.F 67F .M .M 53M 47M
58M .U JF .U
.u
<
Gender
|Home Owner Median Ho...! ercentVet..J Median Income Reoiorj Replacement: DemMedlncome I A 0 U $0 $0 85 U SO $186,800 $87,600 36 $38,750 38750 u $139,200 27 $38,942 38942 u $168,100 37 $71,509 71509 u H $253,100 0 $92,514 92514 H $234,700 22 $72,868 72868 U $207,000 44 $0 U $1 37,300 32 $0 $180,700 37 $0 u $143,900 30 $0 u 3
1
1
"1<
inn.
c o
\
m
m
n>
A new column is added to the analysis data: Replacement: DemMedlncome. Notice that the values o f this variable match the Median Income Region variable, except when the original variables value equals zero. The replaced zero value is shown by a dot (.). which indicates a missing value. 6.
Close the Explore and Exported Data windows to complete this part o f the analysis.
2-80
Chapter 2 Accessing and Assaying Prepared Data
2.5 Chapter Summary Data Access Tools Review . | j|ft; AwflystoOota
1
• Link existing analysis data sets to SAS Enterprise Miner. • Set variable metadata. • Explore variable distribution characteristics. Remove unwanted cases from analysis data.
18
Analyses in SAS Enterprise Miner are organized hierarchically into projects, data libraries, diagrams, process flows, and analysis nodes. This chapter demonstrated the basics of creating each of these elements. Most process flows begin with a data source node. To access a data source, you must define a SAS library. After a library is defined, a data source is created by linking a SAS table and associated metadata to SAS Enterprise Miner. After a data source is defined, you can assay the underlying cases using SAS Enterprise Miner Explore tools. Care should be taken to ensure that the sample of the data source that you explore is representative of the original data source. You can use the Replacement node to modify variables that were incorrectly prepared. After all necessary modifications, the data is ready for subsequent analysis.
Chapter 3 Introduction to Predictive Modeling: Decision Trees 3.1
Introduction Demonstration: Creating Training and Validation Data
3.2
Cultivating Decision Trees Demonstration: Constructing a Decision Tree Predictive Model
3.3
Optimizing the Complexity of Decision Trees Demonstration: Assessing a Decision Tree
3.4
Understanding Additional Diagnostic Tools (Self-Study) Demonstration: Understanding Additional Plots and Tables (Optional)
3.5
Autonomous Tree Growth Options (Self-Study)
3-3 3-23
3-28 3-43
3-63 3-79
3-98 3-99
3-106
Exercises
3-114
3.6
Chapter Summary
3-116
3.7
Solutions
3-117
Solutions to Exercises
3-117
Solutions to Student Activities (Polls/Quizzes)
3-133
3-2
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3.1 Introduction
3.1
3-3
Introduction
Predictive Modeling The Essence of Data Mining
7
rijH
"Mosf of the big payoff [in data mining] has been in predictive modeling." - Herb Edelstein
3
Predictive modeling has a long history of success in the field of data mining. Models are built from historical event records and are used to predict future occurrences of these events. T h e methods have great potential for monetary gain. Herb Edelstein states in a 1997 interview (Beck 1997): "I also want to stress that data mining is successfully applied to a wide range of problems. T h e beer-and-diapers type of problem is k n o w n as association discovery or market basket analysis, that is, describing something that has happened in existing data. This should be differentiated from predictive techniques. Most of the big payoff stuff has been in predictive modeling." Predicting the future has obvious financial benefit, but, for several reasons, the ability of predictive models to do so should not be overstated. First, the models depend on a property k n o w n statistically as stationarity, meaning that its statistical properties do not change over time. Unfortunately, m a n y processes that generate events of interest are not stationary enough to enable meaningful predictions. Second, predictive models are often directed at predicting events that occur rarely, and in the aggregate, often look somewhat random. Even the best predictive model can only extract rough trends from these inherently noisy processes.
3-4
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Predictive Modeling Applications Database marketing Financial risk m a n a g e m e n t Fraud detection Process monitoring
s4
Pattern detection
4 Applications of predictive modeling are only limited by your imagination. However, the models created by S A S Enterprise Miner typically apply to one of the following categories. • Database marketing applications include offer response, up-sell, cross-sell, and attrition models. • Financial risk m a n a g e m e n t models attempt to predict monetary events such as credit default, loan prepayment, and insurance claim. • Fraud detection methods attempt to detect or impede illegal activity involving financial transactions. • Process monitoring applications detect deviations from the norm in manufacturing, financial, and security processes. • Pattern detection models are used in applications ranging from handwriting analysis to medical diagnostics.
3.1 Introduction
3-5
Predictive Modeling Training Data Training Data
, IBM ' t lnptitt .': : H I &C9 K S
â&#x20AC;&#x201D;
R 9
Bwl BW
M i H i
Training data case: categorical or numeric input and target measurements
â&#x20AC;&#x201D;
T o begin using S A S Enterprise Miner for predictive modeling, you should be familiar with standard terminology. Predictive modeling (also k n o w n as supervised prediction ox supervised learning) starts with a training data set. T h e observations in a training data set are k n o w n as training cases (also k n o w n as examples, instances, or records). T h e variables are called inputs (also k n o w n as predictors, features, explanatory variables, or independent variables) and targets (also k n o w n as a response, outcome, or dependent variable), for a given case, the inputs reflect your state of knowledge before measuring the target. The measurement scale of the inputs and the target can be varied. T h e inputs and the target can be numeric variables, such as income. They can be nominal variables, such as occupation. They are often binary variables, such as a positive or negative response concerning h o m e ownership.
3-6
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Predictive Model Training Data
mmm
WKM
WMM
H
Predictive model: a concise representation of the input and target association W W
T h e purpose of the training data is to generate a predictive model. T h e predictive model is a concise representation of the association between the inputs and the target variables.
Predictive Model â&#x20AC;˘ M l
mm mmi mm mm mmm wmw mmm mm M M
m
m
m
Predictions: output of the predictive model given a set of input measurements
10
The outputs of the predictive model are called predictions. Predictions represent your best guess for the target given a set of input measurements. T h e predictions are based on the associations learned from the training data by the predictive model.
3.1 Introduction
3-7
Modeling Essentials
Predict n e w cases. Select useful inputs. Optimize complexity.
12
Predictive models are widely used and c o m e in m a n y varieties. A n y model, however, must perform three essential tasks: â&#x20AC;˘ provide a rule to transform a measurement into a prediction â&#x20AC;˘ have a m e a n s of choosing useful inputs from a potentially vast number of candidates â&#x20AC;˘ be able to adjust its complexity to compensate for noisy training data T h e following slides illustrate the general principles involved in each of these essentials. Later sections of the course detail the approach used by specific modeling tools.
3-8
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Modeling Essentials
Predict new cases.
13
T h e process of predicting n e w cases is examined first.
Three Prediction Types 1
inputs
mm mm mm
mmm mmm mmm mm mm mm mm mm mm
z
s
s
z
prediction
mmm mmm mmm
decisions rankings estimates
15
T h e training data is used to construct a model (rule) that relates the input variables to the original target variable. T h e predictions can be categorized into three distinct types: â&#x20AC;˘ decisions â&#x20AC;˘ rankings â&#x20AC;˘ estimates
3.1 Introduction
3-9
Decision Predictions
A predictive model uses input measurements to make the best decision for each case.
17
T h e simplest type of prediction is the decision. Decisions usually are associated with s o m e type of action (such as classifying a case as a donor or a non-donor). For this reason, decisions are also k n o w n as classifications. Decision prediction examples include handwriting recognition, fraud detection, and direct mail solicitation. Decision predictions usually relate to a categorical target variable. For this reason, they are identified as primary, secondary, and tertiary in correspondence with the levels of the target. B y default, model assessment in S A S Enterprise Miner assumes decision predictions w h e n the target variable has a categorical measurement level (binary, nominal, or ordinal).
3-10
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Ranking Predictions WMWf â&#x20AC;˘ MMbMNM i M M
mmm mmm mmm mmm mmm
mmm mm mm warn mmm mmm umm mmt BH
mm mm mm mm
720 520 470 630
A predictive model uses input measurements to optimally rank each case.
19
Ranking predictions order cases based on the input variables' relationships with the target variable. Using the training data, the prediction model attempts to rank high value cases higher than low value cases. It is assumed that a similar pattern exists in the scoring data so that high value cases have high scores. T h e actual, produced scores are inconsequential; only the relative order is important. T h e most c o m m o n example of a ranking prediction is a credit score. Ranking predictions can be transformed into decision predictions by taking the primary decision for cases above a certain threshold while making secondary and tertiary decisions for cases below the correspondingly lower thresholds. In credit scoring, cases with a credit score above 700 can be called good risks; those with a score between 600 and 700 can be intermediate risks; and those below 600 can be considered poor risks.
3.1 Introduction
3-11
Estimate Predictions
A predictive model uses input measurements to optimally estimate the target value.
21
Estimate predictions approximate the expected value of the target, conditioned on the input values. For cases with numeric targets, this number can be thought of as the average value of the target for all cases having the observed input measurements. For cases with categorical targets, this number might equal the probability of a particular target outcome. Prediction estimates are most c o m m o n l y used w h e n their values are integrated into a mathematical expression. A n example is two-stage modeling, where the probability of an event is combined with an estimate of profit or loss to form an estimate of unconditional expected profit or loss. Prediction estimates are also useful w h e n you are not certain of the ultimate application of the model. Estimate predictions can be transformed into both decision and ranking predictions. W h e n in doubt, use this option. Most S A S Enterprise Miner modeling tools can be configured to produce estimate predictions.
3-12
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Modeling Essentials - Predict Review
D e c i d e , rank,
Predict new cases.
and estimate.
23 T o summarize the first model essential, w h e n a model predicts a n e w case, the model generates a decision, a rank, or an estimate.
3.01 Quiz Match the Predictive Modeling Application to the Decision Type. Modeling Application Loss Reserving Risk Profiling Credit Scoring Fraud Detection Revenue Forecasting Voice Recognition
Decision Type A. Decision B. Ranking C. Estimate
Submit
Clear
25
T h e quiz shows s o m e predictive modeling examples and the decision types.
3.1 Introduction
Modeling Essentials
V> Select useful inputs.
27
Consider the task of selecting useful inputs.
3-13
3-14
Chapter 3 Introduction to Predictive Modeling: Decision Trees
The Curse of Dimensionality
1-D
2-D
3-D 28
T h e dimension of a problem refers to the number of input variables (more accurately, degrees of freedom) that are available for creating a prediction. Data mining problems are often massive in dimension. T h e curse of dimensionality refers to the exponential increase in data required to densely populate space as the dimension increases. For example, the eight pointsfillthe one-dimensional space but b e c o m e more separated as the dimension increases. In a 100-dimensional space, they would be similar to distant galaxies. T h e curse of dimensionality limits your practical ability tofita flexible model to noisy data (real data) w h e n there are a large number of input variables. A densely populated input space is required tofithighly complex models. W h e n you assess h o w m u c h data is available for data mining, you must consider the dimension of the problem.
3.02 Quiz How many inputs are in your modeling data? Mark your selection in the polling area.
30
3.1 Introduction
3-15
Input Reduction - Redundancy Redundancy
Irrelevancy •
*
. j a r .__
31
Input selection, that is, reducing the number of inputs, is the obvious w a y to thwart the curse of dimensionality. Unfortunately, reducing the dimension is also an easy w a y to disregard important information. T h e two principal reasons for eliminating a variable are redundancy and irrelevancy.
Input Reduction - Redundancy Redundancy • •
Input x has the same information as input X), 2
•d 32
A redundant input does not give any n e w information that w a s not already explained by other inputs. In the example above, know ing the value of input X| gives you a good idea of the value ofjfeFor decision tree models, the modeling algorithm makes input redundancy a relatively minor issue. For other modeling tools, input redundancy requires more elaborate methods to mitigate the problem.
3-16
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Input Reduction - Irrelevancy Irrelevancy
Predictions change with input x but much less with input x . 4
3
33
A n irrelevant input does not provide information about the target. In the example above, predictions change with input .v.h but not with input x 3 . For decision tree models, the modeling algorithm automatically ignores irrelevant inputs. Other modeling methods must be modified or rely on additional tools to properly deal with irrelevant inputs.
Modeling E s s e n t i a l s - Select Review
Eradicate
Select useful inputs.
redundancies and irrelevancies.
I
L
,|
34
To thwart the curse of dimensionality, a model must select useful inputs. You can do this by eradicating redundant and irrelevant inputs (or more positively, selecting an independent set of inputs that are correlated with the target).
3.1 Introduction
Modeling Essentials
i s Optimize complexity.
36
Finally, consider the task of optimizing model complexity.
3-17
3-18
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Model Complexity
37
Fitting a model to data requires searching through the space of possible models. Constructing a model with good generalization requires choosing the right complexity. Selecting model complexity involves a trade-off between bias and variance. A n insufficiently complex model might not be flexible enough, which can lead to underfitting, that is, systematically missing the signal (high bias). A naive modeler might assume that the most complex model should always outperform theothers, but this is not the case. A n overly complex model might be too flexible, which can lead to overfitting, that is, accommodating nuances of the random noise in the particular sample (high variance). A model with the right amount of flexibility gives the best generalization.
3.1 Introduction
3-19
Data Partitioning Trainina Data
Partition available data into training and validation sets.
38
A quote from the popular introductory data analysis textbook Data Analysis and Regression by Mosteller and Tukey (1977) captures the difficulty of properly estimating model complexity. "Testing the procedure on the data that gave it birth is almost certain to overestimate performance, for the optimizing process that chose it from a m o n g m a n y possible procedures will have m a d e the greatest use of any and all idiosyncrasies of those particular data...." In predictive modeling, the standard strategy for honest assessment of model performance is data splitting. A portion is used forfittingthe model, that is, the training data set. T h e remaining data is separated for empirical validation. T h e validation data set is used for monitoring and tuning the model to improve its generalization. T h e tuning process usually involves selecting a m o n g models of different types and complexities. T h e tuning process optimizes the selected model on the validation data. f
Because the validation data is used to select from a set of related models, reported performance will, on the average, be overstated. Consequently, a further holdout sample is needed for a final, unbiased assessment.The test data set has only one use, that is, to give afinalhonest estimate of generalization. Consequently, cases in the test set must be treated in the same w a y that n e w data would be treated. T h e cases cannot be involved in any w a y in the determination of the fitted prediction model. In practice, m a n y analysts see no need for a final honest assessment of generalization. A n optimal model is chosen using the validation data, and the model assessment measured on the validation data is reported as an upper bound on the performance expected w h e n the model is deployed.
With small or moderate data sets, data splitting is inefficient; the reduced sample size can severely degrade thefitof the model. Computer-intensive methods, such as the cross validation and bootstrap methods, were developed so that all the data can be used for bothfittingand honest assessment. However, data mining usually has the luxury of massive data sets.
3-20
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Predictive Model Sequence Trainin q Data
mmP<mm â&#x20AC;&#x201D;
Validation Data
umm â&#x20AC;˘ ' mm
1
I
mm
{ w
mm
fopvB mm
mm
tmt mm
mm
Create a sequence of models with increasing complexity.
40
In S A S Enterprise Miner, the strategy for choosing model complexity is to build a sequence of similar models using the training component only. T h e models are usually ordered with increasing complexity.
Model Performance Assessment mm
mm
mm
mm
mgm mmm mmm
Rate model *A*Wt performance using validation data. F 4 r
42
5
Model Complexity
Validation Assessment
A s Mosteller and Tukey caution in the earlier quotation, using performance on the training data set usually leads to selecting a model that is too complex. (The classic example is selecting linear regression models based on R square.) T o avoid this problem, S A S Enterprise Miner selects the model from the sequence based on validation data performance.
3.1 Introduction
Model Selection Validation Data
TBI
â&#x20AC;˘ Select the simplest frvv \ model with the highest validation assessment.
Complexity
Validation Assessment
In keeping with Ockham's razor, the best model is the simplest model with the highest validation performance.
3-21
3-22
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Modeling Essentials - Optimize Review
Optimize complexity.
Tune models with ! validation data.
46
T o avoid overfitting (and underfitting) a predictive model, you should calibrate its performance with an independent validation sample. Y o u can perform the calibration by partitioning the data into training and validation samples. In Chapter 2, you assigned roles and measurement levels to the P V A 9 7 N K data set and created an analysis diagram and a simple process flow. In the following demonstration, you partition the raw data into training and validation components. This sets the stage for using the Decision Tree tool to generate a predictive model.
3.1 Introduction
3-23
Creating Training a n d Validation Data
A critical step in prediction is choosing a m o n g competing models. Given a set of training data, you can easily generate models that very accurately predict a target value from a set of input values. Unfortunately, these predictions might be accurate only for the training data itself. Attempts to generalize the predictions from the training data to an independent, but similarly distributed, sample can result in substantial accuracy reductions. To avoid this pitfall, S A S Enterprise Miner is designed to use a validation data set as a means of independently gauging model performance. Typically, the validation data set is created by partitioning the raw analysis data. Cases selected for training are used to build the model, and cases selected for validation are used to tune and compare models. f
S A S Enterprise Miner also enables the creation of a third partition, n a m e d the test data set. T h e test data set is meant for obtaining unbiased estimates of model performance from a single selected model.
This demonstration assumes that you completed the demonstrations in Chapter 2. Your S A S Enterprise Miner workspace should appear as s h o w n below: s i ts File
Edit View
53
Actions
Options
Window
HHHHHBIffi
•3 My Project o O Data Sources 9 [hi Diagrams 3£> Predictive Analysis O \9j Model Packages G-iflJ Users
1
Help
Property ID Name Status Notes History
173 Sample
• B Bfi Explore
k :
Modify . Model
Assess : Utility
Credit Scoring
so Predictive Analysis
Value EMWS Predictive Analysis Open
r
y • >VA97NK
ID Diagram Identifier. This identifier Diagram Predictive Analyst...
J
p^lrf? placement6
100::. jX Connected to SASMain - Logical Workspace Server
3-24
Chapter 3 Introduction to Predictive Modeling; Decision Trees
Follow these steps to partition the raw data into training and validation sets. 1. Select the S a m p l e tool tab. The Data Partition tool is the second from the left.
• Sample | Explore Modify |_^°del j Assess ) Utility" Credit Scoring 2. Drag a Data Partition tool into the diagram workspace, next to the Replacement node.
•ma
j; Predictive Analysis
>VA97NK
!f^yeplacemenl
• ata Partition -Q
2l\
W 9
—
\—
a
©100%
3.1 Introduction
3-25
3. Connect the Replacement node to the Data Partition node. • Predictive Analysis
ate merit
>VA97NK
-6
-0
jnjnr j^sasbata Partition
4. Select the Data Partition node and examine its Properties panel. Value
Property
General Node ID Imported Data Exported Data Notes
Part ...
•
Train Variables Data OutputType Partitioning Method Default 12345 R a n d o m Seed Ebata Set Allocations 40.0 r-Training 30.0 ^•Validation 30.0 'Test Interval Targets Class Targets
E u
Yes Yes
Use the Properties panel to select the fraction of data devoted to the Training, Validation, and Test partitions. B y default, less than half the available data is used for generating the predictive models.
3-26
Chapter 3 Introduction to Predictive Modeling: Decision Trees
f
There is a trade-off in various partitioning strategies. M o r e data devoted to training results in more stable predictive models, but less stable model assessments (and vice versa). Also, the Test partition is used only for calculating fit statistics after the modeling and model selection is complete. M a n y analysts regard this as wasteful of data. A typical partitioning compromise foregoes a Test partition and places an equal number of cases in Training and Validation.
5. Type 5 0 as the Training value in the Data Partition node. 6. Type 5 0 as the Validation value in the Data Partition node. 7. Type 0 as the Test value in the Data Partition node. §bataSet Allocations j-Training 50.0 50.0 r-Validation ••Test 0.0 f
With smaller raw data sets, model stability can become an important issue. In this case, increasing the number of cases devoted to the Training partition is a reasonable course of action.
8. Right-click the Data Partition node and select Run. 9. Select Yes in the Confirmation window. S A S Enterprise Miner runs the Data Partition node.
Do you want to run this path? Diagram: Predictive Analysis Path:
Data Partition Yes
f
No
Only the Data Partition node should run, because it is the only "new" element in the process flow. In general, S A S Enterprise Miner only runs elements of diagrams that are n e w or that are affected by changes earlier in the process flow.
W h e n the Data Partition process is complete, a R u n Status w i n d o w opens.
Run completed Diagram: Predictive Analysis Path: OK
Data Partition Results..
3.1 Introduction
3-27
10. Select Results... in the R u n Status w i n d o w to view the results. T h e Results - N o d e : Data Partition w i n d o w provides a basic metadata s u m m a r y of the raw data that feeds the node and, more importantly, a frequency table that shows the distribution of the target variable, T a r g e t B , in the raw, training, and validation data sets. r
File
1
Edit
1 (73
View
1
Window
â&#x20AC;˘
output
Summary Statistics for Class Targets Data=DATA
Variable
Humeric Value
Formatted Value
TargetB TargetB
Frequency Count
Percent
4843
50 50
Frequency Count
Percent
2422
50.0103
2421
49.9897
Frequency Count
Percent
2421
49.9897
2422
50.0103
4843
Data=TRAIW
Variable
numeric Value
Formatted Value
TargetB TargetB
Data-VALIDATE
Variable TargetB TargetB
f
Humeric Value
Formatted Value
T h e Partition node attempts to maintain the proportion of zeros and ones in the training and validation partitions. T h e proportions are not exactly the same due to an odd number of zeros' and ones' cases in the raw data.
T h e analysis data w a s selected and partitioned. You are ready to build predictive models.
3-28
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3.2 Cultivating Decision Trees
Predictive Modeling Tools Primary
IS
49
M a n y predictive modeling tools are found in S A S Enterprise Miner. T h e tools can be loosely grouped into three categories: primary, specialty, and multiple models. T h e primary tools interface with the three most c o m m o n l y used predictive modeling methods: decision tree, regression, and neural network.
Predictive Modeling Tools
'
5
g^Aiirftart j
Specialty
[u Ditv
1 ÂŁ, Gftctent
6E
so T h e specialty tools are either specializations of the primary tools or tools designed to solve specific problem types.
3.2 Cultivating Decision Trees
3-29
Predictive Modeling Tools
!! v. , i —' —
::u
L/: r :
Multiple
Multiple model tools are used to combine or produce more than one predictive model. T h e Ensemble tool is briefly described in a later chapter. T h e T w o Stage tool is discussed in the Advanced Predictive Modeling Using S A S ® Enterprise Miner™ course.
Model Essentials - Decision Trees
Predict n e w cases.
; Prediction rules
Select useful inputs.
|
Optimize complexity.
Split search Pruning
52
Decision trees provide an excellent introduction to predictive modeling. Decision trees, similar to all modeling methods described in this course, address each of the modeling essentials described in the introduction. Cases are scored using prediction rules. A split-search algorithm facilitates input selection. M o d e l complexity is addressed by pruning. T h e following simple prediction problem illustrates each of these model essentials.
3-30
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Simple Prediction Illustration
1.0
I
0.9
Predict dot color for each x, and x2. X
2
0.5 0.4 0.3 0.2 n
U-1
1
0.0
...
54
Consider a data set with two inputs and a binary target. The inputs, X\ and x 2 , locate the case in the unit square. T h e target outcome is represented by a color; yellow is primary and blue is secondary. T h e analysis goal is to predict the outcome based on the location in the unit square.
Decision Tree Prediction Rules
56
Decision trees use rules involving the values of the input variables to predict cases. The rules are arranged hierarchically in a tree-like structure with nodes connected by lines. The nodes represent decision rules, and the lines order the rules. The first rule, at the base (top) of the tree, is called the root node. Subsequent rules are called interior nodes. Nodes with only one connection are called leaf nodes.
3.2 Cultivating Decision Trees
3-31
Decision Tree Prediction Rules
57
T o score a n e w case, examine the input values and apply the rules defined by the decision tree.
Decision Tree Prediction Rules Predict:, \
Š w
Estimate = 0.70
0.9 >0.63
06 X
2
0.5 0.1
58
T h e input values of a n e w case eventually lead to a single leaf in the tree. A tree leaf provides a decision (for example, classify as yellow) and an estimate (for example, the primary-target proportion).
3-32
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Model Essentials - Decision Trees
Select useful inputs.
Split search
62
T o select useful inputs, trees use a split-search algorithm. Decision trees confront the curse of dimensionality by ignoring irrelevant inputs. f
Curiously, trees have no built-in method for ignoring redundant inputs. Because trees can be trained quickly and have simple structure, this is usually not an issue for model creation. H o w e v e r , it can be an issue for model deployment, in that trees might somewhat arbitrarily select from a set of correlated inputs. T o avoid this problem, you must use an algorithm that is external to the tree to m a n a g e input redundancy.
3.2 Cultivating Decision Trees
3-33
Decision Tree Split Search left
right
0.0 0.1 0.2 0.1 0.4 O.S OS 07 0.3 0.9 1.0
63
Understanding the default algorithm for building trees enables you to better use the Tree tool and interpret your results. T h e description presented here assumes a binary target, but the algorithm for interval targets is similar. (The algorithm for categorical targets with more than two outcomes is m o r e complicated and is not discussed.) T h e first part of the algorithm is called the split search. The split search starts by selecting an input for partitioning the available training data. If the measurement scale of the selected input is interval, each unique value serves as a potential split point for the data. If the input is categorical, the average value of the target is taken within each categorical input level. T h e averages serve the s a m e role as the unique interval input values in the discussion that follows. For a selected input and fixed split point, two groups are generated. Cases with input values less than the split point are said to branch left. Cases with input values greater than the split point are said to branch right. The groups, combined with the target outcomes, form a 2x2 contingency table with columns specifying branch direction (left or right) and rows specifying target value (0 or 1). A Pearson chi-squared statistic is used to quantify the independence of counts in the table's columns. Large values for the chisquared statistic suggest that the proportion of zeros and ones in the left branch is different than the proportion in the right branch. A large difference in outcome proportions indicates a good split. Because the Pearson chi-squared statistic can be applied to the case of multi w a y splits and multi-outcome targets, the statistic is converted to a probability value or/?-value. T h e v a l u e indicates the likelihood of obtaining the observed value of the statistic assuming identical target proportions in each branch direction. For large data sets, these /?-values can be very close to zero. For this reason, the quality of a split is reported by logworth = -/og(chi-squared /7-value). At least one logworth must exceed a threshold for a split to occur with that input. B y default, this threshold corresponds to a chi-squared p-value of 0.20 or a logworth of approximately 0.7.
3-34
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Decision Tree Split Search
The best split for an input is the split that yields the highest logworth.
3.2 Cultivating Decision Trees
3-35
Additional Split-Search Details (Self Study) There are several peripheral factors that m a k e the split search s o m e w h a t m o r e complicated than whatis described above. First, the tree algorithm settings disallow certain partitions of the data. Settings, such as the m i n i m u m number of observations required for a split search and the m i n i m u m n u m b e r of observations in a leaf, force a m i n i m u m n u m b e r of cases in a split partition. This m i n i m u m n u m b e r of cases reduces the number of potential partitions for each input in the split search. Second, w h e n y o u calculate the independence of columns in a contingency table, it is possible to obtain significant (large) values of the chi-squared statistic even w h e n there are n o differences in the outcome proportions between split branches. A s the number of possible split points increases, the likelihood of obtaining significant values also increases. In this way, an input with a multitude of unique input values has a greater chance of accidentally having a large logworth than an input with only a few distinct input values. Statisticians face a similar problem w h e n they combine the results from multiple statisticaltests. A s the number of tests increases, the chance of a false positive result likewise increases. T o maintain overall confidence in the statistical findings, statisticians inflate the p-values of each test by a factor equal to the number of tests being conducted. If an inflated /rvalue shows a significant result, then the significance of the overall results is assured. This type of p-value adjustment is k n o w n as a Bonferroni correction. Because each split point corresponds to a statistical test, Bonferroni corrections are automatically applied to the logworth calculations for an input. These corrections, called Kass adjustments (named after the inventor of the default tree algorithm used in S A S Enterprise Miner), penalize inputs with m a n y split points by reducing the logworth of a split by an amount equal to the log of the n u m b e r of distinct input values. This is equivalent to the Bonferroni correction because subtracting this constant from logworth is equivalent to multiplying the corresponding chi-squared p-value by the n u m b e r of split points. T h e adjustment enables a fairer comparison of inputs with m a n y and few levels later in the split-search algorithm. Third, for inputs with missing values, two sets of Kass-adjusted logworths are actually generated. For the first set, cases with missing input values are included in the left branch of the contingency table and logworths are calculated. For the second set of logworths, missing value cases are m o v e d to the right branch. T h e best split is then selected from the set of possible splits with the missing values in the left and right branches, respectively.
3-36
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Decision Tree Split Search
Repeat for input x . 2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 07 0.8 0.9 1.0
65
T h e partitioning process is repeated for every input in the training data. Inputs whose adjusted logworth fails to exceed the threshold are excluded from consideration.
Decision Tree Split Search
1.0 0.9 0.8 0.7 0.6 X bottom
top
54%
35%
46%
65%
2
0.63
0.5 0.4 0.3
max
j
logworth(Xj)
4.92
0.2 0.1 00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.6 0.9 1.0
Again, the optimal split for the input is the one that maximizes the logworth function.
3.2 Cultivating Decision Trees
Decision Tree Split Search left
right
53%
42%
47%
58%
bottom
top
54%
35%
46%
65%
max logworth(x,) 0.95 Compare partition logworth ratings. max logworth(jr2) 4.92
69
After you determine the best split for every input, the tree algorithm compares each best split's corresponding logworth. T h e split with the highest adjusted logworth is d e e m e d best.
Decision Tree Split Search
The training data is partitioned using the best split rule.
3-37
3-38
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Decision Tree Split Search
1.0
<0.63 ^
7
>0.63
>
:: 0.4
1
0.0
1
Repeat the process in each subset.
72
Decision Tree Split Search left
right
61%
55%
39%
; logworth{x,) 45% 5.72
1
m a x Q
B
1
1
D.3 [ M V i j r 0.2 Km/KjaP
"â&#x20AC;˘1 r 1 1 1 ' 1
BiS&c
;
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.
74
1.0
...
3.2 Cultivating Decision Trees
3-39
Decision Tree Split Search
0.0 0.1 0 2 0.3 0.4 0.S
76
T h e logworth of the split is negative. This might seem surprising, but it results from several adjustments m a d e to the logworth calculation. (The Kass adjustment w a s described previously. Another, called the depth adjustment, is outlined in the following Self-Study section.)
Decision Tree Split Search left
right
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 *1 77
3-40
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Decision Tree Split Search left
right
61%
55%
39%
45%
1.0 max logwortri(x,)
572 .J x , 0.5
0.2
-2.01
0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.'
*1 70
T h e split search continues within each leaf. Logworths are compared as before.
Additional Split-Search Details (Self-Study) Because the significance of secondary and subsequent splits depends on the significance of the previous splits, the algorithm again faces a multiple comparison problem. To compensate for this problem, the algorithm increases the threshold by an amount related to the number of splits above the current split. For binary splits, the threshold is increased by logw{2)- d~ 03-d, where d is the depth of the split on the decision tree. B y increasing the threshold for each depth (or equivalently decreasing the logworths), the tree algorithm makes it increasingly easy for an input's splits to be excluded from consideration.
3.2 Cultivating Decision Trees
3-41
Decision Tree Split Search
79
The data is partitioned according to the best split, which creates a second partition rule. T h e process repeats in each leaf until there are no more splits whose adjusted logworth exceeds the depth-adjusted thresholds. This process completes the split-search portion of the tree algorithm.
Decision Tree Split Search
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Repeat to form a maximal tree. 81
The resulting partition of the input space is k n o w n as the maximal tree. Development of the maximal tree is based exclusively on statistical measures of split worth on the training data. It is likely that the maximal tree will fail to generalize well on an independent set of validation data.
3-42
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Model Essentials - Decision Trees
Select useful inputs.
Split search
82
T h e split-search algorithm can easily cut through the curse of dimensionality and select useful modeling inputs. Because it is simply an algorithm, m a n y variations are possible. ( S o m e variations are discussed at the end of this chapter.) T h e following demonstration illustrates the use of the split-search algorithm to construct a decision tree model.
3.2 Cultivating Decision Trees
3-43
Constructing a Decision Tree Predictive M o d e l
This four-part demonstration illustrates the interactive construction of a decision tree model and builds on the process flows of previous demonstrations.
Preparing for Interactive Tree Construction Follow these steps to prepare the Decision Tree tool for interactive tree construction. 1. Select the M o d e l tab. T h e Decision Tree tool is second from the left.
fe| fr| 1§ %
•Q
Sample | Explore
Modify
Model
m
Assess
jrj i Hi n Utility Credit Scoring
2. Drag a Decision Tree tool into the diagram workspace. Place the node next to the Data Partition node. •cm Predictive Analysis
IE!
3-44
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3. Connect the Data Partition node to the Decision Tree node. m o Predictive Analysis
£1
Decision True
TTi
100%
T h e Decision Tree tool can build predictive models autonomously or interactively. T o build a model autonomously, simply run the Decision Tree node. Building models interactively, however, is more informative and is helpful w h e n you first learn about the node, and even w h e n you are more expert at predictive modeling. 4. Select Interactive
from the Decision Tree node's Properties panel. Value
Property Node ID Imported Data Exported Data [Notes
Tree
Train [Variables nteractive Use Frozen Tree No Use Multiple Targets N O • Splitting Rule ProbF ! I-Interval Criterion ProbChisq Nominal Criterion Entropy [•Ordinal Criterion Significance Level 0.2 Use in search Missing Values f-Use Input Once No 2 Maximum Branch 6 Maximum Depth l * Minimum Categorical 5 llalNode 5 hLeaf Size
•«
• ••
I mmm
3.2 Cultivating Decision Trees T h e S A S Enterprise Miner Interactive Decision Tree application opens. Interactive Decision Tree - EMWS.TREE_BROWSETREE[EMWS.PART2_TRAIN] File Edit View g
&
Tree Train Window 1
6\6
•
a _|n|x|
Tree View
N in Node: 4843 Target: TargetB • (%) : 50% 1 (%) : 50%
£
1
, .
1
3-45
3-46
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Creating a Splitting Rule Decision tree models involve recursive partitioning of the training data in an attempt to isolate concentrations of cases with identical target values. T h e blue box in the Tree w i n d o w represents the unpartitioned training data. T h e statistics show the distribution of TargetB. Use the following steps to create an initial splitting rule: 1. Right-click the purple box and select Split Node... from the m e n u . T h e Split N o d e 1 dialog b o x opens. Split Node 1 Target Variable: TargetB Variable |GiftCnt36 :GiftAvgCard36 iGiftAvgLast GiftAvg36 QftAvgAll GiftCntCard36 GiftCntCardAH StatusCatStarAll StatusCat96NK GiftCntAll PromCntCard36 GiftTimeFirst GiftTirneLast PromCntAll PromCntCardAll Dern Me dHome Value PromCntCardl2 PromCnt36 PromCntl2 DernPctVeterans DemAge REP_DemMedIncome DemHomeOwner
1 Variable Description Gift Count 36 Months Gift Amount Average ... Gift Amount Last Gift Amount Average... Gift Amount Average .., Gift Count Card 36 M... Gift Count Card All Mo... Status Category Star... Status Category 96NK Gift Count All Months Promotion Count Card... Time Since First Gift Time Since Last GiFt Promotion Count AllM.,, Promotion Count Card.,, Median Home Value R... Promotion Count Card... Promotion Count 36 M... Promotion Count 12 M... Percent Veterans Reg... Age Replacement: DemMe... Home Owner
-Log(p)
Branches 18.44 17.24 15.02 14.63 13,95 12,74 11.91 U1S5 9.52 9.28 8.73 6.69 5.70 5.49 4.59 4.46 3.S7 3.35 2.61 1.99 1.51 1.47 0.40
Apply 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Edit Rule.,. OK Cancel
T h e Split N o d e dialog box shows the relative value, -l.og(p) or logworth, of partitioning the training data using the indicated input. A s the logworth increases, the partition better isolates cases with identical target values. G i f t Count 36 Months has the highest logworth, followed by G i f t Amount Average C a r d 36 Months and G i f t Amount L a s t . You can choose to split the data on the selected input or edit the rule for more information.
3-2 Cultivating Decision Trees
3-47
2. Select Edit Rule.... T h e GittCnt36 - Interval Splitting Rule dialog box opens. GiftCnt36 - Interval Splitting Rule Target Variable: TargetB Assign missing values to {* A specific branch
[i
C A separate missing values branch C All branches Branches Branch 1 2
Split Point 2.5 2.5
<
>=
New split point:
Remove Branch
Add Branch
OK
Cancel
Apply
Undo
Reset
This dialog box shows h o w the training data is partitioned using the input G i f t Count 36 Months. T w o branches are created. T h e first branch contains cases with a 36-rnonth gift count less than 2.5, and the second branch contains cases with a 36-month gift count greater than or equal to 2.5. In other words, with cases that have a 36-month gift count of zero, one or two branch left and three or more branch right. In addition, any cases with a missing or u n k n o w n 36-month gift count are placed in the first branch. f
T h e interactive tree assigns any cases with missing values to the branch that maximizes the purity or logworth of the split by default. For G i f tCnt36, this rule assigns cases with missing values to branch 2.
3-48
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3. Select Apply. T h e GiftCnt36 - Interval Splitting Rule dialog box remains open, and the Tree View w i n d o w is updated. Interactive Decision Tree - EMWSZ.TREE_BROW5ETREE[EMWS2.PART_TRAIN] File Edit View
Tree
Train
Window
Target: Count: 0: It
TargetB 4843 S0% 50%
1
Gift Countit 36 Months 31
Target Count 0 1
1 <2.5 1T a r g e t B 22B3
57% 43%
1
1 >"2.5 Of Missing
1
Target: TargetB Count: 25E0 0: 44% l: 56%
T h e training data is partitioned into two subsets. The first subset, corresponding to cases with a 36month gift count less than 2.5, has a higher than average concentration of T a r g e t B = 0 cases. T h e second subset, corresponding to cases with a 36-month gift count greater than 2.5, has a higher than average concentration of TargetB=l cases. The second branch has slightly more cases than the first, which is indicated by the N i n n o d e field. The partition of the data also represents your first (non-trivial) predictive model. This model assigns to all cases in the left branch a predicted T a r g e t s value equal to 0.43 and to all cases in the right branch a predicted T a r g e t B value equal to 0.56. f
In general, decision tree predictive models assign all cases in a leaf the same predicted target value. For binary targets, this equals the percentage of cases in the target variable's primary outcome (usually the t a r g e t = l outcome).
3.2 Cultivating Decision Trees
3-49
Adding More Splits Use the following steps to interactively add more splits to the decision tree: I. Select the lower left partition. T h e Split N o d e dialog box is updated to s h o w the possible partition variables and their respective logworths. Split Node 3 Target Variable: TargetB Variable DemMedHorneValue DernPctVeterans StatusCatStarAll REP_DemMedIncorne DernAge GiftAvgAll GiftCnt36 PromCntCardl2 5tatusCat96NK GiFtTimeFirst DemHorneOwner GiftCntCardAll GiftCntAII PromCnt36 PromCntCardAll GiFtAvg36 DemGender PromCntAll GiftCntCard36 GiftTimeLast GiftAvgCard36 GiftAvgLast PromCntl2 PromCntCard36 DemCluster
Variable Description Median Home Value Region 3 ercent Veterans Region Status Category Star All M... Replacement: Median Inco... Age Gift Amount Average All Mo... Gift Count 36 Months Promotion Count Card 12 M... Status Category 96NK Time Since First Gift Home Owner Gift Count Card All Months Gift Count All Months Promotion Count 36 Months Promotion Count Card All M... GiFt Amount Average 36 M,.. Gender Promotion Count All Months GiFt Count Card 36 Months Time Since Last GiFt Gift Amount Average Card ... GiFt Amount Last Promotion Count 12 Months Promotion Count Card 36 M... Demographic Cluster
Branches
-iog(p) 3.45360 2,38360 1.81739 1.56106 1.24470 1,03539 0.92244 0.88959 0.76545 0.68583 0.66592 0.58917 0.41340 0.33333 0.26385 0.19233 0.15916 0.13244 0.03293 0.01715 0.00000 0.00000 0,00000 0.00000 0.00000
2
Edit Rule... OK
Cancel
T h e input with the highest logworth is Median Home V a l u e Region.
1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Apply
Refresh
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3-50
2. Select Edit Rule.... T h e D e m M e d H o m e V a l u e - Interval Splitting Rule dialog box opens. D e m M e d H o m e V a l u e - Interval Splitting Rule Target Variable: TargetB Assign missing values to (â&#x20AC;˘ A specific branch
[2
^
C A separate missing values branch C All branches Br a r i c h e s
Branch 1
Split Point
<
67350 67350
>=
2
New split point:
Remove Branch
Add Branch
OK
Cancel
Apply
Undo
Reset
Branch 1 contains all cases with a median h o m e value less than 67350. Branch 2 contains all cases that are equal to or greater than 67350.
3.2 Cultivating Decision Trees 3. Select Apply. T h e Tree View w i n d o w is updated to show the additional split. Interactive Decision Tree -EMWS2.TREE_BROWSETREE[EMWS2.PART_TRAIN] File Edit View
pa
Tree Train
i I lib fUT
Window n 0
ml
&
Tree View Target: TargetB Count: 4843 0: 50% 1: 50%
T
Gift Count 36 Months
r
>=2.5 or Missing
<2.5
I
J.
Target: TargetB Count: 2233 L 0: 57% 1: 431
Target: Count:
TargetB 2560
• :
44%
1:
56%
I
1
Median Home Value Region
<S7350 Target: TargetB Count: 84£ 0: £3% 1: 37%
>=67350 or Missing Target: TargetB Count: 1437 •: 53% 47% 1:
Both left and right leaves contain a lovver-than-average proportion of cases with T a r g e t B = l .
3-51
3-52
Chapter 3 Introduction to Predictive Modeling: Decision Trees
4. Repeat the process for the branch that corresponds to cases with G i f t Count 36 Month in excess of 2.5. Interactive Decision Tree -EMWS2.TREE_BROWSETREE[EMW52.PART_TRAIN] File Edit
View
Tree Train
MiJL
i
ifnr m : 6% •a
|
Window | g | xa
^1
Tree View Tar ge t Count 0 1
TargetB 4843 50% 50%
T
Gift Count 36 Months
>=2.5 or Missing
J Target: TargetB Count: 2203 0: 57% 1: 43%
Target: Count: 0: 1:
Y
T~"
Median Home Value Region
<67350
s=67350 or Missing
TargetB 2S£0 44% 56%
Gift Amount Last
<7.5
=7.5 or Missing
I Target: TargetB Count: 34£ 0: 63% 1: 37%
Target: TargetB Count: 1437 S3% •: 47% 1:
Target: TargetB Count: £57 0: 36% 1: £4%
Target: Targ*tB Count: 1903 0: 47% 1: 53%
T h e right-branch cases are split using the G i f t Amount L a s t input. This time, both branches have a higher-than-average concentration of TargetB=l cases. f
It is important to remember that the logworth of each split, reported above, and the predicted T a r g e t B values are generated using the training data only. T h e main focus is on selecting useful input variables for the first predictive model. A diminishing marginal usefulness of input variables is expected, given the split-search discussion at the beginning of this chapter. This prompts the question: W h e r e do you stop splitting? T h e validation data provides the necessary information to answer this question.
3.2 Cultivating Decision Trees
3-53
Changing a Splitting Rule T h e bottom left split on Median Home V a l u e at 67350 was found to be optimal in a statistical sense, but it might seem strange to someone simply looking at the tree. ( W h y not split at the more socially acceptable value of 70000?) Y o u can use the Splitting Rule w i n d o w to edit where a split occurs. Use the following steps to define your o w n splitting rule for an input. 1.
Select the node above the label Median Home V a l u e Region. Interactive Decision Tree -EMWS2.TREE_BRDWSETREE[EMWS2.PART_TRAIN] File
Edit
View
Tree
Train
a 11 a l l P i l f e r
m
Window re 1 • I • 1 rid I n> | xa |
A
_j|
|
|
Tree View Target: TargetB Count: 4S43 •: 50* 1: 50V
Y
Girl Count 36 Months
r
>=2.5 or Missing
I
Target: TargetB Count: 2$£D O: 44% 1: 5£ %
Target: TargetB Count: "S3 . O: 57% 1: 43%
1
Gift Amount Last
Median Home Value Region
1 <67350 Target: TargetB Count: S4£ D: E3% 1: 37%
1 >=57350 or Missing
1
Target: TargetB Count: 1437 0: 53% 1: 471
1 <7.5
1
Tar get T a r g e t B Count 657 0: 36% 1 64%
1 >-75 or Missing
1
Target: TargetB Count: 1S03 •: 47% 1: 53%
3-54
Chapter 3 Introduction to Predictive Modeling: Decision Trees
2. Right-click this node and select Split Node... from the m e n u . T h e Split N o d e w i n d o w opens. Split Node 3 Target Variable: TargetB Variable 1 DernMedHorneValue DernPctVeterans StatusCatStarAil | REP_DemMedIncome DemAge GiftAvgAll GiftCnt36 PromCntCardl2 StatusCat96NK GiftTimeFirst iDemHorneOwner GiftCntCardAll GiftCntAll PromCnt36 PromCntCardAll GiftAvg36 DemGender PromCntAll |GiftCntCard36 iGiftTirneLast GiftAvgCard36 ;GiftAvgLast ;PromCntl2 jpromCntCard36 : DernCluster
Variable Description Median Home Value Region Percent Veterans Region Status Category Star All M... Replacement: Median Inco... Age Gift Amount Average All Mo... Gift Count 36 Months Promotion Count Card 12 M.., Status Category 96NK Time Since First Gift Home Owner Gift Count Card All Months Gift Count All Months Promotion Count 36 Months Promotion Count Card All M... GiFt Amount Average 36 M... Gender Promotion Count All Months Gift Count Card 36 Months Time Since Last Gift Gift Amount Average Card ... Gift Amount Last Promotion Count 12 Months Promotion Count Card 36 M... Demographic Cluster
Branches
-Log(p) 3.36085 2.38360 1.81739 1.56106 1.24470 1.08539 0.92244 0.88959 0.76545 0,68583 0.66592 0,58917 0,41340 0.38333 0.26385 0.19283 0.15916 0,13244 0.03293 0,01715 0,00000 0,00000 0.00000 ' 0,00000 0.00000
Edit Rule... OK
Cancel
1 Zi 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Apply
Refresh
3.2 Cultivating Decision Trees
3-55
3. Select Edit Rule.... T h e D e m M e d H o m e V a l u e - Interval Splitting Rule dialog box opens. DernMedHorneValue - Interval Splitting Rule
LU
Target Variable: TargetB Assign missing values to (â&#x20AC;˘ A specific branch
[2 ^ ]
C A separate missing values branch C All branches Branches Split Point
Branch 67350 67350
>=
New split point:
Remove Branch
Add Branch
OK
Cancel
Apply
Undo
Reset
|
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3-56
4. Type 7 0 0 0 0 in the N e w s p l i t p o i n t field. D e m M e d H D m e V a l u e - Interval Splitting Rule Target Variable: TargetB Assign missing values to â&#x20AC;&#x201D; (* A specific branch
|2
C A separate missing values branch C All branches Branches Branch
Split Point
1 2
New split point:
67350 67350
<
>=
70000
Add Branch
OK
Remove Branch
Cancel
Apply
Undo
|
Reset
3.2 Cultivating Decision Trees 5. Select A d d Branch. T h e Interval Splitting Rule dialog box shows three branches. DernMedHorneValue - Interval Splitting Rule Target Variable: TargetB Assign missing values to (* A specific branch
fi
C A separate missing values branch C All branches Branches Branch
Split Point
< < >=
l 2 3
New split point:
67350 70000 70000
| Add Branch I
OK
Cancel
Remove Branch
Apply
Undo
Reset
3-57
3-58
Chapter 3 Introduction to Predictive Modeling: Decision Trees
6. Select Branch 1 by highlighting the row. DernMedHorneValue - Interval Splitting Rule Target Variable: TargetB Assign missing values to (* A specific branch
12
>J
C A separate missing values branch C All branches Branches Branch
Split Point
70000 70000
>=
New split point:
Add Branch
OK
Remove Branch
Cancel
Apply
Undo
|
Reset
3.2 Cultivating Decision Trees
3-59
7. Select R e m o v e Branch. DernMedHorneValue - Interval Splitting Rule Target Variable: TargetB Assign missing values to (* A specific branch
[1
â&#x20AC;˘|
C A separate missing values branch C Al! branches Branches Split Point
Branch 1 2
70000 70000
<
>=
Add Branch
New split point: OK
Cancel
Apply
Remove Branch Undo
Reset
T h e split point for the partition is m o v e d to 70000. jf
In general, to change a Decision Tree split point, add the n e w split pointfirstand then remove the unwanted split point.
3-60
Chapter 3 Introduction to Predictive Modeling: Decision Trees Select O K to close the Interval Splitting Rule dialog box and to update the Tree View w i n d o w .
in N o d e : 4843 T a r g e t : Targets • (V) I 50% 1 (%): 50%
T
Gift Count 36 Months
I
1 I
<2.5 or Missing
.
1
>=2.5
,
22B3 in M o d e : T a r g e t : TargetB 57% 0 (%) : 43% 1 (%J:
25£0 in N o d e : Target: TargetB 44% 0 |%): 5c* 1 1%) :
T
Median H o m e Value Region
<700D0 or Missing
Gift Amount Last
=70000
J
>=7.5 or Missing
I H in N o d e : 922 T a r g e t : TargetB 0 (%) : £2% i (%> : 38%
N in N o d e : 1361 T a r g e t : TargetB 54% • (%) : 46% 1 (%) :
in N o d e : £57 T a r g e t : TargetB 0 (%) : 36% 1 (%): 64%
in N o d e Target 0 1%) 1 (*>
1903 TargetB 47% 53*
Overriding the statistically optimal split point slightly changed the concentration of cases in both nodes.
3.2 Cultivating Decision Trees
3-61
Creating the Maximal Tree A s demonstrated thus far, the process of building the decision tree model is one of carefully considering and selecting each split sequentially. There are other, more automated w a y s to grow a tree. Follow these steps to automatically grow a tree with the S A S Enterprise Miner Interactive Tree tool: 1. Select the root node of the tree. 2. Right-click and select Train N o d e from the m e n u . T h e tree is grown until stopping rules prohibit further growth. f
Stopping rules were discussed previously and in the Self-Study section at the end of this chapter.
It is difficult to see the entire tree without scrolling. However, you can z o o m into the Tree View w indow to examine the basic shape of the tree. 3. Select Options •=> Z o o m <=> 5 0 % from the main m e n u . T h e tree view is scaled to provide the general shape of the maximal tree.
s
' 1
r
i
t
T
1
1
i.
1
1
y
Ai ,±,,
|
i •,,'.„,,
T"
i
' 1 f
1
L
1
1 I ^^^^^^^^^
T
1
j
i
Y i
1
i i i ~
_!.„
"T
Y
1
...!....
_.u
~L
i
_i_
1
i
.J^I
i
i
L
' i"~
1
1
u
X
i
T
i
J
-L
iZT-i . . J ..i
Plots and tables contained in the Interactive Tree tool provide a preliminary assessment of the maximal tree.
3-62
Chapter 3 Introduction to Predictive Modeling: Decision Trees
4. Select V i e w â&#x20AC;˘=> Assessment Plot. Interactive Decision Tree - EMW5.TREE jmOWSETREE[EMW5.PART2_TRAIN] File
Edit
View
i
Tree
Train
1
J
Window
1
|
â&#x20AC;˘2)
OQ
Assessment Plot Proportion Misclassifiod 0.50-
N u m b e r of L e a v e s
While the majority of the improvement in fit occurs over the first few splits, it appears that the maximal, fifteen-leaf tree generates a lower misclassification rate than any of its simpler predecessors. T h e plot seems to indicate that the maximal tree is preferred for assigning predictions to cases. However, this plot is misleading. Recall that the basis of the plot is the validation data. Using the s a m e sample of data both to evaluate input variable usefulness and to assess model performance c o m m o n l y leads to overfit models. This is the mistake that the Mosteller and Tukey quotation cautioned you about. You created a predictive model that assigns one of 15 predicted target values to each case. Your next task is to investigate h o w well this model generalizes to another similar, but independent, sample: the validation data. Select File O Exit to close the Interactive Tree application. A dialog box opens. Y o u are asked whether you want to save the maximal tree to the default location. Unsaved Changes ?^
Do you want to save the changes to the default location, EMWS2.TREE_BROWSETREE? I Yes !
No
Cancel
Select Yes. T h e Interactive Tree results are n o w available from the Decision Tree node in the S A S Enterprise Miner flow.
3.3 Optimizing the Complexity of Decision Trees
3.3
3-63
Optimizing the Complexity of D e c i s i o n T r e e s
Model Essentials - Decision Trees
Optimize complexity.
Pruning
The maximal tree represents the most complicated model you are willing to construct from a set of training data. T o avoid potential overfitting. m a n y predictive modeling procedures offer s o m e mechanism for adjusting model complexity. For decision trees, this process is k n o w n as pruning.
Predictive Model Sequence mmm mm
mmm
J IHM M M WmW ^mWm
â&#x20AC;&#x201D;
Create a sequence of models with increasing complexity.
)mplexity
The general principle underlying complexity optimization is creating a sequence of related models of increasing complexity from training data and using validation data to select the optimal model.
3-64
Chapter 3 Introduction to Predictive Modeling: Decision Trees
The Maximal Tree Training Data
Validation Data
Maximal
Create a sequence of models with increasing complexity.
Tree
A maximal tree is the most complex model in the sequence.
r 6 Model 87
Complexity
The Maximal Tree 1
•
•
•
E
- irp NO
i bfcH3
•
OIK
C
A maximal tree is the most complex model in the sequence. Mode/ 89
Complexity
T h e maximal tree that you generated earlier represents the most complex model in the sequence.
3.3 Optimizing the Complexity of Decision Trees
3-65
Pruning One Split Training Data
mm
Validation Data
mm mmm
*pufl • r -
• —
m
w
m
r
m
w
m
mm
The next model in the sequence is formed by pruning one split from the maximal tree. Model Complexity
The predictive model sequence consists of subtrees obtained from the maximal tree by removing splits. T h efirstbatch of subtrees is obtained by pruning (that is, removing) one split from the maximal tree. This process results in several models with one less leaf than the maximal tree.
Pruning One Split
Illll
II
I Il ll ll ll l
Data
III Illll
Validation
Illll
Illll Illll
Training Data
Each subtree's predictive performance is rated on validation data. Model 92
The subtrees are compared using a model rating statistic calculated on validation data. (Choices for model fit statistics are discussed later.)
3-66
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Pruning One Split Training Data
Validation Data WW*
\""""
1
1
; The subtree with the highest validation assessment is selected. Model 94
Complexity
T h e subtree with the best validation assessment statistics is selected to represent a particular level of model complexity.
Pruning Two Splits Training Data
Validation
Data twit
i
ILMJIII'IH m
"i
p**n
H+H I Similarly this is done for subsequent models.
Model Complexity
A similar process is followed to obtain the next batch of subtrees.
3.3 Optimizing the Complexity of Decision Trees
Pruning Two Splits Training Data
ÂŁ
A
^
Prune two splits from the maximal tree,...
Model Complexity
97
This time the subtrees are formed by removing two splits from the maximal tree.
Pruning Two Splits Training Data
Validation
Dais
mm: mmi mil mm
I
1
mmm
mi mi
mm
llll
II
II
mm
1
2 9 validation assessment, and. r a t e
"
"
e a c h
s u b t r e e
u s i n
Model Complexity
O n c e again, the predictive rating of each subtree is compared using the validation data.
3-67
3-68
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Pruning Two Splits Training Data
Validation
Data
T2
s 4k
...select the subtree with the best assessment rating.
Model Complexity
100
T h e subtree with the highest predictive rating is chosen to represent the given level of model complexity.
Subsequent Pruning lion Data
C~2
Continue pruning until all subtrees are considered.
4k 106
Model Complexity
T h e process is repeated until all levels of complexity are represented by subtrees.
3.3 Optimizing the Complexity of Decision Trees
Selecting the Best Tree Training Data
Validation Data
mm mmm mm mmm mmm mm
III
III
mm mm
imam Compare validation assessment between tree complexities.
A MMft ATocte/
Validation
Complexity
107
Assessment
Consider the validation assessment rating for each of the subtrees.
Selecting the Best Tree Validation Data
Training Data
mm
mm
mmm
mm mm
mm
mmm mm
Mkhttt
.<â&#x20AC;˘â&#x20AC;˘ -trtrtrXr
Model Complexitv
mm
Which subtree should be selected as the best model?
Validation Assessment
Which of the sequence of subtrees is the best model?
3-69
3-70
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Validation Assessment Ml
mmm
mm â&#x20AC;&#x201D; mmm mm mm
109
ft
â&#x20AC;˘
Mode/
Validation
Complexity
Choose the simplest model with the highest validation assessment.
Assessment
In S A S Enterprise Miner, the simplest model with the highest validation assessment is considered best.
Validation Assessment Validation Data
What are appropriate validation assessment ratings?
110
O f course, you must finally define what is meant by validation assessment ratings.
3.3 Optimizing the Complexity of Decision Trees
3-71
Assessment Statistics
Validation
Data
Ratings depend on...
mm mm mm ——mm
- . - t a r g e t measurement prediction type.
112
T w o factors determine the appropriate validation assessment rating, or more properly, assessment statistic: • the target measurement scale • the prediction type A n appropriate statistic for a binary target might not m a k e sense for an interval target. Similarly, models tuned for decision predictions might m a k e poor estimate predictions.
Binary Targets
primary outcome secondary outcome
114
For this discussion, a binary target is assumed with a primary outcome ( t a r g e t = l ) and a secondary outcome ( t a r g e t = 0 ) . T h e appropriate assessment measure for interval targets is noted below.
3-72
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Binary Target Predictions
• ^^^ff^^^M t : mMM
HH
mm
•HBBB
Hi
mm
mmmwm
mm
mm
Mmm
mm
m
m
m
m
mm
^u 5! *I
mm • • • • ma
-
' 1
primary
1
decisions
secondary
520
rankings
720
;
estimates
116
There are different s u m m a r y statistics corresponding to each of the three prediction types: • decisions • rankings • estimates The statistic that you should use to judge a model depends on the type of prediction that you want.
3.3 Optimizing the Complexity of Decision Trees
3-73
Decision Optimization
mmm
mmm mmt
mm mm
mmm
warn mm
mm
—
MB H i
mm mm
M
mm. mm
mm mmnrnm
decisions
119
Consider decision predictionsfirst.With a binary target, you will typically consider two decision types: • the primary decision, corresponding to the primary outcome • the secondary decision, corresponding to the secondary outcome
Decision Optimization - Accuracy
true positive true negative Maximize accuracy: agreement between outcome and prediction
120
Matching the primary decision with the primary outcome yields a correct decision called a true positive Likewise, matching the secondary decision to the secondary outcome yields a correct decision called a true negative. Decision predictions can be rated by their accuracy, that is, the proportion of agreement between prediction and outcome.
3-74
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Decision Optimization - Misclassification
fnpulS
mm mm mm â&#x20AC;&#x201D; mm mm H I mm mm mm mm mm ii -11.
mm mm mm mm mm mm
tj/giir
predietioti 1
1
| secondly!]
false negative
0
[ primly ||
false positive
mum mmm
Minimize misclassification: disagreement between outcome and prediction
122
Mismatching the secondary decision with the primary outcome yields an incorrect decision called a false negative. Likewise, mismatching the primary decision to the secondary outcome yields an incorrect decision called a false positive. Decision predictions can be rated by their misclassification, the proportion of disagreement between prediction and outcome.
Ranking Optimization
124
Consider ranking predictions for binary targets. With ranking predictions, a score is assigned to each case. T h e basic idea is to rank the cases based on their likelihood of being a primary or secondary outcome. Likely primary outcomes receive high scores and likely secondary outcomes receive low scores.
3.3 Optimizing the Complexity of Decision Trees
3-75
Ranking Optimization - Concordance
target=0—+lowscore target=l—"high score Maximize concordance: proper ordering of primary and secondary outcomes
126
W h e n a pair of primary and secondary cases is correctly ordered, the pair is said to be in concordance. Ranking predictions can be rated by their degree of concordance, that is, the proportion of such pairs whose scores are correctly ordered.
Ranking Optimization - Discordance
M
M 720 520
target=0—>high score target=l—dow score Minimize discordance: improper ordering of primary and secondary outcomes
129
W h e n a pair of primary and secondary cases is incorrectly ordered, the pair is said to be in discordance. Ranking predictions can be rated by their degree of discordance, that is, the proportion of such pairs whose scores are incorrectly ordered.
3-76
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Estimate Optimization
1
r~~ "
HH
â&#x20AC;˘ â&#x20AC;˘
mmm
mm H
mm
mm
ffl^BBS
Hi
Hi
mm mm mm mm mmm H I E M
estimates
130
Finally, consider estimate predictions. For a binary target, estimate predictions are the probability of the primary outcome for each case. Primary outcome cases should have a high predicted probability; secondary outcome cases should have a low predicted probability.
3.3 Optimizing the Complexity of Decision Trees
3-77
Estimate Optimization - Squared Error
(target - estimate)2 Minimize squared error. squared difference between target and prediction
132
T h e squared difference between target and estimate is called the squared error. Averaged over all cases, squared error is a fundamental statistical measure of model performance.
where, for the /'' case, y, is the actual target value and yt is the predicted target value. W h e n calculated in an unbiased fashion, the average square error is related to the amount of bias in a predictive model. A model with a lower average square error is less biased than a model with a higher average square error. f
Because S A S Enterprise Miner enables the target variable to have more than two outcomes, the actual formula used to calculate average square error is slightly different (but equivalent, for a binary target). Suppose the target variable has L outcomes given by (Cj, C2, ...C/.), then Average
square
error
=
NL
where, for the t case,>'; is the actual target value, p
tJ
and the following:
is the predicted target value for the f
class,
3-78
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Complexity Optimization - Summary
decisions accuracy / misclassification rankings concordance/discordance estimates squared error
134
In s u m m a r y , decisions require high accuracy or low misclassification, rankings require high concordance or low discordance, and estimates require low (average) squared error.
3.3 Optimizing the Complexity of Decision Trees
3-79
A s s e s s i n g a Decision Tree
Use the following steps to perform an assessment of the maximal tree on the validation sample. 1. Select the Decision Tree node. In the Decision Tree node's properties, change Use Frozen Tree from N o to Yes.
Protect Fie Ed* Vew Araju Optitns W n d w
S E n t e r p r l M Miier- M y
Heb
D - l ^ --i|X|BI^|j]|aD|gJl^i|*
3 wf Prsjoil
S*Tte| Eiptaril rtad^vl Mafe|| Istns j Unity | Craftfrumpj Text Hlrtifj !
£i- 1.1. •• tr<-
4Hjperty
Gt rural
-J
1
vaue -
rft 5
3
Imnnrt !! Hals Exported Dal; Moles Vaiaoes tnlurattnr* Use Frozen Tree UseMjIliclBTaroels
M
Yea
InltrrJ Ci leilJd ProbF ProbCnsq -Ncmirar Cntenor • tTjin;' Crnerion Entncv 0.2 r Sijnfflrarue Level rMissinsVilues Jse In search l U5S Input Of It H Nu j-MBdrrumOro-icri 3 >M3rlmimDBB* 6 -Minimum :a!eao-ica Se5 Use Froien Tree
p U p v t Pr«*rtkr* £ r w J v w
J
2l\ZO
Q — J —o w i j '
fn[w*H
T h e Frozen Tree property prevents the maximal tree from being changed by other property settings w h e n the flow is run.
3-80
Chapter 3 Introduction to Predictive Modeling: Decision Trees
2. Right-click the Decision Tree node and run it. Select Results. Enterprise Miner-test I lie grit yjew fictions Options Window Help
3 teitl •*• Ol D a t a Sources H il Diagrams Predictive Analysi 1—J™ predictive analysi; 1 B ftj Model Packages I JlJ Users
Sample Explore Morify Model Assess Ut«y Oedt Scoring Teit r**ig ;
Predictive Analysis
o
•J
USH Multiple TaigetNa B| Interval Criterion ProBF Nominal Criterion ProttChisq Ordinal Criterion Entropy Significance Level 0.2 Missing Values Use In search Use Input Once |No Maximum Branch 2 Maiimum Depth 6 "Minimum Calegorfris
Run completed BaojerniFTeclrtive Analyst Path:
Decision Tree OK
Results...
-—
3.3 Optimizing the Complexity of Decision Trees
3-81
T h e Results w i n d o w opens. nr Results - fode: Decision Tree Diagram: Predictive Analysis Fte Ed* yew Wnctow
Si. i "i i'
ME
\>. mknnji Live rljy: Tjftjelfl
9 14 15 Leaf Index
19
21
â&#x20AC;˘ Tiaimng Percent 1 â&#x20AC;˘ validation Percent 1
Target TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB
Statistic 0:
It H:
Train so.a* so.at
Validation so.a* so.o\
4943
4S4J
Ft Statistics _NOBS_ _SUMW_ _MISC_ _MAX_ _SSE_ _ASE_ JMSE_ _DIV_ _DFT_
Statistics Label
Tram
S u m of f re._ S u m of Cas... Hisclassrfic . Maximum A... S u m of Squ.. Average Sq.. RoolAvera... Divisor (or A.. Total Degre...
Validation 4843
4843
9686
9686
0103425
0.432377
0.923571
0.928571
2293.347
2377.757
0236769
0 245484
0486599
0.495463
9666
9686
4843
Gift C o u n tret 36 Months
Variable Inpoct-encc
T h e Results w i n d o w contains a variety of diagnostic plots and tables, including a cumulative lift chart, a tree m a p , and a table of lit statistics. T h e diagnostic tools s h o w n in the results vary with the measurement level of the target variable. S o m e of the diagnostic tools contained in the results s h o w n above are described in the following Self-Study section.
3-82
Chapter 3 Introduction to Predictive Modeling: Decision Trees T h e saved tree diagram is in the bottom left corner of the Results window. Y o u can verify this by maximizing the tree plot.
at...
i
'
25
>-
I
Me... >. 70000
Pro...
175
i
*
I
Cif...
i
>- 175
:t
I
r
« 532
T~ c Owner
^7S I
>- is
I
i
I
i
>- 5.32
< 1D5
I
Tin...
,
,
>- 105
I
•> 175
l
>- 115
18.53...
I
'
i —
L >- 175
I
Do...
, . I ,
25
Pio...
,
(
&' 115
I
Gift Amount Lost
room
i
1
I
Me... 24,49
* 124100
. I . . I
Gil...
>- 124100
I
>- 14.415
< 14.415
. I . .• 1
~Y SSH...
7S500
< 7S500
3.3 Optimizing the Complexity of Decision Trees
3-83
4. The main focus of this part of the demonstration is on assessing the performance of a 15-leaf tree, created with the Interactive Decision Tree tool, on the validation sample. Notice that several of the diagnostics s h o w n in the results of the Decision Tree node use the validation data as a basis. Select V i e w <=> M o d e l â&#x20AC;˘=> Iteration Plot. Iteration Plot
E M
Average Square Error
0.250 -
0.245 -
0.240-
10
15
N u m b e r of L e a v e s Train: Average Squared Error
Valid: Average Squared Error
T h e plot shows the Average Square Error corresponding to each subtree as the data is sequentially split. This plot is similar to the one generated with the Interactive Decision Tree tool, and it confirms suspicions about the optimality of the 15-leaf tree. The performance on the training sample becomes monotonically better as the tree becomes more complex. However, the performance on the validation sample only improves up to a tree of, approximately, four or five leaves, and then diminishes as model complexity increases. y
T h e validation performance shows evidence of model overtltting. Over the range of one to approximately four leaves, the precision of the model improves with the increase in complexity. A marginal increase in complexity over this range results in better accommodation of the systematic variation or signal in data. Precision diminishes as complexity increases past this range; the additional complexity accommodates idiosyncrasies in the training sample, and the model extrapolates less well.
3-84
Chapter 3 Introduction to Predictive Modeling: Decision Trees Both axes of the Iteration Plot, s h o w n above, require s o m e explanation.
Average Square Error: T h e formula for average square error w a s given previously. For any decision tree in the sequence, necessary inputs for calculating this statistic are the actual target value and the prediction. Recall that predictions generated by trees are the proportion of cases with the primary outcome ( T a r g e t B = l ) in each terminal leaf. This value is calculated for both the training and validation data sets. Number of leaves: Starting with the four-leaf tree that y o u constructed earlier, it is possible to create a sequence of simpler trees by removing partitions. For example, a three-leaf tree can be constructed from your four-leaf tree by removing either the left partition (involving M e d i a n H o m e V a l u e ) or the right partition (involving G i f t A m o u n t L a s t ) . R e m o v i n g either split might affect the generated average square error. T h e subtree with the lowest average square error (measured, by default, on the validation data) is the split that remains. T o create simpler trees, the process is repeated by, once m o r e , starting with the four-leaf tree that y o u constructed and removing m o r e leaves. For example, the two-leaf tree is created from the four-leaf tree by removing t w o splits, and so on.
3.3 Optimizing the Complexity of Decision Trees
3-85
5. To further explore validation performance, select the arrow in the upper left corner of the Iteration Plot, and switch the assessment statistic to Misclassification Rate. Iteration Plot
l-lnlx
Misclassification Rate
i
0.500 -
0.475 -
0.450 -
0.425 -
10
15
N u m b e r of L e a v e s Train: Misclassification Rate
Valid: Misclassification Rate
T h e validation performance under Misclassification Rate is similar to the performance under Average Square Error. T h e optimal tree appears to have, roughly, four or five leaves. Proportion misclassijied: T h e n a m e decision tree model comes from the decision that is m a d e at the leaf or terminal node. In S A S Enterprise Miner, this decision, by default, is a classification based on the predicted target value. A predicted target value in excess of 0.5 results in a primary outcome classification, and a predicted target value less than 0.5 results in a secondary outcome classification. Within a leaf, the predicted target value decides the classification for all cases regardless of each case's actual outcome. Clearly, this classification will be inaccurate for all cases that are not in the assigned class. T h e fraction of cases for which the wrong decision (classification) was m a d e is the proportion misclassified. This value is calculated for both the training and validation data sets. 6. T h e current Decision Tree node will be renamed and saved for reference. Close the Decision Tree Results window.
3-86
Chapter 3 Introduction to Predictive Modeling: Decision Trees
7. Right-click on the Decision Tree node and select R e n a m e . N a m e the node Maximal Tree.
T h e current process flow should look similar to the following: 2 Predictive Analysis
Maximal TreÂť
l.ita Partition
'J
-J
T h e next step is to generate the optimal tree using the default Decision Tree properties in S A S Enterprise Miner.
3.3 Optimizing the Complexity of Decision Trees
3-87
Pruning a Decision Tree The optimal tree in the sequence can be identified using tools contained in the Decision Tree node. T h e relevant properties are listed under the Subtree heading. 1. Drag another Decision Tree node from the Model tab into the diagram. Connect it to the Data Partition node so that your diagram looks similar to what is s h o w n below. ^149.173.38.118-Remote Desktop
ME
£f Enterprise Miner - M y Project O
Ed* £ewfiftkmsOptions Wncbw tJ*b
DH^l^XlSNBlDlBlailJjMI-.'.'l :-|Q|t5Hi •3 My Project Q| Data Sourc WH Diagrams
Sample EipJore Modify Model Assess Ltity I Credit Scoring
HE
i • - Predictive Analysis
_£] Model Packages Property
General Node ID ImportedData Exported Data Noles
Train
jr*^,p!«— ..at ; — —vfcieiaJliliPirl.li.".
Variables Interactive Use Frozen Tree No Use Multiple Tarjie No
J
6
-
J
RnteiyajCrtterlon Nominal Criterion ProbCnlsq r Ordinal Criterion Entropy r Significance Level G 2 rMisslng Values Use in search rUselnputpnce
General pun completed
- f~l.Mi.imil Ttt*
m
-ti * a a - j -
1 [0 student as student
Connected to SASApp - LogicaJ Workspace Server
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3-88
2. Select the n e w Decision Tree node and scroll d o w n in the Decision Tree properties to the Subtree section. T h e main tree pruning properties are listed under the Subtree heading. Value Property Maximum ueptn b I Minimum Categorical Si 5 §Node 5 Leaf Size • N u m b e r of Rules 5 • 'Number of Surroqate R L 0 : Split Size dsplit Search 5000 iExhaustive 20000 Node Sample :E Subtree ; Method Assessment 1 jNumber of Leaves [Assessment Measure Decision Assessment Fraction 0 . 2 5 [slcross Validation Perform Cross Validatio NO Number of Subsets 10 [Number of Repeats 1 i 12345 iSeed [•jObservation Based Impt Hpbservation Based ImpcNo ^ N u m b e r Single Var ImpcS •Bonferroni Adj u stm e nt Yes "ime of Kass AdjustmerBefore Inputs No •Number of Inputs -Split Adjustment Yes T h e default method used to prune the maximal tree is Assessment. This means that algorithms in S A S Enterprise Miner choose the best tree in the sequence based on s o m e optimality measure. f
Alternative method options are Largest and N . The Largest option provides an autonomous w a y to generate the maximal tree. The N option generates a tree with N leaves. T h e maximal tree is upper-bound on N.
T h e Assessment Measure property specifies the optimality measure used to select the best tree in the sequence. T h e default measure is Decision. The default measure varies with the measurement level of the target variable, and other settings in S A S Enterprise Miner. f
Metadata plays an important role in h o w S A S Enterprise Miner functions. Recall that a binary target variable w a s selected for the project. Based on this, S A S Enterprise Miner assumes that you want a tree that is optimized for making the best decisions (as opposed to the best rankings or best probability estimates). That is, under the current project settings, S A S Enterprise Miner chooses the tree with the lowest misclassification rate on the validation sample by default.
3.3 Optimizing the Complexity of Decision Trees
3-89
Ritjht-click on the n e w Decision Tree node and select R u n . ^Enterprise Miner - M y Project lie Edft View Ac lions Options Wjndow Help
•lOlgHl-ul^l 3 My Prolect t a Date Sources -" Diagrams B
Model Packages
ME
Pr elite live Analysis
Property
General Node ID Imported Dot.) EipcrterJ Data Nates
J Use Multiple TargeNo interval Criterion ProbF •Nominal Criterion PrpbChisq I Ordinal Criterion Entropy r Significance Level 07 r Missing Values Use in search rUse Input Once No r Maximum Branch 2 ' Maximum Depth 6 ; Minimum Calegcnl5
Ed* Variables...
A Run SlResJts... -ft Oeate Model Package... j*) Export Path as SAS Program
Leaf See
cut
Cow Genera]
flxn completed
"P? student as student fat Cornec
4. After the Decision Tree node runs, select Results.... Run Status
V
L3
Run completed Diagram: Predictive Analysis Path:
Decision Tree
OK
Results...
Delete °—una
3-90 5.
Chapter 3 Introduction to Predictive Modeling: Decision Trees In the Result w i n d o w , select V i e w ÂŤ=> M o d e l â&#x20AC;˘=> Iteration Plot. T h e Iteration Plot shows model performance under Average Square Error by default. However, this is not the criterion used to select the optimal tree.
6. C h a n g e the basis of the plot to Misclassification Rate.
Sum of Square Errors Maximum Absolute Error Subtree Assessment
0.235 0
5
10
N u m b e r of Leaves Train: Average Squared Error Valid: Average Squared Error
15
3.3 Optimizing the Complexity of Decision Trees
3-91
The five-leaf tree has the lowest associated misclassification rate on the validation sample. Iteration Plot
013
Misclassification Rate 0.500 -
0.475 -
0.450 -
0.425 -
1
5
10
Number of Leaves Train: Misclassification Rate Valid: Misclassification Rate
Notice that the first part of the process in generating the optimal tree is very similar to the process followed in the interactive Decision Tree tool. The maximal, 15-leaf tree is generated. The maximal tree is then sequentially pruned so that the sequence consists of the best 15-leaf tree, the best 14-leaf tree, and so on. Best is defined at any point, for example, eight, in the sequence as the eight-leaf tree with the lowest Validation Misclassification Rate a m o n g all candidate eight-leaf trees. T h e tree in the sequence with the lowest overall validation misclassification rate is selected.
3-92
Chapter 3 Introduction to Predictive Modeling: Decision Trees The optimal tree generated by the Decision Tree node is found in the bottom left corner of the Results window.
SO.01 10.01 BM3
Gift Count 36 Months
2.5 Ti-iin 41.81
3t At 1st ic
V A I irtAt inn
44.Z\
41.0*
7
Girt Amount Last
>=
7.5
7.5 V A I idition
S)
4 1 . It s;.4t
.41
Time Since Last Gift
*=
17.5
7.5
I
T E A in
11.11 40.71
U A I AIIAC
Iialn
Ion
11.81 40.:%
41.11 sa.it
1 • • '~
Gift Amount Average Card 36 Mont
«
< 14.415 SCACucit
TlAtn
41.11 ll.tl
UAI
14.415
idArion
TtAin
tf.M 10.11
IS.CI
Vnlid.it ion
14.El
41.41
45.11
1113
411
7. Close the Results w i n d o w of the Decision Tree node.
1,Lit i o n
UAI
4;.BI
n.ii .
t
n
-Jjl idaalon 11.01
S3.01
l t . i t
TIAIJI
UAIidACicn
11.0 C4.4I ,-•7
34.01
ei.oi
3.3 Optimizing the Complexity of Decision Trees
3-93
Alternative Assessment Measures T h e Decision Tree generated above is optimized for decisions, that is, to m a k e the best classification of cases into donors and non-donors. W h a t if instead of classification, the primary objective of the project is to assess the probability of donation? The discussion above describes h o w the different assessment measures correspond to the optimization of different modeling objectives. Generate a tree that is optimized for generating probability estimates. T h e main purpose is to explore h o w different modeling objectives can change the optimal model specification. 1. Navigate to the M o d e l tab. Drag a n e w Decision Tree into the process flow. fc. Enterprise Miner - My Project FJe £dit View Actions fijtiors Window (jelp
D-l^l-tiixiSll^lBllGlDlAJl^l* 3 My Project - O Data Sources -i JTJ Diagrams
•|0|fl| \.UI-£i ^j^lidiiUj*]
^ J ^ l i d M Jill 5arr.pl? | Explore | M-j'i-fy | Model |
1
|
Scoring |
Mi
• Predictive Analysis
| Model Packages Property
General
Node ID imported Data Exported Data doles
Train
Variables Interactive Use Frozen Tree No use MultipleTargeNo
J
a
J
^JD.i.iion Tret
Interval Critenon ProbF •Nominal Criterion ProbChisc; '•Ordinal Criterion Enlmpy I Signified'e Lew*I 0.2 . Mssinc Values Use in sea-ch !Use Input Once No : vatrr.j':i Branch 2 j-Mailmum Depth E -Minimum CalegoriiS
m
•J533BI Leaf See
1
General pun completed
2.
ft} «udent as student ft*, Connected to SASApp - Logical Workspace Server
Right-click the n e w Decision Tree node and select R e n a m e .
3. N a m e the n e w Decision Tree node P r o b a b i l i t y Rename
m
Node Name; [probability Tree|
OK
Cancel
Tree.
3-94
Chapter 3 Introduction to Predictive Modeling: Decision Trees
4. Connect the Probability Tree node to the Data Partition node. Your diagram should look similar to the following: ...... p r e d i c t i v e A n a l y s i s
5.
Recall that an appropriate assessment measure for generating optimal probability estimates is Average Square Error.
3.3 Optimizing the Complexity of Decision Trees 6. Scroll d o w n in the properties of the Probability Tree node to Subtree. 7. Change the Assessment Measure property to Average Square Error. Value Property 2 ["Maximum Branch 6 j-Maximum Depth '•Minimum Categorical Si:5 ©Node 5 [•Leaf Size 5 '•• Number of Rules H N u m b e r of Surrogate Ru0 M s pi It Size [=]split Search 5000 '•• Exhaustive 20000 '-Node Sample § Subtree Assessment rpMethod 1 r Number of Leaves 1 [-'Assessment Measure Average Sguare Error i '-'Assessment Fraction 0.25 Tperform Cross Validatio N o 10 H N u m b e r of Subsets Number of Repeats 1 12345 '•Seed .[^Observation Based Impt [reservation Based ImpcNo '•Number Single Var Impc5 Pip-Value Adjustment [-Bonferroni Adjustment Yes ! '-Time of Kass AdjustmerBefore No r Inputs 1 H N u m b e r of Inputs Ves '•[Split Adjustment ^Output Variables ; res -LeafVariable Disk Performance 8. R u n the Probability Tree node. 9. After it runs, select Results....
•
3-95
3-96
Chapter 3 Introduction to Predictive Modeling: Decision Trees
10. In the Results w i n d o w , select View â&#x20AC;˘=> M o d e l <=> Iteration Plot. Iteration Plot
â&#x20AC;˘ 2 1
Average Square Error 0.250 -
0.245 -
0.240 -
0.235 N u m b e r of Leaves Train: Average Squared Error Valid: Average Squared Error
A five-leaf tree is also optimal under the Validation Average Square Error criterion. However, viewing the Tree plot reveals that, although the optimal Decision (Misclassification) Tree and the optimal Probability Tree have the same number of leaves, they are not identical trees.
3.3 Optimizing the Complexity of Decision Trees
3-97
11. Maximize the Tree plot at the bottom left corner of the Results window. T h e tree s h o w n below is a different five-leaf tree than the tree optimized under Validation Misclassification. •T Results - Node: Probability Tree Diagram; Predictive Analysis He F,dt View Window
QJDj^lPi/] E m ll.lt I t . 11
f
Gill Caumafl nil Months
2.5
1
Trill. VilidttiOT lilt 11.n lt.lt 11. It
Ti4ln V.llittim ii tt si it
it :t ii.it
Gift Amounl Last
Median H o m e Value Region
I r-350 Ti.lr. V.lliiti
>=
II It II (t
I
T.5 Ilainii •
(111 It.M
1:
TzilB ».(!
a tt
l-.lii.tj-. It.It <(.n
I 1
Time Since Last Gift
\7£ T u r n '.'.liditlm U.tt U.ll ll.lt II,tt
Turn 11.It tilt
Ulllditln. 11.It ii :i
It is important to remember h o w the assessment measures work in model selection. In the first step, cultivating (or growing or splitting) the tree, the important measure is logworth. Splits of the data are m a d e based on logworth to isolate subregions of the input space with high proportions of donors and non-donors. Splitting continues until a boundary associated with a stopping rule is reached. This process generates the maximal tree. The maximal tree is then sequentially pruned. Each subtree in the pruning sequence is chosen based on the selected assessment measure. For example, the eight-leaf tree in the Decision (Misclassification) Tree pruning sequence is the eight-leaf tree with the lowest misclassification rate on the validation sample a m o n g all candidate eight-leaf trees. T h e eight-leaf tree in the Probability Tree pruning sequence is the eight-leaf tree with the lowest average square error on the validation sample a m o n g all candidate eight-leaf trees. This is h o w the change in assessment measure can generate a change in the optimal specification. f
In order to minimize clutter in the diagram, the Maximal Tree node is deleted and does not appear in the notes for the remainder of the course.
3-98
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3.4
Understanding Additional Diagnostic Tools (Self-Study)
Several additional plots and tables contained in the Decision Tree node Results w i n d o w are designed to assist in interpreting and evaluating a decision tree. T h e most useful are demonstrated on the next several pages.
3.4 Understanding Additional Diagnostic Tools (Self-Study)
3-99
Understanding Additional Plots a n d Tables (Optional)
Tree Map T h e Tree M a p w i n d o w is intended to be used in conjunction with the Tree w i n d o w to gauge the relative size of each leaf.
1. Select the small rectangle in the lower right corner of the Tree m a p .
1
1
3-100
Chapter 3 Introduction to Predictive Modeling: Decision Trees
T h e lower right leaf in the tree is also selected.
Statistic 0;
Training
1-
SO*
node:
41 4 3
II
< Statistic
H
in
ii\
V a l idas i<m
so*
50* SO* 4»43
t.s
Training
>• ValiilacIon
0:
ST*
ST
1:
43*
49*
node:
ttoi
« 0 7
' '1 Z.S z
5c a c i n c i c 0:
T r a i n ing
1:
set zsto
0
in
Mil
i<taclon
44*
nodt: tift
Amount
44* SS* ZS96 List
t.S T r a i n ing IS*
(41 SSI
7.5
>= Val idat ion
Statistic
Training
Validation
94*
0:
41*
SS*
It
SJ*
St*
in n o J r .
1903
1971
sss
H
3 ince
I
Last
49*
G I :•
17.5 riing
Val idat ion
ininj
Val idation
41* J9»
49*
SI*
092
914
Sit 49* 1011
SI* 49* 10S7
Gift <
Amount
Average
36
Ho >=
Statistic 0:
Training 47*
Validation S0t
1:
S3*
SOt
in n « d « :
641
£40
>
Card
14.415 e
14.411
Tiaininj
<
T h e size of the small rectangle in the tree m a p indicates the fraction of the training data present in the corresponding leaf. It appears that only a small fraction of the training data finds its w a y to this leaf.
3.4 Understanding Additional Diagnostic Tools (Self-Study)
3-101
Leaf Statistic Bar Chart - m i x 0.6¬ £ 0.4(73 0.2 H 0.0
1
6 Leaf Index • Training Percent 1 • Validation Percent 1 This w i n d o w compares the blue predicted outcome percentages in the left bars (from the training data) to the red observed outcome percentages in the right bars (from the validation data). This plot reveals h o w well training data response rates are reflected in the validation data. Ideally, the bars should be of the s a m e height. Differences in bar heights are usually the result of small case counts in the corresponding leaf.
3-102
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Variable Importance T h e following steps open the Variable Importance window. Select V i e w ^ M o d e l O Variable Importance. The Variable Importance w i n d o w opens. 1 HI Variable Importance Variable Name
Label
I
H
i
^
Number of Splitting Rules
GiftCnt36 Gift Count 3... GiftAvgLast Girt Amount... DemMedH... Median Ho... GiftTimeLast Time Since... DemPctVet... Percent Vet... GiftAvg36 GiftAmount... GiftAvgCard... Gift Amount... DernAge Age GiftAvgAII GiftAmount... GiftCntAII Gift Count A... GiftCntCard... Gift Count... GiftCntCard... Gift Count... GiftTimeFirst Time Since... PromCntCa...Promotion... PromCnt12 Promotion... PromCnt36 Promotion... PromCntAII Promotion... PromCntCa...Promotion... PromCntCa...Promotion... StatusCatSt..Status Gate... REP_Dem... Replaceme... DemCluster Demograp... DernGender Gender DernHome... H o m e Owner StatusCatG... Status Cate...
i
H
Importance
1 1 1 1 0 0 0
n
o 0 0 H I 0
l
Validation Importance
1 0.524135 0.50277 0.480916 0 0
0 0 0 0 0 0 0 0 0 0 0 0
-
0 0 0 0 0 0 0 0 0 0 0 0 0 K O I £ 3 9 0 0 H I 0
^
j
Ratio of Validation to Training Importance
1 0.671883 0.10861 0.445303
1| 1.281891 0.216024 0.925948 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 .
0 0 0 0 • o 0 0
.
0
.
H I
T h e Variable importance w i n d o w provides insight into the importance of inputs in the decision tree. T h e magnitude of the importance statistic relates to the amount of variability in the target explained by the corresponding input relative to the input at the top of the table. For example, G i f t Amount L a s t explains 5 2 . 4 % of the variability explained by G i f t Count 36 Months.
3.4 Understanding Additional Diagnostic Tools (Self-Study)
3-103
The Score Rankings Overlay Plot Score Rankings Overlay: Targets
Hi H
E3
Percentile TRAIN
VALIDATE
The Score Rankings Overlay plot presents what is c o m m o n l y called a cumulative lift chart. Cases in the training and validation data are ranked, based on decreasing predicted target values. A fraction of the ranked data is selected (given by the decile value). The proportion of cases with the primary target value in this fraction is compared to the proportion of cases with the primary target value overall (given by the cumulative lift value). A useful model shows high lift in low deciles in both the training and validation data. For example, the Score Ranking plot shows that in the top 2 0 % of cases (ranked by predicted probability), the training and validation lifts are approximately 1.26. This means that the proportion of primary outcome cases in this top 2 0 % is about 2 6 % more likely to have the primary outcome than a randomly selected 2 0 % of cases.
3-104
Chapter 3 Introduction to Predictive Modeling; Decision Trees
Fit Statistics -\ Fit Statistics Target TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB
Fit Statistics Statistics Train Validation Test Label _NOBS_ S u m of Fre... 4843 4843 _SUMW_ S u m of Cas... 9686 9686 _MISC_ Misclassific... 0.420607 0.42804 _MAX_ Maximum A... 0.643836 0.643836 _SSE_ S u m of Squ... 2351.152 2357.331 _ASE_ Average Sq... 0.242737 0.243375 _RASEâ&#x20AC;&#x17E; Root Avera... 0.492684 0.493331 _DIV_ Divisor for A... 9686 9686 DFT Total Degre... 4843
T h e Fit Statistics w i n d o w is used to compare various models built within S A S Enterprise Miner. T h e misclassification and average square error statistics are of most interest for this analysis.
3.4 Understanding Additional Diagnostic Tools (Self-Study) 3-105
The Output Window Output
* User: Date: Time: *
Dodotech 08APR08 13:04
* T r a i n i n g Output
V a r i a b l e Summary Measurement
Frequency
Level
Count
Role
2
INPUT
BINARY
INPUT
INTERVAL
20 3
INPUT
NOMINAL
REJECTED
INTERVAL
1
TARGET
BINARY
1
TARGET
INTERVAL
1
Model E v e n t s Number
The Output w i n d o w provides information generated by the S A S procedures that are used to generate the analysis results. For the decision tree, this information includes variable importance, tree leaf report, model fit statistics, classification information, and score rankings.
3-106
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3.5 Autonomous Tree Growth Options (Self-Study)
Autonomous Decision Tree Defaults Default Settings m/*\
I
~]
Maximum Branches
Splitting Rule Criterion
X^*
Subtree Method
/^)^
Tree Size Options
2
Logworth
Average Profit
Bonferroni Adjust Split Adjust Maximum Depth Leaf Size
139
...
T h e behavior of the tree algorithm in S A S Enterprise Miner is governed by m a n y parameters that can be divided into four groups: • the number of splits to create at each partitioning opportunity • the metric used to compare different splits • the method used to prune the tree model • the rules used to stop the autonomous tree growing process T h e defaults for these parameters generally yield good results for an initial prediction.
3.5 Autonomous Tree Growth Options (Self-Study)
3-107
Tree Variations: Maximum Branches Complicates split search
Trades height for width
Uses heuristic shortcuts
140 S A S Enterprise Miner accommodates a multitude of variations in the default tree algorithm. T h e first involves the use of multiway splits instead of binary splits. Theoretically, there is no clear advantage in doing this. A n y multiway split can be obtained using a sequence of binary splits. T h e primary change is cosmetic. Trees with multiway splits tend to be wider than trees with only binary splits. T h e inclusion of multiway splits complicates the split-search algorithm. A simple linear search becomes a search whose complexity increases geometrically in the number of splits allowed from a leaf. T o combat this complexity explosion, the Tree tool in S A S Enterprise Miner resorts to heuristic search strategies.
3-108
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Tree Variations: Maximum Branches 1
r» • l-.vfi*J r.,*_. . .. • ' u • 1 O'ffin >•rtrriQn r,, i... L**il
—
,
U » «n >» n t h
"1
j | J M Input 0*.r» :
M a j v n u m D«pD>
! Ljlf S C * -UunMU'ol f?u*< l
S * W f
l :.••> - HM i -
IhB
1
Maximum branches in split
* ;
Exhaustive search size limit
BUM
- M*mn J i_. ,.- • • Ml* u**'. ' A s » » r n # r t Fractal - N I-E - I
h
rvrittr
En*
Oil
141 T w o fields in the Properties panel affect the number of splits in a tree. T h e M a x i m u m Branch property sets an upper limit on the number of branches emanating from a node. W h e n this number is greater than the default of 2. the number of possible splits rapidly increases. T o save computation time, a limit is set in the Exhaustive property as to h o w m a n y possible splits are explicitly examined. W h e n this n u m b e r is exceeded, a heuristic algorithm is used in place of the exhaustive search described above. T h e heuristic algorithm alternately merges branches and reassigns consolidated groups of observations to different branches. T h e process stops w h e n a binary split is reached. A m o n g all candidate splits considered, the one with the best worth is chosen. T h e heuristic algorithm initially assigns each consolidated group of observations to a different branch, even if the number of such branches is more than the limit allowed in the final split. At each merge step, the two branches that degrade the worth of the partition the least are merged. After the two branches are merged, the algorithm considers reassigning consolidated groups of observations to different branches. Each consolidated group is considered in turn, and the process stops w h e n no group is reassigned.
3.5 Autonomous Tree Growth Options (Self-Study)
3-109
Tree Variations: Splitting Rule Criterion Yields similar splits
t ti
Grows enormous trees
Favors many-level inputs
142
In addition to changing the number of splits, you can also change h o w the splits are evaluated in the splitsearch phase of the tree algorithm. For categorical targets, S A S Enterprise Miner offers three separate split-worth criteria. Changing from the chi-squared default criterion typically yields similar splits if the number of distinct levels in each input is similar. If not, the other split methods tend to favor inputs with more levels due to the multiple comparison problem discussed above. Y o u can also cause the chi-squared method to favor inputs with more levels by turning off the Bonferroni adjustments. Because Gini reduction and Entropy reduction criteria lack the significance threshold feature of the chi-squared criterion, they tend to grow enormous trees. Pruning and selecting a tree complexity based on validation profit limit this problem to s o m e extent.
3-110
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Tree Variations: Splitting Rule Criterion • IrtTtf-orCmflpjC" '•• Norn-rial Crdtnon -Qn>ny CtAtnan
Split Criterion
Pi:-. r
Us*mtt«cri • uttinpjo*(« Mi.krr^m Bf*rJh • itanrru*
r"* -
*
; fJirmrWrCfRmtl 5 • fiiiinh*ra[^Ljrrij^tf ftui*s • Sc-i::t :•
-Hflrtafiarnpla Htnm • tUPBtrULlm * • - •. LI_ ,
Chi-square logworth Entropy Gini
Categorical Criteria
Variance Prob-F logworth
Interval Criteria
ft ' 57^>tC"or
Ptfl '•'.H r - /<• A- • lis N u m b * ! of Subtil io QfHiOtlti .•
Logworth adjustments
Jifnm of kati U]<j(rri*nt .BtTor* "_lnpu(f Mq n
143
A total of five choices in S A S Enterprise Miner can evaluate split worth. Three (chi-squared logworth, entropy, and Gini) are used for categorical targets, and the remaining two (variance and ProbF logworth) are reserved for interval targets. Both chi-squared and ProbF logworths are adjusted (by default) for multiple comparisons. It is possible to deactivate this adjustment. T h e split worth for the entropy, Gini, and variance options are calculated as shown below. Let a set of cases S be partitioned intop subsets Si, ...,SP so that
Let the number of cases in S equal N and the number of cases in each subset S, equal n,. Then the worth of a particular partition of 5" is given by the following: worth =
l{S)-^w,l(Sl)
where WrnJN (the proportion of cases in subset SI), and for the specified split worth measure, /(•) has the following value: classes
res classes
Entropy Gini
Variance N
notk cotes
Each worth statistic measures the change in I(S) from node to branches. In the Variance calculation, Y is the average of the target value in the node with Y&tt as a m e m b e r .
3.5 Autonomous Tree Growth Options (Self-Study)
3-111
Tree Variations: Subtree Method Controls Complexity
144
S A S Enterprise Miner features t w o settings for regulating the complexity of subtree: modify the pruning process or the Subtree method. B y changing the model assessment measure to average square error, y o u can construct what is k n o w n as a class probability tree. It can be s h o w n that this action minimizes the imprecision of the tree. Analysts sometimes use this model assessment measure to select inputs for a flexible predictive model such as neural networks. Y o u can deactivate pruning entirely by changing the subtree to Largest. Y o u can also specify that the tree be built with a fixed number of leaves.
3-112
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Tree Variations: Subtree Method
Assessment Largest N Pruning options Pruning metrics imbii of Sublets H n- rii. â&#x20AC;˘ - r or Rspeaii
J
Decision Average Square Error Misclassification Lift
145
T h e pruning options are controlled by two properties: Method and Assessment Measure.
Tree Variations: Tree Size Options Avoids orphan nodes
Controls sensitivity
Grows large trees
146
T h e family of adjustments that you modify most often w h e n building trees are the rules that limit the growth of the tree. Changing the m i n i m u m number of observations required for a split search and the m i n i m u m number of observations in a leaf prevents the creation of leaves with only one or a handful of cases. Changing the significance level and the m a x i m u m depth allows for larger trees that can be more sensitive to complex input and target associations. T h e growth of the tree is still limited by the depth adjustment m a d e to the threshold.
3.5 Autonomous Tree Growth Options (Self-Study)
3-113
Tree Variations: Tree Size Options Logworth threshold Maximum tree depth Minimum leaf size
Threshold depth adjustment Changing the logworth threshold changes the m i n i m u m logworth required for a split to be considered by the tree algorithm (when using the chi-squared or ProbF measures). Increasing the m i n i m u m leaf size avoids orphan nodes. For large data sets, you might want to increase the m a x i m u m leaf setting to obtain additional modeling resolution. If you want big trees and insist on using the chi-squared split-worth criterion, deactivate the Split Adjustment option.
3-114
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Exercises
1.
Initial Data Exploration A supermarket is offering a n e w line of organic products. T h e supermarket's management wants to determine which customers are likely to purchase these products. T h e supermarket has a customer loyalty program. A s an initial buyer incentive plan, the supermarket provided coupons for the organic products to all of the loyalty program participants and collected data that includes whether these customers purchased any of the organic products. T h e ORGANICS data set contains 13 variables and over 22,000 observations. T h e variables in the data set are s h o w n below with the appropriate roles and levels: Name
Model Role
Measurement Level
Description
ID
ID
Nominal
Customer loyalty identification number
DemAff1
Input
Interval
Affluence grade on a scale from 1 to 30
DemAge
input
Interval
A g e , in years
DemCluster
Rejected
Nominal
Type of residential neighborhood
DemClusterGroup
Input
Nominal
Neighborhood group
DemGender
Input
Nominal
M = male, F = female, U = u n k n o w n
DemRegion
Input
Nominal
Geographic region
DemTVReg
Input
Nominal
Television region
PromClass
Input
Nominal
Loyalty status: tin, silver, gold, or platinum
PromSpend
Input
Interval
Total amount spent
PromTime
Input
Interval
Time as loyalty card m e m b e r
TargetBuy
Target
Binary
Organics purchased? 1 = Yes, 0 - N o
TargetAmt
Rejected
1 nterval
N u m b e r of organic products purchased
$T
Although two target variables are listed, these exercises concentrate on the binary variable
TargetBuy. a. Create a n e w diagram n a m e d Organics.
3.5 Autonomous Tree Growth Options (Self-Study)
3-115
b. Define the data set A A E M 6 1 . O R G A N I C S as a data source for the project. 1) Set the roles for the analysis variables as s h o w n above. 2) Examine the distribution of the target variable. W h a t is the proportion of individuals w h o purchased organic products? 3) T h e variable D e m C l u s t e r G r o u p contains collapsed levels of the variable D e m C l u s t e r . Presume that, based on previous experience, y o u believe that D e m C l u s t e r G r o u p is sufficient for this type of modeling effort. Set the model role for D e m C l u s t e r to Rejected. 4) A s noted above, only T a r g e t B u y will be used for this analysis and should have a role of Target. C a n T a r g e t A m t be used as an input for a model used to predict T a r g e t B u y ? W h y or w h y not?
5) Finish the O r g a n i c s data source definition. c. A d d the A A E M 6 1 . O R G A N I C S data source to the Organics diagram workspace. d. A d d a Data Partition node to the diagram and connect it to the Data Source node. Assign 5 0 % of the data for training and 5 0 % for validation. e. A d d a Decision Tree node to the workspace and connect it to the Data Partition node. f. Create a decision tree model autonomously. Use average square error as the model assessment statistic. 1) H o w m a n y leaves are in the optimal tree? 2) W h i c h variable w a s used for thefirstsplit? W h a t were the competing splits for thisfirstsplit? g. A d d a second Decision Tree node to the diagram and connect it to the Data Partition node. 1) In the Properties panel of the n e w Decision Tree node, change the m a x i m u m number of branches from a node to 3 to allow for three-way splits. 2) Create a decision tree model using average square error as the model assessment statistic. 3) H o w m a n y leaves are in the optimal tree? h.
Based on average square error, which of the decision tree models appears to be better?
3-116
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3.6 Chapter Summary Predictive modeling fulfills s o m e of the promise of data mining. Past observation of target events coupled with past input measurements enables s o m e degree of assistance for the prediction offuture target event values, conditioned on current input measurements. All of this depends on the stationarity of the process being modeled. A predictive model must perform three essential tasks: 1. It must be able to systematically predict n e w cases, based on input measurements. These predictions might be decisions, rankings, or estimates. 2. A predictive model must be able to thwart the curse of dimensionality and successfully select a limited n u m b e r of useful inputs. This is accomplished by systematically ignoring irrelevant and redundant inputs. 3.
A predictive model must adjust its complexity to avoid overfitting or underfitting. In S A S Enterprise Miner, this adjustment is accomplished by splitting the raw data into a training data set for building the model and a validation data set for gauging model performance. T h e performance measure used to gauge performance depends on the prediction type. Decisions can be evaluated by accuracy or misclassification, rankings by concordance or discordance, and estimates by average square error.
Decision trees are a simple but potentially useful example of a predictive model. They score n e w cases by creating prediction rules. They select useful inputs via a split-search algorithm. They optimize complexity by building and pruning a maximal tree. A decision tree model can be built interactively, automatically, or autonomously. For autonomous models, the Decision Tree Properties panel provides options to control the size and shape of the final tree.
Decision Tree Tools Review i fatten
Partition raw data into training and validation sets. Interactively grow trees using the Tree Desktop application. You can control rules, selected inputs, and tree complexity. Autonomously grow decision trees based on property settings. Settings include branch count, split rule criterion, subtree method, and tree size options.
149
3.7 Solutions
3.7
Solutions
Solutions to Exercises 1. Initial Data Exploration a. Create a n e w diagram n a m e d Organics. 1) Select File •=> N e w «=> Diagram.... The Create N e w Diagram w i n d o w opens.
Diagram Name:
OK
Cancel —
2) Type O r g a n i c s in the D i a g r a m N a m e field.
Diagram Name: Organics OK
3) Select O K .
Cancel
3-117
3-118
Chapter 3 Introduction to Predictive Modeling: Decision Trees
b. Define the data set AAEM61. ORGANICS as a data source for the project. 1) Set the model role for the target variable. 2) Examine the distribution of the target variable. W h a t is the proportion of individuals w h o purchased organic products? a) Select File <=> N e w ^> Data Source.... The Data Source Wizard w i n d o w opens. JjfData Source Wizard â&#x20AC;&#x201D; StepI of 7 Metadata Source
Select a metadata source
Source: SAS Table
~3
Cancel
b) Select Next >. The wizard proceeds to Step 2.
3.7 Solutions
3-119
c) Type AAEM61. ORGANICS in the T a b l e field. jjjjfData Source Wizard - Step 2 of 7 Select a SAS Table
Select a S A S table
Table: |aaem6l .organics
Browse..
< Qack
Cancel
(text >
d) Select Next >. T h e wizard proceeds to Step 3.
m
Iff Data Source Wizard â&#x20AC;&#x201D; Step 3 of 7 Table Information Table Properties
Properly Table Name Description Member Type Data Set Type Engine Number of Variable? Created Date loclfied Date
Value AAEM6I. ORGANICS DATA DATA BASE 13 22223 April 10, 2008 10:37:03 AM EDT April 10, 2008 10:37:03 AM EDT
<Back
j Net > "j|
Cancel
3-120
Chapter 3 Introduction to Predictive Modeling: Decision Trees e) Select Next >. T h e wizard proceeds to Step 4. SCOAta Source Wizard â&#x20AC;&#x201D;5tep4of7 MetadataAdvisor Options Metadata Advisor Options Use the basic setting lo se! the initial measurement levels and roles based on tbe variable attributes. U s e the advanced setting to set the Initial measurement levels and roles based on both the variable attributes and distributions.
P Basic
r Advanced
<6ack
|QÂťxt>''l|
C***!
f) Select the A d v a n c e d radio button and select Customize.... T h e Advanced Advisor Options w i n d o w opens. g) Type 2 as the Class Levels Count Threshold value. Advanced Advisor Options Property
El Value
50 Missing Percentage Threshold Reject Vars with Excessive Missing ValueYes Class Levels Count Threshold 2 Detect Class Levels Yes Reject Levels Count Threshold 20 . , Reject Vars with Excessive Class Values Yes Y e s Class Levels Count Threshold If "Detect class levels"=Yes, interval variables with less than the number specified for this property wiU be marked as N O M I N A L . The default value is 20.
OK
Cancel
h) Select O K . T h e Advanced Advisor Options w i n d o w closes and you are returned to Step 4 of the Data Source Wizard.
3.7 Solutions 3-121 i) Select Next >. T h e wizard proceeds to Step 5. B y customizing the Advanced Metadata Advisor, most of the roles and levels are correctly set. j) Select Role O Rejected for T a r c r e t A m t . fjjjfData Source Wizard —Step 5 of 7 Column Metadata •* [ r not jEquai !o
i(nooe)
"J
lokrnns: V Label
r
Name | Role DemAfrl Input D e m A g e . _ Input DemCluster Rejected D e m C lusterO input DemGender Input DemReg Input D e m T V R e g Input ID ID PromClass Input PromSpend Input PromTime input TargetAmt Rejected TargetBuy Target
Show code
Explore
| Level Interval Interval Nominal Nominal Nominal Nominal Nominal Nominal Nominal Interval Interval Interval Binary
Oder
| Report No No No No Ho Ho Ho No No No No NO NO
Apply r Sanities
Bas* Drop
Lower Lir«
Upper Li
No Ho Ho Ho
1
No
No No No No No
Refresh Sgranary
< Back
.
Next >
Cancel
)
|
k) Select TargetBuy and select Explore. T h e Explore w i n d o w opens. Explore - AAEM61.ORG ANICS File
View fictions Window
HUl.ll.HJJJ.l.W!
^Jnjxj
_ t n l x | i Targets Value
Property
15000 -
22223
Rows Columns
13 AAEM61 ORGANICS DATA Random Max 22223 [l2345
Library Member Type S a m p l e Method Fetch Size Fetched R o w s R a n d o m Seed
S 10000 m S -
£
5000 -
n u
Apply
|
Plot...
^
' I
l
0
1
Or <i,inics Purchase Imlicato! ^jnjxj
5"! A A E M 6 1 . 0 R G A M C S Ous#
•ner
—
Organics...
o
°1
3-122
Chapter 3 Introduction to Predictive Modeling: Decision Trees T h e proportion of individuals w h o purchased organic products appears to be 2 5 % . I) Close the Explore window. 3) T h e variable DemC l u s t e r Group contains collapsed levels of the variable DemCluster. Presume that, based on previous experience, you believe that DemClusterGroup is sufficient for this type of modeling effort. Set the model role for DemCluster to Rejected. This is already done using the Advanced Metadata Advisor. Otherwise, select Role •=> Rejected for DemCluster. 4) A s noted above, only TargetBuy will be used for this analysis and should have a role of Target. C a n TargetAmt be used as an input for a model used to predict TargetBuy? W h y or w h y not? N o , using TargetAmt as an input is not possible. It is measured at the s a m e time as TargetBuy and therefore has no causal relationship to TargetBuy. 5) Finish the O r g a n i c s data source definition. a) Select Next >. T h e wizard proceeds to Step 6. mm
ItfData Source Wizard" - Step b or 9 Decision Configuration
Decision Processing Do you want to build models based on the vabes of the decisions ? IF you answer yes, you may enter information on the cost or profit of each possible decision, prior probability and cost Function, The data will be scanned for the distributions of the target variables,
® <Back
|| Ne.t >
|j
£ancel
|
3.7 Solutions 3-123 b) Select Next >. T h e wizard proceeds to Step 7. £f Data Source Wizard — Step 7of 8 Data Source Attributes
You m a y change the n a m e and the tole, and can specify a population segment Identifier for the data source
5
Name:
ORGANICS
Role:
| Raw
33
Segment: \
Notes:
<gack
|; Next > |
Cancel
c) Select Next >. T h e wizard proceeds to Step 8.
m
£ D,it ,i Source Wizard -- Step 0 of a Summary Metadata Completed. Library: AAEM81 Table: ORGANICS Role: Raw Role ID Input Input Rejected Rejected Target
Count 1 1
Level Nominal Interval Nominal Interval Nominal Binary
5 1 1 1
<Back
[ hash
|
Cancel
|
d) Select Finish. T h e wizard closes and the O r g a n i c s data source is ready for use in the Project Panel. 13 My Project <? E3| Data Sources IORGANICS PVA97NK <? Lii] Diagrams tj££> Organics Predictive Analysis ©- H j Model Packages ° - JLl Users
3-124
Chapter 3 Introduction to Predictive Modeling: Decision Trees
c. A d d the A A E M 6 1 . O R G A N I C S data source to the Organics diagram workspace. Organics
ET [El
ORGANICS
-V
100%
d. A d d a Data Partition node to the diagram and connect it to the Data Source node. Assign 5 0 % of the data for training and 5 0 % for validation. Organics
ORGANICS
ate Partition
°x
0
0 100%
â&#x20AC;˘
3.7 Solutions 1) Type 5 0 as the Training and Validation values under Data Set Allocations. 2) Type 0 as the Test value.
Variables Data OutputType Partitioning Method Default 12345 Random Seed [•Training [•Validation -Test
.nil
50.0 50.0 0.0
e. A d d a Decision Tree node to the workspace and connect it to the Data Partition node. cc
° Organics
ORGANICS
)nta Paitrtion
Decision Tr ee
3-125
3-126
Chapter 3 Introduction to Predictive Modeling: Decision Trees
f. Create a decision tree model autonomously. Use average square error as the model assessment statistic. • Select Average Square Error as the Assessment Measure property. Train Variables Interactive
Q
I
B
H
Default [-Criterion 0 2 ?• Significance Level Use in search [Missing Values NO [ Use Input Once [•Maximum Branch 2 [•Maximum Depth 6 -Minimum Categorical S 5 llMode 5 [-Leaf Size [•Number ofRules 5 [-Number of Surrogate R 0 '•Split Size =]Split Search [-Exhaustive 5000 : 20000 -Node Sample Assessment [-Method 1 [•Number of Leaves [•Assessment Measure Average Square Error I -Assessment Fraction 0 . 2 5
Right-click on the Decision Tree node and select R u n from the Option m e n u . Select Yes in the Confirmation window.
Do you want to run this path? Diagram: Organics Path:
Decision Tree Yes
No
1) H o w m a n y leaves are in the optimal tree?
Run completed Diagram:Organics Path: OK
Decision Tree
Results..
3.7 Solutions
3-127
a) W h e n the Decision Tree node run finishes, select Results... from the R u n Status window. T h e Results w i n d o w opens. ' Result* - Node: Decision Tree Diagram: Organics Fie Ed* yjew Window Score Rankings Overlays TargetBuy :uirojl*stjve lift
lO Training g h ! Starbt c* Target TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargelSuy TargetBuy TargetBuy TargetBuy
n staaaci SUtntHn _NOBS_ _SUMW_ _M!SC_ _MAX _SSE_ „ASE_ _RASE_
_D1V_ _DFT_
Tn»i Label Sum orFre... Sum otCas.. Mtsclassific... 0. Maximum A... 0 Sum ofSqu.. 26 0. Average 8fl~ ^cotAwra... 0. Divisor for A. Total Degre...
•-ini*l Statisnc
at l a
Ttain 75.;% 24.B*
liu: T Age
J.
Validation 75.=% =4.8%
mil
•sec: Dace: Tlae: * Training Oucpuc •
Scudtnt July 0 6 , 200 13:35:37
3-128
Chapter 3 Introduction to Predictive Modeling: Decision Trees T h e easiest w a y to determine the number of leaves in your tree is via the iteration plot. b) Select V i e w •=> M o d e l •=> Iteration Plot from the Result w i n d o w m e n u . T h e Iteration Plot w i n d o w opens. Iteration Plot ;
x?\£
Average Squared Error
HI]
•
0.13 -
0.16 -
0.14I
i
10
20
30
Nil mh er of Leaves Train: Average Squared Error —Valid:Average Squared Error [
u
:
,
I
Using average square error as the assessment measure results in a tree with 29 leaves. 2) Which variable w a s used for the first split? W h a t were the competing splits for this first split? These questions are best answered using interactive training. a) Close the Results w i n d o w for the Decision Tree model. I
Train Variables Interactive
b) Select the Interactive ellipsis (...) from the Decision Tree node's Properties panel.
3.7 Solutions T h e S A S Enterprise Miner Tree w i n d o w opens.
I Ffe Edt Model Trah Vtow Options Window Hefc
! Statistic
Training
I
0;. l! I H in noda;
Statistic
Training
1: II In nodi: Aflluanca
« I t u i n l c
0: 1:
Truuuig
»«» 4C1
Training
481
a:
1 14.S
711
2)4 111 It
Sit
399 C n d *
1 >• 14.E Statistic Training 0: 111 1: 331
Statistic 0: 1:
Training »7» 114
jQ. Tree EMWS: ForHeti, preisFI
train Interactively No Piers
3-129
3-130
Chapter 3 Introduction to Predictive Modeling: Decision Trees c) Right-click the root node and select Split Node... from the Option m e n u . T h e Split N o d e I w i n d o w opens with information that answers the two questions. I
Split Node 1
01
Target Variable: TargetBuy Variable DemAffl DemGender PromSpend PromClass
Variable Description Affluence Grade Gender Total Spend .oyalty Status
Branches
-Log(p) 200.19 133.14 32.97 23.33
Apply 2 2 2 2
Edit Rule... OK Cancel
Age is used for the first split. C o m p e t i n g splits are A f f l u e n c e Grade, Gender, T o t a l Spend, a n d L o y a l t y Status.
3.7 Solutions A d d a second Decision Tree node to the diagram and connect it to the Data Partition node.
Decision Tree
: : : ORGANICS
Ll_
0
Decision Tree
3
l) In the Properties panel of the n e w Decision Tree node, change the m a x i m u m n u m b e r of branches from a node to 3 to allow for three-way splits. Train
Variables Interactive !â&#x20AC;˘ Criterion [-Significance Level rMissing Values rUse Input Once j-Maximum Branch |-Maximum Depth linimum Categorical
Default 0.2 Use in search No 3| |6 S5
2) Create a decision tree model again using average square error as the model assessment statistic. 3) H o w m a n y leaves are in the optimal tree. Based on the discussion above, the optimal tree with three-way splits has 33 leaves.
3-131
3-132 h.
Chapter 3 Introduction to Predictive Modeling: Decision Trees Based on average square error, which of the decision tree models appears to be better? 1) Select the first Decision Tree node. 2) Right-click and select Results... from the Option m e n u . The Results w i n d o w opens. 3) Examine the Average Squared Error row of the Fit Statistics window. Target TargetBuy
Fit Statistics Statistics Label ASE Average Squared Error
Train Validation 0.132861 0.132773
4) Close the Results window. 5) Repeat the process for the Decision Tree (2) model. Target TargetBuy
Fit Statistics Statistics A Label _ASE_
Average Sq...
Train 0.133009
Validation
Test
0.132662
It appears that the second tree with three-way splits has the lower validation average square error a n d is the better m o d e l .
3.7 Solutions
Solutions to Student Activities (Polls/Quizzes)
3.01 Quiz - Correct Answer Match the Predictive Modeling Application to the Decision Type. Modeling Application C j Loss Reserving
A.
Decision
B Risk Profiling
B.
Ranking
B Credit Scoring
C.
Estimate
A Fraud Detection C \ Revenue Forecasting A Voice Recognition
26
Decision type
3-133
3-134
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Chapter 4 Introduction to Predictive Modeling: Regressions 4.1
4.2
Introduction
Demonstration: Running the Regression Node
4-25
Selecting Regression Inputs
Optimizing Regression Complexity
Interpreting Regression Models Demonstration: Interpreting a Regression Model
4.5
Transforming Inputs Demonstration: Transforming Inputs
4.6
Categorical Inputs Demonstration: Receding Categorical Inputs
4.7
3
4-18
Demonstration: Optimizing Complexity 4.4
-
Demonstration: Managing Missing Values
Demonstration: Selecting Inputs 4.3
4
P o l y n o m i a l R e g r e s s i o n s (Self-Study)
4-29 4-33 4-38 4-40 4-49 4-51 4-52 4-56 4-66 4-69 4-76
Demonstration: Adding Polynomial Regression Terms Selectively
4-78
Demonstration: Adding Polynomial Regression Terms Autonomously (Self-Study)
4-85
Exercises
4-88
4.8
Chapter Summary
4-89
4.9
Solutions
4-91
Solutions to Exercises Solutions to Student Activities (Polls/Quizzes)
4-91 4-102
4-2
Chapter 4 Introduction to Predictive Modeling: Regressions
4.1 Introduction
4.1
4-3
Introduction Model Essentials - Regressions Predict new cases.
Prediction formula
Select useful inputs.
Sequential selection
Optimize complexity.
Best model from sequence
3 Regressions offer a different approach to prediction compared to decision trees. Regressions, as parametric models, assume a specific association structure between inputs and target. By contrast, trees, as predictive algorithms, do not assume any association structure; they simply seek to isolate concentrations o f cases with like-valued target measurements. The regression approach to the model essentials in SAS Enterprise Miner is outlined over the following pages. Cases are scored using a simple mathematical prediction formula. One o f several heuristic sequential selection techniques is used to pick from a collection o f possible inputs and creates a series o f models with increasing complexity. Fit statistics calculated from validation data select the best model from the sequence.
4-4
Chapter 4 Introduction to Predictive Modeling: Regressions
Model Essentials - Regressions i > Predict new cases.
I
p
^<*f*
formula
5 Regressions predict cases using a mathematical equation involving values o f the input variables.
4.1 Introduction
4-5
Linear Regression Prediction Formula input measurement A
y
A
A
= w
0
intercept estimate
+ w,-
A
X)
+ w x 2
2
prediction estimate
parameter estimate
Choose intercept and parameter estimates to minimize: squat ed error function
liy.-y,)
1
framing data
6
In standard linear regression, a prediction estimate for the target variable is formed from a simple linear combination o f the inputs. The intercept centers the range o f predictions, and the remaining parameter estimates determine the trend strength (or slope) between each input and the target. The simple structure o f the model forces changes in predicted values to occur in only a single direction (a vector in the space o f inputs with elements equal to the parameter estimates). Intercept and parameter estimates are chosen to minimize the squared error between the predicted and observed target values (least squares estimation). The prediction estimates can be viewed as a linear approximation to the expected (average) value o f a target conditioned on observed input values. Linear regressions are usually deployed for targets with an interval measurement scale.
4-6
Chapter 4 Introduction to Predictive Modeling: Regressions
Logistic Regression Prediction Formula A
l
o
g
P W + Wj X, + vv x (i3j) = Q
2
2
/0
*
ft
scores
Logistic regressions are closely related to linear regressions. In logistic regression, the expected value o f the target is transformed by a link function to restrict its value to the unit interval. In this way, model predictions can be viewed as primary outcome probabilities. A linear combination o f the inputs generates a logit score, the log o f the odds o f primary outcome, in contrast to the linear regression's direct prediction o f the target. I f your interest is ranking predictions, linear and logistic regressions yield virtually identical results.
4.1 Introduction
4-7
Logit Link Function A
.
/
H P
\
A
log ^ - — A - j = w
p
A 0
A
+ Wi x
1
+ w
x
2
2
'"a"
/og/f link function
The logit link function transforms probabilities (between 0 and 1) to logit scores (between -°° and +»).
For binary prediction, any monotonic function that maps the unit interval to the real number line can be considered as a link. The logit link function is one o f the most common. Its popularity is due, in part, to the interpretability o f the model.
Logit Link Function log y
=
+
+ w * 2
= logit(p)
1
A _
^
2
1
+ e
-loglt(p)
To obtain prediction estimates, the logit equation is solved for ft.
The predictions can be decisions, rankings, or estimates. The logit equation produces a ranking or logit score. To get a decision, you need a threshold. The easiest way to get a meaningful threshold is to convert the prediction ranking to a prediction estimate. You can obtain a prediction estimate using a straightforward transformation o f the logit score, the logistic function. The logistic function is simply the inverse o f the logit function. You can obtain the logistic function by solving the logit equation for p.
4-8
Chapter 4 Introduction to Predictive Modeling: Regressions
14
To demonstrate the properties o f a logistic regression model, consider the t w o - c o l o r prediction p r o b l e m introduced in Chapter 3. A s before, the goal is to predict the target color, based on the location in the unit square. To make use o f the prediction f o r m u l a t i o n , y o u need estimates o f the intercept and other model parameters.
Simple Prediction Illustration - Regressions logit(p) = w + w ' X 1
0
A
p=
1
+ w
2
1 1 + -logiKp)
â&#x20AC;&#x201D;
e
Find parameter estimates
by maximizing I
logtfj)+1109(1-^
MmlnjciiH
UUICOB*'
f lining c u t J
log-likelihood function
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
16 The presence o f the logit l i n k function complicates parameter estimation. Least squares estimation is abandoned in favor o f m a x i m u m l i k e l i h o o d estimation. The l i k e l i h o o d function is the j o i n t probability density o f the data treated as a function o f the parameters. The m a x i m u m l i k e l i h o o d estimates are the values o f the parameters that m a x i m i z e the probability o f obtaining the t r a i n i n g sample.
4.1 Introduction
4-9
Simple Prediction Illustration - Regressions
logit( p ) = 0.81+ 0.92X, +1.11 x
o.s
2
1 P
^^^^^^^
~ 1 + i^'(p> e
x 0.5 2
Using the maximum likelihood estimates, the prediction formula assigns a logit score to each x. and x . 2
0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0,7 0.8 0.9 1.0
18
Parameter estimates are obtained by m a x i m u m l i k e l i h o o d estimation. These estimates can be used in the logit and logistic equations to obtain predictions. The plot on the right shows the p r e d i c t i o n estimates f r o m the logistic equation. One o f the attractions o f a standard logistic regression model is the s i m p l i c i t y o f its predictions. The contours are simple straight lines. (In higher dimensions, they w o u l d be hyperplanes.)
4-10
Chapter 4 Introduction to Predictive Modeling: Regressions
4.01 Multiple Choice Poll What is the logistic regression prediction for the indicated point? a. -0.243 b. 0.56 c. yellow d. It depends . . .
s
^
s
^
S
^ ^ \ v i v \ ' V ^ N
logit(p) =-0.81+0.92x + 1.11x 1
2
20
To score a new case, the values o f the inputs are plugged into the logit or logistic equation. T h i s action creates a logit score or prediction estimate. Typically, i f the prediction estimate is greater than 0.5 (or equivalently, the logit score is positive), cases are usually classified to the p r i m a r y outcome. ( T h i s assumes an equal misclassification cost.) The answer to the question posed is, o f course, it depends. • A n s w e r A , the logit score, is reasonable i f the goal is ranking. • A n s w e r B, the prediction estimate f r o m the logistic equation, is appropriate i f the goal is estimation. •
A n s w e r C, a classification, is a g o o d choice i f the goal is deciding dot color.
4.1 Introduction
4-11
Regressions: Beyond the Prediction Formula Manage missing values. Interpret the model. Handle extreme or unusual values. Use nonnumeric inputs. Account for nonlinearities.
While the prediction formula would seem to be the final word in scoring a new case with a regression model, there are actually several additional issues that must be addressed. • What should be done when one o f the input values used in the prediction formula is missing? You might be tempted to simply treat the missing value as zero and skip the term involving the missing value. While this approach can generate a prediction, this prediction is usually biased beyond reason. • How do you interpret the logistic regression model? Certain inputs influence the prediction more than others. A means to quantify input importance is needed. • How do you score cases with unusual values? Regression models make their best predictions for cases near the centers o f the input distributions. I f an input can have (on rare occasion) extreme or outlying values, the regression should respond appropriately. • What value should be used in the prediction formula when the input is not a number? Categorical inputs are common in predictive modeling. They did not present a problem for the rule-based predictions o f decision trees, but regression predictions come from algebraic formulas that require numeric inputs. (You cannot multiply marital status by a number.) A method to include nonnumeric data in regression is needed. • What happens when the relationship between the inputs and the target (or rather logit o f the target) is not a straight line? It is preferable to be able to build regression models in the presence o f nonlinear (and even nonadditive) input target associations.
4-12
Chapter 4 Introduction to Predictive Modeling: Regressions
Regressions: Beyond the Prediction Formula Manage missing values.
The above issues affect both model construction and model deployment. The first o f these, h a n d l i n g missing values, is dealt w i t h immediately. The r e m a i n i n g issues are addressed, in t u r n , at the end o f this chapter.
Missing Values and Regression Modeling
mm ^^HiCZZZj • • • mm h mm
rzii
C~l
•
•
•
MM • • • . " ; — am mm
m
•
•
•
Problem 1: Training data cases with missing values on inputs used by a regression mode! are ignored.
M i s s i n g values present t w o distinct problems. The first relates to model construction. The default method for treating m i s s i n g values in most regression tools in SAS Enterprise M i n e r is complete-case analysis. In complete-case analysis, only those cases w i t h o u t any missing values are used in the analysis.
4.1 Introduction
4-13
Missing Values and Regression Modeling Training Data beto
i
KBiia j tmm t'-'Wi ta'^a f r y * H O t^aa
Consequence: Missing values can significantly reduce your amount of training data for regression modeling!
26
Even a smattering o f missing values can cause an enormous loss o f data in high dimensions. For instance, suppose that each o f the k input variables is missing at random with probability a . In this situation, the expected proportion o f complete cases is as follows: (t-af Therefore, a 1 % probability o f missing ( a = 0 1 ) for 100 inputs leaves only 3 7 % o f the data for analysis, 200 leaves 13%, and 400 leaves 2%. I f the "missingness" were increased to 5% (cc=.05), then < 1 % o f the data would be available with 100 inputs.
4-14
Chapter 4 Introduction to Predictive Modeling: Regressions
Missing Values and the Prediction Formula logit(p) = -0.81 + 0.92 â&#x20AC;˘ x +1.11- x 1
2
Predict: (x1, x2) = (0.3, ? ) Problem 2: Prediction formulas cannot score cases with missing values.
27
Missing Values and the Prediction Formula logit(p) = ?
Problem 2: Prediction formulas cannot score cases with missing values.
30
The second missing value problem relates to model deployment or using the prediction formula. How would a model built on the complete cases score a new case i f it had a missing value? To decline to score new incomplete cases would be practical only i f there were a very small number o f missing values.
4.1 Introduction
Missing Value Issues Manage missing values. Problem 1: Training data cases with missing values on inputs used by a regression model are ignored. Problem 2: Prediction formulas cannot score cases with missing values.
31
A remedy is needed for the two problems o f missing values. The appropriate remedy depends on the reason for the missing values.
4-15
4-16
Chapter 4 Introduction to Predictive Modeling: Regressions
Missing Value Causes > Manage missing values. |CZD| Non-applicable measurement czj\
No match on merge Non-disclosed measurement
33
Missing values arise for a variety o f reasons. For example, the time since last donation to a card campaign is meaningless i f someone did not donate to a card campaign. In the PVA97NK data set, several demographic inputs have missing values in unison. The probable cause was no address match for the donor. Finally, certain information, such as an individual's total wealth, is closely guarded and is often not disclosed.
Missing Value Remedies Manage missing values. Non-applicable measurement No match on merge Non-disclosed measurement
34
Synthetic distribution
Estimation
4.1 Introduction
4-17
The primary remedy for missing values in regression models is a missing value replacement strategy. Missing value replacement strategies fall into one o f two categories. Synthetic distribution methods
Estimation methods
use a one-size-fits-all approach to handle missing values. A n y case with a missing input measurement has the missing value replaced w i t h a fixed number. The net effect is to modify an input's distribution to include a point mass at the selected fixed number. The location o f the point mass i n synthetic distribution methods is not arbitrary. Ideally, i t should be chosen to have minimal impact on the magnitude o f an input's association w i t h the target. M t h many modeling methods, this can be achieved by locating the point mass at the input's mean value. eschew the one-size-fits-all approach and provide tailored imputations for each case w i t h missing values. This is done by viewing the missing value problem as a prediction problem. That is, you can train a model to predict an input's value from other inputs. Then, when an input's value is unknown, you can use this model to predict or estimate the unknown missing value. This approach is best suited for missing values that result from a lack o f knowledge, that is, no-match or nondisclosure, but i t is not appropriate for not-applicable missing values.
Because predicted response might be different for cases with a missing input value, a binary imputation indicator variable is often added to the training data. Adding this variable enables a model to adjust its predictions in the situation where "missingness" itself is correlated with the target.
4-18
Chapter 4 Introduction to Predictive Modeling: Regressions
Managing Missing Values
The demonstrations in this chapter b u i l d on the demonstrations o f Chapters 2 and 3. A t this point, the process f l o w diagram has the f o l l o w i n g structure:
Data A s s e s s m e n t Continue the analysis at the Data Partition node. A s discussed above, regression requires that a case have a complete set o f input values for both t r a i n i n g and scoring. F o l l o w these steps to examine the data status after the p a r t i t i o n . 1.
Select the Data Partition node.
2.
Select Exported Data •=> ^ Property
Value
General Node ID Imported Data Exported Data Notes
Part
1
Train
Variables Data Output Type Partitioning Method Default Random Seed 1 234 5
j •BISsB IU 1 BETOfflU [-Training [•Validation '-Test
-50.0 50.0 0.0
f r o m the Data Partition node property sheet.
_ |jgj
•
4.1 Introduction
4-19
T h e E x p o r t e d Data w i n d o w opens. 3.
Select the T R A I N data p o r t and select E x p l o r e . . . .
mmm TRAIN VALIDATE TEST
Table EMWS.Part_TRA!N EMWS.Part VALIDATE EMWS.Part TEST
Port
|
Role
J ~
Train IValidate 'Test
Browse..
Data Exists
Yes Yes |No
Properties..
Explore..,
OK
There are several inputs w i t h a noticeable frequency o f m i s s i n g values, f o r e x a m p l e . A g e and the replaced value o f m e d i a n i n c o m e .
;
\
^^•llot^EM'Vl^ij.UMIiMlthMMMiifltti^B File
1
View
Actions
M B .
m ' >
Window
Sample Properties
i4843
Property Rows ; Columns Library Member Type j Sample Method Fetch Size Fetched Rows Random Seed
Value
bo EMWS PART TRAIN DATA iRandom Max 4843 12345
Apply
r/cf
o j EMWS.Part TRAIN Status Cate.J Demograp...| 100 035 035 115 136 016 017 024 035 139 045
Plot.
Gender
Age ,M 53M 58 M .U .F .F 39 M 50F M M .F
m
ll-iome Owner Median Ho...j Percent Vet.J Median Inc .lRepiacerneJ 1
Ij-,
=
' '
IpO
$139,200 $168,100 $234,700 $143,900 $134,400 $68,200 $115,100 $0 $207,900 $104,400 $70,000
V
27 37 22 30 41 47 30 33 48 35 37
TO
$38,942 $71,509 $72,868 $0 $0 $0 $63,000 $0 $0 $54,385 $0
38942 71609 72868
•
63000
54385
—
4-20
Chapter 4 Introduction to Predictive Modeling: Regressions
There are several ways to proceed: â&#x20AC;˘ Do nothing. I f there are very few cases with missing values, this is a viable option. The difficulty with this approach comes when the model must predict a new case that contains a missing value. Omitting the missing term from the parametric equation usually produces an extremely biased prediction. â&#x20AC;˘ Impute a synthetic value for the missing value. For example, i f an interval input contains a missing value, replace the missing value with the mean o f the nonmissing values for the input. This eliminates the incomplete case problem but modifies the input's distribution. This can bias the model predictions. Making the missing value imputation process part o f the modeling process allays the modified distribution concern. Any modifications made to the training data are also made to the validation data and the remainder o f the modeling population. A model trained with the modified training data w i l l not be biased i f the same modifications are made to any other data set that the model might encounter (and the data has a similar pattern o f missing values). â&#x20AC;˘ Create a missing indicator for each input in the data set. Cases often contain missing values for a reason. I f the reason for the missing value is in some way related to the target variable, useful predictive information is lost. The missing indicator is 1 when the corresponding input is missing and 0 otherwise. Each missing indicator becomes an input to the model. This enables modeling o f the association between the target and a missing value on an input. 4.
Close the Explore and Exported Data windows.
4.1 Introduction
4-21
Imputation To address missing values in the PVA97NK data set, use the following steps to impute synthetic data values and create missing value indicators: 1.
Select the Modify tab.
2.
Drag an Impute tool into the diagram workspace.
3.
Connect the Data Partition node to the Impute node. In the display, below, the Decision Tree modeling nodes are repositioned for clarity. : Predictive Analysis
JLl
*1 J
j
J
i n r "
<*>
n
q
i ~ j -
Š
| i h
p
i
ii
r
Chapter 4 Introduction to Predictive Modeling: Regressions
4-22
4.
Select the I m p u t e node and examine the Properties panel. Value
Property
General Node ID Imported Data Exported Data Notes
Impt
•••
Train Variables Non Missinq Variables Missing Cutoff = ] c i a s s Variables r Default Input Method [• Default Target Method -Normalize Values
Count None Yes
[- Default Input Method ' [Default Target Method
Mean None
:
... ...
No 50.0
[-Default Character Value '-[Default Number Value ^ M e t h o d Options 12345 •-Random Seed '•-Tuning Parameters - Tree Imputation l
...
... •••
Score [Hide Original Variables Yes [=]lndicatorVariables Unique [•Type Imputed Variables }-Source
.
The defaults o f the Impute node are as f o l l o w s : •
For interval inputs, replace any missing values w i t h the mean o f the nonmissing values.
•
For categorical inputs, replace any missing values w i t h the most frequent category.
tf
These are acceptable default values and are used throughout the rest o f the course.
W i t h these settings, each input w i t h missing values generates a new input. The new input named \\A?_originaljnpui_name
w i l l have missing values replaced by a synthetic value and n o n m i s s i n g
values copied f r o m the original input.
4.1 Introduction
4-23
Missing Indicators Use the f o l l o w i n g steps to create m i s s i n g value indicators. The settings for m i s s i n g value indicators are found in the Score property group.
.Score [Hide Original Variables rVes r-Type [-{Source ••Role
I
Unique Imputed Variables Input
1.
Select I n d i c a t o r V a r i a b l e s •=> T y p e
Unique.
2.
Select I n d i c a t o r V a r i a b l e s O R o l e
Input.
W i t h these settings, n e w inputs named M_original_input_name indicate the synthetic data values.
w i l l be added to the t r a i n i n g data to
4-24
Chapter 4 Introduction to Predictive Modeling: Regressions
Imputation Results Run the Impute node and review the Results w i n d o w . Three inputs had missing values. â&#x20AC;˘r Results - Mode: Impute Diagram: Predictive Analysis File
Edit
View
Window
Imputation Summary Variable Name
Impute Method
DemAge MEAN GiftAvgCard...MEAN REP_Dem... MEAN
r/Rf Imputed Variable
Indicator Variable
IMP_DemA... M_DemAge IMP_GifiAvg... M_GiftAvgC... IMP_REP_... M_REP_De...
Impute Value Role
59.26291 INPUT 14.2049INPUT 53570.85INPUT
Measureme nt Level INTERVAL INTERVAL INTERVAL
[EI
Label
Number of Missing for TRAIN Age 1203 Gift Amount... 910 Repiaceme... 1191
H i Output
*
*
User: Date: Time:
*
n
* Training
Output
s
Variable
*
Summary
W i t h all o f the m i s s i n g values i m p u t e d , the entire training data set is available for b u i l d i n g the logistic regression m o d e l . In addition, a method is in place for scoring new cases w i t h m i s s i n g values. (See Chapter 7.)
4.1 Introduction
4-25
Running the Regression Node
There are several tools i n SAS Enterprise Miner to fit regression or regression-like models. By far, the most commonly used (and, arguably, the most useful) is the simply named Regression tool. Use the following steps to build a simple regression model. 1.
Select the Model tab.
2.
Drag a Regression tool into the diagram workspace.
3.
Connect the Impute node to the Regression node. ;
;; P r e d i c t i v e A n a l y s i s ;
XTUmpttte
1^
placement
Regression
Probability T r e e
-a Deciiion T r e e
JJ
J
>m"T~
<9
11
Q
I—J"
® ' r«r»l
S
9
I El
f
The Regression node can create several types o f regression models, including linear and logistic. The type o f default regression type is determined by the target's measurement level.
4-26
4.
Chapter 4 Introduction to Predictive Modeling: Regressions
Run the Regression node and v i e w the results. The Results - N o d e : Regression D i a g r a m w i n d o w opens.
5.
M a x i m i z e the Output w i n d o w by d o u b l e - c l i c k i n g its title bar. The initial lines o f the Output w i n d o w summarize the roles o f variables used (or not) by the Regression node. V a r i a b l e Summary ROLE
LEVEL
COUNT
INPUT
BINARY
INPUT INPUT
INTERVAL NOMINAL
3
REJECTED
INTERVAL
2
TARGET
BINARY
1
The fit model has 28 inputs that predict a binary target.
5 20
4.1 Introduction 4-27 Ignore the output related to model events and predicted and decision variables. The next lines give more information about the model, including the training data set name, target variable name, number o f target categories, and most importantly, the number o f model parameters. Model
Information
T r a i n i n g Data S e t
EMWS2.IMPT_TRAIN.VIEW
DMDB C a t a l o g
WORK.REG_DMDB
Target
TargetB (Target G i f t Flag)
Variable
Ordinal
T a r g e t Measurement L e v e l Number o f T a r g e t
Categories
2
Error
MBernoulli
Link Function Number o f Model P a r a m e t e r s
Logit 86
Number o f O b s e r v a t i o n s
4843
Based on the introductory material about logistic regression that is presented above, you might expect to have a number o f model parameters equal to the number o f input variables. This ignores the fact that a single nominal input (for example, D e m C l u s t e r ) can generate scores o f model parameters. Next, consider maximum likelihood procedure, overall model fit, and the Type 3 Analysis o f Effects. The Type 3 Analysis tests the statistical significance o f adding the indicated input to a model that already contains other listed inputs. A value near 0 in the Pr > ChiSq column approximately indicates a significant input; a value near 1 indicates an extraneous input. Type 3 A n a l y s i s o f E f f e c t s Wald Effect
DF
Chi-Square
Pr > ChiSq
DemCluster DemGender
53 2 1
58.9098
0.2682
0.5088 0.1630 2.4464
0.7754 0.6864 0.1178
DemHomeOwner DemMedHomeValue DemPctVeterans
1 1
5.2502
0.0219
GiftAvg36
1
1.6709
0.1961
GiftAvgAll
1
0.0339
0.8540
GiftAvgLast
1 1
0.0026
0.9593
1.2230
0.2688
GiftCntAll GiftCntCard36
1
0.1308
0.7176
1
1.0244
0.3115
GiftCntCardAll GiftTimeFirst
1
0.0061
1
1.6064
0.9380 0.2050
GiftTimeLast IMP_DemAge IMP_GiftAvgCard36
1
21.5351 0.0701
0.7911
GiftCnt36
1
<.0001
1 1
0.0476
0.8273
IMP_REP_DemMedIncome
0.1408
0.7074
M_DemAge
1
3.0616
0.0802
M_GiftAvgCard36
1 1
0.9190
0.3377
0.6228
0.4300
PromCntCard12
1 1 1 1
3.2335 1.0866 1.9715 0.0294
PromCntCard36
1
PromCntCardAll StatusCat96NK
1
0.0049 2.9149
0.0721 0.2972 0.1603 0.8639 0.9441
5
11.3434
0.0878 0.0450
StatusCatStarAll
1
1.7487
0.1860
M_REP_DemMedIncome PromCnt12 PromCnt36 PromCntAll
4-28
Chapter 4 Introduction to Predictive Modeling: Regressions The statistical significance measures a range from <0.0001 (highly significant) to 0.9593 (highly dubious). Results such as this suggest that certain inputs can be dropped without affecting the predictive prowess o f the model. Restore the Output window to its original size by double-clicking its title bar. Maximize the Fit Statistics window.
Target TargetB TargetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB | TargetB TargetB TargetB TargetB TargetB
Fit Statistics Statistics Train Validation Test Label _AIC_ Akaike's Inf... 6633.2 Average Sq... 0.237268 _ASE_ 0.24381 0.667066 _AVERRâ&#x20AC;&#x17E; Average Err... 0.680861 Degrees of.. 4757 _DFE_ Model Degr... 86 _DFM_ Total Degre.. 4843 _DFT_ Divisor for A.. 9686 _DIV_ 9686 Error FunctL. 6461.2 _ERR_ 6594.821 Final Predic. 0.245847 _FPE_ Maximum A... 0.941246 0.841531 _MAX_ Mean Squa... 0.241558 0.24381 _MSE_ Sum ofFre... 4843 4843 _NOBS_ Number of... 86 RootAvera... _RASE_ 0.487102 0.493771 Root Final... _RFPE_ 0.49583 Root Mean... _RMSE_ 0.491485 0.493771 Schwa rz's... _SBC_ 7190.935 Sum of Squ.. _SSE_ 2298.179 2361.545 SumofCas.. _SUMW_ 9686 9686 Misclassific. MISC 0.411522 0.431964
I f the decision predictions are o f interest, model fit can be judged by misclassification. I f estimate predictions are the focus, model fit can be assessed by average squared error. There appears to be some discrepancy between the values of these two statistics in the train and validation data. This indicates a possible overfit o f the model. It can be mitigated by using an input selection procedure. 7.
Close the Results window.
4.2 Selecting Regression Inputs
4-29
4.2 Selecting Regression Inputs Model Essentials - Regressions
Select useful inputs.
Sequential selection
38
The second task that all predictive models should perform is input selection. One way to find the optimal set o f inputs for a regression is simply to try every combination. Unfortunately, the number o f models to consider using this approach increases exponentially in the number o f available inputs. Such an exhaustive search is impractical for realistic prediction problems. A n alternative to the exhaustive search is to restrict the search to a sequence o f improving models. While this might not find the single best model, it is commonly used to find models with good predictive performance. The Regression node in SAS Enterprise Miner provides three sequential selection methods.
4-30
Chapter 4 Introduction to Predictive Modeling: Regressions
Sequential Selection - Forward Input p-value
Entry Cutoff
• • • • • • • 1 1 • • • • • • • •
43
Forward selection creates a sequence o f models of increasing complexity. The sequence starts with the baseline model, a model predicting the overall average target value for all cases. The algorithm searches the set o f one-input models and selects the model that most improves upon the baseline model. It then searches the set o f two-input models that contain the input selected in the previous step and selects the model showing the most significant improvement. By adding a new input to those selected in the previous step, a nested sequence o f increasingly complex models is generated. The sequence terminates when no significant improvement can be made. Improvement is quantified by the usual statistical measure o f significance, the p-value. Adding terms in this nested fashion always increases a model's overall fit statistic. By calculating the change in the fit statistic and assuming that the change conforms to a chi-squared distribution, a significance probability, or p-value, can be calculated. A large fit statistic change (corresponding to a large chi-squared value) is unlikely due to chance. Therefore, a small p-value indicates a significant improvement. When no p-value is below a predetermined entry cutoff, the forward selection procedure terminates.
4.2 Selecting Regression Inputs
4-31
Sequential Selection - Backward
• •••• •••• •••• • •• • • • • Input p-value
•
•
m
Stay Cutoff
•
•
mm
|
—
i
• • •
51
In contrast to f o r w a r d selection, backward selection creates a sequence o f models o f d e c r e a s i n g c o m p l e x i t y . The sequence starts w i t h a saturated model, w h i c h is a model that contains all available inputs, and therefore, has the highest possible fit statistic. Inputs are sequentially r e m o v e d f r o m the m o d e l . A t each step, the input chosen for removal least reduces the overall m o d e l fit statistic. This is equivalent to r e m o v i n g the input w i t h the highest p-value. The sequence terminates w h e n all r e m a i n i n g inputs have a p-value that is less than the predetermined stay cutoff.
4-32
Chapter 4 Introduction to Predictive Modeling: Regressions
Sequential Selection - Stepwise Input p-value
â&#x20AC;˘
â&#x20AC;˘
Entry Cutoff Stay Cutoff
58 Stepwise selection combines elements f r o m both the forward and backward selection procedures. The method begins in the same way as the f o r w a r d procedure, sequentially adding inputs w i t h the smallest p-value b e l o w the entry cutoff. A f t e r each input is added, however, the algorithm reevaluates the statistical significance o f all included inputs. I f the p-value o f any o f the included inputs exceeds the stay cutoff, the input is removed f r o m the model and reentered into the pool o f inputs that are available for inclusion in a subsequent step. The process terminates when all inputs available for inclusion in the model have p-values in excess o f the entry c u t o f f and all inputs already included in the model h a v e p - v a l u e s b e l o w the stay cutoff.
4.2 Selecting Regression Inputs
Selecting Inputs
I m p l e m e n t i n g a sequential selection method in the Regression node requires a m i n o r change to the Regression node settings. 1.
Select Selection M o d e l •=> Stepwise on the Regression node property sheet.
Train Variables
m
B i M a i n Effects Yes r Two-Factor InteracticNo ^Polynomial Terms No r-Polynomial Degree 2 No f-UserTerms --Term Editor '•-Regression Type - Link Function
~U
Loqistic Regression Logit
E3 Suppress Intercept Input Coding Deviation B lection -Selection Model ;Step_wise -Selection Criterion Default -Use Selection DefaiYes -Selection Options
!...
The Regression node is n o w c o n f i g u r e d to use stepwise selection to choose inputs for the m o d e l . 2.
Run the Regression node and v i e w the results.
3.
M a x i m i z e the O u t p u t w i n d o w .
4.
H o l d d o w n the C T R L key and type G. The Go To Line w i n d o w opens.
OK
Enter line number:
. 5.
79
Type 7 9 in the E n t e r
— line
Cancel
n u m b e r field and select O K .
4-33
4-34
Chapter 4 Introduction to Predictive Modeling: Regressions
The stepwise procedure starts with Step 0, an intercept-only regression model. The value o f the intercept parameter is chosen so that the model predicts the overall target mean for every case. The parameter estimate and the training data target measurements are combined in an objective function. The objective function is determined by the model form and the error distribution o f the target. The value o f the objective function for the intercept-only model is compared to the values obtained in subsequent steps for more complex models. A large decrease in the objective function for the more complex model indicates a significantly better model. S t e p 0: I n t e r c e p t e n t e r e d . The DMREG P r o c e d u r e Newton-Raphson Ridge
Optimization
Without P a r a m e t e r S c a l i n g Parameter
Estimates
Optimization Active Constraints Max Abs G r a d i e n t
0
Element
1
Start
Objective Function
3356.9116922
5.707879E-12 Optimization
Results
Iterations
0
Function C a l l s
3
Hessian
1
Active Constraints
0
Calls
Objective Function
3356.9116922
Ridge
0
Convergence c r i t e r i o n
(ABSGC0NV=0.00001)
Max Abs G r a d i e n t
5.707879E-12 0
satisfied.
L i k e l i h o o d R a t i o T e s t f o r G l o b a l N u l l Hypothesis: -2 Log L i k e l i h o o d
Element
A c t u a l Over Pred Change
BETA=0
Likelihood
Intercept
Intercept &
Ratio
Only
Covariates
Chi-Square
DF
6713.823
6713.823
0.0000
0
Pr > ChiSq
A n a l y s i s o f Maximum L i k e l i h o o d
Estimates
Parameter
DF
Estimate
Standard Error
Wald Chi-Square
Pr > ChiSq
Intercept
1
-0.00041
0.0287
0.00
0.9885
Standardized Estimate
Exp(Est) 1 .000
4.2 Selecting Regression Inputs
4-35
Step 1 adds one input to the intercept-only model. The input and corresponding parameter are chosen to produce the largest decrease in the objective function. To estimate the values o f the model parameters, the modeling algorithm makes an initial guess for their values. The initial guess is combined with the training data measurements in the objective function. Based on statistical theory, the objective function is assumed to take its minimum value at the correct estimate for the parameters. The algorithm decides whether changing the values o f the initial parameter estimates can decrease the value o f the objective function. I f so, the parameter estimates are changed to decrease the value o f the objective function and the process iterates. The algorithm continues iterating until changes in the parameter estimates fail to substantially decrease the value of the objective function. S t e p 1: E f f e c t G i f t C n t 3 6 e n t e r e d . The
DMREG P r o c e d u r e
Newton-Raphson Ridge
Optimization
Without P a r a m e t e r S c a l i n g Parameter E s t i m a t e s Optimization Active Constraints Max
Abs
Gradient
0
Element
2
Start
Objective Function
3356.9116922
89.678463762 Ratio Between Actual Abs
and
Function
Gradient
Predicted
Objective Function
Active
Objective
Constraints
Max
Iter
Restarts
Calls
Function
Change
Element
Ridge
1
0
2
0
3316
41.4036
2.1746
0
Change 1.014
2
0
3
0
3315
0.0345
0.00690
0
1.002
3
0
4
0
3315
2.278E-7
4.833E-8
0
1.000
Optimization Results Iterations
3
Function
Hessian
4
Active Constraints
Calls
Objective
Function
3315.473573
Ridge
0
Convergence c r i t e r i o n
(GC0NV=1E-6)
Max
Abs
Calls Gradient
6 0 4.833086E-8
Element
A c t u a l Over P r e d Change
0.999858035
satisfied.
The output next compares the model fit in Step 1 with the model fit in Step 0. The objective functions o f both models are multiplied by 2 and differenced. The difference is assumed to have a chi-squared distribution with one degree o f freedom. The hypothesis that the two models are identical is tested. A large value for the chi-squared statistic makes this hypothesis unlikely. The hypothesis test is summarized in the next lines. Likelihood Ratio Test f o r Global Null Hypothesis: -2 Log L i k e l i h o o d Intercept Intercept & Only Covariates 6713.823
6630.947
Likelihood Ratio Chi-Square
DF
Pr > C h i S q
82.8762
1
<.0001
BETA=0
4-36
Chapter 4 Introduction to Predictive Modeling: Regressions
The output summarizes an analysis o f the statistical significance o f individual model effects. For the one input model, this is similar to the global significance test above. Type 3 A n a l y s i s o f E f f e c t s Wald Effect
DF
Chi-Square
Pr > C h i S q
1
79.4757
<.0001
GiftCnt36
Finally, an analysis o f individual parameter estimates is made. (The standardized estimates and the odds ratios merit special attention and are discussed in the next section o f this chapter.) A n a l y s i s of Maximum Likelihood Estimates Standard
Wald
Error
Chi-Square
Standardized
Parameter
DF
Estimate
Intercept
1
-0.3956
0.0526
56.53
<.0001
GiftCnt36
1
0.1250
0.0140
79.48
<.0001
Pr > ChiSq
Estimate
Exp(Est) 0.673
0.1474
1.133
Odds Ratio Estimates Point Effect
Estimate
GiftCnt36
1.133
The standardized estimates present the effect o f the input on the log-odds o f donation. The values are standardized to be independent o f the input's unit o f measure. This provides a means o f ranking the importance o f inputs in the model. The odds ratio estimates indicate by what factor the odds o f donation increase for each unit change in the associated input. Combined with knowledge o f the range o f the input, this provides an excellent way to judge the practical (as opposed to the statistical) importance o f an input in the model. The stepwise selection process continues for eight steps. After the eighth step, neither adding nor removing inputs from the model significantly changes the model fit statistic. A t this point the Output window provides a summary o f the stepwise procedure. 6.
Go to line 850 to view the stepwise summary. The summary shows the step in which each input was added and the statistical significance o f each input in the final eight-input model. NOTE: No ( a d d i t i o n a l ) e f f e c t s met t h e 0.05 s i g n i f i c a n c e l e v e l f o r e n t r y i n t o t h e model. Summary o f S t e p w i s e Number
Score
Wald
DF
In
Chi-Square
Chi-Square
1 1
1 2 3
81.6807 23.2884 16.9872
<.0001 <.0001 <.0001
Effect Step
Entered
Selection
Pr > C h i S q
1 2 3
GiftCnt36
4
GiftAvgAll
1
StatusCat96NK
5
4 5
14.8514
5
18.2293
0.0001 0.0027
6
DemPctVeterans
1
6
7.4187
0.0065
7
M_GiftAvgCard36
1
7
7.1729
8
M_DemAge
1
8
4.6501
0.0074 0.0311
GiftTimeLast DemMedHomeValue
1
4.2 Selecting Regression Inputs
4-37
The default selection criterion selects the model from Step 8 as the model with optimal complexity. As the next section shows, this might not be the optimal model, based on the fit statistic that is appropriate for your analysis objective. The selected model, based on the CH0OSE=NCrÂŁ criterion, i s the model trained i n Step 8. I t consists of the following effects: Intercept
DenMecHomeValue
DenPctVeterans GiftAvpAll GiftCnt36
GiftTimeLast
M_DemAge M_GiftAvgCard36 StatusCat96rl<
For convenience, the output from Step 8 is repeated. A n excerpt from the analysis o f individual parameter estimates is shown below. A n a l y s i s of Maximum L i k e l i h o o d Estimates Standard Parameter
DF
Estimate
Wald
Standardized
Error
Chi-Square
Pr > ChiSq
Estimate
Exp(Est)
Intercept
1
0.2727
0.2024
1
1.385E-6
3.009E-7
1.82 21.18
0.1779
DemMedHomeValue
<.0001
0.0763
DemPctVeterans
1
0.00658
0.00261
6.38
0.0115
0.0412
1.007
GiftAvgAll
1
-0.0136
0.00444
9.33
0.0023
-0.0608
0.987
GiftCnt36
1
0.0587
0.0187
9.79
0.0018
0.0692
1.060
GiftTimeLast
1
-0.0376
0.00770
23.90
<.0001
-0.0837
0.963
1.314
M_DemAge
0
1
0.0741
0.0344
4.65
0.0311
1.077
M_GiftAvgCard36
0
1
0.1112
0.0411
7.30
StatusCat96NK
A
1
-0.0880
0.0927
0.90
0.0069 0.3423
0.916
StatusCat96NK
E F
1 1
0.4974
StatusCat96NK StatusCat96NK
L
StatusCat96NK
N
0.1818 0.1303
7.48
0.0062
1.644
12.30
0.0005
0.633
1
0.3735
0.15
0.6966
1.157
1
-0.1206
0.1323
0.83
0.3621
0.886
Restore the Output window and maximize the Fit Statistics window. H f Fit Statistics
1.118
-0.4570 0.1456
The parameter with the largest standardized estimate (in absolute value) is G i f t T i m e L a s t . 7.
1.000
WM&
Target
Fit Statistics Statistics Label
TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB argetB argetB argetB argetB TargetB TargetB TargetB
_AIC_ _ASE_ _AVERR_ _DFE_ _DFM_ _DFT_ _DIV_ _ERR_ _FPE_ _MAX_ _MSE_ _NOBS_ _nw_ _RASE_ _RFPE_ _RMSE_ _SBC_ _SSE_ _SUMW_ _MISC
Akaike's Inf... Average Sq... Average Err... Degrees of ... Model Degr... Total Degre... Divisor for A... Error Functi... Final Predic... Maximum A... Mean Squa... Sum of Fre... Number of... RootAvera... Root Final... Root Mean... Schwarz*s... Sum of Squ... Sum of Cas... Misclassific...
Train 6563.093 0.240919 0.674901 4830 13 4843 9686 6537.093 0.242216 0.965413 0.241567 4843 13 0.490835 0.492154 0.491495 6647.402 2333.54 9686 0.421433
Validation
Test
0.242336 0.678517
9686 6572.12 0.998582 0.242336 4843 0.492277 0.492277 2347.27 9686 0.42453
The simpler model improves on both the validation misclassification and average squared error measures o f model performance.
4-38
Chapter 4 Introduction to Predictive Modeling: Regressions
4.3 Optimizing Regression Complexity Model Essentials - Regressions
Optimize complexity.
!
B e s t m o d e
l
! from sequence
61
Regression complexity is optimized by choosing the optimal model in the sequential selection sequence.
Model Fit versus Complexity Model fit statistic
| Evaluate each sequence step.
>xm~m ar.m im irnra tanra tmmm mmm
1
2
3
4
5
6
62
The process involves two steps. First, fit statistics are calculated for the models generated in each step o f the selection process. Both the training and validation data sets are used.
4.3 Optimizing Regression Complexity
4-39
Select Model with Optimal Validation Fit Model fit statistic
1t
Choose simplest optimal model. 3
63 T h e n , as w i t h the decision tree in Chapter 3. the simplest model (that is, the one w i t h the fewest inputs) w i t h the o p t i m a l f i t statistic is selected.
4-40
Chapter 4 Introduction to Predictive Modeling: Regressions
Optimizing Complexity
Iteration Plot The following steps illustrate the use o f the iteration plot in the Regression tool Results window. In the same manner as the decision tree, you can tune a regression model to give optimal performance on the validation data. The basic idea involves calculating a fit statistic for each step in the input selection procedure and selecting the step (and corresponding model) with the optimal fit statistic value. To avoid bias, o f course, the fit statistic should be calculated on the validation data set. 1.
Select View ÂŤ=> Model ÂŤ=> Iteration Plot. The Iteration Plot window opens.
Average Squared Error 0.2500 -
0.2475-
0.2450-
02425-
024002
4
Model Selection Step Number â&#x20AC;˘Train: Average Squared Error -Valid: Average Squared Error
J
The Iteration Plot window shows (by default) average squared error (training and validation) from the model selected in each step o f the stepwise selection process. f
Surprisingly, this plot contradicts the naive assumption that a model fit statistic calculated on training data w i l l always be better than the same statistic calculated on validation data. This concept, called the optimism principle, is correct only on the average, and usually manifests itself only when overly complex (overly flexible) models are considered. It is not uncommon for training and validation fit statistic plots to cross (possibly several times). These crossings illustrate unquantified variability in the fit statistics.
Apparently, the smallest average squared error occurs in Step 4, rather than in the final model, Step 8. I f your analysis objective requires estimates as predictions, the model from Step 4 should provide slightly less biased ones.
4.3 Optimizing Regression Complexity 2.
4-41
Select Select Chart ÂŤ=> Misclassification Rate.
Misclassification Rate 0.500 -
0.475 -
0.450 -
0.425 -
Model Selection Step Number Train: Misclassification Rate -Valid: Misclassification Rate The iteration plot shows that the model with the smallest misclassification rate occurs in Step 3. I f your analysis objective requires decision predictions, the predictions from the Step 3 model are as accurate as the predictions from the final Step 8 model. The selection process stopped at Step 8 to limit the amount o f time spent running the stepwise selection procedure. In Step 8, no more inputs had a chi-squared p-value below 0.05. The value 0.05 is a somewhat arbitrary holdover from the days of statistical tables. With the validation data available to gauge overfitting, it is possible to eliminate this restriction and obtain a richer pool o f models to consider.
4-42
Chapter 4 Introduction to Predictive Modeling: Regressions
Full M o d e l S e l e c t i o n Use the f o l l o w i n g steps to b u i l d and evaluate a larger sequence o f regression models: 1.
Close the Results - Regression w i n d o w .
2.
Select Use Selection Default •=> No f r o m the Regression node Properties panel.
Train Variables
UJ
[Main Effects Yes [-Two-Factor InteracticNo '•Polynomial Terms No [-Polynomial Degree 2 [UserTerms No " T e r m Editor
y
Mciass Targets [-Regression Type -|_ink Function BModel Options '[•Suppress Intercept -Input Coding ;
Logistic Regression Logit No Deviation
[-Selection Model Stepwise [-Selection Criterion Default [-Use Selection DefaiNo ^Selection Options 3.
Select Selection Options "=> ^ J .
wmm i...
4.3 Optimizing Regression Complexity
The Selection Options w i n d o w opens.
J
Property Sequential Order Entry Significance Level Stay Significance Level Start Variable Number Stop Variable Number Force Candidate Effects Hierarchy Effects Moving Effect Rule Maximum Number of Steps
Value No 0.05 0.05 0 0 0 Class None 0
Sequential O r d e r Specifies whether to add or remove variables in the order that is specified in the M O D E L statement.
OK
Cancel
4-43
4-44
Chapter 4 Introduction to Predictive Modeling: Regressions
4.
Type 1 . 0 as the Entry Significance Level value.
5.
T y p e 0 . 5 as the Stay Significance L e v e l value.
I _ Selection Options Property Sequential Order Entry Significance Level Stay Significance Level Start Variable Number Stop Variable Number Force Candidate Effects Hierarchy Effects Moving Effect Rule Maximum Number of Steps
Value No 1.0 0.5 0 0 0 Class None 0
Stay Significance L e v e l Significance level for removing variables in backward or stepwise regression.
OK
Cancel
The Entry Significance value enables any input in the model. (The one chosen w i l l have the smallest p-value.) T h e Stay Significance value keeps any input in the model w i t h a p-value less than 0.5. This second choice is somewhat arbitrary. A smaller value can terminate the stepwise selection process earlier, w h i l e a larger value can maintain it longer. A Stay Significance o f 1.0 forces stepwise to behave in the manner o f a f o r w a r d selection. 6.
Run the Regression node and v i e w the results.
4.3 Optimizing Regression Complexity 7.
Select View
4-45
Model ÂŤ=> Iteration Plot. The Iteration Plot window opens.
Average Squared Error 0.250-
0245-
0.240 -
15
Model Selection Step Number â&#x20AC;˘Train: Average Squared Error Valid: Average Squared Error The iteration plot shows the smallest average squared errors occurring in Steps 4 or 12. There is a significant change in average squared error in Step 13, when the D e m C l u s t e r input is added. Inclusion o f this nonnumeric input improves training performance but hurts validation performance.
4-46
Chapter 4 Introduction to Predictive Modeling: Regressions
Select Select C h a r t •=> M i s c l a s s i f i c a t i o n R a t e . Iteration Plot
r/eT m
Misclassification Rate 0.500 -
0.475 -
0.450 -
0.425 -
0.400 10
15
r
20
M o d e l S e l e c t i o n Step N u m b e r Train: Misclassification Rate 'Valid: Misclassification Rate
The iteration plot shows that the smallest validation misclassi fication rates occur at Step 3. N o t i c e that the change in the assessment statistic in Step 13 is m u c h less pronounced.
Best S e q u e n c e Model Y o u can c o n f i g u r e the Regression node to select the model w i t h the smallest f i t statistic (rather than the final stepwise selection iteration). This method is how SAS Enterprise M i n e r optimizes c o m p l e x i t y for regression models. 1.
Close the Results - Regression w i n d o w .
2.
I f y o u r predictions are decisions, use the f o l l o w i n g setting: Select Selection C r i t e r i o n •=> V a l i d a t i o n M i s c l a s s i f i c a t i o n . ( E q u i v a l e n t l y , y o u can select V a l i d a t i o n P r o f i t / L o s s . The equivalence is demonstrated in Chapter 6.)
3.
I f y o u r predictions are estimates (or rankings), use the f o l l o w i n g setting: Select Selection C r i t e r i o n => V a l i d a t i o n E r r o r . B
•Selection Model Stepwise -Selection Criterion Validation Error Use Selection DefauNo -Selection Options f
The c o n t i n u i n g demonstration assumes validation error selection criteria.
4.3 Optimizing Regression Complexity 4.
Run the Regression node and view the results.
5.
Select View
Model
4-47
Iteration Plot.
Average Squared Error 0250-
0245-
0240 -
Model Selection Step Number â&#x20AC;˘Train: Average Squared Error -Valid: Average Squared Error
The vertical blue line shows the model with the optimal validation error (Step 12). 6.
Go to line 2690. A n a l y s i s of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Wald
Error
Chi-Square
Pr > ChiSq
Standardized Estimate
Exp(Est)
Intercept
1
0.4999
0.2575
3.77
0.0522
DemMedHomeValue DemPctVeterans
1
1.416E-6
3.011E-7
22.12
<.0CO1
0.0781
1.000
1
0.00651
0.00261
6.23
GiftAvg36
1
-0.0101
0.00355
8.02
0.0126 0.0046
0.0407 -0.0561
0.990
GiftCnt36
1
0.0574
0.0197
8.53
0.0035
0.0677
1.059
GiftTimeLast M_DemAge
1
-0.0415
0.00829
<.0001
-0.0923
0.959
0
1
0.0720
0.0345
25.07 4.36
M_GiftAvgCard36
0
1
0.1126
0.0412
7.46
0.0063
1
-0.0381
0.0281
1.85
0.1740
PromCntCard12
1.649
0.0367
1.007
1.075 1.119 -0.0282
0.963
StatusCat96NK
A
1
-0.0353
0.0957
0.14
0.7122
0.965
StatusCat96NK StatusCat96NK
E F
1 1
0.4010 -0.4485
0.1950 0.1314
4.23 11.66
0.0398 0.0006
1.493 0.639
StatusCat96NK
L
1
0.1733
0.3743
0.21
0.6433
1.189
StatusCat96NK
N 0
1
-0.0988
0.1353
0.53
0.906
1
-0.0701
0.0367
3.64
0.4649 0.0563
StatusCatStarAll
0.932
4-48
Chapter 4 Introduction to Predictive Modeling: Regressions
While not all the p-values are less than 0.05, the model seems to have a better validation average squared error (and misclassification) than the model selected using the default Significance Level settings. In short, there is nothing sacred about 0.05. It is not unreasonable to override the defaults o f the Regression node to enable selection from a richer collection o f potential models. On the other hand, most o f the reduction in the fit statistics occurs during inclusion o f the first three inputs. I f you seek a parsimonious model, it is reasonable to use a smaller value for the Stay Significance parameter.
4.4 Interpreting Regression Models
4-49
4.4 Interpreting Regression Models Beyond the Prediction Formula
Interpret the model.
66
After you build a model, you might be asked to interpret the results. Fortunately regression models lend themselves to easy interpretation.
4-50
Chapter 4 Introduction to Predictive Modeling: Regressions
Odds Ratios and Doubling Amounts A Wo'
Doubling amount Input change is required to double odds.
Ax,
consequence
1
=> odds x exp(tv,)
M2=>ocf<fex2
X?
logit
scores
Odds ratio: Amount odds change with unit change in input.
69
There are two equivalent ways to interpret a logistic regression model. Both relate changes in input measurements to changes in odds o f primary outcome. • A n odds ratio expresses the increase in primary outcome odds associated with a unit change in an input. It is obtained by exponentiation o f the parameter estimate o f the input o f interest. • A doubling amount gives the amount o f change required for doubling the primary outcome odds. It is equal to log(2) « 0.69 divided by the parameter estimate o f the input o f interest. f
I f the predicted logit scores remain in the range -2 to +2, linear and logistic regression models o f binary targets are virtually indistinguishable. Balanced stratified sampling (Chapter 6) often ensures this. Thus, the prevalence o f balanced sampling in predictive modeling might, in fact, be a vestigial practice from a time when maximum likelihood estimation was computationally extravagant.
4.4 Interpreting Regression Models
4-51
Interpreting a Regression Model
The following steps demonstrate how to interpret a model using odds ratios: 1.
Go to line 2712 o f the regression model output. Odds R a t i o E s t i m a t e s Point Effect
Estimate
DemMedHomeValue DemPctVeterans
1.000
GiftAvg36
0.990
1.007
GiftCnt36
1.059
GiftTimeLast M_DemAge
0.959 0 vs 1
M_GiftAvgCard36
0 vs 1
1 .253
PromCntCard12 StatusCat96NK
A vs S
0.963 0.957
StatusCat96NK
E vs S
1 .481
StatusCat96NK
F vs S
0.633
1.155
StatusCat96NK
L vs S
1.179
StatusCat96NK
N vs S
0.898
StatusCatStarAll
0 vs 1
0.869
This output includes most o f situations you will encounter when you build a regression model. For G i f tAvg36, the odds ratio estimate equals 0.990. This means that for each additional dollar donated (on average) in the past 36 months, the odds of donation on the 97NK campaign change by a factor o f 0.99, a 1 % decrease. For G i f tCnt36, the odds ratio estimate equals 1.059. This means that for each additional donation in the past 36 months, the odds o f donation on the 97NK campaign change by a factor o f 1.059, a 5.9% increase. For MJDemAge, the odds ratio (0 versus 1) estimate equals 1.155. This means that cases with a 0 value for M_DemAge are 1.155 times more likely to donate than cases with a 1 value for M_DemAge. f
2.
The unusual value o f 1.000 for the DemMedHomeValue odds ratio has a simple explanation. Unit (that is, single dollar) changes in home value do not change the odds o f response by an amount captured in three significant digits. To obtain a more meaningful value for this input's effect on response odds, you can multiply the parameter estimate by 1000 and exponentiate the result. You then have the change in response odds based on 1000 dollar changes in median home value. Equivalently, you could use the Transform Variables node to replace DemMedHomeValue with DemMedHmVall000=DemMedHomeValue/1000, and a unit increase on that new input would represent a $1000 increase in the DemMedHomeValue.
Close the Results window.
4-52
Chapter 4 Introduction to Predictive Modeling: Regressions
4.5 Transforming Inputs Beyond the Prediction Formula
Handle extreme or unusual values.
72
Classical regression analysis makes no assumptions about the distribution o f inputs. The only assumption is that the expected value o f the target (or some function thereof) is a linear combination o f fixed input measurements. Why should you worry about extreme input distributions?
4.5 Transforming Inputs
4-53
Extreme Distributions and Regressions Original I n p u t Scale
true association -gr
i
•
"
bundard regression
,
• standardmpm i ifnif • skewed input distribution
•
^
true association high leverage points
74
There are at least two compelling reasons. • First, in most real-world applications, the relationship between expected target value and input value does not increase without bound. Rather, it typically tapers off to some horizontal asymptote. Standard regression models are unable to accommodate such a relationship. • Second, as a point expands from the overall mean o f a distribution, the point has more influence, or leverage, on model fit. Models built on inputs with extreme distributions attempt to optimize fit for the most extreme points at the cost o f fit for the bulk o f the data, usually near the input mean. This can result in an exaggeration or an understating o f an input's association with the target. Both o f these phenomena are seen in the above slide. f
The first concern can be addressed by abandoning standard regression models for more flexible modeling methods. Abandoning standard regression models is often done at the cost o f model interpretability and, more importantly, failure to address the second concern o f leverage.
4-54
Chapter 4 Introduction to Predictive Modeling: Regressions
Extreme Distributions and Regressions Regularized Scale
Original Input Scale true association -sjg '*' 1
"^standard regression , ~~
i
!
* standardmifin i
" :
skewed input distribution
high leverage points
more symmetric distribution
76
A simpler and, arguably, more effective approach transforms or regularizes offending inputs in order to eliminate extreme values.
Regularizing Input Transformations Original Jriput 'jcaly
Regularized Scale
standard/egression B
skewed input distribution
high leverage points
rf j « • • • J
more symmetric distribution
Then, a standard regression model can be accurately fit using the transformed input in place o f the original input.
4.5 Transforming Inputs
4-55
Regularizing Input Transformations Regularized Scale
Original I n p u t Scale
true association u
regularized estimate
• ,
.
.
f*~** • '
regularized estimate °
"
standard regression \ • .«.-•
•
i
j
• standard/epression
true association
78
Often this can solve both problems mentioned above. This not only mitigates the influence o f extreme cases, but also creates the desired asymptotic association between input and target on the original input scale.
Chapter 4 Introduction to Predictive Modeling: Regressions
4-56
Transforming Inputs
Regression models are sensitive to extreme or outlying values in the input space. Inputs with highly skewed or highly kurtotic distributions can be selected over inputs that yield better overall predictions. To avoid this problem, analysts often regularize the input distributions using a simple transformation. The benefit o f this approach is improved model performance. The cost, o f course, is increased difficulty in model interpretation. The Transform Variables tool enables you to easily apply standard transformations (in addition to the specialized ones seen in Chapter 9) to a set o f inputs.
The Transform Variables Tool Use the following steps to transform inputs with the Transform Variables tool: 1.
Remove the connection between the Data Partition node and the Impute node. | â&#x20AC;˘ Predictive Analysis L:
2.
Select the Modify tab.
3.
Drag a Transform Variables tool into the diagram workspace.
4.
Connect the Data Partition node to the Transform Variables node.
5.
Connect the Transform Variables node to the Impute node.
H B E3
4.5 Transforming Inputs
6.
4-57
A d j u s t the diagram icons for aesthetics. (So that y o u can see the entire d i a g r a m , the z o o m level is reduced.) • Predictive Analysis
i
^
—
-&
-—»>r^j* . ™ pl
tr
rt
^ 6
The T r a n s f o r m Variables node is placed before the Impute node to keep the imputed values at the average (or center o f mass) o f the model inputs. 7.
Select the V a r i a b l e s ^>
property o f the Transform Variables node. Value
Prope
General
Node ID Imported Data Exported Data Notes
Trans
U
m
Train
• •
Variables Formulas Interactions SAS Code
...
u
Ibefault Methods '•-Interval Inputs p Interval Targets |; Class Inputs •-Class Targets
[slsample
None None None None
Properties First N j-Method Default [-Size 12345 - R a n d o m Seed
4-58
Chapter 4 Introduction to Predictive Modeling: Regressions
The Variables - Trans w i n d o w opens.
(none)
•
not
Name Method DemAge Default DemCluster Default DemGender Default D e rn Ho m e OwDef autt DemMedHomOefault DemMedlnconOefautt DemPctVeteraDefault GiftAvg36 Default GiftAvgAII DefauH GiftAvgCard36Default GifWvgLast Default GiftCnt36 Default GiftCntAII Default GiftCntCard3 6 Default GirtCnlCardAil DefauH GiftTimeFirst Default GiftTimeLast Default PromCntl2 Default PromCnt36 Default PromCntAII Default Pro mCntC a rd Default PromCntCard Default PromCntCard/Default REP DernMecDefault Statu sCat9 6NDefault Statu sCalStar/Oefault
Equal to
• Number of Bins
Apply Role 4.0 Input 4.0 Input 4.0 input 4.0'lnput 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0'lnput 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4J)'lnput 4.0'lnput 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input
Level Interval Nominal Nominal Binary Interval Interval Interval Interval Interval (Interval Interval Interval Inlerval Inlerval Inlerval Interval Inlerval Interval Inlerval 'interval Interval Interval Interval Interval Nominal Binary
Type
Order
Reset Label
Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric
Age Demograph Gender Home Owne Median Horr Median Incoi PercentVete GiftAmountJ • GiftAmountJ GiftAmountJ Gift Amount 1 Gift Count 3 £ Gift Count A l l GiftCountC Gift Count C; Times Since Time Since 1 1 Promotion Cj:. Promotion C Promotion C Promotion C : Promotion O Promotion 0 Repiacemer Status Cated Status Cated*
<
! H Update Path
OK
Cancel
4.5 Transforming Inputs
Select all inputs w i t h M.b, ..^lr.r. i
â&#x20AC;˘
(none)
Gift
in the name.
M
not
Method
Name
4-59
Default DemAge DemCluster Default DemGender Default DemHomeOwDefauft DemMedHomOefault D e m M ed In c o nDefault DemPctVeteraDefault Default GiftAvg36 GiftAvgAII Default GiftAvgCard36Default GiftAvgLasl Default Default GiftCnt36 Default GiftCntAII GiflCntC a rd 36 Default GiftCntCardAII Default GiftTimeFirst Default GiftTimeLast Default PromCnt12 Default PromCnt36 Default PromCntAII Default Pro mCntC a rd Default P ro mCntC a rd Default P ro mC ntC ard/Default REP DernMecOefautt Status Cat9 6 N Default Status CatStar/Defautt
Apply
Equal to
Number of Bins
Role 4.0 Input 4.0Jnput 4.0 Input 4.0 Input 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4,0 Input 4.0 Input 4.0 input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0'lnput 4.0 Input
Level Interval Nomina] Nominal Binary Interval Interval Interval Interval Interval Inlerval Inlerval Interval Interval Interval Interval Inlerval Interval Interval Interval Interval Inlerval Interval Interval Interval Nominal Binary
Type
Label
Order
Age k Demographj Gender Home Owne Median Hon"| Median Incoi Percent Vetej Gift Amount; Gift Amount/ Gift Amount/ Gift Amount Gift Count 3^ Gift Count All Gift Count cj Gift Count Cj Times Since Time Since I Promotion C Promotion Cp : Promotion C Promotion C Promotion 0 Promotion Cj, Replacemer^" Status Catec Status Caterj*
Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric
:
:
in
<
Explore..
Update Path
Reset
OK
Cancel
Chapter 4 Introduction to Predictive Modeling: Regressions
4-60
9.
Select E x p l o r e . . . . T h e E x p l o r e w i n d o w opens.
mWKM File
View
Actions
Window
1 0 ^ Sample Properties ; ; :
:
Property •Rows
4843
$133 p
EMWS.Par1_lRAlN Obs#
A
< | |
l°t~
r / l ^ 0
|
U GtftAvg36 X 4000 a) 2000 u. 0--
I 40005 2000¬ £ o .-.
1
i
[ - 1 •
i
| .
0
.
Gift Amount Average 36 M... n*Ef
t 4000 5 2000u0
i
| -.
• i
r
.
$100.00 $200.00
0.0
m
1000
5.8 11.G 17.4 23.2 20.0
Gift Count Card All Months t l GiftTimeFirst
U GfftCnt36
i
$104.00 $208:00
U GiftAvgAII
•
1500 - f — I — , 1000 500¬ 0
Gift Amount Last
• 1500 -
— i — i
i
$0.00
-
$0.00
!
1500 i | 1000- • — 500n - J—L \—i—I—'—I—i—I—'—I—r 0.0 1.8 3.6 5.4 7.2 9.0 Gift Count Card 36 Months U GiftCntCardAII
isi
•
Ef
$104.80 $208.27
I U GmAvgLast
n
:
L i G(ftCntCard36
Gift Amount Average Card...
|l
DATAOBS | Tarqet Git 1 4 5 7
1 2 3
l i GiftAvgCard36
I
Apply 3
m\
r/sf
Value
f - i
- i — r ~ r~TT
1000 500 0
t — . — i — i — i — i — i — i — i — • — r
* 8 : I U 0.0 3.0 6.0 9.0 12.0 15.0
15.0
79.5 144.0
208.5
Time Since First Gift GiftTimeLast
r^Ef
IH
IB) — i — — i —— i I • $1.50 $30.90 $11)0.30 1
1
1
1
Gift Amount Average All M...
- i — i — i— —i — • — i — i — i — i — r 4.0 8.6 13.2 17.8 22.4 27.0 Time Since Last Gift 1
G i f tAvg and G i f t C n t inputs show some degree o f skew i n t h e i r d i s t r i b u t i o n . T h e G i f tTime inputs do not. T o regularize the skewed d i s t r i b u t i o n s , use the l o g t r a n s f o r m a t i o n .
The
For
these i n p u t s , the order o f m a g n i t u d e o f the u n d e r l y i n g measure predicts the target rather than the measure itself. I 0 . C l o s e the E x p l o r e w i n d o w .
4.5 Transforming Inputs
4-61
1. Deselect the t w o inputs w i t h G i f tTime in their names. f*»'-
M i i l
(none) Name
*-
r t w
''
• •
not
Method
DemAge DefauH DemCluster befauR DemGender DefauH DemHomeOwOefau_R_ DemMedHomOefauR DernMedlncortJefault D e m P ctVetera DefauH GiftAvg36 DefauH GiftAvgAII DefauH GiftAvgCard36DefauK GiftAvgLast DefauH Gif!Cnt36 DefauH GiftCntAII Default GiftCntCard36 DefauH GiftCntCardAll DefauH GiftTimeFirst DefauH GiftTimeLast DefauH PromCnt12 DefauH P ro m C nt3 6 Dejault PromCntAII DefauH P ro mC ntC a rd Default PromCntCard DefauH Prom C ntC a rd/OefauK REP_DernMetDefauR StatusCat96NDefauR Status CatStar/DefauR
E q u a l to
Apply
•
Number of Bins
Role 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input
Level Interval Nominal Nominal Binary Inlerval Interval Interval Interval Interval Inlerval Inlerval Interval Interval Interval Inlerval Interval Interval Interval Inten/al Interval Interval Interval Interval Interval Nominal Binary
Explore...
Type
Order
Label Age Demograph Gender Home Owne Median Horn Median Incoi Percent Vete Gift Amount; GiftAmountJ GiftAmountJ Gift Amount | Gift Count 36 Gift Count At Gift Count C; Gift Count C; Times Since Time Since L Promotion Cj Promotion 0 Promotion C Promotion C Promotion C Promotion C Replacemeri Status C a t e g _ Status Calerj*_
Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric
Update Path
Reset
OK
Cancel
4-62
12.
Chapter 4 Introduction to Predictive Modeling: Regressions
Select M e t h o d •=> L o g for one o f the r e m a i n i n g selected inputs. The selected m e t h o d changes f r o m D e f a u l t to L o g f o r the
(none)
Q not
T
G i f tAvg
and
G i f tCnt
inputs.
Apply
Equal to
Name Method DemAge Default DemCluster Default DemGender Default DernHomeOwDefautt DemMedHomOefautt DemMedlnconDefautt DernPctVeteraDefault Log GiftAvg36 GiftAvgAII Log GiftAvgCard36Log GiftAvgLast Log Log GiflCnt36 GiftCntAII Log GiftCntCard36Log GiftCntCardAll Loy GiftTimeFirst Default GiftTimeLast DefauH PromCntl 2 DefauH PromCnt36 Default PromCntftll DefauH Pro mCntCard DefauH PromCnlCardDefauH PromCnlCard/Default REP DemMecDefautt Statu sCal96NIDefault Statu sCalStar/Default
Number of Bins
Role 4.0 Input 4.0 input 4.0 Input 4.0 In put 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input
4
Level Interval Nominal Nominal Binary Inlerval Inlerval Inlerval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Nominal Binary
Explore..,
13.
i Type Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric
Update Path
Order
Reset
Label Age Demographl Gender Home Owne Median Horn Median Incoi Percent Vete GiftAmountJ Gift Amount; Gift Amount/ Gift Amount i Gift Count 3 £ Gift Count Al Gift Count C: Gift Count C? Times Since Time Since t Promollon C Promolion C Promotion C Promotion C Promotion C Promotion C Replacemer Status Cates Status Catec.
OK
Cancel
Select O K to close the Variables - Trans w i n d o w .
14. Run the T r a n s f o r m Variables node and v i e w the results. 5. M a x i m i z e the O u t p u t w i n d o w and go to line 2 8 . Input I n p u t Name
Role
Level
GiftAvg36 GiftAvgAII GiftAvgCard36
INPUT INPUT INPUT
INTERVAL INTERVAL INTERVAL
L0G_ G i f t A v g 3 6
INTERVAL INTERVAL INTERVAL
log(GiftAvg36
L0G_ G i f t A v g A I I L0G_ _ G i f t A v g C a r d 3 6
GiftAvgLast GiftCnt36 GiftCntAII GiftCntCard36 GiftCntCardAll
INPUT INPUT INPUT INPUT INPUT
INTERVAL INTERVAL INTERVAL
LOG. L0G_ LOG LOG_ LOG_
INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL
log(GiftAvgLast + 1) log(GiftCnt36 + 1) log(GiftCntAII + 1) log(GiftCntCard36 + 1) log(GiftCntCardAll + 1)
INTERVAL INTERVAL
Name
GiftAvgLast _GiftCnt36 GiftCntAII _GiftCntCard36 GiftCntCardAll
Level
Formula + 1 )
log(GiftAvgAII + 1) log(GiftAvgCard36 + 1)
N o t i c e the F o r m u l a c o l u m n . W h i l e a l o g transformation was s p e c i f i e d , the actual t r a n s f o r m a t i o n used was \og(input + 1). T h i s default action o f the T r a n s f o r m Variables tool avoids p r o b l e m s w i t h 0-valucs o f the u n d e r l y i n g inputs. 16.
Close the T r a n s f o r m Variables - Results w i n d o w .
4.5 Transforming Inputs
4-63
Regressions with Transformed Inputs The following steps revisit regression, and use the transformed inputs: 1.
Run the diagram from the Regression node and view the results.
2.
Go to line 3754 the Output window. Summary of Stepwise S e l e c t i o n Effect Step
Entered
Removed
DF
Number
Score
In
Chi-Square
Wald Chi-Square
Pr > ChiSq
1
L0G_GiftCnt36
!
1
95.0275
<.0CO1
2
GiftTimeLast
1
2
21.1330
<.OO01
3
DemMedHomeValue
1
3
17.7373
<.0001
4
L03_GiftAvgAll
4
21.7306
5 6
DemPctVeterans
1 1
7.0742
<.0O01 0.0078
StatusCat96NK L0G_GiftCntCard36
5 6
13.7906
1
7
5.9966
0.0170 0.0143
1
8
5.0301
0.0249
53 1
9
61.2167
0.2049
10
M_DemAge DemCluster StatusCatStarAll
10
1.2431
0.2649
11
PromCntCard12
1
11
1.4604
0.2269
12
PromCntAll
1
12
1.0022
0.3168
13
LOG_GiftCntAll
1
13
2.2990
0.1295
14 15
PromCntl2
1
14
0.8158
0.3664
PromCntCardAll
1
15
1.8875
1
0.6075
7 8 9
PromCntCard12
0.0358
0.1695 0.8500
16 17
M_REP_DemMedIncome
1
14 15
18
L0G_GiftAvg36
1
16
0.4691
0.4357 0.4934
19
M_L0G_GiftAvgCard36
0.6226
0.4301
GiftTimeFirst
1 1 1
17
20
18
0.3972
21
GiftTimeFirst
17
0.5285 0.3971
0.5286
The s e l e c t e d model, based on the (HX)SE=VERR0R c r i t e r i o n , i s the model t r a i n e d i n Step 4. I t c o n s i s t s of the following e f f e c t s : Intercept
DemMedHomeValue
GiftTimeLast
L0G_GiftAvgAll
L0G_GiftCnt36
The stepwise selection process took 21 steps, and the selected model came from step 4. Notice that half o f the selected inputs are log transformations o f the original gift variables.
4-64
3.
Chapter 4 Introduction to Predictive Modeling: Regressions
Go to line 3800 to view more statistics from the selected model. Analysis of Maximum Likelihood Estimates Standard DF
Parameter
Estimate
Error
Wald
Standardized
Chi-Square
Pr > ChiSq
Estimate
Exp(Est)
Intercept DerrrVtedHorneValue
1
0.8251
0.2921
7.98
0.0047
1
1.448E-6
23.26
<.CO01
0.0798
1.000
GiftTimeLast LOGjSiftAvgAH
1
-0.0341
3.002E-7 0.00756
20.33
<.0001
-0.0758
0.966
1
-0.3469
0.0747
21.58
<.0O01
-0.0895
0.707
LOG GiftCnt36
1
0.3736
0.0728
26.34
<.0O01
0.0998
1.453
2.282
Odds Ratio Estimates Point Estimate
Effect
4.
DerrrVtedHorneValue
1.000
GiftTimeLast
0.966
LOGjSiftAvgAll
0.707
LOG GiftCnt36
1.453
Select View <=> Model «=> Iteration Plot.
Average Squired Error
0250-
0245-
0240-
023510
-r~ 15
— r 20
Model Selection Step Number •Train: Average Squared Error -Valid: Average Squared Error
The selected model (based on minimum error) occurs i n Step 4. The value o f average squared error for this model is slightly lower than that for the model with the untransformed inputs.
4.5 Transforming Inputs
5.
4-65
Select Select Chart â&#x20AC;˘=> Misclassification Rate.
m Misclassification Rate 0.500-
0.475 -
0.450 -
0.425 -
0.40010
15
- r 20
Model Selection Step Number -Train: Misclassification Rate -Valid: Misclassification Rate
The misclassification rate with the transformed input model is nearly the same as that for the untransformed input model. The model with the lowest misclassification rate comes from Step 3. I f you want to optimize on the misclassification rate, you must change this property in the Regression node's property sheet. 6.
Close the Results window.
4-66
Chapter 4 Introduction to Predictive Modeling: Regressions
4.6 Categorical Inputs Beyond the Prediction Formula
Use nonnumeric inputs.
81
Using nonnumeric or categorical inputs presents another problem for regressions. As was seen i n the earlier demonstrations, inclusion o f a categorical input with excessive levels can lead to overfitting.
4.6 Categorical Inputs
4-67
To represent these nonnumeric inputs in a model, you must convert them to some sort o f numeric values. This conversion is most commonly done by creating design variables (or dummy variables), with each design variable representing approximately one level o f the categorical input. (The total number o f design variables required is, in fact, one less than the number o f inputs.) A single categorical input can vastly increase a model's degrees o f freedom, which, in turn, increases the chances o f a model overfitting.
4-68
Chapter 4 Introduction to Predictive Modeling: Regressions
Coding Consolidation Level
Dgh
&ABCD
A
1
0
0
B
1
0
0
C
1
0
0
D
1
0
0
E
0
1
0
F
0
1
0
G
0
0
1
H
0
0
1
1
0
0
0
8G There are many remedies to this p r o b l e m . One o f the simplest remedies is to use d o m a i n k n o w l e d g e to reduce the number o f levels o f the categorical input. In this way, level-groups are encoded in the model in place o f the original levels.
4.6 Categorical Inputs
4-69
Recoding Categorical Inputs In Chapter 2, you used the Replacement tool to eliminate an inappropriate value in the median income input. This demonstration shows how to use the Replacement tool to facilitate combining input levels o f a categorical input. 1.
Remove the connection between the Transform Variables node and the Impute node. ; Predictive Analysis
> flDcdiioaTfee
-^|f5"O|0 2.
Select the Modify tab.
3.
Drag a Replacement tool into the diagram workspace.
4.
Connect the Transform Variables node to the Replacement node.
5.
Connect the Replacement node to the Impute node.
J
—© 80%|S:°Hj'
- i ; Predictive Analysis
ft
^toputa 1= L B
PVAOTNt
»
JLl
—
6
<
>
^
6
K
J
-ii||^Q|0 - J —Q 76%|gqqj
4-70
Chapter 4 Introduction to Predictive Modeling: Regressions
Y o u need to change some o f the node's default settings so that the replacements are l i m i t e d to a single categorical input. 6.
In the Interval Variables property group, select D e f a u l t L i m i t s M e t h o d O N o n e . Value
Property
General
Node ID Imported Data Exported Data iNotes
Repl2
... ...
...
Train
j- Replacement Editor
•
7.
Cutoff Values Class Variables Replacement Editor Ignore Unknown Levels
•
In the Class Variables property g r o u p , select R e p l a c e m e n t E d i t o r O Properties panel. 1
Value
Property
General
Node ID Imported Data Exported Data Notes
I
Repl2
Train
Slnterval Variables i-Replacement Editor |-Default Limits MethoNone '-CutofTValues a C l a s s Variables \- Replacement Editor Ignore " U n k n o w n Levels
,.u !
y
•
•
•
. „ [ f r o m the Replacement node
4.6 Categorical Inputs
4-71
The Replacement E d i t o r opens.
Variable DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster
Level 40 24 36 35 49 13 12 27 18 30 00 39 45 51 14 11 02 43 28 16 46 41 17 08
Frequency 236 209 206 194 178 154 152 152 148 143 129 129 114 114 113 111 107 103 99 98 98 97 93 89
Type
c c c c c
C
c c c c" c ;C C C C
c
c
iC !C
b c
*
I Char Raw v... Num RawV... Repiaceme... â&#x20AC;˘40 (24 t 36 35 49 ,13 ~T~
gl
L
27 18 30 00 _i39 J45 [51
I. ~ ~ I. ~f
11 02 43
r
28 |16 *46
I
U1
|.
08 Reset
OK
Cancel
The categorical input Replacement Editor lists all levels o f each binary, o r d i n a l , and n o m i n a l input. Y o u can use the Replacement c o l u m n to reassign values to any o f the levels. The input w i t h the largest number o f levels is
DemCluster.
w h i c h has so many levels that
consolidating the levels using the Replacement Editor w o u l d be an arduous task. ( A n o t h e r , autonomous method for c o n s o l i d a t i n g the levels o f D e m C l u s t e r is presented as a special topic in Chapter 8.) For this demonstration, y o u c o m b i n e the levels o f another input,
StatusCat96NK.
4-72
8.
Chapter 4 Introduction to Predictive Modeling: Regressions
Scroll the Replacement E d i t o r to v i e w the levels o f
Variable DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemGender DemGender DemGender DemGender DemHomeOwner DemHomeOwner DemHomeOwner StatusCal96NK StatusCa!96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCatStarAII Status CatStarAII StatusCatStarAII
Level 33 06 50 04 19 52 JJNKNOW.. F M }U JJNKNOW.. H
ty ~
Frequency 30 C 26 c 26 "c 22 19 C 19 c 2592 c 2002 249
c
c
2655 2188
JJNKNOW.. A 2911 S 1168 342 N 286 E 115 IL 21 JJNKNOW.. 1 2590 0 2253 JJNKNOW..
FT
c c c c c c c c
c c
c C c N N N
StatusCat96NK.
Type
CharRawv... Num Raw... Repiaceme.., 33 06 50 04 19 52 JDEFAULT.
—
M U .DEFAULT, H U .DEFAULT. A~~ S F N E L .DEFAULT. 1.0 0.0 DEFAULT Reset
OK
Cancel
The input has six levels, plus a level to represent u n k n o w n values ( w h i c h do not occur in the t r a i n i n g data). The levels o f
StatusCat96NK
w i l l be consolidated as f o l l o w s :
•
Levels A and S (active and star donors) indicate consistent donors and are grouped into a single level, A .
•
Levels F and N ( f i r s t - t i m e and new donors) indicate new donors and are grouped into a single level, N .
•
Levels E and L (inactive and lapsing donors) indicate lapsing donors and are grouped into a single level L.
4.6 Categorical Inputs
9.
Type
10. T y p e
A as the N
Replacement level for
StatusCat96NK
levels A and S.
as the Replacement level for
StatusCat96NK
levels F and N .
4-73
11. Type L as the Replacement level for StatusCat96NK levels L and E. StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK
A S F N E L UNKNOW..
2911 1168 342 286 115 21
12. Select O K to close the Replacement Editor.
C C
c |C c c
c
A
N
|7~
E
I
A A N N L
|L
I.
L
IS F
.
I DEFAULT
4-74
Chapter 4 Introduction to Predictive Modeling: Regressions
13. Run the Replacement node and v i e w the results.
file
Edit View
Window
1 1 1 0
1
I
c ^ E f [ET(|
Total Replacement Counts $ Variable
Role
Label
StatusCat9... INPUT
Train
Status Cate...
Validation 1625
1627
[2) output
*
*
Us e r :
Dodotech
Dace:
17HAY08
Time:
21:51
; K
* Training
Output
s
Variable
Role
*
Summary
Measurement
Frequency
Level
Count
The Total Replacement Counts w i n d o w shows the number o f replacements that occur in the t r a i n i n g and v a l i d a t i o n data.
4.6 Categorical Inputs
14. Select View «=> Model !§ Replaced Levels
4-75
Replaced Levels. The Replaced Levels window opens. |fH
W 0 0 $ $ ^
Variable
Formatted Value
Type
StatusCat9... StatusCat9... StatusCat9... StatusCat9... StatusCat9... StatusCat9...
A S F N E L
C C
Character Numeric Unformated Value Value A S F N E L
c c c c
Replacemen Variable Label t Value A A N N L L
Status Status Status Status Status Status
Cate... Cate... Cate... Cate... Cate... Cate...
The replaced level values are consistent with expectations. 15. Close the Results window. 16. Run the Regression node and view the results. 17. Go to line 3659 o f the Output window. Summary of Stepwise Selection Effect Step
Entered
Removed
DF
Number
Score
Wald
In
Chi-Square
Chi-Square
Pr > ChiSq
1
L0G_GiftCnt36
1
1
95.0275
2
GiftTimeLast
1
2
21.1330
<.0001
3
DenttedHomeValue
1
3
17.7373
<.0001
4
L0G_GiftAvgAll
1
4
21.7306
<.0001
5
DemPctVeterans
1
5
7.0742
0.0078
6
REP_StatusCat96NK
6
9.7073
0.0078
7
L0G_GiftCntCard36
1
7
6.2112
0.0127
8
M_DemAge DemCluster
1
8
4.8754
0.0272
9
61.7834
0.1910
10
StatusCatStarAII
1
10
1.6743
0.1957
11
PromCntCard12
1
11
1.3961
0.2374
12
1
12
1.1442
0.2848
13
PromCntAil L0G_GiftCntAll
1
13
1.8685
0.1717
14
PromCnt12
1
14
0.6761
0.4109
15
ProraCntCardAll
1
15
2.0585
1
14
1
15
0.7608
0.3831
1 1 1
16
0.7343
0.3915
17
0.5853
0.4443
18
0.3821
1
17
9
PromCntGard12
16 17 18 19 20
L0G_GiftAvg36 M_L0G_GiftAvgCarxf36 M_R£P_DemMedIncome GiftTimeFirst
21
GiftTimeFirst
<.0CO1
0.1514 0.0216
0.8830
0.5365 0.3821
0.5365
The REP_StatusCat96NK input (created from the original StatusCat96NK input) is included in Step 6 the Stepwise Selection process. The three level input is represented by two degrees o f freedom. 18. Close the Results window.
4-76
Chapter 4 Introduction to Predictive Modeling: Regressions
4.7
Polynomial Regressions (Self-Study)
Beyond the Prediction Formula
A c c o u n t f o r nonlinearities.
90
The Regression tool assumes (by default) a linear and additive association between the inputs and the logit o f the target. I f the true association is more complicated, such an assumption m i g h t result in biased predictions. For decisions and rankings, this bias can ( i n some cases) be unimportant. For estimates, this bias appears as a higher value for the validation average squared error fit statistic.
Standard Logistic Regression
0.7 0.6
X 0.5 2
0.4 03 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.9 1.0 *1
91
In the dot c o l o r p r o b l e m , the (standard logistic regression) assumption that the concentration o f y e l l o w dots increases t o w a r d the upper right corner o f the unit square seems to be suspect.
4.7 Polynomial Regressions (Self-Study)
4-77
Polynomial Logistic Regression
92
W h e n m i n i m i z i n g prediction bias is important, y o u can increase the f l e x i b i l i t y o f a regression model by adding p o l y n o m i a l c o m b i n a t i o n s o f the model inputs. This enables predictions to better match the true input/target association. It also increases the chances o f o v e r f i t t i n g w h i l e simultaneously reducing the interpretability o f the predictions. Therefore, p o l y n o m i a l regression must be approached w i t h some care In SAS Enterprise M i n e r , a d d i n g p o l y n o m i a l terms can be done selectively or a u t o n o m o u s l y .
Chapter 4 Introduction to Predictive Modeling: Regressions
4-78
Adding Polynomial Regression Terms Selectively
This demonstration shows how to use the Term Editor window to selectively add polynomial regression terms. You can modify the existing Regression node or add a new Regression node. I f you add a new node, you must configure the Polynomial Regression node to perform the same tasks as the original. A n alternative is to make a copy o f the existing node. 1.
Right-click the Regression node and select Copy from the menu.
2.
Right-click the diagram workspace and select Paste from the menu. A new Regression node is added with the label R e g r e s s i o n ( 2 ) to distinguish it from the existing one.
3.
Select the Regression (2) node. The properties are identical to the existing node.
4.
Rename the new regression node P o l y n o m i a l R e g r e s s i o n with model identification i n later chapters.
5.
Connect the Polynomial Regression (2) node to the Impute node.
( 2 ) . The (2) is retained to help
SHIES
• ^ P r e d i c t i v e Analysis
PVA97fK
I
^
"
i . .PotfOOBili
v r~lOedda>Tceo
JLl
J
0
- J —
©
76% S^D
To add polynomial terms to the model, you use the Term Editor. To use the Term Editor, you need to enable User Terms.
4-7 Polynomial Regressions (Self-Study)
6.
4-79
Select U s e r T e r m s "=> Yes in the Polynomial Regression ( 2 ) property panel. Value
Property
General Node ID Imported Data Exported Data Notes
Reg2 U u
Train
Variables
y
BEquatfon
Yes -Main Effects Two-Factor InteracticNo •Polynomial Terms NO •Polynomial Degree 2 Yes -UserTerms Term Editor
~12
T h e T e r m E d i t o r is n o w unlocked and can be used to add specific p o l y n o m i a l terms to the regression model. 3.
Select T e r m E d i t o r "=>
Node ID Imported Data Exported Data Notes
II
Value
Property
General
f r o m the Polynomial Regression Properties panel.
Reg2
Train
Variables
Q U
UJ
Yes Main Effects -Two-Factor InteracticNo -Polynomial Terms No •Polynomial Degree 2 Yes -UserTerms Term Editor
•
Chapter 4 Introduction to Predictive Modeling: Regressions
4-80
T h e T e r m s w i n d o w opens.
EH Target: TargetB
1
•
I !
* •
X
Variables
Term
DemCluster DemGender DemHomeOwner DemMedHomeV DemPctVeterans
Save
OK
Cancel
4.7 Polynomial Regressions (Self-Study)
Interaction T e r m s Suppose that y o u suspect an interaction between home value and time since last g i f t . (Perhaps a recent change in property values affected the donation patterns.) I.
Select D e m M e d H o m e V a l u e in the Variables panel o f the Terms d i a l o g box.
2.
Select the A d d button,
. The D e m M e d H o m e V a l u e input is added to the T e r m panel.
Target: TargetB
X
Term
Variables
0
DemCluster DemGender
DemMedHomeValue
DemHomeOwnei DemMedHomeV* DemPct Veterans <.
Save
J > OK
Cancel
4-81
4-82
3.
Chapter 4 Introduction to Predictive Modeling: Regressions
Repeat the previous step to add
GiftTimeLast. i-.M
Target: TargetB
•
1
* X
Variables
Term
GiftTimeFirst
DemMedHomeValuE
GiftTimeLast
GiftTimeLast
IMP_DemAge IMP_LOG_GiftAvt IMP REP DemM
Save
OK
4.
Cancel
Select Save. A n interaction between the selected inputs is now available for consideration by the Regression node.
t i l Target: TargetB DemMedHomeValue*GiftTimeLast
Variables
Term
GiftTimeFirst GiftTimeLast IMP_DemAge !MP_LOG_GiftAv£ IMP_REP_DemM
Save
OK
Cancel
4.7 Polynomial Regressions (Self-Study)
Quadratic Terms S i m i l a r l y , suppose that y o u suspect a parabola-shaped relationship between the l o g i t o f donation probability and median home value. 1.
Select DemMedHomeValue.
2.
Select the A d d b u t t o n ,
3.
Select
m
. The
again. A n o t h e r
DemMedHomeValue input is added to the T e r m panel.
DemMedHomeValue input is added to the T e r m panel.
m
Target: TargetB 1
DemMedHomeValue*GiftTimeLast
Variables
Term
DemMedHomeVE*
DemMetlHomeValuE
DemPctVeterans
DemMedHomeVakiE
GiftTimeFirst GiftTimeLast IMP_DemAge
Save
OK
4.
Cancel
Select Save. A quadratic median home value term is available for consideration by the m o d e l . L 3 Target: T TargetB I 2 3
T
DemMedHorneValue'GiftTimeLast DemMeclHomeValue'DemMedHomeValue
Variables DemMedHomeVc * DemPctVeterans GiftTimeFirst GiftTimeLast IMP_DemAue
*
Term
Save
" i l T > OK
Cancel
4-83
4-84
Chapter 4 Introduction to Predictive Modeling: Regressions
5.
Select O K to close the Terms dialog box.
6.
Run the Polynomial Regression node and view the results.
7.
Go to line 3752 in the Output window. Suimary of Stepwise Selection Effect Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Entered
m
Score Chi-Square
Nutter Remoted
DF
Chi-Square
Pr > ChiSq
LJDGjaftCnt36 GiftTimeLast DenMxHomeVal^
1
1
95.0275
<.CC01
1
21.1330 19.6082
<.OC01
1
2 3
U3GJÂŁftAvgAll
1 1
4 5
21.8432
1 1 1 1 1 1 1
6 7 8 9 10 11 12 11 12
DenFctVetBrans REP_StatosCaft95W
UDGJ3inaTtCand36 M_Demflge DemMedHomeVaJjue*TJemyed^ StatusCatStarAII PranCntCard12 PraiCntALL StalifiCatStarALl DarCiuster UDGJ^ftCntAll StatusCatStarAII PranCnt12 PraiCn^CardALL PronCntCard12 UDG_GiftAvg36 MjeMJenAtedlnoome MJ03_GaftAvgCard36
53 1 1 1 1 1 1 1 1 1
GiftTimeFirst GiftTimeFirst
1
13 14 15 16 15 16 17 18 19 18
<.OC01 <.0CO1
7.0965 9.7703 6.2012 4.9143 3.6530 1.8153 1.2570 1.3799
0.0077 0.0076
0.4504
0.0128 0.0265 0.0660 0.1779 0.2522 0.2401 0.5021
58.7308
0.2736
1.0539 1.2548 0.6591 2.0306
0.3046 0.2626 0.4169 0.1492 0.8931 0.3888 0.4223
0.0189 0.7423 0.6424 0.6013 0.3946
0.4381 0.5299 0.3945
0.5299
The stepwise selection summary shows the interaction term added in Step 3 and the quadratic term in Step 9. 8.
Close the Results window.
This raises the obvious question: How do you know which nonlinear terms to include in a model? Unfortunately, there is no simple solution to this question in SAS Enterprise Miner, other than including all polynomial and interaction terms in the selection process.
4.7 Polynomial Regressions (Self-Study)
4-85
A d d i n g Polynomial Regression Terms A u t o n o m o u s l y (Self-Study) SAS Enterprise M i n e r has the a b i l i t y to add every p o l y n o m i a l c o m b i n a t i o n o f inputs to a regression m o d e l . O b v i o u s l y , this feature must be used w i t h some care, because the number o f p o l y n o m i a l input combinations increases r a p i d l y w i t h input count. For instance, the PVA97NK data set has 20 interval inputs. I f y o u want to consider every quadratic c o m b i n a t i o n o f these 20 inputs, y o u r selection procedure must sequentially p l o d through more than 200 inputs. T h i s is not an o v e r w h e l m i n g task for today's fast computers. F o l l o w these steps to explore a f u l l t w o - f a c t o r stepwise selection process: 1.
Select T w o - F a c t o r I n t e r a c t i o n ^
Yes in the P o l y n o m i a l Regression property panel.
2.
Select P o l y n o m i a l T e r m s O Yes in the P o l y n o m i a l Regression Properties panel. Property
General Node ID Imported Data Exported Data Notes
Train Variables
•Main Effects •Two-Factor InteracticYes •PolynomialTerms Yes •Polynomial Degree 2 -UserTerms Yes -Term Editor 3.
Run the P o l y n o m i a l Regression (2) node and v i e w the results. ( I n general, this m i g h t take longer than most activities.)
Chapter 4 Introduction to Predictive Modeling: Regressions
4-86
4.
Go to line 1774 o f the Output window. Suimary of Stepwise Selection Effect Step
Removed
Entered
Chi-Square
Chi-Square
Pr > ChiSq
1
UTXLGinrjnt36*lIX3ja
1
1
101.0902
<.CO01
1
2
33.9163
<.CO01
3
t^ftTimeLast*LOG_GiftAvgUst rJenMedHomeVa!ue*DemPctVeterar^
1
3
25.2441
<.00O1
4
REP_StatusCat96MK
2
4
10.2804
0.0059
5 6
DemHomeOAner^_lJDGJ^m\^i^ DemCluster*TJemGender
7 8 10
1
5
5.8659
106
6
134.9632
0.0154 0.0302
GimimeLast*PromCrrt12
1
0.0174
1
7 8
5.6507
urjGGdftOrr^^ LOG_GiftAvgAll
3.7134
0.0540
1
9
5.8292
0.0158
50
10
64.6125
53
9
DemCluster
11
5.
Wald
Score In
2
9
f
Number DF
DemCluster
0.0601 39.8737
0.9086
Surprisingly, the selection process takes only 11 steps. This is the result o f the 106 degree-offreedom D e m C l u s t e r and DemGender interaction in Step 6. As the iteration plot shows below, the model is hopelessly overfit after this step. Inputs with many levels are problematic for predictive models. It is a good practice to reduce the impact o f these inputs either by consolidating the levels or by simply excluding them from the analysis.
Scroll down in the Output Window. The
s e l e c t e d model, based on t h e CH00SE=VERR0R c r i t e r i o n ,
consists of the following
i s t h e model t r a i n e d i n S t e p 3. I t
effects:
I n t e r c e p t DemMedHomeValue*DemPctVeterans GiftTimeLast*LOG_GiftAvgLast
The selected model includes only three terms!
L0G_GiftCnt36*L0G_GiftCntCardAll
4.7 Polynomial Regressions (Self-Study)
6.
Select View >=> M o d e l •=> Iteration Plot.
[Average Squared Error
0.250 -
0.245 -
0.240-
0.235-
0.230-
i 0.0
I
—r~
i
5.0 75 25 Model Selection Step Number
10.0
•Train: Average Squared Error •Valid: Average Squared Error
The validation average squared error o f the three-term model is lower than any other model considered to this point.
4-87
Chapter 4 Introduction to Predictive Modeling: Regressions
4-88
A
* ,
Exercises
1. Predictive Modeling Using Regression a. Return to the Chapter 3 Organics diagram. Attach the StatExplore tool to the ORGANICS data source and run it. b.
In preparation for regression, is any missing values imputation needed? I f yes, should you do this imputation before generating the decision tree models? Why or why not?
c. Add an Impute node to the diagram and connect it to the Data Partition node. Set the node to impute U for unknown class variable values and the overall mean for unknown interval variable values. Create imputation indicators for all imputed inputs. d. Add a Regression node to the diagram and connect it to the Impute node. e. Choose the stepwise selection and validation error as the selection criterion. f. Run the Regression node and view the results. Which variables are included in the final model? Which variables are important in this model? What is the validation ASE? g.
In preparation for regression, are any transformations of the data warranted? Why or why not?
h. Disconnect the Impute node from the Data Partition node. i. Add a Transform Variables node to the diagram and connect it to the Data Partition node, j.
Connect the Transform Variables node to the Impute node.
k. Apply a log transformation to the DexnAff 1 and PromTime inputs. I. Run the Transform Variables node. Explore the exported training data. Did the transformations result in less skewed distributions? m. Rerun the Regression node. Do the selected variables change? How about the validation ASE? n. Create a full second-degree polynomial model. How does the validation average squared error for the polynomial model compare to the original model?
4.8 Chapter Summary
4-89
4.8 Chapter Summary Regression models are a prolific and useful way to create predictions. New cases are scored using a prediction formula. Inputs are selected via a sequential selection process. Model complexity is controlled by fit statistics calculated on validation data. To use regression models, there are several issues with which to contend that go beyond the predictive modeling essentials. 1.
A mechanism for handling missing input values must be included i n the model development process.
2.
A reliable way to interpret the results is needed.
3.
Methods for handling extreme or outlying predictions should be included.
4.
The level-count o f a categorical should be reduced to avoid overfitting.
5.
The model complexity might need to be increased beyond what is provided by standard regression methods.
One approach to this is polynomial regression. Polynomial regression models can be fit manually with specific interactions in mind. They can also be fit autonomously by selecting polynomial terms from a list o f all polynomial candidates.
Chapter 4 Introduction to Predictive Modeling: Regressions
4-90
Regression Tools Review Replace missing values for interval (means) and categorical data (mode). Create a unique replacement indicator.
6
,jÂŁ Regression
^Transform ihfortaUes
|
Create linear and logistic regression models. Select inputs with a sequential selection method and appropriate fit statistic. Interpret models with odds ratios. Regularize distributions of inputs. Typical transformations control for input skewness via a log transformation.
ss
continued...
Regression Tools Review Consolidate levels of a nonnumeric input using the Replacement Editor window.
â&#x20AC;˘ ( Ftryrtocrfal Regression
Add polynomial terms to a regression either by hand or by an autonomous exhaustive search.
4.9 Solutions
4.9
4-91
Solutions
Solutions to Exercises 1. Predictive Modeling Using Regression a. Return to the Chapter 4 Organics diagram in the E x e r c i s e s project. Use the StatExplore tool on the ORGANICS data source. 1) Connect the StatExplore node to the O R G A N I C S node as shown.
4-92
Chapter 4 Introduction to Predictive Modeling: Regressions
2)
Run the StatExplore node and v i e w the results.
Edit
View
Window
i l l B
15"!
iS Chi-Square Plot Target = TargetBuy
â&#x20AC;&#x201D;
0-25-
B
5 0.20 H trt ? 0.1 D M
5 0.10| 0.05 0.00
1
3
Ordered Inputs Q Output Usee:
Dodotech
Dace:
28HAY08
Time:
00:35
â&#x20AC;˘ Training
b.
Output
In preparation for regression, is any missing values imputation needed? I f yes, should y o u do this i m p u t a t i o n before generating the decision tree models? W h y or w h y not? G o to line 38 in the Output w i n d o w . Several o f the class inputs have m i s s i n g values.
Variable DemClusterGroup DemGender DemReg DemTVReg PromClass TargetBuy
Role INPUT INPUT INPUT INPUT INPUT TARGET
C l a s s V a r i a b l e Summary S t a t i s t i c s Number of Mode Levels Missing Mode Percentage 8 674 C 20.55 4 2512 F 54.67 6 465 South East 38.85 14 465 London 27.85 4 0 Silver 38.57 2 0 0 75.23
Mode D M Midlands Midlands Tin 1
Mode2 Percentage 19.70 26.17 30.33 14.05 29.19 24.77
G o to line 65 o f the O u t p u t w i n d o w . Most o f the Interval inputs also have m i s s i n g values. I n t e r v a l V a r i a b l e Summary S t a t i s t i c s Variable DemAffl DemAge PromSpend PromTime
ROLE INPUT INPUT INPUT INPUT
Mean 8.71 53.80 4420.59 6.56
Std. Deviation 3.42 13.21 7559.05 4.66
Non Missing 21138 20715 22223 21942
Missing 1085 1508 0 281
Minimum 0.00 18.00 0.01 0.00
Median 8 54 2000 5
Maximum 34.00 79.00 296313.85 39.00
Y o u d o n o t need t o i m p u t e before t h e Decision T r e e node. D e c i s i o n trees h a v e b u i l t - i n w a y s to h a n d l e m i s s i n g v a l u e s . (See Chapter 3.)
4.9 Solutions
c.
4-93
A d d an I m p u t e node to the diagram and connect it to the D a t a P a r t i t i o n node. Set the node to impute U for u n k n o w n class variable values and the overall mean for u n k n o w n interval variable values. Create i m p u t a t i o n indicators for all imputed inputs.
l_!t_ Decision Tree
± \
^StatExplore
Decision Tree
•Q
r I: i : ORGANICS
Impute
~™™Jata Partition
o
1)
Select D e f a u l t I n p u t M e t h o d •=> D e f a u l t C o n s t a n t V a l u e .
2)
Type U for the Default Character Value. IJciass Variables [-Default Input Method Default Target Method ^Normalize Values
I Default Constant Value Mone Ves
[• Default Input Method -Default Tarqet Method
Mean None
[•Default Character Value -Default Number Value
U
:
3)
Select I n d i c a t o r V a r i a b l e T y p e •=> U n i q u e .
4)
Select I n d i c a t o r V a r i a b l e R o l e <=> I n p u t .
4-94
Chapter 4 Introduction to Predictive Modeling: Regressions
d.
A d d a Regression node to the diagram and connect it to the I m p u t e node.
UL
Decision Tree
o
f ^ D e c i s i o n Tree
StatExplore
)ota Partition
ORGANICS
e.
.jf-
Regression
Select the Stepwise selection and V a l i d a t i o n E r r o r as the selection c r i t e r i o n .
[•Selection Model HSelection Criterion [•Use Selection Defaults ^Selection Options f.
Impute
••1
Stepwise Validation Error Ves
•
Run the Regression node and v i e w the results. W h i c h variables are included in the final model? W h i c h variables are important in this model? What is the v a l i d a t i o n A S E ? 1)
The Results w i n d o w opens.
File
I I
Edit View
Window
IB H ffl
Score Rankings Overlay: TargetBuy
•
Cumulative Lift
TargetBuy TargetB uy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy •TRAIN
TarfiPlfim/
-VALIDATE
< 1 ''-t:-
$
L l Effects Plot
0
1
1
5 Effect Number
! • + o-1
[Hi
Fit Statistics Statistics Train Vali Label _AIC_ Akaike's Inf... * 9691.257 _ASE_ Average Sq... 0.138587 _AVERR_ Average Err... 0.435352 _DFE_ Degrees of... 11104 _DFM_ Model Degr... 8 _DFT_ Total Degre... 11112 _DIV_ Divisor for A.. 22224 _ERR_ Error Fundi... 9675.257 Pinal Prnrtir
FPF
G) Output
n 1-5Q7RR
;
•
5
User:
Dodotech
Date:
28I1AY08
Time:
01:24
Training
4
I #1f
Fit Statistics
Target
Output
4.9 Solutions
2)
G o to line 664 in the Output w i n d o w . The selected model, based on the CHOOSE=VERROR c r i t e r i o n , i s the model t r a i n e d Step 6. I t c o n s i s t s of the f o l l o w i n g e f f e c t s : Intercept
3)
IMP__DemAff 1
IMP_DemAge
IMP_DemGender
M_DemAff1
M_DemAge
The odds ratios indicate the effect that each input has on the logit score. Effect IMP_DemAffl IMP_DemAge IMP_DemGender IMP_DemGender M_DemAffl U DemAge M DemGender
4)
4-95
F M 0 0 0
VS VS vs vs vs
Point Estimate 1 .283 0.947 6.967 2.899 0.708 0.796 0.685
u u 1 1 1
The v a l i d a t i o n A S E is g i v e n in the Fit Statistics w i n d o w . Fit Statistics Target TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy
Fit Statistics Statistics Test Train Validation Label _AIC_ Akaike's Inf... 9691.257 _ASE_ Average Sq... 0.138587 0.137156 _AVERR_ Average Err... 0.435352 0.432266 _DFE_ Degrees of... 11104 Model Degr... 8 _DFM_ _DFT_ Total Degre... 11112 _DIV_ Divisor for A,,. 22224 22222 _ERR_ Error Fundi... 9675.257 9605.81 _FPE_ Final Predic... 0.138786 _MAX_ Maximum A... 0.991147 0.987434 _MSE_ Mean Squa... 0.138687 0.137156 Mnpc
<
m
:
Ci i m
nf Fro
a.
•
| •
in
M_DemGender
Chapter 4 Introduction to Predictive Modeling: Regressions
4-96
g.
In preparation for regression, are any transformations o f the data warranted? W h y or w h y not? 1)
O p e n the Variables w i n d o w o f the Regression node.
2)
Select all Interval inputs.
(none)
•
fjrwt
Name Use DemCluster Default IMP_DemAffl Default IMP_DemAge Default IMP_DemClusOefautt IMP_DemGeri'Defaiilt lMP_DemReg Default IMP_DsmTVROefault lMP_PromTim Default M_DemAffl Default MJDemAge Default M_DernCluste Default i 1
Apply
Equal to
Report No No _Np No No No No NO No No No
Role Rejected Input Input J^put Input Input Input Input Input input Input
Level Nominal Interval Interval Nominal Nominal Nominal Nominal Inlerval Binary Binary Binary
Type Character Numeric Numeric Character Character Character Character Numeric Numeric Numeric Numeric
Order
Reset
Label Forma Nelgborhood' Imputed: Afflui Imputed: Age Imputed: Neig imputed: Gem Imputed: Geo£ Imputed: Tele» Imputed: Loya Imputation Ind Imputation Ind Imputation Ind
1 Explore™
Update Path
OK
Cancel
-
4.9 Solutions
3)
4-97
Select E x p l o r e . . . . The Explore w i n d o w opens.
File
View
Actions
Window
V-Z Sample Properties Property
Li IMP_DemAge 2500 -
Value
Rows Columns Library
Unknown 28 EMWS3
Mpmhpr
IMPT T R A I N
5
_
01
.? 500-
Apply
18 0
Plot...
eg EMWS3.lmpt_TRAIN
30.2
42.4
54.6
66.8
79.0
Imputed: Age Ll IMP_PromTime
DATAOBS ICustomer L. Affluence G. 10000D00140 40000001120
Obs#
2000-
| 15006 1000¬
50000002313 70000003131 110000007420 120000009814
-000 -
& 4000| 3000§• 2000 £ 1000 0 —
i
0.0
[
.
7.8
[
1
1 5.6
1
1
1
23.4
•
315
r 39.0
Imputed: Loyalty Cat d Tenure Ll IMP_DemAffl > 4000 -
S
3 0 0 0
"
| 2000 £ 1000-
U-
0
0.0
3.1
I
I
62
9.3
1
1
1
12.4 1?.:. 18.6 Imputed: Affluence Grade
21.7
24.8
27.9
310
B o t h C a r d T e n u r e a n d A f f l u e n c e G r a d e have m o d e r a t e l y s k e w e d d i s t r i b u t i o n s . A p p l y i n g a l o g t r a n s f o r m a t i o n t o these i n p u t s m i g h t i m p r o v e t h e m o d e l fit. h.
Disconnect the I m p u t e node f r o m the D a t a P a r t i t i o n node.
i.
A d d a T r a n s f o r m V a r i a b l e s node to the diagram and connect it to the D a t a P a r t i t i o n node,
j.
Connect the T r a n s f o r m V a r i a b l e s node to the I m p u t e node.
•J
it
f 'r' \
M i Patiwn <
)
Fnrafm -•.unll--.
• 1^
Pe^awfan
J
4-98
Chapter 4 Introduction to Predictive Modeling: Regressions
k.
A p p l y a l o g t r a n s f o r m a t i o n to the D e m A f f 1 and P r o m T i m e inputs. 1)
O p e n the Variables w i n d o w o f the Transform Variables node.
2)
Select M e t h o d •=> L o g f o r the D e m A f f 1 and P r o m T i m e inputs.
(none)
•
not
Equal i o
Name Method DemAttl Log DemAge Default DemCluster Default DemClusterG (Default DemGender Default Default DemReg DemTVReg Default PromClass Default PromSpend Default PromTime ton rargetAmt Default TargetBuy Default <
Number of Bins
,
.
.
Role 4.0 Input 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 input 4.0 Input 4.0 Input 4.b'|nput 4.0 Rejected 4.0 Target
.
Level Inteival Interval Nominal Nominal Nominal Nominal Nominal Nominal Inlerval Inlerval Interval Binary
Type Numeric Numeric Character Character Character Character Character Character Numeric Numeric Numeric Numeric
I.
Order
Label Affluence Ora Age Neigborhood Neighborhoo Gender Geographic fi Television Re Loyalty Status Total Spend Loyalty Card' Organtcs Pur Organics Pur
1
Explore-.
3)
Reset
Apply
Update Path
OK
Cancel
Select O K to close the Variables w i n d o w .
Run the T r a n s f o r m V a r i a b l e s node. E x p l o r e the exported t r a i n i n g data. D i d the transformations result i n less s k e w e d distributions? 1)
T h e easiest w a y to e x p l o r e the created inputs is to open the Variables w i n d o w i n the subsequent Impute node. M a k e sure that y o u update the I m p u t e node before o p e n i n g its Variables w i n d o w .
(none)
•
not
Name Use DemAge Default DemCluster Default DemClusterGiDefault DemGender Default DemReg Default DemTVReg Default LOGJDemAffi DefauH LOG_PromTinDefault PromClass Default PromSpend Default TargetAmt Default TargelBuy Default
Apply
Equal t o
Method Default Default Default Default Default Default Default Default Default Default Default Default
Use Tree Default Default Default Default Default Defautt Default Default Default Default Defautt Default
Role Input Rejected Input Input Input Input Inpul input Input Input Rejected Target
Level Interval Nominal Nominal Nominal Nominal Nominal Inteival inlerval Nominal interval Interval Binary
Explore.
Type Numeric Character Character Character Character Character Numeric Numeric Character Numeric Numeric Numeric
Update Path
Order
OK
Reset
Labe Age j Neigbot Neighbi Gender Geogra Televisi Transfo Transfo; Loyalty: Total Sd Organic Organic
Cancel
4.9 Solutions
2)
W i t h the L O G
File
View
D c m A f f l and L O G
Actions
4-99
P r o m T i m e inputs selected, select E x p l o r e . . . .
Window
0,4 Qfl Sample Properties J _ Property Rows Columns Library Member Type Sample Method Fetch Size Fetched Rows Random Seed
Unknown 16 EMWS3 TRANS_TRAIN VIEW Random Max 11112 1 2345
4000¬ > ~ 3000o>
o 2000Um 1000¬
Apply
Plot.
Q] EMWSXTrans TRAIN
i
1 2 3 4 5
6 7 a 9 to 11
IEJ
sooo-
0
Obs#
r^tf
Ll LOG_DemAffl Value
0.69
1.39
2.08 2.77
3.47
Transformed: Affluence Grade [HI
ET
0.00
L l LOG_PromTime
DATAOBS [Customer L . Affluence G„ 10000000140 11 40000001120 11 50000002313 1 70000003131 1 110000007420 120000009814 130000010005 140000010219 150000010812 1 170000011932 180000014656 ! •
T h e d i s t r i b u t i o n s are nicely symmetric, m.
Rerun the R e g r e s s i o n node. D o the selected variables change? H o w about the v a l i d a t i o n A S E ? 1)
G o t o line 664 o f the O u t p u t w i n d o w .
The s e l e c t e d m o d e l , It
consists
Intercept 2)
based on t h e CH00SE=VERR0R c r i t e r i o n ,
of the following
i s t h e model t r a i n e d
i n Step 6 .
effects:
IMP_DemAge IMP_DemGender
IMP_LOG_DemAff1
IMP_LOG_DemAf f 1 and M_LOG_DemAf f 1 M DemAf f 1 , respectively.
M_DemAge
replace
M__DemGender
IMP _DemAf f 1
and
M_LOG_DemAff1
4-100
Chapter 4 Introduction to Predictive Modeling: Regressions
3)
A p p a r e n t l y the l o g transformation actually reduced the v a l i d a t i o n A S E slightly.
1
Fit Statistics Target
Fit Statistics Statistics Label
TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy
_AIC_ _ASE_ _AVERR_ _DFE_ _DFM_ _DFT_ _DIV_ _ERR_ _FPE_ _MAX_ _MSE_ _NOBS_
i | ;: n.
Akaike's Inf... Average Sq... Average Err... Degrees of ... Model Degr... Total Degre... Divisor for A... Error Fundi... Final Predic... Maximum A... Mean Squa... Sum of Fre...
Train
Validation
9758.609 0.139545 0.438382 11104 8 11112 22224 9742.609 0.139746 0.992317 0.139646 11112
T
0.138204 0.43599
' 1
22222 9688.581 0.994405 0.138204 11111
M
â&#x20AC;˘
Create a f u l l second-degree p o l y n o m i a l model. H o w does the v a l i d a t i o n average squared error for the p o l y n o m i a l model compare to the original model? 1)
A d d another Regression node to the diagram and rename it
Polynomial
Regression.
y
PLMÂťt!Ullti.-li
4.9 Solutions
2)
4-101
M a k e the indicated changes to the P o l y n o m i a l Regression Properties panel.
Train [Variables
I
M
[•Main Effects hiTwo-Factor Interactions iPolynomial Terms i-Polynomial Degree rUserTerms f r e r m Editor
Yes Yes Yes 2 No
rRegressionType '-Link Function
Logistic Regression Logit
•
L
j-Suppress intercept Hnput Coding [SlModel Selection r Selection Model r Selection Criterion r U s e Selection Defaults '•Selection Options 3)
Stepwise Validation Error Yes
•
G o t o line 1598
The s e l e c t e d m o d e l , It
No Deviation
consists
based on t h e CHOOSE=VERROR c r i t e r i o n ,
of the following
i s t h e model t r a i n e d
i n S t e p 7.
effects:
Intercept IMP_DemAge IMP_DemGender IMP_L0G_DemAff1 M_OemAge MJ)emGender"M_LOG_DeniAffl IMP_DemAge*IMP_DemAge IMP_LOG_DemAffl"IMP_L0G_DemAff1
4)
The P o l y n o m i a l Regression node adds additional interaction terms.
5)
Examine the Fit Statistics w i n d o w . \Z Fit Statistics %
r/ET
| Target
Fit Statistics Statistics Label
TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy
_AIC_ _ASE_ _AVERR_ JDFE_ _DFM_ _DFT_ _DIV_ _ERR_ _FPE_ _MAX_
4
Akaike's Inf.. Average Sq... Average Err... Degrees of.. Model Degr... Total Degre.. Divisorfor A.. Error Fundi... Final Predic. Maximum A...
[
Validation
Train 9529.938 0.136407 0.428003 11103 9 11112 22224 9511.938 0.136628 0.98571 E
T
0.134038 0.421824
!
22222 9373.784 0.986233
••'\
T h e a d d i t i o n a l t e r m s r e d u c e the v a l i d a t i o n A S E s l i g h t l y .
I
•
4-102
Chapter 4 Introduction to Predictive Modeling: Regressions
S o l u t i o n s to Student Activities (Polls/Quizzes)
4.01 Multiple Choice Poll - Correct Answer What is the logistic regression prediction for the indicated point? a. -0.243 b. 0.56 c. yellow @ l t depends . . . o 1
A
logit(p)=-0.81+0.92x
P
1
1
A _
1 + -l째gi!(pt e
+1lix
2
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools 5.1
Introduction Demonstration: Training a Neural Network
5.2
Input Selection Demonstration: Selecting Neural Network Inputs
5.3
5.4
Stopped Training
5-3 5-12 5-20 5-21 5-24
Demonstration: Increasing Network Flexibility
5-35
Demonstration: Using the AutoNeural Tool (Self-Study)
5-39
Other Modeling T o o l s (Self-Study)
5-46
Exercises
5-58
5.5
Chapter Summary
5-59
5.6
Solutions
5-60
Solutions to Exercises
5-60
5-2
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
5.1 Introduction
5-3
5.1 Introduction Model Essentials - Neural Networks Predict new cases.
Prediction formula
Select useful inputs.
None
Optimize complexity.
Stopped training
3
With its exotic sounding name, a neural network model (formally, for the models discussed in this course, multi-layerperceptrons) is often regarded as a mysterious and powerful predictive tool. Perhaps surprisingly, the most typical form o f the model is, in fact, a natural extension o f a regression model. The prediction formula used to predict new cases is similar to a regression's, but with an interesting and flexible addition. This addition enables a properly trained neural network to model virtually any association between input and target variables. Flexibility comes at a price, however, because the problem o f input selection is not easily addressed by a neural network. The inability to select inputs is offset (somewhat) by a complexity optimization algorithm named stopped training. Stopped training can reduce the chances o f overfitting, even in the presence o f redundant and irrelevant inputs.
5-4
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Model Essentials - Neural Networks Predict new cases.
Prediction formula
s
Like regressions, neural networks predict cases using a mathematical equation involving the values o f the input variables.
5.1 Introduction
5-5
Neural Network Prediction Formula pivdiction estimate
fcias estimate
-.wight
csnni.vfo
H = tanh(w + f
-5--
5
h = tanh(w + w 2
x., + w x )
10
20
n
21
2
+w x ) 22
2
H = tanh(w + w X! + w x ) 3
30
31
32
2
A neural network can be thought o f as a regression model on a set o f derived inputs, called hidden units. In turn, the hidden units can be thought o f as regressions on the original inputs. The hidden unit "regressions" include a default link function (in neural network language, an activation Junction), the hyperbolic tangent. The hyperbolic tangent is a shift and rescaling o f the logistic function introduced in Chapter 3. Because o f a neural network's biological roots, its components receive different names from corresponding components o f a regression model. Instead o f an intercept term, a neural network has a bias term. Instead o f parameter estimates, a neural network has weight estimates. What makes neural networks interesting is their ability to approximate virtually any continuous association between the inputs and the target. You simply specify the correct number o f hidden units and find reasonable values for the weights. Specifying the correct number o f hidden units involves some trial and error. Finding reasonable values for the weights is done by least squares estimation (for intervalvalued targets).
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
5-6
Neural Network Binary Prediction Formula A
log(-jâ&#x20AC;&#x201D;jr)=w
0 0
+w
01
Hi + w
02
H + w - ^3 03
2
'of/'' link
(unction
J
5
o
! /
tf
Hi = tanh(w +
+w
12
x)
= tanh(vv + w x + w
22
x)
H = tanh(w + w x., + w
32
x)
10
2
20
3
30
21
31
1
2
2
2
When the target variable is binary, as in the demonstration data, the main neural network regression equation receives the same logit link function featured in logistic regression. As with logistic regression, the weight estimation process changes from least squares to maximum likelihood.
5.1 Introduction
5-7
Neural Network Diagram ,
o
g
( l â&#x20AC;&#x201D; 8 " )
=
Woo + w
0
1
H,
+ w
Q2
H
2
Hi - tanh(w +
+ w
0
3
H
z
x + w x)
10
A
12
2
H = tanh(w + w x., + w x ) 2
20
21
22
2
H = tanh(w + w * i + w x ) 3
inpi/J
hidden
target
,'iiycf
/ayer
/ayer
30
31
32
2
9 Multi-layer perceptron models were originally inspired by neurophysiology and the interconnections between neurons, and they are often represented by a network diagram instead o f an equation. The basic model form arranges neurons in layers. The first layer, called the input layer, connects to a layer o f neurons called a hidden layer, which, in turn, connects to a final layer called the target or output layer. Each element in the diagram has a counterpart in the network equation. The blocks in the diagram correspond to inputs, hidden units, and target variables. The block interconnections correspond to the network equation weights.
5-8
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Prediction Illustration - Neural Networks logit
1.0
equation
logit( jS) = wbo + " t o i * H
H
0.9
2 ^oj- W +
a
0.8
H, = tanh(w + vvâ&#x20AC;&#x17E; x, + w x )
0.7
H = tanh(w + w
0.6
10
2
1 2
M
2 1
x, + w
2
x )
a
2
Wj = tanhfAta + w x , + w x ) 2 0.5 31
32
x
2
0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.S 0.9 1.0
To demonstrate the properties o f a neural network model, consider again the two-color prediction problem. As always, the goal is to predict the target color based on the location in the unit square. As with regressions, the predictions can be decisions, rankings, or estimates. The logit equation produces a ranking or logit score.
Prediction Illustration - Neural Networks logit
1.0
equation
logit( p ) = Woo * <*W
H +w H
+
2
oi
H = tanh(vv + w x + w t
10
H = tanh(w 2
M
u
+
w
3
M
1 2
3
0.8
x )
Q.7
2
x, + Wn x )
2i
H = tanh(w + w
A
0.9
0.6
2
3 1
x +w 1
x 3i
z
0
5
04
Need weight estimates.
0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 06 0.7 08 0.9 1.0
12 As with a regression model, the primary task is to obtain parameter or, in the neural network case, weight estimates that result in accurate predictions. A major difference between a neural network and a regression model, however, is the number o f values to be estimated and the complicated relationship between the weights.
5.1 Introduction
5-9
Prediction Illustration - Neural Networks /oy/f equation loglt( p ) = Wbo + Wâ&#x20AC;&#x17E;- H, + Wqj- H + W - Hj 2
H, = tanh(w + w
1 t
H = tanh( Wjo + w
M
10
2
H,= tanh(Wjo + w
3 1
M
x, + w
1 2
x) 2
x, + x, + w
xj x)
i2
2
x
2 05 0.4
Weight estimates found by
maximizing: Ilog(p>Ilog(1-p ) f
" o.o0.0
Log-iike!ihood
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Function
13
For a binary target, the weight estimation process is driven by an attempt to maximize the log-likelihood function. Unfortunately, in the case o f a neural network model, the maximization process is complicated by local optima as well as a tendency to create overfit models. A technique illustrated in the next section (usually) overcomes these difficulties and produces a reasonable model.
5-10
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
14 Even a relatively simple neural network with three hidden units permits elaborate associations between the inputs and the target. While the mode! might be simple, explanation of the model is decidedly not. This lack of explicability is frequently cited as a major disadvantage o f a neural network. O f course, complex input/target associations are difficult to explain no matter what technique is used to model them. Neural networks should not be faulted, assuming that they correctly modeled this association. After the prediction formula is established, obtaining a prediction is strictly a matter of plugging the input measurements into the hidden unit expressions. In the same way as with regression models, you can obtain a prediction estimate using the logistic function.
5.1 Introduction
5-11
Neural Nets: Beyond the Prediction Formula Manage missing values. Handle extreme or unusual values. Use non-numeric inputs. Account for nonlinaarliies.
16
Neural networks share some o f the same prediction complications with their regression kin. The most prevalent problem is missing values. Like regressions, neural networks require a complete record for estimation and scoring. Neural networks resolve this complication in the same way that regression does, by imputation. Extreme or unusual values also present a problem for neural networks. The problem is mitigated somewhat by the hyperbolic tangent activation functions in the hidden units. These functions squash extreme input values between - I and + 1 . Nonnumeric inputs pose less o f a complication to a properly tuned neural network than they do to regressions. This is mainly due to the complexity optimization process described in the next section. Unlike standard regression models, neural networks easily accommodate nonlinear and nonadditive associations between inputs and target. In fact, the main challenge is over-accommodation, that is, falsely discovering nonlinearities and interactions. Interpreting a neural network model can be difficult, and while it is an issue, it is one with no easy solution.
5-12
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Training a Neural Network
Several tools i n SAS Enterprise Miner include the term neural in their name. The Neural Network tool is the most useful o f these. (The AutoNeural and D M Neural tools are described later in the chapter.) 1.
Select the M o d e l tab.
2.
Drag a Neural N e t w o r k tool into the diagram workspace.
3.
Connect the Impute node to the Neural Network node. Predictive Analysis
faptoccxomt
3
-0
"ye
=4*
-iJ|IW|0 With the diagram configured as shown, the Neural Network node takes advantage o f the transformations, replacements, and imputations prepared for the Regression node. The neural network has a default option for so-called "preliminary training."
ÂŽ8Âť|pal
5.1 Introduction
4.
Select Optimization •=> j
I
Property
from the Neural Network Properties panel. Value
General Neural Node ID Imported Data Exported Data Notes Train Variables No Continue Training Network Optimization Initialization Seed 12345 Model Selection Criterion Profit/Loss Suppress Output Score No {Hidden Units I 'Residuals res Nn Standardization
5.
... ... ... ...
Select Enable •=> No under the Preliminary Training options. I £ f Optimization Property -Accelerate Decelerate :- Learn t- Maximum Learning Minimum Learning Momentum :- Maximum Momentum Tilt ^Preliminary Training T Enable Number of Runs Maximum Iterations Maximum Time
d
Value 1.2 0.5 0.1 50.0 i1 OE-5 0.0 1.75 0.0
J
No 5 10 1 Hour
Enable Indicates whether to perform prelirninary training. The default is No.
OK
Cancel
I
5-13
5-14
6.
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Run the Neural Network node and view the results. The Results - Node: Neural Network Diagram window opens.
File Edit View Window
Score Rankings Overlay: TargetB
r / Ef (Hi I EH Fit Statistics Target
Cumulative Lift
TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TarqetB
Percentile TRAIN
VALIDATE c^eT
Iteration Plot Average Squared Error
Fit Statistics Statistics Train Label _DFT_ Total Degre.. 48' a. _DFE_ Degrees of.. 45S _DFM_ Model Degr... 2< _NW_ Number of ... 2t _AIC_ Akaike's Inf... 7014.5E _SBC_ Schwarz's... 8655.3' _ASE_ Average Sq... 0.2396; _MAX_ Maximum A... 0.7804E Divisor for A.. _DIV_ 96( NOBS Sum of Fre... 48'
<
IE
\>
Output User:
Dodotech
Date:
18HAY06 19:04
Time: * Training
S
10 Training Iteration
Train: Average squared Error. Valid: Average Squared Error.
Output
5.1 Introduction 7.
Maximize the Fit Statistics window. 11
statistics
Target TargetB TargetB TargetB TargetB TargetB TargetB TargetB [TargetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB TargetB TargetB TargetB
-^^MM^MX^^^Mw^^BMMMM.
w-f-f I S
Fit Statistics Statistics Train Validation Test Label _DFT_ Total Degre... 4843 _DFE_ Degrees of... 4590 _DFM_ Model Degr... 253 _NW_ Number of... 253 _AIC_ Akaike's Inf... 7014.564 _SBC_ Schwarz's... 8655.342 _ASE_ Average Sq... 0.239625 0.242925 _MAX_ Maximum A... 0.760482 0.774254 _DIV_ Divisor for A... 9686 9686 _NOBS_ SumofFre... 4843 4843 _RASE_ RootAvera... 0.489515 0.492874 _SSE_ Sum ofSqu... 2321.008 2352.972 Sum ofCas... 9686 9686 _SUMW_ _FPE_ Final Predic... 0.266041 _MSE_ Mean Squa... 0.252833 0.242925 _RFPE_ Root Final... 0.515792 _RMSE_ Root Mean... 0.502825 0.492874 _AVERR_ Average Err... 0.671956 0.678893 _ERR_ Error Functi... 6508.564 6575.759 _MISC_ Misclassific... 0.418542 0.430105 _WRONG_ Number of... 2027 2083
The average squared error and misclassification are similar to the values observed from regression models in the previous chapter. Notice that the model contains 253 weights. This is a large model.
5-15
5-16 8.
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Go to line 54 of the Output window. There you can find a table of the initial values for the neural network weights. The NEURAL Procedure Optimization S t a r t Parameter Estimates
Estimate
Gradient Objective Function
DemMedHomeValue_H11 DemPctVeterans_H11 GiftTimeFirst_H11 GiftTimeLast_H11 IMP_DemAge_H11 IMP_L0G_GiftAvgCard36_H11 IMP_REP_DemMedIncome_H11 L0G_GiftAvg36_H11 L0G_GiftAvgAll_H11
-0.004746 -0.011042 -0.026889 -0.024545 0.008120 0.055146 -0.167987 0.087440 0.063190
0 0 0 0 0 0 0 0 0
H11_TargetB1 H12_TargetB1 H13_TargetB1 BIAS_TargetB1
0 0 0 -0.000413
-0.004814 -0.000030480 0.001641 1.178583E-15
N Parameter 1 2 3 4 5 6 7 8 9
250 251 252 253
Despite the huge number o f weights, the model shows no signs o f overfitting. 9.
Go to line 395. You can find a summary of the model optimization (maximizing the likelihood estimates o f the model weights). Optimization Results Iterations Gradient C a l l s Objective Function Slope of Search Direction
50 Function C a l l s 62 Active Constraints 0.6368394579 Hax Abs Gradient Element -0.000752598
136 0 0.0076065979
QUANEW needs more than 50 i t e r a t i o n s or 2147483647 function c a l l s . WARNING: QUANEU Optimization cannot be completed.
Notice the warning message. It can be interpreted to mean that the model-fitting process did not converge. 10. Close the Results - Neural Network window.
5.1 Introduction
5-17
Reopen the Optimization window and examine the Optimization options in the Properties panel for tlie Neural Network node. 5 £Optimization Value
Property Training Technique Maximum Iterations Maximum Time [§)Nonlinear Options Absolute Absolute Function Absolute Function Times 'Absolute Gradient jAbsolute Gradient Times Absolute Parameter Absolute ParameterTimes Relative Function •RfilativR Fnnntinn Timps
Default 50 4 Hours -1.34078E154 0 1 1.0E-5 1 1.0E-8 1 0.0 1
iJ
Training Technique Specifies the training technique. The following are valid selections: DEFAULT (the default depends on the number of weights that are applied during execution), Trust-Region, Levenberg-Marquardt, Quasi-Newton, Conjugate Gradient, DBLDOG, BackProp, RPROP (uses a separate learning rate for each weight, and adapts alt the learning rates during training), and QPROP (QuickProp is a Newton-like method but uses a diagonal Cancel
OK
The maximum number o f iterations, by default, is 50. Apparently, this is not enough for the network training process to converge. 12. Type 100 for the Maximum Iterations property. 13. Run the Neural Network node and examine the results. 14. Maximize the Output window. You are again warned that the optimization process still failed to converge (even after 100 iterations). QUANEW needs more t h a n 100 i t e r a t i o n s WARNING: QUANEW O p t i m i z a t i o n
o r 2147483647 f u n c t i o n
c a n n o t be c o m p l e t e d .
calls
5-18
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
15. Maximize the Fit Statistics window. Rt Statistics, Target TargetB TargetB TargetB TargetB TargetB TargetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB
Fit Statistics Statistics Train Validation Test Label _DFT_ Total Degre.. 4843 _DFE_ Degrees of.. 4590 DFM Model Degr... 253 Number of... 253 AIC_ Akalke's Inf... 7014.564 _SBC_ Schwarz's... 8655.342 _ASE_ Average Sq... 0.242925 0.239625 _MAX_ Maximum A... 0.774254 0.760482 _DIV_ Divisor for A.. 9686 9686 Sum of Fre... _Nois_ 4843 4843 RootAvera... _RASE_ 0.492874 0.489515 Sum ofSqu.. _SSE_ 2352.972 2321.008 Sum of Cas.. _SUMW_ 9686 9686 Final Predic. _FPE_ 0.266041 Mean Squa... _MSE_ 0.252833 0.242925 Root Final ... _RFPE_ 0.515792 Root Mean ... _RMSE_ 0.502825 0.492874 Average Err... AVERR_ 0.671956 0.678893 Error Functi... _ERR_ 6508.564 6575.759 Misclassific. _MISC_ 0.418542 0.430105 _WRONG_ Number of... 2027 2083
Curiously, increasing the maximum number o f iterations changes none o f the fit statistics. How can this be? The answer is found in the Iteration Plot window.
5.1 Introduction
5-19
16. Examine the Iteration Plot window.
[Average Squared Error
0.26-
0.24-
0.22-
Training Iteration Train: Average Squared Error. Valid: Average Squared Error.
The iteration plot shows the average squared error versus optimization iteration. A massive divergence in training and validation average squared error occurs near iteration 14, indicated by the vertical blue line. The rapid divergence o f the training and validation fit statistics is cause for concern. This primarily results from a huge number o f weights in the fitted neural network model. The huge number o f weights comes from the use o f all inputs in the model. Reducing the number o f modeling inputs reduces the number o f modeling weights and possibly improves model performance. 17. Close the Results window.
5-20
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
5.2 Input Selection Model Essentials - Neural Networks
Select useful inputs.
None
19
The Neural Network tool in SAS Enterprise Miner lacks a built-in method for selecting useful inputs. While sequential selection procedures such as stepwise are known for neural networks, the computational costs o f their implementation tax even fast computers. Therefore, these procedures are not part o f SAS Enterprise Miner. You can solve this problem by using an external process to select useful inputs. In this demonstration, you use the variables selected by the standard regression model. (In Chapter 8, several other approaches are shown.)
5-2 Input Selection
5-21
Selecting Neural Network Inputs
This demonstration shows how to use a logistic regression to select inputs for a neural network. f
Additional dimension reduction techniques are discussed in Chapter 8.
1.
Delete the connection between the Impute node and the Neural Network node.
2.
Connect the Regression node to the Neural Network node. Predictive Analysis
) - , Polynomial 'X'- Req/nsston (2)
)
rips
Replacement
-o
'J
Regression
(inputs
ral Network
a
3 3.
100%
Right-click the Neural Network node and select Update from the shortcut menu.
â&#x20AC;˘-
hi
5-22
4.
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Open the Variables dialog box for the Neural Network node.
(none) Name
^;
f j not
Use
DemCluster Default DernGencfer Default DemHomeOwDefault _ DemMedHormDefault D em Med In co "Default DemPctveteraDefault GiftTimeFirst Default GiftTimeLast Default IMP_DemAge Default IMP LOG^GirVDefautt IMP_REP_DerDefault LOG GiflAvg3fDefault LOG GirtAvgAIDefautt LOG GiftAvgLDefautt LOG GtrlCnt3 Default LOG_GiftCntAI Default LOG_GiflCntC Default LOG_GiftCntC Default M DemAge Default M_LOG_GiftAvDefault MJREPJDernt'Defautt PromCnt12 Default PromCnt36 Default PromCntAII Default P ro mCntG a rd Default P ro mCntC a rd Default *
Apply
Equal to
Report No No No No No Nn No No No No No No ^0 No No No No Ino No
No No No No No No No
Role Rejected Rejected Rejected Input Rejected Rejected Rejected Input Rejected Rejected Rejected Rejected Input Rejected Input Rejected Rejected Rejected Rejected __ Rejected Rejected Rejected Rejected Rejected Rejected Rejected
Level Nominal Nominal Binary Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Binary Binary Binary Interval Interval Interval Interval Interval
Type Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
I Explore...
Order
Label
Close the Variables dialog box.
Format
Demographic Gender Home Owner Median HomeDOLLAR11 Median IncomDOLLAR11 Percent Veten Times Since F. Time Since Ls Imputed: Age Imputed: Tran! Imputed: Repl Transformed: Transformed: Transformed: Transformed: Transformed: Transformed: Transformed: Imputation Inc Imputation Ind Imputation Ind r Promotion Co Promotion Co Promotion Co Promotion Co! Promotion Co I â&#x20AC;˘ Update Path
OK
Only the inputs selected by the Regression node's stepwise procedure are not rejected. 5.
Reset
Cancel
5.2 Input Selection 6.
Run the Neural Network node and view the results. The Fit Statistics window shows an improvement in model fit using only 19 weights. 1 R t ^ i s B c s ^^^^^MMยงBMW^MiiB^$M0:'โ ข Target Fit Statistics Statistics Train Validation Test Label _DFT_ TargetB Total Degre... 4843 _DFE_ TargetB Degrees of... 4824 _DFM_ TargetB Model Degr... 19 TargetB _NW_ Number of... 19 TargetB _AIC_ Akaike's Inf... 6573.409 TargetB _SBC_ Schwarz's... 6696.629 TargetB _ASE_ Average Sq... 0.240983 0.240441 TargetB _MAX_ Maximum A... 0.777 0.78162 TargetB _DIV_ Divisor for A... 9686 9686 TargetB _NOBS_ Sum ofFre... 4843 4843 TargetB _RASE_ RootAvera... 0.490348 0.4909 TargetB _SSE_ Sum ofSqu... 2334.159 2328.913 TargetB _SUMW_ 9686 Sum of Cas... 9686 TargetB _FPE_ Final Predic... 0.242881 TargetB _MSE_ Mean Squa... 0.241932 0.240441 TargetB _RFPE_ Root Final... 0.49283 TargetB _RMSE_ Root Mean... 0.491866 0.490348 TargetB Average Err... 0.674727 0.673563 _AVERR_ TargetB _ERR_ Error Functi... 6524.13 6535.409 TargetB _MISC_ 0.421639 Misclassific... 0.427421 [TargetB _WRONG_ Number of... 2070 2042 The validation and training average squared errors are nearly identical.
K.
5-23
5-24
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
5.3 Stopped Training Model Essentials - Neural Networks
Optimize complexity.
j stopped training
22
As was seen in the first demonstration, complexity optimization is an integral part o f neural network modeling. Other modeling methods selected an optimal model from a sequence o f possible models. In the demonstration, only one model was estimated (a neural network with three hidden units), so what was being compared? SAS Enterprise Miner treats each iteration in the optimization process as a separate model. The iteration with the smallest value o f the selected fit statistic is chosen as the final model. This method o f model optimization is called stopped training.
5.3 Stopped Training
Fit Statistic versus Optimization Iteration initial hidden unit weights
logit(p) = o+
22
w,+
H+ 2
H
3
5-25
5-26
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Fit Statistic versus Optimization Iteration logit(p) = . + H,+ j H + 0 H 2
•
3
H, = tanh( 1.5 - .03x,-.07x ) 2
H = tanh(.79- .17x,- .15x ) 2
2
W = tanh( 57 + .05x +.35x ) 3
1
z
random initial input weights and biases
24
To begin model optimization, model weights are given initial values. The weights multiplying the hidden units in the logit equation are set to zero, and the bias in the logit equation is set equal to the logit(jti), where K\ equals the primary outcome proportion. The remaining weights (corresponding to the hidden units) are given random initial values (near zero). This "model" assigns each case a prediction estimate: p = TT . An initial fit statistic is calculated on t
]
training and validation data. For a binary target, this is proportional to the log likelihood function:
£log(£.(w))+ primary outcomes
£log(l-£.(w))
sec ondary outcomes
where p
t
\v
is the predicted target value. is the current estimate of the model parameters.
Training proceeds by updating the parameter estimates in a manner that decreases the value of the objective function.
5.3 Stopped Training
5-27
Fit Statistic versus Optimization Iteration • • -:•
»y
„
"-:i i :^. :-v. ".' : ;
:
•V ^ • ••*•*»*"
\*
0
5
10 Iteration
15
-
;
r
• * ! ; . '
•'•>"'. • X v -
.
• *
20
26
As stated above, in the initial step of the training procedure, the neural network model is set up to predict the overall average response rate for all cases.
Fit Statistic versus Optimization Iteration
01
5
10 iteration
15
20
27 One step substantially decreases the value average squared error (ASE). Amazingly, the model that corresponds to this one-iteration neural network looks very much like the standard regression model, as seen from the fitted isoclines.
5-28
Chapter 5 Introduction to Predictive Modeling; Neural Networks and Other Modeling Tools
Fit Statistic versus Optimization Iteration
0
2
5
10 Iteration
15
20
28
The second iteration step goes slightly astray. The model actually becomes slightly worse on the training and validation data.
Fit Statistic versus Optimization Iteration
-I
^ ^ ^ ^ ^^^^
\
training
0
3
validation
5
10 15 Iteration
20
29
Things are back on track in the third iteration step. The fitted model is already exhibiting nonlinear and nonadditive predictions. Half of the improvement in ASE is realized by the third step.
5.3 Stopped Training
Fit Statistic versus Optimization Iteration
ASE
\
\
training
0
4 5
validation
10 Iteration
15
20
30
Fit Statistic versus Optimization Iteration ASE
Iteration
5-29
5-30
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Fit Statistic versus Optimization Iteration \
â&#x20AC;˘\ V
t r a i n i n gv . validation
V
0
5 6
10 Iteration
15
20
32
Fit Statistic versus Optimization Iteration
â&#x20AC;˘IA
tralning^^vatidation
0
33
5
7
10 15 Iteration
20
5.3 Stopped Training
Fit Statistic versus Optimization Iteration
8
10 Iteration
3-1
Fit Statistic versus Optimization Iteration ASE
0
5
9'0 Iteration
Most of the improvement in validation ASE occurred by the ninth step. (Training ASE continues to improve until convergence in Step 23.) The predictions are close to their final form.
5-31
5-32
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Fit Statistic versus Optimization Iteration
10
20
Iteration
36
Fit Statistic versus Optimization Iteration
0
5
1011 Iteration
15
20
5.3 Stopped Training
5-33
Fit Statistic versus Optimization Iteration
10 12 aeration
15
38
Step 12 brings the minimum value for validation ASE. While this model will ultimately be chosen as the final model, SAS Enterprise Miner continues to train until the likelihood objective function changes by a negligible amount on the training data.
Fit Statistic versus Optimization Iteration
ASE
49 In Step 23, training is declared complete due to lack of change in the objective function from Step 22. Notice that between Step 13 and Step 23, ASE actually increased for the validation data. This is a sign of overfitting.
5-34
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
50
The Neural Network tool selects the modeling weights from iteration 13 for the final model. In this Iteration, the validation ASE is minimized. You can also configure the Neural Network tool to select the iteration with minimum misclassification or maximum profit for final weight estimates. f
The name stopped training comes from the fact that the final model is selected as i f training were stopped on the optimal iteration. Detecting when this optimal iteration occurs (while actually training) is somewhat problematic. To avoid stopping too early, the Neural Network tool continues to train until convergence on the training data or until reaching the maximum iteration count, whichever comes first.
5.3 S t o p p e d Training
5-35
Increasing Network Flexibility
Stopped training helps to ensure that a neural network does not overfit (even when the number o f network weights is large). Further improvement in neural network performance can be realized by increasing the number o f hidden units from the default o f three. There are two ways to explore alternative network sizes: • manually, by changing the number o f weights by hand • automatically, by using the AutoNeural tool Changing the number o f hidden units manually involves trial-and-error guessing o f the "best" number of hidden units. Several hidden unit counts were tried in advance. One o f the better selections is demonstrated. 1.
Select Network •=> „ J from the Neural Network properties panel. Property
Value
General Node ID Neural limported Data jExported Data Notes Train 1 [Variables Continue Traininq No Network [Optimization 12345 initialization Seed Model Selection Critt Profit/Loss Suppress Output No
... ... ... ...
...
5-36
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
The Network window opens.
Property Architecture Direct Connection Number of Hidden Units Randomization Distribution Randomization Center Randomization Scale Input Standardization Hidden Layer Combination Function Hidden Layer Activation Function Hidden Bias Target Layer Combination Function Target Layer Activation Function Target Layer Error Function
Value MLP No 6 Normal 0.0 1.0 Standard Deviation Default Default Yes Default Default Default
T
Architecture Specifies which network architecture is used in constructing the network. The following are valid selections: generalized linear model, multilayer perceptron, ordinary radial basis function with equal widths, ordinary radial basis function with unequal widths, normalized radial basis function with equal heights, normalized radial basis function with equal volumes, normalized
OK
2.
Type 6 as the Number of Hidden Units value.
3.
Select O K .
4.
Run the Neural Network node and view the results.
Cancel
5.3 Stopped Training The Fit Statistics window shows good model performance on both the average squared error and misclassification scales.
Target TargetB argetB argetB argetB argetB argetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB argetB argetB TargetB TargetB argetB argetB argetB argetB
Fit Statistics Statistics Train Validation Test Label _DFT_ Total Degre.. 4843 _DFE_ Degrees of.. 4806 _DFM_ Model Degr... 37 Number of... 37 _NW_ Akaike's Inf... 6601.6 _AIC_ Schwarz's... 6841.556 _SBC_ Average Sq... 0.240623 _ASE_ 0.23988 Maximum A... 0.857328 _MAX_ 0.869935 Divisor for A. 9686 _DIV_ 9686 Sum ofFre... 4843 _NOBS_ 4843 RootAvera... 0.490533 _RASE_ 0.489776 Sum ofSqu.. 2330.673 _SSE_ 2323.478 Sum of Cas.. 9686 _SUMW_ 9686 Final Predic. 0.244328 _FPE_ Mean Squa... 0.242475 _MSE_ 0.23988 Root Final... 0.494295 _RFPE_ Root Mean... 0.492418 0.489776 _RMSE_ Average Err... 0.673921 0.672391 _AVERR_ Error Functi... 6527.6 6512.78 _ERR_ Misclassific. 0.427421 0.422878 _MISC_ Number of... 2070 2048 WRONG
5-37
5-38
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
The iteration plot shows optimal validation average squared error occurring on iteration 5. Iteration Plot
r/ D
K
IZi
Average Squared Error 0.250-
0.245-
0.240â&#x20AC;&#x201D;fâ&#x20AC;&#x201D;
10
20
30
Training Iteration Train: Average Squared Error. Valid: Average Squared Error.
Unfortunately, there is little else of interest to describe this model. (In Chapter 8, a method is shown to give some insight into describing the cases with high primary outcome probability.)
5.3 Stopped Training
5-39
Using the AutoNeural Tool (Self-Study)
The AutoNeural tool offers an automatic way to explore alternative network architectures and hidden unit counts. This demonstration shows how to explore neural networks with increasing hidden unit counts. 1.
Select the Model tab.
2.
Drag the AutoNeural tool into the diagram workspace.
3.
Connect the Regression node to the AutoNeural node as shown. Predictive Analysis
Polynomial Recession (2)
-o
'J Replacement
-Q
6
nputo
Retji ession
nl Network
J
AutoNeu al
•
100::,
Six changes must be made to the AutoNeural node's default settings. 4.
Select Train Action ^ Search. This configures the AutoNeural node to sequentially increase the network complexity.
5.
Select Number of Hidden Units •=> \. With this option, each iteration adds one hidden unit.
6.
Select Tolerance •=> Low. This prevents preliminary training from occurring.
7.
Select Direct •=> No. This deactivates direct connections between the inputs and the target.
8.
Select N o r m a l <=> t No. This deactivates the normal distribution activation function.
5-40
9.
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Select Sine •=> No. This deactivates the sine activation function. Train Variables [•Architecture Single Layer [•Termination Ove [fitting [•Train Action Search [-Target Layer Error FiDefault [-Maximum Iterations 8 [•Number of Hidden U [•Tolerance Low -TotalTime One Hour crementand Searc Adjust Iterations Yes •Freeze Connections No •Total Number of Hid 30 •Final Training Yes -Final Iterations 5 [SjActivation Functions -Direct No -Exponential No No - Identity -Logistic No -Normal No No •Reciprocal •Sine No -Softmax No -Square No Tanh Yes L
With these settings, each iteration adds one hidden unit to the neural network. Only the hyperbolic tangent activation function is considered. /f
After each iteration, the existing network weights are not reinitialized. With this restriction, the influence of additional hidden units decreases. Also, the neural network models that you obtain with the AutoNeural and Neural Network tools will be different, even i f both networks have the same number of hidden units.
5.3 Stopped Training
5-41
10. Run the AutoNeurai node and view the results. The Results - Node: AutoNeural Diagram window opens.
File
a
Edit
i'ln.: ;\!ti'^y^y^liJ:Jri!iildUti«Jl'lji'idri view Window
l
H
l
m
Score Rankings Overlay: TargetB
d*eT
IE!
3 Fit Statistics Target
Cumulative Lift
VALIDATE
•TRAII-.
-
Iteration Plot
TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB
Average Squared Error
r^kf
m
Fit Statistics Statistics Train valida Label _DFT_ Total Degre... 4843 _DFE_ Degrees of ... 4B36 _DFM_ Model Degr... 7 _NW_ Number of... 7 _AIC_ Akaike's Inf... 6584.S83 _SBC_ Schwarz's... 6630.085 _ASE_ Average Sq... 0.242693 0 • _MAX_ Maximum A... 0.703379 _DIV_ DivisorforA... 9696 _NOBS_ Sum of Fre... 4843 _RASE_ Root Avera... 0.492638 _SSE_ Sum of Sou... 2350.721 SUMW Sum ofCas... 9686
ID Output Dodotech 1BHAY03 20:53
Usee: Date: Time:
0.2500 0.247S -
* T r a i n i n g Output
0.2450 02425¬ 0.2400 Variable Summary Training Step •Train: Average Squared Error. • Vnlki: Average Square:! Error.
Measurement I.e.ve.1
Frequency
5-42
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
11. Maximize the Fit Statistics window. 3 Fit Statistics Target TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB [TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB
r / c ? [HI
Fit Statistics Statistics Train Validation Test Label _DFT_ Total Degre... 4843 _DFE_ Degrees of ... 4836 _DFM_ Model Degr... 7 _NW_ Number of... 7 _AIC_ Akaike's Inf... 6584.688 _SBC_ Schwarz's... 6630.085 _ASE_ Average Sq... 0.242693 0.2411 _MAX_ Maximum A... 0.703379 0,705861 _DIV_ DivisorforA.. 9686 9686 _NOBS_ Sum of Fre... 4843 4843 _RASE_ Root Avera... 0.492638 0.491019 _SSE_ Sum of Squ.. 2350.721 2335.291 _SUMW_ Sum of Cas.. 9686 9686 _FPE_ Final Predic. 0.243395 _MSE_ Mean Squa... 0.243044 0.2411 _RFPE_ Root Final... 0.493351 _RMSE_ Root Mean... 0.492995 0.491019 _AVERR_ Average Err... 0.67837 0.675163 _ERR_ Error Fundi... 6570.688 6539.625 _MISC_ Misclassific. 0.426389 0.41751 _WRONG_ Numberof... 2065 2022
The number o f weights implies that the selected model has one hidden unit. The average squared error and misclassification rates are quite low.
5.3 Stopped Training
5-43
12. Maximize the Iteration Plot window. Iteration Plot |
|
|
^
^
^
!
^
^
K
^
K
^
B
f
i f Uf: M
Average Squared 'Error 0.2500-
0.2475-
0.2450-
0.2425-
0.2400 Training Step â&#x20AC;˘Train: Average Squared Error. Valid: Average Squared Error.
The AutoNeural and Neural Network node's iteration plots differ. The AutoNeural node's iteration plot shows the final fit statistic versus the number o f hidden units in the neural network. 13. Maximize the Output window. The Output window describes the AutoNeural process. 14. Go to line 52. Search # 1 SINGLE LAYER t r i a l # 1 : TANH _ITER_ 0 1 2 3 4 5 6 7 8 8
Training
_AIC_
_AVERR_
_MISC_
_VAVERR_
_VMISC_
6727.82 6725.17 6587.79 6584.69 6584.10 6575.69 6572.57 6571.21 6570.69 6570.69
0.69315 0.69287 0.67869 0.67837 0.67831 0.67744 0.67712 0.67698 0.67692 0.67692
0.49990 0.48462 0.42866 0.42639 0.42804 0.42660 0.42763 0.42866 0.42845 0.42845
0.69315 0.69311 0.67713 0.67516 0.67638 0.67472 0.67455 0.67427 0.67420 0.67420
0.50010 0.48193 0.42350 0.41751 0.43031 0.42061 0.42783 0.42205 0.42061 0.42061
These lines show various fit statistics versus training iteration using a single hidden unit network. Training stops at iteration 8 (based on an AutoNeural property setting). Validation misclassification is used to select the best iteration, in this case, Step 3. Weights from this iteration are selected for use in the next step.
5-44
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
15. View output lines 73-99. Search # 2 SINGLE LAYER t r i a l # 1 : TANH _AIC_
_AVERR_
_MISC_
_VAVERR_
_VMISC_
6596.69 6587.08 6581.99 6580.65 6579.01 6578.27 6577.64 6577.57 6575.88 6575.88
0.67837 0.67738 0.67685 0.67671 0.67654 0.67647 0.67640 0.67640 0.67622 0.67622
0.42639 0.42866 0.42928 0.42887 0.42763 0.43011 0.42474 0.42680 0.42845 0.42845
0.67516 0.67472 0.67405 0.67393 0.67392 0.67450 0.67426 0.67458 0.67411 0.67411
0.41751 0.42639 0.42123 0.42391 0.42267 0.42804 0.42453 0.42825 0.42783 0.42783
_ITER_ 0 1 2 3 4 5 6 7 8 8
Selected I t e r a t i o n based on
VMISC_
_AIC_
_AVERR_
_MISC_
_VAVERR_
_VMISC_
6596.69
0.67837
0.42639
0.67516
0.41751
_ITER_ 0
Training
A second hidden unit is added to the neural network model. A l l weights related to this new hidden unit are set to zero. A l l remaining weights are set to the values obtained i n iteration 3 above. In this way, the two-hidden-unit neural network (Step 0) and the one-hidden-unit neural network (Step 3) have equal fit statistics. Training o f the two-hidden-unit network commences. The training process trains for eight iterations. Iteration 0 has the smallest validation misclassification and is selected to provide the weight values for the next AutoNeural step. 16. Go to line 106. F i n a l Training Training _ITER_ 0 1 2 3 4 5 5
_AIC_
_AVERR_
_MISC_
_VAVERR_
_VMISC_
6584.69 6584.10 6573.21 6571.19 6570.98 6570.56 6570.56
0.67837 0.67831 0.67718 0.67698 0.67695 0.67691 0.67691
0.42639 0.42804 0.42783 0.42990 0.42887 0.43052 0.43052
0.67516 0.67638 0.67462 0.67437 0.67431 0.67418 0.67418
0.41751 0.43031 0.42350 0.42247 0.42308 0.42081 0.42081
Selected I t e r a t i o n based on _VMISC_ _ITER_ 0
_AIC_
_AVERR_
_MISC_
_VAVERR_
_VMISC_
6584.69
0.67837
0.42639
0.67516
0.41751
The final model training commences. Again iteration zero offers the best validation misclassification.
5.3 Stopped Training
The next block o f output summarizes the training process. Fit statistics from the iteration with the smallest validation misclassification are shown for each step. F i n a l Training History _step_
_func_ _ s t a t u s _ _ i t e r _ _AVERR_
SINGLE LAYER 1 SINGLE LAYER 1 SINGLE LAYER 2
TANH TANH TANH
initial keep reject Final
0 3 0 0
0.69315 0.67837 0.67837 0.67837
_MISC_
_AIC_
0.49990 0.42639 0.42639 0.42639
6727.82 6584.69 6596.69 6584.69
_VAVERR_ _VMISC_ 0.69315 0.67516 0.67516 0.67516
0.50010 0.41751 0.41751 0.41751
F i n a l Model Stopping: Termination c r i t e r i a was s a t i s f i e d : o v e r f i t t i n g based on _VMISC _func_
_AVERR_
_VAVERR_
TANH
0.67837
0.67516
neurons 1 1
The Final Model shows the hidden units added at each step and the corresponding value o f the objective function (related to the likelihood).
5-45
5-46
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
5.4 Other Modeling Tools (Self-Study) Model Essentials - Rule Induction Predict new cases.
| Prediction rules / | prediction formula
Select useful inputs. ! j
Optimize complexity.
Split search / none • •
•
1 Ripping/ ; stopped training
ss
There are several additional modeling tools in SAS Enterprise Miner. This section describes the intended purpose (that is, when you should consider using them), how they perform the modeling essentials, and how they predict the simple example data. The Rule Induction tool combines decision tree and neural network models to predict nominal targets. It is intended to be used when one o f the nominal target levels is rare. New cases are predicted using a combination o f prediction rules (from decision trees) and a prediction formula (from a neural network, by default). Input selection and complexity optimization are described below.
5.4 Other Modeling Tools (Self-Study)
5-47
Rule Induction Predictions [Rips create prediction rules.]
0.9 08
A binary model sequentially classifies and removes correctly classified cases. [A neural network predicts remaining cases.]
0.0 0.0 0.1 0.2 0.3
0.8 0.9 i.O
The Rule Induction algorithm has three steps: â&#x20AC;˘ Using a decision tree, the first step attempts to locate "pure" concentrations o f cases. These are regions of the input space containing only a single value o f the target. The rules identifying these pure concentrations are recorded in tile scoring code, and the cases in these regions are removed from the training data. The simple example data does not contain any "pure" regions, so the step is skipped. â&#x20AC;˘ The second step attempts to filter easy-to-classify cases. This is done with a sequence o f binary target decision trees. The first tree in the sequence attempts to distinguish the most common target level from the others. Cases found in leaves correctly classifying the most common target level are removed from the training data. Using this revised training data, a second tree is built to distinguish the second most common target class from the others. Again, cases in any leaf correctly classifying the second most common target level are removed from the training data. This process continues through the remaining levels o f the target. â&#x20AC;˘ In the third step, a neural network is used to predict the remaining cases. A l l inputs selected for use in the Rule Induction node will be used in the neural network model. Model complexity is controlled by stopped training using classification as the fit statistic.
5-48
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Model Essentials - Dmine Regression j
Predict new cases.
Prediction formula
Select useful inputs.
Forward selection
Optimize complexity.
Stop R-square
57
Dmine regression is designed to provide a regression model with more flexibility than a standard regression model. It should be noted that with increased flexibility comes an increased potential o f over-fitting. A regression-like prediction formula is used to score new cases. Forward selection picks the inputs. Model complexity is controlled by a minimum R-squared statistic.
5.4 Other Modeling Tools (Self-Study)
5-49
Dmine Regression Predictions • Interval inputs binned, categorical inputs grouped
pi«*c»*»-i
• Forward selection picks from binned and original inputs
SB
The main distinguishing feature o f Dmine regression versus traditional regression is its grouping o f categorical inputs and binning o f continuous inputs. • The levels o f each categorical input are systematically grouped together using an algorithm that is reminiscent o f a decision tree. Both the original and grouped inputs are made available for subsequent input selection. • A l l interval inputs are broken into a maximum o f 16 bins in order to accommodate nonlinear associations between the inputs and the target. The levels o f the maximally binned interval inputs are grouped using the same algorithm for grouping categorical inputs. These binned-and-grouped inputs and the original interval inputs are made available for input selection. A forward selection algorithm selects from the original, binned, and grouped inputs. Only inputs with an R square o f 0.005 or above are eligible for inclusion in the forward selection process. Forward selection on eligible inputs stops when no input improves the R square o f the model by the default value 0.0005.
5-50
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Model Essentials - DMNeural Predict new cases. Select useful inputs. Optimize complexity.
59
The DMNeural tool is designed to provide a flexible target prediction using an algorithm with some similarities to a neural network. A multi-stage prediction formula scores new cases. The problem o f selecting useful inputs is circumvented by a principal components method. Model complexity is controlled by choosing the number o f stages in the multi-stage predictions formula.
5.4 Other Modeling Tools (Self-Study)
5-51
DMNeural Predictions â&#x20AC;˘ Up to three PCs with highest target R square are selected.
â&#x20AC;˘ One of eight continuous transformations are selected and applied to selected PCs. â&#x20AC;˘ The process is repeated three times with residuals from each stage.
c
GO The algorithm starts by transforming the original inputs into principle components, which are orthogonal linear transformations of the original variables. The three principal components with the highest target correlation are selected for the next step. Next, one o f eight possible continuous transformations (square, hyperbolic tangent, arctangent, logistic. Gaussian, sine, cosine, exponential) is applied to the three principle component inputs. (For efficiency, the transformation is actually applied to a gridded version of the principle component, where each grid point is weighted by the number o f cases near the point.) The transformation with the optimal fit statistic (squared error, by default) is selected. The target is predicted by a regression model using the selected principal components and transformation. The process in the previous paragraph is repeated on the residual (the difference between the observed and predicted target value) up to the number of times specified in the maximum stage property (three, by default).
5-52
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Model Essentials - Least Angle Regression Predict new cases. > Select useful inputs. Optimize complexity.
Prediction formula Generalized sequential selection Penalized best fit
61
Forward selection, discussed in the previous chapter, provides good motivation for understanding how the Least Angle Regression (LARS) node works. In forward selection, the model-building process starts with an intercept. The best candidate input (the variable whose parameter estimate is most significantly different from zero) is added to create a one-variable model in the first step. The best candidate input variable from the remaining subset o f input variables is added to create a two-variable model in the second step, and so on. Notice that an equivalent way to phrase the description o f what happens in Step 2 is that the candidate input variable most highly correlated with the residuals of the one variable model is added to create a two-variable model in the second step. The LARS algorithm generalizes forward selection in the following way: 1.
Weight estimates are initially set to a value o f zero.
2.
The slope estimate in the one variable model grows away from zero until some other candidate input has an equivalent correlation with the residuals of that model.
3.
Growth o f the slope estimate on the first variable stops, and growth on the slope estimate o f the second variable begins.
4.
This process continues until the least squares solutions for the weights are attained, or some stopping criterion is optimized.
This process can be constrained by putting a threshold on the aggregate magnitude o f the parameter estimates. The LARS node provides an option to use the LASSO (Least Absolute Shrinkage and Selection Operator) method for variable subset selection. The LARS node optimizes the complexity of the model using a penalized best fit criterion. The Schwarz's Bayesian Criterion (SBC) is the default.
5-4 Other Modeling Tools (Self-Study)
5-53
Least Angle Regression Predictions â&#x20AC;˘
Inputs are selected using a generalization of forward selection.
â&#x20AC;˘ An input combination in the sequence with optimal, penalized validation assessment is selected by default. o.o
0.0 0.1 0.2 0.3 0.4 0.5 0-6 0.7 0.8 0.9 1.0
The Least Angle Regression node can be useful in a situation where there are many candidate input variables to choose from and the data set is not large. As mentioned above, the best subset model is chosen using SBC by default. SBC and its cousins (AIC, AICC, and others) optimize complexity by penalizing the fit o f candidate input variables. That is, for an input to enter the model, it must improve the overall fit (here, diminish the residual sum o f squares) o f the model enough to overcome a penalty term built into the fit criterion. Other criteria, based on predicted (simulated) residual sums o f squares or a validation data set, are available.
5-54
Chapter 5 introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Model Essentials - MBR Predict new cases.
Training data nearest neighbors
Select useful inputs.
None
Optimize complexity.
Number of neighbors
63
Memory Based Reasoning ( M B R ) is the implementation o f ^-nearest neighbor prediction i n SAS Enterprise Miner. M B R predictions should be limited to decisions rather than rankings or estimates. Decisions are made based on the prevalence o f each target level in the nearest leases (16 by default). A majority o f primary outcome cases results i n a primary decision, and a majority o f secondary outcome cases results in a secondary decision. Nearest neighbor algorithms have no means to select inputs and can easily fall victim to the curse o f dimensionality. Complexity o f the model can be adjusted by changing the number o f neighbors considered for prediction.
5.4 Other Modeling Tools (Self-Study)
5-55
MBR Prediction Estimates â&#x20AC;˘ Sixteen nearest training data cases predict the target for each point in the input space.
0.6
â&#x20AC;˘ Scoring requires training data and the PMBR procedure.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 8 0.9 1.0 *1
G4 By default, the prediction decision made for each point in the input space is determined by the target values o f the 16 nearest training data cases. Increasing the number o f cases involved in the decision tends to smooth the prediction boundaries, which can be quite complex for small neighbor counts. Unlike other prediction tools where scoring is performed by DATA step functions independent o f the original training data, MBR scoring requires both the original training data set and a specialized SAS procedure, PMBR.
5-56
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Model Essentials - Partial Least Squares Predict new cases.
Prediction formula L
Select useful inputs.
VIP
Optimize complexity.
Sequential factor extraction L.
65
In partial least squares regression (PLS) modeling, model weights are chosen to simultaneously account for variability in both the target and the inputs. In general, model fit statistics like average squared error w i l l be worse for a PLS model than for a standard regression model using the same inputs. A distinction o f the PLS algorithm, however, is its ability to produce meaningful predictions based on limited data (for example, when there are more inputs than modeling cases). A variable selection method named variable importance in the projection arises naturally from the PLS framework. In PLS, regression latent components, consisting o f linear combinations o f original inputs, are sequentially added. The final model complexity is chosen by optimizing a model fit statistic on validation data.
5.4 Other Modeling Tools (Self-Study)
5-57
Partial Least Squares Predictions •
Input combinations (factors) that optimally account for both predictor and response variation are successively selected.
• Factor count with a minimum validation PRESS statistic is selected. •
Inputs with small VIP are rejected for subsequent diagram nodes.
66
The PLS predictions look similar to a standard regression model. In the algorithm, input combinations, called factors, which are correlated with both input and target distributions, are selected. This choice is based on a projection given in the SAS/STAT PLS procedure documentation. Factors are sequentially added. The factor count with the lowest validation PRESS statistic (or equivalently, average squared error) is selected. In the PLS tool, the variable importance in the projection (VIP) for each input is calculated. Inputs with VIP less than 0.8 are rejected for subsequent nodes in the process flow.
5-58
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
Exercises 1. Predictive Modeling Using Neural Networks a. In preparation for a neural network model, is imputation o f missing values needed? _ Why or why not? b.
In preparation for a neural network model, is data transformation generally needed? _ Why or why not?
c. A d d a Neural Network tool to the Organics diagram. Connect the Impute node to the Neural Network node. d. Set the model selection criterion to average squared error. e.
Run the Neural Network node and examine the validation average squared error. How does it compare to other models?
5.5 Chapter Summary
5-59
5.5 Chapter Summary Neural networks are a powerful extension o f standard regression models. They have the ability to model virtually any association between a set o f inputs and a target. Being regression-like models, they share some o f the same prediction complications faced by regression models (missing values, categorical inputs, extreme values). Their performance is reasonable even without input selection, but it can be marginally improved by selecting inputs with an external variable selection process. Most o f their predictive success comes from stopped training, a technique used to prevent overfitting. Network complexity is marginally influenced by the number o f hidden units selected in the model. This number can be selected manually with the Neural Network tool, or it can be selected automatically using the AutoNeural tool. SAS Enterprise Miner includes several other flexible modeling tools. These specialized tools include Rule Induction, Dmine Regression, D M Neural, and M B R .
Neural Network Tool Review <
HfwNwralNrtworii
67
Create a multi-layer perceptron on selected inputs. Control complexity with stopped training and hidden unit count.
5-60
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
5.6 Solutions Solutions to Exercises 1. Predictive Modeling Using Neural Networks a. In preparation for a neural network model, is imputation o f missing values needed? Yes. Why or why not? Neural network models, as well as most models relying on a prediction formula, require a complete record for both modeling and scoring. b. In preparation for a neural network model, is data transformation generally needed? Not necessarily. Why or why not? Neural network models create transformations of inputs for use in a regression-like model. However, having input distributions with low skewness and kurtosis tends to result in more stable models. c. Add a Neural Network tool to the Organics diagram. Connect the Impute node to the Neural Network node.
5.6 Solutions
d.
Set the model selection criterion to average squared error. Value Property General Neural Node ID Imported Data Exported Data Notes Train Variables Network Model Selection Criterior Averaqe Error Use Current Estimates No 20 rMaximurn Iterations 4 Hours '-Maximum Time Default '•Training Technique [^Preliminary Training Opt Preliminary Training Yes 10 rMaximum Iterations 1 Hour r Maximum Time •-Number of Runs 5 Hconverqence Criteria Yes !-Uses Defaults -Options 0Print Options No '-Suppress Output :
I
• • y
i ...i i ...
y
5-61
5-62
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools
e.
Run the Neural Network node and examine the validation average squared error. 1) The Result window should appear as shown below. â&#x20AC;˘ft 149.I73.W.I1B-Remote Desktop Targets uy Targe [Buy TsrgelBuy TargelBuy raigelBuy riigetsur fjigelBuy TiigelBuy TaigetBuy TiigetBuy TaigetBuy TaigetBuy TaigetBuy Targets uy TargelBuy
J*M_ _DIV_ _NOBS_ _HASE_ _SSE_ _SUMW_ _FPE_ _MSE_ _RFPE_ _RWSE_ _AVERR_ _ERR_ _MISC_ _YVRONQ_
Average So... Maximum A... DrnSOrfoiA.. Sum of Fie.. RoolAiera. SumofSqu . SumofCas Final Pied it MeanSo.ua Root Final Root Mean .. Averaga En Eirar Fundi . Mrsc las sale. Number o f .
0 13335! 0978743 222J1 11113 0 355037 2951.309 22324 013G333 0 134792 0 389233 0 367141 0 41053 9323 636 0 183315 2037
0.133209 0.979105 21222 11111 0.364978 2900 168 22222
inlzl
0133209 0.364 97 B 0 416894 9308.673 0.185852 2065
Student J u l f 0 7 , 2009 L4;25: S3 T r a i n i n g Output
Var 1 at} 1 e S u u a c y Tiequency Cmifit nunm a r
BIVASY IETEBVAL
IDTFOT PZJECTED TAPGET
Trainlna Iterations
NOMINAL BIHAEY
A
5-6 Solutions
2)
5-63
Examine the Fit Statistics window. | g j Fit Statistics Target [TargetB uy TargetBuy TargelBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy (TargetBuy TargetBuy TargetBuy TargetBuy
Fit Statistics _DFT_ _DFE_ _DFM_ _NW_ _AIC_ _SBC_ â&#x20AC;&#x17E;ASE_ _MAX_ _DIV_ _NOBS_ _RASE_ _SSE_ _SUMW_ _FPE_ _MSE_ _RFPE_ _RMSE_ _AVERR_ _ERR_ _MISC_ _WRONG_
Statistics Label Total Degre... Degrees of... Model Degr... Number of... Akaike's Inf... Schwarz's... Average Sq... Maximum A... Divisor for A... SumofFre... RootAvera... Sum of Squ... Sum of Cas... Final Predic... Mean Squ a... Root Final... Root Mean... Average Err... Error Functi... Misclassific... Number of...
Train
Validation 11112 10985 127
HUB
9577.636 10506.74 0.133252 0.978743 22224 11112 0.365037 2961.389 22224 0.136333 0.134792 0.369233 0.367141 0.41953 9323.636 0.183315 2037
0.133209 0.979105 22222 11111 0.364978 2960.168 22222 0.133209 0.364978 0.418894 9308.673 0.185852 2065
How does it compare to other models? The validation ASE for the neural network model is slightly smaller than the standard regression model, nearly the same as the polynomial regression, and slightly larger than the decision tree's ASE.
5-64
Chapter 5 Introduction to Predictive Modeling: Neural Networks and Other Modeling Tools