DATA MINING
Delinquent Loans Author: Stephen Denham Lecturer: Dr. Myra O’Regan Prepared for: ST4003 Data Mining Submitted: 24st February 2012
Data Mining | Delinquent Loans
Denham S.
1. Background ................................................................................................ 1 1.1 Dataset ................................................................................................. 1 2. Models ....................................................................................................... 4 2.1 Logistic Regression .............................................................................. 4 2.2 Principal Component Analysis.............................................................. 5 2.3 Trees .................................................................................................... 7 2.4 Random Forests ................................................................................. 10 2.5 Neural Networks ................................................................................. 15 2.6 Support Vector Machines ................................................................... 17 3. Further Analysis ....................................................................................... 19 3.1 Ensemble Model................................................................................. 19 3.2 Limitations and Further Work ............................................................. 19 3.3 Variables Importance ......................................................................... 20 4. Models Assessment ................................................................................. 21 5. Conclusion ............................................................................................... 25 7. Appendix………………………………………………………………………… i
1
Data Mining | Delinquent Loans
1.
Denham S.
BACKGROUND This report evaluates predictive models applied to a dataset of over 21,000 rows of customer information across 12 variables. The aim is to model which loans are most likely to be deemed ‘delinquent’. Norman Ralph Augustine said ‘it’s easy to get a loan unless you need it’. By understanding customer trends, financial institutions can better allocate their resources. The combination of data driven predictive models and human intuition can mean credit is given to those in the best position to repay. Techniques used in this report include logistic regression, trees, random forest, neural networks, support vector machines and the creation of an ensemble model. It is as much a problem of soft intuition as mathematical skill.
1.1 Dataset The original dataset contained 21,326 rows and 12 columns, representing customer described by the following 12 variables: 1. nid (integer): Simply a unique identifier that is of no use in our analysis. Such a variable may be necessary for other analysis tools. 2. dlq (0/1): Individual has been 90 days or more past due date are termed ‘delinquent’, represented here by ‘1’. This is the target variable of model. 3. revol (percentage): Total balance on credit cards and personal lines of credit except real estate and no instalment debt like car loans divided by the sum of credit limits. All revol values above 4 (40 cases) were deemed outliers and were removed. 4. age (integer): Age of borrower in years. This variable was not changed. 5. times1 (integer): Number of times borrower has been 30-59 days past due but no worse in the last 2 years. All times1 values above 14 (159 cases) were deemed outliers and were removed. 6. DebtRatio (percentage): Monthly debt payments, alimony, living costs divided by monthly gross income. All DebtRatio values above 14 (3,800 cases) were deemed outliers and were removed. This may be a large, however the vast majority of these were missing MonthlyIncome or other variable outliers. 7. MonthlyIncome (integer): Monthly income. 18.48% of cases have missing data for MonthlyIncome. All MonthlyIncome values above 60,000 (32 cases) were deemed outliers and were removed.
1
Data Mining | Delinquent Loans
Denham S.
8. noloansetc (integer): Number of open loans (instalment like car loan or mortgage) and lines of credit (e.g. credit cards). All noloansetc values above 60,000 (471 cases) were deemed outliers and were removed. 9. ntimeslate (integer): Number of times borrower has been 90 days or more past due. All ntimeslate values above 40 (159 cases) were deemed outliers and were removed. 10. norees (integer): Number of mortgage and real estate loans including home equity lines of credit. All norees values above 6 (205 cases) were deemed outliers and were removed. 11. ntimes2 (integer): Number of times borrower has been 60-89 days past due but no worse in the last 2 years. All ntimes2 values above 8 (262 cases) were deemed outliers and were removed. 12. depends (integer): Number of dependents in family excluding themselves (spouse, children etc.). 2.2% of original cases were missing this variable and were simply removed. All of these cases were also missing All depends values above 10 (2 cases) were deemed outliers and were removed. Overall, 22.8% of variables were removed between outliers and missing data rejection. Figure 1.1 shows boxplots of all variables after cleansing. They show how revol, times1, DebtRatio, revol, ntimeslate, and depends are all densely distributed around 0. Variables age and noloansetc are more normally distributed.
Figure 1.1: Cleansed Data Boxplots Figure 1.2 shows a histogram for revol density between both delinquent and nondelinquent. revol showed the most visible relationship with delinquency.
2
Data Mining | Delinquent Loans
Denham S.
Unfortunately from a simplicity point of view, revol is the exception, and when graphing other variables in this way, the relationship is far more subtle – not enough to make a judgement purely on a univariate basis and so multivariate analysis is necessary.
Figure 1.2: revol Histogram It is often beneficial to create combination variables – blending variables together to create new ones, which encompass a better expression of cases. This was considered but deemed unnecessary. There are a small number of variables, some of which are already combinations – DebtRatio and revol. Also, non-linear kernels would pick up such relationship. In some instances, missing data and outliers are just as insightful. To quickly determine if benefit could be gained from them in a model, a simplified version of the dataset was created. This technique can also be done to divide variables into brackets, such as age brackets. Through this, variables, which have a non-linear influence on the prediction, can also be picked up in simple models using indicator variables. Variables with questionable values were classified. 0 replaced a ‘normal’ value, and 1 replaced an outlier or missing value. This new dataset was fed into a basic tree to assess the value added by these cases. The binary variables did not prove to be accurate predictors, being inferior to, and so it was deemed satisfactory to omit outliers and missing data.
3
Data Mining | Delinquent Loans
2.
Denham S.
MODELS After cleansing a large dataset just under 16,500 remained. This was deemed more than enough to warrant dividing the data into training and test subsets (2:1). These subsets were representative distributions of delinquent loans and so stratified sampling was not required. A third validation would also have been possible. 6 models were developed on the training set and tested.
2.1 Logistic Regression The key element of this problem is the binary nature of the outcome variable. Cases can only be 0 or 1 and the associated probably should only be between 0 and 1. Furthermore, many of the variables were prominently small values however large outliers existed. For these reasons, a logistic regression model was suitable for initial testing. Table 2.1 shows the output produced by the first logistic regression. It showed that on this linear scale, the variables DebtRatio and depends could be removed from the model. Table 2.1: Logistic Regression Output Coefficients: Estimate (Intercept) -1.06498817259 revol 1.94583262923 age -0.01751873613 times1 0.54663351405 DebtRatio 0.04952123774 MonthlyIncome -0.00003790262 noloansetc 0.04735721583 ntimeslate 0.89797913290 norees 0.06882099509 ntimes2 0.95544920500 depends 0.03064240612
Std. Error 0.11027763366 0.06703656789 0.00184558679 0.03189642304 0.04646285001 0.00000642855 0.00604523194 0.05709272686 0.02810557602 0.07330640163 0.02053475812
z value -9.65734 29.02644 -9.49223 17.13777 1.06582 -5.89598 7.83381 15.72843 2.44866 13.03364 1.49222
Pr(>|z|) 0.000000000000000222 0.000000000000000222 0.000000000000000222 0.000000000000000222 0.286503 0.0000000037245671760 0.0000000000000047329 < 0.000000000000000222 0.014339 < 0.000000000000000222 0.135641 < < < <
*** *** *** *** *** *** *** * ***
Further trail and error experimentation, plotting and ANOVA, showed that the norees and depends variable could also be removed. Removing these variables brought the AIC down from 10916 down to 10909. Table 2.2 shows the final logistic regression misclassification rates. Table 2.2: Logistic Regression Model
Â
FALSE POSITVE RATE FALSE NEGETIVE RATE MISCLASSIFICATION RATE
4
29.8% 16.5% 25.5%
Data Mining | Delinquent Loans
Denham S.
2.2 Principal Component Analysis A principal component analysis was done to gain further insight into the underlying trends before further modelling. This method creates distinct linear combinations from the data which to account for multivariate data patterns – components. It does not result in a direct model however can show underlying trends. Figure 2.1 shows a plot of the proportion of total variance accounted for in each component. The ‘elbow’ at component 3 shows clear decreasing marginal value and so, only the first three components were analysed as they explained a disproportionate 23% of data variance.
●
2.0
2.5
Scree Plot
1.5
Variances
●
●
1.0
●
●
● ●
0.5
●
● ●
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Comp.6
Comp.7
Comp.8
Comp.9
Comp.10
Figure 2.1: Principal Component Analysis Scree Plot The first component showed high values for dlq, revol, times1, ntimeslate and times2. This would suggest that these 4 variables might be good indicators of dlq prediction. This component could be termed a ‘financial competence’ element. The second component contains high values for two variables that indicate numbers of lines of credit. It also contains some value of dlq. The third PC could be interpreted as a ‘middle-aged’ variable as it contains values for age, number of dependents, monthly income and debt ratio. This may be because those starting families may typically have large debt obligations as they have bought a house relatively recently. The fifth and ninth components showed high values for dlq, while elements of all other variables, apart from norees (which had a strong presence in component two). This suggests that all variables have some amount of correlation with dlq, however these relationships are still unclear from the PCA alone. The values relevant to this discussion are highlighted.
5
Data Mining | Delinquent Loans
Denham S.
Table 2.3: Principal Component Analysis Loadings Output Loadings: dlq revol age times1 DebtRatio MonthlyIncome noloansetc ntimeslate norees ntimes2 depends
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 -0.422 -0.214 -0.352 -0.249 -0.457 -0.114 -0.293 -0.223 -0.204 0.279 0.255 0.551 0.229 -0.608 -0.301 -0.340 0.243 -0.258 0.520 -0.386 0.490 -0.461 0.234 -0.104 0.224 -0.269 -0.583 0.277 -0.204 -0.279 0.110 0.279 -0.476 -0.166 0.154 -0.373 0.266 0.414 -0.551 0.239 -0.532 -0.147 -0.306 0.145 -0.333 -0.221 0.327 0.400 0.225 0.572 -0.193 -0.582 -0.362 0.464 0.330 -0.385
Â
6
Comp.8 Comp.9 Comp.10 -0.218 0.554 0.449 -0.352 -0.304 -0.583 -0.333 0.420 -0.427 0.166 -0.265 -0.121 -0.252 0.217 0.497 -0.577 0.534 -0.102 0.289 -0.429
Comp.11 -0.154 0.197 -0.502 -0.499 0.652
Data Mining | Delinquent Loans
Denham S.
2.3 Trees Trees are a relatively simple concept in machine learning. They classify predictions based on a series of ‘if-statements’ that can easily be represented graphically and are produced via a recursive-partitioning algorithm. Possibly the greatest benefit of a default classification tree is its simplicity and thus interpretability. It is clearly explained diagrammatically, and so a bank manager could easily apply it to a loan applicant without even the use of a calculator. Figure 2.2 shows a basic tree constructed. It shows ntimeslate, times1 and ntimes2 to be the most important determinants of prediction. By this tree, if a customer has a ntimeslate value of 1 or greater, they have an 88% chance of being delinquent. revol <> 0.4949761365 0; 10966 obs; 51.8%
times1 <> 0.5
4
0; 5736 obs; 72.5% 1 3
ntimeslate <> 0.5
5230 obs
0; 4638 obs; 79.8% 1 1
2
1098 obs
0
1
4474 obs
164 obs
Figure 2.2: Simple Tree with Default Complexity Parameter (cp=0.01) Total classified correct = 74.1 % Unfortunately this tree is an oversimplified model. There exists a trade-off between model accuracy and complexity that can be seen in Figure 2.3. size of tree 2
3
4
8
9
12
13
15
17
18
19
●
●
●
●
●
●
●
●
0.0048
0.004
0.0034
0.0027
0.002
0.0017
0.6
0.8
●
● ● ●
0.4
X−val Relative Error
1.0
1
Inf
0.12
0.023
0.011
0.0069
0.0054
cp
Figure 2.3: Complexity Parameter vs. Relative Error
7
Data Mining | Delinquent Loans
Denham S.
The rpart package chooses a default complexity parameter of 0.01, however, mathematically, the optimum complexity parameter is 0.0017000378 as it has the lowest corresponding error. This tree contains 16 splits and is shown here in Figure 2.4.
revol< 0.495
|
0 5681/5285
times1< 0.5 0 4158/1578
ntimeslate< 0.5
revol< 0.144
1 ntimeslate< 0.5 457/641
1 revol< 0.0453 41/123
0 1 DebtRatio< 0.707 revol>=0.004775 3604/727 56/87 0 0 norees< 2.5 2448/323 1156/404 0 0 1055/309 101/95 0 86/60
0 26/21 0 19/7
1 15/35
times1< 0.51 1523/3707
ntimes2< 0.5
ntimes2< 0.5 0 3701/937 0 3660/814
ntimeslate< 0.5
times1< 1.51 424/462
0 MonthlyIncome>=3242 396/365
1 ntimes2< 0.51 1371/2221 152/1486
1 33/179
1 28/97
0 273/170
0 noloansetc< 9.5 563/393
age>=46.5
1 DebtRatio>=0.075580 noloansetc< 10.5 46/70 413/239
0 1 267/157 6/13
0 12/4
1 34/66
0 1039/1007
DebtRatio< 0.563
1 0.4698 0revol< 0.004575 1 revol< 30/66 319/240 77/125
1 7/14
revol< 0.8326 1 1075/1172
0 0 DebtRatio>=0.174 219/93 194/146
0 215/80
1 4/13
0 132/66
1 75/109 0 28/19
1 1 norees>=0.5 375/415 101/199 0 1 MonthlyIncome< 1125 107/69 268/346
1 47/90
1 36/38
0 20/10
1 noloansetc< 8.5 36/165 age>=52.51 476/614
1age>=58.5 150/154
0 75/45
0 1 revol< 0.5994 168/104 26/42
1 296/1049
0 1 noloansetc>=0.5 111/99 157/247 0 26/19
1 16/28
0 24/11
1 131/228
1 2/8
Figure 2.4: Optimized Complexity Parameter Tree (cp=0.0010406812) The second problem with more complex models, after loss of user accessibility, is the risk of over-fitting. As Figure 2.3 shows, there is a diminishing marginal decrease in the x-error as the tree becomes more complex, and sometimes rises again. With this that in mind, another way to decide on the number of splits (complexity) is to judge the complexity parameter curve. Figure 2.3 shows two â&#x20AC;&#x2DC;elbowsâ&#x20AC;&#x2122; where the marginal improvement for adding splits diminishes. There is one at c=0.1 however this even more basic than our original tree. The next is roughly at c=0.0054 and the associated tree is shown here in Figure 2.5 and contains 8 splits.
revol< 0.495
|
0 5681/5285 times1< 0.5
ntimeslate< 0.5
0 4158/1578
1 1523/3707
ntimeslate< 0.5
times1< 0.5
0 3701/937
1 457/641
1 1371/2221
1 152/1486
ntimes2< 0.5 0 3660/814
1 41/123
1 1075/1172
1 296/1049
revol< 0.8326 0 1039/1007
0 563/393
1 36/165
1 476/614
Figure 2.5: Tree of Complexity Parameter at CP Plot Elbow
8
Data Mining | Delinquent Loans
Denham S.
Due to the categorical output variable, classification trees were used for all of these trees. The gini measure of node purity, also known as the splitting index, was used for the trees selected. Entropy was also used, however, gini has become the convention however other methods appear equally competent. The party package was experimented with as well. Although its graphical output is slightly better, it lacked flexibility in adjusting complexity and so was deemed unsuitable in this case. Trees facilitate loss matrices to weight the value of false negatives and false positives for model evaluation. This was experimented with but ultimately without sufficient background knowledge and client communication, it was deemed too arbitrary and eventually removed. This issue is discussed further in the report (see section 4). Table 2.4 shows the occurrence of variables as splitters or surrogate splitters in the model. The table was produced by a word count of each variable name in the output generated from the function important.rpart. The function was created by Dr. Noel Oâ&#x20AC;&#x2122;Boyle of DCU (2012). The list coincided with the variable importance plot, later generated by neural networks, and to a lesser degree with the PCA output. Simple trees are inferior at predicting linear relationships, which are evident in this dataset from the logistic regression. Table 2.4: Variable Importance via rpart.importance Rank 1 2 3 4 5 6 7 8 9 10
Variable revol times1 ntimeslate times2 DebtRatio age MonthlyIncome noloansetc norees depends
Occurrences in Split or Surrogate 63 38 32 31 26 23 22 22 12 5
Table 2.5: Classification Tree Misclassification Table
Â
FALSE POSITVE RATE FALSE NEGETIVE RATE MISCLASSIFICATION RATE
9
22.4% 24.5% 23.4%
Data Mining | Delinquent Loans
Denham S.
2.4 Random Forests The random forest method involves creating a large number of CART trees of a dataset that form a democratic ensemble. The method has two ‘random’ elements. First, the algorithm selects cases at random (bootstrapping). Secondly, each tree looks at a small selection of variables (roughly the square root of the total number of variables), which are used in random combinations. Each tree is built without 34% of the data, which is then used as the ‘out of bag sample’ (oob) to evaluate the tree. In this sense, it creates internal train and test sets. 1,000 trees were created, as is standard. It was enough to gather a large sample of possible combinations and is not too computationally exhaustive. Each tree in the forest looked at 3 variables, roughly the square root of the total number of variables. As the target variable is binary, classification trees were best suited. As in SVM, this meant the training data had to be converted to type ‘data.frame‘ and the target variable dlq was converted a factor. Variable Insight Random Forests give every variable a chance to ‘vote’ because of the large number of trees created and the relatively low number of variables randomly selected for each tree. This allows them to provide deep insight for every variable. These are expressed by counting the number of times a variable us used in a split and by partial dependency plots. This bar chart (Figure 2.6) displays the number of times each variable was selected for a split in a tree, as opposed to the two other possible variables in that tree (mtry = 3). The chart shows times1, ntimeslate and ntimes2 to be used rarely. The variables revol, age and MonthlyIncome were often used. Unfortunately, this graph does not represent the relative value of each variable. It may be bias towards more continuous variables, as opposed to discrete, and so variables times1, ntimeslate and ntimes2 are low.
10
Denham S.
0
50000
150000
250000
Data Mining | Delinquent Loans
revol
age
times1
DebtRatio
MonthlyIncome noloansetc
ntimeslate
norees
ntimes2
depends
Times variable used
Figure 2.6: Bar Chart of ‘Partial dependence plot gives a graphical depiction of the marginal effect of a variable on the class probability (classification) or response (regression).’ – randomForest Reference Manual. The variables are listed here in descending importance as measured by the average decrease in accuracy and gini. Variable Importance
ntimeslate
●
revol
times1
●
DebtRatio
●
MonthlyIncome
revol ntimes2
depends
●
depends
●
●
norees
●
0.20
●
ntimes2
●
noloansetc
●
noloansetc
●
norees
●
ntimeslate
●
MonthlyIncome
●
times1
●
age
●
age
●
DebtRatio
● ●
0.30 0.40 MeanDecreaseAccuracy
0.50
●
0
200
600 1000 MeanDecreaseGini
1400
Figure 2.7: Variable Importance Plot via VarImportance Partial dependency plots show the marginal effect variable has on the predicted probability as its values change. Figure XX shows the partial dependency plot of times1
(black), DebtRatio
(dark blue), ntimes2
(pink), norees
(red),
ntimeslate (green) and noloansetc (light blue). DebtRatio, times1, ntimeslate and noloansetc all show early rapid falls in credibility in going from 0 to 2. This graph clearly displays how being late on one payment can have a dramatic
11
Data Mining | Delinquent Loans
Denham S.
increase on a person’s credit score. The noloansetc and norees variables, which both measure numbers of financial lines, rise from 0 which matches the assumption that at least one line of credit is required to be in this dataset, however many lines of credit may suggest financial instability. Although having so many variables on a single graph can be inaccessible, it is the best way to maintain relative perspective by using the same scale. While analysing these graphs, one must recall the density distributions of the variables (see Figure 1.1). Variable times1 and DebtRatio show unexpected recoveries, however these are a caused by a low number of values that may have warranted removal as outliers.
−0.6
−0.4
−0.2
0.0
0.2
Partial Dependency Plot
0
2
4
6
8
10
12
times1
Figure 2.8: Partial Dependency Plot for times1 (black), DebtRatio (dark blue), ntimes2 (pink), norees (red), ntimeslate (green) and noloansetc (light blue) The initial drop in the partial dependency plot for monthly income may be explained by those with 0 monthly income being students or retirees, however low-income earners show the least probability of repayment. The large spike at 10,000 and fluctuations thereafter are more difficult to understand, however, these earners are a minority. Figure 1.1 shows them to be in the tail end of the boxplot, and so they may be of less importance.
12
Data Mining | Delinquent Loans
Denham S.
0.00
0.05
0.10
0.15
0.20
0.25
MonthlyIncome − Partial Dependency Plot
0
10000
20000
30000 MonthlyIncome
40000
50000
60000
Figure 2.9: Partial Dependency Plot – MonthlyIncome Figure 2.10 shows the difference age has on the predicted probability of delinquency. It shows steep increases in financial competence through the late twenties as people mature, and again leading towards retirement, which is followed by a decline after retirement.
−0.1
0.0
0.1
0.2
0.3
Age Partial Dependency Plot
20
40
60 age
80
100
Figure 2.10: Partial Dependency Plot – Age The revol plot decreases rapidly from 0 to 1, which is to be expected as a low value of revol suggests high financial control. This is followed by a slow increase for which there is not an obvious reason. Again, they are a ‘tailed’ minority and could be deemed outliers. Values of revol over 1 would be due to interest. Another possibility is that ‘total balance’ includes the sum of future interest payments due. In which case, those deemed satisfactory to obtain a longer-term loan, would be deemed so, due to some other measure of financial stability.
13
Data Mining | Delinquent Loans
Denham S.
−0.6
−0.4
−0.2
0.0
0.2
0.4
revol − Partial Dependency Plot
0
1
2
3
revol
Figure 2.11: Partial Dependency Plot – revol Figure 2.12 shows how the error rate of the out-of-bag sample is reduced when the number of trees voting increases. This shows how 1,000 trees were more than
Error
0.20
0.22
0.24
0.26
0.28
0.30
0.32
sufficient for reaching a satisfactory error rate.
0
200
400
600
800
1000
trees
Figure 2.12: Error Rate A tree of size 250 was also created which showed approximately the same results. There was not major difference, however the 1,000-tree forest was used for the majority of analysis. Random forests often suffer from over fitting, however they can handle variables being large outliers as the democratic voting reduces sensitivity to a single variable. Random forests cannot include variable costs, which was not considered, however a client might express variables to be preferable. Possibly the greatest benefit from the random forest analysis is through the in-depth understanding of the variables it provides. Table 2.6: Random Forest Misclassification Table FALSE POSITVE RATE FALSE NEGETIVE RATE MISCLASSIFICATION RATE
14
22.6% 23.2% 23.7%
Data Mining | Delinquent Loans
Denham S.
2.5 Neural Networks Neural networks are a black-box modelling technique. Put simply, the model creates a formula, similar to that of a linear regression, containing hidden layers. ‘Neurons’ which represent input variables, output variables and optional variables in the hidden layer are algorithmically weighted to minimise the difference between the desired results and the actual results. Both the nnet and neuralnet packages were tested in analysis but nnet was chosen as the primary function due to its speed and level of functionality The model was tweaked with to optimise the AIC, error, positive hessian matrix eigenvalues and misclassification rates. The data was scaled and a trail-and-error method was used and several combinations of model inputs were created and tested with different values for size (number of hidden layers) and input vectors. To tailor the model to the 0/1 classification target variable, softmax = TRUE was defined as the activation function for the nnet package and act.fct=”logistic” was set for the neuralnet package. A decay value can be added to the input to improve its stability, mitigating the risk of weights tending towards infinity. Figure 2.13 plots the error rate for decay values between 0 and 1. It showed that a small value of decay added to the model, vastly reduced the error. Given the high possibility of outliers, the support vector machine
4500 4000 3500
Error
5000
model could easily be over fitted so adding decay to the model generalised it.
0.0
0.2
0.4
0.6 Decay
Figure 2.13: Decay ~ Error Plot
15
0.8
1.0
Data Mining | Delinquent Loans
Denham S.
One hidden layer was added to the model and the depends variable was removed from the model, reducing the AIC to -12291.02. This produced the following output for the optimum model: a 8-1-2 network with 13 weights options were - softmax modelling decay=0.1 b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 1.34 0.47 -0.13 0.56 0.02 -0.09 0.11 1.20 0.04 b->o1 h1->o1 3.25 -4.53 b->o2 h1->o2 -3.25 4.53
Figure 2.14 shows the model as graphically produced by the neuralnet package. On the left are the input variables – the values that are multiplied by the weights marked along the arrows to the hidden layer neuron. The bias is multiplied by the output neuron and is then passed through an activation function to give the probability associated with the given case. 1
1
scale(revol)
9 63
.34 −0
scale(age)
91
0.0 52
scale(times1)
19
764
3183
3
145
2.63
.4 4
scale(DebtRatio) −0 .01
−2.3
−0
−21.70731
as.numeric(dlq)
764
0.0 scale(MonthlyIncome)
5
82
7 .0
−0 .02
scale(ntimeslate)
93 7
−0
.99
10
7
scale(noloansetc)
−0
scale(norees)
Error: 891.320636 Steps: 13284
Figure 2.14: Neural Network In some ways package is more advanced than the nnet package, however it is poorly built in a number of ways. The naming conventions are often unclear and conflicting. The prediction function, which is crucial for developing receiver operating characteristic curves (ROC curves), is also the name of a function in the ROCR package. The result is severable variables and functions, all with similar names and very similar tasks. Table 2.7 shows the performance of the neural network model on the test set. Table 2.7: Neural Network Misclassification
FALSE POSITVE RATE FALSE NEGETIVE RATE MISCLASSIFICATION RATE
16
23.2% 26.1% 24.6%
Data Mining | Delinquent Loans
Denham S.
2.6 Support Vector Machines In the SVM process, points are transformed/mapped to a ‘hyper-space’ so they can be split by a hyper-plane. This transformation is not explicitly calculated which would be computationally expensive, but rather calculate the kernel which is function of the inner product of the original multidimensional space. Consider piece of string. In the first dimension, the distance between the two ends of the string is the string’s length and they can never touch. However in real life, thee-dimensional space, the string can be bent around both ends can easily touch. This may be an abstract example, however it demonstrates the idea that more dimensions can allow more manipulation. Optimising the Support Vector Machine involved tuning the values for gamma and cost on the training set. Gamma is a term in the radial basis transformation and cost is the constant for the Lagrange formula. They were tested between 0.5 and 1.5 and 100 and 1,000 respectively. The tune function concluded that a gamma of 1.3 and cost of 130 were optimum, giving a best performance of 0.2118273. Figure 2.15 shows an early plotted of the tuning function.
Figure 2.15: SVM Tuning Plot
SVM Classification Plots SVM classification plots can be useful in determining the relationship between two variables and the predictive model. In this case, they are better for mapping more scattered variables such as MonthlyIncome rather than ntimeslate. Figure 2.16
17
Data Mining | Delinquent Loans displays
the
relationship
between
Denham S. age,
MonthlyIncome
and
delinquent
classification by the radial c-classification method. The colours represent the classification determined in a hyperspace radial plane.
Figure 2.16: Radial SVM Plotted on MonthlyIncome ~ age Figure 2.17 shows the same two variables being classified by a sigmoid plane. Clearly there is a large differfence in classification using different kernels.
Figure 2.17: Sigmoid SVM Plotted on MonthlyIncome ~ age Table 2.8 shows the performance of support vector machines model on the test set. Table 2.8: Support Vector Machines Misclassification FALSE POSITVE RATE FALSE NEGETIVE RATE MISCLASSIFICATION RATE
18
24.7% 23.2% 24.1%
Data Mining | Delinquent Loans
3.
Denham S.
FURTHER ANALYSIS
3.1 Ensemble Model In an attempt to gain the best elements of all models, an ensemble was created. This simply averages the probabilities created by the tree, random forest, neural network and support vector machine, for each individual case. All models are run and the probabilities they produce for the case are averaged. This creates a less diverse array of probabilities, however it is just as decisive. Table 2.1 shows its performance on the test set. Interestingly, this model had the lowest false positive rate for the test set. Table 2.1: Ensemble Model Misclassification FALSE POSITVE RATE FALSE NEGETIVE RATE MISCLASSIFICATION RATE
22.4% 25.3% 23.7%
3.2 Limitations and Further Work Without having the opportunities to talk with the client and properly understand the data, the variable descriptions are limited. To what extent outliers can be removed and the ease of collecting variables was done by estimation. The Rattle package offers the incorporation of risk into the model however this could but calculated without a measure of risk associated with each case, such as the nominal value of each loan. The creation of this data must also be considered. These are loans, which have already been approved, and so these customers would already gone through some screening system to get to that stage. Any model taken from this empirical study must be used in conjunction with current structures that have already screened out potential loan applicants. Data is temporal but this dataset show not information of the time period from which it was taken. Consider the value of such a model in the midst of the financial crisis. Models can expire and must be continually tested and updated. Unfortunately, it is difficult to test a model once implemented.
Â
19
Data Mining | Delinquent Loans
Denham S.
3.3 Variables Importance Table 2.2 contains a simplified rating, good, bad, average or inconclusive (--), for each variable under by different topics. Support Vector Machines was not included as it is a black-box technique. Table 2.1:
Missing Outliers Log. Data
PCA
Tree
R. N. Net Forest
revol
good
Good
good good
good
good
good
age
good
good
good
ave.
ave.
good
good
times1
good
ave.
good good
good
-‐-‐
good
DebtRatio
good
poor
poor
ave.
ave.
good
good
MonthlyIncome
poor
good
good
ave.
ave.
good
good
noloansetc
good
ave.
good
ave.
ave.
ave.
good
ntimeslate
good
good
good good
good
poor
good
norees
good
ave.
ave.
ave.
poor
ave.
good
ntimes2
good
ave.
good good
good
-‐-‐
-‐-‐
depends
ave.
good
good poor
poor
ave.
poor
20
Data Mining | Delinquent Loans
4.
Denham S.
MODELS ASSESSMENT Models tend to produce different values for false positive and false negatives which makes the question of ‘best model’ ambiguous, particularly without client contact. Prospect Theory (Kahneman and Tversky, 1979) indicates that people’s negative response to a loss is more pronounced than their positive response to an equal gain. Consider one’s feelings on losing a €50 note, compared to finding one. Taking this aspect of human behaviour into account, models with lower false positive (a loan would be given and money is lost as it is not paid back as agreed) may be weighted better than those with lower fast negative (a loan would not be given to a customer who would have paid it back and so potential profit is foregone). To finalise the ‘best’ model in this instance, an analysis of the material profit/loss gained by a loan in question and a cost quantity to represent the client’s attitude to risk. This would guide a weighting to user preference for minimal false negatives or false positives. Such weighting can be used by the evaluateRisk or the loss matrices that some of these models facilitate. Due to the lack of information they could not be used. There is also no information of the ease with which managers can obtain the input information or how reliable these tend to be. If for example, revol may not be practical to obtain for every customer. In the case of trees and random forests, surrogates may be used instead. Figure 4.1 shows the receiver-operating characteristic (ROC) curves for the 4 models plotted together with true positive against false positive rates. A good model would arc high and to the left. In this respect, the random forest, neural network and support vector machine models are essentially the same. The simple tree model is clearly suboptimal. This trend can also be seen in a lift chart
21
Data Mining | Delinquent Loans
Denham S.
Figure 4.1: ROC Curves Figure 4.2 shows scatter plots of the four modelsâ&#x20AC;&#x2122; predictions for the test set, against each other. The graphs also show the actual values for each case by colour. The visible correlations represent the similarity of the two models. Cases plotted in the top left or bottom right are cases that both models agreed and if correct, these should be blue and green respectively if the models prediction was correct. If correctly classified, they will be blue for dlq=1 and green for dlq=0. The simple tree can be clearly seen by the discrete nature of its predictions for each node. Random forests and neural networks appear to be most correlated models with the least variance. These graphs alone should not be used to determine misclassification rates. They give an idea, however the full density of cases at the corners of the graphs is crowded, and so the number of cases there is difficult to read. True misclassification can be seen the bar chart in Figure 4.3.
22
Data Mining | Delinquent Loans
Denham S.
Figure Figure 4.2: Model Comparison Scatter Plots
23
Data Mining | Delinquent Loans
Denham S.
As explained earlier, the client attitude towards false negatives and false positives may be disproportionate due to the prospect effect. It is logical to assume that to some degree, a low false positive is more desirable than a low false negative. Figure 4.3 displays a bar chart of misclassification rates on the test set of the tree, random forest, neural network, support vector machine and the ensemble models. The ensemble model, arguably most complex, performed the best for false positives; however, it also performed poorly with false negatives. Interestingly, the simplest model, logistic regression, performed inversely. It performed well with false negatives and poorly with false positives. The neural network consistently performed worse than its competitors. The support vector machine was mediocre, as was simple tree (however this performed poorly in the ROC analysis). 30.00% Logistic Regression
25.00%
Simple Tree 20.00% Random Forest
15.00%
Neural Network
10.00%
Support Vector Machine
5.00% 0.00% False Positve Rate
False Negetive Rate
Missclassification Rate
Ensemble
Figure 4.3: The random forest appeared to be the most consistently low misclassification rates. It is relatively simple to implement, it is particularly good for classifications and as it is an ensemble method, it often can negate biases and handle missing data and outliers well. It also provided as level of insight into variable importance. It also has relatively strong correlations to other models as seen in Figure 4.2. Unfortunately, random forests are prone to over fitting and they cannot include costs, however, purely based on the information provided, it would be deemed most suitable across all measures of model quality.
24
Data Mining | Delinquent Loans
5.
Denham S.
CONCLUSION This report detailed the in depth analysis of a dataset towards the prediction of loans becoming
‘delinquent’.
Principle
component
analysis,
logistic
regression,
classification trees, random forests, neural networks and support vector machines were all implemented. An ensemble model of the latter four was also tested. The best predictors, in descending order are: the percentage of credit limits utilised – revol, and three variables representing the number of times borrowers were later replaying their loans, times1,
ntimeslate and ntimes2. The variables
measuring number of property loans – norees, and number of dependents – depends, appear to the least valuable predictors. Although limitations of this study are clearly defined, based on basic intuition and the information provided, it was deemed that the random forest model was most suitable as it consistently performed well against all measures, most likely due to its ability to represent non-linear relationship while not over-fitting. It is also best placed to deal with outliers and missing data due to its democratic ensemble nature.
25
Data Mining | Delinquent Loans
7.
Denham S.
APPENDIX
7.1 References KAHNEMAN, D. & TVERSKY, A. 1979. Prospect Theory: An Analysis of Decision Under Risk. Econometrica, 47, 263-291. O'BOYLE, N. 2011. Supervised Classification: Variable Importance in CART (Classification and Regression Trees), DCU Redbrick, viewed 19 Ferbrary, 2012, < http://www.redbrick.dcu.ie/~noel/R_classification.html>
7.2 R Code # Adding Packages library("rpart") library(neuralnet) library("ROCR") # prediction object is masked from neuralnet library(tools) library("randomForest") library("caTools") library("colorspace") library("Martrix") library("nnet") library("gtools") library("e1071") library(randomForest) library("rattle") library(car) library(cluster) library(maptree) library("RColorBrewer") library(modeltools) library(coin) library(party) library(zoo) library("rattle") library(base) sandwich, strucchange, vcd search() ? ctree
cData[cData$MonthlyIncome==999,"MonthlyIncome"] <- NA cData = na.omit(cData) nrow(oData)-nrow(cData) # 3943 Missing 1-(nrow(cData)/nrow(oData)) # 18.5% Missing for both depends and MonthlyIncome together # All missing values of depends are also missing for MonthlyIncome # 3943 NA's/-999s na.omited.oData = cData # Outlier Removal Measureing cData = oData nrow(cData) # 21326 orginal # Insert outlier removal code................................. ... cData = subset(cData, cData$depends < 10) # <5 removes 152. 10 is fine No need to remove any I think, good distribution # cData = na.omit(cData) nrow(oData)-nrow(cData) # Number removed Missing 1-(nrow(cData)/nrow(oData)) # % Removed
oData = read.csv("/Users/stephendenham/Dropbo x/College/Data Mining/directory/creditData.csv") cData = oData nrow(cData) # 21326 orginal cases cData[cData$MonthlyIncome==999,"MonthlyIncome"] <- NA cData = na.omit(cData) nrow(oData)-nrow(cData) # 3943 Missing 1-(nrow(cData)/nrow(oData)) # 18.49% Missing for MonthlyIncome
# Actual Outlier Removal cData = oData cData[cData$depends==-999,"depends"] <- NA cData[cData$MonthlyIncome==999,"MonthlyIncome"] <- NA cData = na.omit(cData)
cData = oData nrow(cData) # 21326 cData[cData$depends==-999,"depends"] <- NA cData = na.omit(cData) nrow(oData)-nrow(cData) # 480 Missing 1-(nrow(cData)/nrow(oData)) # 2.25% Missing for MonthlyIncome
i
Data Mining | Delinquent Loans
Denham S. c0Data = subset(cData, cData$dlq == 0) #
# Outliers picked from boxplots cData = subset(cData, cData$revol < 4) # 5000>>6. 50>>28. 9>>33. 4>>40 cData = subset(cData, cData$times1 < 80) # cData = subset(cData, cData$times1 < 14) # 4 removes 390. 10 removes 359. 14 removes 248 cData = subset(cData, cData$DebtRatio < 14) # 4 removes 390. 10 removes 359. 14 removes 248 cData = subset(cData, cData$MonthlyIncome < 60000) # 100000 removes 6. 50000 removes 37. 14 removes 248 cData = subset(cData, cData$noloansetc < 22) # <22 removes 400. No need to remove any I think, good distribution cData = subset(cData, cData$ntimeslate < 40) # removes 33 cData = subset(cData, cData$norees < 6) # <6 removes 137. No need to remove any I think, good distribution cData = subset(cData, cData$ntimes2 < 8) # 8 removes 1. 5 removes 28. No need to remove any I think, good distribution cData = subset(cData, cData$depends < 10) x = nrow(cData) y = nrow(oData) y-x 1-(x/y) # Percent Removeded # Creating smaller sets for testing
set.seed(12345) test_rows = sample.int(nrow(cData), nrow(cData)/3) test = cData[test_rows,] train = cData[-test_rows,] set.seed(12345) otest_rows = sample.int(nrow(na.omited.oData), nrow(na.omited.oData)/3) otest = na.omited.oData[otest_rows,] otrain = na.omited.oData[otest_rows,] # Creating Data Frames train.Frame = data.frame(train) test.Frame = data.frame(test) # Cleaning Done ################ # Linear model # ################ ? glm fitLR.2 = glm(dlq ~ revol + age + times1 + ntimeslate + norees + ntimes2 + depends, data=train ,family=binomial() ) fitLR.2 summary(fitLR.2) confint(fitLR.2) exp(coef(fitLR.2)) exp(confint(fitLR.2)) predict(fitLR.2, type="response") residuals(fitLR.2, type="deviance") plot(fitLR.2) # predict.glm predict(fitLR.2, test, type = "response")
#par(mfrow=c(2,5)) #boxplot(cData$nid, main="nid") #boxplot(cData$dlq, main="nid") boxplot(cData$revol, main="revol", col=10) boxplot(cData$age, main="age", col=7) boxplot(cData$times1, main="times1", col=3) boxplot(cData$DebtRatio, main="DebtRatio", col=4) boxplot(cData$MonthlyIncome, main="MonthlyIncome", col=5) boxplot(cData$noloansetc, main="noloansetc", col=6) boxplot(cData$ntimeslate, main="ntimeslate", col="light green") boxplot(cData$norees, main="norees", col="light blue") boxplot(cData$ntimes2, main="ntimes2", col=1) boxplot(cData$depends, main="depends", col="purple") names(cData) #dev.off()
# Optimal Logistic Model fitLR = glm(dlq ~ revol + age + times1 + MonthlyIncome + noloansetc + ntimeslate # + norees + ntimes2, data=train ,family=binomial() ) summary(fitLR) plot(predict(fitLR, test, type = "response")~predict(fitLR.2, test, type = "response")) anova(fitLR,fitLR.2, test="Chisq") # Anova of regression with less data ? anova
c1Data = subset(cData, cData$dlq == 1) #
ii
Data Mining | Delinquent Loans
Denham S.
####### # PCA # ####### boxplot(cData[2:12,]) # Scale Data names(cData) sData = scale(na.omit(cData[2:12]), center = TRUE, scale = TRUE) sData sData$age age boxplot(sData[2:12, -7]) boxplot(sData[2:12,]) names(cData) boxplot(cData$ntimeslate)
plotcp(fitTree) # Main Tree fitTree = rpart(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train[,2:12] #,parms=list(split="gini")
sDataPCA = princomp(sData[, c(1:11)], cor=T) # Can't use NAs print(sDataPCA) summary(sDataPCA) round(sDataPCA$sdev,2) plot(sDataPCA, type='l', main="Scree Plot") #Simple PC variance plot. Elbows at PCs 2 & 9 loadings(sDataPCA)
,control=rpart.control(cp=0.0010406 812
# biplot(sDataPCA, main="Biplot") # Difficult to run abline(h=0); abline(v=0)
#,loss=matrix(c(0,false.pos.weight, false.neg.weight,0), byrow=TRUE, nrow=2) ) # Loss Matrix ) fitTree fitTree$cp # xError of: 0.4826868 - CP: 0.0010406812 ? plotcp printcp(fitTree) fitTree$parm fitTree$parm$loss # Plotcac draw.tree(fitTree, cex=.8, pch=2,size=2.5, nodeinfo = TRUE, cases = "obs") ? draw.tree ? abline ? plot.rpart... type, extra, plotcp plot(fitTree,compress=TRUE,uniform= TRUE, branch=0.5) text(fitTree,use.n=T,all=T,cex=.7,p retty=0,xpd=TRUE) draw.tree(fitTree, cex=.8, pch=2,size=2.5, nodeinfo = TRUE, cases = "obs")
,control=(maxsurrogate=100) ) ,method = "class" ,parms=list(split="gini"
# pairs(cData[2:12], main = "Pairs", pch = 21, bg = c("red", "blue")[unclass(cData$dlq)]) #abline(lsfit(Sepal.Width,Sepal.Wid th)) #abline(lsfit((setosa$Petal.Length, setosa$Petal.Width), col="red", lwd = 2, lty = 2)) ######### # TREES # ######### # Simpler Tree - CP is fitTree.sim = rpart(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train, parms=list(split="gini") ,method = "class" ) draw.tree(fitTree.sim, cex=.8, pch=2,size=2.5, nodeinfo = FALSE, cases = "obs")
? draw.tree fitTree.sim$cp fitTree.sim$splits[,1] plot(fitTree.sim$splits[,1]) fitTree.sim$splits false.pos.weight = 10 false.neg.weight = 10 #Min Error CP fitTree.elbow = rpart(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc +
plot(fitTree.sim,compress=TRUE,unif orm=TRUE, branch=0.5) text(fitTree.sim,use.n=T,all=T,cex= .7,pretty=0,xpd=TRUE) draw.tree(fitTree.sim, cex=.8, pch=2,size=2.5, nodeinfo = TRUE, cases = "obs")
iii
Data Mining | Delinquent Loans
Denham S.
ntimeslate + norees + ntimes2 + depends, data=train
# ? rattle.print.rpart # Must figure out which is false positive and which is false negetive newdata0 = subset(cData[2:12], dlq==0) newdata1 = subset(cData[2:12], dlq==1) # newdata1 = subset(cData, dlq==0) noPredictions0 = predict(fitTree, newdata0) noPredictions1 = predict(fitTree, newdata1) noPredictions = predict(fitTree, test) max(noPredictions) min(noPredictions) noPredictions0 noPredictions1 correct0 = (noPredictions0 < 0.5) correct0 correct1 = (noPredictions1 > 0.5) correct1 table(correct0) table(correct1) # Confusion matrix?
,control=rpart.control(cp=0.0068) ,method = "class" ,parms=list(split="gini" #,loss=matrix(c(0,false.pos.weight, false.neg.weight,0), byrow=TRUE, nrow=2) ) # Loss Matrix ) fitTree.elbow plot(fitTree.elbow,compress=TRUE,un iform=TRUE, branch=0.5) text(fitTree.elbow,use.n=T,all=T,ce x=.7,pretty=0,xpd=TRUE) draw.tree(fitTree.elbow, cex=.8, pch=2,size=2.5, nodeinfo = FALSE, cases = "") # CP at the elbow 0.004701763719512 # Party ? ctree fitTree.party <- ctree(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train[,2:12] #, controls=ctree_control( #stump=TRUE, #maxdepth=3 #) ) plot(fitTree.party, type = "simple" )
# Still have to do miss class thing # 2nd Lab on Normal Trees # Gini/Information : seems to make no difference # To do # 1. Add loads of missing data and outliers to test for robustness # 2. Add maxsurrogates (end of lab 3) # ??? ################## # Random Forests # ################## train ? randomForest
fitTree asRules(fitTree) info.gain.rpart(fitTree) ? rpart
? randomForest set.seed(12345) fitrf=randomForest(as.factor(train$ dlq) ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends,
# Function which determines Variable Importance in rpart. See below. a <- importance(fitTree) summary(a) # NOTE: a different CP (.01) had a better #? rattle # parms=list(prior=c(.5,.5)) ?? Priors? # control=rpart.control(cp=0.0018))
data=train.Frame, # declared above ntree=1000, type="classification", predicted=TRUE, importance=TRUE, proximity=FALSE # Never run prox as is crashes computer
# RATTLE
iv
Data Mining | Delinquent Loans
Denham S.
)
col=c("red", "dark
fitrf
red"), horiz=FALSE, space=0.4, width=2, axis=FALSE, axisnames=FALSE) axis(1, at=Graph, las=1, adj=0, cex.axis=0.7, labels = Ylabels) # oposite order of Ylabels ? varUsed # getTree(fitrf, k=1, labelVar=TRUE) # View and individual tree
# Var Importance Plot importance(fitrf) varImpPlot(fitrf, main = "Variable Importance", sort = TRUE) varImpPlot(fitrf, class=1, main = "Variable Importance", sort = TRUE) # Looking good ? varImpPlot boxplot(cData$depends) # Partial Dep Plots give graphical depiction of the marginal effect of a variable on the class response (regression) ? partialPlot partialPlot(fitrf, train, age, main="Age Partial Dependency Plot") partialPlot(fitrf, train, revol,main="revol - Partial Dependency Plot", col="red") partialPlot(fitrf, train,MonthlyIncome,main="MonthlyIn come - Partial Dependency Plot") partialPlot(fitrf, train, depends,main="depends - Partial Dependency Plot") # 6 in one... partialPlot(fitrf, train, times1 ,main="Partial Dependency Plot") partialPlot(fitrf, train, add=TRUE,DebtRatio,main="DebtRatio - Partial Dependency Plot", col="blue") partialPlot(fitrf, train,ntimes2, add=TRUE,main="ntimes2 - Partial Dependency Plot", col="pink") partialPlot(fitrf, train, ntimeslate, add=TRUE,main="ntimeslate - Partial Dependency Plot", col="green") partialPlot(fitrf, train, norees, add=TRUE,main="Partial Dependency Plot", col="red") partialPlot(fitrf, train, noloansetc, add=TRUE,main="noloansetc - Partial Dependency Plot", col="light blue")
set.seed(54321) fitrf.250=randomForest(as.factor(tr ain$dlq) ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, # declared above ntree=250, type="classification", predicted=TRUE, importance=TRUE, proximity=FALSE # Never run prox as is crashes computer ) fitrf.250 set.seed(12345) #fitrf.reg=randomForest(dlq ~ revol + age + times1 + DebtRatio # + MonthlyIncome + noloansetc + ntimeslate # + norees + ntimes2 + depends, # data=train, # ntree=1000, # type="regression", # predicted=TRUE, # importance=TRUE, # proximity=FALSE # ) # set1$similarity <as.factor(set1$similarity) # Need to work out this prox stuff etc.
partialPlot(fitrf, train, noloansetc,main="noloansetc Partial Dependency Plot", col="light blue") partialPlot(fitrf, train,DebtRatio,main="DebtRatio Partial Dependency Plot", col="blue")
fitrf names(fitrf) summary(fitrf) fitrf$importance hist(fitrf$importance) fitrf$mtry # mtry = 3 hist(fitrf$oob.times) # Normal Distrition fitrf$importanceSD hist(treesize(fitrf, terminal=TRUE))
# Var Used Barchart Ylabels = c("revol","age","times1","DebtRatio ","MonthlyIncome","noloansetc","nti meslate","norees","ntimes2","depend s") Graph = barplot(varUsed(fitrf, count=TRUE), xlab="Times variable used", c(1:14),
v
Data Mining | Delinquent Loans
Denham S. # b = bias
boxplot(fitrf$oob.times) plot(fitrf, main = "") ? plot.randomForest
class.ind(train$dlq) set.seed(12345) nn.lin.values = vector("numeric", 10) for(i in 1:10) { fitnn.lin = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2) + scale(depends), train, size=0, skip=TRUE, softmax=FALSE, Hess=TRUE) cat(fitnn.lin$value,"\n") nn.lin.values[i] = fitnn.lin$value } hist(nn.lin.values) plot(nn.lin.values) nn.lin.values eigen(fitnn.lin$Hess) eigen(fitnn.lin$Hess)$values
getTree(fitrf, k=3, labelVar=FALSE) fitrf$votes
margins.fitrf=margin(fitrf,churn) plot(margins.rf) hist(margins.rf,main="Margins of Random Forest for churn dataset") boxplot(margins.rf~data$churn, main="Margins of Random Forest for churn dataset by class") The error rate over the trees is obtained as follows: plot(fit, main="Error rate over trees") MDSplot(fit, data$churn, k=2) # Margins margins.fitrf=margin(fitrf,dlq) plot(margins.fitrf) hist(margins.fitrf, main="Margins of Random Forest for Credit Dataset") boxplot(margins.fitrf~train.Frame$d lq) plot(margins.fitrf~dlq) MDSplot(fitrf, cData$dlq, k=2) # can't do because missing proximity matrix
# Main Nnet set.seed(999) fitnn = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE, decay=0.1) b=nrow(train) findAIC(fitnn, b, 10) # -12168 (these AICs are not consistent) eigen(fitnn$Hess)$values # usually
# Rattle Random Forest Stuff treeset.randomForest(fitrf, format="R") # This takes forever # printRandomForests(fitrf)
#
# Making Predictions #predict(fitrf, test[101,]) # outputs a value of either 1 or 0 predict(fitrf, test) # outputs a value of either 1 or 0 print(pred.fitrf <- predict(fitrf, test, votes = TRUE) ) pred.fitrf = predict(fitrf, test, type="prob")[,2] pred.fitrf
? neuralnet set.seed(999) fitnn.2 = neuralnet( # Took forever to run as.numeric(dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), data = data.matrix(train),##-act.fct = "logistic", hidden = 1, linear.output = FALSE #, rep = 4, err.fct="sse" ) # as.numeric(dlq) # Can we change this to class/int??? class.ind(train$dlq)
############### # Neural Nets # ############### names(cData) sData = scale(na.omit(cData[2:12]), center = TRUE, scale = TRUE) sData # i = inpur # h = hidden layer
vi
Data Mining | Delinquent Loans
Denham S.
# dlq runs! fitnn.2.2 = neuralnet( as.factor(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), data = data.matrix(train), hidden = 1, linear.output = FALSE #, rep = 4, err.fct="sse" )
? nnet confus.fun(fitnn) eigen(fitnn$Hess) eigen(fitnn$Hess)$values # eigen(fitnn.0$Hess) eigen(fitnn.0$Hess)$values # ooooh, not all positive) - measure stability ??? # Postive definite. All eigenvalues greater than 0 # This is also good fitnn.5 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(times1) + scale(MonthlyIncome) + scale(ntimeslate), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 5 # ?? findAIC(fitnn.5, b, c)
# fitnn.2 plot(fitnn.2) # gwplot(fitnn.2, rep="best") nn.CI=confidence.interval(fitnn.2, alpha=0.05) # CI of weights nn.CI$upper nn.CI$lower nn.CI$upper.ci # ? compute # Don't think this scaling is right t = test t = t[-1] t = t[-1] t = t[-10] t = t[-9] names(t) t$revol = scale(t$revol) t$age = scale(t$age) t$times1 = scale(t$times1) t$DebtRatio = scale(t$DebtRatio) t$MonthlyIncome = scale(t$MonthlyIncome) t$noloansetc = scale(t$noloansetc) t$ntimeslate = scale(t$ntimeslate) t$norees = scale(t$norees) t print(pr <- compute(fitnn.2, t)) fitnn.2.pred=pr$net.result #print(pr.2 <- compute(fitnn.2.2, t)) #fitnn.2.2.pred=pr.2$net.result fitnn.2.pred # #
# Worst AIC from here on fitnn.11 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2) + scale(depends), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 11 # ?? findAIC(fitnn.11, b, c) # -11595.36 fitnn.10.1 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(ntimes2) + scale(depends), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 10 # + scale(norees) findAIC(fitnn.10.1, b, c) # 12452.82
##-- List of probabilites generated by the neuralnet package
fitnn.9 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(norees) + scale(ntimes2), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 9 # No , depends, ntimeslate findAIC(fitnn.9, b, c) # -11872.63
b = nrow(train) fitnn.0 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), train, size=0, skip=TRUE, softmax=FALSE, Hess=TRUE) c = 10 # No depends findAIC(fitnn.0, b, c) # -12013.49
fitnn.11 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) +
vii
Data Mining | Delinquent Loans
Denham S.
scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2) + scale(depends), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 11 # ?? findAIC(fitnn.11, b, c) #
scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2), train, size=1, skip=FALSE, softmax=TRUE, decay=DecayRate[i])
# uses test!!! (woo) # Softmax = TRUE requires at least two response categories # Not always all negetive, must est seet # easy because its already all in numbers. No need for class.ind etc... # Task: manually fill in examples into a nnet summary(fitnn) names(fitnn) fitnn$terms fitnn$wts
# Small datasets
# Hessian: , Hess = TRUE
data=train.Frame,
# Matrix
"regression",
# AIC p=nrow(train) k = 8 # ncol(train) SSE = sum(fitnn$residuals^2) AIC = 2*k +p*log(SSE/p) # SBC
"radial",
####### # SVM # #######
#plot(cData$age, cData$DebtRatio, col=(cData$dlq+3), pch=(cData$dlq+2)) set.seed(12345) # attach(cData) # Regression - Radial svm.model.reg.rad <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, type = kernel = cost = 100, gamma = 1) # Regression - Linear svm.model.reg.lin <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends,
# Different Decays errorRate = vector("numeric", 100) DecayRate = seq(.0001, 1, length.out=100) for(i in 1:100) { # set.seed(12345)
data=train.Frame, type = "regression", kernel = "linear",
fitnn = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2), train, size=0, skip=TRUE, decay=DecayRate[i]) errorRate[i] = sum(fitnn$residuals^2) # Could add AIC here # Was inverse graph for size = 0 } errorRate plot(DecayRate, errorRate, xlab="Decay", ylab="Error", type="l", lwd="2") # Actual??
cost = 100) # no gamma # Polynomial svm.model.pol <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "Cclassification", kernel = "polynomial", cost = 100, gamma = 1) # Sigmoid svm.model.sig <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends,
fitnn = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) +
viii
Data Mining | Delinquent Loans
Denham S. obj2 <- tune.svm(dlq~. ,data=svmtrain , gamma = seq(1, 1.5, by = .1) , cost = seq(10,150, by = 10) ) plot(obj2) obj2 # Gamma Cost Best Performance # 1.1 100 -- 0.2132932 # 1.3 130 -- 0.2118273 # 1.3 80 -- 0.2118273 # 1.3 120 -- 0.2137122
data=train.Frame, type = "Cclassification", kernel = "sigmoid", probability = FALSE, cost = 100, gamma = 1) # linear started 5.23. Started 18.41 - 1848!!! :) svm.model.lin <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends,
obj3 <- tune.svm(dlq~. ,data=svmtrain , gamma = seq(.5, 1.5, by = .1) , cost = seq(0,300, by = 10) ) plot(obj3) obj3
data=train.Frame, type = "Cclassification", kernel = "linear", probability = TRUE, cost = 100) #
sm = svm.model # change kernals # svm.model.lin / svm.model... PROBS x = svm.model.sig x = svm.model.pol x = svm.model.lin x = svm.model # PLOT.SVM. (cData must be attached). maybe test[,3] detach(test) nrow(test) nrow(age) nrow(test$MonthlyIncome) attach(cData) plot(x, data=test.Frame, MonthlyIncome ~ age, # age~revol, MonthlyIncome ~ DebtRatio svSymbol = 1, dataSymbol = 2, fill = TRUE )
no gamma for linear # Radial svm.model.rad <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "Cclassification", kernel = "radial", probability = TRUE, cachesize = 1000, cost = 130, gamma = 1.3) # # tuning results: 0.2118273
1.3
130
-detach(cData) #attach(cData) # Outputs svm.model names(svm.model) str(svm.model) summary(svm.model) ? predict predict(svm.model, test[101,]) # outputs a value of either 1 or 0 predict(svm.model, test) # outputs a value of either 1 or 0
# Smaller Sample now svmtest_rows = sample.int(nrow(cData), 4000) svmtrain = cData[svmtest_rows,] svmtest = cData[-svmtest_rows,] svmtrain = data.frame(svmtrain) svmtest = data.frame(svmtest) obj <- tune.svm(dlq~. ,data=svmtrain , gamma = seq(.5, .9, by = .1) , cost = seq(100,1000, by = 100) ) plot(obj) # obj 0.218031
# predsvm.lin / predsvm.pol / predsvm.reg.lin / predsvm.reg.rad / predsvm. predsvm = svm.model.lin predsvm <- predict(svm.model, test, probability = TRUE) predsvm
ix
Data Mining | Delinquent Loans
Denham S.
# This bit needed for classification, not for regression, class-radial, #predsvm=attr(predsvm, "probabilities")[,2] # Converts into probabilities...
pred.fitrf fitnn.2.pred x1=predict(fitTree, test)[,2] x1.2=predict(fitTree.sim, test)[,2] x1.3=predict(fitTree.sml, test) x2=pred.fitrf ##--x2.2=predict(fitrf.reg, test) x3=predict(fitnn, test) x3=x3[,2] # This is needed for regression Nn. Not class. See above. Oh waitm seems to be needed now for softmax. x3.2=fitnn.2.pred x4.2=attr(predict(svm.model.lin, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x4.3=attr(predict(svm.model.rad, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x4.4=attr(predict(svm.model.sig, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x4=x4.3 # Radial c-classification performs best ##-x4.5=attr(predict(svm.model.pol, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x5=predict(fitLR, test, type = "response") # Logistic Regression
plot(predsvm) # ...for the purposes of comparison with other models and ensemble model creation predsvm # Are these probabilities? svs=svm.model$SV svs ? attr
# Model Evaluation predsvmlin = attr(predict(svm.model.lin, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... predsvmrad = attr(predict(svm.model.rad, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... predsvmsig = attr(predict(svm.model.sig, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... #predsvmpol = attr(predict(svm.model.pol, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... #predsvmregrad =attr(predict(svm.model.reg.rad, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... #predsvmreglin =attr(predict(svm.model.reg.lin, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities...
plot(x1~x4, col=(test$dlq+3), pch=(1), ylab="", xlab="", main = "") # with is linear c-classification ##--x4.2=predsvm.rad ensem=(x1+x2+x3+x4+x5)/4 ensem.2=(x1+x2+x3+x4+x5)/5 dev.off() par(mfrow=c(3, 2)) # Comparing Model Results plot(x1~x2, col=(test$dlq+3), pch=(1), ylab="Tree", xlab="Random Forest", main = "Tree vs. Random Forest") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12)
######## # ROCR # ######## ############## # Evaluation # ############## # MODELS fitTree fitTree.sim fitTree.sml fitrf.reg fitrf fitnn fitnn.2 svm.model # Probs
plot(x1~x3, col=(test$dlq+3), pch=(1), ylab="Tree", xlab="Nueral Net", main = "Tree vs. Nueral Net") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green"))
x
Data Mining | Delinquent Loans
Denham S.
abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12)
Predictions") # Class vs. Regression Random Forests plot(x3~x3.2, col=(test$dlq+3), pch=(1), ylab="NN", xlab="NN", main = "Nueral Net vs. ") plot(fitnn.2, col=(test$dlq+3), pch=(1).pred~x3.2, ylab="NN", xlab="NN", main = "Nueral Net vs. ")
plot(x1~x4, col=(test$dlq+3), pch=(1), ylab="Tree", xlab="Support Vector Machine", main = "Tree vs. Support Vector Machine") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12)
plot(x1~x1.2, col=(test$dlq+3), pch=(1)) plot(x1~x1.3, col=(test$dlq+3), pch=(1)) plot(x1.2~x1.3, col=(test$dlq+3), pch=(1)) dev.off()
plot(x2~x3, col=(test$dlq+3), pch=(1), ylab="Random Forest", xlab="Nueral Net", main = "Random Forest vs. Nueral Net") # Correlated, but x3 has negetives. FML legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12)
plot(x1~ensem, col=(test$dlq+3), pch=(1), main = "Tree vs. Random Forest") plot(x2~ensem, col=(test$dlq+3), pch=(1), main = "Random Forest vs. Random Forest") plot(x3~ensem, col=(test$dlq+3), pch=(1), main = "Nueral Net vs. Random Forest") plot(x4~ensem, col=(test$dlq+3), pch=(1), main = "Support Vector Machine vs. Random Forest")
plot(x2~x4, col=(test$dlq+3), pch=(1), ylab="Random Forest", xlab="Support Vector Machine", main = "Random Forest vs. Support Vector Machine") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12)
# Confusion Matrices table(data.frame(predicted=predict( fitTree, test) > 0.5, actual=test[,2]>0.5)) # this works # that doesn't work, but this does TREEmat=table(data.frame(predict(fi tTree, test)[,2] > 0.5, actual=test[,2]>0.5)) # this works RFmat=table(data.frame(pred.fitrf > 0.5, actual=test[,2]>0.5)) # this works NNmat=table(data.frame(predicted=(p redict(fitnn, test) > 0.5)[,2], actual=test[,2]>0.5)) # this works SVMmat=table(data.frame(predicted=p redsvm > 0.5, actual=test[,2]>0.5)) # takes probabilities directly from above LMmat=table(data.frame(predict(fitL R, test) > 0.5, actual=test[,2]>0.5)) # this works ENSmat=table(data.frame(predicted=e nsem.2 > 0.5, actual=test[,2]>0.5)) # takes probabilities directly from above
plot(x3~x4, col=(test$dlq+3), pch=(1), ylab="Nueral Net", xlab="Support Vector Machine", main = "Nueral Net vs. Support Vector Machine") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x1~x1.2, col=(test$dlq+3), pch=(1), main="Tree vs. Other Tree") # Class vs. Regression Random Forests plot(x1~x1.2, col=(test$dlq+3), pch=(1), main="Tree vs. Other Tree") # Class vs. Regression Random Forests plot(x2~x2.2, col=(test$dlq+3), pch=(1), main="Class Random Forests vs. Regression Random Forests
mat=ENSmat fp.rate = mat[1,2]/(mat[1,1] + mat[1,2]) fn.rate = mat[2,1]/(mat[2,1] + mat[2,2]) mc.rate=(mat[1,2]+mat[2,1])/(mat[1, 1]+mat[1,2]+mat[2,1]+mat[2,2]) rates=c(fp.rate, fn.rate, mc.rate) mat
xi
Data Mining | Delinquent Loans
Denham S.
fp.rate fn.rate mc.rate rates
predicsNN=prediction(prednn[,2],tes t$dlq) predicsSVM=prediction(predsvm,test$ dlq)
# clas.ind gives two list # Prediction Probabilities predTree=predict(fitTree, newdata = test, prob = "class")[,2] predrf=predict(fitrf, newdata = test, type = "prob")[,2] # prob = "class" for Class membership prednn=predict(fitnn, newdata = test, prob = "prob") predsvm=predict(svm.model, newdata = test) ##--predsvm=attr(predict(svm.model, test, probability = TRUE), "probabilities")#[,2] # Converts into probabilities...
predicsRF predicsTree predicsNN predicsSVM #str(predicsRF) #predicsRF@fp perfTree=performance(predicsTree,"t pr","fpr") # ROC Curve perfrf=performance(predicsRF,"tpr", "fpr") # ROC Curve perfNN=performance(predicsNN,"tpr", "fpr") # ROC Curve perfSVM=performance(predicsSVM,"tpr ","fpr") # ROC Curve # QUICK ROCR CURVES plot(perfTree, col="red") plot(perfrf, add=TRUE, col="green") plot(perfNN, add=TRUE, col="orange") plot(perfSVM, add=TRUE, col="purple") legend("bottomright",c("Tree","R. Forest","N. Net","SVM"),pch=1,col=c("red","gree n", "orange", "purple"))
# Can not plot risk without Nominal value of each loan par(mfrow=c(2, 2)) evalTree=evaluateRisk(predTree,test $dlq) plotRisk(evalTree$Caseload,evalTree $Precision,evalTree$Recall, show.legend=TRUE) evalrf=evaluateRisk(predrf,test$dlq ) plotRisk(evalrf$Caseload,evalrf$Pre cision,evalrf$Recall) evalnn=evaluateRisk(prednn,test$dlq ) plotRisk(evalnn$Caseload,evalnn$Pre cision,evalnn$Recall, show.legend=TRUE) evalsvm=evaluateRisk(predsvm,test$d lq) plotRisk(evalsvm$Caseload,evalsvm$P recision,evalsvm$Recall) dev.off()
predics = predicsTree # predicsRF, predicsTree, predicsNN, predicsSVM perfo=performance(predics,"acc") plot(perfo) perfo=performance(predics,"tpr","ac c") plot(perfo) perfo=performance(predics,"err","ac c") plot(perfo) perfo=performance(predics,"lift","r pp") # lift chart plot(perfo) perfo=performance(predics,"tpr","rp p") plot(perfo)
########### par(mfrow=c(2, 2)) # Box Plots for Model Evaluation boxplot(predTree~test$dlq, col = "red", main = "Simple Tree") boxplot(predrf~test$dlq, col = "green", main = "Random Forest") boxplot(prednn[,2]~test[,2], col = "purple", main = "Support Vector Machine") boxplot(predsvm~test[,2], col = "orange", main = "Light Blue") ##--TREE NOT LOOKING GOOD!!!!!
perfTree=performance(predicsTree,"l ift","rpp") # ROC Curve perfrf=performance(predicsRF,"lift" ,"rpp") # ROC Curve perfNN=performance(predicsNN,"lift" ,"rpp") # ROC Curve perfSVM=performance(predicsSVM,"lif t","rpp") # ROC Curve
# Prediction for ROCR detach("package:neuralnet") # Needed because neuralnet has a predictions function predTree=predict(fitTree.sim, newdata = test, prob = "class") predicsTree=prediction(predTree[,2] , test$dlq) predicsRF=prediction(predrf,test$dl q)
plot(performance(predics,"lift","rp p")) plot(perfTree, col="red") plot(perfrf, add=TRUE, col="green") plot(perfNN, add=TRUE, col="orange") plot(perfSVM, add=TRUE, col="purple")
xii
Data Mining | Delinquent Loans
Denham S. bData$ntimeslate[j]=1 num_with_one = num_with_one + 1
################################### ################################### ###################################
} } num_with_zero num_with_one pc = num_with_zero/upto pc bData$ntimeslate
###### # CREATING BINARY DATA ###### # DebtRatio j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$DebtRatio[j]>=3.5) { bData$DebtRatio[j]=0 num_with_zero = num_with_zero + 1 } else { bData$DebtRatio[j]=1 num_with_one = num_with_one + 1 } } bData$DebtRatio num_with_zero num_with_one pc = num_with_zero/upto pc boxplot(bData)
# revol j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$revol[j]==0) { bData$revol[j]=0 num_with_zero = num_with_zero + 1 } else { bData$revol[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$revol # norees j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$norees[j]>=6) { bData$norees[j]=1 num_with_zero = num_with_zero + 1 } else { bData$norees[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$norees
# Monthly Income - 0/1 j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$MonthlyIncome[j]=='999') { bData$MonthlyIncome[j]=0 num_with_zero = num_with_zero + 1 } else { bData$MonthlyIncome[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$MonthlyIncome
# Depends j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$depends[j]=='-999') { bData$depends[j]=0 num_with_zero = num_with_zero + 1 } else { bData$depends[j]=1 num_with_one = num_with_one + 1
# ntimeslate boxplot(bData) j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$ntimeslate[j]==0) { bData$ntimeslate[j]=0 num_with_zero = num_with_zero + 1 } else {
xiii
Data Mining | Delinquent Loans
Denham S.
} } num_with_zero num_with_one pc = num_with_zero/upto pc bData$depends bData bData bData bData
= = = =
####################### # Univariate Graphing # ####################### attach(cData) par(mfrow=c(1,2))
bData[-6] bData[-5] bData[-2] bData[-6]
# REVOL! hist(revol,col= "red",main="revol", xlab="dlq", ylab="Density", freq=FALSE) lines(density(revol), col="black", lwd = 2, lty = 1) lines(density(c1Data$revol), col="red", lwd = 2, lty = 2) lines(density(c0Data$revol), col="green", lwd = 2, lty = 2) #legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green"))
# times1 j = 0 pc = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$times1[j]<80) { bData$times1[j]=0 } else { bData$times1[j]=1 } } bData$times1
# norees hist(norees,col= "red",main="norees", xlab="dlq", ylab="Density", freq=FALSE) lines(density(norees), col="black", lwd = 2, lty = 1) lines(density(c1Data$norees), col="red", lwd = 2, lty = 2) lines(density(c0Data$norees), col="green", lwd = 2, lty = 2) #legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green"))
# ntimes2 j = 0 pc = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$ntimes2[j]<80) { bData$ntimes2[j]=0 } else { bData$ntimes2[j]=1 } } bData$ntimes2
dev.off() attach(cData) par(mfrow=c(1,2)) names(cData) # REVOL! hist(revol,col= "red",main="revol", xlab="dlq", ylab="Density", freq=FALSE) lines(density(revol), col="black", lwd = 2, lty = 1) lines(density(c1Data$revol), col="red", lwd = 2, lty = 2) lines(density(c0Data$revol), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0","All"),lwd = 3,col=c("red","green",1))
boxplot(bData) names(bData) ################################### ##########plot(cData$times1~cData$r evol)##### # bData set.seed(12345) Btest_rows = sample.int(nrow(bData), nrow(bData)/3) Btest = bData[Btest_rows,] Btrain = bData[-Btest_rows,] # Simpler Tree - CP is fitTree.binary = rpart(dlq ~ age + times1 + noloansetc + ntimeslate + ntimes2 + depends , data=Btrain
# AGE hist(age,col= "red",main="age", xlab="age", ylab="Density", freq=FALSE) lines(density(age), col="black", lwd = 2, lty = 1) lines(density(c1Data$age), col="red", lwd = 2, lty = 2) lines(density(c0Data$age), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green"))
,parms=list(split="gini") ,method = "class" ) draw.tree(fitTree.binary, cex=.8, pch=2,size=2.5, nodeinfo = FALSE, cases = "obs")
# times1
xiv
Data Mining | Delinquent Loans
Denham S.
hist(times1,col= "red",main="times1", xlab="dlq", ylab="Density", freq=FALSE) lines(density(times1), col="black", lwd = 2, lty = 1) lines(density(c1Data$times1), col="red", lwd = 2, lty = 2) lines(density(c0Data$times1), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green"))
hist(norees,col= "red",main="norees", xlab="dlq", ylab="Density", freq=FALSE) lines(density(norees), col="black", lwd = 2, lty = 1) lines(density(c1Data$norees), col="red", lwd = 2, lty = 2) lines(density(c0Data$norees), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green"))
# DebtRatio hist(DebtRatio,col= "red",main="DebtRatio", xlab="dlq", ylab="Density", freq=FALSE) lines(density(DebtRatio), col="black", lwd = 2, lty = 1) lines(density(c1Data$DebtRatio), col="red", lwd = 2, lty = 2) lines(density(c0Data$DebtRatio), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green"))
# ntimes2 hist(ntimes2,col= "red",main="ntimes2", xlab="dlq", ylab="Density", freq=FALSE) lines(density(ntimes2), col="black", lwd = 2, lty = 1) lines(density(c1Data$ntimes2), col="red", lwd = 2, lty = 2) lines(density(c0Data$ntimes2), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green"))
# MonthlyIncome hist(MonthlyIncome,col= "red",main="MonthlyIncome", xlab="dlq", ylab="Density", freq=FALSE) lines(density(MonthlyIncome), col="black", lwd = 2, lty = 1) lines(density(c1Data$MonthlyIncome) , col="red", lwd = 2, lty = 2) lines(density(c0Data$MonthlyIncome) , col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green"))
# depends hist(depends,col= "red",main="depends", xlab="dlq", ylab="Density", freq=FALSE) lines(density(depends), col="black", lwd = 2, lty = 1) lines(density(c1Data$depends), col="red", lwd = 2, lty = 2) lines(density(c0Data$depends), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) detach(cData)
# noloansetc hist(noloansetc,col= "red",main="noloansetc", xlab="dlq", ylab="Density", freq=FALSE) lines(density(noloansetc), col="black", lwd = 2, lty = 1) lines(density(c1Data$noloansetc), col="red", lwd = 2, lty = 2) lines(density(c0Data$noloansetc), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green"))
#Univariate summary(cData) mean(cData) sd(cData) var(cData) summary(cData) table(cData$dlq)
############## # Univariate # ############## par(mfrow=c(1, 1)) for(i in 2:12) {boxplot(cData[,i], xlab=names(cData)[i], main=names(cData)[i]) }
# ntimeslate hist(ntimeslate,col= "red",main="ntimeslate", xlab="dlq", ylab="Density", freq=FALSE) lines(density(ntimeslate), col="black", lwd = 2, lty = 1) lines(density(c1Data$ntimeslate), col="red", lwd = 2, lty = 2) lines(density(c0Data$ntimeslate), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green"))
#Bivariate plot(cData$MonthlyIncome~cData$age) for(j in 2:11) { for(i in 3:12) { plot(cData[,j]~cData[,i], main="Plot", xlab=names(cData)[i], ylab=names(cData[j])) } } pairs(cData, col=as.integer(cData$dlq))
# norees
xv
Data Mining | Delinquent Loans
Denham S.
? pairs # This could be creative. Surely older people are better are paying off loans and have higher monthly income
current_gini*sizelength(left_dataset[,1])*left_ginilength(right_dataset[,1])*right_gin i } gini <- function(data) { # Calculate the gini value for a vector of categorical data numFactors = nlevels(data) nameFactors = levels(data) proportion = rep(0,numFactors) for (i in 1:numFactors) { proportion[i] = sum(data==nameFactors[i])/length(da ta) } 1-sum(proportion**2) }
######################## # Noel OBoyle Function # ######################## importance <- function(mytree) { # Calculate variable importance for an rpart classification tree # NOTE!! The tree *must* be based upon data that has the response (a factor) # in the *first* column
frame <- mytree$frame splits <- mytree$splits allData <eval(mytree$call$data)
# Returns an object of class 'importance.rpart'
output <- "" finalAnswer <rep(0,length(names(allData))) names(finalAnswer) <names(allData)
# You can use print() and summary() to find information on the result delta_i <function(data,variable,value) { # Calculate the decrease in impurity at a particular node given:
d <- dimnames(frame)[[1]] # Make this vector of length = the max nodeID # It will be a lookup table from frame-->splits index <rep(0,as.integer(d[length(d)])) total <- 1 for (node in 1:length(frame[,1])) { if (frame[node,]$var!="<leaf>") { nodeID <as.integer(d[node]) index[nodeID] <- total total <- total + frame[node,]$ncompete + frame[node,]$nsurrogate+ 1 } }
# data -- the subset of the data that 'reaches' a particular node # variable -- the variable to be used to split the data # value -- the 'split value' for the variable current_gini <gini(data[,1]) size <- length(data[,1]) left_dataset <eval(parse(text=paste("subset(data, ",paste(variable,"<",value),")"))) size_left <length(left_dataset[,1]) left_gini <gini(left_dataset[,1]) right_dataset <eval(parse(text=paste("subset(data, ",paste(variable,">=",value),")"))) size_right <length(right_dataset[,1]) right_gini <gini(right_dataset[,1]) # print(paste(" Gini values: current=",current_gini,"(size=",siz e,") left=",left_gini,"(size=",size_left ,"), right=", right_gini,"(size=",size_right,")") )
for (node in 1:length(frame[,1])) { if (frame[node,]$var!="<leaf>") { nodeID <as.integer(d[node]) output <paste(output,"Looking at nodeID:",nodeID,"\n") output <- paste(output," (1) Need to find subset","\n") output <- paste(output," Choices made to get here:...","\n") data <- allData if (nodeID%%2==0) symbol <- "<" else symbol <- ">="
xvi
Data Mining | Delinquent Loans
Denham S.
i <- nodeID%/%2 while (i>0) { output <paste(output," Came from nodeID:",i,"\n") variable <dimnames(splits)[[1]][index[i]] value <splits[index[i],4] command <paste("subset(allData,",variable,sy mbol,value,")") output <paste(output," Applying command",command,"\n") data <eval(parse(text=command)) if (i%%2==0) symbol <"<" else symbol <- ">=" i <- i%/%2 } output <- paste(output," Size of current subset:",length(data[,1]),"\n")
"for",variable,"and",value,"and agreement of",splits[i,3],"\n") finalAnswer[variable] <- finalAnswer[variable] + best_delta_i*splits[i,3] output <paste(output," Final answer: ",paste(finalAnswer[2:length(finalA nswer)],collapse=" "),"\n") } } } } result <list(result=finalAnswer[2:length(fi nalAnswer)],info=output) class(result) <"importance.rpart" result } print.importance.rpart <function(self) { print(self$result) } summary.importance.rpart <function(self) { cat(self$info) } ## wew
output <- paste(output," (2) Look at importance of chosen split","\n") variable <dimnames(splits)[[1]][index[nodeID] ] value <splits[index[nodeID],4] best_delta_i <delta_i(data,variable,value) output <- paste(output," The best delta_i is:",format(best_delta_i,digits=3), "for",variable,"and",value,"\n") finalAnswer[variable] <finalAnswer[variable] + best_delta_i
confus.fun = function(x) # x= fitnn { confus.mat = table(data.frame(predicted=predict( x, test) > 0.5, actual=test[,2]>0.5)) false.neg = confus.mat[1,2] / (confus.mat[1,2] + confus.mat[1,1]) false.pos = confus.mat[2,1] / (confus.mat[2,1] + confus.mat[2,2]) confus.mat false.neg false.pos cat( "Confusion Matrix: ", "\n", "FALSE NEGETIVE: ",false.neg, "\n","FALSE POSITIVE: ",false.pos, "\n") } confus.fun(fitnn) confus.fun(svm.model)
output <- paste(output," Final answer: ",paste(finalAnswer,collapse=" "),"\n") output <- paste(output," (3) Look at importance of surrogate splits","\n") ncompete <frame[node,]$ncompete nsurrogate <frame[node,]$nsurrogate if (nsurrogate>0) { start <- index[nodeID] for (i in seq(start+ncompete+1,start+ncompete +nsurrogate)) { variable <dimnames(splits)[[1]][i] value <- splits[i,4] best_delta_i <delta_i(data,variable,value) output <paste(output," The best delta_i is:",format(best_delta_i,digits=3),
AIC findAIC = function(a, b, c) { p=b k = c # ncol(train) SSE = sum(a$residuals^2) AIC = 2*k +p*log(SSE/p) AIC #SBC = p*log(n) + n*log(SSE1/n) #SBC }
xvii