5 minute read
Access and Learning Study: Steps Followed in Machine Learning Methods for Predicting Dropouts (for RQc)
Kenya
One drawback for using the semi-public household surveys is that they were collected infrequently – specifically, while the most recent monitoring data was from 2019, the most recent training data from MICS and DHS was from 2014. This creates a potential bias if the correlates of drop-out in 2014 changed.
An additional challenge is that there are some non-overlapping counties in the training and prediction data. MICS data used is not nationally representative, but only includes households from Turkana, Kakamega, and Bungoma counties. We combined the data from the three countries to form one dataset to use for training. DHS data contains all counties in the country. By contrast, EDT project works in 7 counties, one of which is Turkana. One option we had was to restrict the sample to only overlapping counties. However, that would make the sample size too low, and we wouldn’t have been able to predict on counties in EDT not present in MICS. Hence, we combined data of all available counties together to form a training set.
Nepal
We originally aimed to use both DHS and MICS for the training data, but focussed on just the MICS data, as it was collected in 2019. In Nepal, we restricted our sample to include only students at secondary level because the project girls were only secondary. However, this left us with the smallest size of training data (less than 1,500 cases). There was a trade-off between having a large number of training samples and having relevant samples – our hypothesis 5 was that only including girls that have similar characteristics would make the prediction more accurate
Step 3. Feature engineering: We start with a long list of questions from the MICS/DHS surveys, which are saved in a dataset, with each characteristic stored in different ways. For example, a question with five options maybe in five columns, rather than one with multiple choices. To extract information from the data, we perform feature engineering. This usually follows two steps. First is feature transformation, where textual answers are converted to numbers. This step is crucial because all algorithms work on numbers and higher numbers signify higher weights. The weights can be assigned using knowledge on the literature of what characteristics are more predictive of the outcome. Next is feature creation or the process of creating new features – for example wealth indexes
For the feature engineering, we first had to decide on a definition of students at-risk of dropout. This was done slightly differently for Kenya and Nepal, using two main variables:
• Whether a child attends school in the current year
• Whether a child attends school in the previous year
For Kenya, we tested a broader and strict definition of at-risk, based on information of whether a child attends school in the current and/or previous year or not.
For the strict definition, we included only children who used to attend school the previous year but were not attending any more. With this definition, children who are not attending school in this year, but were attending last year, are at-risk.
For the broad definition, a child is at-risk if he/she answers no to either questions. This means we include those who attend neither of the years, as well as those who did not attend the previous year, but attend this year. To ensure that we are not wrongly categorising children who just started school, we recategorise children aged 5 and 6 who attend this year but not last year as not at-risk.
For Nepal, instead of having simple ‘at-risk’ and ‘not at-risk’ definitions, we defined at-risk students in three categories: progress, repeat, and dropout. This means that the model would try to predict these three outcomes for the girls. These definitions were adopted because we focused on transition of students from Lower to Higher Secondary, in line with the projects sample of girls. Hence, it is important to separate out students who repeat Lower Secondary from those who progress to the next level of education. In our training data, we use four main variables from MICS data, namely:
• Whether a child attends school in the current year
• Whether a child attends school in the previous year
• Grade level that a child attends in the current school year
5 This is a testable hypothesis which we looked at, in part, in Kenya, where we trained using all age ranges and a restricted sample. We discuss this more later.
Access and Learning Study: Steps Followed in Machine Learning Methods for Predicting Dropouts (for RQc)
• Grade level that a child attends in the previous school year
A child is progressing if he/she attends school in both years, and progresses by at least one grade level in the current year. A child is a repeater if he/she attends school in both years but remains in the same grade level. Lastly, a child is a dropout if he/she did not attend school in the current year (they can either attend or not attend the last school year, as long as they enrolled in school at some point in time). We dropped the cases where a child attended school in the current year and not last year (so has re-enrolled)
Other than the at-risk feature, we also put weights on the categorical variables based on our domain knowledge, and the results from the Kenya modelling, of which groups are more at-risk of dropping out of school. For instance, head of household having no schooling should have higher weight than head of household having a university education, as the latter makes the student more at-risk of dropping out. The same is with marital status, as being married makes girls a higher risk of dropping out. Additionally, we created an overage variable based on age and grade.
Step 4. Algorithm selection and training: Many possible algorithms are available to researchers. In practice, choosing which algorithm to use is done through trialing, and observing their performance. Data is partitioned into train, test, and validation sets prior to training so that the resulting algorithm can be compared and tested on data that it has not learned from. When we have the algorithm, this can be tested with the data we set aside before training. Performance of the model is assessed through the confusion matrix, classification report, and feature analysis. These are reviewed to see if the algorithm suits our purposes or not. If the performance was not satisfactory, we adjusted the parameters and reran the process.
After the algorithm was trained, it was reviewed for feature analysis and performance metrics (accuracy, precision, recall, and F1 score) so that the features can be fine-tuned, adjusted, and reanalysed
We tested seven different algorithms for each model from the sklearn Python package:
• Logistic Regression
• Linear Discriminant
• K-Neighbours
• Decision Tree
• Random Forest
• Gaussian-NB
• SVM-SVC
The Random Forest algorithms performs best for all our models. This is likely due to the unbalanced nature of the data and the relatively small feature set combined with the power of using many decision trees participating in the prediction process
While simpler models, such as a logistic regression were easier to interpret, they led to lower prediction power, and they did not perform well on our data.
Step 5. Make predictions: Here we ran the algorithm on the prediction dataset. Step 6. Review predictions
The algorithm was tested with the data we set aside before training and we obtained a list of girls with their predictions. This list, less our predictions, was shared with the data collection partners for roll call (see below).
Roll-call validation and qualitative work (primary data collection)
We conducted primary data collection to answer two questions:
• How accurate were the predictions from the machine learning model?