1 minute read
Annex 10 - Steps Followed in Machine Learning Methods for Predicting Dropouts (for RQc)
This Annex describes the steps taken by the IE team to conduct machine learning in order to help answer RQ c - How useful is a machine learning approach in identifying girls are most ‘at risk’ of not returning to school?
The findings are summarised in the main document.
Machine learning has been applied to a broad range of problems, expanding on predictive modelling to introduce a review/reprocessing loop to allow the algorithms to ‘learn’ from previous models. In our example, we are limited to one wave of predictions and validations, so we are essentially using key aspects of the machine learning methods to try and improve the accuracy of predictions of which girls will drop-out of school, rather than testing a true machine learning model (which would require multiple waves of reviews and reprocessing to fine-tune the algorithms). That given, this note aims to provide further detail on the technical steps taken, and key decisions in the process.
At the highest level, we followed a standard series of steps to move from data collection, processing and training to predictions. This is shown visually in the six steps in Figure 1
Step 1. Data collection: The first step in developing a machine learning system is to identify the source of the data that is being used – we used secondary data to initially train and predict, using semi-public household survey data to increase the sample size for training; while predicting on more targeted project data.
To answer our question, we used data from two sources, which provided information on the characteristics of schoolaged girls: public household surveys for training the machine learning models (which had information on girls’ characteristics and their drop-out status); and GEC project data (which had girls’ characteristics, but not their drop-out status) for which we could predict drop-out status (Table 1).
For the training data, we used datasets from the Multiple Indicator Cluster Survey (MICS) and Demographic and Health Surveys (DHS) which capture both the status of schooling (to proxy drop-outs) but also a range of household characteristics. 1 These datasets, have many desirable characteristics that allow for their use in machine learning models – the data is at a child-level, relatively large (at least 6,000 records), and contains similar demographics and socioeconomic features. Additionally, because both MICS and DHS operate in many different countries, the same process can be applied elsewhere if needed.