3 minute read

II Access and Learning Study: Steps Followed in Machine Learning Methods for Predicting Dropouts (for RQc)

Table 1: Summary of data sources used

Kenya (EDT) MICS 2013-2014 (5th round of survey) and DHS 2014

Nepal (Mercy Corps) MICS 2019 (6th round of survey)

EDT monitoring data 2019

Mercy Corps midline survey 2018

For the prediction data, we used the selected GEC projects’ monitoring data, with the long-run view of integrating any findings within the projects monitoring systems This data contains the demographics information to match with the MICS/DHS prediction data, and also allows us to identify girls for validation purposes. The monitoring data collected by each project differs in terms of the features collected, as well as the scale of the project. In general, the monitoring data includes basic demographic information as well as a girl identifier.

In the case of EDT, we received project monitoring data that included all girls in the project, with their most recent demographics data collected in 2019. This data contained over 28,000 cases. In the case of Mercy Corps, we were given data on girls’ club attendance, which the project uses partly for flagging at-risk girls (when a student accumulates a certain number of days of unexplained absences) and project midline external evaluation data. We ended up using this midline data for prediction as it included attendance, drop-out information, and demographics data.

Step 2. Data pre-processing: Prior to running models, we had to clean the data and ensure that it was structured in the right way for analysis. Additional work was required to ensure data was well-structured for the algorithms - this included dealing with issues of missing values, outliers, wrongly labelled examples, and skewed data.

We used public data to train our models and project data, so the first step was to ascertain the ‘features’ that were common across the datasets. Ultimately, we must work with the lowest common denominator in terms of data features that we can train against. In this case, there were fewer features available in the prediction dataset in relation to the training data (DHS and MICS). Hence, we limited both sets of data to the features of the prediction set

We cleaned the datasets to match the features in the training and prediction sets. For Kenya, as we used two training datasets (the MICS data, and the DHS data), we first did this process separately for MICS and DHS. Variables that were the same (or closely related) were mapped and kept for training. In cases where the variables were coded differently, we recoded them so that they were in the same format 3. Lastly, we combined MICS and DHS data together, deleting variables that were present in one but not the other.

In Nepal, MICS data included a wider range of information including disability and work/chores, which were not present with MICS data from Kenya. This allowed the training set in Nepal to have wider sets of features than in Kenya. The features we included in the final model for both countries were age, grade, overage, education level of head of household, household size, gender, marital status, and whether the family owns agricultural land.

For Nepal, the additional features included language, the kinds of disabilities the students had, whether the family lived in a rural or urban area, number of children in the household, religion and language of head of household, roof material, whether the girl had given birth and the number of children she had. For Kenya, one additional feature was whether the family owned animals.

For both countries, we limited the training data to the sub-sample of children and youth age 5-17 as they were schoolaged. Further restrictions were made for some countries to best fit the situation at hand (discussed below).

Issues with data and our decisions

In our data, there are more children of basic education age who are in school than out, meaning that our dataset is highly unbalanced, which can affect algorithm performance. To deal with this, we used the technique known as oversampling to increase the percentage of dropout (i.e. the “at risk” rows), from 5% of the data up to about 20%. 4

2 Different datasets are available for each country at different years.

3 An example of this is the variable ‘Household size’. In MICS, the data is recorded in numbers, ranging from 1-17. However, in the monitoring data, the data is recorded in ranges (such as 4-6 people). We grouped data in MICS to match the monitoring data.

4 We tried other techniques other than simple duplication - called SMOTE (synthetic minority oversampling techniques) which uses algorithms to generate new records based on other records but we did not have enough data to make this work well so simple duplication was better.

This article is from: