2 minute read
Annex 2: Process for Creating the Unique Dataset and Additional Methodological Details
The unique baseline-midline dataset combines data from four instruments: i) the EGRA, EGMA, SeGRA and/or SeGMA Learning Assessments (LA); ii) the Girl Surveys (GS); iii) the Household Surveys (HHS); and iv) the Primary Caregiver Surveys (PCG)
This followed a six-step process:
1. Map the raw datasets available for each GEC-T project;
2. Merge the datasets together for each project using girls’ unique identifiers – at baseline /midline;
3. Append the project-level datasets to create two portfolio datasets (baseline /midline) ;
4. Harmonise variables into comparable codes and categories across projects and rounds (baseline and midline); and
5. Append the two datasets together to create a unique baseline-midline portfolio-level dataset,
6. Data-quality and consistency checks to ensure its completeness and validity.
Figure 9 shows a graphic overview of these steps.
As shown on the diagram above, the creation of the baseline and of the midline portfolio-level datasets followed the same process: as a first step, the scoping and mapping excises was done with baseline and midline datasets separately. This includes mapping baseline and midline learning assessments (Annex 8). The scorecards rated and reported by the FM were reviewed and mapped during the initial mapping phase (See Annex 6 for an overview of GEC-T Projects’ Midline Results). This exercise enabled us to get to understand how each project defined learning and transition and what their targets/achievements were, and therefore, use this as a reference. Then, we merged different types of datasets (e.g. girls survey and learning assessments) within projects so that each project has a single midline dataset and appended these datasets to make a single midline dataset at a portfolio level. Having harmonised and added the baseline and midline datasets, we checked their validity and completeness.
Please note that, a baseline portfolio-level dataset already existed at the start of Study 3. Because it was lacking some key data (such as girls’ unique identifiers and project-specific data), the Study 3 team amended and augmented it by going back to the projects’ raw baseline datasets. They subsequently generated an updated version of the baseline portfolio-level dataset. This was done in parallel to the work on the midline portfolio-level dataset. For midline, the data from over 40 midline datasets were mapped, then reshaped and merged, and subsequently
Independent Evaluation of the Girls’ Education Challenge Phase II – Aggregate Impact of GEC-T Projects Between Baseline and Midline Study - Report Annexes appended to create a unique midline dataset. After harmonising variables, the baseline dataset and the midline dataset were appended into a unique dataset. This process is summarised in Figure 10 below
Figure 2: High-level overview of the data cleaning process followed in Study 3
This yielded about 77,000 girls at baseline and 60,700 at midline for the overall dataset. The data includes boys4, who are not included in our analysis. The sample distribution by treatment status is shown in Table 21
Table 3: Sample distribution of the unique baseline-midline GEC-T dataset
The baseline data includes all 27 GEC-T projects, while the midline data only includes 23 Four GEC-T projects, namely Avanti (Kenya), Link (Ethiopia), Save the Children (DRC) and Save the Children (Mozambique) did not collect suitable midline evaluation data due to the COVID-19
4 Data on boys was included when submitted combined with data on girls. Projects that submitted separate datasets, such as AKF, have not been included as it was out of the scope of this study to prepare this data.