Facilitating Future A Student Capstone Pr Providing context and preservation for data science research at The Library’s Research Data Curation (RDC) program has been helping UC San Diego researchers better manage and preserve their data for the past seven years. In 2019, RDC tackled a new challenge: student projects. Specifically, the capstone projects from the Data Science & Engineering (DSE) Master of Advanced Study (MAS) program. The DSE MAS projects are serious business. Groups of two to five students spend more than a year working on their projects, from initial question development to reporting and presenting the results as the culmination of their studies. Like most data science projects, there are many moving parts to handle: raw data needs to be cleaned; models need to be created, tested, and tweaked; and results need to be interpreted and explained in context. The result of a project might be an improved or annotated dataset, a predictive model, or even an application such as a chat bot. And, typical of data science projects, one group’s finished project might be the start of another group’s project later. Reanalyzing data might lead to new insights, or new methodologies
Raw data is cleaned by replacing or removing missing values and standardizing naming conventions.
2
E X P LO R E
might result in a better model, so DSE MAS cohorts often continue to build on the projects initiated by previous graduates. While this type of collaboration is valuable, it also presents a challenge for course instructors: how best to store and share these types of projects? Fortunately, the Library is no stranger to issues of access, storage, and preservation. When it comes to research data, RDC has the expertise to provide guidance that will contribute to the cataloging of said projects. Last year, Data Science Librarian Stephanie Labou and Curation Analyst Ho Jung Yoo worked with Ilkay Altintas, chief data science officer at the San Diego Supercomputer Center and DSE MAS course instructor, to implement a pilot project to ingest the student projects into the Library’s Digital Collections. According to Yoo, “The broad range of disciplines on our campus means that research outputs span a wide range of data types, organization, and workflows. The Digital Collections was intentionally designed to accommodate diverse needs while also providing long-term findability and context for the valuable data generated at UC San Diego. Our ultimate goal is to facilitate reuse of the data.” Over the course of six weeks, Labou and Yoo worked with DSE MAS students to prepare their projects— which included raw data, processed data, analysis code, and final report— for submission to the Library.
With the right partnerships, we can expand this model of collaboration between faculty and the Library... and start building a full collection of data- and code-intensive works to support teaching and learning.
A Digital Object Identifier (DOI) is a globally unique identifier that can be used to provide persistent, citable links to data. DOIs are available free of charge to campus affiliates via the EZID service administered by RDC.