4 minute read
Facilitating Future Access to Student Projects
The Library’s Research Data Curation (RDC) program has been helping UC San Diego researchers better manage and preserve their data for the past seven years. In 2019, RDC tackled a new challenge: student projects. Specifically, the capstone projects from the Data Science & Engineering (DSE) Master of Advanced Study (MAS) program. The DSE MAS projects are serious business. Groups of two to five students spend more than a year working on their projects, from initial question development to reporting and presenting the results as the culmination of their studies. Like most data science projects, there are many moving parts to handle: raw data needs to be cleaned; models need to be created, tested, and tweaked; and results need to be interpreted and explained in context. The result of a project might be an improved or annotated dataset, a predictive model, or even an application such as a chat bot. And, typical of data science projects, one group’s finished project might be the start of another group’s project later. Reanalyzing data might lead to new insights, or new methodologies
Raw data is cleaned by replacing or removing missing values and standardizing naming conventions.
might result in a better model, so DSE MAS cohorts often continue to build on the projects initiated by previous graduates. While this type of collaboration is valuable, it also presents a challenge for course instructors: how best to store and share these types of projects? Fortunately, the Library is no stranger to issues of access, storage, and preservation. When it comes to research data, RDC has the expertise to provide guidance that will contribute to the cataloging of said projects. Last year, Data Science Librarian Stephanie Labou and Curation Analyst Ho Jung Yoo worked with Ilkay Altintas, chief data science officer at the San Diego Supercomputer Center and DSE MAS course instructor, to implement a pilot project to ingest the student projects into the Library’s Digital Collections. According to Yoo, “The broad range of disciplines on our campus means that research outputs span a wide range of data types, organization, and workflows. The Digital Collections was intentionally designed to accommodate diverse needs while also providing long-term findability and context for the valuable data generated at UC San Diego. Our ultimate goal is to facilitate reuse of the data.” Over the course of six weeks, Labou and Yoo worked with DSE MAS students to prepare their projects— which included raw data, processed data, analysis code, and final report— for submission to the Library.
A Digital Object Identifier (DOI) is a globally unique identifier that can be used to provide persistent, citable links to data. DOIs are available free of charge to campus affiliates via the EZID service administered by RDC.
Projects represented a wide range of topics, including creating algorithms to identify specific parts of magnetic resonance imaging (MRI) data; developing a recommender system (akin to Amazon’s recommended items list) for video games; analysis of human gut microbiome data; and others.
Working with students on submission preparation also fits well with RDC’s mission to teach campus constituents about best practices for working with data. “It was really valuable to talk to students about the importance of properly citing datasets and checking for any data sharing restrictions, as well as formally documenting the software used,” said Labou. “These are the things that make a project not only reproducible, but also more user friendly for students and other researchers in the future.”
After additional processing of student-submitted data and metadata, the projects were ready to go live in the Library’s Digital Collections. All projects were assigned a Digital Object Identifier (DOI), which is a globally unique and persistent identifier. So, not only are projects stored and preserved for access by future DSE MAS cohorts, but students also have a full and formal citation for their projects to add to their resumes. The entire collection of DSE MAS capstone projects was also issued a DOI and is accessible online. “This is the first collection of data science-oriented student projects in the Library’s research data collections,” RDC Director David Minor pointed out. “This was an exciting project because it allowed us to broaden our work on campus and reach the next generation of scholars. We look forward to working with future cohorts.” After the success of the DSE MAS pilot, Labou sees plenty of opportunities to expand. “With the right partnerships, we can expand this model of collaboration between faculty and the Library to other courses and start building a full collection of data- and codeintensive works to support teaching and learning. This is only the beginning!” To view the collection of DSE MAS capstone projects, visit lib.ucsd.edu/dsemas.
Right: Predicted segmentation of membranes, mitochondria, and nuclei. From “Electron Microscopic Data Analysis” (Tushar Singhal and Prashant Kolkur).