Chemistry International | Jan 2024 | Global Partnerships

Page 18

Chemistry Digital Standards: Tools for an increasingly digital research culture by Fatima Mustafa, Leah McEwen, and Ian Bruno1

A

fter over 100 years of developing standards for chemistry, IUPAC is moving forward towards the “Digital IUPAC” era which embraces modern practices working with data, a propensity towards Open Science and application of the FAIR Data Principles. In a recent webinar hosted by *Chemvoices platform, Fatima Mustafa, WorldFAIR Chemistry project coordinator, moderated a discussion with Leah McEwen, the chair of the IUPAC Committee on Publications and Chemical Data Standards (CPCDS), and Ian Bruno, CPCDS titular member, to enlighten the chemists community on efforts around digitizing IUPAC.

From archives to digital publishing

“How did chemistry tools such as SciFinder and Reaxys and others evolve from the original print sources? As we move away from print, how can we make sure that chemical information is discoverable and accessible in the more networked environment of the internet and the cloud?” Those questions led Leah on a sabbatical to study the history of online chemical information and the IUPAC Archives [1] after the Physical Sciences Library at Cornell where she works decided to close their stacks and move online! Moving online created many questions among scientists about how to publish digitally, and how workflows could be improved. To consider these questions, Leah and her colleagues set up a symposium at the ACS in 2012 and invited chemical information specialists to discuss what the future might hold. The outcomes were published in a book [2]. As per now says Leah “Our traditional reporting practice for much of chemistry data looks non-reusable. e.g., spectral data published in your final paper gets embedded in PDF.” So, those static images would not enable users to search the compound or find smaller peaks or search substructure! If those images are not static, and instead measurements and spectral data are represented in digital files then the research published in manuscripts will be more reusable. This is what Leah and colleagues in CPCDS and the chemistry community are tackling. Very few articles include original downloadable data files [3]—why is this the case given that most data are 1

16

generated by instruments & software? Data presented in PDF can be searched by text mining technologies but this can be misleading and inefficient as contextual information might get lost. The policies of sharing data in journals are also a topic of interest. For example, in organic chemistry journals, characterization data are often found in the Supplementary Information (SI) section, so there’s some long-standing practice in publishing data with manuscripts, but very few data files are actively put into repositories and linked back to manuscripts [4]. A recent study of author guidelines in chemistry journals took a further look at many different kinds of data [5]. The authors observed that we’ve made some progress at a general level with adoption of ORCID for identifying researchers, and including data availability statements in journal articles for example. Not so much progress has been made publishing structure information or data types such as spectra in a digital form. So, we haven’t really moved the needle much in the last decade.

What is an example of a digital standard file format for chemical data?

Ian has spent several decades at the Cambridge Crystallographic Data Center, where he has been able to see the profound impact that Digital Data Standards have had on data publishing workflows. “When I joined, creating a database of crystal structure data often involved retyping the coordinates that a researcher had typed to get them into a table in their article, but things are very different now,” said Ian. Today they get around 60,000 small molecule crystal structures published each year, the majority of which are associated with one of 16,000 journal articles and these are all made discoverable and accessible through indexed data resources such as the Cambridge Structure Database. Under-pinning these workflows is a standard file format known as CIF (Crystallographic Information File). What’s important is not just how the information and data is laid out in the file; it’s the fact that it’s structured in a way that each of the data items in there has a very specific definition associated with it. But having the file formats themselves is not enough. The CIF format was first published in 1992 and it took a period of about 8 years to get to a point where around 80 % of structures were being published in that format. A major factor driving this progress was

This article is based on a ChemVoices webinar held on May 22, 2023. The recording is available at https://youtu.be/_Jd2kGQv_ak. ChemVoices is the result of a partnership between IUPAC and IYCN and was created to showcase the talents and impact of earlycareer scientists worldwide. It is a platform to discuss issues that are relevant and of immediate concern to early-career scientists. <https://chemvoices.org/>

Chemistry International

January-March 2024


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.