Proceedings of the 12th International Conference on Electronic Publishing
Open Scholarship: Authority, Community and Sustainability in the Age of Web 2.0
Toronto June 25-27, 2008
Editors: Leslie Chan University of Toronto Scarborough (Canada) Susanna Mornati CILEA (Italy)
II
Proceedings of the 12th International Conference on Electronic Publishing, Toronto 2008 University of Toronto (Toronto, Canada) Edited by: Leslie Chan, University of Toronto Scarborough (Canada) Susanna Mornati, CILEA (Italy) Published by: International Conference on Electronic Publishing (ELPUB) ISBN:
978-0-7727-6315-0
First edition All rights reserved. (C) 2008 Leslie Chan, Susanna Mornati (C) 2008 For all authors in the proceedings Disclaimer Any views or opinions expressed in any of the papers in this collection are those of their respective authors. They do not represent the view or opinions of the University of Toronto, CILEA, the editors and members of the Progamme Committee, nor of the publisher or conference sponsors. Any products or services that are referred to in this book may be either trademarks and/or registered trademarks of their respective owners. The Publisher, editors and authors make no claim to those trademarks.
III
Members of the 2008 Programme Committee Apps, Ann The University of Manchester (UK) Baptista, Ana Alice University of Minho (Portugal) Borbinha, JosÊ INESC-ID (Portugal) Cetto, Ana Maria IAEA (Austria) Costa, Sely M.S. University of Brasilia (Brazil) Delgado, Jaime Universitat Politècnica de Catalunya (Spain) Diocaretz, Myriam University of Maastrichthe (The Netherlands) Dobreva, Milena HATII, University of Glasgow & IMI, Bulgarian Academy of Sciences (Bulgaria) Engelen, Jan Katholieke Universiteit Leuven (Belgium) Gargiulo, Paola CASPUR (Italy) Gradmann, Stefan University of Hamburg (Germany) Guentner, Georg Salzburg Research (Austria) Hedlund, Turid Swedish School of Economics and Business Administration, Helsinki (Finland) Horstmann, Wolfram University of Bielefeld (Germany) Ikonomov, Nikola Institute for Bulgarian Language (Bulgaria) Iyengar, Arun IBM Research (USA) Jezek, Karel University of West Bohemia in Pilsen (Czech Republic) Joseph, Heather SPARC (USA) Krottmaier, Harald Graz University of Technology (Austria) Linde, Peter Blekinge Institute of Technology (Sweden) Martens, Bob Vienna University of Technology (Austria) Moore, Gale University of Toronto (Canada) Markov, Krassimir IJITA-Journal (Bulgaria) Moens, Marie-Francine Katholieke Universiteit Leuven (Belgium) Mornati, Susanna CILEA (Italy) Nisheva-Pavlova, Maria Sofia University (Bulgaria) Paepen, Bert Katholieke Universiteit Leuven (Belgium) Perantonis, Stavros NCSR - Demokritos (Greece) Schranz, Markus Pressetext Austria (Austria) Smith, John University of Kent at Canterbury (UK) Tonta, Yasar Hacettepe University (Turkey)
IV
Acknowledgements Local Organizing Committee Gale Moore, Knowledge Media Design Institute University of Toronto Gabriela Mircea, University of Toronto Libraries Jen Sweezie, University of Toronto Scarborough Graphic design Joe Beausoleil - floodedstudios@rogers.com Typesetting Jen Sweezie Conference host Knowledge Media Design Institute, University of Toronto Conference Sponsors CILEA Interuniversity Consortium Synergies SPARC TeleGlobal Consulting Group JISC Department of Computer Science, University of Toronto Exhibiting Sponsors International Development Research Centre (Canada) Promotional Sponsors BioMedCentral: The Open Access Publisher University of Toronto Bookstores
V
Table of Contents Preface .................................................................................................................................. XI Leslie Chan; Susanna Mornati Organizational and Policy issues A Review of Journal Policies for Sharing Research Data .......................................................... 1 Heather A. Piwowar; Wendy W. Chapman Researcher’s Attitudes Towards Open Access and Institutional Repositories: A Methodological Study for Developing a Survey Form Directed to Researchers in Business Schools. .......................................................... 15 Turid Hedlund The IDRC Digital Library: An Open Access Institutional Repository Disseminating the Research Results of Developing World Researchers .............................................................................................................. 23 Barbara Porrett Metadata and Query Formats Keyword and Metadata Extraction from Pre-prints Emma Tonkin; Henk L. Muller
.............................................................. 30
The MPEG Query Format, a New Standard For Querying Digital Content. Usage in Scholarly Literature Search and Retrieval ............................................. 45 Ruben Tous; Jaime Delgado The State of Metadata in Open Access Journals: Possibilities and Restrictions ................... 56 Helena Francke Collaboration in Scholarly Publishing Establishing Library Publishing: Best Practices for Creating Successful Journal Editors ...................................................... 68 Jean-Gabriel Bankier; Courtney Smith Publishing Scientific Research: Is There Ground for New Ventures?..................................... 79 Panayiota Polydoratou and Martin Moyle The Role of Academic Libraries in Building Open Communities of Scholars ........................................................................................ 90 Kevin Stranack, Gwen Bird, Rea Devakos
VI
Semantic Web and New Services Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging. ................................................................................................... 100 Maria Elisabete Catarino; Ana Alice Baptista Autogeneous Authorization Framework for Open Access Information Management with Topic Maps ........................................ 111 Robert Barta; Markus W. Schranz AudioKrant, the daily spoken newspaper ............................................................................ 122 Bert Paepen A Semantic Web Powered Distributed Digital Library System ........................................... 130 Michele Barbera; Michele Nucci; Daniel Hahn, Christian Morbidoni Business Models in e-publishing No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing ..................................................................................................... 140 Tarek Loubani, Alison Sinclair, Sally Murray, Claire Kendall, Anita Palepu, Anne Marie Todkill, John Willinsky Should University Presses Adopt An Open Access [Electronic Publishing] Business Model For All of Their Scholarly Books? ....................................................... 149 Albert N. Greco; Robert M. Wharton Scholarly Publishing within an eScholarship Framework – Sydney eScholarship as a Model of Integration and Sustainability .................................................................. 165 Ross Coleman Usage Patterns of Online Literature Global Annual Volume of Peer Reviewed Scholarly Articles and the Share Available Via Different Open Access Options ................................................................ 178 Bo-Christer Björk ;Annikki Roosr; Mari Lauri Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean ................................................................................................................. 187 Saray Córdoba-González; Rolando Coto-Solano Consortial Use of Electronic Journals in Turkish Universities Yasar Tonta; Yurdagül Ünal
.......................................... 203
VII
A Rapidly Growing Electronic Publishing Trend: Audiobooks for Leisure and Education ........................................................................................................................ 217 Jan. J. Engelen New Challenges in Scholarly Communications The SCOAP3 project: Converting the Literature of an Entire Discipline to Open Access ...................................................................................................................... 223 Salvatore Mele Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues .............................................................................................................. 234 Anita de Waard; Joost Kircz Synergies, OJS, and the Ontario Scholars Portal ................................................................ 246 Michael Eberle-Sinatra; Lynn Copeland; Rea Devakos Open Access in Less Developed Countries African Universities in the Knowledge Economy: A Collaborative Approach to Researching and Promoting Open communications in Higher Education .................... 254 Eve Gray, Marke Burke Open Access in India: Hopes and Frustrations ...................................................................... 271 Subbiah Arunachalam An Overview of The Development of Open Access Journals and Repositories in Mexico ....................................................................................................................... 280 Isabel Galina; Joaquín Giménez Brazilian Open Access Initiatives: Key Strategies and Actions .......................................... 288 Sely M S Costa; Fernando C L Leite Information Retrieval and Discovery Services Interpretive Collaborative Review: Enabling Multi-Perspectival Dialogues to Generate Collaborative Assignments of Relevance to Information Artefacts in a Dedicated Problem Domain ................................................................................... 299 Peter Pennefather and Peter Jones Joining Up ‘Discovery to Delivery’ Services Ann Apps; Ross MacIntyre
..................................................................... 312
VIII
Web Topic Summarization ................................................................................................... 322 Josef Steinberger; Karel Jezek; Martin Sloup Open Access and Citations Open Access Citation Rates and Developing Countries ..................................................... 335 Michael Norris; Charles Oppenheim; Fytton Rowland Research Impact of Open Access Research Contributions across Disciplines Sheikh Mohammad Shafi
.................. 343
Exploration and Evaluation of Citation Networks .............................................................. 351 Karel Jezek; Dalibor Fiala; Josef Steinberger Added-value Services for Scholarly Communication Advancing Scholarship through Digital Critical Editions: Mark Twain Project Online .............................................................................................................................. 363 Lisa R. Schiff Preserving The Scholarly Record With WebCite (www.webcitation.org): An Archiving System For Long-Term Digital Preservation Of Cited Webpages ................... 378 Gunther Eysenbach Enhancing the Sustainability of Electronic Access to ELPUB Proceedings: Means for Long-term Dissemination .............................................................................. 390 Bob Martens; Peter Linde; Robert Klinc; Per Holmberg A Semantic Linking Framework to Provide Critical Value-added Services for E-journals on Classics: Proposal of a Semantic Reference Linking System between On-line Primary and Secondary Sources ................................................ 401 Matteo Romanello Posters & Demonstrations Creating OA Information for Researchers ........................................................................... 415 Peter Linde, Aina Svensson Open Scholarship eCopyright@UP. Rainbow Options: Negotiating for the Proverbial Pot of Gold .................................................................................................... 417 ElsabĂŠ Olivier
IX
Scalable Electronic Publishing in a University Library Kevin Hawkins
...................................................... 421
Issues and Challenges to Development of Institutional Repositories in Academic and Research Institutions in Nigeria ............................................................................... 422 Gideon Emcee Christian When Codex Meets Network: Toward an Ideal Smartbook ................................................ 425 Greg Van Alstyne and Robert K. Logan Revues.org, Online Humanities and Social Sciences Portal Marin Dacos
............................................... 426
AbstractMaster速 ................................................................................................................. 428 Daniel Marr A Deep Validation Process for Open Document Repositories ............................................ 429 Wolfram Horstmann, Maurice Vanderfeesten, Elena Nicolaki , Natalia Manola Pre-Conference Workshops Publishing with the CDL's eXtensible Text Framework (XTF) ............................................. 432 Kirk Hastings; Martin Haye; Lisa Schiff Open Journal Systems: Working with Different Editorial and Economic Models ................. 433 Kevin Stranack; John Willinsky Repositories that Support Research Management .................................................................. 434 Leslie Carr Opening Scholarship: Strategies for Integrating Open Access and Open Education ............. 435 Mark Surman; Melissa Hagemann Boost your capacity to manage DSpace! ................................................................................ 436 Wayne Johnston; Rea Devakos; Peter Thiessen; Gabriela Mircea
X
XI
Preface It is a pleasure for us to present you with these proceedings, consisting of over 40 contributions from six different continents accepted for presentation at the 12th ELPUB conference. The conference, generously hosted by the Knowledge Media Design Institute at the University of Toronto, was chaired by Leslie Chan, University of Toronto Scarborough, Canada, and Susanna Mornati, CILEA, Italy. The 12th ELPUB conference carried on the tradition of previous international conferences on electronic publishing; bringing together researchers, developers, librarians, publishers, entrepreneurs, managers, users and all those interested in issues regarding electronic publishing and scholarly communications to present their latest projects, research, or new publishing models or tools. This year marks the first time ELPUB was held in North America. Previous meetings were held in the United Kingdom (in 1997 and 2001), Hungary (1998), Sweden (1999), Russia (2000), the Czech Republic(2002), Portugal (2003), Brazil (2004), Belgium (2005), Bulgaria (2006), andAustria (2007). The theme of this year’s meeting was “Open Scholarship�. Participants and presenters explored the future of scholarly communications resulting from the intersection of semantic web technologies, the development of new communication and knowledge platforms for the sciences as well as the humanities and social sciences. We also encouraged presenters to explore new publishing models and innovative sustainability models for providing open access to research outputs. Open Access is now a mainstream debate in publishing, and technological evolution and revolution in the digital world is transforming scholarly communication beyond traditional borders. The impact of the web on daily life has resulted in the involvement of rapidly growing audiences world-wide, who are now e-reading and e-writing, making epublishing an even more important and widespread phenomenon. Electronic calls for submissions for ELPUB 2008 were distributed widely, resulting in over 80 submissions that covered a broad range of scholarly publishing issues. In addition to technical papers dealing with metadata standards, exchange protocols, new online reading tools and service integration, we also received a fair number of papers reporting on the economics of openness, public policy implications, and institutional support and collaboration on digital publishing and knowledge dissemination. A number of conceptual papers also examined the changing nature of scholarly communications made possible by open peer-topeer production and new financial models for the production and dissemination of knowledge. In order to guarantee the high quality of papers presented at ELPUB 2008, all submissions were peerreviewed by at least three members of the international Programme Committee (PC) and additional peer reviewers. Together, these reviewers represented a broad range of technical expertise as well as diverse disciplinary interests. Their contributions of time and feedback to the authors ensured the high quality of papers that were presented at the conference and in these proceedings. We would like to express our sincere appreciation of their efforts and dedication. To assist with the assignment of reviewers, submitters were asked to characterise their entries by selecting 3-5 key words that best represent their work. In a similar way, reviewers identified their 3-5 fields of expertise, allowing the Programme team to match papers to reviewers. Accepted papers were then grouped into sessions according to common and over-lapping themes. The Table of Contents of this volume follows both the themes and the order of the sessions in which they were scheduled during the conference. Over the past three years the SciX Open Publishing System was used to manage the submission and review of abstracts. This year, we decided to experiment with the Open Conference System, an open source software application designed by the Public Knowledge Project at Simon Fraser University, to manage all aspects of an academic conference. The system worked well in most respects, though we encountered a number of small bugs and irregularities. We provided feedback to the development team and we are sure that these issues will be addressed in the next release. This is indeed the beauty of open
XII
source - community input and the sharing of benefits. We would like to thank the Public Knowledge Project for providing the software and for their key role in promoting open scholarship. As with all previous ELPUB conferences, this collection of papers and their metadata are made available through several channels of the Open Archives Initiative, including Dublin Core metadata distribution and full archives at http://elpub.scix.net. It may appear ironic to have printed proceedings for a conference dedicated to electronic publishing. However, the “need” for printed publications is an old and continuing one. It seems that it is still essential for a significant number of delegates to have “something tangible” in their hands and their respective university administrations. Thanks go to Tanzina Islam for checking the references, Jen Sweezie for copyediting and organizational support, Gabriela Mircea of the University of theToronto Library for maintaining the Open Conference System and providing valuable technical support, to the student volunteers, and to many others that made this event possible. Finally we would like to thank the various sponsors for their generous contributions. We hope you enjoy reading the proceedings from ELPUB 2008. It is also our pleasure to invite delegates and readers to ELPUB 2009, taking place in Milan, Italy. The 13th ELPUB conference will be organised by CILEA and the University of Milan. Details of the conference will be forthcoming at the ELPUB web site. As these proceedings go to press, we look forward to a very successful and productive conference,
General Chair Leslie Chan
Program Chair Susanna Mornati
1
A Review of Journal Policies for Sharing Research Data Heather A. Piwowar; Wendy W. Chapman Department of Biomedical Informatics, University of Pittsburgh 200 Meyran Avenue, Pittsburgh PA, USA e-mail: hpiwowar@gmail.com; wec6@pitt.edu
Abstract Background: Sharing data is a tenet of science, yet commonplace in only a few subdisciplines. Recognizing that a data sharing culture is unlikely to be achieved without policy guidance, some funders and journals have begun to request and require that investigators share their primary datasets with other researchers. The purpose of this study is to understand the current state of data sharing policies within journals, the features of journals that are associated with the strength of their data sharing policies, and whether the strength of data sharing policies impact the observed prevalence of data sharing. Methods: We investigated these relationships with respect to gene expression microarray data in the journals that most often publish studies about this type of data. We measured data sharing prevalence as the proportion of papers with submission links from NCBI’s Gene Expression Omnibus (GEO) database. We conducted univariate and linear multivariate regressions to understand the relationship between the strength of data sharing policy and journal impact factor, journal subdiscipline, journal publisher (academic societies vs. commercial), and publishing model (open vs. closed access). Results: Of the 70 journal policies, 53 made some mention of sharing publication-related data within their Instruction to Author statements. Of the 40 policies with a data sharing policy applicable to gene expression microarrays, we classified 17 as weak and 23 as strong (strong policies required an accession number from database submission prior to publication). Existence of a data sharing policy was associated with the type of journal publisher: 46% of commercial journals had data sharing policy, compared to 82% of journals published by an academic society. All five of the openaccess journals had a data sharing policy. Policy strength was associated with impact factor: the journals with no data sharing policy, a weak policy, and a strong policy had respective median impact factors of 3.6, 4.9, and 6.2. Policy strength was positively associated with measured data sharing submission into the GEO database: the journals with no data sharing policy, a weak policy, and a strong policy had median data sharing prevalence of 8%, 20%, and 25%, respectively. Conclusion: This review and analysis begins to quantify the relationship between journal policies and data sharing outcomes. We hope it contributes to assessing the incentives and initiatives designed to facilitate widespread, responsible, effective data sharing. Keywords: data sharing; editorial policies; instructions for authors; bibliometrics; gene expression microarrays 1.
Background
Widespread adoption of the Internet now allows research results to be shared more readily than ever before. This is true not only for published research reports, but also for the raw research data points that underlie the reports. Investigators who collect and analyze data can submit their datasets to online databases, post them on websites, and include them as electronic supplemental information – thereby making the data easy to examine and reuse by other researchers. Reusing research data has many benefits for the scientific community. New research hypotheses can be tested more quickly and inexpensively when duplicate data collection is reduced. Data can be aggregated Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
2
Heather A. Piwowar; Wendy W. Chapman
to study otherwise-intractable issues, and a more diverse set of scientists can become involved when analysis is opened beyond those who collected the original data. Ethically, it has long been considered a tenet of scientific behavior to share results[1], thereby allowing close examination of research conclusions and facilitating others to build directly on previous work. The ethical position is even stronger when the research has been funded by public money[2], or the data are donated by patients and so should be used to advance science by the greatest extent permitted by the donors[3]. Unfortunately, these advantages only indirectly benefit the stakeholders who bear most of the costs for sharing their datasets: the primary data-producing investigators. Data sharing is often time consuming, confusing, scary, and potentially damaging to future research plans. Consequently, sharing data is commonplace in only a few subdisciplines. Recognizing that a data sharing culture is unlikely to be achieved without policy guidance, some funders and journals have begun to request and require that investigators share their primary datasets with other researchers. Funders are motivated by the promise of resource efficiency and rapid progress. The motivation for journals to act as an advocate and gatekeeper for data sharing is less straightforward. Journals seek to publish “well-written, properly formatted research that meets community standards” and in so doing have assumed monitoring tasks to “remind researchers of community expectations and enforce some behaviors seen as advantageous to the progress of science.”[4] This role has been encouraged by many letters[5, 6], white-papers[7, 8], and editorials in high-profile journals[9]. Journal policies are usually expressed within “instruction for authors” statements. A study by McCain in 1995[4] explored the statements of 850 journals, looking for mandates for the dissemination of data (and the sharing of biological materials). She found that 132 (16%) natural science and technology journals had a policy regarding sharing of some type of research-related information. While McCain covered a wide breadth and depth of journals (especially given that her review predated electronic access to instruction for author statements), she did not attempt to associate the policies with journal attributes, nor did she measure the actual data sharing behavior of authors and correlate the prevalence with journal policy strength. We believe looking at these issues could help us better understand the causes and effects of journal data sharing policies. The purpose of this study is to understand the current state of data sharing policies within journals, to identify which characteristics of journals are associated with the strength of their data sharing policies, and to measure whether the strength of data sharing policies impacts the observed prevalence of data sharing. 2.
Methodology
Our study involved three steps. First, we identified a set of journals for examination. For each journal, based on a manual review of the instruction to author statement, we classified the strength of its policy for data sharing as none, weak, or strong. Second, we studied the relationship between the strength of a journal’s data sharing policy and selected journal attributes. Third, for each journal, we measured how many of its recently published articles have submitted datasets to a centralized database. We used these estimates to study the relationship between data sharing prevalence and the strength of the journal’s data sharing policy. Each of these steps is described below in more detail. 2.1
Collecting the journal’s policies on sharing data
To avoid unnecessary complexity, we chose to investigate data sharing policies for a single type of data: biological gene expression microarrays. These “chips” allow investigators to measure the relative level of Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Review of Journal Policies for Sharing Research Data
RNA expression across tens of thousands (exponentially more each year, as the technology improves) of different genes for each cell line in their study. For example, a clinical trial might involve extracting a small piece of breast cancer tumor from each of 100 patients who responded to a given chemotherapy treatment and from another 100 patients who did not. Cells from each patient’s tumor would be hybridized to a microarray chip, then the investigators would compare the relative levels of RNA expression across all the patients to identify a set of genes with expression levels that with chemotherapy response. This highthroughput dataset would include at least a million data points. The dataset is expensive and time-consuming to collect, but very valuable not only to the original investigators for their original purpose but also to other investigators who may wish to study different questions. Microarray data provide a useful environment for exploring data sharing policies and behaviors, for several reasons. Despite being valuable for reuse, microarray data are often but not yet universally shared. The best-practice guidelines for sharing microarray data are fairly mature, including standards for formatting and minimum-inclusion reporting developed by the active Microarray and Gene Expression Data (MGED) Society. A few centralized databases have emerged as best-practice repositories: the Gene Expression Omnibus (GEO)[10] and ArrayExpress[11]. Several high-profile letters have called for strong data sharing policies[5, 6]. Finally, the National Center for Biotechnology Information’s Entrez website (http:// www.ncbi.nlm.nih.gov/) makes it easy to identify journal articles that have submitted datasets to GEO, allowing us to study the association between journal policies and observed data sharing practice. We identified journals with more than 15 articles published on “gene expression profiling” in 2006, using Thomson’s Journal Citation Reports. We extracted the journal impact factors, subdiscipline categories, and publishing organizations. We looked up each journal in The Directory of Open Access Journals to determine which are based on an open-access publishing model. We used Google to locate the Instructions for Author policies for each of the journals. We manually downloaded and reviewed each policy for all mentions of data sharing. 2.2
Classifying the relative strength of the data sharing policies
We classified each of the policies into one of three categories: no mention of sharing microarray data, a relatively weak data sharing policy, or a strong policy. We defined a weak policy as one that is unenforceable, echoing McCain’s terminology.[4] This included policies that merely suggest or request that microarray data be shared, as well as policies that require sharing but fail to require evidence that data has been shared. Strong policies, in contrast, require microarray data to be shared and insist upon a database accession number as a condition of publication. We conducted univariate and linear multivariate regressions to understand the relationship between the strength of data sharing policy and journal impact factor, journal subdiscipline, journal publisher (academic societies vs. commercial), and publishing model (open vs. closed access). 2.3
Measuring the frequency with which authors share their data
To make a preliminary estimate of data sharing prevalence, we began by querying PubMed for journal articles published in 2006 or 2007 that were likely to have generated gene expression microarray data. These articles form the denominator of our prevalence estimate, so ideally only studies that produced raw data – articles with potentially shareable data – would be included. Unfortunately, PubMed does not provide a straightforward way to accurately identify only studies that produced their own data; a PubMed query for articles about gene expression microarray data (“Gene Expression Profiling”[MeSH] AND “Oligonucleotide Array Sequence Analysis”[MeSH]) returns not only studies that produced their own Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
3
4
Heather A. Piwowar; Wendy W. Chapman
data, but also studies that strictly reused previous datasets (and therefore don’t have their own raw microarray data to share) and even articles about new tools for storing and analyzing gene expression microarray data. A more accurate retrieval of data-producing studies would require access to the article’s full text, and was beyond the scope of this paper. Nonetheless, if we assume that articles about data reuse and tools occur in journals independently of the journal’s data sharing policy, we can use the rough PubMed query to provide a preliminary estimate of relative prevalence. It is crucial, however, that we interpret these estimates relative to one another and not compare them to a theoretical ideal of 100%. Since the denominator of our percentages is not “number of papers that produced microarray data and could have shared it” but rather “number of papers about microarrays,” even if all studies that produced data in fact shared it our estimates would still be less than 100%. Using the NCBI’s Entrez website, for each journal in our cohort, we counted the total number of articles returned by our PubMed query and the percentage of those articles that had links to the GEO data repository. We conducted univariate and linear multivariate regressions over the journal data-sharing prevalence percentages to understand if strength of data sharing policy was associated with observed data sharing prevalence, including covariates for journal impact factor, journal subdiscipline, publisher type, and publishing model. 3.
Results
3.1
Journal’s policies on sharing data
Seventy journals met the selection criteria, spanning a wide range of impact factors (0.9 to 30.0, median: 4.5). A minority are published by academic societies (22). Only 5 use an open-access publishing model. Thomson’s Journal Citation Reports identified 27 subdisciplines covered by these journals. We retained the categories with more than five members: Biochemistry and Molecular Biology (19), Biotechnology and Applied Microbiology (11), Cell Biology (11), Genetics and Heredity (11), Oncology (19), and Plant Sciences (7). We also retained Multidisciplinary Sciences (n=4) because we were curious about the policies for high-profile journals such as Nature and Science. Of the 70 journal policies, 30 (43%) had no policy applicable to microarrays. This included 17 journals that make no mention of sharing publication-related data within their Instruction to Author statements, and 13 journal policies that request or require the sharing of non-microarray types of data (usually DNA and protein sequences), but no statement covering data in general or microarray data in particular. The remaining 40 journals had a policy applicable to microarrays. We classified 17 of the microarrayapplicable policies as relatively weak and 23 as strong, as detailed in Table 1. The policies varied widely across a number of dimensions. We explore several of these dimensions below, using excerpts from the policies. 3.1.1 Statements of policy motivation Several journals introduce their policies with a motivation for sharing data. These statements explain the anticipated benefits to the scientific community, the intended service to readers, or the principles of the journal. Examples are given in Table 2. In addition, 22 policies included general-purpose sharing statements, thereby implying their support for the principle of data sharing. An example from Bioinformatics: Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Review of Journal Policies for Sharing Research Data
All data on which the conclusions given in the publication are based must be publicly available. No Policy Acta Biochimica Et Biophysica Sinica Annals Of The New York Academy Of Sciences Biochemical And Biophysical Research Communications British Journal Of Cancer Cancer Cancer Letters Carcinogenesis Experimental Cell Research Frontiers In Bioscience Gene Genes Chromosomes & Cancer Genomics Human Molecular Genetics IEEE-ACM Transactions On Computational Biology And Bioinformatics International Journal Of Molecular Medicine International Journal Of Oncology Journal Of Clinical Oncology Journal Of Leukocyte Biology Journal Of Neurochemistry Leukemia Research Leukemia Mammalian Genome Microbes And Infection Molecular Immunology Molecular Plant-Microbe Interactions Oncogene Oncology Reports Pharmacogenomics Plant Molecular Biology Planta
Weak Policy Bioinformatics BMC Bioinformatics BMC Cancer BMC Genomics Breast Cancer Research FASEB Journal Genome Biology Genome Research International Journal Of Cancer Molecular Endocrinology Physiological Genomics Plant Journal Plant Physiology Proteomics Stem Cells Toxicological Sciences Virology
Strong (Enforceable) Policy Applied And Environmental Microbiology Blood Cancer Research Cell Clinical Cancer Research Developmental Biology FEBS Letters Gene Expression Patterns Infection And Immunity Journal Of Bacteriology Journal Of Biological Chemistry Journal Of Experimental Botany Journal Of Immunology Journal Of Pathology Journal Of Virology Molecular Cancer Therapeutics Molecular And Cellular Biology Nature Biotechnology Nature Nucleic Acids Research Plant Cell Proceedings Of The National Academy Of Sciences Of The USA (PNAS) Science
Table 1: Classification of journal data-sharing policies for gene expression microarray data From BMC Bioinformatics: Submission of a manuscript to BMC Bioinformatics implies that readily reproducible materials described in the manuscript, including all relevant raw data, will be freely available to any scientist wishing to use them for non-commercial purposes. 3.1.2 Datatype-specific policies The journals with general data-sharing policies almost always supplement this with additional instructions for certain datatypes. In fact, many policies only have policies for certain datatypes and not for data sharing in general. The policies for depositing nucleotide sequences are usually more strict than policies for other datatypes, including gene expression microarray data. The FASEB Journal, in contrast, explicitly treats all datatypes Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
5
6
Heather A. Piwowar; Wendy W. Chapman Journal Stem Cells, Blood (similar statement)
Excerpt from Instructions to Authors: motivation for data sharing policy Stem Cells supports the efforts of the National Academy of Sciences (NAS) to encourage the open sharing of publication-related data. Stem Cells adheres to the beliefs that authors should include in their publications the data, algorithms, or other information that is central or integral to the publication, or make it freely and readily accessible; use public repositories for data whenever possible; and make patented material available under a license for research use.
Bioinformatics
Bioinformatics fully supports the recommendations of the National Academies regarding data sharing.
Genome Research
Genome Research encourages all data producers to make their data as freely accessible as possible prior to publication. Open data resources accompanied by fair use will serve to greatly enhance the scientific quality of work by the entire community and for society at large.
Plant Cell
The purpose of this policy is to ensure that conclusions are scientifically sound
Physiological Genomics
Work published in the APS Journals must necessarily be independently verifiable [....] Within a short time span, microarrays have become an important, commonly used tool in molecular genetics and physiology research. For microarray analysis of gene expression to have any long-term impact, it is crucial that the issue of reproducibility be adequately addressed.
Proceedings Of The National Academy Of Sciences Of The USA
To allow others to replicate and build on work published in PNAS, authors must make materials, data, and associated protocols available to readers
Science
After publication, all data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science.
Journal Of Biological Chemistry
... will substantially enhance an author's ability to communicate important research information and will also greatly benefit readers.
Table 2: Selected excerpts to illustrate the variety of data-sharing policy motivations
the same: The FASEB Journal also does not distinguish between microarray data and other sorts of data (proteomics, sequence data, organic syntheses, crystal structures, etc.) All methods must be publicly available and described. Anything published in The FASEB Journal must have all data available not only for review but to every reader, electronic or print. 3.1.3 Sharing requested or required Most journals with a policy for sharing microarray data state it as a requirement, using phrases like must, required, and as a condition of publication. A few policies (n=4) are less strict, stating their policies as requests through the words should, recommend, and request. 3.1.4 Data location Most policies state that microarray data must be made available in a public database. A few are less specific, stating that sharing via public webpages or supplementary journal information is sufficient, or the policy leaves location unspecified. Some policies are more specific, insisting that the database be of a certain standard. Plant Cell, for example, specifies a permanent public database. Plant Physiology Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Review of Journal Policies for Sharing Research Data
expands on this theme: Links to web sites other than a permanent public repository are not an acceptable alternative because they are not permanent archives. Two databases, GEO and ArrayExpress, are the predominant centralized storage locations for microarray datasets. Many of the policies suggest that data be deposited into one of these two locations, and a few policies limit the choice to one of these centralized options. 3.1.5 Data format None of the policies explicitly specified a data format. By recommending or requiring submission to one of the permanent public databases, the journals implicitly stipulate the standard formats used within those databases. 3.1.6 Data completeness The Microarray and Gene Expression Data (MGED) Society has developed guidelines for the Minimum Information About a Microarray Experiment (MIAME) that is “needed to enable the interpretation of the results of the experiment unambiguously and potentially to reproduce the experiment.”[12] Because the experimental conditions for collecting microarray data can be very complex, these MIAME guidelines are very helpful for both data sharers and data reusers. Physiological Genomics includes rationale for adopting the MIAME guidelines within their instruction for authors statement: Within a short time span, microarrays have become an important, commonly used tool in molecular genetics and physiology research. For microarray analysis of gene expression to have any long-term impact, it is crucial that the issue of reproducibility be adequately addressed. In addition, since microarray analytic standards are certain to change, it is crucial that authors identify the nature of the experimental conditions prevalent at the time of their research. If today’s research is to be relevant tomorrow, the core elements that are immune to obsolescence must be made clear. The APS Journals are adopting the MIAME standards to ensure that what is cutting edge today is not obsolete few years later. More than 30 of the data-sharing policies recommend that data be compliant with the MIAME guidelines. As an example of one of the strictest policies, Gene Expression Patterns requires adherence to the MIAME standards and even asks for a completed MIAME checklist to be submitted with the manuscript: Authors submitting manuscripts relying on microarray or similar screens must supply the data as Supplementary data [...] at the time of submission, along with the completed MIAME checklist. The data must be MIAME-compliant and supplied in a form that is widely accessible. 3.1.7 Timeliness of public availability A few policies specify that microarray data must be available to the public upon publication. None of the policies explicitly allow data to be withheld until a date after publication. 3.1.8 Consequences for not sharing data
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
7
8
Heather A. Piwowar; Wendy W. Chapman Journal Applied And Environmental Microbiology, Infection And Immunity, Journal Of Bacteriology, Journal Of Virology, Molecular And Cellular Biology Nucleic Acids Research Mammalian Genome
Excerpt from Instructions to Authors: consequences for NOT sharing data Failure to comply with the policies described in these Instructions may result in a letter of reprimand, a suspension of publishing privileges in ASM journals, and/or notification of the authors’ institutions. The Editors are prepared to deny further publication rights in the Journal to authors unwilling to abide by these principles. Failure to comply with this policy may result in exclusion from publication in Mammalian Genome.
Nature, Nature Biotechnology
After publication, readers who encounter a persistent refusal by the authors to comply with these guidelines should contact the chief editor of the Nature journal concerned, with "materials complaint" and publication reference of the article as part of the subject line. In cases where editors are unable to resolve a complaint, the journal reserves the right to refer the correspondence to the author's funding institution and/or to publish a statement of formal correction, linked to the publication, that readers have been unable to obtain necessary materials or reagents to replicate the findings. Table 3: Selected excerpts of consequences for noncompliance with data-sharing journal policies
Several policies stipulate consequences for authors who fail to comply with journal conditions, as listed in Table 3. No weak policies included consequences, even though weak policies would benefit most since their requirements are the least enforceable prior to publication. Although only tangentially related to dataset sharing, it is interesting to note the tough stance that some journals are willing to take when authors refuse to share their biological reagents after publication. From Blood: Although the Editors appreciate that many of the reagents mentioned in Blood are Journal Genome Research
Journal Proceedings Of The National Academy Of Sciences Of The USA Developmental Biology, Gene Expression Patterns
Excerpt from Instructions to Authors: forbidding exceptions to data sharing policies Genome Research will NOT consider manuscripts where data used in the paper is not freely available on either a publicly held Web site or, in the absence of such a Web site, on the Genome Research Web site. There are NO exceptions. Excerpt from Instructions to Authors: permitting exceptions to data sharing policies Authors must disclose upon submission of the manuscript any restrictions on the availability of materials or information. The editors understand that on occasion authors may not feel it appropriate to deposit the entire data set at the time of publication of this paper. We are therefore willing to consider exceptions to this requirement in response to a request from the authors, which must be made at the time of initial submission or as part of an informal pre-submission enquiry
Science
We recognize that discipline-specific conventions or special circumstances may occasionally apply, and we will consider these in negotiating compliance with requests. Any concerns about your ability to meet Science's requirements must be disclosed and discussed with an editor. Table 4: Selected excerpts to illustrate forbidden and permitted exceptions from data-sharing policies
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
9
A Review of Journal Policies for Sharing Research Data
proprietary or unique, neither condition is considered adequate grounds for deviation from this policy. … if a reasonable request is turned down and not submitted to the Editor-in-Chief, the corresponding author will be held accountable. The consequence for noncompliance is simple: the corresponding author will not publish in Blood for the following 3 years.
Figure 1: A boxplot of the impact factors for each journal, grouped by the strength of the journal’s datasharing policy. For each group, the heavy line indicates the median, the box encompasses the
From PNAS: Authors must make Unique Materials (e.g., cloned DNAs; antibodies; bacterial, animal, or plant cells; viruses; and computer programs) promptly available on request by qualified researchers for their own use. Failure to comply will preclude future publication Journal Attribute Impact Factor, natural log Open Access Published by Association Biochemistry & Molecular Biology Biotechnology & Applied Microbiology Plant Sciences Oncology Cell Biology Genetics & Heredity Multidisciplinary Sciences
Estimate 0.34 0.63 0.23 -0.28 0.04 -0.08 -0.37 0.10 -0.11 -0.29
p-value <0.001 *** 0.002 ** 0.046 * 0.031 0.784 0.636 0.004 0.485 0.456 0.207
*
**
Table 5: Results of linear multivariate regression over the existence of a journal’s datasharing policy Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
10
Heather A. Piwowar; Wendy W. Chapman
in the journalâ&#x20AC;Ś Contact pnas@nas.edu if you have difficulty obtaining materials. 3.1.9 Exceptions to data sharing policies
Figure 2: A boxplot of the relative data-sharing prevalence for each journal, grouped by the strength of the journalâ&#x20AC;&#x2122;s data-sharing policy. For each group, the heavy line indicates the median, the box encompasses the interquartile range (IQR, 25th to 75th percentiles), the whiskers extend to datapoints within 1.5xIQR from the box, and the notches approximate the 95% confidence interval of the median At least one journal, Genome Research, explicitly disallows any exceptions to their principle of public data sharing. In contrast, a few other journals state or imply that they are willing to be flexible in some circumstances. Relevant excerpts are included in Table 4. Journal Attribute Has a Data Sharing Policy Impact Factor, natural log Open Access Published by Association Biochemistry & Molecular Biology Biotechnology & Applied Microbiology Plant Sciences Oncology Cell Biology Genetics & Heredity Multidisciplinary Sciences
Estimate 0.11 0.06 -0.07 0.15 0.01 -0.01 0.08 0.02 0.04 0.27 0.28
p-value 0.037 * 0.118 0.386 0.002 ** 0.850 0.866 0.232 0.737 0.475 <0.001 0.004
*** **
Table 6: Results of linear multivariate regression over the prevalence with which the articles in a journal submit their microarray data to a centralized database Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Review of Journal Policies for Sharing Research Data
3.2
The relative strength of the data sharing policies
Based on univariate analysis, data sharing policy strength was associated with impact factor. As seen in Figure 1, the journals with no data sharing policy, a weak policy, and a strong policy had respective median impact factors of 3.6, 4.9, and 6.2. Data sharing policy was also associated with journal publisher: 46% of commercial publishers had a data sharing policy, compared to 82% of journals published by an academic society. All five of the open-access journals had a policy. In multivariate analysis, we found that the following variables were positively associated with the existence of a microarray data sharing policy: impact factor, open access, and academic society publishing. In contrast, the subdisciplines of Biochemistry&Molecular Biology and Oncology were negatively associated with the existence of a microarray data sharing policy. Details including all the covariates are provided in Table 5. 3.3
The frequency with which authors share their data
Journals with the strongest data sharing policies had the highest proportion of papers with shared datasets. As seen in Figure 2, the journals with no data sharing policy, a weak policy, and a strong policy had a median data sharing prevalence of 8%, 20%, and 25% respectively. As mentioned in the Methodology section, these proportions should be interpreted relative to each other rather than to a theoretical maximum of 100%. Based on multivariate analysis, we found that articles were more likely to have submitted primary data to GEO when they were published in journals with a data sharing policy, published by an academic society, or in the subdisciplines of Genetics&Heredity or Multidisciplinary Sciences. Details are given in Table 6. 4.
Discussion
We found wide variation amongst journal policies on data sharing, even for a data type with well-defined reporting standards and centralized repositories. Journals with a high impact factor, an open access publishing model, and a non-commercial publisher were most likely to have a data-sharing policy. This could be expected, as journals with a high impact factor are able to stipulate conditions to ensure research is of the highest quality without eroding their appeal, open-access journals are often particularly advocates for all aspects of open scholarship, and journals published by academic societies have previously been found to endorse data sharing more readily than commercial journals.[4] Surprisingly, our study did not identify any subdisciplines with an unusually-high number of data sharing policies. In contrast, we found that Oncology journals and Biochemistry&Molecular Biology journals were relatively unlikely to have a data sharing policy. The Oncology result is consistent with our observation that medical journals have been slower to embrace new publishing paradigms and open scholarship principles than journals within biology and bioinformatics. This is unfortunate, since cancer microarray data holds particular promise and is often especially expensive and time-consuming to collect. It is also unnecessary, since microarray data can be (and is) shared without compromising patient privacy. We found that the existence of a data sharing policy was associated with an increase in data sharing behavior. A non-commercial publisher and the subdisciplines of Genetics&Heredity and Multidisciplinary Sciences were also significantly associated with a relatively high frequency of dataset submissions into the GEO database, as a percentage of all published gene expression papers. Studies of Genetics&Heredity often reuse data, so perhaps authors in that field are well acquainted with the value of sharing data. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
11
12
Heather A. Piwowar; Wendy W. Chapman
Interestingly, the two subdisciplines that were negatively associated with the existence of a data sharing policy were not less likely than usual to share their data when other factors are held constant. We were surprised that impact factor was not strongly associated with data sharing prevalence in multivariate analysis, because we suspect that well-funded and high-profile studies are under more pressure to share their data. In the future, we’d like to include variables about funding in these analyses. A large number of journals had a policy for microarray data but not data in general. This probably reflects the success of MGED’s efforts in actively encouraging and supporting microarray data exchange. As such, the results we have found are illuminating but may not be representative for other datatypes with a less mature infrastructure. A study by Brown[13] in 2000 used several methods to investigate the adoption and usage of Genbank, one of the most mature and successful biological databases. She tracked changes in instruction to author statements across 23 journals over 20 years, and noted that the data sharing policies for sequences have become stronger over time. As she explains, the authors who published in the Journal of Biological Chemistry were urged to deposit sequence data into Genbank in 1984, told they “should” deposit data in 1985, and were required to submit data as a condition of publication by 1991. It would be interesting to study whether, as the microarray field continues to mature, the journals we consider to have weak data sharing policies will evolve stronger policies with time. Journals ought to give careful consideration to changing their policies[14]. Although there may be direct benefits to journals when authors must share their raw research data (reducing fraud, encouraging more careful research), data sharing mandates are controversial.[15] It is possible new mandates may cause authors to shop for an alternative publishing venue to avoid hassle. To measure the acceptance of a policy change, the editorial team at Physiological Genomics surveyed their authors and reviewers two years after instituting a data sharing requirement. They found that the vast majority of authors (92%) believed depositing microarray data was of significant value to the scientific community, and “67% of those who responded said they did not find the deposit of microarray data into GEO to be an obstacle to submission or review of articles”.[16] Database tools have evolved since that survey, and submitting data continues to get easier. Nonetheless, there are many personal difficulties for those who undertake to share their data, resulting in a variety of reasons why investigators may choose to withhold it. First, sharing data is often time-consuming: the data have to be formatted, documented, and uploaded. Second, releasing data can induce fear. There is a possibility that the original conclusions may be challenged by a re-analysis, whether due to possible errors in the original study, a misunderstanding or misinterpretation of the data, or simply more refined analysis methods. Future data miners might discover additional relationships in the data, some of which could disrupt the planned research agenda of the original investigators. Investigators may fear they will be deluged with requests for assistance, or need to spend time reviewing and possibly rebutting future reanalyses. They might feel that sharing data decreases their own competitive advantage, whether future publishing opportunities, information trade-in-kind offers with other labs, or potentially profit-making intellectual property. Finally, it can be complicated to release data. If not well-managed, data can become disorganized and lost. Some informed consent agreements may not obviously cover subsequent uses of data. De-identification can be complex. Study sponsors, particularly from industry, may not agree to release raw detailed information, or data sources may be copyrighted such that the data subsets can not be freely shared. Given all of these hurdles, it is natural that authors may need extra encouragement to share their data. We suggest that journal editors take a few simple steps to increase adherence to data sharing policies and thus bring about a more open scholarship. First, journals that already mandate data sharing should require the inclusion of an accession number (or web address for datatypes without databases) upon submission, since “prepublication compliance is much easier to monitor and enforce than postpublication compliance”[4] Second, journals should instruct their editors and reviewers to confirm that accession numbers are included Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Review of Journal Policies for Sharing Research Data
in the manuscripts, as some journals do for their clinical trial reporting policies[17]. Third, journals should require that authors complete a MIAME checklist to increase the likelihood that shared data is complete and well-annotated, following the example of Gene Expression Patterns. To take this step further, journals could contract with a service like the one offered by ArrayExpress[18] to verify that submitted datasets meet a threshold of annotation quality. Fourth, journals need to implement their consequences: don’t publish papers that don’t uphold the policies. Finally, during this cultural transition, we recommend that journals support measures that recognize and reward investigators who share data.[19] For example, journals could educate authors and reviewers on responsible data reuse and acknowledgement practices, either as part of instructions to authors statements or in editorials (see Nature journals [20, 21, 22]) Acknowledging data sources in a machine-readable way (through references, urls, and accession numbers) will allow the benefits of data reuse to be automatically linked back to the original data producers through citation counts[23] or other usage metrics, and thus provide a positive motivation for sharing data. Innovative attempts to provide microattribution or a data reuse registry may offer additional opportunities for journals to support these goals.[21, 22, 24] Our study has several important limitations: we explored journal policies for only one type of data, our measured data sharing behavior predated the policy downloads, and the policy classifications were performed by only one investigator. Our method of measuring data sharing behavior captures many but not all articles that shared data; we plan to use natural language processing techniques to find a wider variety of data sharing instances in the future[25]. Similarly, a full-text query to identify articles that produce primary, shareable data – perhaps using laboratory terms like purify and hybridize – could improve our preliminary estimates of data sharing prevalence. Finally, we note that the reported associations do not imply causation: we have not demonstrated that changing a journal’s data sharing policy will change the behavior of authors. Nonetheless, we believe this review and analysis is an important step in understanding the relationship between journal policies and data sharing outcomes. Policies are implemented with the hopes of affecting change. It is often said, “You cannot manage what you do not measure.” We need to understand the motivation and impact of our various incentives and initiatives if we hope to unleash the benefits of widespread data sharing.
5.
Acknowledgements
HP is supported by NLM training grant 5T15-LM007059-19 and WC is funded through NLM grant 1R01LM009427-01. Raw data and statistical analysis code from this study are available at http://www.dbmi.pitt.edu/piwowar/ 6.
References
[1] [2] [3]
MERTON R: The sociology of science: Theoretical and empirical investigations. 1973 GASS A: Open Access As Public Policy. PLoS Biology 2(10):e353, 2004 VICKERS A: Whose data set is it anyway? Sharing raw data from randomized trials. Trials 7:15, 2006 MCCAIN K: Mandating Sharing: Journal Policies in the Natural Sciences. Science Communication 16(4):403-431, 1995 BALL CA et al.: Standards for microarray data. Science (New York, NY) 298(5593)2002 BALL CA et al.: Submission of microarray data to public repositories. PLoS Biol 2(9)2004 CECH TR et al.: Sharing publication-related data and materials: responsibilities of authorship in the life sciences. Plant physiology 132(1):19-24, 2003
[4] [5] [6] [7]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
13
14
Heather A. Piwowar; Wendy W. Chapman
[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]
PANEL ON SCIENTIFIC RESPONSIBILITY AND THE CONDUCT OF RESEARCH: Responsible Science, Volume I: Ensuring the Integrity of the Research Process. 1992 Microarray standards at last. Nature 419(6905)2002 BARRETT T et al.: NCBI GEO: mining tens of millions of expression profiles—database and tools update. Nucleic Acids Res 35(Database issue)2007 PARKINSON H et al.: ArrayExpress—a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35(Database issue)2007 BRAZMA A et al.: Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29(4):365-371, 2001 BROWN C: The changing face of scientific discourse: Analysis of genomic and proteomic database usage and acceptance. Journal of the American Society for Information Science and Technology 54(10):926-938, 2003 Democratizing proteomics data. Nat Biotech 25(3):262-262, 2007 CAMPBELL P: Controversial Proposal on Public Access to Research Data Draws 10,000 Comments. The Chronicle of Higher Education A42, 1999 VENTURA B: Mandatory submission of microarray data to public repositories: how is it working? Physiol Genomics 20(2):153-156, 2005 HOPEWELL S et al.: Endorsement of the CONSORT Statement by high impact factor medical journals: a survey of journal editors and journal ‘Instructions to Authors’. Trials 9:20, 2008 BRAZMA A, PARKINSON H: ArrayExpress service for reviewers/editors of DNA microarray papers. Nature Biotechnology 24(11):1321-1322, 2006 Got data? Nat Neurosci 10(8):931-931, 2007 SCHRIGER DL, ARORA S, ALTMAN DG: The content of medical journal Instructions for authors. Ann Emerg Med 48(6)2006 Human variome microattribution reviews. Nat Genet 40(1)2008 Compete, collaborate, compel. Nat Genet 39(8)2007 PIWOWAR HA, DAY RS, FRIDSMA DB: Sharing detailed research data is associated with increased citation rate. PLoS ONE 2(3)2007 PIWOWAR HA, CHAPMAN WW: Envisioning a Biomedical Data Reuse Registry. Blog post on March 24, 2008: http://researchremixwordpresscom/2008/03/24/envisioning-a-biomedical-data-reuseregistry/ PIWOWAR HA, CHAPMAN WW: Identifying data sharing in the biomedical literature. AMIA Annual Symposium [submitted] 2008
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
15
Researcherâ&#x20AC;&#x2122;s Attitudes Towards Open Access and Institutional Repositories: A Methodological Study for Developing a Survey Form Directed to Researchers in Business Schools. Turid Hedlund Information Systems Science, Swedish School of Economics and B. A. Pb 479, Arkadiagatan 22, 00101 Helsinki, Finland e-mail: turid.hedlund@hanken.fi
Abstract The aim of this study was to address the need of further studies on researchersâ&#x20AC;&#x2122; expectancies and attitudes towards open access publishing. In particular we wanted to focus on acceptance and user behavior regarding institutional archives. The approach is domain specific and was based on a framework of theories on intellectual and social organization of the sciences and communication practices in the digital era. In the study we apply a theoretical model of user acceptance and user behavior (UTAUT) developed by Venkatesh et al. in 2003 as an explanatory model for developing a survey form for a quantitative empirical research on user attitudes and preferences. Thus our research approach is new and crossdisciplinary in the way we combine earlier research results from the fields of organizational theory, information science and information systems science. This is in our view a fruitful approach broadening the theoretical base of the study and bringing in a deeper understanding of the research problems. As a result of the study we will present a model framework and a web survey form for how to carry out the empirical study. Keywords: end-user attitudes; methodological study; web survey 1.
Introduction
In recent years we have seen quite a few studies on open access publishing. Among others, large survey results on a cross-disciplinary level on author opinions to open access journals have been published [1], [2], [3]. Also author perceptions towards the author charges business model has been studied in the domain of a medical journal [4] . In studies on institutional repositories the attention has mainly been for several years on implementation, technical features and interoperability of systems using the OAI-PMH standard. It is a natural development, that now at a time when institutional repositories have been in function for some time, we have seen studies focusing on evaluation of repository content by genre and type of the included documents, as well as growth rate for submissions [5], [6]. However, even though the concept of open access is known among academic researchers their research and publishing practices have still not undergone a radical change. The important question regarding nonuse of institutional repositories has lately been raised by [7]. There is a need for deeper understanding of to what extent open access practices have spread among academics and what are the main incentives and barriers to acceptance and use of new systems for open access dissemination of research results for example institutional repositories. The aim of this study was to form a methodological part of a project on research on open access and in particular acceptance and user behaviour regarding institutional archives. The approach was to focus on the end-users, in this case researchers in business schools in Finland. As the approach of the project was to limit the collecting of data to a specific field, the framework of theories was based on intellectual and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
16
Turid Hedlund
social organization of the sciences and communication practices. [8], [9], [10]. We also relied on previous studies on open access publishing in the domain of biomedicine [11]. In the study we applied a theoretical model of user acceptance and user behavior (UTAUT) developed by [12] as an explanatory model for the construction of a questionnaire directed to the researchers in business schools. The framework will naturally also be used in the continuing project as a mean for analysis of results from the empirical surveys directed to business school researchers. Thus our research approach was new and cross-disciplinary in the way we combined earlier research results from the fields of organizational theory, information science and information systems science. This is in our view a fruitful approach broadening the theoretical base of the study and bringing in a deeper understanding of the research problems. In the study we addressed the following questions:
•
What are the prevailing attitudes toward open access among business school researchers in Finland?
•
What type of incentives and barriers for use and non-use of open access publishing channels can be identified.
•
Do factors such as the social influence of the faculty and the organization have an impact on acceptance and use of open access and institutional repositories?
•
Do personal factors such as perceived usefulness for the research career and perceived ease of use have an impact on acceptance and use of open access and institutional repositories?
The structure of the final paper will start by an introductory section on theories describing domain specific features in scientific communication and scientific publishing in the fields of research typically carried out in business schools. In the following section we will build up the framework of theories and models on enduser attitudes and the diffusion of new technology. In the section on study settings we described the methodology for the design of a web questionnaire and the survey to be carried out in business schools in Finland, followed by an analysis, discussion and concluding remarks. 2.
Theoretical Background
For the study on the scientific disciplines represented in business schools we will rely on Whitley’s (1984; 2000) theory on the social organization of the scientific fields as our starting point. Whitley’s theory characterizes the differences between scientific fields into two main dimensions
•
Degree of mutual dependence Associated with the degree of dependence between scientists, colleagues, or particular groups to make a proper research contribution
•
Degree of task uncertainty Associated with differences in patterns of work, organisation and control in relation to changing contextual factors
Relating Whitley’s taxonomy of scientific fields to economics and the related subject of business administration we could define the following pattern. A high degree of mutual dependence would indicate that scientific communication patterns become more controlled as competition arises. For example publishing in journals with high rank within the community is favoured among researchers. This trend is enforced also in business schools where research evaluation and meriting of researchers is focusing to an increasing degree on publishing in journals with high impact factors. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Researcher’s Attitudes Towards Open Access and Institutional Repositories
The degree of task uncertainty, according to Whitley associated with differences in pattern of work and publishing patterns, might also indicate that a scientific field with a high degree of task uncertainty is less controlled regarding its scientific output. In business schools several different patterns of work and different contextual factors are naturally present since several different subjects are represented in the departments and faculty. The long tradition within the field of economics in publishing working papers can for example be associated with the need to communicate research results at an early stage of the research process. Hedlund and Roos [11] characterize, in their study on incentives to publish in open access journals, factors depending on the social environment and factors depending on personal factors of the researcher. Social factors:
•
Policymaking, governmental policy in science and technology, policy of other funding bodies, interest groups and officials
• • • • • •
Increased demands for productivity and accountability Internationalisation and strong competition in the scientific field Geographical location Availability of subject-based and institutional archives and open access journals Institutional policies that promote open access publishing Communication patterns of the scientific field and the field’s willingness to early adoption of new techniques
Personal factors:
• • •
The importance of reputation and meriting to the researcher Speed of publication and visibility of research results Personal communication patterns and willingness to adopt new techniques
Figure 1 The UTAUT model. Source: Venkatesh et al. 2003, p. 447 Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
17
18
Turid Hedlund
•
Personal values
In studies modelling user acceptance and behaviour, [12] develop a theoretical framework that is well suited to use as an explanatory framework for the intended statistical analysis of the results from a survey directed to business school researchers in Finland. In the formulation of the Unified Theory of acceptance and Use of Technology (UTAUT) [12] identified four constructs as direct determinants of user acceptance and usage behaviour (see Figure 1). The four constructs are: performance expectancy, effort expectancy, social influence and facilitating conditions. The above four constructs in [12] are defined shortly as follows:
•
Performance expectancy – “the degree to which an individual believes that using a system will help to attain gains in a job performance”
• •
Effort expectancy – “the degree of ease associated with the use of a system”
•
Facilitating conditions – “the degree to which an individual believes that an organizational and technical infrastructure exists to support use of the system”
Social influence – “the degree to which an individual perceives that important others believe he or she should use the new system”.
Until now the above constructs have been used mainly in research on acceptance and intention to use of IT-systems. In the present study we will adapt the above constructs to the study of acceptance and use and eventually also non-use of open access publishing systems as for example institutional archives. In the following we have modified the constructs in the UTAUT model to the needs for a study on enduser attitudes (researchers) towards open Access publishing.
3.
•
Performance expectancy – as the degree to which the researcher expects gains with OA publishing in research performance and thus increasing his/her personal merits
•
Effort expectancy – as the degree to which the researcher expects ease of use of an OS system. This naturally has to do with system technology and design but also with personal factors as willingness to learn and use new systems. Experience from use of other information and communication systems on the web is probably an effecting factor.
•
Social influence – as the degree to which a researcher is influenced of fellow researchers and organization
•
Facilitating conditions – as the degree to which organizational and technical infrastructure is provided to support use of the system
Design of questions for the survey form
The survey form is designed as a web survey form with the intention to be directed to faculty members and doctoral students in business schools. In the beginning of the form explanations are provided of concepts related to open access publishing such as open access journal, university publication repository, subject-based publication repository etc. The survey form contains four sector headings; demographic questions, questions on awareness and use of open access services, questions on open access publishing and questions on reasons and barriers for open access publishing. In the following the factors in the model and their representations in the questionnaire are described. First section of the questionnaire is described in Table 1. The table includes examples of how the moderating Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Researcher’s Attitudes Towards Open Access and Institutional Repositories Moderating factors Gender Age as researcher Experience of use Voluntariness of use
Demografic questions Gender Position (doctoral student – professor) Research experience in years Does your university mandate depositing a copy of your research publications in a publication archive?
Table 1 Demografic section of the survey form factors in the theoretical framework were depicted in the survey
3.1
Determinants of user acceptance and behaviour
Factors depending on the social environment
• • • •
Social influence of fellow researchers and organization Policy of funding bodies, university organization and officials Differences in patterns of work and changing contextual factors Facilitating conditions (organizational and technical infrastructure to support use of the system)
Personal factors of the researcher
• • 3.2
Performance expectancy (expected gains in research performance, personal merits) Effort expectancy (expected ease of use of a system) not reflected in the survey
Examples of social influence
How does your research community or fellow researchers react to open access publishing? Please indicate on a scale from 1-5 how well you find that the following statements reflect your opinion. 1 = I totally agree and 5 = I totally disagree
3.3
•
Researchers that are important to me tend to have a copy of their publications on their home pages
• •
I can find publications on my research topic openly on the web My fellow researchers ask me to publish copies of my research papers if I do not have them publicly available in full text
Examples of differences in patterns of work and changing contextual factors
What are in your opinion the main reasons to publish in an open access publication archive of your university?
•
By submitting my publication into the open access publication archive of my university I can reach a broader audience
•
People interested in my research ask me to have my research available on the Internet
What are in your opinion the main reasons to publish in an open access journal? Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
19
20
Turid Hedlund
3.4
•
Open access journals reach a broader audience and especially professionals that do not have access to databases in the university libraries
• •
I can choose open access journals of good standard in my research field My research community favour publishing in open access journals
Examples of policy of funding bodies, university organisation and officials
•
My research funders recommends or require me to have my research results available freely to the public
•
My university recommends or requires open access publishing in the publication archive of the university
•
My research funders recommends or require me to publish my research in an open access journal when possible
•
My university recommends or require me to publish my research in an open access journal when possible
Free comments are encouraged 3.5
Facilitating conditions (organizational and technical infrastructure to support use of the system)
What are in your opinion the main barriers for publishing in an open access publication archive managed by your university? You can choose several alternatives from the list below
• • • •
I do not know if my university has an open access publication archive I do not know how to submit a copy of a published article I believe that copyright issues are difficult to cope with I do not know which version of my article I am allowed to submit
Free comments are encouraged 3.6
Personal factors of the researcher Performance expectancy
How well do you find that open access journals meet the following criteria Please indicate on a scale from 1-5 how well you find that the following statements reflect your opinion. 1 = I totally agree and 5 = I totally disagree
• • • •
They provide accessibility to the right focus groups They increase visibility The speed of publishing is increasing The quality and impact factors meet the standard of traditional journals
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Researcher’s Attitudes Towards Open Access and Institutional Repositories
• 4.
The peer review is of good quality
Discussion
This conference paper is part of a project on open access institutional repositories carried out at the Swedish School of Economics and Business Administration. Thus the framework for the study that was presented in the theoretical part took into account discipline specific factors of earlier research. The research questions involved the modelling of the constructs of researchers’ attitudes to open access publishing and institutional archives for an empirical survey and in practice a web survey form to be directed to researchers as end users of open access services. The survey form was tested on two separate occasions; firstly, to get a picture of the methodological soundness of the constructs depicted in questions in the survey form and secondly to get feedback on what inconsistencies and hardships there might be in the actual answering of the survey form. To test the methodological soundness a presentation was given for a group of researchers and doctoral students from the Swedish School of Economics and Business Administration. It became clear that the concepts on open access publishing and institutional archives were not familiar to the audience. Therefore explanations and definitions were added to the survey form. The researchers’ main concern was where to publish (in which journal) and in what type of publications (book chapters, journals etc) not mainly in open access format. However, the factors depending on the social environment were seen as relevant for publishing practices. Getting merits for a future career as researcher was also important. The survey form was also sent to a group of 8 test persons (researchers and doctoral students). They were asked to fill in and return the form and to comment on problems and design features as well as on the relevance of the questions. The test respondents provided good comments and suggestions for improvements. The main structure and the questions in the survey were found rather easy to fill in and submit. The survey contained a suitable number of questions and did not take too long to fill in. Of the test answers that were collected we could conclude that the main reason to publish in an open access journal was that you are able to reach a broader audience and also professionals interested in your research. One of the main reasons to publish in a university publication repository was also to reach a broader audience, the other main reason among test respondents was that people interested in a persons research results asked to have them published on the web. The main barrier to publish in an open access journal was that the department of the researcher did not consider open access journals meriting for a research career. The main barrier to publish in a university publication archive was that copyright issues were considered difficult to cope with. Also the project group developing and managing the open access publication archive named “DHanken” were asked to comment on the survey questions. Some clarifications and improvements were suggested to single questions but on the whole the feedback was positive. Many of the comments were on a general level and pointing to the fact that open access publishing might not be so very well known among researchers. Therefore some definitions on concepts were added to the survey form. Based on the comments from the test of the survey form we were able to develop an updated version of the form. The survey will now be sent out to researchers in business schools in Finland and the initial results will be collected and analysed. 5.
References
[1]
Nicholas, D.and Rowlands, I. Open access publishing: The evidence from the authors. The Journal Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
21
22
Turid Hedlund
of Academic Librarianship, vol. 31(3) 2005: 179-181. Nicholas, D. , Huntington, P. and Rowlands, I. Open access journal publishing: the views of some of the worldâ&#x20AC;&#x2122;s senior authors. Journal of Documentation vol 61(4) 2005: 497-519. [3] Swan, A. and Brown, S. Authors and open access publishing. Learned publishing vol. 13(3) 2004: 219-224. [4] Schroter, S., Tite, L. and Smith, R. Perceptions of open access publishing: interviews with journal authors. British Medical Journal 2005; (330); 756- published 26 January 2005. [5] Thomas, C. and McDonald, R. H. Measuring and comparing participation patterns in digital repositories: Repositories by the numbers, Part 1. D-Lib Magazine 2007; vol 13(9/19). Doi:10.1045/ september2007-mcdonald [6] McDowell, C. S. Evaluating institutional repository deployment in American Academe since early 2005. Repositories by the number part 2. D-Lib Magazine 2007; vol 13(9/10) doi:10.1045/ september2007-mcdowell [7] Davis, P. M. and Connolly, M. J. L. Institutional repositories: Evaluating the reasons for non-use of Cornell Universityâ&#x20AC;&#x2122;s installation of DSpace. D-Lib Magazine 2007; vol 13(3/4) doi:10.1045/ march2007-davis [8] Whitley, R. The intellectual and social organization of the sciences. London: Clarendon Press 1984. [9] Fry, J. Scholarly research and information practices: a domain analytic approach. Information processing and management 2006; vol 42, pp 299-316. [10] Fry, J. and Talja, S. The intellectual and social organization of academic fields and the shaping of digital resources. Journal of information Science 2007; vol 33(2) pp. 115-133. [11] Hedlund and Roos, A. Open Access publishing as a discipline specific way of scientific communication: the case of biomedical research in Finland. In Advances in Library Administration and Organization 25. Elsevier book series 2007. [12] Venkatesh, V., Morris, M., Davis, G. B., Davis, F. D. User acceptance of information technology: Toward a unified view. MIS Quarterly 2003; vol. 27(3), pp. 425-478.
[2]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
23
The IDRC Digital Library: An Open Access Institutional Repository Disseminating the Research Results of Developing World Researchers Barbara Porrett Research Information Management Services Division International Development Research Centre (IDRC CRDI) P.O. Box 8500, Ottawa, ON K1G 3H9 Canada e-mail: bporrett@idrc.ca web: http://www.idrc.ca/
Abstract The International Development Research Centre (IDRC) has recently launched the OAI-PMH compliant IDRC Digital Library (IDL), a DSpace institutional repository. The digital library has been developed to enhance the dissemination of research outputs created as a result of Centre-funded research. The repository has a number of unique qualities. It is the public bibliographic database of a Canadian research funding organization, its subject focus is international development and the content is retrospective, dating back to the early 1970s. Intellectual property issues have been a major factor in the development of the repository. Copyright ownership of a majority of IDL content is held by developing world institutions and researchers. The digitization of content and its placement in the open access IDL has involved obtaining permissions from hundreds of copyright holders located in Africa, Asia and Latin America. IDRC has determined that obtaining permissions and populating the repository with developing world researchers’ outputs will help to improve scholarly communication mechanisms for Southern researchers. The expectation is that the IDL will make a contribution to bridging the South to South and South to North knowledge gap. The IDRC Digital Library will serve as a dissemination channel that will improve the visibility, accessibility and research impact of southern research. Keywords: developing world research; institutional repository; open access; DSpace; IDRC Digital Library; International Development Research Centre 1.
Introduction
The subject of this presentation is an institutional repository called the IDRC Digital Library [1]. The repository has several unique qualities that distinguish it from other DSpace institutional repositories now accessible on the Internet. It is the repository of a research funding organization, it serves as the organization’s public bibliographic database for the dissemination of funded research outputs and public corporate documents, its content is retrospective, dating back to the early 1970s and as a result, its development and management has presented some significant intellectual property issues. Notwithstanding these and other challenges, the IDRC Digital Library is developing into a significant resource, sharing the research results of developing world researchers with the international research community. IDRC stands for the International Development Research Centre, a Canadian Crown corporation that works in close collaboration with researchers from the developing world in their search for the means to build healthier, more equitable, and more prosperous societies. The creation of IDRC was Canada’s response to a climate of disillusionment and distrust that surrounded foreign aid programs during the late 1960s. Maurice Strong and others urged the then Canadian Prime Minister Lester B. Pearson to establish a “new instrument” to provide forward-thinking approaches to international challenges that could not be addressed by way of more conventional programs. This led to the establishment of the world’s first Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
24
Barbara Porrett
organization devoted to supporting research activities as defined by developing countries. IDRC’s objectives, as stated in the International Development Research Centre Act of 1970, are “… to initiate, encourage, support, and conduct research into the problems of the developing regions of the world and into the means of applying and adapting scientific, technical, and other knowledge to the economic and social advancement of those regions.”. IDRC is guided by a 21-member, international Board of Governors and reports to the Canadian Parliament through the Minister of Foreign Affairs. In 2007/08, IDRC received CA$135.3 million in funding from the Parliament of Canada. 2.
IDRC and the Dissemination of Funded Research Results
IDRC has, from the onset, placed a great deal of importance on the sharing of the research outputs that are created as a result of Centre-funded research. Although the copyright ownership of the outputs has always remained with funding recipients, it has been a condition of funding that IDRC maintains the ability to disseminate the research outputs supported by Centre funding. An archive of these outputs has been maintained since 1970, originally on paper but now increasingly in digital format. Bibliographic management of this archive has been done through a library catalogue and more recently through an online public access catalogue that has been accessible to the research community on the IDRC Internet web site. In an effort to enhance the dissemination of these research outputs and to provide an improved scholarly communication mechanism for Centre-funded researchers, it was decided in the fall of 2005 to explore the possibility of building an Open Access Initiative (OAI) compliant institutional repository. Under the guidance of a Steering and a Stakeholders Committee, and a policies and governance document [2], a project team of two librarians and a systems analyst undertook the initiative. In April 2007 a DSpace open access institutional repository, called the IDRC Digital Library or the IDL was launched. 3.
Content of the IDRC Digital Library
The IDL provides access to information about the IDRC research output archive dating back to the Centre’s beginnings. The database holds 34,000 Dublin core metadata records, approximately 30% of which provide links to digital full text. The subject coverage reflects the international development focus of IDRC research funding, with strong representation from the sciences and social sciences. The subject areas of research that have been supported by the Centre have changed over time. Research funding currently focuses on the following five themes: Environment and Natural Resource Management; Information and Communication Technologies for development; Innovation, Policy and Science; Social and Economic Policy; and Global Health Research. An average of slightly over 500 IDRC-funded research projects are active at any point in time and approximately 750 research outputs are added to the archive each year. 4.
Audience of the IDRC Digital Library
The primary audience of the repository is the international research community. This includes researchers, applicants for IDRC funding, donor agencies, policy makers, and development practitioners. The repository’s purpose is to share the research results of developing world researchers, to facilitate the discovery of research literature in the fields of international development and to identify researchers, research institutions and civil society organizations that have undertaken research in the fields of international development. The IDL not only enhances the public accountability and transparency of IDRC-funded research, but also demonstrates the Centre’s commitment to contribute to the global “public good” contribution of the research it supports. It also ensures that the research results will be freely accessible in order to contribute to the public debate on development issues for public benefit. The research literature in the IDL can be accessed Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The IDRC Digital Library
and used for non-commercial purposes in accordance with a definition of open access based on the Budapest Open Access Initiative. The expectation is that the IDL will make a contribution to bridging the South to South and South to North knowledge gap. These channels of scholarly communication and scholarly publishing are less heavily traveled than the North to North and North to South. [3] The IDL will serve as a dissemination channel that will improve the visibility, accessibility and research impact of southern research. 5.
Focus of the Presentation
This presentation will focus on three aspects of the IDRC Digital Library: IDL content and how it will continue to develop, copyright permission challenges presented by the repository’s retrospective content, and IDL services. Evidence of use of the repository will also be discussed. 6.
Development of IDRC Digital Library Content
The bulk of the content disseminated by the IDL is in the form of final technical reports. The reports present the research results produced by Centre-funded researchers. They are submitted by funding recipients as a requirement of research funding. The IDL also disseminates books published by the Centre, documents and other writings by staff and IDRC governors for and about IDRC, as well as other substantial works related to the Centre’s programs, projects and activities. This second category or collection of content represents about 25% of the digital library’s content. These two collections have, historically, been housed in the Centre’s Library and managed in the library catalogue. The Library’s catalogue records are the source of the majority of the digital library’s metadata. These were mapped and migrated from the Library open public access catalogue or OPAC into MIT’s DSpace. This kind of undertaking can be perilous, even under the best of circumstances. The library OPAC software, called MINISIS was home-grown, originally designed to be used by developing world libraries. The non-standard/non-MARC record structure of the MINISIS bibliographic records presented significant challenges. Further, the record content and database structure had changed over time. Migrating this content into DSpace was much like opening a Pandora’s box. Countless unanticipated challenges had to be overcome but after a great deal of problem solving, some metadata field customization and two migration attempts, an IDL database with an acceptable level of integrity has emerged. The IDL now serves as the Centre’s public bibliographic database. 7.
Submission Process of the IDRC Digital Library
The submission and metadata creation process for the IDL is centralized in the IDRC Library. The Centre’s research subject specialists, called program officers, review final technical reports received from research project recipients. Although the review is not a true peer-review, it can lead to redrafting of the reports by the funding recipients to ensure that they meet Centre funding requirements. Once finalized, the program officer determines if the report is eligible for public dissemination in the IDL. For example, reports containing politically sensitive or patentable information are not added to IDL holdings. Further if a researcher requests that their research results not be placed in the IDL because, for example, they have published or wish to publish with a publisher that does not permit dissemination on an OA platform, similarly, the report will not be added to the IDL. A Centre-funded researchers’ ability to choose not to have their final technical reports published in the IDL results from the fact that the contractual agreement for research funding has only just been modified to make provisions for OA and digital dissemination of final technical reports. IDRC-funded outputs produced by research projects approved after January 2008 will submit outputs to their program officers in digital format and the researchers will Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
25
26
Barbara Porrett
have granted the Centre permission to disseminate their funded research results in the OA IDL. A soon to be released IDRC publishing policy reiterates these new conditions of IDRC funding. The impact of this contractual and policy change is that submission of final technical reports to the IDL will be mandatory. However, the implications of this will not be seen by the IDRC Library for a number of months, because the outputs are received only after the research has been completed. If a report is destined for the digital library, the program officer determines where it will be placed within the IDL’s browsing structure. That is, in which collection of his/her DSpace community. Incidentally, the community and collection structure of the IDL has been created in collaboration with the Centre’s programming staff. This approach to the development of the IDL browsing structure is an example of how the Library has attempted to share ownership of the IDL with the Centre’s program branch. The researchers and the program officers are asked to provide uncontrolled vocabulary or keywords for the report’s IDL metadata. This is being done with the hope that keywords in the metadata that have been recommended by the subject specialists and/or the researchers will help to enhance the retrievability of the digital library’s content. Four pieces of information, an indication that the report is destined for the IDL, the appropriate collection name, plus keywords and the report are emailed by the program officer to a records management staff member. These are then placed in the Centre’s digital records management system by records staff. This information is transferred manually into the IDL by a library cataloguing technician who completes the metadata creation and submission process. Additional subject description is added to metadata records using the OECD Macrothesaurus. Automating this process to enable migration of this information from the records management system to the digital library is planned. 8.
IP Issues and the IDL
Seventy percent of the outputs described by IDL metadata are in paper format and as mentioned earlier, the copyright ownership of funding recipient created research outputs is owned by the researchers. In order to comply with Canadian copyright law, permission must be obtained from the copyright holders before the format of the outputs can be changed from paper to digital and made accessible through the open access digital library. This then leads to the subject of copyright permissions and digitization. Developing world researchers have encountered and continue to face barriers to the publication of their work. To ensure that IDRC research funding does not further impede efforts to publish, the contract between the Centre and its researchers places copyright ownership of final technical reports with the funding recipient. Obtaining permission to digitize and to place final technical reports in the IDL has been a full time occupation of a library staff member since the fall of 2006. To date, approximately 450 copyright holders have been contacted and asked to complete and sign a license granting IDRC permission to digitize and place their research results in the IDL. Many of the copyright holders are developing world institutions that hold the copyright of numerous works. The success rate in obtaining permissions is in the 65% range. It has not been difficult to obtain permission to digitize and place Centre-funded outputs in the IDL if it was possible to contact the copyright holder. For the most part, the copyright ownership of the outputs has not been transferred to publishers and, with just a few exceptions, copyright holders were willing to grant permission. Copyright holders have not make requests for further information about open access. How this should be interpreted is not clear, however, the correspondence requesting permissions has been carefully drafted in an effort to ensure that its intent is not misunderstood. Recipients have asked to be notified when their outputs were accessible in the IDL, in one case because they planned to place their digitized research results on their own web site. The impediments to obtaining permissions can be summarized as follows: the copyright holder is deceased, the research institution that received the research funding no longer exits, it was not possible to identify the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The IDRC Digital Library
copyright holder or a reply to correspondence requesting permission just has not been received. The Library has the capacity to continue to request permissions from copyright holders and to undertake in-house digitization with the objective of expanding the digital content in the IDL. However, not surprisingly, it has been difficult to locate the copyright holders of many older outputs. Regrettably, it is unlikely that the IDL will be able to disseminate the digital full text of all the final technical reports that its metadata describes. 8.
IDRC Digital Library Services
This then means that IDL metadata will continue to describe final technical reports that are not delivered digitally by the IDL. In an effort to enable users to access these outputs, the IDRC Library does its best to offer a document delivery service. Users are invited to enquire about options for accessing the research results. IDL users can, of course, visit the IDRC Library in Ottawa but this is not a practical choice for many researchers. The Centre’s contractual agreement with recipients funded after February 2004 enables the Library to digitize an output and make it available on the IDRC web site but not in the OA IDL. In cases where IDRC can not obtain permission from copyright holders to disseminate their outputs and the project contract predates February 2004, the Centre may rely, to a limited extent, on the so–called ‘fair dealing’ exception under Canada’s Copyright Act. This exception provides only a very narrow exclusion to allow a library to copy and distribute a portion of a work without it infringing copyright in that work. The library must be satisfied that the use of the work will be for research or private study. The law does not set clear limits on what portion of a work may be copied under the fair dealing exception. But, what is clear is that copying an entire work would not be permissible under the fair dealing exception. This is not an ideal situation but the document delivery staff attempt to do their best to meet the information needs of requesters. Another service being explored by the IDL is the hosting of works authored by developing world researchers who are not IDRC-funding recipients. A Centre-funded project has developed a research methodology that is being applied by non-Centre funded researchers in the developing world. The project’s lead researcher recognized the value of managing and disseminating the results of this disparate group of researchers. A partnership was established to create a DSpace community that makes the research results OA accessible through the IDL. A service agreement addressing issues such as content review, intellectual property, metadata creation and termination of the collaboration was developed to formalize the partnership. This led to the creation of the Social Analysis Systems2 (SAS2) Community [4] in the IDL. This community not only disseminates developing world researchers’ work, but also facilitates the aggregation of a body of knowledge. 9.
Integrating the IDRC Digital Library into Other Centre Systems
The IDL has been designed to integrate with a suite of other repositories of information created by IDRC. For example, as described earlier, final technical reports are filed in the records management system. The documents and their skeletal metadata are reused in the digital library. This eliminates the need for submission to both the records system and the IDL. Further, IDL content can be accessed through the Centre’s project database, called IDRIS+ [5] and its persistent URLs are widely used in the IDRC web content management system. Integration enables reuse of the IDL’s content through other IDRC systems and will help to ensure long term funding and survival of the repository.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
27
28
Barbara Porrett
10.
Use of the IDRC Digital Library as a Research Resource
Preliminary data indicates that the IDL is on its way to accomplishing the objective of making a contribution toward bridging the South to South and South to North knowledge gap. The context of this data is as follows. The absence of links to digital full text in 70% of the IDL’s metadata records has made it possible to gather some information about who is using the IDL as a research resource. Users are contacting the IDRC library to enquire about receiving the full text of outputs that are described but not delivered digitally. The majority of the requests are received by email. Although it is not always possible to confirm that the requester originates from the developing world (many developing world researchers are studying and working in developed world institutions and organizations), these requests for full text do reveal the following. The total number received between July 2007 and mid April 2008 was 96. The mailing address and/or signature indicate that 53 were writing from the South, approximately 55% of the total. The majority of these came from Africa and Latin America. A smaller number were received from India, Vietnam and Cambodia The origin of the remaining 45% was, in order of frequency, Canada, the U.S., the UK, and France. The IDL’s server log files are not available for analysis at this time. However the DSpace application provides a statistical summary that reveals some interesting information about the system’s use. Data collection began in November 2007. The IDL has been searched an average of 16,000 times per month, an average of 81,000 items have been viewed and an average of 35,500 bit streams or digital files have been accessed each month. The words searched by IDL users are also noteworthy. The French, Spanish and English languages are equally well represented among the terms being used. Although the presence of French and English is not surprising, the numerous terms in Spanish may indicate that the IDL has caught the attention of Latin American researchers. Search terms such as reformas, gouvernance, poverty, tecnológicas, rurale, policy as well as developing world geographic locations are high on the list of frequently searched words. All of these terms reflect the research areas funded by IDRC. The terms also suggest that there is a strong potential that searchers’ information needs are being met by the IDL. 11.
Conclusion
By way of conclusion I would like to note that IDRC is the first Canadian research funding organization to build an OAI-PMH compliant institutional repository to disseminate its funded research results. It was the vision of Marjorie Whalen, the IDRC Library director that led to the creation of an IDRC institutional repository to enhance the dissemination of southern researchers’ research results. The experience of the project shows that challenges, some expected, others not, were inevitable but not insurmountable. The collaborative nature of this undertaking has been enriching for all of us at the Centre. But content development of the IDL is far from complete. Obtaining consent from copyright holders to distribute their works through the IDL will remain a high priority for some time. This is consistent with the belief at IDRC that open access will lead to the maximization of the societal benefits of investment in research. To close, I would like to share the following which were sent to us by a researcher. Thank for your email request for permission to include my works in the IDRC Digital Institutional Repository. I am a firm believer on the universal right of the people of the World to have free access to knowledge. Especially when the knowledge created is a result of communal effort as is the case for all IDRC projects.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The IDRC Digital Library
12.
References
[1]
The International Development Research Centre Digital Library/ La Bibliothèque numérique du Centre de recherches pour le développement international. URL : http://idl-bnc.idrc.ca WHALEN, M. IDRC Digital Library Policies and Governance. 2007. URL: http://idl-bnc.idrc.ca/ dspace/handle/123456789/35334 CHAN, L.; KIRSOP, B.; ARUNACHALAM, S. Open Access Archiving: the fast track to building research capacity in developing countries,’ SciDevNet, February 11, 2005. URL: http:// www.scidev.net/en/features/open-access-archiving-the-fast-track-to-building-r.html Social Analysis Systems2 (SAS2)/ Les Systèmes d’analyse sociale2 (SAS2). URL: http://idlbnc.idrc.ca/dspace/handle/123456789/34882 IDRIS+. URL: http://idris.idrc.ca/app/Search
[2] [3] [4] [5]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
29
30
Keyword and Metadata Extraction from Pre-prints Emma Tonkin1; Henk L. Muller2 1
UKOLN, University of Bath, UK e-mail: e.tonkin@ukoln.ac.uk 2 University of Bristol, UK e-mail: henkm@cs.bris.ac.uk
Abstract In this paper we study how to provide metadata for a pre-print archive. Metadata includes, but is not limited to, title, authors, citations, and keywords, and is used to both present data to the user in a meaningful way, and to index and cross-reference the pre-prints. We are particularly interested in studying different methods to obtain metadata for a pre-print. We have developed a system that automatically extracts metadata, and that allows the user to verify and correct metadata before it is accepted by the system. Keywords: metadata extraction; Dublin Core; user evaluation; Bayesian classification 1.
Introduction
There are two methods for obtaining metadata: the metadata can be mechanically extracted from the preprint, or we can ask a person (for example the author or digital librarian) to manually enter the metadata. The former approach, automated metadata generation, has attracted a great deal of attention in recent years, particularly for the role that it is expected to play in reducing the metadata generation bottleneck [1] - that is, the difficulty of producing metadata in a timely manner. Much of this interest arises from prior work in machine-aided indexing, or automated indexing - that is, either software-supported or entirely software-driven indexing approaches. The difference between machine-aided or automated indexing and automated metadata generation or extraction approaches is, as seen by the authors, simply that the metadata is here seen as an end in itself; we aim to emulate well-formed metadata generation, and do not concern ourselves greatly here with the subsequent question - evaluation of the usefulness of this metadata for a given purpose. Greenberg et al [2] describe two primary approaches to metadata generation, stating that researchers have experimented primarily with document structure and knowledge representation systems. Document structure involves the use of the visual grammar of pages, for example, making use of the observation that title, author(s) and affiliation(s) generally appear in content header information. Such metadata can be extracted via various means, for example using support vector machines upon linguistic features [3], a variable hidden Markov model [4], or a heuristic approach [5]. [6] describe an approach that primarily utilizes formatting information such as font size as features, and makes use of the following models: Perceptron with Uneven Margins, Maximum Entropy (ME), Maximum Entropy Markov Model (MEMM), Voted Perceptron Model (VP), and Conditional Random Fields (CRF): they note that an advantage of an approach that primarily makes use of visual features is the ease of generalisation to documents in languages other than English. This approach, however, focuses solely on the problem of extracting the document title. The relevance of knowledge representation systems for Greenberg et al is the increasing availability of resources that can be useful to the process of metadata generation, or indeed the harvesting of existing Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Keyword and Metadata Extraction from Pre-prints
metadata registries; this is primarily of use in post-processing or enhancement, although such knowledge basis additionally provide a useful resource under many circumstances. For example, an authoritative but incomplete author name database can be used firstly for automatic name authority control, and secondly as an excellent basis for training of supervised machine learning systems in detection of fields containing author names. The issue of post-processing is, however, out of the scope of this paper, and will therefore be referred to only briefly. Recent work on the Semantic Web and on classification and knowledge management has focused on the extent to which these methods lead to equivalent or stable results. Whilst the two approaches may have compatible outcomes in terms of the type of metadata output, they depend upon very different underlying mechanisms. Factual metadata such as title and author is usually unambiguous; but other metadata, such as keywords for classification, is of an interpretative nature. User entered classifications can be seen as based around a set of prototype concepts [7,8]. Mechanically generated classifications are generally built around an identified set of features. The features that are used by the mechanical system are meant to form a basis for making similar judgements to those given by a human, and hence are intended to emulate similar behaviour to the set of concepts recognised by the user; but they are in practice quite different, for they are based around a range of heuristics or learnt statistical measurements rather than a deeper understanding of the information within the data object. Because of this difference, care must be taken to ensure that the judgements are compatible, typically by choosing supervised methods, that may be trained and verified against reference data (ground truth). 2.
Available metadata
An electronic copy of a document is potentially a rich source of metadata. Some of the metadata is presented in an obvious manner to the reader, for example the title of a document, the number of pages and the authors. Other metadata is less obviously visible. Attributes of the eprint such as format - intrinsic document properties - can be automatically detected with ease [9]. The class of a document - that is, whether it has been peer-reviewed, whether it appeared as a conference paper, article, journal article, technical report or PhD/Masters’ thesis - is often unclear. The theme, subject matter and contributions contained within the document should be visible within the text, for this is after all the rationale behind making the document available at all, but a great deal of domain knowledge may be required to extract such information and recognise it for what it is. We focused on five general structures that can be examined in order to extract metadata:
•
The document may have structure imposed on it in its electronic format. For example, from an HTML document one can extract a DOM tree, and find HTML tags such as <TITLE>.
•
The document may have a prescribed visual structure. For example, postscript and PDF specify how text is to be layed out on a page, and this can be used to identify sections of the text.
•
The document may be structured following some tradition. For example, it may start with a title, then the authors, and end with a number of references.
•
Documents that are interlinked via citation linking or co-authorship analysis may be analysed via bibliometric methods, making available various types of information.
•
The document will have linguistic structure that may be accessible. For example, if the document is written in English, the authors may “conclude that xxx .”, which gives some meaning to the words between the conclude and the full stop.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
31
32
Emma Tonkin; Henk L. Muller
There exist in practice a huge number of features by which to describe a complex object such as an eprint. Readers effortlessly identify and use relevant subsets and combinations of these on a daily basis, but not all of those features are actually intrinsic to the document or the specific instance of the document (the file). 2.1
Formatting structure
Certain document types contain structural elements with relatively clear or explicit semantics. One of the potential advantages of a language like HTML that stresses document structure over a language such as Postscript that stresses document layout, is that given a document structure it is potentially feasible to mechanically infer the meaning of parts of the document. Indeed, if HTML is used according to modern W3C recommendations, HTML is to contain only structural information, with all design information contributed in CSS. This process of divorcing design from content began in the HTML 4.0 specification [10]. Under these circumstances, a large amount of information can potentially be gained by simply inspecting the DOM tree. For example, all headers H1, H2, H3, ... can be extracted and they can be used to build a table of contents of the paper, and find titles of sections and subsections. Similarly, the HEAD section can be dissected in order to extract the title of a page, although this may not contain the title of the document. However, given that there are multiple ways in HTML to achieve the same visual effect, the use of the tags given above is not enforced. Many WYSIWIG tools use alternative means to produce a similar visual impression – for example, generating a <P class=’header2'> tag rather than a H2 tag. Since the semantics of these alternatives are less clear, this makes extraction of data from HTML pages in practice difficult. A technical report by Bergmark [5] describes the use of XHTML as an intermediate format for the processing of online documents into a structure, but concedes that, firstly, most HTML documents are ‘not well-formed and are therefore difficult to parse’; translation of HTML into XHTML resolves a proportion of these difficulties, but many documents cannot be parsed unambiguously into XHTML. A similar approach is proposed by Krause [11]. In this paper we ignore any context markup, and we have focussed on documents that are not presented in a structure language. On examination of Bergmark’s metadata extraction algorithm, it seems likely that a robust metadata extraction from XHTML makes relatively little use of formatting information.
Figure 1: Visual structure of a scientific paper Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Keyword and Metadata Extraction from Pre-prints
2.2
Visual structure
In contrast to HTML, other methods to present documents often prescribe visual structure rather than document structure. For example, both Postscript and PDF specify symbol or word locations on a page, and the document consists of a bag of symbols or words at specific locations. Document structure may be inferred from symbol locations. For example, a group of letters placed close together are likely to be a word, and a group of words placed on the same vertical position on the page may be part of a sentence in a western language. The disadvantage of those page description languages is that there are multiple ways to present text, for example, text can be encoded in fonts with bespoke encodings; the encoding itself has no relation to the characters depicted, and it is the shape of the character which conveys the meaning. In circumstances like this it is very difficult to extract characters or words, but the visual structure itself can still be used to identify sections of a document. For example, Figure 1 shows a (deliberately) pixelated image of the first page of a paper, and even without knowing anything about the particular characters, four sections can be highlighted that almost certainly contain text (red), authors (green), affiliation (yellow) and abstract (blue). Indeed, it turns out that visual structure itself can provide help in extracting sections of an image of, for example, legacy documents that have been scanned in. However, it is virtually impossible to distinguish between author names above the title and author names below the title, if the length of the title and the length of the author block are roughly the same. We have performed some experiments that show that we can extract bitmaps for the title and authors from documents that are otherwise unreadable — 3-6% of documents on average in a sample academic environment [12]. An approximately 80% degree of success is achievable using a simple image segmentation approach. These images, or indeed the entire page, may alternatively be handed to OCR software such as gOCR for translation into text and the resulting text string processed appropriately. An account of the use of appearance and geometric position of text and image blocks for document analysis and classification of PDF material may be found in Lovegrove and Brailsford [13], and a rather later description of a similar ‘spatial knowledge’ approach applied to Postscript formatted files is given by Giuffrida et al [13]. In this paper we focus on documents from which we can extract the text as a simple stream of characters. 2.3
Document structure
From both structured description languages (such as HTML) and page description languages (such as PDF) we can usually extract the text of the document. The text itself can be analysed to identify metadata. In particular, author names usually stand out, and so do affiliations, and even the title and journal details. The information that can be extracted from the document structure includes: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Title Authors Affiliation Email URL Abstract Section headings (table of contents) Citations References
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
33
34
Emma Tonkin; Henk L. Muller
10. 11.
Figure and table captions eg. [15] Acknowledgments [16]
Extracting these purely from the document structure is difficult, but together with knowledge about words likely found in, for example, author names or titles, the extraction is feasible. A detailed discussion on the methods that we use can be found later on in this paper. 2.4
Bibliographic citation analysis
There exists a widespread enthusiasm for bibliometrics as an area, which depends heavily on citation analysis as an underlying technology. Some form of citation extraction is a prerequisite for this. As a consequence, a number of methods have been identified for this approach, making use of various degrees of automation. Harnad and Carr [17] describe the use of tools from the Open Journal Project and Cogprints that can, given well-formed and correctly specified bibliographic citations, extract and convert citations from HTML and PDF. Citation linking is of interest to many as a result of the potential of this data in analysis of impact and, arguably, value of scientific papers, but other uses of the information exist, in particular in the area of interface design and support for information-seeking practices. The nature and level of interlinking between documents is a rich source for information about the relations between them. For example, a high rate of co-citation may suggest that the subject area or theme is very similar. In this instance, we extracted citations via our software; these could potentially be used for various purposes. For example, Hoche and Flach [18] investigated the use of co-authorship information to predict the topic of scientific papers. The harvesting of acknowledgements has also been suggested as a measure for an individualâ&#x20AC;&#x2122;s academic impact [16], but may also carry thematic information as well as information on a social-networking level that could potentially be useful for measuring points such as conflict of interest. Along with content classification, this constitutes part of a toolkit for â&#x20AC;&#x2DC;similarity searchâ&#x20AC;&#x2122; [9]. 2.5
Linguistic structure
Finally, the document can be analysed linguistically, inferring meaning of parts of sentences, or relationships between metadata. For example, citations in the main text may be contained within the same sentence, indicating that the two citations are likely to be related in some way. The relation may be a positive relationship or a negative relationship, depending on the text around it: In contrast to work by Jones (1998), work by Thomas (1999)... Analysing linguistic structure depends on knowledge of the document language, and possibly on domain knowledge. Using linguistic analysis one can attempt to extract: 1. 2.
keywords relations between citations
Other than Bayesian statistics across term appearance, we do not use explicit linguistic information in the work presented below, but instead focus on the document structure, guided by simple probabilistic information. 3.
Uncertainty and metadata
Potential discrepancies between mechanically generated metadata and user-generated metadata may not Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Keyword and Metadata Extraction from Pre-prints
be a big problem, because there is also considerable variation in metadata generated by users. There are three principle sources of variation in metadata as generated by humans: typographic errors, different interpretation of the document, and different interpretations of the metadata descriptions. Below we give a description of those three, and a discussion on the consequences of metadata uncertainty. 3.1
Differences in document interpretation
Differences in document interpretation come to light in, for example the consistency of classifying preprints using keywords. Neither humans nor computers can index with 100% accuracy. If the same article is indexed by each author and a librarian in turn, then they will probably suggest different indexing terms, stemming from different interpretations of the work, background of the person, knowledge about classifications, and in-depth knowledge of the subject matter. Indexing consistency is a well-known problem of interest to researchers in the domain of information science [19]. Indeed, it is doubtful that there is a “gold standard” classification, for even the author of the article may not agree with appropriate classification keywords. Differing interpretations of the work undoubtedly exist; for example, censorship is generally seen as a primary theme of Bradbury’s classic work, Fahrenheit 451, an interpretation that the author does not accept. That is, the relevance of a document changes over time, and may not coincide with the author’s intention; as this occurs, the keywords associated with a document change over time too. This suggests that either keywords have to be kept up to date, or the interpretation of keywords must depend on the context in which those keywords were assigned. 3.2
Typographic errors in metadata
A common failure mode for a human entering metadata is typographic errors. The frequency of typographic errors depends on system interface, feedback, user profile and the type of metadata. In high-grade metadata that is entered by professionals who are being paid to, say, index scientific works contains very few errors. But low-grade metadata, entered by for example on-line users may contain a significant number of errors. Upper bounds for this value on the tagging system Panoramio was less than 10%, with other tag systems showing far higher numbers. These errors are not limited to incorrect spellings, but include errors where the metadata value is selected using a drop-down menu the user may select the keyword “above” or “below” the chosen keyword, or spell checkers that have “corrected” a typographic error and have, for example, replaced recking with racking (rather than wrecking). The latter can be a big problem with people who write documents in a non-native language. In citing other authors, errors in orthography are common, stemming from typographical error, misreading, cultural misunderstanding (such as the inversion of first and last names), as well as from other sources such as issues with citation management software or, indeed, error propagated from replicating prior miscitation of the document. An overview and typology of features found in online orthography can be found in Tavosanis [20]. Automatically generated metadata does not contain any typos, other than those copied from the original document and those introduced during the extraction process. However, computer generated metadata is subject to different failure mode. In the simplest case, an incorrect keyword is suggested because it appears appropriate on the basis of the features, but turns out to be one that is inappropriate to a human who understands that identical words may refer to different concepts.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
35
36
Emma Tonkin; Henk L. Muller
3.3
Different interpretations of metadata schemas
A third common variation in metadata is due to the interpretation of metadata schemas. This expresses itself commonly in the way in which author names are interpreted. Different parts of names have different meanings, and in some cultures the first part of the name may be a family name, whereas in other cultures the first name may be a given name, and there are languages where there are “middle parts” that are part of the surname. It is virtually impossible to design a metadata scheme that both allows all names to be stored in a single canonical format, and that at the same time is unambiguous and easy to use for authors from all different cultural backgrounds. One strategy around this is that authors names are just opaque strings of characters that warrant no interpretation. These are difficult to match because authors are frequently inconsistent in providing their names, preferring perhaps in certain cases to provide middle initials and in others to give only an initial of one of their given names. Indeed, it is a strategy that is often used consciously by authors to separate their publications in one field from those in another. This strategy may even be deliberately applied to “fool” automatic indexing [20]. Even where authors are consistent, errors in data extraction or journal style guide clashes may cause errors in author name extraction. For example, some article styles require “first” names in citations to be abbreviated to a single letter. 3.4
Propagation of errors
In the general case, we consider metadata generation as an inherently uncertain operation. This implies that metadata should not necessarily be seen as a discrete set of values, but it could be better to represent it as a probability distribution [21,22]. Representing the metadata as a distribution gives us the opportunity to communicate the uncertainty in the suggested metadata to the user. For example, we can select a number of possible keywords based on features of a publication, and communicate which of those keywords are more probable than others. Once errors in metadata exist, they propagate, reinforce similar errors on future pre-prints, introduce seemingly unrelated extra errors, and obfuscate the data presented to the user. Firstly, a system will normally use previous classifications in order to classify future papers. In our system, paperBase, author-names, title, abstract, and classification of previous pre-prints are being used to predict the classification of new pre-prints. Once a pre-print has been misclassified, future papers may be misclassified in a similar manner. Secondly, a system typically uses the metadata found in pre-prints in order to establish connections between pre-prints. Connections can be made because two pre-prints are written by an author with the same name, because they cite each other, or because they cover a similar subject matter according to the keywords. Those connections can be used to, for example, disambiguate author identities. A missing link or an extraneous link would make the process of reasoning about clusters of related papers increasingly difficult. Thirdly, the answers of search queries are diluted when errors are introduced. Cascading errors cause a disproportional dilution of search results. This is also true of user-contributed systems in which users may infer the use of classification terms through examining available exemplars. When machine-generated classifications are provided, they are generally represented as unitary facts; either a document may be described via a keyword, or it may not. Consider the following example of a machine-generated classification: Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Keyword and Metadata Extraction from Pre-prints
Figure 2: Candidate keywords with associated probabilities In this case, a document is considered almost certain to be about “Computer Architecture” or “Parallel Processing”, and to have a diminishing likelihood of being classifiable as about “Machine Learning” or any of the other terms. In general, a threshold is placed, or the top classification accepted by default, when the result is presented, but it is this distribution that describes the paper with respect to others. The shape of this distribution is very relevant in establishing the nature and relevance of the classification. There may be no clear winner if there are many keywords with similar probability, and then our confidence in the clarity of the results may be shaken absent human evaluation of that judgement. In the case of classifications, many options may be acceptable, but this is less the case in other situations where uncertainty exists. Consider the following citation parses taken from a sample paper (bold text denotes the title and italic text denotes the author):
•
Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
•
Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
• •
...
•
Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE
The likelihood for the correct parse is much higher than the likelihood for all other parses. Unlike the prior example of a classification, only one of these parses can be valid. Whilst it is the most likely, we do not have total confidence in this, but we are able to generate a probability of its accuracy (our level of confidence, a value between 0 and 1). Hence, it is possible to provide some guidance as to the validity of this datum as a “fact” about the document. The danger of reasoning over data in which we, or the system, have low confidence, is the risk of propagating errors. If we retain a Bayesian viewpoint, we may calculate any further conclusions on the basis of existing probabilities via Bayesian inference. If, however, we treat a probability as a fact and make Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
37
38
Emma Tonkin; Henk L. Muller
inferences over inaccurate data without regard to degree of confidence, the result may be the production of hypotheses over which we have very little confidence indeed. As a consequence, an extension of DC metadata to include estimates of confidence, as described in [23] is useful, as in the case of classification would be an estimate of the number of classifications considered “plausible”; the breadth or range of likely classifications, which could also be described in terms of variation or level of consistency in judgement - a similar value to that which might be generated in any other situation in which generated or contributed classifications may be treated as “votes”, such as collaborative tagging systems. If the nature and extent of the error are known, further functions that employ these values may apply this information to estimate the accuracy of the result or that of derivative functions. We note that for certain types of metadata, this problem is well-investigated. For example, author name disambiguation has received a great deal of interest in recent years, eg. Han et al [24,25]. 4.
Prototype
We developed a system for the automated extraction of metadata from pre-print papers known as paperBase. The extractor makes use of the structure that is inherent to scientific papers and Bayesian classifiers in order to identify the metadata. We have captured the structure of scientific documents in a probabilistic grammar that produces most known forms of papers. More details on this grammar are given in Tonkin and Muller [12]. The grammar is used to parse the text of a paper, and this produces a collection of metadata with associated probabilities. The parser takes the path through the grammar that results in maximal probabilities for authors, title, affiliation, email addresses. The individual probabilities can then be used later on to decide how to use the metadata. We extended DC with appropriate attributes for the encoding of those confidence measures, so that, for example, a user interface might visually encode the confidence and highlight fields that are likely to contain errors. 4.1
Visual interface
The interface displays the metadata in a tabbed form, one tab for each type of pre-print. The extracted metadata such as author names, title, journal-name, and suggested keywords are displayed in the tab. The uncertainty that is assigned to each of the suggested keywords is shown by ordering the keywords based on the certainty, and by using a graded colour-coding to indicate probable keywords, providing clear and consistent interface semantics. The keywords are shown in a list with scroll-bar, with the five most likely keywords visible. 4.2
Extension to Dublin Core
The metadata extracted, or in some cases generated, from the document object may be retrieved as an XML document via the paperBase API. The DC metadata itself is encoded into XML using the DC XML guidelines [26] as a basis. Additional terms, including confidence values (probabilities of accuracy) where appropriate for this interface, were included in this document. A fragmentary example of an Open Archives Initiative/Dublin Core XML record as generated by paperBase is below:
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Keyword and Metadata Extraction from Pre-prints
<oai_dc:dc> <dc:type>e-print</dc:type> <dc:title>An Evaluation Study of a Link-Based Data Diffusion Machine</dc:title> <dc:creator canonical=’Muller HL’>Henk L. Muller</dc:creator> <dc:creator canonical=’Stallard PWA’>Paul Stallard</dc:creator> <dc:creator canonical=’Warren DHD’>David HD Warren</dc:creator> <dc:description> .... Abstract deleted...</dc:description> <dc:subject probability=’834'>Computer Architecture</dc:subject> <dc:subject probability=’827'>Parallel Processing</dc:subject> <dc:subject probability=’183'>Machine Learning</dc:subject> <dc:subject probability=’176'>Computer Vision</dc:subject> <dc:subject probability=’156'>Mobile Software</dc:subject> ... More keywords ... </oai_dc:dc> All keywords are given with a number indicating a calculated probability that the keyphrase is applicable to this document. In this instance, the top two keywords, Computer Architecture and Parallel Processing are good choices, with a high probability (the maximum value is 1000). The next three are less likely, and are, indeed, inappropriate. The probabilities given are not normalised into confidence values; at this time, there exists no consensus on how confidence values should best be encoded. Therefore, the structure of this record may well change in future. 4.3
Deployment Workflow
As a first trial, we have integrated the system in the institutional repository that stores papers written by members of the Department of Computer Science at the University of Bristol. We adapted the workflow so that authors first have to upload an electronic version of the paper, prior to providing any metadata. When the paper is uploaded the user is presented with a form in which the user can enter the meta-data for that publication. 4.4
Technical details
The extracted data is provided to the end user via a web service. The service is engineered to use web standards common in the Web 2.0 environment, including REST, Dublin Core and XML. The client interface for the user’s web browser makes use of ECMA JavaScript and XML (AJAX) to retrieve the analysed data and place it into a web form. The webserver has a dedicated thread that interprets metadata. This thread decodes postscript and PDF files, and extracts text from those using the public domain PDFbox Java library (www.pdfbox.org). When the text is extracted it is interpreted in a probabilistic grammar, and the results are stored in a database. Various web services make use of this database, including an independent browse interface along the lines of CiteSeer, and a machine-to-machine REST interface that is used to support AJAX applications requiring document metadata. Others, such as an OAI harvesting interface, can be built against the same database backend - however, as mentioned above (in “Propagation of errors”) it is useful for client services to be aware of the data origin and constraints on its use. An AJAX application embedded into the repository’s web interface polls the webserver for metadata, and fills the form in when metadata becomes available. Typically, metadata is available within a few seconds Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
39
40
Emma Tonkin; Henk L. Muller
of submitting the form. The form will then be filled in asynchronously when the web server has extracted the data. One might regard a synchronous implementation as ideal, where the form comes back when the file has been uploaded. However, since we only have limited computational resources on the web server, and it may take a few seconds for a paper to be completely analysed, we must queue all papers on the server and deal with them one at a time, in order to control congestion. The way in which the queue is handled can be optimised to limit the impact of likely causes of congestion, such as a batch file upload (a usage pattern supported by the service’s own internal interface). As a result, users may have to wait for a few seconds before their form is filled in with the relevant information - however, we think this is beneficial because the user can use this time to familiarise themselves with the form. Providing and filling the form as two asynchronous steps is preferable over a user looking at a spinning hour-glass, and then being taken to a filled-in form. In addition, our system fails gracefully, in that if the decoder service is not working for whatever reason, the form will simply stay with all fields blank and report that no metadata could be extracted. The accessibility of the resulting software represented a primary concern. As such, care has been taken to ensure that the system functions across multiple browser platforms, including IE, Firefox, Safari, Opera and other Gecko- or KHTML-based browsers such as Konqueror. Cross-browser compatibility is, however, a moving target; hence it is expected that this will impose a small ongoing maintenance cost. The nonavailability of JavaScript simply means that the user must fill in each field manually, as was the case before this service became available. One further accessibility concern for us was the way in which screen readers and similar assistive technologies reacted to the dynamic content placement. The dynamic content proved not to be an issue in practical use; however, the presence of a (non-AJAX) “SELECT MULTI” element fell foul of a known showstopper bug in the screen reader, which meant that we could not complete the evaluation. It seems that fully supporting screen readers would involve at least the level of customisation and maintenance required for cross-browser compatibility, and furthermore this requires additional investment in developing or procuring a software base for testing purposes. 5.
Evaluation
We have performed two trials of our system. In one trial we have rebuilt our entire repository, logging suggested keywords together with keywords that were assigned by the author. We show that 80% of the keywords selected by the authors are in the top-five list of keywords. This is a conservative figure, since we expect that some authors would not have picked the other 20% of the keywords if they weren’t suggested - see also our discussion earlier on reliability of human indexing. Throughout the development process, sets of informal think-aloud trials were conducted, that resulted in user feedback; applicable results were included in latter phases of the iterative design process. We then performed a more formal evaluation study on 12 subjects, presenting them with a set of six papers to be entered into a repository. The participant were presented with a form to enter the data, and sometimes this form would be pre-filled in with automatically extracted metadata. Half the participants had their first three papers entered without assistance, and had automatically extracted metadata for the last three papers. The other half of the participants were presented with automatic data for the first three papers, and had no assistance for the latter three papers. The papers were selected in order to cause maximum trouble for the metadata extractor, in particular, an author name would be extracted twice on one paper, there would be one missing author on a second paper, other papers would have missing ligatures, or mathematical formulas in the abstract. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Keyword and Metadata Extraction from Pre-prints
In order to quantify what a true user would see we have manually judged the quality of title and author extraction on 186 papers. We found that 8% of the titles was completely wrong, and 8% was not completely correct, with the remaining 86% of the titles being right. For the experiment above, this would mean that a participant might have seen one bad title, with a probability of less than 50%. Three bad titles in a row has a probability of 0.4%. For the authors, 13% of the authors was wrong, of the remaining 87%, 32% included the right authors but had extraneous text that was misconstrued as authors. Our sample was not sufficiently randomised and had many papers by a Slovenian author with a diacritical mark in both surname and first name, which skewed our statistics. In addition, another author’s affiliaton was at the “George Kohler Allee” which was misrecognised as an author name. A detailed analysis of the quantitative results of those study are published in another paper[12]; in short, it was found that the assistive effect generally caused participants to take less time overall in depositing papers. Here, we report on the qualitative feedback that the participants provided. At the end of the trial participants were given a form with four questions, asking which system they preferred, whether they thought that system was faster, whether they thought that system was more accurate, and an “any other comments” box. A most interesting observation was that the participants were divided on the question of whether manually entered data had fewer errors. Many participants had spotted errors in the automatically corrected data, and had corrected them, and had concluded that the manual data must have been more accurate - however, analysing the errors it turns out that manual data contains more errors. The reasons for this is two-fold. First, there are people who take manual entry literally: they type the title in again (rather than using a copyand-paste feature). Typing is an error-prone process. Second, people who use the copy-and-paste feature seem to assume that this is by definition error-free - hardly any of the participants spotted that when they had used copy-and-paste ligatures had gone missing during the process, or that hyphenation had been introduced because a word had been broken across two lines in the abstract. Instead, participants accepted copy-and-paste as a ground-truth, and corrected the errors only when the copy-and-paste had been performed by the meta data extractor. We postulate that people have a limited amount of time to perform tasks such as entering publication data, and that they either spend it on manual entry, or on correcting automated entry - the latter leads to more accurate results. There is also a possibility that this is related to the “proofreader blindness” effect - it is known to be more difficult to proofread one’s own work than work by others in one’s own domain [27]. It is possible that the same effect plays a role in this instance. Many of the comments that were passed on using the last open question related to features that people would like to see in the system; in particular, we requested a month and a year, and many participants rightly complained that they had to give a month, even if they didn’t know it. A number of comments gave qualitative feedback on the use of paperBase. A telling comment was Just adding a few fields makes the task of adding your publication much less boring and time consuming. I’d prefer to see it try and occasionally fail as opposed to it be removed because it occasionally failed.. This has been observed in other studies - users are aware that the task they are doing should be done automatically, and they appreciate any help that a system will give them [28]. Another user commented that I particularly liked the ordered keywords. The suggested keywords are either ordered alphabetically, or in some order of likelihood. The latter works really well if the right keywords are somewhere in the top-5; if they are not in the top-5 they are very hard to find back, because the ordering is related to the perception of the extraction algorithm, and no longer related to the user entering the publication. Even though people liked it in general, we should have an option to sort the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
41
42
Emma Tonkin; Henk L. Muller
keywords alphabetically (or have an assisted keyword search) for situations in which the algorithm fails. One very interesting comment read: Abstracts are a nuisance; I would remove those from the database.. Indeed, this participant had blanked out all abstracts - they had not entered any abstract manually and had erased the automatically extracted abstracts. We postulate that they are only a nuisance because of the work involved in entering abstracts - from a search and user interface perspective abstracts are highly valuable and should be available. We think that automated extraction will aid in making meta data more complete - as long as people will not delete valuable information wholesale. 6.
Conclusion
Semi automatic meta data entry offers many advantages. From the limited study that we performed, we observed an increased accuracy, faster entry time, and most important, buy-in from the participants unambiguously preferring the semi automated entry system. The evaluation that we performed is limited in that we studied only a single domain (computer science papers), with participants who were very computer literate (postgraduates in computer science), and with only a small number of participants. In future evaluations we would like to include different domains. The current version of the interface only uses a small amount of the data that could be used. In particular, we do not use links between papers (as found in the form of citations) yet to, for example, disambiguate author identities. The number of file formats supported by the system could be increased, and methods found for the user to correct other metadata such as citations which are also extracted by the system. Equally, the provision and use of error margins may have some promise in providing cross-site, hybrid search operating across a number of resource and metadata types. One feature of interest within the study results is the reminder that the quality of metadata, whether semiautomated or not, depends on the level of interest of the participant. Individuals who simply do not see the point of providing a given metadata element will at best put little thought into the process, and at worst will actively remove extraneous elements despite the best efforts of an automated metadata extraction service. The ultimate arbiter in any system that is not fully automated is the individual contributor, despite any scaffolding that the system may provide, and any mismatch between the contributor’s needs and the aims of the system designer should be identified and allowed for in design and development. 7.
Notes and References
[1]
Liddy, E. D., Sutton, S. Paik, W., Allen, E., Harwell, S., Monsour, M., Turner, A. and Liddy, J. Breaking the metadata generation bottleneck: preliminary findings. Proceedings of the 1st ACM/ IEEE-CS joint conference on Digital libraries, Roanoke, Virginia, United States, 2001. pp. 464 Greenberg, J., Spurgin, K. and Crystal, A. Functionalities for automatic metadata generation applications: a survey of metadata experts’ opinions. Int. J. Metadata, Semantics and Ontologies, Vol. 1, No. 1. 2006 Han, H., Giles, C. L., Manavoglu, E. and Zha, H. Automatic Document Metadata Extraction using Support Vector Machines, Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital Libraries, ACM Press, New York, 2003. pp.37–48 Takasu, A. ‘Bibliographic attribute extraction from erroneous references based on a statistical model’, Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital Libraries, ACM Press, New York, 2003. pp.49–60. Bergmark, D. Automatic Extraction of Reference Linking Information from Online Documents. CSTR 2000-1821, Cornell Digital Library Research Group, November 2000.
[2] [3] [4] [5]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Keyword and Metadata Extraction from Pre-prints
[6] [7] [8] [9] [10] [11]
[12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]
Hu, Yunhua, Li, Hang, Cao, Yunbo, Teng, Li, Meyerzon, Dmitriy and Zheng, Qinghua. Automatic extraction of titles from general documents using machine learning. Information Processing & Management. Volume 42, Issue 5, September 2006, Pages 1276-1293 Rosch, E.. Natural categories. Cognitive Psychology 4, 1973. pp. 328-350. Labov, W. The boundaries of words and their meanings, New ways of analysing variation in English. Washington: Georgetown University Press C-J. N. Bailey and R. W. Shuy, 1973. pp 340—373 Olivié, H., Cardinaels, K. & Duval, E.. Issues in Automatic Learning Object Indexation. In P. Barker & S. Rebelsky (Eds.), Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2002, Chesapeake, VA: AACE, 2002. pp. 239-240. Austin, Daniel, Peruvemba, Subramanian, McCarron, Shane, Ishikawa, Masayasu and Birbeck, Mark. XHTML™ Modularization 1.1, W3C Working Draft, 2006. Retrieved April 30th, 2008, from http://www.w3.org/TR/xhtml-modularization/xhtml-modularization.html Krause, J. and Marx, J. . Vocabulary Switching and Automatic Metadata Extraction or How to Get Useful Information from a Digital Library, Proceedings of the First DELOS Network of Excellence Workshop on “Information Seeking, Searching and Querying in Digital Libraries”. Zurich, Switzerland, 2000. Tonkin, E. and Muller, H. L. Semi Automated Metadata Extraction for Preprints Archives. Proceedings of the Eighth ACM/IEEE Joint Conference on Digital Libraries, ACM Press, New York. 2008 Lovegrove, W. S. and Brailsford, D. F. Document analysis of PDF files: methods, results and implications. Electronic publishing, Vol. 8(2&3), 207-220 (June and September 1995). Giuffrida, G., Shek, E.C. and Yang, J. Knowledge-based metadata extraction from PostScript files. DL ’00: Proceedings of the fifth ACM conference on digital libraries. pp. 77-84, ACM, NY, USA, 2000. DOI: http://doi.acm.org/10.1145/336597.336639 Liu, Y., Mitra, P., Giles, C.L. and Bai, K. Automatic extraction of table metadata from digital documents. Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries, 2006. pp 339-340. Giles, C. L. and Councill, I. D.. Who gets acknowledged: Measuring scientific contributions through automatic acknowledgment indexing. PNAS, Vol. 101, no. 51, 2004. pp. 17599-17604 Harnad, S. and Carr, L.. Integrating, navigating and analysing open Eprint archives through open citation linking (the OpCit project). Current Science. 79(5), 2000. 629-638 Hoche, S. and Flach, P. Predicting Topics of Scientific Papers from Co-Authorship Graphs: a Case Study. Proceedings of the 2006 UK Workshop on Computational Intelligence (UKCI2006), pp. 215–222. September 2006 Olson, H & Wolfram, D. Indexing Consistency and its Implications for Information Architecture: A Pilot Study. IA Summit 2006. Tavosanis, M. A causal classification of orthography errors in web texts. Proceedings of AND 2007. van Rijsbergen, C. J. The Geometry of Information Retrieval. Cambridge University Press. 2004 Widddows D. Geometry and Meaning. (CSLI-LN) Center for the Study of Language and Information. 2004 Cardinaels, Kris, Duval, Erik and Olivié, Henk J., A Formal Model of Learning Object Metadata. EC-TEL 2006. pp. 74-87 Han, Hui, Lee, Giles, Zha, Hongyuan, Li, Cheng, and Tsioutsiouliklis, Kostas. Two supervised learning approaches for name disambiguation in author citations. Proceedings of the Fourth ACM/IEEE Joint Conference on Digital Libraries, ACM Press, New York, 2004. pp. 296-30 Han, Hui, Zha, Hongyuan, Giles, C. Lee. Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of JCDL’2005, ACM Press, New York, 2005. pp.334~343 Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
43
44
Emma Tonkin; Henk L. Muller
[26] Powell, A and Johnston, P. Guidelines for implementing Dublin Core in XML. DCMI Recommendation. April 2003. http://dublincore.org/documents/dc-xml-guidelines/ [27] Daneman, Meredyth and Stainton, Murray. The generation effect in reading and proofreading. Reading and Writing, Vol. 5, no. 3, 1993. pp. 297-313. DOI - 10.1007/BF01027393 [28] Berry, Michael W. and Murray Browne. Understanding search engines; mathematical modeling and text retrieval. SIAM, 2005
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
45
The MPEG Query Format, a New Standard For Querying Digital Content. Usage in Scholarly Literature Search and Retrieval Ruben Tous1 and Jaime Delgado2 Departament d’Arquitectura de Computadors, Universitat Politècnica de Catalunya (UPC) Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Mòdul D6, Campus Nord C/ Jordi Girona, 1-3, E-08034 Barcelona, Spain e-mail: 1rtous@ac.upc.edu; 2jaime.delgado@ac.upc.edu
Abstract The initiative of standardization of MPEG Query Format (MPQF) has refueled the research around the definition of a unified query language for digital content. The goal is to provide a standardized interface to multimedia document repositories, including but not limited to multimedia databases, documental databases, digital libraries, spatio-temporal databases and geographical information systems. The initiative is being led by MPEG (i.e. ISO/IEC JTC1/SC29/WG11). This paper presents MPQF as a new approach for retrieving multimedia document instances from very large document databases, and its particular application to scholarly literature search and retrieval. The paper also explores how MPQF can be used in combination with the Open Archives Initiative (OAI) to deploy advanced distributed search and retrieval services. Finally, the issue of rights preservation is discussed. Keywords: scholarly literature; search, framework; query format, MPQF; Open Archives Initiative; MPEG 1.
Introduction
During the last years, the technologies enabling search and retrieval of multimedia digital contents have gained importance due to the large amount of digitally stored multimedia documents. Therefore, members of the MPEG standardization committee (i.e. ISO/IEC JTC1/SC29/WG11) have developed a new standard, the MPEG Query Format (MPQF) [1, 2, 3], which provides a standardized interface to multimedia document repositories, including but not limited to multimedia databases, documental databases, digital libraries, spatio-temporal databases and geographical information systems. The MPEG Query Format offers a new and powerful alternative to the traditional scholarly communication model. MPQF provides scholarly repositories with the ability to extend access to their metadata and contents via a standard query interface, in the same way as Z39.50 [4], but making use of the newest XML querying tools (based in XPath 2.0 [5] and XQuery 1.0 [6]) in combination with a set of advanced multimedia information retrieval capabilities defined within MPEG. This would allow, for example, querying for journal papers by specifying constraints over their related XML metadata (which is not restricted to a particular format) in combination with similarity search, relevance feedback, query-by-keywords, queryby-example media (using an example image for retrieving papers with similar ones), etc. MPQF has been designed to unify the way digital materials are searched and retrieved. This has important implications in the near future, when scholarly users’ information needs will become more complex and will involve searches combining (in the input and the output) documents from different nature (e-prints, still images, audio transcripts, video files, etc.). Currently, several forums, like [7], are trying to identify the necessary steps that could be taken to improve
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
46
Ruben Tous; Jaime Delgado
interoperability across heterogeneous scholarly repositories. The specific goal is to reach a common understanding of a set of core repository interfaces that would allow services to interact with heterogeneous repositories in a consistent manner. Such repository interfaces include interfaces that support locating, identifying, harvesting and retrieving digital objects. There’s an open discussion about if the interoperability framework may benefit from the introduction of a search interface service. In general, it is felt that, while such interface is essential, it should not be part of the core, and that it could be implemented as an autonomous service over one or more digital repositories fed through interaction with core repository interfaces for harvesting like the Open Archives Initiative (OAI) [8]. We defend that MPQF could be this search interface service, deployed in the last mile of the value chain, offering powerful and innovative ways to express user information needs. 2.
Related work
In general, the preferred method for distributed acquisition to digital content repositories is metadata harvesting. Metadata harvesting consists on collecting the metadata descriptions of digital items (usually in XML format) from a set of digital content repositories and storing them in a central server. Metadata is lighter than content, and it’s feasible to store the necessary amount of it in an aggregation server so that real-time access to information about distributed digital content can take place without the burden of initiating a parallel real-time querying of the underlying target content databases. Nowadays, the preferred harvesting method is the one offered by the Open Archives Initiative (OAI), which defines a mechanism for harvesting XML-formatted metadata from repositories (usually within the scholarly context). The OAI technical framework is intentionally simple with the intent of providing a low barrier for implementers and users. The trade-off is that its query expressiveness and output format description is very limited. In OAI Protocol for Metadata Harvesting (OAI-PMH), metadata consumers or “harvesters” request metadata from updated records from the metadata producers or “repositories” (data providers are required to provide XML metadata at least in Dublin Core format). These requests can be based on a timestamp range, and can be only restricted to named sets defined by the provider. These sets provide a very limited form of selective harvesting, and do not act as a search interface. Consequently some repositories may provide other querying interfaces with richer functionality, usually in addition to OAI. The two principal examples are Z39.50 and SRU-CQL [9, 10]. Regarding OAI, the MPEG Query Format (MPQF) could also be used for harvesting (though in that case a metadata format offering records update timestamps would be needed), overlapping with the OAI functionalities. However, MPQF is a complex language which has been designed for fine-grained retrieval and more advanced filtering capabilities. Because OAI offers a specialised, low-barrier and mature protocol for harvesting, we think that both mechanisms should be used in conjunction. With respect to Z39.50 and related protocols/languages, MPQF surpasses their expressive power by offering a flexible combination of XML-based query capabilities with a broad set of multimedia information retrieval capabilities. A major difference with respect to the Z39.50 approach is that MPQF does not define abstract data structures to which the queries refer, instead, MPQF queries use generic XPath and XQuery expressions written in terms of the expected metadata format of the target databases. We envisage the usage of MPQF and its expressive power directly between user agents and service providers, while the OAI will be probably used through the rest of the value chain. Regarding other multimedia query formats, there exist several languages explicitly for multimedia data such as SQL/MM [11], MOQL [12] or POQLMM [13], which are out of scope of this paper based on its limitation in handling XML data. Today, these kind of works use to be based on MPEG-7 descriptors and the MPEG-7 data model. Some simply defend the use of XQuery or some extensions of it. Others define a more high-level and user-oriented approach. MPQF outperforms XQuery-based approaches like [14, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The MPEG Query Format, a New Standard For Querying Digital Content
15, 16] because, while offering the same level of expressiveness, it offers multiple content-based search functionalities (QBE, query-by-freetext) and other IR-like features (e.g. paging or relevance feedback). Besides, XQuery does not provide means for querying multiple databases in one request and does not support multimodal or spatial/temporal queries. 3.
MPEG Query Format
3.1. Concepts and benefits Formally, MPQF is Part 12 of ISO/IEC 15938-12, “Information Technology - Multimedia Content Description Interface” better known as MPEG-7 [17]. The standardization process started in July 2006 with the release of a ”Call for Proposals on MPEG-7 Query Format” [18]. However, the query format was technically decoupled from MPEG-7 during the 81st MPEG meeting in July 2007, and its name changed to “MPEG Query Format” or simply “MPQF”. The standardization process has proceeded and it is expected that MPQF will become an ISO/IEC final standard after the 85th MPEG meeting in July 2008. Basically, MPQF is an XML-based query language that defines the format of queries and replies to be interchanged between clients and servers in a distributed multimedia information search-and-retrieval context. The two main benefits of standardizing such kind of language are 1) interoperability between parties (e.g. content providers, aggregators and user agents) and 2) platform independence; developers can write their applications involving multimedia queries independently of the database used, which fosters software reusability and maintainability. The major advantage of having MPEG rather than industry forums leading this initiative is that MPEG specifies international, open standards targeting all possible application domains and which, therefore, are not conditioned by partial interests or restrictions.
Input Query Format Output Query Format
Requester
Responder
Query Management Input Query Management Output
MPQF
MPQF
Service Provider
MPQF
Database 1
Other…
Database 2
Client MPQF
Database N
Figure 1. MPEG Query Format diagram MPQF defines a request-reply XML-based interface between a requester and a responder. Figure 1 shows a diagram outlining the basic MPQF scenario. In the simplest case, the requester may be a user’s agent and the responder might be a retrieval system. However, MPQF has been specially designed for more complex scenarios, in which users interact, for instance, with a content aggregator. The content aggregator acts at the same time as responder (from the point-of-view of the user) and as a requester to Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
47
48
Ruben Tous; Jaime Delgado
a number of underlying content providers to which the user query is forwarded. 3.2. Multimedia information retrieval vs. (XML) data retrieval One of the novel features of MPQF is that it allows the expression of queries combining both the expressive style of information and XML Data Retrieval systems. Thus, MPQF allows combining e.g. keywords and query-by-example with e.g. XQuery allowing the fulfillment of a broad range of usersâ&#x20AC;&#x2122; multimedia information needs. Both approaches to data retrieval aim to facilitate usersâ&#x20AC;&#x2122; access to information, but from different points-of-view. On one hand, given a query expressed in a user-oriented manner (e.g. an image of a bird), an Information Retrieval system aims to retrieve information that might be relevant even though the query is not formalized. In contrast, a Data Retrieval system (e.g. an XQuery-based database) deals with a well defined data model and aims to determine which objects of the collection satisfy clearly defined conditions (e.g. the title of a movie, the size of a video file or the fundamental frequency of an audio signal). Regarding Information Retrieval, MPQF offers a broad range of possibilities that include but are not limited to queryby-example-media, query-byexample-description, query-by-keywords, query-by-feature-range, query-byspatial-relationships, query-by-temporalrelationships and query-by-relevance-feedback. For Data Retrieval, MPQF offers its own XML query algebra for expressing conditions over the multimedia related XML metadata (e.g. MPEG-7, Dublin Core or any other XMLbased metadata format) but also offers the possibility to embed XQuery expressions (see Figure 2). XML query algebra (metadata-neutral) DR-like criteria
Embedded XQuery expressions (metadata-neutral)
MPEG Query Format
QueryByFreeText IR-like criteria
QueryByDescription QueryByMedia SpatialQuery TemporalQuery
Figure 2. MPQF IR and DR capabilities 3.3. Language parts MPQF instances are XML documents that can be validated against the MPQF XML schema. Any MPQF instance includes always the MpegQuery element as the root element. Below the root element, an MPQF document includes the Query element or the Management element. MPQF instances with the Query element are the usual requests and responses of a digital content search process. The Query element can include the Input element or the Output element, depending if the document is a request or a response. The part of the language describing the contents of the Input element is named the Input Query Format (IQF) in the MPQF standard. The part of the language describing the Output element is named the Output Query Format (OQF) in the standard. IQF and OQF are just used to facilitate understanding, but do not have representation in the schema. Alternatively, below the root element, an MPQF document can Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The MPEG Query Format, a New Standard For Querying Digital Content
include the Management element. Management messages (which in turn can be requests and responses) provide means for requesting service-level functionalities like discovering databases or other kind of service providers, interrogating the capabilities of a service, or configuring service parameters. Input Query
FetchResult Output
MpegQuery
Input Management
Output
Figure 3. MPQF Schema root elements 3.4
Input Query Format (IQF)
The MPQF’s Input Query Format (IQF) mainly allows specifying the search condition tree, which represents the user’s search criteria, and also the structure and desired contents of the resultset. The condition tree is the main component of MPQF, and can be built combining different kids of expressions and query types. When analyzing an MPQF condition tree, one must consider that it will be evaluated against an unordered set of Multimedia Content (MC). The concept of Multimedia Content [17] is analogous to the concept of Digital Object from the Digital Libraries area, and refers to the combination of multimedia data and its associated metadata. MPQF allows search and retrieval of complete or partial MC data and metadata. Conditions within the condition tree operate on one evaluation-item (EI) at a given time (or two EIs if a Join operation is used). By default, an Evaluation Item (EI) is a multimedia content in the database, but other types of EIs are also possible (spatial or time regions, metadata fragments, etc.). Figure 4 outlines the main elements of the IQF part of the MPQF schema. The condition tree is placed within the QueryCondition element, and is constructed combining boolean operators (AND, OR, etc.), simple conditions over the XML metadata fields and query types (QueryByFreeText, QueryByMedia, etc.). Example in Code 1 shows the MPQF query asking for PDF research papers related to the keywords “Open Access” with a Dublin Core date element greater or equal to 2008-01-15. Note that the query expects the target repository exposing Dublin Core descriptors. Exposing Dublin Core metadata is not required for an MPQF-compliant server, therefore the requester must previously ask the repository which metadata formats support. 3.5
Output Query Format (OQF)
The MPQF’s Output Query Format (OQF) allows specifying the structure of the resultset. By default, the resultset includes some fields like the resource locator (MediaResource element in MPQF) but MPQF allows also selecting specific XML elements from one or more target namespaces. MPQF allows sorting and grouping result records, but it is deliberately rigid in the way records are presented. Unlike XQuery, which allows defining any possible structure for a result, MPQF records always share the same structure at the top levels. As shown in Figure 5, for each record a ResultItem element is returned. Within each Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
49
50
Ruben Tous; Jaime Delgado
ResultItem, generic information about the record is placed within the Comment, TextResult and MediaResource elements, while the Description element is reserved for encapsulating the XML fields which have been selected in the query.
QFDeclaration OutputDescription Input
Path TargetMediaType
QueryCondition
Join
ServiceSelection
Condition
Figure 4. Input Query Format (IQF) Comment GlobalComment Output
ResultItem SystemMessage
TextResult Thumbnail MediaResource Description AggregationResult
Figure 5. Output Query Format (OQF) Example in Code 2, gives an idea of how the result of the query in Code 1 could look like. The resultset consists on two records which match the query conditions, and include the Dublin Core elements which have been selected (title, creator, publisher and date). <MpegQuery> <Query> <Input> <OutputDescription outputNameSpace="//purl.org/dc/elements/1.1/"> <ReqField>title</ReqField> <ReqField>creator</ReqField> <ReqField>publisher</ReqField> <ReqField>date</ReqField> </OutputDescription> <QueryCondition> <TargetMediaType>application/pdf</TargetMediaType> <Condition xsi:type="AND" preferenceValue="10"> <Condition xsi:type="QueryByFreeText"> <FreeText>Open Access</FreeText> </Condition> <Condition xsi:type="GreaterThanEqual"> <DateTimeField>date</DateTimeField> <DateValue>2008-01-15</DateValue> </Condition> </Condition> </QueryCondition> </Input> </Query> </MpegQuery>
Code 1: Input query example
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The MPEG Query Format, a New Standard For Querying Digital Content
<MpegQuery mpqfID="AB13DGDDE1"> <Query> <Output> <ResultItem xsi:type="ResultItemType" recordNumber="1"> <TextResult>Some advertising here</TextResult> <MediaResource>http://www.repository.com/item04.pdf</MediaResource> <Description xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ dc.xsd"> <dc:title>Open Access Overview</dc:title> <dc:creator>John Smith</dc:creator> <dc:publisher>VDM Verlag</dc:publisher> <dc:date>2008-02-21</dc:date> </Description> </ResultItem> <ResultItem xsi:type="ResultItemType" recordNumber="2"> <TextResult>Some advertising here</TextResult> <MediaResource>http://www.repository.com/item08.pdf</MediaResource> <Description xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://purl.org/dc/elements/1.1/ dc.xsd"> <dc:title>Open Access in Germany</dc:title> <dc:creator>John Smith</dc:creator> <dc:publisher>VDM Verlag</dc:publisher> <dc:date>2008-02-01</dc:date> </Description> </ResultItem> </Output> </Query> </MpegQuery>
Code 2: Output query example
4.
Open Archives and MPQF together. Scholarly objects interchange framework
We envisage that MPQF could be one building block of a future scholarly objects interchange framework, interconnecting heterogeneous scholarly repositories. The framework would be based on the combination of the Open Archives Initiative (OAI) protocol for metadata harvesting (OAI-PMH) with MPQF. Figure 6 outlines graphically the basic elements of the framework in an example scenario. Required search functionalities amongst the different parties in the framework vary depending on their roles. On one hand, aggregators (e.g. librarians) need collecting metadata descriptions from repositories (e.g. publishers) or between them, and this is usually performed through a harvesting mechanism. On the other hand, content “retailers”, which include aggregators and also some repositories (generally medium or large scale ones), should be able to deploy value-added services offering fine-grained access to digital objects, and advanced search and retrieval capabilities. We believe that the MPEG Query Format could be the search interface between “retailers” and users, in the last mile of the value chain, offering expressive ways to represent user information needs. The scenario from Figure 6 does not cover the real-time distributed usage of MPQF. Our experience in previous projects like [19] and [20] makes us think that real-time distributed search imposes severe limitations in terms of interoperability and performance, and it is not always necessary. However, this scenario is just an example, and nothing limits the distributed usage of MPQF (the standard provides extensive capabilities for that). 5.
Advanced examples
5.1
QueryByMedia example: Searching research papers with similar images
Example in Code 3 shows the MPQF query asking for PDF research papers including images similar to a given one. An example usage of this query could be the detection of image copyright infringement. For instance, it could have been used in the first 90s, when the Playboy magazine discovered that an image Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
51
52
Ruben Tous; Jaime Delgado
copyrighted by the company in 1972, the Lena Sjooblom’s photo (Figure 7), was being widely used in image processing research papers. The query includes a Condition element from the QueryByMedia complex type. Query-by-example similarly searches allow to express the user information need with one or more example digital objects (e.g. an image file). Though the usage of low-level features description instead of the example object bit stream is also considered query-by-example, in MPQF these two situations are differentiated, naming query-by-media to the first case (the digital media itself) and query-by-description the second one. MPQF
OAI-PMH
presentation service (e.g. portal)
aggregator
repository
end user MPQF
OAI-PMH
presentation service
repository
end user OAI-PMH
OAI-PMH
Aggregator
aggregator OAI-PMH
MPQF
presentation service
MPQF
end user
repository
Figure 6. OAI+MPQF Example scenario
Figure 7. Lena Sjooblom’s photo from 1972 and a research paper where it appears [21] <MpegQuery> <Query> <Input> <QueryCondition> <TargetMediaType>application/pdf</TargetMediaType> <Condition xsi:type="QueryByMedia"> <MediaResource xsi:type="MediaResourceType"> <MediaResource> <InlineMedia type="image/jpeg"> <MediaData64>R0lGODlhDwAPAKECAAAAzMzM/////wAAACwAAAAADwA PAAACIISPeQHsrZ5ModrLlN48CXF8m2iQ3YmmKqVlRtW4MLwWACH+H09 wdGltaXplZCBieSBVbGVhZCBTbWFydFNhdmVyIQAAOw==</MediaData64> </InlineMedia> </MediaResource> </MediaResource> </Condition> </QueryCondition> </Input> </Query> </MpegQuery>
Code 3: QueryByMedia example
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The MPEG Query Format, a New Standard For Querying Digital Content
QueryByMedia and QueryByDescription are the fundamental operations of MPFQ and represent the query-by-example paradigm. The individual difference lies in the used sample data. The QueryByMedia query type uses a media sample such as image as a key for search, whereas QueryByDescription allows querying on the basis of an XML-based description. Code 4 shows how an example Dublin Core description can be included in a query. The server should return records corresponding to digital objects with metadata similar to the given ones. It’s up to the server deciding which similarity algorithm to apply. <MpegQuery> <Query> <Input> <QueryCondition> <TargetMediaType>application/pdf</TargetMediaType> <Condition xsi:type="QueryByDescription" matchType="exact"> <DescriptionResource resourceID="desc07"> <AnyDescription xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:title>Open Access Overview</dc:title> <dc:creator>John Smith</dc:creator> <dc:publisher>VDM Verlag</dc:publisher> <dc:date>2008-02-21</dc:date> </AnyDescription> </DescriptionResource> </Condition> </QueryCondition> </Input> </Query> </MpegQuery>
Code 4: QueryByDescription example
5.3
QueryByRelevanceFeedback example. Refining the search of a research paper
In Information Retrieval, “relevance feedback” is related to taking the relevance of the results that are initially returned from a given query to improve the results of a new query. MPQF offers the possibility of “explicit” relevance feedback by allowing user to mark specific records as relevant or irrelevant. This is accomplished through the usage of the QueryByRelevanceFeedback query type. Let’s take again example from Code 1 and Code 2. The user was looking for research papers related to the words “Open Access” and submitted a query (Code 1) to the server. The server responded with several records (within a response with id “AB13DGDDE1”), some of which are shown in Code 2. Let’s imagine that the user found specially interesting the records number 1, 2 and 5. By using the QueryByRelevanceFeedback query type, as shown in Code 5, the user cand submit to the server his/her preferences, allowing the server to refine the response. <MpegQuery> <Query> <Input> <QueryCondition> <Condition xsi:type="QueryByRelevanceFeedback" answerID="AB13DGDDE1"> <ResultItem>1</ResultItem> <ResultItem>2</ResultItem> <ResultItem>5</ResultItem> </Condition> </QueryCondition> </Input> </Query> </MpegQuery>
Code 5: QueryByRelevanceFeedback example
The presented examples just demonstrate a small part of MPQF capabilities, but just pretend to show the particularities of this language in comparison to other existing multimedia querying facilities, and specially in comparison to existing scholarly search interfaces.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
53
54
Ruben Tous; Jaime Delgado
6.
Conclusions
This paper proposes the usage of a novel standard, the MPEG Query Format, to extend the functionality and to foster the interoperability of scholarly repositories search interfaces. The paper defends that future scholarly digital objects interchange frameworks could be based on the combination of MPQF and the Open Archives protocol. While Open Archives offers a low-barrier mechanism for “wholesale” metadata interchange, MPQF provides scholarly repositories with the ability to extend access to their metadata and contents via a standard query interface, making use of the newest XML querying tools (based in XPath 2.0 and XQuery 1.0) in combination with a set of advanced multimedia information retrieval capabilities defined within MPEG. The paper describes also how this idea can be applied to the design of a scholarly objects interchange framework. The framework interconnects heterogeneous scholarly repositories and it is based on the combination of two standard technologies such as the OAI-PMH protocol and the MPEG Query Format. The design has been guided by the conclusions of a previous experience, the XAC project [20], from which several lessons were learnt, as the necessary separation between metadata harvesting and realtime search and retrieval, or the necessity to choose a more appropriate query format than XQuery. We are working currently in the first implementation of the framework. It is worth mention that it is planned that from this work, the first known implementation of an MPEG Query Format processor will emerge. Furthermore, parts of the ongoing implementation are being contributed to the MPEG standardisation process in the form of Reference Software modules. Finally, it is also relevant to indicate that, in fact, we are working with a third standard, the MPEG-21 Rights Expression Language [22] and its extensions, in order to also cover rights management issues. Although it has not been the focus of this paper, we have also considered in our framework the possibility of having licenses associated to the content being distributed. Those licenses specify rights and conditions that apply to a resource for a specific user, and may be used, through an authorization process, to enforce these rights and conditions during the consumption of protected content. In [22] we have already developed some tools to create licenses, to verify them, to decide if a specific consumption is to be authorised and to distribute information about all events happening on the content. Apart from this, we have participated in the development of a system [23] that allows controlling the rights related to the whole life cycle of intellectual property, from its creation to the final usage. We are currently considering adapting our system to specifically handle scholarly content, which would allow authors to register their work, before sending for reviewing or publication, to decide about the rights they want to give to their creations, and to control over the events related to them. 7.
Acknowledgments
This work has been partly supported by the Spanish government (DRM-MM project TSI 2005-05277) and the European Network of Excellence (VISNET-II IST-1-038398), funded under the European Commission IST 6th Framework Program. 8.
References
[1]
ISO/IEC FDIS 15938-12:2008 “Information Technology — Multimedia Content Description Interface — Part 12: Query Format”. Gruhne, Matthias; Tous, Ruben; Doeller Mario; Delgado, Jaime and Kosch, Harald (2007). MP7QF: An MPEG-7 Query Format. 3rd International Conference on Automated Production of Cross Media Content for Multi-channel Distribution (AXMEDIS 2007), Barcelona, November 2007. IEEE
[2]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The MPEG Query Format, a New Standard For Querying Digital Content
[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]
[20]
[21] [22] [23]
Computer Society Press. ISBN 0-7695-3030-3. p. 15-18 Kevin Adistambha et al. (2007). The MPEG-7 Query Format: A New Standard in Progress for Multimedia Query by Content. 7th International Symposium on Communications and Information Technologies (ISCIT 2007), Sydney, Australia, October 16-19, 2007. IEEE Computer Society Press. ISO 23950. Information Retrieval (Z39.50): Application Service Definition and Protocol Specification. XQuery 1.0: An XML Query Language. W3C Recommendation 23 January 2007. See http:// www.w3.org/TR/xquery/. XML Path Language (XPath) 2.0. W3C Recommendation 23 January 2007. See http://www.w3.org/ TR/xpath20/. Tony Hey, Herbert Van de Sompel, Don Waters, Cliff Lynch, Carl Lagoze Augmenting interoperability across scholarly repositories JCDL ’06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries June 2006 Open Archives Initiative. http://www.openarchives.org/. Library of Congress, 2004. SRU: SRU (Search Retrieve via URL). See http://www.loc.gov/standards/sru/sru-spec.html. Common query language. See http://www.loc.gov/z3950/agency/zing/cql/cqlsyntax.html. J. Melton and A. Eisenberg. SQL Multimedia Application packages (SQL/MM). ACM SIGMOD Record, 30(4):97–102, December 2001. J. Z. Li, M. T. Ozsu, D. Szafron, and V. Oria. MOQL: A Multimedia Object Query Language. In Proceedings of the third International Workshop on Multimedia Information Systems, pages 19–28, Como Italy, 1997. A. Henrich and G. Robbert. POQLMM: A Query Language for Structured Multimedia Documents. In Proceedings 1st International Workshop on Multimedia Data and Document Engineering (MDDE’01), pages 17–26, July 2001. J. Kang and al. An XQuery engine for digital library systems. In 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, Houston, Texas, May 2003. D. Tjondronegoro and Y. Chen. Content-based indexing and retrieval using mpeg-7 and xquery in video data management systems. World Wide Web: Internet and Web Information Systems, pages 207–227, 2002. L. Xue, C. Li, Y. Wu, and Z. Xiong. VeXQuery: an XQuery extension for MPEG-7 vector-based feature query . In Proceedings of the International Conference on Signal-Image Technology and InternetBased Systems (IEEE/ACM SITIS’2006), pages 176–185, Hammamet, Tunesia, 2006. ISO/IEC 15938 Version 2 “Information Technology - Multimedia Content Description Interface” (MPEG-7). ISO/IEC JTC1/SC29/WG11 N8220. July 2006 ”Call for Proposals on MPEG-7 Query Format”. Ruben Tous, Jaime Delgado. Advanced Meta-Search of News in the Web , ELPUB2002 Technology Interactions. Proceedings of the 6th International ICCC/IFIP Conference on Electronic Publishing held in Karlovy Vary, Czech Republic, 6–8 November 2002. VWF Berlin, 2002. ISBN 389700-357-0. J. Delgado, S. Llorente, E. Peig, and A. Carreras. A multimedia content interchange framework for TV producers. 3rd International Conference on Automated Production of Cross Media Content for Multi-channel Distribution (AXMEDIS 2007), Barcelona, November 2007. IEEE Computer Society Press. ISBN 0-7695-3030-3. p. 206-213. 2007. Jose A. Rodrigo, Tatiana Alieva, Maria L. Calvo, Applications of gyrator transform for image processing, Optics Communications. Volume 278, Issue 2, 15 October 2007, Pages 279-284. ISO/IEC, Information Technology – Multimedia framework (MPEG-21) – Part 5:Rights Expression Language, ISO/IEC 21000- 5:2004, March 2004. IPOS-DS (Intellectual Property Operations System – Digital Shadow), exploited by NetPortedItems, S.L. http://www.digitalmediavalues.com. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
55
56
The State of Metadata in Open Access Journals: Possibilities and Restrictions Helena Francke Department of Cultural Sciences, Lund University SE-223 62 Lund, Sweden and Swedish School of Library and Information Science, Göteborg University and University College of Borås, Sweden e-mail: helena.francke@hb.se
Abstract This paper reports on an inquiry into the use of metadata, publishing formats, and markup in editormanaged open access journals. It builds on findings from a study of the document architectures of open access journals, conducted through a survey of 265 journal web sites and a qualitative, descriptive analysis of 4 journal web sites. The journals’ choices of publishing formats and the consistency of their markup are described as a background. The main investigation is of their inclusion of metadata. Framing the description is a discussion of whether the journals’ metadata may be automatically retrieved by libraries and other information services in order to provide better tools for helping potential readers locate relevant journal articles. Keywords: scholarly journals; metadata; markup; open access; information access 1.
Introduction
This paper will report on an inquiry into the use of metadata, publishing formats, and markup, in editormanaged open access journals[1]. The open access movement endorses and is actively working towards the possibility for everyone with an Internet connection and sufficient information literacy to be able to access scholarly contributions on the Web. However, given the amount of documents and services on the Web, making content available is no guarantee that it will also be found by the intended target groups. Although there are several ways for authors and publishers of open access scholarly journals to address the problem of their products “being found”, including Search Engine Optimization, many of them require a potential reader to either already be familiar with the journal or to enter a suitable search query into a search engine. The latter is presumably the most common locating tool that readers use [2]. Making the journal articles searchable through OAI-compliant repositories or library online catalogues can aid in bringing articles to the attention of potential readers, often with the additional perk that comes with positioning the articles within the context of the journal to a larger extent than is the case when individual files are found through a search engine. For small publishers of scholarly journals [3], particularly in cases where an open access journal is run on a low budget by an individual or an organization such as a university department or library [4],[5], it may be difficult to find the time and resources to promote the journal. Libraries and other information services may provide help with collecting and making available article metadata from this group of journals in order to increase their visibility. Such projects already exist, e.g. the Lund University Library’s DOAJ [6] or the University of Michigan’s OAIster [7], but these services still require input from the journals in the form of harvestable metadata. If article metadata could be retrieved directly from the journal web sites without a need for the publishers to provide it in a specific Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The State of Metadata in Open Access Journals: Possibilities and Restrictions
format, there would be better opportunity for libraries to work with publishers of small and local journals so as to help them target a world-wide audience [cf. e.g. 8]. In this paper, I will present findings concerning the use of metadata, publishing formats, and markup in editor-managed open access journals [5, p. 5] that can be of use for librarians, scholars, and computer scientists who are considering taking on such tasks. Focus in the paper is on which metadata are included and marked up in the journals; the choice of format and markup consistency are included because they constitute important prerequisites for how metadata may be reused. 2.
Methodology
The data were collected through a combination of qualitative and quantitative methods. This allows for conclusions to be drawn both across journals and across the different issues and articles within individual journals. The document architectures of the journals were studied with regard to their choice of publishing format, their use of markup in cases where markup languages were used, and the marked up and visible metadata or bibliographic data included. The study looked at three levels of the journals: the start page, the table of contents pages, and the article pages. The quantitative study comprised 265 journals. The most recent issue and its first article were studied, and for some variables the first issue published online was also included. The qualitative study included four journals, which were investigated in greater detail, including all or most of the issues and a few articles for each issue. The margins of error for each variable in the statistical study were estimated with 95% confidence by using Jowettâ&#x20AC;&#x2122;s method [9], [10]. 2.1
Journals included in the study
The focus of the study was on journals that are published by small open access publishers. These journals are often run by individuals or groups of individuals, or sponsored by universities or university libraries, and they may be termed editor-managed journals [5, p. 5] because much of the publishing work is made by editors who are subject specialists rather than professional publishers. The journals included in the sampling frame were identified through the DOAJ [6] and Open J-Gate [11] databases and was restricted to those journals that were peer reviewed, published the web site in one of the languages Danish, English, French, German, Norwegian, or Swedish, that were open access, and could be considered editor-managed. From the sampling frame of approximately 700 journals (in spring 2006), a random sample of 265 journals was drawn. The majority of the journals in the sample, 70.2%, were published by university departments. Another 9.8% were published by university presses or e-journal initiatives, and 7.2% each by another type of non-profit organisation or under the journalâ&#x20AC;&#x2122;s name. English was the most common language, with 85.3% of the journals having this as their main language. The journals represented every first level subject category included in DOAJ. The four journals in the qualitative section were selected mainly because they use web technology in an innovative or interesting fashion. This was of relevance to other parts of the study than those reported in this paper. The journals were all from the humanities or education, namely: assemblage: the Sheffield graduate journal of archaeology, The Journal of Interactive Media in Education (JIME), The Journal of Music and Meaning (JMM), and The International Review of Research in Open and Distance Learning (IRRODL). 3.
Results
From the study outlined above, data have been selected for presentation that concern three different areas: the publishing formats of the journals, their use of (X)HTML markup, and their inclusion of metadata. Focus is on marked up metadata included in the journal files at the various journal levels. To what extent Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
57
58
Helena Francke
do editor-managed open access journals include marked up metadata, and are the text strings that are marked up in this way potentially useful for various forms of automatic collection of metadata into a system? However, marked up metadata requires a file format based on a markup language of some sort. This motivates an initial look at the publishing formats used in the journals at the various journal levels. The usefulness of the metadata, as well as other marked up text is also to some extent restricted by how the markup has been performed. Therefore, the predictability and validity of the journalâ&#x20AC;&#x2122;s markup is also discussed before turning to a more thorough report of the inclusion of marked up metadata. 3.1
Publishing formats
The start page of a Web-based journal is often intended to be a mutable space where news and updates are added regularly. The page also often functions as a portal with a collection of hyperlinks to the other parts of the journal web site. It is therefore not surprising that the start pages of all the journals in the sample publish through some version of (X)HTML. Most journals also have separate table of contents pages for each issue. These pages have a higher degree of permanency than the start pages, because they are generally not updated once the issue has been published. In most cases, their primary function is to direct the visitor to one of the issueâ&#x20AC;&#x2122;s articles. When these pages exist separately, they are (X)HTML based, but in 5 of the journals the issue is published as a single unit in PDF or DOC, and the table of contents is placed at the beginning of that file. At the article level, the variety of file formats is much wider, but (X)HTML and PDF are by far the most common ones. As many as 67.1 to 78.1% of the journals in the population publish the articles in their latest issue in PDF, whereas between 36.6 and 48.8% of the journals use (X)HTML. The articles in somewhere around one fifth of the journals are actually made available in more than one file format, and it is often the case that both PDF and (X)HTML are used. Furthermore, the proportion of journals with PDF as the publishing format for the articles is higher in the latest issues than in the first ones, with a corresponding decline in the popularity of (X)HTML. There are many reasons that could account for why PDF has become more popular. These include a desire on the part of the journals to use a file format that indicates permanency, something that is often associated with credibility; the ease of using the same file for derivatives in several media (notably print and Web); and a wish to facilitate for readers who print the articles before reading. Publishing format HTML non-specified HTML 2.0 HTML 3.2 HTML 4.01 Transitional HTML 4.01 Frameset XHTML 1.0 (X)HTML Total PDF PostScript MS Word RTF DVI Hyperdvi DjVu TeX ASCII/txt WordPerfect Mp3 PNG EPS
1st issue 82 2 8 26 2 18 139 158 10 5 3 5 1 2 3 4 1 1 1 --
Latest issue 53 -3 29 3 25 113 193 9 4 1 5 -2 3 --1 -1
Table 1: Frequency of publishing formats in the journals, including journals that publish their articles inmore than one format. First peer reviewed article in the first and most recent issue published on the journal web site. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The State of Metadata in Open Access Journals: Possibilities and Restrictions
Other file formats found occasionally at the article level are various LaTeX output formats such as DVI, Hyperdvi and TeX, as well as PostScript and DjVu. Apart from the latter format, these exist solely in journals within the areas of mathematics and computer science. A few occurrences of MS Word, RDF, and TXT were noted, and one journal – IRRODL – contained MP3 versions of some of its articles (see also Table 1). One of the consequences of the dominance of PDF and the decline of the use of (X)HTML at article level is that fewer journals provide the possibility of including marked up metadata at article level. Rather, the issue level becomes more important as a potential location for metadata, even for metadata describing an article rather than an issue. Some journals also offer a “paratext page”, generally positioned between the issue’s table of contents page and the article page, where they include non-marked up metadata (or paratexts) describing the article. This page can include information on the author(s) and the journal, and various descriptions of the article such as title, abstract, keywords, and sometimes even references. As these paratext pages are generally in (X)HTML, this can be a spot to also identify marked up metadata. However, a consequence of the limited use of (X)HTML at article level is that the places to look for marked up metadata in the journals varies depending on which file formats are used at which levels. 3.2
Markup
The predictability and validity of the markup of (X)HTML pages may affect the possibilities to make use of the markup in various ways. If elements are correctly and consistently marked up it is easier to identify and extract them for specific purposes. This includes identifying an article’s title through a <title> tag, finding words occurring in headings, block quotes, or image texts, and the use of XPath to locate a specific position in a document. Among the journals that were studied, very few made use of valid (X)HTML markup. Among start pages, 6.8% passed validation and the corresponding figure at the article level was 8.0%. Due to the low proportion of articles that were published in (X)HTML, this means that between 1.6 and 6.4% of all journals can be expected to publish articles with valid (X)HTML markup. It should be acknowledged that validation of the pages was made automatically, using the fairly strict W3C validator, and that no evaluation was made in the survey of the types of errors that it reported. A closer inspection of the types of errors that came up in the validation of one of the journals in the qualitative study illustrates how attempts to accommodate various (older) web browsers can cause the markup to break W3C recommendations. Thus, a conscious choice may in some cases have been made that has resulted in a minor violation of the recommendations. It was clear in the sample that a majority of the valid (X)HTML pages were found among start pages and articles where XHTML 1.0 was the HTML version used; this was the case in two thirds of the valid pages. With one exception, the remaining third of the valid pages were HTML 4.01 Transitional. A concern with validity (or the use of editor software that generates more correct markup) was thus found primarily among those web sites that use newer versions of (X)HTML. At the same time, only half of the start pages and article pages in the sample that used XHTML 1.0 had valid markup. So far, the (X)HTML validators are not intelligent in the sense that they take into account whether or not the marked up content of the elements fit the logic for which they are marked up. It is, for instance, quite possible to mark up a section of the body text as a heading, such as <h3>, and this is sometimes done in order to achieve a specific visual effect. However, if one wishes to use markup for identifying and retrieving content, it is of importance both that the markup is used for a text string of the content type indicated by that markup element and that all the content of that type is marked up with the correct element and not with other elements. For instance, if one wishes to use the element <blockquote> in order to locate and extract any block quotes in the articles of a journal, this will only be successful if block quotes have in fact been marked up as such and not as, e.g., <dir><dir><font size=-1>, and if <blockquote> has not been used Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
59
60
Helena Francke
to achieve a desired visual appearance for, say, the abstracts. The markup of three types of content was studied in the survey, namely headings, block quotes, and the inclusion of alternative text as an attribute in image elements. These three types were chosen because headings are a common element on a web page and may contain terms that are significant to describe an article’s content, block quotes have close ties to the scholarly article as a genre and indicate a reference to somebody other than the article author(s), and the “alt” attribute could give an indication of what an image represents through means that are possible to use in text – rather than image – retrieval. Of these, block quotes was the element that was used correctly most often, namely in 51.3% of the cases. On the other hand, because many of the journals publish in other formats than (X)HTML and given that block quotes are less common than, for instance, headings, only between 11.3 and 20.4% of journals contain correctly marked up block quotes. Some journals that do not mark quotes using the <blockquote> element nevertheless indicates the function of the string of text by including block quote as a class, name, or ID attribute. All articles can be expected to contain headings, if nothing else then at least an article title, which would presumably be marked up as a heading of the highest degree. Just under half of the journals in the sample with articles in (X)HTML use <h> for headings, and slightly more than 40% of these journals have headings marked up according to hierarchy beginning at the topmost level and downwards. A further 15.7% adhere to hierarchy but do not begin with <h1>. In total, between 7.5 and 15.3% of all the journals can be expected to use <h> to identify headings hierarchically. The “alt” attribute to the image element – optional in earlier versions of HTML but compulsory in later versions – was included in slightly under one third of the articles that contained the <img> element, which was 75.2% of the journals in the sample publishing articles in (X)HTML. A few articles contained the “alt” attribute, but it was left without content. This means that the total proportion of journals with “alt” attributes that could be used for various purposes is between 4.4 and 11.0%. In the survey, the markup was studied in the first peer reviewed article in the most recent issue of each journal. The qualitative studies indicate that there can be large variations to how markup validity and predictability are handled between different issues of the same journal and even between articles in the same issue. At the moment, this makes the use of markup an unreliable means to identify specific logical elements in the articles. 3.3
Metadata
The journals’ start pages, issue pages, paratext pages, and article pages contain data that describe the articles and the journal in various ways. This information can be divided into that which is marked up according to its content type and that which is not marked up but whose content type can be identified by a person or, in some cases, automatically through an algorithm that can identify such specific features as a copyright sign or a phone number. Focus here will be primarily on marked up, machine-readable metadata, which is the type most easily usable in, for instance, various projects for automated data collection (for more results concerning the non-marked up type, see [1]). The types of machine-readable metadata that will be discussed are those marked up by the <title> and <meta> elements, including <meta> elements that make use of elements from the Dublin Core Metadata Element Set. The occurrence of RSS feeds will also be briefly discussed. Three things are of particular interest in this context: 1. 2. 3.
to what extent are various types of marked up metadata included at various levels in the journals? what content is entered into the metadata elements? and which levels of the journal do these metadata describe (journal, issue, article)?
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The State of Metadata in Open Access Journals: Possibilities and Restrictions
Type of content in the <title> element Journal title No/vol. of issue “Current issue” or similar 2 of the above Article title Name of author Article title and author More or other of the above Other
% of journals – issue level (n=265) 38.9 5.3 1.1 40.4 n.a. n.a. n.a. n.a. 10.9
% of journals – article level (n=265) 7.6 -n.a. 1.9 7.9 2.3 3.4 18.1 1.1
% of journals with <title> – article level (n=112) 44.6 24.1 -n.a. 53.6 33.9 n.a. n.a. 1.1
Table 2: Types of content included in the <title> element at issue and article level. In the right-most column, the composite values have been broken down into single values. [1, p. 247] In the presentation of findings that follows, it may be good to keep in mind that the file formats that the journals use vary at the different levels. All the journals use (X)HTML on their start pages and almost all (98.1%) for the table of contents pages (the issue level). The use of (X)HTML is less common at article level, where it is found in the most recent issues of 42.6% of the journals. This means that when the article level is discussed below, only this smaller sample of (X)HTML files has formed the basis for the results. The most commonly occurring metadata type is the <title> element, which can be found at the start page and issue levels in at least 95% of the journals and at the article level of a minimum of 93% of the journals publishing in (X)HTML at this level. The journal title is the most commonly included information in the <title> element at the issue level, occurring in between 73.9 and 84.0% of the journals. Information on the issue and/or volume number, or a text that indicates that it is the “current issue” occurs in between 40.7 and 53.0% of the journals. Both the journal title and the issue/volume number also occur fairly frequently in the <title> element at the article level – the title in just below half of the journals and the issue/volume number in about a quarter of them. Approximately as common – slightly more common in the sample, in fact – are the article title and the name of the author(s). Between 43.9 and 63.1% of the journals include the article title in the <title> element of the article files. However, at this level, it is not entirely uncommon for the <title> element to contain a number of different types of information. The figures of the most common content types are listed in Table 2. Some variety can also be found among the words listed in <title> – many are quite generic, such as “Article/s”, “contributions”, “Mainpage”, or “Default Normal Template”, whereas others provide additional information that may be used to identify the journal, support its credentials, or advertise the journal, such as the name of the publisher or the ISSN. Very few of the <title> elements contain nonsensical text. A particular problem can be caused by journals that use frames. In many cases, frames mean that if the content of a <title> element on a page is to be used for some purpose, a decision has to be made with regard to which file (and <title> element) should be preferred over the others. Perhaps the most likely candidates are the frameset file and the file which contains the article text. However, these can have different text in their <title> elements. One of the journals in the qualitative study illustrates this, and also that there was some inconsistency in what content was included in the <title> elements of similarly positioned files in different issues (this was also the case in another of the journals in the qualitative study). The differences are by no means very large, but it is not uncommon for the content to be formulated according to varying patterns (abbreviations, notation, order, etc.) or to contain slightly different types of content. Overall, the variety of exactly what the <title> element contains is quite wide and covers many more types of content than, for instance, the main heading of the pages.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
61
62
Helena Francke
A comparison of the content in the <title> element and that marked up as DC.title (only few journals make use of the Dublin Core title element, 12 journals at the issue level and 14 at the article level) shows that the content is similar in most cases (10 journals at the issue level, 8 at the article level). In the few other cases, the Dublin Core elements sometimes contain more precise content in the form of the article title where the <title> equivalent has more types of content, and sometimes the Dublin Core element contains generic content such as “Article”. However, since the Dublin Core title element is much less common than the <title> element, and in many cases contains the same information, it does not seem to be particularly useful to target specifically. A type of markup that is of specific interest in this case is the <meta> element available in (X)HTML, which can be used for marking up various types of metadata – in the words of the HTML 4.01 specification, “generic metainformation” [12, sect. 7.4.4]. The attributes name and http-equiv are used to describe the type of metadata (or property) that is included, and the attribute content to include the metadata text itself (the value). As the HTML specification does not restrict the properties that are possible to use, some variety in properties is likely to be encountered, but some properties have emerged as more common than others. Among the 90% of the journals in the sample that included a <meta> element, most used the technically oriented http-equiv with various properties. The details of this attribute were not included in the study. Among the properties associated with the name attribute, the most commonly used were keywords, description, and generator (see Table 3). Keywords and description, in particular, have emerged as quite frequently found on the web sites. Apart from some of the journals that include http-equiv, files often contain more than one <meta> element. Combinations of the properties keyword, description, and httpequiv and of http-equiv and generator are the most common (the two latter properties are likely to be included by the software employed and seldom requires the person marking up the text to fill out the values). Type of metadata http-equiv keywords description generator author robots copyright title date
Journal level (n=265) 206 (77.7%) 98 (37.9%) 93 (35.1%) 66 (24.9%) 36 (13.6%) 16 (6.0%) 11 (4.2%) 3 (1.1%) 2 (0.8%)
Issue level (n=260) 199 (76.5%) 74 (28.5%) 74 (28.5%) 71 (27.3%) 30 (11.5%) 11 (4.2%) 10 (3.8%) 2 (0.8%) 2 (0.8%)
Article level (n=113) 91 (80.5%) 30 (26.5%) 31 (27.4%) 40 (35.4%) 20 (17.7%) 4 (3.5%) 6 (5.3%) 5 (4.4%) 2 (1.8%)
Table 3: Types of metadata in the <meta> element, in frequency and proportion of the (X)HTML files. [1,p. 251] The discrepancy in the number of (X)HTML files at article level compared to Table 2 is due to inconsistencies in the study. Small variations can be seen in the sample when it comes to the frequency of the various properties at different journal levels, but generally they show a similar pattern. The differences need to be treated with caution, as they are not statistically significant for the population at large. The generator property is slightly more common at article level in the sample, as is the case with author. That the author property would not be more common at article level is perhaps a bit surprising, as it is generally easier to identify the particular author(s) of an article than decide who should be listed in that position for the journal at large. The fact that the keyword and description properties are more common on the start pages than on the table of contents or article level could have to do with the fact that it is easy to enter the values to these properties once on the start page when creating the site, but requires certain routines if they are to be entered for each new table of contents page and article. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The State of Metadata in Open Access Journals: Possibilities and Restrictions
The qualitative studies, where more attention was placed on the values included in the <meta> elements, provide examples of how journals try to counter the fact that in the general use of the <meta> element properties one does not adhere to a specific vocabulary, by offering various versions of suitable keywords. Anticipated variations in how users will search for certain words with regard to number, spelling, and synonyms were met by including alternative keywords, e.g. university, universities â&#x20AC;&#x201C; archaeology, archeology â&#x20AC;&#x201C; and journal, periodical. Some journals also explore the fact that search engines can as easily search through post- as pre-coordination. They include quite unexpected phrases among the keywords, phrases that one would perhaps not expect potential readers to search for but where separate terms can still be retrieved. In the fairly rare cases in the sample where the <meta> element is used to mark up a more regulated set of properties, namely those from the Dublin Core Metadata Element Set, the number of properties that are included is quite extensive, ranging from four to 14, with a median of 7 or 8 (depending on journal level). Between 3.8 and 11.4% of the journals contain Dublin Core metadata. In the sample, 18 journals were found to include this metadata type at the journal level, 17 at the issue level, and 20 at the article level. Only the Dublin Core properties that contained a value to the content attribute were included in the study. The practice of including subject (keywords) and description remain fairly strong at all levels, but even more commonly used are properties that may be easier to include (and in some cases to inherit from a template), such as DC.Type, DC.Format, and DC.Language. The Dublin Core elements are also used to indicate the originator to quite a large degree through such properties as DC.Creator, DC.Publisher, DC.Rights, and DC.Identifier. The only other property that occurs in more than 10 journals on at least one of the levels is DC.Title (cf. above). So far, it is mainly the types of properties included in the <meta> that have been reported. However, as with the content of the <title> element, the <meta> elements are of little use if they do not contain values that may be used. For this reason, the quantitative study also included the various journal levels that the metadata describe. In order to discuss this, a distinction must be made between the level (journal/start page level, issue level, and article level) on which the file containing the <meta> element is placed and the level that the value of this <meta> element describes. I will refer to these as the levels where the metadata is placed and the level that the metadata describes. The metadata (including Dublin Core elements) placed on the journalsâ&#x20AC;&#x2122; start pages generally describe the journal at large. This is, however, also very often the case with metadata found at the issue and (to a smaller extent) article levels. When metadata at each of these levels does not (or not only) describe the journal level, it describes the level on which the metadata is placed. Thus, it is very rare for metadata placed on table of contents pages to describe individual articles, and for metadata placed in the article files to apply to the issue level. In fact, as can be seen in Figure 1, it is much more common at the issue level for the metadata to describe the journal than the issue. This further supports the hypothesis that metadata that can be entered once and continue to be valid, such as metadata describing the journal level, are more commonly included than metadata that needs to be updated for each new issue or article. The fact that some cases where found where the metadata had been copied from a previous issue or article without being changed indicates that when a new file is created based on a previous issue or article file, to change the marked up metadata could easily be forgotten. One of the journals in the qualitative study included quite a few <meta> and Dublin Core elements at its various levels. With a few exceptions at the article level, the values to each property were the same across the three levels, however. The metadata on this journal are thus site-specific rather than page-specific, which influences the granularity with which one can search for content from the journal. Marked up metadata that are placed in a separate file are offered by 25 of the journals in the form of RSS feeds. This means that RSS files are available in between 5.6 and 13.6% (possibly as high as 17.8% at the article level) of the journals, at all three journal levels. 7 of these journals make use of a journal management Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
63
64
Helena Francke
system (either PLONE or the Open Journal System), which has presumably made the inclusion of the feed easier. RSS feeds can provide marked up metadata that can be useful for various forms of reuse. Unlike the case with the <meta> element, the content of this metadata format is also more publicly visible, which could mean that the content is more carefully selected and entered. Metadata descriptions
Number of journals
Placed on start page 120 80 40
Placed in TOC
Placed in article
115
63
10
18
17
12 1 7
7 10 5
two or more
other sensible
0 journal
issue
article
Level described
Figure 1: The levels of the journal described by the metadata (<meta> and Dublin Core) found in the files at the various journal levels, by number of journals. [1, p. 256] 4.
Discussion and conclusions
Time is a valuable – and often scarce – resource for editorial staff of open access scholarly journals. A likely reason for the inconsequent use of marked up metadata that has come out as one of the results of this study is the lack of routines to follow when preparing an article for publication, both when a single person is responsible for the markup and design and when several people are involved. This results in great variations in what metadata are included in the various metadata elements as well as in how the metadata are notated and organized. The latter was shown to be the case in particular in the <title> element. As was illustrated in the qualitative studies, such variations occur not only between journals – where they are only to be expected – but also within journals and even within issues. Other problems that turned up in the study concern the reliability of metadata, such as when the values of the metadata elements are not updated when a new article file is created from an existing article or from a template. A certain lack of consistency was also found in one of the journals in the qualitative study that used frames. This raises the question of how to treat, and prioritize between, frames files when it comes to metadata. Thus, there are several potential problems with using existing metadata for various attempts at automatic collection of bibliographic data from the journals, even in the cases where there has been made an effort of including metadata elements. The great variety found in markup and metadata both between and within journals affects the possibilities for, for instance, libraries and other information services to retrieve data directly from the journal web sites in order to provide added value to the journals and their user communities. At the same time, many of the journals do include metadata in the form of <title> and <meta> elements, even though only keywords and description can be said to be properties that occur reasonably often in the journals. Below, some thoughts are offered on considerations to keep in mind for individual journal publishers and the editor-managed journal community as a whole – preferably in co-operation with the library community – when trying to develop simple improvements in the form of documented routines or even more long-term guidelines for improving metadata inclusion in the journals. The ambition here has been that the development and performance of such routines should require little technological know-how. However, if more consistency and predictability is found in the marked up metadata of the editor-managed open access journals, it would be more worthwhile to develop services that offer access to the journals through various forms of collections and through bibliographic control. Such initial improvement of the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The State of Metadata in Open Access Journals: Possibilities and Restrictions
metadata should be seen as a step towards the use of more advanced metadata systems, such as OAIPMH. On the way towards the use of such systems, documented routines or guidelines can be developed that take into consideration the following aspects that emerged from the present study: What level to describe in the metadata elements at various journal levels. At the moment, metadata placed in the table of contents and article files quite often describe the journal as a whole rather than the content of that particular file. This is particularly common at the issue level. It is often of great importance to include information about the journal not only on the start page but also in the files at the issue and article levels in order to highlight the connection between, for instance, an article and the journal in which it has been published, but such metadata is preferably supplemented with metadata describing the content of the file in which the elements are included. In particular, many article metadata are often included on the web site, even if they are not marked up. This includes the name of the author(s), article title, abstract, keywords, and date of publishing. In fact, not surprisingly, the first article in the latest issue of every journal in the survey displayed the author names and article title in the article file. Abstracts were included in 78.9% of the journals, either in the article file, on a paratext page, or on the table of contents page. The corresponding figure for keywords was 40.4% and for author affiliation 86.8%. Another property that is easily obtainable for the journal staff is the date of publishing. This suggests that these metadata are in many cases available, they are simply not included among the marked up metadata in the files. At what journal level to place metadata describing the article. The article file seems to be the obvious place for metadata describing the article. However, in cases when the article is published in a file format that does not easily incorporate metadata for retrieval, an option can be to introduce a paratext page, a page situated between the table of contents page and the article page. When this is done, the paratext page generally serves the purpose of providing bibliographic data about the article that can help the potential reader to determine if it is relevant to download the article – possibly a practice that open access journals have inherited from closed access journals, but where cost rather than download time needs to be considered. Yet, the paratext page can also contain marked up metadata which can serve to direct a user to the article page itself. Another consideration to take into account is how much metadata describing the articles in an issue to include on the table of contents page. This was very rarely done in the journals in the survey. Associated with the issue of where to place metadata describing the article is the question of: How to treat web sites with frames. In journals that use frames for displaying the web site, there are generally several options of where to place metadata that describe the article. The content of the <title> element displaying in the web browser’s title bar will be that of the frameset file. As this file is most likely the same for the entire web site, it is in most cases not a likely candidate for where to place article level metadata. A careful choice needs to be made as to where to place them, taking the design of the site into account. What metadata properties to include. It is easy to be ambitious when planning for metadata elements but sometimes difficult to maintain those ambitions in the daily work. In these cases, it is probably better to keep the number of metadata properties down and aim to update them for each new issue or article. However, some metadata are likely to be constant from issue to issue and from article to article, mainly the ones that concern the journal level and more technical aspects, such as file format and encoding. If the markup is copied from one article to the next, such metadata can remain unchanged. Among the journals in the survey, keywords, description, and author were among the more common <meta> element properties to be included. Title and date were much less frequent. Keywords and description are fairly established properties, but if one wishes to include more properties, there could be reason to use the Dublin Core elements in order to achieve consistency in property names. The Dublin Core Metadata Element Set, still used very seldom among the open access journals, supplies a standardized set of properties that may be beneficial, including the possibility of qualifying such ambiguous properties as “date”.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
65
66
Helena Francke
How to achieve consistency in the metadata element values. Related to the question of how to find consistency in the choice of name for various properties is that of achieving consistency in the element values. There are two dimensions of interest here: how to be consistent in the type of metadata that are included in an element vs. how to be consistent in the notation of the element value, and consistency within a journal vs. consistency across journals. That the issue of what type of content to include in an element is difficult is illustrated by the great variety found in the content of the <title> element. It is also pointed out by Roberson and Dawson [8, p. 68], that of the four journals they worked with, there were three different interpretations as to what should be the value of the DC.Relation property. Simple documentation of routines can help make both the type of content and its notation and organization more consistent across all new pages of a journal web site. If there is time to go over existing pages to align them with the guidelines outlined in the documentation, the web site as a whole will be more useful. One of the greatest challenges is to achieve such consistency across a number of journals while keeping the work both technologically simple and time efficient. At the same time, cross-journal consistency is only interesting if the machine-readable metadata are used, that is, if there is some benefit to be had from consistency. This is where journal editors and librarians/information specialists can work together to add value to and support services that increase the findability of open access journals published by small publishers. To create basic guidelines for the inclusion of marked up metadata is one way to begin such collaboration, but as with all things that require some form of performance, there also needs to be a reward, a reason for putting in the work. 5.
Notes and References
[1]
This paper builds on data that were collected as part of my dissertation work, which was reported in FRANCKE, H. (Re)creations of Scholarly Journals: Document and Information Architecture in Open Access Journals. Borås. Sweden : Valfrid, 2008. Also available from: <http://hdl.handle.net/ 2320/1815/> [cited 10 May 2008]. HAGLUND, L. et al. Unga forskares behov av informationssökning och IT-stöd [Young Scientists’ Need of Information Seeking and IT Support] [online]. Stockholm, Sweden: Karolinska Institutet/ BIBSAM, 2006. Available from: <http://www.kb.se/BIBSAM/bidrag/projbidr/avslutade/2006/ unga_forskares_behov_slutrapport.pdf> [cited 19 April 2007]. The term scholarly is used in this paper to cover contributions from both the scholarly, scientific, and technological communities. HEDLUND, T.; GUSTAFSSON, T.; BJÖRK, B.-C. The Open Access Scientific Journal: An Empirical Study. Learned Publishing. 2004, vol. 17, no. 3, pp. 199-209. KAUFMAN-WILLS GROUP. The Facts about Open Access : A Study of the Financial and Non-financial Effects of Alternative Business Models for Scholarly Journals [online]. The Association of Learned and Professional Society Publishers, 2005. Available from: <http:// www.alpsp.org/ForceDownload.asp?id=70> [cited 24 April 2007]. The Directory of Open Access Journals is provided by Lund University Libraries at <http:// www.doaj.org/>. OAIster is provided by the University of Michigan at <http://www.oaister.org/>. ROBERTSON, R. J.; DAWSON, A. An Easy Option? OAI Static Repositories as a Method of Exposing Publishers’ Metadata to the Wider Information Environment. In MARTENS, B; DOBREVA, M. ELPUB2006 : Digital Spectrum : Integrating Technology and Culture – Proceedings of the 10th International Conference on Electronic Publishing held in Bansko, Bulgaria 14-16 June 2006 [online]. pp. 59-70. Available from: <http://elpub.scix.net/data/works/ att/261_elpub2006.content.pdf> [cited 12 January 2008].
[2]
[3] [4] [5]
[6] [7] [8]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The State of Metadata in Open Access Journals: Possibilities and Restrictions
[9]
JOWETT, G. H. The Relationship Between the Binomial and F Distributions. The Statistician. 1963, vol. 13, no. 1, pp. 55-57. [10] ELENIUS, M. (2004). Några metoder att bestämma konfidensintervall för en binomialproportion : en litteratur- och simuleringsstudie [Some Methods for Determining Confidence Intervalls for a Binomial Proportion : A Literature and Simulation Study]. Göteborg, Sweden: Department of Economics and Statistics, Göteborg University. C-essay in Statistics. [11] Open J-Gate is provided by Informatics India Ltd at <http://www.openj-gate.com/>. [12] RAGGETT, D.; LE HORS, A.; JACOBS, I., Eds. HTML 4.01 Specification : W3C Recommendation 24 December 1999 [online]. W3C (World Wide Web Consortium), 1999. Available from: <http://www.w3.org/TR/html4/> [cited 9 May 2008].
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
67
68
Establishing Library Publishing: Best Practices for Creating Successful Journal Editors Jean-Gabriel Bankier1; Courtney Smith2 The Berkeley Electronic Press 2809 Telegraph Avenue Suite 202, Berkeley, CA e-mail:1jgbankier@bepress.com; 2csmith@bepress.com
Abstract Library publishing is a hot topic. We compiled the results of interviews with librarians and editors who are currently publishing journals with the Digital Commons platform. While the research and illustrations in this paper originate from Digital Commons subscriber interviews, we think the lessons and trends we’ve identified can serve as a roadmap for all librarians looking to provide successful publishing services to their faculty. Successful journal publishing appears to rely greatly upon the librarian hitting the pavement and promoting. The librarian must be ready to invest time and commit to a multi-year view. With support and encouragement, faculty will begin journals. The librarian can then use these early successes as showcases for others. While the first editors get involved in publishing because they believe in open-access or are looking to make a mark, for future editors the most powerful motivator is seeing the success of their peers. Publishing becomes viral, and the successful librarian encourages this. Keywords: University as a publisher of e-journals; journal publishing in an institutional repository; road map for library publishing; open-access; Digital Commons; university as publisher; library as publisher 1.
Introduction
A survey of the current literature on electronic academic publishing shows scholars are rapidly going digital. Commercial publishing is losing its stranglehold on the dissemination of scholarly communications, and the commercial publisher is no longer considered part of the vanguard. Rather, it is becoming apparent that as journal editors “go digital”, they are looking to the university for consulting and publishing support. The recent report “Research Library Publishing Services,” published by the Association of Research Libraries’s Office of Scholarly Communications, showed that 65% of responding libraries offer or plan to offer some form of publishing support, using editorial management and publishing systems including OJS, DPubs, homegrown platforms, and our own institutional repository platform, Digital Commons.[1] We at the Berkeley Electronic Press (bepress) are witnessing a groundswell of interest in publishing with the library – an average of five new journals are being created each month with Digital Commons. Our librarians are excited, and we are too. Library publishing is the hot new topic. We’ve seen several reports over the last year that address the library’s emerging role as publisher[2]. But, to date, we haven’t seen much work on best practices for successful library publishing initiatives, so we started asking, How does the library do it? We compiled the results of interviews with librarians and editors who are currently publishing with the Digital Commons platform, and drew conclusions about the best practices of librarians who drive successful library publishing programs. While the research and illustrations in this paper originate from Digital Commons subscriber interviews, we think the trends we’re seeing can be applied widely. In the following paper, we share lessons about how to best engage existing editors in library publishing and entice or support prospective editors to “jump in”. As a professional publisher ourselves, bepress has worked with hundreds of editors. From the outset, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Establishing Library Publishing: Best Practices for Creating Successful Journal Editors
most know that becoming an editor will take tremendous energy and work. Creating and editing a journal is a huge investment of time for editors and a commitment to their discipline and to their early contributors. They bring a passion for the field and a desire to create a community for others who share their passion. But they often don’t know where to turn to find help in getting started. Despite the findings of the recent reports, we have found that many scholars do not implicitly think “library” when they want to publish digitally. Even after learning about journal publishing services, faculty sometimes question whether publishing is a core competency of their library. Where does this leave library publishing? It tells us that faculty need the following: 1) to know the library is available and can offer the services they need; 2) reassurance that the library is a partner and has proven success; and, 3) certainty that the library can be a successful publisher. We introduce the paper with three general observations—we call them “truths” about library publishing. Next, we discuss key services the library needs to offer to editors in order to encourage journal set-up and help them achieve long-term sustainability. Finally, we discuss the importance of creating a showcase that reflects the publishing expertise of the library, as well as the quality of library publications and, by extension, the editors. We close with thoughts about growing the service of library publishing and the viral nature of faculty engagement. 2.
Two Hard Truths and One Easy Truth About Establishing Library Publishing
The first truth: Librarians must maintain a long-term view. Journals don’t just happen with a snap of the fingers. As Ann Koopman of Thomas Jefferson University explained, her boss supported her in taking “the long-term view” because campus-wide investment in library publishing usually takes three to five years to establish. Starting new journals requires a cheerleader, promoting library publishing for as long as it takes to get faculty talking about it. Librarians who are ready to embark on a library publishing initiative must assume Koopman’s long-term view, and be prepared to spend significant time developing a suite of sustainable journals. The second truth: The first journal is the hardest. The first journal rarely, if ever, comes to the librarian. Instead, the librarian must seek out publishing opportunities by hitting the pavement and doing some good old-fashioned face-to-face networking to find the faculty that is ready to publish digitally. Which brings us to the third truth: It gets easier – much, much easier, in fact – to bring on new journals once the librarian is able to showcase initial successes. The first takers publish because they see themselves as forward thinkers and open-access advocates. But most scholars are simply persuaded by the success of their peers. Once the library has helped establish three or so publications, librarians describe a transformation. Events unthinkable early in the period of journal recruitment become second nature to faculty and students. Librarians begin to watch the publishing craze catch on. Marilyn Billings, Scholarly Communications and Special Initiatives Librarian at UMass-Amherst, says that after three years, she is not the primary publicizing force for ScholarWorks.[3] She finds that faculty and students, including the Dean of the Graduate School, the Vice Provost for Research, and the Vice Provost for Outreach, are now doing the publicizing for her. Of course, these truths still beg the question: How does the library actually establish itself as publisher? Well, here is what we’ve found. 3.
Getting Editors Started
When it comes to establishing a digital publishing system, Ann Koopman considers the librarian’s role as trifold. The librarian is or can be: all-around promoter; provider of clerical support; provider of technical Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
69
70
Jean-Gabriel Bankier; Courtney Smith
support. To put it another way, a library publishing program requires a software platform with technical support, a support system for faculty authors and publishers, and a cheerleader to get them excited and involved. For library publishing to achieve success quickly, we now know that it must have an evangelist – a “librarian as promoter” at the helm, who is truly dedicated to growing it from the grassroots level, by getting out and talking to people about it. When Marilyn Billings unveiled UMass-Amherst’s ScholarWorks IR and publishing platform, she did so with gusto and a special flair for knowing how to throw a party. Billings chose to introduce the new IR at a Scholarly Communications Colloquium sponsored by the University Libraries, Office of Research, Graduate School and Center for Teaching. She introduced it with a show tune (she’s a singer as well as librarian), a slam-bang virtual curtain drawing, and a bit of digital bubbly – “three cheers for ScholarWorks!” The chancellor, who burst out laughing during the unveiling, became a staunch supporter from that moment on. Billings’s unveiling is a lesson in the importance of brewing campus excitement. Billings notes that once she started talking, everybody started talking, and soon (that is to say, soon in library time, i.e. three years), scholars started asking to publish within the library. Western Kentucky University’s Scott Lyons also recognizes that building excitement is the first way to build investment. In addition to personally signing up new reviewers at regional sports medicine conferences for his journal International Journal of Exercise Science[4], he is currently planning a kick-off celebration for the journal’s editorial board at the American College of Sports Medicine’s Sports Medicine Conference in Indianapolis. Once librarians have created initial awareness and excitement, how do they build campus-wide investment? The librarians we spoke with consistently recommended that new publishing programs seek out “lowhanging fruit”, in the words of Sue Wilson, Library Technology Administrator at Illinois-Wesleyan University. Faculty who publish digital, open-access journals regard themselves as forward-thinkers, publishing electronically in order to incorporate multimedia content, increase the rate of knowledge production, and enhance access to scholarship. So to find this “low-hanging fruit” librarians often seek out one or more of the following: proponents of open-access, young scholars looking to make their mark, faculty who use journal publishing as a pedagogical tool, faculty who care greatly about self-promotion, and/or editors whose journals are languishing, usually due to funding concerns. Once librarians have the fruit in sight, they still must be able to reach the faculty on faculty’s terms – to “close the deal” if you will – by eliminating the barriers to going digital. New editors, as well as established editors seeking to transition paper journals, ask for a sustainable infrastructure and an established workflow. In the case of open source software, the infrastructure is set up by the library or the Office of Information Technology. In the case of hosted software, the technical infrastructure is maintained either at an hourly consulting rate or, as is the case with Digital Commons, the host provides both set up and ongoing, unlimited technical support. Whatever the library’s choice of platform, it benefits from having established a training program and peer-reviewed workflow, so that when editors are ready to begin, start up is quick. The idea for a new journal can come from anywhere at any time. The library must be able to say, “I can help you with that.” The library, in short, will want to strike while the iron is hot. Connie Foster, Professor and Serials Coordinator in the Department of Library Technical Services at Western Kentucky University, saw this first hand when Scott Lyons and his colleague James Navalta began the International Journal of Exercise Science. Though the idea of starting his own journal had been germinating for a long time, Navalta did not seize the opportunity until the day Lyons, frustrated by the protracted submission and review process of paper journals, turned right into Navalta’s office instead of left into his own. As Lyons tells it, he marched into Navalta’s office, threw up his hands, and asked, “James, don’t you ever just want to start your own journal?” Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Establishing Library Publishing: Best Practices for Creating Successful Journal Editors
“As a matter of fact,” Navalta replied without pause, “I do.” They are now growing the journal by traveling to conferences and soliciting submissions from their network of colleagues. The journal is student-focused – that is, an undergrad or grad student must either author or co-author the paper for it to be submitted. In addition to developing new titles, librarians cull from the well of established print journals that are looking to transition to hybrid paper-electronic publications or go fully digital. These editors are enticed by the opportunity to reach a much larger audience, and by the time saved in the editorial management process. Faculty members at Boston College are well-versed in both paper and electronic publishing. The experience of Alec Peck, associate professor at the Lynch School of Education, speaks to this. He maintains a print journal, Teaching Exceptional Children, and an electronic journal, Teaching Exceptional Children Plus (TECPlus),[5] which he originally chose to establish in order to supplement the print with content like podcasts, video, and hyperlinks. He notes to Mark Caprio, BC’s eScholarship Program Manager, that the time it takes him to work through a full editorial cycle for the digital journal is at least half that of the print cycle. Doug White, professor of anthropology at the University of California – Irvine, shares a similar perspective. He is founding editor of the e-journal World Cultures[6] as well as founder and editor of Structure and Dynamics.[7] He began his first electronic anthropology journal in 1985, publishing on 5 ¼” floppies, and edited paper journals previous to that. White, a strong proponent of open access publishing, says, “My publication output has roughly doubled because the journal is easy for me to manage.” Editing a journal is, by all accounts, a huge time investment; libraries that can offer time-saving workflow solutions make the scholar’s decision to invest easier. Editors expect not just a publishing plan, but also the support of the library staff, either to train them on a software system, or to act as coordinator between them and hosted IT support. The value of face-to-face support is relevant, and here is where the librarian fills his or her second role – that of facilitator, or in Koopman’s words, the “clerical role”. In the role of facilitator, librarians support scholars by applying to aggregation and indexing services when the time is right, as well as ensuring that publications receive an ISSN number, that metadata is entered and formatted correctly, and that issues are archived. They also may be called upon to practice mediated deposits when a faculty member doesn’t want to learn a publishing software. The librarian, first a promoter, next becomes facilitator, helping faculty manage and publish original content. The librarians we spoke to have the promoter and clerical roles covered – and if their excitement to share their success is any measure, they clearly enjoy them. So how do they find time for the technical role as well? Admittedly, they don’t. This is a two person job. Libraries choose Digital Commons partially because they want their librarians to do the work of building the publishing program and supporting scholars; librarians can only accomplish these two tasks if they are freed from ongoing technical support. Marilyn Billings points out that after spending a six-month sabbatical researching IRs and publishing platforms, one of the reasons UMass-Amherst chose Digital Commons was because they “felt it was more important to do the marketing and the education than spend time on technical concerns.” The changing role of Mark Caprio at Boston College speaks to this as well. As he put it, “Well, I have the time to go out and see who else is doing stuff.” 4.
Sustaining Publishing
The first journal is the hardest. As is, perhaps, the second, given that library publishing is relatively new, and many would-be participants are still wary. A newly-launched journal that flounders can diminish rather than strengthen the chance that the library will ultimately succeed in its mission to become a Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
71
72
Jean-Gabriel Bankier; Courtney Smith
publisher. Demonstrated sustainability is needed not just for the success of the journal but for its potential to influence prospective editors. In order to commit to editing a journal, scholars need to be reassured that the library has proven methods to ensure the journal’s success. Libraries can do this by providing download reports, optimizing discoverability, and branding the full-text. Library publishing is effective insofar as it is able to maintain and increase readership. We found across our interviews that generating automatic download reports validates editors’ and authors’ efforts. Each month authors receive their readership in total downloads for each article within the Digital Commons system. In the former days of paper, editors, authors, and libraries had no way to assess the impact of publications – that is the total readership for any given article or journal. Now, authors and editors can assess, in real-time, the impact of their research, and can use download and citation statistics in funding applications and review processes. Giving contributors feedback on the dissemination and downloads of their work creates excitement and a sense of investment in the journal and the publishing process. Authors are encouraged to submit other pieces of research and encourage their peers to do the same. Moreover, when an institution can statistically verify its impact, it is more likely to continue to support publishing endeavors. The library’s publishing system may or may not automatically generate download reports. If it does not generate them automatically, then the librarian or the editor should consider this an essential task to perform manually. Another way to provide valuable feedback is to show editors and authors their rank in Google search results and citations across the web. Doug White used both download counts and citation counts of his first issue of Structure and Dynamics to demonstrate initial success. He writes, “This [the numbers] reflects positively on quality of the articles, made possible in turn by the high quality and incisiveness of reviews, the number and diversity of reviewers who have responded, and selection for quality in article acceptance and reviewers.”[8] Editors use download reports to assess the impact of research, as well as identify the content most valuable to a journal’s constituency. Ann Koopman, editor of Jeffline[9] and manager of the Jefferson Digital Commons[10], tells a similar story about Thomas Jefferson University’s Health Policy Newsletter[11], which utilizes download reports to identify the topics readers find most compelling. After uploading back content from 1994 to the present, the editors now track article downloads on a quarterly basis. In analyzing the numbers, they can pinpoint the areas of research where readers show the most interest, and shortlist these topics for more in-depth coverage in future issues. Richard Griscom, Head of the Music Library at UPenn and former IR manager, discusses the disproportionate success of the institution’s undergraduate journal, CUREJ: College Undergraduate Research Electronic Journal. During September 2007, CUREJ documents made up a little over 2% of all the content in the repository, but they made up over 10% of the downloads[12]. Analyzing download statistics allows an institution to assess the impact of various scholarly endeavors and focus resources where they are most valuable. As Griscom tells it, these statistics encouraged other groups to approach him about creating various publications within UPenn’s ScholarlyCommons[13]. Clearly, wide readership encourages authors and serves as a reflection of a successful library publishing program. Since it is in the library’s best interest to facilitate the widest possible dissemination of its institutional publications, it must optimize the publications for search by Google and Google Scholar. Librarians as publishers must ensure that their journals are optimized for Google, using identifiers to provide Google with easy access to content. Editors and authors who publish within Digital Commons have their articles full-text indexed through Google and Google Scholar, as well as made highly-discoverable to other search engines. An independent professor posting work on his or her website likely does not know the ins and outs of search engine discovery, whereas a technical team has the knowledge and time to develop a format that maximizes discoverability. By structuring the underlying code in the appropriate way, a web Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Establishing Library Publishing: Best Practices for Creating Successful Journal Editors
development or design team can ensure that the search engine crawlers can discover it by citation data, abstract, or words in the full text. As a publisher of academic journals online, our data on readership referrals shows that 80% of readers arrive at the journal articles straight from Google, without traveling through the journal’s homepage, in which case while the download still registers in the report, the reader may not affiliate the content with the journal or the publishing institution. We’d like to share an approach to this issue. Digital Commons can automatically stamp all PDF articles with cover pages, which bear the journal name and/or the publishing institution’s name, as well as the key article metadata. These pages are produced by a title page-generating software that is incorporated into the Digital Commons platform. There is a lot of junk on the Web and the journal or university’s stamp on the cover page tells the reader that this is content from a reputable source. The cover page acts as a signal of quality.
Figure 1: The Macalester Islam Journal, a stamped cover page on the PDF article, produced by a title page-generating software. 5.
Extending the Publishing Model
Some of the things that editors and libraries are doing, we expected. For instance, we expected journals with a paper history to publish their back content online. But we were surprised by many of the ways libraries and editors are pushing the limits of our current conception of “digital publishing”. As the hub of an institution’s scholarly communications, the library is in a unique publishing position. Scholars take advantage of this position to create a “context” or a “scholarly environment” for one or more journals. At UMass-Amherst and McMaster University we are seeing scholars use library publishing to synthesize various content and resources within and outside of the university. Take, for instance, Rex Wallace, linguist and one of few Etruscan language scholars in America. During Marilyn Billings’ and Rex Wallace’s first conversations about the UMass-Amherst ScholarWorks repository, Billings discovered that Wallace had a database of arcane Etruscan inscriptions without a home. Wallace wanted to house these inscriptions where they could be freely accessed by the scholarly community, but also wanted a location that would act as a “springboard” to bring users to the Etruscan Texts Project and the Poggio Civitate Excavation Archive, both of which are housed within the Classics department. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
73
74
Jean-Gabriel Bankier; Courtney Smith
The pair used this opportunity to create a Center for Etruscan Studies within the repository, an idea that Wallace had been chewing on for some time. Soon after, Wallace and Tuck, an archaelogist also at UMass-Amherst, decided to extend this center by creating the journal Rasenna.[14] Months later, Tuck was at a meeting of the Etruscan Foundation at the annual convention of the American Institute of Archaeology. The Etruscan Foundation had been publishing a well-known paper journal, Etruscan Studies, for over ten years, but noted that it was difficult for many scholars in the field to get easy access to the content. As Wallace tells it, Tuck showed off the Center and Rasenna, and the members, who got very excited about the prospect of making the Etruscan Studies content more widely accessible, started talking about publishing the back content on line. As Billings tells the same story, “After this presentation in Chicago some of these things become really self-evident, he showed them off, and something clicked.” Wallace, Tuck, and Lisa Marie Smith are now in the process of developing a digital version of Etruscan Studies,[15] a sibling journal to Rasenna. They are creating it, he says, out of a desire to make the back content “accessible to the field of Etruscan scholars.” His next goal is to position UMass-Amherst’s Center for Etruscan Studies as the place to go in America for Etruscan Studies – “a sort of clearinghouse for the field,” he says. Editors and librarians are discovering that library publishing offers the potential for an integration of content types – the ability to create what has been alternately called a “context” or a “scholarly environment” for a journal. Wallace calls it an umbrella. He speaks of Rasenna’s creation in these terms: “The e-journal dovetailed with things Marilyn [Billings] and I had been talking about for years. We saw it as a way to bring all the diverse programs we’re working on together under one umbrella.” In the same way that Wallace and Tuck are fashioning the Center for Etruscan Studies and all its associated parts into a “clearinghouse”, the Russell Archives at McMaster University is in the process of creating its own digital presence, under the direction of Kenneth Blackwell. Kenneth Blackwell is Bertrand Russell’s archivist and has been the editor of Russell: the Journal of Bertrand Russell Studies since it began in 1971.[16] Dr.Blackwell was persuaded by the library at McMaster’s University to digitize all of the back issues and bring his journal online. While the most current years (2004-2007) are available by subscription only, he made past issues (1971-2003) openly available to all. It wasn’t long before dozens of Russell-related texts were added, turning the site from a journal into a virtual Bertrand Russell center. The Russell Center is beginning to extend the journal content itself with rare leaflets, notes on his readers, copies of his personal letters and interviews. What does it mean, though, to create a “context” or a “scholarly environment”? In an effort to elucidate this concept, we have identified key practices that are features of library publishing and components of developing a scholarly environment for a journal.Publishing Back Issues: Creating historical continuity is important in establishing an ejournal that has transitioned from paper. Many Digital Commons journals are using the system not only to publish going forward, but to publish back content as well. Editors, like those of Nsodia/Contributions in Black Studies, World Cultures, PLACES,1 and Etruscan Studies, invest time in digitizing and publishing the back content, making what was originally only paper and available only to a few, now accessible to all scholars in the field.
•
Grouping Diverse Content Types: Faculty members are taking advantage of e-publishing to archive and link to many different content types. For example, Alec Tuck, editor of Boston College’s Teaching Exceptional Children Plus, regularly includes links to supplemental podcasts and video. Bond University digitally publishes the journal Spreadsheets in Education2 precisely because the topic of its study requires additional materials conducive only to the electronic format.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Establishing Library Publishing: Best Practices for Creating Successful Journal Editors
6.
•
Creating Families of Journals: UMass-Amherst is currently creating two sets of sibling journals: Contributions in Black Studies and Nsodia, as well as Rasenna and Etruscan Studies. As the library develops the system, it can facilitate browsing and searching – across families or across all library published journals – and provide a single, integrated look and feel.
•
Publishing Cross-Departmentally and Campus-Wide: Librarians are also able to maintain continuity of publication, whether it is cross-departmental, cross-disciplinary, or campuswide. Sue Wilson, for example, is pushing Illinois-Wesleyan’s campus-wide magazine to go digital. She sees the repository as allowing them to move “out of the disciplinary and into the university wide content.” Terri Fishel, Library Director at Macalester College, did not lose the Macalester Islam Journal when the editor, a professor in the Religious Studies department, left the school. Rather, she has found it a new home and it will begin publishing again under the editorship of a professor in the newly-established Middle Eastern Studies program. The flexibility of library publishing ensures continuity – it accommodates both the changing nature of disciplines and departments.
Making the Journal a True Showcase
To recap: the library has excited and engaged its faculty with the publishing program by offering the publishing services and support that faculty need. Editors are invested in their new journal ventures, and the library is helping them to expand the publishing model and achieve success. So where does the librarian go from here? We find that successful librarians get back out and continue to network, this time with successes in hand. It’s that simple: show off success. The librarian gets early adopters, he or she shepherds the first journals to success, and then, as Ann Koopman explained, “Once you’ve got a few and you show them around, they just come like dominos.” Why is this? As we observed before, most scholars are persuaded by the success of their peers. With success in hand, the librarian is now able to demonstrate that the library is a committed, knowledgeable provider of on-campus journal publishing solutions. Prospective editors will recognize that the library can provide the services they need to begin new online journals or transition existing print journals to digital. The journal is a reflection – a showcase, even – of its editorial board. Our editors want their journals to be as visually-compelling as traditional paper journals – and they want them to look good both on screen and in print. We’ve learned from experience that, to the editor, the librarian, and the readers, design matters. We have learned that successful journal sites have a “look” as compelling as commercial journals, and a “feel” that is clean, easy to read and navigate, and demonstrates a coherent logic. As a small publisher we have worked hard on the presentation and design of our family of journals. Our journals have been designed by award-winning professional web designers, and we would like to share some best practices. We present content with key aspects of visual harmony and readability in mind. Our Digital Commons journal pages were built upon concepts of the Golden Ratio and natural mapping, and use grid-based designs to both focus attention on the content and make that content as easily accessible as possible. We find that the little things matter: we always showcase new content from the journal’s homepage, with title, editor and author names given primary focus. Believing that access is primary, we even position the fulltext PDF icon to be the first thing the eyes meet when reading left to right. We ensure on screen readability by designing with attention to optimal line length, spacing, white space, and harmony of color. Because users are drawn to order, alignment and consistency, we have designed the journals to integrate smoothly by providing continuity of design and navigation. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
75
76
Jean-Gabriel Bankier; Courtney Smith
Figure 2: Illinois-Wesleyan University’s Journal, Res Publica. This journal was designed using the Golden Ratio. We also think it is necessary to consider what a digital object will look like in its printed form: when DC journal home pages and article pages are printed, they are rendered intelligible, without hyperlinks and other digital goodies irrelevant to the print format. We have also worked hard to maintain the important vestiges of print journals – down to serif fonts and continuous pagination. And as we mentioned before, since many readers find content through Google without traveling through an institutional portal, we make sure to stamp every article with a cover page branding it as the institution and author’s own. A picture is worth a thousand words – beautiful-looking, simple to navigate journal designs inspire other faculty members to take the leap. The excitement of good looks and good feel lends itself to the establishment of the library publisher, and it is the final key in getting publishing to “go viral”.
Figure 3: Western Kentucky University’s International Journal of Exercise Science. Article page – online view. Compare with print view, Figure 4. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Establishing Library Publishing: Best Practices for Creating Successful Journal Editors
Figure 4: Same article page as Figure 3, in print view. Our librarians’ excitement is contagious – in a good way. As we mentioned before, once they start talking, everybody starts talking, and soon, scholars start asking to publish with the library. Boston College’s Mark Caprio recognizes that library publishing catches on only when scholars see their respected peers engaging in it and finding success. And recently, UMass-Amherst grad student Ventura Perez approached Marilyn Billings about creating a social justice conference, Landscapes of Violence, hoping to have the library publish the conference proceedings. Soon, he had decided to also start a journal of the same name, Landscapes of Violence, the first issue of which will publish the best conference presentations as scholarly articles. 7.
Conclusions
So what do Digital Commons librarians do once they’ve relinquished the role of tech support, and eased up on the cheerleading? They are generally taking on the roles of high-level administration and continue with key content identification. Connie Foster calls her role that of “overarching coordinator”. She says, “Now, thankfully, when a journal or series is created, we [at the library] don’t have to get directly involved in the management of it. Once we know a dedicated faculty member is in charge, the library’s role is to make sure communication goes well. We set up the training for our editors, we coordinate and we troubleshoot.” She goes on to identify the library as “the central contact point, but not the day to day manager.” As Foster wrote in a follow-up email, “Seize every opportunity!” Because there is always more original content to discover, by and large, DC librarians now get to go out and see who is “doing stuff”. Foster, for example, is out finding more original content on campus and she, like many others, now shares stories of serendipitous discoveries. For instance, Foster recently attended an emeritus luncheon where the provost handed out photocopies of early WKU essays compiled by the president in 1926. Once she saw them, she decided to publish them online as the library’s first project under Presidential Papers. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
77
78
Jean-Gabriel Bankier; Courtney Smith
Successful journal publishing appears to rely on hitting the pavement and promoting. The librarian must be ready to invest time and commit to a long-term view. With support and encouragement, faculty will begin journals. The librarian uses these as a showcase for others, and lets design and success speak for themselves. While the first editors get involved in publishing because they believe in open-access or are looking to make a mark, for future editors the most powerful motivator is seeing the success of their peers. Publishing becomes viral, and the successful librarian encourages this. As Marilyn Billings says, “I no longer have to talk about it – they all do!” 8.
Notes and References
[1] [2]
Available at: http://www.arl.org/bm~doc/research-library-publishing-services.pdf These reports include: the ARL report; the Ithaka Report by Laura Brown, Rebecca Griffiths, and Matthew Rascoff, “University Publishing in a Digital Age.” Available at: http://www.ithaka.org/ strategic-services/Ithaka%20University%20Publishing%20Report.pdf; and Catherine Candee and Lynne Withey’s “Publishing Needs and Opportunities at the University of California.” Available at: http://www.slp.ucop.edu/consultation/slasiac/102207/SLASIAC_Pub_Task_Force_Report_final.doc http://scholarworks.umass.edu/ http://digitalcommons.wku.edu/ijes/ http://escholarship.bc.edu/education/tecplus/ A print journal transitioning to digital. The electronic version is currently in demo mode. http://repositories.cdlib.org/imbs/socdyn/sdeas/ White, Douglas R. and Ben Manlove. “Structure and Dynamics Vol.1 No.2: Editorial Commentary.” Structure and Dynamics, Vol. 1 Iss. 2, 1996. Available at: http://repositories.cdlib.org/cgi/ viewcontent.cgi?article=1050&context=imbs/socdyn/sdeas http://jeffline.jefferson.edu/ http://jdc.jefferson.edu/ http://jdc.jefferson.edu/hpn/ DeTurck, Dennis and Richard Griscom. “Publishing Undergraduate Research Electronically” Scholarship at Penn Libraries. Oct. 2007. Available at: http://works.bepress.com/richard_griscom/ 6 http://repository.upenn.edu/ http://scholarworks.umass.edu/rasenna/ Published in paper. The journal site for the electronic version is currently in demo. http://digitalcommons.mcmaster.ca/russelljournal/ http://repositories.cdlib.org/ced/places/ http://epublications.bond.edu.au/ejsie/
[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
79
Publishing Scientific Research: Is There Ground for New Ventures? Panayiota Polydoratou and Martin Moyle University College London, Library Services DMS Watson Building, Malet Place, WC1E 6BT Telephone: 020 7679 7795 Fax: 020 7679 7373 Email: lib-rioja@ucl.ac.uk
Abstract This paper highlights some of the issues that have been reported in surveys carried out by the RIOJA (Repository Interface for Overlaid Journal Archives) project (http://www.ucl.ac.uk/ls/rioja). Six hundred and eighty three scientists (17% of 4012 contacted), and representatives from publishing houses and members of editorial boards from peer-reviewed journals in astrophysics and cosmology provided their views regarding the overlay journal model. In general the scientists were disposed favourably towards the overlay journal model. However, they raised several implementation issues that they would consider important, primarily relating to the quality of the editorial board and of the published papers, the speed and quality of the peer review process, and the long-term archiving of the accepted research material. The traditional copy-editing function remains important to researchers in these disciplines, as is the visibility of research in indexing services. The printed volume is of little interest. Keywords: subject repositories; publishing models; overlay journal model; astrophysics & cosmology 1.
Introduction to the project
The RIOJA (Repository Interface for Overlaid Journal Archives) project (http://www.ucl.ac.uk/ls/rioja) is an international partnership of academic staff, librarians and technologists from UCL (University College London), the University of Cambridge, the University of Glasgow, Imperial College London and Cornell University. It aims to address the issues around the development and implementation of a new publishing model, the overlay journal - defined, for the purposes of the project, as a quality-assured journal whose content is deposited to and resides in one or more open access repositories. The project is funded by the JISC (Joint Information Systems Committee, UK) and runs from April 2007 to June 2008. The impetus for the RIOJA project came directly from academic users of the arXiv (http://arxiv.org) subject repository. For this reason, arXiv and its community is the testbed for RIOJA. arXiv was founded in 1991 to facilitate the exchange of pre-prints between physicists. It now holds over 460,000 scientific papers, and in recent years its coverage has extended to mathematics, nonlinear sciences, quantitative biology and computer science in addition to physics. arXiv is firmly embedded in the research workflows of these communities. This paper highlights some of the issues that have been reported in the community surveys, which, as part of the RIOJA project, surveyed the views of scientists, publishers and members of editorial boards of peer-reviewed journals in the fields of astrophysics and cosmology regarding the overlay journal model. To gather background to their views on publishing, the respondents were asked to provide information about their research, publication and reading patterns. The use of arXiv by this community and the reaction of its members to the overlay publishing model were also addressed in the survey. Respondents were asked to provide feedback about the suggested model; to indicate the factors that would influence them in deciding whether to publish in a journal overlaid onto a public repository; and to give their views on the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
80
Panayiota Polydoratou; Martin Moyle
relative importance of different features and functions of a journal in terms of funding priorities. The publishers and members of editorial boards of peer-reviewed journals provided an insight into existing publishing practices. 2.
Statement of the problem
The overlay concept, and the term “overlay journal” itself, appear to be attributed to Ginsparg [1]. Smith [2] further defined the model by discussing and comparing functions of the existing publishing model with what he referred to as the “deconstructed journal”. Although aspects of overlay have been introduced to journals in some subject domains, such as mathematics and computing [3-6], overlay journals have not yet been widely deployed. Halliday and Oppenheim [7], in a report regarding the economics of Digital Libraries, recommended further research, in the field of electronic publishing in particular. Specifically, they suggested that the costs of electronic journal services should be further investigated, and commented that the degree of functionality that users require from electronic journals may have an impact on their costs. In a JISC funded report, consultants from Rightscom Ltd [8] suggested that commercial arrangements for the provision of access to the published literature are made based on the nature of the resource and the anticipated usage of the resource. Cockerill [9] indicated that what is regarded as a sustainable publishing model in the traditional sense (pay for access) is actually supported by the willingness of libraries to pay […”even reluctantly”, p.94] large amounts of money to ensure access to the published literature. He suggested that as open access does not introduce any new costs there should not be any problem, in theory, to sustain open access to the literature. Waltham [10] raised further questions about the role of learned societies as publishers as well as the overall acceptance of the ‘author pays’ model by the scientific community. Self-archiving and open access journals have been recommended by the Budapest Open Access Initiative (http://www.soros.org/openaccess/read.shtml) as the means to achieve access to publicly-funded research. The overlay model has the potential to combine both these “Green” (self-archiving) and “Gold” (open access journal) roads to open access. Hagemmann [11] noted that “…overlay journals complement the original BOAI dual strategy for achieving Open Access…” and suggested that the overlay model could be the next step to open access. In support of open access to information the BOAI published guides and handbooks on best practice to launching a new open access journal, converting an existing journal to open access, and business models to take into consideration [12-14]. Factors such as the expansion of digital repositories, the introduction of open source journal management software, an increasing awareness within the scholarly community at large of the issues around open access, and an increasing readiness within the publishing community to experiment with new models, suggest that the circumstances may now be right for an overlay model to succeed. The RIOJA survey was designed to test the reaction of one research community, selected for its close integration with a central subject repository, to this prospective new model. 3.
Methodology
The RIOJA project is currently being carried out in six overlapping work packages addressing both managerial and research aspects of the project. This paper will discuss the results from community surveys which were undertaken to explore the views of scientists in the fields of astrophysics and cosmology concerning the feasibility of an overlay journal model. In addition to a questionnaire survey, a number of publishers and members of editorial boards were approached to discuss and elaborate on some of the initial questionnaire findings. These complementary studies were intended to enable a more rounded understanding of the publishing process, and to help the project to explore whether an overlay journal Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Publishing Scientific Research: Is There Ground for New Ventures?
model in astrophysics and cosmology could be viable in the long term . The Times Higher Education Supplement World Rankings [15-16] was used to identify scientists in the top 100 academic and 15 non-academic institutions in the fields of astrophysics and cosmology worldwide, so as to capture feedback from the research community at an international level. Additionally, the invitation to participate in the survey was posted to a domain-specific discussion list, “CosmoCoffee” (http:// www.cosmocofee.info). The survey was launched on June 8th 2007, and closed on July 15th. The questionnaire comprised 5 sections that aimed to: a) gather demographic and other background information about the respondents, b) find out about the research norms and practices of the scientists, from their perspectives as both creators and readers of research, c) identify issues around the researchers’ use of arXiv; and d) the final section sought their views regarding the viability of the overlay journal model. The target group was restricted to scientists who have completed their doctoral studies, and who therefore could be assumed to have produced research publications or to be in the process of publishing their research outcomes. 4012 scientists were contacted, and 683 (17%) participated. The supplementary interviews involved representatives from PhysMath Central, Public Library of Science (PloS), and Oxford University Press (OUP), and members of the editorial boards of the journals Monthly Notices of the Royal Astronomical Society (MNRAS) and Journal of Cosmology and Astroparticle Physics (JCAP). The interviews lasted between 1.5 and 2 hours, were comprised of semi-structured questions, and on several occasions benefited from the participation of the project’s academic staff. 4.
Results
The community surveys received responses from 683 scientists (17% of 4012 contacted), and representatives from publishing houses and members of editorial boards from peer-reviewed journals in astrophysics and cosmology, as described in the previous section. The respondents to the questionnaire survey represented a range of research interests, roles and research experience, and an almost equal proportion of returns (51/49) came from scientists who were English native speakers and those who were not. Results indicated that more than half of the respondents (53%) were favourably disposed to the idea of overlay journal as a potential future model for scientific publishing. Over three quarters (80%) of the respondents were, in principle, willing to act as referees in an arXiv-overlay journal. Those scientists who expressed an interest in an overlay publishing journal (35%) but did not consider it important elaborated on some concerns and provided suggestions that are described in the following subsections. 4.1
Some issues around publishing research outcomes
The vast majority of the respondents to the survey (663 people) noted that papers for submission to peerreviewed journals were their main research output. An average of 13 papers per scientist over a two-year period indicates a healthy research field with substantial ongoing research activity. These findings confirm the importance that peer-reviewed journals, and peers in general, play in the validation and dissemination of research in this discipline. The journals in which the respondents had mostly published their research were among those with the highest impact factor as reported in the Thomson ISI Journal Citation Reports, 2005 [17]. Irrespective of ongoing discussions in the literature about the validity of citation analysis, these findings suggest that impact factor does have a bearing on scientists’ decisions on where to publish. However, the majority of the researchers (494 people) reported that the most important factor in their decision where to publish was the quality of the journal as perceived by the scientific community. Other factors from the scientists Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
81
82
Panayiota Polydoratou; Martin Moyle
pointed to the relationship between the quality, readership and impact of a journal with the reputation of the editorial board, and clear policies around the process of peer review. Although factors such as whether the journal is published by a professional society (473) or published in print (463) were considered unimportant, emphasis was placed on the importance of long-term archiving and sustainable access to the published literature. The subject coverage of the journal, the efficiency and ease of use of the submission system, its handling of images and various file formats (eg LaTex), and the time that it takes for a paper to reach publication were also noted as influential factors (Table 1). Rating
Statement
% agree
95% confidence limit
Perceived quality of the journal by the scientific community
97.3
High journal impact factor
88.9
Being kept up-to-date during the refereeing process
81.6
Other factors (please specify)
75.3
Inclusion in indexing/abstracting services (e.g. ISI Science Citation Index)
67.9
Reputation of the editor/editorial board
66.2
Journals that do not charge authors for publication
64.5
Open Access Journals (journals whose content is openly and freely available)
52.8
Low or no subscription costs
33.9
Journals which publish a print version
29.8
Journals published by my professional society
26.9
Journals which have a high rate of rejection of papers
21.1
Key:
Very unimportant
Fairly unimportant
Neither
Fairly important
± ± ± ± ± ± ± ± ± ± ± ±
1.2 2.4 3 9.4 3.6 3.6 3.6 3.8 3.6 3.5 3.4 3.1
Very important
Table 1: Factors affecting the scientists’ decision where to publish 4.2
Use of arXiv and indexing services
The scientists confirmed the important role that arXiv plays in communicating and disseminating research in the fields of astrophysics and cosmology. About 77% of the respondents access the arXiv on a daily or weekly basis. About 80% visit the arXiv “new/recent” section to keep up to date with new research (Figure 1). In addition, when the scientists were asked “on finding an interesting title/abstract, where do you look for the full article”, e-print repositories (such as arXiv) were denoted as the first port of call by 610 people (89%). In the context of an overlay journal, repository policy clearly needs to be aligned sympathetically with the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Publishing Scientific Research: Is There Ground for New Ventures?
83
journals’s objectives. For example, observations were made about the quality of papers submitted to arXiv, and the fact that papers which have been subjected to peer review and those which have not co-exist on the repository without being clearly distinguished. Limitations on the size and format of files that may be uploaded to arXiv were also highlighted. . Some example of the comments the scientists made:
•
“arXiv has its own flaws, mostly related to the freewheeling unrefereed nature of the papers posted there… “
•
“To be fair, arxiv is quick and fast in spreading information, but the quality of papers in terms of language and typesetting varies greatly - and this is the (expensive) benefit of having journals copy-editing the papers, which I do appreciate. Furthermore, other changes that they would welcome would be in the policies about file formats and image sizes”.
• •
“Large versions of color figures should be available” “I think the idea of “enhancing” the arXiv with a proper peer-review lens is a good idea, provided that what I see are the key advantages of current journal articles are retained: 1. The refereeing process; 2. Proper copy editing; 3. High-quality figures (the current arXiv limits on file sizes for figures leads to figures which are often illegible)”. Up to date with advances 8
N/R
38
Other
49
Indexing/abstracting services
194
Alerts from arXiv 90
Alerts from ADS
148
"table of contents" alerts ADS w ebsite
396
Journal w eb pages
164
Discussion lists/forums
114
arXiv new /recent
549
Print copies of journals
101 0
50
100
150
200
250
300
350
400
450
500
550
600
Figure 1: Keeping up to date with research advances To search for back literature, 68% of the scientists prefer the ADS service. “Other” responses showed that information gleaned from colleagues, journal alerting services, attendance at conferences and workshops, and visiting the SPIRES Web site are all important. 4.3
Costs
The interviews with publishers and editors did not reveal any substantial information about costings that have not already been reported in the literature [10] or are available on some publishers’ websites, e.g. PhysMath Central (http://www.biomedcentral.com/info/about/apcfaq). Interviewees suggested that the price per article processing varies by journal, discipline and usage. Drawing up exact costings for the setup, production and running of an overlay journal was out of scope of the project.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
84
Panayiota Polydoratou; Martin Moyle
The interviews with publishers indicated that the interest of academic and research staff in new publishing models is the prime driver for their adapting to technology challenges. For example, one of the publishers interviewed stated that one of their most successful journals, both in terms of revenue to the publisher and in terms of perceived quality and acceptance by the scientific community, was converted to open access (the ‘author pays’ model) purely because of community demand. Meanwhile, a question included in the questionnaire survey concerning how expenditure should be apportioned towards particular functions of a journal was subject to criticism: respondents queried whether a scientist has adequate knowledge of the publishing process and its associated costs to make any useful observations. It was also observed that the publishing process entails more than the distribution phase, which some respondents felt that the survey appeared only to address . However, the costs associated with the work of scientific editors, with the integrity and long-term archiving of journal content, and with the transparency of peer review were highlighted as worthwhile (Table 2, scale 1 (little) – 5 (most of the amount) ). An indicative comment is listed below: “… Very-little of a high-cost journal may be more than a considerable amou[n]t of a low-cost one. Perhaps it would be better posed in terms of one’s priorities in paying for the journal. I think that in this day paying those such as the editors and referees, and ensuring the integrity of the archive, ought to be a higher priority than producing a paper version of the journal. Especially for an overlay journal such as you propose”.
Suggested expenditure/priority Paying scientific editors Paying copy editors Maintenance of journal software Journal website Online archive of journal's own back issues Production of paper version Extra features such as storage of associated data Publisher profits Paying referees Other
None
1
2
3
4
5
23 8 4
23 28 20
60 73 73
240 256 238
141 134 147
15 6 9
Not sure 21 15 30
5 9
28 27
79 52
225 202
149 189
20 18
15 19
138 30
101 63
125 105
107 182
29 100
4 6
14 26
142 249
122 70
138 70
91 85
9 22
0 8
19 18
3
1
1
1
3
2
3
Table 2 Suggested expenditure/priorities Copy editing, the level of author involvement in it, and who should be responsible for any costs associated with it, were also issues that were commented upon. Some respondents favoured the idea of charging extra for papers that require extensive copy editing. Almost half of the respondents favoured the suggestion that the cost of copy editing should be borne by the author, and that it should also be variable based on the amount of copy editing required. Furthermore, almost half of the respondents (47%) appear to be in agreement that those changes should be carried out by the author (Table 3). The appearance and layout of the published papers were considered important. • “The idea of charging authors for papers that require excessive copyediting is a great one!”
•
“Copy editing is a difficult issue: it should be the [responsibility] of authors to improve their writing, on the other hand the journal should take [responsibility] for what it published. Perhaps an author could have say three chances and after that
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Publishing Scientific Research: Is There Ground for New Ventures?
85
should pay for copy editing?”
•
“…my position is that a basic copy editing should be provided by the journal, but that extremely messy papers should be penalized, perhaps by introducing extra costs”
•
“I do believe money [is] being wasted on the copy-editing of already copy-edited articles, on paper copies of journals, on library subscriptions, etc. The publications process needs to be streamlined and a new type of open-access peer-reviewed journal might just be the right thing”. % agree
Rating
Statement
95% confidence limit
The cost of copy editing should be borne by the author and vary from paper to paper, depending on the amount of copy editing required
48.2
Copy editing should be carried out by the author
47.3
A referee should be prepared to assess whether or not copy editing is required
18.1
The cost of copy editing should be borne by the journal When a journal makes copy edits, the corrected LaTeX should be returned to the author (after his/her approval) Strongly disagree
Key:
Slightly disagree
11.1 4.7 Neither
Slightly agree
± ± ± ± ±
3.8 3.8 2.9 2.4 1.6
Strongly agree
Table 3: Copy editing When asked where the funding to meet those costs should come from, the respondents preferred to select research funders (485 people, 71% of base=683), library subscriptions (432 people, 63%) and sponsorship, for example by a Learned Society (350 people, 51%). A model requiring an author to pay from research funds either on acceptance (218 people) or on submission (47 people) of a paper was endorsed (Figure 2). Other sources mentioned in comments included: personal donations, professional association contributions, commercial and/or not-for-profit organisations, advertisements, subscriptions and even models of having authors pay partially on submission and partially acceptance. Sources for covering journals' costs 700 600 500
485
432
350
400 300
218
200 100
47
14
18
Other (please specify)
N/R
0 Library Author pays on Author pays on Research subscriptions submission acceptance funders (e.g. using (e.g. using (Councils, research research government, funds) funds) etc.)
Sponsorship (e.g. by Learned Society)
Figure 2: Sources for covering journals’ costs Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
86
Panayiota Polydoratou; Martin Moyle
4.4
Peer review
The process of peer review, as noted above, was raised by the scientists as a very important factor when selecting the journals in which they publish their research and, and in informing their opinion about a journal. Aspects of peer review that the respondents considered important were the transparency of the process, the proven track record of the referees, that of the scientific editor and his/her role in the peer review process, high reviewing standards, and relevance of the chosen reviewers. These factors were cited as acceptance criteria for an overlay journal. In general, the comments were grouped around the speed, quality and reliability of the process. Some comments on the speed of peer review concerned the role of the editorial team and a journal’s support services. It was indicated that an easily accessible editorial team that keeps scientists informed at each stage of the review process, while responding promptly and reliably to questions, is desirable. Also welcome, perhaps as an alternative, would be access to an online system that allows authors to keep track of the peer review process, supplemented by a clear statement of how review is conducted and the assessment criteria in place. In comments about the quality of peer review, the scientists raised issues around the transparency of the process, the selection of the referees and the importance of a proven record of past refereeing: what a respondent called “respected peer review”. Furthermore, comments also referred to the competence, care, efficiency and responsibility of editors and editorial boards. Comments from the respondents also addressed other peer review models such as open and community peer review [18-19]. One school of thought called for a more open, publicly available peer review system, incorporating the use of new technologies such as wikis, voting systems, and discussion forums, and so on. A second preferred to maintain the anonymity of peer review, but was keen to see more exploration and possible adaptation of the more rigorous models of peer review which are applied in other disciplines. The administration of peer review was also pointed out as time-consuming and, along with copy editing, costly, by the publishers who were interviewed. 4.5
Concerns – overlay journal model
The scientists who participated in the survey expressed some concerns about new and untested models of publishing, the overlay model included. However, they were positioned favourably towards trying new models and means for publishing scientific research - provided that they could ensure that the published research outcomes would continue to assist them in establishing an academic record, attracting funding and ensuring tenure. Specifically, the following issues received particular mention:
• • • • 4.6
Impact, readership, and the financial sustainability of the journal. Peer review process, with particular emphasis on ensuring quality Long-term archiving and the sustainability of the underlying repositories Clarity and proof of viability of the proposed model.
The overlay journal model - success factors
The most important factors which would encourage publication in a repository-overlaid journal were the quality of other submitted papers (526 responses), the transparency of the peer review process (410) and the reputation of the editorial board (386). Respondents also provided a range of other factors that they considered important, among them the reputation of the journal; its competitiveness measured against other journals under the RAE (the UK’s Research Assessment Exercise); the quality both of the journal’s referees and of its accepted papers; a commitment to using free software; a commitment to the long-term Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Publishing Scientific Research: Is There Ground for New Ventures?
archiving and preservation of published papers; relevant readership; and its impact factor, (which, it was noted, should only take into account citations to papers after final acceptance and not while residing on arXiv prior to “publication”). 5.
Discussion
The questionnaire survey received responses from 683 scientists in the fields of astrophysics and cosmology (a 17% return). The respondents represented a range of research interests, roles and research experience, and an almost equal proportion of returns (51/49) came from scientists who were English native speakers and those who were not. The respondents indicated that they each produce, on average, 13 papers over each 2-year period. They confirmed the important role of scientific journals in communicating research: 97% indicated that papers for submission to peer-reviewed journals are the main written output of their research. When it comes to choosing a journal in which to publish, the scientists highlighted a journal’s impact factor, readership levels and acceptance by the scientific community as having the most weight in the decision. This is exemplified by the list of journals in which the respondents had mostly published their research, which included the 10 with the highest impact factor in these fields (ISI Journal Citation Reports, 2005). Other factors which affect the scientists’ decision on where to publish include the subject coverage of the journal, the efficiency and ease of use of the submission system, the time that it takes for a paper to reach publication, open access, indexing in services such as the ADS and the publishing requirements of particular projects. The most important functions of a journal were identified as the online archive of the journal’s back issues, the journal’s website and maintenance of the journal software. Journal production costs should, it was felt, be covered by research funders or by library subscriptions. In the context of an overlay journal, repository policy clearly needs to be support the journals’s objectives - some of arXiv’s current policies and practices (for example, policies about file sizes, submission, acceptance and citation of unrefereed papers, multiple versions of papers, etc.) were highlighted by this community as issues which would need to be addressed if arXiv overlay were trialled. Open access was also an issue brought up by several scientists, and they emphasised the importance of having free access to the scientific literature. In particular, free access to less privileged scientists was highlighted as desirable. The inclusion of journal content in indexing and alerting services was deemed important. The ADS services are regarded favourably as an access point to the literature by the majority of the respondents. The respondents showed particular concern with the speed, quality and reliability of the peer review process, which was repeatedly mentioned in comments made by the respondents. It is not always clear to authors how peer review is being conducted by a given journal. Their comments suggest that, perhaps, there is room for improvement in the system, although there was no consensus on the best way to make those improvement. As documented elsewhere, arXiv use is prevalent in this community:
• •
77% of respondents access arXiv on a daily or weekly basis 80% visit arXiv’s “new/recent” to keep up to date with advances in their fields
The respondents were broadly receptive to the idea of overlay publishing: 53% welcomed it, and 80% would be happy to be involved as referees for an arXiv-overlay journal. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
87
88
Panayiota Polydoratou; Martin Moyle
The questionnaire survey, therefore, found some encouragement for the overlay journal model in the fields of Astrophysics and Cosmology. However, general issues were raised about new and untested models of publishing, the overlay model included. It is clear that, for any new publishing model to succeed, it will have to address many ‘traditional’ publishing issues, among them impact, peer review quality and efficiency, building a readership and reputation, arrangements for copy-editing, visibility in indexing services, and long-term archiving. These are generic concerns, for which repository overlay is not necessarily the complete answer. 6.
Summary and conclusions
This paper has discussed some of the issues around scientific publishing in astrophysics and cosmology and presented some of the finding of two community surveys in those fields. The roles, responsibilities and experience of the respondents primarily involve research. The preferred output from their research is peer-reviewed journal articles, which confirms the importance in this discipline of certification by quality-assured journals. The scientists indicated that the quality of any journal publishing model is very important to them, and they choose to publish in journals that demonstrate to them the endorsement of the scientific community, whether through readership levels, impact factor, or perceived quality of the editorial board and journal content. In general the scientists were disposed favourably towards the overlay journal model. However, they raised several implementation issues that they would consider important, primarily relating to the quality of the editorial board and of the published papers, and to the long-term archiving of the accepted research material. The traditional copy-editing function remains important to researchers in these disciplines, as is visibility in indexing services. The traditional printed volume is of little interest. The initial results from this survey suggest that scientists in the fields of astrophysics and cosmology are, in the main, positioned positively towards a new publishing model that, in a respondent’s own words, “…is more open, flexible, quicker (and cheaper?), and as “safe” or safer (i.e. ensuring science quality) as would be needed”. A full examination of these results, together with the other findings from the RIOJA project, is expected to enrich our understanding of the many issues around the acceptance and sustainability of the overlay journal as a potential publishing model. 7.
Acknowledgements
The authors would like to thank the scientists who participated in their survey for their time and input. We would also like to thank the representatives from PhysMath Central, Public Library of Science (PloS), and Oxford University Press (OUP), and members of the editorial boards of the journals Monthly Notices of the Royal Astronomical Society (MNRAS) and Journal of Cosmology and Astroparticle Physics (JCAP) for their time and interest in the RIOJA project. 7.
Notes and References
[1]
GINSPARG, P. (1996). Winners and Losers in the Global Research Village. Invited contribution, UNESCO Conference HQ, Paris, 19-23 Feb 1996. [online]. [cited 08 May 2008] Available from: <http://xxx.lanl.gov/blurb/pg96unesco.html> SMITH, J W T. The deconstructed journal: a new model for academic publishing. Learned Publishing. 1999, Vol. 12, no. 2, pp. 79-91. [cited 08 May 2008]. Also available from Internet: <http://library.kent.ac.uk/library/papers/jwts/DJpaper.pdf>
[2]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Publishing Scientific Research: Is There Ground for New Ventures?
[3] [4] [5] [6] [7] [8] [9]
[10] [11] [12] [13] [14] [15] [16] [17] [18]
[19]
Logical Methods in Computer Science [online]. Available from Internet: <http://www.lmcs-online.org/ index.php>. ISSN 1860-5974. Journal of Machine Learning Research [online]. Available from Internet: <http://jmlr.csail.mit.edu/ >. Annals of Mathematics [online]. Available from Internet: <http://annals.princeton.edu/index.html>. Geometry and Topology [online]. Available from Internet: <http://www.msp.warwick.ac.uk/gt/2007/ 11/>. HALLIDAY, L and C OPPENHEIM. (1999). Economic models of the Digital Library. [online]. [cited 08 May 2008]. Available from Internet: <http://www.ukoln.ac.uk/services/elib/papers/ukoln/ emod-diglib/final-report.pdf > RIGHTCOM Ltd. Business model for journal content: final report, JISC. [online].2005. Available from : http://www.nesli2.ac.uk/JBM_o_20050401Final_report_redacted_for_publication.pdf COCKERILL, M. Business models in open access publishing in, JACOBS, Neil (ed.) Open Access: Key Strategic, Technical and Economic Aspect, Oxford: Chandos Publishing, pp. 89-95, 2006. [online]. [cited 08 May 2008].Available from Internet: http://demo.openrepository.com/demo/handle/ 2384/2367 WALTHAM, M. Learned Society Open Access Business Models, JISC. [online]. 2005. [cited 08 May 2008]. Available at: <http://www.jisc.ac.uk/uploaded_documents/ Learned%20Society%20Open%20Access%20Business%20Models.doc> Haggemann, M. SPARC Innovator: December 2006. [online]. [cited 08 May 2008]. Available from Internet: <http://www.arl.org/sparc/innovator/hagemann.html> CROW, R. & GOLDSTEIN, H. 2003, Model Business Plan: A Supplemental Guide for Open Access Journal Developers & Publishers, Open Society Initiative. [online]. [cited 08 May 2008]. Available from Internet: <http://www.soros.org/openaccess/oajguides/oaj_supplement_0703.pdf> CROW, R. & GOLDSTEIN, H. 2003. Guide to Business Planning for Launching a New Open Access Journal, Open Society Institute. 2nd edition. [online]. [cited 08 May 2008]. Available from Internet: <http://www.soros.org/openaccess/oajguides/business_planning.pdf> CROW, R. & GOLDSTEIN, H. 2003, Guide to Business Planning for Converting a Subscriptionbased Journal to Open Access, Open Society Institute. [online]. [cited 08 May 2008]. Available from Internet: <http://www.soros.org/openaccess/oajguides/business_converting.pdf> The Times Higher Education Supplement. World university rankings: the worldâ&#x20AC;&#x2122;s top 100 science universities. 2006 [online]. [cited 08 May 2008]. Available from Internet: <http:// www.timeshighereducation.co.uk/hybrid.asp?typeCode=162> The Times Higher Education Supplement. World university rankings: the worldâ&#x20AC;&#x2122;s top non university institutions in science. 2006 [online]. [cited 08 May 2008]. Available from Internet: <http:// www.timeshighereducation.co.uk/hybrid.asp?typeCode=164> At the time of the survey the ISI Journal Citation Reports, 2006 reports were not available. Therefore, the list of journals that were used in the survery were based on the 2005 reports. RODRIGUEZ, M. A., BOLLEN, J., and VAN DE SOMPEL, H. 2006. The convergence of digital libraries and the peer-review process. Journal of Information Science [online]. [cited 08 May 2008]. 2006, Vol.32, no.2, pp.149-159. DOI= http://dx.doi.org/10.1177/0165551506062327. An arXiv preprint of this paper is available at: arXiv:cs/0504084v3 CASATI, F., GIUNCHIGLIA, F., and MARCHESE, M. 2007. Publish and perish: why the current publication and review model is killing research and wasting your money [online]. [cited 08 May 2008]. Ubiquity. 2007, Vol.8, issue 3, 1-1. DOI= http://doi.acm.org/10.1145/1226694.1226695
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
89
90
The Role of Academic Libraries in Building Open Communities of Scholars Kevin Stranack1, Gwen Bird2, Rea Devakos3 reSearcher/Public Knowledge Project Librarian email: kstranac@sfu.ca; 2 WAC Bennett Library, Simon Fraser University 8888 University Dr., Burnaby, BC Canada, email: gbird@sfu.ca 3 Information Technology Services, University of Toronto Libraries 130 St. George St, Toronto, ON, Canada email: rea.devakos@utoronto.ca 1
Abstract This paper describes three important pillars of publishing programs emerging at university libraries: providing a robust publishing platform, engaging the academic community in discussions about scholarly communication, and building a suite of production level services. The experiences of the Public Knowledge Project, the Simon Fraser University Library, and the University of Toronto Library’s journal hosting service are examined as case studies. Detailed information is provided about the development of the Public Knowledge Project, its goals and history, and the tools it offers. Campus activities at Simon Fraser University have been coordinated to support the use of PKP tools, and to raise awareness on campus about the changing landscape of scholarly publishing. The University of Toronto’s journal hosting service is profiled as another example. The role of university libraries in bringing together scholars, publishing tools and new models of scholarly publishing is considered. Keywords: Public Knowledge Project; academic libraries; scholarly publishing. 1.
Introduction
Libraries around the world are seeking to answer the fundamental question posed by Hahn in “The Changing Environment of University Publishing”: “To what extent should the institutions that support the creation of scholarship and research take responsibility for its dissemination as well?”[1] Many libraries are in fact not only providing services, but actively experimenting in scholarly publishing. This paper describes three important pillars of library publishing programs: providing a robust publishing platform, engaging the academic community in discussions around scholarly communication, and building a suite of production level services. The experiences of the Public Knowledge Project, the Simon Fraser University Library, and the University of Toronto Library’s journal hosting service will serve as case studies. 2.
The Public Knowledge Project
Founded in 1998 by Dr. John Willinsky of Stanford University and the University of British Columbia, the Public Knowledge Project (PKP)[2] is an international research initiative promoting publishing alternatives for scholarly journals, conferences, and monographs. Through its development of innovative, open source publication management tools, the Project contributes to the growing, global community of scholars dedicated to furthering free and open access to information and research. By building in workflow efficiencies, the Project software allows publishers to significantly reduce their operating costs[3] and make their content Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The Role of Academic Libraries in Building Open Communities of Scholars
either free or available with low subscription fees. A recent indication of the software’s impact can be found in Hahn’s 2008 report, Research Library Publishing Services: New Options for University Publishing[4], which discovered that the Project’s Open Journal Systems software is now the most frequently used program of its kind, whether commercial or open source, supporting academic library publishing initiatives. Since becoming a PKP partner in 2005, the Simon Fraser University Library has taken on responsibility for managing the development of the software, providing technical support to the global community, and publicizing the Project through the PKP web site, workshops, presentations, and publications. In 2006, the Project was the sole Canadian winner of the Mellon Award for Technological Collaboration[5] and was also recognized as a Leading Edge partner with the Scholarly Publication and Academic Resources Coalition (SPARC)[6]. Currently, all five of the lead institutions in the Synergies project[7], described in the conference paper by Eberle-Sinatra, Copeland and Devakos, are using one or more elements of the PKP’s software to advance online humanities and social sciences publishing in Canada. In addition, the software products continue to develop and mature, and the global community of scholars taking advantage of the Project’s work continues to grow. 3.
Open Source Software
The Public Knowledge Project’s suite of software includes a variety of separate, but inter-related applications, including the Open Journal Systems (OJS), the Open Conference Systems (OCS), the Open Monograph Press (OMP), and Lemon8-XML. All are freely available as open source software. They share similar technical requirements and underpinnings (PHP, MySQL, Apache or Microsoft IIS 6, and a Linux, BSD, Solaris, Mac OS X, or Windows operating system), operate in any standard server environment, and need only a minimal level of technical expertise to get up and running. In addition, the software is well supported with a free, online support forum and growing body of documentation. The Open Journal Systems (OJS) software[8] provides a complete scholarly journal publication management system, offering a journal web site (see Figure 1), an online submission system, multiple rounds of peerreview, an editorial workflow that includes copyediting, layout editing, and proofreading, indexing, online publication, and full-text searching.
Figure 1: The International Journal of Design web site using OJS Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
91
92
Kevin Stranack; Gwen Bird; Rea Devakos
OJS goes beyond managing and displaying content, however, and provides an interesting set of Reading Tools, helping the reader to contextualize the content, and allowing for innovative interactions between the reader, the text, and the author (see Figure 2).
Figure 2: Postcolonial Textâ&#x20AC;&#x2122;s Reading Tools The Reading Tools allow readers to communicate privately with the author or to place comments directly on the web site, providing an interesting model of post-publication, open review. OJS is currently in version 2.2, with version 2.3 expected for release in late 2008, with upcoming features to include online reader annotation tools and enhanced statistics and reporting. Today, over 1,500 journals worldwide are using the Projectâ&#x20AC;&#x2122;s OJS software to manage their scholarly publication process, with 50% coming from the Sciences, 38% from the Humanities and Social Sciences, and 12% being interdisciplinary. As well, a growing number of translations have been contributed by community members, with Chinese, Croatian, English, French, German, Greek, Hindi, Italian, Japanese, Portuguese, Russian, Spanish, Turkish, and Vietnamese versions of OJS completed, and several others in production. The Open Conference Systems (OCS) software[9] provides a fully-featured conference management and publication system, including not only a conference web site, online submissions, peer review, editorial workflow, online publication, and full-text searching, but also a conference schedule, accommodation and travel information pages, and an online registration and payment system. The Reading Tools, similar to those provided with OJS are also available. OCS is currently in version 2.1, with version 2.2 expected later in 2008. At least 300 scholarly conferences have used OCS to manage their events, including the 2008 International Conference on Electronic Publishing[10]. OCS has now been translated into English, French, German, Italian, Portuguese, and Spanish. The Open Monograph Press (OMP)[11] is a new open source project that is still in a very early stage of development. Essentially, OMP will provide a similar management system for the production of scholarly monographs, with a built-in correspondence system for participants, marketing and cataloguing tools, and XML conversion (see Lemon8-XML below). It will allow editors to invite contributors to participate in the creation in a new work and provide authors with an online studio to assist with the research and writing process, including bibliographic management tools, a document annotation system, blogs, wikis, and more. The PKP has received significant interest internationally for this project, and the OMP will benefit from the wide-ranging community expertise that will be provided throughout the development process. Lemon8-XML[12] is another innovation which is still in development. It is a document conversion system, which will allow users of OJS, OCS, OMP, or any other publication system to automatically transform text files submitted by authors (such as Microsoft Word or Open Office Writer) into XML files to assist with Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The Role of Academic Libraries in Building Open Communities of Scholars
online publication and compliance with indexing service requirements (e.g., PubMed Central). This will build in a significant new level of efficiency, saving layout editors the time-consuming task of producing PDF, XHTML, or XML documents manually. Although developed specifically for use with the other Project software tools, it will be a standalone open source product, allowing for uses independent of OJS, OCS, or OMP. Lemon8-XML will be released in mid-2008 and a beta version is available from the Project web site. 4.
Community
In addition to the hundreds of users of the Public Knowledge Project software products, the community also extends to the many people who volunteer their time and efforts in a variety of important ways. One critical contribution has been the translations mentioned earlier. Without this contribution, the PKP software tools would not have the international reach that they have today. It would have simply been impossible for the Project to create translations without the community volunteers. Other forms of community participation include the recurring need to thoroughly test every new software release. This is a very time-consuming and somewhat repetitive task for the volunteers, but ensures that crucial bugs have not been overlooked, which could cause very serious problems if they were introduced into production systems. Without community testers the Project would not be able to continue with its regular enhancement process and increase the functionality of the software, nor ensure its continued security and robust nature. Community members also contribute important new software features, including the subscription module, which allows OJS journal publishers to continue to charge subscriptions or other fees as they consider the move to open access. Another important example of the health of the PKP community is the fact that the online support forum now has over 1100 members, many of whom not only post their questions, but are increasingly sharing their experiences and assisting other users by answering questions. The PKP community is made up of a wide variety of participants, including scholars (e.g., The International Journal of Communication[13]), university information technology divisions (e.g., The University of Saskatchewan College of Arts and Science[14]), government departments (e.g., Sistema Eletrônico de Editoração de Revistas[15]), publishers (e.g., Co-Action Publishing[16]), and, of course, libraries. As the Project grows, this form of community-based support will become increasingly important. 5.
The Simon Fraser University Library
In 2007, the Simon Fraser University Library began a formal program of scholarly communication activities on campus. Scholarly communication was included as a theme and a clear priority in the Library’s 3-year plan for 2007-2010. This theme included a cluster of issues arising out of the current system of academic publishing and a desire reform that system. We identified the usual array of issues concerning libraries: the high and steeply rising cost of commercially published scholarly journals, widely recognized as unsustainable; a desire to support alternative publishing models, including Open Access; an opportunity to use library buying power to support alternative models that are sustainable and provide benefit for the Simon Fraser community; a desire to minimize limitations on the use of faculty-authored publications; and a desire to provide infrastructure and support for authors who wish to self-archive research outputs. As a result, we asked what the Library could do to contribute to the efforts to “create change”[17]. We asked what would best build on existing activities and strengths of the SFU Library. We were willing to take on new roles as needed in the changing landscape of scholarly communication. To provide a bit of context, Simon Fraser University is a mid-sized, publicly funded Canadian university. It offers programs in a full range of subject areas up to the doctoral level, but includes no professional schools, such as law or medicine, and serves just under 20,000 FTE students. The Library recognized that Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
93
94
Kevin Stranack; Gwen Bird; Rea Devakos
while we were well positioned to take a leadership role on campus, we would be successful only to the extent that we could engage the interest of faculty members. As with other faculty endeavours, our team of liaison librarians would be key to this success, building on their knowledge of departments and wellestablished individual relationships. For this reason, after sketching out a modest set of events, we began by working with SFU librarians. We partnered with colleagues from neighbouring institutions, the University of British Columbia and University of Victoria, to offer joint training for librarians which provided background on many of the issues listed above, and ran participants through a variety of interactive activities. The goal was to orient librarians to the subject in order to make them comfortable integrating discussions of scholarly communication into their liaison work. In short order, the participating librarians felt grounded and ready to incorporate scholarly publishing into their instruction and other interactions with faculty in the way that we had hoped. A few of the events put on for the campus community are described below. In thinking about how we would build on existing strengths of the Library, it was clear that we had an “ace in the hole” for a scholarly communications program in the form of our participation in the PKP Project. Here was a set of tools we could put directly into the hands of those wanting to reclaim academic publishing, one journal or one conference at a time. In July 2007 the Library worked with the PKP project to host the first International PKP Conference[18] bringing together users of the tools and others interested in its goals from around the world. With over 200 participants and generous sponsorship from the Open Society Institute to cover costs for delegates from developing countries, the conference provided an astonishing picture of the development and operation of alternative publishing projects around the world. The conference featured papers from five continents, exploring both the practical and theoretical aspects of the Project.[19] After the conference we repeated “OJS in a Day” workshops offered that were quickly filled, to continue putting the skills needed to use OJS into the hands of interested researchers and editors. As our librarians work on campus to discuss scholarly publishing they are regularly turning up requests for more information about OJS, or requests for software support. The federally funded Synergies project is providing one-time funding to assist many Canadian journals in the Social Sciences and Humanities to move content online for the first time using OJS, and also provides further support for SFU scholars moving their publications in this direction. Another place where the Library saw itself functioning as a hub was with respect to journal editors. Staff in the Collections Management office noticed they were often fielding inquiries from faculty members in their roles as editors, and that these inquiries began to form a pattern. When editorial boards were considering offers to license their journal content to third party aggregators, to change publishers, to digitize their backfiles, or to move from a for-fee to an Open Access business model, they were coming to the Library for guidance. We brought together a group of editors for a forum where they were able to find each other across disciplines, and share common experiences. As many of the editors were active users of OJS, they were able to impart firsthand experience of using the software, and of running Open Access journals, or transitioning society publications to Open Access. In addition, the Library has continued to host campus events highlighting Open Access publishing more generally. These have included speakers from BioMedCentral, Public Library of Science, Open Medicine, and others. Typically they attract a mix of graduate students, faculty members and librarians, and a mix of advocates, skeptics and curious newcomers. Here the Library acts as a facilitator, putting the issues on the table, encouraging lively discussion and debate, highlighting positive stories from successful Open Access journals, and SFU authors willing to share their experiences and motivations for supporting Open Access. As appropriate, the Library can also provide information about the often invisible costs of the traditional system of academic publishing, providing members of the SFU community with a local perspective on our part in this $16 billion a year industry.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The Role of Academic Libraries in Building Open Communities of Scholars
Finally, the Library has launched an Institutional Repository that offers trusted infrastructure and support for interested community members who wish to self-archive.[20] As this initiative grows, the Library plays an increasingly active role at earlier stages of the research process, ideally in a few cases as a co-applicant on funding applications where archiving is built into the project from its inception. As we build this program on our campus, we recognize that the current system of scholarly communication is embedded in larger institutional and industry-wide contexts. These include the tenure and promotion system generally, and its specific expression at SFU; trends in the academic, trade and commercial publishing sectors; and the requirements and regulations of granting agencies. We have launched a blog to help the campus community stay abreast of news, and to offer an online space for continued discussion.[21] Future plans include applied research into faculty attitudes and behaviors around scholarly communication to further inform our work in this area. In holding events like these on the SFU campus, one of the common refrains is that faculty members and researchers are grateful for the opportunity to hear about what’s going on in other disciplines. Even those who are keen on the topic are generally not able to keep up with developments in areas outside their own. For example, biologists are pleased to come to events hosted by the library to learn about discussions going on in the American Anthropological Association[22]; anthropologists are interested in learning about SCOAP3[23], and social scientists are interested to hear what is happening in the life sciences where OA journals have been making significant inroads. Taken alone, none of these activities have marked a departure for the SFU Library, but as a program, together they are certainly contributing to a changed role for libraries in building open communities of scholars. We have learned that faculty on our campus bring a varied level of understanding of the issues, and that our programs must be multivalent enough to address these multiple levels of need. We’ve also seen that integrating scholarly communication into our liaison program is a comfortable fit that has reinvigorated several of our long-serving librarians, and provided us with a renewed definition of liaison work in an academic library. Similar programs have been offered by many university libraries, and reports and lessons learned elsewhere have also been useful for us (e.g., The University of California Berkeley’s Scholarly Communication News and Events[24], Scholarly Communications at the University of Washington Libraries[25], and the University of Guelph’s Scholarly Communication Initiatives[26], to name a few). Awareness on the SFU campus continues to build about the changing scholarly communication landscape. And the Simon Fraser University Library continues to explore new roles for itself in bringing together and building open communities of scholars. Like most other university libraries, we are operating in an environment where we have largely eliminated print journals in favour of online, and just a few years ago were contending with feedback from members of our community lamenting the fact that their once weekly trips to the library’s periodical reading room was a chance to get out of their academic silos and mix with colleagues from elsewhere on campus. We are pleased to see the Library continuing to occupy this role of campus hub, albeit in a new way. 6.
The University of Toronto’s Journal Hosting Service
Like many academic libraries, the University of Toronto is offering a range of journal publishing services. Indeed, U of T services parallel many of the trends found by Hahn[27]: 1. 2.
The Open Journal System is used. Services provided include: · hosting
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
95
96
Kevin Stranack; Gwen Bird; Rea Devakos
3. 4.
5. 6.
7. 8. 9.
· initial set up · consultation on business models, advertising, launches etc. · training · ongoing troubleshooting and customer support Service was initially advertised through word of mouth The university has leveraged past investments in digital library services. The Scholarly Communication Initiatives unit also offers repository services using the DSpace platform and conference hosting services, using the Open Conference System. Electronic only publication services are offered, though a few journals publish in print also. Services are funded through multiple sources; though initially funded by the libraries operating budget, this has now luckily been supplemented by a federal government grant, Synergies. Libraries in Australia, Germany and Denmark received similar government funding. As with the National Library of Australia’s Open Publish service[28], quality control and copyright clearance rests with the journal. Analogous to the California Digital Library (CDL)29, interdisciplinary journals are prominent, as are student journals. Like Newfound Press[30], the Library is interested in enhancing access to peer reviewed scholarship and specialized works with a potentially limited audience.
The service is staffed by a librarian, technical staff and student assistants. An online application form, modeled after the CDL’s, asks for Canadian university affiliation, journal’s aim and purpose, editorial board, peer review process, copyright and authors’ rights[31]. In addition student led journals are asked for a letter of support from a faculty sponsor. Eight journals are currently hosted and we are in discussions with another ten. We expect the number of journals hosted to continue to grow. Here are a few illustrative examples: Women in Judaism was founded 11 years ago, and is devoted to scholarly debate on gender-related issues in Judaism. The ultimate aim of the journal is to promote the reconceptualization of the study of Judaism, by acknowledging and incorporating the roles played by women, and by encouraging the development of alternative research paradigms. Articles undergo blind review. The international editorial board numbers 60. The journal publishes two issues a year. In addition to scholarly articles, works of fiction, biographical essays, book and film reviews are also published. The journal is indexed by ATLAS, RAMBI- the Index of Articles on Jewish Studies by the Jewish National and University Library, Jewish Periodical Index, MLA International Bibliography and others. The journal website states: We do not have subscription fees, nor do we intend to have them in the future. The Canadian Online Journal of Queer Studies in Education was created in 2004 to provide a forum for scholars, professionals, and activists to discuss queer topics in education and the social sciences in the Canadian context. The term ‘education’ is understood broadly to include all levels of education in every discipline. This journal is devoted to supporting and disseminating research and theory that promotes social justice for all queer people, including lesbian, gay, bisexual, queer, intersex, two-spirited and transidentified people. The forum encourages critical examination of queer discourse across disciplines and dialogue on multiple and intersecting forms of oppression based on gender, race, class, ability, religion, etc. This refereed journal is affiliated with the Ontario Institute for Studies in Education at the University of Toronto. Clinical & Investigative Medicine is the official journal of the Canadian Society for Clinical Investigation. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The Role of Academic Libraries in Building Open Communities of Scholars
The journal’s focus is on original research and information of importance to clinician-scientists. Founded in 1978, the journal moved totally online in 2007 due partially to the cost of print production. Most subscribers are also society members. Immediate open access is offered to all Canadian universities; total open access is provided after six months. In the past, CIM had relationships with a variety of aggregators. The University of Toronto Journal of Undergraduate Life Sciences (JULS) showcases the research achievements of undergraduate life science students at the U of T and encourages intellectual exploration across the various life sciences disciplines. Established in 2006 by a small group of students, JULS quickly gained support from various departments and faculty members. The journal publishes research articles and mini-reviews. All articles undergo a two-stage double blind peer-review process conducted by students and faculty. Issues are published annually, in both print and electronic format. Currently, all but one hosted journals are open access, but this is not a strict requirement. Like the Copenhagen Business School Library’s Ejournals@cbs[32], we seek to provide a “low risk environment for small journals.” For journals concerned about losing subscription income we work to identify ways to “open” access while protecting revenues, such as delayed open access, providing free access to some articles, ip ranges or issues. We expect use of this mixed model to increase. Like Ejournals@cbs, our journals fall into two categories: those born digitally versus print. However we have found that whether a journal is born in print or digitally, has not affected comfort with the platform. Our born-digital journals include those born on our service, and those born on their own home grown system or another OJS service provider. Established journals with established workflows are prone to only utilize OJS’ dissemination features. As Felczak, Lorimer and Smith describe, journals often find the task of changing their production methods a “non trivial challenge.”[33] Launching an electronic journal, whether new or established, is a time consuming project. The OJS platform and the new medium prompt the editorial team to consider or reconsider policies such as copyright. The mixture of practical “click here” and policy questions to be addressed is often daunting. Editors ask what others have done, how long it takes to do x etc. The most common question is the cost of electronic journal production. It is not a question we can answer easily. In a review on journal publishing costs, King laments: a wide range of figures for publishing costs and average costs per subscription and per article. Many cost estimates are presented in the literature in support of a specific agenda: to explain high prices, to demonstrate the savings to be expected from electronic publishing, or to show why author-side payment should be adopted. Unfortunately, the way in which many publishing costs are presented in the literature is somewhat misleading, because the costs are not qualified by the various cost parameters or other factors that contribute to their value, large or small.[34] Indeed our initial meetings with existing journals are sometimes difficult. In relating the transition of the Canadian Journal of Sociology to electronic open access publication, Haggerty describes their first meeting with the U of Alberta Libraries: Laura and I, however, did not give our colleagues an easy time during our meeting, asking them a procession of difficult questions about the implication of such as move. Looking back, it is evident that Pam and Denise could not have answered most of those questions to my satisfaction as the answers were contingent upon their having detailed knowledge about the specifics of the journal’s finances and assorted institutional arrangements. I also suspect that what I really wanted from them was an impossible guarantee that the journal could accrue all the benefits of going open access without also bearing the risks of such as move.[35] Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
97
98
Kevin Stranack; Gwen Bird; Rea Devakos
Libraries are well positioned, not only to acknowledge the unknown, but also to assist journals as they explore uncharted waters. In so doing we have forged strong working relationships and gained unique insight into the scholarly communication process. 7.
Conclusions
From the case studies presented in this paper, it is clear that libraries are becoming increasingly involved in scholarly publishing, either through providing powerful software platforms to increase operational efficiency and technological innovations as at the Public Knowledge Project, or the development of new forms of scholar-librarian collaboration at the Simon Fraser University Library, or offering a complete set of production services at the University of Toronto Library. And these libraries are by no means alone in these endeavours. Internationally, libraries are becoming increasingly involved in scholarly publishing activities, and this represents an important shift in the services libraries offer and the perception of their organizations, both externally and internally. As Case and John[36] point out, however, the “next major step is to integrate the digital publishing operations into the library organization.... The role of library as publisher must be embedded in the culture of our organization.” 8.
Notes
[1]
Hahn, K. (2007). The Changing Environment of University Publishing. ARL Bimonthly Report, (252/253). Retrieved June 1, 2008 from http://www.arl.org/bm~doc/arl-br-252-253-intro.pdf The Public Knowledge Project. (2008). Retrieved June 1, 2008 from http://pkp.sfu.ca Willinsky, J. (2005). Scholarly Associations and the Economic Viability of Open Access Publishing. Retrieved June 1, 2008 from http://jodi.tamu.edu/Articles/v04/i02/Willinsky/ Hahn, K. (2008). Research Library Publishing Services: New Options for University Publishing. Retrieved June 1, 2008 from http://www.arl.org/bm~doc/research-library-publishing-services.pdf Recipients of First Annual Mellon Awards for Technology Collaboration Announced. Retrieved June 1, 2008 from http://rit.mellon.org/awards/matcpressrelease.pdf The SPARC Leading Edge publisher partner program. Retrieved June 1, 2008 from http:// www.arl.org/sparc/partner/leadingedge.shtml Synergies Project. Retrieved June 1, 2008 from http://www.synergiescanada.org/ Open Journal Systems. Retrieved June 1, 2008 from http://pkp.sfu.ca/ojs Open Conference Systems. Retrieved June 1, 2008 from http://pkp.sfu.ca/ocs International Conference on Electronic Publishing 2008. Retrieved June 1, 2008 from http:// www.elpub.net Open Monograph Press. Retrieved June 1, 2008 from http://pkp.sfu.ca/omp Lemon8-XML. Retrieved June 1, 2008 from http://pkp.sfu.ca/lemon8 The International Journal of Communication. Retrieved June 1, 2008 from http://ijoc.org The University of Saskatchewan College of Arts and Science Conference Server. Retrieved June 1, 2008 from http://ocs.usask.ca/ Sistema Eletrônico de Editoração de Revistas. Retrieved June 1, 2008 from http://seer.ibict.br/ Co-Action Publishing. Retrieved June 1, 2008 from http://www.co-action.net/ Create Change Canada. Retrieved June 1, 2008 from http://www.createchangecanada.ca/about/ index.shtml First International PKP Scholarly Publishing Conference. Retrieved June 1, 2008 from http:// pkp.sfu.ca/ocs/pkp2007/index.php/pkp/1 First Monday, October 2007, 12 (10). Retrieved June 1, 2008 from http://www.uic.edu/htbin/cgiwrap/ bin/ojs/index.php/fm/issue/view/250
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The Role of Academic Libraries in Building Open Communities of Scholars
[20] Simon Fraser University Institutional Repository. Retrieved June 1, 2008 from http://ir.lib.sfu.ca/ index.jsp [21] Simon Fraser University Library Scholarly Communication News. Retrieved June 1, 2008 from http://blogs.lib.sfu.ca/index.php/scholarlycommunication [22] Cross, J. (2008). Open Access and AAA. Anthropology News, Feb 2008, 49 (2), 6. Retrieved June 1, 2008 from http://www.aaanet.org/pdf/upload/49-2-Jason-Cross-In-Focus.pdf [23] SCOAP3 - Sponsoring Consortium for Open Access Publishing in Particle Physics. Retrieved June 1, 2008 from http://scoap3.org/ [24] The University of California Berkeley’s Scholarly Communication News and Events. Retrieved June 1, 2008 from http://blogs.lib.berkeley.edu/scholcomm.php [25] Scholarly Communications at the University of Washington Libraries. Retrieved June 1, 2008 from http://www.lib.washington.edu/ScholComm/ [26] The University of Guelph’s Scholarly Communication Initiatives. Retrieved June 1, 2008 from http:/ /www.lib.uoguelph.ca/scholarly_communication/initiatives/ [27] Hahn, K. (2007). The Changing Environment of University Publishing. ARL Bimonthly Report, (252/253). Retrieved June 1, 2008 from http://www.arl.org/bm~doc/arl-br-252-253-intro.pdf [28] Graham,S. (2007). Open access to open publish: National Library of Australia. First Monday 12 (10). Retrieved June 1, 2008 from http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/ view/1960/1837 [29] Candee, C. H., & Withey, L. (2007). The University of California as publisher. ARL Bimonthly Report, (252/253). Retrieved June 1, 2008 from http://www.arl.org/bm~doc/arl-br-252-253-cal.pdf [30] Phillips, L. L. (2007). Newfound Press: The digital imprint of the University of Tennessee Libraries. First Monday, 12(10). Retrieved June 1, 2008 from http://www.uic.edu/htbin/cgiwrap/bin/ojs/ index.php/fm/article/view/1968/1843 [31] The University of Toronto Libraries’ Request for Journal Hosting. Retrieved June 1, 2008 from http://jps.library.utoronto.ca/index.php/index/Boilerplate/submit [32] Elbaek, M. K., & Nondal, L. (2007). The library as a mediator for e-publishing: A case on how a library can become a significant factor in facilitating digital scholarly communication and open access publishing for less web-savvy journals. First Monday, 12(10). Retrieved June 1, 2008 from http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/1958/1835 [33] Felczak, M., Lorimer, R., & Smith, R. (2007). From production to publishing at CJC online: Experiences, insights, and considerations for adoption. First Monday, 12(10). Retrieved June 1, 2008 from http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/1959/1836 [34] King, D. W. (2007). The cost of journal publishing: A literature review and commentary. Learned Publishing, 20(2), 85-106. [35] Haggerty, K. D. (2008). Taking the plunge: Open access at the Canadian Journal of Sociology. Information Research, 13(1) Retrieved from http://informationr.net/ir/13-1/paper338.html [36] Case, M. M., & John, N. R. (2007). Publishing Journals@UIC. ARL Bimonthly Report, no. 252/ 253. Retrieved June 1, 2008 from http://www.arl.org/bm~doc/arl-br-252-253-uic.pdf
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
99
100
Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging. Maria Elisabete Catarino1; Ana Alice Baptista2 Information Systems Department, University of Minho. Campus Azurém, Guimarães, Portugal. CAPES-MEC-Brazil. e-mail:ecatarino@dsi.uminho.pt 2 Information Systems Department, University of Minho. Campus Azurém, Guimarães, Portugal. e-mail: analice@dsi.uminho.pt 1
Abstract The Web 2.0 maximizes the Internet concept of encouraging its users to cooperate effectively for the offer of virtual services and content organization. Among the various potentialities of the Web 2.0, folksonomy appears as a result of the free assignment of tags to the Web’s resources by their users/ readers. Despite tags describe the Web’s resources, generally they are not integrated in the metadata. In order for them to be intelligible by machines and therefore used in the Semantic Web context, they have to be automatically allocated to specific metadata elements. There are many metadata formats. The focus of this investigation will be the Dublin Core Metadata Terms (DCTerms) that is a widely used set of properties for the description of electronic resources. A subset of DCTerms, the Dublin Core Metadata Element Set (DCMES), has been adopted by the majority of Institutional Repositories’ platforms as a way to promote interoperability. We propose a research that intends to identify elements of the metadata originated from folksonomies and propose an application profile for DC Social Tagging. That will allow tags to be conveniently processed by interoperability protocols, particularly the Open Archives Initiative – Protocol for Metadata Harvesting (OAI-PMH). This paper will present the results of the pilot study developed in the beginning of the research as well as the metadata elements preliminarily defined. Keywords: Social Tagging; Folksonomy; Metadata; Dublin Core. 1.
Introduction
Metadata may be defined as a group of elements for the description of resources [1]. There are many standards of metadata in the repository context; we can point out the Dublin Core Metadata Element Set (DCMES) or simply Dublin Core (DC) that is a metadata element set for the description of electronic resources. This standard is well diffused, used globally and on a broad scale due to some factors: a) it was created specifically for the description of electronic resources; b) it has an initiative which is responsible for its development, maintenance and spreading - the Dublin Core Metadata Initiative (DCMI); c) it is the metadata set used by default by the Open Archives Initiative – Protocol for Metadata Harvesting (OAIPMH). The more active participation of the users in the construction and organization of Internet contents is the result of the evolution of the Web technologies. The so-called Web 2.0 is “the network as platform, spanning all connected devices; Web 2.0 applications are those that make the most of the intrinsic advantages of that platform: delivering software as a continually-updated service that gets better the more people use it, consuming and remixing data from multiple sources, including individual users, while providing their own data and services in a form that allows remixing by others, creating network effects through an ‘architecture of participation’, and going beyond the page metaphor of Web 1.0 to deliver rich user experiences.”[2].
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging.
One of the new possibilities of the Web 2.0 is the folksonomy that is “the result of personal free tagging of information and objects (anything with an URL) for one’s own retrieval. The tagging is done in a social environment (shared and open to others). The act of tagging is done by the person consuming the information”[3]. The tags which make up a folksonomy would be key-words, categories or metadata [4]. Tags have several roles as a study from Golder and Huberman [5][6] points out: Identifying What (or Who) it is About, Identifying What it Is, Identifying Who Owns It, Refining Categories, Identifying Qualities or Characteristics, Self Reference and Task Organizing. Another study, Kinds of Tags (KoT) [7], has the objective of verifying how the tags derived from folksonomies can be normalized aiming at their interoperability with metadata standards, specifically the DC. Their researchers observed that there are some tags that cannot be inserted in any of the already existing elements. Preliminary results indicate that the following new elements may have to be used: Action_Towards_Resource, To_Be_Used_In, Rate and Depth [8][9]. Generally digital repositories’ metadata is input by authors or professionals that mediate deposit. In the Web 2.0 context, folksonomies arise, as a result of Web resource tagging by its own users. Tags are a complementary form of description which expresses the user’s view of a given resource and, therefore, potentially important for its discovery and retrieval. The preliminary results of KoT indicate that the current DCTerms elements are not enough to hold user’s descriptions by means of tags. In the context shown, following up the analysis resulting from the KoT project, we propose an application profile for DC Social Tagging so as to enable that tags may be used in the context of the Semantic Web. This application profile will be a result of a research that aims at identifying metadata elements derived from folksonomies and compare them with DCTerms’ properties. 2.
Investigation: Procedures
The procedures of this research project are divided in four stages. The first stage consists of an analysis of all tags contained in the KoT project dataset. At this stage all tags assigned to the resources are analysed, grouped in what we call key-tags and then DC properties are assigned to them when possible. A Key-tag is a normalised tag that represents a group of similar tags. For instance, the key-tag Controlled Vocabulary stands for tags controlledvocabulary, controlled vocabularies or vocabulars controlatis. Once that the meaning of tags is not always clear, it is necessary to dispel doubts by complementarily turning to lexical resources (dictionaries, encyclopaedias, Word Net, Wikipedia, etc), and analysing other tags of the same users. Contacting the users may be a last alternative to try to find out the meaning of a given tag. In this stage, a pilot study was developed in order to refine the proposed methodology and to verify whether the proposed variants for grouping and analysing tags are adequate. The second stage aims at proposing complementary properties to the ones already existing in the DCMI Metadata Terms [10]. Key-tags that were not assigned to any DC property in stage one will now be subject to further analysis in order to infer new properties specific to Social Tagging applications. This analysis takes into account all DC standards and recommendations, including the DCAM model, the ISO Standard 15836-2003 and the NISO Standard Z39.85-2007. The next stage comprises the adaptation of an already existing DC ontology. This will make use of Protégé, an ontology editor developed at Stanford University. The ontology will be encoded in OWL, a language endorsed by the W3C. Finally, the fourth stage intends to submit the proposal to the DC-Social Tagging community for comments Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
101
102
Maria Elisabete Catarino; Ana Alice Baptista
and feedback via online questionnaires. After this phase, a first final version of a proposal to a DC Social Tagging profile will be submitted to the community. This paper will present the results of the pilot study alongside with the preliminary results of the first research stage: tag analysis. The preliminary results of KoT indicate that an application profile for Social Tagging applications would benefit from the inclusion of new properties, other than those in DCTerms. Those terms will potentially accommodate tags that currently do not have a metadata holder. The results of this research will therefore allow to determine if the KoT preliminary findings are verified and to what extent. 3.
Pilot Study
The pilot study was carried out in order to improve the methodology proposed for the investigation project, since, as Yin [11] states, â&#x20AC;&#x153;The pilot study helps investigators to refine their data collection plans with respect to both the content of the data and the procedures to be followedâ&#x20AC;?. The dataset used in this project is the same of the KoT project: it is composed of 50 records of resources which were tagged in two systems of social bookmarking: Connotea and Delicious. Each record is composed by fields distributed in two groups of data: a) information related to the resource as a whole: URL, number of users, research date; and b) information related to the tags assigned to the resource: social bookmarking system, user, bookmarked date and the tags. A relational database was set up with the DCMI Metadata Terms and the KoT data set that was imported from its original files. The following tables were created: Tags, Users, Documents, Key-tags and Metadata. 3.1
Tag Analysis
In the pilot study it was analysed data of the first five resources of the data set. This implied the analysis of a total of 311 tags with 1141 occurrences and assigned by 355 users. It was important to register not only the number of tags but also their total occurrence, since a tag could have different meanings to each one of the resources to which it was assigned. Therefore, in some cases, it was possible to analyse of the occurrence of a tag concerning an individual resource. 3.1.1 Grouping Tags in their different forms: Key-tags Key Tag is the term that represents the various forms of a same Tag. In order to accomplish Tag grouping it was necessary to generate reports for each resource with the following information: Title (of the resource), User Nick and Tag, displaying information in the alphabetical order of the Tags to facilitate the visualization of the existing different Tag forms and definition of Key-tags. In this stage it is necessary to use lexical resources (dictionaries, WordNet, Infopedia, etc) and other online services, such as online translators, in order to fully understand the meaning of tags. In some cases further research and analysis of other tags of a given user, or even a direct contact with this user by email may be necessary in order to understand the exact meaning of a tag. An important concern regarding tag analysis is the fact that as tags are assigned by the resourcesâ&#x20AC;&#x2122; users, that inevitably leads to a lack of homogeneity in their form. Therefore, it was necessary to establish some rules in order to properly analyse tags, establish key-tags and relate DC properties with them. The first rule to be observed concerns the alphabet. In this Project, only tags written in Latin alphabet were considered. Further studies should involve the analysis of tags written in different alphabets. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging.
Another rule is directly related to language. The dataset comprises tags written in different languages. As English is the dominant one, it was the chosen language to represent Key-tags. Depending on the Key-tags, certain criteria concerning the classification of words need to be established: simple or compound, singular or plural, based on a thesaurus structure in its syntactical relations. In these cases, the rules presented by Currás [12] were followed. It was still necessary to create rules to deal with compound tags, as they contain more than one word. There are two kinds of compound tags: (1) the ones that are related to only one concept and therefore originate only one key-tag (e.g. Digial Libraries); and (2) the ones that are related to two or more concepts and therefore originate two or more key-tags (e.g. Library and Librarians). In the first kind, compound tags are composed by a focus (or head) and a modifier [13]. The focus, i.e. the noun component which identifies the general class of concepts to which the term as a whole refers, and the modifier, i.e. one or more components which serve to specify the extension of the focus; in the example above: Digital (modifier) Libraries (focus). It is a compound term that comprises a main component or focus and a modifier that specifies it. In the second kind, compound tags are related to two or more distinct Key-tags, as for example: Library and Librarians, which would be part of the group of two distinct Key-tags: Library and Librarian. Another example is Cataloguing-Classification, which would be assigned to the Key-tag Cataloguing and to the Key-tag Classification. In this second segment there isn’t a relation of focus/difference between the components as their meanings are totally independent. Following these pre-established rules, the 311 tags were grouped in their different forms, adding up to 212 Key-tags. The first step of tag analysis comprises grouping tag variants: a) language; b) simple/compound; c) abbreviations and acronyms; d) singular/plural; e) capital letter/small letter. Then a Key-tag is assigned to each of these groups according to the rules presented above. Following, there are some examples of tags and their assigned key-tag: • Tags: _article, article, articles, artikel, article:sw. Key-tag: Article.
•
Tags: biblioteca digital, biblioteques digitals, digital libraries, digital library, digital_libraries, digital_library, digitallibraries, digital-libraries, digitallibrary, dl. Key-tag: Digital Libraries.
The above key-tags show a variation in : • spelling: _article / article; digital library / digital_library / digitallibrary and dl;
•
form (Singular/Plural): article / articles; digital library / digital libraries;
•
language: article (EN) / artikel (DE); Biblioteca digital (PT) / biblioteques digitals (CA) and Digital Library (EN).
The examples above also show the two kinds of compound tags. Compound Tags focus/modifier like biblioteca digital and digital library are assigned to only Key-tag. Tags composed of two focus components like article:sw are assigned to two distinct Key-tags: Article and Semantic Web.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
103
104
Maria Elisabete Catarino; Ana Alice Baptista
3.1.2 Tag Analysis in relation to DC After Key Tag composition, an analysis was carried out in order to verify to which DC Properties these tags corresponded. What happens is that this analysis becomes more complex as the definitions of the DCMI Terms are intentionally very inclusive, so that the description of electronic documents with a small, however satisfactory number, of metadata is possible. This inclusiveness may cause some doubt when relating Key-tags to DC Properties. Another factor of complexity is that this is a qualitative study which is developed manually so that the analysis is the most detailed possible. Due to these factors, it was necessary to define basic rules for the correspondence of Key-tags to the DC Properties. In the occurrence of Simple tags there is a peculiarity to be noticed that relates to the way tags are inserted in the social bookmarking sites: the way tags are inserted can interfere with the system’s indexation. When the user inserts tags in Delicious, the only separator is the space character and everything that is typed separated by spaces will be considered distinct tags. For example, if the compound term Digital Library is inserted containing only the space as separator, the system will consider two tags: Digital and Library. In order to be inserted as a compound tag it is necessary to use special characters such as underscore, dashes and colons. Some examples of such kind of compound tags are: Digital_Library, Digital-library, Digital:Library, Digital.Library. In Connotea tags are also separated by a space or a comma. However, Connotea suggests to users to type compound tags between inverted commas. For example, if the user inserts Information Science without placing the words between inverted commas, the words will be considered two distinct tags; however, if they are typed between inverted commas (“Information Science”) the system will generate only one compound tag. This simple, yet important issue, has a high implication on the system’s indexation of the tags. To exemplify what is said above there is an example of a Delicious user who, when assigning tags to the resource “The Semantic Web”, written by Tim Berners-Lee, inserted the following tags: the, semantic, web, article, by, tim, berners-lee, without using the resources of word combination (_ ; etc). The system generated seven simple tags. However, it is clear that these tags can be post-coordinated [14][15] to have a meaning such as Title, Creator and Subject. Thus, as a first rule, in the cases when simple tags could clearly be post-coordinated, they were analysed as a compound term for the assignment of the DC Property. However, this analysis could only be carried out in relation to only one resource’s user and never to a group, since it can mischaracterize the assignment of properties. The second rule concerns tags that correspond to more than one DC Property. It is considered two different situations: simple and compound tags. The easiest case is the one of simple tags. If simple tags to which more than one property can be assigned occur, then all the properties are assigned to the tag. For example in the resource entitled DSpace, the properties “Title” and “Subject” are assigned to the Key-tag dspace. As explained earlier, compound tags, however, can correspond to two or more key-tags. Thus the relationship with DC properties is made through the key-tags. These are treated as simple tags in the way they are related to DC properties. For example the tag Web2.0:article, corresponds to two Key-tags, Web 2.0 and Article, each one of them corresponding to a different property: Subject and Type (respectively). There may also be cases of compound tags that represent two different values for the same property, as in Classification-Cataloguing, that was splitted into two Key-tags: Classification and Cataloguing, both SUBJECT.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging.
Another rule is related to tags whose value corresponds to the property Title. Tags will be related to the element “Title” when they are composed by terms found in the main title of the resource. For example, Dspace, Library2.0. Another example is the case of the resource entitled “The Semantic Web”, where the tags The, Semantic, Web, that were assigned by the same user, and thus, may be considered postcoordinated. 3.2
Definition of DC Properties
From the 311 tags analysed, 212 Key-tags were created. From this amount, 159 Key-tags (75%) of which corresponded to the following DC properties: Creator, Date, Description, Format, Is Part Of, Publisher, Subject, Title and Type. From these, 90,5% correspond to Subject and Description. At this point it is worth to highlight that the tags that referred both to the main subject and to the other subjects related to the resource were allocated to Subject. The other properties present the following percentages of allocation: Type - 5%; Creator, Is Part Of and Title 3,1% each, Date and Publisher 1,3% each and Format 0,6%. The other 53 Key-tags (25%) could not be related to any DC property. New complementary properties were defined and their definition is still in process. The following properties that were identified in the pilot study will be described: Action, Category, Depth, Rate, User Name, Utility and Notes. 3.3. Proposed Properties At this stage, potential new properties for the Key-tags to which it was impossible to assign any DC property were defined. The definition of these properties, at this stage of the research, is still preliminary, since it is based solely in the pilot study. The research on the full dataset will determine which properties will be included in the application profile, including any new that do not exist in DCTerms. The preliminary new properties identified in the pilot study will be described below, and are the following: Action, Category, Depth, Rate, User Name, Utility and Notes. The following percentages for these properties proposed were observed: Action, Rate and Utility (15,1% each), Category (11,3%), Depth (9,4%), Notes (7,5%) and User Name (1,9%). There is still a 24,5% of Key-tags to which it was not possible to assign or propose any property as their meaning in relation to the resources and users was not possible to identify. Below, each of these properties will be described, following the set of attributes used to specify the DCMI Metadata Terms [16]: Label, Definition, Comment and Example. Some additional information for better understanding these properties will also be included. 3.3.1. Action There is a group of Key-tags that represent the action of the user in relation to the tagged resource. It is a type of Tag that can be easily identified since the action is expressed in the very term itself when tagging the resource. Eight Key-tags were identified: Print, Read-Review, Read Later, Read This, Reading-List, To Do and To Read.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
105
106
Maria Elisabete Catarino; Ana Alice Baptista
Below, a descriptive table of the element to be proposed Label Definition Comment Example
Action Action of the user in relation to the resource. Has the role of registering the action undertaken by the user to the resource As example the tags which represent the action To Read, attributed to 6 users, all from Delicious: _toread, a_lire, toread.
Table 1: Description of the property Action 3.3.2 Category This property includes Tags whose function is to group the resources into categories, that is, to classify the resources. The classification is not determined by subjects or theme of the resource, since, in these cases, the key-tags could correspond to the Subject property. This property is not easy to identify, since it is necessary to analyse the given tag in the context of the totality of tags that user has inserted, independently of the resource under analysis. In some cases it may become necessary to analyse the whole group of resources the user has tagged with the tag that is object of analysis. Six Key-tags which could correspond to the Key Tag Category were identified: Alternative Desktop, DC tagged, DMST, FW – Trends, Literature and Reference. See descriptive table 2. Label Definition Comment Example
Category Terms that specify the category of a group of resources. Applied to the tags which were attributed to group the resources in categories, but which aren’t theme or subject categories, since for those Subject should be used. For instance, during the analysis of the Key-tag DC Tagged it was noticed that the corresponding resources had also other tags tags with the prefix dc: (e.g.: dc:contributor, dc:creator, dc:Publisher, dc:language or dc:identifier, among others). It was concluded that the tag DC Tagged could be being applied to group all the resources that were tagged by tags that were prefixed by dc:. Therefore it was considered a Category since it is not a classification of subjects or a description of the content of the resource.
Table 2: Description of the property Category 3.3.3 Depth This type of tag confers the degree of intellectual depth to the tagged resource. As Word Net, Depth “degree of psychological or intellectual profundity” [17]. Label Definition Comment Example
Category Terms that specify the category of a group of resources. Applied to the tags which were attributed to group the resources in categories, but which aren’t theme or subject categories, since for those Subject should be used. For instance, during the analysis of the Key-tag DC Tagged it was noticed that the corresponding resources had also other tags tags with the prefix dc: (e.g.: dc:contributor, dc:creator, dc:Publisher, dc:language or dc:identifier, among others). It was concluded that the tag DC Tagged could be being applied to group all the resources that were tagged by tags that were prefixed by dc:. Therefore it was considered a Category since it is not a classification of subjects or a description of the content of the resource.
Table 3: Description of the property Depth Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging.
The following Key-tags for this property were identified: Diagrams, Introduction – Document, Overview, SemanticWeb – Overview, Semantic Web – Introduction,that occurred only once. 3.3.4 Notes This element may be proposed to represent the tags that are used as a note or reminder. As Wornet, “a brief written record” that has the objective of registering some observations concerning the resource, but that does not refer to its content and does not intend to be used as its classification or categorization [18]. Label Definition Comment Example
Notes A note or annotation concerning a resource. Used to make some type of comment or observation with the objective of reminding something, registering an observation, comment or explanation related to a tagged resource. For instance, there is a resource that received the tags Hey and OR2007. The first tag, Hey, refers to Tony Hey, a well-known researcher who made a debate on important issues that were related to the tagged resource. In this case the information was given by the user who attributed the tags himself. The second tag makes reference to the Open Repositories 2007, event where Tony Hey mentioned above made a Keynote speech. However, interestingly enough, the tagged resource does not have any direct relation neither with that event nor with Tony Hey, this information was confirmed by the user of the resource himself (creator).
Table 4: Description of the property Notes A note should be understood as: an annotation to remind something; observation, comment or explanation inserted in a document to clarify a word or a certain part of the text [19]. From the five analysed resources, the following Key-tags considered as Notes were identified: Hey, Ingenta, OR2007, PCB Journal Club. 3.3.5 Rate Rate, meaning pattern, category, class or quality is important to include tags that are evaluating the tagged resource. Thus, the user categorizes the resource according to its quality when using this type of tag. Label Definition Comment Example
Rate Categorizes the quality of the tagged resource Used to register the evaluation of the user in relation to the quality of the tagged resource. Examples of this type of tag: good, great, important. A resource tagged with the tags Good and Great represent the qualification of the user according to the quality.
Table 5: Description of the property Rate The following Key-tags were related to the property: academic, critical, important, old, great, good and vision. These are generally easily identified as Rate in each one of the terms. In other cases, the tags may be doubtful and it becomes necessary to analyse them in relation to the tags assigned by the user to the resource under analysis as well as to the whole collection of resources tagged by that user. For instance, the tag Vision could have several meanings, but, after an analysis to the collection of resources, it may be concluded that it is classifying the quality of the resource Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
107
Maria Elisabete Catarino; Ana Alice Baptista
108 Label Definition Comment Example
User Name Name of the user of the resource. Refers to tags which registered the Nick Name of the user of the resource. In the pilot study only one tag for this type of element was identified. The tag Alttablib was attributed by a user of Delicious to the resource 4 (Resource Description and Access (RDA)).
Table 6: Description of the resource User Name 3.3.6. User Name The Tag User Name labels the resource with the name of a user. The analysed resource had the name of the user of the tagged resource. Only one tag of this type was identified in the pilot study. Despite the preliminary results presented here, it is assumed that here may be other occurrences. 3.3.7 Utility After an analysis of the tags and resources, it is proposed an element that would gather the tags that registered the utility of the resource for the user. It represents a specific categorization of the tags, so that the user may recognize which resources are useful to him in relation to certain tasks and utilities. In the pilot study the following tags were identified: Class Paper, Research, Dissertation, Maass, Professional, Research, Search and Thesis. It was not difficult to identify the majority as being Utility. However, three of them, Class Paper, Maass and Professional, required an analysis of other tags and resources from the same users. Class Paper is a tag that is bundled in “1schoolwork” and was assigned to three resources. By analysing the group of resources and related tags, it supposedly refers to resources that would be or have been used for a certain activity. Maass is a tag that was bundled in “Study”. The term represents the name of a teacher, information found in the user’s notes in two resources tagged with Maass: “Forschung von Prof. Maass an der Fakultat Digitale Medien an der HFU”; and “Unterlagen für Thema ‘Folksonomies’ für die Veranstaltung “Semantic Web” bei Prof. Maass”. Professional is a tag assigned by the user to separate those resources that are useful for work-related issues. This information was given by the user of the tag himself. Label Definition Comment Example
Utility Represents the purpose of use of the resource for the user. Categorizes the resources according to utility, as for example: dissertation, thesis A group of resources useful for the development of a research could be tagged with the tag Research.
Table 7: Description of the property Utility 4.
Final Considerations
In the following cases it was not possible to make any correspondence with any property since it was impossible to understand the meaning of the tags in relation to their resources: resource 1 - Capstone; Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Social Tagging and Dublin Core: A Preliminary Proposal for an Application Profile for DC Social Tagging.
resource 2: Suncat2; resource 4: Babel, Exp, L and resource 5: Do it or Diet, Inner Space, Kynunan, and W. Nonetheless, these are the results of the pilot study and, therefore, they will, be presented to the DC community for evaluation and validation along with the result of the final research. As result of this pilot study it is important to highlight that there is a meaningful part of tags, 25%, which could not be assigned the already existing DCTerms properties. This result strengthens what had already been concluded in the KoT project, where 37,3% of the analysed tags were not found to correspond to any of the DCTerms properties. Therefore, the adoption of new properties is justified so that the metadata deriving from folksonomies can be used by metadata interoperability protocols 5.
Acknowledgments
The authors wish to thank Filomena Louro from the Program Support to the Edition of Scientific Papers at University of Minho for her help in editing the final English draft. 6.
Notes and References
[1]
DCMI. Using Dublin Core: Dublin Core Qualifiers. DCMI, 2005. Available at: http:// dublincore.org/documents/usageguide/qualifiers.shtml , last accessed on August 30, 2007. [2] O’REILLY, T. Web 2.0: Compact definition? O’Reilly Radar Blog, 1 October 2005. Available at: http://radar.oreilly.com/archives/2005/10/web_20_compact_definition.html, last accessed on November 6, 2006. [3] WAL, Thomas Vander. Folksonomy definition and wikipedia. Available at: http:// www.vanderwal.net/random/entrysel.php?blog=1750 , last accessed on November 22, 2006. [4] GUY, Marieke; TONKIN, Emma. Folksonomies: tidying up tags?. D-Lib Magazine, v.12, n.1, jan. 2006. Available at: http://wwww.dlib.org/dlib/january06/guy/ 01guy.html , last accessed on December 12, 2006. [5] GOLDER, Scott A.; HUBERMAN, Bernardo A. The Structure of Collaborative Tagging systems. Available at: http://arxiv.org/abs/cs.DL/0508082 , last accessed on November 14, 2006a. [6] ________. Usage patterns of collaborative tagging systems. Journal of Information Science, v.32, n.2, p.198-208, 2006b. [7] Preliminary data presented in DC-2007 and NKOS-2007. [8] BAPTISTA, Ana Alice et al. Kinds of Tags: progress report for the DC-Social tagging community. In: DC-2007, International Cconference on Dublin Core and Metadata Applications: applications profiles: theory and practice. 27-31 August, Singapure. Available at: http:// hdl.handle.net/1822/6881, last accessed on September 4, 2007. [9] TONKIN, E. et al. Kinds of tags: a collaborative research study on tag usage and structure (Presentation). In: European Networked Knowledge Organization Systems (NKOS), 6.; EDCL Conference, 11., Budapest, Hungary. Available at: http://www.us.bris.ac.uk/Publications/ Papers/ 2000724.pdf , last accessed on December 10, 2007. [10] DCMI Usage Board. DCMI Metadata Terms. 14 January 2008. Available at: http://dublincore.org/ documents/2008/01/14/dcmi-terms, last accessed on January 21, 2008. [11] YIN, Robert K. Case Study Research: design and methods. Thousands Oaks, USA, 1989. [12] CURRÁS, Emília. Ontologías, taxonomia y tesauros: manual de construcción y uso. 3.ed. act. y ampl. Madri : Treas, 2005. [13] INTERNATIONAL STANDARDS ORGANIZATION. ISO 2788: Documentation: Guidelines for the establishment and development of monolingual thesauri. [S.L.] : ISO, 1986. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
109
110
Maria Elisabete Catarino; Ana Alice Baptista
[14] Post-coordination is the principle by which the relationship between concepts is established at the moment of outlining a search strategy [15]. [15] MENEZES, E. M.; CUNHA, M. V.; HEEMANN, V. M. Glossário de análise documentaria. São Paulo : ABECIN, 2004. (Teoria e Crítica, 01). [16] DCMI Usage Board. DCMI Metadata Terms. 14 January 2008. Available at: http://dublincore.org/ documents/2008/01/14/dcmi-terms, last accessed on January 21, 2008. [17] WORDNET. A lexical database for the english language. Princeton University, Cognitive Science Laboratory. Disponível em: <http://wordnet.princeton.edu/>. Acedido em 07 de Fevereiro de 2008. [18] WORDNET. ref 17. [19] INFOPEDIA. Enciclopédias e dicionários. Porto : Porto editora. Disponível em: <http:// www.infopedia.pt>. Acedido em 7 de fevereiro de 2008.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
111
Autogeneous Authorization Framework for Open Access Information Management with Topic Maps Robert Barta1; Markus W. Schranz2 Austrian Research Centers Seibersdorf Seibersdorf, Austria e-mail: robert.barta@arcs.ac.at 2 Department of Distributed Systems, Institute for Information Systems, Vienna University of Technology Argentinierstr. 8/184-1, Vienna, Austria e-mail: schranz@infosys.tuwien.ac.at 1
Abstract Conventional content management systems (CMSes) consider user management, specifically authorization to modify content objects to be orthogonal to any evolution of content within the system. This puts the burden on a system administrator or his delegates to organize an authorization scheme appropriate for the community the CMS is serving. Arguably, high quality content - especially in open access publications with little or no a priori content classification – can only be guaranteed and later sustained, if the fields of competence of authors and editors parallel the thematic aspect of the content. In this work we propose to abandon the above-mentioned line of demarcation between object authorization and object theming, and describe a framework which allows to evolve content and its ontological aspect in lockstep with content ownership. Keywords: Ontology; Semantic Technologies; Authorization Framework 1.
Introduction
Content ownership, joint or individual, is the main driving factor in an information society. Currently systems tend to be built with strong gravitational forces to attract content creation, so that the harvested information can be sold back into society. In the long run such business models can lead to monopolies and to highly uneven content distribution. Traditionally, user management has been regarded orthogonal to the life cycle of a document object within a content management system (CMS). That has allowed implementations to delegate not only authentication but also authorization to a middleware layer. Many modern platforms (such as .NET or J2EE) allow to deploy a wide variety of authorization technologies. Most CMSes provide an identity-based authorization scheme to control access to the information nodes within the CMS. Individual users either get assigned particular privileges relative to these nodes, or particular privileges are usually clustered into ‘roles’, mainly to reduce the management effort. When particular users are associated with a particular role, they inherit all the role’s privileges. Role management is then usually handled by an administrator. Such an individual person can easily present a bottleneck (and a security risk), so role assignment is often delegated, or even further subdelegated. Current authorization schemes are quite flexible and they can cover systems in the Wiki class (rather flat user base, hardly any workflow) up to corporate systems with a rather deep organizationally imposed group hierarchy and considerable workflow capabilities. Despite the delegation features in many practical deployments user management is still funneled through very few administrators. And in many real world deployments these actually do not take part in the collective authoring effort. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
112
Robert Barta; Markus W. Schranz
On a different front, a more recent trend in CMSes is the addition of semantic information (e.g. [1,2]). In the simplest case this is achieved by providing a background taxonomy against which the information nodes are organized. More sophisticated CMSes allow users to attach not only meta-information along well-defined attributes but also to use one of the semantic web technologies (RDF[4] and Topic Maps[5]). These enable to freely relate information items within the CMS against each other or with concepts and instance data defined elsewhere. Such outside information can be either referenced or integrated via virtualization[6]. Also here systems differ considerably in the degree how individual users can extend the existing ontology or the types of relationships. In our approach we propose to coalesce the authorization mechanism with the ontological information, thus offering an Autogeneous Authorization Framework (AAF) for (open access) information management based on Topic Maps. The paper is structured as follows: in section 2 we describe the challenges for integrating content management and authorization, introduce our proposed methodology and refer to related work and necessary notation formats. In the following sections we formalize our proposed machinery for the AAF and cover implementation aspects such as visibility rules for nodes and the necessary ontological commitments. Finally we summarize our current experiences and outline the work to invest for a scaleable implementation. 2.
Challenges in Combining Content Management and User Authorization
As target audience for the integration of content management and authorization we have in mind loose federations of organizations which want to cooperate in certain areas on a number of topics. Each of the involved organizations may have their experts in certain areas but each will seek expert knowledge from their partners. 2.1
Proposed Method and Assumptions
Realistically such federated projects will not have stable ontologies from the very beginning, much in contrast to a priori created ones (e.g. [3]). These will have to evolve over time and each snapshot implicitly indicates deficits and hence the need on which topics the project will have to focus next. This will prompt experts to adopt certain topics and detail them to the extent necessary or feasible. From past experiences it can also be expected that further field experts will be solicited, adhoc or via affiliation, especially in the area of open access research publications, where major focus lies on content quality and reliability and trust in topic experts. Ideally these invitations for co-authorship will not affect the whole content body but only that fraction for which a new expert is authoritative. Under the regime of ‘taxonomy-based authorization’ a particular user does not derive his privileges relative to a given node from the settings of a central administrator or a membership in a group, but from the commitment that the user is an expert in the field the node belongs to (the ‘theme’). Accordingly, the system tracks for each node how it is classified to a theme in the current taxonomy and it also keeps book which individuals are authoritative in certain themes. From there we generalize the regime in several directions: • First we abstract away from the particular privileges an individual may have regarding a particular topic. In the simplest case this may be a read or write (edit) access to a node. In more complex cases privileges may include the start, promotion or finalization of workflows steps, or different level of read access to only certain fractions or aspects of the topic. The only thing we are currently assuming is that all different privileges are totally ordered, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Autogeneous Authorization Framework for Open Access Information Management with Topic Maps
so that (a) it is always unambiguously determinable which privilege is higher than the other, and (b) there is always a highest privilege.
•
When topics are generated, they will be classified into a theme. From there topics themselves may or may not follow a workflow. As usual with workflows, the progress of a workflowing node depends on certain privileges, be it for promoting the topic into a final stage or be it for sending it back to an editing phase. Our framework does not prescribe any particular workflow states or any intrinsic privileges to move a topic to another state. The only assumption here is that any workflow privileges are still tied to themes and that topics can be adopted by anyone who has the sufficient privileges for that theme. Topic adoption can happen either actively by a user (pull), or passively (push) by forcing adoption upon the user by another one with higher privileges.
•
Also privileges themselves will follow a life cycle, in that the initial privileges of users are extended (monotonically increasing over time). Also here we allow a push-pull setup: either a user, unsolicited, gets privileges via other users to certain themes, or a user requests higher privileges, which he later is possibly granted.
Hereby we allow two subschemes: • In the ‘delegation scheme’ privileges can only be granted from someone with higher privileges on that theme.
•
In the ‘peer scheme’ the granted privilege can be at the same level as that of the granting user.
The ontology-based authorization described here can be bootstrapped from any existing taxonomy, even a pathological one with a single theme (the ‘thing’). Authors in the system can create taxonomy nodes along with information nodes and establish the highest available privilege on that taxonomy node for them. Safeguards are in place to avoid that users can reassign nodes in the taxonomy to subvert the authorization system. 2.2
Basic Requirements and Related Standards
Since basic elements in our AAF are denominated above as nodes in graphs that can be accessed within certain actions, we propose a standardized notation format to represent the resulting semantic network. In literature such graph structures have been implemented in various forms under different names including associative nets, semantic nets, partitioned nets, or knowledge maps in many AI systems. One of the most completely worked out notations are the conceptual graphs developed by Sowa et.al.[8]. Semantic networks rely on a basic model that is similar to that of the topics and associations found in indexes. Thus the two approaches promise great benefits in both information management and knowledge management. Exactly these benefits are targeted at by the topic map standard. By introducing relations between topics and occurences additionally to the topic-association model, topic maps provide a means bridge the gap, as it were, between knowledge representation and information management. In this paper we want to extend this basic intension of topic maps to include user authorization based on thematic topics within the contents. 2.2.1 Topic Maps and the Topic Map Standard Topic maps are an ISO standard[9] for describing knowledge structures and associating them with information resources. Since topic maps are often synonymed as the GPS of the information universe, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
113
114
Robert Barta; Markus W. Schranz
they provide powerful new ways of navigating large and interconnected information sets. According to the basic elements of informations structuring and accessing embodied in indexes, the topic map standard is based on Topics, Associations, and Occurrences. The following section outlines the TAO of topic maps, as it is explained in detail by S. Pepper in [10]. We will use topic maps as notation standard for our AAF as described in section 4.2. Topics A topic, in its most generic sense, can be any thing whatsoever â&#x20AC;&#x201D; a concept, an article, person, etc.. The term topic refers to the object in the topic map that represents the subject being referred to. Typically, there is a one-to-one relationship between subjects and subjects, with every topic representing a single subject and every subject being represented by just one topic. In a topic map, any given topic is an instance of zero or more topic types, thus categorizing specific topics according to their kind. Similar to the usage of multiple indizes in a book (index of abbreviations, names, or illustrations) topic types semantically describe the nature of topics. What one chooses to regard as topics in any particular application may vary according to the needs of the application, the nature of the information, and the uses to which the topic map will be put: e.g. in software documentation they might be variables, functions, and objects. In order to identify objects symbolically, topics may have explicit names. Since names exist in various shapes, such as formal names, symbolic names, nicknames, etc. the topic map standard provides the facility to assign multiple base names to a single topic, and to provide variants of each base name for use in specific contexts. Occurrences A topic may be linked to one or more information resources that are somehow relevant to the topic. Such resources are named occurrences of the topic and are generally external to the topic map document itself, and they are referenced using various mechanisms the system supports, e.g. URIs in XTM[7]. A significant advantagesto using topic maps is that the real world documents (occurrences) themselves do not have to be touched and thus topic maps support a clear separation of the network into two layers of the topics and their occurrences. Following the concepts in the topic map standard, also occurrences may be of any number of different types (e.g. articles, monograph, commentary) . Such distinctions are supported in the standard by the concepts of occurrence role and occurrence role type. These basic constructs refer to the basic organizing principle for information. The concepts of topic, topic type, name, occurrence, and occurrence role provide means to organize information resources according to topics/subjects, and to create simple indexes. Associations What we really need in addition to basic indizes for constructing semantic networks is to be able to describe relationships between topics. To achive this, the topic map standard provides a construct called the topic association, which asserts a relationship between two or more topics. Similar to the grouping of topics and occurrences according to their specific type â&#x20AC;&#x201C; such as author/ research area and article/commentary/monograph - also associations between topics can be categorized according to their type. And following the notation concepts of the standard, association types are themselves defined in terms of topics. The ability to apply typing to topic associations significantly increases the expressive power of the topic map, making it possible to group together the set of topics that have the same relationship to any given topic. This is necessary to provide intuitive and user-friendly interfaces for navigating large information networks.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Autogeneous Authorization Framework for Open Access Information Management with Topic Maps
Each topic that participates in an association plays a role in that association. Consequently, association roles can be typed and the type of an association role is also represented as a topic. Due to the fact, that associations in topic maps are multidirectional, the clear distinction between specific association roles is very important (e.g. CM_user has_edit_right_on article). The clear separation of (real world) information resources and the topic map itself, the same topic map can be overlaid on different information sets. Similarly, different topic maps can be overlaid on the same pool of information to provide different semantic views to different users. Furthermore, this separation provides the possiblity to interchange topic maps among publishers or to merge several topic maps, thus handling semantic networks. Omitting other details of the topic map standard since out of scope of this paper we progress to introduce a notation scheme for the AAF, followed by a model on how to represent the nodes in Topic Maps. 3.
Proposed Concepts for an Autogeneous Authorization Framework
To describe and analyze the dynamics of an AAF driven system, we need to abstract away (a) from any specifics of the underlying CMS and (b) from any representation technique used to manifest AAF-related ontological and operation information. For this purpose we introduce a simple adhoc formalism to describe static and dynamic integrity constraints. As minimal ontological commitment we choose to have nodes as the unit to carry content. From the AAF’s point of view such a node itself is atomic: It can carry any content, be it text, structured or unstructured. The node may have attachments, or it may have meta information attached to it; in any case this is outside the scope of AAF. 3.1
Static Model
3.1.1 Themes One special node kind are /themes/. Intuitively they represent topics such as finance or, say, UMTS. Themes can be organized into subclass relationships, usually referred to as taxonomy. We write t’ < t if the theme t’ is a specialization, direct or transitively, of theme t. Of course, themes can be related to each other in more specific ways, but this is outside the AAF realm. The only exception are non-theme nodes which are affiliated with a theme t, something we denote as n→t Any number of such affiliations may exist at any time. How such affiliation is modeled in the background ontology is deployment and implementation dependent. Here the notation should simply transport that the node is somehow related to a certain theme. One constraint imposed is that such affiliation inherits downwards the subclass hierarchy, so
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
115
116
Robert Barta; Markus W. Schranz
3.1.2 Users Another special node type are users. Nodes of this type are supposed to carry content about a certain user and we implicitly identify a user with its node (which is somewhat sloppy from an ontological view point). Users are the only active component, so it is them who perform actions. 3.1.3 Actions AAF also assumes that there is a finite set of actions on nodes. Built-in actions are read and edit. read will always keep the content of the node, edit will always modify the node, but both will maintain the identity of any node. Additionally applications and deployments are free to add workflow actions, i.e. actions which move nodes through a series of workflows. Formally, read and edit are embedded in this scheme as they are the only actions in their respective workflows. As common in workflow applications every workflow step will move the document into a new state, such as “edited” after an edit action. States are regarded here as derivatory concepts; still we conveniently use the notation n@S when a node n is in state S. On any set of action we also impose an order, so that between two actions there can be a comparison, which of them is stronger. This is to model that usually editing implies also reading, or that moving a document in a workflow also implies editing it. As this comparison may only exist on some pairs of actions, we only need a half-order. In any case we require that there is a strongest action. We refer to it as top. The bottom action also exists in every system and it represents the empty action. All actions are bigger that bottom. 3.1.4 Privileges When users have privileges then these are characterized by the maximal action that user can exert on a node affiliated to a certain theme. We denote such a privilege of user u relative to a theme t for action a as p ~a~> t An example would be that Bill has editing rights for finance: bill ~edit~> finance Also here we expect the privilege to inherit downwards the theme taxonomy: u ~a~> t ⇒ u ~a~> t’ with t’ < t This makes sense as if Bill has editing privileges for finance he implicitly should also have one for accounting. But since actions are also under a half-order more privileges can be inferred as well: u ~a~> t ⇒ u ~a’~> t with a’ < a Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Autogeneous Authorization Framework for Open Access Information Management with Topic Maps
In the example we derive from Bill’s authorization to edit finance topics also his authorization to read them. Additionally we define that arbitrary nodes n affiliated to a theme t can be actioned too: u ~a~> n iff ∃ t : n → t and u ~a~> t If a node is not (yet) affiliated with a theme, then there is no access information for it to be inferred. In the same way as document states can be derived from the actions, user roles can be defined on the basis of their privileges. Since all privileges, though, are always relative to a theme, simply to define that someone is an editor is correct, but ultimately looses essential information: Editor (u) ⇐ ∃ t: u ~edit~> t Still we keep this as notational convenience. 3.2
System Dynamics
According to the set of privileges at a given time, nodes can evolve and move through their respective workflows. Privileges can also change throughout the lifetime of an AAF governed system. This is either achieved by extending someone’s privileges directly, or indirectly in that nodes are affiliated with themes someone has a privilege on. The integrity constraints for this evolution we model with the help of pre- and post-conditions on node transitions. The preconditions guard certain actions and the post-conditions characterize the state in terms of AAF after a node has be acted on. Each of these transactions are atomic. To denote, for example, that every user is entitled to edit his node u into a new version u’ we write < User (u), u ~edit~> u || User (u’) > While the node u undergoes a change, it will maintain its identity. 3.2.1 Bootstrapping To put an AAF system into an initial state, it has to be bootstrapped into some configuration. The most minimal state is characterized by superuser ~top~> thing Superuser is one particular user. What makes him special is that he has the highest privilege (top) on the most abstract thing. 3.2.2 Document Life Cycle When a new document node is created it will not have any affiliation. Such an outlawed node can only be brought into the realm of a theme t if a user u has top privileges on t: < Node (n), u ~top~> t || n → t’ >
with t’≤ t
In this initiation phase the user has significant responsibility to choose the smallest reasonable subtheme t’ of t. Further changes of affiliations can happen later as well, but only in an accumulative manner, so that no existing privileges are hampered. If a node is moved along a workflow axis it always maintains its identity, even when it is modified. A user
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
117
118
Robert Barta; Markus W. Schranz
u can exert action a on a document node n in workflow state S when a theme t exists such that: < n → t, n @ S , u ~a~> t || n’ > For the special, predefined actions read and edit we define < n → t, u ~read~> t || n > < n → t, u ~edit~> t || n’ > 3.2.3 User Life Cyle When a user node is created obviously no explicit privileges are defined for that user. Implicitly we allow users to modify their own nodes, mainly to introduce themselves to the swarm of other users. This can be achieved on the policy level with priming every user node u with u ~edit~> u During the course of the life time a user can acquire new privileges whereby we distinguish two scenarios:
•
In the unsolicited scenario a user will be promoted by another user without prior request: < User (u), User (v), v ~a~> t || u ~a’~> t’ > In any case t’ will be equal or smaller than t if the user v determines that u only needs privileges for a more special theme. ”The granting v may also reduce the action level itself so that a’ ≤ a . If a = a’ we call this process peer invitation, otherwise delegation. Following our running example, the editor of the finance sector may hand down reviewing rights to another person: < User (bill), User (fred), fred ~edit~> finance || bill ~edit~> accounting >
•
Solicited privilege escalation is not likely to be agile. The process requires that users constantly monitor for the needs of other users. This is not something humans are well equipped to do. In the solicited scenario a user first requests certain privileges. These will be responded to at a later point by other users, so that requests are then resolved. To better moderate the process of solicited privilege escalation we force users either to escalate along the theme taxonomy or alternatively along the action half-order. For the latter case we characterize the creation of a privilege request via < User (u), u ~a~> t || u ~b?~> t > With u ~b?~> t we symbolize the fact that u wants privilege to do b on t. Note that u needs to have at least privilege a to launch such a request. To escalate along the taxonomy also the user needs minimal entry rights:
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Autogeneous Authorization Framework for Open Access Information Management with Topic Maps
< User (u), u ~a~> t || u~a?~ t’ > with t < t’. Every request can be responded to. In the case a request is granted, another user with sufficient privileges will have to come into play: ·
< u ~b?~> t’, v ~c~> t’’ || u ~a~> t >
Hereby v can choose to reduce a to b, so that a < b ≤ c. The user v can also choose to reduce the scope of the privilege, so that t < t’ ≤ t’’ . If a is chosen to be bottom, i.e. the smallest action, then effectively the request is rejected. 4. 4.1
Implementation Visibilities of Nodes Aspect
Once an AAF system is implemented, the user interface has to control which aspects of a node are visible to whom. Many of these visibilities will be policy controlled, so the following may be vary between deployments. Regardless of the node type we distinguish between the content itself, the ontological embedding of the node and the defined (or derived) privileges on it. For reasons of reproducibility and auditing not only the current information is displayed but also the past history of changes, so that it is imminent who got which privilege at which time. While general document nodes follow the generic rules of section 3, user and theme nodes have to be treated differently. 4.1.1 User Nodes For user nodes the content visibility follows the generic rules above. Ontology related information cannot be changed after a user node has been created, at least as far as AAF is concerned. Any ontological content is normally visible to everyone else. Typically all user s also see all existing privileges of another particular user. Outgoing privilege requests, so those which are pending, will be listed only for that particular user and for those users who have a stake in the themes involved and where their privilege level is at least on par with the one requested. Otherwise the request will normally not be shown. Incoming privilege requests, i.e. those where a particular user has the privilege level to grant or reject the request are listed for that very user. As a convenience this list will include past grants and rejections. 4.1.2 Theme Nodes Again for the content itself the generic AAF rules apply. As themes are meant to be abstract, only their relationship with other themes can be modified. Every theme node will list the privileges on them, direct or derived.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
119
120
4.2
Robert Barta; Markus W. Schranz
Model Representation with Topic Maps
All AAF-related information can be mapped into a topic map, although auditing information recording events is less suitable to be brought into this representation. That will be kept normally separate. The initial topic map will have to contain the ontological commitments, namely that every node is either a general document node, a user node or a theme node: user subclasses node theme subclasses node The predicate User(u) is then true when u is a user node in the map. Similarly this holds for themes. Another commitment is the list of actions involved. The predefined ones will appear in any case: read isa action edit isa action But more can and should be readily added. Any ordering between actions is represented by an association of type comparison comparison (stronger: edit, weaker: read) From the totality of all comparisons the smallest (bottom) and the biggest (top) action follow. All nodes are directly represented as Topic Map topics. For theme topics any taxonometric information is directly modeled with the onboard means available for Topic Maps, specifically transitive subclassing and instance-of relationships. In any case we flag theme nodes to be instances of theme, e.g. finance isa theme In the same vein, user nodes are marked as instances of user: bill isa user All content nodes are simply instances of node. If such a node is affiliated with a theme, this is modeled with an otherwise arbitrary association, for example for budget_2008: dc:subject (node: budget_2008, theme: finance) Hereby we made use of the predefined subject property inside Dublin Core vocabulary. Whenever a user is granted a privilege relative to the theme, this will also be represented natively in a map: privilege (theme: finance, user: bill, action: edit ) Requests for privileges look similar, except that associations representing them are scoped as pending: privilege @ pending (theme: finance, .... ) 5.
Conclusions
The formalized version of the AAF is the end result of a series of prototype implementations using a scripting language together with one of the mainstream wiki software (Perl + TWiki). In hindsight, that particular platform has not proven to be flexible enough for two necessary adaptions to an existing system: â&#x20AC;˘ the implantation of an ontological backbone, in which to host taxonometric and other semantic network information, and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Autogeneous Authorization Framework for Open Access Information Management with Topic Maps
•
the injection of an authorization layer to implement AAF’s functionality.
The main motivation for formalization lied in the perspective to better analyze security implications and to describe possible attack vectors. It also opens a foundation to formulate statistical means to subvert an AAF-operated system. At this stage we have little operational experience with an AAF deployment, not only because of the unsuitability of the chosen platform, but also mostly because of a lack of a mature Topic Maps implementation which allows to quickly scale to hundreds of users and thousands of topics. This shortcoming had been addressed recently, so that a reimplementation with a CMS but also a conceptual integration with content frameworks such as JSR-283[11] can be attempted. To substantiate our claim that an AAF-driven system will cause and sustain an adequate and balanced privilege distribution, our efforts will have to concentrate on developing metrics to measure this balance. It is yet unclear whether such metrics will depend on the social setting, be it a corporate environment, a group of cooperating NGOs or independent individuals. 6.
Notes and References
[1]
Barbera, Michele; Di Donato, Francesca. Weaving the Web of Science. HyperJournal and the Impact of the Semantic Web on Scientific Publishing , Proceedings of the 10th International Conference on Electronic Publishing, Bansko, Bulgaria, 14-16 June 2006. [2] Annotation and Navigation in Semantic Wikis, Eyal Oren, Renaud Delbru, Knud Moeller, Max Voelkel, and Siegfried Handschuh, Proceedings of the First Workshop on Semantic Wikis, 2006, Ed. Max Voelkel [3] Costa Oliveira, Edgard; Lima-Marques, Mamede. An Architecture of Authoring Environments for the Semantic Web, Proceedings of the 10th International Conference on Electronic Publishing, Bansko, Bulgaria, 14-16 June 2006. [4] Resource Description Framework (RDF) model and syntax specification, Technical report, W3C; O. Lassila and K. Swick [5] TMDM, ISO 13250-2: Topic Maps - Data Model, Lars Marius Garshol and Graham Moore, 200311-02 [6] Knowledge-Oriented Middleware Using Topic Maps, Robert Barta, TMRA 2007, Leipzig, (to appear in TMRA 2007 Proceedings, Springer LNCS/LNAI) [7] Pepper, S. and Moore G.: XML Topic Maps (XTM) 1.0. TopicMaps.Org http://www.topicmaps.org/ xtm/1.0/, 2001 [8] Sowa J, et. al. Knowledge Representation: Logical, Philosophical and Computational Foundations, Brooks-Cole, Pacific Grove 2000 [9] International Organization for Standardisation, ISO/IEC 13250, Information technology – SGML Appliations – Topic Maps, ISO, Geneva 2000 [10] Pepper S., The TAO of Topic Maps – Finding the Way in the Age of Infoglut, Ontopia http:// www.ontopia.net/topicmaps/materials/tao.html, April 2002 [11] JSR 283: Content Repository for Java Technology API Version 2.0, http://jcp.org/en/jsr/detail?id=283
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
121
122
AudioKrant, the daily spoken newspaper Bert Paepen Katholieke Universiteit Leuven – Centre for Usability Research (CUO) Parkstraat 45 Bus 3605 - 3000 Leuven e-mail: bert.paepen@soc.kuleuven.be
Abstract Being subscribed to a newspaper, readers expect some basic things: receiving their paper in their mailbox early in the morning, being able to read it privately when and where they want, reading first what they find most interesting, etc. For people with a reading disability all this is not that obvious as only few accessible alternatives are around; accessible news on a daily basis does virtually not exist. Knowing that the number of visual disabled persons follows the rise in the ageing population, an increasing number of citizens however is getting debarred from a daily news reading experience. At present Belgium is one of the rare countries publishing a daily newspaper accessible to readers with a visual impairment, both in a Braille print and an electronic version. Notwithstanding major accessibility improvements over a printed newspaper, these newspapers still have some important barriers for many visually impaired readers. Reading requires specific skills and/or equipment, such as the ability to interpret Braille or the availability of a personal computer, a screen reader, a speech synthesizer or an internet connection. The goal of the AudioKrant project was to develop a new, universally accessible news publication with a minimal learning curve, aiming at a wide range of potential readers: the “talking newspaper”. Thanks to significant progress in text-to-speech technology it is today possible to produce a newspaper read by a computer voice that is understandable, has an acceptable speech quality and is even pleasant to listen to. This paper explains how the talking newspaper is produced, what formats and technology are used, what the current status and challenges are and what future improvements can be anticipated. Keywords: newspaper; accessibility; Daisy; DTB (digital talking books) 1.
Introduction
According to the European Blind Union 1 in 30 people are blind or partially sighted. Blindness and partial sight are closely associated with old age, so as people live longer the number of visually impaired persons is increasing. Nearly 90% of all blind and partially sighted people in Europe are over the age of 60, and two thirds are over the age of 65[2] [3]. In several countries initiatives exist for publishing news to readers with a visual impairment. Mostly this takes the form of an audio book containing a daily or weekly selection of news articles, read by a human voice, or a Braille book, also with a selection of articles. Of course this is a major improvement for disabled readers but it is still far from the reading experience offered by a traditional print paper: accessible news should also be complete, recent and allow a private reading experience. At present Belgium is one of the rare countries publishing a daily newspaper accessible to readers with a visual impairment. Both a Braille print and an electronic version are published on a day to day basis. Subscribers can read these papers either by “feeling” the dots on Braille printed papers or by listening to a text-to-speech synthesizer on their computer [1]. Notwithstanding major accessibility improvements to a printed newspaper, the Braille and electronic newspapers still have some important barriers for many visually impaired readers. Reading requires specific Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
AudioKrant, the daily spoken newspaper
skills and/or equipment, such as the ability to interpret Braille or the availability of a personal computer, a screen reader, a speech synthesizer or an internet connection. Knowing that a growing number of elderly readers have difficulties reading a printed paper and at the same time are unable to learn Braille or to operate a computer, an increasing number of people are excluded from getting information from a newspaper. Given the rise in age related visual disabilities there is a clear need for a new, universally accessible newspaper publication with a minimum learning curve. The AudioKrant project has developed such a “talking newspaper”, which is not only targeted at visually impaired persons, but also at elderly persons and people with a reading disability such as dyslexia, a motor disability or language problems. The aim was to come to a very simple product, requiring as little skills as possible and thus being accessible to a wide range of potential readers. This could include for example elderly persons whose sight does not allow them to read the printed paper, but who do not understand Braille or know how to operate a computer. For this reason the talking newspaper is distributed on a CD-ROM by surface mail. As most simple solution it can be listened to by means of a “Daisy” player, but also a computer with specialized software or even a regular MP3 player are possible. This paper explains how the talking newspaper is produced, what formats are used, what the current status and challenges are and what future improvements can be anticipated. 2.
Daisy
In the DiGiKrant project we used the DAISY format, an XML standard for digital talking books (DTB), for producing accessible electronic newspapers. Several types of books can be stored using this format: audio books, text books and audio-text combinations [4]. For the DiGiKrant we used the text only variant. This makes the file sizes very small so that it can be easily transferred by e-mail. The downside is that the electronic text still needs to be converted to an accessible format by means of a Braille screen reader or a speech synthesizer on the reader’s personal computer. This requires the reader to own a computer, an internet connection, accessibility software and/or hardware, and the skills to operate all that. To avoid these possible barriers at the side of the reader, the talking newspaper includes both text and its audio representation (hence “talking” newspaper). Technically this means that the spoken version of the text is created at the producer’s instead of the reader’s side. As a consequence the reader does not need a computer: a small daisy player or mp3 player will do (however it can still be read on a computer as well). A digital talking Daisy book contains a set of mp3 audio files, HTML text files, SMIL synchronization files and an HTML navigation file. Thanks to the latter the reader can browse through the book’s contents in a structural way, jumping to chapters and paragraphs or skipping to the next sentence. SMIL (Synchronized Multimedia Integration Language) [6] enables synchronization between the text and audio version of a book up to the smallest available navigation level. When reading a DTB on a computer, users can see and hear the book’s structure, navigate to its sections, and read through its paragraphs. Thanks to the SMIL synchronization, words or sentences can be highlighted at the moment when they are spoken (see Figure 1, displaying a newspaper fragment in the application EaseReader). Highlighting can be helpful for example for dyslectic readers. A DTB can also be read using a more compact Daisy player or even a regular mp3 player. The former will interpret the navigation file, so that structural reading is possible. A simple four arrow button interface makes operating a Daisy player very easy. Figure 2 displays a Daisy reader with a simple and straightforward design, with the usual play/pause, rewind and forward buttons known from a CD player. Four navigation buttons (up, down, left, right) in the middle of the device allow structural reading: the up and down buttons define the navigation level, the left and right buttons navigate through the items in the chosen level. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
123
124
Bert Paepen
An example newspaper could contain three navigation levels: sections, subsections and articles. The reader will start with the first article in the first section. When the user pushes the down button, the device reads out the current navigation level (1). At a second push on the down button the device changes the level to 2, corresponding to the subsections) and reads out this new level. After pushing the left button the reader navigates to the next subsection and starts reading its first article. This way, navigation is possible up to the level of individual sentences or even words, as long as the DTB is structured up to this level of detail.
Figure 1: reading a DTB on a computer
Figure 2: Daisy reader A list of Daisy players and software can be found in [5]. A classical mp3 player does not allow structural reading but can still play the successive paragraphs in a sequential order. 3.
Accessible newspaper
For obtaining an accessible newspaper the Daisy format is only a starting point: several key requirements determine if the newspaper will be really accessible to and usable for readers with a wide range of visual Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
AudioKrant, the daily spoken newspaper
disabilities. First, the newspaper structure is of key importance. It should be clear, well-organized and simple, and it should resemble the structure of a printed newspaper, using the same type of columns and a recognizable order of sections. Page numbering should allow referencing between the printed and the spoken version of the newspaper. In the audio newspaper typically four navigation levels are included, from sections (like Front page, Politics, Economics, Local news and Sports), subsections (like Soccer, Baseball and Basketball) and articles to the lowest level of individual sentences. Such structure allows “structural reading” of the DTB, meaning that the reader starts from the navigational “tree” structure of the book to browse to a specific part of its content. Second, navigation through the paper should be straightforward and fast. Readers should have an immediate view on the paper’s contents, seeing the sections, subsections and the number of articles in each section. Figure 3 displays an extract newspaper structure, showing the section titles and the number of articles in each section between brackets. For the audio newspaper we chose for showing both the number of articles and the number of subsections (if any) in each section. For example: “Sports (23 articles and 4 subsections)”.
Figure 3: example newspaper structure Readers should also be able to jump from one section to the other or from one article to the other and to quickly skip the remainder of a sentence or paragraph. The four button interface, described above, makes this possible if a sufficient level of detail is provided in the newspaper structure. This type of user interaction makes “sequential reading” more efficient. In the audio newspaper sequential reading is further improved by providing two types of tunes, marking the end of a news article or the end of a section. Without these tunes it could be unclear when a new article or section has started, as the reading software or hardware just continues reading. Finally, the quality of the newspaper’s contents, both text and audio, should be impeccable. This seems obvious, but with a (semi-)automatic production process it is not an easy goal to achieve. These requirements formed the basis of the analysis, design and implementation work for the “production wizard”, an application allowing the daily production of the accessible audio newspaper. 4.
Production process and challenges
The accessible newspaper in its three forms (Braille, text and audio) is produced at nighttime between the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
125
126
Bert Paepen
journalistsâ&#x20AC;&#x2122; deadline and the postal truckâ&#x20AC;&#x2122;s departure time, leaving a very short production time (less than 2 hours). For this reason the production process of the spoken newspaper was optimized for efficiency, leaving little room for error and manual intervention.
De AudioKrant TV schedules audio
News articles
Production
De DiGiKrant text
braille
De BrailleKrant
Stock exchange figures
Figure 4: accessible newspaper production in three forms As a first production step content is gathered from various sources, including newspaper articles, TV schedules and stock exchange figures. Most of the input files are in XML format, like the news article displayed in Figure 5, while some still use a text based format with simplified tags, like the stock exchange example in Figure 6.
Figure 5: news article source fragment
Figure 6: stock exchange figures fragment Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
AudioKrant, the daily spoken newspaper
All these files are filtered to improve the quality of the resulting newspaper and are converted to a central XML format. Thanks to such a centralized format the conversion software is flexible in processing any type of input into any type of accessible output. Filtering is important for obtaining a high quality electronic publication from a source that is intended purely for paper publishing. An article title for example might be missing from the source file because it was only available in graphical form for printing. In this case a title needs to be generated from the article content. In the next step the input file is inserted at the right place in the newspaper structure; the structure is gradually built up when new files arrive. Building such a structure is one of the most difficult tasks in the automatic production process: while the input is article-oriented (each file contains only one news article) the output is newspaper-oriented (containing the full structure). During the third production step, depicted using the Daisy logo in Figure 7, news articles are converted into their “spoken” version using a text-to-speech converter (or speech synthesizer) such as RealSpeak [7]. Immediately a SMIL synchronization file is created, linking the written text to its spoken audio representation. Because of time constraints it is impossible to have a full daily newspaper read by a human voice, knowing that a complete newspaper can take up to 20 hours of speech. Speech generation software has improved immensely since the typical computer voices from the early days, creating speech that is not only understandable, but even pleasant to listen to. Speech can be improved further by a rule set, defining how certain characters should be read (for example: & should be “and” instead of “ampersand”), and a pronunciation dictionary, defining how specific words should be read.
News articles
Front page Politics Sports Barcelona wins Champion’s League Olympic games disturbed by protest
TV schedules
Figure 7: Audio newspaper production process Finally the output is created as a complete DTB in Daisy format, containing the navigation file, the news articles in text format and in audio format (mp3), and the SMIL files. This end result is burned to a CDROM, duplicated, packed and sent by surface mail to the subscribers overnight. As with a regular newspaper, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
127
128
Bert Paepen
subscribers should receive their audio newspaper in their mailbox in the morning. Several technical challenges arise from the fact that the source material, received from the newspaper publisher, is optimized for print rather than for a digital and accessible product. Some information is only available in graphic form, leaving no room for an accessible version, tables are poorly exported and even some article headlines are missing. The first solution for these problems was to try obtaining better source data from the publisher, for example sports results in structured tables. In some cases, where the publisher cannot provide better quality data, the production software tries to improve quality by means of several text filters. A second major project challenge was time available for production. While the deadline for journalists is around 21:00 h, the first articles become available in XML format from around 22:00 h. At 00:30 h the first shipment is leaving, giving only about 2,5 hours for the production of the accessible newspapers. Every aspect of the production software was designed and developed for optimal production speed, for example running several speech processors at the same time in parallel threads, as the text-to-speech module is the most CPU-intensive task of the entire process. Today the total production time for all newspapers (in total about 500 MB in file size) averages around 1h20min, not including the time needed for duplicating, packing and transporting to the shipping department. 5.
Future work
The audio newspaper was developed between May 2007 and May 2008 and was launched on June 2nd, 2008 with a press conference and seminar on June 6th in Brussels. As of that date the production wizard is operational for the accessible newspapers’ production crew, allowing them to publish their products on a day-to-day basis. After a few months of beta testing it is clear that, although we are ready to produce a daily audio newspaper of acceptable quality, not all technical challenges are conquered yet. Especially the time constraints, tied to the physical delivery of the CD-ROMs and the late availability of source material, are still a daily challenge. As a result the first delivery group (leaving at 00:10h) today receives a newspaper with less content and structure than the second group (leaving at 2:00h). We are working on two levels to solve the problems related to the short production time. As a first solution the publisher is working on a solution for an earlier delivery of the source material. Of course the journalists’ deadline cannot be changed, but the accessible newspaper production process does not have to wait for the paper production process before starting. We are now trying to obtain news articles already when they are positioned in the newspaper’s layout, even if they are not set up yet on the printing plate. This gives some extra production time for the audio newspaper. A second solution for the (too) short production time could lie in non-physical distribution channels such as the internet. Being distributed over the Internet to its subscribers, the audio newspaper production could be postponed until later at night, when all source material is available. Although this might sound as an unnatural distribution channel, given that the target audience does not have a computer or internet connection, several user friendly solutions exist for bringing the content to the reader automatically. The ORION Webbox for example is a device that downloads new content from the internet overnight, allowing the reader to start enjoying their fresh newspaper as soon as they get up in the morning, all without manual intervention. Knowing that an audio newspaper in mp3 format averages about 350 MB of data per day, a decent internet connection is necessary. As soon as this type of distribution is used, the increased production time allows new features improving the newspaper’s quality, such as personalization. Subscribers could choose in which type of content they are interested (sports but not economy e.g.) and receive a customized newspaper every day. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
AudioKrant, the daily spoken newspaper
6.
Conclusions
Hearing computer generated voices during the 1980’s it was unimaginable that people could listen to such a voice reading an entire newspaper. Significant progress in text-to-speech technology today allows products such as a spoken newspaper that is understandable, has an acceptable speech quality and is even pleasant to listen to. One of the major achievements of the spoken newspaper for visually impaired persons is that it gives them back the opportunity to enjoy a daily, individual and private news reading experience. With a small player one can read anywhere, anytime and at one’s own pace without needing any assistance. In a world where ubiquitous information access is getting commonplace, this can help impaired persons to get included and overcome the digital divide. 7.
Notes and References
[1]
Paepen, B., Engelen, Jan. Braillekrant and DiGiKrant: a Daily Newspaper for Visually Disabled Readers. In Proceedings of the 9th ICCC International Conference on Electronic Publishing, June 2005. Leuven, Belgium : Peeters Publishing Leuven, pp. 197-202. Eurostat. Health statistics – Key data on health 2002 – Data 1970 – 2001, http:// epp.eurostat.ec.europa.eu/cache/ITY_OFFPUB/KS-08-02-002/EN/KS-08-02-002-EN.PDF, p. 144, 2004. European Blind Union. A Vision for Inclusion - A Guide to the European Blind Union. http:// www.euroblind.org/fichiersGB/visincen.html, 2004. Daisy Consortium. Technology Overview - What is a DTB? http://www.daisy.org/about_us/ dtbooks.asp, 2008. Daisy Consortium. Playback Tools. http://www.daisy.org/tools/tools.shtml?Cat=playback, 2008. W3C. Synchronized Multimedia. http://www.w3.org/AudioVideo/, 2008. Nuance Communications, Inc. RealSpeak. http://www.nuance.com/realspeak/, 2008.
[2] [3] [4] [5] [6] [7]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
129
130
A Semantic Web Powered Distributed Digital Library System Michele Barbera1; Michele Nucci2; Daniel Hahn1, Christian Morbidoni2; Net7 – Internet Open Solutions Via Marche 8/a, 56123 Pisa, Itay e-mail: barbera@netseven.it; hahn@netseven.it 2 Dipartimento di Elettronica, Intelligenza artificiale e Telecomunicazioni, Università politecnica delle Marche, Via Brecce bianche, 60100 Ancona, Italy e-mail: mik.nucci@gmail.com; christian@deit.univpm.it 1
Abstract Research in Humanities and Social Sciences is traditionally based on printed publications such as manuscripts, personal correspondence, first editions and other types of documents which are often difficult to obtain. An important step to facilitate humanities and social sciences scholarship is to make digital reproductions of these materials freely available on-line. The collection of resources available on-line is continuously expanding. It is now required to develop tools to access these resources in an intelligent way and search them as if they were part of a unique information space. In this paper we present Talia, a innovative distributed semantic digital library, annotation and publishing system, which is specifically designed for the needs of scholarly research in humanities. Talia is strictly based on standard Semantic Web technologies and uses ontologies for the organization of knowledge, which can help the definition of a highly adaptable and state-of-the-art research and publishing environment. Talia provides an innovative and flexible system which enables data interoperability and new paradigms for information enrichment, data-retrieval and navigation. Moreover, digital libraries powered by Talia can be joined into a federation, to create a distributed peer-to-peer network of interconnected libraries. In the first three paragraphs we will introduce the motivations and the background that led to the development of Talia. In paragraphs 4 and 5 we will describe the Talia’s architecture and the Talia Federation. In paragraphs 6 and 7 we will focus on Talia’s specialized features for the Humanities Domain and its relations with the Discovery Project.In paragraph 9 we will describe Talia’s widget framework and how it can be used to customize Talia for other domains. In the final paragraph we will compare Talia with related technologies and platforms and suggest some possible future research and development ideas. Keywords: digital library; semantic web; humanities. 1.
Introduction
In the last few years the amount of digital scholarly resources in the Humanities grew substantially thanks to the efforts of many collections holders who digitized their materials and published them on-line. However, to date, many digital library projects can be characterized as both strongly hierarchical (top-down) and disconnected. Materials are selected for reformatting and inclusion by librarians, archivists and curators from their own collections to the presumed benefit of their patrons but with little actual consent from them. Collections are often assembled with little regard for existing complementary materials, leaving it to the end-user to make and sustain the connections across collections, that remain collections remain fundamentally siloed, with no way to establish permanent semantic connections of their contents. In digital research libraries there is no longer any need to abide by the restrictions of physical organizational schemes or even physical location. New research libraries can and should be built across collections and across libraries. On the other side of this spectrum lay the large-scale aggregation initiatives, such as the The European Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Web Powered Distributed Digital Library System
Library [1] or OAIster [2] which serve as general purpose digital libraries, but fail to provide the depth needed for research-level scholarship. The emergence of web 2.0 has resulted in a number of tools and technologies for annotation and personalization of resources but these tools have yet to gain a strong foothold in an humanistic academic setting. We believe that Semantic Web Technologies have the potential to glue together the opposing needs of maintaining the context in which the collections originate, by leaving them under control of their holders, and at the same time making resources part of a global structured knowledge space that is independent of a single centralized authority or aggregation service. 2.
Semantic Web and Ontologies
The Semantic Web is an extension of the current Web in which information can be expressed in a machineunderstandable format and can be processed automatically by software agents. The Semantic Web enables data interoperability, allowing data to be shared and reused across heterogeneous applications and communities [3]. The Semantic Web is mainly based on the Resource Description Framework (RDF) [4] by which is possible to define relations among different data, creating semantic networks. RDFâ&#x20AC;&#x2122;s main strength is simplicity: it is a network of nodes connected by directed and labelled arcs (figure 1).
Figure 1: An example of an RDF Semantic Network The nodes are resources or literals (values) while arcs are used to express properties of resources. In SW a resource is anything that can be somehow identified using a specific identifier. In particular, the identifiers used in SW are known as Uniform Resource Identifiers (URI). In Semantic Web ontologies, are used to organize information and formally describe concepts of a domain of interest. An ontology is essentially a vocabulary which includes a set of terms and relations among them. Ontologies can be developed using specific ontology languages such as: the RDF Schema Language1 (RDFS) and the Web Ontology Language2 (OWL).
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
131
132
3.
Michele Barbera; Michele Nucci; Daniel Hahn, Christian Morbidoni
Digital libraries for all
In recent years, the decreasing prices of digitazion costs and storage facilities as well as the emergence of many easy-to-deploy Open Source content management systems and digital object repositories, led to the multiplication of small digital libraries run by smaller institutions. Despite their limited size, the collection of these libraries sometimes include cultural masterpiecies. Unfortunately, due to limited resources, these libraries cannot afford to invest on professional digital library management platforms that are either too expensive in terms of license costs or too expensive to maintain because of their complexity. Talia is an Open Source semantic digital library management system that is easy to deploy and maintain. Building a Talia based digital library doesnâ&#x20AC;&#x2122;t require any advanced software development and management skill, that smaller cultural institutions may not possess. Additionally, Talia is a distributed library system, meaning that it permits to build virtual collections that go beyond the boundaries of a single archive without requiring the underlying content providers to loose control over their holdings. For all the reasons stated above, Talia aims at being a complete tough powerful solution for the needs of smaller institutionâ&#x20AC;&#x2122;s digital libraries. 4.
The Talia Platform
Talia is a distributed semantic digital library system which has been specifically designed for the needs of scholarly research in social science and Humanities. Talia combines the features of a digital archive management system with an on-line peer review system, thus it is capable of combining together a digital library with an electronic publishing system. Talia is able to handle a wide range of different kinds of resources such as texts, images and videos. All the resources published in Talia are identified by a stable URI: documents can never be removed once they are published and are maintained in a fixed state in perpetuity. This, combined with other long-term preservation techniques, allows the scholars to reference their works and gives the research community immediate access to new contents. One of the most innovative aspect of Talia is that it is completely based on Semantic Web Technologies which enable deep data interoperability with other Semantic-Web aware tools and applications. In particular, the Talia Knowledge Base is kept in RDF and it is formally described using RDFS/OWL ontologies. Talia natively supports heterogeneous data sets whose metadata schemes can be very different from each other, therefore the system is not based on a predefined ontology. Talia embeds only a very broad structural ontology, which contains only general concepts and basic relations to link resources. Research communities are encouraged to develop their own domain ontology to describe knowledge and content in their domains of interest. The domain ontologies can be developed using standard ontology languages such as RDFS and easily imported into into the libraryâ&#x20AC;&#x2122;s data store.
Figure 2: Talia Architecture Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Web Powered Distributed Digital Library System
Talia also includes facilities for semantic data-enrichment, data-annotations and data-retrieval as well as a lot of other specific tools. A complete overview of the Talia architecture as well as the underlying technical details can be seen in [5] and [6]. 5.
Distributed semantic digital libraries
Digital semantic libraries based on Talia can be joined in a Talia Federation, to create a peer-to-peer network of interconnected libraries (nodes). Talia provides a mechanism to share parts of its knowledge base. This mechanism is based on a REST interface, an approach proposed in [7]. By using this feature a node can notify another node that a semantic link between them exists. This information can then be used by the notified node to update its own knowledge base to create a bidirectional connection between the contents. This features allows individual scholarly communities, each one managing a single node of the federation, to retain control on their own content while at the same having a strong interaction with the other nodes content. Talia works both as a Digital Library and as an Open Access publisher of original contributions. In a digital library a notion of absolute quality can be acceptable even outside the boundaries of the community who manages the library. On the other hand, the concept of quality for newly published contributions varies significantly with culture and context, therefore each communiy must retain control on what their users see through that community web site. The approach used in Talia is to let each node decide which other federation node it trusts. Notification of incoming links will then be processed only if they come from trusted nodes. The result for the enduser is that it sees backlinks only to content held in trusted sources. At any time, a node administrator can modify the trust policy and recover the notifications that have been previously filtered out by the trust engine. Talia also features a single sign-on mechanism based on OpenID [8]. A Talia node acts as an OpenID client. Depending on its own policies a federation can run itâ&#x20AC;&#x2122;s own OpenID identity server or choose to rely on any existing external service. Each federation node keeps a copy of the user credentials and user roles and permissions are managed locally. As any other of its components, the authentication and authorization component of Talia is pluggable and completely modular. Therefore, it will be possible in the future to develop specific authentication and authorization components based on other infrastructures, such as a more institution-centric approach based on the Shibboleth [9] model, or on any other legacy model. 6.
Digital Humanities
Projects in the domain of Digital Humanities deal with a incredible amount of different types of data (ranging from manuscript reproductions to statistical linguistic metadata, pictures of historically relevant places, maps and many different kind of books in diverse digital formats just to name a few). The level of standardization of data and metadata formats, but especially of the research process is very limited compared with natural sciences. Another important characteristic of this domain, is that the level of computer literacy for Humanities scholars tends to be rather low compared to scholars in other sectors. These two facts (heterogeneity of data and processes and low computer literacy of the users) suggest the need of an electronic environment that must be extremely flexible, adaptable and extensible but at the same time integrated and easy to use. Talia is a coherent and easy-to-use web-based working environment that integrates a set of features that are usually scattered through many different desktop and web-based tools rather than condensed in a unique environment. These tools include for example XML tagging and transformation, image and audio analysis, linguistic text analysis, manuscript annotation, electronic edition editing and so on. Taliaâ&#x20AC;&#x2122;s widget system permits to extends the core engine and easily integrate these tools into a unique Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
133
134
Michele Barbera; Michele Nucci; Daniel Hahn, Christian Morbidoni
infrastructure. In the context of the Discovery project we will develop a limited amount of these tools as well as the documentation on how to develop additional plug-ins. We hope that the Open Source community and other Digital Library projects will contribute additional plug-ins in the near future. 7.
The discovery project
The Discovery project [10], funded by the European Commission under the eContentplus programme, aims at the creation of a federation of digital libraries dedicated to different authors and themes of ancient, modern and contemporary philosophy. Talia has born in the context of the Discovery project to serve as its technological infrastructure. The federated libraries that are part of the Discovery federation are unique as they as the function at the same time as traditional digital libraries and as Open Access publishers of original contributions. With this model , discovery aims at stimulating the production of new knowledge by aggregating scholarly communities around thematic repositories of both primary sources (like manuscripts and first editions) and original contributions submitted by the scholars. Additionally, the nodes of the Discovery federation, can also store and publish semantic annotations, that are another type of user contributions. In Discovery, there are two main categories of resources (called “Sources” in Discovery): primary and secondary sources. Primary sources are all the resources that belong to the digital library. These resources have been collected, digitized and published by the institution that runs the digital library. Secondary sources are all the resources that belong to the Electronic Publisher component of a Talia node. These resources have been submitted by the users and they passed through a peer-review process before being published. In Discovery there are four content providers, each of them manages is own instance of Talia. Each provider has its own Domain Ontology that specifies wich types of Primary and Secondary Sources they deal with. Each content provider also has its own peer-review policies and procedures and user user interfaces. In addition to running domain specific Digital Libraries and an Open Access Electronic Publishers, the content providers also engages in what is refferred to as “Semantic Enrichment”. By using a tool called Philospace[10], domain specialists can semantically annotate the Sources published in Talia to add new semantic relations among them. As any other user generated content, the semantic annotations also go through the peer review process. If the annotations pass the peer-review scrutiny they are published into Talia and become available to end users. There is no limit on the meaning of the annotation that may range from simple metadata added to a Source to complex relations to philosophical concepts expressed in a domain thesaurus. The only requirement is that the annotations must be based on a “Annotation Domain Ontology” that is both loaded into the annotation tool and into Talia. More details about the Discovery projects are available in its website[10]. 8.
Item-centric vs relation-centric perspectives
In scholarly environments, where Talia is mostly expected to be used, the contex in which each individual object is placed is of extreme importance. It is often by exploring the context, that is the relations that each objects has with others, that new discoveries are made. As an example, consider the following figure. The interface shown above is part of Hyper, the software Talia derives from. A similar interface is currently being re-implemented in Talia. It has been designed to visualize a particular type of relations that a set of manuscripts have among themselves. In particular, this interface allows the user to visualize a path of the genesis of a philosophical idea, from it’s conception on the manuscript to its publication on a printed book, through its evolution and refinements in successive manuscripts and pre-printing copies. The following figure is an alternative visualization of one of the resources shown in the previous figure. This view, called Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Web Powered Distributed Digital Library System
“rhizome view” shows all the “paths” that pass through a certain resource.
Figure 3: Path widget
Figure 4: Rhizome widget It is clear that the meaning of these two alternative visualizations is different from each other, but the interesting element is that in both these interfaces the focus is on relations among resources rather than on the resources themselves. Having alternative interfaces that allow the user to focus on an individual resource as well as its relatioships with other resources is one of the charachteristics that makes Talia a scholarly tool rather than a simple digital object repository. 9.
User Interface Widgets and Source Tool plugins
Talia is meant to be used to publish a vast amount of heterogeneous digital objects and resources. Scholarly communities are often interdisciplinary and their research output embraces more than one scientific domain. We believe, that general purpose user interfaces are unable to match the complexity of these contexts and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
135
136
Michele Barbera; Michele Nucci; Daniel Hahn, Christian Morbidoni
Figure 5: Default user interface. The Source is shown on the right hand side. The bar on the left lists semantic relations with other sources. Related sources are clickable. fail to properly address the diverse needs of the users. Therefore, Talia provides a flexible and modular user interface framework based on widgets. Widgets are distributed independently and can be used as building blocks for customizing the applicationâ&#x20AC;&#x2122;s user interface. Taliaâ&#x20AC;&#x2122;s Widgets engine offers an high level framework that can be used by application developers to build community specific user interfaces without the need of programming low level details. Widgets can easily be plugged into the default user interface.
The following figure shows an example of a Talia Semantic Navigation Interface, based on a widget that directly interacts with the Talia Knowledge Base, using metadata and ontologies to display information.
Figure 6: Semantic relations widget Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Web Powered Distributed Digital Library System
We plan to host a library of Open Source widgets on Talia’s website that developers can use to distribute their widgets. In addition to the widgets component, Talia also includes another kind of plug-ins called Source Tools. These tools are behaviours that can be attached to a specific type of digital object (called “Sources” in Talia). For example a Source of type manuscript edition that includes a data object of type TEI-XML may have a Source tool that allows the user to perform some kind of linguistic analysis on the text. A source of type manuscript whose data objects are images representing the manuscript may have a Source Tool to OCR the the text from the image. The rizhome shown above is another example of Source Tool applied to a subset of Sources of type Manuscript. 10.
Conclusions and Related works
Talia is an innovative distributed semantic digital library system, which aims at improving scholarly research in humanities by avoiding fragmentations of materials. Using standard Semantic Web technologies, ontologies to organize information and a completely customizable user interface framework, Talia represents a stateof-the-art research and publishing environment for humanities. Talia is distributed with an Open Source license and it is very easy to install and configure so that it can be used to build single digital libraries and electronic publishing venues at a very low cost. Talia nodes, can then be joined together in a federation to create virtual collections that cross the borders of a single library or organization. At the same time, Talia’s full compliance with Semantic Web standards ensures a deep interoperability with any other Semantic Web tools and applications. Talia shares some properties with other semantic digital library systems like JeromeDL [11], BRICKS [12] and Fedora [13]. These projects are however mostly focused on the back-end technology and can hardly be deployed in low-tech environments such as small archives, libraries and museums. None of these tools offers tight coupling between the semantic knowledge base and the flexible user interface framework provided by Talia. BRICKS is an architecture that is composed of a set of generic foundational components (called “core and basic services”) plus a number of additional specialized services (called “Pillars”) that can be invoked by applications as remote services. A BRICKS node (called Bnode) is an application that uses these services and interacts with other bNodes within a BRICKS network. BRICKS is therefore a huge infrastructures that requires a significant amount of central coordination to maintain the basic services. From the point of view of the individual institution that wants to join the network, BRICKS provides a set of very useful services on top of which each content provider should develop his own application and user interface. We believe that within the Humanities and in general in the sector of cultural institutions, it is very uncommon that organizations have access to the know-how, budget and organizational capacity to deploy such a complex product. Moreover, even though the technology itself has a decentralized architecture, BRICKS relies on ad-hoc components (“core and basic services”) that depend on the availability in the network of remote services. In this way the need of centralized coordination is shifted from the technological level to the organizational and managerial level. We believe that the lack of organizational and managerial coordination is one of the weak spots of the Humanities scholarly community. Therefore, the approach of Talia is to minimize the efforts needed to set-up and deploy a Talia node. Additionally, a Talia federation does not rely on any legacy coordination and knowledge exchange protocol (such as the BRICKS P2P component). Talia is entirely built on top of very simple Semantic Web standards such as RDF, HTTP and URI’s. In short, Talia is designed to run out-of-the-box in order to make it possible for smaller cultural institutions (whom may even not have or Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
137
138
Michele Barbera; Michele Nucci; Daniel Hahn, Christian Morbidoni
have very small computing staff) to contribute to a Semantic Web of Culture. The similarity between Talia and Fedora is that both allow to express relations between objects in RDF. However, â&#x20AC;&#x153;...Fedora is a digital asset management (DAM) architecture, upon which many types of digital library, institutional repositories, digital archives, and digital libraries systems might be built. Fedora is the underlying architecture for a digital repository, and is not a complete management, indexing, discovery, and delivery application...â&#x20AC;?. As with BRICKS, Fedora is suitable to develop and deploy very large digital library applications. Talia instead aims to be a complete out-of-the box application the comes with a pre-defined, generic and complete user interface. Talia also has a modular architecture that makes it easy to extend its features and customize its interface by developing plug-ins and UI widgets. JeromeDL is the application most similar to Talia that currently exists. Like Talia, JeromeDL is fully based on simple Semantic Web Standards, works out-of-the-box and is extensible through plug-ins. Apart from the different language in which the two applications are written (JeromeDL is written in Java and Talia is written in Ruby) the main difference lies in their primary target user group. JeromeDLâ&#x20AC;&#x2122;s primary target audience is the generic user of a digital library while Talia will also include default User Interfaces and tools that are targeted to Humanities Scholars. At the time of writing, Talia is still in Alpha stage and a first stable public realease is planned for October 2008. The first release will include a set of visualization widgets specifically meant for handling Discovery content as well as a full-featured on-line peer review system. The first release will also include an adapter for Philospace, a semantic annotation tool based on the Dbin platform[14][15], briefly introduced in paragraph 7. In the meantime, Talia is being customized for other applications in the cultural heritage and digital library domains. In particular, additional research is being performed on the integration of Semantic Web based bibliometric tools. The focus of these research activity is to exploit the Semantic Web to explore bibliometric models and impact measures that can be used as alternative to traditional impact indicators such as the Impact Factor. Some of this models have been proposed in [16] and [17]. Other areas of future improvement include the development of an infrastructure for collaborative ontology editing and mapping as well as an ontology library for the Humanities and Cultural Heritage domains. Finally, we are also studying the integration of Talia with archival cataloguing software and standards, such as EAD editors and archival data management systems, with the objective of making Talia a suitable product to interlink heterogeneous data and resources coming from the library, archival and museum domains to create digital collections of cultural resources. 11.
Acknowledgements
This work has been supported by Discovery, an ECP 2005 CULT 038206 project under the EC eContentplus programme. 12.
Notes
1
http://www.w3.org/TR/rdf-schema/ http://www.w3.org/TR/owl-features/
2
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Web Powered Distributed Digital Library System
13.
References
[1] [2] [3] [4] [5]
The European Library Portal, [http://www.theeuropeanlibrary.org/portal/index.html] OAIster, [http://www.oaister.org/] W3C Semantic Web Activity, [http://www.w3.org/2001/sw/] RDF Primer, W3C Recommendation, [http://www.w3.org/TR/rdf-primer/] Nucci, M., David, S., Hahn, D., Barbera, M., Talia: A Framework for Philosophy Scholars, in proceedings of Semantic Web Applications and Perspective, Bari, Italy, 2007. Talia Wiki, [http://trac.talia.discovery-project.eu/] Fielding, R.T., Achitectural Styles and the Design of Network-based Software Architectures, PhD thesis, UC Irvine, 2000. OpenID Web Site, [http://openid.net/] Shibboleth Web Site, [http://shibboleth.internet2.edu/] ] Discovery Web Site, [http://www.discovery-project.eu/] Kruk, S., Woroniecki, T., Gzella, A., Dabrowski, M., McDaniel, B., Anatomy of a social semantic library, in: European Semantic Web Conference, Volume Semantic Digital Library Tutorial, 2007. Risse, T., Knezevic, P., Meghini, C., Hecht, R., Basile, F., The bricks infrastructure - an overview, in The International Conference EVA, Moscow, 2005. Fedora Development Team, Fedora open source repository software: White paper, white paper, Fedora Project, 2005. G. Tummarello, C. Morbidoni, M. Nucci, â&#x20AC;&#x153;Enabling Semantic Web communities with DBin: an overviewâ&#x20AC;?, Proceedings of the Fifth International Semantic Web Conference ISWC 2006, Athens, GA, USA, 2006 Dbin Web Site, [http://www.dbin.org/] Bollen, J., Van de Sompel, H., Smith, J., Luce, R., Towards alternative metrics of journal impact: a comparison of dowload and citation data. Information Processing & Management, Volume 41, Issue 6, pp. 1419- 1440, Dec. 2005 Barbera, Michele and Di Donato, Francesca (2006) Weaving the Web of Science : HyperJournal and the impact of the Semantic Web on scientific publishing. In Martens, Bob and Dobrova, Milena, Eds. Proceedings ELPUB : International Conference on Electronic Publishing, pp. 341-348, Bansko Bulgaria, 2006.
[6] [7] [8] [9] [10 [11] [12] [13] [14] [15] [16] [17]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
139
140
No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing Tarek Loubani1, Alison Sinclair1, Sally Murray1, Claire Kendall1, Anita Palepu1, Anne Marie Todkill1, John Willinsky2 1 Editorial Team, Open Medicine 2409 Wyndale Crescent, Ottawa, Ontario, K1H 8J2, Canada email: tarek@tarek.org, alison.sinclair@mac.com, smurray@openmedicine.ca, ckendall@openmedicine.ca, apalepu@openmedicine.ca 2 Board of Directors, Open Medicine 2409 Wyndale Crescent, Ottawa, Ontario, K1H 8J2, Canada e-mail: john.willinsky@ubc.ca
Abstract Open Medicine (http://www.openmedicine.ca) is an electronic open access, peer-reviewed general medical journal that started publication in April 2007. The editors of Open Medicine have been exploring the use of Free and Open Source Software (FOSS) in constructing an efficient and sustainable publishing model that can be adopted by other journals. The goal of using FOSS is to minimize scarce financial resources and maximize return to the community by way of software code and high quality articles. Using information collected through archived documents and interviews with key editorial and technical staff responsible for journal development, this paper reports on the incorporation of FOSS into the production workflow of Open Medicine. We discuss the different types of software used; how they interface; why they were chosen; and the successes and challenges associated with using FOSS rather than proprietary software. These include the flagship FOSS office and graphics packages (OpenOffice, The GIMP, Inkscape), the content management system Drupal to run our Open Medicine Blog, wiki software MediaWiki to communicate and archive our weekly editorial and operational meeting agenda, minutes and other documents that the team can collectively edit, Scribus for automated layout and VOIP software Skype and OpenWengo to communicate. All software can be run on any of the main operating systems, including the free and open source GNU/Linux Operating system. Journal management is provided by Open Journal Systems, developed by the Public Knowledge Project (http://pkp.sfu.ca/?q=ojs). OJS assists with every stage of the refereed publishing process, from submissions, assignment of peer reviewers, through to online publication and indexing. The Public Knowledge Project has also recently developed Lemon8-XML (http://pkp.sfu.ca/ lemon8), which automates the conversion of text document formats to XML, enabling structured markup of content for automated searching and indexing. As XML is required for inclusion in PubMed Central, this integrated, semi-automated processing of manuscripts is a key ingredient for biomedical publishing, and Lemon8-XML has significant resource implications for the many journals where XML conversion is currently done manually or with proprietary software. Conversion to XML and the use of Scribus has allowed semi-automated production of HTML and PDF documents for online publication, representing another significant resource saving. Extensive use of free and open source software by Open Medicine serves as a unique case study for the feasibility of FOSS use for all journals in scholarly publishing. It also demonstrates how innovative use of this software adds to a more sustainable publishing model that is replicable worldwide. Keywords: Free and Open Source Software (FOSS); biomedical publishing; Open Medicine.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing
1.
Introduction
The private interests of medical society and commercially owned medical journals do not encourage collaboration between journals for processes related to journal publishing. This is particularly apparent as journal publishing moves into the digital age: profit is sitting at the helm of an era where shared software code and reader-centric licenses could otherwise accelerate the development and advantages of electronic publishing for all readers and authors. The focus on profit also prevents many potential readers from purchasing subscriptions. In a US periodical price survey published in early 2008, health science periodicals subscriptions averaged US$1330, representing a ten percent increase from 2007. The same study showed that average subscription prices in the health sciences increased by 43% between 2004 and 2008 [1]. A report commissioned by the Wellcome Trust showed similar data [2]; in 2000 the average subscription price for a medical journal was £396.22, and the average cost of a medical journal increased 184% in the ten-year period between 1990 and 2000 [2]. These costs limit journal readership to academic and institution-affiliated professionals in developed countries, and exclude physicians and academics in developing countries not covered by initiatives such as the Health InterNetwork Access to Research Initiative (HINARI) [3]. Electronic publishing renders obsolete costly processes used to justify high subscription prices. In a recent publication costing study comparing print and electronic publications, Clarke [4] found that the publication costs of a print version of a non-profit association journal were more than double those of an electronic version (US$20 000 compared with US$8 000). Although editorial costs associated with the production of high-quality publications remain – and, for larger journals, can be a considerable part of their operating costs – it is clear that the impact of these costs on the financial viability of a journal can be considerably offset with reduced production costs. This has the potential to reduce the dependence of medical journals on pharmaceutical company and medical device manufacturer advertising, the effects of which have been well documented [5,6]. While the Clarke study does not itemize the contribution that publishing-related software makes to publication costs, it can only be assumed that the use of free and open source software (FOSS) would decrease these costs further. Willinsky and Mendis [7] recently published a paper describing their experience of publishing an entirely unfunded humanities journal using free publishing software and “a volunteer economy of committed souls”. Hitchcock [8] describes the only other journal that we are aware of that has exclusively used FOSS for this purpose. At Open Medicine, we employ “committed souls”, professional journal editors and FOSS to publish our biomedical journal. 2.
Open access (OA) publication
Open access publication has emerged as another way of increasing integrity, transparency and accessibility in biomedical publishing [9]. In 2002, the Budapest Open Access Initiative (BOAI) was launched to encourage science to be freely available on the internet, the BOAI supports the archiving of scientific articles and the free availability, without copyright and proprietary limitations, of articles to be to read, downloaded, reproduced, distributed, printed, searched or linked to full-text articles, with proper attribution to the source (see http://www.soros.org/openaccess). Reframing traditional copyright limitations allows anyone the ability to use science for learning, teaching and discussion without having to pay for its use in the form of a subscription or re-print purchase. Without this kind of protection, even an article’s authors cannot freely use published articles for these purposes. The trend towards opening access among journal publishers has been swift: the Directory of Open Access Journals now lists more than 3281 journals (http://www.doaj.org). The benefits of OA are also becoming clearer: studies are finding that articles published in open access journals are cited more widely [10], and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
141
142
Tarek Loubani,Alison Sinclair,Sally Murray,Claire Kendall,AnitaPalepu,AnneMarie Todkill, John Willinsky
studies that have made their data openly accessible have also increased citation advantage [11]. Academic institutions, funding bodies, regulators and even governments have recognized how open access might serve academic integrity and improve patient care [12]. 3.
Free and open source software (FOSS)
Like the copyright laws that continue to significantly limit readers’ ability to download, reproduce, distribute, print, share and expand upon knowledge printed in many journals, copyright limitations apply to sharing novel software programs and code. Software development under a free license such as the GNU General Public License ensures that source code is freely available and can be used, examined, changed, improved or redistributed without limits except that any changes must be released back into the community with the same license (http://en.wikipedia.org/wiki/Open_source_software). Developers of FOSS range from software hobbyists to multinational corporations. Programmers may or may not be paid for their work, and their motivations include the wish to satisfy user need, and to use and develop their skills [13]. Free licenses encourage code sharing and code integrity, and enable the rapid identification and fixing of critical bugs, and the adaptation and re-purposing of code. Among the best-known open source software projects are the GNU/Linux operating system, the Mozilla Firefox web browser, Open Office productivity software, and the MediaWiki publishing platform that underlies Wikipedia. The ability of many smaller journals to support open access publication has been enabled by the availability of open source journal management and publishing systems, including Open Journal Systems (http:// pkp.sfu.ca/ojs/), DPubS (http://dpubs.org/), GAPworks (http://gapworks.berlios.de/), Hyperjournal (http:/ /www.hjournal.org/), ePublishing Toolkit (https://dev.livingreviews.org/projects/epubtk/), OpenACS (http:/ /openacs.org/), and SciX Open Publishing Services (SOPS; http://www.scix.net/) (see http://pkp.sfu.ca/ ojs_faq). At Open Medicine we have taken our commitment to “openness” and developing a more sustainable publishing model a step further by using free and open source software (FOSS) for our journal management, blog and electronic publishing platform. We are also increasingly incorporating FOSS into our workflow to enable the production of XML (a document format required for NLM/MEDLINE indexing) and for our layout and copyediting process, with the end goal of publishing and managing the journal exclusively using a FOSS workflow. The use of FOSS in medical publishing has many advantages. Cost is one commonly cited factor, though by no means the most important. By using FOSS, Open Medicine is replacing software with single license costs (non-educational versions) ranging from hundreds to thousands of dollars, representing savings in startup costs of many thousands of dollars; this use of FOSS also avoids costly upgrades of both software and hardware. FOSS tends to be available for a broader range of platforms – at a minimum, there are likely to be GNU/Linux, Apple Mac OSX and Microsoft Windows versions – and since older versions of the software are not commercially competitive with newer versions, support for established FOSS projects does not end according to a commercial cycle. This means that older, slower computers remain viable platforms. It also means that backward compatibility of programs is more often maintained. FOSS also produces documents in open formats such as the Open Document Format, which means that the user is able to transfer documents to another program should development on the original one cease, or a more suitable alternative be found – unlike data kept in a proprietary format. This problem, dubbed “vendor lock-in” will become more pronounced, with the introduction of Microsoft’s new proprietary office format, as well as with “patented” proprietary formats from other companies. 4.
FOSS at Open Medicine
Use of FOSS at Open Medicine was primarily driven by the added control, security, and usability of the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing
software. However, it was also in part prompted by cost considerations. As a start-up independent journal, committed to editorial independence, we operate principally with volunteer staff with minimal institutional support: the purchase of expensive proprietary journal management software was not only undesirable, but unfeasible. Our first step was to work with John Willinsky and the Public Knowledge Project to explore Open Journal Systems (OJS; http://pkp.sfu.ca/ojs). OJS is a free and open source online journal management and publishing system, developed by the Public Knowledge Project in a partnership among the Faculty of Education at the University of British Columbia, Simon Fraser Library and the Canadian Centre for Studies in Publishing [14]. We are not alone in recognizing the benefits of using OJS; there are now more than 1000 journals using OJS as a publishing platform, 20 percent of which are new titles and all of which offer some form of open access. Somewhat more than half are being published in low-income countries. OJS offers a complete manuscript management and publishing system. Correspondence between authors, editors, peer reviewers, coypeditors and layout editors can be managed within the system, with modifiable templates for correspondence. A database of peer reviewers, with contact information, interests and review history, is maintained within the system. Authors are able to track the progress of their manuscripts through the system, and peer reviewers are able to access their peer review requests, download the documents and enter or upload their completed peer reviews. OJS operates within a browser, with good attention to cross-platform, cross-browser compatibility (see Figure 1).
Figure 1: Open Medicine home page published using OJS A critical advantage of OJS is its use of open source code and a free software license. This has allowed the technical staff at Open Medicine, with OJS support, to write new or revised code, targeted to our particular journal needs. And of course the cycle continues: any code written by our programmer with wider applicability for journal publishing has in turn been shared with the team at OJS and other journals. This relationship has been particularly productive in our testing and use of Lemon8-XML for OJS. Lemon8XML (http://pkp.sfu.ca/lemon8) is a web-based program that automates the conversion of text document formats to National Library of Medicine (NLM) XML, permitting myriad uses, including the easy transmission of article information to the NLM. XML ensures that text is marked up so as to enable meaningful computer searching. For example, it allows the orderly tagging of date of publication and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
143
144
Tarek Loubani,Alison Sinclair,Sally Murray,Claire Kendall,AnitaPalepu,AnneMarie Todkill, John Willinsky
author names so that computers can search and find data that would usually appear as text buried within the body of a document. XML conversion is required for PubMed/Medline indexing â&#x20AC;&#x201C; a critical goal for any medical journal â&#x20AC;&#x201C; and is currently performed in most journal operations either manually or with prohibitively expensive proprietary software. The development of Lemon8-XML will be a powerful contribution to data searching, and will have significant resource implications for journals, many of whom have been unable to produce XML because of the high costs and expertise required. The easy creation of XML has enabled another recent innovation: the automated transformation of XML files into web-ready pages (HTML), as well as preliminary page layouts that can then be fine-tuned for print-ready publication (PDF). Some of our initial efforts at creating the editing-layout portion of the workflow involved using OpenOffice for both copy-editing and layout, and then generating both the XML markup version and the publication PDF from the final laid out document. OpenOffice, although suitable for editing and copy-editing tasks, proved to lack the flexibility and fine control required to produce layouts to professional standard. For example, fine-grain control over hyphenation, font kerning and so on were nearly impossible with OpenOffice. This led us to explore the use of Scribus, a well-established free and open-source desktop publishing software, for the layout stage. Rather than be obliged to maintain and reconcile separate XML and Scribus/laid out versions, our technical staff developed a plugin that enables conversion of the copy-edited XML version of an article directly to a final web-ready page and a preliminary layout in Scribus, ready for final refinement (see Figure 2).
Figure 2: Example of automated article layout using Lemon8-XML and Scribus Day-to-day operations within the journal are also performed using FOSS. As our editorial team is distributed across Canada and Australia, team members communicate using email, instant messaging (IM), and voice over internet (VoIP). We have also experimented with the SIP-based Wengophone (now Qutephone) to support teleconferences of more than three people, but have been unsuccessful to date; we are currently still reliant on a proprietary solution for teleconferences involving the entire team. Coordination of journal activities is also made possible through an internal wiki, using MediaWiki (the platform originally written for Wikipedia). Editorial meeting agendas and minutes, projects and documents in development, lists of contacts and resources, and all other documentation associated with running the journal are all accessible to and editable by all members of the editorial team. Table 1 offers a summary of the programs we have explored or are exploring as part of a Free and Open Source (FOSS) publishing workflow or in support of our operations. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing Free and Open Source Program
Open Medicine Use(s)
Advantages
Disadvantages
Editing and copy-editing of manuscripts; preliminary layout (Current industry standard: Microsoft Office)
Best established FOSS office suite Increasing acceptance in business and enterprise Well-supported by documentation
Interface and customizations differ from proprietary alternative Does not have fine control required for layout
Image editing (Current industry standard: Adobe Photoshop)
Best all-around photo- and image-editing software Well supported with documentation and forums
Contested user interface CMYK support only with plugin (relevant for print publishing)
Figure preparation (Current industry standards: Adobe Illustrator / Corel Draw)
Intuitive, thought out user interface Excellent SVG support
Difficulty integrating with illustrators using Adobe Illustrator or Corel Draw
Manuscript management Reader tools On-line publishing Communication with editors, copyeditors & layout persons
Many users Potential to request additional system features Responsive developers
Some limitations with theme customization
XML generation
Removes considerable human resource cost as currently done manually at most journals
Still in early testing phase Requires some manual reference searching No current link with OJS author details requiring duplicate data entry (planned for final version)
Layout of articles for print (PDF) publication (Current industry standard: Quark Xpress)
Fine grain control over text layout, font kerning Excellent PDF export control Excellent support community
Confusing development cycle Poorly thought out document format
Blog (Current industry standards: Wordpress, Movable Type, Blogger.com)
Powerful contentmanagement system with user-access controls; extensible with plug-ins Active user community
Learning curve; requires expertise to set up and manage
Meeting minutes; shared projects; shared resources
Web-based Minimal learning required for use Very flexible
Some expertise required for installation and maintenance
Team communication
Multiple sites can conference simultaneously Uses SIP standard
Unstable development Small userbase Decreased sound quality compared to other SIP products
Shared calendars
Multiple users can enter data
Article editing and preparation Open Office
http://www.openoffice.org/
GIMP
http://www.gimp.org/ Inkscape
http://www.inkscape.org/ Article management, layout, and publishing Open Journal Systems (OJS)
http://pkp.sfu.ca/ojs/
Lemon8-XML
http://lemon8.ca/
Scribus
http://www.scribus.net/
Drupal
http://drupal.org/
Operations MediaWiki
http://www.mediawiki.org/ WengoPhone
http://www.openwengo.org/ Chandler
http://www.osafoundation.org/ Thunderbird
http://www.mozilla.com/thunderbird/
Team communication; editorauthor-peer-reviewer communication
Table 1: Free and open source software used at Open Medicine Figure 3 shows a schematic flowchart of our operations and sites of FOSS use.
Figure 3: Workflow at Open Medicine using FOSS Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
145
146
5.
Tarek Loubani,Alison Sinclair,Sally Murray,Claire Kendall,AnitaPalepu,AnneMarie Todkill, John Willinsky
Free and Open Source Software in Medical Publishing: the challenges
There is no denying that there are challenges unique to adopting FOSS to create a workflow that has hitherto involved proprietary software. Some of these challenges arise from the software themselves, some from integration (or lack of) between various FOSS programs, and others simply from the time taken to learn to use new programs and troubleshoot without traditional help forums. For an individual user who is experienced in proprietary software and a proprietary workflow, the initial penalty of moving to FOSS is a loss of efficiency and a (re)learning curve. Users must learn one or several new interfaces, which may require them to adapt their personal workflow if it is not supported by the program, or to learn how to customize the program to suit their needs. This is especially true for littleused specialist components of software, which tend to be buried deep within the software and to be poorly documented. Users must find and identify sources and resources that will provide them with answers to questions that may be quite specific to the task; this can be time-consuming, particularly when the reason for there being no documentation is that that functionality has not been included in the software. The user interfaces of FOSS differ from their proprietary counterparts, in part as a result of the opportunity to solve perceived problems with existing proprietary interfaces and improve their design, and in part because developers in today’s litigious environment must avoid incorporating design elements that may be claimed under patent [see http://en.wikibooks.org/wiki/FOSS_Open_Standards/Patents_in_Standards for a discussion of patents and FOSS]. While improving on design, however, developers of the more “mainstream” and widely adopted FOSS (e.g., Firefox, GNOME, OpenOffice, GIMP) find themselves attempting to balance the needs of new users for an intuitive, familiar interface with the requirements of experienced users for a flexible interface that can be highly customized. Microsoft and Adobe own much of the software in common use in authoring and publishing, and have so shaped user expectations and workflow design such that what user interfaces they do not own, they influence. This results in consistency in the user interface when approaching different programs by the same manufacturer. One common complaint about FOSS interfaces is that they can be individually unique, even idiosyncratic, posing a barrier to new users. This problem has recently been recognized by the community, and is being addressed aggressively with massive usability projects (e.g., Open Usability; http://openusability.org/) and human interface guidelines (e.g., GNOME HIG http://developer.gnome.org/projects/gup/hig/; and KDE HIG http:/ /usability.kde.org/hig/). FOSS applications lend themselves to development on multiple operating systems, since any developer with an interest in a platform and some knowledge is free to modify the code. This leads to support for esoteric operating systems such as IBM’s long-defunct OS/2. The upside of availability on multiple platforms is balanced by the lower quality of versions in which developers are uninterested. Because free software is available to the public at all stages of its development cycle, this also means that sometimes installation of applications on underdeveloped platforms is confusing or poorly implemented. Scribus, one of our mainstay applications for layout editing, is an excellent example of this challenge. At the time of writing, Scribus version 1.3.3.11 is considered “stable”. However, versions 1.3.4 and 1.3.5 are in wide use as well, despite being “unstable”. Scribus’ installer for Mac OSX is also primitive, and does not install required libraries, or even the application itself in an intuitive way. The user needs to select the correct version, may need to download and install the supporting libraries or packages, and may need then to interpret and troubleshoot any resulting error messages. It is worthwhile noting that this problem is essentially eliminated within free software operating systems (e.g., GNU/Linux), all of which use package management systems to easily install software and dependent libraries. Publishing requires a workflow that faithfully preserves detail of presentation – font, layout, figures. For proprietary publishing, this workflow has been developed largely by the consolidation of products involved in the process into end-to-end product lines that smooth the integration but offer little choice to the consumer. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
No Budget, No Worries: Free and Open Source Publishing Software in Biomedical Publishing
The various components of FOSS are not integrated into a workflow and require additional customization and programming. Furthermore, given that almost all of our submissions are received in Microsoft Word document format, one of the areas the Open Medicine staff found most challenging was in importing figures and tables prepared in Word, and citations and reference lists prepared in another widely used proprietary software, EndNote. We have yet to resolve our dependency on proprietary fonts for standardization of appearance and layout across stages and platforms. When difficulties are encountered in free software applications, solutions are not always easily located. The pace of progress means that documentation and technical support are primarily provided online by the user community, rather than in the form of published manuals. The majority of commercial publishers of books describing individual computer applications concentrate their efforts on mainstream proprietary software, which tends to have a much longer product lifecycle and slower development pace. Established FOSS projects commonly offer documentation in the form of a wiki (collectively edited multi-page manual), and support in the form of forums and online communities. Individual users may develop extensive tips and support sites, either out of interest, or in support of their consulting business (or both). To find the documentation that suits one’s level of learning, or the exact answer to a technical question, requires skills in searching, and some experience in assessing the receptiveness of a forum to “newbie questions”. The move to lesser-known free software also negates the often overlooked advantage of “the geek next door”, the friend with a slightly higher level of skill who can help achieve certain tasks. The increasing popularity of free software will eventually render this challenge moot, however it remains important at this time. 6.
FOSS in Medical Publishing: the Possibilities
By the very nature of FOSS, many of the frustrations cited should ease with increasing adoption of FOSS in scholarly publishing. Members of FOSS-OA publishing are forming their own community, exchanging experiences and developing documentation specific to the task of using FOSS for publishing. Experience will teach us which programs are best suited to which step in the editing-publishing workflow, which programs integrate best with others, and how they might be customized for ease of workflow. The open architecture of FOSS permits the development of macros and plugins to automate repeated steps and to facilitate import and export. The most interesting possibilities presented by FOSS will have to do with the fruits of collaboration by several FOSS-OA publishers. A case in point: Open Medicine is collaborating with the Public Knowledge Project to develop a user commenting system for OJS, but we expect this system to truly mature and evolve when other publishers implement and expand upon it. For our own part, we hope Open Medicine can become a working template and case study for other journals interested in publishing using a complete FOSS interface. Journals choosing to use FOSS because of their philosophy, cost considerations or availability of computing ‘power’ to run software applications can benefit from our learning experiences, and, given the nature of FOSS, the source code developed for our publishing purposes. We look forward to the ongoing dialogue and experience of pursuing a truly “Open”, academically independent, biomedical publishing option. For us, transparency and integrity are essential traits, and we want Open Medicine to embody these traits in the software we use as well as the articles we publish.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
147
148
Tarek Loubani,Alison Sinclair,Sally Murray,Claire Kendall,AnitaPalepu,AnneMarie Todkill, John Willinsky
7.
References
[1]
Van Orsdel LC, Born K. Periodicals Price Survey 2008: Embracing Openness. Library Journal [Internet]. 2008. 15th Apr. [cited 2008 May 6]. Available from: http://www.libraryjournal.com/article/ CA6547086.html SQW Limited. Economic analysis of scientific publishing: A report commissioned by the Wellcome Trust; Histon. Wellcome Trust, 2003 The PLoS Medicine Editors. How Can Biomedical Journals Help to Tackle Global Poverty? PLoS Med [Internet]. 2006. 29 Aug [cited 2008 May 6]; 3(8): e380. Available from: http:// medicine.plosjournals.org/perlserv/?request=getdocument&doi=10.1371/journal.pmed.0030380 Clarke, R. The cost profiles of alternative approaches to journal publishing. First Monday [Internet]. 2007. 21 Nov [cited 2008 May 6]; 12(12). Available from: http://www.uic.edu/htbin/cgiwrap/bin/ ojs/index.php/fm/article/view/2048/1906 Smith R. Medical Journals Are an Extension of the Marketing Arm of Pharmaceutical Companies. PLoS Med [Internet]. 2005 [cited 2008 May 6]; 2(5): e138. Available from: http:// medicine.plosjournals.org/perlserv/?request=getdocument&doi=10.1371/journal.pmed.0020138 Fugh-Berman A, Alladin K, Chow J, Advertising in Medical Journals: Should Current Practices Change? PLoS Med [Internet]. 2006 [cited 2008. May 6]; 3(6): p. e130. Available from: http:// medicine.plosjournals.org/perlserv/?request=getdocument&doi=10.1371/journal.pmed.0030130 Willinsky J, Mendis R. Open access on a zero budget: a case study of Postcolonial Text. Information Research [Internet]. 2007 [cited 2008. May 6];12(3): paper 308. Available from: http:// InformationR.net/ir/12-3/paper308.html Hitchcock S. The effect of open access and downloads (â&#x20AC;&#x2DC;hitsâ&#x20AC;&#x2122;) on citation impact: a bibliography of studies. 2007. [cited 2007 May 24]. Unpublished paper, University of Southampton. Available from: http://opcit.eprints.org/oacitation-biblio.html Willinsky J, Murray S, Kendall C, Palepu K. Doing Medical Journals Differently: Open Medicine, Open Access and Academic Freedom. Canadian Journal of Communication [Internet]. 2007 [cited 2008. May 6]; 32(3): 595-612 Eysenbach G. Citation advantage of open access articles. PLoS Biol [Internet]. 2006 [cited 2008. May 6]; 4(5):e157. Available from: http://biology.plosjournals.org/perlserv/?request=getdocument&doi=10.1371%2Fjournal.pbio.0040157 Piwowar HA, Day RS, Fridsma DB. Sharing detailed research data is associated with increased citation rate. PLoS ONE 2007 [cited 2008. May 6]; 2(3):e308. Available from: http://www.plosone.org/ article/fetchArticle.action?articleURI=info:doi/10.1371/journal.pone.0000308 Peter Suber. An open access mandate for the National Institutes of Health. Open Medicine [Internet] 2002;2(2) April 16. [cited 2008. May 10] Available from: http://www.openmedicine.ca/article/view/ 213/135 Lakhani, KR, Wolf RG. Why Hackers Do What They Do: Understanding Motivation and Effort in Free/Open Source Software Projects. MIT Sloan Working Paper No. 4425-03. Sep 2003 [cited 2008. May 5]. Available from: http://ssrn.com/abstract=443040 Willinsky J. Open Journal Systems: An example of open source software for journal management and publishing. Library Hi Tech [Internet]. 2005 [cited 2006. May 5]; 23(4);504-519. Available from:http://research2.csci.educ.ubc.ca/eprints/archive/00000047/01/Library_Hi_Tech_DRAFT.pdf
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
149
Should University Presses Adopt An Open Access [Electronic Publishing] Business Model For All of Their Scholarly Books? Albert N. Greco;1 Robert M. Wharton2 Marketing Area, Fordham University Graduate School of Business Administration, 113 West 60th Street. New York, NY United States 10023. email: agreco@fordham.edu 2 Professor of Management Science, Fordham University Graduate School of Business Administration, 113 West 60th Street. New York, NY United States 10023. e-mail: R.FWharton@att.net 1
Abstract This paper analyzes U.S. university press datasets (2001-2007) to determine net publishers’ revenues and net publishers’ units, the major markets and channels of distribution (libraries and institutions; college adoptions; and general retailer sales) that these presses relied on, and the intense competition these presses confronted from commercial scholarly, trade, and college textbook publishers entering these three markets. ARIMA forecasts were employed to determine projections for the years 2008-2012 to ascertain changes or declines in market shares. The paper concludes with a brief series of substantive recommendations including the idea that university presses must consider abandoning a “print only” business model and adopt an “Open Access” electronic publishing model in order to reposition the presses to regain the unique value proposition these presses held in the late 1970s. Keywords: Innovative business models for scholarly publishing; university presses; electronic publishing; Open Access; scholarly communication; marketing strategies. 1.
Introduction
Since the late 19th century, university presses in the United States have played a pivotal role, and some individuals might argue the pivotal role, in the transmission of scholarly knowledge [1]. University press books have become the “gold standard” in many academic fields (e.g., history; literature; and in certain areas of philosophy and sociology) in the departmental or college evaluation of a faculty member’s scholarly output (and reputation) for tenure, promotion, and merit pay [2]. In 2008 these presses ranged in size from exceptionally large presses (with annual revenues in excess of $50 million; e.g., Oxford University Press has U.S. annual revenues of approximately $140million; Cambridge University Press, approximately $60 million), to large presses (+$6 million; e.g., University of Chicago Press), medium sized presses (approximately $1.5-$3 million; e.g., The University of Notre Dame), and relatively small presses (approximately $900,000-$1.5 million; e.g., Carnegie Mellon University). They publish peer reviewed: scholarly monographs; trade and professional books; textbooks; and, in some instances, scholarly journals that they own or publish under contract for academic societies (e.g., Pennsylvania State University Press). In this paper all university press books will be considered one category for analysis; however, additional research on the relationship between monographs and textbooks is needed. Between 1945 and the late 1970s, the basic university press business model was incredibly successful since this diverse collection of presses had a unique value proposition. University press publishing during those years was a “cozy” world, where everyone knew someone who knew someone; and most editors Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
150
Albert N. Greco; Robert M. Wharton
and press directors attended the same type of college (perhaps the Ivy League, small prestigious liberal arts colleges, or the large state universities). So these editors and publishers either went to school with or knew many of the major academic experts, who sent certain prestigious university presses their manuscripts and advised their graduate students to do likewise. During those years, the typical press received a “reasonable” level of financial and administrative support from its university; and presses were not expected to generate an annual “surplus” (i.e., a profit) [3]. The end result was these presses published superb books and, concomitantly, dominated the scholarly publishing field with preeminent sales in three major markets or channels of distribution: libraries and institutions; college and graduate school adoptions (in this paper “college” and “university” will be used interchangeably); and general readers (i.e., sales to general retailers). There was little competition from commercial professional scholarly publishing houses (the term “commercial professional scholarly publishing houses,” or “professional and scholarly publishing houses,” or “scholarly publishing houses” refer to the same cluster of publishing companies, and they will be used interchangeably in this paper). The vast majority of trade publishing firms tended to concentrate on “big, hit driven” fiction titles, although a cluster of firms (e.g., W.W. Norton or Random House) published serious scholarly works. By the mid to late 1970s, the total amount of net publishers’ revenues for all of these university press operations was quite “modest” (1972: $41.4 million; 1977: $56.1 million; all revenues used in this paper are U.S. dollars); and the suggested retail price for the typical university press book was often $10.00-$15.00 [4]. This was an important marketing strategy since inexpensive suggested retail prices (i.e., the MSRP) allowed the presses to penetrate the library market (the average university press expected to sell approximately 1,500 copies of each new scholarly book to academic and public libraries) as well as the college and graduate school adoption market, which often relied on scholarly titles from university presses in advanced undergraduate and graduate school courses. These presses tended to hire people who loved books; while wages were anemic, even by publishing industry standards, these presses offered editors an intellectually charged work environment in an academic environment that appealed to a significant number of people. The end result, a carefully written and edited and illustrated scholarly book, was indeed impressive. We reviewed press subsidies for 58 university presses for the years 2001-2006 (data for 2007 will not be available until late 2008) [5]. The most important subsidy was a direct financial grant to a press, with 70.69% of the presses receiving these funds. However, a wide variety of other free support services were provided to presses, including: payroll and human resources (86.21%); legal services (84.48%); audit services (70.7%); office space (62.07%); accounting services (60.34%); utilities (50%); working capital (44.83%; e.g., to pay printers, etc.); employee benefits (39.66%); salaries (37.93%); insurance (36.21%); carrying cost of accounts receivable (34.48%); warehouse space (32.76%); carrying cost of inventory (29.31%); parking (17.24%); work-study students (paid for by the university; and interns from the business school , the English, or the mass communications department (no data on these options). These percentages remained rather constant during the years 2001-2006. 2.
Significant Changes: The Emergence of “Black Swans”
Yet this “insulated” world changed abruptly in the late 1970s (a phenomenon called a “Black Swan” because of the unexpected nature of the change) which continued during the following decades [6]. What happened? Reliable statistical data is available for the years 1980 through 2007. However, because of the space limitations on this research paper, we addressed the years 2001-2007; and the ARIMA forecasting methodology was utilized to generate these projections) [7]. First, there was an increase in a number of new titles published in the U.S. by university presses as well Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...
as by all publishers. Table 1 outlines these trends between 2001-2007. _____________________________________________________________________________________________ Year New Title Output Annual % Change Total New Title Output Annual % Change University Presses University Presses All U.S. Books All U.S. Books _____________________________________________________________________________________________ 2001 10,130 — N/A N/A 2002 9,915 -2.12 247,777 N/A 2003 11,104 11.99 266,322 7.48 2004 9,854 -11.26 295,523 10.96 2005 9,812 -0.43 282,500 -4.41 2006 9,969 1.60 291,922 3.34 2007 10,781 8.15 400,000* 37.02* 2008 N/A — N/A — _____________________________________________________________________________________________ Source: Yankee Book Peddler; R.R. Bowker (revised totals since 2002). Totals include both hardbound and paperbound books. *Rachael Donadio, “You’re An Author? Me Too!” The New York Times Book Review, April 27, 2008, page 27. The 2007 projection for all U.S. books was based on R.R. Bowker data in this article; Bowker issues all ISBNs in the U.S.
Table 1: University Press New Title Output: 2001-2007
_______________________________________________________________________________________________ Year Net Publishers’ Annual C.P.I. Net Publishers’ Annual Revenues % Change % Change Units % Change _______________________________________________________________________________________________ 2001 474.8 N/A 2.85 24.6 N/A 2002 486.5 2.46 1.58 24.7 2.92 2003 494.8 1.71 2.28 24.6 -0.40 2004 501.0 1.25 2.66 31.4 27.64 2005 513.5 2.50 3.39 31.4 0 2006 531.0 3.41 3.23 29.0 -7.64 2007 546.9 2.99 2.85 28.7 -1.03 2008 563.3 3.00 2.80* 28.4 -1.05 2008 580.2 2.99 1.90* 28.2 -0.70 2010 597.0 2.90 2.10* 27.9 -1.06 2011 613.7 2.80 2.10* 27.7 -0.71 2012 630.3 2.70 2.10* 27.5 -0.70
_______________________________________________________________________________ Source: Greco & Wharton’s estimates for 2001-2007 and ARIMA projections for 2008-2012; Greco & Wharton, Book Industry Trends (New York: Book Industry Study Group, Inc., various years). All numbers rounded off to one decimal place and may not add up to 100%. Totals include data for hardcover and paperbound books. Consumer Price Index (C.P.I) is for “all items.” All data refers to the sale of new books; used book sales are excluded. *C.P.I. projections: The U.S. Congressional Budget Office (as of January 2008).
Table 2: University Press Books: Net Publishers’ Revenues and Net Publishers’ Units 2001-2007 University press net publishers’ revenues (i.e., gross sales minus returns equally net revenues; the same system is followed for units) increased because of changes in the suggested retail prices of these books (which generally exceeded annual increases in the Consumer Price Index, the C.P.I.) while units sagged after 2005. Table 2 outlines these trends. Since 1945, the three primary markets and channels of distribution for university presses were: (1) libraries and institutions; (2) college adoptions (which include graduate school adoptions); and (3) general retailers. The datasets for net publishers’ revenues indicated growth in all three channels. Total increases between Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
151
152
Albert N. Greco; Robert M. Wharton
2001-2007 were: 7.73% for the general retailer sector; 14.9% for college adoptions; and 16.0% for libraries and institutions. Table 3 outlines these trends. ______________________________________________________________________________________________ Year Exports General College Libraries & High Direct to Other Retailers Adoptions Institutions School Consumer Adoptions _______________________________________________________________________________________________ 2001 60.6 109.9 114.8 125.6 8.6 52.9 2.4 2002 61.9 112.3 117.3 129.6 8.9 53.9 2.6 2003 63.1 114.3 119.5 131.5 9.0 56.3 2.5 2004 63.9 108.9 121.3 132.3 9.1 67.5 2.5 2005 65.5 111.5 124.3 135.9 9.3 69.2 2.6 2006 67.8 115.2 128.4 140.5 9.7 71.5 2.7 2007 69.5 118.4 131.9 145.7 10.0 73.3 3.0 2008 71.9 122.1 136.0 149.7 10.3 75.6 2.8 2009 74.1 125.8 140.1 153.9 10.5 78.2 2.9 2010 76.3 129.5 144.3 158.2 10.9 80.4 2.9 2011 78.3 133.2 148.4 162.4 11.2 82.7 3.0 2012 80.4 146.0 152.5 166.6 11.5 70.2 3.1 _______________________________________________________________________________________________ Source: Greco & Wharton’s estimates for 2001-2007 and ARIMA projections for 2008-2012; Greco & Wharton, Book Industry Trends (New York: Book Industry Study Group, Inc., various years). All numbers rounded off to one decimal place and may not add up to 100%. Totals include both hardbound and paperbound books. All data refers to the sale of new books; used book sales are excluded.
Table 3: University Press Books: Net Publishers’ Revenues By Channels of Distribution 2001-2007 With Projections for 2008-2012 (U.S. $ Millions) Year
Exports
General Retailers
College Adoptions
Libraries & High Direct to Other Institutions School Consumer Adoptions _______________________________________________________________________________________________ 2001 3.1 8.3 7.4 3.9 0.3 1.4 0.3 2002 3.2 8.1 7.2 4.0 0.4 1.5 0.3 2003 3.1 8.1 7.2 4.0 0.4 1.5 0.3 2004 4.0 9.9 9.6 4.8 0.3 2.2 0.3 2005 4.0 9.9 9.4 4.9 0.4 2.2 0.3 2006 3.7 9.2 8.7 4.6 0.3 2.0 0.3 2007 3.7 8.8 8.4 4.6 0.5 2.1 0.3 2008 3.6 8.8 8.3 4.6 0.5 2.1 0.3 2009 3.6 8.8 8.3 4.5 0.4 2.1 0.3 2010 3.5 8.9 8.3 4.4 0.3 1.9 0.3 2011 3.5 8.8 8.3 4.3 0.3 1.9 0.3 2012 3.4 9.3 8.4 4.3 0.3 1.6 0.3 _______________________________________________________________________________________________ Source: Greco & Wharton’s estimates for 2001-2007 and ARIMA projections for 2008-2012; Greco & Wharton, Book Industry Trends (New York: Book Industry Study Group, Inc., various years). All numbers rounded off to one decimal place and may not add up to 100%. Totals include both hardbound and paperbound books. All data refers to the sale of new books; used book sales are excluded.
Table 4: University Press Books: Net Publishers’ Units By Channels of Distribution 20012007 With Projections for 2008-2012 (Millions of Units) Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...
Between 2001-2007, net publishers’ unit data reveals a flattening of sales in the library and institution sector (essentially no growth after 2007) and college adoption areas (another sector with flat sales after 2006). Based on a review of unit sales in the 3 major markets and channels, it appears likely the market for scholarly non-profit university press books has plateaued., a potential weakness for presses in those channels. Table 4 outlines these trends. An analysis of the data for 2001-2007 revealed the substantial gains posted by professional and scholarly publishers in the university press’ three main markets and channels in terms of net publishers’ revenues. Revenues were up 18.01% in the general retailer sector, 17.21% in college adoptions, and 17.55% in the library and institutional market. Unit sales were also strong during those years: +14.31% in general retailers; +11.09% in college adoptions; and +14.08% in libraries and institutions. The prognosis for 2008-2012 was for continued strong growth in both revenues and units in all three markets. Table 5 outlines these trends. Net Publishers’ Revenues
Net Publishers’ Units
Year
General College Libraries & General College Libraries & Retailers Adoptions Institutions Retailers Adoptions Institutions 2001 1399.3 1204.3 1659.4 63.6 51.4 41.9 2002 1444.4 1245.8 1714.7 64.3 52.2 45.5 2003 1482.2 1274.1 1759.0 69.8 55.0 46.0 2004 1535.7 1315.6 1816.0 70.5 55.7 46.3 2005 1547.5 1326.0 1832.7 71.2 56.4 47.1 2006 1599.7 1370.5 1893.8 71.5 57.0 47.6 2007 1651.3 1411.6 1950.7 72.7 57.1 47.8 2008 1692.2 1448.9 2002.6 72.7 57.4 48.2 2009 1734.8 1486.5 2054.6 72.9 57.7 48.3 2010 1777.6 1522.9 2104.4 72.9 58.0 48.4 2011 1821.3 1560.3 2155.3 72.9 58.3 48.5 2012 1872.0 1600.0 2207.2 73.1 58.5 48.8 _______________________________________________________________________________________________ Source: Greco & Wharton’s estimates for 2001-2007 and ARIMA projections for 2008-2012; Greco & Wharton, Book Industry Trends (New York: Book Industry Study Group, Inc., various years). All numbers rounded off to one decimal place and may not add up to 100%.Totals includes both hardbound and paperbound books. All data refers to the sale of new books; used book sales are excluded.
Table 5: Professional and Scholarly Publishers: Net Publishers’ Revenues and Net Publishers’ Units (2001-2012) for Sales to General Retailers, College Adoptions, and Library & Institutions (U.S. $ Million; Millions of Units) The pattern for college textbooks in these three markets was equally impressive. Sales to general retailers increased 17.65% between 2001-2007, with the tally for college adoptions hovering near the 17.64% mark, topping the 15.77% increase in the library and institutional market. Unit data was equally striking: +20.0% in general retailers; +10.11% in college adoptions; and +13.33% in the library and institutional market. Table 6 outlines these trends. A comparison of the revenue sales patterns for university presses, professional and scholarly publishers, and college textbook in the three channels was revealing, illuminating the impressive market shares held by professional and scholarly and college text book publishers. • general retailers: university presses, +7.73%; professional & scholarly, +18.1%; college textbooks, +17.65%; • college adoptions: university presses, +14.9%; professional & scholarly, +17.21%; college textbooks, 17.65%;
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
153
154
Albert N. Greco; Robert M. Wharton •
libraries and institutions: university presses, +16.0%; professional & scholarly, +17.55%; college textbooks, 15.77%.
_______________________________________________________________________________________________ Net Publishers’ Revenues Net Publishers’ Units
Year
General Retailers
College Adoptions
Libraries & Institutions
General Retailers
College Adoptions
Libraries& Institutions
2001 175.6 2875.4 274.6 3.0 47.1 3.0 2002 178.3 2930.8 279.6 3.1 47.7 3.1 2003 181.5 2989.7 285.5 3.3 49.8 3.1 2004 189.8 3133.7 291.8 3.5 56.1 3.4 2005 193.6 3197.2 297.6 3.5 55.7 3.4 2006 199.7 3293.8 307.3 3.5 56.0 3.4 2007 206.6 3382.5 317.9 3.6 56.1 3.4 2008 211.6 3478.7 325.9 3.6 56.4 3.5 2009 217.1 3575.8 334.4 3.6 56.7 3.5 2010 222.9 3677.7 343.2 3.6 57.1 3.5 2011 228.8 3777.2 352.1 3.6 57.4 3.5 2012 234.9 3880.0 362.4 3.6 57.7 3.5 _______________________________________________________________________________________________ Source: Greco & Wharton’s estimates for 2001-2007 and ARIMA projections for 2008-2012; Greco & Wharton, Book Industry Trends (New York: Book Industry Study Group, Inc., various years). All numbers rounded off to one decimal place and may not add up to 100%.Totals includes both hardbound and paperbound books. All data refers to the sale of new books; used book sales are excluded.
Table 6: College Textbook Publishers: Net Publishers’ Revenues and Net Publishers’ Units (2001-2012) for Sales to General Retailers, College Adoptions, and Library & Institutions (U.S. $ Million; Millions of Units) _______________________________________________________________________________________________ Year New Title Output Annual % Change _______________________________________________________________________________________________ 2001 41,016 N/A 2002 43,554 6.19 2003 47,662 9.43 2004 44,981 -5.63 2005 42,975 -4.46 2006 47,124 9.65 2007 48,951 3.88 2008 N/A N/A Source: Yankee Book Peddler; R.R. Bowker (revised totals since 2000).
Table 7: New Title Output Scholarly Books Published by Professional and Scholarly and Trade Publishers: 2001-2007 New title output of scholarly books by professional and scholarly publishers and trade publishers increased 19.35% between 2001-2007. Table 5 outlines this trend. Second, the emergence of the “serials crisis” (i.e., the growth in the number of and annual subscription prices for scholarly journals, often owned by large commercial scholarly publishers (e.g., Elsevier; Wolters Kluwer; Springer; Blackwell; John Wiley; Taylor & Francis; etc.), triggered declines in the academic library purchases of university press books (about 1,500 in the mid-1970s; about 200-300 in 2008) [8]. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...
Third, the decline in the number of independent bookstores (about 4,400 in the 1970s and 1,800 in 2008) and the rise of book superstores (Barnes & Noble; Borders; Books-A-Million) [9]. Fourth, a dramatic change in book retailing channels of distribution, the rise in importance of the mass merchants (e.g., WalMart; K-Mart; Target), price clubs (e.g., Costco, BJs; Sam’s Clubs), and other retailing establishments (e.g., supermarkets; drug stores; convenience stores; terminals; etc.) [10]. Fifth, precipitous declines in media usage (i.e., annual hours per person, above the age of 18, reading books) [11]. Sixth, the development of an interest in publishing scholarly titles by many of the large trade houses [12]. Seventh, the growth of the college textbook educational publishing sector [13]. Lastly, by the 1990s and the early years of the 21st century, several “disruptive technologies” emerged (e.g., the Internet and electronic publishing options; print-on demand, POD; the Open Access movement; etc. ) that challenged traditional concepts regarding the distribution of intellectual content [14]. Starting around 1980, the majority of all university presses witnessed a sophisticated pincer movement by commercial trade, professional, and textbook companies eager to take business and market share away from university presses. In essence, the basic competitive advantage of university presses (i.e., the ability to dominate the publishing of scholarly books in their three key markets and channels of distribution) changed, at first slowly and then more rapidly; and many university press directors and editors (and many academics rightfully concerned about this situation) pursued innovative, and, unfortunately in some instances, unsuccessful strategies and responses to the frontal attack of commercial publishing companies. One cluster of press directors (and major industry leaders) issued jeremiads about the state of scholarly publishing, and they were joined by academics that ruminated, “How can I get tenured if you cannot publish my book?” Many directors (and industry leaders) tried to convince provosts and president to increase their funding to counterbalance the decline in sales. The next strategy was to ask foundations for funding to analyze the decline in sales. Lastly, some presses went to foundations for seed money to publish books in “critical areas” [15] Another group of directors, more attuned to the ideas of finance and marketing, reevaluated their basic business models; and they crafted defensive strategies, including: reducing the print run of new books; curtailing “duel editions” (often called “split runs;” i.e., the simultaneous printing of a hardbound and paperbound version of a new title); outsourcing line editing and certain production tasks; off-shore typesetting and printing; reducing support staff (often secretaries); and changing domestic distributors, often going to one of the major university press distribution operations or relying on a printer to handle distribution and fulfillment. Some of these strategies worked; and some did not. In the years after 1980, two dramatic and completely unanticipated developments occurred, known as a “Black Swan” to economists and marketers, which took the majority of press directors, editors, and industry leaders off guard. First, far two many university presses failed to realize that the basic laws of supply and demand cannot be rescinded. They continued to increase title output (even when trimming print runs) as demand for their books sagged. Second, “Wall Street” firms decided to invest in the book industry. The term “Wall Street” refers to financial service companies in New York, Boston, Chicago, San Francisco as well as London, Paris, etc.; for example, Bain & Co., , Thomas S. Lee, JP Morgan, Goldman Sachs, etc.). Many commercial scholarly presses, trade houses, or college textbook companies were viewed by a growing number of Wall Street investment bankers, private equity managers, and hedge fund executives as either “value stocks” or “growth stocks;” and they invested in many of these companies, taking a few of them “private” [16]. This influx of invested money allowed these commercial publishing companies to gain access to needed capital (known as “capital deepening” to economists and marketers) for investment and expansion. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
155
156
Albert N. Greco; Robert M. Wharton
Why did Wall Street firms target book industry companies when they could have invested in more “glamorous” industries and firms? These Wall Street companies realized that book publishing economics were harsh and unforgiving; but they were understandable and quantifiable. This meant they could develop sophisticated statistical models to predict future earnings. For example, professional and scholarly book publishing companies (as of January 2008) had a low “beta,” which is a measure of volatility. In general the Standard and Poor’s Index has a “beta” of 1.00. So a stock with a “beta” higher than 1.00 has a higher volatility but generally generates higher returns than the stock market; a “beta” below 1.00 has a lower volatility but generally generates lower returns. For example, during the months of March and April 2008, Pearson PLC had a “beta” of 0.95; McGraw-Hill’s “beta” was 1.24; John Wiley’s “beta” stood at 1.57; and Reed Elsevier’s “beta” was a rather low 0.65. As a point of comparison, during this same time period, Hewlett-Packard’s “beta” was 1.09 and Amazon.com was 3.18 [17]. Scholarly and professional companies also had high “alphas” (i.e., successful editors and publishers able to find and cultivate authors who make money for the house). Clearly, many of these commercial scholarly, trade, and college textbook firms were targeted by Wall Street for investment and expansion. Scholarly and professional publishers, many trade publishers (including Bertelsmann AG’s Random House; Pearson PLC’s Penguin; News Corp.’s HarperCollins; CBS’ Simon & Schuster; and Lagadere’s Grand Central, formerly Little, Brown and Warner Books), and all of the major textbook publishers (e.g., Pearson PLC’s Prentice-Hall; Cengage Learning, formerly Thomson Learning; McGraw-Hill; John Wiley-Blackwell; Von Holtzbrinck; Informa’s Taylor & Francis; etc.) crafted innovative strategies to penetrate and increase their market positions in the scholarly publishing world, including: attracting major scholars with advances (e.g., Professor Mankiw was paid $1.6 million in 1996 by Harcourt, now part of Cengage, to write a principles of economics textbook), generous “step” royalty options, aggressive marketing strategies enlarging and expanding channels of distribution in this nation and abroad [18]. In the years after 1980, these commercial publishing companies were able to sell their scholarly tomes or textbooks, pay taxes (university presses are exempt from taxes since they are non-profit entities under the U.S. Internal Revenue Code), provide appealing wages for employees (they hired people who loved books that made money); and make profits for their stockholders. Many of the major commercial scholarly presses also published scholarly journals. Realizing the significant impact electronic journals had on their balance sheets (no printing, paper, binding, mailing, fulfillment, warehouses, warehouse personnel, etc.), many of the largest houses began to offer electronic versions of their books (either an entire book or one or more chapters in a title), a trend that was followed by the major college textbook publishers; as of 2008, trade houses have been unable to monetize significantly their content in digital platforms. While hard data on electronic sales revenues are difficult to obtain (quarterly and annual financial reports are silent on this issue), it is likely that between 15%-20% (well over $1 billion) of scholarly and professional net publishers’ book revenues were generated through electronic sales or site license agreements. The number for textbooks is perhaps 5% (approximately $250 million); and trade publishers generate about $60 million annually through digital sales [19]. 3.
The Business Environment for University Presses: 2001-2006
Can university presses develop realistic marketing plans and regain their competitive advantage? Can they challenge the hegemony of large, global commercial publishers? In light of the proliferation of technological services, are university presses relevant and needed in the 21st century? Are these university presses needed?
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ... Business Model Assumptions _______________________________________________________________________________________________ Print Run: 1,000 copies Gross Sales: 970 copies [1,000 copies; -3% of print run for author’s copies, office copies] Export -20 copies General Retailers -183 copies College Adoptions -239 copies Libraries & Institutions -184 copies High School Adoptions 0 copies Direct to Consumers -20 copies Other 0 copies Net Sales: 646 Suggested Retail Price: $65.00 Average Discount: 47% [Publisher nets $34.45 per copy] PPB: $5,341.13 [Printing, Paper & Binding; approximately 19% of net sales industry average] Plant: $1,124.45 [Editorial and typesetting; approximately 4% of net revenues; industry average] Marketing: $1,000.00 [$1 times number of printed copies] Royalty Advance: 0 Royalty Rate: — [0% for first 500 copies sold; 10% of net revenues for +501 sold copies] Subrights: $200.00 [filmed entertainment; reprints; book clubs; foreign rights; serial rights; 50% for author and 50% for publisher] _______________________________________________________________________________________________ Revenues and Expenses and Net Profit/Loss 1. Gross Sales: $33,416.50 [970 copies x $34.45@] 2. Returns: -$5,856.50 [170 copies x $34.45@] 3. Net Sales: $27,560.00 4. Plant: -$1,124.45 5. PPB: -$5,341.13 6. Earned Royalty: -$502.97 [146 copies at $3.445@] 7. Inventory Write-Off -$1,730.53 [970 – 646 = 324 copies x $5.34@] 8. Total Cost of Goods Sold: $8,699.08 [COGS; #4 +#5 +#6 +#7] 9. Initial Gross Margin: $18,861.00 [#3 - #8] 10. Other Publishing Income: +$100.00 [50% to publisher] 11. Final Gross Margin: $18,961.00 12. Marketing: -$1,000.00 13. Overhead: -$8,268.00 [30% of net sales revenues] 14. Net Profit/Loss: $9,693.00 _____________________________________________________________________________________________ Source: Greco’s estimates; industry averages.
Table 8: Sample Profit & Loss (P & L) Statement for a Hardbound University Press Book An analysis of a “typical” university press book’s profit and loss (P & L) provides a preliminary framework in order to address some of the questions listed above. We start with a series of basic business assumptions regarding: (1) the print run; (2) gross sales; and (3) potential sales to exporters, general retailers, college adoptions, libraries and institutions, high school adoptions, direct sales made by the press direct to consumers, and any “other” sales. These assumptions are based on past experiences for similar books and a healthy dose of optimism (perhaps more of the latter than the former). Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
157
158
Albert N. Greco; Robert M. Wharton
The next step is to determine: (1) net sales (gross sales minus sales); (2) the suggested retail price; and the average discount (books are sold to retail establishments and distributors at a discount; industry averages were utilized in all of these calculations). Other expenses are estimated: (1) printing, paper and binding (PPB; 19% of net revenues is the industry average); (2) editorial-typesetting, etc. (plant; 4% of net revenues is the industry average); (3) marketing; (4) the royalty advance against earned royalties; (5) the royalty rate; and (6) any foreign or sub rights. Once these estimates are determined, the actual financial P & L can be run: (1) gross sales minus returns equals net sales revenues (in general, most books are fully returnable to the publisher for a full credit as long as the published terms and conditions of sale are followed by the retailer or distributor); (2) plant, PPB, earned royalty, and inventory write-offs are subtracted from net sales to determine the total cost of goods sold and the initial gross margin; (3) other income is added to the initial gross margin to calculate the final gross margin; (4) marketing costs and the ubiquitous overhear are deducted from the final gross margin; and (5) the end result is either a net profit or a net loss. Table 8 indicates that this “typical” book, which took months to edit and print and tied up thousands of dollars, generated a net profit of $9,693.00. Any slippage in sales could have generated a loss. Our extensive research (and hundreds of discussions with university press directors, commercial scholarly publishers, and trade book executives) indicated that seven out of every ten new books lose money, two books break even financially, and one is a financial success. The vast majority of university presses post financial losses annually, even with subsidies from their universities. Table 8 outlines in detail the economics of publishing a book. The second analysis centered on a study of 63 university presses between 2001 and 2006 (no data for 2007). These presses ranged in size and had annual revenues between $900,000-$1.5 million (22 presses), $1.5 million-$3 million (16 presses), $3 million-$6 million (18 presses), and more than $6 million (8 presses). In terms of net operating income (i.e., total book sales income plus any other publishing income minus operating expenses; editorial, production and design; order fulfillment; etc.), losses were posted for all of these presses between 2001 and 2006. The addition of direct parent institution financial support, other subsidies-grants-endowments, and “other press activities) changed the economic picture, somewhat. These 63 presses recorded positive total net income results in 2004, 2005, and 2006; losses were generated in 2001, 2002, and 2003. We estimate that a positive net income will be posted by these presses in 2007 and a negative net income in 2008 (and possibly in 2009). So book operations for 6 years had losses; and financial support from the parent institution ameliorated the situation in 3 of these 6 years. In reality, the basic business model of selling printed scholarly books by university presses did not work between 2001-2006, and a review of substantive datasets revealed it has not worked since 1945. If parent institutions trimmed even slightly their financial commitments to the presses, the majority of presses would be in the red financially and deeply in the red. What should these presses do. 4.
Recommendations
Based on an analysis of the relevant, available data, we believe that university presses should consider adopting an exclusive Open Access (i.e., an electronic publishing) policy. While each press would continue to utilize the well established and critically important peer review process for manuscripts and develop its own guidelines, we believe it is imperative financially and economically for these presses to consider the following. First, institute a realistic manuscript submission fee structure, paid by the author(s) (or the author’s academic department and/or college), perhaps $250.00, to cover the initial internal editorial costs Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...
associated with reviewing a submission. Second, if the manuscript had merit and fits into a press’ list, a second fee paid by the author(s) (or the author’s academic department and/or college-university), perhaps $250.00, would cover the expenses of sending the manuscript out for peer review. Many scholarly journals have similar fees, paid for by the author(s) or the author’s department or college; and most universities currently provide some budgets for academics to attend scholarly conferences. This fee structure would become another cost in running a department or college. Another issue centers on the fact that approximately 95 U.S. university presses support the scholarly book publishing activities of academics at more than 3,000 U.S. colleges and universities as well as foreign colleges and universities. So a fee structure provides financial support for the university presses that bear the brunt of reviewing, editing, and publishing an important number of books for faculty members at colleges that do not have a press. Would a fee structure place an unreasonable burden on an author earning a meager salary at a small college and/or a department at a college that did not have the financial resources of a well endowed university? Yes; and the existing playing field is not even. Academics at universities with low teaching schedules and access to substantial financial resources for research have an important competitive advantage over scholars at under funded departments. These are very serious issues, but they are clearly beyond the scope of this paper. Third, if the peer reviewers recommended publication, a final fee, paid by the author(s) (or the author’s academic department and/or college), perhaps $10,000.00 to cover costs associated with line editing, typesetting, posting the book on the press’ Open Access site, etc. Any or all of these fees can or should be waived for academics from developing nations. Table 9 outlines an Open Access P & L. A small press using Table 9 and releasing 20 Open Access books would generate $128,511. in profit; a large press releasing 100 titles would generate $642,555.00 in profit. Three other calculations must be considered. First, we were told that the average press received about 10 manuscripts for every one published. Assuming the fee based structure dampened the submission of manuscripts, and the small press received 100 submissions at $250.00 each, an additional $25,000.00 in extra income could be booked; the large press might receive 500 submission fees of $250, generating an additional $125,000.00 in revenues. Second, not every press had a contract with an author covering electronic rights. So some backlist titles (i.e., a book more than 9 months old) would remain as a print only book, although POD could handle these titles. Third, the existing inventory would have to be stored in a warehouse, triggering costs. It could take at best 4-5 years (2012-2013) to reduce this inventory through sales (or write-offs). The movement toward an Open Access only system provides positive financial results for university presses, allows them to compete with other publishers that are moving rapidly toward the electronic distribution of content, and puts these presses on a sound financial footing, allowing them to continue to exist in both good and bad economic business cycles.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
159
160
Albert N. Greco; Robert M. Wharton
_______________________________________________________________________________________________ Business Model Assumptions _______________________________________________________________________________________________ Print Run: 0 Net Sales: 25 POD [POD is print on demand] Suggested Retail Price: $30.00 POD [$10.00 unit manufacturing cost] Average Discount 0 PPB: 0 Plant: $1,124.45 Marketing: $100.00 Royalty Advance: 0 Royalty Rate: â&#x20AC;&#x201D; [10% of net revenues for all POD copies] Subrights: $200.00 _______________________________________________________________________________________________ Revenues and Expenses and Net Profit/Loss 1. Gross Sales: 2. Returns: 3. Net Sales: 4. Plant: 5 Earned Royalty: 6 Inventory Write-Off 7. Peer Review Fee 8 Total Cost of Goods Sold: 9. Other Publishing Income: Submission fee Peer Review Fee Publication fee 10. Marketing: 11. Overhead:
$750.00 [25 copies x $30.00@] 0 $500.00 [25 copies x $30@ - unit manufacturing cost $10@] -$1,124.45 -$50 [ copies at $2.00] 0 -$250.00 $1,424.45 [#4 + #5 + #7] +$100.00 [sub rights; 50% to publisher] +$250.00 +$250.00 +$10,000.00 -$100.00 -$3,000.00 [30% of $10,000.00 publication fee]
12. Net Profit/Loss:
$10,000 +$250 +$500 + $100 = $10,850.00
$10,850 - $1,424,45 - $3,000 = $6,425.55 profit for this book _____________________________________________________________________________________________ Source: Grecoâ&#x20AC;&#x2122;s estimates; industry averages.
Table 9: Sample Profit & Loss (P & L) Statement for an Open Access University Press Book 5.
Conclusions
Clearly, the world changed in the last 20 years. Computers, the Internet, i-Pods, and cell phones seemed to sprout up everywhere (or in most developed nations); and satellites linked most regions of the world. Yet far too many university presses maintained a centuries old commitment to an unprofitable business model for their books. Based on an analysis of the empirical data, a review of the published literature and existing business models, our visits and discussions with leaders at more than 50 U.S. university presses [e.g., Harvard, Princeton, MIT, Chicago, Stanford, Carnegie-Mellon, Duke), our discussions with faulty members, and our focus group interviews with more than 500 undergraduate and graduate students), we recommend the following procedures to insure the continued viability of university presses. First, all direct university press financial subsidies (excluding non-financial subsidies, e.g., free rent, free access to legal services, etc.) provided by their home university should be discontinued by 2012-2013. Of Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...
course in a market economy, any university that insisted on providing a financial subsidy to its university press can continue this policy. Second, in light of the increased utilization and acceptance of “Open Access” [and electronic publishing] publication models in the scholarly journal sector, a realistic electronic publishing “Open Access” business model should be adopted by university presses for “all” of their books by 20122013. Third, existing stringent peer review should be maintained by each university press as it adopts an Open Access business model. Fourth, each university press should determine an appropriate Open Access fee to be paid to the press after a manuscript has undergone peer review and after it has been accepted for publication by a press; this fee can be paid by the author(s), by the author’s academic department and/or college, through research grants-funds, etc.; waivers of the Open Access fee should be granted to an author(s) from a developing country. Fifth, each university press should consider selling a hard copy, preferably one produced through a “print on demand” (POD) system, to any individual, library, etc. that prefers or needs a hard copy. This Open Access-POD procedure has been utilized successfully by a number of non-profit publishers (e.g., National Academies Press; The World Bank). Sixth, the “university press community,” working with librarians, NGOs, etc., craft a global marketing strategy (by 2012-2013) to license digital content in developing nations, especially titles addressing pivotal issues related to economic development, poverty, disease, global warming, and globalization. Seventh, it appears likely, at least in the next 3-5 years, that the scholarly book will remain the principal scholarly platform in the tenure, promotion and merit process in the humanities and in many areas of the social sciences. It will take more than 4-5 years to convince deans and provosts that peer reviewed Open Access electronic books have the same value as a printed book. What might expedite thinking in academia is the “acceptance” of “electronic books” and “electronic book readers” in the trade book market. Eighth, the transformation to an Open Access publishing platform will take 4-5 years. Contracts for many backlist books (especially contracts from the 1990s) might not contain clauses regarding the electronic distribution of a specific author’s book; and unless those contracts are renegotiated, those titles will remain print only. Recently signed contracts (for manuscripts to be delivered in 2008, 2009, and possibly 2010) are unlikely to contain an only Open Access clause; unless they are renegotiated, these books will remain print only. So it is likely that a university press will have to announce its Open Access policy; and new contracts for manuscript submission in 2010 or 2011 will have to contain the appropriate language. Will some academics refuse to submit a manuscript to an open Access university press? Yes; but the “publish or perish” mindset of university deans and provosts might be a significant counterbalancing force. However, our analysis sparked some intriguing questions. First, are university presses necessary in an age of electronic distribution of content and a plethora of publishing opportunities offered by scholarly, trade, and textbook publishing companies, all with a broad reach and financial resources that exceed the vast majority of most university presses? University presses have a mission to publish and disseminate scholarship, and they offer a useful counterbalance against commercial publishers, although university press title output should be reduced to better match demand against the supply. Would scholarship continue to flourish in a strictly commercial publishing environment? While the precise answer to this question is unknown and unknowable unless all university presses disappeared, the outpouring of research titles from commercial publishers might indicate scholars could continue to get their research out to academics and students. Second, is the institutional affiliation of university presses necessary in an Open Access-commercial publishing environment? From a financial point of view, the answer is no. Universities could reallocate press funding to support other activities, including faculty salaries, scholarships, Open Access publication Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
161
162
Albert N. Greco; Robert M. Wharton
fees, etc. If we evaluate the social mission-public relations component of a university affiliation, we might reach a different conclusion. We asked this question to a number of press directors, and one response was telling. The director told us, “R______ University gets great P.R. every time we publish a book that is reviewed in The New York Times, especially if we do a book that had broad consumer appeal and is highlighted in an article by the Associated Press.” There is no easy answer to this question. Third, the electronic distribution of content by commercial scholarly and textbook publishers has been, at least so far, dependent on downloads onto desk or laptop (notebook) computers and not e-book readers. Price and convenience have been the two main reasons. The average e-book reader costs between $300 and $400, and then you have to pay for the book download. Most academics and students have computers, making downloads relatively easy. The newest e-book reader (the highly publicized Kindle) has a black and white screen; most textbooks and many scholarly books rely on color for charts, graphs, etc. So the price of the e-book reader would have to be reduced significantly to penetrate the academic market (in essence the “King Gillette” model would have to be utilized) and offer color options. Fourth, what is the relationship between print only sales and electronic downloads? While we reviewed data on print sales for 2001-2007 (in reality we also reviewed print sales datasets back to the 1960s), no publisher has released data on electronic downloads. We reviewed quarterly and annual financial reports, Wall Street analyst’s reports, conference calls with stock analysts, and visits to a number of publishing companies. Publishing executives told us they book electronic download revenues and not units; and they did not release any data on the ratio between print and electronic download sales. We investigated this issue unsuccessfully in the summer of 2007; however during the fall of 2007 we began to observe certain patters that provides a “working analysis” of download revenues. We know that McGraw-Hill textbook operations sold 10,000 downloads in 2006, although we could not ascertain whether these were full book downloads and/or book chapter downloads. We estimate that the revenue number for commercial scholarly publishers is perhaps $1 billion; textbooks are approximately $250 million; and trade publishers generated about $60 million in 2007 through digital sales. However, a significant amount of research is needed to develop firmer numbers. Fifth, many universities have launched online course initiatives; and Harvard University’ faculty of arts and sciences created an opt-out only policy regarding Harvard’s posting of scholarly journal articles. Both of these developments are too recent to evaluate in the context of the Open Access book movement, although both will require analysis in the next year or so. The ultimate goal of all U.S. university presses is to reach readers able or unable to access or purchase printed university press intellectual content or books. Clearly, university presses in the U.S., and indeed throughout the world, face exceptionally complex problems related to their intellectual products, convoluted distribution systems, and the increased competition from commercial trade, scholarly, and commercial textbook publishers who are moving rapidly into the electronic publishing of their content. There are no “simple” answers to any of these thorny problems; and a review of the published literature reveals the complexity associated with the current emphasis on printed scholarly books [20]. However, we believe a realistic Open Access (electronic publishing) business model will better position university presses to fulfill their mission to disseminate scholarly knowledge and, concomitantly, mitigate the debilitation economic problems that are undermining the very foundation of these presses and threatening their future. 6.
Notes and References
[1]
COSER, L.A.; KADUSHIN, C. POWELL, W.W. (1982. Books: the Culture and Commerce of Publishing. Pages 45-57. Basic Books, New York.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Should University Presses Adopt An Open Access [Electronic Publishing] Business Model ...
[2] [3]
[4] [5]
[6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
GRECO, A.N.; RODRIGUEZ, C.E.; WHARTON, R.M. (2007). The Culture and Commerce of Publishing in The 21st Century. Pages 36-84. Stanford University Press, Stanford, CA. HARVEY, W.B.; BAILEY, H.S.; BECKER, W.C.; PUTNAM, J.B. (1972). The impending crisis in university press publishing. Journal of Scholarly Publishing 3(3): 195-200; BEAN, D.P. (1981). The quality of American scholarly publishing in 1929. Journal of Scholarly Publishing 12(3): 259268. The journal Scholarly Publishing changed its name to the Journal of Scholarly Publishing; and this new name is the one used for current and past issues. GRECO, A.N. (2005). The Book Publishing Industry, 2nd ed. Page 345. Lawrence Erlbaum Associates, Mahwah, NJ. VAN IERSSEL, H. (2007). Annual university press statistics 2003 through 2006. Pages 23-35. Association of American University Presses, New York; VAN IERSSEL, H. (2007); Van Ierssel, H. (2006). Annual university press statistics 2001 through 2004. Pages 24-30. Association of American University Presses, New York; VAN IERSSEL, H. (2001). Annual university press statistics 1997 through 2000. Pages 16-24. Association of American University Presses, New York; VAN IERSSEL, H. (2000). Annual university press statistics 1996 through 1999. Pages 16-22. Association of American University Presses, New York; TALEB, N.N. (2007). The Black Swan: The Impact of the Highly Improbable. Pages 3-21.Random House, New York; TALEB, N.N. (2004). Fooled By Randomness. Pages 5-20, 43-64. BOOK INDUSTRY STUDY GROUP, INC. (2007). Book Industry Trends 2007. Pages 136-152. Book Industry Study Group, Inc., New York; all of the statistical datasets, projections, and several essays were prepared by Greco, A.N. and Wharton, R.M. KYRILLIDOU, M.; YOUNG, M. (2008). ARL Statistics 2005-2006: A Compilation of Statistics From One Hundred and Twenty-Three members of the Association of Research Libraries. Page12. Association of Research Libraries, Washington, D.C. BOGART, D. ed. (2007). The Bowker Annual: Library and Book Trade Almanac 2007, 52nd ed. Pages 514-515.Information Today, Inc. New Providence, NJ. GRECO, A.N. (2005). The Book Publishing Industry, 2nd ed. Page 26-50. Lawrence Erlbaum Associates, Mahwah, NJ. VERONIS SUHLER STEVENSON. (2007). Communications Industry Forecast 2007-2011, 21st ed. Pages 52-57.Veronis Suhler Stevenson, New York. BOOK INDUSTRY STUDY GROUP, INC. (2007). Book Industry Trends 2007. Pages 181-198. Book Industry Study Group, Inc., New York; all of the statistical datasets, projections, and several essays were prepared by Greco, A.N. and Wharton, R.M. GRECO, A.N. (1987-1988). University presses and the trade book Market. Book Research Quarterly 3(4):34-53. CHRISTENSEN, C.M. (2000). The Innovatorâ&#x20AC;&#x2122;s Dilemna. Pages xi-xxxii. HarperCollins, New York. KERR, C. (1987). The Kerr report: One more time. Publishers Weekly, 5 June: 20; KERR, C. (1987).One more time: American university presses revisited. Journal of Scholarly Publishing 1(4): 8-10. BASS, T.A. (1999). The predictors: How a Band of Maverick Physicists Used Chaos Theory to Trade Their Way to a Fortune on Wall Street. Pages 8-33. Henry Holt, New York; See also: RODICK, D. (2007). One Economics Many Recipes: Globalization, Institutions and Economic Growth. Pages 85-98. Princeton University press, Princeton, NJ; BURTON, K. (2007). Hedge Hunters: Hedge Fund Masters on the Rewards, the Risk, and the Reckoning. Pages 163-177). Bloomberg Press, New York; GADIESH, O.; MacARTHUR, H. (2008). Lessons From private Equity Any Company Can Use. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
163
164
[17] [18] [19] [20]
Albert N. Greco; Robert M. Wharton
Pages 28-57. Harvard Business School Press, Boston, MA.; DERMAN, E. (2004). My Life As A Quant: Reflections on Physics and Finance. Pages 17-28. John Wiley & Sons, Hoboken, NJ; MALKIEL, B. (1996). Pages 164-193. W.W. Norton, New York; DAVENPORT, T.H.; HARRISS, J.G. (2007). Competing on Analytics: the New Science of Winning. Pages 41-82. Harvard Business School Press, Boston, MA. Statistical data about a specific firm’s “beta” (and other key economic indicators) can be found at www.finance.yahoo.com. GRECO, A.N.; RODRIGUEZ, C.E.; WHARTON, R.M. (2007). The Culture and Commerce of Publishing in The 21st Century. Pages 36-84. Stanford University Press, Stanford, CA. Greco’s estimates of the electronic revenue streams for these formats. Rice University Press launched an Open Access –POD scholarly press in 2007; however, its operation is rather new, and few conclusions can be drawn from Rice’s history; See also: BROWN, L.; GRIFFITHS, R.; RASCOFF, M. (2007). The Ithaka Report: University Publishing in a Digital Age Pages 1-62. Available at: http://www.ithaka.org/strategic-services/university-publishing; GREENBLATT, S. (2002). Dear Colleague letter to members of the Modern Language Association. Pages 1-2. May 28, 2002; HAHN, K.L. (2008). Research library publishing services: New options for university publishing. Available at: http://www.arl.org; HOWARD, J. (2008). New open-access humanities press makes its debut. The Chronicle of Higher Learning. May 7, 2008. Available at: http://chronicle.com; RAMPELL, C. (2008). Free textbooks: An online company tries a controversial publishing model. tabThe Chronicle of Higher Learning. May 7, 2008: A14; COHEN, N. (2008). Start writing the eulogies for print encyclopedias. The New York Times, March 3, 2008: WK3; HOOVER, B. (2008). University press tries digital publishing; refers to the University of Pittsburgh Press. Available at: http://www.post-gazette.com; MILLIOT, J. (2008). Report finds growing acceptance of digital books. Publishers Weekly, February 18, 2008: 6; International Digital Publishing Forum (2008). Industry statistics. Available at: http://www.idfp.org; GRAFTON, A. (2007). Future reading: Digitization and its discontents. The New Yorker. November 5, 2007. Available at http://www.newyorker.com; TAYLOR, P. (2007). Kindle turns a new page. The Financial Times, November 23, 2007: 12; CRAIN, C. (2007). Twilight of the books. What will life be like if people stop reading books? The New Yorker, December 24, 2007. Available at: http://www.newyorker.com; eBrary. (2007). 2007 global faculty e-book survey. Pages 1-46. Available at: http://www.ebrary.com.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
165
Scholarly Publishing within an eScholarship Framework – Sydney eScholarship as a Model of Integration and Sustainability Ross Coleman Sydney eScholarship Fisher Library, F03, University of Sydney New South Wales, 2006, Australia email: r.coleman@library.usyd.edu.au
Abstract This paper will discuss and describe an operational example of a business model where scholarly publication (Sydney University Press) functions within an eScholarship framework that also integrates digital collections, open access repositories and eResearch data services. The paper will argue that such services are complementary, and that such a level of integration benefits the development of a sustainable publishing operation. The paper describes the business model as a dynamic hybrid. The kinds of values considered include tangible and intangible benefits as well as commercial income. The paper illustrates the flexible operational model with four brief cases studies enabled by integrating repository, digital library, and data services with an innovative publishing service. Keywords: eScholarship; scholarly communication; Sydney University Press; eResearch; data publication 1.
Introduction
Hardly a week goes by without some new challenge to scholarly communication that demands attention, and occasionally, perhaps, a pause to get some bearings. Information technologies and the opportunities of the semantic web and Web 2.0 on one side, the complexity of rights and open or managed access on another, and on, yet another side, the need for sustainable and viable business or operational models. Beneath a yawning divide between the corporate publishing world and that of the institutions (and within the institutions the relationships between the traditional presses and the emerging e-presses), and overlaying all the power and omnipresence of the global search engines. Approaching over the horizon is the demands and complexity of e-research – as cyberinfrastructure, but also authoritative data itself as a form of publication. The decisions being made now about how to best engage in this environment are not the final solutions. What we need is the best kind of foundations - flexible, responsive, light and open – on which to build the new scholarly publishing and communications structures of the future. A tired cliché, but true - continuing change is the only certainty. Nor are there any single solutions – we need to work within innovative frameworks that accommodates this diversity and these challenges and opportunities; and frameworks that facilitate new partnerships. The framework we have chosen to work within is that of eScholarship As there are no single solutions, this model must be a dynamic hybrid, seeking to respond and deliver to the diverse and changing set of demands and markets. A model providing solutions for the creators and the consumers of scholarly publications. This paper will discuss an operational program and a business model or methodology where scholarly Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
166
Ross Coleman
publication functions within an eScholarship framework that also integrates digital collections, open access repositories and eResearch data services. The paper will argue that such services are complementary, and that such a level of integration benefits the development of a sustainable publishing operation. This argument will be illustrated with results in four brief case studies. The primary platform for scholarly publishing at the University of Sydney – Sydney University Press – operates as an integral part of the University Library’s Sydney eScholarship program [http:// escholarship.usyd.edu.au]. 2.
Discussion - on eScholarship, publishing, sustainability and integration
Sydney eScholarship operates as an integrated set of services, characterised by: • • • • • • •
2.1
commitment to standards for archiving and re-use delivery capabilities - publishing services for books, journals, conferences, new forms stable open digital repository services project analysis and planning advice digital library collections and services business planning, legal compliance and secure e-commerce capabilities partnerships, collaborations and opportunism
eScholarship
We at Sydney were inspired by the vision of eScholarship originally enunciated by the California Digital Library (CDL): “eScholarship … facilitates innovation and supports experimentation in the production and dissemination of scholarship. Through the use of innovative technology, the program seeks to develop a financially sustainable model and improve all areas of scholarly communication….” [1] CDL continues to explore sustainable models, and acts as a leader in innovation developing services and tools. For example XTF (eXtensible Text Framework) being implemented across a number of digital library and publishing services, including at Sydney The term “eScholarship” is used variously according to context or, indeed, convenience. Most common use is in regard to digital repository services (Boston, Queensland et al), and sometimes as a catch-all descriptor for services associated with digital activities in higher education. [2] If there is any commonality in usage it is in reference to a digital archive services. At Sydney we have taken a broader understanding of eScholarship - as an overarching framework. This vision enunciated by CDL enabled us to conceptualise, and implement a coherent approach to deliver the strategic and operational ambitions for many of the Library’s digital collection and publishing activities. It allowed us to articulate the relationships underlying these activities, and the new roles and expectations in integrating digital collections, open repository services and emerging eResearch support services at the University with a publishing operation. Importantly it has allowed us to address these activities and relationships pragmatically, offering a set of services that we feel are operationally sustainable, beneficial and productive. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Scholarly Publishing within an eScholarship Framework
The service components of Sydney eScholarship and the business model underlying these operations will be discussed below, after briefly considering the concept of sustainability. 2.2
Sustainability
Sustainability is one of those comforting aspirational - but slippery - goals, depending on context. Some insight into the complexity of sustainability in the digital environment was gained through participation in the federally funded Australian Partnership for Sustainable Repositories (APSR) [3]. Digital sustainability is described by Kevin Bradley in his Sustainability Issues Discussion Paper for APSR [4] as being technical, social and economic. Bradley describes the following as aspects of sustainability. • • • • • • •
The sustainability of the raw data - the retention of the byte-stream. The sustainability of access to meaning - content remaining meaningful for creator and user. The economics of sustainability – continued existence of the institutions that support the technology. The organisational structure of digital sustainability - relationships between the rights holder, the archive and the user. The economics of participation – matters of incentives and inhibitors. Sustainability and the value of the data – the value through the life-cycle Tools, software and sustainability
Central to any practical discussion of sustainability, and implicit in Bradley’s discussion, is the need for organisational continuity. Such a key requirement also underlies similar topologies such as the attributes and responsibilities of the Trusted Digital Repository. [5] The traditional purveyor of curatorial continuity for publication is the library. While this does not necessarily need to be so in the future, it does explain the repository role of many libraries – in terms of assertion and expectation [6]. In Australia, these roles have been formalised as libraries are funded to provide repository services for various government research assessment initiatives, such as the new Excellence for Research in Australia (ERA) program replacing the Research Quality Framework (RQF) [7], or in the UK for the Research Assessment Exercise (RAE) [ http://www.rae.ac.uk/] The library at many universities, like Sydney, is often the only organisational and curatorial entity that has existed (in one form or another) throughout an institution’s history. The viability of services such as Sydney eScholarship, committed to the long term management and preservation of digital content, relies to a large extent on organisational continuity as part of the University Library,. Indeed, this association has raised the expectations by researchers of libraries having a central role in supporting such initiatives. In a practical sense to have any ambition toward providing sustainable information services over the longer term organisational continuity and commitment is important. But this needs to be accompanied by the appropriate skills and expertise, infrastructure services with forward development plans, an innovative, proactive and responsive approach, and a viable and demonstrable operational or business plan to ensure future funding. Within a publishing environment the business plan is critical (even if the plan is 100% institutional subsidy)
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
167
168
2.3
Ross Coleman
eText to eScholarship
The Sydney eScholarship program was formally launched in 2006, but these services (and the appropriate skills sets) had been evolving over a decade, since the establishment of SETIS (Sydney Electronic Text and Image Service), in the mid 1990s. SETIS was initially established as an eText centre in 1996, similar to many other in the US, and due in part to the missionary zeal of a visiting David Seaman, then at Virginia. The evolution of SETIS from a service networking commercial full text databases to a service creating etext collections was rapid. The skills translated easily from one service to the other. These services provided a platform for the creation of text and image based digital library collections. The expertise built up during the 1990s gave SETIS a national reputation in creating and managing such text collections, with a focus on Australian studies [http://setis.library.usyd.edu.au/oztexts/] This reputation has grown through active partnerships in major research grants in Australian literary and historical studies. The first major project for SETIS to create and provide full text of primary (literary) and secondary (critical) texts for AUSTLIT, [www.austlit.edu.au ] the major Australian literary bibliographical and biographical database, funded through Australian Research Council (ARC) grants. This commitment to AUSTLIT continues with ongoing digital conversion of selected literary works. A production process for digital conversion was developed for this and other projects. To ensure the highest possible accuracy digital conversion involved the double-keying of texts. We eventually settled on a preferred vendor in Chennai, India, and this company remains our major production vendor for digital conversion. Texts were converted and XML files were returned in our established DTD with basic structural mark-up. Further mark-up to the TEI (Text Encoding Initiative) guidelines, and processing took place in SETIS and the XML files were rendered into HTML and web PDFs depending on requirements. The textual corpora created by SETIS – XML based collections with a range of presentation options provided Sydney with leadership and acknowledged expertise in creating primary source text collections in Australian studies. The role of SETIS as the primary full text manager in Australian literature also provided the opportunity to consider establishing a publishing operation to meet the demand for print versions. Our Indian vendor - also a production house for several major European publishers – provided additional services such as type-setting for potential print production. The reputation of SETIS continues to bring new and exciting collaborations, consolidating our role, and providing the innovative impetus and funding for much of the new major project work done in Sydney eScholarship 2.4
Sydney University Press and Sydney eScholarship
Sydney University Press (SUP) had existed as a traditional print publisher and press. It was initially established by the University in 1962, but after 25 years of operation was effectively dismantled due to the heavy infrastructure costs. Over this period SUP produced a major list of over 600 titles and several major journals. In 1987 the SUP imprint was sold by the University to Oxford University Press. The imprint was then used mostly for textbooks, but eventually relinquished by OUP in the early 1990s, and the business name and imprint were abandoned. The University re-registered SUP in 2003 under Library management “to address the challenges of scholarly publication in the networked environment” The reputation of SETIS as a digital library platform facilitated the re-establishment. Case Study # 1 – digital library to publisher, Classic Australian Works, 4.1 below describes how the operation and reputation of SETIS was fundamental to re-establishing SUP. Sydney University Press [http://www.sup.usyd.edu.au/] was revived in the same milieu as a number of Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Scholarly Publishing within an eScholarship Framework
new e-presses (many associated with libraries). In Australia these included Monash e-Press, ANU ePress and UTS e-Press. Though SUP was re-established in this context a decision was made to stay with the name Sydney University Press, and not adopt an e-Press banner. There were several reasons for this: it was an established brand name, we were determined to be a print as well as electronic publisher, and we needed to present a business case that would include its operation as a commercial publisher generating income. Sydney eScholarship was established as a set of innovative services for the University of Sydney to integrate the management and curation of digital content and research with new forms of access and scholarly publication. Within this framework a viable publishing operation was important to add value to the set of services. The business components of this ‘adding value’ are discussed in the methodology of the business model, section 3 below, but real and tangible value flows through all the services of Sydney eScholarship. As a commercial publisher SUP publishes new editorially accepted and market tested titles, as well as a growing re-print list. All titles are electronically archived and sold print on demand or short run. While publishing provides transactional value, the digital library collections and the repository manage content and provide the sound archival foundation to facilitate publishing services. In reality each service provides functional value to the other. Publications derived from associated data sets, described in Case study # 2 – data to publication – from surf to city, 4.2 below, are increasingly part of this value chain. Within the digital environment the key to the value chain (or value circle) is the capacity to re-use, represent or re-engineer content into different environments. Depending on circumstances this may be into an open access environment or a managed or commercial environment. This capacity to address different demands is an essential part of operational viability. The Dictionary of Sydney project, another ARC research project [www.dictionaryofsydney.org/] in which we are a partner, explicitly has such a model of content re-use and re-engineering in both open and commercial spheres, to ensure the sustainability of the project when research funding ceases. This includes forms of publication via SUP In this operational context Sydney eScholarship can be broadly understood as an integration of the Sydney Digital Library (creating, managing and curating content) and Sydney University Publishing (providing associated business and publishing services). This is outlined in the table below Sydney eScholarship Sydney University Publishing
Sydney Digital Library •
eScholarship repository
•
Sydney University Press
•
SETIS digital collections
•
other imprints
•
Sydney Digital Theses
•
digital / print on demand services
•
Data project analyst and advisory services
•
eStore, eCommerce and business services
•
hosting subject data services
•
experimental publication
Table 1 : Sydney eScholarship services Sydney University Publishing, while centred on SUP, does provide other services, many of a business nature. SUP is established as a commercial and scholarly imprint, and this identity needs to be maintained both as a quality publisher, and one that complies with formal ‘research publication’ requirements. We do provide other imprints, such as Darlington Press, for more popular or semi-academic titles. We provide print-on-demand services for other publishers such as Monash ePress and UTS ePress, as well as for administrative services, such as University Faculty Handbooks etc. We will also provide a secure eStore service for the sale of other published content, in print, and soon in electronic form. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
169
170
Ross Coleman
A niche area of increasing interest is conference publication, as a form requiring rapid and open publication. The PKP Open Conference System (OCS) provides the publishing platform. We also provide an Open Journal System (OJS) platform. These are integrated into the repository services, as illustrated in Case study # 3 - repository and publishing – open and managed access, in 4.3, below SUP is also interested in providing platforms for experimental types of publication, such as multi-media streaming. Toward this end SUP recently produced/published its first music CD, Wurrurrumi Kun-Borr . This CD was the joint winner of the 2007 Northern Territory Indigenous Music Awards, and the first in a series from the National Indigenous Recording research project. The new Sydney University Press was established to integrate expertise in handling digital content with a production facility to provide a viable print-on-demand service, and a secure eStore service for commercial sale. It provided the production capacity to meet the formal requirements of research publications. 2.5
Nature of a research publication
In Australia there is currently a formal set of requirements that define (for funding purposes) a ‘research publication’. These requirements do proscribe different modes of scholarly publication, but at the same time provide a useful and defendable set of criteria that provides (for good or bad) some benchmarks for research publication. Not surprising, meeting these requirements are fundamental elements in any scholarly publishing model. Publications that meet this definition generate research points that are converted into federal research funds – so publication output is important for both individuals and institutions. The definition of a research publication is outlined in the Higher Education Research Data Collection (HERDC) 2008 specifications [8] “For the purposes of these specifications, research publications are books, book chapters, journal articles and/or conference publications which meet the definition of research, and are characterised by: •
• • • •
substantial scholarly activity, as evidenced by discussion of the relevant literature, an awareness of the history and antecedents of work described, and provided in a format which allows a reader to trace sources of the work, including through citations and footnotes originality (i.e. not a compilation of existing works) veracity/validity through a peer validation process or by satisfying the commercial publisher processes increasing the stock of knowledge being in a form that enables dissemination of knowledge” [print or electronic]
Publishing within an eScholarship framework enables us to both comply with these specifications and to investigate other types or modes of publication that still meet the fundamental characteristics of ‘research publication’. One area of growing pressure is the need for the creation of authoritative data sets to be recognised as a valid research activity, and perhaps recognised as a ‘publication’ for funding purposes. Appendix 1 illustrates how a data set may comply with the formal characteristics of a research publication 3.
Methodology - the business model
The historical change in scholarly publishing is facilitated by technologies which have enabled new business and strategic approaches. This is an operational shift in publishing from retail-type single product (eg printProceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Scholarly Publishing within an eScholarship Framework
run, journal volume etc) to a dynamic services framework. This is more than multi-channel distribution; it is an ongoing process that allows for re-use and re-mix facilitated by archival formats that enable content to be used in different contexts and markets. This is illustrated in Case study # 4 – literary re-use and customisation – APRIL., 4.4 below. 3.1
A dynamic hybrid
The development of an operationally ‘financially sustainable’ model (that is, one that generates income) was fundamental to the medium term planning of Sydney University Press. The model needed to meet appropriate scholarly and market needs, provide commercial income generating services, as well as the capability to generate research points. It also needed to work in the new information environment that facilitates the benefits of open access, and the challenges and opportunities of packaging and re-use of eResearch as publication. This business model is based on a hybrid operational and philosophical approach to scholarly publishing, and a broad recognition of the various elements of value in a business model. The hybrid approach is demonstrated in the capacity to deliver both digital and print (including on demand) content as appropriate, and in a capacity to mix both open-access and paid delivery of publications as appropriate. This dynamic hybrid model enables response to different demands, requirements and markets. Publication outputs will take the forms appropriate for the work, the readership and the market. There is a continuing market demand for printed works which is serviceable in a digital print environment.. Electronic delivery is currently downloadable free PDFs because of functional constraints with the eStore. A current eStore upgrade will provide the capacity to extend services to sell e-versions in whole or in part. As a delivery mechanism print on demand (PoD) from stored files ensures that theoretically a work is never out of print. In a business sense this provides the long tail for publishing, where production is most cost-effective, and where a long list with low inventory and turnover contributes towards a viable business proposition. This is part of the business strategy behind Amazon BookSurge, and is also – on a more modest scale – a business strategy behind SUP. A business strategy enabled through text archiving processes of the digital library. It is a point where the digital library crosses into business. This dynamic hybrid model does provide a flexible approach. Importantly it allows us to alter and adapt the mix of delivery modes as technology, demands and markets shift. In a context of continual change this flexibility is critical. Another advantage of a business approach is that it - ironically perhaps - provides a particular credibility with authors, partners, the media and the trade. When SUP was first conceptualised we envisaged that most sales would be via the web site, direct to customers. However, currently about half of sales are into the trade, to both retail bookshops and library suppliers. This did require a review of how pricing was structured to include a margin for trade discount, and some careful thought about price points etc. SUP is not exclusively a Sydney University publisher; to be so would only be self-serving, and ultimately self-defeating for a scholarly publisher. In 2007 only about half of new titles were associated with University of Sydney staff. Our first goal in terms of business viability is operational self-sufficiency. That is, from direct and indirect income we cover all production costs including editorial, copy-edit, indexing where needed, design, layout, proof copies and final copies for legal deposit, review, authors etc, and some internal staff costs. This goal has largely been achieved. Core staffing (business manager) is currently provided by the Library. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
171
172
3.2
Ross Coleman
Types of value
We take a broad view of business planning and strategies, and recognise that that the values in this model are more complex than only forms of income, we must consider other real value and benefits that also accrue to individuals and institutions. The value elements of this model can be expressed in a simple matrix form Direct income (sales etc)
Tangible benefits (metrics etc)
Indirect income
Intangible benefits
(subsidies, points etc)
(authority, brand etc)
The nature these value elements •
•
•
•
3.3
Direct income - from SUP publishing sales and diversified income from print-on-demand services. This income accrued via eStore sales is split between SUP, and the Library for infrastructure Indirect income - to SUP in the form of subsidies to assist with publication, common in scholarly publishing. Although preferring a level of subsidy, SUP has taken the whole commercial risk with several titles, and recouped through sales or royalty sacrifice. Another, more, substantial indirect income - though not necessarily to SUP - is accrual to the individuals and universities through research publication points funding by the government (2.5 above). This underwrites some subsidies. Tangible benefits - to individuals include higher metrics and profiles for citations and downloads due to open access, internal institutional efficiencies by utilizing services such as PoD; and the potential rationalisation of diverse publishing operations Intangible benefits - relate to prestige and recognition through an active scholarly press, and increased individual and institutional research publication productivity
Practicalities – legals, marketing and risk
Fundamental to the business process is the contractual basis under which publication is facilitated. All SUP contractual templates comply with University legal requirements, and have been developed with external intellectual property legal advice. Within all these contracts authors retain their copyright, SUP only licences for publication. This enables authors to deposit their content in other repositories. This complies with an open access orientation, as described in Case study 4.3 Marketing remains a major issue, as SUP does not operate in the traditional trade, high advertising, high inventory and distribution environment. Marketing is to the niche with a tailored marketing plan for each title, with little general advertising. SUP uses targeted media releases and media networks. Publication details are added to all the book-trade lists, and have a small number of preferred independent book retailers. Like many publishers, SUP is negotiating to join GoogleBooks and have signed with AmazonBooksurge for delivery into the North American markets. Marketing is still an area requiring more effort and lateral thought. Issues of risk have been considered from several perspectives. Legal risks and exposure across all the activities of Sydney eScholarship have been canvassed at length with the University Office of General Counsel. The outcomes of these discussions often takes the form of approved templates for contracts, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Scholarly Publishing within an eScholarship Framework
deeds, agreements, memos of understanding etc with project partners, for repository contributors, for data hosted on our servers, and for authors. These discussions have sometimes involved the need to resolve differing views about exposure through open access and assumed loss of intellectual property rights. SUP does need to comply with university legal requirements (including copyright), and, despite some frustrations, often arrive at a level of common agreement that enables services to operate largely as we envisage. It is very important that we liaise closely with legal counsel, and continue to have a good and open relationship with them. The other level of risk is around the publishing operation itself. At the time of establishment SUP was subject to a risk assessment, in terms of production services, internal relationships, external partnerships, and initial support. We have been cognisant of these risks, and at all times have contingencies and alternatives planned for technical, production and business disruption. However, it has been accepted that developing new and innovative services does require the University and the Library to accept a degree of risk. This has been minimised as much as possible in legal terms. The benefits of these services and the value they add in terms of improved access and communication of university research in the new information environments has been embraced over any risks that may emerge 4.
Results – the case studies
While, for SUP, print services and sales continue to be a key part of business operations, the importance of integration within the eScholarship framework is fundamental to success. The working relationships through these kinds of integration, and the benefits in terms of productive scholarly outcomes, are described in the case studies, below. 4.1
Case study #1 – digital library to re-print publisher – Classic Australian Works
As already noted SETIS had built a reputation for creating and managing literary full text in the scholarly environment. As part of a digital library collection these texts were maintained in archival forms (XML) for rendering into a range delivery modes. In 2003 the Library was approached by the Copyright Agency Ltd (CAL), the national agency for overseeing copyright enforcement to discuss a project to bring back to the market in a cost-effective way works of literature that were out of print, but still in copyright. CAL had initially approached the National Library of Australia (NLA) to partner the project, but they had referred CAL to Sydney because of our reputation in text creation and archiving (we had partnered with the NLA in other digital projects). The project proposed by CAL was for them to contribute to the establishment of a print-on-demand publishing operation, and to clear the reproduction rights of the books to be re-printed. In return we would convert and archive the works, establish a publishing operation, a secure web-based eStore for commercial sale, and production service to facilitate print-on-demand. So the revival of Sydney University Press was set. It was re-established as a light infrastructure integrated on top of existing university services: the SETIS digital text expertise; the digital print capacity of the University Printing Service; and the secure e-commerce transaction service of central IT. The Classic Australian Works series was established as a partnership between SUP and CAL, and twenty-five major ‘classics’ from twenty authors were selected as the initial tranche. After the series launch we were faced with the challenges of managing a commercial operation, marketing and a whole host of related challenges. The Classic Australian Works series continues with new reproduction of works and a new editorial presence. The infrastructure developed through this initiative provided the initial production and business foundations Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
173
174
Ross Coleman
for the new Sydney University Press. It is important to appreciate that SUP was re-established through an actual business opportunity and demand, not as the result of an administrative decision. This has set the tone and direction of SUP as a viable business operation. 4.2 Case study # 2 – data to publication – from surf to city While a data set itself is not recognised as a form of research publication [though it is possible to peg the characteristics of a research publication against a data set, appendix 1], it is possible for some forms of data set to be converted into research and commercial publication. Publication of some data does work well within the context of re-use and re-mix. SUP has published works derived from research data sets. Associated with this is action to also ensure the technical sustainability of the data, as described by Bradley (2.2, above) The major example is the “Beaches of the Australian Coast” series, currently published in seven volumes, representing states or regions in Australia. This series was derived from a substantial scientific data base detailing every one of the over 10,600 beaches around Australia. The data base covered over 30 elements for each beach including geomorphology, tidal and surf data, safety and recreational data, as well as images of all beaches. While used primarily as a marine data base, the possibilities for publication were obvious, in the form they are now published, but also potential for re-use or re-mixe around particular themes (fishing beaches etc). The state of the data base itself (on a myriad of excel files with little backup) is the subject of re-building into an XML based data base which will be archived by the Library for current and future research. The data base does provide benchmark data on beaches important for climate change studies. Another example of publication is from a data set (also being archived) is on urban planning legislation and practice in Australia. Parts of this data base were rendered to publication as Australian Urban Land Use Planning: Introducing Statutory Planning Practice in New South Wales. This work is used both as a text and a reference work (and data base) by planners. 4.3
Case study # 3 - repository and publishing – open and managed access
The Sydney eScholarship repository – a DSpace installation – provides a secure open access repository service. It provides the storage foundation of the Sydney Digital Library, and the SUP archive. While SUP does operate as a commercial publisher we are committed and oriented to open access wherever possible. The publishing contract templates we use, which comply with university general counsel requirements, permit authors to retain their rights and allow deposit of content in other repositories. All conference papers and chapters in edited works (unless specifically blocked by the author) are openly accessible via the repository, and are regularly harvested by services such as GoogleScholar. The full work or conference is still available as a completed print work, and remains so as a print-on-demand file, with a link between the repository and the SUP eStore. There is demand for both the print volume and open access at the paper level. Print-on-demand (PoD) satisfies the low print demand in a cost-effective way. In the publication process we do need to ensure that the editorial processes meet the formal requirements of peer review so that individual authors receive due research publication recognition. This recognition is provided irrespective as to whether it is distributed in print or electronic form, as long as the criteria of research publication are met. A repository service does fit neatly into a publishing operation, indeed it is a fundamental part of the
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Scholarly Publishing within an eScholarship Framework
operational model to ensure that content remains continually available over the long term. 4.4
Case study # 4 – literary re-use and customisation - APRIL
A project funded by the Australian Research Council (ARC) with industry linkage with CAL (Copyright Agency) is the Australian Poetry Resources Internet Library (APRIL). [http://april.edu.au/]. This has been funded as a research project to study the reception and readership of Australian poetry. The project involves the digitisation of the complete works of over 300 Australian poets in the first phase (with associated video and audio of interviews, readings etc). About half of these works are still in copyright, and permissions will be cleared by CAL. This project is one of several supported by CAL from its Cultural Fund to encourage the study and appreciation of Australian poetry and plays. The text of the poems will be double-keyed and also represented as images. The text will be archived in SETIS as TEI tagged XML. The content will be rendered via Cocoon from within an XTF framework. It will be RDF (Resource Description Framework) capable for semantic web environments. Several publication options are also being investigated as part of the project. These are looking to the repackaging and delivery of anthologies in different contexts, including education and general readership. The processes of producing and selling client customised anthologies of poems by print on demand through SUP are being investigated as part of the project. This is a design and production challenge for publication. At the rights and business level the use of DOIs (Digital Object Identifiers) will articulate and record the transactions for each poem and poet. This work is in the first year of a three year project, but again illustrates the benefit in integrated digital library and publication services to enable new and experimental modes of publication. 5.
Conclusion
The four case studies illustrate the kinds of benefits and synergies that are enabled by integrating repository, digital library, and data services with an innovative publishing service. The capacity of that service to deliver diverse content required the dynamic hybrid operational and business model described in this paper. As hypothesised at the start of this paper, there are no single publishing solutions. These case studies illustrate the need to be able to deliver in different circumstances - to be able to provide appropriate publishing solutions in different contexts. Managing a commercial publishing operation does raise questions whether this is a proper role for a library. At the University of Sydney Library it is regarded as an appropriate role, an extension of the traditional roles of preservation and access. A publishing service is regarded as integral to the Library providing leadership in addressing the challenges of communicating research and scholarship in the new contexts of networked information services. The services integrated through Sydney eScholarship provide the fundamental components to facilitate and support innovative digital projects within the semantic web, and provide stable archival platforms for research. The association with a publishing enterprise that operates in both commercial and open environments provides a service that is attractive to researchers requiring recognized ‘research publications’ and the benefits of secure archiving and open access. Within this integrated eScholarship framework each service adds value to the other – each benefits the other. This is the kind of framework that can contribute to sustaining the new models of scholarly publishing
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
175
176
Ross Coleman
6.
Notes and References
[1] [2]
accessed from CDL site in October 2005 For example, in Australia the eScholarship Research Centre at the University of Melbourne is a research and data archive group within an ITC unit, and no publishing agenda www.esrc.unimelb.edu.au/ Australian Partnership for Sustainable Repositories (APSR) - http://www.apsr.edu.au/ The APSR Project aims to establish a centre of excellence for the management of scholarly in digital format. The project has four interlinked programs: Digital Continuity and Sustainability Centre of excellence to share software tools, expertise and planning strategies. International Linkages Program Participate in international standards and maintain a technology watching brief. National Services Program Support national teaching and research with technical advisory services; knowledge transfer; consultation and collaboration services. Practices & Testbed Build expertise in sustainable digital resource management through partner relationships. The Australian National University: Develop and populate a broad-spectrum · repository. · The University of Sydney: Sustainability of resources in a complex distributed environment. · The University of Queensland: Develop an integrated gateway to a range of repositories of research output. Bradley, Kevin Sustainability Issues Discussion Paper APSR (2005). dspace.anu.edu.au/handle/1885/46445, accessed May 2008 http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf It is an interesting sidelight that the role of librarians in taking leadership in these areas has generated some derisive comments from higher education administrators about ‘agenda stealing’, This has only been conversational but quite explicit - an interesting tangent that may be worth further consideration. h t t p : / / m i n i s t e r. i n n o v a t i o n . g o v. a u / S e n a t o r t h e H o n K i m C a r r / P a g e s NEWERAFORRESEARCHQUALITY.aspx [accessed 4 May 2008] h t t p : / / w w w. d e s t . g o v. a u / s e c t o r s / r e s e a r c h _ s e c t o r / o n l i n e _ f o r m s _ s e r v i c e s / higher_education_research_data_collection.htm#2008_Specifications , retrieved 6 May 2008
[3]
[4] [5] [6]
[7] [8]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Scholarly Publishing within an eScholarship Framework
177
Appendix 1 –data as research publication The HERDC definition of a research publication can be adapted outside of those traditional – mostly textual - publication forms or categories (be they in print of electronic form). The table below identifies at first cut - a range of requirements that could meet publication criteria and be applied to the development of datasets as a recognised research activity and as a new form of publication output. HERDC research publication criteria substantial scholarly activity, as evidenced by discussion of the relevant literature, an awareness of the history and antecedents of work described, and provided in a format which allows a reader to trace sources of the work
Data set publication requirements • credibility of the researchers, • authority of platform/organisation (aka publisher) • significance of the subject matter • conceptualisation of data collection • meeting data and metadata (descriptive, technical, provenance, etc) standard requirements, • relationship/linkage to other datasets • persistent citability
originality (i.e. not a compilation of existing works)
•
unique data collection
•
replicated data necessary for testing or verification
veracity/validity through a peer validation process or by satisfying the commercial publisher processes
•
use of recognised data and metadata standards
•
peer review process for data inclusion
•
credible/authoritative review panel
•
usability/functionality community
•
unique primary data
•
persistence of citation
•
being an identifiable set of data for citation purposes
•
IP licence model
•
OAIS compliance for harvesting (OAIPMH)
increasing the stock of knowledge being in a form that enables dissemination of knowledge
for
research
These requirements may raise many practical questions and many researchers could add other discipline specific standards and requirements. However the table does indicate it is possible to develop an acceptable set of requirements that would provide defendable criteria for recognition as a research publication. Source: Coleman, Ross.. Field, file, data, conference: towards new modes of scholarly publication. In Sustainable Data from Digital Fieldwork. Proceedings of the conference held at the University of Sydney, 4-6 December 2006. http://hdl.handle.net/2123/1300
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
178
Global Annual Volume of Peer Reviewed Scholarly Articles and the Share Available Via Different Open Access Options Bo-Christer Björk 1;Annikki Roosr1,2; Mari Lauri1 1
Information Systems Science, Department of Management and Organization, Swedish School of Economics and Business Administration Arkadiankatu 22, 00100 Helsinki, Finland e-mail: bo-christer.bjork@hanken.fi; Annikki.roos@ktl.fi 2 National Public Health Institute Mannerheimintie 166, 00300 Helsinki, Finland
Abstract A key parameter in any discussions about the academic peer reviewed journal system is the number of articles annually published. Several diverging estimates of this parameter have been proposed in the past, and have also influenced calculations of the average production price per article, the total costs of the journal system and the prevalence of Open Access publishing. With journals and articles increasingly being present on the web and indexed in a number of databases it has now become possible to quite accurately estimate the number of articles. We used the databases of ISI and Ulrich’s as our primary sources and estimate that the total number of articles published in 2006 by 23 750 journals was approximately 1 350 000. Using this number as denominator it was also possible to estimate the number of articles which are openly available on the web in primary OA journals (gold OA). This share turned out to be 4.6 % for the year 2006. In addition at least a further 3.5 % was available after an embargo period of usually one year, bringing the total share of gold OA to 8.1% Using a random sample of articles, we also tried to estimate the proportion of the articles published which are available as copies deposited in e-print repositories or homepages (green OA). Based on the article title a web search engine was used to search for a freely downloadable full-text version. For 11.3 % a usable copy was found. Combining these two figures we estimate that 19.4 % of the total yearly output can be accessed freely. Keywords: scholarly publishing; scientific articles; article output; open access 1.
Introduction
“Open Access” means access to the full text of a scientific publication on the internet, with no other limitations than possibly a requirement to register, for statistical or other purposes. This implicitly means that Open access (OA) material is easily indexed by general purpose search engines. There are several widely quoted definitions on the net, for instance the Budapest Open Access Initiative [1] . For the scholarly journal literature in particular, OA can be achieved using two complimentary strategies: Gold OA means journals that are open access from the start, whereas green OA means that authors post copies of their manuscripts to OA sites on the web [2]. Since there are numerous different types of stakeholders involved in the scientific publishing value chain [3], such as publishers, libraries and authors, with sometimes conflicting interests, a lot of what is being written about OA is strongly biased either towards promoting open access or describing the dangers of open access to the scholarly publishing system. There has also been a discussion among OA advocates which of the two strategies (gold or green) is better. There is thus an urgent need for reliable figures concerning the yearly volumes of journal publishing, and the share of the yearly volume which is available Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Global Annual Volume of Peer Reviewed Scholarly Articles
as open access via different channels. In most of the earlier discussions about the economy of journal publishing the focus has been on the number of journals, and costs such as the subscription cost have been mainly related to the individual title [i.e. 4]. This was natural due to the fact of the easy availability of subscription information for individual titles and for the handling of paper copies in libraries all over the world. We argue that since the advent of the digital delivery for the contents and the electronic licensing of vast holdings of journal content (“the big deal”) the focus should be more on the individual articles as the basic molecule of the journal system and any average costs should be related to the article. We also think that the ratio of open access articles to the overall number of articles published is a much more important indicator of the growing importance of OA than the number of OA titles compared to the number of titles in general. 2.
Total number of articles published
A central hypothesis in this calculation was that the journals indexed by Thomson Scientific’s (ISI) three citation databases (SCI, SSCI an AHCI) on average tend to publish far more articles per volume than the often more recently established journals not covered by the ISI, and that this should explicitly be taken into account in the estimation method. We proceeded as follows. To estimate the total number of scholarly peer reviewed titles we used Ulrich’s periodicals directory and conducted a search with the following parameters; Academic/Scholarly, Refereed and Active. In the winter 2007 this yielded a total of 23 750 journals. For the case of the journals indexed by the ISI it was possible to extract the total number of articles published in the last completed year (2006) by conducting a search in the Web of Science (WoS). A general search was done covering all three indexes (Science Citation Index Expanded, Social Sciences Citation Index and the Arts and Humanities Citation Index). The parameters were set as follows; Publication Year = 2006, Language = All languages, Document type = Article. Since the system has a limitation in the number of items shown of 100 000 it was not possible to directly get the total number of indexed articles. The problem was solved by systematically going through the alphabet by setting the Source Title as A*, B*, C* etc. This worked well for all other letters, for which the total number was less than 100 000, except for A an J. For the letter A more detailed search on AA*, AB* etc was enough, for J we had to go down to the level of Journal of A*, Journal of B* etc. The total number of articles we arrived at in this way was 966 384. ISI as a rule only indexes peer reviewed journals, but with at least one notable exception, the “lecture notes in…” series published by Springer, which publishes conference proceedings in computer science and mathematics in book form. By doing a search using the above as Source Title we got the number of articles published in this series which was 20 484. Subtracting this from the total results in the final number of ISI articles of 945 900. If we know the exact number of titles that the ISI tracked in the WoS in 2006, we can easily derive the average number of articles published annually per title. Since we didn’t have access to exact figures from ISI we had to go a roundabout way to estimate this figure. One indication is given by the number of journals included in the Journal Citation Reports. When searched from Ulrich’s and defining Journal Citation Reports (JCR) as a further search criterion, the result is 6 877 titles. For one reason or another, the search directly from JCR for 2006 gives more journals: 6 166 titles indexed in SCI and 1 768 in SSCI. AHCI journals are not included in the Journal Citation Reports. We can, however, estimate the number of titles by assuming that AHCI journals on average publish as many articles per year as SSCI journals (53.1) which would result in additionally 532 titles. Summing these up, we would get 8 466 titles. Using these Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
179
180
Bo-Christer Björk ;Annikki Roosr; Mari Lauri
numbers as a base, we are able to estimate the average number of articles published in journals indexed in WoS by ISI as 111.7 per title. This can for instance be compared to the figure of 123 articles per year for 6 771 US publishers reported by Tenopir and King [5]. The number of titles indexed in the WoS is probably slightly higher than our estimate for a couple of reasons. The main reason is a time lag between the inclusion in the indexes and the first journal citation report produced for a specific journal. According to ISI [6] the number of titles indexed in the citation databases at the end of the year 2007 was 9 190 journals. In the beginning of 2008 according to ISI’s webpages the number of journals had risen to 9 300. This would indicate, assuming that the number of journals indexed rises steadily every year, that the number would have been somewhere between our estimate and this information. However, we have chosen to use our earlier mentioned estimate (8 466) because the number of titles does not influence the number of ISI-articles which we have obtained separately. It does affect our estimate of the number of non-ISI journals since these are obtained by subtraction (see text below). Since we have estimated these to have a much lower number of articles published per year the effects of a possible mistake in our number of ISI-titles of 1 % would be only around 0.2 % in the total number of articles. Taking as a starting point the number of total titles 23 750 and the number of titles indexed by the ISI 8 466 we arrive by subtraction at a number of titles not indexed by the ISI of 15 284. In order to arrive at a total number of articles we now need to estimate how many articles these journals publish on average per year. This was done using a statistical sample of journals. The basis was Ulrich’s database from which a statistical sample of 250 journals was taken. We set the search so that we only chose journals that have an on-line presence. This might statistically result in a slight bias, but was the only practical way we could study the publication volumes of the journals in the sample. We then extracted the number of articles published in 2006 until we had data for 104 journals (Journals in the original sample which were indexed by the ISI or for which the number of articles could not be found were discarded). In this group the average number of articles published was 26.2, which as we had suspected was considerably lower than for ISI indexed journals. Five of the journals had published no articles and the journal with the highest output had published 225 articles. Multiplying 26.2 by 15 284 results in an estimate of articles published in 2006 of 400 440. Adding the figures for ISI brings the estimate of the total number of peer reviewed articles to 1 346 000 (rounded off) with 70 % covered by the ISI. In their answer to a UK House of Commons committee Elsevier in 2004 estimated that some 2000 publishers in STM (Science, Technology and Medicine) publish 1.2 million peer reviewed articles annually [7]. Taking into account publishing in the social sciences and the humanities our estimate seems to be well in line with these figures. 3.
Share of OA publishing
In policy discussions concerning Open Access publishing a very important question is “what share of all scientific articles is available openly”. For a given year, in our case 2006, this concerns both articles directly published as open access (the so-called gold route in OA jargon), and articles published in subscription based journals, but where the author has deposited a copy in a subject-based or institutional repository (green route). It is easier to estimate the number of gold route articles. For the case of copies in repositories, the evidence is much more scattered, and there is the additional difficulty of checking the nature of the copies (copy of manuscript submitted, personal copy of approved manuscript or replica of published article).
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Global Annual Volume of Peer Reviewed Scholarly Articles
3.1. Gold To estimate the number of articles directly available as OA in 2006 the Directory of Open Access journals (DOAJ) would at first sight seem to be the natural entry point. At the time of checking the directory listed 2 961 journals. Using the directory it is easy to go directly to the web pages of a journal and manually count the number of articles published. One problem is, however, that DOAJ states as inclusion criteria that journals are quality controlled by peer review or editorial quality control. When we searched Ulrich’s for our earlier analysis, we only included journals which had self-reported as refereed (23 750 titles). If we relaxed that criterion and only required a journal to be active and scholarly/academic a search in Ulrich’s yields 60 911 titles. The corresponding figures if the additional criterion of open access was defined were 1 735 refereed and 2 690 scholarly/academic in all. The latter figure is, as could be expected, quite close to the DOAJ total. For these reasons we decided to use Ulrich’s as an entry point, concentrating on the 1 735 journals listed as refereed and open access. In doing the actual counting we tried as far as we could, based on the tables of contents on the web, to only include research articles, excluding editorials etc. This is in line with our earlier use of ISI where we concentrated on the article category only. There are a handful of major OA publishers, Public Library of Science (PLoS), BioMed Central, Hindawi and Internet Scientific Publications (ISP) which use article charges or other means to fund their operations. We counted their articles separately, since they have some high-volume journals. All 7 PLoS journals are listed in Ulrich’s as peer reviewed. Of the 176 BioMed Central journals listed in DOAJ 172 are also listed in Ulrich’s as scholarly and 139 as refereed. For OA journals by other publishers, often published on university web sites using an open source mode of operation with neither publication charges nor subscriptions, we again used a sampling technique. The starting point for this was the figure from Ulrich’s of 1 735 OA titles in total from which we subtracted the number of titles operated by the four publishers listed above resulting in 1 487 titles. A selection of 100 journals was made from this set and the number of research articles was counted from the tables of contents on their web sites. This resulted in an estimate of the mean number of articles published per year of 34.6. Table 1 shows our calculation of the number of OA titles and the number of articles published in 2006. We estimated the total number to 61 313 and this represented 4.6 % of all articles published in 2006. Our figures can be compared to a number of earlier studies. Regazzi [8] used a similar sampling method to study the journals listed in DOAJ in 2003 and 2004 and found a drop in the estimated total number of articles from 25 380 to 24 516, indicating an overall share of 2 % STM articles. He notes that OA journals on average publish far fewer articles (30 on average) than established journals, and quotes an average of 103 for ISI tracked STM journals and 160 for the 1 800 titles of Elsevier. We have also ourselves earlier studied this number through a web survey to the editors of open access journals and then obtained a rather lower figure of 16 articles per year [9].
PLOS Biomed Central Hindawi ISP Other OA journals SUM
Peer reviewed titles (Ulrich’s) 7 139 44 58 1 487 1 735
Articles 2006 881 6 589 1 643 737 51 465 61 313
Table 1. Number of OA titles and articles in 2006 In a white paper on open access publishing from Thomson Scientific [10], the owner of ISI, numbers are given for open access articles included in the Science Citation Index. The text indicates that first the OA publishers were determined from the ROMEO database on publisher OA policies after which the articles Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
181
182
Bo-Christer Björk ;Annikki Roosr; Mari Lauri
were counted. The number of OA articles in SCI in 2003 was 22 095 out of a total of 747 060. Thus roughly 3.0 % of all articles in ISI’s Science Citation Index would have been open access in that year. 3.2. Delayed and hybrid OA In addition to pure gold OA publishing there are two additional routes which could be worth studying. These are the open publishing of individual articles in otherwise closed journals using a separate fee (sometimes labelled open choice) and delayed open access publishing of whole journals. The important thing is that in both these options the version accessed is the original publication, at the publishers website, the only difference is that the access restrictions have been lifted for either a single article, or for articles that have been published before a specific date. All of the biggest publishers, Springer, Taylor & Francis, Blackwell, Wiley and Elsevier provide the option of freeing individual articles against a fee for a wide spectrum of journals [see 11]. It is typical that this opportunity is offered to a sample of the journals in a publisher’s collection. Oxford University Press is an example of a publisher which has been among the first hybrid providers and Karger is an example of a publisher which offers “Author’s Choice” to all of its journals. There are no systematic studies on how commonly the open choice option has been chosen by authors but so far it appears to be rather low. We chose not to do any calculations of our own, since this would be very labor-intensive due to the scattering of relatively few articles among a vast number of titles. Delayed open access is more common among society publishers than commercial publishers. A good example of an individual journal practicing delayed OA is Learned Publishing, the articles in which become OA roughly one year after publishing. A lower bound for an estimate of the prevalence of delayed OA can be obtained via the web portal of HighWire Press, which hosts the e-versions of currently 1 080 journals from over 130 mostly non-commercial publishers. Only a small number of the journals (43) are fully open access from the start but of the total of 4.6 million articles 1.8 million are freely available. The fully open access ones are such that the print version is subscription-based but the online version is free. A search in the database for articles posted during 2006 results in 219 224 hits. This figure may not exactly coincide with the number or articles formally published during that same year and some caution is in order regarding the fact that some of the serials in HighWire Press should not be classified as fully refereed scholarly journals. Of the 1080 HighWire journals 277 (as of January 2008) offer direct or delayed open access. Table 2 lists the numbers in different delay categories as well as an estimate of the total number of articles. The latter has been made assuming that the average number of articles for these is the same as for all the journals in the HighWire portal. Thus comparing this to the total number of articles published in 2006 the share of delayed OA can be estimated to at least 3.5 %, bringing the sum of direct and delayed gold OA to 8.1 %.
Delay
No. of journals
% of all HW journals
Direct OA 1-6 months 7-12 months Over 12 months Delayed in total
43 27 190 17 234
4,0 2,5 17,6 1,6 21,7
Estimated number of articles 8 700 5 481 38 567 3 451 47 499
Table 2. OA articles published electronically by HighWire Press. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Global Annual Volume of Peer Reviewed Scholarly Articles
From the viewpoint of readers hybrid (“open choice”) and delayed open access are less useful than full and instant open access on the title level in “current awareness” reading, where academics track what is being published in a few essential journals either by getting a paper copy or an e-mail table-of-content message. This type of information activity is called “monitoring” in Ellis’s model of information-seeking behaviour [12]. Hybrid and delayed open access help more in cases where a reader tries to access a given article based on a citation (called “chaining” in Ellis’s model). 3.3. Parallel publishing of copies (green) It is much more difficult to estimate the prevalence of green OA than gold OA. Copies of articles published in referee journals are scattered in hundreds of different repositories as well as in even more numerous homepages of authors. There is also the issue of the actual existence of a digital copy on some server versus how easy it is to find it using the most widely used web search engines. For the purposes of this article, we take the pragmatic view that unless you get a hit in Google (or Google Scholar) using the full title of an article, a copy “does not exist”. This is both because a copy which cannot be found this way is very difficult to find for a potential reader and because the best systematic way of measuring the proportion of “green articles” is via systematic search on article titles using a popular search engine, such as Google. An additional complication is that the full text copy found may differ quite substantially from the final published version. It can in the best of cases be an exact copy of the published file (usually PDF) but it can also be a manuscript version from any stage of the submission process. The most useful version is often labelled “accepted for publication” and sometimes includes also changes resulting from the final copy-editing done by the publisher’s technical staff, sometimes not. The layout and page numbering is also usually different from the final published version. Most publishers who allow posting of a copy of an article in an e-print repository allow posting of this so-called “personal version”. In addition, some researchers also upload earlier manuscript versions, often called preprints, but this is not as common except for certain disciplines such as physics. In order to estimate the green route to open access we selected a random sample of all peer reviewed articles published in 2006. The entry point was again Ulrich’s, out of which we took a sample of both journals listed in ISI Web of Science as well as those not listed there. The sample was proportional so that the number of articles from ISI corresponded roughly to the share of ISI in the total number of articles (it included 200 articles in ISI journals and 100 articles in non-ISI journals). A spreadsheet listing the title of the article, the three first authors and the name of the journal was created from the sample. A search was then conducted in Google systematically using the name of the article and in the second hand the writers’ names, using a computer which had Internet access but no access to our university intranet which would automatically allow access to the journals we subscribe to. (We first tried also Google Scholar but we dropped it after a while since we noticed that the search results turned out to almost identical). In order to keep the workload manageable and follow the viewpoint of an average searcher, who does not want to spend too much time and energy, we only searched the 10 first hits, which also is what you usually see on the first screen. If we got a hit which was not on the journal’s own website and which included a full text file containing a document available without subscription, that seemed to fulfil the criteria, a copy was downloaded and saved. The last check was performed by comparing the obtained copy to the published official version which we obtained separately via our own university website or the website of the publisher. This was in order to see that the copy was close enough to the original article. Out of the 35 copies we studied we had subscribed access to 32 and were able to do the comparison, for the remaining three we assumed the copies to be usable. Two of the copies studied turned out to differ significantly in content from the original, and were therefore discarded. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
183
184
Bo-Christer Björk ;Annikki Roosr; Mari Lauri
The results concerning copies in repositories were very similar for ISI-indexed journals (11%) and the other journals (12 %) bringing the weighted average to 11.3 %. The spread between different formats and different types of repositories is shown in the table below, but the absolute numbers are so small per category that it is difficult to generalize to the whole target population. Table 3 shows the percentage of green OA-versions and their popularity: Type of site Exact copy Subject based repository Institutional repository Author’s home pages All
0.7
4.7 1.7 7.0
Personal version 2.3 3.0 1.3 4.0
Type of copy Other version 0.3 0.0 0.0 0.3
All 3.3 5.0 3.0 11.3
Table 3.The frequency of OA copies of different kinds. We found no case of overlaps of the same article being both published as gold OA on a publisher’s website and with a copy in a repository. Thus the figures for green OA can be added to our earlier estimates for gold OA (8.1 %) to get the total OA availability of 19.4%. We were of course also able to check the direct gold availability of the articles in the sample. For the articles in ISI journals the percentage was 15 but for non-ISI articles an astonishing 35 %, compared with our earlier figures of 8.1 %. The reasons for this can be twofold. Firstly we were in practice restricted in producing the sample to journals which at least have tables of content freely available on the net. Our experience in producing the sample, in terms of how many candidate journals we had to disqualify because of a lacking web presence, indicated that for ISI-listed journals the availability of web tables of contents is nowadays rather high, whereas for non-ISI journals the percentage is much lower. Unfortunately we did not keep exact records when we produced our sample, which could have helped in correcting the estimate taking this factor into account. Secondly there might be a random element in this calculation, which of course could be reduced by increasing the sample size. All in all we believe our earlier estimate of gold availability to be more reliable. 4.
Conclusions and discussion
We have estimated in this study that the amount of scientific articles published in 2006 was 1 346 000. Our hypotheses about the difference in the number of articles published per title in the titles indexed by ISI and non-ISI-titles appeared to be correct. The non-ISI journals published on average 26.7 articles per title and the ISI-journals 111.7 articles. 4.6 % from the yearly article output appears in the Golden OA journals and at least 3.5 % is open after a delay period. 11.3 % of articles are openly available in repositories and for example on personal web pages. Altogether the amount of openly available articles from the yearly output is 19.4 %. The different elements in our calculation differ in terms of accuracy. The total number of articles included in the indices of the ISI should be very accurate, provided that we have searched the database in a correct way. Also the total number of journals tracked by the ISI in a given year is reasonably accurate. The total number of peer reviewed scholarly journals is much more difficult to estimate accurately. Ulrich’s database is the best tool available for this purpose, but its coverage is not 100 %, and there are some inactive journals included in the category. On the other hand if we organise the total journal market according to the number of yearly articles per title we get a distribution with a few very high volume titles and many journals with few articles. It is very likely that journals which are not listed in Ulrich’s publish rather few articles per annum, and thus their contribution to the total volume of article is rather marginal. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Global Annual Volume of Peer Reviewed Scholarly Articles
It is also more likely that Ulrich’s coverage of journals published in the Anglo-Saxon countries is more comprehensive than journals published in non-English speaking countries and in particular in languages other than English. It is also impossible to draw a clear border line between journals practicing full peer review and journals where the editors check the content of the submission. In this respect we just have to trust the self-reporting of journals to Ulrich’s data base. Also we have excluded conference proceedings produced using a referee procedure, since it would be very difficult to find data about these. The one notable exception is the Springer Lecture Notes series, but we chose to exclude it from our calculations. An interesting study of the growth of Open Access and the effect of open vs. closed access on the number of citations has been carried out by Hajjem, Harnad and Gingras [13]. They used a web robot to search for full texts corresponding to the citation metadata of 1.3 million articles indexed by the ISI from a 12 year time period (1992-2003), in particular focusing on differences between disciplines in the degree of open availability and in the citation advantage provided by OA. Articles published in OA journals were excluded and their results thus concern articles published in subscription-based journals where the author (or a third party) has deposited a copy on any web site which allows full text retrieval for web robots. According to the study the degree of green OA varied from 5-16 % depending on the discipline, but from our viewpoint the most important figure was that for the total of 1.3 million articles OA full text copies could be found for 12 %. This included both direct replicas, the author’s accepted manuscripts after the review (“personal version”) and submitted manuscripts (“preprint”), since it can be assumed that the robot could not distinguish between these if the title and author have remained unchanged. All in all we believe our estimates to be more accurate than the estimates that have been presented earlier in different contexts. We have defined our method in detail and the estimate can easily be replicated and/ or adjusted by other researchers in later years. 5.
Acknowledgements
This study was partly financed by the Academy of Finland, through the research grant for the OACS project (application no. 205993). We would also like to thank Piero Ciarcelluto for his assistance in the data gathering phase. 6.
References
[1] [2]
Budapest Open Access Initiative. 2002. http://www.soros.org/openaccess/read.shtml Harnad S., Brody T., Vallières F., Carr L., Hitchcock S., Gingras Y., Oppenheim C., Stamerjohanns H. and Hilf ER. 2004. The Access/Impact Problem and the Green and Gold Roads to Open Access. Serials Review, Vol. 30(4), pp. 310-314. Björk B-C. 2007. A model of scientific communication as a global distributed information system. Information Research, Vol. 12(2) paper 307. Available at http://InformationR.net/ir/12-2/ paper307.html European Commission. 2006. Study on the economic and technical evolution of the scientific publication markets in Europe. Brussels: European Commission. Directorate General for Research. Available at: http://ec.europa.eu/research/science-society/pdf/scientific-publication-study_en.pdf Tenopir, C. & King, D. (2000). Towards electronic Journals – realities for scientists, librarians and publishers, Washington D. C.: Special Libraries Association. Horky, David 2008. E-mail from David Horky, Thomson Scientific, in the 17th of Jan 2008. Elsevier. 2004. Responses to the questions posed by the Science and Technology Committee, Document submitted to the UK House of Commons Select Committee on Science and Technology by Elsevier on 12 February 2004. Available at: http://www.elsevier.com/
[3] [4] [5] [6] [7]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
185
186
Bo-Christer Björk ;Annikki Roosr; Mari Lauri
authored_news/corporate/images/UK_STC_FINAL_SUBMISSION.pdf Regazzi J. 2004. The Shifting Sands of Open Access Publishing, a Publisher’s View. Serials Review Vol. 30, pp. 275-280. [9] Hedlund T., Gustafson T. and Björk B-C. 2004. The Open Access Scientific Journal: An Empirical Study. Learned Publishing, Vol. 17(3), pp. 199-209. [10] Mc Veigh ME. 2004. Open Access Journals in the ISI Citation Databases: Analysis of Impact Factors and Citation Patterns – A Citation study from Thomson Scientific. http:// scientific.thomson.com/media/presentrep/essayspdf/openaccesscitations2.pdf [11] Morris S. 2007. Mapping the journal publishing landscape: how much do we know? Learned Publishing, Vol.20(4), pp. 299-310. [12] Ellis, D. (2005) Ellis’s model of information-seeking behaviour. In Fisher K.E. et al. (eds.) Theories of information behaviour. Medford : Information Today. pp. 138-142. [13] Hajjem C., Harnad S. and Gingras Y. 2005. Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin Vol. 28(4) pp. 39-47. Available at: http://eprints.ecs.soton.ac.uk/11688/ [8]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
187
Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean Saray Córdoba-González1; Rolando Coto-Solano2 Vicerrectoría de Investigación, Universidad de Costa Rica Ciudad Universitaria Rodrigo Facio, San José, Costa Rica e-mail: saraycg@gmail.com; 2 Vicerrectoría de Investigación ,Universidad de Costa Rica Ciudad Universitaria Rodrigo Facio, San José, Costa Rica e-mail: rolandocoto@gmail.com 1
Abstract Our objective is to analyze the use that Latin American peer-reviewed journals make of the tools and opportunities provided by electronic publishing, particularly of those that would make them evolve to be more than “mere photocopies” of their printed counterparts. While doing these, we also set out to discover if there were any Latin American journals that use these technologies in an effective way, comparable to the most innovative journals in existence. We extracted a sample of 125 journals from the LATINDEX – Regional System of Scientific Journals of Latin America, the Caribbean, Spain and Portugal – electronic resources index, and compared along five dimensions: (1) Non-linearity, (2) use of multimedia, (3) linking to external resources (“multiple use”), (4) interactivity, and (5) use of metadata, search engines, and other added resources. We have found that very few articles in these journals (14%) used non-linear links to navigate between different sections of the article. Almost no journals (3%) featured multimedia contents. About one in every four articles (26%) published in the journals analyzed had their references or bibliographic items enriched by links that connected to the original documents quoted by the author. The most common form of interaction was user!journal, in the form of question forms (17% of journals) and new issue warnings (17% of journals). Some, however (5%) had user!user interaction, offering forums and response to published articles by the readership. About 35% of the journals have metadata within their pages, and 50% offer search engines to their users. One of the most pressing problems for these journals it the wrong use of rather simple technologies such as linking: 49% of the external resource links were mismarked in some way, with a full 24% being mismarked by spelling or layout mistakes. Latin American journals still present a number of serious limitations when using electronic resources and techniques, with text being overwhelmingly linear and underlinked, e-mail to the editors being the main means of contact, and multimedia as a scarce commodity. We selected a small sample of journals from other regions of the world, and found that they offer significantly more non-linearity (p = 0.005 < 0.1), interactive features (p = 0.005 < 0.1), use of multimedia (p = 0.04 < 0.1) and linking to external documents (p = 0.007 < 0.1). While these are the current characteristics of Latin American journals, a number of very notable exceptions speak volumes of the potential of these technologies to improve the quality of Latin American scholarly publishing. Keywords: Electronic journals; scholarly journals; Latin America; serials quality criteria; LATINDEX 1.
Introduction
Electronic journals in Latin America have been under-analyzed in terms of their architecture, particularly in how well they exploit the tools made available by Internet technologies, which provide new ways to produce interaction with the readers, non-linearity in the text, and multimedia content to illustrate and complement the articles. If online journals take advantage of these novel tools, they can become more Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
188
Saray Córdoba-González; Rolando Coto-Solano
than mere clones of their printed versions. This would give them an advantage that could potentially place them format-wise at par with the more innovative journals in the world, and can help in the debunking of some pervasive prejudices held by the entire scientific community towards electronic publishing [1]. We analyzed a sample of peer-reviewed journals from the Latin America and the Caribbean, and measured the adoption of those features in journals from across the region. We seek to diagnose the present situation in Latin America, but also to provide a basis for comparison with related journals in other longitudes of the world. Mayernik [2] published a study along this line, where he analyzed 11 psychology and 10 physics journals, but didn’t make any emphasis on their geographical origin. A number of virtual libraries, indexes and repositories have sprung forth in Latin America to support the work of the local journals, as well as to help them in better using their resources (particularly monetary resources) with proposals that can improve editorial quality, including introduction of open-source software in their work cycles and adoption of Open Access as a philosophy for the journals. All of these efforts are aimed to make this “hidden science” more accessible to the academic world in both the local and global spaces. The Regional System of Scientific Journals of Latin America, the Caribbean, Spain and Portugal – LATINDEX (www.latindex.org) – was created in 1997, and currently has a directory of more than 16000 journals. It also provides a criterion-reviewed “catalogue” with 2952 journals. These journals have been selected based on 36 evaluation criteria that describe basic editorial quality. Among these criteria, three aspects are aimed specifically at online journals: use of metadata, incorporation of a search engine for the content of the site, and inclusion of “added content”, such as lists of “links of interest”, discussion forums, etc. From the journals in the directory, 2490 have an electronic version. Electronic journals have been at the center of a long-running discussion in the editorial word. Since 1997, Valuskas [3] defined electronic journals as “a digital periodical dedicated to publishing, on the Internet, articles, essays, and analyses that have been read and commented upon initially by a select group of editors and reviewers, to meet a certain arbitrary standard of excellence (as determined by the editors) for a given discipline addressed by the journal itself”. In this sense, electronic journals were perceived in terms of their availability in the web. However, this definition leaves out any mention of the potential exploitation of Internet tools. Valuskas [3] complements the explanation by saying that “The very electronic nature of the journal provides ample opportunities for experimentation with formats, layouts, fonts, and other design features, although many electronic journals fail to jump at some obvious opportunities to make given issue more readable and appealing to the eye”. One year later, Hitchcock et al. [4] drew attention onto the importance of links within the text, as well as other Internet-related advantages that could improve access to the knowledge contained in scientific journals. Efforts have been made to clearly define the parameters upon which electronic journals could be evaluated. An important description deals with the relation the electronic journal might have with a potential printed counterpart. Kling & McKim [5] defined three possibilities for this relation: pure e-journals that were born electronic, p-e journals where the articles were first published on paper but where electronic distribution is also possible, and e-p journals where the electronic format is the predominant version, but limited quantities of paper versions are also produced. While the authors are very clear in pointing out that not all Internet-based journals are rigorous in their peer-review processes, they establish that the model of publication does not determine the quality of the final product. Notwithstanding the way they got into the web, we assume as a matter of course that electronic journals must be peer-reviewed, must conform to international editorial standards (such as the LATINDEX criteria), and the majority of its text must be made of scientific articles. In keeping with this, for this study we will only use peer-reviewed journals, and not include any bulletins or science popularization magazines. A few studies have scratched the surface of the electronic publishing practices in Latin America. Dias [6] studied a number of Brazilian journals for their use of hypertext and search engines as a satisfactory Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean
implementation of the inherent possibilities of the electronic medium. Marcondes et.al. [7] wrote a descriptive study about Brazilian electronic journals, focusing on technical aspects such as electronic text formats used, availability of a site search engine, if the journal belonged or not to a portal, proving that metadata were little known by Brazilian editors, and that features such as interactivity, hypertext and multimedia were almost never used. They conclude that Brazilian journals resemble journals from other parts of the world, in that the issues are designed following the printed-only model, delivering the Internet version as a virtual photocopy of the document, in want of “more professionalism from the editors”. Another significant case in Latin America was presented in ELPUB 2006 by Muñoz, Bustos and Muñoz [8] where they studied the Chilean Electronic Journal of Biotechnology, and described the journal’s innovative features in terms of usability of the website, speed and efficiency, use of metadata, adoption of the DOI system, and use of CrossRef as a citation linking system. Mayernik [2] wrote the most comprehensive study in this field. He used four specific dimensions to evaluate the journals: (1) non-linearity of the document (2) external links to the documents quoted in the article (be it the main body or the references), which he refers to as “multiple use”, (3) multimedia use in the articles or in the website, and (4) interactivity with the readers, in the form of forums or other two-way communications. (Mayernik also studied a fifth dimension, speedy publication, which we will not consider here). These characteristics are deemed as innovative in their use of the technical possibilities of the web, and as a valuable addition to the overall experience of the reader/user. However, as some authors have explained (Harnad [9], Tenopir & King [10]), these qualities are not fully exploited by the editors, and the journal’s full potential is not achieved. These four dimensions are anything but casual. Hitchcock et. al. [4] had denominated the emergency of hyperlinking as “the second frontier”. Harnad studied the benefits of hyperlinks as early as 1992 [11], and Lukesh [12] has explained how multimedia options “play a major role in the similar explosion we are undergoing today as they become tools in developing knowledge rather than simple illustrations”. A number of repositories and virtual collections (such as the Brazilian SciELO and the Mexican REDALyC for example1), usually associated to universities, have played a significant role in pushing forward the digitalization of the scholarly journals in the region, particularly within the Open Access model. These solutions have emerged as a way to focus whatever resources become available and apply them to a number of journals at the same time, and have become valuable tools to provide visibility and web presence to the scientific production of Latin America, providing data on how this information is used and quoted within the local scientific community, and how a field of knowledge evolves in the region. In this study, however, we will focus on the specific website of each of the journals, and see what solutions are being used by individual journals and their editorial boards. We will try to determine how “innovative” are electronic journals in Latin America, where innovation is understood as the exploitation and application of Internet resources, tools and programs that can improve the user!journal and user!user communication processes, and add competitive value to the journal. We understand that the web offers these possibilities, but that they are not taken advantage of by Latin American editors, and indeed by editors around the world (Harnad [9], Tenopir & King [10]). Our first objective is to analyze the e-journals in Latin America, to (a) determine the degree in which they are copies of their printed counterparts, and (b) to discover the specific Internet tools and resources that editors are using to improve their journals. Our second objective is to determine how these journals fare when compared to e-journals from other parts of the world that can be regarded as innovative and technically advanced. 2.
Methodology
We selected peer-reviewed journals, with open acess to their articles and at least 40% of scientific content, Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
189
190
Saray Córdoba-González; Rolando Coto-Solano
and that are published independently, not as a part of a larger collection site (such as SciELO or REDALyC) that might have made caused their content to artificially conform to different publishing standards. Based on the Mayernik [2] characteristics, we chose to study: (1) Non-linearity, the ability to jump from one part of the article to the other as the user wishes, (2) multimedia, the use of audio or video to enhance the user’s experience, (3) multiple use: the existence of links to the full text of the documents quoted or referred to on the article, and (4) interactivity: the existence of tools that can provide interaction between the editors, the authors, and the readers of the journals. Mayernik described a fifth characteristic, speedy publication, but we will not consider it amongst the objective of our study. Additionally, we also analyzed the use of criteria 34, 35 and 36 of the LATINDEX e-journal criteria: use of metadata, use of search engines, and use of “added value services”, such as links to external resources, documents relevant to the readership, forms of interaction, etc 2. To evaluate the journals, we used the Electronic Resources Index of the LATINDEX website. We chose the 12 countries that have more than 10 journals in the directory: Argentina, Brazil, Chile, Colombia, Costa Rica, Cuba, Ecuador, Mexico, Peru, Uruguay and Venezuela. Within these groups, we randomly chose 10% of the journals. While the original sample included 167 journals, it also included bulletins, non-peerreviewed journals and science education magazines. After accounting for this we arrived at the final sample size of 125 journals (see Appendix 1), representative of the journals within Latindex’ electronic directory with an estimated sampling error of 7.5%. The journals had in average 8.95 ± 5.8 articles per issue, so we randomly chose one third of the articles (3 for every issue) to perform the analysis, arriving at a final sample of 375 articles. The multimedia, interaction and LATINDEX-34-35-36 criteria were evaluated at the journal level (using the 125 journals as the sample), and the non-linearity and multiple use criteria were evaluated at the article level (using the 375 articles as the sample). For the LATINDEX criteria, a journal scores one point if it meets all criteria, 0.66 if it meets 2 criteria, and 0 if neither metadata, search engines or added services are present. For the multimedia criteria, there are two individual criteria: (1) presence of video features and (2) presence of audio features. The presence of one of these is interpreted as one point. The presence of only one feature is 0.5 points. The non-existence of these features awards the journal zero points. For interaction, the four criteria are: (1) presence of a contact form that a user can use to write to the editors, (2) presence of some means of communication between the reader and the author, or some expert in the field, (3) presence of some means of communication amongst the readers, and (4) use of alert features, such as e-mail subscriptions or RSS news feeds. A journal scores one point if it meets all criteria, 0.75 if it meets 3 criteria, 0.5 if it meets only two, 0.25 if it meets just one, and 0 for no compliance of any criterion. In addition to the whole-corpus counts, we also analyzed the corpus of journals and articles along two variables: country of the journal, and subject (Social Sciences, Medical Sciences, Agricultural Sciences, Exact and Natural Sciences, Multidisciplinary, Arts and Humanities, and Engineering). Finally, we chose a small intentional sample of e-journals from other regions of the world that made extensive use of our studied characteristics, and compared them against a selection of journals chosen as “top in their class” by three LATINDEX officials3. We evaluated the non Latin American journals using the same parameters as the Latin American ones, and proceeded to compare them. Our research design is exploratory, representative of a major collection of Latin American peer-reviewed journals. This paper being of an exploratory nature, we chose ˜ = 0.1 for statistical comparisons. We used Microsoft® Access© 2003 for database keeping, Adobe® Acrobat Reader© 8 for PDF analysis, and Internet Explorer© 6, Internet Explorer© 7, Mozilla® Firefox© 2.0 and Opera® 9.27 for Internet browsing within the Microsoft® Windows XP© Service Pack 2 operating system. We used the software JMP© 7 for statistical analysis. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean
3.
Results
3.1
General information about the corpus
We examined 125 journals from 12 different countries, using the most current issue as the point of reference to start the comparative analysis. In the following table, we describe the sample’s age and use of computer formats: Year of publication of the current issue 2008 31 articles (25%) 2007 56 articles (45%) 2006 20 articles (16%) 2005 6 articles (5%) Prior to 2005 12 articles (9%)
Format of the articles in the latest edition PDF 312 articles (83%) HTML 102 articles (27%) Both PDF and HTML 42 articles (11%)
Table 1. Age and format of the corpus 3.2 Non-linearity We defined four possibilities for our intradocument links: (1) Navigational links to jump between sections of the document, (2) Links to footnotes or notes at the end of the document, (4) Navigational and footnote links combined and (4) Links to reference or bibliography items. Mayernik uses our possibility 3 as his measure for non-linearity.
Articles that contained: Average (links)
Navigational links
Footnote links 33 articles (8.8%)
Navigational and foodtnote links combined 51 articles (13.6%)
Links to references or bibliography 14 articles (3.7%)
20 articles (5.3%) 11.2 ± 11.8 links
16.4 ± 16.7 links
15.0 ± 15.2 links
32.6 ± 28.9 links
Table 2. Non-linearity across the corpus. Given the large variability in the corpus, it is no surprise that the standard deviation is larger than the average in most categories. Only in the references section the variation is narrow enough to say that each article has at least 3 links to the bibliographic references. Table 3 describes the situation in more detail, breaking it down to the country level. We compared how the use of HTML and PDF formats affected the use of internal links, and found that there is indeed a significant difference. When HTML format is used (alone or in combination with PDF), there is a significantly higher amount of internal links within the document, both navigational and to the references. 3.3
Use of multimedia
We found that only 2.5% of the journals use audio resources; and only the 3% of them post videos on their website. In average, the use of multimedia in the journals analyzed equaled 0.03 points. There are no significant differences in the use of multimedia either by country or by subject. 3.4
Multiple use
We subdivided the multiple use category in four different areas: (1) External links embedded in the body of the article, (2) external links in the reference or bibliography section, that lead directly to the text of the reference item, (3) external links in the reference or bibliography section, that might lead directly or Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
191
192
Saray Córdoba-González; Rolando Coto-Solano
indirectly to the text of the reference item, where indirectly is understood as “three clicks away or less”, and (4) external links in the reference or bibliography section, that lead directly or indirectly to the text, or at the very least lead to an abstract of the reference item. Mayernik uses our area (2) as his measure for multiple use.
Total ARG BRA CHL COL CRI CUB ECU
Number of journals and average links Navigational Footnote Footnote and links links navigational combined 20 (5.3%) 33 (8.8%) 51 (13.6%) 11.2 ± 11.8 16.4 ± 16.7 15.0 ± 15.2 0 5 (10%) 5 (10%) 16.8 ± 15.7 16.8 ± 15.7 0 11 (16%) 11 (16%) 16.5 ± 21.3 16.5 ± 21.3 6 (17%) 5 (14%) 11 (31%) 18.5 ± 13.2 13.8 ± 12.3 16.4 ± 12.4 3 (9%) 0 3 (9%) 3.7 ± 2.1 3.7 ± 2.1 0 3 (17%) 3 (17%) 24.7 ± 21.1 24.7 ± 21.1 5 (28%) 0 5 (28%) 6 ± 4.8 6 ± 4.8 0 0 0
Signification groups * Navigational Footnote links links
C
BCD
C
BC
A
Footnote and navigational combined CD BCD
BCD BC
AB D
D
BC
B
AB
BC CD
BCD
BC
BCD
BCD
2 (2%) 5.5 ± 6.4 0
6 (7%) 13.8 ±13.4 0
BC
D
D
PER
6 (7%) 12 ± 14.4 0
BC
BCD
CD
PRI
0
BCD
BCD
0
VEN
0
2 (17%) 8.5 ± 2.1 5 (42%) 21.2 ± 17.3 0
BC
URY
2 (17%) 8.5 ± 2.1 5 (42%) 21.2 ± 17.3 0
MEX
BC
A
BC
A CD
p = 0.01 < 0.1
D
p = 0.04 < 0.1
p = 0.01 < 0.1
Table 3. Non-linearity by country. (The first line in the left cells is “number of articles” and “percentage of articles in the corpus”. The second line is “average number of links” and “standard deviation in the number of links”).
Total Both HTML and PDF Only HTML Only PDF
Average links Footnote and navigational combined 11.2 ± 11.8 6.0 ± 1.1 5.3 ± 0.9 0.7 ± 0.4
Links to references or bibliography 16.4 ± 16.7 5.1 ± 1.2 2.5 ± 1.0 0.3 ± 0.5
Signification groups * Footnote and Links to references navigational or bibliography combined A A
A A
B p < 0.0001 < 0.1
B p = 0.002 < 0.1
Table 4: Formats used by the journals and their level for significance
Articles that contained: Average (links)
Originating from the body of the article External links
Originating from the references or bibliography
51 articles (14%) 7.4 ± 14.5 links
External links connecting directly to the item 82 articles (22%)
External links connecting indirectly to the item 98 articles (26%)
External links connecting to the item or at least to its abstract 98 articles (26%)
3.4 ± 3.9 links
3.7 ± 4.6 links
4.5 ± 6.1 links
Table 5. Articles with external links Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean
When examining multiple use, we also recorded which articles had any Internet links at all, and whether those links were in fact marked as clickable hyperlinks or not (they might have been just plain text, which the user couldn’t click on). A total of 143 articles (38%) used one or more Internet references, and from those, only 114 (30%) had all of their links properly marked. Table 6 shows the situation broken down by country.
Total ARG BRA CHL COL CRI CUB ECU MEX PER PRI URY VEN
Articles had Internet addresses in the references or bibliography 143 (38%) 12 (23%) 36 (52%) 17 (47%) 15 (47%) 11 (61%) 7 (39%) 0 26 (29%) 3 (33%) 7 (58%) 4 (33%) 5 (24%)
Signification groups * BC A A A A AB C BC ABC A ABC BC p = 0.003 < 0.1
All of the Internet addresses in these articles were in fact marked (clickable) 114 (30%) 11 (22%) 30 (43%) 12 (33%) 15 (47%) 9 (50%) 5 (28%) 0 23 (26%) 2 (22%) 4 (33%) 1 (8%) 2 (9%)
Signification groups * BC A AB A A ABC C BC ABC ABC C C p = 0.006 < 0.1
Table 6: Use of Internet references by authors, and correct marking of Internet references as hyperlinks
Total ARG BRA CHL COL CRI CUB ECU MEX PER PRI URY VEN
External links (from the references) connecting directly or indirectly to the item 98 articles (26%) 3.7 ± 4.6 links 10 articles (20%) 4.0 ± 3.2 links 27 articles (39%) 3.2 ± 3.7 links 13 articles (36%) 3.4 ± 4.6 links 10 articles (31%) 3.1 ±3.4 links 8 articles (44%) 4.2 ± 3.7 links 2 articles (11%) 1.0 ± 0.0 links 0 16 articles (18%) 4.0 ± 6.2 links 3 articles (33%) 11.0 ± 13.9 links 5 articles (42%) 2.2 ± 0.8 links 1 articles (8%) 6.0 ± 0.0 links 3 articles (14%) 3.3 ± 1.5 links
Signification groups *
C C AB C BC C BC C A BC BC C p = 0.098 < 0.1
Table 7: External links connecting directly or indirectly to the text of a reference item, broken down by country Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
193
194
Saray Córdoba-González; Rolando Coto-Solano
From all of the countries in table 6, Costa Rica, Puerto Rico, Brazil, Chile and Colombia show the richest use of Internet references by authors (p = 0.003 < 0.1), but only Costa Rica, Colombia and Brazil match that a throughout use of link markedness (p = 0.006 < 0.1), which only the case of Costa Rica reaches 50% of all articles having their links thoroughly marked as hyperlinks. Table 7 describes the situation for “links leading directly or indirectly to the text”, where we found a significance by country: Chile and Peru appear to make the most use of links in the reference sections of their articles (p = 0.098 < 0.1). When broken down by subject we found that the total links from the reference (direct, indirect and abstracts) were significantly different. Table 8 shows that Natural and Exact Sciences and Agricultural Sciences journals used more links in average that the rest of the subjects (p = 0.04 < 0.1).
Total Arts and Humanities Agricultural Sciences Engineering Natural and Exact Sciences Medical Sciences Social Sciences Multidisciplinary
External links connecting to the item or at least to its abstract 98 articles (26%) 3.7 ± 4.6 links 5 articles (14%) 2.2 ± 1.8 links 8 articles (33%) 4.0 ± 3.5 links 10 articles (42%) 2.4 ± 1.6 links 8 articles (27%) 10.5 ± 12.2 links 14 articles (19%) 2.1 ± 2.0 links 47 articles (28%) 5.3 ± 6.3 links 6 articles (33%) 1.5 ± 0.8 links
Signification groups *
C ABC BC A C B BC p = 0.04 < 0.1
Table 8: External links connecting to the item or at least to its abstract, broken down by subject 3.5. Interactivity To analyze the possibilities offered by interactivity, we broke down the category in four different criteria: (1) presence of a contact form that a user can use to write to the editors, (2) presence of some means of communication between the reader and the author, or some expert in the field, (3) presence of some means of communication amongst the readers, and (4) use of alert features, such as e-mail subscriptions or RSS news feeds. The results for the 125 journals are summarized in Table 9. We can see that Cuba, Brazil and Chile emerge as the more solid competitors in this area (p = 0.022 < 0.1). Cuban journals make good use of rich websites to foster communication between the readers and the authors, mostly in the medical journals. Brazilian journals lead in offering alerts and “new issue warnings” to users (probably due to the common adoption of the OJS platform), and Chilean journals very frequently offer forms to the users (as opposed to simply e-mail addresses) to ask questions or relay opinions to the journal’s editors. In total, we found that: (1) In the category of user-journal interaction, 17% of the journals offer contact forms, 5% offers ways to contact experts, and 17% offer alerts and news in some form. (2) In the category of user-user interaction, only 5% of the journals offer forums, discussion boards, or any other way for the readers to share information or reply to articles. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean
3.6
Latindex evaluation criteria
After calculating the points for the LATINDEX criteria, the following results emerged. The average LATINDEX score is 0.48 points, with 35% of the journals using metadata, 50% using search engines, and 58% using â&#x20AC;&#x153;added servicesâ&#x20AC;?. Figure 1 describes what added services are more common in our sample, and Table 10 breaks down the LATINDEX score by country. Costa Rica and Brazil are the countries with the highest Latindex score (0.83 and 0.62 respectively; p = 0.07 < 0.1). Country Total ARG BRA CHL COL CRI CUB ECU MEX PER PRI URY VEN
Total journals 125 17 23 12 11 6 6 2 30 3 4 4 7
Crit.1 (Journals) 21 (17%) 3 (17.6%) 4 (17.4%) 5 (41.7%) 2 (18.2%) 1 (16,7%) 4 (66.7%) 0 1 (3,3%) 1 (33.3%) 0 0 0
Crit.2 (Journals) 6 (5%) 0 0 0 0 1 (16.7%) 3 (50%) 0 2 (6,7%) 0 0 0 0
Crit.3 (Journals) 5 (4%) 1 (5.9%) 0 0 1 (9.1%) 0 0 0 2 (6.7%) 0 0 0 1 (14.3%)
Crit.4 (Journals) 21 (17%) 5 (29.4%) 11 (47.8%) 2 (16.7%) 0 0 0 0 1 (3.3%) 1 (33.3%) 0 1 (25%) 0
Average points 0,11 0,13 0,16 0,15 0,07 0,08 0,29 0 0,05 0,17 0 0,06 0,04
Signification groups BCD B BC CDE BCDE A BCDE E ABCDE DE BCDE CDE p = 0.022 < 0,1
Table 9. Interactivity, broken down by country. Crit.1: Presence of contact form for user- journal interaction. Crit.2: Presence of some means of communication between the user and the author or some expert in the field. Crit. 3: Presence of some scheme of user-user communication. Crit.4: Use of alerts for the users.
Total ARG BRA CHL COL CRI CUB ECU MEX PER PRI URY VEN
Number of journals Latindex Metadata score (0-1) usage (Journals) 0.48 44 (35%) 0.55 9 (53%) 0.62 10 (43%) 0.39 3 (25%) 0.33 1 (9%) 0.83 5 (83%) 0.56 3 (50%) 0.17 1 (50%) 0.38 7 (23%) 0.56 2 (67%) 0.25 0 0.50 2 (50%) 0.43 1 (14%)
Search engine usage (Journals) 62 (50%) 7 (41%) 16 (70%) 6 (50%) 4 (36%) 5 (83%) 3 (50%) 0 11 (37%) 2 (67%) 1 (25%) 1 (25%) 6 (86%)
Signification groups * Latindex Metadata score (0-1) usage
Search engine usage
BC AB C C A ABC C C ABC C ABC BC p = 0.07 < 0.1
B A AB B A AB B B AB B B A p = 0.09 < 0.1
AB BC BCD D A ABC ABCD CD ABC D ABCD CD p = 0.03 < 0.1
Tabla 10: LATINDEX evaluation criteria, broken down by country 3.7
Comparison between Latin American and non-Latin American journals
For the comparison, we intentionally chose nine journals from Latin America and five from other parts of the world, based on their reputation for innovative use of Internet resources2. Within this small sample, we found that the non-Latin American options offer significantly more non-linearity (p = 0.005 < 0.1), interactive features (p = 0.005 < 0.1), use of multimedia (p = 0.04 < 0.1) and linking to external documents (p = 0.007 Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
195
Saray Córdoba-González; Rolando Coto-Solano
196
< 0.1). However, we did not find the two groups were significantly different in their LATINDEX score. Figure 1. Types of added resources used by Latin American Journals Others
5%
Contact with aut hors or expert s
5%
User? User Interaction (Forums)
9%
Send the article to a third person
10%
Themat ic indexes of the articles
13%
Relevant resources or document s
15%
Issue warnings (e-mail, RSS)
24%
Links to external resources
53%
0%
10%
20%
30%
40%
50%
60%
% of journal s that use the re source
Figure 1. Types of added resources used by Latin American Journals
Journals Non-Latin American Latin American Journals Non-Latin American Latin American
Interactivity 0.55 ± 0.41 pts. 0.14 ± 0.18 pts. p = 0.005 < 0.1 Navigational links
Multimedia Use 0.40 ± 0.12 pts. 0.05 ± 0.09 pts. p = 0.04 < 0.1 External links in the references
30.1 ± 5.2 links 10.9 ± 18.2 links p = 0.005 < 0.1
4.2 ± 5.2 links 1.3 ± 0.7 links p = 0.02 < 0.1
LATINDEX score 0.87 ± 0.13 pts. 0.67 ± 0.33 pts. p = 0.24 > 0.1 External links in the references (including links to abstracts) 49.6 ± 12.9 links 4.0 ± 8.1 links p = 0.007 < 0.1
Table 11: Compared situation of a small simple of Latin American and Non-Latin American journals 4.
Discussion
The data obtained indicates that the use of Internet-related tools and technologies is not very widespread in the Latin American region. While there is a certain presence of multiple use within the articles (about 26% of all articles evaluated had HTML links in their references or bibliography), non-linearity and interaction are very seldom present in the journals, and the sight of multimedia functions in the journals is almost nonexistent. Only 13.6% of all articles are non-linear, with an average of 15.0 ± 15.2 navigational links for articles that are non-linear. Uruguay and Chile appear to stand out in this category (p = 0.01 < 0.1; see table 2), with 42% and 31% respectively of their articles having non-linear links. From the entire corpus, a mere 3.7% of all articles have links to references or bibliography items. This appears to be one of the least soughafter features across Latin American publishing. In both types of non-linear links (‘navigational’ and ‘directed to references’), the appearance of HTML publishing is determinant in raising consciousness and ultimately frequency in links (p d•0.002 < 0.1; see table 4). PDF-only publishing might be keeping the editors from fathoming the possibilities of non-linearity, which might be the cause for these results. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean
Multimedia was the least frequent of all characteristics. Only 3 journals have audio features, and only 4 journals have some form of video. Three of the more noteworthy cases were: (i) Actualidades Investigativas en Educación (Costa Rica), where Quicktime audio and video is used in one of its articles. (ii) La pintura mural prehispánica en México (Mexico), that offers online Flash videos produced by its parent research unit. (iii) Revista de Enfermedades Infecciosas en Pediatría (Mexico), where audio interviews present its authors and other researchers discussing current issues. Multiple use was the characteristic that fared better in our corpus. About a quarter of the articles (26%) had links that departed from the references, and landed on either the document, or on a page with the abstract of the original article. Mayernik only considered direct links, meaning links that landed on the text of the document cited. In our corpus, 22% of the articles had this kind of direct links in their reference sections, and 14% of the articles had such links within the body of the text. In total, 38% of the articles did cite Internet references, with Costa Rica, Puerto Rico, Brazil, Chile and Colombia as the locations where these references are most common among authors (p = 0.003 < 0.1; see table 6), and Chile and Peru as the countries where those references will be more likely to include a link leading directly to (or within 3 clicks of) the cited text (p = 0.098 < 0.1; see table 7). The fields of Natural Sciences and Agricultural Sciences presented the highest frequency of links in the references (p = 0.04 < 0.1; see table 8), which might be influenced by the situation described by Cronin [13]: “publishing practices differ; for example, disciplines such as molecular biology follow a pattern characterized by a large number of relatively short papers with joint authorship, frequently appearing in highly cited journals”. While looking into the multiple use functionality, we discovered a widespread and potentially seriously problem of mismarking in the links leading to external documents. In our sample, only 51% of all potential links where well marked (leading to any Internet page at all when clicked on). From the remaining 49%, only 10% were broken links, 15% were completely unmarked links (only appearing as text), and a full 24% where misspelled or incomplete. The most common problem occurred when marking the reference items. Since URL addresses must fit within the layout page, the longer links get “broken in two”, so that the two parts are sitting on different lines. The paragraph looks very orderly, but the automatic marking cannot recognize the second part of the address, and marks only the first section. When this happens, the address gets “cut off in the middle”, and the browser can’t possibly find the right page. While the problem of link-morbidity in scholarly writing still needs to be addressed, we believe this apparently simple problem should also be taken into account. In the interactivity section, the most common forms of interaction offered by the journals were user!journal in the form of question forms (17% of journals) and new issue warnings (17% of journals). Some journals however (5%) had ways to provide user!user interaction, through user forums and systems of response to published articles, so that the readers could participate in the discussion. E-mail only contact with the journal continues to be the norm, with Cuba as the most salient exception for the user interaction available in its medical journals (p = 0.02 < 0.1; see table 9). As for the Latindex characteristics, about 35% of the journals have metadata within their pages, and 50% offer search engines to their users. Costa Rica and Brazil get the most points across the three Latindex categories (p = 0.07 < 0.1; see table 10): 83% of Costa Rican journals offer metadata in their websites, and 83% uses search engines for the sites’ contents. The average LATINDEX grade for Costa Rican journals was 0.83, while the average for Brazilian journals was 0.62. When comparing the results of the non-Latin American journals with the Latin American ones, the differences are quite obvious. Every single balance tips in favor of the non-Latin American journals (and not only in quantity; the quality of the navigational linking for example is much noticeable). The only exception was in the case of LATINDEX characteristics, where there is no significant difference.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
197
198
5.
Saray Córdoba-González; Rolando Coto-Solano
Conclusions
At this point, we can conclude that the situation of the average the scientific electronic journals from Latin America does not really differ from that one studied by Mayernik and the results obtained by Marcondes et al. [7]; that is, that the journals from this part of the world, “as many international e-journals, are still designed based on paper journals models. They incorporate few of those technological facilities”. Very few journals use formats and techniques that can fully take advantage of the possibilities offered by Internet tools. Many journals take a great deal of care to offer a presentable “cover page”, with a very flexible and non-linear entry page. However, those efforts wilt as the user approaches the article pages, until the article’s text becomes a copy of the original printed format. Traditionally, editors have thought that their articles have only one audience: human beings. An attractive presentation will surely play a role in a diligent editor’s duty. However, a good visual layout will be impenetrable to what has become a second audience for the articles: computers systems such as robots and spiders that crawl the article in search of usable links, hoping to weave valuable connections between science web pages throughout the world. These two audiences (humans and web-exploration software) complement each other, and both beg attention from the editor. Creating awareness about this problem might be the only way to go. While the present situation in the journals of Latin America is not the best of all possible worlds, the adoption of basic Internet tools such as metadata and search engines is in itself no small feat. The LATINDEX network of associates constantly monitors the use of these features in each country, and has campaigned among local editors to create awareness about them, which might have helped reach the results that we see today. In spite of the low scores in the Mayernik categories, there was no significant difference in the use of metadata and search engines between the “top Latin American journals” and the “top non-Latin American journals” we studied in table 11. Superficial as this comparison might be, it does speak of the achievements that electronic publishing has reached in this group of countries. In countries like Chile, Costa Rica, Colombia, Brazil and Mexico, we found examples of good journals that are well prepared to compete in the global arena. Five countries stand out from the corpus: Brazil, Chile, Costa Rica and Cuba (for their relatively good scores in all of the characteristics), and Mexico (for its incorporation of multimedia into their publishing practices). Every country has its peculiarities. As a part of the informal BRIC bloc, Brazil has been hard pressed to improve the quality of their scientific output. Both Chile and Costa Rica fare well in the Global Competitiveness Report [14] (first and fifth place respectively among countries and territories in Latin America) and the Human Development Index [15] (second and fourth place respectively). Cuba is reputed in the region for “good old resourcefulness” in the face of economic difficulties, but also for a very strict and vertical research culture. Mexico’s UNAM is the only university in the region in the top 200 universities of the QS Ranking, followed within the top 250 by Chile’s Pontificia Universidad Católica [16]. In spite of these varying conditions, the one thread that places these countries together is that they have the largest investment in Research and Development in the region when compared to their GNP: Brasil, 0.83%; Chile, 0.68%; Cuba, 0.56%; Mexico, 0.46%; and Costa Rica, 0.41% [17]. This has allowed them to use much needed resources in raising awareness about their scientific communication processes, a fact that appears to be reflected in the data we have obtained. Determining the exact extent to which funding, funding models, and the availability of materials at the individual journal level truly influences the results of this investigation is a question that calls for future study. Marcondes et.al.’s [7] suggestion of “lack of professionalism” is certainly a strong statement that has little or nothing to do with funds; Costa, Silva y Costa [18] point at the direction of lack of computer alphabetization as the culprit of the situation. Some editors might be considering that “just being on the Internet” is added value enough and that there is no need to improve or work on that online presence. Yet Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean
another possibility is that they consider the addition of links might be “baffling” to the user, where the “user” is still the narrow vision of humans as the only consumers of their information. The situation hints a complex interaction between availability of funds and willingness to ‘think outside the box’, and more research is needed to understand the attitudes of editors towards electronic publishing. 6.
Acknowledgements
We would like to thank the Vicedean of Research of the University of Costa Rica (www.vinv.ucr.ac.cr) and the Incentive Funds Commision of the Ministry of Science and Technology of Costa Rica (www.micit.go.cr), as well as Marcela Alfaro for acting as consultant for the statistical analyses. 7.
Notes
1
Further examples include the Mexican E-Journal at UNAM and the Costa Rican Latindex-UCR at University of Costa Rica. There is indeed overlap between the “added services” category in LATINDEX, and the interactivity characteristic of Mayernik. Figure 1 describes the specific added services found in the sample, and can be contrasted to Table 9 for differences between the two. In this case, we asked to LATINDEX partners who have the biggest groups of online journals in the Directory. From them, we got three different answers: Mexico, Colombia and Costa Rica, but our experience begged us to also include journals from Chile and Brazil. We chose the following nonLatin American journals: Journal of Electronic Publishing, British Medical Journal, Behavioral and Brain Sciences, PLoS Medicine and CTheory. As for the Latin American journals, we selected: Online Brazilian Journal of Nursing, Revista Eletrônica de Estudos Hegelianos, Colombia Médica, Livestock Research For Rural Development, Revista E-mercatoria, e-Gnosis, Aleph Zero, Cinta de Moebio and the Electronic Journal of Biotechnology.
2
3
8.
References
[1]
MOGHADDAM, G.G. Archiving challenges of Scholarly Electronic Journals: How do Publishers Manage them? Serials Review 2007, vol. 33, no.2 [cited on May, 5 2008], p. 2. Available from: http://eprints.rclis.org archive/00011175/01/Archiving _Challenges_of_Scholarly_Electronic_Journal_ How_do_publishers_manage_them.pdf MAYERNIK, N. The Prevalence of Additional Electronic Features in Pure E-Journals. Journal of Electronic Publishing, 2007 Fall, vol. 10, no. 3, [cited on May, 5 2008]. Available from: http:// quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;cc=jep;rgn=main;view=text;idno=3336451.0010.307 , accessed on 15-01-2008. VALUSKAS, E. Waiting for Thomas Kuhn: First Monday and the Evolution of Electronic Journals. Journal of Electronic Publishing, 1997, vol. 3, no. 1, September, [cited on May, 5 2008]. Available from: http:// quod.lib.umich.edu/cgi/t/text/ textidx?c=jep;cc=jep;q1=3336451.0003.1%2A;rgn=main;view=text;idno= 3336451.0003.104, accessed on: 01-15-2008. HITCHCOCK, S., CARR, L., HARRIS, S. and, HALL, W., PROBETS, S., EVANS, D., BRAILSFORD, D. Linking electronic Journals. D-Lib Magazine, Dec. 1998. Available from: http:/ /www.dlib.org/dlib/december98/12hitchcock.html#note1, accessed on 10-12-2007. ISSN 10829873. KLING, R. & McKIM, G. Scholarly Communication and the Continuum of Electronic Publishing. Journal of the American Society for Information Science, 1999, v. 50, n.10, p. 890-906. Available from: http://arxiv.org/abs/cs/9903015v1, accessed on 03-01-2008.
[2]
[3]
[4]
[5]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
199
200
[6]
[7]
[8]
[9] [10]
[11] [12] [13] [14] [15] [16] [17] [18]
Saray Córdoba-González; Rolando Coto-Solano
DIAS, G.A. Periódicos eletrônicos: considerações relativas à aceitação deste recurso pelos usuários. Ciência da Informação, 2002, v. 31, n. 3, Sept.-Dec. [cited on May, 2 2008]. . Available from: http://www.scielo.brscielo.php?script=sci_abstract&pid=S010019652002000300002 &lng=en&nrm=iso&tlng=pt. Doi: 10.1590/SO100-19652000300002. MARCONDES, C.H.; SAYÃO, L:F:; MAIA, C.M.R.; DANTAS, M.A.R.; FARIA, W.S. Stateof-the-art of brazilian e-journals in science and technology. In International Conference on Electronic Publishing, 8. Brasilia, D.F., Brazil, 23-26 June, 2004 / Edited by: Jan Engelen, Sely M. S. Costa, Ana Cristina S. Moreira. Universidade de Brasília, 2004. [cited on April, 4 2008]. Available from: http://elpub.scix.net/cgi-bin/works/Show?079elpub2004. MUÑOZ,G., BUSTOS-GONZÁLEZ, A. Y MUÑOZ-CONEJO, A. Sharing the Know-how of a Latin American Open Access ony e-journal: The Case of the Electronic Journal of Biotechnology. In ELPUB2007. Openness in Digital Publishing: Awareness, Discovery and Access - Proceedings of the 11th International Conference on Electronic Publishing held in Vienna, Austria 13-15 June 2007 / Edited by: Leslie Chan and Bob Martens., 2007, [cited on February, 20 2008], pp. 331-340. Available from: http://elpub.scix.net/cgi-bin/works/Show?133_elpub2007, ISBN 978-3-85437-2929. HARNAD, S. Post-Gutenberg Galaxy: The Fourth Revolution in the Means of Production of Knowledge. 1991. [cited on December, 10 2008] Available from: http://users.ecs.soton.ac.uk/harnad/ Papers/Harnad/harnad91.postgutenberg.html. TENOPIR, C. & KING, D. Designing Electronic Journals With 30 Years of Lessons from Print. Journal of Electronic Publishing, 2002, vol. 7, no. 3, [cited on January, 15 2008] April. Available from: http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;cc=jep;q1=Electronic %20journals;q 2=Scholarly%20journals;op2=and;op3=and;rgn=main;view=text;idno=3336451.0007.303, HARNAD, S. Interactive publication: Extending the American Physical Society’s discipline-specific model for electronic publishing, Serials Review, 1992, special issue [cited on March, 15 2008] . Available from: http://cogprints.org/1688/0/harnad92.interactivpub.html. LUKESH, S. Revolutions and Images and the Development of Knowledge: Implications for Research Libraries and Publishers of Scholarly Communications. Journal of Electronic Publishing, April, 2002, vol. 7, n. 3. [cited on May, 5 2008] Available from: http://hdl.handle.net/2027/spo.3336451.0007.303. CRONIN, B. The Hand of Science. Lanham, Maryland: The Rowman and Littlefield Pub., 2005. ISBN 0-8108-5282-9. WORLD ECONOMIC FORUM. Global Competitiveness Report 2007-2008. 2007 [cited on May 6, 2008]. Available from: http://www.weforum.org/pdf/Global_Competitiveness_Reports/Reports/ gcr_2007/gcr2007_rankings.pdf. UNITED NATIONS DEVELOPMENT PROGRAMME. 2007/2008 Human Development Index rankings. 2008 [cited on May 6, 2008]. Available from: http://hdr.undp.org/en/statistics/. TIMES HIGHER EDUCATION. QS World University Rankings 2007 - Top 400 Universities. 2007 [cited on May 6, 2008]. Available from: http://www.topuniversities.com/worlduniversityrankings/ results/2007/overall_rankings/top_400_universities/. RICYT. Red de Indicadores de America Latina: Indicadores Comparativos. 2008 [cited on May 6, 2008]. Available from: http://www.ricyt.org/interior/interior.asp?Nivel1=1&Nivel2=2 &Idioma=. COSTA, S., SILVA W. Y COSTA, M. Electronic publishing in Brazilian academic institutions: changes in formal communication, too?, In: International Conference on Electronic Publishing, 5 2001 in the Digital Publishing Odyssey: Proceedings of an ICCC/IFIP conference held at the University of Kent, Kenterbury, UK, 5-7 July 2001/ Edited by Arved Hübler, Peter Linde and John W.T. Smith / ISBN 1-58603-191-0. Available from: http://elpub.scix.net/cgi-bin/works/Show?200112, accessed on: 03-13-2008.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean
8.
Appendix 1: Journals included in the study
Argentina: Archivos argentinos de alergia e inmunología clínica, AdVersus, Biocell, Contabilidad y auditoría, Dermatología Argentina, Equipo Federal del Trabajo, Foro Iberoamericano sobre Estrategias de Comunicación (FISEC), Hologramática, Journal of Applied Economics, Journal of Computer Science and Technology, Psikeba, Rev. Argentina de Lingüística, Revista de Ciencias Sociales, Rev. De Investigaciones Agropecuarias, Telondefondo, Universitas, Urbe et Ius. Brazil:
Afro Asia, Boletim do Instituto de Pesca, Brazilian Administration Review, Brazilian Journal of Biomotricity, Caderno espaço feminino, Caderno Virtual de Turismo, Contingentia, Data Grama Zero, Economia e Energia, Educação Temática Digital, Engenharia Ambiental, Hegemonia, Klepsidra, Online Brazilian Journal of Nursing, Relações públicas em revista, Revista brasileira de educação médica, Revista Brasileira de Zoologia, Revista de Estudos da Religião, Revista de Gestão da Tecnologia e Sistemas de Informação, Revista Eletrônica de Estudos Hegelianos, Revista Expectativa, Revista Matéria, Semina.
Chile
Agenda Pública, Ciencia y Trabajo, Cinta de Moebio, Cuadernos de Economía, El Vigía (Santiago), Electronic Journal of Biotechnology, Journal of Technology Management and Innovation, Monografías electrónicas de patologia veterinaria, Política Criminal, Rev. Chilena de Ciencia de la Computación, Rev. Chilena de Semiótica, Revista Universitaria.
Colombia
Acta Biológica Colombiana, Colombia Médica, Cuadernos de Administración, Earth Sciences Research Journal, Livestock Research For Rural Development, Nómadas, Rev. Ciencias Humanas, Rev. Latinoamericana de Ciencias Sociales, Niñez y Juventud, Revista EIA Ingeniería Antioquía, Revista E-mercatoria, Revista Escuela Colombiana de Medicina - ECM.
Costa Rica Actualidades Investigativas en Educación, Diálogos, MHSalud, Población y Salud en Mesoamérica, Reflexiones, Revista de Derecho Electoral. Cuba
ACIMED, Fitosanidad, Multimed, Revista cubana de investigaciones biomédicas, Revista Cubana de Obstetricia y Ginecología, Revista cubana de pediatría.
Ecuador
Gaceta Dermatológica Ecuatoriana, Universidad-Verdad
Mexico
Acta Médica Grupo Ángeles, Alegatos, Aleph Zero Anales del I.Biología, Serie Zoología, Anuario Mexicano de Derecho Internacional, Archivos Hispanoamericanos de Sexología, Biblioteca Universitaria, Buenaval, Computación y Sistemas, Cuadernos de Psicoanálisis, Dugesiana, Educar, e-Gnosis, El Psicólogo Anahuac, Hitos de Ciencias Económico Administrativas, InFÁRMAte, Investigación Bibliotecológica, Journal of Applied Research and Technology, La pintura mural prehispánica en México, Los amantes de Sofía, Mensaje bioquímico, Nueva Antropología, Redes Música, Rev. Ciencia Veterinaria, Revista Biomédica, Revista de Enfermedades Infecciosas en Pediatría, Revista de la Educación Superior, Revista del Instituto Nacional de Cancerología, Revista Fractal, Revista Mexicana de Física.
Peru
Biblios, Diagnóstico, Escritura y Pensamiento
Puerto Rico Ceteris Paribus, El Amauta, Rev. Int. Desastres Naturales, Infraestructura Civil, Videoenlace Interactivo.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
201
202
Saray Córdoba-González; Rolando Coto-Solano
Uruguay
Actas de Fisiología, Boletín Cinterfor, Boletín del Inst. de Inv. Pesqueras, Galileo.
Venezuela
Acción Pedagógica, Boletín Antropológico, Cayapa, Música en clave, Postgrado, Rev. Ingeniería UC, Revista de la Sociedad Médico-Quirúrgica del Hospital de Emergencia Pérez de León.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
203
Consortial Use of Electronic Journals in Turkish Universities Yasar Tonta; Yurdagül Ünal Department of Information Management, Hacettepe University 06800 Beytepe, Ankara, Turkey e-mail: {tonta, yurdagul}@hacettepe.edu.tr
Abstract The use of electronic journals has outnumbered that of printed journals within the last decade. The consortial use of electronic journals through publishers’ or aggregators’ web sites is on the rise worldwide. This is also the case for Turkey. The Turkish academic community downloaded close to 50 million fulltext articles from various electronic journal databases since the year 2000. This paper analyzes the seven-years’ worth of journal use data comprising more than 25 million full-text articles downloaded from Elsevier’s ScienceDirect (SD) electronic journals package between 2001 and 2007. Some 100 core journals, constituting only 5% of all SD journal titles, satisfied over 8.4 million download requests. The lists of core journals were quite stable, consistently satisfying one third of all demand. A large number of journal titles were rarely used while some were never used at all. The correlation between the impact factors (IFs) of core journal titles and the number of downloads therefrom was rather low. Findings can be used to develop better consortial collection management policies and empower the consortium management to negotiate better deals with publishers. Keywords: electronic journals; consortial use of electronic journals; core journal titles; Turkish universities; Bradford Law of Scattering; ScienceDirect. 1.
Introduction
Scientific journals are one of the major information sources of library collections. Currently, some 25,000 refereed journals are being published world-wide. Libraries spend about two thirds of their budgets for the subscription to and licensing of scientific journals and make them available online. Consortial agreements signed between libraries and publishers/aggregators enable users to get access to electronic journals through the Internet. Users can easily download the full-texts of articles that appear in thousands of electronic journals. Yet, the great majority of articles downloaded by the users tend to get published in a relatively small number of “core journals” in each field. Those core journals can easily be identified by means of an analysis of COUNTER-based use data. Studies based on such analyses of empirical journal use data are scarce in Turkey. This paper attempts to identify the most frequently used core journals by Turkish academic users. Our analysis is based on data of more than 25 million full-text articles downloaded by Turkish universities from Elsevier’s ScienceDirect (SD) electronic journal package over a seven-year period (2001-2007), making it perhaps one of the most comprehensive electronic journal use studies carried out on a national scale. The volume of data enables us to identify the core journals as well as to determine their stability over the years. Findings can be used to develop better collection management policies and improve the conditions of the national consortial license for Turkish universities.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
204
2.
Yasar Tonta; Yurdagül Ünal
Universities and Consortium Development in Turkey
As of 2008, the total number of universities in Turkey is 115. Most (85) are state-sponsored. The tertiary education system is governed by the Higher Education Act of 1981. The Higher Education Council (YÖK), consisting of members from universities and outside interests, is the policy-making body for all universities including private/foundational ones. The selection and admission of students takes place through a national entrance exam administered by a center (ÖSYM) under the authority of YÖK. The total number of students enrolled in the higher education system (including students in the distance education and vocational programs) was 2,453,664 in the 2006/2007 academic year. The number of graduate students was rather low: 108,653 master’s, 33,711 Ph.D. students. More than half (54.5%) of all undergraduate students study social sciences. The rest study technical sciences (17.3%), math and sciences (10%), medicine (9%), language and literature (4%), agriculture and forestry (3.2%), and arts (2%) [1,2]. The number of faculty in the 2006/2007 academic year was 89,319 (12,773 professors, 6,150 associate professors, 15,844 assistant professors, and 54,562 research assistants, specialists, and others) [3]. The National Academic Network and Information Center (ULAKBIM) of the Turkish Scientific and Technological Research Center (TÜBITAK) was founded in 1996 to set up a national academic and research network and use it as a testbed to share precious information resources among university libraries. In addition, ULAKBÝM aimed to provide access to electronic information sources and services by signing national site licenses with publishers on behalf of all Turkish universities. In fact, the first experience of the Turkish universities with electronic journals dates back to the second half of 1990s following the establishment of ULAKBIM. On November 14, 1997, ULAKBIM organized a day-long meeting for university library directors and their superiors (i.e., vice-rectors overseeing libraries) and presented its views on setting up a consortium for university libraries to cooperate and share electronic resources as stated in its by-law. In 1998, ULAKBIM offered the first trial databases to universities [4]. However, ULAKBIM’s priorities had changed due to financial and administrative difficulties experienced at that time and the Center was not able to immediately carry out some of its duties (one of which was to set up a consortium) as specified in its foundational by-law. Thus, ULAKBIM could not live up to the expectations of the potential members of the consortium, namely university libraries, in its formative years. Meanwhile, a few university libraries signed joint licensing projects with publishers in 1999 and 2000. Following this, the Consortium of Anatolian University Libraries (ANKOS) was created in 2001 as a voluntary association run by a nine-member Steering Committee. ANKOS developed the Turkish National Site License (TRNSL) document and member libraries began to sign agreements with publishers to get access to electronic journals and bibliographic databases [5,6]. These initial agreements were “mostly informal subscription arrangements” for printed journals including access to electronic copies thereof. In 2004, ANKOS began to sign multi-year consortial licenses to get access to the electronic copies of journals (excluding their printed equivalents) [7,8]. Thanks to the indefatigable efforts of ANKOS, several universities, especially the newly-established ones, provided access to electronic journals for the first time through such licenses. Some universities did not even have any sizable journal collections at that time. As more university libraries joined ANKOS over the years, the number of databases offered and their use has increased tremendously. ANKOS currently has some 90 members including a few non-university entities. ULAKBIM has also been a member of ANKOS from the very beginning. As of 2008, ANKOS licenses a total of 30 packages of electronic journals and books. Some of those packages are as follows: Blackwell’s, Ebrary, Emerald, Gale, Nature Publishing Group, Proquest, Sage, ScienceDirect (SD) ebooks, Wiley Interscience, and journal packages offered by professional associations such as ACM, ACS, ALSPS, and SIAM. The number of licensees for each package ranges between 4 (Elsevier’s MD Consult) and 74 universities (Oxford University Press), average being 24 universities [9].
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Consortial Use of Electronic Journals in Turkish Universities
Apparently it took ULAKBIM longer than anticipated to convince TÜBITAK to allocate resources to provide access to electronic journals and books on a national scale [10]. After a precious loss of about seven years, ULAKBIM came into scene once again in 2005. Having secured funds (apparently) from the European Union (EU), TÜBITAK’s Science Council authorized ULAKBIM, in late 2005, to sign national site licenses with publishers covering potentially all universities. This authorization enabled ULAKBIM to make electronic databases available to all Turkish universities and research centers through its National Academic License of Electronic Resources (EKUAL) starting from 2006 [11]. The first package offered to universities on a national scale through ULAKBIM’s EKUAL has been ISI’s (now Thomson Scientific) Web of Science (WoS) [12]. The coverage of EKUAL has been expanded in February 2006 so as to include training and research centers of public hospitals under the administration of the Ministry of Health. EKUAL currently has 105 member universities and research centers as well as 48 hospitals. As of early 2008, ULAKBIM offers 11 electronic databases to universities and research centers. These databases are as follows: BMJ Clinical Evidence, BMJ Online Journals, CAB, EBSCOHost, Engineering Village 2, IEEE, Journal Citation Reports (JCR), Ovid-LWW, ScienceDirect, Taylor & Francis, and the Web of Science. Some databases are offered to all members (e.g., Thomson Scientific’s WoS and JCR databases, and Elsevier’s SD) while others depend on the number of members requesting access (for instance, almost 90 members requested access to the Engineering Village 2 and IEEE databases while 31 members preferred access to the CAB database). In addition to the above databases, all 48 hospitals get access to the following databases through ULAKBIM’s EKUAL: Blackwell Synergy, Embase, ScienceDirect Health Sciences, Springer, The Cochrane Library, Wiley Interscience, and Xpharm [13]. In addition to ANKOS and ULAKBIM, the Associaton of University and Research Libraries (ÜNAK) also took part in consortial licensing of electronic resources starting from 2001. The ÜNAK-OCLC Consortium provides access to OCLC’s databases such as First Search, WorldCat and the Net Library [14]. The number of licensees ranges between 5 and 24. Non-OCLC databases are apparently outside the realm of the ÜNAK-OCLC Consortium. Some 12 million full-text journal articles or book chapters were downloaded in 2007 by the Turkish academics from various databases [15]. Downloads from Elsevier’s SD usually constitute more than half the total. 3.
Literature Review
Libraries sign agreements with publishers for “big deals”. Publishers provide a set of journals as part of the big deal package. In the early days, this approach were embraced readily by the university libraries because it was attractive for users to perform a cross-search and get access to the full-texts of articles regardless of whether their library subscribed to that title earlier. Yet, some of the journal titles provided in the big deal agreements are not necessarily the most frequently used ones. Paying license fees for marginal journal titles embedded in the big deals tends to increase the total license fees to a certain extent and limits the choices of libraries, not to mention the possible overlap of journal titles offered by different publishers and aggregators. To support the license fees of the big deals libraries ought to cut some of their subscriptions to journals that are used perhaps more frequently. Big deals are therefore increasingly being criticized in recent years because of monopoly, price hikes, and the inclusion of journals that may not be at the top of the priority lists of libraries [16,17,18]. Some universities in the United States therefore rejected the big deals and negotiated new agreements with publishers. For instance, Cornell University agreed to identify journal titles from a package and only include them as part of the license agreement with Elsevier [19]. Gatten and Sanville [20] analyzed the download data to identify the use patterns of journal titles within a big consortium (OhioLINK). They wondered if the rarely used journal titles within a consortial big deal package can be dropped from the subsequent years’ negotiations without undermining the use of one or more consortium members, thereby Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
205
206
Yasar Tonta; Yurdagül Ünal
making the big deal more cost effective. They showed that an orderly retreat (i.e., title-by-title elimination of rarely used titles) “based on the ranking of articles-downloaded aggregated across member institutions appears to be a reasonable method to employ if needed. . . . An effective orderly retreat means consortia have the ability to manage a Big Deal based on a ‘cost for content’ approach.” It may sometimes be more economical for a library to pay-per-view rather than sign a big deal agreement, especially if the use is not that great ([21,22]; see also [17] . It should be noted that the big deal publishers seem to soften their stand on “all or nothing” approach and some of them allow libraries to pick the titles they want out of a big deal package. One way for the libraries to find out if the electronic journals licensed are used or not is to conduct use studies. Findings of such studies empower library administrators and enable them to develop better collection management policies [23,24]. Especially studies of cost-benefit analysis are noteworthy [25,26,27]. Use analyses based on SD database of electronic journals are not that many [28,29,30,31]. In general, core journals satisfied large percentage of requests [28,29,32,33]. For instance, half the use of the Middle East Technical University (METU), a leading Turkish university, is satisfied by 136 core journals. One third of all journals satisfied 86% of all need [25, p. 73]. Evans and Peters [22] analyzed the aggregated use of more than 100 business and management journals included in the Emerald Management Xtra (EMX) collection and tested if the dispersal of some 6,4 million articles downloaded in 2004 by the “big deal” users fitted the “80/20 rule” or Pareto principle. They found that the most frequently used 15 journals satisfied 36.7% of all download requests and the download data did not conform to the 80/20 rule: 47.4% of journals satisfied the 80% of download requests. Aggregated use of the members of the Consortium of University Libraries of Catalonia (CBUC) of, among others, EMX collection (formerly MCB) between 2001 and 2003 displayed a similar trend: 46.2% journals satisfied 80% of more than 200 thousand download requests [23]. There are studies that test the relationship between some bibliometric indicators such as the journal impact factor (IF), half life, total number of citations and the number of use (downloads) [24]. Some studies report the existence of such a statistically significant relationship between the use based on bibliometric indicators and that of download data [25] while others do not [26]. Darmoni, Roussel, Benichou, Thirion, and Pinhas [27] defined a new measure called the “Reading Factor” (RF), “the ratio between the number of electronic consultations of an individual journal and the mean number of electronic consultations of all the journals studied” (p. 323) and compared the RF and IF values for 46 journals. They reported no correlation “between IF and electronic journal use as measured by RF” (p. 325). Although such findings can be used in collection management to some extent, the use of electronic journals cannot be explained by a single factor such as journal IFs or RFs. Bollen, Van de Sompel, Smith and Luce [28] developed a taxonomy of impact measures based on journal usage data that includes frequentist author-based (i.e., IF) and reader-based (i.e., RF) measures as well as structural author-based (i.e., webometrics) and reader-based measures. Recently, Bollen and Van de Sompel [29] examined the effects of the community-based characteristics such as the total student enrollment and the size of the discipline in terms of the number of journals on journal usage. They defined a journal Usage Impact Factor (UIF) mimicking ISI’s IF. They then used the two-years’ worth of download data obtained from the 23-campus California State University to rank journals on the basis of UIFs. They reported a negative correlation between UIF and ISI’s IF values in general. No correlation was found for most disciplines between UIF and IF values. However, UIF and IF correlations “seemed to be related to the ratio between the sizes of the undergraduate and graduate community in a discipline.” (p. 146) Studies based on the MESUR database containing large volumes of usage and citation data will shed new Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Consortial Use of Electronic Journals in Turkish Universities
207
lights on the relationship between use-based measures and the community- or subject-based characteristics of journal use. Developed by Bollen and his colleagues, the database contains usage data spanning 100,000 journals and citation data spanning 10,000 journals for 10 years. In addition the database has “publisher-provided COUNTER usage reports that span nearly 2000 institutions worldwide. . . . MESUR is now producing large-scale, longitudinal maps of the scholarly community and a survey of more than 60 different metrics of scholarly impact.” [30] 4.
Methodology
Data used in this paper come from the ScienceDirect (SD) Freedom Collection of electronic journals database of Elsevier. SD contains the full-texts of some 8 million articles published in more than 2,000 journals. The SD Freedom Collection provides access to the contents of both subscribed and non-subscribed Elsevier journals with “dynamic linking to journals from approximately 2,000 STM publishers through CrossRef” [31]. The seven-years’ (2001-2007) worth of COUNTER-based download statistics of Turkish universities from Elsevier’s SD database were obtained from the publisher. The number of full-text articles downloaded from each journal by each university was recorded. The analysis was based on more than 25 million articles downloaded from over 2,000 Elsevier journals. Most frequently used “core” journal titles were identified. Tests were carried out to see if the distribution of downloaded articles to journal titles conformed to the Bradford’s Law of Scattering, the 80/20 rule and the Price Law. Using ISI citation data (Journal Citation Reports 2006), the correlation between the journal impact factors (IFs) of core journal titles and their use based on the number of downloads was calculated to see if journals with high IFs were also used heavily by the Turkish academic community. What follows is the preliminary findings of our study. 5.
Findings and Discussion
Turkish academic users downloaded a total of 25,145,293 full-text articles between 2001 and 2007 from 2,097 different journals included in Elsevier’s SD database [48]. Two-thirds of those articles were downloaded over the last three years (2005-2007) (Fig. 1). March and December are the most heavily used months of the year while the number of downloads appears to decrease considerably during the summer. Table 1 shows the frequencies and percentages of journal titles satisfying one third, two third, and all requests downloaded between 2001 and 2007 as well as on an annual basis. Based on data presented in Table 1, Figure 2 shows the annual distributions of journal titles by regions (i.e., percentage of journal titles satisfying one third, two third and all download requests). The first one third of all download requests (some 8.4 million articles) were satisfied by 105 “core” journals, constituting a mere 5% of all journal titles. The second one third were satisfied by 273 journals (12.9% of all journal titles). In other words, 378 journal titles (some 18% of all journal titles within SD) satisfied two thirds of all download requests. The last one third of requests were satisfied by 1,719 rarely used journals (82.1% of all SD journal titles). When the download statistics were analyzed on an annual basis for seven years, the pattern of use of core journals did not change much: on the average about 90 core journal titles invariably satisfied one third of all download requests each year (77 journal titles in 2001, 83 in 2002, 95 in 2003, 103 each in 2004 and 2005, 92 in 2006, and 93 in 2007) (Table 1). Percentage of core journal titles ranged between 4.6% (2007) and 6.2% (2001) of all SD journals. The use patterns of moderately and rarely used journal titles did not fluctuate much, either: percentage of moderately used journal titles ranged between 12.8% (2007) and 16.7% (2001) while the rarely used ones constituted the overwhelming majority (77% in 2001 and 82.6% in 2007) of all SD journals. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
F
208
Yasar Tonta; YurdagĂźl Ă&#x153;nal Number of journals 2001-2007 N % 105 5.0
2001 N % 77 6.2
2002 N % 83 5.2
2003 N % 95 5.7
2004 N % 103 5.8
2005 N % 103 5.5
2006 N % 92 4.8
2007 N % 93 4.6
273
12.9
206
16.7
225
14.1
255
15.4
271
15.2
274
14.6
262
13.7
257
12.8
1,719
82.1
950
77.0
1,292
80.8
1,304
78.8
1,409
79.0
1,498
79.9
1,553
81.4
1,663
82.6
2,097
100.0
1,233
99.9
1,600
100.1
1,654
99.9
1,783
100.0
1,875
100.0
1,907
99.9
2,013
100.0
Note: Some totals differ from 100% due to rounding.
Table 1. Distribution of journals by regions
Number of downloaded articles
7.000.000
and
6.000.000
5,264,423
5,652,780
5,843,049 (est.)
2006
2007
4,575,094
5.000.000 4.000.000
3,346,381
3.000.000 2.000.000 1.000.000
1,362,934 810,203
0
2001
2002
2003
2004
2005
Year Note: The number of use in the last quarter of 2007 was estimated according to the average rate of increase (70.67%) the last four years (2003-2006).
Figure 1. Number of full-text articles downloaded from ScienceDirect (2001-2007) 1. Region
2. Region
3. Region
100 Percentage of journal titles
90 80 70 60 50 40 30 20 10 0 2001-2007
2001
2002
2003
2004
2005
2006
2007
Ye ar
Figure 2. Yearly distributions of journal titles by region Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Consortial Use of Electronic Journals in Turkish Universities
Core journal titles satisfying one third of all download requests exhibited further interesting use patterns. Not only were their numbers quite stable (around 100) but also the same journal titles consistently appeared, to some extent, in the core journal lists over seven years. To put it somewhat differently, a core journal fulfilling high use in a given year tends to do so in the following years as well. Ranks of individual journal titles based on the number of downloads did not fluctuate much on a yearly basis. This is despite the fact that new journal titles are constantly being added to the SD journal list, thereby increasing both the total number of SD journal titles available for download and the probability of further fluctuation. The total number of SD journal titles available in 2001 is likely to be greater than that in 2007. Yet, the stability of the ranks of individual journals is especially noteworthy. Nonetheless, it should be noted that the ranks of some journals might get affected due to the increase in the total number of available SD journal titles over the years. Spearman rank order correlation coefficients (r) for core journal titles for two consecutive years ranged between 0.402 (2001/2002) and 0.874 (2006/2007) (Table 2). As the number of downloaded articles increased over the years, so did the correlation between the annual ranks of core journal titles.
Years 2001-2002 2002-2003 2003-2004 2004-2005 2005-2006 2006-2007
Spearman rank order correlation coefficient (Ď ) 0.402 0.706 0.778 0.780 0.791 0.874
Note: The correlation coefficient for 2006-2007 does not reflect the use of journal titles within the last quarter of 2007.
Table 2. Correlation coefficients for the core journal titles that were common in two consecutive years
A total of 29 journals appeared in core journal lists of all seven years, roughly satisfying 3.3 million fulltext articles (13.1% of the total number of downloads). More than 200,000 articles were downloaded from the most frequently used journal (Food Chemistry), satisfying 0.8% of all requests. The average number of articles downloaded from those 29 top journals over seven years was 113,793 (16,256 per year). This is about 10 times more than that of the average for all journal titles [49]. Spearman rank order correlation coefficients (Ď ) for 29 core journal titles that were common to all seven years were even higher (minimum 0.472 in 2001, maximum 0.964 in 2005. In other words, Turkish academic users tend to use certain journal titles time and again to satisfy their information needs. The most frequently used top 29 journals along with their rank orders based on the number of articles downloaded over all seven years and that on an annual basis are given in Table 3. It should be noted that journals listed are common to each and every core journal list of all seven years (satisfying one third of all requests) as well as that of the total use (2001/07). It was observed earlier that as the number of downloads increased, the ranks of core journals became more stable. This can be seen in the ranks of the top five journals for the years 2004 through 2007. None of these journals ranked lower than the 8th place (Journal of Food Engineering). As we go down the list, the ranks of top journals start fluctuating. For instance, the journal Brain Resarch was at the top of the core journal list in 2002 whereas it moved down to the 70th place in 2006. Core journal lists need to be studied more closely in order to pinpoint possible use patterns. Findings of a use study based on the download statistics of one Turkish university (Hacettepe) produced similar results with regards to the SD core journal titles [33]. The most frequently used 30 journal titles Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
209
210
Yasar Tonta; Yurdagül Ünal
satisfied 20.4% of all use at Hacettepe University. The most frequently used 12 journal titles (within the first 30) at Hacettepe were also among the 29 journals used most heavily by all Turkish academics. Seven of those 12 titles were of medicine while the remaining 5 in food chemistry, food engineering, chromatography, polymer and biomaterials. The rank of journals differed as well. For instance, the journal Lancet was the most frequently used title in Hacettepe’s core journal list (Hacettepe University has a medical school) while it ranked third in the consortial core journal list. Rank order Journal name
2001/07
2001
2002
2003
2004
2005
2006
2007
Food Chemistry
1
9
9
12
1
1
2
1
European Journal of Operational Research
2
5
5
4
2
2
3
2
Lancet, The
3
29
11
1
3
3
5
6
Journal of Materials Processing Technology
4
6
2
2
4
6
7
3
Journal of Food Engineering
5
26
24
21
6
8
4
5
Tetrahedron Letters
6
19
13
15
12
4
10
15
Journal of Chromatography A
7
13
10
11
10
11
9
14
Analytica Chimica Acta
10
7
12
6
18
17
12
13
Water Research
11
3
4
10
15
16
22
30
Cement and Concrete Research
12
27
6
5
19
5
30
52
Materials Science and Engineering A
13
15
23
19
16
20
17
8
Tetrahedron
15
32
27
23
25
7
14
17
Polymer
16
18
22
16
23
19
15
11
Biomaterials
17
49
19
20
14
15
20
26
Surface and Coatings Technology
18
24
16
36
26
18
18
10
Bioresource Technology
20
36
28
39
28
21
16
7
Chemosphere
24
50
38
32
29
22
26
22
Energy Conversion and Management
25
37
26
24
22
23
31
32
Aquaculture
26
10
29
38
9
24
52
63
International Journal of Production Economics
28
11
25
34
21
26
37
42
Thin Solid Films
29
34
59
70
47
28
25
12
Brain Research
30
66
1
22
50
48
70
73
Talanta
32
14
33
53
34
33
35
25
International Journal of Food Microbiology
33
17
15
25
27
39
43
54
International Journal of Heat and Mass Transfer
35
4
41
31
40
42
63
51
European Journal of Pharmacology
39
56
8
17
64
79
72
69
Renewable Energy
43
28
63
59
45
49
71
58
Journal of Membrane Science
70
57
49
80
95
67
67
86
Enzyme and Microbial Technology
80
58
56
78
96
92
80
83
Table 3. Top 29 journals common in the core journal lists of total use (2001/07) and individual years
The use of SD journals by the Turkish academic community seem to be in parallel with the worldwide use of the same journals. By November 2006, more than one billion articles were downloaded world-wide from SD [50]. Table 4 lists the “hottest” 10 SD journals based on download statistics along with the percentages satisfying the world-wide demand. The weekly and fortnightly science journals such as the Lancet top the list. Table 4 also provides the equivalent percentages and ranks of those top journals on the basis of local download data. Four out of 10 “hottest” journals (The Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Consortial Use of Electronic Journals in Turkish Universities
211
Lancet, Tetrahedron Letters, Journal of Chromatography A, and Journal of the American College of Cardiology) are also among the top 10 journals used most often by the Turkish academics. Percentages of use of four journals are also comparable. Some well-known journals such as Cell and the Journal of Molecular Biology, on the other hand, appear to have not been used heavily in Turkey [51]. Journals
World-wide %
Turkey (2001-2007) % Rank
The Lancet 1.56 0.71 3 Tetrahedron Letters 1.55 0.55 6 Cell 0.99 0.03 919 Biochemical and Biophysical Research Communications 0.97 0.27 47 Tetrahedron 0.93 0.47 15 FEBS Letters 0.87 0.21 96 Journal of Chromatography A 0.67 0.54 7 Journal of Molecular Biology 0.60 0.09 309 Journal of the American College of Cardiology 0.58 0.54 8 Brain Research 0.55 0.36 30 Source: Data in the first two columns come from http://www.info.sciencedirect.com/news/archive/2006/news_billionth.asp.
Table 4. The most frequently used top 10 ScienceDirect journals
Despite the fact that some 100 core journal titles satisfied one third, some 200 titles half, and some 500 titles 80% of all download requests, the distribution of downloaded articles did not conform to the Bradfordâ&#x20AC;&#x2122;s Law of Scattering [52]. In separate studies, we found out that the distribution of the fiveyear (2002-2006) download data of Hacettepe University users representing over one million articles, and the distribution of both electronic document delivery and in-house journal use data of the National Academic Network and Information Center did not fit the Bradford Law, either [32,33]. It was observed in the literature [53,54] that homogenous bibliographies fit the Bradford Law better, whereas the article download data used in the present study come from over 2,000 journals representing all subject fields. It is also possible that the distributions that possess long tails (e.g., very few articles being downloaded from a large number of journal titles, as was the case in our study) may not fit the Bradford Law very well. This is an issue that deserves to be explored further in its own right [55]. Notwithstanding this disconformity, the stability of the number of relatively few journal titles satisfying the great majority of download requests can nonetheless be seen in Figure 3, which depicts the Bradford curves for the aggregated use of all SD journal titles by Turkish academic users. Figure 3 also shows that the number of SD journals used at least once increased over the years (2,013 in 2007 as opposed to 1,233 in 2001). Yet, it is interesting to note that 17 SD journal titles have not been used even once by more than two million (potential) users in Turkish universities during the seven-year period. Some 102 journal titles were used, on the average, just once per annum. The download data did not quite fit the 80/20 rule, either [56]. In our case, 29% of all journals (or 602 titles) satisfied 80% of more than 25 million download requests. For individual years, the percentage of journals satisfying 80% of all requests ranged between 35% (2001) and 28% (2007), average being 31.6%. Nor did the distribution of download data fit the Price Law (i.e. the number of journals that is equal to the square root of all journal titles satisfying half the download demand) ([52], p. 362). In our study, half the downloads came from 208 journal titles instead of 46, as the Price Law suggests. Again, the disconformity can perhaps be explained by the wide variety of uses of the collection for Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
212
Yasar Tonta; Yurdagül Ünal
different purposes by different researchers. For instance, universities with medical schools may download medical articles more often whereas science and engineering schools may do the same for articles in their respective fields. Considering the fact that there are more than 100 universities with different concentration of subjects in Turkish universities, it is likely that the demand for articles dispersed more evenly than that predicted by the 80/20 rule. The four-year (2000-2003) download data of the Consortium of University Libraries of Catalonia (CBUC) did not fit the 80/20 rule, either: an average of 35% of the journal titles of four different publishers satisfied 80% of the demand [35]. It was suggested that the dispersal of use of journals fits the 80/20 rule better as the number of articles available for download in a collection increases. This does not seem to be the case, however. The SD electronic journals package used in this study has over 2,000 journal titles with more than 8 million articles available for download whereas, for instance, the Emerald Management Xtra (EMX) collection comprises about 190 electronic journal titles with 75,000 articles available for download. While 29% of the SD journal titles satisfied 80% of the download requests in our study, almost half the EMX journal titles satisfied 80% of the world-wide demand in 2004 representing more than 6 million article downloads [57]. 2001-2007
2001
2002
2003
2004
2005
2006
2007
100
Cumulative percentage of use
90 80 70 60 50 40 30 20 10 0 0
150
300
450
600
750
900
1050 1200 1350 1500 1650 1800 1950 2100 2250
Cumulative number of journal titles
Figure 3. Bradford curves for the use of journal titles in SD (2001-2007 N = 2097, 2001 N =1233, 2002 N =1600, 2003 N = 1654, 2004 N = 1783, 2005 N = 1875, 2006 N = 1907, 2007 N = 2013). We checked if there is any correlation between the journal impact factors (IFs) and the download statistics. IF values of 105 core journals along with the total number of citations to articles that were published therein were obtained from ISI’s Journal Citation Reports 2006. The number of downloads ranged between 206,537 (Food Chemistry) and 50,020 (European Polymer Journal) for core journals (average being 80,228 with SD=33,329). Journals’ IF values ranged between 25.8 (The Lancet) and 0.615 (Journal of Materials Processing Technology) (average being 2.340 with SD=2.624). There appears to be a low correlation between IFs of core journals and the number of downloads therefrom (Pearson’s r = 0.368). The correlation coefficient was even lower (0.291) for 29 journals that were common in all core journal lists between 2001 and 2007. This finding is in parallel with that obtained by other studies that we recently carried out [32,42]. A low correlation also exists between the ranks of core journal titles based on the number of downloads and that of the total number of citations (Spearman’s r = 0.253, N = 104). It appears Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Consortial Use of Electronic Journals in Turkish Universities
that journals with high IFs tend to be used slightly more often by the Turkish academic community. It can be argued that the journal IFs (and the total number of citations) are calculated on the basis of world-wide use whereas the number of downloads used in this study is based on local use. The concentration of research in Turkey may well differ from that in other countries (e.g., USA) and skew the downloads away from IF and total number of citations. Yet, there are several studies that show that the use based on citations (IFs) and that on downloads are either slightly correlated or not at all (See [32], p. 215; [36]; [40], p. 319; [41,42]). As we have indicated earlier, Bollen and Van de Sompel [45] conducted a more careful study comparing the use based on citations (i.e., Journal IFs) and downloads (i.e., Usage Impact Factors) obtained from California State University (CSU). They reported a moderate negative correlation between the two, noting that “CSU usage data indicates significant, community-based deviations between local usage impact and global citation impact” and that “usage-based impact assessments are influenced by the demographic and scholarly characteristics of particular communities” (p. 146). It is also possible that use based on citations and that on downloads measure two different dimensions of usage [36]. The motives of users downloading articles may be quite different than those who cite articles and they may not overlap. 6.
Conclusion
The preliminary findings of our analysis based on download statistics of all Turkish universities from Elsevier’s SD database show that some 100 core journals satisfied one third of the total number of 25 million full-text download requests. Lists of core journal titles seem to be quite persistent, for they do not change much on an annual basis. A large number of journal titles were rarely used while some were never used at all. Coupled with the pricing data, findings based on seven years’ worth of national usage statistics can be used by individual university libraries as well as by the consortium management to develop collection management plans and devise negotiation strategies that can be exercised with publishers. Based on national usage statistics, “an orderly retreat” for rarely used journal titles that are usually offered as part of the “big deals” can be negotiated with publishers on behalf of all consortium members [20]. 7.
Acknowledgments
This study was supported in part by a reserach grant of the Turkish Scientific and Technological Research Center (SOBAG-106K068). We thank Mr. Hatim El Faiz of Elsevier for providing download data used in this study, and Mr. Umut Al of Hacettepe University for providing feedback on an earlier draft of this paper. 8.
Notes and References
[1]
Statistics come from the web site of the Student Selection and Admission Center: ÖÐRENCÝ SEÇME VE YERLEÞTÝRME MERKEZÝ. (n.d.). 2006-2007 öðretim yýlý yükseköðretim istatistikleri (Higher Education Statistics of 2006-2007 Academic Year). Ankara: ÖSYM. Retrieved 26 March 2008, from http://www.osym.gov.tr/dosyagoster.aspx? DIL=1&BELGEANAH=19176& DOSYAISIM=1_Ogrenci_Say.pdf. The statistics on the distribution of students by subject disciplines come from p.46, Table 4.2 of TÜRK YÜKSEKÖÐRETÝMÝNÝN BUGÜNKÜ DURUMU (the state of the art of Turkish higher education). (November 2005). Ankara: Higher Education Council. Retrieved, 26 March 2008, from http://www.yok.gov.tr/egitim/raporlar/kasim_2005/kasim_2005.doc. Statistics come from the web site of the Student Selection and Admission Center: ÖÐRENCÝ SEÇME VE YERLEÞTÝRME MERKEZÝ. (n.d.). 2006-2007 öðretim yýlý yükseköðretim
[2]
[3]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
213
214
[4] [5]
[6] [7] [8] [9] [10]
[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]
Yasar Tonta; Yurdagül Ünal
istatistikleri (Higher Education Statistics of 2006-2007 Academic Year). Ankara: ÖSYM. Retrieved, 26 March 2008, from http://www.osym.gov.tr/dosyagoster.aspx?DIL=1&BELGEANAH=19176& DOSYAISIM=2_Ogretim_El_Say.pdf. TONTA, Y. (2001). Collection development of electronic information resources in Turkish university libraries. Library Collections, Acquisitions and Technical Services, 25(3): 291-298. LINDLEY, J.A., & ERDOÐAN, P.L. (2002). TRNSL: A model site license for ANKOS. (paper) Presented at the Symposium on Research in the Light of Electronic Developments, October 24-25, 2002, Bolu, Turkey. Retrieved, March 26, 2008, from http://www.library.yale.edu/~llicense/TRNSLpaper.doc LINDLEY, J.A. (2003). Turkish National Site License (TRNSL). Serials, 16(2): 187-190. ERDOÐAN, P.L., & KARASÖZEN, B. (2006). ANKOS and its dealings with vendors. The Journal of Academic Librarianship, 44(3-4): 69-83, p. 69. KARASÖZEN, B., & LINDLEY, J.A. (2004). The impact of ANKOS: Consortium development in Turkey. The Journal of Academic Librarianship, 30: 402-409. See the ANKOS web site for more information (http://www.ankos.gen.tr). TONTA, Y. (2007). Elektronik dergiler ve veri tabanlarýnda ulusal lisans sorunu (The national license issue in electronic journals and databases). (conference paper). Presented at the Akademik Biliþim ’07, 31 January – 2 February 2007, Kütahya, Turkey. (Online). Retrieved, May 12, 2008, from http://yunus.hacettepe.edu.tr/~tonta/yayinlar/tonta-ab07-bildirisi.pdf. For more information on EKUAL, see http://www.ulakbim.gov.tr/cabim/ekual/hakkinda.uhtml. In fact, ULAKBÝM paid for the license fee of 2006 (last year of a three-year license agreement signed by Elsevier and ANKOS) on behalf of ANKOS members. For more information on databases offered through ULAKBÝM’s EKUAL, see http:// www.ulakbim.gov.tr/cabim/ekual/veritabani.uhtml. ÜNAK-OCLC KONSORSÝYUMU (The ÜNAK-OCLC Consortium). (2008). Retrieved, 26 March 2008, from http://www.unak.org.tr/unakoclc/ See http://e-gazete.anadolu.edu.tr/ayrinti.php?no=6501. FRAZIER, K. (2001). The librarians’ dilemma: Contemplating the costs of the “big deal”. D-Lib Magazine, 7(3). (Online). Retrieved, May 12, 2008, from http://www.dlib.org/dlib/march01/frazier/ 03frazier.html BALL, D. (2004). What’s the “big deal”, and why is it a bad deal for universities? Interlending & Document Supply, 32(2), 117-125. JOHNSON, R.K. (2004). Open access: Unclocking the value of scientific research. Journal of Library Administration, 42(2), 107-124. DURANCEAU, E.F. (2004). Cornell and the future of the big deal: An interview with Ross Atkinson. Serials Review, 30(2), 127-130, p. 127. See also Johnson (2004, p. 109) in Ref. 13. GATTEN, J.N., & SANVILLE, T. (2004). An orderly retreat from the big deal: is it possible for consortia? D-Lib Magazine, 10(10). (Online). Retrieved, May 12, 2008, from http://www.dlib.org/ dlib/october04/gatten/10gatten.html. HAAR, J. (2000). Project PEAK: Vanderbilt’s experience with articles on demand. Serials Librarian, 38(1/2), 91-99. HUNTER, K. (2000). PEAK and Elsevier Science. PEAK Conference, Ann Arbor, 23 March 2000. (Online). Retrieved, May 12, 2008, from http://www.si.umich.edu/PEAK-2000/hunter.pdf DAVIS, P.M. (2002). Patterns in electronic journal usage: Challenging the composition of geographic consortia. College & Research Libraries, 63, 484-497. GALBRAITH, B. (2002). Journal retention decisions incorporating use-statistics as a measure of value. Collection Management, 27(1), 79-90. BATI, H. (2006). Elektronik bilgi kaynaklarýnda maliyet-yarar analizi: Orta Doðu Teknik Üniversitesi Kütüphanesi üzerinde bir deðerlendirme. (Cost-benefit analysis in electronic Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Consortial Use of Electronic Journals in Turkish Universities
[26] [27] [28]
[29] [30]
[31] [32]
[33]
[34] [35]
[36] [37] [38] [39] [40]
information resources: An evaluation of the Middle East Technical University Library). Unpublished M.A. dissertation. Hacettepe University, Ankara. CHRZASTOWSKI, T.E. (2003). Making the transition from print to electronic serial collections: A new model for academic chemistry libraries? Journal of the American Society for Information Science and Technology, 54, 1141-1148. WILEY, L., & CHRZASTOWSKI, T.E. (2002). The Illinois Interlibrary Loan Assesment Project II: revisiting statewide article sharing and assessing the impact of electronic full-text journals. Library Collections, Acquisitions, & Technical Services, 26(1), 19-33. HAMAKER, C. (2003). Quantity, quality and the role of consortia. What’s the Big Deal? Journal purchasing – bulk buying or cherry picking? Strategic issues for librarians, publishers, agents and intermediaries. Association of Subscription Agents and Intermediaries (ASA) Conference (2425 February 2003). (Online). Retrieved, 14 January 2007, from http://www.subscriptionagents.org/ conference/200302/chuck.hamaker.pps. KE, H-R., KWAKKELAAR, R., TAI, Y-M., & CHEN, L-C. (2002). Exploring behavior of Ejournal users in science and technology: Transaction log analysis of Elsevier’s ScienceDirect OnSite in Taiwan. Library & Information Science Research, 24, 265-291. RUSCH-FEJA, D., & SIEBKY, U. (1999). Evaluation of usage and acceptance of electronic journals: Results of an electronic survey of Max Planck society researchers including usage statistics from Elsevier, Springer and Academic Press (Full report). D-Lib Magazine, 5(10). (Online). Retrieved, May 12, 2008, from http://www.dlib.org/dlib/october99/rusch-feja/10rusch-fejafullreport.html. VAUGHAN, K.T.L. (2003). Changing use patterns of print journals in the digital age: Impacts of electronic equivalents on print chemistry journal use. Journal of the American Society for Information Science and Technology, 54, 1149-1152. See also TONTA, Y., & ÜNAL, Y. (2007). Dergi kullaným verilerinin bibliyometrik analizi ve koleksiyon yönetiminde kullanýmý (Bibliometric analysis of journal use data and its use in collection management). In Serap Kurbanoðlu, Yaþar Tonta & Umut Al (eds.). Deðiþen Dünyada Bilgi Yönetimi Sempozyumu 24-26 Ekim 2007, Ankara Bildiriler (pp. 193-200). Ankara: Hacettepe Üniversitesi Bilgi ve Belge Yönetimi Bölümü. AL, U., & TONTA, Y. (2007). Tam metin makale kullaným verilerinin bibliyometrik analizi (Bibliometric analysis of the full-text aricle use). In Serap Kurbanoðlu, Yaþar Tonta & Umut Al (eds.). Deðiþen Dünyada Bilgi Yönetimi Sempozyumu 24-26 Ekim 2007, Ankara Bildiriler (pp. 209-217). Ankara: Hacettepe Üniversitesi Bilgi ve Belge Yönetimi Bölümü. EVANS, P., & PETERS, J. (2005). Analysis of the dispersal of use for journals in Emerald Management Xtra (EMX). Interlending & Document Supply, 33(3): 155-157. URBANO, C., ANGLADA, L.M., BORREGO, A., CANTOS, C., COSCULLUELA, C., & COMELLAS, N. (2004). The use of consortially purchased electronic journals by the CBUC (20002003). D-Lib Magazine, 10(6). (Online). Retrieved. May 10, 2008, from http://www.dlib.org/dlib/ june04/anglada/06anglada.html. COOPER, M.D., & MCGREGOR, G.F. (1994). Using article photocopydata in bibliographic models for journal collection management. Library Quarterly, 64, 386-413. MCDONALD, J.D. (2007). Understanding journal usage: A statistical analysis of citation and use. Journal of the AmericanSociety for Information Science and Technology, 58, 39-50. TSAY, M-Y. (1998a). Library journal use and citation half-life in medical science. Journal of the American Society for Information Science, 49, 1283-1292. TSAY, M-Y. (1998b). The relationship between journal use in a medical library and citation use. Bulletin of the Medical Library Association, 86, 31-39. WULFF, J.L., & NIXON, N.D. (2004). Quality markers and use ofelectronic journals in an academic health sciences library. Journal of the Medical Library Association, 92, 315-322. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
215
216
Yasar Tonta; Yurdagül Ünal
[41] SCALES, P.A. (1976). Citation analyses as indicators of the use of serials: A comparison of ranked title lists produced by citation counting and from use data. Journal of Documentation, 32, 17-25. [42] TONTA, Y., & ÜNAL, Y. (2005). Scatter of journals and literature obsolescence reflected in document delivery requests. Journal of the American Society for Information Science & Technology, 56(1): 84-94. [43] DARMONI, S.J., ROUSSEL, F., BENICHOU, J., THIRION, B., & PINHAS, N. (2002). Reading factor: A new bibliometric criterion for managing digital libraries. Journal of the Medical Library Association, 90(3), 323–327. [44] BOLLEN, J., VAN DE SOMPEL, H., SMITH, J., & LUCE, R. (2005). Toward alternative metrics of journal impact: A comparison of download and citation data. Information Processing & Management, 41(6), 1419–1440. [45] BOLLEN, J., & VAN DE SOMPEL, H. (2008). Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. Journal of the American Society for Information Science and Technology, 59, 136-149. [46] http://www.mesur.org/MESUR.html (bold in original) [47] http://www.info.sciencedirect.com/content/journals/titles/ [48] The figure reflects the data obtained from the publisher. It is slightly different from the total given in Fig. 1, as the download statistics for the last quarter of 2007 was estimated and added to the total. [49] The average number of articles downloaded per journal title over 7 years was 11,991 (1,713 per year) (s.d. = 20,101, median: 4,784). [50] http://www.info.sciencedirect.com/news/archive/2006/news_billionth.asp [51] Note that the Journal of the American College of Cardiology ranks 8th on the basis of total use (2001/07). It does not appear in Table 3 because the journal was not common in the core journal lists of all years. [52] EGGHE, L., & ROUSSEAU, R. (1990). Introduction to informetrics: Quantitative methods in library, documentation and information science. Amsterdam: Elsevier Science Publishers. (Online) Retrieved. January 31, 2008 from http://hdl.handle.net/1942/587. [53] COLEMAN, S.R. (1994). Disciplinary variables that affect the shape of Bradford’s bibliograph. Scientometrics, 29(1): 59-81. [54] COLEMAN, S.R. (1993). Bradford distributions of social-science bibliographies varying in definitional homogeneity. Scientometrics, 27(1): 75-91. [55] DROTT, M.C., & GRIFFITH, B.C. (1978). An examination of the Bradford’s Law and the scattering of scientific literature. Journal of the American Society for Information Science, 29, 238-246. [56] TRUESWELL, R.L. (1969). Some behavioral patterns of library users: the 80/20 rule. Wilson Library Bulletin, 43: 458-461. [57] See Ref. 34. Current statistics on the number of electronic journals and articles in the Emerald Management Xtra (EMX) collection come from http://info.emeraldinsight.com/products/xtra/
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
217
A Rapidly Growing Electronic Publishing Trend: Audiobooks for Leisure and Education Jan. J. Engelen Kath. Univ. Leuven – DocArch group – ESAT-Dept. of Electrical Engineering Kasteelpark Arenberg 10, B-3001 Heverlee – Leuven (Belgium) jan.engelen@esat.kuleuven.be
Abstract This contribution focuses on the relatively new phenomenon of the purely commercial availability of audiobooks, sometimes also called “spoken books”, “talking books” or “narrated books”. Having the text of a book read aloud and recorded has been for a very long time the favourite solution to make books and other texts accessible for persons with a serious reading impairment such as blindness or low vision. Specialised production centres do exist in most countries of the world for producing these talking books. But now a growing number of commercial groups have found out that there is a booming market for these products as people slowly get used to leisure listening to books instead of reading them. Some companies claim already having over 40.000 titles in spoken format in their catalogue. Major differences and possible synergies between the two worlds are discussed. Keywords: audiobooks; talking books; spoken information; commercialization 1.
Introduction
Electronic equivalents of printed books (e-books) have been around for a long time now; also multimedia documents have become more and more popular. Especially the spoken variants of books are continuously gaining popularity. Up to a few years ago, producing talking books was seen uniquely as a service to support reading-impaired persons but nowadays commercial interest is growing at a high pace. We start by comparing the traditional specialised production processes with their equivalents in the commercial circuit. This will anyhow involve also some technical aspects. Digital Rights Management and copyright challenges are handled too. Finally we discuss a few implications of this phenomenon on the organisation of libraries and related cataloguing issues. 2.
Specialised audiobook production centres
Most audiobook production centres in Western countries that are focussing mainly on consumers with a reading impairment, have now abandoned cassette distribution in favour of CD-based solutions. Cassettes had been around since the beginning of the sixties but recording, erasing and checking returned cassettes remained very time consuming activities for these production centres. Furthermore most books had to be put on a series of cassettes (due to their limited storage capacity) and clear indications on the cassettes and their boxes, preferably in Braille, were needed to keep some order in such a collection. But even at that time several measures for protecting copyright (nowadays called Digital Rights Management, DRM) were taken: special cassette formats or non-standard tape speeds were used to have some copying protection. In the middle of the nineties internet technology and especially web documents with hyperlinks to other Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
218
Jan. J. Engelen
documents became widespread. Within the European Digibook project several hybrid books were developed, containing both the text and the linked audio files of the same book. The linking was done on sentence level [1]. Similar initiatives were developed at the Swedish production centre TPB. In 1996 a large group of specialised production centres on a global scale has created the Daisy consortium [2] in order to study and to standardise the future audio recording of talking books and, very importantly, how a navigation structure could be added to the books in question. This lead to the Daisy 2.02 and 3.0 standards which have been turned into US standards by NISO but are accepted worldwide by all specialised production centres in order to permit the exchange of this new generation of audiobooks. Most centres distribute their productions nowadays on a data CD [3], or to a minor extent via the internet [4]. Data CD’s permit a trade off between quality and recording speed that is not possible for audio CD’s (e.g. with music). As the human voice can be recorded with a much lower sampling frequency than high quality music, data CD’s can contain easily 50 to 70 hours of speech. Technically the Daisy format describes the content of the book (in XHTML or XML type files) while the audio is recorded as a collection of mp3 files (.wav is rarely used). The Daisy CD can also contain the text of the document and a whole series of timing links (in SMIL format) between the two. That way one can have a computer or reading device searching the text content but the user still can listen to the corresponding audio output. Furthermore a Daisy book permits easy and rapid navigation through complex documents as up to six levels of table of contents are possible. Daisy books are read with computer programmes (AMIS, Easereader, TPB reader…) or special players. These players are actually CD-ROM readers with Daisy reading software. Some look even like a CD walkman. Since last year mini Daisy players have reached the market. These smaller devices with PDA or mobile phone dimensions use SD memory card readers instead of CD’s. The next generation of these devices will connect automatically (through WiFi) to the internet and will then automatically download books (or newspapers, cf. below). Currently only a few UTP-cable connected devices (Webbox, Adela) do exist but their WiFi versions are under development. 3.
Commercial audiobook production
3.1
Booming commercial audiobook popularity
Over the last years we have witnessed an enormous increase in audiobook popularity outside the “traditional” user group of persons with a visual impairment. Several commercial groups, some linked to traditional publishers, some completely new ones have popped up. Audible.com [5] is the leading online provider of digital spoken word audio content in English, specialising in digital audio editions of books, newspapers and magazines, television and radio programmes and original programming. Through its web sites in the US and UK and alliances in Germany and France, Audible.com offers over 40,000 programmes, including audiobooks from well-known authors such as Stephen King, Thomas Friedman, and Jane Austen, and spoken word audio content from newspapers including The New York Times and The New Yorker. However these newspapers are only made available in excerpted form.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Rapidly Growing Electronic Publishing Trend: Audiobooks for Leisure and Education
"It really is that easy. You don't need to install any special software. You don't have to join a club and pay a monthly subscription. You don't even have to break the bank as there are lots of titles for just a few dollars. Just get whatever you want, whenever you want, and sit back and enjoy."
Figure 1: Two different commercial approaches: Audible.com (top) and LeisureAudiobooks (bottom) Meanwhile in Belgium and in the Netherlands (two small countries, 15 million Dutch speaking inhabitants) about a dozen specialised publishers have popped up in a short time. Curiously enough customers will seldom buy audiobooks in bookshops: they seem to be used to downloading music and therefore expect also audiobooks to be downloadable. On the other hand many public libraries have reacted to an enormous interest in audiobooks by adding them to their collections. There is also a growing interest in spoken versions of education material and course material [6]. 3.2
Technical formats and standards for audiobooks; copy protection
A very important issue is the type of audiobook standard that is used. As stated above, within the sector of audiobook production for reading-impaired persons the Daisy standard is very common (and in fact globally accepted). Commercial publishers on the other hand do NOT use the Daisy standard but rely on several alternatives for distributing their audiobooks: • Some companies provide documents on standard [7] audio CD’s (e.g. Dan Brown’s “Da Vinci Code” spans 13 audio CD’s). The main reason for this choice is the universal usability on any audio CD-player developed since 1980.
•
Others use data CD’s with audiofiles in mp3-format and for reasons explained above. Up to 40 hours narration on one CD is not uncommon.
•
A few, including the largest one (audible.com, cf. above) provide their audiobooks in a DRM-protected format. Some of their more expensive books however can be burned onto (a pile of) audio CD’s by a legal buyer. A special version of NERO CD writer is needed to do this.
•
Audible.com has developed the proprietary “.aa” format and provides free software for playing (legally acquired) .aa files on 290 platforms. This format also caters for different quality levels.
•
Apple i-Tunes used mainly the proprietary MP4 format (a container format, including the media and DRM info) which made it impossible for some time to use the files on non-iPod players.
•
Since the beginning of 2007 more and more music on the internet became available without DRM, although generally at a somewhat higher price. Many see DRM now as a thing of the past (cf. below). Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
219
220
Jan. J. Engelen
But the most striking difference of all these solutions with the Daisy format is the lack of any sensible navigation system through the audio files. The available solution, the Daisy standard is not used in the commercial audiobook world! A very important aspect of audiobook (and music) distribution on CD’s (or via the internet) is copyright protection, often seen as copying protection. Digital rights management was once seen by the music industry as the method to prohibit illegal copying. In practice however DRM lead to quite a lot of customer frustration as it hindered copying in general or made it sometimes impossible to play the legally acquired files on a whole series of devices. In practice, all widely-used DRM systems have been defeated or circumvented when deployed to enough customers. Protection of audio and visual material is especially difficult due to the existence of the “analogue hole” [8], and there are even suggestions that effective DRM is logically impossible for this reason. A much more complex situation for illegal copiers arises when books become interactive and the sequential nature of the narration is abandoned. 3.3
Business models
A special audiobook issue is the business model used by the publishers: some companies, including again the largest one, prefer a subscription model with monthly instalments – worth approximately one audiobook. Audible.com’s business model is closely mimicking the well established marketing system of “book clubs”, i.e. one gets the possibility to download a number of books by paying a monthly membership fee. Buying individual books is possible too, but at much higher prices. Its main competitor LeisureAudiobooks, on the contrary, stresses that no subscription is required (cf. figure 1) Others charge different prices for the different audio qualities available. E.g. at audiobooksforfree.com, the lowest quality is for free but users are charged for better quality files. In fact the company stores high quality audiobooks but degrades them for those who want to pay less.
Figure 2: Example of different pricing for different qualities (example from audiobooksforfree.com) 3.4
Audio: human voice vs synthetic voice
A major distinction between audiobooks must be made according to the type of audio: is the narration done by a human person or by a computer (synthetic voice)? Everyone agrees that, even still nowadays, human voice is much more agreeable to listen at than synthetic voice although very good quality text-to-speech software (TTS) is available. However, for some applications only electronic conversion is an option. E.g. during the production of the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Rapidly Growing Electronic Publishing Trend: Audiobooks for Leisure and Education
spoken Flemish daily newspaper project – Audiokrant - with full text coverage of all articles, there are only some 30 minutes available after copy closure time to produce 12 to 15 hours of speech and to physically record the subscribers’ CD’s [9]. 3.5
Growing synergy between commercial and not-for-profit audiobook publishers
Up to now, the worlds of the commercial and the not-for-profit publishers have been very segregated. Commercial publishers often state that their products also benefit reading-impaired persons but they show no interest in using the Daisy standard. On the other hand, specialised production centres are clearly exploring the commercial possibilities of the large archive of spoken books most of them have created over the past years. Sometimes, specialised and commercial productions go hand in hand. The Royal National Institute of the Blind (UK) recording of Terry Darlington’s ‘Narrow Dog To Carcassonne’ has won the APA ‘Audies’ award for 2007 in the category of best unabridged non-fiction. The book was produced by RNIB both as a DAISY Digital talking book for RNIB clients and also as a commercial audio book on CD (ISIS publishing). In the Netherlands, the largest specialised audiobook production centre, “Dedicon” has created at the end of 2006 a commercial branch, named Lecticus [10]. Mainly linearly organised books are provided as a series of mp3 or wma files. Books can be downloaded but also can be delivered on a cheap audio mp3 reader (USB stick size). 4.
Cataloguing Issues
The problem of how to find an audiobook in a library is clearly somewhat complicated by the fact that the number of productions centres is increasing rapidly. Furthermore a comprehensive cataloguing process for audiobooks requires a whole new series of descriptive items including but certainly not limited to: • flags for abridged [11] and unabridged versions; a field for total reading time;
•
fields for technical recording specifications (e.g. audio quality/sampling frequency; file types, use of Daisy standards 2.02 or 3.0 etc.);
• •
a field to distinguish between recorded/synthesized speech;
•
flags for pronunciation details (UK English vs American, Austrian or Swiss German vs Standard German, Dutch vs Flemish intonation etc..). No standard for covering these subtle language differences is available;
•
fields describing the audio-to-text linking mechanisms used in the audiobook (if the text is made available too): synchronisation between text and audio on a word, a paragraph or a page level.
fields for the narrator’s details (Experienced narrators or books read by their author constitute selling arguments for commercially produced books!);
Some of these requirements resemble the cataloguing needs for books in large print. These topics are actually under the remit of a special section within the International Federation of Library Associations (IFLA) that caters for the needs of reading-impaired users [12]. 5.
Conclusions Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
221
222
Jan. J. Engelen
Due to the explosive booming of commercial audiobooks, a huge number of titles theoretically becomes available for reading-impaired users too. This process requires however new business models for the traditional specialised centres and probably also a completely new societal vision on who is willing to pay for what type of audiobook service in the future. 6.
Notes and References
[1]
Andras Arato, Laszlo Buday, Teresa Vaspori, “Hybrid Books for Blind - a New Form of Talking Books”, lecture at ICCHP’96 (Linz, July 1996), in Proceedings of the 5th International Conference “Interdisciplinary Aspects on Computers Helping People with Special Needs”, Schriftenreihe der Oesterreichischen Computer Ges., Band 87 (part 1), ISBN 3-85403-087-8, (Linz, July 1996) [2] Daisy consortium: http://www.daisy.org [3] The medium to use is not part of the Daisy standard. The CD-ROM format is described in the “Yellow Book”: (http://en.wikipedia.org/wiki/Yellow_Book_%28CD-ROM_standards%29) [4] Downloading is not yet a very common procedure due to most internet providers’ data download volume restrictions, although this is changing rapidly to permit more multimedia downloads. [5] Audible.com:http://www.audible.com After having been a minority shareholder for some years, Amazon.com fully acquired Audible.com on January 31, 2008. [6] Post, Hans-Maarten, “Luisterboeken winnen terrein”, p.32 in “De Standaard”, 5 May 2008 (Corelio newspaper publishers, Belgium) [7] The Audio CD format (“Red Book”) was developed in 1980 by Philips & Sony for high quality music recordings and specifies a maximum playing time of 78 minutes. (http://en.wikipedia.org/wiki/Red_Book_%28audio_CD_standard%29) [8] The “analogue hole” means simply that any audio or video signal has to be transformed into an analogue signal to be interpretable by human beings; but analogue signals can be re-digitised afterwards. Internet music stores have more or less given up DRM protection. E.g. it was found that a new iTunes music track (with DRM) made available from Apple needed less than 3 minutes to become available elsewhere on the web in an unprotected audio format. [9] Paepen, Bert, “AudioKrant, the daily spoken newspaper”, Proceedings of the 12th Electronic Publishing Conference (ELPUB, Toronto, July 2008). Available from: http://elpub.scix.net (Open Access) [10] Lecticus audiobook shop: http://www.lecticus.nl [11] And what does “abridged” precisely mean? [12] Brazier, Helen, “The Role and Activities of the IFLA Libraries for the Blind Section”, Library Trends - Volume 55, Number 4, Spring 2007, pp. 864-878
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
223
The SCOAP3 project: converting the literature of an entire discipline to Open Access Salvatore Mele1 1
CERN – European Organization for Nuclear Research CH1211, Geneva 23, Switzerland e-mail: Salvatore.Mele@cern.ch
Abstract The High-Energy Physics (HEP) community spearheaded Open Access with over half a century of dissemination of pre-prints, culminating in the arXiv system. It is now proposing an Open Access publishing model which goes beyond present, sometimes controversial, proposals, with a novel practical approach: the Sponsoring Consortium for Open Access Publishing in Particle Physics (SCOAP3). In this model, libraries and research institutions federate to explicitly cover the costs of the peer-review and other editorial services, rather than implicitly supporting them via journal subscriptions. Rather than through subscriptions, journals will their costs from SCOAP3 and make the electronic versions of their journals free to read. Unlike many “author-pays” Open Access models, authors are not directly charged to publish their articles in the Open Access paradigm. Contributions to the SCOAP3 consortium are determined on a country-by-country basis, according to the volume of HEP publications originating from each country. They would come from nation-wide re-directions of current subscriptions to HEP journals. SCOAP3 will negotiate with publishers in the field the price of their peer review services through a tendering process. Journals converted to Open Access will be then decoupled from package licenses. The global yearly budget envelope for this transition is estimated at about 10 Million Euros. This unique experiment of “flipping” from Toll Access to Open Access all journals covering the literature in a given subject is rapidly gaining momentum, and about a third of the required budget envelope has already been pledged by leading libraries, library consortia and High-Energy Physics funding agencies worldwide. This conference paper describes the HEP publication landscape and the bibliometric studies at the basis of the SCOAP3 model. Details of the model are provided and the status of the initiative is presented, debriefing the lessons learned in this attempt to achieve a large-scale conversion of an entire field to Open Access. Keywords: SCOAP3; Open Access Publishing; High-Energy Physics. 1.
Introduction
Recently, the Open Access debate has become mainstream, spreading to all areas and actors of scholarly communication and affecting its entire spectrum, from policy making to financial aspects [1]. Open Access models are actively being proposed by scholars, libraries and publishers alike, and Open Access definitions, of varying shades and colours, are actively debated. This change falls under the umbrella of the groundbreaking technological changes that are inspiring the transformation of science into e-Science in the XXIst century. This contribution will not enter into these wide ranging issues: its objective is to present a specific Open Access model tailored to the needs of a specific community, High-Energy Physics (HEP), as embodied by the SCOAP3 initiative (Sponsoring Consortium for Open Access Publishing in Particle Physics). Although this is a discipline-specific approach to the wider issue of Open Access, it is a particularly interesting one: HEP has a long tradition of innovations in scholarly communication and Open Access, which have then Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
224
Salvatore Mele
spread to other fields, and the lessons learned by the momentum gathered by the SCOAP3 initiative can inform the evolution of Open Access publishing in other fields. A few words are in order to give the scale of the endeavors of HEP, and its strong collaborative texture, which is inspiring its position in the Open Access debate. The scientific goals of HEP are to attain a fundamental description of the laws of physics, to explain the origin of mass and to understand the dark matter in the universe. Any of these insights would dramatically change our view of the world. To reach these scientific goals, experimental particle physicists team in thousand-strong collaborations to build the largest instruments ever, to reproduce on Earth the energy densities of the universe at its birth. At the same time, theoretical particle physicists collaborate to formulate hypotheses and theories, based on complex calculations, to accommodate and predict experimental findings. These goals are at the edge of current technology and drive developments in many areas, from engineering to electronics, from information technology to accelerator technology. The crowning jewel in HEP research is CERN’s Large Hadron Collider (LHC), which will start accelerating particles in 2008, after more than a decade of construction. This 27km-long accelerator will collide protons 40 million times a second. These collisions will be observed by large detectors, up to the size of a five storey building, crammed with electronic sensors: think a 100MegaPixel digital camera taking 40 million pictures a second. This contribution is structured as follows: Section 2 traces a short history of Scholarly Communication and Open Access in HEP; Section 3 presents the HEP publication landscape and the way this has inspired the construction of the SCOAP3 model; Section 4 outlines the details of the SCOAP3 model; Section 5 discusses the transition from a model to reality, presenting the status of the initiative and debriefing the lessons learned in recent months, with an outlook for the future evolution of the SCOAP3 initiative. 2.
Scholarly Communication and Open Access in HEP
HEP has long pioneered a bridge between scholarly communication and Open Access through its widespread preprint culture [2,3]. For decades, theoretical physicists and scientific collaborations, eager to disseminate their results in a way faster than the distribution of conventional scholarly publications, took to print and mail hundreds of copies of their manuscripts at the same time as submitting them to peerreviewed journals. This ante-litteram form of “author-pays” or rather “institute-pays” Open Access assured the broadest possible dissemination of scientific results, albeit privileging scientists working in affluent institutions. These could afford the mass mailing and were most likely to receive a copy of preprints from other scientists eager to advertise their results. At the same time, for research-intensive institutions, preprint dissemination came at a cost: as an example, in the ‘90s the DESY (Deutsches Elektronen-Synchrotron) HEP research centre in Hamburg, Germany, used to spend about 1 Million DM a year, (500’000€ of today, not corrected for inflation) for the production and mailing of hard-copies of these preprints, while CERN used to spend about twice as much [2]. Against this background, three revolutions mark crucial advances in scholarly communication in HEP. 1. 1974, IT meets HEP libraries. The SPIRES database, the first grey-literature electronic catalogue, saw the light at the SLAC (Stanford Linear Accelerator Center) HEP laboratory in Stanford, California, in 1974. It listed preprints, reports, journal articles, theses, conference talks and books and it now contains metadata for about 760’000 HEP articles, including links to full-text. It offers additional tools like citation analysis and is interlinked with other databases containing information on conferences, experiments, authors and institutions [4]. A recent poll of HEP scholars has shown that SPIRES, in symbiosis with arXiv, is an indispensable tool in their daily research workflow [5]. 2. 1991, the first repository. arXiv, the archetypal repository, was conceived in 1991 by Paul Ginsparg then at LANL (Los Alamos National Laboratory) in New Mexico [6]. It Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The SCOAP3 Project: Converting the Literature of an Entire Discipline to Open Access
3.
evolved the four-decade old preprint culture into an electronic system, offering all scholars a level playing-field from which to access and disseminate information. Today arXiv has grown outside the field of HEP, becoming the reference repository for many disciplines: from mathematics to some areas of biology. It contains about 450’000 full-text preprints, receiving about 5’000 submissions each month, about 15% of which concern HEP. 1991, the web is woven. The invention of the web by Tim Berners-Lee at CERN in 1991 is a household story [7], and April 30th, 2008 saw the 15th anniversary of the day CERN released the corresponding software in the public domain [8]. What is less known is that the first web server outside Europe was installed at SLAC in December 1991 to provide access to the SPIRES database, as an example of the “killer-app” for the web [9]. HEP scholars imagined the web as from its inception as a tool for scholarly communication. The interlinking of arXiv and SPIRES in summer 1992 eventually offered the first web-based Open Access application.
Thanks its decade-old preprint culture, HEP is today an almost entirely “green” Open Access discipline, that means a discipline where authors self-archive their research results on repositories which guarantee their unlimited circulation. Posting an article on arXiv, even before submitting it to a journal, is common practice. Even revised versions incorporating the changes due to the peer-review process are routinely uploaded. Publishers of HEP journals are all allowing such practices and, in some cases, even hosting arXiv mirrors! It is interesting to remark that this success of “green” Open Access in HEP originates without mandates and without debates: very few HEP scientists would not take advantage of the formidable opportunities offered by the discipline repository of the field, and the linked discovery and citation-analysis tools offered by SPIRES. The speed of adoption of arXiv at large in the field is presented in Figure 1, which plots the evolution with time of the submissions to arXiv in the four categories in which HEP results are conventionally divided. The number of preprints that are subsequently published in peer-review journals is also indicated. The difference between the numbers of submissions and the published articles is mostly due to conference proceedings and other grey-literature material that is routinely submitted to arXiv, but which does not usually generate peer-reviewed publications.
Figure 1. HEP preprints submitted to arXiv in four different categories (hep-ex, hep-lat, hep-ph and hep-th) as well as total numbers (hep-*). Preprints subsequently published in peer-reviewed journals are indicated with a “P”. After a phase of adoption of the arXiv system, corresponding to the rise of all curves, present outputs are constant. Data from the SPIRES database.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
225
226
Salvatore Mele
As a consequence of the widespread role of arXiv in scholarly communication, it can be argued that HEP journals have to a large extent lost their century-old role as vehicles of scholarly communication. However, at the same time, they continue to play a crucial part in the HEP community. Evaluation of research institutes and (young) researchers is largely based on publications in prestigious peer-reviewed journals. The main role of journals in HEP is mostly perceived as the one of “keeper-of-the-records”, by guaranteeing a high-quality peer-review process. In short, it can be argued that the HEP community needs high-quality journals as its “interface with officialdom”. The synergy between HEP and Open Access extends beyond preprints, into peer-reviewed literature. In 1997, HEP launched one of the first peer-reviewed Open Access journals: the Journal of High Energy Physics (JHEP), published by the International School of Advanced Studies (SISSA) in Trieste, Italy. It then became a low-cost subscription journal, and it is now offering a successful institutional membership scheme where for a small additional fee, all articles originating from a contributing institution are Open Access. It was followed in 1998 by Physical Review Special Topics Accelerators and Beams, published by the American Physical Society (APS), which operates under a sponsorship scheme, with 14 research institutions footing the bill for the operation of this niche journal. Another example is the New Journal of Physics, published by the Institute of Physics Publishing (IOPP), which carries HEP content in a broader spectrum covering many branches of physics. This journal also started in 1998 and is financed by author fees, under the so-called “author-pays” model. In 2007, PhysMathCentral, a spin-off of BioMedCentral, started a new “author-pays” HEP journal, Physics A. Most HEP publishers, Springer first and APS and Elsevier later, offer now the possibility to authors to pay an additional fee on top of subscription to make their single articles Open Access, under the so called “hybrid model”. The “author-pays” and “hybrid” schemes, however, are not very popular: the total number of HEP articles that appear as Open Access under these two schemes is below 1% of the yearly HEP literature. In comparison, the volume of Open Access articles financed by the institutional membership fee in JHEP is about 20% of this journal, corresponding to about 4% of the total volume of HEP articles. After preprints, arXiv and the web, a transition to Open Access journals appears to be the next logical step in the natural evolution of HEP scholarly communication, and the following sections of this contribution will be describing the publishing landscape in HEP and how such a transition can be achieved, beyond the present experiments. 3.
Bibliometric Facts
The aim of the SCOAP3 initiative is to convert the entire HEP literature to Open Access. In-depth studies have been performed to assess the HEP publication landscape and have informed the design of this model. The most relevant findings of these studies are summarised in the following, and in particular the volume of HEP publishing, the journals favoured by HEP authors and the geographical distribution of the HEP authorship [10,11,12]. Five numbers set the scale of HEP scientific publishing:
• •
20’000; a lower limit to the number of active HEP scholars;
• •
80%; the fraction of HEP articles produced by theoretical physicists;
6’000; an upper limit to the HEP articles submitted to arXiv yearly, and subsequently published in peer-reviewed journals; Figure 1 shows that this yearly HEP output is constant. 20%; the fraction of these articles authored by large collaborations of experimental
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The SCOAP3 Project: Converting the Literature of an Entire Discipline to Open Access
physicists;
•
50:50; the ratio of active experimental and theoretical HEP scholars.
Figure 2 presents the journals favoured by HEP authors in 2006. The large majority of HEP articles are published in just six peer-reviewed journals from four publishers. Five of those six journals carry a majority of HEP content. These are Physical Review D (published by the APS), Physics Letters B and Nuclear Physics B (Elsevier), JHEP (SISSA/IOPP) and the European Physical Journal C (Springer). The sixth journal, Physical Review Letters (APS), is a “broadband” journal that carries only about 10% of HEP content. These journals have been since long time favourite by HEP scholars, albeit with varying fortunes. Figure 3 presents the percentage of HEP articles published in each of these six journals in the last 17 years. Only the articles published in these journals are considered in this graph, which allows to assess the relative popularity of these titles with time. Periods of stability are followed by fast rise of some titles and corresponding decline of others.
Figure 2. Journals favoured by HEP scientists in 2006. Journals that attracted less than 75 HEP articles are grouped in the slice named “Others”. Data from the SPIRES database.
Figure 3. Journals favoured by HEP scientists in the last 18 years. For each year, only articles published in these six journals are considered, and the relative fractions are displayed. Articles published in Zeitschrift für Physik C and the European Physical Journal C are aggregated, as the latter is a successor of the former. Data from the SPIRES database. It is interesting to remark that in a discipline as HEP, with traditionally strong cross-border collaborative links, journals published in the United States or in Europe attract contribution from all geographical regions, as presented in Figure 4. Any Open Access initiative, therefore, can only succeed if it is truly global in scope. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
227
228
Salvatore Mele
Figure 4. Geographical origin of publications in HEP journals based in the United States and in Europe. Co-authorship is taken into account on a pro-rata basis, assigning fractions of each article to the countries in which the authors are affiliated. This study is based on all articles published in the years 2005 and 2006 in five HEP “core” journals: Physical Review D (US), Physics Letters B (EU), Nuclear Physics B (EU), Journal of High Energy Physics (EU) and the European Physical Journal C (EU), and the HEP articles published in two “broadband” journals: Physical Review Letters (US) and Nuclear Instruments and Methods in Physics Research A (EU) [12]. The European contribution is well represented by CERN and its Member States, which are: Austria, Belgium, Bulgaria, the Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Italy, the Netherlands, Norway, Poland, Portugal, the Slovak Republic, Spain, Sweden, Switzerland and the United Kingdom. Share of HEP Country Scientific Publishing United States 24.3% Germany 9.1% Japan 7.1% Italy 6.9% United Kingdom 6.6% China 5.6% France 3.8% Russia 3.4% Spain 3.1% Canada 2.8% Brazil 2.7% India 2.7% CERN 2.1% Korea 1.8% Switzerland 1.3% Poland 1.3% Israel 1.0% Iran 0.9%
Share of HEP Country Scientific Publishing Netherlands 0.9% Portugal 0.9% Taiwan 0.8% Mexico 0.8% Sweden 0.8% Belgium 0.7% Greece 0.7% Denmark 0.6% Australia 0.6% Argentina 0.6% Turkey 0.6% Chile 0.6% Austria 0.5% Finland 0.5% Hungary 0.4% Norway 0.3% Czech Republic 0.3% Remaining countries 3.1%
Table 1: Contributions by country to the HEP scientific literature. Co-authorship is taken into account on a pro-rata basis, assigning fractions of each article to the countries in which the authors are affiliated. The last cell aggregates contributions from countries with a share below 0.3%. This study is based on all articles published in the years 2005 and 2006 in five HEP “core” journals: Physical Review D, Physics Letters B, Nuclear Physics B, Journal of High Energy Physics and the European Physical Journal C, and the HEP articles published in two “broadband” journals: Physical Review Letters and Nuclear Instruments and Methods in Physics Research A. A total sample of about 11’300 articles is considered [11,12]. Table 1 and Figure 5 present the contribution by country to the HEP scientific literature. Co-authorship is taken into account on a pro-rata basis, assigning fractions of each article to the countries in which the authors are affiliated. This study is based on all articles published in the years 2005 and 2006 in five HEP “core” journals: Physical Review D, Physics Letters B, Nuclear Physics B, JHEP and the European Physical Journal C, and the HEP articles published in two “broadband” journals: Physical Review Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The SCOAP3 Project: Converting the Literature of an Entire Discipline to Open Access
Letters and Nuclear Instruments and Methods in Physics Research A. A total sample of almost 11’300 articles is considered [11,12].
Figure 5. Contributions by country to the HEP scientific literature published in the largest journals in the field. Co-authorship is taken into account on a pro-rata basis, assigning fractions of each article to the countries in which the authors are affiliated. Countries with individual contributions less than 0.8% are aggregated in the “Other countries” category. This study is based on all articles published in the years 2005 and 2006 in five HEP “core” journals: Physical Review D, Physics Letters B, Nuclear Physics B, Journal of High Energy Physics and the European Physical Journal C, and the HEP articles published in two “broadband” journals: Physical Review Letters and Nuclear Instruments and Methods in Physics Research A. A total sample of almost 11’300 articles is considered [11,12]. 4.
The SCOAP3 model
The call for Open Access journals in HEP is not only originating from librarians frustrated by spiralling subscription costs and shrinking budget, but is a solid pillar of the scientific community. At the beginning of 2007, the four experimental collaborations working at the CERN LHC accelerator, ATLAS, CMS, ALICE and LHCb, counting a total of over 5’000 scientists from 54 countries, declared: “We, […] strongly encourage the usage of electronic publishing methods for [our] publications and support the principles of Open Access Publishing, which includes granting free access of our publications to all. Furthermore, we encourage all [our] members to publish papers in easily accessible journals, following the principles of the Open Access paradigm” [11]. SCOAP3, the Sponsoring Consortium for Open Access Publishing in Particle Physics, aims to convert to Open Access the HEP peer-reviewed literature in a way transparent to authors [11,13], meeting the expectations of the HEP community for peer-review of the highest standard: administered from the journals which have served the field for decades, while leaving room for new players. The SCOAP3 business model originates from a two-years debate involving the scientific community, libraries and publishers [11,14]. The essence of this model is the formation of a consortium to sponsor HEP publications and make them Open Access by redirecting funds that are currently used for subscriptions to HEP journals. Today, libraries (or the funding bodies behind them) purchase journal subscriptions to implicitly support the peerreview and other editorial services and to allow their users to read articles, even though in HEP the scientists mostly access their information by reading preprints on arXiv. The SCOAP3 vision for tomorrow is that funding bodies and libraries worldwide would federate in a consortium that will pay centrally for the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
229
230
Salvatore Mele
peer-review and other editorial services, through a re-direction of funds currently used for journal subscriptions, and, as a consequence, articles will be free to read for everyone. This evolution of the current “author-pays” Open Access models will make the transition to Open Access transparent for authors, by removing any financial barriers. The SCOAP3 model offers another advantage for libraries and funding bodies over the present “authorpays” model. Disciplines with successful “author-pays” journals often see publication costs met either by libraries or by funding bodies. At the same time the costs of the subscriptions to “traditional” journals do not decrease following the reduced volume of articles that these publish, due to the drain towards “authorpays” Open Access journals. Conversely, in the SCOAP3 models all literature of the field could be converted to Open Access, keeping the total expenditure under control. In practice, the Open Access transition will be facilitated by the fact that the large majority of HEP articles are published in just six peer-reviewed journals from four publishers, as presented in Figure 1. Five of those six journals carry a majority of HEP content and the aim of the SCOAP3 model is to assist publishers to convert these “core” HEP journals entirely to Open Access and it is expected that the vast majority of the SCOAP3 budget will be spent to achieve this target. Another journal, Physical Review Letters, is a “broadband” journal that carries only 10% of HEP content: it is the aim of SCOAP3 to sponsor the conversion to Open Access of this journal fraction. The same approach can be extended to other “broadband” journals. Of course, the SCOAP3 model is open to any other, present or future, “core” or “broadband”, high-quality journals carrying HEP content, beyond those spotlighted here. This will ensure a dynamic market with healthy competition and a broader choice. The price of an electronic journal is mainly driven by the costs of running the peer-review system and editorial processing. Most publishers quote a price in the range of 1’000–2’000€ per published article. On this basis, given that the total number of HEP publications in high-quality journals is between 5’000 and 10’000, according to how one defines HEP and its overlap with cognate disciplines, the annual SCOAP3 budget for the transition of HEP publishing to Open Access would amount to a maximum of 10 Million Euros per year [11]. The costs of SCOAP3 will be distributed among all countries according to a fairshare model based on the distribution of HEP articles per country, as shown in Table 1 and Figure 5. In practice, this is an evolution of the “author-pays” concept: countries will be asked to contribute to SCOAP3, whose ultimate targets are Open Access and peer-review, according to their use of the latter, measured from their scientific productivity. To cover publications from scientists from countries that cannot be reasonably expected to make a contribution to the consortium at this time, an allowance of not more than 10% of the SCOAP3 budget is foreseen. SCOAP3 will sponsor articles through a tendering procedure with publishers of high-quality journals. It is expected that the consortium will invite publishers to bid for their peer-review and other editorial services, on a per-article basis. The consortium will then evaluate these offers as a function of indicators such as the journal quality and price and attribute contracts, within its capped budget envelope. SCOAP3 has therefore the potential to contain the overall cost of journal publishing by linking price, volume and quality and injecting competition into the market. In the SCOAP3 model, libraries will not be paying twice for the journals to be converted to Open Access, in case these are part of journal licence packages. Indeed, in the case of a “core” HEP journal (where an entire journal is converted to OA) that is part of a large journal licence package, the publisher will be required to un-bundle this package and to correspondingly reduce the subscription cost for the remaining part of the package. For “broadband” journals (where only the conversion of selected HEP articles is paid by SCOAP³), the subscription costs will be required to be lowered according to the fraction supported by SCOAP³. For journals of this kind that are part of a licence package, the reduction should be reflected in a corresponding reduction of the package subscription cost. In the case of existing long-term subscription Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The SCOAP3 Project: Converting the Literature of an Entire Discipline to Open Access
contracts between publishers, libraries, and funding agencies, publishers will be required to reimburse the subscription costs pertaining to OA journals or to the journal fractions converted to OA. It appears at first glance to be a formidable enterprise to organize a worldwide consortium of research institutes, libraries and funding bodies that cooperates with publishers in converting the most important HEP journals to Open Access. At the same time, HEP is used to international collaborations on a much bigger scale. As an example, the ATLAS experiment, one of the four detectors at the LHC, has been built over more than a decade by about 50 funding agencies on a total budget of 400 Million Euros (excluding person-power), placing about 1000 industrial contracts. In comparison, the SCOAP3 initiative has about the same number of partners, but a yearly budget of only 10 Million Euros, and will handle less than a dozen contracts with publishers. SCOAP3 will be operated along the blueprint of large HEP collaborations, profiting from the collaborative experience of HEP. 5.
Conclusions and Outlook
SCOAP3 is now collecting Expressions of Interest from partners worldwide to join the consortium. Once it will have reached a critical mass, and thus demonstrated its legitimacy and credibility, it will formally establish the consortium and its governance, it will issue a call for tender to publishers, aimed at assessing the exact cost of the operation, and then move quickly forward with negotiating and placing contracts with publishers. SCOAP3 is rapidly gaining momentum. In Europe, most countries have pledged their contribution to the projects. In the United States, leading libraries and library consortia have pledged a redirection of their current expenditures for HEP journal subscription to SCOAP3, and a call for action has originated from many associations, among which ARL, the Association of Research Libraries [15]. In total, SCOAP3 has already received pledges for about a third of its budget envelope, with another considerable fraction having the potential to be pledged in the short-term future, as presented in Figure 6 [13]. This consensus basis is not restricted to Europe and North America: Australia is part of the consortium and advanced negotiations are in progress in Asia and in Latin America.
Figure 6. Status of the SCOAP3 fund-raising at the time of writing. A third of the funds have already been pledged, 15% are expected to be pledged in the coming weeks, while discussions and negotiations are in progress for another 44% [13]. In conclusion, SCOAP3 is a unique experiment of “flipping” from Toll Access to Open Access all journals covering the literature of a given disciplien. Its success so far and its eventual fate, will be important to inform other initiatives in Open Access publishing for several reasons:
•
The contained publication landscape of HEP, with less than 10’000 articles appearing in Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
231
232
Salvatore Mele
half a dozen journals from few publishers simplifies a possible transition of the entire literature of the field to Open Access.
6.
•
HEP is a scientific discipline which has since long embraced, and actually pioneered, “green” Open Access, with a long tradition of unrestricted circulation of preprints via mass mailing first and arXiv later. SCOAP3 can be interpreted as an experiment in a controlled environment of possible future evolutions in Open Access publishing, or “gold” Open Access, in the light of the present acceleration of “green” Open Access, or self-archiving of research results in many other fields of science, both on an institutional and disciplinary basis.
•
Some of obstacles met by “gold” Open Access publishing so far are related to justified authors’ concerns about financial barriers for the paiment Open Access fees and their reluctance to submit articles to new, Open Access, journals. The SCOAP3 initiative benefits from a strong consensus from the researchers side as it addressed both points: it does not imply any direct financial contribution from authors and aims to convert to Open Access the high quality of peer-reviewed journals which have served the community for decades.
•
By construction, the SCOAP3 model implies a large worldwide consensus first, and financial commitment later. As Open Access is a global issue, the success of this initiative, in a well-organised discipline with strong cross-border links like HEP, can signify the potential of international cooperation in addressing the global problems of scholarly communication.
Notes and References
[1]
One of the most extensive sources of information on the Open Access movement is http:// www.earlham.edu/~peters/fos/overview.htm [Last visited May 25th, 2008]. [2] R. Heuer, A. Holtkamp, S. Mele, 2008, Innovation in Scholarly Communication: Vision and Projects from High-Energy Physics, arXiv:0805.2739. [3] L. Goldschmidt-Clermont, 1965, Communication Patterns in High-Energy Physics, http:// eprints.rclis.org/archive/00000445/02/communication_patterns.pdf. [4] L. Addis, 2002, Brief and Biased History of Preprint and Database Activities at the SLAC Library, http://www.slac.stanford.edu/spires/papers/history.html [Last visited May 25th, 2008]; P. A. Kreitz and T. C. Brooks, Sci. Tech. Libraries 24 (2003) 153, arXiv:physics/0309027. [5] A. Gentil-Beccot et al., 2008, Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course, arXiv:0804.2701 [6] P. Ginsparg, Computers in Physics 8 (1994) 390. [7] T. Berners-Lee, Weaving the Web, HarperCollins, San Francisco, 1999. [8] J. Gillies, 2008, The World Wide Web turns 15 (again) , http://news.bbc.co.uk/2/hi/technology/ 7375703.stm [Last visited May 25th, 2008]. [9] P. Kunz et al, 2006, The Early World Wide Web at SLAC, http://www.slac.stanford.edu/history/ earlyweb/history.shtml [Last visited May 25th, 2008]. [10] S. Mele et al., Journal of High Energy Physics 12 (2006) S01, arXiv:cs.DL/0611130. [11] S. Bianco et al., 2007, Report of the SCOAP3 Working Party, http://www.scoap3.org/files/ Scoap3WPReport.pdf. [12] J. Krause et al., 2007, Quantitative Study of the Geographical Distribution of the Authorship of High-Energy Physics Journals, http://scoap3.org/files/cer-002691702.pdf [Last visited May 25th, 2008]. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
The SCOAP3 Project: Converting the Literature of an Entire Discipline to Open Access
[13] http://scoap3.org [Last visited May 25th, 2008]. [14] R. Voss et al., 2006, Report of the Task Force on Open Access Publishing in Particle Physics, http://www.scoap3.org/files/cer-002632247.pdf. [15] I. Anderson, 2008, The Audacity of SCOAP3, ARL Bimonthly Report, no. 257; J. Blixrud, 2008, Taking Action on SCOAP3, ibid.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
233
234
Modeling Scientific Research Articles â&#x20AC;&#x201C; Shifting Perspectives and Persistent Issues Anita de Waard1,2; Joost Kircz3, 4 Elsevier Labs, Radarweg 29, 1043 NX, Amsterdam, The Netherlands e-mail: a.dewaard@elsevier.com 2 Department of Information and Computing Sciences, Universiteit Utrecht, The Netherlands 3 Institute for Media and Information Management (MIM), Hogeschool van Amsterdam, The Netherlands e-mail: j.g.kircz@hva.nl 4 Kircz Research Amsterdam, http://www.kra.nl 1
Abstract We review over 10 years of research at Elsevier and various Dutch academic institutions on establishing a new format for the scientific research article. Our work rests on two main theoretical principles: the concept of modular documents, consisting of content elements that can exist and be published independently and are linked by meaningful relations, and the use of semantic data standards allowing access to heterogeneous data. We discuss the application of these concepts in five different projects: a modular format for physics articles, an XML encyclopedia in pharmacology, a semantic data integration project, a modular format for computer science proceedings papers, and our current work on research articles in cell biology. Keywords: Scientific publishing models; new scholarly constructs and discourse methods; metadata creation and usage; pragmatic and semantic web technologies. 1.
Introduction
The objective of our work is, one the one hand, to analyze and investigate what role the research article plays in the connected world that scientists live in today, and on the other hand to propose and experiment with new forms of publication, which contain the knowledge traditionally transferred by â&#x20AC;&#x2DC;papersâ&#x20AC;&#x2122;, but are better suited to an online environment. Our research is driven both by an analytical approach stemming from the humanities, including argumentation theory, discourse modeling, and sociology of science, and a knowledge engineering approach from the computer science end, using semantic web technologies, argumentation visualization, and authoring and annotation tools. We present five examples of our work, in roughly chronological order [1]. We have been driven by two main theoretical concepts: firstly, the concept of modularity: the idea that a scientific text can consist of a set of self-contained and reusable content elements that are strung together to form one or more variants of an evolving series of documents. To explore this concept, Kircz and Harmsze have developed a modular format of the research article in physics [2] this work was extended to create a modular format for Major Reference Work in Pharmacology, which can be used as a database or a linear text [3]. The other main theoretical driver for our research is the use of semantic technologies to access scientific content. In the DOPE project [4], we developed an RDF (Resource Description Framework, [5])-based architecture to access a diverse content set through a thesaurus. This project included the RDF formatting Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues
of Elsevier’s EMTREE thesaurus [6] and development of an explorative user interface [7] to access ta heterogeneous dataset. Lastly, we discuss two projects where we combine the concept of modularity with that of semantic tools and standards. We first identified a simple modular structure for articles in computer science that can be created using LaTeX and converted to semantic formats, entitled the ABCDE Format [8]. Our current work delves more deeply into the text of research articles. We are currently investigating a discourse modeling approach to develop a theoretical framework for a pragmatic model of research articles, linked through a network of argumentational relations. We probe of the pragmatic roles which various discourse elements provide, and modeling the way in which textual coherence and argumentative roles of textural elements are expressed, through an analysis of the linguistic forms used in various parts of a biology text. 2.
Modular Documents in Physics
Kircz and Roosendaal [9] summarized the communications needs in the scientific community as follows:
•
Awareness of knowledge about the body of knowledge of one’s own or related research domains;
• •
Awareness of new research outcomes, needed for one’s current research program;
•
Scientific standards on research approaches and reporting, that develop in the process of a certain research program and shapes the social structure of a field;
•
Platform for communication as a tool that enables formal and informal exchange of idea’s opinions, results and (dis)agreements between peers;
•
Ownership protection on the intellectual results and possible commercial applications.
Specific information on relevant theories, detailed information on design, methodologies etc.;
All of these roles demand different ways of identifying pertinent information units that together compose the paper as we know it today. In early period of transition of the scientific paper to electronic media, proposals for new formats remained traditional, without taking into account the extent to which electronic media change the whole spectrum of dissemination and reading. In a critique of this, we explored the functions of the article and discussed changes in form due to the fact that in an electronic medium, text and non-textual material obtain a different relationship than in the paper world [10, 11]. In our proposal the essay format, typical for a paper product that is meant to be read as an individual information object, is replaced by a mode of communication that is an intrinsic fit for reading electronically. Specifically, this will allow the reader to only read those parts that really serve an information need at a particular place and time. In other words with proper tools we will see a change from read and locate situation where first a document is identified and then it is read to identify the needed information, to a locate and read situation where we start with a relevant passage of text and from that as starting point decide to read other parts or to skip on. Such organized browsing, by immediately skipping to determined parts of the text, demands changes in the way research reports are structured and represented on paper and electronic media. This aspect suggests an intrinsically modular structure for electronic publications, first explored in [2]. The PhD research project on modularity of scientific information by Harmsze [12] focused on the dissection of the research paper into different types of information that are conveyed by the structure of the research paper.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
235
236
Anita de Waard; Joost Kircz
This approach leads to a modular model of scientific information in physics, which contains two elements: â&#x20AC;˘ Modules: information elements such as positioning (introduction), methods, results, interpretation, outcome and their subdivisions, and
â&#x20AC;˘
Relations between these elements, both to non-textual elements in the paper as to external relations to (parts of) other works
In Figure 1, we show the modular system developed by Harmsze [12] to model a set of papers in physics, where each module contained a unique type of content, focusing on e.g. the experimental setup or the central problem of a piece of research. Core to the use of modular elements is the concept of reusability: when a paper is updated, it might not need a new Positioning module, but merely provide e.g. new Methods and Results sections.
Figure 1: The module meta-information and the modules follow the conceptual function of the information, and the sequential paths leading through the article. The dashed line indicates the complete sequential path, and the dotted line the essay-type sequential path [12].
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues
The other pillar of the modular model is the concept of ‘meaningful links’ (as the jargon was, at the time), Although the ubiquity of the <A HREF=”link.htm”> typical html link </A> has meant a great triumph of the simple hyperlink – one-to-one, mono-directional, not containing information about the link type – a body of research in hypertext has been going on for decades that identifies many different types of links and roles that can play in connecting pieces of data. In thinking about relations in this way, it becomes clear that a relation is not simply a pointer to another piece of data. The fact that the relation exists, and the relationship it expresses between the linking and linked data, provides information in itself, which can be made explicit (e.g. visible and/or searchable) for the reader. We therefore explicitly considered the relation and the information presented within it, or the relation type, as separate entities.
Figure 2: Different types of organisational relations distinguished in the modular model by Harmsze [12] For physics articles, Harmsze identifies the following detailed taxonomy of relations between modules: • Organisational relations that are based on the structure of the information. These dovetail with the structural XML information we discussed above – see figure 2 for a detailed subdivision, and
•
Discourse relations, which define the reasoning of the argument. In the model of Harmsze an elaborate skeleton has been worked out. Based on the systematic pragmadialectical categorization of Garssen [13], these can further be subdivided as:
•
Causal relations: relations where there is a causal connection between premise and conclusion (or between explanans and explanandum). This kind of relation exists between a statement or a formula and an elaborate mathematical derivation. Obviously, the usage of the causal relation as an argument and as an explanation, lie close together.
•
Comparison relations: relations where the relation is one of resemblance, contradiction or similarity. The analogue is a typical subtype. Comparisons used as argument are well-known phenomena, such as with the comparison of measured data from, e.g., the module Treated Results with theoretical predictions that fit within certain acceptable boundaries. We can also think of similarity relations, where results of others on similar systems are compared to emphasize agreement or disagreement. In the case of an elucidation, we can think of the relation between the description of a phenomenon and a known mechanical toy model. A link between a text and an image that illustrates the reasoning or results belong to this category. Another example is the suggestion that a drug that is effective in curing a particular ailment might also help against similar symptoms.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
237
238
Anita de Waard; Joost Kircz
â&#x20AC;˘
3.
Symptomatic relations, which are of a more complicated nature. Here we deal with relations where a concomitance exists between the two poles. This category is more heterogeneous than the other two. This kind of relation can be based on a definition or a value judgment such as the role of a specific feature that serves as a sufficiently discriminatory value to warrant a conclusion. We can think of a relation between the textually described results and a picture in which a specific feature, like a discontinuity in a graph, is used to declare a particular physical effect present or not.
A Modular Major Reference Work
The main drawback of the model developed for physics articles was that it is very demanding to the author to adhere to the proposed structure and the model presupposes strong editorial assistance in the form of advanced XML-based text processing software. This could be enforced within the context of a reference work, where a) the content elements are commissioned, and therefore a writing template can be proscribed and b) the main rhetorical purpose of the goal is to inform the reader of existing knowledge, rather than convince him or her of the validity of a specific claim (for more on this, see below). Therefore, we adapted Harmszeâ&#x20AC;&#x2122;s model to use it for XPharm, a state of the art, online, comprehensive pharmacology reference work. XPharm contains information on agents (drugs), targets, disorders and principles of pharmacology [3]. The 4,000 XPharm entries are authored by a group of 600 contributors who write in a very modular format. The idea for four databases was driven by the fact that Agents, including drugs, which are the core of pharmacology, act at molecular Targets to treat Disorders. The Principles database is included as a repository of information fundamental to the discipline but generally independent of the chemical entity,
Figure 3: Outline of an XPharm Target Record, showing the modular structure; all topic headings are the same for each Target. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues
site of action, or clinical use. Each XPharm record can be rendered in a customizable way, and the interface allows for the rendering of modular content elements within different user-defined contexts [3]. XPharm uses the concept of modularity by proscribing a rigid format for each type of entry: for example, all target entries follow the format shown in Figure 3: The XML of each record is highly granular, for example physical constants are individually marked up so they can, in principle, be extracted and compared to create tables of data, thus enabling the XML to function either as a text (in the html instantiation) or as a database. Relations exist between records to these modular headings, so that the text can be interlinked in a very granular way. Also, this system enables detailed updates of only specific parts of the texts, e.g. if a new antibody is found for a specific target, only that module can be updated. As a conclusion, the system of modular authoring can work quite well for texts in which structures can be mandated and which are more like a ‘dressed-up database’ than like a persuasive text. As a conclusion, we believe that the difference between informative content sources (such as textbooks and databases) and persuasive texts (such as primary research articles) needs to be taken into greater account when modeling scientific information. In XPharm, a set of content relations was also proposed, which specifically hold between different elements; these are based on the specific biological rules that govern the interactions between content elements. For example, a disorder can be related to a drug (or Agent, in XPharm terms), by either the Treats relation (Aspirin treats Headache) or the Side Effect relation (Stomach Ache is a Side Effect of Aspirin). A system of 13 such relations was proposed, but because of technical issues (most notably, the lack of ability of current browsers to render relationship types) has not yet been implemented. 4.
Semantic Access to Heterogeneous Data: The DOPE Project
The technologies used in the work described above predate (our knowledge of ) the semantic web (XPharm was designed in 1998), and the lack of interoperable standards were partly what prevented us from scaling up or connecting to other projects. A next project focused on the use of such standards in the context of pharmacology research. DOPE, the Drug Ontology Project for Elsevier, focused on allowing access via a multifaceted thesaurus, EMTREE, to a large set of data: five million abstracts from the Medline database and about 500,000 full-text articles from Elsevier’s ScienceDirect [7]. At the time (2003), no open architecture existed to support using thesauri for querying data sources. To provide this functionality, we needed a technical infrastructure to mediate between the information sources, thesaurus representation, and document metadata stored on the Collexis fingerprint server [14]. We implemented this mediation in our DOPE prototype using the RDF repository Sesame [15]. The records were first indexed to Elsevier’s Proprietary thesaurus EMTREE [6]. The version we used, EMTREE 2003, contained about 45,000 preferred terms and 190,000 synonyms organized in a multilevel hierarchy, and currently contained the following information types:
• •
Facets: broad topic areas that divide the thesaurus into independent hierarchies.
•
Preferred terms are enriched by a set of synonyms—alternative terms that can be used to refer to the corresponding preferred term. A person can use synonyms to index or query information, but they will be normalized to the preferred term internally.
•
Links, a subclass of the preferred terms, serve as subheadings for other index keywords.
Each facet consists of a hierarchy of preferred terms used as index keywords to describe a resource’s information content. Facet names are not themselves preferred terms, and they cannot be used as index keywords. A term can occur in more than one facet; that is, EMTREE is poly-hierarchical.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
239
240
Anita de Waard; Joost Kircz
They denote a context or aspect for the main term to which they are linked. Two kinds of link terms, drug-links and disease-links, can be used as subheadings for a term denoting a drug or a disease. The indexing process was done by the Collexis Indexing Engine using a technique called fingerprinting [14], which assigns a list of weighted thesaurus keywords assigned to a document. Next to the document fingerprints, the Collexis server housed bibliographic metadata about the document such as authors and document location. The DOPE architecture (see Figure 4) then dynamically mapped the Collexis metadata to an RDF model. An RDF database, using the SOAP protocol, communicated with both the fingerprint server and the RDF version of EMTREE. A client application interface, based on Aduna’s Spectacle Cluster Map [7], let users interact with the document sets indexed by the thesaurus keywords using SeRQL queries, an RDF rule language sent by HTTP [16]. The system design permits the addition of new data sources, which are mapped to their own RDF data source models and communicate with Sesame. It also allows the addition of add new ontologies or thesauri, which can be converted into RDF schema and communicate with the Sesame RDF server [15].
Figure 4: Basic components of the DOPE architecture (technologies are given in brackets) We performed a small user study with 10 potential end users, including six academic and four industrial users [17]. These users found the tool useful for the exploration of a large information space, for tasks such as filtering information when preparing lectures on a certain topic and doing literature surveys (for example, using a “shopping basket” to collect search results). A more advanced potential application mentioned was to monitor changes in the research community’s focus. This, however, would require extending the current system with mechanisms for filtering documents based on publication date, as well as advanced visualization strategies for changes that happen over time, which were not part of the project scope. Overall, the DOPE system was a useful, working implementation of Semantic Web technologies that allowed for the inclusion of new distributed data sources and ontologies using the RDF data standard. In juxtaposing this project with the experiments in modularity discussed above, we note that a complex representation of the EMTREE thesaurus in RDF was constructed, using historically meaningful relationships between thesaurus elements. The use of semantic standards enables easy scaling of the system with new thesauri, or new relationships. However, of course, within DOPE the documents accessed were not modular, and they could only be related using overlapping or related thesaurus entries. Combining these two concepts, modular documents with meaningful relations and semantic technologies, led to our next series of investigations. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues
5.
Semantic Modular Publishing: The ABCDE Format
Our current research focuses on developing a new format for publications that combines the concepts of modularity with semantic technologies. Our first foray into this area was to develop a simple modular format for structuring conference contributions in computer science, and the authoring, editing and retrieval processes needed to use them. Specifically, this format was meant as a way to allow the use of conference papers by Semantic Browsers such as PiggyBank [18] and semantic collaborative authoring tools such as Semantic Wikis [19]. The ABCDE Format (ABCDEF) for proceedings and workshop contributions is an open-standard, widely (re)useable format, that can be easily mined, integrated and consumed by semantic browsers and wiki’s [8]. The format can be created in several interoperable data types, including LaTeX and XML, or a simple text file. It is characterized by the following elements:
•
A - Annotation. Each record contains a set of metadata that follows the Dublin Core standard. Minimal required fields are Title, Creator, Identifier and Date.
•
B, C, D - Background, Contribution, Discussion. The main body of text consists of three sections: Background, describing the positioning of the research, ongoing issues and the central research question; Contribution, describing the work the authors have done: any concrete things created, programmed, or investigated; Discussion, contains a discussion of the work done, comparison with other work, and implications and next steps. These section headings need to exist somewhere in the metadata of the article - but they can be hidden markup; also, each of the sections can have different, and differently named, subheadings.
•
E- Entities. Throughout the text, entities such as references, personal names, project websites, etc. are identified by: - The text linking to an entity - The type of link (reference, footnote, website, etc.) - The linking URI, if present - The text for the link In other words, the entity link can be described as an RDF statement [5].
•
There is no abstract in an ABCDE document - instead, within the B, C and D paragraphs the author denotes ‘core’ sentences. Upon retrieval or rendering of the article, these can be extracted to form a structured abstract of the article - where one can jump directly to the core of the Background, Contribution or Discussion. This allows the author to create and modify statements summarizing the article only once, which prevents a misrepresentation in the abstract of the paper, which, in fact, occurs quite often [20].
ABCDEF allows an extensible set of relations to work on documents with a (simple) modular structure, and enables the use of open semantic standards. This format has been described and a LaTeX stylesheet has been published [8]; as a test, a small set of documents for the Semantic Wiki conference was converted to this format. The ABCDE format is a quite simple intermediary step towards creating a reusable, modular, semantic format for research articles. The relations between the modules are quite simple: the sequentiality is obvious (first B, then C, then D for the sections); ‘elaboration’ relations exist between core sentences in the abstract and their locations in the text; and the entities are related to their links by a link type which the user is free to name. Although this format allows access to the content by various semantic Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
241
242
Anita de Waard; Joost Kircz
tools, it still does not do a very good job of marking up the knowledge or argumentation in the text. An attempt at this is made in the currently ongoing project, discussed in the next section. 6.
Semantic Modular Publishing: Rhetoric in Biology.
At present, we are developing a more integrated approach, where we look more closely at the way in which rhetoric and persuasion are expressed. The main goal of our research is not a linguistic analysis of a research paper in a field, but the creation of a model that will enable faster browsing through a single paper as well as a collection of related papers. The most important observation from the work done on the modular physics articles was if when you break up the essay-type of article in well-defined units that can be retrieved and re-used, the units of information never become fully independent. A research paper is an attempt to convey meaning, and convince peers of the validity of a specific claim, using research data: therefore, to optimally represent a scientific paper, we should model how it aims to convince. To use a chemical metaphor: breaking up a molecule into its constituent atoms immediately confronts you with the various aspects of chemical binding. In the same way, parts of a scientific text are glued together with arguments, which cannot be disconnected without a loss of meaning to the overall structure. As a knowledge transmission tool, the research article offers an amalgamate of pragmatic, rhetorical and simply informative functions. Our modularity experiments led us to understand that although certain parts of the paper can be made into database-like elements, other parts are quite complex to modularize, and their format plays a critical role in transferring knowledge and convincing peers of the correctness of a statement. Our current efforts focus on obtaining a better understanding of the sociology and linguistic expressions of scientific truth creation in science. We are using a corpus of full-text articles in the field of cell biology, partly because it is a vast field, where presentations are already quite standardized, and partly because the role of research results vs. theoretical descriptions is very clear-cut. In modeling these articles, we are staying close to the traditional ‘IMRaD’ (Introduction, Methods, Results and Discussion) format, since first of all the field has consistently adopted this format [21]; an additional motivation for this format can be found by looking at models from classical rhetoric and story grammar models [22]. Therefore, to optimize granularity but still enable the rhetorical narrative flow, our current model in biology has three elements [23]: I: Content Modules: • Front matter/metadata
• • • • • •
Positioning Central Problem Hypothesis Summary of Results
Experiments, containing the following discourse unit types:
• •
Introduction, containing the following subsections:
Fact, Goal, Problem, Method, Result, Implication, Experimental Hypothesis
Discussion, containing the following subsections:
• • • •
Evaluation Comparison Implications Next Steps
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues
Figure 4: A Subset of Statements and relations from two biology texts, modeled in Cohere [25]; each ‘target’ is linked into the appropriate location in the underlying documents II: Entities (contained within the (sub)sections described above), consisting of: • Domain-specific entities, such as genes, proteins, anatomical locations, etc.
• •
Figures and tables Bibliographic references
III: Relations, consisting of: • Relations within the document from entity representations to entities (figure, table and bibliographic references)
•
Relations out of the document, from entities to external representation (e.g., from a protein name to it’s unique identifier in a protein databank)
•
Relations between moves within documents, e.g. elaboration, from a summary statement the Introduction section to a Result element within the Experiment section)
•
Relations between moves between documents (e.g., agreement between a Result in one paper and that in another paper)
The modular division for the Introduction and the Discussion is based on Harmsze’s model and our own empirical investigations (it was easy to fit a collection of 13 biology articles within this framework, and we Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
243
244
Anita de Waard; Joost Kircz
hope it will cover the needs of the corpus in general). The Experiments are subdivided in a different way, where smaller elements consisting of one or more phrases are identified using verb tense and cue phrases, as motivated in [24] (a preliminary computational assessment will be given in [23]). Currently, we have marked up a corpus of 13 documents in this format, and we are working on implementing these, linked by the relationships described, in the online argumentation environment Cohere [25], see figure 4 for a screenshot of some of our statements in this environment. One of the main challenges is to represent the argumentation and the research data in a way which will allow a user to quickly oversee which claims are based on which experimental data, both within and between research articles. Our final goal is to develop a network that clearly differentiates claims from their validation, based on data, and enables insight into the quantitative motivation of a specific statement from its constituent experimental underpinnings. A further direction is to attempt automatic identification of the elements, specifically the moves within the experiment sections, which could enable a (semi) automatic representation of a paper as a set of claims and underlying data. 7.
Conclusion
Each of these projects has provided us with insights that, in part, have led to the next experiment. In particular, we have explored various incarnations of modular content representations, linked by meaningful relations. In certain cases, this can be fruitful: for example, a modular structure for an encyclopedic work can allow certain user functions that a narrative, linear structure does not allow; the ABCDE format enables an accessible representation of a collection of research papers inside a semantic architecture. The next major issue is to see whether a partly modular, partly linear format, where content elements are at least identified by type (Method, Hypothesis etc.) can indeed replace the existing linear narrative. If it does turns out to enable more useable reading environments, we need to ensure that the creation of the format can be achieved, given current publishing practices. We hope that our current experiments can help provide a format that offers computational handholds to access the argumentative elements within a research paper. Lastly, we want to state our are interest in exploring collaborations on this subject with the myriad initiatives that are currently ongoing, since we firmly believe this complex problem can only be solved by collaborative effort. This issue does not have a purely technological solution; to truly improve the way in which science is communicated will require serious scrutiny by the scientific community of the social, political and psycholinguistic way in which it claims, confirms, and creates knowledge. 8.
References
[1]
This paper is aimed to describe previous and current projects, and does not contain a theoretical embedding or references to related work; these have been addressed in [9, 10, 22, 24] and will be addressed in a forthcoming [23]. Kircz, J.G. and F.A.P. Harmsze, “Modular scenarios in the electronic age,” Conferentie Informatiewetenschap 2000. Doelen, Rotterdam 5 April 2000. In: P. van der Vet en P. de Bra (eds.) CS-Report 00-20. Proceedings Conferentie Informatiewetenschap 2000. De Doelen Utrecht, 5 April 2000. pp. 31-43. Enna, S..J., D. B. Bylund, Preface, XPharm, doi:10.1016/B978-008055232-3.09004-9 Stuckenschmidt, H., F. van Harmelen, A. de Waard, et.al, , “Exploring Large Document Repositories with RDF Technology: The DOPE Project,” IEEE Intelligent Systems, vol. 19, no. 3, pp. 34-40, May/Jun, 2004 Brickley D. (ed.), RDF Vocabulary Description Language 1.0: RDF Schema, W3C Recommendation 10 February 2004, http://www.w3.org/TR/rdf-schema For more information, see http://www.info.embase.com/emtree/about/
[2]
[3] [4] [5] [6]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Modeling Scientific Research Articles – Shifting Perspectives and Persistent Issues
[7] [8] [9]
[10] [11] [12] [13]
[14] [15] [16] [17]
[18] [19] [20] [21] [22] [23] [24]
[25]
Fluit C., M. Sabou, and F. van Harmelen, “Ontology-Based Information Visualization,” Visualizing the Semantic Web, V. Geroimenko and C. Chen, eds., Springer-Verlag, 2003, pp. 36-48. Waard, A. de and Tel, G., “The ABCDE Format: Enabling Semantic Conference Proceedings,” In: Proceedings of the First Workshop on Semantic Wikis, European Semantic Web Conference (ESWC 2006), Budva, Montenegro, 2006. Kircz, J. G. and Hans E. Roosendaal, “Understanding and shaping scientific information transfer,”. In: Dennis Shaw and Howard Moore (eds). Electronic publishing in science. Proceedings of the ICSU Press / UNESCO expert conference February 1996. Unesco Paris 1996. pp. 106116. Kircz, J.G., “New practices for electronic publishing 1: Will the scientific paper keep its form,” Learned Publishing. Volume 14. Number 4, October 2001. pp. 265-272. Kircz,. J.G., “New practices for electronic publishing 2: New forms of the scientific paper,” Learned Publishing. Volume 15. Number 1, January 2002. pp. 27-32 Harmsze, F, “A modular structure for scientific articles in an electronic environment,” PhD thesis, University of Amsterdam, February 9, 2000. Garssen, B., “The nature of symptomatic argumentation,” In: Frans H. van Eemeren, Rob Grootendorst, J Anthony Blair, Charles A. Wilards (eds.). Proceedings of the 4th International Conference of the International Society for the Study of Argumentation, Amsterdam, June 1619 1998. Amsterdam: SICSAT, 1999. Van Mulligen, E.M. et al., “Research for Research: Tools for Knowledge Discovery and Visualization,” Proc. 2002 AMIA Ann Symp., Am. Medical Informatics Assn., 2002, pp. 835–839. Broekstra, J., A. Kampman, and F. van Harmelen, “Sesame: An Architecture for Storing and Querying RDF and RDF Schema,” Proc. 1st Int’l Semantic Web Conf., LNCS 2342, SpringerVerlag, 2002, pp.54–68. Broekstra J. and A. Kampman, “SeRQL: Querying and Transformation with a Second- Generation Language,” technical white paper, Aduna/Vrije Universiteit Amsterdam, Jan. 2004. Stuckenschmidt, H., A. de Waard, R. Bhogal et.al, “A Topic-Based Browser for Large Online Resources,” In: Proceedings of the Proceedings of the 14th International Conference on Knowledge Engineering and Knowledge Management ({EKAW}’04). Editors E. Motta and N. Shadbolt. Series Lecture Notes in Artificial Intelligence. Huynh, D., Stefano Mazzocchi, and David Karger. “Piggy Bank: Experience the Semantic Web Inside Your Web Browser”, Proceedings International Semantic Web Conference (ISWC) 2005. For definitions and examples, see http://en.wikipedia.org/wiki/Semantic_wiki Pitkin, R.M., Branagan, M.A., Burmeister, L.F., “Accuracy of data in abstracts of published research articles,” JAMA 281 (1999) 1110–1111 See e.g., the International Committee of Medical Journal Editors, “Uniform Requirements for Manuscripts Submitted to Biomedical Journals: Writing and Editing for Biomedical Publications,” Updated October 2007, available online at http://www.icmje.org/ Waard, A. de, Breure, L., Kircz, J.G. & Oostendorp, H. van (2006), “Modeling Rhetoric in Scientific Publications,” in: Current Research in Information Sciences and Technologies (pp. 352-356). Vicente P. Guerrero-Bote (Editor) (Ed.), Badajoz, Spain: Open Institute of Knowledge. Waard, A. de, “A Semantic Modular Structure for Biology Articles,” forthcoming Waard, A. de, “A Pragmatic Structure for the Research Article,” in: Proceedings ICPW’07: 2nd International Conference on the Pragmatic Web, 22-23 Oct. 2007, Tilburg: NL. (Eds.) Buckingham Shum, S., Lind, M. and Weigand, H. Published in: ACM Digital Library & Open University ePrint 9275. Buckingham Shum, S., “Cohere: Towards Web 2.0 Argumentation,” In: Proceedings, COMMA’08: 2nd International Conference on Computational Models of Argument, 28-30 May 2008, Toulouse. IOS Press: Amsterdam. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
245
246
Synergies, OJS, and the Ontario Scholars Portal Michael Eberle-Sinatra1; Lynn Copeland2; Rea Devakos3 Centre d’édition numérique, Université de Montréal CP 6129, succ. Centre Ville, Montreal, QC, H3C3J7, Canada e-mail: michael.eberle.sinatra@umontreal.ca 2 W.A.C. Bennett Library, Simon Fraser University Burnaby, BC, V5A 1S6, Canada e-mail: copeland@sfu.ca 3 Information Technology Services, University of Toronto Libraries 130 St George St, 7th Floor, Room 7th floor, Robarts, Toronto, ON, M5S 1A5, Canada e-mail: rea.devakos@utoronto.ca 1
Abstract This paper introduces the CFI-funded project Synergies: The Canadian Information Network for Research in the Social Sciences and Humanities, and two of its regional components. This four-year project is a national distributed platform with a wide range of tools to support the creation, distribution, access and archiving of digital objects such as journal articles. It will enable the distribution and use of social sciences and humanities research, as well as to create a resource and platform for pure and applied research. In short, Synergies will be a research tool and a dissemination tool that will greatly enhance the potential and impact of Social Sciences and Humanities scholarship. The Synergies infrastructure is built on two publishing platforms: Érudit and the Public Knowledge Project (PKP). This paper will present the PKP project within the broader context of scholarly communications. Synergies is also built on regional nodes, with both overlapping and unique services. The Ontario region will be presented as a case study, with particular emphasis on project integration with Scholars Portal, a digital library. Keywords: content management; online publication; digital access 1.
Synergies Overview
This four-year project will create a national distributed platform with a wide range of tools to support the creation, distribution, access and archiving of digital objects such as journal articles. It will enable the distribution and use of social sciences and humanities research, as well as to create a resource and platform for pure and applied research. In short, Synergies will be a research and a dissemination tool that will greatly enhance the potential and impact of Social Sciences and Humanities scholarship. Canadian social sciences and humanities research published in Canadian journals and elsewhere, especially in English, is often confined to print. The dynamics of print mean that this research is machine-opaque and hence invisible on the Internet, where many students and scholars begin and more and more often end their background research. In bringing Canadian social sciences and humanities research to the internet, Synergies will not only bring that research into the mainstream of worldwide research discourse but also continue the legitimization of online publication in social sciences and humanities by the academic community and the population at large. The acceptance of this medium extends the manner in which knowledge can be represented. In one dimension, researchers will be able to take advantage of an enriched media palette— colour, image, sound, moving images, multimedia. In the second, researchers will be able to take advantage of interactivity. And in a third, those who query existing research will be able to broaden their vision by means of navigational interfaces, multilingual interrogation and automatic translation, metadata and intelligent Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Synergies, OJS, and the Ontario Scholars Portal
search engines, and textual analysis. In still another dimension, scholars will be able to expand further into areas of knowledge such as bibliometrics and technometrics, new media analysis, scholarly communicational analysis and publishing studies. Canadian researchers in the social sciences and humanities will benefit from accessing two research communication services within one structure. The first is an accessible online Canadian research record. The second is access to an online publication production level services that will place their work on record and will ensure widespread and flexible access. Synergies provides both these functions. Built on the dual foundation of Érudit, a Quebec-based research publication service provider in existence since 1998 and the Open Journal Systems, which is a British Columbia-based online journal publishing software suite used by over 1,500 journals worldwide, and the additional technical expertise developed by its three other partners, Synergies will aggregate publications from its twenty one-university consortium to create a decentralized national platform. Synergies is designed to eventually encompass a range of formats— including published articles, pre-publication papers, data sets, presentations, electronic monographs— in short to provide a rich scholarly record, the backbone of which is existing and yet to be created peerreview journals. Synergies will bring Canadian social sciences and humanities research into the mainstream of worldwide research discourse by using cost-effective public/not-for-profit partnerships to maximize knowledge dissemination. Synergies will also provide a needed infrastructure for the Social Sciences and Humanities Research Council (SSHRC) to follow through its in-principle commitment to open access and facilitate its implementation by extending the current venues and means for online publishing in Canada. The members of the Synergies consortium are the University of New Brunswick, Université de Montréal (lead institution), University of Toronto, University of Calgary, and Simon Fraser University. Each brings appropriate but different expertise to the project. At its first level, Synergies consists of this five-university consortium that will provide a fully accessible, searchable, decentralized and inclusive national social sciences and humanities database of structured primary and secondary social sciences and humanities texts. This distributed environment is technically complex to implement, and represents a major political and social collaboration which attests to the project’s transformative dimension for Canadian social sciences and humanities research and researchers. Synergies will be a primary aggregator of research that, in providing publishing services, will allow journal editors (and other producers) to manage peer review, structure subscriptions and maintain revenue control. At a second level, Synergies will reach out to 16 regional partner universities who will benefit from, and contribute to extend, Synergies functionality. At a third level, in a producer-to-consumer relationship with university libraries and organizations such as the Canadian Research Knowledge Network, Synergies will make possible national accessibility. Using this relationship as a model, Synergies will be positioned to facilitate similar relationships for journals with licensing consortia around the world. There are many Canadian content and network infrastructure initiatives, such as electronic journals, institutional repositories, and electronic resources. Synergies partners and others are developing these infrastructures. What is needed nationally is an infrastructure that integrates these distributed components in order to enhance productivity and accessibility to Canadian social sciences and humanities at the national and international levels. The Synergies platform will integrate the outputs from the five distributed regional nodes in a centralized fashion on a large scale. The relevant technology is already partially in place. There is a need however to integrate and improve the technical infrastructure, and to address the financial processes whereby information can be made accessible to all Canadians in all sectors. Synergies will create public benefit from public funds invested in knowledge generation. Synergies is not only a pan-Canadian technical infrastructure but also a mobilizing and enabling resource for the entire scholarly community of Canadian social sciences and humanities researchers. In embracing the whole of the social sciences and humanities, Synergies will foster cross-disciplinary, problem- and issue-oriented research while also allowing further research explorations that can be time-framed, disciplineProceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
247
248
Michael Eberle-Sinatra; Lynn Copeland; Rea Devakos
based, media or methodologically specific, theoretically constrained or geo-referenced. Synergies will thus serve to modernize Canadian social sciences and humanities research communication. It embraces emerging practices by utilizing existing texts, enriching, expanding, and greatly easing access to scholarly data and to audiences. It further provides deeper organizational capacity for a fragmented research record, ensuring and enhancing access to existing data sets. By providing a robust infrastructure, it allows content producers to explore new business models such as open access. However, it also facilitates access via aggregation of journals and an ability to enable agreements between Canadian social sciences and humanities journals and other producers’ and buyers’ consortia. It lays a foundation for expanding the research record to encompass all scholarly inquiry in order to achieve maximum accessibility and circulation. Synergies represents a project in parallel with other national projects and disciplinary databases emerging in other countries, for example, Project Muse, Euclid, JStor, and HighWire in the United States, and in France, Persée and Adonis. Similar to these projects, Synergies will capture and disseminate knowledge through a cost-recovery profit-neutral model. As mentioned above, Synergies is the result of a collaboration between five core universities which have been working together for several years. With each partner bringing its own expertise to the initiative, a genuine collaboration resulted in an infrastructure that was conceived from the start as truly scalable and extendable. Each regional node will integrate the input of current and future regional partners in the development of Synergies, thus continuing to extend its pan-Canadian dimension. Each node in close collaboration with the head node will develop the functionality and sustainability of the infrastructure over the course of the first three years starting in 2008. The latter will also co-ordinate the establishment of long-term goals and priorities that will ensure the functionality being developed is appropriate and achieves the overall goal of enhancing the end-user experience. 2.
OJS in the Synergies context
As partners in the Synergies project, considering how Simon Fraser University Library would play a role as the British Columbia node, initially we focused on the most obvious and important contribution we could make — digital conversion of Canada’s humanities and social science research journals, current and past issues, to electronic form (a Canadian JSTOR – CSTOR – if you will). However, we quickly realized that we could play another important role and our thinking evolved along the lines reflected in the Ithaka Report [1] and we began to realize that our partnership with publishers in this key project could be stronger. It is worth considering this important report in the context of Synergies and to note that the conclusions and recommendations relating to university presses in the United States also provide an important model for Canadian scholarly journals. The recommendation that universities ‘develop a shared electronic publishing infrastructure across universities to save costs, create scale, leverage expertise, innovate, extend the brand of U.S. higher education, create an interlinked environment of information, and provide a robust alternative to commercial competitors’ [2] could equally well apply to Canada and its scholarly publishing community. One important facet of the Report recommendation is that libraries are included as parts of the recommended model and Synergies and OJS have also brought together traditional and electronic publishers and academic libraries. The Report notes the strengths that libraries bring to the partnership: technology; expertise in organizing information; storage and preservation capability; and deep connections to the academy, with networks of subject specialists familiar with faculty research, instructional needs and publishing trends. It goes on to note that librarians understand how to build collections and disciplinary differences. They understand multimedia content and own enormous collections of value to scholars, have extensive digitization experience and are committed to providing free access. They understand information searching and retrieval. They Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Synergies, OJS, and the Ontario Scholars Portal
are relatively well funded (although any university librarian will be quick to note that most of that funding is targeted, and that buying power is decreasing). Libraries excel at service. Through SPARC, they advocate nationally and institutionally to maximize the dissemination and bring down the costs of scholarly information, for example through open access, and open source publishing options. They are good at collaborating across institutions (for example, most Canadian university libraries have reciprocal borrowing and interlibrary loan agreements, and have been highly successful at leveraging online journal costs through consortia organizations such as the Canadian Research Knowledge Network, Ontario Consortium of University Libraries, the BC Electronic Library Network (ELN). They have experience in building shared technology; for example, SFU Library, with funding from the BC ELN and Council of Prairie and Pacific University Libraries has developed the reSearcher software which crosslinks index entries and journal content, as well as providing interlibrary loan requesting for print materials. They provide access to their collections through union catalogues, and are extending that model to the digital world, for example through Canadiana.org (formerly AlouetteCanada+CIHM) and of course through Synergies. Complementarily, the Report notes the strengths that publishers bring to the partnership: commercial discipline – they understand the financial aspects of distribution of scholarly research, and the need to protect the sustainability of the enterprise. Publishers understand the publishing process, know how to evaluate demand, are experts at editorial selection, vetting and improving content quality. They work with faculty as the creators of scholarly content. They are marketing experts. They cultivate their longstanding national and international networks among wholesalers, retailers, libraries, and individuals. They are able to balance exposure for a work, financial rewards for creators and producers, and tolerable costs to consumers (libraries). They understand copyright protection and rights management. Thus the reports sets out how libraries, with technological resources and expertise, can play a crucial role in fostering scholarly publishing, by partnering, appropriately, with the academics and publishers, who continue the responsibility for maintain the core editorial and peer review functions. This model is by no means new, in some sense. For example the UBC Press was successfully launched with the leadership and support of the University Librarian, Basil Stuart-Stubbs. Coincidentally, while the Synergies project was being defined and brought into existence, the Public Knowledge Project evolved from its initial project-based inception into what has become an extraordinarily successful and sustainable partnership. It can be argued that part of the reason for its success lies in the conscious adoption of a partnership very similar to that subsequently laid out in the Ithaka report. Dr. John Willinsky, originator of the project, continues his vigorous leadership role, successfully attracting funding and new adopters. Not least of the reasons for the importance of the PKP Project and its success is the goal of bringing the tools for electronic publishing to developing countries and their research output to us. There are three software tools in the PKP suite: Open Journals System (OJS) which provides a scholarly journal process management framework; Open Conference System (OCS) which provides the tools for conference management; and the metadata harvester, which can be configured to harvest a selection of resources, and is used for example for access to the Canadian Association of Research Libraries’ institutional repositories. SFU Library has undertaken the role of system development and maintenance, as well as providing a hosting service for interested journals and conferences. Under the leadership of Dr. Rowland Lorimer, the SFU Canadian Centre for Studies in Publishing and the CCSP Press provide the publishing support itemized in the Report. Under this partnership, OJS has expanded its take-up to over 1,500 journals worldwide, been translated into dozens of languages, and developed partnerships with, among others, SPARC, International Network for the Availability of Scientific Publications, Oxford, Instituto Brasileiro de Informação em Ciência e Tecnologia, Brasilia, Red de Revistas Científicas de América Latina y El Caribe, España y Portugal (REDALYC), Mexico, FeSalud - Fundación para la eSalud, Málaga, España, Journal of Medical Internet Research, the Multiliteracy Project and the National Centre for Scientific Information, Indian Institute of Science, Bengalooru. Dr. Richard Kopak, Chia-ning Chan of UBC and the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
249
250
Michael Eberle-Sinatra; Lynn Copeland; Rea Devakos
team of Dr. Ray Siemens at University of Victoria, who are BC partners in Synergies, are contributing to the reading tools, which will form significant added value for researchers The growth has not been without its challenges, though they have all been met, most notably, the SFU Library hosting of ‘Open Medicine’ whose immediate success led us to realize we needed to ensure 24x7 uptime. With visits reaching over 40,000 per month, pkp.sfu.ca is the twelfth most visited SFU site. A continuing requirement is the recruitment of intelligent, inquiring and thoughtful individuals to work on various aspects of the project, but the possibility of working at a distance has to some extent ameliorated the scarcity of available local talent. With interest among all of the Synergies partners in aspects of the Open Journals System (OJS), Open Conference Systems (OCS) and metadata harvester, it became apparent that, in addition to fostering the transition or development of Canadian SSH journals online, a key component of the SFU Library role in Synergies will be to co-ordinate and foster the further development of the PKP software to provide the features of the software to meet Synergies partner- and, more importantly, Canadian scholarly SSH research publication needs. Synergies nodes University of New Brunswick and University of Toronto are contributing to the development of the software. This co-ordination will of course continue to involve our many other international partners. Thus, development is focussed on particular features such as interoperability with the Synergies national portal site, statistical reporting, reading tools, aggregator modules, scholarly monograph management, and interoperability with institutional repository software such as Dspace. The Synergies partnership has also led to a fruitful and ongoing exchange between the PKP and Érudit developers, in particular through the Technical committee. What is most exciting and encouraging is that our Synergies and international partners, enabling us to truly embody the vision of Open Source collaborative software development, are undertaking much of the development work. As is often noted, the reasons for failure to achieve that vision have much to do with requisite time commitments and resources. Synergies funding allows us to overcome those barriers, to the benefit of Canadian and international scholars and publishers. The Ithaka Report concludes that “It is one thing to say that the organization needs to have a coherent vision of scholarly communications, quite another for provosts, press directors and librarians to agree on what that is and to put it into effect – especially when elements of this vision must be embraced across institutions… The basic infrastructure is there, and the question now is what the next layer (or layers) will look like. The recent report on cyberinfrastructure in the humanities and social sciences explored this question and focused attention on the state of scholarly communications in these fields. In addition, the terrain may now be more fertile for elements of the electronic research environments described in our report to take root, as the necessary ingredients (e.g. growing interest in eBooks) are falling into place. Finally, there is more recognition that the challenges are too big to “go it alone,” and that individual presses or even universities lack the scale to assert a desirable level of control over the dissemination of their scholarly output.” [3] This conclusion applies no less to the Synergies project, and to the PKP Partnership. 3.
Ontario Scholars Portal
In addition to the development of two publishing platforms and the national portal, the work of Synergies will be carried out by a series of linked regional centres. Each region will provide a common set of core services to Canadian scholars. In addition, regional nodes are focusing on related key elements. The Ontario region is exploring search. A key issue for electronic publications is academic findability, acceptance and persistence – clearly the latter two are related to the first. Canadian scholars and publishers want to be found on the open Net, but also on established scholarly databases. The Synergies Ontario node is comprised of York University, and the Universities of Guelph, Toronto (lead) and Windsor. Services provided include journal hosting using OJS, conference hosting using the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Synergies, OJS, and the Ontario Scholars Portal
Open Conference System and repository services using DSpace. All selected platforms facilitate search engine crawling but what about recognized scholarly finding tools such as abstracting and indexing sources? This is a common question from journal editors – how can I get indexed in the leading A & I disciplinary database(s)? What is the application process? And how long will I have to wait? The Ontario Synergies regional node has partnered with the Ontario Council of University Libraries (OCUL) [4] to not only provide a presence within an well-known scholarly finding tool, but one that seeks to integrate itself into the academic workflow and emerging library archiving practices. OCUL is a consortium of twenty university libraries, including the Ontario Synergies partners. OCUL’S vision is to be a recognized leader in provincial, national and international post-secondary communities for the collaborative development and delivery of outstanding and innovative library services. Organizational goals include building effective practices for advocacy, collaboration and organizational development; providing a robust, sustainable and innovative access and delivery services and building comprehensive and integrated digital collections. Projects often begin with grant funding but ongoing costs are then assumed by the membership. Founded in 1967, OCUL serves approximately 382,000 FTE students, staff and faculty within the province of Ontario. Joint services include resource sharing, collective purchasing and the joint creation of the digital library, Scholars Portal (SP). Scholars Portal provides the infrastructure to all Ontario universities to support electronic access to major research materials. The portal is a gateway to a wide range of information and services for all faculty and students in the Ontario universities. The goals of SP are to support research, enhance teaching, simplify learning and advance scholarship. Specifically, Scholars Portal was established in 2002 with four primary objectives 1. 2. 3. 4.
To provide for the long term, secure archiving of resources to ensure continued availability. To ensure rapid and reliable, response time for information services and resources. T o provide an environment that fosters additional innovation in response to the needs of users. To create a network of intellectual resources by linking ideas, materials, documents and resources.
OCUL’s strategy focuses on locally hosting and integrating a range of collections and services into Scholars Portal:
•
SP contains approximately 200 million citations from 200 locally loaded abstracting and indexing databases: approximately 47% are scientific citations, 29% multidisciplinary, 18% social science and 5% from the arts and humanities.
•
Thirteen million full text journal articles from over 8,250 journals are locally loaded. In 2007, 4.2 million articles were downloaded. Publishers include Elsevier, Oxford, Taylor and Frances, Berkeley and the American Chemical Society. Member libraries have integrated Scholars Portal into RefWorks and courses management systems such as BlackBoard.
•
Refworks hosting is provided not only for Ontario but also for a total of sixty-seven institutions from every Canadian province. Thirty thousand regular users log in about 160,000 times a month during peak academic periods. These users collectively manage over 4 million citations.
•
The Ontario Data Documentation, Extraction Service and Infrastructure (ODESI) project, in the early stages of implementation, will provide researchers data discovery and
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
251
252
Michael Eberle-Sinatra; Lynn Copeland; Rea Devakos
extraction services for social science survey data. It is expected that this service will grow to include geospatial data. Current plans are aimed at providing a rich tool set for users:
•
Provincial funding will allow for some Scholars Portal data and search functionality to be made freely accessibly. This will include open access journals, 120,000 books scanned from the University of Toronto collection as part of the Open Content Alliance and some journal metadata.
•
Distributing and archiving over 150,000 e-books on ebrary’s stand alone technology platform: ISIS
Two planned initiatives carry special import for Synergies: the migration of data into Mark Logic and trusted digital repository certification. SP has begun migrating locally loaded data from ScienceServer to a Mark Logic content platform. [5] Mark Logicstores XML documents, an encoding format increasingly used by publishers, in native format. By building indexes on individual works, XML elements and attributes, such as tables or illustrations, it builds indexes not only on words but context and hence can provide a richer search. Relevance-based searching, facet-based browsing, thesaurus expansion, language –based stemming and collations, automaticclassification, web services, AJAZ facilitate incorporating current Web technologies into a new interface. As part of content migration, SP staff will be transforming all records from the proprietary ScienceServer DTD to the NIH Journal Archiving and Publishing Schema. Not only is NIH non proprietary, it also supports full text and metadata only sources. ScienceDirect is a metadata only DTD. The NIH schema will also allow for links to external resources such as genebank database and data integration with external applications such as Goggle documents. The target release date is September 2008. In order for Synergies data to be fully searchable within Scholars Portal, we have begun mapping the OJS native and Érudit DTD to the NIH schema and loaded a few sample journal issues. Once the pilot is complete we will pilot integrating other content types and later invite participation from other Synergies hosted content and OA providers. It is often difficult for small journals to transition into the electronic realm, let alone alter their production methods to fully exploit the realm’s potential. Over the course of the project, we will be seeking cost effective methods to assist journals with this transition. Processes are being re-engineered not only to fulfil the move to Mark Logic but also to begin satisfying requirements of a “Trusted digital repository” [6] External review of practices and policies is planned for 2008-9. Consultation with the University of Calgary, which is charged with developing a preservation framework for Synergies partners, is scheduled. In looking to the future, OCUL envisions a future where Scholars Portal can connect to the citation to the users workflow and support collaborative research. Synergies shares similar aims, though focuses on moving and aggregating Canadian scholarly works online. 4.
Conclusions
The Synergies project is important for granting councils, for universities, for individual journals, for academics, and for Canadians. Synergies will facilitate both public access within Canada and international access and prestige. Academics’ citations will increase substantially as they enjoy much greater national and international exposure. Journals will be able to increase their exposure and find new ways of aggregating content with comparable journals while maintaining their financial viability. Universities, through their institutional repositories, will increase their international reputations. Scholars representing Canadian Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Synergies, OJS, and the Ontario Scholars Portal
universities will benefit from an enhanced profile on the national and international stages. Based on the current expertise of each of the regional partners, Synergies will also very quickly be in a position to become a leader in the field of digital humanities and electronic publications around the world. Thus, Canada’s position on the international level will be reinforced through Synergies and the research it will enable. Furthermore, Synergies will establish itself as an advisory technical committee for policymakers in Canada, and will play a role in the development of future collaborative projects around the world. The project will also help transform scholarly communication and promote a greater degree of interdisciplinary work. All Canadians are stakeholders in this enterprise and should be vitally interested in it, since they will be able to benefit from access to the research that is paid for by their tax dollars and that is contributing to transforming their society in an effort to democratize knowledge. More than just benefiting present-day research, the organization of data within the Synergies infrastructure will be standardized for use by future research initiatives. An initial investment in Synergies thus profits not only already-identified research projects, but it will also benefit many research projects to come. Academic communities in Canada and elsewhere will have access to content that was previously unavailable or obtainable only with great difficulty. As well, this content will enjoy the extensive functionality—powerful searching tools, textual and other forms of computer-assisted analysis, and cross-referencing between disciplines—that will be available in the online environment developed by Synergies. Moreover, Synergies will allow researchers to ask new questions, to draw on previously inaccessible information sources, and to disseminate their results to a much broader range of knowledge users in the public, private, and civil sectors of society. All of these possibilities will greatly benefit Canada as a whole. Once fully operational, Synergies will provide researchers, decision-makers and Canadian citizen with direct, organized and unprecedented access to the vast store of knowledge created within our universities, in both official languages, regardless of geographic location, subject or discipline. 5.
Acknowledgements
The authors would like to thank the other members of the Synergies steering committee for their input on an earlier version of this essay: Guylaine Beaudry, Gérard Boismenu, Thomas Hickerson, Greg Kealey, Ian Lancashire, Rowland Lorimer, Erik Moore, and Mary Westell. 6.
Notes and References
[1]
Brown, Laura; Griffiths, Rebecca and Rascoff, Matthew. University Publishing in a Digital Age, Ithaka Report, July 26, 2007, 2007. <http://www.ithaka.org/strategic-services/ Ithaka%20University%20Publishing%20Report.pdf> ibid. p. 32. ibid. p. 33. <http://www.ocul.on.ca> <http://www.marklogic.com> RLG/OCLC Working Group on Digital Archive Attributes; Research Libraries Group.and OCLC. Trusted Digital Repositories Attributes and Responsibilities: An RLG-OCLC Report, 2002 AGRICOLA. <http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf>
[2] [3] [4] [5] [6]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
253
254
African Universities in the Knowledge Economy: A Collaborative Approach to Researching and Promoting Open Communications in Higher Education Eve Gray1, Marke Burke2 Centre for Educational Technology, University of Cape Town Private Bag, Rondebosch 7701, South Africa email: eve.gray@gmail.com 2 Researcher, Link Centre School of Public Development and Management University of the Witwatersrand email: burkem@developmentwork.co.za
1
Abstract This paper will describe the informal collaborative approach taken by a group of donor funders and researchers in southern and eastern Africa aimed at consolidating the results and increasing the impact of a number of projects dealing with research communications and access to knowledge in higher education in southern and eastern Africa. The projects deploy a variety of perspectives and explore a range of contexts, using the collaborative potential of online resources and social networking tools for the sharing of information and results. The paper will provide a case study of donor intervention as well as analysing the methodologies, approaches and findings of the four projects concerned. The paper will explore the ways in which the projects and their funders have had to address the issues of the global dynamics of knowledge, of the changes in research practices being brought about by information and communication technologies; and of the promises that this could hold for improved access to knowledge in Africa. Finally, the conclusions of the paper address the complex dynamics of institutional change and national policy intervention and the ways in which a collaborative approach can address these. Keywords digital scholarship; knowledge ecology; open education; open access; scholarly publication 1.
Introduction For our continent to take its rightful place in the history of humanity ... we need to undertake, with a degree of urgency, a process of reclamation and assertion. We must contest the colonial denial of our history and we must initiate our own conversations and dialogues about our past. We need our own historians and our own scholars to interpret the history of our continent. President Thabo Mbeki â&#x20AC;&#x201C; launching the South Africa-Mali Timbuktu Library Project
When it comes to access to knowledge in higher education institutions in African countries, the emphasis has tended to be, in the first instance, on the difficulties that African researchers face in gaining access to expensive commercially published journals and books, and the extent to which this disables African participation in the knowledge society. John Willinsky is but one of a number of authors who have described the dismal circumstances in which African researchers work, with empty library shelves and minimal
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
African Universities in the Knowledge Economy
access to international resources. He also describes some of the initiatives that have been put in place to remedy this situation, such as the negotiation of special journal packages by the International Network for the Advancement of Science Publications (INASP) and the World Health Organisation. On the other side of the coin are the difficulties experienced by African scholars in publishing from their home countries, also described in some detail by Willinsky [1] [2]. These are not only problems of resources, of funding for paper and printing, of the difficulties of print distribution or computer availability and bandwidth [3], but also of the power dynamics of international scholarly publishing, a more neglected topic. Developing countries, especially in Africa, face a broad spectrum of research infrastructure and capacity constraints that limit their capability to produce scientific output, and absorb scientific and technical knowledge. Unequal access to information and knowledge by developing nations , exacerbated by unequal development and exchange in international trade, serves to reinforce the political and cultural hegemony of developed countries. The impact of knowledge-based development will continue to have insignificant impact as long as this asymmetry in research output and access to up-to-date information remains [4]. There is no doubt that when it comes to participation in the global knowledge economy, Africa is particularly badly represented. According to a 2002 survey by the African Publishers’ Network, Africa produces about 3% of all books published, yet consumes 12% [5]. The statistics are even worse when considering Africa’s contribution to the internet. In 2002, Africa produced only 0,04% of all online content and, if one excludes South Africa’s contribution to this, the figure fell to 0.02% [6]. When it comes to journal publishing, the power dynamics of this commercialised global sector is clearly demonstrated. In 2005 there were 22 African journals out of 3,730 journals in the Thomson Scientific indexes. Twenty of these were from South Africa. The major Northern journals account for 80% of the journals in the Thomson Scientific indexes; just 2.5% overall come from developing countries [7]. Given the overwhelming social, economic and political problems that so many African countries face, the major need is for the production of locally relevant research to be effectively disseminated in order to have maximum impact where it is most needed. This is skewed in the global scramble for publication in the most prestigious journals as African scholars and their universities seek to establish their rankings in a competitive global research environment [8] [9]. The situation in most African countries has been compounded by decades of IMF and World Bank structural adjustment programmes, based on Milton Friedman’s theory that economic growth is generated through investment in primary education, while higher education creates unrest and instability [10]. This has led, in most African counties, to the decimation of higher education infrastructure and the virtual destruction of research capacity. South Africa is much better off in terms of research capacity; however, the higher education sector faces complex challenges as it addresses its transformation needs post-apartheid. Ondari-Okemwa [11] categorises constraints specific to knowledge production and dissemination into economic (inadequate funding and budgetary cuts, lack of incentives, brain drain) technological (internet connectivity and telecommunications infrastructure) and environmental factors (freedom of expression). Kanyengo and Kanyengo [12] identify the non-existence of information policies for handling information, poor ICT infrastructure to manage the preservation of knowledge resources, inadequate financial resources and the lack of technical knowledge and legal barriers as the key impediments to preserving information resources as inputs into knowledge production. Where there is agreement is that one of the major priorities for addressing Africa’s development challenges,should be knowledge production by African researchers working primarily at African institutions, focusing on locally relevant knowledge production. According to Sawyerr [13], this insistence ‘on African research and researchers at African institutions is to ensure rootedness and the sustainability of knowledge generation, as well as the increased likelihood of relevance and applicability. This condition presupposes local institutions and an environment adequate to support research of the highest calibre; and insists upon Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
255
256
Eve Gray; Marke Burke
the rootedness of such research as well as its positive spill-over effects on the local society’. 2. Policy contradictions The policy framework that lies behind these projects has been described in Eve Gray’s paper on research publication policy in Africa produced for an Open Society Institute (OSI) International Policy Fellowship. This revealed fault lines and contradictions in South Africa’s well-elaborated research policy, which were reflected in policy developments in the region. Broadly speaking, polices that impact on research dissemination veer between an emphasis on the public role of the university, which demands social and economic impact in the national community, and an international role that is framed in the discourse of a competitive system of citation counts and international scholarly rankings. The former places the emphasis on the knowledge society, the use of ICTs and open and collaborative approaches to research; the latter on individual effort, proprietary intellectual property (IP) regulation and monetary returns garnered through the leverage of the university IP in the knowledge economy[8]. As Jean-Calude Guèdon points out, these two terms are not co-terminous: This is comething that is eloquently explored in a recent paper by JeanClaude Guèdon, who points out that ‘the universality of scientific knowledge differs fundamentally from its globalisation’ and that ‘it is clear that the present situation of access to scientific publications arises less from aspirations fir a ‘knowledge society’ but rather from the rapacity of a ‘knowledge economy’ [14]. One effect of the latter strand of policy is – strangely, given the emphasis on the need for national development impact – a remarkably narrow conceptualisation of what constitutes research publication. Peer reviewed journal articles, books, chapters in books and refereed conference proceedings are valued and supported in a region that, given the serious developmental challenges it faces, could learn from the efforts of the Department of Education. Science and Technology (DEST) in Australia to grapple with a broader conception of what could constitute effective research publication, given the opportunities offered by ICT use in a changing research environment [15]. 3.
Donor collaboration
This paper will review the ways in which a group of projects in southern Africa are seeking to address these issues through informal collaboration by donor funders seeking to maximise the impact of their interventions. Discussion of this collaboration started at the Workshop on Electromic Publishing and Open Access in Bangalore in 2006. This workshop recognised the potential for collaboration between second economy countries as a power base for change and was attended by delegates from India, Brazil, South Africa and China. This recognition of the importance of collaboration spilled over to tea-break discussions about the fragmentation of donor interventions in southern Africa and the need for a consolidated and coordinated approach. In response, a group of funders and researchers – from the OSI, the International Development Research Council (IDRC) and the Shuttleworth Foundation – subsequently met at the iSummit in Dubrovnik in June 2007 to take this idea further. The decision was that the funders would map their various projects in consultation with one another in order to try to achieve a consolidated impact in the transformation of policy and practice for the use of ICTs and open access publishing to increase access to knowledge in Africa. The projects that have emerged from this informal initiative thus consciously cross-reference one another in the pursuit of these goals, contractually requiring that research findings be made freely available through open licences, and also sharing project resources and findings through the use of social networking tools. This has already proved effective, as the projects have shared literature surveys and reading lists; have exchanged findings; have collaborated in interviews and workshops; and
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
African Universities in the Knowledge Economy
have used collaborative workspaces and online discussion forums to exchange ideas and track common areas of interest. 4.
The projects
This paper will describe four open access and scholarly publishing projects currently included in this collaborative effort, charting the ways in which they impact on one another and how their findings could coalesce to create an impact greater than the sum of their parts. These projects recognise that the achievement of shifts in policy and practice in an environment as conservative as the university sector and as sensitive as the under-resourced African higher education system needed a multiple-pronged approach, working at all levels of the system – institutional, national and regional – to change entrenched policy and practice. A complex approach would have a better chance, this collaboration suggests, to deliver a substantial shift, leveraging the potential of ICT use and open access publishing models, to transform the delivery of African knowledge dissemination. The projects all focus on the production of African knowledge from Africa, for African purposes, rather than the question of access alone. These projects also all share a contextual understanding of the need to take into account the changing research and teaching environment that has resulted from the impact of ICTs across the academic enterprise. Research is increasingly characterised by greater emphasis on interdiscplinarity, multidisciplinary, transdisciplinary practices; an increasing focus on problems rather than techniques; and more emphasis on collaborative work and communication [16.] [17]( This in turn creates new information and dissemination needs, since there is an increased demand for access to a wider range of more diverse sources; for access mechanisms that cut across disciplines; and for access to, and management of, non-traditional, non-text objects. The four projects to be evaluated in this paper are:
•
The OpeningScholarship Project funded by the Shuttleworth Foundation and carried out in the Centre for Educational Technology (CET) [18] at the University of Cape Town is using a case study approach to explore the potential of ICTs and Web 2.0 to transform scholarly communication between scholars, lecturers and students and also between the university and community. The focus is at an institutional level; the lever for change is seen as the ICT systems that this institution has invested in and their use within the university.
•
The Publishing and Alternative Licensing Africa (PALM Africa) project, funded by the IDRC. The project is working across the conventional publishing industry and open access content providers, seeking to better understand how flexible licences, including online automated licensing systems such as CC+ and Automatic Content Access Protocol (ACAP), can facilitate citizens’ access to knowledge in the digital environment and how the adoption of new and innovative business models of publishing can help African countries improve the publishing of learning materials. The first investigations are being carried out in South Africa and Uganda.
•
The Opening Access to Southern African Research Project, being carried out by the Link Centre for the Southern African Regional University Association (SARUA), funded by the IDRC, is studying the issues of access to knowledge constraints in southern African universities and the role of open approaches to research and science cooperation. The research project aims to inform the development of the basis for policy advocacy at the institutional, country and regional level with respect to academic publishing and knowledge sharing in the ‘digital commons’ context.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
257
258
Eve Gray; Marke Burke
•
The Shuttleworth Foundation and the OSI are supporting the production of the Publishing Matrix, an overview of the workings of the publishing industry – formal and informal – to allow researchers, activists and funders to better understand the context in which they are operating. The problem that this project addresses is that if projects are to achieve wider access to learning materials in Africa, they need to be backed by an understanding of how publication and knowledge dissemination works in the countries concerned, where there are blockages and weaknesses in the provision of learning materials and other knowledge resources, and where traditional systems are working well.
The projects described share methodologies of qualitative analysis, exploratory, descriptive and action research. They combine higher education policy studies with analysis of technology use and its impact. They share the perception that, as a result of the changes being brought about in research and teaching through ICT use, technical, organisational and communication infrastructure needs to be analysed in an integrated knowledge cycle. Most strikingly, in contrast to many open access initiatives, the projects combine to explore the potential for finding solutions that could also involve the publishing industry, formal and informal, in changed business practices that could deliver sustainable models for greater access to learning materials. In analysing these projects we will consider open access in the context of university missions: academic teaching (knowledge building cycle), research (research, development and innovation cycle) and social engagement (promoting the utilisation of knowledge produced in universities for the benefit of communities/ society). The potential value of commons-based, open access approaches for universities would be the creation of an environment which fosters a more rapid growth of the volume of research output than is currently occurring, and the more effective utilisation of research activity to expand the knowledge base in any particular field by building on what has gone before. The conceptual framework shared by these projects acknowledges the context of African countries and their universities in the emerging information and knowledge economy, a world view that regards information and knowledge as central to the development and emergence of a new form of social organisation. This view endorses the role of universities as centres of knowledge production, with a primary mission to produce, communicate and disseminate knowledge. Using the case studies of the projects described above, this paper will describe the barriers in national and institutional policies that currently block the use of ICTs for enhanced access to knowledge and will report on the shifts that are taking place as a result of these interventions. Each project is examined in some detail, exploring the project methodology and its findings before drawing conclusions about the collaboration between the projects. 5.
OpeningScholarship: the picture of an institution
The Opening Scholarship project is being carried out in the Centre for Educational Technology (CET) at the University of Cape Town,with the aim of investigating the impact of the use of ICT in scholarly communications in one of South Africa’s leading research universities.Acknowledging the impact of social networking and Web 2.0 on the hierarchies of knowledge production and the role that can be played by a range of formal and informal technologies, the question asked by the OpeningScholarship project is how the ICT systems that are in place could help deliver much greater intellectual capacity and how a university like UCT could make the most effective use of its research knowledge; how it could avoid becoming a dependency, relying on its own intellectual output rather than on imported content. It also acknowledges the disruptive potential of ICT use: the ways in which changing communications could break down disciplinary silos in an increasingly inter-disciplinary research environment, breaching the walls of the traditional Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
African Universities in the Knowledge Economy
curriculum. The choice of university for this study was influenced by the fact that UCT, South Africa’s leading research university, has made a serious investment in its ICT infrastructure, designed to allow the university to develop and leverage the knowledge that it produces in innovative ways. UCT is also unusual in having invested in the development of an institutional infrastructure in the Centre for Educational Technology that combines technical, research and pedagogical skills in an academic department. The explicit aims of the department are to enrich and enhance the curriculum; provide for the needs of a diverse student body; and support staff in transforming, improving and extending their practice. CET is a partner in the international Sakai collaborative project for the development of an open source learning management system (LMS) in fact it was the the first non-USA member of the community. The development of the UCT version of Sakai, Vula, has provided an interesting perspective on the relationship between open source and open access in delivering the increased capacity being sought through this project as well as providing. a potential platform for opening resources. The project has not taken a narrow view of what constitutes scholarly communications. It has taken seriously the university’s statement of its own mission and national higher education policy in tracking scholarly communications in three directions:
• • •
Academic scholarship: academic to academic; Teaching:and learning: academic to student;, student to academic, student to student; Community engagement: university to community (and community to university).
Although some work has been done at UCT and other South African universities to reveal how ICTs could support academic scholarship, teaching and learning, not enough has been done in terms of understanding how ICTs could be usefully employed in supporting community engagement and more particularly, how ICTs could undergird a coordinated approach to academic scholarship, teaching, learning and community engagement. On the national level, the question would be how to use ICTs to grow access to South African (and African) knowledge to deliver the aspirations of national policy, as set out in the White Paper on Science and Technology (1996) and of the key objectives identified in the university’s own strategy. This is an important reflection of the South African government’s view of the role of the university in a knowledge society, particularly in an African country, in which research investment, the government suggests, needs to be recovered by way of impact on national development goals, for social upliftment, employment, health and economic growth. The use of ICTs is seen as an important component of this process,essential tools if South African universities are to take their place in the global knowledge economy. As the South African White Paper on Science and Technology (1996) spelled it out: The world is in the throes of a revolution that will change forever the way we live, work, play, organise our societies and ultimately define ourselves... Although the nature of this information revolution is still being determined... [t]he ability to maximise the use of information is now considered to be the single most important factor in deciding the competitiveness of countries as well as their ability to empower their citizens through enhanced access to information. [19] What this project has aimed to do, therefore, is to pull together the various initiatives that are taking place and identify how maximum use could be made of ICTs at UCT to advance research, teaching and learning and community engagement through a coordinated set of coherent policies, action plans and technological and infrastructural systems.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
259
260
5.1
Eve Gray; Marke Burke
Methodology
The principle methodology of this project has been the use of case studies to map a variety of uses of ICTs for scholarly communication at the University of Cape Town. The project has drawn upon desk research, semi-structured interviews, focus groups, and questionnaires in the conducting of these case studies. The project is contextualised in a review of international best practice and national policy and practice in order to frame and provide a context for the case studies being conducted. The approach of this project has therefore been to take the university as a case study within the national context, explore ways in which the institution is reflecting national policy and matching this against international best practice. Finally, within the university, case studies have explored how individual academics and departments have been using ICTs in in transformative ways for research communication, teaching and learning, and social impact and what lessons for university policy and strategy can be learned from this . 5.2
Findings
The project is in its final stages, due for completion in July. Its findings, although not finally analysed and integrated, are therefore fairly complete. Although UCT’s mission incorporates teaching and learning, research and social responsiveness as if they are equally rated, in reality the system is heavily weighted towards research, and research of a particular kind. The impact of national policy in this regard was evident in all the case studies dealing with research publication. Not unexpectedly, the project revealed the extent to which institutional behaviour is distorted in South Africa by the financial rewards paid to the universities by the Department of Education for publication in accredited journals, books and refereed conference proceedings. The rush for a substantial revenue stream, reinforced by the appeal the policy makes to an entrenched conservatism, particularly in the upper ranks of the university hierarchy, leads the university to place a very strong emphasis on targets for the production of journal articles in particular ‘accredited’ journals. This is further strengthened by a system of competitive rankings for individual scholars run by the National Science Foundation based on the metrics of citation counts. Both of these mechanisms place a neo-colonial emphasis on the primacy of international rather than local performance and on the metrics of citation counts as opposed to any attempt at evaluation of the contribution that scholarship is making to the the nation or region. The predictable results are that the production of local scholarly publication is under-supported, with an equally predictable backward drag on the professionalism and quality of a number of journals in an environment where journal publishing in the traditional model is unlikely to be self-sustaining [20]. Moreover, the activities of academics involved in publishing and editing are not tracked centrally in the university system, although they may be reflected in a fragmented way in departmental records. South Africa shares a common presumption in the English-speaking world that the delivery of scholarly publication to be regarded as something that it is not the university’s business to fund. While the university seems willing to invest very large sums of money in patent registration, presumably against the (largely unrealistic) expectation of revenue, the much smaller sums needed for publication do not feature in their budgets. This means that there is no source of financial support for the development of digital open access publications, nor for the payment of author fees for publication in international open access journals. Although there are open access journals being published at UCT, such as Feminist Africa in the African Gender Institute [21] and expressions of interest by existing and potential new journals, there has been little actual take-up of open access scholarly publishing at UCT. In part as a result of the OpeningScholarship project and in part because of a national project for the promotion of open access scholarly publishing funded by the DST and being delivered by the Academy of Science of South Africa (ASSAf) [14] there Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
African Universities in the Knowledge Economy
is increasing awareness of the potential of electronic publishing and open access to increase research impact. However, given the lack of institutional support systems – financial and informational – or any centralised institutional policy framework for open access publishing, this interest remains fragmented. On the horizon, however, is the prospect of the creation of a national journal platform as part of the ASSAf programme for the development of local open access journals. As a result of the OpeningScholarship project, the UCT research department has became aware of the fact that in its systems, it was only tracking authorship of publications, and not the publications that are being produced on campus, nor the activities of UCT academics who are journal editors. There is therefore support for authorship and neglect of publishing efforts, even though this neglect is potentially detrimental to levels of authorship, as under-supported journals struggle to produce issues on time. An interesting spinoff from the project was the recognition that the profile of information in the university’s central systems can wield considerable power. This became clear in a workshop on UCT’s research information system at which Australian and South Africa universities compared their use of the publication record module of their shared system. A report from the University of Sydney provided a vivid example of how the creation of a record system, linked to a digital repository that records all publications – formal and informal – has served as a tool both to expand dissemination of university research and to profile and promote the university. Another issue that has emerged at institutional level is the fact that the university has a centralised facility for the use of ICT for teaching and learning through CET, but there is no university-wide integrated system to support not only teaching and learning but also research data and publications across the institution. Given that the DST is planning to implement policy on access to data from public funding in 2008-9, this will become an increasingly important issue. Also, if UCT is to retain its status as a leading research institution, given developments in higher education elsewhere, it will have to begin to address, cyberinfrastructure needs for the 21st century,in collaboration with the South African higher education sector as a whole. Where technology use is having increasing impact is in teaching and learning and this is because of UCT’s investment in CET as a dedicated department for research and development in ICT use for education. It became clear in 2007 when the Vula LMS – UCT’s version of Sakai – was launched that the use of an open source system that was user-friendly and capable of adaptation to user needs has substantially increased the use of ICT for teaching and learning on campus. The courses delivered through the online LMS grew from 191 in 2006, to 908 in 2007, the year that Vula was launched. To date in 2008, less than half way through the year, 933 courses are being delivered though Vula. The figures show very clearly that there is a strong response to the use of ICTs for teaching and leaning, with a particularly steep rise in 2007, when the Vula system replaced the custom system that had hosted courses prior to the creation of CET as a university-wide service. Anecdotal evidence suggests that this is at least in part the result of the ease with which lecturers can create their own course profiles and upload their course materials, given that this is an open source system, as well as student pressure for what they find a congenial and supportive environment. The case studies of teaching and learning practice within Vula have revealed that the development of innovative teaching tools and learning environments is largely the result of individual motivation. Although there are ostensibly weightings for teaching practice in the promotion system and the university offers prizes for good teaching, the perception of lecturers is that the primary route to promotion is through the publication of journal articles, preferably in international journals. An important driver of innovation, the case studies show, has been the support by the Mellon Foundation for Teaching with Technology grants [22. These relatively small grants have been the source of a number of innovative programmes in Vula, using multi-media, animated simulations for technology teaching and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
261
262
Eve Gray; Marke Burke
conceptual understanding; mobile technology for course administration and interaction; and simulation exercises building on social networking, What this makes clear is that relatively small levels of funding support can bring disproportionate results. While the university has shown vision in funding CET from mainstream funding, unlike most other universities, still does not fully fund the department and there are still many posts in the department supported by grant funding The case studies have revealed that the development of online courses is very much bound into individual departments and even into individual courses within departments. in an institutional culture which is still built around disciplinary â&#x20AC;&#x2DC;silosâ&#x20AC;&#x2122; The university does not seem to have much centrally coordinated space for collaboration in teaching and learning. The OpeningScholarship project has, however, revealed the potential for significant interdisciplinary collaboration once the the connections have been made. For example, the project found that there was more than one department involved in the use of V-Python open source software for the creation of animated simulations to enhance theoretical understanding and develop algorithmic thinking in science and technology. This is of vital importance in a country that still faces an educational deficit in scientific education. The opening up and expansion of these resources could therefore be of national value. The problem, however, in delivering a vision such as this would be the question of resourcing in an already-stretched university system. The courses that use these innovations also demonstrate changes in the power relations between students and lecturers, with students playing a more active role in knowledge production. Another set of courses using online simulations of a different kind demonstrates a similar change in lecturer -student dynamics through the creation of online communities and role-playing. The Department of Public Lawâ&#x20AC;&#x2122;s international law course, Inkundla yeHlabathu/World Forum, for example, has created an innovative tutorial simulation, in which students learn to apply the rules and methods of international law through a series of African case studies from the 1960s to the present day by simulating the work of legal advisers to ten African States. The course is delivered through a combination of formal, doctrinal lectures, small-group tutorials and the Inkundla yeHlabathi simulation. A compilation of cases and materials, the e-casebook, is made available to students both online and on CD-ROM for offline use. This course has been recognised in the Sakai community, through the Teaching with Sakai Innovation Award , sponsored by IBM, as one of the two most innovative courses in the Sakai community worldwide in its use of technology for transformative teaching. When it comes to opening access to these resources beyond the university,, the vision of Salim Nakjhavani, the course convenor, is to gradually engage other African universities in parts of the simulation, deployed through Sakai and hosted by the University. The University of the Witwatersrand (Johannesburg, South Africa) plans to join one component of the simulation in late 2008. This course provides an exemplar of the unfolding nature of open education and highlights some of the challenges that need to be addressed in order to make this simulation open to all. These challenges include copyright clearance and long-term sustainability models. While many of the cases that the students use, from the International Criminal Court in particular, are in the public domain, many commentaries on these and other cases that they need to refer to are not. In courses such as these, students form active communities and, according to their lecturers, can develop passionate alliances related to their roles in the simulations. There is also the potential for increased contributions to the creation of online learning materials by students. A further and unexpected sign of student willingness to create their own space in the system is the fact that a growing number of student societies are using the Vula space to manage their communities and their projects. Students are active in community responsiveness projects at UCT and the latest social responsiveness report at UCT reports two of these [23]. While there is still no comprehensive tracking of public benefit programmes at UCT, the university has become aware of the need to demonstrate its Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
African Universities in the Knowledge Economy
contribution to national development and so has started an initiative to track and report on the various projects on campus, student and staff-driven. At a recent social responsiveness workshop it became clear that not only are there a large number of programmes at UCT that make a considerable contribution, but also that among these are projects that produce a variety of publications: research reports, policy guidelines, training manuals, community information resources, and popularisations. These are produced without financial or logistical support and the projects concerned complained that their work was not recognised as academic output and did not receive recognition or incentives, in spite of its importance to the community and the universityâ&#x20AC;&#x2122;s reputation. The importance to the university is recognised at senior level. As Deputy Vice-Chancellor Martin Hall expressed it, in a climate in which govenrment is questioning whether it gets value from its invetment in higer education research: [U]niversities [need to] assert the importance of their independence, and the value of the knowledge commons as a seedbed of innovation ranging from product development to the design of effective public policies. They also recognize that, for the knowledge commons to acquire public credibility and support, they need to show how their work is responsive to the pressing objectives of development. In pursuit of this, they develop a range of smart interfaces with both the state and private sectors, promoting effective knowledge transfer, and showing, through example, how there can be a valid social and economic return on public investment in their resources. [24] A number of projects mentioned that they used Vula to support their projects and it is clear that both a central record of these projects, support for publications through the provision of publishing platforms and the creation of an institutional repository of research outputs would be welcomed. The question the university needs to confront is how much the effective dissemination and publication would add to the impact of its social responsiveness programmes, how much this would contribute to profiling the institution and to its ability to attract government and donor funding. 6.
PALM Africa - from polarisation to collaboration
Publishing and Alternative Licensing in Africa (PALM Africa) funded by the IDRC and led by Dr Frances Pinter, Visiting Fellow in the Centre for the Study of Global Governance at the London School of Economics, addresses some of the sustainability issues raised by the OpeningScholarship project and Opening Access to Knowledge in Southern African Universities. In an African context, in which access to internet connectivity is often limited and in which the question of distribution of learning materials is a serious challenge, what is missing, this project argues, is research on how open access approaches employing flexible licensing could work in conjunction with local publishing in developing countries to improve access to learning materials. Through the action research element of the project it is expected that a variety of new business models appropriate for Africa will be devised and tested. The focus of the project is intended to be primarily on the higher education sector, both because the levels of ICT infrastructure and connectivity in this sector are adequate to the task and also to align the project with the focus of other IDRC interventions in the region. The overarching research question that this project addresses is: whether the adoption of more flexible licensing regimes could contribute to improved publishing of learning materials in Africa today. An important component of this project is the recognition of the contribution that can be made by professional publishing skills: the services of commissioning, editing, design, marketing, validating, branding and distributing learning materials. The project explores how more flexible licensing regimes might allow publishers to access a broader range of materials to which they might add local relevance, publish successfully and distribute in a manner that leads to more sustainable publishing and improved access for readers. In other words, what is being explored in this context is the potential for increased access to be generated through partnerships Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
263
264
Eve Gray; Marke Burke
between open access and commercial publishing models or through the use of innovative licensing and business models that address the particular difficulties of African markets. The possible solutions to the various structural and process issues that a are beginning to emerge from this study might range from alternative business models in market sectors in which the ‘free online’ open access models might be sustainable with public funding, to more complex models combining the commercial and the ‘free’ in various new ways. The scholarly literature has identified a number of viable ‘some rights reserved’ models with reference to a few examples primarily in the fields of music and software. This is the first comparative study of its kind that engages with stakeholders to build up appropriate business models from inside the industry and then proceeds to test the viability of those models. The countries that will be participating are South Africa and Uganda. In the higher education sector, the problems that this project will address include the current difficulties faced in the development of and access to scholarly writing and textbooks produced on the African continent, given resourcing problems and small market size, as well as the barriers to inter-African trade. Then there are the barriers imposed by the high cost of imported textbooks when they are shipped or copublished using conventional publishing business models. Finally there is the need for localisation of international materials [3] [6] In all of these cases, there is potential for electronic publishing to transcend the distribution difficulties and added costs that arise in the physical movement of books across such great distances. The final product could then either be e-books where technology availability allows, or the use of local printing to produce affordable print copies in the local market. Chris Anderson’s ‘Long Tail’ market model would suggest that, given that these are marginal markets for international publishers, there should be opportunities for exploring new financial models – including the potential for open access and commercial models used in conjunction with one another – in order to find innovative ways of meeting market needs without the high prices that have accompanied the conventional book trade models. 6.1
Methodology
This project brings together active research in the form of publishing demonstration projects combined with an academic assessment that reviews whether or not liberalising of licenses may bring about improvements in the publishing process defined as increased access to materials while maintaining sustainability of publishing services. Hence the emphasis is on collaborative efforts to find practical solutions. The outreach activities aim to create a space for discussion of the outputs and outcomes of the projects so as to encourage a deeper understanding of the role of licensing and broader engagement with decisions on the types of licenses that fit the specific needs. The methodologies being employed in the project include literature review; qualitative analysis through questionnaires delivered at a stakeholder seminar; and publishing workshop for capacity-building in each country. Following these interventions, publishing exercises will be supported in each country and a comparative analysis made of the results. 6.2
Findings
This project is still at an early stage, with publishing workshops due to be held in Uganda and South Africa in May and August 2008. However, some interesting insights are emerging already, some as a result of collaboration with other projects. It has become clear that the formal publishing industry internationally is trying to come to terms with the digital age and is experimenting with a number of new business models. This new disruptive digital technology is necessitating new approaches to copyright. Yet, where we stand today is still at the incubation stage of Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
African Universities in the Knowledge Economy
these new models, with caution competing with boldness as the industry tries to find ways of recovering its investments. In the meantime there is still the urgent need to see how these new models may facilitate access and distribution in developing countries. Discussion and debate about new licensing and business models are becoming more insistent in the global North, but are less evident in Africa. This is ironic, as it is in the difficult market conditions in Africa that the use of flexible business models, linked to digital content delivery could have real traction. In South Africa, connections have been forged between the OpeningScholarship project and PALM Africa . Some of the larger academic textbook publishers are interested to explore the changing environment of teaching and learning at UCT and as a result, a workshop with one publisher and an interview session with another have already been organised which included representatives from the PALM and OpeningScholarship projects. It was clear from these discussions that the publishers were beginning to realise the need to grapple with a changing environment brought about by the use of online learning platforms. This is challenging them to explore changing business models. and there is now an interest in exploring how to interface with online and multimedia content being developed in the universities. There might also be potential for exploring licensing options for the inclusion of textbook and commentaries in online delivery in LMS such as Vula., for fully integrated course material incorporating published materials and university-generated content. It is in the OpeningScholarship project that the first steps are bring taken to investigate the copyright solutions that could allow such materials to be opened beyond the originating university. The results of the PALM Africa project should help provide sustainability models for the delivery of scholarly and textbook materials in an African context and, it is hoped, help foster inter-African trade, using flexible licensing and print-on-demand to overcome the current barriers that inhibit it. There would also appear to be potential for exploring different licensing models to make available publications from the long tail of international publishers to lower the cost of specialist but vitally necessary publications into Africa. 7.
The Publishing Matrix â&#x20AC;&#x201C; mapping the publishing industry
The Publishing Matrix project, funded by the Open Society Institute and the Shuttleworth Foundation, arises from the acknowledgement that the access debate is now shifting from access alone to a consideration of the need for participation by developing countries in open knowledge production. If projects to achieve this goal are to succeed, they need to be backed by an understanding of how publication and knowledge dissemination work in the countries concerned, where there are blockages and weaknesses in the provision of learning materials and other knowledge resources, and where traditional systems are working well. There also needs to be an understanding of how the supply chain works in the different publishing sectors, particularly where print products would be needed. An example has been a rush to provide free textbooks for schools in developing countries. Initial enthusiasm is now being tempered by the realisation that the inhibiting factors preventing the wide dissemination of school textbooks do not reside in content development alone, but in printing and distribution. Donors, activists and policy-makers are seeking a more complex understanding of how best to advance access in circumstances where print products need to be distributed in what are often complex supply networks. While there is a common understanding in the open access movement of the goals that are being pursued, there are obstacles as the new world meets with the old. People of good will are struggling to find consensus on what aspects of traditional ways of learning and communicating should be preserved and how we might be better served by newer ways of generating and communicating knowledge. Vested interests abound, opportunists deflect and derail good intentions, but equally serious is a lack of understanding
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
265
266
Eve Gray; Marke Burke
of past, present and possible future contexts, and this is leading to fragmented decision-making. Policies are being made on the hoof with unintended consequences that can destroy many of the skills and capacities we actually wish to preserve. Equally, fear of the unknown is holding back the taking of justifiable risks. This study is intended to pull the various strands together, identify the friction nodes, and contribute to creating a more strategic vision of what changes to support. Outsiders face some difficulty in understanding how publishing works, not least in grappling with the fact that price differentials between countries are not determined only by publishers’ pricing practices, but by a complex set of circumstances in an industry in which price can be largely a factor of the size of the readership in a particular country. This study is therefore intended first and foremost to be a roadmap that aims to help others engaged in changing how knowledge, emanating out of both developed and developing countries, is communicated and how that may reduce the knowledge divide. A component of this project will also be a contribution of a better understanding of what professional publishing skills are needed for the effective development and dissemination of knowledge products . For example the often-cited case study of the HSRC Press in South Africa, which offers a dual model digital open access and for-sale print publication depends upon a highly professional publishing and marketing team for its success. The Publishing Matrix is being prepared as a wiki that will provide an account of how publishing works in different sectors along the value chain, providing multiple perspectives on how the industry works. The information produced will help inform the PAM Africa project and should provide a useful resource for the investigation of ways to improve access to knowledge in the southern African region. 7.1
Findings
Although it is too soon to have hard findings to hand, there are some realisations that are already offering new insights. Most striking was the realisation,when the matrix outline was drawn up and the different sectors profiled, of just how much publishing actually happens outside of the publishing industry. A number of NGOs have been practising what are effectively open access models for years, while corporate and government publishing also produces a wide range of content,including training and curriculum materials. This needs to be better factored in in mapping the access to knowledge terrain. 8.
Opening Access to Knowledge in Southern African Universities
SARUA has, in collaboration with the International Development Research Centre (IDRC) and the Link Centre at the University of the Witwatersrand, launched a research study entitled Opening Access to Knowledge in Southern African Universities to study the issues of ‘access to knowledge’ constraints in Southern African universities and the role and potential contribution of Open Access frameworks and initiatives for research in the region. The project is a qualitative research study that will be implemented in seven countries in he Southern African region over a ten month period. The study will assess the current situation pertaining to access to knowledge constraints in Southern African Universities and the role of Open Access Frameworks and initiatives for research and scientific collaboration. The
research
questions
being
asked
are
exploring:
•
The existing constraints to availability of academic and other relevant research publications in the social sciences and humanities, the health sciences and the natural sciences and engineering.
•
The extent to which Southern African universities are changing their practices relating to Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
African Universities in the Knowledge Economy
production and dissemination of research and publications and if so, how.
•
How Southern African universities can increase the availability of academic and other relevant research publications to students and researchers.
•
What measures would be required to encourage new approaches to knowledge production and dissemination in Southern African universities among librarians, research managers and prominent researchers/scientists.
•
How open access could benefit and contribute to scientific collaboration and endeavour, and what its implications would be for research across higher education institutions across the region, given the current limitations confronting Southern African Universities.
•
How feasible would the establishment be of a SARUA regional open access network(s) based on an ‘open knowledge charter’, and the development of a Science Commonsand what would the the options be for doing so.
SARUA seeks to promote Open Access for increased quality research, enhanced collaboration, and the sharing and dissemination of knowledge, The outcomes of this project will inform the development of a longer-term SARUA programme aimed at promoting awareness and mobilising the University leadership across the region to promote free access to knowledge and enhance scientific research and collaboration. The importance of SARUA’s involvement in this project is that as a regional university association it has the potential for real traction in the formulation of policy in a wide region, involving some 64 universities in African countries south of the Sahara. The findings of this research project could therefore be of considerable importance in in providing the base for the regeneration and growth of southern African universities. The project acknowledges that developing countries, especially in Africa, face a broad spectrum of research infrastructure and capacity constraints that limits their capability to produce scientific output and absorb scientific and technical knowledge. Unequal access to information and knowledge by developing nations, exacerbated by unequal development and exchange in international trade, serve to reinforce the political and cultural hegemony of developed countries. The impact of knowledge-based development will continue to have insignificant impact for as long as this asymmetry in research output and access to up-to-date information remains [4]. At the same time, the project acknowledges the importance of the network society, in which, as Manuel Castells [25] describes this order, the generation of wealth, the exercise of power and the creation of cultural codes depend on technological capacity, with information technology as the core of this capacity. This project is delivered in the understanding that knowledge production, communication and dissemination are becoming central to the mission for all universities in the 21st century, thus enabling a shift beyond teaching towards research and civic engagement. It acknowledges the ways in which the internet and other collaborative technologies are changing the way universities conduct their business by making it possible to conduct collaborative research across disciplines, institutions and countries; making it possible for researchers and students to share working research and publications online; and to promote e-learning for undergraduate and post-graduate programmes. This creates the opportunity for African universities to participate in global knowledge production activities with significant potential gains through, inter alia, increased resources for research and publication in local and international academic journals. For institutions operating in developing countries within resource constrained environments such as SARUA member institutions, these technologies and associated practices offer tremendous opportunities for improving the research, publishing and dissemination processes and putting Southern African knowledge at the service of local economies and societies. The critical question is whether we are positioning our institutions to take advantage of these opportunities.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
267
268
Eve Gray; Marke Burke
This question can only be answered if we understand the present constraints to knowledge production, processing and dissemination within our universities and the extent to which collaborative technologies and its associated practices can contribute to increasing our capacity for generating knowledge and expanding existing knowledge. The rise of open approaches to scientific endeavours and research are closely associated with open source technologies, open access, open data. Open research for example, can significantly contribute to generating knowledge within our institutions. 8.1
Methodology
The study is employing a qualitative strategy of inquiry. A review of the literature and document analysis is being undertaken to assess what the emerging developments and trends are internationally and in Southern Africa. The literature review will serve to frame the inquiry and provide the basis for the formulation of the research questions and the key informant interview guidelines. A research methodology workshop has also been held to refine the design and methodology for the study in a participative way. The research will be aimed at gathering data from two respondent groups. The first group will consist of the heads of research and research managers of the selected universities and the second group will comprise of key informants in the community of librarians, academic publishing houses and teaching/ research/ scientific communities based at the universities. 8.2
Findings
Although the research findings for this project are not yet available, the initial results of the survey will be reported at the ELPUB 2008 conference. 9.
Conclusions
These four projects, although still in progress, taken together have already demonstrated that there are gains to be made in collaboration between projects offering different perspectives to common problems. The projects share a common understanding of the importance of the information and knowledge economy; and also of the inequalities inherent in the economics and politics of global knowledge production. Acknowledging the changing research environment, in which collaboration is of primary importance and the hierarchies of knowledge production are changing, these projects together chart at different levels and from different perspectives how these changes are impacting in Africa. Mapping across the four projects, it becomes clear that before formulating policies and strategies at the national level, there needs to be an understanding of the institutional climate within the universities and the competing cultures within the institutions, as well as of the needs of the communities within which they are operating. A number of issues have emerged from the combined wisdom of these projects that would need to contribute to any effort to being African research into the cyber-age and ensure that it is effectively published:
â&#x20AC;˘
The dominant culture of research publishing needs to be interrogated, with its narrow focus on journal articles in particular and its uncritical adherence to a global model that in fact depends upon an inequitable, imperialist and commercially-driven value system. Rather, the full value of the research being produced in African universities needs to be released for the benefit of the continent.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
African Universities in the Knowledge Economy
•
In order to achieve this, there would need to be a radical change in the current attitudes of university administration, government and senior academics, that support for publication and dissemination is not the job of a university. It is clear that in a digital world, the universities and allied NGOs are already at the forefront, exploring ways of harnessing the potential of the internet,something that publishers could learn from. This in turn needs government policy to recognise open and collaborative approaches to generating impact from research investment, rather than only the lock-down and proprietary models of patents and copyright protection currently valued.
•
Learning from the example of CET at UCT, there clearly needs to be a better understanding of what ICT infrastructure needs to be in place not only for teaching and learning, but also for an integrated approach to research management and effective research dissemination and publication. This would in turn need to include grappling with the changing job profiles and reward systems for staff working in ICT, who need to combine technical, research, communications and pedagogical skills.
•
In Africa, there is a need to interrogate the common wisdom of both the open access movement aand commercial publishing in the North models to come up with sustainability models that are workable in an African continent.. There is also, given the marginalised position of African research, a need for the incorporation of professional publishing skills and effective and targeted publishing strategies, wherever these are sited, for research outputs to reach their intended markets. ‘
•
The PALM Africa project and the Publishing Matrix provide a salutary reminder that in the African context, where resources are scarce, the use of new business models and commercial partnerships might well be needed to provide sustainability, particularly in a context where print is often still needed. Flexible licensing can also address the needs of a changing environment in the NGO sector, in which blended approaches are needed that combine sustainability and public interest.
Taken together, these projects should help to provide a comprehensive vision of what (complex) steps would be needed to create a publishing environment that could harness the potential of ICTs and open access approaches to give a voice to African knowledge. The SARUA project for Opening Access to Southern African Knowledge should hold the key to advancing this vision to the region as a whole, with the capacity to drive an initiative at the upper levels of the university administrations in the region. 10.
Notes and References
[1]
WILLINSKY, JOHN. The access principle: The case for open access to research and scholarship. Cambridge, MA: MIT Press, 2006. KIRSOP, BARBARA & CHAN, LESLIE Transforming Access to Research Literature for Developing Countries. Serials Review December 31, 246–255, 2005 https:// tspace.library.utoronto.ca/bitstream/1807/4416/1/Kirsop_Chan_SerialsReview.pdf (accessed May 2007). GRAY, EVE. Academic Publishing in South Africa. In Evans N & Seeber M (eds) The politics of publishing in South Africa. Scotsville: Holger Ehling Publishers & University of Natal Press, 2001. CHAN, LESLIE & COSTA, SELY. Participation in the global knowledge commons; Challenges and opportunities for research dissemination in developing countries. New Library World 106 (1210/ 1211): 141–163, 2005
[2]
[3] [4]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
269
270
[5] [6]
[7] [8] [9] [10]
[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]
[23] [24] [25]
Eve Gray; Marke Burke
WAFAWAROWA, BRIAN. Legal Exception to Copyright and the Development of the African and Developing Countries’ Information Sector, 2000. CZERNIEWICZ, LAURA & BROWN, CHERYL. Access to ICTs for teaching and learning: from single artefact to inter-related resources. Paper presented at the e-Merge 2004 Online Conference: Blended Collaborative Learning in Southern Africa, University of Cape Town, July 2004. http://emerge2004.net/profile/abstract.php?resid=7 (accessed May 2007). KING, DONALD The scientific impact of nations. Nature 430: 311–316., 2004 GRAY, EVE. Achieving Research Impact for Development: A critiquw of research dissemination policy in South Africa. OSI Fellowship Policy Paper / Budapest, 2007. http://www.policy.hu/gray/ IPF_Policy_paper_final.pdf (accessed May 2008) STEELE, C., BUTLER, L., & KINGSLEY, D. The publishing imperative; The pervasive influence of publication metrics. Learned Publishing 19(4): 277–290, (2006) BLOOM, D., CANNING, D. & CHAN, K. Higher education and economic development in Africa. Washington, World Bank, 2005. http://siteresources.worldbank.org/EDUCATION/Resources/ 278200-1099079877269/547664-1099079956815/HigherEd_Econ_Growth_Africa.pdf (accessed August 2006). ONDARI-OKEMWA, E. & MINISHI-MAJANJA, MK. Knowledge management education in the departments of Library / Information Science in South Africa. South African Journal of Libraries and Information Science 73(2): 136–146, 2007. KANYENGO, CW. & KANYENGO, BK. Information Services for Refugee Communities in Zambia. Proceedings of the 72nd IFLA World Library and Information Congress, 20–24 August, Seoul, Korea, 2006. SAWYERR, A. African universities and the challenge of research capacity development. Journal of Higher Education in Africa 2(1): 213—242, 2004, p. 218. Guedon, Jean-Claude. Accès libre, archives ouvertes et États-nations; les stratégies du possible. Ametist. Numéro 2AMETIST. Http://www.ametist.inist.fr/document.php?id=465. .(My translation) AUSTRALIAN PRODUCTIVITY COMMISSION. Public support for science and innovation. Research report, Canberra, Productivity Commission, 2007. http://www.pc.gov.au/study/science/ finalreport/index.html (accessed May 2007). Houghton, John W, with Steele, Colin and Henty, Margaret. Changing Research Practices in the Digital Information and Communication Environment . Federal Government of Australia, DEST 2003 BELL, ROBERT K with HILL, DEREK and LEHMING, ROLF F. The Changing Research and Publication Environment in American Research Universities. Working Paper | SRS 07-204 | July 2007. Division of Science Resources Statistics, National Science Foundation. 2007) Http://www.cet.uct.ac.za DACST (DEPARTMENT OF ARTS, CULTURE, SCIENCE AND TECHNOLOGY) White Paper on science and technology: Preparing for the 21st century. Pretoria: Department of Arts, Culture, Science and Technology, 1996 ASSAF (ACADEMY OF SCIENCE OF SOUTH AFRICA). Report on a Strategic Approach to Research Publishing in South Africa. Pretoria: Academy of Science of South Africa, 2006. http://web.uct.ac.za/org/agi/ The Mellon Foundation has been the major supporter of CET, and was responsible for the funding of the unit in its original incarnation as the Multimedia Education Unit. Mellon still funds posts within CET, although the university has now taken responsibility for supporting the major part of the department’s human resource and infrastructure needs http://www.uctsocialresponsiveness.org.za/home/default.asp HALL, MARTIN. Freeing the knowledge resources of public universities. Unpublished conference paper: KM Africa – Knowledge to Address Africa’s Development Challenges. University of Cape Town, March 2005. Castells, M. The Rise of the Networked Society (second edition). Oxford: Blackwell, 2000, p. 356. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
271
Open Access in India: Hopes and Frustrations Subbiah Arunachalam Trustee, Electronic Publishing Trust for Development Flat No.1, Raagas Apartments, 66 Venkatakrishna Road, Mandiaveli, Chennai 600 028, India email: subbiah.arunachalam@gmail.com
Abstract Current status of scientific research and progress made in open access â&#x20AC;&#x201C; OA journals, OA repositories and open course ware - in India are reviewed. India is essentially feudal and hierarchical; there is a wide variation in the level of engagement with science and research and there is a wide gap between talk and action. Things never happen till they really happen. The key therefore is constant advocacy and never slackening the effort, and to deploying both bottom-up and top-down approaches. The authorâ&#x20AC;&#x2122; own engagement with the Science Academies and key policymakers is described. Indian Institute of Science is likely to deposit a very large proportion of the papers published by its faculty and students in the past hundred years in its EPrints archive. There is hope that CSIR will soon adopt open access. Keywords: India; open access; CSIR 1.
Introduction
Intellectual (or knowledge) commons share with natural resources commons such as forests, grazing land, fisheries and the atmosphere some features such as congestion, free riding, conflict, overuse, and pollution. But there is a big difference. Natural resources belong to the zero sum domain: if you share something, your stock dwindles. But knowledge wants to be shared and when shared it grows! The two kinds of commons, however, require strong collective action, self-governing mechanisms and a high degree of social capital on the part of the stakeholders. Unfortunately knowledge can be enclosed, commodified, patented, polluted and degraded and the knowledge commons could be unsustainable. That is exactly what we have allowed to happen to much of the knowledge produced by scientists around the world in the past few centuries and recorded in journals. We have allowed the copyright laws to protect the interests of publishers, who are intermediaries in reaching the knowledge to others, rather than protect the interests of the knowledge creators, viz. the authors of research papers, who want to give away their knowledge for free. The past two decades have seen the emergence of a movement that seeks to restore the knowledge commons back to the knowledge creators, through facilitating open access. Although the open access movement began before the advent of the Internet, it would not be an exaggeration to say that it would not have grown but for the emergence and widespread use of the Internet. This movement, like everything else, is uneven. It has done well wherever the stakeholders were able to ensure certain degree of collective action, self-governing mechanism and social capital. For example, physicists started technology-enabled sharing of preprints about two decades ago and now they are moving into the next level with INSPIRE whereas chemists are even now unable to get out of the shackles imposed by one of their own societies. Some countries like the UK and the USA have made some progress, whereas many other countries are lagging far behind. Among the developing countries, Latin America and notably Brazil have done better than others. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
272
Subbiah Arunachalam
This paper presents a status report on open access in India. 2.
General trend in scientific output and publishing from India
Independent India, led by Jawaharlal Nehru, decided to invest in science and technology and to use S&T to leverage development efforts and to improve the standards of living of the people. Ever since, virtually all political parties and the people have generally supported investing in science, even though one in four Indians is living below the poverty line. This is not at all surprising considering that knowledge has always been valued very much in India. Today India has a large community of scientists and scholars and Indian researchers perform research in a wide variety of areas including science, technology, medicine, humanities and social sciences. They publish their research findings in a few thousand journals, roughly half of them in Indian journals and the rest in foreign journals, most of them low-impact journals. The Indian Academy of Sciences and the Council of scientific and Industrial Research have been the leading publishers of S&T journals in India for a long time. The other Academies, professional societies, educational institutions and a few commercial firms also publish journals. But not many of them are indexed in SCI or Web of Science, which are selective in their coverage. MedKnow Publications, a Bombay-based private company, is emerging as a quality publisher of medical journals. As social science has been neglected for long, there are not many social science journals of repute from India. The Economic and Political Weekly has a sizable following. India trains a very large number of scientists and engineers and a large percent of the best graduates, especially those trained at the famous IITs, migrate to the West, and they seem to perform well. Said an article in Forbes, “India is the leader in sending its students overseas for international education exchange, with over 123,000 students studying outside the country in 2006.” Indians constitute the largest contingent of foreign students in the USA; a recent estimate puts the number at over 83,000. The number of Indian students enrolled in British universities in 2006 was about 25,000. Of late there is a sizable outflow of students to Australian universities, and the Australians believe that most of them want to stay on in their country. Research is performed essentially in three sectors: (1) higher educational institutions such as the universities and deemed universities numbering over 400, Indian Institutes of Technology and Indian Institute of Science, (2) laboratories under different government agencies such as the Council of Scientific and Industrial Research (CSIR), Department of Atomic Energy (DAE), Indian Space Research Organization (ISRO), Defence Research and Development Organization (DRDO), Indian Council of Agricultural Research (ICAR), and Indian Council of Medical Research (ICMR), and (3) laboratories in the industrial sector, both public and private. Besides, a number of non-governmental organizations and think tanks are also contributing to India’s research output. Although its overall share of funds invested on R&D is decreasing, the Government continues to be the major source of funding for research, and currently it accounts for bout 70%. Industry’s share is increasing, as more and more Indian companies have started acquiring overseas companies in sectors ranging from automobiles and steel to pharmaceuticals, tea and information technology, and as many multinational corporations are setting up research centres in India to take advantage of high quality researchers they could hire at costs much lower than in the West. One would think that everything is fine with science and technology in India. Far from it. In terms of the number of papers published in refereed journals, in terms of the number of citations to these papers, in terms of citations per paper, and in terms of international awards and recognitions won, India’s record is not all that encouraging. In the past few years things have started looking up. In Table 1 I present some data on India’s contribution to the research literature of the world as seen from wellknown databases. I also provide the number of papers from the People’s Republic of China to see India’s Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Open Access in India: Hopes and Frustrations
273
contribution in perspective. India now accounts for 3.1% of journal papers abstracted in Chemical Abstarcts; a few years ago the figure was a rather poor 2.4%.
Year
Scopus
MathSciNet
Engineering SciFinder + Village India China India China India China India China 2007 43005 190847 1765 11252 25126 205734 41697 235309 2006 40749 179388 1949 11762 25954 199881 38253 222371 2005 36385 157809 1936 10073 21870 173291 33675 183931 2004 32319 111219 1777 9544 18982 121725 29341 126647 2003 29972 74895 1904 8663 16804 81604 25985 106518 *Data accessed on 29 May 2008. Data for 2007 in MathSci Net and Engineering Village + including both Compendex and INSPEC for 2007 are incomplete. Table 1. The numbers of papers indexed from India and China in major bibliographic databases* In Table 2 I present some data on the number of papers from India and three other countries and the average number of citations won by research papers from these countries as seen from Scopus. No doubt, the number of papers from India is increasing steadily, but the growth rate is nowhere near that of China. India moved from the 13th rank in 1996 in terms of number of papers published in journals indexed by Web of Science to 10th in 2006, whereas China moved during the same period from 9th to the second position.
Year China 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 19962006
Doc 26853 29871 31887 36180 42250 55850 55400 66748 98577 148221 166205 758042
India Rank 9 9 8 8 6 5 5 5 2 2 2 5
C/d 4.37 4.55 4.24 4.58 4.29 3.22 3.32 3.26 1.92 1.92 0.12 3.14
Doc 20106 20694 19755 22578 22788 23362 24838 28741 30258 34849 38140 28109
Sout Korea Rank 13 13 13 12 12 12 12 12 12 11 10 12
C/d 6.13 5.67 5.78 5.30 5.17 4.34 3.82 3.31 2.26 1.00 0.19 3.91
Doc 9669 11876 11579 14645 16321 17930 18740 23406 27200 32488 34025 217879
Rank 20 16 16 16 15 14 14 14 14 13 12 14
Brazil C/d 8.55 8.17 8.88 8.54 8.07 6.63 5.74 5.00 3.14 2.88 0.22 5.84
Doc 8497 10167 10357 12196 12857 12708 14590 16978 18695 21239 25266 163550
Rank 21 20 20 18 17 19 16 17 18 17 15 18
C/d 9.42 8.45 8.49 7.95 7.36 5.84 4.93 4.31 2.93 1.29 0.22 5.56
Source: SCImago Journal & Country Rank (based on data from SCOPUS), courtesy Prof. FĂŠlix de Moya of Grupo SCIMAGO, Spain. Doc = Number of documents. C/d = Citations per document, computed for the 11-yeaar period. Note the decrease in value for later years. Table 2. Output of research papers from selected countries Data provided in Table 3 (courtesy In-cites) clearly show that in no field does India receive enough citations to be on par with the world average. In certain fields like physics, materials science, the gap is narrow, but in most areas of life sciences the gap is indeed large. In areas like plant and animal science and immunology Indian research appears to be way behind. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
274
Subbiah Arunachalam
India, after near stagnation, is now on the growth path. In the past two years the government has increased investments on both higher education and R&D. New specialized higher educational institutions are being set up with the hope some of them will eventually emerge as world class institutions. Science Academies are discussing ways to improve the quality of science education with a view to getting better-educated graduates to research. Field
Percentage of papers from India
Relative impact compared to world ------------------------------------------
Materials Science
5.12
-25
Agricultural Sciences
5.06
-57
Chemistry
4.81
-34
Physics
3.71
-20
Plant & Animal Sciences
3.44
-65
Pharmacology
3.21
-45
Engineering
2.93
-28
Geosciences
2.72
-51
India's overall percent share, all fields: 2.63 Ecology/Environmental
2.55
-51
Space Science
2.52
-47
Microbiology
2.18
-50
Biology & Biochemistry
2.06
-56
Mathematics
1.72
-43
Computer Science
1.57
-29
Immunology
1.19
-65
Clinical Medicine
1.18
-56
Molec. Biol. & Genetics
1.17
-62
Economics & Business
0.75
-52
Social Sciences
0.73
-44
Neurosciences & Behavior
0.55
-51
Psychology/Psychiatry
0.30
-38
---------------------------------------------------------------------------------------------------------Courtesy: SciBytes - ScienceWatch, Thomson Reuters Table 3. Number of Indian papers published in different fields during the five years 20022006 and citations to them [Data from National Science Indicators, Thomson Reuters] 3.
Awareness of OA in India
Scientists do research and communicate results to other scientists. They build on what is already known, on what others have done – the ‘shoulders of giants’ as Newton said. Indian scientists face two problems common to scientists everywhere, but acutely felt by scientists in poorer countries: Access and Visibility. 1.
They find it difficult to access what other scientists have done, because of the high costs of access. With India’s annual per capita GDP well below US $1,000, most Indian libraries cannot afford to subscribe to key journals needed by their users. Most scientists in India
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Open Access in India: Hopes and Frustrations
2.
are forced to work in a situation of information poverty. Even the programmes supported by UN agencies are not available for free in India, even though India’s per capita GDP is far below the agreed upon threshold of US $1,000. Other researchers are unable to access what Indian researchers are doing, leading to low visibility and low use of their work. As Indian scientists publish their own research in thousands of journals, small and big, from around the world, their work is often not noticed by others elsewhere, even within India, working in the same and related areas. No wonder Indian work is poorly cited.
Both these handicaps can be overcome to a considerable extent if open access is adopted widely both within and outside the country. That is easier said than done. As an individual I have been actively advocating open access for the past seven years. There are a few more advocates and proponents of OA in India. But what we have to show is rather limited. With the advent of the Internet and the Web, we need not suffer these problems any longer. If science is about sharing, then the Net has liberated the world of science and scholarship and made it a level playing field. The Net and the Web have not merely replaced print and speeded up things but have inherently changed the way we can do science (e.g. eScience and Grid computing), we can collaborate, we can datamine, and deal with datasets of unimaginable size. But the potential is not fully realized, largely because most of us are conditioned by our past experience and are inherently resistant to change. Our thinking and actions are conditioned by the print-on-paper era, especially in India! From colonial days, most people do things only when they are told to do. The situation with accessing overseas toll-access journals has improved considerably thanks to five major (and a few smaller) consortia that provide access to a large number of journals for large groups of scientists in India (especially those in CSIR labs, IITs and IISc). Many universities have benefited through INFLIBNET. ICMR labs and selected medical institutions have formed ERMED, their own consortium. Rajiv Gandhi Health Sciences University, Bangalaore, provides access to literature through HELINET Consortia to a number of medical colleges in the South. On the open course ware front the consortium of IITs and IISc have launched the NPTEL programme under which top notch IIT and IISc professors have come together to produce both web-based and video lessons in many subjects. Now these are available on YouTube as well. Many physicists in the better-known institutions use arXiv, which has a mirror site in India, both for placing their preprints and postprints and for reading preprints of others. But many others are not aware of it. A very large number of Indian researchers working in universities and national laboratories are not aware of open access – green or gold - and its advantages. Very few Indian authors know about author’s addenda and whenever they receive a publisher’s agreement form they simply sign on the dotted line giving away all the rights to the publisher. Call it ignorance or indifference, but it is rampant. Many authors think that attaching an author addendum to the copyright agreement may lead to rejection of their paper! Or at least they do not want to take a risk. What we need is advocacy and more advocacy. 4.
OA journals and OA repositories in place
Thanks to the initiatives taken by Prof. M S Valiathan, former President of the Indian National Science Academy, the four journals published by INSA were made OA a few years ago. The Academy also signed the Berlin declaration. Four years ago, he convened a one-day seminar on open access as part of the Academy’s annual meeting. The Indian Academy of Sciences converted all its ten journals into OA. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
275
276
Subbiah Arunachalam
The Indian Medlars Centre at the National Informatics Centre brings out the OA version of 40 biomedical journals (published mostly by professional societies) under its medIND programme. MedKnow brings out more than 60 OA journals, on behalf of their publishers, mostly professional societies. [Not all of them are Indian journals. Also, some MedKnow journals are included in the medIND programme of NIC.] Three OA medical journals are brought out from the Calicut Medical College. A few more OA journals are brought out from India. In all, the number of Indian OA journals will be around 100 (DOAJ lists 97, but it does not list journals published by the Indian National Science Academy). Dr D K Sahu, the CEO of MedKnow has shown with ample data that OA journals can be win-win all the way. For example, the Journal of Pstgraduate Medicine (JPGM) was transformed into a much better journal after it became OA. It attracts more submissions of better quality papers and from researchers from many countries; the circulation of the print version has increased; advertisement revenue has increased (both for the print version and for the online version). Its citation per document ratio has been increasing steadily. Dr Sahu has made several presentations on MedKnow journals and how open access is helping in improving the quality of the journals as well as their revenue, but not many other Indian journal publishers are coming forward to make their own journals OA. Incidentally not a single Indian OA journal charges a publication fee. Several leading publishing firms (both European and multidisciplinary) have started poaching on these newly successful OA journals! In fact a few journals have moved out of MedKnow to foreign publishers who have lured them with money. The online versions of a few Indian journals are brought out by Bioline International. Two young OA advocates, Thomas Abraham and Sukhdev Singh, have formed a society to promote Open Journal System in India. The National Centre for Science Information at the Indian Institute of Science has also helped a few journals become OA by adopting OJS. The Indian Institute of Science, Bangalore, was the first to set up an institutional repository in India. They use the GNU EPrints software. Today the repository has close to 10,200 papers, not all of them full text and not all of them truly open (as many papers are available only to searchers within the campus). IISc also leads the Million Books Digital Library project’s India efforts under the leadership of Prof. N Balakrishnan. Today there are 31 repositories in India (as seen from ROAR; OpenDOAR lists only 28), including three in CSIR laboratories, viz. National Chemical Laboratory, National Institute of Oceanography, and the National Aerospace Laboratories. Three of them are subject-based central repositories. OpenMed of NIC, New Delhi, accepts papers in biomedical research from around the world. The Documentation Research and Training Centre at Bangalore maintains a repository for library and information science papers. Prof. B Viswanathan of the National Centre for Catalysis Research maintains, virtually single handed, a repository for Indian catalysis research papers with over 1,150 full text papers. Five of the thirty Indian repositories have found a place in the list of top 300 repositories prepared by the Cybermetrics Lab of the Centro de Información y Documentación Científica (CINDOC) of Consejo Superior de Investigaciones Científicas (CSIC), Spain: Indian Institute of Science is placed at 36th rank, followed by Indian Statistical Institute – Documentation Research and Training Centre at 96. Openmed of National Informatics Centre at 111, Indian Institute of Astrophysics at 228 and the National Institute of Oceanography at 231. The repository at the Raman Research Institute has all the papers written by C V Raman, the winner of the 1930 Nobel Prize for Physics. The National Institute of Technology, Rourkela, is the only Indian institution to have mandated OA for all faculty publications. Apart from NIT-R, the deposition rate of current papers is pretty low in all other institutions. Soon ICRISAT, a CGIAR laboratory located in India, will throw open its OA repository. A small proportion of Indian physicists, mostly high energy and condensed matter physicists, use arXiv to deposit preprints and postprints. And arXiv has a mirror site at the Institute of Mathematical Science Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Open Access in India: Hopes and Frustrations
(IMSc), Chennai, which is visited by an increasing number of researchers from India and the neighbouring countries. A few weeks ago IMSc set up its own institutional repository. A small team at the University of Mysore is digitizing doctoral dissertations from select Indian universities under a programme called Vidyanidhi. With funding from the Department of Scientific and Industrial Research a small group at Indian Institute of Science – National Centre for Science Information – was helping Indian institutions set up OA archives (using EPrints or DSpace) and to convert journals to open access using the Open Journal System. Not many institutions have taken advantage. Informatics India Pvt Ltd, a for-profit company with its headquarters in Bangalore, is bringing out a service called Open J-Gate, which indexes all open access journals in the world. And it is absolutely free. Jairam Haravu of Kesavan Institute of Information and Knowledge Management has made the NewGenLib library management software open source. NewGenLib can be used to set up and maintain institutional repositories. 5.
Policy developments
The two Science Academies, INSA at New Delhi and IASc at Bangalore, and many of their Fellows have been engaged in a discussion on open access and its advantages, but there has been very little follow-up. As India continues, in a sense, to be feudal, one wonders if top-down approaches would work better than bottom-up approaches. But OA advocates are working on both fronts! On the bottom-up front, a number of workshops have been held with a view to training mostly library staff in the use of OA software such as EPrints, DSpace, and NewGenLib. Dr A R D Prasas of the Indian Statistical Institute – DRTC, Bangalore, is on the advisory board of DSpace, and has conducted many workshops on setting up repositories using DSpace. Two online discussion lists OA-India and OADL are used by mostly LIS professionals to discuss OA related issues. But very few working scientists have taken part in these discussions. Several librarians have written about OA in professional journals. One major concern expressed by librarians and repository managers is about copyright violation; they are really worried about journal publishers taking action against their institutions. I have been writing to scientists and librarians regularly alerting them to OA developments around the world and the need for India to adopt OA quickly. By now a very large number of Indian researchers, among them elected Fellows of Academies, must have heard about the advantages of OA several times. The Indian Academy of Sciences had started on a pilot basis placing the full text of papers by Fellows of the Academy, but the project has not gone beyond the initial effort. A similar proposal is pending with INSA. If implemented these projects will be the equivalent of the Cream of Science project in the Netherlands. Despite concerted advocacy and many individual letters addressed to policy makers, the heads of government’s departments of science and research councils do not seem to have applied their minds to opening up access to research papers. The examples of the research councils in the UK, the Wellcome Trust, the Howard Hughes Medical Institute and more recently NIH And Harvard University have had virtually no impact on the Indian S&T establishment. Many senior scientists and directors of research laboratories and vice chancellors of universities do not have a clear appreciation of open access and its implications. The more than 60 well-funded Bioinformatics Centres have been talking about setting up their own OA archives for more than six years, but nothing has happened. In a national laboratory, scientists do not want Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
277
278
Subbiah Arunachalam
to upload their papers in the OA repository set up by the library. There is great reluctance and apathy among scientists. The National Knowledge Commission headed by Mr Sam Pitroda, a technocrat and a telecom expert, has recommended open access and it is understood that both the Prime Minister and the Deputy Chairman of the Planning Commission have been apprised of the need for adopting OA as a national policy. Two OA advocates, yours truly and Dr A R D Prasad were members of the Working Group on Libraries that advised the National Knowledge Commission. In addition, Dr Mangala Sundar Krishnan of NPTEL and IIT Madras and I were members of the working group on open and distance education. These two groups had submitted strong recommendations in favour of India adopting open access mandate for publiclyfunded research. 6.
Opportunities and Challenges
Among those who understand the issues, many would rather like to publish in high impact journals, as far as possible, and would not take the trouble to set up institutional archives. A recent letter to the editor of Nature from a leading Indian scientist, a foreign associate of the National Academy of Sciences, USA, illustrates this point. Publishing firms work in subtle ways to persuade senior librarians to keep away from OA initiatives. There have been no equivalents of FreeCulture.org among Indian student bodies and no equivalent of Taxpayers‘ Alliance to influence policy at the political level. Hopes - As pointed out earlier, the National Knowledge Commission supports open access and has included it in its recommendations to the Government. Google is in touch with NKC with a proposal to digitize all doctoral theses and bringing out OA versions of selected print journals and digitizing back runs of OA journals. The Director of Indian Institute of Science, which is in its centenary year, has decided to digitize all papers published from the Institute in the past more than 99 years and make them available to the world through the Institute’s EPrints archive, and the work has just begun. The Director General of the Council of Scientific and Industrial Research has said that it should be possible for CSIR to adopt a mandate similar to the one adopted by the Irish Research Council. Hope it becomes a reality soon. The Indian National Science Academy invited me to address its Council a few months ago and the President, Vice Presidents and Members of the Council listened to me carefully; again in early April 2008 the Academy held a half-day meeting on open access, free and open source software and copyright issues. I was asked to coordinate the presentations on the first two topics. But the lawyer who was invited to speak on copyright probably had very little understanding of the ‘give away’ nature of journal papers. INSA will send before long its recommendations to the Government. Developments around the world, including in Latin America, South Africa and China, I hope, will goad Indian establishment to action. 7.
International collaboration and ways forward
The Principal Scientific Advisor to the Government was a former chairman of the Atomic Energy Commission and is fully aware of developments around the world. His own colleagues have been part of the work at CERN and are involved in many international collaborative projects. He often meets with his counterparts in other countries, especially the UK and European Union. Decisions on OA made in the UK and Europe may have an influence on him. India is an important member of both the InterAcademy Panel and the InterAcademy Council. If these Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Open Access in India: Hopes and Frustrations
bodies could be persuaded to endorse and adopt OA, then India will fall in line. I am trying to get a few OA champions to major events in India. Stevan Harnad came to India about eight years ago, but we did not provide him opportunities to meet many policy makers. Alma Swan came twice and did meet some key people. May be we need to facilitate more such visits and meetings. EIFL does not work in India. We should persuade them to include India in their programmes. 8.
Conclusion
One never knows when things start happening in India. They go on talking and holding meetings but they rarely act here. That is why it is important we keep pushing.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
279
280
An Overview of The Development of Open Access Journals and Repositories in Mexico Isabel Galina1; Joaquín Giménez2 School of Library, Archive and Information Studies (SLAIS) University College London, Gower Street, London, WC1E 6BT, UK e-mail: igalina@servidor.unam.mx; i.russell@ucl.ac.uk 2 UNIBIO (Unidad de Informática para la Biodiversidad) Instituto de Biología, Universidad Nacional Autonoma de Mexico Circuito exterior S/N Ciudad Universitaria C.P. 04510. Mexico DF, Mexico e-mail: joaquin@ibiologia.unam.mx 1
Abstract It has been noted that one of the potential benefits of Open Access is the increase in visibility for research output from less developed countries. However little is known about the development of OA journals and repositories in these regions. This paper presents an exploratory overview of the situation in Mexico, one of the leading countries in terms of scientific output in Latin America. To conduct the overview we focused on OA journals and repositories already in place and in development. It was particularly hard to locate information and our results do not intend to be exhaustive. We identified 72 Mexican OA journals using DOAJ. Of these journals 45 are from REDALyC which we identified as a key project in OA journal development in Mexico. Using OpenDOAR and ROAR, ten Mexican repositories were identified. These were reviewed and classified. We found a large variation between repositories in terms of size, degree of development and type. The more advanced repositories were well developed in terms of content and developing added on services. We also found inter-institutional groups working on advanced OAI tools. We also did a case study of 3R, a repository development project at one of the countries leading universities. This included interviews with two repository managers. The main challenges we found were lack of institutional buy in, staffing and policy development. The OA movement has not yet permeated the academic research environment. However, there are important working groups and projects that could collaborate and coordinate in order to lobby university authorities, national bodies and funders. Keywords: repositories, developing countries, Open Access, Open Access journals, institutional repository 1.
Introduction
This paper presents an overview of the Open Access movement in Mexico and the current OA journal and repository landscape. Although the importance of Open Access and repository building for developing countries by increasing visibility of under represented research has been noted [1-3], more work is required on the current situation [4]. The main objective of this paper is to present an introductory overview which will hopefully promote further discussion and contributions on this subject. We do not intend to present an exhaustive overview, as information regarding this subject is not easily available. First we look at the general trends in scientific output and publishing from Mexico in order to contextualize the discussion in particular with regard to other Latin American countries. This is followed by a broad discussion on the general awareness of Open Access in the country and a more detailed look at OA journals and repository development in place and in development. We present a case study of repository development at the National Autonomous University of Mexico (Universidad Nacional Autónoma de México- UNAM), in Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
An Overview of The Development of Open Access Journals and Repositories in Mexico
order to discuss key issues faced. In addition two interviews were conducted with repository managers to gather their views on repository development in Mexico. In particular the lack of policy development at a national and institutional level is addressed. Finally we look at the opportunities and challenges for Open Access in the country as well as the importance of international collaboration and other proposals to further the development of OA journals and repositories both in Mexico and in Latin America. 2.
General trends in scientific output and publishing
In terms of scientific output and publishing, Mexico is an important country in Latin America and could act as a one of leading players in the development of OA journals and repositories in the region. Mexico’s contribution to the global research output as measured by the ISI database is around 0.75%, second only to Brazil. Fifteen journals are included within the ISI [5]. It is particularly important to note that the UNAM, Mexico’s national university, is the biggest contributor to the country’s research output with over forty percent of the country’s research produced at this institution. We present a case study of the UNAM’s repository project in order to determine in more detail, particular issues and challenges in repository development. It is worth noting that the UNAM website is ranked number 59 in the Webometrics Ranking of World Universities [6], and although not all of it can be considered research output, it is clear that there is already a considerable base of material published online. Considering the size of the UNAM it could hopefully act as a key player in discussions and policy development in the country in collaboration with other institutions and national bodies. Both Mexico and Brazil have a relative low amount of citations for the number of articles published [5]. Increasing the visibility of Mexican research output is an important concern and the development of OA journals and repositories could contribute to this. In this sense, Brazil has been leading the way with the creation of SciELO (Scientific Electronic Library Online). This project is discussed further in the paper. As with other countries in the Latin American region, Mexico has a fairly low investment in science and technology development compared to Europe, Asia and the USA and Canada [7]. Mexico invests about 0.46% of its GDP compared to around 2% for most developed countries. From 1997 onwards however, Mexico has had a relatively steady investment in comparison to other countries possibly due to its recent stable economy. More than half the funding for research and development in universities and other research institutes comes from the public sector [7]. This could be a key issue when discussing mandates for selfdeposit when receiving public funding for research. 3.
General awareness of OA in Mexico
It is difficult to gauge the level of knowledge about OA in Mexico but there is little evidence of a generalized national awareness. However, a number of events and projects were found that suggest a growing momentum towards more widespread recognition. A few Mexican institutions are signatories of the Budapest and Berlin initiatives. In 2006 the UNAM organized the 5th International Conference on University Libraries (Conferencia internacional sobre bibliotecas universitarias) with the theme ‘Open Access: an alternative access to scientific information’. This was a two-day conference on the subject with a wide array of international and national speakers. Unfortunately it is not clear if any concrete policies or projects were developed as a result. At a national level the Open Network of Digital Libraries (Red Abierta de Bibliotecas Digitales- RABID) together with the University Consortium for the Development of the Internet (Corporación Universitaria para el Desarrollo del Internet- CUDI) has worked for several years on interconnectivity of resources and services between Mexican digital libraries. Their work has focused mainly on Open Access journals Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
281
282
Isabel Galina; Joaquín Giménez
and electronic theses, promoting their use using OAI-PMH. No mention of institutional repositories was found on their website [8] but this information may well be documented elsewhere or currently in development. Fourteen Mexican higher education institutions currently collaborate in RABID and they have developed a number of resource discovery tools such as: the OA-Hermes, developed by the UNAM, which is an OAI harvester for selected quality assured Open Access resources; and VOAI and XOAI developed by the Universidad de las Américas- UDLA, which are federated tools for sharing resources. The national body for Science and Technology, CONACyT has not apparently issued any public statement about Open Access. However, they have funded RedALyC (Red de Revistas Científicas de América Latina, el Caribe, España y Portugal) which is a large database of full text Open Access journals for Latin America, the Caribbean, Spain and Portugal, developed by the Universidad Autónoma del Estado de México- UAEM. This project will be discussed further in the next section. In general, although we found little evidence of an elevated awareness of OA in Mexico, we did find several concrete examples of institutions working on a number of projects that could positively influence further awareness. It is clear that more work needs to be done on this area though, in particular with national bodies in order to promote OA at more national level and involving many more institutions. 3.
OA journals and OA repositories in place and in development.
The following projects related to OA journals and repositories that we mention do not intend to be exhaustive. As the OA movement in Mexico is still relatively young, it was difficult to discover what projects are in development and in place and there may be important initiatives that we have missed. It is hoped that this paper will indeed be an opportunity to promote discussion in order to gather further information and bring together key players. 3.1
OA Journals
We used the Directory of Open Access Journals- DOAJ to perform a search using the term ‘Mexico’ and it produced 72 journal results. 50 of these journals have DOAJ content and more interestingly, all but five of these journals are from RedALyC. As mentioned in the previous section RedALyC is the most notable development in terms of OA journals. It currently offers 512 journals and over 81,000 full text articles. The site contains a section dedicated to Open Access, describing its development and the Budapest initiative. RedALyC works under the banner ‘Science that cannot be seen does not exist’ and its main objectives are to develop a common information space for Latin America, the Caribbean, Spain and Portugal, strengthen the quality of the publications of the region, act as a showcase for the regions quality research output and promote a more inclusive information society. We also used Latindex, an online registry of Latin American journals and found that of the 483 registered Mexican online publications, 238 are available freely. This of course, is not strictly OA but it does show that a wide range of material is already publicly available. It is not known if these journals support metadata harvesting but further work in this area could increase OA availability. A well-known Latin American journal publication project is SciELO (Scientific Electronic Library Online), originally developed in Brazil and which has now been expanded to eleven countries. Full text articles are marked up in XML using the SciELO markup methodology and in recent years an OAI interface has been included. As well as the SciELO portals by country there are also two subject portals on Public Health and Social Sciences. Although Mexico has participated in the Public Health portal for some time now, the national site was only recently launched with twenty-one full text journals. Most of the other country portals have been developed with strong support from national research councils or similar bodies. The Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
An Overview of The Development of Open Access Journals and Repositories in Mexico
development of SciELO Mexico could possibly have been done more effectively with similar support. It is currently developed by the UNAM and now that proof of concept has been proved it will hopefully receive more attention. 3.2 Repository development Repositories have become increasingly important in the academic world [9-11] and can contribute to the development of Open Access. In 2004, ROAR (Registry of Open Access Repositories) registered around 200 repositories worldwide. This figure is currently over 1,000 with OpenDOAR (Directory of Open Access Repositories) recently reaching a similar number. However, the coverage of repositories on a global scale is patchy, with a small number of countries leading the way with most of their academic organizations developing an institutional repository plus a number of subject or national repositories [1214], whilst other countries will have none or only a few. For example, the Netherlands, Norway, Germany indicate one hundred percent coverage of universities with an institutional repository [13], whilst other countries such as Zimbabwe, Mexico, Argentina and others, register only a few repositories in the whole country. In these cases it would be reasonable to expect that is not representative of the total number of academic organizations considering the size of the country. In Latin America, Brazil has been leading in repository development with 26 repositories registered in OpenDOAR (55 in ROAR). In order to look at repository development in Mexico the browse by country function was used for OpenDOAR and ROAR. Five Mexican repositories were found in the former and eight in the latter. Two duplicates were eliminated leaving a total of eleven repositories for the whole country. This is actually quite a small number of repositories considering the size and academic importance of Mexico within Latin America. The repositories were reviewed and classified. Definition of repositories varies considerably and in order to classify we used the Heery and Anderson typological model [15] by describing repositories according to functionality, coverage, content type and user group. Despite there only being ten Mexican repositories a wide range of types were found. We found two national subject repositories, one theses, two institutional, two departmental, one subject, one catalogue, one regional and one unidentified as shown in Table 1. It is clear that repositories in Mexico are still in embryonic stage and there appears to be no coherent trends in developments. However, it is possible that there are currently a number of repositories in development that have not been registered in ROAR or picked up by OpenDOAR, so this number may not reflect the total figure. Of the eleven repositories inspected, three repositories were over five years old, two unknown and six had been registered in ROAR in the past two or three years but it was unclear how long they had been in development. There was no evident relationship between age and number of items. Two had less than 100 items, three between 1000 and 5000 items, whilst two were very large. Of the large repositories, one repository had over 200,000 items but on closer inspection was functioning as a library catalogue rather than a repository. The other has over 80,000 full-text article journals. Four repository sizes were unknown as they had not been successfully harvested by ROAR and there was no indication on their homepages, which is unfortunate as this information would be valuable. In order to examine repository development in Mexico in more detail we took the Network of University Repositories (Red de Repositorios Universitarios- 3R) currently being developed at the UNAM, as a case study. This project has been particularly well documented [16] and both authors are involved making access to information, experiences, interviews, documentation and development easier. Additionally it provides an important case study as it is a large, highly centralized and productive national university, currently producing over fifty percent of the countryâ&#x20AC;&#x2122;s total research output. This was followed by interviews with two UNAM based repository managers in order to gain a deeper understanding of repository development, content ingestion work flows, depositing behaviour, content typology, resource usage Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
283
284
Isabel Galina; Joaquín Giménez
monitoring, dissemination. These interviews were compared to similar ones done with repository managers in the UK to discover points of convergence and differences. NAME INSTITUTION Acervo Digital del Instituto de Biología de la UNAM UNAM Acervo General de la biblioteca ITESO Árboles de la UNAM: Instituto de Biología UNAM Biblioteca del IBUNAM UNAM Colección de Tesis Digitales UDLA Documentacion en Ciencias de la Comunicacion ITESO-CONACyT DSpace en Publicaciones Digitales UNAM Gobierno del Estado Chiapas State government Publications of the Interactive and Cooperative Technologies Lab UDLA Redalyc UAEM SciELO México UNAM
TYPE
ITEMS
Institutional Catalogue Subject Institutional Theses National subject repository Departmental Not found
3074 213500 unknown unknown 2773 4510 unknown 85
Departmental 76 Regional 81249 National repository Unknown
Table 1. Mexican repositories by type and number of items 3R began in 2005 as part of a larger university funding scheme specifically designed to encourage interdisciplinary projects within the UNAM. A steering group was set up with members from Computing Services, the Library, Biology Institute and the Centre for Applied Sciences and Development. Two full time people were hired, one software engineer and one repository developer in order to work the project. The objectives were to investigate the solutions and design of a university repository. Initial steps were to diagnose the current state of repository development or digital collections at the UNAM. The UNAM has a particularly impressive web presence, appearing in place number 59 in the World’s Universities’ Ranking on the Web with only UK, USA, Canadian, Swiss, Finland, Norway and Australian universities above it. Attempting to go through the domain www.unam.mx methodologically was not possible due to a lack of consistency in ordering and sub domain assignation [16]. It was considered more practical to collect recommendations from the expert committee and colleagues who worked in the production of digital resources and were well acquainted with the numerous projects that had been developed in the UNAM over the past. We did not find a repository in a strictly defined sense or any information system with OAI-PMH interoperability. However, a large amount of digital collections, publication listings, image databases and others, covering a very large scope of material both in type and subject were found. These were organized and collected in such a fashion that they could easily be repurposed as repositories. It was clear from this work that repositories could answer an obvious need for digital object management and distribution. We found little or no evidence of coordination between the different working groups. This was followed by the development of a conceptual model in which it was decided that a federated system formed by a set of university repositories would best answer the UNAM’s wide and large digital management and distribution needs. A minimum framework of policies would need to be developed at a global university level with a clear framework on roles and responsibilities in three key areas: collection and material management, depositing and usage. Each local university repository would define at a local level: collection structure, types and formats of accepted items, revision and approval procedures and access policies. Following an extensive revision of available repository development literature it is clear that the most difficult aspects in repository development and functioning are workflow processes, development and practical implementation of policies and content ingestion. We set up four prototype repositories in order Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
An Overview of The Development of Open Access Journals and Repositories in Mexico
to work on the technological aspects, but more importantly the development of policies, both global and local. Workshops were organized in order to gather different future repository managers together. The main objective was to arrive at global basic policies and to work out local and particular repository policies depending on local needs and requirements. These workshops were also targeted at examining and understanding the breadth and volume of the digital materials that we could expect. Acquiring a critical mass of digital materials is an important consideration. It was decided that the four prototype repositories would serve as a benchmark for this before discussing mandates and other forms of acquiring content. It was considered that an appropriate infrastructure should be in place together with implemented workflow processes before we took this step. We hope that this project can serve as a proof of concept before talking to higher university authorities. Additionally it has become clear that the strength of a repository also lies in the services that can be built on top making them more useful for academics and encouraging them to use it. For example, generating publications lists for homepages, research assessment examinations or internal university reporting. The Biology Institute university repository (IB-UR) at the UNAM has done extensive work on developing services for their existing institutional repository. One interesting example is connecting the OAI-PMH repository that contains mainly eprints to another digital repository that holds information about biological collections using the Darwin Core metadata classification system. It is now possible for the user to consult information about a particular specimen within the biological collection and then automatically check all related images and articles held within the institutional repository. A future expert database will also be connected allowing the user to look not only at the research publications about a particular specimen produced by the Institute but also contact information for prominent researchers in that field. This type of work between OAI-PMH and Darwin Core has not been done before and could be a useful contribution to the field. As part of the research into the current situation of repositories, interviews with two repository managers were conducted, one with the manager of a national learning object repository and the other with the IB repository manager. These interviews were compared to interviews done with six UK repository managers on similar themes. From theses interviews it was clear that the Mexican repositories are in a less developed stage in terms of institutional buy in, content acquisition and staffing. Mexican repositories are still a fairly recent and are usually being developed within a department alongside numerous other projects. In the case of UK repositories one important step has been to hire full time staff to run the repository although in most cases this was a fairly recent development (within the last year). Mexican repositories were only just beginning to acquire content for the repository, although notably the IB repository as mentioned before, is already working on added on services and other content ingestion systems. According to the repository manager this is one of the advantages of developing a repository at this later stage as software systems for running repositories are now fairly stable, allowing them to focus on new technological developments. 4.
Policy development
As mentioned previously no national or even institutional policies on Open Access or repositories were found. There is still no full recognition of their importance either from university authorities or the national science council. This is a big difference from UK, where especially JISC funded projects such as TARDIS, SHERPA and others have been important motors for repository development. In addition, discussions about Open Access have reached research council level, which is not the case in Mexico. It appears, however, from the previous overview that the current scenario in Mexico is ripe for these types of discussions to take place. National government organizations such as CONACyT together with national Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
285
286
Isabel Galina; JoaquĂn GimĂŠnez
collaborative academic organizations such as RABID and individual institutions developing projects such as 3R at UNAM are in a position to provide a platform to develop and promote Open Access in Mexico. Coordination and cooperation are key issues in order to ensure that OA journal and repositories projects can be pushed forward. 5.
International collaboration
As the level of repository development at Mexican universities is still in very early stages it would be fruitful to consider building on international expertise to further their development. The repository at the IB is a good example of using stable software solutions in order to spring board development to more advanced stages quite rapidly. Successful international collaboration projects, such as SciELO and RedALyC, have demonstrated the effectiveness of cooperation for OA journal creation. In addition, Mexican experiences and technology could be used to contribute towards development of OA projects in other Latin American countries. 6.
Conclusions
The Open Access movement has not permeated Mexicoâ&#x20AC;&#x2122;s academic research environment as it has to the extent of other countries, in particular in relation to more developed countries. However, a few important OA journals and repository projects were found although information regarding OA is still rather scattered and has not been formalized yet in consolidated OA working groups. It appears that Mexico has focused more on the development of OA journals than it has on repository building. Notable projects such as RedALyC are positive indicators that OA journals can be effectively built. It is clear that institutional collaboration and recognition in terms of funding are key elements for these types of endeavours to succeed. There is definitely a need for academic digital content management solutions within Mexican universities as noted in the great amount of digital content found for example in the UNAM. There appears to be an important trend towards repository building in Mexico although it is lagging behind in terms of development. However, the mature state of software development would allow Mexican universities to catch up. The few repositories that do exist are mostly still in an embryonic stage or were developed as prototypes and then abandoned or discontinued. One of the most important aspects would be to work towards making university administrators and national policy makers more aware of the need to promote, fund and develop OA journals and repositories. Although repositories are still not ubiquitous in all developed countries academic institutions their importance is acknowledged and discussed at policy-making level. This is key step that most be taken in Mexico. Consolidated working groups such as RABID can play an important role in promoting OA within the country. National funding bodies and research councils most be lobbied if OA is to be promoted. Although Mexico is a subscribed to the Open Access movement more concrete steps should be taken towards implementing it. Gathering experiences from repositories and OA journals already in place within Mexico could help towards developing an important body of literature in Spanish allowing us to build a framework so that universities can work in conjunction in order to bring OA to the attention of a larger group of people in particular university authorities, national policy makers and funders.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
An Overview of The Development of Open Access Journals and Repositories in Mexico
7.
Notes and References
[1]
ARUNACHALAM, S. 2003. Information for Research in Developing Countries: Information Technology – Friend or Foe? Bulletin of the American Society for Information Science and Technology, 2003, vol.29, no.5. CHAN, L., KIRSOP, B., et al. 2005. Open Access Archiving: the fast track to building research capacity in developing countries. SciDev Net, February. 2005. Available at: http://www.scidev.net/ en/features/open-access-archiving-the-fast-track-to-building-r.html CHAN, L., KIRSOP, B., et al. 2005. Improving access to research literature in developing countries: challenges and opportunities provided by Open Access. World Library and Information Congress: 71th IFLA General Conference, “Libraries - A voyage of discovery”, 2005. FERNANDEZ, L. 2006. Open Access Initiatives in India - an Evaluation. Partnership: the Canadian Journal of Library and Information Practice and Research, 2006, vol.1, no.1. CONACyT. 2007. Informe General del Estado de la Ciencia y la Tecnlogía Mexico, Consejo Nacinal para la Ciencia y la Tecnología. 2007, 416pp. Webometrics Ranking of World Universities, January 2008. See http://www.webometrics.info/. RICYT. 2008. El estado de la ciencia 2007, Red Iberoamericana de Indicadores de Ciencia y Tecnlogía, Organización de Estados Iberoamericanos. 2008, 53pp. See http://ict.udlap.mx/rabid/, [Access date January 2008] CROW, R. 2002. The Case for Institutional Repositories: A SPARC Position Paper. Scholarly Publishing and Academic Resources Coalition. 2002, 37pp. LYNCH, C. 2003. Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age, ARL Bimonthy Report. 2003, vol.226. KIRCZ, J. G. 2005. Institutional Repositories, a new platform in Higher Education and Research, KRA Publishing Research. 2005. LYNCH, C., LIPPINCOTT, J. K. 2005. Institutional Repository deployment in the United States as of early 2005, D-Lib Magazine, 2005. vol.11, no.9. WESTRIENEN VAN, G., LYNCH, C. 2005. Academic Institutional Repositories: Deployment Status in 13 Nations as of Mid 2005. D-Lib Magazine. 2005, Vol.11, no.9. MARKEY, K., ST JEAN, B., et al. 2006. Nationwide Census of Institutional Repositories: Preliminary Findings, MIRACLE (Making Institutional Repositories A Collaborative Learning Environment). Available from: http://journals.tdl.org/jodi/article/view/194/170 HEERY, R., ANDERSON, S. 2005. Digital Repositories Review, UKOLN and AHDS. 2005, 33pp. LOPÉZ, C., CASTRO, A., et al. 2006. Red de repositorios universitarios de recursos digitales. Primer informe técnico. 2005, 75pp. Available from: http://eprints.rclis.org/archive/00006324/.
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
287
288
Brazilian Open Access Initiatives: Key Strategies and Actions Sely M S Costa1; Fernando C L Leite2 Departamento de Ciência da Informação, Universidade de Brasília, Campus Universitário Darcy Ribeiro, 70910-900 Brasília DF – Brazil Phone: +5561 33071205, Fax: +5561 33673297, email: selmar@unb.br 2 Gerência de Organização e Difusão da Informação, Embrapa Informação Tecnológica; Universidade de Brasília, Parque Estação Biológica, PqEB, Av. W3 Norte (final) 70770-901 Brasília DF – Brazil Phone: +5561 34484585 Fax: +5561 32724168 email: fernandodfc@gmail.com 1
Abstract This overview of key Open Access (OA) strategies in Brazil over the last three years describes the guidelines, tools and methodologies needed for Brazil to become an effective actor in the worldwide open access movement. We review general trends and awareness of OA, as well as ongoing developments and policies, opportunities and challenges, both national and international. The institutionalization of Brazilian scientific research is described, along with advances in open access journals and repositories, as well as institutional and governmental policies and the problems that have slowed their progress. Among the major actions targeted recently are plans and actions specific to Portuguese-speaking countries, as well as international collaboration. We conclude with challenges and opportunities ahead. Keywords: Learned publishing; Open Access journals; Open Access repositories; Governmental and institutional research policy; Lusophone collaboration; Brazil. 1.
Introduction
Open Access (hereafter OA) in Brazil, aims at reflecting the so-called BBB definition (Budapest, Bethesda and Berlin declarations): “literature [that] is digital, online, free of charge, and free of unnecessary copyright and licensing restrictions. It removes both price barriers and permission barriers. It allows reuse rights which exceed fair use” [1]. Nevertheless, as can be seen in the work reported here, there have been a number of problems concerning the Brazilian initiatives that aim at implementing OA policies and actions in this country. This paper provides a picture of the two kinds of approaches taken: sensitising and real actions. The first approach consisted in a number of meetings that have taken place in different states. Different types of academic institutions such as learned societies, research institutes and universities have conducted these meetings. The second approach consisted in the creation of OA journals and digital OA repositories, as well as service providers. Most of the actors involved in this second approach are likewise universities and research institutions. This report is organised under six headings, which comprise the common outline adopted for papers presented at the special session on developing countries. We report Brazilian general trends in scientific output and publication, general OA awareness, the state of the art with open journals and digital repositories, policies, challenges, opportunities, as well as partnerships and collaboration at the international level.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Brazilian Open Access Initiatives: Key Strategies and Actions
2.
General trends in scientific output and publishing in Brazil
Universities are the major settings for research worldwide. In Brazil this is particularly true, as it. was only in the twentieth century that our current major universities were founded, and only after the Second World War that a process of institutionalising scientific and technological development began , with all the initiatives for scientific and technological development in the country being guided by government actions. In the 1950’s and 1960’s, the science and technology policies of the Brazilian Government were aimed at training researchers and creating and strengthening research teams. The National Research Council (CNPq) was created, alongside the Higher Education Personnel Training Campaign (CAPES) http:// www.capes.gov.br/ , both in 1951. A more comprehensive process for supporting scientific and technological development started to appear, via two instruments implemented by these two institutions: scholarships and research funding. They envisaged, ultimately, the creation of a solid academic and scientific environment in Brazil. The Financing Agency for Studies and Projects (FINEP), established in 1969, along with the National Scientific and Technological Development Fund (FNDCT), aimed to be the fundamental instruments for the support of scientific and technological development. Both the agency and the fund have been crucial for scientific and technological achievements in Brazil since then. Later, intense planning in the postgraduate and research sectors strongly linked research activities to post-graduate programmes. From the 1990’s, the Brazilian post-graduate sector became more consolidated as a result of a “stable and well-defined governmental policy”(7). Nowadays, both Brazilian Government bodies (at both federal and state levels) and business companies (both public and private) have been funding research in Brazil. As may be expected, however, the Federal Government has been the major source of financial resources for research, particularly by means of grants and scholarships from CAPES, CNPq and FINEP. At the state level there are the Research Support Foundations, of which the most resourceful one is FAPESP (State of São Paulo). Brazilian researchers from some sectors can also count on grants and scholarships from research institutions such as Embrapa (The Brazilian Agricultural Research Corporation), the largest public research institution in the agricultural sector. The participation of Brazilian research output on the worldwide stage is very small, though figures have gradually increased. According to some of the latest figures available, Brazilian researchers produced 1.8% of the world’s scientific knowledge in 2005, with the largest contribution coming from medicine (2,508 journal articles), followed by physics with 2,204 articles. It is interesting to note that ca. 85% of the national scientific output is carried out by post-graduate programmes, with a total number of 3,325 masters and doctoral courses in 2005. Researchers in any field are stimulated by CNPq to be organised in research groups, focusing on topics within their interest. These research groups have been in the lead for research production in the country, which, along with research investment, is mostly centred in the Southeast and South (as is the country’s population). Data in Table 1 show the breakdown of researchers in terms of the two major divisions of knowledge, namely, science, technology and medicine (STM) and social sciences and humanities (SS&H). “Researchers with doctorates” represent less than 70% of the total of researchers, hence considerable research is also being carried out by people without doctoral training. Data in Table 2 show the breakdown of research groups by geographic region. The distribution of research groups (ca. 21,000), compared to the total number of researchers (ca. 100,000) in Table 1, shows that the average number of researchers in each group is around five. Concerning the geographical distribution of these research groups, it can be clearly observed that most of the research work carried out in Brazil is Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
289
290
Sely M S Costa; Fernando C L Leite
concentrated in the south east, where major Brazilian universities are located. Three of these universities are the University of São Paulo, the University of Campinas, in the state of São Paulo, and the Federal University of Rio de Janeiro. Division of Knowledge
Total of researchers (R)
Researchers with doctorates (D)
% (D) / (R)
47,512 54,672 102,184
36,037 31,668 67,705
76.3 58.3 67.3
STM SS&H Brazil
Table 1: Breakdown of researchers with doctorate, by major divisions of knowledge (2006) Geographic region Southeast South Northeast Central-West North Brazil
Research Groups 10.592 4.955 3.269 1.275 933 21.024
% 50,4 23,6 15,5 6,1 4,4 100,0
Table 2: Breakdown of research groups by geographical region (2006) With regard to scientific publication, Brazilian research output has increased significantly over the last decade. According to data available on the CNPq web page, gathered from researchers’ CV at Lattes Platform (http://lattes.cnpq.br)1, the breakdown of publications from researchers with doctorates whose curricula are accessible shows a low level of production. Despite disciplinary differences, it is expected that researchers produce at least two publications per year on average. Nevertheless, as depicted in Table 3, the output per year does not achieve the average ideal. It is important to note that Lattes Platform covers ca. 85% of total Brazilian scientific output.
Type of output 1. National journal articles 2. International journal articles 3. Conference papers 4. Books 5. Book chapters 7. Doctoral theses 8. Master dissertations Brazil
Number of Number of outputs by Outputs per outputs researchers with doctorates year (2003-2006) per year 238,480
59,620
0,88
212,442 332,707 21,778 113,522 35,753 121,227 1,075,909
53,111 83,177 5,445 28,381 8,938 30,307 268,979
0,78 1,23 0,08 0,42 0,13 0,45 -----------------------
Table 3: Science &Technology production between 2003-2006, by type of output In summary, the federal government (with the exception of the state of São Paulo) has funded most of the research and publication output of Brazil. This output is very small compared to global figures, for such reasons as shortage of money for research (project grants) and of scholarships for doctorate students, as well as language constraints.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Brazilian Open Access Initiatives: Key Strategies and Actions
3.
General awareness of OA in Brazil
It is difficult to report precisely on the general awareness of OA in Brazil, as the requisite empirical data are not available. We accordingly discuss it indirectly by describing the initiatives carried out to promote OA. The promotion of OA in Brazil, as everywhere else, has faced many challenges. Both the Brazilian Institute of Information in Science and Technology (IBICT) and the Scientific Electronic Library Online (SciELO) have been involved with the movement, taking the lead on most of the initiatives in the country. There are two basic initiatives to report. The first consists of declarations in support of OA that have been launched. From 2005, a number of declarations have been issued in Brazil, signed by either individuals or civil society entities through their representatives. So far, at least four major declarations have been issued in Brazil, following the Berlin Declaration. IBICT issued a declaration at the 57th Annual Meeting of the Brazilian Society for the Advancement of Science (SBPC). The other three were issued by the National Psychology Learned Society, by participants of an international conference in the health sciences (known as the “Salvador Declaration”), and by a group of researchers from the state of São Paulo, known as the “São Paulo Letter”. The later two are available on IBICT’s web page (in Portuguese only) (http:// www.acessoaberto.org/). The Salvador Declaration is also available in English from http://www.icml9.org/ meetings/openaccess/public/documents/declaration.htm. The second kind of initiative consisted of a number of events to promote OA. Over the last three years, events that have taken place in Brazil and included OA in their programme consisted of:
•
All three last annual meetings (57th, 58th and 59th) of SBPC. Their programmes with special sessions on OA can be found at the SBPC site at http://www.sbpcnet.org.br/livro/57ra/programas/CONF_SIMP/simposios/1.htm, http://www.sbpcnet.org.br/livro/58ra/atividades/ENCONTROS/listagem.html, http://www.sbpcnet.org.br/livro/59ra/programacaocientifica.html#EA.
•
Annual meetings of learned societies in different fields of knowledge have also been the arenas for this sort of initiative. This has been the case of information and library science, which was the organiser of the Cipecc (Ibero American Conference in Electronic Publishing) event (http://portal.cid.unb.br/cipeccbr); the last two annual meetings of the National Association of Information and Library Science Research (Enancib - www.ancib.org.br); and the National Seminar on University Libraries (SNBU http://www.snbu2006.ufba.br/). Also researchers and practitioners from the health sciences have organised the 9th World Congress in Health Information and Libraries (http://www.icml9.org/). The last three annual meetings of communication science have included OA topics in their programmes. These discussions can be found at • (http:// www.portcom.intercom.org.br/www_antigo2/index.php?secao=projetos/endocom). Finally, psychology researchers ran the annual meeting of 2006 with OA being discussed and a manifesto issued. (http://www.anpepp.org.br/index-grupoXI.htm).
•
The First Cipecc - Ibero American Conference in Electronic Publishing in the context of Scholarly Communication, ran very successfully in Brasilia, April 2006. There were participants from 6 countries (Mexico, Chile, Portugal, Spain, Brazil and Canada), and 13 Brazilian states, spanning from the North to the South of the country and totalling 101 delegates. It offered a unique opportunity to make open access, institutional repositories and other topics known and discussed by people from Ibero America as a whole and Brazil in particular. The conference website contains all papers and presentations. The Second Cipecc will take place in Rio de Janeiro in November 2008 and also aims at publicising and discussing OA in Brazil and Ibero America. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
291
292
Sely M S Costa; Fernando C L Leite
Despite these endeavours, however, little has been achieved. Although significant steps have been taken by some of the government funding agencies, so far they have not been fully supportive, mostly because they still lack sufficient awareness of OA. It has therefore been decided to adopt new strategies. At the events, there has always been the same “audience” in attendance, mostly comprising information scientists and practitioners. Most of the scholarly and scientific communities (i.e., the researchers themselves) remain unaware of the real meaning, potential and benefits of OA. The same applies to decision makers at universities and research institutions. Hence, Brazil has seen few concrete undertakings as yet, despite the recognition that all these endeavours have had some effect on the previous profound and widespread unawareness on the part of the Brazilian scholarly community regarding OA. A great deal of both action and research on this issue is still needed. 4.
OA journals and OA repositories in place and in development
One of the major OA initiatives in Brazil is the creation of online, open access electronic scholarly journals. In this, the Open Journal Systems, from PKP, Canada, has been the cornerstone. The package has been translated into Portuguese, and customised to fit Brazilian journals. Training programmes established by IBICT allowed more than 700 people to learn to use OJS in the majority of higher education institutions. So far, more than 360 Brazilian journals have been created and maintained by means of OJS in a variety of disciplines. Table 4 shows the number of titles in terms of the two major divisions of knowledge. In order to give support to institutions unable to create their own journals, IBICT has recently launched a journal incubator (INSEER - http://inseer.ibict.br/index.php?option=com_frontpage&Itemid=1). INSEER is a service that supports and stimulates the creation and maintenance of OA scholarly journals on the Internet. This is expected not only to give rise to new journals but also, and mainly, to support the sustainability of both new and existing titles. This will undoubtedly help increase and improve figures in the country, in this regard. Major divisions of knowledge STM SS&H Brazil
Number of Titles (OJS) 95 270 365
Table 4: Breakdown of scholarly journals created by means of OJS and accessible through IBICT’s portal Along with OJS, SciELO is another major OA scientific and scholarly journals service in terms of both recording and disseminating. There are now more than 230 Brazilian titles in its collection, from different divisions of knowledge (Table 5). The major work of SciELO is to produce electronic versions of journals and then make them available and searchable through its platform in an interoperable environment. The service provides a statistics module that can generate the metrics widely used in the country as quality indicators. SciELO now has a collection of journals from 8 countries in Latin America, Caribbean and Europe, namely Argentina, Brazil, Chile, Colombia, Spain, Portugal and Venezuela, with a total of 553 titles in its entire collection. There are also ongoing implementations of journals from Mexico, Jamaica, Costa Rica, Paraguay, Peru and Uruguay. Concerning repositories, figures in Brazil are disappointing. So far, there are few (around 6 and others underway) university repositories implemented in the country, even though they concern the major focus Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Brazilian Open Access Initiatives: Key Strategies and Actions
of the sensitising approach carried out by OA stakeholders. Taking into account that there are 266 universities (142 from the public sector and 124 from the private sector) in Brazil, figures are expected to be higher than that. In fact, the “Green Road” to OA (author self-archiving of non-OA journal articles in Institutional Repositories) http://www.nature.com/nature/focus/accessdebate/21.html does not yet constitute a reality in Brazil. Despite all the software translation, customisation and dissemination that IBICT has carried out over the last five years or so, very few successful initiatives have been registered within universities and research institutions. The difficulties seem to be related to a continuing lack of awareness of OA. Technical difficulties are no longer a problem at all (despite some remaining concern about computer personnel). Major divisions of knowledge STM SS&H Brazil
Number of Titles (SciELO) 157 81 238
Table 5: Breakdown of scholarly journals available at SciELO Nevertheless, concrete progress in Brazil is now underway, and both the approved Bill as well as CAPES (See Section 5: policy development), provide the evidence. It is interesting to highlight two initiatives:
•
The Digital Library of the National Institute of Space Research (Inpe), available at: http:// bibdigital.sid.inpe.br/col/sid.inpe.br/bibdigital%4080/2006/11.11.23.17/doc/mirror.cgi. The software used to host the data has been developed entirely by Inpe’s technicians, who also created persistent identifiers and a number of other resources needed to guarantee access and control. Figure 1 shows the home page of Inpe’s digital library.
•
The Digital Library of the University of Campinas (Unicamp), in São Paulo, available at: http://libdigi.unicamp.br. Like Inpe, Unicamp works with a software developed and maintained by its technicians. In both cases, the standards are OAI compliant and open access. Figure 2 shows the home page of Unicamp’s digital library.
Figure 1: Snapshot of Inpe digital library home page Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
293
294
Sely M S Costa; Fernando C L Leite
Figure 2: Snapshot of Unicamp digital library home page There are 6 DSpace and 4 EPrints installations in Brazil registered in ROAR http://roar.eprints.org/. Some of the government initiatives, i.e. initiatives from the public administration sector, make successful use of DSpace in Brazil: • BDJur Consortium, which comprises a network of digital libraries from the judiciary sector. The consortium uses Dspace and PKP metadata harvester, and is accessible at: http://www.consorciobdjur.gov.br
•
Virtual Library on Corruption, a partnership between a Brazilian judiciary body and the United Nations Office on Drugs and Crime (Unodc), available at http://bvc.cgu.gov.br.
A successful initiative in Brazil has also been implemented by the Antônio Carlos Jobim Institute using DSpace. The institute collection was created with the aim of having Tom Jobim’s multimedia collection well managed and is available at http://www.jobim.org/manakin. Concerning scientific information, it is currently the Digital Library of Theses and Dissertations (BDTD) that is the most successful OA initiative in Brazil, with ca. 70,000 theses and dissertations available. It is noteworthy that this success arises from the fact that CAPES has adopted a mandatory Green OA selfarchiving policy, requiring post-graduate programmes in Brazil to make their theses and dissertations available on the Internet. Currently, BDTD collects and makes available theses and dissertations from 77 higher education institutions. The strategic diffusion model of BDTD programme seems to correspond to Brazilian needs. It consists of launching public calls that aim at higher and further education institutions with post-graduate programmes. The support given includes hardware, software, methodology and training. The selection process takes account of a number of requirements, among them the existence of a team composed of computer personnel who will be responsible for both implementing and operating the local digital library of ETD’s. Brazil is one of the biggest contributors to NDLTD. A further very promising development in Brazil is the creation of a national web portal of OA initiatives at IBICT’s server, named Oasis.br (Brazilian Open Access Scientific Information System), which should Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Brazilian Open Access Initiatives: Key Strategies and Actions
encompass both e-journals and institutional repositories. It has been envisaged as an important twofold step. First, it is expected to serve as a tool for journals and repositories to be registered and, consequently, it will be a service provider for Brazilian OA output. Second, both journals and repositories must be compliant with the portal policies and quality criteria for being registered. This, in turn, aims at ensuring the quality and sustainability of Brazilian research output. In conclusion, the “Gold road” to OA (OA journal publishing) http://www.nature.com/nature/focus/ accessdebate/21.html is well implemented in Brazil, with high possibility for improvement, despite a great need of quality and sustainability policies for journals created by means of OJS. There does not seem to be a need for increasing, but rather for improving the collection of OJS journals. SciELO journals are assessed in advance, which is a guarantee of their quality. OJS journals, however, because of OJS’s freedom and ease of use, show a greater need for improvement. The Green road to OA of self-archiving is still a dream, mostly because of the continuing low level of awareness among Brazilian scholars and scientists, as well as university librarians, funding agency personnel and university decision makers. This dream, however, has a high chance of becoming a reality thanks to the efforts of IBICT and other stakeholders. 5.
Policy development
One of the most significant steps towards OA in Brazil, as the result of its stakeholders’ efforts, is the recent approval of Bill 1120/2007 by the Science and Technology Commission of the Brazilian Chamber of Deputies. This Bill defines policies for the country that require mandatory deposit, in a university repository, of research results publications resulting from research projects funded by public institutions. This is, indeed, an important step, despite being a manifestation of the country’s overall top-down approach. It is interesting to note that no bottom-up approach (i.e., researcher voluntarism) tried so far has achieved positive results, in Brazil. According to Harnad (comments given personally) “Likewise anywhere else in the world, except in the special case of High Energy Physics. In contrast, researchers state http:// eprints.ecs.soton.ac.uk/11006/ , and outcome studies confirm http://fcms.its.utas.edu.au/scieng/comp/ project.asp?lProjectId=1830 , that mandates successfully generate near-100% OA within 2 years of implementation.” Kirsop (comments given personally), however, considers that “many of the existing >1300 IRs resulted from hard work in departments of universities and institutes – i.e. the IISc in Bangalore, Southampton and Tasmanian universities and so on – very little top-down help, and almost all happened because the department staff ‘just did it’. Another highly effective decision that helps OA implementation in Brazil is the CAPES normative act that requires higher and further education institutions both to create a digital library of theses and dissertations and to effectively deposit all theses therein. In addition, the CNPq programme that funds scholarly and scientific journals states that if a research proposal’s published results are to be made OA, then it has higher chances of receiving funding than if it is not. Finally, a Task Force was established in October 2007, involving seven major Brazilian universities representing the five geographical regions with the aim of initiating OA promotional efforts within those universities and then propagating them to others. The modus operandi initially defined has not yet proved very successful and is now being reviewed. However, one important outcome has been to create the University of Brasilia repository and to use it as a pilot for the entire country, starting with the other six universities of the Task Force. The University of Brasilia is expected to be the first Brazilian university to implement a mandatory deposit policy http://www.eprints.org/openaccess/policysignup/ and this may occur shortly.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
295
296
6.
Sely M S Costa; Fernando C L Leite
Opportunities and challenges
Based on the picture described so far, three important issues need to be tackled in Brazil; they accordingly represent, the countryâ&#x20AC;&#x2122;s biggest challenges. The first is the lack of a good infrastructure for the scientific publication system. This can actually be seen as a great opportunity for the adoption of the new publication and dissemination models and technologies available nowadays. Brazil is one of the main users of OJS, with one of the highest numbers of titles created through the system. Another advantage of this lack of infrastructure is that the country does not face any battle with the publishing industry, as seen in the developed world. The great challenge, in this sense, is to remedy the unawareness of the scholarly community, particularly that of individual researchers themselves. The second important issue is that most of those responsible for OJS journals, for example, have well mastered the use of the software itself, but know very little about the publication system in particular, and the scholarly communication process in general. This also represents a huge challenge for OA stakeholders in Brazil, and a difficult one to overcome. The same applies to DSpace, EPrints and other resources being used in the country. Finally, the information technology practitioner community seems unaware of the OA movement. Considering the importance of this community as key players in developing and implementing the Green and Gold approaches to OA, the country urgently needs to raise awareness of OA among these people and persuade them to collaborate. Opportunities and potential benefits include the free use of all available electronic resources, international collaboration and cooperation, an increasing presence of Brazilian researchers and practitioners at international conferences, more visiting of experts to the country itself but, above all, the greatly increased research access, usage and impact: http://opcit.eprints.org/oacitation-biblio.html 7.
International collaboration and ways forward
In November 2006, a group of researchers from Brazil, along with researchers and librarians from Portugal, and Mozambique, held a meeting at the University of Minho, in Portugal, to discuss the OA movement in Portuguese speaking countries. The programme of the meeting is available at http://www.sdum.uminho.pt/ confOA/programa.htm. From this meeting, the Minho Commitment emerged as an important document for this community. The document is available in both Portuguese and English at http://www.ibict.br/ openaccess/arquivos/compromisso.pdf. As a follow-up, the Open Access Seminar to the Scientific Knowledge in Portuguese Speaking Countries took place on November 13th, 2007,as part of a major event organised by the Internet Governance Forum. The programme is available at http://www.intgovforum.org/Rio_Schedule_final.html. The event took place in Rio de Janeiro, as part of a Brazil/United Nations meeting (http://www.intgovforum.org). Experts, researchers, librarians and government representatives jointly discussed the theme. Representatives of 8 Portuguese-speaking countries signed up to the Rio de Janeiro Protocol, which establishes their agreement to the aims of the Minho commitment. Further developments are on the agenda of both University of Minho, in Portugal, and IBICT, in Brasilia. More discussion and action are needed in order fully to implement the Minho Commitment. Oasis.br and INSEER, already mentioned, are expected to complement efforts made by SciELO and Qualis by working in collaboration with Bireme and CAPES. The former (SciELO) is a collection of journals that share well-defined criteria for inclusion. The latter (Qualis) is a Brazilian programme implemented by CAPES to assess journals. CAPES is a federal agency responsible for assessing research and post-graduate work in Brazil. This should help guarantee journal quality and sustainability by defining Oasis.br assessment criteria both for the journals that are to be registered for the portal and for journal Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Brazilian Open Access Initiatives: Key Strategies and Actions
publishers themselves. It should also set quality criteria for journals created by means of INSEER. Oasis.br aims at promoting both the green and the gold roads to OA. The gold road comprises methodology, strategies and criteria to either create or migrate scholarly journals using OJS. The green road consists of defining methodology, strategies and guidance for the creation and filling of institutional repositories at universities. All these actions will need to take into account disciplinary differences among scholarly communities, particularly in terms of criteria to be defined. One special concern is to take into account Brazil’s own reality and context, as a developing country in which the lack of skilled personnel is still a great problem. Partnership with people from well-equipped institutions, such as PKP (Canada), University of Minho (Portugal), and E-LIS are under discussion and should result in training programmes. 8.
Concluding remarks
Brazil has definitely been prominent on the agenda of the worldwide Open Access movement, thanks to a few stakeholders. Since Elpub2003 in Portugal, Brazilian researchers and technicians began to carry out work on OJS, DSpace and EPrints, which were already being considered by IBICT’s personnel. Apart from the pioneering OA initiative by SciELO, since its beginnings in the late 1980’s (though without an interoperable environment till very recently), the involvement of the Brazilian scholarly community with OA dates from 2003. Since then, a huge effort has been made to involve Brazil prominently in worldwide OA developments. Concrete achievements, however, have so far not been very impressive. There still remain long and wide roads to be traversed, not just Green and Gold, but, as provocatively expressed in the FEST programme (http://www.festrieste.it): top-down or bottom-up? Costa’s response: both top-down and bottom-up! Brazil needs the problems of OA implementation to be tackled by librarians, researchers and computer technicians who can persuade decision makers to provide the support needed. This would represent the bottom-up approach, which corresponds to the sensitising approach mentioned earlier. At the same time, the country needs decisions to be made by both the government as a whole (legislative, judiciary and executive branches), and the research institutions, universities and funding bodies. It is understood that if these bodies establish policies for the country as a whole, this will help the scholarly community, defined in its largest sense, to follow. A synergy between top-down and bottom-up approaches appears optimal. This will constitute real action and requires strong involvement and advice from all key stakeholders. Let the country, therefore, go ahead and do it! Hope springs eternal. 10.
Acknowledgements
Acknowledgments to Stevan Harnad and Barbara Kirsop for the valuable comments and corrections to this paper. We are also indebted to Finatec (http://www.finatec.org.br) for the financial support received. 9.
Notes
1
Lattes Platform is a system that provides a very comprehensive record of the Brazilian researchers activities concerning teaching, research, administrative positions, and so forth. In summary, a real comprehensive CV record from which it is easy to assess research work in the country.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
297
298
Sely M S Costa; Fernando C L Leite
10.
References
[1]
Suber, Peter. Strong and weak OA. Open Access News: news from the open access movement. Tuesday, April, 29, 2008. Reposted by Stevan Harnad as Open Access: “Strong” and “Weak”. American Scientist Open Access Forum, April, 29, 2008.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
299
Interpretive Collaborative Review: Enabling Multi-Perspectival Dialogues to Generate Collaborative Assignments of Relevance to Information Artefacts in a Dedicated Problem Domain Peter Pennefather1 and Peter Jones1,2 1
Laboratory for Collaborative Diagnostics, Leslie Dan Faculty of Pharmacy, University of Toronto 144 College St, Toronto ON, M5S 3M2 CANADA, email: p.pennefather@utoronto.ca 2 Redesign Research, Dayton, Ohio USA, peter@redesignresearch.com
Abstract Interpretive Collaborative Review (ICR) is a process designed to assemble electronically accessible research papers and other forms of information into collaboratively interpreted guides to information artefacts relevant to particular problems. The purpose of ICR is to enable collective understanding of a selected problem area that can be developed and represented by evaluating (reviewing) selected artefacts through a collaborative deliberation process. ICR has been conceptually formalized as an online environment enabling collaborative evaluation of relevancy relationships articulated in the triad of: 1) specific problems (topic), 2) diverse stakeholders and reviewer perspectives (context), and 3) particular settings where the problem matters (task). We define relevance as a cognitive recognition of proximal meaning relationships among the triad nodes of topic, task, and context. Three necessary dimensions of relevance relationships are proposed: 1) precedence, 2) validity, and 3) maturity. Based on experience with other forms of collaborative knowledge construction such as structured dialogue and cooperative learning, we conceptualized the ICR process as encompassing three phases: 1) discovery, promoting initial interpretations and definition, 2) deliberation, promoting emerging understanding and acceptance of degrees of interpretation within the group and 3) dissemination, promoting summation, validation, and distribution or publication of conclusions. The ICR method starts by recruiting a community of reviewers with necessarily diverse perspectives who agree to collaborate in identifying and evaluating information artefacts that can inform knowledge construction centered on a problem of common interest. A discovery phase allows reviewers to declare perspectives that are further delimited and explored collaboratively through the use of group dialogue around challenge questions. This is followed by a deliberative phase that facilitates collaborative dialogue aimed at developing a shared understanding of available information artefacts and their significance and of how those sources are relevant to the problem context. A final dissemination phase involves recording and publishing the knowledge synthesis and innovation that emerged from this collaborative dialogical process to affect knowledge transfer. Alignment of perspectives is promoted through collaborative generation of an aggregated report that describes the perceived relevancy relationships for each knowledge artefact evaluated in the review collection. While useful by itself, this report also serves as the raw material for a new form of scholarly publication, the 3D-Review, where relevancy relationships are used to guide suggested actions that could be taken with respect to advancing knowledge of the problem and options for addressing it. Both reports and reviews are indexable and electronically accessible, allowing other communities or individuals to find, retrieve, and act upon the new knowledge associated with the reports and reviews. This process of rigorous and purposeful deliberation enabled through online support of honest dialogue has the potential to develop into a new form of scholarly activity that should be useful in integrative scholarship.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
300
Peter Pennefather; Peter Jones
Keywords: collaborative review; multiperspectival dialogue; information relevancy; online deliberation; integrative research; knowledge synthesis. 1.
Introduction
1.1
What is ICR.
We are developing an interpretive collaborative review (ICR) process that will enable ad hoc review of published literatures and their data. The aim of the ICR process is to facilitate and record human perceived relevance of these information artefacts within the context of construction of knowledge concerning a specific applied research question, especially concerning interdisciplinary and wicked problems1. ICR facilitates multi-perspectival dialogues to generate collaborative assignments of relevance to information artefacts in a dedicated problem domain. Unlike peer review, which reinforces dominant disciplinary perspectives by privileging within-discipline peer assessment, ICR affirms the necessity of including and validating the multiple perspectives necessary to understand a complex problem and to inform decision making. The ICR process is designed to lead to online distribution of an ICR publication that summarizes and reflects upon conclusions of the discovery and deliberation characteristics of ICR process. We anticipate that these 3D-Reviews (Discovery, Deliberation, and Dissemination) will serve a need for rapidly-generated, problem-focused scholarly interpretations of the literature, available evidence, and data relevant for addressing significant practice and research questions. The 3DRs are envisioned as a venue for disseminating integrative research [2], or the deliberate association of multiple perspectives in interdisciplinary research problems, increasingly important in healthcare, disease management, planning, and other domains characterized by wicked problems. Here we describe a theoretical and conceptual framework for designing this ICR process. 1.2
Why ICR is Needed
The increasing rate of publication sponsored by massive investments in discovery and technology development has generated a bewildering array of knowledge and information artefacts most of which are now accessible in a digital form. Digital information artefacts include: research reports, medical records, original research data, audio/video recordings, maps, images, financial documents, legal forms and case-law texts, databases, websites, public records, etc. An equally massive â&#x20AC;&#x153;knowledge aftermarketâ&#x20AC;? corpora of internet accessible white papers, reviews, books, proceedings, and guidelines have been published and archived. They represent attempts to interpret and make meaning from the original published body of research as well as other forms of information artefacts and to make them more accessible. This deluge of data, articles, and communications [4,5], both published and unpublished, creates significant complexity and cognitive load for scholars. Especially when engaged in multidisciplinary and integrative research, scholars may experience uncertainty with respect to the validity or perspectives of published accounts outside of their primary field. Significant cognitive overload burdens are to be expected in assessing and interpreting research findings, their contexts of meaning and definition, methodological soundness, and disciplinary applicability. Practitioners, developers and consumers who want to use the information for practical purposes face even more headaches than scholars. Appraisal of the published literature that might be relevant to a practice becomes a daunting exercise in sensemaking for professional information seekers [6] and even more so for the public at large [7], as they are increasingly viewed as a consumer of health and biomedical information [8]. In addition, new views are emerging about opportunities for how digital data and accessible digital information artefacts can be reused that will likely transform the nature of scientific publishing in the future [9].
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Interpretive Collaborative Review
Yet, in applied and practice-oriented health professional disciplines, such as medicine, nursing and pharmacy, we find a continuous need to interpret, translate, and make clinical decisions based on rapid but non-trivial assessments of the current agreement of positions drawn from scientific research [10]. While there are review journals and even informatics services designed to supply summaries and expert assessments, these review services suffer from often severe publishing latency and inconsistent quality [11,12]. Moreover, evidence evaluation services (Cochrane, MDLinx) deliberately ignore much of the literature and favor specific perspectives (e.g., evidence-based medicine) over others [13]. There is a need to provide a guide to assist practitioners and multidisciplinary researchers in directly evaluating the literature as well as published and electronically accessible data. There is also a need to highlight the theoretical knowledge frameworks that guide construction, presentation and interpretation of research data and findings [14]. Both the interpretive and original scholarly literatures of all disciplines and the data and material they are based upon can now be accessed via digital libraries, repositories, publisher services, and OPACs hosted by academic libraries [5]. This accessibility has been developed by investments by universities and governments’ through the promotion of public research libraries and open access regulations [15] as well as through innovative services developed and marketed by publishers. Since almost any published product of research and interpretation can now be delivered through the Internet, few technical barriers remain to achieve reliable archiving and retrieval of all forms of information artefacts. These include primary research papers but also include other forms of text based sources (e.g. patents, white papers, reviews, indexes, transcripts, etc) and increasingly other forms of rich media sources. But despite (or due to) advances in subject information retrieval, information seekers often face intractable information-seeking problems related to information opportunity overload, knowledge overproduction, and disciplinary language complexity. Some scholars propose the need for gatekeepers has increased, not diminished, due to overload and the information seeker’s inability to acquire sufficient context [16]. ICR is a generative, anticipatory approach to gatekeeping, wherein interested scholars and stakeholders – but not necessarily experts in a problem – review and formally assess knowledge artefacts, the ideas within them, and their relationships to the problem to enable their dissemination as indexed scholarly review products available to open web searching. Publications have become so specialized that adaptation of knowledge between disciplines becomes an ever-increasing challenge, a cultural language challenge not resolved by technology. Even within major disciplines, competing factions arise and persist over decades, resulting in fragmented perspectives and guidance for major applications, such as healthcare and drug therapies. With over 600,000 new articles from almost 5,000 journals indexed by the National Library of Medicine in 2006 in the domain of healthcare alone [17], practitioners and researchers have trouble identifying and locating information that is germane (useful), interpretable (usable), and directly applicable (ready to use) to specific problems faced in their practices and their studies. This problem becomes particularly acute when practitioners and researchers are engaged in interdisciplinary integrative research aimed at delivering knowing in action. [18] While there are language translators available on the web, we know of no disciplinary or problem space decoders that effectively translate the relevance of given publications for the trans-disciplinarians interested in wicked problems. Such a decoder would need to be trusted. Indeed, Thiede [19] has argued that trust plays a key role within the transactional process of information exchange and communicative interaction. Moreover, he points out that trust both enables and is generated by communicative interaction. For too long information access strategies have relied on authority rather than trust to help guide distribution of information. Information access must go beyond the mere possibility of access to embrace development and engagement of a capacity for access by communities and individuals. The decoder would also need to be dynamic, improving its functionality through the sharing of experience by its users. The ICR process is being designed to serve this decoding function.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
301
302
Peter Pennefather; Peter Jones
2.
Background.
2.1
The Nature of Relevance
The ICR process is being designed to create a new venue for applying scholarship in the 21st century through relevancy mapping. Since we are advocating the need for collaborative, humanly-assigned relevancy, we propose a model for associating a scale of relevancy indicators to artefacts. The purpose of this discussion is to share the theoretical background, and we do not show the scoring method alternatives in this paper. The adjective relevant is derived from the Latin word relevare meaning to raise up or highlight. In most forms of common usage relevance refers to properties and attributes of an information source that has bearing upon, is connected with, or is pertinent to a matter at hand[20]. Table 1 summarizes what we perceive as analogous nodes of meaning within conceptualizations of relevance and knowledge. Saracevic´s [21], Phenomenology framework for characterizing relevance emphasizes the importance in relating knowledge seeking themes developed by individuals and the social situation, or Shutz’s ¨lifeworld,¨ [22] in which this theme operates. This formulation explicitly recognizes the social and constructed nature of relevance. Shutz recognizes relevance as being composed of a system of relevancy relationships consiting of topical relevance, interpretational relevance, and motivational relevance. These relationships reflect the knowledge relationship as understood within a practice or scholarly community. Wenger’s [22] framework describes three dimensions of knowledge within a community of practice: what is it about, how it functions, and what capability it has produced.[23]. This corresponds to the three knowledge domains specified by Spender [24] of 1) data (abstraction), 2) meaning (codification) and 3) practice (diffusion). Mizzaro [25] has defined relevance in an information retrieval (IR) context as a relation useful in guiding IR that exists between any two entities of two groups. Where one group includes: documents, their surrogates, and the information they contain. The other group includes: problems, information needs, information requests, and information queries. Each of these relevancy relationships can be further decomposed into influences related to the topic, the context and the task. Thus, Mizzaro [25] like many others defines relevance as a system of articulating relevancy relationships. This system of relevancies is applied in different types of sensemaking or abouts. Maron [26] has distinguished between three abouts: 1) the subjective about (the relation between the information and the resulting inner experience of the recipient), 2) the objective about (a well defined explicit and external point of view) and the 3) retrieval about (the consequences of making that information available). These distinctions roughly reflect the different approaches for organizing knowledge: ontological, epistemological and methodological [14]. Building on these insights we define relevance of an information artefacts to a problem as: a quality or attribute either providing guidance (it is germane and pertinent), perspective (it is material and valid) or options (it is applicable and mature) to a practice, situation, issue, problem, subject or other matter at hand. We will refer to these three quality dimensions as: pertinence, validity and maturity (Table 1). 2.2
ICR as a form of Collaborative Informatics
Collaborative informatics is an emerging information practice developing from the requirement to improve decision making and understanding among professionals in research and practice by drawing from and integrating multiple personal and disciplinary knowledges and perspectives. Shortliffe and Blois [27] define (biomedical) informatics as: the scientific field that deals with (biomedical) information, data, and knowledge - their storage, retrieval and optimal use for problem solving and (their optimal use for) decision making. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Interpretive Collaborative Review
303
We view collaboration as a structured recursive process where multiple individuals work together toward a common goal - typically an intellectual endeavour that is creative in nature - by sharing knowledge, learning and building consensus [28]. An important feature of collaboration is that it does not require leadership only communication and self-organization. Scholarly or professional collaboration occurs within a community of practice comprised of members with distinct responsibilities and competencieswho are mutually dependent on each other to accomplish the shared practice goals. Collaborative inquiry or learning within a group of stakeholders should be multilateral where all members are expected to participate and benefit from the process. Pennefather and Suhanic [29] have described how diagnostics (the structured decision support leading to a diagnosis) resembles the stages of learning described by Bloom´s taxonomy [30] and the stages of a structured inquiry that guides structured dialogic design [31]. These levels can be aggregates into three distinct phases (Table 2): 1) Discovery, 2) Deliberation, and 3) Dissemination.In a sense the ICR is a form of collaborative diagnostics aimed at developing a shared understanding of the meaning of the relevance of a collection of information to a particular problem. Precedence (Germane) Validity (Material) fit to the issue ways of knowing the issue SOURCE (guidance/topic) (perspectives/insight/context) Shutz (22) Topical Interpretational Wenger (23) What is it about? How it functions? Spender (24) Meaning Data Maron (26); Subjective About Objective About Thomas (14) Ontological Epistemological Amin & Roberts(18) Professional Epistemic/Creative
Maturity (Applicable) usability & actionability (options/applicability/task) Motivational How can it be used? Practice Retrieval About Methodological Craft/Task based
Table 1 Dimensions of relevance and their analogy with other conceptualizations of knowledge 1
Learning
Discovery Deliberation Dissemination
2
Interactive Inquiry Comprehension Application Analysis Synthesis Evaluation
3
Diagnostics
Discovery/Engagement Definition/Mapping Design Decision Making Action Planning
Initiation Sensing Analysis Diagnosis Reporting
1 Bloom [30]; 2 Schreibman and Christakis [31] ; 3 Pennefather and Suhanic [29]
Table 2. Analogous levels of different forms of collaborative construction of knowledge 3.
Theory and process of ICR
3.1
Toward a theory of problem-centered knowledge communities
We have adopted a constructivist perspective in designing the ICR process. Participants in an ICR process are viewed as members of a learning community, even a community of practice formed around a problem of common interest [18, 23]. This community is defined by the triad of: 1) specific problems (topic), 2) diverse stakeholders and reviewer perspectives (context), and 3) particular settings where the problem matters (task) [25]. Relevance will exist and be recognizable to the community as an articulated network of relationships linking aspects of meaning expressed through this triad. The goal of the ICR is to make evident the individual meanings perceived by community members and to explore the consensus that develops regarding that meaning through ICR mediated deliberation. The ICR process aims to create an environment where necessarily multi-perspectival communities can collaboratively build knowledge and construct meaning. We anticipate that this collaboration will generate new synthesis, innovation and transfer of knowledge which can be fed back into a larger system of published reviews, citations, and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
304
Peter Pennefather; Peter Jones
accounts. A key role for any form of review is to identify within a plethora of alternatives a subset of specific information sources that are relevant to specific needs and to explain why those sources are significant. This is a form of knowledge synthesis which infers generalities form specific instances. The Canadian Institutes of Health Research has defined knowledge synthesis as ¨the contextualization and integration of research findings of individual research studies within the larger body of knowledge on the topic¨. This is seen to be an essential first stage in knowledge transfer and translation. Knowledge translation is defined by CIHR as ¨a dynamic and iterative process that includes synthesis, dissemination, exchange and ethically sound application of knowledge to improve the health¨ [32]. In our opinion, the best reviews integrate synthesis, innovation and transfer of knowledge. In a sense such reviews help information seekers to develop scholarly literacy as it relates to problems those seekers are concerned with. This is especially true when the problems impinge on fields of knowledge within which seekers have limited experience. The UNESCO definition of literacy states that it is “the ability to identify, understand, interpret, create, communicate, compute and use printed and written materials associated with varying contexts. Literacy involves a continuum of learning to enable an individual to achieve his or her goals, to develop his or her knowledge and potential, and to participate fully in the wider society”[33]. An important outcome of literacy is an increased and continual access to knowledge and with continued learning in a problem area comes the generation of new knowledge. Knowledge can be thought of as “justified true belief” that increases the capacity for effective action by a community of practice or a community sharing a common problem or goal [34]. Thus, literacy can be conceived as a capacity to access and exchange knowledge and serves an important role to helping a group engaged in sharing knowledge to judge the relevance of information to a particular problem. Another aspect of literacy embodied in the literate scholar is the quality of perceptiveness, which we see as the enabling of tacit knowledge to recognize patterns that may be overlooked by others but become apparent once pointed out. Tacit knowing [35] is essentially “that which we know but cannot tell,” and is considered both the basis and consequence of expertise and deep domain knowledge acquired over time by practice and reflection. We propose that Nonaka’s [34] dynamic model of knowledge creation (based on a cycle of knowledge exchange among members of a community, Fig 1) matches the desired ICR process where the community would be a virtually-organized problem-review team. The ICR process is being designed to supports most and perhaps all of the four-phase cycle of knowledge creation, effectively in an online collaborative dialogue setting. This cycle supports a model of knowledge synthesis, translation, innovation, and transfer through exchanges among reviewers reflecting different disciplinary perspectives and also necessarily different tacit sources of expertise and experience. It is this form of knowledge Nonaka [34] translation that we seek to embody in the ICR process. Nonaka [34] presents a conceptual cycle of knowledge translation from tacit to external forms, then returning external back to tacit, embedded knowledge when it has been learned and internalized effectively. In this cycle, tacit knowledge, inherent in expertise and deep disciplinary knowledge, is translated or exchanged in collaborative dialogue in the ICR process. As reviewers select, review, and assign relevance scores to artefacts in a problem-centered collection, they reveal perspectives, knowledge preferences, and are compelled to state claims that might otherwise remain as tacit knowing regarding a problem. The socialization process noted in the upper left part of this cycle is inherent in scientific communities, a tacit to tacit exchange wherein background knowledge and explicit support for claims is left unsaid, and remains understood as part of the community’s background of knowledge. The ICR process aims to make this exchange evident and accessible, revealing the development of background assumptions, foundational concepts, and inherent expertise. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Interpretive Collaborative Review
Tacit
Explicit
Socialization
Externalization
Internalization
Combination
Tacit
Explicit
Figure 1. Tacit knowledge translation cycle [34] Scholars recognize their identification with an “invisible college” community to the extent they have internalized the encoding and shared background of a problem area or discipline. But since we are interested in amplifying weaker signals of meaning in order to articulate or bind together interdisciplinary research and to strengthen the collaborative review. Our intent with ICR is to recruit and disclose the tacit formulations researchers may hold. We do so by engaging their perspectives through having them collaboratively reviewed and published as accessible knowledge artefacts that inform a problem or decision domain. The translation from tacit to “explicit,” what Nonaka calls externalization, is necessary to produce an exchange of form in the sharing explicitly interpreted claims in an ICR review. Especially in a review context, knowledge claims that may seem obvious to an expert will be revealed and explicated (externalized) in a form available to non-peer participants, requiring a translation process that combines perspective and knowledge claim within a capsule review form. We view this explicit exchange of reviews, while peripheral to any directed aim of “problem solving,” as supporting participants in formulating the “combination” of knowledge identified in the knowledge creation/translation cycle. The disseminated ICR publication further combines knowledge represented in the individual reviews, the references to selected artefacts, and the continuation of dialogue among new participants in the intentional community surrounding the problem of interest. 3.2 Informatics Model of the ICR process The ICR process has been conceptually formalized as an online environment that convenes a multiperspectival, temporary, problem-centered community to facilitate collaborative evaluation of generalizations. While other prototypes and services have been identified that have attempted a similar process (e.g. Digital Document Discourse Environment [36]), these other approaches differed in significant ways from the current design. Some of these distinctions include: 1) it is explicitly problem or question centered, 2) it engages multiple perspectives by design, 3) a strong editorial role is taken to ensure quality control and artefact selection, 4) a strong model of human-assigned relevancy is embedded in the process, and 5) this relevancy model is construed as an open indexing schema. Other differences exist in the theoretical frames underpinning the ICR, such as: 1) its strong view of knowledge translation and collaborative dialogue, 2) an inherent theory of scholarly motivation, and 3) a requirement or goal to distribute an indexable publication as a collaborative outcome. We articulate the ICR process as the triad of 1) specific problems, 2) diverse stakeholders and reviewer perspectives, and 3) particular settings or venues. These are linked to information about perceived relevancy Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
305
306
Peter Pennefather; Peter Jones
to generate shared meaning for a defined problem. We assert that a dialogical collaborative evaluation process is necessary to conserve tacit knowledge of autonomous participants within this community and to test knowledge propositions efficiently. Effectiveness of this dialogical process requires recognition of participant autonomy and authenticity, parsimony of resources, and an allowance for evolutionary learning and opinion changing in the process [31]. While not yet fully developed as an interactive prototype, the process has been tested in classroom and design settings. Fig 2 outlines a proposed ICR workflow that prototypes will have to support. The general workflow aims to develop an inclusive and non-hierarchical online process that allows assembly of a limited group of interested parties with diverse perspectives to arrive at agreement about the substance and extent of a problem and its deliberation. Participants engage in a collaborative effort to discover, deliberate about, and evaluate the relevancy of artefacts selected and screened by participants as candidates for the review. This live intervention enables access to tacit internal and external knowledge networks available to reviewers. The ICR process shares some surface similarity to the canonical scholarly publishing process, involving editors and boards of selected reviewers. The most significant difference is the ad hoc nature of the ICR, where a review may be registered (online) and organized at the time a problem area is identified for review. The ICR is initiated by an editor who may individually or by convening an editorial board establishes a new ICR focused on a problem of concern to some community. The ICR is formulated around a triggering question, which may be identified by the editor (1), or circulated in review with up to 2 candidate questions. Trigger Question : - Allow 3 candidate Questions - Dialogue (not review) of questions - Selected question initiates Review
1
Reviewer functions
2 Create collection
1. Reviews artifacts & scores each from a sub-set of complete corpus
1. Editors build initial collection from : - Review of personal papers - Scopus & scholarly searches - Selection of papers from list
Start rev iew
Proposes question
Ad dt oc
2. Editors narrow list to top 30-50 - Criteria used for selection (may be) Diversity, range, 3D relevancy
Editor: Starts new 3DR Selects 20-30 candidate papers, e.g. from Scopus & bibs
2. Adds new papers as fit
olle cti
3. Starts & engages in dialogue
on
4. Summarizes their review contribution for final publication
3. Move these into core collection 4. Invite reviewers to participate
Reviewer: Registers a profile â&#x20AC;&#x201C; Sets a perspective indicator
w vie Re
Editorial functions 1. Moderate dialogues 2. Evaluate progress by summary review 3. Prompt / notify participants to complete reviews or scores 4. Declare completion of ICR review
Review articles 1. Set a To-Read List (checkboxes )
4
2. Review each article in list (Display part or whole in 3DR) 3. Scan article, select it or section to reference in individual review . 4. Post Review comments
Dialogue on articles
ue
R evi ew
log Dia
3
It
w vie Re e tn e b lo gu t a er Dia
5. Set comments into Dialogue (checkbox as ready?)
&
1. Enter comments about an article or section of an article re; focus question 2. Respond to dialogue from other reviewers / Reviews of other articles 3. Summarize all dialogue across review 4. Post Review comments 5. Set comments into Dialogue (checkbox as ready?)
6. Score articles for relevancy based on rel to Trigger Question
5. Generates Publication process
Figure 2. ICR process flow. Figure 2 illustrates four processes, each showing a clear listing of steps or tasks engaged in the process. The specification and framing of the trigger question (1) is one of the most important decisions of the editor, as the question sets the scope for relevancy of materials. The trigger question must invite and allow Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Interpretive Collaborative Review
multiple perspectives to converge on the problem itself, and should not introduce jargon or coding that disenfranchises perspectives. The “problem” as perceived by participants will invariably be judged by the needs of their personal disciplines and experiences, and the ICR is designed to specify a problem scope in the form of a question (or set of nested questions) that inspires multilateral inquiry. Given the formulation and review (and voting) on the trigger question(s) as the scope, the reviewing editors (2) each contribute sets of artefacts (papers, materials, data) they consider initially relevant to the problem and question. The editors conduct a scoping review of sorts to narrow the set of artefacts to be reviewed (by the full panel of review participants) to a manageable number of initiating items. This set of materials becomes the initial core collection to be reviewed, scored, and deliberated by participant reviewers. Participants are invited to join the ICR for a problem domain question for which they are known or are expected to share interest. They are invited to register and to declare a perspective relating to the question in their background statement. Reviewers select articles (3) from the core collection that they choose to read, identify key ideas that may be linked as information objects in the review, and they write a brief review of their contribution. Reviewers then score the article for its relevancy to the problem, based on their review and consideration of its validity, maturity, and precedence as a contribution of published knowledge to a problem. In the next iteration participants can suggest new material to be reviewed, commented upon and scored. This process proceeds until a shared understanding is deemed to be sufficiently evident. Reviewed articles are disclosed to participants as available for dialogue, which differs from other types of scholarly commentary. In an ICR, deliberation (4) questions and develops ideas drawn forth in the reviews themselves, not the articles per se. The ICR process is being designed to elicit significant passages from articles in the collection, amplified by review commentary, and proffered to the community of reviewers for deliberation. The intention of this method is that of drawing out salient and even citable concepts that may have been overlooked in prior reading of the articles or of similar literature or perspectives. When editors declare the review completed, the full ICR corpus may be baselined and published online as a collection. Very soon, most of the desired primary publications that might be incorporated in a collection (as defined in the Creative Commons licensing language) will be available from freely accessible preprints of published papers [15]. This collection would include the artefact reference lists, the artefacts selected for review (and those dismissed with respect to the question), the reviews and deliberative comments on the core collection, and the relevancy scores. This rich set of contextual materials will be published as an ongoing inquiry, with the invitation for other interested readers, scholars and practitioners, to further engage with the online review publication. This report can also be mined and integrated into an even more interpretive review where knowledge is integrated and specific conclusions related to the problem area are advanced. 3.3
The ICR as a Novel Scholarly Publication
We conceive the outcome of an ICR process as a completed review publication, similar to an online review journal in genre. We conceive of this publication as a complete set of reviews and dialogues associated directly with articles (artifacts) within an interactive website indexed by these texts, and findable by association to the trigger question. By searching text associated with the issue, and by browsing subject taxonomies that link a problem to multiple subject classifications and keyword tags, we believe the ICR will serve as a timely response to scholars interested in similar questions to those explored in the ICR review. Figure 3 illustrates the format of the publication and its post-publication indexes.The published ICR report and 3DR disrupts current publication categories. Figure 3 shows the “masthead” declaring the publication as similar to a review journal or special issue, with an issue editor, and perhaps a series or imprint editor if a “sponsored” ICR. Being a novel published format, the preferred bibliographic information and citation are identified, and the permanent and alternate web locations noted as well. The online format Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
307
308
Peter Pennefather; Peter Jones
is conceived as an expandable collection of the ICR components, with contextual statements added by the editors to produce an interactive, readable, responsive online publication. Depending on the intended audience the reports and reviews can be associated with a digital library of artefact’s or simply bibliographical pointers to the accessible locations of these artefacts. ICR Publication
Review Format: Publisher Imprint– Online CR “Special Issue” - Editor / Editorial Board - Reviewers / Contributors - Referenced (citable) as: Jones, et al. (2007) Current Collaborative Reviews in Informatics, - Online destination at : 1. Focus Question: Collaborative Review for: What informatics practices from research support the needs of primary health providers in global healthcare practice ? 2. Context: Initial Editorial Overview + Final Summary Goal of Review : Consensus statement reflecting purpose, outcome, multiple perspectives 3. Discussion of Issues: Dialogues + Links
Post-Pub Indexing Primary: ICR Title, Date, Subject Keywords, Full Text Editor & Board Reviewers / Contributors Citations & Article links Reviews & Scores
Citations to ICR New articles find ICR in web searches, with reviews of primary literature
Dialogue Topic Full text of Discussions Links to external sources Additional references
4. Links to Reviews 5. Articles and References 6. Review “Sign-off” statements
Figure 3. The ICR publication and post-publication indexing. Post-publication indexing shows that certain document descriptors will be established as metadata and indexing terms for pervasive web indexing and findability in search engines and scholarly index services. A novel function of the ICR publication will be the text indexing of newly contributed materials in reviews and dialogue. All of these contributions could, in theory, be citable and linked to and from other publications. We anticipate using an open text indexing system that leverages search engine optimization techniques rather than traditional metadata indexing. The ICR generates a substantial body of original text statements, associated directly with known published articles and their links (DOIs), which magnifies context and improves the findability of the ICR publication. For example, when scholars interested in an author’s work performs an open web search on author name, and that author’s work is referenced and discussed in an ICR, the high number of links and text references will be expected to optimize its ranking in current search algorithms. Unlike traditional or even online journals, the full text of an ICR publication will be completely issue-focused, leading researchers investigating that issue online to easily produce the ICR publication, and increasing its value as a timely and perhaps newsworthy review publication. 4.
Conclusion
Despite continual advances in technology, the cognitive task of identifying and locating candidate artefacts appropriate for consideration in problem-centered reviews remains daunting. We propose that by building a network of many-to-many human relevancy candidate artefacts and facilitating their evaluations we can surpass the limitations of computer-based algorithms to create a meaningfulrelevancy maps of information artefacts that are germane and useful with respect to important interdisciplinary [1,3] problems. People faced with a need to understand a particular issue or academic question often use various search techniques to identify electronically accessible information sources that may be relevant for constructing a mental model of an issue. Search engines use various retrieval algorithms to automatically rank the likely relevance of the information artefact to the matter at hand. The user than scans through retrieved lists and makes their own assessment. This works well for simple matters and can also be used to systematically identify Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Interpretive Collaborative Review
all papers on a new or proscribed topic [11]. But for more complex information problems where multiple perspectives of relevancy are apparent, it is becomes more difficult to find problem-relevant materials or to develop a guide to an emerging or dynamic literature. We propose that a hybrid system of informatics algorithms and human assessment of relevancy linked to the specific contexts and multiple perspectives may increase the societal utility of navigating electronically accessible collections of information sources. Gonzalez et al [37] have attempted to evaluate the quality of digital libraries and identifies the pertinence and relevance of elements in these collections as key quality dimensions. They distinguish between pertinence and relevance with respect to where the relation between a document (text) and information need (query) sits. For pertinence or more specifically cognitive relevance, the relation is perceived in the mind of the user of the digital library. General relevance on the other hand, or more specifically systemic relevance, describe that relation as being an objective, publicly, and social notion that can be established by general consensus. It is this public assessment of relevance that we are interested in. However we recognize the need to engage a wide spectrum of perspectives and worldviews in order to establish a recognizable and widely acceptable consensus. We believe that attempts to map out relevance relationships and attempts to engage in ¨genuine dialogue¨ have many elements in common. We want to learn from what has been proposed in the past about engaging in genuine dialogue to develop a means by which a community can work together to create public knowledge about “a matter at hand” where the purpose of the dialogue is to create a shareable cognitive map of the relationships that define that problem. We propose that the principles of dialogical inquiry can be applied to facilitate this public and socially situated process. The dialogical process can be used to discover as a group what the information sources relevant to a particular topic, how that relevancy is coloured by ways in which the information was produced and the situations in which it can be applied and finally how that assessment of relevance is motivated by a need or an intention to complete a particular task. The theory and conceptualizations described above are being used to constrain and guide design of online platforms that will enable participants to track and reveal the processes they use to evaluate the pooled information artefacts associated with the reports and reviews. This draws out tacit knowing, by requiring participants to recognize and state the meaning value of knowledge they derive from the accessible information artefacts considered in the review. Furthermore, the process enables the explicit assignment of humanly-identified relevancy of those sources to a given problem, a relationship that can be preserved electronically and indexed so that external non-participants might locate the reviews as they seek information on similar problems to those indexed by ICR publications. We believe that this feedback will generate new knowledge and represent a new form of scholarly activity which in turn will create a new media for reuse of published or otherwise accessible scholarly papers and findings. The syncopated nature of the information represented in the ICR products will resemble attempts by early 20th century authors like Joyce to deal with the ¨Gutenberg Galaxy¨ and produce “ syncopated manipulation to permit inclusive or simultaneous perception of a total and diversified field. Such, indeed, is symbolism by definition- a collocation, a parataxis of components representing insight by carefully established ratios, but without a point of view or lineal connection or sequential order” [38, p. 267]. Following McLuhan we call this process a syncopated manipulation of accessible information. The process of ICR, driven by collaborative reflection and dialogical response, will reveal and accentuate weak beats in the information flow that resonate with the participants and be amplified into new and more transmissible learning and teachings.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
309
310
Peter Pennefather; Peter Jones
5. References and Notes [1]
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]
A wicked problem is generally understood as an ill-defined, messy, circular problem that resists solution by logical analysis and planning. As originally defined by Rittel and Webber (3), wicked problems embody ten characteristics that distinguish them from “tame” problems. Many interdisciplinary research issues entail wicked problems, requiring the coordination of multiple disciplinary views. HOFMEYER A, NEWTON M, and SCOTT C. Valuing the scholarship of integration and the scholarship of application in the academy for health sciences scholars: recommended methods, 2007, Health Research Policy and Systems, vol, 5, p. 5 RITTEL, H and WEBBER, MM. Dilemmas in a general theory of planning. 1973, Policy Sciences, vol. 4 (2), p. 155-169. HEY, T, and TREFETHEN, A. “The Data Deluge: An e-Science Perspective,” in Grid Computing: Making the Global Infrastructure a Reality. (Chichester: Wiley, 2003), pp. 809-24. BROGMAN CL. Data disciplines and scholarly publishing, 2008, Learned Publishing, vol. 2, p. 29-38 DERVIN, B. On studying information seeking methodologically: the implications of connecting metatheory to method. 1999, Information Processing and Management, vol. 35, p. 727-750. DERVIN, B. Libraries reaching out with health information to vulnerable populations: guidance from research on information seeking and use. 2005, Journal of the Medical Library Association, vol. 93(4 Suppl): S74–S80. KESELMAN A, LOGAN R, SMITH CA, LEROY G, and ZENG-TREITLER Q. Developing informatics tools and strategies for consumer-centered health communication, 2008, J. Am. Med. Inform. Assoc. preprint published April 24, 2008; doi:10.1197/jamia.M2744 LYNCH, CA. The shape of the scientific article in the developing cyberinfrastructure, 2007 CT Watch Quarterly, vol. 3. Accessed 21 May 2008 DAVIS D, EVEANS M, JAHAD A, PERRIER L, RATHH D, SIBALD G, STRAUS S, RAPPOLT S, WOWK M ZWARENSTEIN M. The case for knowledge translation: Shortening the journey from evidence to effect. 2005 British Medical Journal, vol., 32, p. 33-35. Editorial, Many reviews are systematic but some are more transparent and completely reported than others., 2007, PLoS Medicine editorial, vol. 4, p., e147. TRICCO AC., TETZLAFF J, SAMPSON M, FERGUSSON D, COGO E, HORSLEY T, and MOHER D. Few systematic reviews exist documenting the extent of bias: a systematic review, 2008 Journal of Clinical Epidemiology, vol. p. 422-434. MILES A, LOUGHLIN M, and POLYCHRONIS, A. Medicine and evidence: Knowledge and action in clinical practice. 2007, Journal of Evaluation in Clinical Practice, vol. 13, p. 481-50. THOMAS, P. General medical practitioners need to be aware of the theories on which our work depends. 2006, Annals of Internal Medicine, vol. 4, p., 450-454 MAYOR S. Opening the lid on open access, 2008, British Medical Journal, vol. 336, p. 688-689. LU Y. The human in human information acquisition: Understanding gatekeeping and proposing new directions in scholarship. 2007, Library & Information Science Research, vol. 29(1), p. 103-123. National Library of Medicine, 2007, HYPERLINK “http://www.nlm.nih.gov/pubs/factsheets/ nlm.html” HYPERLINK “http://www.nlm.nih.gov/pubs/factsheets/nlm.html” http:// www.nlm.nih.gov/pubs/factsheets/nlm.html AMIN A, and ROBERTS J, Knowing in Action: Beyond Communities of Practice. 2008, Research Policy, vol. 37, p. 353-369. THIEDE M. Information and access to health care: is there a role for trust. 2005, Social Science & Medicine vol. 6, p. 1452-62. Relevance, Oxford English Dictionary, online. 2008, Oxford University Press Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Interpretive Collaborative Review
[21] SARACEVIC T. Relevance: a review of and a framework for the thinking on the notion in information science. 1997, Readings in information retrieval. San Francisco: Morgan Kaufmann, p. 143 – 165. [22] SCHUTZ A. 1977, Reflections on the problem of relevance. New Haven, CT: Yale University Press. [23] WENGER E. 1998 . Communities of practice: Learning, meaning, and identity. Cambridge, UK: Cambridge University Press. [24] SPENDER JC. Method, philosophy, and imperics in KM and IC. 2006, Journal of Intellectual Capital, vol. 7 (1), 12-28. [25] MIZZARO S. Relevance: The whole history. 1997, Journal of the American Society for Information Science, 48: 810-832. [26] MARON ME. On indexing, retrieval, and the meaning of about. 1977, Journal of the American Society for Information Science vol. 28, p. 38-43. [27] SHORTLIFE EH and BLOIS MS THE COMPUTER MEETS MEDICINE AND BIOLOGY:EMERGENCE OF A DISCIPLINE, 2006, in Biomedical Informatics Computer Applications in Health Care and Biomedicine EH Shortliffe, editor, 3rd ed. New York: SpringerVerlag, (p. 24) http://www.dbmi.columbia.edu/shortliffe/Ch01.pdf [28] Collaborative, Oxford English Dictionary, online. 2008, Oxford University Press [29] PENNEFATHER P. and SUHANIC W. Diagnostics. Laboratory for Collaborative Diagnostics, www.lcd.utoronto.ca [30] BLOOM BS. 1956, Taxonomy of Educational Objectives. Handbook 1: Cognitive Domain. New York: David McKay [31] SCHREIBMAN V. and CHRISTAKIS AN. New Agoras: New geometry of languaging and new technology of democracy: The structured design dialogue process. 2007, International Journal of Applied Systemic Studies, vol. 1, p. 15-31 [32] Knowledge Translation, 2007, CIHR, http://www.cihr-irsc.gc.ca/e/29418.html [33] UNESCO. (2008). Retrieved from AIMS and Literacy Assessment, UNESCO Bangkok website http://www.unescobkk.org/index.php?id=3749” on 22 May, 2008. [34] NONAKA I. A dynamic theory of organizational knowledge creation. 1994, Organization Science, vol. 5, p. 14–37. [35] POLANYI M. (1966). The Tacit Dimension. New York: Anchor Day. [36] SUMNER T. and BUCKINGHAM-SHUM S. From documents to discourse: Shifting conceptions of scholarly publishing. 1998, Proc. CHI 98: Human Factors in Computing Systems, 18-23 April, 1998, Los Angeles, CA. ACM Press: New York. [37] GONZALES MA, MORIERA BL, FOX EA, and WATSON LT. What is a good digital library: A quality model for digital libraries. 2007, Information processing and Management vol. 43, p. 1416-1437. [38] MCLUHAN M. 1962, The Gutenberg galaxy: The making of typographic man. Toronto: University of Toronto Press. (p. 267)
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
311
312
Joining Up ‘Discovery to Delivery’ Services Ann Apps; Ross MacIntyre Mimas, The University of Manchester M13 9PL, UK e-mail: ann.apps@manchester.ac.uk; ross.macintyre@manchester.ac.uk
Abstract Zetoc is a bibliographic current awareness service that provides discovery of relevant literature within the British Library’s Electronic Table of Contents of journal articles and conference papers. A researcher having discovered an article of interest will wish to read it, preferring to locate an electronic copy of an article to be delivered directly to their desktop. However, until now, Zetoc was essentially the British Library’s document delivery catalogue, containing details of journals that are published traditionally. The lack of open access articles in Zetoc, because there would be no reason to order and pay for copies of these articles, implied a deficiency in Zetoc as a current awareness and general article discovery service. This paper describes the introduction of open access article records into Zetoc by OAI-PMH harvesting from UK PubMed Central. The prototype concentrates on biomedicine and initially BioMed Central journals. But the paper discusses future extension to other disciplines, as well as general requirements for sharing bibliographic article records. Keywords: open access; bibliographic article record; OAI-PMH; harvest; biomedical research services. 1.
Introduction
Zetoc [1] is a bibliographic current awareness service that provides discovery of relevant literature within the British Library’s Electronic Table of Contents of journal articles and conference papers. Hosted at Mimas, at The University of Manchester, Zetoc is available to researchers, learners and teachers in UK Higher and Further Education, as well as to members of several other organisations including Irish colleges and health care trusts. The Zetoc database holds details of around 20,000 current journal articles and 16,000 conference proceedings per year, covering all disciplines, data being available from 1993 and updated daily. Searches using the Zetoc Web or Z39.50 interfaces yield bibliographic citation details of the discovered articles. Zetoc Alert is an alerting service, via either email or RSS feeds, which provides tables of contents of new issues of journals. Each article in an alert is accompanied by a persistent URL enabling direct access to its full record within the Zetoc Web interface, and hence location of the article. Once a researcher has discovered an article of interest, they will wish to read it, and therefore to locate, and request delivery of, its full text. Zetoc provides access to the British Library’s document delivery service, which requires payment, and also assistance with requesting articles from an institution’s library via traditional Inter-Library routes. But researchers prefer to locate an electronic copy of an article to be delivered directly to their desktop [2]. Thus the full record of an article in Zetoc also includes a link to the reader’s institution’s OpenURL resolver [3] if this service is available, or alternatively to free article location services including Google Scholar [4], Sirus [5] and Copac [6]. However, because Zetoc is essentially the British Library’s document delivery catalogue, it contains details of journals that are published traditionally, the majority still published in print as well as electronically. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Joining Up ‘Discovery to Delivery’ Services
Zetoc does not include open access electronic journals because there would be no reason to order and pay for copies of these articles. Thus a natural extension to Zetoc, as a current awareness and general article discovery service, is to include details of open access articles where there is no overlap with existing content. Mimas also hosts UK PubMed Central (UKPMC) [7], based on PubMed Central, the US National Institutes of Health (NIH) free digital archive of biomedical and life sciences journal literature; providing this service jointly with the British Library. UKPMC provides an Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) interface [8], which includes timely details of new articles. This raised the possibility of incorporating details of open access journal articles in UKPMC into Zetoc, both into the searchable database and into Zetoc Alert. Because these articles are open access, it is possible to provide a direct link from the Zetoc full record to the full text of the article in UKPMC. 2.
Methodology
2.1. Domain and Publisher for Prototype It was decided to concentrate initially on importing details of biomedical open access journals into Zetoc, in particular those published by BioMed Central (BMC) [9]. BMC states the ‘all research articles published by BioMed Central are archived without delay in PubMed Central’. 2.2. Data Mapping The first task was to map the UKPMC data within the OAI-PMH feed to the properties within the Zetoc namespace, to investigate any significant omissions and ascertain any requirements for new fields. The UKPMC OAI-PMH metadata format chosen is the article header (or front matter) details (‘pmc_fm’), the simple Dublin Core (‘oai_dc’) metadata format lacking precision. The disseminated UKPMC article header details are in XML according to the NLM DTD [10]. UKPMC also makes the full text articles available via a further OAI-PMH metadata format (‘pmc’), but that is beyond requirements because Zetoc contains only bibliographic citation details of articles. The data mapping to existing data fields in the Zetoc namespace is shown in Table 1. Zetoc Article Title Author Journal Title ISSN Publication Year Date of Publication Volume / Issue Pagination Publisher Abstract Keywords
UKPMC <article-title> <contrib contrib-type=“author”> <journal-title> <issn pub-type=“ppub”> <pub-date pub-type=“ppub”><year> <pub-date pub-type=“epub”><day>,<month>,<year> <volume> / <issue> <fpage>,<lpage> <publisher><publisher-name> <abstract> <kwd-group><kwd>
Table 1: Data Mapping from UKPMC to Existing Zetoc Data Fields It was necessary to add several new fields to the Zetoc namespace and database to accommodate significant details in the UKPMC data records. Articles are identified in UKPMC by their PubMed identifier (PMID), so its capture was essential for providing a persistent URL for direct access to the full article. The Digital Object Identifier (DOI) is also a persistent URI, enabling a link to the publisher’s copy of the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
313
314
Ann Apps; Ross MacIntyre
article. The PMID is also used to construct a Zetoc identifier for the imported articles. The new fields are shown in Table 2. Zetoc PubMed Identifier Digital Object Identifier (DOI) eISSN Copyright
UKPMC <article-id pub-id-type=“pmid”> <article-id pub-id-type=“doi”> <issn pub-type=“epub”> Constructed as agreed with BioMed Central
Table 2: New Zetoc Data Fields The Zetoc Alert application requires a unique identification (within the Zetoc namespace) for a journal; for this it uses the British Library’s ‘shelfmark’. A similar identification for the new BMC journals was needed that does not conflict with the existing shelfmarks. This pseudo-shelfmark is constructed from the letters ‘PM’, the ISSN, and matching punctuation. BMC open access articles are available according to a Creative Commons [11] ‘attribution required’ licence, with no restrictions on sharing or remixing or commercial use. Following discussion with BioMed Central, a copyright statement was agreed, for instance, where the first author’s family name is ‘Smith’: “© 2008 Smith et al; This article is distributed under the terms of the Creative Commons Attribution Licence (http://creativecommons.org/licenses/by/2.0)”. There are a few data fields in the UKPMC record that are ignored on import to Zetoc. This is because they are irrelevant to Zetoc, for example publication history details. Or it is because they do not have a matching field in Zetoc, for instance author affiliations, which would need not only a new Zetoc field, but also cross-referencing to author names. However in the future author affiliations may be included to support improved author identification, which is becoming an increasing requirement, including: authenticating a researcher’s work for assessment reporting, for which Zetoc has been used recently; or finding an author’s papers in open access repositories. This is an area under investigation by various projects, including the Names Project [12], which has been informed by Zetoc as one of its sources of author data. 2.3. Data Enhancement The noticeable omission in UKPMC article data is a subject classification, other than the author’s keywords. Journals in Zetoc have Dewey Decimal Classification (DDC) [13] and Library of Congress Classification [14] terms, reproduced in each article record. The British Library provided DDC subject classifications for the BMC journals, which are added to the records during processing of the UKPMC import, via a look-up table using ISSN as its key. A process for the inclusion of DDC terms for new journals will be developed; Zetoc support staff will request these from the British Library and include them into the application via an administration interface. This introduces the possibility of a future extension to include improved subject classifications, which would enable possible future enhancements to Zetoc with more accurate subject-based end-user searching or alert selection, especially if coupled with a DDC-based terminology service such as the High-Level Thesaurus project (HILT) [15]. Another look-up table is needed for the publisher’s country; UKPMC records the city. Currently this table contains only one entry for BioMed Central in Great Britain. Again Zetoc support staff will be able to update this table using an administration interface. 2.4. Data Load Implementation Zetoc harvests open access OAI-PMH data (set = ‘pmc-open’) from UKPMC and selects from it articles published by BioMed Central. This harvest occurs nightly immediately following the upload of the British Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Joining Up ‘Discovery to Delivery’ Services
Library’s Electronic Table of Contents data. Both sets of article records are aggregated to inform the Zetoc Alert service. The implementation sends a RESTful URL over HTTP Get [16] according to the OAI-PMH protocol, which returns XML according to the article header part of the NLM DTD. The transformation of the UKPMC data is driven by a mapping template, defined in XML. This generic model should assist the future introduction of data harvesting and transformation from different suppliers and formats. The implementation is based on a similar application for the JISC Information Environment Service Regsistry (IESR) [17], which harvests details of ESDS International [18] macro and micro economic datasets from the UK Data Archive [19], transforming them from DDI metadata [20] into IESR format. For the initial prototype only current data is imported into Zetoc. But it is planned to import all the BioMed Central back data from UKPMC, whose records date from 2001. This will enhance the corpus of biomedical articles in Zetoc for discovery. 3.
Results
3.1. Data Harvest Zetoc is able to find details of BMC open access journal articles in a timely fashion by selecting them from the daily UKPMC OAI-PMH feed, and thus to provide opportune alerts to researchers. Since its introduction, the harvest of BMC article details from UKPMC has been trouble free. The quantity of new articles introduced into Zetoc is relatively small: 1325 BMC articles in a month, as opposed to 162300 journal articles from the British Library. It became apparent that some BMC articles do not have PubMed identifiers in the UKPMC data feed, and these are ignored in the Zetoc data load. On investigation it appears that the articles without PMIDs are supplementary or review articles. But PubMed Central have agreed to include PMIDs in the OAIPMH data in the future wherever possible. There was some concern during the initial design phase, that not all BMC content is open access, thus implying inaccessible links to the full article. However, on investigation, only open access BMC content is added to UKPMC; all research papers being open access. 3.2. Article Location Figure 1 shows the full record for a BMC article in Zetoc; Figure 2 shows the document delivery links below the record. The result of following the UKPMC link is shown in Figure 3. A direct link is provided to the full text of the article in UKPMC, which, being open access, will be universally available. A second link is provided to the, again open access, article at BioMed Central, by their request. Because these links are based on global identifiers they are persistent, following good practice to minimise future loss of information [21]. The third link is to a user’s institution’s OpenURL resolver, in this example John Rylands University Library at The University of Manchester. This is provided because it is assumed to be some institutions’ preference and will potentially lead the reader to additional services about the article. The ‘hidden’ COinS machine-parsable bibliographic citation [22] is also provided alongside the OpenURL Resolver link. The PubMed identifier and the Digital Object Identifier (DOI) are included in the OpenURL and COinS, information that is not available for other Zetoc articles. This more precise article identification will assist downstream consumers of the OpenURL or COinS. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
315
316
Ann Apps; Ross MacIntyre
Figure 1: An Example Full Record for a BioMed Central Article in Zetoc
Figure 2: Links to the Full Article from Zetoc Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Joining Up â&#x20AC;&#x2DC;Discovery to Deliveryâ&#x20AC;&#x2122; Services
Figure 3: Full Article in UKPMC These links differ from the document delivery options shown for other Zetoc articles, which are not open access as far as Zetoc is aware. For articles that are not open access, in addition to the OpenURL resolver link, there are the options of acquiring the article through Inter-Library routes or purchasing directly from the British Library. Obviously there is no point in providing links that require payment when a free version is available. 3.3. Zetoc Alert Inclusion of the new BMC journals into the Zetoc Alert system has been seamless because their article details are appended, in the Alert data feed, to those supplied by the British Library. Zetoc Alert maintains its journal list from this data feed, automatically adding new journals to the potential alert list. A user wising to locate an article from a link in a received alert is taken via its full record page in the main Zetoc application, again making the introduction of these new journals seamless. 4.
Discussion
4.1. Enhanced Functionality The import from UKPMC of BioMed Central journal article details is a significant introduction of high profile and open access research literature in a particular discipline. Because the literature is open access, Zetoc is able to provide direct links to the full text of an article, at UKPMC and at BioMed Central. Although, at the time of writing, this is a recent introduction and has not yet been formally, publicly announced, links to full text articles are already showing use, with 38 accesses during April 2008. Previous Zetoc evaluation studies have indicated some dissatisfaction about the availability of articles by users from institutions that do not have OpenURL resolvers, or that have a low level of journal subscriptions Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
317
318
Ann Apps; Ross MacIntyre
[23, 24]. Providing direct links to open access literature should address these concerns. An additional data field in the UKPMC data is the article abstract. The majority of Zetoc article records do not contain abstracts. This inclusion of abstracts in Zetoc for the BMC articles is a significant enhancement, providing assistance to researchers when choosing full articles to read. 4.2. Joining Up BioMedical Services The British Library, jointly with the Technical Computing Group at Microsoft, are developing the Research Information Centre (RIC) [25]. This Virtual Research Environment is a desktop application to support researchers through the complete research lifecycle. The primary focus of the prototype RIC is biomedicine, selected resources being those considered to be of importance to biomedical researchers. Zetoc, via its prototype Web Services SOAP interface [26], is one of the included resources. The introduction of BioMed Central journals into Zetoc will enhance the availability of biomedical literature within the RIC. 4.3. Extension to Other Open Access Journals Clearly it should be possible to extend this model in the future to import into Zetoc details of all pure open access articles in UKPMC. But, as indicated above, journals will need manually assigned subject classifications and publisher countries. Thus it is probable that journals would be added to the import from UKPMC either by publisher or individually rather than developing an automatic mass ingest of all UKPMC open access articles. It may be that some articles recorded in UKPMC are also published traditionally and so are already included in the Zetoc data feed from the British Library. Deduplication should be possible given the quality bibliographic citation data in both sources. However having two records for an article, would not really matter within a very large discovery service such as Zetoc, which already contains duplicate records for conference papers that are published in journals. It would rather give a reader the option of document delivery, by either open access or traditional routes. Further, this solution could be used to import details of articles into Zetoc from other disciplines if a suitable open access archive, with a harvest interface and an acceptable data mapping, were available. BMC also publish a couple of Physics journals, according to their open access model, in PhysMath Central [27] and have suggested their inclusion in Zetoc. These journals are available within arXiv, the Physics repository [28], from which article details could be harvested. Implementation would require a similar process to that described above for ingest from UKPMC; the first step being the definition of a data mapping from arXiv to Zetoc. If this were successful it would open up the possibility of introducing more open access physics literature into Zetoc. 4.4. Single Article Publishing The article publishing model exhibited by UKPMC differs from the traditional journal issue model. BioMed Central articles are received by UKPMC, and hence imported into Zetoc and users alerted, as soon as they are available, because they are not tied to the printed paper paradigm, in which a complete journal issue is assembled before publication. However they do include the traditional bibliographic citation details within a journal such as volume and pseudo-pagination, which is actually an article number within the volume. Although the Electronic Table of Contents data from the British Library consists of standalone records for articles, the complete table of contents for a journal issue is supplied consecutively in the data feed.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Joining Up ‘Discovery to Delivery’ Services
Thus users of Zetoc Alert who subscribe to a BMC journal will not receive a complete journal issue table of contents in a single email. Rather they will receive multiple emails, over a period of time, for a journal issue, each containing a subset of the eventual articles. However it was decided that the advantage to a researcher of receiving a very timely alert about the publication of an article of interest, which is also open access, should override any irritation at receiving multiple alerts rather than a single, complete table of contents. The introduction of the BMC journals into Zetoc is too recent to determine whether this situation is acceptable to researchers in practice. 4.5. Sharing Records of Open Access Literature The data mapping, to Zetoc from the UKPMC records available for harvesting via OAI-PMH, has indicated a more general data requirement for sharing, between registries and repositories, bibliographic records of open access scholarly literature published in journals. The apparently necessary data fields are shown in Table 3, and further ‘good to have’ items in Table 4. Data Field Article Title Author Names Journal Title ISSN Publication Year Volume and Issue Pagination Subject Classification of journal Persistent URI
Notes Family name and initials at a minimum
As per journal First page; last page optional According to a standard scheme
Table 3: Data Fields for Sharing Journal Article Bibliographic Records Data Field eISSN Date of Publication Publisher Country of Publication Abstract Keywords Global Identifiers
Notes
E.g author keywords E.g. DOI, PubMed Identifier
Table 4: Additional Useful Data Fields 5.
Conclusions
This development is ‘joining up’ two bibliographic services hosted by Mimas. It provides, through Zetoc, a single discovery and current awareness application, supplying bibliographic details of articles appropriate to a researcher’s request, with the addition of abstracts for BioMed Central articles. The full text of discovered open access biomedical articles is immediately available for reading from UKPMC. The objective is to enhance the usefulness of Zetoc to biomedical researchers by accelerating ‘discovery to delivery’, and in the future to researchers in other disciplines. 6.
Acknowledgements
The Zetoc service, including its development, is supported by the Joint Information Systems Committee (JISC) [29] of the UK Higher and Further Education Funding Councils and the British Library [30]. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
319
320
Ann Apps; Ross MacIntyre
7.
Notes and References
[1] [2]
Zetoc. Retrieved, April 25, 2008, from http://zetoc.mimas.ac.uk/ APPS, Ann; MACINTYRE, Ross. Customising Location of Knowledge. DC2006: Proceedings of the International Conference on Dublin Core and Metadata Applications, Manzanillo, Colima, Mexico, 3-6 October 2006. Universidad de Colima, 2006, p. 261-272. Retrieved, April 25, 2008, from http://epub.mimas.ac.uk/papers/dc2006/appsmac-dc2006.html APPS, Ann; MACINTYRE, Ross. Why OpenURL? D-Lib Magazine. Vol 12 No 5, 2006, doi:10.1045/may2006-apps. Google Scholar. Retrieved, April 25, 2008, from http://scholar.google.co.uk/ Sirus – for scientific information only. Retrieved, April 25, 2008, from http://www.scirus.com/ Copac Academic and National Library Catalogue. Retrieved, April 25, 2008, from http://copac.ac.uk/ UK PubMed Central Free Archive of Life Sciences Journals. Retrieved, April 25, 2008, from http:/ /ukpmc.ac.uk/ LAGOZE, C.; VAN de SOMPEL, H.; NELSON, M.; WARNER, S. The Open Archives Protocol for Metadata Harvesting. 2004. Retrieved, April 25, 2008, from http://www.openarchives.org/OAI/ openarchivesprotocol.html BioMed Central: The Open Access Publisher. Retrieved, April 25, 2008, from http:// www.biomedcentral.com/ NLM Journal Archiving and Interchange Tag Suite. Retrieved, April 25, 2008, from http:// dtd.nlm.nih.gov/ Creative Commons. Retrieved, April 25, 2008, from http://creativecommons.org/ HILL, Amanda. What’s in a Name? Prototyping a Name Authority Service for UK Repositories. ISKO2008 Conference, Montreal, Canada, August 2008. 2008. (Accepted for publication). Retrieved, May 1, 2008, from http://names.mimas.ac.uk/documents/Names_ISKO2008_paper.pdf OCLC Dewey Services: Dewey Decimal Classification. Retrieved, April 25, 2008, from http:// www.oclc.org/dewey/ The Library of Congress Classification. Retrieved, April 25, 2008, from http://www.loc.gov/aba/ cataloging/classification/ NICHOLSON, D; DAWSON, A; SHIRI, A. HILT: a Terminology Mapping Service with a DDC Spine. Classification Quarterly. Vol 42 No 3/4, 2006, p. 187-200. Retrieved, May 1, 2008, from http://eprints.rclis.org/archive/00008767/ WIKIPEDIA CONTRIBUTORS. Representational State Transfer. Wikipedia, The Free Encyclopedia. 2008. Retrieved, May 1, 2008, from http://en.wikipedia.org/wiki/ Representational_State_Transfer APPS, Ann. Using an Application Profile Based Service Registry. DC2007: Proceedings of the International Conference on Dublin Core and Metadata Applications, Singapore, 27-31 August 2007. Dublin Core Metadata Initiative and National Library Board Singapore, 2007, p. 63-73. Retrieved, May 1, 2008, from http://epub.mimas.ac.uk/papers/2007/dc2007/apps-dc2007.html ESDS International. Retrieved, May 1, 2008, from http://www.esds.ac.uk/international/ UK Data Archive. Retrieved, May 1, 2008, from http://www.data-archive.ac.uk/ Data Documentation Initiative (DDI). Retrieved, May 1, 2008, from http://www.icpsr.umich.edu/ DDI/ LAWRENCE, S; PENNOCK, DM; FLAKE, GW; KROVETZ, R; COETZEE, FM; GLOVER, E; NIELSEN, FA; KRUGER, A; GILES, CL. Persistence of Web References in Scientific Research. Computer. Vol 34 No 2, 2001, p. 26-31. HELLMAN, E. OpenURL COinS: A Convention to Embed Bibliographic Metadata in HTML. 2006. Retrieved, April 25, 2008, from http://ocoins.info/ EASON, Ken; HARKER, Susan; APPS, Ann; MACINTYRE, Ross. Towards an Integrated Digital
[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]
[18] [19] [20] [21] [22] [23]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Joining Up ‘Discovery to Delivery’ Services
[24]
[25]
[26]
[27] [28] [29] [30]
Library; Exploration of User Responses to a ‘Joined Up’ Service. Lecture Notes in Computer Science (ECDL2004: Eighth European Conference on Research and Advanced Technology for Digital Libraries, University of Bath, UK, 13-15 September 2004). Vol 3232, 2004, p. 452-463. Retrieved, April 25, 2008, from http://epub.mimas.ac.uk/papers/eham-ecdl2004.html EASON, Ken; MACINTYRE, Ross; APPS, Ann. A ‘Joined Up’ Electronic Journal Service: User Attitudes and Behaviour. Libraries Without Walls 6: Evaluating the Distributed Delivery of Library Services. London : Facet Publishing, 2005, p. 63-70. Retrieved, April 25, 2008, from http:// epub.mimas.ac.uk/papers/lww6/easonetal-lww6.html BARGA, RS; ANDREWS, S; PARASTATIDIS, S. The British Library Research Information Centre (RIC). UK e-Science ALL HANDS MEETING 2007, Nottingham, UK, 10-13 September 2007. 2007. Retrieved, April 25, 2008, from http://www.allhands.org.uk/2007/proceedings/papers/ 800.pdf APPS, Ann. Zetoc SOAP: a Web Services Interface for a Digital Library Resource. Lecture Notes in Computer Science (ECDL2004: Eighth European Conference on Research and Advanced Technology for Digital Libraries, University of Bath, UK, 13-15 September 2004). Vol 3232, 2004, p. 198-208. Retrieved, April 25, 2008, from http://epub.mimas.ac.uk/papers/appsecdl2004.html PhysMath Central. Retrieved, April 25, 2008, from http://www.physmathcentral.com/ arXiv.org. Retrieved, April 25, 2008, from http://arxiv.org/ Joint Information Systems Committee (JISC). Retrieved, April 25, 2008, from http://www.jisc.ac.uk/ The British Library. Retrieved, April 25, 2008, from http://www.bl.uk/
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
321
322
Web Topic Summarization Josef Steinberger1; Karel Jezek1; Martin Sloup1 1
Department of Computer Science and Engineering, University of West Bohemia in Pilsen Univerzitní 8, Pilsen 306 14, Czech Republic e-mail: jstein@kiv.zcu.cz; jezek_ka@kiv.zcu.cz; msloup@student.zcu.cz
Abstract In this paper, we present our online summarization system of web topics. The user defines the topic by a set of keywords. Then the system searches the Web for the relevant documents. The top ranked documents are returned and passed on to the summarization component. The summarizer produces a summary which is finally shown to the user. The proposed architecture is fully modular. This enables us to quickly substitute a new version of any module and thus the quality of the system’s output will get better with module improvements. The crucial module which extracts the most important sentences from the documents is based on the latent semantic analysis. Its main property is independency of the language of the source documents. In the system interface, one can choose to search a news site in English or Czech. The results show a very good search quality. Most of the retrieved documents are fully relevant, only a few being marginally relevant. The summarizer is comparable to state-of-the-art systems. Keywords: Information retrieval; searching; summarization; latent semantic analysis 1.
Introduction
Searching the web has played an important role in human life in the past couple of years. A user either searches for specific information or just browses topics which interest him/her. Typically, a user enters a query in natural language, or as a set of keywords, and a search engine answers with a set of documents which are relevant to the query. Then, the user needs to go through the documents to find the information that interests him. However, usually just some parts of the documents contain query-relevant information. A benefit to the user would be if the system selected the relevant passages, put them together, made it concise and fluent, and returned the resulting text. Moreover, if the resulting summary is not relevant enough, the user can refine the query. Thus, as a side effect, summarization can be viewed as a technique for improving querying. Our aim is to apply the following step after retrieval of the relevant documents. The set of documents is summarized and the resulting text is returned to the user. So, basically, the key work is done by the summarizer. In the past we created a single-document summarizer which extracted the most important sentences from a single source document [1]. The core of the summarizer was covered by latent semantic analysis (LSA – [2]). Now, we are experimenting with its extension to process multiple documents – a cluster of documents concerning the same topic. Several new problems arise here. For example, because the documents are about the same topic, they can contain similar sentences. We have to ensure that the summary does not contain this type of redundancy. In this paper, we present the SWEeT system (Summarizer of WEb Topics). A user enters a query in the system. That query should describe the topic he would like to read about (e.g. “George Bush Iraq War”). The system passes the query to a search engine. It answers with a set of relevant documents sorted by relevance to the query. Top n documents, where n is a parameter of the system, are then passed to our summarizer, the core of the system. The created summary is returned to the user, together with references Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Web Topic Summarization
to the searched documents that can help him to get more details about the topic. The structure of the paper is as follows. In Section 2, a quick overview of our SWEeT approach is presented. We then go deeper into the technical details (Section 3). We describe the architecture of the system and then we briefly mention the function and approach of each module. Then, in Section 4, we discuss the evaluation results, which can give an idea of the searching and summarizing quality. Moreover, we show a couple of resulting summaries and system screenshots. In the end, we discuss our vision of the systemâ&#x20AC;&#x2122;s further extensions and improvements. 2.
Approach Overview
Until we go into more technical details, we will explain the approach firstly in a simple way. After the user submits a query it is passed to a search engine. It answers with a set of relevant documents. Their contents, together with some additional information, e.g. date of publication, are extracted and passed on to the summarizer. The first task for the summarizer is to extract the most important sentences from the set of documents. Our approach follows what has been called a term-based strategy: find the most important information in the document(s) by identifying its main terms, and then extract from the document(s) the most important information (i.e., sentences) about these terms [3]. Moreover, to reduce the dimensionality of the term space, we use the latent semantic analysis [2], which can cluster similar terms and sentences into â&#x20AC;&#x2DC;topicsâ&#x20AC;&#x2122; on the basis of their use in context. The sentences that contain the most important topics are then selected for the summary. However, in this step, we have to be sure that the summary does not already contain a similar sentence to prevent redundancy. The vector of the sentence that is trying to be included in the summary is compared with those of the sentences already included in the summary by cosine similarity. After obtaining the summary sentences, we try to remove unimportant clauses from them. (In other words, we perform a second-level summarization.) We designed a set of knowledge-poor features that help in deciding if the most important information contained in a sentence is still present in its compressed version (see Section 3.7, [4]). These features are used by the classifier, which makes a decision on whether the particular clause is/is not important. The shortest of the compressed versions that still contain the main sentence information is selected to substitute the full sentence in the summary. Further, the summary sentences have to be ordered. Our method uses the fact that two sentences that are to appear next to each other in the final summary should be connected by occurrences of the same entities. The last step of our approach is to correct the problematic occurrences of entities brought by extracting sentences without their context. (E.g. there can be a pronoun which the reader could not interpret.) Our approach is to substitute each of these problematic expressions (e.g. he) with the full noun phrases (e.g. president George Bush) [5]. 3.
System Architecture
The crucial part of the system is the summarizer. However, state-of-the-art summarization is still far behind human-written summaries. So we designed a modular system to quickly enable us to improve the summarization process (see Figure 1). The first stage of the process is to pass the query to a search engine. We use the widely used Google search engine. Moreover, the search engine can be easily instructed to search a single domain or a couple of domains. Thus, e.g., we can search just certain news domains to get a summary of a news topic. After getting the cluster of relevant documents, their source URLs, titles, dates of publication (if available) and the own texts are extracted. The cluster is saved in our designed XML format. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
323
324
Josef Steinberger; Karel Jezek; Martin Sloup
Figure 1: System architecture After getting the XML with the searching details the summarization pipeline starts. This pipeline consists of several modules. Their aim is to create the final summary XML node whose content is finally returned as the system answer. The first module annotates entities (e.g., persons, organizations, places) which appear in the text. This would be needed for sentence ordering and entity occurrence correction. Later, we plan to use a complex co-reference resolution system for this task (e.g. Bart [6]). The next module tries to automatically annotate the sentence clause structure which is needed for sentence compression. The sentence extraction module is the main one. Its goal is to select the summary sentences. Our LSAbased method is used here. After this step, the XML file contains the summary node with selected sentences. Then, it is the turn of the sentence compression module, which removes unimportant clauses from the summary sentences. The next module orders the sentences in the summary and the last module corrects the entity occurrences. The last stage takes the content of the XML summary node and presents it to the user. The modules are discussed in the following subsections. 3.1
User Query Processing and Keyword Extraction
The first stage after submitting the query is to extract significant terms from it. The resulting set of keywords is then used in the searching module. For this task we need a list of “stop words”, i.e. words that do not carry any information – prepositions, conjunctions, etc. If the module finds a stop word among the query terms, it ignores it. Further, we need to convert the terms into their basic forms (lemmatization). We use a dictionary where for each term we can get a lemma. Thus, we get a set of lemmas that hold the query information that is passed on to the searching module. 3.2
Searching by External Search Engine
The aim of the searching module is to find documents relevant to the query. So far, the system has searched just a single pre-defined domain (for English it is nytimes.com and for Czech it is novinky.cz). We use well-known external search engines to guarantee the highest searching quality. The first one is Google whose performance cannot be doubted. However, we need to search just a single domain and thus we use the modifier “site:domain”. For searching in the Czech news site novinky.cz, we directly use their search engine. It is based on the Seznam engine, one of the most widely used engines on the Czech Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Web Topic Summarization
Web. Thus, good searching quality is also guaranteed in the case of searching the Czech news domain. Nevertheless, the modular architecture enables us to use other search engines as well. 3.3
Content Extraction and Parsing
References to top n retrieved documents1 are passed from the searching module to content extraction and parsing. The documents pointed to by the references are then downloaded and parsed. The parser needs to know what parts of the HTML structure have to be extracted. This cannot be done automatically for any HTML structure. Fortunately, each portal has its own uniform format. We created a simple configuration for each domain in which we run searching. This configuration tells the parser where it should find the title, the date of publication and the own text in the HTML structure. The resulting texts, together with titles and other meta-information, are converted into our own XML format, which is passed, and updated, through the summarization pipeline. 3.4
Entity Markup
Entity markup starts the summarization pipeline. Each module of the pipeline adds some information to the XML data. The first two modules add a marking that is utilized by other modules further down the pipeline. The entity markup module tries to mark all entities that occur in the text (persons, institutions, geographic names, etc.). Here we have to use a natural language parser. This component cannot be language independent. For English we use the Charniak parser [7] and for Czech we use a parser from PDT 2.0 (Prague Dependency Treebank – [8]) which is based on the Collins parser [9]. Both these tools can mark noun phrases (NP) and with a little effort we can get heads of the NPs2. From these noun phrases we create co-reference chains. Two NPs are added to the co-reference chain if they contain the same noun. With this approach we can put together phrases like “president George Bush”, “Bush”, “the president” or “George”. On the other hand, “the Czech president” and “the U.S. president” will be bound by mistake. In future we plan to use a complex co-reference resolution system [6] that would resolve other anaphoric expressions like pronouns. In the XML data file the entity occurrences are wrapped in tags and the identifier of the entity chain is contained in its attribute. The information about entities is later used in the modules for sentence ordering, reference correction and sentence compression. 3.5
Sentence Structure Markup
After finishing the entity markup, the sentence structure markup follows. Its aim is to identify sentence parts (clauses). For this task we again use the natural language parser’s output. It can derive a sentence tree structure like the one in Figure 2.
Figure 2: Tree structure of an example sentence. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
325
326
Josef Steinberger; Karel Jezek; Martin Sloup
The knowledge of sentence structure is later used by the sentence compression module. In the XML file, the clauses are wrapped in tags as in the case of entity marking. 3.6
LSA-based Sentence Extraction
This module is the core of the pipeline. It identifies and then extracts the most important sentences from the retrieved documents. The algorithm is based on our LSA-based single-document summarization method [1]. It was extended to work with a set of documents [10]. LSA is a fully automatic mathematical/statistical technique for extracting and representing the contextual usage of words’ meanings in passages of discourse. The basic idea is that the aggregate of all the word contexts in which a given word does and does not appear provides mutual constraints that determine the similarity of meanings of words and sets of words to each other. LSA has been used in a variety of applications (e.g., information retrieval, document categorization, information filtering, and text summarization). The heart of the analysis in the summarization background is a document representation developed in two steps. The first step is the creation of a term-by-sentence matrix, where each column represents the weighted term-frequency vector of a sentence in the set of documents under consideration. The terms from a user query get higher weight. The next step is to apply Singular Value Decomposition (SVD) to matrix A: A = U Ó VT,
(1)
where U = [uij] is an m×n column-orthonormal matrix whose columns are called left singular vectors. Ó =diag(ó1, ó2, . . . , ón) is an n×n diagonal matrix, whose diagonal elements are non-negative singular values sorted in descending order. V = [vij] is an n × n orthonormal matrix whose columns are called right singular vectors. The dimensionality of the matrices is reduced to r most important dimensions and thus, U is m × r, Ó is r × r and VT is r × n matrix. From an NLP perspective, what SVD does is to derive the latent semantic structure of the document represented by matrix A: i.e. a breakdown of the original document into r linearly-independent base vectors which express the main ‘topics’ of the document. SVD can capture interrelationships among terms, so that terms and sentences can be clustered on a ‘semantic’ basis rather than on the basis of words only. Furthermore, as demonstrated in [11], if a word combination pattern is salient and recurring in the document, this pattern will be captured and represented by one of the singular vectors. The magnitude of the corresponding singular value indicates the importance degree of this pattern within the document. Any sentences containing this word combination pattern will be projected along this singular vector, and the sentence that best represents this pattern will have the largest index value with this vector. Assuming that each particular word combination pattern describes a certain topic in the document, each singular vector can be viewed as representing such a topic [12], the magnitude of its singular value representing the degree of importance of this topic. The method selects for the summary those sentences whose vectorial representation in the matrix Ó·V T has the greatest ’length’. Intuitively, the idea is to choose the sentences with the greatest combined weight across all important topics. In [10] we proposed the extension of the method to process a cluster of documents written about the same topic. Multi-document summarization is a one step more complex task than single-document summarization. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Web Topic Summarization
It brings new problems we have to deal with. The first step is again to create a term-by-sentence matrix. In this case we include in the matrix all sentences from the cluster of documents. (On the contrary, in the case of single-document summarization we included the sentences from that document.) Then, we run sentence ranking. Each sentence gets a score which is computed in the same way as when we summarized a single document – vector length in the matrix Ó·V T (LSA score). Now, we are ready to select the best sentences (the ones with the greatest LSA score) for the summary. However, two documents written about the same topic/event can contain similar sentences and thus we need to solve redundancy. We propose the following process: before adding a sentence to the summary, see whether there is a similar sentence already in the summary. The similarity is measured by the cosine similarity in the original term space. We determine a threshold here. The extracted sentence should be close to the user query. To satisfy this, query terms get a higher weight in the input matrix. 3.7
Knowledge-poor Sentence Compression
Naturally, long sentences with many significant terms are usually selected for the summary. However, they often contain clauses that are unimportant from the summarization point of view. We try to identify these clauses and then remove them. Firstly, we need to create a set of possible compressed forms of each summary sentence. We call them compression candidates (CC). In this step we use the knowledge of sentence structure obtained by the sentence structure markup module (example in Figure 2). If we cut the tree on an edge, we get a compressed sentence (CC) where all subordinate clauses of the edge are removed. And moreover, we can cut the tree more than once - in a combination of edges. In this way we obtain a set of CCs. After obtaining the set of CCs, we try to select the best candidate within the set. In some of the candidates some important information is removed or even its sense is changed. We designed several features that can help in deciding whether the crucial information is retained or not in the particular candidate3. The final decision is left to a two-class classifier. The shortest candidate within the positive ones is selected to substitute the original sentence in the final summary. 3.8
Sentence Ordering
After obtaining sentences (or their compressed versions) which the final summary will consist of, they have to be ordered somehow. Our idea for resolving this problem is that two sentences that occur close to each other should deal with the same entities. The first step is to select the first summary sentence. Each sentence is assigned by a score that describes to what extent it should start the summary. From the entity markup we get entity co-occurrence chains, but moreover, for each chain and document we get one NP that starts the chain in that document. Usually, each entity is introduced in the document with the full NP (e.g. “president George Bush”). Sentences are then scored according to three features – the number of entities occurring in them, the number of entity introductions, and finally, the date of the publication of the document in which the sentence is contained. A sentence from the oldest document is preferred to start the summary. When we have selected the first sentence it is time to select the next one. The sentence that contains the same entities as the previous one is preferred to continue the summary. Thus again, the sentences are scored according to the slightly changed three features: the number of entities occurring in them, where the entities that occur in the previous sentence are emphasized (multiplied by a weight), the number of entity introductions, where again the entities that occur in the previous sentence are emphasized (multiplied by a weight) and finally, the date of the publication of the document in which the sentence is contained. A sentence from the oldest document is preferred as well. This process is repeated until we have ordered all the sentences. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
327
328
3.9
Josef Steinberger; Karel Jezek; Martin Sloup
Reference Correction
Anaphoric expressions can only be understood with respect to a context. This means that summarization by sentence extraction can wreak havoc with their interpretation: there is no guarantee that they will have an interpretation in the context obtained by extracting sentences to form a summary, or that this interpretation will be the same as in the original text. For example, a pronoun can occur in the summary without any information about which entity it replaces. Our idea is to replace anaphoric expressions with a full noun phrase in cases where the anaphoric expression could otherwise be misinterpreted. The information marked by the entity marking module is utilized here. However, we need a co-reference resolver. So far we have experimented just with English and the GuiTAR resolver [13]. For details, see [5]. 4.
Experiments
There are two crucial parts that affect the performance of the system: the quality of searching and the quality of summarization. As for searching, we will present figures showing its accuracy (how many retrieved documents were relevant to the user query and how many were not). We use manual annotations. The quality of the summarization is assessed by the widely-used ROUGE measure [14, 15]. At the end of the section, we present a couple of system summaries and we show system screenshots. 4.1
Searching Results
The following tables demonstrate that with the proposed searching approach we can obtain mostly relevant documents. Just a couple of documents were classified as marginally relevant (i.e., the query terms are mentioned there in the right sense, but the main documentâ&#x20AC;&#x2122;s topic is different from the query topic). A few documents were irrelevant (e.g., when we submitted a query about a huge accident on Czech highway D1, the system returned a document about an accident on an Austrian highway). Proper names can increase the accuracy of searching. We analyzed a maximum of the top ten retrieved documents. The results are presented in Table 1 (English queries) and Table 2 (Czech queries). Query ID 1
Significant terms in query China Olympic games protests
Total
Relevant
10
10
Marginally relevant 0
Irrelevant 0
2
American radar in Czech Republic
10
8
2
0
3
Independent Kosovo
10
10
0
0
4
Polygamy U.S. sect
10
9
1
0
5
Obama Hillary Clinton president elections
10
10
0
0
6
Soccer stadium security
10
8
0
2
7
Iraq attact U.S.
8
7
1
0
8
Iranian nuclear program
9
8
1
0
9
Mugabe Zimbabwe elections
8
8
0
0
10
Al Queda Osama bin Laden
6
4
2
0
91
82 (90,1%)
7 (7.7%)
2 (2.2%)
In total
Table 1: Evaluation of searching quality on English queries 4.2
Summarization Results
Assessing the quality of a summary is much more problematic. The DUC (Document Understanding Conference â&#x20AC;&#x201C; [16]) series of annual conferences controls the direction of the evaluation. However, the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Web Topic Summarization
329
only fully automatic and widely used method so far is ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [14, 15] which compares human-written abstracts and system summaries based on the overlap of n-grams4. Query ID 1
Total
Relevant
Peking Čína olympijské hry bojkot
5
4
Marginally relevant 1
2
Americký radar Brdy
10
8
2
0
3
Samostatnost Kosova
9
7
2
0
4
USA polygamní sekta
3
3
0
0
5
Obama Hillary Clinton prezident volby
10
10
0
0
6
Fotbal stadión bezpečnost fanoušci
5
5
0
0
7
Poplatky u lékaře reforma zdravotnictví Julínek
10
9
1
0
8
Daňová reforma
10
7
3
0
9
Hromadná nehoda na dálnici D1
10
7
0
3
10
Sraz neonacistů Praha
9
5
0
4
91
65 (80,3%)
9 (11.1%)
7 (8.6%)
Significant terms in query
In total
Irrelevant 0
Table 2: Evaluation of searching quality on Czech queries Suppose a number of annotators created manual summaries. The ROUGE-n score of a candidate summary (the summary which is evaluated) is computed as follows:
ROUGE − n =
∑ ∑
∑ Count (n − gram) ∑ Count (n − gram) }
C∈{manual summaries} n − gram∈C
match
C∈{manual summaries n − gram∈C
where Countmatch(n-gram) is the maximum number of n-grams co-occurring in a candidate summary and a manual summary and Count(gramn) is the number of n-grams in the manual summary. Notice that the average n-gram ROUGE score, ROUGE-n, is a recall metric. It was shown that bigram score ROUGE2 and ROUGE-SU4 (a bigram measure that enables at most 4 unigrams inside bigram components to be skipped [15]) best correlate with the human (manual) system comparison. We present a comparison of our summarizer with those that participated at DUC 2005 - Tables 3 and 4. Not all of the differences are statistically significant. Therefore, we show by the letters the multiple systems’ comparison – the systems that share the same letter (in the last column) are NOT statistically significant. To summarize these tables: in ROUGE-2, our summarizer performs worse than 5 systems and better than 27 systems; however, when we count in significance, none of the systems performs significantly better than ours and 8 of them perform significantly worse. And similarly in ROUGE-SU4, our summarizer performs worse than 5 systems and better than 27 systems; however, when we count in significance, none of the systems performs significantly better than ours and 11 of them perform significantly worse. 4.3
Example summaries
To demonstrate the system output we show two resulting summaries (their desired length is 255 words). One for English and the query: “Al Qaeda and Osama bin Laden” and one for Czech and the query “americký radar Brdy” (American radar Brdy) – Figures 3 and 4. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
330
4.4
Josef Steinberger; Karel Jezek; Martin Sloup
System Interface
To get the reader closer to the user interface of the system, we present screen outputs. In Figure 5 there is a page where a user submits a query. The length of the resulting summary can be selected here. In Figure 6 there is a page with searching (and summarization) results. Under the header with the query and the selected summary length, we can see the resulted summary and references to the original documents below. Summarize ID 15 17 10 8 4 SWEeT 5 11 14 16 19 7 9 29 25 6 24 28 3 21 12 18 26 27 32 20 13 30 31 2 22 1 23
ROUGE-2 score 0.0725 0.0717 0.0698 0.0696 0.0686 0.06791 0.0675 0.0643 0.0635 0.0633 0.0632 0.0628 0.0625 0.0609 0.0609 0.0609 0.0597 0.0594 0.0594 0.0573 0.0563 0.0553 0.0547 0.0546 0.0534 0.0515 0.0497 0.0496 0.0487 0.0478 0.0462 0.0403 0.0256
A A A A A A A A A A A A A A A A A A A A
B B B B B B B B B B B B B B B B B B B B B B
C C C C C C C C C C C C C C C C C C C C C
D D D D D D D D D D D D D D D D D D D D D
E E E E E E E E E E E E E E E E E E E E E
F F F F F F F F F F F F F F F F F F F
G G G G G G G G G G G G G G G G G G
H H H H H H H H H H H I I
Table 3: Multiple comparisons of all peers based on ANOVA of ROUGE-2 recall Summarize ID 15 17 8 4 10 SWEeT 5 11 19 16 7 6 25 14 9 24 3 28 29 21 12 18 27 32 13 26 30 2 22 31 20 1 23
ROUGE-SU4 score 0.1316 0.1297 0.1279 0.1277 0.1253 0.12390 0.1232 0.1225 0.1218 0.1190 0.1190 0.1188 0.1187 0.1176 0.1174 0.1168 0.1167 0.1146 0.1139 0.1112 0.1107 0.1095 0.1085 0.1041 0.1041 0.1023 0.0995 0.0981 0.0970 0.0967 0.0940 0.0872 0.0557
A A A A A A A A A A A A A A A A A
B B B B B B B B B B B B B B B B B B
C C C C C C C C C C C C C C C C C
D D D D D D D D D D D D D D D D D D
E E E E E E E E E E E E E E E E E
F F F F F F F F F F F F F F F F
G G G G G G G G G G G G G G G
H H H H H H H H H H H
I I I I I I I I I I I
J J J J J J J J J J
K K K K K K K
L
Table 4: Multiple comparisons of all peers based on ANOVA of ROUGE-SU4 recall Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Web Topic Summarization Even as American officials portrayed the case as mainly a Canadian operation, the arrests so close to the United States border jangled the nerves of intelligence officials who have been warning of the continuing danger posed by small "homegrown" extremist groups, who appeared to operate without any direct control by known leaders of Al Qaeda. These fighters include Afghans and seasoned Taliban leaders, Uzbek and other Central Asian militants, and what intelligence officials estimate to be 80 to 90 Arab terrorist operatives and fugitives, possibly including the Qaeda leaders Osama bin Laden and his second in command, Ayman alZawahri. In recent weeks, Pakistani intelligence officials said the number of foreign fighters in the tribal areas was far higher than the official estimate of 500, perhaps as high as 2,000 today. The area is becoming a magnet for an influx of foreign fighters, who not only challenge government authority in the area, but are even wresting control from local tribes and spreading their influence to neighboring areas, according to several American and NATO officials and Pakistani and Afghan intelligence officials. Some American officials and politicians maintain that Sunni insurgents have deep ties with Qaeda networks loyal to Osama bin Laden in other countries. Hussein’s government, one senior refinery official confided to American soldiers. In fact, money, far more than jihadist ideology, is a crucial motivation for a majority of Sunni insurgents, according to American officers in some Sunni provinces and other military officials in Iraq who have reviewed detainee surveys and other intelligence on the insurgency.
Figure 3: Example English summary. Result for the query: “Al Qaeda and Osama bin Laden” Rozhovory Spojených států s českou vládou o umístění radaru by mohly být završeny na bukurešťském summitu NATO na počátku dubna, s Poláky by chtěl Washigton dohodu uzavřít do konce volebního období amerického prezidenta George Bushe, tedy do konce roku. Plán Američanů umístit v Brdech protiraketový radar a v Polsku sila s obrannými raketami vyvolává od počátku odpor ruských představitelů. "Naše velká síla neznamená, že si můžeme dělat, co chceme a kdy chceme," řekl McCain a dodal: "Musíme naslouchat (různým) názorům a respektovat kolektivní vůli našich demokratických spojenců."Republikánský kandidát uvedl, že součástí skupiny nejvyspělejších států G8 by měly být demokratické země včetně Indie a Brazílie. Podle informací z ruských médií nabízí Američané Rusům možnost inspekcí objektů systému v ČR a Polsku, omezení možností radaru tak, aby nemohl sledovat ruský vzdušný prostor a slibují, že rakety do sil v Polsku neumístí do té doby, než bude zjevné hrozící nebezpečí. Poté, co před několika dny v Moskvě američtí ministři zahraničí a obrany Condoleezza Riceová a Robert Gates předložili oficiálně zatím nezveřejněné návrhy mající ruské obavy rozptýlit, se zřejmě ruská strana s existencí systému smířila. Nejlepší způsob, jak uklidnit ruské obavy z evropských prvků americké protiraketové obrany, by ale podle něj bylo vůbec radar v ČR a sila pro antirakety v Polsku nestavět. Informace z Moskvy potvrzuje nedávné tvrzení předsedy ČSSD, že dohoda Ruska a USA o protiraketové obraně je na spadnutí. To je vítězstvím Ruska, které ovšem nechtělo americký radar v ČR a sila s obrannými raketami v Polsku.
Figure 4: Example Czech summary. Result for the query: “americký radar Brdy” (i.e., American radar construction in Czech Republic-Brdy)
Figure 5: SWEeT’s query form Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
331
332
Josef Steinberger; Karel Jezek; Martin Sloup
Figure 6: SWEeT’s result
Figure 6: SWEeT’s result 5.
Conclusion and Future Work
Pilot experiments show the solid quality of system summaries. The future version of the system will enable advance searching where the user will be able to select domains that will be searched, he will be able to select a summarizer and set it up. After that, we will work on multilingual processing. The system will search in various languages. The terms will be indexed by the EuroWordNet (EWN) thesaurus [17, 18] in an internal EWN format – Inter Lingual Index (ILI). As a result the system’s answer will be multilingual. If the user understands more languages, he will get to know what is written about the topic in different countries/languages. And moreover, because the same terms in different languages would be linked, the summarizer can use all documents together to decide what is important in the topic. The proposed modular architecture has several advantages. We can easily change the search engine or the summarizer or any of its modules. Our summarizer is based on LSA, which works just with the context of words and thus is not dependent on any particular language. We perform experiments with both Czech and English queries. Another possible function of the system can be knowledge-poor question answering. When a user enters a question, the answer should be found in the summary. So far, the basic version of the system has been stable, however, some of the modules are still in the experimental stage and there are many things to be improved. 6.
Acknowledgement
This research was partly supported by project 2C06009 (COT-SEWing).
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Web Topic Summarization
7.
Notes
1
In our experiments we used 10 most relevant documents; however, this constant will be able to be set in the advanced searching settings in the next version of the system. E.g., the head of the noun phrase “the blue car” is “car”. For example, the depth of the removed clause in the clause tree structure can signify how important the clause is (the lower, the less important), or the fall in the LSA score of the CC (compared to the LSA score of the full sentence) can show how important the removed information was. For details, see [4]. An n-gram is a subsequence of n words from a given text.
2 3
4
8.
References
[1]
Steinberger, J., and Jezek, K. Text summarization and singular value decomposition. In Proceedings the 3rd International Conference on Advances in Information Systems, Lecture Notes in Computer Science 2457, Springer-Verlag Berlin Heidelberg, p. 245–254, 2004. Landauer, T. K., and Dumais, S. T. A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. In Psychological Review, 104, p. 211–240, 1997. Hovy, E., and Lin, C. Automated text summarization in SUMMARIST. In ACL/EACL Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, 1997. Steinberger, J., and Tesao, R. Knowledge-poor Multilingual Sentence Compression. In Proceedings of 7th Conference on Language Engineering, Cairo, Egypt, p. 369-379, 2007. Steinberger, J., Poesio, M., Kabadjov, M.A., and Jezek, K. Two Uses of Anaphora Resolution in Summarization. In Special Issue of Information Processing & Management on Summarization, volume 43, issue 6, Elsevier Ltd., p. 1663-1680, 2007. Versley, Y., Ponzetto, S.P., Poesio, M., Eidelman, V., Jern, A., Smith, J., Yang, X., and Moschitti, A. BART: A Modular Toolkit for Coreference Resolution. To appear in ACL’08. Charniak, E. A maximum-entropy-inspired parser. In Proceedings of NAACL. Philadelphia, 2000. The Prague Dependency Treebank 2.0. http://ufal.mff.cuni.cz/pdt2.0/ Collins, M. Head-Driven Statistical Models for Natural Language Parsing. PhD Dissertation, University of Pennsylvania, 1999. Steinberger, J., and Køiš•an, M. LSA-Based Multi-Document Summarization. In Proceedings of 8th International PhD Workshop on Systems and Control, a Young Generation Viewpoint, Balatonfured, Hungary, p. 87-91, 2007. Berry, M.W., Dumais, S.T., and O’Brien, G.W. Using Linear Algebra for Intelligent IR. In SIAM Review, 37(4), 1995. Ding, C.H.Q. A probabilistic model for latent semantic indexing. In Journal of the American Society for Information Science and Technology, 56(6), p. 597–608, 2005. Poesio, M., Kabadjov, M.A. A general-purpose, off-the-shelf anaphora resolution module: implementation and preliminary evaluation. In Proceedings of LREC, Lisbon, Portugal, 2004. Lin, C., and Hovy, E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of HLT-NAACL, 2003. Lin, Ch. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the workshop on text summarization branches out, Barcelona, Spain, 2004. Document Understanding Conference Past Data. http://www-nlpir.nist.gov/projects/duc/data.html
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
333
334
Josef Steinberger; Karel Jezek; Martin Sloup
[17] EuroWordNet thesaurus. http://www.illc.uva.nl/EuroWordNet/ [18] Michal Toman, Josef Steinberger, Karel Jezek: Searching and Summarizing in Multilingual Environment. In Proceedings of the 10th International Conference on Electronic Publishing, p. 257â&#x20AC;&#x201C;265, FOI Commerce, Bansko, Bulgaria, 2006.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
335
Open Access Citation Rates and Developing Countries Michael Norris1; Charles Oppenheim1; Fytton Rowland2 Department of Information Science, Loughborough University, Loughborough, Leicestershire LE11 3TU, UK e-mail: M.Norris2@lboro.ac.uk; C.Oppenheim@lboro.ac.uk 2 73 Dudley Street, Bedford MK40 3TA, UK e-mail: fytton@googlemail.com
1
Abstract Academics, having written their peer reviewed articles, may at some stage in the make their work Open Access (OA). They can do this by self-archiving an electronic version of their article to a personal or departmental web page or to an institutional or subject repository, such that the article then becomes freely available to anyone with Internet access to read and cite. Those authors who do not wish to do this may leave their article solely in the hands of a toll access (TA) journal publisher who charges for access, consigning their article to remain behind a subscription barrier. Lawrence (2003), in a short study, noted that conference articles in computer science that were freely available on the World Wide Web were more highly cited that those that were not. Following this, there have been a number of studies which have tried to establish whether peer-reviewed articles from a range of disciplines which are freely available on the World Wide Web, and hence are OA, accrue more citations than those articles which remain behind subscription barriers (Antelman 2004, Davis and Fromerth 2007, Eysenbach 2006, Harnad and Brody 2004, Kurtz and Henneken 2007, Moed 2007). These authors generally agree that there is a citation advantage to those articles that have been made OA, but are either uncertain about, or find that they cannot agree on, the cause of this advantage. The causes of this citation advantage could simply be that OA articles are available well in advance of formal publication, and so have a longer period in which to accrue citations, or simply that more authors, because they are freely available, can read and cite them. As part of this debate, Smith (2007) asked whether authors from developing countries might contribute to higher citation counts by accessing OA articles and citing them more readily than TA articles. As part of a larger study of the citation advantage of OA articles (Norris, Oppenheim and Rowland 2008), research was undertaken to see whether a higher proportion of citations to OA articles came from authors based in countries where funds for the purchase of journals are very limited. Mathematics was chosen as the field to be studied, because no special programme for access in developing countries, such as HINARI (2007), covers this subject. The results show that the majority of citations were given by Americans to Americans, but the admittedly small number of citations from authors in developing countries do seem to show a higher proportion of citations given to OA articles than is the case for citations from developed countries. Some of the evidence for this conclusion is, however, mixed, with some of the data pointing toward a more complex picture of citation behaviour. Keywords: Open Access; Citation advantage; Developing countries 1.
Introduction
One of the basic arguments for OA is that those who cannot afford access to peer-reviewed journal articles could access them if the authors of these articles self-archived their work somewhere on the World Wide Web. It should follow that a higher percentage of those who cite these OA articles ought to come from countries where access to expensive journals is limited. A number of schemes, such a HINARI Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
336
Michael Norris; Charles Oppenheim; Fytton Rowland
(2007) and AGORA (2007), exist to provide access to scholarly information inexpensively to users in developing countries, but not all disciplines are covered by these. In the overall larger study (Norris, Oppenheim and Rowland 2008), four subjects (sociology, economics, ecology and mathematics) were selected, and a large number of papers were investigated to discover whether they were available on an OA basis anywhere. Citation data on all these papers were collected and subjected to statistical analyses of various kinds to establish whether or not OA availability of itself correlates with a greater number of citations to an article. As part of the larger study, mathematics – which is not covered by any of the assistance schemes – was chosen for an investigation of citation of articles by authors based in developing countries, the hypothesis being that these authors would be unlikely to have access to expensive toll access (TA) journals. 2.
Methods
In the main project, articles were selected from high-impact journals and their OA status was sought by using various search tools (OAIster 2007, OpenDOAR 2007, Google Scholar, and finally Google), and their availability or non-availability with OA was noted. Citations to them were then retrieved by using the ISI Web of Science databases. The country of origin of cited and citing articles was decided by the first author’s affiliation. Countries were classified by their per capita income using the World Bank’s (2007) categories (see Table 6), and were also grouped by their geographical location into twelve groups (see Table 1). Citation ratios for the TA group and the OA group of articles were calculated separately.
Spain Japan Italy Germany France Canada Pacific Rim China Rest of World UK Europe USA Total
Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region
OA Toll Open Access Access 12 14 46.2% 53.8% 16 9 64.0% 36.0% 20 22 47.6% 52.4% 18 47 27.7% 72.3% 26 39 40.0% 60.0% 11 16 40.7% 59.3% 9 22 29.0% 71.0% 15 11 57.7% 42.3% 20 18 52.6% 47.4% 18 28 39.1% 60.9% 31 65 32.3% 67.7% 102 204 33.3% 66.7% 298 495 37.6% 62.4%
Total 26 100.0% 25 100.0% 42 100.0% 65 100.0% 65 100.0% 27 100.0% 31 100.0% 26 100.0% 38 100.0% 46 100.0% 96 100.0% 306 100.0% 793 100.0%
Table 1. Cited articles by region and OA status Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Open Access Citation Rates and Developing Countries
3.
Data
In the overall sample, 1158 mathematics journal articles were taken from 16 high impact journals. Only citation links from other-author citations were counted; all other types of author and journal self-citations were discarded. After this, 365 of the articles were then uncited, leaving 793 articles cited by other authors. Table 1 shows how these 793 were distributed amongst the twelve regions. All of the citation links to the remaining 793 articles, which totalled 3032, were then analysed. These 3032 citations were from 2680 citing articles; clearly, in some cases, there were multiple citations from some of the citing articles. Table 2 shows how the 3032 citations from the 2680 citing articles that cited the original 793 were broken down. For example, there were 2413 citing articles (80%) that cited just one of the 793 articles, whereas there were three articles which cited six of the original articles each. The first-author affiliations of the original 793 cited articles covered 47 countries. The first-author affiliation of the 2680 citing articles, citing the 793 articles, were drawn from 70 countries; 23 of these were necessarily in addition to the initial 47 countries. The cited and citing countries were classified by their per capita income using The World Bankâ&#x20AC;&#x2122;s (2007) system of classification. China, for example, is designated as being in the lower middle-income group of countries and India in the low-income bracket, whereas most of Western Europe and North America are in the high-income group of countries. To further aid analysis and comparison, the original 47 countries and the 70 citing-author countries were classified by location into USA, Canada, France, Germany, Italy, Japan, Spain, UK, rest of Continental Europe, China, Pacific Rim, and the Rest of World. Frequency of Citation
Citing Articles
Overall Citations
1
2413
2413
2
208
416
3
43
129
4
9
36
5
4
20
6
3
18
Totals
2680
3032
Table 2. Frequency of Citation In Table 3 the 3032 citations are shown by their cited article OA or TA status, the region from which they were cited, and whether the cited article was matched by a citation from the same region. By way of illustration, 231 TA citations came from the USA, but only 115 of these were from articles that were originally authored by first-author affiliation from that territory, hence the other 116 were from other regions. Table 4 takes the data from Table 3 a step further and shows the citing country and the income group of the related cited articles. The 231 citations from the USA to the TA cited articles are shown by the World Bank per-capita income group from which they came, by first-author affiliation. It is evident that 115 of them were from the USA, as shown in Table 3, but overall only 20 of the 231 were from regions outside the high per-capita income bracket. It is noticeable at this stage that a greater percentage of the TA cited articles (13.50%) are being cited outside of the high per-capita income bracket than OA articles, which account for only 4.7%. The cluster bar chart in Figure 1 further extends the data from Table 4 by giving a comparative percentage Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
337
338
Michael Norris; Charles Oppenheim; Fytton Rowland Count
OA Toll Access
Region
Open Access
Total Region
Total
USA Rest of Europe UK Rest of World China Pacific Rim Canada France Germany Italy Japan Spain USA Rest of Europe UK Rest of World China Pacific Rim Canada France Germany Italy Japan Spain
Region to region match no match match 116 115 104 16 48 12 50 5 70 20 44 7 26 1 45 14 59 11 65 16 28 24 37 2 692 243 251 359 266 55 97 17 99 5 133 7 114 10 55 5 108 29 168 46 93 27 84 6 60 3 1528 569
Total 231 120 60 55 90 51 27 59 70 81 52 39 935 610 321 114 104 140 124 60 137 214 120 90 63 2097
Table 3. Citations by author country of the whole count for each category by the OA status of the citations from each region. The USA, for example, receives 24.71% of all the citations to given to TA articles (231/935) and 29.09% of all those to given to OA articles (610/2097). Of the twelve regions, only five receive a greater overall percentage of the OA than of the TA citations, even though in every case each region had a greater number of OA than TA citations. It is noticeable that China receives 9.6% (90/935) of all citations given to TA articles but only 6.7% (140/2097) of all citations to OA articles. The Rest of World shows a similar but narrower disparity: 5.9% (55/935) of all citations to TA articles but 5.0% (104/2097) of all those to OA articles. Table 5 shows the distribution of citations to cited articles by their OA status and per-capita income group. What is evident is that there is a greater percentage of citations to the TA articles (20.00%, 187/935) from the low to upper middle income groups than is the case for OA articles, where the comparable group is 15.40% (323/2097) of all the citations to the OA articles. Table 6 shows the distribution of the original 793 cited articles and the distribution of the 3032 citations by the country of their first-author affiliation. Countries are again classified by their World Bank per-capita income grouping. The number of articles appearing in each category has been given by its occurrence and the ratio is the division of the citing articles by the cited articles. The numbers in brackets indicate the number of article records in each category.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Open Access Citation Rates and Developing Countries
OA Toll Access
Low Citing Country
USA Rest of Europe UK Rest of World China Pacific Rim Canada France Germany Italy Japan Spain
Total Open Access
Citing Country
USA Rest of Europe UK Rest of World China Pacific Rim Canada France Germany Italy Japan Spain
Total
Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region Count % within Region
0 .0% 0 .0% 0 .0% 0 .0% 3 3.3% 0 .0% 0 .0% 0 .0% 0 .0% 2 2.5% 0 .0% 0 .0% 5 .5% 0 .0% 0 .0% 0 .0% 1 1.0% 0 .0% 0 .0% 0 .0% 0 .0% 0 .0% 0 .0% 1 1.1% 0 .0% 2 .1%
Cited Income Lower middle Upper middle 17 3 7.4% 1.3% 7 10 5.8% 8.3% 2 0 3.3% .0% 3 9 5.5% 16.4% 21 6 23.3% 6.7% 5 2 9.8% 3.9% 5 1 18.5% 3.7% 2 2 3.4% 3.4% 5 1 7.1% 1.4% 6 3 7.4% 3.7% 3 1 5.8% 1.9% 3 4 7.7% 10.3% 79 42 8.4% 4.5% 14 12 2.3% 2.0% 11 7 3.4% 2.2% 3 2 2.6% 1.8% 5 3 4.8% 2.9% 7 4 5.0% 2.9% 12 0 9.7% .0% 0 0 .0% .0% 1 1 .7% .7% 3 5 1.4% 2.3% 1 3 .8% 2.5% 1 1 1.1% 1.1% 0 0 .0% .0% 58 38 2.8% 1.8%
339
High 211 91.3% 103 85.8% 58 96.7% 43 78.2% 60 66.7% 44 86.3% 21 77.8% 55 93.2% 64 91.4% 70 86.4% 48 92.3% 32 82.1% 809 86.5% 584 95.7% 303 94.4% 109 95.6% 95 91.3% 129 92.1% 112 90.3% 60 100.0% 135 98.5% 206 96.3% 116 96.7% 87 96.7% 63 100.0% 1999 95.3%
Total 231 100.0% 120 100.0% 60 100.0% 55 100.0% 90 100.0% 51 100.0% 27 100.0% 59 100.0% 70 100.0% 81 100.0% 52 100.0% 39 100.0% 935 100.0% 610 100.0% 321 100.0% 114 100.0% 104 100.0% 140 100.0% 124 100.0% 60 100.0% 137 100.0% 214 100.0% 120 100.0% 90 100.0% 63 100.0% 2097 100.0%
Table 4 Citing country to cited income group 4.
Results and Discussion
Overall there is a tendency for authors to cite work from their own country preferentially. Table 3 shows all the citations, analysed by whether there was a match or not between the nationality of the authors of the citing work and of the cited work. Generally, for every one citation that can be paired by country to the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
340
Michael Norris; Charles Oppenheim; Fytton Rowland
article it is citing, there are three that do not match. This applies to both OA and TA articles. Clearly, however, these data are skewed by the predominance of the region-to-region match for the USA. Given that just over 25% of the citations come from this territory alone, it is not surprising that of all citations almost 42% are to USA-affiliated first authors (data not shown). Perhaps this result is unremarkable, given that a large proportion (38.6%) of the cited articles originate from the USA, and that of all the citations from each region, the largest number are given to USA-affiliated authors. The 230 citations made by Chinese authors accounted for about 10% of all the citations to TA articles and about 7% of all the citations to OA articles. For the Pacific Rim and the Rest of the World territories, the overall TA/OA citation percentages were barely different, at around 5% each. This result appears to confirm the findings from the analysis by per capita income, that is, that there is little evidence to suggest that authors who live in countries that may have difficulties accessing TA journals are citing OA articles in greater numbers and hence boosting the citation count. Figure 1 shows that seven out of the twelve regions have more citations to TA than to OA articles. The seven regions include the UK, Italy, Japan, Spain, China and the Rest of the World, the latter two helping to support the premise that low-income does not generate exceptional OA citations. It is noticeable also, as demonstrated by Table 3, that the regional link between cited and citing article by first author affiliation is generally weak, once the USA has been excluded. For citations of either access status, OA or TA, the overall regional match is about a quarter, but noticeably in the case of China, this is heavily skewed in favour of not citing other Chinese-affiliated authors. OA
30.0%
Toll Access Open Access
Percent
20.0%
10.0%
Spain
Japan
Italy
Germany
France
Canada
Pacific Rim
China
Rest of World
UK
Rest of Europe
USA
0.0%
Region
Figure 1 Percentage of citations to cited articles by OA status Whilst there is unmistakable evidence from the data collected here that there is an overall citation advantage to those articles that are made available as OA (20%), the actual causes of this advantage, here and in other studies, are not always clear. Given that one of the primary arguments in favour of OA is that those who cannot afford access to peer-reviewed journal articles could use them if these articles were selfarchived on the World Wide Web, where they could be readily accessed and cited by those with limited incomes. Hence, it could be reasoned that if this were clearly so, that a demonstrable cause of any citation advantage could be shown. Table 6 shows for the TA articles, the highest ratio of citing to cited articles occurs for citing authors in those countries in the lower middle income bracket, regardless of the nationality Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Open Access Citation Rates and Developing Countries
341
of origin of the cited articles. If all but the high income level countries are taken together, then the citation ratio is 4.45 for the TA articles and 9.79 for the OA articles. However, the overwhelming majority of articles are both authored and cited from the high-income countries. Table 6 gives the full data, divided into TA and OA articles, and by the four income categories of the first authorâ&#x20AC;&#x2122;s country of residence. Although this appears to be a convincing advantage, the percentage of the 187 lower-income citations to TA articles of all 935 citations to TA articles the result is 20%, and this is greater than the 15.40% figure from the comparable calculation for citations to OA articles. So there is a greater percentage of lower income authors among those citing TA articles than among those citing OA articles. OA Toll Access
Low Citing Country Income
Low
Lower middle
Upper middle
High
Total
Open Access
Citing Country Income
Low
Lower middle
Upper middle
High
Total
Count % within Citing Country Income Count % within Citing Country Income Count % within Citing Country Income Count % within Citing Country Income Count % within Citing Country Income Count % within Citing Country Income Count % within Citing Country Income Count % within Citing Country Income Count % within Citing Country Income Count % within Citing Country Income
0
Cited Country Income Lower middle Upper middle 1 0
High
Total 5
6
.0%
16.7%
.0%
83.3%
100.0%
3
22
8
73
106
2.8%
20.8%
7.5%
68.9%
100.0%
0
6
15
54
75
.0%
8.0%
20.0%
72.0%
100.0%
2
50
19
677
748
.3%
6.7%
2.5%
90.5%
100.0%
5
79
42
809
935
.5%
8.4%
4.5%
86.5%
100.0%
0
1
0
16
17
.0%
5.9%
.0%
94.1%
100.0%
0
10
5
165
180
.0%
5.6%
2.8%
91.7%
100.0%
1
3
6
116
126
.8%
2.4%
4.8%
92.1%
100.0%
1
44
27
1702
1774
.1%
2.5%
1.5%
95.9%
100.0%
2
58
38
1999
2097
.1%
2.8%
1.8%
95.3%
100.0%
Table 5. Cited to citing articles by income group Access Status
World Bank classification by per capita income Low
Lower middle
Upper middle
High
TA cited articles (298)
2
21
19
256
TA citing articles (935)
6
106
75
748
Ratio of citing to cited articles
3
5.05
3.95
2.92
Low
Lower middle
Upper Middle
High
OA cited articles (495)
1
14
18
462
OA citing articles (2097)
17
180
126
1774
Ratio of citing to cited articles
17
12.85
7
3.84
TA Articles
OA Articles
Table 6. Ratio of citing to cited articles by national income groups Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
342
Michael Norris; Charles Oppenheim; Fytton Rowland
Given that self-citations have been eliminated from the original portion of the data, it is authors citing the work of others to support their work, who are doing the citing. As can be seen in Table 5, however, most of those who are doing the citing to the lower income groups are from the high-income countries. It is clear that a greater percentage of authors from the low and lower income countries cite more TA articles than OA articles, despite there being a higher ratio of citing to cited articles for those of OA status. In fact, 95.33% of all OA citations and 86.5% of TA citations are from high income regions. 5.
Conclusions
Taken overall the results give a mixed picture as to whether those in the lower per capita income bracket countries are citing OA articles more frequently than TA ones. The USA cites itself more than anyone else, which is not surprising given its level of authorship. The other developed countries, except for Japan, are all at about the same level in terms of within-nation citation. Table 6 suggests that while there is a modest difference between the citation ratios of OA and TA articles for citations given by authors in the developed world (3.84 versus 2.92), the difference becomes much greater when citations given by authors from the developing world are studied. The sample from the lowest income countries is very small. It may be that the lack of reliable telecommunications networks in these low income countries could hinder access to OA articles. In this case, scholars in these countries may rely on a limited number of printed journals for which they have subscriptions. The results from the larger sample in the lower middle income group of countries, however, are striking: a citation ratio of 12.85 for OA articles versus 5.05 for TA articles. 6.
Notes and References
HINARI, 2007. <http://www.who.int/hinari/en/>, [accessed 16.02.07]. AGORA, 2007. <http://www.aginternetwork.org/en/>, [accessed 16.2.07]. OAIster, 2007. < http://www.oaister.org/>, [accessed 07.03.07]. OpenDOAR. 2007. <http://www.OpenDOAR.org/search.php>, [accessed 07.02.07]. The World Bank, 2007. <http://go.worldbank.org/K2CKM78CC0>, [accessed 08.08.07]. [1] [2] [3] [4] [5] [6] [7] [8]
Lawrence, S., 2001. Online or invisible. Nature, 411(6837), 521. Antelman, K., 2004. Do open-access articles have a greater research impact? College and Research Libraries, 65(5), 372-382. Eysenbach G., 2006. Citation advantage of open access articles. PLoS Biology, 4(5), 692-698. Harnad, S. and Brody, T., 2004. Comparing the impact of OA (OA) vs. non-OA articles in the same journals. D-Lib Magazine [online], 10(6), 1-5. <http://mirrored.ukoln.ac.uk/lis-journals/dlib/ dlib/dlib/june04/harnad/06harnad.html>, [accessed 22.12.05]. Kurtz, M. & Henneken, E., 2007. Open access does not increase citations for research articles from The Astrophysical Journal. <http://arxiv.org/ftp/arxiv/papers/0709/0709.0896.pdf>, [accessed 07.09.07]. Moed, H., 2007. The effect of ‘Open Access’ upon citation impact: An analysis of ArXiv’s Condensed Matter Section. Journal of the American Society for Information Science and Technology. 58(13), 2047-2054. Smith, J., 2007. Re: The apparent OA citation advantage. To multiple recipients of list JISC Repositories, 20 May, 19:19:35 BST. Norris, M., Oppenheim C. & Rowland, F. (in press). The citation advantage of open access articles. Journal of the American Society for Information Science and Technology. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
343
Research Impact of Open Access Research Contributions across Disciplines Sheikh Mohammad Shafi Department of Library and Information Science The University of Kashmir, Srinagar - (J & K) India 190006. email: smshafi@kashmiruniversity.ac.in
Abstract The study is based on 4,413 papers identified from Elsevier’s Scopus for various fields from 2000 to 2004 to assess the research impact of OA journal articles, from DOAJ based journals, using sampling techniques following ‘R’ software. It focuses to test the hypothesis “OA articles in hard, urban and convergent fields receive more citations (hence higher research impact) than those in soft, rural and divergent subjects, besides a comparative study of research impact across disciplines, supported with experimental method and literature review. Keywords: Research Impact, urban and rural disciplines, hard and soft disciplines, Open Access impact 1.
Introduction
Many grounds demand universal open access to scholarly information. The rising cost of scholarly materials particularly journals; stable or declining budgets; declining numbers of society publishers providing reasonable pricing; mergers within the commercial publishing industry resulting in less competition and increased prices; and a shifting emphasis from communicating scientific information to generating profits for publishing company stakeholders, have left the scientists around the world inaccessible to the current literature in the field. These access barriers represent impact barriers for research and researchers where careers largely depend on visibility and eventually on citation counts (1,2,3) Open access calls for the free availability of scholarly literature on the internet. The open access movement has gained significant momentum over the past several years. It maintains that all scientific and scholarly literature should be available to all for free via the internet. (4,5) More recently, research funders and research institutions in several countries have been proposing official policies to actively encourage or even require their fundees and employees to self-archive their research output in order to make it freely accessible online to all potential users, rather than leave them accessible only to those who can afford the journals in which they happen to publish (6) What makes OA so important is its potential effect on visibility, usage and impact of research. The careers and funding of researchers depend on the uptake of their findings- as does the progress of research itself. Academic institutions, federal agencies, publishers, editors, authors and librarians increasingly rely on citation analysis for promotion, tenure funding, and reviewer and evaluation and selection decisions (7). It is now established that research impact has increased by open access. It is in this context that the present paper endeavors to make a comparative study of research impact of OA articles across six disciplines through citation analysis. 2.
Objectives
The following objectives are laid down for the study: Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
344
Sheikh Mohammad Shafi
i) To assess the research impact of OA journal articles across ii) To compare the research impact of OA journal articles across iii) To verify the hypothesis drawn for the purpose. 3.
disciplines. disciplines.
Scope
The scope of the study is limited to Open Access articles appearing in 24 English language ‘OA’ journals in the field of Physics, Chemical Engineering, Sociology, Psychology, Economics and Environmental Science. 4.
Related Literature
Lawrence (2001)(8) analyzed 1, 19,924 conference articles in computer Science across1990-2000 and found that OA articles are cited 4.5 times more than non-OA articles. Antelman (2004) (9)studied the research impact of OA vs. non-OA articles across four disciplines and found that OA articles have a greater research impact then non-OA articles in all the four disciplines. Further OA articles in Mathematics and Electrical & Electronics Engineering have a greater research impact than those in Political Science and Philosophy. Harnad & Brody (2004) (10) compared the citation counts of individual OA and non-OA articles appearing in the same (non-OA) journal which reveals the citation advantage for OA.(10) found that OA increases the impact of publicly funded research after studying the OA impact advantage across all the disciplines using a 12 year sample of 14 million articles. In another study by Hajjem, C., Harnad, S., & Gingras, Y. (2005) (11)covered ten disciplines across 12 years and found that OA articles have consistently more citations and it is unlikely that OA citation advantage is merely or mostly a self-selection bias (for making only one’s better articles OA). Esyenbach (2006) (12)used web of science database to find the impact factor of prestigious journal “Proceedings of the National Academy of Sciences (PNAS)”, publishing both OA and non-OA articles. It reveals that OA articles published side by side with non-OA articles are cited more quickly and twice compared to non-OA articles. Tonta,Unal & Al (2007)(13) studied the research impact of OA articles across nine disciplines and found that the research impact varies from discipline to discipline. The OA articles in Biology and Economics has the highest research impact, compared to OA articles in hard, urban and convergent disciplines (such as Physics, Mathematics and Chemical Engineering). Hajjem, et al (2005) (14) analyzed the citation impact of OA articles across four disciplines (Biology, Business, Psychology and Sociology) using ISI CD-ROM database from 1992-2003 and found a citation advantage of 25%-250% for OA articles. To examine the cause of OA research impact in the field of Astronomy Kurtz, et al. (2005) (15)analyzed OA (Open access), EA (early access) and SB (selfselection bias) postulates and found that there is a strong EA and SB effect for higher citations than OA effect 5.
Methodology
5.1
Selection of Journals
The directory of OA journals (www.doaj.org) was used to select OA journals in six disciplines. Monolingual (English language) journals dealing with a single discipline and having back issue available since 2000 were selected. Since Elsevier’s Scopus database is used for identification of citations, the journal titles not covered by Scopus were excluded.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Research Impact of Open Access Research Contributions across Disciplines
5.2
Selection of articles
The ‘O A ‘articles selected for the study belong to period of 2000, 2002, & 2004 in the disciplines of Physics, Chemical Engineering, Economics and Environmental Science , while papers published between 2000-2004 in the field of Psychology and Sociology as the frequency of these articles being low in the journls. The articles of each journal after arranging them in chronological order , a 10% sample was randomly selected using function “sample” of ‘R’ software. ( R Development )(16). Scopus, as a tool for identification of citations ,was used in view of the following revelations from the literature. Jacso(2005) (17)compared WOS, Scopus and Google Scholar citation databases in terms of their major features and found that Google Scholar lacks competence and understanding of basic issues of citation indexing. Meho & Yang (2006)(18) found that use of Scopus and Google Scholar in addition to WOS significantly alters the ranking of scholars through citation counts. Burnham (2006)(19) found that Scopus and WOS complement each other as neither resource is all inclusive. WOS has advantage over Scopus in the depth of coverage, with the full WOS database going back to 1945 and Scopus going back to 1966. Baur & Bakkalbasi (2005) (20)found that WOS provided the largest citation counts for the articles published in 1958, whereas Google Scholar provided higher citation counts for the articles published in 2000 from the Journal of American Society for Information Science and Technology (JASIST). There in no difference in citation counts between WOS and Scopus for the articles published in 2000. Bakkalbasi, Bauer, Glover & Wang (2006) (21) found that WOS retrieved higher citation counts for articles published in Oncology and ‘Condensed Matter Physics’ journals in 1993 than Scopus and Google Scholar. Scopus showed higher citation counts for more current (2003) Oncology articles whereas Google scholar provided largest set of unique citations for current (2003) on Oncology articles. The Institute for Scientific Information (ISI) citation database is used for decades and often as the only tool for locating citations and conducting citation analysis. ISI databases (or web of science), however may no longer be adequate as the only or even the main sources of citation because new databases (like Scopus, Google Scholar, etc) and tools that allow citation searching are now available.( Meho and Yang, 2006) (22 ). The 441 OA articles thus selected were searched for citations in Scopus database. The number of citations, self citations, date and other details were recorded for each article. (Table -1) The data is tabulated and analyzed in a systemic way to reveal findings in accordance with desired objectives. The standard statistical techniques were used to estimate various statistical tests. Subject Physics Chemical Engineering Psychology Sociology Economics Environmenta l Science Total
DOAJ journals 69 15
Sample journals 6 3
Total articles 2738 842
Sample articles 274 (10.00)* 85(10.09)
79 57 65 54
4 4 4 3
188 102 233 276
19(10.10) 11(10.78) 24 (10.30) 28(10.14)
339
24
4379
441(10.07)
Table 1:Sampling Staistics *Figures in parenthesis indicate percentage. 6.
Hypothesis
Hypothesis formulated for the purpose of testing reads:“OA articles in hard, urban, and convergent disciplines have higher research impact than those in soft, rural and divergent disciplines” The queue is taken from the study of Tonta, Unal and Al(2007)( 23) who bases his study on previous Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
345
346
Sheikh Mohammad Shafi
studies Whitley(2000)(24 ), Antelman(2006) (25 ), Becher & Trowler(2001) ( 26) and classifies the disciplines in hard/soft, urban/rural, convergent/ divergent fields. The Physics and Chemical Engineering represent hard/urban/convergent disciplines, as these are hard and Applied Sciences that are convergent for having close relationship with other discipline and urban in their social aspects. Sociology and Psychology represent soft/rural/divergent, as these are soft and concentrate on rural issues and divergent in nature. The disciplines Economics and Environmental science represent mixture of hard/soft, urban/rural, convergent/ divergent components. 7. Results and Discussion The 441 OA articles received a total 1052 citations (Mean =2.38, S.D. =4.96). The average number of citations per OA articles varied from discipline to discipline and from journal to journal within various disciplines. The Economics received highest average citations (3.33) whereas the Sociology the lowest (0.27). Demographic Research received the highest average citations (5.12) whereas IDEA: a journal of social issues, Journal of Memetics- evolutionary models of information transmission the lowest (0.00) among the journals. The distributions of citations for all disciplines are Skewed (as the S.D. for all disciplines are higher than averages). The standard deviation within journals varied from 10.67 (Brazilian journals of Physics) to 0.00 (Psyche: an interdisciplinary Jr. of research on consciousness). (Tables 2-4.) S. No.
Name of the Journal
Total article s
Total citatio ns
1
Acta Physica Polonica B
88
290
2
Brazilian jr. of Physics
55
167
3
Entropy
6
19
4
New Journal. of Physics
34
85
5 6
Pramana: Jr. of Physics Turkish Jr. of Physics
73 18
80 13
274
654
Physics
Self citations 57 (19.65)* 56 (33.53) 7 (36.84) 16 (18.82) 22 (27.5) 3 (23.07) 161 (24.61)
Mea n
Media n
Mod e
S. D.
Half life (in years)
3.29
2
0
4.91
3
3.03
1
0
10.67
4
3.16
3
0
3.12
3
2.5
1
0
3.07
3
1.10 0.72
0 0
0 0
2.09 1.4
4 5
2.38
1
0
5.81
3
22
40
13 (32.5)
1.81
1
1
2.01
4
2
Brazilian Jr. of Chemical Engineering Iranian Polymer Journal
13
20
1.53
1
0
1.85
3
3
Jr. of Chemical Engg. Of Japan
50
137
2.74
1.5
1
3.32
4
Chemical Engineering
85
197
2.31
1
1
2.86
4
Physics & Chemical Engineering.
359
851
8 (40.0) 55 (40.14) 76 (38.57) 237 (27.84)
2.37
1
0
5.26
4
1
Table2: Citation Count of OA articles (n Physics and Chemical Engineering) * Figures in parenthesis indicate percentage Out of 24 journals two journals (IDEA: a Journal of social issues & Journal of Memetics-evolutionary models of information transmission) receives no citation. Four journals receive only one citation each. Out of 1052 citations 294 (27.94%) are self citations. The self citation rate differs from discipline to discipline and from journal to journal within disciplines. The Environmental Science gets the highest self citation rate (46.15%) whereas the Sociology received no self citation (0.0%). Out of 441 articles 164(37.18%) receives no citation, 100(22.67%) articles receive one citations each, 49(11.11%) articles receive two citations each and 38(8.16%) receive three citations each and 90(20.40%) articles more than three citations. The highest rate of citation is received by Sociology (72.72%) and the lowest by Economics Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Research Impact of Open Access Research Contributions across Disciplines
347
(16.66%). Out of 24 journals the top five research impact journals belong to Economics(2), Environmental Sciences(1) and Physics(2).The top ten research impact articles are from Physics(6), Chemical Engineering(2) ,Economics(1) and Environmental Science(1).The top ten research impact articles receive 236(22.43%) citations in aggregate. (Table 5) S. No
Journal
Total articles
Total citations
Self citations
Mean
Median
Mode
S. D.
Half life (in yrs)
1
Current Research in Social Psychology Dynamical Psychology
10
13
1.3
0.5
0
1.82
3
3
1
3 (23.07)* -
0.33
0
0
0.57
-
5
12
2.4
2
1
1.67
3
1
1
1
1
1
0.00
4
Psychology
19
27
1.42
1
0
1.67
3
1
Jr. of Criminal justice & popular culture
4
2
4 (14.81) -
0.5
0.5
-
0.57
4
2
IDEA: a Journal of social
1
0
-
0.0
0
0
0.00
-
2 4 11 30
0 1 3 30
4 (13.33)
0.0 0.25 0.27 1.00
0 0 0 0.5
0 0 0 0
0.00 0.5 0.46 1.46
5 4 3
2 3 4
Journal Of technology in counseling Psyche
1 (8.33) -
issues 3 Journal of Memetics 4 Theory & Science Sociology Psychology & Sociology
Table3: Citation Count of OA articles ( Psychology and Sociology) *Figures in parenthesis indicate percentage S.N o
Name of the Journal Asian Development Review
Total article s 3
Total citation s 12
1 2
IMF Staff papers
9
21
3
Demographic Research
8
4
Self citations
Mea n
Media n
Mod e
S. D.
-
4.00
4
_
1.00
Half life (in yrs) 4
2.33
1
1
2.59
6
41
2 (9.52) 9 (21.95)
5.12
4.5
1
4.76
4
Jr. of Regional analysis & Policy Economics 1 Electronic green Journal
4
6
-
1.5
1.5
_
1.29
3
24 3
80 1
3.33 0.33
2.5 0
1 0
3.42 0.57
4 4
2
4
3
0.75
0.5
0
0.95
1
21 28 52
87 91 171
11 (13.75) ) 1 (33.33) 41 (47.12) 42 (46.15) 53 (30.99)
4.14 3.25 3.28
2 1 2
0 0 0
5.09 4.67 4.10
4 5 5
Park Science
3 Water S. A. Environmental Science Economics & Environmental Science
Table4: Citation Count of OA articles( Economics and Environmental Science) *Figures in parenthesis indicate percentage Half life period (time taken to receive half of all citations) differs from discipline to discipline and from journal to journal within each discipline. The half life period estimated for Physics and Psychology is 3 years. It is 4 years for Chemical Engineering, Sociology and Economics whereas it is calculated 5 years for Environmental Science. However, the highest half life period amongst the journals is 6 years for IMF Staff Papers.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
348
Sheikh Mohammad Shafi S. N o 1 2 3 4 5 6 7 8 9 10
Tilte/Author/Volume/Year of Publishing of the article . Paths to Self-Organized Criticality/ Ronald Dickman, Miguel A. MuĂąoz, Alessandro Vespignani, and Stefano Zapperi/ 30,1/2000 Chiral Doubling of Heavy--Light Hadrons: BaBar 2317 MeV/c2 and CLEO 2 2460 MeV/c Discoveries /M.A. Nowak, M. Rho, I. Zahed /35,10/2004 Soil-water utilisation and sustainability in a semi-arid grassland/ HA Snyman /26,3/2000 The Hadronic \tau Decay of a Heavy H\pm in ATLAS /Ketevi Adikle Assamagan, Yann Coadou/33,2/2002 The Ising Model and Real Magnetic Materials /W. P. Wolf /30,4/2000 Removal and Recovery System of Hexavalent Chromium from Waste Water by Tannin Gel Particles /Yoshio Nakano, Michihito Tanaka, Yasuo Nakamura and Masayuki Konno / 33,5/2000 Extraction Mechanism of Rare Metals with Microcapsules Containing Organophosphorus Compounds / Eiji Kamio, Michiaki Matsumoto and Kazuo Kondo/ 35,2/2002 From Atomic Physics to Solid-State Physics: Magnetism and Electronic Structure of PrNi5, ErNi5, LaCoO3 and UPd2Al 3 /R.J. Radwanski, R. Michalski, Z. Ropka /31,10-11/2000 TRI\mu P --- a New Facility to Investigate Fundamental Interactions with Optically Trapped Radioactive Atoms /Klaus Jungmann/33,8/2002 Tempo-Adjusted Period Parity Progression Measures: Assessing the Implications of Delayed Childbearing for Cohort Fertility in Sweden, the Netherlands and Spain / Hans-Peter Kohler, JosĂŠ Antonio Ortega / 6/2002
Citatio n Count 78 34 20 19 16 15 15 13 13 13
Title of the Journal Brazalian Journal of Physics ** Acta Physica Polonica B**** Water SAP Acta Physica Polonica B**** Brazalian Journal of Physics** Journal of Chemical Engineering of Japan** Journal of Chemical Engineering of Japan** Acta Physica Polonica B**** Acta Physica Polonica B**** Demographic Research*
Table5: The Highest ten cited OA articles in the sample Number of * mark(s) on the title of a journal indicates frequency of that particular journal in the list. 7.1
Testing the Hypothesis
The average citation rate for hard/urban/ convergent disciplines (Physics & Chemical Engineering) is 2.37 with Standard Deviation 5.26. Out of 851 citations received for 359 articles in this group 237 (27.84%) are self citations, 136 (37.88%) articles receiving no citation, 79 (22.00%) getting one citation each, 41 (11.42%) receiving two citations, 34 (9.47%) receive three citations and 69(19.22%) receive more then three citations. In this group the lowest citation rate was for Turkish Journal of Physics (0.72) and the highest for Acta Physica Polonica B(3.29). The highest S.D. was for Brazilian Journal of Physics, (10.67) while the lowest for Turkish Journal of Physics (1.40). The collective S.D. for Physics journals was 5.81 whereas it is 2.86 for Chemical Engineering journals. This means that distribution of citations for physics are more Skewed than those of Chemical Engineering. The half life period for this group is 4 years. For soft/rural/divergent disciplines (Sociology and Psychology) the average citation rate is 1.00 with Standard Deviation of 1.46. There are only four self citations (13.33%) out of 30 citations for 30 OA articles. Out of 30 articles 15(50.00%) receive no citation, 9(30.00%) receives one citation, 2(6.66%) receive two citations, 1(3.33%) receive three citations and 3(10.00%) gets more than three citations. Out of the 8 journals in this group two journals receive no citation and three receive only one citation each. However the lower S.D. of this group suggests less skewed-ness in the distribution of citations. The half life period for this group stood at 3 years. In this group the highest citation rate is 2.4 for Journal Of technology in counseling. The third group of Economics & Environmental Science (mixture of hard/ soft, urban/ rural, convergent/ divergent components) presents the average citation rate of 3.28 with Standard Deviation of 4.10. The Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Research Impact of Open Access Research Contributions across Disciplines
self citation rate varies significantly from 13.75% (Economics) to 46.15 % (Environmental Science). The overall self citation rate being 30.99% .Out of 171 citations received for 52 OA articles, 13(25.00%) receive no citation, 12(23.07%) receive one citation each, 6(11.53%) receive two citations 3(5.76%) receive three citations, and 18(34.61%) receive more than three citations. Out of 7 journals in this group no journal receives zero citations and one journal receives one citation. The half life period for the articles in this group is 5 years. The lowest citation rate in this group is for Electronic Green Journal (0.33). Thus the hypothesis that â&#x20AC;&#x153;OA articles in hard/urban/convergent disciplines have higher research impact (receive higher citations) then those in soft, rural and divergent disciplinesâ&#x20AC;? is accepted. However the disciplines Economics & Environmental Science (mixture of hard/ soft, urban/rural, convergent/divergent components) receive highest citations than either of the two groups. Even if we exclude the self citations from all the disciplines, the hypothesis is still acceptable and Economics and Environmental Science still show the highest citation rate. 8.
Conclusion
Research impact of OA articles varies from discipline to discipline and from journal to journal within disciplines. The hard, urban and convergent disciplines have higher research impact than those of soft, rural and divergent disciplines, but less than mixed hard/ soft, urban/ rural, convergent/ divergent disciplines. The results are some what similar to Tonta, Unal, & Al (2007).( 27) However further studies need to be undertaken on a wider research canvas to ascertain the cause of varying degree of acceptance of OA by the researchers in these disciplines. 9.
References
[1]
KOHLER, Barbara M., & RODERE, Nancy K. (2006). Scholarly communications program: force for change. Biomedical Digital Libraries, 3(6). Retrieved November 17, 2007, from http:// www.bio-diglib.com/content/3/1/6 MADHAN, M., RAO, Y. Srinivasa., & AWASTHI, Shipra. (2006). Institutional repository enhances visibility and prestige of the institute- the case of National Institute of Technology, Rourkela. Paper presented at the National Conference on information management in digital libraries IIT Kharagpur, India. Retrieved May 23, 2007, from http://dspace.nitrkl.ac.in/dspace/bitstream/2080/310/1/ madhan1.pdf MARK, Timothy. & SHERRER, Kathleen. (2006). Institutional repositories: a review of content recruitment strategies. World Library and Information Congress: 72nd IFLA General Conference and Council, Seoul, Korea. Retrieved November 26, 2007, from http://www.ifla.org/IV/ifla72/ papers/155-Mark_Shearer-en.pdf BLUH, Pamela.(2006).Open access, legal publishing, and online repositories Journal of Law, Medicine & Ethics, 34(1). Retrieved March 20, 2008, from http://digitalcommons.law.umaryland.edu/ cgi/viewcontent.cgi?article=1048&context=fac_pubs MARK and SHERRER, ref. 3. HAJJEM, C., et al. (2005). Open access to research increases citation impact. Retrieved December 8, 2007, from http://eprints.ecs.soton.ac.uk/11687/ MEHO, L.I.,& YANG, K. (2006). A new era in citation and bibliometric analyses: Web of Science, Scopus, and Google Scholar. Journal of the American Society for Information Science and Technology. Retrieved December 8, 2007, from http://arxiv.org/ftp/cs/papers/0612/0612132.pdf LAWRENCE, Steve. (2001). Online or invisible? Nature, 411(6837). Retrieved November 28, 2007, from http://citeseer.ist.psu.edu/cache/papers/cs/29685/
[2]
[3]
[4] [5] [6] [7] [8]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
349
350
[ 9] [10] [11] [12] [13]
[14] [15] [16] [17] [18] [19] [20] [ 21] [22] [23] [24] [25] [26] [ 27]
Sheikh Mohammad Shafi
ttp:zSzzSzwww.ipo.tue.nlzSzhomepageszSzmrauterbzSzpublicationszSzCITESEER2001onlinenature.pdf/lawrence01online.pdf ANTELMAN, Kristin.(2004). Do open-access articles have a greater research impact? College & Research Libraries, 65(5). Retrieved November 26, 2007, from http://eprints.rclis.org/archive/ 00002309/01/do_open_access_CRL.pdf HARNAD, Stevan., & BRODY, Tim. (2004). Comparing the impact of open access (OA) vs. nonOA articles in the same journals. D-lib Magazine, 10(6). Retrieved November 29, 2007, from http:/ /webdoc.sub.gwdg.de/edoc/aw/d-lib/dlib/june04/harnad/06harnad.html HAJJEM, C., HARNAD, S., & GINGRAS, Y. (2005). Ten-year Cross-disciplinary comparison of the growth of open access and how it increases research citation impact. Retrieved November 26, 2007, from http://arxiv.org/ftp/cs/papers/0606/0606079.pdf EYSENBACH, Gunther. (2006). Citation advantage of open access articles. PLOS Biology, 4(5). Retrieved November 27, 2007, from http://biology.plosjournals.org/archive/1545-7885/4/5/pdf/ 10.1371_journal.pbio.0040157-L.pdf TONTA, Yasar., UNAL, Yurdagul., & AL, Umut. (2007). The research impact of open access journal articles. Proceedings ELPUB 2007 Conference in Electronic Publishing, Vienna, Austria. Retrieved November 21, 2007, from http://eprints.rclis.org/archive/00009619/01/tonta-unal-alelpub2007.pdf Hajjem, C.,et al(2005). Open access to research increases citation impact. Retrieved December 8, 2007, from http://eprints.ecs.soton.ac.uk/11687/ KURTZ, M.J., et al. (2005).The effect of use and access on citations. Retrieved November 27, 2007, from http://arxiv.org/PS_cache/cs/pdf/0503/0503029v1.pdf R DEVELOPMENT Core Team.(2007). R: A Language and Environment for Statistical Computing.Vienna: R Foundation for Statistical Computing. Retrieved December 8, 2007, from http://www.R-project.org JASCO, P. (2005). As we may search - Comparison of major features of the Web of Science, Scopus, and Google Scholar citation-based and citation-enhanced databases. Current Science, 89(9). Retrieved November 27,2007, from http://www.ias.ac.in/currsci/nov102005/1537.pdf MEHO and YANG, ref. 7. . BURNHAM, J. F.(2006).Scopus database: a review. Biomedical Digital Libraries, 3(1).Retrieved December 8, 2007, from http://www.bio-diglib.com/content/3/1/1 BAUER, K., & BAKKALBSI, N. (2005). An examination of citation counts in a new scholarly communication environment. D-Lib Magazine, 11(9). Retrieved December 8, 2007, from http:// www.dlib.org//dlib/september05/bauer/09bauer.html. BAKKALBSI, N., (2006). Three options for citation tracking: Google Scholar, Scopus and Web of Science. Biomedical Digital Libraries, 7 Retrieved December 10, 2007, from http://www.biodiglib.com/content/pdf/1742-5581-3-7.pdf. MEHO and YANG, ref. 7. TONTA and UNAL,ref. 13. WHITLEY, R. (2000). The intellectual and social organization of the sciences. New York: Oxford University Press. ANTELMAN, Kristin.(2006). Self-archiving practice and the influence of publisher policies in the Social Sciences. Learned Publishing, 19(2). Retrieved December 8, 2007, from http:// www.lib.ncsu.edu/staff/kaantelm/antelman_self-archiving.pdf BECHER, T., & TROWLER, P.R. (2001). Academic tribes and territories: intellectual enquiry and the culture of disciplines (ed. 2d) Buckingham: SRHE and Open University Press. TONTA and UNAL,ref.13.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Research Impact of Open Access Research Contributions across Disciplines
Exploration and Evaluation of Citation Networks Karel Jezek1; Dalibor Fiala2; Josef Steinberger1 1
Department of Computer Science & Engineering, University of West Bohemia Univerzitni 8, 306 14 Pilsen, Czech Republic e-mail: jezek_ka@kiv.zcu.cz; jstein@kiv.zcu.cz 2 Gefasoft AG Dessauerstrasse 15, 80992 Munich, Germany e-mail: dalibor.fiala@gefasoft.de
Abstract This paper deals with the definitions, explanations and testing of the PageRank formula modified and adapted for bibliographic networks. Our modifications of PageRank take into account not only the citations but also the co-authorship relationships. We verified the capabilities of the developed algorithms by applying them to the data from the DBLP digital library and subsequently by comparing the resulting ranks of the sixteen winners of the ACM SIGMOD E.F.Codd Innovations Award from the years 1992 till 2007. Such ranking, which is based on both the citation and co-authorship information, gives better and more fairminded results than the standard PageRank gives. The proposed method is able to reduce the influence of citation loops and gives the opportunity for farther improvements e.g. introducing temporal views into the citations evaluating algorithms. Keywords: WWW structure mining; PageRank; citation analysis; citation networks; ranking algorithms; social networks;
1.
Introduction
Rating of research institutions and researchers themselves is a challenging and important area of investigation. Its conclusions have a direct influence on acquiring financial support for research groups. The aim of our work is to investigate citation networks (networks of relationships between citing and cited publications) and other similar networks, e.g. hyperlink structures of the Web. We want to derive a rating of individual participants modeled as nodes of the network graph. Every system modeled as a graph is a network. These two notions are actually synonyms. Perhaps the word graph has a more abstract meaning and therefore mathematicians prefer speaking of graphs rather than networks which are the notion in the terminology of social sciences. Real world networks are grouped into social, information, technological and biological networks [1]. Citation networks as well as Word Wide Web hyperlink structures are mostly included in information networks, but some authors [2] use the term â&#x20AC;&#x153;socialâ&#x20AC;? in this context. As stated above, network systems can be modeled as a graph. Mathematical notions and formulas from graph theory are available to explore their features and results from one type of networks are profitably utilized in others. In Section 2 we are concerned with ranking of Web pages. Methods originated for determination of page importance were soon recognized as applicable to citation analysis. Connections between the ranking method and co-authorship networks are discussed in Section 3. Section 4 is the core of the article and introduces our evaluation method of citation networks. The next part presents results achieved on data from the DBLP digital library. Finally, possible further improvements are proposed together with other Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
351
352
Karel Jezek; Dalibor Fiala; Josef Steinberger
application areas where the introduced method can be used. 2.
Ranking of Network Structures
WWW is a gigantic extensive explored network structure. Filtering Web documents by relevance to the topic the user is interested in does not sufficiently reduce the number of searched documents. Some further criteria must decide which documents are worth the user’s attention and which are not. In [3] Page and Brin proposed an iterative calculated page ranking (or topic distillation) algorithm based on hyperlinks. This algorithm, suitably named PageRank, has at the same time been used in the famous search engine Google, and without doubt it is one of the basic reasons behind Google’s successes. The PageRank technique is able to order Web documents by their significance. Its principle lies in collecting and distributing “weights of importance” among pages according to their hyperlink connections. Figure 1 demonstrates PageRank calculations for a piece of a hypothetical network. It assigns high ranks to pages that are linked to by documents that themselves have a high rank. The whole process runs iteratively and represents probably the world’s largest matrix computation.
5
25 50
20
10
30
25 20
10
10
15 5
10 5
40
15
25
10
5 15
10
Figure 1: Rank distribution and collection within a PageRank calculation Approximately at the same time as PageRank appeared, J. Kleinberg [4] proposed a similar algorithm for determining significant web pages called HITS. Other new ranking methods and modifications soon appeared - SALSA, SCEAS Rank, ObjectRank, BackRank, AuthorRank, etc. To prove the applicability of a method for rating research institutions, we collected the Web pages of main Czech computer science departments and applied the rating formula to their hyperlink structure [5]. 2.1
PageRank
Let us briefly introduce the PageRank principles as presented in [3] and [6]. Let G = (V,E) be a directed graph, where V is a set of vertices (corresponding to Web pages) and E a set of edges (representing hyperlinks between Web pages). The PageRank score PR(u) for Web page u is defined as:
PR (u ) =
1− d PR (v) +d ∑ |V | ( v ,u )∈E Dout (v )
(1)
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Exploration and Evaluation of Citation Networks
where |V| is the number of nodes, d is the dumping factor (an empirically determined constant set between 0.8 and 0.9) and Dout(v) is the out-degree of node v (number of outgoing edges from node v). You can see that the PageRanks of nodes depend on the PageRanks of other nodes. As the hyperlink structure is usually cyclic, so the PageRank evaluation is a recursive process allowing the current node to influence all nodes to which exists the path from the current node. The randomizing factor (1-d) represents the possibility to jump to a random node in the graph regardless of the out-edges from the current node. On the contrary, d stands for the probability of following out-link from the present node. Introducing the random term prevents loops of nodes (rank sinks) from accumulating too much rank and not propagating it further. An example of a rank sink is illustrated in Figure 2. There are also problems with nodes without out-links (referred to as dangling pages in PageRank evaluation) that would not distribute their rank either. In fact, zero-out-degree Web pages and rank sinks are the main problems in PageRank processing. On the other hand, nodes without in-links are not harmful and their rank is always smaller than that of any nodes with some in-links, as expected. The PageRank method is rather reliable. The necessary number of iterations depends on the extensiveness of the Web graph, but converges promptly. For a graph with over 320 million nodes (pages), only about 50 iterations were required as claims [3]. The frequency of normalization and the order of nodes affect the final ranking, but the effect on the resulting rank is not substantial.
Figure 2: An example of a graph with a rank sink We evaluate an iterative calculation of PageRank as follows: 1. 2. 3. 4.
5.
6.
We remove duplicate links and self-links from the graph. We set the initial PageRanks of all nodes in the graph uniformly so that the total rank in the system is one. This is the zeroth iteration. We remove nodes having no out-links iteratively because removing one zero-out-degree node may cause another one to appear. We compute the PageRank scores for all nodes in the residual graph according to Figure 1, using the scores from the previous iteration. We perform normalization so that the total rank in the system (including the vertices removed in step 3) is again one. We repeat step 4 until convergence. Numerical convergence of the scores is usually not necessary. An ordering of nodes (by PageRank) that does not change (or changes relatively little) is satisfactory as claims [7]. We gradually add back the nodes removed in step 3, compute their rank score and renormalize the whole system.
Normalization of the rank obtained from in-linking nodes by their out-degree is an important feature of PageRank. In this way, such nodes are penalized which are connected to many other nodes. It corresponds to a similar situation in citation evaluation, when citations of frequently citing authors are less valuable then those citing rarely. This analogy was a motivating idea for applying PageRank principles to bibliographic Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
353
354
Karel Jezek; Dalibor Fiala; Josef Steinberger
citations. 2.2
SCEAS Rank
In [8] an iterative PageRank like the SCEAS method (Scientific Collection Evaluator with Advanced Scoring) is used to rank scientific publications. It evaluates the impact of publications on the basis of their citations. In the graph where nodes are publications and edges mean citations between them, the original PageRank metrics is not appropriate. Such graph often contains cycles which are in fact a kind of selfcitation. Therefore, we would rather the nodes from the cycle not have much influence on rank distribution. Similarly, the direct citations should have their impact higher than indirect citations and their impact should become smaller when the distance between cited and citing gets larger.
R (u ) = (1 − d ) + d
R(v) + b −1 a ( v ,u )∈E Dout (v )
∑
where (a ≥ 1, b>0)
(2)
The SCEAS formula (2) computes the rank score R(u) with direct citation enforcing factor b and speed a in which an indirect citation enforcement converges to zero. For b=0 and a=1 formula (2) is equivalent to PageRank formula (1). The authors experimentally proved that SCEAS converges faster than PageRank. They carried out experiments with data from the DBLP digital library and compared the SCEAS rankings with several other ranking schemes including PageRank, HITS and a “baseline” ranking constituted of authors winning an ACM award. They showed that their method is superior to the others. We adopted their comparison methodology to test our novel algorithm. 2.3
Other ranking methods
As mentioned above, PageRank is not the only method of ranking. The most elementary way is to count in-links for each node. The most authoritative node is then the one with the highest number of in-linking edges. The rank Rin(u) of node u can be computed as:
R in (u ) = ∑ w(v, u )
(3)
( v, u ) ∈ E
In the case in which the graph G is unweighted, e.g. all weights w(v,u) are equal to one, the sum of inlinking edges gives an in-degree of the node. If applied to citations, all have the same weights and the citation of B in A does not influence the citation of C in B. Publication C is in (3) ranked as if it was not indirectly (through B) cited in A. Note that PageRank preserves such transitive feature respecting contributions of reputation from outlying nodes. Another ranking technique worth mentioning is HITS [4], [9]. HITS (Hyperlink-Induced Topic Search) defines two values (authority A(u) and hubness H(u)) for each node u as follows:
A(u ) =
∑ H (v )
( v ,u )∈E
H (u ) =
(4)
∑ A(v)
( v ,u )∈E
Importance of the node has two measures. The nodes pointed to by many nodes with high hub scores have high authorities and the nodes pointed to by many good authorities have high hubness. Mutual reinforcement between hubs and authorities is evident. HITS is applicable to citation networks as well and Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Exploration and Evaluation of Citation Networks
355
gives reasonable results. The necessity to work with two scores was the main reason why we preferred the PageRank algorithm for our further research. A simple metric of researcher scoring called the h-score was proposed in [10]. A researcher has a score h if h of his papers have at least h citations each. The h-score enables you to evaluate the successfulness of researchers at different levels of seniority. When n is the number of years in service of a researcher (since the year of his first publication), then his successfulness m is calculable as: mâ&#x2030;&#x2C6;h/n (5) E.g. a scientist in physics is successful if his/her m is close to 1. The h-index has obvious advantages. It is only a single number; it does not prefer quantity to quality. On the other hand it is not comparable across different scientific fields and does not reflect co-authorships. 3.
Co-authorship networks and Ranking Methods
Co-authorship networks are a special case of social networks, in which the nodes represent authors and edges mean collaboration between authors. Unlike the citation networks mentioned above, in which each edge means acknowledgement of primacy, declaration of debt or recognition, in a co-authorship graph an edge connecting two authors expresses the fact that those authors are or were colleagues. They have published one or more articles as a result of common research lasting for a year or years. This is in contrast to such citations where the citing author does not know the cited author personally and these persons have never collaborated. Co-authorship networks can also express the intensity of cooperation. We can consider a number of co-authors in the paper or a number of common papers to assess the weight of cooperation. 3.1
AuthorRank method
A co-authorship network model is investigated in [2]. It introduces AuthorRank as an indicator of the importance of an individual author in the network. As the number of collaborated authors is rather limited, the co-authorship graph of all documents consists of strongly connected components whose number may be huge but can be evaluated independently. The AuthorRank result gives the impact scores of authors using similar principles to PageRank. Let us briefly mention the main idea of AuthorRank. Any co-authorship network can be described simply as an undirected unweighted graph, where nodes represent authors and edges symbolize the existence of collaboration. If we allow a variety of authorities in the graph, we have to replace any undirected edge between nodes e.g. a1 and a2 with two directed edges (one directed from a1 to a2 , the second directed from a2 to a1). Further, we have to weight the collaboration not uniformly, e.g. assign weights wij to edges. Therefore, we need some additional knowledge which is not included in the undirected co-authorship graph. To show the case in a non-trivial but simple enough example, let us suppose as in [2] three cooperating authors. Figure 3 shows their co-authorship graphs. w21 w12
a2
a1
a1
w13
a3
w31
w23
a2
w32
a3
Figure 3: Co-authorship graph Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
356
Karel Jezek; Dalibor Fiala; Josef Steinberger
The remaining but substantial problem is determination of weights w. Co-authors of a paper published by two authors are obviously more tightly connected than co-authors of a paper written by ten authors. Frequently collaborating authors should be more connected than the authors jointly publishing only occasionally. This problem is solved in [2] with the help of two factors used in the collaboration graph – co-authorship frequency and exclusivity. They should give higher weight to edges that connect authors often publishing together with a minimum number of other authors involved. Let m be the number of publications, N the number of authors and f(pk) the number of authors of publication pk. Then co-authorship exclusivity gi,j,k, frequency cij and on their basis evaluable weight wij (between authors ai and aj) can be computed following way:
g i , j ,k = 1 ( f ( pk ) − 1) m
cij = ∑ g i , j ,k
(6)
k =1
wij =
cij
N
∑c k =1
ik
The weights are normalized (divided by the sum of weights of outgoing edges from the node), which is necessary for convergence of an algorithm computing nodes’ prestige. The resulting AuthorRank of an author i is evaluated as follows:
AR(i) = (1 − d ) + d
N
∑ AR( j ) × w j =1
ij
(7)
where AR( j ) corresponds to the AuthorRank of node j from which goes the edge to node i with weight
wij . Let us remember that the above described method works with collaborations not with citations. We believe that to measure the importance or prestige of nodes only on the basis of collaboration is questionable at least. Why should researchers who have many co-authors be more authoritative than those having just a few co-workers? Consider e.g. authors frequently publishing their works without co-authors. They are strongly handicapped in the AuthorRank methodology and completely ignored in the extreme case – publishing without co-authors at all. Single-author papers are quite common. In the DBLP collection we used in our experiments they made up 1/3 of them. The authoritativeness in the collaboration networks does not reflect the authoritativeness based on citations. But just citations are an accepted means of evaluating a researcher’s importance. 4.
Citation analysis and co-authorship
The main objective of this article is adapting the PageRank method to the citation analysis task. There are other PageRank modifications, e.g. the one submitted by [8] is meant particularly for bibliographic citations. The original contributions of our work are extensions and improvements of a traditional citation analysis method. Our innovations are based on considering mutual cooperation between the cited and citing author and its various assessments. If we allow the existence of co-authorship influence on citations, we might want to refine the citation analysis results. To consider the higher impact of a citation between not cooperating authors, we need to involve co-authorship networks in the evaluation process. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Exploration and Evaluation of Citation Networks
357
Our rating model is based on three graphs which are all derivable from digital library documents. This model includes: i) ii) iii)
bipartite graph of co-authorship, publication-citation graph, author-citation graph.
A simple example of graphs is shown in Figure 4. Co-authorship graph
p1 p2 p3 p4
a1 a2 a3
Publication-citation graph
p1
p3
Author-citation graph
a1
weights12
a2
weights23
p2
p4
a3
Figure 4: Example of graphs derivable from digital library
Ad i: The nodes of this unweighted graph consist of two disjunctive sets. One contains authors and the second publications. The edges are undirected matching authors and their publications. Ad ii: This graph is unweighted and its nodes represent publications. The edges are directed and express bindings between citing and cited publications. No common authors in a citing and cited publication are allowed. Ad iii: It is an edge-weighted directed graph. Its nodes represent the set of authors. Edges represent the citation between the authors. This graph is derivable from those two mentioned above. A triple (wuv , cuv , buv) of weight is associated with each edge, where: wuv represents the number of citations between citing author u and cited author v, cuv is the number of common publication by authors connected with this edge, buv expresses various semantics of collaboration we want to stress. E.g. the overall number of publication of both authors, the overall number of co-authors, the overall number of distinct co-authors and some other alternatives giving a true picture of the cooperation effect on citations. Actually, the author-citation graph should have the form of a multi-graph and the introduced triples substitute the multiplicity of its edges. For those who prefer mathematical symbolism let us define the above introduced graphs formally. It allows us to exactly express the weights assigned to the edges of the author-citation graph: i.
ii.
iii.
The co-authorship graph GP = (P ∪ A, EP) is an undirected, unweighted, bipartite graph, where P ∪ A is a set of vertices (P a set of publications, A a set of authors) and EP is a set of edges. Each edge (p, a) ∈ EP, p ∈ P, a ∈ A means that author a has co-authored publication p. The publication-citation graph GC = (P, EC) is a directed unweighted graph, where P is a set of vertices representing the publications, and EC is a set of edges. The edge (pi, pj) ∈EC denotes a citation of publication pj in publication pi. The author-citation graph G = (A, E) is a directed, edge-weighted graph, where A is a set
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
358
Karel Jezek; Dalibor Fiala; Josef Steinberger
of vertices representing authors and E is a set of edges denoting citations between authors. For every p∈P let Ap = {a∈A: ∃(p, a)∈EP} be the set of authors of publication p. For each (ai, aj), ai∈A, aj∈A, ai ¹ aj , where exists (pk, pl) ∈ EC such that (pk, ai) ∈ EP and (pl, aj) ∈ EP and Apk ∩ Apl = Æ (i.e. no common authors in the citing and cited publications are allowed) there is an edge (ai, aj)∈E. Thus, (ai, aj)∈E if and only if ∃(pk, pl) ∈ EC ∧ ∃(pk, ai) ∈ EP ∧ ∃(pl, aj) ∈ EP ∧ Apk ∩ Apl = ∅ ∧ ai ≠ aj. The weight wu,v representing the number of citations from u to v can now be defined as: wu,v = |C|, where C = {pk∈P: ∃(pk, u)∈EP ∧ ∃(pl,v)∈EP ∧ ∃(pk, pl)∈EC ∧ pk ≠ pl}. The weight cu,v representing the number of common publications by u and v is defined as: cu,v = |CP|, where CP = {p∈P: ∃(p,u)∈EP ∧ ∃(p,v)∈EP}. The third weight bu,v symbolizes the values obtained from the various formulas we have used in our experiments. They should more softly express the examined views of the author’s cooperation. The considered alternatives were: a. bu,v= |Pu| + |Pv| where Pi = {p∈P: ∃ (p, i)∈EP}, e.i. the total number of publications by u plus the total number of publications by v, b. bu,v= |ADCu| + |ADCv| where ADCi = {a∈A: ∃p∈P such that (p, a)∈EP ∧ (p, i)∈EP}, i.e. the number of all distinct co-authors of u plus the number of all distinct co-authors of v, c. bu,v= |ADCu| + |ADCv| where ADCi is defined as above but it is a multiset, i.e. the number of all co-authors of u plus the number of all co-authors of v, d. bu,v= |DCA| where DCA = {a∈A: ∃p∈P such that (p, a)∈EP ∧ (p, u)∈EP ∧ (p, v)∈EP}, e.i. the number of distinct co-authors in common publications by u and v, e. bu,v = |DCA| where DCA is defined as above but it is a multiset, i.e. the number of coauthors in common publications by u and v, f. bu,v = |Pu| + |Pv| – |SPu| – |SPv| where Pi = {p∈P: ∃ (p, i)∈EP} and SPi = {p∈P:(p,i)∈EP∧
d G P ( p ) = 1}, i.e. the number of publications by u where u is not the only author plus the g.
number of publications by v where v is not the only author, bu,v = 0, i.e. no refinements by bu,v are introduced.
The weights are used as parameters in a modified PageRank formula (see below), where the main innovative part is a function of wu,v, cu,v, bu,v, named contribution(u, v) and used as a multiplicative factor of the contributing ranks. The rank of each author u evaluates from ranks of him citing authors (there exists the edge (u,v) from the citing author u to the cited author v). The rank formula is not as complicated
R (v ) =
1− d + d ∑ R (u ) | A| ( u , v )∈E
contribution(u , v ) ∑ contribution(u, k )
(8)
( u , k )∈E
as it looks at first sight; its similarity with the original PageRank is evident. Except for contribution the meaning of other symbols was explained above; the rank of cited author v is counted from the rank of him citing author u, d is as usual the dumping factor, an empirically determined constant set to 0.85. The contribution from u to v must be normalized (divided by the sum of contributions from u). The sum of all contributions must be 1 to guarantee convergence. The contribution(x,y) is evaluated by formulas (9).
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Exploration and Evaluation of Citation Networks
contribution(x, y) =
wx , y f (cx , y , bx, y )
359
∑w
( x , j )∈E
x, j
where
(9)
f ( c x , y , bx , y ) =
cx, y + 1 bx , y + 1
The goal of the presented modification is to penalize the cited authors if they frequently collaborate with the citing authors. The contribution(x, y) defined in (9) on the one hand increases prestige of the node v in formula (8) proportionally to the number of its citations but on the other hand it reduces its prestige when the citing author has published (some other publication) together with the cited (see cx,y in f(cx,y , bx,y ) ). The reduction was again chosen as proportional to the number of (common) publications. The tightness of binding between the citing and cited author when they together published some other papers (note that no common authors in citing and cited publications are allowed) should depend on the number of their coauthors. Therefore, we introduced the term bx,y in the formula. Its variations were mentioned above inclusive of the zero value discarding its effect. The constant 1 is used to prevent zero dividing and the sum of wx,j is for normalization. Roughly speaking, contribution(x, y) represents the normalized weight of citations from x to y with respect to the author’s cooperation. In case authors x and y have no common publications, the coefficient cx,y is zero, bx,y is then implicitly zero in the alternatives d, e and according to the definition in the alternative g. The other alternatives assigning the bx,y value on the basis of the total number of author’s publications or co-authors in the environment where any common publications x and y does not exist should due to the definitions be non-zero. But this non-zero value is not justifiable. There is no reason to contribute to the author’s rank from one citation more or less depending on the total number of his publications or co-authors. Therefore, whenever cx,y is zero we assign to bx,y zero too. When the coefficients cx,y and bx,y are all zero, formula (8) corresponds to the weighted PageRank used e.g. in [11]. Certainly it is possible to deduce other formulas to express the influence of the author’s cooperation on the citation. The method just described works well, as we will show in the next section. Other alternatives and experiments will be investigated in the future. 5.
Evaluation
We tested our formula for various alternatives of the function of wu,v, cu,v, bu,v on a bibliographic dataset derived from the DBLP library available in XML format. The http://dblp.uni-trier.de/xml/ dblp20040213.xml.gz version of the collection was used. Only journal and proceedings papers similar to [8] were extracted. Nearly half a million journal and conference papers were explored. Over eight thousand of them have references, but some of them are outside the DBLP library. The investigated publication-citation graph has approximately five hundred thousand nodes and around one hundred thousand edges. The derived co-authorship graph was much wider, with around eight hundred thousand nodes (authors + publications) and one million edges, each of them representing an authorpublication couple. The most frequent number of co-authors is two, an average is 2.27. The relevant author-citation graph contains over three hundred thousand nodes and nearly the same number of edges. Fifteen thousand authors were not isolated. There is a problem of how the ranking method should be assessed. The author’s prestige surely depends on citations, but there are many choices, as stated above. Our results should reflect a common human Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
360
Karel Jezek; Dalibor Fiala; Josef Steinberger
meaning. They should approximate the meaning of a broad group of professionals in the rating domain. Therefore, we decided to use approved ACM honors. The resulting ranks were compared by sixteen winners of the SIGMOD E. F. Codd Innovation Awards from the years 1992 till 2007. We supposed the rank of winners should be relatively high and the positions of winners provide an evaluation of the abilities of the used formulas. 6.
Results
The rankings received by our modified formula were clearly better (relative to the Codd Award winners) than those received by the standard PageRank. The sum of ranks, the worst rank and the median rank of winners were used as indicators of rating quality. The “outlierless” median omits the worst column value. Table 1 presents the results. There is a drawback when a time sequence of award-winners is used for quality ranking evaluation. The “oldest” award-winners, as you can see in Table 1, occupy the best positions in all columns. It is explainable as “the permanency effect”; they take advantage of their popularity, i.e. becoming more popular and prestigious, they are more often cited. The column labeled “PageRank” shows the results of the standard PageRank formula and serves as a baseline. The next column gives results when the weighted PageRank is used. Remarkable improvements are obvious. The next seven columns present the results of modifications a – g of formula (9). The best behavior is seen in the b and c columns. It confirms the last row too, showing the median rank when the worst place is disregarded. This is a common practice when an outlier can distort the data. The last two columns are just for reference. The relatively simple “In degree” behaves well and “HITS authorities” in the last column surprisingly significantly overcome the basic PageRanks. 7.
Conclusion
Graph theory is a traditional discipline originating from the eighteen century. Its utilization in information network analysis is only a few years old and is being intensively investigated with the expansion of the Web. Novel methods developed initially for Web mining were recognized as useful and applicable in citation analysis as well. This contribution presented an overview of the most important and recent methods from the field of Web pages, articles and author citation analysis. We concentrated on the issue of analyzing the network structure in order to find authoritative nodes. The main contributions of our work are modifications of the PageRank equation, this time suited for graphs of citations between publications and collaborations between authors. This enables one to rank authors “more fairly” by significance, taking into account not only citations but also collaborations between them. To test this new approach on actual data, we applied our ranking algorithms to a data set from the DBLP digital library and used the methodology of Sidiropoulos and Manolopoulos [8] for ranking comparisons. We compared author rankings to a list of ACM SIGMOD E. F. Codd Innovations Award winners and found that the new rankings much better reflected the prize award scheme than the baseline, “standard” PageRank ranking. It was not possible to directly compare our results with those of Sidiropoulos et al. because they utilized a slightly different data set and their method is primarily destined for publications, not for authors. Our experiments proved that adding the aspect of the author’s cooperation to the ranking algorithm improves the rating performance. Nowadays, large electronic libraries give the best chance of ranking scholars, research groups or even whole institutions - from departments to universities. There are many exciting research directions in the areas of bibliometrics, webometrics and scientometrics. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Exploration and Evaluation of Citation Networks
In future research, we plan to continue primarily in the following directions:
8.
•
It seems to be useful to more carefully analyze the sensitivity and stability of computations on parameters b, c, w in formulas (8), (9). Our next aim has to be their more expedient integration into the ranking formula. This presently used is based only on simple reasoning. Although the standard PageRank has been shown to be relatively stable, the larger number of parameters involved in the calculation may negatively affect this property.
•
We expect further improvements and more fair-minded results when time relations between citing and cited items will be included in the ranking evaluation. Time stamps are or at least should be an ordinary part of bibliographical records and they may certainly be beneficially utilized. The concept of a “fairer” ranking of researchers based not only on citations but also on collaborations invites inclusion of the time factor. A citation between two scientists should without any doubt have a different meaning when it is made after their co-authorship of many articles or long before they get to know each other. This enhancement might add even more “justice” to the ranking.
Acknowledgement
This work was partly supported by National Research Grant 2C06009 9.
References
[1]
Newman M. E. J. The Structure and Function of Complex Networks. SIAM Review, vol. 45, no. 2, pp. 167-256, 2003. Liu X., Bollen J., Nelson M. L., Van de Sompel H. Co-authorship Networks in the Digital Library Research Community. Information Processing and Management, vol. 41, no. 6, pp. 1462-1480, 2005. Brin S., Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the 7th World Wide Web Conference, pp. 107 – 117, 1998. Kleinberg J. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, vol. 46, no. 5, pp. 604-632, 1999. Fiala D., Tesar R., Jezek K., Rousselot F. Extracting Information from Web Content and Structure. Proceedings of the 9th International Conference on Information Systems Implementation and Modelling ISIM’06, PY´erov, Czech Republic, pp. 133-140, 2006. Page L., Brin S., Motwani R., Winograd T. The PageRank Citation Ranking: Bringing Order to the Web. Computer Science Department, Stanford University, California, USA, Technical Report 199966, Nov. 1999. Chakrabarti S. Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann Publishers, San Francisco, California, USA, 2002. Sidiropoulos A., Manolopoulos Y. A Citation-Based System to Assist Prize Awarding. SIGMOD Record, vol. 34, no. 4, pp. 54-60, 2005. Kleinberg J., Kumar R., Raghavan P., Rajagopalan S., Tomkins A. The web as a graph: Measurements, models and methods. Proceedings of the 5th Annual International Conference on Combinatorics and Computing, Tokyo, Japan, Lecture Notes in Computer Science, Springer, vol. 1627, pp. 1-17, 1999.
[2] [3] [4] [5] [6] [7] [8] [9]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
361
362
Karel Jezek; Dalibor Fiala; Josef Steinberger
[10] Hirsch J. E. An index to quantify an individualâ&#x20AC;&#x2122;s scientific research output. Proceedings of the National Academy of Sciences, vol. 102, no. 46, pp. 16569-16572, 2005. [11] Bollen J., Rodriquez M. A., Van de Sompel H. Journal status. Scientometrics, vol. 69, no. 3, pp. 669-687, 2006.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
363
Advancing Scholarship through Digital Critical Editions: Mark Twain Project Online Lisa R. Schiff1 California Digital Library, University of California 415 20th Street, 4th Floor, Oakland, California, United States e-mail: Lisa.Schiff@ucop.edu 1
Abstract Digital critical editions hold the promise of supporting new scholarly research activities not previously possible or practical with print critical editions. This promise resides in the specific ability to integrate corpora, their associated editorial material and other related content into system architectures and data structures that exploit the strengths of the digital publishing environment. The challenge is to do more than simply create an online copy of the print publication, but rather to provide the kind of resource that both eases and extends the research activities of scholars. Authoritative collections published online in this manner, and with the same rigor brought to the print publishing process, offer scholars: the ability to discover more elusive, granular pieces of information with greater facility; tighter, more obvious and more accessible connections between authoritative versions of texts, editorial matter and primary source material; and continually corrected and expanded â&#x20AC;&#x153;editions,â&#x20AC;? no longer dependent upon the print lifecycle. This paper will explore these benefits and others as they are instantiated in the recently released Mark Twain Papers Online (MTPO) (http://www.marktwainproject.org), created and published as a joint project of the Mark Twain Papers & Project at The Bancroft Library of UC Berkeley (the Papers), the University of California Press (UC Press), and the California Digital Library of the University of California (CDL). This current release of MTPO is comprised of more than twenty three hundred letters written between 1853 and 1880; over twenty eight thousand records of other letters with text not held by the Papers; nearly one hundred facsimiles; and makes available the many decades of archival research on the part of the editors at the Papers. Of particular focus in this discussion will be several key features of the system which, despite the many challenges they presented in development, were felt to be essential pieces of a digital publication that could support scholarship in new and significant ways. Those features include facets, which create intellectual structure and support serendipity; advanced search, which provides a means for researchers to apply their own analytical frameworks; citation support functionality, which serves to secure and record the outcomes of research exploration; and complex displays of individual letters, which allow detailed inspection by collocating the pieces of the authoritative object. These features together maintain the integrity and stability of the collection, while concurrently allowing for fluidity in the continued expansion of the material. In this way, MTPO hopes to succeed as a digital critical edition that will support and extend the research activities of scholars. Keywords: Mark Twain Project Online; digital critical editions facets; search; citation support 1.
Introduction
Digital critical editions hold the promise of supporting new scholarly research activities not previously possible or practical with print critical editions. This promise resides in the specific ability to integrate corpora, their associated editorial material and other related content into system architectures and data structures that exploit the strengths of the digital publishing environment. The challenge is to do more than simply create an online copy of the print publication, but rather to provide the kind of resource that both Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
364
Lisa R. Schiff
eases and extends the research activities of scholars. The vision this challenge speaks to has been compellingly articulated by Robinson [1], an early and consistent explorer in this terrain, with his work on such efforts as the Canterbury Tales Project [2] and the development of tools such as the XML publishing application Anastasia to bring digital editions online, who calls such works essential for the future of humanities scholarship. Authoritative collections published online in this manner, and with the same rigor brought to the print publishing process, offer scholars: the ability to discover more elusive, granular pieces of information with greater facility; tighter, more obvious and more accessible connections between authoritative versions of texts, editorial matter and primary source material; and continually corrected and expanded “editions,” no longer dependent upon the print lifecycle and able to take advantage of a larger pool of knowledge found in both the scholarly and lay communities. D’Iorio [3] provides a dramatic example of the value of such possibilities in his account of the impact of a mistranslated word in a collection of aphorisms by Nietzsche, a mistranslation which he argues undercuts an interpretation of Nietzsche’s concept of the “will to power” by the influential philosopher Deleuze [3 p.2]. Had the original manuscripts been available to the scholarly community long before, this translation error might have been noticed due to the presence of so many more individuals examining the facsimiles and their associated transcriptions, translations and analysis. This paper will explore these benefits and others as they are instantiated in the recently released Mark Twain Papers Online (MPTO) [4], created and published as a joint project of the Mark Twain Papers & Project at The Bancroft Library of UC Berkeley (the Papers) [5], the University of California Press (UC Press) [6], and the California Digital Library of the University of California (CDL) [7]. This current release of MTPO is comprised of more than twenty three hundred letters written between 1853 and 1880; over twenty eight thousand records of other letters with text not held by the Papers; nearly one hundred facsimiles; and makes available the many decades of archival research on the part of the editors at the Papers. MTPO draws from 30 volumes of previously published material—including the critical apparatus created by the Mark Twain Papers—which have been encoded in XML according to an extended version of the Text Encoding Initiative P4 (TEI P4) [8] customized by technologists at the Papers just for this collection. The metadata gathered by the Papers has also been generated by them according to the XMLbased metadata standards METS and MADS. Together the TEIs, METS and MADS served as the primary inputs to the MTPO system – an implementation of CDL’s eXtensible Text Framework (XTF), a robust and flexible platform for providing search and display solutions for collections of digital content. UC Press has long been the publisher of the Paper’s critical editions and serves in the same capacity by offering their imprint for the digital critical edition. Of particular focus in this paper will be several key features of the MTPO system which, despite the many challenges they presented in development, were felt to be essential pieces of a digital publication that could support scholarship in new and significant ways. Those features include faceted browsing, advanced search, citations, and complex displays of individual letters. 2.
The State of Digital Critical Editions
Critical editions, digital or otherwise, are the authoritative texts that textual editors, produce and that other scholars use for their work. The material difficulties and philosophies and approaches towards creating such texts continue to be widely debate, as can be seen in the works of both central participants such as Tanselle [9] and new entries into the profession such as Gunder [10]. While such theoretical matters are significant, they are not germane to the discussion at hand, which is focused on how one produces such a digital edition in a way that accurately reflects the philosophy of its textual editors, while presenting itself usefully and meaningfully to the literary scholars that come after. For this discussion then, a very simple definition of the notion “digital critical edition” can suffice, specifically, a coherent collection of content Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Advancing Scholarship through Digital Critical Editions: Mark Twain Project Online
which has been assembled by qualified textual editors and that serves as the authoritative versions for scholars wishing to work with original texts. 2.1. The Goals and Requirements of Digital Critical Editions While digital critical editions should adhere to the high standards of the practices of textual editors, they need not, and indeed should not, simply replicate the printed volumes which have preceded them. The digital medium offers different constraints and opportunities, which must be openly acknowledged and addressed. Hillesund [11] has described the difficulties and possibilities presented at this moment when, despite our immersion in the Web, we are still quite grounded in print processes and technologies, so much so that many of our digital efforts are merely electronic re-presentations of print works, drawing from the same workflows and perspectives that go into making print publications. Hillesund argues that we need to shift from such a “print text cycle” to a “digital text cycle”, in which “texts are produced, distributed and read with the aid of computers, networks and monitors in a predominantly digital environment.” He identifies two hallmark attributes of the digital text cycle : the separation of the storage of the text (or data) from its manifestations (increasing the likelihood of a variety of presentations) and a significantly increased amount of digital reading, as opposed to print reading. The qualities embedded in Hillesund’s notion of the digital text cycle quickly become obligations for digital critical editions, appearing as goals that those working in the field aspire to attain and requirements that those awaiting such texts expect to be met. The first of these is easy access to more original manuscripts. Along these lines, Tanselle [9] reasonably insists that a “hypertext” scholarly edition is “inadequate if in addition to transcriptions and editorial matter, it does not offer images of the original documents, both manuscript and printed. Important physical evidence will obviously still be unreproduced, but at least the range of paleographical and typographical evidence made available will be far greater than has been customary in editions of the past—even in “facsimile” editions, which have usually been limited to single documents.” Tanselle even goes so far as to point to the need for regenerated texts, via the assemblage of bits and pieces: “Indeed, the point can be made more positively: that critically reconstructed texts ought to be included within the collection of texts available in a hypertext edition. Readers can of course make their own choices among variants, using whatever bases of judgment they wish, just as they have always been able to do with other forms of apparatus—though with hypertext they can more easily produce a smooth reading text of their own construction.” The availability of original documents, and especially of the types of recreations Tanselle calls for, leads to a second requirement and goal, which is the wider exposure of the editorial work to a broader scholarly community (and the well-informed lay community as well), intensifying the gaze on all aspects of the text thereby increasing opportunities for corrections of transcription and translation errors, and the bridging of gaps created by incomplete collections of materials. D’Iorio [2] gives a gives a resoundingly convincing argument of the need for digital critical editions in his gloss of the problematic history of a non-existent work attributed to Nietzsche, The Will To Power, in which he describes the various publications of this constructed work, the errors of translation and transcription which they include, and the resulting problematic scholarship, as mentioned earlier. D’Iorio’s argument is that much of this could have been avoided through access to the original manuscripts. We can also add that access to the trail of scholarship for which these documents have served as the genesis would also have greatly enhanced the conversation about this text. Continual accretion of the edition by the entire community of interested parties is yet another goal and expectation theoretically more feasible with a digital critical edition. For instance, D’Iorio’s vision [2, p.4] with the HyperNietzsche[12] project is of an edition that allows for correction and debate over primary sources, but that also serves as a focal point for contributors of vetted scholarly analysis: “...one may consider HyperNietzsche as the integration of a public archive, which Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
365
366
Lisa R. Schiff
allows free access to primary sources; a public library, which allows free access to critical editions and other scholarly contributions; and a non-profit academic publisher with a prestigious editorial board and rigorous procedures of peer review.” The significance of expanding the participation within a scholarly community of interest is so great that it has been taken on (in addition to other goals) in en EU initiative called Interedition [13], which aims to create “international digital infrastructure for scholarly editorial work.” Certainly these opportunities are significant for MTPO, as the Papers receive and acquire new letters in a continual stream, a flow which they wish to extend to include a regular incremental dissemination of such texts once the editorial process has been completed. The goal is to move further into the “digital text cycle” realm and build an infrastructure that allows the digital collection to grow and be made available apace at which the work on small groups or even individual letters is completed, as opposed to having to wait until there are a print-volume-worthy number of texts ready for public consumption. 2.2
The Challenges of Digital Critical Editions
The opportunities, goals and expectations of digital critical editions are not without their challenges. The constraining effect of print technology on how we imagine digital publications has almost become a trope. More interesting are the ways in which particular pieces of the work of producing digital publications (critical editions and other works) act in this liminal space. Hillesund [11] provides a very compelling analysis of the constraining effects of a backbone technology for digital publishing, XML, showing that while it achieves the goal of separating storage and representation, it can be seen as a limiting transitional technology as it is so frequently tied to print displays, especially in such standards as TEI: “This promotion of XML structures will have the paradoxical (and obviously unintended) consequence that conventions of print will dominate digital publishing for a long time, especially the parts based on cross-media publishing and single sourcing. These production workflows will lead to print-based content structures being contorted to fit new media, while new genres which exploit the potential of digital media will not be developed.” Robinson [1] provides a different perspective on the challenges to digital critical editions, locating the major obstacles on the lack of usable tools and the unwillingness of major publishers to put forth these works. Formulated as goals instead of shortcomings, Robinson argues: “Our goal must be to ensure that any scholar able to make an edition in one medium should be able to make an edition in the other. Further, that an edition in either medium should be equally assured of appropriate distribution: just as once a library has bought a print edition it can be used by any member of the library for years to come, so too should it be for electronic editions.” The call for tools has been both echoed and responded to by those involved with such systems as NINES [14] and the related Collex system [15], which, as described by one of the developers, Nowviskie [16], offers a standard platform for producing scholarly editions and end-user oriented tools for working with the resultant collections. Many have taken on the challenge to produce editions that move us along towards achieving the goals described above. Many notable examples of digital critical editions currently exist, particularly excelling at the provision of access to manuscript facsimiles. The most recent and perhaps most popularly known of these is the Charles Darwin site [17], which received over 7 million hits to its website on the day of its public release, April 17th, 2008. A selection of other examples that focus on facsimiles include the Rosetti Archive [18], the Walt Whitman archive [19], the Blake Archive [20] and HyperNietszche [12]. In addition to addressing the problems involved as producers of digital critical editions, a major area of Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Advancing Scholarship through Digital Critical Editions: Mark Twain Project Online
concern must also be with the usability of such works on the part of the scholar or lay researcher. This concern resonates with Nowviskie’s [21] identification of the need for a theory of interface for critical and scholarly editions, noting that little thought has been given to “editions as interface, though, because we as readers are so accustomed to jacking into the codex.” While this arena has received less attention, it is a lynchpin in the success of such works. Some sites that have taken on the challenge of prioritizing the enduser experience (with or without an underlying theory of that interface) include the NINES based sites which use Collex and the Carlyle Letters Online [22]. These publications have been experimenting with end-user tools arenas, such as facets, tag clouds, advanced search functionality and support for citation activities, tools widely available in other online arenas. Similarly, interface was a motivating factor in many of the decisions regarding MTPO as will be shown the discussion that follows. 3.
Mark Twain Project Online
The Mark Twain Project Online is the result of an effort to meet the challenges described above, an attempt to provide a scholarly resource that would significantly enhance the research activities of those interested in Mark Twain. The original vision for the complete works of Samuel Clemens (SLC) online began in 2001, when members of the Papers realized digital editions were a requirement of the future [23] and soon began the process of having the texts encoded in an extended form of TEI P4 [24]. At this point, a natural collaboration formed between the Papers, their publisher UC Press, and the CDL, in which each party contributed in distinct ways to the achievement of the aforementioned goal. The interface, known in CDL parlance as the “User Experience” or “UX,” became a defining aspect, a touchstone for decision making regarding feature selection, representation of information, and content selection. Developing the UX was a lengthy group process that involved balancing the differing allegiances to various aspects of the project, namely the purest reflection of the material and the editorial work that had gone into producing the edition; the intuitiveness and ease of the interface; and the overall production qualities of the design and content scope. A year and a half later, the beta version of MTPO was launched, complete with components the collaborative team deemed essential for successfully using MTPO. Those components included: 1) faceted browsing functionality, in order to provide a stable structural view that would allow users to easily grasp the scope and nature of the collection even as it grew transparently over time and that would also support purposeful exploration and provide opportunities for serendipitous findings; 2) robust advanced search functionality providing support for general searching activities as well as precisely formed research tasks, allowing the user to apply his or her own analytical framework to the collection; 3) citation support, by allowing users to easily generate and collect citations at useful levels of granularity; and 4) wholistic presentations of letters, providing integrated access to interdependent layers of the text (e.g. transcriptions, facsimilies, notes and apparatus entries). Each of these components will be discussed below. 3.1
Essential Features: Facets
Facets are the various dimensions of a piece of content, such as, in the case of a letter, the sender, receiver and date of composition. When facets can be defined across an entire set of documents, as in a collection of letters, they provide an effective means for users to peruse a large corpus, browse through a number of items across one or more dimensions, and alternately, hone in quickly on a specific slice of a collection. Although facets are commonly discussed from the perspective of end-users, the notion was originally developed by the librarian S.R. Ranganathan [25] in the early 1930’s as a more effective way for catalogers to describe the subjects addressed within a document. Ranganathan’s approach was powerful in that it advanced a composite approach to describing a document, an approach in which individual concepts are Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
367
368
Lisa R. Schiff
identified and then coexist as a set providing a richer description of the document than can be achieved by having to choose among pre-grouped concepts, adhering to a more brittle classification scheme. This more atomistic, dynamic approach allows for specificity and agility by allowing for terms to be combined as needed, as opposed to in advance of use. Ranganathan’s conception of facets (and that of those who followed him, namely Bliss [26]) was still a highly formalized structured, which has of late been overtaken by a more organic approach as an increasing amount of content is being provided digitally and thus the demands for creating more effective descriptions in order to promote discoverability are breaking down that formalism. The power of facets has been discovered by a range of people, from ecommerce sites, hoping to more effectively pair goods and consumers, to information retrieval scientists in their continued effort to improve our ability to find “relevant” information from large masses of content. Facets are useful along this spectrum because they provide improved intellectual access to such objects by breaking out discrete, meaningful attributes, making it easier to both to describe and discover content. In addition, they are flexible and adaptable in that facets can be added (or eliminated) as they are deemed useful dimensions on which to describe a collection and because they support multi-dimensionality as they can be combined as required for both the descriptive and discovery tasks in which we work with objects. Contemporary discussions of facets are frequently focused on automatically generating close approximations of handcrafted facets through activities such as clustering (Hearst [27]) and on how to effectively present facets in interfaces to better support the activities of end users in discovery. Hearst’s Flamenco browsing system [28] has shown that facets, particularly hierarchical facets support browsing and unstructured exploration of large collections, allowing for serendipity in a way that searching does not. The Flamenco system not only presents facets but allows users to move up and down a ladder of specificity within any dimension, using a hierarchy within each facet where appropriate. In another application of Flamenco, Yee et al [29] have demonstrated that hierarchical facets are not only useful for discovery of textual objects, but of images as well. The significance of hierarchical and interdependent facets as opposed to simply flat “bucket” categories has also been discussed by Ben-Yitzhak et al [30] in work on the use of faceted “search” that supports the complex, enterprise decision making. Facets are powerful, but that power is dependent upon the quality of the metadata that populates any given dimension, the degree to which that metadata is available across an entire collection and how clearly it is displayed. Additionally, although facets can seem intuitive if built properly, their use and manipulation can become complex putting a serious burden on the interface to clearly guide the user. For MTPO, facets were essential as the only way users would be able to explore such a large volume of texts and records. The key dimensions that were ultimately built were: letter direction (i.e. whether the letter was sent or received by Clemens); availability (i.e. whether the transcription and/or facsimiles were online); name (i.e. named persons in the “envelope” of a letter; date written; the place of origin; the repository holding the original letter; (when the letter was written); and print volumes in which the letter is referenced. Using the CDL’s eXtensible Text Framework (XTF) [31] as its backbone enabled MTPO to be built with facets that are both hierarchical where appropriate (with dates, for example) and interdependent, meaning that as a user makes a choice within one facet, the possible set of values within all other facets is updated as well. For instance, if a researcher choices “Elmira, NY” from the “Place of Origin” facet, not only are the letters presented in the results pane only those letters originating from Elimra, NY, but the remaining choices within the facets are refined appropriately, so the people that can be selected in the “Name” facet are now only those people who are associated with letters that originated in Elimra, NY, and so on. The challenges in providing such a faceting feature lay not in the technology of the XTF application, but rather in questions of consistency in the data (because unlinked variations of names or other data will simply appear as one more choice, and will not return to the user all the associations he or she might Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Advancing Scholarship through Digital Critical Editions: Mark Twain Project Online
expect), but more importantly in: 1) determining formulations of that metadata that were acceptable to the editors of the Papers, feasible to create and work with for the technologists on the team, and meaningful to end users; and 2) identifying where and how in the interface users would explore and maniuplate the facets. The seemingly innocuous example of the “date written” facet serves as a good illustration. In this case, the goal is to allow the researcher the dimension of time in which to examine the letters, supporting activities such as using dates to find specific letters, or letters within certain time periods, or to combine questions of date of writing with other aspects, such as developing an understanding of how Mark Twain’s correspondents changed over the course of time. Providing an ability to ask and get answers to such questions involved first ensuring that the data actually was available for all letters where possible, devising solutions for those letters for which there was no date, and secondly maintaining the editorial integrity of the date information by coming to agreement upon a notation that could also display reasonably well. This second challenge is a good example of the work that goes on in this intervening space between the text and digital text cycles. Taking an individual letter, such as SLC’s letter of February 13, 1869 to his future mother-in-law [32] as a case study, the editors at the Papers recorded this date as “1869.02.13” which was converted in the METS metadata record to an ISO 8601 compliant date format [33] “1869-0213” but is presented in the print volumes as “13 February 1869.” The first formulation is not end-user friendly and the third formulation does not fit into the narrow facet sidebar of the website, nor does it lend itself to easy visual parsing by end users for narrowing down by decade, year, month and day. Thus while the canonical date form was recorded in the metadata, it had to be transformed. This meant agreeing on how data information would be stored in the metatdata so that the indexing and display software could distinguish it from other date information in the metadata. It also required coming to agreement on the ultimate representation of that data, specifically month abbreviations, the use of periods, and the ordering of each of the year, month and day components that together make up a complete date. A secondary challenge with facets is how to indicate to users when they’ve made a choice, which choices they’ve made, and how to back out of those choices. This interface difficulty is created by a combination of limited screen real estate and likely visual overload. Our final solution, derived at through a detailed heuristic analysis by the CDL’s User Assessment team, involved several pieces of functionality and some specific visual cues including: ensuring that the list of letters presented in the main frame of the site reflect the result of the current set of facet choices; presenting selected terms within facets in a line of labeled boxes (one per chosen facet) just above the main frame of results, each of which could be independently removed by clicking on an “X” box; greying out and italicizing the chosen value within a given facet as it appears in the left sidebar in the facet area; providing dynamically updated hit counts for all values each time a facet choice is made. The ultimate effect has been to make a close visual link between the user’s choices and the subsequent letters that appear in the result set in the main pane. Testing with users indicated that individuals understood how to use these tools and that the resulting sets of letters were what they were expecting. 3.2
Essential Features: Advanced Search
Search activities are becoming increasingly simplified as online applications strive to provide the strippeddown Google experience that so many people are now expecting. For some academic work these searches may be sufficient, especially for those newly exploring a discipline and are attempting to map out an unfamiliar terrain. Such an approach can also be strewn with invisible difficulties, as signficant work within a given area may not appear in results generated by commercial search engines, a liklihood increased by the lack of knowledge of key concepts and central individuals. For more advanced scholars who bring more depth and breadth to their research, search tools must be calibrated more finely to support more Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
369
370
Lisa R. Schiff
precision in working through a collection. Such precision offers scholars search results that are more likely to be relevant to their research interests as opposed to having a mere coincidental relation to each other. The issue of expertise extends to both knowledge of a domain and facility with online discovery tools, of which search tools are no doubt the most heavily used. Wildemuth [34], in a study of the efficacy of searches by medical students as they progress through their education, has shown that domain expertise is tied to the ability to form more precise, complex searches, thereby producing more satisfactory search results. In a study that adds another dimension to the question of expertise, Hsieh-Yee [35] has shown that not only does search experience trump domain knowledge in effective search practices, but that search experience is a key to effectively employing domain expertise, finding that novice searchers were unlikely to use their subject knowledge (as expressed by synonyms, etc.) to improve and enhance searches, whereas those individuals with expertise in both the field in question and online searching did avail themselves of their domain knowledge. In a confirming study, Bhavnani [36] observed individuals with online skills and expertise in healthcare or online shopping search both in and outside their areas of knowledge, revealing that domain expertise combined with search expertise yields better results, particularly because of the ability to effectively use domain specific tools. One explanation for such findings is that naïve searchers presumably do not have a good grasp on how the systems that they are searching actually work, and therefore are unable to usefully bring their domain knowledge into play. Palmer [37] notes that humanities scholars gravitate towards indexing tools within their subfields, but need help finding online search tools that are robust and precise in their fields. Turning this around, digital tools built for humanities scholars must at a minimum provide the same kind of utility that scholars rely on with print resources. These findings presents a compelling argument for translating the intellectual structure of content within a domain (e.g. “text,” “editorial notes,” etc. in the case of literary scholarship) in a way that is obvious to knowledgeable users within that field and learnable on the part of those with little or no domain expertise. With this body of knowledge pointing the way, the MTPO team faced the challenge of providing researchers with a search framework appropriate to their discipline and that could accommodate, and perhaps nurture, a range of online searching skills. We wanted to allow scholars to distinguish among the identifiable pieces of the edition they would need to search independently without overwhelming any user with overly precise, lengthy lists of options. The difficulty resided in determining how to translate those distinctions into a managable set of meaningful “content slices” that would fit on a small search form, and then how to implement the search technology to support the complex queries users would want to create. Providing search access to the essential descriptive fields of the letters (i.e., the “envelope” attributes of sender, address, and place of origin) was a clearcut decision, particularly since it provides users the opportunity to distinguish between sender and addressee, something not possible by browsing with the facets. However identifying distinct components of the edition was much more difficult, becoming again a balancing act between adhering to editorially integrity, which meant specificity, and usability for both literary, as opposed to textual, scholars and lay enthusiasts, which meant larger groupings. Ultimately this required the grouping of very well defined pieces of the editorial matter into somewhat larger buckets. For instance, the editors were clear that Clemens’ own text needed to be distinctly searchable apart from any of their own work. Secondly, as part of maintaining and exemplifying the standards of textual scholarship, the distinction between “Explanatory Notes,” which provide contextual information and the “Textual Apparatus,” which provide information such as provenance and emendations, also needed to be retained, as each offer unique sets of data to which scholars of all sorts would require separate access. An example can demonstrate how these “content slices” serve the needs of the end-user, (expert or lay), addressing and validating the concerns of the editors. For instance, a scholar who was interested in knowing more about the physical condition of the letters might want to know about certain types of damage, which letters were known to have be torn or have sections missing, for example, a physical Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Advancing Scholarship through Digital Critical Editions: Mark Twain Project Online
attribute that would be captured by an archive. A simple keyword search on the word “missing,” which searches a combination of all metadata fields, original and editorial text, returns 143 results, too many to look through in a reasonable amount of time. Searching on that same term in the various content slices discussed above demonstrates how that term appears with different meanings depending upon what is being searched. Looking for the term within the original texts retrieves 56 letters in which there are both inline notes from the editors indicating missing portions, but also such letters as Clemens’ 17 January 1869 letter to his wife Olivia in which he notes that “Your Iowa City letter came near missing—it arrived in the same train with me.” [38] This use of the word is, of course, entirely different than what the researcher in this scenario would be interested in. Further, while 56 letters is a more manageable set than 143, it is still quite a lot to work through. Searching only within the Textual Apparatus returns a more tractable 27 letters, which when examined include exactly the sort of letters pertinant to this scenario, as reflected in such critical notes as this, from an October 1865 letter from Clemens to Orion and Molly Clemens [39]: “All four leaves are creased and chipped, especially along the right edge where one crucial fragment is missing (324.34).” While the facets of MTPO discussed in the previous section present a structure in which researchers can browse the edition, the advanced search framework, by separating the various components of a critical edition, provides scholars the ability to uncover information related to their more individually tailored concerns. Serendiptious findings are possible in both settings, but a primary goal of such a search tool is the support of targetted exploration of a collection. 3.3
Essential Features: Citation Tools
Any scholarly artifact must be citable. Without the ability to reliably refer to a specific “location” that others can independently access, the conversation among scholars is halted. Print material and citation structures are closely married, as evidenced by the numerous standards for referring to any manner of printed item. With the rise of online primary and secondary source material, conventions for referring to resources on websites are beginning to be established, as indicated by The Modern Language Association Guidelines that insist that “Authors of scholarly writing on the Web should number paragraphs or other sections in texts of significant length so exact locations may be cited.” [40] Accompanying the greater frequency of online sources requiring citation has come the development of tools to ease the generation and management of those references. These tools range from informal folksonomy tagging tools such as “del.icio.us” [41] to more formal but still web-based, web-friendly bibliographic management systems such as CiteULike [42] and Zotero [43] that can produce citations adherent to well known standards such as The Chicago Manual of Style [44] or the MLA Handbook for Writers of Research Papers MLA [45]. Palmer [46] explored the issues presented by bibliographic web services in the development of his CiTEX citation management, identifying the importance of persistent identifiers such as DOIs and Handles and the need for more work to eliminate tedious copying and pasting. Similarly, Hitchcock et al [47] in their work on the Open Citation project discuss how in the web environment there are increasing possibilities for citations to become actionable reference links, positing that such a feature will become essentialized as part of scholarly publication and communication. Bier et al [48], who have taken web-based citation management one step further by building a document-centric research support tool, argue that citation extraction and management is central to supporting the overall research efforts of scholars, which include such activities as in-depth reading, following reference chains, and tracking which documents have or have not been read. Indeed, many of the existing digital critical editions provide citation support at the top level of the object, either focusing on providing persistent URLS, as in the case of HyperNietzsche’s predictable URL structure for referring to page numbers within a text; the Carlyle Letters, which supports the output citation data in the formats required by a number of standard citation managers (e.g. EndNote and BibTeX); and Collex/NINES, which suports Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
371
372
Lisa R. Schiff
citations of the Collex based search results. None of those conventions or approaches, however, is sufficient for dealing with digital texts that lack attributes transferred from print such as page numbers or that are likely to be modified in a non-linear manner as new material comes to light, thus eliminating the possibility of stable paragraph or line numbers. Thus the MTPO team was confronted with the need to support citations for individual texts that would likely expand but whose previous components would remain stable. An additional concern was the issue of monograph length works that had no usable markers from the print world leaving only the most easily attainable citations for individual chapters or the entire work, either of which were unacceptable in that they are at too coarse of a level to support scholarly arguments in a comfortable fashion. MTPO addressed this complexity by developing the notion of “citable chunks,” which meant determining which encoded chunks that could be automatically identified by the system should have an associated citation widget that when clicked, would automatically generate a reference that included a persistent URL to that exact location. Enabling this involved determining at which level such chunks made sense, from the perspective of scholars who would want to use them; from the perspective of the technology pieces that would have to come together to generate the citations; and from the perspective of end-users who would have to understand how to use the citation tool and not have it interfere with their work with the texts. Because MTPO’s first release includes letters as opposed to the longer works, the decisions were somewhat clearer. For the letters, users can click on an icon to gather a citation for a letter in its entirety and can do the same for editorial notes at the individual note level. In the next phase of MTPO, the infrastructure is in place to support citations at the paragraph level, which could provide useful precision for citing scholars, but could also introduce some interesting interface challenges as one imagines a page of dialog strewn with citation widget symbols at the beginning of nearly every line. Still, as there are ways solutions to address this issue (e.g. enabling users to turn off and on the citation widgets), our tendency at the moment is towards supporting the greatest degree in citation precision, as this seems to be a significant advantage of a digital publication. Generating citations is the first part of the equation especially for original sources for which the particular citation structure either may not yet exist and even if it does, is not as easily remembered as the traditional (online or offline) journal article. For MTPO for instance, the editors devised citation structures for the various types of texts that could be referenced, ranging from the letters themselves to records of letters to editorial notes. A the development of new web-based reference managment tools indicates, in addition to having access to acceptable citation structures, scholars need to be assisted in tracking what items they are interested in, either before or after having read them. To that end, MTPO includes the ability to gather citations of interest to a web page within the system (called “My Folder”) that is available during any individual session. Citations can be added and removed, seen in a compact or full format, and emailed to oneself or any other interested party. Offering tools to help researchers as they work with the texts is yet another distinguishing feature that can set digital critical editions apart from their print-based antecedents. At this point, development teams have the option of supplying those tools within the sites they are building, as MTPO chose to do, or to integrate with generic reference management tools. The best choice, of course, is to make no choice at all, but rather to build sites that produce appropriate citation structures, offer good tools for scholars to use while doing their research on the site and that can also be used with the predominant tools in existance, which is where MTPO hopes to be in future phases.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Advancing Scholarship through Digital Critical Editions: Mark Twain Project Online
3.4
Essential Features: Complex Object Views
MTPO, like other digital critical editions, faced the challenge of how to represent the overlapping layers of detailed, precise information that together make up a discrete work within the edition. For MTPO, a letter was the defining object in question and the pieces that need to be made available and put under the control of the reader qua user were transcriptions, facsilimies, editorial notes, the textual apparatus, and other related editorial material such as the descriptions of the text and provenance, and the guide to editorial signs Similar to the manuscripts presented in Gallica Proust [49], letters were presented in two left and right panes. In MTPO, the larger left pane always contains the transcribed text, which is consistently available for any letter for which there is text. The narrower right pane defaults to a facsmilie, if it exists, and if not, then to a display of the Editorial Notes and Textual Apparatus that are associated with that letter. For letters with a facsmile, users can magnify the image for closer inspection and can also select a link to switch back and forth in the right pane between the manuscript image and the editorial matter. In the notes view, when users glide the mouse over the transcribed text, notes and apparatus entries associated with that text will be shaded in the right-hand pane. If a user begins to read one note and then decides to read further, clicking on an editorial note will sync up the reading pane with the portion of the text associated with that note or entry and for the apparatus entires only, will additionally shade the portion of the text to which the apparatus entry applies. The asynchronous movement of each of the panes supports wellknown research habits of perusing beyond the original note of interest, yet another way serendiptious research behaviours can be supported in the online environment. Additionally, as desired users can call forth in a secondary window a “print view” of the letter with all of the Expalanatory Notes and Textual Apparatus entries appended to the end. Finally, from the letter view, a researcher can add a citation for the letter on screen at the moment to his “My Folder” page and can also wander backwards or forwards along the editorially determined sequence of letters. Such functionality transfers as much control over the reading experience as possible. All the components of the edition are made apparant and available to the user whenever she may require them without having that information overwhelm the central text itself. 4.
Conclusion
Scholarly practices are being transplanted, re-imagined, and newly developed as the digital environment expands in both depth and breadth, becoming a richer arena in which scholars can discover and work with pimary and secondary resources. Researchers are gaining a better understanding of how to use existing tools to work within this dimension and at the same time, new tools and infrastructure are being dreamed up and crafted to aid them. Per Hillesund, the digital text cycle has not wholly overtaken the print text cycle, but advancements toward that end are being made. Digital critical editions such as MTPO are examples of the transitional works and associated embedded tools and practices that represent significant movement towards that new paradigm. Although the data the edition publishes online was originally created for print presentation, it is now stored in such a way that many alternate presentations and uses are possible, as evidenced by the publication of MTPO itself. As an original online edition and not simply an electronic print reproduction, MTPO offers unique access to a rich, dynamic collection of authoritative material that provides scholars easier entry into that material from many vantage points, easier means in which to closely inspect it, and easier methods for recording the trail of that investigation. Facets and advanced search together provide a rich set of ways for researchers to explore and discover the material and information in which they are or might be interested. By providing the ability to browse the collection via elemental dimensions and by exposing the structural elements of textual scholarship, MTPO supports both serendiptious discovery and more precise research strategies. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
373
374
Lisa R. Schiff
Bringing these approaches together, by allowing individuals to avail themselves of that overall structure to further winnow down their findings as they traverse the collection according to their own terms of interest, MTPO supports the dynamic and creative aspects of research that draw on the expertise of scholars within a field. By presenting a visually managable, interactive composite view of the many layers of information which together comprise an individual work (i.e. letter) within the edition, MTPO provides a richer and easier environment for detailed inspection of the content. Finally, by creating automatically generated, reliable citation formats for the unique materialsin the edition, and a means for gathering those citations and saving them (through emailing them), MTPO provides a means to safeguard and track the results of valuable, often irreproducable, research efforts. To briefly recapitulate, facets create intellectual structure and help to maintain the persistent integrity of the edition as it appears to users over time; advanced search and support serendipity provides opportunities for researchers to apply their own framework; citations are the outcome of these together and serve to secure and record those outcomes; and complex displays of objects allow detailed inspection by collocating the pieces of the authoritative object. These features together maintain the integrity and stability of the collection, while concurrently allowing for fluidity in the continued expansion of the material. In this way, MTPO hopes to succeed as a digital critical edition that will support and extend the research activities of scholars. 5.
Acknowledgements
A great many people contributed over several years to the ultimate publication of MTPO, a complete listing of whom can be found on the site [50]. However, a core project team, of which I was one member, worked very closely together during the year and a half leading up to publication. I would like to directly acknowledge those individuals, as the result of their work is what made this paper possible: Laura Cerruti, Erim Foster, Sharon Goetz, Benjamin Griffin, Kirk Hastings, Catherine Mitchell, and Leslie Myrick. 6.
Notes and References
[1]
ROBINSON, Peter. Current issues in making digital editions of medieval textsâ&#x20AC;&#x201D;or, do electronic scholarly editions have a future? Digital Medievalist [online]. Spring 2005, vol. 1, no. 1 [cited 28 April 2008]. Available from Internet: <http://www.digitalmedievalist.org/journal/1.1/robinson/>. ISSN: 1715-0736. http://www.canterburytalesproject.org/ Dâ&#x20AC;&#x2122;IORIO, Paolo. Nietzsche on New Paths: The HyperNietzsche Project and Open Scholarship on the Web. [online]. [cited 1 April 2008]. In FIORINI, MC., and FRANZESE, S. Friedrich Nietzsche. Edizioni e Interpretazioni. Pisa : ETS, 2006. Available from Internet: <http:// www.hypernietzsche.org/doc/doc.html> Mark Twain Project Online [online]. Edited by Fischer, Victor, Frank, Michael B, and Smither, Harriet Elinor. Berkeley: University of California Press, 2007 [cited 28 April 2008]. Available from the Internet: <http://www.marktwainproject.org>. http://bancroft.berkeley.edu/MTP/ http://www.ucpress.edu/ http://www.cdlib.org www.tei-c.org/ TANSELLE, G.Thomas. Critical Editions, Hypertexts and Genetic Criticism. The Romantic Review.
[2] [3]
[4] [5] [6] [7] [8] [9]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Advancing Scholarship through Digital Critical Editions: Mark Twain Project Online
1995, vol 86, p. 581-593. [10] GUNDER, Anna. 2002: Forming the Text, Performing the Work - Aspects of media, navigation, and linking. Human IT. [online]. February – March, 2001.[cited 26 April 2008]. Available from Internet: <http://www.hb.se/bhs/ith/23-01/ag.htm>. ISSN: 1402-150X. [11] HILLESUND, Terje. Digital Text Cycles: From Medieval Manuscripts to Modern Markup. Journal of Digital Information. [online] 2006, vol 6, no. 1 [cited 26 April 2008]. Available from Internet: <http://journals.tdl.org/jodi/article/view/jodi-164/65>. ISSN: 1368-7506. [12] HyperNietzsche. [online]. [cited 1 April 2008]. Available from Internet: <http:// www.hypernietzsche.org>. [13] http://interedition.huygensinstituut.nl/?page_id=2 [14] http://nines.org [15] http://nines.org/collex [16] NOWVISKIE, Bethany. A Scholar’s Guide to Research, Collaboration, and Publication in NINES. Romanticism and Victorianism on the Net. [online]. August 2007. vol 47. [cited 5 May 2008]. Available from: <http://www.erudit.org/revue/ravon/2007/v/n47/016707ar.html>. ISSN: 1916-1441. [17] The Complete Work of Charles Darwin Online. [online]. Cambridge: University of Cambridge, 2008. [cited 28 April 2008]. Available from Internet: <http://darwin-online.org.uk/>. [18] The Complete Writings and Pictures of Dante Gabriel Rossetti. [online]. Edited by McGann, Jerome J. [cited 28 April 2008]. Charlottseville: IATH, in production. Available from Internet: <http:/ /www.rossettiarchive.org>. [19] The Walt Whitman Archive. [online].edited by Folsom, Ed and Price, KennethLincoln, NB: Center for Digital Research in the Humanities, 15 February 2007. [cited 28 April 2008]. Available from Internet: <http://www.whitmanarchive.org/>. [20] The William Blake Archive. [online]. edited by Eaves, Morris, Essick, Robert N. And Visomi, Viscomi. 13 November 1997 [cited 28 April 2008]. Available from Internet: <http:// www.blakearchive.org/>. [21] NOWVISKIE, B. 2000. Interfacing the Edition. [cited 5 May 2008] Available from Internet: <http:/ /www.iath.virginia.edu/~bpn2f/1866/interface.html>. [22] The Carlyle Letters Online. [online]. edited by Kinser, Brent E. [cited 28, April 2008]. Available from Internet: <http://carlyleletters.dukejournals.org/>. [23] The Making of Mark Twain Project Online. Mark Twain Project Online [online]. Berkeley: University of California Press, 2007 [cited 28 April 2008]. Available from Internet: <http:// www.marktwainproject.org/about_makingMTPO.shtml>. [24] Mark Twain Project Online [online]. Berkeley: University of California Press, 2007 [cited 28 April 2008]. Available from Internet: <http://www.marktwainproject.org/ about_technicalsummary.shtml#contentTransform>. [25] RANGANATHAN, S. R. Colon Classification. 7th ed. Bangalore: Sarada Ranganathan Endowment for Library Science, 1987. [26] BROUGHTON, Vanda. Faceted classification as a basis for knowledge organization in a digital environment; the Bliss Bibliographic Classification as a model for vocabulary management and the creation of multidimensional knowledge structures. New Review of Hypermedia and Multimedia. 2001, vol 7, no. 1, p. 67-102. [27] HEARST, Marti A. Clustering versus faceted categories for information exploration. Communications of the ACM. 2006, vol 49, no. 4, p. 59-61. [28] __ et al. Finding the flow in web site search. Communications of the ACM. 2002, vol 45, no. 9, p. 45. [29] YEE, Ka-Ping, SWEARINGEN, Kirsten, LI, Kevin, et al. Faceted metadata for image search and browsing. In CHI ’03: Proceedings of the conference on Human factors in computing systems. 2003, p. 401-408. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
375
376
Lisa R. Schiff
[30] BEN-YITZHAK, Ori et al. Beyond basic faceted search. In WSDM ’08: Proceedings of the international conference on Web search and web data mining. 2008, p. 39. [31] XTF is a full-text and metadata search and display technology, initially focused on documents, and built on top of the open source Lucene search index. For MTPO, that index was built out of the fulltext of the transcribed TEI encoded letters, and the METS encoded metadata. For more information, please refer to <http://www.cdlib.org/inside/projects/xtf/>. [32] SLC to Olivia Lewis (Mrs. Jervis) Langdon, 13 Feb 1869, Ravenna, Ohio (UCCL 00249). [online] In Mark Twain’s Letters, 1869. Edited by FISCHER, Victor Fischer, FRANK, Micahel B. and ARMON, Dahlia. Mark Twain Project Online Berkeley: University of California Press. 1992, 2007. [cited 5 May 2008]. Available from Internet: <http://www.marktwainproject.org/xtf/ view?docId=letters/UCCL00249.xml;style=letter;brand=mtp>. [33] http://www.w3.org/TR/NOTE-datetime [34] WILDEMUTH, Barbara M. 2004. The effects of domain knowledge on search tactic formulation. Journal of the American Society of Information Science and Technology.2004, vol 55, no. 3, p. 246-258. [35] HSIEH-YEE, Ingrid. Effects of search experience and subject knowledge on the search tactics of novice and experienced searchers. Journal of the American Society for Information Science. 1999, vol, 44, no. 3, p. 161-174. [36] BHAVNANI, Suresh K. Domain-specific search strategies for the effective retrieval of healthcare and shopping information. CHI ’02 extended abstracts on Human factors in computing systems. Minneapolis, MN: ACM. 2002, p. 610-611. [37] PALMER, Carole L. Scholarly work and the shaping of digital access: Research Articles. Journal of the American Society of Information Science and Technology. 2005, vol 56, no. 11, p. 11401153. [38] SLC to Olivia L. Langdon ... , 17 Jan 1869, Chicago, Ill. (UCCL 00234). [online] In Mark Twain’s Letters, 1869. Edited by , Victor Fischer, FRANK, Micahel B. and ARMON, Dahlia. Mark Twain Project Online. Berkeley: University of California Press. 1992, 2007. [cited 5 May 2008]. Available from Internet: <http://www.marktwainproject.org/xtf/view?docId=letters/ UCCL00234.xml;style=letter;brand=mtp> [39] SLC to Orion and Mary E. (Mollie) Clemens, 19 and 20 Oct 1865, San Francisco, Calif. (UCCL 00092). [online] In Mark Twain’s Letters, 1853-1866. Edited by BRANCH, Edgar Marquess, FRANK, Michael B., SANDERSON, Kenneth M. et al. Mark Twain Project Online. Berkeley: University of California Press. 1988, 2007. [cited 5 May 2008]. Available from Internet: <http:// www.marktwainproject.org/xtf/view?docId=letters/UCCL00092.xml;style=letter;brand=mtp>. [40] Minimal Guidelines for Authors of Web Pages. [online]. Modern Language Association. [cited 28 April 2008]. Available from Internet: <http://www.mla.org/web_guidelines>. [41] http://del.icio.us/ [42] http://www.citeulike.org [43] http://www.zotero.org/ [44] The Chicago Manual of Style. 15th Ed. Chicago: University of Chicago Press, 2003. [45] GIBALDI, J. MLA Handbook for Writers of Research Papers. New York: Modern Language Association of America, 2003. [46] PALMER, James D. Exploiting bibliographic web services with CiTeX. Proceedings of the 2007 ACM symposium on Applied computing. Seoul, Korea: ACM. 2007, p. 1673-1676. [47] HITCHCOCK, Steve et al.. Developing services for open eprint archives: globalisation, integration and the impact of links. Proceedings of the fifth ACM conference on Digital libraries. San Antonio, Texas, United States: ACM. 2000, p. 143-151.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Advancing Scholarship through Digital Critical Editions: Mark Twain Project Online
[48] BIER, Eric, GOOD, Lance, POPAT, Kris et al. A document corpus browser for in-depth reading. Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries. Tuscon, AZ, USA: ACM. 2004, p. 87-96. [49] Gallica Proust. [online] Edited by CALLU, Florence. [cited 28 April 2008]. Available from Internet: <http://gallica.bnf.fr/proust/>. [50] Contributor Credits. Mark Twain Project Online [online]. Berkeley: University of California Press, 2007 [cited 28 April 2008]. Available from Internet: <http://www.marktwainproject.org/ about_contributorcredits.shtml>.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
377
378
Preserving The Scholarly Record With WebCite (www.webcitation.org): An Archiving System For Long-Term Digital Preservation Of Cited Webpages Gunther Eysenbach,2,3 Centre for Global eHealth Innovation, University Health Network, 190 Elizabeth St, Toronto M5G2C4, Canada e-mail: geysenba at uhnres.utoronto.ca 2 Department of Health Policy, Management, and Evaluation, University of Toronto 3 Knowledge Media Design Institute, University of Toronto 1
Abstract Scholars are increasingly citing electronic “web references” which are not preserved in libraries or full text archives. WebCite is a new standard for citing web references. To “webcite” a document involves archiving the cited Web page through www.webcitation.org and citing the WebCite permalink instead of (or in addition to) the unstable live Web page. Almost 200 journals are already using the system. We discuss the rationale for WebCite, its technology, and how scholars, editors, and publishers can benefit from the service. Citing scholars initiate an archiving process of all cited Web references, ideally before they submit a manuscript. Authors of online documents and websites which are expected to be cited by others can ensure that their work is permanently available by creating an archived copy using WebCite and providing the citation information including the WebCite link on their Web document(s). Editors should ask their authors to cache all cited Web addresses (Uniform Resource Locators, or URLs) “prospectively” before submitting their manuscripts to their journal. Editors and publishers should also instruct their copyeditors to cache cited Web material if the author has not done so already. Finally, WebCite can process publisher submitted “citing articles” (submitted for example as eXtensible Markup Language [XML] documents) to automatically archive all cited Web pages shortly before or on publication. Finally, WebCite can act as a focussed crawler, caching retrospectively references of already published articles. Copyright issues are addressed by honouring respective Internet standards (robot exclusion files, no-cache and no-archive tags). Long-term preservation is ensured by agreements with libraries and digital preservation organizations. The resulting WebCite Index may also have applications for research assessment exercises, being able to measure the impact of Web services and published Web documents through access and Web citation metrics. Keywords: Internet archiving, digital preservation, citing web material 1.
Introduction
Scholars (but also legal professions1 and authors of lay publications) are increasingly citing electronic npnjournal “web references” (such as Wikis, Blogs, homepages, PDF reports) 2 which are generally not permanently preserved in libraries or repositories such as Pubmedcentral and are prone to become inaccessible over time (Error 404 – Not found). The unstable nature of web references is increasingly recognized as a problem within the scientific community and has been the subject of recent articles and discussions 1-11. In a seminal article, Dellavalle et al have shown that 13% of Internet references in scholarly articles were inactive after only 27 months. Even if URLs are still accessible, another problem is that cited webpages Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Preserving The Scholarly Record With WebCite (www.webcitation.org)
may have changed, so that readers see something different than what the citing author saw, sometimes without realizing this. Dellavalle et al. have concluded that “publishers, librarians, and readers need to reassess policies, archiving systems, and other resources for addressing Internet reference attrition to prevent further information loss” and called this an issue “calling for an immediate response” by publishers and authors1. Until recently (before WebCite was available), the only option readers of articles which cite inaccessible URLs had to retrieve the “lost” webmaterial was to consult services such as the Internet Archive (Wayback Machine) or the Google archive, hoping that they may - by pure chance – have a version of the cited web document in their archive (hopefully a version which is close to the access date). However, the Internet Archive, Google, and other Internet archiving initiatives commonly use unspecific crawlers to harvest the Web in a shotgun-approach, not focussing on academic references, and the archiving process cannot be initiated by authors, editors, or publishers wanting to archive a specific web reference at a specific time and date, as they saw it when they quoted it. Moreover, certain webmaterial may be part of the “hidden web” and not be accessible to archiving crawlers. Therfore, these traditional approaches are inadequate. The objective of this paper is to present and discuss a solution called WebCite (http://www.webcitation.org), an on-demand archiving system for authors, journal/book editors, and publishers for long-term preservation of cited webreferences. WebCite is a tool specifically designed to be used by authors, readers, editors and publishers of scholarly material, allowing them to permanently archive cited “non-journal” Web material, such as cited homepages, wiki pages, blogs, draft papers accessible on the web, “grey” PDF reports, news reports etc. To prevent “link rot”, authors simply have to cite the WebCite snapshot ID and/or a link to the permanent WebCite URL, in addition to citing the original URL. WebCite is now used by an increasing number of authors and journals3, ensuring permanent availability of cited webreferences for future readers and scholars. WebCite has built a XML-based webservice architecture which enables for example publishers, webmasters, editors, institutions, and vendors of bibliographic software packages to exchange data (e.g. metadata) and to trigger an archiving request. As such, WebCite can be seen as an intermediary between the scholarly community (authors/editors/publishers) and the digital preservation community. As member of the International Internet Preservation Consortium (IIPC), WebCite works together with IIPC members such as the Internet Archive to create 1) a distributed storage infrastructure (so that WebCite snapshots are automatically fed into other digital archives, such as the Internet Archive or National Libraries and Archives), and 2) an interoperability infrastructure which would allow a federated search across different archives (e.g., if a reader clicks on a WebCite link with the format www.webcitation.org/ query?date=..&url=..., the system looks across different archives to locate snapshots of that URL cited on a specific date). WebCite is also working on advanced Web 2.0 functionalities, allowing users to share and recommend documents with other users. Finally, WebCite creates alternative statistics on the usage and citation of websites and non-journal webmaterial, which can be used (in analogy to the impact factor) to measure the scholarly “impact” of a given webservice or web document, based on citations in scholarly publications.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
379
380
2.
Gunther Eysenbach
Methods: WebCite Architecture, Workflow and Functionality
2.1. History The WebCite idea was first conceived in 1997 and mentioned in a 1998 article on quality control on the Internet, alluding to the fact that such a service would also be useful to measure the citation impact of webpages 12. In the same year, a non-functional mockup was set up at the address webcite.net (see archived screenshots of that service at the Internet Archive 1). However, shortly after, Google and the Internet Archive entered the market, both apparently making a service like WebCite redundant. The idea was revived in 2003, when a study published in Science 1 concluded that there is still no appropriate and agreed on solution in the publishing world available. Both the Internet Archive and Google do not allow for “on-demand” archiving by authors, and do not have interfaces to scholarly journals and publishers to automate the archiving of cited links. In 2005, the first journal [Journal of Medical Internet Research] announced using WebCite routinely 13, and dozens of other journals followed suit. Biomed Central, publisher of hundreds of open access journals, has been using WebCite routinely since 2005 (all URLs cited in Biomed Central articles are automatically archived by WebCite)2. 2.2
Functionality Overview
Authors and journal editors ensure long-term accessibility of cited URLs by using WebCite-enhanced references. A WebCite-enhanced reference is a reference which contains - in addition to the original live URL (which can and probably will disappear in the future, or its content may change) - a link to an archived copy of the material, exactly as the citing author saw it when he accessed the cited material. There are two basic formats of a WebCite URL: The opaque and the transparent format - the former can be used to be added to a cited URL, the latter can be used to replace a cited URL. Both formats will be returned in response to an archiving request, usually initiated by the citing author. The opaque URL is very short and handy, containing a short ID like 5Kt3PxfFl (http://www.webcitation.org/ 5Kt3PxfFl). This format should only be used in a reference where the original URL is still visible: Example References “enhanced” by WebCite: [1] Lawrence, Lessig: “this is a fantastically cool idea” (Blog). Sept 8, 2006. http://lessig.org/blog/2006/09/ Archived by WebCite at http://www.webcitation.org/5UzgHmsS7 on 20-01-2008)
Alternatively, the cited URL and the cited date can be part of a single WebCite URL (the transparent format), making it redundant to spell out the original URL. The drawback is that the WebCite URL can become pretty long: Alternative format: [2] Lawrence, Lessig: “this is a fantastically cool idea” (Blog). Sept 8, 2006. Archived by WebCite at http://www.webcitation.org/query?url=lessig.org/blog/2006/09/&date=200801-20
These are just examples, the actual citation formats preferred by different editors may differ. Most style guides currently give little or no guidance on how to cite URLs and their archived version, but most editors will accept something along the lines of citing the original URL together with the archived URL in a submitted manuscript. It is possible to omit the archiving or “accessed on” date (which is recommended in most style guides when citing URLs), because WebCite always tells the reader when the snapshot was taken and in the transparent format it becomes part of the URL. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Preserving The Scholarly Record With WebCite (www.webcitation.org)
Another form of a WebCite link contains the cited URL and the DOI (Digital Object Identifier) of the citing document (refdoi): [3] IMEX. http://www.webcitation.org/query.php?url=http://imex.sourceforge.net&refdoi=10.1186/jbiol36
This format is used by publishers who use the DOI system to identify their articles and who have implemented WebCite (e.g. Biomed Central) by sending us their citing articles shortly before or at publication (which our software combs for URLs which have not been archived by the citing author). A WebCite URL containing a refdoi implies that a snapshot of the cited URL was taken when the citing paper (identified by its DOI) was published. Thus, the archiving date of the snapshot retrieved by WebCite if the reader clicks on this link will be close to the publishing date of the article. The example above is from the citing paper with the DOI 10.1186/jbiol36, which cites the URL http://imex.sourceforge.net (see reference 45). A fourth way to retrieve a WebCite snapshot is using a hash sum (a sort of a digital fingerprint of a document), however, citing a document by URL and date is currently much more common than using a hash. A fifth way is that WebCite may have (on request of the cited author/publisher) assigned a DOI to an archived snapshot, so that the link has the format http://dx.doi.org/10.2196/webcite.xx” (where xxx is the hash key of the material in WebCite). The DOI resolver at dx.doi.org (which is commonly also used to resolve cited journal and book references) would then resolve to a WebCite snapshot page (or an intermediary page pointing to the same work in other archives, to other manifestations such as print or pdf, or - in the case of online preprints - to “final” publications). Before the WebCite URL can be cited, the archiving process has to be initiated, as described in detail below. The archiving process can be initiated by citing authors, editors, publishers, or cited authors. Various degrees of “automation” exist, from manual initiation of the archiving of a specific page, to automatic harvesting of cited URLs and subsequent archiving of these URLs. Authors usually use the archiving form, the WebCite bookmarklet, or upload an entire citing manuscript to the WebCite server via the comb page, which initiates the WebCite tool to comb through the manuscript and to archive all cited non-journal URLs. Participating journal or book editors, publishers, or copyeditors, participate through inserting a note in their “Instructions for authors” asking authors to use webcitation.org to permanently archive all cited webpages and websites before manuscript submission, and to cite the archived copy in addition to the original link. Participating Publishers (such as BioMed Central) submit manuscript XML files to WebCite at the time of publication, so that WebCite can comb through the manuscript and archive cited webpages automatically. Citable authors, i.e. academic bloggers and authors of non-journal scholarly webpages who foresee the possibility to be cited in the scholarly literature (“citable Web-author” or subsequently called “cited author”), but are concerned about the persistence and citability of their work can add a “WebCite this!” link to their work, which links dynamically to the archiving form. 2.3
Detailed Architecture and Workflow (see Figure 1)
2.3.1 “Cited” Author-initiated archiving These pathways are initiated by the author of webmaterial who is concerned about the citability and longterm preservation of his unpublished online work (we call these cited author or citable author). The work can be a blog, a discussion paper published on a website, an online preprint/draft of a paper, a posting in Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
381
382
Gunther Eysenbach
a newsgroup etc. It is presumed that the cited author is also the copyright holder. 1.
2.
Self-archiving through the form at http://www.webcitation.org/archive (or a bookmarklet or browser-plugin). Optionally, authors may post a “cite-as” statement on their document, containing the WebCite link, indicating that they prefer that people (citing authors) prefer to cite a given version. For dynamically changing content, authors can publish a button ( Cite this page!) which dynamically creates a new archived version1 whenever somebody (a “citing author”) clicks on that link . This (this is a link to the archiving form which is populated with metadata and the URL to be archived handed over in the URL, for example http:// www.webcitation.org/ archive?url=http%3A%2F%2Foalibrarian.blogspot.com%2F2006%2F07%2Fopen-sourceopen-access-and-open.html&title=Open+Source%2C+Open+Access++And+Open+Search%3F&author=Giustini+Dean&date=2006-0713&source=OA+Librarian&subject=electronic+publishing%3B+open+access.
Figure 1: WebCite architecture 2.3.2 “Citing” Author-initiated archiving 2. 3.
See above. While the “cited author” facilitates archiving by publishing an archiving link, the actual archiving process is initiated by the citing author Most citing authors will initiate archiving of a webpage through the form at http://
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Preserving The Scholarly Record With WebCite (www.webcitation.org)
www.webcitation.org/archive .... 4. or through a bookmarklet (or through other yet-to-be-developed browser plugins which facilitate metadata entry) 5. Citing authors can also publish a draft manuscript on the web, and then let WebCite “comb” through this html-manuscript to archive all (or selected) cited URLs. Further functionality to comb through Word/ODT/RTF files is planned. 2.3.3 Publisher/Editor-initiated archiving 6. 7.
8.
Publishers, editors, copyeditors who publish “citing” documents (academic papers, books etc) can use thesame tools as citing authors to archive cited URLs (/archive, /comb).., When processing a manuscript for publishing, publishers can submit a tagged document to WebCite for processing. WebCite will automatically archive all cited URLs (unless they have already been archived by the citing author). Ideally, this submission is done via FTP, and uses a well defined (preferably XML based) schema for article data. Currently WebCite support (X)HTML documents, NLM Journal Publishing DTD documents, and BioMed Central Article DTD documents. Adding new document types to this list is a straightforward process, and can be undertaken on a publisher by publisher basis by providing WebCite with a document DTD and sample document for testing. Note that if publishers use this avenue and if they are using DOIs for their citing articles, they can link to the archived URL simply by using a format like this: http://www.webcitation.org/ query?url=http%3A%2F%2Fwww.iom.edu%2F%3Fid%3D19750&refdoi=10.2196/ jmir.8.4.e27, where url is the archived URL and refdoi is the DOI of the citing document. That way, publishers know the link to the WebCite-cached copy ahead of time, and do not necessarily have to analyze the XML document which WebCite returns and which contains the WebCite IDs and success/failure messages. Providing a refdoi in the query implies that WebCite retrieves an archived copy of the URL which is close to the submission date. (possibly to be implemented in collaboration with CrossRef) Publishers who are members of CrossRef currently supply their article metadata in an XML file formatted according to the CrossRef XSD schema version 3.0.1 for forward linking. This schema provides for the optional inclusion of citation lists attached to the existing journal article metadata, however, the current version does not support of non-journal, non-book, nonconference citations. Should the schema be revised to include cited URLs of webcitations, and if CrossRef makes these cited URLs (together with the referring DOIs of citing articles available, then WebCite could automatically archive cited URLs without publishers having to upload their content separately to WebCite as described under (7).
2.3.4 Mirrors, Creating a redundant infrastructure 9.
10.
”Cited authors” can request assignment of a DOI to their archived work. The DOI will be something like 10.2196/webcite.863eb3546af7384e44fb0e422cca1fa97704abeb, with the part after webcite being the digital fingerprint (hash sum) of the archived document. This enables citing authors to use the DOI resolver at dx.doi.org instead of using a “direct” link to WebCite, adding a layer of w (to be implemented) WebCite deposits archived copies in mirrors and secondary archives
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
383
384
Gunther Eysenbach
(e.g. Internet Archive and other IIPC members). While not implemented yet, in the future citing authors and readers could use the link-resolver at dx.doi.org to retrieve archived documents which have a DOI. Alternatively, cross-archive searches for snapshots with a certain URL/Date could be conducted. 2.4
Use cases
2.4.1 Using WebCite as a (citing) author to archive webpages WebCite is an entirely free service for authors who want to cite webmaterial, regardless of what publication they are writing for (even if they are not listed as members). The author of a citing manuscript can: • Either manually initiate the archiving of a single cited webpage (by using either the WebCite bookmarklet or the archive page) and manually insert a citation to the permanently archived webdocument on webcitation.org in his manuscript, or;
•
Upload an entire citing manuscript to the WebCite server via the comb page, which initiates the WebCite tool to comb through the manuscript and to archive all cited nonjournal URLs. The WebCite software also replaces all URLs in the manuscript with a link to the permanently archived webdocument on webcitation.org.
2.4.2 Using WebCite as a reader Readers simply click on the WebCite link provided by publishers or citing authors in their WebCite-enhanced references to retrieve the archived document in case the original URL stopped working, or to see what the citing author saw when he cited the URL. Readers can also search the WebCite database to see how a given URL looked like on a given date provided somebody has cited that URL on or near that date. The date search is “fuzzy”, i.e. the date does not have to match exactly the archiving date - we always retrieve the closest copy and give a warning if the dates do not match. A drop-down list on top of the frame with different dates tells readers that snapshots were taken on these dates. Select any of these dates to retrieve the respective snapshot. 2.4.3 Using WebCite as an editor Participating journal or book editors, publishers, or copyeditors should insert a note in their “Instructions for authors” asking their authors to use webcitation.org to permanently archive all cited webpages and websites, and to cite the archived copy in addition to the original link7. Secondly, editors/copyeditors should initiate the archiving of cited webpages (either manually, or through the automated comb mechanism, which involves uploading the entire manuscript so that the WebCite engine can crawl and archive all cited URLs) and replace all webcitations in a manuscript with links to the archived copy, before the manuscript is published. Thirdly, editors are encouraged to become a member of WebCite (which is free) so that their journal is listed on webcitation.org as participating journal.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Preserving The Scholarly Record With WebCite (www.webcitation.org)
2.4.4 Using WebCite as a library, internet archive, or digital preservation organization WebCite works with digital preservation partners who are running dark mirrors, and is working on a federated cross-archive search and common API infrastructure. This ensures that archived content remains accessible for future generations, without being dependant on a single server. This is currently work in progress (interested potential partners are encouraged to contact the author). 2.4.5 Using WebCite as a citable Web-author (blogger, copyright holder of Webpages, etc), or for self-archiving? Academic bloggers and authors of non-journal scholarly webpages who foresee the possibility to be cited in the scholarly literature (“citable Web-author“ or subsequently called “cited author”), but are concerned about the persistence and citability of their work can add a WebCite link to their work. If their content is dynamically changing, then they are encouraged to publish a button which creates a new archived version whenever somebody cites the work: Cite this page!1
This links to the WebCite archiving request form – the link can contain prepopulated variables containing the Dublin Core metadata of the cited page, which will ensure that the reader (“citing author”) knows exactly how to cite the work, and makes sure that a snapshot of the cited work is preserved in WebCite and its digital preservation partners. If your online content is static, and bloggers (and other web authors) may want to encourage readers to cite a specific version. In this case, they first self-archive their work using the WebCite archiving request form, which simply requests the URL of the page to be archived. After successful archiving (which is done in an instant), web authors may then publish a button and link to the snapshot in the WebCite archive within their work, or when referring to it, for example: Cite as: Giustini D. Open Source, Open Access - And Open Search? URL: http://oalibrarian.blogspot.com/2006/07/open-source-open-access-and-open.html (accessed 26-Nov2007). Self-Archived at WebCite on 26-Nov-2007 [http://www.webcitation.org/5TeS43Ipq]
Thus, WebCite can be used by authors as a one-click self-archiving tool, to ensure that for example preprints, discussion papers, and other formally unpublished material remains citable and available. All they have to do is to publish a preprint online, and then to self-archive it (one-click self-archiving). Note that all these buttons should not be used for journal articles, which are presumably already archived through other mechanisms (LOCKSS etc.) WebCite premium members will also be able to search the archive, assign a DOI to their work archived in WebCite, specify whether ads can be displayed (they receive a proportion of the ads revenues) etc. If a DOI is assigned, citing authors/publishers can use a link to dx.doi.org, which enables doi.org to resolve the link to either WebCite or another archive, e.g. the Internet Archive, if a archived copy with an identical hash is found. 2.4.6 Using WebCite as a publisher Participating publishers include publishers of scholarly journals like BioMed Central who use WebCite to preserve cited webmaterial routinely and largely automatically.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
385
386
Gunther Eysenbach
They do this by encouraging their editors to instruct their authors and copyeditors to cache all cited URLs “prospectively” before submission (level 1) or during the copyediting process (level 2), respectively, and/ or by submitting manuscript XML files to WebCite at the time of publication (level 3), so that WebCite can comb through the manuscript and archive cited webpages automatically. WebCite can also analyze backissues of your journal(s) and archive the cited documents “retrospectively” (level 4). Implementation example at publishers Biomed Central uses a WebCite
logo to link to the archived copy in every webreference.
Another example is the Journal of Medical Internet Research - almost all articles in this Journal cite URLs, and since 2005 all are archived. See http://www.jmir.org/2005/5/e60#ref9 for an example. 3.
Results
Since 2005, WebCite has been used by over 200 scholarly journals, and has archived over 3 Million scholarly important files and webpages. 4.
Discussion
4.1 Copyright Issues Caching and archiving webpages is widely done (e.g. by Google, Internet Archive etc.), and is not considered a copyright infringement, as long as the copyright owner has the ability to remove the archived material and to opt out. WebCite honors robot exclusion standards, as well as no-cache and no-archive tags. WebCite also honors requests from individual copyright owners to have archived material removed from public view. A U.S. court has recently (Jan 19th, 2006) ruled that caching does not constitute a copyright violation, because of fair use and an implied license (Field vs Google, US District Court, District of Nevada, CVS-04-0413-RCJ-LRL). Implied license refers to the industry standards mentioned above: If the copyright holder does not use any no-archive tags and robot exclusion standards to prevent caching, WebCite can (as Google does) assume that a license to archive has been granted. Fair use is even more obvious in the case of WebCite than for Google, as Google uses a “shotgun” approach, whereas WebCite archives selectively only material that is relevant for scholarly work. Fair use is therefore justifiable based on the fair-use principles of purpose (caching constitutes transformative and socially valuable use for the purposes of archiving, in the case of WebCite also specifically for academic research), the nature of the cached material (previously made available for free on the Internet, in the case of WebCite also mainly scholarly material), amount and substantiality (in the case of WebCite only cited webpages, rarely entire websites), and effect of the use on the potential market for or value of the copyrighted work (in the case of Google it was ruled that there is no economic effect, the same is true for WebCite). In the future, WebCite will further reduce its liability by feeding content into third party archives such as National Libraries and Archives, which (often) have a legal deposit mandate. Secondly, WebCite is working on a more sophisticated infrastructure which would allow copyright holders to not only withdraw their content, but to specify a fair royalty fee. 4.2
Business Model
In order to cover the costs for ongoing and sustainable operations, WebCite will have to generate a Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Preserving The Scholarly Record With WebCite (www.webcitation.org)
revenue stream, for example through the following mechanisms:
•
Premium membership accounts for individuals (e.g. “cited authors”) and institutions (publishers, universities) with an annual fee, which enables the assignment of a DOI to unpublished online work, the automatic processing of “citing” XML documents, and offer the opportunity to display advertisements (own ads, or Google Adsense) together with the archived content, if the member owns the copyright of the content. Premium accounts could also offer services such as access to certified citation statistics (for promotion and tenure, individuals and universities may want to have this information), citation recommendations (people who cited this webpage also cited …) etc.
•
Advertising: “Cited authors” may decide to enable Google Adsense ads with their content (with them receiving royalties)
•
Royalty collection: Cited authors/copyright holders will get the option to specify a per-payview royalty, which WebCite could collect for the copyright holder (receiving a commission)
•
Developing tools and consulting for companies seeking to integrate WebCite into their products (e.g. vendors of bibliographic software packages such as RefMan or Endnote).
5. Conclusions The current state of scholarly communication on the web can be characterized by the following paradox:
•
blogs (and other Internet venues such as wikis) are - at least in theory - important venues for scholarship to publish hyptheses, analyses etc. outside of the traditional journal publishing system
•
yet, they are not considered “citable” or “publications” - which in turn affect their use, usefulness, and acceptance among researchers as tools for scholalry communication.
WebCite aims to make Internet material (any sort of digital objects) more “citable”, long-term accessible, and hence more acceptable for scholarly purposes. Without WebCite, Internet citations are deemed ephemeral and therefore are often frowned upon by authors and editors. However, it does not make much sense to ignore opinions, ideas, draft papers, or data published on the Internet (including wikis and blogs), not acknowledging them only because they are not “formally” published, and because they are difficult to cite. The reality is that in the age of the Internet, “publication” is a continuum, and it makes little sense to not cite (therefore acknowledge) for example the idea of a scholarly blogger, the collective wisdom of a wiki, ideas from an online discussion paper, or data from an online accessible dataset only because online material is not deemed “citable”. By making Internet material more “citable” (and also by creating incentives such as mechanisms and metrics for measuring the “impact” of online material by calculating and publishing WebCite impact factor), we hope that this will encourage scholars to publish ideas and data online in a wide range of formats, which in turn should accelerate and facilitate the exchange of scientific ideas. While we do see the value of scholarly peer-reviewed journals for publishing research results, we also acknowledge that much of the scientific discourse takes place before it is “formally” published, and that peer-review can also take on other forms (e.g. post-publication peer-review, which is something WebCite plans to implement). Another broader societal aspect of the WebCite initiative is advocacy and research in the area of copyright. We aim to develop a system which balances the legitimate rights of the copyright-holders (e.g. cited authors and publishers) against the “fair use” rights of society to archive and access important material. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
387
388
Gunther Eysenbach
We also advocate and lobby for a non-restrictive interpretation of copyright which does not impede digital preservation of our cultural heritage, or free and open flow of ideas. This should not be seen as a threat by copyright-holders - we aim to keep material which is currently openly accessible online accessible for future generations without creating economic harm to the copyright holder. This is a challenging, but feasible goal, and future iterations of this service may include some sort of revenue sharing mechanism for copyright holders. Yet another angle is that WebCite enables “one-click self-archiving”, making it very easy for scholarly authors to create a permanent, openly accessible record of their own work and their ideas. While the primary pathway in the WebCite system is third-party initiated archiving (triggered by a citing author), WebCite also provides a very simple mechanism for authors to self-archive their own work. Another perspective is that WebCite is an innovative Internet archiving process that could be referred to as a “reverse archiving” or “Archiving 2.0” approach. Rather than to let librarians or archiving crawlers (such as the Wayback Machine) archive material (and to let curators assign metadata), WebCite puts the initiation of the archiving process into the hands of the scientific community, who – by virtue of citing it – decides what is considered worthy archiving. The assignment of metadata is also a highly decentralized, bottom-up process (which involves the community of “citing” authors, but also the cited author). 5.
Notes
1
A New York Times article, published Jan 29, 2007, “Courts Turn to Wikipedia, but Selectively” by Noam Cohan mentions WebCite: “(…) ‘citation of an inherently unstable source such as Wikipedia can undermine the foundation not only of the judicial opinion in which Wikipedia is cited, but of the future briefs and judicial opinions which in turn use that judicial opinion as authority.’ Recognizing that concern, Lawrence Lessig, a professor at Stanford Law School who frequently writes about technology, said that he favored a system that captures in time online sources like Wikipedia, so that a reader sees the same material that the writer saw. He said he used www.webcitation.org for the online citations in his amicus brief to the Supreme Court in Metro-Goldwyn-Mayer Studios v. Grokster Ltd., which “makes the particular reference a stable reference, and something someone can evaluate. (…)”. It is important to understand that WebCite focuses on documents exclusively available on the web, not documents such as journal articles which can be assumed to be archived in libraries. For a full list of journals using WebCite see http://www.webcitation.org/members. Accessed: 200806-04. (Archived by WebCite® at http://www.webcitation.org/5YJvduH5t) http://web.archive.org/web/19990203173551/webcite.net/home.htm Cockerill M. Webcite links provide access to archived copy of linked web pages. BioMed Central Blog. URL: http://blogs.openaccesscentral.com/blogs/bmcblog/entry/ webcite_links_provide_access_to [Archived in WebCite at http://www.webcitation.org/5Tb2FDt4e on 2007-11-14] WebCite® automatically determines whether there is a need for storing another physical copy, or whether the content has already been archived, in which case a new WebCite ID is generated which points . For an example see: http://www.jmir.org/cms/ viewInstructions_for_Authors:Instructions_for_Authors_of_JMIR#webcite
2
3
4 5
6
7
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Preserving The Scholarly Record With WebCite (www.webcitation.org)
This links to the WebCite archiving request form â&#x20AC;&#x201C; the link can contain prepopulated variables containing the Dublin Core metadata of the cited page, e.g. http://www.webcitation.org/ archive?url=http%3A%2F%2Foalibrarian.blogspot.com%2F2006%2F07%2Fopen-source-open-accessand-open.html&title=Open+Source%2C+Open+Access+-+And+Open+ Search%3F&author=Giustini+Dean&date=2006-07-13&source=OA+Librarian&subject= electronic+publishing%3B+open+access
8
6.
References
[1]
Dellavalle RP, Hester EJ, Heilig LF, Drake AL, Kuntzman JW, Graber M et al. Going, Going, Gone: Lost Internet References. Science 2003;302:787-8. [2] Crichlow R, Davies S, Winbush N. Accessibility and Accuracy of Web Page References in 5 Major Medical Journals. JAMA 2004;292:2723-b. [3] Hester EJ, Heilig LF, Drake AL, Johnson KR, Vu CT, Schilling LM et al. Internet citations in oncology journals: a vanishing resource? J Natl.Cancer Inst. 2004;96:969-71. [4] Johnson KR, Hester EJ, Schilling LM, Dellavalle RP. Addressing internet reference loss. Lancet 2004;363:660-1. [5] Kelly DP, Hester EJ, Johnson KR, Heilig LF, Drake AL, Schilling LM et al. Avoiding URL reference degradation in scientific publications. PLoS.Biol. 2004;2:E99. [6] Schilling LM, Kelly DP, Drake AL, Heilig LF, Hester EJ, Dellavalle RP. Digital Information Archiving Policies in High-Impact Medical and Scientific Periodicals. JAMA 2004;292:2724-6. [7] Schilling LM, Wren JD, Dellavalle RP. Bioinformatics leads charge by publishing more Internet addresses in abstracts than any other journal. Bioinformatics 2004;20:2903. [8] Badgett RG, Berkwits M, Mulrow C. Scholarship Erosion. Ann.Intern.Med. 2006;145:77-a. [9] Wren JD, Johnson KR, Crockett DM, Heilig LF, Schilling LM, Dellavalle RP. Uniform Resource Locator Decay in Dermatology Journals: Author Attitudes and Preservation Practices. Arch Dermatol 2006;142:1147-52. [10] Evangelou E, Trikalinos TA, Ioannidis JPA. Unavailability of online supplementary scientific information from articles published in major journals. FASEB J. 2005;19:1943-4. [11] Aronsky D, Madani S, Carnevale RJ, Duda S, Feyder MT. The Prevalence and Inaccessibility of Internet References in the Biomedical Literature at the Time of Publication. J Am Med Inform Assoc 2007;14:232-4. [12] Eysenbach G,.Diepgen TL. Towards quality management of medical information on the internet: evaluation, labelling, and filtering of information. BMJ 1998;317:1496-500. [13] Eysenbach G,.Trudel M. Going, Going, Still There: Using the WebCite Service to Permanently Archive Cited Web Pages. J Med Internet Res 2005;7:e60.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
389
390
Enhancing the Sustainability of Electronic Access to ELPUB Proceedings: Means for Long-term Dissemination Bob Martens1; Peter Linde2; Robert Klinc3; Per Holmberg4 1
Vienna University of Technology, Karlsplatz 13, A-1040 Vienna, Austria e-Mail: b.martens@tuwien.ac.at 2, 4 Blekinge Institute of Technology, SE-37179 Karlskrona, Sweden e-mail: peter.linde@bth.se, per.holmberg@bth.se 3 University of Ljubljana, FGG KGI, SI-1000 Ljubljana, Slovenia e-mail: rklinc@itc.fgg.uni-lj.si
Abstract ELPUB can look back on a track record of a steadily growing number of conference papers. From a longterm perspective, access to this body of knowledge is of great interest to the community. Beyond this, extended preoccupation with the collected scientific work in the area of digital publishing has to be mentioned. Naturally, the authors are particularly focussed on the individual paper itself and possible connections with related efforts. Typically, conferences amplify and enhance opportunities of “gettingtogether”. A well-stocked repository may, however, serve in this respect as a fruitful complementary addition. In this contribution, the implementation of persistent identifiers on the existing ELPUB.scix.netbase is elaborated in detail. Furthermore, the authors present the result of efforts related to the harvesting of ELPUB-metadata and to the creation of a citation index. The paper concludes with an outlook on future plans. Keywords: Digital preservation; shared information; self-organization; information retrieval; persistent identifiers 1.
Introduction
After more than a decade of conferencing, the “threshold” of 500 published entries will be passed on the occasion of ELPUB 2008. There has been a long and persistent discussion on whether to have paperbased proceedings or not. The limited number of hard-copies of proceedings might cause difficulties, whereas parallel electronic release secures wider (and easier) dissemination. However, it can be regarded as an irony that 2003 was the first year in which all published papers were made electronically available in a repository (ELPUB.scix.net; [1]), which has since been extended consistently. In 2006, the repository was extended, i.e. once the set of proceedings was on its way for printing, it was made available on the web as a pre-publishing alternative in good time before the conference itself, accompanied by direct links in the conference schedule. It has to be noted that the handling of ELPUB.scix.net is intentionally done on a shoestring budget to ensure archiving and collectioning in the long term. Furthermore, members of the ELPUB-community became intrigued by the wealth of stored information and focused on content analysis (see the work from S. Costa et. al. [2,3]) in their work. At the start of ELPUB.scix.net, the policy of “Limited Open Access” was chosen, i.e. users had to register at no cost in order to have access to all stored PDF-papers. Open Access received a growing amount of dedication and it was decided to convert the repository accordingly. Supporting measures caused a minimal amount of programming efforts and led to registrations at www.openarchives.org, www.opendoar.org and roar.eprints.org. The paper will start out by exploring Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Enhancing the Sustainability of Electronic Access to ELPUB Proceedings
opportunities for harvesting metadata. The core part of this contribution focuses on matters of sustainability and investigates/describes the procedure towards the implementation of persistent identifiers for the ELPUB-repository. There are several schemes that apply to the concept of persistent identifiers - DOI, PURLs etc. But which scheme would be appropriate for ELPUB? The authors provide an analysis concerning the practical and technical experience of introducing URN-NBN as a persistent identifier scheme for the ELPUB repository. A citation index has been made available in the ELPUB-repository, and the authors explore the situation at the time of writing. In the future, this option will support researchers with a view to impact analysis. The contribution concludes with an outlook on planned developments..
Figure 1: Screenshot Open Archives – ELPUB record 2.
Harvesting ELPUB.scix.net metadata
On the occasion of ELPUB 2007 the decision was taken to switch to an Open Access mode. Given the theme of the conference (“Openness in Digital Publishing”), this did hardly come as a surprise,. Participants provided the managers of the ELPUB-repository with valuable information and hints. Certain programming efforts had to be made in order to adapt the existing OAI-interface. Shortly after the conference, the authors registered the ELPUB Digital Library at www.openarchives.org, www.opendoar.org and roar.eprints.org was successfully concluded. The Open Archives Initiative - OAI - develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. OAI has its roots in the open-access and institutionalrepository movements. Continued support of this work remains a cornerstone of the Open Archives program. Over time, however, the work of OAI has expanded to promote broad access to digital resources for eScholarship, eLearning, and eScience. OpenDOAR delivers a quality-controlled list of repositories, as submitted applications are checked by the project staff who visit the repository sites. The focus lies on academic repository contents. The listing shows an impressive growth, with currently over 1100 entries, and can be regarded as a key resource for the Open Access community. The services offered - such as text-mining - are continuously developed. ROAR, an acronym for ”Registry of Open Access Repositories” promotes open access to the pre- and post-peer-review of research literature through author self-archiving. The registry monitors the overall growth in the number of eprint archives and also maintains a list of GNU EPrints sites. It has to be noted that the entire repository (482 recorded papers – without ELPUB 2008) was made available to the general audience, which certainly boosted dissemination. By creating a personal login within ELPUB.scix.net, users can access the advanced search features and also store personal favourites Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
391
392
Bob Martens; Peter Linde; Robert Klinc; Per Holmberg
etc. There are currently more than 500 registered-user accounts. Building on the developments of the previous year, the 2008 conference papers will be put online prior to the conference as soon as they become available and receive a direct link in the conference schedule. The work effort is manageable for the programme chair, who will derive metadata from the proceedings along with the full texts. If the idea of digital publishing is to be taken seriously, there should be hardly any delay in dissemination and, ideally, conferences attendees can seek timely advance information. In the end it is just a matter of handling and processing already created digital materials. Further options for harvesting may be explored as well and any proposal is welcome.
Figure 2: Screenshot OpenDOAR â&#x20AC;&#x201C; ELPUB listing
Figure 3: Screenshot ROAR.eprints â&#x20AC;&#x201C; Overview of the harvested ELPUB Digital Library Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Enhancing the Sustainability of Electronic Access to ELPUB Proceedings
Figure 4: Screenshot ROAR.eprints – Record creation statistics While the DBLP server initially focused on DataBase systems and Logic Programming, the scope of interest was gradually widened and could be interpreted as Digital Bibliography & Library Project. The system provides bibliographic information on major computer science journals and proceedings. As of April 2008, more than a million articles and several thousand links to home pages of computer scientists were indexed. The co-author index (see insert in figure 4) is an interesting feature. In the meantime ELPUB-paper have also been recorded in the DBLP-database. 3.
Persistent Identifiers
The tedious “404 not found” message is a common experience for Web users. The problem with Uniform Resource Locators (URLs) is that they contain no information about their lasting reliability. Everything depends on the intentions of URL creators, and their level of commitment is hardly ever clear to the user. Hence, users cannot trust URLs. The concept of URLs was developed for the first Web software in 1991. URLs are one kind of Uniform Resource Identifiers (URIs) and only serve for locating resources. From a preservation viewpoint, the weakness of URLs is the fact that they depend upon Domain Name System (DNS) information. This information is obtained locally, and passes through a national DNS registrar up to DNS root servers. It has to be noted that domain names are not stable and may change at any time. Furthermore, URLs depend on the fact that the local internet server, which is identified through the DNS, actually stores the object according to the information in the URL string. In 1994, the idea was born to create a system for naming resources instead of addressing them, and the syntax of Uniform Resource Names (URNs) was formally described in 1997. The purpose was to provide a unique, Persistent Identifier in order to access the resource itself. The URN syntax uses a Namespace Identifier (NID) from a defined list maintained by the Internet Assigned Numbers Authority (IANA). URNs are designed to fit the definition of URIs. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
393
394
Bob Martens; Peter Linde; Robert Klinc; Per Holmberg
Figure 5: Screenshot DBLP – Recorded ELPUB conferences and author’s details (insert) A URN starts with a “urn” character string followed by the NID and, separated by a single colon, the last part of the URN which consists of a Namespace Specific String (NSS): urn:nid:nss Well-known standard concepts like ISBN, ISSN and MPEG can be found among namespaces on the IANA list. NBN, which stands for National Bibliography Numbers, is another namespace which is exclusively assigned to National Libraries. The Library of Congress acts as global registry for URN:NBN namespaces. The general syntax of the NBN-URN is as follows [5]: urn:nbn:<ISO 3166 country code>-<assigned NBN string> urn:nbn:<ISO 3166 country code>:<sub-namespace code>-<assigned NBN string> NBN is a generic name referring to a group of identifier systems utilised by the national libraries, and only byby them, for identifying deposited publications which lack an identifier. Each National Library decides independently to whom they will issue sub-namespaces [4]. It is quite common that Swedish university Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Enhancing the Sustainability of Electronic Access to ELPUB Proceedings
libraries manage institutional repositories that include digital versions of doctoral and licentiate theses. In some cases these theses do not have any paper equivalent at all. It is reasonable to assume that in the future a growing number of theses will solely be made available in a digital format. Hence the wish – both from the Swedish National library and from the data providers at the universities – that links to object sources remain persistent. The Swedish National Library (e.g. The Royal Library in Stockholm) is responsible for preserving all material printed in Sweden. However, collecting digital resources (e-books etc) requires an infrastructure different from the one used for traditional print materials. Adequate technical solutions have been developed at the National Library to meet these challenges and needs. The aim is to further develop technical solutions and to find practical solutions to the problems involved in administrating legal deposits for digital material. One important task is to find out how to acquire metadata together with digital deposits. Swedish legal deposit legislation does not yet cover online publications, although the government has announced that a new Act which includes them will come in 2010. One step in the preparation for legal deposits of digital publications is the implementation of URN-NBN. Since 2004, Swedish publishers, libraries etc. have been able to use the urn-nbn scheme and the national library resolver service in order to secure persistent links to digital resources. For some time now there have been considerations about implementing some sort of persistent identifiers, both for the ELPUB repository and for the institutional repository of Blekinge Institute of Technology (BTH). The main reason for finding a preservation scheme was the ambition to keep the unique repository of conference papers, in ELPUB’s case, and theses, in BTH’s case, available over a long period of time. The two repositories also had another feature in common: a shoestring budget and a lack of administration resources. Hence, the major prerequisites for a persistent identifier scheme were the following: Persistence at no or at least low cost, involving as little technical and administrative effort as possible. With these requirements in mind, the horizon was scanned for plausible persistent identification schemas. The Handle System was developed by the Corporation for National Research Initiatives (CNRI) within the framework of the Computer Science Technical Reports (CSTR) project. A first implementation of the system was made available in 1994. Participation in the Handle System requires registration along with the establishment of a Naming Authority. CNRI makes free software available, but requires the signing of a licence when registering a new Naming Authority. A small registration fee is charged. In 1997, the Digital Object Identifier (DOI) initiative, based on the Handle concept, was launched at the Frankfurt Book Fair. Today, many scientific journal publishers are using the DOI. The International DOI Foundation (IDF) was founded in 1998 [11]. Although DOI was created for the publishing industry, it is used just as much outside that sector, e.g. by electronic commerce applications. The usage of DOI is not restricted to digital objects. One of the aims is to provide identification mechanisms for commercial transactions concerning rights management [6]. At the early stages of the DOI system, a central DOI depository directory was used with IDF as the only administrator or registration agency (RA). Today there are several RAs that will assign prefixes to new registrants in accordance with IDF standards. The RAs are free to set their own rates for registration costs. The IDF has fixed rates for different levels of participation [7]. One of these RAs is CrossRef, the reference-linking network of the Publishers International Linking Association (PILA). They have a preference for what is called “original work” and are rather cautious about how to define that. They often consider Open Access materials as insufficiently defined, and CrossRef only accepts material which is not published from institutional repositories. Furthermore, DOIs are not granted the right to pre-print or post-print materials [8]. Another persistent identifier schema based on the Handle system is incorporated in the DSpace software package. For persistent identification and linking, DSpace uses HDL Identifier and Resolution Services Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
395
396
Bob Martens; Peter Linde; Robert Klinc; Per Holmberg
[12]. The source code for DSpace is distributed with an API for CNRI’s Handle Server. Users of DSpace intending to use the handle-server announce this to CNRI and are then supplied with a handle-prefix against a yearly cost of approx. 50 USD. The Persistent URL (PURL) is based on the URN specification and is developed by the Online Computer Library Center (OCLC) [9]. As one would assume, a PURL is a URL with some degree of modification. A PURL turns into a URL for the user after being associated with the correct URL at an intermediate resolution service using HTTP redirect. The PURL server software is developed by OCLC and can be freely downloaded. It contains group management for maintainers, tools to administrate PURLs, etc. After a brief period of reviewing the system solutions described above, the authors decided to proceed with URN-NBN. The decision was mainly based on the lack of funding and an aversity to running administrative software. DOI involves negotiations with commercial entrepreneurs suspicious of institutional repositories and costs that are either to high initially or to unpredictable for the future for small-scale services. While alternatives such as PURLs are more or less free of cost, they demand installing new software and administrating the whole process of assigning identifiers to objects, which is not desirable in an organisation slimmed to the bone or relying on non-paid work. The URN-NBN scheme could be used by ELPUB.scix.net and the BTH repository at no cost and, on top of this, with only a minimal amount of administration. In order to implement URN-NBN a contact with the National Library of Sweden was settled, which delivered instructions on how to proceed. This is work in progress and not all printed material and instructions for how to implement URN-NBN for customers are available at the library website yet. However, the staff of the Digital Library Department was very supportive whenever questions concerning implementation arose. As for DOI, PURL etc., the general concept of URN-NBN is to assign a permanent identity to a digital resource (an article or thesis for example). A “resolver” translates the permanent identifier to the actual URL where the resource can be found at the moment. In order for the “resolver” to function it needs to be fed with the proper mapping instructions; in other words “Which information/data matches which URL?” This was arranged by creating a mapping file content of which is harvested on a regular basis and fed to the “resolver”. Since the institutional repository at BTH [10] is based on the Lotus Domino platform, a Lotus application was constructed that generates an XML-file which maps the URN-NBN identifiers to the URLs. When generating the XML-file, the interface looks up all published theses and journal articles in the BTHrepository. There are two addresses where the BTH URN-NBN mapping can be harvested. One “cached” version exists, which is updated once a week: http://www.bth.se/fou/forskinfo.nsf/urn-nbn-mapping/records/$file/records.xml There also exists an option to generate the XML-file on the fly. The advantage of this option is, of course, that up to date content is received without delay. The downside is that the process is time-consuming and constitutes a burden on the server capacity: http://saxofon.bth.se/fou/urn-nbn.nsf/on-the-fly The BTH repository application was adapted in order to show URN-NBN identifiers in clear on the web for future reference purposes. Similarly identifiers can be retrieved at ELPUB.scix.net in the field <urn:nbn>. As indicated previously, every record in an URN-NBN scheme must have a unique identifier. In the BTH repository every record is assigned a unique serial number that can never be changed. These serial numbers, automatically generated by Lotus, were used for the NBN string after the sub-namespace. An example of a BTH URN-NBN using the “Urn:nbn:ISO country code:sub-namespace- nbn string” version Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Enhancing the Sustainability of Electronic Access to ELPUB Proceedings
of the syntax is given below: urn:nbn:se:bth-da5e6bd8f0ee409ac1257424004d33c4 It goes without saying that if, for whatever reason, the repository records were to change database to a new platform it would be necessary to migrate the Lotus serial numbers to the proper records in the new database. It took roughly one day’s work implementing the above schema. So far the implementation is running automatically without causing any noteworthy problems. For resolving BTH or ELPUB URN-NBNs it is possible to use the Swedish Royal Library resolution service. Upon entering the persistent identifier generated at the repository level, one will receive the correct URL for the object. The service is available at: http://urn.kb.se/start Example: urn:nbn:se:bth-e2b0aad6ae4c30f8c12573d8003b5b62 returns: http://www.bth.se/fou/forskinfo.nsf/0/e2b0aad6ae4c30f8c12573d8003b5b62 It is also possible to use the demonstration resolver of the Deutsche Nationalbibliothek. This resolver is configured to redirect to external resolution services by analyzing the hierarchical parts of a URN (e.g. “urn:nbn:de” or “urn:nbn:se”): http://nbn-resolving.de/ResolverDemo-eng.php To give an example: An entry with the ID <urn:nbn:se:elpub-136_elpub2007> will be translated to http://elpub.scix.net/cgi-bin/works/Show?_id=136_elpub2007 6.
Citation Index
An index with references stemming from the conference papers was recently added to the ELPUB Digital library. While the idea as such is not new, the addition can be regarded as a way of making the context of digital publishing more tangible. The procedure itself is fairly straightforward. First, the metadata has to be extracted from the PDFdocuments and subsequently it is split into four parts (authors, title, source, year). This still requires manual work, and a higher degree of automated extraction would be desirable. Finally, the metadata is added to the existing records and may be viewed in both directions. The reference at the end of the record are leading to the citation index; From the citation database a direct link is offered to the recorded paper, where the reference is in place. Over 1.600 citations stemming from nearly 300 recorded entries (period: 2003-2007 conference papers) are available at the time of writing. Entries before 2003 are also likely to be processed in the near future. Not all the proceedings are, however, available in a machine readable format and their handling will depend on the disposable amount of volunteering resources. The citation index can be found at http://ELPUB.scix.net/cgi-bin/refs This sample would allow for an analysis of the networks between the references, including the highly interesting phenomenon of self-citation. 7.
Future Plans and Outlook
With approx. 500 papers a critical mass is available. Collecting and archiving is to be regarded as a pragmatical issue and would, for example, serve as a starting point for ontology-based research. The efforts required are not limited to a certain group of “core activists”. Any member of the ELPUB-community is invited to use the metadata for harvesting issues. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
397
398
Bob Martens; Peter Linde; Robert Klinc; Per Holmberg
Figure 6: Recording of references in an entry (elpub.scix.net) 560
540
520
number of users
500
480
460
440
420
400 31.03.2007 31.05.2007 31.07.2007 30.09.2007 30.11.2007 31.01.2008 31.03.2008
Figure 7: Increase of the number of registered users with a trend line â&#x20AC;&#x201C; clear decline after the shift towards open access The shoestring-budget-idea as such has been mentioned already and constitutes a core principle. It aims at securing the ELPUB digital library in the longer term without making it dependent on a heavy input of human resources. The principle of self-organization should also be highlighted. In this respect, the procedure Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Enhancing the Sustainability of Electronic Access to ELPUB Proceedings
related to the conference organization itself and the overlap of programme and general chair (year n=programme chair; year n+1=general chair) has been worked out well. Rank
Title of the paper
1
Openness in higher education: Open source, open standards, open access
2
DCMI-Tools: Ontologies for Digital Application Description
3
The Deconstructed Journal revisited – a review of development
4
Music Publishing
5
The Newer, the Worse: the Status of Farsi Word Processing Softwares in Iran
6
Scientific publication life-cycle model (SPLC)
7
Scholarly Communication in Transition: Evidence for the Rise of a Two-Tier System
8
Convergence and divergence in media: different perspectives
9
Peer-to-Peer Networks as a Distribution and Publishing Model
10
From print to web to e-paper: the challenge of designing the e-newspaper
Table 1: Top 10 most downloaded full text papers from the ELPUB digital library
Figure 8: Constant increase of daily visitors of ELPUB.scix.net The decision to support Open Access had an impact on the number of users registering to ELPUB.scix.net. Figure 7 clearly shows, that since the availability of this type of access which requires no user account to access full papers, the number of registered users remained more or less constant. Even more, since this change, users no longer log into the repository and it is almost impossible to identify groups of core users. It is rather difficult to make a clear distinction between active and inactive users. On the other hand, the number of visitors coming from the different search engines is increasing and the number of visits to the collection of full text files is increasing. Table 1 shows the “Top Ten” papers that were accessed or downloaded from the ELPUB Digital Library. The paper in first place was downloaded more than 3000 times. There is no doubt that the step towards “Open Access”, which was fully realised after the ELPUB 2007 Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
399
400
Bob Martens; Peter Linde; Robert Klinc; Per Holmberg
con–fe–rence, leads to enhanced user access. It is interesting to see is that the number of visitors coming from general (such as Google, AOL, Yahoo, Windows Live etc.) and specific purpose search engines (such as Google Scho–lar, Wikipedia, Open Archives, etc.) is growing and proves that the shift toward open access was worth the effort. Finally, it has to be mentioned that some of the previous ELPUB-proceedings are already indexed in the ISI Web of Knowledge and those in charge will consider further steps towards having all the conference articles indexed there and in other important indexes such as Scopus and Google Scholar (making records visible in, for example, Herzing´s Publish or Perish citation analyis software program). 8.
Notes and References
[1]
MARTENS, B.; Linde, P. and Turk, Z. A Digital Library for ELPUB Proceedings: The Use of a Web-Based Prototype. In ELPUB2003. From information to knowledge: Proceedings of the 7th ICCC/IFIP International Conference on Electronic Publishing held at the Universidade do Minho, Portugal 25-28 June 2003/Edited by: Sely Maria de Souza Costa, João Álovaro Carvalho, Ana Alice Baptista, Ana Cristina Santos Moreira, pp. 363-371. Available from Internet: urn:nbn:se:elpub-0344. [2] COSTA, Sely M.S. and Gottschalg-Duque, Claudio Towards an Ontology of ELPUB/SciX: A Proposal . In ELPUB2007. Openness in Digital Publishing: Awareness, Discovery and Access Proceedings of the 11th International Conference on Electronic Publishing held in Vienna, Austria 13-15 June 2007 / Edited by: Leslie Chan and Bob Martens, pp. 249-256. Available from Internet: urn:nbn:se:elpub-153_elpub2007. [3] COSTA, Sely M.S.; Bräscher, Marisa; Madeira, Fabyola; Schiessl, Marcelo. Ten Years of ELPUB: An Analysis of its Major Trends. In ELPUB2006. Digital Spectrum: Integrating Technology and Culture - Proceedings of the 10th International Conference on Electronic Publishing held in Bansko, Bulgaria 14-16 June 2006 / Edited by: Bob Martens, Milena Dobreva, pp. 395-399. Available from Internet: urn:nbn:se:elpub-1706_elpub2006. [4] HAKALA, J. Using National Bibliography Numbers as Uniform Resource Names. Memo 2001. Available from the Internet: http://www.ietf.org/rfc/rfc3188.txt. [5] HILSE H.-W. and Kothe J. Implementing Persistent Identifiers. Overview of concepts, guidelines and recommendations (London/Amsterdam 2006, CERL/ECPA). Available from the Internet: www.knaw.nl/ecpa. urn:nbn:de:gbv:7-isbn-90-6984-508-3-8. [6] WANG, Jue. Digital Object Identifiers and Their Use in Libraries. Serials Review 2007, vol. 33:3 pp. 161-164. Available from the Internet: doi:10.1016/j.serrev.2007.05.006. [7] The DOI® Handbook, Version 4.4.1, Chapter 7, Section 7.13 The International DOI Foundation 2007. Available from the Internet: http://www.doi.org/membership/brochure.html. [8] QUINT, Barbara. Linking up bibliographies: DOI Harvesting Tool Launched by CrossRef. (Electronic) newsbreaks.infotoday.com 11-06-2006. Available at the Internet: http:// newsbreaks.infotoday.com/nbReader.asp?ArticleId=18581# [9] http://www.purl.org [10] http://www.bth.se/fou/ [11] http://www.doi.org [12] http://www.handle.net/
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
401
A Semantic Linking Framework to Provide Critical Value-added Services for E-journals on Classics Matteo Romanello1 Ca’ Foscari University of Venice V.lo Limbraga 16, 31100 Treviso (TV) ITALY e-mail: mailto:mromanel@dsi.unive.it 1
Abstract In the field of Classical Studies the use of e-journals as effective means for scholarly research still needs to bootstrapped. This paper proposes a possible implementation of two value-added services to be provided by e-journals that is to reach this goal: reference linking and reference indexing. Both these services would contribute to make more machine-evident the hidden bond but of utmost importance which link together primary and secondary sources in this field of studies. On a technical level, this paper proposes the use of Microformats and of the Canonical Texts Service (CTS) protocol to build the semantic and nonproprietary linking framework necessary to provide scholars reading e-journals with some advanced and critical features. Keywords: E-journals; Classics; Reference linking; Microformats ; Value-added services; Semantic Web, CTS URNs, Canonical Texts Services protocol. 1.
Introduction
Without a doubt, e-journals in the field of Classical Studies are not yet so popular in this field in comparison to scholars in other scientific disciplines. It results that the use of e-journals as effective means for research purposes needs to be bootstrapped. The greater use of these electronic resources would justify greater investments by publishers in terms of both money and human resources in the production of electronic versions of printed journals or even of digital- born journals. In order to reach this aim there is need to identify what services and features really need to be provided as value-added services to scholars reading on-line publications. At the moment, in particular with respect to the Italian situation, journal publishers do not yet provide a sufficient amount of value-added services along with the electronic journals. The shifting from content holding to service providing was suggested by C. Armbruster [17]‘†as a profitable model to reach an effective economical sustainability of the open access to research contents. Therefore, it is hoped that a publishers’ strategy would focus on providing new and useful services for readers allowing both an increas in the use of e-journals among scholars and offering open access to a greater amount of contents. In the field of ancient languages studies, and in particular Classical Literature and Philology, the distinction between ancient literary texts (primary sources) and modern monographies or journal articles written about them is of primary importance. Scholars’ research work can be considered as lying for the most part in creating, retrieving or modifying links between these two different kinds of resources. Indeed, when they compare texts to produce new interpretations and they provide secondary sources in support of their thesis, they constantly draw new ties between primary sources, as well as between primary and secondary sources. The first time this complex net of citations and references between primary and secondary sources tried
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
402
Matteo Romanello
to emerge in a single medium was with the Byzantine practice of writing scholia. The scholium then made its first appearance as a new practice as a result of an emergent writing support: the codex. The codex left ancient erudite scholars more blank space all around the main writing space than the papyrus scroll in order to write comments and remarks a latere of the text itself (all around the text itself). This new practice led to the need for a graphical signs system to link together a given text passage and the erudite comments referred to [24]‘. Thus the scholia may probably also be considered an embrional kind of hypertext. At the moment, the Web gives us the technical means to get closer again, in a truly digital context, to primary and secondary sources by creating a sort of big scholium on a Web scale. And this will be one of the most interesting challenges for cyberscholarhip-related researches in the near future. Therefore, the aim of this paper is to discuss both the characteristic features and the actual design of a Microformat vocabulary set to encode references to ancient literary texts within web documents in order to achieve a semantic linking system between primary and secondary sources. From now on such references will be referred to as “canonical references” to distinguish them from the citations of modern texts: they are references to discrete corpora of ancient texts that are written by scholars in a canonical citation format which often implies abridgements. The paper firstly describes the technical requirements of such a semantic linking system identified by examining the characteristics of canonical references. Then the paper shows how Microformats can be suitably imployed to encode such canonical text references thus realizing a linking system for primary and secondary sources which has to be semantic, open-ended and cross-language. The implementative details of the proposed Microformat will be presented and discussed, along with some examples of its use. Some examples of value-added services that could be provided by journal publishers will be built upon the outlined linking system. In particular the paper focuses on the navigation services according to the distinction made by Armbruster between certification and navigation services, since services of this kind seem to be the most desirable for scholars of classics. Specifically these are the reference linking (where by “reference linking” I mean the capability for a user to move directly from a text reference to its source) and the reference indexing of articles published by scholarly journals. 2.
Methodology: loosely vs tigthly coupled approach to realize a linking system between primary and secondary sources
Before presenting the methodological approach adopted in this paper to solve the big issue of bringing primary sources close again to secondary ones by using a semantic linking system, it is necessary to describe the actual scenario and then how the desidered scenario would look. 2.1
Scenarios
a) Jane is a scholar of classics and she is doing some reasearches on Homer’s Iliad, and in particular on the fourth book. During some preparatory bibliographic researches, she experiences difficulties related to information retrieval in the field of classical studies. She would like to be able to find the journal articles and the monographies written in any language where the work of her interest is dicussed. Instead, the common search engines allow Jane only to submit language-specific queries. Thus in order to retrieve all the publications related to the fourth book of the Iliad she would have to submit one query for each of the main world languages (at least German, English, French, Italian and Spanish)Jane is disappointed about the inefficiency of searching the web through the normal search engines and gives up her on-line search. b) John is a Jane’s colleague and has just subscribed to an e-journal on Classical Studies which in addition to open access to articles also offers some experimental navigation features such as reference linking for ancient literary texts. When reading on-line articles John can read directly on the screen the original text Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Linking Framework to Provide Critical Value-added Services for E-journals on Classics
of each text passage of an ancient work cited in the body of the article. Morever he is able to retrieve all the available editions or translations for a given text passage, comparing critical editions or finding related resources (illustration from ancient pottery, reviews of articles or books concerned with that specific text passage etc.). Furthermore readers can browse all the journal issues using a reference index (namely an index locorum) which covers all the articles published by the journal itself . By using this tool John is able to find all the articles related to for example Aeschylus, the author he is focusing on, and in particular the exact lines of Agamemnon he is specifically investigating. 2.2
State of the art and proposal of an implementative solution
Currently, in the field of Classical Studies the secondary sources published electronically do not go beyond an incunabular stage, according to the defition of digital incabula given by N. Smith [27]‘. A few on-line publications are provided as HTML, whereas most of the journals just keep publishing a binary file reproducing exactly the printed issue (generally in PDF format). Nonotheless existing projects aimed at building electronic corpora of ancient languages most of the time harness an internal linking system that, on one hand permits reaching a worthy degree of hypertextuality but on the other hand, however, just produces a fairly closed system of hard-linked resources. Within electronic secondary sources the references to canonical authors and texts (that we referred to above as “primary sources”) are mostly not encoded, but in a few cases are marked up as simple links to stable on-line resources such as digital libraries and electronic corpora. The issues resulting from such a situation have been briefly summarized in the first scenario described. Abbreviations and implicit statements are not expressed in a machine-readable format and thus constantly require human disambiguation capabilities. Such references are treated by robots and information retrieval tools merely as strings without the slightest semantic markup that if present would permit a metadataaware information processing. Indeed within secondary sources the references to ancient authors’ texts are expressed most of the time by abridgements, such as Hom. Od. IX 1 or 145. But these abridged notations are unmanageable when using current text retrieval systems, because, for instance, a query with a single Greek letter, that in the context of a Homeric citation is the book reference, leads to poor precision (it can mean something else, e.g. divisions in paragraphs) and recall (there are other ways to indicate the same book, e.g. by roman numbers). <!-- Plut. Sol. 19.1 Canonical Text Reference from
http://www.stoa.org/projects/demos/article_democracy_development
-->
<a class="citation" target="_blank" href="http://www.perseus.tufts.edu/cgibin/ptext?lookup=Plut.+Sol.+19.1">Plut. <em>Sol.</em> 19.1</a>
Figure 1: Example of hard-linked canonical reference to a passage of Plutarch’sSolon encoded following a tightly coupled approach. To sum up, at the moment the references to primary sources contained inside electronic secondary sources are rendered as hard-links through a tightly coupled linking system or even are not encoded in a machine readable format and thus still just replicate the references contained in printed texts. Notwithstanding, there is not yet a shared standard or a well established best practice to determine how to encode references to texts stored in any given corpus, within an (X)HTML document. On top of that, digital corpora of primary sources do not use a common protocol which would guarantee interoperability among different collections of texts and would allow the creation of links between primay and secondary sources. Indeed, what has to be avoided as much as possible is the use of several solutions peculiar to given projects and un-interoperable with each other. For that reason it is necessary to find both a common protocol to access collections of texts and a shared format to encode canonical references within web online resources. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
403
404
Matteo Romanello
Therefore, the desired linking system for primary and secondary sources is required to be:
•
open-ended: it should be possible to link and retrieve other resources related to a given author or work as soon as they appear on the Web. Each link would be resolved into an open-ended, and therefore potentially infinite, number of on-line resources;
•
interoperable: it should guarantee the reuse of data and the interoperability among web applications that use different communication protocols and interfaces;
•
semantic and language-neutral: such a linking system should allow to identify each author, work and edition of a work with a unique identifier rather than with a language-dependent name. If an author is univocally identified it is possible to map the name of the same author written in different languages to that unique identifier. But it is impossible to do the reverse.
In order to implement such a linking system I adopted here a loosely coupling approach. The canonical references and the primary sources referred to have been assumed to be two separate sets that should be linked together using a loosely rather than tightly coupling approach. This kind of coupling is realized by distinguishing three different layers from each other: a) the layer of metadata references contained in the web pages; b) the layer of applications or agents capable of understanding such references in order to provide new navigation and information retrieval services; c) the layer of service providers, such as collections and digital libraries containig the texts referred to by canonical texts references. Furthermore, the implementation of loosely coupled linking system such as this is made of two main components: a metadata embedding technique to mark up directly on the page the canonical references in a machine readable format, and a web protocol in order to create from the metadata embedded in the web page some dynamic links to other resources made available by different service providers. The adopted approach was inspired by two interesting experiences that achieved a high adoption rate on the web. The first one is the OpenURL ContextObject in SPAN (CoinS) [12]‘ which allows embedding in HTML the metadata necessary to construct a link compliant with the OpenURL protocol. This solution of marking up citations and references to any publication allows some context-sensitive services of reference linking to be provided. Indeed, a user browsing a web page with a metadata-aware agent is able to switch directly from a bibliographic reference, for instance, to the corresponding record in a library OPAC or to an electronic version if available. Secondly, in the area of chemistry I found an interesting example of a loosely coupled linking system similar to that which is being proposed in this paper [29,30]‘. In electronic publications in this field a shared and non-proprietary set of identifiers to refer to chemical substances is currently being used: the IUPAC International Chemical Identifier (InChI) [20]‘. The availability of this set of identifiers permits one to build a system which links together pieces of information related to the same chemical element or substance simply by embedding in HTML the correspondent unique identifier. The solution for metadata-embedding proposed by Egon Willighagen is to use RDFa and a client-side script in order to link dynamically different on-line resources concerned with the same chemical stucture, such as blog posts or a PubMed journal article. Now, in order to implement a similar solution in the field of Classical Studies I suggest the use of a) the Catalog of Greek Literary Works (and other similar catalogs), a CTS URN scheme that leverages the existing Thesaurus Linguae Graecae Canon of Greek authors [15]‘ aimed at univocally identify canonical authors, works and also different exemplars of the same work (e.g. critical editions and translations); b) an ad hoc Microformat to embed canonical metadata references in HTML elements; c) open protocols to provide some value-added services that use the semantic information embedded in microformatted references, such as the CTS protocol to retrieve texts or part of them from a distributed digital library of classical texts. This solution seems to satisfy the above identified requirements of the desired linking Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Linking Framework to Provide Critical Value-added Services for E-journals on Classics
system. Indeed the open-endedness is guaranteed because the links between resources are created dynamically by a metadata-aware agent on the client-side, rather than on the server-side as currently happens most of the time. Furthermore the use of a unique scheme of identifiers for authors and their works instead of language-dependent named entities will guarantee the semanticness of the suggested linking system. In the next sections both the implied technologies, Microformats and CTS protocol, will be briefly presented. 2.3
Embedded metadata layer: Microformats
Microformats are one of the most cutting-edge technologies from Web 2.0 aimed at the markup of structured data using HTML tags. They are currently developed at the developers community through a specific wiki, where it is possible to find principles and guidelines to be followed during the design process of new Microformats. Nevertheless, the development of new Microformats seems to be quite discouraged by the community itself. Types of data that are currently being encoded using Microformats include geographical data (geo), web feeds (hFeed), personal profiles (hCard), relationships (XFN), reviews (hReview), events (hCalendar), curricula vitae (hResume). Microformats’ success is due particularly to their use in blogs and social networks and demonstrated how semantic data could be easily embedded inside compounds of Plain Old Semantic Html (POSH) without mixing content with presentational features, which are instead managed through Cascading Style Sheets (CSS). 2.4
Unique identification and communication layer: the CTS Protocol
The main features of projects that started in the past by building electronic corpora are so strictly dependent on the available technologies, that they can be thought in terms of diachronic evolution. Indeed, as pointed out by Neel Smith, “the TLG founded in 1970s is a response to the potential of mass storage and rapid retrieval in digital media” and “the Perseus project founded in 1980s is a response to richer media and higher-level data structures”. Now, a distributed digital library built upon various web-oriented protocols and standards would probably be the response to the technological convergence of XML-related technologies, an easier web service communication over interfaces designed in conformity with a Representational State Transfer (REST) architectural style [18]`†and an infrastructure taking some advantages from the distributed architecture of the web itself. The Canonical Texts Service protocol, developed among by others by Neel Smith and Chris Blackwell at Harvard’s Centre for Hellenic Studies, allows the joining together different digital repositories of TEIencoded texts, providing a common protocol to make these collections interact with each other as a single distributed library. The protocol lies in the conceptual model described by the Functional Requirementes for Bibliographic Records (FRBR) distinguishing between a work and its different exemplars, while introducing some slight differences, e.g. defining the term “workgroup” instead of “author”. One of the important features of the protocol is that it allows one to reach a higher granularity when accessing documents hierarchically and supports the use of a citation scheme referring to each level of the entire document hierarchical structure. In addition, it makes it possible to distinguish different exemplars (namely instances) of the same text. Digital libraries as well as libraries of printed texts need catalogs and authority lists to make the information they contain easily retrievable. To this end the CTS protocol defines a scheme of Uniform Resource Names (URN), called CTS-URNs, to identify univocally authors and works contained in corpora of canonical texts that could be accessed as CTS-repositories. Such a semantic authority list could be defined for each corpus of XML texts. Therefore the protocol is in turn built upon another ancillary protocol, the Registry Services Protocol which provides “an automated interface to authority lists”. The distributed and wide scope nature of Registry Service protocol allows the creation of several lists of identifiers that are organized in registries. These unique identifiers for authors, works Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
405
406
Matteo Romanello
and other interesting facts (for instance geographical or mythological names) can be used as additional and significant entry points to texts stored in CTS-compliant repositories. 3.
Results: a Microformat vocabulary set for canonical texts references
This section presents one of the main findings of this paper namely the definition of a Microformat vocabulary set to encode semantically canonical text references in HTML elements. Firstly, some usecases references are examined in order to identify the requirements of a format in order to encode them. Secondly, the actual design of such a Microformat is discussed. Finally some examples of value-added services that that could be built in the near future to take some benefits of such microformatted contents are supplied . 3.1
Properties of Canonical Text References
First of all it is necessary to define the properties of references to canonical texts in order to think of a technical solution that really fits scholars’ needs and usual practices. Since any reference may be written following, or not, a given citation scheme and implying some implicit information, it becomes difficult to operate a fully automated text analysis in order to identify references to corpora texts and extract semantic information from them. Examining a case history it is possible to distinguish at least three types of references: abridged (3,5-9), unabridged (1-2, 4, 10-11), implicit (4) on the basis of their appearance. Three genres could be distinguished on the basis of its meaning: references to a canonical author, to a canonical work and to a precise text passage. In addition, since references are language-dependent this fact exponentially increases the number of possible equivalent references to authors and works (Homer, Omero, Homero, Homerus and Homère are all valid names the same author might be referred to by). Furthermore, a common practice of scholars in classics is to refer to different edition shortening the editor’s name (7,8) or referring to the current edition of a text just by omitting the editor’s name indication. Indeed, a common practice is to leave out the editor’s name indication when a work is known entirely and to indicate it instead when it is known by fragments. In particular this case makes references even more contextdependent since,when another edition becomes the most reference edition for scholars, the meaning of the reference itself changes. Otherwise, subsequent anaphoric references to different loci of the same texts are usually written specifying for the first time the text referred to and then indicating just the line or the number of the chapter pointed to by the author (5,6).
(1) [...] the Politics of Aristotle (2) Homer and the Papyri (3) Ar. 'Arch.' 803 (4) Protean Forms and Disguise in 'Odyssey' 4 (5) Aesch. 'Sept.' 565-67, 628-30 (6) Aesch. 'Pers.' 265, 711, 1008 (7) Hes. fr. 321 M.-W. (8) Callimaco, 'ep.' 28 Pf., 5-6 (9) Paus. 7.23.1-3 (10) [ ] sulla parodo dell'"Ifigenia in Aulide" di Euripide (vv Figure 2: Case history of canonical references written in different languages and following several citation styles. In conclusion, from the analysed case history results that the indication respectively, of the author and of the work referred to, should be considered at the same time as the zero degree and the minimum properties Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Linking Framework to Provide Critical Value-added Services for E-journals on Classics
of a canonical text reference. On the contrary, the indication of text range and editor’s name appear to be additional information that contributes to setting up a precise text reference, concerned with a specific exemplar (both digital or printed) of a canonical work, such as a translation or a critical edition. Futhermore, since even the title of a work or the name of an ancient author can be considered with good reason as a reference, the Microformat to encode such references is should be able to cover as much as possible the entire case history. 3.2
Design of a Microformat vocabulary set to encode canonical text references
After the properties of canonical text references have been singled out, it is necessary to design a Microformat vocabulary set in accordance with the above specified requirements choosing the most fitting among POSH tags. The design process of new encoding formats should conform to Microformats design principles and patterns, such as embeddability and modularity, in the way desired by the developers community [16]‘. The first Microformats design principle states that a “microformat must solve a specific problem”: the problem we started from is how to link primary and secondary sources within a Digital distributed Library in a way open-ended, semantic and cross-language way. Furthermore, according to the fourth principle claiming that designing Microformats means “paving the cowpath”, the proposed microformat does not change the current behaviour of the users for whom it is especially designed (i.e. scholars). Hence, our solution should fits with the actual habit of citing canonical texts outlined above since each Microformat is suitable to encode abridged, unabridged and implicit references as well. The HTML tags used to compose the microformat vocabulary set are the most semantic among POSH elements. Since the entire canonical reference could be defined in a loose sense as a citation, a <cite> element was used as container element. As highlighted by the chunk of code reproduced in Fig. 3, the value of class attribute of each element indicates the property expressed by the element itself as implied by the Microformats class-design-pattern [5]‘. I propose three Microformats here that may be combined to encode almost the entire range of possible references individuated above (Fig. 2). In particular, the decision to split the vocabulary set into three different Microformats on one hand responds to needs of embeddability and modularity, and on the other hand corresponds to the earlier adopted classification of references by meaning. The “ctauthor” (Fig. 3, line 3) is an elemental Microformat and should encode references just to a canonical author. The “ctwork” (line 6) belongs to the same kind and is suitable for references to canonical works where the author’s name is implicitly contained. The author statement is made machine-readable thanks to the structure itself of the CTS URN correspondent to a work which contains also the CTS URN of its author (e.g. “urn:cts:greekLit:tlg0012.tlg001” refers to Iliad and and the same time contains the CTS URN of Homer, “urn:cts:greekLit:tlg0012”). Both “ctauthor” and “ctwork” Microformats was designed according with the abbr-design-pattern [1]‘† which suggests to use an <abbr> element to encode data that are provided in both machine-readable and human-readable format. Indeed the title attribute of <abbr> element is used to supply the same data contained in the element in a machinereadable (and -parsable) format, while the inner . Finally, “ctref” (lines 3-10) is a compound Microformat to encode a complete canonical reference. Therefore it requires as mandatory properties both a “ctauthor” and a “ctwork” element and a “range” property which supply a numerical range of the text sections referred to. In addition it allows to specify an edition statement property (line 9) which is designed following the abbr-name pattern and encodes the editor’s name either in a complete or abridged form. To sum up, the proposed Microformat allows content providers to combine this encoding format with an internal linking system if already present (Fig. 3). The microformatted references embed the metadata necessary to make them machine-readable and at the same time this semantic encoding is reached regardless Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
407
408
Matteo Romanello
of the external appearance of canonical references which is managed through CSS. Therefore, scholars are not requested to change their citation practices because the suggested Microformat vocabulary set is able to cover any citation format and content providers could even integrate it easily into their projectspecific linking framework. Ultimately, the use of such a Microformat does not exclude the possibility of hard-linking the same references to resources on the web, it just provides the semantic information necessary to also build on top of them a loosely coupled linking system. 1 <a class="citation" target="_blank" href="http://www.perseus.tufts.edu/cgi2 bin/ptext?lookup=Plut.+Sol.+19.1"> 3 <cite style="" class="ctref"> 4 <abbr class="ctauthor" title="urn:cts:greekLit:tlg0007">Plut.</abbr> 5 <em> 6 <abbr class="ctwork" title="urn:cts:greekLit:tlg0007.tlg007">Sol.</abbr> 7 </em> 8 <abbr class="range" title="19.1">19.1</abbr> 9 <abbr class="edition" title="Bernadotte Perin"/> 10 </cite> 11 </a>
Figure 3: Example of the hard-linked canonical reference of Fig. 1 after it has been reencoded using the proposed Microformat vocabulary set. 3.3
Benefits of microformatted canonical text references
Various types of resources often containing canonical references that could be microformatted in addition to both monographies and e-journals articles, include books reviews, conference announcements, descriptions of heterogeneous resources (such as illustrations on pottery), metadata of bibliographic records. All those resources could be suitably aggregated if the contained references to primary sources could be properly encoded. Once some Microformatted contents are available on the Web, an engine for targeted search, modeled on existing services that allow one to search among information tagged using a given microformat (e.g. Technoratiâ&#x20AC;&#x2122;s Kitchen supporting in particular the search among hCard, hCalendar and hReview microformats), will make possible it to retrieve resources pertinent to a given author or work with greater precision and recall, giving the scholars information retrieval tools infinitely more precise than traditional web search engines. The increased precision is granted by the capability of CTS URNs of addressing even a single part/section (i.e. book, chapter, paragraph, line, verse ecc.) of an XML-encoded literary text. Furthermore microformatted contents are searchable even through a not-semantic search engine (like Google) because of their bidirectionality [25]â&#x20AC;&#x2DC;. This property is due to the fact that inside a microformatted canonical reference some URNs are directly embedded in HTML and thus a textual search engine will also retrieve them. In the near future useful desktop applications allowing the reuse of microformatted data extracted from web pages may be created. It is likely, for instance, that an application could allow scholars to manage collections of the most frequently used references and to export them formatted according to a given citation style (or in a localized or abridged, rather than unabridged, form) to a wordprocessor. One interesting example of applications like this which work with citations and bibliographic references of modern publications is Zotero. Zotero allows one to extract from web pages and then manage or export in different formats the bibliographic references expressed using CoinS, the above mentioned technique of metadata embedding which is very like a Microformat even if it is not properly a Microformat proper because it was developed outside that community and without following the Microformats patterns and design principles. But for scholars on classical literature the most useful interface functionality to provide, and the only one capable of giving readers of the Digital Library a richer and more meaningful experience, is without a doubt the reference linking. As soon as a significant number of digital libraries exposing a CTS-compliant Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Linking Framework to Provide Critical Value-added Services for E-journals on Classics
interface are available over the web, it will be possible to build client-side applications giving readers the ability to move directly from a text reference to its source, and to read or compare different critical editions of the same texts. The prototype of an e-journal providing such a service will be described in the next section 3.2
E-scholia: prototype of an e-journal on classics with a feature of reference linking for primary sources
E-scholia is the prototype of an e-journal on classical philology and was conceived to realize the scenario described in section 2.1.b. Its aim is to implement a reference feature to be provided as a critical valueadded service to scholars who read on-line journal articles. The scholarly articles published by E-scholia are XML-encoded according to the TEI P5 specification. This makes it possible to both display the same article in several output formats (plain HTML, HTML with rich user interface and PDF) and to provide the users with some interesting navigation features made possible by the text encoding. The canonical references contained in each article and previously marked up with TEI-XML, are then encoded with the proposed Microformat vocabulary set. Thus each reference is mapped dynamically onto CTS-compliant requests that will receive back from a CTS repository - providing that someone would be available â&#x20AC;&#x201C; an XML-encoded response which contains the requested text passage. This way each reference to a primary source becomes a direct link to the text referred to. Fig. 4 (n. 2) shows the text corresponding to the canonical reference Hom. Od. 2.272 to be dynamically retrieved by querying a CTS repository available on the web and then displayed to the user in a small text box. A client-side script constructs on the fly the CTS query using the metadata extracted from the microformatted reference. Once on the web some CTS repositories are available providing other critical editions and translations of the Homerâ&#x20AC;&#x2122; Odyssey the reader using such a reference linking feature will be able to browse and read them starting with the reference Hom. Od. 2.272. Actually the stability and persistency of existing digital libraries sharing their raw data as a canonical texts service needs to be considerably improved. Once this is improved it will be possible to provide such advanced features as stable valued-added services for e-journals on Classical Studies.
Figure 4: Screenshot of E-scholia prototype of an advanced e-journal on classical philology. Canonical texts references (1) becomes dynamically links to the full-text of the passage referred to which is retrieved from some CTS repository available on the Web. Furthermore an extended version of Operator extension for Mozilla Firefox allows to detect the meaning of microformatted references and to perform upon them some actions on the client-side (3). Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
409
410
Matteo Romanello
The main benefit of using a Microformat instead of a non-standard encoding format to markup canonical references is that the semantic metadata contained in the reference allow the implementation of a serverside feature which does not exclude the possibility that the reader can use other client-side agents to parse the same microformatted references and to find related information. To demonstrate how this could be accomplished I extended Operator [21]‘, a Javascript extension written by Michael Kaply for the popular open-source browser Mozilla Firefox, that will be fully integrated within version 3.0 of the web browser. Operator works by parsing the entire page from which it extracts the microformatted contents and then suggests the available actions to the user (Fig. 4, n. 3). I extended Operator adding just a few lines of code in order to add support for the proposed Microformats vocabulary set. The result is a browser aware of Microformatted references and able to act upon them. The only action supported at the moment is to searching both the web and among del.icio.us’ bookmarks to find a given CTS URN: this is accomplished by appending dynamically a CTS URN as query parameter to the standard search query URL of those services (e.g. CiteULike’s query URL to search among tags is “http://www.citeulike.org/search/all?q=tag: <query parameter>”). Given that the support for this action was added just by virtue of example, several services could be built on the basis of CTS URNs, among them there some reference indexing services that are outlined in the next section. 3.3
Outline of a reference indexing service for e-journals on classics
A service of semantic reference indexing is another example of services that could be provided in order to benefit from such microformatted references. Indeed the support for such a service can be added to already existing client-side agents that are aware of Microformats. This service is supposed to take as input a CTS URN and to return a list of resources related to the supplied URN regardless of the language they are written in. In particular a reference indexing for e-journals will return all the indexed articles that contain in the title or in the text body one or more references to a given author, work or textual passage. The glue between the layer of such a reference indexing service and the layer of metadata contained in canonical references is granted by a client-side application as for instance a browser add-on (see Fig. 5).
Figure 5: Separation of logical layers in the proposed linking framework between primary and secondary sources. Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Linking Framework to Provide Critical Value-added Services for E-journals on Classics
Currently, scholars of classics often use the electronic version of L’Année Philologique (Aph) [10]‘†for their bibliographic researches which allows one to search at one time among all the publications summarized within the printed version of the well-known bibliographical index. Publications are indexed by the ancient authors they focus on (if specified) using a latin thesaurus wich partially avoids the above described drawbacks due to the fact that named entities are language-dependent (see scenario 2.1.a). Users can search the index using the Latin name of an ancient author as the search key. Furthermore the information retrieval system of APh does not allow users to search for publications related to a specific text passage. Instead, a semantic reference indexing service should allow one to search, for instance, for all the articles concerned with the first line of Homer’s Iliad and Odyssey since these articles are mapped to the CTS URNs “urn:cts:greekLit:tlg0012.tlg001” and “urn:cts:greekLit:tlg0012.tlg002”. Lastly, since it favours the traditional printed version of summarized articles APh does not provide the URL of articles that are also available electronically. At the moment there are at least a top-down and a bottom-up approach. The top-down approach is to use the Reference Index protocol, a web protocol which is being developed at CHS and could be profitably employed to create a semantic reference indexing service for on-line published articles. This protocol actually allows the creation of a list of entries where each entry corresponds to a resource and is associated with a unique identifier which could be a CTS URN or a canonical reference. Since such a service might take as an input a CTS URN and should return a list of related journal articles, it will be fully interoperable with the suggested Microformat. Although it is undoubtably the most powerful solution, this approach presumes that the Reference Index protocol is widely adopted by content providers such as e-journals. Instead a bottom-up solution to provide such a reference indexing service for published e-journals articles would properly exploit the tagging feature provided by the most part by on line platforms for the social bookmarking. The basic assumption is that tags could also be semantic. In order to tag semantically an article which focuses on the Homer’s Iliad it is just enough to use it along with tags as “Homer” and “Iliad” the CTS URN correspondent to the given work (i.e. urn:cts:greekLit:tlg0012:tlg001, for the Iliad). This application of the semantic tagging technique was inspired by some experiments recently done by Sebastian Heath [26]`†and Gabriel Weaver [28]`†in the same research perspective. Among the experiments I began conducting there is the semantic tagging on CiteULike of some bibliographic citations from ejournals on classics by using CTS URNs as tags instead of just named entities to describe the topic of the paper declared in the title [4]`. Once some available bibliographic records are properly tagged, it is possible to figure out, for example, an application (even acting as a web service) which: a) harvests several CTS repositories containing the XML encoding of primary sources; b) displays the original text to the reader; c) with the respect to the actual reading context retrieves from a service such CiteULike some bibliographic entries concerned in particular with the text passage that is being read which have been previously semantically tagged; d) displays the relevant aggregated information to the user in a totally transparent way: thus he is not required to know what CTS URNs are. Among the advantages of a bottom-up approach such as the semantic tagging we need to mention the fact that it does not imply any modification to be made directly to existing contents, such as lists of bibliographic records, but it just simply implies the reuse of raw contents that may be created in a collaborative way as through a social application from the Web 2.0. 4.
Discussion: Microformats and CTS URNs suitability
With respect to the adoption of Microformats to implement a semantic linking framework it is noteworthy that their suitability is confirmed by the Rule of Least Power: this is the reason why they are preferred among the others to RDF [14]‘. Indeed the data complexity of canonical references is not enough to make necessary at all the capability of RDFa [13]‘†or eRDF [7]‘†to embed within elements and attributes of Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
411
412
Matteo Romanello
HTML such complex semantic data that have to be expressed by using multiple RDF vocabularies. Due to their rapid and wide success Microformats undoubtably contributed to make the topic of reaching the Semantic Web more popular and to draw attention to it again. Indeed someone [22]‘†seems to have appropriately coined the term “lower-case semantic web” or also “real world semantics” to describe them, since at the moment are considered a sort of bottom-up way to the Semantic Web itself. The use of Microformats to express semantic data may be considered as a starting point toward the upcoming adoption of more powerful and semantic web-oriented technologies, such as RDF. Moreover, the forwardcompatibility with a more powerful format such as RDF can be reached through the use of some already existent vocabularies and ontologies, such as Expression of Core FRBR Concepts in RDF [8]‘, Citation Oriented Bibliographic Vocabulary [3]‘†and Dublin Core Metadata Element Set [6]‘. Thus microformatted references can be extracted as RDF triples by using these vocabularies to express semantic data, after have been processed by a GRDDL transformation [19]‘. Another fact which confirms Microformats’ suitability is that both of the next versions of the two most popular web browsers (i.e. Internet Explorer 8 and Mozilla Firefox 3, arrived yet at its Beta 5 version) have been announced as featuring support for this technology contributing thus to redefining even the role played by web browsers. Indeed web browsers are actually turning into platforms for content aggregation rather than mere tools to display static pages. There are some interesting examples, such as the social web browser Flock [9]‘, those browsers are conceived on the basis of the needs of the community of users they are expected to be used by. Scholars in classical studies could in the near future take benefit from a browser customized to fit better their own needs and providing integration for some useful tools. Furthermore, with respect to the development policy of new microformats it is noteworthy that although it is fully controlled in quite a centralized way by the Microformats community, contrasting in some way with the Web’s decentralised nature itself, it undoubtably prevents a large amount of redundancy encouraging discussion and agreement on new standards proposals. Currently, the Microformats developers community is working on hBib a format to encode references to modern publications that was previously called secondary sources according with a common distinction in the field of classical studies. Therefore it is time to start a discussion on the Microformats Wiki [11]‘†aimed at defining a common format to encode references to primary sources, and in particular canonical references. Finally, it could be asked “and why did you suggest building such a linking system for primary and secondary sources upon the CTS protocol and CTS URNs, although they are not yet widely adopted?”. The first reason is that currently the CTS is an effective and unparalleled protocol in the field of classical studies. Furthermore it fits perfectly the need of scholars of classics referring and accessing hierarchical sections of canonical works within digital libraries through unique identifiers [23]`. Secondly, it is built entirely upon common and fully interoperable standards and responds to the Web’s decentralized nature. In fact, recently the Perseus Project started implementing a CTS-compliant web interface for its digital library which is probably the widest free accessible digital collection of classical texts. The Perseus’ choice of allowing potentially thousands of web applications to use its raw XML data to provide new services is destined to increase its popularity rather than reduce it and seems to bode well for a wide-scale adoption of the CTS protocol itself. After that the MultitextHomer will be completed [2]`, one of the first big projects to be built, among other technologies, just upon the CTS protocol, it is hoped to become a model for other “monographic” projects aimed at digitazing the whole history of the text of the works of a single ancient author. This is a great opportunity for cyberscholarhip to realize a distributed and freely accessible digital library from which it will be possible to start improving both the quality of digital critical editions and their effective usefulness for scholars.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Semantic Linking Framework to Provide Critical Value-added Services for E-journals on Classics
5.
Conclusions
In conclusion, this paper tries to show what benefits that could be gained by using a linking framework between primary and secondary sources based on a loosely coupled approach, and implemented by using Microformats combined with CTS URNs. In particular, the benefits addressed in this paper, would increase the efficiency of information retrieval tools and enhance the number of services it is possible to provide compared to using a tightly coupled linking system. When some effective value-added serivec services such as the reference linking or the reference indexing will be provided by e-journals, scholars themselves are hoped to be more encouraged to use the electronic publications for their discipline-specific purposes. However, some improvements are expected in particular with regard to the automatic creation of microformatted references, since a model just based on a fully manual markup will not scale. To this end some Natural Language Processing techniques could be profitably exploited in the near future, such as finite-state automata, named entities disambiguation and semantic analysis of the writing context where such references appear. 6.
Notes and References
[1]
abbr-design-pattern - Microformats. [cited 12 May 2008]. Available from world wide web: <http:/ /microformats.org/wiki/abbr-design-pattern>. Center for Hellenic Studies -The Homer Multitext Project. [cited 10 May 2008]. Available from world wide web: <http://zeus.chsdc.org/chs/homer_multitext>. Citation Oriented Bibliographic Vocabulary. [cited 10 May 2008]. Available from world wide web: <http://vocab.org/biblio/schema>. CiteULike: mromanello56k’s library. [cited 10 May 2008]. Available from world wide web: <http:/ /www.citeulike.org/user/mromanello56k>. class-design-pattern - Microformats. [cited 9 May 2008]. Available from world wide web: <http:/ /microformats.org/wiki/class-design-pattern>. Dublin Core Metadata Element Set, Version 1.1. [cited 10 May 2008]. Available from world wide web: <http://dublincore.org/documents/dces/>. ERDF - GetSemantic. [cited 10 May 2008]. Available from world wide web: <http:// getsemantic.com/wiki/ERDF>. Expression of Core FRBR Concepts in RDF. [cited 10 May 2008]. Available from world wide web: <http://vocab.org/frbr/core>. Flock Browser - The Social Web Browser. [cited 12 May 2008]. Available from world wide web: <http://flock.com/>. L’Année philologique. [cited 12 May 2008]. Available from world wide web: <http://www.anneephilologique.com/aph/>. Microformats Wiki. [cited 10 May 2008]. Available from world wide web: <http://microformats.org/ wiki/Main_Page>. OpenURL ContextObject in SPAN (COinS). [cited 10 May 2008]. Available from world wide web: <http://ocoins.info/>. RDFa Syntax: A collection of attributes for layering RDF on XML languages. [cited 10 May 2008]. Available from world wide web: <http://www.w3.org/2006/07/SWD/RDFa/syntax/>. Resource Description Framework (RDF) / W3C Semantic Web Activity. [cited 10 May 2008]. Available from world wide web: <http://www.w3.org/RDF/>. Thesaurus Linguae Graecae - TLG. [cited 13 May 2008]. Available from world wide web: <http:/ /www.tlg.uci.edu/>. Allsopp, John. Microformats: empowering your markup for Web 2.0. Friends of ED, 2007.
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
413
414
Matteo Romanello
[17] Armbruster, Chris. Society Publishing, the Internet and Open Access: Shifting Mission-Orientation from Content Holding to Certification and Navigation Services? Social Science Research Network 2007. Available from world wide web: <http://papers.ssrn.com/sol3/ papers.cfm?abstract_id=997819>. [18] Fielding, Roy T., and Richard N. Taylor. Principled design of the modern Web architecture. ACM Trans. Inter. Tech. 2, 115-150. [19] Hausenblas, M., W. Slany, and D. Ayers. A Performance and Scalability Metric for Virtual RDF Graphs. In 3 rdWorkshop on Scripting for the Semantic Web (SFSW07), Innsbruck, Austria, 2007. [20] International Union of Pure and Applied Chemistry. The IUPAC International Chemical Identifier (InChITM) . [cited 2 May 2008]. Available from world wide web: <http://old.iupac.org/inchi/>. [21] Kaply, Michael. Operator. Mike’s Musigns. [cited 12 May 2008]. Available from world wide web: <http://www.kaply.com/weblog/operator/>. [22] Khare, Rohit, and Tantek Çelik. Microformats: a pragmatic path to the semantic web. In Proceedings of the 15th international conference on World Wide Web, 865-866, [Edinburgh, Scotland]: ACM, 2006. [23] Mimno, David, Alison Jones, and Gregory Crane. Hierarchical Catalog Records: Implementing a FRBR Catalog. D-Lib Magazine 11, October 2005. [cited 13 May 2008]. Available from world wide web: <http://webdoc.sub.gwdg.de/edoc/aw/d-lib/dlib/october05/crane/10crane.html>. [24] Reynolds, Leighton D. Scribes and scholars : a guide to the transmission of Greek and Latin literature. 2nd ed., rev. and enl. [Oxford]: Clarendon Press, 1974. [25] Sebastian, Heath. A DNID that can find itself. Domain Name Identifiers November 2007. [cited 10 May 2008]. Available from world wide web: <http://dnid-community.blogspot.com/2007/11/ dnid-that-can-find-itself.html>. [26] Sebastian, Heath. Wikipedia and Google like DNIDs. Domain Name Identifiers. [cited 10 May 2008]. Available from world wide web: <http://dnid-community.blogspot.com/2007/11/wikipediaand-google-like-dnids.html>. [27] Smith, Neel. TextServer: Toward a Protocol for Describing Libraries. Classics@ 2, 2004. Available from world wide web: <http://www.chs.harvard.edu/publications.sec/publications.sec/classics.ssp/ classics_2_smith.pg>. [28] Weaver, Gabriel. Adding Value to Open Scholarly Content. Gabriel Weaver’s Personal Blog March 2007. Available from world wide web: <http://blog.gabrielweaver.com/2007/03/testpost.html>. [29] Willighagen, Egon . Chemical RDFa with Operator in the Firefox toolbar. chem-bla-ics June 2007. [cited 2 May 2008]. Available from world wide web: <http://chem-bla-ics.blogspot.com/2007/06/ chemical-rdfa-with-operator-in-firefox.html>. [30] Willighagen, Egon . SMILES, CAS and InChI in blogs: Greasemonkey. chem-bla-ics December 2006. [cited 2 May 2008]. Available from world wide web: <http://chem-bla-ics.blogspot.com/ 2006/12/smiles-cas-and-inchi-in-blogs.html>.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
415
Creating OA Information for Researchers Peter Linde1, Aina Svensson2 Blekinge Institute of Technology The Library, 371 79 Karlskrona, Sweden email: 1peter.linde@bth.se, 2aina.svensson@ub.uu.se
Keywords: Open Access; Information; web portals; Learning objects 1.
Background
About half of the Swedish Universities and University colleges are today administrating some sort of institutional repository and about a dozen of these can deliver meta data according to the recommendations set by a national project SVEP in 2003-05. Today the problem is not a lack of software or hardware technology. The immediate question is instead how we are going to fill our archives with full-text documents and how to make researchers see the possibilities and the advantages with publishing their documents Open Access. Today there is a vast ignorance of OA in the Swedish research community. Many Swedish libraries need therefore a support in order to tackle the task of sharing information and marketing OA. The understanding and knowledge about OA among librarians is also needed to be increased.
2.
Goals
An easily accessed OA information kit aimed at researchers and administered by librarians and other information specialists would contribute to a higher OA competence among teaching staff, it would increase researcher knowledge about OA and at the same time increase the amount of records in Swedish OA repositories. This poster will describe how during 2007 a project group consisting of representatives from 6 Swedish universities and university colleges produced a set of presentation- and information material aimed for researchers and administered by librarians and other information specialists. Making this cooperative material freely available there is no longer need to repetitively produce the same material in different locations. Instead energy can be relocated to offensive marketing work contributing to a higher OA competence among teaching staff, and increase researcher knowledge about OA and hopefully increase the amount of records in Swedish OA repositories.
3.
Realization
During the project a number of education- and information objects in English (by May 2008) and Swedish were created. These consisted of longer texts (4-5000 words), PowerPoint presentations, leaflets, lists of links and contact information. A website was created for exposing and downloading the information. This was done using the Search Guide network framework where independent libraries are using the same open source platform for sharing information and material. All the material produced in the project is freely accessible and published under Creative Commons licence Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
416
4.
Peter Linde, Aina Svensson
Conclusions
The main part of the project ended in november 2007 but in April 2008 a basic evaluation of the website (http://www.searchguide.se/oa/) is due. It will include an analysis of webtraffic to the site during 6 months use. A user survey among researchers and librarians will also be done by april 2008 and the result will be included in this poster. The project group have now received a grant for a follow up project starting early 2008. This will reinforce objectives fulfilled and strengthen the level of knowledge about OA among Swedish researchers which is a burning issue for making scientific documents available Open Access in Sweden. 5.
References
http://www.searchguide.se/oa/
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
417
Open Scholarship eCopyright@UP. Rainbow options: negotiating for the proverbial pot of gold Elsabé Olivier Department of Library Services, University of Pretoria PO Box 12411, Hatfield, Pretoria, South Africa 0028 +27 12 420 3719 (tel) +27 362 5100 (fax) email: elsabe.olivier@up.ac.za
Keywords : copyright; Open scholarship; institutional repositories; research articles; Open access; scholarly publications, 1.
Introduction
There is a worldwide trend towards Open scholarship and towards the end of 2005 the University of Pretoria (UP) developed the institutional repository named UPSpace https://www.up.ac.za/dspace/. Within UPSpace, openUP houses the e-print collection of peer reviewed and published research articles/papers by staff, students and other affiliates of the University of Pretoria. The purpose of this collection is to make the University’s research visible and accessible to the entire international research community, in accordance with the philosophy and practice of Open access as well as the copyright policies of the publishers. These research articles correspond with the University’s Annual Research Report. Submission of research articles started in July 2006 and the openUP collection (https://www.up.ac.za/dspace/handle/ 2263/1210) comprises of 1526 items which have been mapped from the Research articles collections. The aim of openUP is opening access to UP research output towards the advancement of science and providing equal opportunities to researchers worldwide. It implies a new business model where the creators of knowledge retain copyright and take responsibility for the dissemination of their products. A University Open access policy has been drafted, which is being discussed with all University faculty members before its campus-wide implementation in 2009. It entails:
• •
mandatory submission of all research articles;
•
encouraging its authors to publish their research articles in Open access journals.
encouraging researchers to negotiate copyright with publishers by adding the official UP author addendum or by granting copyright in stead of transferring copyright;
The purpose of this poster is to report on progress made in a project to manage copyright and Open Access (OA) at the openUP Office of the University of Pretoria. It forms part of a strategy to transform UP into an Open Scholarship institution. Although South Africa occupies a leading position in research publishing in Africa, it is still rated amongst the lowest producers of research publications in the world (Ocholla, 2007). In South Africa academics will only receive subsidy or recognition by the government for research articles written, if the journal titles appear in one of the following lists (usually referred to as accredited journals):
• •
Institute for Scientific Information (ISI) journal lists; International Bibliography of Social Sciences (IBSS) journal list; Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
418
Elsabé Olivier
•
2.
An index of Approved South African Journals list. This list is maintained by the South African Department of Education and is subject to annual review. (DOE list)
Statement of the problem
One of the major concerns associated with a collection such as openUP, is the issue of publisher’s copyright. A journal publisher usually requires authors to sign a “Copyright Transfer Agreement” whenever they publish their research articles. The publisher then holds the copyright and even the author has to abide by the copyright policies of the publisher. Copyright can be a deterrent to Open access and universities wanting to provide Open access to their research output need to deal with this issue very carefully. The majority of international publishers allow inclusion of research articles in repositories under very specific conditions. These archiving policies can be found on the SHERPA/RoMEO database of publishers (www.sherpa.ac.uk/romeo.php) or on the publishers’ web sites. Currently very few South African journals (journals in the approved DOE list) appear on the SHERPA/ RoMEO database and most publishers’ policies do not make provision for self archiving or depositing in repositories. Open access is still a new concept in South Africa and neither publishers nor authors are knowledgeable about copyright and the implications thereof for archiving in repositories. The openUP Office adopted the following policy regarding the copyright of scholarly research articles by University of Pretoria researchers as it relates to UPSpace:
3.
•
To abide by the copyright policies of the publishers, and to take the responsibility and initiative to contact the publishers on behalf of the authors and to negotiate copyright permission to archive the UP authored research articles;
• •
To strictly adhere to any embargoes imposed; To only archive the metadata of research articles, where archiving is forbidden by the publishers.
Methodology
The openUP copyright negotiation project is currently undertaken by one staff member and consists of email survey/correspondence which is sent to South African publishers. The three lists of approved journals for accreditation and the University’s annual Research Reports are used to identify the journals which have to be targeted. Publishers’ copyright and archiving policies are first checked against the SHERPA/ RoMEO database of publishers which categorized the rights of publishers’ self-archiving policies in different colors. This useful database provides academic authors, institutional repository administrators and publishers the ability to check the conditions and restrictions that publishers place on self-archived scholarly articles. Whenever journals are not found on this database (usually South African journals), the publishers or editors are contacted via email and they are requested to supply information about their archiving and Open access policies by completing a questionnaire. The questionnaire covers the following questions:
• • • •
Which version of the article will be allowed for archiving? Which format of the article will be allowed (PDF with branding etc.) Should the URL of the journal append the citation as a form of recognition? Does the publisher require access restrictions / embargoes?
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Open Scholarship eCopyright@UP. Rainbow Options: Negotiating for the Proverbial Pot of Gold
•
Can openUP assume that this granting of reproduction permission reflects a policy that could be applied to all articles from University of Pretoria academics who have published in this particular journal?
This project started in February 2007 and will be an ongoing process until an extensive list of UP preferred journals has been compiled. 4.
Data and interpretation
Currently the openUP Office has successfully negotiated archiving policies for 219 journal titles – from mostly South African publishers. Journals that appear on the SHERPA/RoMEO database were excluded in this exercise. The journals are categorized in the same color codes of the SHERPA/RoMEO database, but two colors were added:
•
gold indicating journals which will allow the archiving of the published PDF version of articles and support Open access; and
•
grey indicating publishers that are currently in the negotiation phase or who have not responded to the survey yet.
The University of Pretoria’s feedback from the publishers in South Africa indicates that: • • • • • • 5.
South African publishers (47%) support Open access and are willing to allow the archiving of the published PDF version (gold); a total of 44% of the publishers have either not replied or formulated a general policy on Open access and archiving yet (grey); a small percentage (3%) prohibits archiving and will not allow any form of archiving (white); only 4% of the publishers will allow the archiving of the pre- or post-print version (green); only 1% of the publishers will allow the archiving of the post-print (post-refereed) version (blue); and lastly, only 1% will allow archiving of the pre-print (pre-refereed) version (yellow).
Lessons learned and tentative conclusions
Some of the important lessons learned from this project are:
•
Copyright is an important and critical issue in South Africa. The situation in Africa and in particular South Africa is unique and differs vastly from Europe and North America due to the fact that both South African authors and publishers are ignorant of the implications of copyright:
•
Authors usually don’t retain any rights and don’t realize the implications of signing away their copyright. One of the openUP Office’s goals is to influence the copyright behavior of the UP authors so that they rethink the rights they sign away and at least negotiate the right to archive their research articles in UPSpace or add a revised addendum to the Copyright Transfer Agreement;
•
On the other hand most South African publishers do not have any copyright policies
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
419
420
Elsabé Olivier
regarding self-archiving in repositories. The openUP Office plays an important role in promoting Open access and influencing publishers to either adjust or develop copyright archiving policies. This entails negotiation skills as well as applying the art of persuasion. At the same time South African publishers are given the opportunity to align their publisher copyright policies to those of their international partners;
•
Many South African journals are still only available in printed format and often electronic copies of articles are unattainable. Authors seldom retain electronic copies of their articles which complicates the archiving process;
•
Many of the South African publishers are very supportive of the main goals of Open access after the concepts and advantages had been explained to them. Those that prohibit archiving are the smaller commercial publishers who rely heavily on subscriptions to fund the production of their journals.
•
National bodies such as the Academy of Science of South Africa (ASSAf) and the Department of Science & Technology are taking responsibility to ensure that Open access initiatives are promoted in South Africa. At ASSAf’s Journal Editor’s Forum in July 2007, publishers were encouraged to support the Open access publishing model in order to enhance the accessibility of South African research articles and make the African continent’s research more visible to the international research community.
•
The openUP Office supports these national strategies and cooperates wherever possible to enhance Open access in South Africa.
The extensive colour-coded list of South African archiving policies can either contribute to the current SHERPA/RoMEO database or might be utilized as a basis for a (South) African database. In the words of Gevers: “… in the publishing arena, South Africa might be a dwarf internationally, but we are a giant on the African continent” (Gevers, 2006). The openUP Office realizes the significant role that needs to be played in securing pro-Open access archiving policies for scholarly publishing in the global community. 6.
References
Academy of Science of South Africa. (2006). Report on a strategic approach to research publishing in South Africa. Pretoria: Academy of Science of South Africa. Ocholla, DN. (2007). Common errors and challenges in publishing in a peer refereed library and information journal. South African Journal of Libraries and Information Science, 73(1):1-13.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
421
Scalable electronic publishing in a university library Kevin Hawkins Scholarly Publishing Office, University Library, University of Michigan 300 Hatcher Graduate Library North, 920 N. University Ave., Ann Arbor, MI 48109-1205 USA http://www.ultraslavonic.info/ email: kevin.s.hawkins@ultraslavonic.info
Keywords: scholarly publishing; services models, business models; tagging and annotation tools Since 2001, the Scholarly Publishing Office (SPO), a division of the University of Michigan University Library, has published a broad range of scholarly literature in electronic and print form, extending the library’s commitment to the distribution of scholarship by experimenting with innovative methods for publishing to serve the needs of scholars, both at the University of Michigan and around the world. In 2007, SPO’s staff of approximately 7.5 FTEs published nearly 2,000 articles in journals, reviews, and conference proceedings, plus a handful of monographs, image collections, and other digital projects. Text content is stored in XML, with approximately half of the 2007 articles derived from unstructured electronic source documents. SPO’s seven years of experience demonstrates how to build a scalable electronic publishing operation. SPO has emphasized efficient publication of content at the expense of publicity, acquiring content mostly through word of mouth and through referrals from SPARC, the Scholarly Publishing Academic Resources Coalition. This presentation attempts to remedy that lack of publicity by giving a comprehensive overview of SPO’s current publishing services—business models and rights agreements; technology used for content conversion, storage, retrieval, display, metadata exchange, usage statistics, and print production; and experimentation with tagging and annotation tools—as well as SPO’s future plans.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
422
Issues and Challenges to Development of Institutional Repositories in Academic and Research Institutions in Nigeria Gideon Emcee Christian International Development Research Center (IDRC), Ottawa, Canada gchristian@idrc.ca
Keywords: institutional repositories; Nigeria; Open Access; research output 1.
Introduction
Open Access institutional repositories are electronic archives that may contain post-published articles, pre-published articles, thesis, manuals, teaching materials or other documents that the authors or their institutions wish to make publicly available without financial or other access barriers. Open Access institutional repositories provide an avenue for the promotion and dissemination of knowledge and institutional research outputs. It can also provide a better picture of the type of research being conducted at these institutions. With about 92 universities, Nigeria boasts of more Universities than any other country in Sub-Saharan Africa. In addition, emerging facts seem to suggest that it also generates more academic research outputs than any other country in the region. For example, Nigeria has 125 online journals listed in the African Journal Online (AJOL) as opposed to 54 from South Africa, 8 from Uganda, and 10 from Zimbabwe. Curiously though, the directory of Open Access Repositories shows that there are 15 Open Access repositories in across Africa - 12 in South Africa, 1 in Namibia, 1 in Uganda and 1 in Zimbabwe. There is yet no open access institutional repository in Nigeria. This prompts a reasonable inquiry as to what constraints have prevented a country with so many academic institutions and so much research outputs from developing Open Access institutional repositories. This presentation therefore seeks to highlight the challenges to the establishment of open access institutional repositories in Nigeria as well as views and awareness of open access institutional repositories among scholars and researchers in the target institution surveyed in the course of my field research in the country. 2.
Objectives:
The main research objectives are: 1. 2. 3. 4. 5.
Highlight the current state of open Access institutional repositories in academic and research institutions in Nigeria. Establish a basis for understanding current state of open Access institutional repositories in academic and research institutions in Nigeria. Highlight the opportunities presented by Open Access institutional repositories and archives in Nigeria. Identify the issues and challenges hindering the development of Open Access institutional repositories in academic and research institutions in Nigeria. Highlight the primary and practical solutions to the major issues affecting the development of Open Access institutional repositories in Nigeria.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Issues and Challenges to Development of Institutional Repositories...
3.
Methodology
The presentation will be based on the results of a survey conducted at the University of Lagos, Nigeria. The survey was conducted via questionnaires distributed among 100 scholars, researchers and graduate students. 4.
Conclusions
Institutional repositories provide universities in developing countries with a good avenue to disseminate their intellectual output to the outside world. However, notwithstanding increased research outputs from universities in Nigeria, there is still absence of Open Access institutional repositories. It is hoped that this presentation will highlight the level of awareness of institutional repositories among scholars and academic researchers in Nigeria, as well as the problems that have militated against the development of institutional repositories in Nigeria; a practical solution that will facilitate the development of institutional repositories in Nigeria.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
423
424
When Codex Meets Network: Toward an Ideal Smartbook Greg Van Alstyne1 and Robert K. Logan Beal Institute for Strategic Creativity, Ontario College of Art & Design 100 McCaul Street, Toronto, Ontario M5T 1W1 Email gvanalstyne@faculty.ocad.ca
1
Keywords: ebooks; reading; design thinking; network “Understanding the future of reading and writing will mean learning to read the future, and learning to write the future” The experience of using a book in the classical, codex form – more than 1000 years old – is far from “broken.” However it is ripe for evolutionary enhancement. A cambrian explosion of forms is underway, offering new software, hardware, appliances, systems and networks that seek to extend and enhance the pleasure, power and utility of reading. But which of these forms, if any, promises the ideal combination of qualities and functions? Hardware ebooks offer several strengths including numerous titles in one light package, rapid delivery of texts purchased online, some free ebooks for download, and potentially lower environmental impact than paper books. But the pleasure of the user experience, and rates of adoption, continue to be held back by strict hardware dependency along with proprietary standards and DRM that result in uncertain future access, obstacles to sharing, and little or no used book market. Readers want and need qualities like persistence and reliability. Many people enjoy reading physical books because they last and don’t break. Still their more recent expectations, whetted by the tapestry of Web 2.0 possibilities, lean increasingly toward user-generated content, including collaborative co-creation. People increasingly want and expect to work independently and/or together in developing, expressing and sharing ideas as well as simply reading official or commercial publications. Yet some ebook systems don’t even permit, let alone encourage, original content creation and dissemination. Soft readers offer more flexibility, more access to open standards, and they more clearly point to what the Institute for the Future of the Book (IFB) calls the “network book.” IFB’s Ben Vershbow has said ““It’s... possible that defining the networked book as a new species within the genus “book” sows the seeds of its own eventual obsolescence.... But that strikes me as too deterministic.... As with the evolution of biological life, things tend to mutate and split into parallel trajectories. The book... may indeed be headed for gradual decline, but we believe the network has the potential to keep it in play far longer than the techno-determinists might think.”* Can design thinking and strategic foresight be leveraged toward a greater interpenetration of physical and digital book forms? Is it possible to marry the codex with the network? Can we aspire to increase pleasure, utility, sustainability, and evolvability in the same design gesture? We believe these goals are both desirable and feasible. This poster presentation briefly surveys the field before outlining an argument and a conceptual solution for a “smartbook” format that leverages machine-readable tags and persistent URL space to address the Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
When Codex Meets Network: Toward an Ideal Smartbook
issues raised. Our vision offers a synthesis of attributes, for it is first and foremost meant to be readable, providing the pleasure of a physical, codex book. At the same time it is fully searchable, with the utility of electronic text, and â&#x20AC;&#x153;smartâ&#x20AC;? in that it heavily leverages the intelligence of the network.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
425
426
Revues.org, online humanities and social sciences portal Marin Dacos - CNRS Directeur du CLEO / Revues.org, Centre pour l’édition électronique ouverte CNRS - EHESS - Université de Provence - Université d’Avignon 3, place Victor Hugo, Case n°86, 13331 Marseille Cedex 3, Tél. 04 88 57 69 29, Tél. direct 04 88 57 69 38, Fax 04 88 57 69 30 email: marin.dacos@revues.org
Keywords: humanities and social sciences; learned societies; electronic publishing platform 1.
Introduction
Since 1999, the CLEO, “Centre pour l’édition électronique ouverte”, (“Centre for open electronic publishing”), has been developing Revues.org, the oldest French social sciences portal, which now gathers more than one hundred and fifty journals. The centre promotes the dissemination of scientific literature in the humanities and social sciences by developing electronic publishing. It federates scholarly journals, provides them with a technological support and helps them to settle their visibility on the internet. It also fosters the learning of skills linked to electronic publishing by organizing trainings and by producing documentation. This project originates from the French scientific community. All journals follow academic and scholarly standards in the fields of history, politics, geography, sociology, anthropology, psychology, etc. They are owned by learned societies, major research centres, university institutes or private publishers. Most of them receive funds from the CNRS, the CNL (Centre national du livre in France) or universities. To date, Revues.org has predominantly been involved in the publishing of French journals. It is now receiving an increasing number of international applications and has started to welcome multilingual French publications and international publications. The portal interface will soon be multilingual. Revues.org is a non profit service supported by the CNRS (Centre national de la recherche in France), the EHESS (Ecole des hautes études en sciences sociales, the main French social sciences institute), the University of Provence (Aix and Marseille) and the University of Avignon. The unit has offices in Marseilles and is now part of ADONIS, the major French cyber infrastructure for the humanities and social sciences. For more information on its current activities, you can subscribe to Revues.org online newsletter, la Lettre électronique de Revues.org. 2.
Electronic skills and tools for publishing teams
Revues.org offers a simple publishing process and supports the learning of electronic skills, in order to give more freedom and more independence to journals. It has developed a specialized software, Lodel (Logiciel d’édition électronique : Software for electronic publishing, http://www.lodel.org). It is a free of charge open source software (GPL licence, http:// sourcesup.cru.fr), that can be used by any publishing team after a few hours of training. Each journal in Revues.org has its own layout, designed to reflect its editorial specificities. The journals can publish footnotes, photographs, tables, graphics, fine indexes, etc ..
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
Revues.org, online humanities and social sciences portal
Revues.org also provides each journal with a panel of communication and promotion tools, which are easy to run and free of charge:
•
newsletters to inform users. More than ten newsletters have already been created, notably Terrain, Nuevo mundo and Cultures & conflits newsletters.
• •
internal newsletter and lists for each journal team
• •
statistics, publicly available
publishing blog by using the CLEO new service called Hypotheses.org <http:// www.hypotheses.org>. web feeds. For example, a journal can publish the latest news related to discipline (through Calenda, le calendrier des sciences sociales) and/or to its own activities.
Revues.org also maintains and secures servers and databases. 3.
Citability and visibility
From the very beginning, each document published on Revues.org has received a simple and stable URL. This ensures the long term preservation of articles on the web and makes them citable in other work, like a traditional paper publication. Revues.org projects to use DOI (Digital object identifier), in order to be linked to Crossref service. A department in Revues.org is dedicated to the optimisation of the web visibility of the portal. Revues.org is now clearly identified by several major scientific citation index and search engines. It is linked to: • Scirus (Holland), the Elsevier scientific search engine,
4.
• • • •
Google scholar (USA), Google’s scientific search engine,
• •
the Hispanic American Periodicals Index (HAPI, USA),
the directory of open access journals (DOAJ, Sweden), the SUDOC (France), Système universitaire de documentation, OAISTER (USA), the biggest search engine , which is harvesting more than 800 OAIPMH depots worldwide, Intute (England).
Dissemination of scientific information
The CLEO has also created Calenda, le calendrier des sciences sociales, the main online announcement service in the humanities and social sciences in France. Since 2000, it has been publishing information about colloquiums, seminars, job offers and calls for papers. It currently stores 8 000 entries and welcomes 100 000 visitors every month. Announcements remain accessible through a stable URL in order to build a record of social sciences activities. Each announcement proposal is submitted to a verification process supported by a scientific committee. The CLEO is currently developing a new online service, “Les manuscrits de Revues.org”, (http:// manuscrits.revues.org), a free and multilingual online tool to collect, select and correct articles on a double blind peer review basis. It will be adapted for journals working with an international team.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
427
428
AbstractMaster® Daniel Marr FSA Associate/Consultant Toronto, Canada http://www.abstractmaster.com email: dmarr@abstractmaster.com
Keywords: National library of Medicine, electronic search, literature tracking and sharing AbstractMaster is a powerful search/data management software specifically designed to help quickly locate relevant medical articles indexed by the National Library of Medicine* (NLM), to help keep track searches, what one has read, catalog references and full text articles, to assist in the review and analysis of articles for referencing, publishing and research purposes. Basically, an electronic finder/cataloguer/ organizer for the sciences of medicine. An important feature of AbstractMaster is the abilility to dynamically assign individual relevancy ratings to references based on importance, and to even share/collaborate findings with colleagues, researchers and other similarly interested healthcare personnel via the AbstractMaster Collaborators’ Corner. The goal being to collectively end up with an abstract for every reference, supplement search criteria and use global relevancy scores to assist researchers in their endeavours. Not only will researchers will be able to simultaneously search the NLM and the AbstractMaster Collaborators’ Library, but have available for free downloading specific ‘Editions’ identified and created with the help of specialits/experts, i.e., collaboratores, in their respective fields. For example, the HIV/AIDs Edition contains over 300 thousand references; Oncology Edition, 1.5 million; PharmaScience, 125 thousand; Anthropology, 200 thousand. Via the AbstractMaster web site, experts in their fields will be invited to become members of the AbstractMaster Collaborators’ Corner. The Corner will become the basis for the AbstractMaster Collaborators’ Directory which will list interested specialist/experts by their specialty and availability re services such as speakers, clinical collaborators, researchers, etc. Thus, the center is designed to connect specialists and experts to initiate discussions and sharing of information in their respective fields. It it the position of AbstractMaster that full text articles should be made more readily available, and if not for free, at a price that does not encumber their acquirement. AbstractMaster provides a means for researchers to share full text articles easier via conduits to central repositories, publishers’ domains, medical/ science and individual collaborators’ libraries for free or at minimal charges.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
429
A Deep Validation Process for Open Document Repositories Wolfram Horstmann1, Maurice Vanderfeesten2, Elena Nicolaki3, Natalia Manola3 Bielefeld University, P.O. Box 10 01 31, 33501 Bielefeld, Germany; e-mail: wolfram.horstmann@uni-bielefeld.de 2 SURF Foundation, P.O. Box 2290, 3500 GG UTRECHT, The Netherlands; e-mail: Vanderfeesten@surf.nl 3 National & Kapodistrian University of Athens, Panepistimioupolis-Ilissia, Athens 15784, Greece; e-mail: {natalia; enikol}@di.uoa.gr 1
Abstract Institutional document repositories show a systematic growth as well as a sustainable deployment. Therefore, they represent the current backbone of a distributed repository infrastructure. Many developments for electronic publishing through digital repositories are heading in the direction of innovative value-added services such as citation analysis or annotation systems. A rich service-layer based on machine-to-machine communication between value-added services and document repositories requires a reliable operation and data management within each repository. Aggregating search services such as OAISTER and BASE provide good results. But in order to provide good quality they also have to overcome heterogeneity by normalizing many of the data they receive and build specific profiles for sometimes even one individual repository. Since much of the normalization is done at the side of the service provider, it often remains unclear — maybe sometimes also to the manager of a local repository — exactly which data and services are exposed to the public. Here, an exploratory validation method for testing specification compliance in repositories is presented. Keywords: OAI-PMH; Validation; Harvesting; Institutional Repository; Open Access 1.
Introduction
Many developments for electronic publishing through digital repositories are heading in the direction of innovative value-added services (e.g. citation analysis or annotation systems) and abstract data representation of all types of resources (e.g. primary data or educational material). Still, conventional document repositories show the most systematic growth [1] as well as high-quality, sustainable deployment. Therewith, they represent the current backbone of a distributed repository infrastructure. A rich servicelayer based on machine-to-machine communication between value-added services and document repositories requires a reliable operation and data management within each repository. Aggregating search services such as OAISTER [2] and BASE [3] provide good results. But in order to provide good quality they also have to normalize many of the data they receive and build specific profiles and processes for even individual repositories. Since much of the normalization is done at the side of the service provider, it often remains unclear — maybe sometimes also to the manager of a repository — exactly which data and services are exposed to the public by a local data provider. Existing validation techniques [4, 5] are important but their current scope is only at the level of testing basic compliance with OAI-PMH and simple-DC. As a consequence, quantitative assessments of data quality, i.e. of what is specifically exposed by a repository and by the whole repository landscape are widely missing. Mature infrastructures should, however, provide reliable data resources, robust standards, corresponding validation mechanisms for them and a systematic change-request cycle. Here, an exploratory validation method for testing specification compliance in repositories is presented. It produces specific and quantitative data on the quality of technical and content-related behaviour of a repository and shall be further developed as a tool that can be applied Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
430
Wolfram Horstmann, Maurice Vanderfeesten, Elena Nicolaki, Natalia Manola
by individual repository managers for monitoring the quality of their repository. 2.
Methods
Basis for the validation are the DRIVER [6] Guidelines for Content Provides: “Exposing textual resources with OAI-PMH” [7]. They assume the usage of OAI-PMH [8] and are strongly influenced by guidelines for using simple Dublin Core of Powell et al. [9] DC-Best-Practice [10], the DINI certificate for document and publication repositories [11] and experiences from DAREnet [12]. Software has been developed by the National Kapodistrian University of Athens that is designed for repository managers or ‘curators’ of repository networks. It runs automated tests for three aspects: (i) general compliance with OAI-PMH, (ii) compliance with DRIVER-specific recommendations for OAI-PMH implementation and (iii) the metadata compliance according to the DRIVER guidelines. Aspect (i) tests the validity of XML according to the OAI-PMH schema in a variety of use patterns to discover flaws in expected behaviour. Aspect (ii) tests several strategies to be within the boundaries of the DRIVER guidelines, such as deleting strategy, batch size or the expiration time of a resumption token. Aspect (iii) looks into the record and tests how the simple Dublin Core fields are used compared to the recommendations in the DRIVER guidelines. Technically, the DRIVER validator is a web application (Servlet/JSP based). Its main characteristics are:
•
it is a rule based system allowing end users to customize the type of validation they want to perform through the selection of specific predefined rules
•
end users are able to validate multiple sets/repository or multiple repositories in a batch mode, also having the option of sampling for a “quick-look” operation
•
it produces comprehensive, full, on-line reports of the results with direct links to the original repository records
•
it provides registration means and persistent storage of the results so that end users may view the history of the changes performed in their repositories
• •
provides mechanisms for scheduled or periodic validation runs extendable so that the predefined types of rules may be configured to run on additional fields/attributes of the data
Moreover, the internal software architecture is built into components, both local and remote Web Services, so that the system (i) is fully distributed, in order to accept and process simultaneous requests, (ii) may be used -some of its core components at least- by other services (e.g. DRIVER Aggregator), and (iii) allows for extension to protocols beyond OAI-PMH. It is built with the Apache Struts 1.3.5 web development framework and is deployed and tested in the Apache Tomcat 5.0.28 Web Application server. It is currently deployed as a beta version on a temporary test environment [13] . 3.
Results
In a sample of 61 repositories from 5 countries (Belgium, France, Germany, Netherlands, UK) test validation was performed for a set of 300 records each. Results indicate that no repository fully complies with the guidelines but many are near. The overall picture is heterogeneous: (i) for general OAI-PMH validation results show that only few repositories in the sample are-non harvestable. On the other hand only a few repositories deliver 100% valid XML. For specific DRIVER characteristics with respect to OAI-PMH (ii) and simple Dublin Core (iii), some systematic behaviours resulting from national conventions or platform Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
A Deep Validation Process for Open Document Repositories
conventions such as ePrints [14], DSpace [15] and OPUS [16] are observable but many variations relating to data entry of single records also contributed to the error distribution. 4.
Discussion
The validation method is functional. Further improvements lay in removing points of confusion by (i) changing some of the functions of the validation rules in a way that better fits the guideline explanation. And (ii) change descriptions in the DRIVER Guidelines that allow multiple interpretation of the validation rules. Currently new DRIVER guidelines are under development.. Advanced test routines are currently missing, for example for providing critical performance indicators, such as the record throughput that measures the number of records that are delivered per second. Quantitative values shall be used to visualise performance and raise awareness for repository managers. Communication between DRIVER helpdesk and the Mentor service [17] with repository managers based on these reports will be analysed to design the next development phase: establish a change-request cycle, define the validity threshold and automate the validation process. 5.
References
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]
http://www.opendoar.org/ http://www.oaister.org/ http://www.base-search.net/ http://www.openarchives.org/data/registerasprovider.html http://roar.eprints.org/ http://www.driver-community.eu http://www.driver-support.eu/en/guidelines.html http://www.openarchives.org/pmh/ http://eprints-uk.rdn.ac.uk/project/docs/simpledc-guidelines/ http://oai-best.comm.nsdl.org/cgi-bin/wiki.pl http://www.dini.de/ http://www.darenet.nl/en http://validator.driver.research-infrastructures.eu http://www.eprints.org/ http://www.dspace.org/ http://opusdev.bsz-bw.de/trac http://www.driver-support.eu/mentor.html
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
431
432
Publishing with the CDL’s eXtensible Text Framework (XTF) Kirk Hastings, Martin Haye, Lisa Schiff California Digital Library, US
The eXtensible Text Framework (XTF) is a powerful and highly configurable platform for providing access to digital content. It is specifically designed to support rapid development by allowing the implementer to easily control data flow and presentation at any point using the easily-learned XSLT language. However, this level of flexibility can make for a steep learning curve. In this hands-on workshop we will concentrate on some common tasks in the development of a custom interface and guide the participants through their solution. Our goal is to give you enough skills to allow you to continue development on your own Students will be provided with media containing a variety of content types to work on, example solutions, and the latest stable version of XTF. Please be sure to bring a laptop w/ CD/DVD drive and power cord. If you have never been exposed to XSLT, please take the time to complete one of the many available online tutorials (see below). Specifics to be covered include:
• • • • • • • • • •
Overview of XTF Feature suite supported by XTF Currently supported content types Suggested work-flow for application development Overview of default application Installation and configuration of XTF XSLT Basics Indexing exercise Search Exercise Document View Exercise
Resources:
• • •
XTF Documentation Basic XSLT Tutorial In-depth XSLT Tutorial
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
433
Open Journal Systems: Working with Different Editorial and Economic Models Kevin Stranack, Simon Fraser University Library, Canada John Willinsky, Stanford University, US
This session will provide an introduction to the Public Knowledge Project and an overview of the Open Journal Systems (OJS) online publication management software (http://pkp.sfu.ca/ojs). It will include an examination of the publishing process, peer review and editorial workflow, web site management, and tips for increasing journal visibility. This half-day, hands-on workshop is aimed at editors, publishers, librarians, and others interested in learning about this free, open source software that is being used by over 1000 journals around the world. Participants will come away with the ability to start up and operate their own online journal management system.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
434
Repositories that Support Research Management Leslie Carr Southampton University, UK
The aim of institutional repositories has focused on serving the interests of faculty – researchers and teachers – by collecting their intellectual outputs for long-term access, preservation and management. However, management are members of an institution too, and can reasonably ask for the repository to provide services that assist in the task of research management. There is also an entirely pragmatic argument for supporting management agendas: experience shows that in order to attain the engagement of the faculty, it is necessary to obtain the support of the institutional management. But in order to gain management support, a repository has to demonstrate a measureable and effective contribution to current management agendas and concerns – e.g. research management or research assessment. Background Reading Carr, L., White, W., Miles, S. and Mortimer, B. (2008) Institutional Repository Checklist for Serving Institutional Management. In: Third International Conference on Open Repositories 2008, 1-4 April 2008, Southampton, United Kingdom. This document is a For Discussion draft that came from the Research Assessment Experience session at Open Repositories 2008. Comment is invited from managers of all repository platforms to share experience of the demands of the processes involved in supporting research assessment at an institutional level. In particular, the demands placed on an institutional reporting tool are greater than normal. This document lists the success criteria as distilled from the authors' recent experience. Carr, L., White, W. and Brown, M. (2008) Collecting & Verifying RAE Output. Presentation at: Beyond the RAE 2008: Bibliometrics, League Tables and the REF, 30th April 2008, Kings College, London. This presentation was given as a reflective summary of the University of SOuthampton's experience in using their repository for collecting evidence for the UK's national Research Assessment Exercise. Participants will learn about (discuss, reflect upon) the issues relevant to providing repository support to institutional management for research management and assessment. The workshop will provide participants with a checklist on the issues and recommendations for repository managers, based on particles experiences drawn from various repository managers. Format:
• • • •
Experience with preparing for research assessment Presentations & discussions based on position papers Discussions of issues arising Recommendations for practical progress.
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
435
Opening Scholarship: Strategies for Integrating Open Access and Open Education Eve Gray1, Melissa Hagemann2, Heather Joseph3, Mark Surman4 University of Cape Town, South Africa 2 Open Society Institute, USA 3 SPARC, USA 4 Shuttleworth Foundation, Canada / South Africa 1
The objectives of this workshop are: 1. 2. 3.
Inform publishers and scholars involved in open access about recent developments in open education, including the Cape Town Declaration. Identify lessons from the success of the open access publishing movement that can be applied to open education, and brainstorm opportunities for action based on these lessons. Surface opportunities for long term synergies and interconnection between open access and open education, feeding into the broader agenda of open scholarship.
Agenda 1. 2. 3. 4. 5.
Overview: core concepts in open access and open education Melissa Hagemenn Achievements of the open access movement Heather Joseph Status of the open education movement Mark Surman Making connections: the link between access and education Eve Gray Mapping lessons from OA to OE faciliated discussion led by Melissa and Mark
Format This workshop will include presentations, large group mapping and visioning and small group action planning sessions. These techniques will ensure that all participants have a chance to a) learn about the pieces of the open access / open edu universe that they are not yet aware of and b) contribute to charting the future course of action and collaboration within both of these movements. After the conference is over, the materials produced through the visioning and small group sessions will be compiled into a short article by the workshop facilitators. Suggested resources for preparation 1. 2. 3.
Cape Town Open Education Declaration (framework for open education) http://www.capetowndeclaration.org/ Budapest Open Access Initiative (framework for open access) http://www.soros.org/openaccess/ Flatworld Knowledge open education videos (basic concepts in open education explained) http://www.flatworldknowledge.com/
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008
436
Boost your capacity to manage DSpace Wayne Johnston1, Rea Devakos2, Peter Thiessen3, Gabriela Mircea4 University of Guelph, Canada University of Toronto Libraries, Canada 3 University of Toronto Libraries, Canada 4 University of Toronto Libraries, Canada 1
2
Abstract This workshop will provide attendees with the knowledge needed to manage a DSpace repository. It will cover essential administration and maintenance tasks but will not focus on the more technical details of installing and customizing the software. We will focus on DSpace version 1.4.2 but we will also introduce some of the changes and new features offered in version 1.5. Target audience: site administrators, collection managers, service planners Agenda
• • • • • • • •
general information about site administration creating/editing communities and sub-communities creating/editing collections managing e-people item interventions managing the metadata registry manage authorization policies basic configuration/customization
• • • • • • • • • • • • •
The dspace.cfg Configuration Properties File Wording of E-mail Messages The Metadata and Bitstream Format Registries The Default Submission License Customizing the Web User Interface Configuring LDAP Authentication Displaying Image Thumbnails Displaying Community and Collection Item Counts Configuring Controlled Vocabularies introduction to DSpace 1.5
introduction to Manakin introduction to OAI-PMH introduction to handle service
Proceedings ELPUB 2008 Conference on Electronic Publishing - Toronto, Canada - June 2008