Project Acronym: CRISPool Version: 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
JISC Final Report Project Information Project Acronym
CRISPool
Project Title
Using CERIF-XML to integrate heterogeneous research information from several institutions into a single portal
Start Date
1 March 2010
Lead Institution
University of St Andrews
Project Director
Anna Clements
Project Manager & contact details
Anna Clements akc@st-andrews.ac.uk 01334 462761
Partner Institutions
SUPA (The Scottish Universities Physics Alliance) http://www.supa.ac.uk/ University of Edinburgh University of Glasgow EuroCRIS http://www.eurocris.org Atira A/S http://www.atira.dk
Project Web URL Pilot Portal
http://www.crispool.org http://crispool.atira.dk/portal
Programme Name (and number)
Information Environment Programme 2009-2011 Research Information Management Call 11/09
Programme Manager
Neil Jacobs / Frederique Van Till
End Date
31 August 2010
Document Name Document Title
Final Report
Reporting Period Author(s) & project role
Anna Clements (Project Manager) Niall Lockhart (Project Management Support Officer)
Date
Filename
URL
if document is posted on project web site
Access
√Project and JISC internal
√ General dissemination
Document History Version
Date
Comments
V1.0
31/08/2010
Circulated to partners and programme manager
V2.0
09/09/2010
Amendments from partners plus Appendices
V2.1
16/09/2010
Final version to JISC
Page 1 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
JISC Final Report
CRISPool Project Using CERIF-XML to integrate heterogeneous research information from several institutions into a single portal
Author(s): Anna Clements (Project Manager) Niall Lockhart (Project Management Support Officer)
Contact Anna Clements akc@st-andrews.ac.uk University of St Andrews Business Improvements Butts Wynd Building St Andrews Fife KY16 9AD
Page 2 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
Table of Contents
............................................................................................................. 1 JISC FINAL REPORT ................................................................................................ 1
............................................................................................................. 2 JISC FINAL REPORT ................................................................................................ 2
.................................................................................................................................................... 2 Acknowledgements ............................................................................................................................................ 4 Executive Summary ............................................................................................................................................ 5 Background......................................................................................................................................................... 6 Aims and Objectives ........................................................................................................................................... 7 Methodology....................................................................................................................................................... 7 Implementation ................................................................................................................................................... 9 Sourcing the data .............................................................................................................................................. 11 Producing the CERIF-XML .............................................................................................................................. 11 Outputs and Results .......................................................................................................................................... 14 Outcomes .......................................................................................................................................................... 16 Conclusions ...................................................................................................................................................... 18 Implications ...................................................................................................................................................... 18 References ........................................................................................................................................................ 19 Appendix 1: CRISPool Data Dictionary ........................................................................................................... 21 Appendix 2 : Class Scheme Data...................................................................................................................... 37 Appendix 3: CRISPool CERIF to PURE4 mapping ......................................................................................... 44 Appendix 4 : Technical Summary - CRISPool project prototype implementation ........................................... 53
Page 3 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
Acknowledgements The CRISPool project would like to acknowledge the contributions of the following organisations to the success of the project: • •
•
JISC for part-funding the project through the Information Environment Programme 200911 and it’s Research Information Management Call 11/09 The project partners for their invaluable contributions to the project: 1 o SUPA (The Scottish Universities Physics Alliance) o University of Glasgow o University of Edinburgh 2 o EuroCRIS 3 o Atira A/S 4 5 The ERIS and R4R (Readiness4Ref) project teams for continuing enthusiastic support and advice.
1
www.supa.ac.uk www.eurocris.org 3 www.atira.dk 4 http://eriscotland.wordpress.com/ 5 http://www.kcl.ac.uk/iss/cerch/projects/portfolio/r4r.html 2
Page 4 of 55
Project Acronym: CRISPool Version: 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
Executive Summary We have successfully used CERIF-XML to bring together data on people, organisations and publications from three Universities for the SUPA [Scottish Universities Physics Alliance] research pool. These data are viewable and searchable at http://crispool.atira.dk/portal
This was the main aim of the project and has been achieved within the limited timescale and budget of this JISC call. The collaborative aspect of the project involving partner institutions, pool administrators, euroCRIS, third party developers, Atira, and related JISC-funded projects, Readiness4REF(R4R) and ERIS has meant that a wide number of stakeholders have been involved at all stages to help ensure the success of the project. The approach taken meant that we were learning how to use CERIF-XML as we went along so the expert help and advice of euroCRIS and Atira who are members of the euroCRIS CERIF Task Group and the sharing of preliminary findings from the Readiness4Ref project led by Kings College, London have been invaluable. Additionally, the enthusiastic support from the ERIS project has provided a channel to other pools in Scotland; several of whom have expressed interested in the project. The basic steps, once we had agreed on which data the partner Institutions (Glasgow and Edinburgh) could reasonably provide within the timescale, were that the University of St Andrews created some sample CERIF-XML files for the other University partner institutions which would allow them to generate the data needed for the portal. Each institution took a different approach to generating their XML data but all used relatively low-tech text editing and search and replace tools. No additional specialist knowledge was required. Although the main aim of the project was to test the suitability of CERIF-XML as an exchange format, it was evident that those Institutions with an existing culture of integrated research information management were better able to provide the required data quickly. For St Andrews there was no additional work required as all data were fed in from their existing CRIS. Glasgow, which has had an in-house integrated research information management system for many years were able to provide data on people and publications easily. Edinburgh were able to provide data on people but unfortunately were not able to provide publications data within the project timeframe. Returning to the main aim of testing CERIF-XML’s suitability as an exchange format, the CERIF data model fully supported the requirements of the project except for two relatively minor areas which have been reported to euroCRIS. For the pilot project we have been able to workaround these issues by using CERIF classifications; something that R4R has also been able to do during the exercise to map RAE2008 schema to CERIF. The main technical issue we have found is to do with the fragmentation of CERIF-XML into so many individual xml files. The sheer number means that it is very resource intensive to process as each item, whether a person or organisation or publication is defined by data in up to 10 related xml files. The issue facing the designers of CERIF is that the model itself needs to represent the real world of interrelated research information – the fully connected graph; however XML is a linearised tree structure and cannot natively represent the complexity required. However, XML is also the vehicle of choice for data exchange in web services. In conclusion all partners are positive about the results of the CRISPool project and SUPA are keen to move forward from a pilot to a sustainable solution. We see that while there are still areas to improve on (for example the processing of multiple xml files) the sector as a whole can take heart from our findings that reinforce the conclusion from the EXRI report that CERIF should be used as the exchange format within the UK research information sector.
Page 5 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
Background The importance of collecting, maintaining and exchanging good quality, comprehensive and current research information has risen up the agenda in the UK Higher Education sector following the recently completed RAE2008 data collection exercise. In particular, the use of a standard exchange format, the Common European Research Information Format (CERIF), to improve interoperability of data between the different stakeholders (Funding Councils, Research Councils and other funders, HESA, Institutions) has been discussed by a JISC-led Research Information Management Group. This group commissioned the EXRI project to examine the suitability of CERIF versus other possible 6 standards, or no standard. The final report recommends the use of CERIF as a standard exchange format between the stakeholders. CRISPool builds directly on Recommendation 7 from this report ‘.. pilots to look at real exchange of research activity data between HEIs using CERIF’. 7
Research Pooling is well established in Scotland and there are currently 13 pools – the oldest being SUPA, established in 2005 The pools were setup in order to help create and maintain a critical mass of resources needed for Scotland’s universities to carry out world-class research. The success of the initiative was highlighted by the RAE2008 results in which Scottish institutions increased their share of the UK's world-class research from 11.6% in 2001 to 12.3%, even though the country has only 8.5% of the UK population. Every Scottish institution now has world leading research in at least one of its disciplines. This approach is also being discussed in the national UK press as reported in Times Higher 8 Education , THE, 5th August 2010 which presented the views of David Price, Vice-Provost for Research and Stephen Caddick, Vice-Provost for Enterprise, both of at University College London : ‘ … the coming cuts to the sector will necessitate "major restructuring" to preserve the global standing of the elite universities on which the success of UK higher education depends. The elite, they propose, should pool and coordinate their research strengths to form hubs of about half a dozen regional "research clusters". The current information infrastructure underpinning SUPA, as with the other pools, is poor and much resource and duplication of effort is spent by both SUPA administrators and members of the partner institutions in collecting and checking data on staff, students and publications. This information is held at member institutions in different formats with different vocabularies used, for example, for similar job descriptions or publication types. This information is collated and presented for reporting to the Scottish Funding Council. 9
In SUPA information on staff and students is used to provide access to the My.SUPA portal, a virtual 10 learning environment and research collaboration portal based on Moodle . Logged in staff and students are given access to lists of other users along with some limited profile information listing interests in various research themes with the aim of fostering new collaborations between researchers across Scotland. Gathering publications information has been particularly resource intensive. In previous years it has been carried out by requesting information from department administrators passing data using spreadsheets. SUPA administrators verified the information, emailing each staff member or research student, and provided opportunities to make corrections and additions prior to publication of a printed publications list. This final publications list is only available in a limited electronic form : a PDF version
6
http://www.jisc.ac.uk/publications/briefingpapers/2010/bpexriv1.aspx#downloads
7
http://www.sfc.ac.uk/research/researchpools/researchpools.aspx http://www.timeshighereducation.co.uk/story.asp?storycode=412909 9 http://my.supa.ac.uk/ 10 http://www.moodle.org 8
Page 6 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
of the printed copy. This monolithic document does not provide any other ways to analyses information except in the order and format provided, and a simple text search in the PDF document. Data supplied by the departments is of varying quality, and is sometimes provided indirectly rather than being sourced from institutional information systems. The complete dataset is based on information from different information systems with different business rules and data constraints. Feedback from several department administrators involved in these requests for information suggest that data is gathered from a variety of sources including institutional systems, departmental information systems or local files. This approach responds to the immediate query expeditiously, rather than following a repeatable process. Several requests and clarifications may be required in each information gathering exercise. The data is generally only updated on an annual basis, and not consistently maintained between annual reporting cycles and so quickly goes out of date, therefore considerably less useful. The CRISPool project partners have been working in the area of research information for several years including innovative projects to link research information and management systems to open access repositories. Glasgow University has developed an innovative integrated research management system, the University of Edinburgh is leading a consortial approach to driving the open access agenda forward in Scotland and the University of St Andrews, in a joint project with the University of Aberdeen, is the first UK institution to implement the CERIF-based CRIS [Current Research Information System] product (PURE), made by the Danish company, Atira. A key stipulation by both St Andrews and Aberdeen, and supported by Atira, is that the conceptual data model developed for the UK should be made available to other UK Institutions implementing or investigating a CERIF-CRIS independent of which system they choose.
Aims and Objectives CRISPool builds on the experience gained by the partners and their desire to work with other Institutions to find practical ways of reducing the overall burden or research information management across the sector. The implementation of PURE has demonstrated the suitability of CERIF for capturing research information internally within the two Institutions (St Andrews and Aberdeen). The CRISPool project had the following aims: • To demonstrate that CERIF-XML can be used to bring data from heterogeneous, cross institutional sources together. • To provide evidence of the benefits and costs of adopting CERIF-XML as a cross-institutional data exchange format. The aims were to be through the main objective: • To build an initial portal exposing these data on the web with basic search & retrieve functionality and basic technical exhibition of data (e.g. fetching data via RSS, XML/SOAP, OAI). Whilst testing the suitability of CERIF-XML was the primary focus of this project, there was also an expectation that organisational and information systems changes would occur as a direct result of the need to ensure data is up to date, sufficiently accurate and meets the commonly agreed criteria. The results from CRISPool are both transferable (to other exchange scenarios) and scaleable (to other CERIF elements not included in this project). These aims and objective remained constant throughout the project and the Outputs and Results section below discusses the degree to which they have been met.
Methodology With a six month project and multiple partners the methodology used was to build on existing expertise : euroCRIS with their in depth knowledge of CERIF; Atira with their existing PURE CRIS
Page 7 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
product and established expertise in CERIF-based CRIS and the partner Institutions [St Andrews, Glasgow and Edinburgh] with their experience in the area of research information and repository systems. The project was split into three strands: 1. Scoping and Investigation: defining data model (entities, relationships, constraints), common vocabularies (people, publications, organisations) to meet SUPA requirement for annual publications report. This strand also identified data sources and determined any limitations necessary due to data availability. Due to the limited time span and resource available to the project at the partner institutions we kept the data requirements to a minimum to meet what could be provided by Glasgow and Edinburgh and was still useful to SUPA. Thus Glasgow started with their dataset produced for the REF Bibliometrics Pilot project in 2008-9. This data set already linked outputs to staff using the institutional ID. Edinburgh aimed to provide all current academic staff in the School of Physics and Astronomy and then match them against publications data from the Edinburgh Research Archive [ERA]. A comprehensive set of publications data related to current academics in the School of Physics and Astronomy at St Andrews was provided from the PURE CERIF-CRIS database in CERIF-XML format.
Figure 1 : Summary of data flow in CRISPool 2. Technical delivery: configure and install PURE for the defined data model and build CERIF-XML integrator; data sources mapped to CERIF-XML to produce single or multiple data streams for integration into PURE; a simple portal was built to expose data via web pages, web services and RSS feeds.
Page 8 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
Again, because of the short timeframe and limited budget existing technology [PURE4 product] formed the backend database and administrator functionality. Atira built CERIF-XML export [to export St Andrews data to CERIF-XML] and import functions [to import St Andrews, Glasgow and Edinburgh CERIF-XML] as add-ons to the PURE4 product. Each function was triggered through the administrative interface on an ad-hoc basis or using standard cron job configuration. Finally, a simple portal was created based on the current SUPA website design.
3. Engagement and Evaluation: conduct a base line review during SUPA annual data collection round; time and effort to identify sources and map to CERIF; advantages and disadvantages; ongoing engagement with regional, national and European projects and groups e.g. ERIS led by Edinburgh [project manager is member of CRISPool], Enquire led by Glasgow [ditto], Readiness4Ref (led by KCL), UCISA, WRN/ARMA, euroCRIS. The engagement strand ran throughout the project and is continuing, for example, at the Repository Fringe, Sep 2010 at Edinburgh. Due to resource issues at SUPA a full base line review was not carried out however feedback from SUPA staff has been incorporated into the Outputs and Results section.
Implementation Two workshops were held early in the project [March and April 2010] to familiarise all partners with the CERIF model; finalise the data requirements taking into account what was achievable over the short timescale and also useful to SUPA, and share the experiences of the R4R project in mapping RAE2008 to CERIF-XML. We also agreed to create institution-specific unique IDs for the organisations, persons and publications being brought together into CRISPool by using the UK Learner Provider number as a prefix to institutional IDs. The scope of CRISPool did not allow for any time to deduplicate/merge data on publications and, potentially, people. A CRISPool project was created within the existing ERIS project online collaboration tool, 11 Basecamp to help plan and manage the project.
11
http://basecamphq.com/
Page 9 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
Figure 2: Elements of CERIF used in CRISPool
Figure 1 shows the basic CERIF model with the elements used in CRISPool highlighted. In all a total of 30 CERIF-XML files were used in the pilot. cfPers_CORE cfPers_Class-LINK cfPers_EAddr-LINK cfPers_OrgUnit-LINK cfPers_PAddr-LINK cfPers_ResPubl-LINK cfPersKeyW-LANG cfPersName-ADD cfPersResInt-LANG cfOrgUnit-CORE cfOrgUnit_EAddr-LINK cfOrgUnit_Class-LINK cfOrgUnit_OrgUnit-LINK cfOrgUnit_PAddr-LINK cfOrgUnit_ResPubl-LINK cfOrgUnitName-LANG cfResPubl-RES cfResPubl_Class-LINK
Page 10 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
cfResPubl_ResPubl-LINK cfResPublAbstr-LANG cfResPublBiblNote-LANG cfResPublKeyW-LANG cfResPublAbbrev-LANG cfResPublSubtitle-LANG cfResPublTitle-LANG ND
cfEAddr-2 ND cfPAddr-2 cfEAddr_Class-LINK cfClassTerm-LANG cfClass-CLASS Full details are in Appendix 1: CRISPool Data Dictionary, Appendix 2 : CRISPool Class Scheme Data and Appendix 3 \\\documents.
Sourcing the data For Glasgow this was straightforward once the data requirements had been finalised. The information on persons coming from the Institutional HR database and that for publications from the data set produced for the REF bibliometrics pilot. Glasgow considered using data from their institutional repository but at the time this did not link publications to internal authors via the institutional ID. The data from the HR database was already integrated with the research management system at Glasgow and so there were no problems with reusing this data for CRISPool. For Edinburgh detailed person data was provided from the Institutional HR database. While the data provided was of good quality it is worth noting that it took the team at Edinburgh some time to find the right contact within HR who could authorise use of the data for CRISPool. At Glasgow and St Andrews these links have already been made and so no delay was incurred. The publications data was sourced from the central closed Publications Repository and checked to ensure the bibliographic data could be made publicly available. It had originally been planned to use the public Edinburgh Research Archive, but on investigation this only included two publications that were by academics in the current HR feed, and were journal articles. Edinburgh therefore switched to using data from the closed repository, the Publications Repository, which had many more articles, and was the repository used for the RAE submission. For St Andrews all the required data was sourced directly from the Institution’s PURE4 CRIS, which itself is synchronised daily with data from the Institutional HR database. The CRIS is the golden source of publications data.
Producing the CERIF-XML Following on from the workshops, the University of St Andrews created template files with some sample data for each of the CERIF-XML files to be used in the pilot. We used documentation from the eurocris.org web-site and advice and examples from the R4R mapping documents. These sample files were validated against the CERIF-XML 2008-1.1 schema at http://www.eurocris.org/fileadmin/cerif-2008/2008_1.1/XML-SCHEMAS/ Note: we started with CERIF-2008_1.0 version but switched to the later version in order to be able to use IDs of greater length. Version 1.0 handled IDs up to 32 characters long; version 1.1. up to 128 characters. The CERIF-XML sample files were created using the text editor Notepad. The sample CERIF-XML files were distributed to the Universities of Edinburgh and Glasgow via the Basecamp site for them to populate with their own data.
Page 11 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
The University of Glasgow used MS Excel to create a worksheet for each CERIF-XML file required, populating the worksheets with data on persons from their Human Resources database and publications from the REF Bibliometrics pilot; the latter included links to the HR database via the Institutional staff ID. MS Word Mailmerge was then used to merge the data from each worksheet into the correct CERIFXML template. The xml header and footer was added to each resulting .doc file and saved as .txt and then renamed as .xml. This process took approximately 1 day to complete. For the University of Edinburgh, the process of generating the CERIF-XML files for persons was a largely manual process. HR provided them with an Excel spreadsheet containing academic names, hesa numbers and job titles. These were then amalgamated with further information manually copied from the school staff webpages. This process took 2.5 days to complete. The publications data was provided as an export from the dSpace Publications Repository but has not yet been converted to CERIF-XML for importing. Related work taking place as part of R4R to create a CERIF plug-in for dSpace is expected to provide this functionality. SUPA were provided with lists of people from all three Institutions in order to match to the existing SUPA ID and SUPA theme/s. This data was provided back to St Andrews in spreadsheet format and CERIF-XML cfPers_OrgUnit-LINK files produced linking each person to the main and additional SUPA themes. We had originally planned to use a classification for the SUPA themes but switched to using organisations very quickly once we realised that this would allow us more flexibility in linking other entities such as persons and publications to themes. In practice SUPA treat the themes as virtual organisations. Both sets of files (from Glasgow and Edinburgh) required tidying up before they validated successfully against the CERIF-XML Schemas. The issues included CERIF mismatched tags and elements in the wrong order. There was also a problem initially with files being saved with LATIN-1 encoding rather 12 than UTF-8. All these issues were solved using the freeware text and Unicode editor PSPad . On the whole it was a successful and straightforward low-tech process although there were a couple of more time-consuming problems where there were inconsistencies in the IDs across files thus preventing data being linked correctly once imported. Again these could be solved using PSPad, which had good functionality for checking several files side by side. What is evident is that the time taken initially to define requirements and prepare sample files was very important. In this project the resource was very limited and undoubtedly if we had had more resource at the member institutions the errors would have been much fewer. Equally, once a more automated process can be established such errors should be removed completely. For St Andrews data was exported using the export framework in Pure. See Appendix 4 for a technical summary from Atira on importing and exporting via CERIF-XML in the Pure product.
Suitability of CERIF Most of the data mapped across to the CERIF model easily but there were two areas where the CERIF data model imposed restrictions. Both of these have been raised with the euroCRIS CERIF Task Group and could be worked around for this pilot using CERIF classifications. •
12
Issue 1: The placing of a person’s contact details as an attribute of the person rather than an attribute of the relationship between the person and organisation (cfPers_EAddr-LINK, cfPers_PAddr-LINK). In the CRISPool model which concentrates entirely on work contact details, rather than personal contact details, it is normal that a person’s contact details will change as they move from job to job.
http://www.pspad.com/en/
Page 12 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
o
•
For the pilot a workaround was used whereby the classification of the cfPers_EAddr and cfPers_PAddr relations were used to carry data. See Appendix 3 CRISPool CERIF to PURE4 mapping for details.
Issue 2: A one to one relationship between the publication entity and URI (cfResPubl13 CORE.URI). Thus we were unable to record both a DOI and URI to a full-text version in the IR against publications. This issue has been discussed within euroCRIS previously and at length and is a philosophical issue rather than a technical one. The current euroCRIS view is that each Publication object is represented by 1 and only 1 URI; if another URI is needed then that is another Publication object. This debate leads into the definitions of ‘work’ and 14 ‘manifestation’, and so on from the FRBR model, and is not part of the CRISPool project. o For the pilot we restricted ourselves to the DOI as there was more data for this than for URIs to full-text in IRs.
13
http://www.doi.org
14
http://www.loc.gov/cds/FRBR.html Page 13 of 55
Project Acronym: CRISPool Version: 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
Importing the CERIF-XML The importing of the CERIF-XML into the CRISPool PURE instance was done by uploading all the 15 XML files into WebDAV folders. There were four WebDAV folders, one for each institution involved (St Andrews, Glasgow, Edinburgh and SUPA). However, there were some issues with accessing these folders due to the operating systems used by the CRISPool team. A solution was found in using 16 a free program called NetDrive to gain access to folders. Once the data XML files were uploaded into the appropriate WebDAV folders then the data could be imported into PURE by selecting to synchronise it. If there were any problems with the CERIF-XML in any of the files being imported into PURE then all details of errors could be accessed after a failed synchronisation, the error list would give file name and line number of each issue so that could be amended. This detailed error logging helped to identify inconsistencies between IDs in the separate files, for instance. The synchronisation jobs could be run repeatedly to update existing data and this functionality was used, for example, when we received additional data on external authors from Glasgow. Unfortunately, due to the number of external authors on these publications, (there were an average of 150 authors per publication) the import process was taking so long that we decided to limit the authors to the first 5 (including at least one Glasgow author). Atira adopted an agile approach to the development of the import functionality working closely with the CRISPool team at St Andrews to test first the organisation and person import and then the publications import. This incremental approach meant we could sort out any issues with the organisations and persons before moving on to the much larger data sets containing publications data. See Appendix 4 for a technical summary from Atira on importing and exporting via CERIF-XML in the Pure product.
Outputs and Results The CRISPool project has several deliverables, the first being the actual CRISPool portal. http://crispool.atira.dk/portal/ The portal has been designed to look and feel the same as the SUPA website. On the front page of the portal a selection of the most recent publications are displayed. A search bar allows anyone to search through researchers, organisations and publications. There is also a navigation menu on the right side of the portal pages which allow a user to search through the available data alphabetically. The portal also offers an option of ‘statistics’ which allows a user to view charts showing the volume and format of research of the institutions involved from the last 5 years. RSS feeds are also available from the portal. The CRISPool project managed successfully to employ the CERIF data model (2008 version 1.1). This can be demonstrated through the CERIF-XML files that have been created during the project as each file will conform to the data model documentation and the schema which can found on the euroCRIS website. http://www.eurocris.org/fileadmin/cerif-2008/2008_1.1/XML-SCHEMAS/ The data model used in CRISPool is described in detail in Appendices 1,2 and 3. A less tangible output is the transferable skills developed by the partner institutions in mapping internal data sources to CERIF-XML. The mapping process led to an improved understanding of how the CERIF data model works in a practical situation. For example, at Glasgow, this knowledge helped 17 inform the JISC-funded Enquire project looking at Research Council Outcomes and Outputs. 15 16
17
http://www.webdav.org/ http://www.sitepoint.com/blogs/2004/10/03/novell-netdrive-webdav-client-for-windows/ http://researchoutcomes.wordpress.com/
Page 14 of 55
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
Overall timings for sourcing, mapping and producing the initial CERIF XML files, once the sample template files had been created, was between 2 and 5 days. A further period of time was spent by the project team at St Andrews checking and amending the files, which as they had been produced semimanually and to a tight deadline were prone to miss-typing [or miss-copying/pasting!]. This was a one-off relatively manual process for both Glasgow and Edinburgh and would need further work to develop into a sustainable production of CERIF-XML for use in keeping the portal up to date and removing the errors in the files. For St Andrews the CERIF-XML files were created directly from PURE using the functionality that Atira developed. As described in the Methodology section, a systematic baseline review was not able to be carried out due to resource issues at SUPA. However SUPA provided feedback as follows: For the work done by SUPA to gather data from all 6 Institutions [SUPA has only recently been expanded to 8 Institutions] : ‘I'd split the data gathering into two types: data gathering about people in SUPA, and the publication list (which is informed by the first process, however). For the first type, which is gathering the essential contact data for each member of SUPA: This takes approximately 1 month, which includes one week of solid work plus additional time to follow up with the institutions and verify accuracy. For the publication data exercise: This takes approximately 3 months, which includes a mix of solid work periods and following up with institutions. This process includes the initial meetings covering scope, the request for information, the follow up with individual institutions and the accuracy verification, collation and report publishing. ' For St Andrews, prior to implementation of Pure, a School Administrator took 4 days [spread over 2 weeks] to run Web of Science searches for all academic staff and research fellows. These data were not checked by individuals dues to lack of time. With Pure in place each individual member of staff can maintain an up to date accurate publication list that can then be fed out to CRISPool regularly – not just once a year as now. These data are also reused in other online pages such as School web sites. This not only saves time for School Administrators and individual researchers as the data is collected once but also improves data quality and timeliness with the researcher taking responsibility for their own data. If production of XML data streams can be automated at member institutions and synchronised within the portal then this would cut out the annual process of data collection via emailing spreadsheets back and forth between SUPA and each of the institutions. It would also have the benefit that corrections and additions prompted by the publication of information through the pool portal could be carried out directly in the source institutional information systems, saving staff time making separate updates to the SUPA systems and institutional systems.
Page 15 of 55
Project Acronym: CRISPool Version: 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
Outcomes The project plan listed a set of evaluation factors and questions to address. These are repeated below with an update following the project’s conclusion. Factor to Evaluate Suitability of CERIF 2008
Questions to Address Does it contain all data elements required? Can it be easily extended if not?
Method(s) Evaluate against SUPA requirements
Measure of Success All elements exist or can be easily add
Ease of mapping to CERIF-XML
What level of technical expertise required?
Feedback from technical experts who did the mapping
Technical expertise already available at member institutions or easily acquired
Usefulness of CERIF-CRIS
Is the CERIF-CRIS an improvement on previous solution? If so, in what way? If not, in what way?
Evaluation/Feedback
CRISPool solution extended to other member institutions
Overall time/cost savings
Across all stakeholders, does use of CERIF-XML increase/decrease resource/cost Does using a CERIF-CRIS facilitate improvement in
Evaluation [before/after]
Decreases resource/cost or further investigation needed over longer time period
Evaluation [before/after]
Data quality improved or likely to be improved
Data quality
Page 16 of 55
Outcome All but 2 directly mapped. Issue 1 – contact details against person not person-organisation relation; worked around using classification. Issue 2 – single URI per publication; philosophical point to be discussed with euroCRIS; opted for DOI not IR handle as more data; could have been addressed with classification Yes - Standard text editor tools used by St Andrews, Glasgow and Edinburgh. Basic relational db understanding necessary and knowledge of staff and publications data held by University SUPA – yes dynamic searchable portal much better than fixed pdf publications list. At least one other Pool expressed interest. Member Institutions - In principle – yes but requires up to date central data sources and further work to automate CERIF-XML production. So not a CERIF issue in itself. Main technical issue is fragmentation of CERIF-XML, which means import processes, are resource intensive. Further investigation needed. However it is indicative that SUPA were unable to collect data this year using existing method because too resource intensive Neutral – as data quality from partner institutions already good for the limited
Project Acronym: CRISPool Version 2.2 Contact: akc@st-andrews.ac.uk Date: 01/12/2010
Project success and impact
data quality?
subset of data we were working with.
Does using a publicly available portal facilitate improvement in data quality? To what extent has the project delivered on objectives and how useful are the projects findings?
Not answered as portal not public yet
End Project Report/Lessons Learned
Funder, partners and stakeholder feedback positive CRISPool solution extended to other member institutions and CERIF entities
Page 17 of 55
Partner and stakeholder feedback positive and keen to move pilot to sustainable system; more pools are interested
Project Acronym: CRISPool Version: 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Conclusions On the technical side the project has been straightforward - CERIF is flexible and comprehensive and for the most part does not require additional expertise over and above standard relational database modelling; the exception is the use of Classification Schemes particularly when used with Link entities. Here the project benefited from the model expertise of euroCRIS and the practical experience of Atira. For those who do not have access to this expertise and experience it would be very useful to have more sample CERIF-XML files available at the euroCRIS.org web-site. The project has come across a couple of areas where the CERIF data model has not met our needs – or not immediately - and discussion on these is being taken forward in the CERIF Task Group. In one case a workaround was created relatively simply by extending the use of the CERIF classification concept. In the other case a similar work around could have been employed if we had had time to do so. The discussion with euroCRIS therefore is to do with whether such workarounds are the correct way to extend CERIF or whether the core CERIF data model should be extended. The issue of the resource- intensive nature of processing the CERIF-XML which is due to the fragmentation of CERIF-XML into many separate xml files is something that does need to be addressed whether by improving algorithms to process the data or by adjusting the CERIF model; however it is difficult to see how the latter can be done without losing the flexibility of the model. This proved to be such a problem with importing co-authors on some of the Glasgow papers, where typically 150-200 authors existed on each paper, that we had to limit the data to the 5 named authors (including at least one Glasgow author). Finally it has shown that in order to best take advantage of an initiative such as CRISPool, Institutions need at least publications and staff data joined up. For Glasgow this was straightforward as they were able to provide the publications data set from the REF bibliometrics pilot which was linked to internal authors via the HR staff ID. Going forward they are now linking their full publications data set in their Institutional Repository to internal authors and so will be able to provide a more comprehensive set of publications data in the future. For Edinburgh the publications data repository does hold the staff id for the user which could have been matched with the HR database to allow the publications to be matched easily to persons. St Andrews were able to provide comprehensive data on people and their publications directly from their CRIS.
Implications There are specific implications for CRISPool and more generic implications for adopting CERIF-XML as an exchange format within the UK.
CRISPool The project partners are keen to take CRISPool forward from a pilot to a live system. However, first we need to identify a clear achievable objective, such as bringing in people and publications data for all members of SUPA to support decision-making and specific reporting requirements. We then need to develop processes to produce the CERIF-XML data automatically and regularly from the various source databases that exist within member institutions. This is not a small undertaking and requires buy in from the partner institutions to the bigger picture across the UK research domain : that improved information management and operational efficiency can be gained by adopting CERIF-XML as the exchange format. CRISPool has demonstrated that those with an integrated research system or CRIS are already at an advantage here. Finally, but importantly, we need a CERIF-CRIS to bring the data together with functionality to view, search, report, and so on. Atira have worked as project partners on the pilot but there is no agreement to continue beyond the end of the pilot and any commercial solution would necessarily need to follow the normal procurement route.
Page 18 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010 18
It is also worth noting that at least one other Research pool, SICSA has already expressed interest in the idea so a project that was able to provide data for both pools from the common member institutions could be another option; it’s aim would be to demonstrate the scaleability and transferability of using CERIF-XML for this purpose.
CERIF-XML in general On the technical side and of relevance to others working with CERIF-XML the resource-intensive nature of processing CERIF-XML needs to be addressed. This could be via reviewing the CERIFXML model itself (which runs the risk of reducing CERIF’s ability to model the research information domain accurately) or improving the technology that processes large and or/fragmented XML files. It should be noted that in other applications – especially relating to research information – XML has proved to be an inefficient exchange format. EXEM has been developed to (partially) overcome this http://portal.acm.org/citation.cfm?id=1285888 . However the article at http://www.criticism.com/dita/dss.html suggests that using XML provides gains over legacy data exchange mechanisms. Considering the spreadsheet exchange method of SUPA hitherto, this appears to be borne out by CRISPool despite the apparent inefficiency of CERIF-XML. Brigitte Joerg, CERIF Task Group leader comments ; ‘I understand the fragmentation is seen as a problem. But, from my whole experience with ontologies, with respect to interchange is still the most appropriate format - and makes it very flexible to map to from legacy systems. Especially due to the fact of fragmentation, you can exchange just the data that you need. Imagine an interrelated or networked ontological graph (which can be based on XML too). Here it becomes a problem of where to cut of - and where to locate the related data. I think - the only way to improve fragmentation in CERIF-XML, would be, to define mini-CERIFSubontologies - like for person, including all the related entities and their basic attributes and also all the relationships for a particular context. That would mean, your CERIF Person Ontology would integrate the related entities - and you could consider such a Person Ontology as your "integration" manager for person records, because it tells you about all the entities you want to involve, and about all the attributes and relationships that come with them - according to your specification. Ontologies try to integrate information based on a real world view - they use URIs for interconnection - but finally they are also XML-based. They do the opposite of fragmentation - here you have to deal with the complexity - but down to the physical level - you still deal with XML.’ For both CRISPool and CERIF-XML in general further JISC support is recommended whether by the funding of follow-on project/s or by extending the scope of an existing project such as ERIS (for CRISPool) or R4R (for CERIF-XML in general)
References Rogers, N and Ferguson, N (2009), Exchanging Research Information in the UK. EXRI‐ UK: A study funded by JISC. http://ie-repository.jisc.ac.uk/448/1/exri_final_v2.pdf th
Price,D and Caddick, S (2010), How to stay on top, Times Higher, 5 Aug 2010 http://www.timeshighereducation.co.uk/story.asp?storycode=412909 Joerg, B, van Grootel, G and Jeffery, K [Eds], CERIF 2008 1-1 XML Data Exchange Format Specification http://www.eurocris.org/fileadmin/cerif-2008/CERIF2008_1.1_XML.pdf Joerg, B et al, CERIF 2008 1-1 Semantics http://www.eurocris.org/fileadmin/cerif-2008/CERIF2008_1.1_Semantics.pdf 18
www.sicsa.ac.uk
Page 19 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Natchetoi, Y, Wu,H, Babin, G and Dagtas, S (2007) EXEM: Efficient XML data exchange management for mobile applications, Information Systems Frontiers , 9 439-448 http://portal.acm.org/citation.cfm?id=1285888 Hoenisch,S (2005) Using Data Structure Standards to Foster Efficiency and Opportunity http://www.criticism.com/dita/dss.html
Page 20 of 55
Project Acronym: CRISPool Version: 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Appendix 1
CRISPool Data Dictionary Niall Lockhart, Anna Clements
Version 1 30/04/10
Nal, akc
Version 2 11/05/10: a few classschemeIDs and classIDs revised for consistency and to match CRISPool Class Scheme Data.doc Also blanket changed cfPublicationId to cfResPublId Some minor corrections to xml files i.e. missing ‘<’ s To find changes look for 11/05/10
Akc
Version 2.1 19/05/10 Add info and examples for external people – in cfPers_CORE, cfPersName-ADD and cfPers_ResPubl-LINK
Nal
Version 2.2 24/08/10 Updated all tables to reflect use of CERIF 2008 V1.1
Akc, Nal
Final Version 2.3 30/08/10 Update at end of project
Note on IDs CRISPool is bringing together data from several UK Institutions and will use a combination of UK Learner Provider number plus Institutional internal ID to ensure uniqueness of IDs within CRISPool for Person and Publication records. UKPRNs can be found at http://www.ukrlp.co.uk. For CRISPool we need University of Edinburgh Glasgow University
Page 21 of 55
10007790 10007794
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
University of St Andrews
10007803
Akc 19/05/10 For external authors, we have to just assume each one is a separate entity unless the Institution has some kind of db of external authors. Have added examples in the relevant tables below. No need to link such persons to an organisation; but do need to link to publications. Tables affected: cfPers-CORE, cfPersName-ADD, cfPers_ResPubl-LINK NAL 24/08/10 It is important to note that all CERIF data contained within the document relates to CERIF 2008 version 1.1 and is correct at time of writing. NAL 31/08/10 Where elements cfFraction, cfStartDate and cfEndDate are not supplied then default values shall be used. These will be “1”(cfFraction), “190001-01T00:00:00.000+01:00” (cfStartDate) and “2099-12-31T00:00:00.000+01:00” (cfEndDate). Also, I have identified 3 tables with a “*” to show that they have not been implemented in this version of CRISPool.
PERSON cfPers-CORE Element cfPersId
Type CERIF String : 128 chars
cfSex
String : 1 char
cfURI
String : 128 chars
cfPers_Class-LINK Element cfPersId
Type CERIF String : 128 chars
Pure
Mandatory CERIF y
Content Pure y
y
Pure
Mandatory CERIF y
Unique person id INTERNAL; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Akc 19/10/05 Unique person id EXTERNAL; person-[UKPRN]ext-[simple id] e.g. “person-10007803-ext-0092169” For external authors suggest just create a sequential numeric id [For St Andrews can use internal PureID] Examples: “m” “f” Example “http://dept.physics.gla.ac.uk/staff/default.asp?record=672”
Content Pure y
Page 22 of 55
Unique person id; person-[UKPRN]-[Person InstID] e.g.
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
cfClassId
String : 128 chars
y
cfClassSchemeId
String : 128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
cfPers_EAddr-LINK Element cfPersId
Type CERIF String : 128 chars
cfEAddrId
String : 128 chars
y
cfClassId
String : 128 chars
y
cfClassSchemeId
String : 128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
Pure
“person-10007803-akc” Examples : “internal-person, external-person” “1626” Schemes : “class-scheme-person-types” “class-scheme-hesa-identifiers” “class-scheme-wos-identifiers” “class-scheme-supa-identifiers” Examples: “1.0”, “0.5” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”
Mandatory CERIF y
Content Pure y
Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Unique email address id: email-[UKPRN]-[Person InstID] e.g. “email-10007803-akc” Examples : “email” skype Scheme: “class-scheme-eaddress-types” Examples: “1.0”, “0.5” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”
cfPers_OrgUnit-LINK 11/05/10 Changed Content examples to make consistent with CRISPool Class Scheme Data.doc
Page 23 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Element cfPersId
Type CERIF String : 128 chars
cfOrgUnitId
String : 128 chars
y
cfClassId
String : 128 chars
y
cfClassSchemeId
String :128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
cfPers_PAddr-LINK Element
Pure
Mandatory CERIF y
cfPersId
Type CERIF String : 128 chars
cfPAddrId
String : 128 chars
y
cfClassId
String : 128 chars
y
cfClassSchemeId
String :128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
Pure
Mandatory CERIF y
Content Pure y
Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Unique organisation unit address id: organisation-[UKPRN][Organisation-InstID] e.g. “organisation-10007803-80UNIV” “organisation-supa-condensed-matter-material-physics” “organisation-supa-nuclear-plasma-physics” Examples : “academic” Scheme: “class-scheme-job-families” Examples: “1.0”, “0.5” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”
Content Pure y
Page 24 of 55
Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Unique postal address id: paddress-[UKPRN]-[OrganisationInstID] e.g. “paddress-10007803-40SCPHAS” Examples: “work” Scheme: “class-scheme-paddress-types” Examples: “1”, “0.5” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
cfPers_ResPubl-LINK Element Type CERIF cfPersId String : 128 chars
Pure
Mandatory CERIF y
cfResPublId
String : 128 chars
y
cfClassId
String : 128 chars
y
cfClassSchemeId
String : 128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
cfPersKeyW-LANG Element cfPersId
Type CERIF String : 32 chars
cfLangCode
String: 5 chars
y
cfTrans
String :1 chars
y
cfKeyW
String : 255 chars
Pure
Content Pure y
Unique person id INTERNAL; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Akc 19/10/05 Unique person id EXTERNAL; person-[UKPRN]ext-[simple id] e.g. “person-10007803-ext-0092169” For external authors suggest just create a sequential numeric id [For St Andrews can use internal PureID] Unique publication id; publication-[UKPRN]-[PublicationID] e.g. “publication-10007794-801001” Examples : “is-editor-of” “is-author-of” Schemes : “class-scheme-cerif-person-publication-roles” Examples: “1”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”
Mandatory CERIF y
Content Pure y
Page 25 of 55
Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Examples: “en-GB” “DE” Examples : “o” Examples: “Artificial Intelligence, AI, Human Computer Interfaces” “Physics, Space, Satellite”
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
*cfPersName_Pers-LINK 31/08/10 – Due to uncertainty of required data this table has not been used in CRISPool Element Type Mandatory Content CERIF Pure CERIF Pure cfPersId1 String : 128 chars y y Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” cfPersId2 String : 128 chars y Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” cfClassId String : 128 chars y Example: “spelling-variant” cfClassSchemeId String : 128 chars y Examples: “class-scheme-person-name-variants” cfFraction Float y Examples: “1”, “0.5” cfStartDate Date y Examples: “2001-01-0101T00:00:00.000+01:00”, “1999-123101T00:00:00.000+01:00” cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” cfPersNameVar String : 128 chars Unknown Data cfPersName-ADD Element cfPersId
cfFamilyNames cfOtherNames cfFirstNames
Type CERIF String : 128 chars
String : 64 chars String : 64 chars String : 64 chars
Pure
Mandatory CERIF y
Content Pure y
Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc”
y
Akc 19/10/05 Unique person id EXTERNAL; person-[UKPRN]ext-[simple id] e.g. “person-10007803-ext-0092169” For external authors suggest just create a sequential numeric id [For St Andrews can use internal PureID] “Clements”
y
“Anna Katharine”
cfPersResInt-LANG
Page 26 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Element cfPersId
Type CERIF String : 128 chars
cfLangCode
String: 5 chars
y
cfTrans
String :1 chars
y
cfResInt
NClob
Pure
Mandatory CERIF y
Content Pure y
Unique person id; person-[UKPRN]-[Person InstID] e.g. “person-10007803-akc” Examples: “en_GB” “DE” Examples : “o” Examples: “John Smith's current research subject areas are Artificial Intelligence and Human Computer Interfaces.”
OrganisationUnit cfOrgUnit-CORE Element cfOrgUnitId
Type CERIF String : 128 chars
cfAccro
String : 16 chars
cfURI
String : 128 chars
cfOrgUnit_Class-LINK 30/08/10 Added Element Type CERIF cfOrgUnitId String : 128 chars
cfClassId
String : 128 chars
Pure
Pure
Mandatory CERIF y
Pure y
Content
Mandatory CERIF y
Pure y
Unique organisation unit id; organisation-[UKPRN][Organisation InstID] e.g. “organisation-1000780340SCPHAS” Example: “Physics” Example: “http://www.gla.ac.uk/departments/physics/”
Content
y
Page 27 of 55
Unique organisation unit id; organisation-[UKPRN][Organisation InstID] e.g. “organisation-1000780340SCPHAS” Examples : “university”
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
cfClassSchemeId
String : 128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
cfOrgUnit_EAddr-LINK Element Type CERIF cfOrgUnitId String : 128 chars
Pure
Mandatory CERIF y
cfEAddrId
String : 128 chars
y
cfClassId
String : 128 chars
y
cfClassSchemeId
String : 128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
cfOrgUnit_OrgUnit-LINK Element Type CERIF cfOrgUnitId1 String : 128 chars
Pure
“school” “research-pool” “research-theme” Schemes : “class-scheme-organisation-types” Examples: “1.0”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”
Content Pure y
y
Mandatory CERIF y
Unique organisation unit id; organisation -[UKPRN][Organisation-InstID] e.g. “organisation-1000780340SCPHAS” Unique email address id: email-[UKPRN]-[Organisation-InstID] e.g. “email-10007803-40SCPHAS” Examples : “email” “skype” Scheme: “class-scheme-eaddress-types” Examples: “1”, “0.5” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00” Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”
Content Pure y
Page 28 of 55
Unique organisation unit id; organisation-[UKPRN]-
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
cfOrgUnitId2
String : 128 chars
y
cfClassId
String : 128 chars
y
cfClassSchemeId
String : 128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
cfOrgUnit_PAddr-LINK Element Type CERIF cfOrgUnitId String : 128 chars
Pure
y
Mandatory CERIF y
cfPAddrId
String : 128 chars
y
cfClassId
String : 128 chars
y
cfClassSchemeId
String :128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
[Organisation-InstID] e.g. “organisation-1000780340SCPHAS” Unique organisation unit id; organisation -[UKPRN][Organisation-InstID] e.g. “organisation-1000780340SCPHAS” Examples: “is-parent-of” Scheme: “class-scheme-organisation-relationship-types” Examples: “1”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”
Content Pure y
y
Unique organisation unit id; organisation-[UKPRN][Organisation-InstID] e.g. “organisation-1000780340SCPHAS” Unique postal address id: paddress-[UKPRN]-[OrganisationInstID] e.g. “paddress-10007803-40SCPHAS” Examples: “work” Scheme: “class-scheme-paddress-types” Example “1”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”
cfOrgUnit_ResPubl-LINK 24/08/10 cfOrgUnitId actually declared as 32 chars on euroCRIS website but this is a mistake, RA from Atira has alerted euroCRIS to this error.
Page 29 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Element cfOrgUnitId
Type CERIF String : 128 chars
cfResPublId
String : 128 chars
y
cfClassId
String : 128 chars
y
cfClassSchemeId
String : 128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
cfOrgUnitName-LANG Element Type CERIF cfOrgUnitId String : 128 chars
Pure
Pure
Mandatory CERIF y
y
Mandatory CERIF y
cfLangCode
String: 5 chars
y
cfTrans
String :1 chars
y
cfName
String : 255 chars
Content Pure y
Unique organisation unit id; organisation-[UKPRN]-[InstID] e.g. “organisation-10007803-80UNIV” Unique publication id; publication-[UKPRN]-[PublicationID] e.g. “publication-10007794-801001” Examples: “is-publisher-of” “is-author-institution-of” “claims-ipr” Schemes: “class-scheme-cerif-orgunit-publication-roles” Examples: “1”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”
Content Pure y
y
ResultPublication cfResPubl-RES
Page 30 of 55
Unique organisation unit id; organisation-[UKPRN][Organisation-InstID] e.g. “organisation-10007803-80UNIV” Examples: “en_GB” “DE” Examples: “o” Examples: “The University of St Andrews”
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Element cfResPublId
Type CERIF String : 128 chars
cfResPublDate
Date
cfNum cfVol cfEdition cfSeries cfIssue cfStartPage cfEndPage cfTotalPages cfISBN cfISSN cfURI
String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 30 chars String: 128 chars
Pure
Mandatory CERIF y
Content Pure y
Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. publication-10007803-010101 Examples: “2001-01-01T00:00:00”, “1999-12-31T00:00:00”
Example: “http://www.st-andrews.ac.uk/departments/physics/book”
cfResPubl_Class-LINK 11/05/10 Changed Content examples to make consistent with CRISPool Class Scheme Data.doc Element Type Mandatory Content CERIF Pure CERIF Pure cfResPublId String : 128 chars y y Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” cfClassId String : 128 chars y Examples: “textbook” “journal-article” cfClassSchemeId String : 128 chars y Schemes: “class-scheme-cerif-publication-types” cfFraction Float y Examples: “1”, “0.5” cfStartDate Date y Examples: “2001-01-0101T00:00:00.000+01:00”, “1999-123101T00:00:00.000+01:00” cfEndDate Date y Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”
Page 31 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
cfResPubl_ResPubl-LINK Element Type CERIF cfResPublId1 String : 128 chars
Pure
Mandatory CERIF y
cfResPublId2
String : 128 chars
y
cfClassId
String : 128 chars
y
cfClassSchemeId
String : 128 chars
y
cfFraction
Float
y
cfStartDate
Date
y
cfEndDate
Date
y
cfResPublAbstr-LANG Element Type CERIF cfResPublId String : 128 chars
Pure
Mandatory CERIF y
cfLangCode
String: 5 chars
y
cfTrans
String :1 chars
y
cfAbstr
NClob
cfResPublBiblNote-LANG Element Type CERIF cfResPublId String : 128 chars
Pure
Mandatory CERIF y
Content Pure y
Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010102” Examples: “is-part-of” Schemes: “class-scheme-cerif-publication-publication-roles” Examples: “1”, “0.5” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00”
Content Pure y
Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Examples: “en_GB” “DE” Examples : “o” Examples: “An abstract of a publication would be written here.”
Content Pure y
Page 32 of 55
Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101”
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
cfLangCode
String: 5 chars
cfTrans
String :1 chars
cfBiblNote
String : 255 chars
cfResPublKeyW-LANG Element Type CERIF cfResPublId String : 128chars
Examples: “en_GB” “DE” Examples : “o” Examples: “Additional information on publication up to 255 characters.”
Pure
Mandatory CERIF y
cfLangCode
String: 5 chars
y
cfTrans
String :1 chars
y
cfKeyW
String : 255 chars
Content Pure y
*cfResPublAbbrev-LANG Not used within CRISPool as no institution has this data available at this time. Element Type Mandatory CERIF Pure CERIF Pure cfResPublId String : 128 chars y y cfLangCode
String: 5 chars
y
cfTrans
String :1 chars
y
cfAbbrev
String : 255 chars
Page 33 of 55
Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Examples: “en_GB” “DE” Examples: “o” Examples: “Physics, Space, Light, Gravity.”
Content Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Examples: “en_GB” “DE” Examples: “o” Examples: “Abbreviated title of an article.”
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
*cfResPublSubtitle-LANG Not used within CRISPool as no institution has this data available at this time. Element Type Mandatory CERIF Pure CERIF Pure cfResPublId String : 128 chars y y cfLangCode
String: 5 chars
y
cfTrans
String :1 chars
y
cfSubtitle
String : 255 chars
cfResPublTitle-LANG 11/05/10 correction to xml tags to make valid Element Type CERIF Pure cfResPublId String : 128 chars
Mandatory CERIF y
cfLangCode
String: 5 chars
y
cfTrans
String :1 chars
y
cfTitle
String : 255 chars
Content Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Examples: “en_GB” “DE” Examples: “o” Examples: “Bloggs blogs about blogs”
Content Pure y
y
Unique publication id; publication-[UKPRN]-[Publication InstID] e.g. “publication-10007803-010101” Examples: “en_GB” “DE” Examples: “o” Examples: “An Example of a Textbook”
Other cfClassTerm-LANG 11/05/10 Changed Content examples to make consistent with CRISPool Class Scheme Data.doc Element Type Mandatory Content CERIF Pure CERIF Pure
Page 34 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
cfClassId
String : 128 chars
y
cfClassSchemeId
String : 128 chars
y
cfLangCode
String: 5 chars
y
cfTrans
String :1 chars
y
cfTerm
String : 64 chars
Examples : “academic-teaching” “academic-research” Schemes : “class-scheme-10007803-job-families” “class-scheme-10007994-job-families” Examples: “en_GB” “DE” Examples: “o” Examples: “Academic Teaching” “Academic Research”
ND
cfEAddr-2 Element cfEAddrId
Type CERIF String : 128 chars
cfPAddrId
String : 128 chars
cfURI
String : 128 chars
Pure
Mandatory CERIF y
Content Pure Unique email address id: email_UKPRN_[Person InstID] {OR} email-[UKPRN]-[Organisation-InstID] e.g. “email-10007803et37”, “email-10007803-40SCPHAS” Unique postal address id: paddress-[UKPRN]-[OrganisationInstID] e.g. “paddress-10007803-80UNIV” Examples: “et37@st-andrews.ac.uk”
y
ND
cfPAddr-2 Element cfPAddrId
Type CERIF String : 128 chars
cfCountryCode
String : 2 chars
cfAddrline1 cfAddrline2 cfAddrline3
String : 80 chars String : 80 chars String : 80 chars
Pure
Mandatory CERIF y
Content Pure
y
Page 35 of 55
Unique postal address id: paddress-[UKPRN]-[OrganisationInstID] e.g. “paddress-10007803-80UNIV” Examples: “UK” “DE”
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
cfAddrline4 cfAddrline5 cfPostCode cfCity/Town cfStateOfCountry cfURI
String : 80 chars String : 80 chars String : 16 chars String : 64 chars String : 64 chars String : 128 chars
cfClassScheme-CLASS Element Type CERIF cfClassSchemeId String : 128 chars
cfURI
cfClass-CLASS Element
Pure
Mandatory CERIF y
Pure
Content
Mandatory CERIF y
Pure
Schemes : “class-scheme-organisation-types” “class-scheme-cerif-publication-publication-roles” Examples: “/uk/crispool/organisation/types” “/org/eurocris/cerif/publication/publication/roles”
String : 128 chars
cfClassId
Type CERIF String : 128 chars
cfClassSchemeId
String : 128 chars
y
cfStartDate
Date
y
cfEndDate
Date
y
cfURI
String : 128 chars
Pure
Content
Page 36 of 55
Examples : “supa-physics-and-life-sciences” “in-book” Schemes : “class-scheme-supa-themes” “class-scheme-cerif-publication-publication-roles” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “2001-01-0101T00:00:00”, “1999-12-3101T00:00:00” Examples: “/uk/crispool/supa/themes/physics-and-life-sciences” “/org/eurocris/cerif/publication/types/in-book”
Project Acronym: CRISPool Version: 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Appendix 2
Class Scheme Data Niall Lockhart, Anna Clements
Version 1 05/04/10
Nal
Version 2.0 02/06/10 Added personal job titles for Glasgow
Akc
Version 2.1 30/05/10 Remove cfTerm for wos and hesa
Nal
Version 2.2 24/08/10 Updated supa identifiers and schema. Job families also modified. Added paddress types, person types and organisation types.
This documents lists the values for each of the class schemes to be used in CRISPool. Those with ‘cerif’ in the title are taken from the documentation on the eurocris website See http://www.eurocris.org/fileadmin/cerif-2008/CERIF2008_1.1_Semantics.pdf
class-scheme-eaddress-types cfClassId cfTerm email Email Address skype
Skype Address
class-scheme-paddress-types cfClassId cfTerm work Work Address home
Home Address
class-scheme-person-types cfClassId cfTerm externalInternal Person person internalExternal Person person
Link Entity cfOrgUnit_EAddr cfPers_EAddr cfOrgUnit_EAddr cfPers_EAddr
Link Entity cfOrgUnit_PAddr cfPers_PAddr cfOrgUnit_PAddr cfPers_PAddr
Link Entity cfPers_Class cfPers_Class
class-scheme-organisation-relationship-types cfClassId cfTerm Link Entity is-parent-of Is Parent Of cfOrgUnit_OrgUnit
Page 37 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
class-scheme-cerif-orgunit-publication-roles cfClassId cfTerm is-publisher-of Is Publisher Of claims-ipr Claims IPR Of curator Is Curator Of reviewer Provides Reviewer For is-author-of Is Author Of commissioned Has Commissioned funded Is Funded By author-institution Is Author Institution Of publishing-inst Is Publishing Institution Of external-org Is External Institution Of class-scheme-personal-titles cfClassId cfTerm mr Mr mrs Mrs miss Miss ms Ms dr Dr prof Professor class-scheme-academic-titles cfClassId cfTerm mlitt MLitt msc MSc bsc BSc ma MA mphil MPhil mres MRes phd PhD meng MEng mphys MPhys mmath MMath beng BEng ba BA pgdip PGDip
Link Entity cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl cfOrgUnit_ResPubl
Link Entity cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class
Link Entity cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class cfPers_Class
Akc 30/06/10 â&#x20AC;&#x201C; remove cfTerm and use cfClassID as the actual data value as here are using a cerif classification scheme purely as a way of augmenting base data for a person ie not as a true classification schema class-scheme-hesa-identifiers cfClassId cfClassID hesa-1234567890123 1234567890123 hesa-3210987654321 3210987654321 class-scheme-wos-identifiers cfClassId web-of-science- 1234-2009 web-of-science- 9876-2010
class-scheme-supa-themes cfClassId main-theme
Link Entity cfPers_Class cfPers_Class
cfClassID 1234-2009 9876-2010
Link Entity cfPers_Class cfPers_Class
cfTerm Main Theme
Page 38 of 55
Link Entity cfPers_OrgUnit
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
additional-theme
Additional Theme
cfPers_OrgUnit
class-scheme-supa-indentifiers 24/08/10 Nal no need for supaid as already identified as supa through scheme id cfClassId Link Entity cfClassId supaid-1620 1620 cfPers_OrgUnit supaid-349 349 cfPers_ OrgUnit class-scheme-cerif-person-publication-roles cfClassId cfTerm author Is Author Of editor Is Editor Of author-numbered Is Author (Numbered) Of author-percentage Is Author (Percentage) Of subject Is Subject Of commissioned Has Commissioned reviewer Is Reviewer Of translator Is Translator Of publisher Is Publisher Of commissioned Has Commissioned
Link Entity cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl cfPers_ResPubl
Akc 11/05/10 IGNORE class-scheme-person-name-variants cfClassId cfTerm spelling-variant Spelling Variant of Personâ&#x20AC;&#x2122;s Name class-scheme-cerif-publication-types cfClassId cfTerm book Book book-review Book Review book-chapter-abstract Book Chapter Abstract book-chapter-review Book Chapter Review in-book In Book anthology Anthology monograph Monograph reference-book Reference book textbook Textbook encyclopaedia Encyclopaedia manual Manual other-book Other Book journal Journal journal-article Journal Article journal-article-abstract Journal Article Abstract journal-article-review Journal Article Review conference-proceedings Conference Proceedings conference-proceedings-article Conference Proceedings Article letter Letter letter-to-editor Letter To Editor phd-thesis PhD Thesis doctoral-thesis Doctoral Thesis report Report short-communication Short Communication poster Poster presentation Presentation news-clipping News Clipping commentary Commentary
Page 39 of 55
Link Entity cfPersName_Pers
Link Entity cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class cfResPubl_Class
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
annotation
Annotation
cfResPubl_Class
class-scheme-cerif-publication-publication-roles cfClassId cfTerm is-part-of Is Part Of
Link Entity cfResPubl_ResPubl
Job Titles etc This area needs to be flexible to cope with the different ways different Institutions categorise their staff. Suggest up to three levels as follows : 1. Personal Job Title : what I want to be known as and what should be shown on portal e.g. ‘Professor of Photonics’ 2. Generic Job Title : for filtering and grouping by SUPA e.g. ‘Professor’ 3. Job Family : for filtering and grouping by SUPA e.g. Academic class-scheme-job-families 24/08/10 Nal Currently these are the only available jobs and schema within CRISPool cfClassId cfTerm Link Entity academic Academic cfPers_OrgUnit academic-research Academic Research cfPers_OrgUnit academic-teaching Academic Teaching cfPers_OrgUnit honorary Honorary cfPers_OrgUnit emeritus Emeritus cfPers_OrgUnit research-support Research Support cfPers_OrgUnit
At St Andrews we can only supply 1 and 3 at moment. St Andrews class-scheme-10007803-personal-job-titles : EXAMPLES as one created per link cfClassId cfTerm Link Entity professor-photonics Professor of Photonics cfPers_OrgUnit honorary-professor Honorary Professor cfPers_OrgUnit supa-advancedSUPA Advanced Fellow cfPers_OrgUnit fellow research-assistant Research Assistant cfPers_OrgUnit pic-technicalPIC Technical Manager cfPers_OrgUnit manager class-scheme-10007803-job-families cfClassId cfTerm academic Academic academic-research Academic Research academic-teaching honorary emeritus research-support
Academic Teaching Honorary Emeritus Research Support
Link Entity cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit
Glasgow class-scheme-10007794-personal-job-titles – added 02/06/10 NL cfClassId cfTerm Link Entity
Page 40 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
senior-researchfellow research-fellowknc-manager professor rcuk-researchfellow professor-ofphysics research-fellow regius-professor-ofastronomyastronomer-royalfor-scotland reader kelvin-chair-ofnatural-philosophy senior-lecturer reader-inastrophysics lecturer egee-scotgridtechnicalcoordinator professor-cargillchair-of-naturalphilosophy research-fellowatlas-neural-netanalysis
Senior Research Fellow
cfPers_OrgUnit
Research Fellow/KNC Manager Professor RCUK Research Fellow
cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit
Professor of Physics
cfPers_OrgUnit
Research Fellow Regius Professor of Astronomy (Astronomer Royal for Scotland)
cfPers_OrgUnit cfPers_OrgUnit
Reader Kelvin Chair of Natural Philosophy Senior Lecturer Reader in Astrophysics
cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit
Lecturer EGEE/ScotGrid Technical Coordinator
cfPers_OrgUnit cfPers_OrgUnit
Professor - Cargill Chair of Natural Philosophy
cfPers_OrgUnit
Research Fellow ATLAS Neural Net Analysis
cfPers_OrgUnit
Yellow highlights â&#x20AC;&#x201C; may not be needed as historical records class-scheme-10007794-generic-job-titles cfClassId cfTerm administrative-library-andAdministrative Library & computing1 Computing 1 administrative-library-andAdministrative Library & computing2 Computing 2 advisor-of-studies Advisor of Studies atypical-worker Atypical Worker atypical-worker-grade5 Atypical Worker Grade 5 atypical-worker-minimum-wage Atypical Worker Minimum Wage head-of-department Head Of Department
Link Entity cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit
cfPers_OrgUnit honorary-staff
Honorary Staff
mpa-level4
MPA Level 4
mpa-level5 mpa-level6
MPA Level 5 MPA Level 6
mpa-level7
MPA Level 7
mpa-level8
MPA Level 8
marie-curie-fellow
Marie Curie Fellow
cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit
Page 41 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
operational2
Operational 2
professor
Professor
reader
Reader
research1a
Research 1A
research1b
Research 1B
research2
Research 2
research-and-teaching6
Research & Teaching 6
research-and-teaching7
Research & Teaching 7
research-and-teaching8
Research & Teaching 8
research-and-teaching9
Research & Teaching 9
scholar
Scholar
scholarship
Scholarship
senior-lecturer
Senior Lecturer
technical2
Technical 2
technical4
Technical 4
technical5
Technical 5
technical6
Technical 6
technical7
Technical 7
technician-a
Technician A
technician-c
Technician C
technician-d
Technician D
technician-e
Technician E
technician-f
Technician F
cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit class-scheme-10007794-job-families cfClassId cfTerm admin-libraryAdministrative Library and computing Computing academic-related Academic and Related
Link Entity cfPers_OrgUnit cfPers_OrgUnit
atypical
Atypical Workers
honorary
Honorary University
mpa
Management Professional and
cfPers_OrgUnit cfPers_OrgUnit
Page 42 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
operational
Administrative Operational
cfPers_OrgUnit
academic
Academic
research
Research
research-andteaching scholars
Research and Teaching Scholars
technical
Technical and Related
cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit Edinburgh – tbc – need job-families and personal-job-titles class-scheme-10007790-job-families cfClassId cfTerm
Link Entity cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit cfPers_OrgUnit
SUPA-tbc Suggestion here is for SUPA to provide class-scheme-supa-job-families to which member Institutions can map their own job families. class-scheme-organisation-types cfClassId cfTerm department Department university University school School college College research-pool Research Pool research-theme Research Theme
Link Entity cfOrgUnit_Class cfOrgUnit_Class cfOrgUnit_Class cfOrgUnit_Class cfOrgUnit_Class cfOrgUnit_Class
Page 43 of 55
Project Acronym: CRISPool Version: 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Appendix 3
CRISPool CERIF to PURE4 mapping • o o o o • o o o o o • o o o o o
1 Important notes 1.1 CERIF imposes constraints on data 1.2 Persons/Authors 1.3 Fragmentation 1.4 Translations 2 Classification mappings 2.1 Organisations 2.2 SUPA Themes 2.3 Persons 2.3.1 Employment types 2.4 Publications 2.4.1 Publication Peer Review 2.4.2 Organisation to publication relations 2.5 Electronic addresses 3 Entity Mappings 3.1 Organisation 3.2 Person 3.2.1 Person-organisation relationships 3.3 Address (UK) 3.4 Email, Skype, etc. 3.5 Publications 3.5.1 General fields 3.5.2 Contribution to Journal 3.5.3 Book Anthology 3.5.4 Conference Contribution 3.5.5 Contribution to Book Anthology 3.5.6 Other Contribution 3.5.7 Working Paper
Important notes CERIF imposes constraints on data The CERIF XML format imposes many constraints on the data it holds. Many text strings in the format is limited by a max length constraint, and often this constraint is too small. Imposing such restrictions on data is not suitable for an exchange format as CERIF actually is. If a receiving CRIS system has length constraints on text strings, the problem should be dealt with internally. This has been reported this to euroCRIS for their information.
Persons/Authors Only "real" persons are exported/imported, meaning that when a person is connected to a publication the alias author name, which can be different from the persons's actual name, is discarded.
Page 44 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Fragmentation Generally the Cerif XML data model is vastly fragmented. Data regarding an entity is scattered into several XML files and namespaces, which means that the referential integrity is lost. Thus it is up to the data provider to ensure that references are correct.
Translations Cerif uses language codes for specifying languages and translations, but the only specification available is that language codes are 5 characters long. In this project we use the well-known standard <language code>_<country code>, where â&#x20AC;˘ â&#x20AC;˘
language code is the two letter ISO 639-2 standard (see http://www.loc.gov/standards/iso639-2/englangn.html) country code is the two letter ISO 3166 standard (see http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/listen1.html)
Examples are: en_GB (british english), en_US (american english), da_DK (danish), fr_FR (french from France).
Classification mappings Classifications are mapped from Cerif classification id and scheme id to either a PURE4 classification URI or a contextual meaning.
Organisations An organisation is classified by an organisation type. In CERIF organisations are classified via the cfOrgUnit_Class classification element. CERIF Scheme id: class-scheme-organisation-types Cerif cfClassId
PURE Classification URI
university
/dk/atira/pure/organisation/organisationtypes/organisation/university
college
/dk/atira/pure/organisation/organisationtypes/organisation/college
faculty
/dk/atira/pure/organisation/organisationtypes/organisation/faculty
school
/dk/atira/pure/organisation/organisationtypes/organisation/school
department
/dk/atira/pure/organisation/organisationtypes/organisation/department
institute
/dk/atira/pure/organisation/organisationtypes/organisation/institue
research-pool /dk/atira/pure/organisation/organisationtypes/organisation/research researchtheme
/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme
publisher
/dk/atira/pure/publisher/publishertypes/publisher/publisher
The organisation relationship scheme classifies an organisation to organisation relation and is specified in the CERIF cfOrgUnit_OrgUnit link element. Scheme id: class-scheme-organisation-relationship-types Cerif cfClassId Contextual meaning
Page 45 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
is-parent-of
cfOrgUnitId1 is parent of cfOrgUnitId2 (cfOrgUnit_OrgUnit)
SUPA Themes SUPA Themes are mapped to organisations in the Research Theme classification. CERIF Scheme id: class-scheme-supa-themes Cerif cfClassId
PURE Classification URI
supa-particle/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme physics supaastronomy/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme space-physics supacondensedmattermaterialphysics
/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme
supa-physics/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme life-sciences supa-energy
/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme
supa-nuclear/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme plasmaphysics supaphotonics
/dk/atira/pure/organisation/organisationtypes/organisation/researchtheme
Persons Both internal and external authors are mapped to the CERIF cfPers type and distinguished from each other by classification via the cfPers_Class element. Scheme id: class-scheme-person-types (cfPers_Class) Cerif cfClassId
Contextual meaning
internalperson
the person is mapped to a PURE Person (and PURE authors)
externalperson
the person is mapped to a PURE External Person Author (only present on publications etc.)
Employment types A person's relation to an organisation is classified by an employment type. This is expressed in CERIF via the classification present in the cfPers_OrgUnit link element. SchemeId: class-scheme-job-families Cerif cfClassId
PURE Classification URI
academic
/dk/atira/pure/person/employmenttypes/academic
academic-research /dk/atira/pure/person/employmenttypes/academicresearch academic-teaching /dk/atira/pure/person/employmenttypes/academicteaching
Page 46 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
honorary
/dk/atira/pure/person/employmenttypes/honorary
emeritus
/dk/atira/pure/person/employmenttypes/emeritus
research-support
/dk/atira/pure/person/employmenttypes/research-support
Publications Scheme id: class-scheme-cerif-publication-types Cerif cfClassId
PURE Classification URI
book
/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/book
book-review
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/other
book-chapter/dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/forewo abstract book-chapter/dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/other review in-book
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/entry
anthology
/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/anthology
monograph
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/special
reference-book /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/book textbook
/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/book
encyclopaedia /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/anthology manual
/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/other
other-book
/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/other
journal
/dk/atira/pure/journal/journaltypes/journal/journal
journal-article /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/article journal-article/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/letter abstract journal-article/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/scientific review conferenceproceedings
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/other
conferenceproceedingsarticle
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/paper
letter
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/letter
letter-to-editor /dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/letter phd-thesis
/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/scholarly
doctoral-thesis /dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/scholarly report
/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/commissioned
short/dk/atira/pure/researchoutput/researchoutputtypes/bookanthology/other communication poster
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/poster
presentation
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontoconference/other
news-clipping
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontobookanthology/entry
commentary
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/comment
annotation
/dk/atira/pure/researchoutput/researchoutputtypes/contributiontojournal/comment
Page 47 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Publication Peer Review To signal whether the publication has been peer reviewed or not. This is done using the cfResPubl_Class element. Scheme id: class-scheme-publication-peer-review Cerif cfClassId Contextual meaning is-reviewed
The publication has been reviewed by a peer
is-not-reviewed The publication has not been reviewed by a peer
Organisation to publication relations Scheme id: class-scheme-cerif-orgunit-publication-roles Cerif cfClassId Contextual meaning claims-ipr
the organisation is considered the owner of the publication
author-institution the organisation has an author on the publication is-author-of
the organisation has an author on the publication
is-publisher-of
for future use
publishing-inst
for future use
curator
for future use
reviewer
for future use
commissioned
for future use
funded
for future use
external-org
for future use
Electronic addresses The electronic address classification is used to identify different types of addresses and is specified in the cfPers_EAddr element. Scheme id: class-scheme-eaddress-types Cerif cfClassId Context Type email
Email address (cfEAddr)
skype
Skype address (cfEAddr)
messenger
Instant Messaging
web
Web site URL
phone
Phone number
mobile
Mobile phone number
fax
Fax number
Entity Mappings Organisation PURE CERIF Default Mandatory Mandatory value
PURE field CERIF field
Page 48 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
name
cfOrgUnitName.cfName
Y
shortName cfOrgUnit.cfAcro period.start
cfOrgUnit_Class.cfStartDate (the first appearance)
peroid.end
cfOrgUnit_Class.cfEndtDate (the first appearance)
type
cfOrgUnit_Class (the first appearance)
Y
visibility
NA
Y
keywords
cfOrgUnitKeyw.cfKwyw
website
cfOrgUnit.cfURI
cfEAddr (first appearance classified as email)
Y
Y
today
Y Y FREE
Person The UK model has two different person-organisation relation types which are different if the person is staff or student. In this proof of concept project, we assume that only staff are synchronised. PURE field
PURE Mandatory
CERIF field
cfPersNamename.firstname ADD.cfFirstNames (first appearance)
Y
cfPersNamename.lastName ADD.cfLastNames (first appearance)
Y
-
cfPersNameADD.cfOtherNames
nameVariants
cfPersName-ADD (2nd to last appearance)
sex
cfPers.cfSex
CERIF Mandatory
Default value
Y
Person-organisation relationships In Cerif a number of postal address, email address, etc. is associated directly with the person. In PURE these relations are gathered as metadata on a person-organisation relation and a person can have one or more such relations. To overcome this obstacle CRISPool CERIF mapping bends the rules by using the classification of the cfPers_EAddr and cfPers_PAddr relations to carry data. Thus the following special classification schemes have been made. Common for these classifications is that the cfClassId contains the cfOrgUnitId of the related organisation. â&#x20AC;˘ â&#x20AC;˘
cfPers_PAddr o class-scheme-person-organisation-address-postal specifies a person's postal work address in relation to an organisation cfPers_EAddr o class-scheme-person-organisation-address-email specifies a person's email work address in relation to an organisation o class-scheme-person-organisation-address-web specifies a person's web work address in relation to an organisation o class-scheme-person-organisation-address-phone specifies a person's work phone in relation to an organisation
Page 49 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
o o
class-scheme-person-organisation-address-mphone specifies a person's work mobile phone in relation to an organisation class-scheme-person-organisation-address-fax specifies a person's work fax number in relation to an organisation
A person's relation to an organisation is classified by an employment type in the class-scheme-job-families scheme. This is expressed in CERIF via the classification present in a cfPers_OrgUnit link element.
Address (UK) PURE field CERIF field
PURE Mandatory
CERIF Mandatory
Default value
postalCode cfPAddr.cfPostCode country
cfPAddr.cfCountryCode
address1
cfPAddr.cfAddrline1
address2
cfPAddr.cfAddrline2
address3
cfPAddr.cfAddrline3
address4
cfPAddr.cfAddrline4
address5
cfPAddr.cfAddrline5
Y
Email, Skype, etc. Cerif electronic addresses such as email, skype and messenger is specified via an cfEAddr. The different electronic addresses is distinguished from each other by their classification as specified earlier in the document. The actual electronic address is specified in the cfURI element.
Publications General fields PURE CERIF Mandatory Mandatory
PURE field
CERIF field
publishedDate, publicationYear, Month, -Day
cfResPubl.cfResPublDate
numberOfPages
cfResPubl.cfTotalPages
title (localised)
cfResPublTitle.cfTitle
abstract (localised)
cfResPublAbstr.cfAbstr
Default value
y
Y
bibliographicalNote cfResPublBiblNote.cfBiblNote keywords
cfResPublKeyw.cfKeyw
Contribution to Journal
PURE field
CERIF field
PURE Mandatory
Page 50 of 55
CERIF Default Mandatory value
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
pages
cfResPubl.cfStartPage, cfResPubl.cfEndPage
journalNumber cfReslPubl.cfNum volume
cfResPubl.cfVol
Book Anthology
PURE CERIF Default value Mandatory Mandatory
PURE field CERIF field printIsbns
cfResPubl.cfISBN
edition
cfResPubl.cfEdition
volume
cfResPubl.cfVol
Conference Contribution
PURE CERIF Default Mandatory Mandatory value
PURE field CERIF field pages
cfResPubl.cfStartPage, cfResPubl.cfEndPage
peerReview
cfResPubl_Class (peer review classification)
Contribution to Book Anthology
PURE CERIF Default value Mandatory Mandatory
PURE field
CERIF field
printIsbns
cfResPubl.cfISBN
edition
cfResPubl.cfEdition
hostPublicationTitle cfResPubl.cfSeries
Other Contribution
PURE field CERIF field
PURE
CERIF
Page 51 of 55
Default value
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Mandatory printIsbns
Mandatory
cfResPubl.cfISBN
Working Paper
PURE field CERIF field printIsbns
PURE CERIF Default value Mandatory Mandatory
cfResPubl.cfISBN
Page 52 of 55
Project Acronym: CRISPool Version: 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Appendix 4
Technical Summary - CRISPool project prototype implementation
Created by: Atira A/S, edited by Thomas Vestdam Date: 87 February 2011 Version: 1.0 Rev. nr. 19
Technical Summary Below we have outlined how the CERIF-XML import and export functionality was implemented in Pure for the CRISPool project. In general, we have observed a few, but important, problems when using CERIF-XML as an exchange format:
Fragmentation – introduces unnecessary complexity, especially in import algorithms, as input must be scanned several times in order to collect all relevant XML-fragments that make up a single entity (e.g. a person). In addition, the excessive scanning of XML input also causes performance issues. Suggestion: allow certain XML entities/types to include other relevant entities resulting in a single comprehensive document type covering everything related to that type. E.g. the person element could allow inclusion of optional sub-elements such as names, keywords, relations other to other entities, etc. That is, the CERIF-model is kept as it is, but the exchange format becomes more suited for machine processing (as well as improve human readability). Too many namespaces – parsing and querying (e.g. using XPath) CERIF-XML is very cumbersome as every single element has its own namespace. Suggestion: only have one namespace per CERIF version. This would also allow having only one XML schema defining CERIF-XML. Constraints – the XML format should not impose too many constraints on data sizes other than IDs. It makes good sense to keep an upper limit to ID lengths, but we suggest that names, titles, abstract, etc. should be unbounded, and leave it up the different CRIS systems to decide, what to do if the incoming data length is greater than the systems internal representation. Most of the issues seem to stem from the fact that the CERIF-XML format is very close to the relational database schema defined the CERIF model. This leaves some desired improvements to CERIF-XML as an exchange format. However, the bottom line is that CERIF (XML) can, as such, be utilised as a flexible exchange format.
Page 53 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
Outline of the CERIF-XML import functionality Importing CERIF-XML into Pure is done by first loading the supplied CERIF-XML files for a given data-provider into an XML-database (eXist-db was used, http://exist.sourceforge.net/). Content is either created or updated in Pure based on the information in the XML-database. If a give piece of content (e.g. person, organisation, research output) does not already exists in Pure then the content is created, and stored along with the id found in the input (e.g. cfPersId) as a source id – it is this source id that is used to check for existence in Pure. When creating or updating content all relevant bits and pieces are loaded from the XML-database – e.g. for a given person that would be XML-fragments relating to the specific person id such as:
the person element person name elements associated keyword elements person-organisation relation elements and so on The relevant XML-fragments are found by performing several XPath queries in the XML-database in order to provide a “single document” containing all XML-fragments for a given piece of content. The XML-fragments are then transferred to relevant entities in the Pure model (we use XMLBeans to create binding to Java types, http://xmlbeans.apache.org/). The XML-database approach was choose over a handwritten parser for the reason of simplicity, and in order to be able to have a better basis for handling very large data-sets (if needed the XML-database can be kept in memory, or be streamed to the file-system depending on the needs).
Outline of the CERIF-XML export functionality Exporting content from Pure is implemented using the export framework in Pure by defining a series of “converters”. Each converter is responsible for converting a given a Pure model entity (e.g. Person, Organisation, Research Output, Journal, Patent) to CERIF-XML. The converter for a specific model entity is responsible for creating all relevant CERIF-XML fragments representing that entity in CERIF-XML. E.g. for a person that would be fragments such as
the CERIF person element, in a CERIF persons elements XML file the persons name, in a CERIF person name elements XML file associated keywords to a person, in a CERIF person keyword elements XML file relations to organisations, in a CERIF person organisation elements XML file and so on
Page 54 of 55
Project Acronym: CRISPool Version 2.1 Contact: akc@st-andrews.ac.uk Date: 16/09/2010
When exporting, the relevant data is loaded based on a list of the organisations that data is needed for â&#x20AC;&#x201C; each research output associated with any of the input organisations or their sub-organisations is loaded and converted one by one. For each research output, associated organisations, authors (persons) and journals are loaded and converted. In turn, when converting a person, any associated organisations are converted, and when converting an organisation any associated organisations (e.g. sub-organisations) are converted as well. Due to the recursive nature of the order in which entities are loaded and converted the exporter keeps track of already converted entities by keeping a list of their UUIDs. This ensures that entities are only loaded and converted once. The actual XML files are generated when all relevant entities have been converted by serialising all XML fragments in a set of CERIF-XML files. In this specific implementation an XML-database was used to temporarily store the XML-fragments while converting data, and when the conversion was done, the final XML-files where created by utilizing the serializing capabilities of the XML-database. The CERIF-XML export is provided as a special web-service that the delivers serialised CERIF-XML files as a bundled zip-file. The procedure and techniques described above can be applied for any system that aims to export to CERIFXML. However, the description of how data is loaded is of cause Pure specific, and just serves as an example. While writing the exporter a mapping document was created, and mapping decisions were recorded in the document.
Page 55 of 55