PERSISTENT IDENTIFIERS:
COMMERCIAL AND HERITAGE VIEWS

5/9 – CASE STUDY 3: DataCite at TIB-UB1

Summary

The DataCite service identifies research data outputs so they can be re-used for new research. In this it shares some goals with digital cultural heritage and shows how DOI might be useful to museums and archives.

A Linked Heritage partner, the German National Library of Science and Technology (Technische InformationsBibliothek; TIB), funded by the German Federal Government and States, primarily offers a global documentation service covering mathematical and engineering sciences:

  • Architecture
  • Chemistry
  • Computer science
  • Mathematics
  • Physics
  • Engineering technology

TIB also hosts a centre of expertise in metadata for multimedia items; exemplified by their PROBADO initiative developing data models for music and digital 3D architectural drawings.

As we discussed in the introduction, identifiers link things with information about their characteristics. Hence TIB, having expertise in research information and the metadata describing it, is a natural home for the DataCite consortium, which maintains DOIs for full sets of research data.

Research data was traditionally a kind of “grey literature”, unpublished but still valuable – produced in the course of scientific research.

Though there is not a commercial supply chain for research data (yet?) it is possible to identify a kind of “audit trail” for the products of the academic research process:

Data, information and knowledge in the academic research 'trajectory' (after Brase, J. 2012)
2

DataCite’s identifier and attendant metadata aim to enable citation and quotation (or re-use) of original data in new contexts.

High-powered, networked computing can derive new value from large aggregations of real data through:

  • Simulations – creation of model experimental situations within a computer, based on real-world data
  • Meta-analysisstatistical studies of research data combined from many real-life studies

By sharing their original raw data from the earliest stages - rather than only at the end and in condensed and summary form - scientists can:

  • do more and better experiments,
  • more quickly,
  • and get more value from the research that has been funded in the past.

DataCite uses DOI to give the research data identifiers that are:

  • persistent – the location of a data file may change when an academic changes institution, or when data archive systems are changed
  • flexible – the DOI system can link many versions of a data file to a single identifier so that researchers can choose the most appropriate copy, e.g. by format, subset of data or language of text

The DataCite service reports 3 that it has registered 1,498,811 DOIs to date; there has been a steady growth in (successful) look-ups since the end of 2011:

The requirements for persistence and flexibility are identical to those in the digital cultural heritage world:

  • a digitised version of a cultural heritage object, or documentation about it, may move in the same way as a research data set
  • researchers and interested members of the public may want to access different resolution images, or description in different languages

Heritage datasets in DataCite?

There are some real similarities between the data available through DataCite and those aggregated and curated by cultural heritage projects like Linked Heritage:

  • Both begin with some kind of unique event which is the object of interest, rather than a set of information – or tangible object that witnesses to an event or a series of events in history
  • Both types of data arise from “research” – in fact, a set of related Linked Heritage Cultural Heritage Object records could be considered a dataset in DataCite terms
  • Both types of data are intended for use in further “research”
  • Additional value can be added to both types of metadata –
    • through linking multiple digital surrogates,
    • and linked the metadata to object data through RDF representation

Several heritage institutions have already used DataCite to assign DOIs to some of their digital collections; there are almost 19,000 documents submitted to the DataCite metadata catalogue whose publisher is some kind of museum. Most of these come from the Museum of Vertebrate Zoology at Berkeley as might be expected (they are in fact specimen records from a natural sciences collection).

DataCite also contains a set of web pages for paintings by Karl Hagemeister at the Bröhan Museum, Berlin (one example: Birken am Bach im Spätherbst). Possibly collections like these are under-represented in DataCite due to

  • existing methods of citing heritage objects like paintings
  • slight differences in emphasis in the type of “research” that uses museum data:
DATACITE RECORDSLINKED HERITAGE RECORDS
Object of descriptionUnique “event” (experiment, investigation, publication or review)Unique curated cultural heritage object or collection
Data creation contextData created in the course of scientific research on a specific problem or project themeData created in the course of on-going curatorial research into gallery, museum, archive or library’s cultural programme, or one specific project theme
Access to source objectFull original data often available openlyOriginal heritage object normally accessible to the public (or at least bona fide researchers)
Richness of metadataBasic metadata to enable unambiguous identification, relationships with other identified objects, interoperability with richer data schemasExtremely rich metadata enabling (in principle) detailed description of described object AND all related entities
Access to metadata describing objectMetadata openly available (or notice as to why not)Metadata in full LIDO format NOT yet openly available
Digital surrogates availableMultiple representations of data and/or metadata possible through content negotiationMultiple Digital Objects (e.g. digital photos of a building taken from different angles, scans of different pages – or different sections of pages – of documents) often collected and linked via on LIDO record
End-user value of metadata and source objectCitation and re-use (quotation, further analysis, incorporation into meta-study) of datasets in academic research, educational materialsCitation of object records in academic and professional research, educational materials, public-service cultural offerings
Underlying data modelIndecs-compatible data model (DOI Kernel)CIDOC-CRM compatible data schema
Potential for linked dataLinked data representation of DataCite metadata already publishedLIDO as RDF in development

Lessons from the case study

  • The DOI system is already used for a type of data that is very similar to heritage object data
  • There are some differences between research data and museum and archive documentation that mean a different solution for this data might be needed
  • Both sectors share a concern for digital preservation and this is one reason to use DOI
Explore further

DataCite: Information for potential clients (UK via the British Library)
Step-by-step videos showing how the DataCite system works from the user’s perspective

DataCite workshops at the British Library
Presentations providing background and context, including case studies of DataCite implementations such as the Archaeology Data Service, UK, Data Archive, and Data.bris in the UK, as well as discussion of technical issues and possible future developments

NOTES



1 This section draws on Brase, J. (2012) DataCite and Linked Data. Presented at: Global Interoperability and Linked Data in Libraries, Florence, Italy, 18th-19th June 2012.
2 Data, information and knowledge in the academic research "trajectory" (after Brase, J. 2012).
3 See http://stats.datacite.org/ for full statistics