LINKINGCULTURAL HERITAGE INFORMATION
2/5 – Analysis of The Cloud by Linking cultural heritage information
Page Index
Linking cultural heritage information is the name of the 2nd Work package (WP2) of Linked Heritage – a 30 month EU project, started on 1st April 2011 and coordinated by ICCU (Rome) - Istituto Centrale per il Catalogo Unico delle biblioteche italiane.
Some of the objectives set by WP2 are:
- to explore the state of the art in linked data and its applications and potential;
- to identify the most appropriate models, processes and technologies for the deployment of cultural heritage information repositories as linked data.
In particular, WP2 examined both the structure and information carried by the Linked data Cloud.
The Cloud
The Cloud is the best known representation of linked data. It shows “packages” of linked data and the links between packages. It is growing very quickly and the most recent diagram (September 2011) counts 331 packages.
The cloud is maintained on The Data Hub website, a W3C project which represents a registry of open and not-open knowledge, with information on packages and projects.
The Data Hub website is managed by Linking Open Data community project – which is part of the W3C’s Semantic Web Education and Outreach Interest Group (SweoIG). Therefore it may be considered as representing a significant proportion of the linked data available.
For each package The Data Hub website gives information about:
- name and description;
- links to the resources available;
- intellectual property rights status;
- which other packages are linked to (including number of links);
- the number of “triples” in the package (a measure of size);
- subject information and the formats used.
From WP2 analysis has emerged an unexpected image of The Cloud, beyond guiding principles and plans.
Is The Cloud open?
Given that in The Cloud open means able to be reused, enriched and shared (also commercially) the analysis shows that a significant component of The Cloud is not open.
IPR Status | % by Package |
---|---|
Open | 42.6 |
Not open | 57.4 |
One reason for this anomaly may be that the first packages published in The Cloud didn’t consider important having a licence.
Which IPR licences are used?
Open licences
Of the 132 packages with open licences:
Licence type | % by Package |
---|---|
Creative Commons Attribution (CC BY) | 28.8 |
Creative Commons Attribution Share Alike (CC BY-SA) | 18.2 |
Open Data Commons Public Domain Dedication and Licence (ODC PDDL) | 10.6 |
Creative Commons CC Zero (CC0) | 9.1 |
UK Crown Copyright with data.gov.uk rights | 7.6 |
Other (Public Domain) | 6.8 |
Other (Open) | 5.3 |
Others | 12.9 |
CC0 is a relatively new option and it is the choice made by Europeana – and secondly by its providers – for its publication of linked open data. It is the most permissive of the open licences, with attribution being a recommendation rather than mandatory.
Not open licences
Of the 178 packages with licences that are not open, or with no licence information:
Licence type | % by Package |
---|---|
not given | 69.1 |
None | 14.6 |
Creative Commons Attribution Non-commercial (CC BY-NC) | 7.3 |
Other (Not Open) | 6.7 |
Creative Commons Attribution (CC BY) | 1.1 |
Other (Non-Commercial) | 0.6 |
Creative Commons Attribution Share alike (CC BY-SA) | 0.6 |
For over 80% of packages of this part of The Cloud there is no information about the IPRs: a large part of published linked data does not seem to have a licence for its use. The result is that it is unclear what can be done with this data.
How big is The Cloud?
There are c38 billion triples in The Cloud with a large distribution in size: 9 packages (2.89%) have over a billion triples. Nearly a quarter of the packages are relatively small.
The ten largest packages with open licences are:
Package | Number of triples |
---|---|
LinkedGeoData | 3.00 billion |
UK Legislation | 1.90 billion |
Linked Sensor Data (Kno.e.sis) | 1.73 billion |
data.gov.uk Time Intervals | 1.00 billion |
DBpedia | 1.00 billion |
Open Library data mirror in the Talis Platform | 0.54 billion |
The Open Library | 0.40 billion |
Freebase | 0.34 billion |
transport.data.gov.uk | 0.33 billion |
Data Incubator: MusicBrainz | 0.18 billion |
The ten largest packages without open licences are:
Package | Number of triples |
---|---|
TWC: Linking Open Government Data | 9.80 billion |
Data.gov | 6.40 billion |
Source Code Ecosystem Linked Data | 1.50 billion |
2000 U.S. Census in RDF (rdfabout.com) | 1.00 billion |
PubMed | 0.80 billion |
DBTune.org MySpace RDF Service | 0.66 billion |
UniParc | 0.63 billion |
DBTune.org AudioScrobbler RDF Service | 0.60 billion |
Linking Italian University Statistics Project | 0.59 billion |
UniProt UniRef | 0.49 billion |
TWC Linking open government data is the largest package in The Cloud and is an aggregation of US government data.
Which are the subjects in the data?
There does not seem to be a controlled terminology for The Cloud, with the same subject represented by different tags in different packages.
WP2 analysis have combined a number of tags which appear to be the same subject. The ten most common subjects are:
Subject tag | Number of packages with tag |
---|---|
publications | 94 |
government | 54 |
life sciences | 46 |
geographic | 40 |
media | 32 |
library | 22 |
United Kingdom | 22 |
education | 20 |
user generated content | 19 |
bibliographic | 15 |
There is very little cultural heritage data. This is probably because, until the advent of Europeana, there has been no interest in linked data in this community. The appearance of United Kingdom as a tag shows mainly the effect of the UK Government’s policy of publishing linked data. The role of the USA is not apparent, but this is because packages are not tagged United States.
Which formats are used to encode data?
The most commonly used formats are:
Format | Number of packages using the format |
---|---|
Resource Description Framework (rdf) | 261 |
Dublin Core (dc) | 97 |
Friend of a Friend (foaf) | 84 |
Simple Knowledge Organization System (skos) | 57 |
RDF Schema (rdfs) | 42 |
Web Ontology Language (owl) | 34 |
Basic Geo (geo) | 25 |
Advanced Knowledge Technologies Reference Ontology (akt) | 22 |
eXtensible HyperText Markup Language (xhtml) | 19 |
Bibliographic Ontology (bibo) | 14 |
none given | 13 |
Music Ontology (mo) | 13 |
DBpedia Ontology (dbpedia) | 12 |
Others | 52 |
AKT Ontology, DBpedia Ontology, and GeoNames Ontology were developed in the context of the publication of a single package as linked data. The adoption of this type of formats by more packages suggests that these de facto standards are playing a significant role in The Cloud.
It is surprising, when Berners-Lee suggests using a standard format, to find that 75 formats are used by two or less packages: for the sake of interoperability it may be hoped the survival of the fittest!
How is The Cloud linked?
The most important part of The Cloud is how the packages are linked together. The ten most commonly linked to packages, in terms of the number of packages linking, are:
Package being linked to | Number of packages linking |
---|---|
DBpedia | 158 |
GeoNames Semantic Web | 42 |
(none) | 34 |
DBLP Computer Science Bibliography (RKBExplorer) | 27 |
Association for Computing Machinery (ACM) (RKBExplorer) | 26 |
ePrints3 Institutional Archive Collection (RKBExplorer) | 26 |
Freebase | 25 |
Others | 72 |
The success of DBpedia and GeoNames is probably due to their being well-known. But the most interesting thing is that over 10% of the packages in The Cloud do not link to other packages – included in this group are some of the largest packages, e.g. Data.gov, 2000 U.S. Census. This shows that the linking of packages is not something that is growing in an “organic” way.
There are initiatives responsible for creating large parts of The Cloud: such an initiative would be welcome in the cultural heritage sector too, where Europeana is actually taking a leading role.
Cultural Heritage data in The Cloud
The world of networked information is very interested in the legacy data produced by libraries, archives and museums as they are traditionally known to have a key role in producing quality information.
Unluckily there are only 18 packages in The Cloud that could be identified as having “cultural heritage” as their subject or related to it:
Package | Number of triples |
---|---|
VIAF: The Virtual International Authority File | 200,000,000 |
Europeana Linked Open Data | 185,000,000 |
British National Bibliography (BNB) | 80,249,538 |
Hungarian National Library (NSZL) catalog | 19,300,000 |
Amsterdam Museum as Linked Open Data in the Europeana Data Model | 5,000,000 |
Library of Congress Subject Headings | 4,151,586 |
Swedish Open Cultural Heritage | 3,400,000 |
Calames | 2,000,000 |
RAMEAU subject headings (STITCH) | 1,619,918 |
data.bnf.fr - Bibliothèque nationale de France | 1,400,000 |
National Diet Library of Japan subject headings | 1,294,669 |
Gemeenschappelijke Thesaurus Audiovisuele Archieven – Common Thesaurus Audiovisual Archives | 992,797 |
Gemeinsame Normdatei (GND) | 629,582 |
Archives Hub Linked Data | 431,088 |
Thesaurus for Graphic Materials (t4gm.info) | 103,000 |
Italian Museums (LinkedOpenData.it) | 49,897 |
Thesaurus W for Local Archives | 11,000 |
MARC Codes List Open Data | 8,816 |
The part of The Cloud from cultural heritage is still rather small (c500m triples or <1.5%), but hopefully developments from Europeana are planned to significantly increase its size. Linked Heritage project will be a significant component of it.
Format
Cultural heritage packages usually follow these formats:
Format | Number of packages using the format |
---|---|
Resource Description Framework | 13 |
Simple Knowledge Organization System | 11 |
Dublin Core | 7 |
eXtensible HyperText Markup Language | 4 |
Friend of a Friend | 3 |
Basic Geo | 1 |
Bibliographic Ontology | 1 |
DBpedia | 1 |
Music Ontology | 1 |
Object Reuse and Exchange | 1 |
RDF Schema | 1 |
vCard | 1 |
Web Ontology Language | 1 |
XML Schema | 1 |
The general picture is similar to The Cloud as a whole, except that the use of SKOS is much more significant, indicating the importance of terminological resources and authority files in the sector. Of note is the absence of a format for museum information specifically. Also the Europeana Data Model is not mentioned in The Data Hub, according to other sources it was surely used by some packages.
Link
Cultural heritage packages in The Cloud link to targets:
Package being linked to | Number of packages linking |
---|---|
DBpedia | 5 |
Library of Congress Subject Headings | 4 |
VIAF: The Virtual International Authority File | 2 |
GeoNames Semantic Web | 2 |
Dewey Decimal Classification (DDC) | 2 |
RAMEAU subject headings (STITCH) | 2 |
Swedish Open Cultural Heritage | 1 |
Gemeinsame Normdatei (GND) | 1 |
IdRef: Sudoc authority data | 1 |
(DCMI Type Vocabulary – not in The Cloud) | 1 |
UK Postcodes | 1 |
AGROVOC | 1 |
Hungarian National Library (NSZL) catalog | 1 |
(none) | 1 |
DBpedia and GeoNames represent well known sources of cross-domain and geographical information to link to. The rest of the linked packages are mainly other cultural heritage packages – especially standard terminologies and authority files.
Serialization
RDF/XML is used by almost all the packages, but Europeana Linked Open Data uses mentions only N-Triples.
Serialisation | Number of packages using (%) |
---|---|
RDF/XML | 16 (88.9%) |
N-Triples | 5 (27.8%) |
Turtle | 1 (5.5%) |
(none given) | 1 (5.5%) |
This suggests that cultural heritage linked data should be, at least, published as RDF/XML and possibly as N-Triples in order to be compatible to existing data.