LINKINGCULTURAL HERITAGE INFORMATION

2/5 – Analysis of The Cloud by Linking cultural heritage information

Linking cultural heritage information is the name of the 2nd Work package (WP2) of Linked Heritage – a 30 month EU project, started on 1st April 2011 and coordinated by ICCU (Rome) - Istituto Centrale per il Catalogo Unico delle biblioteche italiane.

Some of the objectives set by WP2 are:

  • to explore the state of the art in linked data and its applications and potential;
  • to identify the most appropriate models, processes and technologies for the deployment of cultural heritage information repositories as linked data.

In particular, WP2 examined both the structure and information carried by the Linked data Cloud.

The Cloud

The Cloud is the best known representation of linked data. It shows “packages” of linked data and the links between packages. It is growing very quickly and the most recent diagram (September 2011) counts 331 packages.

The Cloud in Semtember 2011. The Linking Open Data cloud diagram
The Cloud in September 2011. [Click to enlarge].
The Linking Open Data cloud diagram is maintained by Richard Cyganiak (DERI, NUI Galway) and Anja Jentzsch (HPI).

The cloud is maintained on The Data Hub website, a W3C project which represents a registry of open and not-open knowledge, with information on packages and projects.

THE DATA HUB

The Data Hub website is managed by Linking Open Data community project – which is part of the W3C’s Semantic Web Education and Outreach Interest Group (SweoIG). Therefore it may be considered as representing a significant proportion of the linked data available.

For each package The Data Hub website gives information about:

  • name and description;
  • links to the resources available;
  • intellectual property rights status;
  • which other packages are linked to (including number of links);
  • the number of “triples” in the package (a measure of size);
  • subject information and the formats used.

From WP2 analysis has emerged an unexpected image of The Cloud, beyond guiding principles and plans.

Is The Cloud open?

Given that in The Cloud open means able to be reused, enriched and shared (also commercially) the analysis shows that a significant component of The Cloud is not open.

IPR Status% by Package
Open
42.6
Not open
57.4

One reason for this anomaly may be that the first packages published in The Cloud didn’t consider important having a licence.

Which IPR licences are used?

Open licences

Of the 132 packages with open licences:

Licence type% by Package
Creative Commons Attribution (CC BY)
28.8
Creative Commons Attribution Share Alike (CC BY-SA)
18.2
Open Data Commons Public Domain Dedication and Licence (ODC PDDL)
10.6
Creative Commons CC Zero (CC0)
9.1
UK Crown Copyright with data.gov.uk rights
7.6
Other (Public Domain)
6.8
Other (Open)
5.3
Others
12.9

CC0 is a relatively new option and it is the choice made by Europeana – and secondly by its providers – for its publication of linked open data. It is the most permissive of the open licences, with attribution being a recommendation rather than mandatory.

Not open licences

Of the 178 packages with licences that are not open, or with no licence information:

Licence type% by Package
not given
69.1
None
14.6
Creative Commons Attribution Non-commercial (CC BY-NC)
7.3
Other (Not Open)
6.7
Creative Commons Attribution (CC BY)
1.1
Other (Non-Commercial)
0.6
Creative Commons Attribution Share alike (CC BY-SA)
0.6

For over 80% of packages of this part of The Cloud there is no information about the IPRs: a large part of published linked data does not seem to have a licence for its use. The result is that it is unclear what can be done with this data.

How big is The Cloud?

There are c38 billion triples in The Cloud with a large distribution in size: 9 packages (2.89%) have over a billion triples. Nearly a quarter of the packages are relatively small.

The ten largest packages with open licences are:

PackageNumber of triples
LinkedGeoData
3.00 billion
UK Legislation
1.90 billion
Linked Sensor Data (Kno.e.sis)
1.73 billion
data.gov.uk Time Intervals
1.00 billion
DBpedia
1.00 billion
Open Library data mirror in the Talis Platform
0.54 billion
The Open Library
0.40 billion
Freebase
0.34 billion
transport.data.gov.uk
0.33 billion
Data Incubator: MusicBrainz
0.18 billion


The ten largest packages without open licences are:

PackageNumber of triples
TWC: Linking Open Government Data
9.80 billion
Data.gov
6.40 billion
Source Code Ecosystem Linked Data
1.50 billion
2000 U.S. Census in RDF (rdfabout.com)
1.00 billion
PubMed
0.80 billion
DBTune.org MySpace RDF Service
0.66 billion
UniParc
0.63 billion
DBTune.org AudioScrobbler RDF Service
0.60 billion
Linking Italian University Statistics Project
0.59 billion
UniProt UniRef
0.49 billion

TWC Linking open government data is the largest package in The Cloud and is an aggregation of US government data.

Which are the subjects in the data?

There does not seem to be a controlled terminology for The Cloud, with the same subject represented by different tags in different packages.

WP2 analysis have combined a number of tags which appear to be the same subject. The ten most common subjects are:

Subject tagNumber of packages with tag
publications
94
government
54
life sciences
46
geographic
40
media
32
library
22
United Kingdom
22
education
20
user generated content
19
bibliographic
15

There is very little cultural heritage data. This is probably because, until the advent of Europeana, there has been no interest in linked data in this community. The appearance of United Kingdom as a tag shows mainly the effect of the UK Government’s policy of publishing linked data. The role of the USA is not apparent, but this is because packages are not tagged United States.

Which formats are used to encode data?

The most commonly used formats are:

FormatNumber of packages using the format
Resource Description Framework (rdf)
261
Dublin Core (dc)
97
Friend of a Friend (foaf)
84
Simple Knowledge Organization System (skos)
57
RDF Schema (rdfs)
42
Web Ontology Language (owl)
34
Basic Geo (geo)
25
Advanced Knowledge Technologies Reference Ontology (akt)
22
eXtensible HyperText Markup Language (xhtml)
19
Bibliographic Ontology (bibo)
14
none given
13
Music Ontology (mo)
13
DBpedia Ontology (dbpedia)
12
Others
52

AKT Ontology, DBpedia Ontology, and GeoNames Ontology were developed in the context of the publication of a single package as linked data. The adoption of this type of formats by more packages suggests that these de facto standards are playing a significant role in The Cloud.

It is surprising, when Berners-Lee suggests using a standard format, to find that 75 formats are used by two or less packages: for the sake of interoperability it may be hoped the survival of the fittest!

How is The Cloud linked?

The most important part of The Cloud is how the packages are linked together. The ten most commonly linked to packages, in terms of the number of packages linking, are:

Package being linked toNumber of packages linking
DBpedia
158
GeoNames Semantic Web
42
(none)
34
DBLP Computer Science Bibliography (RKBExplorer)
27
Association for Computing Machinery (ACM) (RKBExplorer)
26
ePrints3 Institutional Archive Collection (RKBExplorer)
26
Freebase
25
Others
72

The success of DBpedia and GeoNames is probably due to their being well-known. But the most interesting thing is that over 10% of the packages in The Cloud do not link to other packages – included in this group are some of the largest packages, e.g. Data.gov, 2000 U.S. Census. This shows that the linking of packages is not something that is growing in an “organic” way.

There are initiatives responsible for creating large parts of The Cloud: such an initiative would be welcome in the cultural heritage sector too, where Europeana is actually taking a leading role.

Cultural Heritage data in The Cloud

The world of networked information is very interested in the legacy data produced by libraries, archives and museums as they are traditionally known to have a key role in producing quality information.

Unluckily there are only 18 packages in The Cloud that could be identified as having “cultural heritage” as their subject or related to it:

PackageNumber of triples
VIAF: The Virtual International Authority File 200,000,000
Europeana Linked Open Data 185,000,000
British National Bibliography (BNB) 80,249,538
Hungarian National Library (NSZL) catalog 19,300,000
Amsterdam Museum as Linked Open Data in the Europeana Data Model 5,000,000
Library of Congress Subject Headings 4,151,586
Swedish Open Cultural Heritage 3,400,000
Calames 2,000,000
RAMEAU subject headings (STITCH) 1,619,918
data.bnf.fr - Bibliothèque nationale de France 1,400,000
National Diet Library of Japan subject headings 1,294,669
Gemeenschappelijke Thesaurus Audiovisuele Archieven – Common Thesaurus Audiovisual Archives 992,797
Gemeinsame Normdatei (GND) 629,582
Archives Hub Linked Data 431,088
Thesaurus for Graphic Materials (t4gm.info) 103,000
Italian Museums (LinkedOpenData.it) 49,897
Thesaurus W for Local Archives 11,000
MARC Codes List Open Data 8,816

The part of The Cloud from cultural heritage is still rather small (c500m triples or <1.5%), but hopefully developments from Europeana are planned to significantly increase its size. Linked Heritage project will be a significant component of it.

Format

Cultural heritage packages usually follow these formats:

FormatNumber of packages using the format
Resource Description Framework
13
Simple Knowledge Organization System
11
Dublin Core
7
eXtensible HyperText Markup Language
4
Friend of a Friend
3
Basic Geo
1
Bibliographic Ontology
1
DBpedia
1
Music Ontology
1
Object Reuse and Exchange
1
RDF Schema
1
vCard
1
Web Ontology Language
1
XML Schema
1

The general picture is similar to The Cloud as a whole, except that the use of SKOS is much more significant, indicating the importance of terminological resources and authority files in the sector. Of note is the absence of a format for museum information specifically. Also the Europeana Data Model is not mentioned in The Data Hub, according to other sources it was surely used by some packages.

Cultural heritage packages in The Cloud link to targets:

Package being linked toNumber of packages linking
DBpedia
5
Library of Congress Subject Headings
4
VIAF: The Virtual International Authority File
2
GeoNames Semantic Web
2
Dewey Decimal Classification (DDC)
2
RAMEAU subject headings (STITCH)
2
Swedish Open Cultural Heritage
1
Gemeinsame Normdatei (GND)
1
IdRef: Sudoc authority data
1
(DCMI Type Vocabulary – not in The Cloud)
1
UK Postcodes
1
AGROVOC
1
Hungarian National Library (NSZL) catalog
1
(none)
1

DBpedia and GeoNames represent well known sources of cross-domain and geographical information to link to. The rest of the linked packages are mainly other cultural heritage packages – especially standard terminologies and authority files.

Serialization

RDF/XML is used by almost all the packages, but Europeana Linked Open Data uses mentions only N-Triples.

SerialisationNumber of packages using (%)
RDF/XML
16 (88.9%)
N-Triples
5 (27.8%)
Turtle
1 (5.5%)
(none given)
1 (5.5%)

This suggests that cultural heritage linked data should be, at least, published as RDF/XML and possibly as N-Triples in order to be compatible to existing data.