TERMINOLOGY

1/4 – Terminology and terminology management

We use in this context the word “terminology” in order to mention the resources used by the museums for describing their collections, this word “terminology” might be ambiguous. Indeed, strictly speaking, “terminology” is a discipline which aims at studying terms and their use within a specific domain; but a “terminology” could refer to the resource resulting from this discipline as well. However, “terminology” is the most generic and clear word to mention the different existing types of resources.

Content sections:

Introduction

In this learning object you will often read the word “terminology”. Generally speaking, terminology refers to the study of words and how they are used. In the context of this learning object, we consider the term “terminology” as a general concept for different types of controlled vocabularies, such as thesauri, classifications, flat term lists etc. These controlled vocabularies are used by organizations to describe collections or to make them accessible in a local database or online catalogue.

Imagine a museum that wants to make an inventory of 5000 paintings exposed in their collections. First of all museum staff will make a list of questions describing the painting: Who painted it? When was it painted? Which material was used? Where was it painted? Which art historical period does it represent?

These questions are conceptualized in databases as metadata: “who painted it?” is the author, “when was it painted?” is the date, “which material was used?” is the material, “where was it painted?” is the production place. The data on the other hand are the answers: the author is Leonardo Da Vinci, the date 16th century, the material oil on canvas, the production place Italy, the art historical period is Renaissance. When doing this exercise for each single painting, the museum will dispose of a large list of authors, materials, techniques, geographic places etc.

Imagine you want to learn more about paintings of the Renaissance and you have access to an online database. You will type the key words “painting” and “Renaissance”. When the data resulting from the inventory of paintings are structured in a thesaurus, the result will give you all the records containing information on Renaissance painting. If you are only interested in the paintings from Italy, you can type in Italy in the metadata-field production place (or select it from the thesaurus), and the data will be filtered again. If the thesaurus contains equivalence relations, you will also be referred to books or articles on Italian Renaissance paintings.

An even richer search result would be acquired when terminologies from different organizations are linked to one another. In literature this is called mapping, when terms in one thesaurus are linked to terms (with the same meaning) in another thesaurus. If the terminology of a museum in Paris and a museum in Hong Kong are mapped, you will find all relevant information from both institutions, irrespective of language or form.

The last decade terminologies have started playing an important role in the semantic web. The semantic web wants to be an intelligent web: imagine this time you are looking for information on the Mona Lisa. When you type in “Mona Lisa” in a browser, your query results will be websites containing the lexical string “mona lisa”. When the browser makes use of thesauri on the other hand, you will find information on Mona Lisa, but also on La Joconde, the French title of the same painting and La Gioconda, which is the Italian title of the famous painting. All this information is stored in terminologies and can be shared and reused in the semantic web.

These examples demonstrate the importance of terminologies in information systems. However, before we can make use of our local terminologies and share them in the semantic web, we need to meet some basic requirements: use controlled vocabularies, publish in SKOS/RDF, map with terminologies from other organizations etc. This WP3 learning object will provide you guidelines and an open-source tool for thesaurus management and terminology publication, so you can optimize the visibility and accessibility of your data on the web.

Terminology?

The type of resource is highly connected to its purpose: in other words, an information retrieval tool and a knowledge management tool won’t use the same kind of resources. The terminology resources in the context of cultural institutions are mainly used for indexing and information retreival pupose.

Considering this, we have raised five main types of resources organised according their level of complexity:

Simple list of terms

The simple list of terms could be assimilated to a controlled vocabulary. A controlled vocabulary is a list of terms that have been explicitly enumerated. This list is controlled by and is available from a controlled vocabulary registration authority. All terms in a controlled vocabulary should have an unambiguous, non-redundant definition. However the simple list of terms generally consists in an alphabetical list of terms of a specific domain without definition or relations between terms... It could be also a list of named entities such as authors’ or persons’ names, location names... It represents the “minimalist” type of resource.

Glossary

A glossary is an alphabetical list of terms of a specific domain where each term has a definition or an explanation. The glossary, despite some common features, is not a dictionary or a lexicon. It often concerns a very specific or technical domain and is generally dedicated to non-experts for giving definition of very technical terms in a simplified way. A glossary could be multilingual.

Classification

Classifications are originally specific to library science and mainly used for cataloguing: a classification is a system of coding and organizing the knowledge. Classification is one of the tools used to facilitate subject access to collections. Thesauri and subject heading systems are another tool facilitating subject access. The main difference between these two tools is that classifications don’t allow assigning an object to several classes while thesauri allow assigning several terms to one object.

The Dewey Decimal Classification (DDC) and the Universal Decimal Classification (UDC) are the most known classification systems in the Information science and documentation world. DDC is more likely to be used as a system of location of resources while UDC which is more expressive than DDC especially with the relations between subjects will be preferred for subject browsing. Classification schemes may be either special, e.g. limited to a specific subject; or general, e.g. aiming to cover all subjects equally ('the universe of information').

Taxonomy

The taxonomy is very close to the classification since it is also a system of coding and classification. Originally used to designate classifications in the natural sciences field, the word “taxonomy” now refers to a form of classification scheme. In other words, taxonomy could be assimilated to a controlled vocabulary organized into a hierarchical structure. The terms are connected through a parent-child relationship. As classification and taxonomy are very similar, these two types of resources have been brought together for the needs of this report.

Thesaurus

A thesaurus could be defined as “a networked collection of controlled vocabulary terms”. Thesauri allow connecting the terms via several types of relationships which can be hierarchical, associative, equivalence or definition. This means that a thesaurus uses associative relationships in addition to parent-child relationships. A parent-child relationship is expressed by a Broader Term (BT) / Narrower Term (NT) feature. Associative relationships in a thesaurus such as “Related Term” (RT) (e.g. term A is related to term B) are used to express relationships that are neither hierarchical nor equivalent. Equivalence is expressed by the USE (e.g. preferred term) / Used For (UF) (e.g. non-preferred term). Additional information such as definition or remark can be included in a Scope Note (SN). The equivalence relationship is especially useful within multilingual thesauri. Thesauri contain two different types of terms: descriptors and non-descriptors. The descriptors are the terms used for indexing. The non-descriptors refer to all the terms connected to the descriptors through the relationships mentioned above. Non-descriptors are not used for indexing.

A thesaurus can be either monohierarchical or polyhierarchical: in a monohierarchical thesaurus, a descriptor can be connected to a broader descriptor whereas several broader descriptors can be parent of a descriptor in a polyhierarchical thesaurus. This horizontal level of relationship makes the main difference between thesaurus and taxonomy.

A specific norm, ISO 25964-1 Thesauri for Information retrieval has been established in 2012 in order to tackle the evolution of the thesauris in relation with the one of semantic technologies, namely the SKOS format.

Ontology

An ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts. Ontologies are the main kind of resource used for the Semantic Web or Knowledge management as a knowledge representation. The concepts are linked together by hierarchical relationships in one hand and semantic relationships in another hand.

Semantic Web, Linked Data and You

The Semantic Web (part of Web 3.0) is “the Web of data with meaning in the sense that a computer program can learn enough about what the data means to process it”1. It provides “a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by World Wide Web Consortium (W3C) with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming. It was proposed by World Wide Web inventor Tim Berners-Lee”.

As we can read on Wikipedia:

“Semantic Web is a term coined by World Wide Web Consortium (W3C) director Tim Berners-Lee. It describes methods and technologies to allow machines to understand the meaning - or "semantics" - of information on the World Wide Web.”

The availability of machine-readable metadata would enable automated agents and other software to access the Web more intelligently. The agents would be able to perform tasks automatically and locate related information on behalf of the user. While the term “Semantic Web” is not formally defined it is mainly used to describe the model and technologies proposed by the W3C. These technologies include the Resource Description Framework (RDF), a variety of data interchange formats (e.g. RDF/XML, N3, Turtle, N-Triples), and notations such as RDF Schema (RDFS) and the Web Ontology Language (OWL), all of which are intended to provide a formal description of concepts, terms, and relationships within a given knowledge domain.

The Semantic Web is then an evolution of the Web that implies that we, as users of the Web, change the way we make our data/documents available in order to ensure a machine-readable access and human as well.

The Linked Data is a practice of the Semantic Web since once the data is made available online the links between these data is necesary to make sense.

As a first definition we can say:

“In Semantic Web terminology, Linked Data is the term used to describe a method of exposing and connecting data on the Web from different sources. Currently, the Web uses hypertext links that allow people to move from one document to another. The idea behind Linked Data is that hyperdata links will let people or machines find related data on the Web that was not previously linked. The main point is that the focus is more about data and how to create and maintain links between these data than documents and links between documents.”

Here is a more “official” definition from Tim Berners-Lee:

“The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data.
Like the web of hypertext, the Web of data is constructed with documents on the Web. However, unlike the Web of hypertext, where links are relationships anchors in hypertext documents written in HTML, for data they links between arbitrary things described by RDF,. The URIs identify any kind of object or concept. But for HTML or RDF, the same expectations apply to make the Web grow:
  1. Use URIs to identify things (anything, concrete or abstract things, not just documents)
  2. Use HTTP URIs so that people can look up those things.
  3. Provide useful information using standards (RDF*, SPARQL) when someone looks up a URI
  4. Include links to other URIs (RDF links generally) to enable the discovery of related information.”

Digitisation is a long and expensive process which final aim is to make available onling the digital cultural heritage of your institution. Using the Semantic Web & Linked Data technologies to put your data online is a guarantee to exploit the best your digitised content and optimise their visibility o the Web.
Thesaurus and other terminology resources are mainly used for indexing and organising the collections of an institution. The Semantic Web technologies allow linking several different thesauri and institutions and users will have then the possibility to expand search functionalities through federated searching of multiple controlled vocabularies and linked data sources.

The semantic enrichment provided by the cultural institutions will facilitate multilingual information access and retrieval thanks to a semantically rich visualisation of thesauri and links in between.

How to join the semantic web: guidelines

The WP3 of Linked Heritage has produced a booklet with a set of recommendations and guidelines for bringing your terminology resources towards the Semantic Web.

Cover of LH booklet

This chapter is a step by step guide to publish your terminology as part of the Semantic Web.

Content sections:

STEP 1: Conceive your terminology

Building your terminology is the foundation for all the rest. It determines the operations you shall do later when you will make your terminology interoperable with other resources, and when you will link it to a network of terminologies.

The terminology type that we consider “ideal” is a domain-specific, multilingual and user-oriented thesaurus. The closer to the ideal form your terminology is, the more optimised the exploitation of your semantic descriptions on Europeana will be.

Define your collection domain(s) http://www.athenaeurope.org/athenawiki/index.php/A1

Identify your users’ expectations (about your semantic descriptions) http://www.athenaeurope.org/athenawiki/index.php/A2

Define your connection with the datamodel http://www.athenaeurope.org/athenawiki/index.php/A3

Choose the terms for the semantic description of your digital resources http://www.athenaeurope.org/athenawiki/index.php/A4

Organise your terms into a thesaurus structure http://www.athenaeurope.org/athenawiki/index.php/A5

Find equivalend terms in other languages http://www.athenaeurope.org/athenawiki/index.php/A6

Implement your thesaurus http://www.athenaeurope.org/athenawiki/index.php/A7

STEP 2: Make it interoperable

Evaluate how far skos is compliant with your terminology features http://www.athenaeurope.org/athenawiki/index.php/B1

Roughly skosify your terminology http://www.athenaeurope.org/athenawiki/index.php/B2

Define with precision the labels expressing concepts http://www.athenaeurope.org/athenawiki/index.php/B3

Identify your concepts and validate the structure http://www.athenaeurope.org/athenawiki/index.php/B4

Ensure the documentation of concepts http://www.athenaeurope.org/athenawiki/index.php/B5

Map your concepts http://www.athenaeurope.org/athenawiki/index.php/B6

Map your (multilingual) terms http://www.athenaeurope.org/athenawiki/index.php/B7

Validate your skosification http://www.athenaeurope.org/athenawiki/index.php/B8

STEP 3: Link to a network

Define the metadata of your terminology http://www.athenaeurope.org/athenawiki/index.php/C1

Identification of resources for mapping http://www.athenaeurope.org/athenawiki/index.php/C2

Mapping with other resources http://www.athenaeurope.org/athenawiki/index.php/C3

Validation of the interoperability http://www.athenaeurope.org/athenawiki/index.php/C4