ScaDS Logo


Connecting Digital Humanities with the CLARIN Infrastructure - Canonical Text Service


Canonical Text Service

Developed by researchers with a humanistic background, the Canonical Text Service protocol (CTS) reflects the requirements that Digital Humanists have on a text reference system. The core idea is that - similiar to the URLs that are used to find individual websites - Unique Resource Names (URNs) are used to reference text passages persistently roughly following the hierarchical principle that is used in common citation: at first a group of documents is specified, then one document and then one text passage inside the document. These URNs are then used to reference text via a web service, which basically means that every text passage can be persistently stored in the form of a web link or a bookmark in a web browser. Adding these links as citation reference enables readers to directly access the referenced text instead of having to run to the nearest library hoping that they own a copy of the referenced edition of the referenced book.

The protocol was developed by David Neel Smith and Christopher Blackwell in the Homer Multitext project and brought from the United States to Leipzig by Gregory Crane during the ESF project A Library of a Billion Words, a collaberation between the Leipzig University Library, the Natural Language Processing Group and the Visualisation department of Leipzigs Institute for Computer Science and the newly established Digital Humanities Institute. The goal was to create a workflow for a digital library and my task in this project was to set up a scalable implementation of CTS.

The exact specifications are available here. The project website for the CTS in Leipzig can be found here.
The following examples hopefully illustrate the core mechanic of CTS URNs.

Static CTS URNs are used to reference text parts like chapters or sentences as in
Document ( urn:cts:pbc:bible.parallel.eng.kingjames: )
Verse ( urn:cts:pbc:bible.parallel.eng.kingjames:1.3.2 )

Dynamic CTS URNs using spans of URNs or sub passage notation allow to reference any possible text passage in a document as in
Span of URNs ( urn:cts:pbc:bible.parallel.eng:1.2-1.5.6 )
Sub passage notation ( urn:cts:pbc:bible.parallel.eng:1.2@the[2]-1.5.6@five )

At the current state of implementation, the information from the TEI/XML document is directly translated into the corresponding information in CTS.

What separates CTS from other reference systems for text - besides some technical advantages - is that creating this format for data sets is a research question in Digital Humanities itself. Researchers do not have to create these files to be able to connect to the tools but instead want to create instances of CTS to be able to reference text passages online as for example illustrated by mentions in the works of Philologist Stylianos Chronopoulos, Digital Humanist Monica Berti, humanistsic projects like Perseus and Croatiae Auctores and interesting, philosophical and sometimes heated discussions about various implications of CTS. This self purpose of the protocol combined with the relatively strict technical specification makes it a prime candidate for a connection between the data sets in Digital Humanities and the tools and services provided by CLARIN.