ScaDS Logo

COMPETENCE CENTER
FOR SCALABLE DATA SERVICES
AND SOLUTIONS

Integrating Canonical Text Service Support in the CLARIN Infrastructure

This joint effort of ScaDS and CLARIN Leipzig resulted in a connection between CTS and CLARIN that provids a fine-granular reference system for CLARIN and opens its infrastructure for many Digital Humanists across the world.

The Common Language Resources and Technology Infrastructure CLARIN is a european research infrastructure project with the goal to provide a huge interoperable research environment for researchers and is for example described in detail in the german article "Was sind IT-basierte Forschungsinfrastrukturen fuer die Geistes- und Sozialwissenschaften und wie koennen sie genutzt werden?" by Gerhard Heyer, Thomas Eckart and Dirk Goldhahn. Generally, CLARIN combines the efforts of various research groups to build an interoperable set of tools, data sets and web based workflows. This interoperability for example enables facetted document search over several different servers of different research groups in the Virtual Language Observatory VLO. Some of the more philisophic aspects of the Canonical Text Service Protocol are implied by its self purpose in Digital Humanities that is for example discussed in https://www.scads.de/de/blog/223-connecting-digital-humanities-with-the-clarin-infrastructure-2?showall=&start=5. More technical benefits are a flexible granularity of the references, its highly specialized performant implementation and its position as an address to outsource text content. By adding CTS support in CLARIN, these technical benefits (especially the finer granularity) can directly address some of the issues that are listed by Thomas Eckart in the paper "Jochen Tiepmar, Thomas Eckart, Dirk Goldhahn und Christoph Kuras: Canonical Text Services in CLARIN - Reaching out to the Digital Classics and beyond. In: CLARIN Annual Conference 2016, 2016"

* Many of the current solutions treat textual resources as atomic, i.e. all provided interfaces are focused on the complete resource. The inherent structure of textual data is left to be processed by external tools or manually extracted by the user. Although this being acceptable for some use cases, a highly integrated research environment loses much of its power and applicability for research questions if ignoring this obvious fact.

* Textual resources do not have a typical granularity. Even for rather similar textual resources (like Web-based corpora or document-centric collections) it can not be assumed to have a ”default structure” on which analysis or resource aggregation can take place. As a consequence many approaches require and assume a standard format that is foundation for all provided applications and interfaces.

* Granularity has to be addressed as a basic feature of (almost) all textual resources. Current infrastructures make usage of several identification and resolving systems (like Handle, DOI, URNs etc.) but a fine-grained identification and retrieval of (almost) arbitrary parts are hardly supported or have to be artificially modelled using features that these systems provide. As a consequence even textual resources already provided in CLARIN are often not directly accessible or combinable because of the heterogeneity of used reference solutions or the level of supported granularity.

The result of this work is that CTS URNs for a configurable type of text part are now included in the Virtual Language Observatory together with a import interface for CTS instances. After a very well received presentation at the CLARIN Annual Conference 2016 in Aix-en-Provence, several discussions about a further inclusion and various implications of CTS as a (internally used) text communication protocol have started, that also indicated that - eventhough this work already provides significant benefits for many researchers - this first step has to be considered as just the tip of the iceberg. Future work will include an interface between the Federated Content Search of CLARIN and the fulltext search of CTS as well as the connection to WebLicht via a MIME-Type based interface.