ScaDS Logo

CENTER FOR
SCALABLE DATA ANALYTICS
AND ARTIFICIAL INTELLIGENCE

Connecting Digital Humanities with the CLARIN Infrastructure - CLARIN

Beitragsseiten

CLARIN

The Common Language Resources and Technology Infrastructure CLARIN is a european research infrastructure project with the goal to provide a huge interoperable research environment for researchers and is for example described in detail in the german article "Was sind IT-basierte Forschungsinfrastrukturen fuer die Geistes- und Sozialwissenschaften und wie koennen sie genutzt werden?" by Gerhard Heyer, Thomas Eckart and Dirk Goldhahn. Generally, CLARIN combines the efforts of various research groups to build an interoperable set of tools, data sets and web based workflows. This interoperability for example enables facetted document search over several different servers of different research groups in the Virtual Language Observatory VLO. In the following example, I use VLO to limit 904014 resources on different servers to 1 german resource describing spoken language in XML markup.

Another great example of interoperability is WebLicht. The following description is copied from the project homepage and describes it better than I could: WebLicht is an execution environment for automatic annotation of text corpora. Linguistic tools such as tokenizers, part of speech taggers, and parsers are encapsulated as web services, which can be combined by the user into custom processing chains. The resulting annotations can then be visualized in an appropriate way, such as in a table or tree format. (...) By making these tools available on the web and by use of a common data format for storing the annotations, WebLicht provides a way to combine them into processing chains.

The various tools, webservices and data sets in CLARIN are added by various research groups, which - depending on the technical knowledge of the researcher - can be a complex task. To bridge this potential gap in the case of digitized documents, it would require some kind of common ground, that is on the one hand strict and persistent enough to be used as an technical access point for the data transfer and on the other hand suits the requirements that Humanists have on text resources. This common ground - and my main working area - is the Canonical Text Service protocol.