ScaDS Logo

CENTER FOR
SCALABLE DATA ANALYTICS
AND ARTIFICIAL INTELLIGENCE

Connecting Digital Humanities with the CLARIN Infrastructure - Technical Variety

Beitragsseiten

Technical Variety

After dealing with the problem of file types by limiting the data to files that can be accessed with plain text editors and dealing with the problem of heterogenous file formats by limiting the data to TEI/XML, another more technical layer of heterogenity is becoming prominent: tools, workflows and data availability. Text data that is available online is generally offered in project specific websites, including GitHub repositories, Javascript applications, data dumps or HTML pages. Generally, whenever researchers want to access text data that is advertised as publicly available, it is at first required to learn, how and if the data can be accessed. Compiling a text corpus may even require to implement a specific crawler for a given project.

Tools and workflows are mostly developed to suit the project specific data set, which in many - if not most - cases results in individual tools for specific data sets that can not be applied to other projects.

An example of technical variety in the Digital Humanities is illustrated by two projects developed by Monica Berti at the University of Leipzig. Project specific technical solutions have been italicized. From a technical point of view, these parts are probably exchangeable, which means that in an unrelated project, they might create the same information as a different kind of data.

1) The first one is the "Digital Fragmenta Historicorum Graecorum" (DFHG), which is the digital version of a print collection of quotations and text reuses of ancient Greek authors: http://www.dfhg-project.org . The digital version has been produced starting from the OCR output of the print edition. Combining manual work and shell scripts, an SQL database has been created for delivering web services and tools. The raw data files are inserted into the SQL DB enriched with information useful to perform searches and citation extraction (based on CTS and CITE URNs). Ajax web pages are automatically generated to increase the usability of DFHG data. The project exports data stored in the DFHG database in different formats: CSV format files and XML format files (EpiDoc compliant). Files are available on GitHub. An API can be queried with DFHG author names and text reuse numbers: the result is a JSON output containing every piece of information about the requested text reuse. Integration with external resources allows to get inflected forms of DFHG words and lemmata from dictionaries and encyclopedias. An alignment of DFHG Greek text reuses and their Latin translations is available through the Parallel Alignment Browser developed at the University of Leipzig: http://ctstest.informatik.uni-leipzig.de/cts_admin_tools/parallelbrowser/?ctsURL=../../fragmentary&sep=:

 2) The second project is the "Digital Athenaeus", which is producing a digital edition of an ancient Greek work entitled the Deipnosophists by Athenaeus of Naucratis: http://digitalathenaeus.org. The project is focused on annotating quotations and text reuses in order to provide an inventory of reused authors and works and implement a data model for identifying, analyzing, and citing uniquely instances of text reuse in the Deipnosophists ( http://dh2016.adho.org/abstracts/46). In order to get these results, the project has produced an online converter for finding concordances between the numerations used in different editions of the Deipnosophists and get stable CTS URNs for citing the text. Moreover, digital versions of printed indexes have been created in order to map the text references to reused authors and works. Index entries can be visualized in Martin Reckziegel's Canonical Text Reader and Citation Exporter (CTRaCE) developed at the University of Leipzig. The project has implemented two web-based interface tools for querying the SQL Database of the indexes. An API with JSON output allows to integrate index data into external services. Different file formats (CSV, XML, etc.) can be generated from the DB.

Both examples use different technical solutions including some that were implemented specifically for these projects. While standards and guidelines can be shared among research groups, project specific technical solutions - like scripts, format conversions and computer setups - are potentially hard to reuse in other projects by different research groups because of small deviations in the data or other requirements. That is why it is often more reasonable to build yet another project specific workflow even if problems overlap with an already existing solution.

One step to solve this technical Variety is the developement of text research infrastructures like CLARIN because they enable researchers to outsource technical problems.