ScaDS Logo

CENTER FOR
SCALABLE DATA ANALYTICS
AND ARTIFICIAL INTELLIGENCE

Connecting Digital Humanities with the CLARIN Infrastructure - Markup Variety

Beitragsseiten

Markup Variety

Still, even after when limiting the data to plain text, the heterogenity of formats is huge because different researchers use different markup for the meta information. There is no universal guideline how to format text data because - again and contrary to for example HTML - text data does not have to be compliant to any tool. For instance, a headline in HTML is generally marked as < h1 > because any browser will know that this means that the following content has to be rendered in a certain way. Edited digital documents on the other hand are mostly incorporated in project specific workflows. This means that a headline may be marked as < h1 >if the documents are available as a website. It may also be < headline > or < title > or < ueberschrift > depending on the purpose of the file and the technical knowledge of the editor.

To limit this heterogenity, guidelines are established. To my knowledge, TEI/XML provides the most prominent markup guidelines for digitized documents and many current corpus projects provide at least an export functionality for TEI/XML. Documents in this format are divided into a TEIHeader with the meta information for this document and a body that contains the actual text. The body can be structured in several ways depending on the type of document - for example chapter|sentence or song|stanza|verse. Being based on XML, this format requires a certain technical knowledge and understanding of structure, but because of its expressiveness and the relative intuitive syntax of XML, TEI/XML is worth the learning effort.

The benefits of being compliant to established guidelines can for example easily be understood when looking at HTML. The markup of HTML can be interpreted by any web browser, which means that this combination of characters : < i > some text < / i > will be rendered as some text if the spaces between the characters are deleted. Without this common set of rules, it would not be possible to design websites aside from using ASCI art.
In the following images, this principle is applied to text by using generic rules to render the content of certain TEI/XML tags nicely. The upper image shows the TEI/XML markup as it would be shown in any plain text editor. The lower image shows the way that the tags are visualized in Martin Reckziegel's CTRace.

Unstyled TEI/XML output
Unstyled TEI/XML output
Styled TEI/XML output
Styled TEI/XML output

Similiar to how any HTML is rendered nicely in a browser, any TEI/XML document can be rendered nicely using tools that understand the markup. This way TEI/XML can serve as a very effective way to significantly reduce the markup Variety. One main disadvantage of XML is that its use can have a serious performance impact compared to a simple plain text format like CSV.

Monica Berti kindly provided the following TEI/XML example use case:

An example of a project based on TEI/XML is the Open Greek and Latin (OGL) developed at the University of Leipzig: https://github.com/OpenGreekAndLatin. The project is producing machine-corrected XML versions of ancient Greek and Latin works and translations. The goal is to provide at least one version for all Greek and Latin sources produced during antiquity (about 150 million words). Analysis of 10,000 books in Latin, downloaded from Archive.org, identified more than 200 million words of post-classical Latin. With 70,000 public domain books listed in the Hathi Trust as being in Ancient Greek or Latin, the amount of Greek and Latin already available will almost certainly exceed 1 billion words. The project is using OCR-technology optimized for Classical Greek and Latin to create an open corpus that is reasonably comprehensive for the c. 100 million words produced through c. 600 CE and that begins to make available the billions of words produced after 600 CE in Greek and Latin that survive. OGL XML files are based on the EpiDoc guidelines (https://sourceforge.net/p/epidoc/wiki/Home/) and are CTS compliant.