ScaDS Logo


Connecting Digital Humanities with the CLARIN Infrastructure


One of the questions that I am often confronted with when presenting my work is what my work has to do with BigData, when the biggest text collections that I have to deal with fill only a couple of Gigabyes of hard disk space. The reason for this question is the argument that BigData has to have to do with large amounts of data and BigData related problems have to deal with at least Tera- or Petabytes of stored information. As understandable and right as this argument is, there is actually a whole lot more to BigData than just the question of the size of a data set and with this article I want to explain what it is and - hopefully - answer the question in a satisfactory manner.

Digital Humanities and the 4 Big V's

Defining what BigData is and what it is not is not a trivial task. IBM specified 4 terms that have to be considered when dealing with BigData: Volume, Variety, Veracity and Velocity. Since they all start with the letter V, these terms are often referred to as the 4 V's of BigData. Volume corresponds to size-related data issues and is probably the easiest to understand. Veracity and Velocity describe how reliable data is and how fast it needs to be processed. Variety describes the heterogenity of data sources, data formats, metadata markup and tool - or workflow requirements of data sets. To be qualified as a BigData problem, something has to be problematic in at least one of these aspects.

Since my work has to do with corrected text, Veracity and Velocity can be ignored. In general, texts are not streamed in problematic amounts and aside from automatic character recognition, the error potential for text is very small: a letter is either there or it is not. With all the digitization projects that are currently running, Volume might be problematic, but considering that even large text corpora like the Deutsche Textarchiv with 142015148 words in 2441 documents only require a couple of Gigabytes of hard disk space, it is hard to talk about a Volume problem. This leaves us with Variety that - in my opinion and experience - is the most prominent issue that has to be dealt with in text based Digital Humanities.