ScaDS Logo


Digital Humanities and the 4 Big V's

Defining what BigData is and what it is not is not a trivial task. IBM specified 4 terms that have to be considered when dealing with BigData: Volume, Variety, Veracity and Velocity. Since they all start with the letter V, these terms are often referred to as the 4 V's of BigData. Volume corresponds to size-related data issues and is probably the easiest to understand. Veracity and Velocity describe how reliable data is and how fast it needs to be processed. Variety describes the heterogenity of data sources, data formats, metadata markup and tool - or workflow requirements of data sets. To be qualified as a BigData problem, something has to be problematic in at least one of these aspects.

Since my work has to do with corrected text, Veracity and Velocity can be ignored. In general, texts are not streamed in problematic amounts and aside from automatic character recognition, the error potential for text is very small: a letter is either there or it is not. With all the digitization projects that are currently running, Volume might be problematic, but considering that even large text corpora like the Deutsche Textarchiv with 142015148 words in 2441 documents only require a couple of Gigabytes of hard disk space, it is hard to talk about a Volume problem. This leaves us with Variety that - in my opinion and experience - is the most prominent issue that has to be dealt with in text based Digital Humanities.