ScaDS Logo


Application Area: Digital Humanities


A more detailed overview of the stock of data and tools including ready-to-use demos can be found on the project website

The exponential growth of internet-based communication means that the humanities and social sciences have access to a large amount of data for data-driven analysis. Large-scale digitization programs and the release of official data have also enabled the retrieval of historical and statistical sources. The particular challenge in the field of digital humanities is the linking and interplay of quantitative, data-driven analysis with qualitative interpretations, so that questions of knowledge extraction are particularly important. For the processing of very high data volumes, complex data structures and fast changes, Big Data methods are required, as they are already more established in other specialist disciplines. At the same time, psychologists and social scientists can use their methodological repertoire to contribute to the critical reflection on how to deal with Big Data in science and business.

In the area of digital humanities, the traditional separation of resources and the metadata describing them increasingly leads to a separation of the resources themselves into their individual components. In the case of text collections, this relates, for example, to the actual raw texts, various annotations and their metadata. Specifically, this requires a multi-tiered architecture: a storage solution that contains the full texts as well as annotations, various indices (including, among other things, annotations and metadata) and suitable interfaces for the provision of the data using existing indexes. For the preparation and annotation of the data, it must be noted that there are various tools which serve a similar purpose (such as crawling, cleaning, segmentation, tagging or parsing procedures), but these only work efficiently for specific domains. The quality of used processing chains also depends directly on the pre-processing steps used and the appropriate parameterization of the individual processes. The resulting high number of different, generated versions of a resource on the basis of the same raw data leads to a significantly increased demand for storage space and computing capacity as part of the systematic evaluation and efficient provision of these data.
The work is carried out by the Natural Language Processing group of the University of Leipzig.