ScaDS Logo

CENTER FOR
SCALABLE DATA ANALYTICS
AND ARTIFICIAL INTELLIGENCE

CTS Text Miner 


Background
The purpose of this project is to create a modular framework for text mining that uses Canonical Text Service (CTS) as a data source. By combining standardized functionalities with standardized access to text data, this framework intends to reduce the heterogeneity of workflows in today’s Digital Humanities and act as an important element of a text research infrastructure.
 
Preliminary Results
The framework is designed as a 3-layered modular webservice architecture.
Layer 1 consists of RESTFul webservices that serve raw data.  

 

 
On top of layer 1, generic diagrams are served as RESTFul webservices on layer 2 to visualize the data by generically combining diagrams with webservice requests to layer 1. Combinations include relatively easy visualizations like a bar chart for trend detection or more advanced visualization like the topic model browser that combines several data-diagram combinations in one exploration tool. Every visualization uses requests to layer 1 to receive the input. The function call to layer 1 is wrapped as one paramater in layer 2 to make it possible to support upcoming *currently unknown* functions generically.

 

 

 
The 3rd and top layer, which is currently under development, is a user interface that will provide a public and open text mining tool that can cover several predefined use cases based on predefined diagram-data combinations. This layer is not designed as a RESTFul webservice to allow stateful interactions like for instance session wide parameter caching for better usability.

 

 

Because of the layered architecture, each step in the user interface can be tracked to the raw data layer and since layer 1 and 2 work as RESTFul webservices, each vertical step in the architecture can be backed up with persistent URLs for the results. This means that users of the Graphical User Interface can share and bookmark their results and visualizations with persistent URLs that will always generate the same result.
Because of the use of CTS URNs, each result can be connected to a publicly available text passage and across different tools and instances.
It is planned to implement an interface between one of the HPC clusters in ScaDS to provide a relatively easy way to outsource the possibly expensive calculation of the results.
The current state of developement can be illustrated with the demo that is available here: http://ctstm.informatik.uni-leipzig.de:8080/ctstm/vis/
 
Project Members
Prof. Dr. Gerhard Heyer
Jochen Tiepmar
Hans Dieter Pogrzeba ( Student Member )