SHK-Position(Leipzig): Entity Resolution and Linking as a Service
Improving data quality by duplicate detection and data cleansing is an important pre-processing step before meaningful data analysis can be performed. In Leipzig the Dedoop system for large scale duplicate detection tasks was developed at the University of Leipzig, which can identify large numbers of matching object efficiently. Dedoop helps in configuring Entity Matching workflows using a GWT UI and transforms these workflows into Map-Reduce jobs (recently also Apache Flink Jobs). Various techniques for load balancing of similarity calculations are applied and displayed by various interlinked Map-Reduce jobs or in future Flink Jobs.
One difficulty is to integrate these linking features into existing Software tools or make it usable by Big Data consumers. Therefore a Dedoop-Service needs to be developed that can be used as a web-based tool or as a web/rest service for simple application integration. The Dedoop service will be deployed on a big data infrastructure at the University of Leipzig with 90 Hadoop nodes.
We are looking for a student that
- Is open for new data management topics such as deduplication and linking
- Has a good programming background in Java
- Has good web-programming skills (could be GWT, Angular, HTML+js, JqueryUI..)