Bachelor's or Master's Thesis(Leipzig):
Linking and Duplicate Detection in Big Graphs – blocking approaches in distributed linking
The graph-based storage and processing of large amounts of data is becoming increasingly important. In our work we encounter large networks of interactions between genes, proteins and processes in the life sciences, chemical compounds and their reactions in chemistry or information graphs in the business domain. A particularly prominent example Facebook offers its users access to information of the social network through a graph search.
In a current project at the University of Leipzig, a novel graph-processing platform (GRADOOP) is developed, which simplifies the entire process of creating a graph, its processing and analysis with the help of standardized operators and workflows. These workflows are then efficiently executed and distributed by using Apache Flink.
An important initial step is the creation of graphs by linking various data sources and improving data quality by duplicate detection and data cleansing. A prototypical version of Gradoop already contains initial operators for calculating object and subgraph similarities as well as load balancing features. Still, we observed that performance improvement techniques need to be investigated more thoroughly such as blocking in graphs. Blocking is a technique to reduce comparisons within a cheap computation step. In the non-graph entity matching area these techniques are well research. However, for graph-based linking new approaches could be developed that could take neighbors and edge properties into account when computing blocks. All new techniques should be developed with primitives from Apache Flink.
The work includes the following subtasks:
- overview of related work in blocking for entity matching, focusing on graph-based blocking techniques
- Concept of new graph-based blocking techniques that take neighboring information into account
- Prototypical implementation on top of Gradoop and Apache Flink
- Evaluation of the developed concepts and several datasets – such as dataset of publications, conferences and authors
We promise close supervision by members of the Big Data Center ScaDS. In some cases we could offer student positions before or after the thesis to dive into the topic.