Prof. Dr. Erhard Rahm & Dr. Eric Peukert & Alieh Saeedi & Marcel Gladbach - Big Data Integration Research at ScaDS Dresden/Leipzig
FAMER (FAst Multi-source Entity Resolution system) is a new scalable framework for distributed multi-source entity resolution. While existing link discovery methods focus on finding binary links between pairs of sources, FAMER supports a more holistic data integration by clustering equivalent entities from many sources. Such an approach is especially useful for constructing large knowledge graphs from many sources. FAMER constructs a so-called similarity graph for the entities of interest as basis for clustering. It supports parallel versions of several clustering schemes including a new approach called CLIP that favors so-called strong entity links. CLIP can also be used to repair clusters determined by other methods such as connected components or correlation clustering. FAMER is based on Apache Flink and its parallel execution supports scalability to large data volumes.
Privacy Preserving Record Linkage (PPRL) addresses the problem of matching person al records across different databases without revealing any sensitive information. It allo ws the combination of data from different sources for improved data analysis and research while not sharing uncoded identifying in formation. The linkage of person-related records (e. g., patients in hospitals) is based on encoded values of quasi-identifiers (e. g., name, address). The data needed for analysis (e. g., health data) is separated from these quasi-identifiers and can be linked with the ID pairs resulting from the PPRL process.
PPRL is confronted with several challenges needing to be solved to ensure its practical applicability. In particular, a high degree of privacy has to be ensured by suitable encoding of sensitive data and organizational structures, such as the use of a trusted linkage unit. PPRL must achieve a high linkage quality by avoiding false or missing matches. Furthermore, a high efficiency with fast linkage time and scalability to large data volumes are needed despite the inherent quadratic complexity of the problem. The talk will give an overview of our research results and plans for future work in this area.
Dr. Eric Peukert coordinates the Service Center for Big Data at the University of Leipzig as part of ScaDS Dresden/Leipzig. He studied Computer Science and Media at the Dresden University of Technology and worked at SAP Research in the field of data integration and schema mapping within various BMBF and EU research projects. After completing his doctorate at the University of Leipzig and two more years with SAP, Mr. Peukert switched to the ScaDS. Mr. Peukert coordinates the activities of the center in Leipzig with a special focus on industry contacts and cooperations. His research includes big data technologies, data integration and learning-based duplicate detection methods.
Alieh Saeedi is a PhD student in computer science at university of Leipzig. she work in database group under the supervision of Dr. Eric Peukert and Prof. Erhard Rahm. Her research focuses on Entity Resolution (ER) in big data from multiple sources.
Marcel Gladbach is researcher and PhD student in the database group at the University of Leipzig after receiving his M.Sc. at the Leipzig University of Applied Science (HTWK Leipzig) in 2017. His research focus is Privacy-Preserving Record Linkage and its practical application within the Medical Informatics Initiative Germany