B.Sc. Thesis: TF-IDF for Entity Resolution in Huge Knowledge Graphs
Entity Resolution (also known as Deduplication, Record Linkage, Link Discovery) refers to the task of identifying entities, which refer to the same real-world entity. Entities are usually matched by determining the similarity between them and this similarity is then used to determine if the entities are the same. One of these similarity measures is tf-idf (term frequency inverse document frequency).
This bachelor thesis consists of implementing tf-idf as similarity measure for FAMER(FAst Multi-source Entity Resolution system), a scalable framework for distributed multi-source entity resolution implemented with Apache Flink™ .