Master Thesis (Leipzig): Scalable image-based deduplication
Dipl. Medieninform. Christoph Müller (Check24 Vergleichsportal Reise GmbH)
Digital images that refer to one and the same real-world object are called duplicates. They can be identified as such by humans very quickly, but they are very different in binary form. The automated detection of these duplicates based on image properties, which are generated exclusively from the binary data, has been the subject of research for many years. However, current deduplication systems often only support textual data throughout the matching process. This Master's thesis presents the concept of a system that enables image-based deduplication of large quantities of images on a distributed infrastructure. This Similar Image Matching Suite, short SIMaSu, was also prototypically implemented using the message broker RabbitMQ. Furthermore, the paper gives an overview of the currently available methods for the calculation of image similarities. These include perceptual hash technologies, feature-based methods, and a mean square error approach. Such metrics represent the core of an image-based deduplication. In addition, a similarity metric has been designed which calculates a similarity value by using the feature-based technologies SIFT, SURF, and ORB. In a final evaluation, the runtimes are evaluated for eleven selected implementations of different metrics, the invariants against image transformations are examined and the efficiencies are compared. This fair comparison provides decision support for or against the use of a particular metric, as well as the choice of an effective threshold for classifying a pair of images.