Let's say you're researching something on George Bush and look at different sources about him. First you have to make sure, every source you're looking at is talking about the same person. Is it George H.W. or George W.? So in each source you check for their details e.g. their names and birthdate. What you're doing there is what's called entity resolution: determining which entities (in this case “George Bush”) refer to the same real world entity. The research in this field reaches back a long time already, but has usually focused on traditional databases or tabular data.
In recent years structuring data in so-called Knowledge Graphs has become a popular approach, not least due to its use by Google, Wikipedia, Facebook and many others. It's what gives you the info box on the right when you search e.g. for a person or city in Google or Bing. Its idea is to store things (entities) as nodes in a graph that are linked via relationships.
Gestern am 22.04. fand deutschlandweit der Girls‘ und Boys‘ Day zur Unterstützung einer klischeefreien Berufswahl statt. Auch ScaDS.AI war in Leipzig und Dresden mit einem digitalen Angebot für Mädchen und junge Frauen dabei.
In Dresden waren mehr als 10 Mädchen diesmal aus ganz Deutschland anwesend - ein großer Vorteil des virtuellen Formats. Am Anfang konnten sie beim KI-Quiz zeigen, ob sie mit Fachbegriffen wie Algorithmen, schwache KI oder Supercomputer etwas anfangen können. Sunna Torge, eine unserer Forscherinnen, hat im Anschluss erklärt, wie Neuronale Netze funktionieren und wie wir damit zur Bilderkennung forschen. Die Genomkatalogisierung von Fledermäusen war auch ein spannendes Thema, das gezeigt hat, wie lange ein normaler Computer im Vergleich zu einem Hochleistungsrechner für die Arbeit braucht - nämlich mehr als 10 Jahre! Lena Jurkschat, unsere wissenschaftliche Hilfskraft, studiert Informatik im Master. Sie hat aus ihrem Studienalltag berichtet und erklärt, dass ein Informatikstudium ziemlich cool sein kann und sich der Umzug nach Dresden wirklich gelohnt hat.
Bei der Vorstellung des Rechenzentrums (ZIH) der TU Dresden, dem "IT-Herz der Uni“, haben die Teilnehmerinnen erfahren, wie umfangreich die Verbindungen von unseren Systemen und Anwendungen in alle Bereiche der Universität sind: Nicht nur in die Wissenschaft, sondern auch in die Bibliothek, die Verwaltung bis hin zu jedem einzelnen Beschäftigten und Studierenden reicht dieses Netz. Die biologische Musterbildung wurde von Prof. Andreas Deutsch erläutert, der als Leiter der Abteilung Innovative Methoden des Computing auch zur Krebserkennung forscht.
Zum Schluss hat Christina Mühlbach als Fachinformatikerin des ZIH Einblicke in ihre Arbeit gegeben: Nämlich, was Programmierung mit Kreativität zu tun hat und warum sie als Gestaltungsmittel für (fast) alles taugt!
Our emerging Living Lab will be the home to many of our competences at ScaDS.AI. To present them to our visitors and partners many different demonstrators are being developed and will be displayed. Unfortunately, due to the Covid-19 pandemic several social distancing rules apply nowadays. Of course, we must also incorporate these new conditions into the development of our Living Lab. Existing solutions like the Corona-WarnApp help us to monitor the contacts between users and warn them if they have been too close based on the distance of their smartphones. For the Living Lab Alexander Leipnitz and Timo Adameit therefore developed a new method to monitor the social distance between our visitors within the Living Lab via cameras. The goal is to visualize contacts between the visitors of the Living Lab in a graph and highlight dangerous contacts’ while preserving the data privacy. By using this approach, we can identify contact groups, analyze group dynamics and can thus show how the Corona virus could possibly spread.
As part of the BMBF funding initiative "KMU-innovativ", ScaDS.AI started a project with partners from research and industry to develop a new tool for structuring, analyzing and exploring large volumes of heterogeneous, dynamic data sources.
Efficiently managing and merging heterogeneous, changing data sources has become a critical success factor for enterprises. AMPL (Automatic Meta Data Profiling and Lineage for Integrating Heterogeneous Data Sources) aims to develop a new tool for structuring, analyzing and exploring multiple heterogeneous, dynamic data sources. For this purpose, extensive data profiles are computed, consisting of statistics, correlations and complex provenance information (lineage). Machine learning assisted methods help in schema mapping between data sources as well as new methods for scalable and incremental computation of the data profiles.
We are happy to announce that the research paper ‘Enhancing Cross-lingual Semantic Annotations Using Deep Network Sentence Embeddings’ has won the Best Paper Award in the HEALTHINF 2021 (BIOSTEC) conference. The paper proposes two new workflows using deep learning sentence encoders to tackle cross-lingual semantic annotation problem.
It incorporates state of the art methods on annotating non-English medical forms using large ontology sources. The research result shows impressive enhancement in annotation quality compared to method using conventional string matching. The awarded paper is authored by Dr. Ying-Chi Lin and Prof. Erhard Rahm from the database group and ScaDS.AI Leipzig and Phillip Hoffmann, who was a supervised Bachelor student. Dr. Ying-Chi Lins presentation of the paper at the conference can be seen here. The BIOSTEC joint conference received a total of 317 paper submissions from 54 countries in all continents, of which 22% were accepted as full papers. Therefore, it is a great honor for our researchers to have won the Best Paper Award in HEALTHINF.
Even though there are a lot of issues we need to think about and there are a lot of concerns about hygiene precautions and the upcoming holidays, we wanted to wish you all a Merry Christmas and therefore give you something to smile about. Maybe you (or maybe your children) have wondered whether it’s going to snow on Christmas in Leipzig. For all of you who have asked themselves this question, one fellow researcher of ScaDS.AI Leipzig built a forecasting model based on open data from the German Weather Service. The model maybe can tell us, if it’s really going to snow on Christmas. But how did he do it? And what do gingerbread and weight loss have in common with each other? And what does this question have to do with this snow height prediction model? To explain all this, let's first start with a small tutorial on how to build such a prediction model yourself.
Under the heading "How will we live in the future?" DRESDEN-concept presents current cooperative research projects and innovations in the research fields of digitalization, living, climate & water, mobility, material and cultural heritage. Climate change, demographic change, pandemics and megacities are just some of the major challenges facing our society. Scientists around the world are working to develop innovative solutions. The numerous research institutions in Dresden make an important contribution to this with their excellent research.
👉 ScaDS.AI Dresden/Leipzig participates in the exhibition by presenting our fields of research
The joint exhibition is committed to make science visible and tangible for the public and to transfer results from science into society. The alliance shows the public the importance and strength of cooperative research.
Under the umbrella of DRESDEN-concept, an alliance of 32 Dresden research and cultural institutions, scientists work on common topics across institute and subject boundaries. We are happy to be part of this great alliance! You can visit this exhibition in front of the Kulturpalast Dresden. 👀
ScaDS.AI Participation at Student Panel of DI2KG Workshop
Since 2015, the data science center ScaDS.AI (Center for scalable Data Services and Artificial Intelligence) and the preceding Big Data center ScaDS Dresden/Leipzig run a yearly international summer school. Its 6th edition was planned to take place for a full week in July 2020 at the Univ. of Leipzig. Due to the Covid-19 pandemic it was replaced by a virtual and more compact 2-day event that took place at July 7-8. This opened the summer school on current AI (artificial intelligence) and Big Data topics not only for a broader and more international crowd of participants but also for internationally renowned speakers. With more than 250 registrations from North and South America (USA, Ecuador, Brazil), Europe (Germany, Switzerland, Italy, Spain, Norway, France, UK, Romania, Ukraine), Asia (Russia, India, Thailand), Africa (Morocco, Turkey, Iran) and Australia the ScaDS.AI Summer School of 2020 achieved a great international outreach and better participation than in previous years with 70-100 participants.
From August 17th to 23rd 2019 the two Germany based Big Data Competence Centers ScaDS Dresden/Leipzig and BBDC held the fifth international summer school on Big Data and Machine Learning in Dresden. This time, the summer school bridged the gap between the research fields Big Data and machine learning, with contributions from many internationally well-known experts from various fields. The highly recognized program included key notes from IBM, NVIDIA, Intel, and speakers from academia of both competence centers BBDC and ScaDS Dresden/Leipzig as well as invited speakers. The topics span a wide range of topics around large scale and data intensive computing (Big Data) and exciting new trends in machine learning, such as uncertainty quantification, distributed machine learning and architectural optimization for deep learning. Almost sixty participants could not just take part and connect to the expert, but could also contribute a poster about own research activity in a poster session and during the whole week to trigger discussions between participants. As social activity an archery tournament brought fun and a contrast into the program as well as triggered some competition among the participants. Stay in touch with us about future activities, e.g.the Big Data and AI in Business Workshop @September 19.-20. in Leipzig!
Last year the BBDC Berlin and the Big Data Competence Center ScaDS Dresden/Leipzig invited to the 4th international Summer School for Big Data and Machine Learning with Hackathon (https://www.scads.de/en/summerschool-2018). From 30.06. to 06.07.2018, the University of Leipzig offered a wide-ranging program that gave the more than 80 participants from industry and research insights into new findings and challenges in dealing with very large amounts of data and machine learning and enabled a lively exchange.
As already in the year 2017 there were again exciting and current lectures and discussions on the individual topics, which the overriding topic Big Data and machine learning raises. We would like to take this opportunity to thank all speakers and participants once again for their participation in a successful event.
Speakers from well-known companies (e.g. Microsoft, neo4j, Zalando) as well as speakers from various universities (University of Munich, Politecnico di Milano, FZ Jülich and many more) reported on problems, current research points and solutions. At the same time there was a colorful accompanying program, which invited to explore Leipzig with dragon boat trips and city tours and promoted the common exchange.
Auch dieses Jahr lud der Leipziger Standort des Big Data Kompetenzzentrums ScaDS Dresden/Leipzig wieder zu einem Workshop zum Thema „Big Data in Business“ ein (http://scads.de/bidib2017). Am 15. und 16.06.2017 wurde im Felix-Klein-Hörsaal der Universität Leipzig ein breit gefächertes Programm geboten, das den über 50 Teilnehmern aus Wirtschaft und Forschung Einblicke über neue Erkenntnisse und Herausforderungen im Umgang mit sehr großen Datenmengen gab und einen regen Austausch ermöglichte. Wie bereits im Jahr 2015 gab es wieder spannende und aktuelle Vorträge und Diskussionen zu den einzelnen Themen, die die übergeordnete Thematik Big Data mit sich bringt. An dieser Stelle möchten wir uns noch einmal ganz herzlich bei allen Referenten und Teilnehmern für ihre Beteiligung an einer gelungenen Veranstaltung bedanken.
Referenten bekannter Unternehmen (u.a. BMW Group, Immowelt AG, Huawei Technology), sowie lokale Startups, berichteten von Problemstellungen aus der Praxis und ihren Lösungen. Gleichzeitig wurde das bewährte Begleitprogramm aus dem Jahr 2015 weitergeführt, indem Wissenschaftler der Universität Leipzig Forschungsprojekte und -prototypen (z.B. Gradoop und Exploids) des Big-Data-Kompetenzzentrums vorstellten.
Nowadays, data analysis is one of the crucial parts in the field of science and research and in business as well. The data analysis process includes different steps and areas. These are mainly data collection, data pre-processing (checking, cleaning etc.), data analysis itself and visualization/interpretation of the results. Thereby, every single step can be realized by using a big variety of tools. Developing an efficient and powerful analysis process, especially in connection with big data, can be a technical challenge. Therefore it is of advantage, to have an infrastructure that allows testing, modifying and evaluating every single part of the analysis as well as the whole process.
The cloud structure as described in this article provides a cost-efficient and flexible platform in order to develop and evaluate complex data analysis processes. In the following article, an example of the cloud infrastructure itself is presented at first. In the second part, we demonstrate an application of the infrastructure in order to realize a data analysis task.
Probably every research group in the world faces the problem of collecting all publications of all group members. Usually, the list publications is displayed in structured way on the group's homepage to provide an overview of the research topics and impact of the group.
The larger and older the group, the more publications are in this list and the more painful is the manual collection of the publication list. Additional features such as searching for authors, keywords, and titles, linking additional author data to the publication (such as membership periode in the group), and handling name changes turn a simple publication list in a interesting use case for big data.
A effective solution to this problem is given in this tutorial. The tutorial is written for python starters and gives an introduction in many techniques:
Advanced features of python 3.6
Interacting with SQLite in python
Interacting with a REST-API in python.
Interacting with the ORCID public API
Reading and writing bibtex files in python
Creating HTML of a bibtex file in python
Understanding basic python syntax is required for this tutorial but all advance features are explained.
The tutorial is subdivded into 8 parts. Each part introduces a technique and demonstrate the its usage for the use case. You can, thus, jump to the part of interest or follow the tutorial step by step. A full understanding of the use case can only be achieved by reading the complete tutorial.
Historical topographic maps are a valuable and often the only data source for tracking long-term land-use changes. Their availability and vast spatio-temporal coverage make these maps an important source of information for climate and earth system modeling (ESM). However, the automated retrieval of complex and compound geographical objects from these historical maps is a challenging task. To facilitate the laborious information extraction from these maps, we present a two-stage machine learning-based approach for segmenting urban land-use from gray-scale scans using only a small set of training samples. We employ a Conditional Random Field (CRF) which obtains its unary potentials from a Random Forest (RF). The method is tested using two inference algorithms. To evaluate the performance and the scalability of the approach over large amounts of data sets, we conduct parallel computing experiments within a High Performance Computing (HPC) environment at the Center for Information Services and High Performance Computing at TU Dresden. We evaluated the methodology on the first Central-European set of trigonometry-based maps (1:25000) from 1850-1940 with large spatial and temporal coverage, which makes them particularly valuable for land-use change research and historical geo-information systems (HGIS). Experimental results indicate the suitability of both, the methodological approach and its parallel implementation.
This work has been presented at the GEOBIA 2016 Conference. A conference paper has been published and is available online.
Demonstration service for binary image segmentation
Binary image segmentation is a technique to identify various segments in a digital image. The main goal of segmentation is to enhance the information content of the image and to provide a standardized representation of the reconstructed segments. Image segmentation can be used in various ways, its applications range from low-level vision like 3D-reconstruction and motion estimation to high-level problems like image understanding and scene parsing. This demonstrator shows the applicability of this method for different raw image types to illustrate the possibly large range of application areas.
The main focus of this work was not to reach good performance in terms of e.g. pixel accuracy. Instead, we focus on usability, computational efficiency and generalization. Usability: unlike many of the existing systems, here, the training data may be incomplete, like e.g. in GrabCut binary segmentation, where a user provides only scribbles or bounding boxes marking some pixel as the background or foreground. In doing so, usually, most of image pixels remain non-marked. Actually, we only have ground truth information for the pictures of one use-case. This ground truth information is however not used for learning but only to check the results afterwards. To summarize, we employ semi-supervised learning using a quite incomplete user information. Computational efficiency: most of the system is implemented for processing on a GPU. For the use-cases below, the full pipeline (computing features, learning, inference etc.) takes about a couple of minutes depending on the image's size. Generalization: for each use-case, we learn only relatively few unknown parameters. In particular, we do not learn image features. We use a pre-trained Convolutional Neural Network to do this. The default values work quite stable and sufficient for most cases.
Random numbers are an essential element of cryptography and therefore security in general. Like most security aspects, random numbers sound simple but proof to be hard to get right. This text will discuss problems regarding random numbers in computers, especially in virtual machines.
Reference architectures are a key research topic in business information systems. They try to simplify software development by reusing architectural and software components. But reusability leads also to a trade-off in making reference architectures on a higher level to reuse it in many domains and applications. Or to concentrate them on a subject and hence easier to reuse.
In this blog post we discuss, whether big data references architectures are really needed. Our hypothesis is that current big data reference architectures are not sufficient to provide real benefit for implementing big data projects.
Big Data is usually used as a synonym for Data Science on huge datasets and dealing with all kinds of obstacles coming with that. Having access to a large amount of data offers a high potential to find more accurate results for many research questions. Moreover, the ability to handle Big Data volumes may facilitate solutions to previously unsolved problems. However, many research groups have not the necessary facilities to run large analysis jobs using computing resources they have access to at their home institution. Furthermore, the installation, administration and maintenance of a complex and agile software stack for data analytics is often a challenging task for domain scientists. One of the key issues of the Big Data competence center ScaDS Dresden/Leipzig is therefore to provide multi-purpose data analytics frameworks for research communities, which can be used directly at the computing resources of the Center for Information Services and High Performance Computing (ZIH). Using the high performance computing (HPC) infrastructure of ZIH, ScaDS Dresden/Leipzig members and collaborating researchers can run their data analytics pipelines massively in parallel on modern hardware. The following general purpose data analytics frameworks are currently available:
It is often necessary to build a proof of concept to show the ease and feasibility of Big Data to customers / project promoters or colleagues. With OSTMap (Open Source Tweet Map) mgm partners with the ScaDS to prove that it is possible to accomplish a lot with the right choice of technologies in a short time frame.