ScaDS Logo

COMPETENCE CENTER
FOR SCALABLE DATA SERVICES
AND SOLUTIONS

Big Data in Business 2017

Auch dieses Jahr lud der Leipziger Standort des Big Data Kompetenzzentrums ScaDS Dresden/Leipzig wieder zu einem Workshop zum Thema „Big Data in Business“ ein (http://scads.de/bidib2017). Am 15. und 16.06.2017 wurde im Felix-Klein-Hörsaal der Universität Leipzig ein breit gefächertes Programm geboten, das den über 50 Teilnehmern aus Wirtschaft und Forschung Einblicke über neue Erkenntnisse und Herausforderungen im Umgang mit sehr großen Datenmengen gab und einen regen Austausch ermöglichte.
Wie bereits im Jahr 2015 gab es wieder spannende und aktuelle Vorträge und Diskussionen zu den einzelnen Themen, die die übergeordnete Thematik Big Data mit sich bringt. An dieser Stelle möchten wir uns noch einmal ganz herzlich bei allen Referenten und Teilnehmern für ihre Beteiligung an einer gelungenen Veranstaltung bedanken.

Referenten bekannter Unternehmen (u.a. BMW Group, Immowelt AG, Huawei Technology), sowie lokale Startups, berichteten von Problemstellungen aus der Praxis und ihren Lösungen. Gleichzeitig wurde das bewährte Begleitprogramm aus dem Jahr 2015 weitergeführt, indem Wissenschaftler der Universität Leipzig Forschungsprojekte und -prototypen (z.B. Gradoop und Exploids) des Big-Data-Kompetenzzentrums vorstellten.

Read more ...

Virtual Cloud Infrastructure for Data Analysis

Introduction

Nowadays, data analysis is one of the crucial parts in the field of science and research and in business as well. The data analysis process includes different steps and areas. These are mainly data collection, data pre-processing (checking, cleaning etc.), data analysis itself and visualization/interpretation of the results. Thereby, every single step can be realized by using a big variety of tools. Developing an efficient and powerful analysis process, especially in connection with big data, can be a technical challenge. Therefore it is of advantage, to have an infrastructure that allows testing, modifying and evaluating every single part of the analysis as well as the whole process.

The cloud structure as described in this article provides a cost-efficient and flexible platform in order to develop and evaluate complex data analysis processes. In the following article, an example of the cloud infrastructure itself is presented at first. In the second part, we demonstrate an application of the infrastructure in order to realize a data analysis task.

Read more ...

Static Publications Site-Tutorial (ORC-Schlange)

Probably every research group in the world faces the problem of collecting all publications of all group members. Usually, the list publications is displayed in structured way on the group's homepage to provide an overview of the research topics and impact of the group.

The larger and older the group, the more publications are in this list and the more painful is the manual collection of the publication list. Additional features such as searching for authors, keywords, and titles, linking additional author data to the publication (such as membership periode in the group), and handling name changes turn a simple publication list in a interesting use case for big data.

A effective solution to this problem is given in this tutorial. The tutorial is written for python starters and gives an introduction in many techniques:

  • Advanced features of python 3.6
  • Interacting with SQLite in python
  • Interacting with a REST-API in python.
  • Interacting with the ORCID public API
  • Reading and writing bibtex files in python
  • Creating HTML of a bibtex file in python
  • Filtering HTML content with javascript

Understanding basic python syntax is required for this tutorial but all advance features are explained.

The tutorial is subdivded into 8 parts. Each part introduces a technique and demonstrate the its usage for the use case. You can, thus, jump to the part of interest or follow the tutorial step by step. A full understanding of the use case can only be achieved by reading the complete tutorial.

Read more ...

Large-scale map analysis for land-use change monitoring using machine learning methods within an HPC environment

Historical topographic maps are a valuable and often the only data source for tracking long-term land-use changes. Their availability and vast spatio-temporal coverage make these maps an important source of information for climate and earth system modeling (ESM). However, the automated retrieval of complex and compound geographical objects from these historical maps is a challenging task. To facilitate the laborious information extraction from these maps, we present a two-stage machine learning-based approach for segmenting urban land-use from gray-scale scans using only a small set of training samples. We employ a Conditional Random Field (CRF) which obtains its unary potentials from a Random Forest (RF). The method is tested using two inference algorithms. To evaluate the performance and the scalability of the approach over large amounts of data sets, we conduct parallel computing experiments within a High Performance Computing (HPC) environment at the Center for Information Services and High Performance Computing at TU Dresden. We evaluated the methodology on the first Central-European set of trigonometry-based maps (1:25000) from 1850-1940 with large spatial and temporal coverage, which makes them particularly valuable for land-use change research and historical geo-information systems (HGIS). Experimental results indicate the suitability of both, the methodological approach and its parallel implementation.

This work has been presented at the GEOBIA 2016 Conference. A conference paper has been published and is available online.

Demonstration service for binary image segmentation

Binary image segmentation is a technique to identify various segments in a digital image. The main goal of segmentation is to enhance the information content of the image and to provide a standardized representation of the reconstructed segments. Image segmentation can be used in various ways, its applications range from low-level vision like 3D-reconstruction and motion estimation to high-level problems like image understanding and scene parsing. This demonstrator shows the applicability of this method for different raw image types to illustrate the possibly large range of application areas.

The main focus of this work was not to reach good performance in terms of e.g. pixel accuracy. Instead, we focus on usability, computational efficiency and generalization.
Usability: unlike many of the existing systems, here, the training data may be incomplete, like e.g. in GrabCut binary segmentation, where a user provides only scribbles or bounding boxes marking some pixel as the background or foreground. In doing so, usually, most of image pixels remain non-marked. Actually, we only have ground truth information for the pictures of one use-case. This ground truth information is however not used for learning but only to check the results afterwards. To summarize, we employ semi-supervised learning using a quite incomplete user information.
Computational efficiency: most of the system is implemented for processing on a GPU. For the use-cases below, the full pipeline (computing features, learning, inference etc.) takes about a couple of minutes depending on the image's size.
Generalization: for each use-case, we learn only relatively few unknown parameters. In particular, we do not learn image features. We use a pre-trained Convolutional Neural Network to do this. The default values work quite stable and sufficient for most cases.

Further Information can be found here.

 

Random numbers are hard - even harder in virtual machines

Random numbers are an essential element of cryptography and therefore security in general. Like most security aspects, random numbers sound simple but proof to be hard to get right. This text will discuss problems regarding random numbers in computers, especially in virtual machines.

Read more ...

Big Data Reference architectures: Are they really needed?

Reference architectures are a key research topic in business information systems. They try to simplify software development by reusing architectural and software components. But reusability leads also to a trade-off in making reference architectures on a higher level to reuse it in many domains and applications. Or to concentrate them on a subject and hence easier to reuse.

In this blog post we discuss, whether big data references architectures are really needed. Our hypothesis is that current big data reference architectures are not sufficient to provide real benefit for implementing big data projects.

Read more ...

Big Data Frameworks on highly efficient computing infrastructures

Introduction

Big Data is usually used as a synonym for Data Science on huge datasets and dealing with all kinds of obstacles coming with that. Having access to a large amount of data offers a high potential to find more accurate results for many research questions. Moreover, the ability to handle Big Data volumes may facilitate solutions to previously unsolved problems. However, many research groups have not the necessary facilities to run large analysis jobs using computing resources they have access to at their home institution. Furthermore, the installation, administration and maintenance of a complex and agile software stack for data analytics is often a challenging task for domain scientists. One of the key issues of the Big Data competence center ScaDS Dresden/Leipzig is therefore to provide multi-purpose data analytics frameworks for research communities, which can be used directly at the computing resources of the Center for Information Services and High Performance Computing (ZIH). Using the high performance computing (HPC) infrastructure of ZIH, ScaDS Dresden/Leipzig members and collaborating researchers can run their data analytics pipelines massively in parallel on modern hardware. The following general purpose data analytics frameworks are currently available:

Read more ...

OSTMap - Open Source Tweet Map

Introduction

It is often necessary to build a proof of concept to show the ease and feasibility of Big Data to customers / project promoters or colleagues. With OSTMap (Open Source Tweet Map) mgm partners with the ScaDS to prove that it is possible to accomplish a lot with the right choice of technologies in a short time frame.

Read more ...

Big Data Cluster in “Shared Nothing” Architecture in Leipzig

The Galaxy Cluster

The state of Saxony funded a notable shared nothing cluster located at the University of Leipzig and the Technical University of Dresden. Here we want to give a short overview on this new “Galaxy” cluster which is a very nice asset for ScaDS.

Shared nothing is probably the most referenced architecture when talking about big data. The idea behind this cluster architecture is to use large amounts of commodity hardware to store and analyze big amounts of data in a highly distributed, scalable and cost effective way. It is optimized for massive parallel data oriented computations using e.g. Apache Hadoop, Apache Spark or Apache Flink.

Cluster Facts Overview:

Read more ...

Introduction to Privacy Preserving Record Linkage

 Many companies and organizations collect a huge amount of data about people simple by offering their services in form of online applications. Another “more official” way to gather such data is asking the people (by the mean of printed forms) as is the case in hospitals and administration. In both cases each data owner holds information that cover only one or few aspects of each person. However, analyzing such data and mining interesting patterns or improving decision making processes generally require clean and aggregated data, which are held by several organizations. Record linkage operates as a preprocessing step for these tasks with the main goal to find records, stored in different databases, which refer to the same real world object or person. This process finds application in many areas like healthcare, national security or business. In healthcare for example, linking records from two or more hospitals allows the adaptation of disease’s treatment of patients.

The main impediment when linking person related data across many organizations is the privacy aspect.  In several countries processing such data is subject to strict privacy policies, e.g. how and where to store the data and whether or not such data can be exchanged with a third party. Privacy Preserving Record Linkage (PPRL) presents techniques and methods to efficiently link similar records in different databases without compromising the privacy and confidentiality.

Read more ...

Webcrawling of building-relevant information

Urban and regional planning as well as the spatial sciences require detailed information about the functional, morphological and socio-economic structure of the built environment. Building-relevant information such as the building height, number of storeys, usage, age and condition are of great interest in urban modelling as they serve as a basis for training, calibrating and validating various models for e.g. mapping populations, estimating energy demands or assessing flood risks.
Acquiring data on the level of individual buildings is time consuming and costly as this is usually realized through field observations and local knowledge. Another possibility to collect this information is the automatic analysis of user-generated data from sources like OpenStreetMap, Mapillary, WikiMapia and others. In the context of buildings, WikiMapia is a promising resource that contains building usage information for a large number of buildings. Additional geocoded Street View Data are attached by users, which offers further opportunities for image interpretation (computer vision or human-based computation approaches).

Read more ...

Versioning system for modeling environmental data based on an automatic meta-data generation strategy

The Helmholtz-Centre for Environmental Research (UFZ) is one of the world's leading research centres in the field of Earth system science. The Department of Environmental Informatics of the UFZ develops software for the simulation of environmental phenomena via coupled thermal, hydrological, mechanical and chemical processes by using innovative, numerical methods. Examples include the prediction of groundwater contamintion, the development of water management schemes or the simulation of innovative means of energy storage. The modeling process is a complete workflow, starting with data acquisition and -integration to process simulation to analysis and visualization of calculated results.

Unfortunately this modeling process is not transparent and traceable and often poorly documented. A typical model is developed over many weeks or months and usually  a large number of revisions are necessary for updating and refining the model such that the simulation is as exact as possible. The first setup of a model is often used to get an overview over existing data and to detect potential problems in both data and numerical requirements. Further revisions try to solve these problems by adding data, refining or adjusting finite element meshes or updating and ajusting processes and their parametrization. Both input- and parameter files range from few/small files up to hundreds of files containing detailed spatial, temporal or numerical information. Likewise, changes from one modeling step to the next may be small (e.g. one parameter value in a single input file) or major (e.g. geometrical input changes and requires a new discretization of the FEM domain as well as a new parameterization).

Read more ...

Dynamics of Open Quantum Systems

The description of the dynamics of open quantum systems is subject of ongoing research in theoretical quantum physics (solid state physics, quantum optics, quantum chemistry). Real (quantum) systems are never perfectly isolated from environmental fluctuations or forces. In case of weak environmental influence various approaches have been developed. By contrast this project focuses on the description of open quantum systems facing a significant influence of structured surroundings. Examples of experimental implementation can be found in energy transfer processes in molecular aggregates ([1],[2]) or quantum bits in solids ([3], [4]). Here, an exact and complete quantum mechanical description would be desirable. However, due to the exponential growth of Hilbert space dimensions of many-body quantum systems limits of computational resources are reached soon. We attack this challenge by means of a stochastic Schrödinger equation.

Read more ...

Multi-scale visualization – The key to an enhanced understanding of materials

In the computer-assisted design of fibre-reinforced composites for lightweight structures, their hierarchical structure must be taken into account – from fibres, matrices and rovings to reinforcing textiles, individual layers and multi-layered composites. This hierarchical approach also needs to be applied as components are joined to form structures and multiple structures interact within the overall system. The availability of a suitable simulation model for each scale is therefore a prerequisite for targeted, efficient system development. Up to now, it has nevertheless not been possible to implement a user-friendly cross-model concept which enables the multi-scale visualisation of individual sets of simulation results.
Jointly developed by the Institute of Lightweight Engineering and Polymer Technology and the Chair of Computer Graphics and Visualisation (both part of TU Dresden) within the framework of ScaDS Dresden/Leipzig – Competence Center for Scalable Data Services and Solutions, this solution is the first to facilitate the consistent visualisation of simulation results across all scales (Figure 1). The browser-based software demonstrates the potential offered by multi-scale visualisation in terms of gaining an enhanced understanding of material behaviour. The example presented here uses simulation data generated during the development of an adaptive leaf spring within the framework of special research project SFB 639.
The video shows the range of functions and the advantages of the browser-based software. The software will be presented to a wide range of potential users at the trade fair Composite Europe – 11. Europäische Fachmesse & Forum für Verbundwerkstoffe, Technologien und Anwendungen, 29.11. - 1.12.2016, Messe Düsseldorf. Then the prospective customers can independently use the software and make their own impression of their possibilities.

Read more ...

Graph Mining for Advanced Data Analytics

In order to answer complex analytical questions, data mining methods are often combined with other data processing steps, for example to prepare the search space or to process results. To enable the combination of data mining algorithms and other operators, productive data mining solutions offer extensive toolkits for the analysis of relational or multidimensional data. However, the situation differs with regard to the less established methods of graph mining. Although there are research prototypes, there is no solution that supports complex analytical programs composed from multiple graph operations. Gradoop, an Open Source system developed at the University of Leipzig and the ScaDS Dresden / Leipzig, aims at changing this. Gradoop is the first system that supports the combination of multiple graph algorithms and graph operators in a single script. Hence, Gradoop enables new applications, for example, graph-based analyses of business data. Built on state-of-the-art Big Data technology, Gradoop not only offers a unique range of functionality, but is also out-of-the-box horizontally scalable. It further provides an interface for plug-in algorithms and is thus open to application-specific extensions.

Read more ...

Halloween Tutorial: How to do Vertex-Centric Iteration (Pregel) with Gelly

At the 2nd International ScaDS Summer school on Big Data we offered a couple of workshops with the aim to provide an introduction into the three Big Data technologies MongoDB, Flink and Gelly. This post is an extension of the Gelly tutorial to demonstrate the new feature of Gelly: the Vertex-Centric Iteration or Pregel Iteration. 

Find out which child is getting the largest amount of candies in our Halloween-Special of Trick-or-treat...

Read more ...

Connecting Digital Humanities with the CLARIN Infrastructure

One of the questions that I am often confronted with when presenting my work is what my work has to do with BigData, when the biggest text collections that I have to deal with fill only a couple of Gigabyes of hard disk space. The reason for this question is the argument that BigData has to have to do with large amounts of data and BigData related problems have to deal with at least Tera- or Petabytes of stored information. As understandable and right as this argument is, there is actually a whole lot more to BigData than just the question of the size of a data set and with this article I want to explain what it is and - hopefully - answer the question in a satisfactory manner.

Read more ...

Sierra Platinum

We present the latest result of our research:

Sierra Platinum is a fast and robust peak-caller for replicated ChIP-seq experiments with visual quality-control and -steering. It allows to generate peaks while the user influences, which replicates are most suitable for creating them. The results show that the new method outperforms available tools and methodologies.

Read more ...

Automatisierte Siedlungserkennung in topographischen Karten

Einleitung

Die Analyse von Siedlungsstrukturen, Bausubstanz und Gebäudenutzung, gehört zu den Kernaufgaben der Raum- und Landschaftsplanung. Relevante Anwendungsgebiete sind z.B. städtebauliche Maßnahmen, Planung von Verkehrswegen und Infrastruktur, Naturschutzplanung oder die Erstellung von Gefahrenkarten bei Naturkatastrophen. Die größte zentralisierte und zumindest europaweit flächendeckend verfügbare Datengrundlage für solche Untersuchungen, stellen topographische Karten dar. Dabei spielen sowohl aktuelle als auch historische Karten eine Rolle. Letztere dienen zur Analyse von historischen Entwicklungen sowie zu langfristigem Monitoring von Siedlungsstrukturen und Landnutzung. Ein entscheidender Schritt bei der Analyse von Katenmaterial ist die Erkennung von Siedlungen, das heißt die Aufteilung der untersuchten Region in Siedlungs- und Nicht-Siedlungsgebiete (automatische Segmentierung topographischer Karten).

Read more ...

Memories of our ScaDS summer school

Do you remember the ScaDS summer school 2016 in Leipzig? Seven weeks ago we heard a lot of interesting talks about the Big Data topic, hands-on sessions with new frameworks but also a lot of fun... Sightseeing, traditional german food and of course our fantastic trip with canoes and dragonboats on the waters of Leipzig. To refresh your memories, Jörn created a great video of our trip.

Read more ...

Using Apache Flink CEP for real-time logistics monitoring

With the increasing distribution of smart devices and sensor systems it is now possible to get data and context information of any element of the real-world. The Internet of Things (IoT) is synonym of this trend, which also becomes a social meaning because it affects all areas of everyday life. But of course this trend provides massive possibilities to improve our life, as well as our companies. So for example real-time insights into business processes are getting more important in recent times. With the right information and only short delays between certain incidents decision makers on operational level are able to quickly react and adapt changes to the processes. Thus, companies with in-depth knowledge of their processes have options to optimize their business as well as to offer new service levels for customers and increase earnings.

Read more ...

Successful ScaDS Big Data School in Leipzig - a Report

From 11th to 15th of July 2016 the Big Data Center of Excellence (ScaDS Dresden/Leipzig)  hosted its first summer school for Big Data in Leipzig. The program attracted many students and young graduates, as well as other academic and industrial practitioners and researchers that operate in the field of Big Data. We were overwhelmed with the number of registrations and the many speakers that were willing to support us in our summer school. While we initially planned with 50 attendees in total we finally counted 120 people on our summerschool including speakers and short-term attendees throughout the week. Surprisingly more than 50% of attendees were international coming from all continents. 

Read more ...

ScaDS Big Data Industry Forum at BIS-Conference in Leipzig

On 8 July 2016, the Scads Big Data Industry Forum was held in Leipzig. In conjunction with the 19th International Conference on Business Information Systems various projects related to Big Data were presented by young but also renowned software companies.

Read more ...