ScaDS Logo

COMPETENCE CENTER
FOR SCALABLE DATA SERVICES
AND SOLUTIONS

ScaDS Ringvorlesung SS2017:  Praktische Aufgaben für Studierende der Uni Leipzig

db1: Analytics of Development Project Data

Many development projects (commercial and open source) manage their tasks and reporting through JIRA. In this taks we would like to analyze the huge number of projects under https://issues.apache.org/jira/secure/BrowseProjects.jspa#all with a distributed graph-based approach. The taks is to use the JIRA REST API to export projects, tasks, assignees, project members and also log-entries and transform this to a graph-representation that can be consumed by Gradoop (www.gradoop.org ). In the second step we would like to analyze this graph with the help of simple Apache Flink and Gradoop-Scripts.

db2: Analytics of Git  Project Data

Many development projects (commercial and open source) manage code on Git. In this task we would like to analyze contents of git-repositories with a distributed graph-based approach. The taks is to use the JGIT-API to export commits, users,tags, and the complete history etc from git repositories and transform this to a graph-representation that can be consumed by Gradoop (www.gradoop.org ). In the second step we would like to analyze this graph with the help of simple Apache Flink and Gradoop-Scripts.

db3: Schema Graph Fusion for the Microsoft Academic Graph and DBLP Graph

Due to its diversity and different sources publication data is valuable input for Graph-ETL pipelines. The aim of this project is to bring the Microsoft Academic Graph (https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/) into GRADOOP and summarize its information with the help of the grouping operator.  Afterwards, the result is compared to the Grouping result of DBLP and both grouped Graphs are fused into a consolidated version of the graph. 

bioinf1: Supergenome data analysis in Flink and Gradoop

In the supergenome project of the Bioinformatik Lehrstuhl, the target is to create a common coordinate system out of a multiple genome alignment. These happened mostly in a graph data structure.  The complete project is written in java and use a third party graph library in it. The task is to rewrite parts of the program to run on a cluster. The frameworks for this are Flink to do filter and other steps that need no graph representation. Then the graph operations are implemented in Gradoop. As a starting point a basic implementation of some things exists in Flink and Gelly. The students need to be able to write programs in Java. The syntax of Flink and Gradoop is the main learning point of this work.

bioinf2: Supergenome data saved in Neo4j

In the supergenome project of the Bioinformatik Lehrstuhl, the target is to create a common coordinate system out of a multiple genome alignment. These happened mostly in a graph data structure.  The complete project is written in java and use a third party graph library in it. The task is to write a connector to save the supergenome graph with all meta information in Neo4j. For this a schema of the databank must be created. Also a read direction should be written that can read the complete graph but have also the possibilities to read only parts.  The students need to be able to write programs in Java. The handling of Neo4j is the main learning point of this work.

ti1: Quantity and Quality of Random Numbers from shared TRNGs

Most modern server environments provide a hardware random number generator (true random number generator, TRNG) for cryptographic purposes. Unfortunately the TRNG often has to be shared between multiple virtual machines, especially in big data use cases. The task is to implement a small test environment and to collect data concerning the quantity and quality of random numbers regarding the combination of multiple virtual machines and a shared TRNG. Basic knowledge of Linux is essential, first experience with QEMU is advantageous.

ti2: Measuring the Quantity of Random Numbers under Linux

It is often desirable to know the amount of random numbers provided by the Linux kernel to the user space programs, e.g. to measure the effect of a hardware random number generator. Unfortunately there is no easy accessible way to get these numbers. The task is to design and implement a way to measure the quantity of random numbers provided by the Linux kernel during a specified time. Advanced knowledge of Linux is essential, C programming skills may be advantageous.

tm1: Preparing a Text Corpus for Canonical Text Services

A vast number of text corpora are freely available online but it is often not clear, how they can be automatially processed or analyzed. The task is to find interesting German and international text corpora and prepare them in such a way that they can be processed using the Canonical Text Service protocol. The result of this task should be a set of fresh text corpora as part of the CTS infrastructure. Further information about the issue can be found here: http://cts.informatik.uni-leipzig.de/. 

tm2: Converting LaTex documents into CTS compliant TEI/XML

LaTex is one of the major text editing formats in academic work environments and provides structured documents that can be included into the Canonical Text Infrastructure. The task is to implement a converter that provides compliant TEI/XML documents based on LaTex input documents. 

bd1: Implementing a data mining pipeline for predictive business analytics

In recent years new concepts and technologies evolved that allow organizations to get more detailed insights into operational business activities. By utilising the collected data for information about prospective events, business process management and decision support could be optimized and automated. The goal of this work is to identify appropriate process mining or data mining algorithm and adapt it to process data of a financial use case. Next the algorithm should be integrated in a data processing pipeine including preprocessing, analytics and visualization steps.