Successful ScaDS Big Data School in Leipzig - a Report

From 11th to 15th of July 2016 the Big Data Center of Excellence (ScaDS Dresden/Leipzig)  hosted its first summer school for Big Data in Leipzig. The program attracted many students and young graduates, as well as other academic and industrial practitioners and researchers that operate in the field of Big Data. We were overwhelmed with the number of registrations and the many speakers that were willing to support us in our summer school. While we initially planned with 50 attendees in total we finally counted 120 people on our summerschool including speakers and short-term attendees throughout the week. Surprisingly more than 50% of attendees were international coming from all continents. 

As we envisoned we had an inspiring mix of motivating keynotes, online trainings, classes and excursions. The summer school tought the basics of working with large and complex amounts of data and provided an overview of relevant approaches and solutions. In particular the practical sessions gave participants a fist glimpse into the basics of stroring, processing and analyzing big (graph) data. 


Monday

On Monday the 11th July we started our 2nd International Summer School on Big Data at the University of Leipzig. Prof. Dr. Erhard Rahm welcomed around 100 guests and started as the first speaker with a presentation on “Big Data Integration” introducing recent approaches and challenges for holistic data integration of many data sources and discussing e.g. parallel blocking and entity resolution on Hadoop platforms. 

  

Prof. Dr. Peter Boncz from the VU University Amsterdam continued with a presentation on “Benchmarking Graph Data Analysis with LDBC”. He got the opportunity to introduce the system LDBC, an EU project that involves him as scientific director, and focused his talk on “choke-point” based benchmark development (Social Network Benchmark). 

 After a little break Dr. Sherif Sakr from the King Saud bin Abdulaziz University for Health Sciences gave an overview on the recent developments in his talk “Big Data 2.0 Processing Engines: The Time After Hadoop” and discussed the directions for future research as well as the latest challenges that exceed the limitations of Hadoop frameworks.

For the first evening we organised a Sightseeing Tour for our guests to introduce them to our 1000 years old city that is known for historical events like the Monday demonstrations or famous inhabitants and guests, like Gottfried Wilhelm Leibniz or Johann Wolfgang von Goethe. 


Tuesday

The second day started under the headline “Big Data Storage/NoSQL”. The introductory presentation “NoSQL: State of the Art & New Developments” of Prof. Dr. Stefan Edlich (professor at the Beuth University of Applied Sciences, Berlin) covered the different NoSQL applications in the DB landscape and how NoSQL will affect future approaches. Prof. Dr. Andreas Thor, from the Leipzig University of Telecommunications (HfTL), went on with his talk “NewSQL, SQL on Hadoop” and compared the query languages, explained how they can be applied to the Hadoop infrastructure and finally gave an overview of NewSQL systems (e.g. VoltDB, Google Spanner). Another local speaker, Anika Groß, a Postdoc at the Leipzig University, focused the talk “NoSQL – Datastores for Big Data” on the different data models and technical models for NoSQL datastores and used the systems Dynamo, an AP system and key-value store, as well as MongoDB, the CP system and document store. 

    

After the lunch the practical sessions started. In 3 different group the participants could either attend a course on Text Mining, Genome Alignment Processing or Logistics and got an introduction on the system MongoDB. 

At the end of the day we went with our guests to a boat/canoe tour and finished the trip with a barbecue.


Wednesday

After this sportive trip we started the next day with presentations on “Distributed Data Processing”. First of all, Prof. Dr. Kai-Uwe Sattler of the TU Ilmenau spoke about Big Data Stream-processing. He gave a survey of the recent processing engines and discussed the different architectures, execution models and programming interfaces. The next speaker Tilmann Rabl, a research director at the Database Systems and Information Management (DIMA) group and technical coordinator of the Berlin Big Data Center (BBDC), introduced the open source system Apache Flink in his talk “Distributed Data Processing and Streaming in Flink”, which allows a faster and more efficient data analysis on both batch and streaming data. A research assistant of the Center for Information Services and High Performance Computing at the TU Dresden continued and focused on another method in his talk “Introduction to Big Data Analytics on HPC clusters”. 

On this day we continued our practical courses with the system Apache Flink and finished the day with a dinner at the “Bayerischer Bahnhof” that offered a wide range of international specialties and the locally brewed beer “Gose”.   

Thursday

On Thursday we welcomed Prof. Dennis Shasha of the New York University who introduced today’s topic “Graph Analytics” with his talk “Fast Methods for Finding Colored Motifs in Graphs”. He focused on the problem of finding subgraphs of a network. Next, Vasia Kalavri of the KTH, Stockholm, introduced the Gelly framework in the talk “Graph processing on Apache Flink with the Gelly framework” and showed how graph analysis task can be expressed using Flink operators and different graph processing models. Another method of graph analytics was presented by Martin Junghanns, a researcher of the University of Leipzig. In his talk “Graph Analytics with Gradoop” he explained the functionalities of Gradoop. The last speaker of the day, Prof. Sören Auer of the University of Bonn, spoke about “(Big) Knowledge Graphs”. He introduced the concept of knowledge graphs based on the RDF and Linked Data paradigm and thematised recent and future Big Knowledge Graph applications and strategies of the combination of Linked Data paradigms and Big Data. 

For the last practical sessions of this summer school we introduced the previously mentioned Flink Gelly system.


Friday

Finally, the last three speakers discussed different aspect under the topic “Big Data Integration”. Prof. Peter Christen of the Australian National University made the start with his talk “Privacy-Preserving Record Linkage (PPRL)”. By means of real-world scenarios he illustrated the significance of PPRL and showed how to applicate it on large databases in Big Data environments. In the talk “Blocking for Big Data Integration”, Prof. Themis Palpanas of the Paris Descartes University (France), focused on the blocking-based Entity Resolution and on blocking methods especially for Web Data collections, also giving a little outlook on the future applications. Dr. Maik Thiele, a postdoc researcher at the Database Systems Group in Dresden, finished the summer school with his talk “Building the Dresden Web Table Corpus and Beyond”. The presentation was focused on the relational web tables and how to classify them and identify different categories to provide a better usability. 

We would like to thank the speakers and guests for making this summer school a success. The evaluation of the feedbacks gave us a good impression of the general perception and provides a good basis on how we can improve the next time. We hope the guests had a pleasant stay in Leipzig and Leipzig University and maybe we’ll see you next year again.