Master Thesis (Leipzig): Distributed analysis of graph streams with Apache Flink
Graphs are ubiquitous - many of them are large and strongly change over time. Analyzing graphs via classical algorithms can be challenging if the set of relationships is unbounded and the graph elements arise in a very high frequency. An approach to handling such graphs is to process them in the data stream model, i.e. graph elements (typically edges) are represented as a stream of objects or events. This captures various challenges regarding to necessary online algorithms with only a limited amount of memory. Typical examples for these stream graphs are information networks in the web and social networks with a high frequent occurrence of relationships like friendships, messages etc. The analysis of such graphs representing interrelated and evolving information is needed for numerous applications, e.g., in social networks to analyse the evolution of user communities, in e-commerce to analyse the website usage and purchases of customers, or in criminology to analyse the behaviour of suspects with all their relevant actions.
In this work we would like to investigate how to represent and process graph streams with the distributed environment Apache Flink and its Table or Stream API to provide a basis for a toolset for distributed graph analyses. Table API offers the usage of relational, SQL-like operators to query and aggregate datasets as well as data streams. There are several graph stream algorithms, like graph grouping, connectivity measures and subgraph counting, that are addressed in current research. One or more of these should be considered in this work, especially how they can be implemented with a graph stream model.
The thesis consist of the following subtasks:
- Getting an overview to the related work of stream graph models
- Getting an overview of existing graph stream analyses and systems supporting them
- Get familiar with the technology stack of Apache Flink and it's Table API and Stream API
- Conception of a chosen stream graph model
- Conception of an analysis operator, e.g. graph grouping for graph streams
- Prototypical implementation of newly developed model, the analysis operator and a scalability benchmark
- Advanced: integration into the distributed graph analysis system Gradoop