ScaDS Logo

CENTER FOR
SCALABLE DATA ANALYTICS
AND ARTIFICIAL INTELLIGENCE

Graph Mining for Advanced Data Analytics

Article Index

In order to answer complex analytical questions, data mining methods are often combined with other data processing steps, for example to prepare the search space or to process results. To enable the combination of data mining algorithms and other operators, productive data mining solutions offer extensive toolkits for the analysis of relational or multidimensional data. However, the situation differs with regard to the less established methods of graph mining. Although there are research prototypes, there is no solution that supports complex analytical programs composed from multiple graph operations. Gradoop, an Open Source system developed at the University of Leipzig and the ScaDS Dresden / Leipzig, aims at changing this. Gradoop is the first system that supports the combination of multiple graph algorithms and graph operators in a single script. Hence, Gradoop enables new applications, for example, graph-based analyses of business data. Built on state-of-the-art Big Data technology, Gradoop not only offers a unique range of functionality, but is also out-of-the-box horizontally scalable. It further provides an interface for plug-in algorithms and is thus open to application-specific extensions.

Graphs for business analytics

The first graph mining technique available in Gradoop allows to find frequent patterns (Frequent Subgraph Mining). Similar to the relational counterpart (Frequent Itemset Mining), frequently occurring entities are evaluated. In the example of a classical shopping basket analysis, a pattern represents a certain combination of products, such as bread, butter and milk. However, in practical application scenarios, analytical questions are rarely as simple as in textbooks. For example, to optimize a webshop, not only products bought together could be of interest but frequently occurring product groups in canceled order processes. To answer such more complex questions, several steps of data processing are required. Before the actual algorithm is executed, canceled orders must be filtered and products have to be replaced by representatives of their groups. For relational and multidimensional data there already exist productive solutions supporting such multi-stage analyzes in a flexible manner. This is different with regard to graph data. But why should one use the graph model?

In addition to the common occurrence of certain entities (vertices), graph patterns additionally include their relationships (edges). One way to analyze business data is representing all data recorded during the execution of business processes by a collection of graphs. Here, every graph represents a single process execution. For example, a graph contains all the data around an online trading process, e.g., the customer's order, purchasing and warehouse operations as well as delivery data. Vertices can be either master data like employees and products or transactional data like invoices and delivery notes. Edges additionally provide information about relationships, for example, which employee has processed the packaging step. By representing process data by graph collections, one can identify patterns which represent the joint occurrence of certain master data as well as their roles in the process.

Figure 1: Example graph pattern representing process data.

 

Such pattern-driven analyses are especially of interest with regard to their economic context like correlations among patterns and complaints. But which steps are necessary to find such patterns? First, graphs containing complaints must be identified and filtered. Second, let us also assume that vertices describe business objects of certain types (e.g. employee, invoice, product) including properties (e.g. identifiers or prices). To increase expressiveness of resulting patterns, we are interested in information about concrete master data instances but, as they represent processing steps, only transactional data types. To find respective patterns, the remaining graphs must be transformed to enable the common evaluation master data identifiers and transactional as well as relationship types. Figure 1 shows two example graph patterns meeting this requirement. One can see, that simple patterns (i) as well as more complex relationships (ii) can be found. With Gradoop, we are able to declare our example analysis including categorization, filtering, transformation and pattern mining in a single program.