ScaDS Logo

COMPETENCE CENTER
FOR SCALABLE DATA SERVICES
AND SOLUTIONS

Research Cluster Management

The high diversity of Big Data researchers’ requirements around ScaDS Dresden/Leipzig is not easily mapped to the many different computing resources available.

The Research in the Project ScaDS Dresden/Leipzig spans a very broad area of topics and fields and the corresponding expected IT infrastructure comprises a wide variety as well. The Apache Hadoop stack is quite common in the big data field but not the only one needed and even the Apache Hadoop stack can be required in many variations.

 

Variety of researchers requirements

ScaDS is working in many different research areas, this includes the disciplinary research areas of ScaDS and the computer science.

Here are only some examples of the Infrastructure requirements mentioned by researchers so far:

  • schedulers: Apache Hadoop stack (several specific versions) , slurm, separate cluster of computers
  • software: Apache Zookeeper, Apache HBase, KNIME, research area specific software, …
  • environment: Java (several specific versions), Docker, specific Linux OS, …
  • storage: Apache HDFS, Network storage, SSD, …

Of course researchers want to provide good documentation of their work, so topics like workflow management and provenance for their results are important to them.

 

Hardware resources in the environment of ScaDS

HPC

  • Taurus (ZIH Dresden)

Shared-Nothing-Cluster

  • Galaxy Dresden/Leipzig (URZ Leipzig)
  • bdclu (Ifi Leipzig)
  • some Cluster at the database dpartment Leipzig
  • small temporary Cluster of dektop computers

Virtual Machines

  • ZIH Dresden
  • URZ Leipzig

In Memory Server

  • ZIH Dresden
  • Sirius (in installatation) (URZ Leipzig)

 

Problem outline

Even with this broad range of ressources available developing an approach to map requirements to existing or to be developed infrastructure offers is an interesting research topic.

Example 1: Shared Nothing Scheduling:

When focusing for example on the shared nothing infrastructure referred to often in the big data area, one gets problems with software configuration management and data locality. A straight forward approach to enable several different cluster configurations on a set of resources is to build temporal subsets (called for example partitions or sub clusters) and configure them according to the specific need.


Example of different cluster partitions on a cluster with 18 nodes at different times (week 1 to 7)

 

The problem is to keep the different configurations manageable and reproducible. At the same time it is important to keep the overhead low to enable researchers to get reliable performance results.

One possible building block for a solution we are investigating is the usage of lightweight linux containers like the prominent example docker. Linux container offer near hardware performance and encapsulate the individual configuration and software stack needed.