ScaDS Logo

COMPETENCE CENTER
FOR SCALABLE DATA SERVICES
AND SOLUTIONS

TEC-4: Spark usage/processing capabilities

 

Service Owner

ScaDS

Contact

Jan Frenzel

Target

Data analysts, high-level programmers (Java, Scala)

Dependencies

TEC-1

Description

Apache Spark is a scalable cluster-computing framework based on Apache Hadoop. It is tailored to iterative computations and achieves significant performance gains by in-memory primitives. It provides frameworks for graph processing, machine learning, support for structured and semi-structured data and streaming analytics. Data partitioning and workload distribution is done by the framework, the user only has to specifiy its algorithm in terms of Apache Spark's basic operations. We provide all necessary tools for using a full-featured Apache Spark ecosystem for fast Big Data analytics.

Offerings

  •  tools to integrate Apache Spark environment into HPC ressource and workload sharing systems

  •  explorative data analysis via an interactive Apache Spark shell

  •  dedicated Apache Spark environment

  •  distributed storage systems (Lustre, HDFS)

  •  tools to import HDF5 files (Matlab, R, ...)

Consumption

  •  Collect information about your use case. Prepare for the following questions:

    •  What focus has your research?

    •  Do you already have a (serial/parallel) program?

    •  How many resources do you need? (type of computing resources and amount of required computing time)

    •  Who is responsible for your project?

  •  Contact us (via e-mail or phone)

  •  We send you an application form. With this form, we want to have a look at your use case and see the specific requirements. This helps us to provide any additional software you might need. Additionally, we need this form to request computing resources.

  •  Fill out the form and send it back to us.

  •  We contact you, when your login is granted and you can access our cluster. This might require some time.

  •  We send you information about how to use our cluster. This includes material on how to log on or submit jobs to the cluster, write programs or avoid potential bottlenecks.