Common Hardware Recommendations
Some older public sources on hardware recommendations are available in public, for example:
- Chapter 10 of the book “Hadoop – The definitive Guide” from Tom White (2015, 4. Ed., O'Reilly)
- Cluster Planning Guide” in the documentation for Hortonworks HDP 2.4
- Clouderas Blog article from 2013 “How-to: Select the Right Hardware for Your New Hadoop Cluster
Most guides go more into detail, e.g. make a distinction between compute and storage/IO intensive work load, but here is just a very brief summary:
- master nodes need more reliability, slave/worker nodes less
- 1-2 sockets witch 4-8 cores each and medium clock speed
- 32-512 gigabyte ECC RAM
- 4-24 hard disks with 2-4 terabyte each and SATA or SAS interface.
- 1-10 gigabit/s Ethernet
The Actual Hardware of Galaxy
For our big data cluster Galaxy we decided, as mentioned before, to go for the shared nothing architecture. Special focus was put on a big number of nodes and high flexibility for the high diversity of researchers needs. A Europe wide tender procedure got us good cluster hardware:
90 Nodes with the following homogeneous hardware specification:
- 2 sockets equipped with 6 Core CPUs (Intel Xeon E5-2620 v3, 2.4 GHz, supports Hyperthreading)
- 128GB RAM (DDR4, ECC)
- 6 SATA Hard disks with 4 terabyte each
- attached via a controller capable of JBOD (recommended for Apache HDFS) or fault tolerant RAID 6 configuration
- 10 gigabit/s Ethernet interface
In addition to that we have a dedicated virtualization infrastructure. There we can organize management nodes and master nodes in as many virtual machines as needed. They benefit from better protection against hardware failures but can leverage only limited resources in comparison with a dedicated server.
The big data cluster spans both ScaDS partner locations, Dresden and Leipzig. 30 of the 90 Nodes are located at the TU Dresden, 60 Nodes and the management infrastructure is located at the Leipzig University. The nodes of both locations are organized in a common private network, connected transparently via a VPN tunnel. To achieve optimal performance on this tunnel, the VPN tunnel endpoints on both sides are specialized routers with hardware support for the VPN packaging. The Ethernet-bandwidth within each location is 10 gigabit/s via non-blocking switches. For security reasons the cluster’s private network is accessible only from the network of the Leipzig University. Scientists from other institutions with a project on the big data infrastructure are currently provided with a VPN-Login when needed.