|By Rick Friedman||
|October 20, 2005 08:30 AM EDT||
Albert Einstein defined success as 10 percent inspiration and 90 percent perspiration. Although he had no inkling about the emergence of Linux Clusters for High Performance Computing (HPC), his words ring true for designing, building, and managing compute clusters.
Linux Computing Is Here
A Disruptive Computing Paradigm
Businesses, research labs, and universities that require intensive computing power have recently discovered that Linux cluster architectures can meet their ever-increasing demand for computing power, application speed, and "always-on" availability. Clusters have proven to be an excellent solution for compute-intensive technical and business computing tasks requiring significant data crunching.
Linux clusters have enjoyed widespread adoption in performance-driven sectors such as financial services, government, energy, life sciences, and advanced manufacturing. Linux clusters are used for a wide variety of demanding applications including derivative trading, drug design, risk analysis, weapons simulation, petroleum exploration, and design automation.
As penetration has moved beyond the early adoption phase, it's no longer a question of whether to deploy Linux clusters, but rather how to implement cluster computing to deliver high ROI, improved system performance, and minimal business risk. Fundamentally, Linux computing success is also about how clusters are implemented and managed and not just about the choice of hardware or software tools.
Scaling Out Linux Clusters
Compelling Benefits but Hidden Costs and Challenges
For small, 2 to 8 node clusters, with minimal or no scheduling requirements, cluster management is relatively straightforward. However, clusters tend to have a strong growth impetus, given their attractive economies and the typical insatiable application demand for computing power. At a certain point, often around 16-64 nodes, cluster management gets problematic and expensive due to growing workload scheduling requirements, increasing cluster size, and the increasing (and inherent) heterogeneity of software and hardware components. Every node added at this point increases management attention, magnifies complexity and business risk, and multiplies both direct (overhead, labor) and indirect (lateness to market, hassle) costs. Adding to the complexity is a myriad of clustering tools and applications, interconnects, MPIs, compilers and libraries, and management tools.
Large cluster sizes makes it even tougher to deploy applications quickly, fully control and monitor all resources, or respond to rapid changes to user or business requirements. All of these issues conspire to reduce the organization's return on IT assets and overall business performance. Consequently, as new clusters scale or existing Linux infrastructures grow, there is an ongoing and critical need for software management solutions that can quickly and economically scale, manage, and troubleshoot clusters.
"Clusters are enabling more work to be undertaken and therefore are a major driver in the growth of the HPC marketplace...there still are clearly large hurdles that clusters will need to overcome in order to expand the addressability of the technology...[including] the management of a large number of nodes and greater levels of parallelization of cluster applications as well as the conversion of applications to cluster operating environments," according to a 2004 IDC study. The pain around cluster management is shared across industries, processing needs, and user roles, representing a significant user concern.
Fortunately, there are management solutions that reduce the pain and cost of building and scaling HPC clusters, even when the cluster size exceeds 1,000 nodes. Powerful, commercially available software solutions can simplify and automate the management, scale out, and support of Linux clusters by bundling industry standard servers, operating systems, and application software into an easily managed Linux datacenter.
After deciding Linux clusters are the way to go, knowing how to plan and implement a cluster is equally crucial. This requires a significant amount of upfront analysis and a detailed, yet realistic, implementation methodology.
Implementing Linux Clusters
Document, Plan, Test, Launch
The process of implementing clusters is no different than the usual project-planning methodology including a review of goals, resources, staffing, and features. There are, however, some guidelines that can shorten time to implementation and results. For example, Scali has identified a number of best practices for deploying, scaling, and managing clusters. In general, investing more time upfront in planning and testing will bear rewards later on in terms of reduced rework, cost, and deployment time.
Baseline Information Needs
Before beginning an implementation, two important lists should be created: a hardware inventory including specific information on minimum hardware requirements; performance tests and specifications on how to set up and configure nodes; and an application inventory that would include information on install applications, fix run errors, and run and test performance.
To minimize risk and the time associated with "going live," it is useful to set up a mini proof of concept to avoid unanticipated problems during actual deployment. The proof of concept helps to verify cluster software installation and review the performance and reliability of the network and hardware. Installation requirements may be customized to support hardware or other software unique to the cluster and performance-testing tools to optimize performance. A management solution can handle replication of those requirements across additional nodes.
All applications must be tested on the mini proof of concept to ensure high performance and robustness for stakeholders. As there are likely many different kinds of off-the-shelf, proprietary, and custom applications in the environment, this process may take time with many fits and starts. It is critical to configure, run, and test the applications to highlight missing functionality or corrupted operation instead of relying only on third-party documentation. This process should also be repeated for all software, to ensure a complete environment.
Following configuring and testing, the next important phase is implementing the cluster. Successful cluster implementation should incorporate the following steps:
- Communicate critical information on policies, system information, support, and application performance to cluster users.
- Brief the operator on the design philosophy, system architecture, and user interface as well as management tools, scripting, and nomenclature.
- Prepare the system for running your application software through hardware migration. This may include power testing of any new hardware (i.e., master node, compute nodes, power systems), PXE boots, or changes to nodal BIOS settings.
- Once the new nodes are active, the system is ready for application migration and testing.
- Upon completion of application proofing, commissioning can begin. At this stage, users and operators should be made aware that the cluster can now run production jobs and be aware of any missing or delayed features.