Preface & Origins
The Clumon performance monitoring system was developed for monitoring Linux-based clusters at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Champaign-Urbana. The system is currently based on Performance Co-Pilot by SGI and the PBS scheduler. It also uses MySQL as its database, and Apache with PHP as its bundled viewer. The system was created to give an overview of what the current state of the cluster was at a glance. This of which requires information from both the scheduler and the hosts themselves. Until now there was no tool that combined and cross-referenced data from the two areas. The system was also need to provide for a common structure and single point of access so that the many users and administrators would not overload the cluster itself with redundant requests for the same information. The primary requirements for the system were as follow: scalability, low bandwidth, use freely-available components, versatility, controllability. Scalability was first and foremost. The system had to be able to potentially thousands of hosts and, at the same time, not create too large a load on the hosts from which data was being collected or the network layer. This also led to the requirement of controllability. Since there are many, many different potential configurations for clusters, the system needed to be flexible enough to be tuned to work on a broad range of clusters. Monitoring performance data is a trade off between detail, scale & frequency versus load & bandwidth. Therefore, the performance monitor's administrator should be able to adjust the frequency and amount of data collected since the scale and bandwidth of a cluster is commonly a fixed value. This should allow an administrator to balance the amount of resources used for performance monitoring versus those required for computation. Versatility is a requirement in any system such as this one, because it is to be used by different categories of users. These generally include the users that are doing computation on the cluster, the system administrators keeping an eye on things, and developers and engineers doing testing and debugging; just to name a few. And all of these groups tend to care about different aspects of the cluster, with none of the group knowing what they want from the beginning. So the administrator must be able to add and, of course, remove items from the set of collected data in a very simple manner. One of the final requirements was that it be developed with freely-available, standardly used software packages so that others may easily develop components to interact with this system.
Before the structure can be explained, a look at the components is necessary. The central component is the clumond data collection daemon.
Benefits of Architecture
There are numerous benefits that the architecture itself grants an administrator, the users, as well as overall performance of the cluster. From a cluster performance and integrity standpoint, this system acts as a gateway for performance data. This allows the administrator to throttle the bandwidth and cycles used by the cluster for monitoring. It also allows others to develop their own applications that would use the same data and provides them with a standard set of tools for accessing that data. That of course being db connectivity and custom formatting via PHP. From an integrity standpoint, there are is only a single entity collecting data on the cluster and routing and filtering can be set up such this traffic is allowed and other types are explicitly forbidden. In essence it provides a known quantity such that the any other traffic can be disregarded. This is a very positive thing to be able to do in the realm of security. And, at the same time, virtually any performance metric can be provided to a user.
Administrative and uptime issues are addressed by having all configuration contained in the database and having only a passive set of software on each of the hosts being monitored. Changes take effect without having to communicate with the hosts and even without having to restart any service on the monitoring machine.
Limitations of Architecture
There are a few limitations of this system, most of which are tied to either hardware or the underlying software. The number of parallel collections will be limited by the memory, and to a lesser extent, the processor speed of the machine on which the clumond executes. Performance may be limited by the database handling it's various connections. The db may need to be configured to allow more concurrent connections to facilitate a large number of parallel collection processes. Clumond was designed to run on the same machine as the database due to the high volume of communications between it and the db. The same is true for the PHP web interface. The web side of things can easily be moved to another web server, but performance will degrade with larger sets of machines displayed.
In this section, the algorithm that the clumond executable goes through on each iteration will be described in general detail. Great detail can be gathered from the documentation within the source code itself. This section is for a general understanding of the process and an aid to those digesting the source.