To deliver increased computing capacity to its users, companies have traditionally
relied on two different approaches: mainframe-centric and desktop-centric computing. In the mainframe-centric approach, users connect to the mainframe via dumb terminals. The mainframe operating system serves as the job scheduler, seeing to it that all user jobs gain access to the CPU. The mainframe-centric approach offers straightforward administration and the time-sharing operating system manages the workload. However, the approach is notoriously not scalable.
In addition, the million-dollar investment
typically required of a mainframe is a barrier to entry for many smaller companies, and the cost of increasing mainframe capacity is prohibitive even for large companies. Because of this inherent lack of scalability, and because companies could be stuck with an obsolete mainframe before it had fully depreciated, mainframe computing rapidly gave way to Unix/RISC architectures, which affordably placed mainframe-like capacity and performance on an individual user's desktop.
The desktop-centric approach
offers the advantage of cost-effective, incremental scalability in compute power. The advances in microprocessor technology predicted by Moore's law allow companies to justify upgrading to the fastest, latest machines for their power users every 12-18 months. However, these desktop resources remain underutilized for most of the time. Then, when the users' need for computing power suddenly exceeds their current capacity, they resort to the manual process of "load balancing" their jobs.
To access remote
resources, users consult CPU monitors on desktops to determine which remote machine is likely to complete jobs the soonest. Unfortunately, users then remotely log into colleagues' computers to run urgent jobs. This practice is not only disturbing to colleagues, but fails to ensure efficient use of valuable computing resources during periods of peak demand.
The manual load balancing of resources on the network results in some big inefficiencies: underused desktop resources; overloaded shared server
resources; lost time-opportunity costs as users spend valuable time managing their compute cycles rather than doing the creative work they were hired to perform; and poorly managed software license resources as high-priority jobs run on slower machines or suffer license denials while low-priority jobs run.
A network-centric load-sharing facility (LSF) clusters all of the CPU resources across a network into a single networked entity. The batch scheduler within LSF allows companies to define policies to ensure
that high-priority work is served before low-priority work, and that users requiring resources receive a fair share of the CPU resources available to them on the network. Users need no longer be aware of the specific hosts that will serve their job requests. Rather, they simply submit their jobs to the system, and then the LSF identifies the available hosts, selects one, and automatically runs the job on it.
Attitude is everything
One obstacle to adopting the network-centric model is the
attitudes of the individuals in the organization. The view "I've got what I need right here, and don't want to give up my desktop" responds to the individual's fear that they must "give up" anything at all. Individuals like the control and the efficacy of managing their own resources and knowing to what extent that they can rely on them. Our experience has shown that we can go a long way toward allaying the fears of the individual desktop users by carefully introducing the network-centric model and defining
policies that govern the use of desktop and server resources.
Every organization faces a transitional period after it introduces the load sharing of shared server resources. With proper user awareness training and load-sharing policies, we have seen attitudes reversed within a span of five working days. Users who expressed concern about giving up resources learn that the network supplies vast amounts of computing resource. Jobs return faster than they did before, often being served on machines that they
wouldn't have previously considered for execution. These same users come to rely on the load sharing software as though it was an operating system component. Management now has control of its computing environment. They can easily manage priorities and fair share access to computing, increase the utilization of depreciating resources, and easily monitor the demand levels for capacity planning and the cost justification of new purchases.
Building a compute farm
The major components of the
compute farm architecture include the applications, the CPU and memory resources, the operating systems, the network infrastructure, the data storage infrastructure, the load-sharing configuration and scheduling policies, and the monitoring and tuning of the farm.
The best place to start when considering a compute farm is to inventory the applications. Are they single-threaded or multi-threaded? Do they employ a client-server architecture? How much graphics processing is required to display the information?
How much memory do the applications actually use? Is the solver process separated from the GUI? What OS and hardware platforms do they run on?
Single-threaded applications that don't rely on client-server architecture are the best candidates for load sharing on a network of CPUs. Multi-threaded applications benefit from load sharing, but their existence frequently demands some number of symmetrical multiprocessor boxes in the server pool, depending on the amount of communication required between threads
and the amount of memory each process consumes.
Client-server programs such as a CAD drawing package don't benefit as much from load sharing. This drawback occurs because once the client-server connection is made, requests for computation put to the server occur on the machine that hosts the server, and therefore aren't easily executed on another, more available host. Some recent advances in client-server technology have introduced "client-client-server" architectures that make the "client-client"
connection and run the requests for computation as a separate "server" process. This approach plays to the strengths of the load-sharing network.
The amount of memory that the applications consume determines the composition and mix of memory and CPU boxes in the farm. The histogram that plots the number of jobs and memory consumed for the jobs determines which system single-processor, 2-processor, or 4-processor is most cost effective. For example, if most of the jobs require less than 512 Mbytes, but 20
percent of the jobs require up to 1 Gbyte RAM. It would be more cost-effective to buy dual-processor boxes with 1 GByte of memory rather than twice the number of single-processor boxes each with 1 GByte of memory.
Operating systems
Deciding which operating system to use depends on the application base. If the applications run equally well under Unix, NT, or Linux, you're free to choose the operating system that minimizes the administrative overhead while maximizing the return on depreciated
assets. The fundamental law governing the network-centric architecture is to put most of the dollars into the shared resources (the fastest CPUs, the most memory, the fastest disks) and conserve the dollars on the desktop. The analysis might dictate a mixed OS strategy. The load-sharing software works well in a heterogeneous environment, and depending on application availability can allow companies to buy the fastest most cost-effective hardware, regardless of OS, and provide these resources transparently to
their users.
The decision about whether to have an EDA application write its data to a local scratch directory or to a network-mounted file depends on how much data is being read or written. An application that runs for 10 hours and writes 200 Mbytes of data is writing an average of 5,632 bytes per second. With the appropriate buffering, this amount of data is probably not enough to slow down the application. If, however, the application runs for 1 hour and needs to read or write 200 Mbytes, then
caching to local disk and copying the file before or after the execution would be most appropriate. Note that the ultimate cost to the network remains the same, but that the decision to cache or not to cache can greatly influence the running time of the application.
For Unix installations, it makes sense to locate the data server and the application server on the server farm's subnet, thus avoiding lengthy application and data load times. Windows NT also allows centralized serving of applications and data.
However, for increased reliability, performance, and ease of administration, our experience has shown that the applications should be installed locally on each server inside of the load-sharing network. The Microsoft SMS utility speeds this process and ensures that the application is installed identically on each host. To avoid confusion between users accessing shared data sets across the network, it's best to use NT's universal naming convention (UNC) path specifications when referring to data files.
For
mixed Unix/NT installations, we recommend establishing an NFS file server for sharing application data and using the SAMBA free software to provide centralized file services for both Unix and NT. This scheme allows both Unix and NT applications to read and write from the same database and eliminates the need for replicating files and for maintaining parallel back-up strategies.
Managing resources
Any discussion of what site-specific policies to implement begins with the question: "What
scarce resources do we need to manage?" Some companies have excess computers, but limited software licenses. Other companies have unlimited software licenses but a limited number of machines with large memory configurations.
The second major question to ask is: "What are the priorities of work in our environment?" In general, there should be an inverse proportion between the priority of the work and the numbers and duration of jobs that meet that criterion. In a typical EDA environment, for example, the
high-priority interactive synthesis and simulation job slots might number one or two per user (interactive queue). For normal priority, noninteractive batch job slots (batch), there might be a limit of three to five job slots per user, while the noninteractive regression job slots (regression) might allow an unlimited number of jobs slots for a subset of users. The expectation is that the interactive queue offers an immediate response from the server farm. Batch queue jobs should find an open job slot within
an hour or two; regression queue jobs, which might number in the hundreds, may not start for days. (When the regression cycle time is the critical path schedule item, it's possible to modify the regression queue to become the priority queue late in the project cycle.)
Armed with a detailed understanding of the important resources to manage and the basic priorities of work, you can then begin to design a set of queues to fulfill your business objectives. We have developed some simple guidelines for
defining a set of queues for users to submit their jobs. Fewer queues are better than more queues and each queue should have a distinct purpose. The user should have incentives in other words, no resource limits to submit to lower-priority queues. The higher the scheduling priority of the queue, the stricter the resource limit on that queue. Strategies might include limiting the number of jobs per user, limiting the maximum run time of a job from that queue, and limiting the memory a job from that queue can use.
The lower the priority, the more relaxed the resource limit. For example, a regression queue might allow hundreds of concurrent jobs, but be preempted by jobs in an interactive queue.
Avoid application-specific queues, unless specific priority and preemption strategies supercede. It's best to educate users to specify their resource requirements at job submission time, rather than to specify the resource requirement at the queue level. Avoid partitioning queues to specific hosts let the scheduler use
the resource requirements to determine the best host for the job. Take advantage of LSF's "nice" job value to effectively balance interactive and background batch jobs on desktop computers. Avoid complicated load-balancing strategies. If your applications are bound to a CPU or memory, then specify a maximum number of jobs per processor equal to 1 for all machines. Finally, avoid paging at all costs.
Cycle stealing
Automating the compile, build, and test cycle for software engineering is a
natural application for a load-sharing network. A software engineer's workflow cycle involves spending time thinking about and editing text files, followed by a compute-intensive compile, build, and test cycle. Cycle stealing seeks to find the idle cycles on desktop workstations for short, CPU-intensive jobs. A cycle-stealing scheduling policy takes advantage of idle desktops and distributes the load on to the idle machines, then transparently returns the result to the user's desktop machine. CPU utilization
and memory available offer key scheduling criteria to determine the suitability of a desktop. CPUs currently above a certain threshold, say 30 percent, would be unavailable for scheduling.
As an added precaution, to avoid interference with the user's interactive response time, the scheduled job should be executed at a very "nice" processor priority. If the user's machine suddenly activated, the scheduled job would lack priority for processor cycles and would thus take longer to finish. However, the job
wouldn't intrude on the desktop user's machine. Compute tasks that run for a significant duration (more than 5 minutes, for example) or require significant amounts of memory resources aren't good candidates for cycle stealing. This effect occurs because the longer-running jobs can increase the likelihood of intrusion on an individual desktop user's resource. The act of suspending a large memory job can severely degrade a desktop machine's performance and hold significant memory and swap resources while being
held in suspension.
Putting in overtime
Many customers who run the CPU-intensive jobs that require significant memory resources restrict their use of desktop resources to evenings and weekend processing. A typical setup mixes shared server resources to process the daily workload and a significant number of desktop machines with plentiful CPU processing and memory resources. Because the application demands significant memory resources to run adequately, scheduling those jobs on the desktop
resources during working hours unacceptably interferes with the user's day-to-day work. However, at many companies, when the employees go home, the machines sit idle and therefore represent a significant opportunity cost for the enterprise. The LSF software enables run windows or dispatch windows to be set up for an individual host or a group of hosts. We advise our customers to use the dispatch window so that no jobs will dispatch to a desktop host after a certain hour, say 7 AM. If the majority of jobs run
less than 1 hour, it's likely that overnight jobs dispatched to desktops will finish before users activate their machines. In the event that a dispatched job is running for a long time, the scheduling software can be programmed to suspend or terminate and subsequently requeue the job. For a low-priority regression job, this lost work presents a small cost in light of the benefits of the extended computing cycles gained by running overnight on the desktop machines.
For many enterprises, CPU and memory
resources are abundant, but software licenses are scarce. For this reason, it's insufficient to schedule a job based solely on an available machine; such users must combine a lightly loaded machine one with adequate physical memory and a software license. LSF's architecture for communicating network-wide shared resources or host-specific resources allows dynamic changes in license availability to be communicated to the scheduler. The mechanism for this process is the external load information manager (ELIM)
provided, in this case, by Blackstone Technology Group. This script can interrogate any software license managed by the Globetrotter Flexlm system and communicate its status to the LSF scheduler. We have found this technique reliable and robust, even in an environment that doesn't access all applications through LSF. The scheduling algorithm of the LSF software can easily allow jobs from a regression queue to consume all the licenses, to the exclusion of the interactive users. To address this problem, our
script includes provisions for specifying the time of the day acceptable to use all of the licenses (evenings), and when a buffer of licenses for users interactive use should remain (days).
In addition to the ELIM script method described above, LSF enables more simplistic license management policies as well. A simple approach is to schedule the job regardless of license availability, and requeue the job if it exits with a specific return code. The Model Technology simulator, for example, exits with a
value of 4 if the program has exited without a license. Another approach defines an application-specific queue and limits the number of jobs executing from that queue to equal the number of licenses for that particular application. This alternative works well if all of the requests for that license flow through that queue and if all of the jobs do in fact require that particular license as a condition of execution.
Yet another approach defines a pre-exec script, which runs prior to the actual job on the
execution host. This script tests the availability of the license and returns a zero if it's ok to run the job, while a non-zero returns the job to the front of the queue. Finally, LSF allows a jobstarter, which runs intrinsic to the job on the execution host, takes the actual user job command as an argument, and executes it. The jobstarter script parses the output log of the job and looks for error status or license status messages. Upon finding errors, the script returns a known program status code that
instructs the LSF scheduler to requeue the job because of license failure. Among all these flexible strategies, we have found the external script to be the most reliable and most straightforward approach.
We preempt this program
Many environments require the preemption of hardware and software license resources. LSF supports the preemption of low-priority jobs by high-priority jobs. The programmer can instruct the LSF system to take on the lower-priority job upon preemption (suspend, terminate,
re-nice processor priority). If there are abundant CPU job slots to run jobs, but a scarcity of software licenses, the preemption policy becomes more complex, as merely suspending a job doesn't free up the license. In order to free the license, the job must either be terminated or the license "removed" and appropriately "restored" upon resumption. For low-priority, short regression jobs, termination preemption might be a small cost to pay. The job is simply requeued and is rescheduled when ample resources
become avaiable. If the regression job took a long time to run and was nearing the end of the run, then termination could be a very costly action to take. One solution to this problem is to have the preempting job's preexecution command interrogate the jobs running from the preemptive queue. This would determine the last scheduled job from that queue and either suspend and remove, or terminate and requeue, the job as appropriate. This intelligence makes sure we interrupt the job with the smallest
opportunity cost if we need to rerun the job.
Some applications behave nicely when suspended or put into the background. For example, the VCS simulator from Synopsys, upon receiving a control-Z signal, will suspend and relinquish its license. Upon resumption it rerequests the license and resumes execution with no lost work. In this scenario, LSF is programmed to send the appropriate control-Z signal to the preemptive VCS job. Subsequently, when the preempting job is finished, it keeps track of the preempted job
and sends the appropriate resume signal to it before exiting.
Boarding the data local
For many companies, the data network is the precious resource in the load-sharing network. Some applications require significant I/O resources in addition to CPU and memory. In this case, administrators must consider the locality of data as a requirement to scheduling. The simplest approach simply copies the data file to the execution host before the execution, and then read/writes to a scratch area on the
execution host. However, for very large data sets this solution may not be practical. In this case, it might make sense to cache these data sets on specific hosts on the network and to see them as resources associated with those hosts. Under this scenario, the LSF scheduler would see those resources either by cluster configuration or via ELIM script; the scheduler would see to it that the job resource requirements were then matched to hosts with the appropriate data set. This approach becomes, however,
impractical if the number of different data set resources is large. (For instance, LSF imposes a maximum of 128 external resources to be defined via ELIM.) In this case, an external database can serve to maintain the data set-host location relationship. This information then works at job submission time to modify the job's host preference list, requiring the scheduler to pick the best host from a subset list of hosts, as defined by the current status of the database. (Blackstone Technology Group has created a
data-abstraction enhancement to the LSF system to provide this feature set.)
In general, it's best to rely on the network to serve the data up until the point where it becomes clear that simply increasing network bandwidth won't address the bottleneck. Only then should you consider the added complexity of caching data sets on specific hosts to achieve faster throughput.
LSF supports remote execution across heterogeneous platforms. The mechanism to support this is the "jobstarter." The jobstarter
script runs on the execution host, taking the actual job command line as an argument and running it. It's easy to set up this script to detect the host type and the appropriate paths and environment variables to execute the proper binaries. The jobstarter script is also useful for setting external environment parameters, such as with Rational Software's Clearcase "view" settings.
Tuning up
The LSF software maintains detailed logs of all jobs processed by the system, who submitted them, how long
the job waited in the queue, how much memory and CPU resource the job consumed, and other relevant data. Used in a historical context, this information provides a detailed summary about who is using the resources, the level of resource utilization, and where the bottlenecks are. In addition to detailed batch logs, Platform Computing offers the LSF Analyzer graphical application to plot and chart detailed utilization information for host load indices (CPU, memory, paging rates, I/O rates, and other
categories). Finally, the analyzer allows the definition of a rate multiplier for a load index, and can generate charge back accounting reports for individual users, groups, or projects.
In general, with a well-tuned LSF cluster, the administrator should expect to see close to 100-percent CPU utilization, with nearly all hosts busy (LSF's indication for a host unavailable to accept new jobs). If the number of pending jobs stays high, and CPUs remain idle, a mismatch of resources in the environment may be
occurring. Either the job's resource requirement is too strict (in terms of memory or host preference), or the hosts themselves aren't configured properly, or there is a scarce resource (such as a software license). If hosts are running multiple jobs per CPU and the paging rates are high, the administrator should reduce the number of running jobs allowable per CPU. It's almost never preferable for jobs to page. If CPUs are less than 100 percent utilized, and jobs are backing up in the queues, then some other
scarce resource must be isolated.
The information available in LSF's logs also enables the implementation of an effective alarm system. Here an administrator or manager is notified when the pending backlog is too high for too long, or when the number of available CPUs falls below a certain threshold, or when jobs stand in the queue for more than a specified amount of time. This data capture and reporting system can provide the basis of a total Quality of Service (QOS) contract between a user organization and
its support staff. Administrators can fully document the QOS, providing complete utilization and pending backlog documentation to show higher levels of management how resources are being applied and to make decisions about future investments in hardware and software components.
Overall, the use of server distribution software has increased the utilization of hardware and software licenses by a factor of two or more, while easing access to needed compute time for users. For example, one customer
estimated that the savings in reduced time spent managing compute cycles equaled a half hour per engineer per day.
Ron Ranauro is president and co-founder of Blackstone Technology Group, Inc., a firm that designs and builds compute farms. Previously, he worked in sales, marketing, and engineering within the electronics industry.
Send electronic versions of press releases to
news@isdmag.com
For more information about isdmag.com e-mail
webmaster@isdmag.com
Comments on our editorial are welcome.
Copyright © 2000
Integrated System Design
Magazine