Supercomputers Get Personal: BYTE - May, 1990 by Mike Russell

STATE OF THE ART DESKTOP SUPERCOMPUTING

Supercomputers Get Personal

Torque's new i860-based ComputeServer is fresh and fastan attractive way to serve up computational power Sam Bogoch, lain Bason, Jeff Williams, and Mike Russell

ringing supercomputer performance to your desktop is not so much a question of technology as it is one of economics. Many companies build workstations and coprocessors that, at least in some areas, deliver supercomputing performance from desktop machines . The problem is that you can spend a fortune trying to outfit everyone in your organization with high-priced hardware . The ComputeServer from Torque Computer offers an affordable alternative . The ComputeServer (see the photo on page 233) is a parallel-processing computing engine that delivers supercomputing performance by putting a "desktop supercomputer" on a LAN . As a peripheral resource for personal computers and workstations, it is analogous to already-popular print and file servers found on most LANs . In fact, it is the large installed base of standard networks (the ComputeServer uses Ethernet) that makes the computational server concept viable . The server approach is also consistent with the needs of most desktop "power" users who wouldn't use a supercomputer full-time but would require its power in bursts . ILLUSTRATION : CARY HENRIh " - 1990

Client-Server Computing The ComputeServer uses a client-server architecture-much like the database servers that are becoming so popular on LANs . Applications are split into client and server portions . The client portion runs on your desktop machine and provides the user interface for the application . The server portion consists of computationally intensive code residing on

the ComputeServer. The client calls this code as needed, and the server returns the results (see figure 1) . While the ComputeServer is transparent to users, it is not transparent to programmers . Applications must be modified to take advantage of its power. Fortunately, many programmers are already using structured-programming techniques to divide user-interface modules from computationally intensive ones, so converting code to support a true client-server approach is not difficult. The system software takes advantage of modular programs by using the Linda memory model originated by David Gelernter at Yale (see "Getting the Job Done," November 1988 BYTE). Linda adds just six statements to a conventional language such as C, yet it easily splits client and server functions and handles all the communications between the two. Just as important, Torque's Linda implementation allows multiple server subtasks to execute simultaneously on the ComputeServer's multiple processors . Linda automatically handles the two main problems in getting one or more continued

COMPUTESERVER SOFTWARE Desktop system

ComputeServer

Tuple space User interface eval ~; out in

Regions to process (based on number of processes), parameters (set, pixel granularity, etc.), control tuples (start, stop, quit) Completed scanlines (arrays), progress information (errors, out of memory, etc.), ready signals from processes (optional) Boundary conditions

Process in l out Process in out Process in out Process -o- in out

Figure 1 : A ComputeServer application exists on both the client and server machines. Communication between the two parts occurs in tuple space. applications to run on multiple processors : process creation and data consistency . It spins off tasks to run on the different processors and makes certain the tasks send and receive the right data at the right times. Torque's implementation of the Linda model asynchronously spins off C functions, as well as linked code in FORTRAN 77 or other languages, onto one or more remote processors . Linda treats the multiple CPUs as a processing pool . It can allocate any processor to one or more subtasks that the application programs create . Linda also creates software-based global memory for passing data between the client machine and the ComputeServer's multiple processors . The system subdivides the available processor pool on the fly among multiple clients and keeps each client's memory space separate during execution . End of the Rainbow The processors at the heart of the ComputeServer supply its supercomputing power. The system contains from one to 16 Intel i860s, which offer marked performance advantages over other current RISC and complex-instruction-set-computer processors (see "Intel's Cray-on-aChip," May 1989 BYTE) . The 1-million-transistor i860 includes a RISC core, a vector-capable 64-bit

FPU, 4K-byte and 8K-byte data caches, and hardware support for three-dimensional graphics primitives . The tight coupling of these functions within a single chip via fast, wide (up to 128 bits) onchip buses provides much faster floatingpoint performance than the multichip sets needed to implement other RISC architectures. The i860 can reach 66 million floating-point operations per second at peak performance, although only a few applications, such as neural-network simulations, can effectively harness all the potential of the microprocessor's multiplyaccumulate pipeline . A more realistic performance figure is the 17 MFLOPS quoted by Intel for running an optimized LINPACK suite. The processor should be able to sustain 10 MFLOPS in most nonoptimized applications . So, a ComputeServer with 16 processing units is capable of more than 1000 MFLOPS at peak performance, and 160 MFLOPS sustained. By comparison, a Mac Ilx can sustain 0.4 MFLOPS ; a SPARCStation 1, 1 .2 MFLOPS ; and a Cray X-MP with four processors, 200 MFLOPS . A uniprocessor ComputeServer for $20,000 runs nearly as fast as the original Cray 1 on many applications . The i860 also offers an upgrade path to superscalar technology, also known as very long instruction word technology (see "VLIW : Heir to RISCT' August

1989 BYTE) . A superscalar architecture features multiple integer and floatingpoint arithmetic units within a single processor. Advanced compilers can simultaneously assign different jobs to these processors . Superscalar is a form of parallel processing called microparallelism, which, unlike the macroparallelism of the ComputeServer's multiple processors, is transparent to the programmer. As superscalar technology matures, future Torque machines will continue to incorporate both microparallelism and macroparallelism . Ties That Bind One of the central concerns in parallel processing is how to most effectively tie multiple processors together . In distributed-memory architectures, each processor has private RAM and communicates with the other processors via messaging links. In shared-memory architectures, processors share the same bus and main memory . Often, software development issues obscure the debate about the underlying pros and cons of each of these hardware architectures. The key issue becomes which architecture is easier to program for a given job, rather than which one is inherently better . Because programming a shared-memory machine is often easier than programming a message-passing one, most people are willing to live with the higher hardware costs and lack of scalability associated with the former . However, C-Linda and other new development tools have made these architectures interchangeable from a programmer's point of view . Either architecture, or a hybrid of the two, can perform most parallel-programming tasks handily. Linda has shown that the real key to successful parallel processing is a machine's ability to provide high bandwidth and low latency for interprocessor communications . The Virtual Tree architecture used in the ComputeServer combines many of the advantages of shared-memory and message-passing architectures . Developed by the ComputeServer design team in 1987 for the multi-8086 Parallon processor, the VT uses a layered hierarchy of messaging buses optimized for burstmode transmission . The tree is called "virtual" because each branching layer is implemented as a fast-messaging bus rather than as many slower point-to-point links . Unlike the buses on shared-memory systems, the fast-messaging buses pass messages rather than access and arbitrate shared-memory locations . Thus, you

don't need the complex bus-snooping hardware of shared-memory systems, and processors on one bus do not need knowledge of transactions occurring on other buses. The kicker is that, because the system passes messages by way of these very fast shared channels, it can adapt to changing communications loads without employing any complex routing overhead . The VT also provides a hardware-broadcast facility that dramatically improves bus utilization when you have to share data among many processors . The single-layer VT bus used in the ComputeServer is 64 bits wide, features burst mode, and is capable of 66-megabyte-per-second point-to-point transfers and 1-gigabyte-per-second effective broadcast transfers when all 16 processors are "listening ." The bus supports up to 32 devices within a given layer. The system can employ addresses that the processors don't use to accommodate parallel 1/0 devices such as multiple disk drives, frame buffers, and real-time data acquisition systems. Each processor board contains one or two complete i860 processing units, each with up to 16 MB of local static-column DRAM, and interfaces to both the VT and the system 1/0 bus (see figure 2) . The ComputeServer backplane can hold up to eight processor boards (16 CPUs) . This modular architecture allows boardlevel upgrades to increase the number of processors, or board swaps to new CPUs . For instance, if you have an earlier, transputer-based ComputeServer, you can upgrade to an 1860 system for the cost differential between the new system and your current one. ComputeServer 1/O Central to the computational-server concept is the ComputeServer's ability to connect to standard networks . Although the 1/0 requirements for a computational server are less complex than those for file and database servers, rapid response to input tasks and sustained network performance are still critical . The system employs a dedicated 386-based processor to handle 1/0 functions, and it features an Extended Industry Standard Architecture backplane to ensure compatibility with both current and future highspeed networks . The 1/0 processor runs Unix System V/386 and manages the ComputeServer's built-in hard disk, which stores both executable code and temporary data files. The widespread availability of networking hardware and drivers for this 1/0 processor and bus combination

COMPUTESERVER HARDWARE To desktop system

Network or SCSI adapter

I/O processor (386-based)

Disk

subsystem

Figure 2: The ComputeServer uses an industry-standard l/O bus and a proprietary fast-messaging bus. Each processor has its own local memory. means that the ComputeServer can support new networks without redesign . The high-level system software must simply be ported to the new 1/0 boards . Software Architectures For all its high-tech hardware, the ComputeServer would be useless without software capable of harnessing its power. One option considered was to implement a system using the X Window System's (referred to as X Window hereafter) smart-terminal model of computation . The X Window model has a limited set of primitives for the desktop machine (the "window server" in X Window terminology) to execute, and it effectively runs the entire application on the server (the "window client") . These standard primitives express any communication between client and server . This places the desktop system in the role of intelligent terminal ; it translates your input into primitives that the server uses, and primitives from the server into on-screen graphical elements . Continued

The Torque ComputeServer connects to standard networks to deliver unparalleled computing power to Macintosh, DOS, and SunOS clients.

The ComputeServer system decides which tuples to pre-send based on a mix of compile-time and run-time factors. For instance, it's better to pre-send tuples that rd ( ) operations are going to match, since they leave the tuple around for other operations . The in() operation removes it . The ComputeServer also minimizes network communications by having one of its processes act as a proxy for the desktop computer . All tuple searching and synchronization are performed between the proxy and the other processors in the ComputeServer . When a tuple is chosen, it's sent back to the desktop computer. These kinds of optimization are critical in sustaining a sense of tuple space as a global resource, while at the same time providing you with a responsive system . If application code must be optimized around a system's shortcomings, then that system is meeting the Linda model in name only . In fact, an ideal multiprocessor system should go out of its way to accommodate "mistakes." One example of this approach is dynamic load balancing . Because a distributed Linda kernel can both allocate time slices and monitor processes, it could actively seek out execution bottlenecks and take appropriate actions. For instance, an iterative process that repeatedly holds up others could be given successively larger time slices to minimize overall execution time . Tools of the Trade You can create client-server applications without having to buy ComputeServer hardware . The Torque developer's toolkit is a self-contained C-Linda implementation running on the client machine that allows code to be compiled, tested, and debugged as multiple threads on that system . The Linda primitives incur a very small speed penalty-typically 2 percent to 5 percent-when the application runs completely on the client . Thus, it's reasonable to maintain only one version of an anplication's source code that can be compiled for client-only or client-server operation. Once the code has been debugged on the desktop, it can be recompiled on the ComputeServer for client-server operation. Likewise, source code developed for the earlier transputer-based ComputeServer can be re-

compiled for the new machine without modification . Mechanically, compilation for clientserver operation is a matter of adding a few lines to the make file, because the ComputeServer compilers are themselves Linda applications that run from the desktop . Error messages from the Torque compilers are displayed within the standard programmer environments (MPW on the Mac, Microsoft's on the PC, and SunTools on the Sun) . The ComputeServer supports languages other than C-Linda. It does so in two ways: either by using a traditional, single-threaded syntax callable from a CLinda framework, or as intrinsically parallel superstructures running above

You can create client-server applications without buying ComputeServer hardware . the Linda run-time environment . Examples of the former include FORTRAN 77 with VAX extensions and ISO Pascal . Examples of the latter are Paralogic's n/Prolog, which is a multiprocessor interpreter written in C-Linda, and Strand's Strand88, a parallel language with sophisticated job-control functions. Finally, several high-level graphics tools are under development for the system, including the PPSE parallel CASE tools developed at Oregon's Advanced Computing Institute . Join the Party Torque has concentrated its efforts on applications that combine an advanced user interface with sustained number crunching . Most of these involve 3-D graphics, simulation, and image processing . Several prominent vendors have pledged support for the system, including MacroMind (Three-D 1 .1) and Wolfram Re-

search (Mathematica kernel) . Smaller niche vendors are also supporting the system, including Market Engineering (Crystal Ball, for Monte Carlo forecasting), and Pre-Press Technologies (SpectreSeps color-separation software) . ComputeServer-capable versions normally cost more than their desktop counterparts, but they provide considerably higher performance and support multiple users. A number of embedded applications are also in the works, including exposure control for submicron lithography at Lepton (an AT&T spin-off). Third-party ComputeServer developers represent a collection of seemingly unrelated specialties, with little in common except the need for speed. In fact, support for this system by "mainstream" applications is almost an oxymoron : If a product performs well enough to be sold by the millions on today's desktops, it probably does not need the ComputeServer . Rather, we expect that the ComputeServer will help bring today's niche products into tomorrow's mainstream . There's nothing to keep ad agencies from ray-tracing animated sequences if each frame takes only minutes to produce . There's nothing to keep financial analysts from running 100 Monte Carlo variations on a spreadsheet if the job takes only slightly longer than a single recalculation . And there's nothing to keep desktop publishers from running color separations alongside Linotronic output when the job no longer takes hours to complete . Such applications will become commonplace when access to the necessary computing power becomes available. A new kind of computer, the computational server, will complement desktop machines for compute-intensive applications . By treating computational power as a shared resource, the ComputeServer delivers lots of FLOPS at a reasonable price per desktop, and it does so without forcing you to sacrifice the computing environment you're comfortable with . Given the ever-increasing importance of LANs, the ComputeServer is truly a machine for the 1990s. Sam Bogoch, lain Bason, Jeff Williams, and Mike Russell design and develop computational servers for Torque Computer, Inc . (New York) . They can be reached on BIX as "sbogoch .