International Engineering Journal For Research & Development E-ISSN No: 2349-0721 Volume 1: Isuue 1
AN ANALYSIS OF PARALLEL PROCESSING AT MICROLEVEL Vina S. Borkar 1 Dept. of Computer Science and Engineering St. Vincent Pallotti College of Engineering and Technology, Nagpur, India vinaborkar@gmail.com
------------------------------------------------------------------------------------------------------------------------
Abstract:To achieve performance processors rely on two forms of parallelism: instruction level parallelism (ILP) and thread level parallelism (TLP).ILP and TLP are fundamentally identical: they both identify independent instructions that can execute in parallel and therefore can utilize parallel hardware.ILP include, In this paper we begin by examining the issues (dependencies, branch prediction. window size, latency) on ILP from program structure. and give the use of thread-level parallelism as an alternative or addition to instruction-level parallelism. This paper explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor’s resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting multiple threads to share the processor’s functional units simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. Keywords-TLP, ILP, branch prediction, corse-grain, SMT.
I. Introduction Instruction-level Parallelism (ILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations, such as memory loads and stores, integer additions and floating point multiplications, to execute in parallel [1].This technique is that like circuit speed improvements, but unlike traditional multiprocessor parallelism and massive parallel processing, they are largely transparent to users. ILP is also called as a technique called pipelining. Pipelining breaks down a processor into multiple stages and creates a pipeline that instructions pass through. This pipeline functions much like an assembly line. An instruction enters at one end, passes through the different stages of the pipe, and exits at the other end. VLIWs and superscalar’s are examples of processors that derive their benefit from instruction-level parallelism, and software pipelining and trace scheduling are example software techniques that expose the parallelism that these processors can use. A superscalar machine is one that can issue multiple independent instructions in the same cycle. A super pipelined machine issues one instruction per cycle, but the cycle time is set much less than the typical
www.iejrd.in
Page 1
International Engineering Journal For Research & Development E-ISSN No: 2349-0721 Volume 1: Isuue 1 instruction latency. A VLIW machine [8] is like a superscalar machine, except the parallel instructions must be explicitly packed by the compiler into very long instruction words. A multithreaded processor aims to increase processor utilization by sharing resources at a finer granularity than a conventional processor [3].SMT is a technique that permits multiple independent threads to issue multiple instructions each cycle to a superscalar processor’s functional units. This paper is organized as follows. Section 2 discusses how executed ILP with processor. Section 3 discusses issues of the ILP .Section 4 discusses how ILP supports to exploit TLP with multithreaded processor. What is SMT and its working and how exploit TLP and ILP discusses section 5. II. Execution with ILP A typical ILP processor has the same type of execution hardware as a normal RISC machine. The differences between a machine with ILP and one without is that there may be more of that hardware, for example several integer adders instead of just one, and that the control will allow, and possibly arrange, simultaneous access to whatever execution hardware is present. The execution hardware of a simplified ILP processor consisting of more than one functional units. Typically ILP execution hardware allows multiple-cycle operations to be pipelined, so we may assume that all operations can be initiated each cycle. Instruction-level parallel execution is that multiple operations are simultaneously in execution, either as a result of having been issued simultaneously or because the time to execute an operation is greater than the interval between the issuance of successive operations. A superscalar that has two data paths can fetch two instructions simultaneously from memory. This means that the processor must also have double the logic to fetch and decode two instructions at the same time[2].For example, if in each cycle the longest latency operation is issued, this hardware could have 10 operations "in flight" at once, which would give it a maximum possible speed-up of a factor of 10 over a sequential processor with similar execution hardware. III. Different issue with ILP In instruction-level parallelism must determine which instructions can be executed in parallel. If two instructions are parallel, they can execute simultaneously in a pipeline without causing any stalls, assuming the pipeline has sufficient resources.. If two instructions are dependent they are not parallel and must be executed in order, though they may often be partially overlapped. If two instructions are data dependent they cannot execute simultaneously or be completely overlapped. A. Dependencies and hazards Determining how one instruction relates to another is critical to determining how much parallelism is available to exploit in an instruction stream. If two instructions are not dependent then they can execute simultaneously—assuming sufficient resources that is no structural hazards. Obviously, if one instruction
www.iejrd.in
Page 2
International Engineering Journal For Research & Development E-ISSN No: 2349-0721 Volume 1: Isuue 1 depends on another, they must execute in order though they may still partially overlap. It is imperative then, to determine exactly how much and what kind of dependency exists between instructions. The following sections will describe the different kinds of non-structural dependency that can exist in an instruction stream. There are three different types of dependencies: data dependencies (als0o true dependencies), name dependencies and control dependencies about ILP. 1. Data dependencies An instruction j can be considered data dependent on instruction i as follows: directly, where instruction i produces a result that may be used by instruction j or indirectly, where instruction j is data dependent on instruction k and k is data dependent on i etc. The indirect data dependence means that one instruction is dependent on another if there exists a chain of dependencies between them. This dependence chain can be as long as the entire program! If two instructions are data dependent, they cannot execute simultaneously nor be completely overlapped. A data dependency can be overcome in two ways: maintaining the dependency but avoiding the hazard or eliminating a dependency by transforming the code. Code scheduling is the primary method used to avoid a hazard without altering the dependency. Scheduling can be done in hardware or by software—in this paper, in the interests of brevity, only the hardware-based solutions will be discussed. The VLIW/EPIC report will cover software-based code scheduling. A data value may flow between instructions through registers or memory locations. When registers are used, detecting the dependence is reasonably straightforward as register names are encoded in the instruction stream. Dependencies that flow through memory locations are much more difficult to detect as the effectiveaddress of the memory location needs to be computed and the EA cannot be determined during the ID phase. Compilers can be of great help in detecting and scheduling around these sorts of hazards; hardware can only resolve these dependencies with severe limitations. 2. Name Dependencies The second type of dependence is a name dependency. A name dependency occurs when two instructions use the same register or memory location, called a name, but there is no flow of data between them. There are two types of name dependencies between an instruction i that proceeds instruction j: an antidependence occurs when j writes a register/memory that i reads (the original value must be preserved until i can use it) or an output dependence occurs when i and j write to the same register/memory location (in this case instruction order must be preserved.) Both anti-dependencies and output dependencies are name dependencies, as opposed to true data dependency, as there is no information flow between the two instructions. 3. Data Hazards A data hazard is created whenever there is a data dependency between instructions and they are close enough to cause the pipeline to stall or some other reordering of instructions. Because of the dependency, we
www.iejrd.in
Page 3
International Engineering Journal For Research & Development E-ISSN No: 2349-0721 Volume 1: Isuue 1 must preserve program order, that is, the order in which the instructions would execute in a non-pipelined sequential processor. A requirement of ILP must be to maintain the correctness of a program and reorder/overlap instructions only what correctness is not at risk. There are three types of data hazards: read after write (RAW)—j tries to read a source before i writes ithis is the most common type and is a true data dependence; write after write (WAW)—j tries to write an operand before it is written by i—this corresponds to the output dependence; write after read (WAR)—j tries to write a destination before i has read it—this corresponds to an anti-dependency. Self evidently the read after read(RAR) case is not a hazard. 4. Control Dependencies A control dependency determines the order of an instruction i with respect to a branch, so that i is executed in correct program order only if it should be. The first basic block in a program is the only block without some control dependency. Consider the statements: if (p1) S1 if (p2) S2 S1 is control dependent on p1 and S2 is control dependent on p2 but is not dependent on p1. In general there are two constraints imposed by control dependencies: an instruction that is control dependent on a branch cannot be moved before the branch and, conversely, an instruction that is not control dependent on a branch must not be moved after the branch in such a way that its execution would be controlled by the branch. IV.Using ILP Support to Exploit Thread Level Parallelism Increasing performance by using ILP has the great advantage that it is reasonably transparent to the programmer, ILP can be quite limited or hard to exploit in some applications. For example, an online transaction-processing system has natural parallelism among the multiple queries and updates that are presented by requests. These queries and updates can be processed mostly in parallel, since they are largely independent of one another. This higher-level parallelism is called thread-level parallelism because it is logically structured as separate threads of execution. A thread is a separate process with its own instructions and data. A thread may represent a process that is part of a parallel program consisting of multiple processes, or it may represent an independent program on its own. Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute. Unlike instruction-level parallelism, which exploits implicit parallel operations within a loop or straight-line code segment, thread-level parallelism is explicitly represented by the use of multiple threads of execution that are inherently parallel. Thread-level parallelism is an important alternative to instruction-level parallelism primarily because it could be more cost-effective to exploit than instruction-level parallelism. Thread-level and instruction-level parallelism exploit two different kinds of parallel structure in a program. A data path designed to exploit higher amounts of ILP will find that functional units are often idle because of either stalls or dependences in the code.
www.iejrd.in
Page 4
International Engineering Journal For Research & Development E-ISSN No: 2349-0721 Volume 1: Isuue 1 Multithreading allows multiple threads to share the functional units of a single processor in an overlapping fashion. To permit this sharing, the processor must duplicate the independent state of each thread. For example, a separate copy of the register file, a separate PC, and a separate page table are required for each thread. The memory itself can be shared through the virtual memory mechanisms, which already support multiprogramming. In addition, the hardware must support the ability to change to a different thread relatively quickly; in particular, a thread switch should be much more efficient than a process switch, which typically requires hundreds to thousands of processor cycles. There are two main approaches to multithreading. Fine-grained multithreading: switches between threads on each instruction, causing the execution of multiple threads to be interleaved. This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that time. To make fine-grained multithreading practical, the CPU must be able to switch threads on every clock cycle. One key advantage of fine-grained multithreading is that it can hide the throughput losses that arise from both short and long stalls, since instructions from other threads can be executed when one thread stalls. The primary disadvantage of fine-grained multithreading is that it slows down the execution of the individual threads, since a thread that is ready to execute without stalls will be delayed by instructions from other threads. Sun’s Ultra T1 (Niagara) uses fine-grain multithreading. Coarse-grained multithreading It was invented as an alternative to fine-grained multithreading. Coarse-grained multithreading switches threads only on costly stalls, such as level 2 cache misses. This change relieves the need to have thread switching be essentially free and is much less likely to slow the processor down, since instructions from other threads will only be issued when a thread encounters a costly stall. Coarse-grained multithreading suffers, however, from a major drawback: It is limited in its ability to overcome throughput losses, especially from shorter stalls. This limitation arises from the pipeline start-up costs of coarse-grain multithreading. Because a CPU with coarse-grained multithreading issues instructions from a single thread, when a stall occurs, the pipeline must be emptied or frozen. The new thread that begins executing after the stall must fill the pipeline before instructions will be able to complete. Because of this start-up overhead, coarse grained multithreading is much more useful for reducing the penalty of high-cost stalls, where pipeline refill is negligible compared to the stall time.
V. Simultaneous Multithreading Simultaneous multithreading (SMT) is a variation on multithreading that uses the resources of a multipleissue, dynamically scheduled processor to exploit TLP at the same time it exploits ILP. The key insight that motivates SMT is that modern multiple-issue processors often have more functional unit parallelism available than a single thread can effectively use. Furthermore, with register renaming and dynamic scheduling, multiple
www.iejrd.in
Page 5
International Engineering Journal For Research & Development E-ISSN No: 2349-0721 Volume 1: Isuue 1 instructions from independent threads can be issued without regard to the dependences among them; the resolution of the dependences can be handled by the dynamic scheduling capability. Figure2.conceptually illustrates the differences in a processor’s ability t exploit the resources of a superscalar for the following processor configurations: •
A superscalar with no multithreading support
•
A superscalar with coarse-grained multithreading
•
A superscalar with fine-grained multithreading
•
A superscalar with simultaneous multithreading
In the superscalar without multithreading support, the use of issue slots is limited by a lack of ILP, a topic we discussed in earlier sections. In addition, a major stall, such as an instruction cache miss, can leave the entire processor idle. In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by switching to another thread that uses the resources of the processor. Although this reduces the number of completely idle clock cycles, within each clock cycle, the ILP limitations still lead to idle cycles. Furthermore, in a coarse grained multithreaded processor, since thread switching only occurs when there is a stall and the new thread has a start-up period, there are likely to be some fully idle cycles remaining. In the fine-grained case, the interleaving of threads eliminates fully empty slots. Because only one thread issues instructions in a given clock cycle, however, ILP limitations still lead to a significant number of idle slots within individual clock cycles. In the SMT case, TLP and ILP are exploited simultaneously, with multiple threads using the issue slots in a single clock cycle. Ideally, the issue slot usage is limited by imbalances in the resource needs and resource availability over multiple threads. Simultaneous multithreading uses the insight that a dynamically scheduled processor already has many of the hardware mechanisms needed to support the integrated exploitation of TLP through multithreading. In particular, dynamically scheduled superscalar’s have a large set of virtual registers that can be used to hold the register sets of independent threads (assuming separate renaming tables are kept for each thread). Because register renaming provides unique register identifiers, instructions from multiple threads can be mixed in the data path without confusing sources and destinations across the threads. Simultaneous multithreading has a dual effect on branch prediction, much as it has on caches. Simultaneous multithreading is much less sensitive to the quality of the branch prediction than a single-threaded processor. Still, better branch prediction is beneficial for both architectures. In case of register renaming larger SMT register file requires a longer access time; to avoid increasing the processor cycle time, the SMT pipeline was extended two stages to allow two cycle register reads and two-cycle writes[8]. Threads on an SMT processor share the same cache hierarchy, so their working sets may introduce inter-thread conflict misses. When increasing the number of threads from 1 to 8, the cache miss component of average memory access time increases by less than 1.5 cycles on average, indicating the small effect of inter-thread conflict misses. In out of
www.iejrd.in
Page 6
International Engineering Journal For Research & Development E-ISSN No: 2349-0721 Volume 1: Isuue 1 order execution, write buffering, and the use of multiple threads allow SMT to hide the small increases in additional memory latency, and large speedups can be attained.
Figure 2 How four different approaches use the issue slots of a superscalar processor.
The horizontal dimension represents the instruction issue capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding issue slot is unused in that clock cycle. The shades of grey and black correspond to four different threads in the multithreading processors. Black is also used to indicate the occupied issue slots in the case of the superscalar without multithreading support.
VI. Result Analysis There are some of other architectures have been proposed that exhibit simultaneous multithreading in some form. Tullsen, et al.[6] demonstrated the potential for simultaneous multithreading, but did not simulate a complete architecture, nor did that paper present a specific solution to register file access or instruction scheduling. Yamamoto, et al., [10] present an analytical model of multithreaded superscalar performance, backed up by simulation. Their study models perfect branching, perfect caches and a homogeneous workload. Hirata, et al., [9] present an architecture for a multithreaded superscalar processor and simulate its performance on a parallel ray-tracing application. They do not simulate caches or TLBs and their architecture has no branch prediction mechanism. Yamamoto and Nemirovsky [11] simulate an SMT architecture with separate instruction queues and up to four threads. In addition to these, Beckmann and Polychronopoulos [14], Gunther [12], Li and Chu [13], and Govindarajan, et al., [15] all discuss architectures that feature simultaneous multithreading, none of which can issue more than one instruction per cycle per thread. The M-Machine [16] and the Multiscalar project [17] combine multiple-issue with multithreading, but assign work onto processors at a coarser level than individual instructions..
VII.CONCLUSION www.iejrd.in
Page 7
International Engineering Journal For Research & Development E-ISSN No: 2349-0721 Volume 1: Isuue 1 Simultaneous Multithreading is an extension of hardware multithreading that increases parallelism in all forms. SMT combines the instruction level parallelism experienced by pipelined, superscalar processors with the thread level parallelism of multithreading. This allows the processor to issue multiple instructions from multiple threads in a single clock cycle, thus increasing the overall instruction throughput. SMT attacks multiple sources of lost resource utilization in wide-issue processors.
Reference [1] B. Ramakrishna Rau, Joseph A. Fisher, Instruction-Level Parallel Processing: History, Overview and Perspective..HPL-92-132 October, 1992. [2] Harris, D. M., and Harris, S. L. Digital Design and Computer Architecture. Morgan Kaufmann, Amsterdam, 2007. [3] M. Gulati and N. Bagherzadeh. Performance study of a multithreaded superscalar microprocessor. In Second International Symposium on High Performance Computer Architecture, pages 291–301, February 1996. [4] Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Hank M.Levy, Jack L. Lo, and Rebecca L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In International Symposium on Computer Architecture, May 1996. [5] Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy,Rebecca L. Stamm, and Dean M. Tullsen. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, pages 322–354, August 1997 .[6] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In International Symposium on Computer Architecture,1995. [7] Norman P. Jouppi and David W. Wall. Available instruction-level parallelism for superscalar and superpipelined machines. Third International Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 272-282, April 1989 [8] Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., Stamm, R. L., and Tullsen, D. M. Simultaneous multithreading: A platform for next-generation processors. IEEE Micro 17 (1997). [9] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa. Elementary processor architecture with simultaneous instruction issuing from multiple threads. In 19th Annual International Symposium on Computer Architecture, pages 136–145, May 1992. [10] W. Yamamoto, M.J. Serrano, A.R. Talcott, R.C. Wood, and M. Nemirosky. Performance estimation of multistreamed, superscalar processors. In Twenty-Seventh Hawaii International Conference on System Sciences, pages I:195–204, January1994.
www.iejrd.in
Page 8
International Engineering Journal For Research & Development E-ISSN No: 2349-0721 Volume 1: Isuue 1 [11] W. Yamamoto and M. Nemirovsky. Increasing superscalar performance through multistreaming. In Conference on Parallel Architectures and Compilation Techniques, pages 49–58, June 1995. [12] B.K. Gunther. Superscalar performance in a multithreaded microprocessor. PhD thesis, University of Tasmania, December 1993. [13] Y. Li and W. Chu. The effects of STEF in finely parallel multithreaded processors. In First IEEE Symposium on High- Performance Computer Architecture, pages 318–325, January 1995. [14] C.J. Beckmann and C.D. Polychronopoulos. Microarchitecture support for dynamic scheduling of acyclic task graphs. In 25th Annual International Symposium on Microarchitecture, pages 140–148, December 1992. [15] R. Govindarajan, S.S. Nemawarkar, and P. LeNir. Design and peformance evaluation of a multithreaded architecture. In First IEEE Symposium on High-Performance Computer Architecture, pages 298–307, January 1995. [16] M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang, Y. Gurevich, andW.S. Lee. The M-Machine multicomputer. In 28th Annual International Symposium on Microarchitecture, November 1995. [17] G.S. Sohi, S.E. Breach, and T.N. Vijaykumar. Multiscalar processors. In 22nd Annual International Symposium on Computer Architecture, pages 414–425, June 199
www.iejrd.in
Page 9
International Engineering Journal For Research & Development E-ISSN No: 2349-0721 Volume 1: Isuue 1
www.iejrd.in
Page 10