Crash course in CPUs by Jeremy Ford

CPU TECH EXPLAINED!

NEHALEM, MICRO-OP FUSION, IMC, FCLGA AND MUCH MORE!

AMAZING FREE DVD INSIDE!

59 MUST-PLAY GAMES • 50 POWER TOOLS • 69 ESSENTIAL FREE APPS

Turn to page p118

ISSUE 220 DEC 2008

PERFORMANCE GEAR & GAMING

ISSUE 220 AU REVOIR, GOPHER

NINJA RIG SPECIAL!

AU REVOIR, GOPHER

SILENCE YOUR PC

Blazing fast and dead quiet yours from as little as £29

AMD BACK FROM THE DEAD IN OUR GRAPHICS SUPERTEST SEE P60 Smoking performance from just £58!

WWW.PCFORMAT.CO.UK

THERE’S MORE… Meet the incredible £108 PC Solid state drive breakthrough First Person Shooter Hall of Fame Steam

PLUS!

HOTWIRED HARDCORE PC ADVICE

¤ HACKING ¤ OVERCLOCKING ¤ MODDING

Back-up your games Issue 220 Dec 2008 £5.99 Outside UK & ROI £6.49

FAR CRY 2 DEFINITIVE REVIEW PCF220.cover 1

WARHAMMER: RVR MASTERCLASS 6/10/08 11:19:46 am

The future of CPUs

Crash course in CPUs

How do tiny grains of sand turn numbers into stunning 3D games anyway? Adam Oxford goes under the ceramic of the CPU to ďŹ nd out

PCF220.feature2 072

December 2008

3/10/08 1:28:2 pm

The future of CPUs ight a candle and bake a cake, then pop down to Clintons to pick up an hilarious card – for your CPU has just turned 30! While its best years aren’t behind it quite yet, it could do with cheering up. In 1978, Intel released its first 16-bit microprocessor, the 8086. Although it was the cheaper, cut down 8-bit version – the 8088 – that made it into the IBM PC and quite literally changed the world as we know it, today’s Core 2 and Phenom chips are designed to run code based on what’s still called the x86 instruction set. In fact, they still share some important common core characteristics with the venerable 8086. Quite why it should have been the x86 family is a different story for another time. Intel’s chips were far from the most advanced, cleverest or cheapest available at the end of the 1970s, and had some fairly serious design bugs, which had to be replaced by IBM free of charge some years later. In the annals of our times, though, that will be deemed irrelevant: this was the general purpose processor that drove the desktop revolution. Curiously, one of its competitors – the Zilog Z80 which powered Sinclair’s home computer of (almost) the same name – is actually still manufactured and used today. The 8086, however, has been consigned to history. Why do we bring these curious factoids up? Because later this month also sees the launch of Intel’s seventh generation of x86 CPUs, the Core i7 (Nehalem). Intel is touting it as the biggest architectural change in the company’s history; and for once we’re actually prepared to believe it. The success of x86 is, of course, backwards compatibility. Somewhere in the Core i7’s infinitely more complex design are the same 116 instructions that the 8086 could execute, albeit substantially enhanced with later additions, and the same is true of the AMD Phenom. These are the basic arithmetic and logic commands – like ADD, MUL, OR and XOR – along with a few more specific instructions for which bit of data belongs in which block of memory or system register. In reality, of course, the things couldn’t be more different today if they

tried. The 8086 ran at 4MHz, had a total transistor count of less than 30,000 and was packaged in a 40-pin dual in-line chip: physically, it was one of those long black things with the legs sticking out from the sides like an evil metal spider. The Core i7, by contrast, is a two-, fouror eight-core beast, with up to 1.4 billion transistors in its largest variety. At launch, it will be clocked at well over the 3GHz mark. It has 1567 pin outs, and comes in the flat FCLGA (flip chip land grid array) packaging that will be familiar from the Core 2 line. That means that balls of solder meet the circuit board head on, and end in simple pads which are then laid on to of pins in the motherboard socket. We’ve come a long way, clearly. The CPUs of the seventies look like singlecelled organisms in primordial processor sludge by comparison to the staggering complexity of today’s chips. It takes teams of hundreds of people several years to design a new CPU, and it’s unlikely that any individual could completely navigate the finished silicon topography by hand.

INSIDE THE SHELL We can, however, do our bit to improve general understanding by looking at certain core principles of CPU design. Technically speaking, a CPU is any processor that can execute programmable code, but for the purposes of our sanity, we’ll stick to a discussion of modern day x86 chips here. Though the layout of today’s chips bear as much resemblance to the original 8086s as a dog does to its jellyﬁsh ancestors, nevertheless the core operational procedure follows the same cycle. The CPU’s task is broken into four stages: fetch, decode, execute and writeback. Instructions are called from a memory store to the registers. These are then interpreted, processed and a result is written back. This result can be output to, say, a graphics card or hard drive, or called back into the CPU for processing again. However intricate a processor is, these basic four steps are a good way to understand how they work and why they are designed the way they are. That cycle can be sped up, of course, by increasing the clockspeed of the CPU and the number of cycles it performs per second. Intel learnt the hard way that the

“Layout of today’s chips bears as much resemblance to the original 8086s as a dog does to its jellyﬁsh ancestors” December 2008

PCF220.feature2 073

3/10/08 1:28:5 pm

CPUs explained

INTEL NEHALEM MICROARCHITECTURE

Uncore 128

Quick Path Enterconnect

Branch Prediction global/bimodal, loop, indirect jmp

Predecode & Instruction Length Decoder Instruction Queue 18 x86 Instructions Alignment MacroOp Fusion

Simple Decoder

Complex Decoder

Loop Stream Decoder

DDR3 Memory Controller

Simple Decoder

Decoded Instruction Queue (28 ÂľOP entries) MicroOp Fusion

Diagram: Appaloosa, http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)

2x Retirement Register File

Micro Instruction Sequencer

2 x Register Allocation Table (RAT) Reorder Buffer (128-entry) fused

Reservation Station (128-entry) fused Port 4

Store Data

Port 3

Port 2

Port 5

Port 1

AGU

Store Addr. Unit

Load Addr. Unit

Integer/ MMX ALU, Branch

Integer/ MMX ALU

SSE ADD Move

128

Port 0

FP ADD

Integer/

FP MMX ALU, MUL 2x AGU SSE MUL/DIV Move

128

512-entry L2-TLB-4K

128

Result Bus Memory Order Buffer (MOB) 128

256

128

GT/s: gigatransfers per second

PCF220.feature2 074

December 2008

3/10/08 1:28:6 pm

The future of CPUs key to building a really fast processor isn’t just about raw gigahertz. If a single cycle requires a certain amount of electricity to be performed, more cycles per second means increasing the power consumed and – importantly – the heat produced. The theoretically scalable netburst architecture of the Pentium 4 came a cropper when it hit an unexpected top speed barrier beyond which trying to cool the chip was impracticable for most. Which tells us that processor designers may be very clever, but they can’t foresee everything. In the same way that graphics technology has moved to uniﬁed shading in order to make more efﬁcient use of the processing power available, today’s design goals are to keep all the various parts of the CPU working on useful information. Note the inclusion of the word ‘useful’ there.

FETCH, DECODE, ETC The simplest form of CPU takes one piece of data, works out what to do with it, does it and then outputs the result. The inherent problem is that it can only work on one piece of data at a time, and while that’s being passed through to the part of the execution engine that’s designed to perform the requested operation, the rest of the CPU is sitting idle. The solution to this is to introduce some form of parallelism to the ‘pipeline’. To start with, this might have been simply to have the ‘fetch’ part of the CPU grabbing a new piece of data while the ‘decode’ bit is working on another. That’s been developed somewhat, mind you, and the last iteration of Pentium 4 had a whopping 31 stages to its pipeline. The problem with long pipelines, however, is that they aren’t always terribly efﬁcient because they’re not always full of useful information. On its journey through the pipeline, a piece of data may return an error or will become reliant on other information being drawn from the registers – if it isn’t there, the

INTEL INTERVIEW WHO’S AFRAID OF THOSE GP-GPUs? Ronak Singhal is the Chief Architect for Nehalem. His USbased team was also responsible for the original Pentium 4 design.

Chip design seems very hard work these days. Is there anyone left who actually understands a CPU in its entirety? At a high level, a lot of people can explain what we’re doing here, here and here. It’s becoming much harder for any one person to grasp all the internal details from an architecture standpoint, as there are so many components integrated on the die. It’s hard to find folks who understand execution units and memory controllers at a very detailed level, and it’s going to become more complicated going forward. The team that developed Nehalem developed the first Pentium Pro processor in the mid-1990s. Back then each of the architects was able to hold all the key details at a fairly low level. We’re way beyond that now: complexity has grown substantially. How many people does it take? It depends how you deﬁne it – there are some activities where people are there for four or ﬁve years, others where people will complete their work in 12 months then move on to another project. If you look at the peak number of people it’s several

Below Two dual-core dies mounted on a PCB. The lid you see is the Penryn’s heatsink

hundred, but how many, I don’t know. Where do you draw the line? What’s the best new feature in Core i7? With each processor generation there’s something that’s evolutionary and something that’s revolutionary. What’s really different with Nehalem is the power management stuff. The concepts of power gating and turbo mode are what will be remembered as revolutionary. Is your experience with Pentium 4 why you put so much work into this area? You may read into that the fact that the needle has gone so far in the other direction, that we’ve been burnt by that before. If you’ve had an issue before you make sure that it’s never going to be an issue again. And we all understand the beneﬁts of conserving power. Is Hyperthreading your baby too? This is the only team at Intel that could have resurrected Hyperthreading and implemented it. We’ve done it before, we understand the good and the bad, and we’re willing to take on that challenge. We have people on the team who’ve been working on hyperthreading since day one of the Pentium 4. Is Nehalem enough to see off the threat of GP-GPUs in the HPC market? The big beneﬁt of these versus GP-GPUs is the programming environments. You can use standard tools, rather than specialised kits. The x86 tools are extremely mature. The second advantage is the legacy of backward compatibility we have. You know that apps you write today will run on the next-gen processors; with the graphics cards, you haven’t had that yet.

result will have to be written out while the new piece of data is fetched and the rest of the pipeline will stand idle. The key workaround for this in today’s CPUs is to build logical areas that are dedicated to ‘branch prediction’ – in other words, guessing what bits of data are going to be needed next and getting them ready for insertion into the pipe. Of course, branch predictors aren’t infallible, and if the wrong information is called then you’re back to having large amounts of wasted die area. A large part of processor design is ﬁnding a happy balance between length of pipeline and CPU cycles lost to such stalling. Part of Core i7’s secret to success, for example, is using relatively mediumlength pipelines, and including a ‘SecondLevel Branch Target Buffer’: an extra bit of memory to cache information and

allow the branch to double back on itself if a problem arises. There are other ways to speed up data throughput too. Your CPU’s inbox is always overflowing with work to be done, but it will rifle through the pages to take the best job next, not necessarily the first one it was given. The order in which instructions are executed is decided by a scheduler, which independently assesses the most efficient way to do them. That might mean looking ahead in the currently running thread and pulling out commands that aren’t dependent on the current operation – known as ‘out of order’ processing – or, in the case of a processor core capable of working on more than one thread at once, starting to work through an entirely different instruction loop that just happens not to December 2008

PCF220.feature2 075

3/10/08 1:28:6 pm

The future of CPUs

THE GPU CHALLENGE Much has been made of the latest graphics cards being able to do more than graphics. Both NVIDIA and AMD have been touting the GP-GPU (General Purpose GPU) properties of their DirectX 10 chips and proprietary programming languages which coders can use to unlock said features. In NVIDIA’s case, this is the CUDA development environment; more phonetically pleasing, AMD calls its GP-GPU technology FireStream. Both are very promising technologies, but one thing should be clear: no matter what happens, they’re no threat to the current CPU architecture of your PC. A processor like the Core 2 or Phenom is designed to be able to do lots of different things at different speeds at the same time. It might be running Windows, Outlook and a heavily branched AI routine in a game program at the same time. Or they may be running the user interface of a hosted telephone exchange. That takes a particular type of chip, and it isn’t a graphics one. It’s long been suspected that NVIDIA wants to get into the processor market: if it does, it won’t be with a GeForce-derived product. But with the introduction of unified shaders, graphics companies have found themselves with some interesting designs in their inventories. Essentially, the DX10 GPUs are incredible at parallel processing. They can be configured on the fly to take in a lot of similar pieces of data and perform the same operation on all of them. It may have been designed to take all the pixels beneath a light source and add a green glow to them, but it’s also useful for prosaically named High Performance Computing (HPC) applications. These are areas like weather simulators, medical imaging systems, molecular modelling, geophysical analysis and even the financial systems of insurance firms and the Stock Exchange. Previously, they’ve all relied on enormous server farms filled with x86 processors to perform vector calculations quickly: but a single graphics card is, potentially, a hundred times faster than a blade server at these operations. There will be no GPU-CPU, but NVIDIA has already released the first of its CUDA applications for its desktop cards: making them able to simulate PhysX hardware on a GeForce G80.

Above PhysX acceleration is also available in some ATI cards thanks to the CUDA SDK

PCF220.feature2 076

Above Silicon wafers are carved up into individual processors by tiny saws – not all will survive

need the same parts of the pipeline as the currently running one. To speed things up further, Core i7 can execute up to four instructions per cycle. Incidentally, it’s also interesting to note that a CPU’s instruction set – the programming language into which all commands are eventually decoded and compiled - isn’t completely hard-wired into the design. There’s a software layer that handles most of the interpretation known as the ‘microcode’ – a form of non-upgradable ﬁrmware stored in an on-board ROM, which works as a minioperating system. It’s a useful tool for chip builders: because the microcode isn’t ﬁnalised until the chip goes into production – and can be rewritten for a new manufacturing run – any problems or improvements that are discovered after the silicon has been laid out can be changed in the software stack. This is, of course, easier than going back to the drawing board and laying out another million or two transistors.

DEDICATED BITS If you have a look at the Core i7 block diagram on page 74, you can see that the execution engine is also broken down further into dedicated areas for tasks like integer operations, ﬂoat point calculation and SSE instructions. The latter is an acronym of an acronym – the Streaming SIMD Engine where SIMD stands for Single Instruction Multiple Data. It’s an on-board vector processor capable of performing the same transformation on several pieces of information at once. It’s included on Intel and AMD chips for speeding up things like video processing, where the same command must be performed on, say, all the pixels on a screen simultaneously. There’s also, of course, the one important part of a CPU that we haven’t talked about yet: the memory. Closest to the actual instruction pipeline are the registers: there are 32 of these on a 64-bit chip, and each can

either store a general piece of information or has a speciﬁc task or overlapping tasks. In order to help out those prefetch engines we mentioned earlier, though, there are two levels of fast cache memory to store the data which might be needed for the current process, or that has been written out but may be called again. The cache memory is much faster than system memory and prevents the whole system bottlenecking while the RAM is slowly scanned for instructions and data. For multi-core chips, where two or more processors are packaged onto the same die, there’s often a third cache area that is structured to allow the different cores to swap information quickly. AMD has long led the way in memory access speed: since the introduction of the Athlon 64, its CPUs have been able to talk directly to the memory via a fast proprietary bus. Meanwhile, Intel chips have had to share access to the memory on the same general system bus – the FSB – as all other information travels. With Core i7, however, Intel has ﬁnally introduced a technology it calls ‘QuickPath Interconnect’, or QPI. Broadly analogous to Hypertransport, it allows the CPU to talk directly to components like memory without going via the northbridge on the motherboard, which otherwise acts rather like the router in your home network as a central hub for data transportation and can get quite congested. This should prevent Core i7 bottlenecking despite its enormous demand for incoming information.

POWER MAD There have been many other improvements to CPUs since the humble 8086. One, for example, is that power management systems are now built onto the die. These serve several purposes – shutting the chip down to protect it from damage when it gets too hot, say, or turning off areas that aren’t being used to conserve electricity. The latter is particularly useful for extending

December 2008

3/10/08 1:28:7 pm

The future of CPUs

the battery life on notebooks, but in these times of rising energy costs is also handy for datacentres, where a 50W saving per chip over a thousand servers can add up to a serious amount of money per year. Especially when it means you can turn the air-conditioning down a notch or two as well. Perhaps the biggest ongoing technological advances are in how these complex designs actually get turned into transistors on a silicon die. Most of us will be familiar with the mind-roastingly tiny figures that are quoted by CPU manufacturers for their manufacturing processes – 45nm, 60nm and so on. These refer to the basic size of components on a chip, and are unimaginably small. Reducing them further has several advantages: performance-wise, the same chip on a smaller process can run cooler and faster; but more importantly – because they take up less physical space – more can be squeezed onto a single silicon wafer. So they’re cheaper too. The basic manufacturing process hasn’t changed much in 30 years: take a large disc of purified silicon, and use a photolithographic process to build layers of extra materials onto it which connect together to create data pathways and logic gates. The materials and tools, however, are constantly being refined to increase the accuracy needed to achieve these tiny dimensions, and reduce the effects of ‘leakage’. Put

TAKE SOLACE IN QUANTUM Want to read about the CPUs of the future in a human, digestible format? Read Charles Stross’ book Halting State. It’s a whodunnit starring a quantum processor as one of the main protagonists. Quantum has been much vaunted as the future of computing, and not just because it sounds very clever. While manufacturers like Intel are doing well at ﬁnding ways around the physical limits of current CPU design to make their transistors ever smaller, there will one day come an impassable boundary for current engineering techniques. You can only make things so small when you’re playing around with silicon molecules. Manipulating the quantum characteristics of atoms to

PCF220.feature2 78

simply, this is when electrons begin hopping out over the boundaries between interconnects that they’re not supposed to be able to mount, and becomes more of a problem the smaller the manufacturing process becomes. At the moment, Intel leads the way with its 45nm process, which is made possible thanks to a hafnium-derived material used for the transistor gates. Intel states that the next generation of Core i7s will be produced on an even smaller 32nm process.

MOORE TO COME CPU design and manufacture isn’t showing any signs of slowing. The infamous ‘Moore’s Law’ – a prediction by Intel founder Gordon Moore that the number of transistors that can be placed on a circuit will double every two years – may not be based on any scientific assessment of the manufacturing capabilities of the future, but it has remained peculiarly true for the last forty-three years. Indeed, it could be that we’re on the cusp of a far bigger architectural change than even Core i7 augers. AMD and Intel are keen to move more functions onto the CPU, starting with a basic graphics processor, with the end goal of creating simple, power efficient system-on-a-chip that will, essentially, put a desktop PC on your fingernail. NVIDIA and the ex-ATI part of AMD, meanwhile, seem to recognise that the

represent data stores is, theoretically, one way to go beyond conventional computers – albeit a highly complicated way that no-one has perfected yet. The idea is that data would be stored in ‘Qubits’, rather than bits. These could use particle spin to represent a one, a zero or – crucially – a quantum superposition, allowing for calculations of such ﬁendish complexity to be carried out that it hurts our collective brains to think about. Quite how it would be implemented is another matter – the University of Michigan has demonstrated a proof of concept, but there’s little sign that its on its way desktopwards any time soon. More likely, in the short term at least, are optical replacements for current microprocessors, using photons instead of electrons to carry data and refracting materials instead of silicon for logic gates.

Above Phenom: four central rectangles are cores; transistors on the edge are the shared resources

“Electrons hop over boundaries they’re not supposed to mount…” next big jump in real-time graphics engines is a little further off than previously supposed, and their hugely parallel GPUs are capable of performing important tasks like medical imaging and financial reporting better than an entire farm of CPU servers (see The GPU Challenge on the previous page). Perhaps more likely to yield results faster, though, are the hardware hooks for virtualisation which are being built into CPU cores, allowing several operating systems to run at once without a performance penalty. Many speculate that ‘cloud computing’ – starting an instanced desktop from a web-based grid server is the way forward, turning all our computing into one big Gmail-type application. Quite where these developments – and the hundreds of others that are going on simultaneously – will lead, though, is anyone’s guess. But before gazing too far into the future, bear this in mind: there’s another, even bigger and more significant birthday than the 8086 this year. In September 1958, Texas Instruments welcomed the very first microprocessor – just a single transistor on a germanium strip – off its production line and set the ball rolling for the information age. Did anyone, fifty years ago, predict World of Warcraft or even Microsoft Word? Happy birthday, computers. PCF raises a glass substrate to your future. ¤

December 2008

3/10/08 1:28:10 pm