Yins

Page 1

Energy- and Thermal-Aware Design of Manycore Heterogeneous Datacenters Stanko Novakovic

PIs: David Atienza, Luca Benini, Edouard Bugnion, Babak Falsafi, Lothar Thiele, John Thome, Fabrice Roudet, Marcel Ledergerber, Patrick Segu


Digital data is growing at unprecedented rate

4.4ZB

2013

44ZB

Datacenters come to the rescue

2020

Tremendous amount of data generated on daily basis Datacenters are key for the development of modern society • Store, process, and serve user data on behalf of billions

Turn data into value with minimal cost


IT Energy Not Sustainable 280 240 200 160 120 80 40 0

A Modern Datacenter

Datacenter Electricity Demands In the US (source: Energy Star)

50 million Swiss homes 2001

2005

2009

2013

17x football stadium, $3 billion

2017

Modern datacenters ďƒ¨ 20 MW In modern world, 6% of all electricity, growing at >20% Develop open and green datacenter technologies


YINS: Holistic optimization of datacenters

Applications and system software

Server and rack architecture design

Chip design, power & cooling

Goal to improve energy efficiency  Reduce power consumption  Improve Power Usage Effectiveness (PUE)

System Software (e.g., messaging)

Server Hardware (e.g., CPU, memory, network)

Infrastructure

Software

Cross-layer, vertical integration of:

Technology (e.g., FDSOI)

Infrastructure (e.g., cooling, power) 44


YINS: Holistic optimization of datacenters (1/4)

Core

Core

Core

L2

L2

L2

L2

0.985

System Software (e.g., messaging)

0.98 10

5

Oxidant

0.99

Core

Logic

Logic

Logic

Logic

L3

Logic

L3

0.975

L2

L2

L2

L2

0.97

Core

Core

Core

Core

0.965

IC Package

Server Hardware (e.g., CPU, memory, network)

0.96 I/O

0Power delivery Vias 0 5 10 15 length (mm)

20

25

Chip-level cooling & energy recovery ďƒ 6W of free power to power-up the caches

Infrastructure

width (mm)

15

0.995

L3

Microchannels

Fuel

Processing unit controller

I/O L3

Software

Liquid power and cooling20delivery

Technology (e.g., FDSOI)

Infrastructure (e.g., cooling, power) 55

5


YINS: Holistic optimization of datacenters (2/4) Software

System Software (e.g., messaging)

Server- and rack-level cooling Passive cooling ďƒ pumping power not required

Infrastructure

Server Hardware (e.g., CPU, memory, network)

Technology (e.g., FDSOI)

Infrastructure (e.g., cooling, power) 66

6


YINS: Holistic optimization of datacenters (3/4)

Local fan speed

CPU Capper Local CPU cap

Global Controller Global fan speed

Global CPU cap

Hierarchical workload-based control scheme Better fan control ďƒ better use of P-states (23.5% energy reduction)

System Software (e.g., messaging)

Server Hardware (e.g., CPU, memory, network)

Infrastructure

Fan Controller

Software

Local controller

Technology (e.g., FDSOI)

Infrastructure (e.g., cooling, power) 7


YINS: Holistic optimization of datacenters (4/4) System Software (e.g., messaging)

Server Hardware (e.g., CPU, memory, network)

Infrastructure

in collaboration with: Alexandros Daglis Edouard Bugnion Babak Falsafi Boris Grot

Software

Specialized scale-out server Architectures (this talk)

Technology (e.g., FDSOI)

Infrastructure (e.g., cooling, power) 8


Big Data, Big Trouble Latency/Bandwidth critical services

Data

Data

– Analytics, Key-value stores, DB

Vast datasets  distribute Today’s networks:

1x

20x – 1000x

Latency 20x-1000x of DRAM Network BW << DRAM BW

Latency and bandwidth limit service performance


Big Data on Cache-coherent NUMA (ccNUMA) ďƒź Ultra-low access latency, ultra-high bandwidth (i.e. DDR) Cost and complexity of scaling up Fault-containment challenge

512GB

3TB

32TB

Ultra low-latency/high-BW but ultra expensive


Big Data on Integrated Fabrics  Cost-effective rack-scale fabrics of SoCs High remote access latency, low BW

AMD’s SeaMicro

HP’s Moonshot

Need low-latency, high-bandwidth rack-scale fabric!


High-performance rack-scale systems Tightly integrated group of servers (rack)

Scale-Out NUMA [ASPLOS’14]

- Used for Rack-scale processing - Used as a building block: scale-out  rack-out

Communication inside the rack optimized for: – Latency  fast access to small blocks – Bandwidth  fast access to large blocks

Our proposal: Scale-Out

NUMA

– Non-cache-coherent NUMA w/ remote access capability

server

NUMA interconnect Remote access

soNUMA is a highly tuned rack-scale system: latency=300ns; BW=DDR


C2C Router

Large manycore chip integration Increasing trend toward manycore chip design - How is the network then integrated? - NOC introduces overhead! â—?

70% of end-to-end latency

Remote transactions cross the NOC

Network interfaces core server

NUMA interconnect Remote access

Why is the conventional design (Edge NI) not sufficient?

Network-on-chip


Large manycore chip integration problem Edge NI

Per-tile NI

A data

On-chip interactions add latency

Router

Router

B

data

NOC BW wasted with data traffic

SMALL TRANSFERS

SMALL TRANSFERS  Localized control interactions

LARGE TRANSFERS  Cross the NOC once, let NI transfer

LARGE TRANSFERS


Datacenter applications’ requirements Data processing (a.k.a. analytics) – Transfer large amounts of data, low processing time  bandwidth-bound Transfer_time = latency + data_size/bandwidth

Data serving (a.k.a. key-value stores) – Tranfer small amount of data, low processing time  latency-bound Transfer_time = latency + data_size/bandwidth

Need to optimize for both, latency and bandwidth


Split NI design [ISCA’15] M C M C M C M C M V M C

data

Router

Frontend logic

NI split in two components: - Per-tile frontend: minimal control overhead  Latency still ~300ns, as in the original design - Edge backend: minimized on-chip data movements  Full bisection bandwidth

Backend logic

Split NI optimized for both, low latency and high bandwidth


Latency micro-benchmark results [ISCA’15] 3200

Latency (ns)

1600

Still small factor of DRAM (~300ns)

NI_edge

Split NI

NI_per-tile

NUMA projection

800

Latency grows with Larger transfers

400

200

64

128

256

Transfer size

512

1024

Latency same as in single-core node configuration


Bandwidth micro-benchmark results [ISCA’15] High BW for small transfers too

Bandwidth (GBps)

12

Still full BW

10 8 6 4

NI_edge

2 0

Ni_per-tile

NI_split 64

128

256

512

1024

Request size

2048

4096

8192

Aggregate node bandwidth saturated for 8KB transfers


Datacenter applications on soNUMA Data processing – Fast shuffle phase in graph processing improves execution time

Data serving – Goal: increase throughput w/o violating SLA


Datacenter applications on soNUMA Data processing – Fast shuffle phase in graph processing improves execution time Look at the Scale-Out NUMA paper for more details [ASPLOS’14]

Data serving – Goal: increase throughput w/o violating SLA


Basics: Hash partitioned data serving Data “sharded” based on a hash function – Each server serves one part (shard) of the key space Web server

ShardID = CRC16(key) % …

r ve shard r se

Storage servers (classic scale-out deployment)


Client access distribution skewed • Zipfian typically, θ=0.99

Popularity

Problem: Highly-skewed key popularity MAX/AVG=3

5M keys hash partitioned Shards Popularity

• Into 32 and 512 shards  Skew still problematic

MAX/AVG=30

Shards

The more shards we have, the bigger the skew


Penalty due to queuing (512 shards) 99th-percentile latency (milliseconds)

scale-out

scale-out (4 replicas)

2 1.5 1 0.5 0

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

TPS

Able to deliver only fraction of max achievable throughput


Solution: Rack-out, rather than scale-out Group servers into racks using soNUMA Grouping Factor (GF) = 4 er v r se soNUMA fabric

Super-shard se

er v r

Handle hot shards with more compute (i.e. racks of servers)


Performance impact of rack-out (analytic model)

99th-percentile latency (milliseconds)

scale-out

scale-out (4 replicas)

rack-out (GF=32)

rack-out (GF=128)

2

SLA

1.5 1 0.5 0

0

500

1000

1500

2000

2500

TPS

3000

3500

4000

4500

5000

SLA – Service Level Agreement

Groups of servers deliver higher throughput w/o violating SLA


Scale-Out NUMA conclusion Remote memory access is essential – Low-latency, high-bandwidth remote access matters

Commodity networks ill-suited for rack-scale soNUMA: Offers ultra-low latency and high-BW – Integrated protocol controller (NI) – Leverages NUMA


YINS Conclusion Energy efficient computation is key in datacenters  Drastic increase of energy both powering and cooling servers

Future: cooling-aware design = system-level integration  Specialization of components  Cross-layer optimizations

Hw/Sw co-design solutions enable energy-proportional datacenter design  Global computing-cooling control for cost-efficient datacenter management  New processing architectures and service-based customization  Novel cooling infrastructures with global thermal-aware control  Future servers: power delivery and cooling jointly!


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.