Yins by Breadfruit

Energy- and Thermal-Aware Design of Many-core Heterogeneous Datacenters Stanko Novakovic PIs: David Atienza, Luca Benini, Edouard Bugnion, Babak Falsafi, Marcel Ledergerber, Fabrice Roudet, Patrick Segu, Lothar Thiele, John Thome,

Digital data is growing at unprecedented rate 4.4ZB

2013

44ZB

Datacenters come to the rescue

2020

Tremendous amount of data generated on daily basis Datacenters are key for the development of modern society â&#x20AC;˘â&#x20AC;Ż Store, process, and serve user data on behalf of billions

Turn data into value with minimal cost 2

IT Energy Not Sustainable 280 240 200 160 120 80 40 0

A Modern Datacenter

Datacenter Electricity Demands In the US (source: Energy Star)

50 million Swiss homes 2001

2005

2009

2013

17x football stadium, $3 billion

2017

Modern datacenters è 20 MW In modern world, datacenters consume lots of electricity, growing at >20% §  e.g. London, 6% of total electricity consumption

Develop open and green datacenter technologies

YINS: Holistic optimization of datacenters

Goal to improve energy efficiency à Reduce power consumption à Improve Power Usage Effectiveness (PUE)

System Software (e.g., messaging)

Server Hardware (e.g., CPU, memory, network)

Infrastructure

§  Applications and system software §  Server and rack architecture design §  Chip design, power & cooling

Software

Cross-layer, vertical integration of:

Technology (e.g., FDSOI)

Infrastructure (e.g., cooling, power) 44

Liquid power and cooling delivery

Oxidant

IC Package Power delivery Vias

Efficient chip design & energy recovery ! Built first multi-core ever in FD-SOI Ă ď&#x192; 6W of free power to power-up the caches

System Software (e.g., messaging)

Server Hardware (e.g., CPU, memory, network)

Infrastructure

Fuel

Microchannels

Software

YINS: Holistic optimization of datacenters (1/4)

Technology (e.g., FDSOI)

Infrastructure (e.g., cooling, power) 55 5

YINS: Holistic optimization of datacenters (2/4) Software

System Software (e.g., messaging)

Server- and rack-level cooling Passive cooling Ă ď&#x192; pumping power not required

Infrastructure

Server Hardware (e.g., CPU, memory, network)

Technology (e.g., FDSOI)

Infrastructure (e.g., cooling, power) 66 6

YINS: Holistic optimization of datacenters (3/4)

Local fan speed

CPU Capper

Thermal model

Local CPU cap

Global Controller Global fan speed

Global CPU cap

Hierarchical workload-based control scheme Better fan control Ă ď&#x192; better use of P-states (23.5% energy reduction)

System Software (e.g., messaging)

Server Hardware (e.g., CPU, memory, network)

Infrastructure

Fan Controller

Software

Local controller

Technology (e.g., FDSOI)

Infrastructure (e.g., cooling, power) 7

YINS: Holistic optimization of datacenters (4/4)

soNUMA fabric Direct remote access

Specialized scale-out server architectures (Scale-Out NUMA)

System Software (e.g., messaging)

Server Hardware (e.g., CPU, memory, network)

Infrastructure

Coherence domain 1

Software

Coherence domain 2

Technology (e.g., FDSOI)

Infrastructure (e.g., cooling, power) 8

Big Data, Big Trouble Latency/Bandwidth critical services

Data

–  Analytics, Key-value stores, DB

Vast datasets à distribute Today’s networks:

20x – 1000x

Latency 20x-1000x of DRAM   Network BW << DRAM BW

Latency and bandwidth limit service performance 9

Big Data on Cache-coherent NUMA (ccNUMA) ü Ultra-low access latency, ultra-high bandwidth (i.e. DDR)   Cost and complexity of scaling up   Fault-containment challenge

512GB

3TB

32TB

Ultra low-latency/high-BW but ultra expensive 10

Big Data on Integrated Fabrics ü Cost-effective rack-scale fabrics of SoCs   High remote access latency, low BW

AMD’s SeaMicro

HP’s Moonshot

Need low-latency, high-bandwidth rack-scale fabric! 11

High-performance rack-scale systems Tightly integrated group of servers (rack)

Scale-Out NUMA [ASPLOS’14]

-  Used for Rack-scale processing -  Used as a building block: scale-out à rack-out

Communication inside the rack optimized for: –  Latency à fast access to small blocks –  Bandwidth à fast access to large blocks

Our proposal: Scale-Out

NUMA

server

NUMA interconnect Remote access

–  Non-cache-coherent NUMA w/ remote access capability

Scale-Out NUMA is a highly tuned rack-scale system: latency=300ns; BW=DDR 12

Large manycore chip integration Router

Increasing trend toward manycore chip design - How is the network interface then integrated? - Network-on-chip introduces overhead! â&#x20AC;&#x201C;â&#x20AC;Ż 70% of end-to-end latency

Network interfaces core server

NUMA network

Network-on-chip

Each remote transaction requires crossing the network-on-chip (multiple times)

Remote access

Why is the conventional design (Edge NI) not sufficient? 13

Large manycore chip integration problem Edge NI

Per-tile NI

A data

On-chip interactions add latency

Router

data

NOC BW wasted with data traffic

SMALL TRANSFERS

SMALL TRANSFERS à Localized control interactions

LARGE TRANSFERS à Cross the NOC once, let NI transfer

LARGE TRANSFERS

Datacenter applications’ requirements Data processing (a.k.a. analytics) –  Transfer large amounts of data, low processing time à bandwidth-bound Transfer_time = latency + data_size/bandwidth

Data serving (a.k.a. key-value stores) –  Tranfer small amount of data, low processing time à latency-bound Transfer_time = latency + data_size/bandwidth

Need to optimize for both, latency and bandwidth 15

Split NI design [ISCAâ&#x20AC;&#x2122;15] M C M C M C M C M V M C

data Router

Frontend logic

Backend logic

NI split in two components: - Per-tile frontend: minimal control overhead ! Latency still ~300ns, as in the original design - Edge backend: minimized on-chip data movements ! Full bisection bandwidth

Split NI optimized for both, low latency and high bandwidth 16

Remote read latency results [ISCAâ&#x20AC;&#x2122;15] NI_edge Split NI NI_per-tile NUMA projection

Latency (ns)

800

Still small factor of DRAM (~300ns)

400

Latency grows with Larger transfers 200 64

128

256

512

1024

2048

Transfer size

4096

8192

16384

Latency same as in single-core node configuration; close to ideal ccNUMA 17

Remote read bandwidth results [ISCAâ&#x20AC;&#x2122;15] High BW for small transfers too

Bandwidth (GBps)

250

Still full BW

200

150

100

NI_edge Ni_per-tile

NI_split 0 64

128

256

512

1024

Request size

2048

4096

8192

Aggregate node bandwidth saturated for 8KB transfers 18

Datacenter applications on soNUMA Data processing – Fast shuffle phase in graph processing improves execution time

Data serving – Goal: increase throughput w/o violating SLA

Conclusion Energy efficient computation is key in datacenters –  Drastic increase of energy both powering and cooling servers

YINS is a multidisciplinary research project on energy-efficient datacenters –  Vertical integration via cross-layer optimizations

soNUMA: Offers ultra-low latency and high-BW –  Remote access via integrated protocol controllers (NIs) –  Leverages NUMA

Thank you! Questions Scale-Out NUMA collaborators: Alexandros Daglis, Edouard Bugnion, Babak Faksafi, Boris Grot