Energy- and Thermal-Aware Design of Many-core Heterogeneous Datacenters Stanko Novakovic PIs: David Atienza, Luca Benini, Edouard Bugnion, Babak Falsafi, Marcel Ledergerber, Fabrice Roudet, Patrick Segu, Lothar Thiele, John Thome,
Digital data is growing at unprecedented rate 4.4ZB
2013
44ZB
Datacenters come to the rescue
2020
Tremendous amount of data generated on daily basis Datacenters are key for the development of modern society •  Store, process, and serve user data on behalf of billions
Turn data into value with minimal cost 2
IT Energy Not Sustainable 280 240 200 160 120 80 40 0
A Modern Datacenter
Datacenter Electricity Demands In the US (source: Energy Star)
50 million Swiss homes 2001
2005
2009
2013
17x football stadium, $3 billion
2017
Modern datacenters è 20 MW In modern world, datacenters consume lots of electricity, growing at >20% § e.g. London, 6% of total electricity consumption
Develop open and green datacenter technologies
3
YINS: Holistic optimization of datacenters
Goal to improve energy efficiency à Reduce power consumption à Improve Power Usage Effectiveness (PUE)
System Software (e.g., messaging)
Server Hardware (e.g., CPU, memory, network)
Infrastructure
§ Applications and system software § Server and rack architecture design § Chip design, power & cooling
Software
Cross-layer, vertical integration of:
Technology (e.g., FDSOI)
Infrastructure (e.g., cooling, power) 44
Liquid power and cooling delivery
Oxidant
IC Package Power delivery Vias
Efficient chip design & energy recovery ! Built first multi-core ever in FD-SOI Ă ďƒ 6W of free power to power-up the caches
System Software (e.g., messaging)
Server Hardware (e.g., CPU, memory, network)
Infrastructure
Fuel
Microchannels
Software
er
YINS: Holistic optimization of datacenters (1/4)
Technology (e.g., FDSOI)
Infrastructure (e.g., cooling, power) 55 5
YINS: Holistic optimization of datacenters (2/4) Software
System Software (e.g., messaging)
Server- and rack-level cooling Passive cooling Ă ďƒ pumping power not required
Infrastructure
Server Hardware (e.g., CPU, memory, network)
Technology (e.g., FDSOI)
Infrastructure (e.g., cooling, power) 66 6
YINS: Holistic optimization of datacenters (3/4)
Local fan speed
CPU Capper
Thermal model
Local CPU cap
Global Controller Global fan speed
Global CPU cap
Hierarchical workload-based control scheme Better fan control Ă ďƒ better use of P-states (23.5% energy reduction)
System Software (e.g., messaging)
Server Hardware (e.g., CPU, memory, network)
Infrastructure
Fan Controller
Software
Local controller
Technology (e.g., FDSOI)
Infrastructure (e.g., cooling, power) 7
YINS: Holistic optimization of datacenters (4/4)
soNUMA fabric Direct remote access
Specialized scale-out server architectures (Scale-Out NUMA)
System Software (e.g., messaging)
Server Hardware (e.g., CPU, memory, network)
Infrastructure
Coherence domain 1
Software
Coherence domain 2
Technology (e.g., FDSOI)
Infrastructure (e.g., cooling, power) 8
Big Data, Big Trouble Latency/Bandwidth critical services
Data
Data
– Analytics, Key-value stores, DB
Vast datasets à distribute Today’s networks:
1x
20x – 1000x
Latency 20x-1000x of DRAM Network BW << DRAM BW
Latency and bandwidth limit service performance 9
Big Data on Cache-coherent NUMA (ccNUMA) ü Ultra-low access latency, ultra-high bandwidth (i.e. DDR) Cost and complexity of scaling up Fault-containment challenge
512GB
3TB
32TB
Ultra low-latency/high-BW but ultra expensive 10
Big Data on Integrated Fabrics ü Cost-effective rack-scale fabrics of SoCs High remote access latency, low BW
AMD’s SeaMicro
HP’s Moonshot
Need low-latency, high-bandwidth rack-scale fabric! 11
High-performance rack-scale systems Tightly integrated group of servers (rack)
Scale-Out NUMA [ASPLOS’14]
- Used for Rack-scale processing - Used as a building block: scale-out à rack-out
Communication inside the rack optimized for: – Latency à fast access to small blocks – Bandwidth à fast access to large blocks
Our proposal: Scale-Out
NUMA
server
NUMA interconnect Remote access
– Non-cache-coherent NUMA w/ remote access capability
Scale-Out NUMA is a highly tuned rack-scale system: latency=300ns; BW=DDR 12
Large manycore chip integration Router
Increasing trend toward manycore chip design - How is the network interface then integrated? - Network-on-chip introduces overhead! â&#x20AC;&#x201C;â&#x20AC;Ż 70% of end-to-end latency
Network interfaces core server
NUMA network
Network-on-chip
Each remote transaction requires crossing the network-on-chip (multiple times)
Remote access
Why is the conventional design (Edge NI) not sufficient? 13
Large manycore chip integration problem Edge NI
Per-tile NI
A data
On-chip interactions add latency
Router
Router
B
data
NOC BW wasted with data traffic
SMALL TRANSFERS
SMALL TRANSFERS à Localized control interactions
LARGE TRANSFERS à Cross the NOC once, let NI transfer
LARGE TRANSFERS
14
Datacenter applications’ requirements Data processing (a.k.a. analytics) – Transfer large amounts of data, low processing time à bandwidth-bound Transfer_time = latency + data_size/bandwidth
Data serving (a.k.a. key-value stores) – Tranfer small amount of data, low processing time à latency-bound Transfer_time = latency + data_size/bandwidth
Need to optimize for both, latency and bandwidth 15
Split NI design [ISCAâ&#x20AC;&#x2122;15] M C M C M C M C M V M C
data Router
Frontend logic
Backend logic
NI split in two components: - Per-tile frontend: minimal control overhead ! Latency still ~300ns, as in the original design - Edge backend: minimized on-chip data movements ! Full bisection bandwidth
Split NI optimized for both, low latency and high bandwidth 16
Remote read latency results [ISCAâ&#x20AC;&#x2122;15] NI_edge Split NI NI_per-tile NUMA projection
Latency (ns)
800
Still small factor of DRAM (~300ns)
400
Latency grows with Larger transfers 200 64
128
256
512
1024
2048
Transfer size
4096
8192
16384
Latency same as in single-core node configuration; close to ideal ccNUMA 17
Remote read bandwidth results [ISCAâ&#x20AC;&#x2122;15] High BW for small transfers too
Bandwidth (GBps)
250
Still full BW
200
150
100
NI_edge Ni_per-tile
50
NI_split 0 64
128
256
512
1024
Request size
2048
4096
8192
Aggregate node bandwidth saturated for 8KB transfers 18
Datacenter applications on soNUMA Data processing – Fast shuffle phase in graph processing improves execution time
Data serving – Goal: increase throughput w/o violating SLA
19
Conclusion Energy efficient computation is key in datacenters – Drastic increase of energy both powering and cooling servers
YINS is a multidisciplinary research project on energy-efficient datacenters – Vertical integration via cross-layer optimizations
soNUMA: Offers ultra-low latency and high-BW – Remote access via integrated protocol controllers (NIs) – Leverages NUMA
20
Thank you! Questions Scale-Out NUMA collaborators: Alexandros Daglis, Edouard Bugnion, Babak Faksafi, Boris Grot