Energy- and Thermal-Aware Design of Manycore Heterogeneous Datacenters Stanko Novakovic
PIs: David Atienza, Luca Benini, Edouard Bugnion, Babak Falsafi, Lothar Thiele, John Thome, Fabrice Roudet, Marcel Ledergerber, Patrick Segu
Digital data is growing at unprecedented rate
4.4ZB
2013
44ZB
Datacenters come to the rescue
2020
Tremendous amount of data generated on daily basis Datacenters are key for the development of modern society • Store, process, and serve user data on behalf of billions
Turn data into value with minimal cost
IT Energy Not Sustainable 280 240 200 160 120 80 40 0
A Modern Datacenter
Datacenter Electricity Demands In the US (source: Energy Star)
50 million Swiss homes 2001
2005
2009
2013
17x football stadium, $3 billion
2017
Modern datacenters ďƒ¨ 20 MW In modern world, 6% of all electricity, growing at >20% Develop open and green datacenter technologies
YINS: Holistic optimization of datacenters
Applications and system software
–
Server and rack architecture design
–
Chip design, power & cooling
Goal to improve energy efficiency Reduce power consumption Improve Power Usage Effectiveness (PUE)
System Software (e.g., messaging)
Server Hardware (e.g., CPU, memory, network)
Infrastructure
–
Software
Cross-layer, vertical integration of:
Technology (e.g., FDSOI)
Infrastructure (e.g., cooling, power) 44
YINS: Holistic optimization of datacenters (1/4)
Core
Core
Core
L2
L2
L2
L2
0.985
System Software (e.g., messaging)
0.98 10
5
Oxidant
0.99
Core
Logic
Logic
Logic
Logic
L3
Logic
L3
0.975
L2
L2
L2
L2
0.97
Core
Core
Core
Core
0.965
IC Package
Server Hardware (e.g., CPU, memory, network)
0.96 I/O
0Power delivery Vias 0 5 10 15 length (mm)
20
25
Chip-level cooling & energy recovery ďƒ 6W of free power to power-up the caches
Infrastructure
width (mm)
15
0.995
L3
Microchannels
Fuel
Processing unit controller
I/O L3
Software
Liquid power and cooling20delivery
Technology (e.g., FDSOI)
Infrastructure (e.g., cooling, power) 55
5
YINS: Holistic optimization of datacenters (2/4) Software
System Software (e.g., messaging)
Server- and rack-level cooling Passive cooling ďƒ pumping power not required
Infrastructure
Server Hardware (e.g., CPU, memory, network)
Technology (e.g., FDSOI)
Infrastructure (e.g., cooling, power) 66
6
YINS: Holistic optimization of datacenters (3/4)
Local fan speed
CPU Capper Local CPU cap
Global Controller Global fan speed
Global CPU cap
Hierarchical workload-based control scheme Better fan control ďƒ better use of P-states (23.5% energy reduction)
System Software (e.g., messaging)
Server Hardware (e.g., CPU, memory, network)
Infrastructure
Fan Controller
Software
Local controller
Technology (e.g., FDSOI)
Infrastructure (e.g., cooling, power) 7
YINS: Holistic optimization of datacenters (4/4) System Software (e.g., messaging)
Server Hardware (e.g., CPU, memory, network)
Infrastructure
in collaboration with: Alexandros Daglis Edouard Bugnion Babak Falsafi Boris Grot
Software
Specialized scale-out server Architectures (this talk)
Technology (e.g., FDSOI)
Infrastructure (e.g., cooling, power) 8
Big Data, Big Trouble Latency/Bandwidth critical services
Data
Data
– Analytics, Key-value stores, DB
Vast datasets distribute Today’s networks:
1x
20x – 1000x
Latency 20x-1000x of DRAM Network BW << DRAM BW
Latency and bandwidth limit service performance
Big Data on Cache-coherent NUMA (ccNUMA) ď&#x192;ź Ultra-low access latency, ultra-high bandwidth (i.e. DDR) Cost and complexity of scaling up Fault-containment challenge
512GB
3TB
32TB
Ultra low-latency/high-BW but ultra expensive
Big Data on Integrated Fabrics Cost-effective rack-scale fabrics of SoCs High remote access latency, low BW
AMD’s SeaMicro
HP’s Moonshot
Need low-latency, high-bandwidth rack-scale fabric!
High-performance rack-scale systems Tightly integrated group of servers (rack)
Scale-Out NUMA [ASPLOS’14]
- Used for Rack-scale processing - Used as a building block: scale-out rack-out
Communication inside the rack optimized for: – Latency fast access to small blocks – Bandwidth fast access to large blocks
Our proposal: Scale-Out
NUMA
– Non-cache-coherent NUMA w/ remote access capability
server
NUMA interconnect Remote access
soNUMA is a highly tuned rack-scale system: latency=300ns; BW=DDR
C2C Router
Large manycore chip integration Increasing trend toward manycore chip design - How is the network then integrated? - NOC introduces overhead! â&#x2014;?
70% of end-to-end latency
Remote transactions cross the NOC
Network interfaces core server
NUMA interconnect Remote access
Why is the conventional design (Edge NI) not sufficient?
Network-on-chip
Large manycore chip integration problem Edge NI
Per-tile NI
A data
On-chip interactions add latency
Router
Router
B
data
NOC BW wasted with data traffic
SMALL TRANSFERS
SMALL TRANSFERS Localized control interactions
LARGE TRANSFERS Cross the NOC once, let NI transfer
LARGE TRANSFERS
Datacenter applications’ requirements Data processing (a.k.a. analytics) – Transfer large amounts of data, low processing time bandwidth-bound Transfer_time = latency + data_size/bandwidth
Data serving (a.k.a. key-value stores) – Tranfer small amount of data, low processing time latency-bound Transfer_time = latency + data_size/bandwidth
Need to optimize for both, latency and bandwidth
Split NI design [ISCA’15] M C M C M C M C M V M C
data
Router
Frontend logic
NI split in two components: - Per-tile frontend: minimal control overhead Latency still ~300ns, as in the original design - Edge backend: minimized on-chip data movements Full bisection bandwidth
Backend logic
Split NI optimized for both, low latency and high bandwidth
Latency micro-benchmark results [ISCAâ&#x20AC;&#x2122;15] 3200
Latency (ns)
1600
Still small factor of DRAM (~300ns)
NI_edge
Split NI
NI_per-tile
NUMA projection
800
Latency grows with Larger transfers
400
200
64
128
256
Transfer size
512
1024
Latency same as in single-core node configuration
Bandwidth micro-benchmark results [ISCAâ&#x20AC;&#x2122;15] High BW for small transfers too
Bandwidth (GBps)
12
Still full BW
10 8 6 4
NI_edge
2 0
Ni_per-tile
NI_split 64
128
256
512
1024
Request size
2048
4096
8192
Aggregate node bandwidth saturated for 8KB transfers
Datacenter applications on soNUMA Data processing â&#x20AC;&#x201C; Fast shuffle phase in graph processing improves execution time
Data serving â&#x20AC;&#x201C; Goal: increase throughput w/o violating SLA
Datacenter applications on soNUMA Data processing – Fast shuffle phase in graph processing improves execution time Look at the Scale-Out NUMA paper for more details [ASPLOS’14]
Data serving – Goal: increase throughput w/o violating SLA
Basics: Hash partitioned data serving Data “sharded” based on a hash function – Each server serves one part (shard) of the key space Web server
ShardID = CRC16(key) % …
r ve shard r se
Storage servers (classic scale-out deployment)
Client access distribution skewed • Zipfian typically, θ=0.99
Popularity
Problem: Highly-skewed key popularity MAX/AVG=3
5M keys hash partitioned Shards Popularity
• Into 32 and 512 shards Skew still problematic
MAX/AVG=30
Shards
The more shards we have, the bigger the skew
Penalty due to queuing (512 shards) 99th-percentile latency (milliseconds)
scale-out
scale-out (4 replicas)
2 1.5 1 0.5 0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
TPS
Able to deliver only fraction of max achievable throughput
Solution: Rack-out, rather than scale-out Group servers into racks using soNUMA Grouping Factor (GF) = 4 er v r se soNUMA fabric
Super-shard se
er v r
Handle hot shards with more compute (i.e. racks of servers)
Performance impact of rack-out (analytic model)
99th-percentile latency (milliseconds)
scale-out
scale-out (4 replicas)
rack-out (GF=32)
rack-out (GF=128)
2
SLA
1.5 1 0.5 0
0
500
1000
1500
2000
2500
TPS
3000
3500
4000
4500
5000
SLA â&#x20AC;&#x201C; Service Level Agreement
Groups of servers deliver higher throughput w/o violating SLA
Scale-Out NUMA conclusion Remote memory access is essential – Low-latency, high-bandwidth remote access matters
Commodity networks ill-suited for rack-scale soNUMA: Offers ultra-low latency and high-BW – Integrated protocol controller (NI) – Leverages NUMA
YINS Conclusion Energy efficient computation is key in datacenters Drastic increase of energy both powering and cooling servers
Future: cooling-aware design = system-level integration Specialization of components Cross-layer optimizations
Hw/Sw co-design solutions enable energy-proportional datacenter design Global computing-cooling control for cost-efficient datacenter management New processing architectures and service-based customization Novel cooling infrastructures with global thermal-aware control Future servers: power delivery and cooling jointly!