Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google

Page 1

Bringing learnings from Googley microservices with gRPC Microservices Summit Varun Talwar

Google confidential │ Do not distribute


Contents 1. 2.

Context: Why are we here? Learnings from Stubby experience a. b. c. d. e. f.

3.

HTTP/JSON doesnt cut it Establish a Lingua Franca Design for fault tolerance and control: Sync/Async, Deadlines, Cancellations, Flow control Flying blind without stats Diagnosing with tracing Load Balancing is critical

gRPC a. b. c. d.

Cross platform matters ! Performance and Standards matter: HTTP/2 Pluggability matters: Interceptors, Name Resolvers, Auth plugins Usability matters !


CONTEXT WHY ARE WE HERE?


Business Agility


Developer Productivity


Performance


INTRODUCING STUBBY


Microservices at Google ~O(1010) RPCs per second.

Images by Connie Google confidential │ Do not Zhou distribute


Stubby Magic @ Google


Making Google magic available to all

Borg Kubernetes

Stubby


LEARNINGS FROM STUBBY


Key learnings 1. 2. 3. 4. 5. 6.

HTTP/JSON doesnt cut it ! Establish a lingua franca Design for fault tolerance and provide control knobs Dont fly blind: Service Analytics Diagnosing problems: Tracing Load Balancing is critical


HTTP/JSON doesn’t cut it !

1 1. 2. 3. 4. 5. 6. 7. 8.

WWW, browser growth - bled into services Stateless Text on the wire Loose contracts TCP connection per request Nouns based Harder API evolution Think compute, network on cloud platforms


Establish a lingua franca

2 1. 2. 3. 4. 5. 6.

Protocol Buffers - Since 2003. Start with IDL Have a language agnostic way of agreeing on data semantics Code Gen in various languages Forward and Backward compatibility API Evolution


How we roll at Google


Service Definition (weather.proto) syntax = "proto3"; service Weather { rpc GetCurrent(WeatherRequest) returns (WeatherResponse); }

message WeatherRequest { Coordinates coordinates = 1;

message WeatherResponse { Temperature temperature = 1; float humidity = 2; } message Temperature { float degrees = 1; Units units = 2; enum Units { FAHRENHEIT = 0; CELSIUS = 1; KELVIN = 2; }

message Coordinates { fixed64 latitude = 1; fixed64 longitude = 2; } }

}

Google Cloud Platform


3

Design for fault tolerance and control ● Sync and Async APIs

● Need fault tolerance: Deadlines, Cancellations

● Control Knobs: Flow control, Service Config, Metadata


gRPC Deadlines First-class feature in gRPC. Deadline is an absolute point in time. Deadline indicates to the server how long the client is willing to wait for an answer. RPC will fail with DEADLINE_EXCEEDED status code when deadline reached. 18


Deadline Propagation withDeadlineAfter(200, MILLISECONDS) 40 ms

60 ms

20 ms

90 ms

20 ms

Gateway DEADLINE_EXCEEDED Now = 1476600000000 Deadline = 1476600000200

DEADLINE_EXCEEDED

DEADLINE_EXCEEDED

Now = 1476600000040

Now = 1476600000150

Deadline = 1476600000200

Deadline = 1476600000200 Google Cloud Platform

DEADLINE_EXCEEDED Now = 1476600000230 Deadline = 1476600000200


Cancellation? Deadlines are expected. What about unpredictable cancellations? • User cancelled request. • Caller is not interested in the result any more. • etc 20


Cancellation? Busy

Active RPC

Busy

Active RPC

Busy

Active RPC

GW

Active RPC

Busy

Active RPC

Busy

Active RPC

Busy

Active RPC

Busy

Active RPC

Google Cloud Platform

Busy

Active RPC

Busy


Cancellation Propagation

GW

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Google Cloud Platform


Cancellation Automatically propagated. RPC fails with CANCELLED status code. Cancellation status be accessed by the receiver. Server (receiver) always knows if RPC is valid!

23


BiDi Streaming - Slow Client Slow Client

Fast Server Request

Responses CANCELLED UNAVAILABLE RESOURCE_EXHAUSTED

Google Cloud Platform


BiDi Streaming - Slow Server Fast Client

Slow Server

Request

Response

Requests CANCELLED UNAVAILABLE RESOURCE_EXHAUSTED

Google Cloud Platform


Flow-Control Flow-control helps to balance computing power and network capacity between client and server. gRPC supports both client- and server-side flow control.

Photo taken by Andrey Borisenko. 26


Service Config Policies where server tells client what they should do Can specify deadlines, lb policy, payload size per method of a service Loved by SREs, they have more control Discovery via DNS

27


Metadata helps in exchange of useful information Metadata Exchange - Common cross-cutting concerns like authentication or tracing rely on the exchange of data that is not part of the declared interface of a service. Deployments rely on their ability to evolve these features at a different rate to the individual APIs exposed by services.


4 ● ● ● ●

Don’t fly blind: Stats What is the mean latency time per RPC? How many RPCs per hour for a service? Errors in last minute/hour? How many bytes sent? How many connections to my server?


Data collection by arbitrary metadata is useful ●

Any service’s resource usage and performance stats in real time by (almost) any arbitrary metadata 1. 2. 3.

Service X can monitor CPU usage in their jobs broken down by the name of the invoked RPC and the mdb user who sent it. Social can monitor the RPC latency of shared bigtable jobs when responding to their requests, broken down by whether the request originated from a user on web/Android/iOS. Gmail can collect usage on servers, broken down by according POP/IMAP/web/Android/iOS. Layer propagates Gmail's metadata down to every service, even if the request was made by an intermediary job that Gmail doesn't own

Stats layer export data to varz and streamz, and provides stats to many monitoring systems and dashboards


5

Diagnosing problems: Tracing

● ●

1/10K requests takes very long. Its an ad query :-) I need to find out. Take a sample and store in database; help identify request in sample which took similar amount of time

I didnt get a response from the service. What happened? Which link in the service dependency graph got stuck? Stitch a trace and figure out. Where is it taking time for a trace? Hotspot analysis What all are the dependencies for a service?

● ●


5

Load Balancing is important ! Iteration 1: Stubby Balancer Iteration 2: Client side load balancing Iteration 3: Hybrid Iteration 4: gRPC-lb


Next gen of load balancing ●

Current client support intentionally dumb (simplicity). ○ Pick first available - Avoid connection establishment latency ○ Round-robin-over-list - Lists not sets → ability to represent weights

For anything more advanced, move the burden to an external "LB Controller", a regular gRPC server and rely on a client-side implementation of the so-called gRPC LB policy. 3) RR over addresses of address-list

gRPC LB

client

backends

1) Control RPC 2) address-list

LB Controller


In summary, what did we learn ● ● ● ● ●

Contracts should be strict Common language helps Common understanding for deadlines, cancellations, flow control Common stats/tracing framework is essential for monitoring, debugging Common framework lets uniform policy application for control and lb

Single point of integration for logging, monitoring, tracing, service discovery and load balancing makes lives much easier !


INTRODUCING gRPC


gRPC core gRPC Java gRPC Go

Open source on Github for C, C++, Java, Node.js, Python, Ruby, Go, C#, PHP, Objective-C


Where is the project today? ● ● ●

1.0 with stable APIs Well documented with an active community Reliable with continuous running tests on GCE ○

Measured with an open performance dashboard ○

Deployable in your environment Deployable in your environment

Well adopted inside and outside Google


More lessons 1. Cross language & Cross platform matters ! 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors, Name Resolvers, Auth plugins 4. Usability matters !


More lessons 1. Cross language & Cross platform matters ! 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors, Name Resolvers, Auth plugins 4. Usability matters !


gRPC Principles & Requirements

Coverage & Simplicity The stack should be available on every popular development platform and easy for someone to build for their platform of choice. It should be viable on CPU & memory limited devices.

http://www.grpc.io/blog/principles

Google Cloud Platform


gRPC Speaks Your Language Service definitions and client libraries ● ● ● ● ● ● ● ● ●

Java Go C/C++ C# Node.js PHP Ruby Python Objective-C

Platforms supported ● ● ● ● ●

MacOS Linux Windows Android iOS

Google Cloud Platform


Interoperability gRPC Service

gRPC Stub

gRPC Service

GoLang Service gRPC Stub

Java Service gRPC Stub

gRPC Stub

gRPC Service

gRPC Service

gRPC

Python Stub Service

Google Cloud Platform

C++ Service


More lessons 1. Cross language & Cross platform matters ! 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors, Name Resolvers, Auth plugins 4. Usability matters !


HTTP/2 in One Slide HTTP/1.x

• Single TCP connection. • No Head-of-line blocking.

Application (HTTP/2) Binary Framing Session (TLS) [optional]

• Binary framing layer.

POST: /upload HTTP/1.1 Host: www.javaday.org.ua Content-Type: application/json Content-Length: 27 {“msg”: “Welcome to 2016!”}

Transport(TCP) Network (IP)

HTTP/2

• Request –> Stream.

HEADERS Frame DATA Frame

• Header Compression.

Google Cloud Platform


Binary Framing Stream 1 Request

HTTP/2 breaks down the HTTP protocol communication into an exchange of binary-encoded frames, which are then mapped to messages that belong to a stream, and all of which are multiplexed within a single TCP connection.

HEADERS :method: GET :path: /kyiv :version: HTTP/2 :scheme: https

HEADERS :status: 200 :version: HTTP/2 :server: nginx/1.10.1 ...

TCP

Stream 2

Stream N

Google Cloud Platform

DATA Response <payload>


HTTP/1.x vs HTTP/2 http://http2.golang.org/gophertiles http://www.http2demo.io/

Google Cloud Platform


gRPC Service Definitions Unary

Server streaming

Client streaming

BiDi streaming

Unary RPCs where the client sends a single request to the server and gets a single response back, just like a normal function call.

The client sends a request to the server and gets a stream to read a sequence of messages back. The client reads from the returned stream until there are no more messages.

The client send a sequence of messages to the server using a provided stream. Once the client has finished writing the messages, it waits for the server to read them and return its response.

Both sides send a sequence of messages using a read-write stream. The two streams operate independently. The order of messages in each stream is preserved.

Google Cloud Platform


BiDi Streaming Use-Cases Messaging applications. Games / multiplayer tournaments. Moving objects. Sport results. Stock market quotes. Smart home devices. You name it! 48


Performance ● ● ● ●

Open Performance Benchmark and Dashboard Benchmarks run in GCE VMs per Pull Request for regression testing. gRPC Users can run these in their environments. Good Performance across languages: ○ ○ ○

Java Throughput: 500 K RPCs/Sec and 1.3 M Streaming messages/Sec on 32 core VMs Java Latency: ~320 us for unary ping-pong (netperf 120us) C++ Throughput: ~1.3 M RPCs/Sec and 3 M Streaming Messages/Sec on 32 core VMs.


More lessons 1. 2. 3. 4.

Cross language & Cross platform matters ! Performance and Standards matter: HTTP/2 Pluggability matters: Interceptors, Auth Usability matters !


gRPC Principles & Requirements Pluggable Large distributed systems need security, health-checking, load-balancing and failover, monitoring, tracing, logging, and so on. Implementations should provide extensions points to allow for plugging in these features and, where useful, default implementations.

http://www.grpc.io/blog/principles

Google Cloud Platform


Interceptors Client interceptors

Server interceptors

Request

Client

Server Response

Google Cloud Platform


Pluggability ● ●

● ●

Auth & Security - TLS [Mutual], Plugin auth mechanism (e.g. OAuth) Proxies ○ Basic: nghttp2, haproxy, traefik ○ Advanced: Envoy, linkerd, Google LB, Nginx (in progress) Service Discovery ○ etcd, Zookeeper, Eureka, … Monitor & Trace ○ Zipkin, Prometheus, Statsd, Google, DIY


More lessons 1. 2. 3. 4.

Cross language & Cross platform matters ! Performance and Standards matter: HTTP/2 Pluggability matters: Interceptors, Auth Usability matters !


Get Started


Coming soon ! 1. 2. 3. 4. 5. 6. 7.

8.

Server reflection Health Checking Automatic retries Streaming compression Mechanism to do caching Binary Logging a. Debugging, auditing though costly Unit Testing support a. Automated mock testing b. Dont need to bring up all dependent services just to test Web support


Some early adopters

Microservices: in data centres Client Server communication/Internal APIs

Streaming telemetry from network devices

Mobile Apps


Thank you! Thank you!

Twitter:

@grpcio

Site:

grpc.io

Group:

grpc-io@googlegroups.com

Repo:

github.com/grpc github.com/grpc/grpc-java github.com/grpc/grpc-go


Q&A


Why gRPC? Multi-language

Open

Strict Service contracts

9 languages

Open source and growing community

Define and enforce contracts, backward compatible

Performant

Pluggable design

Efficiency on wire

1m+ QPS - unary, 3m+ streaming (dashboard)

Auth, Transport, IDL, LB

2-3X gains

Streaming APIs

Standard compliant

Easy to use

Large payloads, speech, logs

HTTP/2

Single line installation


The Fallacies of Distributed Computing The network is reliable

Topology doesn't change

Latency is zero

There is one administrator

Bandwidth is infinite

Transport cost is zero

The network is secure

The network is homogeneous

https://blogs.oracle.com/jag/resource/Fallacies.html Google Cloud Platform





How is gRPC Used? Direct RPCs : Microservices

On Prem

GCP

Other Cloud


How is gRPC Used? Direct RPCs : Microservices

On Prem

GCP

Other Cloud

Google APIs

RPCs to access APIs Your APIs


How is gRPC Used? Direct RPCs : Microservices Mobile/Web RPCs On Prem

GCP

Other Cloud

Your Mobile /Web Apps Google APIs

RPCs to access APIs Your APIs


What are the benefits? Developers

Operators

Ease of use

Uniform Monitoring

Performance

Debugging/Tracing

Versioning

Cross platform/language

Programming model

Architects/Manag ers Defined Contracts Single uniform framework for control Visibility

Google confidential │ Do not distribute


gRPC Principles & Requirements

Layered Key facets of the stack must be able to evolve independently. A revision to the wire-format should not disrupt application layer bindings.

http://www.grpc.io/blog/principles

Google Cloud Platform


Layered Architecture Code Gen’d Service API

Standard applications

Stub Code Gen Support Code Channel API Transport API

Initialization, interceptors, and advanced applications


Layered Architecture RPC Client-Side App

Pluggable Load Balancing and Service Discovery

Stub

Future Stub

RPC Server-side Apps

Blocking Stub

Service Definition (extends generated definition)

ClientCall ServerCall

Channel NameResolver

Tran #1

LoadBalancer

Tran #2

ServerCall handler Transport

Tran #N

HTTP/2

Google Cloud Platform


Takeaways HTTP/2 is a high performance production-ready multiplexed bidirectional protocol. gRPC (http://grpc.io): • HTTP/2 transport based, open source, general purpose standards-based, feature-rich RPC framework. • Bidirectional streaming over one single TCP connection. • Netty transport provides asynchronous and non-blocking I/O. • Deadline and cancellations propagation. • Client- and server-side flow-control. • Layered, pluggable and extensible. • Supports 10 programming languages. • Build-in testing support. • Production-ready (current version is 1.0.1) and growing ecosystem. Google Cloud Platform


Growing Ecosystem


gRPC Gateway https://github.com/grpc-ecosystem/grpc-gateway

Migration. Testing. Swagger / OpenAPI tooling.

Photo taken by Andrey Borisenko. 74


Metadata and Auth ●

Protocol Structure ○ ○

● ●

Request → <Call Spec> <Header Metadata> <Messages>* Response → <Header Metadata> <Messages>* <Trailing Metadata> <Status>

Generic mechanism for attaching metadata to requests and responses Commonly used to attach “bearer tokens” to requests for Auth ○ OAuth2 access tokens ○ JWT e.g. OpenId Connect Id Tokens Session state for specific Auth mechanisms is encapsulated in an Auth-credentials object


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.