/backtype-efficiency-and-big-data-systems by Matt Hudson

Become Efficient or Die The Story of BackType

Nathan Marz @nathanmarz

BackType

BackType helps businesses understand social media and make use of it

BackType Data Services (APIs) Social Media Analytics Dashboard

APIs • Conversational graph for url • Comment search • #Tweets / URL • Influence scores • Top sites • Trending links stream • etc.

URL Profiles

Site comparisons

Influencer Profiles

Twitter Account Analytics

Topic Analysis

BackType stats • >30 TB of data • 100 to 200 machine cluster • Process 100M messages per day • Serve 300 req/sec

BackType stats • 3 full time employees • 2 interns • 1.4M in funding

How? Avoid waste Invest in efficiency

Development philosophy • Waterfall • Agile • Scrum • Kanban

Development philosophy Suffering-oriented programming

Suffering-oriented Programming Donâ&#x20AC;&#x2122;t add process until you feel the pain of not having it

Example

â&#x20AC;˘ Growing from 2 people to 3 people

Example • Founders were essentially “one brain” • Cowboy coding led to communication mishaps

• Added biweekly meeting to sync up

Example

â&#x20AC;˘ Growing from 3 people to 5 people

Example • Moving through tasks a lot faster with 5 people

• Needed more frequent prioritization of tasks

• Changed biweekly meeting to weekly meeting

• Added “chat room standups” to facilitate mid-week adjustments

Suffering-oriented Programming Donâ&#x20AC;&#x2122;t build new technology until you feel the pain of not having it

Suffering-oriented Programming First, make it possible. Then, make it beautiful. Then, make it fast.

Example

â&#x20AC;˘ Batch processing

Make it possible â&#x20AC;˘ Hack things out using MapReduce/ Cascading

â&#x20AC;˘ Learn the ins and outs of batch processing

Make it beautiful • Wrote (and open-sourced) Cascalog • The “perfect interface” to our data

Make it fast • Use it in production • Profile and identify bottlenecks • Optimize

Overengineering Attempting to create beautiful software without a thorough understanding of problem domain

Premature optimization Optimizing before creating â&#x20AC;&#x153;beautifulâ&#x20AC;? design, creating unnecessary complexity

Knowledge debt 20 15

Knowledge debt

10 5 0

Your productivity Your potential

Knowledge debt Use small, independent projects to experiment with new technology

Example • Needed to write a small server to collect records into a Distributed Filesystem

• Wrote it using Clojure programming language

• Huge win: now we use Clojure for most of our systems

Example • Needed to implement social search • Wrote it using Neo4j • Ran into lot of problems with Neo4j and rewrote it later using Sphinx

Example • Needed an automated deploy for a

distributed stream processing system

• Wrote it using Pallet • Massive win: anticipate dramatic reduction in complexity in administering infrastructure

Knowledge debt

(Crappy job ad)

Knowledge debt Instead of hiring people who share your skill set, hire people with completely different skill sets

(food for thought)

Technical debt Technical debt builds up in a codebase

Technical debt • W needs to be refactored • X deploy should be faster • Y needs more unit tests • Z needs more documentation

Technical debt Never high enough priority to work on, but these issues built up and slow you down

BackSweep â&#x20AC;˘ Issues are recorded on a wiki page â&#x20AC;˘ We spend one day a month removing items from that wiki page

BackSweep • Keeps our codebase lean • Gives us a way to defer technical debt

issues when don’t have time to deal with them

• “Garbage collection for the codebase”

What is a startup? A startup is a human institution designed to deliver a new product or service under conditions of extreme uncertainty. - Eric Ries

How do you decide what to work on?

Donâ&#x20AC;&#x2122;t want to waste three months building a feature no one cares about This could be fatal!

Product development Valid? Form hypothesis

Keep

Test hypothesis

Invalid?

Learn

Discard

Example

Pro product didnâ&#x20AC;&#x2122;t actually exist yet

Example • We tested different feature combinations and measured click through rate

• Clicking on “sign up” went to a survey page

Example

Hypothesis #1

Customers want analytics on topics being discussed on Twitter

Testing hypothesis #1

â&#x20AC;˘ Fake feature -> clicking on topic goes to survey page

Testing hypothesis #1 â&#x20AC;˘ Do people click on those links? â&#x20AC;˘ If not, need to reconsider hypothesis

Hypothesis #2

Customers want to know how often topics are mentioned over time

Testing hypothesis #2 • Build topic mentions over time graph for

“big topics” our private beta customers are interested in (e.g. “nike”, “microsoft”, “apple”, “kodak”)

• Talk to customers

Hypothesis #3 â&#x20AC;˘ Customers want to see whoâ&#x20AC;&#x2122;s talking about a topic on a variety of dimensions: recency, influence, num followers, or num retweets

Testing hypothesis #3

â&#x20AC;˘ Create search index on last 24 hours of data that can sort on all dimensions

Lean Startup

Questions? Twitter: @nathanmarz Email: nathan.marz@gmail.com Web: http://nathanmarz.com

The Secrets of Building Realtime Big Data Systems Nathan Marz @nathanmarz

Who am I?

(Upcoming book)

BackType • >30 TB of data • Process 100M messages / day • Serve 300 requests / sec • 100 to 200 machine cluster • 3 full-time employees, 2 interns

Built on open-source Thrift Cascading Scribe ZeroMQ Zookeeper Pallet

What is a data system? Raw data

What is a data system?

Tweets

# Tweets / URL

Influence scores