Become Efficient or Die The Story of BackType
Nathan Marz @nathanmarz
BackType
BackType helps businesses understand social media and make use of it
BackType Data Services (APIs) Social Media Analytics Dashboard
APIs • Conversational graph for url • Comment search • #Tweets / URL • Influence scores • Top sites • Trending links stream • etc.
URL Profiles
Site comparisons
Influencer Profiles
Twitter Account Analytics
Topic Analysis
Topic Analysis
BackType stats • >30 TB of data • 100 to 200 machine cluster • Process 100M messages per day • Serve 300 req/sec
BackType stats • 3 full time employees • 2 interns • 1.4M in funding
How? Avoid waste Invest in efficiency
Development philosophy • Waterfall • Agile • Scrum • Kanban
Development philosophy Suffering-oriented programming
Suffering-oriented Programming Don’t add process until you feel the pain of not having it
Example
• Growing from 2 people to 3 people
Example • Founders were essentially “one brain” • Cowboy coding led to communication mishaps
• Added biweekly meeting to sync up
Example
• Growing from 3 people to 5 people
Example • Moving through tasks a lot faster with 5 people
• Needed more frequent prioritization of tasks
• Changed biweekly meeting to weekly meeting
• Added “chat room standups” to facilitate mid-week adjustments
Suffering-oriented Programming Don’t build new technology until you feel the pain of not having it
Suffering-oriented Programming First, make it possible. Then, make it beautiful. Then, make it fast.
Example
• Batch processing
Make it possible • Hack things out using MapReduce/ Cascading
• Learn the ins and outs of batch processing
Make it beautiful • Wrote (and open-sourced) Cascalog • The “perfect interface” to our data
Make it fast • Use it in production • Profile and identify bottlenecks • Optimize
Overengineering Attempting to create beautiful software without a thorough understanding of problem domain
Premature optimization Optimizing before creating “beautiful� design, creating unnecessary complexity
Knowledge debt 20 15
Knowledge debt
10 5 0
Your productivity Your potential
Knowledge debt Use small, independent projects to experiment with new technology
Example • Needed to write a small server to collect records into a Distributed Filesystem
• Wrote it using Clojure programming language
• Huge win: now we use Clojure for most of our systems
Example • Needed to implement social search • Wrote it using Neo4j • Ran into lot of problems with Neo4j and rewrote it later using Sphinx
Example • Needed an automated deploy for a
distributed stream processing system
• Wrote it using Pallet • Massive win: anticipate dramatic reduction in complexity in administering infrastructure
Knowledge debt
(Crappy job ad)
Knowledge debt Instead of hiring people who share your skill set, hire people with completely different skill sets
(food for thought)
Technical debt Technical debt builds up in a codebase
Technical debt • W needs to be refactored • X deploy should be faster • Y needs more unit tests • Z needs more documentation
Technical debt Never high enough priority to work on, but these issues built up and slow you down
BackSweep • Issues are recorded on a wiki page • We spend one day a month removing items from that wiki page
BackSweep • Keeps our codebase lean • Gives us a way to defer technical debt
issues when don’t have time to deal with them
• “Garbage collection for the codebase”
What is a startup? A startup is a human institution designed to deliver a new product or service under conditions of extreme uncertainty. - Eric Ries
How do you decide what to work on?
Don’t want to waste three months building a feature no one cares about This could be fatal!
Product development Valid? Form hypothesis
Keep
Test hypothesis
Invalid?
Learn
Discard
Example
Example
Pro product didn’t actually exist yet
Example • We tested different feature combinations and measured click through rate
• Clicking on “sign up” went to a survey page
Example
Hypothesis #1
Customers want analytics on topics being discussed on Twitter
Testing hypothesis #1
• Fake feature -> clicking on topic goes to survey page
Testing hypothesis #1 • Do people click on those links? • If not, need to reconsider hypothesis
Hypothesis #2
Customers want to know how often topics are mentioned over time
Testing hypothesis #2 • Build topic mentions over time graph for
“big topics” our private beta customers are interested in (e.g. “nike”, “microsoft”, “apple”, “kodak”)
• Talk to customers
Hypothesis #3 • Customers want to see who’s talking about a topic on a variety of dimensions: recency, influence, num followers, or num retweets
Testing hypothesis #3
• Create search index on last 24 hours of data that can sort on all dimensions
Lean Startup
Questions? Twitter: @nathanmarz Email: nathan.marz@gmail.com Web: http://nathanmarz.com
The Secrets of Building Realtime Big Data Systems Nathan Marz @nathanmarz
Who am I?
Who am I?
Who am I?
Who am I?
(Upcoming book)
BackType • >30 TB of data • Process 100M messages / day • Serve 300 requests / sec • 100 to 200 machine cluster • 3 full-time employees, 2 interns
Built on open-source Thrift Cascading Scribe ZeroMQ Zookeeper Pallet
What is a data system? Raw data
View 1
View 2
View 3
What is a data system?
Tweets
# Tweets / URL
Influence scores
Trending topics
Everything else: schemas, databases, indexing, etc are implementation
Essential properties of a data system
1. Robust to machine failure and human error
2. Low latency reads and updates
3. Scalable
4. General
5. Extensible
6. Allows ad-hoc analysis
7. Minimal maintenance
8. Debuggable
Layered Architecture Speed Layer
Batch Layer
Let’s pretend temporarily that update latency doesn’t matter
Let’s pretend it’s OK for a view to lag by a few hours
Batch layer • Arbitrary computation • Horizontally scalable • High latency
Batch layer
Not the end-all-be-all of batch computation, but the most general
Hadoop Distributed Filesystem
Distributed Filesystem
Input files
Output files MapReduce
Input files
Output files
Input files
Output files
Hadoop • Express your computation in terms of MapReduce
• Get parallelism and scalability “for free”
Batch layer • Store master copy of dataset • Master dataset is append-only
Batch layer view = fn(master dataset)
Master dataset
Batch layer MapReduce
Batch View 1
MapReduce
Batch View 2
MapReduce
Batch View 3
Batch layer • In practice, too expensive to fully
recompute each view to get updates
• A production batch workflow adds
minimum amount of incrementalization necessary for performance
Incremental batch layer Batch View 1
New data
Batch workflow
View maintenance
Append
Query All data
Batch View 2
Batch View 3
Batch layer Robust and fault-tolerant to both machine and human error. Low latency reads. Low latency updates. Scalable to increases in data or traffic. Extensible to support new features or related services. Generalizes to diverse types of data and requests. Allows ad hoc queries. Minimal maintenance. Debuggable: can trace how any value in the system came to be.
Speed layer Compensate for high latency of updates to batch layer
Speed layer Key point: Only needs to compensate for data not yet absorbed in batch layer
Hours of data instead of years of data
Application-level Queries Batch Layer
Query
Merge Speed Layer
Query
Speed layer Once data is absorbed into batch layer, can discard speed layer results
Speed layer • Message passing • Incremental algorithms • Read/Write databases • Riak • Cassandra • HBase • etc.
Speed layer
Significantly more complex than the batch layer
Speed layer
But the batch layer eventually overrides the speed layer
Speed layer
So that complexity is transient
Flexibility in layered architecture • Do slow and accurate algorithm in batch layer
• Do fast but approximate algorithm in speed layer
• “Eventual accuracy”
Data model Every record is a single, discrete fact at a moment in time
Data model • Alice lives in San Francisco as of time 12345 • Bob and Gary are friends as of time 13723 • Alice lives in New York as of time 19827
Data model • Remember: master dataset is append-only • A person can have multiple location records
• “Current location” is a view on this data:
pick location with most recent timestamp
Data model • Extremely useful having the full history for each entity
• Doing analytics • Recovering from mistakes (like writing bad data)
Data model Reshare: true
Gender: female
Property
Property Reactor
Tweet: 123
Tweet: 456 Reaction
Reactor
Alice Property
Content: RT @bob Data is fun!
Property
Bob
Content: Data is fun!
Questions? Twitter: @nathanmarz Email: nathan.marz@gmail.com Web: http://nathanmarz.com