Data processing @ bit.ly
Jehiah Czebotar jehiah@gmail.com @jehiah
http://www.jsps.go.jp/english/e-jafos/2010_01.html!
http://bit.ly/gFNuXa!
DATA
The Big Data Problem Big Dataset, No Updates Big Dataset, Lots of Updates Small Dataset, Lots of Updates
sortdb $sort data.csv > sorted_data.csv $sortdb -F ',' -f sorted_data.csv -p 8080 $curl http://127.0.0.1:8080/get?key=...
http://bit.ly/simplehttp
simplequeue • $curl -f “data=...” http://simplequeue/put • $data=`curl http://simplequeue/get`
http://bit.ly/simplehttp
simplequeue
passing database changes through a queue allows you to decouple the performance of receiving data from a client, and adding it to a database http://bit.ly/simplehttp
simplequeue
http://bit.ly/simplehttp
• class BackoffTimer(object): • def __init__(self): • self.interval = 0 • • def failure(self): • self.interval = min(self.interval * 2, 1) • •
def success(self): self.interval = max(self.interval * .25, 1) - 1
allows processing to gracefully slow down when remote systems become unavailable or start returning errors (ie: sleep 1s, 2s, 4s, 8s, 16s, 32s, ...)
• class QueueReader(object): • def __init__(self): • self.backoff_timer = BackoffTimer() • • def run(self): • while True: • try: • data = queue.get() • if not data: • time.sleep(.5) • continue • self.handle(data) • self.backoff_timer.success() • except: • self.backoff_timer.failure() • • if self.backoff_timer.interval: • time.sleep(self.backoff_timer.interval)
pubsub • long lived persistent HTTP connections that streams back JSON messages • a way to separate the core data collection (or production) from data consumers
http://bit.ly/simplehttp
• $curl --silent http://pubsubserver/sub • { "a": "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3 like Mac OS X; ja-jp) AppleWebKit/533.17.9 (KHTML, like Gecko) Mobile/8F190", "c": "JP", "nk": 1, "tz": "Asia/Tokyo", "gr": "40", "g": "hkIdmh", "h": "g0ABCf", "k": "4d8547e6-0022b-04438-d8ac8fa8", "l": "portalexcite", "al": "ja-jp", "hh": "bit.ly", "r": "direct", "u": "http://paltyyuria.exblog.jp/15685838/", "t": 1300587345, "hc": 1300587133, "cy": "Tokyo", "ll": [ 35.685001, 139.751404 ], "i": "613d872b3663f1f0cd54b48653ec788" } • { "a": "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; OfficeLiveConnector.1.5; OfficeLivePatch. 1.3; .NET4.0C; .NET CLR 3.0.30729)", "c": "US", "nk": 0, "tz": "America/Chicago", "gr": "TX", "g": "fTPW1w", "h": "fK3C3I", "k": "4d856351-001b9-07054-c6ac8fa8", "l": "espn", "al": "en-us", "hh": "es.pn", "r": "http://espn.go.com/mlb/", "u": "http://espn.go.com/ blog/dallas/texas-rangers/post/_/id/4861596/surprise-six-saturdaycamp-recap-4", "t": 1300587345, "hc": 1300584373, "cy": "Dallas", "ll": [ 32.809799, -96.799301 ], "i": "6686654b9493543ff18d36120d5caa9" }
tools we like • memcached • tokyo tyrant / tokyo cabinet • simplehttp (simplequeue, sortdb, pubsub) • tornado (fast async python framework) • json files • mysql (battle tested; reliable replication) • mongod
github.com/bitly/simplehttp
Thank you! Questions?
@jehiah