/possconpresentationjehiahv3

Page 1

Data processing @ bit.ly

Jehiah Czebotar jehiah@gmail.com @jehiah


http://www.jsps.go.jp/english/e-jafos/2010_01.html!

http://bit.ly/gFNuXa!




DATA



The Big Data Problem Big Dataset, No Updates Big Dataset, Lots of Updates Small Dataset, Lots of Updates


sortdb $sort data.csv > sorted_data.csv $sortdb -F ',' -f sorted_data.csv -p 8080 $curl http://127.0.0.1:8080/get?key=...

http://bit.ly/simplehttp


simplequeue • $curl -f “data=...” http://simplequeue/put • $data=`curl http://simplequeue/get`

http://bit.ly/simplehttp


simplequeue

passing database changes through a queue allows you to decouple the performance of receiving data from a client, and adding it to a database http://bit.ly/simplehttp


simplequeue

http://bit.ly/simplehttp


•  class BackoffTimer(object): •  def __init__(self): •  self.interval = 0 •  •  def failure(self): •  self.interval = min(self.interval * 2, 1) •  •

def success(self): self.interval = max(self.interval * .25, 1) - 1

allows processing to gracefully slow down when remote systems become unavailable or start returning errors (ie: sleep 1s, 2s, 4s, 8s, 16s, 32s, ...)


•  class QueueReader(object): •  def __init__(self): •  self.backoff_timer = BackoffTimer() •  •  def run(self): •  while True: •  try: •  data = queue.get() •  if not data: •  time.sleep(.5) •  continue •  self.handle(data) •  self.backoff_timer.success() •  except: •  self.backoff_timer.failure() •  •  if self.backoff_timer.interval: •  time.sleep(self.backoff_timer.interval)


pubsub • long lived persistent HTTP connections that streams back JSON messages • a way to separate the core data collection (or production) from data consumers

http://bit.ly/simplehttp


•  $curl --silent http://pubsubserver/sub •  { "a": "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3 like Mac OS X; ja-jp) AppleWebKit/533.17.9 (KHTML, like Gecko) Mobile/8F190", "c": "JP", "nk": 1, "tz": "Asia/Tokyo", "gr": "40", "g": "hkIdmh", "h": "g0ABCf", "k": "4d8547e6-0022b-04438-d8ac8fa8", "l": "portalexcite", "al": "ja-jp", "hh": "bit.ly", "r": "direct", "u": "http://paltyyuria.exblog.jp/15685838/", "t": 1300587345, "hc": 1300587133, "cy": "Tokyo", "ll": [ 35.685001, 139.751404 ], "i": "613d872b3663f1f0cd54b48653ec788" } •  { "a": "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; WOW64; Trident/4.0; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; OfficeLiveConnector.1.5; OfficeLivePatch. 1.3; .NET4.0C; .NET CLR 3.0.30729)", "c": "US", "nk": 0, "tz": "America/Chicago", "gr": "TX", "g": "fTPW1w", "h": "fK3C3I", "k": "4d856351-001b9-07054-c6ac8fa8", "l": "espn", "al": "en-us", "hh": "es.pn", "r": "http://espn.go.com/mlb/", "u": "http://espn.go.com/ blog/dallas/texas-rangers/post/_/id/4861596/surprise-six-saturdaycamp-recap-4", "t": 1300587345, "hc": 1300584373, "cy": "Dallas", "ll": [ 32.809799, -96.799301 ], "i": "6686654b9493543ff18d36120d5caa9" }


tools we like • memcached • tokyo tyrant / tokyo cabinet • simplehttp (simplequeue, sortdb, pubsub) • tornado (fast async python framework) • json files • mysql (battle tested; reliable replication) • mongod


github.com/bitly/simplehttp


Thank you! Questions?

@jehiah


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.