8:["$","$L22",null,{"documentTextVersion":{"pageTexts":["Data Structures and Algorithms for\nBig Databases\nMichael A. Bender\nStony Brook & Tokutek\n\nBradley C. Kuszmaul\nMIT & Tokutek","Big data problem\n\ndata ingestion\n\nqueries + answers\noy v\ney\n\n365\n\n42\n\n???\n\ndata indexing\n\n2\n\nquery processor\n\n???\n\n???\n\n???","For on-disk data, one sees funny tradeoďŹ€s\nin the\nspeeds\nBig data\nproblem\nof data ingestion, query speed, and freshness of data.\n\ndata ingestion\n\nqueries + answers\noy v\ney\n\n365\n\n42\n\n???\n\ndata indexing\n\n2\n\nquery processor\n\n???\n\n???\n\n???","Funny tradeoff in ingestion, querying, freshness\n• “Select queries were slow until I added an index onto the timestamp field...\nAdding the index really helped our reporting, BUT now the inserts are taking\nforever.”\n‣ Comment on mysqlperformanceblog.com\n\n• “I'm trying to create indexes on a table with 308 million rows. It took ~20\nminutes to load the table but 10 days to build indexes on it.”\n‣ MySQL bug #9544\n\n• “They indexed their tables, and indexed them well,\nAnd lo, did the queries run quick!\nBut that wasn’t the last of their troubles, to tell–\nTheir insertions, like treacle, ran thick.”\n‣ Not from Alice in Wonderland by Lewis Carroll\nqueries +\nanswers\n42\n\ndata\ningestion\n\n3\n\ndata indexing\n\nquery processor\n\n???\n\nDon’t Thrash: How to Cache Your Hash in Flash","Funny tradeoff in ingestion, querying, freshness\n• “Select queries were slow until I added an index onto the timestamp field...\nAdding the index really helped our reporting, BUT now the inserts are taking\nforever.”\n‣ Comment on mysqlperformanceblog.com\n\n• “I'm trying to create indexes on a table with 308 million rows. It took ~20\nminutes to load the table but 10 days to build indexes on it.”\n‣ MySQL bug #9544\n\n• “They indexed their tables, and indexed them well,\nAnd lo, did the queries run quick!\nBut that wasn’t the last of their troubles, to tell–\nTheir insertions, like treacle, ran thick.”\n‣ Not from Alice in Wonderland by Lewis Carroll\nqueries +\nanswers\n42\n\ndata\ningestion\n\n4\n\ndata indexing\n\nquery processor\n\n???\n\nDon’t Thrash: How to Cache Your Hash in Flash","Funny tradeoff in ingestion, querying, freshness\n• “Select queries were slow until I added an index onto the timestamp field...\nAdding the index really helped our reporting, BUT now the inserts are taking\nforever.”\n‣ Comment on mysqlperformanceblog.com\n\n• “I'm trying to create indexes on a table with 308 million rows. It took ~20\nminutes to load the table but 10 days to build indexes on it.”\n‣ MySQL bug #9544\n\n• “They indexed their tables, and indexed them well,\nAnd lo, did the queries run quick!\nBut that wasn’t the last of their troubles, to tell–\nTheir insertions, like treacle, ran thick.”\n‣ Not from Alice in Wonderland by Lewis Carroll\nqueries +\nanswers\n42\n\ndata\ningestion\n\n5\n\ndata indexing\n\nquery processor\n\n???\n\nDon’t Thrash: How to Cache Your Hash in Flash","This tutorial\n\nFractal-tree®\nindex\nBɛ-tree\nLSM\ntree\n\n6\n\n• Better data\nstructures\nsignificantly mitigate\nthe insert/query/\nfreshness tradeoff.\n• These structures\nscale to much larger\nsizes while efficiently\nusing the memoryhierarchy.","What we mean by Big Data\nWe don’t define Big Data in terms of TB, PB, EB.\nBy Big Data, we mean\n• The data is too big to fit in main memory.\n• We need data structures on the data.\n• Words like “index” or “metadata” suggest that there are\nunderlying data structures.\n• These underlying data structures are also too big to fit in\nmain memory.\n\n7\n\nDon’t Thrash: How to Cache Your Hash in Flash","In this tutorial we study the\nunderlying data structures for\nmanaging big data.\nFile systems\n\n8\n\nSQL\nNewSQL\n\nNoSQL\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","But enough about\ndatabases...\n... more\nabout us.","Our Research and Tokutek\nA few years ago we started working together on\nI/O-eďŹƒcient and cache-oblivious data structures.\n\nMichael\n\nMartin\n\nBradley\n\nAlong the way, we started Tokutek to commercialize\nour research.\n10\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Tokutek sells open source, ACID compliant,\nimplementations of MySQL and MongoDB.\nTokuDB\n\nTokuMX\n\nApplication\n\nApplication\n\nMySQL Database\n-- SQL processing,\n-- query optimization\n\nStandard MongoDB\n-- drivers,\n-- query language, and\n-- data model\n\n11\n\nlibFT\n\nlibFT\n\nFile System\n\nFile System\n\nDisk/SSD","Tokutek sells open source, ACID compliant,\nimplementations of MySQL and MongoDB.\nTokuDB\n\nTokuMX\n\nApplication\n\nApplication\n\nMySQL Database\n-- SQL processing,\n-- query optimization\n\nStandard MongoDB\n-- drivers,\n-- query language, and\n-- data model\n\n11\n\nlibFT\n\nlibFT\n\nFile System\n\nFile System\n\nDisk/SSD","Tokutek sells open source, ACID compliant,\nimplementations of MySQL and MongoDB.\nTokuDB\n\nTokuMX\n\nApplication\n\nApplication\n\nMySQL Database\n-- SQL processing,\n-- query optimization\n\nStandard MongoDB\n-- drivers,\n-- query language, and\n-- data model\n\n11\n\nlibFT\n\nlibFT\n\nFile System\n\nFile System\n\nDisk/SSD\n\nlibFT implements the\npersistent structures\nfor storing data on disk.","Tokutek sells open source, ACID compliant,\nimplementations of MySQL and MongoDB.\nTokuDB\n\nTokuMX\n\nApplication\n\nApplication\n\nMySQL Database\n-- SQL processing,\n-- query optimization\n\nStandard MongoDB\n-- drivers,\n-- query language, and\n-- data model\n\n11\n\nlibFT\n\nlibFT\n\nFile System\n\nFile System\n\nDisk/SSD\n\nlibFT implements the\npersistent structures\nfor storing data on disk.\n\nlibFT provides a\nBerkeley DB API\nand can be used\nindependently.","Our Mindset\n• This tutorial is self contained.\n• We want to teach.\n• If something we say isn’t clear to you, please ask\nquestions or ask us to clarify/repeat something.\n• You should be comfortable using math.\n• You should want to listen to data structures for an\nafternoon.\n\n12\n\nDon’t Thrash: How to Cache Your Hash in Flash","Topics and Outline for this Tutorial\nI/O model.\nWrite-optimized data structures.\nHow write-optimized data structures can help file systems.\nCache-oblivious analysis.\nLog-structured merge trees.\nIndexing strategies.\nBlock-replacement algorithms.\nSorting Big Data.\n13\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Data Structures and Algorithms for Big Data\n\nModule 1: I/O Model and CacheOblivious Analysis\nMichael A. Bender\nStony Brook & Tokutek\n\nBradley C. Kuszmaul\nMIT & Tokutek","Story for Module\n• If we want to understand the performance of data\nstructures within databases we need algorithmic models\nfor understanding I/O.\n• There’s a long history of memory-hierarchy models.\nMany are beautiful. Most have found little practical use.\n• Two approaches are very powerful, the Disk Access\nMachine (DAM) model and cache-oblivious analysis.\n• We’ll present the DAM model in this module to lay a\nfoundation for the rest of the tutorial.\n• Cache-oblivious analysis comes later in the tutorial.\n\n2\n\nI/O models","I/O in the Disk Access Machine (DAM) Model\nHow computation works:\n• Data is transferred in blocks between RAM and disk.\n• The # of block transfers dominates the running time.\n\nGoal: Minimize # of block transfers\n• Performance bounds are parameterized by\nblock size B, memory size M, data size N.\n\nB\nRAM\nM\n\nDisk\nB\n[Aggarwal+Vitter ’88]\n\n3\n\nI/O models","Example: Scanning an Array\nQuestion: How many I/Os to scan an array of\nlength N?\n\nN\n\n4\n\nB\n\nI/O models","Example: Scanning an Array\nQuestion: How many I/Os to scan an array of\nlength N?\nAnswer: O(N/B) I/Os.\n\nN\n\n4\n\nB\n\nI/O models","Example: Scanning an Array\nQuestion: How many I/Os to scan an array of\nlength N?\nAnswer: O(N/B) I/Os.\n\nN\n\nB\n\nscan touches â‰¤ N/B+2 blocks\n4\n\nI/O models","Example: Searching in a B-tree\nQuestion: How many I/Os for a point query or\ninsert into a B-tree with N elements?\n\nO(logBN)\n\n5\n\nI/O models","Example: Searching in a B-tree\nQuestion: How many I/Os for a point query or\ninsert into a B-tree with N elements?\nAnswer: O(logB N\nÂ )\n\nO(logBN)\n\n5\n\nI/O models","Example: Searching in an Array\nQuestion: How many I/Os to perform a binary\nsearch into an array of size N?\n\nN/B blocks\n\n6\n\nI/O models","Example: Searching in an Array\nQuestion: How many I/Os to perform a binary\nsearch into an array of size N?\n✓\n◆\nN\n⇡ O(log2 N )\nAnswer: O log2\nB\nN/B blocks\n\n6\n\nI/O models","Example: Searching in an Array Versus B-tree\nMoral: B-tree searching is a factor of O(log2 B)\nfaster than binary searching.\n\nO(logBN)\n\nO(log2 N )\n7\n\nâœ“\n\nlog2 N\nO(logB N ) = O\nlog2 B\n\nâ—†\n\nI/O models","The DAM model is simple and pretty good\nThe Disk Access Machine (DAM) model\n• ignores CPU costs and\n• assumes that all block accesses have the same cost.\n\nIs that a good performance model?\n• Far from perfect.\n• But very powerful nonetheless.\n• (We’ll discuss more later in the tutorial.)\n\n8\n\nI/O models","Data Structures and Algorithms for Big Data\n\nModule 2: Write-Optimized Data\nStructures\nMichael A. Bender\nStony Brook & Tokutek\n\nBradley C. Kuszmaul\nMIT & Tokutek","Big data problem\n\ndata ingestion\n\nqueries + answers\noy v\ney\n\n365\n\n42\n\n???\n\ndata indexing\n\n2\n\nquery processor\n\n???\n\n???\n\n???","For on-disk data, one sees funny tradeoďŹ€s\nin the\nspeeds\nBig data\nproblem\nof data ingestion, query speed, and freshness of data.\n\ndata ingestion\n\nqueries + answers\noy v\ney\n\n365\n\n42\n\n???\n\ndata indexing\n\n2\n\nquery processor\n\n???\n\n???\n\n???","Funny tradeoff in ingestion, querying, freshness\n• “Select queries were slow until I added an index onto the timestamp field...\nAdding the index really helped our reporting, BUT now the inserts are taking\nforever.”\n‣ Comment on mysqlperformanceblog.com\n\n• “I'm trying to create indexes on a table with 308 million rows. It took ~20\nminutes to load the table but 10 days to build indexes on it.”\n‣ MySQL bug #9544\n\n• “They indexed their tables, and indexed them well,\nAnd lo, did the queries run quick!\nBut that wasn’t the last of their troubles, to tell–\nTheir insertions, like treacle, ran thick.”\n‣ Not from Alice in Wonderland by Lewis Carroll\nqueries +\nanswers\n42\n\ndata\ningestion\n\n3\n\ndata indexing\n\nquery processor\n\n???\n\nDon’t Thrash: How to Cache Your Hash in Flash","Funny tradeoff in ingestion, querying, freshness\n• “Select queries were slow until I added an index onto the timestamp field...\nAdding the index really helped our reporting, BUT now the inserts are taking\nforever.”\n‣ Comment on mysqlperformanceblog.com\n\n• “I'm trying to create indexes on a table with 308 million rows. It took ~20\nminutes to load the table but 10 days to build indexes on it.”\n‣ MySQL bug #9544\n\n• “They indexed their tables, and indexed them well,\nAnd lo, did the queries run quick!\nBut that wasn’t the last of their troubles, to tell–\nTheir insertions, like treacle, ran thick.”\n‣ Not from Alice in Wonderland by Lewis Carroll\nqueries +\nanswers\n42\n\ndata\ningestion\n\n4\n\ndata indexing\n\nquery processor\n\n???\n\nDon’t Thrash: How to Cache Your Hash in Flash","Funny tradeoff in ingestion, querying, freshness\n• “Select queries were slow until I added an index onto the timestamp field...\nAdding the index really helped our reporting, BUT now the inserts are taking\nforever.”\n‣ Comment on mysqlperformanceblog.com\n\n• “I'm trying to create indexes on a table with 308 million rows. It took ~20\nminutes to load the table but 10 days to build indexes on it.”\n‣ MySQL bug #9544\n\n• “They indexed their tables, and indexed them well,\nAnd lo, did the queries run quick!\nBut that wasn’t the last of their troubles, to tell–\nTheir insertions, like treacle, ran thick.”\n‣ Not from Alice in Wonderland by Lewis Carroll\nqueries +\nanswers\n42\n\ndata\ningestion\n\n5\n\ndata indexing\n\nquery processor\n\n???\n\nDon’t Thrash: How to Cache Your Hash in Flash","This module\n\nFractal-tree®\nindex\nBɛ-tree\nLSM\ntree\n\n6\n\n• Write-optimized\nstructures\nsignificantly mitigate\nthe insert/query/\nfreshness tradeoff.\n• One can insert\n10x-100x faster than\nB-trees while\nachieving similar\npoint query\nperformance.\n\nNSF Workshop on Research Directions in Principles of Parallel Computing","An algorithmic performance model\nHow computation works:\n• Data is transferred in blocks between RAM and disk.\n• The number of block transfers dominates the running time.\n\nGoal: Minimize # of block transfers\n• Performance bounds are parameterized by\nblock size B, memory size M, data size N.\n\nB\nRAM\nM\n\nDisk\nB\n[Aggarwal+Vitter ’88]\n\n7\n\nDon’t Thrash: How to Cache Your Hash in Flash","An algorithmic performance model\nB-tree point queries: O(logB N) I/Os.\n\nO(logBN)\n\nBinary search in array: O(log N/B)≈O(log N) I/Os.\nSlower by a factor of O(log B)\n\n8\n\nDon’t Thrash: How to Cache Your Hash in Flash","Write-optimized data structures performance\nData structures: [O'Neil,Cheng, Gawlick, O'Neil 96], [Buchsbaum, Goldwasser, Venkatasubramanian, Westbrook 00],\n[Argel 03], [Graefe 03], [Brodal, Fagerberg 03], [Bender, Farach,Fineman,Fogel, Kuszmaul, Nelson’07], [Brodal,\nDemaine, Fineman, Iacono, Langerman, Munro 10], [Spillane, Shetty, Zadok, Archak, Dixit 11].\nSystems: BigTable, Cassandra, H-Base, LevelDB, TokuDB.\n\nSome write-optimized\nstructures\n\nB-tree\nInsert/delete\n\nO(logBN)=O(\n\nlogN\nlogB\n\n)\n\nO( logN\nB\n\n)\n\n• If B=1024, then insert speedup is B/logB≈100.\n• Hardware trends mean bigger B, bigger speedup.\n• Less than 1 I/O per insert.\n\n9\n\nDon’t Thrash: How to Cache Your Hash in Flash","Optimal Search-Insert Tradeoff\ninsert\nOptimal\ntradeoff\n(function of ɛ=0...1)\n\n10x-100x faster inserts\n\nB-tree\n(ɛ=1)\nɛ=1/2\n\nɛ=0\n10\n\nO\n\n✓\n\npoint query\n\nlog1+B \" N\nB1 \"\n\n◆\n\nO (logB N )\n✓\n\n[Brodal, Fagerberg 03]\n\n◆\n\nlogB N\np\nO\nB\n✓\n◆\nlog N\nO\nB\n\nO log1+B \" N\n\nO (logB N )\n\nO (logB N )\nO (log N )\n\nDon’t Thrash: How to Cache Your Hash in Flash","B-tree\n\nOptimal Curve\n\nSlow\n\nPoint Queries\n\nFast\n\nIllustration of Optimal Tradeoff [Brodal, Fagerberg 03]\n\nLogging\n\nSlow\n\nFast\nInserts\n\n11\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Illustration of Optimal Tradeoff [Brodal, Fagerberg 03]\nB-tree\n\nOptimal Curve\n\nInsertions improve by\n10x-100x with\nalmost no loss of pointquery performance\n\nSlow\n\nPoint Queries\n\nFast\n\nTarget of opportunity\n\nLogging\n\nSlow\n\nFast\nInserts\n\n12\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Illustration of Optimal Tradeoff [Brodal, Fagerberg 03]\nB-tree\n\nOptimal Curve\n\nInsertions improve by\n10x-100x with\nalmost no loss of pointquery performance\n\nSlow\n\nPoint Queries\n\nFast\n\nTarget of opportunity\n\nLogging\n\nSlow\n\nFast\nInserts\n\n12\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","One way to Build WriteOptimized Structures\n(other approaches later in tutorial)","A simple write-optimized structure\nO(log N) queries and O((log N)/B) inserts:\n• A balanced binary tree with buffers of size B\n\nInserts + deletes:\n• Send insert/delete messages down from the root and store\nthem in buffers.\n• When a buffer fills up, flush.\n14\n\nDon’t Thrash: How to Cache Your Hash in Flash","A simple write-optimized structure\nO(log N) queries and O((log N)/B) inserts:\n• A balanced binary tree with buffers of size B\n\nInserts + deletes:\n• Send insert/delete messages down from the root and store\nthem in buffers.\n• When a buffer fills up, flush.\n14\n\nDon’t Thrash: How to Cache Your Hash in Flash","A simple write-optimized structure\nO(log N) queries and O((log N)/B) inserts:\n• A balanced binary tree with buffers of size B\n\nInserts + deletes:\n• Send insert/delete messages down from the root and store\nthem in buffers.\n• When a buffer fills up, flush.\n14\n\nDon’t Thrash: How to Cache Your Hash in Flash","A simple write-optimized structure\nO(log N) queries and O((log N)/B) inserts:\n• A balanced binary tree with buffers of size B\n\nInserts + deletes:\n• Send insert/delete messages down from the root and store\nthem in buffers.\n• When a buffer fills up, flush.\n14\n\nDon’t Thrash: How to Cache Your Hash in Flash","A simple write-optimized structure\nO(log N) queries and O((log N)/B) inserts:\n• A balanced binary tree with buffers of size B\n\nInserts + deletes:\n• Send insert/delete messages down from the root and store\nthem in buffers.\n• When a buffer fills up, flush.\n14\n\nDon’t Thrash: How to Cache Your Hash in Flash","A simple write-optimized structure\nO(log N) queries and O((log N)/B) inserts:\n• A balanced binary tree with buffers of size B\n\nInserts + deletes:\n• Send insert/delete messages down from the root and store\nthem in buffers.\n• When a buffer fills up, flush.\n15\n\nDon’t Thrash: How to Cache Your Hash in Flash","Analysis of writes\nAn insert/delete costs amortized O((log N)/B) per\ninsert or delete\n• A buffer flush costs O(1) & sends B elements down one\nlevel.\n• It costs O(1/B) to send element down one level of the tree.\n• There are O(log N) levels in a tree.\n\n16\n\nDon’t Thrash: How to Cache Your Hash in Flash","Difficulty of Key Accesses","Difficulty of Key Accesses","Analysis of point queries\nTo search:\n• examine each buffer along a single root-to-leaf path.\n• This costs O(log N).\n\n18\n\nDon’t Thrash: How to Cache Your Hash in Flash","Obtaining optimal point queries + very fast inserts\n\nB\nfanout: √B\n\n√B\n\n...\nPoint queries cost O(log√B N)= O(logB N)\n• This is the tree height.\n\nInserts cost O((logBN)/√B)\n• Each flush cost O(1) I/Os and flushes √B elements.\n19\n\nDon’t Thrash: How to Cache Your Hash in Flash","What the world looks like\nInsert/point query asymmetry\n• Inserts can be fast: >50K high-entropy writes/sec/disk.\n• Point queries are necessarily slow: <200 high-entropy reads/\nsec/disk.\n\nWe are used to reads and writes having about the\nsame cost, but writing is easier than reading.\n\nWriting is easier.\n20\n\nReading is hard.\nDon’t Thrash: How to Cache Your Hash in Flash","The right read-optimization is write-optimization\nThe right index makes queries run fast.\n• Write-optimized structures maintain indexes efficiently.\n\nqueries\n42\n\ndata\ningestion\n\n21\n\n???\n\nanswers\ndata indexing\n\nquery processor\n\nDon’t Thrash: How to Cache Your Hash in Flash","The right read-optimization is write-optimization\nThe right index makes queries run fast.\n• Write-optimized structures maintain indexes efficiently.\n\nFast writing is a currency we use to accelerate\nqueries. Better indexing means faster queries.\n\nqueries\n42\n\ndata\ningestion\n\n21\n\n???\n\nanswers\ndata indexing\n\nquery processor\n\nDon’t Thrash: How to Cache Your Hash in Flash","I/O Load\n\nThe right read-optimization is write-optimization\n\nAdd selective indexes.\n(We can now afford to maintain them.)\n\n22\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","I/O Load\n\nThe right read-optimization is write-optimization\n\nAdd selective indexes.\n(We can now afford to maintain them.)\n\nWrite-optimized structures can significantly\nmitigate the insert/query/freshness tradeoff.\n22\n\n3\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Implementation Issues","Write optimization. ✔ What’s missing?\nOptimal read-write tradeoff: Easy\nFull featured: Hard\n• Variable-sized rows\n• Concurrency-control mechanisms\n• Multithreading\n• Transactions, logging, ACID-compliant crash recovery\n• Optimizations for the special cases of sequential inserts\nand bulk loads\n• Compression\n• Backup\n\n24\n\nDon’t Thrash: How to Cache Your Hash in Flash","Systems often assume search cost = insert cost\nSome inserts/deletes have hidden searches.\nExample:\n• return error when a duplicate key is inserted.\n• return # elements removed on a delete.\n\nThese “cryptosearches” throttle insertions\ndown to the performance of B-trees.\n\n25\n\nDon’t Thrash: How to Cache Your Hash in Flash","Cryptosearches in uniqueness checking\nUniqueness checking has a hidden search:\nIf Search(key) == True\nReturn Error;\nElse\nFast_Insert(key,value);\n\nIn a B-tree uniqueness checking comes for free\n• On insert, you fetch a leaf.\n• Checking if key exists is no biggie.\n\nDon’t Thrash: How to Cache Your Hash in Flash","Cryptosearches in uniqueness checking\nUniqueness checking has a hidden search:\nIf Search(key) == True\nReturn Error;\nElse\nFast_Insert(key,value);\n\nIn a write-optimized structure, that cryptosearch can throttle performance\n•\n•\n•\n•\n\nInsertion messages are injected.\nThese eventually get to “bottom” of structure.\nInsertion w/Uniqueness Checking 100x slower.\nBloom filters, Cascade Filters, etc help.\n[Bender, Farach-Colton, Johnson, Kraner, Kuszmaul, Medjedovic, Montes, Shetty, Spillane, Zadok 12]\n\n27\n\nDon’t Thrash: How to Cache Your Hash in Flash","A locking scheme with cryptosearches\nOne implementation of pessimistic locking:\nmaintain locks in leaves\n\nwriter lock\n\nreader lock\n\nwriter range lock\n\nTo insert a new row v, determine whether there\nis already a lock on v at a leaf.\nThis is also a cryptosearch.\n28\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Performance","iiBench Insertion Benchmark\nWrite performance on large data\n\nMongoDB\n30\n\nMySQL\nDonâ€™t Thrash: How to Cache Your Hash in Flash","I/Os per second\n\nTokuMX runs fast because it uses less I/O\n\n100M inserts into a collection with 3 secondary indexes\n31\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Compression (up to 25x)\n\n32\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","iiBench on SSD\n35000\n30000\n\nInsertion Rate\n\n25000\nTokuDB\nFusionIO\nX25E\nRAID10\n\n20000\n15000\n10000\n\nInnoDB\nFusionIO\nX25-E\nRAID10\n\n5000\n0\n0\n\n5e+07\n1e+08\nCummulative Insertions\n\n1.5e+08\n\nTokuDB on rotating disk beats InnoDB on SSD.\n33\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","wntime is seconds to minutes. We detailed an experiment that showed this in\n\n0 also introduced\nHot Indexing. YouCan\ncan add\nan index\nto an existing\ntable with\nWrite-optimization\nHelp\nSchema\nChanges\n\nThe total downtime is seconds to a few minutes, because when the index is fin\n\nL closes and reopens the table. This means that the downtime occurs not when\nissued, but later on. Still, it is quite minimal, as we showed in this blog.\n\nntOS 5.5; 2x Xeon E5310; 4GB RAM; 4x 1TB 7.2k SATA in RAID0.\n\nâ€“ Software Configuration Details\n34\n\nback to top\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Scaling into the Future","Write-optimization going forward\nExample: Time to fill a disk in 1973, 2010, 2022.\n• log high-entropy data sequentially versus index data in\nB-tree.\nTime to fill disk\nusing a B-tree\n(row size 1K)\n\nYear\n\nSize\n\nBandwidth\n\nAccess Time\n\nTime to log\ndata on disk\n\n1973\n\n35MB\n\n835KB/s\n\n25ms\n\n39s\n\n975s\n\n2010\n\n3TB\n\n150MB/s\n\n10ms\n\n5.5h\n\n347d\n\n2022\n\n220TB\n\n1.05GB/s\n\n10ms\n\n2.4d\n\n70y\n\nBetter data structures may be a luxury now, but\nthey will be essential by the decade’s end.\nDon’t Thrash: How to Cache Your Hash in Flash","Write-optimization going forward\nExample: Time to fill a disk in 1973, 2010, 2022.\n• log high-entropy data sequentially versus index data in\nB-tree.\nTime to log\ndata on disk\n\nTime to fill disk Time to fill using\nusing a B-tree\nFractal tree*\n(row size 1K)\n(row size 1K)\n\nYear\n\nSize\n\nBandwidth\n\nAccess\nTime\n\n1973\n\n35MB\n\n835KB/s\n\n25ms\n\n39s\n\n975s\n\n2010\n\n3TB\n\n150MB/s\n\n10ms\n\n5.5h\n\n347d\n\n2022\n\n220TB\n\n1.05GB/s\n\n10ms\n\n2.4d\n\n70y\n\nBetter data structures may be a luxury now, but\nthey will be essential by the decade’s end.\n* Projected times for fully multi-threaded version\n37\n\nDon’t Thrash: How to Cache Your Hash in Flash","Write-optimization going forward\nExample: Time to fill a disk in 1973, 2010, 2022.\n• log high-entropy data sequentially versus index data in\nB-tree.\nTime to log\ndata on disk\n\nTime to fill disk Time to fill using\nusing a B-tree\nFractal tree*\n(row size 1K)\n(row size 1K)\n\nYear\n\nSize\n\nBandwidth\n\nAccess\nTime\n\n1973\n\n35MB\n\n835KB/s\n\n25ms\n\n39s\n\n975s\n\n200s\n\n2010\n\n3TB\n\n150MB/s\n\n10ms\n\n5.5h\n\n347d\n\n36h\n\n2022\n\n220TB\n\n1.05GB/s\n\n10ms\n\n2.4d\n\n70y\n\n23.3d\n\nBetter data structures may be a luxury now, but\nthey will be essential by the decade’s end.\n* Projected times for fully multi-threaded version\n38\n\nDon’t Thrash: How to Cache Your Hash in Flash","Summary of Module\nWrite-optimization can solve many problems.\n• There is a provable point-query insert tradeoff. We can\ninsert 10x-100x faster without hurting point queries.\n• We can avoid much of the funny tradeoff between data\ningestion, freshness, and query speed.\n• We can avoid tuning knobs.\n\nwrite-optimized\n\nDon’t Thrash: How to Cache Your Hash in Flash","Data Structures and Algorithms for Big Data\n\nModule 3: (Case Study)\nTokuFS--How to Make a WriteOptimized File System\nMichael A. Bender\nStony Brook & Tokutek\n\nBradley C. Kuszmaul\nMIT & Tokutek","Story for Module\nAlgorithms for Big Data apply to all storage\nsystems, not just databases.\nSome big-data users store use a file system.\nThe problem with Big Data is Microdata...\n\n2\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","HEC FSIO Grand Challenges\nStore 1 trillion files\nCreate tens of thousands of files\nper second\nTraverse directory hierarchies\nfast (ls -R)\nB-trees would require at least\nhundreds of disk drives.","TokuFS\nTokuFS\n\n[Esmet, Bender, Farach-Colton, Kuszmaul HotStorage12]\n\nTokuFS\n\n• A file-system prototype\n• >20K file creates/sec\n• very fast ls -R\n\nTokuDB\n\n• HEC grand challenges on a cheap disk\n(except 1 trillion files)\n\nXFS\n\nDon’t Thrash: How to Cache Your Hash in Flash","TokuFS\nTokuFS\n\n[Esmet, Bender, Farach-Colton, Kuszmaul HotStorage12]\n\nTokuFS\n\n• A file-system prototype\n• >20K file creates/sec\n• very fast ls -R\n\nTokuDB\n\n• HEC grand challenges on a cheap disk\n(except 1 trillion files)\n\nXFS\n\n• TokuFS offers orders-of-magnitude speedup\non microdata workloads.\n‣\n‣\n\nAggregates microwrites while indexing.\nSo it can be faster than the underlying file system.\n\nDon’t Thrash: How to Cache Your Hash in Flash","Big speedups on microwrites\nWe ran microdata-intensive benchmarks\n• Compared TokuFS to ext4, XFS, Btrfs, ZFS.\n• Stressed metadata and file data.\n• Used commodity hardware:\n\n‣ 2 core AMD, 4GB RAM\n‣ Single 7200 RPM disk\n‣ Simple, cheap setup. No hardware tricks.\n• In all tests, we observed orders of magnitude speed up.\n\n6\n\nDon’t Thrash: How to Cache Your Hash in Flash","Faster on small file creation\nCreate 2 million 200-byte files in a directory tree\n\n7\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Faster on small file creation\nCreate 2 million 200-byte files in a directory tree\n\nLog\nscale\n7\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Faster on metadata scan\nRecursively scan directory tree for metadata\n• Use the same 2 million files created before.\n• Start on a cold cache to measure disk I/O efficiency\n\n8\n\nDon’t Thrash: How to Cache Your Hash in Flash","Faster on big directories\nCreate one million empty files in a directory\n• Create files with random names, then read them back.\n• Tests how well a single directory scales.\n\n9\n\nDon’t Thrash: How to Cache Your Hash in Flash","Faster on microwrites in a big file\nRandomly write out a file in small, unaligned\npieces\n\n10\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","TokuFS\nImplementation","TokuFS employs two indexes\nMetadata index:\n• The metadata index maps pathname to file metadata.\n‣ /home/esmet ! mode, file size, access times, ...\n‣ /home/esmet/tokufs.c ! mode, file size, access times, ...\n\nData index:\n• The data index maps pathname, blocknum to bytes.\n‣ /home/esmet/tokufs.c, 0 ! [ block of bytes ]\n‣ /home/esmet/tokufs.c, 1 ! [ block of bytes ]\n• Block size is a compile-time constant: 512.\n‣\n\n12\n\ngood performance on small files, moderate on large files\n\nDon’t Thrash: How to Cache Your Hash in Flash","Common queries exhibit locality\nMetadata index keys: full path as string\n• All the children of a directory are contiguous in the\nindex\n• Reading a directory is simple and fast\n\nData block index keys:【full path, blocknum】\n• So all the blocks for a file are contiguous in the index\n• Reading a file is simple and fast\n\n13\n\nDon’t Thrash: How to Cache Your Hash in Flash","TokuFS compresses indexes\nReduces overhead from full path keys\n• Pathnames are highly “prefix redundant”\n• They compress very, very well in practice\n\nReduces overhead from zero-valued padding\n• Uninitialized bytes in a block are set to zero\n• Good portions of the metadata struct are set to zero\n\nCompression between 7-15x on real data\n• For example, a full MySQL source tree\n\n14\n\nDon’t Thrash: How to Cache Your Hash in Flash","TokuFS is fully functional\nTokuFS is a prototype, but fully functional.\n• Implements files, directories, metadata, etc.\n• Interfaces with applications via shared library, header.\n\nWe wrote a FUSE implementation, too.\n• FUSE lets you implement filesystems in user space.\n• But there’s overhead, so performance isn’t optimal.\n• The best way to run is through our POSIX-like file API.\n\n15\n\nDon’t Thrash: How to Cache Your Hash in Flash","Microdata is the Problem","Data Structures and Algorithms for Big Data\n\nModule 4: Cache-Oblivious Analysis\nMichael A. Bender\nStony Brook & Tokutek\n\nBradley C. Kuszmaul\nMIT & Tokutek\n\n1","Recall the Disk Access Machine\nExternal-memory model:\n• Time bounds are parameterized by B, M, N.\n• Goal: Minimize # of block transfers ≈ time.\n\nBeautiful restriction:\n• Parameters B, M are unknown to the algorithm or coder.\n\nAn optimal CO algorithm is universal for all B, M, N.\nB\n\nRAM\nM\n2\n\nDisk\nB\nDon’t Thrash: How to Cache Your Hash in Flash","Cache-Oblivious (CO) Algorithms [Frigo, Leiserson,\nProkop, Ramachandran ’99]\n\nExternal-memory model:\n• Time bounds are parameterized by B, M, N.\n• Goal: Minimize # of block transfers ≈ time.\n\nBeautiful restriction:\n• Parameters B, M are unknown to the algorithm or coder.\n\nB=?\n\nRAM\nM=?\n3\n\nDisk\nB=?\nDon’t Thrash: How to Cache Your Hash in Flash","Cache-Oblivious (CO) Algorithms [Frigo, Leiserson,\nProkop, Ramachandran ’99]\n\nExternal-memory model:\n• Time bounds are parameterized by B, M, N.\n• Goal: Minimize # of block transfers ≈ time.\n\nBeautiful restriction:\n• Parameters B, M are unknown to the algorithm or coder.\n\nAn optimal CO algorithm is universal for all B, M, N.\nB=?\n\nRAM\nM=?\n3\n\nDisk\nB=?\nDon’t Thrash: How to Cache Your Hash in Flash","Overview of Module\nCache-oblivious definition\nCache-oblivious B-tree\nCache-oblivious performance advantages\nCache-oblivious write-optimized data structure\n(COLA)\nCache-adaptive algorithms\n\n4\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Traditional B-trees aren’t cache-oblivious\nB\nO(logBN)\n\nThe fan-out is a function of B.\nThere do exist cache-oblivious B-trees.\n• We can still achieve O(logB N) I/Os per operation,\neven without parameterizing by B or M.\n[Bender, Demaine, Farach-Colton ’00] [Bender, Duan, Iacono, Wu ’02]\n[Brodal, Fagerberg, Jacob ’02]\n5\n\nDon’t Thrash: How to Cache Your Hash in Flash","Static cache-oblivious B-Tree (no inserts)\n\n[Prokop 99]\n\n• Technique: divide & conquer (Van Emde Boas layout)\n\nlg N\n\nbinary tree\nwith N nodes\n\nSplit into\nsub-trees\n\n(lg N)/2\n\nX\n\n(lg N)/2\n\nY1 Y2\n\nY pN\n\nLay out each subtree\n\nX\n\nY1\n\nY2\n\nY pN\n\nRepeat recursively.\nDon’t Thrash: How to Cache Your Hash in Flash","7\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Dynamic Cache-Oblivious B-trees\n\n[Bender, Demaine,\nFarach-Colton 00]\n\nWe won’t describe how to dynamize....\nAfter all, the cache-oblivious dynamic B-tree\nisn’t write-optimized.\nWe believe that write-optimized data structures\nwin out over B-trees (even cache-oblivious\nones) in the majority of cases.\n\n8\n\nDon’t Thrash: How to Cache Your Hash in Flash","Overview of Module\nCache-oblivious definition\nExample: cache-oblivious B-tree\nCache-oblivious performance advantages\nCache-oblivious write-optimized data structure\n(COLA)\nCache-adaptive algorithms\n\n9\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","The DAM model is a simplification\n\nTracks get prefetched into the disk\ncache, which holds ~100 tracks.\n\nFixed-size blocks are fetched.\n\nDisks are organized into\ntracks of different sizes\n10\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","The DAM model is a simplification\n2kB or 4kB is too small for the model.\n• B-tree nodes in Berkeley DB & InnoDB have this size.\n• Issue: sequential block accesses run 10x faster than\nrandom block accesses, which doesn’t fit the model.\n\nThere is no single best block size.\n• The best node size for a B-tree depends on the\noperation\n(insert/delete/point query).\n\n11\n\nDon’t Thrash: How to Cache Your Hash in Flash","Time for 1000 Random B-tree Searches\n[Bender, Farach-Colton, Kuszmaul ’06]\n\nB\n4K\n16K\n32K\n64K\n128K\n256K\n512K\n\nSmall\n17.3ms\n13.9ms\n11.9ms\n12.9ms\n13.2ms\n18.5ms\n\nBig\n22.4ms\n22.1ms\n17.4ms\n17.6ms\n16.5ms\n14.4ms\n16.7ms\n\nSmall\nCO Btree\n\nBig\n\n12.3ms 13.8ms\n\nThere’s no best block\nsize.\nThe optimal block size\nfor inserts is very\ndifferent.\nDon’t Thrash: How to Cache Your Hash in Flash","Cache-Oblivious Analysis\n• Cache-oblivious algorithms work for all B and M...\n• ... and all levels of a multi-level hierarchy.\n\nIt’s better to optimize approximately for all B, M\nthan to pick the best B and M.\nB=??\n\nRAM\nM=??\n\nDisk\nB=??\n\n[Frigo, Leiserson, Prokop, Ramachandran ’99]\n13\n\nDon’t Thrash: How to Cache Your Hash in Flash","Overview of Module\nCache-oblivious definition\nExample: cache-oblivious B-tree\nCache-oblivious performance advantages\nCache-oblivious write-optimized data structure\n(COLA)\nâ€˘ You can even make write-optimized data structures\ncache-oblivious\n[Bender, Farach-Colton, Fineman, Fogel, Kuszmaul, Nelson, SPAA 07]\n[Brodal, Demaine, Fineman, Iacono, Langerman, Munro, SODA 10]\n\nCache-adaptive algorithms\n14\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Recall optimal search-insert tradeoff\n\n[Brodal,\n\nFagerberg 03]\n\ninsert\nOptimal\ntradeoff\n(function of ɛ=0...1)\n\n10x-100x faster inserts\n\nB-tree\n(ɛ=1)\nɛ=1/2\n\nɛ=0\n\nO\n\n✓\n\npoint query\n\nlog1+B \" N\nB1 \"\n\n◆\n\nO (logB N )\n✓\n\n◆\n\nlogB N\np\nO\nB\n✓\n◆\nlog N\nO\nB\n\nO log1+B \" N\n\nO (logB N )\n\nO (logB N )\nO (log N )\n\nWe give a cache-oblivious solution for ɛ=0.\n15\n\nDon’t Thrash: How to Cache Your Hash in Flash","Simplified CO write-optimized structure (COLA)\n\n20\n\n21\n\n22\n\n23\n\nO((logN)/B) insert cost & O(log2N) search cost\n• Sorted arrays of exponentially increasing size.\n• Arrays are completely full or completely empty\n(depends on the bit representation of # of elmts).\n• Insert into the smallest array.\nMerge arrays to make room.\n\n16\n\nDon’t Thrash: How to Cache Your Hash in Flash","Simplified CO write-optimized structure (COLA)\n\n17\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Simplified CO write-optimized structure (COLA)\n\nInsert Cost:\n•\n•\n•\n•\n\ncost to flush level of size X = O(X/B)\ncost per element to flush level = O(1/B)\nmax # of times each element is flushed = log N\ninsert cost = O((log N))/B) amortized memory transfers\n\nSearch Cost\n• Binary search at each level\n• log(N/B) + log(N/B) - 1 + log(N/B) - 2 + ... + 2 + 1\n= O(log2(N/B))\n18\n\nDon’t Thrash: How to Cache Your Hash in Flash","Cache-oblivious write-optimized structure (COLA)\n\nO((logN)/B) insert cost & O(logN) search cost\n•\n•\n•\n•\n19\n\nSome redundancy of elements between levels\nArrays can be partially full\nHorizontal and vertical pointers to redundant elements\n(Fractional Cascading)\nDon’t Thrash: How to Cache Your Hash in Flash","Overview of Module\nCache-oblivious definition\nExample: cache-oblivious B-tree\nCache-oblivious performance advantages\nCache-oblivious write-optimized data structure\n(COLA)\nCache-adaptive algorithms\n\n20\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","Cache-Oblivious Algorithms [Frigo, Leiserson, Prokop,\nRamachandran ’99]\n\nExternal-memory model:\n\n• Time bounds are parameterized by B, M, N.\n• Goal: Minimize # of block transfers ≈ time.\n\nBeautiful restriction:\n• Parameters B, M are unknown to the algorithm or coder.\n• An optimal CO algorithm is universal for all B, M, N.\n\nB=?\n\nDisk\n\nRAM\nM=?\n\nB=?\n\np\nsho\nork\nnt\nlW\nme\ntuh\nage\nags\na D Man\nl at\nad\nhae rklo\nMic Wo\n\n11\n\nagement\nMichael at Michael\na Dagstuhl\non\nat aWorkshop\nDagstuhl Workshop\nDatabase Workload\nManagement\nWorkload\nManagement\nat a Dagstuhl Workshop on\norkload Management\nCache-oblivious\nalgorithms are\nuniversal algorithms.\nThey are platform\nindependent.\n\no\n\n21\n\n11","Michael at a Dagstuhl Workshop on\nDatabase Workload Management\nCache-Oblivious Algorithms [Frigo, Leiserson, Prokop,\nRamachandran ’99]\n\nExternal-memory model:\n• Time bounds are parameterized by B, M, N.\n• Goal: Minimize # of block transfers ≈ time.\n\nBeautiful restriction:\n• Parameters B, M are unknown to the algorithm or coder.\n• An optimal CO algorithm is universal for all B, M, N.\n\nB=?\n\nDisk\n\nRAM\nM=?\n\n22\n\nB=?\n\nCache-oblivious\nalgorithms adapt to\nchanging RAM.","ork\n\nRamachandran ’99]\n\nExternal-memory model:\n\n• Time bounds are parameterized by B, M, N.\n• Goal: Minimize # of block transfers ≈ time.\n\nBeautiful restriction:\n• Parameters B, M are unknown to the algorithm or coder.\n• An optimal CO algorithm is universal for all B, M, N.\n\nCache-oblivious\nalgorithms adapt to\nchanging RAM.\n\nB=?\n\nM=?\n\nreally?\n\nDisk\n\ninteresting\n\n22\n\nB=?\n\nhuh?\n\npo\nsho\nork\nnt\nlW\nme\ntuh\nage\nags\nn\na D Man\npo\nl at\nad\nsho\nhae rklo\nork\nMic Wo\nnt\nlW\nme\ntuh\nage\nags\na D Man\nl at\nad\nhae rklo\nWo\n\nCache-Oblivious Algorithms [Frigo, Leiserson, Prokop,\n\nRAM\n\nnt\n\nMic\n\n11\n\nDa\nloa gst\nu\nMichael at a dDagstuhl\nMa hl WWorkshop on\nna Management\no\nDatabase Workload\nge rks\nme ho\nnt p o\nn\n\nhuh?\n\nwow\n\ntell me more","ork\n\nRamachandran ’99]\n\nExternal-memory model:\n\n• Time bounds are parameterized by B, M, N.\n• Goal: Minimize # of block transfers ≈ time.\n\nBeautiful restriction:\n• Parameters B, M are unknown to the algorithm or coder.\n• An optimal CO algorithm is universal for all B, M, N.\n\nCache-oblivious\nalgorithms adapt to\nchanging RAM.\n\nB=?\n\nM=?\n\nreally?\n\nDisk\n\ninteresting\n\n22\n\nB=?\n\nhuh?\n\npo\nsho\nork\nnt\nlW\nme\ntuh\nage\nags\nn\na D Man\npo\nl at\nad\nsho\nhae rklo\nork\nMic Wo\nnt\nlW\nme\ntuh\nage\nags\na D Man\nl at\nad\nhae rklo\nWo\n\nCache-Oblivious Algorithms [Frigo, Leiserson, Prokop,\n\nRAM\n\nnt\n\nMic\n\n11\n\nDa\nloa gst\nu\nMichael at a dDagstuhl\nMa hl WWorkshop on\nna Management\no\nDatabase Workload\nge rks\nme (Ihdon’t\nop know what the\nntheck I’m otalking about.)\nn\n\nhuh?\n\nwow\n\ntell me more","We had an activity of presenting an abstract\nfor a fictional paper that we wanted to write.\nCache-oblivious\nalgorithms adapt to\nchanging RAM.\n\nWe have this concern\nat --- (social media\ncompany).\n\nWe have this concern at\n---- (database company).\n\n23","So I proved some theorems\nTheorem: Some (but\nnot all) cache-oblivious\nalgorithms adapt to\nchanging sizes of RAM.\n\n24","Some CO algorithms adapt when RAM changes\nSome cache-oblivious algorithms run optimally\neven when the RAM changes arbitrarily over time.\n• Sorting\n• Problems with a special recursive structure\n(matrix multiplication, transpose, Gaussian elimination,\nall-pairs shortest paths)\n\nB=?\n\nB\nDisk\n\nRAM\nM(t)\n\n25\n\nB\n\nDisk\n\nRAM\nM=?\n\nB=?\n\nDon’t Thrash: How to Cache Your Hash in Flash","Summary\nIn the cache-oblivious model, B and M are\nunknown to the coder.\n(Of course, we still use B and M in proofs.)\nIt’s remarkable how many common I/O-efficient\ndata structures have cache-oblivious\nalternatives.\nSometimes it’s better to optimize approximately\nfor all B and M instead of picking the best B\nand M.\n\n26\n\nDon’t Thrash: How to Cache Your Hash in Flash","Data Structures and Algorithms for Big Data\n\nModule 5: Log Structured Merge Trees\nMichael A. Bender\nStony Brook & Tokutek\n\nBradley C. Kuszmaul\nMIT & Tokutek","Log Structured Merge Trees\n\n[O'Neil, Cheng,\nGawlick, O'Neil 96]\n\nLog structured merge trees are write-optimized data\nstructures developed in the 90s.\nOver the past 5 years, LSM trees have become\npopular (for good reason).\nAccumulo, Bigtable, bLSM, Cassandra, HBase,\nHypertable, LevelDB are LSM trees (or borrow ideas).\nhttp://nosql-database.org lists 122 NoSQL\ndatabases. Many of them are LSM trees.\n\n2","Recall Optimal Search-Insert Tradeoff\n\n[Brodal,\n\nFagerberg 03]\n\ninsert\nOptimal\ntradeoff\n(function of ɛ=0...1)\n\nO\n\n✓\n\nlog1+B \" N\nB1 \"\n\npoint query\n◆\n\nO log1+B \" N\n\nLSM trees don’t lie on the optimal search-insert\ntradeoff curve.\nBut they’re not far off.\nWe’ll show how to move them back onto the\noptimal curve.\n3\n\nDon’t Thrash: How to Cache Your Hash in Flash","Log Structured Merge Tree\n\n[O'Neil, Cheng,\nGawlick, O'Neil 96]\n\nAn LSM tree is a cascade of B-trees.\nEach tree Tj has a target size |Tj | .\nThe target sizes are exponentially increasing.\nTypically, target size |Tj+1| = 10 |Tj |.\n\nT0\n4\n\nT1\n\nT2\n\nT3\n\nT4","LSM Tree Operations\nPoint queries:\n\nT0\n\n5\n\nT1\n\nT2\n\nT3\n\nT4","LSM Tree Operations\nPoint queries:\n\nT0\n\nT1\n\nT2\n\nT3\n\nT4\n\nT2\n\nT3\n\nT4\n\nRange queries:\n\nT0\n5\n\nT1","LSM Tree Operations\nInsertions:\nâ€˘ Always insert element into the smallest B-tree T0.\ninsert\n\nT0 T1\n\nT2\n\nT3\n\nT4\nflush\n\nâ€˘ When a B-tree Tj fills up, flush into Tj+1 .\n\nT0 T1\n6\n\nT2\n\nT3\n\nT4","LSM Tree Operations\nDeletes are like inserts:\n\ninsert tombstone\nmessages\n\n• Instead of deleting an\nelement directly, insert\ntombstones.\n• A tombstone knocks out a\n“real” element when it lands\nin the same tree.\n\nT0 T1\n\nT0 T1\n7\n\nT2\n\nT2\n\nT3\n\nT3\n\nT4\n\nT4","Static-to-Dynamic Transformation\nAn LSM Tree is an example of a “static-todynamic” transformation [Bentley, Saxe ’80] .\n• An LSM tree can be built out of static B-trees.\n• When T3 flushes into T4, T4 is rebuilt from scratch.\n\nflush\n\nT0 T1\n8\n\nT2\n\nT3\n\nT4","This Module\nLetâ€™s analyze LSM trees.\n\nM\n\n9\n\nB","Recall: Searching in an Array Versus B-tree\nRecall the cost of searching in an array versus\na B-tree.\n\nO(logBN)\n\nâœ“\n\nlog2 N\nO(logB N ) = O\nlog2 B\n10\n\nâ—†\n\nI/O models","Recall: Searching in an Array Versus B-tree\nRecall the cost of searching in an array versus\na B-tree.\n\nO(logBN)\n\n✓\n\nN\nO log2\nB\n10\n\n◆\n\n⇡ O(log2 N )\n\n✓\n\nlog2 N\nO(logB N ) = O\nlog2 B\n\n◆\n\nI/O models","Analysis of point queries\nSearch cost:\n\nlogB N + logB N/2 + logB N/4 + · · · + logB B\n\n1\n=\n(log N + log N\nlog B\n\n1 + log N\n\n2 + log N\n\n3 + · · · + 1)\n\n= O(log N logB N )\n...\n11\n\nT0 T1 T2\n\nT3\n\nTlog N","Insert Analysis\nThe cost to flush a tree Tj of size X is O(X/B).\n• Flushing and rebuilding a tree is just a linear scan.\n\nThe cost per element to flush Tj is O(1/B).\nThe # times each element is moved is ≤ log N.\nThe insert cost is O((log N)/B) amortized\nmemory transfers.\nA flush costs O(1/B) per element.\n\nTj has size X.\n12\n\nTj+1 has size ϴ(X).","Samples from LSM Tradeoff Curve\ninsert\ntradeoff\n(function of ɛ)\nsizes grow by B\n(ɛ=1)\nsizes grow by B1/2\n(ɛ=1/2)\nsizes double\n(ɛ=0)\n13\n\nO\n\n✓\n\nlog1+B \" N\nB1 \"\n\npoint query\n◆\n\nO (logB N )\nO\n\n✓\n\nO\n\nlogB N\np\nB\n\n✓\n\nlog N\nB\n\n◆\n\n◆\n\nO (logB N )(log1+B \" N )\n\nO ((logB N )(logB N ))\nO ((logB N )(logB N ))\n\nO ((logB N )(log N ))","How to improve LSM-tree point queries?\nLooking in all those trees is expensive, but can\nbe improved by\n• caching,\n• Bloom filters, and\n• fractional cascading.\n\nT0\n\n14\n\nT1\n\nT2\n\nT3\n\nT4","Caching in LSM trees\nWhen the cache is warm, small trees are\ncached.\n\nWhen the cache is warm,\nthese trees are cached.\n\nT0\n\n15\n\nT1\n\nT2\n\nT3\n\nT4","Bloom filters in LSM trees\nBloom filters can avoid point queries for\nelements that are not in a particular B-tree.\n\nT0\n16\n\nT1\n\nT2\n\nT3\n\nT4","Fractional cascading reduces the cost in each tree\nInstead of avoiding searches in trees, we can use a\ntechnique called fractional cascading to reduce the\ncost of searching each B-tree to O(1).\nIdea: Weâ€™re looking for a key, and we already\nknow where it should have been in T3, try to\nuse that information to search T4.\n\nT0\n17\n\nT1\n\nT2\n\nT3\n\nT4","Searching one tree helps in the next\nLooking up c, in Ti we know itâ€™s between b, and e.\n\nTi\nb e\n\nv w\n\nTi+1\na\n\nc\n\nd\n\nf\n\nh\n\ni\n\nj\n\nk\n\nm\n\nn\n\np\n\nq\n\nt\n\nu\n\ny\n\nShowing only the bottom level of each B-tree.\n18\n\nz","Forwarding pointers\nIf we add forwarding pointers to the first tree, we\ncan jump straight to the node in the second tree, to\nfind c.\nTi\nb e\n\nv w\n\nTi+1\na\n\n19\n\nc\n\nd\n\nf\n\nh\n\ni\n\nj\n\nk\n\nm\n\nn\n\np\n\nq\n\nt\n\nu\n\ny\n\nz","Remove redundant forwarding pointers\nWe need only one forwarding pointer for each block\nin the next tree. Remove the redundant ones.\n\nTi\nb e\n\nv w\n\nTi+1\na\n\n20\n\nc\n\nd\n\nf\n\nh\n\ni\n\nj\n\nk\n\nm\n\nn\n\np\n\nq\n\nt\n\nu\n\ny\n\nz","Ghost pointers\nWe need a forwarding pointer for every block in the\nnext tree, even if there are no corresponding\npointers in this tree. Add ghosts.\nTi\nb eh m v w\n\nghosts\n\nTi+1\na\n\n21\n\nc\n\nd\n\nf\n\nh\n\ni\n\nj\n\nk\n\nm\n\nn\n\np\n\nq\n\nt\n\nu\n\ny\n\nz","LSM tree + forward + ghost = fast queries\nWith forward pointers and ghosts, LSM trees require\nonly one I/O per tree, and point queries cost only\n\nO(logR N ).\nTi\nb eh m v w\n\nghosts\n\nTi+1\na\n\nc\n\nd\n\nf\n\nh\n\ni\n\nj\n\nk\n\nm\n\nn\n\np\n\nq\n\nt\n\nu\n\ny\n\nz\n\n[Bender, Farach-Colton, Fineman, Fogel, Kuszmaul, Nelson 07]\n22","LSM tree + forward + ghost = COLA\nThis data structure no longer uses the internal nodes\nof the B-trees, and each of the trees can be\nimplemented by an array.\nTi\n\nText\n\nb e h m v w\nghosts\n\nTi + 1\n\na\n\nc\n\nd\n\nf\n\nh\n\ni\n\nj\n\nk\n\nm\n\nn\n\np\n\nq\n\nt\n\nu\n\ny\n\nz\n\n[Bender, Farach-Colton, Fineman, Fogel, Kuszmaul, Nelson 07]\n23","Data Structures and Algorithms for Big Data\n\nModule 6: What to Index\nMichael A. Bender\nStony Brook & Tokutek\n\nBradley C. Kuszmaul\nMIT & Tokutek","Story of this module\nThis module explores indexing.\nTraditionally, (with B-trees), indexing improves\nqueries, but cripples insertions.\nBut now we know that maintaining indexes is\ncheap. So what should we index?\n\n2\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","I/O Load\n\nAn Indexing Testimonial\n\nAdd selective indexes.\nThis is a graph from a real user, who added\nsome indexes, and reduced the I/O load on\ntheir server. (They couldnâ€™t maintain the\nindexes with B-trees.)\n3\n\nDonâ€™t Thrash: How to Cache Your Hash in Flash","What is an Index?\nTo understand what to index, we need to get on\nthe same page for what an index is.","Row, Index, and Table\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nRow\n• Key,value pair\n• key = a, value = b,c\n\nIndex\n• Ordering of rows by key\n(dictionary)\n• Used to make queries fast\n\nTable\n• Set of indexes\n\ncreate table foo (a int, b int, c int,\nprimary key(a));","An index is a dictionary\nDictionary API: maintain a set S subject to\n• insert(x): S ← S ∪ {x}\n• delete(x): S ← S - {x}\n• search(x): is x ∊ S?\n• successor(x): return min y > x s.t. y ∊ S\n• predecessor(y): return max y < x s.t. y ∊ S\n\nWe assume that these operations perform as\nwell as a B-tree. For example, the successor\noperation usually doesn’t require an I/O.","A table is a set of indexes\nA table is a set of indexes with operations:\n• Add index: add key(f1,f2,...);\n• Drop index: drop key(f1,f2,...);\n• Add column: adds a field to primary key value.\n• Remove column: removes a field and drops all indexes\nwhere field is part of key.\n• Change field type\n• ...\n\nSubject to index correctness constraints.\nWe want table operations to be fast too.","Next: how to use indexes to improve queries.","Indexes provide query performance\n1. Indexes can reduce the amount of retrieved\ndata.\n• Less bandwidth, less processing, ...\n\n2. Indexes can improve locality.\n• Not all data access cost is the same\n• Sequential access is MUCH faster than random access\n\n3. Indexes can presort data.\n• GROUP BY and ORDER BY queries do post-retrieval\nwork\n• Indexing can help get rid of this work","Indexes provide query performance\n1. Indexes can reduce the amount of retrieved\ndata.\n• Less bandwidth, less processing, ...\n\n2. Indexes can improve locality.\n• Not all data access cost is the same\n• Sequential access is MUCH faster than random access\n\n3. Indexes can presort data.\n• GROUP BY and ORDER BY queries do post-retrieval\nwork\n• Indexing can help get rid of this work","An index can select needed rows\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\ncount (*) where a<120;","An index can select needed rows\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\n100\n101\n\ncount (*) where a<120;\n\n5\n92\n\n45\n2\n\n}\n2","No good index means slow table scans\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\ncount (*) where b>50 and b<100;","No good index means slow table scans\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\n100\n101\n156\n165\n198\n206\n256\n412\n\n5\n92\n56\n6\n202\n23\n56\n43\n\n45\n2\n45\n2\n56\n252\n2\n45\n\ncount (*) where b>50 and b<100;\n\n1230","You can add an index\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n202\n\na\n100\n165\n206\n412\n156\n256\n101\n198\n\nalter table foo add key(b);","A selective index speeds up queries\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n202\n\na\n100\n165\n206\n412\n156\n256\n101\n198\n\ncount (*) where b>50 and b<100;","A selective index speeds up queries\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n56\n202\n92\n\na\n100\n165\n206\n412\n156\n256\n156\n101\n256\n198\n101\n\n56\n56\n92\n\n156\n256\n101\n\n}\n3\n\ncount (*) where b>50 and b<100;","Selective indexes can still be slow\n\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n202\n\na\n100\n165\n206\n412\n156\n256\n101\n198\n\nsum(c) where b>50;","a\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n202\n\na\n100\n165\n206\n412\n156\n256\n101\n198\n\nsum(c) where b>50;\n\n56\n56\n92\n202\n\n156\n256\n101\n198\n\nSelecting\non b: fast\n\nSelective indexes can still be slow","b\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n202\n\na\n100\n165\n206\n412\n156\n256\n101\n198\n\nsum(c) where b>50;\n\n156\n256\n101\n198\n\nFetching info for\nsumming c: slow\n\na\n100\n101\n156\n165\n198\n206\n256\n412\n\n56\n56\n92\n202\n\nSelecting\non b: fast\n\nSelective indexes can still be slow","Selective indexes can still be slow\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n202\n\na\n100\n165\n206\n412\n156\n256\n101\n198\n\nsum(c) where b>50;\n\n156\n\n56\n\nSelecting\non b: fast\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\n156\n256\n101\n198\n45\n\nFetching info for\nsumming c: slow\n\na\n100\n101\n156\n165\n198\n206\n256\n412\n\n56\n56\n92\n202","Selective indexes can still be slow\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n202\n\na\n100\n165\n206\n412\n156\n256\n101\n198\n\nsum(c) where b>50;\n\n156\n\n56\n\nSelecting\non b: fast\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\n156\n256\n101\n198\n45\n\nFetching info for\nsumming c: slow\n\na\n100\n101\n156\n165\n198\n206\n256\n412\n\n56\n56\n92\n202","Selective indexes can still be slow\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n202\n\na\n100\n165\n206\n412\n156\n256\n101\n198\n\nsum(c) where b>50;\n\n156\n256\n\n56\n56\n\nSelecting\non b: fast\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\n156\n256\n101\n198\n45\n2\n\nFetching info for\nsumming c: slow\n\na\n100\n101\n156\n165\n198\n206\n256\n412\n\n56\n56\n92\n202","b\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n202\n\na\n100\n165\n206\n412\n156\n256\n101\n198\n\nsum(c) where b>50;\n\n156\n256\n101\n198\n\n156\n256\n101\n198\n\n56\n56\n92\n202\n\nSelecting\non b: fast\n\na\n100\n101\n156\n165\n198\n206\n256\n412\n\n56\n56\n92\n202\n\n45\n2\n2\n56\n\nFetching info for\nsumming c: slow\n\nPoor data locality\n\nSelective indexes can still be slow","b\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb\n5\n6\n23\n43\n56\n56\n92\n202\n\na\n100\n165\n206\n412\n156\n256\n101\n198\n\nsum(c) where b>50;\n\n156\n256\n101\n198\n\n156\n256\n101\n198\n\n56\n56\n92\n202\n\nSelecting\non b: fast\n\na\n100\n101\n156\n165\n198\n206\n256\n412\n\n56\n56\n92\n202\n\n45\n2\n2\n56\n\n105\n\nFetching info for\nsumming c: slow\n\nPoor data locality\n\nSelective indexes can still be slow","Indexes provide query performance\n1. Indexes can reduce the amount of retrieved\ndata.\n• Less bandwidth, less processing, ...\n\n2. Indexes can improve locality.\n• Not all data access cost is the same\n• Sequential access is MUCH faster than random access\n\n3. Indexes can presort data.\n• GROUP BY and ORDER BY queries do post-retrieval\nwork\n• Indexing can help get rid of this work","Covering indexes speed up queries\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb,c\n5,45\n6,2\n23,252\n43,45\n56,2\n56,45\n92,2\n202,56\n\na\n100\n165\n206\n412\n256\n156\n101\n198\n\nalter table foo add key(b,c);\nsum(c) where b>50;","Covering indexes speed up queries\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb,c\n5,45\n6,2\n23,252\n43,45\n56,2\n56,45\n92,2\n202,56\n\na\n100\n165\n206\n412\n256\n156\n101\n198\n\n56,2\n56,45\n92,2\n202,56\n\n105\n\nalter table foo add key(b,c);\nsum(c) where b>50;\n\n256\n156\n101\n198","Indexes provide query performance\n1. Indexes can reduce the amount of retrieved\ndata.\n• Less bandwidth, less processing, ...\n\n2. Indexes can improve locality.\n• Not all data access cost is the same\n• Sequential access is MUCH faster than random access\n\n3. Indexes can presort data.\n• GROUP BY and ORDER BY queries do post-retrieval\nwork\n• Indexing can help get rid of this work","Indexes can avoid post-selection sorts\na\n100\n101\n156\n165\n198\n206\n256\n412\n\nb\n5\n92\n56\n6\n202\n23\n56\n43\n\nc\n45\n2\n45\n2\n56\n252\n2\n45\n\nb,c\n5,45\n6,2\n23,252\n43,45\n56,2\n56,45\n92,2\n202,56\n\na\n100\n165\n206\n412\n256\n156\n101\n198\n\nb\n5\n6\n23\n43\n56\n92\n202\n\nsum(c)\n45\n2\n252\n45\n47\n2\n56\n\nselect b, sum(c) group by b;","Data Structures and Algorithms for Big Data\n\nModule 7: Paging\nMichael A. Bender\nStony Brook & Tokutek\n\nBradley C. Kuszmaul\nMIT & Tokutek","This Module\nThe algorithmics of\ncache-management.\n\nThis will help us\nunderstand I/O- and\ncache-efficient\nalgorithms.\n\n2","Recall Disk Access Model\nGoal: minimize # block transfers.\n• Data is transferred in blocks between RAM and disk.\n• Performance bounds are parameterized by B, M, N.\n\nWhen a block is cached, the access cost is 0.\nOtherwise it’s 1.\n\nRAM\nM\n\nDisk\nB\n[Aggarwal+Vitter ’88]\n\n3","Recall Cache-Oblivious Analysis\nDisk Access Model (DAM Model):\n• Performance bounds are parameterized by B, M, N.\n\nGoal: Minimize # of block transfers.\nBeautiful restriction:\n• Parameters B, M are unknown to the algorithm or coder.\n\nRAM\nM=??\n\nDisk\nB=??\n\n[Frigo, Leiserson, Prokop, Ramachandran ’99]\n4","Recall Cache-Oblivious Analysis\nCO analysis applies to unknown multilevel hierarchies:\n• Cache-oblivious algorithms work for all B and M...\n• ... and all levels of a multi-level hierarchy.\n\nMoral:\n• It’s better to optimize approximately for all B, M rather than to try\nto pick the best B and M.\n\nRAM\nM=??\n\nDisk\nB=??\n\n[Frigo, Leiserson, Prokop, Ramachandran ’99]\n5","Cache-Replacement in Cache-Oblivious Algorithms\nWhich blocks are currently cached in RAM?\nâ€˘ The system performs its own caching/paging.\nâ€˘ If we knew B and M we could explicitly manage I/O.\n(But even then, what should we do?)\n\nRAM\nM=??\n6\n\nDisk\nB=??","Cache-Replacement in Cache-Oblivious Algorithms\nWhich blocks are currently cached in RAM?\n• The system performs its own caching/paging.\n• If we knew B and M we could explicitly manage I/O.\n(But even then, what should we do?)\n\nBut systems may use different mechanisms, so what\ncan we actually assume?\n\nRAM\nM=??\n6\n\nDisk\nB=??","This Module: Cache-Management Strategies\nWith cache-oblivious analysis, we can assume\na memory system with optimal replacement.\nEven though the system manages memory, we\ncan assume all the advantages of explicit\nmemory management.\n\nRAM\nM=??\n7\n\nDisk\nB=??","This Module: Cache-Management Strategies\nAn LRU-based system with memory M performs cache-management\n< 2x worse than the optimal, prescient policy with memory M/2.\nAchieving optimal cache-management is hard because predicting\nthe future is hard.\nBut LRU with (1+ɛ)M memory is almost as good (or better), than the\noptimal strategy with M memory.\n[Sleator, Tarjan 85]\n\nLRU\n\nDisk\n\n(1+ɛ) M\n8\n\nLRU with (1+ɛ) more memory is\nnearly as good or better...\n\nOPT\n\nDisk\n\nM\n... than OPT.","The paging/caching problem\nA program is just sequence of block requests:\n\nr1 , r2 , r3 , . . .\nCost of request rj\nâ‡˘\n0 block rj is already cached,\ncost(rj ) =\n1 block rj is brought into cache.\nAlgorithmic question:\nâ€˘ Which block should be ejected when block rj is\nbrought into cache?\n\n9","The paging/caching problem\nRAM holds only k=M/B blocks.\nWhich block should be ejected when block rj is\nbrought into cache?\n\nrj\nRAM\nM\n10\n\nDisk\n???","Paging Algorithms\nLRU (least recently used)\n• Discard block whose most recent access is earliest.\n\nFIFO (first in, first out)\n• Discard the block brought in longest ago.\n\nLFU (least frequently used)\n• Discard the least popular block.\n\nRandom\n• Discard a random block.\n\nLFD (longest forward distance)=OPT\n\n[Belady 69]\n\n• Discard block whose next access is farthest in the future.\n11","Optimal Page Replacement\nLFD (Longest Forward Distance) [Belady â€™69]:\nâ€˘ Discard the block requested farthest in the future.\n\nCons: Who knows the Future?!\nPage 5348 shall be\nrequested tomorrow\nat 2:00 pm\nPros: LFD can be viewed as a point of\ncomparison with online strategies.\n12","Optimal Page Replacement\nLFD (Longest Forward Distance) [Belady â€™69]:\nâ€˘ Discard the block requested farthest in the future.\n\nCons: Who knows the Future?!\nPage 5348 shall be\nrequested tomorrow\nat 2:00 pm\nPros: LFD can be viewed as a point of\ncomparison with online strategies.\n13","Optimal Page Replacement\nLFD (Longest Forward Distance) [Belady â€™69]:\nâ€˘ Discard the block requested farthest in the future.\n\nCons: Who knows the Future?!\nPage 5348 shall be\nrequested tomorrow\nat 2:00 pm\nPros: LFD can be viewed as a point of\ncomparison with online strategies.\n14","Competitive Analysis\nAn online algorithm A is k-competitive, if for\nevery request sequence R:\n\ncostA (R) ďŁż k costopt (R)\nIdea of competitive analysis:\nâ€˘ The optimal (prescient) algorithm is a yardstick we use\nto compare online algorithms.\n\n15","LRU is no better than k-competitive\nMemory holds 3 blocks\n\nM/B = k = 3\nThe program accesses 4 different blocks\n\nrj 2 {1, 2, 3, 4}\n\nThe request stream is\n\n1, 2, 3, 4, 1, 2, 3, 4, · · ·\n\n16","LRU is no better than k-competitive\nrequests\n\n1\n\n2\n\n3\n\n4\n\n1\n\n2\n\n3\n\nblocks in\nmemory\nof size 3\n\n1\n\n1\n\n1\n\n2\n\n2\n\n2\n\n3\n\n3\n\n3\n\n4\n\n4\n\n1\n\n1\n\n1\n\n2\n\n2\n\n2\n\n3\n\n3\n\n3\n\n4\n\n4\n\n4\n\n4\n\n1\n\n2\n\n1\n\n1\n2\n4\n\nThereâ€™s a block transfer at every step because\nLRU ejects the block thatâ€™s requested in the next step.\n17","LRU is no better than k-competitive\nrequests\n\n1\n\n2\n\n3\n\n4\n\n1\n\n2\n\n3\n\n4\n\n1\n\n2\n\nblocks in\nmemory\nof size 3\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n2\n\n2\n\n2\n\n2\n\n2\n\n3\n4\n\n4\n\n4\n\n2\n3\n\n3\n\n3\n\n3\n\n4\n\n4\n\n4\n\n4\n\nLFD (longest forward distance) has a\nblock transfer every k=3 steps.\n18","LRU is k-competitive\n\n[Sleator, Tarjan 85]\n\nIn fact, LRU is k=M/B-competitive.\nâ€˘ I.e., LRU has k=M/B times more transfers than OPT.\nâ€˘ A depressing result because k is huge so k . OPT is\nnothing to write home about.\n\n19","On the other hand, the LRU bad example is fragile\nrequests\n\n1\n\n2\n\n3\n\n4\n\n1\n\n2\n\n3\n\nblocks in\nmemory\nof size 3\n\n1\n\n1\n\n1\n\n2\n\n2\n\n2\n\n3\n\n3\n\n3\n\n4\n\n4\n\n1\n\n1\n\n1\n\n2\n\n2\n\n2\n\n3\n\n3\n\n3\n\n4\n\n4\n\n4\n\n4\n\n1\n\n2\n\n1\n\n1\n2\n4\n\nIf k=M/B=4, not 3, then both LRU and OPT do well.\nIf k=M/B=2, not 3, then neither LRU nor OPT does well.\n20","LRU is 2-competitive with more memory\n\n[Sleator, Tarjan 85]\n\nLRU is at most twice as bad as OPT, when LRU\nhas twice the memory.\n\nLRU|cache|=k (R) ďŁż 2 OPT|cache|=k/2 (R)\nIn general, LRU is nearly as good as OPT when\nLRU has a little more memory than OPT.\n\n21","LRU is 2-competitive with more memory\n\n[Sleator, Tarjan 85]\n\nLRU is at most twice as bad as OPT, when LRU\nhas twice the memory.\n\nLRU|cache|=k (R) ďŁż 2 OPT|cache|=k/2 (R)\nLRU has more memory, but OPT=LFD can see the future.\n\nIn general, LRU is nearly as good as OPT when\nLRU has a little more memory than OPT.\n\n21","LRU is 2-competitive with more memory\n\n[Sleator, Tarjan 85]\n\nLRU is at most twice as bad as OPT, when LRU\nhas twice the memory.\n\nLRU|cache|=k (R) ďŁż 2 OPT|cache|=k/2 (R)\nLRU has more memory, but OPT=LFD can see the future.\n\nIn general, LRU is nearly as good as OPT when\nLRU has a little more memory than OPT.\n\n21","LRU Performance Proof\nDivide LRU into phases, each with k faults.\n\nr1 , r2 , . . . , ri , ri+1 , . . . , rj , rj+1 , . . . , r` , r`+1 , . . .\n\n22","LRU Performance Proof\nDivide LRU into phases, each with k faults.\n\nr1 , r2 , . . . , ri , ri+1 , . . . , rj , rj+1 , . . . , r` , r`+1 , . . .\nOPT[k] must have ≥ 1 fault in each phase.\n• Case analysis proof.\n• LRU is k-competitive.\n\n22","LRU Performance Proof\nDivide LRU into phases, each with k faults.\n\nr1 , r2 , . . . , ri , ri+1 , . . . , rj , rj+1 , . . . , r` , r`+1 , . . .\nOPT[k] must have ≥ 1 fault in each phase.\n• Case analysis proof.\n• LRU is k-competitive.\n\nOPT[k/2] must have ≥ k/2 faults in each phase.\n• Main idea: each phase must touch k different pages.\n• LRU is 2-competitive.\n22","Under the hood of cache-oblivious analysis\nMoral: with cache-oblivious analysis, we can\nanalyze based on a memory system with optimal,\nomniscient replacement.\n• Technically, an optimal cache-oblivious algorithm is\nasymptotically optimal versus any algorithm on a memory\nsystem that is slightly smaller.\n• Empirically, this is just a technicality.\n\nLRU\n\nDisk\n\n(1+ɛ) M\n23\n\nThis is almost as good or better...\n\nOPT\n\nDisk\n\nM\n... than this.","Ramifications for New Cache-Replacement Policies\nMoral: There’s not much performance on the\ntable for new cache-replacement policies.\n• Bad instances for LRU versus LFD are fragile and\ndepend on a particular cache size.\n\nThere are still research questions:\n• What if blocks have different sizes [Irani 02][Young 02]?\n• There’s a write-back cost? (Complexity unknown.)\n• LRU may be too costly to implement (clock algorithm).\n• The cache-size changes over time.\n[Bender, Ebrahimi, Fineman, Ghasemiesfeh, Johnson, McCauley 13]\n\n24","Data Structures and Algorithms for Big Data\n\nModule 8: Sorting Big Data\nMichael A. Bender\nStony Brook & Tokutek\n\nBradley C. Kuszmaul\nMIT & Tokutek","Story for Module\nAnother way to create an index is to sort\n• Sorting creates an index all-at-once.\n• Sorting does not incrementally maintain an index.\n• Sorting is faster than the best algorithms to\nincrementally maintain an index.\n\nI/O-efficient mergesort\nParallel sort\n\n2","Modeling I/O Using the Disk Access Model\nHow computation works:\n• Data is transferred in blocks between RAM and disk.\n• The # of block transfers dominates the running time.\n\nGoal: Minimize # of block transfers\n• Performance bounds are parameterized by\nblock size B, memory size M, data size N.\n\nB\nRAM\nM\n\nDisk\nB\n[Aggarwal+Vitter ’88]\n\n3","Merge Sort\nTo sort an array of N objects\n• If N fits in main memory, then just sort elements.\n• Otherwise,\n‣ divide the array into M/B pieces;\n‣ sort each piece (recursively); and\n‣ merge the M/B pieces.\n\nsort\n\nsort\n\nsort\n\nsort\n\nmerge\n4\n\nsort\n\nsort","Why Divide into M/B pieces?\n• We want as much fan-in as possible.\n• The merge needs to cache one block for each\nsorted subinput.\n• Plus one block for the output.\n• There are M/B blocks in memory.\n• So the fan-in can be at most O(M/B)\n\nM\nB\n\nMemory\n(size M)\n\nB\nB","Merge Sort\n\nmerge\n\nmerge\n\nmerge\n\nmerge\n\nmerge\n\nmerge\n\nsort\n\nsort\n\nsort\n\nsort\n\nsort\n\nsort\n\nmerge\n6","Intuition for Merge Sort Analysis\nQuestion: How many I/Os to sort N elements?\n• First run takes N/B I/Os.\n• Each level of the merge tree takes N/B I/Os.\n• How deep is the merge tree?\n\nO\n\n✓\n\nN\nN\nlogM/B\nB\nB\n\nCost to scan data\n\n7\n\n◆\n\n# of scans of data","Intuition for Merge Sort Analysis\nQuestion: How many I/Os to sort N elements?\n• First run takes N/B I/Os.\n• Each level of the merge tree takes N/B I/Os.\n• How deep is the merge tree?\n\nO\n\n✓\n\nN\nN\nlogM/B\nB\nB\n\nCost to scan data\n\n◆\n\n# of scans of data\n\nThis bound is the best possible.\n7","Merge Sort Analysis\nT(N), the number of I/Os to sort N items, satisfies\ncost to sort each\nthis recurrence:\ncost to merge\n# of pieces\n\nM\nT (N ) =\n·T\nB\n\npiece recursively\n\n✓\n\nN\nM/B\n\n◆\n\nN\n+\nB\n\nN\nT (N ) =\nwhen N < M\nB\nSolution:\n\nO\n\n✓\n\nN\nN\nlogM/B\nB\nB\n\nCost to scan data\n8\n\n◆\n\n# of scans of data\n\ncost to sort something\nthat fits in memory","Sorting is Faster Than Index Maintenance\nI/Os to sort N objects:\n\nO\n\n✓\n\nN\nN\nlogM/B\nB\nB\n\n◆\n\nI/Os to insert N✓objects into ◆\na COLA:\n\nO\n\nN\nlg(N/M )\nB\n\nI/Os to insert N objects into a B-tree:\n\nO (N logB (N/M ))\nSorting can usually be done in 2 passes since\nM/B is large.\n9","Parallel Sort\nBig data might not fit on one machine.\nSo use many machines and parallelize.\nP P P P P P P P P\nNetwork\n\nParallelizing merge sort is tricky, however.\n10","Distribution Sort\nTo sort an array of N objects\n• If N fits in main memory, then just sort elements.\n• Otherwise\n‣ pick M/B pivot keys;\n‣ partition data according to pivot keys; and\n‣ sort each partition (recursively).\n\npartition\nsort\n11\n\nsort\n\nsort\n\nsort\n\nsort\n\nsort","Parallelizing Partitioning\n• Broadcast the pivot keys to every processor.\n• Compute the local rank of each pivot on each processor.\n‣ Sort local data to make this fast.\n\n• Sum the local ranks to get global ranks.\n• Send each datum to the right processor.\n• The final step is a merge, since the local data was sorted.\n\nsort\n\nsort\n\nsort\n\nsort\n\nsort\n\nSend data to right partition, merging\n12\n\nsort","Engineering Parallel Sort\n• Scheduling:\n‣ Overlap I/O with computation and network communcation.\n‣ Schedule network communication carefully to avoid network contention.\n\n• Hardware:\n‣ Use a serious network.\n‣ Get rid of slow disks. Some disks are 10x slower than average. Probably failing.\n\n• In memory:\n‣ Must compute local pivot ranks efficiently.\n‣ Employ a heap data structure to perform merge efficiently.\n\nsort\n\nsort\n\nsort\n\nsort\n\nsort\n\nSend data to right partition, merging\n13\n\nsort","Sorting Contest\nBradley holds the world record for sorting a\nTerabyte: sortbenchmark.org\n•\n•\n•\n•\n\n400 dual core machines with 2400 disks in 2007.\nRan in 3.28 minutes.\nUsed a distribution sort.\nTerabyte sort now deprecated, since it’s the same as\nminute sort (how much can you sort in a minute).\n• Today to compete, you must sort 100TB in less than\ntwo hours.\n\n14","Summary\nFast sorting is an important tool for big data.\nSorting provides many opportunities for\ncleverness.\nNo one can take my Terabyte sorting trophy!\n\n15","Closing Words\n\n16","We want to feel your pain.\nWe are interested in hearing about other scaling\nproblems.\nCome talk to us.\nbender@cs.stonybrook.edu\nmichael@tokutek.com\nbradley@mit.edu\nbradley@tokutek.com\n17","Big Data Epigrams\nThe problem with big data is microdata.\nSometimes the right read optimization is a\nwrite-optimization.\nItâ€™s often better to optimize approximately for\nall B, M than to pick the best B and M.\nAs data becomes bigger, the asymptotics\nbecome more important.\n\n18"]},"initialDocumentData":{"document":{"access":"PUBLIC","contentRating":{"isAdsafe":true,"isExplicit":false,"isReviewed":true},"description":"Data Structures for Big Databases","path":{"type":"user","username":"oylum4","documentName":"2013-benderkuszmaul-xldb-tutorial"},"fullPageImageDimensions":[{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125},{"width":1500,"height":1125}],"originalPublishDateInISOString":"2017-11-27T13:14:45.000Z","pageCount":226,"publicationId":"1f4a086623781389b9af77fd04b43bc6","revisionId":"171127131445","title":"2013 benderkuszmaul xldb tutorial","isDocumentGated":false,"username":"oylum4","documentName":"2013-benderkuszmaul-xldb-tutorial"},"publisher":{"owner":{"type":"user","username":"oylum4","userId":30816929},"displayName":"oylum","userPlan":"UNKNOWN_PLAN","moreDocuments":[],"licenses":{"removeSignupButton":false,"hideAdsInReader":false,"hideReadMoreSection":false},"username":"oylum4","userId":30816929},"visitor":{"isLoggedIn":false,"readerplusWeeklyPrice":{"currencyCode":"USD","unitAmount":99}},"articleSummaries":[],"articleStorySummaries":"$8:props:initialDocumentData:articleSummaries","iabCategories":[],"categories":[]},"initialPageNumber":1,"initialViewportWidth":1024,"referrer":"","isCrawler":true,"experiments":[]}]