BEYOND RELATIONAL: «NEURAL» DBMS? Roberto Reale @ Italian Association for Machine Learning 10 Apr 2019
F. Codd, E. (1970). A Relational Model of Data for Large Shared Data Banks. Commun. ACM. 13. 377-387.
Kraska, T., Beutel, A., Chi, E.H., Dean, J. and Polyzotis, N., (2017). The Case for Learned Index Structures. arXiv preprint arXiv:1712.01208.
RELATIONAL MODEL Can be expressed in first-order predicate logic Data is represented as tuples, grouped into relations Abstraction from physical storage model
INDEX STRUCTURES Needed for efficient data access B-Trees, Hash maps, Bloom filters, ...
Need tuning General data structures, do not take advantage of data patterns
ENTER MACHINE LEARNING Replacing core components of a data management system through learned models
Traditional indexes are already models For efficiency reasons it is common not to index every single key of the sorted records, rather only the key of every n-th record
Using other types of models as indexes can provide benefits
INDEXES ARE CDF MODELS An index is a model that takes a key as an input and predicts the position of the record
A model that predicts the position given a key inside a sorted array approximates the cumulative distribution function F(Key) is the estimated cumulative distribution function for the data to estimate the likelihood to observe a key smaller or equal to the lookup key
ISSUES... Decision trees in general, are really good in overfitting the data with a few operations
A single neural net requires significantly more space and CPU time for the “last mile� B-Trees are extremely cache- and operation-efficient
THE LEARNING INDEX FRAMEWORK (LIF) Given a trained Tensorflow model, LIF automatically extracts all weights from the model and generates efficient index structures in C++ Designed for small models No unnecessary overhead
THE RECURSIVE MODEL INDEX Challenge: accuracy for last-mile search We build a hierarchy of models Each model takes the key as an input and based on it picks another model
THE RECURSIVE MODEL INDEX, 2 We iteratively train each stage with loss Lℓ We separate model size and complexity from execution cost We effectively divide the space into smaller sub-ranges to make it easier to achieve the required “last mile” accuracy
HYBRID MODELS Top-layer: rectified linear unit (ReLU) neural net At the bottom: thousands of simple, inexpensive linear regression models Traditional B-Trees at the bottom if the data is particularly hard to learn
DOES THIS STUFF WORK? Simple NNs can be efficiently trained using stochastic gradient descent
A closed form solution exists for linear multi-variate models The results are promising, but “learned indexes� might not be the best choice in every use case A new way to think about indexing
ROBERTO@REALE.ME