Ubiquitous Big Data for knowledge extraction from secured data silos Thierry Nagellen – Orange Labs thierry.nagellen@orange.com
Why Ubiquitous Big Data?
2
Orange
Thierry Nagellen - 2018
AI: beyond software some hardware constraints
3
Orange
Thierry Nagellen - 2018
“Our work in this area was initially motivated by our aim to reduce service response times and resource usage in our cloud environment which operates globally and at scale‌ Managing data access locality in geo-distributed systems is important because doing so can significantly improve data access latencies, given that intra-datacenter communication latencies are two orders of magnitude smaller than cross-datacenter communication latencies: e.g. 1ms vs 100ms.â€? Facebook - Akkio
4
Orange
Ref: https://www.usenix.org/conference/osdi18/presentation/annamalai
Thierry Nagellen - 2018
Reality of constraints: Energy, Latency‌
5
Orange
Ref: http://web.eecs.umich.edu/~jahausw/publications/kang2017neurosurgeon.pdf
Thierry Nagellen - 2018
Lambda architecture: historical data + streaming data (Nathan Marz – 2011) Objective: Maintaining code that needs to produce the same result in two complex distributed systems
6
Orange
Ref: https://dzone.com/articles/lambda-architecture-with-apache-spark
Thierry Nagellen - 2018
Some emerging trends
SOLID – Tim Berners-Lee Re-decentralize the web Smart IoT Embedded Narrow AI 7
Orange
Thierry Nagellen - 2018
Federated Learning (Google) 1. A subset of existing clients is selected, each of which downloads the current model. (M -> A) 2. Each client in the subset computes an updated model based on their local data. (A -> B) 3. The model updates are sent from the selected clients to the sever. (B -> C -> M) 4. The server aggregates these models (typically by averaging) to construct an improved global model. (M) M.
8
Orange
Ref: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
Thierry Nagellen - 2018
Parameter Server: split the data – centralize the model
Spark Summit 2016 talk by Erik Ordentlich (Yahoo) and Badri Bhaskar (Yahoo) 9
Orange
Thierry Nagellen - 2018
Parameter server with TensorFlow (Deep learning example)
10
Orange
Ref: https://medium.com/polyaxon/distributed-deep-learning-with-polyaxon-6d9f1288e4b8 https://docs.riseml.com/guide/advanced/distributed_tensorflow/
Thierry Nagellen - 2018
IBM Centralized vs decentralized (Parallel SGD vs Asynchronous SGD)
11
Orange
Ref: https://www.ibm.com/blogs/research/2017/12/deep-learning-training-10x-improvement/
Thierry Nagellen - 2018
What about Data & Model parallelization?
12
Orange
Ref: https://www.youtube.com/watch?v=vwXolaBQfaU
Thierry Nagellen - 2018
But how to discover data characteristics? Our main challenges: Find relevant data to solve my problem (from the user point of view) No time to describe the data Solve the vocabulary issue: (a) Per vertical silo (b) Multi languages Reduce data preparation time if you want to apply automatically some algorithms Some answers: Semantic search engine Natural Language interface for data description Semantization chain New approach for structured data: Probabilistic Relational Model
“Probabilistic relational models (PRMs) are a rich representation language for structured statistical models. They combine a frame-based logical representation with probabilistic semantics based on directed graphical models (Bayesian networks).� 13
Orange
Ref: https://ai.stanford.edu/~koller/Papers/Getoor+al:SRL07.pdf
Thierry Nagellen - 2018
Orange Dataforum: semantization chain
14
Orange
English French Dutch Spanish Slovak Romanian German
Thierry Nagellen - 2018
Probabilistic Relational Model to combine with semantics
15
Orange
Ref: https://www.slideshare.net/AnthonyCoutant/
Thierry Nagellen - 2018
Thanks!
Thierry Nagellen – Orange Labs thierry.nagellen@orange.com