Statistical Artificial Intelligence: Learning from Complex Data Globally Sparse Discriminant Analysis of High-Dimensional Data
Charles BOUVEYRON Professor of Statistics Chair INRIA in ”Data Science” Laboratoire LJAD, UMR CNRS 7351 Epione team, Inria Sophia-Antipolis Université Côte d’Azur charles.bouveyron@univ-cotedazur.fr @cbouveyron
1
Introduction: the AI revolution hasn’t happened yet! Artificial intelligence is a strategic field of research:
with direct application in most scientific fields (Medicine, Biology, Astrophysics, Humanities) and with probably the most impact in innovation and transfer (health, transport, defense).
2
Introduction: the AI revolution hasn’t happened yet! Artificial intelligence is a strategic field of research:
with direct application in most scientific fields (Medicine, Biology, Astrophysics, Humanities) and with probably the most impact in innovation and transfer (health, transport, defense).
The recent and impressive NN results should not hide the remaining issues:
deep learning has impressive results in a few specific cases and with a high-level supervision,
use of DL techniques in various fields are promising but not well understood.
2
Introduction: the AI revolution hasn’t happened yet! Artificial intelligence is a strategic field of research:
with direct application in most scientific fields (Medicine, Biology, Astrophysics, Humanities) and with probably the most impact in innovation and transfer (health, transport, defense).
The recent and impressive NN results should not hide the remaining issues:
deep learning has impressive results in a few specific cases and with a high-level supervision,
use of DL techniques in various fields are promising but not well understood. ”Artificial Intelligence: the revolution hasn’t happened yet” M. Jordan (UC Berkley)
2
Introduction: open problems for AI
Some open problems are critical:
reliability of models and algorithms,
handling data heterogeneity (categorical, functional, networks, images, texts, ...),
unsupervised learning (clustering, dimension reduction),
learning from small data (n small / p large),
3
Introduction: open problems for AI
Some open problems are critical:
reliability of models and algorithms,
handling data heterogeneity (categorical, functional, networks, images, texts, ...),
unsupervised learning (clustering, dimension reduction),
learning from small data (n small / p large),
Combination of statistical theory with deep learning techniques is certainly the future of AI!
3
Introduction: a new AI team @ UCA/Inria To address those problems, we are building a new research team at UCA & Inria:
Unsupervised learning Theory of deep learning
Models & Algorithms Adaptive & robust learning for AI
Heterogeneous & complex data Applications & innovation
Figure: Scientific objectives of the (upcoming) Maasai team. 4
Introduction: a few recent recent projects
FunLBM: Coclustering of functional data
5
Introduction: a few recent recent projects
FunLBM: Coclustering of functional data
HDMI: single-image denoising
5
Introduction: a few recent recent projects
FunLBM: Coclustering of functional data
HDMI: single-image denoising
5
WAMiC: Deep clustering with mixtures
Introduction: a few recent recent projects
8 0
0
2
4
6
≈W
X
6
8
Y
10
WAMiC: Deep clustering with mixtures
4
2
FunLBM: Coclustering of functional data
10
0
200
400
600
800
600
800
0
2
0
2
Class 1
10 8 6 4 0
0
2
≈W
2
X
T
8
Y
Hello World
10
gsHDDA: classification of HD data 6
HDMI: single-image denoising
4
0
200
400 Class 3
5
Hello World
A motivating example: lung cancer diagnosis
The considered data set:
data from n = 87 patient with lung cancer, p = 636 radiomic features were extracted CT images, the objective is to build a classifier able to discriminate 3 subtypes of lesions.
Beyond the diagnosis task, the goal is to isolate some radiomic markers of the lesion subtypes.
6
Basics of PCA & sparsity: PCA Let us consider a n Ă— p data matrix X = (x1 , ..., xn )T that one wants to project onto a "good" d-dimensional subspace. Principal component analysis (PCA): T the optimal choice is spanned by the top-d eigenvectors of X X, PCA can also be view as a factorization into a low-rank decomposition.
Y X
T
≈
W
Figure: PCA viewed as a low-rank decomposition. 7
Basics of PCA & sparsity: SPCA However, regular PCA fails when p is large (Johnstone & Lu ’09): sparse versions of PCA (SPCA, Zou et al., ’06) have been developed consequently, sparse PCA allows to regularize the problem but does not improve significantly the interpretation of the results.
Y X
T
≈
W
Figure: Sparse PCA viewed as a low-rank decomposition. 8
Basics of PCA & sparsity: gsPPCA Our objective is to truly perform unsupervised variable selection within PCA: the projection matrix W should be row-sparse, leading to the globally sparse PCA problem, this solution allows to identify the relevant original variables while reducing the dimensionality.
Y
Hello World
X
T
≈
W
Figure: Globally sparse PCA viewed as a low-rank decomposition. 9
Probabilistic PCA Let us consider probabilistic PCA (PPCA, Tipping & Bishop, ’99) which assumes that each observation is generated by the following model: x = Wy + ε
where y ∼ N (0, Id ) is a low-dimensional Gaussian latent vector,
W is a p × d parameter matrix called the loading matrix,
and ε ∼ N (0, σ 2 Ip ) is a Gaussian noise term.
10
(1)
Probabilistic PCA Let us consider probabilistic PCA (PPCA, Tipping & Bishop, ’99) which assumes that each observation is generated by the following model: x = Wy + ε
where y ∼ N (0, Id ) is a low-dimensional Gaussian latent vector,
W is a p × d parameter matrix called the loading matrix,
and ε ∼ N (0, σ 2 Ip ) is a Gaussian noise term.
(1)
Two remarks:
the PPCA model justifies PCA under a Gaussian assumption,
PPCA allows to recover the principal components even in the limit noiseless setting σ → 0!
10
The gsPPCA model
We propose the following model to handle variable selection within PPCA: x = VWy + V̄ε1 + Vε2
(2)
where V = diag(v) such that the matrix VW is row-sparse, leading to global sparsity, ε1 ∼ N (0, σ12 Ip ) and ε2 ∼ N (0, σ22 Ip ) are respectively the noises of the inactive and active variables, we impose Gaussian priors wij ∼ N (0, α−2 ) on the loadings.
We want to investigate the noiseless case σ2 → 0.
11
The gsPPCA model In the context of the globally sparse PPCA model, we demonstrate that: Theorem. In the noiseless limit σ2 → 0, x converges in probability to a random variable x̃ whose density is p(x̃|v, α, σ12 ) = N (x̃V̄ |0, σ1 Ip−q )Bessel(x̃V |1/α, (d − q)/2).
(3)
This theorem allows us to efficiently compute the noiseless marginal log-likelihood defined as L(X, v, α, σ1 ) =
n X i=1
12
log P(x̃ = xi |v, α, σ1 ).
The gsPPCA model In the context of the globally sparse PPCA model, we demonstrate that: Theorem. In the noiseless limit σ2 → 0, x converges in probability to a random variable x̃ whose density is p(x̃|v, α, σ12 ) = N (x̃V̄ |0, σ1 Ip−q )Bessel(x̃V |1/α, (d − q)/2).
(3)
This theorem allows us to efficiently compute the noiseless marginal log-likelihood defined as L(X, v, α, σ1 ) =
n X
log P(x̃ = xi |v, α, σ1 ).
i=1
A last (big) issue:
there are 2p possible models for v!
the direct optimization over v is intractable. 12
The relaxed gsPPCA model Our solution:
use a a relaxed model of (2) to propose a family of p ordered models,
only compute the marginal likelihood along the proposed path of models.
13
The relaxed gsPPCA model Our solution:
use a a relaxed model of (2) to propose a family of p ordered models,
only compute the marginal likelihood along the proposed path of models.
The gsPPCA algorithm:
input: a data matrix X ∈ Rp 2 main steps:
output:
13
VEM algorithm on a relaxed model for finding a path of candidate variables, model selection using (3) along the path of models for finding the relevant variables. the number q of relevant variables, the sparsity pattern v ∈ {0, 1}p .
An introductory example As an educational example: we simulated a data set according to model (2), with n = 50, p = 30, d = 5 and q = 10, such that v = (1, ..., 1, 0, ..., 0). | {z } | {z } p
p−q
1.00 −2000
Evidence
Values
0.75
0.50
−3000 −4000 −5000
0.25
−6000 0.00 0
10
20
Variables
30
5
10
15
20
25
Variables
Figure: Variable selection with gsPPCA on the introductory example. 14
30
Radiomics: lung cancer diagnosis The considered data set:
data from n = 87 patient with lung cancer, p = 636 radiomic features were extracted CT images, the objective is to build a classifier able to discriminate 3 subtypes of lesions.
Experimental setup:
the cohort was randomly divided 50 times, in a learning subset (57 patients) and a test subset (30 patients).
15
70 %
60 %
50 %
40 %
30 %
sHDDA
HDDA
sPLS−DA
PLS−DA
ASDA
pLDA
sLDA
20 %
LDA
Percentage of observations correctly identified
Radiomics: classification results
Figure: Correct classification rate for gsHDDA and competitors. 16
Radiomics: variable selection
40 %
20 %
0%
sHDDA−class 3 100 % Percentage of tests where each variable is selected
60 %
Percentage of tests where each variable is selected
Percentage of tests where each variable is selected
80 %
sHDDA−class 2 100 %
80 %
60 %
40 %
20 %
0%
80 %
60 %
40 %
20 %
0%
gsHDDA
sPLS−DA 100 % Percentage of tests where each variable is selected
sHDDA−class 1 100 %
80 %
60 %
40 %
20 %
0%
sPLSDA
Figure: Variable selection for each class using gsHDDA (left) and with sPLSDA (right).
17
Conclusion
Take home messages:
AI is just rising! The real AI revolution is still ahead!
academic research is a key element to solve (the numerous) open problems,
we believe in a enriching combination of statistical and deep learning tools.
18
Conclusion
Take home messages:
AI is just rising! The real AI revolution is still ahead!
academic research is a key element to solve (the numerous) open problems,
we believe in a enriching combination of statistical and deep learning tools.
References: C. Bouveyron, P. Latouche and P.-A. Mattei, Bayesian Variable Selection for Globally Sparse Probabilistic PCA, Electronic Journal of Statistics, in press, 2018. F. Orlhac, P.-A. Mattei, C. Bouveyron and N. Ayache, Class-specific Variable Selection in High-Dimensional Discriminant Analysis through Bayesian Sparsity, Journal of Chemometrics, in press, 2019.
18