Ludovic LEBART — Telecom Paritech — Self-supervised Learning for Textual Data Analysis by Université Côte d'Azur

Self-Supervised Learning for Textual Data Analysis. Ludovic Lebart, Telecom-ParisTech www.lebart.org

Self-Supervised Learning for Textual Data Analysis ▪ Outline ▪

0. Self, semi, un / supervised learning

▪

1. Some theoretical links, CA at the crossroad

▪

2. Beyond supervision: synergy and hybridation: Example 1

▪

3. Context, fragments, Word2Vec

▪

4. The prism of fragmentation: Example 2.

▪

5. Bridges between Clustering and Principal axes: Example 3.

▪

6. Conclusion 2

Self-Supervised Learning for Textual Data Analysis ▪ Outline ▪

0. Self, semi, un / supervised learning

▪

1. Some theoretical links, CA at the crossroad

▪

2. Beyond supervision: synergy and hybridation: Example 1

▪

3. Context, fragments, Word2Vec

▪

4. The prism of fragmentation: Example 2.

▪

5. Bridges between Clustering and Principal axes: Example 3.

▪

6. Conclusion 3

â&#x2013;Ş0. Self, semi, un / supervised learning

Reminder: Supervised and unsupervised approaches In the statistical learning theory (cf., e.g., Vapnik, 1998; Hastie et al., 2001): "Unsupervised approach" (exploratory or descriptive). "Supervised approach (confirmatory or explanatory approach). Usually: Principal axes Analyses and clustering are unsupervised, Discriminant analysis or regression methods are supervised.

External validation is the standard procedure in the case of supervised learning. Once the model parameters are estimated (learning phase), external validation (labels) is used to evaluate the model (generalization phase), usually with cross-validation methods. But the term unsupervised is applied, in the literature, to visualization or clustering without statistical validation (mostly Bootstrap) and without external information.

▪0. Self, semi, un / supervised learning

What leads to Self supervision ? 1) To follow Lecun et al. (2015)… ? “…we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.” LeCun Y., Bengio Y., Hinton G. (2015). Deep Learning. Nature, 521, 436-444.

2) To cope with the scarcity, the absence (or the cost) of label data. In the domains of images, sounds and video, Self supervised methods correspond to several hybrid methodologies inspired by the Deep Learning. 3) « For more than forty years I have been speaking prose while knowing nothing of it, and I am the most obliged person in the world to you for telling me so ». [ M. Jourdain, « Le Bourgeois Gentilhomme », Molière, 1642]. The statistical methods of Textual Data Analysis have been somewhat « Self supervised » avant la lettre.

▪0. Self, semi, un / supervised learning

For most of the supervised learning algorithms, in an industrial context, the effort is concentrated in image, sound and video processing Learning involves gigantic data Bases, very far from the usual problematic of textual data analysis.

When dealing with supervised analysis of texts, examples of problems / questions: → Following speech recognition, to select an answer to FAQs.

→ In sample surveys with answers to open-ended questions: How to predict some target variables such as "buying a product", “belonging to a particular category of respondents”.

→ In topic modelling, classification of Tweets or Texts into predefined themes. → In stylometry, authorship attribution.

In the field of textual data analysis, the priority is not systematically "recognition" but discovery, description, comparison, understanding. It remains partially supervised in the sense that both the available external information and the discovered structures are used to enhance the exploration.

â&#x2013;Ş0. Self, semi, un / supervised learning

External validation in the context of correspondence analysis (CA). External validation can be used in the unsupervised case in the context of CA in the following two practical circumstances: a) When the data set can be divided into two or more parts, one part being used to estimate the model and the other part (s) used to verify the adequacy of the model. b) When certain metadata (external information) are available to supplement the description of the elements to be analyzed. The external information is then provided by supplementary elements (additional rows or columns of the data table). Such supplementary elements are subsequently projected onto the main Visualization planes.

The bootstrap procedures provides an â&#x20AC;&#x153;internal supervisionâ&#x20AC;? of the obtained results. 7

â&#x2013;Ş0. Self, semi, un / supervised learning

Their significance can be evaluated using conventional statistical tools (e.g. Student t) or Bootstrap validation. The technique of additional or supplementary variables can be viewed as a visualized regression. In this sense, we are in a supervised context.

The supplementary variable y is projected onto the space spanned by the active variables (or onto the principal axes approximation of that space) It can give an answer to the question: are these additional variables independent of the structure revealed by the active variables?

Self-Supervised Learning for Textual Data Analysis ▪ Outline ▪

0. Self, semi, un / supervised learning

▪

1. Some theoretical links, CA at the crossroad

▪

2. Beyond supervision: synergy and hybridation: Example 1

▪

3. Context, fragments, Word2Vec

▪

4. The prism of fragmentation: Example 2.

▪

5. Bridges between Clustering and Principal axes: Example 3.

▪

6. Conclusion 9

▪1. Some theoretical links, CA at the crossroad

CA: a tool at the junction of many different methods Correspondence Analysis of contingency tables (CA), independently discovered by various authors, can be presented from many points of views. It can be viewed, for example, as a particular case of both Linear Discriminant Analysis (LDA) (performed on dummy variables) and Singular Value Decomposition (SVD) (performed after a proper scaling of the original data). After the seminal papers of Guttman (1941), Hayashi (1956) and Benzécri (1969a), various presentations of CA can be found in the available literature (see, for instance, Lebart et al. (1984), Greenacre (1984), Gifi (1990), Benzécri (1992), Gower and Hand (1996)). In the context of neural networks Correspondence Analysis is at the meeting point of several different techniques.

â&#x2013;Ş1. Some theoretical links, CA at the crossroad

CA can be described as a particular Supervised Multilayer Perceptron (in that case, the input and the output layers are respectively the rows and the columns of the contingency table). CA is also an Unsupervised Multilayer Perceptron (in such a case the input layer, and the output layer as well, could be the rows, whereas the observations - also named examples, or elements of the training set could be the columns of the table). In both situations, the networks make use of the identity function as a transfer function. More general transfer functions might lead to interesting non-linear extensions of CA. CA can also be obtained from Linear Adaptive Networks, a series of methods closely related to stochastic approximation algorithms.

â&#x2013;Ş1. Some theoretical links, CA at the crossroad

Equivalence between Linear Discriminant Analysis and supervised Multilayer Perceptron (when transfer functions are identity functions) has been proved by Gallinari et al. (1988) and generalized to the case of more general models (such as non-linear discriminant analysis) by Asoh and Otsu (1989). A general framework (see, e.g., Baldi and Hornik (1989)) can deal simultaneously with the supervised and the unsupervised cases. Gallinari, P., Thiria, S. and Fogelman-Soulie, F. (1988): Multilayers perceptrons and data analysis, International Conference on neural Networks, IEEE,, 1, 391-399. Baldi, P. and Hornik, K. (1989): Neural networks and principal component analysis : learning from examples without local minima. Neural Networks, 2, 52-58. Asoh, H. and Otsu, N. (1989): Nonlinear Data Analysis and Multilayer Perceptrons. IEEE, IJCNN-89, 2, 411-415. ********* Ripley B. D. (1993): Statistical aspects of neural networks. In : Networks and ChaosStatistical and Probabilistic Aspects, Barndorff-Nielsen O.E., Jensen J. L., Kendall W. S., (eds), Chapman and Hall, London, p 40-123. Cheng, B. and Titterington, D.M. (1994): Neural networks: a review from a statistical perspective. Statistical Science, 9, 2-54.

â&#x2013;Ş1. Some theoretical links, CA at the crossroad

Multi-layer Perceptron

Fig. 1: Perceptron with one hidden layer (i-th observation)

▪1. Some theoretical links, CA at the crossroad

Function  is often the logistic function.

 (z) =

exp z 1 + exp z

Function  could be linear, logistic, or binary (e.g. :  (z) = 0 if z  0 and  (z) = 1 if z >0). In the case of identity transfer functions ( and  ) and null constant terms, the model collapses to the simpler form:

 c  yik =   bmk  m=1

 p  p   a x  =   jm ij     j =1   j=1

 c    bmk a jm  xij + eik    m=1 

â&#x2013;Ş1. Some theoretical links, CA at the crossroad

Multi-layer Perceptron

Figure 2; The case of binary disjunctive input and output (dummy variables)

â&#x2013;Ş1. Some theoretical links, CA at the crossroad

Figure 3; Three equivalent correspondence analyses

▪1. Some theoretical links, CA at the crossroad

Self organized (or unsupervised) Perceptron

The « bottleneck » implies a compression of data through Singular values decomposition (SVD, behind PCA and CA) 17

â&#x2013;Ş1. Some theoretical links, CA at the crossroad

A Linear Adaptive Network Brief review of some computational techniques involved in CA

Several computational algorithms could be involved in Correspondence Analysis: Reciprocal averaging, iterated power, QR and QL algorithms, Jacobi method, Lanczos method, as well as other classical numerical procedure for SVD. The use of Back-Propagation method and other techniques usually associated with Multilayer Perceptron provides new numerical approaches and a better insight into the method. The unsupervised MLP model is also closely related to various types of stochastic approximation algorithms that could roughly mimic the cognition process involved in perusing a data table. These algorithms are able to tackle huge data sets like those encountered in Text mining.

▪1. Some theoretical links, CA at the crossroad

Benzécri (1969b), Krasulina (1970) have proposed independently stochastic approximation algorithms for determining the largest eigenvalues of the expectation of a random matrix. Lebart (1974) has given a numerical proof of the convergence of Benzécri algorithm, and shown its interest in the case of sparse data matrices, such as those involved in Multiple Correspondence Analysis. Oja and Karhunen (1981) have proposed similar algorithms, adding new proofs and developments, reinforced by the results of Kushner and Clark (1978).

The first mention of neural networks can be found in Oja (1982), who has proposed since then a wide variety of algorithms (see: Oja (1992)). See: http://www.dtmvic.com/doc/MOD97.pdf

▪1. Some theoretical links, CA at the crossroad Basics of stochastic approximation algorithms The basic idea is as follows: X being the (n,p) matrix of properly re-scaled data, the product moment matrix XTX can be written as a sum of n terms Ai. T

i=n

X X =  Ai

with

i=1

Ai = x i xiT ,

The classical iterated power algorithm can then be performed using this decomposition, taking advantage of the possible sparsity of the data matrix X. Starting from a random vector u0, the step k of this algorithm, after setting uk = 0, consists of n assignments such as: for i = 1 to i = n, do : u(k)  u(k) + Ai u(k-1) The vector uk remains unchanged during the whole step k. We can improve the algorithm by modifying the estimate of u(k) during each assignment: for j= 1 to j = n, do :

u(j)  u(j-1) + g(j) Ai(j) u(j-1)

where g(j) is a “gain parameter”. During each step k, the index i(j) of the matrix A takes values 1 to n . At step k : i(j) = j - (k-1)n . To ensure the convergence of uj towards the largest eigenvector of XTX , the series g(j) must diverge whereas the series must converge. The series g(j) could be chosen among series closely related to the harmonic series such as: g(j) = a/(b+j) .

Self-Supervised Learning for Textual Data Analysis ▪ Outline ▪

0. Self, semi, un / supervised learning

▪

1. Some theoretical links, CA at the crossroad

▪

2. Beyond supervision: synergy and hybridation: Example 1

▪

3. Context, fragments, Word2Vec

▪

4. The prism of fragmentation: Example 2.

▪

5. Bridges between Clustering and Principal axes: Example 3.

▪

6. Conclusion 21

2. Beyond supervision: synergy and hybridation: Example 1

Example 1: Comments about Spanish wines: Examples of “responses” ---- I001 Manzana reineta, pomelo maduro, flores blancas. en boca suave y frutoso, con un agradable toque de acidez al final. ---- I003 Expresivo en sus notas florales y frutales, lirio, manzana verde, pera de agua, pétalos blancos. en boca suave, taninos muy sedosos de la fruta, bayas blancas y una acidez perfecta ---- I007 Nariz extremadamente perfumada: flores azules y blancas y cáscara de nuez . limón y frutos secos en boca. ---- I009 Boca muy equilibrada, con destellos de madera sobre un fondo de fruta amarilla madura. Buena persistencia. en nariz, sin embargo, algo insípido y dominado por notas de hierbas y un toque dulce de levaduras. ---- I010 ………………………………… 22

2. Beyond supervision: synergy and hybridation: Example 1

Example 1: Comments about Spanish wines: Examples of “responses” (English translation)

---- I001 Pippin apple, ripe grapefruit, white flowers. soft and fruity on the palate with a pleasant touch of acidity in the end. ---- I003 Expressive in its floral and fruity notes, lily, green apple, pear, water, white petals. in mouth soft, silky tannins of fruit, white berries, and perfect acidity. ---- I007 Extremely perfumed nose: blue and white flowers nutshell. lemon and nuts in mouth. ---- I009 Mouth very balanced, with flashes of wood on a background of ripe yellow fruit. good persistence. Nose, however, rather bland, dominated by notes of herbs and a hint of yeast sweetness. ---- I010…………………………………

2. Beyond supervision: synergy and hybridation: Example 1

Example 1: Comments about Spanish wines Counts for the first phase of numeric coding: Summary of results ------------------total number of responses = 443 total number of words = 14,061 number of distinct words = 1394

Selection of words -----------------When the words appearing at least 4 times are selected, 12,404 occurrences (tokens) of these words remain, with 395 distinct words (types).

2. Beyond supervision: synergy and hybridation: Example 1

Example 1: Comments about Spanish wines Selected statistical units

Words (frequency order) !-------!--------------!--------! ! num. ! used words ! freq. ! !-------!--------------!--------! ! 101 ! de ! 891 ! ! 393 ! y ! 806 ! ! 129 ! en ! 694 ! ! 46 ! boca ! 433 ! ! 87 ! con ! 356 ! ! 174 ! fruta ! 334 ! ! 378 ! un ! 308 ! ! 261 ! nariz ! 246 ! ! 259 ! muy ! 237 ! ! 215 ! la ! 211 ! ! 271 ! notas ! 211 ! ! 309 ! que ! 168 ! ! 355 ! taninos ! 167 ! ! 123 ! el ! 158 ! ! 379 ! una ! 152 ! ! 232 ! madera ! 140 ! !-------!--------------!--------!

2. Beyond supervision: synergy and hybridation: Example 1

Example 1: Comments about Spanish wines

Selected statistical units

Words (Alphabetical order) +-------+--------------+--------+ ! 1 ! a ! 66 ! ! 2 ! abierto ! 9 ! ! 3 ! acarameladas ! 9 ! ! 4 ! accesible ! 14 ! ! 5 ! acidez ! 79 ! ! 6 ! agradable ! 68 ! ! 7 ! agradables ! 17 ! ! 8 ! agua ! 6 ! ! 9 ! ahora ! 5 ! ! 10 ! al ! 27 ! ! 11 ! albaricoque ! 5 ! ! 12 ! algo ! 72 ! ! 13 ! alguna ! 20 ! ! 14 ! algunas ! 5 ! ! 15 ! algĂşn ! 35 ! ! 16 ! alta ! 8 ! ! 17 ! amable ! 7 ! +-------+--------------+--------+

2. 5) Beyond Applications: supervision: Opensynergy questions, andsample hybridation: surveys, Example texts 1

Example 1: Comments about wines

The forthcoming diapositives show the principal plane produced by a correspondence analysis of lexical contingency table. Proximity between 2 category-points (columns) means similarity of lexical profiles of the 2 categories. Proximity between 2 word-points (rows) means similarity of lexical profiles of these words.

2. Beyond supervision: synergy and hybridation: Example 1

Example 1: Comments about wines Principal plane of the CA of the contingency table crossing 395 words and 19 score groups (N79 -> N97). Partial bootstrap confidence elliplses.

2. Beyond supervision: synergy and hybridation: Example 1

Example 1: Comments about wine, Same first plane with the 395 words.

2. Beyond supervision: synergy and hybridation: Example 1

Example 1: Comments about wine. Same first plane with the 395 words and some confidence ellipses for words.

2. Beyond supervision: synergy and hybridation: Example 1

Example 1: Comments about wine S.O.M. Self Organizing Map (Kohonen Map) 395 words, 19 categories

2. Beyond supervision: synergy and hybridation: Example 1

Example 1: Comments about wines

(Zoom on the S.O.M.)

2. Beyond supervision: synergy and hybridation: Example 1

Example 1 (ÂŤWine Âť question) Direct CA of responses, the score groups are projected afterwards on the principal plane. Bootstrap ellipses drawn after bootstrapping the respondents

Self-Supervised Learning for Textual Data Analysis ▪ Outline ▪

0. Self, semi, un / supervised learning

▪

1. Some theoretical links, CA at the crossroad

▪

2. Beyond supervision: synergy and hybridation: Example 1

▪

3. Context, fragments, Word2Vec

▪

4. The prism of fragmentation: Example 2.

▪

5. Bridges between Clustering and Principal axes: Example 3.

▪

6. Conclusion 34

3. Context, fragments, Word2Vec

Word2Vec, Semantic content of a lexical profile Distributional linguistics (Z. Harris) A is sometimes purring

A mews A has whiskers A likes milk

→ At the end, the point « A » will be superimposed with the point « CAT»

A likes chasing mice Note 1: semantic similarity is not a transitive relationship (1) calm–wisdom–discretion–wariness–fear–panic, (2) fact–feature –aspect–appearance–illusion (not compatible with an Euclidean space) Note 2: To take into account the local similarities between words through a sliding window to enhance a prediction (Word2Vec approach) is similar to using the repeated segments of Salem (1983) to enhance the predictive power of the original words. 35

3. Context, fragments, Word2Vec

new variables, new metrics

3. Context, fragments, Word2Vec

Additive tree of French verbs [excerpt], freq. Threshold = 19 )

3. Context, fragments, Word2Vec

Zoom on the previous additive tree of semantic similarities

Self-Supervised Learning for Textual Data Analysis ▪ Outline ▪

0. Self, semi, un / supervised learning

▪

1. Some theoretical links, CA at the crossroad

▪

2. Beyond supervision: synergy and hybridation: Example 1

▪

3. Context, fragments, Word2Vec

▪

4. The prism of fragmentation: Example 2.

▪

5. Unfolding Self Organizing Maps: Example 3.

▪

6. Conclusion 39

4. The prism of fragmentation: Example 2

About an option of fragmentation of a corpus We can create new "artificial observations" in a text corpus by fragmenting it in small consecutive units. Original approach proposed by Reinert (1983, 1986) at the base of a procedure known as ALCESTE methodology, and also IRAMUTEQ.

Advantages of the fragmentation of the corpus - The structure of the text inside of each part is taken into account, a piece of information overlooked in the classical approach to the single aggregate table.

- A deeper understanding of the internal structure of each text, a finer granularity. - An external validation evidence can be achieved using the partition of the initial corpus of texts (initial context units).

4. The prism of fragmentation: Example 2

Example 2:

State of the Union speeches of the last eight American presidents, excerpt from the â&#x20AC;&#x153;Inaugural addressâ&#x20AC;? corpus (that can be extracted from the nltk.book corpuses: see e.g. Bird et al. 2009) [see also the website: http://www.usa-presidents.info/union/ that contains all the texts back from the speeches of George Washington in 1790]. As a check, the corpus was also lemmatized using the software TreeTagger (Schmid, 1994), with elimination of function words and prepositions.

4. The prism of fragmentation: Example 2

corpus ÂŤ20-21st centuries Âť (new CA) 806,627 tokens 17,321 types 21 texts

The

4. The prism of fragmentation: Example 2

After: How the distancesâ&#x20AC;Ś Why the distances.

4. The prism of fragmentation: Example 2

New zoom on the corpus ÂŤ1940 - 2012 Âť (new CA) 296,905 tokens 11,030 types

4. The prism of fragmentation: Example 2

New zoom on the corpus ÂŤNixon-Obama (new CA) 139,899 tokens

8,306 types

4. The prism of fragmentation: Example 2

Fragmentation of the corpus into: - Lines, - Pairs of lines (Elementary Context Unit), - Blocks of 5 lines - Blocks of 20 lines - Blocks of 100 lines

4. The prism of fragmentation: Example 2

Projection of the supplementary variable "President" on the first principal plane of the CA of the table 583 x 12,854 (the 12,854 lines are considered as context units), with specific bootstrap ellipses ( lines are drawn with replacement, instead of words). [Same situation as the example of wine testing, but the units are here arbitrary fragments and not individual responses].

4. The prism of fragmentation: Example 2

Unlike the Word2Vec approach (sliding window to take into account the context) the fragmentation approach involves new statistical units which can be related to external data (meta data). Consequently, we were able to project the « presidents » onto the visualization space generated by the fragments.

From the distances computed from fragments can be derived a « local semantic graph » [using either threshold of distance or nearest neighbours]. A similar graph can be computed from the distances computed from sliding windows. From a linguistic point of view, fragments that can contain a whole number of sentences are more licit units than sliding windows in order to define a meaningful context.

Self-Supervised Learning for Textual Data Analysis ▪ Outline ▪

0. Self, semi, un / supervised learning

▪

1. Some theoretical links, CA at the crossroad

▪

2. Beyond supervision: synergy and hybridation: Example 1

▪

3. Context, fragments, Word2Vec

▪

4. The prism of fragmentation: Example 2.

▪

5. Bridges between Clustering and Principal axes: Example 3.

▪

6. Conclusion 49

5. Bridges between Clustering and Principal Axes: Example 3

A general point of view. some theoretical links. Clustering methods and principal axes techniques (principal components analysis, twoway and multiple correspondence analysis, canonical and linear discriminant analyses, etc.) have been often interacting. Most practitioners consider clustering methods and principal axes techniques (principal components analysis, two-way and multiple correspondence analysis, etc.) as complementary approaches in the exploration of multivariate data sets. 1 Clustering from principal coordinates

◼

2 A posteriori projection of clusters onto the principal planes,

◼

4 Use of the minimum spanning tree to complement principal axes vizualisations.

◼

5 Bridge between principal planes and Self Organizing Maps. 50

5. Bridges between Clustering and Principal Axes: Example 3

Reminder: Non probabilistic aspects of Principal axes A pedagogical example : Description of a textual graph

5. Bridges between Clustering and Principal Axes: Example 3

Each Irish county “answers” to the fictitious “open-question” : Which are your neighboring counties?

Table 1: Text encoding contiguity relationship for four Irish counties **** Galway Mayo Roscommon Offaly Clare Tipperary

**** Leitrim Sligo Roscommon Longford Fermanagh Cavan Donegan **** Mayo Sligo Roscommon Galway **** Roscommon Sligo Leitrim Longford Westmeath Offaly ……………

5. Bridges between Clustering and Principal Axes: Example 3

Example of a graph G (n = 25) associated with a squared lattice 1

.10

.12

.13

.14

.15

.17

.18

.19

. 20

.25

â&#x20AC;Ś and its associated matrix M â&#x17E;&#x201D;

5. Bridges between Clustering and Principal Axes: Example 3

matrix: M

1 r01 r02 r03 r04 r05 r06 r07 r08 r09 r10 r11 r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25

3 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

8 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

9 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

10 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

11 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

12 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

13 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0

14 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0

15 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0

16 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0

17 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0

18 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0

19 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0

20 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0

21 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0

22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1

23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0

24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0

25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1

5. Bridges between Clustering and Principal Axes: Example 3

Description of chessboard G through Principal Component Analysis of M

5. Bridges between Clustering and Principal Axes: Example 3

Description of chessboard G through Correspondence Analysis of M axis 2

1.5

0.5

9 4

13 22

-1

-0.5

0.5

axis 1

-0.5

12 21 16

-1

11 -1.5

5. Bridges between Clustering and Principal Axes: Example 3

Example :

Compression via principal axes

Color Image : 3 numbers (< 256) per pixel. Table 294 x 145 [294 = 98 x 3]

110 122 138 95 105 117 86 93 101 91 95 94 97 94 85 94 88 72 93 85 66 78 67 49 63 50 41 42 32 23 39 32 26 65 62 53 97 99 88 122 125 108 142 142 116 144 141 110 129 122 93 105 100 71 90 88 67 78 85 69 77 93 82 90 108 94 102 116 90 111 113 74 122 109 56 131 109 51 142 117 61 152 131 86 153 144 113 156 150 128 153 140 121 138 123 104 114 101 82 83 70 51 78 67 47 98 87 67 114 106 85

5. Bridges between Clustering and Principal Axes: Example 3

Compression via principal axes

2 axes 4 axes

10 axes

20 axes

100 axes

This ability to detect (graphs example) and to summarize (pictures example) patterns was not used in shallow supervised learning.

5. Bridges between Clustering and Principal Axes: Example 3

The minimum spanning tree to complement principal axes

Example of application with PCA The selected data set deals with ÂŤ Semiometric Âť survey data. The questionnaire consists of 210 words that 3370 respondents must rate (on a 7-items scale) according to the pleasure or displeasure they experience at the mention of each of these words. The pattern obtained in the space spanned by the six first principal axes of a Principal Component Analysis of the (3360 x 70) data table appears to be stable over time, and similar in several European countries. We run the example on a subset of 70 words and 12 principal coordinates derived from a preliminary PCA performed on a subset of 60 300 respondents.

5. Bridges between Clustering and Principal Axes: Example 3

Display of variable points (2 principal axes)

5. Bridges between Clustering and Principal Axes: Example 3

MST on 2 principal axes

5. Bridges between Clustering and Principal Axes: Example 3

MST on 4 principal axes

5. Bridges between Clustering and Principal Axes: Example 3

The self organizing maps (SOM) The self organizing maps (SOMs) proposed by Kohonen (1981) aim at clustering a set of multivariate observations. The obtained clusters are displayed as the vertices of a rectangular (chessboard like) or octagonal graph.

The distances between vertices on the graph are supposed to reflect, as much as possible, the distances between clusters in the initial space.

Principles of the algorithm : The size of the graph, and consequently, the number of clusters are chosen a priori (e.g;: a square grid with 5 rows and 5 columns, leading to 25 clusters).

The algorithm is very similar to the McQueen algorithm (1967) in its on line version, and to the k-means algorithm (Forgy, 1965) in its batch version. 64

5. Bridges between Clustering and Principal Axes: Example 3

Sketch of the algorithm Let us consider n points in a p-dimensional space (rows of the (n, p) matrix X). At the outset, to each cluster k is assigned a provisional centre Ck with p components (e.g.: chosen at random, or among the first elements). For each step t, the element i(t) is assigned to its nearest provisional centre Ck(t). Such centre, together with its neighbours on the grid, is then modified according to the formula: Ck(t+1)=Ck(t)+e(t) (i(t)-Ck(t)) In this formula, e(t) is an adaptation parameter (0< e < 1) which is a (slowly) decreasing function of t, as those usually encountered in stochastic approximation algorithms. This process is reiterated, and eventually stabilizes, but the partition obtained generally depends on the initial choice of the centres. 65

5. Bridges between Clustering and Principal Axes: Example 3

SOM (3 x 3) describing the associations between words

5. Bridges between Clustering and Principal Axes: Example 3

Projection of the 9 centroids of the SOM clusters in the PCA plane

5. Bridges between Clustering and Principal Axes: Example 3

Stylised incidence M1 matrix of the graph associated with a SOM map (all the cells in the white [resp. black] areas contain the value 0 [resp. 1] )

Reminder: L.D.A.: a set of classes

5. Bridges between Clustering and Principal Axes: Example 3

9 centroids of the SOM clusters in the contiguity plane

5. Bridges between Clustering and Principal Axes: Example 3

Convex hulls of the nine SOM clusters in the contiguity plane

Advantages: Shape, size, overlapping, distances, internal configuration

5. Bridges between Clustering and Principal Axes: Example 3

Convex hulls of the nine SOM clusters in the contiguity plane

5. Bridges between Clustering and Principal Axes: Example 3

In the same plane: bootstrap confidence areas for 4 elements

6. Conclusions We have evoked both theoretical (part 2 and 5) and pragmatical aspects of the links between methods, together with the interest of combination of methods. The unsupervised techniques are not simply the counterpart of supervised techniques.

They are components of a knowledge process that may involve labelled data (when an external decision relying on these labels is expected). But that process, designed here as Textual Data Analysis (TDA) , may also create or at least suggest new “labels” (clusters, axes, patterns [SOM, trees] ) to be validated by other components. The assessment of results through the powerful non parametric procedures of Bootstrap highlight the confirmatory aspect of TDA. The qualification: “Self-supervised approach” is then particularly adapted. 73

Software note: All the preceding computations (Multidimensional analysis of texts and images, Self organizing maps, Bootstrap) can be carried out with the Software DtmVic (Data and text Mining, Visualization, Inference, Classificaiton) freely downloadable from the website: www.dtmvic.com.



Gracias Obrigado

Thank You Grazie Danke

Merci