LDA Topic Models turning words into meaning
Andrius Knispelis
In 2011 I joined a danish startup issuu (the fastest growing online publishing platform) as their first Data Scientist. Over the following 4 years I’ve worked on many interesting things there. And by far the coolest of all was the Topic Modelling. Let me share with you:
What is LDA Topic Modelling? Why do you need one? How to build it?
related similar
đ&#x;‘Ľ reading patterns
? đ&#x;“„ content
?
placement where&when
?
0010100100010010101011001111 0101010100100011111010010010 1011100100101001110001010100 0010111010101010010100100010 0101010110011110101010100100 0111110100100101011100100101 0011100010101000010111010101 0100101001000100101010110011 1101010101001000111110100100 1010112001001010011100010101 0000101110101010100101001000 1001010101100111101010101001 0001111101001001010111001001 0100111000101010000101110101 0101001010010001001010101100 1111010101010010001111101001 0010101110010010100111000101 0100001011101010101000001110 0010101010111001010111010101
👤
!
TROPICAL FRUIT Serve up something new with... RECIPES: GLYNIS MCGUINNESS, GREGOR MCMASTER. PHOTOGRAPHS: JONATHAN KENNEDY. STYLING: TAMZIN FERDINANDO, JENNY IGGLEDEN. FOOD STYLING: DENISE SMART, KATE BLINMAN KIWI FRUIT Cheesecake layers Put a few digestive biscuits in a freezer bag and smash with a rolling pin. Beat a little icing sugar into soft cheese, then peel and slice some kiwi fruit. Layer in glasses until full. MANGO Spicy mango salad with pork Peel and stone ripe mango and slice. Mix with sliced red onion, quartered cherry tomatoes, a sliced chilli, chopped coriander and a squeeze of lemon juice. Serve with grilled pork chops or steaks.
đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜
PASSION FRUIT Tropical pavlova Whip double cream until thick and spoon into meringue nests. Top with mango slices. Halve 2 passion fruits, scoop outthe seeds and flesh. Spoon over the meringues. PINEAPPLE Rum-flavoured rings Remove the pineapple ends and peel. Slice thickly and remove the core. Heat a little butter with brown sugar and stir until melted. Add a good splash of dark rum and the pineapple and simmer for 5-10 minutes.
knowledge about the world
Why not also try... • Salsa Peel and dice some Why not also try... • Ice cream topping Simply Why not also try... • Rice salad Cook long
kiwi fruits and peeled, stoned avocados. Toss in lime juice, then stir in a little finely chopped shallot and deseeded red chilli. Serve with meat or fish. • Kiwi & chicken wraps
what is it about? what is it related to? what does it feel like? what does it mean?
?
0010100100010010101011001111 Serve up something new with... RECIPES: GLYNIS MCGUINNESS, GREGOR MCMASTER. PHOTOGRAPHS: JONATHAN 0101010100100011111010010010 KENNEDY. STYLING: TAMZIN FERDINANDO, JENNY IGGLEDEN. FOOD STYLING: DENISE SMART, KATE BLINMAN 1011100100101001110001010100 KIWI FRUIT 0010111010101010010100100010 Cheesecake layers Put a few digestive biscuits in a freezer bag and smash with a rolling pin. Beat a little icing sugar into soft cheese, then peel 0101010110011110101010100100 and slice some kiwi fruit. Layer in glasses until full. 0111110100100101011100100101 MANGO Spicy mango salad with pork Peel and stone ripe mango and slice. Mix with sliced red onion, 0011100010101000010111010101 quartered cherry tomatoes, a sliced chilli, chopped coriander and a squeeze of lemon juice. Serve with grilled pork chops or steaks. 0100101001000100101010110011 PASSION FRUIT Tropical 1101010101001000111110100100 pavlova Whip double cream until thick and spoon into meringue nests. Top with mango slices. Halve 2 passion fruits, scoop outthe seeds and flesh. 1010112001001010011100010101 Spoon over the meringues. 0000101110101010100101001000 PINEAPPLE Rum-flavoured rings Remove the pineapple ends and peel. Slice thickly and remove the 1001010101100111101010101001 core. Heat a little butter with brown sugar and stir until melted. Add a good splash of dark rum and the pineapple and simmer for 5-10 0001111101001001010111001001 minutes. knowledge about Why not also try... 0100111000101010000101110101 • Salsa Peel and dice some the world 0101001010010001001010101100 Why not also try... • Ice cream topping Simply 1111010101010010001111101001 Why not also try... •0010101110010010100111000101 Rice salad Cook long kiwi fruits and peeled, stoned avocados. Toss in lime juice, then 0100001011101010101000001110 stir in a little finely chopped shallot and deseeded red chilli. Serve with meat or fish. •0010101010111001010111010101 Kiwi & chicken wraps TROPICAL FRUIT
đ&#x;‘¤
?
!
right level of abstraction
đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜
what is it about? what is it related to? what does it feel like? what does it mean?
?
knowledge about the world
knowledge about the world
đ&#x;‘¤
!
context word topic
right level of abstraction
?
đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜ knowledge about the world
knowledge about the world use the right
word
words
đ&#x;‘¤
topic
context
!
capture the widest range of
set the right “window� of
topics
context
đ&#x;ŒŽ $ right level of abstraction
millions or articles in Wikipedia
đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜
?
capturing the widest range of topics
knowledge about the world
knowledge about the world use the right
word
words
đ&#x;‘¤
topic
context
!
capture the widest range of
set the right “window� of
topics
context
đ&#x;ŒŽ $ right level of abstraction
millions or articles in Wikipedia
đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜
đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜
knowledge about the world
capturing the widest range of topics
knowledge about the world use the right
word
words
đ&#x;‘¤
topic
context
!
capture the widest range of
set the right “window� of
topics
context
how to‌
'
âš™
đ&#x;“–
(
preprocess
train
score
evaluate
the data?
the model?
it on new document?
the performance?
gensim topic modeling framework Free Python library
LDA hierarchical LDA dynamic LDA DeepLearning Word2Vec Doc2Vec POS ‌
. Title
đ&#x;ŒŽ
word word word word word word word word word word word word word word word word word word word word word word word word word word word word wordwikipedia word word word
.
đ&#x;ŒŽ
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word
word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word
wikipedia
Setting the right “window� of context (a) Define minimum number of words to be present in an article.
.
đ&#x;ŒŽ
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word
word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word
wikipedia
Setting the right “window� of context (a) Define minimum number of words to be present in an article.
Recommended 100 - 300
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Setting the right “window� of context (b) Skip articles whose titles start with those namespaces:
Title Wikipedia: 1,167,766 907,811 Category:
892,147 File:
571,248 Portal:
128,603 Template:
8,893 MediaWiki:
4,815 User:
2,324 Help:
1,505 Book:
915 Draft: 0
500.000
1.000.000
1.500.000
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Setting the right “window� of context (b) Skip articles whose titles start with those namespaces:
Title Wikipedia: 1,167,766 907,811 Category:
892,147 File:
571,248 Portal:
128,603 Template:
8,893 MediaWiki:
4,815 User:
2,324 Help:
1,505 Book:
915 Draft: 0
500.000
1.000.000
1.500.000
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
remove if word appears in more than 10% of the articles
Let the right words in Word length 1: 2: 3:
i
do, be, am, …
ice, was, who, …
16: videoconferences, …
17: superbillionaires, …
18: intellectualization, …
Stoplists: general terms last names, first names countries, cities
Lemmatization am , are, is = be
Parts of Speech: remove if the word appears in less than 20 articles
NN - noun VB - verb RB - adverb JJ - adjective IN - preposition
(computer, car, cake, …) (play, install, commit, …) (today, quickly, patiently, …) (red, awesome, big, …) (of, about, from, …)
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title
Title
Title
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
remove if word appears in more than 10% of the articles
Let the right words in Word length
keep top n words Recommended 50.000 - 100.000
1: 2: 3:
i
do, be, am, …
ice, was, who, …
16: videoconferences, …
17: superbillionaires, …
18: intellectualization, …
Stoplists: general terms last names, first names countries, cities discard the rest
Lemmatization am , are, is = be
Parts of Speech: remove if the word appears in less than 20 articles
NN - noun VB - verb RB - adverb JJ - adjective IN - preposition
(computer, car, cake, …) (play, install, commit, …) (today, quickly, patiently, …) (red, awesome, big, …) (of, about, from, …)
LATENT DIRICHLET ALLOCATION tfidf.mm
wordids.txt
documents
words
Titl
word word word word word word word word word word word word word word word word
Θ
Z
W N M
the topic distribution for document i
the topic for the j’th word in a document i
N words M documents
observed words in a document i
A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003. It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.
LATENT DIRICHLET ALLOCATION tfidf.mm
wordids.txt
documents
words
Titl
word word word word word word word word word word word word word word word word
context
topic
word N M
the topic distribution for document i
the topic for the j’th word in a document i
N words M documents
observed words in a document i
A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003. It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.
lets assume that… topics, themes, … topic#1
topic#2
topic#3
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
….
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
….
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
….
Take this recipe and generate a document based on the model’s “rules”
recipe topic#1
topic#2
topic#3
50%
30%
20%
Take this collection of documents and learn a model that describes it best…
lets assume that… topics, themes, … topic#1
topic#2
topic#3
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
….
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
….
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
….
recipe topic#1
topic#2
topic#3
50%
30%
20%
Take this recipe and generate a document based on the model’s “rules”
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
)
what really happens… word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word wordword wordword wordword wordword word word word word word word
words appearing in Take this collection documents and the sameof context are learn a model(document) that describes it best… related
lets assume that… topics, themes, … topic#1
topic#2
topic#3
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
….
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
….
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
P * word
….
recipe topic#1
topic#2
topic#3
50%
30%
20%
Take this recipe and generate a document based on the model’s “rules”
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
)
what really happens… topic#1
topic#2
topic#N
P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word ….
P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word ….
P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word ….
Take this collection of documents and learn a model that describes it best… …given these model parameters:
how many topics?
how are those topics assigned to a document?
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word wordword wordword wordword wordword word word word word word word
words appearing in the same context (document) are related
LATENT DIRICHLET ALLOCATION tfidf.mm
wordids.txt
documents
words
Titl
word word word word word word word word word word word word word word word word
context
topic
word N M
the topic distribution for document i
the topic for the j’th word in a document i
N words M documents
observed words in a document i
A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003. It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.
LATENT DIRICHLET ALLOCATION tfidf.mm
wordids.txt
β
documents
words word word word word word word word word word word word word word word word word
a parameter that sets the prior on the per-document topic distributions
α
Θ
Z
a parameter that sets the prior on the per-topic word distributions
W N M
the topic distribution for document i
the topic for the j’th word in a document i
N words M documents
observed words in a document i
A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003. It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.
LATENT DIRICHLET ALLOCATION wordids.txt
β
documents
words word word word word word word word word word word word word word word word word
a parameter that sets the prior on the per-topic word distributions
model.lda
a parameter that sets the prior on the per-document topic distributions
α
Θ
Z
topics
tfidf.mm
W N M
the topic distribution for document i
the topic for the j’th word in a document i
N words M documents
observed words in a document i
A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003. It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.
words
How many topics (dimensions) ?
đ&#x;‘¤
How many topics (dimensions) ?
! context word topic
meaning thresholds
dimensions
đ&#x;‘¤
features
spaces
context
gestalts
PERCEPTION a combination of top-down and bottom-up processing
word
topic
!
meaning thresholds
context
A document is a probability distribution over topics A topic is a probability distribution over words
dimensions
đ&#x;‘¤
features
spaces
context
gestalts
PERCEPTION a combination of top-down and bottom-up processing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
路路路
247 248 249 250
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
路路路
247 248 249 250
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
¡¡¡
Each document gets represented as a pattern of LDA topics. Making every document appear‌
đ&#x;“– â?Ąđ&#x;“– â?Ąđ&#x;“– ‌different enough to be separable,
đ&#x;“–♼đ&#x;“–
đ&#x;“–
‌similar enough to be grouped.
247 248 249 250
DNA
?đ&#x;?”
?
0.019*recipes + 0.017*chef + 0.017*peanut + 0.016*cuisine + 0.015*cooking + 0.015*meat + 0.015*restaurant + 0.015*dish + 0.014*cookery + 0.014*vegetables + 0.014*dishes + 0.012*rice + 0.012*chicken + 0.011*sauce + 0.010*fried + 0.010*beef + 0.009*chefs + 0.009*peanuts + 0.009*bean + 0.009*pork + 0.008*culinary + 0.008*restaurants + 0.008*cucumber + 0.008*recipe + 0.007*kitchen + 0.007*pepper + 0.007*melon + 0.007*ingredients + 0.007*eaten + 0.007*cooked + 0.007*cook + 0.006*potato + 0.006*soup + 0.006*cooks + 0.006*coconut + 0.005*onion + 0.005*meal + 0.005*sausage + 0.005*cabbage + 0.005*anise + 0.005*potatoes +
0
topic #143
đ&#x;?˘
topic #270
DNA
?
topic #81
0.057*wine + 0.056*plantings + 0.030*wines + 0.024*vineyard + 0.020*grape + 0.020*winery + 0.016*peaches + 0.016*vineyards + 0.015*grapes + 0.012*cabernet + 0.012*pinot + 0.012*vine + 0.012*napa + 0.011*blanc + 0.011*velvety + 0.010*mourad + 0.010*magie + 0.010*sauvignon + 0.010*trophic + 0.009*approachable + 0.009*neda + 0.009*vines + 0.009*gall + 0.009*bano + 0.008*powdery + 0.008*degraw + 0.007*kimiko + 0.007*viticulture + 0.007*dagupan + 0.007*noir + 0.006*haridas + 0.006*aphid + 0.006*mccray + 0.006*chardonnay + 0.006*osmotic + 0.006*tasting + 0.006*merlot + 0.006*benidorm + 0.006*kyĹ?ko +
0.048*dutch + 0.034*netherlands + 0.029*amsterdam + 0.019*danish + 0.014*batavia + 0.014*denmark + 0.014*copenhagen + 0.012*rotterdam + 0.012*holland + 0.011*utrecht + 0.010*hague + 0.010*willem + 0.009*haarlem + 0.009*leiden + 0.008*pieter + 0.008*odense + 0.008*hansen + 0.008*cornelis + 0.007*congreve + 0.007*groningen + 0.007*sint + 0.007*hendrik + 0.007*frans + 0.006*lange + 0.006*roughriders + 0.006*rasmus + 0.005*wilhelmina + 0.005*jørgensen + 0.005*roskilde + 0.005*witton + 0.005*eskimos + 0.005*stampeders + 0.005*vries + 0.005*arnhem + 0.005*nijmegen + 0.005*delft + 0.004*johan + 0.004*niels + 0.004*johannes +
?đ&#x;?”
?
0.019*recipes + 0.017*chef + 0.017*peanut + 0.016*cuisine + 0.015*cooking + 0.015*meat + 0.015*restaurant + 0.015*dish + 0.014*cookery + 0.014*vegetables + 0.014*dishes + 0.012*rice + 0.012*chicken + 0.011*sauce + 0.010*fried + 0.010*beef + 0.009*chefs + 0.009*peanuts + 0.009*bean + 0.009*pork + 0.008*culinary + 0.008*restaurants + 0.008*cucumber + 0.008*recipe + 0.007*kitchen + 0.007*pepper + 0.007*melon + 0.007*ingredients + 0.007*eaten + 0.007*cooked + 0.007*cook + 0.006*potato + 0.006*soup + 0.006*cooks + 0.006*coconut + 0.005*onion + 0.005*meal + 0.005*sausage + 0.005*cabbage + 0.005*anise + 0.005*potatoes +
0
topic #143
đ&#x;?˘
topic #270
DNA
?
topic #81
0.057*wine + 0.056*plantings + 0.030*wines + 0.024*vineyard + 0.020*grape + 0.020*winery + 0.016*peaches + 0.016*vineyards + 0.015*grapes + 0.012*cabernet + 0.012*pinot + 0.012*vine + 0.012*napa + 0.011*blanc + 0.011*velvety + 0.010*mourad + 0.010*magie + 0.010*sauvignon + 0.010*trophic + 0.009*approachable + 0.009*neda + 0.009*vines + 0.009*gall + 0.009*bano + 0.008*powdery + 0.008*degraw + 0.007*kimiko + 0.007*viticulture + 0.007*dagupan + 0.007*noir + 0.006*haridas + 0.006*aphid + 0.006*mccray + 0.006*chardonnay + 0.006*osmotic + 0.006*tasting + 0.006*merlot + 0.006*benidorm + 0.006*kyĹ?ko +
0.048*dutch + 0.034*netherlands + 0.029*amsterdam + 0.019*danish + 0.014*batavia + 0.014*denmark + 0.014*copenhagen + 0.012*rotterdam + 0.012*holland + 0.011*utrecht + 0.010*hague + 0.010*willem + 0.009*haarlem + 0.009*leiden + 0.008*pieter + 0.008*odense + 0.008*hansen + 0.008*cornelis + 0.007*congreve + 0.007*groningen + 0.007*sint + 0.007*hendrik + 0.007*frans + 0.006*lange + 0.006*roughriders + 0.006*rasmus + 0.005*wilhelmina + 0.005*jørgensen + 0.005*roskilde + 0.005*witton + 0.005*eskimos + 0.005*stampeders + 0.005*vries + 0.005*arnhem + 0.005*nijmegen + 0.005*delft + 0.004*johan + 0.004*niels + 0.004*johannes +
LDA space a simplex in this example 3 topics
đ&#x;?”
0,21
0
similar enough
Jensen-Shannon Divergence = Jensen-Shannon Distance ( gives values between 0 and 1 )
a threshold that defines what is considered similar (found experimentally)
đ&#x;?˘
more similar
less similar
?
Does the model capture the right aspects of a magazine?
“
“
all models are wrong, but some are useful
George E. P. Box
magazine level high number of words noise - ads, editorial stuff, etc.
meaning thresholds
dimensions
đ&#x;‘¤
features
context
spaces
gestalts
?
What is the distance threshold under which magazines are perceived as similar?
Take this piece of text 1. Preprocess it. Show me what was removed and what stayed. 2. Get the LDA topic distribution. Show me the topic distribution. 3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.
Do the neighbours look similar? Where is the distance threshold?
Take this piece of text 1. Preprocess it. Show me what was removed and what stayed. 2. Get the LDA topic distribution. Show me the topic distribution. 3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.
Do the neighbours look similar? Where is the distance threshold?
Take this piece of text 1. Preprocess it. Show me what was removed and what stayed. 2. Get the LDA topic distribution. Show me the topic distribution. 3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.
Do the neighbours look similar? Where is the distance threshold?
Take this piece of text 1. Preprocess it. Show me what was removed and what stayed. 2. Get the LDA topic distribution. Show me the topic distribution. 3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.
Do the neighbours look similar? Where is the distance threshold?
Take this piece of text 1. Preprocess it. Show me what was removed and what stayed. 2. Get the LDA topic distribution. Show me the topic distribution. 3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.
Do the neighbours look similar? Where is the distance threshold?
'
âš™
đ&#x;“–
(
preprocess
train
score
evaluate
the data
the model
it on new document
the performance
Text corpus depends on the application domain. It should be contextualised since the window of context will determine what words are considered to be related. The only observable features for the model are words. Experiment with various stoplists to make sure only the right ones are getting in. Training corpus can be different from the documents it will be scored on. Good all utility corpus is Wikipedia.
The key parameter is the number of topics. Again, depends on the domain. Other parameters are alpha and beta. You can leave them aside to begin with and only tune later. Good place to start is gensim - free python library.
The goal of the model is not to label documents, but rather to give them a unique fingerprint so that they can be compared to each other in a humanlike fashion.
Evaluation depends on the application. Use Jensen-Shannon Distance as similarity metric. Evaluation should show whether the model captures the right aspects compared to a human. Also it will show what distance threshold is still being perceived as similar enough. Use perplexity to see if your model is representative of the documents you’re scoring it on.
'
âš™
đ&#x;“–
(
preprocess
train
score
evaluate
the data
the model
it on new document
the performance
Text corpus depends on the application domain. It should be contextualised since the window of context will determine what words are considered to be related. The only observable features for the model are words. Experiment with various stoplists to make sure only the right ones are getting in. Training corpus can be different from the documents it will be scored on. Good all utility corpus is Wikipedia.
The key parameter is the number of topics. Again, depends on the domain.
The goal of the model is not to label documents, but rather to give them a unique fingerprint so that they can be compared to each other in a humanlike fashion.
thank you
Other parameters are alpha and beta. You can leave them aside to begin with and only tune later. Good place to start is gensim - free python library.
! Andrius Knispelis andrius.knispelis@gmail.com
Evaluation depends on the application. Use Jensen-Shannon Distance as similarity metric. Evaluation should show whether the model captures the right aspects compared to a human. Also it will show what distance threshold is still being perceived as similar enough. Use perplexity to see if your model is representative of the documents you’re scoring it on.