LDA Topic Models

Page 1

LDA Topic Models turning words into meaning

Andrius Knispelis


In 2011 I joined a danish startup issuu (the fastest growing online publishing platform) as their first Data Scientist. Over the following 4 years I’ve worked on many interesting things there. And by far the coolest of all was the Topic Modelling. Let me share with you:

What is LDA Topic Modelling? Why do you need one? How to build it?




related similar

đ&#x;‘Ľ reading patterns

? đ&#x;“„ content

?

placement where&when


?

0010100100010010101011001111 0101010100100011111010010010 1011100100101001110001010100 0010111010101010010100100010 0101010110011110101010100100 0111110100100101011100100101 0011100010101000010111010101 0100101001000100101010110011 1101010101001000111110100100 1010112001001010011100010101 0000101110101010100101001000 1001010101100111101010101001 0001111101001001010111001001 0100111000101010000101110101 0101001010010001001010101100 1111010101010010001111101001 0010101110010010100111000101 0100001011101010101000001110 0010101010111001010111010101

👤

!


TROPICAL FRUIT Serve up something new with... RECIPES: GLYNIS MCGUINNESS, GREGOR MCMASTER. PHOTOGRAPHS: JONATHAN KENNEDY. STYLING: TAMZIN FERDINANDO, JENNY IGGLEDEN. FOOD STYLING: DENISE SMART, KATE BLINMAN KIWI FRUIT Cheesecake layers Put a few digestive biscuits in a freezer bag and smash with a rolling pin. Beat a little icing sugar into soft cheese, then peel and slice some kiwi fruit. Layer in glasses until full. MANGO Spicy mango salad with pork Peel and stone ripe mango and slice. Mix with sliced red onion, quartered cherry tomatoes, a sliced chilli, chopped coriander and a squeeze of lemon juice. Serve with grilled pork chops or steaks.

đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜

PASSION FRUIT Tropical pavlova Whip double cream until thick and spoon into meringue nests. Top with mango slices. Halve 2 passion fruits, scoop outthe seeds and flesh. Spoon over the meringues. PINEAPPLE Rum-flavoured rings Remove the pineapple ends and peel. Slice thickly and remove the core. Heat a little butter with brown sugar and stir until melted. Add a good splash of dark rum and the pineapple and simmer for 5-10 minutes.

knowledge about the world

Why not also try... • Salsa Peel and dice some Why not also try... • Ice cream topping Simply Why not also try... • Rice salad Cook long

kiwi fruits and peeled, stoned avocados. Toss in lime juice, then stir in a little finely chopped shallot and deseeded red chilli. Serve with meat or fish. • Kiwi & chicken wraps

what is it about? what is it related to? what does it feel like? what does it mean?

?

0010100100010010101011001111 Serve up something new with... RECIPES: GLYNIS MCGUINNESS, GREGOR MCMASTER. PHOTOGRAPHS: JONATHAN 0101010100100011111010010010 KENNEDY. STYLING: TAMZIN FERDINANDO, JENNY IGGLEDEN. FOOD STYLING: DENISE SMART, KATE BLINMAN 1011100100101001110001010100 KIWI FRUIT 0010111010101010010100100010 Cheesecake layers Put a few digestive biscuits in a freezer bag and smash with a rolling pin. Beat a little icing sugar into soft cheese, then peel 0101010110011110101010100100 and slice some kiwi fruit. Layer in glasses until full. 0111110100100101011100100101 MANGO Spicy mango salad with pork Peel and stone ripe mango and slice. Mix with sliced red onion, 0011100010101000010111010101 quartered cherry tomatoes, a sliced chilli, chopped coriander and a squeeze of lemon juice. Serve with grilled pork chops or steaks. 0100101001000100101010110011 PASSION FRUIT Tropical 1101010101001000111110100100 pavlova Whip double cream until thick and spoon into meringue nests. Top with mango slices. Halve 2 passion fruits, scoop outthe seeds and flesh. 1010112001001010011100010101 Spoon over the meringues. 0000101110101010100101001000 PINEAPPLE Rum-flavoured rings Remove the pineapple ends and peel. Slice thickly and remove the 1001010101100111101010101001 core. Heat a little butter with brown sugar and stir until melted. Add a good splash of dark rum and the pineapple and simmer for 5-10 0001111101001001010111001001 minutes. knowledge about Why not also try... 0100111000101010000101110101 • Salsa Peel and dice some the world 0101001010010001001010101100 Why not also try... • Ice cream topping Simply 1111010101010010001111101001 Why not also try... •0010101110010010100111000101 Rice salad Cook long kiwi fruits and peeled, stoned avocados. Toss in lime juice, then 0100001011101010101000001110 stir in a little finely chopped shallot and deseeded red chilli. Serve with meat or fish. •0010101010111001010111010101 Kiwi & chicken wraps TROPICAL FRUIT

đ&#x;‘¤

?

!


right level of abstraction

đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜

what is it about? what is it related to? what does it feel like? what does it mean?

?

knowledge about the world

knowledge about the world

đ&#x;‘¤

!

context word topic


right level of abstraction

?

đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜ knowledge about the world

knowledge about the world use the right

word

words

đ&#x;‘¤

topic

context

!

capture the widest range of

set the right “window� of

topics

context


đ&#x;ŒŽ $ right level of abstraction

millions or articles in Wikipedia

đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜

?

capturing the widest range of topics

knowledge about the world

knowledge about the world use the right

word

words

đ&#x;‘¤

topic

context

!

capture the widest range of

set the right “window� of

topics

context


đ&#x;ŒŽ $ right level of abstraction

millions or articles in Wikipedia

đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜

đ&#x;“šđ&#x;‘ś$⚛♼đ&#x;’° đ&#x;Ž‚)đ&#x;?şđ&#x;?ˆđ&#x;š˜

knowledge about the world

capturing the widest range of topics

knowledge about the world use the right

word

words

đ&#x;‘¤

topic

context

!

capture the widest range of

set the right “window� of

topics

context


how to‌

'

âš™

đ&#x;“–

(

preprocess

train

score

evaluate

the data?

the model?

it on new document?

the performance?

gensim topic modeling framework Free Python library


LDA hierarchical LDA dynamic LDA DeepLearning Word2Vec Doc2Vec POS ‌


. Title

đ&#x;ŒŽ

word word word word word word word word word word word word word word word word word word word word word word word word word word word word wordwikipedia word word word


.

đ&#x;ŒŽ

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word

word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word

wikipedia

Setting the right “window� of context (a) Define minimum number of words to be present in an article.


.

đ&#x;ŒŽ

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word

word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word

wikipedia

Setting the right “window� of context (a) Define minimum number of words to be present in an article.

Recommended 100 - 300


Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Setting the right “window� of context (b) Skip articles whose titles start with those namespaces:

Title Wikipedia: 1,167,766 907,811 Category:

892,147 File:

571,248 Portal:

128,603 Template:

8,893 MediaWiki:

4,815 User:

2,324 Help:

1,505 Book:

915 Draft: 0

500.000

1.000.000

1.500.000


Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Setting the right “window� of context (b) Skip articles whose titles start with those namespaces:

Title Wikipedia: 1,167,766 907,811 Category:

892,147 File:

571,248 Portal:

128,603 Template:

8,893 MediaWiki:

4,815 User:

2,324 Help:

1,505 Book:

915 Draft: 0

500.000

1.000.000

1.500.000


Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word


Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

remove if word appears in more than 10% of the articles

Let the right words in Word length 1: 2: 3:

i

do, be, am, …

ice, was, who, …

16: videoconferences, …

17: superbillionaires, …

18: intellectualization, …

Stoplists: general terms last names, first names countries, cities

Lemmatization am , are, is = be

Parts of Speech: remove if the word appears in less than 20 articles

NN - noun VB - verb RB - adverb JJ - adjective IN - preposition

(computer, car, cake, …) (play, install, commit, …) (today, quickly, patiently, …) (red, awesome, big, …) (of, about, from, …)


Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title

Title

Title

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

remove if word appears in more than 10% of the articles

Let the right words in Word length

keep top n words Recommended 50.000 - 100.000

1: 2: 3:

i

do, be, am, …

ice, was, who, …

16: videoconferences, …

17: superbillionaires, …

18: intellectualization, …

Stoplists: general terms last names, first names countries, cities discard the rest

Lemmatization am , are, is = be

Parts of Speech: remove if the word appears in less than 20 articles

NN - noun VB - verb RB - adverb JJ - adjective IN - preposition

(computer, car, cake, …) (play, install, commit, …) (today, quickly, patiently, …) (red, awesome, big, …) (of, about, from, …)


LATENT DIRICHLET ALLOCATION tfidf.mm

wordids.txt

documents

words

Titl

word word word word word word word word word word word word word word word word

Θ

Z

W N M

the topic distribution for document i

the topic for the j’th word in a document i

N words M documents

observed words in a document i

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003. It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.


LATENT DIRICHLET ALLOCATION tfidf.mm

wordids.txt

documents

words

Titl

word word word word word word word word word word word word word word word word

context

topic

word N M

the topic distribution for document i

the topic for the j’th word in a document i

N words M documents

observed words in a document i

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003. It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.


lets assume that… topics, themes, … topic#1

topic#2

topic#3

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

….

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

….

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

….

Take this recipe and generate a document based on the model’s “rules”

recipe topic#1

topic#2

topic#3

50%

30%

20%

Take this collection of documents and learn a model that describes it best…


lets assume that… topics, themes, … topic#1

topic#2

topic#3

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

….

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

….

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

….

recipe topic#1

topic#2

topic#3

50%

30%

20%

Take this recipe and generate a document based on the model’s “rules”

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

)

what really happens… word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word wordword wordword wordword wordword word word word word word word

words appearing in Take this collection documents and the sameof context are learn a model(document) that describes it best… related


lets assume that… topics, themes, … topic#1

topic#2

topic#3

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

….

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

….

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

P * word

….

recipe topic#1

topic#2

topic#3

50%

30%

20%

Take this recipe and generate a document based on the model’s “rules”

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

)

what really happens… topic#1

topic#2

topic#N

P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word ….

P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word ….

P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word ….

Take this collection of documents and learn a model that describes it best… …given these model parameters:

how many topics?

how are those topics assigned to a document?

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word wordword wordword wordword wordword word word word word word word

words appearing in the same context (document) are related


LATENT DIRICHLET ALLOCATION tfidf.mm

wordids.txt

documents

words

Titl

word word word word word word word word word word word word word word word word

context

topic

word N M

the topic distribution for document i

the topic for the j’th word in a document i

N words M documents

observed words in a document i

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003. It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.


LATENT DIRICHLET ALLOCATION tfidf.mm

wordids.txt

β

documents

words word word word word word word word word word word word word word word word word

a parameter that sets the prior on the per-document topic distributions

α

Θ

Z

a parameter that sets the prior on the per-topic word distributions

W N M

the topic distribution for document i

the topic for the j’th word in a document i

N words M documents

observed words in a document i

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003. It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.


LATENT DIRICHLET ALLOCATION wordids.txt

β

documents

words word word word word word word word word word word word word word word word word

a parameter that sets the prior on the per-topic word distributions

model.lda

a parameter that sets the prior on the per-document topic distributions

α

Θ

Z

topics

tfidf.mm

W N M

the topic distribution for document i

the topic for the j’th word in a document i

N words M documents

observed words in a document i

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003. It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

words


How many topics (dimensions) ?

đ&#x;‘¤


How many topics (dimensions) ?

! context word topic

meaning thresholds

dimensions

đ&#x;‘¤

features

spaces

context

gestalts

PERCEPTION a combination of top-down and bottom-up processing


word

topic

!

meaning thresholds

context

A document is a probability distribution over topics A topic is a probability distribution over words

dimensions

đ&#x;‘¤

features

spaces

context

gestalts

PERCEPTION a combination of top-down and bottom-up processing


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

路路路

247 248 249 250


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

路路路

247 248 249 250


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

¡¡¡

Each document gets represented as a pattern of LDA topics. Making every document appear‌

đ&#x;“– â?Ąđ&#x;“– â?Ąđ&#x;“– ‌different enough to be separable,

đ&#x;“–♼đ&#x;“–

đ&#x;“–

‌similar enough to be grouped.

247 248 249 250


DNA


?đ&#x;?”

?

0.019*recipes + 0.017*chef + 0.017*peanut + 0.016*cuisine + 0.015*cooking + 0.015*meat + 0.015*restaurant + 0.015*dish + 0.014*cookery + 0.014*vegetables + 0.014*dishes + 0.012*rice + 0.012*chicken + 0.011*sauce + 0.010*fried + 0.010*beef + 0.009*chefs + 0.009*peanuts + 0.009*bean + 0.009*pork + 0.008*culinary + 0.008*restaurants + 0.008*cucumber + 0.008*recipe + 0.007*kitchen + 0.007*pepper + 0.007*melon + 0.007*ingredients + 0.007*eaten + 0.007*cooked + 0.007*cook + 0.006*potato + 0.006*soup + 0.006*cooks + 0.006*coconut + 0.005*onion + 0.005*meal + 0.005*sausage + 0.005*cabbage + 0.005*anise + 0.005*potatoes +

0

topic #143

đ&#x;?˘

topic #270

DNA

?

topic #81

0.057*wine + 0.056*plantings + 0.030*wines + 0.024*vineyard + 0.020*grape + 0.020*winery + 0.016*peaches + 0.016*vineyards + 0.015*grapes + 0.012*cabernet + 0.012*pinot + 0.012*vine + 0.012*napa + 0.011*blanc + 0.011*velvety + 0.010*mourad + 0.010*magie + 0.010*sauvignon + 0.010*trophic + 0.009*approachable + 0.009*neda + 0.009*vines + 0.009*gall + 0.009*bano + 0.008*powdery + 0.008*degraw + 0.007*kimiko + 0.007*viticulture + 0.007*dagupan + 0.007*noir + 0.006*haridas + 0.006*aphid + 0.006*mccray + 0.006*chardonnay + 0.006*osmotic + 0.006*tasting + 0.006*merlot + 0.006*benidorm + 0.006*kyĹ?ko +

0.048*dutch + 0.034*netherlands + 0.029*amsterdam + 0.019*danish + 0.014*batavia + 0.014*denmark + 0.014*copenhagen + 0.012*rotterdam + 0.012*holland + 0.011*utrecht + 0.010*hague + 0.010*willem + 0.009*haarlem + 0.009*leiden + 0.008*pieter + 0.008*odense + 0.008*hansen + 0.008*cornelis + 0.007*congreve + 0.007*groningen + 0.007*sint + 0.007*hendrik + 0.007*frans + 0.006*lange + 0.006*roughriders + 0.006*rasmus + 0.005*wilhelmina + 0.005*jørgensen + 0.005*roskilde + 0.005*witton + 0.005*eskimos + 0.005*stampeders + 0.005*vries + 0.005*arnhem + 0.005*nijmegen + 0.005*delft + 0.004*johan + 0.004*niels + 0.004*johannes +


?đ&#x;?”

?

0.019*recipes + 0.017*chef + 0.017*peanut + 0.016*cuisine + 0.015*cooking + 0.015*meat + 0.015*restaurant + 0.015*dish + 0.014*cookery + 0.014*vegetables + 0.014*dishes + 0.012*rice + 0.012*chicken + 0.011*sauce + 0.010*fried + 0.010*beef + 0.009*chefs + 0.009*peanuts + 0.009*bean + 0.009*pork + 0.008*culinary + 0.008*restaurants + 0.008*cucumber + 0.008*recipe + 0.007*kitchen + 0.007*pepper + 0.007*melon + 0.007*ingredients + 0.007*eaten + 0.007*cooked + 0.007*cook + 0.006*potato + 0.006*soup + 0.006*cooks + 0.006*coconut + 0.005*onion + 0.005*meal + 0.005*sausage + 0.005*cabbage + 0.005*anise + 0.005*potatoes +

0

topic #143

đ&#x;?˘

topic #270

DNA

?

topic #81

0.057*wine + 0.056*plantings + 0.030*wines + 0.024*vineyard + 0.020*grape + 0.020*winery + 0.016*peaches + 0.016*vineyards + 0.015*grapes + 0.012*cabernet + 0.012*pinot + 0.012*vine + 0.012*napa + 0.011*blanc + 0.011*velvety + 0.010*mourad + 0.010*magie + 0.010*sauvignon + 0.010*trophic + 0.009*approachable + 0.009*neda + 0.009*vines + 0.009*gall + 0.009*bano + 0.008*powdery + 0.008*degraw + 0.007*kimiko + 0.007*viticulture + 0.007*dagupan + 0.007*noir + 0.006*haridas + 0.006*aphid + 0.006*mccray + 0.006*chardonnay + 0.006*osmotic + 0.006*tasting + 0.006*merlot + 0.006*benidorm + 0.006*kyĹ?ko +

0.048*dutch + 0.034*netherlands + 0.029*amsterdam + 0.019*danish + 0.014*batavia + 0.014*denmark + 0.014*copenhagen + 0.012*rotterdam + 0.012*holland + 0.011*utrecht + 0.010*hague + 0.010*willem + 0.009*haarlem + 0.009*leiden + 0.008*pieter + 0.008*odense + 0.008*hansen + 0.008*cornelis + 0.007*congreve + 0.007*groningen + 0.007*sint + 0.007*hendrik + 0.007*frans + 0.006*lange + 0.006*roughriders + 0.006*rasmus + 0.005*wilhelmina + 0.005*jørgensen + 0.005*roskilde + 0.005*witton + 0.005*eskimos + 0.005*stampeders + 0.005*vries + 0.005*arnhem + 0.005*nijmegen + 0.005*delft + 0.004*johan + 0.004*niels + 0.004*johannes +


LDA space a simplex in this example 3 topics

đ&#x;?”

0,21

0

similar enough

Jensen-Shannon Divergence = Jensen-Shannon Distance ( gives values between 0 and 1 )

a threshold that defines what is considered similar (found experimentally)

đ&#x;?˘



more similar

less similar

?

Does the model capture the right aspects of a magazine?

“

“

all models are wrong, but some are useful

George E. P. Box

magazine level high number of words noise - ads, editorial stuff, etc.

meaning thresholds

dimensions

đ&#x;‘¤

features

context

spaces

gestalts

?

What is the distance threshold under which magazines are perceived as similar?


Take this piece of text 1. Preprocess it. Show me what was removed and what stayed. 2. Get the LDA topic distribution. Show me the topic distribution. 3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

Do the neighbours look similar? Where is the distance threshold?


Take this piece of text 1. Preprocess it. Show me what was removed and what stayed. 2. Get the LDA topic distribution. Show me the topic distribution. 3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

Do the neighbours look similar? Where is the distance threshold?


Take this piece of text 1. Preprocess it. Show me what was removed and what stayed. 2. Get the LDA topic distribution. Show me the topic distribution. 3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

Do the neighbours look similar? Where is the distance threshold?


Take this piece of text 1. Preprocess it. Show me what was removed and what stayed. 2. Get the LDA topic distribution. Show me the topic distribution. 3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

Do the neighbours look similar? Where is the distance threshold?


Take this piece of text 1. Preprocess it. Show me what was removed and what stayed. 2. Get the LDA topic distribution. Show me the topic distribution. 3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

Do the neighbours look similar? Where is the distance threshold?


'

âš™

đ&#x;“–

(

preprocess

train

score

evaluate

the data

the model

it on new document

the performance

Text corpus depends on the application domain. It should be contextualised since the window of context will determine what words are considered to be related. The only observable features for the model are words. Experiment with various stoplists to make sure only the right ones are getting in. Training corpus can be different from the documents it will be scored on. Good all utility corpus is Wikipedia.

The key parameter is the number of topics. Again, depends on the domain. Other parameters are alpha and beta. You can leave them aside to begin with and only tune later. Good place to start is gensim - free python library.

The goal of the model is not to label documents, but rather to give them a unique fingerprint so that they can be compared to each other in a humanlike fashion.

Evaluation depends on the application. Use Jensen-Shannon Distance as similarity metric. Evaluation should show whether the model captures the right aspects compared to a human. Also it will show what distance threshold is still being perceived as similar enough. Use perplexity to see if your model is representative of the documents you’re scoring it on.


'

âš™

đ&#x;“–

(

preprocess

train

score

evaluate

the data

the model

it on new document

the performance

Text corpus depends on the application domain. It should be contextualised since the window of context will determine what words are considered to be related. The only observable features for the model are words. Experiment with various stoplists to make sure only the right ones are getting in. Training corpus can be different from the documents it will be scored on. Good all utility corpus is Wikipedia.

The key parameter is the number of topics. Again, depends on the domain.

The goal of the model is not to label documents, but rather to give them a unique fingerprint so that they can be compared to each other in a humanlike fashion.

thank you

Other parameters are alpha and beta. You can leave them aside to begin with and only tune later. Good place to start is gensim - free python library.

! Andrius Knispelis andrius.knispelis@gmail.com

Evaluation depends on the application. Use Jensen-Shannon Distance as similarity metric. Evaluation should show whether the model captures the right aspects compared to a human. Also it will show what distance threshold is still being perceived as similar enough. Use perplexity to see if your model is representative of the documents you’re scoring it on.


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.