Topic Models at Issuu

Page 1

Topic Models (Marc h

[

6, 20 14)

Morten Arngren Senior Data Scientist

]


I “…YouTube for Publications…


DOCUMENT CORPUS

đ&#x;Œ´

â€œâ€Ś Sochi for the Winter Olympics and watch us invade another countr y‌

Discover hidden thematic structure Topic modelling can be many things: Topics: Travel, Sports, Science, Music etc.?

đ&#x;š€

â€œâ€ŚMoonbase alpha is under at tack from unknown aliens‌Arm the Rat Gun!

Search Engines Recommendations (Similarity Measure)

đ&#x;Ž§

â€œâ€ŚMadonna joins Metallica as lead singer and publishes ‘Hits for Kids vol. 57’

⌖


L ATENT DIRICHLE T ALLOC ATION [

đ&#x;Œ´

â€œâ€Ś Sochi for the Winter Olympics and watch us invade another countr y‌

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.

Bag-of-Words model

W

Assumes interchangeability between words Textual context not modeled.

Generative Process Graphical Model Representation Topic of the j’th word in doc. d

Topic Prior

]

Parameters No. topics K pre-defined Topics inferred from observed documents

Word (obser ved)

(Dirichlet)

↾

✓

Z

Vocabular y distribution

W

N D

No. topics

K

Topic distribution of doc. d No. documents No. words in a doc.

Inference Posterior distribution of parameters can be inferred via e.g: Variational Inference Gibbs Sampling


L ATENT DIRICHLE T ALLOC ATION Training Ii

http://radimrehurek.com/gensim/

đ&#x;Œ´

LDA Wikipedia Training Data (English)

K = 150 topics

~4.5M Single Articles

(preset parameter)

(Pure Topics)

Topic Distribution

✓

1

hotels arabic

1

Australia

history

business

islands environment

poetic

food design arts

plants animals 150

0


L ATENT DIRICHLE T ALLOC ATION đ&#x;Œ´

✈

1

✓ 0

1

Dirichlet Distribution or LDA Space

0  ✓k  1

^

X

K ✓k = 1

k

' đ&#x;š€

đ&#x;“š đ&#x;ŽŹ

the real

LDA Space PC 4+5+6

(Jan 2013)


RECOMMENDATIONS Document Similarity

✈

Euclidean Distance

Dirichlet Distribution or LDA Space

✓=

đ&#x;“ş

' đ&#x;“š đ&#x;ŽŹ

X⇣

(ref ) ✓k

k

✓k

⌘2

Jensen-Shannon Divergence ✓=

)

s

(a) 1 2 DKL (✓k ||M)

DKL (P ||Q) =

X i

ln

+

(b) 1 2 DKL (✓k ||M),

!

Pi Pi Qi

^

M=

where

1 2

⇣

(a) ✓k

(b) ✓k

⌘


RECOMMENDATIONS Document Similarity Toys

✈ Dirichlet Distribution or LDA Space

Fashion

+

' đ&#x;“š

Travel


+ [

Morten Arngren Senior Data Scientist

]


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.