Topic Models (Marc h
[
6, 20 14)
Morten Arngren Senior Data Scientist
]
I “…YouTube for Publications…
DOCUMENT CORPUS
đ&#x;Œ´
â€œâ€Ś Sochi for the Winter Olympics and watch us invade another countr y‌
Discover hidden thematic structure Topic modelling can be many things: Topics: Travel, Sports, Science, Music etc.?
đ&#x;š€
â€œâ€ŚMoonbase alpha is under at tack from unknown aliens‌Arm the Rat Gun!
Search Engines Recommendations (Similarity Measure)
đ&#x;Ž§
â€œâ€ŚMadonna joins Metallica as lead singer and publishes ‘Hits for Kids vol. 57’
⌖
L ATENT DIRICHLE T ALLOC ATION [
đ&#x;Œ´
â€œâ€Ś Sochi for the Winter Olympics and watch us invade another countr y‌
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
Bag-of-Words model
W
Assumes interchangeability between words Textual context not modeled.
Generative Process Graphical Model Representation Topic of the j’th word in doc. d
Topic Prior
]
Parameters No. topics K pre-defined Topics inferred from observed documents
Word (obser ved)
(Dirichlet)
↾
✓
Z
Vocabular y distribution
W
N D
No. topics
K
Topic distribution of doc. d No. documents No. words in a doc.
Inference Posterior distribution of parameters can be inferred via e.g: Variational Inference Gibbs Sampling
L ATENT DIRICHLE T ALLOC ATION Training Ii
http://radimrehurek.com/gensim/
đ&#x;Œ´
LDA Wikipedia Training Data (English)
K = 150 topics
~4.5M Single Articles
(preset parameter)
(Pure Topics)
Topic Distribution
✓
1
hotels arabic
1
Australia
history
business
islands environment
poetic
food design arts
plants animals 150
0
L ATENT DIRICHLE T ALLOC ATION đ&#x;Œ´
✈
1
✓ 0
1
Dirichlet Distribution or LDA Space
0  ✓k  1
^
X
K ✓k = 1
k
' đ&#x;š€
đ&#x;“š đ&#x;ŽŹ
the real
LDA Space PC 4+5+6
(Jan 2013)
RECOMMENDATIONS Document Similarity
✈
Euclidean Distance
Dirichlet Distribution or LDA Space
✓=
đ&#x;“ş
' đ&#x;“š đ&#x;ŽŹ
X⇣
(ref ) ✓k
k
✓k
⌘2
Jensen-Shannon Divergence ✓=
)
s
(a) 1 2 DKL (✓k ||M)
DKL (P ||Q) =
X i
ln
+
(b) 1 2 DKL (✓k ||M),
!
Pi Pi Qi
^
M=
where
1 2
⇣
(a) ✓k
(b) ✓k
⌘
RECOMMENDATIONS Document Similarity Toys
✈ Dirichlet Distribution or LDA Space
Fashion
+
' đ&#x;“š
Travel
+ [
Morten Arngren Senior Data Scientist
]