Topic Models Recommendations [ ] Morten Arngren Senior Data Scientist
đ&#x;’Ą About
Topic Modelling
!
Recommendations
“…YouTube for Publications…
I Started in 2006 by 5 dudes.
2013 đ&#x;“– 15M. publications (free) 340M. pages - (25 km2) đ&#x;‘€ 7.5B. page views / month đ&#x;‘Ľ 83M. unique visitors / month
"
"
Data Science Team (Copenhagen)
ML Gadgets
Morten Arngren
12x 2.6GHz
Ph.D. in Machine Learning and AI (2011) M.Sc.A.M. (2007) B.Sc.E.E. (1997) ! ISSUU, Data Scientist (2011 - present) DTU & FOSS Analytical, Machine Learning in Food Quality (2008-2011) Nokia Mobile Phones, Digital Signal Processing (2000-2007) Alcatel Space Denmark, Building Rockets (1997-2000)
96GB Ram 2TB SSD 2TB HardDrive
Andrius Butkus Ph.D. in Digital Media Personalisation (2009) M.Sc.E.E. (2004) B.Sc.E.E. (2002) ! ISSUU, Data Scientist (2011 - present) DTU External Lecturer, Human Computer Interaction (2010 - present) DTU Assistant Professor, Digital Media Engineering (2008-2010)
â˜
Amazon Web Services
đ&#x;“ˆ Data
đ&#x;“ˆ Data
Content
(âš›
Translate to English
LDA
(from 24 languages)
Topics
đ&#x;”Ž
Detect Language
Explicit Detection
(56)
&
đ&#x;“–
)
Text
OCR
đ&#x;š€
đ&#x;š€
Page
Image
đ&#x;š€ Layout (Quantify text and image boxes)
# Cover Analysis
* DB
đ&#x;š€ Article Extraction
$ Doc. Type
Classification
40k Pubs / Day
Reader Activity ," ,"
đ&#x;Ž§ “Birdie Nam Namâ€?
,"
đ&#x;?”
đ&#x;ŽŹ N
đ&#x;‘?!
-" ,"
-"
đ&#x;“š
2
+
Session
1
đ&#x;‘?
-"
đ&#x;?” N 200GB / Day
time
* DB
Topic Modelling
L ATENT DIRICHLE T ALLOC ATION Topic model based on Bag-of-Words Data
[
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
] http://radimrehurek.com/gensim/
đ&#x;Œ´
LDA Wikipedia Training Data
150 topics
~4.5M Single Articles
(preset parameter)
(Pure Topics)
Topic Distribution hotels arabic 1
Australia
history
business
islands environment
poetic
food design arts
plants animals 150
L ATENT DIRICHLE T ALLOC ATION đ&#x;Œ´
Issuu Publications Properties
[0:1]
∧
ÎŁ=1
✈ đ&#x;š€ 5
(
the real
đ&#x;“š
LDA Space PC 4 5+
TOPIC C ATEGORIES (Learning from Wikipedia Dataset) ~9 Mio.
~4.5 Mio.
✈
đ&#x;Œ´
✈
Travel
Botanics
đ&#x;?¸ (
8
Cock tails
Chemistr y
I
0.5 Travel 0.4 Spor ts 0.1 Dancing
(
Density distribution not the same
Empty locations in LDA space.
đ&#x;“š
đ&#x;?¸ Drinks
!
Recommendation System
READER ACTIVITY
đ&#x;Ž§
đ&#x;?” đ&#x;ŽŹ
đ&#x;“š
“Birdie Nam Nam�
2
1
đ&#x;?” Time
No Explicit Rating‌. Extrac t Implicit Rating‌.?
Readers
đ&#x;?” Publications
Browsing or Reading? Session { UserName: ‘Birdie-Nam-Nam’ DocID: xxx-xxxxx Pages: 1: [250, 725, 569, 134, ...] 2: [1056, 1259, ...] 3: [1056, 1259, ...] 4: [102, 356, 208, 438] Pages: [1,2,3,6,7] 5: [102, 356, 208, 438] ReadTime: 25789 ms. 6: [5250, 3567, 809] TimeStamp: 1378935850 7: [5250, 3567, 809] ... TimeStamp: 1378935850 DocID: yyy-yyyyy }
Time
đ&#x;ŽŹ 2 đ&#x;Ž§ đ&#x;?¸
Reader indexed learning Decay function
đ&#x;Ž§ đ&#x;ŽŹ 1065 850
đ&#x;?” 2508
đ&#x;“š 1150
2 3690
1 9860
đ&#x;?” 5685 To
Time
Item2Item Matrix
đ&#x;?” Pages: [1,6,7,10,11] ReadTime: 11250 ms. TimeStamp: 1385437850
đ&#x;?” đ&#x;ŽŹ 2
= 850
in weeks decay per week
đ&#x;Ž§ đ&#x;?¸
đ&#x;ŽŹ
2 đ&#x;Ž§ đ&#x;?¸
RECOMMENDING Item2Item Matrix
1
đ&#x;“– Read History
đ&#x;“š đ&#x;ŽŹ 1065 850
đ&#x;?” 2508
8
đ&#x;Ž§ 1150
Stacks
đ&#x;‘? Likes
< 1
đ&#x;&#x152;´ 1
đ&#x;&#x161;&#x20AC; 1
5 đ&#x;&#x17D;§ đ&#x;&#x17D;ą 1
đ&#x;?&#x;
5 đ&#x;&#x17D;§
đ&#x;?&#x201D; Time
N
đ&#x;?&#x;
đ&#x;&#x17D;Ź đ&#x;?&#x20AC; đ&#x;?¸
Item Matrix Weight Mapping Function
RECOMMENDING
🍔 🎱
1
🍟
5 🎧 🎱
5
🍟 🎧
1 🎱 5
+
🍏
🎧
1
Item Weights
🍕
📹
Weighted Sampling
🍟 🎧
1
E
🔀
5
I
1 🎱
8 F
🍸 ♫ 📷
🏀 C
🎬 🍟 🔈 🎤 🍷
Max. Rank
Tuned Parameters
Master Student Projec t Collaborate Filtering Using Social Media Knowledge
Deep Belief Network Model
!
2 20 500 2000
I Training Data
L L
Lars Maal øe
Bag-of-Words model
" Kasper Johansen
L L
[
Morten Arngren Senior Data Scientist
]