Issuu Talk on Topic Models and Recommendation Systems

Page 1

Topic Models Recommendations [ ] Morten Arngren Senior Data Scientist


đ&#x;’Ą About

Topic Modelling

!

Recommendations


“…YouTube for Publications…


I Started in 2006 by 5 dudes.

2013 đ&#x;“– 15M. publications (free) 340M. pages - (25 km2) đ&#x;‘€ 7.5B. page views / month đ&#x;‘Ľ 83M. unique visitors / month

"

"


Data Science Team (Copenhagen)

ML Gadgets

Morten Arngren

12x 2.6GHz

Ph.D. in Machine Learning and AI (2011) M.Sc.A.M. (2007) B.Sc.E.E. (1997) ! ISSUU, Data Scientist (2011 - present) DTU & FOSS Analytical, Machine Learning in Food Quality (2008-2011) Nokia Mobile Phones, Digital Signal Processing (2000-2007) Alcatel Space Denmark, Building Rockets (1997-2000)

96GB Ram 2TB SSD 2TB HardDrive

Andrius Butkus Ph.D. in Digital Media Personalisation (2009) M.Sc.E.E. (2004) B.Sc.E.E. (2002) ! ISSUU, Data Scientist (2011 - present) DTU External Lecturer, Human Computer Interaction (2010 - present) DTU Assistant Professor, Digital Media Engineering (2008-2010)

â˜

Amazon Web Services


đ&#x;“ˆ Data


đ&#x;“ˆ Data


Content

(âš›

Translate to English

LDA

(from 24 languages)

Topics

đ&#x;”Ž

Detect Language

Explicit Detection

(56)

&

đ&#x;“–

)

Text

OCR

đ&#x;š€

đ&#x;š€

Page

Image

đ&#x;š€ Layout (Quantify text and image boxes)

# Cover Analysis

* DB

đ&#x;š€ Article Extraction

$ Doc. Type

Classification

40k Pubs / Day


Reader Activity ," ,"

đ&#x;Ž§ “Birdie Nam Namâ€?

,"

đ&#x;?”

đ&#x;ŽŹ N

đ&#x;‘?!

-" ,"

-"

đ&#x;“š

2

+

Session

1

đ&#x;‘?

-"

đ&#x;?” N 200GB / Day

time

* DB


Topic Modelling


L ATENT DIRICHLE T ALLOC ATION Topic model based on Bag-of-Words Data

[

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.

] http://radimrehurek.com/gensim/

đ&#x;Œ´

LDA Wikipedia Training Data

150 topics

~4.5M Single Articles

(preset parameter)

(Pure Topics)

Topic Distribution hotels arabic 1

Australia

history

business

islands environment

poetic

food design arts

plants animals 150


L ATENT DIRICHLE T ALLOC ATION đ&#x;Œ´

Issuu Publications Properties

[0:1]

∧

ÎŁ=1

✈ đ&#x;š€ 5

(

the real

đ&#x;“š

LDA Space PC 4 5+


TOPIC C ATEGORIES (Learning from Wikipedia Dataset) ~9 Mio.

~4.5 Mio.

✈

đ&#x;Œ´

✈

Travel

Botanics

đ&#x;?¸ (

8

Cock tails

Chemistr y

I

0.5 Travel 0.4 Spor ts 0.1 Dancing

(

Density distribution not the same

Empty locations in LDA space.

đ&#x;“š

đ&#x;?¸ Drinks


!

Recommendation System


READER ACTIVITY

đ&#x;Ž§

đ&#x;?” đ&#x;ŽŹ

đ&#x;“š

“Birdie Nam Nam�

2

1

đ&#x;?” Time

No Explicit Rating‌. Extrac t Implicit Rating‌.?


Readers

đ&#x;?” Publications

Browsing or Reading? Session { UserName: ‘Birdie-Nam-Nam’ DocID: xxx-xxxxx Pages: 1: [250, 725, 569, 134, ...] 2: [1056, 1259, ...] 3: [1056, 1259, ...] 4: [102, 356, 208, 438] Pages: [1,2,3,6,7] 5: [102, 356, 208, 438] ReadTime: 25789 ms. 6: [5250, 3567, 809] TimeStamp: 1378935850 7: [5250, 3567, 809] ... TimeStamp: 1378935850 DocID: yyy-yyyyy }

Time

đ&#x;ŽŹ 2 đ&#x;Ž§ đ&#x;?¸


Reader indexed learning Decay function

đ&#x;Ž§ đ&#x;ŽŹ 1065 850

đ&#x;?” 2508

đ&#x;“š 1150

2 3690

1 9860

đ&#x;?” 5685 To

Time

Item2Item Matrix

đ&#x;?” Pages: [1,6,7,10,11] ReadTime: 11250 ms. TimeStamp: 1385437850

đ&#x;?” đ&#x;ŽŹ 2

= 850

in weeks decay per week

đ&#x;Ž§ đ&#x;?¸

đ&#x;ŽŹ

2 đ&#x;Ž§ đ&#x;?¸


RECOMMENDING Item2Item Matrix

1

đ&#x;“– Read History

đ&#x;“š đ&#x;ŽŹ 1065 850

đ&#x;?” 2508

8

đ&#x;Ž§ 1150

Stacks

đ&#x;‘? Likes

< 1

đ&#x;Œ´ 1

đ&#x;š€ 1

5 đ&#x;Ž§ đ&#x;Žą 1

đ&#x;?&#x;

5 đ&#x;Ž§

đ&#x;?” Time

N

đ&#x;?&#x;

đ&#x;ŽŹ đ&#x;?€ đ&#x;?¸

Item Matrix Weight Mapping Function


RECOMMENDING

🍔 🎱

1

🍟

5 🎧 🎱

5

🍟 🎧

1 🎱 5

+

🍏

🎧

1

Item Weights

🍕

📹

Weighted Sampling

🍟 🎧

1

E

🔀

5

I

1 🎱

8 F

🍸 ♫ 📷

🏀 C

🎬 🍟 🔈 🎤 🍷


Max. Rank


Tuned Parameters


Master Student Projec t Collaborate Filtering Using Social Media Knowledge

Deep Belief Network Model

!

2 20 500 2000

I Training Data

L L

Lars Maal øe

Bag-of-Words model

" Kasper Johansen


L L

[

Morten Arngren Senior Data Scientist

]


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.