Cross-Device Tracking September 2017 Morten Arngren Lead Data Scientist
Vlad Sandulescu Senior Data Scientist
Jan Kremer Data Scientist
đ&#x;“ˇ a lot of experimentation
Snapshot
!" Still under development
ADFORM PLATFORM
Cookie Churn
đ&#x;‘ť
' '
' '
Cross-Device
' đ&#x;“ą % đ&#x;’ť đ&#x;“ž time
⌖ mission
đ&#x;‘ť đ&#x;‘˝ )
⌖ mission
đ&#x;‘ť đ&#x;‘˝ )
đ&#x;“ˆ data
â?“
Ground Truth?
Cookie
â?˛
#

đ&#x;“ž
Time
Location
OS
Device
External Deterministic Providers
(Deterministic)
>700K
(Adform bid responses)
events/day
â?˛ Cookie
#

đ&#x;“ž
OS
Device
Time Location
1.5B
events/day
(Adform bid responses) (Deterministic)
â?˛
>700K events/day
Cookie
#

đ&#x;“ž
OS
Device
Time Location
1.5B
events/day 1 Months of data
94%
2M events Filtering Remove 
 cookies with
 multiple clusters
74K
Data set may not be representative.
6%
clusters
Co o
kie
Ch
Cro ss-
u rn
De vic e
2017-01-21
Training 2.236.222 events 2017-02-21 2017-03-10
Validation 4.776.974 events 2017-04-27
300M 300M
More of it‌
time
Cross-Device Graph
) đ&#x;‘˝
Data
đ&#x;‘ť
Cookie Time
â?˛
Location
#
OS

⌖
đ&#x;“ž
Cluster cookies into user profiles.
Device
Cookie IP
Device User-Agent
Cross-Device Graph Graph cutting algorithms • • • •
Min-Cut Markov Random Fields Community Detection Graph Clustering…
Production Scale
B
events Cookie IP
Device User-Agent
B
nodes
B
edges
Scale….what to do?? Divide Graph …by location (Pre-Clustering) Ignored
Cookie IP
1
2
3
(Connected Components Algorithm)
4
n-1
n
Cookie association by Pair-wise classification
đ&#x;‘ť A
đ&#x;‘˝ B
C
) D
E Positive
Negative
A B
A C
A E
C D
A D
B E
Create cookie pairs as classification data set
B C
C E
‌with ground truth
B D
D E
Crafting Features for each cookie pair‌ A B
A
B Pageviews
Observations ID Time
Location
Device
cookie_id
Aggregation 0
12
24 hours
log_time ip_v4 country_id region_id city_id zip_code_id device_type_id os_id browser_id browser_language_codes mobile_app_id
0
set a set b
31 days
Crafting Features for each cookie pair‌ Similarity
A B Pageviews
-
Observations ID Time
Location
Device
cookie_id
Aggregation 0
12
24 hours
log_time ip_v4 country_id region_id city_id zip_code_id device_type_id os_id browser_id browser_language_codes mobile_app_id
Features
Cosine Similarity Correlation Similarity Hellinger Distance
0
set a set b
31 days
Max Overlap
Jaccard Index
Crafting Features
PC 1+2
for each cookie pair‌ A B Observations ID Time
Location
Device
Features
cookie_id
25 dim. feature vector
Numerical
log_time ip_v4 country_id region_id city_id zip_code_id device_type_id os_id browser_id browser_language_codes mobile_app_id
Vectorial
Classifier
Sets
đ&#x;Œ´
đ&#x;Œ´đ&#x;Œ´
0.78
A B
Gradient Boosted Decision Trees
(http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
Classification
Clustering
Pruning Degrees of edges
A
A
B
G
B
C D E
F
2
2
2
2
A
B
D
E
C E
2
3
C
F
D
F
1 95% percentile
G
(Connected Components Algorithm)
G
Pre-Clustering
Classification
Clustering
Pruning
A
B
A
B
A A
B
C
G
C D
C
B
F
C
D
E
D
E
E E
F
D
F
F
G
G
Clusters
Users
G
Full Graph
Components
(Connected Components Algorithm)
Sub-Graph (XGBoost)
(Connected Components Algorithm)
đ&#x;“ˆ
performance
(Preliminary YouGov data set…)
Performance Graphs
(Preliminary YouGov data set…)
Performance Graphs
(Dasgupta et al. “Overcomming Browser Cookie Churn with Clustering” 2012)
(Preliminary YouGov data set…)
Performance Graphs
(Preliminary YouGov data set…)
Performance Graphs
Precision
Recall
F1-score
91% 32% 47% (Preliminary YouGov data set…)
(Preliminary YouGov data set…)
Performance Graphs
Precision
Recall
F1-score
93% 43% 59% (Preliminary YouGov data set…)
production
Scheduled tasks Predicting
Training XD Deterministic Data
Serving
PreProcessing
PreClustering
PreProcessing
Train Classifier
PreClustering
Model parameters
Model ready
( ) S3
Kafka
/
Predict Cross Device ID’s
Cross-Device Cookies Ready
XD Data
( ) /
S3
Kafka
Serve‌
đ&#x;Œ?
2017-05-01
â?˛
Germany 7 days
2017-05-08
1.3B events (10%)
2017-06-25
đ&#x;Œ?
Germany
2017-07-24
â?˛
56B 30 days events 300M 300M
More of it‌
210M 138M time
175M
Predicting
đ&#x;Œ?
2017-05-01
â?˛
Germany 7 days
2017-05-08
1.3B events (10%)
XD Data
2017-06-25
PreProcessing
đ&#x;Œ?
Germany
2017-07-24
â?˛
56B 30 days events Model parameters
PreClustering
30 min.
Predict Cross Device ID’s
2 hours
More of it‌ Cross-Device Cookies Ready
time
â˜˘ challenges
Large pre-clusters
IP Obscurity
176.22.252.???
Cookie IP
One IP to rule them all….
Noisy pre-clusters
â?˛ future
Data
Predicting
đ&#x;“ˆ
Features
Contextual Information
XD Data
PreProcessing
More of it‌
Topic modelling on website content
PreClustering
Validate‌ Model parameters
4
Predict Cross Device ID’s
Large pre-clusters IP Obscurity More‌
Cross-Device Cookies Ready
✈ đ&#x;“š
Streaming
Context
References [1] Tran et al. “Classification and Learning-to-rank Approaches for Cross-Device Matching” CIKM Cup 2016 [2] Dasgupta et al. ”Overcoming Browser Cookie Churn with Clustering” CIKM Cup 2016 [3] Blondel et al. “Fast unfolding of communities in large networks” 2008 [4] Malloy et al. “Internet Device Graphs” KDD 2017
6 thank you
Profile-cookie space Precision / Recall Precision =
TP TP + FP
=
Recall =
TP TP + FN
=
FP TN FN
Estimated Class A Class B
TP Precision • Recall F1 = 2 • Precision + Recall
đ&#x;Œ´
đ&#x;Œ˛
F1
F3
>1
F4 <0.5
2
>0
<=0
<=1
F8
+
>=0.5
17
<0.8
F5 <=6
>6
>=0.8
F2
12 10
4
>1
<=1
3
12
+
1
+ Boosted Decision Trees (http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
0.78
A B