Weborama Datascience by Fernando Comet

DATA SCIENCE MARCH 2014

Weborama has been building user profile databases since 1998. Sociodemographic and behavioural profiles. In the first years, we had a structural vision of the web. We were not analysing the content of the web pages. We relied on the themas declared by the webmasters: that was approximate. The themas were predefined. Sociodemographic profiling was possible because of our Panel. At that time, the panelists were collected via a questionnaire we addressed to users who were voting for their favourite Weborama network websites. In 2008, everything changed. We found out that the key was to have a more neutral approach, based on the lexical content of the web pages. We started by working with words, and soon figured out that these were too many dimensions. We had to do a segmentation of the words on the web. This is the first of the three segmentations we do. Weborama Semantic Engine is like a 3-level rocket - we do 3 kinds of segmentations. Data is collected on the web, by crawling robots and algorithms. Algorithms that identify the relevant content of the pages. And that perform word extraction, involving morphosyntactic analysis. The underlying science is NLP (Natural Language Processing). Basically, that's linguistics, maths, and CPU. The unity of analysis for taxonomy (classification of meaning), is the word cloud. A word cloud is a vector of words. Words with weights. In order to build a classification, we have to consider a corpus. A corpus is a set of texts. It can be a pile of books. For us, the global corpus is the web. Each text in the corpus is a word cloud extracted from a web page. Weborama is active in 7 countries: FRANCE, UK, RUSSIA, NETHERLANDS, SPAIN, ITALY, PORTUGAL Each country's taxonomy work start with the acquisition of a lexicon. It can be an open source lexicon. The lexicon carries two important fields: - spelling of word - word lemma The lemma is like the root of the word. Several flexed forms of a word have one unique lemma. Examples: plural and singular nouns directed to the singular form; 'general motors', 'gm', 'gm.com' are all directed to 'general motors', etc. A lexicon evolves as the language evolves on the web. New words pop up continuously. Weborama has a way to detect new popular words, and an algorithm proposes automatic lexicon insertion, for a word (string of characters) like '12 years a slave' for instance. In order to achieve classification in this lexical space, we need to define a metric. That is, a way to compute distances between words. Linguist experts use what is called an equivalence index. There is an algorithm there, taking account of both occurrences and co-occurrences of a pair of words within a corpus. Equivalence index lies between 0 and 1. 0 when the words never appear together, 1 when they always appear together. That way, we will able to see that the distance between the words 'DS' and 'Nintendo' on one hand, and 'DS' and 'CitroĂŤn' (a French car brand) are very different: 0.93 and 0.17 Now that the space and the distance are determined, we can run a classification algorithm. We've used many known techniques, mixed them, in order to make our own secret sauce.

In the first steps (iterations), the algorithm will put in the same bucket words that are very close. Like 'general motors', 'gm', and 'gm.com'. At this stage, the algorithm performs automatic lemmatization. It will also do automatic spelling corrections. Then, in the next steps, you will find other words put in the 'general motors' bucket, like 'Cadillac', 'Buick', etc. The algorithm actually builds up clusters, that is, themas. It will proceed like this for a number of interests: fashion, cooking, sports, news, etc. What you get at the end is this:

All the words are there. You can see all the relations between the words, and between the clusters. Each distance between a pair is known. Diameters of the words are correlated to their popularity. Colors are correlated to the word sociodemographic profiles. Thanks to our panel (that has evolved throughout the years, we now work with Toluna as a provider), we can assign weights to words on the different sociodemographic criteria. The taxonomy is Weborama vision of the web. Every data was crunched, analyzed, and a big heterogeneous mass was turned into an organized structure. We now have 177 clusters and 29 sociodemographic criteria (including gender, age, household revenue). Disposable income and urbanity criteria are derived from GEOIP data. This is the first level of the rocket, and it allows the building of the two upper levels. The database of users is built by projection of user word clouds on the taxonomy, and assigning scores on sociodemographic criteria. Sociodemographic profiling is possible through machine learning algorithms: some patterns are observed on the panel, and we learn from them in order to propagate the information and achieve predictions. Surf intensity is a criterion that reflects the number of data events for a user. The higher it is, the more confidence in the user profile. All profile values on clusters and sociodemographic criteria are dispatched into 14 quantiles. Quantile 14 will carry the users with the higher interest in a cluster, or the higher probability to belong to a sociodemographic criterion. Quantile 1 will carry the users with lesser interests and lesser probabilities. A value of 0 reflects no interest, or a probabililty close to 0. Here, we have a glance at the database:

USER_ID 123456ABCDEF: Fashion => 3 ; Cooking => 12 ; Pets => 14 ; News => 7 ; Interior Design => 13 ; Sports => 0 ; Car brands => 1 etc. Female => 13 ; HHR ++ => 9 ; Age 35-49 => 8 ; Age 25-34 => 4 Surf intensity => 12 Weborama sociodemographic database is not declarative. We compute affinity indexes. It is modelized, on a huge amount of users. Cookies on user desktops only carry the user id. The full profile is stored in Weborama datacenters (NoSQL structure).

Now that we have built the 2nd level of the rocket (user profile database), we can jump into the 3rd level: "C-Clones". This is a segmentation of Weborama profile database. It allows us to serve clients according to their needs. Examples: - optimizing a campaign for an advertiser ; - extending the audience of a publisher. This is possible through the combination of 1st party data and Weborama 3rd party data. In the case of an advertiser who wants to optimize a campaign, a data mining file is written. In it, you find data for users who have been exposed to the campaign and who have not converted (negative examples), and users who have been exposed to the campaign and who have converted (positive examples). Positive examples are much rarer than negative examples of course. Each example comes with a series of more than 200 attributes: those are the quantile values on all Weborama clusters and sociodemographic criteria. The file is crunched using a decision tree.

The basis of the tree contains all users in data mining file. The algorithm splits all nodes using statistical tests (chi square test, or other). The result of this process is a dispatch of all the users in the tree nodes. The nice thing about it is that the nodes are not all equivalent: some will show a much higher conversion rate than others. By combining the nodes, using OR and AND booleans, an optimum segmentation can be operated. How many nodes do you want to combine? It depends on the choice between performance and volume. There is a trade-off: the more you combine nodes, the bigger the volume you get, but the less performance (uplift of conversion rate) you have. Weborama latest business cases show that we can aim at a x2.0-x4.0 uplift for a population ratio of 15-20%. In this case, we get both volume and precision.