Tr 00098

Page 1

IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

HADOOP based Recommendation Algorithm for Micro-video URL Revathy Ramakrishnan Department of CSE, CMRIT, VTU Bengaluru, India revathyr1@gmail.com

Abstract:

In the recent years usage social media applications pervade in our daily life which makes the Social Networking Sites (SNSs) being dependent on users for content generation. Considering user interest, contents produced by individual SNSs significantly leaves some of the interest based content undiscovered. This led to facilitate features such as “like”, “share”, “hashtags” functions to deliver the content from one platform to another platform. These allowed users to interact with multiple SNSs but limited to receive contents for separate SNSs. Although Open Identity allowed users for single signin in multiple platforms, it still remained to target multiple platforms. A Unified Access Model is proposed to internet-based-content modeling where the content for the users could be images or videos or text. Videos of short length termed as “micro-videos” are more popular both for the viewers and also the producers. The work carried out provides a recommendation algorithm for micro-video url, which compared to traditional recommendation algorithms such as content based recommendation, the big data uses parallel computing framework. High performance computing is achieved by using slope one algorithm that uses Mapreduce and Hadoop techniques. Hence, the proposed recommendation system for micro-video url can achieve high performance parallel computing, which can be used by the producers and viewers. Keywords: Networking Sites; Hadoop; Mapreduce; parallel computing; Slope one; micro-video

IDL - International Digital Library

1.

INTRODUCTION

With an increase in the amount of data provided by social networks, Internet searches, etc., there was a need to revolutionize the data. "Big Data" describes a universe of very large dataset. Although, Big Data refers to the volume of data, it also signifies the important capabilities which involve processing of Big Data. Typically, a wide range of media and ecommerce firms such as news websites, video providers and also social networking websites, provide data (hereafter referred as "content") on the Internet and their primary goal is to generate revenue. Not only, Content providers tend to maximize their revenue through advertisements and subscriptions but also try to reduce the cost of content distribution. Hence the providers distribute their contents across several geographical locations and also to improve and understand user experience, special analytical services would be used (eg., Google Analytics). Social media applications that deliver contents are completely dependent on the Users and hence make them deliver the best possible quality with minimum cost. At the same time, Content providers will now have the ability to collect, store and analyze behavioral patterns from Users. Users are proactively engaged in integrating content information with their social information giving rise to social networking sites. Social networking sites such as Facebook, Twitter, etc., completely depend on individual users for content generation. Each of these social networking sites are Single-Platform based. With an increase in social networking sites, Single-Platform has a limitation where significant user interests are always left behind. 1|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 Moreover, users not only consume data but also engage actively with the contents and thus pose new challenges to Big Data repositories. In order to enhance user experience in Big Data paradigm, it‟s essential to use Big Data User Centric Model in foreground. The data can be differentiated by using the characteristics of Big Data which are generally referred as Five V‟s 

Volume: Data sources such as sensors, social media and online transactions produce huge volume of data that demands huge storage and high process management Velocity: With a short span of time, enormous amount of data can be accumulated by Data sources which in turn needs short processing time for accumulated data Variety: Multiple types of data such as videos, images, text, audio, etc., both structured and unstructured data brings challenges for data integration, storage and processing Veracity: Data quality is more important while considering the source of data. For example, data from a controlled source such as registered user, has more fidelity when compared to the data from a uncontrolled source such as blog post Value: Data usefulness for an enterprise is a key factor which is highly dependent on veracity and processing time. Data with high veracity that can be analysed in shorter time is more value to a enterprise business

Existing System Indeed, social media applications emerged as SinglePlatform with the limitation of user accessing the contents. Although, efforts were made to propagate the content across platforms through OpenID, this always led the users to spend more time and effort to follow all social media applications with same dedication. Limitations of Existing System Even though several attempts were made to facilitate interest-based content access such as "like" on Facebook, "hash tags" on Twitter, etc., searching user interest-content across multiple platforms weren't available. However, the "share" feature across SinglePlatform social networking sites allowed user contents to propagate across multiple platforms, SinglePlatform were still isolated in receiving user contents individually.

Fig. 2 Overview of content share in Existing System Problem Statement

Fig. 1 Overview of Content Delivery to Users IDL - International Digital Library

Content in social networking platforms is wide spread and these Single-Platform consume significant amount of data share in our Internet lives. Since the singleplatform restricts itself from content discovery through other platforms, a large proportion of user-experience and user-interest is lost within single-platform. We 2|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 propose a access model for user interest and experience based content modelling using the Big Data paradigm. Proposed System With the large-scale data storage and processing, we propose to use the Hadoop framework that can process large amount structured and unstructured data. Moreover, Hadoop implements MapReduce processing technique where the input dataset is split into several independent segments that can be processed parallel. The Mapping of these independent segments are later sorted which are then input to reduce the tasks.

[2] N. B. Ellison et al., “Social network sites: Definition, history, and scholarship”

The proposed work consists of  

Design of cross-platform layers Feedback based use-interest algorithm

Advantages:  

Content similarity is measured High speed processing due (MapReduce) framework

to

Hadoop

The social community standards such as OpenID [6] enable access to user social profile and connection to content based sites. This helps to bring content based sites (eg: youtube, google videos, etc.,) much closer to social networking sites that raises new challenges in data management and content discovery across multiple platforms. While some of platforms provide large amount of content access, they lack supporting content discovery with other user experiences. 2.

LITERATURE SURVEY

[1] Y. Beyer, G. S. Enli, A. J. Maasø, and E. Ytreberg, “Small talk makes a big difference: recent developments in interactive, sms-based television” In this paper, we find information regarding the different social media platforms which were emerged

IDL - International Digital Library

recently. These platforms are focused towards users through add-ons such as mobile-texting in Facebook, Instagram, twitter, Google+ and Whatsapp. There exists a key problem in the add-ons from these applications. There is no user-assessment experience. This paper mainly focuses on Twitter in the context of Journalism. To go into further detail, this paper deals with the structural analysis of Twitter use which pertains to the first season of a talk show called „Hubinette‟. „Hubinette‟ was aired on a public service television in Sweden in 2011. The current state-of-the-art methods for data collection and analysis were used on this dataset and this paper shows that Twitter is used in some ways which are not so traditional in terms of journalist-reader relationships.

In this paper, we figure out that the social media platforms consist of social networking sites (SNSs) which heavily rely on the users for the creation of their contents. This contrasts with the professionally produced content. If the user does not involve or participate in these social networking sites, there would be no success for these SNSs. This has intrigued the attention in research. Current institutes and industries focus on this issue. The different features of SNSs are described in this journal. They are also defined in constructive way. This paper gives a perspective for the history of these sites where the key changes were observed and the most important developments were highlighted. Once the different features and definitions of SNSs are known, we can pursue research for the current paper. [3] K. P. R. Lee, J. Brenner. (2012, Sep 13) Photos and videos as social currency online One can easily predict that most people in the world use the internet to find images and videos in the current generation. It is obvious that social media is the core of most people. Here, in this paper, Lee shows that more than forty-one percent of the US population discover or transfer the photos and videos on the internet. Therefore, this is a clear sign to say that internet based 3|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 content discovery is the centre stage for content generation and redistribution. This only adds to the fact that social media is used by these people a lot which generates a huge amount of internet traffic. The content streams are fragmented and limit the internet-based relevant content to the users only. Individual interest was considered as an interest by Lee that was continuously stimulated by relevant content discovery. Single-platform SNSs varied in terms of technology and scope. The different ways range from user demographics, geographical attributes or pre-existing relationships. Focusing on specific interests like travel, religion, sports, music, photo sharing, video sharing, country-related news, politics, philosophy, etc. have become mainstream in the old times for SNSs. Scope was slowly increased by removing age restriction, opening in different unreachable countries and so on. These limitations overcame limited content access, platform interoperability issues, lack of relevant content segmentation across multiple platforms, etc. Therefore, internet based content access started from a specific platform to a more general one. Some examples of these features which were eventually modeled in these ways include the „like‟ feature on Facebook, „hashtag‟ on Twitter, „filters‟ on Instagram, etc. These attempts lead to a conclusion that internet-based content is searched within a single platform in a more effective way rather than across multiple ways. People become lazy and try to find a platform where everything can be found. They don‟t even consider user interaction with other users or content through different platforms. [4] D. Recordon and D. Reed, “Openid 2.0: a platform for usercentric identity management” This paper tells us that the social media cross-platform applications are meticulously designed to account for a single-platform content access limitations. The „share‟ feature, which is an internet-based content redistribution, makes it very easy to set a common platform to access everything using a single platform. Content access to multiple platforms is enabled using such features in a single platform. Therefore, content variety and content flexibility was ensured with the emergence of such social networking sites, especially

IDL - International Digital Library

with the „share‟ functions to enable the user to access multiple platforms by creating a user-id. Here, the interests of the individual is stored based on the searches and shares done by the user and corresponding content from multiple platforms are shown to the user. [5] E. P. Bucy, “Interactivity in society: Locating an elusive concept” This paper additionally deals with cross platform application for the users. It elaborates further into the fact that following multiple platforms takes more time, more effort and much more cognitive capacity. With the same dedication, one can achieve far more knowledge or one can save time when all this information from these multiple platforms is presented in a single platform. Such a trend towards the „share‟ functionality has some side-effects. The downside to such an approach is the engagement of one-to-many content distribution by the user. The user, however, is limited to receive contents from each separated platform individually. This has been cleverly clearly explained by Bucy. [6] OpenID Foundation. (2013, Sep 13) The openid foundation website allows us to use an open identity to access content by allowing users to sign into too many websites using a single identity. This open ID is limited to a targeted platform rather than multiple platforms. Open ID is a very creative approach to access multiple platforms with a unique id which allows us to get information from all these sites. Content aggregation platforms provide users with large amount of content access. Now, the only thing lacking is to use history of the user to achieve results. This is yet to support interaction and content discovery through user experiences. Our unified access model to interest based content modeling accounts for this mentioned fact. 3. SYSTEM ARCHITECTURE The overall structure of the system together with the conceptual integrity of the system is provided through system architecture. The structural properties depict the components of the system and their interconnectivity through interfaces. With proper 4|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 specifications of the structural properties, we can provide a architectural design can be realized.

provides information related to individual class attributes. Fig 4 shows the class diagram.

The system consists of following modules

B. Use Case Diagram

User Management: User registers to the system. Manage content filtering and browse and view content using this module. Also this module implements the blacklisting of contents not matching to user interest.

Information Extraction: This module extracts contents from social media like youtube, facebook, twitter.

Interest Mining: Based on user browsing behaviour on contents, this module learns the user interest and constructs user profiles grouping user of similar interest.

Content Matching: This module will match the contents to user interest based on metadata matching and also collaborative recommendation and provides content recommendation to the user.

The entire system will run on Hadoop Cluster.

Behaviour of the class is visualized in the form of graph. This gives information related to the usefulness of the system with respect to their objective (referred as use cases) and the dependencies between use cases. Fig 5 shows the usecase diagram.

Fig. 4 Class Diagram

Fig. 5 Usecase Diagram Fig. 3 System Architecture

C. Sequence Diagram

A. Class Design

The message sequence for the forms can be shown through a sequence diagram using Unified Modelling Language (UML). Fig 6 shows the Sequence Diagram.

Framework classes are drawn using Unified Modelling Language which provides a logical connectivity among classes as a chart. Also, the class diagram

IDL - International Digital Library

D. Data flow Diagram

5|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 A data-flow diagram (DFD) is a graphical representation of the "stream" of information through a data framework. DFDs can likewise be utilized for the representation of information handling (organized plan). On a DFD, information things spill out of an outside information source or an inside information store to an interior information store or an outer information sink, through an inward procedure. Fig 7 and Fig 8 shows the Dataflow diagrams for Level 0 and Level 1.

4. IMPLEMENTATION Step 1: Each content extracted from Social networking site is expressed in form of feature vector D= (W1,W2‌ WN) where W1,W2 are weight vectors for the Feature items. Feature Items are taken from the metadata of the content. One keeps a continuous tab on the useractivity in the social networking site. The content browsed or accessed by the user from the social networking site is kept tabs by the framework. This content is extracted from the social networking site in the form of a feature vector D. D contains a list of weight vectors which are known as individual features for the individual contents. Each feature represents the

Fig. 8 Level 1 Dataflow Diagram particular content. These feature items are taken from the metadata of the content. Fig. 6 Sequence Diagram

Fig. 7 Level 0 Dataflow Diagram

IDL - International Digital Library

In short, every time the user clicks on a link which points to content, this content is stored in the metadata. This happens for a while (time-limit of a micro-video). The metadata during this time interval is saved in the common user account. The implemented framework reads this micro-video, or this metadata information. This metadata information contains the information regarding the links clicked or seen. Or one can say that each line of the metadata contains each link clicked or seen during a time period. This information regarding content is stored as a feature (W). A list of such information during the micro-video time period is 6|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 stored as a feature vector D, which is a vector of such features. Step 2: Store the features vectors whenever a user browses a particular content and store it as interest feature vector. Generally, when the framework tries to access the metadata over a period of time, it would contain the same information for a particular time-gap. But we are interested only in the new content, rather than the same content. Therefore, the framework tries to extract the relevant and changed metadata. In this way, we can figure out which metadata are useful for analysis. Therefore, such contents are stored as interest feature vectors. In the end we would only process the interest feature contents along with the time periods. The frequency of the interest feature points and the time period of access give a lot of information regarding what the user is looking for and we can generate ads based on user experience. Step 3: Remove outliers and cluster the items in the interest feature vector. This step is simple to understand. The outliers are removed. Not all contents which are stored in metadata are relevant. Some of these are not to be considered. These contents or weights are represented with least frequency or least time interval associated. These weights are ignored. Step 4: Any new content extracted from the social site, compare the similarity of feature vector of content to cluster centers of the interest feature vector created, if similarity value is less than threshold, the content is recommended to the user. After analyzing the first two steps, these steps are quite easy to follow. Any new content found in the metadata are stored as a new weight. This new weight is compared with the weights contained in the cluster and based on the frequency, a similarity value, and a threshold, a decision is made if the content is recommended by the user. And this results in a decision to advertise an ad.

There are many ways in which one can use the information found so far. This step tells us exactly how the information of these weights and similarity measures is used for a micro-video (cluster of weights) in deciding or predicting the content which the user wants to see. The groups of similar cluster centers of an interest feature point are grouped. When the user watches the recommended content, the similarity measure is increased and when the user doesnâ€&#x;t watch it, it is decreased. Therefore, the similarity measure is dynamically changed according to user-activity and this is what this framework is all about, to consider user activity and decide on ads.

5. OUTCOMES Table I, II and III gives a summary of Unit Testing, Integration Testing and Validation Testing that were performed for the implementation respectively. User Interests are monitored and logged to profile.txt and transaction.txt in Cross-Platform module as shown in Fig 9. These files are later moved to Linux system, where we run the map-reduce technique using Hadoop and use slope-one algorithm to generate recommendation result file. We again use this file as input to Cross-Platform module where the recommendation results are shown to the user as shown in Fig 10. Table I

Step 5: Group the groups of similar cluster centers of interest feature vector and whenever the user is group watches a content recommend the same content to other users in the group if they have not watched it. IDL - International Digital Library

7|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 Table II

Fig. 10 Multiplatform Recommended Results based on user interests on subsequent login

CONCLUSION

Table III

The paper proposes a method, where interconnection of services to multiple social media platforms is emphasized. Content sharing and content access across multiple platforms through Big Data user centric approach provided an extension to the past implementations. User interests and also individual experiences are considered as data where we use Hadoop with Map-Reduce techniques for data processing. User friendly GUI's were implemented using Java Swing to ease the use of recommendation system that generates interest-based content to the users.

REFERENCES [1] Y. Beyer, G. S. Enli, A. J. Maasø, and E. Ytreberg, “Small talk makes a big difference: recent developments in interactive, sms-based television”, Television & New Media, vol. 8, no. 3, pp. 213–234, 2007 [2] N. B. Ellison et al., “Social network sites: Definition, history, and scholarship”, Journal of Computer-Mediated Communication, vol. 13,no. 1, pp. 210–230, 2007 [3] K. P. R. Lee, J. Brenner. (2012, Sep 13) Photos and videos as social currency http://pewinternet.org/Reports/2012/OnlinePictures/MainFindings.aspx?view=all

Fig. 9 Recommended URL is empty during first login IDL - International Digital Library

[4] D. Recordon and D. Reed, “Openid 2.0: a platform for usercentric identity management”, in Proceedings of the second ACMworkshop on Digital identity management, ser. DIM ‟06. NewYork, NY, USA:

8|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 6, June 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 ACM, 2006, pp. http://doi.acm.org/10.1145/1179529.1179532

11–16.

[5] E. P. Bucy, “Interactivity in society: Locating an elusive concept”, The information society, vol. 20, no. 5, pp. 373–383, 2004 [6] OpenID Foundation. (2013, Sep 13) Openid foundation website http://openid.net/ [7] H. M. Inc. (2013) Social media management. https://hootsuite.com/ [8] Yoono. (2013) Your social networks united. http://www.yoono.com/

IDL - International Digital Library

9|P a g e

Copyright@IDL-2017


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.