INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY AND COMPUTER SCIENCE (IJITCS) ISSN Print: 2074-9007, ISSN Online: 2074-9015 Editor-in-Chief Prof. Matthew He, Nova Southeastern University, USA
Associate Editors Prof. Youssef Amghar, Insa of Lyon, France Prof. Bimal Kumar Mishra, Birla Institute of Technology Mesra, India Prof. Jesuk Ko, Gwangju University, Korea Prof. Ewa Ziemba, University of Economics in Katowice, Poland
Members of Editorial and Reviewer Board Dr. Tarek S. Sobh Egyptian Armed Forces, Egypt
Dr. Emad Elabd Menoufia University, Egypt
Prof. Nguyen Thanh Thuy VNU University of Engineering and Technology, Vietnam
Dr. A. Bharathi Bannari Amman Technology, India
Dr. Muhammad Tahir Qadri Sir Syed University of Engineering & Technology, Pakistan
Prof. A.V. Senthil Kumar Hindusthan College of Arts and Science, India
Dr. Olaronke G. Iroju Adeyemi College of Nigeria
Prof. M. Phani Krishna Kishore GVP College of Engineering (Autonomous), India
Education,
Dr. Rafiqul Zaman Khan Aligarh Muslim University (AMU), India Dr. A.A. Eludire Joseph Ayo Babalola Nigeria
University,
Prof. Swati V. Chande International School of Informatics and Management, India
Institute
Dr. Adnan A. Hnaif Al-zaytoonah University of Jordan, Jordan of
Dr. Fahrad H. Pashaev Institute of Control Systems of the Azerbaijan National Academy of Sciences, Azerbaijan Dr. Muhammad Asif Norwegian University of Science and Technology, Norway Dr. Juan Li North Dakota State University, USA
Prof. Sumathi Canara Engineering College, India Dr. Pawan Whig Bhagwan Parshuram Technology, India
Institute
of
Dr. R. Subha Sri Krishna College of Technology, India
International Journal of Information Technology and Computer Science(IJITCS, ISSN Print: 2074-9007, ISSN Online: 2074-9015) is published monthly by the MECS Publisher, Unit B 13/F PRAT COMM’L BLDG, 17-19 PRAT AVENUE, TSIMSHATSUI KLN, Hong Kong, E-mail: ijitcs@mecs-press.org, Website: www.mecs-press.org. The current and past issues are made available on-line at www.mecs-press.org/ijitcs. Opinions expressed in the papers are those of the author(s) and do not necessarily express the opinions of the editors or the MECS publisher. The papers are published as presented and without change, in the interests of timely dissemination. Copyright © by MECS Publisher. All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without written permission from the copyright owner.
International Journal of Information Technology and Computer Science (IJITCS) ISSN Print: 2074-9007, ISSN Online: 2074-9015 Volume 8, Number 12, December 2016
Contents REGULAR PAPERS Automation in Software Source Code Development Henryk Krawczyk, Dawid Zima
1
Usability Evaluation Criteria for Internet of Things Michael Onuoha Thomas, Beverly Amunga Onyimbo, Rajasvaran Logeswaran
10
A Survey On Emotion Classification From Eeg Signal Using Various Techniques and Performance Analysis M. Sreeshakthy, J. Preethi, A. Dhilipan
19
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters Deepti Sharma, Vijay B. Aggarwal
27
ECADS: An Efficient Approach for Accessing Data and Query Workload Rakesh Malvi, Ravindra Patel, Nishchol Mishra
39
Journey of Web Search Engines: Milestones, Challenges & Innovations Mamta Kathuria, C. K. Nagpal, Neelam Duhan
47
SQL Versus NoSQL Movement with Big Data Analytics Sitalakshmi Venkatraman, Kiran Fahd, Samuel Kaspi, Ramanathan Venkatraman
59
Improving Matching Web Service Security Policy Based on Semantics Amira Abdelatey, Mohamed Elkawkagy, Ashraf Elsisi, Arabi Keshk
67
A Hybrid Approach for Blur Detection Using NaĂŻve Bayes Nearest Neighbor Classifier Harjot Kaur, Mandeep Kaur
75
Performance Optimization in WBAN Using Hybrid BDT and SVM Classifier Madhumita Kathuria, Sapna Gambhir
83
I.J. Information Technology and Computer Science, 2016, 12, 1-9 Published Online December 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2016.12.01
Automation in Software Source Code Development Henryk Krawczyk and Dawid Zima Department of Computer Systems Architecture, Gdansk University of Technology, Poland, E-mail: henryk.krawczyk@eti.pg.gda.pl, dawzima@student.pg.gda.pl
Abstract—The Continuous Integration idea lays beneath almost all automation aspects of the software development process. On top of that idea are built other practices extending it : Continuous Delivery and Continuous Deployment, automat ing even more aspects of software development. The purpose of this paper is to describe those practices, including debug process automation, to emphasize the importance of automat ing not only unit tests, and to provide an examp le o f co mplex automation of the web application development process. Index Terms—Continuous integration, continuous delivery, continuous deployment, test automation, debug automation.
I. INT RODUCT ION According to the Forrester report [1] co mpanies are looking to prioritize innovation through developing software services, but Software develop ment providers can’t deliver new services at the rate business leaders want (business leaders want software delivered in six months). Also, according to the report, corporate culture and development process immaturity impedes communicat ion, and slows service delivery, and only a few IT organizat ions regularly perfo rm advanced continuous delivery practices. Automation in terms of software development exists in almost all stages of the software development life cycle, independently fro m the chosen development method. Some of them emphasize auto mation – like Test-Driven development and many Agile methodologies. Deeper description of different software development models is not in the scope of this paper, and can be found in other publications [2]. The Continuous Integration set of practices are the core of the automation in the software development process. On top of that idea are built other practices extending it : Continuous Delivery and Continuous Deployment. These ideas are the answer to the rapid demand of business for new services – they speed up the process of when the developed application will be deployed to the users. Ho wever, not all aspects of software development are subject to automation. In terms of the software development process (activities strictly related to development) there can be distinguished 3 roles: (1) Development, (2) Validation and (3) Debug. Develop ment is an actual p rogramming effort which ends when a p iece of code is co mmitted to Copyright © 2016 MECS
the source code repository (whether it is a new feature, bug fix or new test covering some functionality of the software). Validation is responsible for executing and interpreting tests and their results. Debug means analyzing failed tests and error reports to find out root causes in the developed software source code. In s mall projects all three roles are related to each developer, but in larger projects there may be entire teams specialized in each role. The purpose of this paper is to describe key pract ices that are the core of the automation in software development: Continuous Integration, Continuous Delivery and Continuous Deployment emphasizing test and debug process automation, including a simp lified example of an automated acceptance test of a web-based application. The rest of the paper is structured as follows: Section 3 contains detailed description of Continuous Integration practices. Section 4 describes practices extending Continuous Integration, Continuous Delivery and Deploy ment. Section 5 p rovides introduction to tests automation with a detailed example of Continuous Integration process with acceptance test automation for web applicat ion. Section 6 introduces the debug automation process extending the previous examp le. Section 7 concludes this publication.
II. RELAT ED W ORK There are many publications describing and co mparing different software development models. In authors other paper [2] they’ve been discussed including tradit ional, agile and open source methodologies. Z. Mushtaq et al. [15] proposed hybrid model co mb ining two agile methodologies – Scrum and eXtreme Programming (of which many practices were the origin of the Continuous Integration). Continuous Integration, Delivery and Deploy ment has been well described and discussed by M. Fowler [3] [5], J. Hu mble [6] [7] and D. Farley [6]. Forrester Consulting [1] prepared Continuous Delivery Maturity Assessment Model based on the results of the survey they conducted. A. Miller fro m M icrosoft Corporation [4] has analyzed and presented data from their Continuous Integration process for a period of approximately 100 working days . Debug automation has not been the subject of many researches, most of them were well described and published by Microsoft researchers [14] [12]. They’ve
I.J. Information Technology and Computer Science, 2016, 12, 1-9
2
Automation in Software Source Code Development
shared theirs’ experience fro m more than 10 years of implementing, using and improving overall process of debug automation.
III. CONT INUOUS INT EGRAT ION The Continuous Integration (CI) is a set of practices, known for a long time, but formally introduced as part of the eXtreme Programming methodology, and then well described by Martin Fowler [3]. He has distinguished the 11 most important practices of CI: (1) ―Maintain a Single Source Repository‖, (2) ―Automate the Bu ild‖, (3) ―Make Your Build Self-Testing‖, (4) ―Everyone Co mmits To the Mainline Every Day‖, (5) ―Every Co mmit Should Bu ild the Mainline on an Integration Machine‖, (6) ―Fix Broken Builds Immed iately‖, (7) ―Keep the Bu ild Fast‖, (8) ―Test in a Clone of the Production Env iron ment‖, (9) ―Make it Easy for Anyone to Get the Latest Executable‖, (10) ―Everyone can see what's happening‖ and (11) ―Automate Deploy ment‖. It can be concluded in one sentence: the CI set of practices provides rapid feedback about committed change quality and helps to avoid integration problems. The first CI pract ice, ―Maintain a Single Source Repository‖ (1), means that there should be a single, centralized, application source-code source behind a version management system (application that allows developers to work in parallel on the same files, allowing them to share and track changes, resolve conflicts etc., i.e.: SVN, GIT, Perfo rce) that is known to anyone who is involved in the project. There should be distinguished a mainline among other branches, that contains the most up-to-date sources of the project that developers are currently working on. All developers working on the project should commit their changes to the repository. Everything that is needed to build the project, including test scripts, database schemas, third party libraries etc. should be checked-in to that repository. As Martin Fowler says [3], the basic rule of thumb is that anyone should be able to build a project on a virgin machine (fresh setup) having access only to the centralized source code repository. Practice shows that this is not always possible, and sometimes some environment setup on new developer machines are required (i.e. installation of Windows Driver Kit). ―Automate the Build‖ (2) might be considered as the crucial practice in CI. It means that the process of converting source code into a running system should be simp le, straightforward and fu lly automated (including any environment changes, like database schemas etc.). This is the first step that indicates the quality of change checked-in to the source code repository – if the build was compiling before, and failed to compile after introducing the change, the developer that made the commit should fix the co mpilation as soon as possible. There are many existing solutions like: GNU Make, Apache Ant for Java projects, NAnt for .Net or MSBu ild that allow automation of the build process. By making the build self-testing (3), there should be low-level tests (i.e. Unit Tests) included in the project, Copyright © 2016 MECS
covering most of the codebase, that can be easily triggered and executed, and the result is clear and understandable. If any of the tests failed (a single test case or an entire test suite) the build should be considered as failed. It is important to remember that testing will not tell us that software works under all conditions (does not prove absence of bugs), but will tell us that under certain conditions it will not work. Execution of low-level tests after each check-in allo ws you to quickly check if the change introduced a regression to the project. When many developers are working on the same project, developing different components (in isolation) that interact with each other based on the prepared contract (interface) and do not integrate their changes frequently but rather rarely (i.e. once every few weeks), they can experience something called ―integration hell‖ – conflicts, redundant work, misunderstandings on the stage when different co mponents are integrated after being developed in isolation. To resolve these issues, developers should commit to the mainline very often (i.e. every day) (4), literally continuously integrating their changes. This practice allo ws one to quickly find any conflicts between developers. Before co mmitting their changes to the repository, developers should: get the latest source code fro m the repository, resolve any conflicts, and perform build and low-level tests on their development machine. If all steps were successful, then they are allo wed to co mmit their change to the repository, however this is not the end of their work. A dedicated machine (integration machine) detects changes in the source code repository and performs the build (5). Only when the build is successfully co mpleted on the integration machine, can the build be considered as successful, and the developer’s work is done. It is important to maintain the codebase in a healthy state – each compilat ion break, unit test failure or static source code analyzer error should be fixed as soon as possible by the developers who have broken the build (6). Somet imes, to get quickly back mainline to the successful state, the best way is to revert latest co mmits to the last known good build. ―Keep the Build Fast‖ (7) – to be able to provide rapid feedback about committed change quality, build process time should be relat ively short. eXtreme Programming methodology tells us that the build should last no longer than 10 minutes. All tests should be performed in the environment maximally similar to the production environment (8). This means, i.e. using database servers with the same version as on the production, web browsers the same as used by clients etc. Every d ifference between the test and production environment introduces the risk that developed software will behave differently when deployed to the production environment. Martin Fowler [3] also pays special attention to the availability of the project executables. They should be easily accessible to anyone who needs them (9) for any purposes – manual tests, demonstrations etc. According to the exact words of Martin Fowler [3] CI
I.J. Information Technology and Computer Science, 2016, 12, 1-9
Automation in Software Source Code Development
is all about communication, so it is important that everyone involved in the project can see what is happening (10) – what is the current state of the mainline, what is the latest successful build etc. Modern tools
3
supporting the CI process (often called CI servers) provide web based GUI that allo ws you to display all necessary information.
Fig.1. Diagram of the Continuous Integration Process
To perform h igher level tests (integration, performance, acceptance etc.) there is a need to deploy the project to the test environment (as previously mentioned, it should be similar to the production environment). So there is a need to automate the deployment process (11). When the deployment automation is used to deploy the project to the production, it is worth to have also an automated rollback mechanis m that allows you to revert the application to the last good state in case of any failures. Deploy ment automation will be elaborated more during the discussion of the Continuous Delivery, Continuous Deployment and the Deployment Pipeline in this paper. The CI server allo ws the practical imp lementation of the CI process. Its main responsibility is to monitor the source code repository, perform build, deploy and test when a change is detected, store build artifacts (i.e. project executables) and co mmun icate the result to project participants. There are many existing commercial and open source CI servers available on the market, offering many collaboration features. The most popular are: TeamCity (Jet Brains), Jenkins, Team Foundation Server (Microsoft), Bamboo (Atlassian). Ade Miller fro m M icrosoft Corporation [4] analyzed data from their CI process. Data was collected for the ―Web Service Software Factory: Modeling Edit ion‖ project that was released in November of 2007, fo r a period of appro ximately 100 wo rking days (4000 hours of development). Developers checked in changes to the repository on average once each day, and the CI server was responsible to compile the project, run unit tests and static code analysis (FxCop and NDepend) and compilation of MSI installers. During that 100 days, developers committed 551 changes resulting in 515 builds and 69 bu ild breaks (13% Copyright © 2016 MECS
of committed changes). According to his analysis, causes of build breaks were: Stat ic code analysis (40%), Unit Tests (28%), Co mp ilation errors (26%), Server issues (6%). The great majority of build breaks were fixed in time less than an hour (average time to fix a CI issue was 42 minutes). There were only 6 breaks that lasted over the night. He has also calculated the CI process overhead, which in that case was 267 hours (50 for server setup an d maintenance, 165 for checking-in, and 52 for fixing build breaks). In hypothetical calcu lations for an alternative heavyweight process without CI, but with similar codebase quality, he has estimated the project overhead at 464 hours, so in h is case the CI process reduced the overhead by more than 40%.
IV. CONT INUOUS DELIVERY, CONT INUOUS DEPLOYMENT AND DEPLOYMENT PIPELINE Continuous Delivery [5] [6] is the practice of developing software in a way where it is always ready to be deployed to the production environment (software is deployable through its lifecycle and the development team p riorit ize keeping the software deployable over t ime spent working on a new feature). Continuous Delivery is built on the CI (adding stages responsible for deploying application to production), so in order to do Continuous Delivery, you must be doing Continuous Integration. Continuous Deployment is a pract ice built on Continuous Delivery. Each change is automatically deployed to the production environment (which might result in mult iple deployments per day). The main difference (and the only one) between Continuous Delivery and Continuous Deploy ment is that the deploy ment in Continuous
I.J. Information Technology and Computer Science, 2016, 12, 1-9
4
Automation in Software Source Code Development
Delivery depends on business decisions and is triggered manually, and in Continuous Deployment each ―good‖ change is immediately deployed to the production [5] [7]. According to Jez Hu mb le and David Farley [6], Deploy ment Pipeline is a man ifestation of a process of getting software fro m check-in to release (―getting software fro m version control into the hands of your users‖). A d iagram of a Deploy ment Pipeline has been shown in Figure 3. Each change, after being checked-in to the repository, creates the build that goes through a sequence of tests. As the build moves through the pipeline, tests become more co mp lex, the environ ment more production-like and confidence in the build ’s good fitness is increasing. If any of the stages will fail, the build is not promoted to the next one, to save resources and send information to the develop ment team rap idly.
Stages common fo r all pro ject types in the Deployment Pipeline are: co mmit stage (build compiles, low level unit-tests are passing, code analysis is passing), automated acceptance tests stage (asserts whether the project works on a functional level), manual test stage (asserts whether the system meets customer requirements, finding bugs omitted during the automated tests) and release stage (project is delivered to its users). This pattern does not imp ly that everything is automated and no user action is required – rather, it ensures that complicated and error-prone tasks are automated and repeatable. Continuous Integration
Continuous Delivery
Continuous Deployment
Fig.2. Continuous Integration, Delivery and Deployment relations
Fig.3. Basic Deployment Pipeline [6]
Jez Hu mble and Dav id Farley [6] have distinguished a number of Dep loyment Pipeline practices: (1) Only build your binaries once, (2) Deploy the Same Way to Every Environment, (3) Smo ke-Test Your Deploy ments, (4) Deploy into a Copy of Production, (5) Each Chan ge Should Propagate through the Pipeline Instantly, (6) If Any Part of the Pipeline Fails, Stop the Line.
V. TEST A UT OMAT ION A software is tested to detect errors, however the testing process is not able to confirm that the system works well in all conditions, but is able to show that under certain conditions it will not work. Testing may also verify whether the tested software behaves in Copyright © 2016 MECS
accordance with the specified requirements used by developers during the design and implementation phase. It also provides information about the quality of the product and its condition. Frequent test execution (i.e. in an automated way) helps to address regressions introduced to source code as soon as possible. All levels of tests can be automated. Starting fro m unit tests examining application internals, through the integration tests checking integration between different software co mponents, fin ishing on acceptance tests validating system behavior. For .Net projects an examp le of technology that might be used to automate unit tests is xUn it.net [8]. Almost all modern CI servers have a builtin support for the most common test frameworks and all modern frameworks have support for co mmand line
I.J. Information Technology and Computer Science, 2016, 12, 1-9
Automation in Software Source Code Development
instrumentation for automation purposes. There are many good practices, patterns and technologies related to test development or execution that encourage automation. One of them is colloquially named the ―3-As‖ pattern (Arrange, Act, Assert) suggesting that each test consists of an init ialization part (arrange), invoking code or system under test (act) and verifyin g if the execution results meet the expectations (assert). Another good example is a PageObject [9] pattern which introduces separation between testing code and UI of the application under test (s o the change in a tested application UI requires only a single change in the layer of that UI abstraction, not affecting the nu mbers of tests interacting with that UI through the PageObject layer). Seleniu m [10] is an examp le of a technology that allows automation of the interactions with web browsers.
Fig.4. T est Bus Imperative [11]
In terms of automation of higher levels of tests interacting with UI (wh ich might be very timeconsuming), there is another pattern that is worth to mention – The Test Bus Imperative [11]. As the author claimed, it is the software development architecture that supports automated acceptance testing through exposing a test bus – the set of APIs that allows convenient access to application modules. So having in an application a presentation API used by both – UI and Acceptance Tests, allo ws developers to bypass UI to speed-up tests execution, which in consequence allows one to run a higher level of tests much more frequently – i.e. every
5
commit. In very co mplex systems, wh ich are developed by mu ltip le teams, with thousands of tests of mult iple levels, sometimes info rmation that the test failed is not sufficient. Especially, when one single source code change (maybe a complex one) causes hundreds of tests to fail – but there may be a single root cause for all of those failures. It is a very time consuming task to inspect all of those tests separately, and causes redundant work. Then, it co mes in handy to have a debug automation process, wh ich will be described further in this publication. To have better understanding of how test automation works an examp le will be considered. For the sake of this example SUT (System Under Test) is a web-based application (hosted on an external HTTP server and accessed by users via web browsers). The system use case that will be covered by the automated acceptance test is very simple: a user using different browsers supported by the application (Firefo x, Chro me and Internet Exp lorer) wants to log in to the application by providing user name and password and clicking ―Log in‖ button. So the automated test must perform the following steps: open web browser, navigate to log in page, enter username and password, click ―Log in‖ button, and validate if the user was correctly redirected to the main page. Development process in terms of this examp le is: (1) developer commits change to the repository, (2) CI server automatically detects that change, (3) downloads sources and starts a new build (comp ilation etc.), (4) automatically deploys applicat ion to a test environ ment that is similar to the production one (development HTTP server and database) and performs automated acceptance tests (including the one considered in this examp le), (5) in case of any failures, a report with test results is generated and presented to the us er, otherwise (6) if every step succeeded, the application is deployed to the production. It is wo rth to mention that tests are written by developers or validation engineers and executed automatically on all provided web b rowsers. The entire process has been illustrated in Fig. 5.
Fig.5. Continuous Deployment process example
Copyright © 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 1-9
6
Automation in Software Source Code Development
allo ws one to specify acceptance criteria using English like syntax and Given-When-Then patterns (Listing 1.). SpecFlow allows you to generate a test skeleton for a provided gherkin description using (i.e.) NUnit beneath as a test framework (Listing 3.). Generated steps are implemented using Selenium that allo ws interaction with the web browser via a PageObject abstraction layer (Listing 2.). Listing 2. and 3. are written in C# programming language.
Feature: LoginFeature Scenario Outline: Correct logging in Given Start a web browser <browser> And Go to the log in page And enter login 'test' and password 'test' When press "Log In" button Then will be redirected to main page and will be logged in. Scenarios: | browser | | Firefox | | Chrome | | Internet Explorer | | PhantomJS |
[Binding] public clas s Login FeatureS tepDefi nitions { public Pa geObjec t PageOb ject { get; set ; }
Listing 1. Login use-case description in Gherkin language public abst ract cl ass Page Object : IDispo sable { public Iw ebDrive r WebDri ver { g et; set; } public st ring Ba seUrl { get; se t; } protected PageOb ject(Iwe bDriver webDriv er){…} public vo id Disp ose(){…} public vo id Navi gate(str ing url ) { WebDrive r.Navig ate().Go ToUrl(B aseUrl+u rl); } } public clas s Login Page : P ageObje ct { public Lo ginPage (IwebDri ver web Driver): base(we bDriver ) { Navigat e(‚‛); } public vo id Inse rtUserAn dPasswo rd(strin g user, string p ass) { IwebElem ent log inInput = WebDriv er.Find Element( By.Id(‚ login‛)) ; loginInp ut.Send Keys(use r); IwebElem ent pas sInput = WebDriv er.Find Element( By.Id(‚ password ‛)); passInpu t.SendK eys(pass ); } public Pa geObjec t Submit () { WebDriv er.Find Element( By.TagN ame(‚for m‛)).Su bmit(); if(WebD river.F indEleme nt(By.I d(‚login -status ‛)) .Text != ‚OK‛ ) return this; return new Hom ePage(We bDriver ); } } public clas s HomeP age : Pa geObjec t {
[Given(@‛ Start a web bro wser Ch rome‛)] public vo id Give nStartAW ebBrows erChrome () { PageObje ct = ne w LoginP age(new ChromeD river() ); } [Given(@‛ Start a web bro wser Fi refox‛)] public vo id Give nStartAW ebBrows erFirefo x() { PageObje ct = ne w LoginP age(new Firefox Driver( )); } [Given(@‛ Start a web bro wser In ternet E xplorer ‛)] public vo id Give nStartAW ebBrows erIntern etExplo rer() { PageObje ct = ne w LoginP age(new Interne tExplor erDrive r( )); } [Given(@‛ Start a web bro wser Ph antomJS‛ )] public vo id Give nStartAW ebBrows erPhanto mJs() { PageObje ct = ne w LoginP age(new Phantom JSDrive r()); } [Given(@‛ Go to t he log i n page‛ )] public vo id Give nGoToThe LogInPa ge(){} [Given(@‛ enter l ogin ‘(. *)’ and passwor d ‘(.*) ’‛)] public voi d Given EnterLog inAndPa ssword(s tring p 0,strin g p1) { (PageObj ect as LoginPag e).Inse rtUserAn dPasswo rd(p0,p 1) ; } [When(@‛p ress ‚‛ (.*)‛‛ b utton‛) ] public vo id When PressBut ton(str ing p0) { PageObje ct = (P ageObjec t as Lo ginPage) .Submit (); }
public Home Page(Iw ebDriver webDri ver) : b ase(web Driver) {} }
[Then(@‛w ill be redirect ed do m ain page (…)‛)] public vo id Then WillBeRe directe dToTheMa inPage( ) { if (Page Object. GetType( )!=type of(HomeP age)) { Assert. Fail(); } }
Listing 2. Implementation of PageObject pattern using Selenium to interact with web browser
For the sake of this examp le, as a source code repository GIT has been chosen, and for the CI server, TeamCity fro m Jetbrains. The TeamCity server has all necessary features built-in, i.e. auto mated comp ilat ion (and deployment) using MSBuild or Visual Studio, running batch scripts, communication with many popular source code repositories (including GIT), running static code analysis (i.e. using FxCop and StyleCop) and unit test runners (i.e. NUnit). Automation of web application acceptance tests has been accomplished by using a combination of technologies like: Gherkin, SpecFlow, NUn it and Seleniu m. Gherkin is a business readable language that Copyright © 2016 MECS
[AfterSce nario] public vo id Tear Down() { if (Page Object != null) { PageObj ect.Dis pose(); PageObj ect = n ull; } } }
Listing 3. Acceptance test implementation using SpecFlow
I.J. Information Technology and Computer Science, 2016, 12, 1-9
Automation in Software Source Code Development
VI. DEBUG PROCESS A UT OMAT ION When the developed application is very co mp lex, consisting of many components with thousands of tests, sometimes info rmation that the test failed may not be sufficient. Especially, after co mmit failed hundreds of tests at the same time. Inspecting all of them may be a time-consuming task. After all, it may be a single bug that caused multiple tests to fail. When the software was released to the users, they will probably report some erro rs (manually or v ia an automated system). The nu mber of reported errors depends on the quality of the application and the number of the users using the developed application. Manual error report analysis might be a t ime-consuming task with redundant work, because many of the reported errors will have the same root cause in the application’s source code (hundreds of users experiencing the same bug and reporting it via the automated error collection system). When the inflow of error reports (co ming fro m internal
7
test execution systems or fro m the users after the software was released) is big, it may co me in handy to have some kind of post-mortem debug automation process to reduce the time that developers must spend on bug fixing. The main goal of debug automation is to automatically detect the root cause of a single crash report, and aggregate collections of bug reports of the same bug into buckets to avoid duplicates, thus saving developers’ time. Error reports might be prepared by the crashing application itself when the crash occurs (when developers expect that a crash may have occurred and prepared some kind of exception handling and reporting subsystem) or by the operating system (when the error was unhandled by the application). All of modern operating systems are capable of handling unhandled application exceptions, preparing crash reports consisting of memory du mps or log files fro m memory analysis (kernel or user memory dumps for Windows, coredumps for Linu x, To mbstone files for Android).
Fig.6. Example of debug automation process
The Debug automation system should be able to analyze error reports, i.e. using command-line debuggers (KD and CDB in W indows, GDB in Linu x). So met imes it is necessary for the debuggers to have some additional resources like applicat ion source code or debug symbols to provide more accurate data. Results of the automated analysis should provide a crash root cause signature that would be used by the bucketing algorithm responsible for clustering duplicated crash reports into buckets representing a single bug in the source code. An example of a bucket ing algorith m may be the one using call stack similarity to indicate if the two crash reports represent the same bug. This similarity may be computed using simp le string-like similarity (i.e. Copyright © 2016 MECS
Levenshtein distance) or a much more sophisticated method, like the one proposed by the Microsoft Research Team, Position Dependent Model (PDM) that’s part of a more co mplex method called ReBucket [12]. Another example of a bucketing algorith m in Windows might be using the result of ―!analyze -v‖ co mmand in KD or CDB debuggers, providing informat ion like exception code and its arguments or ―BUCKET_ID‖ section [13]. Windows Error Reporting System (WER) [14] is a good examp le o f a large scale debug automation system used at Microsoft. It originated from a co mb ination of diagnosis tools used by the Windows team, and an automatic data collection tool fro m the Office team. As described in [14], W ER is a d istributed post-mortem
I.J. Information Technology and Computer Science, 2016, 12, 1-9
8
Automation in Software Source Code Development
debugging system. When the error occurs on the client mach ine, a special mechanism that’s part of the Windows operating system automatically collects necessary data on the client machine, prepares an error report, and (with user permissions) send that report to WER Service, where it’s debugged. If the resolution for the bug is already known, the client is provided with a URL to find the fix. To have better understanding of how debug automation can reduce redundant work in failing test results analysis, an extended version of the examp le fro m the previous section will be considered. An additional assumption to the examp le provided in the test automation section is that the application under test is written in ASP.NET MVC technology (for the sake of this example). Automated flow has been extended for additional steps: application error acquisit ion and error correlation, so the entire process is: developer commits change to the repository, CI server detects that change and starts new build, auto matically deploys application to test environment and perfo rms automated tests, in case of any failures acquires error reports from the application and performs correlation of the error reports; otherwise, if every step succeeded, the application is deployed to the production. An example of how web application written in ASP.NET M VC technology can handle errors has been presented on Listing 4. Method DumpCallStack sends a prepared text file with call stack of the unhandled exception that occurred which is acquired by the next step of the automated debug process. Then, after acquiring all error reports with exception call stacks, all of them are co mpared (i.e. using simple string comparison) to find out how many of them are identical. So the result of this examp le can be as follows: after submitting a change to the repository 10 tests failed, but debug automation step after analyzing call stacks of those 10 fails finds out that all of them were caused by a single root cause (with one, identical call stack). So the developer, instead of analy zing all 10 test res ults, has to focus on a single bug represented by a single call stack causing those 10 tests to fail. public clas s MvcAp plicatio n : Sys tem.Web. HttpApp licatio n { (…) public vo id Appl ication_ Error() { Exceptio n e = S erver.Ge tLastEr ror(); Response .Clear( ); Server.C learErr or(); DumpCall Stack(e .StackTr ace); } }
Delivery and Deployment. Start ing fro m the compilation, through deployment, tests and debug – almost all stages of iterative development activit ies might be automated. The core of the automation best practices is mentioned before Continuous Integration, which tends to evolve into an extended version: Continuous Delivery and Continuous Deployment. Co mplex systems with advanced validation processes (i.e. comp lex and welldeveloped automated tests on many levels) needs to have an automated debugging process to reduce redundant work when analyzing failed test results (when a single root cause in the source code caused hundreds of tests to fail). REFERENCES [1] [2] [3]
[4] [5]
[6]
[7]
[8] [9]
[10]
[11]
[12]
[13]
Listing 4. Function in Global.asax of ASP.NET MVC application collecting exception call stack after each failure
[14]
VII. CONCLUSION Increasing business demand for reducing the t ime of development and deployment to production of new features, fosters automation in software development, as can be seen in practices like Continuous Integration, Copyright © 2016 MECS
[15]
Forrester Consulting, ―Continuous Delivery: A M aturity Assessment M odel,‖ 2013. D. Zima, ―M odern M ethods of Software Development,‖ TASK Quarterly, tom 19, nr 4, 2015. M . Fowler, ―Continuous Integration,‖ [Online]. Available: http://www.martinfowler.com/articles/continuousIntegrati on.html. [Accessed 3 10 2015]. A. M iller, "A Hundred Days of Continuous Integration," in Agile 2008 Conference, 2008. M . Fowler, "ContinuousDelivery," 30 M ay 2013. [Online]. Available: http://martinfowler.com/bliki/ContinuousDelivery.html. [Accessed 28 November 2015]. J. Humble and D. Farley, "Anatomy of the Deployment Pipeline," in Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation, Addison-Wesley Professional, 2010, pp. 105-141. J. Humble, "Continuous Delivery vs Continuous Deployment," 13 August 2010. [Online]. Available: http://continuousdelivery.com/2010/08/continuousdelivery-vs-continuous-deployment/. [Accessed 22 November 2015]. "xUnit.net," [Online]. Available: http://xunit.github.io/. [Accessed 28 November 2015]. M . Fowler, "PageObject," 10 September 2013. [Online]. Available: http://martinfowler.com/bliki/PageObject.html. [Accessed 12 February 2016]. ―Selenium,‖ [Online]. Available: http://www.seleniumhq.org/. [Accessed 12 February 2016]. R. M artin, ―The test bus imperative: architectures that support automated acceptance testing,‖ IEEE Software, tom 22, nr 4, pp. 65-67, 2005. Y. Dang, R. Wu, H. Zhang, D. Zhang and P. Nobel, "ReBucket: A M ethod for Clustering Duplicate Crash Reports Based on Call Stack Similarity". "Using the !analyze Extension," [Online]. Available: https://msdn.microsoft.com/enus/library/windows/hardware/ff560201. [Accessed 22 November 2015]. K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle and G. Hunt, "Debugging in the (very) large: ten years of implementation and experience," in Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, 2009. Zaigham M ushtaq, M . Rizwan Jameel Qureshi, "Novel Hybrid M odel: Integrating Scrum and XP", IJITCS, vol.4, no.6, pp.39-44, 2012.
I.J. Information Technology and Computer Science, 2016, 12, 1-9
Automation in Software Source Code Development
9
Authorsâ&#x20AC;&#x2122; Profiles Henryk Krawczyk: Rector of the Gdansk University of Technology in 2008-2012 and 2012-2016, the dean of Faculty of Electronics, Telecommunications and Informatics in 1990-1996 and 2002-2008, and also Head of the Computer Systems Architecture Department since 1997, received his PhD in 1976 and became a Full Professor in 1996. Current research interests: software development and testing methods, modeling and analysis of dependability of distributed computer systems including emergency situations, defining a new category of so-called approaching threats and determining effective and suitable detection and elimination procedures, analyzing the essential relationship between system components and their user behaviors, developing web services and distributed applications with usage of digital documents.
Dawid Zima: PhD student at the Gdansk University of Technology, Department of Computer Architecture, Faculty of Electronics, Telecommunications and Informatics. Received his engineering degree in 2012 and master's degree in 2013. Research interests: debug process automation, methods for bucketing error reports based on call stack similarities.
How to cite this paper: Henryk Krawczyk, Dawid Zima, "Automation in Software Source Code Development", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.12, pp.1-9, 2016. DOI: 10.5815/ijitcs.2016.12.01
Copyright Š 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 1-9
I.J. Information Technology and Computer Science, 2016, 12, 10-18 Published Online December 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2016.12.02
Usability Evaluation Criteria for Internet of Things Michael Onuoha Thomas Asia Pacific University of Technology & Innovation (APU), Technology Park Malaysia, Bukit Jalil, 57000 Kuala Lumpur, Malaysia. E-mail: onuohathomas@gmail.com
Beverly Amunga Onyimbo and Rajasvaran Logeswaran Asia Pacific University of Technology & Innovation (APU), Technology Park Malaysia, Bukit Jalil, 57000 Kuala Lumpur, Malaysia. E-mail:beverlyamunga@g mail.co m, logeswaran@apu.edu.my
Abstract—The field of usability, user experience (UX) design and human-computer interaction (HCI) arose in the realm of desktop computers and applicat ions. The current experience in co mputing has radically evolved into ubiquitous computing over the preceding years. Interactions these days take place on different devices: mobile phones, e-readers and smart TVs, amid numerous smart devices. The use of one service across mult iple devices is, at present, common with different form factors. Academic researchers are still try ing to figure out the best design techniques for new devices and experiences. The Internet of Things (IoT) is growing, with an ever wider range of daily objects acquiring connectivity, sensing ability and increased computing power. Designing for IoT raises a lot of challenges; the obvious difference being the much wider variety of device form factors. IoT is still a technically driven field, thus the usability of many of IoT products is, in some way, of the level anticipated of mature consumer products. This study focuses on proposing a usability evaluation criterion for the generic IoT architecture and essential technological components. Index Terms—Internet of Things (IoT), Usability evaluation, Devices, Quality, Systems.
I. INT RODUCT ION The Internet of Things (IoT) is the future of the Internet and an integrated part of the modern person‘s daily life. Statistics and trends in the field of Informat ion Technology (IT) indicate that the Internet is now literally used for everything and by everyone. IoT systems are perceived to support various spectrums of users , technical and non-technical, in the same way [1]. The encroachment of ubiquitous computing technologies, like wireless networks and mobile devices, partake significantly in the augmented availability of dig ital informat ion and services in our day to day lives, transforming how people access and make use of them. The IoT is a technology that spreads digital resources to the real world, linking such resources with daily objects Copyright © 2016 MECS
by augmenting the latter with RFID (Radio Frequency Identification) or Near Field Communication (NFC) tags. Usability has been increasingly recognized as a crucial quality dimension to arbit rate the success of interactive systems. According to [2], usability is ―the competency of a product to be understood, learned, operated and is attractive to the users’ when they are used to achieving certain goals with effectiveness and efficiency in specific environments‖. The authors further expound that the usability of a product is habitually validated through its interfaces. To date, there are no defined usability dimensions or guidelines specifically meant for IoT devices. The guidelines available are part icularly intended for desktop and web based applications and systems. The IoT tends to use mobile applicat ions, in addition to web based and desktop applications, due to their mob ility nature. To ensure that systems meet their expected quality performance, a nu mber o f usability guidelines have been introduced in the past [38]. These, however, are only generic rules that guide the design and imp lementation for the IoT. Usability guidelines in this field are lacking as usability is relatively unexplored and unproven [3]. This study aims to address this issue by proposing a set of usability dimensions to be considered in the design and evaluation of IoT systems. This research also intends to review the existing usability guidelines for IoT systems, in order to identify and priorit ize the usability d imensions based on its importance. The guidelines and usability evaluation criteria are proposed based on reviews of previous related studies and observations of current trends. The findings of this work could motivate the init ial steps in the development and introduction of more targeted guidelines for IoT systems. This paper is organized as follows : this section provides a brief introduction on the meaning of usability and IoT. Next , section II explo res related works in IoT as well as usability, to gain insight on how they are intertwined. Section III g ives an in-depth review of IoT, its applicability and technological co mponents. Section IV then exp lores the aspect of usability engineering and important usability evaluation criteria , as well as
I.J. Information Technology and Computer Science, 2016, 12, 10-18
Usability Evaluation Criteria for Internet of Things
important requirements necessitated for usability integration. Finally, a conclusion of the general overview of the research and recommendations for the proposed evaluation criteria are discussed in section V.
II. RELAT ED W ORKS The princip le and concept of internet usage have witnessed a new revolution with exponential changes in operational abilities [4]. This change has significant effect and impact on every aspect of our lives, and serves as a new frontier towards future evolution of technology. [5] exp lains that the technological concept of IoT revolves around interconnectivity between objects within our surroundings; this is applicable to all scenarios of our lives, e.g. wearab les, smart phones and TV‘s etc. According to [6], the operational abilit ies of these smart devices should be easy and smooth for the users , and must function maximally at all times . Although the technological concept of IoT presents great opportunities, according to [7] and [8], there are still many open challenges that must be addressed to ensure smooth operability and applicability. There are currently no standardization and quality dimensions for IoT applications and devices [7], [9]. As such, [10] proposes a quality dimensional model, which must address issues related to usability, operability, reliability, responsiveness and personalization. The application of usability engineering process in the development of IoT device interface would ensure smooth and easy to use operational systems for the users. Integrating usability prospects into the development of IoT devices also provides reliability assurance for the user based on the functionality of the system or device.
III. INT ERNET OF THINGS (IOT) This technological paradig m has over the years become a focal area of research at both the industrial and academic levels. The concept, which introduces a new era in co mputing, g ives inanimate objects the capability of communicat ing. The term ‗Internet of Things‘ (or IoT, as it is popularly abbreviated) was invented by Kevin Ashton in 1999, in the context of supply chain management [5]. In both the industrial and academic scenario, IoT has been defined in different aspects and scenarios depending on its application. The IoT, as stated by [11], is a technological advancement which is recognized as one of the most important aspect of future technology, with a fo recast that this technology would evolve beyond 26 b illion units by 2020 (up fro m 0.9 billion in 2009). According to [12], the IoT paradig m is based on ―intelligent and self-configuring nodes (things) interconnected in a dynamic and global network of infrastructure‖. It provides connectivity for anyone and anything at any given point in time. The basic
Copyright © 2016 MECS
11
technological concept behind this paradigm, according to [13], is to give autonomous and secure connectivity between objects (i.e. things) with processing capability through exchange of data amongst real world devices and applications. It is characterized with several features comprising of the complex working environ ment, wide distribution of network segment, as well as a no specific standardized network topology [14]. So me basic technological components towards making the vision for connected objects a reality include radio frequency identification (RFID), sensors, actuators, mobile phones and other portable devices. A. Internet of Things (IoT) Scenario As a technological parad ig m with heterogeneous components, the IoT has over the years been inevitably applied and integrated into different scenarios in the modern environ ment [15]. According to [8], the IoT embraces the convergence of different technological components for sensing, connectivity, processing, as well as control, through application software. It enables vast and sophisticated service processing for tracking, composed of heterogeneous devices by creating communicat ion channels and also translating their functionality into useful service for the user. Examp les of major application do mains are described as depicted in Fig. 1. Smart homes: According to [15], ho me and office automation has been made possible due to technological advancement in the IoT as it decreases consumption of resources associated to building, e.g. water and electricity, and also improves the standard of living for hu mans. Sensors are constantly integrated into homes and office equipment for easy monitoring of resources and users‘ needs [8]. A notable instance of automation in ho mes and office includes internet enabled television control, air conditioning automation, switching on/off of lighting, etc. Transportation: As suggested in [16], a notable application area of IoT technology in transportation is in the vehicle anti-theft t racking system. This technology gives the user the ability to co mfortably monitor transitions of the stolen vehicle with options for visualizing the process through a GPS software- enabled application. Smart transportation enables logistical tracking of assets irrespective of its location and environment, and also provides an avenue in which these assets communicate with the users through chips embedded in them [17]. Healthcare: Th is is an important domain in which the IoT actively enhances productivity and improves service delivery [5], [18]– [20]. It provides a systematic mechanis m in wh ich track records of human health can be analyzed, monitored and also provide emergency services to individuals in need as a result of actively tracking the health activity of the individual. In hospitals, RFID technology (an important driving force in the IoT) has been actively deployed into medical equip ment for easy tracking.
I.J. Information Technology and Computer Science, 2016, 12, 10-18
12
Usability Evaluation Criteria for Internet of Things
Fig.1. End-user application scenarios of IoT [4]
that enables developers to easily carryout authentication and certification of device management for the end users.
B. Internet of Things (IoT) Architecture Research in [7]-[8] and [18]-[20] suggest that the IoT is co mposed of four funda mental layers: perception, network, middleware and application, as described below:
Perception layer:
In [13], th is layer is described as the device layer. It constitutes the physical objects as well as the microprocessor chips and sensors. It comprises of the RFID tags, sensors, bar-codes, bluetooth, infrared technology, etc. This layer deals with giving identity to objects as well as collecting of identity and vital data through the sensors, which are passed to the network layer to provide secure data processing and transmission.
C. Essential Technological Components of IoT According to [5]-[6], [19]-[20] and [22], the fo llo wing are the essential co mponents that make the IoT paradig m a possibility:
Radio Frequency Identification (RFID):
Network layer:
Generally characterized as the fundamental enabling platform fo r co mmunication between devices in the IoT paradigm, wh ich relates and transmits objects (things) status and data [24]. It consists of both wired and wireless access communicat ions, common ly using the cellular, WIFI, microwave and satellite infrastructures.
Middleware layer:
This basically co mprises of informat ion gathering and intelligent processing. It bridges the gap between the eccentric devices, wh ich include software control components of the IoT, cloud management platform, data centers and control centers [25].
Application layer:
This layer is characterized as the applicat ion support Copyright © 2016 MECS
Fig.2.Schematic representation of RFID [52]
This technological component is characterized as one of the enabling co mponents of the IoT, as described in [11]. It allows fo r autonomous identification of objects (things) as well as data capture of fundamental informat ion about the objects through radio waves, tags and a reader. Fig. 2 shows a schematic representation of
I.J. Information Technology and Computer Science, 2016, 12, 10-18
Usability Evaluation Criteria for Internet of Things
RFID. These sensor-like chips provide identity to anything they are attached to and also enable integration of objects into a broad network domain [12], [21]. This principle serves as a pivotal enabler for the technological component to continuously acquire status informat ion about its environment. They play an impo rtant role towards linking the gap between the physical world and the information processing world. According to [5] and [18], there are three types of RFID tags : (1) Active RFID tags - this type of RFID tags is battery-enabled with the capability of efficiently communicat ing with the reader. The energy supplied to the tag helps to init iate co mmunicat ion with the reader. (2) Passive RFID tags - this type of tags solely depends on the radio frequency energy that is
13
transferred fro m the reader to the tag, for power and communication. (3) Semi-passive RFID tags - This type of tags has embedded batteries that powers the microchips for communication through powers from the reader.
Wireless Sensor Network (WSN):
This technological component of the IoT allo ws for different network topologies and mult i-hop communicat ion between embedded objects. It consists of spatially autonomous distribution of sensors / RFID enabled devices that continuously monitor environ mental as well as physical condit ions of basic status, locations or transitions in embedded objects [11], [27]. A WSN architectural representation is given in Fig. 3.
Fig.3. Architectural representation of WSN [27]
Middleware:
It is a mechanism at which the cyber infrastructure, service oriented arch itecture and sensor network interpose, in order to provide access towards the IoT heterogeneous sensor resources [5]. It is based on the concept of isolating resources that can be utilized by different software applicat ions. It is characterized as a software layer [11] in the IoT parad ig m, wh ich interposes between the functional control of the software applications and enables communications of the basic input and output of data.
Cloud computing:
As described in [28], cloud computing creates enabling ways in which the IoT systems are designed, developed, tested, deployed and maintained on the Internet. It applies a utility model that defines how embedded systems consume computing resources (e.g. storage). Cloud computing is defined as a technological concept that provides huge data streams for storage capabilit ies to everything or anything with processing capability [29], Copyright © 2016 MECS
[30]. It is considered as a complex system of distributed parallel co mputing, utilizing co mputing with network and virtualizat ion technology. It serves as a back-bone of the IoT because of its capability to provide back-end solutions towards handling of the enormous data streams emanating fro m various IoT devices. Analyzing the technological concept of the IoT [11], [31], many of its applications demand enormous amounts of data storage, high processing speed in order to enable real-t ime communicat ion, as well as easy decision making processing. Cloud computing for the IoT is capable of providing on-demand access to a shared pool of configurable resources and devices , i.e. networks, servers, wearables, storage and applications , to meet the required needs. An example of an end-to-end model of interaction in a cloud centric IoT paradigm is given in Fig. 4.
Application software:
The application software o f the IoT is considered as a platform fo r co mmunication. It enables easy communicat ion between machine-to-machine and human-to-machine.
I.J. Information Technology and Computer Science, 2016, 12, 10-18
14
Usability Evaluation Criteria for Internet of Things
Fig.4. An end-to-end model of interaction in cloud centric IoT paradigm [5]
A notable examp le of application software of the IoT scenario is the GPS tracking software for transportation. This software provides a platform fo r interaction, monitoring and communicat ion between objects in transition, i.e. monitoring of assets and car transition.
user(s) for which the system is created. It also should compose of a broad range of easy to use functionalities.
IV. USABILIT Y ENGINEERING Usability engineering is the branch of engineering concerned with the design and development of systems interface with high usability, a need that arose due to the emergence of co mp lex and sophisticated systems with more interactive interfaces and a wide range of inexperienced users [32], [33]. The usability engineering process or techniques are geared towards imp roving existing or intended systems, and is considered as one of the most important focus areas during development, as noted in [34]. It is also characterized as an important quality attribute that every system intended for humans must adhere to, i.e. enabling a user-friendly operational system. As described in [35], a good user-friendly system is determined in accordance with the objective of the system. This is in accordance with the ISO 9241-11, which defines usability as ―the extent at which a product can be used by specified users in achieving specified goals with efficiency, effectiveness and also satisfaction in the specified context of use‖, and ―the quality which characterizes the functional use of a program and application” [36]. Fig. 5 describes the relationship between the important components taking part in the process of usability. Applying the principles of usability engineering criteria, process or technique, according to [32], [36][38], ensures that every system conforms to specificat ions and are fit for purpose based on any particular group of Copyright © 2016 MECS
Fig.5. Usability framework [36]
A. Usability Evaluation Criteria for IoT
Fig.6. Usability evaluation criteria [9], [35]-[36]
Generally, the usability criteria serves as a basis for determining the ease of use of a co mputing system. It is
I.J. Information Technology and Computer Science, 2016, 12, 10-18
Usability Evaluation Criteria for Internet of Things
geared towards ensuring efficiency, effectiveness and satisfaction of computing systems and devices for the users. The criteria shown in Fig. 6 represents the basic fundamental criteria all co mputing interactive systems or devices must possess for assurance of ease in usage for the users [9], [36]. According to [9], these are also applicable to IoT. 1)
Flexibility:
This characteristic describes the generalized abstraction between various IoT application logic with the system interface [35]. An IoT system is characterized as a heterogeneous system, therefore its co mponents should have the ability to be integrated or incorporated into a new environment without any disruption of service to the user. In every co mputing system, it is expected that they proffer different ways of executing the same task [36], in order to enable users to choose and adapt to the most suitable functionality based on preference and necessity. 2)
Operability:
Since IoT consists of interconnected heterogeneous components, it is pertinent that they all function towards achievement of users specification [39]-[40]. It is expected that there is consistency in functionality, to enable the users of these systems or devices to become familiar with important functional co mmands. The adoption of important usability standards , according to [36], would help prevent comp licat ions in navigation of the user interface of IoT devices. 3)
Learnability:
According to [41], the ease of learn ing of any given system is a major quality characteristic every system should possess. In order to achieve this characteristic, [32] and [42] state that the system should have task(s) that meet the users‘ way of life. It ensures that bogus technical terms, ele ments or icons that are not familiar to the user are minimal - this princip le ensures that there is no misinterpretation of the functionality to avoid leading the users astray. It is expected that an IoT system or component prevents wrong conclusions, irrelevant contents, and min imizes the use of complex co mmands by allowing as few as possible actions or functions to perform a task. As noted in [36], co mp lex tasks hinder learning and increases the possibility of errors. 4)
Understandability:
It is generally expected that every system works consistently in terms of functionality, as this enables the user to familiarize with the basic fundamental composition of the system [36]. Functionality, as depicted in [35], describes the quality of an IoT system to be designed, developed and deployed to serve its purposes well in the prescribed duration for which it was manufactured. Considering the dynamic nature of our environment, it is expected that all IoT systems and devices are easy to learn based on the deployment environment [7], [15]. To achieve ease of usage and Copyright © 2016 MECS
15
learnability, IoT systems or devices must be free fro m amb iguous technical terms or co mmands that are not widely known by the users [36], [43]. This is to avoid misinterpretation of functionality that might lead to adverse damage. B. Achieving Usability Criteria in IoT Systems Fro m the literature on the IoT and basic usability criteria, the scholars in [16] and [32] suggest that to achieve the basic fundamental usability criteria in a system, some requirements must be considered when integrating the basic usability criteria development and deployment of IoT systems or devices. It is necessitated considering the dynamic nature of our surroundings and due to the fact that the user‘s perspective and satisfaction about the system is important. The rationale towards requirement specification is to keep the basic fundamental stakeholders in the IoT technological paradigm focused on the goals or purpose of designing or developing the system. These requirements serve as prerequisites for quality assurance or as a checklist to ensure that the basic functionality of a system is strictly adhered to based on the users‘ needs and preferences. According to [13], [34]–[36], [38], in deploying any system (e.g., IoT), user‘s requirements, data requirements, functional requirements, and environmental requirements must be taken into account. These are elaborated below. 1)
Data Requirements
According to [19], ―internet of things is a structure in which objects and people are provided with unique identities with capabilities of relocating any data through a network without requiring any two-way handshaking between human-to-human, i.e. source destination or human-to-systems interaction”. The IoT is a system that has heterogeneously connected components with many functional and non-functional co mponents and subcomponents. These heterogeneous components constantly transmit data about their location and status [31], [44]. The data transmitted requires enormous amounts of data storage. As a necessity, the data requirement of any proposed IoT device or system should be an essential factor to be deliberated to ensure that the usability of the system is properly addressed and prevents loss of data during transmission. Data loss causes disruption in service, which can be frustrating on the part of the user [45]. Data requirements assists the developers in comprehending how the data will be collected and used, for their planning and development of the database(s) that has a functionality that is in support with the informat ion flow. During data / in formation gathering, it is imperative to have awareness of the storage procedure and format the information / data, for it to be utilized efficiently as well as in several ways [46]. 2)
Environmental Requirements
According to [4], the development of a trusted, secure, reliable, interoperable usability criteria into the computing environ ment of the IoT requires technologies
I.J. Information Technology and Computer Science, 2016, 12, 10-18
16
Usability Evaluation Criteria for Internet of Things
that would ensure flexibility and scalability of the system, which gives privilege to the system as IoT with diverse components and a robust operational environ ment for its processes. As discussed in [4], the IoT system operates around us and are sometimes attached to the human body . This requires its operational capabilit ies to fall within the environment to which it would be deployed. In [22], it is necessary for environmental factors to be considered to necessitate deployment of reliable systems to the users‘ environment, with environ mental parameters like humid ity, temperature and illu minance determined to ensure that the system falls within the boundaries it was created for. According to [47], the system‘s operational environment determines its operability; when a built system fails to withstand its environmental component, failure occurs that might create panic or d iscomfort among the users. 3)
Functional Requirements
In order to support users with the presence of a list of features, it is necessary to give a description of their functionality. Workflo w is the sequence of steps that the users ought to take in order for them to co mplete a task. Determining the interaction between one feature and another is also vital because most times , the interaction between the features (two or more) may possibly generate new design problems that are related to functionality. According to [1] and [47], the IoT consists of basic intelligent devices (i.e. sensors), embedded processors and many forms of connectivity components. These components are considered as functional elements, which enable business and human intelligence. The transmitted data fro m sensors and other intelligent objects serve as an input considered during various decision making for users as well as businesses and customer management [1], [48]. In [50], the functionality of a system determines user acceptance. Through benchmarking, the fundamental functional requirement of a system is enforced; it enables the IoT system functions to be based on specificat ions and does only what it is intended or built for. 4)
V. CONCLUSION As the technological trends continue to expand, especially so for user-enabled technologies, the need to continually implement standards to ensure that systems comply with specifications is required. The IoT technological paradig m has over the years become a world wide phenomenon with most of them co mposed of user interfaces, e.g. s mart watches and applications control points in smart phones. Usability engineering processes ensure that certain standards are met and guidelines adhered to, because the likelihood of affecting the user‘s perspective, way of life and feelings is high. The usability evaluation criteria, as discussed in this paper, reflects previous literature on the basic fundamental criteria that all intelligent computing systems and devices must possess, including the IoT. In order to properly imp lement these criteria into the technological paradigm o f the IoT, the stated requirements should be met. REFERENCES [1] [2]
[3]
User Requirements
This defines how the physical and cognitive needs of the envisioned users of the system are met. It is vital for users to have the ability to contentedly and effectively make use of an interface to realize the goals it has been designed to support. User requirements can be specified when the users of the system interface and the environment in which it will be used are clearly defined. According to [38], with regards to usability engineering, improving the usability of a system requires a standard benchmark for design, development, imp lementation and deployment. In [23], it was suggested that a draft of a technical oriented reference document is provided with the aim of representing the standardization of requirements for IoT systems or applications. A contextual task analysis method can be used to achieve insights on the expectations of the interface use by keenly scrutinizing the way they presently carry out tasks that are similar to what the interface will support. In Copyright © 2016 MECS
the situation where the number of the targeted users with disabilit ies (e.g. impaired v ision, hearing problems or limited motor skills) is high, one may be fo rced to come up with an interface design that supports accessibility tools. According to [20], the IoT should have unique features and diverse requirements , be time tolerant, have secure connections, good monitoring and low energy consumption. These service requirements ensure that the user utilizes the system‘s fu ll potential without disruption in service. The reliability of the IoT system provides trust and confidence during usage [9], [51].
[4] [5]
[6]
[7]
[8]
[9]
A. J. Choi, ―Internet of Things : Evolution towards a Hyper- Connected Society,‖ 2014. R. Baharuddin, D. Singh, and R. Razali, ―Usability dimensions for mobile applications-a review,‖ Res. J. Appl. Sci. Eng. Technol., vol. 5, no. 6, pp. 2225–2231, 2013. R. Yáñez Gómez, D. Cascado Caballero, and J.-L. Sevillano, ―Heuristic evaluation on mobile interfaces: a new checklist.,‖ ScientificWorldJournal., vol. 2014, pp. 1–20, 2014. R. D. Sriram and A. Sheth, ―Internet of Things Perspectives,‖ IT Prof., vol. 17, no. 3, pp. 60–63, 2015. J. Gubbi, R. Buyya, S. M arusic, and M . Palaniswami, ―Internet of Things (IoT): A vision, architectural elements, and future directions,‖ Futur. Gener. Comput. Syst., vol. 29, no. 7, pp. 1645–1660, 2013. R. Baharuddin, D. Singh, and R. Razali, ―Usability dimensions for mobile applications-a review,‖ Res. J. Appl. Sci. Eng. Technol., vol. 5, no. 6, pp. 2225–2231, 2013. L. Atzori, A. Iera, and G. M orabito, ―The Internet of Things: A survey,‖ Comput. Networks, vol. 54, no. 15, pp. 2787–2805, 2010. D. M iorandi, S. Sicari, F. De Pellegrini, and I. Chlamtac, ―Internet of things: Vision, applications and research challenges,‖ Ad Hoc Networks, vol. 10, no. 7, pp. 1497– 1516, 2012. T. Frühwirth, L. Krammer, and W. Kastner,
I.J. Information Technology and Computer Science, 2016, 12, 10-18
Usability Evaluation Criteria for Internet of Things
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25] [26]
[27]
[28]
―Dependability demands and state of the art in the internet of things,‖ IEEE Int. Conf. Emerg. Technol. Fact. Autom. ETFA, vol. 2015-Octob, 2015. P. M artin and K. Brohman, ―CLOUDQUAL: A Quality M odel for Cloud Services,‖ IEEE Trans. Ind. Informatics, vol. 10, no. 2, pp. 1527–1536, 2014. I. Lee and K. Lee, ―The Internet of Things (IoT): Applications, investments, and challenges for enterprises,‖ Bus. Horiz., vol. 58, no. 4, pp. 431–440, 2015. A. Botta, W. de Donato, V. Persico, and A. Pescapé, ―Integration of Cloud computing and Internet of Things: A survey,‖ Futur. Gener. Comput. Syst., vol. 56, pp. 684– 700, 2014. R. Khan, S. U. Khan, R. Zaheer, and S. Khan, ―Future internet: The internet of things architecture, possible applications and key challenges,‖ Proc. - 10th Int. Conf. Front. Inf. Technol. FIT 2012, pp. 257–260, 2012. L. Yong-Fei and T. Li-Qin, ―Comprehensive Evaluation M ethod of Reliability of Internet of Things,‖ 2014 Ninth Int. Conf. P2P, Parallel, Grid, Cloud Internet Comput., pp. 262–266, 2014. L. Coetzee and J. Eksteen, ―The Internet of Things – Promise for the Future? An Introduction,‖ Conf. Proc., pp. 978–1, 2011. Z. Liu, A. Zhang, and S. Li, ―Vehicle anti-theft tracking system based on Internet of things,‖ Proc. 2013 IEEE Int. Conf. Veh. Electron. Saf., pp. 48–52, 2013. I. M . Almomani, N. Y. Alkhalil, E. M . Ahmad, and R. M . Jodeh, ―Ubiquitous GPS vehicle tracking and management system,‖ 2011 IEEE Jordan Conf. Appl. Electr. Eng. Comput. Technol. AEECT 2011, 2011. C. Rahmani, A. Azadmanesh, and H. Siy, ―ArchitectureBased Reliability M odeling of Web Services Using Petri Nets,‖ 2010 IEEE 12th Int. Symp. High Assur. Syst. Eng., pp. 164–165, 2010. A. W. Burange and H. D. M isalkar, ―Review of Internet of Things in development of smart cities with data management & privacy,‖ Conf. Proceeding - 2015 Int. Conf. Adv. Comput. Eng. Appl. ICACEA 2015, pp. 189– 195, 2015. M . De Sanctis, E. Cianca, G. Araniti, I. Bisio, and R. Prasad, ―Internet of Remote Things,‖ vol. 3, no. 1, pp. 113–123, 2016. L. Tan, ―Future internet: The Internet of Things,‖ 2010 3rd Int. Conf. Adv. Comput. Theory Eng., pp. V5–376– V5–380, 2010. S. Bin, Z. Guiqing, W. Shaolin, and W. Dong, ―The development of management system for Building Equipment Internet of Things,‖ 2011 IEEE 3rd Int. Conf. Commun. Softw. Networks, pp. 423–427, 2011. W. Pollard, ―Internet of Things EUROPEAN R ESEARCH C LUSTER ON THE INTERNET OF THINGS,‖ 2015. J. Rui and S. Danpeng, ―Architecture Design of the Internet of Things Based on Cloud Computing,‖ 2015 Seventh Int. Conf. Meas. Technol. Mechatronics Autom., pp. 206–209, 2015. S. Frost, ―Internet of Things,‖ vol. 07, 2015. X. Li, R. Lu, X. Liang, X. Shen, J. Chen, and X. Lin, ―Smart community: An internet of things application,‖ IEEE Commun. Mag., vol. 49, no. 11, pp. 68–75, 2011. A. M ihailovic, M . Simeunovi, N. Leki, and M . Pejanovi, ―A strategy for deploying diverse sensor-based networks as an evolution towards integrated Internet of Things and Future Internet,‖ pp. 23–26, 2014. J. Zhou, T. Leppanen, E. Harjula, M . Ylianttila, T. Ojala,
Copyright © 2016 MECS
[29]
[30]
[31]
[32] [33]
[34]
[35]
[36] [37]
[38]
[39]
[40]
[41] [42]
[43] [44]
[45]
[46]
17
C. Yu, and H. Jin, ―CloudThings: A common architecture for integrating the Internet of Things with Cloud Computing,‖ Proc. 2013 IEEE 17th Int. Conf. Comput. Support. Coop. Work Des. CSCWD 2013, pp. 651–657, 2013. M . I. Alam, M . Pandey, and S. S. Rautaray, ―A Comprehensive Survey on Cloud Computing,‖ I.J. Inf. Technol. Comput. Sci. Inf. Technol. Comput. Sci., vol. 02, no. 02, pp. 68–79, 2015. T. Liu, ―Application of Cloud Computing in the Emergency Scheduling Architecture of the Internet of Things,‖ 2015. T. Y. Wu, G. H. Liaw, S. W. Huang, W. T. Lee, and C. C. Wu, ―A GA-based mobile RFID localization scheme for internet of things,‖ Pers. Ubiquitous Comput., vol. 16, no. 3, pp. 245–258, 2012. N. Bevan, ―M easuring usability as quality of use,‖ Softw. Qual. J., vol. 4, no. 2, pp. 115–130, 1995. N. Zeni and L. M ich, ―Usability issues for systems supporting requirements extraction from legal documents,‖ 2014 IEEE 7th International Workshop on Requirements Engineering and Law, RELAW 2014 Proceedings, 2014. . C. Eliasson, M . Fiedler, and I. Jørstad, ―A criteria-based evaluation framework for authentication schemes in IM S,‖ Proc. - Int. Conf. Availability, Reliab. Secur. ARES 2009, pp. 865–869, 2009. O. Gioug, K. Dooyeon, K. Sangil, and R. Sungyul, ―A quality evaluation technique of RFID middleware in ubiquitous computing,‖ Proc. - 2006 Int. Conf. Hybrid Inf. Technol. ICHIT 2006, vol. 2, pp. 730–735, 2006. V. Nassar, ―Common criteria for usability review,‖ Work, vol. 41, no. SUPPL.1, pp. 1053–1057, 2012. U. O. Nwokedi, B. A. Onyimbo, and B. B. Rad, ―Usability and Security in User Interface Design: A Systematic Literature Review,‖ Int. J. Inf. Technol. Comput. Sci., vol. 8, no. 5, pp. 72–80, 2016. T. Jokela, ―Assessments of usability engineering processes: experiences from experiments,‖ 36th Annu. Hawaii Int. Conf. Syst. Sci. 2003. Proc., p. 9 pp., 2003. C. Prehofer, ―From the internet of things to trusted apps for things,‖ Proc. - 2013 IEEE Int. Conf. Green Comput. Commun. IEEE Internet Things IEEE Cyber, Phys. Soc. Comput. GreenCom-iThings-CPSCom 2013, pp. 2037– 2042, 2013. N. M aalel, E. Natalizio, A. Bouabdallah, P. Roux, and M . Kellil, ―Reliability for emergency applications in internet of things,‖ Proc. - IEEE Int. Conf. Distrib. Comput. Sens. Syst. DCoSS 2013, pp. 361–366, 2013. N. Nikmehr, ―Content y Usability,‖ pp. 347–351, 2008. S. Jimenez-Fernandez, P. De Toledo, and F. Del Pozo, ―Usability and interoperability in wireless sensor networks for patient telemonitoring in chronic disease management,‖ IEEE Trans. Biomed. Eng., vol. 60, no. 12, pp. 3331–3339, 2013. R. A. Canessane and S. Srinivasan, ―A Frameworkfor Analysing the System Quality,‖ pp. 1111–1115, 2013. I. Lee and K. Lee, ―The Internet of Things (IoT): Applications, investments, and challenges for enterprises,‖ Bus. Horiz., vol. 58, no. 4, pp. 431–440, 2015. A. J. Jara, Y. Bocchi, and D. Genoud, ―Social internet of things: The potential of the internet of things for defining human behaviours,‖ Proc. - 2014 Int. Conf. Intell. Netw. Collab. Syst. IEEE INCoS 2014, pp. 581–585, 2015. P. Patel, A. Pathak, T. Teixeira, Val, #233, and R. Issarny, ―Towards application development for the internet of
I.J. Information Technology and Computer Science, 2016, 12, 10-18
18
Usability Evaluation Criteria for Internet of Things
things,‖ Proc. 8th Middlew. Dr. Symp., pp. 1–6, 2011. [47] J. Kim and J.-W. Lee, ―OpenIoT: An open service framework for the Internet of Things,‖ 2014 IEEE World Forum Internet Things, pp. 89–93, 2014. [48] W. Zhang, ―Study about IOT ‘ s Application in ‗ Digital Agriculture ‘ Construction,‖ Inf. Sci. (Ny)., pp. 2578–2581, 2011. [49] Y. Xu, Y. Wang, X. Gao, and S. Zhang, ―Product Development Process Improvement Approach Based on Benchmarking,‖ Electronics, pp. 1–4, 2010. [50] M . H. Abdallah, ―A quality assurance model for an information system development life cycle,‖ Int. J. Qual. Reliab. Manag., vol. 13, no. 7, pp. 23–35, 1996. [51] L. Yong-Fei and T. Li-Qin, ―Comprehensive Evaluation M ethod of Reliability of Internet of Things,‖ 2014 Ninth Int. Conf. P2P, Parallel, Grid, Cloud Internet Comput., pp. 262–266, 2014. [52] S. Thomas, G.E, ―Dual RFID-ZigBee Sensor enable NFC application for internet of things,‖ 2012. [Online]. Available: http://www.electronicssourcing.com/2012/03/28/dual-rfid-zigbee-sensorsenable-nfc-applications-for-the-internet-of-things/.
the Brain Gain M alaysia international fellowship & post doctoral programme, as well as the Brain Korea21 post-doctoral grant. A Senior M ember of the IEEE, he is the Secretary of the IEEE Signal Processing Society M alaysia chapter and a reviewer of numerous journals and conferences.
How to cite this paper: M ichael Onuoha Thomas, Beverly Amunga Onyimbo, Rajasvaran Logeswaran, "Usability Evaluation Criteria for Internet of Things", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.12, pp.10-18, 2016. DOI: 10.5815/ijitcs.2016.12.02
Authors’ Profiles Michael Onuoha Thomas received his B.Sc. Degree in Computer Science from Caritas University, Nigeria in 2012 and currently pursuing his M .Sc. Degree in Software Engineering at Asia Pacific University of Technology and Innovation under Staffordshire University franchised program. His research interests include software security, reliability engineering, software development process modeling, software project management, fog computing, cloud computing and internet of things.
Beverly Amunga Onyimbo awarded Bachelor Degree from Kabarak University, Kenya in 2012, currently pursuing her M .Sc. Degree in Software Engineering at Asia Pacific University of Technology and Innovation under Staffordshire University franchised program. Her current research interests include User Experience and User Interface (UX & UI) Design, M obile Application Development, Human-Computer Interaction, Requirements Engineering, Computer Communication Networks, Artificial Intelligence, Fog Computing, Internet of things and Information Security .
Rajasvaran Logeswaran studied his B.Eng (Hons) Computing at Imperial College London, United Kingdom, and completed his M .Eng.Sc. as well as Ph.D. at M ultimedia University, M alaysia. He is a Novell Certified Linux Professional, and a certified IC Digital Citizen and trainer. His areas of interests include multimedia data processing, data compression, neural networks, natural user interfaces and big data, with over a hundred publications in books, peer-reviewed journals and international conference proceedings. He has been a recipient of several scholarship s, including Telekom M alaysia, the JCS 75th Anniversary Scholar,
Copyright © 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 10-18
I.J. Information Technology and Computer Science, 2016, 12, 19-26 Published Online December 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2016.12.03
A Survey On Emotion Classification From Eeg Signal Using Various Techniques and Performance Analysis M. Sreeshakthy P.G Scholar, Department of Computer Science, Anna University Regional Centre, Coimbatore . E-mail: m.sribtechit@gmail.co m.
J. Preethi Department of Computer Science, Anna University Regional Centre Coimbatore. E-mail: preethi17j@yahoo.com.
A. Dhilipan P.G Scholar, Department of Computer Science, Anna University Regional Centre, Coimbatore . E-mail: adhilipbe@gmail.com
Abstract—In this paper, the human emot ions are analyzed fro m EEG Signal (Electroencephalogram) with different kinds of situation. Emotions are very important in different activity and decision making. Various feature extraction techniques like d iscrete wavelet transform, Higher Order crossings, Independent component analysis is used to extract the particu lar features. Those features are used to classify the emotions with different groups like arousal and valence level using different classification techniques like Neural networks, Support vector machine etc.. Based on these emotion groups analyze the performance and accuracy for the classification are determined. Index Terms—Feature Ext raction; Valence and Arousal; Neural Networks.
Classification;
is the basic emotion, wh ich is used to identify the mental stress, mental disorders. In Hu man Brain each and every cell has performed the particular functions like, occip ital lobes perform v isual tasks and temporal cell perfo rms auditory task. EEG power has decreased during the sad emotion and increased during the happiness. The region that shows the difference between sadness and happiness is the frontal pole with left CBF being h igher during sadness and lower during happiness [1]. So that we can identify the positive and negative emotions from the past experience. The emotions can be classified by using two ways, imp licit memo ry and exp licit memory. The imp licit memo ry the emotions and decision making analysis only the present incident and explicit memory is used to analysis the past experience [1]. The Basic brain images are shown in Fig1. The Figure explains that, basic brain structure and its parts for storing and processing the information.
I. INT RODUCT ION Emot ions are playing an important role in several activities like decision making, cognitive process and Hu man Co mputer Interface. Based on the emotion, the Hu man Hu man Interaction (HHI), Hu man Mach ine Interaction (HMI) has played on important role in affective co mputing [2]. The emotions can be determined by various ways. The first kind of approach focuses the Facial exp ressions or speech. The audio –visual based techniques used to detect the emotion. The Second kind of approaches focuses on peripheral physiological s ignal. Different emotional state has been identified by using the Electrocardiogram, Skin Conductance. Third approach uses EEG signal. EEG signal is used to extract the hu man emotion because facial exp ression cannot be used to find the exact emot ions. People can ignore to tell their feelings inside open; they can act in front of camera [3]. But the EEG tells that accurate feelings in a particular person. The emotions may be Happy, Sad, Fear, d isgust, etc., this Copyright © 2016 MECS
Fig.1. Brain Configuration
EEG signals are captured fro m the brain activ ity and the signal has to be preprocessed, after preprocessing
I.J. Information Technology and Computer Science, 2016, 12, 19-26
20
A Survey On Emotion Classification From Eeg Signal Using Various Techniques and Performance Analysis
those signals, feature has to be extracted and classifying those emotions. The basic EEG signal is shown in Fig 2. The Following figure shows that the sample EEG Signal which is used to analysis the emotion of the Hu man which is used for the cognitive thin king and decision making process.
Fig.2. Sample EEG Signal
In this rest of the paper, Section II deals with, how the dataâ&#x20AC;&#x2122;s are collected and how it's organized what are the different kinds of the bands available in EEG Data. Section III describes that Methods and Preprocessing the EEG signal for analysis, Feature Ext raction and Emot ion Classification and their related work are discussed. Sample Result will be discussed in Section IV. Performance Accuracy of the classification is discussed in Section V and Conclusion will be discussed in Section.
Table 1. EEG- Different Band Levels BAND
FREQUENCY RANGE
LOCAT ION
DELT A
0-4Hz
Frontal Lobe
T HET A
4-7Hz
Midline T emp
ALPHA
8-13Hz
Frontal Occipital
MU
8-12Hz
Central
BET A
13-30Hz
Frontal Central
GAMMA
30-100Hz
Based on these bands, the different emotions and related frequency are recorded and classify using different classificat ion method and analysis the accuracy and performance of the particular method. The Overall Structure of the EEG Processing for emotion classification is shown in Fig 3.Th is following figure explains the overall process of the EEG Signal Analysis and also explains the flow of the steps for emotion classification. EEG DATA
PREPROCESSING
FEATURE EXTRACTION
CLASSIFICATION
DIMENSIONALITY REDUCTION
FEATURE SMOOTHING
II. EEG DAT A The EEG Data is collected fro m the healthy subjects. The subject is seated in the experimental room in front of the system and asked to fill the some set of questions , after finishing that process, some electrode has to be placed on the scalp. The EEG recorded using BIM EC fro m Brain maker BV [1]. The BIM EC has the reference channel with eight EEG sample signal with 250Hz. The EEG signal has collected from several activit ies. It has to be recorded using different factors like Subject- elicited vs. event- elicited, Laboratory setting vs. Real World, focus on expression vs. feeling of the emotion, openly recorded vs. hidden recording and emotion- purpose vs. other purpose [3].The emotions have to be captured by one minute eye closed and eye opener and also it has to be recorded with d ifferent kind of p icture related signal. The pictures are collected fro m International Affect ive Picture System (IAPS) and the International A ffective Dig itized Sound [IA DS] [1]. The signal has different kind of band levels like Alpha, Beta, Gamma, and Theta. Each band stores the particular information about the emotions, which was shown in Table 1. The following table contains the different band details and its frequency ranges and their location in the brain.
Copyright Š 2016 MECS
Fig.3. Overall structure
III. EEG DAT A A NALYSIS M ET HODS EEG Signal is an analysis based on three main steps. There is, EEG Preprocessing, Feature Ext raction, and Classification of emotions. The Signal preprocessing methods, various feature extraction and classification techniques are shown in fig 4. A. Eeg Preprocessing Preprocessing is the process of removing the noise fro m the signal. EEG Signal has recorded over the scalp; it may contain different kind of art ifacts, power line, heartbeat, oculars [4]. Those noises are removed fro m EEG using Surface Lap lacian Filtering, Spectral Filtering etc., Band pass filter is used to ext ract the signal fro m 4 to 40Hz using Surface Laplacian filtering.
I.J. Information Technology and Computer Science, 2016, 12, 19-26
A Survey On Emotion Classification From Eeg Signal Using Various Techniques and Performance Analysis
21
fingerprint recognition. Related Works For Discrete Wavelet Transformation: DWT includes the time scale signal analysis, signal decomposition and signal compression [6]. Murugaappan et al and Mohan Kumar et al [4,7] paper deals with the Wavelet Transform is the Non- Parametric method of feature extract ion based on mult i resolution analysis.The time, frequency resolution has obtained by wavelet transform and then wavelet function has to be choosers.
a ,b t
a
1
b a
t
(1)
In Xiao-weiwang et al., [5] the wavelet function is chosen based on the time location properties. This function decomposes the signal into different time frequency. Wavelet function is calcu lating the d ifferent shapes of particular time, frequency then that total energy has to be estimated.
Pj
Fig.4. Various Approaches in Emotion Classification-Feature Extraction Methods
Ej
(2)
E total
Then wavelet entropy has to be calculated, it can be defined as, n
W Pn ln Pj
B. Feature Extraction Feature extract ion is the process of analysis the characteristics of waves and extracts the useful informat ion bearing feature which was used for pattern classification. The main aim of the feature extract ion is used to analyze the raw signal [7].There are Several feature extraction techniques like DWT, HOC, PCA, STFT, SB etc.
DWT is a linear signal processing which is applied to the particular data. The data has the same length, and then apply these techniques to data reduction. It has the following steps...
The Length of L, input vector must power of 2 Apply data smoothing (sum or weighted average) and perform a weight difference wh ich gives detail feature about particular data After applying this function the resulting output is L/ 2 wh ich has low frequency or high frequency respectively. These functions applied to the resulting dataset obtain the length of 2. Select values from data obtained apply to the coefficient of transformation.
The DWT can be prov iding the best result and it was applied for data cleaning, analysis, time series data, and Copyright © 2016 MECS
Mohan Kumar et al.,[7] Ext ract wavelet energy coefficient g ives a representation of EEG s ignal in t ime and frequency. It has deco mposed the different level of frequency signals. Table 2 shows that the details of the decomposed signals [8]. T able 2. Frequency corresponding to different level of decomposition
Discrete Wavelet Transformtion:
(3)
i 1
FREQUENCY RANGE
DECOMP OSITION LEVEL
BAND
FREQUENCY BAND WIDTH(Hz)
0 -(fs/2J+1)
A5
T heta
0-4
D5
Delta
4-8
D4
Alpha
8-16
D3
Beta
16-32
D2
Gama
32-64
(2fs/2J+1) (3fs/2 J+1) (2fs/2J+1) (4fs/2 J+1) (2fs/2J+1) (5fs/2 J+1) (2fs/2J+1) (6fs/2 J+1)
A5 decomposition is the in the range (0-4) Hz,D5 decomposition is in the range (4-8) Hz,D4 decomposition are in the range (8-12) Hz, D3 decomposition is in the range (12-32) Hz.. Different Level of deco mposition and its frequency levels are shown in Table 2. DWT is used to analysis the EEG signal for classification using mathematical steps.[6] wavelet functions can be derived fro m the dilation equation and also form the wavelet
I.J. Information Technology and Computer Science, 2016, 12, 19-26
22
A Survey On Emotion Classification From Eeg Signal Using Various Techniques and Performance Analysis
transform coefficient fro m the matrix with Nonzero elements and each row corresponding to the own dilation Co efficient. It defines the special filter, which can be used for analysis signals. Those filter features are used for classification. Statstical Based Feature It was applied to the phys iological signal like EEG, MRI,ECG etc.; it has to be following some of the mathematical steps.
Mean of Raw signal Standard Deviation Mean of absolute values of the first different raw signal Mean of the absolute value of second different raw signal Mean of the absolute value of second standard signal. Based on these steps the exact patterns are extracted from EEG.
Related Works For Statstical Based Features: Chai Tong Yuen et al., [9, 10] paper deals with that, statistical based features are used to classify the human emotion fro m EEG signals. Each Statistical features are used to classify different types of emotions. The signals recorded fro m the EEG is X, then the Nth Samp le of signal is Xn where N=1024, The Mean of the raw signal is
x
1 N
N
Xn
(4)
n 1
And also find Standard Deviations,
1 N X n x 2 x N 1 n 1
1
2
(5)
X 1 x , x
X 2 x , x
(6)
C. Petrantonais [2] paper, Statistical Feature Vector (FV) is used to classify the emotion fro m the EEG s ignal. FV is calculated by using the mean, standard deviation and etc. The corresponding FV is defined by
(7)
Short Time Fourier Transformation STFT is a signal processing method used to analyze the Copyright © 2016 MECS
Related Works For Short Time Fourier Transformation: STFT is used to ext ract fro m the each electrode sliding window of 512 samples and its overlapping between two consecutive windows. [3] EEG signal features are extracted by STFT and each electrode with slid ing window with 512 samp les. For each electrode, we have to select the 9 frequency band range fro m 4 to 22Hz and also find the mutual information between the each electrode wh ich was measured by the statistical dependencies between the different brain areas. Abdul-Bary Raouf Suleiman et al.,[11] the EEG signal is the time domain signal the spectrum can be changed over the time, so that features are extracted using Short Time Fourier Transform wh ich is type of Time frequency Representation. Principal Component Analysis PCA is a statistical technique that has used for face recognition, image co mpression, and it is common techniques for finding patterns in data of huge dimension. It has used some mathematical terminology like mean, standard deviation, variance, co-variance, covariance matrix, matrix algebra, Eigenvector, Eigen value. It has the following steps...
Gather the input data Calculate mean vector Computing co-variance Matrix Find corresponding Eigen Vector and Eigen value. Ranking and Choosing K Eigenvector and generate the new feature vector Transform the samples on the new subspace. New Data=Row Feature vector*Row Data Adjust
Related For Principal Component Analysis :
And also find the mean absolute values of raw signal, normalized signals. Here few co mbinations of Statistical Feature Vectors are,
X I X i : X j H X i H i X j
non-stationary signals. It is used to determine the sinusoidal frequency and phase content of the local section of a signal as it changes over the time.
Ales Proch Azka et al.,[12] paper deals with that, segmenting the EEG mu ltichannel signal and classify the human emotions. Principal Co mponent Analysis is used to reduce the dimensionality of the Matrix using linear algebra. The Principal Co mponent Matrix PM, M is orthogonal evaluating column matrix is Y N,M with the decreasing variance. Reza khosrowbadi et al [1] paper, Non Negative Principal Co mponent Analysis is used to reduce the dimensionality of the data, because the high dimensional data is difficult to process and identify the human emotions in accurate. This transformat ion maximizes the variance of the transformed features using the original coordinates. Independent Component Analysis M. Ungureanu et al [16] paper deals that, it converts the mult ivariate signal to signal having co mponent, which are independent. It removes all the noise fro m the EEG signal and ext racts the particular feature which is not related to another. Suppose the signal X (t) assume vector
I.J. Information Technology and Computer Science, 2016, 12, 19-26
A Survey On Emotion Classification From Eeg Signal Using Various Techniques and Performance Analysis
has zero mean then, C. Feature Classification After Ext racting the particular features, the best features have to be selected using optimization method like Genetic Algorithm. The best features are selected, based on this features the emotions are classified into particular group. Classification is the process of grouping the related data into a single group. The classification may be Supervised or Unsupervised Classificat ion. There are several classification methods like, NN, SVM , K-NN, and LDA. Neural Networks ANN is a biological way of work to store the informat ion in Neurons. Connection and Interconnection between the neurons are used to pass the informat ion fro m one node to another node. Neurons are called as the Elements. NN has the input layer, hidden layer and output layer. The Basic Neural Netwo rk diagram is shown in Fig 5. It has fo llo wing features, Co mputer Based Learning, Ability to arrange nodes itself, Real Time Processing, Ability to withstand in failure. Neu ral networks are used to find the pattern to analysis the particular emotion. There are three types of learn ing fro m the neural network. Supervised learning there is a target class wh ich is used to find the output exactly we want. Unsupervised Train ing there is no target class we have to find the output, which is related to the problem. Rein forcement is the combination of both. Based on these NN we have to find a particular pattern for emotion.
23
using two layers Back propagation network, which gives the feedback between the hidden layer and the input layer. Sig mo id Functional is used for the activation function. Levenberg-Marquardt back propagation algorith m is used for train ing the network. Based on these, emotions are classified. Chai Tong Yuen et al., [9] the emotions are classified using the back propagation neural network. The features are selected using the statistical features, these features are applied to the NN and the classification results are analysed using 60 data. Support Vector Machine SVM is used for classifying both linear and non-linear data. It uses a non-linear mapping to transform the original t rain ing data into higher dimension. The new dimension, it searches the linear optimal hyper plane that is decision boundary. It is used; separate the one class fro m another one. The SVM finds the hyper plane using support vectors and margin. The basic SVM is shown in fig 6.The fo llo wing figure exp lains the SVM classifier which has the margin between the two classes.
Fig.6. Small Margin using SVM
Related Works For Support Vector Machine: C. Petrantonais [2] paper deals that, SVM Classifier used a polynomial function and Kernel Function that reflects the high dimensional feature space.
t K FV s , FVq FV SV , FVq 1
Fig.5. Neural Networks
Related Works For Neural Networks: Reza khosrowbadi et al., [1] paper deals that emotions are classed using Feed Forward Neural Netwo rk. The features are extracted using Connectivity between the one feature into other features. Magnitude square Coherence estimation is applied for co mputing the connectivity features between the brain signals. After reducing the dimensionality of the features, the best features are used for classification. The Rad ial Basis Function Net work is used to classify the particular emot ion. The emotions are classified into Valence and Arousal Level. Seyyed Abed Hosseini et al., [13] paper tells that classify the emotions Copyright © 2016 MECS
n
(8)
By using the SVM the different class of emotions is classified. Mohammed Soleymani [3] paper human short term emotions are identified using Support Vector Machine. It has maximu m margin classifiers that maximize the distance between the decision surface and the nearest point to this surface and also minimize the error on the training set. It also used for linear and rad ial basis function to analysis the features and classifies those emotions. Ali S. AlMejrad [15] the emotions are classified by using SVM. The Extracted features are input to the SVM for emot ion analysis; here they need not reduce the dimensionality of the features. The extracted features separated by linear classifier.
I.J. Information Technology and Computer Science, 2016, 12, 19-26
24
A Survey On Emotion Classification From Eeg Signal Using Various Techniques and Performance Analysis
K-Nearest Neighbor Nearest-neighbor classifiers are based on learning by comparing test tuple with train ing tuples that are similar to it. When a tuple is not familiar, then k-nearestneighbor classifier searches the pattern space for the k training tuples that are closest to the unknown tuple. These k training tuples are the k “nearest neighbors” of the unfamiliar tuple. The closeness is defined by using Euclidean distance. Samp le K-NN is shown in fig 7. Following figure exp lains the different tuples are g rouped by using the K-NN classifier.
Fig.7. K-Nearest Neighbor
The Euclidean space between two points or tuples, say, X1 = (x11, x12... x1n) and X2 = (x21, x22... x2n), is
Dist X 1 , X 2
X
1i
X 2i
2
fast computation of unknown inputs performed by distance calculation between the new samp le data and training data. It is used to find the optimal hyper plane to separate five classes of emotion like happy, sad, fear, surprise, etc. LDA does not require any external parameter to classify the emotions. Related Works For Lda: Mohammed Soleymani [3] In this paper deals that, the short term emotions are identified by using discriminate analysis. The emotions are analysis using Linear Discriminate Analysis and Quadratic analysis; both analyses are based on the Bayes Ru le, which fined the class with the highest posterior probability P (Wi|f). Murugappan [4] In this paper emotions are classified using DWT features and it has found the fast and extreme evaluation of unknown inputs performed by distance calculation between new sample and mean of train ing data samples. It is used to find the optimal hyper plane, which separates five different class emotion like happiness, surprise, fear, disgust, and neural. [14] In this paper deals that, how the EEG related emotions are classified using time domain parameter as a feature. To classify the emotion LDA is used because it has high dimensionality of the features which was co mpared to the number of trails; LDA has found a one dimensionality subspace. In which that class are separated. The rat io of the class variance to within the class variance is maximal.
W 11 2
(10)
Where sample Co variance Matrix µ1 and µ2 are sample Class Mean.
(9) IV. SAMPLE RESULT
Related Works For K-Nn Mohammed Soleymani [3] paper recognizes the emotions fro m EEG using Higher Order Crossing. The Feature Vectors (FV) is the input to the K-NN. In the DSpace, the nearest neighbors are defined by using the Euclidean Distance by equ 15. It follows two steps, 1. Define the training set of FV and 2. For a given query of FV’s then find the nearest of FV. And then it co mes under the particular majo rity of the class. Murugappan et al., [4] in this paper emotion are classified using K-NN. It is a simp le and intuitive method for classifying images and signal. The train ing samp le is co mpared with the testing data, the closest or nearest emotion related features are identified and majority vote related features are grouped into the particular same class.
In this Section we are exp lained sample result which was obtained by Independent Component Analysis method is used. The raw input signal is g iven to the input and preprocesses the data and that signal is analysis based on the component. Those components are separate from the different way. Using the co mponents the different emotions are classified. The sample screen shots are shown in fig 8 and 9. The following figure explains the sample result for co mponent analysis using ICA and classifies the emotions into different groups.
Linear Discriminant Analysis LDA methods used in statistics, recognition of pattern and regularities in data, art ificial intelligence to find a linear co mbination of features which separates two or more classes of objects. It is similar to regression analysis. The main concept of searching for a linear co mb ination of variables the best separates two targets. It is extremely Copyright © 2016 MECS
Fig.8. Front page of the tool box
I.J. Information Technology and Computer Science, 2016, 12, 19-26
A Survey On Emotion Classification From Eeg Signal Using Various Techniques and Performance Analysis
25
VI. CONCLUSION An Emotion can be identified by extracting the different kind of feature fro m the signal. For ext racting the features from the signal Wavelet Transform will g ive the highest accuracy, after preprocessing the signal it has to be smoothed and optimized for the particular feature, using different optimization techniques like GA , PSO, etc. After getting the optimized result apply the RBF and SVM it will provide the high accuracy of emotions in any situation. These emotions will used for Hu man Co mputer Interaction, Affective Computing and Robotics etc.
Fig.9. Component analysis and classification
A CKNOWLEDGMENT V. PERFORMANCE AND A CCURACY The emotions are identified by EEG signal with different feature ext raction techniques and classification methods. The emotions accuracy may be varied fro m one extract technique to another. Different ext raction techniques that combine with the classification method provide better results. The Tab le 3 shows that the various accuracy levels of feature extract ion and classification methods during the emotion identification process . T able 3. Accuracy of methods
I extend my sincere thanks and gratitude to my guide and mentor who has helped me all through the way to make out this project. My heartfelt thanks to my family and friends. Of all, my humb le prayers to God for making me persuade through this way. REFERENCES [1]
[2]
NO
EXT RACT MET HOD
CLASSIFY MET HOD
ACCURACY
REF
1
DWT
K-NN
83.26
[4]
2
Dynamic and PCA Features
SVM
64.7 to 82.91
[5]
3
WT
SVM
82.38
[4]
[5]
4 5 6 7
HOC ST FT HOC PCA Features
SVM SVM NN RBF
82.33 80 80.5 84.6
[2] [3] [2] [1]]
[6]
8
DWT
LDA
75.21
[4]
[7]
9
HOC
QDA
62.30
[14]
[3] [4]
[8]
Fig 10 The accuracy of feature ext raction method and classification method [9]
[10]
[11]
[12]
[13] Fig.10. Performance Analysis
[14]
Copyright © 2016 MECS
Reza khosrowbadi, Kai Keng Ang, and Abdul Wahab “ERNN: A biologically Feedforward Neural Network to Discriminate Emotion from EEG Signal” IEEE neural network and learning systems. Vol. 25, No 3, M ar-2014 C. Petrantonais, “Emotion recognition from EEG using higher order” vol. 14, pp-390-396 in 2010 IEEE. M ohammed Soleymani, “Short term Emotion assessment in a recall paradigm” in 2009 Elsevier. M urugappan, N. Ramachandaran, and Y. Sazali, M ohd “Classification of Human Emotion from EEG discrete wavelet” J. Biomed. SCI. Eng., vol. 3, pp -390-396, Apr2010 Xiao-Wei Wang, Dan Nile, “Emotional State Classification from EEG data using machine Learning Approach” in 2013 Elsevier. Ales Prochazka and Jaromır Kukal, Oldrich Vysata “Wavelet Transform Using for Feature Extraction and EEG Signal Segments Classification” in IEEE. M r. C. E. M ohan Kumar. M r. S. V. Dharani Kumar “Wavelet Based Feature Extraction Scheme of Electroencephalogram” in IJIRSET. M .Murugappan, M .Rizon, RNagarajan, S. Yaacob, I. Zunaidi, and D. Hazry “EEG Feature Extraction for Classifying Emotions using FCM and FKM in IJCC. Chai Tong Yuen , Woo San San1, M ohamed Rizon and Tan Ching Seong “Classification of Human Emotions from EEG Signals using Statistical Features and Neural Network” in IJIE. Chai Tong Yuen, Woo San San, Jee-Hou Ho and M . Rizon “ Effectiveness of Statistical Features for Human Emotions Classification using EEG Biosensors´in RJAET. Abdul-Bary Raouf Suleiman, Toka Abdul-Hameed Fatehi “features extraction techniqes of eeg signal for bci applications”. Ales porch Azka, M artina M udrov a, Oldˇrich Vysata, Robert H´av, and Carmen Paz Su´arez Araujo “M ultiChannel EEG Signal Segmentation and Feature Extraction”. Seyyed Abed Hosseini and M ohammad Bagher NaghibiSistani “ Classification of Emotional Stress Using Brain Activity”. C. Petrantonais, “Adaptive Emotional Information Retrieval from EEG signals in the Time Frequency
I.J. Information Technology and Computer Science, 2016, 12, 19-26
26
A Survey On Emotion Classification From Eeg Signal Using Various Techniques and Performance Analysis
Domain” in 2012 IEEE. [15] Ali S. AlM ejrad. “Human Emotions Detection using Brain Wave Signals: A Challenging”, 2010 [16] M . Ungureanu, C. Bigan, R. Strungaru, V. Lazarescu,“Independent Component Analysis Applied in Biomedical Signal Processing,” M easurement Science Review, Vol. 4, section 2, 2004 [17] S. Karpagachelvi, Dr. M .Arthanari, M . Sivakumar, “ECG Feature Extraction Techniques – A Survey Approach,” IJCSIS, vol. 8, no. 1, April 2010. [18] Yuan-Pin Lin, Chi-Hong Wang, Tien-Lin Wu, Shyh-Kang Jeng,“Support Vector M achine for EEG signal classification during listening to emotional music,” M ultimedia Signal Processing, IEEE 10th Workshop, 2008. [19] Ali S. AlM ejrad. “Human Emotions Detection using Brain Wave Signals: A Challenging”, 2010. [20] Garrett, D., Peterson, D. A., Anderson, C. W., & Thaut, M . H.“Comparison of linear, nonlinear, and feature selection methods for EEG signal classification”. Neural Systems and Rehabilitation Engineering, IEEE Transactions on, 11 (2), 141-144, 2003 [21] P.C. Petrantonakis and L. J. Hadjileontiadis, “EEG-Based Emotion Recognition Using Hybrid Filtering and Higher Order Crossings”, Proc. 3rd International Conference on Affective Computing and Intelligent Interaction (ACII) and Workshops: IEEE, 2009, pp. 1-6, Amsterdam. [22] M arcin Kołodziej, Andrzej M ajkowski, Remigiusz J. Rak, Linear discriminant analysis as EEG features reduction technique for brain-computer interfaces, Przegląd Elektrotechniczny (Electrical Review), R. 88 NR 3a/2012. [23] E. D. Ubeyli, “Combined Neural Network M odel Employing Wavelet Coefficients for EEG Signal Classification”, Digit. Signal. Process, Vol. 19, pp. 297308, 2009 [24] Rashima M ahajan, Dipali Bansal, Shweta Singh, “A Real Time Set Up for Retrieval of Emotional States from Human Neural Responses”, Vol 8, No: 3,2011 [25] Sample result performed by using the ICA tool box.
Department of Computer Science and Engineering, Anna University, Regional Centre, Coimbatore. Her research interests include Radio Access Technology selection, Soft Computing techniques, and Data warehousing and data mining.
A. Dhilipan received the B.E degree in Computer Science and Engineering from M ahendra Engineering College, Tamil nadu, India, in 2011. Now doing M .E Software Engineering in Anna university regional centre Coimbatore, India. His area of research interest includes soft computing technique and Data warehouse and data mining.
How to cite this paper: M . Sreeshakthy, J. Preethi, A. Dhilipan, "A Survey On Emotion Classification From Eeg Signal Using Various Techniques and Performance Analysis", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.12, pp.19-26, 2016. DOI: 10.5815/ijitcs.2016.12.03
Authors’ Profiles M.S reeshakthy received the B.TECH degree in Information Technology from University College of Engineering tindivanam, Chennai, India, in 2012. Now doing M .E Software Engineering in Anna university regional centre Coimbatore, India. Her area of research interest includes soft computing technique and Data warehouse and data mining.
J.Preethi received the B.E. degree in Computer Science and Engineering from Sri Ramakrishna Engineering College, Coimbatore, Anna University, Chennai, India, in 2003, the M .E. degree in Computer Science and Engineering from the Govt. college of Technology, Anna University, Chennai, India, in 2007 and obtained her Ph.D. degree in the Department of Computer Science and Engineering from the Anna University Coimbatore in the year 2013. Currently, she works as an Assistant Professor in the
Copyright © 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 19-26
I.J. Information Technology and Computer Science, 2016, 12, 27-38 Published Online December 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2016.12.04
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters Deepti Sharma Department of Information Technology, Jagan Institute of Management Studies, Affiliated to GGSIPU, Rohini, Delhi, India E-mail: deepti.jims@gmail.co m
Dr. Vijay B. Aggarwal DIT, JIMS, Rohini, Delhi, India E-mail: vbaggarwal@jimsindia.org
Abstract—Web application is being challenged to develop methods and techniques for large data processing at optimu m response time. There are technical challenges in dealing with the increasing demand to handle vast traffic on these websites. As number of users‟ increases, several problems are faced by web servers like bottleneck, delayed response time, load balancing and density of services. The whole traffic cannot reside on a single server and thus there is a fundamental requirement of allocating this huge traffic on mult iple load balanced servers. Distributing requests among servers in the web server clusters is the most important means to address such challenge, especially under intense workloads. In this paper, we propose a new request distribution algorith m for load balancing among web server clusters. The Dynamic Load Balancing among web servers take place based on user‟s request and dynamically estimat ing server workload using mult iple parameters like processing and memo ry requirement, expected execution time and various time intervals. Our simulat ion results show that, the proposed method dynamically and efficiently balance the load to scale up the services, calculate average response time, average waiting time and server‟s throughput on different web servers. At the end of the paper, we presented an experimentation of running proposed system wh ich proves the proposed algorith m is effic ient in terms of speed of processing, response time, server utilization and cost efficiency. Index Terms—Load balancing, Distributed and Parallel Systems, Heterogeneous systems, response and waiting time.
I. INT RODUCT ION A critical challenge today is to process huge data from mu ltip le sources. Today, people are very much reliant on Internet. Users are becoming progressively more dependent on the web for their daily activit ies such as electronic commerce, on-line banking, reservations and Copyright © 2016 MECS
stock trading. Therefore the performance of a web server system plays an important role in success of many internet related companies. Due to huge Internet traffic, requests on only single server will not serve the purpose. Hundreds or thousands of requests can come at a single point of t ime. Web developers need to process mult iterabyte or petabyte sized data sets. Handling these datasets shall not be possible using single server. Today, a big challenge is how to handle this traffic with good response time and replication at minimu m cost. One of the best ways for these huge requests and data processing is to perform parallel and distributed computing in cluster environment. Web Cluster has proved to be a best solution than using an overloaded single server. The need for a web server cluster system arises fro m the fact that requests are distributed among these web servers in an efficient manner. Over a period of time, the performance of each system may be identified and the informat ion can be used for effective load balancing. Such systems are extremely suitable for job processing. For load balancing various factors like I/O overhead, job arrival rate, processing rate may be considered to distribute the jobs to various nodes so as to derive maximu m efficiency and minimum wait time for jobs. There is a vast responsibility of data in various areas viz. Physics, astronomy, health care, finance and web scale. There is a necessity of data intensive processing and to design algorithms for real world datasets. For these data intensive workloads, a large number of cluster servers are preferred over small nu mber of high end servers. There are lots of data processed by various companies. During 2010, data processed by Google everyday was 20 petabyte, and by Facebook was 15 Terabyte. This data processing requires very quick processing but Input/output is slow. The data needs to be shared also. But sharing is d ifficult as it leads to the problem of synchronization, deadlocks, finite bandwidth and temporal dependency. There is a need of departure fro m this type of data processing technology to High Performance Co mputing (HPC). To do large scale data processing, say we want to use 1000‟s of CPUs without
I.J. Information Technology and Computer Science, 2016, 12, 27-38
28
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
hassle of managing things . Various algorith ms have been proposed for load balancing in distributed job processing systems. The algorith ms can be classified into Static and Dynamic. While, the Static algorith m relies on a predetermined distribution policy, the Dynamic Load balancing algorith m makes its decisions based on the current state of the system. This framework uses the dynamic algorithm to analy ze the current system load and various cost factors in arriving at the best target processor to handle the job processing. This can be implemented on proposed framework. It allo ws experimenting distributed computations on massive amount of data. It is designed for large scale data processing. It parallelizes the computation across large scale cluster of machines. The primary contribution of this research is to propose a framework for running web server cluster system in web environment based on collected requirements and to present its implementation on one of the web services. The second contribution is to present an experimental analysis of running this framework in web environ ment to prove the proposed research and to present the evaluation of experiments based on the various parameters such as speed of processing, response time, server utilization and cost efficiency. This paper is organized as follows: Section 2 exp lains the approach of proposed framework. Sect ion 3 exp lains set of data and requirements to develop the proposed framework. Section 4 exp lains proposed framework and its implementation. Section 5 gives experimental results and analysis. Section 6 provides conclusion.
II. RELAT ED W ORK Recently, cluster servers are used for fast info rmation retrieval on internet. The load balancing on effective parameters has been studied by many researchers. Alan. Massaru. T. N. Anitha et al describes the load status of web servers as a solution for load balancing. Manoj Ku mar Singh et.al presents an optimal traffic distribution concept to distribute dynamic load based on traffic intelligence methods. Jianhai shi presents two steps static and dynamic scheduling for load balancing. He describes a strategy of distributed load balancing based on hybrid scheduling. Jorge E. Pezo et al propose a reliability solution for improving reliability in Heterogeneous Distributed systems . He demonstrated the results and testbed solutions. Y.S. Hong et al propose a DNS based load balancing solution using ring and global manager for distributing the traffic overflow. A. Khunkitti propose TCP-handoff and multicast based load balancing that allows immediate and complete connection transfer to another available server. Shardal Jain et.al proposes load balancing solution using prioritization of nodes which is done by comparing the efficiency factor and processing power of each and every node in it.
Copyright © 2016 MECS
III. A PPROACH In this section, we will discuss about our framework for load balancing mechanism. Fo r load balancing, various factors are needed to be considered. Whenever there is a new request at any web site, the algorith m has to decide that this incoming request should be assigned to which web server so that the load among web servers remains balanced. For this, we take into consideration various aspects that are involved for fu lfilling any request. These are number of servers, intervals, jobs generated and jobs expected execution time. There are „n‟ nu mbers of servers. The value for „n‟ is variable. Server is having basic parameters as server‟s memo ry, processing speed and memory left over. Server‟s memo ry leftover will be mod ified whenever there is any job allocation to server or comp letion of job fro m that server. Total numbers of jobs are designated by „x‟. Jobs generated for an individual interval can be defined as: J[y] = Q, where J = { J1 , J2 …Jn }, „y‟ is the interval generated and Q is the job generated for the respective interval and Q={0, 1….xn } where „xn ‟ can be defined as maximu m number of jobs that can be generated for an individual interval. Job parameters are job memo ry, job processing speed and total expected execution t ime. Here, job‟s memo ry and processing speed means how much memo ry and processing of the server required by job for execution. Job‟s expected execution time is the maximu m time required by the job for execution. In our approach, total t ime is div ided into intervals and there is a fixed time slice of 5 milliseconds. At the time of initialization of intervals, jobs are generated, initialized and allocated to the servers. These jobs are recorded in “main array”. In addition, there are three types of arrays. They are defined as follows: a) Main Array: contains all the allocated jobs to respective servers. It contains all o ld jobs which are already executing and transferred in next interval and new jobs from main array. b) New Job Array: contains all the jobs that are generated, when the interval begins. c) Waiting Array: the jobs that are initialized, but waiting for allocation to server and execution. There are various processes also running in the system. They are as follows: i)
Distribution of Jobs among Servers : Jobs are distributed among various servers on the basis of job‟s memo ry and processing requirements. Both these parameters are co mpared with server‟s memo ry left over and processing speed. If satisfied, the job is allocated to the respective server. If all the servers are checked and job is still unallocated, it will be in waiting queue.
I.J. Information Technology and Computer Science, 2016, 12, 27-38
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
ii) Checking for Load Balance (LB): In the proposed approach, LB is taken care at the time of job allocation. The job is allocated to respective server only if it is greater than the job‟s memo ry and job‟s processing requirement. Th is makes the respective server even, or load balanced. iii) Job Completion Process: Remain ing expected execution time (init ially same as maximu m expected execution) is decremented with the reduction value of the respected server (to which job is allocated) after every cycle. If it becomes zero or less than zero, job is comp leted and done. At the same t ime, the memo ry left of that server is incremented with the value of job that is currently completed. iv) Reduction value Process: Reduction value is calculated which is associated with each server. The reduction value is the value with wh ich job‟s expected execution time is decremented. It d iffers fro m server to server, on the basis of the processing speed of each server.
IV. DAT A REQUIREMENT S T O DEVELOP FRAMEWORK The basic goal of load balancing techniques in a cluster of Web servers is that every request will be served by the more lightly loaded node in the cluster. This section describes the various aspects which are specific to architectural design that we are going to imp lement for the development of software model where the incoming load fro m the client can be distributed efficiently over the web server system. Following are the data to be used in the proposed framework: 1. Time Interval: It is a definite length of time marked off by t wo instants. Our algorith m uses the concept of time interval which would be generated randomly for every 5 milliseconds. 2. Jobs Generated: A job is a task performed by computer system. In the algorith m, jobs are generated randomly in each t ime interval. Jobs that are generated have various parameters like processing time, memory requirement and total expected execution time. The performance of algorith m is measured by various factors that depend upon jobs generated like jobs in wait ing queue, how many jobs are completed in an interval and response time, waiting time and total job runtimes. 3. Server: A server is a co mputer program which servers the requests made by clients. Servers have various parameters like processing speed, memory, memo ry left and jobs assigned. Server‟s performance is also measured through its utilizat ion levels, server‟s status and server‟s throughput. 4. Scheduling Technique: In the proposed algorith m, SJF (shortest job first) scheduling technique is used. It
Copyright © 2016 MECS
29
will be applied twice: one in itially on total burst time, while allocating jobs to servers and secondly, on „remaining-burst time‟ on main array.
V. PROPOSED FRAMEWORK AND IT S IMPLEMENT AT ION This section describes the algorithm and flowchart that depicts the functioning of proposed system. A. Algorithm Step 1 : Server Initialization Step 2 : Interval Initialization Step 3: If Job/Jobs Generated Then a) They are queued in NEW_JOB_ARRAY b) SJF will be applied on NEW_JOB_ARRA Y on the basis of BURST_TIME Step 4: if INTERVA L_ VA LUE == 0 Then a) Allocation of Jobs that are in NEW_JOB_ARRAY starts b) If A llocation is successful, the Jobs are queued to MAIN_ARRAY c) If Allocation of Job is not done, the JOBS are queue to WAITING_QUEUE Step 5: if INTERVA L_ VA LUE > 0 Then a) Firstly, the Jobs that are in WAITING_QUEUE will be allocated If A llocation is successful, then, the jobs are queued to MAIN_ARRAY Else, they remain in WAITING queue b) Go to Step -4 Step 6: MAPPER FUNCTION a) SJF applied on MAIN_ARRA Y on the basis of REMAIN_BURST Step 7: REDUCER FUNCTION a) Execution of Job/JOBS starts, that are in MAIN_ARRAY b) REMAIN_ BURST value of JOBS/JOB are reduced by the Server‟s RAD_VALUE c) Once, job‟s REMAIN_ BURST value == 0, its execution completed Step 8: If JOB_ARR_ VA LUE >= 0 AND WAITING_ VA LUE >= 0 AND INTERVA L_ VA LUE >= MAX_JOB_INTERVA L Then, a) Repeat Step -3 to Step -8 Else STOP B. Flowchart This section describes the pictorial representation of proposed approach.
I.J. Information Technology and Computer Science, 2016, 12, 27-38
30
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
START
SERVER INITILIZATION
IF JOB GENRA TED?
INTERVAL INITILIZATUON
If INTERVA L == 0 ?
SJF on NEW_JOB_ARRAY [BURST_TIME]
Yes
No
No
Yes
Allocation of WAINITING_QUEUE Allocation starts
Queue to MAIN_ARRAY
Yes
ALL OC No
Yes
ALL OCA
No
MAIN_ARRAY [Allocation started]
TIO IN WAITING_QUEUE
MAPPER FUNCTION Re - Allocated, when the new Interval starts
SJF on MAIN_ARRAY [REMAIN_BURST]
If JOB_ARR_VAL >= 0 AND WAITING_VALUE >= 0 AND SERVER_VALUE >= MAX_Sever VALUE ?
Yes
No
REDUCER FUNCTION
Execution of JOBS in MAIN_ARRAY
REMAIN_BURST will be deducted with the SERVER‟S RAD_VAL
Job Finished, leave the server STOP
Fig.1. Flowchart shows overall representation of proposed approach
VI. EXPERIMENT AL RESULT S AND A NALYSIS Performance Measurement: In the experiment, following notations are being used: T.I.G.: Total Interval Generated, T.J.G.: Total Job Generated, T.S.: Total Server, E.E.T.: Expected Execution Time, F.E.T: Final Execution time, A.W.T.: Average Waiting Time, A.R.T: Average Response Time and J.R: Job Runtime The Perfo rmance of proposed load balancing algorith m is measured by following parameters A. Calculating Performance with low, medium and high load jobs The first and foremost parameter is to calculate performance of servers w.r.t number of jobs . Jobs generated per interval may be div ided into low jobs (010), mediu m jobs (0-40) and high load jobs (0-70). If there are different jobs coming at different time interval, following table and chart show the average waiting t ime and response time of each server. T able 1. Perfomance with Different No. of jobs No. of Jobs
A.R.T .
A.W.T .
0-10
2
1
11-40
4
12
41-70
4
16
Copyright © 2016 MECS
20 A R 15 T 10 , A 5 W 0 T
16
12 A.R.T.
4
21 0-10
11-40 No. of Jobs
4
A.W.T.
41-70
Fig.2. Graphical Representation of T able 1
B. Calculate mean response time and mean waiting time Following experiment shows average response, average waiting time and job run time w.r.t jobs generated per interval. T able 2. Mean Response and Waiting T ime T .J.G
A.R.T
A.W.T
J.R
370
52
73
50
715
52
93
90
458
53
80
82
510
85
85
89
I.J. Information Technology and Computer Science, 2016, 12, 27-38
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
C. Results with various Intervals, Jobs and Servers
100
40
A.R.T
20
A.W.T
In this experiment, we have taken 4 cases (with different Intervals generated, jobs generated, total servers and expected execution time) and then calculated following in each interval:
J.R
C.1 Final execution time
80 A.R.T,A.W.T,J.R.
31
60
0 370
715 458 T.J.G
510
Experiments are performed on the algorith m in four cases, and they are defined in the table below:
Fig.3. Graphical Representation of T able 2
T able 3. Different Cases Cases 1 2 3 4
T.I.G. 20 20 20 20
T.J.G (Per Interval) 0-50 0-80 0-50 0-50
T.J.G(All Intervals) 508 768 495 354
T.S. 10 10 8 10
E.E.T 0-20 0-20 0-20 0-40
F.E.T. 105 105 101 113
1000 768
800 600
508
495 T.J.G(All Intervals)
354
400
F.E.T.
200
105
105
101
113
1
2
3
4
0
Fig.4. Different Cases
C.2 Results for Server’s Utilization
s 0 , s 1 ,……, s n .
Following table shows the Server‟s Status in each interval i.e. how much each server is utilized. Let interval generated be „x‟. total time=total intervals * time per interval
(1)
In our experiment, the intervals generated are 19. So total time for job execution is 95ms (19intervals * 5ms/interval). Also, let Server S be busy for „m‟ intervals, where S є
Total busy period = ( m * time per interval )
(2)
server utilization = total busy period / total time. (3) The tables below show the total busy period and the server utilization. Case -1: Server: 10 (S0-S9), Total Job Generated: 050 per interval, Interval: 10 (1-19), EET: 0 – 20. The data collected is listed in Table 4 and shown in Figure 5 below.
T able 4. Case 1 Servers S0 S1 S2 S3 S4 S5 S6 S7 S8 S9
Copyright © 2016 MECS
T otal Busy Period 65 85 90 90 85 70 70 40 0 0
ServerUtlization 68.42% 89.47% 94.74% 94.74% 89.47% 73.68% 73.68% 42.11% 0.00% 0.00%
I.J. Information Technology and Computer Science, 2016, 12, 27-38
32
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%
89.47%
94.74% 94.74%
89.47% 73.68% 73.68%
68.42%
42.11%
S0
S1
S2
S3
S4
S5
S6
S7
0.00%
0.00%
S8
S9
Fig.5. Case 1
Case 2: Server: 10, Total Job Generated: 0-80 per Interval, Interval: 19, EET: 0-20
The data collected is listed in Tab le 5 and shown in Figure 6.
T able 5. Case 2 Servers S0 S1 S2 S3 S4 S5 S6 S7 S8 S9
T otal Busy Period 80 80 90 90 80 80 85 70 45 5
ServerUtilization 84% 84% 95% 95% 84% 84% 89% 74% 47% 5%
ServerUtilization 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
95% 95% 84% 84%
84% 84%
89% 74% ServerUtilization 47%
S0
S1
S2
S3
S4
S5
S6
S7
S8
5% S9
Fig.6. Case 2
Case 3: Server: 8, Total Job Generated: 0-50 per Interval, Interval: 19, EET: 0-20
The data collected is listed in Tab le 6 and shown in Figure 7.
T able 6. Case 3 Servers S0 S1 S2 S3 S4 S5 S6 S7
Copyright Š 2016 MECS
T otal Busy Period 80 70 95 75 75 55 55 20
ServerUtilization 84% 74% 100% 79% 79% 58% 58% 21%
I.J. Information Technology and Computer Science, 2016, 12, 27-38
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
33
Se rverUtilization 120%
100% 100% 80%
84%
79%
74%
60%
79% 58%
ServerUtilization
58%
40% 20% 21% 0%
S0
S1
S2
S3
S4
S5
S6
S7
Fig.7. Case 3
Case 4: Server: 10, Total Job Generated: 0-50 per Interval, Interval:19, EET: 0-40
The data collected is listed in Tab le 7 and shown in Figure 8.
T able 7. Case 4 Servers
T otal Busy Period
ServerUtilization
S0
90
95%
S1
90
95%
S2
65
68%
S3
55
58%
S4
40
42%
S5
30
32%
S6
80
84%
S7
45
47%
S8
15
16%
S9
0
0%
ServerUtilization 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
95%
95% 84%
68% 58% 47%
42%
ServerUtilization 32%
16% 0%
S0
S1
S2
S3
S4
S5
S6
S7
S8
S9
Fig.8. Case 4
C.3 Response Time, Waiting Time and Job Runtimes Per Interval Throughput is the output per interval. In this experiment, we have calculated the throughput after each
Copyright Š 2016 MECS
interval. The results show the average waiting time, average response time and total Job run time ( fo r all above 4 cases).
I.J. Information Technology and Computer Science, 2016, 12, 27-38
34
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
Case 1: T able 8. T hroughput-Case 1 Interval 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
T.J.G. 20 37 24 22 14 8 15 20 25 1 16 29 14 36 13 12 6 14 14 30
A.R.T. 2 4 3 3 2 2 2 3 2 3 3 3 2 3 3 2 3 2 2 3
A.W.T. 3 11 4 3 3 2 2 4 3 2 2 3 2 7 6 2 2 3 3 6
J.R. 2 13 3 1 1 2 1 2 3 0 0 1 1 7 1 2 0 3 2 5
Case 1 14 12 10
A.R.T.
8
A.W.T.
6
J.R. 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Fig.9. T hroughput-Case 1
Case 2: T able 9. T hroughput-Case 2 Interval 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Copyright Š 2016 MECS
T.J.G. 0 54 58 35 16 56 12 69 29 46 21 7 59 46 29 24 13 53 35 53
A.R.T. 3 3 3 3 3 3 3 3 3 2 2 3 3 2 2 2 3 3 3 0
A.W.T. 15 12 15 10 13 7 13 11 13 9 9 14 15 11 9 8 13 11 13 0
J.R. 0 8 5 10 2 7 3 6 2 7 2 1 5 8 4 1 2 8 4 5
I.J. Information Technology and Computer Science, 2016, 12, 27-38
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
35
16 14 12 10
A.R.T.
8
A.W.T.
6
J.R.
4 2
0 0
2
4
6
8
10 12 14 16 18
Fig.10. T hroughput-Case 2
Case 3: T able 10. T hroughput-Case 3 Interval 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
T.J.G. 3 6 29 34 36 6 34 23 29 15 35 31 30 0 34 36 14 33 11 19
A.R.T. 2 2 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 0
A.W.T. 0 1 5 5 7 1 3 5 4 3 6 4 7 5 5 4 8 6 1 0
J.R. 0 3 6 5 8 1 3 5 4 3 8 4 7 0 6 4 2 10 2 1
12
10 8 A.R.T. 6
A.W.T.
4
J.R.
2 0 0
2
4
6
8 10 12 14 16 18
Fig.11. T hroughput-Case 3
Copyright Š 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 27-38
36
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
Case 4: T able 11. T hroughput-Case 4 Interval 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
T.J.G. 36 24 38 28 38 24 17 26 21 18 13 39 33 20 29 38 12 38 5 13
A.R.T. 5 5 5 4 4 5 4 4 4 4 4 4 5 4 4 5 4 5 2 4
A.W.T. 18 20 21 16 19 20 17 14 12 13 14 19 22 20 17 21 19 20 4 11
J.R. 9 14 15 9 15 11 5 9 5 5 1 13 20 16 10 18 13 15 5 4
25 20 15
A.R.T. A.W.T.
10
J.R.
5
0 0 1 2 3 4 5 6 7 8 9 10111213141516171819 Fig.12. T hroughput-Case 4
D. Traffic Intensity
Traffic Intensity (T.I.) = A.R.T / A.W.T, where T.I. < = 1 (4)
Traffic Intensity is a measure of the average occupancy of a server or resource during a specified period of time, normally a busy hour. In our experiments, we have calculated
Idle Server = 1 â&#x20AC;&#x201C; T.I.
(5)
T able 12. T raffic Intensity
Copyright Š 2016 MECS
Interval
A.R.T.
A.W.T.
T.I
Idle server
0
2
3
0.67
0.33
1
4
11
0.36
0.64
2
3
4
0.75
0.25
3
3
3
1.00
0.00
4
2
3
0.67
0.33
5
2
2
1.00
0.00
6
2
2
1.00
0.00
7
3
4
0.75
0.25
8
2
3
0.67
0.33
9
2
2
1.00
0.00
I.J. Information Technology and Computer Science, 2016, 12, 27-38
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
37
1.20 1.00 0.80 T.I
0.60
Idle server
0.40 0.20 0.00 1
2
3
4
5
6
7
8
9
10
Fig.13. T raffic Intensity and Idle Server
Above results shows the performance of proposed load balancing algorith m among web servers by using number of effect ive parameters such as server‟s performance, utilization, traffic intensity and throughput.
[6]
[7]
[8]
VII. CONCLUSION In this paper, we have proposed a framework for load balancing in heterogeneous web server clusters. Based on the various factors which include processing capacity, memo ry size, expected execution time and t ime intervals, the jobs are distributed among different web servers and load is balanced simu ltaneously. Preliminary evaluation reveals that use of this algorithm is necessary to improve the performance of web servers by proper resource utilizat ion and reducing the mean response time by distributing the workload evenly among the web servers. We present here a cost effective framework for a distributed job processing system that adapts to the dynamic co mputing needs easily with efficient load balancing for heterogeneous systems. The proposed algorith m shows its efficiency in terms of server utilizat ion, average response time, average wait ing time and server‟s throughput.
[9]
[10]
[11]
[12]
REFERENCES [1]
[2]
[3]
[4]
[5]
Ramamritham, K. and J.A. Stankovic, “Dynamic Task Scheduling in Hard Real-Time Distributed Systems”, IEEE Software, 2002. 1(3): p. 65-75. Konstantinou, Ioannis; Tsoumakos, Dimitrios; Koziris, Necta/8532.rios, “Fast and Cost-Effective Online LoadBalancing in Distributed Range-Queriable Systems” Parallel and Distributed Systems, IEEE Transactions on Volume: 22, Issue: 8 (2011), Pages 1350 - 1364. J. H. Abawajy, S. P. Dandamudi, "Parallel Job Scheduling on M ulti-cluster Computing Systems," Cluster Computing, IEEE International Conference on, pp. 11, Fifth IEEE International Conference on Cluster Computing (CLUSTER'03), 2003. Dahan, S.; Philippe, L.; Nicod, J.-M ., “The Distributed Spanning Tree Structure”, Parallel and Distributed Systems, IEEE Transactions on Volume 20, Issue 12, Dec. 2009 Page(s):1738 – 1751. Wei Zhang, Huan Wang, Binbin Yu,” A Request Distribution Algorithm for Web Server Cluster”, Journal of Networks, Vol. 6, No. 12, December 2011.
Copyright © 2016 MECS
[13]
[14]
[15]
[16]
[17]
Chandra, P. Pradhan, R. Tewari, S. Sahu, P. Shenoy."An observation-based approach towards self-managing web servers", Computer Communications, 2006, pp1174- 1188. V. Cardellini, E. Casalicchio, M . Colajanni, S. Tucci, "M echanisms for quality of service in web clusters", Computer Networks, vol.37, No.6, 2001, pp761-771. Yasushi Saito, Brian N. Bershad, and Henry M . Levy. "An approximation-based load-balancing algorithm with admission control for cluster web servers with dynamic workloads", Journal of Supercomputing, vol.53, No.3, 2010, pp 440-63. Tiwari A., Kanungo P.,” Dynamic Load Balancing Algorithm for Scalable Heterogeneous Web Server Cluster with Content Awareness”, IEEE, 2010. M ehta H., Kanungo P. and Chandwani M ., “Performance Enhancement of Scheduling Algorithms in Clusters and Grids using Improved Dynamic Load Balancing Techniques,” 20th International World Wide Web Conference 2011, Hosted by IIIT, Banglore at Hyderabad, 28 M arch-01 April 2011, pp. 385-389. Chen, X., Chen, H. and M ohapatra, P., “An Admission Control Scheme for Predictable Server Response Time for Web Accesses,” Proceedings of the 10th World Wide Web Conference, Hong Kong, M ay 2001, pp. 545-554. International Symposium on Distributed Objects and Applications (DOA 2000), Antwerp, Belgium, Sept. 2000, OM G Castro, M . Dwyer M ., Rumsewicz, M ., “Load Balancing and Control for Distributed World Wide Web Servers,” Proceedings of IEEE International Conference on Control Applications, Hawaii, USA, 22-27 Aug. 1999, pp.16141618. Priyesh Kanungo , “ Study of server load balancing techniques”, International Journal of Computer Science & Engineering Technology (IJCSET) , Vol. 4 No. 11 Nov 2013 , ISSN : 2229-3345 Abdelzaher, T.F., Shin, K.G. and Bhatti, N., “Performance Guarantee for Web Server End Systems: A Control Theoretical Approach,” IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 1, Jan. 2000, pp. 80-96. Cardellini V. et al., “The State of Art Locally Distributed Web-Server Systems,” ACM Computing surveys, Vol. 34, No.2, 2002, pp. 264-311. H.M ehta,P.Kanungo and M .Chandwani, “A modified delay strategy for Dynamic Load balancing in cluster and Grid Environment” International Conference on Information Science and applications, Seoul Korea, ICISa 2010(IEEE) , April 2010. Sandeep Singh Waraich, “Classification of Dynamic Load Balancing stratergies in a network of workstations” fifth
I.J. Information Technology and Computer Science, 2016, 12, 27-38
38
Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters
International Conference on Information technology. Washington,USA 2008: 1263-1265.
Authors’ Profiles Ms. Deepti S harma is an Asst. Professor in Department of Computer Science at Jagan Institute of M anagement Studies, Rohini, Delhi. She is M Phil, M CA and pursuing her PhD in Computer Science from IGNOU. She has more than 12 years of teaching experience. Her research areas include “Load Balancing in Heterogeneous Web Server Clusters”, Big Data Analytics, Distributed Systems and M obile Banking on which papers have been published in National and International conferences and journals. Various seminars, workshops and AICTE sponsored FDP have been attended.
Dr. V.B. Aggarwal was awarded Ph.D Degree by University of Illinois in USA in 1973 for his research work in the areas of Super Computers, Array Processors, Cray XMP and Data Base M anagement Systems. He has been faculty member of Computer Science Deptt at Colorado State University and University of Vermont in USA. Dr. V.B. Aggarwal has been Head & Professor of Computer Science at University of Delhi and Professor at Dept of Electrical Engg and Computer Science at University of Oklahoma, USA. Currently he is Dean (Infotech), DIT, JIM S, Rohini, Delhi. In 2001 Dr. V.B. Aggarwal was elected to the prestigious office of Chairman, Delhi Chapter, Computer Society of India. He has been associated as a computer subject Expert with NCERT, CBSE, AICTE and Sikkim Govt Technical Education Department. Presently he has been nominated as Computer Subject Expert in Academic Council of Guru Govind Singh Indraprastha University in Delhi. Prof. V.B. Aggarwal has authored more than 20 Computer Publications which are very popular among the students of Schools, Colleges and Institutes.
How to cite this paper: Deepti Sharma, Vijay B. Aggarwal, "Improving Performance of Dynamic Load Balancing among Web Servers by Using Number of Effective Parameters", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.12, pp.27-38, 2016. DOI: 10.5815/ijitcs.2016.12.04
Copyright © 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 27-38
I.J. Information Technology and Computer Science, 2016, 12, 39-46 Published Online December 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2016.12.05
ECADS: An Efficient Approach for Accessing Data and Query Workload Rakesh Malvi M.Tech (COMPUTERAPPLICATION &TECHNOLOGY), UIT, BHOPAL, INDIA E-mail: malviyarakeshsh1@yahoo.in
Dr. Ravindra Patel, Dr. Nishchol Mishra M.Tech (COMPUTERAPPLICATION &TECHNOLOGY), UIT, BHOPAL, INDIA E-mail: ravindra@rgtu.net, nishchol@rgtu.net
Abstract—In current scenario a huge amount of data is introduced over the web, because data introduced by the various sources, that data contains heterogeneity in nature. Data extraction is one of the major tasks in data mining. In various techniques for data extraction have been proposed from the past, wh ich provides functionality to extract data like Collaborative Adaptive Data Sharing (CADS), pay-as-you-go etc. The drawbacks associated with these techniques is that, it is not able to provide global solution for the user. Through these techniques to get accurate search result user need to know all the details whatever he want to search. In this paper we have proposed a new searching technique “Enhanced Collaborative Adaptive Data Sharing Platform (ECADS)” in which predefined queries are provided to the user to search data. In this technique some key words are provided to user related with the domain, for efficient data extraction task. These keywords are useful to user to write proper queries to search data in efficient way. In this way it provides an accurate, time efficient and a global search technique to search data. A co mparison analysis for the existing and proposed technique is presented in result and analysis section. That shows, proposed technique provide better than the existing technique. Index Terms—CA DS (Collaborative Adaptive Data Sharing Platform), Dynamic Forms, Dynamic Query,ECA DS (Enhanced Collaborative Adaptive Data Sharing Platform), Nepal Dataset, QCV.
I. INT RODUCT ION In last decade, enormous amount of data is flooded over the internet and that data is collected and stored in databases in al over the world. That data comes fro m the various social networking sites and some other media. To get that data use it in useful mean information and knowledge about that data is required. Data mining is the field where all the data and in formation about that data is fetched and useful informat ion is extracted. In data mining various techniques are provided to extract useful information fro m that data. That data can be used in various fields like teleco mmunication, banking, any crisis Copyright © 2016 MECS
situation or some other means [3]. There are lots of software domains the place customers create and share info rmation; for examp le, information blogs, scientific networks, social networking companies, or catastrophe management networks. Present understanding sharing instruments, like content admin istration software (e.g., M icrosoft Share-point), permit customers to share documents and annotate (tag) them in an ad-hoc approach. In a similar way, Goog le Base permits customers to outline attributes for their objects or decide on fro m predefined templates. This annotation procedure can facilitate subsequent informat ion discovery. Many annotation techniques allow best “untyped” keyword annotation: for example, a consumer may annotate a climate record utilizing a tag equivalent to “Storm category 3.” Annotation approaches that use attribute-worth pairs are on the whole more expressive, as they are able to contain ext ra know-how than untyped tactics. In such settings, the above know-how may also be entered as (Storm category, 3). A recent line o f work towards utilizing more expressive queries that leverage such annotations, is the “pay-as-you-go” querying approach in Databases: In Databases, customers provide knowledge integration pointers at query time. The belief in such techniques is that the info sources already contain structured expertise and the challenge is to match the question attributes with the source attributes. Many methods, although, do not even have the elemental “attribute-wo rth” annotation that will make a “pay-as you- go” querying possible. Annotations that use “attribute value” pairs require customers to be more principled of their annotation efforts. Users should understand the underlying schema and discipline variet ies to use; they will have to also know when to make use of each and every of those fields. With schemas that usually haven tens and even thousands of available fields to fill, this assignment becomes complicated and cumbersome. This results in information entry users ignoring such annotation capabilities. Even supposing the procedure permits users to arbitrarily annotate the information with such attribute-worth pairs, the users are commonly unwilling to participate in this project: The undertaking no longer simplest requires gigantic effort but it also has doubtful usefulness for subsequent searches sooner or later: who is going to make
I.J. Information Technology and Computer Science, 2016, 12, 39-46
40
ECADS: An Efficient Approach for Accessing Data and Query Workload
use of an arbit rary, undefined in a normal schema, attribute type for future searches? However even when making use of a predetermined schema, when there are tens of potential fields that can be utilized, which of these fields are going to be priceless for looking the database sooner or later? In data min ing various predefined tools are provided to the user to extract valuable informat ion fro m that data. There are various methods like machine learn ing, or some other techniques are used to provide a way automated extract ion of the data. Data mining is a knowledge discovery process which used to extract information fro m the various type of resource data and that data can be used for various applications. Generally classification, regression, clustering are used to classify that data into different classes which provides a better identification for the data and also help in the process of extraction [4]. In Classificat ion data classified into different classes and then process of extraction is applied to extract data on the basis of that classificat ion. In that way data is classified into different classes and groups. That data used as a training data too train untrained data and class ify untrained data into different classes. In that that data can be used for various type of decision making [5]. There are various type of techniques are presented by the various researchers to provide better classification for that data. Like there are various type of platforms are provided by the Google to manage and define measures for objects. Thus that helps to classify data into various classes[7]. CADS [1] provides cost effective and good solution to help efficient search result. The goal of CADS is to support a process that creates nicely annotated documents that can be immediately useful for co mmonly issued semi-structured queries of end user. Annotation methods that use attribute-value dyads square measure usually additional co mmunicatory, as they'll contain additional data than untyped approaches. A recent lines of labor towards utilizing additional co mmunicatory queries that leverage such annotations, is that the “pay- as-you-go” querying strategy in knowledge areas: In knowledge spaces, users give knowledge integration hintsat question time. The position such systems is that the data supplies already contain structured informat ion and therefore the quandary is to match the question attributes with the source attributes. Even if the system permits users to annotate the information with such attribute-value pairs, the users square measure typically unwilling to perform the task. Such difficult ies results in terribly basic annotations that is typically restricted to straight forward keywords. Such straight forward annotations create the analysis and querying of the info cu mbersome. User‟s square measure typically restricted to plain keyword searches, or has access to terribly basic annotation fields, like “creation date” and “size of document”., we tend to propose CADS (Co llaborative adaptive info rmation Sharing) platform that is associate “annotate-as-you-create” infrastructure that facilitates fielded informat ion annotation. A key conCopyright © 2016 MECS
tribution of our system is the direct use of the question emp loyment to direct the annotation method, additionally to examin ing the content of the document. Our aim is to rank the annotation of documents towards generating attribute names and attribute values for attributes can that may typically employed by querying users and these attribute values will give best potential results to the user whereby users will ought to deal solely with relevant result. In this paper a new technique called enhanced collaborative data sharing (ECADS) is presented. In that technique various predefined tools and key words are provided to the used to search data from a large dataset, in existing technique there is no global solution is provided to the user. In existing technique provide a mechanism to search data in which used need to have prior knowledge about the structure of the system thus a new technique is presented in this paper which provides more accurate search results as compare to the other technique. A short description of proposed technique presented below: First user need to select field which he want to search data. Suggestions for the search keywords are provided to the user which he can used to form query for search data. An easy and flexible select ion mechanis m is provided to select keyword and an automatic query can be formed to provide better search result for the user.A comparison of the search resultsfor existing and proposed techniques is presented in result and analysis section which shows that proposed technique provides accurate and time efficient search results in context of search. The rest of the paper organizes as follows: in section II, a brief literature review over the techniques which used to provide solution to search data is presented. A related work which shows, a description over the CADS technique for search data is presented in section III. In section IV a description over the proposed technique is presented. In section V, experimental setup and result analysis for the proposed technique is presented. An evaluationof the results for proposed and existing technique is presented in VI. Section VII concludes the paper.
II. LIT ERAT URE REVIEW In [2] pay as you go user feedback system based technique is presented to provide feedback for the data space system. A technique is provided which provides a pay as- you- go query technique in data space is presented. That technique uses queries which take annotations to provide an efficient query technique to search data, inn that data space user provides hints to user to search data. But in that technique a pre assumption consider that user have all the structured information of the searched data there is a problem is occurs which is to match the query attribute to the source attribute. In [3] a technique is proposed to provide a framework for fast disaster recovery in the business community information network. In that a crisis management system and disaster recovery system is used to provide is used.
I.J. Information Technology and Computer Science, 2016, 12, 39-46
ECADS: An Efficient Approach for Accessing Data and Query Workload
These systems provide a technique to deal situation where crisis occurs and merge it to the natural calamity. There are various techniques are used to provide a framework to recover and prevent these issues. A proper solution is required for such problem. The technique presents in this paper resolves the issues related to these problems. In [6] an ensemble technique called random K-labelsets (RAKEL) is presented which provides a multi-label classification for the data. In that technique (RAKEL) first each ensemble div ided into labeled subsets and then a single label classifier is learned to provide a combine results for these classifier as a power set of these subsets. By the use of label correlation, single label classifier and single label classifiers are applied to over the subsets in a controlled way, with limited no of data per label. In that technique correlation used between tags to annotate the label. But there is collaborative annotation mechanism is provided. In [10]. Procedure to learn about the objects that require annotation by the use of studying from a set of until now annotated records. This as a rule requires the mark up of a big assortment of files . The MnM system, for illustration, was developed to examine how this venture would be facilitated for domain experts. Co mfortably marking a number of files just isn't amp le; the items marked have obtained to be excellent examples of the types of contexts in which the gadgets are found out. Discovering the right mixture of e xemplar records is a more d ifficu lt venture for non IE specialists than the time-ingesting work of marking up a pattern of records. Melita addressed this predicament through suggesting the best mixture of records fo r annotation. Unsupervised programs, like Armadillo, are starting to deal with these challenges via explo iting unsupervised finding out approaches. PANKOW (used in OntoMat), for instance, demonstrates how the distribution of distinct patterns on the internet can be used as evidence so that you could approximate the formal annotation of entities in web pages through a princip le of „annotation through maximal (syntactic) proof‟. For illustration, the nu mber of events the phrase “cities reminiscent of Paris” occurs on web websites, would provide one piece of p roof that Paris is a town, wh ich would be considered inside the mild of counts of exceptional patterns containing “Paris”. In [14], a segmentation technique for the words in hand-written Gu ru mukhi script is presented. Segmentation of the words into characters is one of the challenging task to do that becomes more difficult when segmenting a handwritten script and touching characters. A water reservoir based technique is presented. If water is poured fro m top to the bottom of the character. That water stored at the cavity region of the character. Like if water poured fro m the top that region considered as top reservoir. To analyze that technique 300 hand written scripts are collected from the Gurumukhi script. In [15], a data extract ion framework which uses a natural language processing based feedback extraction technique is presented. In that data is extracted fro m the
Copyright © 2016 MECS
41
feedbacks dataset to extract sematic relations in the domain ontology. Sematic analysis is the process understanding meaning of the words to which it represents, it the process of determin ing the meaning of the sentence on the basis of the word level meaning in the sentence.
III. RELAT ED W ORK Existing work is done in this work format where the annotation scheme is being improved by CADS technique; Fig. 1 shows the mechanism how the existing flow is working.
Fig.1. Flow-diagram for Existing technique.
QV: Query value is the Suggest attributes based on the querying value component, wh ich is similar to ranking attribute based on their popularity in the workload. CV: This is the content value attribute based on Su ggest attributes based on the content value. In [13] a CADS and USHER technique is presented to provide better performance to ext ract useful data from the raw data. CADS (Collaborative Adaptive Data Sharing Platform) is a technique which provides a mechanis m to automatically generate query form and provide fun ctioning to search data fro m the raw data. An USHER technique is used to enhance the quality of the results. Thus that technique resolves all the limitations of the existing technique and provide enhanced query to extract useful data from the raw data. In [1], an enhanced technique to search data fro m the large scale data is presented. A structure of the metadata is generated by analyzing the document which contains useful informat ion to ext ract data fro m a large scale data. A structured algorithm is presented that identifies the structured features which presents in the documents. Thus that technique called CA DS provide4s an enhanced functionality to extract useful in formation fro m the large scale dataset that can be used in various applications.
I.J. Information Technology and Computer Science, 2016, 12, 39-46
42
ECADS: An Efficient Approach for Accessing Data and Query Workload
True negative rate (or specificity): TNR=TN/ (FP+TN) False negative rate FNR=1−TPR
A. Issues with previous technique
In existing technique there is no suitable global s olution is provided to the user to annotate and search data. In existing technique solution provided only can be handled by the user who is aware off the technique or belongs fro m technical stream. But in case of nontechnical person if they want to search data about the disaster management or crisis etc. it need to be known about the structure of the process which not efficient way to done that process.
An efficient technique is required to provide a global solution which enhance the efficiency of the on-demand Service which used to provide these information to the user. So me issues with the existing techniques are discussed below
There is no efficient technique is provided to the user. Available techniques are more suitable for technical persons To get accurate results, a proper query is required.
IV. PROPOSED W ORK An improved CADS technique is presented, where an efficient query technique is provided to the user to search data. In that technique a frame work is provided by which any user can access the data efficiently. In that technique a predefined query generator is provided to the user to generate query for the search. Thus in that an accurate query is provided to the user which helps to generate proper results. Suppose a user want to search any informat ion about any crisis, in that technique some predefined key words are provided to provide proper solution for the user. In that in that technique user have no need to know about the structure of the whole process or remember the whole query to search data. Thus in that way that technique resolves all issues which occurs in previous techniques. That technique provides better results than the other techniques. A description of the analysis result is presented which shows that proposed technique provides improved performance as compare to the existing techniques. Precision, Recall and co mputation time taken as a parameter. In order to calcu late the precision and recall we have various parameter values from the dataset progress: Which are TP, FP, TN, and FN, wh ich is getting in the form of confusion matrix? True positive rate (or sensitivity): TPR=TP/ (TP+FN) False positive rate: FPR=FP/ (FP+TN)
Copyright © 2016 MECS
Recall and precision are generally used to evaluate the performance of the technique in that results of the every technique is observed and used to provide a co mparison of the performance of the various techniques to evaluate the performance of the technique. There are various techniques are used to retrieved the image. For query q, A (q) =A set of images in dataset B (q) =A set of images relevant to query Precision = A (q) ^B (q)/A (q) Recall=A (q) ^B (q)/B (q) Precision provides description about the images which are relevant to the query image, it is the ratio of no of relevant image to the retrieved image and recall is the ratio of the no. of relevant image to the query and to the no of relevant image in the database. Thus these techniques are used to evaluate the performance of the technique. Same in the case of text or any search data. In this fig. 2 show flow chart the step by step technique will be used. We will get less as less resources need to provide while the data need publish in public. Th is things we can perform with different retrieval technique on CADS.
Fig.2. Flow Chart for ECADS T echnique.
Ecads Algorithm:In order to perform the execution of the form based technique at querying side approach we have used the techniques which use the advantage of existing CADS approach.The Abbreviation usage in this document is described in this table 1:
I.J. Information Technology and Computer Science, 2016, 12, 39-46
ECADS: An Efficient Approach for Accessing Data and Query Workload
T able 1. Shows the abbreviation and meaning Abbreviation CV QV QQV QCV T R Qn Ci-n Sds A(Q)
Meaning and usage Content value, a data usage value from the existing dataset. Query value of the data, available attributes in dataset. Query side or database side attribute value in the system. A content value which is extracted from the database side. Threshold value usage in proposed technique in order to generate form at query side based on the QQV. T he consider value output storage element. Query table input. From first column to nth column in dataset. Column indexing from i to n i.e. from i=0 or initialization to end of the column value from database end. Sample dataset consider for the simulation environment. It is the set of values in dataset.
43
base end the query value mechanism described in the existing paper. [1] And based on the query value for each query value its content value is going to obtained in this step. Here WAj be the set of queries in W that use Aj as one of the predicate conditions. P (Aj|w) = (|Waj|+1)/ (|Wa|+1)
(1)
Calculate QCV. The calculation of content value fro m the QQV is g oing to determine in this step such that it can categorize for the form generation suitability. Content value=QCV= Value of (P (Aj|w)
(2)
Here with the help of QQV we extract the value of QCV. Constant or Calculate T .Where QCV is the maximu m possible extraction of the all unseen values. Score value (T) is generated here –
[A]. ECADS Algorithm Pseudo CODE: Score value (T) =Maximum Average valuesOf (Value of (P (Aj|w))) (3)
For (Qi-n) { Read data and process using technique Classification. Go to next; } Obtain QCV <- QQV Calculate the Score Waj; QCV<Qaj & QQV Content value=QCV= Value of (P (Aj|w))
In this step a conclusion value for the form generation elig ibility , a threshold value based on the content value is calculated , and the best available value are consider to the further process using this mechanism.
Calculation T Score value (T) =Maximu m Average values Of (Value of (P (Aj|w))) Sequentially process data T> value R put value Generate form and extraction data.
In this process we evaluate the value and read it and if it found the QCV according to the T, it is going to co nsider as final value fo r the form generation and storing in to a variable R, i.e. form generation attribute.
[B]. ECADS Algorithm:Input: SDS, Models. Output: Query output & QCADS Form Steps: Retrieve next Qi from column 1-n.
Apply QCV Else Go to Step 1 Finally in this step a query side form is generated in order to process the data available in form manner such that an efficient query technique can be apply in the d ataset.
In this step the first is going to load the dataset and then retrieve the column value fro m the init ial point to the end of column, data is going to read fro m the first column up to the end of colu mn in order to read the value and follo w it for the further work to generate dynamic form according to the condition and algorithm. Get the colu mn and data value for each colu mn ci Ca lculate QQV QCV. In th is step first th e calcu lat ion is to o bt ain t he knowledge about the column value wh ich is query querying value name is given, such that it is working at data
Copyright © 2016 MECS
Sequentially process data T> value R put value
V. EXPERIMENT AL SET UP & RESULT A NALYSIS In order to work on our proposed work we have performed our experimental setup and analysis using java API using JDK 8.0 where we have used swing api for design and other supporting api for implementing logical requirement for the algorith m and following results are produced while co mparing the existing and proposed algorith m.
I.J. Information Technology and Computer Science, 2016, 12, 39-46
44
ECADS: An Efficient Approach for Accessing Data and Query Workload
In table 2, a calculat ion for the True Positive (TP), True negative (TN), False Positive (FP) and False Negative (FN). TP = correctly identified. TN = incorrectly identified. FP = correctly rejected. FN= incorrectly rejected. These values are used to calculated the precision and recall for the technique wh ich taken as a parameter to evaluate the performance of the technique. Precision = positive predictive values= TP/ (TP+FP).
Fig.3. File search module for ECADS.
A NETBEA NS IDE a simu lator which provides development environment to develop projects in java is used. In the proposed technique a global query architecture to provide an enhanced query to search data is presented. In fig. 3. A query generation mechanism is presented. That provide keywords to build query to search data and enhance the performance of the whole search mechanis m. In existing technique, there is query generation mechanism is provided which degrades the performance of the whole search mechanis m. Because if user not aware about the technical aspects of the topic for wh ich the search operation is performed, that is too difficu lt to build an accurate query for the search operation. That degrades the performance o f the whole technique in terms of accuracy and efficiency. Thus an enhanced technique is presented in this paper. To imp lement the proposed technique a dataset of the earthquake in Nepal and disaster management is used. And an analysis over that data is performed to analyse the performance of the proposed technique. A statistical and graphical co mparison for the search results provides by the existing and proposed technique is provided in this section. Precision, recall and execution time are taken as an evaluation parameter to evaluate the performance of the techniques. T able 2. Comparison of the existing technique and Proposed technique. Algo name Existing Approach Proposed Approach
TP Value
FN Value
FP Value
TN Value
53
8.0
67.0
22.0
53
37.0
38.0
22.0
Upon getting follo wing parameter we have calculated actual required parameter to co mpare technique such as precision, recall, computation timing fro m the two different techniques and got the values mentioned in below table as statically analysis of result. In which we have got the less computation time of our proposed approach and required precision and recall value using Nepal dataset used in that project. Copyright Š 2016 MECS
Recall = True Positive Rates= TP/ (TP+FN). A comparison of the precision, recall and computation time for the existing and propose technique is presented in Table 3. That shows proposed technique consumes less amount of t ime to provide search results for the query and comparison of the precision and recall shows that the proposed technique provides better search results as compare to the existing technique. That way proposed technique provides an enhanced mechanism to search data. T able 3. A statistical analysis of results for the proposed technique. Algorithm name
Computation time (in ms)
Precision (%)
Recall (%)
25022
44.16
88.88
9142
58.24
58.88
Existing approach Proposed approach
90 80 70 60 50 40 30 20 10 0
CADS ECADS
Precision
Recall
Fig.4. Result chart for Precision and Recall Value.
A graphical co mparison for the existing and proposed technique over the parameters called precision and recall is presented in Fig. 4, wh ich shows that proposed tech
I.J. Information Technology and Computer Science, 2016, 12, 39-46
ECADS: An Efficient Approach for Accessing Data and Query Workload
nique provides accurate results in context of search. We generate the auto query which uses dynamic form based query and data accessing system. Recall and precision are generally used to evaluate the performance of the technique in that results of the every technique is observed and used to provide a comparison of the performance of the various techniques to evaluate the performance of the technique.So all the evaluation process of query the dataset are always done using these two calculation. A graphical co mparison over time is presented in Fig. 5, wh ich shows that the proposed technique taken s mall span of time to provide search results. So that user can remove fro m the burden of query formation, thus computation time can be reduced. We can definitely get good results in all computation time.
Time in Micro seconds
30000 25000 20000 CADS Technique
15000 10000
ECADS technique
5000 0 computation time
45
Computation Time: Co mputation time is the length of time required to perform a co mputational process . Co mputation time can be represented as a sequence of time slots for performing computation on the various availab le segments of the services. The computation time is proportional to the number of services. A comparison analysis for the existing and proposed technique is presented in the section V. That shows proposed technique provides an accurate and efficient mechanism to build an accurate query to perform search operation data.
VII. CONCLUSION There are various techniques which used to search data in large dataset scenario, but these technique suffers some defects like high computation time, no global solution is provided etc. in existing technique to get accurate result for the search data a proper structure for that data is required. That is difficult to every user to know the internal structure thus a new technique is proposed in this paper. That technique provides predefined keywords to search the data is provided these keywords contains the informat ion which help user to search data and get result without knowing the structure of the whole system. In that way a simple co mmon person who don‟t know an ything about the system can perform their search operation and get accurate result. A result analysis is presented in section V. wh ich shows that proposed technique provides better results as compare to the other techniques. REFERENCES [1]
Fig.5. Graphical comparison over Execution time.
In that way a search technique is provided to the user which p rovides time efficient and accurate search result in context of search.
VI. EVALUT ION OF EXPERIMENT S To evaluate the performance of the technique precision and recall are generally used in case of search technique. A statistical and graphical analysis of the technique is presented in section V. That shows proposed technique provides better performance as co mpare to the other techniques. For query q, A (q) =A set of images in dataset B (q) =A set of images relevant to query Precision = A (q) ^B (q)/A (q) Recall=A (q) ^B (q)/B (q)
Copyright © 2016 MECS
Eduardo J. Ruiz, Vangelis Hristidis, and Panayiotis G. Ipeirotis: proposed a paper “CADS Technique Introduced to Generate Dynamic form to M aintain Data in Structure Format.” IEEE, 2014 [2] S.R. Jeffery, M .J. Franklin, and A.Y. Halevy: proposed a paper “Pay-as-You-Go User Feedback for Data space Systems,” ACM , 2008. [3] K. Saleem, S. Luis, Y. Deng, S.-C. Chen, V. Hristidis, and T. Li: proposed a paper “Towards a Business Continuity Information Network for Rapid Disaster Recovery” 2008. [4] J. M . Ponte and W.B. Croft: proposed a paper “A Language M odeling Approach to Information Retrieval”. [5] R. T. Clemen and R.L. Winkler: proposed a paper “Unanimity and Compromise among Probability Forecasters. [6] G. Tsoumakas and I. Vlahos‟s: propose a paper “Random Label sets: An Ensemble M ethod for M ultilevel Classification. [7] P. Heymans, D. Ramage, and H. Garcia-M olina: proposed a paper “Social Tag Prediction”. [8] Y. Song, Z. Zhuang, H. Li, Q. Zhao, J. Li, W.-C. Lee, and C.L. Giles: proposed a paper “Real-Time Automatic Tag Recommendation”. [9] D. Eck, P. Lamere, T. Bertin-M ahieux, and S. Green: proposed a paper “Automatic Generation of Social Tags for M usic Recommendation. [10] Victoria Uren, Philipp Cimiano, José Iria, Siegfried Handschuh, M aria Vargas-Vera, Enrico M otta, and Fabio Ciravegna “Semantic Annotation Systems for Knowledge
I.J. Information Technology and Computer Science, 2016, 12, 39-46
46
[11]
[12]
[13]
[14]
[15]
ECADS: An Efficient Approach for Accessing Data and Query Workload
M anagement: A Survey of Requirements and State of the Art” M IAKT, 2004. B. Russell, A. Torralba, K. M urphy, and W. Freeman: propose a paper “Label M e: A Database and Web-Based Tool for Image Annotation”. C.D. M anning, P. Raghavan, and H. Schu¨tze, Introduction to Information Retrieval, first ed. Cambridge Univ.Press,http://www.amazon.com/exec/obidos/redirect ?tag=citeulike07- 20&path=ASIN/0521865719, July 2008. Anita L. Devkar, Dr. Vandana S. Inamdar “Combine Approach of CADS and USHERInterfaces for Document Annotation” IJIRCCE, M ay, 2015. M unish Kumar, M .K. Jindal, R.k. Sharma “Segmentation of Isolated and Touching Characters in Offline Handwritten Gurmukhi Script Recognition” IJITCS, 2014. Pratik K. Agrawal, Avinash. J. Agrawal “Opinion Analysis Using Domain Ontology for Implementing Natural Language Based Feedback System” IJITCS, 2014.
How to cite this paper: Rakesh M alvi, Ravindra Patel, Nishchol M ishra, "ECADS: An Efficient Approach for Accessing Data and Query Workload", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.12, pp.39-46, 2016. DOI: 10.5815/ijitcs.2016.12.05
Authors’ Profiles Rakesh Malvi, is currently working toward the M .Tech degree in the department of School of Information Technology in Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal. He has also received the BE degree from the RKDF College of Technology & Research, Bhopal.
Dr. Ravindra Patel, is working as Associate Professor in Department of Computer Applications in Rajiv Gandhi Proudy ogiki Vishwavidyalaya, Bhopal. He has done M aster in Computer Applications from Barkatullaha University, Bhopal, India in 1998. He has completed his Ph.D. in Computer Science from Rani Durgawati University, Jabalpur, India in 2005. He has published 32 research p apers in various international journals (including Elsevier, Springer, IEEE index etc.) And around 11 research papers in proceedings of various peer-reviewed conferences in India and abroad. He has contributed chapters in edited books published by Elsevier, Springer. He has more than 14 years of teaching and research experience in post graduation. He has more than 8 years of administrative experience as Head of Department. His area of research includes but not limited to Data M ining, Big Data analytics, Cyber Security, Human-computer Interaction.
Dr. Nishchol Mishra, is working as Associate Professor in School of Information Technology in Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal. He has done his M .Tech & Ph.D in Computer Science & Engineering. He has published over 40 p apers in various international journals of reputed. He has more than 14 year of research experience. His area of research includes M ultimedia, Data M ining, and Image Processing & Data Analytics.
Copyright © 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 39-46
I.J. Information Technology and Computer Science, 2016, 12, 47-58 Published Online December 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2016.12.06
Journey of Web Search Engines: Milestones, Challenges & Innovations Mamta Kathuria YMCA University of Science & Technology, Faridabad, 121001, India E-mail: mamtakathuria7@rediffmail.co m
C. K. Nagpal and Neelam Duhan YMCA University of Science & Technology, Faridabad 121001, India E-mail: nagpalckumar@rediffmail.co m, neelam.duhan@gmail.com
Abstract—Past few decades have witnessed an informat ion big bang in the form of World Wide Web leading to gigantic repository of heterogeneous data. A humble journey that started with the network connection between few co mputers at ARPANET p roject has reached to a level wherein almost all the co mputers and other communication devices of the world have joined together to form a huge global in formation network that makes availab le most of the information related to every possible heterogeneous domain. Not only the managing and indexing of th is repository is a big concern but to provide a quick answer to the user‘s query is also of critical importance. A mazingly, rather miraculously, the task is being done quite efficiently by the current web search engines. This miracle has been possible due to a series of mathematical and technological innovations continuously being carried out in the area of search techniques. This paper takes an overview of search engine evolution from primitive to the present. Index Terms—World Wide Web, Search Eng ines, Web Search, Information Retrieval.
an application known as web browser can be used wh ich has to be provided with the unique identity of the resource in possession of the information known as its Unifo rm Resource Locator (URL). The tremendous growth of the WWW led to huge number of informat ion resources with each one having its own URL(s) resulted in enormous number of websites beyond the grasp of any individual. This led to the requirement of manual directories/automated mechanisms to provide the list of desired URLs in possession of the requisite info rmation. The crossing of total number of on line websites one billion mark in September 2014 [1] co mbined with continuous growth has rendered it mean ingless to solely manage the system through manual directories and therefore making the automated system an essentiality, though the combination of both is still going on. The ―Fig. 1‖ shows the rise in number of websites year wise.
total number of websites 1,500,000,000 1,000,000,000
500,000,000
I. INT RODUCT ION In today‘s life, it has become hard to think of life without internet. It is ama zing to imagine that this integral part of our cu rrent daily life was almost non existent half a century ago and was an expensive academic lu xu ry few decades back. An innovation which started in 1960s, with a view to connect immobile bulking co mputers of that time in order to avoid the postage and travel delay of storage devices, underwent tremendous scalability and started undertaking almost every communicating device into its fold. The flexib le scalable network created by the various heterogeneous devices gave birth to an information repository that was commonly sharable worldwide leading to the coining of the term World Wide Web (WWW) in early 1990s. For the purpose of informat ion retrieval fro m WWW,
Copyright © 2016 MECS
0
Fig.1. Proliferation in the number of web sites
The exploration fo r the automated mechanisms to find the desired URLs led to the creation of one of the most complex and comp licated type of the software in the world known as Search Eng ine. Search Engines help their users in gathering and analyzing large amount of informat ion available on various resources on the internet by presenting it in categorized, indexed and logical way. The use of the search engine is second most common activity amongst the internet users next to sending/receiving of emails [2] as depicted in ―Fig. 2‖.
I.J. Information Technology and Computer Science, 2016, 12, 47-58
48
Journey of Web Search Engines: M ilestones, Challenges & Innovations
their users the requisite information in all heterogeneous domains and have proven to be indispensable informat ion provider. Let us take a look at the rapid evolut ionary process which the search engine technology has undergone with the time. The paper contains 5 sections. Section 2 contains basic terminologies associated with the search engine technology including various types of search engines, basic architecture and search methodologies. Section 3 contains a list of few pro minent search engines evolved in the journey with their salient features. Section 4 talks about the current challenges faced by search engine industry and associated innovations . Section 5 includes the persistent issues which will continue to exist in the domain of web search due to its inherent structure and operations.
Sell something online
Listen to music online
Look for info. on Wikipedia
Pay bills online
Download files(games,…
Look online for info for job
Buy a product
Look for health/medical info
Send/Read email
Map or driving dir search
100 90 80 70 60 50 40 30 20 10 0
Watch a video/listen audio…
Percent of Internet users who report this activity
II. SEARCH ENGINE BASICS This section describes the various types of search engines along with their architecture and search methodologies. A. Crawler Based Search Engine
Fig.2. Internet activities of different users
With the use of mathematical, statistical and technological innovations to explo it the enormous growth of WWW, the search engines have been able to provide
We start our journey with the general architecture of a typical crawler based search engine as shown in ―Fig. 3‖.
Query Processor
QUER
Y……. Search Interface
Ranking
Front end Process
Back end Process
WWW
Crawler
Indexer
Index
Fig.3. General Architecture of a Web Search Engine
The complete process of searching is divided into two phases:
The back-end phase The front-end phase
At the front-end, when user submits his query in the form of keywords on the interface of the search engine, the query processor/engine performs its execution by matching the query keywords with the document Copyright © 2016 MECS
informat ion present in the index. A page is considered as a hit if it possesses at least one of the query keywords. The matched URLs are retrieved fro m the index and given to the ranking module so as to return a ranked list to the user. At the back -end, Crawler is the most important component of search engine that traverses the hypertext structure of the WWW, down loads the web pages and parses them. The parsed pages are then routed to an indexing module that builds the index on the basis of
I.J. Information Technology and Computer Science, 2016, 12, 47-58
Journey of Web Search Engines: M ilestones, Challenges & Innovations
different terms present in the pages. The index is used to keep track of the Web pages fetched by the web crawler. Some of the most prevalent crawler type search engines include Google, Yahoo, Bing, Ask and AOL. When one has a specific query in the mind then the crawler-based search engines are quite efficient in find ing relevant informat ion. However in case of generic query a crawler-based search engines may return large number of irrelevant responses. B. Human-powered directories Another type of search engine includes human powered directories. These search engines classify the web-pages on the basis of brief human description which can be provided by the webmasters or by the editorial group of the directory. The search engines in th is category are Yahoo directory, Open Directory and LookSmart [3]. Hu man-powered directories are good at the searches made on the general topics where they can guide and help the searcher in narrowing down his/her search and get refined results [4]. However in case of specific search they are unable to provide an efficient way to find information. C. Hybrid Search Engine A hybrid search engine (HSE) uses different types of data with or without ontologies to yield algorithmically generated outcomes based on web crawling. Previous types of search engines used only text to generate their results while hybrid search engines use a combination of both crawler-based results and human-powered directories[5]. Most of the search engines these days are moving towards a hybrid-based model. The search engines in this category include Google, Yahoo and MSN Search. D. Meta-search Engines A meta search engine uses the services of other search engines and forwards the userâ&#x20AC;&#x2DC;s query simu ltaneously to several search engines working in the back. The results supplied by these search engines are then integrated and after the application of features like clustering and removal o f replicates, the results are presented to the user. The search engines in this category include Dogpile[30, 31], Mamma[6] and Metacrawler[19]. Meta-search engines are good for saving time by searching only in one place and sparing the user fro m the need to use several separate search engines. Fig. 4 shows the architecture of a meta search engine. E. Vertical Search Engine Vertical search engines focus on a particular do main o f search. They are also referred to as specialty or topical search engines. The common verticals of search include travel, online shopping, legal informat ion, medical informat ion etc. The crawler o f these search engines focuses on the web pages of the particular domain and is referred to as focused crawler.
Copyright Š 2016 MECS
49
III. THE M ILEST ONES IN T HE JOURNEY After a brief d iscussion on the various types of the currently prevalent search engines, let us have a look at the journey travelled by the search engine technology over the period and talk about various milestones crossed. The journey has been presented through Table 1 which contains most of the pro minent search engines evolved in the journey along with their year of development, name of developing team members/ organization, features and innovations, current activation status and Alexa rank[83,84].
IV. CHALLENGES & INNOVAT IONS With the time, the search engines have evolved and facing novel and un-envisaged challenges. These challenges are being handled through innovations. This section takes a look at the challenges and the corresponding innovations. A. Standardization To markup different types of information on web pages mult iple standards and schemas are prevalent making it difficult for web master to choose one. A common schema supported by major search engines was required to resolve this problem. Schema.org [53] is a collaborative effort by the Bing, Yahoo, Google and Yandex to assist search engines to achieve faster and relevant search using a structured data markup schema that helps in recognizing people, events, and attributes on web resources. The on-page markups help search engines to understand the information on web pages and provide richer search results. Schema.org is not a standard body like W3C [9, 10] o r IETF [54] but is a website providing the schema and markup supported by major search engines. The co mmon markup and schema is mutually beneficial for all the stakeholders i.e. Webmasters, Search Engines and Users. B. Beyond Keywords The conventional mechanis m of the web search by the search engine is based upon the keywords typed by the user/ searcher [55]. With the time, the efforts are being made to extend the keyword based web search to the semantic search wherein a search engine is expected to understand the natural language using machine intelligence and identify the underlying intent of the searcher. The underly ing concept of semantic search is based upon the semantic similarity being taken over documents [56], words [57, 58], terms [59], sentence [60] and entities [61]. The available search engines in this category include Powerset[62], Hakia [63] and Google hummingbirds[51]. To implement natural language search, Powerset uses natural language technology platform developed by Palo Alto Research Centre (PARC) that can encode synonyms and identify relationships between the entities. Hakia uses its own feature called QDEX that is inclined towards analyzing of the web pages rather than indexing. For
I.J. Information Technology and Computer Science, 2016, 12, 47-58
50
Journey of Web Search Engines: M ilestones, Challenges & Innovations
short queries it displays relevant categories and for long queries it displays relevant sentences and phrases. Google Hu mmingbird takes into account the entire sentence
(instead of individual key words) for understanding the underlying intent of the user.
T able 1. Milestones in the Journey Sr.No
Year/Search Engine
1.
1990 Archie [7 ]
2.
1992 Veronica & Jughead[8]
3.
1993 W3 Catalog[9,10]
4.
5.
6.
Key Developer / Developed at or Owner Alan Emtag e, Peter J. Deutsch, Bill McGill University, Montreal Fred Barrie, Rhett Jones Unive rsity of Naved a Syste m Computing Servic e s group
Features and Innovations
Current Active status/ Alexa Rank
1. 2. 3. 4.
FT P Server based sharing of files crawling concept Script-based data gatherer Regular Expression based matching retrie val o f files for user query
Not Active Alexa N.A
1. 2. 3. 4.
Menu Driven approach Ability to search plain text files Keyword based search in 4. Its own designed Gopher Index System
Not Active Alexa N.A
Oscar Nierstr asz Unive rsity of Gene va
1. Purely textual browser 2. Integration of manually maintained catalogue. 3. Dynamic querying
Not Active Alexa N.A
1993 JumpStation[11]
Jonathon Fletcher Unive rsity of Stirling
1. Combines crawling, searching and indexing 2. Lays the foundation for current form of search engines 3. Unable to grow because of linear search drawback
Not Active Alexa N.A
1993 WWW Wanderer[12]
Matthe wG ra y Massa c huse tts Institute of Techno logy
1. Introduces web robots to crawl the web 2. T rack the web's growth, Indexed titles and URLs 3. Did not facilitate web search, major goal to measur e web size 4. Perl based web crawler
Not Active (Redir e cte d to Yaho o)
Martijn Koster United Kingdom
1. Devoid of crawling mechanism 2. Website administrator had to register with Aliweb to get their services listed & indexed 3. Cap ability to perform Archie Like Indexing for the web
Active ( ww w.a l iweb.com)
1993 Aliweb[13]
7.
1994 Web Crawler[14]
8.
1994 Meta Crawler[15]
Erik Selber g, Oren Etzioni Blucor a Inc.
1. Lays the foundation for Content Based Search 2. Use of Boolean operators in user query 3. User Friendly Interface
Active, Aggre ga tor, (https://ww w.w ebcrawler.com/) 674
1. Introduced the concept of meta search wherein search results of major search engines are combined to widen the search results. 2. Doesâ&#x20AC;&#x2DC;t have its own search index
Active, Aggre ga tor, (http://ww w.m etacrawler.co.uk/
1. Search tool compatible with Internet Explorer (4.x or above) and Netscape 4.x. 2. It is a spyware and search toolbar program 3. Displays algorithmic search results from Google, Ask.com, Yahoo and LookSmart, along with sponsored listings, primarily from Google. 4. Easy to add/remove additional software products to the Toolbar. 5. Free to use
Active but powered by google(http://h ome.mywebsea rch.com/index. jhtml) 405
9.
MywebSearch[16]
IAC
10.
1994 Lycos[17]
Mauldin Mich ea l L. Caneg ie Mellon Univ., Pittsburg
11.
1994 Inktomi[18]
Eric Brewer University of California
1. First major search engine to launch a paid inclusion service 2. Handles thousands of search queries by distributing among many servers
12.
1994 Infoseek[19]
Steve Kirsch Infoseek Corporation
1. Provided subject oriented search 2. Allowed real-time submission of the page
Not Active Alexa N.A
13.
1995
Joe Kraus,
1. Both concept & keyword based search
Active, Now
Copyright Š 2016 MECS
1. Prefix matching and word Proximity 2. Keyword, search on image or sound files 3. Focuses more on directory
(http://ww w.ly cos.co m /S ea r c h/) 9041 Not Active, Aquir ed by Yaho o
I.J. Information Technology and Computer Science, 2016, 12, 47-58
Journey of Web Search Engines: M ilestones, Challenges & Innovations
Excite[20]
2. 3. 4. 5.
Large & up-to-date index Excellent summaries Fast, flexible, reliable searching Idea of statistical analysis of word relat ionship for efficient search
1995 AltaVista[21]
Louis Monier, Micha e l Burrows Digital Equipment Corporationâ&#x20AC;&#x2DC;s
1. 2. 3. 4. 5. 6. 7.
Fast Multithreaded crawler & Back-end search Keyword based simple or advanced search Multilingual search capabilities Periodic Re-indexing of sites High bandwidth Allow natural language query Inbound link checking
Not Active, Shutdown in 2013, redirecte d to Yaho o
15.
1995 Yahoo[22,23]
David Filo, Jerry Yang Yahoo Corpor atio n
1. Keyword based search 2. Web directory organized in hierarchy 3. Separate searches for images, news stories, video, maps, shopping 4. Supports full Boolean searching 5. Support Wild Card Word in Phrase
2nd largest Active SE (https://in.yaho o.com/) 4
16.
1995 AOL[24]
Bill von Meiste r Control Video Corporation
1. 2. 3. 4.
Microsoft Microsoft ltd.
1. Large and unique database 2. Boolean searching 3. Cached copies of Web pages including date cached 4. Automatic local search options. 5. Neural n/w added features
Active as Bing (http://ww w.m sn.com/en-in/)
18.
1996 DogPile[26,27]
Aaron Flin Blucora Inc.
1. Meta Search engine 2. Has its own search Index 3. Searched multiple engines, filter ed fo r duplicates and then presented the results to the user 4. Special provisions for Stock quotes, weather forecast, yellow pages 5. etc.
Active, Aggre ga tor (http://ww w.d o gpile.com/) 3084
19.
1996 InfoSpace[28]
Naveen Jain Infospace Inc.
1. Meta Search Engine 2. Selects results from the leading search engines and then aggregates, filters and prioritizes the results to provide more comprehensive results 3. Instant messenger service
Active (http://infospac e.com /) 2110
1996 Hotbot[ 29,30]
Wired Magazine Inktomi Corporation
1. Extensive use of cookie technology to store personal search preference information 2. Ability to search within search results 3. Frequent updation of Database Use of parallel processing
Active(http://w ww.hotbot.com/) 100902
1996 WOW[31]
Jeniff er Thomp son Comp u Serv e
1. First internet service to be offered with a monthly "unlimited" rate 2. Brightly colored 3. Seemingly hand-drawn pages. 4. Find all of the breaking news articles, top videos and trending topics that matter to you. 5. Effective advertising 6. Highly communicative design 7. Budget friendly media services 8. Creative concept development
Active (http://www.w ow.com /) 767
22.
1996 Ask[32,33]
David Warthen, Garrett Grue ne r IAC/ InterActive Corporation
1. Natural language-based Search 2. Both concept & keyword based search 3. Allows to enter query in the form of sentence for humanize the online experience 4. Question answering system
Active (http://www.as k.com/) 28
23.
1997 Daum[34]
Daumkakao Daum Corparation
1. A popular search engine in Korea 2. Besides internet sear ch provides facilities for Email, Chat, Shopping etc.
Active (www.da um .ne t/) 140
24.
1997 Overture[35]
Bill Gross Yahoo
1. Paid search inspired from commercial telephone directory 2. Secured, pay-per-placement directory service
Not Active Alexa N.A
14.
17.
20.
21.
1995 MSN[25]
Copyright Š 2016 MECS
Graha spencer Garage in Silicon velley
51
Started as Internet Messenger Service Subscriber based service Movie & Game portal
an interne t Portal(http://w ww.e xcite .co m/) 7951
Not Active (http://w w w.ao l.in/)
I.J. Information Technology and Computer Science, 2016, 12, 47-58
52
Journey of Web Search Engines: M ilestones, Challenges & Innovations
25.
1997 Yandex[36]
26.
1998 Google[37,38]
Taylo r Nelson Sofres San Franc isco Bay Area Serge y Brin, Lawrence Page Stanford University, Stanford
1. Full-text search with Russian morphology support 2. Encrypted search 3. Multilingual
Active (https://ww w.y andex.com/) 20 Active as most popular SE (https://ww w.g oogle.co.in/) 1
1. 2. 3. 4. 5.
Keyword based search Page Rank algorithm Semantic search Free, Fast and easy to search No programming or database skills required
27.
1999 AlltheWeb[39]
T or Egge Norwegian Univ. of Sci. & Tech.
1. 2. 3. 4. 5. 6.
Faster Database Advanced search features Sleek interface FAST ‘s enterprise search engine search clustering completely customizable look
28.
2000 T eoma[40]
Apostolos Gerasoulis Rutgers Univ. compu ter lab
1. 2. 3. 4. 5.
Provide knowledge search Provide subject specific popularity Clustering T echniques to Determine Site Popularity Unique Link popularity
29.
2000 Baidu[41]
Robin Li Beijing China
1. largest internet user population 2. pay per click marketing platform 3. China‘s Google
30.
2007 LiveSearch[42]
Satya Nadella Microsoft
1. Uses a drag-and-drop interface that's really simple to pick up 2. T he new search engine used search tabs that include Web, news, images, music and desktop
2008 DuckDuckGo[43]
Gebriel Weinberg DuckDuckGo Inc.
1. Offers real privacy or protecting searchers' privacy and avoiding the filter bubble of personalized search results 2. Smarter search, and stories that user likes 3. Not profiling its users and by deliberately showing all users the same search results for a given search term 4. Emphasizes on getting information from the best sources rather than the most sources
Active (https://duc kd u ckgo.c om /) 506
32.
2008 Aardvark[44]
Max Ventilla, Nath an Stoll The Mecha nic al Zoo, A San Franc isc o based startup
1. Use Social n/w facilitated a live chat or email conversation with one or more topic experts 2. Social search Engine 3. Aadvark Ranking Algorithm
Not Active Alexa N.A
33.
2009 Bing[45]
Steve Billmer Microsoft
1. 2. 3. 4. 5.
Active (https://ww w.b ing.com/) 24
31.
34.
2009 Caffeine[46]
Matt Cutts Google
35.
2010 Google Insta nt[4 7]
Marissa Mayer & Matt Cutts Google
Copyright © 2016 MECS
Keyword based search Index updated on weakly or daily basis Advertised as a decision engine Social integrations are stronger Direct information in the area of finance & sports
1. New web indexing system 2. Near-real-time integration of indexing and ranking 3. Allows easier annotation of the information stored with documents 4. Provide 50% fresher result 5. Find links to Relevant content much sooner 6. Update search index on a continuou s basis, globally. 7. Caffeine processes hundreds of thousands of pages in parallel. 8. Nearly 100 million gigabytes of storage in one database 1. Search-before-you-type 2. Predicts the users whole query 3. Faster Searches, Smarter Prediction, Instant Result 4. User Experience 5. Provide Autocomplete Suggestion
Not Active (URL redir e cte d to Yahoo)
Not Active , Redirected to Ask.com Active (http://w w w.ba idu.com/) 5 Active as Bing, Launch ed as rebra nde d MSN search (https://ww w.li ve.co m /)
Active (http://googlebl og.blo gspo t.in/ 2010/06/ournew-se a rc hindexcaffe ine .htm l)
Active
I.J. Information Technology and Computer Science, 2016, 12, 47-58
Journey of Web Search Engines: M ilestones, Challenges & Innovations
36.
37.
38.
2010 Blekko[48]
2013 Contenko[49]
2013 Alhea[50]
Rich Skrenta Blekko Inc.
1. Uses slash tags to allow people to search in more targeted categories 2. Spam Reduction 3. Provides better search results than those offered by Google Search, by offering results culled from a set of billion trusted websites and excluding material from such sites as content farms. 4. Dynamic interface graph algorithm 5. Blekko offers a web search engine and social news platform that provides users with curated links for the entered search criteria. 6. Provided downloa da ble search bar which was later acquired by IBM
Active, Aquir ed by IBM(www.ble kko.com ) 4518
T omas Meskauskas Amerow LLC
1. Deceptive Internet Search, promoted using various browsers hijackers 2. Provides Innovative means for browsing the internet 3. Its Startup page doesn‘t contain any links to privacy terms or terms of use
Active (http://w w w.co ntenko.c om /) 4505
1. Offers a single source to search the Web, images, audio, video, news from Google, Yahoo!, and many more search engines. 2. Alhea .com compile s results from many of the Web's major search properties, delivering
Active (http://www.al hea.com/) 11225
1. Focuses on eliminating sites that didn't have enough quality content and were more geared at moneymaking than providing useful content. 2. Provide new Google‘s search results ranking algorithm 3. Quality Search results
Active (http://ww w.g oogle panda.com/)
Manue l Barrios Amazon T echnologies Inc.
39.
2011 GooglePanda[51]
Navne et Panda and Vladimir Ofitse ro v Google
40.
2012 GooglePenguin[50]
Matt Cutts Google
41.
42.
2013 Google HummingBird[51]
GianlucaFiore Lli Google
2015 SciNet[52]
Tuukk aR u ots alo, KumaripabaA thukor ala , DorotaGłowa cka, Ksen iaK ony u shkova ,Antti Oula svirta ,Sa muliKaipiaine n, Samu el Kask i, Giulio Jacucci Helsinki Institute for Information T echnology HIIT , Finland
1. 2. 3. 4.
Web spam update goal of concentrating on webspam Search Algorithm update Protect your site from bad links .
1. A core algorithm update may enable more semantic search and more effective use of the Knowledge Graph in the future, Hummingbird is about synonyms but also about context Google 2. Hummingbird is designed to apply the meaning technology to billions of pages from across the web, in addition to 3. Knowledge Graph facts, which may bring back better results 4. Search Algorithm update 5. Understand the intent of the user
1. 2. 3. 4.
Reinforcement Learning Auto-suggestion for specific topic & document Interactive approach A new search engine that outperforms current ones and helps people search more efficiently. 5. SciNet displays a range of keywords and topics in a topic radar
This type of the search is being referred to as the conversational search by the Google [37,38] and is intended to take into the account both context and intent of the search.
Copyright © 2016 MECS
53
Active Alexa N.A
Active Alexa N.A
Active 215998 8
One of the major difference between the keyword based search and the semantic search is that the semantic search takes into account the connecting words like in, by, for, about etc. as they are vital to the mean ing of the sentence (semantic impact) while these words are simply
I.J. Information Technology and Computer Science, 2016, 12, 47-58
54
Journey of Web Search Engines: M ilestones, Challenges & Innovations
discarded in the keyword based search.
[70] than the indexed web. The basic reasons for the nonindexing are following:
C. Knowledge graph and entity based search The basic strategy of keywords based search, as used by the conventional search engines, has a major drawback that it is unable to get real sense many t imes as it does not exp lore the underly ing real world connections, properties and relationships [64]. The new type of search is referred to as entity based search and in this regard a major work has been done by Microsoft‘s Satori [65] and Google‘s Knowledge Graph[66]. To acco mplish the entity based search in the future, the data/ unstructured informat ion is being extracted fro m the web-pages and a structured database of nouns (people, places, objects etc.) is being created that includes the relationship as well. The newly defined structure is referred to as web of concepts [67]. The transformation fro m unstructured web to web of concepts includes three processes namely information extraction, lin king (mapping the relationship) and analysis (categorizing informat ion about an entity)[ ] . The knowledge graph is similar to Facebook‘s Open Graph and derived from Freebase [68].
D. Avoiding memory recall (Option Based Search)
A novel strategy was adopted by the Scinet[52] search engine to cater to the personal needs of the user wherein the search process has been converted fro m memory recall process(thinking of keywords) to the recognition process(to make a select ion fro m the given choices). Depending upon the user‘s past behavior, the search engine exhib its the potential topics/keywords along with the intent radar indicat ing the potential direction where the search will lead to. E. Social and continuous search A novel initiative has been taken by a new search engine called Yotify[69] that does not reply user‘s query instantly but keeps on searching the websites to find appropriate answers and send them by e-mail. For example, if somebody is looking for a house in a desired set of localities or a particular type of furniture items then the Yotify keeps on searching the associated websites. In contrast to Google and yahoo alerts which focus on the news and other information, Yotify is more concerned with Shopping. At present, the problem with Yotify is that it can scan only a small portion of web and lacks the width like Google and yahoo. F. Deep Web search The Web search engines are just web spiders which index webpages by follo wing the hyperlin ks one after the other. However, there are some places where a spider/crawler cannot enter e.g. the database of a lib rary, webpages belonging to private networks of organizations etc which may normally require a password for access. Such part of web, which remains un-indexed, is referred to as Deep Net, Deep Web, Invisible Web or hidden web. Despite the remarkable progress in search technology the size of the deep web is much larger (nearly 500 times) Copyright © 2016 MECS
Dynamic pages which are accessed only through filling of forms whose contents are related to domain knowledge. Web pages that are not linked to other pages i.e. the pages which are not having any inlinks / backlinks. Such a situation makes the webpage contents inaccessible. Websites requiring registration and login. Webpages whose content vary as per access rights and contexts. Websites prohibiting search engines from browsing them using by using Robot Exclusion Standards such as CAPTCHA code. Textual content encoded in mult imedia files or other such file formats which are not conventionally readable by search engines. Web contents intentionally kept invisible to the standard internet. Such contents are accessible only through darknet softwares like Tor [71], I2P [72]. Archived versions of web pages and web-site which have beco me time irrelevant and are not indexed by search engines.
However, with the time various search engines have come in the market which make available a certain segment of deep web resources. So me of these search engines are Info mine[73] created by a g roup of libraries in USA, Intute[74] created by group of universities in UK, Co mplete-Planet [75] containing providing access to nearly 70000 databases over heterogeneous domains, Infoplease [76] p roviding access to encyclopedias; atlas and other such resources, DeepPeep [77], IncyW incy [78], Scirus [79], TechXtra[80] etc.. The deep web search engines mentioned in this subsection have been created with a positive intent to provide a controlled access to databases, clubbed through authorization, to their legitimate and authorized users which need them for academic or commercial purpose. G. Onion Search The onion search [81] is a type of deep web search, but with a negative intent. The onion search provides a kind of opacity wherein both the persons i.e. the informat ion provider and the one accessing the information are difficult to trace not only by others but even by each another. The onion is a pseudo Top Level Do main (TLD) host reachable via Tor network [71]. The TLD in case of .onion sites is not an actual DNS root but is an access mechanis m provided through a proxy server. The addresses in the onion domain are auto matically generated based upon the public key when the hidden service is created/ configured.
The onion search is being used for undesirable purpose such as drugs and arms dealing. One such search engine is Onion.city [82].
I.J. Information Technology and Computer Science, 2016, 12, 47-58
Journey of Web Search Engines: M ilestones, Challenges & Innovations
The onion search is also being used by investigating agencies and defense organization to penetrate into the deep web. One such search engine is Memexa by Defense Advanced Research Projects Agency (DARPA) pro ject to find things on the deep web which are not indexed by major search engines [82].
H. Entity Search Now the search has changed its way fro m findings ―strings‖(i.e., strings that is a co mbination of letters in a search query) to findings ―things‖(i.e., entit ies). The move fro m ―strings‖ to ―things‖ helped in data base searches where bits of data are p laced on a knowledge graph to answer the who, what, when, where and how type of questions. Entity Search gives a new insight into search optimization because now google can provides direct answers to many queries within the search results. This effort increases the search results relevancy by identifying what a query term means and helps to understand the correlat ion between the strings of characters and real-life context. Google‘s entity search aims to expand the Knowledge Graph by understanding relationships through stringing together relevant data and making real-world connections between content and how users search.
what words or phrases might have a similar meaning and filter the result accordingly, making it more effective to handle the queries that have not been seen earlier. After having discussed the various innovation in the web search process, let us summerise them and list the challenges intended to overcome through these innovations. Table 2 shows this summary.
V. SPEED BREAKERS IN JOURNEY (INHERENT SEARCH ISSUES) Due to its inherent huge size, diversity of users, diversity of search requirement and heterogeneity of contents, the following issues will continue to persist and search engine will have to make a co mpro mise between various choices. a)
b) c) d) e)
I. RankBrain in Google f) T able 2. Search Engine Challenges and Innovations Challenge 1 2 3 4 5
Multiple Schemas
standards
&
Search Based upon user‘s actual intent To take into account the real world relationship Relieving the user from key based thinking To keep users‘ query in memory and make search during a period
6
T o explore hidden web
7
To maintain opacity between the information provider and information seeker
Innovation Standard Schema accepted by major search engines in the form of Schema.Org
g) h)
Semantic Search Engines
i)
Entity based search
j) k)
Option based search Social & Continuous Search Authorized collaboration of data bases and their access through a deep web search engine Onion Search
RankBrain helps in processing and refining amb iguous search query and connect them to specific topics using pattern recognition. It is a mach ine learn ing system that gives optimize results for a specific query set for executing hundreds of millions for search queries per day. It refines the query results of Google‘s Knowledge Graph based entity search. It uses artificial intelligence to embed massive amount of written language into mathematical entities known as vectors that is easy to understand for computer. If a word o f phrase that is not familiar with RankBrain is seen, the machine can make a guess as to Copyright © 2016 MECS
55
To simultaneously support the generic overview of topics and enabling specialist groups to drill down to their exclusively relevant items. To effectively deal with invisible or deep web. To offer demand anticipation, customization and personalization. To go beyond the list of possible relevant webpages and to focus on providing an exact answer. To effect ively deal with the web spam i.e. the web pages that exist only to mislead search engines as well as the users to certain web sites. To effectively deal with noisy, low quality, unreliable and contradictory contents continuously being uploaded on the web. To deal with the multiple replica of web pages. To deal with the unstructured or vaguely structured contents. To effectively deal with noisy, low quality, unreliable and contradictory contents continuously being uploaded on the web. To deal with the multiple replica of web pages. To deal with the unstructured or vaguely structured contents.
VI. CONCLUSION Search engines offer their users vast and impressive amounts of information accessible with a speed and convenience few people could have imagined one/two decade ago. Yet the challenges are not over. Every advancement in search methodology/technology is leading to mo re and mo re envisaged challenges paving the way for further innovations and the cycle continues. The paper discusses the innovations that have been carried out in the past with the hope that it will encourage the researcher for further innovations. REFERENCES [1] [2] [3]
http://www.internetlivestats.com/total-number-of websites/ http://www.infoplease.com/ipa/A0921862.html http://www.irkawebpromotions.com/webdirectories/looks
I.J. Information Technology and Computer Science, 2016, 12, 47-58
56
[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43]
[44] [45] [46]
Journey of Web Search Engines: M ilestones, Challenges & Innovations
mart/ http://www.yuanlei.com/studies/articles/is567searchengine/page2.htm https://forums.digitalpoint.com/threads/hybrid-searchengines.2612207/ http://websearch.about.com/od/metasearchengines/a/mam ma.htm https://en.wikipedia.org/wiki/Archie_search_engine https://en.wikipedia.org/wiki/Veronica &Jughead (search_engine) http://scg.unibe.ch/archive/software/w3catalog/W3Catalo gHistory.html https://en.wikipedia.org/wiki/W3Catalog https://en.wikipedia.org/wiki/JumpStation https://en.wikipedia.org/wiki/World_Wide_Web_Wander er http://thesearchenginearchive.wikia.com/wiki/Aliweb http://www.sciencedaily.com/terms/web_crawler.htm https://en.wikipedia.org/wiki/M etaCrawler http://malwaretips.com/blogs/remove-mywebsearch/ http://www.livinginternet.com/w/wu_sites_lycos.htm http://searchenginewatch.com/sew/news/2047873/inktomi -debuts-self-serve-paid-inclusion https://en.wikipedia.org/wiki/Infoseek https://en.wikipedia.org/wiki/Excite http://searchenginewatch.com/sew/study/2067828/altavist as-search-by-language-feature http://www.searchengineshowdown.com/features/yahoo/r eview.html https://en.wikipedia.org/wiki/Yahoo!_Search https://en.wikipedia.org/wiki/AOL http://www.msn.com/en-in/ https://en.wikipedia.org/wiki/Dogpile http://investor.blucora.com/releasedetail.cfm?ReleaseID= 166325 http://chj.tbe.taleo.net/chj04/ats/careers/requisition.jsp?or g=INFOSPACE&cws=1&rid=181 https://en.wikipedia.org/wiki/HotBot http://www.searchengineshowdown.com/features/hotbot/r eview.html https://en.wikipedia.org/wiki/Wow!_(online_service) https://en.wikipedia.org/wiki/Ask.com http://www.searchengineshowdown.com/features/ask/revi ew.html https://en.wikipedia.org/wiki/Daum_(web_portal) http://www.search-marketing.info/search-engines/priceper-click/overture.htm https://en.wikipedia.org/wiki/Yandex_Search https://en.wikipedia.org/wiki/Google_Search#calculator http://www.telegraph.co.uk/technology/google/10346736/ Google-search-15-hidden-features.html https://en.wikipedia.org/wiki/AlltheWeb http://www.seochat.com/c/a/marketing/webdirectories/teoma-the-superior-search-engine/ https://en.wikipedia.org/wiki/Baidu https://en.wikipedia.org/wiki/Live_search https://en.wikipedia.org/wiki/DuckDuckGo D.Horowitz, S.D. Kamvar, ―The Anatomy of a LargeScale Social Search Engine‖, International World Wide Web Conference Committee (IW3C2), 2010, April 26–30, 2010, Raleigh, North Carolina, USA, ACM 978-1-60558799-8/10/04. http://www.windowscentral.com/top -bing-features http://www.telegraph.co.uk/technology/google/6009176/ Google-reveals-caffeine-a-new-faster-search engine.html http://searchengineland.com/google-instant-completeusers-guide-50136
Copyright © 2016 MECS
[47] https://en.wikipedia.org/wiki/Blekko [48] https://www.aihitdata.com/company/00D2051A/CONTE NKO/history [49] https://en.wikipedia.org/wiki/Althea [50] http://www.searchenginejournal.com/seo-guide/googlepenguin-panda-hummingbird/ [51] TuukkaRuotsalo,KumaripabaAthukorala, DorotaGłowacka, KseniaKonyushkova, AnttiOulasvirta, SamuliKaipiainen, Samuel Kaski, GiulioJacucci, ―Supporting Exploratory Search Tasks with Interactive User M odeling‖ ,Helsinki Institute for Information Technology HIIT, University of Helsinki, ASIST 2013, November 1-6, 2013 [52] https://schema.org/docs/faq.html [53] https://www.ietf.org/ [54] https://www.inbenta.com/en/blog/entry/keyword- basedversus-natural-language-search [55] R.Priyadarshini, LathaTamilselvan, ―Document clustering based on keyword frequency and concept matching technique in Hadoop‖, International Journal of Scientific & Engineering Research, Volume 5, Issue 5, M ay -2014 1367 ISSN 2229-5518 [56] DanushkaBollegala, Yutaka M atsuo, and M itsuru Ishizuka, ―A Web Search Engine-Based Approach to M easure Semantic Similarity between Words‖ IEEE Transactions on Knowledge and Data Engineering, vol. 23, NO. 7, July 2011 [57] D. M clean, Y. Li, and Z.A. Bandar, ―An Approach for M easuring Semantic Similarity between Words Using M ultiple Information Sources,‖ IEEE Trans. Knowledge and Data Eng., vol. 15, no. 4, pp. 871-882, July/Aug. 2003. [58] Elias Iosif, Alexandros Potamianos, ―Unsupervised Semantic Similarity Computation between TermsUsing Web Documents‖, IEEE Transactions on knowledge and data engineering, vol. 22, no. 11, november 2010 [59] Y. Li, D. M cLean, Z. A. Bandar, J. D. O'Shea, and K. Crockett, ``Sentence similarity based on semantic nets and corpus statistics,'' IEEE Trans.Knowl. Data Eng., vol. 18, no. 8, pp. 1138_1150, Aug. 2006. [60] Tao Cheng, Hady W. Lauw, and SteliosPaparizos, ―Entity Synonyms for Structured Web Search‖, IEEE Transactions on Knowledge and data engineering, vol. 24, no. 10, October 2012 [61] Tim Converse, Ronald M . Kaplan, Barney Pell, Scott Prevost, Lorenzo Thione, Chad Walters, ―Powerset‘s Natural Language Wikipedia Search Engine‖,Powerset, Inc. 475 Brannan Street San Francisco, California 94107 [62] https://www.crunchbase.com/organization/hakia and Website: http://www.hakia.com [63] http://arstechnica.com/information technology/2012/06/inside-the-architecture-of- googlesknowledge-graph-and-microsofts-satori/ [64] http://www.cnet.com/news/microsofts-bing-seeksenlightenment-with-satori/ [65] https://en.wikipedia.org/wiki/Knowledge_Graph [66] AdityaParameswaran, AnandRajaraman, Hector GarciaM olina, ―Towards The Web Of Concepts: Extracting Concepts fromLarge Datasets‖, publisher, ACM , VLDB ‗10, September 13-17, 2010, Singapore(http://ilpubs.stanford.edu:8090/917/1/concept M ining-Techrep.pdf) [67] http://www.freebase.com [68] http://www.technologyreview.com/news/410961/makingsearch-social/, http://www.yotify. com / [69] https://en.wikipedia.org/wiki/Deep_web(search [70] https://en.wikipedia.org/wiki/Tor(anonymity_network)
I.J. Information Technology and Computer Science, 2016, 12, 47-58
Journey of Web Search Engines: M ilestones, Challenges & Innovations
[71] https://en.wikipedia.org/wiki/I2P [72] http://www.lib.vt.edu/find/databases/I/infomine- searchengine.html [73] https://en.wikipedia.org/wiki/Intute [74] http://websearch.about.com/od/invisibleweb/a/completepl anet.htm [75] http://www.infoplease.com/ [76] http://content.lib.utah.edu/cdm/ref/collection/uspace/id/54 77 [77] http://www.seochat.com/c/a/search-engine-optimizationhelp/search-engines-for-the-invisible-web/ [78] http://searchenginewatch.com/sew/news/2065996/scirusa-new-science-search-engine [79] http://library.poly.edu/news/2007/10/09/techxtra-searchengine-for-engineering-mathematics-and-computing [80] https://www.deepdotweb.com/how-to-access-onion-sites/ [81] http://thehackernews.com/2015/02/Onion-city-darknetseach-engine.html [82] www.alexa.com/siteinfo/ [83] http://www. ebizmba.com/articles/search-engines [84] Deital P J and Deital H M , ― Internet & World Wide Web, How to Program‖, Pearson International Edition, 4 th edition, 2013 [85] C Jouis, I Biskri, J G Ganascia, M Roux, ― Next Generation Search Engines: Advanced M odels for Information Retreival‖, Information Science Reference,2012 [86] J.Bernard, S. Amanda,‖ How are we searching the world wide web?: a comparison of nine search engine transaction logs‖ Information Processing and M anagement: an International Journal(Elsevier), Volume 42 Issue 1, January 2006, Pages 248-263 [87] R Aravindhan, R. Shanmugalakshmi "Comparative analysis of Web 3.0 search engines: A survey report", International Conference on Advanced Computing and Communication Systems (ICACCS), IEEE Conference Publications, 2013, Page(s): 1 – 6 [88] Leslie S. Hiraoka ,‖ Evolution of the Search Engine in Developed and Emerging M arkets‖, International Journal of Information Systems and Social Change(DBLP), Vol. 5 Issue 1, January 2014, pp.30-46 [89] Capra, R.G.P. Quinones, ―Using Web search engines to find and refind information‖ IEEE Journals & M agazines 2005, Volume: 38, Issue: 10 DOI: 10.1109/M C.2005.355, Page(s): 36 - 42 [90] YipingKe, Lin Deng, Wilfred N g, Dik-Lun Lee, ― Web dynamics and their ramifications for the development of web search engines‖, The International Journal of Computer and Telecommunications Networking-Web dynamics, Elsevier North-Holland, Inc. New York, NY, USA, Volume 50 Issue 10, 14 July 2006, Pages 1430 1447 [91] P. M etaxas, ―Web Spam, Social Propaganda and the Evolution of Search Engine Rankings‖, SOFSEM 2007:Theory and Practice of Computer Science, Lecture Notes in Computer Science Volume 4362, 2007, pp 1-8 [92] Q. Yang, H. Wang, J. Wen, G. Zhang, Y. Lu, K. Lee, H. Zhang, ―Towards a Next-Generation Search Engine‖, The Connected Home: The Future of Domestic Life(Science Direct) 2011, pp 79-91 [93] M onica Peshave, KamyarDezhgosha, ―How Search Engines Work and a Web Crawler Application‖ [94] D.Horowitz, S.D. Kamvar, ―The Anatomy of a LargeScale Social Search Engine‖, International World Wide Web Conference Committee (IW3C2), 2010, April 26–30, 2010, Raleigh, North Carolina, USA, ACM 978-1-60558799-8/10/04.
Copyright © 2016 MECS
57
[95] Stefano Ceri, Alessandro Bozzon, M arco Brambilla, Emanuele Della Valle, PieroFraternali, Silvia Quarteroni, ―Search Engines‖, Advanced Topics in Information Retrieval, The Information Retrieval Series Volume 33, 2011, pp 27-50 [96] Ricardo BaezaYates, Alvaro Pereira Jr, NivioZiviani, ―The Evolution of Web Content and Search Engines‖, WEBKDD'06, August 20, 2006, Philadelphia, Pennsylvania, USA. Copyright 2006 ACM 1-59593-4448... $5.00 [97] Gray, M atthew. "Internet Growth and Statistics: Credit and Background". mRetrieved February 3, 2014. [98] A. Ntoulas, J. Cho, C. Olston, "What's New on the Web ? The Evolution of the Web from a Search Engine Perspective", In Proceedings of the World-Wide Web Conference (WWW), M ay 2004. [99] ArvindArasu, Junghoo Cho, Hector Garcia- M olina, Andreas Paepcke, SriramRaghavan, "Searching the Web", ACM Transactions on Internet Technology, 1(1): August 2001. [100] Dirk Lewandowski, ―Web searching, search engines and Information Retrieval, Information Services & Use‖, 25 (2005) 137-147, IOS Press, 2005. [101] Tom Seymour, Dean Frantsvog, Satheesh Kumar, ―History Of Search Engines‖,International Journal of M anagement & Information Systems – Fourth Quarter 2011 Volume 15, Number 4 [102] TuukkaRuotsalo, KumaripabaAthukorala, DorotaGłowacka, KseniaKonyushkova, AnttiOulasvirta, SamuliKaipiainen, Samuel Kaski, GiulioJacucci, ―Supporting Exploratory Search Tasks with Interactive User M odeling‖ ,Helsinki Institute for Information Technology HIIT, University of Helsinki, ASIST 2013, November 1-6, 2013 [103] M archionini, G, ―Exploratory search: from finding to understanding‖, Comm. ACM 49, (2006), 41-46. [104] Gromov, G. R.,‖History of Internet and WWW: the roads and crossroads of Internet history‖. from http://www.netvalley.com/intvalstat.html, Retrieved December 5, 2004 [105] Holzschlag, M . E.,‖ How specialization limited the Web‖, Retrieved December 4, 2004, from http://www.webtechniques.com/archives/2001/09/desi/ [106] Jansen, B. J., Spink, A. & Pedersen, J., ‖An analysis of multimedia searching on AltaVista‖, Proceedings of the 5th ACM SIGMM international workshop on M ultimedia information retrieval, (2003) 186-192. [107] Kherfi, M . L., Ziou, D. &Bernardi, A., ―Image retrieval from the World Wide Web‖ issues, techniques and systems. ACM Computer Surveys, (2004),36(14), 35-67. [108] Wall, A., ―History of search engines & web history‖, Retrieved December 3, 2004, from http://www.searchmarketing.info/search-engine-history/ [109] Jansen, B. J., Spink, A. & Pedersen, J., ―An analysis of multimedia searching on AltaVista‖, Proceedings of the 5th ACM SIGMM international workshop on M ultimedia information retrieval, (2003), 186-192. [110] ArvindArasu, Junghoo Cho, Hector Garcia-M olina, Andreas Paepcke, SriramRaghavan, ―Searching the Web‖, (Stanford University). ACM Transactions on Internet Technology (TOIT), Volume 1, Issue 1 (August 2001). [111] Elgesem, D, ―Search Engines and the Public Use of Reason.‖ Ethics and Information Technology, 10(4), 2008 [112] Nagenborg, M . (ed.), 2005. The Ethics of Search Engines.Special Issue of International Review of Information Ethics.Vol. 3. [113] ―Search Engines, Personal Information, and the Problem
I.J. Information Technology and Computer Science, 2016, 12, 47-58
58
Journey of Web Search Engines: M ilestones, Challenges & Innovations
of Protecting Privacy in Public,‖ International Review of Information Ethics, 3: 39–45. [114] Bruce Croft, Donald M etzler, and Trevor Strohman, ―Search Engines: Information Retrieval in Practice‖, Addison Wesley, 2010 [115] Dennis Fetterly, M ark M anasse, M arc Najork, and Janet Wiener, ―A large-scale study of the evolution of web pages‖, WWW ‘03: Proceedings of the 12th international conference on World Wide Web, pages 669–678, 2003.
Authors’ Profiles Mamta Kathuria received her M CA degree with Honors from Kurukshetra University, Kurukshetra in 2005 and M .Tech in Computer Engineering from M aharshi Dayanand University, Rohtak in 2006 and 2008, respectively.. She is pursuing her Ph.D in Computer Engineering from YM CA University of Science and Technology, Faridabad. She is currently working as a Assistant Professor in YM CA University of Science & Technology and has eight years of experience. Her areas of interest are search engines, Web M ining and Fuzzy Logic.
Dr. Chander K. Nagpal is Ph. D (Computer Science) from Jamia M illa Islamia, New Delhi. He is currently working as a professor in YM CA University of Science & Technology and has twenty eight years of teaching experience. He has published two books. He has published many research papers in reputed international Journals such as IEEE transaction on software reliability, Wiley STVR, CSI. His academic interests include Ad hoc networks, Web M ining and Soft Computing.
Dr. Neelam Duhan received her B.Tech. in Computer Science and Engineering with Honors from Kurukshetra University, Kurukshetra and M .Tech with Honors in Computer Engineering from M aharshi Dayanand University, Rohtak in 2002 and 2005, respectively. She completed her PhD in Computer Engineering in 2011 from M aharshi Dayanand University, Rohtak. She is currently working as an Assistant Professor in Computer Engineering Department in YM CA University of Science and Technology, Faridabad and has an experience of about 12 years. She has published over 30 research papers in reputed international Journals and International Conferences. Her areas of interest are databases, search engines and web mining.
How to cite this paper: M amta Kathuria, C. K. Nagpal, Neelam Duhan, "Journey of Web Search Engines: M ilestones, Challenges & Innovations", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.12, pp.47-58, 2016. DOI: 10.5815/ijitcs.2016.12.06 p r
Copyright © 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 47-58
I.J. Information Technology and Computer Science, 2016, 12, 59-66 Published Online December 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2016.12.07
SQL Versus NoSQL Movement with Big Data Analytics Sitalakshmi Venkatraman School of Engineering, Construction and Design (IT), Melbourne Polytechnic, VIC 3072, Australia E-mail: SitaVenkat@melbournepolytechnic.edu.au
Kiran Fahd, Samuel Kaspi School of Engineering, Construction and Design (IT), Melbourne Polytechnic, VIC 3072, Australia E-mail: Kiran.Fahd@hotmail.com, SamKaspi@melbournepolytechnic.edu.au
Ramanathan Venkatraman National University of Singapore, Singapore E-mail: rvenkat@nus.edu.sg
Abstractâ&#x20AC;&#x201D;Two main revolutions in data management have occurred recently, namely Big Data analytics and NoSQL databases. Even though they have evolved with different purposes, their independent developments complement each other and their convergence would benefit businesses tremendously in making real-t ime decisions using volumes of co mplex data sets that could be both structured and unstructured. While on one hand many software solutions have emerged in supporting Big Data analytics, on the other, many NoSQL database packages have arrived in the market. However, they lack an independent benchmarking and co mparat ive evaluation. The aim of this paper is to provide an understanding of their contexts and an in-depth study to compare the features of four main NoSQL data models that have evolved. The performance comparison of traditional SQL with No SQL databases for Big Data analytics shows that NoSQL database poses to be a better option for business situations that require simplicity, adaptability, high performance analytics and distributed scalability of large data. This paper concludes that the NoSQL movement should be leveraged for Big Data analytics and would coexist with relational (SQL) databases. Index Termsâ&#x20AC;&#x201D;Structured Query Language (SQL), Non SQL (NoSQL), Big Data, Big Data Analytics, Relational Database, SQL Database, NoSQL Database.
I. INT RODUCT ION As the technology environment transforms and faces new challenges, businesses increasingly realize the need to evaluate new approaches and databases to manage their data to support changing business requirements and growing co mplexity and expansion of their applicat ions [1]. Relational database has been the default choice for data model adoption in businesses worldwide over the past thirty years with Structured Query Language (SQL) Copyright Š 2016 MECS
as the standard language designed to perform the basic data operations. However, with the explosion of data volume, SQL-based data querying lose efficiency, and in particular, managing larger databases has become a major challenge [2]. In addit ion, relational databases exh ibit a variety of limitations in meeting the recent Big Data analytics requirement in businesses. While clusters -based architecture has emerged as a solution for large databases, SQL is not designed to suit clusters and this mis match has led to think of alternate solutions. There are mis matches between persistent data model and inmemo ry data structures, and servers based on SQL standards are now prone to memory footprint, security risks and performance issues. NoSQL (Non SQL) databases with a set of new data management features, on the other hand, are more flexib le and horizontally scalable. They are considered as alternatives to overcome the limitations of the current SQL-dominated persistence landscape and hence they are also known as non-relational databases [3]. The main goal for the NoSQL movement is to allo w easy storage and retrieval of data, regardless of its structure and content, which is possible due to the non-existence of a rig id data structure in non-relat ional databases. NoSQL databases exhib it horizontal scalability by taking advantage of new clusters and several low-cost servers. In addition, they are envisaged to automatically manage data administration including fault recovery and these capabilit ies would result in huge cost savings. Though non-relational databases are providing different features and advantages, they were init ially characterised by lack of data consistency and non-ability to query stored records using SQL. With the emergence of NoSQL databases new features and optimisation characteristics are evolving to overcome these limitations as well. However, their total capabilities are still not disclosed [4]. Also, due to the increasing differences in NoSQL database offerings and their non-standard features, businesses are not clear on what is the stand to take. In this paper, we first provide an overview of the
I.J. Information Technology and Computer Science, 2016, 12, 59-66
60
SQL Versus NoSQL M ovement with Big Data Analytics
present context o f Big Data analytics and NoSQL databases. Next, we discuss the four main data models of non-relational databases and compare them with SQL databases. There are a variety of No SQL databases and which one is more appropriate for wh ich business operation remains an unanswered question so far. We compare the different data models of NoSQL in terms of their features and the NoSQL databases available in the market that support those features. The different data man ipulation mechanis ms and optimisation techniques adopted by NoSQL databases could result in their difference in performance. We discuss how these factors play a major ro le in Big Data analytics and identify the associated challenges. We also consider the coexistence of NoSQL databases with relational databases and discuss their relevance in different business contexts.
II. RELAT ED W ORK: T HE CONT EXT OF NOSQL DAT ABASES WIT H BIG DAT A A NALYT ICS Fro m the recent trends reported in literature [5][6], it is evident that in today's context, there is an exponential growth of data volume that are structured as well as unstructured (Big Data) fro m a variety of data sources, such as social media, e -mails, text documents, GPS data, sensor data, surveillance data, etc. with increasing Internet usage. Hence, we can say that Big Data is characterised by structured, semi-structured, and unstructured data collected fro m digital and non-digital resources. The main challenge is the effective use of this Big Data that represents the data source for efficient decision-making by adopting suitable data min ing techniques [7][8]. Based on our literature survey, we have identified that the current challenges presented by Big Data are due to the following general characteristics experienced by businesses:
High data Velocity – rapid ly and continuously updated data streams fro m d ifferent sources and locations. Data Variety – structured, semi-structured and unstructured data storage. Data Vo lu me – huge nu mber of datasets with sizes of several terabytes or petabytes. Data Co mp lexity – data organized in several different locations or data centres.
It is important for businesses to perform Big Data analytics, which is the process of examin ing large data sets containing a variety of data types. Using Big Data Analytics, businesses are able to arrive at more accurate analysis of huge amounts of data to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information [2][9]. In order to support timely and effect ive decision making, Big Data analytics relies on large volu mes of data that requires clusters for data storage. However, since relational databases are not designed for clusters, and exhibit performance issues with regard to Big Data Copyright © 2016 MECS
analytics, businesses are considering the need for the NoSQL movement [10]. The schema of NoSQL is not fixed. It uses varied interfaces to store and analyse sheer volume of usergenerated content, personal data and spatial data being generated by modern applications, cloud computing and smart devices. [1][11]. In this context, NoSQL database presents a preferred solution than SQL database primarily for its ability to cater to the horizontal partitioning of data, flexib le data processing and improved performance. Large Internet companies (Facebook, Lin kedIn, A mazon and Google), which cannot process services by using existing relational databases, had researched and led to the advent of NoSQL to solve their problem of dealing with continuously increasing data, optimised data utilizat ion and horizontal scalability of large data. No SQL databases are a better option for the information systems that require h igh performance and dynamic scalability more than the requirements of reliability, highly distributed nature of the three-tier Internet architecture systems and cloud computing [1][3][11]. Therefore, it is necessary to investigate further and compare SQL versus NoSQL as well as the salient differences in the performance of NoSQL data models in supporting the necessary features for Big Data analytics. This paper presents these investigations and findings in today's Big Data context.
III. NOSQL DAT A M ODELS There are many NoSQL databases available, however, they fall under four data models described below [3][11][12]. Each category has its own specific attributes but there are crossovers between the different data models. Generally, all NoSQL databases are built to be distributed and scaled horizontally. Key-Value Store Database – Key-Value store is a simp le but efficient and powerful NoSQL database. The data is stored in two parts, a string that represents the key and the actual data that represents the value, thus creating a “key-value” pair. This results in values being indexed by keys for retrieval, a concept similar to hash tables. In other words, the store allo ws the user to request the values according to the key specified. It can handle structured or unstructured data. It offers high concurrency and scalability as well as rap id lookups, but little consistency. Such Key-Value store databases can be used to develop forums and online shopping carts and websites where user sessions are required to be stored. So me notable examp les are A mazon‟s Dyan moDB, Apache‟s Cassandra, Azure Table Storage (ATS), Oracle Berkeley DB, and Basho Technologies‟ Riak. A mazon offers fu lly managed No SQL store service DynamoDB for the purpose of internet scale applications . It is a distributed key-value storage system wh ich provides fast, reliable and cost-effective data access and high availability and durability due to its replica feature. One of the advantages of Key-Value store database is its high insert/read rates compared to traditional SQL
I.J. Information Technology and Computer Science, 2016, 12, 59-66
SQL Versus NoSQL M ovement with Big Data Analytics
database. This is achieved by saving more than one entry to the store as shown in the example below: @db.bulk_save([ {"hot" => "and spicy"}, {"cold" => "yet loving"}, {"other" => ["set","of","keys"]} ]) Colu mn Oriented (o r wide -colu mn) Store Databases – In colu mn store databases, columns are defined for each row instead of being predefined by the table structure having uniform sized co lu mns for each row. Such stores have a two-level aggregate structure, a key and a row aggregate, which is a group of columns. Any column can be added to any row, and rows can have very different columns. In other words, each row has different number of colu mns that are stored. It can also store data tables as sections of columns of data. Data can be viewed as either row-oriented where each row is an aggregate, or column-o riented where each colu mn family defines a record type. Each key is associated with one or more columns and a key for each colu mn family is used for rapid data retrieval with less I/O activity thereby offering very high performance. These databases provide high scalability as they store data in highly distributed architectures. Wide-colu mn databases is ideal to be used for data mining and analytic applicat ions with Big Data. Examples of some colu mn-oriented store providers are Facebook‟s high-performance Cassandra, Apache Hbase, Google‟s Big Table and HyperTable. Google‟s Big Table is high performance wide-colu mn database that can deal with vast amount of data. It is developed on Google File System GFS using C/C++. It is used by multip le Google applications like YouTube and Gmail that have varied latency demand of the database. It is not distributed outside Google besides the usage inside Google's App Engine. Big Tab le is designed for easy scalability across thousands of machines, thus, it is tolerant to hardware failures. Document Store Databases – Document database extends the basic key-value database concept and stores complex data in document form such as XML, PDF or JSON documents. A document store is typically schemaless where each document can contain different fields of any length. Documents are accessed or identified by using a unique key wh ich may be simple string, URI string or path string. Docu ment databases are more complex databases but offer h igh performance, horizontal scalability and schema flexib ility which allo w storing virtually any structure required by any application. Document oriented databases are suitable for content management systems and blog applications. So me examples of providers using document oriented databases are 10gen‟s MongoDB, Apache CouchDB, Basho Technologies‟ Riak, Azure's Docu mentDB and AWS Dynamo DB. MongoDB is developed by 10gen using C++ and is a structure free, cross -platform document oriented database. It uses Grid File System to store large Copyright © 2016 MECS
61
files such as images and videos in BSON (Binary JSON) format. It prov ides efficient performance, h igh consistency and high persistence but it is not very reliable and is resource hungry. Graph Store – Graph database focuses on relationships between data. It uses the graph theory approach to store the data and optimises the search by using index free adjacency technique. It is designed for data whose relationships are well represented by graph structures consisting of nodes, edges and properties. A node represents an object (an entity in the database), an edge describes the relationship between the objects and the property is the node on the other end of the relationship. In index free adjacency technique, each node consists of a pointer which directly points to the adjacent node as shown in Fig. 1. These stores provide fast performance, A CID compliance and rollback support. These databases are suitable to develop social-networking applications, bioinformat ics applications, content management systems and cloud management services. Examp les of notable Graph databases are Neo Technology‟s Neo4j , Orient DB, Apache Giraph and Titan. Apache Giraph is an open source large-scale graph processing system and imp lementation of Google Pregel (a graph processing architecture which has vertex-centric approach). It is designed for high scalability to overcome the crucial need for scalable platforms and parallel architectures that can process the bulk data p roduced by modern applications such as social networks and knowledge bases. For examp le, it is currently used at Facebook, Lin kedIn and Twitter to analyse the graph formed by users and their connections. Giraph is a distributed and fault-tolerant system and offers features such as, master co mputation, sharded aggregators, edgeoriented input and out-of-core computation.
Fig.1. Graph algorithm.
IV. HIGH LEVEL COMPARISON BET WEEN NOSQL AND SQL DAT ABASES Based on the features of each type of database recently reported in the literature [1][3][11][13], we performed a high level co mparison between SQL (relational) and NoSQL (non-relational) databases and the summary of findings is given in Table 1. We considered aspects such as database type, schema, data model used, scaling model availab le, transactional capabilit ies, data man ipulation method used, and popular
I.J. Information Technology and Computer Science, 2016, 12, 59-66
62
SQL Versus NoSQL M ovement with Big Data Analytics
database software available in the market in order to compare SQL databases versus NoSQL databases. Some examples are also given in Table 1 for a better understanding of their differences . Overall, Tab le 1 provides the high level differences in the key features and properties exhib ited by relational and non-relational databases, which would support businesses in making decisions about using SQL or NoSQL database options in various Big Data application scenarios. T able 1. Relational Versus NoSQL Databases - High Level Differences Relational Databases Data base Type
One SQL DBMS product (marginal variations)
Based on pre-defined foreign-key relationships between tables in an explicit database schema Schema Strict definition of schema and data types is required before inserting data Any update alters the entire database.
NoSQL Databases Four general types: keyvalue, document and widecolumn and graph stores Dynamic db schema Do not force schema definition in advance Different data can be stored together as required Allows modification of the schema freely with no downtime.
Data records are stored as Supports all types of row and columns in data – structured, semidifferent tables joined via structured, and relationships unstructured Data Explicit defined data types Different products offer different and flexible Mode ls of columns to store a specific piece of data data models. For For example, SQL engine example:. Document joins two separate tables store type organizes all the "employees" and related data using "departments" together to references and find out the department of embedded documents an employee. tools.
Scaling Mode l
Horizontal Scaling Vertical Scaling Modern approach of Data resides on a single partitioning of the data node and capacity is added across additional servers to existing resources(data or cloud instances as storage or I/O capacity) required.
Based on ACID Supports AID transactional properties, transactions and CAP such as atomicity, T heorem of distributed consistency, isolation, systems supports action durability to ensure high consistency of data Capab- data reliability and data across all nodes of a integrity. NoSQL database ilitie s Atomic transactions there is atomicity at the Degrade the performance single document.
Trans-
Structured Query Language – SQL DML Data Statements are used to Manipul manipulate data e.g. SELECT customer_name ation FROM customers WHERE customer_age>18;
Software
Query data efficiently. Object- Oriented APIs are used e.g. db.customers.find( {customer_age: {$gt : 18 }} { customer_name:1 })
Mongodb, Riak, Couchbase, Rethinkdb, Oracle, MySQL, Redis, Aerospike, DB2, SQLServer Leveldb, Hbase, Cassandra, Neo4j, Elasticsearch, Lucene
Copyright © 2016 MECS
V. PERFORMANCE OF NOSQL A ND SQL DAT ABASES FOR BIG DAT A A NALYT ICS The most important reason in moving towards NoSQL fro m relational database is due to requirements of performance imp rovements. Choi et al. [1] found that a NoSQL database such as MongoDB provided mo re stable and faster performance at the expense of data consistency. The tests were done on an internal blog system based on an open source project. It was found that MongoDB stored posts 850% faster than a SQL database. It has been suggested that NoSQL should be used in environ ments which are concerned with data availability rather than consistency. Fotache & Cogean [14] describe the use of MongoDB in mobile applications. Certain mu ltiple update operations like Upsert are easier and faster to perform with NoSQL than SQL database. The use of cloud computing along with NoSQL is said to increase the performance especially in the data layer for mobile platforms. Ullah [15] co mpared performance of both relational database management system (RDBMS) and NoSQL database where Resource Description Framework (RDF) based Trip le store was used as the NoSQL database. It was noted that NoSQL database was slower than the relation database due to the mass amount of memory usage by the NoSQL database. Reading a large amount of data takes toll on the database and because of the unstructured format of NoSQL database the storage of thousand records requires a huge amount of storage whereas the RDBMS uses less amount of storage. For example, searching red berry in the database took 5255 ms in the NoSQL database while it only took 165.43 ms to search it in RDBMS. Floratou et al. [4] performed the Yahoo Cloud Serving Benchmark (YCSB) test on RDBMS and MongoDB. They tested SQL client sharded database against MongoDB auto and client sharded databases. The tests found that SQL client sharded database was able to attain higher throughput and lower latency in most of the benchmarks. The reason for higher performance is SQL is attributed to the fact that majority of the read requests are made to pages in the buffer pool whereas NoSQL databases tend to read shards located at different nodes. The study has tried to prove that RDBMS still has the processing power to handle larger wo rkloads similar to NoSQL. There are many advantages of NoSQL databases over SQL databases like easy scalability, flexib le schema, lower cost and efficient and high performance. Having said that, there are some weaknesses of NoSQL over SQL databases to [12][16]. These are summarised below:
NoSQL is new and immature; therefore, there is lack of familiarity and limited expertise. NoSQL databases scale horizontally by giv ing up either consistency or availability. There is no standard query and manipulation language in all NoSQL databas es. There is no standard interface for NoSQL databases
I.J. Information Technology and Computer Science, 2016, 12, 59-66
SQL Versus NoSQL M ovement with Big Data Analytics
It is d ifficult to export all data in distributed ones (Cassandra) compared to non-distributed ones (MongoDB). NoSQL databases are challenging to install and difficult to maintain.
3.
We have identified the following situations when NoSQL should be more suitable than SQL in the context of Big Data analytics: 1.
2.
Simp licity of use – current Big Data technologies are co mplex requiring highly skilled technical expertise, wh ile NoSQL offers simplicity that would improve the productivity of both developers and users. The simple, s mall, intuitive and easy to learn NoSQL stacks can suit businesses that require Big Data analytics to adopt a clean NoSQL-like APIs. Adaptability to change – when business requirements and data models change warranting
4.
63
flexib le Big Data analytics, NoSQL that supports flexib le data schemas are ideal to integrate siloed and disparate backend systems. Efficiency for analytics functionality – The foundation data structure of majority of NoSQL technology is the Javascript Object Notation (JSON) data format that caters to both schema-onread and schema-on-write efficiently for data warehousing functionality. For examp le, NoSQL Big Data Warehouse, SonarW for JSON makes analytics functionality efficient for Big Data applications. Distributed scalability – with mo re and more distributed nature of systems and transactions, flexib le data beco mes the norm and strict schema approach is unsuitable. With schema evolution, NoSQL p rovides the necessary scalability for Big Data platforms to perform distributed queries faster.
T able 2. Comparison of NoSQL Data Models NoSQ L Data Mode ls
NoSQ L Databases
Performance
Scalability
Flexibility
Complexity
Functionality
High
High
High
High
High
Moderate
Low
Minimal
High
Low
Variable (Low)
DyanmoDB, Ke y-Value
Cassandra, AT S, Riak
None
Variable (None)
Berkeley DB, Cassandra, Hbase, Big Wide -Column
T able, HyperT able MongoD, CouchDB,
Docume nt
Riak, DynamoDB Neo4j, Orient DB,
Graph
Giraph, T itan.
Variable
High
(High)
Vari-able
Variable
High
High
Graph T heory
VI. COMPARISON OF NOSQL DAT A M ODELS NoSQL databases vary in their performance depending on their data model [17]. We co mpare the key attributes of the four types of NoSQL data models and summarise them in Table 2. As shown in Table 2, we have considered key attributes such as, performance, scalability, flexib ility, complexity and functionality fo r co mparing the four data models supported by the popular NoSQL database software that are available in the market. Fig. 2 shows CAP theorem that fo rms a visual guide to NoSQL databases under each NoSQL data model [16], which is based on consistency, availability and partition tolerance features. With NoSQL databases, there are now other options for storing different kinds of data where typically d istributed set of servers have to fit two of the three requirements of the CAP theorem, wh ich is usually a deciding factor in what technology could be used. Bazar & Losif [3] co mpared the performance of MongoDB, Cassandra and Couchbase databases, each possessing different features and functionalities. The tests were conducted using the YCSB tool. Copyright © 2016 MECS
Fig.2. CAP theorem for NoSQL databases.
VII. RESULT S The benchmark tests found that Couchbase produced the lowest latencies for interactive database applications. Couchbase is able to process more operations per second with a lower average latency in read ing and writing data than both MongoDb and Cassandra. Docu ment level
I.J. Information Technology and Computer Science, 2016, 12, 59-66
64
SQL Versus NoSQL M ovement with Big Data Analytics
locking in Couchbase database is the primary reason for faster read and write operations. Cassandra is faster in writing than MongoDb but both of them have almost equal reading speed. It is also mentioned that each NoSQL database is suitable to specific application environments and cannot be considered a comp lete solution for every workload and use case. Another case study by Klein et al. [18] looked at the use of NoSQL database MongoDB, Riak and Couchbase in a distributed healthcare organisation. These databases use different NoSQL data models including key-value (Riak), column (Cassandra) and document (MongoDB). Cassandra produced the overall best performance for all types of database operations (Reading, Writ ing, and Updating). Riakâ&#x20AC;&#x;s performance was degraded due to its internal thread pool creating a pool for each client session instead of creating a shared pool for all client sessions. Cassandra had the highest average latencies but also produced the best throughput results. This was firstly due to the indexing features that allowed Cassandra to retrieve the most recent written records efficiently, especially compared to Riak. Secondly, the hash-based sharding allowed Cassandra to distribute the request for storage to be load better than MongoDB. Prasad & Gohil [11] discussed the use of different NoSQL databases for different work environ ments. It is reported that the performance of NoSQL databases is increased because of the use of a collection of processors in the distributed system. MongoDB and Cassandra are considered the best databases to be used in cases where data is frequently written but rarely read. The NoSQL databases are ment ioned to be victims of Consistency, Availability and Partit ioning (CAP) theorem. Th is means that a trade-off is always made e.g. the database can either be consistent with low performance or offers high availability and low consistency with fast performance [11][17][19]. Zhikun et al. [20] suggested the use of a new database allocation strategy based on load (DASB) in order to increase performance of the NoSQL database. However, the DASBL only works when it satisfies four conditions and is unable to cater to an unbalanced system load. Prasad et al. [11] co mpared different attributes such as Replication, Sharding, Consistency and Failure handling. We summarise all these findings in Table 3, wh ich provides a list of the best NoSQL databases for each of the features reported in literature. Several doubts arise on the NoSQL pro mises and studies have been conducted to explore the strengths and weaknesses of NoSQL [21][22]. A recent study reviews the trends of storage and computing tools with their relative capabilit ies, limitations and environment they are suitable to work with [23]. While h igh-end platforms like IBM Netezza AMPP could cater to Big Data, due to economic considerations, choices such as Hadoop have proliferated world-wide resulting in the rise of NoSQL database adoption that can integrate easily with Hadoop. Even though HBase supports strong integration with Hadoop using Apache Hive, it could provide a better choice for applicat ion development only but not for realCopyright Š 2016 MECS
time queries and OLTP applicat ions due to very high latency. On the other hand, graph-based platforms such as Neo4j and Giraph form better options for storage and computation due to their capability to model vertexedge scenarios in businesses that involve data environments such as social networks and geospatial paths. Overall, Big Data has led to the requirement of new generation data analytics tools [24][25] and hence it is realistic to believe that both SQL and NoSQL databases will coexist. With cloud environments that support SQL databases, fast processing of data is warranted to enable efficient elasticity [26] and Big Data analytics that involve current and past data as well as future predict ions. New solutions are being proposed for cloud monitoring with the use of NoSQL databases back-end to achieve very quick response time. T able 3. NoSQL Databases mapped to their features Features High availability Partition T olerance High Scalability Consistency
Best NoSQ L Databases Riak, Cassandra, Google Big T able, Couch DB MongoDB, Cassandra, Google Big table, CouchDB, Riak, Hbase Google Big table MongoDB, Google Big T able, Redis, Hbase
Auto-Sharding
MongoDB
Write Frequently, Read Less
MongoDB, Redis, Cassandra
Fault T olerant (No Single Point Of Failure)
Riak
Concurrency Control
Riak, Dynamo, CouchDB,
(MVCC)
Cassandra, Google Big T able
Concurrency Control
MongoDB, Redis, Google Big
(Locks)
T able
VIII. CONCLUSIONS The industry has been dominated by relational or SQL databases for several years. Ho wever, with business situations recently having the need to store and process large datasets for business analytics, NoSQL database provides the answer to overcome such challenges. NoSQL offers schemaless data store and transactions that allo w businesses to freely add fields to records without the structured requirement of defin ing the schema a priori which is a prime constraint in SQL databases. With the growing need to manage large data and unstructured business transactions via avenues such as social networks, NoSQL graphs are well suited for data that has complex relationship structures and at the same time simp licity is achieved through key-value stores. NoSQL data models provide options for storing unstructured data to be document-oriented, key-value pairs, colu mn-oriented or graphs. These NoSQL storage models are easy to understand and implement and do not require comp lex
I.J. Information Technology and Computer Science, 2016, 12, 59-66
SQL Versus NoSQL M ovement with Big Data Analytics
SQL optimizat ion techniques to perform Big Data analytics. This paper has compared SQL vers us NoSQL databases as well as the four data models of NoSQL in the context of Big Data analytics for business situations. We conclude that the flexib le data modelling of NoSQL is well suited to support dynamic scalability and improved performance for Big Data analytics and could be leveraged as new categories of data architectures coexisting with traditional SQL databases. REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7] [8]
[9] [10]
[11]
[12]
[13]
[14]
[15]
[16]
Choi, Y., Jeon, W., & Yo, S. (2014), 'Improving Database System Performance by Applying NoSQL', Journal Of Information Processing Systems, 10(3), 355-364. M oniruzzaman, A. B., & Hossain, S. A. (2013), NoSQL database: New era of databases for big data analytics Classification, characteristics and comparison. International Journal of Database Theory and Application, 6(4), 1-14. Bazar, C., & Losif, C. (2014), 'The Transition from RDBM S to NoSQL. A Comparative Analysis of Three Popular Non-Relational Solutions: Cassandra, M ongoDB and Couchbase', Database Systems Journal, 5(2), 49-59. Floratou, A., Teletia, N., Dewitt, D., Patel, J., & Zhang, D. (2012), 'Can the Elephants Handle the NoSQL Onslaught?', VLDB Endowment, 5(12), 1712-1723. M ason, R. T. (2015), 'NoSQL databases and data modeling techniques for a document-oriented NoSQL database', Proceedings of Informing Science & IT Education Conference (InSITE) 2015, 259-268. Pothuganti, A. (2015) 'Big Data Analytics: Hadoop -M ap Reduce & NoSQL Databases', International Journal of Computer Science and Information Technologies, 6(1), 522-527. Smolan, R. & Erwit, J. (2012), The Human face of Big Data, Against all odds production, O‟Reilly, USA. Sharda, R., Delen, D., & Turban, E. (2015), Business intelligence and analytics: systems for decision support (10th ed.). Upper Saddle River, NJ: Pearson. Ohlhorst, F. (2013), Big data analytics: Turning big data into big money. Hoboken, NJ. John Wiley and Sons. Kaur, P.D., Kaur, A. & Kaur, S. (2015), „Performance Analysis in Bigdata‟, International Journal of Information Technology and Computer Science (IJITCS), 7(11), 55-61. Prasad, A, & Gohil, B. (2014), 'A Comparative Study of NoSQL Databases', International Journal Of Advanced Research In Computer Science, 5(5), 170-176. Nayak, A., Poriya, A. & Poojary, D. (2013), „Type of NOSQL Databases and its Comparison with Relational Databases‟, International Journal of Applied Information Systems (IJAIS), 5(4) Foundation of Computer Science FCS, New York, USA. M ongoDB (2014), „Why NoSQL?‟, https://www.mongodb.com/nosql-explained, [Online: accessed 20-Feb-2016] Fotache, M ., & Cogean, D. (2013), 'NoSQL and SQL Databases for M obile Applications. Case Study: M ongoDB versus PostgreSQL', Informatica Economica, 17(2), 41-58. Ullah, M d A. (2015), „A Digital Library for Plant Information with Performance Comparison between a Relational Database and a NoSQL Database (RDF Triple Store)‟, Technical Library, Paper 205. Hurst, N. (2010, ‘Visual Guide to NoSQL Systems‟, http://blog.nahurst.com/visual-guide-to-nosql-systems,
Copyright © 2016 MECS
65
[Online: accessed 5-Nov-2015] [17] Planet Cassandra ‘NoSQL Databases Defined and Explained’, http://www.planetcassandra.org/what-isnosql,[Online: accessed 24-M ar-2016] [18] Klein, J., Gorton, I., Ernst, N. & Donohoe, P. (2015), 'Performance Evaluation of NoSQL Databases: A Case Study', Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems, PABS’15, Austin, 5-10. [19] M ongoDB (2015), ‘Top 5 Considerations When Evaluating NoSQL Databases’, https://s3.amazonaws.com/ info-mongodbcom/10gen_Top_5_NoSQL_Considerations.pdf [Online: accessed 5-Nov-2015] [20] Zhikun, C., Shuqiang, Y., Shuan, T., Hui, Z., Li., Ge, Z.,& Huiyu, Z. (2014), „The Data Allocation Strategy Based on Load in NoSQL Database’, Applied Mechanics and Materials, 513-517, 1464-1469. [21] Leavitt, N. (2010), „Will NoSQL Databases Live Up to Their Promise?‟, IEEE Computer 43(2) , 12-14. [22] Subramanian, S. (2012), ‘NoSQL: An Analysis of the Strengths and Weaknesses’, https://dzone.com/articles/nosql-analysis-strengths-and, [Online: accessed 15-Jan-2016] [23] Prasad B.R. & Agarwal S. (2016), 'Comparative Study of Big Data Computing and Storage Tools: A Review', International Journal of Database Theory and Application 9(1), 45-66. [24] Warden P. (2012), Big Data Glossary - A Guide to the New Generation of Data Tools, O‟Reilly, USA. [25] Zareian S., Fokaefs, M ., Khazaei H. Litoiu M . & Zhang X. (2016), 'A Big Data Framework for Cloud M onitoring', Proceedings of the 2nd International Workshop on BIG Data Software Engineering (BIGDSE'16), ACM Digital Library, 58-64. [26] Ramanathan, V. & Venkatraman, S. (2015), Cloud Adoption in Enterprises: Security Issues and Strategies, 96-121, Book Chapter In Haider A. and Pishdad A. (Eds.), Business Technologies in Contemporary Organizations: Adoption, Assimilation, and Institutionalization, IGI Global Publishers, USA.
Authors’ Profiles Dr. Sitalakshmi Venkatraman obtained
doctoral degree in Computer Science, from National Institute of Industrial Engineering, India in 1993 and M Ed from University of Sheffield, UK in 2001. Prior to this, she had completed M Sc in M athematics in 1985 and MTech in Computer Science in 1987, both from Indian Institute of Technology, M adras, India. This author is a Senior M ember (SM ) of IASCIT. In the past 25 years, Sita's work experience involves both industry and academics - developing turnkey projects for IT industry and teaching a variety of IT courses for tertiary institutions, in India, Singapore, New Zealand, and more recently in Australia since 2007. She currently works as Lecturer (Information Technology) at the School of Engineering, Construction & Design, M elbourne Polytechnic, Australia. She also serves as M ember of Register of Experts at Australia's Tertiary Education Quality and Standards Agency (TEQSA). Sita has published eight book chapters and more than 100 research papers in internationally well-known refereed journals and conferences that include Information Sciences, Journal of Artificial Intelligence in Engineering, International Journal of
I.J. Information Technology and Computer Science, 2016, 12, 59-66
66
SQL Versus NoSQL M ovement with Big Data Analytics
Business Information Systems, and Information Management & Computer Security. She serves as Program Committee M ember of several international conferences and Senior M ember of professional societies and editorial board of three international journals.
How to cite this paper: Sitalakshmi Venkatraman, Kiran Fahd, Samuel Kaspi, Ramanathan Venkatraman, "SQL Versus NoSQL M ovement with Big Data Analytics", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.12, pp.59-66, 2016. DOI: 10.5815/ijitcs.2016.12.07
Kiran Fahd received the B.Eng in software engineering from the National University of Emerging Technologies, Pakistan in 2001, and the M asterâ&#x20AC;&#x;s degree in Enterprise Planning System - ERP from the Victoria University, M elbourne in 2010. Since 2001, she has worked in the capacity of software engineer and as a teacher. Kiran has held various lecturing positions in Australian and overseas universities. She currently teaches the subjects of Bachelor of Information Technology under the Software Development major at the School of Engineering, Construction & Design, M elbourne Polytechnic, Australia.
Dr. S amuel Kaspi earned his PhD (Computer Science) from Victoria University, a M asters of Computer Science from M onash University and a Bachelor of Economics and Politics from M onash University. He is a member of Australian Computer Society (ACS) and Association for Computing M achinery (ACM ). Sam is currently the Information Technology Discipline Leader and Senior Lecturer of IT.at the School of Engineering, Construction & Design, M elbourne Polytechnic, Australia. Previously, Dr Kaspi taught at Victoria University, consulted privately and was the CIO of OzM iz Pty Ltd. Sam has been active in both teaching and private enterprise in the areas of software specification, design and development. As chief information officer (CIO) of a small private company he managed the development and submission of five granted and three pending patents. He also managed the submission of a successful Federal Government Comet grant under the Commercialising Emerging Technologies category. He has also had a number of peer reviewed publications including the Institute of Electrical and Electronics Engineers (IEEE).
Dr. Ramanathan Venkatraman is working as M ember, Advanced Technology Application Practice at National University of Singapore. He has served industry and academia for more than 32 years and has a wide spectrum of experience in the fields of IT and business process engineering. His current research focuses in evolving decision models for business problems and more recently, he has been contributing to frontiers of knowledge by devising innovative architectural models for ICT in domains such as Service Oriented Architecture, Big Data and Enterprise Cloud Computing. Dr Venkatraman has a strong practice approach having worked in large scale IT projects across Asia, US, Europe and NZ. He has published more than 20 research papers in leading journals. Apart from research and consulting, he also teaches advanced technical courses for M asters program at NUS and has been a key architect in setting up innovative software engineering and business analytics curriculum in the fast changing IT education scenario.
Copyright Š 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 59-66
I.J. Information Technology and Computer Science, 2016, 12, 67-74 Published Online December 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2016.12.08
Improving Matching Web Service Security Policy Based on Semantics Amira Abdelatey, Mohamed Elkawkagy, Ashraf Elsisi, Arabi Keshk Faculty of Computers and Information/Computer Science, Menofia University, Egypt E-mail: {Amira.mohamed, mohamed.elkhawaga, ashraf.elsisi, arabi.keshk}@ci.menofia.edu.eg.
Abstract—Nowadays the trends of web is to beco me a collection of services that interoperate through the Internet. The first step towards this inter-operation is finding services that meet requester requirements; which is called a service discovery. Service discovery matches functional and non-functional properties of the requester with the p rovider. In th is paper, an enhanced matching algorith m o f Web Service Security Po licy (WS-SP) is proposed to perform requirement-capability match making of a consumer and a provider. Web service security policy specifies the security requirements or capabilities of a web service participant (a provider or a consumer). Security requirement or a capability of a part icipant is one of the non-functional properties of a web service. The security addressed through this paper is the integrity and the confidentiality of web service SOA message transmitted between participants. The enhanced matching algorith m states simp le policy and co mp lex policy cases of the web service security as a non-functional attribute. A generalization matching algorith m is introduced to get the best-matched web service provider fro m a list of available providers for serving the consumer. Index Terms— Ontology matching, web service, SOA message security, Web service non-functional properties, web service security policy.
I. INT RODUCT ION Nowadays, web services become a modern standard in information technology industry. A service represents a self-contained platform-independent computational element that can be accessible by other applications across organizational boundaries [1]. In the service-oriented model, a service offered by a service provider and invoked by a service requester. A d iscovery mech an is m us ed b y th e p ro v id er t o advert ise its serv ices and by the requester to find the suitab le serv ice that fu lfills its requ irements [2]. So, mat ch makin g an d lo cat ing serv ices is an imp o rtant problem [3].Selecting web service must address not only the functional aspects but also non-functional properties of th e serv ice [4]. Web Serv ice Descript ion Lang uag e (WSDL) was inspired to represent the functional aspects of a Web Service. Web Service Policy (WS-Policy) used to represent the non-functional attributes of a web service.
Copyright © 2016 MECS
Both the functional and the non-functional capabilities and requirements of a Web Service is initial step during service discovery stage [5]. The discovery of web services is conducted by WSDL processing system [6]. That is besides the processing of WS-Policy. This paper focuses on message security that is one of the non-functional properties of a web service [7]. Message security becomes a primary concern when using Web services. Message security mainly means the confidentiality and the integrity of data transmitted through the message [8]. Confidentiality and integrity are assured by applying security mechanisms such as encryption and digital signature. Framework implemented to perform automatic semantic matching between service requester’s requirements and service provider’s security capabilities. Different security classes are associated with web services like message encryption, digital signature, authentication, etc. [8]. Web Service Security Policy (WSSP) specification used as a standard for representing security requirements for web service entities. WS-Policy [9] is capable of representing the syntactic aspects of the non-functional properties but lacks semantics. It allows only syntactic matching of policies. It depends on intersection policy mechanism [10]. Syntactic matching of security policy restricts the effectiveness of checking the compatibility between requester and provider policies. As it has a strict yes-no matching result. Semantic matching leads to more flexible and correct result of matching policies. WS-SP transformed into The web Ontology Language Description Language (OW LDL) [11]. Semantic Web Rule Language (SWRL) used for extending OW L-DL with semantic relations to get the best matching level between requester and provider policies. These relationships lead to more correct and more flexible matching of security policies. In this paper, an improved matching algorith m for WS-SP is introduced. It considers different cases of WS-SP types either simple or comp lex policy. A generalized matching algorithm is introduced for getting the best-matched provider from N number of providers. In this paper, we will introduce an improved semantic matching algorithm of WS-SP. WS-SP is described in section II. Related work is discussed in Section III. In section IV, An improved WS-SP matching algorithm is presented. In addition, a generalization of the improved matching of WS-SP algorith m is provided in section V. The work concludes in Section VI.
I.J. Information Technology and Computer Science, 2016, 12, 67-74
68
Improving M atching Web Service Security Policy Based on Semantics
II. W EB SERVICE SECURIT Y POLICY Web service architecture as a type of SOA involves three entities: service provider, service registry and service consumer that integrate together to perform specific task [9]. In addition to WS-Policy, WS-SP (Web Service Security Policy) represents the syntactic aspects of security as a non-functional property. Because of the loosely coupled connections of SOA and HTTP as an open access, SOA must web services with a set of security requirements or capabilities. Security is an important parameter of web service. Security refers the secure of SOAP message exchanged between provider and consumer. Message security assures SOAP message integrity and confidentiality, and identity. Each entity of a web service architecture has a requirement or a capability constraints. Matching these requirements or capabilities constraints is not an easy task. WS-Policy a llows Web services to define policy requirements for endpoints. These requirements include privacy rules, encryption rules, and security tokens. WSSP allows Web services to apply security to Simple Object Access Protocol (SOAP) messages through encryption and integrity checks on all or part of the message. WS-SP is an expressive language aggregated into the Web service architecture. Then matching WS-SP problem becomes more and more important while integrating Web services. However, WS-SP has a big weakness as it only allows syntactic matching of security policies. Security policy matching depends on the policy intersection mechanism provided by WS-Policy A. Security policy Security Policy is a widely spread in the industry, and it is currently a popular standard to be combined into the Web service architecture [12].Web Services Po licy (WSPolicy) Framework is a framework for describing capabilities and requirements of web service provider and requester [10]. WS-Policy used to represent the nonfunctional properties of a web service. In the matching algorithm, the non-functional properties of a web service represent the policy requirements of the requester must be compatible with the capability policies of the provider. WS-SP used to specify the web service security specification for Web services. In WS-Security, a security policy defines a set of security policy assertions that used in determining individual security requirements or capability [13]. Policy operators used to combine security policy assertions. Policy operators have two elements: “Exactly One” and “Exact ly All”. “Exactly one” used to express the assertions that have alternatives; it means only one of its children elements must hold. On the other hand, “Exactly All” means that all its children elements must hold. Alternatives used to describe requirement options of a requester or a provider. Fig.1 shows security policy requirements are expressed using WS-SP. It represents security policy signature and encryption of a web service entity [14]. The example has one security alternative; this alternative have two Copyright © 2016 MECS
assertions. This security policy supports the signature of the message body with a symmetric key securely transported using an X.509 token [15]. Besides, the necessary cryptographic operations must perform using Basic256 algorithm suite [16]. <wsp : Ploicy> <wsp : ExactlyOne> <wsp : All> <sp : SymmetricBinding> <wsp : Ploicy> <sp : ProtectionToken> <wsp : Ploicy> <sp : X509Token> </wsp : Ploicy> </sp : ProtectionToken> <sp: AlgorithmSuite> <wsp : Ploicy> <sp : Basic256> </wsp : Ploicy> </sp: AlgorithmSuite> </wsp : policy> </sp : SymmetricBinding> < sp : SignedParts> < sp : Body/> </sp : SignedParts> </wsp : All> </wsp : ExactlyOne> </wsp : Policy> Fig.1. Representation Example of WS-SP
B. Web service security policy matching problem WS-SP (A) <wsp : Ploicy> <wsp : ExactlyOne> <wsp : All> <sp : SymmetricBinding> <wsp : Ploicy> <sp : ProtectionToken> <wsp : Ploicy> <sp : X509Token> </wsp : Ploicy> </sp : ProtectionToken> <sp: AlgorithmSuite> <wsp : Ploicy> RAss1 <sp : Basic256> </wsp : Ploicy> </sp: AlgorithmSuite> </wsp : policy> </sp : SymmetricBinding> < sp : SignedParts> RAss2 < sp : Body/> </sp : SignedParts>
</wsp : All> </wsp : ExactlyOne> </wsp : Policy>
WS-SP (B) <wsp : Ploicy> <wsp : ExactlyOne> <wsp : All> <sp : SymmetricBinding> <wsp : Ploicy> <sp : ProtectionToken> <wsp : Ploicy> <sp : X509Token> </wsp : Ploicy> </sp : ProtectionToken> <sp: AlgorithmSuite> <wsp : Ploicy> P Ass1 <sp : Basic256> </wsp : Ploicy> </sp: AlgorithmSuite> <sp : IncludeTimestamp> </wsp : policy> </sp : SymmetricBinding> < sp : SignedElements> <sp : XPath> /Envelope/Body P Ass2 </sp : XPath> </sp : SignedElements> </wsp : All> </wsp : ExactlyOne> </wsp : Policy>
Fig.2. Matching Problem of WS-SP
The automatic matching algorithm of WS-SP specifications checks requester’s policy against provider’s policy to ensure their co mpatibility. Syntactical matching
I.J. Information Technology and Computer Science, 2016, 12, 67-74
Improving M atching Web Service Security Policy Based on Semantics
of their policies is quite straightforward, but it lacks semantics [17]. It is not able to discover matching when each policy uses different vocabularies, even though they have the same meaning. It is necessary to construct a formal model that describes conceptual relationships between requester and provider policies. Ontology is the most commonly used formal, explicit specification of a shared conceptualization [18]. Fig.2 clarifies the matching problem of web service security policy. WS-SP(A) and WS-SP(B) are two security policies each one has one alternative with two assertions. The assertions specified in the provider security policy ;WSSP (B); are syntactically different from those specified in the requester security policy; WS-SP (A). So, policy intersection will adopt a “no match” result for these security policies. However, semantic analysis of the above provider security policy and requester security policy leads to a different matching result. Although RAss2 and PAss2 assertions don’t have the same type, they have the same meaning of signing the body of the message. The other difference between RAss1 and PAss1 assertions is that PAss1 includes an extra child element that is “sp: IncludeTimestamp” which means a timestamp element must be included in the security header of the message. Fro m a security viewpoint, this strengthens the integrity of the service [19]. So, matching these two assertions must lead to a perfect match rather than a no match.
finite set of alternatives {Alt 1 , Alt 2 ,….,AltN}. It is exp ressed as a disjunction of all its alternatives as following:
P Alt1 Alt 2 .Alt N
((Alti )S.T . Alti Re qP and
(1)
An alternative Alt is identified as a finite set of assertions {Ass1 , Ass2 … AssN}. It is also can be expressed as a conjunction of all its assertions as following:
Alt Ass1 ^ Ass2 ^ AssN
(2)
The requester web service security policy ReqP is defined as a set of alternatives. Each alternative is a set of assertions. Requester policy expressed as follows:
ReqP Alt1 Alt 2 . Alt i
(3)
Alti Ass1 ^ Ass2 ^ Assi
(4)
Moreover, the provider web service security policy ProvP is defined as a set of alternatives. Each alternative is a set of assertions. The Provider policy expressed as follows:
C. Policy matching WS-Policy represents requester requirement and a provider capability. WS-Policy describes a normal policy form that is a disjunction of alternatives and conjunction of all assertions in an alternative. The proper form for policy matching is as follows: A policy P is defined as a
69
ProvP Alt1 Alt 2 .Alt j
(5)
Altj Ass1 ^ Ass 2 ^ Assj
(6)
Matching ReqP and ProvP security policies are reduced to finding equivalent alternatives. As expressed in the following rule.
(Alt j )S.T . Alt j Pr ovP and
Alti Alt j ) (Re qP Pr ovP) (7)
Finding equivalent alternatives is identified in the following manner. There are two alternatives are equivalent: if, for each assertion in both alternatives, there
((Assi )S.T . Assi Alti
and
exists satisfied assertion. Equivalent alternatives expressed in the following rule.
(Ass j ) S.T . Ass j Alt j
and
Assi Ass j ) ( Alti Alt j ) (8)
Fro m the above rules, an equivalent policy created from equivalent alternatives. Also, equivalent alternatives created from semantically equivalent assertions. In the proposed policy framework, equivalent assertions computed using semantic matching of these assertions.
express the semantic interpretation between requester and provider. First, fro m these semantic relations, we get the matching degree between assertions. After having the matching level of all assertions, we can get the matching level between policies.
D. Semantic relations of ws-sp
E. WS-SP Ontology
Semantic relations added to WS-SP ontology are defined in details in [20]. These semantic relations are prescribed by SWRL [21]. These relations bond the requestor SP and the provider SP. These rules define the conditions that requester and provider assertions must satisfy to create a given semantic relation. These relations
The Ontology-based model of WS-SP consists of two main parts: an ontological representation of a Security Policy (SP) structure and an ontological representation of WS-SP assertions. Fig.3 shows the ontological representation of an SP structure. To specify SP in a normal form, there are three classes created “Security
Copyright © 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 67-74
70
Improving M atching Web Service Security Policy Based on Semantics
Policy”, “Security Alternative”, and “Security Assertion”. Security policy contains one or more alternatives. Security alternatives consist of one or more security assertion. Therefore, the three previous classes created in that particular order. In WS-SP standard, an assertion can have an arbitrary number of types: “Security Binding”, “Protection Scope”, “Supporting Security Tokens”, “Token Referencing And Trust Options”. These classes described in details in [20, 22]. Security Policy Has Alternative
III. RELAT ED W ORK
Security Alternative
Has Assertion Security Assertion Fig.3. Ontological Representation of SP Structure
Security assertions classes modeled as ontological representation based on the semantic meaning of these assertions. “Security Binding”, “Supporting Security Tokens”, “Token Referencing And Trust Options”, and “Protection Scope” are the security assertions of WS-SP ontology and are modeled as subclasses of the “Security Assertion” class as shown in Fig.4. Security Assertion
Security Binding Protection Scope
Token Referencing And Trust Options
Supporting Security Tokens Fig.4. Class assertion types
The “Security Binding” class specifies the primary security mechanism to apply for securing message exchanges [23]. “Security Binding” can be either transport level, represented by “Transport Binding” class, or message level represented by “Message Security Binding” class. Transport binding protocols can be either symmetric binding and asymmetric b inding which represented by the two subclasses “Symmetric Binding” and “Asymmetric Binding”. “Protection Scope” class used to specify message parts encrypted and signed of security policy. It has two subclasses “Encryption Scope” and “Signature Scope”. “Encryption Scope” class has two subclasses “Encrypted Element” and “Encrypted Part”. “Signature Scope” class has two subclasses “Signed Element” and “Signed Part”. “Supporting Security Copyright © 2016 MECS
Tokens” class creates security binding elements and tokens. It supports tokens specify encryption and signing requirements. In other words, it supports security tokens required by the “Security Binding” class [24]. It has “Binary Security Token” class and “XML Security Token” class. “Token Referencing And Trust Options” class defines various policy assertions related to exchanges between requester and provider. It is used for negotiation protocols. It has two subclasses “Trust Referencing Options” and “Trust Options”.
Several works dealt with adding semantics to WS-SP to overcome the deficits of the policy intersection. M. Ben Brahim et al. [25] constructed a simple ontology to compare two security policies and then build an algorithm to compare security policies. This algorithm uses a semantic reasoner to get the result of the comparison. It uses WS-SP for specifying requester requirements and provider capabilit ies. It represents security requirements and capabilities as OWL ontology and reasoner works on top of it. The comparing algorithm is not described in detail. S. Alhazbi et al. [26] introduced a framework for the preference based semantic matching between web services security policies. This approach utilizes the alternative feature in WS-policy to allo w the requester to specify mu lti-optional requirements ranked by preference. Ontology used to model the relationships between different web service security concepts. The author also uses a reasoner to specify the level of matching. The matching algorith m uses a matching level and requester preference to specify the best option to be mapped with provider capabilities. M. Ben Brahim et al. [22] presented a semantic matching technique to compare two security assertions. They proposed a WS-SP based ontology and some relations to compare two security assertions. They show how to get the matching level of two simp le security assertions, but it lacks the comparing of all two policies. It also lacks processing of complex security policies. T-D Cao et al. [20] presented a semantic approach for determin ing and matching the security policies . This approach transforms WS-SP into the OWL-DL ontology. It adds a set of semantic relations that can exist between the provider and requestor security concepts. The algorithm determines the matching level of the provider and requestor security policies. However, it lacks processing all probability cases of simple and comp lex security policies. It also lacks processing of complex policies. The improved matching algorithm depends on [25], [22], [20]. It enhances the WS-SP based ontology semantic matching algorith m. It improves semantic matching simple security policy and complex security policy. In addition to considering all cases of a simple policy and complex one.
I.J. Information Technology and Computer Science, 2016, 12, 67-74
Improving M atching Web Service Security Policy Based on Semantics
IV. IMPROVED SEMANT IC M AT CHING A LGORIT HM Web service security policy matches the compatibility of WS-SP alternatives and assertions. Improved matching algorithm used for matching two security policies. The matching process checks to what extent requester security requirements satisfied by provider capabilities. It first checks whether the security policy is a simple policy or a complex one. The simp le policy is a security policy that has only one alternative with any number of assertions. The complex policy is a security policy that has more than one alternative each alternative has one or more assertions. Work [25] has discussed matching security policies . After that, they extended their work to discuss matching provider and requestor security policies in a simp le policy case and complex policies cases [20]. They do not clarify how to apply the matching algorithm in comp lex policies. In addition, in contrast, all the standards [10, 17, 27] say alternative has more than one assertion. The authors in the [20] apply their matching with considering that a policy contains different assertions and an assertion contains one or more alternative. We categorize WS-SP as a simple policy and a complex policy. As stated before, the simple policy is a policy which each web service entity has zero or one alternative. If requester alternatives and provider alternatives equal zero, then a perfect match. If one of requester alternatives and provider alternatives equal zero and the other has one or more alternative, then no match. If requester alternatives and provider alternatives equal one, then simple policy matching. If each of provider alternatives or requester alternatives has one alternative and the other has more than one alternative, then complex policy matching. If provider alternatives and requester alternatives have more than one alternatives, then complex policy matching. Different cases for simp le policy matching and complex policy matching stated in Table 1.
71
compareAssertion(ReqAss,ProvAss) { If ( ReqAss “isIdenticalTo” ProvAss ) then, perfect match. If ( ReqAss “isEquivalentTo” ProvAss ) then, perfect match. If( ReqAss “isMoreSpecificThan” ProvAss ) then, close match. If ( ReqAss “isMoreGeneralThan” ProvAss ) then, possible match. If ( ReqAss “isLargerThan” ProvAss ) then, possible match. If ( ReqAss “isStrongerThan” ProvAss ) then, possible match. If (ReqAss “hasTechDiffWith” ProvAss ) then, possible match. If ( ReqAss “isDifferentFrom” ProvAss ) then, no match. If ( ReqAss “isSmallerThan” ProvAss ) then, no match. If ( ReqAss “isWeakerThan” ProvAss ) then, no match. } Fig.5. compareAssertion () pseudo code
T able 1. Simple policy and complex policy different cases Re que ste r
Provider
|RAlt| = 1
|PAlt|=1 or |PAlt| > 1 |PAlt| = 1
State (Simple Policy) No Match (Simple Policy) Perfect Match (Simple Policy) No Match Simple Policy
|RAlt|=1 or |RAlt| > 1
|PAlt| = 0
|RAlt| = 0 |RAlt| = 0
|PAlt| = 0
|RAlt| >1
|PAlt| = 1
Complex Policy
|RAlt| = 1
|PAlt| > 1
Complex Policy
|RAlt| > 1
|PAlt| > 1
Complex Policy
Therefore, the Simp le matching process conducted in only one case “if requester and provider have one alternative”. The complex matching process carried out if requester and provider have one or more alternatives. Previous works on this topic do not consider all different cases. The improved algorithm studies all possible cases. There are four possible assertion-matching levels for requester and prov ider assertions: perfect match , close match, possible match and no match. These assertionCopyright © 2016 MECS
Fig.6. Improved matching security policy algorithm
matching levels depend on previously described semantic relations. Assertion matching levels described in details in [25]. Co mpareAssertion () procedure compares requester assertions with provider assertions. It returns perfect
I.J. Information Technology and Computer Science, 2016, 12, 67-74
72
Improving M atching Web Service Security Policy Based on Semantics
match, possible match, close match and no match. The pseudo code of compareAssertion () is shown in Fig.5. If requester assertion has identical relation or equivalent to the relation with provider assertion, then a perfect match. If requester assertion has been more specific than relation with provider assertion, and then close match. If requester assertion has been more general than or is larger than or is stronger than or has tech diff with relations with provider assertion, then possible match. If requester assertion has been different from or is smaller than or is weaker than relation with provider assertion, then no match. The matching process starts with the requirement of a Requester, so the requester is defined as the starter of match making process [13]. Matching algorithm main ly depends on a number of alternatives and number of assertions that are the primary components of the security policy. To get the final matching level of a simple policy, get the final assertion matching level as the lowest degree of the match found between requester assertions and provider assertions. In complex policy, matchmaker matches alternatives of requester security policy with provider alternatives. To get the final matching level of complex policy, get the final matching level as the highest level of the match found between requester and provider alternatives. Alternative matchmaker calls assertions match maker, which is the simple policy case. The
improved matching security policy algorithm is shown in Fig.6. Through Fig.6, it first checks the number of requester and provider alternatives that are the primary component of policy. If requester alternatives “RAlt” or provider alternatives “PAlt” less than or equal to “one” then simple policy else comp lex policy. Through a simp le policy, it creates all pairs of requester and provider assertions. Then, get the matching level of all pairs and finally get the matching level with the lowest value of matching. It gets the matching level of each pair by calling “compareAssertions” pseudo code. With the complex policy, the algorithm first creates all pairs of the requester and provider alternatives. Second, it gets the matching level of all alternatives. and finally, it combines all matching level of all pairs and gets the final matching level of two policies by finding the highest matched alternatives. During complex policy matching process, the algorithm calls simp le policy matching. The complexity of policy is analyzed by comparing th e elements in Requester Policy P1 with a number of alternatives X1, the number of Requester assertions X2 and Provider Policy P2 with a number of alternatives Y1, the number of Provider assertions Y2. Table 2 defines a complexity for the improved matching algorithm compared to [25] and [20].
T able 2. Complexity of improved matching algorithm Requester |RAlt|=1 or |RAlt| > 1 |RAlt| = 0
Provider
Work [25]
Work [20]
Improved matching algorithm
|PAlt| = 0
0 0
|RAlt| = 1
|PAlt| = 0 |PAlt|=1 or |PAlt| > 1 |PAlt| = 1
|RAlt| >1
|PAlt| = 1
|RAlt| = 1
|PAlt| > 1
O(Y1.X2.Y2)
|RAlt| > 1
|PAlt| > 1
O(X1.Y1.X2.Y2)
|RAlt| = 0
0 O(X2.Y2)
O(X2.Y2) O(X1.X2.Y2)
Co mplexity for a simp le policy of the improved matching is the same as the time in [20, 25] , which take into account only the number of assertions of requester and provider. The complexity of complex policy is defined in the improved matching algorithm only. Note that the gray cells represent the simple and complex policies that are not considered in [20, 25].
V. GENERALIZAT ION OF IMPROVED M AT CHING A LGORIT HM The improved semantic matching of WS-SP matches requester security policy with provider security policy. It only matches one requester with one provider and returns perfect match, possible match, close match and no match. Through the web service selection phase, Requester selects the best-matched provider. Therefore, a requester matches a number of providers. As a generalization of matching WS-SP, the requester Copyright © 2016 MECS
O(X2.Y2)
matches with N number of providers to get the most suitable provider. After processing of the generalized matching algorithm, the requester chooses the bestmatched provider of the N numbers. T able 3. Complexity analysis of a generalized matching Requester |RAlt|=1 or |RAlt| > 1 |RAlt| = 0
Provider
Improved matching algorithm
|PAlt| = 0
0 0
|RAlt| = 1
|PAlt| = 0 |PAlt|=1 or |PAlt| > 1 |PAlt| = 1
N*O(X2.Y2)
|RAlt| >1
|PAlt| = 1
N*O(X1.X2.Y2)
|RAlt| = 1
|PAlt| > 1
N*O(Y1.X2.Y2)
|RAlt| > 1
|PAlt| > 1
N*O(X1.Y1.X2.Y2)
|RAlt| = 0
0
Table 3 defines the complexity of the generalized matching between one requester and N number of
I.J. Information Technology and Computer Science, 2016, 12, 67-74
Improving M atching Web Service Security Policy Based on Semantics
providers. For the increased time, a parallel technique used to decrease the processing time for the generalized matching WS-SP algorith m. Processing of matching the requester with each provider executed separately. VI. CONCLUSION AND FUT URE W ORK
[11]
[12]
In this paper, an improved web service security policymatching algorithm is introduced with considering all cases of simple security policy and complex security policy. In this paper, we considered the complex policy cases of WS-SP matching. In addition, we state all cases of simple and complex policies. In simple policy, we state all instances of simple policy. Furthermore, the improved algorithm addressed the complex policy with all different situations. A generalized matching of WS-SP is conducted to get the best-matched provider from the different set of providers. As a future work, we aim to extend the improved algorithm to match requester security requirements with different provider security policies and get the bestmatched provider. Also, we target to add a negotiation technique so that interaction between web service provider and the consumer can take place. REFERENCES
[13]
[14]
[15] [16] [17]
[18]
[19]
[1]
M . P. Papazoglou, "Service-oriented computing: Concepts, characteristics and directions," in Web Information Systems Engineering, 2003. WISE 2003. Proceedings of the Fourth International Conference on, 2003, pp. 3-12. [2] D. Guinard, V. Trifa, S. Karnouskos, P. Spiess, and D. Savio, "Interacting with the soa-based internet of things: Discovery, query, selection, and on-demand provisioning of web services," Services Computing, IEEE Transactions on, vol. 3, pp. 223-235, 2010. [3] A. FELLAH, M. M alki, and A. ELÇI, "Web Services M atchmaking Based on a Partial Ontology Alignment," 2016. [4] E. M . M aximilien and M . P. Singh, "Toward autonomic web services trust and selection," in Proceedings of the 2nd international conference on Service oriented computing, 2004, pp. 212-221. [5] T. Lavarack and M . Coetzee, "Considering web services security policy compatibility," in Information Security for South Africa (ISSA), 2010, 2010, pp. 1-8. [6] N. N. Chiplunkar and A. Kumar, "Dynamic Discovery of Web Services using WSDL," International Journal of Information Technology and Computer Science (IJITCS), vol. 6, p. 56, 2014. [7] L. Seinturier, P. M erle, R. Rouvoy, D. Romero, V. Schiavoni, and J. B. Stefani, "A component ‐ based middleware platform for reconfigurable service‐oriented architectures," Software: Practice and Experience, vol. 42, pp. 559-583, 2012. [8] D. Jamil and H. Zaki, "Security issues in cloud computing and countermeasures," International Journal of Engineering Science and Technology (IJEST), vol. 3, pp. 2672-2676, 2011. [9] S. Weerawarana, F. Curbera, F. Leymann, T. Storey, and D. F. Ferguson, Web services platform architecture: SOAP, WSDL, WS-policy, WS-addressing, WS-BPEL, WS-reliable messaging and more: Prentice Hall PTR, 2005. [10] A. S. Vedamuthu, D. Orchard, F. Hirsch, M . Hondo, P.
Copyright © 2016 MECS
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
73
Yendluri, T. Boubez, et al., "Web services policy 1.5framework," W3C Recommendation, vol. 4, pp. 1-41, 2007. J. De Bruijn, R. Lara, A. Polleres, and D. Fensel, "OWL DL vs. OWL flight: conceptual modeling and reasoning for the semantic Web," in Proceedings of the 14th international conference on World Wide Web, 2005, pp. 623-632. P. Hallam-Baker, V. M . Hondo, H. M aruyama, M. M cIntosh, and I. Nataraj Nagaratnam, "Web Services Security Policy Language (WS-SecurityPolicy)," 2005. L. Kagal, T. Finin, and A. Joshi, "A policy based approach to security for the semantic web," in International Semantic Web Conference, 2003, pp. 402-418. J. H. An, Y. Dodis, and T. Rabin, "On the security of joint signature and encryption," in Advances in Cryptology— EUROCRYPT 2002, 2002, pp. 83-107. C. Adams and D. Pinkas, "Internet X. 509 public key infrastructure time-stamp protocol (TSP)," 2001. H. V. Chung, Y. Nakamura, and F. Satoh, "Security Policy Validation For Web Services," ed: Google Patents, 2007. S. Speiser, "Semantic annotations for ws-policy," in Web Services (ICWS), 2010 IEEE International Conference on, 2010, pp. 449-456. D. M artin, M. Paolucci, S. M cIlraith, M. Burstein, D. M cDermott, D. M cGuinness, et al., "Bringing semantics to web services: The OWL-S approach," in Semantic Web Services and Web Process Composition, ed: Springer, 2005, pp. 26-42. K. Ono, Y. Nakamura, F. Satoh, and T. Tateishi, "Verifying the consistency of security policies by abstracting into security types," in Web Services, 2007. ICWS 2007. IEEE International Conference on, 2007, pp. 497-504. T.-D. Cao and N.-B. Tran, "Enhance M atching Web Service Security Policies with Semantic," in Knowledge and Systems Engineering, ed: Springer, 2014, pp. 213-224. I. Horrocks, P. F. Patel-Schneider, H. Boley, S. Tabet, B. Grosof, and M . Dean, "SWRL: A semantic web rule language combining OWL and RuleM L," W3C Member submission, vol. 21, p. 79, 2004. M . B. Brahim, T. Chaari, M . B. Jemaa, and M . Jmaiel, "Semantic matching of ws-securitypolicy assertions," in Service-Oriented Computing-ICSOC 2011 Workshops, 2012, pp. 114-130. N. Gruschka and L. L. Iacono, "Vulnerable cloud: Soap message security validation revisited," in Web Services, 2009. ICWS 2009. IEEE International Conference on, 2009, pp. 625-631. D. Z. G. Garcia and M . B. F. De Toledo, "Ontology -based security policies for supporting the management of web service business processes," in Semantic Computing, 2008 IEEE International Conference on, 2008, pp. 331-338. M . Ben Brahim, T. Chaari, M . Ben Jemaa, and M . Jmaiel, "Semantic matching of web services security policies," in Risk and Security of Internet and Systems (CRiSIS), 2012 7th International Conference on, 2012, pp. 1-8. S. Alhazbi, K. M . Khan, and A. Erradi, "Preference-based semantic matching of web service security policies," in 2013 World Congress on Computer and Information Technology (WCCIT), 2013. K. Lawrence, C. Kaler, A. Nadalin, M . Goodner, M. Gudgin, A. Barbir, et al., "WS-SecurityPolicy 1.3," OASIS Standard, February, pp. 41-44, 2009.
I.J. Information Technology and Computer Science, 2016, 12, 67-74
74
Improving M atching Web Service Security Policy Based on Semantics
Authorsâ&#x20AC;&#x2122; Profiles Arabi keshk received the B.Sc. in Electronic Engineering and M .Sc. in Computer Science and Engineering from M enoufia University, Faculty of Electronic Engineering in 1987 and 1995, respectively and received his PhD in Electronic Engineering from Osaka University, Japan in 2001. His research interest includes software testing, software engineering, distributed system, database, data mining, and bioinformatics.
Ashraf El-S isi received the B.Sc. and M .Sc. in Electronic Engineering and Computer Science Engineering from M enofia University, Faculty of Electronic in 1989 and 1995, respectively and received his PhD in Computer Engineering & Control from Zagazig University, Faculty of Engineering in 2001. His research interest includes cloud computing, privacy preserving data mining, and Intelligent systems. Mohamed Elkawkagy, (1973) ,male, Faculty of Computers and Information, M enofia University, Egypt, Lecturer, received his PhD in 2012, his research directions include AI-planning, Software Engineering ,Planning search strategy , M ulti-agent Planning, Web-based planning, A gent Systems and Human Computer Interaction (HCI)
Amira Abdelatey received the B.Sc. and M .Sc. in computers and information from M enofia University, Faculty of computers and information in 2007 and 2012, respectively. Currently hold PhD student in Faculty of computers and information, M enofia University. Her research interest includes semantic web, web service, intelligent systems, web service security, software engineering and database systems.
How to cite this paper: Amira Abdelatey, Mohamed Elkawkagy, Ashraf Elsisi, Arabi Keshk, "Improving M atching Web Service Security Policy Based on Semantics ", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.12, pp.67-74, 2016. DOI: 10.5815/ijitcs.2016.12.08
Copyright Š 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 67-74
I.J. Information Technology and Computer Science, 2016, 12, 75-82 Published Online December 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2016.12.09
A Hybrid Approach for Blur Detection Using Naïve Bayes Nearest Neighbor Classifier Harjot Kaur Dept. of Computer Science and Engineering, Sri Guru Granth Sahib World University, Punjab, India E-mail: harjotkaur94639@g mail.com
Mandeep Kaur Dept. of Computer Science and Engineering, Sri Guru Granth Sahib World University, Punjab, India E-mail: mandeepkaur.dhot@gmail.com
Abstract—Blur detection of the partially blurred image is challenging because in this case blur varies spatially. In this paper, we propose a blurred-image detection framework for auto maticallQy detecting blurred and nonblurred regions of the image. We propose a new feature vector that consists of the informat ion of an image patch as well as blur kernel. That is why it is called kernelspecific feature vector. The informat ion extracted about an image patch is based on blurred pixel behavior on local power spectrum slope, gradient h istogram span, and maximu m saturation methods. To make the features vector useful for real applications, kernels consisting of motion-b lur kernels, defocus-blur kernels, and their combinations are used. Gaussian filters are used for filtering process of extracted features and kernels. Construction of kernel-specific feature vector is followed by the proposed Naï ve Bayes Classifier based on Nearest Neighbor classification method (NBNN). The proposed algorith m outperforms the up-to-date blur detection method. Because blur detection is an initial step for the de-blurring process of partially blurred images , our results also demonstrate the effectiveness of the proposed method in deblurring process. Index Terms—Blur detection, feature extraction, mot ion blur, defocus blur, support vector mach ine (SVM ), NBNN, deblurring.
I. INT RODUCT ION In recent years, the commercialization of mobile cameras increases the number of casual photographers that means people are allowed to capture a huge quantity of photos without difficu lty. The increase in a number of casual photographers increases the number of failure photographs containing noisy, blurred, and unnaturallycolored images. That is why an auto matic system is required to avoid and correct the photographs with low quality [1]. Dig ital cameras are integrated with auto-exposure, automatic wh ite balance, and noise reduction capabilit ies to resolve exposure, color, and noise issues; but handle image blur only in an imperfect manner. Fo r examp le, depth of specific scene can be focused with an autofocus Copyright © 2016 MECS
function, but it can’t capture things at different depths sharply at the same time. Due to the inco mpatibilities of an autofocusing feature of cameras’, in personal photo collections, defocused blur images are commonly seen. We can discard bad image data at the source itself by integrating image b lur detection in a camera. Blur detection approach can also help in restoration (deblurring) process. One of the challenges of image deblurring is to recover informat ion fro m available b lurred data through efficient and reliab le algorith ms. Deblurring of fully blurred image is easier than partially blurred images where only a part or few objects of an image are blurred. Deb lurring of whole part ially b lurred image is costly and also produces wrong results. Therefore, there is a need to detect the blurred region of the partially blurred image to restore it where restoration process only applies on the blurred region (not on unblurred region) of the image [14]. Other applications on which b lur detection can be applied are- object extraction, scene classification, image quality assessment, forensic investigation (detects criminals fro m low-quality camera’s footage that has some blurred content) etc. In fact, automatic b lur detection can replace most of the human operator work of extracting useful information fro m the blurred image and vast applications of blu r detection increase the need of research in this topic. However many techniques have been proposed for the blur detection such as Edge and sharpness analysis [3], Gaussian blur kernel detection [1], Discrete Wavelet Transform (DWT) [13], Maximu m Likelihood (M L) Lo w depth of field (DOF) image auto-segmentation [23], kernel based SVM classifier etc. but there arises some problems such as: (1) Techniques are easy to apply on simp ler images but less effective for co mplex images, (2) These techniques are effective for small databases but it is difficult to sample the large database, (3) The techniques are only effective to specific type of blur (either motion blu r or out-of-focus blur images), (4) User interaction is needed for correct estimations. So there is a need to propose the suitable technique which can resolve all these problems. In this paper, we proposed a new technique that resolves above-mentioned problems, and results in
I.J. Information Technology and Computer Science, 2016, 12, 75-82
76
A Hybrid Approach for Blur Detection Using NaĂŻ ve Bayes Nearest Neighbor Classifier
improved blur detection accuracy. The contribution of this paper is threefold. First, we propose new feature vector based on three features and combination of mot ion and defocus blurred kernels. Second, we use a hybrid approach (NBNN) to classifying the input image regions into blurred and unblurred regions. Third, we apply deblurring technique on our results to deblur pixels only inside the blur region. We provide results for defocus blur and motion blu r images (that are the basic types of blur). Most of the existing blur related researches are based on only motion and defocus blur.
(a)
(b)
Fig.1. T ypes of blur: (a) Defocus Blur (b) Motion Blur
The Structure of paper: This Introduction section covers the basic introduction to our topic, applications of blur detection, existing problems and basic structure of our proposed method. Organization of rest of the paper is as follo w: (a) Section II covers related literature review or work, (b) Sect ion III covers proposed method that describes the blur features, blur kernels, classifier , and structure used to execute our proposed work, (c) Sect ion IV covers the experimental results of proposed method of blur detection. It also presents the comparison in terms of accuracy of proposed method with popular existing methods. (d) Results after deblurring the blurred image using blur detection are presented in Section V (e) Conclusion and Future work are presented in Section VI and Section VII respectively.
II. RELAT ED W ORK During the last few decades, topics related to the blurred image have been studied deeply in the field of computer vision and image processing. In this section, we review general blur detection methods. The shape of an object is due to its edges. An image is said to be sharp if objects and their shapes can be perceived correctly. For examp le objectâ&#x20AC;&#x2122;s face in an image looks clear only when we can identify eyes, ears, nose, lips, forehead etc. very clearly. But factors like blurring (where an image is blurred through photoediting tools or filters), environment condition, relative motion between camera and scene, low-quality camera etc. reduce the edge content and makes the transition between colors very smooth. Blur can be detected directly through edge and sharpness analysis. A sharp object contains only step edges, but step edges turn into ramp edges when that object gets blurred. Therefore, measurement of the sharpness or blurriness of edges can Copyright Š 2016 MECS
be used to detect blur. Chung et. al [2] used gradient magnitude and edge direction to measure degree of blurriness in the image. Firstly, they fitted gradient magnitude and edge direction into a normal distribution. Then, they computed the standard deviation of normal distribution along with gradient magnitude to measure edge width and edge magnitude that make b lur measurement more reliable. In contrast, measurement of the degree of blur through the thickness of object contours is performed in the paper by Elder and Zucker [3]. They modeled focal b lur model by a Gaussian blur kernel and then used the first and second order derivative of Gaussian filters (steerable) [4] to calculate the response that describes the degree of blur in an image. This method is used for only local edge estimat ion over a wide range of contrast and local blur scale and requires the only second mo ment o f the sensor noise as input parameters. While fo r mu lti-scale b lur estimat ion Zhang and Bergholm [5] defined Gaussian Difference Signature that functions similarly to the first-order derivative of Gaussian. Bayes discriminant function can be constructed based on the statistic of the grad ients that perform mean and standard deviation on both blurred and sharp regions [6]. In the distribution blurred region of an image has a smaller value of mean and standard deviation than a sharp region of the image. This concept helps to detect the blurred region in an images for further de-blurring processes. Naive Bayesian classifier can also integrate blur feature set obtained fro m d ifferent do mains based on the posterior score. For example, Shi Jianping et al. [7] proposed a new blur feature set (in mult iple do mains) that is based on Gradient Distribution as well as Frequency Domain and Local filters. Similarly, Renting Liu et al. [8] p roposed feature set based on image co lor, gradient and spectrum informat ion. These features are used by Bayes Classifier to detect spatially-varying b lur and type of blur. Blur detection method based on the no-reference (NR) block do not require the original sharp image to measure the degree of blur. Therefo re, this method has less complexity and high robustness compared with edge based blur metrics. Blu r metrics can be co mputed by averaging macro b locks of local blur; and content dependent weighting scheme reduces the texture influence [9]. Based on human blur perception for contrast-varying values, Niranjan D. Narvekar et al. [10] presented a no-reference image blur metric that utilizes a probabilistic model wh ich is used to evaluate the probability of blur detection at each edge in the image). Lowest direct ional high frequency energy is used for motion b lur detection and has less computational cost because estimation of point spread function is not required [11]. Along the motion blur direct ion, high frequency energy decrease where energy refers to the sum of the squared derivative of image [12]. Discrimination between blurred and non-blurred image regions (based on the gradient distribution) can be conducted using SVM classifier [13]. Proposed method performs wavelet deco mposition on the input image to
I.J. Information Technology and Computer Science, 2016, 12, 75-82
A Hybrid Approach for Blur Detection Using Naï ve Bayes Nearest Neighbor Classifier
extract features in wavelet space and constructs gradient histograms on which probabilistic SVM is applied to detect blurred region. In [14], kernel-specific feature vector consisting of the mult iplication of the variance of the filtered grad ient of image patch and the filtered kernel is classified by SVM. This paper is related to most of our proposed work. Their results showed higher accuracy for defocus blur compared with motion blur. We present the comparison of results between our method and their method in the results section. Low Depth of Field (DoF) is a photography technique which provides clear focus only on a specific object and is used to detect and remove b lur at the source. Autosegmentation of low DoF images can be conducted through- pixel-wise spatial distribution of the high frequency components [15], mo rphological filters [16], localized blind deconvolution that produces a focus map [17], mult i-scale context-dependent approach [18], the ratio of wavelet coefficients [19] and so on. After the detection of the blurred region, we can apply the deblurring techniques on it without affecting the unblurred region. Inverse Filter [1, 21], Least-Square Filter [21] and Iterative Filters [22] are used to restore image when prior informat ion about the degrading system is available . While, a prior blur identification [21], ARMA parameter estimat ion, Non-parametric deterministic image-based restoration [23] and Maximu m-Likelihood (M L) [24] methods are used, when prior informat ion about the degrading system is not available. These methods work well on spatially invariant blur. To tackle the partial blur problem, transparency based MAP model can be used [26]. But this method requires user interaction for better results whereas our proposed method does not require user interaction.
77
a) For each input image, first, extract features of the image. b) Based on the information of ext racted features select blur kernel fro m the pool of kernels consisting of motion blur kernels, defocus blur kernels and their combinations. c) Construct the kernel-specific feature vector or b lur matrix from step (a) and (b). d) On the basis of the blur matrix detection of b lurred pixels is performed i.e. hybrid NBNN classifier is used to classify the blurred and non-blurred regions of the image. A. Feature Extraction There are three different features developed and combined in our system. These features are derived by analyzing the visual and spectral clues fro m images. These features are described in following: 1. Local Power Spectrum Slope The strength of change in an image is defined by its frequency components. Sharp edges of the image have high frequency value. Power spectrum uses frequency components to detect the blurred and unblurred regions. Some high frequency components of the blurred image are absent due to the low-pass-filtering characteristic of the blurred reg ions. So, for blurred reg ion amplitude spectrum slope tends to be steeper than unblurred region.
III. PROPOSED M ET HOD (a)
Due to the diversity of the natural images, we propose a framework to detect blur in the partially blurred images. The basic flow chart of our system is given in Fig. 2 and is explained in brief in the following: Input image
Feature Extraction
Blur Kernel Ide ntification
Contruct blur metrix
Blur-unblur classification Fig.2. Basic Flow Chart for blur detection
Copyright © 2016 MECS
(b) Fig.3. (a) Input image (b) Power Spectrum Slope of a blurred patch (shown in a rectangle of red color) and unblurred region (shown in a rectangle of blur color) are shown where blurred and unblurred region have different values.
It is not reliable to simply set a threshold value for the blur estimation because an image may have mult iple objects with the different type of edges or boundaries. On the basis of this observation, we first compute the global measure of the slope of the power spectrum (α 0 ) for the I.J. Information Technology and Computer Science, 2016, 12, 75-82
78
A Hybrid Approach for Blur Detection Using Naï ve Bayes Nearest Neighbor Classifier
whole image. Then we co mpare power spectrum (αp) computed in each local block p with α0 . If αp is much larger than α0 , it is quite possible that this block is blurred. 2. Gradient Histogram Span Gradient magnitude distribution helps as important visual evidence in blur detection process. Blurred regions have a small grad ient magnitude (or in log grad ient distribution have shorter tails) wh ile unblurred regions have large grad ient magnitude. As shown in Fig. 4, in a case of a blurred reg ion, the number of pixels having zero gradient value is large. That is why it has a short tail and high peak at zero gradient value while the unblurred region has a heavy tail and less mass at zero gradient value.
Fig.4. Gradient Histogram Span of the blurred patch (represented in red color) and unblurred region (represented in blur color) is shown where the blurred region has strong peak at zero gradient value i.e has small tail while the unblurred region has heavy tail.
3. Maximum Saturation The maximu m value of saturation of b lurred regions is correspondingly expected to be smaller than that of unblurred regions. Based on this observation, we first compute the maximu m saturation value ( max(S0 )) of the whole image. Then within each patch p, we co mpute saturation Sp for each pixel and find the maximu m saturation value (max(Sp)) that is compared with max(S 0 ). If max(Sp) is less than max(S0 ) than the patch is blurred otherwise not. B. Blur Kernel Identification To identify the blur fo r an input image, we perform blur edge analysis and derive the fo llo wing attributes of the blur kernel:
Blur kernel of defocus blur type are isotropic in nature i.e. edges in every direct ion are s moothened. Blur kernel of motion b lur type are an isotropic i.e. edges with the same direction as the motion direction will be the least affected while edges perpendicular to the motion direction will be blurred most severely. Blur kernel shows no effect when applying to a flat reg ion without illu minance changes. In fact, flat regions contain no useful informat ion for b lur detection.
Copyright © 2016 MECS
C. Construct Blur Matrix The kernel-specific feature vector (or blur matrix) is composed of the mu ltiplication of the variance of filtered kernel (obtained using subsection B methods) and the variance of filtered patch features (obtained using subsection C methods). D. Blur-unblur Classification Through hybrid approach comp lex problems can be solved by stepwise decomposition. Specific hierarchical levels (on the basis of concept granularity) are defined in intelligent hybrid systems. Therefore to discriminate unblurred and blurred regions, based on proposed feature vector, we use NBNN classifier where Naï ve Bayes classifier is used for training purpose and nearest neighbor approach is used to improve the accuracy through additional testing on the neighborhood pixels. When only Naï ve Bayes classification approach is used to discriminate the blurred and unblurred region, it detects defocus blur more accurately than motion b lur [7]. Therefore, we use nearest neighbor classificat ion approach, on neighborhood pixels to increase the accuracy of the motion b lur detection method because in the motion blurred image blurring color of one pixel spread to its neighborhood that will increase its color similarity to its neighboring pixels.
IV. EXPERIMENT AL RESULT S The performance of the proposed method is estimated on the public dataset of 1000 image accessible at [29] with its ground truth images. Out of 1000 blurred images , 296 images are motion blurred images while others are defocused blurred images. The proposed method is compared with the method developed by Y. Pang et al. [14] named as SVM classifier. In their method, kernel specific feature vector based on gradient magnitude and blur kernel; and SVM classifier is adopted. It is the most successful existing method because their experimental results outperform the method proposed by Chakrabarti et al. [20], Liu et al. [8], and Su et al. [27]. It is enough to compare our proposed method with SVM classifier results. To show the need for the hybrid approach we also compare our results with results of Naï ve Bayes classifier results obtained by Shi et al. [7] openly available at http://www.cse.cuhk.edu.hk/leojia/projects/dblurdetect/. Results (in the form o f images) o f SVM classifier, Naï ve Bayes classifier and proposed method on motion blur and defocus blur images along with their ground truth image are shown in Fig. 5 and 6. The output image is displayed in black and white co lor where b lurred pixels are represented with white co lor while unblurred pixels with a b lack color to make the d ifferentiation between the blurred and unblurred part of an image clearer. To compute the accuracy of these methods ground truth image are co mpared with the obtained results to measure how many pixe ls are classified accurately. The accuracy of SVM classifier, Naï ve Bayes classifier, and proposed method is shown in table 1and reasons for their
I.J. Information Technology and Computer Science, 2016, 12, 75-82
A Hybrid Approach for Blur Detection Using Naï ve Bayes Nearest Neighbor Classifier
different behavior are given below. Naï ve Bayes classifier [7] has wrongly detected most of the unblurred pixels as blurred becaus e this classification method does not consider the neighborhood pixels during classificat ion process and its feature set was not directly based on the kernel as used in SVM and our proposed method. However, it detected blurred pixel more accurately than SVM classifier because its feature set is based on multip le blu r features and is tested at multiple scales to avoid ambiguous results. SVM classifier used in paper [14] used feature vector that is based on gradient distribution and blur kernels. Accuracy of this method is more than Naï ve Bayes classifier because it uses kernel specific feature set formed by mult iplying the variance of filtered grad ient distribution and the filtered blur kernel where the measure of variance reduce the chance of error (less amount of unblurred p ixels are detected as blurred) during the classification process. Results of proposed method (Naï ve Bayes Nearest Neighbor Classifier) are more accurate because its classification results are based
(a) Input Image
(b) Proposed Method
79
on multip le blur features and a combination of blur kernels (defocus and motion blu r kernels), and neighborhood pixels are also considered to improve the classification process. As shown in table 1, the accuracy of SVM and Naï ve Bayes classifier is different for motion and defocus blur. In fact, the accuracy of defocus blur images is approximately 10% mo re than motion blur images. But the accuracy of motion blur images and defocus blur images is approximately same in the case of proposed method results. T able 1. Comparison in Accuracy Method Proposed method SVM classifier [14] Naï ve Bayes Classifier [7]
(c) SVM Classifier
Motion blur
Accuracy Defocus blur
Average
64.12
65.2
64.7
49.8
57.6
54.4
41.2
53.3
51.1
(d) Naï ve Bayes
(e) Ground T ruth
Fig.5. Results of Motion Blur Detection
(a) Input Image
(b) Proposed Method
(c) SVM Classifier
(d) Naï ve Bayes
(e) Ground T ruth
Fig.6. Results of Defocus Blur Detection
Copyright © 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 75-82
80
A Hybrid Approach for Blur Detection Using Naï ve Bayes Nearest Neighbor Classifier
(a)
(b)
(c)
(d)
Fig.7. Restoration results of defocus blur image: (a) Input Image (b) Blur segmented image where the black region is the segmented unblurred region (c) Deblurring results of the blurred region (d) Final restored image.
(a)
(b)
(c)
(d)
Fig.8. Restoration results of motion blur image: (a) Input Image (b) Blur segmented image where the black region is the segmented unblurred region (c) Deblurring results of the blurred region (d) Final restored image.
Results of proposed method
(b) part shows the results of blur segmentation process that subtract the detected unblurred reg ion fro m the input image. That is why; the unblurred reg ion is displayed in black color. Step 2- Deblurring of blurred region: With the help of blur segmentation step, we subtract the unblurred region of the blurred image and apply image deblurring method on the blurred part of an image (results of restored blurred reg ion are presented in Fig. 7(c) and 8(c)). Finally, we concatenate the original unblurred region and the recovered blurred region of the blurred image (that is presented in Fig. 7(d) and 8(d)).
Blur Segmentation
VI. CONCLUSION
V. IMAGE DEBLURRING RESULT S Several co mputer vision applications can be benefitted fro m our blur detection task. We show blur segmentation and deblurring in this section. Existing non-blind deconvolution mixes foreground and background under different motion but our objective is to deblur pixels only inside blur masks. To perfo rm deblurring we first perform b lur segmentation fo llo wed by deblurring process explained in the following:
Unblurred region Concatinate
Blurred Deblurring
Restored image Fig.9. Basic Deblurring Flow Chart
Step 1- Blur Seg mentati on: W ith our proposed method, it is possible to segment images into the blur and clear reg ions. In Fig. 7 and Fig. 8, (a) part shows the defocus and motion blurred input image respectively and Copyright © 2016 MECS
In this paper, we have developed a new kernel specific feature based on more than one blur detection features. The co mbination of motion blur kernels and defocus blur kernels also plays an important role in the construction process of proposed blur detection feature vector. Instead of using a single classifier, we use NBNN (hybrid) classifier. We construct extensive experiments with training and testing dataset of 1000 images and use them to test the accuracy of our classification algorith m. To estimate the effectiveness of proposed method, we compare it with existing method (SVM and Naï ve Bayes classifier). Most of the existing methods provide good blur detection results only for specific type of blur. But our method works well on both types of blur. To show how blur detection can be used for deblurring process, after the detection of the b lurred region o f an
I.J. Information Technology and Computer Science, 2016, 12, 75-82
A Hybrid Approach for Blur Detection Using Naï ve Bayes Nearest Neighbor Classifier
image we segment the blurred part of the image fro m the unblurred part and existing deblurring method is applied only on segmented blurred part. Then the orig inal unblurred part of the image is concatenated with the results of deblurring method.
VII. FUT URE W ORK Our classificat ion results inevitably contain errors in natural images due to the similarity o f the b lurred and low-contrast regions. So, the main problem of future work is to make our system work more robust. Other future works involve segmentation of images into layers with different blur extent for further utilizat ion of the blur features. To make our classification results more effective and robust, application of different classification methodologies is also possible. REFERENCES [1]
Bovik and J. Gibson, Handbook of image and video processing, Academic Press, Inc. Orlando, FL, USA, 2000. [2] Y. Chung, J. Wang, R. Bailey, S. Chen and S. Chang, ―A nonparametric blur measure based on edge analysis for image processing applications‖, IEEE Conference on Cybernetics and Intelligent Systems, vol. 1, 356 – 360, 2004. [3] J. H. Elder and S. W. Zucker, ―Local scale control for edge detection and blur estimation,‖ IEEE Conf. on Pattern Analysis and M achine Intelligence, vol. 20, no. 7, pp. 699–716, 1998. [4] W. T. Freeman and E. H. Adelson, ―The design and use of steerable filters,‖ IEEE Conf. on Pattern Analysis and M achine Intelligence (PAM I), vol. 13, no. 9, pp. 891– 906, 1991. [5] W. Zhang and F. Bergholm, ―M ulti-scale blur estimation and edge type classification for scene analysis,‖ International Journal of Computer Vision, vol. 24, no. 3, pp. 219–250, 1997. [6] Jaeseung Ko and Changick Kim, ―Low cost blur image detection and estimation for mobile devices,‖ IEEE International Conference on Advanced Computing Technologies (ICACT), vol. 3, pp. 1605-1610, 2009. [7] J. Shi, L. Xu. and J. Jia, ―Discriminative blur detection features,‖ IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 2965-2972, 2014. [8] R. Liu, Z. Li and J. Jia, ―Image partial blur detection and classification,‖ CVPR, pp. 1-8, 2008. [9] Liu Debing, Chen Zhibo, M a Huadong, Xu Feng and Gu Xiaodong, ―No reference block based blur detection,‖ Quality of Multimedia Experience International Workshop, pp. 75-80, 2009. [10] Niranjan D. Narvekar and Lina J. Karam, ―A nonreference image blur metric based on the cumulative probability of blur detection (CPBD),‖ IEEE Trans. on Image Processing, pp. 2678-2683, 2011. [11] Xiaogang Chen, Jie Yang, Qiang Wu and Jiajia Zhao, ―M otion blur detection based on lowest directional highfrequency energy,‖ IEEE International Conference on Image Processing (ICIP), pp. 2533-2536, 2010. [12] X. Zhu, S. Cohen, S. Schiller and P. M ilanfar X. Zhu, ―Estimating spatially varying defocus blur from single image,‖ IEEE Trans. on Image Process., pp. 4879-4891, 2013.
Copyright © 2016 MECS
81
[13] V. Kanchev, K. Tonchev and O. Boumbarov, ―Blurred image regions detection using wavelet-based histograms and SVM ,‖ IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), pp. 457-461, 2011. [14] Y. Pang, H. Zhu, Xinyu Li and Xuelong li, ―Classifying discriminative features for blur detection,‖ IEEE Trans. on Cybernetics, pp. 2168-2267, 2015. [15] C. S. Won, K. Pyun and R. M . Gray , ―Automatic object segmentation in images with low depth of field,‖ ICIP, 3, pp. 805-808, 2002. [16] C. Kim, ―Segmenting a Low-Depth-of-Field image using morphological filters and region mergin,‖ IEEE Transactions on Image Processing, no. 14, pp. 1503-1511, 2005. [17] L. Kovacs and T. Sziranyi, ―Focus area extraction by blind deconvolution for defining regions of interest,‖ PAM I, vol. 29, no. 6, pp. 1080-1085, 2007. [18] J. Z. Wang, J. Li, R. M . Gray and G. Wiederhold, ―Unsupervised multiresolution segmentation for images with low depth of field,‖ PAM I, 23, pp. 85, 2001. [19] R. Datta, D. Joshi, J. Li and J. Z. Wang, ―Studying aesthetics in photographic images using a computational approach,‖ ECCV , pp. 288-301, 2006. [20] A. Chakrabarti, T. Zickler, and W.T. Freeman, ―Analyzing spatially-varying blur,‖ IEEE Int. Conf. on Comput. Vis. Pattern Recognit., pp. 2512-2519, 2010. [21] M . Banham and A. Katsaggelos, ―Digital image restoration,‖ IEEE Signal Processing M agazine, pp. 24-41, 1997. [22] R. Lagendijk, A. Katsaggelos and J. Biemond ―Iterative identification and restoration of image,‖ International Conference on Acoustics, Speech, and Signal Processing, pp. 992-995, 1998. [23] D. Kundur and D. Hatzinakos, ―Blind image deconvolution,‖ IEEE Signal Processing M agazine, vol. 13, no. 3, pp. 43 – 64, 1996. [24] G. Pavlovic and M . Tekalp , ―M aximum likelihood parametric blur identification based on a continuous spatial domain model,‖ IEEE Trans. on Image Processing, vol. 1, no. 4, pp. 496-504, 1992. [25] J. Jia, ―Single image motion deblurring using transparency,‖ IEEE Int. Conf. on Comput. Vis. and Pattern Recognition, pp. 1-8, 2007. [26] L. Bar, B.Berkels, M .Rumpf and G. Sapiro, ―A variational framework for simaltaneous motion estimation and restoration of motion-blurred video,‖ IEEE Int. Conf. on Comput. Vis., pp. 1-8, 2007. [27] B. Su., S. Lu., and C.L. Tan, "Blurred image region detection and classification," ACM Int. Conf. M ultimedia, pp. 1397-1400, 2011. [28] Elena Lazkano and Basilio Sierra, Progress in Artificial intelligence, Springer Berlin Heidelberg, vol. 2902, pp. 171-183, 2003. [29] Blur Detection Dataset (2015) Available online at: http://www.cse.cuhk.edu.hk/~leojia/projects/dblurdetect/d ataset.html. [30] X. Lu, X.Li. and L.M ou, ―Semi-Supervised multitask learning for scene recognition,‖ IEEE Trans. Cybernetics, vol. 45, no. 9, pp. 1967-1976, 2015. [31] Y. Pang., K. Wang, Y. Yuan and K. Zhang, ―Distributed object detection with linear SVM ,‖ IEEE Trans. Cybernetics, vol. 44, no. 11, pp. 2122-2133, 2014. [32] Regis Behmo, Arnak Dalalyan, Paul M arcombes and Veronique Prinet, ― Towards optimal naïve bayes nearest neighbor,‖ Springer Berlin Heidelberg, pp. 171-184, 2010. [33] Sancho M cCann and David G.Lowe, ―Local naive bayes
I.J. Information Technology and Computer Science, 2016, 12, 75-82
82
A Hybrid Approach for Blur Detection Using Naï ve Bayes Nearest Neighbor Classifier
nearest neighbor for image classification,‖IEEE Conference on Comp. Vision and Pattern Recognition, pp. 3650-3656, 2012.
Authors’ Profiles Harjot Kaur: Post-graduate student for M . Tech. degree in Computer Science and Engineering in Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India.
Mandeep Kaur: currently working as Assistant Professor in Computer Science and Engineering Department, Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India.
How to cite this paper: Harjot Kaur, M andeep Kaur, "A Hybrid Approach for Blur Detection Using Naï ve Bayes Nearest Neighbor Classifier", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.12, pp.75-82, 2016. DOI: 10.5815/ijitcs.2016.12.09
Copyright © 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 75-82
I.J. Information Technology and Computer Science, 2016, 12, 83-90 Published Online December 2016 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2016.12.10
Performance Optimization in WBAN Using Hybrid BDT and SVM Classifier Madhumita Kathuria YMCA University of Science and Technology, Faridabad, India E-mail: madhumita.fet@mriu.edu.in
Sapna Gambhir YMCA University of Science and Technology, Faridabad, India E-mail: sapnagambhir@gmail.com
Abstract—Wireless Body Area Network has attracted significant research interest in various applications due to its self-automaton and advanced sensor technology. The most severe issue in WBAN is to sustain its Quality of Service (QoS) under the dynamic changing environ ment like healthcare, and patient monitoring system. Another critical issue in WBAN is heterogeneous packet handling in such resource-constrained network. In this paper, a new classifier having hybrid Binary Decision Tree and Support Vector Machine classifier is proposed to tackle these important challenges. The proposed hybrid classifier deco mposes the N-class classification problem into N-1 sub-problems, each separating a pair o f subclasses. This protocol dynamically updates the priority of packet and node, adjusts data rate, packet transmission order and time, and resource distribution for the nodes based on node priority. The proposed protocol is implemented and simulated using NS-2 network simu lator. The result generated for proposed approach shows that new protocol can outperform in a dynamic environment, and yields better performance by leveraging advantages of both the Binary Decision Tree in terms of efficient co mputation and Support Vector Mach ine for high classification accuracy. Th is hybrid classifier significantly reduces loss ratio and delay and increase packet delivery ratio and throughput. Index Terms—Binary decision tree, Mult i-class packet classification, Packet prioritizat ion, Support vector machine, Wireless Body Area Network.
I. INT RODUCT ION A Wireless Body Area Network (W BAN) consists of a number of t iny sensors located on or in the human body, depending on the needs of a patient. These sensors are equipped with a wireless interface and are capable of sensing the required health data and can transmit the data to a central Controller Un it (CU), such as a personal digital assistant (PDA) or a s mart phone, for preprocessing. The goal of the healthcare WBAN architecture is to remotely sense heterogeneous vital signals and to provide an alert message in crit ical conditions. This system will reduce the need of dedicated Copyright © 2016 MECS
med ical personnel services for patient’s health status monitoring and help the patients to lead a normal life besides providing them with the high quality of med ical services. It is used for storing the health‐related data, to analyze the progress of health condition over a period of time, to predict an emergency situation, and to sending an alert message to the Medical Server Unit (M SU). The pre‐processed data can be used by the medical personnel for further clinical diagnosis. For provid ing better Quality of Service (Qo S) [ 1-10] in the field of a heterogeneous WBAN, accurate classification of traffic is much essential. The proposed protocol has the ability to classify data packets and assign them different priorities guaranteeing a certain level of performance in terms of better data delivery, reduced delay and improved throughput [11]. It has the capability to update the various parameters dynamically and adjusts them accordingly. The main goal o f the proposed architecture is to design an approach for packet classification to enhanced resource utilizat ion and improve the quality of med ical services. In this paper, an advanced step in the direction of designing such an efficient clas sificat ion [12-14] algorith m was taken. The proposed packet classification unit tries to categorize packets into different classes and assigns each of them a prio rity. For doing the same, the classification unit needs to have a mult i-class classification technique. The Support Vector Mach ine (SVM ) classifiers [15-19] are mu lti-class classifications which often have superior recognition rates in comparison to other classificat ion methods. In SVM, the mu lti-class classification problems usually decompose into several two-class problems using several Binary Decision Trees (BDTs) [20-25]. So we optimize the performance of the packet classification unit with the help of proposed hybrid Binary Decision Tree and Support Vector Machine based classifier, wh ich utilizes a binary tree based architecture that further utilizes SVMs for solving mu lti-class problems [25-29]. The proposed classifier leverages advantages of both the Binary Decision Tree in terms of efficient computation and SVMs for high classification accuracy. In this paper, we emp irically investigate the performance of hybrid classifier and find that this hybrid approach yields
I.J. Information Technology and Computer Science, 2016, 12, 83-90
84
Performance Optimization in WBAN Using Hybrid BDT and SVM Classifier
noticeable better analytical performance. The organization of the paper is as follows: Section II presents the work done so far on this aspect. Section III illustrates the proposed protocol. Section IV describes the proposed hybrid classifier algorith m. Sect ion V co mpares the performance analysis of the proposed protocol along with existing protocol. Section VI concludes the paper.
II. RELAT ED W ORK The existing work related to traffic flow classification is given below. The architecture in Optimized Congestion Management Protocol (OCM P) for Healthcare W ireless Sensor Networks [10] assigns dynamic weights to each child node and fo llo ws a fair queue management policy. If any child node’s queue is likely to be full, then the free space of any other child node’s queue can be utilized. This approach helps in minimizing the packet drop and loss rates for high-priority traffic, reducing congestion, and providing fair scheduling. It also utilizes bandwidth in a better way and achieves better throughputs; however this approach deficient in detection and classification of heterogeneous traffic flows in a frequently changing and dynamic environment of a healthcare system. In paper [12], authors describe the marg in line between hard and soft classification techniques. According to them, soft classifier exp licit ly estimate the p robabilities and then perform classification based on estimated probabilit ies. In contrast, hard classifier directly target on the classification decision boundary without producing the probability estimation. Authors provide a study and analysis of various packet classification techniques in [13-14]. They have explo red their ideas in this field and tried to briefly highlights advantages, disadvantages, usage and the platforms of their techniques. In paper [26], a novel architecture of Support Vector Machine classifiers utilizes binary decision tree (SVMBDT) for solving mult i-class problems has been presented. The clustering algorith m that utilizes Euclidean distance measures at the kernel space is used to convert the mult i-class problem into a binary decision tree. Euclidean distance is used to measure the similarities between classes. The results of the experiments show that the speed of training and testing are imp roved while keep ing comparable o r offering even better than in the other SVM multi-class methods. Classifiers proposed in [27-29] describe the use of SVM-BDT methods. The study proves that the testing time of SVM -BDT is noticeably better than the oneagainst-all and one-against-one methods. The SVM-BDT method was faster in the recognition phase for the Pendigit dataset, but slightly slower than DA GSVM method for the Statlog dataset. All the limitations of existing protocol and the requirements of mult i-class problem for mult i-object ive heterogeneous traffic classification mot ivate us to design a noble protocol which effectively improves the speed and comprehensive performance of the healthcare Copyright © 2016 MECS
WBAN system.
III. PROPOSED A RCHIT ECT URE Nowadays, the monitoring of patient health conditions is becoming more important for immediate and emergency medical services. Furthermore, many vital signals are required to be monitored continuously, and transmitted within the t ime bound. The proposed architecture as discussed in [30] having a Controller Unit (CU), which gathers data packets fro m d ifferent sensor nodes, and forwards them in a definite order towards the med ical server unit (M SU), positioned in a remote place. The order of packet forwarding depends on its priority, and the priority is obtained after concerning classification policies. Classificat ion of packets with guaranteed Quality of Service (QoS) in a heterogeneous and dynamic environment is a challenging task. In this paper, we have studied the existing protocols and identified all issues related to classificat ion, emergency services, and resource utilizat ion in a frequently changing environment of WBAN. These limitations in existing protocols help us to introduce a new dignified hybrid classificat ion technique in the proposed protocol, and the simu lation results show that the proposed protocol outperforms existing ones in terms of throughput, delay and packet delivery ratio. A. Wireless Body Area Network (WBAN) All the sensors equipped in WBA NU are capable of sensing the required data and transmit the data to a central Controller Unit (CU). In Data sensing and Pre-processing unit, the vital signal is sensed and processed into the desired format. This unit also calculates the data sending rate, packet transmission gap, and bandwidth allocation for each sensor node dynamically, where the data sending rate is defined as the number of packets to be sent in a given interval. Packet Transmission time gap is described as the transmission time gap between t wo consecutive packets, and the bandwidth allocation is defined as the amount of bandwidth assigned to individual nodes. The Packet Dispatching unit performs classification, queuing and scheduling of packet at sensor node according to their traffic or flo w type. The real-t ime bandwidth deficient traffic flows are assigned a higher priority than the non-real-time traffics. Hence, they are queued into the high prio rity queue and delivered to the CU with minimu m possible delay. Its main mot ive is to handle heterogeneous packet-flo w wh ile reducing delay and starvation. B. Controller Unit (CU) The Controller Unit aggregates all received data fro m various sensor nodes in the Aggregation unit, and updates database accordingly. It also pre-processes packets and transmits them to medical server unit for diagnosing purposes. The Packet handling unit maintains both the data and control packets in the heterogeneous WBANs. This unit is also responsible for the detection of critical situations;
I.J. Information Technology and Computer Science, 2016, 12, 83-90
Performance Optimization in WBAN Using Hybrid BDT and SVM Classifier
classification, queuing and scheduling of packets; and updating of priorit ies in a dynamic healthcare environment. The main responsibility of the Alerting unit is immed iate detection and notification of crit ical or emergency conditions . Due to uncertainty condition of patient, accurate detection of the condition is a necessary requirement for the healthcare system. Fewer and uneven observations give an imperfect depiction of what is actually happening, thus resulting in an inaccurate diagnosis. So this unit tries to detect the crit ical condition of a patient by focusing on multip le discrete observations for a particular t ime interval and not just by observing a single observation, this gives a better picture of what is happening and remove inaccurate results. Upon reception of a packet, it checks for the variation in sensed values. If it identifies variat ion in sensed values, then it calculates their standard deviation. If the standard deviation value is greater than the critical threshold value (i.e. provided by the medical personal), then it activates the alert index field in the packet header and transmits it to classification unit. The main job of Packet classification unit is the classification of packets based on the traffic or flow and level of crit icality, and assigning them a unique priority. It categorizes the heterogeneous traffic into one of these categories: Real-t ime traffic, Alert or Emergency traffic, On_Demand traffic, and Normal t raffic. Real-t ime packets are stored and sent when an appropriate amount of bandwidth is available. The Alert packet has the highest priority and is sent immediately without any delay. On_Demand packets are sensed and sent to doctors on their request. Normal packets are sent in a routine way. The Queuing unit holds packets until the scheduler fetches and serves them. The proposed protocol uses two Double-Ended Priority Queues (DEPQs). One is used to store high priority packets i.e. Alert or Real-time packets and another is used to store low priority packets i.e. On_Demand or Normal packets . The proposed Scheduling unit presents a new scheduling approach, called Rate based Earliest Deadline First (REDF) scheduling, wh ich is used to reduce the queuing time and drop rates. The REDF finds and drops those low-prio rity packets whose deadline is exceeded. The main idea behind this scheduling is resource utilizat ion. The strength of this method is that it deals with heterogeneous flows with considerably different bandwidth requirements. It also avoids the starvation and delay problem, faced by low-priority queues and low priority packets. In the Prioritization unit, the priority index of the sensor changes dynamically over the time by activating prioritization field. The priority is updated by the med ical person by analyzing some fields of packet i.e. sensor nodeâ&#x20AC;&#x2122;s previous priority and level o f crit icality. A med ical person can also ask for some On-Demand data by making this field active. C. Medical Server Unit (MSU) MSU receives data fro m the CU and takes decisions Copyright Š 2016 MECS
85
intelligently. In Packet monitoring unit, MSU collects information about vital signals of a patient, displays them on the screen and stores them for later processing. A medical database is built on the workstation computer, wh ich stores the patientâ&#x20AC;&#x2122;s identification data, physiological data and other diagnosis related data. The Decision making unit provides a filter facility. If the received packet is the highest priority packet with alert index=1, then MSU issues an alert message to the med ical personal. Upon reception of an alert message, the med ical person might request On_Demand data fro m the patient or can update old values of some pre-defined parameters i.e. priority of sensor node, vital signal range, monitoring time etc.
IV. PROPOSED PACKET CLASSIFIER During classification, packets are categorized into distinct data traffics and then served accordingly, so Packet classification is an essential processing task in CU. It is a mechanism that categorizes packets into an appropriate class and assigns them a specific priority. Some attributes of a packet header, i.e. packet flo w type, packet size, bandwidth, On_Demand index and Alert index fields are mapped into priority index. The priority is used to determine the order in wh ich packets have to be served or transmitted to the MSU during a particular t ime period. The Packet classificat ion unit classifies the entire traffic into four categories; Normal, On-demand, Realtime, and Alert or Emergency traffic. Normal traffic is transmitted in an uninterrupted manner. This includes unobtrusive and routine health monitoring of a patient. On-Demand traffic is initiated by the medical person to know certain informat ion, mostly for the purpose of diagnosis and prescription. Real-time traffics are transmitted as soon as they are generated if bandwidth is available. Alert traffic is co mmenced when sensed value exceeds a predefined threshold range and should be transmitted within an acceptable delay. The alert traffic is not generated on a regular basis and is totally unpredictable. The proposed Packet classification unit works by applying novel classifier employing the hybrid approach of both Binary Decision Tree and Support Vector Machine (i.e. named as BDSVMT) on mu ltip le header fields of a packet. This classifier co mpares header fields of inco ming packets with the set of predefined set of rules and decides their priority. A Binary Decision Tree (BDT) is a rule-based classifier technique, where the rules are normally expressed in the form if-then statements. These rules are then converted into a binary decision tree while a Support Vector Machine (SVM) is based on statistical learn ing theory, classifies data by determining a set of support vectors that are members of the set of training inputs that outline a hyperplane in feature space. This hybrid classifier is very much useful for multiclass packet classifications, which usually decompose the mu lti-class categories into several two-class groups by
I.J. Information Technology and Computer Science, 2016, 12, 83-90
86
Performance Optimization in WBAN Using Hybrid BDT and SVM Classifier
utilizing b inary decision tree. Th is classifier provides a tree-based architecture as shown in Fig. 1, contains binary SVM in the internal nodes.
SVM-1 +1
-1
* |
+
(1)
where v i is the value of attribute i in the header field of the packet, pri is the priority index and pri ∈ {1, 2, 3, 4, and NULL}. The separating hyper-plane is defined using Eq. (2). ( )
SVM-2
+1
-1
P Pr =NULL
( )
(2)
SVM-3
+1 P Pr =1
P Pr =2
where w is the weight, b is the optimal bias, and ϕ is the nonlinear mapping function applied to input vectors and is expressed as:
-1 SVM-4 +1
-1
P Pr =3
( { (
( )
P Pr =4
|
) )
| ( { (
Fig.1. T he Hybrid Packet classifier.
In this classifier; N-1 SVMs are needed to be trained for an N class problem. The proposed classification unit consists of five classes related to heterogeneous traffic, so needs four SVMs. These traffics are organized into two groups, since SVM at each node of BDT d ivides into two outputs (+1 for the positive group and -1 for the negative group). It then assigns each class with a particula r priority index (i.e. 1 for A lert traffic, 2 for bandwidth deficient Real-time traffic, 3 for On_Demand traffic, 4 for Normal traffic, and NULL for forwarded Real-time traffic). The rule set consists of five ru les to provide five class of prio rity, as given in Tab le 1, where R= {R1 , R2 , R3 , R4 , R5 } denotes the set of rules. RT denotes the Real-t ime traffic or flow, NRT denotes the Non-Real-time traffic or flow, PSize denotes the size of the packet, A Index denotes the value of the alert index field, BW av denotes the availability of Bandwidth, ODIndex denotes the value of the On_Demand index field.
|
|
( (
( { (
) )
|
* {
(3)
+
| |
(5) ) )
| |
(4)
) )
(6)
(7)
where PRT denotes the Real-t ime field, PNRT denotes the Non-real-t ime field, PSize denotes packet size field, PA index denotes the Alert index field POD index denotes the On_Demand index field, and pr denotes priority field. The optimizat ion is done by minimizing w which results in maximized d istance between the closest point of the hyper-plane and the hyper-plane itself. It can be understood as: ( ( ))
‖ ‖
∑
(8)
T able 1. Rule set table for packet classification RT/ NRT RT
PSize
BW av
AIndex
O DIndex
Priority
fixed
Yes
-
-
Null
R2
RT
fixed
No
-
-
2
R3
NRT
fixed
-
1
0
1
R4
NRT
fixed
-
0
1
3
R5
NRT
fixed
-
0
0
4
Rule R1
The Intermediate node of each Binary Decision based SVM tree (BDSVMT) fo llows the above specific rules and assigns the priority accordingly. A. Mathematical Formulation
(
( )
)
(9)
where i=1…Pd . Applying Lagrangian method ( )
∑
∑
( ( )
( )) (10)
such that
The aim of the hybrid classifier is to create a statistical model using SVM in BDT to predict the priority value (pr) of a packet P by considering the packet header attribute as vector v i . Assume a set of packet field samp les say Pd taken fro m a packet, and represented as:
Copyright © 2016 MECS
where c is the constant used for regularizat ion, and ei is the normalized variation with e i >=0. The output function can be expressed as:
∑
(11)
where a is the Lagrangian multiplier, and
I.J. Information Technology and Computer Science, 2016, 12, 83-90
Performance Optimization in WBAN Using Hybrid BDT and SVM Classifier
87
On solving the above equation, the classification can be expressed as given below: {
|
( )
| |
|
( )| ( )| (12)
where e is the maximu m allowed error. In the training set, all the above five ru les with mult iple attributes are used to define the prio rity, wh ile the test set is used validate it in simu lation and to determine the accuracy of this classifier in term of performance. The proposed classifier yields faster search and recognition speed, which improves the performance, is significantly better than other protocols. The main focus of proposed hybrid classifier is to determine the optimized solution by applying SVM functions on binary decision rules, where BDTs are much faster than SVMs in classifying new instances while SVMs perfo rm better than BDTs in terms of classification accuracy.
V. PERFORMANCE A NALYSIS
Fig.2. T he Comparison graph for PDR.
In this section, the simulat ion process is done to evaluate the performance of the proposed protocol under different scenarios. The simulat ion measures the performance of the proposed protocol (denoted as pr_) with the performance of the existing OCMP protocol (denoted as ex_). The implementation is done using the network simu lator NS-2.35. The performance metrics that reflect the most fundamental characteristics of proposed protocol can be categorized into three main facets: in terms of parameters like Packet Delivery Ratio, Throughput, and End-to-End Delay with respect to variation in nu mber of nodes. The results of these three metrics are shown using Xgraphs (i.e. an analysis tool for NS-2.35). A. Packet Delivery Ratio(PDR) The PDR is defined as the ratio of a total number of packets received by the receiver to the total number of packets transmitted fro m the source node. PDR can be calculated using Eq. (13). ∑
∑ ∑
(13)
where PLost denotes the number of packets lost in the transmission and PTransmit denotes the number of packets transmitted from the source node. The comparison graph for PDR is given in Fig. 2, shows that proposed protocol succeeds to transmit and receive a large number o f packets than the existing one because more amount of and more useful packets are transmitted in the preliminary phase.
Copyright © 2016 MECS
B. Throughput It is the number o f packets/bytes received by the source per monitoring t ime. It is an impo rtant met ric for analyzing network protocols. It calculates the total number of packets received at the receiver side with respect to the total monitoring time. The total throughput is estimated from Eq. 14. (∑
∑
) (
)
(14)
where PLost denotes the number of packets lost in the transmission, PTransmit denotes the number of packets transmitted fro m the source node, PSize denotes the size of a packet, TMonitorTimeStop denotes the time when The patient monitoring is stopped, and TMonitolTimeStart denotes the time when the monitoring is started. The comparison graph of throughput given in Fig. 3, displays that OCMP shows major variat ions for different node values. However, DPPH shows consistency results for variation of nodes. The DPPH has significant growth in throughput than OCMP. Th is is due to the proposed dynamic prioritization based classification unit. C. End-to-End Delay The delay is defined as the average time taken by a packet to arrive at the destination. It also includes the delay caused by route discovery process and the queuing time in data packet transmission. Only the data packets that successfully delivered to destinations are counted. The delay is calculated by subtracting packet sending time from the packet receiving time.
I.J. Information Technology and Computer Science, 2016, 12, 83-90
88
Performance Optimization in WBAN Using Hybrid BDT and SVM Classifier
The simulat ion results obtained from Fig. 4 shows that the comparison of delay for proposed protocol is nearly same with respect to a number of nodes. Also, the variation in delay in each case is negligible for both protocols, but delay occurring in DPPH is less as compared to OCM P protocol, wh ich is a good significance for enhancement in network throughput. The monitoring of a patient has been conducted for a time interval of 60 minutes. The patient was implanted with d ifferent types of sensor nodes. The number of sensor nodes is varied with time. The imp lementation was done through network simu lator ns -2.35 fo r this monitoring time interval. The generated result shows that the proposed protocol significantly imp roves the network performance by increasing packet delivery ratio and throughput while reducing the delay.
VI. CONCLUSION
Fig.3. T he Comparison graph for T hroughput.
The total end to end is calculated fro m the equation given in Eq. (15). ∑
(
∑
) ∑
)
(15)
where PLost denotes the number of packets lost in the transmission, PTransmit denotes the number of packets transmitted fro m the source node, PRcvTime denotes the time in which the packet arrives at the destination, and PSndTime the time when a packet is sent from the source.
A critical healthcare system is well designed for vital signal monitoring and determining the real-t ime conditions. Our proposed protocol implements a hybrid Binary Decision Tree and Support Vector Mach ine (BDSVMT) based classification mechanism, wh ich ensures the delivery of crit ical and important packets in an urgent manner. This classifier is designed to provide superior mult i-class classification by utilizing a BDT architecture that requires much less computation for deciding a class for unknown sample and SVM, that utilizes distance measures at the Kernel space to convert the mult i-class problem into a binary decision tree. The dynamically updating priority helps for sensing and transmitting more important data within significant delay and min imu m loss. The simu lation results validated and analyzed that the proposed protocol improves the packet delivery rat io throughput in a better way with the reduction in packet transmission and queuing delay. REFERENCES [1]
[2]
[3]
[4]
[5]
Fig.4. T he Comparison graph for E2E Delay.
[6]
Copyright © 2016 MECS
G. N. Bradai et al.,”QoS architecture over WBANs for remote vital signs monitoring applications”, 12th Annual IEEE Consumer Communications and Networking Conference, pp. 1-6, 2015. M . Kathuria, and S. Gambhir, “Quality of service provisioning transport layer protocol for WBAN system”, International Conference on Optimization, Reliability and Information Technology (IEEE Xplore), pp. 222-228, 2014. M . A. Ameen, A. Nessa, and K. S. Kwak,“QoS Issues with Focus on Wireless Body Area Networks”, Third International Conference on Convergence and Hybrid Information Technology, vol. 1, pp. 801-807, 2008. M . Kathuria, and S. Gambhir, “Layer wise Issues of Wireless Body Area Network: A Review”, International conference on Reliability, Infocom Technologies and Optimization (ICRITO), pp. 330-336, Jan 2013. S. M isra, V. Tiwari, and M . S. Obaidat,, “LACAS: Learning Automata-Based Congestion Avoidance Scheme for Healthcare Wireless Sensor Networks”, IEEE Journal on Selected Areas in Communications, vol. 27, no.4, pp. 466-479, 2009. S Gambhir, V. Tickoo, and M . Kathuria, ”Priority based
I.J. Information Technology and Computer Science, 2016, 12, 83-90
Performance Optimization in WBAN Using Hybrid BDT and SVM Classifier
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
congestion control in WBAN”, Eighth International conference on contemporary computing (SCOPUS, DBLP, IEEE Xplore), pp. 428-433, 2015. N. Farzaneh and M . H. Yaghmaee, “Joint Active Queue M anagement and Congestion Control Protocol for Healthcare Applications in Wireless Body Sensor Networks”, 9th International Conference on Smart Homesand Health Telematics (Springer Verlag), pp. 8895, 2011. N. Farzaneh, M. H. Yaghmaee, and D. Adjeroh, ”An Adaptive Congestion Alleviating Protocol for Healthcare Applications in Wireless Body Sensor Networks: Learning Automata Approach”, Journal of Science and Technology (Springer), vol. 44, no. 1, pp. 31-4 1, 2012. M . H. Yaghmaee, N. F. Bahalgardi, and D. Adjeroh, “A Prioritization Based Congestion Control Protocol for Healthcare M onitoring Application in Wireless Sensor Networks”, Wireless Personal Communications (Springer), vol. 72, no. 4, pp. 2605-2631, April 2013. A. A. Rezaee, M . H. Yaghmaee, and A. M . Rahmani, “Optimized Congestion M anagement Protocol for Healthcare Wireless Sensor Networks”, Wireless Personal Communications (Springer), vol. 75, no. 1, pp. 11-34, 2013. M . Kathuria, and S. Gambhir, "Reliable delay sensitive loss recovery protocol for critical health data transmission system", 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge M anagement (IEEE Xplore), pp. 333-339, Feb 2015. Y. Liu et al.,”Hard and Soft Classification? Large margin unified machines”, Taylor and Francis, vol. 106, pp. 166177, 2011. D. E.Taylor, and J. S. Turner, “ClassBench: a packet classification benchmark”, IEEE/ACM Transactions on Networking, vol. 15, no. 3, pp. 499-511, 2007. D. E. Taylor, “Survey and taxonomy of packet classification techniques”, ACM Computing Surveys (CSUR), vol. 37, no. 3, pp. 238-275, 2005. F. Huang, and Y. Lu-M ing, “Research on classification of hyper spectral remote sensing imagery based on BDT SM O and combined features,” Journal of M ultimedia, vol. 9, no. 3, pp. 456-462, 2014. N. Xue, “Comparison of multi-class support vector machines”, Computer Engineering and Design, vol. 32, no. 5, pp. 1792-1795, 2011. Q. Ai, Y. Qin, and J. Zhao, “An improved directed acyclic graphs support vector machine”, Computer Engineering and Science, vol. 33, no. 10, pp. 145-148, 2011. G. Feng, “Parameter optimizing for Support Vector M achines classification”, Computer Engineering and Applications, vol. 47, no. 3, pp. 123-124, 2011. C. C. Chung, and C. J. Lin, ”LIBSVM : A library for support vector machine”, ACM Transactions on Intelligent System and Technology, vol. 2, pp.1-27, 2011. M . Kathuria, and S. Gambhir, “Leveraging machine learning for optimize predictive classification and scheduling E-Health traffic”, International Conference on Recent Advances and Innovations in Engineering (IEEE Xplore), p p . 1-7, 2014. L. Wenlong, and X. Changzheng, “Parallel Decision Tree Algorithm Based on Combination”, IEEE International Forum on Information Technology and Applications (IFITA) Kunming, pp. 99-101, July 2010. M . Kathuria, and S. Gambhir, “Genetic Binary Decision Tree based Packet Handling schema for WBAN system”, Recent Advances in Engineering and Computational
Copyright © 2016 MECS
89
Sciences (IEEE Xplore), pp. 1-6, 2014. [23] S. Geetha, N. Ishwarya, and N. Kamaraj, “Evolving decision tree rule based system for audio stego anomalies detection based on Hausdorff distance statistics”, Information Sciences Journal (Elsevier Publisher), vol. 180, no. 13, pp. 2540-2559, 2010. [24] K. Bhaduri, R. Wolff, C. Giannella, and H. Kargupta, “Distributed Decision-Tree Induction in Peer-to-Peer Systems”, Journal Statistical Analysis and Data M ining (John Wiley and Sons), vol. 1, pp. 1-35, June 2008. [25] D. Kocev, C. Vens, J. Struyf, and S. Dzeroski, “Ensembles of multi-objective decision trees”, 18th European Conference on M achine Learning (DBLP,Springer), pp. 624-631, 2007. [26] G. M adzarov, D. Gjorgjevikj, and I. Chorbev, “A M ulticlass SVM Classifier Utilizing Binary Decision Tree”, Informatics, vol. 33, pp.233-241, 2009. [27] X. Wang and Y. Qin, “Research on SVM multi-class classification based on binary tree, ”Journal of Hunan Institute of Engineering, vol. 18, pp. 68-70, 2008. [28] G. M adzarov, D. Gjorgjevikj, and I. Chorbev, “M ulticlass classification using support vector machines in decision tree architecture”, IEEE EUROCON 2009 (EUROCON '09), pp. 288-295, 2009. [29] K. K. Reddy, and V. Reddy, “A Survey on Issues of Decision Tree and Non-Decision Tree Algorithms”, International Journal of Artificial Intelligence and Applications for Smart Devices(SERSC), vol. no. 1, pp. 932, 2016. [30] S. Gambhir and M . Kathuria, “DWBAN: Dynamic Priority based WBAN Architecture for Healthcare System”, 3rd International Conference on Computing for Sustainable Global Development (IEEE Xplore), pp 3380-3386, 2016. [31] M . Kathuria, and S. Gambhir, “ Comparion Analysis of proposed DPPH protocol for Wireless Body Area Network”, International Journal of Coputer Applications (IJCA), vol. 144, pp. 36-41, 2016. [32] M . Kathuria, and S. Gambhir, “Security and Privacy Assault of Wireless Body Area Network System”, International conference on Reliability, Infocom Technologies and Optimization (ICRITO), pp: 223-229, Jan 2013.
Authors’ Profiles Madhumita Kathuria is Assistant Professors in Computer Science and Engineering Department M anav Rachna International University, Faridabad, India. She is pursuing her PhD (Computer Science & Engineering) from YM CA University of Science and Technology. She has published more than 15 papers in various International and National journals and conferences. Her area of interests includes Wireless Body Area Network, Sensor Network, Network Security, Digital Image Processing, Learning and Computational Techniques.
Dr. S apna Gambhir is Assistant Professor in Computer Engineering Department in YM CA University of Science and Technology, Faridabad, India. She has done her PhD from Jamia M illia Islamia University in 2010. She has published more than 50 papers in various National and International
I.J. Information Technology and Computer Science, 2016, 12, 83-90
90
Performance Optimization in WBAN Using Hybrid BDT and SVM Classifier
journals and conferences. Her area of interests includes Wireless Sensor Network, Ad-hoc Network and Social Network, and Security of Wireless Networks.
How to cite this paper: M adhumita Kathuria, Sapna Gambhir, "Performance Optimization in WBAN Using Hybrid BDT and SVM Classifier", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.12, pp.83-90, 2016. DOI: 10.5815/ijitcs.2016.12.10
Copyright Š 2016 MECS
I.J. Information Technology and Computer Science, 2016, 12, 83-90
Instructions for Authors Manuscript Submission We invite original, previously unpublished, research papers, review, survey and tutorial papers, application papers, plus case studies, short research notes and letters, on both applied and theoretical aspects. Manuscripts should be written in English. All the papers except survey should ideally not exceed 18,000 words (15 pages) in length. Whenever applicable, submissions must include the following elements: title, authors, affiliations, contacts, abstract, index terms, introduction, main text, conclusions, appendixes, acknowledgement, references, and biographies. Papers should be formatted into A4-size (8.27″×11.69″) pages, with main text of 10-point Times New Roman, in single-spaced two-column format. Figures and tables must be sized as they are to appear in print. Figures should be placed exactly where they are to appear within the text. There is no strict requirement on the format of the manuscripts. However, authors are strongly recommended to follow the format of the final version. Papers should be submitted to the MECS Publisher, Unit B 13/F PRAT COMM’L BLDG, 17-19 PRAT AVENUE, TSIMSHATSUI KLN, Hong Kong (Email: ijitcs@mecs-press.org, Paper Submission System: www.mecs-press.org/ijitcs/submission.html), with a cowering email clearly staring the name, address and affiliation of the corresponding author. Paper submissions are accepted only in PDF. Other formats are not acceptable. Each paper will be provided with a unique paper ID for further reference. Authors may suggest 2-4 reviewers when submitting their works, by providing us with the reviewers’ title, full name and contact information. The editor will decide whether the recommendations will be used or not.
Conference Version Submissions previously published in conference proceedings are eligible for consideration provided that the author informs the Editors at the time of submission and that the submission has undergone substantial revision. In the new submission, authors are required to cite the previous publication and very clearly indicate how the new submission offers substantively novel or different contributions beyond those of the previously published work. The appropriate way to indicate that your paper has been revised substantially is for the new paper to have a new title. Author should supply a copy of the previous version to the Editor, and provide a brief description of the differences between the submitted manuscript and the previous version. If the authors provide a previously published conference submission, Editors will cheek the submission to determine whether there has been sufficient new material added to warrant publication in the Journal. The MECS Publisher’s guidelines are that the submission should contain a significant amount of new material, that is, material that has not been published elsewhere. New results are not required; however, the submission should contain expansions of key ideas, examples, and so on, of the conference submission. The paper submitting to the journal should differ from the previously published material by at least 50 percent.
Review Process Submissions are accepted for review with the same work has been neither submitted to, nor published in, another publication. Concurrent submission to other publications will result in immediate rejection of the submission. All manuscripts will be subject to a well established, fair, unbiased peer review and refereeing procedure, and are considered on the basis of their significance, novelty and usefulness to the Journals readership. The reviewing structure will always ensure the anonymity of the referees. The review output will be one of the following decisions: Accept, Accept with minor revision, Accept with major revision, Reject with a possibility of resubmitting, or Reject. The review process may take approximately three months to be completed. Should authors be requested by the editor to revise the text, the revised version should be submitted within three months for a major revision or one month for a minor revision. Authors who need more time are kindly requested to contact the Editor. The Editor reserves the right to reject a paper if it does not meet the aims and scope of the journal, it is not technically sound, it is not revised satisfactorily, or if it is inadequate in presentation.
Revised and Final Version Submission Revised version should follow the same requirements as for the final version to format the paper, plus a short summary about the modifications authors have made and author’s comments. Authors are requested to the MECS Publisher Journal Style for preparing the final camera-ready version. A template in PDF and an MS word template can be downloaded from the web site. Authors are requested to strictly follow the guidelines specified in the templates. Only PDF format is acceptable .The PDF document should be sent as an open file, i.e. without any date protection. Authors should submit their paper electronically through email to the Journal’s submission address. Please always refer to paper ID in the submissions and any further enquiries. Please do not use the Adobe Acrobat PDFWriter to generate the PDF file. Use the Adobe Acrobat Distiller instead, which is contained in the same package as the Acrobat PDFWriter. Make sure that you have used Type 1 or True Type Fonts(cheek with the Acrobat Reader or Acrobat Writer by clicking on File>Document Properties>Fonts to see the list of fonts and their type used in the PDF document).
Copyright Submission of your paper to this journal implies that the paper is not under submission for publication elsewhere. Material which has been previously copyrighted, published, or accepted for publication will not be considered for publication in this journal. Submission of a manuscript is interpreted as a statement of certification that no part of the manuscript is under review by any other formal publication. Submitted papers are assumed to contain no proprietary material unprotected by patent or patent application; responsibility for technical content and for protection of proprietary material rests solely with the author(s) and their organizations and is not the responsibility of the MECS Publisher or its editorial staff. The main author is responsible for ensuring that the article has been seen and approved by all the other authors. It is the responsibility of the author to obtain all necessary copyright release permissions for the use of any copyrighted materials in the manuscript prior to the submission. More information about permission request can be found at the web site. Authors are asked to sign a warranty and copyright agreement upon acceptance of their manuscript, before the manuscript can be published. The Copyright Transfer Agreement can be downloaded from the web site. Publication Charges and Re-print No page charges for publications in this journal. Reprints of the paper can be ordered with a price of 150 USD. Electronic: free available on www.mecs-press.org.To subscribe, please contact the Journal Subscriptions Department, E-mail: ijitcs@mecs-press.org. More information is available on the web site at http://www.mecs-press.org/ijitcs.