Tr 00085

Page 1

IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

Algorithm to Convert Unstructured Data in Hadoop and Framework to Secure Big Data in Cloud 1

Lakshmikantha G C2Anusha Desai3Keerthana G4Keerthi Kiran M E5Siri S Garadi 1

( Assistant professor, Department of CSE, VKIT, Bengaluru, Lakshmikantha.gc@gmail.com) 2 (B.E Student, Department of CSE, VKIT, Bengaluru, anushadesai21@gmail.com) 3 (B.E Student, Department of CSE, VKIT, Bengaluru, bramara.keerthana@gmail.com) 4 (B.E Student, Department of CSE, VKIT, Bengaluru,keerthikiran351995@gmail.com) 5 (B.E Student, Department of CSE, VKIT, Bengaluru, Sirigaradi82@gmail.com)

Abstract— Extracting unstructured BigData from the

dataset and converting to Hadoop format. The resultant data is stored in the cloud and secured by double encryption. The user can retrieve the data in the cloud with the help of user interface by double decryption. The security to the data in the cloud is provided using Fully Homomorphic algorithm. As a result of efficient encryption, transmission and storage of sensitive data is achieved. We analyze the existing search algorithm over cipher text, for the problem that most algorithm will disclosure user's access patterns, we propose a new method of private information retrieval supporting keyword search which combined with Homomorphic encryption and private information retrieval.

I. INTRODUCTION Big data is a term for data sets that are so large or complex that traditional data processing application softwares are inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. The goal of most big data systems is to surface insights and connections from large volumes of heterogeneous data that would not be possible using conventional methods. Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Distributed File System a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster. A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in

IDL - International Digital Library

a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand. Cloud computing security or, more simply, cloud security refers to a broad set of policies, technologies, and controls deployed to protect data, applications, and the associated infrastructure of cloud computing. Homomorphic encryption is a form of encryption that allows computations to be carried out on ciphertext, thus generating an encrypted result which, when decrypted, matches the result of operations performed on the plaintext. We define the relaxed notion of a semihomomorphic encryption scheme, where the plaintext can be recovered as long as the computed function does not increase the size of the input “too much”. But the disadvantage of these two algorithms is that the encrypted data can be decrypted easily. To overcome the disadvantages of homomorphic and semi-homomorphic encryption we propose a fully homomorphic encryption scheme – i.e., a scheme that allows one to evaluate circuits over encrypted data without being able to decrypt.

II. EXISTING SYSTEM Crawlers algorithm for extraction A Web crawler, sometimes called a spider, is an Internet but that systematically browses the World Wide Web, typically for the purpose of Web indexing.Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites' web

1|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 content. Web crawlers can copy all the pages they visit for later processing by a search engine which indexes the downloaded pages so the users can search much more efficiently. Crawlers consume resources on the systems they visit and often visit sites without approval. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For instance, including a robots.txt file can request bots to index only parts of a website, or nothing at all. As the number of pages on the internet is extremely large, even the largest crawlers fall short of making a complete index. For that reason search engines were bad at giving relevant search results in the early years of the World Wide Web, before the year 2000. This is improved greatly by modern search engines; nowadays very good results are given instantly. Homomorphic Encryption Homomorphic encryption is a form of encryption that allows computations to be carried out on ciphertext, thus generating an encrypted result which, when decrypted, matches the result of operations performed on the plaintext. This is sometimes a desirable feature in modern communication system architectures. Homomorphic encryption would allow the chaining together of different services without exposing the data to each of those services. For example, a chain of different services from different companies could calculate 1) the tax, 2) the currency exchange rate, 3) shipping, on a transaction without exposing the unencrypted data to each of those services. Homomorphic encryption schemes are malleable by design. This enables their use in cloud computing environment for ensuring the confidentiality of processed data. In addition the homomorphic property of various cryptosystems can be used to create many other secure systems, for example secure voting systems, collision-resistant hash functions, private information retrieval schemes, and many more. There are several partially homomorphic crypto-systems, and also a number of fully homomorphic crypto-systems. Although a crypto-system which is unintentionally malleable can be subject to attacks on this basis, if treated carefully homomorphism can also be used to perform computations securely. Data Encryption Standard (DES) In this Data Encryption Standard is a symmetric- key algorithm, contains in block cipher available as FIPS-46 in January 1977 by the Federal Register and National Institute of Standards and Technology. By the side of the encryption site, DES picks out a 64-bit plaintext and inducts a 64-bit cipher text, on the decryption site, it takes a 64-bit cipher text and inducts a 64-bit plaintext, and similar 56 bit cipher key is used for both decryption and encryption. The encryption process is progress of two permutations (Pboxes), which we call preliminary and ultimate permutation, IDL - International Digital Library

and sixteen Feistel rounds. It uses 10, 12, or fourteen rounds. Advanced Encryption Standard (AES) Advanced Encryption Standard is a symmetric- key block cipher emerged as FIPS-197 in the Federal Register in December-2001 approach the National Institute of Standards and Technology. AES equals a non-Feistel cipher. AES encrypts information with block range of 128-bits. It uses 10, 12, or fourteen rounds. Blowfish Algorithm Blowfish is one of a symmetric block cipher algorithm. It uses the similar secret key to both encryption and decryption of messages. The block range on behalf of Blowfish is 64 bits; messages to be a multiple of 64-bits inside range have to be bolstered. It applies a variable length key, from 32 bits to 448 bits. It is desirable for applications where the key is not changed habitually. It is extensively faster than most encryption algorithms when did in 32-bit microprocessors with huge information caches. Disadvantages:  Encryption is a very complex technology. Management of encryption keys must be an added administrative task for often overburdened IT staff. One big disadvantage of encryption as it relates to keys is that the security of data becomes the security of the encryption key. Lose that key, and you effectively lose your data.  Encrypting data and creating the keys necessary to encrypt and decrypt the data is computionally expensive. No matter what type of encryption is used, the systems performing the computional heavy lifting must have available resources.  One of the common drawbacks of traditional full-disk encryption solutions is the reduction of overall system performance upon deployment. A key pitfall is that a poor encryption implementation could result in a false sense of security, when in fact it is wide open to attack.  There are three attacks known that can break the full 16 rounds of DES with less complexity than a brute-force search: differential cryptanalysis (DC), linear cryptanalysis, and Davies' attack. However, the attacks are theoretical and are unfeasible to mount in practice these types of attack are sometimes termed certificational weaknesses.  The key space to be searched by brute force attacks increases by a factor of 2 for each additional bit of key length. This alone increases the degree of difficulty for a brute force search very rapidly. Key length itself does not provide sufficient security against attacks, however, as there are ciphers with very long keys which have still been found to be vulnerable.  Encrypting a message does not guarantee that this message is not changed while encrypted. Hence often a 2|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017





message authentication code is added to a ciphertext to ensure that changes to the ciphertext will be noted by the receiver. Message authentication codes can be constructed from symmetric ciphers. However, symmetric ciphers cannot be used for nonrepudiation purposes except by involving additional parties. Another application is to build hash functions from block ciphers. See one-way compression function for descriptions of several such methods. A reduced-round variant of Blowfish is known to be susceptible to known-plaintext attacks on reflectively weak keys. Blowfish implementations use 16 rounds of encryption, and are not susceptible to this attack.

III. LITERATURE SURVEY DESIGN OF FOCUSED CRAWLER BASED ON FEATURE EXTRACTION, CLASSIFICATION AND TERM EXTRACTION A web crawler is a software program that scans the hypertext layout of the web pages, starting from a set of seed pages. The crawler retrieves these pages, indexed them and extracts the hyperlinks inside these pages to find out the addresses for more pages to be crawled. It is given to improve the performance of the crawler by using the different feature extraction technique and the classification technique. AN EFFICIENT R-G-B ALGORITHM FOR WEB CRAWLER ON INFORMATION EXTRACTION Vertical search engine has unstable shooting rate and low average accuracy. A development of RDF data transfer and query on Hadoop Framework to handle the storage of huge data we use map reduce and hadoop framework supporting tool to perform data transfer it is used. It is also used to reduce the access time. HADOOP SECURITY MODELS - A STUDY The new emerging technology to handle large number of dataset in an efficient manner is the big data which is used in various different platforms and domain to support several services and improve the system performance in a reliable manner. The main weakness in this domain is the security which can be easily destroyed or surpassed by the user. To enhance it in a protected way it provides a detailed survey about the various security mechanisms and methodologies used in the big data technique. BIG DATA EMERGING ISSUES: HADOOP SECURITY AND PRIVACY With the growing development of data, it has become increasingly vulnerable and exposed to malicious attacks. These attacks can damage the essential properties such as confidentiality, integrity and availability of information systems. To deal with these malicious intents, it is necessary IDL - International Digital Library

to develop effective protection mechanisms. It indicates the main risks arising in Big Data and existing security mechanism, focus on Hadoop security and its components because it remains the required Framework for the management and the processing of big data. DESIGN AND IMPLEMENTATION OF HDFS DATA ENCRYPTION SCHEME USING ARIA ALGORITHM ON HADOOP Hadoop is developed as a distributed data processing platform for analyzing big data. Enterprises can analyze big data containing users' sensitive information by using Hadoop and utilize them for marketing. Therefore, researches on data encryption have been widely done to protect the leakage of sensitive data stored in Hadoop. However, the existing researches support only the AES international standard data encryption algorithm. Meanwhile, the Korean government selected ARIA algorithm as a standard data encryption scheme for domestic usages. A HDFS data encryption scheme which supports both ARIA and AES algorithms on Hadoop. MODIFIED RSA ALGORITHM TO ENHANCE SECURITY FOR DIGITAL- SIGNATURE Digital signature has been providing security services to secure electronic transaction over internet. Rivest, Shamir and Adlemen (RSA) algorithm was most widely used to provide security technique. Here we have modified the RSA algorithm to enhance its level of security. This paper presents a fair comparison between RSA and Modified RSA algorithm along with time and security by running several encryption and decryption setting to process data of different sizes. The efficiency of these algorithms was considered based on key generation speed and security level. The texts of different sizes were encrypted and decrypted using RSA and modified RSA algorithms. FINANCIALCLOUD: OPEN CLOUD FRAMEWORK OF DERIVATIVE PRICING Predicting prices and risk measures of assets and derivatives and rating of financial products have been studied and widely used by financial institutions and individual investors. In contrast to the centralized and oligopoly nature of the existing financial information services, in this paper, we advocate the notion of a Financial Cloud, i.e., an open distributed framework based cloud computing architecture to host modularize financial services such that these modularized financial services may easily be integrated flexibly and dynamically to meet users' needs on demand. This new cloud based architecture of modularized financial services provides several advantages. We may have different types of service providers in the ecosystem on top of the framework. For example, market data resellers may collect and sell long-term historical market data. Statistical analyses of macroeconomic indices, interest rates, and correlation of a set of assets may also be purchased online. Some agencies

3|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 might be interested in providing services based on rating or pricing values of financial products. THREE STEP DATA SECURITY MODEL FOR CLOUD COMPUTING BASED ON RSA AND STEGANOGRAPHY TECHNIQUES Cloud computing is based on network and computer applications. In cloud data sharing is an important activity. Small, medium, and big organizations are use cloud to store their data in minimum rental cost. In present cloud proof their importance in term of resource and network sharing, application sharing and data storage utility. Hence, most of customers want to use cloud facilities and services. So the security is most essential part of customer's point of view as well as vendors. There are several issues that need to be attention with respect to service of data, security or privacy of data and management of data. The security of stored data and information is one of the most crucial problem in cloud computing. Using good protection techniques of access control we can resolved many security problems.

IV. PROPOSED SYSTEM Fully Homomorphic Encryption  Fully homomorphic algorithm is use to double encrypt and decrypt the content of file, we use RSA and RNS algorithms to encrypt and decrypt the content of the file using its private and the public key.  RSA cryptosystem appreciate the properties of the multiplicative Homomorphicencryption .  Ronald Rivest, Adi Shamir and Leonard Adleman have devised the RSA algorithm and named later on its inventors.  RSA uses modular exponential for decryption and encryption.  RSA uses two exponents, a and b, where a is public and b constitutes private. Let the plaintext is Pt and Ct is cipher text, thus at encryption. Ct = Pta mod n.  And at decryption side Pt = Ctb mod n.  n is a very large number, created during key generation process.  The RSA method's security rests on the fact that it is extremely difficult to factor very large numbers.  Residue number system, which can be applied to any moduli set. Simulation results proved that the algorithm was many times faster than most competitive published work.  Determining the position of the most significant nonzero bit of any residue number in that algorithm is the major speed limiting factor.  They customize the same algorithm to serve two specific moduli sets: (2/sup k/, 2/sup k/-1, 2/sup k-1/-1) and (2/sup k/+1, 2/sup k/, 2/sup k/-1), and thus, eliminate that speed limiting factor. Based on this work, hardware needed to determine most significant bit position has been reduced to a single adder.  Therefore, computation time and hardware requirements are substantially improved. This would enable RNS to be a stronger force in building general purpose computers.

Advantages: IDL - International Digital Library

 Security is easy as only the private key must be kept secret.  Maintenance of the keys becomes easy being the keys remain constant throughout the communication depending on the connection.  The two main advantages of asymmetric encryption are that the two parties don't need to have already shared their secret in order to communicate using encryption and that both authentication and non-repudiation are possible.  Most implementations use asymmetric encryption to encode a symmetric key and transfer it to the other party. They then transmit the actual message using the symmetric key which is much more efficient in CPU time.  An important advantage of asymmetric ciphers over symmetric ciphers is that no secret channel is necessary for the exchange of the public key. The receiver needs only to be assured of the authenticity of the public key.  Asymmetric ciphers also create lesser key-management problems than symmetric ciphers. Only 2n keys are needed for n entities to communicate securely with one another.  Hashing is a one way function and encrypted messages could be retrieved if you know the corresponding key to which you encrypted the message.  Hashing is, as you said, non-reversible. It is also constant. This is why we use it to store passwords. When you set your password for, say, your e-mail, the server never stores it. Instead, (assuming your password is "password") they store h("password"). Now let’s say you want to log in. This is where encryption comes into play. V. SYSTEM DESIGN The system architecture is a conceptual model that defines the structure, behavior, and more views of a system. An architecture description is a formal description and representation of a system, organized in a way that supports reasoning about the structures and behaviors of the system. A system architecture can comprise system components that will work together to implement the overall system. There have been efforts to formalize languages to describe system architecture; collectively these are called architecture description languages. Components used in the system are: A. HADOOP Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of commodity hardware nodes, and to handle thousands of terabytes of data. Its distributed file system facilitates rapid data transfer 4|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 rates among nodes and allows the system to continue operating in case of a node failure. This approach lowers the risk of catastrophic system failure and unexpected data loss, even if a significant number of nodes become inoperative. Consequently, Hadoop quickly emerged as a foundation for big data processing tasks, such as scientific analytics, business and sales planning, and processing enormous volumes of sensor data, including from internet of things sensors. It was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts, which are also called fragments or blocks, can be run on any node in the cluster. Organizations can deploy Hadoop components and supporting software packages in their local data center. However, most big data projects depend on short-term use of substantial computing resources. This type of usage is best-suited to highly scalable public cloud services, such as Amazon Web Services (AWS), Google Cloud Platform and Microsoft Azure. Public cloud providers often support Hadoop components through basic services, such as AWS Elastic Compute Cloud and Simple Storage Service instances. B. CLOUD Cloud storage is a cloud computing model in which data is stored on remote servers accessed from the Internet, or "cloud." It is maintained, operated and managed by a cloud storage service provider on a storage server that are built on virtualization techniques. Cloud storage is also known as utility storage - a term subject to differentiation based on actual implementation and service delivery. Cloud storage works through data center virtualization, providing end users and applications with a virtual storage architecture that is scalable according to application requirements. In general, cloud storage operates through a Web-based API that is remotely implemented through its interaction with the client application's in-house cloud storage infrastructure for input/output (I/O) and read/write (R/W) operations. When delivered through a public service provider, cloud storage is known as utility storage. Private cloud storage provides the same scalability, flexibility and storage mechanism with restricted or non-public access.  My SQL: MySQL is an open source relational database management system (RDBMS) based on Structured Query Language (SQL). MySQL runs on virtually all platforms, including Linux, UNIX, and Windows. Although it can be used in a wide range of applications, MySQL is most often associated with web-based applications and online publishing and is an important component of an open source enterprise stack called LAMP. LAMP is a Web development platform that uses Linux as the operating system, Apache as the Web server, MySQL as the relational database management IDL - International Digital Library

system and PHP as the object-oriented scripting language.  ECLIPSE: Eclipse is an integrated development environment (IDE) used in computer programming, and is the most widely used in Java IDE. It contains a base workspace and an extensible plug-in system for customizing the environment. Eclipse is written mostly in Java and its primary use is for developing Java applications, but it may also be used to develop applications in other programming languages via c, c++, java script, etc. It can also be used to develop documents with LaTeX (via a TeXlipse plug-in) and packages for the software Mathematica. Development environments include the Eclipse Java development tools (JDT) for Java and Scala, Eclipse CDT for C/C++, and Eclipse PDT for PHP, among others. The initial codebase originated from IBM VisualAge. The Eclipse software development kit (SDK), which includes the Java development tools, is meant for Java developers. Users can extend its abilities by installing plug-ins written for the Eclipse Platform, such as development toolkits for other programming languages, and can write and contribute their own plug-in modules. Since Equinox, plug-ins can be plugged-stopped dynamically and are termed (OSGI) bundles. Eclipse software development kit (SDK) is free and open-source software, released under the terms of the Eclipse Public License, although it is incompatible with the GNU General Public License. It was one of the first IDEs to run under GNU Classpath and it runs without problems under IcedTea.  JDK: The Java Development Kit (JDK) is a software development environment used for developing Java applications and applets. It includes the Java Runtime Environment (JRE), an interpreter/loader (java), a compiler (javac), an archiver (jar), a documentation generator (javadoc) and other tools needed in Java development. Java developers are initially presented with two JDK tools, java and javac. Both are run from the command prompt. Java source files are simple text files saved with an extension of .java. After writing and saving Java source code, the javac compiler is invoked to create .class files. Once the .class files are created, the 'java' command can be used to run the java program.  TOMCAT SERVER: Apache Tomcat, often referred to as Tomcat Server, is an open-source Java Servlet Container developed by the Apache Software Foundation (ASF). Tomcat implements several Java EE specifications including Java Servlet, JavaServer Pages (JSP), Java EL, and WebSocket. Tomcat is developed and maintained by an open community of developers under the auspices of the Apache Software Foundation, released under the Apache License 2.0 license, and is open-source software.

5|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 Components of TOMCAT SERVER are: (i) Catalina:Catalina is Tomcat's servlet container. (ii) Coyote:Coyote is a Connector component for Tomcat that supports the HTTP 1.1 protocol as a web server. This allows Catalina, nominally a Java Servlet or JSP container, to also act as a plain web server that serves local files as HTTP documents. (iii) Jasper:Jasper is Tomcat's JSP Engine. Jasper parses JSP files to compile them into Java code as servlets. At runtime, Jasper detects changes to JSP files and recompiles them. (iv) Cluster:This component has been added to manage large applications. It is used for load balancing that can be achieved through many techniques. Clustering support currently requires the JDK version 1.5.

Fig.1 Block Diagram

VI. WORKING PROCEDURE HADOOP INSTALLATION Steps to install Hadoop: For windows7 - 32bit 1. Open Cygwin.exe to install the hadoop. 2. Click next 3. Install from internet radio button should be on. 4. Click next 5. Select any available download site from the list. 6. Click next 7. Search for openss, inside openss (Base,Debug,Devel,LLibs,Net,Phyton) click on skip and next. 8. Search for tcp, follow the above procedure. 9. Search for utils, inside Debug, diffutils-debuginfo …change from skip. 10. Click next. IDL - International Digital Library

11. Finish. It will take more than half an hour. you will get cygwin terminal on desktop. 12. Paste the following in the environmental variables by editing it (C:\cygwin\bin; C:\cygwin\usr\bin;.;). 13. Copy the java file, and paste that on the outside the program files. 14. Download notepad++. 15. Go to the following file –Hadoopsoftwares\ Hadoop Software\hadoop-0.18.0\conf\ hadoopenv.sh. 16. Open it with notepad++. 17. Paste the java path in the export line of hadoopenv.sh file and change “\” to “/”. 18. Open Cygwin, which is present on the desktop. 19. Paste the path fallowing path cdD:/Hadoop_softwares/Hadoop_Software/hadoop0.18.0/bin 20. Enter ./hadoopnamenode -format 21. Enter ./hadoopnamenode 22. Open another Cygwin and type ssh-host-config 23. no, no, yes, no, no, yes. 24. Go to control panel, administrative tools, services. And start service of Cygwin. 25. Go to Cygwin and type ssh-keygen. 26. Go to Cygwin and type ssh-keygen. 27. cd ~/.ssh/ 28. cat id_rsa.pub >>authorized_keys 29. sshlocalhost. 30. Open browser and type localhost50030 FILE UPLOADING MODULE Description of the MD5 Algorithm Step 1: Get the Message Step 2: Convert message in to the bits. Step 3: Append Padding Bits (Make the message bit length should be the exact multiple of 512 bits as well as 16 word Blocks). Step 4: Divide total bits in to 128 bits blocks each Step 5: Initialize MD Buffer. A four word buffer (A, B, C, D) is used to compute the message digest, total 128 bits. Step 6: Do AND, XOR, OR, NOT operations on A, B, C, and D by giving three inputs and get one output. Step 7: Do the Step 6 until get the 128 bits hash (16 bytes). Step 8: Stop SEARCHING AND DOWNLOADING MODULE Description of the Searching and Downloading Algorithm Step 1: Start Step 2: Enter the keyword for search. Step 3: Get the grade key of the searching user. Step 4: Concatenate grade key and searching keyword and generate the hash code by MD5 algorithm. 6|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 Step 5: If hash code matched with the database content. Step 6: Get the respective files having matched hash code for download. Step 7: Download the files from the cloud. Step 8: Decompress the file downloaded the files from the cloud. Step 9: Decrypt the file with the RSA public key. Step 10: Get the original file. Step 11: Stop User has to select the aggregate Key then after that Input the search keyword. Convert the keyword into hashcode. Decrypt the aggregate Key, Separate and get hash keys and separate and get public Key. Using Hash Key and keyword generate hash codes. Send the Hash codes to server, based on the Hash codes received server has to check the keyword index and if any matching files are available, list all the file names to the user. View the shortlisted files from server, download the files and finally decrypt the file with owner public key. VII. RESULTS Extracting unstructured BigData from the dataset and converting to Hadoop format. The resultant data is stored in the cloud and secured by double encryption. The user can retrieve the data in the cloud with the help of user interface by double decryption. The security to the data in the cloud is provided using Fully Homomorphic algorithm. As a result of efficient encryption, transmission and storage of sensitive data is achieved. We analyze the existing search algorithm over cipher text, for the problem that most algorithm will disclosure user's access patterns, we propose a new method of private information retrieval supporting keyword search which combined with Homomorphic encryption and private information retrieval. CONCLUSION Big Data consists of huge amount of data to which providing security is main concern. Once the Admin uploads the file, it is sent to hadoop for converting unstructured data into structured data. The analysis process takes place in hadoop making use of Map reduce algorithm. The resultant file is double encrypted and stored in cloud. When the user requests for a particular file, the required file is obtained from the cloud with the help of a TOMCAT server, which is then double decrypted and sent to the user for downloading the file. This application is mainly used for storing and securing confidential Big Data. FUTURE ENHANCEMENT

IDL - International Digital Library

 Block chain algorithm can be used for encryption purpose for more security.  Queries can be included to make the user work easier.  Different data formats can be included for storage purpose such as Images, xml, etc.  Real time Case study can be implemented.  Single step of conversion of unstructured data can be split into two steps, such as  Converting XML file to text file, again from text file to structured format.

REFERENCES [1] B Harikrishna, S Kiran, G.Murali and R.Pradeep Kumar Reddy, Security Issues In Service Model Of Cloud Computing Environment Procedia Computer Science 87 (2016) 246 251, Scince Direct. [2] B Harikrishna, N.Anusha, K.Manideep, Madhusudhanraoch, Quarantine Stabilizing Multi Keyword Rated Discover with Unfamiliar ID Transfer over Encrypted Cloud Warning IJERCSE Vol2, Issue 2, February 2015. [3] AlexaHuth and James Cebula The Basics of Cloud Computing, United States Computer Emergency Readiness Team. 2011. [4] SandipanBasu International Date Encryption Algorithm (IDEA) – A Typical Illustration Journal of Global Research in Computer Science. July 2011 ISSN: 2229-371X Vol.2, Issue 7 [5] Rajeev Bedi, Amritpal Singh and Tejinder Singh, Comaparative Analysis of Cryptographic Algorithms International Journal of Advanced EngineeringTechnoloty.July-Sept. (2013) E-ISSN 09763945. [6] Pradeep Mittal and Vinod Kumar Comparative Study of Cryptographic Algorithms Internationls Journal of Computer Science and Network. June(2014) ISSN(Online): 2277-5420, Volume3, Issue 3. [7] V. G. Korat, A. P. Deshmukh, K. S. Pamu. Introduction to Hadoop Distributed File System. Int. J. of Engineering Innovations and Research, 1(2): 230-236, March 2012. [8] Apache Hadoop. http://hadoop.apache.org/, 2012. [9] P. K. Mantha, A. Luckow, S. Jha. Pilot-MapReduce: An Extensible and Flexible MapReduce Implementation for Distributed Data. Proc. of 2012 Int. Conf. on MapReduce and Its Applications, pp. 17-24. [10] “Hug the Elephant: Migrating a Legacy Data Analytics Application to Hadoop Ecosystem” by PradeepAdluru, SrikariSindhooriDatla, Xiaowen Zhang. [11] Analysis, Modeling, and Simulation of Hadoop YARN MapReduce. Thomas C. BressoudQiuyi (Jessica) Tang. [12] KOHA: Building a Kafka-based Distributed Queue System on the fly in a Hadoop cluster. Cao Nguyen, JikSoo Kim and Soonwook Hwang. [13] Hadoop Security Models - A Study. [14] Big Data Emerging Issues: Hadoop Security And Privacy. 7|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 [15] Design And Implementation Of HDFS Data Encryption Scheme Using Aria Algorithm On Hadoop. [16] PoonamS.Patil, Rajesh. N. Phursule, Survey Paper on Big Data Processing and HadoopComponents, IJSR, Volume 3, Issue 10, pp. 585-590,October 2010. [17] Singh ArpitaJitendrakumar , Security Issues in Big Data : In Context with a BigData , IJIERE, Volume 2, Issue 3, 2015, pp. 127-130, ISSN: 2394 – 3343 [18] Dr. E. Laxmi Lydia, Dr. M.BenSwarup, Analysis of Big data through Hadoop Ecosystem Components like Flume, MapReduce, Pig and Hive, IJCSE, Vol. 5 No.01 Jan 2016, pp. 21-29, ISSN : 2319-7323. [19] SanjeevDhawan, Sanjay Rathee, Big Data Analytics using Hadoop Components like Pig and Hive, AIJRSTEM, pp.88-93, 2013, ISSN (Online): 2328-3580. [20] Apache Flink, [Online] Available https://en.wikipedia.org/wiki/Apache_Flink [Accessed: October 10, 2016] [21] A gentle introduction to blockchain technology with BlockChaines[online] Available, https://bitsonblocks.net/2015/09/09/a-gentleintroduction-toblockchain-technology [22] Bitcoin-part-one [online] Available, http://tech.eu/features/808/bitcoin-part-one [23] Block hashing algorithm, Available, https://en.bitcoin.it/wiki/Block_hashing_algorithm

IDL - International Digital Library

8|P a g e

Copyright@IDL-2017


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.