IJIRST –International Journal for Innovative Research in Science & Technology| Volume 1 | Issue 6 | November 2014 ISSN (online): 2349-6010
A Survey on Appliance and Secure In Big Data N.Naveenkumar PG-Scholar Department of Information Technology SNS College of Technology, Coimbatore
N.Naveenkumar Assistant Professor Department of Information Technology SNS College of Technology, Coimbatore
Abstract The area of Big Data is a highly new challenging environment with a rapidly evolving landscape. This paper surveys will involve learning new tools and techniques used to process big data in a highly performing and scalable manner. This will also include the developing new connectors to the Big Data system, new data structures, and techniques to represent the various datasets and analyze to secure them efficiently to uncover insights. Data security and privacy deliver data protection across all enterprises. Together, they comprise the people, process and technology required to prevent destructive forces, threats and unwanted actions, now it’s the time to keep customer, business, personally identifiable information and other types of sensitive data safe against internal and external threats. Data should be protected and no matter where it resides in databases, applications or reports across production and non-production environments. Keywords: Analysis, Big Data, Database, Hadoop, security. _______________________________________________________________________________________________________
I. INTRODUCTION Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include different types such as structured/ unstructured and streaming/batch, and different sizes from terabytes to zettabytes. Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low-latency. aAnd it has one or more of the following characteristics – high volume, high velocity, or high variety. Big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media - much of it generated in real time and in a very large scale[1]. Analyzing big data allows analysts, researchers, and business users to make better and faster decisions using data that was previously inaccessible or unusable. Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, businesses can analyze previously untapped data sources independent or together with their existing enterprise data to gain new insights resulting in significantly better and faster decisions. Big Data is a term defining data that has three main characteristics. First, it involves a great volume of data. Second, the data cannot be structured into regular database tables and third, the data is produced with great velocity and must be captured and processed rapidly. Big Data is a relatively new term that came from the need of big companies like Yahoo, Google, Facebook to analyze big amounts of unstructured data, but this need could be identified in a number of other big enterprises as well in the research and development field.The framework for processing[2]. There is Hadoop, an open source platform that consists of the Hadoop kernel, Hadoop Distributed File System (HDFS), MapReduce and several related instruments. Two of the main problems that occur when studying Big Data are the storage capacity and the processing power. Trillions of dollars are at stake in the ongoing battle against cybercriminals and fraudsters. But now, big data technologies like Hadoop are helping to prevent unauthorized access, both behind the corporate firewall and in public coffers. One company that’s hoping to simultaneously put a cap on internal and external threats is Fort scale, which today announced the availability of a Hadoop-based security monitoring product. The San Francisco-based company uses proprietary machine learning algorithms running in Hadoop to score the security risk of every user in an organization, and then monitor user behavior for significant deviations from that score[3].
II. PROFICIENCY USED IN BIG DATA A. Big Table Proprietary distributed database system build on the Google File System. Inspiration from HBase. B. Business intelligence (BI) A type of application software designed to report, analyze and present data. BI tools are often used to read data that have been stored in a data warehouse or data mart. BI tools can also be used to create standard reports that are generated on a periodic basis, or to display information on real-time management dashboards, i.e. integrated displays of metrics that measure the performance of a system.
All rights reserved by www.ijirst.org
237
A Survey on Appliance and Secure In Big Data (IJIRST/ Volume 1 / Issue 6 / 042)
C. Cassandra An open source (free) database management system designed to handle huge amounts of data on a distributed system. This system was originally developed on Facebook and is now managed as a project of the Apache Software foundation. D. Cloud computing A computing paradigm in which highly scalable computing resources, often configured as a distributed system, are provided as a service through a network. E. Data mart Subset of a data warehouse, used to provide data to users usually through business intelligence tools. F. Data warehouse Specialized database optimized for reporting, often used for storing large amounts of structured data. Data is uploaded using ETL (extract, transform and load) tools from operational data stores, and reports are often generated using business intelligence tools. G. Distributed system Multiple computers, communicating through a network, used to solve a common computational problem. The problem is divided into multiple tasks, each of which is solved by one or more computers working in parallel. Benefits of a distributed system include higher performance at a lower cost (i.e. because a culture of lower-end computers can be less expensive than a single higher-end computer), higher reliability (i.e. because of a lack of a single point of failure) and more scalable (i.e. because increasing the power of a distributed system can be accomplished by simply adding more nodes rather than completely replacing a central computer)[4]. H. Dynamo Proprietary distributed data storage system developed by Amazon. I.
Extract, transform, and load (ETL)
Software tools used to extract data from outside sources, transform them to fit operational needs, and load them into a database or data warehouse. J.
Google File System
Proprietary distributed file system developed by Google; part of the inspiration for Hadoop. K. Hadoop An open source (free) software framework for processing huge datasets on certain kinds of problems on a distributed system. Its development was inspired by Google’s MapReduce and Google File System. It was originally developed at Yahoo! and is now managed as a project of the Apache Software Foundation[4]. L. HBase An open source (free), distributed, non-relational database modelled on Google’s Big Table. It was originally developed by Powerset and is now managed as a project of the Apache Software Foundation as part of Hadoop[4]. M. MapReduce A software framework introduced by Google for processing huge datasets on certain kinds of problems on a distributed system. Also implemented in Hadoop[4]. N. Mashup An application that uses and combines data presentation or functionality from two or more sources to create new services. These applications are often made available on the Web, and frequently use data accessed through open application programming interfaces or from open data. O. Metadata Data the described the content and context of data files (e.g. means of creation, purpose, time and data or creation and author).
All rights reserved by www.ijirst.org
238
A Survey on Appliance and Secure In Big Data (IJIRST/ Volume 1 / Issue 6 / 042)
P. Non-relational database A database that does not store data in tables (rows and columns). (In contrast to relational database). Q. R An open source (free) programming language and software environment for statistical computing and graphics. The R language has become a de facto standard among statisticians for developing statistical software and is widely used for statistical software and data analysis. R is part of the GNU Project, a collaboration that supports open source projects. R. Relational database A database made up of a collection of tables (relations), i.e., data is stored in rows and columns. Relational database management system (RSBMS) store a type of structured data. SQL is the most widely use language for managing relational databases. S.
Semi-structured data
Data that does not confirm to fixed fields but contains tags and other markets to separate data elements. Examples of semistructured data include XML or HTLM-tagged text. Contrast to structured data and unstructured data. T. SQL Originally an acronym for structured query language, SQL is a computer language designed for managing data in relationional databases. This technique includes the ability to insert, query, update and delete data, as well as manage data scheme (database structures) and control access to data in the database[4]. U. Stream processing Technologies design to process large real-time streams of event data. Stream processing enables applications such algorithmic trading in financial services, RFID event processing applications, fraud detection, process monitoring and location-based services in telecommunication. Also known as event stream processing. V. Structured data Data that reside in fixed fields. Examples of structured data include relational databases or data in spreadsheets. Contrast with semi-structured data and unstructured data. W. Unstructured data Data that do not reside in fixed fields. Examples include free-form text (e.g. books, articles, body of e-mail messages), untagged audio, image and video data. Contract with structured data and semi-structured data. X. Visualization Technologies used for creating images, diagrams, or animations to communicate a message that often used to synthesize the result of big data analyses.
III. SECURITY ISSUES IN BIG DATA The scale of the Internet and the fact that the world’s population is steadily coming online, protecting users from cybercrime can be viewed as a numbers game. The same forces that are driving big data are driving threats concurrently[5]. New methods of addressing cyber threats are needed to process the enormous amount of data emerging from the world and to stay ahead of a sophisticated, aggressive, and ever-evolving threat landscape. No off-the-shelf solution can address a problem of this magnitude. The traditional rules of engagement no longer apply. Scaling up to manage the changes in the threat landscape is necessary, but it must be done intelligently. A brute force approach is not economically viable. Successful protection relies on the right combination of methodologies, human insight, an expert understanding of the threat landscape, and the efficient processing of big data to create actionable intelligence. Complicating the issue further, security software companies need to not only stop malicious behavior that has already been initiated, but to predict future behavior as well. Predicting the next threat can mean preventing an attack that could potentially cause millions of dollars in damages. Accurate prediction requires knowledge of previous history. Successful security software companies examine past behaviors and model them to predict future behavior. This means employing effective mechanisms to archive historical information, access it, and provide instant reporting and details. Consumers rarely glimpse the enormous amount of effort conducted below the surface to protect them from cyber threats. A licensing agreement that allows customers to anonymously donate suspicious data for analysis and reverse engineering can provide valuable access to real data on real machines operating in the real world. Based on data gathered from this community network, specialized search algorithms, machine learning, and analytics can then be brought to bear on this data
All rights reserved by www.ijirst.org
239
A Survey on Appliance and Secure In Big Data (IJIRST/ Volume 1 / Issue 6 / 042)
to identify abnormal patterns that can signal a threat[5,6]. For example, many computer users follow a typical daily pattern. That pattern may consist of visiting a news site, encountering several ad servers, and logging on to Facebook. If that pattern suddenly changes, perhaps moving the user to a domain never previously visited, this incident can be immediately prioritized for further analysis. These types of complex correlations can be identified only by a system that can perform a very large number of database searches per second. A feedback loop for process improvement is another critical component. Keen observation and curation of key data that is fed back into the process allows for continual process improvement. Over time, the process can predict malicious behavior long before it occurs. While big data in security is a numbers game, human experts need to play the most important role. Trained analysts need to constantly evolve the combination of methodologies, apply human intuition to complex problems, and identify trends that computers miss. Using the right approach when an attack slips through the cracks is also crucial. A savvy security software company works directly with the ISP involved in an attack to drive a better end result. A. Volume: A Growing Threat Landscape The threat landscape is evolving in various ways, including growth in the sheer volume of threats [5,6]. The numbers are daunting, but this is only the tip of the iceberg. The Internet Protocol shift currently under way (from IPv4 to IPv6) is providing cybercriminals a new playground to exploit. Approximately four billion unique IP addresses are available for use with IPv4. This is a large, yet tractable number. By contrast, IPv6 is providing an almost infinite number of IP addresses. Growing demand for unique IP addresses for devices ranging from smart TVs to telephones motivated development of the new IPv6 standards. The goal was to generate sufficient IP addresses to avoid the need to later revisit the problem. While IPv6 fixed one problem, it simultaneously created an enormous opportunity for cybercriminals and introduced an entirely new set of challenges to the industry. Variety: Innovative Malicious Methods The lure of financial gain has motivated cybercriminals to implement innovative new methods and to become more thorough with each passing year. Today, cybercriminals are sophisticated, evolving their craft and tools in real time[5,6]. For example, malware created today often undergoes quality control procedures. Cybercriminals test it on numerous machines and operating systems to ensure it bypasses detection. Meanwhile, server-side polymorphic threats drive rapid evolution and propagation and are undetectable using traditional methods. One hundred pieces of malware can be multiplied in thousands of different ways. And malware is no longer restricted to personal computers. Multi-platform malware means mobile devices are also at risk. By The number of distribution points for spam, viruses, malware, and other malicious tools that cybercriminals employ is constantly increasing, while geo-specific threats have become more common. A recent threat infected computer users with IP addresses based in Italy, while those who accessed the Internet from IP addresses outside Italy were connected to an innocuous web page. This requires software security company detection to become more granular geographically. Spear phishing threats now target individuals, rather than countries, cities, companies, or demographic groups - further complicating detection. Velocity: Fluidity of Threats The need to manage, maintain and process this huge volume and variety of data on a regular basis presents security vendors with an unprecedented velocity challenge. The fluidity of the Internet over time adds to the complexity of the problem[5,6]. Unlike a physical street address, which cannot be relocated without leaving significant evidence behind, changing IP addresses on the Internet is trivial, rapid, and difficult to track. An individual or a company can move effortlessly and quickly from one location to another without leaving a trace. Determining whether a particular Web site or page contains malicious content is fluid over time as well. Cybercriminals routinely transform legitimate sites into corrupt sites almost instantly.
IV. BIG DATA SECURITY MANAGEMANT When dealing with ―Big Data,‖ the volume and types of data about IT and the business are too great to process in an ad hoc manner. Moreover, it has become increasingly difficult to secure meaningful information from the data being collected. There are a number of factors that explain this: Attackers are becoming more organized and better funded. But while attacks have become dynamic, defences have remained static. Today’s attacks are designed to exploit the weaknesses of our user-centric, hyper-connected infrastructures. IT-enabled organizations continue to grow more complex. Organizations now demand much more open and agile systems, creating incredible new opportunities for collaboration, communication, and innovation. This also results in new vulnerabilities that cyber criminals, ―hacktivist‖ groups, and nation states have learned to exploit. Compliance is even more far reaching. Regulators and legislators are getting more prescriptive. Companies, particularly those with multiple lines of business or international operations, have an increasingly hard time keeping track of current controls that are in place, controls that are needed, and how to ensure controls are being managed properly. The combined effect of these factors in IT environments makes security management much more complex, with many more interdependencies and a wider scope of responsibility. As more business processes become digitized, security teams have both the opportunity and the challenge to collect and manage more data. Investments are increasingly made in log management, vulnerability management, identity management, and configuration management tools. However, breaches continue to happen, causing more disruption and expense than ever.
All rights reserved by www.ijirst.org
240
A Survey on Appliance and Secure In Big Data (IJIRST/ Volume 1 / Issue 6 / 042)
The following is an overview of the most common threats to data management systems. Data stored in with big data clusters is commonly regulated by law, and needs to be protected in accordingly. Attacks on corporate IT systems and data theft are prevalent, and large distributed data management systems provide a tempting target. Big data offers all the same weak points we know from traditional IT systems, it have technologies that address common threats so there is no need to reinvent the wheel. The trick is selecting options that work with big data. Cluster administrators should consider security controls for each of the following areas when setting up or managing a big data cluster: A. Data at rest protection The standard for protecting data at rest is encryption, which guards against attempts to access data outside established application interfaces. With traditional data management systems, it worry about people stealing archives or directly reading files from disk. Encrypted files are protected against access from users without encryption keys. Replication effectively replaces backups for big data, but that does not mean a rogue administrator or cloud service manager won’t create their own. Encryption protects data copied from the cluster. One or two obscure NoSQL variants provides encryption for data at rest but most do not. Worse, most available encryption products lack sufficient horizontal scalability and transparency to work with big data. This is a critical issue. B. Administrative data access Each node has at least one administrator with full access to its data. As with encryption its need a boundary or facility to provide separation of duties between different administrators. The requirement is the same as on relational platforms but big data platforms lack their array of built-in facilities, documentation, and third party tools to address this requirement. Unwanted direct access to data files or data node processes can be addressed through a combination of access controls, separation of roles, and encryption technologies, data is only as secure as least trustworthy administrator. It’s up to the system designer to select controls to close this gap. C. Configuration and patch management With clusters of servers it is common to have nodes running different configurations and patch levels as new nodes are added over time. Or if it use dissimilar OS platforms in the cluster, determining what constitutes equivalent patch revision levels can be difficult. Existing configuration management tools work for underlying platforms and HDFS Federation will help with cluster management, but careful planning is still necessary. The cluster may tolerate nodes cycling without loss of data or service interruption but reboots can still cause serious performance issues, depending on which nodes are affected and how the cluster is configured. D. Authentication of applications and nodes: Hadoop can use Kerberos to authenticate users and add-on services to the Hadoop cluster[7]. But a rogue client can be inserted onto the network if a Kerberos ticket is stolen or duplicated perhaps using credentials extracted from virtual image files or snapshots. This is more of a concern when embedding credentials in virtual and cloud environments, where it is relatively easy to introduce an exact replica of a client app or service. A clone of a node is often all that’s needed to introduce a corrupted node or service into a cluster. It is easy to impersonate a cluster node or service, Kerberos improves security significantly but it still need to be careful. It is a pain to set up, but strong authentication of nodes is a principal security tool for keeping rogue servers and requests out of the cluster E. Audit and logging Its need a record of activity. One area which offers a variety of add-on capabilities is logging[8]. Scribe and LogStash are open source tools that integrate into most big data environments, as do a number of commercial products. Its just need to find a compatible tool, install it, integrate it with other systems such as SIEM or log management, and then actually review the results. Without actually looking at the data and developing policies to detect fraud, logging is not useful. F. Monitoring, filtering, and blocking There are no built-in monitoring tools to look for misuse or block malicious queries. In fact there is no consensus on what a malicious big data query looks like aside from MapReduce scripts written by bad programmers[9]. It is assumed that it will authenticate clients through Kerberos about security, and MapReduce access is gated by digest authentication. Several monitoring tools are available for big data environments, but most review data and user requests at the API layer. G. API security The APIs for big data clusters need to be protected from code and command injection, buffer overflow attacks, and every other web services attack. Most of the time this responsibility falls upon the application(s) that use the cluster, but not always. Common security controls include integration with directory services, mapping to API services, filtering requests, input validation, managing policies across nodes, and so on. Some of the APIs even work without authentication,. Again, there are a
All rights reserved by www.ijirst.org
241
A Survey on Appliance and Secure In Big Data (IJIRST/ Volume 1 / Issue 6 / 042)
handful of off-the-shelf solutions to help address API security issues, but most are based on a gateway that funnels users and all requests through a single interface. H. Use Kerberos for node authentication Kerberos is effective for validating inter-service communicates and help keep rogue nodes and applications out of the cluster[9]. And it can help protect web console access, making administrative functions harder to compromise. Kerberos is a pain to set up, and (re-)validation of new nodes and applications requires additional overhead. But without bi-directional trust establishment, it is too easy to fool Hadoop into letting malicious applications into the cluster or accepting malicious nodes — which can then add, alter, and extract data. Kerberos is one of the most effective security controls at our disposal, and it’s built into the Hadoop infrastructure. I.
Use file layer encryption
File encryption protects against two attacker techniques for circumventing application security controls. Encryption protects data if malicious users or administrators gain access to data nodes and directly inspect files, and renders stolen files or copied disk images unreadable. While it may be tempting to rely upon encrypted SAN/NAS storage devices, it don’t provide protection from credentialed user access, granular protection of files or multi-key support. And file layer encryption provides consistent protection across different platforms regardless of OS/platform/storage type, with some products even protecting encryption operations in memory. Just as important, encryption meets our requirements for big data security. It is transparent to both Hadoop and calling applications, and scales out as the cluster grows. Open source products are available for most Linux systems; commercial products additionally offer external key management, trusted binaries, and full support. This is a cost-effective way to address several data security threats. J.
Use key management
File layer encryption is not effective if an attacker can access encryption keys. Many big data cluster administrators store keys on local disk drives because it’s quick and easy, but it’s also insecure as keys can be collected by the platform administrator or an attacker. Use key management service to distribute keys and certificates and manage different keys for each group, application, and user. This requires additional setup and possibly commercial key management products to scale with big data environment, but it’s critical. Most of the encryption controls we recommend depend on key/certificate security. K. Deployment validation Deployment consistency is difficult to ensure in a multi-node environment. Patching, application configuration, updating the Hadoop stack, collecting trusted machine images, certificates, and platform discrepancies all contribute to what can easily become a management nightmare. The good news is that most big data clusters are deployed in cloud and virtual environments.. Machine images, patches, and configurations should be fully updated and validated prior to deployment. It can even run validation tests, collect encryption keys, and request access tokens before nodes are accessible to the cluster. It also recommend use of the service-level authentication built into Hadoop to help segregate administrative responsibilities. Building the scripts and setting up these services takes time up front, but pays for itself in reduced management time and effort later, and ensures that each node comes online with baseline security in place[9]. L. Log it To detect attacks, diagnose failures, or investigate unusual behavior, its need a record of activity. Unlike less scalable data management platforms, big data is a natural fit for collecting and managing event data. Many web companies start with big data specifically to manage log files, and most SIEM and log management products embed big data capabilities for log management. There is no reason not to add logging onto existing cluster. It gives a place to look when something fails. Logging MapReduce requests and other cluster activity is easy, and the increase in storage and processing demands is small[9]. M. Use secure communication Implement secure communication between nodes, and between nodes and applications. This requires an SSL/TLS implementation that actually protects all network communications rather than just a subset. This imposes a small performance penalty on transfer of large data sets around the cluster, but the burden is shared across all nodes. Cloudera offers TLS, and some cloud providers offer secure communication options as well, otherwise it will likely need to integrate these services into your application stack.
V. CONCLUSION The present survey presented the concepts of different techniques used for implementing and storing large amount of data known as big data generated by various organizations. Generally, big data cycle comprises of four stages namely Generation of data,
All rights reserved by www.ijirst.org
242
A Survey on Appliance and Secure In Big Data (IJIRST/ Volume 1 / Issue 6 / 042)
acquisition of data, data storage and analysis of data respectively. The present survey mainly focuses on the proficiency and security of big data. Keeping this point, the above literature described various proficiency and security techniques namely, Bigtable, Dynamo, Cassandra, R, Metadata, Mashup, etc, and the security point of view described various security issues that is Volume, Variety, Velocity Each techniques present unique characterization of data and provides security in big data. Since data management is gaining significant role in digital data growth, the advancement of security system technology is still need to be enhanced and improved in future IT system.
REFERENCE [1] [2]
Manish Kumar Kakhani, Sweeti Kakhani and S.R. Biradar: Research Issues in Big Data Analytics, Volume 2, Issue 8, August 2013 ISSN 2319 - 4847 Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber:Bigtable Distributed Storage System for Structured Data fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,_kes,gruberg@google.com. [3] Papineni Rajesh, Y. Madhavi Latha : HADOOP the Ultimate Solution for BIG DATA,volume4Issue4 –April 2013 [4] P Beaulah Soundarabai1, Aravindh , Thriveni , K.R. Venugopal and L.M. Patnaik: Big Data Analytics: An Approach using Hadoop Distributed File System Volume 3, Issue 11, May 2014. [5] Priya P. Sharma, Chandrakant P. Navdeti: Securing Big Data Hadoop: A Review of Security, Vol. 5 (2) , 2014, 2126-2131,ISSN:0975-9646 [6] Venkata Narasimha Inukollu , Sailaja Arsi and Srinivasa Rao Ravuri :Security issues associated with big data in cloud computing Vol.6, No.3, May 2014 [7] Solaimurugan Vellaipandiyan : Big Data Framework - Cluster Resource Management with Apache Mesos National Resource Centre for Free and Open Source Software (NRCFOSS) [8] Bhavani Thuraisingham: Big Data – Security with Privacy available online : http://csi.utdallas.edu/events/NSF/NSF%20workshop%202014.htm DRAFT October 16, 2014 [9] Nivethitha Somu, A. Gangaa and V. S. Shankar Sriram: Authentication Service in Hadoop using One Time Pad, Vol 7(S4), 56–62, April 2014, ISSN (Print) : 0974-6846, ISSN (Online) : 0974-5645 [10] Chanchal Yadav, Shuliang Wang , Manoj Kumar: Algorithm and approaches to handle large Data- A Survey Vol 2, Issue 3, 2013 ISSN (Online) : 2277-5420
All rights reserved by www.ijirst.org
243