Challenges and opportunities in big data analytics by Indira Group of Institutes

Challenges and Opportunities in Big Data Analytics Prof. Sharad Gore Department of Statistics (Retd.) Savitribai Phule Pune University Pune 411 007. Abstract Big Data Analytics have opened up a vast field of challenges for computer engineers, computer scientists, and statisticians with the potential of a great amount and variety of research and development in hardware, software and human expertise. Data science is becoming an integral part of the big data revolution and statistics plays an important role in the emerging area of big data analytics. This talk is about the challenges and opportunities for statistical science and statisticians to participate in the new era of big data analytics as opposed to the classical statistical analysis for developing inferential or predictive models. The large volume, velocity, and variety of big data is making some of the classical analytical methods either irrelevant or inadequate. At the same time, efficient statistical methods are overcoming the problem of time complexity that arises due to large sizes of data. Survey-based methods are making way for model-based methods, because the pervasive methods of data collection are generating data without any human supervision, resulting in observations that are not obtained through survey sampling. The conclusion is that statistics as a scientific endeavor must modify some of its methodology in order to face the challenges and grab the opportunities presented by the big data revolution. 1. Introduction. The United Nations Economics and Social Council has a Statistical Commission. The forty-fifth session of the Commission took place during March 4-7, 2014 on the theme “Big Data and Modernization of Statistical Systems.” The main conclusions drawn at the session were that big data constitute a source of information that cannot be ignored by statisticians and that statisticians must exploit the opportunities and harness the challenges effectively. Pervasive use of electronic devices and the perpetual generation and availability of digital data have led to a fundamental change in the nature of data. Data that are continuously being generated in enormous quantities are being referred to as big data. Big data consist of “high volume, velocity, and variety of data that demand cost-effective and innovative forms of processing.” Big data have the potential of producing more relevant and timely statistics than traditional sources. Most sources of big data are controlled by the private sector and most of the countries have still not promulgated legislation to permit the use of big data in the public domain. 1

2. Role of Statistics in Big Data Analytics The statistical community has begun to recognize a paradigm shift due to big data. The National Institute of Statistics of Italy acknowledged that â&#x20AC;&#x153;if they want to exploit the enormous treasure in the big data mountain, statistical institutes must climb that mountain. The fifty-ninth International Statistical Institute (ISI) World Statistics Congress, held in Hong Kong during August 2530, 2013, devoted significant attention to the topic of big data. Big data will also be a major topic in the scientific programme of the sixtieth World Statistics Congress to be held during July 26-31, 2015, in Rio de Janeiro, Brazil. It is obvious that big data will have a big impact on the statistical community. The specifics of this impact will become clear only with time, but some of the features are already visible. There are already indications of a paradigm shift from the survey-based approach to a more secondary data-based approach, where modelbased analyses are more common than design-based analyses. Big data sources fall in to the following categories. Sources arising from programme administration, whether it is government or non-government. Commercial sources arising from the transactions between two entities. Sensor network sources. Tracking device sources. Behavioural data sources. Opinion data sources. In order to confront the challenges posed by the big data revolution , statistical systems have to be modernized. More research is needed to overcome the methodological difficulties impending the exploitation of big data sources. The nature of big data may raise questions on representativeness and population coverage. Moreover, the variability and temporal nature of data pose problems because the quality of a statistical analysis is established by comparability, continuity, and coherence. Necessary changes in methodology may include more frequent use of modeling and may require more involvement of academia. More research and experimental studies are required as part of exploring the potential applications of big data. Standardization of tools and methods should be considered for developing the ability and competence in confronting big data challenges. Acquisition of the latest technology is necessary to satisfy the technological needs in collecting and processing big data. For example, organizations may have to consider cloud computing instead of moving large amounts of data to their own servers. 3. Methodological Aspects of Big Data Analytics When it comes to the statistical methodology, it is necessary to keep in mind that the nature, structure and content of big data are very different from data assumed in the classical or traditional statistical methods. More specifically, it is necessary to note that observations in big data may not be independent and identically distributed, as is the requirement of most of the classical statistical methods. If observations are not independent, how can likelihood-based methods be applied? The likelihood of a sample is defined as the product of the density functions of individual observations only due to the assumption of mutual independence of observations. Also, if the observations are not identically 2

distributed, what happens to the parameter space and its dimensionality? How many parameters will the model involve? The number of parameters does not increase proportionally with the increasing sample size only due to the assumption that the observations in the sample are identically distributed and hence have common parameters. Large volumes of data make the analysis more time- and space-complex, but more efficient and fast algorithms can overcome this problem. This is one reason why more research is required in statistical methodology to cope with the big data revolution. Since large volumes of data are more likely to be heterogeneous, it may be useful to first segment data into homogeneous clusters, so that every cluster is much smaller than the original data and also is much more homogeneous than the original data. A consequence of these two features is that every cluster may be more convenient for statistical analysis. Model parameters may change from cluster to cluster, but smaller clusters may make cluster-level computations much faster than those over the entire data. 4. Model-based Analysis Statistical analysis of given data can be of two types, model-based and design-based. When a random sample is obtained from a large population, the randomness in the data arises due to randomness of selection of sampling units, because once a sampling unit is selected for inclusion in data, its value is a constant. This uncertainty is called design-based uncertainty because it can be controlled by controlling the mechanism of selection of sampling units for analysis. For instance, the sampling design may be changed from simple random sampling to stratified random sampling if the population is heterogeneous. This consideration is relevant in the case of sample data. The other method of data collection is through an experiment. In an experiment, observations are random not because some sampling units are selected for inclusion in the sample, but because the response on a sampling unit in the experiment cannot be determined or controlled. This uncertainty is called model-based uncertainty. If a factor causing noticeable variation in the response variable is included in the survey, then a part of the uncertainty will be controlled by the variation in the cause variable. Correlation and regression analyses are examples of such analyses, where model parameters cannot be controlled by carefully selecting some sampling units as opposed to some other. 5. Challenges of Big Data Computer Science involves studying management of resources such as time, space, and energy. It views data as a workload and not as a resource. The paradigm shift due to the emergence of big data, there is a need to view data as a resource. Data as a resource combines with other resources to provide timely, cost-effective, and high-quality inferences and hence to deliver more efficient decisions. As is the case with resources like time and space, it should also be with data in the sense that the more of the data resource the better. However, it does 3

not appear to be the situation at present for two main reasons. First, the complexity of a query grows faster than the data size in the following sense. More rows in a data table usually come with more columns. More columns imply that more hypotheses have to be considered. As a matter of fact, the number of hypotheses grows exponentially in the number of columns. As the result, the larger the data, the higher the probability of false positives, that is, of random fluctuations being taken as signal. Second, the larger the data, smaller the probability that a sophisticated algorithm will be able to fit in the stipulated time frame. As a consequence, it may be necessary to use a less efficient algorithm that may have a higher error rate. Alternatively, it is possible to subsample from the huge data for expediting the computations, but it requires complete knowledge of the entire data, which is not available before undertaking any analysis. If it also necessary to make sure that the quality of inference will monotonically improve with growing data size. Classical statistical theory specifies the asymptotic distribution of the estimator, thereby putting a lower bound on precision of the estimator in form of its asymptotic variance. 6. Frequentist and Bayesian Perspectives Frequentist and Bayesian are two paradigms in statistical inference. Frequentist statistics tends to focus on analysis rather than on methods. Machine learning research, along with optimization, has changed it to some extent through techniques like bootstrap. In this paradigm, procedures can come from anywhere; they may not be derived from any probability model. It is therefore necessary to develop principles and techniques of analysis, so that bad methods can be ruled out and reasonable methods can be ranked. Analytical activities have a hierarchy of consistency, rates and sampling distributions. Classical frequentist statistics focused on parametric inference. Then there was a wave of non-parametric tests. Recently, there is a wave of some variations of non-parametric methods like function estimation and data having small sample size but high dimension. Empirical process theory is one of the most powerful general tool that uniformly obtains consistency, rates and sampling distributions. This tool covers most of the statistical learning theory. Bayesian paradigm, on the other hand, is divided in two parts: subjective Bayes and objective Bayes. Subjective Bayes approach works with a domain expert to come up with the model, the prior distribution and the loss function. It therefore involves developing new models, new computational methods for integration, and new subjective techniques of assessment. Subjective Byes provides a strong framework in principle, but has some serious problems in practice. Complex models can have several parameters, whose distributions must be assessed. Assumptions of independence must be imposed in order to develop criteria for assessment. Assumptions of independence may also be required for computationally tractable models. Finally, it is particularly difficult to assess tail behaviour of the distribution and tail behaviour can sometimes be important. Objective Bayes can be handy when subjective Bayes fails due to complexity. It 4

attempts to develop principles for setting priors that have minimal impact on posterior inference. For example, reference priors maximize the divergence between the prior and the posterior. Objective Bayes often use frequentist ideas to develop principles for choosing priors. As a result, it provides an appealing framework, but can be quite challenging for complex models like multivariate or hierarchical models. Coherence and calibration are two important quality requirements in statistical inference. As far as coherence and calibration are concerned, the Bayesian approach is focused on coherence while the frequentist approach does not worry much about it. The problem with pure coherence is that it may produce an inference that is totally coherent but completely wrong, due to lack of calibration. On the other hand, the frequentist approach tends to focus on calibration while the Bayesian approach ignores calibration. The problem with pure calibration is that it may produce fully calibrated but useless inference, due to lack of coherence. It is therefore useful to develop a blend of the two approaches in order to achieve a balance coherence and calibration. 7. Conclusions It can be safely concluded that big data analytics are exciting because they are challenging. They are challenging because they are complex, both in terms of space and time. Statistical learning, machine learning and advances in computing technology merge in order to confront the challenges of big data and contribute fruitfully to the emerging field of big data analytics. This is an open area for exploration and can accommodate computer engineers, computer scientists, mathematical scientists and statisticians alike. The four Vâ&#x20AC;&#x2122;s of big data, namely volume, variety, velocity, and veracity, were initially considered to be intimidating, but now there is a better understanding of the challenge and it is found that the challenge is worth facing. It is both demanding and rewarding demanding because there are no pre-determined rules of the game and rewarding because it can handle complex situations and provide real-life and real-time solutions. The automation in data acquisition demands automation in data analysis and machine learning algorithm satisfy precisely this requirement. It is therefore useful to learn probability theory, probability distributions, statistical inference, Bayesian statistics, multivariate statistics, sampling theory, linear models, computational statistics, statistical learning theory, and data mining algorithms for big data analytics. Relevant tools and technologies, like Hadoop, MapReduce and cloud computing, are already developed by the developers of the big data framework. As a result it can be said that, like any other new field of expertise, the field of big data analytics is open for anyone who is willing and capable, curious and interested, and, most importantly, adaptive and yet well prepared.