e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
DEMYSTIFICATION OF DATASET SHIFT IN THE FIELD OF DATA SCIENCE Mir Nawaz Ahmad*1, Ummer Altaf Bhat*2 *1Department
of Computer Science and Engineering, SSM College of Engineering, Parihaspora 193121, Jammu and Kashmir, India.
*2Department
of Computer Science and Engineering, SSM College of Engineering, Parihaspora 193121, Jammu and Kashmir, India.
ABSTRACT A change in data distribution is what defines the process of dataset shift. This difference exists between the training and test sets. When building a machine learning algorithm, we use training data to train it in the view that it can yield comparable effects when applied to test data. This might not be the case in the real-world implementation of machine learning models. As a result, the distribution of data will undoubtedly change. The model's efficiency degrades as a result of this shift. It also includes an experimental study of the impact of different approaches on the efficiency of the algorithms mentioned in the following sections. After using techniques such as the Reweighting method, Bach Normalization, and feature-dropping, the performance of machine learning models improved significantly. Keywords: Dataset shift, Covariate shift, Concept drift, Internal covariate shift, Bach normalization, Sample selection bias, Prior probability, Data science.
I.
INTRODUCTION
Data scientists have examined data consistency in machine learning problems from a variety of angles, including missing values, data uncertainty, noise, data complexity, imbalance, and, in this case, dataset shift [1]. When the testing (unseen) data experiences a phenomenon that causes a change in the distribution of a single feature, a group of features, or the class boundaries, this is known as dataset shift. As a consequence, in real-world applications and situations, the standard assumption that training and testing data follow the same distributions is often violated. When the joint distribution of inputs and outputs varies between the training and test phases, dataset shift is a difficult situation. Only the input distribution shifts (covariate denotes input), while the conditional distribution of the outputs given the inputs P(y|x) remains unchanged. Dataset shift occurs in most practical implementations for a variety of purposes, ranging from experimental design bias to the simple deficiency that can lead to testing conditions at training time. In an image classification task, for example, training data may have been collected under controlled laboratory conditions, while test data may have been collected under varying lighting conditions. In certain applications, the data generation process is adaptive in and of itself. Some authors consider the issue of spam email filtering: efficient "spammers" would try to create spam that is different from the spam that the automated filter is based on [2]. Until recently, the machine learning group seems to have shown little interest in dataset shift. Indeed, many machine learning algorithms assume that the training data comes from the same distribution as the test data that will later be used to evaluate the model. This paper aims to provide a summary of the various recent activities in the machine learning field to cope with dataset shift [3].
II.
DATA SHIFT
As the name implies, a dataset shift happens when the data distribution changes. When creating a Machine Learning model, the aim is to discover the (possibly nonlinear) relationships between the data and the target variable. We could then feed new data from the same distribution into the model and expect similar performance and results. However, this is rarely the case; in fact, consumer preferences, technological breakthroughs, political, social, and other uncontrollable variables can all drastically alter the input dataset, the target variable, and the underlying trends and relationships between the input and output data. Each of these scenarios has its own name in Data Science, but they all result in model output deterioration. It is applicationspecific, relying heavily on the data scientist's ability to investigate and fix issues. How do we know when the dataset has changed enough to cause a problem for our algorithms, for example? How do we assess the trade-off www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[341]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
between losing accuracy by eliminating features and losing accuracy by an inaccurate data distribution if only those features begin to diverge? Data shift or data drift, concept shift, changing environments, data fractures are all similar terms that describe the same phenomenon: the different distribution of data between train and test sets [4]. In this paper, we will discuss the different types of dataset shift problems that can arise from their presence and current best practices that one can use to avoid them: 2.1. Covariate shift The transition of distributions in one or more of the independent variables is referred to as covariate shift. That is the change in the distribution of the covariates, or independent variables, in particular. This is typically caused by changes in the state of latent variables, which may be temporal (including changes in the stationarity of a temporal process), spatial, or less apparent. A shifting environment that affects the input variables but not the target variable will cause a covariate shift. In mathematical notation Covariate shift is termed the situation where: Ptrain(Y|X)=Ptest(Y|X) but Ptrain(X) ≠Ptest(X)
Figure 1: Covariate shift [5] Here are some examples where covariate shift is likely to cause problems [6]: • Categorizing photographs as cats or dogs, as well as excluding those animals from the training collection that appears in the test set. • Facial recognition algorithms that are mostly trained on younger faces, despite the fact that the dataset contains a much higher percentage of older faces. • Predicting life expectancy despite only a small sample of smokers in the training set and a large sample of this in the training set. The underlying relationship between the input and output does not vary in this situation (the regression line remains the same), but part of the relationship is data-sparse, omitted, or misconstrued, resulting in the test and training sets not reflecting the same distribution as shown in figure 1. When doing cross-validation, covariance shift will trigger plenty of issues. Without covariate shift, cross-validation is nearly impartial, yet with covariate shift, it is massively biased. Addressing the problem of Covariate Shift There are different techniques by which we can treat this covariate shift problem in order to improve our model. We have mentioned the two most used techniques: www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[342]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
01. Dropping of drifting features: This approach is straightforward since it simply eliminates the elements that are known as shifting. But consider the possibility that merely removing features would result in certain data loss. We've devised a basic guideline to deal with this. We will remove features with a drift value greater than 0.8 that aren't significant in our model. As a result, we'll retain these two things in our model while removing the other shifting features. However, before removing any element, double-check to see if it can be reused to generate a new one. 02. Importance weight using Density Ratio Estimation: The significance will be calculated using this method by first estimating the training and test densities separately and then using the ratio of the estimated densities of the test set and train set to estimate the importance. Such densities are then used as weights in the training data for each case. 2.2. Prior probability shift Prior probability shift is the polar opposite of covariate drift, and the input variable distributions remain constant while the output variable's distribution varies. Prior probability shift is concerned with changes in the distribution of the class variable y, whereas covariate shift is concerned with variations in the feature (x) distribution. Prior probability shift is termed the situation where: Ptrain(X|Y)=Ptest(X|Y) but Ptrain(Y) ≠Ptest(Y)
Figure 2: Prior probability shift [7] If the training set has equivalent prior probabilities for the amount of spam emails you get (i.e., the chance of an email being spam is 0.5), we should expect half of the training set to be spam and a half to be non-spam. If only 90% of our emails are spam (which isn't impossible), so the class variable's prior probability has shifted. This concept is related to data sparsity and skewed feature selection, both of which are factors in triggering covariance change, but they affect our target distribution rather than our input distribution. This issue appears only in Y X problems and is often linked with naive Bayes. 2.3. Concept drift As the relationships between the input and output variables vary, this is in simple terms known as concept drift. As a result, we are no longer concentrating solely on the X or Y variables but also on the relationships between them. The case where an idea drift occurs is referred to as a concept drift as shown mathematically: `Ptrain(Y|X) ≠ Ptest(Y|X) but Ptrain(X) = Ptest(X) in X → Y problems Ptrain(X|Y) ≠ Ptest(X|Y) but Ptrain(Y) = Ptest(Y) in Y → X problems Concept drift differs from covariate shift and prior probability drift in that it is applied to the relationship between the two variables rather than the data distribution or the class distribution. Looking at time series analysis is a natural way to learn about this concept drift. Until doing any analysis of time series analysis, it is www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[343]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
normal to check if the time series is stationary or not, as stationary time series are much simpler to evaluate than non-stationary time series.
Figure 3: Concept Drift [8] How to Address Concept Drift 1. Do Nothing (Static Model) The most popular approach is to ignore it entirely and presume that the data will remain unchanged. This enables you to create a single perfect model and apply it to all future data. This should serve as your starting point and a benchmark to which you may compare other approaches. You can use a static model in two ways if you think the dataset would suffer from concept drift: Concept Drift Detection: Monitor the ability of the static model over time, and if it decreases, it's possible that concept shift is happening, requiring interference. Baseline Performance: Use the performance of the static model as a benchmark to compare to any intervention you make[9]. 2. Periodically Re-Fit Periodically updating the static model with more current historical data is successful first-level interference. Back-testing the model, could also be necessary in order to determine how much historical data to use before refitting the static model. 3. Weight Data In this example, you should use a weighting that is inversely related to the data's age so that the most latest data gets more coverage (higher weight) and the less recent data gets less importance (lower weight). 4. Detect and Choose Model It could be necessary to build systems to identify changes and choose a particular and distinct model to make forecasts for certain problem domains. This could be suitable for domains where sudden shifts are expected in the future and can be monitored in the past. It also means that skilled models can be developed to accommodate each of the observable data variations. 5. Data Preparation When dealing with this sort of challenge, it's popular to use differencing to eliminate systemic variations in the data over time, such as trends and seasonality. This is so popular that it's incorporated into classical methods like the ARIMA model [9]. 2.4. Internal Covariate Shift The variance in the distribution of activations from the output of a given hidden layer, which are used as the input to a subsequent layer, may trigger covariate shifting in the network layers, which can hinder deep neural network learning. Training is slowed by the internal covariate shift in the hidden layers, which necessitates lower learning speeds and diligent parameter initialization. Weights are modified, and new data is interpreted at each epoch of preparation, resulting in subtly different inputs to a neuron each cycle. When these modifications are carried on to the next neuron, the input distribution of each neuron varies at each epoch. Normally, this isn't a big deal, but in deep networks, minor variations in input distribution add up quickly and have a big impact www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[344]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
farther down. In the end, the feedback distribution obtained by the deepest neurons varies a lot from epoch to epoch. As a result, these neurons must constantly respond to evolving feedback distributions, causing their learning ability to be seriously hampered. Internal covariate shift refers to the dynamically shifting input distribution. Treatment of Internal Covariate Shift Batch Normalization: By subtracting the mini batch mean and dividing it by the mini batch standard deviation, batch normalization normalizes a layer input. The term "mini-batch" refers to a single batch of data provided for each epoch, which is a subset of the entire training data. The normalization means that the inputs have a mean of zero and a standard deviation of one, ensuring that the input distribution to each neuron is consistent, preventing internal covariate shift and providing regularization.
Figure 4: Batch normalization reduces Internal Covariate Shift [10]
III.
MAJOR CAUSES OF DATASET SHIFT
We address some of the most common causes of dataset shift in this section. These terminologies have caused some misunderstanding in the past, so it's necessary to clarify that they're variables that may cause some dataset shifts to occur, but they're not dataset shifts in and of themselves. There are a number of reasons for dataset shift, but this section focuses on the two that we believe are the most important: sample selection bias and non-stationary environments. Sample Selection Bias A form of bias triggered by choosing non-random data for statistical analysis is sample selection bias. The bias is caused by an error in the sample collection procedure, which results in a portion of the data being systematically discarded due to a certain feature. The statistical value of the test may be influenced by the subset's omission, and parameter values in the statistical model may be skewed. There is no error in any algorithm or data processing that causes sample selection bias. It is purely a methodological error in the data collection or marking method that results in a non-uniform set of training examples from a group, resulting in bias formation during training. Since we are manipulating the distribution of our results, sample selection bias is a type of covariance shift. This is a false representation of the operational system in which our model optimizes its training environment to fictitious or cherry-picked operating conditions. In general, sample selection bias induces data in the testing set to obey Ptrain = P(s = 1|x, y), while data in the test set follows Ptest = P(s = 1|x, y) (y, x). Depending on the type of problem, we have: Ptrain = P(s = 1|y, x)P(y|x)P(x) and Ptest = P(y|x)P(x) in X→Y problems. Ptrain = P(s = 1|y, x)P(x|y)P(y) and Ptest = P(x|y)P(y) in Y→X problems. Where (s) is a binary selection variable that determines whether or not a data point should be included in the training sample process. (s = 1) or rejected from it (s = 0). www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[345]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
Types of Sample Selection Bias Sample selection bias can manifest itself in a variety of ways. The following are the most prominent forms of sample selection bias: 1. Self-selection Self-selection occurs because survey subjects have some discretion of whether or not they choose to engage in the research. The chosen sample does not represent the whole population since individuals will choose whether or not to participate in the study. 2. Selection from a specific area The study's participants are chosen from just a few regions, while others are not included in the survey. 3. Exclusion The analysis excludes certain segments of the population. 4. Survivorship bias When a study is based on subjects who passed the selection process while ignoring subjects who did not pass the selection process, survivorship bias exists. The survivorship bias causes the study's conclusions to be too positive. 5. Pre-screening of participants The study's participants are drawn from a select number of people. As a result, the survey would not represent the whole research population. Non -stationary environments It is often the case in real-world implementations that the data is not (time- or space-) stationary. Nonstationary environments will implement various types [11] of change depending on the type of problem: A non-stationary environment in X-Y problems may cause shifts in P(x) or P(y/x), resulting in covariate shift or concept drift, respectively. It could cause prior probability drift with a change in P(y) or concept drift with a change in P(x/y) in Y-X problems. Adversarial classification challenges, such as spam filtering and malware/intrusion detection, are two of the most relevant non-stationary situations. Due to the presence of an adversary who wants to operate around the current classifier's learned principles, this sort of problem is gaining more interest in the machine learning area. It typically deals with non-stationary environments. In view of the machine learning challenge, this adversary reshapes the test set to make it distinct from the training set, resulting in some kind of dataset shift.
IV.
IDENTIFYING DATASET SHIFT
There are many approaches for determining whether or not shifting is occurring in a dataset and how severe it is. Unsupervised approaches are by far the most effective methods for detecting dataset shift because they do not need pre-analysis, which is time-consuming in some production processes. The most common approaches which are handy in the initial stage of processing are mentioned below: Statistical Distance: The statistical distance approach will be used to see how your model estimates have changed over time. This is accomplished by the development and application of histograms. Making histograms allows you to not only see how your model forecasts have changed over time, but also to see if your most significant features have changed. Simply create histograms from your training results, track them over time, and analyze them to see if any changes have occurred. Various institutions use this approach in credit-scoring models. There are a number of metrics that can be used to monitor the evolution of model predictions over time. These include the histogram intersection, Kolmogorov-Smirnov statistic, Population Stability Index (PSI) and Kullback-Lebler divergence (or other f-divergences).
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[346]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
Figure 5: Kullback-Lebler divergence [12] Novelty Detection Novelty identification is a technique that is well suited to more complex domains like computer vision. The aim is to develop a model that can be used to simulate source distribution. You attempt to determine the probability that a new data point is drawn from the source distribution given a new data point. You may use different techniques for this process, such as a support vector machine, which is available in the most popular libraries. If you're working in an environment of homogeneous yet dynamic connections (e.g., visual, audio, or remote sensing), this is an approach to consider since the statistical distance (histogram method) won't be as efficient.
Figure 6: Novelty Detection [13] Discriminative Distance Although the discriminative distance approach is less common, it can be useful. The idea is that you'll train a classifier to determine if an example belongs in the source or target domain. The training error can be used as a reference for the distance between such two distributions. The closer they are, the greater the mistake (i.e. the classifier cannot distinguish between the source and target domain). Discriminative distance has a wide range of applications and has high dimensionality. This approach is useful if you are making domain adaptations, even though it takes time and can be very difficult, and for some deep learning approaches, it may be the only optimal technique that exists. This approach works well for data that is both high-dimensional and sparse, and it is broadly applicable. It can, however, only be performed offline, and it is more difficult to execute than the previous methods [14].
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[347]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
V.
www.irjmets.com
CONCLUDING REMARKS
The data available for model building, i.e., training data in many realistic applications of machine learning, are not exclusively descriptive of the data on which the classifier will actually be implemented (test data). This problem, which we refer to as dataset shift in accordance with [15], encompasses a broad range of studies found in the machine learning literature. The aim of this paper is to compile and synthesize existing studies in order to help inform future efforts in the field.
VI. [1] [2]
[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
[15]
REFERENCES
R.J.A. Little, D.B. Rubin, Statistical Analysis with Missing Data, Probability and Statistics, second ed., Wiley, New Jersey, 2002. E.M. Bahgat, S. Rady, W. Gad , An e-mail filtering approach using classification techniques The 1st International Conference on Advanced Intelligent System and Informatics (AISI2015), November 28-30, 2015, Springer International Publishing, BeniSuef, Egypt (2016), pp. 321-331 Dataset Shift, The MIT Press, https://cs.nyu.edu/~roweis/papers/invar-chapter.pdf J.G. Moreno-Torres, T. Raeder, R. Alaiz-Rodríguez, N.V. Chawla, F. Herrera, A Unifiying view of Dataset Shift in Classification. Pattern Recognition, 2011, In press. Covariate shift ,http://iwann.ugr.es/2011/pdf/InvitedTalk-FHerrera-IWANN11.pdf understanding-dataset-shift, https://towardsdatascience.com/understanding-dataset-shiftf2a5a262a766 Prior Probability Shift, https://data-newbie.tistory.com/355 Shweta Kadam. A Survey on Classification of Concept Drift with Stream Data. 2019. ffhal-02062610f Jason Brownlee, A Gentle Introduction to Concept Drift in Machine Learning Jae Duk Seo , Batch Normalization and Internal Covariate Shift , https://medium.com/@SeoJaeDuk/archivepost-batch-normalization-and-internal-covariate-shift-1d47661d236f Jose G. Moreno-Torres, Troy Raeder, RocíO Alaiz-RodríGuez, Nitesh V. Chawla, and Francisco Herrera. 2012. A unifying view on dataset shift in classification. Pattern Recogn. 45, 1 (January, 2012), 521–530. Kullback-Lebler divergence , https://www.researchgate.net/figure/KL-divergences-between-twonormal-distributions-In-this-example-p-1-is-a-standard-normal_fig1_319662351 Novelty detection, https://scikit-learn.org/stable/_images/sphx_glr_plot_oneclass_0011.png J. Li, X. Lin, X. Rui, Y. Rui and D. Tao, "A Distributed Approach Toward Discriminative Distance Metric Learning," in IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 9, pp. 21112122, Sept. 2015, doi: 10.1109/TNNLS.2014.2377211. J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, N.D. Lawrence, Dataset Shift in Machine Learning, the MIT Press, 2009.
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[348]