DEMYSTIFICATION OF DATASET SHIFT IN THE FIELD OF DATA SCIENCE

e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021

Impact Factor- 5.354

www.irjmets.com

DEMYSTIFICATION OF DATASET SHIFT IN THE FIELD OF DATA SCIENCE Mir Nawaz Ahmad*1, Ummer Altaf Bhat*2 *1Department

of Computer Science and Engineering, SSM College of Engineering, Parihaspora 193121, Jammu and Kashmir, India.

*2Department

of Computer Science and Engineering, SSM College of Engineering, Parihaspora 193121, Jammu and Kashmir, India.

ABSTRACT A change in data distribution is what defines the process of dataset shift. This difference exists between the training and test sets. When building a machine learning algorithm, we use training data to train it in the view that it can yield comparable effects when applied to test data. This might not be the case in the real-world implementation of machine learning models. As a result, the distribution of data will undoubtedly change. The model's efficiency degrades as a result of this shift. It also includes an experimental study of the impact of different approaches on the efficiency of the algorithms mentioned in the following sections. After using techniques such as the Reweighting method, Bach Normalization, and feature-dropping, the performance of machine learning models improved significantly. Keywords: Dataset shift, Covariate shift, Concept drift, Internal covariate shift, Bach normalization, Sample selection bias, Prior probability, Data science.

I.

INTRODUCTION

Data scientists have examined data consistency in machine learning problems from a variety of angles, including missing values, data uncertainty, noise, data complexity, imbalance, and, in this case, dataset shift [1]. When the testing (unseen) data experiences a phenomenon that causes a change in the distribution of a single feature, a group of features, or the class boundaries, this is known as dataset shift. As a consequence, in real-world applications and situations, the standard assumption that training and testing data follow the same distributions is often violated. When the joint distribution of inputs and outputs varies between the training and test phases, dataset shift is a difficult situation. Only the input distribution shifts (covariate denotes input), while the conditional distribution of the outputs given the inputs P(y|x) remains unchanged. Dataset shift occurs in most practical implementations for a variety of purposes, ranging from experimental design bias to the simple deficiency that can lead to testing conditions at training time. In an image classification task, for example, training data may have been collected under controlled laboratory conditions, while test data may have been collected under varying lighting conditions. In certain applications, the data generation process is adaptive in and of itself. Some authors consider the issue of spam email filtering: efficient "spammers" would try to create spam that is different from the spam that the automated filter is based on [2]. Until recently, the machine learning group seems to have shown little interest in dataset shift. Indeed, many machine learning algorithms assume that the training data comes from the same distribution as the test data that will later be used to evaluate the model. This paper aims to provide a summary of the various recent activities in the machine learning field to cope with dataset shift [3].

II.

DATA SHIFT

As the name implies, a dataset shift happens when the data distribution changes. When creating a Machine Learning model, the aim is to discover the (possibly nonlinear) relationships between the data and the target variable. We could then feed new data from the same distribution into the model and expect similar performance and results. However, this is rarely the case; in fact, consumer preferences, technological breakthroughs, political, social, and other uncontrollable variables can all drastically alter the input dataset, the target variable, and the underlying trends and relationships between the input and output data. Each of these scenarios has its own name in Data Science, but they all result in model output deterioration. It is applicationspecific, relying heavily on the data scientist's ability to investigate and fix issues. How do we know when the dataset has changed enough to cause a problem for our algorithms, for example? How do we assess the trade-off www.irjmets.com

@International Research Journal of Modernization in Engineering, Technology and Science

[341]

Turn static files into dynamic content formats.

Create a flipbook