e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:02/Issue:09/September-2020
Impact Factor- 5.354
www.irjmets.com
PERFORMANCE ANALYSIS OF REGRESSION MODELS USING MYANMAR SALES DATA Kyawt Kyawt San*1 *1Faculty
of Information Science, University of Information Technology, Yangon, Myanmar.
ABSTRACT In the development of sales estimating and prediction, regression analysis plays a crucial role and is a broadly used technique. It is applied to estimate future sales values or values of a variable using information of other features. Regression analysis which are also supervised machine learning algorithms and are widely used in businesses to know how the diversification of a set of independent variables influence a dependent one. In this paper, linear regression, random forest regression and K-Nearest Neighbors(KNN) regression are experimented using Myanmar supermarket sales dataset. The main purpose of the experiment is to compare performance of regression analysis among these regressors. According to the experiment, the linear regression model performs the best among these regression models. The paper also intends to experiment all three regressors and analyze the optimal analyzer for supermarket sales data analysis. KEYWORDS: Analysis, regression analysis, supervised learning, linear regression, sales.
I.
INTRODUCTION
Supervised machine learning techniques such as regression and classification techniques enable to measure evaluation analysis such as error model selection and assessment to choose the optimal model for a given data set. Most of the supermarket will be make better profit if they have a good estimation for their yearly sales. Currently, most of them use ad hoc tools traditional statistical methods to estimate the yearly sales. But a lot of challenges and problems may be encountered and may result in the prediction models that execute poorly. Making profit for a supermarket can only be reached when more goods are sold, and the turnover is high. Therefore, estimating sales to increase the yearly sales becomes a demanding issue for every supermarket. Sales data from high performing supermarket has become the worthy data which is produced by customers while interacting with the supermarket. The meaningful patterns and features from these data are used to build a machine learning model which can lead to a better performance for forecast sales. There are a lot of techniques for this kind of problems. Among these techniques, machine learning becomes a critical field because of its highly accurate predictive performance. To predict an observed event, a machine learning models is built on training data and from which it finds knowledge pattern to forecast unseen events. The main purpose of the paper is to present the comparative performance analysis of regression techniques for supermarket sales data. Machine learning models such as multilinear regression, KNN regression and random forest regression are used to predict the unseen event. Supermarket data from Myanmar is utilized for experimental purposes.
II.
RELATED WORKS
The different techniques for predictions utilizing different machine learning techniques are presented in this section. Odegua [8] adopted K-Nearest Neighbor, Random Forest and Gradient Boosting regression algorithms for sales estimating a supermarket store. According to [8], random forest had the lowest mean absolute error among three algorithms and the regressor performs better if more data is observed. Shiwani, et. al.[10] predicted the spending amount of the customers by using different machine learning algorithms and compared the performances of these algorithms. The two features, the time and location are considered as the important ones for the prediction in their experiment. They aim to enhance the profit of the store. www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[291]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:02/Issue:09/September-2020
Impact Factor- 5.354
www.irjmets.com
Rajalaxmi, et.al.[3] experimented multi-linear regression analysis using the academic performance of the students. The predicted label or dependent variable is Cumulative Grade Points of the Students which is measured based on students’ examination grades. They showed how strongly related between dependent variable and the values of the independent variables. Sakhare et. al.[4] measured the performance of three different regression algorithms: Linear Regression, Polynomial Regression and Support Vector Regression using S&P 500 data set. According to [4], the best performance measure among three regression algorithms is Support vector regression and it performs well regardless of size of feature dimension space. Khan et.al. [7] proposed a runoff forecasting model by evaluating the performance of multi-linear regressor, Artificial Neural Network-Levenberg Marquardt, decision tree regressor and Least Square Support Vector Regressor algorithms. And they found that the regression tree is the best performed algorithms for runoff forecasting.
III.
SYSTEM DESIGN AND REGRESSION MODELLING
Statistical techniques such as regression methods are applied for to predict the relationships between dependent variable and one or more independent variables. It is exploited to estimate how strong the relationship between data points and observed data points for fitting the future relationship. One data point is recognized as an explanatory variable and another is recognized as a dependent variable. The primary concern of machine learning models, especially the field of predictive modelling is to lessen the error and make the better estimation possible. The design methodology for comparing performance of the regression models is presented in Figure 1. In this experimentation, linear regressor, K-Nearest Neighbors(KNN) regressor and Random Forest regressor are fitted and analyzed their respective performance using standard evaluation metrics of the regression analysis. For evaluation purposes, mean squared error, root mean squared error and mean absolute error measures of these three models are computed and compared using Myanmar supermarket dataset. The performance is analyzed based on the validation, training and testing data of the dataset. Regression Algorithms
Supermarket Dataset
Random Forest Regressor
Linear Regressor
KNN Regressor
Results and Evaluation
Figure:1 System Design of the Regression Analysis
a) Linear Regression Linear regression is a model that assesses a linear relationship between the input data point(X) and the single output data point (y). To be specific, the predicted or output value (y) is computed from a linear combination of the input data point(x). The linear model for a single variable is shown using the equation (1). Y = m*X+c
(1)
Where:
Y – Dependent variable
X – Independent (explanatory) variable
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[292]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:02/Issue:09/September-2020
c – coefficient
m – Slope
Impact Factor- 5.354
www.irjmets.com
The linear model for multiple variables is shown using the equation (2). Y = m1*X1+ m2*X2+ m3*X3+……. mn*Xn+c
(2)
b) Random Forest Regression Random Forest is one of machine learning supervised algorithms and ensemble technique for performing both regression and classification tasks. It completes by establishing collection of decision trees during training time and produces the predicted class that is the average prediction of multiple trees. Two key modifications to prevent the decision trees from being too deeply correlated: 1) Random sampling of training data points when constituting decision trees 2) Random subsets of features for dividing nodes c) K-Nearest Neighbors Regression Algorithm (KNN) A regression task of KNN can be done by computing the mean of the numeric label of the K nearest neighbors. KNN regression algorithm is shown in Figure 2. The step-by-step procedure of KNN algorithms is described in below.
1: load a dataset. 2: choose k values to select the nearest neighbors according to k values 3: Do the steps below for each individual data point in the dataset: 3.1 − Compute the distance between testing samples and each row of training samples using one of the distance functions described in equation 3.2 – Arrange or sort the distance in ascending order 3.3 – Select first K values from the arranged distance-array. 3.4 – Predict a class based on the majority votes of the classes from the result obtained from the step 3.3. 4 – End
Figure:2 K-Nearest Neighbors Algorithms
d) Evaluation Metrics for Regression Regression models solves the problem of estimating quantity. In this paper, three evaluation measures are expressed and computed to compare the performance of the regression models. These measures are extensively used on regression machine learning problems.
Mean Square Error (MSE): computes the mean of the squares of the errors. In other words, that is, the mean squared difference between the prediction(unobserved) values and the actual value. Root Mean Square Error (RMSE): RMSE describes how fit the observed data are to the analysis model’s estimated data points. It computes the square root of the variance of the residuals. Median absolute error: represents the mean of the absolute diversity between prediction and actual observation. R-Squared: R^2 measure resolves how effectively the regression models estimates for the observed data points. The calculation of r-squared is described in equation. R-squared values range from 0 to1. R^2 values becomes 1 if the regression modes fit the data points effectively.
All performance metrics described above is shown in Figure 3. MSE, RMSE and R-squared measures are used in this paper to analyze performance of the regression models.
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[293]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:02/Issue:09/September-2020
Impact Factor- 5.354
www.irjmets.com
Figure-3: Evaluation Metrics for Regression Models
e) Myanmar Supermarket Dataset Supermarket dataset[5] which is collected from sales history of a supermarket company in Myanmar is described in Table 1. It consists of sales data from three different branches for three months. Performance comparison of regression models are done using this dataset. It consists of sixteen columns like “InvoiceID, Branch, City, Customer type, Gender, Productline, Unitprice, Quantity, Tax 5%, Total, Date, Time, Payment, cogs, grossincome, Rating� . Performance of Regression models are compared and analyzed using this dataset. Table-1: Myanmar Supermarket Dataset Description No.
Attribute
Description
1
InvoiceID
Computer generated sales slip invoice identification number
2
Branch
Branch of supercenter (3 branches are available identified by A, B and C).
3
City
Location of supercenters
4
Customer type
Type of customers, recorded by Members for customers using member card and Normal for without member card
5
Gender
Gender type of customer
6
Productline
General item categorization groups
7
Unitprice
Price of each product in $
8
Quantity
Number of products purchased by customer
9
Tax 5%
5% tax fee for customer buying
10
Total
Total price including tax
11
Date
Date of purchase
12
Time
Purchase time
13
Payment
Payment used by customer for purchase
14
cogs
Cost of goods sold
15
grossincome
Gross margin percentage
16
Rating
Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[294]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:02/Issue:09/September-2020
Impact Factor- 5.354
www.irjmets.com
Some exploratory analysis is done and visualized using features from the dataset. Among three different payment method, the greatest usage of Ewallet is found in Yangon City and is shown in Figure 4. Figure 5 describes the greatest sales product from product line feature in the given dataset. According to the Figure 5 Fashion product line has the greatest sales and cosmetics product line has the lowest sales. Figure 6 shows the local time of maximum sales at supermarket. According to the analysis, the greatest sales occurs at the time of 2 pm local time.
Figure:4 Payment Analysis by City
Figure:5 Product Data Analysis
Figure:5 The Greatest Sales-Hour Analysis
IV.
RESULTS AND DISCUSSION
The dataset used in this experiment has no missing values, but some of the attributes are categorical and convert all categorical variables into numeric values with the use of label encoding method. Then transformed dataset are standardized. The standardized dataset is split into training and testing dataset www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[295]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:02/Issue:09/September-2020
Impact Factor- 5.354
www.irjmets.com
and applied with the regression models. The three regression models are fitted with the dataset and evaluated using mae, mse and rmse performance metrics. These three modes are experimented using python language on jupyter notebook. The testing results of the three regressors are presented in this section. The performance of these three models are compared based on mae, mse and rmse. Table-2: Performance Comparison of the Regressors
Model
MSE
RMSE
Train data
Test data
Train data
Test data
Linear Regression
2.8267
3.0897
1.6813
1.6813
Random Forest Regressor
0.47961
3.4116
0.69254
0.69254
KNN Regressor
2.6669
3.2448
1.6331
1.6331
Table 3. R-Squared Comparison of the Regressors
Model
R-Squared Score
Linear Regression
94.00%
Random Forest Regressor
93.46%
KNN Regressor
93.78%
MSE, RMSE and R-squared measures are used in this experiment to calculate the prediction quality. Table 1 describes MSE and RMSE measures for three regression models applying training and testing data set.
Evaluation Metrics 3.5 3 2.5 2 1.5 1 0.5 0 MSE for train data
MSE for test data
RMSE for train data
MSE Linear Regression
RMSE for test data
RMSE Random Forest Regressor
KNN Regressor
Figure:7 Comparing MSE and RMSE of the Three Models
The RMSE scores of linear regression for applying training and testing data are equal and this means linear regressor model well fitted the data. Although the RMSE measure for Random forest model has decreased, the MSE measure for testing data is quite larger than the training data. This means that the built model is overfitted. The RMSE score of KNN regressor is 1.6331 which is slightly lower than the linear regression and it shows a little bit improvement of the model. www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[296]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:02/Issue:09/September-2020
Impact Factor- 5.354
www.irjmets.com
On the other hand, the performance of the three models are also compared using R-squared and described in Table-3 and Figure 7. It shows that the linear regression models also perform the best obtaining 94% of R-squared score in this measure. Among three regression analysis, the linear regression is recognized as the most advanced and it can be used to obtain the accurate prediction result.
R-Squared Score 94.20% 94.00% 93.80% 93.60% 93.40% 93.20% 93.00% R-Squared Score Linear Regression
Random Forest Regressor
KNN Regressor
Figure:8 R-Squared Comparison of the Three Models
V.
CONCLUSION
Sales estimation in business sector is a critical problem and estimating future sales based on the previous sales can be accomplished with machine learning models. Regression models are widely used for these problems. In this paper, performance of three different regression models are compared and evaluated. Evaluation measures among regression models are compared and analyzed using Supermarket dataset from Myanmar. Regression analysis shows that the linear regression model outperforms among three algorithms. According to MSE and RMSE score, linear regression model also well-fitted the sales data and it obtains the highest R-squared measure of the three models. The experiment aims to support and to have a major impact on excellent predictive analysis in forecasting coming sales.
ACKNOWLEDGEMENTS I cannot express my thanks to my family in the completion of this work. My completion of this work could not have been accomplished without the support of my family, my husband and my children for their supportive motivation.
VI.
REFERENCES
[1]
O. El Aissaoui et al.,“Multiple Linear Regression-Based Approach to Predict Student Performance”, In the Proceedings of the Advanced Intelligent Systems for Sustainable Development, AISC 1102, pp. 9–23, 2020, January 2020.
[2]
Pinki, S. Gupta, “Sales Forecasting using Linear Regression and Support Vector Machine”, International Journal of Innovative Research in Computer and Communication Engineering,Vol. 6, Issue 4,pp.3749-3755, April 2018.
[3]
R R, Rajalaxmi, Natesan, P,Krishnamoorthy, N. & Ponni, S., “Regression Model for Predicting Engineering Students Academic Performance”, International Journal of Recent Technology and Engineering (IJRTE), ISSN: 2277-3878, Volume-7 Issue-6S3, pp.71-75 April, 2019.
[4]
Sakhare, Nitin & Sagari, S S.,” Performance analysis of regression based machine learning techniques for prediction of stock market movement”, International Journal of Recent Technology and Engineering,Volume-7, Issue-6S4,pp.206-213, April 2019.
[5]
“Supermarket Dataset”, https://www.kaggle.com/
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[297]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:02/Issue:09/September-2020
Impact Factor- 5.354
www.irjmets.com
[6]
R R, Rajalaxmi, et.al., “Regression Model for Predicting Engineering Students Academic Performance”, International Journal of Recent Technology and Engineering (IJRTE), Volume-7 Issue-6S3,pp.71-75, April, 2019.
[7]
Khan M,et al., “Performance Analysis of Regression-Machine Learning Algorithms for Predication of Runoff Time”, Agrotechnology, Vol. 8 Iss. 1 No: 187.
[8]
Odegua, Rising. ,“Applied Machine Learning for Supermarket Sales Prediction”,2020.
[9]
“A Forecast for Big Mart Sales Based on Random Forests and Multiple Linear Regression”, The International journal of Engineering development and research(IJEDR), Volume 6, Issue 4,pp.4142,2018.
[10] S. Joshi, L. S. Rao, B. I. Seraphim, “Customer Centric Sales Analysis and Prediction”, International Journal of Engineering and Advanced Technology (IJEAT),ISSN: 2249 – 8958, Volume-8 Issue4,pp.1749-1753, April, 2019.
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[298]