Research paper
Analysis and Prediction of Income and Economic Hierarchy on Census Data using Data Analytics TAGSMachine-learning, Data Analytics, data analytics, Statistical Analysis, Statistical data analysis, Data Interpretations, Cluster Analysis, Statistical Analysis, Data Analytics.
SERVICESResearch Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
Copyright Š 2019 Statswrok. All rights reserved
The greatest trouble in the machinelearning field is the availability of clean and high quality datasets.
INTRODUCTION
Demographic data constitutes a major role in the economic growth of the nation.
It helps in finding the income growth of the people, how much are from urban and rural areas and how educated every person in the nation. Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
Example: DATA ANALYTICS IN PREDICTING THE INCOME AND ECONOMIC HIERARCHY ON CENSUS DATA ANALYTICS.
. . . .
The data analytics method to predict the income and economic hierarchy on the census data obtained from Kaggle Sharath et al (2016) . The dataset involve 3.5 million U.S. households consists of their education, work, transportation they use, usage of internet, etc. Before analysing the data, the main pre-requisite is that the data must be normalized for performing Statistical Analysis. Hadoop is used as a first stage for the large dataset and PIG MapReduce is adopted for the normalisation of the dataset.
Later, the statistical analysis is performed and the results are interpreted. CopyrightMŠ 2019 Statswrok. All rights reserved OCHARINTO COACHING | 2020
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
Copyright Š 2019 Statswrok. All rights reserved
Aim of the study .
Gender distribution against occupation
.
Relationship between education and salary
.
Economic hierarchy and prediction of classes
. .
Plotting theoretical versus the actual values for Benford’s Law Mean and Median of Income using Heatmap
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
Huge Dataset
Fig. 1: Step by step procedure
LOAD
HADOOP PIG for MapReduce
NORMALIZED DATA
Graphical Representation and Interpretation
Data Mining/ statistical Analysis
Final Processed Data
Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
Personal Care
Personal Care
80%
88%
78%
Building Cleaning
75%
Building Cleaning
75%
70%
68%
Food
56%
Farming, Fishing
65%
78%
58%
Farming, Fishing
23%
Food
55% 35%
0
10
20
30
40
50
60
(a) Before Normalization
70
40% 30%
0
10
20
30
40
50
60
70
(b) After Normalization
Fig. 2: The importance of the normalization before proceeding with the Statistical data analysis Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
RESULTS & DISCUSSIONS
. .. .
It is noted that the percentage of men in farming and fishing industry is found to be increased for 3.8% after normalization and the percentage of women in that field is also gets increased. This clearly satisfies the need and importance of normalization of the census data. Normalization is used to reduce the execution time and improves the efficiency of the results. Two normalization is done for this purpose but before that there exists few blank entries in the dataset and this can be handled to avoid invalid results.
-
First level, the actual data is used without any modifications Second level of normalization, the actual data is inputted and then modified with a suitable mathematical methods.
Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
Copyright Š 2019 Statswrok. All rights reserved
Data Interpretations Fig 3: Depicts the first objective of the study i.e. obtaining the percentage of gender distribution against occupation.
Contd... Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
. . . . .
Percentage of men in the dales field is more in number than compared to others. Transportation field also contains higher percentage of men next to the Sales. In addition, percentage of women is distributed almost equally in all the occupational fields. Further, in order to achieve the second objective, boxplot technique is used to identify the relationship between the education and salary. This helps in understanding the income growth under different levels of education. Contd... Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
. .
Usually, the more educated person will get higher salary. However, from this graph, Professional degree holders are getting more salary than the doctorate degree holders which is quite unusual. Likewise, one can compare the median and quartiles of each field in the boxplot for better understanding of the level of education and annual salary. Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
K-MEANS CLUSTERING
.. . .. ..
Cluster Analysis methods are the useful tool for analysing large dimensional dataset. However, K-means clustering is the most versatile technique for getting valid results. In that sense, in order to achieve the economic hierarchy, k-means clustering technique is adopted for economic income variable, and the distance between each data values and a set of clusters are measured using centroid clustering method and then plotted the cluster against the classes. Even though, the clustering technique is widely used in the literature, the problem of finding the number of clusters still persist. Furthermore, Benford’s law is discussed for plotting the actual versus the theoretical values. Finally, the mean and median of the income across the states is depicted using heat maps. In addition, the time complexity of the performance of the analysis gets decreased by including the level of normalization is also discussed and are tabulated below. Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
Summary Handling large datasets becomes a major problem in the Statistical Analysis because of the inferring invalid results. Recently with the advent of computational strategies, researchers are involved in handling big data in easier way particularly using Hadoop and MapReduce techniques. There are lot of scope for the Data Scientist to handle big data through machine learning and deep learning techniques. The development of the nation is analysed using the census data collected during certain periods. This helps in understanding the growth of the population, wealth of the nation, and understanding the needs for the improvement for the welfare of the nation and people.
Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics
Contact Us
Work With Us
INDIA: +91-4448137070
Freelancer Consultant Guest Blog Editor
info@statswork.com
( hr@workfoster.com )
UK: +44-1143520021
Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics