Analysis And Prediction Of Income And Economic Hierarchy On Census Data Using Data Analytics And Dat

Page 1

Research paper

Analysis and Prediction of Income and Economic Hierarchy on Census Data using Data Analytics TAGSMachine-learning, Data Analytics, data analytics, Statistical Analysis, Statistical data analysis, Data Interpretations, Cluster Analysis, Statistical Analysis, Data Analytics.

SERVICESResearch Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics

Copyright Š 2019 Statswrok. All rights reserved


The greatest trouble in the machinelearning field is the availability of clean and high quality datasets.

INTRODUCTION

Demographic data constitutes a major role in the economic growth of the nation.

It helps in finding the income growth of the people, how much are from urban and rural areas and how educated every person in the nation. Copyright Š 2019 Statswrok. All rights reserved

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


Example: DATA ANALYTICS IN PREDICTING THE INCOME AND ECONOMIC HIERARCHY ON CENSUS DATA ANALYTICS.

. . . .

The data analytics method to predict the income and economic hierarchy on the census data obtained from Kaggle Sharath et al (2016) . The dataset involve 3.5 million U.S. households consists of their education, work, transportation they use, usage of internet, etc. Before analysing the data, the main pre-requisite is that the data must be normalized for performing Statistical Analysis. Hadoop is used as a first stage for the large dataset and PIG MapReduce is adopted for the normalisation of the dataset.

Later, the statistical analysis is performed and the results are interpreted. CopyrightMŠ 2019 Statswrok. All rights reserved OCHARINTO COACHING | 2020

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


Copyright Š 2019 Statswrok. All rights reserved

Aim of the study .

Gender distribution against occupation

.

Relationship between education and salary

.

Economic hierarchy and prediction of classes

. .

Plotting theoretical versus the actual values for Benford’s Law Mean and Median of Income using Heatmap

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


Huge Dataset

Fig. 1: Step by step procedure

LOAD

HADOOP PIG for MapReduce

NORMALIZED DATA

Graphical Representation and Interpretation

Data Mining/ statistical Analysis

Final Processed Data

Copyright Š 2019 Statswrok. All rights reserved

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


Personal Care

Personal Care

80%

88%

78%

Building Cleaning

75%

Building Cleaning

75%

70%

68%

Food

56%

Farming, Fishing

65%

78%

58%

Farming, Fishing

23%

Food

55% 35%

0

10

20

30

40

50

60

(a) Before Normalization

70

40% 30%

0

10

20

30

40

50

60

70

(b) After Normalization

Fig. 2: The importance of the normalization before proceeding with the Statistical data analysis Copyright Š 2019 Statswrok. All rights reserved

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


RESULTS & DISCUSSIONS

. .. .

It is noted that the percentage of men in farming and fishing industry is found to be increased for 3.8% after normalization and the percentage of women in that field is also gets increased. This clearly satisfies the need and importance of normalization of the census data. Normalization is used to reduce the execution time and improves the efficiency of the results. Two normalization is done for this purpose but before that there exists few blank entries in the dataset and this can be handled to avoid invalid results.

-

First level, the actual data is used without any modifications Second level of normalization, the actual data is inputted and then modified with a suitable mathematical methods.

Copyright Š 2019 Statswrok. All rights reserved

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


Copyright Š 2019 Statswrok. All rights reserved

Data Interpretations Fig 3: Depicts the first objective of the study i.e. obtaining the percentage of gender distribution against occupation.

Contd... Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


. . . . .

Percentage of men in the dales field is more in number than compared to others. Transportation field also contains higher percentage of men next to the Sales. In addition, percentage of women is distributed almost equally in all the occupational fields. Further, in order to achieve the second objective, boxplot technique is used to identify the relationship between the education and salary. This helps in understanding the income growth under different levels of education. Contd... Copyright Š 2019 Statswrok. All rights reserved

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


. .

Usually, the more educated person will get higher salary. However, from this graph, Professional degree holders are getting more salary than the doctorate degree holders which is quite unusual. Likewise, one can compare the median and quartiles of each field in the boxplot for better understanding of the level of education and annual salary. Copyright Š 2019 Statswrok. All rights reserved

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


K-MEANS CLUSTERING

.. . .. ..

Cluster Analysis methods are the useful tool for analysing large dimensional dataset. However, K-means clustering is the most versatile technique for getting valid results. In that sense, in order to achieve the economic hierarchy, k-means clustering technique is adopted for economic income variable, and the distance between each data values and a set of clusters are measured using centroid clustering method and then plotted the cluster against the classes. Even though, the clustering technique is widely used in the literature, the problem of finding the number of clusters still persist. Furthermore, Benford’s law is discussed for plotting the actual versus the theoretical values. Finally, the mean and median of the income across the states is depicted using heat maps. In addition, the time complexity of the performance of the analysis gets decreased by including the level of normalization is also discussed and are tabulated below. Copyright Š 2019 Statswrok. All rights reserved

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


Summary Handling large datasets becomes a major problem in the Statistical Analysis because of the inferring invalid results. Recently with the advent of computational strategies, researchers are involved in handling big data in easier way particularly using Hadoop and MapReduce techniques. There are lot of scope for the Data Scientist to handle big data through machine learning and deep learning techniques. The development of the nation is analysed using the census data collected during certain periods. This helps in understanding the growth of the population, wealth of the nation, and understanding the needs for the improvement for the welfare of the nation and people.

Copyright Š 2019 Statswrok. All rights reserved

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


Contact Us

Work With Us

INDIA: +91-4448137070

Freelancer Consultant Guest Blog Editor

info@statswork.com

( hr@workfoster.com )

UK: +44-1143520021

Copyright Š 2019 Statswrok. All rights reserved

Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics | Econometrics


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.