Research p a p e r
Analysis and Prediction of Income and Economic Hierarchy on Census Data using Data Analytics TAGSMachine-learning, Data Analytics, data analytics, Statistical Analysis, Statistical data analysis, Data Interpretations, Cluster Analysis, Statistical Analysis, Data Analytics.
SERVICESResearch Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics
Copyright Š 2019 Statswrok. All rights
The greatest trouble in the machine- learning field is the availability of clean and high quality datasets.
IN TRODUCTION
Copyright Š 2019 Statswrok. All rights reserved
Demographic data constitutes a major role in the economic growth of the nation.
It helps in finding the income growth of the people, how much are from urban and rural areas and how educated every person in the nation.
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
Example :DATA ANALYTICS IN PREDICTING THE INCOME AND ECONOMIC HIERARCHY ON CENSUS DATA ANALYTICS.
.
.
.
.
The data analytics method to predict the income and economic hierarchy on the census data obtained from Kaggle Sharath et al (2016) . The dataset involve 3.5 million U.S. households consists of their education, work, transportation they use, usage of internet, etc. Before analysing the data, the main pre-requisite is that the data must be normalized for performing Statistical Analysis. Hadoop is used as a first stage for the large dataset and PIG MapReduce is adopted for the normalisation of the dataset. Later, the statistical analysis is performed and the results are interpreted. OCHARINTO COACHING | 202 Copyright ŠM 2019 Statswrok. All rights reserved 0
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
Copyright Š 2019 Statswrok. All rights reserved
Aim of the study .
.
.
.
.
Gender distribution against occupation Relationship between education and salary Economic hierarchy and prediction of classes Plotting theoretical versus the actual values for Benford’s Law Mean and Median of Income using Heatmap
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
Huge Dataset
Fig. 1: Step by step procedure
LOAD
HADOOP
PIG for MapReduce
NORMALIZED DATA
Graphical Representatio n and Interpretation
Data Mining/ statistical Analysis
Final Processed Data
Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
Personal Care
Personal Care
80%
7 8%
Building Cleaning
68%
5 8%
56%
78%
2 3%
Foo d
55%
3 5%
0
70%
Farming, Fishing
65%
Farming, Fishing
7 5%
Building Cleaning
75%
Foo d
88%
10
20
30
40
50
60
(a) Before Normalization
70
40%
3 0%
0
10
20
30
40
50
60
70
(b) After Normalization
Fig. 2: The importance of the normalization before proceeding with the Statistical data analysis Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
RESULTS & DISCUSSIONS
.
It is noted that the percentage of men in farming and fishing industry is found to be
increased
. .
for 3.8% after normalization and the percentage of women in that field is also gets
Normalization is usedthe to reduce theimportance execution time and improvesofthe This clearly satisfies need and of normalization theefficiency census of the results. increased. Two normalization is done for this purpose but before that there exists few blank entries data. in the dataset and this can be handled to avoid invalid results. First level, the actual data is used without any modifications
-
Second level of normalization, the actual data is inputted and then modified with a suitable mathematical methods. Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
Data Interpretations Fig 3: Depicts the first objective of the study i.e. obtaining the percentage of gender distribution against occupation.
Contd... Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
.
Percentage of men in the dale s field i s mo r e in n u mbe r than compa r ed to
o t he rs .
..
T r an s po rt a t ion field al s o con t ain s highe r percentage of men next to the Sale s . In addi t ion , percentage of w omen i s di str ib ut ed almo st eq u all y in all the occ u pa t ional field s .
.
F urt he r , in o r de r to achieve the s econd objec t i v e , bo x plo t t echniq u e i s us ed to iden t if y the r ela t ion s hip between the ed u ca t ion and s ala ry .
.
Thi s help s in u nde rst anding the income g r o wt h u nde r diffe r en t le v el s of
ed u ca t ion .
Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
.
U su all y , the mo r e educated pe rs on g aph , tPhighe r ofe ssrional deg ree are wrill ge s ala ry . Ho w eholde v e r , frs r om tge hi stting mo r e s ala ry than the doctorate degree holde rs w hich i s q u i t e u n usu al .
.
Like w i s e , one can compa r e the median and q u a rt ile s of each field in the bo x plo t
fo r better u nde rst anding of the le v el of ed u ca t ion and ann u al s ala ry . Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
K-MEANS CLUSTERING
. . .
Cluster Analysis methods are the useful tool for analysing large dimensional
dataset. However, K-means clustering is the most versatile technique for getting valid and the distance between each data values and a set of clusters are measured using centroid clustering method and then plotted the against the to achieve the economic hierarchy, k-means clustering technique is adopted for economic results. In cluster that sense, in order classes.
. . .
income variable, Even though, the clustering technique is widely used in the literature, the problem of finding the number of clusters
Finally, the mean and median the income the states is versus depicted heat Furthermore, Benford’s law isofdiscussed foracross plotting the actual theusing theoretical maps. still persist. In addition, the time complexity of the performance of the analysis gets decreased by including the level of normalization is also discussed and are tabulated below. values. Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
Summar y
Handling large datasets becomes a major problem in the Statistical Analysis because of the inferring invalid results. Recently with the advent of computational strategies, researchers are involved in handling big data in easier way particularly using Hadoop and MapReduce techniques. There are lot of scope for the Data Scientist to handle big data through machine learning and deep learning techniques. The development of the nation is analysed using the census data collected during certain periods. This helps in understanding the growth of the population, wealth of the nation, and understanding the needs for the improvement for the welfare of the nation and people.
Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |
Contact Us UK: +44-1143520021 INDIA: +91-4448137070 info@statswork.com
Work With Us Freelancer Consultant Guest Blog Editor
( hr@workfoste r.com )
Copyright Š 2019 Statswrok. All rights reserved
Research Planning | Data Collection | Semantic Annotation | Business Analytics | Bio Statistics |