S TAT A N A LY T I C A
The Basics of Statistics for Data Science By Statisticians
statanalytica.com
INSIDE THE GUIDE TOPICS AND HIGHLIGHTS Overview Introduction to StatisticsT erminologies in Statistics Types of Analysis Data Types Measures of Central Tendency Measures of Variability Measurements of Relationships between Variables Probability Distribution Functions Continuous Data Distributions Discrete Data Distributions Moments Probability Accuracy Conclusion statanalytica.com
OVERVIEW
statanalytica.com
Data science has become a boom in the current industry. It is one of the most popular technologies these days. Most of the statistics students want to learn data science. Because statistics is the building block of the machine learning algorithms. But most of the students don’t know how much statistics they need to know to start data science. To overcome this problem we are going to share with you the best ever tips on statistics for data science. In this blog, you are going to see which statistics are crucial to start with data science.
INTRODUCTION TO STATISTICS
Statanalytica
Statistics is one of the most crucial subjects for the students. It has various methods that are helpful to solve the most complex problems of real life. Statistics is almost everywhere. Data science and data analysts use it to have a look on the meaningful trends in the world. Besides, statistics has the power to drive meaningful insight from the data.Statistics offers a variety of functions, principles, and algorithms. That is helpful to analyze raw data, build a Statistical Model and infer or predict the result.
TYPES OF ANALYSIS STATISTICS HAS TWO TYPES OF ANALYSIS QUANTI TATI VE ANALYS I S Quantitative Analysis is also known as statistical analysis. It is the science or an art of collecting and interpreting data with numbers and graphs. We also use it to identify patterns and trends.
QUALI TATI VE ANALY S I S Qualitative is also known as Non-Statistical Analysis. It gives generic information. It also uses text, sound and other forms of media.
DATA TYPES STATISTICS HAS TWO TYPES OF DATA TYPES NUME RI C AL Numerical data types are those data types which are expressed with digits. These data types are measurable. There are two major types of data types i.e. discrete and continuous.
C ATEG OR I CAL Categorical data types are qualitative data and it is classified into categories. There are two types of major categorical data types i.e. nominal (no order) or ordinal (ordered data).
MEASURES OF CENTRAL TENDENCY MEA N
ME D I AN
MOD E
Means stands for the average of the given dataset.
Median is the middle of the given ordered dataset.
Mode is the most common value in a given dataset. It is the only relevant for discrete data.
Statanalytica
MEASURES OF VARIABILITY
RANGE Range is the difference between the maximum and minimum value in a given dataset.
VARI ANC E (Σ2) Variance measures how spread out a set of the given data is relative to the mean.
S TANDARD DEVI ATI ON (Σ) It is also a measurement of how spread out numbers are in the given data set. Square root of variance is also known as standard deviation.
statanalytica.com
MEASURES OF VARIABILITY
Z-SC ORE Z score determines the number of standard deviations a data point is from the mean.
R-S QU AR ED R square is a statistical measure of fit. It used to indicate how much variation of a dependent variable is explained by the independent variable(s). We can use it only for the simple linear regression.
ADJU STED R-S QUARED It is similar to the R squared and also R square modified version. It has been adjusted for the number of predictors in the model. It decreases if the old term improves the model more than would be expected by chance and vice versa.
statanalytica.com
MEASUREMENTS OF RELATIONSHIPS BETWEEN VARIABLES C OVAR IANC E If we want to find the difference between two variables then we use the covariance. It is based on the philosophy that if it is positive then they tend to move in the same direction. Or if it’s negative then they tend to move in opposite directions. There will also be no relation with each other, if they are zero.
C ORRE L ATI ON Correlation is all about to measure the strength of a relationship between two different variables. It ranges from -1 to 1. It is the normalized version of co-variance. Most of the time the correlation of +/- 0.7 represents a strong relationship between two different variables. On the other hand, there is no relationship between variables when the correlations between -0.3 and 0.3
PROBABILITY DISTRIBUTION FUNCTIONS P ROBABI LI TY DENS I TY FU NCT I ON (P DF) It is for continuous data. Hereby in the continuous data the value at any point can be interpreted as providing a relative likelihood. In addition, the value of the random variable will also be equal to that sample. statanalytica.com
P ROBABIL I TY MAS S FUNC TI ON (PM F) In the probability mass function for a discrete data. It also gives the probability of a given occurring value.
CU MUL ATI VE DENS I TY FUNC TI ON (C DF) The cumulative density function is used to tell us the probability that the random variable is less than a certain value. In addition is also the integral of the PDF.
CONTINUOUS DATA DISTRIBUTIONS
CONTI NUOUS DI STRI BUTI ON Continuous data distributions is a probability distribution. In this distribution all the outcomes are equally likely.
NOR MAL/ GAUS S IAN DI S TRI BU TI ON The normal distribution is commonly referred to as the bell curve. In addition it is also related to the central limit theorem. It has the standard deviation of 1 and the mean is 0.
T-DI S TRI BUTI ON The T distribution is another probability distribution. It is used to estimate population parameters when the sample size is small.
statanalytica.com
CONTINUOUS DATA DISTRIBUTIONS
UNI FOR M DI STRI BUTI ON In this probability distribution we have the single value that only occurs within the certain range. The value outside this range is just 0. It is also known as on and off distribution.
POSI TION DI STRI BU TI ON It is quite similar to the normal distribution. But it offers the addition factor i.e. the skewness. The lower the value of the skewness the distribution will relatively uniformly spread in all directions. But if the skewness is high then the data will spread out in different directions with unequal distribution,
statanalytica.com
DISCRETE DATA DISTRIBUTIONS POI S SON DI S T RI BU T ION One of the most common probability distributions. It expresses the probability of a given number of events occurring within a given fixed time period.
BI NOMI AL DI S TR I BU TI ON The probability distribution of the number of successes in a sequence of n independent experiences each with its own Boolean-valued outcome (p, 1-p).
MOMENTS T he Mo m ents d escr ib e d iff e re nt aspe cts o f nature and the shape o f any g iv e n d istrib utio n. Mo m e nts hap p e ne d in se q ue nce there fo re the m e ans is the fi r st m o me nt, the v ar iance is the se co nd o ne , ske wne ss is the third one and the kur to sis is the fo ur th o ne and the last o ne .
statanalytica.com
Probability C ON DI T I ON AL PR O BAB I LIT Y In this probability [P(A|B)] is the likelihood of an event occurring.The event occurring is based on the occurrence of an event that occurred previously
statanalytica.com
BAY E S ’ T H EO R EM The Bayes’ theorem is the most popular mathematical formula. It is used to determine the conditional probability. It is based on the methodology that the probability of A given B is equal to the probability of B given A times the probability of A over the probability of B�.
ACCURACY
TR UE POS I TI VE It detects the condition, if the condition is present.
TR UE NEGATI VE It does not detect the condition, if the condition is not present.
FALS E-POS I TI VE It automatically detects the condition if the condition is absent.
FALS E-NEGATI VE
It does not detect the condition if the condition is present.
S ENSI TI VI TY It measures the ability of a test to detect the condition. If the condition is present. The sensitivity = TP/(TP+FN)
statanalytica.com
ACCURACY
S PEC I FIC I TY It measures the ability of a test to correctly exclude the condition if the condition is absent. It specificity = TN/ (TN+FP)
PR EDI CTI VE VALU E POS ITI VE Predictive value positive is also called as precision. In this the proportion of positives that correspond to the presence of the condition. Here is the formula PVP = TP/(TP+FP)
PR EDI CTI VE VALU E NEG ATI VE In this the proportion of negatives. It also corresponds to the absence of the condition. Here is the formula PVN = TN/(TN+FN)
statanalytica.com
CONCLUSION N ow we have g o ne throug h all the basic conce p ts o f statistics fo r d ata scie nce . I f yo u are g o ing to star t with d ata scie nce the n y ou should tr y to have a g oo d com m and ov e r all the se statistics co nce p ts. I t will he lp y o u a lo t whe n y o u star t le arning d ata scie nce . With the he lp o f the se co nce pts y ou will b e ab le to unde r stand the d ata scie nce conce p ts. S o what are y ou waiting fo r? Gr ab the b e st statis tics b o o ks and start le arning the se concep ts.
statanalytica.com
FOLLOW US ON SOCIAL MEDIA
FAC E BOO K
T WI TTE R
P I NT ER E S T
@statanalytica
@statanalytica
@statanalytica
statanalytica.com
CONTACT US
WE BS I T E https://statanalytica.com
EMA I L Info@statanalytica.com
statanalytica.com