SAMPLING AND ESTIMATION - Simple Random Sampling => each item has an equal probability of being selected. - Sampling Distribution => is the distribution that occurs due to the potential parameter selection from all potential samples. -
Sampling Error => is the difference between the population parameter and the sample statistic. For example xx ± μ where μ is the distance between the observed statistical value and that where it was expected to lie as per the distribution.
Central Limit Theorem (CLT) The CLT is a statistical theory which states that the mean of all observed samples (of sufficient size) of a population with finite variance will be approximately equal to that of the population itself. Generally, it is considered to hold for sample sizes of n ≥ 30.
Standard Error of the Mean (SEM) The standard error of the sample mean is different to the standard deviation. Where the latter quantifies how much the values of a population vary from one another, the former quantifies how accurately you know the true mean of the population – it considers the standard deviation and the sample size. (Given a larger sample, we can be more confident our SEM is accurate and thus it gets smaller). We have two formulae for the SEM depending on if the population standard deviation, σ, is known or is unknown - in which case we use the sample’s standard deviation, s.
- Points Estimates (PE) => are single value estimates of population parameters. - Confidence Interval (CI) => point estimate ± (reliability factor x standard error) t-Distribution Is one of many potential continuous probability distributions of small sample size that arise when estimating the mean of a normal distribution with an unknown standard deviation – it tends to exhibit fatter tails; that is to say it is less accurate. We will henceforth denote the terminology df to mean “degrees of freedom” – it means the number of independent
variables that can be assigned to a statistical distribution. For a sample of size n it is intuitive that as n increases (i.e. gets closer to the size of the population) the sample distribution becomes more normally distributed (as we know for a t-distribution the population itself is normally distributed) thus the sample is more accurately sampling the population and the tails will become less heavily populated. The confidence intervals are…
[Tip: a summary of the information above is given in the table below…] Distribution Normal Normal Non normal Non normal
Variance Known Unknown Known Unknown
n < 30 z t n/a n/a
n > 30 z t z t
Some Common Biases - Data Mining => significant relationships occurred by chance. - Sample Selection Bias => selection is non-random. - Look Ahead Bias => data is not available at a time. - Survivorship Bias => using only surviving data. - Time-Period Bias => relationship doesn’t hold over the period. Why Not Just Make The Sample Test Very Large? 1. Cost 2. The increased risk of including inappropriate data, e.g. going back in time too far to a non-comparable market era. Monte Carlo Method The Monte Carlo method is based on parameters that are not limited to past experience. It’s a computational algorithm practice that varies all inputs over many possible ranges and gives out a likelihood of potential outputs based on running thousands of variable trials. [Tip: learn these commonly used confidence interval/significance level z-scores…]