1
Probability for Deep Learning Part 1 Johar M. Ashfaque
A.
Why Probability?
Probability is the mathematical language for quantifying uncertainty. We can apply probability theory to a diverse set of problems from flipping a coin to the analysis of computer algorithms. The starting point is to specify the sample space, the set of all the possible outcomes.
B.
Sample Space and Events
The sample space Ω is the set of all the possible outcomes of an experiment. Events are subsets of Ω. Example 1. If we flip a coin twice then Ω = {HH, HT, T H, T T }. The event that the first flip gives a head is A = {HH, HT }. Example 2. Let ω be the outcome of a measurement of some physical quantity, say temperature. Then Ω = R. The event that the measurement is larger than 10 but less than or equal to 23 is A = (10, 23]. Given an event A, let Ac denote the complement of A. Informally, Ac can be read as “not A”. The complement of Ω is the empty set ∅. Ω Sample Space ω outcome A event Ac complement of A A∪B union (A or B) A∩B intersection (A and B) A − B set difference (points in A that are not in B) A ⊂ B set inclusion (A is a subset of or equal to B) ∅ null event (always false) TABLE I. Sample Space and Events
We say that A1 , A2 , ... are disjoint or mutually exclusive if Ai ∩ Aj = ∅ whenever i 6= j. A partition of Ω is a sequence of disjoint sets A1 , A2 , ... such that ∪∞ i=1 Ai = Ω.
C.
Probability Measure
We want to assign a real number P(A) to every event A called the probability of A. We also call P a probability distribution or a probability measure. To qualify as a probability, P has to satisfy three axioms. Definition .1 A function P that assigns a real number P(A) to each event A is a probability distribution or a probability measure if it satisfies the following three axioms: • P(A) ≥ 0 for every A • P(Ω) = 1
2 • If A1 , A2 , ... are disjoint then [ X ∞ ∞ P Ai = P(Ai ). i=1
i=1
One can derive many properties of the function P from these axioms. Here are a few: • P(∅) = 0 • A ⊂ B =⇒ P(A) ≤ P(B) • 0 ≤ P(A) ≤ 1 • P(Ac ) = 1 − P(A) • A ∩ B = ∅ =⇒ P(A ∪ B) = P(A) + P(B) A less obvious property is given by the following Lemma. Lemma .2 For any events A and B P(A ∪ B) = P(A) + P(B) − P(A ∩ B). Example 3. Flip two coins. Let H1 be the event that heads occurs on flip 1 and let H2 be the event that heads occurs on flip 2. If all outcomes are equally likely, that is P({H1 , H2 }) = P({H1 , T2 }) = P({T1 , H2 }) = P({T1 , T2 }) =
1 4
then P(H1 ∪ H2 ) = P(H1 ) + P(H2 ) − P(H1 ∩ H2 ) =
D.
1 1 1 3 + − = . 2 2 4 4
Probability on Finite Sample Spaces
Suppose that the sample space Ω = {ω1 , ..., ωn } is finite. For example, if a dice is thrown twice then Ω A has 36 elements in total. If each outcome is equally likely then P(A) = 36 where A denotes the number 2 of elements in A. The probability that the sum of the dice is 11 is 36 since there are two outcomes that correspond to this event. In general, if Ω is finite and if each outcome is equally likely then P(A) =
A Ω
which is called the uniform probability distribution. To compute probabilities, we need to count the number of points in an event A using combinatorial techniques. Given n objects, the number of ways of ordering these objects is n!. For convenience, we define 0! = 1. We also define n n! = k k!(n − k)! which is the number of distinct ways of choosing k objects from n. For example, if we have a class of 20 people and we want to choose a committee of 3 students then there are 20 20! = = 1140 3 3!17! possible committees. Note the following property: n n = = 1. 0 n
3 E.
Independent Events
If we flip a fair coin twice, then the probability of two heads is 12 × 12 . We multiply the probabilities because we regard the two tosses as independent. The formal definition of independence is as follows. Definition .3 Two events A and B are independent if P(A ∩ B) = P(A)P(B). Independence can arise in two distinct ways. Sometimes, we explicitly assume that the two event are independent. In other instances, we derive independence by verifying that P(AB) = P(A)P(B) holds. Suppose that A and B are disjoint events, each with positive probability. Can they be independent? The answer is no. This follows since P(A)P(B) > 0 yet P(AB)P(∅) = 0. 1) A and B are independent if P(A ∩ B) = P(A)P(B). 2) Independence is sometimes assumed and sometimes derived. 3) Disjoint events with positive probability are not independent. TABLE II. Summary of Independence
F.
Conditional Probability
Assuming that P(B) > 0, we define the conditional probability of A given that B has occurred as follows. Definition .4 If P(B) > 0 then the conditional probability of A given B is P(A|B) =
P(A ∩ B) . P(B)
If A and B are independent events then P(A|B) =
P(A ∩ B) P(A)P(B) = = P(A). P(B) P(B)
From the definition of conditional probability, we can write P(A ∩ B) = P(A|B)P(B) and also P(A ∩ B) = P(B|A)P(A). Often these formulae give us a convenient way to compute P(A ∩ B) when A and B are not independent. 1) If P(B) > 0 then the conditional probability of A given B is P(A|B) = 2) In general, P(A|B) 6= P(B|A). 3) A and B are independent if and only if P(A|B) = P(A). TABLE III. Summary of Conditional Probability
P(A∩B) . P(B)
1
Probability for Deep Learning Part 2 Johar M. Ashfaque
I. A.
EXPECTATION
Expectation of a Random Variable
The expectation or the mean of a random variable X is the average value of X. The formal definition is as follows Definition I.1 The expected value or the mean or the first moment of X is defined to be (P Z xf (x) if X is discrete E(X) = xdF (x) = R x xf (x)dx if X is continuous x assuming that the sum (or integral) is well-defined. We use the following notation to denoted the expected value of X: Z E(X) = EX = xdF (x) = µ = µX . The expectation if a one-number summary of the distribution. From now on whenever we discuss expectations, we assume that they exist. Let Y = r(X). How do we compute E(Y )? Theorem I.2 Let Y = r(X). Then Z E(Y ) = E(r(X)) =
r(x)dFX (x).
This result makes intuitive sense. Think of playing a game where we draw X at random and then I pay you Y = r(X). Your average income is r(x) times the chance that X = x summed or integrated over all values of x. The k th moment of X is defined to be E(X k ) assuming that E(X k ) < ∞. We shall rarely make much use of moments beyond k = 2.
B.
Properties of Expectation
Theorem I.3 If X1 , ..., Xn are random variables and a1 , ..., an are constants then X X E ai Xi = ai E(Xi ). i
i
Theorem I.4 Let X1 , ..., Xn be independent random variables. Then Y Y n E Xi = E(Xi ). i
i
Notice that the summation rule does not require independence but the multiplication does.
2 C.
Variance and Covariance
The variance measures the spread of the distribution. 2 Definition I.5 Let X be a random variable with mean µ. The variance of X denoted σ 2 or σX or V(X) is defined by Z σ 2 = E(X − µ)2 = (x − µ)2 dF (x)
assuming this expectation exists. The standard deviation is sd(X) = σ. Theorem I.6 Assuming the variance is well-defined, it has the following properties: 1. V(X) = E(X 2 ) − E(X)2 = E(X 2 ) − µ2 . 2. If a and b are constants then V(aX + b) = a2 V(X). 3. If X1 , ..., Xn are independent and a1 , ..., an are constants then V
X n
ai Xi
=
i=1
n X
a2i V(Xi ).
i=1
If X1 , ..., Xn are random variables then we define the sample mean to be n
Xn =
1X Xi n i=1
and the sample variance to be n
Sn2 =
1 X (Xi − X n )2 . n − 1 i=1
Theorem I.7 Let X1 , ..., Xn be IID and let µ = E(Xi ), σ 2 = V(Xi ). Then E(X n ) = µ,
V(X n ) =
σ2 , n
E(Sn2 ) = σ 2 .
If X and Y are random variables then the covariance and correlation between X and Y measure how strong the linear relationship is between X and Y . Definition I.8 Let X and Y be random variables with means µX and µY and standard deviations σX and σY . Define the covariance between X and Y by Cov(X, Y ) = E[(X − µX )(Y − µY )] and the correlation by ρ = ρ(X, Y ) =
Cov(X, Y ) . σX σY
Theorem I.9 The covariance satisfies Cov(X, Y ) = E(XY ) − E(X)E(Y ). The correlation satisfies −1 ≤ ρ(X, Y ) ≤ 1. If Y = a + bX for some constants a and b then ρ(X, Y ) = 1 if b > 0 and ρ(X, Y ) = −1 if b < 0. If X and Y are independent, then Cov(X, Y ) = ρ(X, Y ) = 0. The converse is not true in general.
3 Distribution Mean Point mass at a a Bernoulli (p) p Binomial (n, p) p 1 Geometric (p) p Poisson (λ) λ a+b Uniform (a, b) 2 Normal (µ, σ 2 ) µ Exponential (β) β Gamma (α, β) αβ α Beta (α, β) α+β tν 0 (ν > 1) χ2p p
Variance 0 p(1 − p) np(1 − p) 1−p p2
λ (b−a)2 12 2
σ β2 αβ 2
αβ (α+β)2 (α+β+1) ν ν−2 (ν > 2)
2p
1
Probability for Deep Learning Part 3 Johar M. Ashfaque
I.
LINEAR REGRESSION
The term â&#x20AC;&#x153;regressionâ&#x20AC;? is due to Sir Francis Galton. Regression is a method for studying the relationship between a response variable Y and a covariate X. The covariate is also called a predictor variable or a feature. There can be one or more covariates. The data are of the form (Y1 , X1 ), ..., (Yn , Xn ). One way to summarize the relationship between X and Y is through the regression function Z r(x) = E(Y |X = x) = yf (y|x)dy.
A.
Simple Linear Regression
The simplest version of regression is when Xi is simple (a scalar not a vector) and r(x) is assumed to be linear r(x) = β0 + β1 x. This model is called the simple linear regression model. Let i = Yi â&#x2C6;&#x2019; (β0 + β1 Xi ). Then E( i |Xi ) = 0. 2
Let Ď&#x192; (x) = V( i |X = x). We will make the further simplifying assumption that Ď&#x192; 2 (x) = Ď&#x192; 2 does not depend on x. We can thus write the linear regression model as follows. Definition I.1 (The Linear Regression Model) Yi = β0 + β1 Xi + i where E( i |Xi ) = 0 and Ď&#x192; 2 (x) = V( i |Xi ). The unknown parameters in the model are the intercept β0 and the slope β1 ad the variance Ď&#x192; 2 Let βË&#x2020;0 and βË&#x2020;1 denote the estimates of β0 and β1 respectively. The fitted line is defined to be rĚ&#x201A;(x) = βË&#x2020;0 + βË&#x2020;1 (x). The predicted values or fitted values are YĚ&#x201A;i = rĚ&#x201A;(Xi ) and the residuals are defined to be Ë&#x2020;i = Yi â&#x2C6;&#x2019; YĚ&#x201A;i = Yi â&#x2C6;&#x2019; (βË&#x2020;0 + βË&#x2020;1 Xi ). The residual sums of squares or RSS is defined by RSS =
n X
Ë&#x2020;2i .
i=1
The quantity RSS measures how well the fitted line fits the data.
2 Definition I.2 The least squares estimates are the values βË&#x2020;0 and βË&#x2020;1 that minimize RSS =
n X
Ë&#x2020;2i .
i=1
Theorem I.3 The least squares estimatesP are given by n (Xi â&#x2C6;&#x2019; X n )(Yi â&#x2C6;&#x2019; Y n ) βĚ&#x201A;1 = i=1Pn 2 i=1 (Xi â&#x2C6;&#x2019; X n ) βĚ&#x201A;0 = Y n â&#x2C6;&#x2019; βĚ&#x201A;1 X n . An unbiased estimate of Ď&#x192; 2 is 2
Ď&#x192;Ě&#x201A; =
II.
1 nâ&#x2C6;&#x2019;2
X n
Ë&#x2020;2i .
i=1
GRAPHICAL MODELS
Graphical models are a class of multivariate statistical models that useful for representing independence relations. Graphical models often require fewer parameters and may lead to estimators with smaller risk. There are two main types of graphical models: undirected and directed. An undirected graph G = (V, E) has a finite set V of vertices and a set E of edges consisting of a pair of edges. The vertices correspond to random variables X, Y , Z,... and edges are written as unordered pairs. For example, (X, Y ) â&#x2C6;&#x2C6; E means that X and Y are joined by an edge. Two vertices are adjacent denoted X â&#x2C6;ź Y if there is an edge between them. A graph is complete if there is an edge between every pair of vertices. A subset U â&#x160;&#x201A; V of vertices together with their edges is called a subgraph. If A, B and C are three distinct subsets of V , we say that C separates A and B if every path from a variable in A to a variable in B intersects a variable in C. Directed graphs are similar to undirected graphs except that there are arrows between vertices instead of edges. Like undirected graphs, directed graphs can be used to represent independence relations. A directed graph G consists of a set of vertices V and an edge set E of ordered pairs of variables. If (X, Y ) â&#x2C6;&#x2C6; E then there is an arrow pointing from X to Y . If an arrow connects two variables X and Y in either direction, we say that X and Y are adjacent. If there is an arrow from X to Y then X is a parent of Y and Y is the child of X. The set of all parents of X is denoted Ď&#x20AC;(X). A directed path from X to Y is a set of vertices beginning with X and ending with Y such that each pair is connected by an arrow and all the arrows point in the same direction. A sequence of adjacent vertices starting with X and ending with Y but ignoring the direction of the arrows is called an undirected path. X is an ancestor of Y if there is a directed path from X to Y . We also say that Y is a descendent of X. A directed path that starts and ends at the same variable is called a cycle. A directed graph is acyclic if it has no cycles. In thie case we say that the graph is a directed acyclic graph (DAG). III.
LOG-LINEAR MODELS
Log-linear models are useful for modelling multivariate discrete data. There is a strong connection between log-linear models and undirected graphs. A.
The Log-Linear Model
Let X = (X1 , ..., Xm ) be a random vector with probability function f (x) = P(X = x) = P(X1 = x1 , ..., Xm = xm ) where x = (x1 , ..., xm ). Let rj be the number of values that Xj takes. Without loss of generality, we can assume that Xj â&#x2C6;&#x2C6; {0, 1, ..., rj â&#x2C6;&#x2019; 1}. Suppose now that we have n such random vectors. We can think
3 of the data as a sample from a multinomial with N = r1 Ă&#x2014; r2 Ă&#x2014; ... Ă&#x2014; rm categories. The data can be represented as counts in a r1 Ă&#x2014; r2 Ă&#x2014; ... Ă&#x2014; rm table. Let p = (p1 , ..., pN ) denote the multinomial parameter. Let S = {1, ..., m}. Given a vector x = (x1 , ..., xm ) and a subset A â&#x160;&#x201A; S, let xA = (xj : j â&#x2C6;&#x2C6; A). For example, if A = {1, 3} then xA = (x1 , x3 ). Theorem III.1 The joint probability function f (x) of a single random vector X = (X1 , ..., Xm ) can be written as X log f (x) = Ď&#x2C6;A (x) Aâ&#x160;&#x201A;S
where the sum is over all subsets A of S = {1, ..., m} and the Ď&#x2C6;â&#x20AC;&#x2122;s satisfy the following conditions â&#x20AC;˘ Ď&#x2C6;â&#x2C6;&#x2026; (x) is a constant â&#x20AC;˘ For every A â&#x160;&#x201A; S, Ď&#x2C6;A (x) is only a function of xA and not the rest of the xj â&#x20AC;&#x2122;s â&#x20AC;˘ If i â&#x2C6;&#x2C6; A and xi = 0 then Ď&#x2C6;A (x) = 0 The formula log f (x) =
X
Ď&#x2C6;A (x)
Aâ&#x160;&#x201A;S
is called the log-linear expansion of f . Note that this is the probability function for a single draw. Each Ď&#x2C6;A (x) will depend on some unknown parameters βA . Let β = (βA : A â&#x160;&#x201A; S) be the set of all these parameters. We will write f (x) = f (x; β) when we want to estimate the dependence on the unknown parameters β. In terms of the multinomial, the parameter space is P=
p = (p1 , ..., pN ) : pj â&#x2030;Ľ 0,
N X
pj = 1 .
j=1
This is the N â&#x2C6;&#x2019; 1 dimensional space. In the log-linear representation, the parameter space is Î&#x2DC; = β = (β1 , ..., βN ) : β = β(p), p â&#x2C6;&#x2C6; P where β(p) is the set of β values associated with p. The set Î&#x2DC; is a N â&#x2C6;&#x2019; 1 dimensional surface in Rn .
1
Probability for Deep Learning Part 4 Johar M. Ashfaque
I.
INTRODUCTION
We will consider sequences of dependent random variables. For example, daily temperatures will form a sequence of time-ordered random variables and clearly the temperature on day one is not independent of the temperature on day two. A stochastic process {Xt : t ∈ T } is a collection of random variables.The variables Xt take values in some set X called the state space. The set T is called the index set and for our purposes can be thought of as time. The index set can either be discrete or continuous. Example 1. Let X = {sunny, cloudy}. A typical sequence might be sunny, sunny, cloudy, sunny, cloudy, · · ·. This process has a discrete state space and a discrete index set. Example 2. A sequence of IID random variables can be written as {Xt : t ∈ T } where T = {1, 2, 3, ...}. Hence a sequence of IID random variables is an example of a stochastic process. If X1 , ..., Xn are random variables then we can write the joint density as f (x1 , ..., xn ) =
n Y
f (xi |pasti )
i=1
where pasti refers to all the variables before Xi . II.
MARKOV CHAINS
The simplest stochastic process is a Markov chain in which the distribution of Xt depends only on Xt−1 . We will assume that the state space is discrete X = {1, ..., N } and that the index set is T = {0, 1, 2, ...}. Definition II.1 The process {Xn : n ∈ T } is a Markov chain if P(Xn = x|X0 , ..., Xn−1 ) = P(Xn = x|Xn−1 ) for all n and for all x ∈ X. For the Markov chain, the joint probability is f (x1 , ..., xn ) = f (x1 )f (x2 |x1 )...f (xn |xn−1 ).
A.
Transition Probabilities
The key quantities of a Markov chain are the probabilities of jumping from one state into another. Definition II.2 We call P(Xn+1 = j|Xn = i)
2 the transition probabilities. If the transition probabilities do not change with time, we call the chain homogeneous. In this case, we define pij = P(Xn+1 = j|Xn = i). The matrix P whose (i, j) element is pij called the transition matrix. We will only be interested in homogeneous Markov chains. Notice that P has two properties • p ij ≥ 0 P • i pij = 1 Each row is a probability mass function. A matrix with these properties is called a stochastic matrix. Let pij (n) = P(Xm+n = j|Xm = i) be the probability of going from state i to state j in n steps. Let Pn be the matrix whose (i, j) element is pij (n). These are called the n-step transition probabilities. Theorem II.3 (The Chapman-Kolmogorov Equations) The n-step transition probabilities satisfy X pij (m + n) = pik (m)pkj (n). k
This statement of the theorem is nothing more than the equation for matrix multiplication. Hence Pm+n = Pm Pn . 1) Transition Matrix: P(i, j) = P(Xn+1 = j|Xn = i). 2) n-step Matrix, Pn (i, j) = P(Xm+n = j|Xm = i). TABLE I. Summary
B.
States
The states of a Markov chain can be classified according to various properties. Definition II.4 We say that i reaches j or j is accessible from I if pij (n) > 0 for some n and we write i → j. If i → j and j → i then we write i ↔ j and we say that i and j communicate. Theorem II.5 The communication relation satisfies the following properties: 1. i ↔ i. 2. If i ↔ j then j ↔ i. 3. If i ↔ j and j ↔ k then i ↔ k. 4. The set of states X can be written as a disjoint union of classes X = X1 ∪ X2 ∪ · · · where two states i and j communicate with each other if and only if they are in the same class. If all states communicate with each other then the chain is called irreducible. A set of states is closed if once you enter that set of states you never leave. A closed set consisting of a single state is called an absorbing state. Example. Let X = {1, 2, 3, 4} and 1 2 3 3 0 0 2 1 0 0 3 3 P= 1 1 1 1 4 4 4 4 0 0 0 1 The classes are {1, 2}, {3} and {4}. State 4 is an absorbing state.
3 Theorem II.6 A state i is recurrent if and only if X pii (n) = ∞. n
A state i is transient if and only if X
pii (n) < ∞.
n
Theorem II.7 Some facts. • If state i is recurrent and i ↔ j then j is recurrent. • If state i is transient and i ↔ j then j is transient. • A finite Markov chain must have at least one recurrent state. • The states of a finite, irreducible Markov chain are all recurrent.
III.
POISSON PROCESSES
One of the most studied and useful stochastic processes is the Poisson process. It arises when we count occurrences of events over time. For example, traffic accidents, radioactive decay etc. As the name suggests the Poisson process is related to the Poisson distribution. Definition III.1 A Poisson process is a stochastic process {Xt : t ∈ [0, ∞)} with state space X = {0, 1, 2, ...} such that • X(0) = 0. • For any 0 = t0 < t1 < t2 < · · · < tn , the increments X(t1 ) − X(t0 ), X(t2 ) − X(t1 ), · · ·, X(tn ) − X(tn−1 ) are independent. • There is a function λ(t) called an intensity function. This is to say the number of events in any interval of length t is a Poisson random variable with parameter (or mean) λt. Definition III.2 A Poisson process with intensity function λ(t) ≡ λ for some λ > 0 is called a homogeneous Poisson process with rate λ. In this case X(t) ∼ Poisson(λt).