Applications of Information Theory to Epidemiology

Page 1



2.2. Expected information content In both the potato late blight example (Section 1.2) and the Sclerotinia stem rot example (Section 1.3), the true status of a crop may be either D1 (denoting a disease outbreak, or the need for treatment) or D2 (denoting no outbreak, or no need for treatment). Here, we adopt an extended notation, in which the true status of a crop may take any one of m states, D1 … Dj … Dm. The corresponding probabilities are Pr(D1) … Pr(Dj) … Pr(Dm), and:

/ Pr ^D h = 1 m

j

Pr ^ D j h $ 0 ^ j = 1...m h. j=1

When we receive a definite message that a crop has true status Dj, the information content of this message (from Equation 5) is h ^Pr ^ D jhh , abbreviated to h(Dj): h ^ D j h = log ;

1 E. Pr (D j)

We cannot calculate this quantity until the message is received, because the message ‘Dj occurred’ may refer to any one of D1 … Dj … Dm. We can, however, calculate the expected information content before the message is received. This is the weighted average of the h(Dj) values. Since the message ‘Dj occurred’ is received with probability Pr(Dj), the expected information content, denoted H(D), is: H ^ D h = / Pr ^ D jh log ; m

j=1

1 E (7) Pr (D j)

This is the expected information content of a discrete probability distribution (in this case, the distribution of D), and is often referred to as the entropy of that distribution, with Equation 7 written as follows: H ^ D h =- / Pr ^ D j h log 6Pr (D j)@ (8) m

j=1

(Theil, 1967).8 We note that H(D) ≥ 0 and take Pr (D j) log [Pr (D j)] = 0 if Pr(Dj) = 0.9 If any Pr(Dj) = 1, H(D) = 0. This is reasonable since we expect nothing from the forecast if we are already certain of the actual outcome. H(D) has its maximum value when all the Pr(Dj) have the same value, equal to 1/m (Theil, 1967). This is also reasonable, since a message that tells us what actually happened when all outcomes have the same prior probability will have a larger information content than when some outcomes have larger prior probabilities than others. More generally, we can say that the greater the uncertainty prevailing before a message is received, the larger is the expected information content of a message that

8

The expected information content of a continuous probability distribution, say f(x), is often referred to as the differential entropy (to distinguish it from the expected information content (entropy) of a discrete probability distribution), and written: H 6 f ^ x h@ = - y f ^ x h log 6 f ^ x h@ dx X

Cover and Thomas (2006). While the equation for differential entropy (of a continuous distribution) is analogous, in an informal way, to the equation for the entropy (of a discrete distribution), it is not the case that the former arises simply as a limiting form of the latter. 9  Because lim x log ^ x h = 0. x"0

16


tells us what happened.10 For example, for forecasts of potato late blight outbreaks in commercial fields in south-central Washington (Section 1.2), the probability of an outbreak was taken to be Pr(D1) = 0.5 and so the corresponding probability of no outbreak was taken to be Pr(D2) = 0.5. Working in natural logarithms, the expected information content of a message that tells us what happened is (from Equation 7) H(D) = 0.5•ln(2) + 0.5•ln(2) = 0.69 nits (which is the maximum expected information content when m = 2). For forecasts of Sclerotinia stem rot of oil seed rape in east-central Sweden (Section 1.3), the probability of need for treatment Pr(D1) = 0.16 and the probability of no need for treatment Pr(D2) = 0.84. In this case, the expected information content of a message that tells us what happened is (from Equation 7) H(D) = 0.16•ln(6.25) + 0.84•ln(1.19) = 0.44 nits. The expected information content of a message that tells us what happened in relation to a potato late blight forecast in south-central Washington is larger than the expected information content of a message that tells us what happened in relation to a Sclerotinia stem rot forecast in east-central Sweden because in the former case the uncertainty was greater before the message was received. We need now to generalize Equation 7 to be able to calculate expected information content for an indefinite message. Recall that the true status of a crop may be any of D1 … Dj … Dm. The corresponding probabilities are Pr(D1) … Pr(Dj) … Pr(Dm). A message (denoted T) is received which serves to transform these prior probabilities into the posterior probabilities Pr(D1|T) … Pr(Dj|T) … Pr(Dm|T), where:

/ Pr ^D m

j

Th = 1

Pr ^ D j T h $ 0 j=1

^ j = 1...m h.

When we receive the message T, the information content of this message as viewed from the perspective of a particular Dj is (from Equation 6): information content of message T = log =

Pr ^ D j T h G. Pr ^ D j h

The expected information content of the message T, denoted I(T), is the weighted average of the information contents, the weights being the posterior probabilities Pr(Dj|T):

I ^T h = / Pr ^ D j T h log = m

j=1

Pr ^ D j T h G (9) Pr ^ D jh

(Theil, 1967).11 The quantity I(T) ≥ 0, and is equal to zero if and only if Pr(Dj|T) = Pr(Dj) for all j = 1…m. Thus the expected information content of a message which leaves the prior probabilities unchanged is zero, which is reasonable. 10

Entropy can be thought of as characterizing either information or uncertainty, depending on our point of view. We know that just one of a number of events will occur, and the corresponding prior probabilities. Entropy quantifies how much information we will obtain, on average, from a message that tells us what actually happened. Alternatively, entropy characterizes the extent of our uncertainty prior to receipt of the message that tells us what happened. 11  For continuity with later chapters, we could refer to the posterior probability distribution as the comparison distribution and the prior probability distribution as the reference distribution (see, for example, Section 3.1). However, we do not require explicit specification of a comparison distribution and a reference distribution here because, for practical purposes, the only calculations of expected information content that are of interest in this chapter are those where the posterior probabilities form the comparison distribution and the prior probabilities form the reference distribution.

17


In the terminology of Kullback (1968), the quantity I(T) is a directed divergence (sometimes colloquially referred to as the Kullback-Leibler distance, although Solomon Kullback himself did not endorse this terminology (see Kullback, 1987)). Cover and Thomas (2006) use the term relative entropy. Its application as a measure of diagnostic information is the subject of a clinically-orientated discussion by Benish (1999). In this context, relative entropy is a synonym for expected information content. Relative entropy quantifies expected information from a specific test result (Benish, 2002). As an illustration, we use the Sclerotinia stem rot test (see Sections 1.3 and 1.5) with a threshold risk points score of 40. The prior probabilities are Pr(D1) = 0.16 and Pr(D2) = 0.84. The required posterior probabilities are given in Table 3. Working in natural logarithms, information contents in nits are calculated using Equation 6, and expected information contents in nits are then calculated from Equation 9 (see Table 3). For this implementation of the test, the expected information content of prediction T1 (need for treatment) is much larger than that of prediction T2 (no need for treatment) (Table 3). Here we are interested, in essence, in characterizing the transitions from prior probabilities to posterior probabilities that result when we receive an indefinite message. Such transitions from prior probability to posterior probability are characterized a posteriori by the information contents of predictions based on the application of a test. To obtain information content, we require both the prior probability and the posterior probability, but not (explicitly) the details of the test. We continue to restrict our attention, for the moment, to the situation in which the true status of a crop, Dj ( j = 1…m), is described in one of two categories (so m = 2). D1 denotes that the true status is a disease outbreak, or the need for treatment. D2 denotes that the true status is no outbreak, or no need for treatment. The predicted status, Ti (i = 1…n), is also described in one of two categories (so n = 2). T1 denotes a prediction of a disease outbreak, or the need for treatment. T2 denotes a prediction of no outbreak, or no need for treatment. We denote the prior probability of true status Dj as Pr(Dj) ( j = 1…m), and the posterior probability of true status Dj given prediction Ti as Pr(Dj|Ti) (i = 1…n). Table 3 provides a convenient starting point. As previously, we work in natural logarithms. For the example shown in the table, Pr(D1) = 0.16 and Pr(D2) = 1 − Pr(D1) = 0.84. We note from the body of Table 3A that when the test result is T1, Pr(D1|T1) > Pr(D1), and when the test result is T2, Pr(D1|T2) < Pr(D1). Similarly, when the test result is T2, Pr(D2|T2) > Pr(D2), and when the test result is T1, Pr(D2|T1) < Pr(D2). From the body of Table 3B we note that when the test result is correct (i = j), the information

18


content ln[Pr(Dj|Ti)/Pr(Dj)] > 0; and when the test result is incorrect (i ≠ j), ln[Pr(Dj|Ti)/Pr(Dj)] < 0. This is all as we would expect for a useful test. Fig. 1A shows these results in graphical form. To read the figure, note that the abscissa is calibrated in information units. The ordinate is calibrated for the prior probability Pr(D1) and the posterior probabilities Pr(D1|Ti), while the prior probability Pr(D2) and the posterior probabilities Pr(D2|Ti) are the complements, respectively 1 − Pr(D1) and 1 − Pr(D1|Ti). Four regions are created by the intersection of the lines Information = 0 and Probability = Pr(D1) (= 1 − Pr(D2)). The two regions for which Information > 0 correspond to the data representing correct test outcomes, and conversely, the two regions for which Information < 0 correspond to the data representing incorrect test outcomes. The two regions for which Probability > Pr(D1) correspond to data representing the test outcome T1, and conversely, the two regions for which Probability < Pr(D1) correspond to data representing the test outcome T2. Data such as those in the body of Table 3 then provide the coordinates of four points on the plot, one point in each region (Fig. 1A). Still using Table 3 as a reference point, and working in natural logarithms, we now consider the cases (subjects categorized definitively as D1). For these, we have a pair of test results, T1 and T2, such that (from Equation 6): information content of test result = ln e

Pr ^ D 1 test result h o. Pr ^ D 1 h

For a given prior probability Pr(D1), the two test results lie on the curve:

Pr ^ D 1 test result h = Pr ^ D 1 h $ exp ^information content of test result h

shown in Fig. 1B. Similarly, considering the controls (subjects categorized definitively as D2), we have a pair of test results, T1 and T2, such that (again from Equation 6): information content of test result = ln e

Pr ^ D 2 test result h 1 - Pr ^ D 1 test result h o = ln e o. Pr ^ D 2 h 1 - Pr ^ D 1 h

For a given prior probability Pr(D2) = 1 − Pr(D1), the two test results lie on the curve:

Pr ^ D 1 test result h = 1 - ^1 - Pr ^ D 1 hh $ exp (information content of test result) also shown in Fig. 1B. The two curves intersect at the point where Information = 0 and Probability = Pr(D1) (Fig. 1B). The positions of the two curves in Fig. 1B are governed by the prior probability Pr(D1).12 In turn, this provides the basis for calculation of an expected information curve, shown in Fig. 1C and discussed further in Section 2.3. 12

In addition, if we refer to a particular binary predictor we note that the horizontal distance from the data point on the ‘cases’ curve to the data point on the ‘controls’ curve in the region of the graph for which Probability > Pr(D1) (i.e., points b and a, respectively, in Fig. 1B, corresponding to data representing the test outcome T1) is equal to ln(LR1). The horizontal distance from the data point for the ‘controls’ curve to the data point for the ‘cases’ curve in the region of the graph for which Probability < Pr(D1) (i.e., points d and c, respectively, in Fig. 1B, corresponding to data representing the test outcome T2) is equal to −ln(LR2). For a numerical example (referring again to Fig. 1), 1.088 − (−0.470) = 1.558 = ln(LR1), exp(1.558) = 4.75 = LR1; 0.121 − (−1.131) = 1.252 = −ln(LR2), exp(−1.252) = 0.286 = LR2 (see Table 3 and Section 1.5). Recall that we would like to have tests with LR1 as large as possible and LR2 as small as possible (while > 0) (see Section 1.4). Diagrammatically, these conditions translate as a general preference for tests with longer horizontal distances b − a and d − c.

19


FIG. 1. Information content (calibrated in nits). Calculations are based on the data given in Table 3. Four regions are created by the intersection of the lines Information = 0 and Probability = Pr(D1) (= 1 − Pr(D2)) = 0.16. A. Point (a) is (ln[Pr(D2|T1)/Pr(D2)], 1 − Pr(D2|T1)) or (−0.470, 0.475). Point (b) is (ln[Pr(D1|T1)/Pr(D1)], Pr(D1|T1)) or (1.088, 0.475). Point (c) is (ln[Pr(D1|T2)/Pr(D1)], Pr(D1|T2)) or (−1.131, 0.05161). Point (d) is (ln[Pr(D2|T2)/Pr(D2)], 1 − Pr(D2|T2)) or (0.121, 0.05161). B. Points (b) and (c) refer to the cases (subjects definitively categorized as D1) and lie on the red curve specified by Pr(D1|test result) = Pr(D1)∙exp(information content of test result). Points (a) and (d) refer to the controls (subjects definitively categorized as D2) and lie on the blue curve specified by Pr(D1|test result) = 1 − (1 − Pr(D1))∙exp(information content of test result). C. The solid black line is the expected information content curve (see also Fig. 3).

20


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.