Statistics using Excel 2007
Scatterdiagram Age and Income (*€100) ) 80 0 0 1 (* e m 70 co In 60
50
40 Income (*€100) Linear (Income (*€100)) 30
20
10
0 0
10
20
30
40
50
60
70
Age
(Quantitative Methods course)
Johan van Berkel
Statistics using Excel 2007 (Quantitative Methods course)
Johan van Berkel
Foreword
Foreword To get more integration in the first year at the University of Applied Sciences Utrecht between the courses in Dutch and the courses in English it has been decided to translate the reader ‘Statistiek met Excel’, that has been already used for over ten years, in English. This new edition was written specifically for performing statistical calculations with Excel 2007. The previous edition was amended as follows: • all screen shots are from Excel 2007 rather than Excel 2003; • the work in Excel 2007 is subdivided into tasks as opposed to menus, and the descriptions of how to perform tasks have been fully updated. Any comments or criticism are welcome. E-‐mail: johan.vanberkel@hu.nl Johan van Berkel Enschede, June 2009
3
Statistics using Excel 2007
4
Table of contents
Table of contents 0
The Excel window ........................................................................................................................ 9
0.1
Introduction ........................................................................................................................................ 9
0.2
Starting Excel ....................................................................................................................................... 9
0.3
Excel window elements ....................................................................................................................... 9
0.3.1
Title Bar ............................................................................................................................................. 10
0.3.2
The Office Button .............................................................................................................................. 11
0.3.3
Quick Access Toolbar ........................................................................................................................ 11
0.3.4
Knoppen voor instellen venster ........................................................................................................ 12
0.3.5
The Ribbon ........................................................................................................................................ 13
0.3.6
Name Box .......................................................................................................................................... 14
0.3.7
The Formula Bar ................................................................................................................................ 14
0.3.8
Scroll bars .......................................................................................................................................... 14
0.3.9
Worksheets ....................................................................................................................................... 14
0.3.10
The status area .................................................................................................................................. 15
0.3.11
Window view buttons ....................................................................................................................... 15
1
Introduction .............................................................................................................................. 17
1.1
An example ....................................................................................................................................... 17
1.2
Statistical research ............................................................................................................................ 18
1.3
Problem description .......................................................................................................................... 18
1.4
The survey ......................................................................................................................................... 19
1.5
The data file ...................................................................................................................................... 20
1.6
Qualitative and quantitative data ..................................................................................................... 21
2
Tables ....................................................................................................................................... 23
2.1
Introduction ...................................................................................................................................... 23
2.2
Tables of qualitative data .................................................................................................................. 23
2.2.1
Simple pivot tables ............................................................................................................................ 24
2.2.2
Multiple pivot tables ......................................................................................................................... 31
2.2.3
Relative tables ................................................................................................................................... 33
2.2.4
Filters ................................................................................................................................................ 37
2.3
Tables of quantitative data ............................................................................................................... 40
2.3.1
Cumulative tables ............................................................................................................................. 43
2.3.1.1
Summing data from a pivot table ..................................................................................................... 44
2.3.1.2
The FREQUENCY function ................................................................................................................. 46
2.3.1.3
Data analysis ..................................................................................................................................... 51
2.4
Entering a change .............................................................................................................................. 57
2.4.1
Changes and pivot tables .................................................................................................................. 57
2.4.2
Changes and the FREQUENCY function ............................................................................................. 58
2.4.3
Changes and data analysis ................................................................................................................ 58
2.4.4
When to use pivot tables, the FREQUENCY function, or data analysis ............................................. 58
2.5
Exercises ............................................................................................................................................ 59
5
Statistics using Excel 2007
3
Charts ....................................................................................................................................... 61
3.1
Introduction ...................................................................................................................................... 61
3.2
Charts of qualitative data .................................................................................................................. 62
3.2.1
Column charts ................................................................................................................................... 62
3.2.2
Bar chart ........................................................................................................................................... 70
3.2.3
Pie chart ............................................................................................................................................ 72
3.3
Charts of quantitative variables ........................................................................................................ 75
3.3.1
Histogram ......................................................................................................................................... 76
3.3.2
Frequency curve ............................................................................................................................... 82
3.3.3
Cumulative line chart ........................................................................................................................ 84
3.3.4
Time series charts ............................................................................................................................. 88
3.3.4.1
Short time series ............................................................................................................................... 89
3.3.4.2
Long time series ................................................................................................................................ 92
3.3.5
Scatter chart ..................................................................................................................................... 93
3.4
Changes ............................................................................................................................................. 97
3.5
Exercises ........................................................................................................................................... 98
4
Distribution characteristics ....................................................................................................... 99
4.1
Introduction ...................................................................................................................................... 99
4.2
Distribution characteristics of distinct observations ........................................................................ 99
4.2.1
Measures of location ...................................................................................................................... 100
4.2.2
Measures of dispersion ................................................................................................................... 101
4.2.3
Measures of shape .......................................................................................................................... 104
4.3
Distribution characteristics of distinct observations with Excel ..................................................... 106
4.3.1
Functions ......................................................................................................................................... 106
4.3.2
Data analysis ................................................................................................................................... 109
4.3.3
Pivot tables ..................................................................................................................................... 112
4.4
Characteristics for frequency distributions with classes ................................................................. 114
4.4.1
Measures of location for frequency distributions .......................................................................... 114
4.4.2
Measures of dispersion for frequency distributions ....................................................................... 115
4.5
Characteristics for frequency distributions with classes with Excel ............................................... 117
4.6
Exercises ......................................................................................................................................... 120
5
Probabilities ............................................................................................................................ 123
5.1
Introduction .................................................................................................................................... 123
5.2
Calculating with probabilities ......................................................................................................... 124
5.3
Probabilities via pivot tables ........................................................................................................... 126
5.4
Exercises ......................................................................................................................................... 134
6 6.1
Discrete probability distributions ............................................................................................ 135 Introduction .................................................................................................................................... 135
6.2
Binomial probability distribution .................................................................................................... 135
6.2.1
Probabilities of binomial distributions with Excel ........................................................................... 137
6.3
Hypergeometric probability distribution ........................................................................................ 143
6.3.1
Probabilities of hypergeomtric distributions with Excel ................................................................. 144
6.4
Negative binomial probability distribution ..................................................................................... 147
6.4.1
Probabilities of negativ binomial distributions with Excel .............................................................. 148
6.5
Poisson probability distribution ...................................................................................................... 152
6.5.1
Probabilities of Poisson distributions with Excel ............................................................................ 152
6.6
Exercises ......................................................................................................................................... 155
6
Table of contents
7
Continuous probability distributions ....................................................................................... 157
7.1
Introduction .................................................................................................................................... 157
7.2
Normal probability distribution ...................................................................................................... 157
7.2.1
Probabilities of normal distributions with Excel ............................................................................. 160
7.3
Exponential probability distribution ............................................................................................... 168
7.3.1
Probabilities of exponential distributions with Excel ...................................................................... 169
7.4
Exercises .......................................................................................................................................... 171
8
Simple linear regression and correlation analysis .................................................................... 173
8.1
Introduction .................................................................................................................................... 173
8.2
Regression models .......................................................................................................................... 173
8.3
The linear regression model ............................................................................................................ 175
8.4
Explained and unexplained variation .............................................................................................. 176
8.5
Lineair regression with Excel ........................................................................................................... 178
8.5.1
Lineair regression with SLOPE and INTERCEPT functions ............................................................... 178
8.5.2
Lineair regression analysis by mean of a chart: scatter chart ......................................................... 181
8.5.3
Lineair regression analysis by means of Data Analysis: Regression ................................................ 184
8.6
Correlation analysis ......................................................................................................................... 187
8.6.1
Correlation analysis with Excel ........................................................................................................ 187
8.7
Lineair regression with time series ................................................................................................. 188
8.8
Nonlinear regression models .......................................................................................................... 191
8.9
Forecasts ......................................................................................................................................... 192
8.10
Exercises .......................................................................................................................................... 193
9
Time series analysis ................................................................................................................. 195
9.1
Introduction .................................................................................................................................... 195
9.2
The components of a time series .................................................................................................... 195
9.2.1
Trend ............................................................................................................................................... 196
9.2.2
Economic cycle influence ................................................................................................................ 196
9.2.3
Seasonal influence .......................................................................................................................... 196
9.2.4
Irregular component ....................................................................................................................... 197
9.2.5
Break in trend ................................................................................................................................. 197
9.3
Trend via moving average ............................................................................................................... 197
9.3.1
Moving averages with an odd number of periods .......................................................................... 198
9.3.2
Moving average with an even number of periods .......................................................................... 198
9.3.3
Moving average via data analysis ................................................................................................... 199
9.3.4
Moving average via formulas .......................................................................................................... 204
9.4
Trend via lineair regression (least squares method) ....................................................................... 206
9.5
Seasonal patterns and cyclic movements ....................................................................................... 208
9.5.1
The additive model ......................................................................................................................... 208
9.5.2
The multiplicative model ................................................................................................................ 209
9.6
Forecasting ...................................................................................................................................... 211
9.6.1
Forecasting with the additive model .............................................................................................. 212
9.6.1
Forecasting with the multiplicative model ...................................................................................... 212
9.7
Seasonal adjustment of a time series ............................................................................................. 213
9.8
Exercises .......................................................................................................................................... 214
7
Statistics using Excel 2007
Appendix A: Data file “Fiction 2000” ................................................................................................................ 217 Appendix B: Answers ....................................................................................................................................... 223 Answers Chapter 2: Tables ........................................................................................................................................ 223 Answers Chapter 3: Charts ........................................................................................................................................ 225 Answers Chapter 4: Distribution characteristics ....................................................................................................... 228 Answers Chapter 5: Probabilities .............................................................................................................................. 230 Answers Chapter 6: Discrete probability distributions ............................................................................................. 232 Answers Chapter 7: Continuous probability distributions ........................................................................................ 233 Answers Chapter 8: Simple linear regression and correlation analysis .................................................................... 234 Answers Chapter 9: Time series analysis .................................................................................................................. 237 Appendix C: Specimen examination questions ................................................................................................. 239 Appendix D: Excel files ..................................................................................................................................... 253
8
0
The Excel window
0.1
Introduction
0.2
Starting Excel
0 The Excel window
This introductory chapter briefly explains how to start Excel 2007 and mentions the most important elements of the Excel window. If you have already had an opportunity to work with Excel 2007 you may skip this chapter. After starting Windows, you can start Excel by pressing the Start button and All Programs, with a desktop shortcut, or using the Office Taskbar. The Excel 2007 shortcut icon is as follows:
Figure 0.1: Excel 2007 shortcut Another way of starting Excel 2007 is to open an Excel workbook. An Excel workbook file has the following icon:
Figure 0.2: Excel 2007 workbook icon
0.3
Excel window elements
When you start Excel 2007, most of the window is filled with a grid of cells (see overleaf). This grid is known as a worksheet and it consists of 16,384 columns and 1,048,576 rows. The columns are identified by letters from A to Z, followed by AA, AB, and so on. The rows are numbered from 1 to 1,048,576.
9
Statistics using Excel 2007
The other window elements are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
Title Bar; Office Button; Quick Access Toolbar; Window sizing buttons; Ribbon; Name Box; Formula Bar; Scroll bars; Worksheets; Status area; Window view buttons.
5 6
7
8
8
9 1 0 Figure 0.3: The Excel window
1 1
0.3.1
Title Bar
The Title Bar displays the name of the file, Book1, and the name of the program you are working with, which is Microsoft Excel (see Figure 0.4). Figure 0.4: The Title Bar
10
0.3.2
0 The Excel window
The Office Button
If you click the Office Button (1) (at the top left of the window) the Office menu (2) will open to allow you to alter Excel settings (3) and close Excel (4) (see Figure 0.5). The Office menu can be used for opening new or existing Excel files, and saving, printing, or closing open files. You can view and change Excel settings after clicking Excel Options. Next to the Excel Options button is the button to close Excel. 1
2
3
4
Figure 0.5: The Office button and the Office menu
0.3.3
Quick Access Toolbar
Next to the Office Button is the small Quick Access Toolbar, which by default contains the buttons Save, Undo and Redo. Figure 0.6: Quick Access Toolbar You can add buttons for commands you often use to this toolbar. You can right-‐click a button with the mouse to add it to the Quick Access Toolbar. You can also add any button by right-‐clicking the Quick Access Toolbar and selecting Add. Similarly, you can remove any you no longer want by selecting Delete.
11
Statistics using Excel 2007
Figure 0.7: Window for customizing the Quick Access Toolbar
0.3.4
Knoppen voor instellen venster
The buttons for altering the Excel window size are at the top right. You can Minimize (1), Restore Down or Maximize (2), or Close (3) the Excel window. 1 2 3 Figure 0.8: Window sizing buttons
12
0.3.5
0 The Excel window
The Ribbon
Most of the upper part of the Excel window is occupied by the Ribbon. The Ribbon consists of: 1. Tabs to the right of the Office Button; 2. A set of buttons, depending on which tab is currently open; 3. Start icons for dialogs, which also depends on which tab is currently open; 4. Sizing buttons for the currently open workbook window, and the Excel Help button. 1
4
2
3 Figure 0.9: The Ribbon When you open a tab the group of commands most likely to be used with the type of activity concerned are displayed. For instance, the Insert tab groups together the Tables, Illustrations, Charts, Links and Text commands. Some groups have an icon (3) at the bottom right to bring up a dialog containing additional commands.
Figure 0.10: The Ribbon with the Insert tab open
Special contextual tabs may appear when working with certain Excel objects. The Ribbon can then contain commands specific to the object concerned. You will encounter contextual tabs of this kind mainly when editing tables, charts and illustrations. The contextual tab will disappear as soon as you stop editing a table, chart or illustration.
Figure 0.11: Contextual tabs for pivot table
13
Statistics using Excel 2007
0.3.6
Name Box
The Name Box is on the left below the Ribbon. It contains the address or name of the active cell in Excel. You can also enter a cell address into the Name Box to select the cell.
Figure 0.12: The Name Box
0.3.7
The Formula Bar
The Formula Bar is to the right of the Name Box. It displays the active cell contents. You may edit a cell’s contents directly in the worksheet, or, after selecting the cell to make it active, in the Formula Bar. Figure 0.13: The Formula Bar
0.3.8
Scroll bars
There are horizontal and vertical scroll bars to help navigate an Excel worksheet.
Figure 0.14: The horizontal scroll bar
0.3.9
Worksheets
Tabs for the various worksheets that belong to the open workbook are shown at the bottom left. By default, Excel loads three worksheets. You can open a new worksheet by clicking the button next to Sheet3. The buttons to the left of Sheet1 are for navigating the various worksheets, which is useful if there is not enough room in the bar to display all the worksheet tabs. You can add a new worksheet by clicking the button next to Sheet3. Figure 0.15: The worksheets
14
0 The Excel window
0.3.10 The status area
The status area is located underneath the worksheets. You can start to enter data into Excel when the text Ready is displayed here. The status then changes to Enter. If you start to modify cell data, the status changes to Edit. You will not be able to use all the buttons on the Ribbon while the status is Enter or Edit. Figure 0.16: The status area
0.3.11 Window view buttons
The window view buttons are at the bottom right. The leftmost button is for Normal View, the second is the Page Layout button, and the third is the Page Break Preview button. You can zoom in or out with the zoom bar, which normally shows 100%. You can click the setting and type in a different value, or drag the slider with the mouse. Figure 0.17: Window view buttons
15
Statistics using Excel 2007
16
1
1 Introduction
Introduction
1.1
An example Youth form a promising market for free newspapers
‘We already had our suspicions, but these figures confirm that a relatively large number of young and highly qualified people read free newspapers.’ Buter points out that this is a group of consumers that is now hard to reach, such as through expensive advertising slots in cinema films. ‘I expect we will be doing more business with these papers.’ Metro and Spits have made inroads in the advertising market with substantial discounts. Last summer advertisers supported both newcomers in what they saw as a welcome new initiative in the newspaper market. Many advertisers have been adopting a wait-‐and-‐see attitude in recent months, pending the arrival of independent SUMMO figures. These figures now confirm the advertisement sellers’ claims. ‘In terms of reach we are already the number four newspaper in the country’, Metro editor-‐in-‐chief Jelle Leenes concludes with satisfaction. ‘And our performance among young people is even better.’ Leenes is convinced that a free newspaper is compatible with the busy, zapping type of person who has grown accustomed to not paying on Internet. ‘The strange thing is that the content hardly targets young people at all.’ Leenes expects other newspapers and some magazines to take ‘a substantial hit’ this year. The SUMMO figures justify further rises in Metro advertising tariffs. ‘We started with high discounts, but our tariffs have already increased in the past quarter. We want this trend to continue, and these figures will help. (Representative random sample among 16,098 people) Volkskrant, 11-‐04-‐2000
The free newspapers Metro and Spits would appear to be particularly popular among young and highly qualified Dutch people with an above-‐average income. None of the mainstream daily newspapers scores as well among these readers, which are an interesting target group for advertisers. This is one of the conclusions in the annual SUMMO reach survey released this week. The results confirm the commercial strategy of the Swedish publisher Modern Times Group (Metro) and De Telegraaf (Spits). They are confident that a free newspaper can be profitable because so many train passengers are students or workers in their twenties or thirties. It is hard for advertisers to selectively reach this group of consumers. Figures for the second half of 1999 reveal that every edition of Metro is read by 469 thousand passengers. The corresponding figure for Spits is 448 thousand. The readership of each newspaper includes 227 thousand Dutch people in the 13 to 35 age range. This is an enormous number in a world where this group tends to be underrepresented. Only the Volkskrant reaches the same number of young people as in the random sample. Furthermore, Metro and Spits are read respectively by 106 thousand and 90 thousand Dutch people from the highest social class. This figure is less impressive, but important nonetheless for advertisers. The hardest hit is the Volkskrant, whose reach among young readers has plummeted by 21 per cent. De Telegraaf has seen a loss of 10 per cent of readers in their twenties and thirties. All daily newspapers together reached 6 per cent fewer readers compared with the second half of 1998. Only some of this decline can be blamed on the two free newspapers. ‘They have found a gap in the market’, according to Remon Buter, an advertisement specialist with Initiative Media. Various statements are made in the above article: • Metro is read by 469 thousand passengers; • the corresponding figure for Spits is 448 thousand; • the readership of each newspaper includes 227 thousand Dutch people in the 13 to 35 age range; • Metro and Spits are read respectively by 106 thousand and 90 thousand Dutch people from the highest social class; • the Volkskrant’s reach among young readers has plummeted by 21 per cent; • De Telegraaf saw a loss of 10 per cent of readers in their twenties and thirties; • all daily newspapers together reached 6 per cent fewer readers
17
Statistics using Excel 2007
Statistics is about numbers. The numbers in the statements are referred to as statistics or statistical indicators. These numbers help marketing and other managers of businesses like ‘Metro’ to make the ‘right’ marketing decisions: ‘The advertising tariffs can rise further’.
1.2
Statistical research
However, statistical investigations for market research or some other purpose involve much more than numbers alone. They can be divided into the following phases: 1. preparation; 2. gathering and checking data; 3. data processing; 4. drawing conclusions; 5. publishing and presenting results. This basic statistics module focuses on data processing. The other investigation phases are covered in the Investigation Methods and Market Research modules. 1.3 Problem description There has been a conspicuous decline in daily newspaper readership in recent years. The decline amounted to 6 per cent in April 2000 compared with the 2nd half of 1998. The largest falls were:
* Volkskrant: * Telegraaf:
21 per cent; 10 per cent.
The decline is suspected to be attributable to the free distribution of Metro and Spits to passengers at railway stations. A subscription to a national daily newspaper might be irrelevant in these times of ‘free’ Internet, on which national daily newspapers are also published. It was decided to survey train passengers in the morning peak period (6am – 9am). The objective was to ascertain whether the free distribution of Metro and Spits had any influence on reading a national daily newspaper. To find the answer, the objective was broken down into the following items of information about the morning commuters:
1. 2. 3. 4. 5.
the length of the journey time; the number of days the journey was made in an average week; whether they ever read Metro or Spits, and what they thought of them; whether they subscribed to a national daily newspaper, or had cancelled a subscription in the past year; whether the above items of information depended on gender, age or income.
The above led to the following questionnaire.
18
1 Introduction
1.4
The survey The following survey was taken in the trains during the morning rush hours (between 6 am and 9 am) in order to gather information about the free morning papers (Metro and Spits): 1 1. At which rail station did your train journey start? ……….…………………………………. 1 2. At which rail station did your train journey end? …………………………………………… 3. On average, how many days a week do you travel by train? …………… days 4. Have you ever read ‘Metro’? o No o Yes 5. What is your opinion about ‘Metro’? o No opinion o Very bad o Bad o Reasonable o Good o Very good 6. Have you ever read ‘Spits’? o No o Yes 7. What is your opinion about ‘Spits’? o No opinion o Very bad o Bad o Reasonable o Good o Very good 8. Do you have a subscription to a daily paper? o Algemeen Dagblad o NRC o Telegraaf o Trouw o Volkskrant o Other o None 9. Have you cancelled a subscription to a daily paper in the past year, because of the free distribution of ‘Metro’or ‘Spits’? o No o Yes
1
The travelling time by train was deduced from the reported start and end points of the train journeys by referring to the timetable. The deduced time was then incorporated in the data file as travelling time (in minutes).
19
Statistics using Excel 2007
10. What is your gender? o Male o Female 11. How old are you? ……. Years 12. What is your net monthly income? € ……… 1.5 The data file
Part of the data gathered (the data file) is shown in Table 1.1. The complete file is given as Appendix 1 at the end of the syllabus. Table 1.1: Data from a random sample of 300 train passengers in the morning peak period Respondent Gender Age (years) Travelling time per Travel days per Income Daily number train (minutes) train per week (* € 100) newspaper 1 2 3 4 5 6 7 8 9 10 11 12
2 2 2 1 1 1 1 1 1 1 1 1
21 27 18 20 26 60 42 51 19 21 22 55
35 29 23 32 14 104 58 55 44 5 36 67
3 5 5 4 5 3 5 5 4 5 5 4
3 14 5 3 18 27 23 32 2 10 3 21
7 3 7 5 5 2 5 3 7 6 5 6
Source: ‘Fiction 2000’
The following three aspects of this survey of 300 train passengers can be distinguished. 1. Units of analysis. The 300 surveyed train passengers in the morning peak period are the units of analysis. Each surveyed train passenger was given a respondent number (see column 1). It would have taken too much time or money to interview all the train passengers in the morning peak period (the population). For the purpose of this investigation it was therefore decided to interview part of the population: 300 train passengers (the random sample). 2. Variables or characteristics. The head of each column contains the name of a variable. A variable is a quantity that can have a different value for each investigation subject. For instance, the ‘Gender’, ‘Age’, ‘Travelling time’, etc. of each surveyed train passenger was noted. 3. Observed value or score of a unit for a variable. When someone is of the female gender, the value of that person’s (respondent’s) ‘Gender’ variable is female. The table (data file) often contains only numbers (codes) rather than words. One reason is that it is much faster to type ‘2’ than ‘female’, besides which it makes it easier to process the data. A data file containing codes can be understood only with an accompanying code book (which defines the various codes). The code book in Table 1.2 applies to the file in Appendix 1.
20
Table 1.2: Variable Gender
1 Introduction
Code book for the variables in the survey of 300 train passengers in the morning peak period
Daily newspaper
Cancelled Metro Opinion Metro
Spits Opinion Spits
Code 1 2 1 2 3 4 5 6 7 0 1 0 1 0 1 2 3 4 5 0 1 0 1 2 3 4 5
Definition Male Female Algemeen Dagblad NRC Telegraaf Trouw Volkskrant Other None No Yes No Yes No opinion Very bad Bad Reasonable Good Very good No Yes No opinion Very bad Bad Reasonable Good Very good
Source: ‘Fiction 2000’
The first train passenger surveyed (respondent number 1) was therefore of the female gender (code 2), 21 years old, has a 35-‐minute train journey, commutes by train 3 days a week, has an income of €300, has not cancelled (code 0) a subscription to a daily newspaper in the past year, does sometimes read Metro (code 1), which she rates as good (code 4), and also occasionally reads Spits (code 1), which she also rates as good (code 4). 1.6 Qualitative and quantitative data How the data can be manipulated into charts and characteristics and so on, and what kind of statistical analysis is appropriate for a given variable, depends on whether qualitative or quantitative data (values) are involved. Qualitative variables can generally assume a limited number of values (words or labels) and are measured on a nominal or ordinal scale.
21
Statistics using Excel 2007
Example Variable Measurement level Possible values Gender nominal male, female Opinion ordinal very bad, bad, reasonable, good, very good When a measure is nominal the values serve only to distinguish different attributes. For the Gender variable, male is different from female, for the Daily newspaper variable, Algemeen Dagblad is different from NRC, Telegraaf, Trouw, and so on. When a measure is ordinal, the values serve both to distinguish different attributes and arrange the attributes in a 2 logical sequence. The attributes of the Opinion variable can be ranked from very bad to very good . Quantitative (numerical) variables are always measured in numbers, and the measurement level is interval or ratio. Example Variable Measurement level Possible values Temperature in °C interval …., -‐3, -‐2, -‐1, 0, 1, 2, 3, ….. Number of travelling days per week ratio 0, 1, 2, 3, 4, 5, 6, 7 In interval measurement, you can rank both the attributes in sequence and attach meaning to the difference between two attributes. For example, the difference between 10 °C and 14°C is 4°C. Interval scales have no absolute zero, and are relatively uncommon in practice. Some familiar examples are temperature (with the exception of degrees Kelvin), calendar years, and shoe sizes. Ratio measurement gives meaning to the difference between two values and also to how many times greater one value is than another. A person who commutes by train 4 days a week travels twice as often as someone who commutes 2 days a week. A ratio measurement has an absolute zero. Data that are inherently on ratio level are often measured on ordinal level, since people may be more willing to provide the less detailed information involved. For example: What is your net monthly income? € …………. This question measures the variable ‘Income’ on ratio level. Someone entering €3500 would conspicuously have an income 5 times higher than someone who enters €700. What is your net monthly income? 1. €0 -‐< €1000 2. €1000 -‐< €2000 3. €2000 -‐< €3000 4. €3000 -‐< €4000 5. > €4000 This question measures the variable ‘Income’ on ordinal level. While it is possible to rank the different values from low to high, it is impossible to state how many times higher the income of someone who answered 4 is than that of someone who answered 2. People would be more inclined to answer the second question than the first. However, a lower level of measurement will preclude certain kinds of analysis and restrict the conclusions that can be drawn.
2
The survey also included an attribute no opinion, which makes the level of measurement nominal, because no opinion cannot be ranked logically with the other attributes. 22
2
2 Tables
Tables
2.1
Introduction
2.2
Tables of qualitative data
You can read from a data file how each respondent answered the questions, but you have no immediate overall view of all respondents’ scores for a given variable. For example while you can read each individual respondent’s opinion about Metro in Appendix I, it is almost impossible to see at a glance what all respondents thought about Metro. Do people generally think Metro is good, reasonable, or poor, or are opinions actually divided? Therefore, one of the first operations to be performed on the gathered data (possibly in a data file) is to produce one or more tables. What kinds of table to produce depends on what statistical information you want to convey in the report of the investigation, and the nature of the conclusions you want to draw. Qualitative variables can generally assume only a limited number of attributes and are measured on a nominal or ordinal level. Example Variable Scale level Possible values Gender nominal male, female Opinion ordinal very bad, bad, reasonable, good, very good When constructing a frequency table (that shows how often a given value occurs) each possible attribute corresponds with a category (group). The Gender variable therefore has two corresponding groups (Male and Female). Constructing a frequency table for the Gender variable involves counting how many respondents were male and how many were female. The Daily newspaper variable in the data file (see code book Table 1.2) has seven corresponding groups. Constructing a frequency table for the Daily newspaper variable involves counting how many respondents said they had a subscription to Algemeen Dagblad, how many to NRC, how many to Telegraaf, and so on. A table is a compact way of presenting figures, by arranging them in (horizontal) rows and (vertical) columns. A position in a table that can be occupied by a number is known as a cell. A typical table has a column to the left of the data cells and a row of column heads above, which identify the attributes of the variables concerned. The variables are identified above the additional column to the left and the column heads. There may also be a total row (with the totals of each column) and a total column (with the totals of each row) if that would provide useful information. It is a fairly simple matter to compose tables of this kind in Excel. What you have to add to the table yourself are a number and a descriptive title for the contents of the table. If applicable, the unit of analysis may be part of the title or the caption. You can add an acknowledgement under the table to declare the source of the data. Table 2.1 shows the various components of a table.
23
Statistics using Excel 2007
It is possible for some of the cells in a table to remain unoccupied. Ascertain the reason and then look up the appropriate special symbol to enter in the cell. Table 2.2: Symbols to be used in a table and their definitions Symbol -‐ empty (blank cell) . x * 0 (0.0)
Definition Null This figure is nonexistent (cannot occur) No data available This figure is confidential This figure is provisional This figure is less than the half of the unit used
Observe the following conventions when referring to calendar years in titles and acknowledgements. 2000 – 2001 : 2000 to 2001, inclusive; 2000 / ‘01 : the financial year, academic year, etc. starting in 2000 and ending in 2001 2000 / 2001 : the average for 2000 and 2001
2.2.1
Simple pivot tables
You can use a pivot table in Excel to construct a frequency table where the data were entered as qualitative values. Suppose you want to construct frequency tables with the data from the file introduced in Chapter 1 (Fiction 2000) for the variables Gender and Daily newspaper (subscriptions to daily newspapers).
24
2 Tables
Method 1. Open in Excel the workbook you want to use as a source for the tables. To do so, click the Office Button (at the top left of the window) and then Open.
2.
Figure 2.1: Opening an Excel workbook. Find the disk and folder where the workbook file was saved. Click the workbook and then Open (Fiction 2000 s on the Hogeschool network at: K:\FEM\Excel\Statistics ursing Excel 2007\Fiction2000.xlsx) Open the Insert tab, click PivotTable and select PivotTable in the drop-‐down list.
Figure 2.2: Opening pivot table dialog.
25
Statistics using Excel 2007
3. In the dialog, specify the range of the data to use in the table. Excel provides as default all the data in the currently open worksheet. The default range for the opened Fiction 2000 file is Data!$A$1:$L$301. Next choose where to place the table. Since the opened Excel workbook currently contains only one worksheet (which holds the gathered data), you should select New Worksheet, and then click OK.
Figure 2.3: The Create PivotTable dialog. 4. A new worksheet (Sheet1) opens, where you can see the future location of the table (starting in cell A3). You can also see the PivotTable Field List dialog and observe that the ribbon now has contextual tabs with the appropriate options for creating a pivot table.
Figure 2.4: The initial table creation window
26
2 Tables
For example, if you wanted to create a table of how many men and women contributed to the survey (Gender variable), you would drag the Gender variable (by left-‐clicking Gender and holding the mouse button down) from the field list to the Row Labels box (which specifies the column of stubs that will later contain the names of the different scores (Male and Female). Drag the Gender variable again from the field list, but this time to the ∑ Values box.
Figure 2.5: Dragging the Gender variable to the ∑ Values box
Figure 2.6: The Gender table so far 5. You can see that the table that appears displays the arithmetic sum rather that the frequency (count). Since Female was coded with the value 2, the cell following 2 does not contain a count of the number of women (= 77), but the sum of the codes, 154 (= 77*2). The cell values therefore have to be changed from Sum to Count. Click Sum of Gender and then Value Field Settings.
27
Statistics using Excel 2007
Figure 2.7: Opening the Value Field Settings dialog
Next, in the Value Field Settings dialog, change ‘Summarize by’ from Sum to Count
Figure 2.8: Changing the ‘Summarize value field by’ setting The table is now as follows.
28
Figure 2.9: Table of Gender with the correct numbers
2 Tables
6. You can find what ‘1’ and ‘2’ mean in the code book (Table 1.2). A table is complete only when what it displays is clear. You therefore have to change ‘1’ to ‘Male’ by selecting cell A5 (on Sheet1) and typing ‘Male’, and change ‘2’ to ‘Female’ by selecting cell A6 and typing ‘Female’. An additional advantage of having the correct labels is that they are automatically included in any charts (see Chapter 3) you create with the table.
Figure 2.10: Final Gender table You can simply copy this table as a figure in a report created using a word processor (e.g. Word 2007). To do this, select the cell range A3:B7. Open the Home tab and the click Copy in the Clipboard group (alternatively, you can type CTRL+C on the keyboard).
29
Statistics using Excel 2007
Figure 2.11: Copying a table In the word processor, open the Home tab and click Paste in the Clipboard group (alternatively, you can type CRTL+V on the keyboard).
Figure 2.12: Pasting into a Word document Be careful to observe the requirements that apply to a table: 1. it must have a number, and a caption that describes the table contents; 2. it should have an acknowledgment (and must have one where desk research is involved). After some editing, a table will have the following appearance in Word 2007. Table 2.3: Gender of the respondents Gender Male Female Grand Total Source: ‘Fiction 2000’
Figure 2.13: The pivot table in Word
30
Total 223 77 300
2.2.2
2 Tables
Multiple pivot tables
Pivot tables are frequently used for creating contingency tables to cross-‐tabulate two or more variables. The values of one variable are then listed in the stub column, and the possible values of another variable become the heads of the data columns. It is usual to expand the independent variable in the column heads and the dependent variable as rows in the stub column. For example, you could cross-‐tabulate the number of subscriptions to the various daily newspapers with gender. The Gender variable is then the independent variable and the Daily newspaper variable the dependent variable. The choice of daily newspaper might depend on gender. In other words, there could be differences between men and women in the numbers of the various newspaper subscriptions. Method 1. Open the Data worksheet in the ‘Fiction 2000’ Excel workbook and select cell A1.
2. Open the Insert tab, click PivotTable in the Tables group and then select PivotTable.
3. In the dialog, specify the range of the data to use in the table. Excel provides as default all the data in the currently open worksheet. The default range for the opened Fiction 2000 file is Data!$A$1:$L$301. Next choose where to place the table. For example, cell A10 of worksheet Sheet1.
Figure 2.14: Completed Create PivotTable dialog
4. Drag the Gender variable (by left-‐clicking Gender and holding the mouse button down) from the field list to the Column Label box (the spanner for the column heads that will later contain the names of the different attributes Male and Female). Drag the variable Daily newspaper from the field list to the Row Label box (which specifies the column of stubs that will later contain the names of the different scores (Algemeen Dagblad, NRC, etc.). Finally, drag the Gender variable (or the Daily newspaper variable) to the ∑ Values box. Next select Sum of Gender and change Sum to Count in the Value Field Settings dialog.
31
Statistics using Excel 2007
Figure 2.15: Contingency table of Daily newspaper and Gender 5. Finally, change the codes in cell range A12:A18 to the names of the respective daily newspapers and the codes in the cell range B11:C11 with Male or Female (See Table 1.2).
Figure 2.16: Final contingency table of Daily newspaper and Gender The total column and total row of a contingency table comprise essential information. The total column in this example gives the total number of subscriptions to a given daily newspaper, and the total row gives the separate total numbers of men and women.
32
2 Tables
The following table finally emerges after some editing in Word 2007. Table 2.4: Number of subscriptions by daily newspaper and by gender Gender Daily newspaper Male Female Grand Total Algemeen Dagblad 18 7 25 NRC 9 3 12 Telegraaf 44 6 50 Trouw 16 3 19 Volkskrant 40 20 60 Other 37 16 53 None 59 22 81 Grand Total 223 77 300 Source: ‘Fiction 2000’
Figure 2.17: Word version of contingency table of Daily newspaper and Gender 2.2.3 Relative tables Rather than an absolute frequency distribution (count) a table can also display a relative (proportional) distribution, and this is definitely preferable when comparing multiple categories. The available options in a relative distribution, depending on what is being emphasized, are percentages of: 1. the total; 2. the column; 3. the row. The method is analogous to the one in Section 2.2.2 except for how the ∑ Values are modified. The frequency (count) representation used in the above table of daily newspaper subscriptions by gender makes it hard to see which daily newspaper is favoured more by men than by women. The counts for each daily newspaper in the ‘male’ column in the above table are greater than the corresponding values in the ‘female’ column. This is because the group of men in the survey was far larger than the group of women. To be able to use the above table of daily newspaper subscriptions by gender to ascertain which daily newspaper is favoured more by men than by women, you need to display the proportions, usually as a percentage, instead of the counts of the columns. Each cell then indicates the proportion of men with a subscription to a given daily newspaper, or the corresponding proportion of women with a subscription to the same daily newspaper. You have to modify the above contingency table for Daily newspaper and Gender accordingly. First copy the above contingency table by selecting the cell range A10:D19 either by typing CRTL+C (copy), or selecting Copy in the Clipboard group on the Home tab. Select a cell, e.g. cell A23, and then either type CRTL+V (paste), or select Paste in the Clipboard group on the Home tab. Select a cell of the copied pivot table (e.g. A25). The PivotTable Field List now opens. In this dialog click Count of Gender in the ∑ Values box and then select Value Field Settings in the drop-‐down list.
33
Statistics using Excel 2007
Figure 2.18: Opening the Value Field Settings dialog Open the ‘Show values as’ tab in the Value Field Settings dialog and then click the button alongside Normal.
Figure 2.19: Opening the ‘Show values as’ menu
34
2 Tables
Select ‘% of column’ in the drop-‐down list.
Figure 2.20: Changing the values to ‘% of column’. When you finally click OK, you will see the values shown in the pivot table change to % of column as opposed to the original counts.
Figure 2.21: Pivot table of daily newspaper subscription by gender as percentages of gender.
35
Statistics using Excel 2007
The percentages can be made clearer and comparisons easier by removing the fractional part of the numbers. Select the cell range B25:D32 (by pressing and holding down the left mouse button). The selected cells now have a thick border. Then click the Decrease Decimal button twice in the Number group on the Home tab.
Figure 2.22: Reducing the number of decimal places shown in the selected cells Finally the table is as follows. Count of Gender Gender Daily newspaper Male Female Grand Total Algemeen Dagblad 8% 9% 8% NRC 4% 4% 4% Telegraaf 20% 8% 17% Trouw 7% 4% 6% Volkskrant 18% 26% 20% Other 17% 21% 18% None 26% 29% 27% Grand Total 100% 100% 100% Figure 2.23: Table of the proportion of daily newspaper subscriptions by gender as percentage. It is easy to see in this table that a larger proportion (percentage) of men subscribe to De Telegraaf (20% of males ) than women (‘only’ 8% of females). A larger proportion of women subscribe to De Volkskrant (26% of females) than men (18% of males).
36
2.2.4
2 Tables
Filters
Sometimes you will want to create a table that refers to only some of the respondents. For instance, you may be interested only in the ‘Opinions about Metro’ of only the respondents who have actually read Metro. Method 1. Select cell A1 in the Data worksheet. 2. Open the Insert tab, click PivotTable and select PivotTable in the drop-‐down list. 3. Data!$A$1:$L$301 is entered automatically by Table/Range in the dialog. Select the location on the worksheet Sheet1, for example cell A36. 4. Next drag the Opinion Metro variable from the field list to the Row Label box and again drag the Opinion Metro variable to the ∑ Values box. In Value Field Settings, change Sum of Opinion Metro to Count of Opinion Metro. Drag the Metro variable to the Column Labels box. A frequency table of the Opinion Metro variable is now created that is grouped according to whether respondents had or had not read Metro. Substitute the codes in the pivot table with their definitions (see Table 1.2).
Figure 2.24: Opinion Metro grouped according to whether respondents had or had not read Metro 5. Now drag Metro from the field list to the Report Filter box. The variable name Metro now appears in cell A34, and you can choose what aspect of the variable Metro to display in cell B34. The default selection is All, but you can click the arrow to choose among the attributes of the variable Metro. If you wanted to display Opinion Metro only for those who had actually read Metro, you would select Yes and then click OK.
37
Statistics using Excel 2007
Figure 2.25: Selecting respondents who had actually read Metro using a filter You finally obtain a table of Opinion Metro of those who had actually read Metro at some time.
Figure 2.26: Opinion Metro of the group who had actually read Metro By clicking the filter for B34 again you can alter the choice back to All or to the group that had never read Metro. Filtering does not have to be restricted to a single variable. For example, you might be interested only in the Opinion about Metro of male respondents who had ever read Metro. In that case you would first drag the Gender variable to the Column Labels box. Then in the pivot table you would change the code 1 to Male and code 2 to Female. Now drag the Gender variable from the field list to the Column 3 Labels box . Then click the arrow next to (All) in the Gender row, select Male in the window that opens, and finally click OK. The following table then appears. 3
Alternatively, you can drag a variable to the report filter directly, but you cannot change the codes in the report filter. This is why the variable is first dragged to the column (or row) label to change a code to its definition. 38
Metro Gender
2 Tables
Yes Male
Count of Metro Opinion Metro Total Very bad 6 Bad 19 Reasonable 36 Good 38 Very good 13 Grand Total 112 Figure 2.27: Frequency table of Opinion about Metro of men who had actually read Metro. You finally change the name of the worksheet Sheet1 into something meaningful, such as ‘Qualitative tables’, in order to identify the different worksheets. To do this, right-‐click Sheet1 and then click Rename.
Figure 2.28: Changing a worksheet name The name Sheet1 is now displayed with a black background and you can type in the new worksheet name: Qualitative tables.
39
Statistics using Excel 2007
2.3
Tables of quantitative data
When dealing with quantitative data (e.g. the Age variable in the Fiction 2000 file) you can create a frequency table in much the same way as with the data of qualitative variables. If you follow the method given in 2.2.1 for creating a pivot table of the Age variable, you will obtain the following result. Create this table on a new worksheet and change its name to Quantitative tables . Age 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 63 64 Grand Total
Total 1 4 4 3 5 8 20 10 20 9 12 3 5 10 7 11 16 6 11 5 4 1 10 8 5 7 5 6 6 7 8 5 4 6 5 2 6 5 2 1 4 4 4 1 3 1 3 2 2 1 2 300
Figure 2.29: Table of the various ages This table shows exactly how many times each value (age) occurs. However, a significant drawback of this table is its lack of clarity. The more than fifty different values make it hard to see the distribution of ages clearly at a glance.
40
2 Tables
Since the data of quantitative variables tend to have an extremely large range of values, the values are usually grouped into a frequency table. A frequency table shows how many times a value occurs within a given interval (class or group). You have to decide how many classes you want before creating a frequency table. A general rule is to use the square root of the number of observations. For example, if you had 100 numbers, you would choose to have approximately 10 groups or classes. You would group the 300 observations in the Fiction 2000 file into approximately 17 classes. N.B. the more classes there are, the less clear the table will be (see the table that was created for Ages), whereas the fewer classes there are, the more information will be lost (the most extreme case would be a single class, 13 – 64 years of age, containing 300 observations, but conveying no information about the age distribution whatsoever). The choice of interval width depends on the maximum and minimum values and the number of classes.
Width of interval =
Maximum − Minimum Number
For 300 numbers, with a minimum of 13 and a maximum of 64: class size = (64 -‐ 13)/Ö300 = 2.94 Make the class size a ‘round’ number (e.g. 5, 10, 25, 50, 100, etc.), and a good choice in this case would be 5. Each class must start with a multiple of the class size. The first class is not therefore from 13 -‐< 18, but from 10 -‐< 15. The classes are then: 10 -‐< 15 15 -‐< 20 20 -‐< 25 etc. You can now adjust the table for ages. Method 1. Select a cell that contains an age (e.g. cell A5). 2. The Options tab of PivotTable Tools opens. Click Group Field in the Group group.
Figure 2.30: Opening the dialog to group fields
41
Statistics using Excel 2007 4.
The following dialog appears.
Figure 2.31: Grouping dialog The minimum age is 13 and the maximum age is 64. The dialog suggests making classes of 10 values. If you were to click OK now, you would have only 6 classes (which is very few). Furthermore, the first class should start with a multiple of the class size. A better choice of class size would be 5, Starting at 10 and Ending at 64. You can change the values in the dialog by clicking 13 and typing 10, and then clicking the 10 and changing it to 5.
Figure 2.32: Modified Grouping dialog If you click OK now, you will obtain the following table. Age Total 10-‐14 15-‐19 20-‐24 25-‐29 30-‐34 35-‐39 40-‐44 45-‐49 50-‐54 55-‐59 60-‐64
5 40 54 49 27 35 32 23 16 12 7
Grand Total 300 Figure 2.33: Frequency table of ages in 5-‐year classes
42
2 Tables
You can restore the original table with all the ages by clicking cell A4 and then clicking Ungroup Field in the Group group on the Options tab.
Figure 2.34: Ungrouping
2.3.1
Cumulative tables
Sometimes you will be interested not only in the numbers within a given class, but also in all the numbers (or percentages) below a certain limiting value. In that case you would create a cumulative frequency (or relative) table. There are three ways of creating a cumulative frequency table: 1. by summing the data from a pivot table; 2. using the FREQUENCY function; 3. with data analysis.
43
Statistics using Excel 2007
2.3.1.1 Summing data from a pivot table
We use the Age frequency table as an example. Method 1. Type ‘Age’ in cell D4, then ‘< 15’ in cell D5 (i.e. ages up to and including 14), ‘< 20’ in cell D6 (i.e. ages up to and including 19), etc., finally entering ‘< 65’ in D15 (i.e. ages up to and including 64). 2. Type ‘Total’ in cell E4 and then ‘=B5’ in cell E5. 3. Type ‘=E5+B6’ in cell E6.
5.
Figure 2.35: Entering the formula in cell E6. You can copy the formula in cell E6 into each of the cells from E7 to E15 (E7:E15). To do this, move the pointer over the lower right-‐hand corner of cell E6 until it turns into a ‘+’ symbol, signifying that it is on the fill handle.
Figure 2.36: Fill handle for copying the formula 44
2 Tables
Now press and hold down the left mouse button while moving the pointer (i.e. dragging the fill handle) to cell E15.
Figure 2.37: Selecting the cells to copy the formula into If you then release the mouse button, the formula from cell E6 will be copied to the cells E7:E15, incrementing the cell numbers in the original formula by 1 for each downward move. In other words, cell E7 will have ‘=E6+B7’ cell E8 will have ‘=E7+B8’, etc.. This finally yields the cumulative frequencies (totals).
Figure 2.38: Result of copying the formula into cells E7:E15 For example, there are 265 respondents below the age of 50. You can create the cumulative relative table (with values in per cent) similarly. Base this table on the pivot table, selecting ‘% of column’ in the ‘Show values as’ box, and typing ‘Percentage’ in cell E4 instead of ‘Total’. Finally select E5:E15 and choose % in the Number group on the Home tab.
45
Statistics using Excel 2007
Figure 2.39: Cumulative relative frequency table of Age Remember to add a clear title and an acknowledgement when presenting the table!
2.3.1.2 The FREQUENCY function
Before you can use the FREQUENCY function you have to set the upper limit of the classes. N.B. the upper limit of the class 10 -‐< 15 is 15, but this value is just outside the class. You therefore enter a value such as 14.9 or 14.99 as the upper limit, depending on the accuracy of the original data. For the Age variable, 14 is acceptable. Method 1. Open a new worksheet in the Fiction 2000 Excel workbook. Type ‘Age’ in cell A1 and ‘Cumulative frequency’ in cell B1. 2. Type ‘14’ in cell A2 and ‘19’ in cell A3. You can now quickly copy the ages 24, 29, ….., 64 into the cells A4:A12 by selecting cells A2 and A3 and moving the pointer over the lower right-‐hand corner of cell A3 until it turns into a ‘+’ symbol.
Figure 2.40: Fill handle for copying a series Press and hold down the left mouse button while moving the pointer to cell A12. Then release the mouse button. You will then obtain the numbers 24, 29, ….., 64 in the correct cells. This method is quicker than typing each number in separately.
46
2 Tables
3. Select cell B2 and click Insert Function in the Function Library group on the Formulas tab, or simply click the fx button in the Formula Bar.
Figure 2.41: Opening the Insert Function dialog 4. Select Statistical from the list next to ‘Or select a category:’, select FREQUENCY from the list next to ‘Select a function’, and finally click OK.
Figure 2.42: Opening the FREQUENCY function dialog 5. For Data_array select cells C2:C301 on the Data worksheet and then press F4 on the keyboard. Pressing F4 adds ‘$’ symbols to the selected cell addresses. These cells will not now change when you copy the formula. Select for Bins_array cell A2 and then click OK.
47
Statistics using Excel 2007
Figure 2.43: Completed FREQUENCY function dialog 6. You now copy the formula in cell B2 to cells B3:B12. To do this, move the pointer over the lower right-‐hand corner of cell B2 until it turns into a ‘+’ symbol.
Figure 2.44: Fill handle for copying the formula Now press and hold down the left mouse button while moving the pointer to cell B12. Then release the mouse button. The result will be as follows.
48
2 Tables
Figure 2.45: Result of copying the formula from cell B2 to cells B3:B12 The FREQUENCY function can also be used to obtain an ordinary frequency distribution. To do so, you follow the same steps as above, with the following changes: • at point 1 type ‘Frequency’ in cell B1; • at point 3 select cells B2:B12;
•
Figure 2.46: Selecting cells B2:B12 at point 5 by Bins_array enter: A2:A12.
49
Statistics using Excel 2007
Figure 2.47: Completed FREQUENCY function dialog for making a frequency distribution Do not press OK, but press the Ctrl and Shift keys on the keyboard simultaneously. Hold these keys down and press the Enter key. The normal frequency distribution will appear.
Figure 2.48: Result of the FREQUENCY function for creating a frequency distribution
50
2 Tables
2.3.1.3 Data analysis
It is also a simple matter to create a frequency distribution and a relative cumulative frequency distribution by data analysis. Method 1. Open the Data worksheet from the Fiction 2000 workbook. Open the Data tab and then click Data Analysis in the Analysis group.
Figure 2.49: Opening the Data Analysis dialog If Data Analysis is not present in the ribbon, click the Office Button and then click Excel Options. Select Add-‐Ins in the Excel Options menu and then click Go.
Figure 2.50: Opening the Add-‐Ins dialog
51
Statistics using Excel 2007
Check the Analysis ToolPak box and click OK
Figure 2.51: Activating Analysis ToolPak Data Analysis will now be available 2. Select Histogram under Analysis Tools, and click OK
Figure 2.52: Opening the Histogram dialog 3. The Histogram dialog appears. Enter after Input Range the addresses of the cells that contain the data (the range for the data of the Age variable is $C1:$C301). Check the Labels box. Select New Worksheet Ply and enter a name (e.g. ‘Frequency Age’). Finally check the Cumulative Percentage box.
52
Figure 2.53: A completed Histogram dialog When you click OK the following will appear.
2 Tables
Figure 2.54: Result of the data analysis Histogram for the variable Age It is hard to read the contents of cells A1:C1 unless you adjust the column widths to suit the text. Open the Home tab and click Format in the Cells group.
53
Statistics using Excel 2007
Figure 2.55: Opening the cell format drop-‐down list Then click AutoFit Column Width in the format drop-‐down list
Figure 2.56: Setting automatic column width adjustment according to contents Now you can read the table properly. 54
2 Tables
Figure 2.57: Result of the data analysis with adjusted column width A disadvantage of this frequency distribution is the ‘inelegant’ choice of classes. The first class runs to age 13, the second from age 14 to age 16, etc.. The class size is therefore 3 years. There is also an open class at the end for over 61 years of age. Fortunately, you can also choose the class size for a Histogram. First fill in the bin range on a worksheet. For example, enter in cell A1: ‘Age’, in cell A2: ‘14’, in cell A3: ‘19’, etc. until cell A12: ‘64’. Copy the original data for age to the same sheet as the class structure. In the Histogram dialog for Bin Range enter $A$1:$A$12. (You should change the name of the New Worksheet Ply if necessary, because Excel cannot work with two worksheets with the same name).
Figure 2.58: A completed Histogram dialog with specified class boundaries If you now click OK the following appears.
55
Statistics using Excel 2007
Figure 2.59: Result of the data analysis with specified class boundaries You now have the processed data to produce an orderly frequency table and a relative cumulative table. Remember to add a title, clear class names and an acknowledgement. Table 2.4: Frequency of respondents by age Table 2.5: Relative cumulative distribution of respondents’ ages Age (years) 10 -‐< 15 15 -‐< 20 20 -‐< 25 25 -‐< 30 30 -‐< 35 35 -‐< 40 40 -‐< 45 45 -‐< 50 50 -‐< 55 55 -‐< 60 60 -‐< 65 Total Source: ‘Fiction 2000’
56
Frequency 5 40 54 49 27 35 32 23 16 12 7 300
Age (years) < 10 < 15 < 20 < 25 < 30 < 35 < 40 < 45 < 50 < 55 < 60 < 65
Source: ‘Fiction 2000’
Percentage (%) 0 1,7 15,0 33,0 49,3 58,3 70,0 80,7 88,3 93,7 97,7 100,0
2.4
2 Tables
Entering a change
It can happen that you discover an entry error in an item of data after it has already been processed. For example, you might notice that the age of respondent number 5 should not have been 26, but 62. You can change this age into 62 in the data file, but what effect would that have on the frequency distributions that have already been created? The effect depends on the method used. 2.4.1 Changes and pivot tables Nothing happens to a frequency distribution that was created using pivot tables after changing the age in the data file. In order to implement the change in the pivot table you have to open the Data tab and then click Refresh All in the Connections group.
Figure 2.60: Opening the Refresh All menu to change data Then select Refresh All in the drop-‐down list.
Figure 2.61: Refresh drop-‐down list
57
Statistics using Excel 2007
The pivot table now updates to reflect the changed data.
Figure 2.62: Modified Age table You can see that the number in the 25-‐29 group has decremented by 1 to 48 and the number in the 60-‐64 group has incremented by 1 to 8.
2.4.2
Changes and the FREQUENCY function
The frequency table you created using the FREQUENCY function changes immediately when you change the age of respondent number 5. All functions behave like this!
2.4.3
Changes and data analysis
The frequency table you created with the Histogram tool from Data Analysis does not change when you change the age of respondent number 5. Furthermore, this frequency table cannot be refreshed. The only way to obtain a correct frequency table is to repeat the steps of clicking Data Analysis, selecting Histogram and creating a new frequency table on a new worksheet. The same is true of all analysis tools available under Data Analysis.
2.4.4
When to use pivot tables, the FREQUENCY function, or data analysis
Contingency tables are frequently used in surveys. The only way to create them is with PivotTable. If data change in the course of an investigation , you have to refresh the pivot tables through the Data tab in order to ensure consistency with the new situation. Some frequency tables are created from constantly changing data. An example would be bank account balances, or stock exchange share prices. If you want to display these frequency tables correctly at all times you should create them using the FREQUENCY function. If a statistical investigation does not require contingency tables, and you have checked the data and are confident that they are correct, Data Analysis provides a rapid way of creating frequency tables.
58
2.5
2 Tables
Exercises
The Fiction 2000 workbook must be used for all exercises. 1. Create a frequency table from the data of all respondents for the Opinion Spits variable. 2. Create a frequency table from the data of the respondents who ‘read Spits’ on the Opinion Spits variable, broken down according to data of the Gender variable. 3. Create a relative frequency table from the data of the respondents who ‘read Spits’ on the Opinion Spits variable, showing evidence that females have a poorer opinion of this newspaper than males. 4. Create a relative frequency table from the data of the respondents who ‘read Spits’ on the Opinion Spits variable, showing evidence that females read this newspaper less often than males. 5. Create a frequency table from the data of all respondents on the Travelling Time variable. 6. Create a frequency table from the data of all respondents with respect to their incomes. 7. Create a cumulative relative frequency table from the data of all respondents with respect to the Income variable. 8. Create a frequency table from the data of all respondents for the ‘Weekly travel days by train’ variable, broken down according to Gender. 9. Create a frequency table from the data of respondents aged 30 or over with respect to the Daily Newspaper variable. 10. Create a frequency table from the data of male respondents aged 30 or older with respect to the Daily Newspaper variable.
59
Statistics using Excel 2007
60
3
3 Charts
Charts
3.1
Introduction
The previous chapter explained how to present gathered data clearly in the form of a table. Another way of presenting a quantity of data is in the form of a chart, which has the advantage that information about the data can be conveyed at a glance. After making a pivot table (see Chapter 2) a chart can be created from the data simply and quickly in Excel 2007. Excel 2007 provides eleven chart types, each of which has several subtypes.
Figure 3.1: List of the various chart types and some of the subtypes This chapter deals with only selected chart types and subtypes. The chart you choose depends primarily on the nature of the data (e.g. whether it is qualitative or quantitative). A column chart, bar chart, or pie chart is often used for tables of qualitative data, whereas a histogram (i.e. a column chart with contiguous columns) or a line chart (curve or ogive) is used for frequency distributions of quantitative data.
61
Statistics using Excel 2007 Table 3.1: Data Qualitative Quantitative
Subject and chart style for qualitative and quantitative data Subject Totals Totals with categories Short time series Long time series Frequency distribution Cumulative frequency distribution Simple correlation or regression
Chart style Column chart, bar chart, pie chart Composite column chart or bar chart Column chart, bar chart Line chart Histogram, frequency curve Ogive Scatter chart
As with a table, a good chart also has to meet several requirements: 1. it must have a number, and a caption (title) describing the information conveyed by the chart; 2. it will usually have a horizontal and a vertical axis with clear markers, and a caption that includes the unit of measurement; 3. it should have an acknowledgment (and must have one where desk research is involved); 4. column charts cannot have a broken vertical axis, but this is often necessary in line charts. 3.2 Charts of qualitative data You must create a table (see 2.2 Tables of qualitative data) before you can create a column, bar or pie chart of qualitative data in Excel. It is fairly simple to create a chart once you have a pivot table of the qualitative data available. 3.2.1 Column charts Take the following pivot table of the Gender variable as an example (see also Chapter 2.2.1).
Figure 3.2: Table of Gender Method 1. Click on any cell of the pivot table (one of the cells A3:B7). 2. Open the Insert tab in the Ribbon and then click Column in the Charts group (to insert a column chart). Select the first column chart subtype (clustered column chart).
62
3 Charts
Figure 3.3: Inserting a column chart of the Table of Gender The chart appears as an embedded chart on the active worksheet. There are two additional active panes: the PivotChart Filter Pane and the PivotTable Field List.
Figure 3.4: (Embedded) column chart of the Gender variable
63
Statistics using Excel 2007
A chart comprises several elements. Each element has a name and can be formatted individually.
Figure 3.5: The elements of the example column chart The commonest chart elements are given below. Table 3.2: Various chart elements defined Element Definition Chart area Everything inside the frame. The entire chart and the background. Horizontal axis (X-‐axis) The horizontal axis with the various categories. Vertical axis (Y-‐axis) The vertical axis with the values. Chart title Description of the information conveyed by the chart. Plot area The part of the chart bounded by the axes. Legend Definition of the various colours used in the chart. Data series The values plotted in the plot area. Grid lines Horizontal and vertical lines to make the chart easier to read. Chart Area Gridlines Chart Title Legend
Vertical Axis Horizontal Axis Plot Area Figure 3.5: Various chart elements Each chart element can be formatted after right-‐clicking it.
64
Series
3 Charts
The PivotChart Tools contextual tabs appear when you create or select a chart. The contextual tabs are Design, Layout, Format and Analyze.
Figure 3.6: The PivotChart Tools contextual tabs • You can use commands on the Design tab to edit the basic chart data and overall chart layout. • You can use commands on the Layout tab to determine the position of, or show or hide, elements. • You can use commands on the Format tab to format selected chart elements. • You can use commands on the Analysis tab to clear and refresh data and to show and hide the PivotChart Filter Pane and the PivotTable Field List. You will now edit the example column chart to meet the established requirements: a clear chart title , titles for the axes, an acknowledgement, and no superfluous legends. • Click Total and type ‘Chart 3.1: Gender of the 300 respondents’. Then select this text and open the Home Tab in the Ribbon and click the tool in the Font group to change the size to 14. • The axes are untitled at present. Open the Layout tab in the Ribbon and then click Axis Titles in the Labels group. Select Primary Horizontal Axis Title and then Title Below Axis. See Figure 3.7.
• • •
Figure 3.7: The selections for inserting a title below the horizontal axis Click on the newly inserted axis title below the horizontal axis and change the name to ‘Gender’. Repeat the above for Primary Vertical Axis Title and change the name to ‘Frequency’. Open the Layout tab and click the Legend button in the Labels group. Select None.
Figure 3.8: Removing the chart legend
65
Statistics using Excel 2007 •
There is no separate command button for acknowledgment, which therefore has to be added as a text box. Open the Layout tab in the Ribbon and click the Text Box button in the Insert group. Drag the text box into position below the horizontal axis. Type in the text box: ‘Source: Fiction 2000’ and finally change the font size to 8.
Figure 3.9: Inserting a text box into the chart The column chart is now as follows.
Figure 3.10: The formatted chart of the Gender variable If you would prefer to have the chart on a separate worksheet rather than embedded, open the Design tab in the Ribbon and click the Move Chart button in the Location group. In the dialog that opens select ‘New sheet: Chart 1’, and finally click OK.
Figure 3.11: Move Chart dialog 66
3 Charts
If you need to create a column chart for two variables, you should select one of the nineteen subtypes under Column. The differences between the subtypes are: • shape: column, cylinder, cone, or pyramid; • dimensionality: 2-‐D (flat), or 3-‐D (spatial); • clustered, stacked, or 100% stacked. Clustered columns are used to compare values across categories, stacked columns to compare the contribution of each value to a total across categories, and 100% stacked columns to compare the percentage that each value contributes to a total across categories. While column charts with a 3-‐D effect may have a more striking appearance, they tend to make it harder to read values from the chart accurately. Similarly, the shape of cones and cylinders can distort the view. When creating a column chart of two variables in Excel 2007, the variable set out in the column of labels to the left of each table row is considered to be the independent variable, and its values are used as labels on the chart X-‐ axis. The variable set out in the row of labels for each table column is considered to be the dependent variable, and each column in the column chart is subdivided in accordance with the values of this variable . As an example you can create a stacked column chart from the pivot table for the Daily newspaper and Gender variables. Method 1. Click any cell in the A10:D19 range of this pivot table. 2. Open the Insert tab in the Ribbon and then click Column in the Charts group and select the stacked column chart. A stacked column chart of the data about the number of subscriptions to the various daily newspapers by gender then appears as an embedded chart.
Figure 3.12: Inserting a stacked column chart of the selected table
67
Statistics using Excel 2007
3. Add a chart title, axis titles and an acknowledgment. Do not remove the legend. A legend is an essential part of a composite chart.
Figure 3.13: Stacked column chart of the variable Daily newspaper by Gender The main emphasis in this chart is on the total number of subscriptions for each daily newspaper (the lengths of the columns) and secondary emphasis is given to the gender of the subscribers (the subdivision of the columns). If you wanted to emphasize the gender distribution per daily newspaper, you would open the Design tab in the Ribbon and click Change Chart Type in the Type group. Then select the first column chart subtype (clustered column). This result is as follows.
Figure 3.14: Clustered column chart of the variable Daily newspaper by Gender
68
3 Charts
If you wanted to present the distribution of the number of newspaper subscriptions over men and women, you would edit the pivot by swapping the variables in Row and Column (i.e. put Gender in Row and ‘Daily newspaper’ in Column). Method 1. In the PivotTable Field List drag the Daily newspaper variable from the Axis Fields box to the Legend Fields box under the Gender variable.
Figure 3.15: Dragging the Daily newspaper variable to the Legend Fields box 2. Now drag the Gender variable from the Legend Fields box to the Axis Fields box. The chart and the corresponding table refresh immediately after moving a variable. Finally change the chart title and the horizontal axis title.
Figure 3.16: A column chart of the Gender variable by Daily newspaper
69
Statistics using Excel 2007
It goes without saying that all these charts can be edited. You can make them as colourful and shapely as you want. Whether that makes them easier to read is debatable. Each chart element is editable.
Figure 3.17: A 3-‐D cone chart of the Daily newspaper variable by Gender
3.2.2
Bar chart
A bar chart is actually a column chart that has been rotated by 900. The category axis is then vertical and the value axis horizontal. This has definite advantages if the categories have long names. For example, you would proceed as follows to create a bar chart of the opinions about Metro of respondents who had actually ever read it. Method 1. Click any of the cells A36:B42 in the pivot table concerned. 2. Open the Insert tab in the Ribbon. Click Bar in the Charts group and then select the clustered bar subtype. 3. Add a chart title, axis titles and acknowledgment and remove the legend.
70
3 Charts
Figure 3.18: Creating a clustered bar chart of the Opinion Metro variable The result is as follows.
Figure 3.19: Bar chart of the Opinion Metro variable of the respondents who had read the newspaper
71
Statistics using Excel 2007
3.2.3
Pie chart You can create a pie chart for the totals of only one qualitative variable. Take as an example the pie chart of the table of the distribution of the respondents according to Gender. Method 1. Click any cell in the A3:B6 range of the pivot table of the Gender variable. 2. Open the Insert tab in the Ribbon and then click Pie in the Charts group. Then select the first pie chart.
Figure 3.20: Creating a pie chart of the Gender variable 3. Add a chart title and an acknowledgment.
Figure 3.21: Pie chart of the Gender variable
72
3 Charts
When creating a pie chart from the data of a contingency table, the data from the first column are shown initially in the chart. Take as an example the pivot table of subscriptions to daily newspapers broken down by gender. Method 1. Click any cell in the A10:D19 range of the pivot table of the Daily newspaper variable by Gender. 2. Open the Insert tab in the Ribbon and then click Pie in the Charts group. Then select the first pie chart.
Figure 3.22: Inserting a pie chart of the Daily newspaper variable broken down by Gender This yields the following pie chart.
Figure 3.23: Pie chart of the Daily newspaper variable for male respondents
73
Statistics using Excel 2007
Data for male respondents appear in the pie chart because they are in the first column of the contingency table. If you want to create a pie chart of the number of newspaper subscriptions of the female respondents, you first have to change the pivot table to put the number of women’s newspaper subscriptions in the first column. There are two ways of doing this. 1. Click the arrow at the right of the column labels box in the table and then select Sort Largest to Lowest. The result is that females move to the first column and males to the second. 2. Unselect Male by clicking the box. The result is a table with only a single column, which is for females.
Figure 3.24: Sorting column labels from largest to smallest When you are finished, you will see the pie chart change immediately as follows.
Figure 3.25: Pie chart of the Daily newspaper variable for female respondents Finally, remember to add a meaningful chart title and acknowledgment. If you want a pie chart of the newspaper subscriptions of all respondents, change the contingency table to one with only the Daily newspaper variable and create a pie chart with this new pivot table.
74
3.3
3 Charts
Charts of quantitative variables
The main thing to be done before you can use Excel to generate a line chart of quantitative data is to arrange the (quantitative) values to be plotted on the X-‐axis correctly and in the proper sequence. A column chart (histogram) identifies the x values centrally below each column, whereas the (x, y) values are set out in a line chart. Choose the axis divisions for quantitative data with care. The axes are a kind of ruler. Each interval between two numbers must represent the same difference. If you have created a frequency table with classes, a class is represented by the class midpoint. Class midpoints are well suited to creating line charts and histograms. N.B. ensure that all intervals have the same class size! For example, if one interval is twice as wide as the other class sizes, change it. Table 3.2: Breakdown of labour force in 2008 from 15 – 64 years by age Age Number* 1,000 15 -‐< 25 817 25 -‐< 45 3,794 45 -‐< 65 2,800 Total 7,411 Source: Statistical Yearbook 2009 (Statistics Netherlands)
The first class (15 -‐< 25) has 10 different ages, while the second and third classes group together 20 ages. Divide the second and third classes into groups of 10 years and assume a uniform distribution within the classes. You will then obtain the following table, which you can use to create a histogram. Table 3.3: Breakdown of labour force in 2008 from 15 – 64 years by age
Age 15 -‐< 25 25 -‐< 35 35 -‐< 45 45 -‐< 55 55 -‐< 65 Total
Number* 10.000 817 1,897 1,897 1,400 1,400 7,411
Source: Statistical Yearbook 2009 (Statistics Netherlands)
75
Statistics using Excel 2007
3.3.1
Histogram
Take as an example the table of the ages of the 300 survey respondents. You do not need to change the class size in this case, because there are 5 years in each class. The ages will appear on the horizontal axis, and by convention the horizontal axis starts at zero. The classes 0 -‐< 5 and 5 -‐< 10 are not included in the table because they contain no values. However, you will need these classes to produce a histogram with a horizontal axis that starts at zero. Alternatively, you could create a table that also includes the classes that have no values. Method 1. Click any cell in the A3:B15 range of the grouped pivot table of the Age variable. 2. A lower limit of 10 was used when the date were grouped. If you want a table that starts at zero you would have to change the lower limit to 0. To do this, click on any cell in the A4:A14 range and open the Options contextual tab under PivotTable Tools, then click Group Selection in the Group group.
Figure 3.26: Opening the Grouping dialog 3. For ‘Starting at:’ enter ‘0’ and click OK.
Figure 3.27: Creating 5-‐wide groups starting at 0 and ending at 64
76
3 Charts
4. At first glance the table will appear not to have changed. However if you click the Row Labels filter in cell A3, you will see that there are some new groups that are not shown in the table.
Figure 3.28: Row Labels dialog with the various groups 5. Unselect <0 and >65. The reason that the 0-‐4 and 5-‐9 groups are not shown is that Excel omits groups by default if they contain no observations. The number in these groups is 0. Open the Options tab and then click Field Settings in the Active Field group. In the Field Settings dialog that opens, open the Layout & Print tab and check the ‘Show items with no data’ box under Layout.
Figure 3.29: Showing items without data
77
Statistics using Excel 2007
6. Groups 0-‐4 and 5-‐9 will now be shown in the table. A proper presentation of the horizontal axis of a histogram requires the labels to be changed to class midpoints. Click cell A4 and change ‘0-‐4’ by overtyping it with ‘2.5’. Click cell A5 and change it to ‘7.5’. Continue until cell A16 has been changed to ‘62.5’.
Figure 3.30: Edited table for the Age variable 7. Now open the Insert tab and then click Column in the Charts group.
Figure 3.31: Inserting a column chart for the Age variable
78
3 Charts
The result will be as follows.
Figure 3.32: Column chart of the Age table This is not a histogram, because there are gaps between the columns. To solve this problem, click any column and then open the Layout tab and click Format Selection in the Current Selection group (the columns).
Figure 3.33: Opening the Format Selection dialog for columns
79
Statistics using Excel 2007
8. Move the Gap Width slider to 0% (No Gap) and click Close.
Figure 3.34: Removing the gap between columns The following is the result after editing the chart title, axis titles, legend and acknowledgment.
Figure 3.35: Histogram of the Age variable
80
3 Charts
It goes without saying that this histogram could be tidied up somewhat. Each chart element can be edited after double-‐clicking with the left mouse button, or single-‐clicking with the right mouse button. Another way is to select the chart and use the Format or Chart drop-‐down lists to edit the various chart elements. Try this for yourself.
Figure 3.36: Formatted histogram of the Age variable
81
Statistics using Excel 2007
3.3.2
Frequency curve A frequency curve links the class midpoints of the columns of a histogram. A frequency curve is created in much the same way as a histogram, except for the chart type to be used. To give a frequency curve a stable appearance, it should start and end on the X-‐axis, which is achieved by adding a class at the end with a frequency of 0 if necessary. This implies for the table of ages adding a 65 -‐< 70 year class (with a midpoint of 67.5) with a total of 0. Method 1. Click the table Row Labels filter, which you used earlier for the histogram (cell A3). Check the box for > 65.
Figure 3.37: Selecting the > 65 group in the table. 2. Change the label in cell A17 from ‘> 65’ to ‘67.5’. 3. Open the Insert tab in the Ribbon and then click the Line button in the Charts group. Now select the first line chart subtype shown.
82
3 Charts
Figure 3.38: Inserting a line chart of the selected table 4. Edit the chart title and axis titles. Add an acknowledgment and remove the legend. The chart will be as follows
Figure 3.39: Line chart of the Age variable
83
Statistics using Excel 2007
3.3.3
Cumulative line chart
A cumulative line chart can normally be generated fairly rapidly from a cumulative frequency table. However, you do need to modify the table somewhat to ensure a satisfactory horizontal X-‐axis layout (i.e. one that starts at 0, the same as for a histogram and a frequency curve). You therefore have to edit the cumulative frequency table, such as the one created in Chapter 2 for the Age variable, in the same way as for the Ages frequency table when creating a histogram. Take as an example the cumulative (absolute) frequency table on page 45. The values in the cells D4:D14 are the labels for the category axis. You have to delete the ‘<‘ character from these cells (because only digits are allowed on the horizontal axis) and add the values 0, 5 and 10 in the cells D4, D5 and D6 (because the horizontal axis must start at 0 and have uniformly spaced divisions). Finally you have to type ‘0’ in cells E4, E5 and E6.
Figure 3.40: Table and cumulative table of the Age variable Method 1. Select the cell range E3:E17 using the mouse. 2. Open the Insert tab in the Ribbon and then click the Line button in the Charts group. Now select the first line chart subtype shown. The result will be as follows.
84
3 Charts
Figure 3.41: Cumulative frequency chart of the Age variable 3. The horizontal axis layout still has to be chosen. To do this, right-‐click the chart area and then select Select Data. The following dialog box then opens.
Figure 3.42: The Select Data Source dialog 4. Click the Edit button in the Horizontal (Category) Axis Labels area of the Select Data Source dialog. Then select the cell range D4:D17 using the mouse and click OK.
85
Statistics using Excel 2007
Figure 3.43: Selecting the horizontal axis labels 5. Edit the chart title and axis titles. Add an acknowledgment and remove the legend. You will then obtain the following cumulative line chart, or ogive.
Figure 3.44: Ogive of the Age variable 6. The numbering of the horizontal axis must now be modified. Open the Layout tab and click Axes in the Axes group. Select Primary Horizontal Axis and finally click More Primary Horizontal Axis Options. 86
3 Charts
Figure 3.45: Opening the Format Axis dialog for the horizontal axis 7. In the Format Axis dialog change the axis position. Select ‘On tick marks’ instead of ‘Between tick marks’.
Figure 3.46: Positioning the axis on the tick marks The horizontal axis values are now correctly positioned. The chart will appear as follows.
87
Statistics using Excel 2007
Figure 3.47: Ogive of the Age variable 3.3.4 Time series charts Since the Fiction 2000 workbook has no time series data, you can use the following table to practice creating time series charts. Table 3.5: Percentage of newspaper subscriptions by age group, 1997-‐2007 Age 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 18 -‐ 24 years
55
55
54
53
52
51
48
48
40
42
45
25 -‐ 34 years
47
47
43
43
40
39
37
33
31
28
27
35 -‐ 44 years
62
61
59
57
55
55
50
50
48
46
40
45 -‐ 54 years
73
74
73
72
69
67
65
64
59
58
57
55 -‐ 64 years
74
76
75
74
74
76
76
72
71
71
67
65 -‐ 74 years
75
74
73
73
74
75
72
72
73
70
70
75 years and older
70
69
68
68
72
69
70
71
66
64
73
© Statistics Netherlands, The Hague/Heerlen 19-‐6-‐2009 Time is always shown on the X-‐axis of a time series. Where calendar years are shown, the X-‐axis will obviously not start at zero.
88
3 Charts
3.3.4.1 Short time series
The data for the years of a short time series (e.g. up to 5 years) are shown in a column chart. For creating a column chart see also 3.2.1 (Column chart). The Y-‐axis of a column chart is a real ruler that starts at zero and each interval on the axis represents the same unit. A broken axis is therefore not acceptable. This is a problem you have not encountered so far, but you will often have figures that differ little from year to year in time series. Take for example the percentage of newspaper subscriptions of the 18 to 25-‐year-‐olds for the 2003 – 2007 period (see Table 3.5). The figure varies from 40% to 48%. You can see that the columns do not start at zero on the vertical axis, which gives a distorted view of reality. Method 1. Open the ‘Newspaper subscriptions.xlsx’ file. 2. Select the cell range H2:L2 to create a column chart of the newspaper subscriptions of the 18 to 25-‐year-‐olds for the 2003 – 2007 period. Open the Insert tab and then click Column in the Charts group. Then select the first subtype.
Figure 3.48: Inserting a column chart for the selected cells 3. You will obtain the following column chart.
Figure 3.49: Column chart of the selected cells Add a chart title, vertical axis title (percentage) and acknowledgment. Remove the legend (Series1)
89
Statistics using Excel 2007
4. The horizontal axis layout still has to be chosen. To do this, right-‐click the chart area and then select Select Data. The following dialog box then opens..
Figure 3.50: The Select Data Source dialog Click the Edit button in the Horizontal (Category) Axis Labels area of the Select Data Source dialog. Then select the cell range H1:L1 using the mouse and click OK. The years 2003, 2004, …, 2007 will now be shown below the horizontal axis. 5. The vertical axis still needs some tidying so that it starts at zero rather than 36. Open the Layout tab in the Ribbon and click Axes in the Axes group, then select Primary Vertical Axis and finally More Primary Vertical Axis Options.
Figure 3.51: Opening the Format Axis dialog for the vertical axis
90
3 Charts
In the Format Axis dialog under Axis Options Minimum: select Fixed and change ‘36’ to ‘0’. The vertical axis now starts at zero. For Major unit: select Fixed, and change ‘2’ to ‘10’. The units on the vertical axis will now increase in steps of 10. Finally click Close.
Figure 3.52: Setting the axis minimum to zero and the major unit to 10 The column chart now appears as follows.
Figure 3.53: Chart of newspaper subscriptions of 18 to 25-‐year olds for the 2003-‐2007 period
91
Statistics using Excel 2007
3.3.4.2 Long time series
The data for the years of long time series (e.g. more than 5 years) are shown in a line chart. For creating a line chart see also 3.3.2 Frequency curve. A line chart is also used to represent multiple time series, e.g. all percentages of newspaper subscriptions of all age groups for several years. A line chart may have a broken axis, and the Y-‐axis does not therefore have to start at zero. Excel does not show an axis break marker, but you could add one to your chart in a word processor. Take as an example the percentage of newspaper subscriptions of the 18 to 25-‐year-‐olds for the 1997–2007 period. Method 1. Open the ‘Newspaper subscriptions.xlsx’ file. 2. Select the cells B2:L2 for creating a column chart of the newspaper subscriptions of the 18 to 25-‐year-‐olds for the 1997 – 2007 period. Open the Insert tab and then click Line in the Charts group. Then select the fourth subtype.
Figure 3.54: Inserting a line chart of the selected data 3. You will obtain the following line chart.
Figure 3.55: Line chart of the selected data Add a chart title, vertical axis title (percentage) and acknowledgment. Remove the legend (Series1). 92
3 Charts
4. The horizontal axis layout still has to be chosen. To do this, right-‐click the chart area and then select Select Data. Click the Edit button in the Horizontal (Category) Axis Labels area of the Select Data Source dialog. Then select the cell range B1:L1 using the mouse and click OK. The years 1997, 1998, …, 2007 will now be shown below the horizontal axis. The final chart will appear as follows.
Figure 3.56: Line chart of the newspaper subscriptions of 18 to 25-‐year-‐olds 1997–2007
3.3.5
Scatter chart Scatter charts are used for investigating expected logical connections between two quantitative variables (see also Chapter 8). The independent variable is plotted on the horizontal (X) axis and the dependent variable on the vertical (Y) axis. For instance, it would be possible using the information in the Fiction 2000 file about a respondent’s age and income to investigate whether it is indeed true that income increases with age. The independent variable is then Age, and the dependent variable Income. N.B. all the above charts used tables. However, scatter charts use ‘distinct’ observations! To give an example of a scatter chart you can take the ages and incomes of the 300 respondents. Method 1. Open the Fiction 2000 workbook. 2. Open the Data worksheet and select the data you want to show in the scatter chart. Therefore, select C2:C301 (the various ages), then press and hold down the Ctrl-‐key and select the data of the Income variable (F2:F301). Be careful not to include the cells with the names (C1 and F1) in the selection. 3. Open the Insert tab in the Ribbon, then click XY (Scatter) in the Charts group and finally select the first subtype. You will then obtain the following scatter chart.
93
Statistics using Excel 2007
Figure 3.57: Scatter chart of the selected data The data from column C (ages) are plotted on the horizontal axis and the data from column F (incomes) on the vertical axis. If the ages were in column F and the incomes in column C you would swap the axes. Go to Select Data and click Edit in the Legend Entries (Series) area.
Figure 3.58: Opening the Select Data Source dialog to edit the selected data Then under ‘Series X values:’ change the letters C to F and under ‘Series Y values’ change the letters F to C, and then click OK.
94
Figure 3.59: Edit Series dialog
3 Charts
4. Add a chart title, axis titles and acknowledgment, and remove the legend. The scatter chart should have the following appearance.
Figure 3.60: Scatter chart of Age and Income 5. Looking at the points on the scatter chart, you can see that in general the higher the age, the higher the income. You could represent this connection with the equation of a straight line that passes through the points as accurately as possible. The following method can be used to display the line and equation in the chart. Open the Layout tab , click Trendline in the Analysis group and finally select More Trendline Options.
95
Statistics using Excel 2007
Figure 3.61: Opening the Trendline Options dialog Select the Linear option button in the Trend/Regression Type area. Finally check Display Equation on chart and Display R-‐squared value on chart.
Figure 3.62: Adding the trendline and the equation and R-‐squared value of this line to the scatter chart
96
3 Charts
The chart should now be as follows.
Figure 3.63: Scatter chart of Income and Age with regression line and equation The regression line and its equation are now displayed on the chart. The equation is y = 0.4953x – 2.2733. Another item of information that is displayed is that R2 = 0.6635. What this means is explained in Chapter 8.
3.4
Changes
Any changes you make to tables are incorporated immediately in the charts based on these tables. Scatter charts are also updated immediately on any change in the original data.
97
Statistics using Excel 2007
3.5
Exercises
Exercises 1 to 11 refer to the Fiction 2000 workbook. 1. Create a pie chart of the data of all respondents on the Opinion Spits variable . 2. Create a composite column chart of the data of the respondents who ‘read Spits’ on the ‘Opinion Spits’ variable, where the opinions are broken down according to the Gender variable. 3. Create a bar chart of the data of the respondents who ‘read Spits’ on the ‘Opinion Spits’ variable, showing evidence that females have a poorer opinion of this newspaper than males. 4. Create a histogram of the data of all respondents on the ‘Travelling time’ variable. 5. Create a frequency curve of the data of all respondents for the Income variable. 6. Create an ogive (cumulative frequency distribution) of the data of the male respondents for the ‘Travelling time’ variable. 7. Create an appropriate chart of the data of all respondents on the ‘Daily newspaper’ variable. 8. Create an appropriate chart of the data of all respondents on the variable ‘Weekly travel days by train’. 9. Create an appropriate chart of the data of all respondents on the ‘Daily newspaper’ variable broken down according to the Gender variable. 10. Create an appropriate chart from which the percentage of females with an income lower than a given amount can be read. 11. Create a scatter chart of the data of all respondents, with Age as independent variable and ‘Travelling time’ as dependent variable. Exercises 12, 13 and 14 refer to data from the ‘Newspaper subscriptions’ workbook. 12. Create an appropriate chart of the data of ‘Newspaper subscriptions’ for 2004 – 2007 of the 25 to 35 age group. 13. Create an appropriate chart of the data of ‘Newspaper subscriptions’ for 1997-‐2007 of the 75 and older age group. 14. Create an appropriate chart of the data of ‘Newspaper subscriptions’ for 1997-‐2007 of all the different age groups.
98
4
4 Distribution characteristics
Distribution characteristics
4.1
Introduction
4.2
Distribution characteristics of distinct observations
The previous two chapters explained how to use tables and charts to clarify the structure of large numbers of observed values. However, many reports or articles do not present observed values in tables or charts because they take up too much space. Instead, the observed values are represented in the text using measures of location, dispersion or shape. A measure of location indicates the value that may be considered to occupy the centre, average or middle of a distribution. A measure of dispersion expresses how much the individual observed values differ from each other and deviate from the centre, average or middle of the distribution. A measure of shape expresses how closely the observed values conform with a symmetrical bell-‐shaped distribution curve. It is important to know the level of measurement of a variable. If you entered the scores for nominal and ordinal variables in the form of codes, Excel will readily calculate all the measures of location, dispersion and shape. However most of these measures will be meaningless. The only meaningful measure for a nominal variable is mode, and for an ordinal variable only mode, median, range and interquartile range. The other measures of location, dispersion and shape are meaningful only for quantitative variables (measured on interval or ratio level). You can calculate the measures of location, dispersion and shape accurately once you have possession of the original gathered data. It is usually impossible to gather data about all objects of interest (the population). For instance, it would be too expensive in time and money to interview all rush-‐hour commuters about their opinions of the free morning newspapers. This is why ‘only’ 300 morning commuters were surveyed (the random sample). Population characteristics are conventionally denoted with Greek letters. For instance, the arithmetic mean of a population is denoted by ‘µ’, the standard deviation by ‘σ’ and the proportion (percentage) that meets a certain condition by ‘π’. The population size is represented by a capital ‘N’. Random sample characteristics are conventionally denoted with Latin letters. For instance, the arithmetic mean of a random sample is denoted by ‘ ‘, the standard deviation by ‘s’ and the proportion (percentage) that meets a certain condition by ‘p’. The size of the random sample is expressed by the small letter ‘n’. Many of the formulas used for calculating the characteristics of random samples are the same as those used for the characteristics of populations, except that a different notation is used. Formulas for the standard deviation and the formulas derived from it (variance, coefficient of variance, skewness and kurtosis) do differ between population and random samples, and may produce substantially different results, certainly for small random samples.
99
Statistics using Excel 2007
4.2.1
Measures of location A measure of location indicates the middle value, in some sense, of all outcomes. The measure of location that you can read directly from a table or chart and that applies to all variables is the mode. The mode is the observed value that occurs most frequently. It is the observed value with the highest frequency in a frequency table. You can read from Table 2.2 that the modal gender is male (223 males versus 77 females). In a bar or column chart it is the value associated with the longest bar or highest column. For instance, you can read from Chart 3.1 that the modal gender is male, because the column for males is higher than the column for females. A variable must be measured at ordinal level at least in order to determine the median. The median is the value of the middle observed value when sorted from low to high. Take as an example the journey times of the first 7 respondents from the Fiction 2000 workbook: 35, 29, 23, 32, 14, 104, 58. Sorting these numbers (from low to high) gives: 14, 23, 29, 32, 35, 58, 104. The median travelling time in this case is 32. The journey times of 3 respondents is less than 32 (the median) and the journey times of 3 respondents is greater than 32 (the median). There is no middle observation if the number of scores is even, in which case the median is the arithmetic mean of the two middle observed values. Take as an example the journey times of the first 8 respondents of the Fiction 2000 workbook file: 35, 29, 23, 32, 14, 104, 58, 55. Sorting these numbers (from low to high) gives: 14, 23, 29, 32, 35, 55, 58, 104. The median travelling time in this case is the average of 32 and 35, which is 33.5 minutes. A median can therefore have a value that does not actually occur. If you have many outcomes, you may wonder which number is in the middle. For example, if you have 90 th th numbers, should you use the value of the 45 , the 46 , or the mean of these two values? To find out, you add 1 to the number of numbers and divide the result by 2. As a formula: ranked value median =
n+1 2
90 + 1 1 th th = 45 which is to say that the 45 and 46 2 2 observation are both in the middle. You then have to take the mean of the values concerned. The arithmetic mean is the most commonly used measure of location for quantitative data, i.e. variable data measured on interval or ratio level. You already know how to calculate a mean: ‘add all the numbers together and divide by the number of numbers’. As a formula: Random sample: Population: For 90 ordered numbers, you should therefore take number
n
x=
100
∑ i=1
n
N
xi
µ=
∑x i=1
N
i
4 Distribution characteristics
Take as an example the journey times of the first 7 respondents from the Fiction 2000 workbook: x1 = 35, x2 = 29, x3 = 23, x4 = 32, x5 = 14, x6 = 104, x7 = 58. The mean travelling time ( x )is then: 7
x=
∑ xi i=1
4.2.2
7
=
35 + 29 + 23 + 32 + 14 + 104 + 58 = 42,1 7
Measures of dispersion
Measures of dispersion indicate how widely the data are spread out. There is no measure of dispersion for nominal variables. If you measure at ordinal level or higher, you can determine the range (also known as the variation width or dispersion width), for example. The range is the difference between the highest (maximum) and the lowest (minimum) observed value. For instance, the range of ‘Opinion Metro’ for the first 6 respondents who had ever read Metro is from poor to very good. The actual scores are: good, good, reasonable, poor, very good, good. The range of ‘Travelling time’ for the first 7 respondents is 90 minutes. The actual scores are: 35, 29, 23, 32, 14, 104, 58. The maximum is 104 and the minimum 14, so the range is 104 – 14 = 90 The range is fairly sensitive to extremes at both the high and low ends. For instance, the range of incomes would be €200 million if one person had no income and another had an income of €200 million. A measure of dispersion that is less sensitive to extremes of this kind is the interquartile range, in which the ordered data is divided into 4 groups of 25% each. Q1 = first quartile; 25% of the observations have a score less than or equal to Q1 Q2 = second quartile (median); 50% of the observations have a score less than or equal to Q2 Q3 = third quartile; 75% of the observations have a score less than or equal to Q3 Q4 = fourth quartile (maximum); 100% of the observations have a score less than or equal to Q4 The interquartile range is the difference between Q3 and Q1. As a formula:
interquartile range = Q3 − Q1 You can determine which of the ordered scores to pick for the first quartile as follows.
ranked value Q1 =
n+1 4
You can determine the position in the ranking for the third quartile with the following formula:
ranked value Q 3 =
3(n + 1) 4
The position in the ranking that is determined may be a whole number, in which case the quartile is the score at that position. But, like the median, the position in the ranking may also end in ½, in which case you take the arithmetic mean of the two scores immediately above and below the rounded position number.
101
Statistics using Excel 2007
If the position number ends in ¼, multiply the score at the rounded-‐down position number by ¾ and the score at the rounded-‐up position number by ¼. If the position number ends in ¾, multiply the score at the rounded-‐down position number by ¼ and the score at the rounded-‐up position number by ¾. Take for example the data of the ‘Travelling time’ variable for the first 10 respondents: 35, 29, 23, 32, 14, 104, 58, 55, 44, 5. Sorting these numbers (from low to high) gives: 5, 14, 23, 29, 32, 35, 44, 55, 58, 104. 10 + 1 Dus Q1 = ¼ * 14 + ¾ * 23 = 20 ¾ ranked value Q1 = = 2,75 4 3(10 + 1) ranked value Q3 = = 8,25 Dus Q3 = ¾ * 55 + ¼ * 58 = 55 ¾ 4 The interquartile range = 55 ¾ – 20 ¾ = 35. Neither the range or the interquartile range take account of how far all numbers spread from each other, but only how far two numbers are separated. A measure of dispersion that does take into account how far all numbers spread from the mean is the average absolute deviation (AAD). 1 This measure of dispersion starts by subtracting the mean from each score . The absolute value is then calculated (a positive difference remains positive, a negative difference is made positive). All these positive values are added together and finally divided by the number of observations (scores). As a formula: Random sample: Population: n
A.A.D. =
N
∑| xi − x | i =1
∑ | xi − μ|
n
i =1 A.A.D. =
N
Take for example the data of the ‘Travelling time’ variable for the first 10 respondents: 35, 29, 23, 32, 14, 104, 58, 55, 44, 5. The mean travelling time x = 39,9 minutes. 10
A.A.D. =
∑ |xi− x|
i =1
10
=
=
|35 − 39,9| +|29 − 39,9| +|23 − 39,9| +|32 − 39,9| +|14 − 39,9| +|104 − 39,9| +|58 − 39,9| +|55 − 39,9| +| 44 − 39,9| +|5 − 39,9| = 10
=
4,9 +10,9 +16,9 + 7,9 + 25,9 + 64,1+18,1+15,1+ 4,1+ 34,9 202,8 = = 20,28 10 10
However, the measure of dispersion most frequently used for quantitative variables in statistical practice is the standard deviation. Like the average absolute deviation, the standard deviation circumvents the difficulty that the sum of all deviations from the mean is always zero.
1
The variable must therefore be measured at interval level or higher.
102
4 Distribution characteristics
The average absolute deviation achieves this by summing the absolute values of the deviations rather than the deviations. The standard deviation squares the differences between the values and the arithmetic mean, thus summing only positive deviations. As a formula: Random sample:
Population:
n
∑ i=1
s=
N
2
( xi − x )
n −1
σ =
2
∑ ( x i − µ) i=1
N
Take for example the data of the ‘Travelling time’ variable for the first 10 respondents: 35, 29, 23, 32, 14, 104, 58, 55, 44, 5. The mean travelling time x = 39,9 minutes. 10
s=
2
∑ (xi − x)
i =1
10 - 1
=
=
2 2 2 2 2 2 2 2 2 2 (35 − 39,9) + (29 − 39,9) + (23 − 39,9) + (32 − 39,9) + (14 − 39,9) + (104 − 39,9) + (58 − 39,9) + (55 − 39,9) + (44 − 39,9) + (5 − 39,9) = 9
=
24,01 + 118,81 + 285,61 + 62,41 + 670,81 + 4108,81 + 327,61 + 228,01 + 16,81 + 1218,01 = 9
7060,9 = 784,5444 = 28,0 9
If you do not take the square root of the squared deviations from the mean, the result is referred to as the variance. Or in other words: the variance is the square of the standard deviation. As a formula: Random sample: Population: n 2 s =
∑ i=1
N
2
( x i − x) n −1
2 σ =
2
∑ ( x i − µ) i=1
N
The variance of the ‘Travelling time’ variable of the first 10 respondents is therefore 784.5. The variance and the standard deviation are important concepts in statistics theory and practice. They are used among other things in standardizing observed values, allowing different distributions to be compared with each other. This point is covered in Chapter 7 under the normal distribution. Furthermore, the variance and the standard deviation play an important part in correlation and regression analysis (see Chapter 8). The final measure of dispersion to be mentioned is the coefficient of variance (CV). This measure of dispersion is often used to indicate that a series of observations with a high average and a large standard deviation can have a smaller range in relative terms than one with a low average and a small standard deviation. For instance, the coefficient of variance of the incomes of the Dutch population measured in guilders in 2001 will be the same as that of the incomes of the Dutch population measured in euros in 2001, whereas the standard deviation of the incomes in guilders will be approximately 2.2 times the standard deviation of the incomes in euros.
103
Statistics using Excel 2007
The coefficient of variance is calculated by dividing the standard deviation by the arithmetic mean and multiplying by 100%. As a formula: Random sample: Population:
C. V. =
s * 100% x
C.V. =
σ * 100% μ
Take for example the data of the ‘Travelling time’ variable for the first 10 respondents: 35, 29, 23, 32, 14, 104, 58, 55, 44, 5. The standard deviation s = 28.0 minutes. The mean travelling time = 39.9 minutes. The coefficient of variance =
28,0 * 100% = 70% 39,9
Remarks: 1. the greater the spread, or the more the data differ from each other, the larger the range, the interquartile range, the average absolute deviation, the standard deviation, the variance and the coefficient of variance; 2. the more homogenous the data are, the smaller the range, the interquartile range, the average absolute deviation, the standard deviation, the variance and the coefficient of variance; 3. if all the data are equal (i.e. there are no differences), the range, the interquartile range, the average absolute deviation, the standard deviation, the variance and the coefficient of variance are all zero; 4. none of the measures of dispersion (range, interquartile range, average absolute deviation, standard deviation, variance and coefficient of variance) can ever be negative. 4.2.3 Measures of shape The first aspect of shape to be assessed is symmetry. A data distribution is referred to as symmetrical if it has the same shape on either side of the median. The arithmetic mean in a symmetrical distribution equals the median. If the distribution has only one peak, the mode also equals the median and the arithmetic mean. Therefore, for a unimodal symmetrical distribution: mode = median = arithmetic mean Diagrammatic examples of symmetrical distributions:
Figure 4.1: Unimodal symmetrical distribution
104
4 Distribution characteristics
Figure 4.2: Uniform or rectangular distribution When the shape of the distribution is different on either side of the median, the distribution is referred to as asymmetrical or skewed. The measure of shape for symmetry is skewness. There are three cases of interest in interpreting skewness: 1. skewness = 0, for a symmetrical distribution, where the median equals the arithmetic mean. 2. skewness > 0, for a distribution that is said to have positive skew, or to be right-‐skewed. The peak (head) is left of centre, and the distribution has a long tail on the right. Almost always in this case: mode < median < arithmetic mean. 3. skewness < 0, for a distribution that is said to have negative skew, or to be left-‐skewed. The peak (head) is right of centre and the distribution has a long tail to the left. Almost always in this case: arithmetic mean < median < mode. Diagrammatic examples of skewed distributions: Left skewed distribution
Right skewed distribution
Figure 4.3: Negative, left-‐skewed, distribution Figure 4.4: Positive, right-‐skewed distribution To calculate skewness, the arithmetic mean is subtracted from each score and the result divided by the standard deviation. Each value is then raised to the third power and all the results summed. The sum is finally multiplied by the number of scores and divided by the number of scores –1 and the number of scores –2. As a formula: Random sample: Population:
n Skewness = (n − 1)(n
n
− 2) ∑ i =1
3
( ) xi − x s
1 Skewness = N
N
−μ
∑(xiσ i=1
3
)
Another aspect of the shape of a distribution is its kurtosis, which expresses how flat or sharply pointed it is. A distribution’s kurtosis is compared with a normal distribution (see Chapter 7), which is referred to as mesokurtic, 2 and has a kurtosis of zero . A distribution that is flatter than a normal distribution is referred to as platykurtic, and the kurtosis is less than zero. A distribution that is more sharply pointed than a normal distribution is referred to as leptokurtic and the kurtosis is greater than zero.
2
A distribution can have zero kurtosis but not resemble a normal distribution at all. A chart of the distribution is indispensable in interpreting the kurtosis satisfactorily.
105
Statistics using Excel 2007
Figure 4.5: Diagrammatic examples of leptokurtic, mesokurtic and platykurtic distributions To calculate the kurtosis, the arithmetic mean is subtracted from each score and the result divided by the standard deviation. The values are raised to the fourth power and all the results summed. The sum is then multiplied by the number of scores and the number of scores + 1 and divided by the number of scores –1, the number of scores –2 and the number of scores -‐3. Finally, 3 times the square of the number of scores –1 is divided by the number of scores –2 and the number of scores –3 and subtracted from the above result. There is clearly a lot of calculation involved. As a formula: Random sample: Population:
⎧⎪ n(n + 1) Kurtosis = ⎨ ⎪⎩ (n − 1)(n − 2)(n − 3)
n
∑ i=1
( ) xi − x s
4 ⎫
2
3 (n − 1) ⎪ ⎬ − ⎪⎭ (n − 2)(n − 3)
⎧⎪ 1 Kurtosis = ⎨ ⎪⎩ N
N
∑ i=1
4 ⎫
( ) xi − µ σ
⎪ ⎬ − 3 ⎪⎭
4.3
Distribution characteristics of distinct observations with Excel
If you have seen the formulas for the measures of shape in the previous section, you will be delighted to know that you can use Excel for all the calculations. If you have ‘distinct’ observed values at your disposal, you can calculate measures of location, dispersion and shape using Excel’s: 1. functions; 2. data analysis; 3. pivot tables.
4.3.1
Functions
3 For calculating measures of location or dispersion : AVERAGE(number1;number2;….) : calculates the arithmetic mean of the values in the range; MEDIAN(number1;number2;….) : calculates the median of the values in the range; MODE(number1;number2;….) : calculates the mode of the values in the range; QUARTILE(array;quart) : calculates the k-‐th quartile; PERCENTILE(array;k) : calculates the k-‐th percentile; MIN(number1;number2;….) : calculates the lowest value in the range; MAX(number1;number2;….) : calculates the highest value in the range. 3
There are several statistical functions for calculating means, such as HARMEAN (for calculating the harmonic mean), TRIMMEAN (for calculating the arithmetic mean but disregarding a certain proportion of extremes) and GEOMEAN (for calculating the geometric mean).
106
4 Distribution characteristics
For calculating measures of dispersion: AVEDEV(number1;number2;….) : calculates the average of the absolute deviations of the observed values relative to the mean value; STDEV(number1;number2;….) : gives an estimate of the standard deviation based on a random sample; STDEVP(number1;number2;….) : calculates the standard deviation of the values of the range of the population; VAR(number1;number2;….) : gives an estimate of variance based on a random sample; VARP(number1;number2;….) : calculates the variance of the values of the range of the population. For calculating measures of shape: SKEW(number1;number2;….) : gives an estimate of the skewness based on a random sample; KURT(number1;number2;….) : gives an estimate of the kurtosis (curvature) based on a random sample. The Age variable from the random sample file of the 300 respondents in Chapter 1.4 is taken as an example of calculating the various measures of location, dispersion and shape. You can see below how to calculate these various measures using Excel functions and formulas.
Figure 4.6: Excel functions for calculating measures of location, dispersion and shape Final result:
Figure 4.7: Distribution characteristics of the Age variable
107
Statistics using Excel 2007
To further clarify the measures of location, dispersion and shape, it is helpful to create a chart (histogram) of the above data of the Age variable.
Chart 4.1: Ages of 300 respondents Frequency per year
25 20
15 10
5 0 0
5
10
15
20
25
30
35
40
45
50
55
60
65
Year
Source: Fiction 2000
Mode Median Mean Figure 4.8: Histogram of the Age variable with different location measures
Remarks. 1. The chart shows a bimodal distribution. The scores for both 19 years and 21 years occur most frequently. The Excel MODE function produces at most one mode, which will be the one for 21 years, because it is the first score encountered in the list. 2. The chart shows a distribution that is clearly right skewed. This is reflected in the calculated skewness, which is positive (0.56). Likewise, the relative positions of the measures of location, mode < median < arithmetic mean, point to a right-‐skewed distribution. 3. The chart shows a distribution that is clearly platykurtic (flat). This is reflected in the calculated kurtosis, which is negative (-‐0.63). You can determine the mode for nominal values with Excel by replacing the ‘words’ with numbers (codes), and translating the numerical result back into the associated word. You can also determine the median for ordinal values in the same way. Be careful when changing the ‘words’ of nominal or ordinal variables into ‘numbers’ not to perform operations that are incompatible with the variable’s measurement level (e.g. do not calculate an arithmetic mean for an ordinal variable).
108
4 Distribution characteristics
4.3.2
Data analysis The most important measures of location, dispersion and shape can be obtained using Excel 2007’s data analysis for descriptive statistics. A drawback of data analysis is that a change in any of the original values is not propagated and the data analysis has to be repeated (this also applies to data analysis for histograms (see: Chapter 2.3.1.3)). Take as an example the Age variable of the Fiction 2000 workbook. Method 1. Open the Fiction 2000 workbook and open the Data worksheet with the file for the 300 respondents. 2. Open the Data tab and click Data Analysis in the Analysis group. 3. If either the Analysis group or Data Analysis are not present, click the Office Button and then click Options for Excel. Click Analysis ToolPak and click Go.
Figure 4.9: Opening the Add-‐Ins dialog
109
Statistics using Excel 2007
Check the Analysis ToolPak box and click OK.
Figure 4.10: Activating Analysis ToolPak Data Analysis will now be available. In the Data Analysis dialog select Descriptive Statistics and then click OK.
Figure 4.11: Opening the Descriptive Statistics dialog 4. The Descriptive Statistics dialog appears. • For ‘Input Range:’ select the cells that contain the data for analysis (the cell range for the Age variable is $C$1:$C$301); • the data are arranged in columns, so select the Columns option button; • cell C1 contains a label (‘Age’) so check the ‘Labels in first row’ box; • check the ‘New Worksheet Ply:’ box and enter a name (e.g. ‘Descriptives’); • check the ‘Summary statistics’ box; • check the ‘Confidence Level for Mean:’ box and enter 95%; • to obtain the third quartile and the first quartile check the ‘Kth Largest:’ and Kth Smallest:’ boxes and enter 75 for both (you will then obtain the third and first quartiles).
110
4 Distribution characteristics
The dialog will be as follows.
Figure 4.12: Completed dialog for calculating the characteristics of the Age variable Click the OK button. The following output screen will appear.
Figure 4.13: Data Analysis output for the Age variable Remarks. 1. Not surprisingly, the results are the same as those produced by the functions, except for the first quartile (Smallest 75). This difference is because functions can also handle numbers with fractional parts when calculating quartiles, whereas Data Analysis requires whole numbers. 2. Standard Error and Confidence Level(95.0%) are defined in Chapter 8.
111
Statistics using Excel 2007
4.3.3
Pivot tables
You can obtain the most important measures of location (arithmetic mean) and dispersion (standard deviation and variance) using a pivot table. You can likewise obtain the total, sum, maximum and minimum from the table. Take as an example the Age variable from the Fiction 2000 workbook. Method 1. Open the Fiction 2000 workbook. 2. Open the Data worksheet. Open the Insert tab, click PivotTable in the Tables group and then select PivotTable. 3. The following completed Create PivotTable dialog than appears. Click OK.
Figure 4.14: Create PivotTable dialog 4. Drag the Age variable for each characteristic you want calculated to the ∑ Values box. Next click Count of Age in this box, and select Value Field Settings. Then select the characteristics you wish to calculate, e.g. Average, Min, Max, StDev and Var. You will then obtain the following.
Figure 4.15: Characteristics of the Age variable in a pivot table If you would prefer the data to be arranged vertically in a table, click ∑ Values in the Column Labels area and then select Move to Row Labels.
112
4 Distribution characteristics
Figure 4.16: Moving the labels from column to row The use of pivot tables becomes more interesting when you first break down a numerical variable into categories. You then have a way of comparing the statistics of two or more groups. Example: You want to compare the ages of males and females in the survey. Drag the Gender variable from the field list to the Column Label box. Change the code ‘1’ to ‘male’ and the code ‘2’ to ‘female’. It is now easy to compare the characteristics of males and females with respect to the Age variable.
Figure 4.17: Table of characteristics of Age by Gender The average age of females in the survey is therefore lower than the average age of males, while the standard deviation of the ages is also lower for female than male respondents.
113
Statistics using Excel 2007
4.4
Characteristics for frequency distributions with classes
When performing desk research you will often have no distinct observed values available from a survey, but you may have a frequency distribution of the observed values. If you then wish to calculate (actually estimate) measures of location or dispersion, you will assume that the values in a class are distributed uniformly and that the classes can be represented by their midpoints. The calculations you have to perform are almost the same as those in the previous sections. You have to interpolate to calculate the median and the quartiles. As an example, we will use the frequency distribution and relative cumulative frequency distribution created in Chapter 2 for the ages in the Fiction 2000 workbook. These tables are repeated below.
Figure 4.18: Frequency distribution and cumulative frequency distribution of the Age variable
4.4.1
Measures of location for frequency distributions The mode is simple to estimate. The modal class is the one with the highest frequency density (= frequency / class size) and the mode is the middle of the modal class. In the example you find the highest frequency density in the class 20 -‐ 24 (54 by 5 years), so the modal class is 20 -‐ 24 and the mode is 22.5 years. All the classes in the example have the same width, i.e. 5. If the classes do not all have the same width, you must first calculate the frequency density (= frequency / class size) before determining the modal class. See the table below. Table 4.1: Class grouping of ages Age 10 -‐ < 20 20 -‐ < 25 25 -‐ < 30 30 -‐ < 40 40 -‐ < 65 Total
Frequency 45 54 49 62 90 300
Source: Fiction 2000
114
Frequency density (= frequency per 5 year) = 45 / 2 = 22.5 = 54 / 1 = 54 = 49 / 1 = 49 = 62 / 2 = 31 = 90 / 5 = 18
4 Distribution characteristics
The smallest class size is 5 years. If you take this class size as the standard for calculating the frequency density, you will have to handle the class 10 -‐ < 20 years, which is twice as wide as the standard (2 * 5 year = 10 year). You must then divide the frequency by 2 to obtain the frequency density. Therefore the frequency density = 45 / 2 = 22.5. The highest frequency density is for the modal class, 20 -‐ < 25, and the mode equals the class midpoint, which is 22.5 years. A cumulative percentage frequency distribution can be used to estimate the median of the Age variable. The median is at 50%, which is therefore somewhere between 30 years (49.33%) and 35 years (58.33%). You can estimate the median by interpolating using the following formula: median = L +
50% − CFb CFm − CFb
* IM
L: The low real limit of the class interval that contains the median; CFb: cumulative relative percentage in the class interval below the class interval that contains the median; CFm: cumulative relative percentage in the class interval that contains the median; IM: Interval width of the class interval that contains the median. Applied to the Age variable :
median = 30 +
50% − 49,33% * 5 = 30,37 58,33% − 49,33%
Because a class can be represented by its class midpoint, you can calculate (estimate) the arithmetic mean with the following formula: Random sample:
Population:
j
j
x=
∑ f i * mi
∑ f i * mi i =1
n
µ=
i=1
N
where fi is the frequency in the i-‐th class and mi is the class midpoint of the i-‐th class. In total there are j classes and the total number of observations is n (for a random sample) or N (for a population). Applied to the distribution of the Age variable in classes: x=
5 * 12,5 + 40 * 17,5 + 54 * 22,5 + 49 * 27,5 + 27 * 32,5 + 35 * 37,5 + 32 * 42,5 + 23 * 47,5 + 16 * 52,5 + 12 * 57,5 + 7 * 62,5 = 33,12 300
4.4.2
Measures of dispersion for frequency distributions
As with distinct observations, the range is defined as the difference between the maximum and the minimum, where the maximum is the upper (right-‐hand) limit of the last class and the minimum is the lower (left-‐hand) limit of the first class. Applied to the Age variable : Maximum = 65 Minimum = 10 Range = 55
115
Statistics using Excel 2007
Also as with distinct observations, the interquartile range is defined as the difference between the values of the third and first quartiles. The same formula is used to calculate (estimate) the first quartile (the value below which 25% of the observations fall) and the third quartile (the value below which 75% of the observations fall) as for the median. 25% − CFb 75% − CFb * IQ 1 * IQ 3 Q 1 = LQ 1 + Q 3 = LQ 3 + − CFb CFQ 3 − CFb CFQ1
where: LQ1: The low real limit of the class interval that contains the first quartile; LQ3: The low real limit of the class interval that contains the third quartile; CFb: cumulative relative percentage in the class interval below the class interval that contains the first (third) quartile; CFQ1: cumulative relative percentage in the class interval that contains the first quartile; CFQ3: cumulative relative percentage in the class interval that contains the third quartile; IQ1: Interval width of the class interval that contains the first quartile; IQ3: Interval width of the class interval that contains the third quartile. Applied to the Age variable : The first quartile is in the 20 -‐ 24 class: 25% − 15% * 5 = 22,78 Q1 = 20 + 33% − 15% The third quartile is in the 40 – 44 class:
Q3 = 40 +
75% − 70% * 5 = 42,34 80,67% − 70%
As with determining the arithmetic mean, estimating the average absolute deviation also assumes that a class is represented by its midpoint and the frequency is used as a weighting factor for the class midpoint. As a formula: Random sample: Population: j
A.A.D. =
j
∑ f i * | mi − x | i =1
n
A.A.D. =
∑ f i * | mi − µ | i =1
N
where fi is the frequency in the i-‐th class and mi is the class midpoint of the i-‐th class. In total there are j classes and the total number of observations is n (for a random sample) or N (for a population). Applied to the data of the Age variable of Table 4.1, where the mean age ( ) is 33.78 years: 45 * | 15 − 33,78 | +54 * | 22,5 − 33,78 | +49 * | 27,5 − 33,78 | +62 * | 35 − 33,78 | +90 * | 52,5 − 33,78 | A.A.D. = = 11,74 300
116
4 Distribution characteristics
Finally the two most important measures of dispersion: the variance and the standard deviation. The formula for estimating the variance for a frequency distribution is as follows: Random sample: Population: j
s2 =
j
2
∑ f i *( mi − x ) i=1
n -1
σ2 =
2
∑ f i * (mi − µ)
i=1
N
where fi is the frequency in the i-‐th class and mi is the class midpoint of the i-‐th class. In total there are j classes and the total number of observations is n (for a random sample) or N (for a population). The standard deviation is then the square root of the variance. The formulas applied to the data for the Age variable of Table 4.1:
s
2
=
45 * (15 − 33,78) 2 + 54 * (22,5 − 33,78) 2 + 49 * (27,5 − 33,78) 2 + 62 * (35 − 33,78) 2 + 90 * (52,5 − 33,78) 2 = 187,69 300
and the random sampling variance (s) is then:
s = 187,69 = 13,70
4.5
Characteristics for frequency distributions with classes with Excel
Unfortunately, unlike for distinct observations, Excel has no functions for calculating the characteristics of frequency distributions with classes that can be applied directly. You have to use the formulas. You can use the frequency distribution per class in calculating the arithmetic mean, average absolute deviation, variance and standard deviation. Take as an example the Age variable divided into classes (see the pivot table in 4.4 columns A and B). Method 1. Enter in cell C4: Midpoint (m), in cell D4: fm, in cell E4: f*|m-‐x| and in cell F4: f*(m-‐x)^2. 2. Enter in cell A18: Mean, in cell A19: A.A.D., in cell A20: Variance, and in cell A21: Standard deviation. 3. Enter in cell C5: 12.5 and in cell C6: 17.5. Select cells C4 and C5 and drag the cells up to and including C14 to copy the class midpoints. 4. Enter in cell D5: =B5*C5. 5. Enter in cell E5: =B5*ABS(C5-‐$B$18. 6. Enter in cell F5: =B5*(C5-‐$B$18)^2. 7. Select cells D4 to F4, inclusive, and drag these cells up to and including cell F14 to copy the formulas. 8. Enter in cell D16: =SUM(D5:D15), in cell E16: =SUM(E5:E15) and in cell F16: =SUM(F5:F15), or use the 4 summation button in the Editing group on the Home tab; 9. Enter in cell B18: =D16/B16, in cell B19: =E16/B16, and in cell B20: =F16/(B16-‐1); 5 10. Enter in cell B21: =SQRT(B20) . See the following updated Excel worksheet. 4 5
the summation button. SQRT is in the math and trigonometry function category.
117
Statistics using Excel 2007
Figure 4.19: Excel formulas for calculating the arithmetic mean and measures of dispersion The final result is as follows.
Figure 4.20: Arithmetic mean and measures of dispersion for the Age frequency distribution Remarks. 1. Column A contains the classes, B the frequencies, C the class midpoints, D frequencies * class midpoints (for calculating the arithmetic mean), E the frequencies * absolute deviations of the class midpoints from the arithmetic mean (for calculating the average absolute deviation), and F the frequencies * squared deviations between the class midpoints and the arithmetic mean (for calculating the variance). 2. The estimates of the arithmetic mean, average absolute deviation, variance and standard deviation are close to the actual values. 3. If the data relate to a population rather than a random sample, enter in cell B20: = F16/B16
118
4 Distribution characteristics
You can use the (relative) cumulative frequency distribution in calculating the first quartile, median, third quartile and interquartile range. Take as an example the Age variable (see the pivot table in 4.4 columns D and E). Method 1. Enter in cell A18: First quartile (Q1), in cell A19: Median, in cell A20: Third quartile, and in cell A21: Interquartile range; 2. Enter in cell B18: =20+((25%-‐E6)/(E7-‐E6))*5; 3. Enter in cell B19: =30+((50%-‐E8)/(E9-‐E8))*5; 4. Enter in cell B20: =40+((75%-‐E10)/(E11-‐E10))*5; 5. Enter in cell B21: =B20-‐B18. See the lower part of the Excel worksheet.
Figure 4.21: Excel worksheet with tables and associated formulas for the quartiles The final answers are then:
Figure 4.22: Results of the calculation of the quartiles and interquartile range If you compare these estimates with the actual values, then you will see that they too are satisfactory (Q1 = 21.25, Q2 = 30 and Q3 = 42).
119
Statistics using Excel 2007
4.6
Exercises
Exercises 1 to 8 refer to the Fiction 2000 workbook. 1. Determine the measures of location (arithmetic mean and median), the measures of dispersion (standard deviation and interquartile range) and the measures of shape (skewness and kurtosis) of the data of the 300 respondents with respect to the Income variable. 2. Determine the mode and median of the data of the respondents who had ever read Spits, with respect to the ‘Opinion Spits’ variable. 3. Determine the measures of location (arithmetic mean, median and mode), the measures of dispersion (standard deviation, average absolute deviation and interquartile range) and the measures of shape (skewness and kurtosis) of the data of the 300 respondents with respect to the ‘Travelling time’ variable, broken down according to the Gender variable. 4. Determine the mode of the data of the 300 respondents with respect to the ‘Daily newspaper’ variable. 5. Determine the measures of location (arithmetic mean, median and mode), the measures of dispersion (standard deviation, variance and interquartile range) and the measures of shape (skewness and kurtosis) of the data of the 300 respondents with respect to the ‘Weekly travel days by train’ variable broken down according to the Gender variable. 6. The following table was created for the ‘Travelling time’ variable: Travelling time by train (in minutes) Number 0 -‐< 20 68 20 -‐< 30 56 30 -‐< 40 57 40 -‐< 50 25 50 -‐< 60 45 60 -‐< 100 38 100 -‐< 150 11 Grand total 300 Source: Fiction 2000
Determine the measures of location (arithmetic mean, median and mode), the measures of dispersion (standard deviation, variance and interquartile range) and the measures of shape (skewness and kurtosis) of the data of the 300 respondents with respect to the ‘Travelling time’ variable based on the data in the above table.
120
7.
4 Distribution characteristics
The following table was created for the Income variable: Income (* €100) Total 0 -‐< 5 51 5 -‐< 10 32 10 -‐< 15 69 15 -‐< 20 85 20 -‐< 25 42 25 -‐< 30 17 30 -‐< 40 4 Grand total 300 Source: Fiction 2000
Determine the measures of location (arithmetic mean, median and mode), the measures of dispersion (range, average absolute deviation, standard deviation and interquartile range) and the measures of shape (skewness and kurtosis) of the data of the 300 respondents with respect to the ‘Income’ variable based on the data in the above table. 8.
The following table was created for the breakdown of the Income variable by the Gender variable: Income (* €100) male Female Grand total 0 -‐< 5 38 13 51 5 -‐< 10 16 16 32 10 -‐< 15 45 24 69 15 -‐< 20 67 18 85 20 -‐< 25 36 6 42 25 -‐< 40 21 -‐ 21 Grand total 223 77 300
Source: Fiction 2000
Determine the measures of location (arithmetic mean, median and mode) and the measures of dispersion (range and standard deviation) of the data of the 300 respondents broken down according to Gender with respect to the ‘Income’ variable based on the data in the above table.
121
Statistics using Excel 2007
122
5 Probabilities
5
Probabilities
5.1
Introduction
The previous three chapters explained how to create tables and charts and calculate distribution characteristics that describe the group of 300 survey respondents (the random sample). Some of the conclusions drawn in Chapter 2 about the random sample are that: 77 of the 300 (25.7%) morning commuters were women; Metro tended to be read by more morning commuters (163) than Spits (150). Actually, researchers, and certainly clients, are far less interested in the results of a specific random sample than in what the results mean for the target group of all morning commuters (the population). For example, are you right to conclude from your random sample that Metro is read by more morning commuters than Spits, or might the group of Metro readers be overrepresented in your random sample, simply by chance? Unfortunately, the element of chance in a random sample is beyond dispute, but its role can be illuminated using statistics. It is therefore important to have some knowledge of probabilities. Probability is a measure of how likely it is that a given event will occur. The probability that you will manage to catch a given train is a subjective probability that different people will estimate very differently. For example, you might suspect that you will miss the train, while your friend remains convinced that you can still make it. Similarly, the designer of a new product may believe that it will be a great success, while the marketing director disagrees completely. The magnitude of a subjective probability therefore depends on who assesses it. Probabilities of this kind are not handled in this chapter. If you know that half the morning commuters read Spits, you can calculate in advance, or a priori, that the probability that a commuter selected arbitrarily will read Spits is 50%. The magnitude of an a priori probability is independent of who calculates it. Everyone would arrive at 50% in the above example. If you don’t know how many morning commuters read Spits, you can’t calculate in advance the probability that a commuter selected arbitrarily will read Spits. However, you can determine this kind of probability through investigation. For example, you could ask a group (a random sample) of 300 morning commuters whether they read Spits. If out of this group 150 read Spits, the empirical (i.e. experimentally determined) probability is 150/300 = 0.5, or 50%. In common parlance probability and chance are used interchangeably. For example, it may be said that the government is to give the railway operator one more chance to achieve the target of 80% of trains running on time within six months. This is a qualitative use of probability, in that it cannot be measured numerically. This chapter is about calculating quantitative probabilities. There are two ways of representing these probabilities numerically: 1. as a fraction, or a number between 0 and 1; 2. as a percentage, or a number between 0% and 100%. The extremes 0 (=0%) and 1 (=100%) are allowed when calculating probabilities. 0 means that it is impossible for the event to occur, and 1 means that it is certain that the event will occur.
123
Statistics using Excel 2007
5.2
Calculating with probabilities This chapter is about calculating quantitative probabilities. We use the Laplace rule: P=
X where: T
P: Probability of occurrence X: Number of ways in which the event occurs T: Total number of possible outcomes The Laplace rule is often misapplied. For instance, sometimes people will say that there is a 50% probability of an arbitrary train being delayed, reasoning that there are two options: the train is delayed, or the train is not delayed. The rule is then applied without satisfying the condition ‘if all outcomes are equally probably’. It is therefore important to know the number of favourable outcomes and the total number of outcomes, and whether all these outcomes are equally probable. This is not always a simple matter, but you can use various probability rules, which are explained with reference to the contingency table below. Tabel 5.1: Aantal “Metro” lezers per geslacht Gender Metro No Yes Grand Total Male 111 112 223 Female 26 51 77 Grand Total 137 163 300 Source: Fiction 2000
a.
Complement rule:
P( A' ) = 1 − P( A ) In words, the probability that an event will not occur (A’) is 1 -‐ the probability that the event will occur. The probability that a morning commuter reads Metro: P(Metro) = 163/300 = 54.3%. The probability that a morning commuter does not read Metro: P(No Metro) = 1 – 163/300 = 137/300 = 45.7%. The complement rule is based on the fact that the experiment has only two possibilities: G occurs or G does not occur, or P(A) + P(A’) = 1. You use the complement rule when P(A’) is easier to calculate than P(A).
124
5 Probabilities b.
Sum rule:
c.
P(A or B) = P(A) + P(B) - P(A and B) The probability of event A or event B occurring is equal to the probability of event A plus the probability of event B minus the probability of both events occurring. The probability that a morning commuter reads Metro or is female: P(Metro or Female)=P(Metro) + P(Female) – P(Metro and Female) = 163/300 + 77/300 – 51/300 = 189/300 = 63%.
Product rule (for statistically independent events): P(A and B) = P(A)*P(B) The probability of event A and event B occurring is equal to the probability of event A times the probability of event B, provided the two events are independent of each other. You can use this probability rule in two ways: if you know that two events are independent of each other (have no mutual influence), you can calculate the probability of two events by multiplying the two probabilities; if the probability of two events equals the product of the two separate probabilities, you can conclude that the two events are independent of each other. However, if they are unequal, you know that the two events are mutually dependent. The probability that a morning commuter reads Metro and is female: P(Metro and Female) = 51/300 = 17% and P(Metro) * P(Female) = 163/300 * 77/300 = 13.9% Conclusion: whether a person reads Metro depends on gender. In relative terms the percentage of women who read Metro (=66.2%) is different from the percentage of men who read Metro (=50.2%). The two events in the example below are independent. Of the men 120/200 = 60% read Metro, and for the women the figure is 60/100 = 60%.
Gender
d.
Metro No 80 40 120
Yes 120 60 180
Grand Total 200 100 300
Male Female Grand Total You can use the product rule with this table: P(Metro and Female) = 60/300 = 20%. and P(Metro) * P(Female) = 180/300 * 100/300 = 20%. General product rule: P(A and B) = P(A | B) * P(B) The probability of event A and event B occurring equals the probability that event A will occur while you know that event B has taken place, times the probability of event B. You can always apply this probability rule, and it does not matter whether the two events are mutually dependent or independent. The probability that a morning commuter reads Metro and is female: P(Metro and Female) = P(Metro | Female) * P(Female) = 51/77 * 77/300 = 17%. P(Metro | Female) means the probability that a morning commuter reads Metro where you already know that the morning commuter concerned is female. Of the 77 women, 51 read Metro, so this probability is 51/77.
125
Statistics using Excel 2007
5.3
Probabilities via pivot tables
Probabilities can be calculated conveniently using pivot tables, by expressing the numbers in percentages of columns, rows, or the total. Take the Fiction 2000 file of the 300 respondents. You want to know the probability that a person selected at random from the file travels by train 5 days a week. Method 1. Open the Fiction 2000 workbook and then open the Data worksheet. 2. Open the Insert tab, click PivotTable in the Tables group and then select PivotTable.
3.
Figure 5.1: Opening the create pivot table dialog All data of the Data worksheet are automatically selected for the pivot table range. It would suffice for creating the pivot table of the ‘Travel days’ variable to select the cell range E1:E301, but it is easier to select all the data and later specify which data you want to be shown in the pivot table. Specify that the pivot table be created on a New Worksheet.
Figure 5.2: Selecting the range for the pivot table and its location
126
5 Probabilities 4.
5.
An empty pivot table is created on a new worksheet and the PivotTable Field List (containing the variables in the first row of the Data worksheet, (A1:L1)) is opened. Drag the ‘Travel days’ variable to the Row Labels box and to the ∑ Values box. Now click ‘Sum of Travel days’ box in this box and select Value Field Settings in the drop-‐down list.
Figure 5.3: Opening the Value Field Settings dialog Change Sum to Count and then open the ‘Show values as’ tab.
Figure 5.4: Changing from Sum to Count and opening the ‘Show values as’ tab
127
Statistics using Excel 2007 6.
Click the arrow to the right of Normal and select % of total in the drop-‐down list. Then click OK.
Figure 5.5: Changing the values from Count to % of total The table should be as follows.
Figure 5.6: Pivot table of the ‘Travel days’ variable shown as a percentage of total There is therefore a 54% probability that a person selected at random from the group of 300 respondents will travel by train 5 days a week. You can create a pivot table similarly for the probability of randomly selecting a man or a woman from the group of 300 respondents.
Figure 5.7: Pivot table of the ‘Travel days’ variable and a pivot table of the Gender variable
128
5 Probabilities If you now wanted to determine the probability of selecting at random from the group of 300 respondents someone who travels 5 days a week and is female, then you could consider using the product rule: P(5 days and Female) = P(5 days) * P(Female) = 54% (=B8) * 25.67% (=E5) = 13.86% But this rule applies only if number of travel days and gender are independent events. This would mean that 54% of the men travel 5 days a week and 54% of the women 5 days a week. You can determine the mutual independence using a contingency table combining the ‘Travel days’ and gender variables. Method 1. Insert a new pivot table and place it in cell A12 of the same worksheet as the previous two tables. 2. Drag the Gender variable to the Column Labels box (and change the codes there from ‘1’ to ‘Male’ and from ‘2’ to ‘Female’). Drag the ‘Travel days’ variable to the Row Labels box, and to the ∑ Values box (the Gender variable may be dragged there instead). Change the presentation from Total to Count and change the presentation From Normal to % of total. The following table then appears.
Figure 5.8: Contingency table of the ‘Travel days’ variable and Gender variable as percentages of the total
The probability of selecting at random from the group of 300 respondents someone who travels by train 5 days a week and is female is therefore 9.67%, not 13.86%. You may therefore conclude that the ‘Travel days’ and ‘Gender’ variables are dependent. You could also have reached this conclusion with a similar contingency table, but showing the counts as % of column. See the table below.
Figure 5.9: Contingency table of the ‘Travel days’ and Gender variables as percentages of column This table shows that men and women do not (in relative terms) travel equally often by train each week. 59.64% of the men travel 5 days a week, compared with ‘only’ 37.66% of the women. The variables ‘Travel days’ and Gender are therefore dependent.
129
Statistics using Excel 2007
You could also have used the general product rule to calculate the probability of a person who travels by train 5 days a week and is female: P(Number of travel days = 5 and Gender = Female) = P(Number of travel days = 5 | Gender = Female) * P(Gender = Female) = 37.66% (cell C26) * 25.67% (cell E5) = 9.67% To summarize: 1. for calculating a probability of event 1 and event 2: use ‘% of total’ in the contingency table. 2. for calculating a probability of event 1 (shown by ROW) under the condition that event 2 (shown by COLUMN) has occurred: use ‘% of column’ in the contingency table . 3. for calculating a probability of event 2 (shown by ROW) under the condition that event 1 (shown by COLUMN) has occurred: use ‘% of row’ in the contingency table. In order to determine the probability that a person selected at random from the group of 300 respondents will travel by train five days a week or is female, you can use the sum rule and the data from the first three pivot tables. P(Number of travel days = 5 or Gender = Female) = P(Number of travel days = 5) + P(Gender = Female) -‐ P(Number of travel days = 5 and Gender = Female) = 54.00% (cell B9) + 25.67% (cell E6) – 9.67% (cell C20) = 70.00%. Finally, an example to illustrate some other uses of pivot tables. What is the probability that a person selected at random from the group of respondents will have ever read Metro, be female, and think Metro is good or very good? You could explore this using Report Filter. Method 1. Insert a new pivot table. 2. Drag the Gender variable to the Column Labels box (and change the codes there from ‘1’ to ‘Male’ and from ‘2’ to ‘Female’), the variable ‘Opinion Metro’ to the Row Labels box (change the codes there from ‘0’ to ‘No opinion’, ‘1’ to ‘Very bad’, ‘2’ to ‘Bad’, ‘3’ to “Reasonable”, ‘4’ to ‘Good’ and ‘5’ to ‘Very good’) and again the ‘Opinion Metro’ variable to the ∑ Values box (the Gender variable may be dragged there instead). Change the presentation from Total to Count and change the presentation from Normal to % of total. 3. Finally drag the Metro variable to the Report Filter box. The Excel worksheet will then be as follows.
130
5 Probabilities
4.
Figure 5.10: Contingency table of Opinion Metro by Gender, filtered on Metro The table currently shows the results of all respondents and not only those who have ever read Metro. Click the arrow by (All), which is above the table next to Metro, to select the values for the variable Metro to be shown. ‘1’ is the code for the group that have ever read Metro, so click ‘1’ and then OK.
Figure 5.11: Filtering the pivot table for the group that have ever read Metro The table is then as follows.
131
Statistics using Excel 2007
5.
Figure 5.12: Contingency table of Opinion Metro by Gender for the group of Metro readers (code 1) Finally you could merge the groups for Good and Very good. Select the cells A37 and A38 and click on Group Selection in the Group in the Options tab.
Figure 5.13: Grouping the values Good and Very good The Excel worksheet is now as follows.
132
5 Probabilities
6.
Figure 5.14: Table of the grouped Opinion Metro2 and ungrouped values of Opinion Metro This table is unclear. The row labels include both the grouped and ungrouped labels. Furthermore the values in the table for Group1 still have to be added together. Drag the Opinion Metro variable from the Row Labels box to the ‘Choose fields to add to report:’ box. The pivot table now shows only the labels of the group. Change the label from ‘Group1’ to ‘Good or very good’. The pivot table is then as follows. You can see that the Good and Very good percentages have been added together.
Figure 5.15: Contingency table of Opinion Metro by Gender for Metro readers It is now easy to read from the table that almost 16% (15.95%) of Metro the readers are female who think it is good or very good.
133
Statistics using Excel 2007
5.4
Exercises
The exercises below refer to data from the Fiction 2000 workbook. 1. Determine the probabilities of selecting at random a person from the group of 300 respondents who meets the following criteria: a. has a subscription to the Volkskrant; b. male and has a subscription to De Telegraaf; c. is 20 years of age or more; d. is male with an income less than 15 (* €100). 2. Determine the probabilities of selecting at random a person from the group of 300 respondents who has ever read Spits and meets the following criteria: a. has a subscription to the Algemeen Dagblad; b. has a subscription to the Algemeen Dagblad or De Telegraaf; c. has a very bad or bad opinion of Spits; d. is at least 30 years of age and thinks Spits is good. 3. The following contingency table was created using data from the Fiction 2000 workbook. Table 5.2: Numbers of Metro and Spits readers Metro Spits No Yes Grand Total No 60 77 137 Yes 90 73 163 Grand Total 150 150 300 Source: Fiction 2000
a. b. c. d. e. 4.
Assume that the respondent reads Spits. What is the probability that this Spits reader also reads Metro? Assume that the respondent reads Metro. What is the probability that this Metro reader also reads Spits? What is the probability that an arbitrary respondent reads both Metro and Spits? What is the probability that an arbitrary respondent reads Metro or Spits? Are the variables Metro (reads Metro) and Spits (reads Spits) statistically independent? Give reasons for your answer.
The following contingency table was created using data from the Fiction 2000 workbook. Table 5.3: Subscription to Volkskrant by gender Volkskrant
Gender Male No 183 Yes 40 Grand Total 223
Female Grand Total 57 240 20 60 77 300
Source: Fiction 2000
a. b. c. d.
134
Assume that the respondent is female. What is the probability that she has a subscription to the Volkskrant? Assume that the respondent is male. What is the probability that he has a subscription to the Volkskrant? What is the probability that an arbitrary respondent will have a subscription to the Volkskrant? Are the variables Volkskrant and Gender independent? Give reasons for your answer.
6 Discrete probability distributions
6
Discrete probability distributions
6.1
Introduction
A discrete variable is one with a limited number of possible values. A variable that is measured on a nominal or ordinal level is a discrete variable. E.g. gender has only two possible values: male or female. Variables that are measured on an interval or ratio level can also be discrete variables. E.g. the number of children in a family will be somewhere in the range 0 to 20. At any rate, the range is not unlimited, and only whole numbers are allowed as values. A discrete probability distribution involves a discrete variable with a probability associated with each possible value. The most important types of discrete probability distribution are discussed in this chapter. 6.2 Binomial probability distribution A binomial probability distribution has the following properties: 1. the possible values can be gathered using two random sampling methods: a. from an infinite population without replacement; b. from a finite population with replacement; 2. each outcome can be viewed as a success or a failure; 3. the probability of success, which is designated with the letter ‘p’, does not change, and neither does the probability of failure: q = 1 -‐ p; 4. each outcome is independent of previous outcomes. Examples 1. If you ask two morning commuters selected at random whether they had read Metro that morning, then the number of Metro readers (X) in the random sample is binomially distributed, because: • it is a random sample (n = 2) from an ‘infinite’ (the number of morning commuters is enormous) population without replacement; • each surveyed passenger is deemed a ‘success’ (read Metro) or a ‘failure’ (had not read Metro); • if 30% of the morning commuters had read Metro on that day, then the probability of success is p = 0.3 and the probability of failure is q = 0.7; • whether or not a surveyed morning commuter has read Metro does not depend on whether a previous morning commuter had read Metro. Table 6.1: Probability distribution for a binomial distribution with p = 0.3 and n = 2 Possible outcomes success, success success, failure failure, success failure, failure
P (Probability) 0,3 * 0,3 = 0,09 0,3 * 0,7 = 0,21 0,7 * 0,3 = 0,21 0,7 * 0,7 = 0,49
X (Number of successes) 2 1 1 0
The probability of the number of Metro readers being 2, P(X=2) is therefore: 2 2 probability of success * probability of success = probability of success = p . The probability of the number of Metro readers being 1, P(X=1) is therefore: probability of success * probability of failure + probability of failure * probability of success = p*q + q*p = 2*p*q. The probability of the number of Metro readers being 0, P(X=0) is therefore: 2 probability of failure * probability of failure = q*q = q .
135
Statistics using Excel 2007 2.
You blindly distribute three complementary tickets to an exhibition among a group of ten train passengers consisting of four women and six men (the blind distribution means that is it possible for a single train passenger to receive all three tickets). The number of complementary tickets that go to a female (X) is then binomially distributed, because: • three tickets are distributed, the random sample (n = 3) occurs with a finite number of people, the population (N = 10), with replacement (a person who receives the first ticket may also receive the second); • each ticket distributed is a ‘success’ if received by a woman, or a ‘failure’ if received by a man; • the probability of success p = 4/10 = 0.4 and the probability of failure q = 6/10 = 0.6; • whether a ticket goes to a woman or a man does not depend on whether the previous ticket went to a woman or a man. Table 6.2: Probability distribution for a binomial distribution with n = 3 and p = 0.4
Possible outcomes woman, woman, woman woman, woman, man woman, man, woman man, woman, woman woman, man, man man, woman, man man, man, woman man, man, man
P (Probability) 3 0 0,4*0,4*0,4 = 0,4 *0,6 = 0,064 2 1 0,4*0,4*0,6 = 0,4 *0,6 = 0,096 2 1 0,4*0,6*0,4 = 0,4 *0,6 = 0,096 2 1 0,6*0,4*0,4 = 0,4 *0,6 = 0,096 1 2 0,4*0,6*0,6 = 0,4 *0,6 = 0,144 1 2 0,6*0,4*0,6 = 0,4 *0,6 = 0,144 1 2 0,6*0,6*0,4 = 0,4 *0,6 = 0,144 0 3 0,6*0,6*0,6 = 0,4 *0,6 = 0,216
X (=Number of women) 3 2 2 2 1 1 1 0
There is 1 possible outcome where there are 3 women in the random sample, with a probability of P(X=3) = 1* 3 0 p *q . There are 3 possible outcomes where there are 2 women in the random sample, each with a probability of 2 1 p *q . 2 1 Therefore P(X=2) = 3 * p *q . There are 3 possible outcomes where there is 1 woman in the random sample, each with a probability of 1 2 p *q . 1 2 Therefore P(X=1) = 3 * p *q . There is 1 possible outcome where there are no women at all in the random sample, with a probability of 0 3 P(X=0) = 1* p *q . The following formula is used to calculate the probability of k successes for a sample size of n and a probability of success of p.
⎛ n ⎞ P(X = k | n ; p) = ⎜⎜ ⎟⎟ * pk * (1 − p)n −k ⎝ k ⎠ where:
⎛ n ⎞ n! 9 ⎜⎜ ⎟⎟ = the number of possibilities ; k k !*( n − k )! ⎝ ⎠ pk means the probability of success raised to the power of the number of successes; (1-‐p)n-‐k means the probability of failure raised to the power of the number of failures. The binomial probability distribution is therefore determined by only two data, the sample size n (or the number of trials n), and the probability of success p. These two data, n and p, are also referred to as the parameters of the binomial probability distribution. 9
n! = n * (n-‐1) * (n-‐2) * (n-‐3) * ……..* 1. Voorbeeld: 5! = 5 * 4 * 3 * 2 * 1 = 120 136
6 Discrete probability distributions
6.2.1
Probabilities of binomial distributions with Excel
Excel uses the statistical function BINOMDIST for calculating binomial probabilities. Method 1. Select the cell where you want to calculate the binomial probability. 2. Click Insert Function in the Function Library group on the Formulas tab, or simply click fx in the Formula Bar.
3.
Figure 6.1: Opening the Insert Function dialog Select Statistical from the list adjacent to ‘Or select a category:’, select BINOMDIST from the list adjacent to ‘Select a function:’ and finally click OK.
Figure 6.2: Opening the BINOMDIST function arguments dialog The following dialog appears.
137
Statistics using Excel 2007
4.
Figure 6.3: BINOMDIST function arguments dialog By Number_s enter the number of successes (k). By Trials enter the number of trials or the size of the random sample (n). By Probability_s enter the probability of success (p). By Cumulative enter either FALSE or 0 (alternatively, either TRUE or 1 if you want to calculate P(X < k)). Finally click OK.
Example 10 What is the probability that in a random sample of 300 passengers 77 are female, given that 25% of all morning commuters are female? P(X = 77 | n = 300 ; p = 0.25) = ? where X is the number of women in the random sample. Number_s = 77 ; Trials = 300 ; Probability_s = 0.25 ; Cumulative = FALSE (you want to calculate the probability of exactly 77 women) P(X = 77 | n = 300 ; p = 0.25) =BINOMDIST(77;300;0.25;FALSE) = 0.0508
Figure 6.4: Completed dialog for calculating a binomial probability 10
You should actually take a random sample with replacement here, because the population of morning commuters is a finite group. However, since the group is large (there are certainly more than 100,000 morning commuters) you may assume that drawing a relatively small group (300) from this large group will hardly alter the ratio of men to women. Consider the following extreme example. Suppose that 25,000 of the 100,000 morning commuters are female. Then the probability of the first to be drawn being a woman is 25,000/100,000 = 0,250. Suppose that 299 people have already been drawn, and by coincidence they are all female (which is very unlikely), th then the probability that the 300 person is also female is: 24,701 / 99,701 = 0,248 In other words, even in the most extreme case, the probability of success remains almost constant.
138
6 Discrete probability distributions If you wanted to calculate the probability of 77 or fewer females: P(X < 77 | n = 300 ; p = 0.25) then enter by Cumulative TRUE (or 1): =BINOMDIST(77;300;0.25;TRUE) = 0.6342
Figure 6.5: Completed dialog for calculating a cumulative binomial probability Finally, if you wanted to calculate the probability of 77 or more females, use the complement rule: P(X > 77 | n = 300 ; p = 0.25) = 1 – P(X < 77) = 1 – P(X <76) = =1-‐BINOMDIST(76;300;0.25;TRUE) = 0.4166 Now you know how to use the BINOMDIST function in Excel, you also have a simple and rapid way to create a complete binomial probability distribution table for all X values (number of successes) for an arbitrary number of trials (n) and a random probability of success (p). Method 1. Enter in cell A1: Computing Binomial Probabilities 2. Enter in cell A3: Sample size. In the adjacent cell B3 you will later enter the sample size n. 3. Enter in cell A4: Probability of success. In the adjacent cell B4 you will later enter the probability of success p. 4. Enter in cell A5: Expected value 5. Enter in cell B5: =B3*B4. 6. Enter in cell A6: Variance 7. Enter in cell B6: =B3*B4*(1-‐B4). 8. Enter in cell A7: Standard deviation. 9. Enter in cell B7: =SQRT(B6). 10. Enter in cell A9: X 11. Enter in cell B9: P(X) 12. Enter in cell C9: P(<=X) 13. Enter in cell A10: 0 14. Enter in cell B10: =BINOMDIST(A10;$B$3;$B$4;0)
139
Statistics using Excel 2007
Figure 6.6: Completed dialog for cell B10 in which the binomial probability will be calculated. 15. Enter in cell C10: =BINOMDIST(A10;$B$3;$B$4;1). 16. Enter in cell A11: 1. 17. Select cells A10 and A11 and drag the selection (move the pointer over the lower right-‐hand corner of cell A11 until it turns into a ‘+’ symbol, and then press and hold down the left mouse button until you reach the cell in column A that you want to calculate the binomial probability for (e.g. to A16 if you want to calculate the binomial probability for a maximum of 6 successes);
Figure 6.7: Copying a series by dragging 18. Select cell B10 and drag it to the cell in column B that is in the same row as where you stopped in column A (e.g. to cell B16). 19. Select cell C10 and drag it to the cell in column C, that is in the same row as where you stopped in column A (e.g. to cell C16). See the following Excel worksheet.
140
6 Discrete probability distributions
Figure 6.8: Excel worksheet with formulas for calculating binomial and cumulative binomial probabilities Remarks. 1. Cell B5 holds the formula for calculating the expected value (the mean). The expected value for a binomial probability distribution is n * p; 2. Cell B6 holds the formula for calculating the variance. The variance for a binomial probability distribution is n * p * (1-‐p); 3. Cell B7 holds the formula for calculating the standard deviation. The standard deviation is the square root of the variance. 4. In cell B10, the address of cell B3 is shown with $ characters, which will prevent the address changing when it is copied (by dragging). The same is true for cell B4. If a value for n is now entered in cell B3 and a value for p in cell B4, the binomial probabilities for 0 to six 6 successes are calculated in cells B10:B16, and the cumulative binomial probabilities for 0 to 6 successes in cells C10:C16. Take as an example drawing a random sample of 6 people from an infinite population of people of equal numbers of women and men, P(female) = 0.5. The number of women in the random sample may then vary from 0 up to and including 6. If you then enter 6 in cell B3 and 0.5 in cell B4 you will obtain the following result.
Figure 6.9: Completed Excel worksheet with calculated binomial probabilities
141
Statistics using Excel 2007
The table will change if you enter a different sample size (n), or a different probability of success (p). The table can easily be extended downwards by copying (for a larger random sample), enabling the probability to be calculated for a larger number of successes. You can create a histogram of the binomial probability table that has just been produced. Method 1. Open the Insert tab in the Ribbon and click Column in the Charts group. 2. Select the first subtype in the 2-‐D Column group. 3. Open the Design tab in the ribbon and then click Data in the Data group. 4. Select for ‘Chart data rage:’ cells B10:B16. 5. Click Edit in the Horizontal (Category) Axis Labels area and select cells A10:A16.
Figure 6.10: Completed Select Data Source dialog 6. 7.
Enter the chart title, the title for the Primary Horizontal Axis (X) and the title for the Primary Vertical Axis (Y), and unselect Legend (available only in the Labels group on the Layout tab). Double-‐click any column. Click Format Selection in the Current Selection group on the Layout tab. Set the Gap Width to 0%. The chart will be as follows.
Figure 6.11: Chart of the binomial probability distribution for a 50% probability of success and a random sample of 6 142
6 Discrete probability distributions
6.3
Hypergeometric probability distribution
The hypergeometric probability distribution closely resembles the binomial probability distribution. The difference is in the sampling method: A hypergeometric probability distribution involves a random sample from a finite population without replacement. The probability of success changes, as does the probability of failure. An outcome is dependent on previous outcomes. Example You distribute three complementary tickets to an exhibition among a group of ten train passengers consisting of four women and six men. A passenger that receives a ticket is no longer eligible to receive any more. The number of complementary tickets that go to a female (X) is then hypergeometrically distributed, because: • three tickets are distributed with random sampling (n = 3), from a finite number of people, the population (N = 10), without replacement (once anyone receives a ticket they are removed); • each ticket distributed is a ‘success’ if received by a woman, or a ‘failure’ if received by a man; • whether a ticket goes to a woman or a man depends on whether the previously distributed ticket went to a woman or a man. Table 6.3: Probability distribution for a hypergeometric distribution with n = 3, N = 10 and number of successes in the population A = 4 Possible outcomes P (Probability) X (=Number of women) 4 3 2 woman, woman, woman 3 * * = 0,0333 10 9 8 4 3 6 woman, woman, man 2 * * = 0,1000 10 9 8 4 6 3 woman, man, woman 2 * * = 0,1000 10 9 8 6 4 3 * * = 0,1000 man, woman, woman 2 10 9 8 4 6 5 woman, man, man 1 * * = 0,1667 10 9 8 6 4 5 man, woman, man 1 * * = 0,1667 10 9 8 6 5 4 man, man, woman 1 * * = 0,1667 10 9 8 6 5 4 man, man, man 0 * * = 0,1667 10 9 8 Every possibility with the same number of successes in the random sample produces the same probability. For example, therefore, if the random sample includes one man, it is irrelevant to the probability whether the man was drawn first, second or last. You can use the following formula to calculate the probability of k successes with a sample size of n from a population size of N in which there are A successes (and N -‐ A failures).
⎛ A ⎞⎛ N − A ⎞ ⎜⎜ ⎟⎟⎜⎜ ⎟ k ⎠⎝ n − k ⎟⎠ ⎝ P( X = k | n;N; A ) = ⎛ N⎞ ⎜⎜ ⎟⎟ ⎝ n ⎠ The hypergeometric probability distribution is therefore determined by three data, the sample size n (or the number of trials n), the population size N, and the number of successes in the population, A. These three data n, N and A, are also referred to as the parameters of the hypergeometric probability distribution. 143
Statistics using Excel 2007
6.3.1
Probabilities of hypergeomtric distributions with Excel
Excel uses the statistical function HYPERGEOMDIST for calculating hypergeometric probabilities . Method 1. Select the cell where you want to calculate the hypergeometric probability. 2. Click Insert Function in the Function Library group on the Formulas tab, or simply click fx in the Formula Bar. 3. Select Statistical from the list adjacent to ‘Or select a category:’, select HYPERGEOMDIST from the list adjacent to ‘Select a function:’ and finally click OK. The following dialog appears.
4.
Figure 6.12: Hypergeometric distribution dialog By Sample_s enter the number of successes in the sample (k). By Number_sample enter the number in the random sample (n). By Population_s enter the number of successes in the population (A). By Number_pop enter the number in the population (N).
Example What is the probability that when selecting ten people at random (without replacement) from the group of 300 respondents, half of them (5) will be female? P(X = 5 | n = 10 ; N = 300 ; A = 77) = ? where X is the number of women in the random sample of 10. Sample_s: 5 ; Number_sample: 10 ; Population_s: 77 ; Number_pop: 300 P(X = 5 | n = 10 ; N = 300 ; A = 77) =HYPERGEOMDIST(5;10;77;300) = 0.0621
144
6 Discrete probability distributions
Figure 6.13: Completed dialog for calculating a hypergeometric probability You can now create a complete hypergeometric probability distribution table in the same way as for a binomial probability distribution.
Figure 6.14: Excel worksheet with formulas for calculating hypergeometric and cumulative hypergeometric probabilities Remarks. 1. Cell B6 holds the formula for calculating the expected value (the mean). The expected value is n * A / N. 2. Cell B7 holds the formula for calculating the variance. The variance is (n * A (N – A) / N2) * ((N -‐ n) / (N – 1)). 3. Cell B8 holds the formula for calculating the standard deviation. The standard deviation is the square root of the variance. 4. In cell B11, the address of cell B3 is shown with $ characters, which will prevent the address changing when it is copied (by dragging). The same is true for cells B4 and B5. 5. There is no cumulative version of HYPERGEOMDIST. You can calculate the cumulative probabilities using: P(X) + P(<X). Example P(X <1) = P(X = 1) + P(X < 1), or cell B12+C11.
145
Statistics using Excel 2007
Entering the data of the example (a group of ten people drawn from the 300 respondents, 77 of which are female) gives the following result.
Figure 6.15: Completed Excel worksheet with calculated hypergeometric probabilities The associated chart is as follows.
Figure 6.16: Chart of the hypergeometric probability distribution for a random sample of 10 (N=300; A=77)
146
6 Discrete probability distributions
6.4
Negative binomial probability distribution The negative binomial probability distribution can be used to calculate the probability of a number of failures before a given number of successes occurs. This probability distribution (like the binomial probability distribution) has a constant probability of success. Examples 1. If you know that 30% of people approached are willing to take part in a telephone survey, what is the probability of finding someone willing only on the fifth call? In other words, the first four people called say no, and the fifth agrees to take part. The fifth person being the first to be willing to take part is negatively binomially distributed, because: • the sample is random (n=5 ) from an ‘infinite’ (the number of people with a telephone is enormous) population without replacement (you don’t call the same person twice); • each person approached is a ‘success’ if they are willing to take part in the survey, or a ‘failure’ otherwise; • the probability of ‘success’ p = 0.3 and the probability of ‘failure’ q = 0.7; • whether someone who is approached will be willing to take part does not depend on whether the people approached previously were willing. 4 Probability = 0.7 * 0.7 * 0.7 * 0.7 * 0.3 = 0.3 * 0.7 = 0.0720 2. See Example 1. What is the probability of finding the second person to be willing to take part on the fifth call? In other words, one of the first four people was willing to take part, and finally the fifth person was also willing. Table 6.4: Probability distribution for a negative binomial probability distribution with n=5 = 5, p = 0.3 and X = 2 Possible outcomes P (Probability) X (=Number of successes) 2 3 yes,no,no,no,yes 0,3*0,7*0,7*0,7*0,3 =0,3 *0,7 =0,0309 2 2 3 no,yes,no,no,yes 0,7*0,3*0,7*0,7*0,3 =0,3 *0,7 =0,0309 2 2 3 no,no,yes,no,yes 0,7*0,7*0,3*0,7*0,3 =0,3 *0,7 =0,0309 2 2 3 no,no,no,yes,yes 0,7*0,7*0,7*0,3*0,3 =0,3 *0,7 =0,0309 2 There are therefore four possible outcomes in total, each with the same probability, which is 0.32 * 0.73. The probability is therefore 4 * 0.32 * 0.73 = 0.1235. You can use the following formula to calculate the negative binomial probability that the Xth success will occur on the nth attempt. ⎛ n − 1 ⎞ X ⎟⎟ ∗ p ∗ (1− p)n− X P(n) = ⎜⎜ X − 1 ⎝ ⎠ where: p: probability of success th n: number of attempts (sample size) where the last attempt delivers the X success The negative binomial probability distribution is determined by three data, the sample size n (or the number of 11 trials n), the probability of success p and the number of successes X . These three data n, p and X, are also referred to as the parameters of the negative binomial probability distribution.
11
You could also say that the negative binomial probability distribution is determined by the probability of success p, the number of successes X and the number of failures n-‐X. The sample size happens to be equal to the number of successes and the number of failures.
147
Statistics using Excel 2007
6.4.1
Probabilities of negativ binomial distributions with Excel
Excel uses the statistical function NEGBINOMDIST for calculating negative binomial probabilities. Method 1. Select the cell where you want to calculate the negative binomial probability. 2. Click Insert Function in the Function Library group on the Formulas tab, or simply click fx in the Formula Bar. 3. Select Statistical from the list adjacent to ‘Or select a category:’, select NEGBINOMDIST from the list adjacent to ‘Select a function:’ and finally click OK. The following dialog appears.
Figure 6.17: Negative binomial distribution function dialog 4.
By Number_f enter the number of failures; by Number_s enter the minimum number of successes; by Probability_s enter the probability of success (p).
Example As part of your investigation you want to conduct a special interview of morning commuters who have an annual season ticket. It is known that 35% of morning commuters have an annual season ticket. You need five people with th an annual season ticket for your investigation. What is the probability that when you come to the 15 morning th th commuter you find the 5 person for your investigation (in other words, that the 15 person has an annual season ticket, and before this 10 passengers had no annual season ticket)? P(n = 15 | X = 5 ; p = 0.35) = ? Number_f: 10 ; Number_s: 5 ; Probability_s: 0.35 P(n = 15 | X = 5 ; p = 0.35) =NEGBINOMDIST(10;5;0.35) = 0.0708
148
6 Discrete probability distributions
Figure 6.18: Completed dialog for calculating a negative binomial probability You can now create a complete negative binomial probability distribution table in the same way as for the binomial and hypergeometric probability distributions.
Figure 6.19: Excel worksheet with formulas for calculating negative binomial probabilities Remarks: 1. Number_s starts at 1 and cannot exceed the sample size (n); 2. Number_f is the sample size – Number_s; 3. columns B and C have references to cells $B$3 and $B$4. These cell addresses must not change when copying. The following appears after entering the data of the example (a random sample of 15 morning commuters with a 35% probability of success, defined as a person being in possession of an annual season ticket).
149
Statistics using Excel 2007
Figure 6.20: Completed Excel worksheet with calculated negative binomial probabilities There is another negative binomial probability table you could create, which focuses on the Xth success, and calculates the probability of the Xth success for various sample sizes. See the following Excel worksheet.
Figure 6.21: Excel worksheet with formulas for calculating the sample size for negative binomial probabilities Remarks: 1. the minimum sample size is the number of successes (B3); 2. cumulative negative binomial probabilities are calculated in column C. Entering the data of the example (5 successes with a 35% probability of success) gives the following result.
150
6 Discrete probability distributions
Figure 6.22: Completed Excel worksheet with calculated negative binomial probabilities for various sample sizes Remarks: th 1. the probability of the 5 success occurring on the final element of the random sample is greatest for a sample size of 12; th 2. if you were to take a random sample of 20, there would be an 88% probability of having a 5 success. You could also have calculated this probability with a binomial probability distribution: P(X > 5 | n = 20 ; p = 0.35) = 1 – P(X < 4 | n = 20 ; p = 0.35).
151
Statistics using Excel 2007
6.5
Poisson probability distribution
You can use the Poisson distribution for calculating the probability of a number of successes in a continuous unit (of time, length, etc.). A Poisson distribution applies if the interval of the unit concerned (e.g. time, length, or area) can be reduced to a point where the following conditions are satisfied: 1. there is a constant probability of exactly one success in the interval; 2. the probability of more than one success in the interval is 0; 3. the probability of a success in an interval is independent of other success in other intervals. Example If an average of 180 passengers arrive at Amsterdam Central Station ticket counters in the morning peak period between 7am and 8am, the number of passengers will have a Poisson distribution if you divide the hour into 3600 one-‐second intervals, because: 1. the probability that a passenger will arrive at a counter in a given second is 180 / 3600 = 5%; 2. the probability of another passenger arriving at a counter in the same second is: 5% * 5% = 0.25% ≈ 0%; 3. the probability of a passenger arriving at a counter in a given second has no effect on a passenger who arrives at the counter in a different second. You can use the following formula for calculating the probability of X successes for a Poisson distribution with a given average of λ.
P(X) =
e−λ * λX X!
where: λ : the expected number of successes e: the base of the natural logarithm (mathematical constant) ≈ 2.718 X: number of successes per unit
6.5.1
Probabilities of Poisson distributions with Excel
Instead of using the formula, Excel also allows use of the statistical function POISSON. Method 1. Select the cell where you want to calculate the Poisson probability. 2. Click Insert Function in the Function Library group on the Formulas tab, or simply click fx in the Formula Bar. 3. Select Statistical from the list adjacent to ‘Or select a category:’, select POISSON from the list adjacent to ‘Select a function:’ and finally click OK. The following dialog appears.
152
6 Discrete probability distributions
4.
Figure 6.23: Poisson distribution function dialog By X enter the number of ‘successes’; by Mean enter the average number of successes per unit (λ); by Cumulative enter FALSE to calculate the probability of the number of successes, or TRUE to calculate the probability of the number of successes or fewer.
If you examine the example of the number of passenger arrivals per minute at the Amsterdam Central Station ticket counters during the morning peak period between 7am and 8am, you will find an average of 3 per minute. The probability or 4 passengers arriving at a counter in an arbitrary minute can be calculated in Excel as follows: x : 4 λ : 3 cumulative: FALSE P(X = 4 | λ = 3) =POISSON(4;3;FALSE) = 0.1680
Figure 6.24: Completed dialog for calculating a Poisson probability You can now create a complete Poisson probability distribution table in the same way as for the binomial, hypergeometric and negative binomial probability distributions.
153
Statistics using Excel 2007
Figure 6.25: Excel worksheet with formulas for calculating Poisson probabilities An average number of successes of 3, as in the example of the number of passenger arrivals per minute at an Amsterdam Central Station ticket counter, gives the following result.
Figure 6.26: Completed Excel worksheet with calculated Poisson probabilities
154
6 Discrete probability distributions
6.6 1.
Exercises
87% of trains run on time. What is the probability that someone who travels 10 times on an arbitrary train: a. always arrives on time? b. is delayed no more than once? c. is delayed on at least the half the number of journeys?
2.
If you take a random sample of 25 people (without replacement), from the group of 300 respondents (from the Fiction 2000 data file), what is the probability that in this group: a. 15 people have ever read Metro? b. at least 15 people have ever read Metro? c. the number of people who have ever read Metro is between 10 and 20 (including 10 and 20)?
3.
In a survey of train passengers, it is the intention to interview four train passengers who have made an international train journey at some time. 18% of train passengers have done so. th a. What is the probability that the 10 passenger approached is the second passenger to have made an international train journey at some time? th b. What is the probability that the 40 passenger approached is the fourth passenger to have made an international train journey at some time? c. What is the minimum number of train passengers to approach to have at least a 75% probability of finding 4 train passengers who have made an international train journey at some time?
4.
Every day an average of 9 passengers call at the lost property desk at a given station. What is the probability on an arbitrary day that: a. 7 passengers call at the lost property desk? b. fewer than 5 passengers call at the lost property desk? c. 10 or more passengers call at the lost property desk ?
5.
72% of Spits readers do not leave the newspaper behind in the train. What is the probability that 12 Spits readers selected at random will: a. all have taken Spits away with them (not left it behind in the train)? b. include only 2 people who left Spits behind in the train? c. include at least 6 people who took Spits away with them? d. include a minimum of 6 and a maximum of 8 people who took Spits away with them?
6.
In a group of 40 train passengers, 25 have at some time filled in a form to claim a refund because of a delay. What is the probability if you select at random a group of 10 passengers from the group of 40 train passengers that: a. all 10 passengers have ever completed a refund form? b. none has ever completed a refund form? c. 5 passengers have ever completed a refund form? d. a minimum of 4 and a maximum of 6 passengers have ever completed a refund form?
7.
A telephone interviewer has to survey 5 women 25 years of age or older who have travelled by train at some time in the past year (i.e. a quota survey). If 15% of people with a telephone meet these requirements (female, 25 years of age or older and have travelled by train in the past year), what is the probability that the interviewer will: a. find the first person who meets these requirements on the fourth telephone call? b. find the second person who meets these requirements on the tenth telephone call? c. find the fifth person who meets these requirements on the thirtieth telephone call? d. have to call 50 people or fewer to find the five people who meet these requirements?
155
Statistics using Excel 2007
156
7
7 Continuous probability distriburions
Continuous probability distributions
7.1
Introduction
A continuous variable is one that is able to assume any value within a given range. In other words, it is always possible to find a third value that lies between any two given values of a continuous variable. A continuous value is always measured on a quantitative level (interval or ratio). Some examples of continuous variables are time, length, and weight. A continuous probability distribution has no regard to the probability of a specific value (which will always be 0), but the probability of a range of values (less than some value or greater than some value). This chapter discusses the most important continuous probability distribution, the normal distribution. The exponential probability distribution is also briefly addressed. 7.2 Normal probability distribution A normal distribution is characterized by the mean µ, which is also referred to as the expected value E[X], and the standard deviation σ.
Probability density
Chart 7.1: Normal distributions 0.45
A
0.4 0.35 0.3 0.25
B
0.2 0.15 0.1
C
0.05 0 -4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
X Distributions A and B have a mean (µ) of 4 and standard deviations (σ) of 1 and 2, whereas Distribution C has a mean (µ) of 6 and a standard deviation (σ) of 3. Distributions A, B and C differ, but have a number of properties in common, because they are all normal distributions.
157
Statistics using Excel 2007
The properties of normal distributions are: 1. a bell-‐shaped and symmetrical probability density. Values that are more than σ above µ occur as frequently as values that are more than σ below µ; 2. an arithmetic mean, median and mode that are all equal; 3. a strong central tendency. Values close to the mean occur most frequently; 4. infrequent occurrences of relatively large and relatively small values. For instance, 95.4% of all observations fall within two standard deviations of µ, and as many as 99.7% fall within three standard deviations of µ. The probability density function f(X) of a normal distribution can be calculated using the following formula.
f(X) =
1 X −µ 2 1 e− 2 ( σ ) σ 2π
where: σ: standard deviation of the population µ: arithmetic mean of the population π: mathematical constant, approximately 3.14159 e: mathematical constant, approximately 2.71828 If you want to calculate the probability of the value X of a normal distribution lying between the values a and b, P(a < X < b), you have determine the area below the probability density function between a and b.
You can calculate this area by integrating the probability density function over the interval [a,b] or by standardizing and using the standard normal probability table. A normal distribution can be converted into standard normal distribution with an average µ = 0 and a standard deviation σ = 1. This yields a z value, the probability of which is given in a standard normal table. Converting a normal distribution into a standard normal distribution, which is referred to as standardization or normalization, is performed in two steps, as explained graphically with reference to the normal distribution of Distribution B, with µ = 4 and σ = 2.
158
1.
2.
7 Continuous probability distriburions
The mean is subtracted from all X values (X -‐ µ), which shifts the chart so that the mean is at 0.
The new X values are now divided by the standard deviation σ. I.e.: (X -‐ µ) / σ This produces the standard normal distribution, for which the mean is 0 and the standard deviation is 1. The values of a standard normal distribution are referred to as z values rather than X values.
The value of z is how many times the standard deviation equals the distance of the value of X from the mean, µ. If the mean (µ) of a normal distribution is 4 and the standard deviation (σ) is 2, then: for X = 6 the z value is 1 (6 is 1 times the standard deviation above the mean of 4); for X = 0 the z value is -‐2 (2 is 2 two times the standard deviation below the average of 4); for X = 3 the z value is – ½ (3 is ½ the standard deviation below the mean of 4); 159
Statistics using Excel 2007
for X = 8 the z value is 2 (8 is 2 times the standard deviation above the mean of 4). Therefore the probability that X < 6, for a normal distribution with a mean of 4 and a standard deviation of 2 is: 1 P(X < 6 | µ = 4 ; σ = 2) = P(z < 1) = 0.8413 7.2.1 Probabilities of normal distributions with Excel Excel has five functions that are concerned with normal and standard normal probability distributions. I. NORMDIST(X;µ;σ;TRUE) This function calculates the probability of a value smaller than X for a normal distribution with parameters µ and σ. If you enter FALSE as the final parameter you will obtain the probability density. Method 1. Select the cell where you want to calculate the normal probability. 2. Click Insert Function in the Function Library group on the Formulas tab, or simply click fx in the Formula Bar. 3. Select Statistical from the list adjacent to ‘Or select a category:’, select NORMDIST from the list adjacent to ‘Select a function:’ and finally click OK.
Figure 7.1: Opening the Normal Distribution dialog
1
The value can be found in a cumulative probability or standard normal distribution table.
160
7 Continuous probability distriburions
The following dialog appears.
4.
Figure 7.2: Normal probability distribution dialog By X enter the value for which you wish to calculate the cumulative probability or probability density; by Mean enter the population mean; by Standard_dev enter the standard deviation of the population; by Cumulative enter TRUE (for the probability as opposed to the probability density).
Example
What is the probability for a normal distribution with a mean of 8 and a standard deviation of 3 that X is less than 6?
P(X < 6 | µ = 8 ; σ = 3) NORMDIST(6;8;3;TRUE) = 0.2525 You can also calculate the above probability in Excel in two steps, by first standardizing the probability distribution (i.e. determining the z value, see the function mentioned under II) and then determining the cumulative probability of this z value (see the function mentioned under III).
161
Statistics using Excel 2007 II.
STANDARDIZE(X;µ;σ) This function calculates the standardized z value for a normal distribution with parameters µ and σ. Method 1. Select the cell where you want to calculate the z value. 2. Click Insert Function in the Function Library group on the Formulas tab, or simply click fx in the Formula Bar. 3. Select Statistical from the list adjacent to ‘Or select a category:’, select STANDARDIZE from the list adjacent to ‘Select a function:’ and finally click OK.
Figure 7.3: Opening the STANDARDIZE function dialog The following dialog appears.
4.
Figure 7.4: STANDARDIZE function dialog By X enter the value for which you want to calculate the associated z value; by Mean enter the population mean; by Standard_dev enter the standard deviation of the population. Example
What is the z value associated with an X value of 6 for a normal distribution with an mean of 8 and a standard deviation of 3? STANDARDIZE(6;8;3) = -‐0.6667 162
7 Continuous probability distriburions
III. NORMSDIST (z) This function calculates the probability of a value less than z (the (left-‐tailed) cumulative probability). Method 1. Select the cell where you want to calculate the cumulative probability of the z value. 2. Click Insert Function in the Function Library group on the Formulas tab, or simply click fx in the Formula Bar. 3. Select Statistical from the list adjacent to ‘Or select a category:’, select NORMSDIST from the list adjacent to ‘Select a function:’ and finally click OK.
Figure 7.5: Opening the standard normal distribution dialog The following dialog appears.
4.
Figure 7.6: Standard normal distribution dialog By Z enter the value of z for which you wish to calculate the cumulative probability. Example
What is the probability that z is less than –0.6667? NORMSDIST(-‐0.6667) = 0.2525
163
Statistics using Excel 2007
IV. NORMINV(probability;µ;σ) This function calculates the X value that corresponds with the given probability for a normal distribution with parameters µ and σ. This function is therefore the opposite (inverse) of the NORMDIST(X;µ;σ) function, which calculates the probability that corresponds with an X value. Method 1. Select the cell where you want to calculate the X value. 2. Click Insert Function in the Function Library group on the Formulas tab, or simply click fx in the Formula Bar. 3. Select Statistical from the list adjacent to ‘Or select a category:’, select NORMINV from the list adjacent to ‘Select a function:’ and finally click OK.
Figure 7.7: Opening the inverse normal distribution dialog The following dialog appears.
Figure 7.8: Inverse normal distribution dialog 4. By Probability enter a number between 0 and 1 representing the cumulative probability for which you want to calculate the corresponding X value; by Mean enter the population mean; by Standard_dev enter the standard deviation of the population.
164
7 Continuous probability distriburions
Example Which X value for a normal distribution corresponds with a mean of 25, a standard deviation of 4 and a probability of 0.75?
P(X < ? | µ = 25 ; σ = 4) = 0.75 NORMINV(0.75;25;4) = 27.70 V.
NORMSINV(probability) This function calculates the z value that corresponds with the given probability. The function is therefore the opposite (inverse) of the NORMSDIST(Z) function, which calculates the probability that corresponds with a Z value. Method 1. Select the cell where you want to calculate the Z value. 2. Click Insert Function in the Function Library group on the Formulas tab, or simply click fx in the Formula Bar. 3. Select Statistical from the list adjacent to ‘Or select a category:’, select NORMSINV from the list adjacent to ‘Select a function:’ and finally click OK.
Figure 7.9: Opening the inverse standard normal distribution dialog The following dialog appears.
165
Statistics using Excel 2007
Figure 7.10: Inverse standard normal distribution dialog Example What z value corresponds with a probability of 0.75?
NORMSINV(0.75) = 0.6745 Remember to use the following probability rules with normal probability distributions. 1. P(X < a) = P(X < a) P(X = a) = 0 2. P(X > a) = 1 – P(X < a) complement rule 3. P(a < X < b) = P(X < b) – P(X < a)
166
7 Continuous probability distriburions
An Excel worksheet is given below that you can use for calculating left-‐tailed and right-‐tailed cumulative probabilities and interval probabilities and determining the corresponding X value for a given probability, for normal distributions. The ‘&’ symbol appears in A7 and other cells, and it ensures that any change in the X value (in cell B5) is reflected immediately in the cell.
Figure 7.11: Excel worksheet for calculating probabilities of normal distributions ABS is a mathematical function that gives the absolute value of a number. Entering an arithmetic mean of 75, a standard deviation of 6, a first X value of 69, a second X value of 81 and a cumulative percentage of 10% gives the following output in Excel.
Figure 7.12: Completed Excel worksheet for calculating probabilities of normal distributions If you change the value of cell B5 to 72, then cell A7 becomes: P(X<=72)
167
Statistics using Excel 2007
7.3
Exponential probability distribution
The exponential probability distribution is used in waiting time theories to model the time between two arrivals. You can use the following formula to calculate the probability for an exponential distribution that a subsequent arrival will occur within a given period X: P(Arrival time < X) = 1 – e-λx where: e: the mathematical constant approximated by 2,71828 λ: the mean number of arrivals per unit X: any value of the continuous variable where 0 < X < + ∞ The exponential probability distribution has only one parameter, the mean, λ, which equals the average number of arrivals per unit of time. The mean time between two arrivals is then 1/λ. For example, if the average rate of arrivals is 5 per hour, then the time between two arrivals is 1/5 hour, or 12 minutes.
Example If on average 30 customers an hour arrive at a railway station ticket office, what is the probability that the next customer will arrive within 3 minutes of the predecessor? Be careful to use the same units of time throughout: λ (lambda) = 30 per hour = 0.5 per minute X = 3 minutes = 0.05 hours -‐3*0.5 -‐1.5 Unit minutes: P(X < 3) = 1 – e = 1 – e = 1 – 0.2213 = 0.7769 -‐0.05*30 -‐1.5 Unit hours: P(X < 0.05) = 1 – e = 1 – e = 1 – 0.2213 = 0.7769
168
7.3.1
7 Continuous probability distriburions
Probabilities of exponential distributions with Excel
Excel uses the following function for calculating exponential probabilities. EXPONDIST(x;lambda;cumulative) Method 1. Select the cell where you want to calculate the exponential probability. 2. Click Insert Function in the Function Library group on the Formulas tab, or simply click fx in the Formula Bar. 3. Select Statistical from the list adjacent to ‘Or select a category:’, select EXPONDIST from the list adjacent to ‘Select a function:’ and finally click OK;
Figure 7.13: Opening the exponential distribution dialog
The following dialog appears.
Figure 7.14: Exponential distribution dialog
169
Statistics using Excel 2007 4.
By X enter the value for which you wish to calculate the cumulative probability or probability density; by Lambda enter the average of the exponential distribution. by Cumulative enter TRUE.
Example If on average 20 customers an hour arrive at a railway station ticket counter, what is the probability that the next customer will arrive within 6 minutes? Be careful to use the same units of time throughout: λ (lambda) = 20 per hour x = 6 minutes = 0.1 hours P(X < 0.1 | λ = 20) = EXPONDIST(0.1;20;TRUE) = 0.8647 In this case too, use the probability rules given above if you want to calculate probabilities other than the cumulative probability. Example What is the probability that it will be longer than 6 minutes before the next customer arrives? P(X > 0.1) = 1 -‐ P(X < 0.1) = 1 – 0.8647 = 0.1353 An Excel worksheet is given below that you can use for calculating left-‐tailed and right-‐tailed cumulative probabilities for exponential distributions.
Figure 7.15: Excel worksheet for calculating exponential probabilities Entering the average value 20 in cell B3 and the X value 0.1 in cell B4 gives the following result.
Figure 7.16: Completed Excel worksheet for calculating exponential probabilities 170
7.4 1.
2.
3.
4.
5.
7 Continuous probability distriburions
Exercises
On average a delivery service takes 2 hours to deliver Metro to several railway stations each morning, with a standard deviation of 10 minutes. Assuming that the delivery time is normally distributed, determine: a. the probability on an arbitrary day that the delivery service will be finished within 1½ hours; b. the probability on an arbitrary day that the delivery service will take longer than 2¼ hours to deliver Metro; c. the probability on an arbitrary day that delivering Metro will take a minimum of 1¾ hours and a maximum of 2¼ hours; d. the length of time for which there is a 75% probability of delivering within that time. On average 23 train passengers arrive per hour in the morning peak period at a ticket machine at a given railway station. If train passenger arrivals at the ticket machine are exponentially distributed, determine the probability that: a. the next passenger arrives within 5 minutes; b. the next passenger arrives after 10 minutes; c. it takes a minimum of 3 minutes and a maximum of 7 minutes before the next passenger arrives. The time train passengers need to buy a ticket from the machine is approximately normally distributed with an average of 2 minutes and a standard deviation of 15 seconds. Determine: a. the probability that an arbitrary passenger will spend less than 1½ minutes at the ticket machine; b. the probability that an arbitrary passenger will spend more than 2 minutes and 20 seconds at the ticket machine; c. the probability that an arbitrary passenger will spend more than 2 minutes but less than 2½ minutes at the ticket machine; d. the time within which 95% of train passengers manage to buy a ticket from the machine.
Newspapers arrive at a Spits distribution point at 6am. It has been found on average that all the copies of Spits go within 5 hours (by 11am). If the time that copies of Spits remain available is normally distributed with an average of 5 hours and a standard deviation of ½ hour, determine: a. the probability that someone arriving at the distribution point at 10am on an arbitrary day finds no Spits available; b. the probability that someone arriving at the distribution point at 11.30am on an arbitrary day still finds a Spits available; c. the latest time at which someone has to arrive on an arbitrary day at this distribution point to have a 90% probability of finding a Spits available.
The railway operator’s central journey time information service is called 18 times on average during the morning peak period. If the arrival of telephone calls at the central information service is exponentially distributed, determine: a. the probability that the next telephone call arrives within 3 minutes; b. the probability that it takes more than 5 minutes before the next telephone call arrives; c. the probability that the next telephone call arrives between 2 minutes and 4 minutes.
171
Statistics using Excel 2007
172
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse
8
Simple linear regression and correlation analysis
8.1
Introduction
8.2
Regression models
Regression analysis is concerned with developing models that explain the relationship between a response variable and one or more explanatory variables, with the objective of forecasting a response variable given the explanatory variables. A response variable in a regression model is also referred to as the dependent variable, and the explanatory variables as the independent variables. For instance, the ‘Price’ advertisers will be willing to pay for an advertisement in Spits depends on the ‘Circulation’ of Spits. The ‘Price’ is then the dependent (response) variable and the ‘Circulation’ the independent (explanatory) variable. A response variable will frequently depend on multiple explanatory variables in practice. The term multiple regression analysis is used when more than one explanatory variable is involved in an investigation, and simple regression analysis when only one explanatory variable is involved. The ceteris paribus condition is assumed, which means that all other explanatory variables, which are outside the scope of the investigation, are assumed constant. Correlation analysis is about investigating the strength of the relationship between two variables. The variables in both correlation and regression analysis must be quantitative. If one or more variables are qualitative, a different statistical technique must be used to investigate the relationship between the variables.
Chapter 3.3.5 showed the relationship between the Age variable and the Income variable in a scatter chart. Income depends on Age, so the Income variable is shown on the vertical (Y) axis and the Age variable on the horizontal (X) axis. The mathematical functions that describe the relationship between variables range from simple to extremely complex. The simplest form is a straight line. The mathematical function of a straight line is: y=a*x+b where: x : independent (explanatory) variable y : dependent (response) variable a : slope (the increase in y units for an increase in x of one unit) b : intercept with the y axis (the value of y when x equals zero) Chart 8.1 shows that the linear regression line between the variables Age and Income can be shown as follows: y = 0.4935 x – 2.2733 If a person is one year older, the income is 0.4935 * €100 = € 49.35 higher.
173
Statistics using Excel 2007
Chart 8.1: Scatter Diagram for Age and Income
Income (* € 100)
40 35
y = 0,4935x - 2,2733
30 25 20 15 10 5 0 0
10
20
30
40
50
60
70
Age
Whether you should use a linear function for your model or a more complex mathematical function depends on the distribution of the (x,y) values in your scatter chart. Some examples of different forms of regression (relationships) between x and y are shown in the following charts. The values of y in Chart A increase approximately linearly and proportionally as the values of x increase. An example of this is shown in Chart 8.1 of the relationship between Age and Income. The values of y in Chart B decrease approximately linearly and proportionally as the values of x increase. A falling regression line means that the relationship is negative. An example of a negative linear relationship is that between the ‘Price of a product’ and ‘Sales of the product’. Chart C shows no relationship. There are high and low y values for all x values. The values of y in Chart D increase as the x values increase. The increases are more than proportional at the start and less than proportional at the end. An example of a positive curved relationship is ‘Advertising expenses for a product’ and ‘Sales of the product’. Initially, sales will increased substantially until the advertising message has reached the entire target group, at which point sales will remain almost constant. The values of y in Chart E decrease as the x values increase. The decrease is more than proportional at the start and less than proportional at the end. An example is the relationship between the variables ‘Depreciation’ and ‘Years’. Cars depreciate considerably more in the first few years than later. The values of y in Chart F decrease at first as the x values increase. Further on, the y values increase as the values of x increase. An example would be the relationship between the variables ‘Time’ and ‘Number of errors a person makes when performing a task’. At first practice quickly reduces the number of errors, but as fatigue and boredom set in the number of errors will increase again.
174
Chart A: Positive linear relationship Y
X
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse Chart B: Negative linear relationship Y
X
Chart C: No relationship
Chart D: Positive curvlinear relationship
Y
Y
X
X
Chart E: Negative curvlinear relationship Y
X
8.3
Chart F: U-shaped curvlinear relationship Y
X
The linear regression model
It can be seen in Chart 8.1 and Chart A that the points do not all lie on a straight line. It is still possible to draw a straight line such that most of the points are reasonably close to the line. The line that best represents the relationship between x and y is the regression line. The regression line is determined by the least squares method. This method produces a regression line that gives the smallest sum of the squares of the vertical distances of the points in the scatter chart from the line. As with variance, the distances are squared because some points are above the line and some below. The vertical distance of a point in the scatter chart from the regression line is known as a residual value. It is therefore these residual values that are squared and summed. The regression line is therefore the line for which the sum of the squares of the residual values is smallest.
175
Statistics using Excel 2007
Chart 8.2: Scatter Diagram with regression line
Y
y = ax + b
y5 r5
y2
r4 y4
r2
r3 y3
r1 y1
X The vertical distance between a point in the scatter chart (yi) from the regression line (ax + b) is shown in Chart 8.2 as ri. Now: r1 = y1 – ax1 – b ; r2 = y2 – ax2 – b ; r3 = y3 – ax3 – b ; and so on. All these residual values must then be squared and summed. n
2
S = ∑ ( y i - ax i - b) i=1
S is a function of two variables, a and b. To find a solution for the minimum of S requires partial differentiation of the function. The mathematical detail involved is not covered here. The minimum is found for the following values of a and b.
a=
1n ∑x y − x * y n i=1 i i 2 σx
b = y − a* x
8.4
Explained and unexplained variation
You have to confirm that the model is suitable for the purpose before using it to make forecasts of the dependent variable based on the independent variable . A linear model is suitable provided the observed points are not too far from the linear regression line. You need a distance criterion that can give a valid measure of the distances of the multiple points from the regression line. 2 The criterion is the coefficient of determination (r ), which is defined as follows.
r2 =
176
explained variation total variation
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse
The total variation refers to the sum of the squares of the distances between the observed y values and the mean of y ( y ). The explained variation refers to the sum of the squares of the distances between the y values on the regression line and the mean of y ( y ). What remains is the sum of the squares of the distances between the y values on the regression line and the observed y values, which gives the unexplained variation. In other words, you can go some way towards explaining why the observed values of y do not all equal the mean of y through the relationship between x and y given by the regression line. However, the points do not all lie on the regression line, for other, unexplained, reasons.
Chart 8.3: Measures of variation Y Unexplained variation
Ŷ Total variation
Explained variation
Y
X
The sum of squares total (SST) is the sum of the squares of the distances between yi and y . SST =
n
2
∑ ( y i − y)
i=1
The sum of squares determined by regression (sum of squares regression, SSR) is the sum of the squares of the ^
distances between y (point on the regression line) and the mean of y ( y ). SSR =
n
2
^ ∑ ( y i − y)
i=1
The residual sum of squares (sum or squares error, SSE) is the sum of the squares of the distances between the ^
observed y values (yi) and the y values on the regression line ( y ). 2
SSE =
n ^ ∑ ( y i − y i)
i=1
For the sums of squares: SST = SSR + SSE Dividing the sums of the squares by n-‐1 gives what is referred to as the variance, for which: Total variance = Explained variance + Unexplained variance
177
Statistics using Excel 2007
In other words: The variance of y can be divided into an explained variance component, which is deemed to be caused by the relationship with x, and an unexplained variance that is independent of the relationship with x, and is probably caused by other factors that are not covered by the investigation. The value you should attach to the coefficient of determination depends on the number of points involved in the 2 investigation. If you have only 2 points in your investigation, the resulting r will be 100%, simply because there is always a straight line that passes exactly through two points.
8.5
Lineair regression with Excel
Excel provides three different methods of determining the equation of the linear regression line: 1. functions; 2. scatter charts; 3. Data Analysis Regression. 8.5.1 Lineair regression with SLOPE and INTERCEPT functions Simple linear regression analysis is a technique for determining the rectilinear (straight line) relationship between two quantitative variables, with the objective of predicting the magnitude of a dependent variable (y) by reference to the magnitude of another, independent, variable (x). This linear relationship can be described by the function: y = ax + b The Age and Income variables in the Fiction 2000 workbook serve as an example. You might expect people’s income to increase as they get older. The independent variable (x) is then Age, and the dependent variable (y) is Income. Determining the regression line requires the slope (a) and the intercept with the y axis (b) to be found. Method 1. Open the Fiction 2000 workbook. 2. Open an empty worksheet and enter in cell A1: Slope (a) and in cell A2: Y intercept (b) 1 3. Enter in cell A3: ="y="&ROUND(B1;2)&"*x+"&ROUND(B2;2)&"" 4. The slope is calculated in cell B1 with the statistical function SLOPE(known_y’s;known_x’s). y in this example is ‘Income (depends on Age) and x is Age (independent variable). Select cell B1. Open the Formulas tab, then click Insert Function in the Function Library group. Select Statistical from the list adjacent to ‘Or select a category:’, select SLOPE from the list adjacent to ‘Select a function:’ and finally click OK.
1
The mathematical function ROUND is used here to remove superfluous decimal places in the regression line.
178
Figure 8.1: Opening the SLOPE function dialog
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse
Enter by Known_y’s the array of the dependent variable ‘Income’, which is Data!F2:F301 Enter by Known_x’s the array of the independent Age variable, which is Data!C2:C301
Figure 8.2: Completed dialog for calculating the slope between Income and Age 5.
The value of a will appear in cell B1: 0.493521; Cell B2 will receive the value of the intercept with the y-‐axis calculated with the statistical function: INTERCEPT (known_y’s;known_x’s). Select cell B2. Click Insert in the menu bar and then select Function from the drop-‐down list. Select Statistical from the list adjacent to ‘Or select a category:’, select INTERCEPT from the list adjacent to ‘Select a function:’ and finally click OK. You will obtain a dialog box similar to the one for SLOPE (Step 4). Enter by Known_y’s the array of the dependent variable ‘Income’, which is Data!F2:F301 Enter by Known_x’s the array of the independent Age variable, which is Data!C2:C301 Cell B2 will receive the value of b: -‐2.27328.
179
Statistics using Excel 2007 6.
In cell A3 you will then obtain the function giving the linear relationship between income and age: y=0.49*x+-‐ 2.27, or y = 0.49x – 2.27. A one-‐year increase in age therefore gives an increase in income of 0.49 * € 100.
Figure 8.3: Excel worksheet with calculated slope and intercept between the Income and Age variables Linear regression is used for forecasting the dependent variable y (in this case Income) from the independent variable x (in this case Age) using the function. Excel uses the FORECAST(x,known_y’s,known_x’s) function for forecasting. For example, if you want to forecast Income for a morning train commuter aged 34, you can do so with Excel as follows. 1. Enter Age in cell A5 and Predicted income in cell A6. 2. Select cell B6. Open the Formulas tab, then click Insert Function in the Function Library group. Select Statistical from the list adjacent to ‘Or select a category:’, select FORECAST from the list adjacent to ‘Select a function:’ and finally click OK. The following dialog appears.
Figure 8.4: FORECAST function dialog 3. 4.
Enter by X: B5, by Known_y’s Data!$F$2:$F$301, and by Known_x’s Data!$C$2:$C$301, and click OK. Select B5 and enter 34. The result is as follows.
Figure 8.5: Completed Excel worksheet for predicting an income given an age You can change the age in cell B5, and the predicted income value in cell B6 will update simultaneously.
180
8.5.2
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse
Lineair regression analysis by mean of a chart: scatter chart
You can also use a scatter chart to obtain the regression line equation for the relationship between Age and Income. Method 1. Open the Fiction 2000 workbook. 2. Determine which of the two variables is the dependent variable: à the dependent variable is Income; 3. Select the values of the Income variable, i.e. Data!$F$2:$F$301, then press and hold down the Ctrl-‐key and select the values of the Age variable, i.e. Data!$C$2:$C$301. 4. Open the Insert tab, click Scatter in the Charts group and select the first subtype. 5. Enter the titles in and disable the legend display. You will obtain the following chart.
Chart 8.4: Scatter Diagram for Age and Income
Income (* € 100)
40 35 30 25 20 15 10 5 0 0
10
20
30
40
50
60
70
Age Source: ‘Fiction 2000’
To obtain the regression line you must first open the Chart Tools Layout tab in the Ribbon, click Trendline in the Analysis group and then select Linear Trendline from the drop-‐down list.
Figure 8.6: Opening the dialog to add a trendline
181
Statistics using Excel 2007
You can achieve the same result by right-‐clicking one of the points in the chart. In the Trend Regression Type area select the Linear option button, check the ‘Display Equation on chart’ and ‘Display R-‐squared value on chart’ boxes, and click Close.
Figure 8.7: Completed Format Trendline dialog The result is as follows.
182
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse
Income (*€ 100)
Chart 8.4: Scaper Diagram for Age and Income 40
y = 0.4935x -‐ 2.2733 R² = 0.66346
35
30
25
20
15
10
5
0 0
10
20
30
40
50
60
70
Age
Source: ‘Fiction 2000’
2 2 R is the coefficient of determination . This is a measure of the proportion of the variance in Income that depends on Age. In other words, 66.35% of the change in income is attributable to the change in ages. 33.65% of the change in income then depends on other factors that are not being investigated, such as qualifications, the business sector concerned, etc..
2
You can also calculate the coefficient of determination with the statistical function RSQ(known_y’s;known_x’s).
183
Statistics using Excel 2007
8.5.3
Lineair regression analysis by means of Data Analysis: Regression
A third method of obtaining a regression line is with Data Analysis from the Analysis group on the Data tab. Select Regression in the Data Analysis dialog.
Figure 8.8: Opening the Regression dialog Complete the dialog that appears as follows.
Figure 8.9: Completed Regression dialog The SUMMARY OUTPUT on the following page is divided into four parts. I Regression statistics The data you can read here include the correlation coefficient (R) and the coefficient of determination (R-‐ Square). II Analysis of variance (ANOVA) III Regression line Here you can read the intercept with the y-‐axis (b) and the slope (a) of the regression line.
184
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse
IV Residual output Here you can see the predicted income for the 300 different observations based on the regression line, and the deviations of the observations from the regression line (residuals). You also obtain a scatter chart and a residuals chart.
Age Line Fit Plot 40 35
Income (*€100)
30 25 20
Income (*€100)
15
Predicted Income (*€100)
10 5 0
0
10
20
30
40
50
60
70
Age
Figure 8.10: Scatter chart of Age and Income with regression line
Age Residual Plot 20 15 10
Residuals
5 0
0
10
20
30
40
50
60
70
-‐5 -‐10 -‐15 -‐20 -‐25
Age
Figure 8.11: Chart of residuals (differences between observed and predicted incomes) of the various ages
185
Statistics using Excel 2007
186
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse
8.6
Correlation analysis The usual measure of the strength of the relationship between two variables in a population is the correlation coefficient, r. The definition of r is shown in the table below. Table 8.1: The definition of various correlation coefficient (r) value ranges Correlation coefficient Strength of relationship -‐0,2 < r < 0,2 Negligible 0,2 < r < 0,4 of –0,4 < r < -‐0,2 Weak 0,4 < r < 0,7 of –0,7 < r < -‐0,4 Moderate 0,7 < r < 0,9 of –0,9 < r < -‐0,7 Strong 0,9 < r < 1 of –1 < r < -‐0,9 Very strong r = 1 of r = -‐1 Perfect A positive correlation implies that an increase in the independent (explanatory) variable will cause an increase in the dependent (response) variable. A negative correlation implies that an increase in the independent variable will cause a decrease in the dependent variable. This corresponds respectively with a positive and a negative regression line slope. When correlation is perfect, all the points on the scatter chart lie on the regression line. The strength of the correlation is obviously also determined by the reliability of r, or the number of points on the scatter chart. 8.6.1 Correlation analysis with Excel For calculating the correlation coefficient (r) Excel has the statistical function CORREL(array1;array2).
Figure 8.12: Opening the CORREL function dialog Unlike linear regression analysis, the choice of which variable is independent and dependent is irrelevant in calculating the correlation coefficient.
187
Statistics using Excel 2007
For calculating the correlation coefficient in the Age and Income example you would select Data!C2:C301 for array1, and Data1!F2:F301 for array2.
Figure 8.13: Completed dialog for calculating the correlation between Age and Income This yields an r value of 0.8145 (strong positive correlation). Interchanging the two arrays has no influence on the value of r. The correlation coefficient is not shown when you perform simple linear regression analysis with a scatter chart, but you can calculate it by taking the square root of the coefficient of determination. Conversely, the correlation coefficient is shown when regression analysis is performed via Data Analysis. See Chapter 8.5.3. 8.7 Lineair regression with time series Linear regression is also frequently applied in time series analysis (see also Chapter 9), where time (e.g. the year) is considered to be the independent variable. Taken as an example the data of Table 3.5 (Percentage of newspaper subscriptions by age group, 1997-‐2007) for the 18-‐25 group. The Dutch Daily newspapers expect a linear relationship between the years (independent variable) and the revenues achieved (dependent variable). One way of finding this linear relationship would be by using a scatter chart. Method 1. Open ‘Newspaper subscriptions.xlsx’. 2. Select the cells with the years and the cells with the data of the 18 -‐ 25 year age group (B1:L2). 3. Open the Insert tab, click Scatter in the Charts group, and select the first subtype. 4. Add titles and remove the legend. 5. Open the Layout tab and click Trendline in the Analysis group. Add the linear trendline and check the boxes for ‘Display Equation on chart’ and ‘Display R-‐squared value on chart’. The chart will be as follows.
188
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse
Percentage
Chart 8.5: Scatter diagram percentage daily newspaper subscription (18 -‐ 24 year) 60
y = -‐1,4364x + 2925 R² = 0,8327
50 40 30 20
10 0 1996
1998
2000
2002
2004
2006
2008
Source: Statistics Netherlands, The Hague/Heerlen 19-‐6-‐2009
You can produce forecasts for several years at the same time based on a linear, or other, regression line using the statistical TREND(known_y’s;known_x’s;new_x’s;const) function. For example, if you want to forecast the percentage of newspaper subscriptions in the 18 to 25 group for 2008, 2009 and 2010, you would first enter the years concerned in cells M1:O1 (i.e. in M1: 2008, in N1: 2009 and in O1: 2010). Now select the cells where you want the forecast percentages for the years to go, i.e. select cells M2:O2. Open the Formulas tab and click Insert Function in the Function Library group. Select Statistical and then TREND.
Figure 8.14: Opening the TREND function dialog
189
Statistics using Excel 2007
Enter B2:L2 by Known_y’s, B1:L1 by Known_x’s, and M1:O1 by new_x’s. Either leave Const blank or enter TRUE. See the following dialog.
Figure 8.15: Completed TREND function dialog Do not click OK but click Ctrl, Shift and Enter keys simultaneously. The result is as follows.
Figure 8.16: Excel worksheet with calculated percentages based on regression analysis It would clearly be better to round the values in cells M2:O2 to whole numbers.
190
8.8
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse
Nonlinear regression models
An assumption in all the relationships described above in this chapter is that they take the form of a straight line (rectilinear) and involve two variables. You will often encounter nonlinear relationships. Consider for example a product life cycle chart (introduction, expansion, maturity, saturation, decline), where revenue depends on time. This chart is definitely not rectilinear, but is clearly polynomial. It is fairly simple with Excel to use a scatter chart to find other, nonlinear, regression models (e.g. polynomial, exponential, logarithmic, and power function). You have to select an appropriate trendline type when creating the chart.
Figure 8.17: Completed dialog for adding a polynomial trendline An example of a polynomial relationship is given below, for the Age and Income variables of the data in the Fiction 2000 workbook.
191
Statistics using Excel 2007
Income (*€ 100)
Chart 8.6: Scaper Diagram for Age and Income y = -‐0.0133x2 + 1.4551x -‐ 17.444 R² = 0.74104
40 35 30 25 20 15 10 5 0 0
10
20
30
40
50
60
70
-‐5
Age Source: Fiction 2000
Comparing the coefficient of determination in this polynomial model with its linear model counterpart would lead to the conclusion that the polynomial model describes the relationship between the Age and Income variables better. 8.9 Forecasts The goodness of a prediction depends on the following. 1. Whether you are predicting a value that lies within the range of x values, i.e. between the lowest and highest observation of x (interpolation), or one that lies outside the range (extrapolation). Interpolation can yield a good prediction, whereas extrapolation generally does not, particularly for values that are far outside the range. Therefore, predicting the income of a morning train commuter aged 45 (by interpolation) based on a regression line is more reliable than predicting the income of a fellow passenger who is aged 70 (by extrapolation). 2 2. The coefficient of determination (r ). The closer the coefficient of determination is to 1, the more correct the prediction. 3. The number of points in the scatter chart. The regression line in the example is based on 300 observations (points). If, for example, you were to determine the regression line based on only the first 10 respondents (points), you would indeed obtain a higher coefficient of determination, but the 95% confidence interval for the intercept with the y-‐axis and the slope would be significantly larger. Number of Coefficient of Y intercept Slope observations determination Lower 95% Upper 95% Lower 95% Upper 95% 300 0,663 -‐3,674 -‐0,873 0,453 0,534 10 0,831 -‐14,673 1,701 0,419 0,905 Figure 8.18: Table showing the effect of the number of observations on the confidence intervals
192
8 Enkelvoudige lineaire regressie-‐ en correlatieanalyse
8.10 1.
2.
3.
4.
5.
Exercises
It is required to ascertain whether there is any relationship between the Income and Travelling time variables using data from the Fiction 2000 workbook. This is to test the expectation that people with higher earnings are willing to travel further to work. a. Investigation of the relationship between Income and Travelling time. b. Determine the linear regression line that describes the relationship between the (independent) variable Income and the (dependent) variable Travelling time. c. What travelling time would you expect based on the linear regression line for someone with an income of 20 (* € 100)? d. What is your opinion of the reliability of the prediction made in point c? It is required to investigate a possible relationship between the Travelling time and Number of travel days variables using the data in the Fiction 2000 workbook. This is to test the expectation that people whose journeys take longer travel on fewer days each week. a. Investigate the correlation between Travelling time and Number of travel days. b. Determine the linear regression line that gives the relationship between the (independent) variable Travelling time and the (dependent) variable Number of travel days. c. Calculate the corresponding correlation coefficient r. d. Predict the number of travel days for a travelling time of 100 minutes. e. What comment can you make about any predictions made? It is required to ascertain whether there is any relationship between the Age and Income variables of female passengers using the data in the Fiction 2000 workbook. a. Investigate the correlation between Age and Income for female respondents. b. Determine the linear regression line that describes the relationship between the (independent) Age variable and the (dependent) Income variables for female respondents. c. What income would you expect based on the linear regression line for a 40-‐year-‐old female morning commuter? d. What do you observe if you compare the answers to a. and b. with the correlation and linear regression line for all respondents? It is required to investigate whether there is any relationship for the 25-‐35 year category between the Percentage newspaper subscriptions and Years variables using data for 1997 up to and including 2007 in the Newspaper subscriptions.xlsx file. This is to test the expectation that the proportion of newspaper subscriptions decreases with time. a. Determine the correlation between Years and Percentage newspaper subscriptions for the 25-‐35 year category. b. Determine the linear regression line that describes the relationship between the Years and Percentage newspaper subscriptions variables for the 25-‐35 year category. c. Predict for the year 2015 the percentage of newspaper subscriptions for the 25-‐35 year category. d. What comment can you make about the prediction? It is required to investigate whether there is any relationship for the 75 and older category between the Percentage newspaper subscriptions and Years variables using data for the years 1997 up to and including 2007 in the Newspaper subscriptions.xlsx file. This is to test the expectation that the proportion of newspaper subscriptions decreases with time. a. Determine the correlation between Years and Percentage newspaper subscriptions for the 75 and older category. b. Determine the linear regression line that describes the relationship between the Years and Percentage newspaper subscriptions variables for the category 75 and older. c. Predict for the 2008 -‐ 2010 period the percentage of newspaper subscriptions for the 75 and older category. d. What comment can you make about the prediction?
193
Statistics using Excel 2007
194
9 Time series analysis
9
Time series analysis
9.1
Introduction
Chapter 8 explains how to investigate the relationship between two variables. This chapter considers a special case, where the independent variable is always time. The objectives of time series analysis is to predict the value of a variable at a given future time. This objective can be achieved by studying the past behaviour of the variable. You can attempt to ascertain the magnitude of the factors that determine the progress of the variable by means of analysis. The factors are the trend, the cyclical effect, the seasonal pattern, the irregular component, and the break in the trend. You endeavour to develop a mathematical model using these factors to enable a prediction to be made for the investigated variable. The various time series components are discussed before describing the different methods for determining the trend. Seasonal patterns and irregular components are then covered. 9.2 The components of a time series Fluctuations in an economic time series are caused by five components: • trend; • economic cycle influence; • seasonal influence; • irregular component; • break in the trend.
Chart 9.1: Component factors of a time serie
Trend
Variable
Break
Trend
Time
The dotted lines are the fluctuations due to seasonal, cyclical and irregular variations
195
Statistics using Excel 2007
9.2.1
Trend The trend in a time series is the mean long term progress of a variable. In everyday conversation, someone might say ‘There is a falling trend in the number of trains that run on time’, or ‘There is a rising trend in the number of passengers travelling by train each day’. The latter statement does not mean that the number passengers on any day is always more than the day before, but that ‘disregarding fluctuations, the number of passengers travelling by train is increasing in the long term’. The trend is not usually the only factor in the value of the numbers in a time series. If it was, the regression analysis techniques covered in Chapter 8 would suffice for developing a mathematical model to support predictions. 9.2.2
Economic cycle influence
9.2.3
Seasonal influence
Many economic phenomena, such as in national income, employment, saving and spending, and import and export. are influenced to some extent by cyclical movements. Economists have identified three cyclic influences: • the Kitchin cycle, which takes approximately 4 years; • the Juglar cycle, which takes approximately 9 years; • the Kondratieff cycle, which takes approximately 50 years. It is hard to establish how long an economic cycle lasts, and the result varies substantially depending on the variable involved. The various economic cycles have only limited influence in the Netherlands. For instance, the Kitchin cycle adds or subtracts only about 2.5% to or from the average economic growth of 2.5% in the Netherlands. Furthermore, all you will often have at your disposal are statistics that cover less than one cycle time, which makes it almost impossible to estimate the size of the cyclic component. This chapter therefore does not discuss details of how to incorporate economic cycles in a mathematical time series model. The calculation of the cyclic component is analogous to that of the seasonal influence.
Looking at the number of rail passengers, you may observe that passenger numbers are considerably higher between 7am and 9am than between noon and 2pm. There is a daily pattern, and it is also referred to as a daily cycle. If you compare the number of passengers on different days, you will observe that the numbers are considerably higher on weekdays than at weekends. Besides a daily cycle, there is therefore also a weekly cycle. Similarly, you could look at the number of passengers in the course of a month (monthly cycle) or year (annual cycle). Each regular movement with fixed duration (also referred to as the cycle time) is known as a seasonal movement. The seasonal pattern comprises a total of mean fluctuations around the trend. The fluctuations caused by the seasons are observed at the regularly recurring times (one or more cycle times apart) and then averaged. The most important seasonal movement in the economy has a cycle time of one year. For instance, there are substantially more train passengers in the winter (above the trend or the mean) than in the summer (below the trend or the mean).
196
9 Time series analysis
9.2.4
Irregular component
9.2.5
Break in trend
The time series figure will usually be unequal to the trend figure adjusted for seasonal movement. The difference between these two numbers, which is generally small compared with the trend and seasonal pattern figures, is referred to as the irregular component. The cause of the irregular component is nonsystematic. For instance, it can happen that the number of train passengers is slightly higher or lower on a Monday than would be expected based on the trend (the mean) and the weekly pattern for Mondays (the seasonal pattern). A break in the trend is a singular irregularity in which the value or slope of the trend changes drastically and permanently. For example, a permanent drastic increase in the trend in the number of train passengers coincided with the introduction of season tickets for all Dutch students. Airline passenger numbers in the United States fell drastically after the World Trade Centre attacks in September 2001. Stock market indices also fell drastically because of the credit crisis. Since breaks in the trend are infrequent and usually unpredictable, this time series component will not be covered in the development of time series models. Sometimes a trend interruption may occur. A trend interruption is when there are no data for one or more periods. The trend then resumes after the dataless period with the same slope and at approximately the same level as previously, unlike the situation after a break in the trend. For example, after a rail strike (trend interruption), passenger numbers may return to approximately the same level as before.
9.3
Trend via moving average
A mean is calculated over the length of the cycle in order to eliminate the cyclic component (seasonal pattern or business cycle) . If you have trimester data and want to eliminate the trimester influence, you would calculate the average for the three trimesters. For quarterly data you would calculate the mean for the four quarters. To eliminate the business cycle you first need to know the lengths of the economic cycles (Kondratieff cycle » 50 years, Juglar cycle » 7 years, Kitchin cycle » 4 years). You then calculate means for 50, 7 and 4 years. Depending on whether the number of periods in a cycle is odd or even, one of two methods is used to calculate the trend using a moving average. To start with, the simpler of the methods is presented below: an odd number of periods.
197
Statistics using Excel 2007
9.3.1
Moving averages with an odd number of periods
Example The volume in millions of dollars of a company’s sales for the years 1980 – 2007 are given. It is required to determine the trend by eliminating the medium-‐term business cycle (7 years) from these data. See Table 9.1. Table 9.1: Sales (in millions of dollars) for a company from 1980 -‐2007 Year Sales Year Revenues Year Revenues Year Revenues 1980 5.3 1987 9.1 1994 8.3 2001 7.0 1981 7.8 1988 9.5 1995 9.3 2002 7.2 1982 7.8 1989 9.0 1996 8.6 2003 8.9 1983 8.7 1990 7.1 1997 7.8 2004 9.7 1984 6.7 1991 6.8 1998 8.1 2005 7.6 1985 6.6 1992 6.2 1999 7.9 2006 7.5 1986 8.6 1993 7.8 2000 7.5 2007 6.9 The moving average for a given year for a 7-‐period (year) cycle is calculated by adding together the data for the year concerned, the data of the three preceding years and the data of the three following years, and dividing the result by 7. The moving average for 1993 = (7.1+6.8+6.2+7.8+8.3+9.3+8.6) / 7 = 7.73
9.3.2
Moving average with an even number of periods
Example: Consider the company in section 9.2.1, except that now you want to determine the trend for the sold quantities by eliminating the short (four-‐year) economic cycle. Calculating an average for an even number of periods (four years in this case) presents a problem, since none of the numbers is in the middle of an even number, and the middle falls precisely between two numbers. You can solve this problem as follows: 1980 5.3 1981 7.8 7.40 1982 7.8 7.575 7.75 1983 8.7 1984 6.7 The average of the years 1980 to 1983, inclusive (7.40), falls between 1981 and 1982. The average of the years 1981 to 1984, inclusive (7.75), falls between 1982 and 1983. If you average these two numbers (7.575), the average coincides precisely with 1982. The difference in calculating the trend via the moving average for even and odd numbers of periods is also relevant when using Excel to calculate the trend.
198
9 Time series analysis
9.3.3
Moving average via data analysis
You first determine the moving average for an odd number of periods. You can use Table 9.1 and calculate the trend based on a medium-‐term economic wave (seven years). Method 1. Enter in cell A1 ‘Year’ , in cell B1 ‘Sales’ and in cell C1 ‘Trend’. 2. Enter 1980 – 2007 in the cells A2:A29 (you can save much effort by dragging) and the associated sales (see Table 9.1) into cells B2:B29. 3. Open the Data tab, click Data Analysis in the Analysis group, and finally click Moving Average.
Figure 9.1: Opening the Moving Average dialog The following dialog appears when you click OK.
Figure 9.2: Moving Average dialog 4. 5.
Enter for Input Range cells B2:B29, for Interval 7 (for a seven-‐year cycle) and for Output Range cells C2:C29, and then click OK. Excel does not position the moving average correctly. The moving average for the first seven years (1980-‐ 1986) is not positioned behind the middle year (1983), but behind the last year (1986).
199
Statistics using Excel 2007
Figure 9.3: Excel worksheet with calculated trend You can correct the positioning as follows: Select cells C2:C4. Open the Home tab and click Delete in the Cells group, then select Delete Cells.
Figure 9.4: Deleting selected cells Select ‘Shift cells up’ and click OK;
6.
Figure 9.5: Specifying the action when deleting cells Enter in cells C27:C29 #N/A. Select cells C5:C26 and decrease decimals to 1 (which is clearer when comparing the data with the trend).
200
9 Time series analysis The upper part of the Excel worksheet is now as follows.
Figure 9.6: Result after deleting cells You will need to create a chart to obtain a clear view of the original data and the trend. Method 1. Select cells B1:C29. th 2. Open the Insert tab. Click Line in the Charts group and select the 4 subtype (line with data markers).
Figure 9.7: Inserting a line chart with data markers 3.
4.
A line chart of Sales and Trend appears. The various years must appear on the horizontal axis. Open the Design tab and click Select Data in the Data group. Click Edit under Horizontal (Categories) Axis Labels and in the Axis Label Range box specify cells A2:A29. Finally on the Layout tab in the Labels group add a chart title and vertical axis title. Chart title: Chart 9.2: Sales and trend, Vertical axis title: Sales (* million dollars).
You then obtain (after a few minor adjustments) the following result.
201
Statistics using Excel 2007
Chart 9.2: Sales and trend Sales (* m illion dollars)
12,0
10,0
8,0
6,0
Sales Trend
4,0
2,0
0,0 1980
1985
1990
1995
2000
2005
The pronounced peaks and troughs in the original sales have disappeared in the trendline. The method for determining the moving average for an even number of periods (e.g. the short four-‐year economic wave), is similar to that for an odd number of periods, but with the following differences. Method 4. Enter for Input Range 4 (for a four-‐year cycle). 5. For an even number of periods you have to average two calculated trend values, which you can perform in column D. Enter in cell D1: Trend. The first trend value you can calculate is for 1982. It is the average for the first four years (1980 – 1983), which is in cell C5, and the average for the years 1981-‐1984, is in cell C6. Therefore, enter in cell D4: =(C5+C6)/2. Then copy this formula into cells D5:D27 (the last year for which you can calculate the trend is 2005). The upper part of your Excel worksheet will be as follows.
Figure 9.8: Excel worksheet of calculated trend in four-‐year periods
202
9 Time series analysis You can create a chart in a similar way to the sales and trend based on a seven-‐year cycle. N.B. the columns you want to use to produce a chart (B and D) are not contiguous. First select cell range B1:B29 for data range and then press and hold down the Ctrl key while selecting the cell range D1:D29. This finally produces the following result.
Chart 9.3: Sales and trend Sales (* m illion dollars)
12,0
10,0
8,0
6,0
Sales Trend
4,0
2,0
0,0 1980
1985
1990
1995
2000
2005
If you compare the trendline (based on four years) with the trendline in Chart 9.1 (based on seven years) you will see that the first trendline has more peaks and troughs. Therefore, the longer the cycle, the more stable (straighter) the trend.
203
Statistics using Excel 2007
9.3.4
Moving average via formulas
Since it is easy to copy formulas in Excel, you could also opt to enter the formula for the trend. A substantial advantage compared with the moving average via data analysis is the ability to enter the formula immediately in the correct location. As an example, you can use the data in Table 9.2. Table 9.2: Revenues (* $ 1,000) for a hotel Revenues (* $ 1,000) Year Quarter 2004 I 118 II 237 III 425 IV 165 2005 I 117 II 235 III 422 IV 170 2006 I 127 II 258 III 472 IV 186 2007 I 135 II 273 III 497 IV 195 2008 I 140 II 282 III 511 IV 200 Method 1. Enter ‘Year’ in cell A1, ‘Quarter’ in cell B1, ‘Revenues’ in cell C1 and ‘Trend’ in cell D1. 2. Enter the various years in cells A2, A6, A10, A14 and A18, the various quarters in cells B2:B21 and the associated revenues in cells C2:C21. 3. The first quarter for which you can calculate the trend is the third quarter of 1994. Enter in cell D5: =(C2+2*C3+2*C4+2*C5+C6)/8. 4. Copy the function from cell D4 to cells D5:D19. 5. Select cells D4:D19 and decrease decimals to 1 (which is clearer when comparing the data with the trend). This finally produces the following result.
204
9 Time series analysis
Figure 9.9: Excel worksheet with calculated trend values per quarter You can create a chart of the above in the same way as in Chapter 9.3.3. The final result (with the x-‐axis modified) is as follows.
Chart 9.4: Revenues and trend for a hotel 2004-‐2008 600
Revenues (* m illion dollars)
500
400
300
Revenues
Trend 200
100
0 2004
2005
2006
2007
2008
205
Statistics using Excel 2007
9.4
Trend via lineair regression (least squares method) This chapter presents the linear model (see also Chapter 8 linear regression analysis). The trend in the linear model is a straight line: y = ax + b, where y is the variable being investigated and x is the various years (or quarters), a is the slope of the line and b the intercept with the y-‐axis. Example: See Table 9.1: Sales (in millions of dollars) for a company from 1980-‐2007. Method 1. Click Data Analysis in the Analysis group on the Data tab. In Data Analysis select Regression. 2. For Input Y Range: enter B1:B29. 3. For Input X Range: enter A1:A29. 4. Check the Labels box. 5. Enter Linear by New Worksheet Ply. 6. Check the Residuals and Residual Plots boxes. Part of the result is shown below.
Figure 9.10: Regression analysis output The trendline is: y = 0.015681445 x – 23.27200328 To determine the trend for the various years with the linear model of the least squares method, enter the various years in the formula for x. E.g. enter in cell C2: =0.015681445*A2-‐23.27200328 and copy this formula to cells C3:C29. This finally produces the following result.
206
9 Time series analysis
Figure 9.11: Excel worksheet with calculated trend values produced by regression analysis (least squares method) You cannot use this trendline directly in forecasting, because the correlation coefficient (0.1214) is negligible.
Year Residual Plot 2
Residuals
1 0 1975 -‐1
1980
1985
1990
1995
2000
2005
2010
-‐2 -‐3
Year
Figure 9.12: Chart with residuals (difference between trend and actual figures) It is conspicuous in the chart with residuals that no account was taken of cyclic movement. Therefore, when dealing with quarterly or trimester data, number the quarters or trimesters 1, 2, 3, 4, 5, 6, 7, etc.. The x values must be numbers in the regression analysis!
207
Statistics using Excel 2007
9.5
Seasonal patterns and cyclic movements
You were able to conclude in all the above examples where a trend was calculated that the trend could not be used directly in forecasting. All the examples had clear seasonal patterns or cyclic movements that have to be dealt with. This chapter examines seasonal patterns. The method for calculating cyclic movements is analogous with this account. 9.5.1 The additive model The following model can be used when seasonal fluctuations (the difference between high and low seasons) are almost constant: Time series = Trend + Seasonal component (+ Irregular component) The revenues of the hotel in Table 9.2 are used as an example, where the trend is determined by means of moving averages (see section 9.2.4). N.B. the procedure if you calculated the trend in a different way is similar to the account below.
The seasonal fluctuations are conspicuous in the chart, as are the almost constant differences between high and low season, which confirm that the additive model can be used. Use the Excel worksheet from 9.2.4 and enter data as follows. 1. Enter in cells E1, F1, G1, H1: I, II, III, IV (you will determine the trend deviations per quarter in the various columns). 2. Enter in cell E5: = C4-‐D4 and copy this formula to cells E5:E19. 3. Move (using Cut and Paste in the Edit drop-‐down list) cell E4 (trend deviation from quarter III) to cell G4, cell E5 to H5, do not move cell E6, cell E7 to F7 etc. up to and including cell E19 to F19. 4. Enter in cell A22: Sum of deviations, in cell A23: Mean of deviations, in cell A24: Preliminary seasonal component, in cell A25: Between calculation, and in cell A26: Seasonal component. 5. In cell E22 you sum cells E2:E21, and copy this formula to cells F22:H22. 6. In cell E23 you divide E22 by COUNT(E2:E21) and copy this formula to cells F23:H23. 7. In cell I23 sum the cells E23:H23. 8. In E24 you round the seasonal pattern to a whole number. Since the sum of the seasonal patterns has to be 0, the value of I23/4 must first be subtracted from E23. 9. What remains is a rather obscure adjustment for any difference of the sum of the seasonal deviations from 0 in cells E25:H26. N.B. it is also possible to adjust the seasonal deviations where the sum differs from 0 using the least rounding errors method. The formulas in cells E22:E26 are shown below.
Figure 9.14: Formulas for calculating the various deviations in the additive model
208
9 Time series analysis The irregular component is calculated in column J with the formula: irregular component = historical series – trend – seasonal component Therefore enter in cell J4: =C4-‐D4-‐G26, in cell J5: =C5-‐D5-‐H26, etc.. Finally round the numbers to zero decimal places. The final result is shown below.
Figure 9.14: Excel worksheet with additive model applied 9.5.2 The multiplicative model The multiplicative model can be used for growing seasonal fluctuations and a rising trend, or declining seasonal fluctuations with a falling trend. Time series = Trend * Seasonal index (* Irregular component) This model uses trend indices. These indices can be calculated by dividing the historical series by the trend and multiplying by 100%. The rest of the procedure is analogous to that of the additive model. 1. The trend indices per quarter are calculated first. 2. The sum of the trend indices is then determined and an average trend index calculated. 3. The sum of the trend indices must be 400 (for four periods in a cycle). 4. The irregular component can then be calculated by: Irregular component = historical series / trend * (seasonal index/100)
209
Statistics using Excel 2007
Example Table 9.3: Number of visitors (* 1000) to Addo National Elephant Park Year Quarter Number of visitors 2005 I 94 II 200 III 182 IV 139 2006 I 97 II 213 III 193 IV 150 2007 I 104 II 225 III 208 IV 158 2008 I 110 II 239 III 223 IV 169 When you present the above data in a chart, the result is as follows.
Chart 9.5: Number of visitors of Addo National Elephant Park 300
Visitors (* 1 ,000)
250 200 150 100 50 0 2005
2006
2007
2008
There is a rising trend with increasing seasonal fluctuations (the lines linking the low-‐season and high-‐season figures diverge). You could therefore apply the multiplicative model in this case.
210
9 Time series analysis The formulas in lower part of the worksheet are repeated below.
Figure 9.15: Formulas for calculating the seasonal indices The final result is as follows.
Figure 9.16: Excel worksheet with multiplicative model applied 9.6 Forecasting The objective of time series analysis is to use the model to forecast the future of the dependent variable under investigation. Forecasting occurs in two steps: 1. extrapolate the trend figures; 2. incorporate the seasonal effect into the extrapolated trend figures. The same method for extrapolating the trend figures is used in both the additive and multiplicative models. However, the extrapolation does depend on whether the trend is determined with the moving averages method or linear regression analysis. The method for incorporating the seasonal effect depends on which model (additive or multiplicative) was chosen.
211
Statistics using Excel 2007
9.6.1
Forecasting with the additive model The data of Table 9.2, the quarterly revenues of a hotel, are used as an example. If you use these data to forecast the revenues of the various quarters of 2009, you first have to calculate the predicted trend figures for the various quarters. It is expected with moving averages that the trend change will be the same in the future as in the observed periods. In the example you can assume that the trend increase from the third quarter of 2004 to the second quarter of 2008 will be repeated in the future. 15 Average trend increase = (282.6 – 236.1) / 15 = 3.1 Therefore the average revenue (trend) increases each quarter by 3.1 (* € 10,000). You may therefore expect for the third quarter trend figure for 2008: 282.6 + 3.1 = 285.7 For the fourth quarter: 285.7 + 3.1 = 288.8, and so on. For the quarters of 2009 the predicted trend figures are then: 2009 Quarter Predicted trend I 282.6 + 3 * 3.1 = 291.9 II 282.6 + 4 * 3.1 = 295.0 III 282.6 + 5 * 3.1 = 298.1 IV 282.6 + 6 * 3.1 = 301.2 Finally you have to add the seasonal components to the predicted trend figures to obtain the expected revenue. 2009 Quarter Predicted trend Seasonal component Predicted revenue I 282.6 + 3 * 3.1 = 291.9 -‐128 291.9 –128 = 193.9 II 282.6 + 4 * 3.1 = 295.0 0 295.0 + 0 = 295.0 III 282.6 + 5 * 3.1 = 298.1 203 298.1 + 203 = 501.1 IV 282.6 + 6 * 3.1 = 301.2 -‐75 301.2 – 75 = 226.2 9.6.1 Forecasting with the multiplicative model The data of Table 9.3, the visitor numbers of Het Hooge Veld wildlife park, are taken as an example. The trendline and least squares method are now used to predict the trend. The trend line is: y = 3.2971x + 140.98 (where: x = 1 the first quarter of 2005, x = 2 the second quarter of 2005, and so on) The predicted trend for the first quarter of 2009 (x = 17) is then: 3.2971 * 17 + 140.98 = 197.0 In order then to be able to predict the visitor numbers you have to multiply the trend by the seasonal index / 100%. 2009 Period Predicted trend Seasonal index Predicted visitors I 17 3.2971 * 17 + 140.98 = 197.0 61 197.0 * 61/100 = 120.2 II 18 3.2971 * 18 + 140.98 = 200.3 131 200.3 * 131/100 = 262.4 III 19 3.2971 * 19 + 140.98 = 203.6 118 203.6 * 118/100 = 240.2 IV 20 3.2971 * 20 + 140.98 = 206.9 90 206.9 * 90/100 = 186.2 15
The number of rises equals the number of periods – 1. For 16 quarters you would therefore have 16 –1 = 15 rises.
212
9 Time series analysis
9.7
Seasonal adjustment of a time series
People are sometimes interested more in how the average (trend) develops than in the actual figure. An example would be unemployment statistics. Unemployment is generally higher in the winter months than in the summer. If you were only to look at the actual figures, you would observe unemployment falling in the course of the year and then increasing again. However, people are more interested in the general unemployment trend, and they then refer to a seasonally adjusted unemployment figure. For instance, actual unemployment might increase by 10,000 in December, while average unemployment (i.e. seasonally adjusted unemployment) declines by 4,000. The unemployment trend is therefore favourable in that case. Another example is the proportion of trains that run on time. It is known that autumn and winter weather causes a fall in this proportion compared with the spring and summer. This is another time series for which people are more interested in the seasonally adjusted figure than the actual figure. The procedure for adjusting a time series is the reverse of forecasting. For an additive model: Adjusted time series = Real time series – Seasonal component For a multiplicative model: Adjusted time series = (Real time series / Seasonal index)* 100 If, for example, it is known that the proportion of trains that run on time is 5% below the trend in the fourth quarter (assuming an additive model), then this would mean that after adjustment a 72% actual punctuality rate for the fourth quarter of 2001 would be 72% -‐ (-‐ 5%) = 77%.
213
Statistics using Excel 2007
9.8
1.
Exercises
The revenues from 2007 to 2009, inclusive, of a company are given. Table 1: Revenues from 2007 to 2009, inclusive Year Quarter Sales (* million dollars) 2007 I 11.87 II 12.13 III 12.52 IV 16.11 2008 I 12.45 II 13.08 III 12.86 IV 17.03 2009 I 12.53 II 13.20 III 13.30 IV 17.50 a. Calculate the trend values using the moving average. b. Determine which seasonal model is more appropriate for the data. c. Calculate the seasonal components for each quarter. d. Predict the revenue for the first quarter of 2010, assuming that the trend increase after the second quarter of 2009 equals the average rise of the trend values from the third quarter of 2007 up to and including the second quarter of 2009.
2.
A container company has recorded the quarterly sales of a given product for several years. Table 2: Sales of a product for 2006 to 2008, inclusive Year Quarter Sales (*1,000 tons) 2006 I 83.0 II 95.6 III 117.9 IV 125.3 2007 I 96.5 II 109.4 III 121.1 IV 130.5 2008 I 104.8 II 116.2 III 131.6 IV 143.4 a. Calculate the trend values using the moving average. b. Determine which seasonal model is more appropriate for the data. c. Calculate the seasonal components for each quarter. d. Derive the following trendline equation based on linear regression: T = 3.45 * t + 92.2 where t: the time in quarters (t = 1 in 2006 quarter I) T: the trend value (in 1,000 tons) e. Calculate the annual trend increase. f. Predict the sales for each quarter of 2009 based on the trendline and the seasonal components.
214
9 Time series analysis 3.
The management of the Miracle Star amusement and entertainment park is considering whether to hold several events in 2011, with a view to making the park more attractive to the public. To provide a sound justification of the investments that would be needed, they consider that annual visitor numbers on a ‘business as usual’ basis must exceed 1,500,000 in 2010. They perform a simple analysis to form a view of the visitor number trend since 1998. We will carry out the analysis in this exercise, with the objectives of forecasting visitor numbers in 2010. Table 3: Visitor numbers per quarter in the 2006-‐2009 period Year Quarter Visitors (* 1,000) 2006 I 141 II 400 III 373 IV 216 2007 I 146 II 420 III 390 IV 225 2008 I 156 II 438 III 412 IV 237 2009 I 165 II 465 III 435 IV 254 a. Calculate the trend values based on the moving average. b. Show that the multiplicative model is more appropriate to the above data. c. Determine the seasonal pattern. d. Visitor numbers would appear to be 182,000 in the first quarter of 2010. Adjust this figure for the season. e. Derive the following trendline equation based on linear regression: T = 4.96 * t + 262.4 where t: the time in quarters (t=1 for quarter I in 2006; t=2 for quarter II in 2006, etc..) T: the trend values (in 1,000) f. Forecast the visitor numbers for quarters II, III and IV of 2010 using the given regression line. g. Do the calculations suggest that the target of 1,500,000 visitors a year will be met?
215
Statistics using Excel 2007
216
Appendix A: Data file “Fiction 2000”
Appendix A: Data file “Fiction 2000” Number Gender
Age
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
21 27 18 20 26 60 42 51 19 21 22 55 38 39 31 20 23 21 19 31 19 32 55 15 59 43 62 33 17 15 29 59 31 26 22 40 39 53 23 38 23 63 21 31 60 20 35 19 55 53
2 2 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 2 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 2 2 2 1 1 1 1 2
Traveling time 35 29 23 32 14 104 58 55 44 5 36 67 35 54 61 7 28 37 66 22 19 22 120 14 75 35 98 23 29 9 42 51 32 14 15 58 37 100 54 98 22 147 35 58 42 75 15 7 100 66
Travel days 3 5 5 4 5 3 5 5 4 5 5 4 5 3 5 4 5 3 4 5 4 3 4 5 4 3 3 5 5 5 5 4 1 5 4 5 4 4 5 2 5 3 5 4 4 3 5 5 3 4
Income 7 28 11 6 36 54 47 63 5 19 6 41 40 23 35 8 15 13 7 33 9 18 46 1 52 28 51 28 3 1 32 33 28 24 10 42 47 52 25 37 23 45 7 35 48 11 56 8 52 35
Daily Cancelled newspaper 7 0 3 0 7 0 5 0 5 0 2 0 5 0 3 0 7 0 6 0 5 0 6 0 1 0 1 0 5 0 7 0 5 0 1 0 7 0 3 0 5 0 4 0 3 0 7 0 4 0 6 0 6 0 3 0 7 0 7 0 2 0 6 0 5 0 7 1 1 0 3 0 3 0 6 0 1 0 1 0 3 0 6 0 7 0 6 0 6 0 7 1 4 0 7 0 4 0 3 0
Metro Opinion Spits Opinion Metro Spits 1 4 1 4 0 0 1 5 1 4 1 3 1 3 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 1 4 1 5 1 5 1 4 1 2 0 0 0 0 1 3 0 0 1 4 0 0 1 2 1 3 1 2 0 0 1 4 1 4 0 0 1 2 0 0 1 3 0 0 1 4 0 0 1 4 1 3 0 0 1 2 0 0 0 0 1 5 0 0 1 4 0 0 0 0 1 3 0 0 1 4 0 0 1 2 1 4 1 4 1 4 0 0 1 3 1 3 0 0 0 0 0 0 0 0 0 0 1 5 0 0 0 0 1 4 0 0 1 4 1 4 1 5 1 4 0 0 0 0 1 3 1 4 1 3 0 0 1 4 1 3 0 0 1 4 0 0 1 4 1 2 1 1 0 0 0 0 1 5 0 0 0 0 0 0 1 4 0 0 0 0 0 0 1 4
217
Statistics using Excel 2007 Number 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
218
Gender
Age
2 1 1 1 2 1 2 1 1 2 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 2 1 1 2 1
33 19 64 39 21 49 22 35 46 16 29 33 50 42 29 27 51 21 43 42 36 28 40 26 28 32 42 45 50 41 26 29 22 40 47 31 23 31 35 35 35 20 54 26 25 18 19 32 29 48
Traveling time 51 14 42 61 32 70 36 37 58 22 15 29 70 9 23 51 100 54 58 89 28 23 67 5 23 44 58 54 104 35 35 23 15 32 98 51 19 89 61 31 42 37 89 14 23 58 22 15 36 9
Travel days 3 5 4 3 4 1 4 5 5 5 5 5 4 4 5 5 2 4 4 4 3 5 4 5 5 5 5 5 2 2 5 5 4 5 3 5 5 4 3 5 4 4 1 5 5 5 5 5 5 5
Income 24 6 41 45 14 58 18 36 72 2 32 35 74 25 33 36 47 25 33 36 30 35 29 32 35 28 36 53 58 15 25 28 15 32 35 36 10 26 24 42 35 9 58 39 35 5 6 38 26 35
Daily newspaper 1 7 6 6 5 5 3 3 3 7 1 4 3 6 6 5 3 5 3 1 2 5 7 7 5 3 3 3 6 5 3 1 5 7 3 5 7 7 6 5 3 7 5 2 5 7 7 5 6 6
Cancelled Metro Opinion Spits Opinion Metro Spits 0 1 5 1 3 0 1 4 0 0 0 0 0 1 1 0 1 4 1 2 0 1 3 0 0 0 1 3 0 0 0 0 0 1 3 0 0 0 1 4 0 1 2 1 4 0 0 0 0 0 0 1 4 0 0 0 1 4 0 0 0 0 0 1 4 0 1 2 0 0 0 1 3 0 0 0 1 4 0 0 0 1 3 0 0 0 1 3 1 1 0 0 0 1 3 0 0 0 1 3 0 1 3 0 0 0 1 3 0 0 1 1 4 0 0 1 1 5 1 5 0 1 3 0 0 0 0 0 1 4 0 0 0 0 0 0 1 3 1 3 0 1 4 0 0 0 0 0 0 0 0 0 0 1 4 0 0 0 1 4 0 1 4 1 2 1 1 4 1 3 0 0 0 1 3 0 1 2 0 0 0 1 4 1 4 1 0 0 1 4 0 1 4 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 1 5 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 1 5 1 3 0 0 0 1 3 0 1 4 0 0 0 1 3 0 0 0 0 0 0 0
Number 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
Gender
Age
1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 2
27 64 45 22 19 21 46 57 29 23 29 19 35 35 41 14 54 56 29 29 13 39 29 22 21 19 36 29 39 21 30 19 44 14 50 28 49 14 43 29 37 19 43 28 23 27 23 45 50 19
Traveling Travel time days 31 5 42 4 22 5 12 4 14 5 7 5 23 5 66 3 29 5 70 3 28 5 23 5 54 4 15 5 100 4 22 5 75 4 32 3 12 5 42 5 19 5 67 4 54 5 31 4 5 4 37 5 14 5 58 5 51 4 12 4 35 4 43 5 58 5 35 5 22 4 22 5 35 4 7 5 31 3 61 5 17 5 9 5 66 5 12 5 75 5 15 5 43 5 14 4 67 4 22 5
Income 20 41 36 26 8 4 34 44 37 26 30 6 30 32 35 1 48 44 29 25 2 35 32 8 18 9 28 32 36 19 22 14 59 2 57 30 44 1 24 32 34 18 47 29 27 28 25 32 40 9
Appendix A: Data file “Fiction 2000” Daily newspaper 3 3 6 5 7 7 7 6 5 1 3 7 7 6 6 7 4 4 5 1 7 6 6 7 1 7 7 5 5 3 1 2 2 7 2 5 4 7 6 6 1 3 5 5 5 5 5 6 1 7
Cancelled Metro Opinion Spits Opinion Metro Spits 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0 1 4 0 0 0 1 5 1 2 0 0 0 1 3 1 1 3 0 0 0 0 0 0 0 0 1 3 0 0 0 1 5 1 4 0 0 0 1 3 0 1 4 1 4 0 1 4 1 4 0 1 2 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 0 0 0 1 4 1 1 0 1 5 1 3 0 0 0 1 3 0 1 4 1 5 1 0 0 0 0 0 1 3 0 0 0 1 2 0 0 0 0 0 1 5 0 0 0 1 3 0 1 2 1 1 0 1 1 0 0 0 0 0 0 0 0 1 3 0 0 0 1 3 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 1 1 0 1 4 0 0 0 1 3 1 2 0 0 0 1 3 0 0 0 0 0 0 1 2 0 0 0 1 4 0 0 0 1 3 1 2 0 0 0 0 0 0 1 3 1 3 0 0 0 0 0 0 1 4 1 5
219
Statistics using Excel 2007 Number 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
220
Gender
Age
1 2 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 2 1 2 1 2 1 1 1 1 2 2 1
37 17 20 18 21 40 21 26 27 25 23 30 17 20 41 19 15 22 58 35 16 21 19 23 18 35 25 49 49 26 29 24 32 29 28 30 46 38 50 28 28 53 30 24 36 20 27 21 21 26
Traveling Travel time days 5 5 19 5 23 5 28 5 40 4 29 5 14 5 54 3 35 3 12 5 25 5 43 5 31 5 32 4 36 5 37 5 22 5 22 5 75 2 15 4 7 5 36 5 5 4 31 4 14 3 42 5 70 3 51 2 26 4 25 5 17 4 37 5 67 3 9 4 15 5 22 2 58 5 61 4 43 5 44 4 19 4 40 4 36 3 31 3 12 5 26 5 35 3 17 4 14 5 40 5
Income 26 4 10 8 10 32 7 25 24 30 29 25 4 11 36 6 4 11 56 29 2 7 9 26 5 44 25 15 36 22 21 23 28 24 22 16 46 42 40 30 25 35 26 22 33 5 33 24 6 33
Daily Cancelled Metro Opinion Spits Opinion newspaper Metro Spits 7 1 1 5 1 3 7 0 0 0 1 4 7 0 1 4 1 4 7 0 0 0 1 4 7 0 1 5 0 0 3 0 0 0 1 3 7 0 1 4 1 4 5 0 1 3 0 0 5 0 1 4 0 0 4 0 1 2 1 1 2 0 1 4 0 0 6 0 1 3 0 0 7 0 1 3 0 0 7 0 1 4 1 3 6 0 0 0 1 2 7 0 1 4 1 4 7 0 0 0 1 4 7 0 1 3 1 4 2 0 1 3 0 0 3 0 0 0 1 4 7 0 0 0 0 0 7 0 0 0 1 4 7 0 1 4 0 0 5 0 1 3 0 0 7 0 0 0 0 0 5 0 1 2 1 1 5 0 0 0 1 2 4 0 1 2 1 2 3 0 0 0 0 0 1 0 1 5 1 3 3 0 1 3 0 0 7 1 1 4 1 5 6 0 1 4 0 0 5 0 0 0 0 0 3 0 0 0 1 4 1 0 0 0 0 0 2 0 1 1 1 1 5 0 1 2 1 2 6 0 1 3 0 0 6 0 0 0 1 3 7 0 1 3 1 3 6 0 0 0 0 0 5 0 1 3 0 0 3 0 0 0 1 4 1 0 0 0 1 4 7 0 1 4 1 4 2 0 1 3 1 2 5 0 1 3 1 2 7 0 1 4 1 5 5 0 1 4 0 0
Number 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250
Gender
Age
1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 2 2 1 1 1 1 2 1 1 1 2 1 2 1 1 1 2 1 1 1 2 1
23 42 34 40 44 35 21 38 57 46 38 36 31 24 30 41 27 47 19 43 47 37 17 18 28 19 42 44 29 49 45 20 46 40 29 28 37 36 17 21 57 23 21 62 26 41 43 31 36 44
Traveling time 26 51 22 37 54 23 5 54 32 50 28 61 22 7 50 26 35 43 12 44 84 58 29 17 37 9 54 39 14 19 39 36 43 67 35 37 61 22 23 50 40 5 23 54 12 37 44 14 50 51
Travel days 5 5 4 4 4 4 5 5 4 4 4 5 5 5 4 4 5 5 5 5 4 3 5 5 5 5 4 3 5 4 5 5 5 5 5 4 3 4 5 5 4 5 4 4 5 5 5 5 4 5
Income 25 35 36 42 38 32 6 34 44 49 35 39 25 27 25 45 23 36 6 38 52 36 3 5 36 23 35 36 33 36 45 7 44 48 35 29 30 25 2 12 40 21 20 42 28 35 38 30 36 40
Appendix A: Data file “Fiction 2000” Daily newspaper 3 3 5 4 4 6 7 3 4 4 6 5 3 1 6 1 3 3 7 1 5 5 7 7 6 4 7 6 6 6 5 7 6 3 5 5 3 3 7 7 6 6 7 6 5 7 6 6 7 6
Cancelled Metro Opinion Spits Opinion Metro Spits 0 0 0 1 3 0 1 3 0 0 0 1 2 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 1 4 0 0 0 1 3 0 0 0 1 5 0 1 2 0 0 0 0 0 1 3 0 0 0 1 2 0 1 3 0 0 0 0 0 1 1 0 0 0 1 4 0 1 5 1 2 0 0 0 0 0 0 1 2 1 4 0 0 0 1 3 0 1 3 1 3 0 0 0 0 0 0 1 2 0 0 0 1 3 0 0 0 0 0 1 4 0 1 4 1 5 0 0 0 1 3 0 1 3 1 3 0 0 0 0 0 0 0 0 0 0 0 1 4 1 3 0 0 0 0 0 0 0 0 0 0 0 1 5 1 3 0 0 0 0 0 0 0 0 1 4 0 1 2 0 0 0 1 1 1 1 0 0 0 0 0 0 1 2 1 4 0 0 0 1 2 1 1 5 0 0 0 1 2 0 0 0 0 0 1 4 1 1 4 1 5 0 1 3 0 0 0 1 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3 0 1 4 0 0 0 1 2 0 0
221
Statistics using Excel 2007 Number 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
222
Gender
Age
1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 1 2 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1
55 54 28 32 20 47 19 18 15 43 41 59 53 48 22 33 31 19 25 23 36 18 36 43 16 18 42 52 46 37 20 21 38 30 31 49 38 28 31 44 21 25 21 14 26 47 19 19 54 22
Traveling time 23 19 9 32 35 36 40 7 39 54 37 120 50 54 12 61 23 67 29 84 19 28 50 37 14 29 114 114 51 50 9 22 54 28 29 50 5 36 32 61 44 51 29 37 50 44 12 28 67 35
Travel days 4 4 5 3 5 4 4 4 5 5 5 3 4 5 5 3 4 5 5 5 3 5 5 4 5 4 4 4 5 3 5 5 2 5 5 4 5 4 4 5 4 5 4 5 5 5 5 5 2 4
Income 35 40 26 30 8 45 12 9 2 37 44 56 42 48 23 22 19 4 20 27 23 7 35 40 3 8 47 52 43 32 6 7 15 30 28 35 32 29 31 40 9 12 12 1 28 35 18 10 38 22
Daily newspaper 3 1 3 5 7 6 7 7 7 6 5 3 3 3 5 5 4 7 7 6 5 7 6 3 7 7 3 4 5 4 7 7 5 5 6 6 3 2 5 6 7 7 7 7 5 3 1 7 1 4
Cancelled Metro Opinion Spits Opinion Metro Spits 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 0 1 3 0 0 0 1 4 1 4 0 0 0 0 0 0 1 3 1 4 0 0 0 1 4 0 0 0 1 5 0 0 0 0 0 0 1 3 1 1 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 1 3 1 1 0 1 4 0 0 0 1 2 0 0 0 0 0 1 4 1 1 5 0 0 0 1 3 1 3 0 1 3 0 0 0 0 0 1 4 0 1 4 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5 1 3 0 0 0 1 4 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 0 1 5 1 3 0 1 4 0 0 0 1 2 0 0 0 0 0 1 3 0 1 1 0 0 0 0 0 1 5 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 5 1 3 0 1 4 1 3 0 0 0 1 3 0 0 0 0 0 0 1 4 0 0 0 0 5 1 5 0 0 5 0 0 0 1 4 1 3 0 0 0 1 3 0 1 4 1 3
Appendix B: Answers
Appendix B: Answers Answers Chapter 2: Tables
Table 2.1: Opinion on "Spits" of 300 respondents Opinion Spits Total No opinion 150 Very bad 12 Bad 17 Reasonable 51 Good 52 Very good 18 Grand Total 300 Source: Fiction 2000
Table 2.2: Opinion on "Spits" per gender of respondents who read "Spits" Opinion Spits Male Female Grand Total Very bad 8 4 12 Bad 12 5 17 Reasonable 36 15 51 Good 41 11 52 Very good 14 4 18 Grand Total 111 39 150 Source: Fiction 2000
Table 2.3: Opinion on "Spits" as percentage per gender of respondents who read “Spits" Opinion Spits Male Female Grand Total Very bad 7% 10% 8% Bad 11% 13% 11% Reasonable 32% 38% 34% Good 37% 28% 35% Very good 13% 10% 12% Grand Total 100% 100% 100% Source: Fiction 2000
Table 2.4: Opinion on “Spits”per gender as percentage per opinion of respondents who read “Spits” Opinion Spits Male Female Grand Total Very bad 67% 33% 100% Bad 71% 29% 100% Reasonable 71% 29% 100% Good 79% 21% 100% Very good 78% 22% 100% Grand Total 74% 26% 100% Source: Fiction 2000
Table 2.5: Travelling time per train of 300 respondents Travelling time per train (minutes) 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99 100-109 110-119 120-129 140-149 Grand Total
Total 23 45 56 57 25 45 21 9 5 3 6 2 2 1 300
Source: Fiction 2000
223
Statistics using Excel 2007
Table 2.6: Income of 300 respondents Income (*€100) Total 0-4 51 5-9 32 10-14 69 15-19 85 20-24 42 25-29 17 30-34 2 35-39 2 Grand Total 300
Source: Fiction 2000
Table 2.7: Cumulative percentage for income of 300 respondents Income (*€100) <0 0,0% <5 17,0% < 10 27,7% < 15 50,7% < 20 79,0% < 25 93,0% < 30 98,7% < 35 99,3% < 40 100,0%
Source: Fiction 2000
Table 2.8: Travel days per train per week per gender of 300 respondents Travel days per Male Female Grand Total train per week 1 3 0 3 2 6 3 9 3 19 16 35 4 62 29 91 5 133 29 162 Grand Total 223 77 300 Source: Fiction 2000
224
Table 2.9: Subscriptions to daily newspaper of respondents of 30 years and older Daily newspaper Total Algemeen Dagblad 14 NRC 6 Telegraaf 36 Trouw 16 Volkskrant 27 Other 43 None 10 Grand Total 152 Source: Fiction 2000
Table 2.10: Subscriptions to daily newspaper of maler respondents of 30 years and older Daily newspaper Total Algemeen Dagblad 10 NRC 5 Telegraaf 34 Trouw 13 Volkskrant 18 Other 31 None 8 Grand Total 119 Source: Fiction 2000
Answers Chapter 3: Charts
Source: Fiction 2000
Appendix B: Answers
Source: Fiction 2000
Source: Fiction 2000
Source: Fiction 2000
Source: Fiction 2000
Source: Fiction 2000
Source: Fiction 2000
Source: Fiction 2000
225
Statistics using Excel 2007
Source: Fiction 2000
226
Source: CBS
Source: Fiction 2000
Source: CBS
Source: Fiction 2000
Appendix B: Answers
Source: CBS
227
Statistics using Excel 2007
Answers Chapter 4: Distribution characteristics 1. Income Measures of location Mean Median Mode
Measures of dispersion Interquartile range Standard deviation Measures of shape Skewness Kurtosis 2. Opinion on “Spits” Measures of location Median Mode 3. Travelling time Measures of location Mean Median Mode
13,9 14 18 10 7,5 0,11 -‐0,42
Reasonable Good
Male Female 39,8 35,7 35 35 23 22 Measures of dispersion Interquartile range 32 28 Average absolute deviation 20,2 14,4 Standard deviation 26,2 18,1 Measures of shape Skewness 1,18 0,56 Kurtosis 1,56 -‐0,13 4. Daily newspaper Measures of location Mode None
228
5. Travel days per train per week Measures of location Male Female Mean 4,4 4,1 Median 5 4 Mode 5 4 Measures of dispersion Minimum 1 2 Maximum 5 5 Range 4 3 First quartile 4 4 Third quartile 5 5 Interquartile range 1 1 Average absolute deviation 0,7 0,7 Standard deviation 0,9 0,9 Variance 0,7 0,7 Coefficient of variance 19% 21% Measures of shape Skewness -‐1,69 -‐0,56 Kurtosis 2,94 -‐0,54 6. Travelling time Measures of location Mean 40,3 Median 34,6 Mode 35 Measures of dispersion Interquartile range 33 Standard deviation 27,5 Variance 756,6 Measures of shape Skewness 1,15 Kurtosis 1,28
7. Income Measures of location Mean Median Mode Measures of dispersion Range Interquartile range Average absolute deviation Standard deviation Measures of shape Skewness Kurtosis 8. Income Measures of location Mean Median Mode Measures of Variability Range Standard deviation
Appendix B: Answers
14,2 14,9 17,5 40 10,5 6,2 7,6 0,08 -‐0,43
Male Female 15,4 11,7 15,9 12,0 17,5 12,5 40 25 8,5 6,0
229
Statistics using Excel 2007
Answers Chapter 5: Probabilities Count of Daily newspaper Daily newspaper Algemeen Dagblad NRC Telegraaf Trouw Volkskrant Other None Grand Total Answer 1a Count of Age Age2 < 20 > = 20 Grand Total Answer 1c Count of Daily newspaper Daily newspaper Algemeen Dagblad NRC Telegraaf Trouw Volkskrant Other None Grand Total
Total 8,3% 4,0% 16,7% 6,3% 20,0% 17,7% 27,0% 100,0% 20,0%
Total 15,00% 85,00% 100,00% 85,0%
Total 10,7% 2,0% 25,3% 3,3% 7,3% 12,0% 39,3% 100,0% 10,7% 36,0%
< 30 4,0% 6,0% 21,3% 23,3% 8,7% 63,3% 11,3%
Male
Count of Income (*€100) Income (*€100)2 < 15 > = 15 Grand Total Answer 1d
Female Grand Total 6,0% 2,3% 8,3% 3,0% 1,0% 4,0% 14,7% 2,0% 16,7% 5,3% 1,0% 6,3% 13,3% 6,7% 20,0% 12,3% 5,3% 17,7% 19,7% 7,3% 27,0% 74,3% 25,7% 100,0% 14,7%
Gender Male Female Grand Total 33,0% 17,7% 50,7% 41,3% 8,0% 49,3% 74,3% 25,7% 100,0% 33,0%
Count of Opinion Spits Opinion Spits Very bad Bad Reasonable Good Very good Grand Total Answer 2c
Age2
230
Daily newspaper Algemeen Dagblad NRC Telegraaf Trouw Volkskrant Other None Grand Total Answer 1b
Answer 2a Answer 2b Count of Opinion Spits Opinion Spits Very bad Bad Reasonable Good Very good Grand Total Answer 2d
Count of Daily newspaper Gender
> = 30 4,0% 5,3% 12,7% 11,3% 3,3% 36,7%
Grand Total 8,0% 11,3% 34,0% 34,7% 12,0% 100,0%
Total
8,0% 11,3% 34,0% 34,7% 12,0% 100,0% 19,3%
3.
4.
Appendix B: Answers
a. b. c. d. e.
P(Metro | Spits) = 73 / 150 = 48,7% P(Spits | Metro) = 73 / 163 = 44,8% P(Metro and Spits) = 73 / 300 = 24,3% P(Metro or Spits) = 240 / 300 = 80,0% of 163 / 300 + 150 / 300 – 73 / 300 = 240 / 300 = 80,0% No, because P(Metro) * P(Spits) = 163 / 300 * 150 / 300 = 27,2% and this is unequale to answer c. P(Metro and Spits) = 24,3%.
a. b. c. d.
P(Volkskrant | Female) = 20 / 77 = 26,0% P(Volkskrant | Male) = 40 / 223 = 17,9% P(Volkskrant) = 60 / 300 = 20% No, a higher proportion of women have a subscription on ‘Volkskrant’ than men.
231
Statistics using Excel 2007
Answers Chapter 6: Discrete probability distributions
1. X = Number of times train arrives on time Answer a P(X = 10) = 0,2484 Answer b P(X > 9) = 0,6196 Answer c P(X < 5) = 0,0053 2. X = Number of people who have read Metro Answer a P(X = 15) = 0,1408 Answer b P(X > 15) = 0,3525 Answer c P(10 < X < 20) = P(X < 20) -‐ P(X < 9) = 0,9552 3. X = Number of international train passengers Answer a P(X = 2 | n = 10) = 0,0596 Answer b P(X = 4 | n = 40) = 0,0076 Answer c n = 28 4. X = Number of passengers arriving at the counter during a day Answer a P(X = 7) = 0,1171 Answer b P(X < 5) = 0,0550 Answer c P(X > 10) = 0,4126 5. X = Number of passengers who did not leave Spits behind in the train Answer a P(X = 12) = 0,0194 Answer b P(X =10) = 0,1937 Answer c P(6 < X < 8) = 0,4452 – 0,0267 = 0,4185 6. X = Number of passengers who filled in a cash refund form Answer a P(X = 10) = 0,0039 Answer b P(X = 0) = 0,0000 Answer c P(X = 5) = 0,1882 Answer d P(4 < X < 6) = 0,5680 – 0,0199 = 0,5481 7. X = Number of women 25 years and older who travelled by train last year Answer a P(X = 1 | n = 4) = 0,0921 Answer b P(X = 2 | n =10) = 0,0552 Answer c P(X = 5 | n = 30) = 0,0310 Answer d P(X = 5 | n < 50) = 0,8879
232
Appendix B: Answers
Answers Chapter 7: Continuous probability distributions 1.
2.
3.
4.
5.
a. b. c. d.
P(X < 90) = 0,0013 P(X > 135) = 0,0668 P(105 < X < 135) = 0,8664 126,7 minutes
a. P(X < 0,0833) = 0,8529 b. P(X > 0,1667) = 0,0216 c. P(0,500 < X < 0,1167) = 0,2483 a. b. c. d.
P(X < 90) = 0,0228 P(X > 140) = 0,0912 P(120 < X < 150) = 0,4772 144,7 seconds
a. P(X < 4) = 0,0228 b. P(X > 5,5) = 0,1587 c. 10,21 hours a. P(X < 0,0500) = 0,5934 b. P(X > 0,0833) = 0,2231 c. P(0,0333 < X < 0,0667) = 0,2476
233
Statistics using Excel 2007
Answers Chapter 8: Simple linear regression and correlation analysis 1.
a. b. c. d.
Correlation (r) = 0,4952 y = 1,6073x + 16,51 Travel time is 49 minutes The correlation is moderate (0,4 < r < 0,7). Number of observations is fairly large (n = 300). So the prediction will be moderate.
Chart 8.1: Scatter Diagram Income and Travelling time 160 y = 1,6073x + 16,51 R² = 0,2452
Travelling time (minutes)
140 120 100
80 60 40 20 0 0
5
10
15
20
25
30
35
40
Income (*€ 100)
Source: Fiction2000 2.
a. b. c. d. e.
Correlation (r) = – 0,42 y = – 0,0148x + 4,907 r = – √ 0,1724 = – 0,42 Travel days y = – 0,0148*100 + 4,907 = 3 The correlation is moderate (– 0,7 < r < – 0,4). Number of observations is fairly large (n = 300). So the prediction will be moderate.
Chart 8.2: Scatter Diagram Travelling time and Travel days 6 y = -‐0,0148x + 4 ,907 R² = 0 ,1724
5
Travel days
4 3 2
1 0 0
20
40
60
80
100
120
140
160
Travelling time (minutes)
Source: Fiction2000
234
3.
a. b. c. d.
Appendix B: Answers
Correlation (r) = 0,77 y = 0,3909x – 0,3542 15,3 * € 100 The correlation of female respondents is lower than the correlation of all respondents. The slope of the regression line is lower for women than the slope of the regression line for all respondents. So the increase in income is lower for women when they grow older than for all the respondents.
Chart 8.3: Scatter Diagram Age and Income of female respondents y = 0,3909x -‐ 0,3542 R² = 0,5936
30 25
Income (* 1 00)
20 15
10 5 0 0
10
20
30
40
50
60
70
Age (Years)
4.
Source: Fiction2000
a. b. c. d.
Correlation (r) = 0,99 y = – 2,1364x + 4314,7 9,37% The relationship between Time and percentage Daily Newspaper Subscriptions is very strong. The prediction is not very reliable because the extrapolation is very strong (2015 is far in the future).
Percentage
Chart 8.4: Scatter diagram percentage daily newspaper subscription (25 -‐ 34 year) 60 y = -‐2,1364x + 4314,7 R² = 0,9802
50 40 30 20
10 0 1996
1998
2000
2002
2004
2006
2008
Source: Statistics Netherlands, The Hague/Heerlen 19-‐6-‐2009
235
Statistics using Excel 2007 5.
a. Correlation (r) = – 0,08 b. y = – 0,0636x + 196,49 c. Year Percentage Subsriptions Daily Newspapers 2008 2009 2010 d.
68,71 68,65 68.58
The extrapolation is not far into the future, but the prediction is unreliable nonetheless because the relationship is negligible.
Percentage
Chart 8.5: Scatter diagram percentage daily newspaper subscription (75 year and older) 74 73 72 71
y = -‐0,0636x + 196,49 R² = 0,0067
70 69 68 67 66 65 64 63 1996
236
1998
2000
2002
Source: Statistics Netherlands, The Hague/Heerlen 19-‐6-‐2009
2004
2006
2008
Appendix B: Answers
Answers Chapter 9: Time series analysis
1. Multiplicative model Trend indices Year Quarter Time series Trend I II III IV 2007 I 11.87 II 12.13 III 12.52 13.23 94.63341 IV 16.11 13.42125 120.0335 2008 I 12.45 13.5825 91.66207 II 13.08 13.74 95.19651 III 12.86 13.865 92.75153 IV 17.03 13.89 122.6062 2009 I 12.53 13.96 89.75645 II 13.2 14.07375 93.79163 III 13.3 IV 17.5 Indices I II III IV Sum 181.4185 188.9881 187.3849 242.6397 Mean 90.70926 94.49407 93.69247 121.3199 400.2157 Preliminary 91 94 94 121 400 Between calculation 9.290744 5.50593 6.307529 21.31986 0 Seasonal index 91 94 94 121 d. 13.14 * $ 1 miljoen 2. Additive model Trend deviations Year Quarter Time series Trend I II III IV 2006 I 83 II 95.6 III 117.9 107.138 10.7625 IV 125.3 110.55 14.75 2007 I 96.5 112.675 -‐16.175 II 109.4 113.725 -‐4.325 III 121.1 115.413 5.6875 IV 130.5 117.3 13.2 2008 I 104.8 119.463 -‐14.6625 II 116.2 122.388 -‐6.1875 III 131.6 IV 143.4
Deviations I II Sum -‐30.8375 -‐10.5125 Mean -‐15.4188 -‐5.25625 Preliminary -‐16 -‐6 Between calculation 15.4188 5.25625 Seasonal component -‐16 -‐6 e. 13.8 f. II I III IV 121.05 134.5 151.95 161.4
III 16.45 8.225 8 8.225 8
IV 27.95 13.975 14 13.975 14
1.525 0 0
237
Statistics using Excel 2007 3.
Multiplicative model Year Quarter 2006 I II III IV 2007 I II III IV 2008 I II III IV 2009 I II III IV
Trend indices Time series Trend I II III IV 141 400 373 283.125 131.7439 216 286.25 75.45852 146 290.875 50.19338 420 294.125 142.7964 390 296.5 131.5346 225 300 75 156 305 51.14754 438 309.25 141.633 412 311.875 132.1042 237 316.375 74.9111 165 322.625 51.14297 465 327.625 141.9306 435 254
Indices
I
Sum Mean Preliminary Between calculation Seasonal index
152.4839 50.82796 51 49.17204 51
II
III
IV
426.36 395.3827 225.3696 142.12 131.7942 75.12321 399.8654 142 132 75 400 42.11999 31.79424 24.87679 0 142 132 75
b. Differences between high and low season are increasing and the trend is also increasing. d. f. g.
238
Corrected: (182 * 1000) / 0.51 = 356.9 * 1000 II: III: IV:
(4.96 * 18 + 262.4) * 1.42 = 499.4 (4.96 * 19 + 262.4) * 1.32 = 470.8 (4.96 * 20 + 262.4) * 0.75 = 271.2
182.0 + 499.4 + 470.8 + 271.2 = 1,241.4 No, because it is less than 1,500
Appendix C: Specimen examination questions
Appendix C: Specimen examination questions A random sample of train passengers in September 1999 were asked the following questions. 1. State your gender. 0 Male 1 Female 2. How old are you? ………. years 3. How often on average do you read Spits in a week? 0 never 0 once or twice 0 3 times or more 4. How do you rate Spits content? 0 very bad 0 bad 0 reasonable 0 good 0 very good 0 no opinion 1. How is the set of all train passengers referred to in the above investigation? a. Parameter b. Population c. Statistic d. Random sample 2. On what level is the Gender variable measured (see research question 1)? a. Nominal b. Ordinal c. Interval d. Ratio 3. On what level is the Age variable measured (see research question 2)? a. Nominal b. Ordinal c. Interval d. Ratio 4.
On what level is the ‘Average number of times Spits is read in a week’ variable measured (see research question 3)? a. Nominal b. Ordinal c. Interval d. Ratio
239
Statistics using Excel 2007 5.
Indicate which of the following measures of location you can calculate for the values of the answers to Question 1 (‘State your gender’): a. Median b. Mode c. Arithmetic mean d. None of the above
Indicate which of the following frequency distributions can be created with the answers to Question 2 (‘How old are you?’). Table 1: Ages Age Total 15 -‐< 20 24 20 -‐< 30 35 30 -‐< 40 53 40 -‐< 50 31 50 -‐< 80 57 Total 200 6. In Table 1, the modal age in years is: a. 35 b. 53 c. 57 d. 65 7. Which of the following charts is a proper histogram of Table 1? Frequency per 5 years
Age of 200 train passengers
Frequency
30
60
25
50
20
40
15
30
10
20
5
Age of 200 train passengers
10
0 0
10
20
30
40
50
60
70 Age
80
0 10
0
20
Chart a
Frequency Age per 5 years 30
30
50
60
70 Age
Chart b
of 200 train passengers
Frequency
Age of 200 train passengers
60 50
25 20
40
15
30
10
20
5
10
0 15
20
30
40
0 50
80
15
20
30
Age
Chart c
40
50
80 Age
240
40
Chart d
80
Appendix C: Specimen examination questions 8.
9.
The mean age (arithmetic mean) of the 200 train passengers shown in Table 1 rounded to whole years is: a. 35 b. 37 c. 41 d. 47
The variance in age of the 200 train passengers shown in table 1 is rounded to whole years: a. 17 b. 33 c. 65 d. 289 10. In order to study the data of Question 3 (‘How often on average do you read Spits in a week’) and Question 4 (‘How do you rate Spits content’) simultaneously, you should use a: a. Frequency table b. Histogram c. Contingency table d. Percentage frequency table 11. Which of the following tables gives the clearest indication that female train passengers read Spits relatively less often than male train passengers? Table a Table b Gender Gender How often Male Female Grand total How often Male Female Grand total Never 24 17 41 Never 59% 41% 100% 1 -‐ 2 times 83 19 102 1 -‐ 2 times 81% 19% 100% 3 times or more 46 11 57 3 times or more 81% 19% 100% Grand total 153 47 200 Grand total 77% 24% 100% Table c Table d Gender Gender How often Male Female Grand total Hoe often Male Female Grand total Never 16% 36% 21% Never 12% 9% 21% 1 -‐ 2 times 54% 40% 51% 1 -‐ 2 times 42% 10% 51% 3 times or more 30% 23% 29% 3 times or more 23% 6% 29% Grand total 100% 100% 100% Grand total 77% 24% 100% 12. See the above tables (a, b, c and d). Which of the following statements is correct? a. 9% of female train passengers never read Spits. b. 23% of train passengers who read Spits 3 times or more a week are male. c. 51% of train passengers read Spits once or twice a week. d. All the statements a, b and c are correct. 13. Which chart is best suited to presenting the data of Table a in a single chart? a. Pie chart b. Histogram c. Line chart d. Bar chart
241
Statistics using Excel 2007
Chart 1: Cumulative frequency curve of one way distance by train of 200 passengers Frequency 250 200 150 100 50 0 0
20
40
60
80
100
120 miles
140
14. See Chart 1. Which of the following statements is correct? a. 30% of the passengers have a one-‐way journey of 20 kms. b. Most passengers have a one-‐way journey of between 120 and 140 kms. c. The modal class for one-‐way journey distance is 20 -‐< 40 kms. d. There are no passengers with a one-‐way train journey less than 20 kms. Only 15 of the 200 interviewed train passengers wished to disclose their gross annual income. These gross annual incomes (in thousands of guilders) were: 12 ; 40 ; 67 ; 50 ; 45 ; 8 ; 50 ; 25 ; 34; 75 ; 35 ; 29 ; 0 ; 6 ; 86. 15. The median income (in thousands of guilders) of the 15 train passengers who disclosed their gross annual incomes is: a. 25 b. 35 c. 50 d. 86 16. The standard deviation of the incomes (in thousands of guilders) of the 15 train passengers, who disclosed their gross annual incomes is: a. 16 b. 20 c. 26 d. 32 17. The range of the incomes (in thousands of guilders) of the 15 train passengers who disclosed their gross annual incomes is: a. 15 b. 20 c. 74 d. 86
242
Appendix C: Specimen examination questions 18.
If the incomes had been measured in euros rather than in guilders, using an exchange rate of 1 euro = 2 guilders, then: a. the standard deviation does not change b. the standard deviation is halved c. the standard deviation is doubled d. the standard deviation is 4 times as large
Chart 2: Opinion of "Spits" of 200 train passengers Very bad 10% No opinion 30% Bad 20%
Very good 5%
Good 20%
Reasonable 15%
19. See Chart 2. Which of the following statements is correct? a. 10 passengers consider Spits to be very bad b. 50 passengers consider Spits to be good or very good c. 60 passengers do not read Spits d. None of the above statements are correct 20. You can calculate the following measure of dispersion from the values of the answers to Question 4 (‘How do you rate Spits content?’): a. Range b. Interquartile range c. Standard deviation d. None of the above 21. Assume that the probability of a boy being born is equal to the probability of a girl being born, or P(boy) = P(girl) = 0.5. Which probability is greatest with reference to the gender of the first two children born in the new year in a given maternity hospital. a. Both children have the same gender b. The first child is a boy and the second is a girl c. The first child is a girl and the second a boy d. Options a, b and c are all equally probable.
243
Statistics using Excel 2007
22. For a ‘fair’ die, the probability of throwing a 4 is 1/6. What is the probability that both of two throws of the die will produce a 4? a. 1/36 b. 1/12 c. 1/6 d. 1/3 23. Someone throws a ‘fair’ die 4 times. Which of the following sequences of throws is most probable? a. First a 1, then a 2, then a 3 and finally a 4 b. First a 1, then a 1, then a 1 and finally a 1 c. First a 3, then a 5, then a 2 and finally a 6 d. All the sequences a, b and c are equally probable 24. You meet an arbitrary member of the Dutch public (X). Which probability is the greatest? a. P(X is a teacher) b. P(X is a teacher or a minister) c. P(X is a teacher who lives in Amsterdam) d. P(X is a teacher or a minister who lives in Amsterdam or The Hague)
Chart 3: Binomial distribution (p = 0.3 ; n = 10) P(X) 0.30 0.25
0.20 0.15 0.10
0.05 0.00 0
1
2
3
4
5
6
7
8
9
10
X
25. See also Chart 3. For a binomial probability distribution with p = 0.3 and n = 10, the expected value of the binomial probability distribution is: a. 0.03 b. 1.45 c. 2.10 d. 3.00 26. See also Chart 3. P(3 < X < 5 | p = 0.3 ; n = 10) = a. 0.20 b. 0.57 c. 0.71 d. 0.84 244
Appendix C: Specimen examination questions 27. A random sample of 5 journalists was drawn without ‘replacement’ from a group of 9 female and 18 male journalists to report on the Dutch queen’s visit to the Netherlands Antilles. The probability distribution of the number of women in the random sample is: a. binomial b. hypergeometric c. negative binomial d. Poisson 28. See also Question 27. A random sample of 5 journalists was drawn without ‘replacement’ from a group of 9 female and 18 male journalists to report on the Dutch queen’s visit to the Netherlands Antilles. The probability that the number of women in the sample is 2 is: a. 0.1111 b. 0.3292 c. 0.3639 d. 0.7901 29. A very large number of ‘Superfris’ plastic bottles are placed on a conveyor belt for inspection each hour. The average number of rejected bottles placed on the conveyor belt is 20 per hour. The number of rejected bottles placed on the conveyor belt in a 10-‐minute period has the following probability distribution: a. binomial b. hypergeometric c. negative binomial d. Poisson 30. See also Question 29. A very large number of ‘Superfris’ plastic bottles are placed on a conveyor belt for inspection each hour. The average number of rejected bottles placed on the conveyor belt is 20 per hour. What is the probability that the number of rejected bottles placed on the conveyor belt in an arbitrary 10-‐ minute period is 2? a. 0.0000 b. 0.0176 c. 0.1982 d. 0.7165 31. In the production of serial interfaces, 9% have been found to be defective. The probability distribution of the number of defective serial interfaces in a random sample of 50 serial interfaces is: a. binomial b. hypergeometric c. negative binomial d. Poisson 32. See also Question 31. In the production of serial interfaces, 9% have been found to be defective. What is the probability that the number of defective serial interfaces in a random sample of 50 serial interfaces is greater than or equal to 4? a. 0.1973 b. 0.3303 c. 0.5277 d. 0.6697 245
Statistics using Excel 2007
Chart 4: Normal distribution (μ = 10 and σ = 2) 0,25
Possibility density
0,2 0,15
0,1 0,05 0 4
5
6
7
8
9
10
11
12
13
14
15
16
X 33. See Chart 4. In the normal distribution with µ = 10 and σ = 2, P(7 < X < 10) is: a. 0.13 b. 0.27 c. 0.43 d. 0.57 34. The volume of beer deposited by a bottle filling machine into each bottle is approximately normally distributed, with µ = 30.4 cl and σ = 0.5 cl. What is the probability that a bottle of beer filled using this filling machine will contain less than the minimum 30.0 cl stated on the label? a. 0.0026 b. 0.0359 c. 0.2119 d. 0.8000 35. The volume of beer deposited by a bottle filling machine into each bottle is approximately normally distributed, with µ = 30.4 cl and σ = 0.5 cl. What is the minimum volume setting for the filling machine to ensure that the probability of a bottle being filled to less than 30 cl of beer is no more than 0.0001, assuming the same standard deviation? a. 28.5 cl b. 28.6 cl c. 31.9 cl d. 32.3 cl
246
Appendix C: Specimen examination questions The soft drinks company management wish to develop a method of calculating the delivery costs of crates of soft drink to be charged to its customers. One of the aspects delivery cost factors is the travelling time to the customer’s premises. Management expects that the relationship between time and distance is linear. Management gathers data from 10 customer deliveries selected at random about the time taken for the delivery and the distance to the customer. The table below shows these data. Table 2: Delivery time of crates of soft drinks to 10 customers Customer Time (minutes) Distance (miles) 1 30 12 2 26 34 3 102 101 4 59 26 5 120 128 6 78 33 7 145 189 8 129 159 9 86 52 10 67 86 36. Based on the above data management wishes to determine the linear regression line based on the least squares method. Which variable should they choose as dependent (effect) and which variable as independent variable (cause)? a. dependent: distance independent: customer b. dependent: distance independent: time c. dependent: time independent: distance d. dependent: time independent: customer 37. For the above data, if the Distance variable is denoted by x and the Time variable by y, then the regression line equation based on the least squares method is: a. y = 0.6 x + 35.0 b. y = 1.4 x – 34.0 c. y = 8.8 x + 23.5 d. y = 10.4 x + 11.9 38. The correlation coefficient (r) between Time and Distance based on the data in Table 1 is: a. 0.30 b. 0.41 c. 0.83 d. 0.91
247
Statistics using Excel 2007
Another aspect that influences delivery cost is the time needed to unload the crates of soft drink at the customer’s premises. Here too, management expects the relationship between the Number of deliverable crates variable and the Time variable to be linear. Based on 20 data of the number of deliverable crates and the time needed, management performs linear regression analysis with Excel, in which they choose the number of crates as variable x and the as variable y. Part of the output of the linear regression analysis is given below. SUMMARY OUTPUT Regression Statistics 0.986 Multiple R 0.972 R Square 0.970 Adjusted R Square 1.987 Standard Error 20 Observations ANOVA df SS 1 2443.466006 Regression 18 71.03149378 Residual 19 2514.4975 Total Coefficients Standard Error 24.83 Intercept 1.054218648 Number of crates 0.14 0.005627243
39. The linear regression line equation based on the above data is: a. y = 0.14 x + 24.83 b. y = 0.97 x + 20 c. y = 24.83 x + 0.14 d. y = 2443 x + 71
40. For a given product in the soft drink manufacturer’s range there is a negative relationship between Product price and Product sales. It would appear that the price is able to explain 64% of the range of the sales. What is the correlation coefficient (r) for the correspondence between Product price and Product sales? a. – 0.80 b. – 0.64 c. + 0.64 d. + 0.80
0
41. The equation for the linear relationship between Temperature (x) measured in C and ‘Sales of a given soft drink (y) measured in hectolitres is: y = 0.83 x – 0.46 You can conclude from the above that the correlation between x and y is: a. negative b. weakly negative c. positive d. weakly positive
42. Based on the regression line in Exercise 41 (y = 0.83 x – 0.46), the forecast sales (in hl. to 2 decimal places) of 0 the soft drink concerned at a temperature of 20 C is: a. 16.14 b. 17.06 c. 20.37 d. 23.54 248
Appendix C: Specimen examination questions
Chart 5: Sales of soft drink ‘Cocicool’ 1997-1999
Sales (*€100.000)
y = 10x - 19930 R2 = 0.0769
100 90 80 70 60 50 40 30 20 10 0 1996
1997
1998
1999
2000 Year
43. See the above scatter chart of ‘Cocicool’ sales. A sales forecast for 2010 based on the regression line gives sales of 170 * € 100,000. This forecast is highly unreliable because: a. the forecast involves extremely strong extrapolation b. the regression line is determined by only 3 points c. the correlation is negligible d. all the answers a, b and c are correct The management of a soft drinks manufacturer wishes to analyse the sales of one of its products (‘Ansicool’) in order to produce forecasts for each quarter of 2000. Management has the data shown in the table below at its disposal. Table 3 Sales of soft drink ‘Ansicool’ 2007-‐2009 Year Quarter Sales (hl) 2007 I 6.4 II 16.1 III 18.9 IV 5.3 2008 I 6.7 II 16.5 III 19.0 IV 5.4 2009 I 8.8 II 20.0 III 22.6 IV 8.7 44. What is the trend figure in hectolitres (rounded to 1 decimal place) for Ansicool in the fourth quarter of 2007, if it is calculated on the basis of the moving average? a. 5.3 b. 10.3 c. 11.8 d. 12.7
249
Statistics using Excel 2007
45. Management decides that the additive model is most appropriate for the Ansicool data. The 4 seasonal components (or seasonal patterns) are then: (N.B. it is also possible to answer this question satisfactorily without performing calculations). a. Quarter I: – 5 Quarter II: + 5 Quarter III: + 7 Quarter IV: – 7 b. Quarter I: + 5 Quarter II: – 5 Quarter III: – 7 Quarter IV: + 7 c. Quarter I: + 7 Quarter II: – 7 Quarter III: + 5 Quarter IV: – 5 d. Quarter I: 60 Quarter II: 138 Quarter III: 159 Quarter IV: 43 For another of the soft drink manufacturer’s products, ‘Sisicool’, management has calculated the trend on the basis of the quarterly sales figures for 2007-‐2009 on the basis of the moving average. See Table 4. Table 4: Sales and trend of soft drink ‘Sisicool’ 2007-‐2009 Year Quarter Sales (hl) Trend (hl) 2007 I 60.0 II 102.0 III 126.0 97.3 IV 81.0 111.5 2008 I 100.0 130.5 II 176.0 147.0 III 204.0 161.9 IV 135.0 187.0 2009 I 165.0 222.0 II 312.0 252.4 III 348.0 IV 234.0 The seasonal indices determined using the multiplicative model are as follows: Quarter I II III IV Seasonal index 76 122 129 73 46. How much does the irregular component contribute (in %) to Sisicool sales in the first quarter of 2008 based on the above data? The multiplicative model used is: Time series = Trend * Seasonal index * Irregular component a. – 0.82 b. – 0.63 c. + 0.63 d. + 0.82 Management expects the increase in the trend figures after the second quarter of 2009 to be the same as the average increase in the trend values from the third quarter of 2007 up to and including the second quarter of 2009. 47. What is the trend figure (rounded to 1 decimal place) that management should expect for the second quarter of 2010 based on the above data for the Sisicool product? a. 318.9 b. 330.0 c. 341.0 d. 374.0
250
Appendix C: Specimen examination questions 48. Sisicool sales in the first quarter of 2010 are 250.2 hectolitres. Sisicool sales in the first quarter of 2010 adjusted in accordance with the multiplicative model in hectolitres (rounded to 1 decimal place) are: a. 174.2 b. 190.2 c. 326.2 d. 329.2 49. Based on the above data, management has also calculated the trend on the basis of linear regression analysis. For the third quarter of 2010 they have found a trend figure for sales of 348.1 hectolitres. What sales volume (in hectolitres rounded to 1 decimal place) should management expect on the basis of these data and the multiplicative model for the third quarter of 2010? a. 269.8 b. 348.1 c. 449.0 d. 477.1 50. Management has the sales data for a new soft drink product that had been on the market only since 2 January 2009, for each week in January, February and March, 2009. Which of the time series components below can management determine based on these data? a. Cyclic b. Seasonal c. Trend d. None of the above components
251
Statistics using Excel 2007
Answers: Question 1 2 3 4 5 6 7 8 9 10
252
Answer Question b 11 a 12 d 13 b 14 b 15 a 16 a 17 c 18 d 19 c 20
Answer Question Answer c 21 a c 22 a d 23 d c 24 b b 25 d c 26 b d 27 b b 28 c b 29 d d 30 c
Question Answer 31 a 32 d 33 c 34 c 35 c 36 c 37 a 38 d 39 a 40 a
Question Answer 41 c 42 a 43 d 44 c 45 a 46 d 47 c 48 d 49 c 50 d
Appendix D: Excel files
Appendix D: Excel files
Table: Excel files on intranet FEM (K:\FEM\Excel\Statistics using Excel 2007)
Workbook Fiction 2000
Worksheets Data
Subsription to daily newspapers
Data
Summary measures
Raw data Aggregated data Binomial Hypergeometric Negative binomial Poisson Normal Exponential Additive quarters Multiplicative quarters Additive terms Multiplicative terms
Discrete distributions
Continuous distributions Time series models
253
Statistics using Excel 2007
254