Chap2-Applied Stats in Bus & Eco - Doane/Seward-2E

Page 1

2

Chapter

Data Collection

Definitions Level of Measurement Time Series and Crosssectional Data Sampling Concepts Sampling Methods Data Sources Survey Research McGraw-Hill/Irwin

Copyright Š 2009 by The McGraw-Hill Companies, Inc. All rights reserved.


Data: Singular or Plural?

Data is the plural form of the Latin datum (a “given� fact).

2-2


Data: Singular or Plural? Subjects, Variables, Data Sets We will refer to Data as plural and data set as a particular collection of data as a whole. Observation – each data value. Subject (or individual) – an item for study (e.g., an employee in your company). Variable – a characteristic about the subject or individual (e.g., employee’s income). 2-3


Data Definitions (Table 2.2)

Number of Variables and Typical Tasks Data Set

Variables

Typical Tasks

Univariate

One

Histograms, descriptive statistics, frequency tallies

Bivariate

Two

Scatter plots, correlations, simple regression

Multivariate More than two

2-4

Multiple regression, data mining, econometric modeling


Data Definitions (Table 2.1)

A Small Multivariate Data Set 8 Subjects

2-5

5 Variables


Data Definitions

(Figure 2.1) 2-6


Data Definitions

Data Types Categorical or Qualitative data. Values are described by words rather than numbers. For example, - Automobile style (e.g., X = full, midsize, compact, subcompact). - Mutual fund (e.g., X = load, no-load). 2-7


Data Definitions

Data Coding Coding refers to using numbers to represent categories to facilitate statistical analysis. Coding an attribute as a number does not make the data numerical. For example, 1 = Bachelor’s, 2 = Master’s, 3 = Doctorate Rankings may exist, for example, 1 = Liberal, 2 = Moderate, 3 = Conservative 2-8


Data Definitions Binary Data

A binary variable has only two values, 1 = presence, 0 = absence of a characteristic of interest (codes themselves are arbitrary). For example, 1 = employed, 0 = not employed 1 = married, 0 = not married 1 = male, 0 = female 1 = female, 0 = male The coding itself has no numerical value so binary variables are attribute data. data

2-9


Data Definitions

Numerical Data Numerical or quantitative data arise from counting or some kind of mathematical operation. For example, - Number of auto insurance claims filed in March (e.g., X = 114 claims). - Ratio of profit to sales for last quarter (e.g., X = 0.0447). Can be broken down into two types – discrete or continuous data. 2-10


Data Definitions Discrete Data A numerical variable with a countable number of values that can be represented by an integer (no fractional values). For example, - Number of Medicaid patients (e.g., X = 2). - Number of takeoffs at O’Hare (e.g., X = 37).

2-11


Data Definitions

Continuous Data

A numerical variable that can have any value within an interval (e.g., length, weight, time, sales, price/earnings ratios). Any continuous interval contains infinitely many possible values (e.g., 426 < X < 428).

2-12


Data Definitions

Rounding Ambiguity is introduced when continuous data are rounded to whole numbers. Underlying measurement scale is continuous. Precision of measurement depends on instrument. Sometimes discrete data are treated as continuous when the range is very large (e.g., SAT scores) and small differences (e.g., 604 or 605) aren’t of much importance. 2-13


Level of Measurement Levels of Measurement Level of Measurement Characteristics Example Nominal Categories only Eye color (blue, brown, green, hazel) Ordinal Rank has Bond ratings (Aaa, meaning Aab, C, D, F, etc.)

2-14

Interval

Distance has meaning

Temperature (57o Celsius)

Ratio

Meaningful zero exists

Accounts payable ($21.7 million)


Level of Measurement Nominal Measurement Nominal data merely identify a category. category Nominal data are qualitative, attribute, categorical or classification data (e.g., Apple, Compaq, Dell, HP). Nominal data are usually coded numerically, codes are arbitrary (e.g., 1 = Apple, 2 = Compaq, 3 = Dell, 4 = HP). Only mathematical operations are counting (e.g., frequencies) and simple statistics. 2-15


Level of Measurement Ordinal Measurement Ordinal data codes can be ranked (e.g., 1 = Frequently, 2 = Sometimes, 3 = Rarely, 4 = Never). Distance between codes is not meaningful (e.g., distance between 1 and 2, or between 2 and 3, or between 3 and 4 lacks meaning). Many useful statistical tests exist for ordinal data. Especially useful in social science, marketing and human resource research. 2-16


Level of Measurement Interval Measurement Data can not only be ranked, but also have meaningful intervals between scale points (e.g., difference between 60°F and 70°F is same as difference between 20°F and 30°F). Since intervals between numbers represent distances, mathematical operations can be performed (e.g., average). Zero point of interval scales is arbitrary, so ratios are not meaningful (e.g., 60°F is not twice as warm as 30°F). 2-17


Level of Measurement Likert Scales

A special case of interval data frequently used in survey research. The coarseness of a Likert scale refers to the number of scale points (typically 5 or 7).

“College-bound high school students should be required to study a foreign language.” (check one)

2-18

Strongly Agree

Somewhat Agree

Neither Agree Nor Disagree

Somewhat Disagree

Strongly Disagree


Level of Measurement Likert Scales A neutral midpoint (“Neither Agree Nor Disagree”) is allowed if an odd number of scale points is used or omitted to force the respondent to “lean” one way or the other. • Likert data are coded numerically (e.g., 1 to 5) but any equally spaced values will work. 2-19

Likert coding: 1 to 5 scale

Likert coding: -2 to +2 scale

5 = Help a lot 4 = Help a little 3 = No effect 2 = Hurt a little 1 = Hurt a lot

+2 = Help a lot +1 = Help a little 0 = No effect −1 = Hurt a little −2 = Hurt a lot


Level of Measurement Likert Scales Careful choice of verbal anchors results in measurable intervals (e.g., the distance from 1 to 2 is “the same� as the interval, say, from 3 to 4). Ratios are not meaningful (e.g., here 4 is not twice 2). Many statistical calculations can be performed (e.g., averages, correlations, etc.). 2-20


Level of Measurement Likert Scales

More variants of Likert scales:

How would you rate your marketing instructor? (check one) Terrible

Poor

Adequate

Good

Excellent

How would you rate your marketing instructor? (check one) Very Very Bad Good 2-21


Level of Measurement Ratio Measurement Ratio data have all properties of nominal, ordinal and interval data types and also possess a meaningful zero (absence of quantity being measured). Because of this zero point, ratios of data values are meaningful (e.g., $20 million profit is twice as much as $10 million). Zero does not have to be observable in the data, it is an absolute reference point. 2-22


Level of Measurement Use the following procedure to recognize data types: Question

If “Yes�

Q1. Is there a meaningful zero point?

Ratio data (all statistical operations are allowed)

Q2. Are intervals between scale points meaningful?

Interval data (common statistics allowed, e.g., means and standard deviations)

Q3. Do scale points represent rankings?

Ordinal data (restricted to certain types of nonparametric statistical tests)

Q4. Are there discrete categories?

Nominal data (only counting allowed, e.g. finding the mode)

2-23


Level of Measurement Changing Data by Recoding In order to simplify data or when exact data magnitude is of little interest, ratio data can be recoded downward into ordinal or nominal measurements (but not conversely). For example, recode systolic blood pressure as “normal” (under 130), “elevated” (130 to 140), or “high” (over 140). The above recoded data are ordinal (ranking is preserved) but intervals are unequal and some information is lost. 2-24


Time Series versus Cross-Sectional Data

Time Series Data Each observation in the sample represents a different equally spaced point in time (e.g., years, months, days). Periodicity may be annual, quarterly, monthly, weekly, daily, hourly, etc. We are interested in trends and patterns over time (e.g., annual growth in consumer debit card use from 2001 to 2008). 2-25


Time Series versus Cross-Sectional Data

Cross-sectional Data Each observation represents a different individual unit (e.g., person) at the same point in time (e.g., monthly VISA balances). We are interested in - variation among observations or in - relationships. We can combine the two data types to get pooled cross-sectional and time series data. 2-26


Time Series versus Cross-Sectional Data Sample or Census? A sample involves looking only at some items selected from the population. A census is an examination of all items in a defined population. Why can’t the United States Census survey every person in the population? Mobility - Illegal immigrants - Budget constraints - Incomplete responses or non-responses 2-27


Parameters and Statistics? Situations Where A Sample May Be Preferred: Infinite Population No census is possible if the population is infinite or of indefinite size (an assembly line can keep producing bolts, a doctor can keep seeing more patients).

Destructive Testing The act of sampling may destroy or devalue the item (measuring battery life, testing auto crashworthiness, or testing aircraft turbofan engine life). 2-28


Parameters and Statistics? Situations Where A Sample May Be Preferred: Timely Results Sampling may yield more timely results than a census (checking wheat samples for moisture and protein content, checking peanut butter for aflatoxin contamination). Accuracy Sample estimates can be more accurate than a census. Instead of spreading limited resources thinly to attempt a census, our budget of time and money might be better spent to hire experienced staff, improve training of field interviewers, and improve data safeguards. 2-29


Parameters and Statistics? Situations Where A Sample May Be Preferred: Cost Even if it is feasible to take a census, the cost, either in time or money, may exceed our budget. Sensitive Information Some kinds of information are better captured by a welldesigned sample, rather than attempting a census. Confidentiality may also be improved in a carefully-done sample. 2-30


Parameters and Statistics? Situations Where A Census May Be Preferred Small Population If the population is small, there is little reason to sample, for the effort of data collection may be only a small part of the total cost. Large Sample Size If the required sample size approaches the population size, we might as well go ahead and take a census.

2-31


Parameters and Statistics? Situations Where A Census May Be Preferred Database Exists If the data are on disk we can examine 100% of the cases. But auditing or validating data against physical records may raise the cost. Legal Requirements Banks must count all the cash in bank teller drawers at the end of each business day. The U.S. Congress forbade sampling in the 2000 decennial population census. 2-32


Parameters and Statistics? • Statistics are computed from a sample of n items, chosen from a population of N items. • Statistics can be used as estimates of parameters found in the population. • Symbols are used to represent population parameters and sample statistics.

2-33


Parameter or Statistic?

Parameter Any measurement that describes an entire population. population Usually, the parameter value is unknown since we rarely can observe the entire population. Parameters are often (but not always) represented by Greek letters.

2-34


Parameter or Statistic?

Statistic Any measurement computed from a sample. sample Usually, the statistic is regarded as an estimate of a population parameter. Sample statistics are often (but not always) represented by Roman letters.

2-35


Parameters or Statistics? Target Population The population must be carefully specified and the sample must be drawn scientifically so that the sample is representative. The target population is the population we are interested in (e.g., U.S. gasoline prices). The sampling frame is the group from which we take the sample (e.g., 115,000 stations). The frame should not differ from the target population. 2-36


Parameters or Statistics? Finite or Infinite? A population is finite if it has a definite size, even if its size is unknown. A population is infinite if it is of arbitrarily large size. Rule of Thumb: A population may be treated as infinite when N is at least 20 times n (i.e., when N/n ≼ 20) N

2-37

Here, N/n ≼ 20

n


Sampling Methods Probability Samples

2-38

Simple Random Sample Systematic Sample

Use random numbers to select items from a list (e.g., VISA cardholders).

Stratified Sample

Select randomly within defined strata (e.g., by age, occupation, gender).

Cluster Sample

Like stratified sampling except strata are geographical areas (e.g., zip codes).

Select every kth item from a list or sequence (e.g., restaurant customers).


Sampling Methods Non-probability Samples

2-39

Judgment Sample

Use expert knowledge to choose “typical� items (e.g., which employees to interview).

Convenience Sample

Use a sample that happens to be available (e.g., ask co-worker opinions at lunch).

Focus Groups

In-depth dialog with a representative panel of individuals (e.g. iPod users).


Sampling Methods Simple Random Sample Every item in the population of N items has the same chance of being chosen in the sample of n items. We rely on random numbers to select a name. 2-40

=RANDBETWEEN(1,48)


Sampling Methods Random Number Tables

A table of random digits used to select random numbers between 1 and N. Each digit 0 through 9 is equally likely to be chosen.

Setting Up a Rule For example, NilCo wants to award cash prizes to 10 of its 875 loyal customers. To get 10 three-digit numbers between 001 and 875, we define any consistent rule for moving through the random number table. 2-41


Sampling Methods

Setting Up a Rule Randomly point at the table to choose a starting point. Choose the first three digits of the selected fivedigit block, move to the right one column, down one row, and repeat. When we reach the end of a line, wrap around to the other side of the table and continue. Discard any number greater than 875 and any duplicates. 2-42


Table of 1,000 Random Digits

Start Here

2-43

82134

14458

66716

54269

31928

46241

03052

00260

32367

25783

07139

16829

76768

11913

42434

91961

92934

18229

15595

02566

45056

43939 43939

31188

43272

11332

99494

19348

97076

95605

28010

10244

19093

51678 51678

63463

85568

70034

82811

23261

48794

63984

12940

84434

50087

20189 20189

58009

66972

05764

10421

36875

64964

84438

45828

40353

28925

11911 11911

53502

24640

96880

93166

68409

98681

67871

71735

64113

90139

33466 33466

65312

90655

75444

30845

43290

96753

18799

49713

39227

15955

46167 46167

63853

03633

19990

96893

85410

88233

22094

30605

79024

01791

38839 38839

85531

94576

75403

41227

00192

16814

47054

16814

81349

92264

01028 01028

29071

78064

92111

51541

76563

69027

67718

06499

71938

17354

12680 12680

26246 26246

71746

94019

93165

96713

03316

75912

86209

12081

57817

98766

67312

96358

21351

86448

31828

86113

78868

67243

06763

37895

51055

11929

44443

15995

72935

99631

18190

85877

31309

27988

81163

52212

25102

61798

28670

01358

60354

74015

18556

19216

53008

44498

19262

12196

93947

90162

76337

12646

26838

28078

86729

69438

24235

35208

48957

53529

76297

41741

54735

34455

61363

93711

68038

75960

16327

95716

66964

28634

65015

53510

90412

70438

45932

57815

75144

52472

61817

41562

42084

30658

18894

88208

97867

30737

94985

18235

02178

39728

66398


Sampling Methods With or Without Replacement If we allow duplicates when sampling, then we are sampling with replacement. replacement Duplicates are unlikely when n is much smaller than N. If we do not allow duplicates when sampling, then we are sampling without replacement. replacement

2-44


Sampling Methods Computer Methods Excel - Option A

Enter the Excel function =RANDBETWEEN(1,875) into 10 spread-sheet cells. Press F9 to get a new sample.

Excel - Option B

Enter the function =INT(1+875*RAND()) into 10 spreadsheet cells. Press F9 to get a new sample.

Internet

The web site www.random.org will give you many kinds of excellent random numbers (integers, decimals, etc).

Minitab

Use Minitab’s Random Data menu with the Integer option.

2-45

These are pseudo-random generators because even the best algorithms eventually repeat themselves.


Sampling Methods Row – Column Data Arrays When the data are arranged in a rectangular array, an item can be chosen at random by selecting a row and column. For example, in the 4 x 3 array, select a random column between 1 and 3 and a random row between 1 and 4. This way, each item has an equal chance of being selected. 2-46


Sampling Methods Row – Column Data Arrays

Use =RANDBETWEEN function to choose row 3 and column 3 (Target).

Dillard's Dollar General Federated Dept Stores

K-Mart Kohl's May Dept Stores

Saks Sears Roebuck Target

J. C Penney

Nordstrom

Wal-Mart Stores

2-47


Sampling Methods Randomizing a List In Excel, use function =RAND() beside each row to create a column of random numbers between 0 and 1. Copy and paste these numbers into the same column using “Paste Special | Values� (to paste only the values and not the formulas). Sort the spreadsheet on the random number column. 2-48


Sampling Methods Randomizing a List (Figure 2.6)

• The first n items are a random sample of the entire list (they are as likely as any others).

2-49


Sampling Methods Systematic Sampling

• Sample by choosing every kth item from a list, starting from a randomly chosen entry on the list. For example, starting at item 2, we sample every k = 4 items to obtain a sample of n = 20 items from a list of N = 78 items.

Note that N/n = 78/20 ≈ 4. 2-50


Sampling Methods Systematic Sampling A systematic sample of n items from a population of N items requires that periodicity k be approximately N/n. Systematic sampling should yield acceptable results unless patterns in the population happen to recur at periodicity k. Can be used with unlistable or infinite populations. • Systematic samples are well-suited to linearly organized physical populations. 2-51


Sampling Methods Systematic Sampling For example, out of 501 companies, we want to obtain a sample of 25. What should the periodicity k be?

k = N/n = 501/25 ≈ 20. So, we should choose every 20th company from a random starting point.

2-52


Sampling Methods Stratified Sampling Utilizes prior information about the population. Applicable when the population can be divided into relatively homogeneous subgroups of known size (strata). strata A simple random sample of the desired size is taken within each stratum. stratum For example, from a population containing 55% males and 45% females, randomly sample 120 males and 80 females (n = 200). 2-53


Sampling Methods Stratified Sampling

Or, take a random sample of the entire population and then combine individual strata estimates using appropriate weights. For a population with L strata, the population size N is the sum of the stratum sizes: N = N1 + N2 + ... + NL The weight assigned to stratum j is wj = Nj / n

2-54

For example, take a random sample of n = 200 and then weight the responses for males by wM = .55 and for females by wF = .45.


Sampling Methods Cluster Sample Strata consist of geographical regions. One-stage cluster sampling – sample consists of all elements in each of k randomly chosen sub regions (clusters). Two-stage cluster sampling, first choose k sub regions (clusters), then choose a random sample of elements within each cluster.

2-55


Sampling Methods Cluster Sample (Figure 2.7)

• Here is an example of 4 elements sampled from each of 3 randomly chosen clusters (two-stage cluster sampling).

2-56


Sampling Methods Cluster Sample Cluster sampling is useful when - Population frame and stratum characteristics are not readily available - It is too expensive to obtain a simple or stratified sample - The cost of obtaining data increases sharply with distance - Some loss of reliability is acceptable

2-57


Sampling Methods Judgment Sample A non-probability sampling method that relies on the expertise of the sampler to choose items that are representative of the population. Can be affected by subconscious bias (i.e., nonrandomness in the choice). Quota sampling is a special kind of judgment sampling, in which the interviewer chooses a certain number of people in each category. 2-58


Sampling Methods Convenience Sample Take advantage of whatever sample is available at that moment. A quick way to sample.

Focus Groups A panel of individuals chosen to be representative of a wider population, formed for open-ended discussion and idea gathering.

2-59


Data Sources Useful Data Sources (Table 2.11)

2-60

Type of Data

Examples

U.S. general data

Statistical Abstract of the U.S.

U.S. economic data

Economic Report of the President

Almanacs

World Almanac, Time Almanac

Periodicals

Economist, Business Week, Fortune

Indexes

New York Times, Wall Street Journal

Databases

CompuStat, Citibase, U.S. Census

World data

CIA World Factbook

Web

Google, Yahoo, msn


Survey Research Basic Steps of Survey Research • Step 1: State the goals of the research. • Step 2: Develop the budget (time, money, staff). • Step 3: Create a research design (target population, frame, sample size). • Step 4: Choose a survey type and method of administration. 2-61


Survey Research Basic Steps of Survey Research • Step 5: Design a data collection instrument (questionnaire). • Step 6: Pretest the survey instrument and revise as needed. • Step 7: Administer the survey (follow up if needed). • Step 8: Code the data and analyze it. 2-62


Survey Research Survey Types (Table 2.12)

2-63

Type of Survey

Characteristics

Mail

You need a well-targeted and current mailing list (people move a lot). Low response rates are typical and non-response bias is expected (nonrespondents differ from those who respond). Zip code lists (often costly) are an attractive option to define strata of similar income, education, and attitudes. To encourage participation, a cover letter should clearly explain the uses to which the data will be put. Plan for follow-up mailings.


Survey Research Survey Types Type of Characteristics Survey Telephone Random dialing yields very low response and is poorly targeted. Purchased phone lists help reach the target population, though a low response rate still is typical (disconnected phones, caller screening, answering machines, work hours, no-call lists). Other sources of non-response bias include the growing number of non-English speakers and distrust caused by scams and spam's. 2-64


Survey Research Survey Types

2-65

Type of Survey

Characteristics

Interviews

Interviewing is expensive and timeconsuming, yet a trade-off between sample size for high-quality results may still be worth it. Interviews must be carefully handled so interviewers must be well-trained – an added cost. But you can obtain information on complex or sensitive topics (e.g., gender discrimination in companies, birth control practices, diet and exercise habits).


Survey Research Survey Types

2-66

Type of Survey

Characteristics

Web

Web surveys are growing in popularity, but are subject to non-response bias because those who participate may differ from those who feel too busy, don’t own computers or distrust your motives (scams and spam are again to blame). This type of survey works best when targeted to a well-defined interest group on a question of self-interest (e.g., views of CPAs on new proposed accounting rules, frequent flyer views on airline security).


Survey Research Survey Types Type of Characteristics Survey Direct This can be done in a controlled setting (e.g., Observation psychology lab) but requires informed consent, which can change behavior. Unobtrusive observation is possible in some no lab settings (e.g., what percentage of airline passengers carry on more than two bags, what percentage of SUVs carry no passengers, what percentage of drivers wear seat belts). 2-67


Survey Research Survey Guidelines (Table 2.13)

Planning What is the purpose of the survey? Consider staff expertise, needed skills, degree of precision, budget.

2-68

Design

Invest time and money in designing the survey. Use books and references to avoid unnecessary errors.

Quality

Take care in preparing a quality survey so that people will take you seriously.


Survey Research Survey Guidelines Pilot Test Pretest on friends or co-workers to make sure the survey is clear. Buy-in Improve response rates by stating the purpose of the survey, offering a token of appreciation or paving the way with endorsements. Expertise Work with a consultant early on.

2-69


Survey Research Questionnaire Design • Use a lot of white space in layout. • Begin with short, clear instructions. • State the survey purpose. • Assure anonymity. • Instruct on how to submit the completed survey.

2-70


Survey Research Questionnaire Design • Break survey into naturally occurring sections. • Let respondents bypass sections that are not applicable (e.g., “if you answered no to question 7, skip directly to Question 15”). • Pretest and revise as needed. • Keep as short as possible. 2-71


Survey Research Questionnaire Design (Table 2.14)

Type of Question

2-72

Example

Open-ended question

Briefly describe your job goals.

Fill-in-the-blank

How many times did you attend formal religious services during the last year? ________ times

Check boxes

Which of these statistics packages have you ever used? SAS Visual Statistics SPSS MegaStat Systat Minitab


Survey Research Questionnaire Design Type of Question Ranked choices

2-73

Example “Please evaluate your dining experience” Excellent

Good

Fair

Poor

Food

Service

Ambiance

Cleanliness

Overall


Survey Research Questionnaire Design Type of Question

Example

Pictograms

“What do you think of the President’s economic policies?” (circle one)

Likert scale

Statistics is a difficult subject. Neither Strongly Slightly Agree Nor Slightly Strongly Agree Agree Disagree Disagree Disagree

2-74


Survey Research Question Wording • The way a question is asked has a profound influence on the response. For example, 1. Shall state taxes be cut? 2. Shall state taxes be cut, if it means reducing highway maintenance? 3. Shall state taxes be cut, it is means firing teachers and police? 2-75


Survey Research Question Wording • Make sure you have covered all the possibilities. For example, Are you married? Yes No • Overlapping classes or unclear categories are a problem. For example,

2-76

How old is your father? 35 – 45 45 – 55 55 – 65 65 or older


Survey Research Coding and Data Screening

• Responses are usually coded numerically (e.g., 1 = male 2 = female). • Missing values are typically denoted by special characters (e.g., blank, “.” or “*”). • Discard questionnaires that are flawed or missing many responses. • Watch for multiple responses, outrageous or inconsistent replies or range answers. • Follow-up if necessary and always document your data-coding decisions.

2-77


Survey Research Data File Format • Enter data into a spreadsheet or database as a “flat file” (n subjects x m variables matrix).

2-78


Survey Research Advice on Copying Data • Using commas (,), dollar signs ($), or percents (%) as part of the values may result in your data being treated as text values. • A numerical variable may only contain the digits 0-9, a decimal point, and a minus sign. • To avoid round-off errors, format the data column as plain numbers with the desired number of decimal places before you copy the data to a statistical package. 2-79


Applied Statistics in Business and Economics

End of Chapter 2

2-80


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.