Business Analytics Communicating with Numbers 2nd Edition by Sanjiv Jaggia Professor, Alison Kelly P

Page 1


CHAPTER 1 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) The ability to use qualitative reasoning with quantitative tools allows management to make decisions to improve business performance. ⊚ true ⊚ false

2) The use of historical information to predict what could happen in the future describes prescriptive analytics. ⊚ true ⊚ false

3) Data are the compilation of facts, figures, or other contents, both numerical and nonnumerical. ⊚ true ⊚ false

4) Generally speaking, it is not feasible to obtain complete population data due to expense and near impossibility to examine every member of the population. ⊚ true ⊚ false

5) Collecting height data annually on the sample set of participants is an example of time series data. ⊚ true ⊚ false

6)

Structured and unstructured data are only machine generated. ⊚ true ⊚ false

7)

Numerical variables are either discrete or continuous.

Version 1

1


⊚ ⊚

true false

8) When each piece of data in a file is separated by a comma, it is called delimiter and the file is called a comma-spliced file. ⊚ true ⊚ false

9)

When coding in HTML, <table> is a tag used to provide structure for textual data. ⊚ true ⊚ false

10) In XML, tags are not case-sensitive and are interchangeable. For example, <City>and <city>represent the same pieces of information. ⊚ true ⊚ false

11) Sally created a table that summarizes the dollar amount of last year’s sales for each store. This is an example of descriptive analytics. ⊚ true ⊚ false

12) data.

Social media data, such as Twitter, Facebook, and TicTok are examples of structured ⊚ ⊚

13)

true false

The characteristics marital status and income are examples of observations. ⊚ true ⊚ false

Version 1

2


MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 14) Which of the following broad categories is not a type of analytic technique? A) manipulative analytics B) descriptive analytics C) predictive analytics D) prescriptive analytics

15) The people of Appleton, WI represent the __________blank, whereas we analyze the education level of a subset or __________blank to make inferences about the population. A) information; cross-section B) population; information C) items; sample D) population; sample

16) A massive volume of both structured and unstructured data that is extremely difficult to manage, process, and analyze is known by which catch phrase? A) wrangling B) big data C) data mining D) general data

17) The 2019 FIFA Women’s World Cup contained 52 matches in total with 24 teams competing. The use of __________blank data will display team standings during and at the end of the tournament. A) split-section B) organized C) cross-sectional D) numerical

Version 1

3


18) According to a report in US Today, 26% of young people between the ages of 19–28 have at least two tattoos. What does the 26% represent? A) categorical data B) random data C) population D) sample set

19) According to a report in US Today, 38% of young people between the ages of 18–29 have at least one tattoo. What does the 38% represent? A) categorical data B) random data C) population D) sample set

20) According to a report in US Today, 38% of young people between the ages of 19–31 have at least one tattoo. What do the overall observations in the study represent? A) categorical data B) random data C) population D) sample set

21) According to a report in US Today, 38% of young people between the ages of 18–29 have at least one tattoo. What do the overall observations in the study represent? A) categorical data B) random data C) population D) sample set

22) Mary asks her friends on Facebook for recommendations for the best restaurants in Chicago. The results are then placed in a table for review. What does the data represent? Version 1

4


A) time-series data B) cross-sectional data C) numerical data D) quantitative data

23)

In big data, the most important aspect of any analytic initiative is __________blank. A) volume B) veracity C) values D) variety

24)

What term refers to the credibility and quality of data? A) volume B) veracity C) values D) variety

25) When compiling data, it is important to know data comes in all types, forms, and granularity. This is known as A) volume. B) veracity. C) values. D) variety.

26) A New York Times article notes that there are an increasing number of people calling for tech companies to ease their grip on the personal data of consumers. The concern is that a handful of companies holds most of the data. The immense amount of data is also called

Version 1

5


A) volume. B) veracity. C) values. D) variety.

27)

Unstructured data is best defined as A) not conforming to a predefined, row-column format. B) not conforming to a way to analyze data. C) conforming to a predefined, row-column format. D) conforming to a managing velocity.

28) Tobias Smith is working with his company’s data to examine inventory information. His intent is to use the variables to express ratios on inventory turnover. Based on this description, what is the strongest level of measurement being used? A) continuous variable B) interval scale C) categorical D) ratio scale

29)

The time in minutes to travel from city A to city B is what type of variable? A) distraction B) discrete numerical C) categorical D) continuous numerical

30)

The time in hours spent sleeping per day is what kind of variable?

Version 1

6


A) distraction B) discrete numerical C) categorical D) continuous numerical

31) Molly Nelson has been collecting temperatures in degrees Fahrenheit, daily over the past five spring seasons, to determine the optimal point to plant her heirloom tomatoes. Because the difference between each degree is the same, irrelevant of the temperature, this is what type of measurement scale? A) ratio B) nominal C) ordinal D) interval

32) Cassidy is researching the impacts of eating breakfast on college students who have classes prior to 9 am. To do this, she issued a Likert scale questionnaire to the students, with a scale of 1 through 10 to answer a series of 20 questions. What is the type of measurement scale? A) ratio B) nominal C) ordinal D) interval

33) A large retailer is asking each customer at checkout for their zip code. If the zip code is the only recorded variable, what would the summarized results field headers be in tabular format? A) zip code B) customer number, zip code, count C) zip code, count D) count

Version 1

7


34) A large retailer is asking each customer at checkout for their zip code. If the zip code is the only recorded variable, what is the type of measurement scale? A) ratio B) nominal C) ordinal D) interval

35)

Which one is a drawback of interval-scaled data? A) The zero point is arbitrarily chosen. B) The degree of measurement is not a whole number. C) The scale is categorized and qualitative. D) The scale in nominal and zero point is meaningful.

36)

Of the following numerical variables, which is continuous? A) number of newborn babies B) number of goals scored C) cars sold by a car dealer D) height

37)

Of the following numerical variables, which is continuous? A) number of goals scored B) number of stocks C) number of children D) weight

38)

Which one of the following variables is numerical?

Version 1

8


A) city B) height C) hair color D) religion

39)

Which one of the following variables is numerical? A) city B) population C) state D) color

40) An instructor hands out course evaluations where students have a rank of 0 to 5. What is the best way for the data to be measured? A) filtered B) ordinal C) nominal D) numerical

41) During the winter, the ice festival committee measures the depth of the ice during the month of February. What is the type of measurement scale? A) ratio B) interval C) nominal D) numerical

42) At the local animal shelter, each animal is marked if they are a boy or a girl. What is this type of measurement scale?

Version 1

9


A) filtered B) ordinal C) nominal D) numerical

43)

The following table is an example of what type of format? Inv_Nbr 12232 13425 19932

Inv_Name Filter, Air Gasket Battery

Inv_Cost $9.65 $4.32 $32.00

A) hypertext B) extensible C) delimited D) fixed-width

44) What is the following file format called? Inv_Nbr, Inv_Name, Inv_Cost 876521,battery,45.00 A) delimited format B) extensible markup C) fixed-width format D) hypertext format

45) What type of markup language is this? <Data> <Inventory> Inv_Nbr>102304</Inv_Nbr> Inv_Name>Filter</Inv_Name> </Inventory> </Data>

Version 1

10


A) HTML B) XML C) DFF D) JSON

46) The following is an example of which markup language? { “Inventory”: [ { “Inv_nbr”: “0199284”, “Inv_Name”: “Filter” } ] } A) HTML B) XML C) DFF D) JSON

47) The following is an example of which markup language? <table> <tr> <th> Inventory_Nbr </th> <th> Inventory Description </th> </tr> </table> A) HTML B) XML C) DFF D) JSON

48)

Which is not a viable markup language?

Version 1

11


A) HTML B) XML C) DFF D) JSON

49)

Which of the following is not related to data privacy? A) Data collection B) Data ethics C) Data usage D) Data transmission

50) __________blank is a set of data that are organized and processed in a meaningful and purposeful way. A) Data B) Knowledge C) Statistics D) Information

51)

Which of the following is an example of a fixed-width format?

A) Cust_Nbr 1232 1325 1972

Cust_Name Mike Barnes Lakshmi Singh Seo-Jun Hak

Cust_Bal $1,059.65 $2,914.32 $932.00

B)Cust_Nbr,Cust_Name,Cust_Bal 1232,Mike Barnes,$1,059.65 1325,Lakshmi Singh,$2914.32 1972,Seo-Jun Hak,$932.00

Version 1

12


C)<table> <tr> <th> Cust_Nbr </th> <th> Cust_Name</th> <th> Cust_Bal</th> </tr> </table> D){ “Customer”: [ { “Cust_nbr”: “1325”, “Cust_Name”: “Lakshmi Singh” } ] } 52)

Which of the following is an example of a JSON markup language?

A) Cust_Nbr 1232 1325 1972

Cust_Name Mike Barnes Lakshmi Singh Seo-Jun Hak

Cust_Bal $1,059.65 $2,914.32 $932.00

B)Cust_Nbr,Cust_Name,Cust_Bal 1232,Mike Barnes,$1,059.65 1325,Lakshmi Singh,$2914.32 1972,Seo-Jun Hak,$932.00 C)<table> <tr> <th> Cust_Nbr </th> <th> Cust_Name</th> <th> Cust_Bal</th> </tr> </table>

Version 1

13


D){ “Customer”: [ { “Cust_Nbr”: “1325”, “Cust_Name”: “Lakshmi Singh”, “Cust_Bal”: “$2,914.32” } ] }

Version 1

14


Answer Key Test name: Chap 01_2e_Jaggia 1) TRUE Business analytics allows for the combination of qualitative reasoning with quantitative tools to identify key business problems and translate them into improved business processes. 2) FALSE What could happen in the future describes predictive analytics, whereas “what should we do” describes prescriptive analytics. 3) TRUE The term data is defined as a compilation of facts, figures, or other contents, both numerical and non-numerical. 4) TRUE Obtaining population data is expensive and it is generally impossible to examine every member of the population. 5) TRUE Time series data are collected over several time periods on a certain group of people. 6) FALSE Structured and unstructured data can be both human-generated and machine-generated. 7) TRUE Numerical variables assume meaningful numerical values and can be categorized as either discrete or continuous, whereas categorical variables assume names or labels. 8) FALSE The file is called a comma-separated value (CSV) file or commadelimited file. Version 1

15


9) TRUE Tags are an effective and efficient way of providing structure to code identifying the beginning and completion of items such as tables and paragraphs. 10) FALSE XML is case-sensitive and would view a deviation as two separate data points. 11) TRUE Descriptive analytics refers to gathering, organizing, tabulating, and visualizing data that summarizes “what has happened?” This includes summarizing financial statistics such as the total amount of last year’s sales at each store. 12) FALSE Structured data have a pre-defined, row-column format. Social media data, while it has some defined structure, does not have the pre-defined, row-column format and therefore are considered examples of unstructured data. 13) FALSE The characteristics marital status and income are examples of variables because a person’s marital status and income vary from person to person. Observations (records) are data we collect about variables. 14) A Descriptive, predictive, and prescriptive are the three broad categories of analytic technique. 15) D -Population is the observation or interest group and the sample is what we use to make inferences about the population. 16) B Big data is a catch phrase used for both data types that are difficult to manage, process, and analyze. Version 1

16


17) C Cross-sectional data is the best at displaying team standings containing many subjects at the same point in time. 18) D 26% represents the sample set, whereas the total of 19–28 year-olds represents the population. 19) D 38% represents the sample set, whereas the total of 18–29 year-olds represents the population. 20) C 38% represents the sample set, whereas the total of 19–31 year-olds represents the overall observations for the study known as the population. 21) C 38% represents the sample set, whereas the total of 18–29 year-olds represents the overall observations for the study known as the population. 22) B Cross-sectional represents the data collected by recording the characteristics, in this case restaurants, in a city location. 23) C Values derived from big data are the most important aspects of any analytic initiative. 24) B Veracity is the credibility and quality of data. 25) D Variety is knowing the data comes in various formats, types, and forms. 26) A Volume reflects the immense amount of data from a single (or multiple) source(s). Version 1

17


27) A Unstructured does not conform to a predefined, row-column format. 28) D The ratio scale is the strongest level of measurement used to determine ratios in a live changing company environment. 29) D There are only 2 types of numerical variables, discrete or continuous. Continuous numerical represents the uncountable variables between intervals, and is the best case for this example. 30) D There are only 2 types of numerical variables, discrete or continuous. Continuous numerical represents the uncountable variables between intervals, and is the best case for this example. 31) D When the increments are of the same spacing and have no true zero point, then it is an interval measure. 32) C Ordinal scales are the ranked order of the answers on the scale, thus a Likert scale is a form of ordinal measurement. 33) C You would have the zip code variable and a count to capture the number of records in each zip code. 34) B You are grouping the zip code data and counting, thus, this is nominal in measurement. Zip codes are values that differ by name or label and allow us to categorize or group the data. 35) A The interval-scaled data drawback is that the zero point is arbitrarily chosen and does not reflect a complete absence of what is being measured. Version 1

18


36) D Continuous variables are characterized by uncountable values within an interval. Thus, height offers the possibility of half and partial numbers between whole numbers. 37) D Continuous variables are characterized by uncountable values within an interval. Thus, weight offers the possibility of half and partial numbers between whole numbers. 38) B Numerical variables are numbers. The other variable options are all categorical in nature. 39) B Numerical variables are numbers. The other variable options are all categorical in nature. 40) B Ordinal scales allow for a stronger type of measurement where items can be both categorized and ranked. 41) A Ratio scales are a way to categorize and rank data and find meaningful differences in observations. However, ratio scales have a true zero point, whereas zero in interval scales is arbitrarily chosen. For example, an ice depth of zero feet or zero inches implies the absence of ice. 42) C Nominal scales are a way to sort or group categories. 43) D All columns are aligned and have fixed widths: 5, 10, 5. 44) A Delimited format, also called comma-delimited file format, is where data is partitioned by commas. 45) B Version 1

19


The format of XML is that each line in the file contains a pair of userdefined markup tags of the file, in this case Inventory. 46) D JavaScript Notation (JSON) format contains a clear path of identifying attributes in the file, is faster to code, supports a wide range of data types, and is common for open source programming. 47) A HTML can be quickly distinguished by the tags used in the code, which conform to standards created by organizations such as World Wide Web Consortium (W3C). 48) C DFF or Data File Format is a format, not a markup language. 49) B Data privacy also referred to as information privacy, is a branch of data security related to the proper collection, usage, and transmission of data. Data ethics is a branch of ethics that studies and evaluates moral problems related to data such as whether data are being used to do the right thing for people and society. 50) D Information is a set of data that are organized and processed in a meaningful and purposeful way, whereas data are compilations of facts and figures; and knowledge is a combination of data, information, experience, and intuition. 51) A Data files in fixed-width format (or fixed-length format) have columns that start and end at the same place in every row. 52) D JSON is a markup language that is commonly used due to its less wordy and smaller file size and supports a wide range of data types.

Version 1

20


CHAPTER 2 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) The process of retrieving, cleaning, integrating, transforming, and enriching data to support analysis is called data wrangling. ⊚ true ⊚ false

2)

A foreign key (FK) is the only unique identifier in a table structure. ⊚ true ⊚ false

3) In R, the following represents how to receive results from column 3, row 2 > myData[3,2]. ⊚ true ⊚ false

4)

In R, to sort data in descending order, we use a negative parameter in the order function. ⊚ true ⊚ false

5) Simple mean imputation is the best route for replacing large quantities of missing variables within a data set without distorting the relationship among variables. ⊚ true ⊚ false

6)

To view only a portion of the data that is of interest, subsetting is used. ⊚ true ⊚ false

7)

Converting data from one structure to another is called data transformation. ⊚ true ⊚ false

Version 1

1


8)

Subsetting is a technique used to convert numerical values into categorical variables. ⊚ true ⊚ false

9)

A dummy variable takes on a value of 1 or 0 to describe two categories of a variable. ⊚ true ⊚ false

10) Megan took a phone survey where each question posed had an answer range of unsatisfied to completely satisfied describing her purchase experience. Because the categories are in equal increments, the category can be recoded into a number transforming the category into what is called a category score. ⊚ true ⊚ false

11)

The strategy of removing observations with missing data is called omission. ⊚ true ⊚ false

12) Changing an individual’s date of birth to age, combining height and weight to create body mass index, calculating percentages, or converting values to natural logarithms are examples of transforming categorical data. ⊚ true ⊚ false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 13) Which of the following is NOT a process of the data management system?

Version 1

2


A) acquire B) distribute C) store D) summarize

14)

Which term represents data items, events, or things stored in a database file? A) instance B) entity C) settings D) quantitative

15) Mary in the accounting department has been assigned a specific vehicle as her company car to perform audits. This represents which type of relationship? A) 1 : 1 B) 1 : M C) M : N D) M : M

16)

Select, From, and Where keywords are statements used in _____________blank. A) DBMS B) XML C) SQL D) JAVA

17) The primary purpose of a(n) _____________blank is to support decision-making and provide a composite view of the organization.

Version 1

3


A) data warehouse B) data mart C) entity D) attribute

18) A non-relational database structure that can support the storage of a wide range of data, including structured, semi-structured, and unstructured is called _____________blank. A) SQL B) Free Range C) NoSQL D) Recreational

19) Mary has been tasked with reviewing a large data file. She wants to begin by first inspecting the number of values in each cell, both numeric and non-numeric, for any blank entries. The plan is to first find the blank or missing values for first review. Using Excel, what function(s) should she use to complete this task? A) COUNT B) COUNTA C) COUNTIF D) Both COUNT and COUNTA

20) Molly wants to view observations with missing values in Inventory. However, her data set is quite large. What functions should she use to complete her task in R? A) > is.na (myData.Inventory) B) > is.na (myData$Inventory) C) > which (is.na(myData$Inventory)) D) > which (is.na(myData.Inventory))

21) In the presence of outliers in a data set, extremely small or large values, it is preferred to use the _____________blank instead of the _____________blank to impute missing variables. Version 1

4


A) median; mean B) mean; median C) subset; total D) average; range

22) In a data set with 22 variables, if 13% of the values, randomly spread across observations, are missing (blank), what is the probable percent of complete and usable observations? A) 87% B) 13% C) 4.67% D) 4.16%

23) In a data set with 20 variables, if 8% of the values, randomly spread across observations, are missing (blank), what is the probable percent of complete and usable observations? A) 92% B) 8% C) 18.87% D) 15.29%

24) Using the simple mean imputation strategy, what value would be placed in the missing observation in x1? x1 76 82

x2 22

87

32 41

84

28

Version 1

5


A) No value because excluded B) 82 C) 80 D) 66

25) Using the simple mean imputation strategy, what value would be placed in the missing observation in x1? x1 76 82

x2 22

91

32 41

88

28

A) No value because excluded B) 84 C) 83 D) 90

26) x1?

Using the omission strategy, what value would be placed in the missing observation in x1 76 70

x2 22

95

32 41

92

28

A) No value because excluded B) 83 C) 81 D) 67

Version 1

6


27) x1?

Using the omission strategy, what value would be placed in the missing observation in x1 76 82

x2 22

91

32 41

88

28

A) No value because excluded B) 84 C) 83 D) 90

28) When performing an analysis, one technique is called RFM. Which of the following is not reflective of RFM? A) recency B) frequency C) monetary D) relevancy

29) Mark wants to have a better understanding of his client base at the credit union. To do so, he is running a report to show loan amount approval with corresponding credit scores. He realized the data set is quite large and wants to create categories by grouping. To do this, he needs to do all the following except A) identify the value he wants to transform into smaller groups or bins. B) remove 20% of the data to create a training set. C) ensure the data sets are not overlapping. D) identify how he wants the observations to be labeled in the bin.

Version 1

7


30) In Analytic Solver, Aimee is trying to create a new column called RFM. This column is merging multiple values into one cell. The function to accomplish this is called? A) TRANSFORM B) CONCATENATE C) VARIABLE D) VLOOKUP

31)

The function that provides a natural logarithm in Excel is? A) INT function B) LN function C) YEARFRAC function D) VLOOKUP function

32) In R, Mary wants to understand the number of days between rain events in Chicago, IL. What function is used to find the number of rain events between today and January 1, 2026? A) difftime B) as.numeric C) diffdate D) floor

33) Using R, what is the formula that will allow for the weekday function to display the day of the week for November 15, 2020? A) >weekdays< (as.Date(“2020-11-15”) B) > format(as.Date(“2020-11-15”), “%d”) C) > weekdays(as.Date(“2020-11-15”)) D) > Sys.Date(“2020-11-15”)

34) Four observations were binned into one group. In this group, the values are: 40, 45, 66, and 33. What is the average of the group?

Version 1

8


A) 48 B) 47 C) 45 D) 46

35) Four observations were binned into one group. In this group, the values are: 40, 45, 38, and 33. What is the average of the group? A) 41 B) 40 C) 38 D) 39

36) The following table contains 2 variables with 2 observations. A new variable was created named Sum. This is the sum of the values x1 and x2 for each observation. What is the average value of Sum if the chart is completed? x1 80

x2 40

82

32

Sum

A) 117 B) 64 C) 120 D) 114

37) The following table contains 2 variables with 2 observations. A new variable was created named Sum. This is the sum of the values x1 and x2 for each observation. What is the average value of Sum if the chart is completed? x1 76

x2 22

82

32

Version 1

Sum

9


A) 106 B) 53 C) 98 D) 114

38) When too many variables are categorized in an analysis, several potential issues may occur. Which of the following is not one of the issues that may occur? A) model performance suffers B) rarely occurring categories may not be captured accurately C) difficulty in differentiating among observations D) an increase in the number of categories as the data set becomes larger

39) Henry wants to analyze income, but the sheer number of categories in the data’s current form will make a clear analysis less meaningful. In Excel with Analytic Solver, how will Henry determine the frequency of each category to transform his data? A) Income variable is selected and Analytic Solver produces frequency levels for each Income category from most to least frequent. B) Inspect the frequency of Income category: >table(myData$Income). C) Income variable is selected and Analytic Solver produces a new category for non-use variables. D) Apply a limit to the number of categories from the drop-down to a reasonable number.

40) Using R, what function is used to evaluate the categories in the variable to identify the dummy variables? A) referral B) if C) ifelse D) view

Version 1

10


41) In the following table, there are four observations with three variables. Which category is the best fit to be transferred into dummy variables? Marital Status Single Married Single Married

Age 24 26 33 28

Income $45,000 $33,000 $53,000 $59,000

A) age B) marital status C) income D) none are a good fit for a dummy variable

42) Ann is analyzing a data set that contains two variables, Job Title and 401K. 401K contains the name of the three companies that carry the retirement accounts. It is mandatory to have an account, thus no observation is blank. If 401K was transformed to dummy variables, how many should be created? A) 2 B) 3 C) 4 D) 1

43) Transform the marital status into dummy variables where Single = 1 and Married = 0. How many would have the category score of 0? Marital Status Single Married Single Single Single Married

Version 1

Age 24 26 33 28 36 29

Income $45,000 $33,000 $53,000 $59,000 $62,000 $48,000

11


A) 4 B) 6 C) 2 D) 0

44) Transform the marital status into dummy variables where Single = 1 and Married = 0. How many would have the category score of 0? Marital Status Single Married Single Married Married Married

Age 24 26 33 28 36 29

Income $45,000 $33,000 $53,000 $59,000 $62,000 $48,000

A) 2 B) 6 C) 4 D) 0

45) Michael is examining a data set and trying to determine which category he can transform into a dummy variable. Of the four variables, Employee Number, Pay Rate, Hire Date, and Sex, which is the best fit to use a dummy variable? A) employee number B) pay rate C) hire date D) sex

46) Marcus wants to include the month of the year in the analysis as categories. How many dummy variables will be needed?

Version 1

12


A) 12 B) 11 C) 6 D) 1

47) Kara is reviewing categories where a series of numbers represent the type of loan. She would prefer the actual name of the loan be retained when running her analysis. Using Analytic Solver, what function will allow Kara to retain the category name instead of recording them in numbers? A) log function B) view function C) IF function D) head function

48) Using the following table view, Mark wants to create a relationship between the two tables. What will he need to add to establish a relationship?

A) primary key B) foreign key C) instances D) entities

49) Which of the following Excel functions will Ibrahim use to determine how many employees make more then $20 per hour?

Version 1

13


A) COUNT B) COUNTA C) COUNTIF D) COUNTIFS

50) Which of the following Excel equations will identify the number of married individuals under the age of 30? Marital Status Single Married Single Married Married Married

Age 24 26 33 28 36 29

Income $45,000 $33,000 $53,000 $59,000 $62,000 $48,000

A) =COUNT(A2:A7, “=Married”, B2:B7,“<30”) B) =COUNTA(A2:A7, “=Married”, B2:B7,“<30”) C) =COUNTIF(A2:A7, “=Married”, B2:B7,“<30”) D) =COUNTIFS(A2:A7, “=Married”, B2:B7,“<30”)

51) Using the imputation strategy for categorical values, what value would be placed in the missing observation in Fav_Color? Age 22

Fav_Color Purple Blue

32 41 28

Purple Red

A) No value because excluded B) Blue C) Purple D) Red

Version 1

14


52) What data preparation technique is Maeve using when she extracts a payroll data set into two separate files, one for hourly employees and one for salary employees? A) Separating B) Subsetting C) Typesetting D) Wrangling

53)

Which of the following is NOT an example of categorical data transformation? A) Category binning B) Category reduction C) Category scores D) Dummy variables

54) The variable x1 contains three categories ranging from “Poor” to “Good.” Convert the category names into category scores into x2 (i.e., 1 = “Poor”, 2 = “OK”, and 3 = “Good”). How many observations have a category score of 1? x1 OK

x2

Good Good Poor OK Good

A) 0 B) 1 C) 2 D) 3

Version 1

15


Version 1

16


Answer Key Test name: Chap 02_2e_Jaggia 1) TRUE The definition of data wrangling is the process of retrieving, cleansing, integrating, transforming, and enriching data to support subsequent data analysis. 2) FALSE A primary key (PK) is the main unique identifier in a table structure, whereas a foreign key is a primary key from a related entity. 3) FALSE > myData[3,2] represents how to view the data in row 3, column 2. 4) FALSE To place in descending order, we use the decreasing parameter in the order function. 5) FALSE If the number of missing variables is relatively small, then the simple mean process fills in the observations without biasing the results. However, in large quantities, simple mean will distort the data leading to biased results. 6) TRUE If only a portion of the data is of interest, such as sales by region, subsetting is the method to use to view only the needed information. 7) TRUE Data transformation is used to convert data from one format or structure to another. 8) FALSE 9) TRUE

Version 1

17


Dummy variables are used often to describe categories. For example, 1 = one result, such as male, and 0 equals female. 10) TRUE Because the survey response is one of four options, then the categorical variable can be transformed into a category score for running an analysis. 11) TRUE The omission strategy, also called complete-case analysis, recommends that observations with missing values be excluded from the analysis and is appropriate when only a small number of values are missing from the data. 12) FALSE Each of these examples converts data into numerical data values as opposed to transforming them into categories. 13) D A data management process is to acquire, organize, store, manipulate, and distribute data. However, summarize is not an option in data management. 14) B An entity is a generalized category, representing people, places, or things to be stored in a database file. 15) A In this situation, it is one person, Mary, assigned to a specific vehicle. So 1 : 1 relationship is the best fit for the scenario. 16) C Structured Query Language (SQL) is driven by the statements Select, From, and Where, to specify tables, attributes, and criteria to retrieve. 17) A

Version 1

18


A data warehouse or enterprise data warehouse is the central repository for data in an organization. The historical and comprehensive view allows management to make strategic decisions for the business. 18) C A NoSQL offers the flexibility, performance, and scalability to handle high volumes of data. This next phase database will become common to handle the new world of growing data analysis. 19) D Because the data is both numerical and non-numerical, then both the COUNT and the COUNTA function need to be used to find the blank or missing values quickly. 20) C When a small data set is in use, then is.na function is fine, but when the set is larger, then which and is.na is used to quickly identify the missing values. 21) A With outliers, the preference is median over mean for the missing values. The reasoning is the swing could impact the variable amounts. Thus, both Analytic Solver and R both have easy imputations to compute this function. 22) C (1 − 0.13)22 = 0.0467 or 4.67%. 23) C (1 − 0.08)20 = 0.1887 or 18.87%. 24) B A simple mean is replacing the blank or missing variable with the mean or average of the present variables = (76 + 82 + 87 + 84) ÷ 4 = 82.25 or 82. 25) B

Version 1

19


A simple mean is replacing the blank or missing variable with the mean or average of the present variables = (76 + 82 + 91 + 88) ÷ 4 = 84.25 or 84. 26) A In the omission strategy, the missing values are excluded from the observation. 27) A In the omission strategy, the missing values are excluded from the observation. 28) D RFM is the acronym for recency, frequency, and monetary. 29) B Binning is taking the entire data set, identifying the value to be binned into smaller groups, ensuring no data overlapping, and labeling the bin accordingly. 30) B The CONCATENATE function allows for multiple cells to be merged into one cell. 31) B In Excel, LN function provides a natural logarithm transformation. 32) A The difftime function is used to determine the number of days between dates. 33) C The weekdays is the function used to present the result of the day of the week. This example > weekdays(as.Date(“2020-11-15”)) comes back with the result of “Sunday”. 34) D 40 + 45 + 66 + 33 = 184 ÷ 4 = 46. 35) D Version 1

20


40 + 45 + 38 + 33 = 156 ÷ 4 = 39. 36) A First you need to sum the variables x1 and x2 for each row (120 and 114, respectively). (120 + 114) ÷ 2 = 117 is the average of Sum. 37) A First you need to sum the variables x1 and x2 for each row (98 and 114, respectively). (98 + 114) ÷ 2 = 106 is the average of Sum. 38) D If the results of a smaller set are applied to a larger data set, then errors may be created. The categories will not increase in numbers as the set becomes larger, just more data will reside under the same amount of categories. 39) A Analytic Solver will produce results indicating the frequency levels from most to least frequent category for income. 40) C The ifelse function evaluates the category and determines the assignment of the 1 or 0. For example, if the category is sex, then 1 for male, 0 for female. 41) B Marital status can be transformed into 1 = married and 0 = single. 42) A The dummy variables would cover the three possible options for the company being used for the 401K funds. Given k categories of a variable, the general rule is to create k − 1 dummy variables, using the last category as reference. For 401k we only need to define two dummy variables ( k − 1 = 3 − 1 = 2). Creating a third dummy would create data redundancy. 43) C The category score for Marital is 0, thus there are 2 with that status. Version 1

21


44) C The category score for Marital is 0, thus there are 4 with that status. 45) D Sex would be the best solution because the options are minimal. Example: 1 = Female, 0 = Male 46) B If a given k categories = 12, then k − 1, or 12 − 1 = 11 dummy variables. 47) C An IF function allows for statements to be crafted to transform numbers into category names. 48) B A foreign key is the primary key from another entity used to create a relationship between the tables.

49) C COUNTIF is used to count the number of cells in a row or column that meet certain criteria such as “more than $20 per hour”. COUNTIFS is used when looking to count the number of cells that meet more than one criteria. COUNT and COUNTA are used to determine the number of cells in a row or column that contain numerical and non-numerical values, respectively. 50) D COUNTIFS is used when looking to count the number of cells that meet more than one criteria whereas COUNTIF is used to count the number of cells in a row or column that meet just one criterion such as “=married”. COUNT and COUNTA are used to determine the number of cells in a row or column that contain any values, numerical and nonnumerical, respectively. 51) C In the case of categorical variables, the most frequent category is often used as the imputed value. 52) B Version 1

22


The process of extracting portions of a data set that are relevant to the analysis is called subsetting. 53) A Three common approaches for transforming categorical data are category reduction, dummy variables, and category scores. Binning is the process of converting numerical values into categorical variables by grouping into bins, whereas dummy variables and category scores convert categorical data into numerical values and category reduction simplifies the number of categories. 54) B When converted, x2 should be 2, 3, 3, 1, 2, 3, respectively. Since “Poor” only occurs once, the category score 1 appears in only one observation.

Version 1

23


CHAPTER 3 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) The number of defects in a production process, the salary of a business graduate, the rental price in a neighborhood, and the number of orders for a subscription-based service are examples of dispersion measures. ⊚ true ⊚ false

2) If approximately p percent of the observations have values less than the pth percentile, then approximately (100 − p) percent of the observations have values greater than the pth percentile. ⊚ true ⊚ false

3)

A percentile is calculated using the QUANTILE.INC function in Excel. ⊚ true ⊚ false

4) The variance and the standard deviation are the two most widely used measures of association. ⊚ true ⊚ false

5) A symmetric distribution is one that is a mirror image of itself on both sides of its center, whereas the skewness coefficient measures the degree to which a distribution is not symmetric about its mean. ⊚ true ⊚ false

6) The covariance between two variables x and y indicates both the direction and the strength of the linear relationship. ⊚ true ⊚ false

Version 1

1


7) In a boxplot, if the median is in the center of the box and the left and right whiskers are equidistant from their respective quartiles, a symmetrical shape is implied. ⊚ true ⊚ false

8) Central location is defined as the way numerical data tends to cluster around a middle or central value. ⊚ true ⊚ false

9) In data sets that contain outliers, the arithmetic mean is used as the measure of the central location. ⊚ true ⊚ false

10)

Converting observations into z-scores is also called doubling the observation. ⊚ true ⊚ false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 11) Which of the following functions do you use to calculate the pth percentile in Excel? A) PERCENTILE.INC B) QUANTILE.INC C) quantile D) summary

12)

Ashley has a data set of sales transactions in the following format: Cust 1 2

Gender Female Male

Clothing 246 171

Tech 64 345

Version 1

2


130

Male

52

58

Which of the following functions will return the average sales for the technology department for females in R? A) =AVERAGEIF(B2:B131, “Female”, D2:D131) B) =AVERAGEIF(D2:D131, “Female”, B2:D131) C) tapply(myData$Gender, myData$Tech, mean) D) tapply(myData$Tech, myData$Gender, mean)

13)

Find the mode of the values: 1, 4, 5, 3, 4, 6, 1, 2, 4, 3, 2 A) 1 B) 2 C) 3 D) 4

14)

Find the median of the values: 1, 4, 5, 3, 4, 6, 1, 2, 4, 3, 2 A) 1 B) 2 C) 3 D) 4

15) Survey results provided the skewness coefficient is 0.21672 and the (excess) kurtosis coefficient is −1.15926. These values imply that the return value for the survey is __________blank skewed, and the distribution has a __________blank tail than a normal distribution. A) negatively; longer B) negatively; shorter C) positively; longer D) positively; shorter

Version 1

3


16) Four investment return distributions are the same, except for skewness (e.g., same mean and standard deviation). Which investment would you chose to increase the likelihood of positive returns? A) −0.25 B) 0.05 C) 0.55 D) 0.75

17) The coefficient correlation for rent and square footage is computed to be 0.84, this means the relationship between the two variables are __________blank. A) strong and positive B) weak and negative C) weak but positive D) strong but negative

18) In a boxplot, the dashed vertical line in the middle of the box represents which of the following measures of location? A) mean B) median C) mode D) percentile

19)

What is the median in the following table on the variable score?

Student_ID R304110 R304003 R102234 R209939

Version 1

Score 0.98 0.88 0.65 0.92

4


A) 0.86 B) 0.88 C) 0.90 D) 0.92

20)

What is the only meaningful measure of the central location for a categorical variable? A) mean B) mode C) median D) model

21)

What is the arithmetic mean in the following table on the variable score? Student_ID R304110 R304003 R102234 R209939

Score 0.95 0.78 0.65 0.92

A) 0.92 B) 0.78 C) 0.715 D) 0.8250

22)

What is the arithmetic mean in the following table on the variable score? Student_ID R304110 R304003 R102234 R209939

Version 1

Score 0.98 0.88 0.65 0.92

5


A) 0.92 B) 0.88 C) 0.765 D) 0.8575

23) Carmen is a professor at a local university. After collecting data on her Introduction to Business course for a year, she wants to calculate the z-score for a student who scores 92 on the final exam. The mean and the standard deviation scores on the exam are 76 and 6, respectively. Calculate the z-score. A) 1.67 B) 2.67 C) 0.67 D) 2.33

24) Carmen is a professor at a local university. After collecting data on her Introduction to Business course for a year, she wants to calculate the z-score for a student who scores 90 on the final exam. The mean and the standard deviation scores on the exam are 76 and 6, respectively. Calculate the z-score. A) 1.33 B) 2.33 C) 0.33 D) 2.67

25)

The empirical rule states all the following except A) <p>almost all observations fall in the interval . B) <p>approximately 95% of all observations fall in the interval <i> C) <p>approximately 65% of all observations fall in the interval <i> D) <p>approximately 68% of all observations fall in the interval .

Version 1

. .</i></p>

6


26) The mean credit score is 650 out of 345 used car loan applicants with a standard deviation of 14. Assuming a bell-shaped curve, what is the number of loan applicants that fall within a score of 622 and 678? A) 56 B) 235 C) 328 D) 345

27) The mean credit score is 640 out of 300 used car loan applicants with a standard deviation of 16. Assuming a bell-shaped curve, what is the number of loan applicants that fall within a score of 608 and 672? A) 96 B) 204 C) 285 D) 300

28) Candice is preparing for her final exam in Statistics. She knows she needs 62 out of 100 to earn an A overall in the course. Her instructor provided the following information to the students. On the final, 200 students have taken it with a mean score of 54 and a standard deviation of 6. Assume the distribution of scores is bell-shaped. Calculate to see if a score of 62 is within one standard deviation of the mean. A) Yes, 62 is the upper limit of one standard deviation from the mean. B) Yes, the upper level of one standard deviation is 60. C) Yes, 62 is greater than the 48, one standard deviation below the mean. D) No, 62 is greater than one standard deviation above the mean, 60.

29) Candice is preparing for her final exam in Statistics. She knows she needs 80 out of 100 to earn an A overall in the course. Her instructor provided the following information to the students. On the final, 200 students have taken it with a mean score of 72 and a standard deviation of 6. Assume the distribution of scores is bell-shaped. Calculate to see if a score of 80 is within one standard deviation of the mean.

Version 1

7


A) Yes, 80 is the upper number of one standard deviation from the mean. B) No, the upper level of one standard deviation is 78. C) Yes, 80 is greater than the 66, one standard deviation below the mean. D) No, 80 is greater than the mean of 72.

30)

Using the following Boxplot, identify the median score on the test.

A) 28 B) 68 C) 60 D) 90

31)

Using the following Boxplot, what is the star to the far right considered?

A) upper quartile mark B) an outlier C) a whisker D) a deviation

32) In the following Boxplot, the left whisker is longer than the right whisker. This indicates that the underlining distribution is __________blank.

Version 1

8


A) negatively skewed B) median skewed C) positively skewed D) no indicated skew

33)

The degree of strength of the linear relationship between x and y is called? A) correlation determination B) index C) standard deviation D) correlation coefficient

34) As observations become more dispersed, the difference between the minimum and maximum observation of a variable is called __________blank. A) the range B) the variable C) the interquartile D) the mean absolute deviation

35)

The interquartile range is IQR = Q3 − Q1. Thus, it can be thought of as A) the 75% interquartile range. B) the quartile or 25% of the variable. C) the middle 50% of the variable. D) the incorporation of all observations.

36)

Find the Mean Absolute Deviation (MAD) of 13, 9, 9, 11, 13.

Version 1

9


A) 4.0 B) 1.60 C) 11.0 D) 3.0

37)

Find the Mean Absolute Deviation (MAD) of 10, 9, 3, 8, 10. A) 5 B) 2 C) 8 D) 4

38) The following table is the summary statistics for Scores. Calculate the Sharpe ratio of growth. Assume Rf = 4. Type

Minimum

Maximum

Mean

Scores

8

61

27.8182

Standard Deviation 4.63193

A) 3.223 B) 1.158 C) 6.006 D) 5.142

39) The following table is the summary statistics for Scores. Calculate the Sharpe ratio of growth. Assume Rf = 4. Type

Minimum

Maximum

Mean

Scores

8

61

30.8182

Standard Deviation 4.93193

A) 3.699 B) 1.232 C) 6.472 D) 5.438

Version 1

10


40) Alex is working on an investment portfolio reviewing Gas and Diesel summary statistics. Which investment would be the better risk per unit, assuming Rf = 2? Type

Minimum

Maximum

Mean

Gas Diesel

113.9 118.3

125.9 123.4

129.9 128.5

Standard Deviation 3.620927 2.783401

A) Gas provides a higher Sharpe ratio with 35.3224, thus more reward per unit of risk. B) Gas provides a higher Sharpe ratio with 63.1395, thus more reward per unit of risk. C) Diesel provides a higher Sharpe ratio with 62.8583, thus more reward per unit of risk. D) Diesel provides a higher Sharpe ratio with 45.448, thus more reward per unit of risk.

41) Alex is working on an investment portfolio reviewing Gas and Diesel summary statistics. Which investment would be the better risk per unit, assuming Rf = 2? Type Gas Diesel

Minimum

Maximum

Mean

113.9 118.32

125.9 123.4

119.9 120.5

Standard Deviation 3.620927 2.783401

A) Gas provides a higher Sharpe ratio with 32.5607, thus more reward per unit of risk. B) Gas provides a higher Sharpe ratio with 73.9700, thus more reward per unit of risk. C) Diesel provides a higher Sharpe ratio with 153.8165, thus more reward per unit of risk. D) Diesel provides a higher Sharpe ratio with 42.5738, thus more reward per unit of risk.

42) The standard deviation of midterm scores and the final exam are 13.5 and 10.5, respectively. Which of the two exams is riskier and why? A) Both the midterm and the final share the same amount of risk. B) The midterm exam is riskier because the standard deviation is higher. C) The midterm exam is riskier because the standard deviation is lower. D) There is not enough information to determine which is the riskier of the two.

43) The standard deviation of midterm scores and the final exam are 8 and 6, respectively. Which of the two exams is riskier and why? Version 1

11


A) Both the midterm and the final share the same amount of risk. B) The midterm exam is riskier because the standard deviation is higher. C) The final exam is riskier because the standard deviation is lower. D) There is not enough information to determine which is the riskier of the two.

44) Survey results provided the skewness coefficient is −0.141974 and the (excess) kurtosis coefficient is 1.15926. These values imply that the return value for the survey is __________blank skewed, and the distribution has a __________blank tail than a normal distribution. A) positively; longer B) positively; shorter C) negatively; longer D) negatively; shorter

45) In analyzing the S&P 500 and XYZ Incorporated in a five-year study, the covariance (S&P 500, XYZ Incorporated) is 4,670.60. What kind of linear relationship does the S&P 500 and XYZ Incorporated have? A) a negative linear relationship B) a positive linear relationship C) no linear relationship D) a neutral linear relationship

46) In analyzing the S&P 500 and the XYZ Incorporated in a five-year study, the covariance (S&P 500, XYZ Incorporated) is 9,107.30. What kind of linear relationship does the S&P 500 and the XYZ Incorporated have? A) a negative linear relationship B) a positive linear relationship C) no linear relationship D) a neutral linear relationship

Version 1

12


47)

If the correlations coefficient is 0, then x and y A) are not linearly related. B) are absolute and perfectly related. C) have a perfect positive relationship. D) have a perfect negative relationship.

48) If the coefficient correlation is computed to be −0.95, this means the relationship between the two variables are __________blank. A) strong, positive B) weak, negative C) weak, positive D) strong, negative

49) If the coefficient correlation is computed to be −0.85, this means the relationship between the two variables are __________blank. A) strong, positive B) weak, negative C) weak, positive D) strong, negative

50)

Find the mean of the values: 1, 4, 5, 3, 4, 6, 1, 2, 4, 3, 2 A) 3 B) 3.18 C) 3.27 D) 4

Version 1

13


Answer Key Test name: Chap 03_2e_Jaggia 1) FALSE The number of defects in a production process, the salary of a business graduate, the rental price in a neighborhood, and the number of orders for a subscription-based service are all examples of measures of central location. 2) TRUE A percentile is a measure of relative location that splits a data set into two parts: p and 100 − p, where p = the percent of observations in the data set. 3) FALSE A percentile is calculated using the PERCENTILE.INC function in Excel. R uses the quantile function. 4) FALSE The variance and the standard deviation are the two most widely used measures of dispersion. 5) TRUE The skewness coefficient is a measure of shape and measures the degree to which a distribution is not symmetric. 6) FALSE The correlation coefficient between two variables x and y indicates both the direction and the strength of the linear relationship, whereas the covariance only measures the direction of the relationship. 7) TRUE

Version 1

14


A boxplot is also used to informally gauge the shape of the distribution. Symmetry is implied if the median is in the center of the box and the left and right whiskers are equidistant from their respective quartiles. If not, skewness is implied. 8) TRUE The measure is to find a central location or value that describes the data; thus the central location is where data tends to cluster around a central variable. 9) FALSE The arithmetic mean is the primary measure of the central location. However, the median is used when the mean can be misleading due to outliers. 10) FALSE Converting observations into z-scores is also called standardizing the observation. 11) A When calculating the pth percentile in Excel, the PERCENTILE.INC function is used. The function quantile is used to calculate the pth percentile in R, while summary calculates all quartiles. 12) D The tapply function is useful for finding means of subgroups, which in this case is female tech sales. The input for the function is (1) the outcome variable (Tech), (2) the categorical variable (Gender), and (3) the function to be performed (mean). 13) D The mode of a variable is the observation that occurs most frequently, within this data set the number 4 occurs the most frequently. 14) C

Version 1

15


The median of a variable is the middle value of a data set; that is, an equal number of observations lie above and below the median. After arranging the data in ascending order (smallest to largest), in this case 1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 6, the middle observation is 3. 15) D The positive skewness coefficient provides that return is positively skewed with kurtosis of −1.15926 indicating a shorter tail than the normal distribution. 16) D If all four investment distributions are the same with the exception of skewness, you would want to invest in the option that has the largest positive skewness (0.75) because that distribution implies a greater probability of extremely large gains. 17) A Other values must be interpreted with reference to −1, 0, or 1 indicating the strength of the linear relationship. In this case 0.84 indicates a strong, positive relationship. 18) B The boxplot is used to graphically display the five-number summary of a variable, which includes the min, 1Q, median, 3Q, and max values. In a boxplot the box represents the interquartile range (Q3 − Q1) and the dashed vertical line in the box represents the median. 19) C The median is the middle value of a data set; that is, an equal number of observations lie above and below the median. After arranging the data in ascending order (0.65, 0.88, 0.92, 0.98), we calculate the median as (1) the middle value if the number of observations is odd or (2) the average of the two middle values if the number of observations is even. Since the number of observations is even, we calculate the median as (0.88 + 0.92)/2 = 0.90. Version 1

16


20) B If we want to summarize a categorical variable, then the mode is the only meaningful measure of central location. 21) D The arithmetic mean is calculated by adding all the scores together and dividing by the number of observations. (0.95 + 0.78 + 0.65 + 0.92) ÷ 4 = 0.8250. 22) D The arithmetic mean is calculated by adding all the scores together and dividing by the number of observations. (0.98 + 0.88 + 0.65 + 0.92) ÷ 4 = 0.8575. 23) B We use the z-score to find the relative position of an observation by dividing the difference of the observation from the mean by the standard deviation, or, equivalently, z = 2.67.

=

24) B We use the z-score to find the relative position of an observation by dividing the difference of the observation from the mean by the standard deviation, or, equivalently, z =

25) C <p>Approximately 68% of all observations fall in the interval

, not 65%.

26) C 650 − (2 × 14) = 622 is 2 standard deviations below the mean and = 650 + (2 × 14) = 678 is two standard deviations above the mean. The empirical rule states 95% will fall within this range. Therefore, 345 × 0.95 = 328 applicants fall within the 622 and 678 range. 27) C 640 − (2 × 16) = 608 is 2 standard deviations below the mean and = 640 + (2 × 16) = 672 is two standard deviations above the mean. The empirical rule states 95% will fall within this range. Therefore, 300 × 0.95 = 285 applicants fall within the 608 and 672 range. 28) D Version 1

17


54 + 6 = 60 for the upper level and 54 − 6 = 48 for the lower level. The answer is therefore no, 62 does not fall within one standard deviation. 29) B 72 + 6 = 78 for the upper level and 72 − 6 = 66 for the lower level. The answer is therefore no, 80 does not fall within one standard deviation. 30) C The median score is identified through the line in the box. In this case, at 60.

31) B A star or mark outside of the 1.5 IQR Whisker is considered an outlier.

32) A When the whisker is longer to the left and the median is right of center, then the underlining distribution is negatively skewed.

33) D The correlation coefficient is known as the degree of strength of a linear relationship between variables. 34) A The range is the difference between the minimum observation of a variable and the maximum observation of a variable. 35) C The interquartile range or IQR is the difference between the third quartile and the first quartile, thought of as the middle 50% of the variable. 36) B MAD is the average of the absolute differences between the mean and the observations. Of the presented data set, the mean is 11, take this and deduct from the observations and the result is MAD = (|13 − 11| + |9 − 11| + |9 − 11| + |11 − 11| + |13 − 11|) ÷ 5 = 1.60. 37) B

Version 1

18


MAD is the average of the absolute differences between the mean and the observations. Of the presented data set, the mean is 8, take this and deduct from the observations and the result is MAD = (|10 − 8| + |9 − 8| + |3 − 8| + |8 − 8| + |10 − 8|) ÷ 5 = 2. 38) D <p>The Sharpe ratio is often calculated as where <em> is the mean return for investment i, Rf is the mean return for a risk-free asset such as a Treasury bill (T-bill), and si is the standard deviation for investment i. Sharpe ratio for Scores = (27.8182 − 4) ÷ 4.63193 = 5.142.

39) D <p>The Sharpe ratio is often calculated as

40) D <p><span style="">The Sharpe ratio is often calculated as where is the mean return for investment i, Rf is the mean return for a risk-free asset such as a Treasury bill (T-bill), and si is the standard deviation for investment i. Sharpe ratio for Gas = (129.9 − 2) ÷ 3.620927 = 35.3224. Sharpe ratio for Diesel = (128.5 − 2) ÷ 2.783401 = 45.4480. Thus, diesel has the higher Sharpe ratio and has more reward per unit of risk.

41) D <p><span style="">The Sharpe ratio is often calculated as

42) B Because the standard deviation for the midterm is greater than the one for the final, then the midterm is considered riskier. 43) B Because the standard deviation for the midterm is greater than the one for the final, then the midterm is considered riskier. 44) C The negative skewness coefficient provides that return is negatively skewed with kurtosis of 1.15926 indicating a longer tail than the normal distribution. 45) B

Version 1

19


The covariance is the measure of the relationship between two variables. If the covariance is positive, then there is a positive linear relationship. 46) B The covariance is the measure of the relationship between two variables. If the covariance is positive, then there is a positive linear relationship. 47) A When the correlations coefficient equals 0, then x and y are not linearly related. 48) D Other values must be interpreted with reference to −1, 0, or 1 indicating the strength of the linear relationship. In this case −0.95 indicates a strong, negative relationship. 49) D Other values must be interpreted with reference to −1, 0, or 1 indicating the strength of the linear relationship. In this case −0.85 indicates a strong, negative relationship. 50) B To calculate the mean of a variable, we add up all the observations and divide by the number of observations. (1 + 4 + 5 + 3 + 4 + 6 + 1 + 2 + 4 + 3 + 2)/11 = 35/11 = 3.18.

Version 1

20


CHAPTER 4 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) An easy way to convert relative frequencies into percentages is by dividing by 100. ⊚ true ⊚ false

2) When working with numerical variables, the frequency distribution is equal to the number of observations that falls into each interval. ⊚ true ⊚ false

3) Constructing a contingency table allows for a clear visualization of the relationship between two categorical variables. ⊚ true ⊚ false

4) In a scatter plot diagram, if there is no discernable pattern, then there is a positive relationship between the numerical variables. ⊚ true ⊚ false

5) A scatterplot with a categorical variable allows for the dynamic view of the addition of a categorical variable to the numeric plot points adding an additional layer of visible detail. ⊚ true ⊚ false

6) A study focused on the following numerical variables: Age, Income, and Candy (lbs.). A bubble chart will help the researcher understand the relationship based on the location of the age and income plots and the size of the bubble based on candy consumption. ⊚ true ⊚ false

7)

A line chart displays the numerical variable of a series of data points connected by a line.

Version 1

1


⊚ ⊚

true false

8) A bar chart is a series of rectangles where the width and height of each rectangle represent the interval width and frequency (or relative frequency) of the respective interval for numerical variables. ⊚ true ⊚ false

9)

Stacked column charts can be used to visualize more than one categorical variable. ⊚ true ⊚ false

10) Heat maps are useful in identifying combinations of the numerical variables that have economic significance. ⊚ true ⊚ false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 11) Using the following table, what is the percent of the relative frequency of a Blue car being observed? Car Color Black Red Blue Silver

Frequency 251 198 205 293

Relative Frequency 0.2650 0.2090 0.2170 0.3090

A) 21.65% B) 2.1650% C) 20.50% D) 0.2165%

12) Using the following table, what is the percent of the relative frequency of a Blue car being observed? Version 1

2


Car Color Black Red Blue Silver

Frequency 251 198 203 293

Relative Frequency 0.2660 0.2100 0.2150 0.3100

A) 21.48% B) 2.1480% C) 20.30% D) 0.2148%

13) Using R, Bart wants to create a bar chart showing the frequency of the color of cars that pass over the I-270 overpass at the Main Street exit. What function should he use? A) table B) barplot C) abline D) view

14) Bill wants to calculate the width of each interval by using the approximation formula. He first created a frequency distribution with 4 intervals. The minimum and maximum for the variable are −32.4 and 66.69 respectfully. Calculate the width of each interval for Bill. A) 16.673 B) 8.100 C) 24.773 D) −24.773

15) Bill wants to calculate the width of each interval by using the approximation formula. He first created a frequency distribution with 5 intervals. The minimum and maximum for the variable are −32.4 and 65.42 respectfully. Calculate the width of each interval for Bill.

Version 1

3


A) 13.084 B) 6.604 C) 19.564 D) −19.564

16) Marin produced the following histogram based on his observations on the age of players willing to sample a video game. He then organized the age into frequencies and interval width of the respective intervals. Based on the results, which range has the best frequency for future video game sampling?

A) 16–18 B) 19–22 C) 23–25 D) 8–10

Version 1

4


17)

Using the following histogram, how would the distribution be described?

A) bell-shaped B) symmetric skewed C) positively skewed D) negatively skewed

18) As a researcher, you should be mindful on constructing and interpreting graphs. All of the following are recommended guidelines you should follow except for one. Which one does not reflect a guideline? A) The simplest graph should be used for a given set of data. B) Axis should be clearly marked and labeled. C) On a bar chart or histogram, bar widths should always reflect various widths of the data. D) Be mindful of the upper limits on the vertical axis to prevent compression, hiding variant details.

Version 1

5


19) The following chart represents the results of a two categorical variable study reflecting on education level survey in 2006 and the same survey results from 2016. Which type of chart is being displayed?

A) scatterplot B) contingency table C) histogram D) stacked column chart

Version 1

6


20)

Below is the Scatterplot of 2 numeric variables. What type of relationship is represented?

A) nonlinear relationship B) linear relationship C) matrix relationship D) no relationship

21) Simone is a marketing consultant hired to review the product sales for a new high-end barista machine line. The product line has four variations, selling in four specialty store regions. To clearly show where each variation is selling best and in which regions, she plans to provide a color-scaled chart using percentage by type and location. What is the name of the chart she will be using? A) pivot table B) bubble plot C) color map D) heat map

Version 1

7


22) The following scatterplot is of % of fat and BMI Index Score. What type of linear relationship is being displayed?

A) negative linear relationship B) positive linear relationship C) no relationship D) scattered negative relationship

23)

Which visualization method would best represent the following frequency distribution?

Car Color Black Red Blue Silver

Frequency 251 198 203 293

Relative Frequency 0.266 0.210 0.215 0.310

A) bar chart B) box plot C) histogram D) line chart

24) What is the relative frequency of individuals who rated the restaurant “Excellent” from the following sample? Rating Poor

Frequency 25

Ok

19

Good

20

Excellent

29

Version 1

Relative Frequency

8


A) 0.27 B) 0.29 C) 0.31 D) 0.33

25)

How would you describe the following histogram?

A) bell-shaped B) negatively skewed C) positively skewed D) symmetrically skewed

26)

Which of the chapter recommended guidelines is violated in the graph below?

Version 1

9


A) The simplest graph should be used for a given set of data. B) Axis should be clearly marked and labeled. C) On a bar chart or histogram, bar widths should always be consistent as differing widths may create distortions. D) Be mindful of the upper limits on the vertical axis to prevent compression, hiding variant details.

27) Which of the following would likely be used to report data related to gender and phone model purchased? A) contingency table B) frequency distribution C) line chart D) scatterplot

28)

Which is the most popular beverage from this sample of 270 men and women?

A) men; beer B) men; soft drink C) women; beer D) women; soft drink

29)

Which of the following would use a stacked column chart to visualize the results?

Version 1

10


A) daily stock prices B) gender and restaurant ratings C) income and purchase price D) location and house prices

30)

Which of the following would use a contingency table to visualize the results? A) daily stock prices B) gender and restaurant ratings C) income and purchase price D) location and house prices

31) What type of relationship is represented in the following scatterplot of average income per state and median house price?

A) positive relationship B) negative relationship C) no relationship D) steady relationship

32)

Which of the following would use a scatterplot to visualize the results?

Version 1

11


A) daily stock prices B) gender and restaurant ratings C) income and purchase price D) location and house prices

33) What type of relationship is represented in the following scatterplot of the returns for stock portfolios A and B?

A) positive relationship B) negative relationship C) no relationship D) steady relationship

Version 1

12


34)

Which of the following statements is true regarding the scatterplot below?

A) females tend to spend less on technology than males B) females tend to spend less on clothing than males C) males tend to spend more on clothing than females D) males tend to spend less on technology than females

35) Which visualization method will Jorge want to use if he has square footage, property value, and property type (e.g., single-family home, a condominium) data? A) bubble plot B) heat map C) line chart D) scatterplot with a categorical variable

Version 1

13


36) The relationship between math and writing scores in public schools is _____________blank but _____________blank than private schools.

A) negative; higher B) negative; lower C) positive; higher D) positive; lower

37) Which visualization method will Jorge want to use if he wants to understand the relationships between study time, screen time, and academic performance (e.g., GPA) data? A) bubble plot B) heat map C) line chart D) scatterplot with a categorical variable

Version 1

14


38) While birth rate and life expectancy have a _____________blank relationship, life expectance and gross national income (size of bubble) have a _____________blank relationship.

A) negative; negative B) negative; positive C) positive; negative D) positive; positive

39) Which visualization method will Jorge want to use if he wants to understand daily sales over the course of the year? A) bubble plot B) heat map C) line chart D) scatterplot with a categorical variable

Version 1

15


40) The following line chart shows the daily close prices for Citigroup and Overstock.com stock prices for the last 30 days. Which stock appears to be more volatile?

A) Citigroup B) Overstock.com C) Both are volatile D) Neither are volatile

41) Which visualization method will Jorge want to use if he wants to understand the relationship between number of downloads for various audiobook genres? A) bubble plot B) heat map C) line chart D) scatterplot with a categorical variable

42) Which Tees4U t-shirt color and size combination is most popular, according to their heatmap, which uses red-yellow-green in ascending order?

Version 1

16


A) Blue XS B) Green M C) Purple M D) Yellow XL

Version 1

17


Answer Key Test name: Chap 04_2e_Jaggia 1) FALSE An easy way to convert is by multiplying the relative frequency by 100 to create the percentage of the relative frequency. 2) TRUE 3) TRUE 4) FALSE 5) TRUE 6) TRUE A bubble plot marks the Age and Income plots and the third variable candy, represents the size of the bubble on the plot. This provides a visible size difference in how the three variables interact. 7) TRUE 8) FALSE A histogram is a series of rectangles where the width and height of each rectangle represent the interval width and frequency (or relative frequency) of the respective interval. Histograms are used for numerical variables whereas bar charts are used for categorical variables. 9) TRUE 10) FALSE Heat maps are especially useful to identify combinations of the categorical variables that have economic significance. 11) A The percentage of relative frequency is 21.65%, calculated as 0.2165 multiplied by 100. 12) A

Version 1

18


The percentage of relative frequency is 21.48%, calculated as 0.2148 multiplied by 100. 13) B In R, the barplot function allows for the creation of a bar chart. 14) C Using the approximation formula, the interval is calculated as (66.69 − (−32.40)) ÷ 4 = 24.773. 15) C Using the approximation formula, the interval is calculated as (65.42 − (−32.4)) ÷ 5 = 19.564. 16) A Based on the frequency, the age range of 16-18 has the highest frequency and most likely the best target sampling for future game testing.

17) C The histogram is positively skewed, which means the tail extends to the right and reflects the presence of a small number of relatively large values.

18) C On a bar chart or histogram, you want the widths to be consistent because differing widths may create distortions. 19) D A bar chart, as we learned is best for single variable analysis, whereas for 2 or more variables, a stacked column chart will show comparison of composition in each category.

20) B The scatterplot is a linear relationship. Specifically, the points are clustered along the line with a negative slope.

21) D To determine the economic significance of the combination between categorical variables, a heat map indicating a color scale of variances will be the best selection. 22) B The plots are in a linear path, where when x increases, y increases. Thus, a positive linear relationship.

23) A Version 1

19


A bar chart depicts the frequency or the relative frequency for each category of the categorical variable as a series of horizontal or vertical bars, the lengths of which are proportional to the values that are to be depicted. As such, the frequencies of various car colors (a categorical variable) are best represented by bar charts. 24) C The relative frequency for each category equals the proportion of observations in each category. The sample size is 25+19+20+29 = 93. And the relative frequency is 29/93 = 0.3118 or approximately 31% of the individuals who rated the restaurant. 25) C The histogram is positively skewed, which means the tail extends to the right and reflects the presence of a small number of relatively large values.

26) D The vertical axis should not be given a very high value as an upper limit. In these instances, the data may appear compressed so that an increase (or decrease) of the data is not as apparent as it perhaps should be.

27) A Gender and phone model are categorical variables. As such, a contingency table is best since it shows the frequencies for two categorical variables whereas frequency distributions are used for a single variable (categorical or numerical), line charts depict numerical values over time and scatterplots are used for two numerical variables. 28) A Beer for men is the most popular both by frequency (142) and relative frequency 142/202 = 0.703.

29) B

Version 1

20


Stacked column charts are used to visualize two categorical variables. Gender and restaurant ratings is the only option containing two categorical variables. 30) B Contingency tables are a tabular format used to visualize two categorical variables. Gender and restaurant ratings is the only option containing two categorical variables. 31) A The scatterplot is a linear relationship. Specifically, the points are clustered along the line with a positive slope.

32) C Scatterplots are a graph format used to visualize two numerical variables. Income and purchase price is the only option containing two numerical variables. 33) C The scatterplot shows no discernable linear relationship.

34) A In this sample, no woman spends more than $100 on technology whereas the vast majority of men spend more than $100 on technology.

35) D Jorge’s data contains two numerical variables and a categorical variable. Scatterplots are used to graph the relationship between two numerical values. By adding the categorical variable “property type”, we can use different colors or symbols to see how the relationships may differ based on type. 36) D The relationship between math and writing scores is positive for both public and private schools but is generally lower for public schools.

Version 1

21


37) A A bubble plot shows the relationship between three numerical variables in a two-dimensional graph. The third numerical variable is represented by the size of the bubble. 38) B From the direction of the bubbles, we can see that birth rate and life expectancy have a negative relationship. However, life expectancy and gross national income have a positive relationship, which can be seen by the increasing size of the bubbles as life expectancy increases.

39) C A line chart connects the consecutive observations of a numerical variable with a line. It tends to be used to track changes of the variable over time. 40) B Over the prior 30 market days Overstock.com has been more volatile than Citigroup, which has remained more stable over the prior 30 days.

41) B A heat map uses color or color intensity to display relationships between variables. A heat map is especially useful for identifying combinations of the categorical variables that have economic significance. In this case, the genres with more downloads will have better economic performance than those that are downloaded less. 42) B The red-yellow-green in ascending order heat map in Excel will color the cell with the largest number in dark green and the smallest number in dark red. In this example, Green M is the darkest green with the largest number 122, which indicates it is the most popular color and size combination.

Version 1

22


CHAPTER 5 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) The sample space contains all probable outcomes of an experiment. ⊚ true ⊚ false

2) If 42% of interns are between the ages of 20 to 25, the complement rule dictates P(Ac) = 1 − P(A) = 1 − 0.42 = 0.58. ⊚ true ⊚ false

3) B.

The union of two events A and B, denoted as A ∪ B, will contain outcomes of both A and ⊚ ⊚

true false

4) According to the total probability rule, P(A) equals the sum of P(A ∩ B) and P(A ∩ Bc), and is considered conditional on two mutually exclusive and exhaustive events independent of an experiment. ⊚ true ⊚ false

5)

Bayes’ theorem is a procedure for updating probabilities based on new information. ⊚ true ⊚ false

6)

A discrete random variable is denoted as distinct countable values x1, x2, x3,… ⊚ true ⊚ false

Version 1

1


7) The expected value of the discrete random variable X, denoted by simply μ, is also referred as the maximum of all possible values of X. ⊚ true ⊚ false

8) A result of attaching probabilities to the outcomes of a Bernoulli process is called a binomial distribution. ⊚ true ⊚ false

9) The Poisson process is satisfied only if the number of successes counted in nonoverlapping intervals is independent and is not dependent on the proportional size of an interval. ⊚ true ⊚ false

10) The Gaussian distribution is the most extensively used probability distribution in statistical work. ⊚ true ⊚ false

11) Bayes’ Theorem cannot be extended to include n, mutually exclusive and exhaustive events. ⊚ true ⊚ false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 12) What is the probability when a personal assessment is made without referencing data? A) empirical probability B) classical probability C) subjective probability D) exclusive probability

Version 1

2


13)

The union of two events is denoted as A) A ⊂ S. B) A ∪ B. C) P(A ∪ B). D) A ∩ B.

14) Which rule is being followed when summing P(A) and P(B) then subtracting P(A ∩ B) from the sum? A) complement B) multiplication C) addition D) joint probability

15)

Mutually exclusive events __________blank. A) have joint probability of zero B) contain the multiple of two probabilities C) are conditional on interest D) contain all possible experiment outcomes

16) Using conditional probability, if P(A) = 0.15, P(B) = 0.36, and P(A ∩ B) = 0.09, then P(A ∣ B) = A) 0.03. B) 4.00. C) 0.25. D) 0.80.

17) Using conditional probability, if P(A) = 0.50, P(B) = 0.20, and P(A ∩ B) = 0.15, then P(A ∣ B) =

Version 1

3


A) 0.40. B) 0.30. C) 0.75. D) 2.5.

18) In Holland, 74% of the people own a car. If four adults are randomly selected, what is the probability that none of the four have a car? A) 7.24% B) 0.46% C) 39.44% D) 0.47%

19) In Holland, 60% of the people own a car. If five adults are randomly selected, what is the probability that none of the five have a car? A) 7.8% B) 1.02% C) 40% D) 1.03%

20) In Holland, 30% of the people own a car. If five adults are randomly selected, what is the probability that no more than two own a car? A) 37.2% probability that more than two own a car. B) 30.9% probability that more than two own a car. C) 66.8% probability that no more than two own a car. D) 83.7% probability that no more than two own a car.

21)

A simple event __________blank.

Version 1

4


A) contains unlimited outcomes B) contains exactly two outcomes C) contains only subsets of the outcome D) contains a single outcome

22) Michael has interviewed for two jobs. He feels that he has a 68% chance of getting an offer on Job A and a 62% chance of getting an offer on Job B. He also believes there is a 55% chance of getting an offer on both jobs. What is the probability that he receives an offer on at least one of the jobs? A) 0.60 B) 0.70 C) 0.10 D) 0.75

23) Michael has interviewed for two jobs. He feels that he has a 65% chance of getting an offer on Job A and a 45% chance of getting an offer on Job B. He also believes there is a 40% chance of getting an offer on both jobs. What is the probability that he receives an offer on at least one of the jobs? A) 0.60 B) 0.25 C) 0.10 D) 0.70

24) Michael has interviewed for two jobs. He feels that he has a 65% chance of getting an offer on job A and a 45% chance of getting an offer on job B. He also believes there is a 40% chance of getting an offer on both jobs. What is the probability that he does not get an offer at either job? A) 0.10 B) 0.30 C) 0.70 D) 0.25

Version 1

5


25) Two events are __________blank if the occurrence of one event does not affect the probability occurrence of another. A) simple B) dependent C) independent D) comparative

26)

Which is not a characteristic of the normal distribution? A) It is bell-shaped. B) It is inverse. C) It is asymptotic. D) It is symmetric.

27) Simone, owner of the Blue Canoe Coffee Shop, ran a report showing the identified valued customer visits the shop on average 9 times in a 30-day period. Simone now wants to break down the information further to determine how many visits she should expect in a 10-day period from a tracked value customer. How many visits should Simone expect? A) 3 B) 5 C) 1 D) 4

28) Simone, owner of the Blue Canoe Coffee Shop, ran a report showing the identified valued customer visits the shop on average 18 times in a 30-day period. Simone now wants to break down the information further to determine how many visits she should expect in a 5-day period from a tracked value customer. How many visits should Simone expect? A) 3 B) 5 C) 1 D) 4

Version 1

6


29) Which of the following Excel formulas will calculate the probability the percentage of returns will be greater than 17% for a balanced portfolio that historically earned 7% with a standard deviation of 10%? A) =NORM.DIST (17,7,10,FALSE) B) =NORM.DIST (17,7,10,TRUE) C) =1 − NORM.DIST (17,7,10,FALSE) D) =1 − NORM.DIST (17,7,10,TRUE)

30) Which of the following R formulas will calculate the probability the percentage of returns will be between 5% and 15% for a balanced portfolio that historically earned 7% with a standard deviation of 10%? A) pnorm(15,7,10, lower.tail=TRUE) - pnorm(5,7,10, lower.tail=TRUE) B) pnorm(5,7,10, lower.tail=TRUE) - pnorm(15,7,10, lower.tail=TRUE) C) pnorm(15,7,10, lower.tail=FALSE) - pnorm(5,7,10, lower.tail=FALSE) D) pnorm(5,7,10, lower.tail=FALSE) - pnorm(15,7,10, lower.tail=FALSE)

31) The Daytona 500 runs 40 race cars. Of the 40, 19 cars crashed. This is a probability of 0.475 that a car will crash in the race. This is an example of which probability? A) subjective B) empirical C) classical D) random

32)

Which is the best probability to determine the outcome of rolling seven with two dice? A) subjective B) empirical C) classical D) random

Version 1

7


33)

Which one does not satisfy the Poisson process?

A) Success is presented as an integer between zero and infinity. B) Number of successes counted in nonoverlapping intervals are independent. C) The interval is the same for probability failure as in success in exceeding the size of the interval. D) Probability of success in an interval is the same for all intervals of equal size and proportionality.

34)

Events are considered __________blank if they include all outcomes in the sample space. A) posterior B) exhaustive C) a sample D) a rule

35)

The intersection of two events is denoted as __________blank. A) A ∩ B B) A ∪ B C) P (A ∪ B)² D) P (Ac) = 1 − P(A)

36) Marketing analysis determined 47% of females between the ages of 25 and 34 years search for green technology and practice being green, as compared to 35% of men in the same age group. What is the probability that a randomly selected woman between the age of 25 and 34 does not search for green technology? A) 12% probability B) 68% probability C) 53% probability D) 44% probability

Version 1

8


37) Marketing analysis determined 44% of females between the ages of 25 and 34 years search for green technology and practice being green, as compared to 32% of men in the same age group. What is the probability that a randomly selected woman between the age of 25 and 34 does not search for green technology? A) 12% probability B) 68% probability C) 56% probability D) 44% probability

38) Marketing analysis determined 52% of females between the ages of 25 to 34 years old search for green technology and practice being green, as compared to 35% of men in the same age group. What is the probability that a randomly selected man between the age of 25 and 34 does not search for green technology? A) 56% probability B) 65% probability C) 12% probability D) 68% probability

39) Marketing analysis determined 44% of females between the ages of 25 to 34 years old search for green technology and practice being green, as compared to 32% of men in the same age group. What is the probability that a randomly selected man between the age of 25 and 34 does not search for green technology? A) 44% probability B) 68% probability C) 32% probability D) 56% probability

40) The __________blank of the discrete random variable X, denoted by E(X), or simply μ, is a weighted average of all possible values of X.

Version 1

9


A) summary value B) corresponding value C) random value D) expected value

41) Howard Simpson at Organics Central Market ran a report showing the identified valued customer visits the market on average 36 times in a 60-day period. Howard now wants to break down the information further to determine how many visits he should expect in a 5-day period from a tracked value customer. What is the probability of a valued customer visiting all 5 days? A) 10% B) 8% C) 4% D) 9%

42) Howard Simpson at Organics Central Market ran a report showing the identified valued customer visits the market on average 18 times in a 30-day period. Howard now wants to break down the information further to determine how many visits he should expect in a 5-day period from a tracked value customer. What is the probability of a valued customer visiting all 5 days? A) 10% B) 8% C) 4% D) 9%

43) Which theorem can the posterior probability be found using the prior probability and conditional probability? A) Fisher B) Poisson C) Bernoulli D) Bayes’

Version 1

10


44) Andrea decided her job opportunities will increase conditional on completing her bachelor’s degree. Based on her assumption, what probability would best fit? A) classical probability B) conditional probability C) complement rule D) empirical probability

45) Based on the provided table, the expected employee bonus is 7.85 or 7,850. What is the variance and the standard deviation of the annual bonus amount? xi 7 5 6

P (X = xi) 0.50 0.45 0.35

x1P (X = xi) 3.50 2.25 2.10 Total = 7.85

(xi − µ)2 P (X = xi) ? ? ? Total = ?

A) σ = 9.36 = 3.059 B) σ = 4.95 = 2.225 C) σ = 9.85 = 3.138 D) σ = 5.21 = 2.283

46) Based on the provided table, the expected employee bonus is 4.95 or 4,950. What is the variance and the standard deviation of the annual bonus amount? xi 8 5 0

P (X = xi) 0.40 0.35 0.25

x1P (X = xi) 3.2 1.75 0 Total = 4.95

(xi − µ)2 P (X = xi) ? ? ? Total = ?

47) Tiffany Ham’s business is thriving in Houston, TX. To reward her team, Tiffany is implementing a performance incentive program. Annual Bonuses begin at $5,000 for excellent performance, $3,000 for good performance, and $1,500 for fair performance, and $0 for poor performance. The probability levels are 0.20, 0.30, 0.15, and 0.30, respectively. What is the expected value of the annual bonus amount for an employee?

Version 1

11


A) $1,875 B) $2,325 C) $2,125 D) $2,375

48) Tiffany Ham’s business is thriving in Houston, TX. To reward her team, Tiffany is implementing a performance incentive program. Annual Bonuses begin at $5,000 for excellent performance, $3,000 for good performance, and $1,500 for fair performance, and $0 for poor performance. The probability levels are 0.15, 0.40, 0.25, and 0.20, respectively. What is the expected value of the annual bonus amount for an employee? A) $1,875 B) $2,300 C) $2,325 D) $2,375

49) In reviewing retirement portfolios, Kim determined the probability of a client owning stock is 0.50 and the probability of owning a bond is 0.20. The probability of a customer who owns bonds already owning stock is 0.60. What is the probability a client owns both securities in their retirement portfolio? A) 0.52 B) 0.40 C) 0.39 D) 0.30

50) In reviewing retirement portfolios, Kim determined the probability of a client owning stock is 0.70 and the probability of owning a bond is 0.40. The probability of a customer who owns bonds already owning stock is 0.55. What is the probability a client owns both securities in their retirement portfolio?

Version 1

12


A) 0.52 B) 0.30 C) 0.40 D) 0.39

51) Alison has been hired to sell two different homes on the same street that two houses apart. She predicts that Home A has a 72% chance in selling on the first week of being listed, whereas Home B is in lesser condition and has a 26% probability. There is also a 14% chance both homes will not sell on the first week of it being listed. What is the probability that house A does not sell given that house B does not sell due to its poor condition? A) 0.667 B) 0.189 C) 0.250 D) 0.700

52) Alison has been hired to sell two different homes on the same street that two houses apart. She predicts that Home A has a 75% chance in selling on the first week of being listed, whereas Home B is in lesser condition and has a 30% probability. There is also a 20% chance both homes will not sell on the first week of it being listed. What is the probability that house A does not sell given that house B does not sell due to its poor condition? A) 0.267 B) 0.286 C) 0.250 D) 0.700

53)

Find P(A|B) using the following table of probabilities.

Prior Probability P(A) = 0.70 P(AC) = 0.30

Version 1

Conditional Probability P(B|A) = 0.90 P(B|AC) = 0.05

Joint Probability P(B∩A) = 0.630 P(B∩AC) = 0.015

13


A) 0.397 B) 0.645 C) 0.950 D) 0.977

54)

Discrete random variables are NOT associated with which of the following items? A) a probability that the random variable X assumes a particular value x B) the sum of all probabilities is equal to or greater than 1 C) can be defined in terms of their cumulative distribution function D) a probability that each value of x is 0 ≤ P(X = xi) ≤ 1

55) In 2020, 63% of high school graduates were enrolled in colleges or universities. If five high school graduates are randomly selected, what is the probability that no more than three are enrolled in college or university? A) 39.1% B) 34.2% C) 60.9% D) 63.0%

56) In 2020, 63% of high school graduates were enrolled in colleges or universities. If five high school graduates are randomly selected, what is the probability that no more than three are enrolled in college or university? A) 20.1% B) 34.2% C) 60.9% D) 63.0%

57) How many defects should Umair at East Side Manufacturing expect out of 20 manufactured items if the plant usually identifies 3 defects in every 15 items manufactured.

Version 1

14


A) 3 B) 4 C) 5 D) 6

Version 1

15


Answer Key Test name: Chap 05_2e_Jaggia 1) TRUE All possible outcomes are contained in the sample space for an experiment. 2) TRUE The complement rule is P(Ac) = 1 − P(A). 3) TRUE The union will contain all outcomes of A and B. 4) FALSE The total probability rule expresses the probability of an event in terms of joint or conditional probability that must relate to an experiment, not independent of one. 5) TRUE Bayes’ theorem is a procedure for updating probabilities based on new information. 6) TRUE A discrete random variable assumes a countable number of distinct values. 7) FALSE The expected value is also referred to as the mean or weighted average of X. 8) TRUE Attaching probabilities to the Bernoulli process results is called binomial distribution. 9) FALSE

Version 1

16


For the Poisson process to be satisfied, three factors must be met; one of which is that the interval be the same for all intervals of equal size and proportional to the size of the interval. 10) TRUE Gaussian is another name for the normal distribution which is the most extensively used probability in statistical work. 11) FALSE Bayes’ theorem is based on two mutually exclusive and exhaustive events, namely, B and Bc, but it can easily be extended to include n mutually exclusive and exhaustive events, B1, B2, . . . , Bn. 12) C Subjective probability is assigning a probability without referencing any data. 13) B A ∪ B denotes the union of events. 14) C The addition rule states that the probability that Event A or Event B occurs is derived as P(A ∪ B) = P(A) + P(B) − P(A ∩ B). 15) A Probability of the union is simply the sum of the two probabilities. P(A ∩ B) = 0. 16) C P(A ∣ B) = P(A ∩ B) ÷ P(B); P(A ∣ B) = 0.09 ÷ 0.36 = 0.25. 17) C P(A ∣ B) = P(A ∩ B) ÷ P(B); P(A ∣ B) = 0.15 ÷ 0.20 = 0.75. 18) B Using the Bernoulli process, the probability of success (having a car) is p = 0.74 and the probability of failure (not having a car) is 1 − p = 1 − 0.74 = 0.26. The probability of none of the four people having a car is x = 0 thus:

19) B Version 1

17


Using the Bernoulli process, the probability of success (having a car) is p = 0.60 and the probability of failure (not having a car) is 1 − p = 1 − 0.60 = 0.40. The probability of none of the five people having a car is x = 0 thus:

20) D <p style="margin-bottom: 20px;">

21) D It is called a simple event if it contains a single outcome. 22) D P (A) = 0.68; P (B) = 0.62; P (A ∩ B) = 0.55. Use the addition rule. P (A ∪ B) = 0.68 + 0.62 − 0.55 = 0.75. 23) D P (A) = 0.65; P(B) = 0.45; P (A ∩ B) = 0.40. Use the addition rule. P (A ∪ B) = 0.65 + 0.45 − 0.40 = 0.70. 24) B = P(A ∪ B) = 0.65 + 0.45 − 0.40 = 0.70. Then 1 − P((A ∪ B)c) = 1 − 0.70 = 0.30. Use the addition rule, but use the complement of the union. 25) C Two events are independent if the occurrence of one event does not affect the probability of the occurrence of the other event. 26) B Normal distribution is bell-shaped, symmetric, and asymptotic. Inverse is the inverse transformation converting Z to X to produce a corresponding value. 27) A μ30 = 9; to solve, 9 visits per 30 days breaks down to 3 visits per 10 days. Thus, μ10 = 3. 28) A μ30 = 18; to solve, 18 visits per 30 days breaks down to 3 visits per 5 days. Thus, μ5 = 3. 29) D

Version 1

18


P(X > 17) = 1 − P(X < 17) In Excel: =1 − NORM.DIST(X,μ,σ,TRUE) =1 − NORM.DIST (17,7,10,TRUE) 30) A P(5 < X < 15) = P(X < 15) – P(X < 5) In R: pnorm(15,7,10, lower.tail=TRUE) - pnorm(5,7,10, lower.tail=TRUE) 31) B The event is observed with relative frequency of occurrence. The empirical probability is 19/40 of a crash occurring. 32) C Classical probability is the best fit to determine roll outcomes in a game of chance. 33) C The probability of success in any interval is the same for all intervals of equal size and is proportional to the size of the interval. Thus, exceeding is the incorrect option. 34) B Events are considered exhaustive if they include all outcomes in the sample space. 35) A A ∩ B denotes the intersection of events. 36) C P(A) = 0.47, soP(Ac) = 1 −P(A) = 1 − 0.47 = 0.53 using the complement rule. 37) C P(A) = 0.44, soP(Ac) = 1 −P(A) = 1 − 0.44 = 0.56 using the complement rule. 38) B Version 1

19


P(B) = 0.35; so P(Bc) = 1 − P(B) = 1 − 0.35 = 0.65 using the complement rule. 39) B P(B) = 0.32; so P(Bc) = 1 − P(B) = 1 − 0.32 = 0.68 using the complement rule. 40) D The expected value is also referred to as the mean. 41) A There is a 10% probability a customer will visit all five days. .

42) A There is a 10% probability a customer will visit all five days.

43) D Bayes’ Theorem says the posterior probability P(B|A) can be found using the information on the prior probability P(B), along with the conditional probabilities P(A|B) and P(A|Bc). 44) B In business applications, the probability of interest is often a conditional probability. 45) D xi 7 5 6

P (X = xi) 0.50 0.45 0.35

x1P (X = xi) 3.50 2.25 2.10 Total = 7.85

<p style="margin-top: 20px;">

(xi − µ)2 P (X = xi) (7 − 7.85)2 × 0.50 = 0.36 (5 − 7.85)2 × 0.45 = 3.65 (6 − 7.85)2 × 0.35 = 1.20 Total = 5.21

.

46) D xi 8 5 0

P (X = xi) 0.40 0.35 0.25

x1P (X = xi) 3.20 1.75 0 Total = 4.95

(xi - µ)2 P (X = xi) (8 − 4.95)2 × 0.40 = 3.72 2 (5 − 4.95) × 0.35 = 0.001 (0 − 4.95)2 × 0.25 = 6.13 Total = 9.85

<p style="margin-top: 20px;"> Version 1

20


47) C xi 5,000 3,000 1,500 0

P (X = xi) 0.20 0.30 0.15 0.30

x1P (X = xi) 5,000 × 0.20 = 3,000 × 0.30 = 1,500 × 0.15 = 0 × 0.30 = Total: $2,125

1,000 900 225 0

48) C xi 5,000 3,000 1,500 0

P (X = xi) 0.15 0.40 0.25 0.20

x1P (X = xi) 5,000 × 0.15 = 750 3,000 × 0.40 = 1,200 1,500 × 0.25 = 375 0 × 0.20 = 0 Total: $2,325

49) D Use the multiplication rule (joint probability); P(S ∩ B) = P(B|S)P(S) = 0.60 × 0.50 = 0.300. 50) D Use the multiplication rule (joint probability); P(S ∩ B) = P(B|S)P(S) = 0.55 × 0.70 = 0.385. 51) B Use the conditional probability rule: P(A|B ) =

52) B Use the conditional probability rule: P(A|B) =

53) D To find P(B|A) we first must find P(B), which is the sum of the joint probabilities (0.63 + 0.015 = 0.645). Next, we use Bayes’ Theorem P(B|A) =

54) B The sum of all probabilities for discrete random variables is equal to 1 but not greater than 1. Version 1

21


55) C P(X = 0) =

56) B P(X = 3) =

57) B Given the rate of 3 defects over 15 items, we can write the mean for the 15-item period as μ15 = 3. We compute the proportional mean for 20 items as μ20 = 4 because the proportion of defects is 3/15 = 0.20 defects/item = 0.20 defects/item × 20 items = 4 defects.

Version 1

22


CHAPTER 6 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) Many sample sizes can be drawn from a population, but there is only one overall population. ⊚ true ⊚ false

2) The value of the sample mean will remain static even when the data set from the population is changed. ⊚ true ⊚ false

3)

<p>The sampling distribution of is closely related to the normal distribution. ⊚ true ⊚ false

4) In general, a normal distribution approximation is justified for the sample proportion when np ≥ 5 and n (1 − p) ≥ 5. ⊚ true ⊚ false

5)

The more degrees of freedom, the broader the tails of the distribution. ⊚ true ⊚ false

6) With a df = 20, and α = 0.05 and a referencing table value of 1.74742, the upper tail suggests that P(T20 ≥ 1.74742) = 0.05. ⊚ true ⊚ false

7)

The population proportion p is the essential measure for a quantitative variable. ⊚ true ⊚ false

Version 1

1


8) An alternative hypothesis, denoted HA , is defined as the contradiction of the default state or status quo. ⊚ true ⊚ false

9)

When the p-value is greater than alpha, then the null hypothesis is rejected. ⊚ true ⊚ false

10) In performing a test statistic for p, the formula is only valid if a normal distribution. ⊚ true ⊚ false

(approximately) follows

11) Samples of men and women salaries are independent random samples if the process that generates one sample is completely separate from the process that generates the other sample. ⊚ ⊚

true false

12) When testing two population means based on samples that we believe arenot independent, one option is to use unmatched-pairs sample testing. ⊚ ⊚

true false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 13) <p>If the average derived from a specific sample is $43,000, then = 43,000 is the _____________blank of the population mean.

Version 1

2


A) deviation B) estimate C) point D) static point

14) In a population set, 20 students of the 149 observations received a failing grade in the Accounting Principles course. What is the estimate of the sample proportion of failure in the course? A) 0.1342 B) 7.450 C) 0.7199 D) 0.0134

15) In a population set, 13 students of the 132 observations received a failing grade in the Accounting Principles course. What is the estimate of the sample proportion of failure in the course? A) 0.0985 B) 10.154 C) 0.6842 D) 0.0099

16) Cameron is the quality inspector for the Mason Pot Company, an artesian cooperative crafting bowls. In the smaller series, the mean diameter is7 inches with a standard deviation of 0.3 inch. First, find the expected value. Then answer: what is the standard error of the sample mean derived from a random sample of 11 bowls? A) 0.090 B) 0.9 C) 0.001 D) 0.11

Version 1

3


17) Cameron is the quality inspector for the Mason Pot Company, an artesian cooperative crafting bowls. In the smaller series, the mean diameter is4 inches with a standard deviation of 0.2 inch. First, find the expected value. Then answer: what is the standard error of the sample mean derived from a random sample of 12 bowls? A) 0.058 B) 0.58 C) 0.001 D) 0.12

18)

Megan ran an analysis with a population standard deviation σ = 0.90. When reviewing

sample sizes of 4 and 8, the standard error for the sample mean was = 0.45 and = 0.32 respectively. Doing a comparison, what does the expected value and the standard error of the sample mean reflect? A) By comparison, the results confirm the sample is normally distributed with increased variability. B) The standard error for the sample mean is lower confirming the initial population standard deviation is incorrect. C) By comparing, the results confirm that averaging reduces variability. D) The standard error for the sample mean is lower, increasing variability in the sample set.

19)

The Central Limit Theorem, by definition, implies that as the sample size increases, the

A) average of the large number of independent observations has an approximate normal distribution. B) average of the large number of independent observations reflects the leftmost tail of the distribution. C) expected underlining distribution is only justified when n ≤ 30. D) expected normal distribution declines.

20)

What is the expected value of the sample proportion?

Version 1

4


A) number of success B) sample population C) standard error D) population proportion

21) CNBC (2018) reported 67% of households are using streaming services compared to standard cable. What are the expected value and the standard error of the sample proportion derived from a random sample of 300? A) EP¯ = 0.33 and seP¯ = 0.0271 B) EP¯ = 0.0271 and seP¯ = 0.33 C) EP¯ = 0.67 and seP¯ = 0.0271 D) EP¯ = 0.8515 and seP¯ = 0.0391

22) CNBC (2018) reported 58% of households are using streaming services compared to standard cable. What are the expected value and the standard error of the sample proportion derived from a random sample of 500?

23)

A) B)

= 0.42 and <em> = 0.0221 and <em>

</em> = 0.0221</p> </em> = 0.42</p>

C)

= 0.58 and <em>

</em> = 0.0221</p>

D)

= 0.7615 and <em>

</em> = 0.0341</p>

A range of values used to estimate a population parameter of interest is called A) estimated error. B) confidence interval. C) standard error. D) probability of success.

Version 1

5


24) In a meeting, Matt estimates the average miles per gallon (mpg) of the fleet cars is 36 mpg. With a certain level of confidence, he announces the actual mpg is in the 35 to 37 mpg range. What is the Margin of Error? A) 2 mpg B) 0.50 mpg C) 36 mpg D) 1 mpg

25) In a meeting, Matt estimates the average miles per gallon (mpg) of the fleet cars is 24 mpg. With a certain level of confidence, he announces the actual mpg is in the 22 to 26 mpg range. What is the Margin of Error? A) 4 mpg B) 0.50 mpg C) 24 mpg D) 2 mpg

26)

If the confidence coefficient is 0.64, what is the implied probability of error α? A) 0.36 B) 0.64 C) 1 D) 0.5

27) If the confidence coefficient is 0.85, what is the implied probability of error α? {MISSING IMAGE} A) 0.15 B) 0.85 C) 1 D) 0.05

Version 1

6


28) <p>In a sample of 24, the mean is = 92.62 with a standard deviation of s = 9.26. Assuming the results follow a normal distribution, construct the 90% confidence interval for the population mean. The t Table = 1.714. A) 90% confidence interval is 92.62 ± 2.94. B) 90% confidence interval is 95.86. C) 90% confidence interval is between 89.38 and 95.86. D) 90% confidence interval is 89.68 ± 1.714.

29) <p>In a sample of 25, the mean is = 86.92 with a standard deviation of s = 9.87. Assuming the results follow a normal distribution, construct the 90% confidence interval for the population mean. The t Table = 1.711. A) 90% confidence interval is 86.92 ± 3.68. B) 90% confidence interval is 90.30. C) 90% confidence interval is between 83.54 and 90.30. D) 90% confidence interval is 86.82 ± 1.711.

30) <p>In a sample of 33, the mean is = 89.30 with a standard deviation of s = 10.02. Assuming the results follow a normal distribution, construct the 90% confidence interval for the population mean. The t Table = 1.694. A) 90% confidence interval is 89.30 ± 2.95. B) 90% confidence interval is 2.95. C) 90% confidence interval is between 86.18 and 92.42. D) 90% confidence interval is 86.35 ± 1.69.

31) <p>In a sample of 60, the mean is = 76.29 with a standard deviation of s = 7. Assuming the results follow a normal distribution, construct the 90% confidence interval for the population mean. The t Table = 1.671.

Version 1

7


A) 90% confidence interval is 76.29 ± 1.51. B) 90% confidence interval is 77.97. C) 90% confidence interval is between 74.61 and 77.97. D) 90% confidence interval is 76.29 ± 1.684.

32) <p>In a sample of 19, and = 81, with a standard deviation of s = 7, the standard error of the sample mean is A) 0.684. B) 1.606. C) 1.725. D) 2.717.

33) <p>In a sample of 9, and <em> = 76, with a standard deviation of s = 5, the standard error of the sample mean is A) 0.745. B) 1.667. C) 1.786. D) 2.778.

34) In a sample of 40, 14 were found to have over 200 Blu-Ray Discs in one location. Construct the point estimate for the population proportion. A) 0.70 B) 0.10 C) 0.18 D) 0.35

35) In a sample of 30, 9 were found to have over 200 Blu-Ray Discs in one location. Construct the point estimate for the population proportion.

Version 1

8


A) 0.60 B) 0.045 C) 0.15 D) 0.30

36) In a sample of 34 iPhones, 21 had over 94 apps downloaded. Construct a 90% confidence interval for the population proportion of all iPhones that obtain over 94 apps. Assume z0.05 = 1.645. A) 0.62 ± 0.14 B) 0.62 ± 0.09 C) 0.36 ± 0.13 D) 0.36 ± 0.14

37) In a sample of 25 iPhones, 12 had over 85 apps downloaded. Construct a 90% confidence interval for the population proportion of all iPhones that obtain over 85 apps. Assume z0.05 = 1.645. A) 0.48 ± 0.16 B) 0.48 ± 0.09 C) 0.29 ± 0.15 D) 0.29 ± 0.16

38) In a sample of 62 iPhones, 16 had over 86 apps downloaded. Construct a 99% confidence interval for the population proportion of all iPhones that obtain over 86 apps. Assume z0.005 = 2.576. A) 0.16 ± 0.14 B) 0.16 ± 0.09 C) 0.26 ± 0.13 D) 0.26 ± 0.14

Version 1

9


39) In a sample of 50 iPhones, 13 had over 100 apps downloaded. Construct a 99% confidence interval for the population proportion of all iPhones that obtain over 100 apps. Assume z0.005 = 2.576. A) 0.13 ± 0.16 B) 0.13 ± 0.09 C) 0.26 ± 0.15 D) 0.26 ± 0.16

40)

A Type II error occurs when A) the null hypothesis is rejected when the null hypothesis is true. B) the null hypothesis is rejected, proving the hypothesis is true. C) the null hypothesis is not rejected when the null hypothesis is false. D) the null hypothesis is not rejected when the null hypothesis is true.

41) According to the National Retail Federation, the average shopper will spend $1,007.24 during the holiday shopping season. What is the null and alternate hypothesis? A) Sample Population is needed to complete the hypothesis. B) H0:μ ≠ 1007.24; HA : μ = 1007.24 C) H0:μ ≥ 1007.24; HA : μ ≤ 1007.24 D) H0:μ = 1007.24; HA : μ ≠ 1007.24

42) Peter provided an analysis for the building of a multi-unit retail strip for Corning Construction using the following hypothesis: H0:μ = Do not build the Strip Mall; Build the Strip Mall. Based on his calculations of the area, and the probabilities for growth, he shows the project will be profitable. The client built the strip mall to a profit loss. What type of error would this represent? A) Type I error because the building was built. B) Type II error because the building was built. C) Type I error because the building was built, but it was not profitable. D) Type II error because the building was built and was profitable.

Version 1

10


43)

Using R, what function is used to denote the hypothesized value of the mean? A) alternative B) option conf. C) t.test D) mu

44) In a sample of 490 students, 78% prefer online resources versus printed materials. The standard error for the sample proportion is A) 0.001. B) 0.019. C) 0.015. D) 0.039.

45) In a sample of 500 students, 80% prefer online resources versus printed materials. The standard error for the sample proportion is A) 0.002. B) 0.018. C) 0.014. D) 0.040.

46)

Which one of the following is not a characteristic of t distribution? A) t distribution is defined by the degrees of freedom. B) t distribution is bell-shaped. C) t distribution is symmetric along zero. D) t distribution is commonly known as population mean.

Version 1

11


47) According to the United States Census Bureau, the average single-family home constructed in 2016 was 2,438 square feet. Carmen wondered if families in her area still want a smaller home in comparison to the average. She randomly selected 45 people and asked what the ideal square footage of a home is for them. From the responses, she calculated the sample mean equals 1,550 square feet with a sample standard deviation of 54 square feet. What are the competing hypotheses? A) H0:μ ≠ 2,438; HA : μ = 2,438 B) H0:μ ≠ 1,550; HA : μ = 1,550 C) H0:μ ≥ 2,438; HA : μ < 2,438 D) H0:μ = 1,550; HA : μ ≠ 1,550

48) According to the United States Census Bureau, the average single-family home constructed in 2016 was 2,462 square feet. Carmen wondered if families in her area still want a smaller home in comparison to the average. She randomly selected 41 people and asked what the ideal square footage of a home is for them. From the responses, she calculated the sample mean equals 1,370 square feet with a sample standard deviation of 45 square feet. Calculate the value of the test statistic. A) −155.38 B) 155.61 C) −2.23 D) 1.10

49) According to the United States Census Bureau, the average single-family home constructed in 2016 was 2,438 square feet. Carmen wondered if families in her area still want a smaller home in comparison to the average. She randomly selected 45 people and asked what the ideal square footage of a home is for them. From the responses, she calculated the sample mean equals 1,550 square feet with a sample standard deviation of 54 square feet. Calculate the value of the test statistic. A) −110.31 B) 110.54 C) −2.45 D) 1.32

Version 1

12


50) test?

If t34 = −4.322 and α = 0.05, then what is the approximate of the p-value for a left-tailed

A) P(T34 ≤ −4.322) < 0.005. B) P(T34 ≤ −4.322) < 0.05. C) P(T34 ≥ − 4.322) < 0.05. D) P(T34 ≥ 4.322) < 0.50.

51)

What does α = 0.01 reflect? A) There is a 1% chance of rejecting a true null hypothesis. B) There is a 1% chance of accepting a true null hypothesis. C) Without an identified μ, there is no significant meaning. D) The 1% is the lowest significance level.

52) In a World Atlas study, 10% of people have blue eye color. Lane decided to observe 31 people and she concluded 6 people had blue eyes. Calculate the z-score. A) 0.1935 B) 0.0940 C) 1.7362 D) 0.8951

53) In a World Atlas study, 10% of people have blue eye color. Lane decided to observe 35 people and she concluded 5 people had blue eyes. Calculate the z-score. A) 0.1428 B) 0.0430 C) 0.8452 D) 0.0041

Version 1

13


54) In a 2019 Quinnipiac University poll of registered voters, 52% oppose making all U.S public colleges free. The Glangariff Group in Michigan collected data from 600 voters, where 342 support a taxpayer-funded free college program. What is the competing hypothesis? A) H0:p = 0.52; HA:p ≠ 0.52 B) H0 = 0.51; HA ≠ 0.51 C) H0:p = 0.51; HA:p ≠ 0.51 D) H0:p = 0.57; HA:p ≠ 0.57

55) In a 2019 Quinnipiac University poll of registered voters, 54% oppose making all U.S public colleges free. The Glangariff Group in Michigan collected data from 490 voters, where 295 support a taxpayer-funded free college program. Calculate the value of the test statistic. A) 2.40 B) 2.76 C) 12.33 D) 2.35

56) In a 2019 Quinnipiac University poll of registered voters, 52% oppose making all US public colleges free. The Glangariff Group in Michigan collected data from 600 voters, where 342 support a taxpayer-funded free college program. Calculate the value of the test statistic. A) 2.09 B) 2.45 C) 12.02 D) 2.04

57) A p-value of 0.07698 was reported for a study with a 10% significance level. Should the null hypothesis be accepted or rejected? A) Accept the null hypothesis because the p-value = 0.07698 is part of α = 0.10 B) Accept the null hypothesis because the p-value = 0.07698 is less than α = 0.10 C) Reject the null hypothesis because the p-value = 0.07698 is less than α = 0.10 D) Not enough information to determine.

Version 1

14


58) Which of the following represents the null hypothesis to test whether the health insurance premiums of men (μm) and women (μw) differ from each other? A) H0: μm − μw = 0 B) H0: μm − μw ≠ 0 C) H0: μm − μw < 0 D) H0: μm − μw > 0

59) Jackie wanted to know if there was any difference between low-carb and low-fat diets. Jackie ran a t-test to test the hypothesis for H0: μLC − μLF = 0 using experiment data she received from her nutrition professor. Given the results below, which of the statements is not true? t-Test: Two-Sample Asswning Unequal Variances Mean Variance Obsetvations Hypothesized Mean Difference elf tStat P(f<=t) one-tail t Critical one-tail P(f<=t) two-tail t Critical two-tail

Low-carb

Low-fat

9.33 2.9106 20 0 37 5.4343 0.0000 1.6871 0.0000 2.0262

6.60 2.1552 20

A) At theα = 0.05 level,H0 is rejected B) The test statistic value is 5.4343. C) There is no difference between diets. D) The two-tail test is used for this analysis.

60) Buster believes the new manufacturing process will result in assembly times that are onehalf minute faster than the previous process. Running 35 widgets through the old process and 30 widgets through the new process, Buster found the mean assembly time to be 32.6571 minutes (s.d. 1.924) and 32.1667 minutes (s.d. 1.949), respectively. Calculate the t-statistic for testingμnew-μold. Assume the population variances are unequal.

Version 1

15


A) −2.055 B) −1.017 C) −0.019 D) 1.017

61) Buster believes the new manufacturing process will result in assembly times that are faster than the previous process. Running 35 widgets through the old process and 30 widgets through the new process, Buster found the mean assembly time to be 32.6571 minutes (s.d. 1.924) and 32.1667 minutes (s.d. 1.949), respectively. Calculate the t-statistic for testingμoldμnew. Assume the population variances are unequal. A) −2.055 B) −1.017 C) −0.019 D) 1.017

62) In a 2019 Quinnipiac University poll of registered voters, 52% oppose making all U.S public colleges free. The Glangariff Group in Michigan collected data from 600 voters, where 342 support a taxpayer-funded free college program. What is the null hypothesis to test whether these two proportions are different from each other at theα = 0.05 level? A) H0: μD ≠ 0 B) H0: μD < 0 C) H0: μD > 0 D) H0: μD = 0

63) Calculate the test statistic used to determine whether the mean difference in monthly estimated sales and actual sales for the last three years (36 months), given the following information. Average Sales Standard Deviation

Version 1

Estimated

Actual

Difference

75972.2222 15882.5801

75725.0833 15974.4273

247.1389 1997.9809 16


Sample size

36

36

36

A) −0.742 B) −0.066 C) 0.066 D) 0.742

Version 1

17


Answer Key Test name: Chap 06_2e_Jaggia 1) TRUE There is only one population that sample sets are drawn from. 2) FALSE 3) FALSE <p>The sampling distribution of is closely related to the binomial distribution. Binomial is very different than a normal distribution because a binomial distribution is discreet. If the sample size is large enough, it will mimic a normal distribution, but it is not possible to find data values between any two data values.

4) TRUE In general, a normal distribution approximation for the sample proportion is justified when np ≥ 5 and n(1 − p) ≥ 5. When the result is greater than or equal to 5, then the appropriate normal distribution will do a fairly good job of estimating the binomial probabilities. 5) FALSE The degrees of freedom determine the extent of the broadness of the tails of distribution. The fewer degrees of freedom, the broader the tails of distribution. 6) TRUE The upper tail with given information is P(T20 ≥ 1.74742) = 0.05. 7) FALSE The population proportion p is the essential measure for a qualitative variable. The population mean μ is the descriptor for a quantitative variable. 8) TRUE Where a null hypothesis is corresponding to presumed default state of nature or status quo, the alternate hypothesis is defined as the contradiction of the default state or status quo. Version 1

18


9) FALSE When the p-value ≥ α, then the null hypotheses is not rejected. 10) TRUE The test statistic for p, the formula is only valid if

follows a normal distribution as defined.

11) TRUE Two (or more) random samples are considered independent if the process that generates one sample is completely separate from the process that generates the other sample. The samples are clearly delineated. 12) FALSE When testing two population means based on samples that we believe arenot independent, one option is to use matched-pairs sample testing. 13) B <p>The sample mean is denoted as and is commonly known as the estimator or point estimator. Thus, = 43,000 is the estimate of the population mean.

14) A = 20 ÷ 149 = 0.1342 is the estimate of the population proportion to fail the Accounting Principles course.

15) A = 13 ÷ 132 = 0.0985 is the estimate of the population proportion to fail the Accounting Principles course.

16) A With a sample size of n = 11, E( ) = 7 and

17) A With a sample size of n = 12, E( ) = 4 and

18) C For both sample sizes, the sample error of the sample mean is lower than the standard deviation of an individual unit. Thus, confirming that averaging reduces variability.

19) A

Version 1

19


The Central Limit Theory, by definition, implies that as the sample size increases, the average large number of observations (n ≥ 30), has an approximate normal distribution. 20) D The expected value of the sample proportion is equal to the population proportion. 21) C Given p = 0.67 and n = 300, the expected value and standard error of are

= 0.67 and

=

= 0.0271.

22) C Given p = 0.58 and n = 500, the expected value and standard error of are <em> </em> =

= 0.58 and

23) B A range of values used to estimate a population parameter of interest is called a confidence interval. For example, you can, with a certain level of confidence, suggest an actual observation will fall between a range of values in a population. 24) D The margin of error is the +/− from the point of estimate. In this scenario, the point of estimate was 36 mpg and the difference of 1 mpg is the margin of error. 25) D The margin of error is the +/− from the point of estimate. In this scenario, the point of estimate was 24 mpg and the difference of 2 mpg is the margin of error. 26) A The probability of error α is 1 − α. Therefore, probability of error is 1 − 0.64 = 0.36. 27) A The probability of error α is 1 − α. Therefore, probability of error is 1 − 0.85 = 0.15.

Version 1

20


28) C ± t , df = 92.62 ± 1.714 and 95.86.

= 92.62 ± 3.24. Thus, 90% confidence resides between 89.38

29) C ± t , df = 86.92 ± 1.711 and 90.30.

= 86.92 ± 3.38. Thus, 90% confidence resides between 83.54

30) A ± t , df = 89.30 ± 1.694 and 92.25.

= 89.30 ± 2.95. Thus, 90% confidence resides between 86.35

31) A 76.29 ± 1.671

= 76.29 ± 1.51. Thus, 90% confidence resides between 74.78

and 77.80.

32) B <p>The standard error for the sample mean is =

= 1.606.

33) B <p>The standard error for the sample mean is =

= 1.667.

34) D <p>The point estimate for the population proportion is = 14 ÷ 40 = 0.35.

35) D <p>The point estimate for the population proportion is = 9 ÷ 30 = 0.30.

36) A <p>First, calculate . = 21 ÷ 34 = 0.62, then 0.62 ± 1.645

= 0.62 ± 0.137.

37) A <p>First, calculate . = 12 ÷ 25 = 0.48, then 0.48 ± 1.645

38) D <p>First, calculate . = 16 ÷ 62 = 0.26, then 0.26 ± 2.576

= 0.26 ± 0.14.

39) D <p>First, calculate . = 13 ÷ 50 = 0.26, then 0.26 ± 2.576

40) C

Version 1

21


A Type II error is made when we do not reject the null hypothesis when the null hypothesis is actually false. 41) D The null hypothesis is H0:μ = 1007.24 based on the assumption that the null hypothesis is true (i.e., average shopper will spend $1,007.24), whereas the alternate hypothesis is the exact opposite at HA : μ ≠ 1007.24. 42) C This would be a Type I error because the market analysis showed the new strip mall would be profitable, rejecting the null hypothesis. The building was built; however, it was not profitable, causing a Type I error. 43) D When using R, the t.test function is used to obtain both the test statistic and the p-value. The mu denotes the hypothesized value of the mean. 44) B 45) B 46) D The t distribution is also known as the Student’s t distribution, not population mean. 47) C The purpose of the study of 45 people is to determine whether they preferred a smaller home in comparison to the national average. So the hypotheses will be H0:μ ≥ 2,438; HA : μ < 2,438. 48) A The value of the test statistic for the hypothesis test of the population mean μ is computed as .

49) A Version 1

22


The value of the test statistic for the hypothesis test of the population mean μ is computed as

50) A P(T34 ≤ −4.322) is equivalent to P(T34 ≥ 4.322), less than 0.005. In other words, the approximate p-value is: P(T34 ≤ −4.322) < 0.005. 51) A 1% is the highest significance level that allows a 1% chance of rejecting the null hypothesis. 52) C 53) C

54) A The parameter of interest is the population proportion. Therefore, the competing hypotheses would be H0:p = 0.52; HA:p ≠ 0.52. 55) B

56) B 57) C The p-value of 0.07698 is less than α = 0.10, therefore the null hypothesis is rejected. 58) A One possible null hypothesis for testing the difference between two means is H0: μm − μw = 0. The remaining options are all possible alternative hypotheses. 59) C

Version 1

23


A two-tailed test is used because we are hypothesizing no difference between diets. The t-value is 5.4343 with a p-value < α = 0.05, therefore H0 is rejected, and we conclude there is a difference between the two diets. 60) A 61) B 62) D One possible null hypothesis for testing the difference between two means isH0: μD = 0. The remaining options are all possible alternative hypotheses. 63) D Comparing the same 36 months of estimated and actual sales represents a match-pairs example, so the following formula is used to calculate the t-statistic.

Version 1

24


CHAPTER 7 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) Regression analysis captures the relationship between only two distinct variables. ⊚ true ⊚ false

2) The response variable is the outcome of a variable, whereas the predictor is the input variable(s). ⊚ true ⊚ false

3) A regression model treats all predictor variables as numerical, where observations of a categorical variable are first converted into numerical values. ⊚ true ⊚ false

4)

R2 in linear regression is the correlation coefficient. ⊚ true ⊚ false

5) R2, also known as the coefficient of determination, quantifies the proportion of the sample variation in the predictor variables (xi) that is explained in the sample regression equation. ⊚ true ⊚ false

6) The total sum of squares (SST) can be broken into two: explained variation and unexplained variation. ⊚ true ⊚ false

7) half.

When using Excel for a one-tailed test, the returned p-value will need to be divided in

Version 1

1


⊚ ⊚

true false

8) When working with big data, a sample size is significantly large if the variability virtually disappears. ⊚ true ⊚ false

9) If the Ordinary Lease Squares (OLS) required assumptions of linear regression are met, OLS estimators of the regression coefficients βj are unbiased. ⊚ true ⊚ false

10) If residual plots exhibit strong nonlinear patterns, the inferences made by a linear regression model can be quite accurate. ⊚ true ⊚ false

11) For interval estimates for the response variable y, the prediction interval is narrower than the confidence interval. ⊚ true ⊚ false

12) Adjusted R2 is used to compare competing linear regression models when the models have the same numbers of predictor variables. ⊚ true ⊚ false

13) While tables are often used to report regression results, a notes section below the table is recommended to explain any important notations. ⊚ true ⊚ false

Version 1

2


MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question.

14) This represents which relationship? A) Positive linear B) Negative linear C) No linear D) Multiple linear

15) The Department of Natural Resources conducted a study examining the condition of fish in the Wolfe River (Wisconsin). In total 110 fish were captured. The variables that were measured are: mile marker of location in river, species (rainbow trout, northern pike, musky, walleye, andpanfish), length, and weight. Which item is not numerical? A) amount of fish captured B) length C) species D) mile marker

16) In a simple linear regression based on 20 observations, it is found b1 = 2.95 and se(b1) = 1.05. Consider the hypothesis : H0 : β1 = 0 and HA : β1 ≠ 0 . Calculate the value of the test statistic.

Version 1

3


A) 2.81 B) 2.410 C) 0.356 D) 0.121

17) In a simple linear regression based on 20 observations, it is found b1 = 3.25 and se(b1) = 1.35. Consider the hypothesis: H0 : β1 ≠ 0 and HA : β1 ≠ 0 . Calculate the value of the test statistic. A) 2.41 B) 0.415 C) 1.42 D) 0.121

18) In the following equation ŷ = 40,000 + 7x with given sales (γ in $500) and marketing (x in dollars), what does the equation imply? A) An increase of $7 in marketing is associated with an increase of $43,500 in sales. B) An increase of $1 in marketing is associated with an increase of $3,500 in sales. C) An increase of $7 in marketing is associated with an increase of $3,500 in sales. D) An increase of $1 in marketing is associated with an increase of $43,500 in sales.

19) In the following equation ŷ = 30,000 + 4x with given sales (γ in $500) and marketing (x in dollars), what does the equation imply? A) An increase of $1 in marketing is associated with an increase of $32,000 in sales. B) An increase of $1 in marketing is associated with an increase of $2,000 in sales. C) An increase of $4 in marketing is associated with an increase of $2,000 in sales. D) An increase of $4 in marketing is associated with an increase of $32,000 in sales.

20) What is the relationship between the predictor variables and the response variable called when the omission of various relevant factors can influence the response variable?

Version 1

4


A) deterministic B) independent C) regressive D) stochastic

21)

If R2 = 0.48, then how much of the sample variation is y? A) 31.0% B) 38.0% C) 62% D) 48%

22)

If R2 = 0.62, then how much of the sample variation is y? A) 31% B) 7.8% C) 38% D) 62%

23)

If SST = 2,500 and SSE = 625, then the coefficient of determination is A) 0.23. B) 0.43. C) 0.75. D) 0.57.

24)

If SST = 2,500 and SSE = 575, then the coefficient of determination is A) 0.43. B) 0.23. C) 0.77. D) 0.57.

Version 1

5


25)

If the coefficient of the determination is 0.43, what is the percent of the R2? A) 43% B) 57% C) −0.43 D) −0.57

26)

If the coefficient of the determination is 0.60, what is the percent of the R2? A) 60% B) 40% C) −0.60 D) −0.40

27)

In a linear regression, ε, read as epsilon, is A) a dummy variable. B) the unrounded number. C) the random error. D) the relationship between variables.

28)

In a linear regression model, the competing hypotheses take all but which form? A) H0:βj = βj0 and HA:βj ≠ βj0 B) H0:βj ≤ βj0 and HA:βj > βj0 C) H0:βj ≥ βj0 and HA:βj < βj0 D) H0:βj > βj0 and HA:βj ≠ βj0

29)

The slope coefficient β, is called

Version 1

6


A) regression. B) beta. C) alpha. D) intercept.

30)

Which of the following is not a goodness-of-fit measure? A) adjusted coefficient of determination B) coefficient of determination C) standard error of the estimate D) simple regression model

31) When determining if there is evidence of a linear relationship between variables, OLS estimators must be _____________blank for the test to be valid. A) normally distributed B) scattered C) a single variable D) predicted

32)

Based on the following table, what is the sample regression equation?

Intercept Cost Grad Debt

Coefficients

Standard Error

t Stat

p-value

10,133.8188 0.7363 178.8439 111.4955

7,543.6071 0.0060 78.8670 137.8860

1.311 3.917 2.574 1.207

0.1927 0.0002 0.0114 0.2300

A) Earnings = 10,133.8188 + 0.7363Cost + 178.8439Grad + 111.4955Debt B) Earnings = 10,133.8188 + 0.736Cost + 178.8439Grad − 111.496Debt C) Earnings = 10,133.8188 − 0.736Cost + 178.8439Grad − 111.496Debt D) Earnings = 10,133.8188 − 0.736Cost + 178.8439Grad + 111.496 Debt

Version 1

7


33)

Based on the following table, what is the sample regression equation?

Intercept Cost Grad Debt

Coefficients

Standard Error

t Stat

p-value

10,004.9665 0.4349 178.0989 141.4783

7,634.3338 0.1110 69.1940 117.2120

1.311 3.917 2.574 1.207

0.1927 0.0002 0.0114 0.2300

A) Earnings = 10,004.9665 + 0.4349Cost + 178.0989Grad + 141.4783Debt B) Earnings = 10,004.9665 − 0.4349Cost + 178.0989Grad + 141.4783Debt C) Earnings = 10,004.9665 + 0.4349Cost + 178.0989Grad − 141.4783Debt D) Earnings = 10,004.9665 − 0.4349Cost + 178.0989Grad − 141.4783 Debt

34) If there is a 30,000 average Service Cost in marketing services, with a 70% Cost Increase and 60% of client Payment for services upfront in addition to advantages of City, the predicted annual earnings for the firm are: Intercept Service Cost Cost Increase Payment City

Coefficients

Standard Error

t Stat

p-value

10,597.4751 0.4292 175.5073 136.5302 2,778.4538

7,575.3077 0.1106 69.1755 116.4507 1,098.6256

1.399 3.879 2.537 1.172 2.529

0.1646 0.0002 0.0126 0.2435 0.0128

A) $43,950.80. B) $35,758.99. C) $34,591.26. D) $46,729.25.

35) If there is a 30,000 average Service Cost in marketing services, with a 70% Cost Increase and 60% of client Payment for services upfront in addition to advantages of City, the predicted annual earnings for the firm are: Intercept Service Cost Cost Increase Payment

Version 1

Coefficients

Standard Error

t Stat

p-value

10,004.9665 0.4349 178.0989 141.4783

7,634.3338 0.1110 69.1940 117.2120

1.311 3.917 2.574 1.207

0.1927 0.0002 0.0114 0.2300

8


City

2,526.7888

1103.4026

2.290

0.0239

A) $44,007.59. B) $35,518.89. C) $34,351.16. D) $46,534.38.

36)

If SSE = 240 and SSR = 260, then the coefficient of determination is A) 0.60. B) 0.52. C) 0.67. D) 0.40.

37)

If SSE = 200 and SSR = 300, then the coefficient of determination is A) 0.20. B) 0.60. C) 0.67. D) 0.40.

38) If ŷ = 100 − 7x with y = product and x = price of product, what happens to the demand if the price is increased by 2 units? A) decreases by 49 units B) increases by 14 units C) decreases by 14 units D) increases by 49 units

39) If ŷ = 110 − 5x with y = product and x = price of product, what happens to the demand if the price is increased by 3 units?

Version 1

9


A) increases by 15 units B) increases by 125 units C) decreases by 15 units D) decreases by 95 units

41) Based on goodness-of-fit measures, which is the preferred model based on the results below? Standard error of the estimate Se Coefficient of determination R2 Adjusted R2

Model 1

Model 2

Model 3

6,498.4419

5,946.7042

5,789.4569

0.2934

0.4023

0.4288

0.2902

0.3890

0.4187

A) Model 1 B) Model 2 C) Model 3 D) Both Model 1 and Model 2

42) In the goodness-of-fit measures, interpret the coefficient of determination for Earnings with Model 3 and what the sample variation of Earnings explains. Standard error of the estimate Se Coefficient of determination R2 Adjusted R2

Model 1

Model 2

Model 3

6,761.3420

5,641.8049

5,886.0024

0.5846

0.5304

0.2319

0.5389

0.5362

0.7950

A) 0.8475% of the sample variation in Earnings is explained by the model selection. B) 0.0100 of the sample variation in Earnings determines the model selection. C) 79.50 of the sample variation in Earnings determines the regression model. D) 23.19% of the sample variation in Earnings is explained by the regression model.

Version 1

10


43) In the goodness-of-fit measures, interpret the coefficient of determination for Earnings with Model 3 and what the sample variation of Earnings explains. Standard error of the estimate Se Coefficient of determination R2 Adjusted R2

Model 1

Model 2

Model 3

6,498.4419

5,946.7042

5,789.4569

0.2934

0.4023

0.4288

0.2902

0.3890

0.4187

A) 41.87% of the sample variation in Earnings is explained by the regression model. B) 0.8475 of the sample variation in Earnings determines the model selection. C) 0.010 of the sample variation in Earnings determines the model selection. D) 42.88% of the sample variation in Earnings is explained by the regression model.

44) Based on goodness-of-fit measures, what is the percentage of the sample variation unexplained by Model 2? Standard error of the estimate Se Coefficient of determination R2 Adjusted R2

Model 1

Model 2

Model 3

6,637.5388

5,527.6696

5,676.6253

0.6835

0.7620

0.7078

0.4389

0.5125

0.1808

A) 76.20% B) 51.25% C) 18.08% D) 23.80%

45) Based on goodness-of-fit measures, what is the percentage of the sample variation unexplained by Model 2? Standard error of the estimate Se Coefficient of determination R2 Adjusted R2

Version 1

Model 1

Model 2

Model 3

6,498.4419

5,946.7042

5,789.4569

0.2934

0.4023

0.4288

0.2902

0.3890

0.4187 11


A) 38.90% B) 40.23% C) 41.87% D) 59.77%

46) Conduct a test to determine if the predictor variables are jointly significant in explaining Earnings at α = 0.05. ANOVA

df

SS

MS

F

Regression Residual

1 18

43,435,936.05 14,717,014.82

43,435,936.05 817,611.934

53.125

Total

19

158,355,047.50

Significance F 0.000

A) P(F(1,18) ≥ 53.125), p-value is less than α = 0.05, we reject H0, and the predictor variables are jointly significant in explaining Earnings. B) P(F(1,19) ≥ 61.423), p-value is less than α = 0.05, we accept H0, and the predictor variables do not jointly explain Earnings. C) P(F(1,18) ≥ 53.125), p-value is less than α = 0.05, we accept H0, and the predictor variables are jointly significant in explaining Earnings. D) P(F(1,19) ≥ 61.423), p-value is less than α = 0.05, we reject H0, and the predictor variables jointly do not explain Earnings.

47) Conduct a test to determine if the predictor variables are jointly significant in explaining Earnings at α = 0.05. ANOVA

df

SS

MS

F

Regression Residual

1 18

44182633.37 13774291.07

44182633.37 765238.393

57.737

Total

19

57956924.44

Version 1

Significance F 0.000

12


A) P(F(1,18) ≥ 57.737), p-value is less than α = 0.05, we reject H0, and the predictor variables are jointly significant in explaining Earnings. B) P(F(1,19) ≥ 60.945), p-value is less than α = 0.05, we accept H0, and the predictor variables do not jointly explain Earnings. C) P(F(1,18) ≥ 57.737), p-value is less than α = 0.05, we accept H0, and the predictor variables are jointly significant in explaining Earnings. D) P(F(1,19) ≥ 60.945), p-value is less than α = 0.05, we reject H0, and the predictor variables jointly do not explain Earnings.

48) Camber Seal is a financial planner hired to review KMB stock. She is considering the CAPM where the KMB risk-adjusted stock returnR −Rf is used as the response variable and the risk-adjusted market returnRm −Rf is used as the predictor variable. KMB stock is considered staple products, whether the economy is good or bad. Given estimates for the beta coefficient is 0.7055, standard error of 0.1385, and ap-value of 0.027 with a formulated hypothesis ofH0 : β ≥ 1HA :β < 1. At a 5% significance level, what is the risk determination of the stock against the market? A) β is significantly less than one, thus, H0 : is rejected and less risky than the market. B) β is significantly higher than one, thus, H0 : is accepted and less risky than the market. C) β is significantly less than one, thus, H0 : is accepted and riskier than the market. D) β is significantly higher than one, thus, H0 : is rejected and riskier than the market.

49) Camber Seal is a financial planner hired to review KMB stock. She is considering the CAPM where the KMB risk-adjusted stock returnR −Rf is used as the response variable and the risk-adjusted market returnRm −Rf is used as the predictor variable. KMB stock is considered staple products, whether the economy is good or bad. Given estimates for the beta coefficient is 0.7503, standard error of 0.1391, and ap-value of 0.039 with a formulated hypothesis ofH0 : β ≥ 1HA :β < 1. At a 5% significance level, what is the risk determination of the stock against the market?

Version 1

13


A) β is significantly less than one, thus,H0 : is rejected and less risky than the market. B) β is significantly higher than one, thus,H0 : is accepted and less risky than the market. C) β is significantly less than one, thus, H0 : is accepted and riskier than the market. D) β is significantly higher than one, thus, H0 : is rejected and riskier than the market.

50)

To conduct a test of joint significance, you want to employ which test? A) regressed mean F test B) left-tailed F test C) double-tailed F test D) right-tailed F test

51) The simple linear regression model y = β0 + β1 x + ɛ implies that if x _____________blank, we expect y to change by β1, irrespective of the value of x. A) goes up by one unit B) is a straight line C) curves by one unit D) goes down by one unit

52) Abe is calculating a stock investment risk. If the hypothesis is H0 : β ≥ 1 HA : β < 1, the p-value 0.027, and α = 0.05, is the investment riskier than the market? A) β is equal to one, so the investment is less risky than the market. B) β is more than one, so the investment is riskier than the market. C) β is less than one, so the investment is less risky than the market. D) β is less than one, so the investment is riskier than the market.

53)

Which one of the following is not a common violation in the test of validity?

Version 1

14


A) estimation B) multicollinearity C) changing variability D) nonlinear patterns

54) In a study where the least squares estimates were based on 34 sets of sample observations, the total sum of squares and regression sum of squares were found to be: SST = 4.46 and SSR = 4.22. What is the error sum of squares? A) 1.07 B) 0.24 C) 0.320 D) 0.93

55) In a study where the least squares estimates were based on 34 sets of sample observations, the total sum of squares and regression sum of squares were found to be: SST = 4.53 and SSR = 4.21. What is the error sum of squares? A) 1.07 B) 0.32 C) 0.929 D) 8.74

56) It is important to review residual plots to identify any signs of _____________blank and correlated observations in cross-sectional and time-series studies. A) variable studies B) residual plot crosses C) changing variability D) standard error

Version 1

15


57) In Excel, to construct a residual plot, input of a y range and an x range is needed. Aimee is examining the relationship between age and square foot range (sqft) of living space. In the scenario provided, what would be the y and what would be the x range data? A) In selecting a regression, residual plot is the first selection before range input. B) Input y range would be age and Input x range would be sqft. C) Input y range would be blank to produce a concise x range. D) Input y range would be sqft and Input x range would be age.

58) Using the 95% confidence interval data below, what is the mean net profit/loss range when x1 = $5 million and x2 = $3.5 million and the standard error is 0.2144? Coefficient Standard s Error Intercep 0.62113 0.0273 t 2 x1new 0.08746 0.0122 4 x2new 0.11226 0.0108 0

t Stat

P-value

Lower 95% Upper 95%

22.7384 9 7.14528

0.0000 0 0.0000 0 0.0000 0

0.5669 1 0.0631 7 0.0908 2

10.3930 8

0.6753 5 0.1117 5 0.1336 9

A) $570,000 to $680,000 B) $190,000 to $1,050,000 C) $710,000 to $810,000 D) $340,000 to $1,190,000

59) Using the 95% prediction interval data below, what is the mean net profit/loss range when x1 = $5 million and x2 = $3.5 million and the standard error is 0.2144? Coefficient Standard s Error Intercep 0.62113 0.0273 t 2 x1new 0.08746 0.0122 4 x2new 0.11226 0.0108 0

Version 1

t Stat

P-value

Lower 95% Upper 95%

22.7384 9 7.14528

0.0000 0 0.0000 0 0.0000 0

0.5669 1 0.0631 7 0.0908 2

10.3930 8

0.6753 5 0.1117 5 0.1336 9

16


A) $570,000 to $680,000 B) $190,000 to $1,050,000 C) $710,000 to $810,000 D) $340,000 to $1,190,000

60)

Which of the following statements about Adjusted R2 is false?

A) It is possible to increase Adjusted R2 unintentionally by including a predictor variable that has no foundation in the model. B) Adjusted R2 explicitly accounts for the sample size n and the number of predictor variables k. C) Adjusted R2 imposes a penalty for any additional predictor variable that is included in the analysis. D) In models with the same response variable, the model with the higher Adjusted R2 is preferred because it explains more of the sample variation in y.

61) How many dummy variables will be created if the following four modes of transportation to work are captured: biking, public transportation, driving alone, and carpooling? A) 1 B) 2 C) 3 D) 4

Version 1

17


Answer Key Test name: Chap 07_2e_Jaggia 1) FALSE Regression analysis captures the relationship between 2 or more variables. 2) TRUE The outcome of a variable, called the response variable, is related to one or more other input variables, called the predictor variables. 3) TRUE A regression model treats all predictor variables as numerical, where observations of a categorical variable are first converted into numerical values. Recall from Chapter 1 that the observations of a numerical vari-able represent meaningful numbers, whereas the observations of a categorical variable represent different categories. 4) FALSE R2 in linear regression is the coefficient of determination, which is the proportion of the sample variation in the response variable that is explained by the sample regression equation. The correlation coefficient is the relationship between two variables. 5) FALSE R2 quantifies the sample variation of the response variable y that is explained in the sample regression equation, not the predictor variables. 6) TRUE The total sum of squares, SST, can be broken down into two components: explained variation and unexplained variation. 7) TRUE The p-value will need to be divided in half in a one-tailed test. 8) TRUE Version 1

18


Sample size is significantly large if the variability virtually disappears. 9) TRUE OLS estimators of the regression coefficients βj are unbiased unless the error term is correlated with the predictor variables (i.e., one or more relevant predictors are excluded). 10) FALSE If residual plots exhibit strong nonlinear patterns, the inferences made by a linear regression model can be quite misleading. Nonlinear regressions models should be used. 11) FALSE For interval estimates for the response variable y, the prediction interval is narrower than the confidence interval because of the added uncertainty in predicting the individual value of y. 12) FALSE Adjusted R2 is used to compare competing linear regression models when the models have different numbers of predictor variables. 13) TRUE While tables are often used and are typically in a user-friendly format, some of the notations may vary from author to author. As a result, notes explaining typical notations are included below the table. 14) A The slope parameter β1 determines whether the linear relationship between x and E(y) is positive (β1 > 0).

15) C Categorical variables are descriptors as in species, versus the other variables, which are numerical. 16) A <p>Use the Test Statistic for the Test of Individual Significance:

17) A <p>Use the Test Statistic for the Test of Individual Significance:

Version 1

19


18) B ŷ = 40,000 + 7x; ŷ = 40,000 + 7(500). Thus, a $1 increase is a $3,500 increase in sales. 19) B ŷ = 30,000 + 4x; ŷ = 30,000 + 4(500). Thus, a $1 increase is a $2,000 increase in sales. 20) D The relationship between the response variable and the predictor variables is deterministic if the value of the response variable is uniquely determined by the predictor variables. Otherwise, the relationship is stochastic due to the omission of relevant factors (sometimes not measurable) that influence the response variable. 21) D R2 represents the sample variation in y, thus 48%. 22) D R2 represents the sample variation in y, thus 62%. 23) C R2 = 1 − SSE/SST = 1 − 625/2,500 = 0.75. 24) C R2 = 1 − SSE/SST; = 1 − 575/2,500 = 0.77. 25) A 0.43 or 43% is the coefficient of determination. 26) A 0.60 or 60% is the coefficient of determination. 27) C y = β0 + β1x1 + ɛ, where ɛ (the Greek letter read as epsilon) is the random error term. 28) D The following is not a form taken H0:βj > βj0 and HA:βj ≠ βj0. 29) B Version 1

20


The slope coefficient β is read as beta and represents the population slope coefficient to be estimated. 30) D The standard error of the estimate (denoted as se), the coefficient of determination (denoted as R2), and the adjusted coefficient of determination (denoted as adjusted R2) are correct. Simple regression model is not a part of the preferred model measure. 31) A The OLS estimators b0, b1, . . . , bk must be normally distributed. 32) A Earnings = 10,133.8188 + 0.7363Cost + 178.8439Grad + 111.4955Debt is correct because all of the coefficients are positive. This is the only equation meeting that requirement. 33) A Earnings = 10,004.9665 + 0.4349Cost + 178.0989Grad + 141.4783Debt is correct because all of the coefficients are positive. This is the only equation meeting that requirement. 34) D 10,597.4751 + 0.429 × 30,000 + 175.5073 × 70 + 136.5302 × 60 + 2,778.4538 × 1 = 46,729.252 or $46,729.25. 35) D 10,004.9665 + 0.4349 × 30,000 + 178.0989 × 70 + 141.4783 × 60 + 2,526.7888 × 1 = 46,534.376 or $46,534.38. 36) B SST = SSR + SSE = 260 + 240 = 500; R2 = SSR/SST = 260/500 = 0.52. 37) B SST = SSR + SSE = 300 + 200 = 500; R2 = SSR/SST = 300/500 = 0.60. 38) C Two times seven is 14 units and the demand decreases due to the negative sign before the 7x. Version 1

21


39) C Three times five is 15 units and the demand decreases due to the negative sign before the 5x. 40) D The closer se is to zero, the better the model fits the sample data.

41) C Using the lower standard error of the estimate, Model 3 is the best choice. Using the highest value of adjusted R2, Model 3 is the best choice. Even though the coefficient of determination for Model 3 is also the highest, it cannot be used to determine goodness-of-fit unless all three models have the same number of predictor variables, which is unknown in this example. 42) D Coefficient of determination R2 is 0.2319, meaning 23.19% of the sample variation in Earnings is explained by the regression model. 43) D Coefficient of determination R2 is 0.4288, meaning 42.88% of the sample variation in Earnings is explained by the regression model. 44) D In Model 2, 76.20% of the sample is explained, so (1 − 0.7620 = 0.2380) 23.80%. 45) D In Model 2, 40.23% of the sample is explained, so (1 − 0.4023 = 0.5977) 59.77%. 46) A Thep-value:P(F(1,18) ≥ 53.125), thusp-value is less thanα = 0.05, we reject the hypothesis. The predictor variables are jointly significant in explaining Earnings.

47) A

Version 1

22


48) A The beta for KMB stock is less than one so H0 : is rejected and therefore the stock is less risky than the return on the market. 49) A The beta for KMB stock is less than one so H0 : is rejected and therefore the stock is less risky than the return on the market. 50) D To conduct the test of joint significance, we employ a right-tailed F test. 51) A x goes up by one unit. 52) C Based on the hypothesis, β is less than one, indicating the investment is less risky than the market. 53) A Estimation is not a common violation in testing the validity of a model but nonlinearity, multicollinearity, changing variability, correlated observations, and excluded variables are. 54) B SSE = 4.46 − 4.22 = 0.24. 55) B SSE = 4.53 − 4.21 = 0.32. 56) C It is important that we explore residual plots to look for signs of changing variability and correlated observations in cross-sectional and time-series studies, respectively. 57) B The x range data should be sqft because the question requests the result of age range in y. 58) A <p>The 95% confidence interval for

Version 1

23


59) B <p>The prediction interval is given by

60) A 61) C One of the dummy variables from the regression is excluded, as the excluded variable represents the reference category against which the others are assessed. Therefore, only three dummy variables need to be created.

Version 1

24


CHAPTER 8 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) In determining the partial effect on dummy variable d in a regression model with an interaction variable ŷ = b0 + b1x + b2d + b3xd, the numeric variable x value needs to be known. ⊚ true ⊚ false

2) In a regression model with two dummy variables and an interaction variable d1d2:y = β0 + β1d1 + β2d2 + β3d1d2 + ε, the interaction variables are easy to estimate. ⊚ true ⊚ false

3) The quadratic regression model is appropriate for use when the slope changes in magnitude as well as sign. ⊚ true ⊚ false

4) In the quadratic regression model, if β2 > 0 then the relationship between x and y is an inverted U-shape. ⊚ true ⊚ false

5)

ln(y) = β0 + β1 ln(x) + ε represents the exponential regression model. ⊚ true ⊚ false

6) In the quadratic regression model, if β2 < 0 then the relationship between x and y is an inverted U-shape. ⊚ true ⊚ false

7) For cross-validation methods, the holdout method and the k-fold cross-validation method, competing regression models are assessed with the mean square error (MSE). Version 1

1


⊚ ⊚

8)

true false

The natural logarithm converts changes in a variable into percentage changes. ⊚ true ⊚ false

9) In the k-fold cross-validation method, the smaller the k value, the greater the reliability of the k-fold method. ⊚ true ⊚ false

10) Because software packages use random draws of the observations to partition data, the results will not be identical to a fixed partitioning of the observations. ⊚ true ⊚ false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 11) In a regression model, the _____________blank exists when a predictor variable has a different partial effect on the outcome of another predictor variable. A) target effect B) interaction effect C) dummy effect D) predictor effect

12) What is the predicted value (ŷ) when the numerical variable is x = 70 for the regression equation ŷ = −837 + 24.4x − 0.100x2? A) 1,832.2 B) 1,305.4 C) 381.0 D) Not enough sufficient data to calculate the predicted value.

Version 1

2


13) What is the predicted value (ŷ) when the numerical variable is x = 70 for the regression equation ŷ = −810 + 24.4x − 0.142x2? A) 1822.2 B) 1593.8 C) 202.2 D) Not enough sufficient data to calculate the predicted value.

14) A variable with a value of x1x2 is added to the general linear regression model to account for two predictive variables, x1 and x2 and the effect on the response variable. This type of effect is called _____________blank. A) transformative B) dummy C) interaction D) predictive

15) Consider the sample correlation coefficients in the table below. How much of the variability in time can be explained by boxes using the alternative way of to determine the coefficient of determination? Time

Boxes

Boxes

0.7029

Weight

0.6909

0.3846

Route

0.050

−0.1147

Weight

0.0134

A) 71% B) 26.0% C) 50.7% D) 49.4%

Version 1

3


16) Consider the sample correlation coefficients in the table below. How much of the variability in time can be explained by boxes using the alternative way of to determine the coefficient of determination? Time

Boxes

Boxes

0.7452

Weight

0.6572

0.3801

Route

0.082

−0.1615

Weight

0.0114

A) 75% B) 25.5% C) 49.5% D) 55.5%

17) A study was completed on cholesterol in 100 male adults 40–60 years of age to determine if there is a relationship between cholesterol concentration and time spent watching TV. The researchers wanted to determine if there are any predictive results, such as if the amount of time spent watching TV increases or decreases cholesterol levels. Based on the following regression results, is TV-watch statistically significant? Variable Constant TV_time Adjusted R2

Model 1

p−value −2.07921 0.041399 0.1391

0.0000 0.001

A) The p-value = 0.001 and is statistically significant because it is under the 5% level. B) The p-value = 0.002 and is not statistically significant because it is under the 5% level. C) The p-value = 0.000 and is statistically significant because it is under the 5% level. D) The p-value = 0.000 and is not statistically significant because it is under the 5% level.

18) A study was completed on cholesterol in 100 male adults 40–60 years of age to determine if there is a relationship between cholesterol concentration and time spent watching TV. The researchers wanted to determine if there are any predictive results, such as if the amount of time spent watching TV increases or decreases cholesterol levels. Based on the following regression results, is TV-watch statistically significant? Variable

Version 1

Model 1

p−value

4


Constant TV_time Adjusted R2

−2.13478 0.044069 0.1426

0.000 0.002

A) The p-value = 0.002 and is statistically significant because it is under the 5% level. B) Thep-value = 0.002 and is not statistically significant because it is under the 5% level. C) The p-value = 0.000 and is statistically significant because it is under the 5% level. D) The p-value = 0.000 and is not statistically significant because it is under the 5% level.

19) Consider a linear regression model where y represents the response variable and d1 and d2 represent two dummy variables. The model is estimated as ŷ = −2.95 + 1.83x + 4.70d1 − 2.79d2 + 0.89d1d2. Compute ŷ for x = 3, d1 = 1, and d2 = 0. A) 14.25 B) 3.66 C) 12.01 D) 7.24

20) Consider a linear regression model where y represents the response variable and d1 and d2 represent two dummy variables. The model is estimated as ŷ = −2.53 + 2.04x + 4.20d1 − 2.86d2 + 0.68d1d2. Compute ŷ for x = 3, d1 = 1, and d2 = 0. A) 13.53 B) 3.71 C) 12.85 D) 7.79

21) Sam, a marketing manager for XYZ big box stores, is trying to determine if there is a relationship between shelf space (in feet) and sales (in hundreds of dollars). To do this, Sam selected the top 12 producing locations. The regression results produced the following adjusted R2 values: Model 1: 0.8874 and Model 2: 0.6028. Which model is more suitable of a prediction?

Version 1

5


A) Model 1 B) Model 2 C) Both are suitable D) Neither model

22) Sam, a marketing manager for XYZ big box stores, is trying to determine if there is a relationship between shelf space (in feet) and sales (in hundreds of dollars). To do this, Sam selected the top 12 producing locations. Using the provided Model 1 results, what is the estimated equation on Sales? Variable Constant Number of shelves Shelf Space Adjusted R2

Model 1 10.641 0.225 0.067 0.6276

p−value 0.0000 0.000 0.001

A) ^Sales = 10.641 + 0.225shelves + 0.067shelf_space B) ^Sales = −0.225shelves + 0.067shelf_space C) ^Sales = 0.225shelves + 0.067shelf_space D) ^Sales = 0.067shelf_space ÷ (2 × 0.225shelves)

23) Sam, a marketing manager for XYZ big box stores, is trying to determine if there is a relationship between shelf space (in feet) and sales (in hundreds of dollars). To do this, Sam selected the top 12 producing locations. Using the provided Model 1 results, what is the estimated equation on Sales? Variable Constant Number of shelves Shelf Space Adjusted R2

Model 1 10.365 0.211 0.074 0.6523

p−value 0.000 0.000 0.000

A) ^Sales = 10.365 + 0.211shelves + 0.074shelf_space B) ^Sales = −0.211shelves + 0.074shelf_space C) ^Sales = 0.211shelves + 0.074shelf_space D) ^Sales = 0.074shelf_space ÷ (2 × 0.211shelves)

Version 1

6


24) Sam, a marketing manager for XYZ big box stores, is trying to determine if there is a relationship between shelf space (in feet) and sales (in hundreds of dollars). To do this, Sam selected the top 12 producing locations. Using the provided Model 1 results, which option best interprets the impact of the coefficients and p-values? Variable Constant Number of shelves Shelf Space Adjusted R2

Model 1 10.365 0.211 0.074 0.6523

p−value 0.000 0.000 0.001

A) The predictor variables are below one offering a negative correlation. B) All predictor variables are positive with a significant influence on sales. C) There is no impact presented with the coefficients and p-value results. D) The p-value offers significant influence where the coefficient provides reduction on sales.

25) The estimate regression equation for the average cost of widgets is: ^Widgets = 7.21 − 0.2705 Output + 0.1904 Output2. Both predictor variables are statistically significant at 5% level, confirming the quadratic effect. What is the predictive average from an output level of 2 million units to 3 million units? A) The increase in output units results in a $0.02 decrease in predictive average cost. B) The increase in output units results in a $0.24 increase in predictive average cost. C) The increase in output units results in a $0.68 decrease in predictive average cost. D) The increase in output units results in a $0.68 increase in predictive average cost.

26) The estimate regression equation for the average cost of widgets is: ^Widgets = 7.23 − 0.2478 Output + 0.1954 Output2. Both predictor variables are statistically significant at 5% level, confirming the quadratic effect. What is the predictive average from an output level of 2 million units to 3 million units?

Version 1

7


A) The increase in output units results in a $0.02 decrease in predictive average cost. B) The increase in output units results in a $0.24 increase in predictive average cost. C) The increase in output units results in a $0.73 decrease in predictive average cost. D) The increase in output units results in a $0.73 increase in predictive average cost.

27) Todd uses the quadratic regression model to determine the predictive average unit cost of baseballs produced in his production facility. After determining predictive average costs at multiple unit batch size amounts in millions, he now wants to know what the output level that maximizes his costs would be. Given b1 = −0.3669 and b2 = 0.0192, what is the level that will maximize his average cost in units? A) 5.0 million units B) 2.1 million units C) 19.80 million units D) 9.55 million units

28) Todd uses the quadratic regression model to determine the predictive average unit cost of baseballs produced in his production facility. After determining predictive average costs at multiple unit batch size amounts in millions, he now wants to know what the output level that maximizes his costs would be. Given b1 = −0.3802 and b2 = 0.0198, what is the level that will maximize his average cost in units? A) 7.5 million units B) 3.7 million units C) 19.20 million units D) 9.60 million units

29) Ava Diego, a doctoral student, is researching car loans issued at a local bank. She prepared a sample of 200 to determine if there is a relationship between the loan amount, length of the loan, and interest rate provided. The regression results are in the table below. Which model is more suitable for prediction and what is the best fit reason? Variable Constant Interest Rate

Version 1

Model 1 114.325 106.505

p-value 0.000 (0.000)

Model 2 110.54 108.650

p-value 0.000 (0.000)

8


Loan Length Interest × Loan Adjusted R2

0.2074

(0.000)

0.3290

(0.006)

NA

−0.1430

(0.0005)

0.2178

0.2089

A) Model 2 is the most suitable because of the p-value variance. B) Model 1 is the most suitable because of the higher adjusted R2 value. C) Model 2 is the most suitable because of the lower adjusted R2 value. D) Neither provide enough results data to predict the model or reasoning.

30) An estimated linear regression of annual fuel expenditures y on annual income x is represented as the following equation: = 1,500 + 0.05x. What is the estimated slope coefficient value?

A) 0.05 B) 2b1 + 0.05 C) 75 D) 75x

31) An estimated linear regression of annual fuel expenditures y on annual income x is represented as the following equation: = 2,200 + 0.05x. What is the estimated slope coefficient value? A) 0.05 B) 2b1 + 0.05 C) 110 D) 110x

32)

An estimated linear regression of annual fuel expenditures y on annual income x is

represented as the following equation: = 2,200 + 0.05x. Jim was offered a new job that would increase his salary by $2,000. What would be his potential increase in fuel costs? Based on this information, is the assumption of increased fuel cost against annual salary meaningful?

Version 1

9


A) The increase in fuel cost is $100.00 annually; linearity assumption is justified. B) The increase in fuel cost is $100.00 annually; linearity assumption is not justified. C) The increase in fuel cost is $50.00 annually; linearity assumption is justified. D) The increase in fuel cost is $1000.00 annually; linearity assumption is not justified.

33) In the following logarithmic regression model, β1 × 0.01 measures the approximate unit change in E(y) when x increases. If β1 = 10,000, then what is the unit change in E(y)?

A) 10.0 units B) 1,000 units C) 100 units D) 10,000 units

34) In the following logarithmic regression model, β1 × 0.01 measures the approximate unit change in E(y) when x increases. If β1 = 6,000, then what is the unit change in E(y)? A) 6 units B) 600 units C) 60 units D) 6,000 units

35) In an exponential regression model, the exact percentage of change can be calculated as: (exp(β1) − 1) × 100. If β1 = 0.15, what is the percent increase in E(y)? A) 25% B) 16% C) 75% D) 22%

36) In an exponential regression model, the exact percentage of change can be calculated as: (exp(β1) − 1) × 100. If β1 = 0.25, what is the percent increase in E(y)?

Version 1

10


A) 25% B) 28% C) 75% D) 22%

37) In the model y = β0 + β1ln(x) + ε, the predicted value is = b0 + b1ln(x). What is the impact of the estimated slope coefficient? A) b1 measures the approximate change in when x increases by 1 unit. B) b1 × 0.01 measures the approximate change in when x increases by 1%. C) b1 measures the approximate change in when x increases by 1%. D) b1 × 100 measures the approximate change in when x increases by 1 unit.

38) In the model y = β0 + β1x + ε, the predicted value is = b0 + b1x. What is the impact of the estimated slope coefficient? A) b1 measures the approximate change in when x increases by 1 unit. B) b1 × 0.01 measures the approximate change in when x increases by 1%. C) b1 measures the approximate change in when x increases by 1%. D) b1 × 100 measures the approximate change in when x increases by 1 unit.

39) In the model ln(y) = β0 + β1x + ε, the predicted value is ( ) = exp(b0 + b1x + What is the impact of the estimated slope coefficient?

÷ 2).

A) b1 measures the approximate change in when x increases by 1 unit. B) b1 × 0.01 measures the approximate change in when x increases by 1%. C) b1 measures the approximate change in when x increases by 1%. D) b1 × 100 measures the approximate change in when x increases by 1 unit.

Version 1

11


40) 14.

<p>Consider the following quadratic model, = 17 + 2.50x − 0.25x2. Predict y when x =

A) 3 B) 40 C) 12 D) 9

41) 12.

<p>Consider the following quadratic model, = 25 + 1.5x − 0.25x2. Predict y when x =

A) 40 B) 12 C) 9 D) 7

42) Why are the partial effects of the two predictor variables more difficult to interpret when a model contains the interaction of two numerical variables? A) The partial effect of either predictor variable depends on the value of the other predictor variable. B) Models with an interaction of two numerical variables is no more difficult to interpret than other interactions. C) The partial effects can only be interpreted at the sample means of x1 and x2. D) Comparing models with and without the interaction terms depends on the value ofR2.

43) Christian Za is using the holdout method to partition his data into two independent and mutually exclusive sets: 75% in a training set and 25% in a validation set. Based on an R2 value of 0.4532 and RMSE of 0.3268 for the training set and an R2 value of 0.1426 and RMSE of 0.5371 for the validation set, which of the competing models is the preferred model?

Version 1

12


A) The validation data set would be the preferred model. B) The training data set would be the preferred model. C) Both models are preferred models. D) There is not enough data to determine the preferred model.

44) Compute givenx1 = 25 withx2 evaluated at 5 where =−97.2850 + 5.5416x1 + 3.0928x2 − 0.1210x1x2. A) −7.38 B) 22.87 C) 41.59 D) 71.88

45)

Which of the following statements about the logarithmic regression model is true?

A) The model allows us to estimate the percentage change in E(y) when x increases 1%. B) The model is useful when only the predictor variable is better captured in percentages. C) The model is useful when the response and predictor variables are best captured in percentages. D) The model allows us to estimate the percent change in E(y) whenx increases by one unit.

46)

Which of the following statements about the exponential regression model is true?

A) The model allows us to estimate the unit change in E(y) whenx increases 1%. B) The model is useful when only the predictor variable is better captured in percentages. C) The model is useful when the response and predictor variables are best captured in percentages. D) The model allows us to estimate the percent change in E(y) whenx increases by one unit.

Version 1

13


47) A regression model made to conform to a sample set of data, compromising predictive power is called _____________blank. A) cross-validation B) flooding C) overfitting D) binary choice

48) A nonlinear regression model where both the response and predictor variables are transformed into natural logs is called a _____________blank. A) logistic regression model B) log-transformed model C) log-log regression model D) linear probability regression model

49)

<p style="margin-bottom: 20px;">

What regression model best matches this scatterplot?

Version 1

14


A) logistic regression model B) log-log regression model C) log-transformed model D) linear probability regression model

50) In cross-validation, if k equals the sample size, the resulting method is also called the _____________blank. A) sensitive data cross-validation method B) leave-one-out cross-validation method C) equality cross-validation method D) partition cross-validation method

51) In using both the linear probability regression and the logistic regression models for n = 40, the following table is the analysis of the holdout method. Based on the table, what is the impact of changing the ŷ to binary predictions? Observation 18 19 : 40 Accuracy

Y 1 0 : 1

ŷ (Model 1) 1.0263 0.1987 : 0.1509 100

ŷ (Model 2) 0.9907 0.3032 : 0.1989 100

A) There is no relevant outcome changing ŷ to a binary variable. B) By comparing ŷ to y, the validation results are not able to be validated. C) By comparing ŷ to y, the accuracy of the model is 100%, but can change with a larger validation set. D) By comparing ŷ to y, the accuracy of the model is 100% and will not change based on a larger validation set.

Version 1

15


Answer Key Test name: Chap 08_2e_Jaggia 1) TRUE The numerical variable x needs to be known to properly interpret the partial effect. 2) TRUE The partial effect of d1 on ŷ, given by b1 + b3d2, equals b1 if d2 = 0 or b1 + b3 if d1 = 1. The partial effect d2 is dependent on the value of d1 presenting an easy estimation of the interaction variable. 3) TRUE When the slope, capturing the influence x has on y, changes in magnitude and sign, the quadratic regression model is appropriate to show the changes. 4) FALSE β2 > 0 would reflect a standard U-shaped where β2 < 0 would indicate an inverted U-shape. 5) FALSE This formula represents the log-log regression model as it transforms all variables (response and predictor) into logs. An exponential regression model is specified as ln(y) = β0 + β1x + ε, where β1 × 100 is captured to represent the percentage change in E(y) when x is increased by one unit. 6) TRUE β2 > 0 would reflect a standard U-shaped where β2 < 0 would indicate an inverted U-shape. 7) FALSE For cross-validation methods, the holdout method and the k-fold crossvalidation method, competing regression models are assessed with the root mean square error (RMSE). Version 1

16


8) TRUE The natural logarithm converts changes in a variable into percentage changes, which is useful because many relationships are naturally expressed in terms of percentages. 9) FALSE The greater the k value, the more reliable the k-fold method. 10) TRUE Results from a fixed set will never match that of a random partitioning of observation data. 11) B Interaction effect in a regression model exists when a predictor variable has a partial effect on another variable. An example is the number of bedrooms and the sale price of a home. 12) C Replace the x in the equation with 70 and solve: ŷ = −837 + 24.4(70) − 0.100(70)2. 13) C Replace the x in the equation with 70 and solve: ŷ = −810 + 24.4(70) − 0.142(70)2. 14) C A linear regression model with interaction variables x1x2 may have a partial effect of a predictor variable on the response variable. 15) D The coefficient of determination (R2) between time and boxes is 0.70292 = 0.4941 or 49.4%. 16) D The coefficient of determination (R2) between time and boxes is 0.74522 = 0.5553 or 55.5%. 17) A

Version 1

17


The p-value for TV-time is 0.001 and is under the 5% significance level, thus, TV watch time is a predictor of cholesterol. 18) A Thep-value for TV-time is 0.002 and is under the 5% significance level, thus, TV watch time is a predictor of cholesterol. 19) D ŷ = −2.95 + 1.83(3) + 4.70(1) − 2.79(0) + 0.89(1)(0); ŷ = 7.24. 20) D ŷ = −2.53 + 2.04(3) + 4.20(1) − 2.86(0) + 0.68(1)(0); ŷ = 7.79. 21) A Model 1 is more suitable because (0.8874 > 0.6028). 22) A The estimate of Sales begins with top hat Sales, followed by the components of the regression analysis. The formula is the constant + Number of shelves + Shelf Space. The p-value is not considered in this equation. 23) A The estimate of Sales begins with top hat Sales, followed by the components of the regression analysis. The formula is the constant + Number of shelves + Shelf Space. The p-value is not considered in this equation. 24) B Anytime the coefficient is positive and the p-value is approximately 0, then there is a positive impact and significant influence. 25) D ^Widgets = 7.21 − 0.2705(2) + 0.1904(22) = $7.43. ^Widgets = 7.21 − 0.2705(3) + 0.1904(32) = $8.11. 26) D ^Widgets = 7.23 − 0.2478(2) + 0.1954(22) = $7.52. ^Widgets = 7.23 − 0.2478(3) + 0.1954(32) = $8.25. Version 1

18


27) D ŷ reaches a maximum when b2 < 0 when the partial effects equal zero, which happens when (in millions).

28) D ŷ reaches a maximum when b2 < 0 when the partial effects equal zero, which happens when

29) B The higher the adjusted R2 value, the more suited for predictive modeling. 30) A The slope of coefficient b1 is 0.05.

31) A The slope of coefficient b1 is 0.05.

32) B $2,000 in salary times 0.05.

33) C 10,000 units times 0.01 or 1% = 100 units. 34) C 6,000 units times 0.01 or 10% = 60 units. 35) B The formula to calculate the exact percentage change in E(y) is = (exp(β1) – 1 = (exp(0.15) − 1) × 100 = 16.1834 or 16% (rounded). 36) B The formula to calculate the exact percentage change in E(y) is = (exp(β1) – 1 = (exp(0.25) − 1) × 100 = 28.4025 or 28% (rounded). 37) B The logarithmic regression model has the following impact on the estimated slope: b1 × 0.01 measures the approximate change in when x increases by 1%.

38) A A simple linear regression model has an impact of b1, which measures the approximate change in when x increases by 1 unit.

39) D

Version 1

19


The exponential model results, having an estimated slope impact of b1 × 100, measures the approximate change in when x increases by 1 unit.

40) A = 17 + 2.50(14) − 0.25(142) = 3.

41) D = 25 + 1.5(12) − 0.25(122) = 7.

42) A <p>For the estimated model <span style="font-family: monospace;"><em> = b0 + b1x1 + b2x2 + b3x1x2, the partial effects of both predictor variables are difficult to interpret. For example, the partial effect of x1 on , given by b1 + b3x2, depends on the given value of x2. Similarly, the partial effect of x2 on , given by b2 + b3x1, depends on the given value of x1.

43) B The lower RMSE and higher R2 is the preferred model, in this case the training data. 44) C = −97.2850 + 5.5416(25) + 3.0928(5) − 0.1210(25)(5) = 41.59

45) B The model is useful when only the predictor variable is better captured in percentages. Logarithmic and exponential regression models are different when β1 > 0 and β1 < 0. 46) D The exponential regression model allows us to estimate the percent change in E(y) when x increases by one unit. 47) C Overfitting occurs when a regression model is made overly complex to fit all the elements of given sample data compromising its predictive power. 48) C In the log-log regression model, both the response variable and the predictor variable are transformed into natural logs. We write the model as ln(y) = β 0 + β 1 ln(x) + ε. 49) B Version 1

20


For 0 < β1 < 1, the log-log regression model implies a positive relationship between x and E(y); as x increases, E(y) increases at a slower rate.

50) B The method is called the leave-one-out method where you leave out one observation for the validation set. 51) C The accuracy is increased comparing ŷ to y, although a larger validation set than 40 may produce different results.

Version 1

21


CHAPTER 9 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) Dummy (binary) variables cannot be used as response variables. ⊚ true ⊚ false

2) of xj.

In both linear probability and the logistic regression models, bj measures the partial effect ⊚ ⊚

true false

3)

Odds range from 0 to 1, whereas probability ranges from 0 to infinity. ⊚ true ⊚ false

4)

Logistic regression coefficients are easier to interpret using odds rather than probabilities. ⊚ true ⊚ false

5) There are no rules as to how the training and validation sets of the sample data should be partitioned but generally random draws are used. ⊚ true ⊚ false

6) In the linear probability regression model, the response variable y will equal 1 or 0 to represent the probability of success. ⊚ true ⊚ false

7) In the logistic regression model, estimates can be made with standard ordinary least squares procedures. ⊚ true ⊚ false

Version 1

1


8) To test for the accuracy rate in a binary choice model, the number of correct classification observations for both outcomes should be reported. ⊚ true ⊚ false

9) In the k-fold cross-validation method, essentially the holdout cross-validation method is used k times with the average of the performance measures being used for model selection. ⊚ true ⊚ false

10)

Specificity is the proportion of target class cases that are identified correctly. ⊚ true ⊚ false

15)

WhichR function do you use to construct a logistic regression model? A) glm B) lm C) predict D) summary

16) Genie is worried that her binary response model will not perform as well with data outside of the estimation sample. What technique should she use to assess the predictive power of her model? A) sample-validation B) population-validation C) model-validation D) cross-validation

Version 1

2


17) In using both the linear probability regression and the logistic regression models for n = 40, the following table is the analysis of the holdout method. Based on the table, what is the impact of changing the ŷ to binary predictions? Observation 18 19 : 40 Accuracy

Y 1 0 : 1

ŷ (Model 1) 1.0263 0.1987 : 0.1509 100%

ŷ (Model 2) 0.9907 0.3032 : 0.1989 100%

A) There is no relevant outcome changingŷ to a binary variable. B) By comparingŷ toy, the validation results are not able to be validated. C) By comparingŷ toy, the accuracy of the model is 100%, but can change with a larger validation set. D) By comparingŷ toy, the accuracy of the model is 100% and will not change based on a larger validation set.

18)

Which of the following is not a step for the holdout cross-validation method? A) Partition the sample data into two parts, labeled training set and validation set. B) Use the training set to estimate competing models. C) Use these estimates to make predictions in the validation set. D) Calculate the accuracy rates and select the model with the lowest value.

19)

Which of the following statements regarding odds isfalse?

20)

What is the probability of a soccer team winning with odds of 5 (i.e., 5 to 1 odds)? A) 5:1 B) 0.20 C) 0.80 D) 0.83

21)

What is the probability of a soccer team winning with odds of 9 (i.e., 9 to 1 odds)?

Version 1

3


A) 0.20 B) 0.90 C) 9:1 D) 9:10

22)

What are the odds of a tennis player winning with a probability of 0.20? A) 0.20 B) 0.25 C) 0.75 D) 0.80

23)

What are the odds of a tennis player winning with a probability of 0.80? A) 0.20 B) 0.25 C) 4 D) 8

24) Given the following logistic regression model results, what is the percentage change in the odds when x2 increases by one unit holding x1 constant?

A) −49.69 B) −35.65 C) 35.65 D) 49.69

Version 1

4


25) Given the following logistic regression model results, calculate the odds when x1 = 10 and x2 = 7?

A) 0.079 B) 0.185 C) 1.580 D) 1.768

26) Which of the following goodness-of-fit measures assesses how well the model fits the data for binary choice models? A) standard error of the estimate B) coefficient of determination R2 C) adjusted R2 D) accuracy rate

27) WhichR function do you use to compute the predicted probabilities of a binary choice model? A) glm B) lm C) predict D) summary

28) Elise computed the linear probability model and logistic regression model accuracy rates for 300 observations. The binary predictions match 73.8 percent of the time for the linear probability model and 238 times for the logistic regression model. Using the preferred model, calculate ŷ for x1 = 50 and x2 = 10. Coefficients:

Version 1

Linear Logistic Probability Model Regression Model

5


Intercept x1 x2

0.5334 0.0164 −0.0549

−1.0555 0.1439 −0.4025

A) 58.22% B) 69.09% C) 80.44% D) 89.23%

29) Given a dataset labeled myData with 750 observations, how do you partition a training set (T) and validation set (V) in R to use 70% of the observations to train the data? A) myData <- TData [1:525,]myData <- VData [526,750,] B) myData <- TData [1:225,]myData <- VData[226,750,] C) TData <- myData[1:225,]VData <- myData[226,750,] D) TData <- myData[1:525,]VData <- myData[526,750,]

30) Given the following accuracy rates for k-fold cross-validation with k = 4, which model will we choose to make predictions and what is the accuracy rate for that model? Observations in the Validation Set 301-400 201-300 101-200 1-100

Model 1 69.5% 70.2% 71.8% 67.3%

Model 2 72.2% 68.9% 71.5% 69.4%

A) Model 1; 69.7% B) Model 1; 71.8% C) Model 2; 70.5% D) Model 2; 72.2%

31)

Consider a binary response variable y and a predictor variable x that varies between 0 and

5. The linear model is estimated as = −2.90 + 0.65x. What is the estimated probability for x = 5?

Version 1

6


A) 0.35 B) 6.15 C) 0.65 D) −6.15

32) How well the binary choice model predicts the nontarget class cases is called_____________.blank A) Accuracy B) AdjustedR2 C) Sensitivity D) Specificity

33) Using a sample of 50, the following regression output is obtained from estimating the linear probability model y = β0 + β1x + ε. What is the predicted probability when x = 14? Coefficients Intercept X

4.03 −1.98

Standard Error 0.40 0.04

t Stat

P-value

1.23 −4.55

0.0001 0.0000

A) 4.42 B) -23.69 C) 3.86 D) 0.72

34) Using a sample of 50, the following regression output is obtained from estimating the linear probability model y = β0 + β1x + ε. What is the predicted probability when x = 14? Coefficients Intercept X

Version 1

4.14 −0.02

Standard Error 0.30 0.03

t Stat

P-value

1.45 −4.65

0.0001 0.0000

7


A) 4.42 B) 3.86 C) 8.34 D) 0.72

35) The following table contains the parameter estimates of the linear probability model and the logistic regression model. When considering a binary response variable y and two predictor variables, x1 and x2, what is the predicted probability implied by the logistic regression model for x1 = 2 with x2 = 15? (Hint: for logit, the model is ŷ = Variable Intercept x1 x2

Linear Probability −0.98 0.48 −0.02

)

Logistic −5.80 1.46 −0.26

A) 0.838 B) 0.001 C) 0.006 D) −5.160

36) The following table contains the parameter estimates of the linear probability model and the logistic regression model. When considering a binary response variable y and two predictor variables, x1 and x2, what is the predicted probability implied by the logistic regression model for x1 = 2 with x2 = 15? (Hint: for logit, the model is ŷ = Variable Intercept x1 x2

Linear Probability −0.76 0.43 −0.02

)

Logistic −4.1 1.12 −0.22

A) 0.838 B) 0.006 C) −5.16 D) 0.005

Version 1

8


37) The following table contains the parameter estimates of the linear probability model and the logistic regression model. When considering a binary response variable y and two predictor variables, x1 and x2, what is the estimated linear probability implied by the logistic probability regression model for x1 = 3 with x2 = 9? Variable Intercept x1 x2

Linear Probability −0.62 0.45 −0.02

Logistic −1.13 0.90 −0.20

A) 1.87 B) −0.35 C) 0.55 D) 0.35

38) The following table contains the parameter estimates of the linear probability model and the logistic regression model. When considering a binary response variable y and two predictor variables, x1 and x2, what is the estimated linear probability implied by the logistic probability regression model for x1 = 3 with x2 = 9? Variable Intercept x1 x2

Linear Probability −0.76 0.43 −0.02

Logistic −1.12 0.85 −0.18

A) 1.87 B) −0.35 C) 0.35 D) −0.87

39)

Specificity and sensitivity are particularly relevant when_____________.blank

Version 1

9


A) the response variable has many ones and a few zeros, or many zeros and a few ones B) the response variable has approximately the same number of ones and zeroes C) the less likely outcome is correctly classified highly accurately D) the response variable is based on the number of times heads is the correct answer

40) A linear regression model applied to a binary response variable is called a _____________blank. A) logistic regression model B) log-transformed model C) log-log regression model D) linear probability model

41) Which interpretation of a linear probability model represents the estimate ^P = −40 + 0.05x? A) For each 1 unit increase in x, the predicted probability ^P decreases by 0.05. B) For each 40 unit decrease in x, the predicted probability ^P increases by 40. C) For each 1 unit increase in x, the predicted probability ^P increases by 0.05. D) For each 0.05 unit increase in x, the predicted probability ^P decreases by 40.

42) Given the following accuracy, sensitivity, and specificity measures, which model is the preferred model? Measure Accuracy Sensitivity Specificity

Model 1 69.5% 68.9% 71.8%

Model 2 72.2% 70.2% 71.5%

A) Model 1 because specificity is higher than Model 2. B) Model 2 because accuracy and sensitivity are higher than Model 1. C) Model 2 because accuracy is higher than Model 1. D) Model 2 because the average of the measures is higher than Model 1.

Version 1

10


43) Given the following accuracy, sensitivity, and specificity measures, which model is the preferred model? Measure Accuracy Sensitivity Specificity

Model 1 79.5% 78.9% 81.3%

Model 2 77.2% 78.2% 81.3%

A) Model 1 because accuracy and sensitivity are higher than Model 2. B) Model 1 because accuracy is higher than Model 2. C) Model 1 because the average of the measures is higher than Model 2. D) Model 2 because specificity is higher than Model 2.

Version 1

11


Answer Key Test name: Chap 09_2e_Jaggia 1) FALSE Dummy (binary) variables can be used both as predictor and response variables. 2) FALSE In linear probability models, bj measures the partial effect of xj but in the logistic regression model, however, the interpretation of bj is not straightforward. While bj conveys whether the relationship between xj and p̂ is posi-tive or negative, it does not imply a unique partial effect of xj on p̂. 3) FALSE Probability ranges between 0 and 1, whereas odds range between 0 and infinity. 4) TRUE Since a logistic model does not allow us to determine the partial effect of a predictor variable on the probability. It is often preferable to interpret logistic regression coefficients in terms of odds rather than probabilities. 5) TRUE We start by partitioning the data into two inde-pendent and mutually exclusive data sets—the training set, and the validation set. There is no rule as to how the sample data should be partitioned. We generally use random draws when partitioning the data. 6) TRUE In the linear probability model, y is a binary variable response equal to 1 or 0. This allows for the probability of success to be determined. 7) FALSE

Version 1

12


The logistic regression cannot be predicted using standard ordinary least squares (OLS) procedures, but can using the method of maximum likelihood estimation (MLE). 8) TRUE The model focuses on the number of correct predictions divided by total observations times 100. By only focusing on one accuracy rate, the larger picture of improving the cutoff value may be ignored. 9) TRUE In the k-fold cross-validation method, we partition the data into k subsets, and the one that is left out in each iteration is the validation set. In other words, we perform the holdout method k times and use the average of the performance measures for model selection. 10) FALSE Sensitivity is the proportion of target class cases that are classified correctly. Specificity is the proportion of nontarget class cases that are identified correctly. 11) B Because the regression coefficientb1 = 0.24 is positive, we can infer thatx exerts a positive influ-ence onp̂ .

12) A Because the regression coefficient b1 = 0.84 is negative, we can infer that x exerts a negative influ-ence on p̂ .

13) C p̂ =ŷ =

14) D p̂ =ŷ =

15) A The functionglm is used inR to construct a logistic regression model whilelm constructs linear models. The functionpredict is used to find predicted probabilities andsummary is used to view the output. Version 1

13


16) D A useful method to assess the predictive power of a model is to test it on a data set not used in estimation. We use cross-validation to assess models by partitioning the data into a training set to build (train) the model and a validation set to evaluate (validate) it. 17) C The accuracy is increased comparingŷ toy, although a larger validation set than 40 may produce different results. 18) D Accuracy rates should be calculated but the model with the highest value is the model that should be selected. 19) C Odds range between 0 and infinity, whereas probability ranges between 0 and 1.

20) D Odds can be used to calculate probability using the formula

21) B Odds can be used to calculate probability using the formula<em>p =

22) B Odds are computed using the formula odds =

23) C Odds are computed using the formulaodds =

24) A 25) C The odds can now be derived asodds =

26) D

Version 1

14


There is no universal goodness-of-fit measure for binary choice models to assess how well the model fits the data. It is common to assess the performance of binary choice models based on the accu-racy rate, defined as the percentage of correctly classified observations. 27) C The function predict is used to find predicted probabilities in binary choice models. The function glm is used in R to construct a logistic regression model while lm constructs linear models. The function summary is used to view the output. 28) D Since the logistic regression model was correct 238/300 = 79.3% of the time, the logistic model is preferred over the linear probability model with 73.8%. Therefore, we calculate =

29) D If a dataset has 750 observations and 70% are used to develop the model, the training dataset will need 525 observations and the remaining 225 observations will be used to validate the data. 30) C The model with the highest average accuracy will be the preferred model. In this case, Model 1 has an average accuracy rate of 69.7% whereas, model 2 has an average accuracy of 70.5%, thus Model 2 is considered the superior model. 31) A −2.90 + 0.65(5) = 0.35, thus the estimated probability of x = 5 is 0.35.

32) D Specificity is the proportion of nontarget class cases that are classified correctly, whereas sensitivity is the proportion of target class cases that are classified correctly. 33) B Estimated linear probability is ŷ = β0 + β1x + ε. ŷ = 4.03 − 1.98(14) = 23.69 predicted probability when x = 14. Version 1

15


34) B Estimated linear probability is ŷ = β0 + β1x + ε. ŷ = 4.14 − 0.02(14) = 3.86 predicted probability when x = 14. 35) B ŷ=

;=

;

= 0.00100.

36) B ŷ=

37) C The estimated linear probability is ŷ = b0 + b1x1 + b2x2 = −0.62 + 0.45(3) − 0.02(9) = 0.55. 38) C The estimated linear probability is ŷ = b0 + b1x1 + b2x2 = −0.76 + 0.43(3) − 0.02(9) = 0.35. 39) A Specificity and sensitivity are particularly relevant when the response variable has many ones and a few zeros, or many zeros and a few ones and when the less likely outcome is correctly classified poorly. 40) D The linear probability model is a linear regression model applied to a binary response variable. 41) C Each unit increases the predicted probability by 0.05. 42) B Model 2 is preferred because in addition to its higher accuracy, it also has higher sensitivity. Specificity is slightly lower than Model 1 but only by 0.3%. 43) A Model 1 is preferred because in addition to its higher accuracy, it also has higher sensitivity. Specificity is the same in both models.

Version 1

16


Version 1

17


CHAPTER 10 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) Aimee’s bookstore had a 45% increase in profits on Wednesday, June 12th, over the previous year’s sales. Without the presence of a holiday, events in the area, or sale promotion, this business event is considered random. ⊚ true ⊚ false

2) The use of quantitative forecast can be criticized because biases in optimism and overconfidence may skew the results. ⊚ true ⊚ false

3) In a 3-period moving average, when a new observation becomes available, the highest numerical observation is dropped. ⊚ true ⊚ false

4) When a time series is expected to grow by fixed amounts each time period, then the linear trend model should be used. ⊚ true ⊚ false

5) When visually inspecting data to confirm the existence of a trend, a scatterplot of the data with a superimposed linear trend line is advisable to view the series over time. ⊚ true ⊚ false

6) In reviewing stock growth in Amazon, the linear trend model would be best use for when an increase in the series happens over time. ⊚ true ⊚ false

Version 1

1


7) If a time series reverses direction, then a quadratic trend model will allow for the curvature to be graphed. ⊚ true ⊚ false

8) By combining the validation and the training set, the sample is larger for estimation and includes the most recent validation set for predictions. ⊚ true ⊚ false

9) When a time series exhibits seasonal variations, the Holt exponential smoothing method, or double exponential smoothing method, is appropriate to capture the upward and downward movement of the time series. ⊚ true ⊚ false

10) data.

The triple exponential smoothing method uses seasonality variations in the analysis of the ⊚ ⊚

true false

11) The functionality to perform the moving average or simple exponential smoothing techniques in Excel can be found on the Formulas ribbon. ⊚ true ⊚ false

12)

Causality can be determined using the model y = β0 + β1 d1 + β2d2 + β3d3 + β4t + ε. ⊚ true ⊚ false

13) The linear trend model with seasonality y = β0 + β1 d1 + β2d2 + β3d3 + β4t + ε can be modified to reflect an exponential trend by using the model ln(y) = β0 + β1 d1 + β2d2 + β3d3 + β4t + ε.

Version 1

2


⊚ ⊚

true false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 14) Sydney is evaluating monthly sales for her Etsy account. Based on the given data y1 = 4,170; y2 = 4,002; y3 = 4,245, what is her 3-period moving average? A) 4,139 B) 4,245 C) 4,086 D) 4,207

15) Sydney is evaluating monthly sales for her Etsy account. Based on the given data y1 = 4,321; y2 = 3,876; y3 = 4,190, what is her 3-period moving average? A) 4,129 B) 4,190 C) 4,098 D) 4,255

16) Mark is using a 3-period moving average to forecast the number of filters needed for the fourth quarter. Using the following data, what is the forecasted amount? Quarter

Filters 1 2 3 4 1 2 3 4

Version 1

43 32 38 37 37 32 40 ?

3


A) 35 Filters B) 38 Filters C) 36 Filters D) 40 Filters

17) Mark is using a 3-period moving average to forecast the number of filters needed for the fourth quarter. Using the following data, what is the forecasted amount? Quarter 1 2 3 4 1 2 3 4

Filters 43 32 38 37 41 35 40 ?

A) 37 Filters B) 38 Filters C) 39 Filters D) 40 Filters

18)

Using only the residual in the following table, calculate the overall MSE.

Quarter 4 1 2 3 Total

y 2,789 3,009 2,787 2,968

ŷ 2,984.00 2,956.33 2,924.67 2,721.00

e = y − ŷ −195 53 −138 247

e2 38,025 2,774 18,953 61,009 120,761

A) 2,896.50 B) 2,888.25 C) 30,190.25 D) 222.83

Version 1

4


19)

Using only the residual in the following table, calculate the overall MSE.

Quarter 4 1 2 3

y 2,789 3,009 2,965 2,876

ŷ 2,984.00 2,956.33 2,924.67 2,921.00

Total

e = y − ŷ −195.00 52.67 40.33 −45.00

e2 38,025 2,774 1,627 2,025 44,451

A) 2,946.50 B) 2,909.75 C) 11,112.75 D) 210.83

20) Complete the simple exponentially smoothed series on the following table where α = 0.40 and L1 = y1. Quarter 1 2 3

y 2,872 3,104 3,006

Lt 2,872 2,819 ?

A) 3,045.20 B) 2,893.80 C) 2,931.20 D) 2,952.40

21) Complete the simple exponentially smoothed series on the following table where α = 0.40 and L1 = y1. Quarter 1 2 3

Version 1

y 2,872 3,104 2,976

Lt 2,872 2,964.8 ?

5


A) 3,027.20 B) 2,969.28 C) 2,971.52 D) 2,934.40

22) Consider the following estimated linear trend model and make a forecast for t = 16 and = 11.76 + 1.05t. A) 0.76 B) 1.79 C) 27.76 D) 28.56

23) Consider the following estimated linear trend model and make a forecast for t = 18 and = 11.23 + 1.04t. A) 0.68 B) 1.67 C) 29.23 D) 29.95

24) Consider the following quadratic trend model and make a forecast for t = 18 and ŷ = 11.57 + 1.07t + 0.03t2. A) 31.91 B) 40.55 C) 29.75 D) 48.24

25) Consider the following quadratic trend model and make a forecast for t = 20 and ŷ = 15.84 + 0.98t + 0.02t2.

Version 1

6


A) 35.08 B) 43.44 C) 25.80 D) 45.10

26)

Consider the following exponential trend model and make a forecast for t = 21 and . A) 34.48 B) 34.52 C) 34.40 D) 33.45

Version 1

7


28) Based on the following plotted series of linear and exponential trends with an overlay of observation points, which model is the best fit for the observed data?

A) the exponential trend model B) the linear trend model C) neither model D) the quadratic model

29) The polynomial trend model that allows for two changes in the direction of a series is called the __________blank. A) quadratic trend model B) cubic trend model C) linear trend model D) exponential trend model

Version 1

8


30) What is the polynomial trend model that allows for one change in direction of a series called? A) quadratic trend model B) cubic trend model C) linear trend model D) exponential trend model

31)

Which option is not a description of polynomial trend models? A) Higher level polynomials run the risk of overfitting. B) MSE, MAD, MAPE cannot be used to compare polynomial trend models. C) Adjusted R2 is used to compare polynomial models. D) The coefficient β2 determines the direction of the linear trend line.

32) Using the following table of results, what is the estimated linear trend model for the sample? Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations

0.918811476 0.852298529 0.835559781 0.8270129490 18 Coefficients

Intercept X Variable 1

12.647631580 0.767368421

A) ŷ = 0.91881 + 0.7674t B) ŷ = 12.6476 + 0.7674t C) ŷ = 0.8523 + 0.83556t D) ŷ = 0.83556 + 12.6476t

33) Using the following table of results, what is the estimated linear trend model for the sample? Version 1

9


Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations

0.918811476 0.844214529 0.835559781 0.820824949 20 Coefficients

Intercept X Variable 1

11.76063158 0.314368421

A) ŷ = 0.91881 + 0.3144t B) ŷ = 11.7606 + 0.3144t C) ŷ = 0.84421 + 0.83556t D) ŷ = 0.83556 + 11.7606t

34) Using the following table of results, what is the estimated quadratic trend model for the sample? Regression Statistics Multiple R 0.976507605 R Square 0.953567103 Adjusted R Square 0.952481409 Standard Error 0.46111766 Observations 20 Coefficients Intercept

12.2736316

Standard Error 0.343063

t Stat

X Variable 1

0.875136651

0.075239

10.32438

X Variable 2

−0.01449539

0.00348

−6.32741

29.3388

p-value 5.3 E-16 9.68 E-09 7.58 E-06

A) ŷ = 12.2736 + 0.8751t − 6.3274t2 B) ŷ = 12.2736 + 0.9525t + 0.0034t2 C) ŷ = 12.2736 + 0.0034t + 0.0145t2 D) ŷ = 12.2736 + 0.8751t − 0.0145t2

Version 1

10


35) Using the following table of results, what is the estimated quadratic trend model for the sample? Regression Statistics Multiple R 0.976507605 R Square 0.953567103 Adjusted R 0.948104409 Square Standard Error 0.46111766 Observations 20 Coefficients Intercept X Variable 1 X Variable 2

10.0650614 0.776796651 −0.02202039

Standard Error 0.343063 0.075239 0.00348

t Stat 29.3388 10.32438 −6.32741

p-value 5.30E-16 9.68E-09 7.58E-06

A) ŷ = 10.0651 + 0.7768t − 6.3274t2 B) ŷ = 10.0651 + 0.9481t + 0.0034t2 C) ŷ = 10.0651 + 0.0034t + 0.0220t2 D) ŷ = 10.0651 + 0.7768t − 0.0220t2

36)

Based on the following results, which model is the best fit for the data?

Summary Output Linear Trend Model Regression Statistics Multiple R 0.918811476 R Square 0.844214529 Adjusted R Square 0.835559781 Standard Error 0.820824949 Observations 20 Summary Output Quadratic Trend Model Regression Statistics Multiple R 0.976507605 R Square 0.953567103 Adjusted R Square 0.948104409 Standard Error 0.46111766 Observations 20

Version 1

11


A) The adjusted R2 for the quadratic model is higher, thus best fit for the data. B) The adjusted R2 for the linear model is lower, thus best fit for the data. C) The R2 for the quadratic model is higher, thus best fit for the data. D) The R2 for the linear model is lower, thus best fit for the data.

37) Using the following table, estimate quadratic trend model with seasonal dummy variables. Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations

0.935185002 0.874570988 0.855988912 0.054863436 32 Coefficients

Intercept X Variable 1 X Variable 2 X Variable 3 X Variable 4 X Variable 5

1.415457253 0.017428641 −0.073442595 −0.06362285 −0.004004323 0.000294365

A) = 0.9352 + 0.0174d1 − 0.0734d2 − 0.0636d3 − 0.0040t + 0.0003t2 B) =1.4155 + 0.0174d1 − 0.0734d2 − 0.0636d3 − 0.0040t + 0.0003t2 C) =0.8560 + 0.0174d1 − 0.0734d2 − 0.0636d3 − 0.0549t + 0.0003t2 D) =1.4155 + 0.9352d1 − 0.8746d2 − 0.8560d3 − 0.0549t + 0.0003t2

38)

Using the following results, estimate linear trend model with seasonal dummy variables. Regression Statistics

Multiple R R Square Adjusted R Square Standard Error Observations

Intercept

Version 1

0.918811476 0.852822529 0.852395529 0.053583436 32 Coefficients 2.433809156 12


X Variable 1 X Variable 2 X Variable 3 X Variable 4

0.095028641 −0.068102865 −0.05981912 −0.01356797

A) = 2.4338 + 0.0950d1 + 0.0681d2 + 0.0598d3 + 0.0136t B) = 2.4338 − 0.0950d1 − 0.0681d2 − 0.0598d3 + 0.0136t C) = 2.4338 + 0.0950d1 − 0.0681d2 − 0.0598d3 − 0.0136t D) = 2.4338 − 0.0950d1 + 0.0681d2 + 0.0598d3 + 0.0136t

39)

Using the following results, estimate linear trend model with seasonal dummy variables. Regression Statistics

Multiple R R Square Adjusted R Square Standard Error Observations

0.935185002 0.874570988 0.855988912 0.054863436 32 Coefficients

Intercept X Variable 1 X Variable 2 X Variable 3 X Variable 4

1.470209156 0.017428641 −0.072853865 −0.06303412 −0.01371837

A) 1.4702 + 0.0174d1 + 0.0729d2 + 0.0630d3 + 0.0137t B) 1.4702 − 0.0174d1 − 0.0729d2 − 0.0630d3 + 0.0137t C) 1.4702 + 0.0174d1 − 0.0729d2 − 0.0630d3 − 0.0137t D) 1.4702 − 0.0174d1 + 0.0729d2 + 0.0630d3 + 0.0137t

40) Mary is determining the maximum quantity given the seasonality is constant. If the coefficients are 113t and −23.1326t2 in a quadratic model, what is the maximum quantity that can be reached?

Version 1

13


A) 4.88 B) 0.102 C) 10.23 D) 2

41) Mary is determining the maximum quantity given the seasonality is constant. If the coefficients are 112t and −23.4502t2 in a quadratic model, what is the maximum quantity that can be reached? A) 4.78 B) 0.015 C) 9.10 D) 2

42)

Which one of the following is not a step in cross-validation with time series? A) Use both the training and validation set to re-estimate the preferred model. B) Split data series into early and later periods, representing both training and validation

set. C) Determine the proper forecast model based on R2 results. D) Explore suitable forecasting models to compute MSE, MAD, and MAPE.

43) In the __________blank model, to estimate, the response variable is measured in natural logs, ln (yt), and then a regression is run of ln (yt) on t. A) linear B) exponential C) cubed D) quadratic

44)

The following model yt =β0 + β1t + β2t2 + β3t3 + εt allows for what changes in a series?

Version 1

14


A) allows for a curvature indicating the change of direction B) allows for one change in the direction of a series C) allows for two changes in the directions of a series D) allows for the increase in the time period and standard error of the estimate

45) Martin owns a food truck that for the past five years has frequented local festivals selling fried cheese curds. He has experienced a variation in sales for no known reason and wants to develop a forecast using the exponential smoothing method. After loading data from the previous 20 days, the following summary table was calculated. What is the MSE? Day

Curds 1

70

2 3 … 20 Total

69 67 … 61

At

ŷt

et = yt − ŷt

e2

|e|

69.7 68.89 … 63.03

70 69.7 … 63.9

−1 −2.7 … −2.898

1 7.29 … 8.39682 213.799

1 2.7 … 2.8977 52.042

A) 10.69 B) 4.11 C) 11.25 D) 2.739

46) Martin owns a food truck that for the past five years has frequented local festivals selling fried cheese curds. He has experienced a variation in sales for no known reason and wants to develop a forecast using the exponential smoothing method. After loading data from the previous 20 days, the following summary table was calculated. What is the MSE? Day

Curds 1

70

2 3 … 20

69 67 … 61

Version 1

At

ŷt

et = y1 − ŷt

69.7 68.89 … 63.03

70 69.7 … 63.9

−1 −2.7 … −2.898

e2

1 7.29 … 8.39682

|e|

1 2.7 … 2.8977

15


Total

206.064

51.211

A) 10.85 B) 4.02 C) 10.30 D) 2.695

47) Martin owns a foot truck that for the past five years has frequented local festivals selling fried cheese curds. He has experienced a variation in sales for no known reason and wants to develop a forecast using the exponential smoothing method. After loading data from the previous 20 days, the following summary table was calculated. What is the MAD? Day

Curds 1

70

2 3 … 20 Total

69 67 … 61

At

ŷt

et = y1 − ŷt

e2

|e|

69.7 68.89 … 63.03

70 69.7 … 63.9

−1 −2.7 … −2.898

1 7.29 … 8.39682 212.920

1 2.7 … 2.8977 51.776

A) 10.65 B) 4.11 C) 11.21 D) 2.725

48) Martin owns a foot truck that for the past five years has frequented local festivals selling fried cheese curds. He has experienced a variation in sales for no known reason and wants to develop a forecast using the exponential smoothing method. After loading data from the previous 20 days, the following summary table was calculated. What is the MAD? Day

Curds 1

70

2 3 …

69 67 …

Version 1

At

ŷt

et = y1 − ŷt

e2

|e|

69.7 68.89 …

70 69.7 …

−1 −2.7 …

1 7.29 …

1 2.7 … 16


20

61

63.03

63.9

Total

−2.898

8.39682 206.064

2.8977 51.211

A) 10.30 B) 4.02 C) 10.85 D) 2.695

49) Martin is handed quarterly sales data from a small subsidiary. Prior to creating a strategy, he wants to forecast the Q4 results to have an idea of the full year potential sales. What is the most popular technique Martin can use to compute the Q4 sales estimate? A) quadratic trend model B) 3-period moving average technique C) polynomial trend model D) exponential smoothing model

50) Consider the following table of the derivations for the MSE, MSA, and MAPE in the validation set. Based on the results, which model is preferred and why? Linear MSE MAD MAPE

144.64 11.06 10.4

Exponential 120.43 10.82 2.45

A) Exponential, because the MSE, MAD, and MAPE are consistently lower. B) Both Linear and Exponential are preferred models because the results are positive. C) Linear, because the MSE, MAD, and MAPE are consistently higher. D) Neither, because an adjusted R2 is needed for determination.

51) Which method would be the best fit for a sample containing seasonality, but no trend, and is further divided into structures depending on the type of seasonality exhibited by the series?

Version 1

17


A) the exponential method B) the Holt exponential smoothing method C) the Holt-Winters exponential smoothing method D) quadratic trend model

52) When constructing a quick review table in Excel, knowing the formula is essential in populating the cells correctly. Using the Holt exponential smoothing method, complete the following table for Year 3 where α = 0.1 and β = 0.2. Year

Lt

Tt

Ft

1

78,000

2

90,000

78,000

11,000

3

98,000

91,900

?

89,000

A) 6,400 B) 9,180 C) 7,580 D) 11,580

53) When constructing a quick review table in Excel, knowing the formula is essential in populating the cells correctly. Using the Holt exponential smoothing method, complete the following table for Year 3 where α = 0.1 and β = 0.2. Year

Lt

Tt

Ft

1

78,000

2

89,000

78,000

11,000

3

98,000

89,900

?

89,000

A) 6,680 B) 8,980 C) 8,680 D) 11,180

Version 1

18


54) When constructing a quick review table in Excel, knowing the formula is essential in populating the cells correctly. Using the Holts exponential smoothing method, complete the following table for Year 4 where α = 0.1 and β = 0.2. Year

Lt

Tt

Ft

1

78,000

2

89,000

78,000

11,300

3 4

98,000 98,000

92,900 ?

12,020 11,000

89,000 89,000

A) 103,536 B) 104,228 C) 103,080.8 D) 114,720.0

55) When constructing a quick review table in Excel, knowing the formula is essential in populating the cells correctly. Using the Holts exponential smoothing method, complete the following table for Year 4 where α = 0.1 and β = 0.2. Year

Lt

Tt

1

78,000

2

89,000

78,000

11,000

3 4

98,000 115,000

89,900 ?

11,180 11,458.40

Ft

89,000 101,080

A) 103,864 B) 102,472 C) 104,510.8 D) 114,625.6

56) When performing a cross-validation of regression model with R, in the forecast package, we use the __________blank function to find the number of observations in the validation set.

Version 1

19


A) length B) forecast C) tslm D) accuracy

57) When using R to perform the Holt-Winters exponential smoothing method, in forecast, we use the __________blank function with the model inputs of ‘AAA’ and ‘AAM’. A) ts B) window C) ets D) fAdd

58) Of the smoothing methods, which one does the level Lt, as well as the trend Tt, adapt over time and is a best fit when the time series expresses no seasonality? A) simple exponentially smoothing technique B) the moving average technique C) the Holt exponential smoothing method D) the Holt-Winters exponential smoothing method

59) Which of the following model performance measures gives you a sense of the magnitude of the errors by showing the error as a percentage of the actual value? A) MAD B) MAPE C) MSE D) RMSE

60)

Which of the following is not a valid reason for using a simple smoothing technique?

Version 1

20


A) To reduce the effect of random fluctuations B) To provide forecasts if short-term fluctuations represent random departures from the structure C) When forecasts of multiple variables need to be updated frequently D) When there are variations that can be explained due to trend or seasonality

61) Which of the following models reflects a linear trend model with quarterly dummy variables? A) y = β0 + β1 t + ε B) y = β0 + β1d1 + β2d2 + β3d3 + ε C) y = β0 + β1d1 + β2d2 + β3d3 + β4d4 + ε D) y = β0 + β1 d1 + β2d2 + β3d3 + β4t + ε

Version 1

21


Answer Key Test name: Chap 10_2e_Jaggia 1) TRUE Random is the absence of reason for an event occurrence outside the normal trend. 2) FALSE Quantitative is numerical, data-based, and used to project historical data. Qualitative is observation-based and dependent on the judgment and skill of the forecaster and tends to be prone to biases in optimism and overconfidence. 3) FALSE In a 3-period moving average, when a new observation is added, the oldest observation is dropped. 4) TRUE The linear trend model is used when a time series is expected to grow by a fixed amount each time period. 5) TRUE By superimposing the scatterplot points and the trend line, a visual of the time line can be inspected. 6) FALSE The exponential trend model would be the best fit when expected increases in the series gets larger over time. 7) TRUE A quadratic trend model is constructed by a common polynomial function creating U-shaped trends. 8) TRUE

Version 1

22


The combination of the validation set and training set must be used for estimating the preferred model for making forecasts. Otherwise, you are introducing unnecessary noise in the data by using the training period to project past the validation time period. 9) FALSE The Holt exponential smoothing method is used for level and time and is appropriate when the time series exhibits trend but not seasonality. 10) TRUE The Holt-Winters exponential smoothing, also called the triple exponential smoothing method, includes seasonality in the analysis. 11) FALSE The functionality to perform the moving average or simple exponential smoothing techniques in Excel can be found on the Data ribbon, within the Data Analysis Tools icon. 12) FALSE The model y = β0 + β1 d1 + β2d2 + β3d3 + β4t + ε does not offer any explanation of the mechanism generating the target variable and simply provide a method for projecting historical data. For a time series model with or without seasonality to test for potential causality would be to add a β5x predictor variable to the model, such as the firm’s advertising budget. 13) TRUE Just as seasonality can be incorporated into linear trend models, seasonality can be incorporated into nonlinear trend models like exponential and quadratic nonlinear models as well. 14) A ӯ₂ = (4,170 + 4,002 + 4,245)/3 = (12,417/3) = 4,139. 15) A ӯ₂ = (4,321 + 3,876 + 4,190)/3 = (12,387/3) = 4,129. 16) C Version 1

23


The forecasted amount is based on the 3 most recent observations or (37 + 32 + 40)/3 = 36.33 or 36 filters is the forecast. 17) C The forecasted amount is based on the 3 most recent observations or (41 + 35 + 40)/3 = 38.67 or 39 filters is the forecast. 18) C = 30,190.25.

19) C 20) B The simple exponential smoothed technique is updated using the formula Lt = α yt + (1 − α) Lt–1 . L3 = 0.40(3,006) + 0.60(2,819.0) = 2,893.80. 21) B The simple exponential smoothed technique is updated using the formula Lt = α yt + (1 − α) Lt–1 . L3 = 0.40(2,976) + 0.60(2,964.8) = 2,969.28. 22) D <p>The estimated linear trend model is specified as = 11.76 + 1.05(16) = 28.56.

23) D <p>The estimated linear trend model is specified as = 11.23 + 1.04(18) = 29.95.

24) B The estimated linear trend model is specified as ŷ = 11.57 + 1.07(18) + 0.03(18)2 = 40.55. 25) B The estimated linear trend model is specified as ŷ = 15.84 + 0.98(20) + 0.02(20)2 = 43.44. 26) A <p>The estimated model is used to make forecasts as <i> where =

Version 1

= exp ( b0 + b1 t + se2/2), .

24


27) A <p>The estimated model is used to make forecasts as

28) A The shape of the exponential trend is the best fit for the plotted observations.

29) B The polynomial trend model that allows for two changes in the direction of series is a cubic trend model. 30) A The polynomial trend model that allows for one change in the direction of a series is the quadratic trend model. 31) D All are descriptors of a polynomial trend model except the coefficient β2 determines whether the trend is U-Shaped or inverted U-Shaped, not linear. 32) B The output results provided coefficients for the models, 12.6476 and the slope of 0.7674. Plug this into the formula: ŷ = b0 + b1t and you have = 12.6476 + 0.7674t.

33) B The output results provided coefficients for the models, 11.7606 and the slope of 0.3144. Plug this into the formula:ŷ = b0 + b1t and you haveŷ = 11.7606 + 0.3144t. 34) D The quadratic trend model with dummy variables isŷ = b0 + b1t + b2t2 so you haveŷ = 12.2736 + 0.8751t − 0.0145t2. 35) D The quadratic trend model with dummy variables is ŷ = b0 + b1t + b2t2 so you have ŷ = 10.0651 + 0.7768t − 0.0220t2. 36) A

Version 1

25


The adjusted R2 for the quadratic trend model is 0.94810 and the linear trend model is 0.83556. The higher adjusted R2, the quadratic trend model, is the best fit for the data. 37) B The quadratic trend model with seasonal dummy variables is ŷ = β0 + β1d1 + β2d2 + β3d3 + β4t + β5t2. So, ŷ = 1.4155 + 0.0174d1 − 0.0734d2 − 0.0636d3 − 0.0040t + 0.0003t2. 38) C The linear trend model with seasonal dummy variables is ŷ= β0 + β1d1 + β2d2 + β3d3 + β4t. So, ŷ = 2.4338 + 0.0950d1 − 0.0681d2 − 0.0598d3 − 0.0136t. . 39) C The linear trend model with seasonal dummy variables is ŷ= β0 + β1d1 + β2d2 + β3d3 + β4t. So, ŷ = 1.4702 + 0.0174d1 − 0.0729d2 − 0.0630d3 − 0.0137t. . 40) D Given b1 = 113 and b2 = −23.1326, the output level that maximizes quantity holding seasonality constant =

= 2.442 or 2 maximum units.

41) D Given b1 = 112 and b2 = −23.4502, the output level that maximizes quantity holding seasonality constant =

42) C

Version 1

26


Cross-validation with a time series involves the following steps: The splitting of the series into early and late periods, representing the training and validation sets respectively. Exploring suitable forecasting models for the training set and the use of forecast errors for the validation set to compute. The model with the lowest MSA, MAD, and MAPE is selected. The use of the entire data set, combining both validations and training sets to re-estimate the preferred model for making forecasts. 43) B The exponential model is estimated by first generating the series in natural logs, ln (yt), and then by running a regression of ln (yt) on t. 44) C The cubic trend model, yt =β0 + β1t + β2t2 + β3t3 + εt , allows for two changes in the direction of a series. 45) A <p>MSE is computed as

= (213.799/19) = 10.69.

46) C <p>MSE is computed as

= (206.064/19) = 10.30.

47) D <p>MAD =

= 51.776/19 = 2.725.

48) D <p>MAD =

= 51.211/19 = 2.695.

49) B The moving average technique, specifically the 3-period moving average, is the most popular technique to calculate the Q4 forecast. 50) A The exponential model is preferred because the calculated MSE, MAD, and MAPE are lower. 51) C

Version 1

27


The Holt-Winters Exponential Smoothing Method is best for seasonality in the sample and further divides the sample into additive and multiplicative structures depending on the type of seasonality exhibited in the series. 52) D To complete, the formula in Excel, the formula would reflect: = 0.2 × (91,900 − 78,000) + 0.8 × 11,000 = 11,580. 53) D To complete, the formula in Excel, the formula would reflect: = 0.2 × (89,900 − 78,000) + 0.8 × 11,000 = 11,180. 54) B To complete, the formula in Excel, the formula would reflect = 0.1 × 98,000 + 0.9 × (12,020 + 92,900) = 104,228. 55) B To complete, the formula in Excel, the formula would reflect = 0.01 × 115,000 + 0.9 × (11,180 + 89,900) = 102,472. 56) A The length function in the forecast package identifies the number of observations in the validation set. 57) C In the Holt-Winters exponential smoothing method we use the ets function with the model inputs of “AAA” for additive seasonality and “AAM” for multiplicative seasonality. Also we enter restrict = FALSE to reduce the error possibility of using the default setting of true. 58) C The Holt exponential smoothing method is an extension of simple exponential smoothing where the level Lt, as well as the trend Tt, adapt over time. 59) B

Version 1

28


The main attraction of using MAPE is that it shows the error as a percentage of the actual value, giving a sense of the magnitude of the errors. 60) D Smoothing methods are typically used when trends appear random or that cannot be explained by particular events or seasonality. 61) D With quarterly data, a linear trend model with seasonal dummy variables can be specified as y = β0 + β1 d1 + β2d2 + β3d3 + β4t + ε, where d1, d2, and d3 are the dummy variables representing the first three quarters. 62) B

Version 1

29


CHAPTER 11 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) The process of applying a set of analytical techniques for the development of machine learning and artificial intelligence is called data mining. ⊚ true ⊚ false

2) The key distinction between supervised and unsupervised data mining is that the identification of the target variable is identified in supervised data mining. ⊚ true ⊚ false

3) Common applications of unsupervised learning include dimension reduction and prediction model. ⊚ true ⊚ false

4)

Normalization is the process that makes the numerical data independent of scale. ⊚ true ⊚ false

5) The Jaccard’s coefficient is appropriate when it is more informative to match negative outcomes between observations. ⊚ true ⊚ false

6) Oversampling involves intentionally selecting more samples from one class than from other classes to adjust the class distribution of a data set. ⊚ true ⊚ false

7) A diagram that represents the information in equal-sized intervals, deciles, is called a cumulative lift chart.

Version 1

1


⊚ ⊚

true false

8) In real-world situations, data sets contain many variables. If some variables are eliminated, valuable information may be lost. ⊚ true ⊚ false

9) The principal component analysis (PCA) is a dimension reduction technique used to reduce variables without removing variables. ⊚ true ⊚ false

10) In Excel, Analytic Solver only provides the covariance matrix for performing principal component analysis (PCA). ⊚ true ⊚ false

11) Pairwise observations that have a large distance between observations are said to have a high degree of similarity. ⊚ true ⊚ false

12)

A bar or column chart is used to create a decile-wise lift performance chart in Excel. ⊚ true ⊚ false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 13) Cross-Industry Standard Process for Data Mining (CRISP-DM) consists of six phases. Of the six, which one represents the phase where data wrangling occurs?

Version 1

2


A) Deployment B) Modeling C) Data understanding D) Data preparation

14) Using the Manhattan distance between pairwise observations, which pairwise observation is most similar? Observation 1 2 3

x1 1 3 13

x2 2 4 1

A) Observations 1 & 2 B) Observations 2 & 3 C) Observations 1 & 3 D) Both Observations 1 & 2 and 2 & 3

15) Using the Manhattan distance between pairwise observations, which pairwise observation is most similar? Observation 1 2 3

x1 2 6 8

x2 3 4 2

A) Observations 2 & 3 B) Observations 1 & 3 C) Observations 1 & 2 D) Both Observations 2 & 3 and 1 & 3

16) Using the Euclidean distance between pairwise observations, which pairwise observation is most dissimilar? Observation 1 2 3

Version 1

x1 2 5 13

x2 1 4 2 3


A) Observations 1 & 3 B) Observations 2 & 3 and 1 & 3 C) Observations 2 & 3 D) Observations 1 & 2 and 2 & 3

17) Using the Euclidean distance between pairwise observations, which pairwise observation is most dissimilar? Observation 1 2 3

x1

x2 3 4 2

2 6 8

A) Observations 1 & 3 B) Observations 2 & 3 and 1 & 3 C) Observations 2 & 3 D) Observations 1 & 2 and 2 & 3

18) Consider the partial data set in the table represents online hours spent shopping by age and income. The average and standard deviation for the full data set is $47,650 and $14,223, respectively. Using z-scores to standardize the observations, what is the average standard deviation of Income for the three provided? ID

Income

Age

2201 2202 2203

69,000 50,000 68,000

49 57 37

Online Hours 3 4 4

A) 1.501 B) 0.1740 C) 1.2990 D) 1.0323

Version 1

4


19) Consider the partial data set in the table represents online hours spent shopping by age and income. The average and standard deviation for the full data set is $47,667 and $14,292, respectively. Using z-scores to standardize the observations, what is the average standard deviation of Income for the three provided? ID

Income

Age

2201 2202 2203

62,000 58,000 53,000

48 52 44

Online Hours 2 4 5

A) 1.003 B) 0.7320 C) 0.2410 D) 0.6997

20) Consider the partial data set in the table represents online hours spent shopping by age and income. Using the min-max transformation to normalize Income, what is the average standard deviation of Income for the chart provided? Use the min-max transformation to normalize the observations for Income spent online. ID 2,201 2,202 2,203 2,204 2,205 2,206

Income 57,000 57,000 54,000 17,000 37,000 45,000

Age 51 48 45 35 33 31

Online Hours 5 4 5 9 2 4

A) 1.000 B) 0.6875 C) 0 D) 0.7455

21) Consider the partial data set in the table represents online hours spent shopping by age and income. Using the min-max transformation to normalize Income, what is the average standard deviation of Income for the chart provided? Use the min-max transformation to normalize the observations for Income spent online. ID

Version 1

Income

Age

Online Hours

5


2201 2202 2203 2204 2205 2206

62,000 58,000 53,000 22,000 43,000 48,000

48 52 44 28 33 35

2 4 5 7 4 3

A) 1 B) 0.6417 C) 0 D) 0.6997

22) The following table is a segment of Loan Data from a bank for car loans. Compute the matching coefficient between Pairs 1 and 4. Line 1 2 3 4

Term 60 50 35 60

interest_rate 3.45% 3.51% 6.36% 6.86%

loan_amount 33,300 15,000 12,900 15,280

Sex M M M F

A) Matching coefficient is 0.25. B) Matching coefficient is 0.13. C) Matching coefficient is 0.88. D) Matching coefficient is 0.35.

23) The following table is a segment of Loan Data from a bank for car loans. Compute the matching coefficient between Pairs 1 and 4. Line 1 2 3 4

Version 1

Term 60 60 36 60

interest_rate 3.58% 4.00% 6.71% 6.07%

loan_amount 35,000 15,350 12,500 15,350

Sex F F M F

6


A) Matching coefficient is 0.50. B) Matching coefficient is 0.25. C) Matching coefficient is 0.75. D) Matching coefficient is 0.40.

24) Sandra began collecting transaction details to see if the same items were in each sales transaction. Compute the matching coefficient and Jaccard’s coefficient for pairwise transaction 1 & 2. Transaction 1 2

Coffee No Yes

Tea Yes No

Scone No No

Muffin Yes Yes

Cookie No Yes

A) Matching Coefficient 0.40 and Jaccard's 0.40 B) Matching Coefficient 0.40 and Jaccard's 0.25 C) Matching Coefficient 0.40 and Jaccard's 0.30 D) Matching Coefficient 0.40 and Jaccard's 0.50

25) Sandra began collecting transaction details to see if the same items were in each sales transaction. Compute the matching coefficient and Jaccard’s coefficient for pairwise transaction 1 & 2. Transaction 1 2

Coffee Yes Yes

Tea No Yes

Scone Yes Yes

Muffin No No

Cookie Yes No

A) Matching Coefficient = 0.60 and Jaccard’s = 0.40 B) Matching Coefficient = 0.40 and Jaccard’s = 0.50 C) Matching Coefficient = 0.60 and Jaccard’s = 0.50 D) Matching Coefficient = 0.40 and Jaccard’s = 0.40

26) The process of dividing a data set into a training, a validation, and an optimal test data set is called _____________blank.

Version 1

7


A) overfitting B) oversampling C) optional testing D) data partitioning

27) Cameron is performing a study on the IQ of groups in various areas. He has calculated that the average IQ of Group A is 135 with a standard deviation of 10. What is the z-score for someone with an IQ of 98? A) 3.70 B) −3.70 C) 3.90 D) −2.90

28) Cameron is performing a study on the IQ of groups in various areas. He has calculated that the average IQ of Group A is 105 with a standard deviation of 10. What is the z-score for someone with an IQ of 98? A) 0.7 B) −0.7 C) 0.9 D) 0.1

29) Cameron is performing a study on the IQ of groups in various areas. He has calculated that the average IQ of Group B is 109 with a standard deviation of 10. What is the z-score for someone with an IQ of 116? A) 1.10 B) −0.70 C) 0.22 D) 0.70

Version 1

8


30) Cameron is performing a study on the IQ of groups in various areas. He has calculated that the average IQ of Group B is 118 with a standard deviation of 12. What is the z-score for someone with an IQ of 125? A) 0.98 B) −0.58 C) 0.10 D) 0.58

31) When a predictive model is made overly complex to fit in the quirks of given sample data, it is called _____________blank. A) oversampling B) overfitting C) partitioning D) distribution

32) Molly e-mailed her clients offering a free 30-minute massage for referrals. In the following validation set of 100, Class 1 reflects the clients predicted to provide referrals and Class 0 reflects the clients predicted to not provide referrals. Based on the confusion matrix, what was the True Positive (TP) of current clients who provided referrals for a free massage? Actual Class Class 1 Class 0

Predicted Class 1 Predicted Class 0 23 14 14 49

A) 37 B) 14 C) 23 D) 49

33) Molly e-mailed her clients offering a free 30-minute massage for referrals. In the following validation set of 100, Class 1 reflects the clients predicted to provide referrals and Class 0 reflects the clients predicted to not provide referrals. Based on the confusion matrix, what was the True Positive (TP) of current clients who provided referrals for a free massage? Actual Class

Version 1

Predicted Class 1 Predicted Class 0

9


Class 1 Class 0

18 11

11 60

A) 29 B) 11 C) 18 D) 60

34) Molly e-mailed her clients offering a free 30-minute massage for referrals. In the following validation set of 100, Class 1 reflects the clients predicted to provide referrals and Class 0 reflects the clients predicted to not provide referrals. Based on the confusion matrix, what was the False Negative (FN) of current clients who provided referrals for a free message? Actual Class Class 1 Class 0

Predicted Class 1 Predicted Class 0 18 11 11 60

A) 29 B) 18 C) 11 D) 60

35)

The sensitivity, also called recall, is computed using which equation? A) Sensitivity = TP ÷ (TP + FN). B) Sensitivity = TN ÷ (TP + FN). C) Sensitivity = (TP + TN) ÷ (TP + TN + FP + FN). D) Sensitivity = TP ÷ (TP + FP).

36)

The precision, also called positive predictive value, is computed using which equation? A) Precision = TP ÷ (TP + FN). B) Precision = TN ÷ (TP + FN). C) Precision = (TP + TN) ÷ (TP + TN + FP + FN). D) Precision = TP ÷ (TP + FP).

Version 1

10


37) Based on the following confusion matrix with a validation set of 100, Class 1 reflects the members targeted who purchased services and Class 0 reflects the non-targeted respondents who did not purchase services. Calculate the specificity rate. Actual Class Class 1 Class 0

Predicted Class 1 Predicted Class 0 13 11 11 65

A) 54% B) 70% C) 85% D) 85.5%

38) Based on the following confusion matrix with a validation set of 100, class 1 reflects the members targeted who purchased services and class 0 reflects the non-targeted respondents who did not purchase services. Calculate the specificity rate. Actual Class Class 1 Class 0

Predicted Class 1 18 11

Predicted Class 0 11 60

A) 62% B) 78% C) 84% D) 84.5%

39) Based on the following confusion matrix with a validation set of 100, Class 1 reflects the members targeted who purchased services and Class 0 reflects the non-targeted respondents who did not purchase services. Calculate the sensitivity rate. Actual Class Class 1 Class 0

Version 1

Predicted Class 1 Predicted Class 0 27 14 14 45

11


A) 82% B) 76% C) 66% D) 36%

40) Based on the following confusion matrix with a validation set of 100, Class 1 reflects the members targeted who purchased services and Class 0 reflects the non-targeted respondents who did not purchase services. Calculate the sensitivity rate. Actual Class Class 1 Class 0

Predicted Class 1 Predicted Class 0 18 11 11 60

A) 78% B) 84% C) 62% D) 32%

41) The calculated average error measuring the average magnitude of errors in predictive performance measures is called _____________blank. A) mean percentage error B) root mean square error C) mean error D) mean absolute deviation

42) The _____________blank is the average absolute percentage error, shown as a percentage of the actual value, displaying the magnitude of the errors in performance measures. A) RMSE B) MAD C) MAPE D) ME

Version 1

12


43) 44)

The root mean square error calculation is Which is the best-fit definition for the use of Principal Component Analysis (PCA)?

A) The elimination of the number of components in a data set to identify errors. B) The transformation of a small set of correlated variables into larger uncorrelated subsets. C) The transformation of a large number of correlated variables into a smaller number of uncorrelated variables. D) The driving of pattern recognition in a supervised data mining set used to visualize data methods.

45) The following table displays the weights for computing the principal components and the data for two Observations. The mean and standard deviation for x1 are 3.90 and 1.60, respectively. What is the z-score of x1 for Observation 1? Weights x1 x2

PC1 −0.74 −0.69 x1

PC2 0.51 0.70 x2

Observation 1 Observation 2

4.57 1.48

12.20 8.86

A) 0.42 B) 0.41 C) 0.97 D) 0.45

46) The following table displays the weights for computing the principal components and the data for two Observations. The mean and standard deviation for x1 are 4.2 and 1.4, respectively. What is the z-score of x1 for Observation 1? Weights x1 x2

PC1 −0.78 −0.65 x1

PC2 0.51 0.82 x2

Observation 1 Observation 2

4.45 1.43

12.08 8.62

Version 1

13


A) 0.18 B) 0.17 C) 0.73 D) 0.21

47) The following table displays the weights for computing the principal components and the data for two Observations. The mean and standard deviation for x1 are 3.3 and 1.2, respectively. The mean and standard deviation for x2 are 6.3 and 5.3, respectively. Compute the first principal component score for Observation 1. Weights x1 x2

Observation 1 Observation 2 Variance Variance Percentage

PC1 −0.62 −0.46 x1 4.31 4.42 3.90

x2 11.48 8.46

78.323183

A) −0.972 B) −0.523 C) 1.179 D) 0.842

48) The following table displays the weights for computing the principal components and the data for two Observations. The mean and standard deviation for x1 are 4.2 and 1.4, respectively. The mean and standard deviation for x2 are 6.8 and 4.8, respectively. Compute the first principal component score for Observation 1. Weights x1 x2

Observation 1 Observation 2

Version 1

PC1 −0.72 −0.85 x1

x2

4.45 4.56

12.08 8.62

14


Variance Variance Percentage

3.36 78.323221

A) −0.293 B) −1.065 C) 0.949 D) 0.612

49)

Of the following selections, which is not a descriptor of principal component analysis?

A) The first principal account is not suitable for analysis. B) Principal components are uncorrelated variables. C) The first principal accounts for most of the variability. D) Principal component variables are weighted linear combinations of the original variables.

50)

Calculate the misclassification rate for the following confusion matrix.

Actual Class Class 1 Class 0

Predicted Class 1 Predicted Class 0 19 12 12 57

A) 0.74 B) 0.24 C) 0.80 D) 0.30

51)

Calculate the misclassification rate for the following confusion matrix.

Actual Class Class 1 Class 0

Version 1

Predicted Class 1 Predicted Class 0 18 11 11 60

15


A) 0.72 B) 0.22 C) 0.78 D) 0.28

52)

Calculate the accuracy rate for the following confusion matrix.

Actual Class Class 1 Class 0

Predicted Class 1 17 13

Predicted Class 0 13 57

A) 0.76 B) 0.26 C) 0.74 D) 0.32

53)

Calculate the accuracy rate for the following confusion matrix.

Actual Class Class 1 Class 0

Predicted Class 1 18 11

Predicted Class 0 11 60

A) 0.72 B) 0.22 C) 0.78 D) 0.28

54) Tess is tasked with analyzing a data set with multiple variables with various scales. To reduce the difference in scale in variables, she is following a common process called _____________blank to make the observations unit-free.

Version 1

16


A) data mining B) Euclidean distance C) Manhattan distance D) standardizing

55) Which chart allows for a visual representation to determine a point where a model’s predictions become less useful? A) ROC Curve B) Cumulative Lift Chart C) Decile-wise Lift Chart D) Sensitivity Measure

56) Which chart is a bar chart displayed in 10 equal-sized intervals, or every 10% of the observations? A) ROC Curve B) Cumulative Lift Chart C) Decile-wise Lift Chart D) Sensitivity Measure

57)

When using PCA, all the following are disadvantages except A) PCA results are difficult to interpret clearly. B) components are weighted linear combinations and abstract. C) PCA only works with numerical data. D) PCA significantly increases the dimension of the data.

58) What is the first step to take in Excel to create a confusion matrix given a dataset that includes the actual class to which the observation belongs (ActualClass) and the probability of the target class (TargetProb)?

Version 1

17


A) Compute values for False Positives and False Negatives B) Compute values for True Positives and True Negatives C) Derive the predicted class based on a cutoff of 0.5 D) Build the confusion matrix using a COUNTIF formula

59) Calculate the misclassification rate for the following 5 observations using a 0.25 cutoff value. Observation 1 2 3 4 5

Actual Class 0 1 0 1 1

Class 1 Probability 0.11 0.24 0.76 0.37 0.07

A) 0.33 B) 0.40 C) 0.50 D) 0.60

60) Cross-Industry Standard Process for Data Mining (CRISP-DM) consists of six phases. Which of the following is the first phase? A) Business understanding B) Data preparation C) Data understanding D) Modeling

61) Based on the following chart, how many components will need to be retained in order account for at least 90% of the total variance in the data? Variance %

Version 1

PC1

PC2

PC3

PC4

PC5

74.3890

12.5242

7.3262

4.7306

1.030

18


A) 2 B) 3 C) 4 D) 5

Version 1

19


Answer Key Test name: Chap 11_2e_Jaggia 1) TRUE Data mining is applying analytical techniques in machine learning and artificial intelligence. 2) TRUE The target variable is identified in supervised data mining. 3) FALSE In unsupervised data mining, common applications are dimension reduction and pattern recognition. Prediction models are common in supervised data mining. 4) TRUE The key is the numerical data is independent of scale. 5) FALSE Jaccard’s is appropriate to match positive outcomes between observations. 6) TRUE Oversampling is a technique that involves the intentional selection of more samples in a class to adjust the distribution set. 7) FALSE A decile-wise chart is similar to the cumulative lift chart but presents the information in ten equal-sized intervals. 8) TRUE Where some variables may be dropped, others are key and by removing them, valuable information will be lost. 9) TRUE PCA is a dimension reduction technique that reduces the data set without the loss of important information. Version 1

20


10) FALSE In Analytic Solver, you have 2 options for PCA, the covariance method and correlation matrix. 11) FALSE A small distance between the observations implies a high degree of similarity, whereas a large distance between the observations implies a low degree of similarity. 12) TRUE A bar or column chart is used to create a decile-wise lift performance chart in Excel. A scatterplot with smooth lines is used for the cumulative lift performance chart and the ROC curve. 13) D Data preparation is the phase that includes variable selection, data cleansing, and data wrangling. 14) A Observations 1 & 2 are the most similar with the shortest distance at |1 − 3| + |2 − 4| = 4. 15) A Observations 2 & 3 are the most similar with the shortest distance at |6 − 8| + |4 − 2| = 4. 16) A Observations 1 & 3 are the most dissimilar with the largest result of 11.045, whereas Observations 1 & 2 equal 4.243 and Observations 2 & 3 equal 8.246. 17) A Observations 1 & 3 are the most dissimilar with the largest result of 6.083, whereas Observations 1 & 2 equal 4.123 and Observations 2 & 3 equal 2.828. 18) D Line 1: z-score = ÷ s = (69,000 − 47,650) ÷ 14,223 = 1.501. Line 2 = 0.165 and Line 3 = 1.431. The average of the three scores is 1.0323. Version 1

21


19) D Line 1: z-score = ÷ s = (62,000 − 47,667) ÷ 14,292 = 1.003. Line 2 = 0.723 and Line 3 = 0.373. The average of the three scores is 0.6997.

20) B Line 1 = (57,000 max income − 17,000 min income) ÷ 40,000 (range) = 1.000. Line 2 = 1.000; Line 3 = 0.925; Line 4 = 0; Line 5 = 0.500; and Line 6 = 0.70. The average of the six lines equals 0.6875. 21) B Line 1 = (62,000 max income − 22,000 min income) ÷ 40,000 (range) = 1. Line 2 = 0.9; Line 3 = 0.775; Line 4 = 0; Line 5 = 0.525; and Line 6 = 0.65. The average of the six lines equals 0.6417. 22) A The pair, Line 1 and 4, has a 1/4 matching or 0.25 matching coefficient. 23) A The pair, Line 1 and 4, has a 2/4 matching or 0.50 matching coefficient. 24) B Matching coefficient 2/5 = 0.40 and Jaccard’s coefficient = 1/(5 − 1) = 0.25. 25) C Matching coefficient 3/5 = 0.60 and Jaccard’s coefficient = 2/(5 − 1) = 0.50. 26) D Data Partitioning is the process of dividing the data into training, validation, and optimal test sets. 27) B z-score =

÷ s = (98 − 135) ÷ 10 = −3.70.

28) B z-score =

÷ s = (98 − 105) ÷ 10 = − 0.7.

29) D z-score =

÷ s = (116 − 109) ÷ 10 = 0.70.

30) D Version 1

22


z-score =

÷ s = (125 − 118) ÷ 12 = 0.58.

31) B Overfitting is when a model is too complex and altered to fit quirks, which causes the predictive power to be compromised. 32) C TP = 23: There are 23 current clients who provided referrals that received a free massage that were correctly classified in the model. 33) C TP = 18: There are 18 current clients who provided referrals that received a free massage that were correctly classified in the model. 34) C FN = 11: There are 11 current clients who provided referrals for free messages that were misclassified as clients who would not provide referrals. 35) A Sensitivity = TP ÷ (TP + FN). 36) D Precision = TP ÷ (TP + FP). 37) D TN ÷ (TN + FP) = 65 ÷ (65 + 11) = 0.855 implying 85.5% proportion of non-purchasers that are classified correctly. 38) D TN ÷ (TN + FP) = 60 ÷ (60 + 11) = 0.845 implying 84.5% proportion of non-purchasers that are classified correctly. 39) C TP ÷ (TP + FN) = 27 ÷ (27 + 14) = 0.66 implying 66% of clients are identified correctly. 40) C TP ÷ (TP + FN) = 18 ÷ (18 + 11) = 0.62 implying 62% of clients are identified correctly. Version 1

23


41) D Mean Absolute Deviation (MAD) measures the calculated average error in prediction performance measures. 42) C The mean absolute percentage error (MAPE) is the average absolute percentage error. 43) B The formula for RMSE is similar to the standard error estimate, except it is calculated for the validation set, rather than the training set.

44) C The main definition is the ability to transform a large number of correlated variables into a more manageable smaller number of uncorrelated variables. 45) A z-score =

÷ s = (4.57 − 3.90) ÷ 1.60 = 0.4188 or 0.42.

46) A z-score =

÷

s = (4.45 − 4.2) ÷ 1.4 = 0.1786 or 0.18.

47) A First compute the z-scores [(xki − x̄k) ÷ s]. z11 = (4.31 − 3.3) ÷ 1.2 = 0.84 & z21 = (11.48 − 6.3) ÷ 5.3 = 0.98. Then compute PC1,Ob1 = w11z11 + w12z21 = −0.62 × 0.84 + −0.46 × 0.98 = −0.972. 48) B First compute the z-scores [(xki − x̄k) ÷ s]. z11 = (4.45 − 4.2) ÷ 1.4 = 0.18 & z21 = (12.08 − 6.8) ÷ 4.8 = 1.1. Then compute PC1,Ob1 = w11z11 + w12z21 = −0.72 × 0.18 + −0.85 × 1.1 = −1.065. 49) A Principal components are uncorrelated variables, weighted linear combinations of the original variable, and the first principal accounts for the most variability in the data. 50) B

Version 1

24


Misclassification rate = (FP + FN) ÷ (TP + TN + FP + FN) = (12 + 12) ÷ (19 + 57 + 12 + 12) = 0.24 is the misclassification rate. 51) B Misclassification rate = (FP + FN) ÷ (TP + TN + FP + FN) = (11 + 11) ÷ (18 + 60 + 11 + 11) = 0.22 is the misclassification rate. 52) C Accuracy = 1 − Misclassification rate = 1 − [(FP + FN) ÷ (TP + TN + FP + FN)] = 1 − (13 + 13) ÷ (17 + 57 + 13 + 13) = 1 − 0.26 = 0.74. 53) C Accuracy = 1 − Misclassification rate = 1 − [(FP + FN) ÷ (TP + TN + FP + FN)] = 1 − (11 + 11) ÷ (18 + 60 + 11 + 11) = 1 − 0.22 = 0.78. 54) D Standardizing or normalizing the numerical set before calculating is a common process to make the observations unit-free. 55) B A cumulative lift chart, also called cumulative gains chart, or lift chart allows for a visual representation of the less useful point. 56) C The decile-wise lift is a bar graph displayed in 10 equal intervals. 57) D PCA significantly reduces the dimension of the data, not increasing the dimensions. 58) C The first step is to create a predicted class column, which evaluates the TargetProb column and assigns it a 1 if the value is greater than or equal to 0.5 and 0 otherwise. 59) D Actual Class Class 1 Class 0

Predicted Class 1 Predicted Class 0 1 2 1 1

<p>Misclassification rate =

Version 1

25


60) A Business understanding is the first phase of CRISP-DM and includes understanding the data mining project and its business objectives. 61) B Three components will need to be retained to account for at least 90% of the total variance. 74.3890 + 12.5242 + 7.3262 = 94.2394, which is greater than 90%.

Version 1

26


CHAPTER 12 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) The use of classifying or predicting the value to create an outcome is called scoring a record. ⊚ true ⊚ false

2) KNN is a simple data mining tool, known for developing personalized recommendations for many online company applications. ⊚ true ⊚ false

3) Naïve Bayes classifiers are relatively simple, efficient, and assume dependency among predictors. ⊚ true ⊚ false

4)

KNN belongs to a category of mining techniques called computer-based-reasoning. ⊚ true ⊚ false

5) While k-nearest neighbors is effective as a classifier, it provides no information on predictor importance. ⊚ true ⊚ false

6) The naïve Bayes method is an unsupervised data mining technique that uses partitioning to assess model performance. ⊚ true ⊚ false

7)

When performing a naïve Bayes analysis, all predictor variables must be categorical.

Version 1

1


⊚ ⊚

true false

8) Unlike the KNN method, the naïve Bayes method does not use the validation data set to optimize model complexity. ⊚ true ⊚ false

9) To use the naïve Bayes method, numerical variables can be converted into discrete categories, through a process called binning, and then stored in a newly created categorical value. ⊚ true ⊚ false

10) Binning is a process where categorical data is transformed into numerical segments that can be appended back to the original data set to use a naïve Bayes method. ⊚ true ⊚ false

11) Anastasia should use a classification model to predict the starting salary of a university graduate. ⊚ true ⊚ false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 12) Using the table below, find the k-nearest neighbor for record 4 using k = 3 for age. Record 1 2 3 4 5 6

Version 1

Age 33 23 30 32 44 53

Marital Other Single Single Single Single Single

Loan $ 4,900.00 $ 13,000.00 $ 9,000.00 $ 16,000.00 $ 9,000.00 $ 7,000.00

Risk High High Medium High Low Low

2


A) 23, 30, & 33 B) 30, 33, & 44 C) 30, 32, & 33 D) 32, 33, & 44

13)

Using the table below, find the k-nearest neighbor for record 4 using k = 3 for age.

Record 1 2 3 4 5 6

Age 34 24 31 32 44 53

Marital Single Married Single Other Married Single

Loan $ 5,000.00 $ 15,000.00 $ 7,000.00 $ 10,000.00 $ 12,000.00 $ 9,000.00

Risk High Low Medium Medium High Low

A) 24, 31, & 34 B) 31, 34, & 44 C) 31, 32, & 34 D) 32, 34, & 44

14) A new applicant, age 32, is applying for a loan. Using the table below, what is the estimated probability the loan will default using k = 3. Record 1 2 3 4 5 6

Age 34 24 31 32 44 53

Marital Other Single Married Married Other Married

Loan $ 4,500.00 $ 23,000.00 $ 7,000.00 $ 6,000.00 $ 11,000.00 $ 12,000.00

Risk Medium Low High Medium Low High

Default No Yes Yes No No Yes

A) 100% B) 66% C) 33% D) 0%

Version 1

3


15) A new applicant, age 32, is applying for a loan. Using the table below, what is the estimated probability the loan will default using k = 3. Record 1 2 3 4 5 6

Age 34 24 31 32 44 53

Marital Single Married Single Other Married Single

Loan $ 5,000.00 $ 15,000.00 $ 7,000.00 $ 10,000.00 $ 12,000.00 $ 9,000.00

Risk High Low Medium Medium High Low

Default No No Yes No Yes No

A) 100% B) 66% C) 33% D) 0%

16)

For a new observation of (0, 0, 0), what is the k-nearest neighbor when k= 1.

Observation 1 2 3 4 5 6

Y Red Blue Blue Red Blue Blue

X1 1 1 0 1 1 1

X2 0 0 2 2 1 −1

X3 2 2 1 0 2 0

A) 1, 1, 2 B) 1, 0, 2 C) 1, 2, 0 D) 1, -1, 0

17)

For a new observation of (0, 0, 0), what is the k-nearest neighbor when k = 1.

Observation 1 2 3 4 5 6

Version 1

Y Red Blue Blue Red Blue Blue

X1 0 −2 −1 0 1 0

X2 2 −1 0 1 1 3

X3 0 0 1 3 1 0

4


A) 1, 1, 1 B) −2, −1, 0 C) 0, 1, 3 D) −1, 0, 1

18) =3

What is the estimated probability that the cheese sample tested in NW will be Gouda? k Sample 1 2 3 4 5

Location NW SW SE NW NW

Cheese Gouda Cheddar Parmesan Gouda Gouda

A) 33% B) 67% C) 40% D) 100%

19) 3

What is the estimated probability that the cheese sample tested in NW will be Gouda? k = Sample 1 2 3 4 5

Location NW NW SW SE NW

Cheese Gouda Gouda Cheddar Gouda Parmesan

A) 40% B) 60% C) 34% D) 67%

Version 1

5


20) A new applicant, age 45, is applying for a loan. Using the table below, what is the estimated probability the loan will be approved? k = 4. Record 1 2 3 4 5 6

Age 34 24 49 32 44 53

Marital Single Other Other Single Single Single

Loan $ 5,000.00 $ 18,000.00 $ 15,000.00 $ 9,000.00 $ 15,000.00 $ 9,000.00

Risk Low Medium High High Medium Low

Default Yes Yes No No Yes No

A) The probable default rate is 50%, the loan will be declined. B) The probable success rate is 50%, the loan will be approved. C) The probable success rate is 30%, the loan will be approved. D) The probable default rate is 25%, the loan will be declined.

21) A new applicant, age 45, is applying for a loan. Using the table below, what is the estimated probability the loan will be approved? k = 4. Record 1 2 3 4 5 6

Age 34 24 49 32 44 53

Marital Single Married Single Other Married Single

Loan $ 5,000.00 $ 15,000.00 $ 7,000.00 $ 10,000.00 $ 12,000.00 $ 9,000.00

Risk High Low Medium Medium High Low

Default Yes No Yes No Yes No

A) The probable default rate is 75%, the loan will be declined. B) The probable success rate is 75%, the loan will be approved. C) The probable success rate is 30%, the loan will be approved. D) The probable default rate is 25%, the loan will be declined.

22)

Using the following table, which k should be used in the subsequent calculations?

k 1 2 3 4 5

Version 1

% Misclassification 10.66666785 10.33333593 9.33333245 9.33333591 8.00

6


6 7 8 9 10

11.66666397 11.33333574 9.80 8.33333201 8.66666584

A) 8 B) 5 C) 6 D) None of the percentages should be used.

23)

Using the following table, which k should be used in the subsequent calculations?

k 1 2 3 4 5 6 7 8 9 10

% Misclassification 10.66666667 10.33333333 9.33333333 9.33333333 7 11.66666667 11.33333333 10 8.33333333 8.66666666

A) 8 B) 5 C) 6 D) None of the percentages should be used.

24) If the performance measures from the training data are considerably higher than the values from the validation and test data, what could be the issue? A) Proportion B) Sensitivity C) Duplication D) Overfitting

Version 1

7


25) An issue with the naïve Bayes classifier is determining rare outcomes because the estimate is 0. To overcome this problem, the algorithm allows a replacement of zero probability with a nonzero value. This technique is called A) replacement. B) smoothing. C) discrete. D) combinations.

26)

What is the Euclidean distance between Observation 1 and the origin point of (0, 0, 0)?

Observation 1 2 3

Y Red Blue Blue

X1 0 1 0

X2 2 0 1

X3 0 2 1

A) 2 B) 1 C) 0 D) 3

27)

What is the Euclidean distance between Observation 1 and the origin point of (0, 0, 0)?

Observation 1 2 3

Y Red Blue Blue

X1 0 2 1

X2 2 1 0

X3 0 0 1

A) 2 B) 1 C) 0 D) 3

28) The chart below is a summary of the main results of a test data set representing the population observed purchasing a virtual digital assistant. What does the accuracy rate indicate? Metrics

Version 1

8


Metric Accuracy (#correct) Accuracy (%correct) Specificity Sensitivity (Recall) Precision F1 Score Success Class Success Probability

Value 265 92.9 0.88129 0.7 0.82494 0.88238 1 0.5

A) 92.9% of the population has purchased a virtual assistant. B) 70% of the observations are correctly classified. C) 92.9% of the observations are correctly classified. D) 7.09999999999999% of the observations are correctly classified.

29) The chart below is a summary of the main results of a test data set representing the population observed purchasing a virtual digital assistant. What does the accuracy rate indicate? Metrics Metric Accuracy (#correct) Accuracy (%correct) Specificity Sensitivity (Recall) Precision F1 Score Success Class Success Probability

Value 212 88.5 0.88188 0.9 0.82453 0.88229 1 0.5

A) 88.5% of the population has purchased a virtual assistant. B) 90% of the observations are correctly classified. C) 88.5% of the observations are correctly classified. D) 11.5% of the observations are correctly classified.

30) The chart below is a summary of the main results of a test data set representing the population observed purchasing a virtual digital assistant. What is the percent of the results that are incorrectly classified? Metrics

Version 1

9


Metric Accuracy (#correct) Accuracy (%correct) Specificity Sensitivity (Recall) Precision F1 Score Success Class Success Probability Class 0 1 Overall

Value 169 92.2 0.88222 0.6 0.82494 0.88201 1 0.5 Error Report # Cases 160 80 240

# Errors 15 10 28

% Errors 9.15 13 7.80

A) 8% B) 13% C) 9.15% D) 7.80%

31) The chart below is a summary of the main results of a test data set representing the population observed purchasing a virtual digital assistant. What is the percent of the results that are incorrectly classified? Metrics Metric Accuracy (#correct) Accuracy (%correct) Specificity Sensitivity (Recall) Precision F1 Score Success Class Success Probability Class 0 1 Overall

Version 1

Value 212 88.5 0.88188 0.9 0.82453 0.88229 1 0.5 Error Report # Cases 140 100 240

# Errors 18 10 28

% Errors 12.85 10 11.5

10


A) 12% B) 10% C) 12.85% D) 11.5%

32) Marta is partitioning her data set into 60% for training and 40% for validation. She is first specifying ‘Member’ as her target variable. What will she need to program to ensure consistency to fix a random seed? A) myIndex B) trainSet C) createDataPartition D) set.seed

33) Which chart allows for the categorization of large data sets from high to low values, dividing sets of observations into an easy visual representation of the data. A) Decile-wise chart B) Cumulative lift chart C) Scatterplot D) ROC Curve

34) This chart measures the effectiveness of a predictive model, containing both a baseline and a lift curve. A) Decile-wise chart B) Cumulative lift chart C) Scatterplot D) ROC Curve

35)

This chart determines how well the model performs in terms of sensitivity and specificity.

Version 1

11


A) Decile-wise chart B) Cumulative lift chart C) Scatterplot D) ROC Curve

36) The following table is the count of observations in each class of the training data set on approvals and declines for a loan at a local bank. Using the naïve Bayes method, calculate the conditional probability of both the male and female being approved (declined) for the loan and indicate which one should be categorized with approved classification? Approved (y = 1) Declined (y = 0)

Male (x = 1)

Female (x = 0)

25 20

33 78

A) 0.56 > 0.44 for male; 0.297 < 0.703 for female, providing male with the approved classification. B) 0.56 > 0.44 for male; 0.40 < 0.60 for female, providing male with the approved classification. C) 0.297 < 0.703 for male; 0.56 > 0.44 for female, providing female with the approved classification. D) 0.297 < 0.703 for male; 0.50 >= 0.50 for female, providing female with the approved classification.

37) The following table is the count of observations in each class of the training data set on approvals and declines for a loan at a local bank. Using the naïve Bayes method, calculate the conditional probability of both the male and female being approved (declined) for the loan and indicate which one should be categorized with approved classification? Approved (y = 1) Declined (y = 0)

Version 1

Male (x = 1)

Female (x = 0)

20 15

35 70

12


A) 0.57 > 0.43 for male; 0.333 < 0.667 for female, providing male with the approved classification. B) 0.57 > 0.43 for male; 0.40 < 0.60 for female, providing male with the approved classification. C) 0.333 < 0.667 for male; 0.57 > 0.43 for female, providing female with the approved classification. D) 0.333 < 0.667 for male; 0.50 >= 0.50 for female, providing female with the approved classification.

38) A researcher is preparing data for a k-fold cross-validation. The number of groups the sample data is to be split into is 10. What would k equal in a 10-fold cross-validation? A) k = 1 B) k = 5 C) k = 10 D) k cannot be determined until applied to a machine model.

39) The following table reflects the observations made on the color and type of vehicle, if a speeding ticket was received (1) or a warning (0), and if there was a prior driving violation (yes or no). Using the naïve Bayes calculation, what is the conditional probability of receiving a ticket with a red vehicle given a prior driving violation. Assume predictor variables are independent. Prior No No No No Yes No Yes Yes No No

Version 1

Color Green Red Red Green Green Green Red Black Green Black

Type Sport Sport Sport Sport SUV SUV Sport SUV SUV SUV

Ticket 0 0 0 0 0 0 1 1 1 1

13


A) 0.97 B) 0.62 C) 0.49 D) 0.84

40) The following table reflects the observations made on the color and type of vehicle, if a speeding ticket was received (1) or a warning (0), and if there was a prior driving violation (yes or no). Using the naïve Bayes calculation, what is the conditional probability of receiving a ticket with a red vehicle given a prior driving violation. Assume predictor variables are independent. Prior Yes No No No No No Yes Yes No Yes

Color Red Red Green Red Black Red Red Black Green Black

Type SUV Sport Sport SUV SUV Sport Sport Sport SUV SUV

Ticket 0 0 0 0 0 0 1 1 1 1

A) 0.88 B) 0.53 C) 0.40 D) 0.75

41) An R’s ROC curve with AUC = 0.9453 is presented below from an analysis on potential increased membership level from current basic members at Costco Wholesale. What does the AUC indicate on the prediction on increased membership enrollment among current base members? {MISSING IMAGE}

Version 1

14


A) The high AUC indicates an anomaly in that it requires data smoothing to match the baseline. B) The high AUC indicates the KNN model performs well and better than the baseline model. C) The high AUC indicates the KNN model is not predicting the level increase based on the baseline model. D) The high AUC indicates there is a 0.05% probability the KNN model performs as predicted.

42) The marketing group and Rings Are Us is trying to predict if undergraduate or graduate students are more inclined to purchase (y = 1) or not purchase (y = 0) a class ring at graduation. Using the following count on the training data set, calculate the conditional probability of both to determine which should be classified to the purchase group. Purchase (y = 1) No Purchase (y = 0)

Undergraduate (x = 1)

Graduate (x = 0)

70 36

102 102

A) Undergraduate 0.66 > 0.272; Graduate 0.428 < 0.272. The undergraduate is assigned to the purchaser group. B) Undergraduate 0.66 > 0.34; Graduate 0.5 < 0.5. The undergraduate is assigned to the purchaser group. C) Undergraduate 0.303 > 0.272; Graduate 0.476 < 0.524. The undergraduate is assigned to the purchaser group. D) Undergraduate 0.30 < 0.20; Graduate 0.524 > 0.476. The graduate is assigned to the purchaser group.

43) The marketing group and Rings Are Us is trying to predict if undergraduate or graduate students are more inclined to purchase (y = 1) or not purchase (y = 0) a class ring at graduation. Using the following count on the training data set, calculate the conditional probability of both to determine which should be classified to the purchase group. Purchase (y = 1) No Purchase (y = 0)

Version 1

Undergraduate (x = 1)

Graduate (x = 0)

70 30

100 110

15


A) Undergraduate 0.70 > 0.272; Graduate 0.428 < 0.272. The undergraduate is assigned to the purchaser group. B) Undergraduate 0.70 > 0.30; Graduate 0.476 < 0.524. The undergraduate is assigned to the purchaser group. C) Undergraduate 0.303 > 0.272; Graduate 0.476 < 0.524. The undergraduate is assigned to the purchaser group. D) Undergraduate 0.30 < 0.20; Graduate 0.524 > 0.476. The graduate is assigned to the purchaser group.

44)

Specificity is A) TP ÷ (TP + FN). B) TN ÷ (TN + FP). C) TP ÷ (TN + FP). D) 1 − TN ÷ (TN + FP).

45) To examine classification for k-fold cross-validation and naïve Bayes, two packages contain the necessary functions for partitioning the data. These are A) caret & klaR. B) caret & Crisp. C) klarR & SEMMA. D) predictive & caret.

46) Mark is reviewing a partial summary of results from a test data set on a small health clinic. With an accuracy 61% (100 count), Sensitivity 41%, and Specificity 100%, can Mark correctly predict the true positive rate to identify those with the flu? A) Yes, because the specificity is 100% identifying healthy patients. B) No, because the test set identified all patients with the flu. C) No, because the sensitivity rate is only 41% in identifying those with the flu. D) Yes, because there is a 61% accuracy rate with all healthy identified.

Version 1

16


47) Mark is reviewing a partial summary of results from a test data set on a small health clinic. With an accuracy 75% (100 count), Sensitivity 50%, and Specificity 100%, can Mark correctly predict the true positive rate to identify those with the flu? A) Yes, because the specificity is 100% identifying healthy patients. B) No, because the test set identified all patients with the flu. C) No, because the sensitivity rate is only 50% in identifying those with the flu. D) Yes, because there is a 75% accuracy rate with all healthy identified.

48)

Of the following options, which does not represent the naïve Bayes method? A) All predictor variables are categorical. B) All predictor variables are independent. C) Does not capture possible interactions between predictor variables. D) Works best on a small data set.

49) Using the following table, what is the estimate of P(Color) = Black and what is the smoothed estimate of P(Color). k = 1. Number 1 2 3 4 5 6 7 8 9 10

Color Green Red Black Red Green Red Green Black Green Black

A) P(Color = Black) = 0.70 and Smoothed 0.471 B) P(Color = Black) = 0.30 and Smoothed 0.380 C) P(Color = Black) = 0.3 and Smoothed 0.308 D) P(Color = Black) = 0.30 and Smoothed 0.471

Version 1

17


50) Using the following table, what is the estimate of P(Color) = Black and what is the smoothed estimate of P(Color). k = 1. Number 1 2 3 4 5 6 7 8 9 10

Color Black Red Green Red Black Black Red Red Green Red

A) P(Color = Black) = 0.70 and Smoothed 0.471 B) P(Color = Black) = 0.30 and Smoothed 0.380 C) P(Color = Black) = 0.30 and Smoothed 0.308 D) P(Color = Black) = 0.30 and Smoothed 0.471

51) In a decile-wise lift chart, what does the lift value of the leftmost bar imply? {MISSING IMAGE} A) The lift value is determined by the smoothing of the data. B) The first 10% yields twice as many as a random selection of 10% would. C) The bar represents 10% of the data cumulative score. D) The first 10% is twice as prevalent as 20%.

52) To validate the model on the validation set, Mary calibrates the output of the model to examining all possible outcomes of the prediction (true positive, true negative, false positive, false negative). One way is to use a cutoff value and use functions such as the ifelse () function. These statements are called A) prediction. B) set.seed. C) reference. D) confusionMatrix.

Version 1

18


53)

What is the Euclidean distance between Observation 2 and the origin point of (0, 0, 0)?

Observation 1 2 3

Y Red Blue Blue

X1

X2 0 2 1

X3 2 1 0

0 0 1

A) B) C) D)

54) Using the following table of the results of a paper towel study and selection, the XYZ company is making a new product with Durability = 3 and Feel = 4. Using the Euclidean distance, which Type is closest to the new observation? Type 1 2 3 4

Durability 4 2 1 3

Feel 2 3 4 3

Choice Yes No No Yes

A) Type 2 B) Type 1 C) Type 4 D) Type 3

55) Using the following table of the results of a paper towel study and selection, the XYZ company is making a new product with Durability = 3 and Feel = 4. What is the k-nearest neighbors when k = 2? Type 1 2 3 4

Version 1

Durability 4 2 1 3

Feel 2 3 4 3

Choice No Yes No Yes

19


A) Types 2 & 4 B) Types 1 & 4 C) Types 2 & 3 D) Types 3 & 4

56) XYZ Streaming services wants to be able to provide programming recommendations to their customers. What type of supervised data mining technique is most used? A) classification and regression trees B) k nearest neighbors method C) linear regression model D) Naïve Bayes classifiers

57) Kurt wants to build a model that will allow him to predict whether it is a good day to go golfing. Using categorical weather conditions (e.g., outlook, humidity, windy, and temperature). What type of supervised data mining technique would most likely be used? A) classification and regression trees B) k nearest neighbors method C) linear regression model D) Naïve Bayes classifiers

58) The following table reflects the observations made on the weather outlook and temperature, if Kurt golfed (1) or did not golf (0), and if it was windy (yes or no). Using the Naïve Bayes calculation, what is the conditional probability of golfing on a sunny day given it is windy. Assume predictor variables are independent. Windy Yes No Yes No No No

Version 1

Outlook Sunny Rainy Cloudy Cloudy Sunny Sunny

Temp Hot Cold Mild Cold Hot Mild

Golf 0 0 0 0 1 1

20


Yes No No Yes

Sunny Cloudy Sunny Cloudy

Hot Cold Hot Mild

1 1 1 1

A) 0.10 B) 0.38 C) 0.73 D) 0.95

59) The marketing group and Fabulous Flowers is trying to predict if the parents of undergraduate or graduate students are more inclined to purchase (y = 1) or not purchase (y = 0) flowers at graduation. Using the following count on the training data set, calculate the conditional probability of both to determine which should be classified to the purchase group. Purchase (y = 1) No Purchase (y = 0)

Undergraduate (x = 1)

Graduate (x = 0)

160 80

50 90

A) Undergraduate 0.667 > 0.333; Graduate 0.357 < 0.623. The undergraduate is assigned to the purchaser group. B) Undergraduate 0.760 > 0.471; Graduate 0.529 > 0.238. The undergraduate is assigned to the purchaser group. C) Undergraduate 0.333 < 0.667; Graduate 0.623 > 0.357. The graduate is assigned to the purchaser group. D) Undergraduate 0.30 < 0.500; Graduate 0.529 > 0.500. The graduate is assigned to the purchaser group.

60)

What is the function used in R to create bins for a variable myData$x1? A) breaks B) cut C) pnorm D) transform

Version 1

21


Answer Key Test name: Chap 12_2e_Jaggia 1) TRUE In the basic form, scoring a record is using classification or predicting the value by selecting variables that will lead to an outcome of a new observation. 2) TRUE KNN is a simple data mining tool with clear, understandable results. It is an excellent choice for movie, book, and other online recommendation models. 3) FALSE Naïve Bayes is simple, efficient, effective, and assumes independence among predictors. 4) FALSE KNN belongs to a category of mining techniques called memory-basedreasoning. 5) TRUE Along with providing no information on the predictor importance, knearest searches the entire data set, thus, computationally expensive. 6) FALSE The naïve Bayes method is a supervised data mining technique. 7) TRUE The basis of naïve Bayes is ensuring predictor variables are categorical for analysis. 8) TRUE The validation set is not used; however, the partitioned training set is used to compute the probabilities. 9) TRUE Version 1

22


The numerical data needs converted to discrete categories by binning the values into ranges, creating a new variable housing a categorical value. 10) FALSE The naïve Bayes method requires categorical values. Binning is the process for transforming numerical values into discrete categories. 11) FALSE Since a graduate’s starting salary is numerical, Anastasia should use a prediction model. Classification models are used when the target variable is categorical. 12) A Using KNN, you need to locate k = 3 values on age nearest to record #4, which is 32. These would be 23, 30, & 33, which have Euclidian distances of 9, 2, and 1, respectively. Records 5 and 6 are 12 and 21 away from 32. 13) A Using KNN, you need to locate k = 3 values on age nearest to record number 4, which is 32. These would be 24, 31, & 34, which have Euclidian distances of 8, 1, and 2, respectively. Records 5 and 6 are 12 and 21 away from 32. 14) C The three nearest in age are 31, 32, & 34. Based on the default rate, 1 out of 3 has defaulted, thus, 0.33 or 33% chance the new applicant will default. 15) C The three nearest in age are 31, 32, & 34. Based on the default rate, 1 out of 3 has defaulted, thus, 0.33 or 33% chance the new applicant will default. 16) D When k = 1, then the nearest neighbor is 1, -1, 0. 17) D Version 1

23


When k = 1, then the nearest neighbor is −1, 0, 1. 18) D With k = 3, of the three KNN in NW location, 3/3 of the samples are identified as Gouda. Thus, 3/3 = 100% probability of the new sample is Gouda. 19) D With k = 3, of the three KNN in NW location, 2/3 of the samples are identified as Gouda. Thus, 2/3 = 67% probability of the new sample is Gouda. 20) A The four nearest in age are 34, 44, 49, 53. Based on the default rate, 2 out of 4 have defaulted, thus, 0.50 or 50% chance the new applicant will default. 21) A The four nearest in age are 34, 44, 49, 53. Based on the default rate, 3 out of 4 have defaulted, thus, 0.75 or 75% chance the new applicant will default. 22) B k = 5 is the lowest misclassification rate of the false positives and negatives combined and, thus, should be used. 23) B k = 5 is the lowest misclassification rate of the false positives and negatives combined and, thus, should be used. 24) D Overfitting may exist when training data is too closely a fit to the set of data, thus, hindering future observation reliability. 25) B Smoothing, or Laplace smoothing, is a technique used to overcome a rare problem of new record classification where zero probability is replaced with nonzero values. Version 1

24


26) A <p>The calculation is <span style="font-family: monospace;"> .

27) A <p>The calculation is <span style="font-family: monospace;">

28) C The accuracy rate of 92.9% reflects the observations that are correctly identified. 29) C The accuracy rate of 88.5% reflects the observations that are correctly identified. 30) D In the error report, the overall % Errors = 7.80%. 31) D In the error report, the overall % Errors = 11.5%. 32) D Marta needs to set.seed function to fix a random seed problem. 33) A Decile-wise shows data in easy block increments, in a format from high to low values, allowing for analysis by category. 34) B The cumulative lift chart allows for a visual representation of the effectiveness of the predictive model. The greater the area between the lift curve and the base line, the better the model effectiveness. 35) D The receiver operating characteristics (ROC) curve assess the model performance in terms of sensitivity and specificity. 36) A P(y = 1│x = 1) = = 0.56; P(y = 0│x = 1) = = 0.44, thus 0.56 > 0.44 for the male classification and P(y = 1│x = 0) = = 0.297; P(y = 0│x = 0) = = 0.703 is 0.297 < 0.703 for the female class. So, the male class would be assigned to the approved and the female to the declined. Version 1

25


37) A P(y = 1│x = 1) =

= 0.57; P(y = 0│x = 1) =

= 0.43, thus 0.57 > 0.43 for the male classification

and P(y = 1│x = 0) = = 0.333; P(y = 0│x = 0) = = 0.667 is 0.333 < 0.667 for the female class. So, the male class would be assigned to the approved and the female to the declined.

38) C The number of times the data will be split is indicated in the name. k = 10, hence, 10-fold cross-validation. 39) B y = ticket; x1 = prior ; x2 = color P(y = 1 | x1 = yes, x2 = red) = P(x1 = yes | y = 1) P(x2 = red | y = 1) P(y = 1) / P (x1 = yes, x2 = red) (A) P(x1 = yes | y = 1) P(x2 = red | y = 1) P(y = 1) = (1/4)(2/4)(4/10) = 0.050. (B) P(x1 = yes, x2 = red | y = 0) = P(x1 = yes | y = 0) P(x2 = red | y = 0) P(y = 0) = (1/6)(2/6)(6/10) = 0.033 Now, P(x1 = yes, x2 = red) = P(x1 = yes, x2 = red | y = 1) + P(x1 = yes, x2 = red | y = 0) = A + B = 0.050 + 0.033 = 0.08 (A)/(A+B) = 0.62 is the conditional probability that a red colored vehicle with a prior driving violation receives a ticket. 40) B

Version 1

26


y = ticket; x1 = prior ; x2 = color P(y =1 | x1 = yes, x2 = red) = P(x1 = yes | y = 1) P(x2 = red | y = 1) P(y = 1) / P (x1 = yes, x2 = red) (A) P(x1 = yes | y = 1) P(x2 = red | y = 1) P(y = 1) = (1/4)(3/4)(4/10) = 0.075. (B) P(x1 = yes, x2 = red | y = 0) = P(x1 = yes | y = 0) P(x2 = red | y = 0) P(y = 0) = (1/6)(4/6)(6/10) = 0.067 Now, P(x1 = yes, x2 = red) = P(x1 = yes, x2 = red | y = 1) + P(x1 = yes, x2 = red | y = 0) = A + B = 0.075 + 0.067 = 0.14 (A)/(A+B) = 0.53 is the conditional probability that a red colored vehicle with a prior driving violation receives a ticket. 41) B With a very high AUC, the model is considered a strong predictor and better than the baseline model in terms of sensitivity and specificity across all cutoff values.

42) B P(y = 1│x = 1) = = 0.66; P(y = 0│x = 1) = = 0.34, thus 0.66 > 0.34 for the undergraduate classification. P(y = 1│x = 0) = = 0.500; P(y = 0│x = 0) = = 0.500 is 0.500 < 0.500 for the graduate class. So, the undergraduate class would be assigned to the purchaser group.

43) B P(y = 1│x = 1) =

= 0.70; P(y = 0│x = 1) =

= 0.30, thus 0.70 > 0.30 for the undergraduate

classification. P(y = 1│x = 0) = = 0.476; P(y = 0│x = 0) = = 0.524 is 0.476 < 0.524 for the graduate class. So, the undergraduate class would be assigned to the purchaser group.

44) B Specificity = True Negative (TN) ÷ (True Negative (TN) + False Positive (FP)). 45) A caret & klaR are the two recommended packages to examine k-fold cross-validation and naïve Bayes. 46) C

Version 1

27


By the results, all the healthy patients were diagnosed (specificity), whereas sensitivity rate shows only 41% of the patients were diagnosed with the flu. Thus, the test can provide a view of healthy cases, but not diagnose specifically. 47) C By the results, all the healthy patients were diagnosed (specificity), whereas sensitivity rate shows only 50% of the patients were diagnosed with the flu. Thus, the test can provide a view of healthy cases, but not diagnose specifically. 48) D Naïve Bayes works best on a large data set with a large number of predictor variables. 49) C P(Color) = Black = 3 ÷ 10 or 0.30 and smoothed: (3 + 1) ÷ (10 + 3 × 1) = 0.308. 50) C P(Color) = Black = 3 ÷ 10 or 0.30 and smoothed: (3 + 1) ÷ (10 + 3 × 1) = 0.308. 51) B The left side is the global mean of the decile and represents the score of the partitioned data in 10% increments. Thus, identifying specific group bans with the highest probabilities versus a random selection.

52) D The confusion matrix statement specifies the value of “1” in the target variable to determine the positive class based on identified factors. If the variable name is loan, then the code would be: >confusionMatrix(as.factor(ifelse(nb_class_prob[,2].0.75,’1’,’0’)),valida tionSet$Loan, positive=’1’). 53) C <p>The calculation is

54) C Version 1

28


55) A <p>Based on the Euclidean distance of: Type 1 =

; Type 2 =

; Type

3 = ; Type 4 = , the ranking from lowest to highest distance is Type: 4, 2, 3, 1. k = 2 would reflect Types 4 & 2.

56) B Thek-nearest neighbors (KNN) method is used most to offer personalized recommendations for new books, movies, or TV shows as well as diagnosing diseases based on symptoms like previously diagnosed illnesses. 57) D Naïve Bayes classifiers can “learn” and recognize new weather observations as well as allow individual users to train the naïve Bayes algorithm by personally classifying weather observations as good or bad days for golf. In addition, the weather information can usually be captured and reduced into a data set that consists solely of categorical variables (e.g., hot/cold, windy/not windy, etc.), which is a requirement for traditional naïve Bayes methods. 58) C

Version 1

29


y = golf; x1 = windy; x2 = outlook P(y = 1 | x1 = yes, x2 = sunny) = P(x1 = yes | y = 1) P(x2 = sunny | y = 1) P(y = 1) / P (x1 = yes, x2 = sunny) (A) P(x1 = yes | y = 1) P(x2 = sunny | y = 1) P(y = 1) = (2/6)(4/6)(6/10) = 0.133. (B) P(x1 = yes, x2 = sunny | y = 0) = P(x1 = yes | y = 0) P(x2 = sunny | y = 0) P(y = 0) = (2/4)(1/4)(4/10) = 0.050 Now, P(x1 = yes, x2 = sunny) = P(x1 = yes, x2 = sunny | y = 1) + P(x1 = yes, x2 = sunny | y = 0) = A + B = 0.133 + 0.050 = 0.183 (A)/(A+B) = 0.73 is the conditional probability that Kurt golfs on a windy, sunny day. 59) A P(y = 1│x = 1) =

= 0.667;P(y = 0│x = 1) =

= 0.333, thus 0.677 > 0.333 for the

undergraduate classification.P(y = 1│x = 0) = = 0.357;P(y = 0│x = 0) = = 0.623 thus 0.357 < 0.623 for the graduate class. So, the undergraduate class would be assigned to the purchaser group.

60) B The functioncut is used to create bins for variable myData$x1;breaks is the argument within thecut function that tells where the bins are cut.

Version 1

30


CHAPTER 13 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) A pure subset contains leaf nodes where cases have contradicting values to the target variable, to enhance the variable case outcomes and allow for further splits. ⊚ true ⊚ false

2) Decision trees produced by the CART algorithm are binary, meaning that there are two branches for each decision node. ⊚ true ⊚ false

3) The best-pruned tree is the smallest set, least complex tree, with the smallest validation error. ⊚ true ⊚ false

4) Small changes in the training set, while using the CART algorithm, will result in drastically different trees. ⊚ true ⊚ false

5) A subset with the highest degree of impurity is when a 50% and 50% split occur between classes. ⊚ true ⊚ false

6) Based on the Gini index, 0.10 implies a higher degree of purity because it is closer to 0 than 0.5. ⊚ true ⊚ false

Version 1

1


7) In a decision tree, the recursive process of partitions continues and only terminates when the Gini index reaches 0.5. ⊚ true ⊚ false

8)

To measure impurity in a regression tree, mean square error (MSE) is used. ⊚ true ⊚ false

9) The overall MSE split for Age = 24 is $23,987,487.29 and for Age = 26 is $21,983,723.40. Of the two presented, Age = 24 is slightly higher and has a lower level of impurity for constructing a regression tree. ⊚ true ⊚ false

10) The overall MSE split for Age = 25 is $22,987,111.29 and for Age = 23 is $21,983,723.40. Of the two presented, Age = 25 is slightly higher and has a lower level of impurity for constructing a regression tree. ⊚ true ⊚ false

11) Before constructing a decision tree, one of the first steps is identifying possible splits of the predictor variable. ⊚ true ⊚ false

12) Boosting is an ensemble modeling strategy that uses the bootstrap aggrega-tion technique to create multiple training data sets by repeatedly sampling the original data with replacement. ⊚ true ⊚ false

13) The random forest technique is particularly useful if the predictor variables are not highly correlated.

Version 1

2


⊚ ⊚

true false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 14) When a target variable is categorical, the CART algorithm produces a __________blank tree to predict the class memberships of new cases. A) classification B) regression C) minimum D) pruned

15)

Which tree is the least complex and contains the smallest validation error? A) best-pruned tree B) full-grown tree C) minimum error tree D) categorical tree

16) Based on the following sorted 20 values for age, what are the possible split points? {20, 22, 24, 26, 29, 31, 32, 33, 35, 41, 42, 43, 45, 47, 49, 50, 52, 53, 55, 57} A) {20, 21, 23, 25, 27.5, 30, 31.5, 32.5, 34, 38, 41.5, 42.5, 44, 46, 48, 49.5, 51, 52.5, 54, 56} B) {21, 23, 25, 27.5, 30, 31.5, 32.5, 34, 38, 41.5, 42.5, 44, 46, 48, 49.5, 51, 52.5, 54, 56, 57} C) {0, 21, 23, 25, 27.5, 30, 31.5, 32.5, 34, 38, 41.5, 42.5, 44, 46, 48, 49.5, 51, 52.5, 54, 56} D) {21, 23, 25, 27.5, 30, 31.5, 32.5, 34, 38, 41.5, 42.5, 44, 46, 48, 49.5, 51, 52.5, 54, 56}

17) Based on the following sorted 20 values for age, what are the possible split points? {20, 22, 24, 26, 28, 31, 32, 33, 35, 40, 42, 43, 45, 47, 49, 50, 52, 53, 55, 57}

Version 1

3


A) {20, 21, 23, 25, 27, 29.5, 31.5, 32.5, 34, 37.5, 41, 42.5, 44, 46, 48, 49.5, 51, 52, 54, 56} B) {21, 23, 25, 27, 29.5, 31.5, 32.5, 34, 37.5, 41, 42.5, 44, 46, 48, 49, 51, 52.5, 54, 56, 57} C) {0, 21, 23, 25, 27, 29.5, 31.5, 32.5, 34, 37.5, 41, 42.5, 44, 46, 48, 49, 51, 52.5, 54, 56} D) {21, 23, 25, 27, 29.5, 31.5, 32.5, 34, 37.5, 41, 42.5, 44, 46, 48, 49.5, 51, 52.5, 54, 56}

18) Based on the following values for income, what are the possible split points? {12,665, 15,432, 28,763, 34,876, 45,437, 53,987} A) {14048.5, 22097.5, 31819.5, 40156.5, 49712, 53987} B) {12665, 14048.5, 22097.5, 31819.5, 40156.5, 49712} C) {14048.5, 22097.5, 31819.5, 40156.5, 49712} D) {14048, 22097, 31819, 40156, 49712}

19) Based on the following values for income, what are the possible split points? {12665, 15432, 28763, 34876, 41967, 52997} A) {14048.50, 22097.50, 31819.50, 38421.50, 47482, 52997} B) {12665, 14048.50, 22097.50, 31819.50, 38421.50, 47482} C) {14048.50, 22097.50, 31819.50, 38421.50, 47482} D) {14048, 22097, 31819, 38421, 47482}

20)

If 71% of the cases belong to Class 0 and 29% belong to Class 1, what is the Gini index? A) 0.41 B) 0 C) 0.58 D) 0.14

21)

If 80% of the cases belong to Class 0 and 20% belong to Class 1, what is the Gini index?

Version 1

4


A) 0.32 B) 0 C) 0.40 D) 0.16

22) In reviewing the split of data, Maggie notes among the 16 cases, 2 belong to Class 1 and the remaining to Class 0. What is the Gini index for the cases and is it pure or impure? A) 0.00 default because it is under 0.5 and pure. B) 0.22 is closer to 0 implying relative purity. C) 0.28 is at the halfway point is not considered pure. D) 0.78 is over the 0.5 level and is impure.

23) In reviewing the split of data, Maggie notes among the 15 cases, 2 belong to Class 1 and the remaining to Class 0. What is the Gini index for the cases and is it pure or impure? A) 0.00 default because it is under 0.5 and pure. B) 0.23 is closer to 0 implying relative purity. C) 0.27 is at the halfway point is not considered pure. D) 0.77 is over the 0.5 level and is impure.

24) Viewing the results in the following scatterplot, for the 11 cases to the left subset (Age < 40), two belong to Class 1 and nine belong to Class 0. In the right subset (Age ≥ 40) three belong to Class 1 and one belong to Class 0. What is the Index score for the two subsets?

Version 1

5


A) (Age < 40) = 0.3636; (Age ≥ 40) = 0.50 B) (Age < 40) = 0.20; (Age ≥ 40) = 0.25 C) (Age < 40) = 0.298; (Age ≥ 40) = 0.375 D) (Age < 40) = 0.375; (Age ≥ 40) = 0.298

25) Robin wanted to know if the age partition chosen for her data was the best fit for her 50 case, 90% Class 1, 10% Class 0 partition. She completed the Gini impurity index with the results of (Age < 32) = 0.2196 and (Age ≥ 32) = 0.2804. What is the weighted combination and what did partition at Age 32 produce? A) Robin was able to reduce the Gini index from 0.2804 to 0.2524 confirming the best split for age. B) Robin was able to reduce the Gini index from 0.2804 to 0.20 confirming the best split for age. C) Robin was able to reduce the Gini index from 0.2804 to 0.2257 confirming the best split for age. D) Robin realized with the 0.2524 weighted average, the age split was not the best split for the age range.

26) Robin wanted to know if the age partition chosen for her data was the best fit for her 30 case, 90% Class 1, 10% Class 0 partition. She completed the Gini impurity index with the results of (Age < 32) = 0.2034 and (Age ≥ 32) = 0.2786. What is the weighted combination and what did partition at Age 32 produce? A) Robin was able to reduce the Gini index from 0.2786 to 0.2507, confirming the best split for age. B) Robin was able to reduce the Gini index from 0.2786 to 0.20, confirming the best split for age. C) Robin was able to reduce the Gini index from 0.2786 to 0.2109, confirming the best split for age. D) Robin realized with the 0.2507 weighted average, the age split was not the best split for the age range.

Version 1

6


27) A split at the $32,000 Income point creates a top and bottom partition. Compute the overall (weighted) Gini index given an income split of $32,000.

A) Ginisplit (Income=$36,000) = 0.2667 B) Ginisplit (Income=$36,000) = 0.0000 C) Ginisplit (Income=$36,000) = 0.4959 D) Ginisplit (Income=$36,000) = 0.3636

28) Which description best fits the following tree structure for loan debt balance with a single age predictor?

A) The split points presented represent the MSE calculated points for Age = 35. B) The MAD of the single age predictor is $42,964 and $32,980 respectfully. C) The MSE split for Age = 35 is between the two partitions of $42,964 and $32,980, respectively. D) The average loan debt balance of the two partitions are $42,964 and $32,980, respectively, when Age = 35.

29) In R, to determine the number of splits in the default classification tree, the rpart function uses what to determine when to stop growing the tree?

Version 1

7


A) nsplit B) complexity parameter C) prune D) predict

30)

In a R complexity parameter table, the xerror column represents: A) the cross-validation errors associated with each candidate tree. B) the recommended measure for the full tree. C) the maximum error point for the first node split. D) the root node type argument point.

31)

Using the following pruning table, what does the Rel Error represent?

Level 1 2 3

CP 0.50 0.44 0.00

Num Splits 0 1 2

Rel Error 1.00 0.50 0.06

X Error 1.17 0.73 0.10

X Std Dev 0.050735 0.061215 0.030551

A) Rel error is the calculated difference after the standard deviation is removed. B) Rel error is the cross-validation error associated with each candidate tree. C) Rel error is the error for predictions of the data that were used to estimate the model. D) Rel error is the parameter associated with the candidate tree and complexity level.

32)

Using the following pruning table, which tree is the minimum error tree?

Level 1 2 3

CP 0.50 0.44 0.00

Num Splits 0 1 2

Rel Error 1.00 0.50 0.06

X Error 1.17 0.73 0.10

X Std Dev 0.050735 0.061215 0.030551

A) Level 3 B) Level 2 C) Level 1 D) Additional Levels needed to identify minimum error tree among candidate trees.

Version 1

8


33)

Using the following chart for age and income, determine the split points for income. Age 41 45 38 24 25 39 40 36

Income 43,000 63,000 32,650 33,450 32,650 51,000 63,000 54,000

A) {32650, 33050, 38225, 47000, 52500, 58500, 63000} B) {32650, 38225, 47000, 52500, 63000} C) {33050, 47000, 58500, 63000} D) {33050, 38225, 47000, 52500, 58500}

34)

Using the following chart for age and income, determine the split points for income. Age 42 44 38 21 23 39 40 39

Income 43,000 63,000 33,000 39,000 33,000 51,000 63,000 54,000

A) {33000, 36000, 41000, 47000, 52500, 58500, 63,000} B) {33000, 41000, 47000, 52500, 63000} C) {36000, 47000, 58500, 63000} D) {36000, 41000, 47000, 52500, 58500}

35)

Which is not a purpose of running classification and regression trees (CART)?

Version 1

9


A) To remove nodes that do not produce additional information B) To simplify and reduce complexity C) To identify the most diverse case set for the target variable D) To reduce the chances of overfitting

36) If the RMSE for the validation set is 49.95 and the RMSE for the test set is 48.45, then what range will the new data RMSE lie in? A) 48.45–49.95 range B) 48–49 range C) 48.45–50 range D) 48-50 range

37) If the RMSE for the validation set is 58.78 and the RMSE for the test set is 57.12, then what range will the new data RMSE lie in? A) 57.12–58.78 range B) 57–58 range C) 57.12–59 range D) 57–59 range

38) A regression tree was developed to predict customer spending for a hotel during football season. One of the leaf nodes consists of six cases in the training set with the following values: 325.00, 361.00, 285.00, 295.00, 380.00, 220.00. What is the predicted spending amount on a hotel for the night for a customer that falls into this leaf node? A) 311.00 B) 312.40 C) 314.50 D) 310.80

Version 1

10


39) A regression tree was developed to predict customer spending for a hotel during football season. One of the leaf nodes consists of six cases in the training set with the following values: 312.00, 350.00, 285.00, 295.00, 380.00, 220.00. What is the predicted spending amount on a hotel for the night for a customer that falls into this leaf node? A) 307.00 B) 308.40 C) 310.50 D) 306.80

40) When using the CART algorithm, the Gini index is used in the classification tree, however in a regression tree, __________blank is used to measure impurity. A) mean percentage error B) mean squared error C) mean absolute deviation D) mean absolute percentage error

41) Using the following sample of a regression prune log, the minimum error tree is decision node number 19 with a standard error of 4.689492 (not shown). Using the information provided, which decision node number represents the best-pruned tree? Number Decision Nodes Cost Complexity 21 0 20 2.47933762 19 2.602975478 18 2.510227273 17 2.874095507 16 2.874095507 15 3.636959064 14 3.277857555 13 3.277857555 12 3.166379434 11 3.359372577 10 3.819333349 9 4.60470163 8 5.988329591 7 5.868890688

Version 1

Train. MSE 11.59784 11.79482 11.92497 12.16648 12.28455 12.42569 12.653 12.87976 13.01188 13.29182 14.02543 14.6908 14.93437 15.39484 17.27776

Validation MSE 22.37068 22.01144 21.99134 22.14245 22.12984 22.12984 22.69819 22.43582 22.43582 22.73964 22.74113 23.05767 23.38113 23.85401 23.74365

11


6 5 4 3 2 1 0

6.68214015 11.29753292 17.4988252 17.11427713 23.45465646 33.43802325 32.671226

17.62498 18.57957 22.07933 29.89755 34.17612 50.89513 83.56636

23.78791 24.99924 29.80257 31.73721 27.43468 51.82928 86.50055

A) decision node number 21 B) decision node number 5 C) decision node number 4 D) decision node number 17

42) The following table reflects a partial Analytic Solver’s Performance measure for a hotel cost during an NFL game night. What is the MAD implying? Performance Measure RMSE MAD

Value 311.76 55.24

A) The predicted mean absolute deviation is 0.55 of the mean absolute percentage error. B) The predicted cost is relatively low, providing the need for full tree. C) The predicted average cost is lesser than the standard error, thus impure. D) The predicted cost on average differs from the actual cost by $55.24.

43) The following table reflects a partial Analytic Solver’s Performance measure for a hotel cost during an NFL game night. What is the MAD implying? Performance Measure RMSE MAD

Value 310.12 50.56

A) The predicted mean absolute deviation is 0.51 of the mean absolute percentage error. B) The predicted cost is relatively low, providing the need for full tree. C) The predicted average cost is lesser than the standard error, thus impure. D) The predicted cost on average differs from the actual cost by $50.56.

Version 1

12


44) When generating a single regression tree visually in R, the prp function is used. Based on the following example code, what does setting type = 1 mean? >prp(default_tree, type = 1, extra = 1, under = TRUE) A) type = 1 argument is the number of observations that fall into each node displayed. B) type = 1 argument places the number of cases under each decision node in the diagram. C) type = 1 argument allows for all nodes, except leaf nodes, to be labeled in the diagram. D) type = 1 argument allows for the predicting variable to be displayed in root node.

45)

Using the following pruning table, which tree is the best-pruned tree?

Level 1 2 3

CP 0.50 0.44 0.00

Num Splits 0 1 2

Rel Error 1.00 0.50 0.06

X Error 1.17 0.73 0.10

X Std Dev 0.050735 0.061215 0.030551

A) Level 3 B) Level 2 C) Level 1 D) Additional Levels needed to identify best-pruned tree.

46) Which option is not one of the three common strategies used in creating ensemble models? A) bagging B) boosting C) bootstrapping D) random Forest

47) If the performance measures are based on a cutoff value of 0.5, then if we lower the cutoff value, more cases will be in the target class, resulting in different performance measurement values. What chart can be used to review the data that are independent of the cutoff value?

Version 1

13


A) cumulative lift chart B) decile-wise lift chart C) ROC curve D) All options are independent of the cutoff value.

48) If predictor variables are highly correlated, then repeated sampling of the training data and a random selection of features are used to construct trees. This is an example of which strategy? A) random forest B) bagging C) boosting D) banking

49) In a random forest model, as a guideline the user needs to select a number of the random features for each tree. If there are 144 predictor variables in the data, each tree will randomly select how many features to be included in the tree? A) 4 B) 9 C) 144 D) 12

50) In a random forest model, as a guideline the user needs to select a number of the random features for each tree. If there are 9 predictor variables in the data, each tree will randomly select how many features to be included in the tree? A) 4 B) 1 C) 9 D) 3

Version 1

14


51) When constructing the argument for a bagging tree strategy, the varImpPlot function displays feature importance graphically. For this we set the type argument to either equal 1 or 2. If type = 2, then what does this command? A) to show the average decrease in the predictive variable mean in a percentage form B) that R will use the average decrease in the Gini impurity index to compare the feature importance C) to show the feature importance as the average decrease in overall accuracy D) that R will use the average increase in the Gini impurity index to compare future importance

52) Ensemble tree models combine multiple single-tree models to reduce the variation in prediction error. Of the strategies, which may lead to overfitting? A) boosting B) random Forest C) bagging D) banking

53)

In the following tree, how many leaf nodes are there?

A) six B) seven C) two D) four

Version 1

15


54)

In the following tree, the arrow is pointing to the __________blank node.

A) exterior B) interior C) leaf D) root

55)

In the following tree, the arrow is pointing to the __________blank node.

A) exterior B) interior C) leaf D) root

Version 1

16


56) A data file contains four predictor variables x1, x2, x3, and x4 and a numerical target variable y. Which of the following MSE split points should be selected for as the root node? A) MSEsplit(x1=4) = 400.57 B) MSEsplit(x2=42) = 385.29 C) MSEsplit(x3=225) = 40.79 D) MSEsplit(x4=16) = 210.17

57) In which of the following situations is using an MSE to determine the split point appropriate? A) predictor variable is categorical B) predictor variable is numerical C) target variable is categorical D) target variable is numerical

58) The regression tree below relates credit score to number of defaults (NUM_DEF), revolving balance (REV_BAL), and years of credit history (YRS_HIST). Predict the score for an individual with two defaults, a revolving balance of $12,800, and 14 years of credit history.

A) 535 B) 588 C) 672 D) 740

Version 1

17


Version 1

18


Answer Key Test name: Chap 13_2e_Jaggia 1) FALSE Pure subsets contain leaf nodes that contain the same value of the target variable. There is no need to further split pure subsets. 2) TRUE CART is binary with two branches for each decision node. 3) FALSE The best-pruned is the smallest set and least complex tree with a validation error within one standard error of the minimum error tree (the tree that produces the lowest error rate). 4) TRUE Because of the complexity requiring a large data set, any change, even a small one, may produce a contrasting tree result. 5) TRUE When half the cases belong to one and the other half to the other, the subset is considered to have the highest degree of impurity, meaning the two classes are not separated as well as they could be. In comparison, a “pure” subset happens when 100% of the cases belong to one class and 0% to the other class. 6) TRUE For binary classification problems, the Gini index of a set of cases can range from 0 (highest purity) to 0.5 (highest impurity). Because 0.10 is relatively small and close to 0, it is purer than 0.5, the highest impurity level. 7) FALSE The process continues until all partitions become a pure subset. "(Gini index of 0)" Version 1

19


8) TRUE The target value in a regression tree is numerical and MSE is used to measure its impurity. 9) FALSE Age 26, with the lower MSE value, represents the lower level of impurity and should be used to construct the decision tree. 10) FALSE Age 23, with the lower MSE value, represents the lower level of impurity and should be used to construct the decision tree. 11) TRUE Splits have to be identified prior to construction of decision tree. 12) FALSE Bagging is an ensemble modeling strategy that uses the bootstrap aggregation technique to create multiple training data sets by repeatedly sampling the original data with replacement. Boosting is an ensemble modeling strategy that forces the model to pay more attention to cases that are misclassified or have large prediction errors in previous trees through a weighted sampling process. 13) FALSE The random forest technique is particularly useful if the predictor variables are highly correlated. 14) A The classification tree is produced from CART using categorical variables to predict factors. 15) C The definition of a minimum error tree is being the least complex with the smallest validation error. 16) D

Version 1

20


The first possible split point is calculated as (20 + 22)/2 = 21, the next possible split point is calculated as (22 + 24)/2 = 23, and the remaining possible split points are found: {21, 23, 25, 27.5, 30, 31.5, 32.5, 34, 38, 41.5, 42.5, 44, 46, 48, 49.5, 51, 52.5, 54, 56}. 17) D The first possible split point is calculated as (20 + 22)/2 = 21, the next possible split point is calculated as (22 + 24)/2 = 23, and the remaining possible split points are found: {21, 23, 25, 27, 29.5, 31.5, 32.5, 34, 37.5, 41, 42.5, 44, 46, 48, 49.5, 51, 52.5, 54, 56}. 18) C The first possible split point is calculated as (12665 + 15432)/2 = 14048.50, the next possible split point is calculated as (15432 + 28763)/2 = 22097.50, and the remaining possible split points are: {14048.5, 22097.5, 31819.5, 40156.5, 49712}. 19) C The first possible split point is calculated as (12665 + 15432)/2 = 14048.50, the next possible split point is calculated as (15432 + 28763)/2 = 22097.50, and the remaining possible split points are: {14048.50, 22097.50, 31819.50, 38421.50, 47482}. 20) A Gini index = 1 − (0.712 + 0.292) = 1 − (0.5041 + 0.0841) = 0.4118. 21) A Gini index = 1 − (0.802 + 0.202) = 1 − (0.64 + 0.04) = 0.32. 22) B Gini16 cases = 1 − [(14/16)2 + (2/16)2] = 1 − (0.7656 + 0.0156) = 0.2188. 23) B Gini15 cases = 1 − [(13/15)2 + (2/15)2] = 1 − (0.7511 + 0.0178) = 0.2311. 24) C Gini(Age<40) = 1 − [(9/11)2 + (2/11)2] = 1 − [0.6694 + 0.0331] = 0.298. Gini(Age≥40) = 1 − [(3/4)2 + (1/4)2] = 1 − [0.5625 + 0.0625] = 0.375.

25) C Version 1

21


To compute the overall Gini index for the split, we use the weighted combination of the Gini indexes using the percentage of cases in each partition as the weight: Ginisplit:(Age = 32) = (45/50) × 0.2196 + (5/50) × 0.2804 = 0.19764 + 0.02804 = 0.22568. By partitioning at 32, Robin was able to reduce the Gini index from 0.2804 to 0.22568. 26) C To compute the overall Gini index for the split, we use the weighted combination of the Gini indexes using the percentage of cases in each partition as the weight: Ginisplit:(Age = 32) = (27/30) × 0.2034 + (3/30) × 0.2786 = 0.18306 + 0.02786 = 0.21092. By partitioning at 32, Robin was able to reduce the Gini index from 0.2786 to 0.21092. 27) D Income < 32,000 = 1 − [(4/4)2 + (0/4)2] = 1 − 1 = 0. Income ≥ 32,000 = 1 − [(6/11)2 + (5/11)2] = 1 − [0.2975 + 0.2066] = 0.4959. Income = 32,000 = (11/15) × 0.4959 + (4/15) × 0 = 0.3636 + 0 = 0.3636.

28) D The chart represents the initial split points of the partitioned average loan debt balance of $42,964 and $32,980, respectively, when Age = 35.

29) B The rpart function uses complexity parameters (cp) to determine when to stop growing the tree. If adding a split exceeds the value of cp, then the tree will not continue to grow. 30) A xerror is the cross-validation error associated with each candidate tree. 31) C Rel error shows the fraction of misclassified cases for each tree relative to the fraction of misclassified cases in the root node if all cases are into the predominant class. 32) A Version 1

22


The minimum error tree is Level 3, the lowest prediction error (i.e., X error) among all candidate trees. 33) D First you sort the income in ascending order, only using a duplicate amount once, and you have: 32650, 33450, 43000, 51000, 54000, 63000. The first possible split point is calculated as (32650 + 33450)/2 = 33050. Continue through the sets and you have your split points as {33050, 38225, 47000, 52500, 58500}. 34) D First you sort the income in ascending order, only using a duplicate amount once, and you have: {33000, 39000, 43000, 51000, 54000, 63000}. The first possible split point is calculated as (33000 + 39000)/2 = 36000. Continue through the sets and you have your split points as {36000, 41000, 47000, 52500, 58500}. 35) C The purpose is to identify subsets that contain only the cases with similar values for the target variable. 36) B Because the results on the validation set was 49.95 and the test set 48.45, the new range will lie in the 48–49 range. 37) B Because the results on the validation set was 58.78 and the test set 57.12, the new range will lie in the 57–58 range. 38) A The predicted value is the average value of the previous cases that belong to the same leaf node. Average the six values: (325.00 + 361.00 + 285.00 + 295.00 + 380.00 + 220.00)/6 = 311.00. 39) A

Version 1

23


The predicted value is the average value of the previous cases that belong to the same leaf node. Average the six values: (312.00 + 350.00 + 285.00 + 295.00 + 380.00 + 220.00)/6 = 307.00. 40) B The mean squared error (MSE) is used to measure impurity in a regression tree. 41) B The best-pruned tree is decision node number 5, because it’s the smallest tree with a validation error within one standard error of the minimum error tree (21.99134 + 4.689492 = 26.680832). Node 5 with validation MSE 24.99924 is the smallest tree within one standard error of the minimum error tree. 42) D MAD is implying the predicted cost on average differs from the actual cost by $55.24. 43) D MAD is implying the predicted cost on average differs from the actual cost by $50.56. 44) C In the sample code, the argument type = 1 refers to having all nodes, except leaf nodes, labeled in the diagram. 45) A The minimum error tree is Level 3. However, there are no other candidates that are within one standard of the minimum, making Level 3 the best-pruned also. (0.10 + 0.030551) = 0.130551. 46) C Bootstrapping is an aggregate technique used in conjunction with bagging and not one of the three common strategies. 47) D

Version 1

24


All options are independent of the cutoff values and are applicable to evaluate model performance. 48) A An extension of the bagging strategy, random forest takes the highly correlated bagging results, implements a repeated sampling of the training data and a random selection of features to construct trees, which increases diversity. 49) D The general guideline is to take the square root of the predictor variable count to find the number of features to be included in the tree. The square root of 144 is 12. 50) D The general guideline is to take the square root of the predictor variable count to find the number of features to be included in the tree. The square root of 9 is 3. 51) B type = 2 in R uses the average decrease in the Gini impurity index to compare the feature importance. 52) A Boosting cannot develop trees in parallel, making it more computationally expensive and may lead to overfitting. 53) D Leaf nodes (or terminal nodes) are the bottom nodes of the decision tree and is where the classification or prediction outcomes are given.

54) B The arrow is pointing to the interior node where more decision rules are applied. In this case, the Income variable is an interior node.

55) D The arrow is pointing to the top or root node, which is the first variable to which a split value is applied. In this case, the root node is Age.

56) C

Version 1

25


Because the split point x3 has the lowest MSE, the best split is on x3, and the best split value is 225. 57) D Using an MSE to determine the split point is appropriate when the target variable is numerical. 58) A Because the individual has two defaults, the right branch is taken. The next interior node is credit history and our individual has 14 year credit history, the left branch is taken. This next branch evaluates NUM_DEF of 2.1 and our individual takes the left branch to the leaf node, credit score of 535.

Version 1

26


CHAPTER 14 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) The forming of groups into internally homogeneous groups where each has a unique characteristic, different from other groups, is called cluster analysis. ⊚ true ⊚ false

2)

The most commonly used approach for hierarchical clustering is divisive clustering. ⊚ true ⊚ false

3) When using k-means clustering, the number of clusters are specified at the end of the analysis to remove overlapping clusters. ⊚ true ⊚ false

4) When evaluating large data sets, it is customary to cluster large data sets using the kmeans to reduce the computation of measures during each iteration compared to hierarchical clustering methods. ⊚ true ⊚ false

5) When using R, after the data is imported, set.seed function is used to set the random seed and the k function sets the k parameters to preselect the number of clusters. ⊚ true ⊚ false

6)

In understanding the association rules, it is best to think of them as an If-Then statement. ⊚ true ⊚ false

7)

Under the association rule, a lift ratio between 0 and 1 indicates a positive association.

Version 1

1


⊚ ⊚

true false

8) If-Then logical statements are constructed with the If portion being the consequent and the Then being the antecedent. ⊚ true ⊚ false

9) The Ward’s method is the use of a different algorithm to minimize the dissimilarity within clusters by using error sum of squares. ⊚ true ⊚ false

10)

A dendrogram allows for a visual inspection of the clustering results. ⊚ true ⊚ false

11) Because clustering is essentially an unsupervised technique for data exploration, the appropriate technique would be the one that makes the fewest clusters. ⊚ true ⊚ false

12) Gower’s coefficient is commonly used to measure the distance between two observations when using agglomerative hierarchical clustering to analyze mixed data. ⊚ true ⊚ false

13) Hierarchical clustering methods are more efficient than k-means clustering methods for large data sets. ⊚ true ⊚ false

Version 1

2


MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 14) Which method uses the farthest distance between a pair of observations that do not belong to the same cluster? A) The single linkage method B) The complete linkage method C) The centroid method D) The average linkage method

15) In the following AGNES algorithm, what type of linkage distance method is being displayed?

A) The centroid method B) The single linkage method C) The complete linkage method D) No linkage method is shown

Version 1

3


16) When using R for Agglomerative Clustering, the plot function is used to create the dendrogram as well as a banner plot. What function is used to split these results into distinct clusters? A) aResult B) data.frame C) cutree D) view

17) The following results are a subset of a study on the demographics of a city population. Participants were asked to respond if male (1) or female (0), current annual salary, and if they were raised in a suburb (1) or in a city (0). Based on the hierarchical clustering results, which of the following is not a valid observation that can be made? Cluster 1 (N = 21) 2 (N = 68) 3 (N = 41) 4 (N = 34)

Salary 58,980 60,873 72,390 82,080

Gender 0.4354 0.4102 0.5142 0.6101

City 0.4877 0.568 0.2351 0.5041

A) Cluster 3 has the most observations of participants that grew up in the city. B) Cluster 1 has a 44% male population with 49% being raised in the suburbs. C) Cluster 4 has the highest average salary, with 61% of participants being female. D) Cluster 2 has the most participants in the cluster with over half being raised in a suburb.

18) The following results are a subset of a study on the demographics of a city population. Participants were asked to respond if male (1) or female (0), current annual salary, and if they were raised in a suburb (1) or in a city (0). Based on the hierarchical clustering results, which of the following is not a valid observation that can be made? Cluster 1 (N = 21) 2 (N = 68) 3 (N = 41) 4 (N = 34)

Version 1

Salary 58,980 60,873 72,390 82,080

Gender 0.4782 0.4102 0.5142 0.5301

City 0.4877 0.568 0.2351 0.5041

4


A) Cluster 3 has the most observations of participants that grew up in the city. B) Cluster 1 has a 48% male population with 49% being raised in the suburbs. C) Cluster 4 has the highest average salary, with 53% of participants being female. D) Cluster 2 has the most participants in the cluster with over half being raised in a suburb.

19) In the k-Means Clustering Method, there is a general process of how k-means clustering algorithm can be classified. Which one of the following is not one of the general processes? A) Specify the k value B) Randomly assign k observations to its nearest cluster center C) Calculate the cluster centroids D) Reassign each observation to the nearest observation point

20) Using the following dendrogram created by using the k-means method, identify the number k was given to form the clusters present.

Version 1

5


A) 5 B) 8 C) 9 D) 10

21) The marketing department is examining the data pulled from the retail stores over the month of December. In this time period, three items are of interest, Sound Bars, LED under counter lights, and shelving units. In researching if two of the items are purchased, if the third will be also, the following confidence level was calculated at 0.685, with an expected confidence of 0.20. Calculate the lift ratio. A) 0.1370 B) 3.425 C) 0.2920 D) 0.485

22) The marketing department is examining the data pulled from the retail stores over the month of December. In this time period, three items are of interest, Sound Bars, LED under counter lights, and shelving units. In researching if two of the items are purchased, if the third will be also, the following confidence level was calculated at 0.575, with an expected confidence of 0.10. Calculate the lift ratio. A) 0.0575 B) 5.75 C) 0.6389 D) 0.475

23) Of 22,000 grocery store transactions, 1,645 have been identified as having coffee, ice cream, and chips as part of the same transaction. Calculate the support of the association rule. A) 13.374 B) 0.0748 C) 16.45 D) 1.645

Version 1

6


24) Of 10,000 grocery store transactions, 895 have been identified as having coffee, ice cream, and chips as part of the same transaction. Calculate the support of the association rule. A) 11.173 B) 0.0895 C) 8.95 D) 0.895

25) Martin wants to understand the strength of association among toilet paper, milk, and eggs. Of 11,000 transactions, the number of transactions including antecedent is 1,910 whereas the number of transactions including both antecedent and consequent transactions is 810. Calculate the confidence. A) 0.2473 B) 0.1736 C) 0.4241 D) 0.0736

26) Martin wants to understand the strength of association among toilet paper, milk, and eggs. Of 5,000 transactions, the number of transactions including antecedent is 1,520 whereas the number of transactions including both antecedent and consequent transactions is 690. Calculate the confidence. A) 0.4420 B) 0.3040 C) 0.4539 D) 0.1380

27) Of 6,300 total transactions, 1,450 transactions are the number of consequents, confidence equals 0.5212. Calculate the expected confidence.

Version 1

7


A) 0.23 B) 2.30 C) 3,284 D) 0.7698

28) Of 5,000 total transactions, 1,400 transactions are the number of consequents, confidence equals 0.5203. Calculate the expected confidence. A) 0.28 B) 2.80 C) 2,601 D) 0.6919

29)

Using the following transactions, what is the frequency distribution?

Transaction 001 002 003 004 005 006 007 008 009 010

Item Latte, Scone, Muffin Coffee, Scone Espresso, Egg, Fruit Cup Scone, Egg, Muffin Scone, Latte, Muffin Latte, Scone, Fruit Cup Latte, Muffin, Egg Coffee, Muffin, Fruit Cup Espresso, Scone, Cookie Latte, Muffin, Cookie

A) Latte-4%; Scone-6%; Muffin-6%; Egg-3%; Espresso-2%; Coffee-1%; Fruit Cup-3%; Cookie-2% B) Latte-1/6; Scone-1/6; Muffin-1/6; Egg-1/3; Espresso-1/2; Coffee-1/2; Fruit Cup-1/3; Cookie-1/3 C) Latte-1; Scone-2; Muffin-4; Egg-3; Espresso-5; Coffee-6; Fruit Cup-7; Cookie-10 D) Latte-5; Scone-6; Muffin-6; Egg-3; Espresso-2; Coffee-2; Fruit Cup-3; Cookie-2

30)

Using the following transactions, what is the frequency distribution? Transaction

Version 1

Item 8


001 002 003 004 005 006 007 008 009 010

Latte, Scone, Muffin Coffee, Muffin Espresso, Egg, Fruit Cup Coffee, Egg, Muffin Scone, Latte, Muffin Latte, Scone, Fruit Cup Latte, Muffin, Egg Coffee, Muffin, Fruit Cup Espresso, Scone, Cookie Latte, Muffin, Cookie

A) Latte-4%; Scone-4%; Muffin-7%; Egg-3%; Espresso-2%; Coffee-2%; Fruit Cup-3%; Cookie-2% B) Latte-1/6; Scone-1/4; Muffin-1/7; Egg-1/3; Espresso-1/2; Coffee-1/2; Fruit Cup-1/3; Cookie-1/2 C) Latte-1; Scone-2; Muffin-4; Egg-3; Espresso-5; Coffee-6; Fruit Cup-7; Cookie-8 D) Latte-5; Scone-4; Muffin-7; Egg-3; Espresso-2; Coffee-3; Fruit Cup-3; Cookie-2

31) Use the proportion of the following transactions that contain both Latte and Scone and calculate the Support of the association rule. Transaction 001 002 003 004 005 006 007 008 009 010

Item Latte, Scone, Muffin Coffee, Muffin Espresso, Scone, Fruit Cup Espresso, Egg, Muffin Scone, Latte, Muffin Latte, Scone, Fruit Cup Latte, Muffin, Egg Coffee, Muffin, Fruit Cup Espresso, Scone, Cookie Latte, Muffin, Cookie

A) 0.25 B) 0.70 C) 0.30 D) 0.75

Version 1

9


32) Use the proportion of the following transactions that contain both Latte and Scone and calculate the Support of the association rule. Transaction 001 002 003 004 005 006 007 008 009 010

Item Latte, Scone, Muffin Coffee, Muffin Espresso, Egg, Fruit Cup Coffee, Egg, Muffin Scone, Latte, Muffin Latte, Scone, Fruit Cup Latte, Muffin, Egg Coffee, Muffin, Fruit Cup Espresso, Scone, Cookie Latte, Muffin, Cookie

A) 0.25 B) 0.70 C) 0.30 D) 0.75

33) Carmen pulled transactions from the month of March to see if there is an association between coffee and breakfast sandwich purchases. After running through confidence and expected confidence calculations, she calculated the lift ratio at 1.52. What association does a calculated lift ratio of 1.52 reflect? A) A lift ratio of 1.52 is over 1, indicating a weak association between coffee and breakfast sandwiches. B) 1.52 implies that identifying a person who purchases coffee and a breakfast sandwich is 52% better than guessing that a random person will purchase a breakfast sandwich. C) 1.52 implies that 52% of consumers will purchase coffee and not a breakfast sandwich. D) The lift ratio cannot determine the association between purchases.

34)

Using R, which function is used to conduct association rule analysis?

Version 1

10


A) itemFrequency B) apriori C) inspect D) read.transaction

35) Eva is trying to figure out the total number of possible combinations for 110 inventory items. Calculate the number for Eva. A) 1.09237E + 52 possible combinations B) 9.2614E + 104 possible combinations C) 3.04325E + 52 possible combinations D) 8.57733E + 212 possible combinations

36) Eva is trying to figure out the total number of possible combinations for 50 inventory items. Calculate the number for Eva. A) 2.57689E + 23 possible combinations B) 5.15378E + 47 possible combinations C) 7.17898E + 23 possible combinations D) 2.65614E + 98 possible combinations

37) Martin wants to use Gower’s coefficient to compute the distance for each variable and to convert it into a [0,1] scale. What analysis package will he use to run the analysis? A) Analytic Solver only B) Analytic Solver & R C) R only D) Excel

38)

Of the following options, which is not accurate for clustering?

Version 1

11


A) Euclidean distance or Manhattan distance measures for numerical variables and matching. B) AGNES takes each observation in the data initially and forms its own cluster. C) Hierarchical clustering commonly follows agglomerative and divisive clustering. D) Cluster analysis is where small amounts of data are organized against larger statistical sets.

39) The use of error sum of squares to measure the loss of information that occurs when observations are clustered describes which method? A) The centroid method B) The complete linkage method C) The average linkage method D) Ward’s method

40)

k-means clustering algorithm can be summarized as all of the following except for A) can be either numerical or character variables. B) specify the k value. C) randomly assign k observations as cluster centers. D) assign each observation to its nearest cluster center.

Version 1

12


41) Tosh Marketing Group wants to identify customers who are likely to purchase high-end appliances for a new marketing campaign. After collecting data from recent customers, the following plot was created showing age and spend variables with 3 distinct clusters. Which cluster offers the best target with the highest spend for marketing materials?

A) The lower-left cluster reflects the younger age range and should be the target of the marketing. B) The upper-middle cluster reflects a higher spend for appliances versus the lower-left and far-right clusters. C) The far-right cluster is in the middle and the highest spend for appliances. D) Both the upper-middle cluster and the lower-left cluster are the highest spends for appliances.

42) In reviewing purchases at Costco on a given Saturday, 400 transactions out of 1,200 included toilet paper, detergent, and clothing or {toilet paper, detergent} => {clothing}. Calculate the support of the association rule.

Version 1

13


A) 0.333 B) 1.333 C) 1,200 D) 4.00

43) In reviewing purchases at Costco on a given Saturday, 385 transactions out of 1,000 included toilet paper, detergent, and clothing or {toilet paper, detergent} => {clothing}. Calculate the support of the association rule. A) 0.385 B) 1.385 C) 1,000 D) 3.85

44) Toilet paper and detergent are the antecedent, where 1,350 of the 10,500 transactions include both items. Of the overall 10,500 transactions, 365 include toilet paper, detergent, and the consequent of clothing. Using the association rule, what is the confidence? A) 0.0348 B) 46.929 C) 0.129 D) 0.2704

45) Toilet paper and detergent are the antecedent, where 890 of the 10,000 transactions include both items. Of the overall 10,000 transactions, 234 include toilet paper, detergent, and the consequent of clothing. Using the association rule, what is the confidence? A) 0.0234 B) 20.826 C) 0.089 D) 0.2629

Version 1

14


46) Costco is known for lower prices for bulk items. After calculating the support and confidence values for {toilet paper} => {avocados}, there appears to be a strong association because a large percentage of customers purchase these items. However, to avoid assuming the strength of association, what option should be done to confirm the strength of the association? A) Review the transactions for accuracy B) Rerun the confidence C) Calculate the lift ratio D) If the support and confidence are strong then no need to run anything more

47) Based on the following table, what is the frequency distribution of the most purchased item? Transaction 001 002 003 004 005

Item bagels, milk, cheese milk, yogurt bagels, cheese bagels, bread milk, cheese, bagels

A) milk B) bread C) bagels D) cheese

48) Based on the following table, what is the frequency distribution of the most purchased item? Transaction 001 002 003 004 005

Version 1

Item bread, milk, cheese milk, yogurt bread, cheese milk, bread milk, cheese, bagels

15


A) bread B) cheese C) milk D) yogurt

49) Based on the following table, what is the proportion of the transactions that include both milk and bread? Transaction 001 002 003 004 005

Item bread, bagels, cheese bagels, yogurt bread, cheese bagels, bread milk, bread, bagels

A) 0.80 B) 0.60 C) 0.40 D) 0.20

50) Based on the following table, what is the proportion of the transactions that include both milk and bread? Transaction 001 002 003 004 005

Item bread, milk, cheese milk, yogurt bread, cheese milk, bread milk, cheese, bagels

A) 0.80 B) 0.60 C) 0.20 D) 0.40

Version 1

16


51) The Corner Market is using 3,800 transactions on item purchases for analysis. Based on initial results, the manager noticed eggs and potato chips were frequently in the same transactions. After calculating the confidence and the expected confidence on the data, 0.48 and 0.57 respectively, they want to run a lift ratio to ensure there is a positive association. Calculate the lift ratio and determine if the association is positive or negative. A) 1.188, positive number, positive association B) 0.842, under one, negative association C) 1.188, over one, positive association D) −1.188, negative number, negative association

52) The Corner Market is using 2,500 transactions on item purchases for analysis. Based on initial results, the manager noticed eggs and potato chips were frequently in the same transactions. After calculating the confidence and the expected confidence on the data, 0.54 and 0.62 respectively, they want to run a lift ratio to ensure there is a positive association. Calculate the lift ratio and determine if the association is positive or negative. A) 0.871, positive number, positive association B) 0.871, under one, negative association C) 1.148, over one, positive association D) −1.148, negative number, negative association

53) Sara is a marketing analysis manager for a top cereal producer. Part of her job is to review product sales to grocery chains and if the contracted product shelf placement is the optimal location to maximize sales. Because of the large data sets, she is only interested in creating a small number of clusters to view the results. What type of clustering method would work best? A) hierarchical clustering B) agglomerative clustering C) divisive clustering D) k-means clustering

Version 1

17


54) Amazon uses searches and items purchased to create future product marketing recommendations. Additionally, demographics drive additional potential products to be recommended. To do this, what type of market basket analysis is used? A) Information Rule B) Supervised Data Analysis C) Association Rule D) k-mean

55)

Using R, what function is used to view the rules by their lift ratios? A) lookup B) apriori C) rules D) sort

56) In cluster analysis, measures are used to form clusters. However, when large data sets are imported into R, sometimes the variables do not share the same format. To overcome this, you standardize the data using the __________blank function. A) score B) scale C) standard D) dist

57) When using R for Agglomerative Clustering, the cutree function is used to split results into distinct clusters. What function is used to create the dendrogram as well as a banner plot? A) aResult B) data.frame C) plot D) view

Version 1

18


58) What type of (dis)similarity measure should you use when your variables result in observations with mostly 0s as the values for binary variables? A) Euclidean distance B) Jaccard’s coefficient C) Manhattan distance D) matching coefficient

59) To what cluster should Record 2 be assigned, given the following distances to the cluster centroids? Record ID 1 2 3

Dist.Cluster-1 Dist.Cluster-2 4.0503384 1.9593924 3.8503278

3.830521 4.235910 1.230822

Dist.Cluster-3 Dist.Cluster-4 1.793345 3.394509 2.977934

4.166309 3.899235 2.781135

A) Cluster 1 B) Cluster 2 C) Cluster 3 D) Cluster 4

60) To what cluster should Record 3 be assigned, given the following distances to the cluster centroids? Record ID 1 2 3

Dist.Cluster-1 Dist.Cluster-2 4.0503384 1.9593924 3.8503278

3.830521 4.235910 1.230822

Dist.Cluster-3 Dist.Cluster-4 1.793345 3.394509 2.977934

4.166309 3.899235 2.781135

A) Cluster 1 B) Cluster 2 C) Cluster 3 D) Cluster 4

Version 1

19


Answer Key Test name: Chap 14_2e_Jaggia 1) TRUE Cluster analysis is an unsupervised data mining technique that groups data into categories that share some similar characteristic or trait. 2) FALSE Agglomerative clustering is the most commonly used approach for hierarchical clustering. 3) FALSE The importance of k-means clustering is to determine the number of clusters, k, prior to performing the analysis. This allows for nonoverlapping clusters and clusters that are as homogenous as possible. 4) TRUE When using large clustered data sets, it is customary to use the k-means cluster method because it will dramatically reduce the number of iterations. 5) FALSE After set.seed, the pam function is used to perform the k-means clustering. Within pam, the parameter of k is set to the number of preselected clusters. 6) TRUE It is best to think as If-Then because you are looking for hidden associations in the data set. 7) FALSE A lift ratio between 0 and 1 is a negative association and needs to be greater than 1 for a strong and positive association to exist. 8) FALSE The If portion is the antecedent and the Then portion is the consequent. Version 1

20


9) TRUE The Ward’s method uses ESS to measure the loss of information that occurs when observations are clustered. It is widely implemented in many clustering analysis software applications including Analytic Solver and R. 10) TRUE A dendrogram is a treelike structure which allows for a subjective view of the clustering results. 11) FALSE The ability of a clustering method to discover useful hidden patterns of the data depends on how it is implemented. Because clustering is essentially an unsupervised technique for data exploration, the appropriate technique would be the one that makes the most sense conceptually, not simply the one with the fewest clusters. 12) TRUE To measure the distance between two observations with mixed data, it is common to use Gower’s coefficient as the distance measure. Gower’s coefficient converts the value of each variable into a [0, 1] scale and calculates a weighted average of the scaled distances as a measure of similarity between two observations. 13) FALSE Compared to the hierarchical clustering methods, the k-means clustering method is more computationally efficient, especially when dealing with large data sets. 14) B The AGNES algorithm uses one of several linkage methods. Of these methods, the complete linkage method is the one that uses the farthest distance between a pair of observations that do not belong to the same cluster. 15) B Version 1

21


The chart reflects the nearest distance between a pair of observations which represents the single linkage method.

16) C cutree is used to take the results of both the dendrogram as well as a banner plot and create distinct clusters based on the number assigned in the number for k. 17) C Based on the table, Cluster 4 has the highest average salary, with 61% of participants being male, not female. 18) C Based on the table, Cluster 4 does have the largest average salary, however, the male population is 53%, not female. 19) D Of the summarizations of the general process of the k-means clustering algorithm, the reassign of observations is done with the clustering to the nearest centroid, not nearest observation point. 20) A By visual inspection, the predetermined number was k = 5.

21) B The lift ratio is confidence divided by expected confidence. (0.685 ÷ 0.20) = 3.425. 22) B The lift ratio is confidence divided by expected confidence. (0.575 ÷ 0.10) = 5.75. 23) B The calculation is support = number of transactions including antecedent and consequent divided by total number of transactions. Support = 1,645 ÷ 22,000 = 0.0748. 24) B

Version 1

22


The calculation is support = number of transactions including antecedent and consequent divided by total number of transactions. Support = 895 ÷ 10,000 = 0.0895. 25) C Confidence = number of transactions including both antecedent and consequent divided by number of transactions including antecedent. 810 ÷ 1,910 = 0.4241. 26) C Confidence = number of transactions including both antecedent and consequent divided by number of transactions including antecedent. 690 ÷ 1,520 = 0.4539. 27) A The expected confidence = number of transactions including consequent divided by total number of transactions = 1,450 ÷ 6,300 = 0.23. 28) A The expected confidence = number of transactions including consequent divided by total number of transactions = 1,400 ÷ 5,000 = 0.28. 29) D Out of the 10 transactions, the frequency is the number of times the item appears in a transaction. Latte-5; Scone-6; Muffin-6; Egg-3; Espresso-2; Coffee-2; Fruit Cup-3; Cookie-2. 30) D Out of the 10 transactions, the frequency is the number of times the item appears in a transaction. Latte-5; Scone-4; Muffin-7; Egg-3; Espresso-2; Coffee-3; Fruit Cup-3; Cookie-2. 31) C The proportion of the transactions that have both Latte and Scone is 3/10 transactions or Support = 0.30. 32) C

Version 1

23


The proportion of the transactions that have both Latte and Scone is 3/10 transactions or Support = 0.30. 33) B The lift ratio when over 1 indicates a strong association and identifies the person purchasing coffee and a breakfast sandwich is 52% better than guessing a random customer purchased a breakfast sandwich. 34) B The apriori function is used to conduct association rule analysis. In this function, minimum values of support and confidence is inputted. 35) C Number of combinations = 3n − 2(n+1) + 1, where n is the number of items. So, 3110 − 2(110+1) + 1 = 3.04325E + 52 possible combinations. 36) C Number of combinations = 3n − 2(n+1) + 1, where n is the number of items. So, 350 − 2(50+1) + 1 = 7.1789799E + 23 possible combinations. 37) C The current version of Analytic Solver does not calculate Gower’s similarity coefficient; therefore, R is used for agglomerative clustering with mixed data. 38) D Cluster analysis is where companies can organize large amounts of customer related data and group the transitions into different segments. 39) D Ward’s method uses ESS to measure the loss of information. ESS is defined as the squared between individual observations and the cluster mean. 40) A k-means clustering algorithm only works with numerical data. 41) B Based only on visual analysis, the upper-middle cluster is the higher spending age range by far and the best target to market for higher end appliances. Version 1

24


42) A The calculation is support = number of transactions including antecedent and consequent divided by total number of transactions. Support = 400 ÷ 1,200 = 0.333. 43) A The calculation is support = number of transactions including antecedent and consequent divided by total number of transactions. Support = 385 ÷ 1,000 = 0.385. 44) D Confidence = number of transactions including both antecedent and consequent divided by number of transactions including antecedent. 365 ÷ 1,350 = 0.2704. 45) D Confidence = number of transactions including both antecedent and consequent divided by number of transactions including antecedent. 234 ÷ 890 = 0.2629. 46) C The lift ratio compares the confidence with the expected confidence providing the ratio determining if there is a strong association or not. If the lift ratio is greater than 1, then there is a strong association. Anything between 0 and 1 indicates a negative relationship. 47) C Out of the 5 transactions, the frequency is the number of times the item appears in a transaction. The frequency and order of transactions from most to least is bagels = 4; milk = 3; cheese = 3; bread = 1; yogurt = 1. 48) C Out of the 5 transactions, the frequency is the number of times the item appears in a transaction. The frequency and order of transactions from most to least is milk = 4; bread = 3; cheese = 3; yogurt = 1; bagels = 1. 49) D Version 1

25


The calculation is support = number of transactions including antecedent and consequent divided by total number of transactions. Support = 1 ÷ 5 = 0.20. 50) D The calculation is support = number of transactions including antecedent and consequent divided by total number of transactions. Support = 2 ÷ 5 = 0.40. 51) B Lift ratio = confidence ÷ expected confidence = 0.48 ÷ 0.57 = 0.842, The number is between 0 and 1, so there is a negative association. 52) B Lift ratio = confidence ÷ expected confidence = 0.54 ÷ 0.62 = 0.871, The number is between 0 and 1, so there is a negative association. 53) D k-means clustering allows for the number of clusters to be predetermined, shortening the process time. 54) C The association rule is used by many companies to determine what goes with what to identify items that tend to occur together. 55) D The sort function is used to sort the rules. In this case > srules <sort(rules, by = ’lift’, decreasing = TRUE). 56) B In R the scale function is used to standardize the values of the variables. 57) C In R, the plot function is used to produce the dendrogram as well as a banner plot. 58) D

Version 1

26


Because Jaccard’s coefficient does not account for the nonexistence of attributes when measuring similarity, observations with mostly 0s as the values for binary variables are often found to have no or close to no simi-larity to other observations, resulting in multiple clusters with only one observation. In these cases, it is recommended that other similarity measures such as matching coefficient be used to render more meaningful clustering structures. 59) A The second record is closest to the centroid in Cluster 1 (a distance of 1.9594, compared to the distance from the other three clusters: 4.2359, 3.3945, and 3.8992). 60) B The third record is closest to the centroid in Cluster 2 (a distance of 1.2308, compared to the distance from the other three clusters: 3.8503, 2.9779, and 2.7811).

Version 1

27


CHAPTER 15 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) There are two types of decision models, deterministic and stochastic. ⊚ ⊚

true false

2) Stallion Airlines knows for certain what they can charge per seat, as well as the labor, fuel, and plane maintenance costs for its flight from Minneapolis, MN to Sioux Falls, SD. Stallion Airlines would likely use a stochastic model to compute total profit for a full flight. ⊚ ⊚

true false

3) When entering a formula, careful consideration should be given to the choice of absolute, relative, and mixed references so that the formula can be copied to other cells that require a similar calculation. ⊚ ⊚

true false

4) The textbook recommends using an influence diagram to avoid critical design flaws or bugs. ⊚ ⊚

true false

5) Spreadsheet modeling is the process of designing, constructing, and testing spread-sheet models. ⊚ ⊚

true false

6) What-if analysis, also called sensitivity analysis, helps the decision maker predict how the target outputs of a spreadsheet model would change if one or more input variables are changed. Version 1

1


⊚ ⊚

true false

7) In a two-way data table, you can compare how different values of one input variable influence the return on invested capital by holding the remaining input variables constant. ⊚ ⊚

true false

8) Excel’s Goal Seek is the what-if analysis that can determine the exact value of the target outcome variable based on a desired input. ⊚ ⊚

true false

9) A formula that refers to its own cell, either directly or indirectly, is called a circular reference. ⊚ ⊚

true false

10) Trace Dependents allows you to identify cells that contain data that are part of the formula in the active cell. Trace Precedents, on the other hand, allows you to identify cells whose values are affected by the active cell. ⊚ ⊚

true false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 11) What are the two types of decision models?

Version 1

2


A) deterministic and stochastic B) dynamic and static C) deterministic and static D) dynamic and stochastic

12) Which one of the following isnot one of the three primary components in a decision model? A) input variables B) processes C) target outputs D) test variables

13) What type of cell references is Sophia using in the following Excel formula: =IF(D4=“Management”, B$12, C$12)? A) absolute and relative B) absolute and mixed C) relative and mixed D) relative and absolute

14) What type of cell references is Sophia using in the following Excel formula: =IF($B3=“Non-Management”, $B$12, $C$12)? A) absolute and relative B) absolute and mixed C) relative and mixed D) relative and absolute

Version 1

3


15) The following partially completed Excel spreadsheet contains information for computing an annual budget. Assuming salary and bonus values are net of taxes, what is the formula for calculating the estimated annual savings (cell B11) for Mr. Ramirez this year? Note: Consider the columns as labeled "A" and "B" and rows as labeled "1-12". Use these labels to determine the correct formula. A 1

B

Ramirez Monthly Budget

2 3

Variables:

4 5 6 7 8 9

Annual salary One-time bonus Fixed expenses Salary savings Bonus savings

10

Annual Estimates

11

Savings

12

Spending money

$ 54,900.00 $ 10,000.00 50% 10% 50%

A) =B4*B7 B) =(B4+B5)*(B5+B8) C) =B4*B7 + B5*B8 D) =64900*0.60

16) The following partially completed Excel spreadsheet contains information for computing an annual budget. Assume salary and bonus values are net of taxes and fixed expenses are based on annual salary. What is the formula for calculating the estimated annual spending money (cell B12) for Mr. Ramirez this year? Consider the columns as labeled "A" and "B" and rows as labeled "1-12". Use these labels to determine the correct formula. A 1

Version 1

B

Ramirez Monthly Budget

4


2 3

Variables:

4 5 6 7 8 9

Annual salary One-time bonus Fixed expenses Salary savings Bonus savings

10

Annual Estimates

11

Savings

12

Spending money

$ 54,900.00 $ 10,000.00 50% 10% 50%

A) =B4 − B4*B6 B) =B4*B6 + B5*B8 C) =B4 + B5 − B4*B7 − B5*B8 D) =B4 + B5 − B4*B6 − B11

17) The following partially completed spreadsheet contains information for computing an annual budget. Is this a deterministic or stochastic model and why? A 1

B

Ramirez Monthly Budget

2 3

Variables:

4 5 6 7 8 9

Annual salary One-time bonus Fixed expenses Salary savings Bonus savings

10

Annual Estimates

Version 1

$ 54,900.00 $ 10,000.00 50% 10% 50%

5


11

Savings

12

Spending money

A) deterministic because all model parameters are known with certainty, and the output of the model is fully determined by the model parameters B) deterministic because some randomness is incorporated in the model; therefore, the model will produce different outputs even with the same inputs C) stochastic because all model parameters are known with certainty, and the output of the model is fully determined by the model parameters D) stochastic because some randomness is incorporated in the model; therefore, the model will produce different outputs even with the same inputs

18) Which of the following statements best describes the key elements of an influence diagram? A) It is made up of nodes (decision variables) and arrows, starting with the final target output variable and decomposes the decision to show how variables are influenced by input variables at each level. B) It starts with input variables at the top and builds its way down to intermediate, and then, final output variables (nodes) using nodes (decision variables) and arrows. C) It creates all of the input and output formulas that will be used in the spreadsheet model where each formula should each be placed. D) None of the statements accurately describe the key elements of an influence diagram.

Version 1

6


19) Below is an influence diagram for a basic make versus buy manufacturing decision. What is the final target output variable?

A) buy cost per unit B) cost difference C) fixed costs D) units

Version 1

7


20) Below is an influence diagram for a basic make versus buy manufacturing decision. What of the following is not an intermediate output variable?

A) total buy costs B) total make cost C) total variable costs D) variable cost per unit

Version 1

8


21) Mad Merna’s Halloween Emporium is planning for this year’s busiest time and is debating between buying or making this year’s limited edition witch’s hat. Merna estimates that she will need 1,000 units. Her normal vendor charges $45.75 per hat including shipping. To make the hat, Merna will incur $15,000 in fixed costs and an additional $30.25 per hat in variable costs. Using the influence diagram below, create a spreadsheet model and determine 1) whether Merna should make or buy this year’s limited edition witch’s hat and 2) how much she will save.

A) buy; $250 B) buy; $500 C) make; $250 D) make; $500

22) Bindle-Featherston Company has a new product. It estimates that first year sales will result in 150,000 units sold and a growth rate of 12% per year. The sales price is $25 per unit and will increase by 5% per year. The fixed costs are $200,000 per every 100,000 units and variable costs are $10 per unit. Variable costs will increase by 3% per year due to increased material and labor costs. What is the total profit expected over the next five years?

Version 1

9


A) $11,893,906.56 B) $14,031,510.54 C) $14,525,656.33 D) $15,093,381.95

23) Bindle-Featherston Company has a new product. It estimates that first year sales will result in 150,000 units sold and a growth rate of 12% per year. The sales price is $25 per unit and will increase by 5% per year. The fixed costs are $200,000 per every 100,000 units and variable costs are $10 per unit. Variable costs will increase by 3% per year due to increased material and labor costs. What is the profit expected in year three? A) $2,279,600.00 B) $2,422,400.00 C) $2,789,970.56 D) $2,963,150.56

24) Bindle-Featherston Company has a new product. It estimates that first year sales will result in 150,000 units sold and a growth rate of 12% per year. The sales price is $25 per unit and will increase by 5% per year. The fixed costs are $200,000 per every 100,000 units and variable costs are $10 per unit. Variable costs will increase by 3% per year due to increased material and labor costs. What is the total profit expected over the next five years if the sales growth rate is 15% instead? A) $11,893,906.56 B) $14,031,510.54 C) $14,525,656.33 D) $15,093,381.95

Version 1

10


25) Use the Excel template below for a vertical one-way data table that shows loan amounts in $2,500 increments starting from $10,000 and their monthly payment amounts, assuming the loan term and annual interest rate are 5 years and 3.9%, respectively. What is the formula for the total loan amount in cell A10 if we want to be able to copy the formula to the remainder of the cells? Consider the columns as labeled "A" and "B" and rows as labeled "1-13". Use these labels to determine the correct formula. A 1

B

Loan Payment Calculator

2 3 4 5 6

Loan amount Loan term (in years) Annual interest rate Monthly payment

$ 10,000.00 5 3.9%

7 8 9 10

Loan Amount $ 10,000.00

Monthly Payment ($183.71)

11 12 13

A) = A9 + 2500 B) = $A$9 + 2500 C) = A$9 + 2500 D) = $B$3 + 2500

26) Use the Excel template below for a vertical one-way data table that shows loan amounts in $2,500 increments starting from $10,000 and their monthly payment amounts, assuming the loan term and annual interest rate are 5 years and 3.9%, respectively. What is the formula for the monthly payment of this loan in cell B9 if we want to be able to copy the formula to the remainder of the cells? Consider the columns as labeled "A" and "B" and rows as labeled "1-13". Use these labels to determine the correct formula.

Version 1

11


A 1

B

Loan Payment Calculator

2 3 4 5 6

Loan amount Loan term (in years) Annual interest rate

$ 10,000.00 5 3.9%

7 8 9 10

Loan Amount $ 10,000.00

Monthly Payment ($183.71)

11 12 13

A) =PMT($B$5,$B$4,A9) B) =PMT($B$5/12,$B$4*12,A9) C) =PMT(A9, $B$4*12,$B$5/12) D) =PMT(B5/12,B4*12,A9)

27) Use the Excel template below to create a vertical one-way data table that shows loan amounts in $2,500 increments starting from $10,000 and their monthly payment amounts, assuming the loan term and annual interest rate are 5 years and 3.9%, respectively. What is the monthly payment for a loan of $20,000? Consider the columns as labeled "A" and "B" and rows as labeled "1-13". Use these labels to determine the correct formula. A 1

B

Loan Payment Calculator

2 3 4

Version 1

Loan amount Loan term (in years)

$ 10,000.00 5

12


5 6

Annual interest rate

3.9%

7 8 9 10

Loan Amount $ 10,000.00

Monthly Payment ($183.71)

11 12 13

A) $367.43 B) $413.36 C) $867.35 D) $4,479.93

28) Bindle-Featherston Company has a new product. It estimates that first year sales will result in 150,000 units sold and a growth rate of 12% per year. The sales price is $25 per unit and will increase by 5% per year. The fixed costs are $200,000 per every 100,000 units and variable costs are $10 per unit. Variable costs will increase by 3% per year due to increased material and labor costs. Calculate the breakeven point using Excel’s Goal Seek in year one. A) 8,000 B) 12,333 C) 13,334 D) 20,000

29) Bindle-Featherston Company has a new product. It estimates that first year sales will result in 150,000 units sold and a growth rate of 12% per year. The sales price is $25 per unit and will increase by 5% per year. The fixed costs are $200,000 per every 100,000 units and variable costs are $10 per unit. Variable costs will increase by 3% per year due to increased material and labor costs. Using Excel’s Goal Seek calculate the number of units that will need to be sold in year one to earn a profit of $100,000.

Version 1

13


A) 8,000 B) 12,333 C) 13,334 D) 20,000

30) The following partially completed Excel spreadsheet contains information for computing an annual budget. Assume salary and bonus values are net of taxes and fixed expenses are based on annual salary. Use Excel’s Goal Seek to determine what percentage will Ramirez need to save monthly to have $20,000 in his savings account by the end of the year. A 1

B

Ramirez Monthly Budget

2 3

Variables:

4 5 6 7 8 9

Annual salary One-time bonus Fixed expenses Monthly savings Bonus savings

10

Annual Estimates

11

Savings

12

Spending money

13

Fixed expenses

$ 54,900.00 $ 10,000.00 50% 10% 50%

A) 10.00% B) 13.85% C) 22.39% D) 27.32%

Version 1

14


31) The benefit of using Excel’s Scenario Manager over data tables and Excel’s Goal Seek is that it allows you to create and compare__________blank. A) one variable at a time B) multiple possible scenarios based on multiple input variables C) conceptual models of the business problem D) randomness that models deterministic models have

32) Trulovia Manufacturing is looking to purchase a machine that will increase its efficiency in their manufacturing process. Trulovia wants to use Scenario Manager to evaluate the following four scenarios based on possible purchase prices and interest rates for a 10-year loan. Which option results in the lowest monthly payment? Scenario 1 2 3 4

Equipment Cost $ 145,000 $ 147,500 $ 150,000 $ 155,000

Loan Term 9 years 10 years 11 years 9 years

Interest Rate 7% 5% 4% 2%

A) 1 B) 2 C) 3 D) 4

33) Trulovia Manufacturing is looking to purchase a machine that will increase its efficiency in their manufacturing process. Trulovia wants to use Scenario Manager to evaluate the following four scenarios based on possible purchase prices and interest rates for a 10-year loan. Which option results in the highest monthly payment? Scenario 1 2 3 4

Version 1

Equipment Cost $ 145,000 $ 147,500 $ 150,000 $ 155,000

Loan Term 9 years 10 years 11 years 9 years

Interest Rate 7% 5% 4% 2%

15


A) 1 B) 2 C) 3 D) 4

34) Trulovia Manufacturing is looking to purchase a machine that will increase its efficiency in their manufacturing process. Trulovia wants to use Scenario Manager to evaluate the following four scenarios based on possible purchase prices and interest rates for a 10-year loan. What is the monthly payment for Scenario 2? Scenario 1 2 3 4

Equipment Cost $ 145,000 $ 147,500 $ 150,000 $ 155,000

Loan Term 9 years 10 years 11 years 9 years

Interest Rate 7% 5% 4% 2%

A) $1,161.74 B) $1,260.41 C) $1,385.72 D) $1,448.26

35) The following spreadsheet and formula results in what type of common spreadsheet error? A 1

B

Total Employee Salaries

2 3 4 5 6 7 8

Employee Name Fletcher Davis Simon Jennings Ainsley Smart Casey Manning Moira Fowler

Salary $ 54,900.00 $ 35,000.00 $ 22,500.00 $ 85,000.00 $ 49,000.00

9

Grand Total

=SUM(B4:B9)

Version 1

16


A) circular reference B) data validation C) missing data D) mixed reference

36) Which Excel feature will let you enforce a rule where the user can only enter whole numbers in a particular cell? A) data validation B) error alert C) new rule D) what-if analysis

37) Which option in Data Validation would you choose if you want to make sure only certain options can be selected for a particular cell? A) any value B) date C) list D) whole number

38)

Which of the following isnot one of the model auditing tools available in Excel? A) data validation B) show formulas C) trace dependents D) trace precedents

39) Jeremy Valdez found three #N/A formula errors in his spreadsheet. Thinking an error in one cell is causing errors in the following cells, he decided to use the__________blank formula auditing tool. Version 1

17


A) show formulas B) trace dependents C) trace precedents D) watch window

40) Jeremy Valdez thinks he has found the source of a series of #N/A errors in his spreadsheet. He is ready to begin correcting the errors but would like a way to monitor how the results change when the formulas are modified. What model auditing tool should he use? A) show formulas B) trace dependents C) trace precedents D) watch window

Version 1

18


Answer Key Test name: Chap 15_2e_Jaggia 1) TRUE In a deterministic model, all model parameters are known with certainty, and the output of the model is fully determined by the model parameters. A stochastic model incorporates some randomness in the model; therefore, the model will produce different outputs even with the same inputs. 2) FALSE Since the company is certain about its revenues and costs associated with the flight, a deterministic model, which is often represented by mathematical formulas, is used as it allows us to understand how each input variable affects the output. 3) TRUE When entering a formula, careful consideration should be given to the choice of absolute, relative, and mixed references so that the formula can be copied to other cells that require a similar calculation. 4) TRUE An influence diagram is a visual tool used in spreadsheet engineering to illustrate the various components of a decision and their relationships. 5) FALSE Spreadsheet engineering is the process of designing, constructing, and testing spread-sheet models. Spreadsheet modeling is one component of spreadsheet engineering. 6) TRUE

Version 1

19


What-if analysis, also called sensitivity analysis, helps the decision maker predict how the target outputs of a spreadsheet model would change if one or more input variables are changed. 7) FALSE The return on invested capital (ROIC) is influenced by a variety of factors, such as inventory turnover, manufacturing overhead, logistic costs, the corporate tax rate, etc. In a one-way data table, you can compare how different values of one input variable influence the ROIC by holding the remaining input variables constant. 8) FALSE Excel’s Goal Seek is the what-if analysis that can determine the exact value of the input variable to achieve a desired outcome. 9) TRUE One common error in spreadsheet models is circular reference, which means that a for-mula refers back to its own cell, either directly or indirectly. 10) FALSE Trace Precedents allows you to identify cells that contain data that are part of the formula in the active cell. Trace Dependents, on the other hand, allows you to identify cells whose values are affected by the active cell. 11) A There are two types of decision models, deterministic and stochastic. 12) D There are three primary components in a decision model: target outputs, input vari-ables, and processes. 13) C Version 1

20


The reference to D4 is a relative reference and will change when the formula is copied into another cell. The references B$12 and C$12 are mixed references. The B & C column references will change when the formula is copied into another cell but the row reference 12 will remain constant. 14) B The reference to $B3 is a mixed reference and the row reference will change when the formula is copied into another cell but column B will remain constant. The references $B$12 and $C$12 are absolute references. If we copy this formula to a new cell, the references to $B$12 and $C$12 will remain the same. 15) C The correct formula is =B4*B7 + B5*B8. Annual salary times 10% plus the one-time bonus times 50%. 16) D The correct formula is =B4 + B5 − B4*B6 − B11. Total income minus 50% of annual salary for fixed expenses minus savings. 17) A The model is deterministic because all model parameters are known with certainty, and the output of the model is fully determined by the model parameters. 18) A

Version 1

21


The two key elements of an influence diagram are nodes (shown as ovals) and arrows. The nodes represent the decision variables, whereas the arrows represent the influence that variables have on other variables. An influence diagram usually starts with the final target output variable at the top and continuously decomposes the decision to show how the intermediate output variables are influenced by input variables at each level. 19) B An influence diagram usually starts with the final target output variable at the top and continuously decomposes the decision to show how the intermediate output variables are influenced by input variables at each level. In this case, cost difference is the final target output variable.

20) D An influence diagram usually starts with the final target output variable at the top and continuously decomposes the decision to show how the intermediate output variables are influenced by input variables at each level. In this case, variable cost per unit is an input variable whereas the others are intermediate output variables.

21) D Total cost to make is $15,000 + 1,000 × $30.25 = $45,250, which is $500 cheaper than the cost to buy, which is $45.75 × 1,000 = $45,750.

22) B The total profit over five years will be $14,031,510.54. Solution: Bindle-Featherston Company Inputs: Sales units Sales unit growth Sales price Price increase Fixed costs

Version 1

150,000 12% per year $ 25.00

per unit

5% per year $ 200,000.00

for every

100,000 units

22


Variable costs

$ 10.00

VC increase

per unit

3% per year

Outputs: Ye Volum Sales ar e price/u nit 0 1

150, $ 000 25.00

2

168, $ 000 26.25

3

188, $ 160 27.56

4

210, $ 739 28.94

5

236, $ 028 30.39

Total Sales

Fixed costs

VC per Variable unit costs

Profit

Total Profit 0

$ 3,750,00 0.00 $ 4,410,00 0.00 $ 5,186,16 0.00 $ 6,098,92 4.16 $ 7,172,33 4.81

$ 400,00 0.00 $ 400,00 0.00 $ 400,00 0.00 $ 600,00 0.00 $ 600,00 0.00

$ 10. 00 $ 10. 30 $ 10. 61 $ 10. 93 $ 11. 26

$ 1,500,00 0.00 $ 1,730,40 0.00 $ 1,996,18 9.44 $ 2,302,80 4.14 $ 2,656,51 4.85

$ 1,850,00 0.00 $ 2,279,60 0.00 $ 2,789,97 0.56 $ 3,196,12 0.02 $ 3,915,81 9.96

$ 1,850,00 0.00 $ 4,129,60 0.00 $ 6,919,57 0.56 $ 10,115,6 90.58 $ 14,031,5 10.54

F

H

Formulas: A 1

B

C

D

E

G

I

Bindle Feathe rston Compan y

2 3 4 5

Inputs : Sales 150,000 units Sales 0.12 per unit year growth

Version 1

23


6 7

8 9

1 0 1 1 1 2 1 3

Sales 25 per price unit Price 0.05 per increa year se Fixed 200,000 for costs every Variab 10 per le unit costs VC 0.03 per increa year se

Output s: Year

1 4 1 5

0

1 6

2

1 7

3

1 8

4

1

Version 1

100, units 000

Volume

Sales Tota price/u l nit Sale s

=B4

=$B$6

Fixed costs

VC per unit

Vari Pro Total able fit Profi cost t s 0

=B15 =$B$8*ROUNDUP(B =B9 *C15 15/$D$8,0)

=B15 =D1 =I14 *F15 5- +H15 E15 G15 =B15*(1+ =C15*(1 =B16 =$B$8*ROUNDUP(B =F15*(1+ =B16 =D1 =I15 $B$5) +$B$7) *C16 16/$D$8,0) $B$10) *F16 6- +H16 E16 G16 =B16*(1+ =C16*(1 =B17 =$B$8*ROUNDUP(B =F16*(1+ =B17 =D1 =I16 $B$5) +$B$7) *C17 17/$D$8,0) $B$10) *F17 7- +H17 E17 G17 =B17*(1+ =C17*(1 =B18 =$B$8*ROUNDUP(B =F17*(1+ =B18 =D1 =I17 $B$5) +$B$7) *C18 18/$D$8,0) $B$10) *F18 8- +H18 E18 G18

24


1 9

5

=B18*(1+ =C18*(1 =B19 =$B$8*ROUNDUP(B =F18*(1+ =B19 =D1 =I18 $B$5) +$B$7) *C19 19/$D$8,0) $B$10) *F19 9- +H19 E19 G19

23) C The year three profit will be $2,789,970.56. Solution: Bindle-Featherston Company Inputs: Sales units

150,000

Sales unit growth

12% per year

Sales price

$ 25.00

Price increase

per unit

5% per year

Fixed costs Variable costs

$ 200,000.00 $ 10.00

VC increase

for every per unit

100,000 units

3% per year

Outputs: Ye Volum Sales ar e price/u nit 0 1

150, $ 000 25.00

2

168, $ 000 26.25

3

188, $ 160 27.56

Total Sales

Fixed costs

VC per Variable unit costs

Profit

Total Profit 0

$ 3,750,00 0.00 $ 4,410,00 0.00 $ 5,186,16 0.00

$ $ 400,00 10. 0.00 00 $ $ 400,00 10. 0.00 30 $ $ 400,00 10. 0.00 61

$ 1,500,00 0.00 $ 1,730,40 0.00 $ 1,996,18 9.44

$ 1,850,00 0.00 $ 2,279,60 0.00 $ 2,789,97 0.56

$ 1,850,00 0.00 $ 4,129,60 0.00 $ 6,919,57 0.56

F

H

Formulas: A

Version 1

B

C

D

E

G

I

25


1

Bindle Feathe rston Compan y

2 3 4 5

6 7

8 9

1 0 1 1 1 2 1 3

1 4 1 5

Inputs : Sales 150,000 units Sales 0.12 per unit year growth Sales 25 per price unit Price 0.05 per increa year se Fixed 200,000 for costs every Variab 10 per le unit costs VC 0.03 per increa year se

Output s: Year

100, units 000

Volume

Sales Tota price/u l nit Sale s

=B4

=$B$6

Fixed costs

VC per unit

0 1

Version 1

=B15 =$B$8*ROUNDUP(B =B9 *C15 15/$D$8,0)

Vari Pro Total able fit Profi cost t s 0 =B15 =D1 =I14 *F15 5- +H15 E15 G15

26


1 6

2

1 7

3

=B15*(1+ =C15*(1 =B16 =$B$8*ROUNDUP(B =F15*(1+ =B16 =D1 =I15 $B$5) +$B$7) *C16 16/$D$8,0) $B$10) *F16 6- +H16 E16 G16 =B16*(1+ =C16*(1 =B17 =$B$8*ROUNDUP(B =F16*(1+ =B17 =D1 =I16 $B$5) +$B$7) *C17 17/$D$8,0) $B$10) *F17 7- +H17 E17 G17

24) D The total profit over five years will be $15,093,381.95 if the sales growth rate is 15% instead of 12%. Solution: Bindle-Featherston Company Inputs: Sales units

150,000

Sales unit growth

12% per year

Sales price

$ 25.00

Price increase

per unit

5% per year

Fixed costs Variable costs

$ 200,000.00 $ 10.00

VC increase

for every per unit

100,000 units

3% per year

Outputs: Ye Volum Sales ar e price/u nit 0 1

150, $ 000 25.00

2

172, $ 500 26.25

Version 1

Total Sales

Fixed costs

VC per Variable unit costs

Profit

Total Profit 0

$ 3,750,00 0.00 $ 4,528,12 5.00

$ $ 400,00 10. 0.00 00 $ $ 400,00 10. 0.00 30

$ 1,500,00 0.00 $ 1,776,75 0.00

$ 1,850,00 0.00 $ 2,351,37 5.00

$ 1,850,00 0.00 $ 4,201,37 5.00

27


3

198, $ 375 27.56

4

228, $ 131 28.94

5

262, $ 351 30.39

$ 5,467,71 0.94 $ 6,602,26 0.96 $ 7,972,23 0.11

$ $ 400,00 10. 0.00 61 $ $ 600,00 10. 0.00 93 $ $ 600,00 11. 0.00 26

$ 2,104,56 0.38 $ 2,492,85 1.76 $ 2,952,78 2.91

$ 2,963,15 0.56 $ 3,509,40 9.19 $ 4,419,44 7.19

$ 7,164,52 5.56 $ 10,673,9 34.76 $ 15,093,3 81.95

F

H

Formulas: A 1

B

C

D

E

G

I

Bindle Feathe rston Compan y

2 3 4 5

6 7

8 9

1 0

Inputs : Sales 150,000 units Sales 0.15 per unit year growth Sales 25 per price unit Price 0.05 per increa year se Fixed 200,000 for costs every Variab 10 per le unit costs VC 0.03 per increa year se

100, units 000

1 1

Version 1

28


1 2 1 3

Output s: Year

1 4 1 5

0

Volume

Sales Tota price/u l nit Sale s

1

=B4

=$B$6

1 6

2

=B15*(1+ =C15*(1 $B$5) +$B$7)

1 7

3

=B16*(1+ =C16*(1 $B$5) +$B$7)

1 8

4

=B17*(1+ =C17*(1 $B$5) +$B$7)

1 9

5

=B18*(1+ =C18*(1 $B$5) +$B$7)

Fixed costs

VC per unit

Vari Pro Total able fit Profi cost t s 0

=B15 =$B$8*ROUNDUP(B =B9 *C15 15/$D$8,0)

=B15 =D1 =I14 *F15 5- +H15 E15 G15 =B16 =$B$8*ROUNDUP(B =F15*(1+ =B16 =D1 =I15 *C16 16/$D$8,0) $B$10) *F16 6- +H16 E16 G16 =B17 =$B$8*ROUNDUP(B =F16*(1+ =B17 =D1 =I16 *C17 17/$D$8,0) $B$10) *F17 7- +H17 E17 G17 =B18 =$B$8*ROUNDUP(B =F17*(1+ =B18 =D1 =I17 *C18 18/$D$8,0) $B$10) *F18 8- +H18 E18 G18 =B19 =$B$8*ROUNDUP(B =F18*(1+ =B19 =D1 =I18 *C19 19/$D$8,0) $B$10) *F19 9- +H19 E19 G19

25) A The total loan amount in A10 should be $2,500 higher than the amount in A9 and the amount in A11 should be $2,500 higher than the amount in A10. Therefore, the formula should either be = A9 + 2500 or = $A9 + 2500. The row reference in the formula needs to change as the formula is copied down. 26) B Version 1

29


The function PMT is used to calculate loan payments. The parameters are interest rate, loan term, and loan amount, respectively. Because our payments are made monthly, the correct formula is =PMT($B$5/12,$B$4*12,A9). 27) A The monthly payment amount for a $20,000 loan for 5 years at 3.9% is $367.43. Solution: Loan Payment Calculator Loan amount Loan term (in years) Annual interest rate Loan Amount $ 10,000.00 $ 12,500.00 $ 15,000.00 $ 17,500.00 $ 20,000.00 $ 22,500.00

$ 10,000.00 5 3.9% Monthly Payment ($183.71) ($229.64) ($275.57) ($321.50) ($367.43) ($413.36)

Formulas: A 1 2 3 4 5 6

B Loan Payment Calculator

Loan amount Loan term (in years) Annual interest rate

10,000 5 0.039

7 8 9 10 11 12 13

Loan Amount =B3 =A9+2500 =A10+2500 =A11+2500 =A12+2500

Version 1

Monthly Payment =PMT($B$5/12,$B$4*12,A9) =PMT($B$5/12,$B$4*12,A10) =PMT($B$5/12,$B$4*12,A11) =PMT($B$5/12,$B$4*12,A12) =PMT($B$5/12,$B$4*12,A13)

30


14

=A13+2500

=PMT($B$5/12,$B$4*12,A14)

28) C The company will need to sell 13,334 units in year one to break even. Note: The result is 13,333.3333. In break even analyses, we round up since we cannot sell a partial unit. Given the spreadsheet formulas table below the Goal Seek values entered should as follows:

Spreadsheet formulas: A 1

B

C

D

E

F

G

H

I

BindleFeathers ton Company

2 3

Inputs:

4

Sales 150,0 units 00 Sales 0.12 per unit year growth Sales 25 per price unit Price 0.05 per increase year Fixed 200,0 for costs 00 every Variable 10 per costs unit VC 0.03 per increase year

5

6 7 8 9 1 0 1 1

Version 1

100,00 units 0

31


1 2 1 3

Outputs:

1 4 1 5

0

Year

1

Volum Sales Total e price/u Sales nit

=B4

Fixed costs

VC Variab Prof Total per le it Profit uni costs t 0

=$B$6 =B15*C =$B$8*ROUNDUP(B15/ =B9 =B15*F =D15 =I14+H 15 $D$8,0) 15 15 E15G15

29) D The company will need to sell 20,000 units in year one to earn a profit of $100,000. Given the spreadsheet formulas table below the Goal Seek values entered should as follows:

Spreadsheet formulas: A 1

B

C

D

E

F

G

H

I

BindleFeathers ton Company

2 3

Inputs:

4

Sales 150,0 units 00 Sales 0.12 per unit year growth Sales 25 per price unit Price 0.05 per increase year

5

6 7

Version 1

32


8

Fixed 200,0 for costs 00 every Variable 10 per costs unit VC 0.03 per increase year

9 1 0 1 1 1 2 1 3

100,00 units 0

Outputs: Year

1 4 1 5

Volum Sales Total e price/u Sales nit

Fixed costs

0 1

=B4

VC Variab Prof Total per le it Profit uni costs t 0

=$B$6 =B15*C =$B$8*ROUNDUP(B15/ =B9 =B15*F =D15 =I14+H 15 $D$8,0) 15 15 E15G15

30) D Given annual salary, bonus, fixed expenses, and bonus savings remain the same, Ramirez will need to save 27.32% of his annual salary to have $20,000 in total savings by the end of the year. Given the spreadsheet formulas table below the Goal Seek values entered should as follows:

Spreadsheet formulas: A 1

B

Ramirez Monthly Budget

2 3

Variables

4

Annual Salary

Version 1

54,900

33


5 6 7 8 9

Bonus Fixed Expenses Salary Savings Bonus Savings

10

Estimated

11

Savings

10,000 0.5 0.1 0.5

=B4*B7+B5*B8

31) B Scenario Manager allows you to create and compare up to 32 possible scenarios based on multiple input variables. 32) C Scenario 3 results in the lowest monthly payment at $1,161.74. Setting up a data model that puts loan amount, loan term (in years), and annual interest rate in cells B3:B5 and inputting the problem’s four scenarios in Scenario Manager arrives at the following output:

33) D

Version 1

34


Scenario 4 results in the highest monthly payment at $1,448.26. Setting up a data model that puts loan amount, loan term (in years), and annual interest rate in cells B3:B5 and inputting the problem’s four scenarios in Scenario Manager arrives at the following output:

34) B Scenario 2 results in a monthly payment of $1,260.41. Setting up a data model that puts loan amount, loan term (in years), and annual interest rate in cells B3:B5 and inputting the problem’s four scenarios in Scenario Manager arrives at the following output:

35) A Using the SUM function and including cell B9 as a part of the SUM formula in cell B9 will result in a circular reference. 36) A

Version 1

35


ClickingData > Data Validation to open theData Validation dialog box, then choosing Whole number from theAllow: drop-down box will let you ensure only a whole number can be entered in a particular cell. 37) C ClickingData > Data Validation to open theData Validation dialog box, then choosing List from theAllow drop-down box will let you set up an in-cell dropdown list. 38) A Show formulas, trace dependents, and trace precedents are all model auditing tools in the Formulas ribbon. They help you diagnose errors in your spreadsheet model. Data Validation is an option in the Data ribbon and helps you set up rules to prevent certain spreadsheet model errors. 39) B If it is likely that the errors in the cells are related; an error in one cell may have a cascading effect on other cells. To diagnose the errors, start with the formula in the first cell and make it the active cell. Then clickFormulas > Trace Dependents. 40) D To monitor how the results change when the formulas are modified, we use the Watch Window tool in Excel. The Watch Window tool creates a floating window that lets you view the values of selected cells in any workbook.

Version 1

36


CHAPTER 16 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) Prescriptive analytics is the process of using decision analysis tools to improve decision making. ⊚ true ⊚ false

2) The Monte Carlo simulation relies solely on average values to capture risk and uncertainty. ⊚ true ⊚ false

3) Binomial and Poisson distributions are the two most relevant probability distributions for many business and nonbusiness applications. ⊚ true ⊚ false

4) Continuous uniform distribution, also known as the rectangular distribution, captures the time that elapses between occurrences. ⊚ true ⊚ false

5) Risk can be defined as the possibility or probability that an undesirable outcome may occur. ⊚ true ⊚ false

6) In Excel, random observations can be generated, however random seed cannot be set, requiring values to be cut and pasted. ⊚ true ⊚ false

Version 1

1


7) Risk analysis is a process of examining and assessing different scenarios to better understand the possibility of both intended and undesired outcomes. ⊚ true ⊚ false

8) Katerina’s team used best practices when they only ran a best case scenario risk analysis before debuting a new electric car. ⊚ true ⊚ false

9) Many real-life events and the data obtained from these events may not conform to any theoretical probability distributions. In these situa-tions, we may construct a triangular distribution. ⊚ true ⊚ false

10) A Bayesian model is used to run simulations and analyze risk and uncertainty by recreating a real-world process. . ⊚ true ⊚ false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 11) When using Excel to model uncertainty, the RAND function and BINOM.INV functions are used. What key is used to redraw new observations, changing the output? A) Return key B) ALT4 C) F4 D) F9

Version 1

2


12) A CNBC study in 2017 determined only 39% of adults have zero to only a couple of hundred dollars in a savings account. With n = 7 and p = 0.39, using the binomial probability distribution, what is the mean and the standard deviation of the population? Hint: σ =

A) 7 and 1.66 respectively B) 2.73 and 1.2905 respectively C) 3.5 and 2.04 respectively D) 1.65 and 1.6653 respectively

14) _____________blank analytics refers to using simulation and optimization algorithms to provide advice on ‘what we should do.’ A) Descriptive B) Diagnostic C) Predictive D) Prescriptive

15) Serena only wants to run risk analyses on situations with uncertainty, which of the following scenarios does she not need to run a risk analysis on? A) A certificate of deposit at a bank B) An investment in shares of stock C) The level of inventory at any point in time D) The ideal interior size of vehicle prototype

16) Consider the function: y = (x1 + x2)x3. In the most likely scenario, the values of x1, x2, and x3 are 4, 21, and 4, respectively. In the worst-case scenario, the values of the three variables are 1, 10, and 2, respectively. In the best-case scenario, the values of the vari-ables are 8, 32, and 5, respectively. What is the range of y?

Version 1

3


A) 13 – 200 B) 22 – 100 C) 22 – 200 D) 100 – 200

17) Consider the function: y = (x1 + x2)x3. In the most likely scenario, the values of x1, x2, and x3 are 4, 21, and 4, respectively. In the worst-case scenario, the values of the three variables are 1, 10, and 2, respectively. In the best-case scenario, the values of the variables are 8, 32, and 5, respectively. What is the most likely value of y? A) 22 B) 88 C) 100 D) 200

18)

Which of the following is an example of a discrete random variable? A) number of bankruptcies B) amount of data usage C) inches of rainfall D) increase in stock price

19)

Which of the following is an example of a continuous random variable? A) number of bankruptcies B) amount of data usage C) number of students D) total t-shirts

20)

Which of the following is an example of a discrete random variable?

Version 1

4


A) height of a person B) amount of data usage C) number of students D) increase in stock price

22) How many defects should Black Beauty Cosmetics expect out of 20 mascara tubes if they usually identify 3 defects in every 15? A) 3 B) 4 C) 5 D) 6

23)

The Monte Carlo simulation is also known as A) stochastic & probabilistic. B) stochastic & deterministic. C) deterministic. D) probabilistic & deterministic.

24) Below is a scatterplot of a Monte Carlo simulation of 100 profit distribution scenarios based on employing 3 tailors or 4 tailors. What would you recommend to the manager if she is looking to minimize variability?

Version 1

5


A) Four tailors because profitability is higher. B) Four tailors because risk is lower. C) Three tailors because profitability has a lower standard deviation. D) Three tailors because risk is lower.

25) Below is a scatterplot of a Monte Carlo simulation of 100 profit distribution scenarios based on employing 3 tailors or 4 tailors. What would you recommend to the manager if she is looking to maximize profit?

A) Four tailors because profitability is higher. B) Four tailors because risk is lower. C) Three tailors because profitability has a lower standard deviation. D) Three tailors because risk is lower.

26) Black Beauty Cosmetics wants to model demand (normal distribution with μ = 95 units and σ = 10.67) and a production rate (uniform distribution between 23 units and 31 units) for producing mascara tubes. Using R, what formulas are needed to generate 100 random observations for demand and production rate, respectively? A) output1 <- sample(95, 100, replace = TRUE); output2 <- sample(23:31, 100, replace = TRUE) B) output1 <- rnorm(100, 95, 10.67); output2 <- rnorm(100, 23, 31) C) output1 <- rnorm(23:31, 100, replace = TRUE); output2 <- sample(100, 95, 10.67) D) output1 <- rnorm(100, 95, 10.67); output2 <- sample(23:31, 100, replace = TRUE)

Version 1

6


27)

Deterministic process is A) created by random variable selection. B) created by constraint identification. C) a Monte Carlo simulation. D) a precise estimation based on known variables.

28) What type of probability distribution will Elva need to use to generate random observations for delivery times so that she can set up the drivers’ schedules? A) Continuous B) Exponential C) Normal D) Triangular

29) What type of probability distribution will Evelyn need to use to generate random observations for the expected amount of time between customer purchases? A) Continuous B) Exponential C) Normal D) Triangular

30) What type of probability distribution will Axel need to use to generate random observations for weekly demand for ventilators needed at a hospital? A) Continuous B) Empirical C) Exponential D) Triangular

31) What type of probability distribution will Herman need to use to generate random observations when he only knows his best, worst, and most likely scenarios? Version 1

7


A) Continuous B) Empirical C) Exponential D) Triangular

32) Javier needs to project how many vases to produce each month for his new customer. The new customer only provided the minimum, maximum, and average number of vases the flower shop goes through each month. Which type of probability distribution should Javier use to generate random observations? A) Continuous B) Empirical C) Triangular D) Variable

33) What type of data is used with the continuous uniform distribution when generating random observations? A) data with unknown distributions B) data generated from symmetrical distributions C) data generated from nonsymmetrical distributions D) data with constant probability within a specified range

34) What type of data is used with the exponential uniform distribution when generating random observations? A) data with unknown distributions B) data generated from symmetrical distributions C) data generated from nonsymmetrical distributions D) data with constant probability within a specified range

35) What type of data is used with the empirical distribution when generating random observations? Version 1

8


A) continuous data generated from symmetrical distributions B) continuous data generated from nonsymmetrical distributions C) historical data with discrete outcomes D) data with constant probability within a specified range

36) What type of data is needed to use the triangular distribution when generating random observations? A) minimum, maximum, and mode values B) worst, best, and most likely case scenarios C) either the min, max, and mode or the worst, best, and most likely case scenarios D) neither the min, max, and mode or the worst, best, and most likely case scenarios

37) Finn has historical data but does not know the distribution pattern of the data. What type of graph or chart could they use to plot the data to identify potential patterns? A) box plots B) histogram C) line chart D) scatterplot

38) Which formulas are used for an empirical probability distribution? 39) Which formulas are used for a triangular probability distribution? 40) What function is used in R to generate random observations from the continuous uniform distribution? A) cumsum B) rexp C) runif D) ifelse

41) What function is used in R to generate random observations from the continuous uniform distribution? Version 1

9


A) rexp B) runif C) ifelse D) cumsum

Version 1

10


Answer Key Test name: Chap 16_2e_Jaggia 1) TRUE Prescriptive analytics uses simulation and optimization algorithms to quantify the effects of different possible actions to make more informed decisions. 2) FALSE The Monte Carlo simulation considers all values using a random variable to capture risk and uncertainty. 3) TRUE Two of the most relevant discrete probability distributions in most business and nonbusi-ness applications are the binomial and Poisson distributions. 4) FALSE Continuous uniform distribution represents the constant probability within a specified range, whereas the exponential random variable captures the time elapse between occurrences. 5) TRUE Risk can be defined as the possibility or probability that an undesirable outcome may occur. 6) TRUE In Excel, you can run random observations many times, but the results will change and not save because a random seed cannot be set. Therefore, values must be copy and pasted in blank cells to retain the information. 7) TRUE

Version 1

11


Risk analysis is a process of examining and assessing different scenarios in order to better understand the possibility of both intended and undesired outcomes. 8) FALSE We typically perform risk analysis by considering three alternatives: most likely, pessimistic (or worst case), and optimistic (or best case) scenarios, allowing us to evaluate a range of outcomes from multiple situations. 9) FALSE In general, we use an empirical probability distribution to model uncer-tainty in applications with discrete outcomes. 10) FALSE By recreating a real-world process, Monte Carlo simulation is used to understand the impact of the risk and uncertainty in a wide variety of applications. 11) D The F9 key when pressed will reshuffle the observations, redrawing new observations, changing the output. 12) B By using the probability distributions in Chapter 5, we know the formula is σ = n = 7 and p = 0.39. Thus, the mean and standard deviation of the binomial probability distribution are (= 7 times 0.39) = 2.73 and respectively.

where

= 1.2905,

13) B By using the probability distributions in Chapter 5, we know the formula is σ = n = 5 and p = 0.35. Thus, the mean and standard deviation of the binomial probability distribution are (= 5 times 0.35) = 1.75 and

where

14) D

Version 1

12


Prescriptive analytics refers to using simulation and optimization algorithms to provide advice on ‘what we should do.’ Descriptive summarizes historical data ‘what has happened’, diagnostic determines ‘why something has happened’, and predictive uses historical data to predict ‘what could happen.’ 15) A The outcomes of a certificate of deposit at a bank is certain. You earn X% of interest if you keep your money in for the stated amount of time. You pay Y penalty if you withdraw early. All the other situations have higher risk and more uncertainty and would warrant a risk analysis. 16) C The worst case scenario is y = (1 + 10) × 2 = 22. The best case scenario is y = (8 + 32) × 5 = 200. The range is between 22 and 200. 17) C The most likely scenario is y = (4 + 21) × 4 = 100. 18) A The number of bankruptcies is a discrete random variable because it represents a countable number of distinct values. There is no such thing as a half of a bankruptcy. 19) B The amount of data usage is a continuous random variable because it represents an uncountable number of values. The amount of data used can be 1GB, 1.25GB, 14.9GB, etc. 20) C The number of students is a discrete random variable because it represents a countable number of distinct values. There is no such thing as a half of a student. 21) C <p>By using the probability distributions in Chapter 5, we know the formula is

22) B Version 1

13


Given the rate of 3 defects over 15 items, we can write the mean for the 15-item period as λ15 = 3. We compute the proportional mean for 20 items as λ20 = 4 because the proportion of defects is 3/15 = 0.20 defects/item = 0.20 defects/item × 20 items = 4 defects. 23) A The Monte Carlo simulation is also called stochastic & probabilistic simulations. 24) C 25) A To maximize profitability, the manager will want to employ four tailors because profitability is higher. This is evidenced by more of the orange dots being higher on the y-scale than the blue dots.

26) D Since demand has a normal distribution, the rnorm function is used with the following parameters: observations, mean, and standard deviation. Since production rate has a uniform distribution, the sample function is used with the following parameters: range, observations, replace = TRUE. 27) D The deterministic process is based on nonrandom variables that precisely estimate based on a formula; for example, time and value of money formulation calculations. 28) A Delivery times are an example of continuous uniform distribution, which is also referred to as the rectangular distribution, as it represents a constant probability within a specified range. 29) B Time between customer purchases is an example of an exponential distribution, as it represents a variable with a nonsymmetric distribution. 30) B

Version 1

14


When an event like weekly demand for ventilators does not conform to a theoretical probability distribution, an empirical distribution can be created using historical data. 31) D A triangular distribution is a simplistic representation of a continuous probability distri-bution with a triangular shape and is defined by three parameters: the minimum value, the maximum value, and the mode, or sometimes the best, worst, and most likely scenarios. It is commonly used when we have no detailed historical data or additional information to make assumptions about any theoretical distribution; we use a triangular probability distribution to assess the uncertainty. 32) C A triangular distribution is a simplistic representation of a continuous probability distribution with a triangular shape and is defined by three parameters: the minimum value, the maximum value, and the mode, or sometimes the best, worst, and most likely scenarios. It is commonly used when we have no detailed historical data or additional information to make assumptions about any theoretical distribution; we use a triangular probability distribution to assess the uncertainty. 33) D The continuous uniform distribution is appropriate when you have a variable with a constant probability within a specified range, such as delivery times and flight times between destinations. 34) C The exponential uniform distribution is appropriate when you have a variable with a nonsymmetrical distribution, such as times between customer purchases. 35) C

Version 1

15


An empirical distribution is appropriate when you have a variable where the distribution is unknown, but you have historical or empirical data you can use an empirical distribution, which uses cumulative relative frequencies. 36) A A triangular distribution is a simplistic representation of a continuous probability distri-bution with a triangular shape and is defined by three parameters: the minimum value, the maximum value, and the mode. Sometimes the minimum value represents the worst-case scenario (e.g., the lowest profit level), and the maximum value represents the best-case scenario (e.g., the highest profit level). 37) B Sometimes it is possible to plot the data and inspect the histogram for distinctive shapes corresponding to different theoretical distributions such as continuous or normal distributions. 38) D <p>An empirical probability distribution is based on a relative frequency table so the formulas

39) C <p>A triangular probability distribution is based on the min, max, and mode so the formulas

40) C We use the runif function to generate random observations from the continuous uniform distribution. For example, if we enter runif(100, 20, 21), then R generates 100 random observations from a continuous uniform distribution with values between 20 and 21. 41) A We use the rexp function to generate random observations from the exponential distribution. For example, if we enter rexp(100,3), then R generates 100 random observations from the exponential distribution with λ = 3. Version 1

16


CHAPTER 17 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) A minimization linear programming model will usually have at least one constraint with the ≤ sign. ⊚ true ⊚ false

2) The constraints in the minimi-zation problem usually describe maximum requirements that must be met. ⊚ true ⊚ false

3) Sometimes, in real-world applications, a linear programming model may yield multiple optimal solutions that have the same maximized or minimized value of the objective function. ⊚ true ⊚ false

4) A linear programming problem is infeasible when there is no solution for which all constraints in the LP model are satisfied. ⊚ true ⊚ false

5) Linear programming problems are bounded when the objective function can attain an infinite value without violating any constraints. ⊚ true ⊚ false

6) Linear programming is an optimization technique in which an exponential function is maximized or minimized when subjected to resource constraints. ⊚ true ⊚ false

7)

The objective function is a mathematical representation of an objective.

Version 1

1


⊚ ⊚

true false

8) Constraints with slack or surplus in the linear programming solutions are called binding constraints. ⊚ true ⊚ false

9) The shadow price, or dual price, of a constraint is an indication of the change in the optimized value of an objective function in a one unit change in a binding constraint. ⊚ true ⊚ false

10) The optimal solution to a linear programming problem can be found at the boundary or corner of the feasibility region in a graphic model. ⊚ true ⊚ false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 11) Which of the following is not an example of where linear programming can be used? A) choosing a combination of products to maximize profits B) developing a mix of investment options to maximize return C) using available resources to minimize delivery costs D) using a simulation to analyze demand for a particular product

Version 1

2


12)

What special case in linear programming does the following graph depict?

A) infeasibility B) multiple optimal solutions C) objectivity D) unboundedness

13) Lane Accessories has a growing demand for custom Apple watch designer bands. The current manufacturing costs are $120 per hour to operate, for each hour of operation, 210 black band designs and 180 multi-color designs are completed. However, Lane found a new larger manufacturing space that will cost $160 per hour and produce 325 black design bands and 289 multi-color bands per hour. Lane has newly placed orders to restock other retail outlets nationwide for 6,000 black band designs and 5,000 multi-color bands. Because Lane is out of inventory, she needs to decide how many hours to operate each facility to fulfill orders while minimizing cost. Formulate the minimization function for production costs. A) Production Costs = 120x1 + 160x2 B) Production Costs = (120x1 + 390) + (160x2 + 614) C) Production Costs = 390x1 + 614x2 D) Production Costs = (120x1 + 6,000) + (160x2 + 5,000)

14)

Which one of the following is not an essential component of linear programming?

Version 1

3


A) random B) an objective function C) decision variables D) constraints

15) Lane Accessories has a growing demand for custom Apple watch designer bands. The current manufacturing costs are $120 per hour to operate, for each hour of operation, 210 black band designs and 180 multi-color designs are completed. However, Lane found a new larger manufacturing space that will cost $160 per hour and produce 325 black design bands and 289 multi-color bands per hour. Lane has newly placed orders to restock other retail outlets nationwide for 6,000 black band designs and 5,000 multi-color bands. Because Lane is out of inventory, she needs to decide how many hours to operate each facility to fulfill orders while minimizing cost. Formulate the orders for black design bands. A) 210x1 + 325x2 ≥ 6,000x3 B) 210x1 + 325x2 ≥ 6,000 C) 180x1 + 210x2 ≥ 6,000 D) 180x1 + 325x2 ≥ 5,000

16) Sampson Limited produces two products that can be produced on either of two machines. Each month, only 500 hours of time are available on each machine. The time required to produce each item by hour and machine is: Machine 1 Product 1 Product 2

Product 1 Product 2

Machine 2

3 3 5 4 Month 1 Demand Month 2 Demand Month 1 Price Month 2 Price 100 120

160 110

$ 45 $ 65

$ 10 $ 35

The demand and price point for each product that customers are willing to pay are above. The company goal is to maximize revenue from sales from the next two months. Based on the provided information, how many constraints does this problem have excluding the nonnegativity constraints?

Version 1

4


A) 2 total constraints B) 4 total constraints C) 6 total constraints D) 8 total constraints

17) By viewing the following Excel snapshot, in Excel Solver what would be entered to formulate the Objective Function for E3?

A) =SUM(B5:D11,C8:D11) B) =SUMPRODUCT(B5*C5) + (B8*C8) C) =SUMPRODUCT(B5+B8) * (C5+C8) D) =SUMPRODUCT(B5*B8) + (C5*C8)

Version 1

5


18) By viewing the following Excel snapshot, in Excel Solver what would be entered to formulate B11, Material (units) for Quantity?

A) =SUMPRODUCT(B2+B8) * (C2+C8) B) =SUMPRODUCT(B2*B8) + (C2*C8) C) =SUMPRODUCT(B2*D15) + (C2*D16) D) =SUMPRODUCT(B2*D15) * (C2*D16)

19) Based on the following sensitivity results on constraints, what is the optimal linear programming solution for Products 1 and 2? Constraints Cell Name $B$11 Material (units) Quantity $B$12 Machine (hours) Quantity $B$13 Labor (hours) Quantity $B$15 Number of Product 1

Version 1

Final Value 1,050

Shadow Constraint Price R.H. Side 0 1,300

Allowable Increase 1 E+30

Allowable Decrease 250

525

0

750

1

E+30

225

1,020

0

1,400

1

E+30

500

290

5

300

125

300

6


Units Produced $B$16 Number of Product 2 Units Produced

140

8

150

83.33333

150

A) Product 1: 290 units and Product 2: 234 units B) Product 1: 425 units and Product 2: 140 units C) Product 1: 290 units and Product 2: 140 units D) Product 1: 125 units and Product 2: 84 units

20) Based on the following sensitivity results on constraints, what is the optimal linear programming solution for Products 1 and 2? Constraints Cell Name $B$11 Material (units) Quantity $B$12 Machine (hours) Quantity $B$13 Labor (hours) Quantity $B$15 Number of Product 1 Units Produced $B$16 Number of Product 2 Units Produced

Final Value 1,050

Shadow Constraint Price R.H. Side 0 1,300

Allowable Increase 1E+30

Allowable Decrease 250

525

0

750

1E+30

225

900

0

1,400

1E+30

500

300

5

300

125

300

150

8

150

83.33333333

150

A) Product 1: 300 units and Product 2: 234 units B) Product 1: 425 units and Product 2: 150 units C) Product 1: 300 units and Product 2: 150 units D) Product 1: 125 units and Product 2: 84 units

Version 1

7


21) Based on the following sensitivity results on constraints, how would an increase in machine hours impact the linear programming solution? Constraints Cell Name $B$11 Material (units) Quantity $B$12 Machine (hours) Quantity $B$13 Labor (hours) Quantity $B$15 Number of Product 1 Units Produced $B$16 Number of Product 2 Units Produced

Final Value 1,050

Shadow Constraint Price R.H. Side 0 1,300

Allowable Increase 1E+30

Allowable Decrease 250

525

0

750

1E+30

225

900

0

1,400

1E+30

500

300

5

300

125

300

150

8

150

83.33333333

150

A) The hours can have an infinite increase without altering the solution. B) The hours can only decrease without altering the solution. C) The hours can have an increase by 750 units before price decreases. D) The hours cannot be predicted with the information presented.

22) Based on the following Variable cells segment of a Solver Sensitivity Report, what range does Product 1 per-unit profit fall between at 360 units? Variable cells Cell Name $B$8 Decision Variable (to produce) Product 1 $C$8 Decision Variable (to produce) Product

Version 1

Final Value 360

150

Reduced Objective Allowable Allowable Cost Coefficient Increase Decrease 0 4 0.76 0.884

0

8

1

0.8

8


2

A) The range is between 0.884 and 0.76. B) The range is between 0.884 and 4.76. C) The range is between 3.116 and 4.76. D) The range can only be determined with more details from the sensitivity report.

23) Based on the following Variable cells segment of a Solver Sensitivity Report, what range does Product 1 per-unit profit fall between at 300 units? Variable cells Cell Name $B$8 Decision Variable (to produce) Product 1 $C$8 Decision Variable (to produce) Product 2

Final Reduced Objective Allowable Allowable Value Cost Coefficient Increase Decrease 300 0 5 0.75 0.889

150

0

8

1

0.5

A) The range is between 0.889 and 0.75. B) The range is between 0.889 and 5.75. C) The range is between 4.111 and 5.75. D) The range can only be determined with more details from the sensitivity report.

24)

What special case in linear programming does the following graph depict?

Version 1

9


A) infeasibility B) multiple optimal solutions C) objectivity D) unboundedness

25) In a linear programming model, the parameter values in an objective function are referred to as the A) objective function coefficients. B) parameter function. C) constraint coefficient. D) quantitative function.

26) At Taste of Thyme coffee shop a dirty chai latte creates a profit point of $2.89 for a small and $3.45 for a large. In a month, 250 small lattes were sold and 285 large lattes. As a fast growing demand item with fall approaching, the demand is estimated at 430 and 420 per month. The amount of machine time needed to produce the lattes is 5 minutes and 7 minutes each or for a month, 16.47 hours a month for a small and 33.25 hours a month for a large. What is the maximization function for profit? A) Profit = 430x1 + 420x2 B) Profit = 2.89x1 + 3.45x2 C) Profit = 16.47x1 + 33.25x2 D) Profit = 2.54x1 + 3.15x2

27) At Taste of Thyme coffee shop a dirty chai latte creates a profit point of $2.85 for a small and $3.80 for a large. In a month, 200 small lattes were sold and 285 large lattes. As a fast growing demand item with fall approaching, the demand is estimated at 400 and 420 per month. The amount of machine time needed to produce the lattes is 5 minutes and 7 minutes each or for a month, 16.67 hours a month for a small and 33.25 hours a month for a large. What is the maximization function for profit?

Version 1

10


A) Profit = 400x1 + 420x2 B) Profit = 2.85x1 + 3.80x2 C) Profit = 16.67x1 + 33.25x2 D) Profit = 2.50x1 + 3.50x2

28) At Taste of Thyme coffee shop a dirty chai latte creates a profit point of $2.93 for a small and $3.26 for a large. In a month, 225 small lattes were sold and 285 large lattes. As a fast growing demand item with fall approaching, the demand is estimated at 430 and 420 per month. The allotted machine time for both lattes is 100 hours. The amount of machine time needed to produce the lattes is 5 minutes and 9 minutes each or for a month, 16.67 hours a month for a small and 34.66 hours a month for a large. What is the corresponding parameters formulation for machine time? A) 5x1 + 9x2 ≤ 100 B) 16.67x1 + 34.66x2 ≥ 100 C) 2.93x1 + 3.31x2 ≤ 100 D) 16.67x1 + 34.66x2 ≤ 100

29) At Taste of Thyme coffee shop a dirty chai latte creates a profit point of $2.85 for a small and $3.80 for a large. In a month, 200 small lattes were sold and 285 large lattes. As a fast growing demand item with fall approaching, the demand is estimated at 400 and 420 per month. The allotted machine time for both lattes is 90 hours. The amount of machine time needed to produce the lattes is 5 minutes and 7 minutes each or for a month, 16.67 hours a month for a small and 33.25 hours a month for a large. What is the corresponding parameters formulation for machine time? A) 5x1 + 7x2 ≤ 90 B) 16.67x1 + 33.25x2 ≥ 90 C) 2.85x1 + 3.85x2 ≤ 90 D) 16.67x1 + 33.25x2 ≤ 90

30) Which of the following statements is false regarding Excel’s Solver and R's capabilities regarding multiple optimal solutions in linear programming?

Version 1

11


A) Solver and R can only show one optimal solution. B) Solver and R can provide a simple way to find the other optimal solutions. C) If the allowable increase of a decision variable is zero, the linear programming problem may have multiple optimal solutions. D) If the allowable decrease of a decision variable is zero, the linear programming problem may have multiple optimal solutions.

31)

The first step in performing linear programming is A) to generate random numbers. B) to formulate a problem into a series of mathematical expressions. C) to create intervals. D) to analyze the file for patterns.

32) Smith Industries has a growing demand for custom athleisure wear. The current manufacturing costs are $90 per hour to operate, for each hour of operation, 110 black yoga pants and 150 multi-color yoga pants per hour are completed. However, Smith found a new larger space that will cost $125 per hour and produce 175 black yoga pants and 140 multi-color yoga pants per hour. Smith has newly placed orders to restock retail outlets for 4,000 black pants and 3,500 multi-color pants. Because Smith is out of inventory, she needs to decide how many hours to operate each facility to fulfill orders while minimizing cost. Formulate the minimization function for production costs. A) Production Costs = (90x1 + 260) + (125x2 + 315) B) Production Costs = 215x1 + 575x2 C) Production Costs = 90x1 + 125x2 D) Production Costs = (90x1 + 5,000) + (125x2 + 3,500)

Version 1

12


33) Smith Industries has a growing demand for athleisure wear. The current manufacturing costs are $90 per hour to operate, for each hour of operation, 110 black yoga pants and 150 multi-color yoga pants are completed. However, Smith found a new larger manufacturing space that will cost $120 per hour and produce 175 black yoga pants and 140 multi-color yoga pants per hour. Smith has newly placed orders to restock other retail outlets nationwide for 5,000 black yoga pants and 3,500 multi-color yoga pants. Because Smith is out of inventory, she needs to decide how many hours to operate each facility to fulfill orders while minimizing cost. Formulate the orders for multi-color yoga pants. A) 250x1 + 190x2 ≥ 8,500 B) 150x1 + 140x2 ≥ 5,000 C) 150x1 + 140x2 ≥ 3,500 D) 110x1 + 175x2 ≥ 3,500

34) The range of objective function coefficients between the allowable increase and decrease, within which the optimal solution for an LP problem remains unchanged is called the _____________blank. A) shadow price B) binding constraint C) range of feasibility D) range of optimality

35) The range of values of a binding constraint, within which the shadow price remains the same is called the _____________blank. A) shadow price B) binding constraint C) range of feasibility D) range of optimality

Version 1

13


36) Hwang Manufacturing has a growing demand for high quality carving knives. The current manufacturing costs are $150 per hour to operate, for each hour of operation, 210 straight blades and 175 curved blades are completed. However, CEO Kelly Hwang found a new larger space that will cost $175 per hour and produce 250 straight blades and 190 curved blades per hour. Kelly has newly placed orders to restock retail outlets for 6,000 straight blades and 4,500 curved blades. Because Kelly is out of inventory, she needs to decide how many hours to operate each facility to fulfill orders while minimizing cost. Formulate the minimization function for production costs. A) Production Costs = 150x1 + 175x2 B) Production Costs = 325x1 + 825x2 C) Production Costs = (150x1 + 460) + (175x2 + 365) D) Production Costs = (150x1 + 6,000) + (175x2 + 4,500)

37) Hwang Manufacturing has a growing demand for high quality carving knives. The current manufacturing costs are $150 per hour to operate, for each hour of operation, 210 straight blades and 175 curved blades are completed. However, CEO Kelly Hwang found a new larger space that will cost $175 per hour and produce 250 straight blades and 190 curved blades per hour. Kelly has newly placed orders to restock retail outlets for 6,000 straight blades and 4,500 curved blades. Because Kelly is out of inventory, she needs to decide how many hours to operate each facility to fulfill orders while minimizing cost. Formulate the orders for straight blades. A) 460x1 + 325x2 ≥ 10,500 B) 210x1 + 250x2 ≥ 6,000 C) 175x1 + 190x2 ≥ 4,500 D) 150x1 + 175x2 ≥ 6,000

38) Serena Limited produces two products that can be produced on either of two machines. Each month, only 320 hours of time are available on each machine. The time required to produce each item by hour and machine is: Machine 1 Product 1 Product 2

Product 1

Version 1

Machine 2

2 2 4 3 Month 1 Demand Month 2 Demand 125

150

Month 1 Price $ 45

Month 2 Price $ 30

14


Product 2

130

120

$ 65

$ 55

The demand and price point for each product that customers are willing to pay are above. The company goal is to maximize revenue from sales from the next two months. Based on the provided information, how many constraints does this problem have excluding the nonnegativity constraints? A) 2 total constraints B) 4 total constraints C) 6 total constraints D) 8 total constraints

39) In a linear programming solution, constraints with slack are called _____________blank constraints. A) binding B) nonbinding C) surplus D) overage

40)

If x1 and x2 are outside of the feasibility region, what does that mean? A) The first constraint is a valid solution to be considered. B) Solutions outside of the feasibility region can be considered as viable. C) The farthest outside the feasibility region is the optimal solution. D) Solutions outside of the feasibility region cannot be considered.

41) As a manager, Mike uses linear programming to formulate a problem into a series of mathematical expressions. _____________blank refers to the choices or alternatives Mike selects to minimize or maximize the value of his goals. A) Objective function B) Decision variables C) Optimization D) Feasible region

Version 1

15


42) Numerical values that are associated with objective function, decision variables, and constraints are called _____________blank. A) parameters B) binary C) value D) assumptions

43) Which of the following is not one of the “special cases” in linear programming that was discussed in this chapter? A) Interminableness B) Infeasibility C) Multiple optimal solutions D) Unboundedness

44) At Tasty Sweets ice cream shop, a banana split creates a profit point of $1.75 for a small and $2.50 for a large. Last month, 100 small banana splits and 185 large banana splits were sold. However, with summer approaching, the demand is estimated to increase. The maximum allotted labor for both banana splits sizes is 50 hours for the month. Based on last month’s sales the amount of labor needed to produce one small and large banana split is 3 minutes and 5 minutes, respectively for a total of 5 hours a month for a small and 15.416 hours a month for a large. What is the corresponding constraint for monthly labor? A) 3x1 + 5x2 ≤ 50 B) 5x1 + 15.416x2 ≤ 50 C) 5x1 + 15.416x2 ≥ 50 D) 1.75x1 + 2.50x2 ≤ 50

Version 1

16


Answer Key Test name: Chap 17_2e_Jaggia 1) FALSE A minimization linear programming model will usually have at least one constraint with the ≥ sign. 2) FALSE The constraints in the minimi-zation problem usually describe minimum requirements that must be met. The constraints in the maximization problem usually describe maximum requirements that must be met. 3) TRUE Sometimes, in real-world applications, an LP model may yield multiple optimal solutions that have the same maximized or minimized value of the objective function. For example, when a retail store manager chooses a mix of products to market to customers in order to maximize the rev-enue or profit, it is likely that there are a very large variety of products from which the manager can choose. 4) TRUE A linear programming problem is infeasible when there is no solution for which all constraints in the LP model are satisfied. Examples include when production capacity is insufficient to meet demand, some of the constraints are incorrectly formulated, or an error like ≥ is used rather than ≤. 5) FALSE Linear programming problems are unbounded when the objective function can attain an infinite value without violating any constraints. Maximization problems can result in objective functions increasing to infinity, whereas minimization problems can result in objective functions decreasing to infinity. Version 1

17


6) FALSE Linear programming (LP) is an optimization technique in which a linear function is maximized or minimized when subjected to resource constraints. 7) TRUE Objective function is a mathematical representation of an objective, expressed as a maximization or minimization function of a single variable. 8) FALSE Constraints with slack or surplus are called nonbinding constraints. Therefore, no optimal solution is on the line for the constraint. 9) TRUE When all other parameters are constant, the one unit change in the optimized value of an objective function is considered the shadow price or dual price. 10) TRUE When viewing the profit line on the graph, profits will increase as long as part of the profit line remains within the feasibility region. The demand will remain within limits and to the point where constraints are met achieving the optimized profit point. 11) D Simulation is an attempt to imitate a real-world process that examines several possible scenarios, whereas optimization is an attempt to find an optimal way to achieve an objective under given constraints, such as limited capacities, financial resources, and competing priorities. 12) A The graph shows an example of an infeasible LP problem where no solutions satisfy the two constraints.

13) A Production Costs = 120x1 + 160x2 is the minimization function. 14) A Version 1

18


The missing linear programming essential component is parameters. Parameters are numerical values associated with the objective, decision, and constraints. 15) B 210x1 + 325x2 ≥ 6,000 for orders for black band designs. 16) C There are two constraints to demonstrate the capacity limit of machine 1 and machine 2 (Machine 1 time = 3p1 + 5p2 <= 500; Machine 2 time = 3p1 + 4p2 <= 500). There are 4 more constraints that show the minimum production level of each product in each month based on their demand (Month1 P1 demand = p1 <= 100; Month1 P2 demand = p2 <= 120; Month2 P1 demand = p1 <= 160; Month2 P2 demand = p2 <= 110). 17) D The objective function can be calculated in Excel Solver as: =SUMPRODUCT(B5*B8) + (C5*C8) or also can be formulated as =SUMPRODUCT (B5:C5, B8:C8)

18) B The quantity formula for units of material can be calculated in Excel Solver as =SUMPRODUCT(B2*B8) + (C2*C8). As for any SUMPRODUCT, you can also use SUMPRODUCT(B2:C2, B8:C8).

19) C Based on the sensitivity report, the optimal linear programming solution for Product 1 is 290 units and for Product 2 is 140 units. 20) C Based on the sensitivity report, the optimal linear programming solution for Product 1 is 300 units and for Product 2 is 150 units. 21) A Since the machine hours constraint is nonbinding, (implied with an Allowing Increase of 1E+30) an infinite increase can occur without altering the solution. 22) C

Version 1

19


The objective coefficient is 4 with an allowable increase of 0.76 and decrease of 0.884 at the 360-unit level. Thus, between 3.116 and 4.76 range. 23) C The objective coefficient is 5 with an allowable increase of 0.75 and decrease of 0.889 at the 300-unit level. Thus, between 4.111 and 5.75 range. 24) D The graph shows an example of an unbounded maximization LP problem with two constraints.

25) A The objective coefficient is the alternate name for the parameter value in the objective function. 26) B Profit = 2.89x1 + 3.45x2, which reflects the price for the two products where x1 and x2 represents the decision variables for the number that will be produced. 27) B Profit = 2.85x1 + 3.80x2, which reflects the price for the two products where x1 and x2 represents the decision variables for the number that will be produced. 28) D The machine constraint is 16.67x1 + 34.66x2 ≤ 100, which takes the total hourly time for a small latte times the decision variable plus the large latte hour time times the decision variable. 29) D The machine constraint is 16.67x1 + 33.25x2 ≤ 90, which takes the total hourly time for a small latte times the decision variable plus the large latte hour time times the decision variable. 30) B

Version 1

20


Solver and R can only show one optimal solution and do not provide a simple way to find the other optima. 31) B The first step in a linear programming simulation is to formulate a problem into a series of mathematical expressions. 32) C Production Costs = 90x1 + 125x2 is the minimization function. 33) C 150x1 + 140x2 ≥ 3,500 for orders for multi-color yoga pants. 34) D The range of objective function coefficients between the allowable increase and decrease, within which the optimal solution for an LP problem remains unchanged is called the range of optimality. 35) C The range of values of a binding constraint, within which the shadow price remains the same is called the range of feasibility. 36) A Production Costs = 150x1 + 175x2 is the minimization function. 37) B 210x1 + 250x2 ≥ 6,000 for orders for straight blades. 38) C Based on the provided information, there are the constraints for both the machine and for two months. Then also on the number of units the 2 product lines produce in a month that will be less than the demand for product 1 and 2 in the given month. For example, Machine 1 time = 2p1 + 4p2 <= 320; Machine 2 time = 2p1 + 3p2 <= 320; Month1 P1 demand = p1 <= 125; Month1 P2 demand = p2 <= 130; Month2 P1 demand = p1 <= 150; Month2 P2 demand = p2 <= 120. 39) B

Version 1

21


Constraints with slack or surplus in the linear programming solution are called nonbinding constraints. Adding or reducing a nonbinding constraint by one will not change the solution. 40) D Solutions outside of the feasibility region cannot be considered because the result violates at least one constraint. 41) B Decision variables refer to the different choice or alternatives from which a decision maker has to choose to minimize or maximize the value of the goal or objective function. 42) A Parameters or input parameters, are numerical values associated with all three. For example, 30 minutes to make a breakfast is a parameter value for the labor constraint. 43) A Multiple optimal solutions, infeasibility, and unboundedness were all special cases discussed in the chapter. 44) B The preparation constraint is 5x1 + 15.416x2 ≤ 50, which takes the total monthly labor hours for a small banana split times the decision variable (number of small banana splits x1) plus the large banana split monthly labor hours times the decision variable (number of large banana splits x2).

Version 1

22


CHAPTER 18 TRUE/FALSE - Write 'T' if the statement is true and 'F' if the statement is false. 1) The general process for modeling an integer programming optimization problem is similar to that of linear programming, including the four essential components: an objective function, decision variables, constraints, and parameters. ⊚ ⊚

true false

2) Using linear programming on an integer programming problem and then rounding the optimal fractions to the nearest integer results in an optimal solution. ⊚ ⊚

true false

3) Linear and integer programming techniques can be used to solve many real-world problems. ⊚ true ⊚ false

4)

Capital budgeting is common linear programming application. ⊚ ⊚

true false

5) Integer Programming, like Linear Programming, requires the analyst to round to the nearest integer for optimization. ⊚ true ⊚ false

6) Theoretically, Shad can use linear or integer programming techniques to minimize the transportation costs related to shipping goods from his company’s warehouse to various retailers across the nation if the demand and supply both have integer values. ⊚ ⊚ Version 1

true false 1


7)

An assignment problem can be formulated as a maximization or minimization problem. ⊚ ⊚

true false

8) Renita runs a local food bank and needs to schedule staff for the month ahead, but her integer programming model will result in one and only one optimal solution. ⊚ ⊚

true false

9) A facility location problem usually seeks to minimize the number of facilities and therefore the total cost, while satisfying the needs of all constituents. ⊚ true ⊚ false

10) Unlike a linear programming model, where the optimal solution is found at the corner or bound-ary of the feasibility region, the optimal solution to a nonlinear programming problem can be found any-where, making a nonlinear programming model easier to solve. ⊚ true ⊚ false

MULTIPLE CHOICE - Choose the one alternative that best completes the statement or answers the question. 11) Which of the following isnot an example of a linear or integer programming application? A) capital budgeting B) distribution of electricity C) transportation D) workforce

12)

Which of the following is not an example of a nonlinear programming application?

Version 1

2


A) assignment of employees to tasks B) distribution of electricity C) investment volatility or risk D) vehicle traffic in busy city

13) When using R, how do you ensure all decision variables are integers when solving an integer programming problem? A) all.int = FALSE B) all.int = TRUE C) all.num = INT D) all.num = TRUE

14)

What type of programming problem is this graph depicting?

A) global programming B) integer programming C) linear programming D) nonlinear programming

15) What type of nonlinear programming algorithm would you use in your software program, if your nonlinear function is smooth and differentiable?

Version 1

3


A) derivative-based B) derivative-free C) evolutionary D) simplex

16) Williams LLC wants to run advertisements to promote their digital game, Fractions are Fun! during the month of March in celebration of Pi Day. They have four options, Facebook, Instagram, TikTok, and Twitter. They have decided they can afford to fund at least two but no more than three platforms. What formula should they add to reflect this constraint?

A)

and

B)

and

C)

and

D)

and

17) Creighton Medical Supplies manufactures medical gowns and scrubs for hospitals and medical offices. Each medical gown uses about 30 square feet of material and uses about 0.8 hours of machine time. Each set of scrubs uses about 50 square feet of material and uses about 1.3 hours of machine time. For this production cycle the company secured 1,500 square feet of material and has up to 80 hours available. Given the following template, what constraint(s) would be set in Solver add-in?

Version 1

4


A) $B$13:$B$14 <= $D$13:$D$14 B) $B$13:$B$14 >= $D$13:$D$14 C) $B$10:$C$10 = integer; $B$13:$B$14 <= $D$13:$D$14 D) $B$10:$C$10 = integer; $B$13:$B$14 >= $D$13:$D$14

18) Pumpkin Enterprises is evaluating 4 projects for potential capital funding. However, they only have $75,000 for Year one, and $60,000 for the remaining years to invest. Unfortunately, they cannot receive any additional funds unspent from year to year. Based on the following data, what is the constraint for the Year 1 integer programming formulation? Cash Investment Year 1 Year 2 Year 3 Year 4 Expected Return

Project 1 50 10 10 0 350,000

Project 2 45 10 10 10 400,000

Project 3 30 10 10 0 385,000

Project 4 40 10 10 10 450,000

A) B) C) D)

19) Old Orchard Brewing Company has three warehouses that service ten locations for Good Eats Grocery. The first warehouse supplies up to 210 pallets a week, the second warehouse supplies 480 pallets a week, and the third warehouse supplies 175 pallets a week. The orders received weekly from Good Eats are 40 pallets for the1st five locations and 50 pallets for locations 6 through 10. Formulate the objective function for the total shipping costs.

Version 1

5


A) Maximize: Total Shipping Cost = B) Minimize: Total Shipping Cost = C) Maximize: Total Shipping Cost = D) Maximize: Total Shipping Cost =

20) Old Orchard Brewing Company has two warehouses that service three locations for Good Eats Grocery. The first warehouse supplies up to 210 pallets a week and the second warehouse supplies 480 pallets a week. The orders received weekly from Good Eats are 240 pallets for the location 1 and 2 and 350 pallets for location 3. Formulate the warehouse capacity constraint(s). 21) A kitchen cabinet shop with four cabinet makers receives a contract to work on a project that requires four different tasks: milling, sanding, building, and staining. The project needs to be completed as soon as possible. The owner wants to assign each of the four tasks to an employee with an overall goal of minimizing the project’s completion time and has recorded the amount of time each employee has taken to complete the four tasks from previous projects. The accompa-nying table shows the employees’ average completion time (in hours) for each task. Define the design variables. Employees Kristy Michelle Mike Ronaldo

Milling 5.65 6.05 5.19 5.48

Cutting 4.05 3.91 5.27 5.99

Finishing 4.73 4.20 4.55 4.62

Assembly 3.41 5.57 6.16 4.82

A) xij representing employeei being assigned to taskj B) xij representing taski being assigned to employeej C) wij representing average completion time for employeei, taskj D) wij representing average completion time for taski, employeej

Version 1

6


22) A kitchen cabinet shop with four cabinet makers receives a contract to work on a project that requires four different tasks: milling, sanding, building, and staining. The project needs to be completed as soon as possible. The owner wants to assign each of the four tasks to an employee with an overall goal of minimizing the project’s completion time and has recorded the amount of time each employee has taken to complete the four tasks from previous projects. The accompa-nying table shows the employees’ average completion time (in hours) for each task. Define the objective function. Employees Kristy Michelle Mike Ronaldo

Milling 5.65 6.05 5.19 5.48

Cutting 4.05 3.91 5.27 5.99

Finishing 4.73 4.20 4.55 4.62

Assembly 3.41 5.57 6.16 4.82

A) Maximize: Total Completion Time = B) Maximize: Total Completion Time = C) Minimize: Total Completion Time = D) Minimize: Total Completion Time =

23) A department chair at Sassafrass University has four remaining classes to staff (intro accounting, advanced accounting, finance, and data analytics) and three instructors available to teach those classes. The department chair would like to assign each of the three classes to an instructor with an overall goal of maximizing overall instructor ratings. The accompa-nying table shows the instructors’ average course ratings for each course. Define the objective function assuming that each instructor is only available to teach one class. Using Solver, who will teach the finance class? Instructors Ingrid Mark Mary Uday

Version 1

Intro Accounting 4.65 4.05 3.89 4.62

Advanced Accounting 4.05 3.91 4.24 4.77

Finance 4.65 4.67 4.73 4.50

Data Analytics 4.63 4.85 4.06 4.25

7


A) Ingrid B) Mark C) Mary D) Uday

24) A department chair at Sassafrass University has four remaining classes to staff (intro accounting, advanced accounting, finance, and data analytics) and three instructors available to teach those classes. The department chair would like to assign each of the three classes to an instructor with an overall goal of maximizing overall instructor ratings. The accompanying table shows the instructors’ average course ratings for each course. Define the objective function assuming that each instructor is only available to teach one class. Using Solver, what will be the total and average evaluation scores? Instructors Ingrid Mark Mary Uday

Intro Accounting 4.65 4.05 3.89 4.62

Advanced Accounting 4.05 3.91 4.24 4.77

Finance 4.65 4.67 4.73 4.50

Data Analytics 4.63 4.85 4.06 4.25

A) 16.66; 4.17 B) 17.33; 4.33 C) 18.25; 4.64 D) 19; 4.75

25) The owners of CaroKat Café need to create a new schedule for its staff. The café is open from 6 am to 10 pm seven days a week. All of the café’s employees are full time and work a schedule of five 8-hour days followed by two days off. Recently the owners have been busier on Fridays and Saturdays due to music acts. Caro and Kat like that all of their employees work full time, what is the objective function? Days Monday Tuesday Wednesday Thursday Friday

Version 1

Workers Needed 6 6 5 8 10

8


Saturday Sunday

10 5

A) Maximize: Total number of employees B) Maximize: Total number of employees C) Minimize: Total number of employees D) Minimize: Total number of employees

26) The owners of CaroKat Café need to create a new schedule for its staff. The café is open from 6 am to 10 pm seven days a week. All of the café’s employees are full time and work a schedule of five 8-hour days followed by two days off. Recently the owners have been busier on Fridays and Saturdays due to music acts. Caro and Kat like that all of their employees work full time, what is the constraint for Wednesday if Monday = x1? Days Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Workers Needed 6 6 5 8 10 10 5

A) x1 + x2 + x3 + x4 + x5 ≥ 5 B) x1 + x2 + x3 + x6 + x7 ≥ 5 C) x3 + x4 + x5 + x6 + x7 ≥ 5 D) x1 + x4 + x5 + x6 + x7 ≥ 5

27) The owners of CaroKat Café need to create a new schedule for its staff. The café is open from 6 am to 10 pm seven days a week. All of the café’s employees are full time and work a schedule of five 8-hour days followed by two days off. Recently the owners have been busier on Fridays and Saturdays due to music acts. Caro and Kat like that all of their employees work full time, what is the minimum number of employees needed to be able to fully staff the café? Days Monday

Version 1

Workers Needed 6

9


Tuesday Wednesday Thursday Friday Saturday Sunday

6 5 8 10 10 5

A) 8 B) 9 C) 10 D) 11

28) Martin is evaluating 4 projects for potential capital funding. However, he only has $60,000 for Year one, and $45,000 for the remaining years to invest. Unfortunately, he cannot receive any additional funds unspent from year to year. Based on the following data, what is the constraint for the Year 1 integer programming formulation? Cash Investment Year 1 Year 2 Year 3 Year 4 Expected Return

Project 1 50 10 10 0 350,000

Project 2 45 10 10 10 400,000

Project 3 30 10 10 0 385,000

Project 4 40 10 10 10 450,000

A) = ∑i=14 cixi B) = ∑i=1 4ai1xi ≤ 60,000 C) = ∑i=14 aixi ≤ 45,000 D) = ∑i=14 ai1xi ≤ 165,000

29) Martin is evaluating 4 projects for potential capital funding. However, he only has $90,000 for Year one, and $50,000 for the remaining years to invest. Unfortunately, he cannot receive any additional funds unspent from year to year. Based on the following data, what is the constraint for the Year 1 integer programming formulation? Cash Investment Year 1 Year 2 Year 3 Year 4

Version 1

Project 1 50 10 10 0

Project 2 45 10 10 10

Project 3 30 10 10 0

Project 4 40 10 10 10

10


Expected Return

350,000

400,000

385,000

450,000

30) Martin is evaluating 4 projects for potential capital funding. However, he only has $130,000 for Year one, and $50,000 for the remaining three years to invest. Each of the four projects is projected to generate an expected return of $350,000, $400,000, $385,000, and $450,000, respectively. Based on the summary information provided, what is the Expected Return objective function? A) Expected Return = ∑i=1 4cixi B) Expected Return = 130x1 + 50x2 C) Expected Return = ∑i=x14cixi D) Expected Return = ∑i=1 ai1xi6i

31) Martin is evaluating 4 projects for potential capital funding. However, he only has $90,000 for Year one, and $50,000 for the remaining three years to invest. Each of the four projects is projected to generate an expected return of $350,000, $400,000, $385,000, and $450,000, respectively. Based on the summary information provided, what is the Expected Return objective function? 32) The following Excel Solver Results show capital projects selected for investment. When constructing the Solver parameters, the “subject to the constraints” need to be set. What two constraints are required?

A) $G$8 AND B10:E10 = binary B) B8:E8 = binary AND F2:F5 <= G2:G5 C) B10:E10 = binary AND F2:F5 => G2:G5 D) B10:E10 = binary AND F2:F5 <= G2:G5

Version 1

11


33) In Excel, the worksheet needs to be prepared prior to running the Solver. As such, the total (total return), cell G8, shows the total of the expected return for approved projects. What is the formula that must be set in G8 to capture the results accurately?

A) =SUM(B8:E8 * B10:E10) B) =SUMPRODUCT(B8:E8, B10:E10) C) =SUMPRODUCT(B8:E8 * B10:E10) D) =SUMPRODUCT(B8:E8, G2:G5)

34) Be it a capital project selection or a transportation, a manager’s goals are what drives the projects. In integer programming the manager’s goal translates to the __________blank function in programming. A) objective B) fractional C) matrix D) transport

35) Watkins Trucking has two warehouses that service four retail locations for Harmons Hardware. The first warehouse supplies up to 180 pallets a week and the second warehouse supplies 300 pallets a week. The orders received weekly from Harmons are 75 pallets for the 1st location and 50 pallets for locations 2, 3, & 4. Formulate the objective function for the total shipping costs. Watkins 1

Version 1

Harmons 1

Harmons 2

Harmons 3

Harmons 4

2.85

3.32

4.85

6.25

12


Watkins 2

3.10

2.90

3.48

4.90

A) <p>Minimize: Total Shipping Cost = B) <p>Minimize: Total Shipping Cost = C) <p>Maximize: Total Shipping Cost = D) <p>Maximize: Total Shipping Cost =

36)

When using R, how do you store shipping costs? Watkins 1 Watkins 2

Harmons 1

Harmons 2

2.93 3.60

3.26 3.00

A) >unit.costs <- matrix(2.93, 3.26, 3.60, 3.00) B) >unit.costs <- unit.costs(2.93, 3.26, 3.60, 3.00) C) >unit.costs <- matrix(c(2.93, 3.26, 3.60, 3.00), nrow=2, byrow=TRUE) D) >unit.costs <- unit.costs(2.93, 3.26, 3.60, 3.00), nrow=2, byrow=TRUE)

37)

When using R, how do you store shipping costs? Watkins 1 Watkins 2

Harmons 1

Harmons 2

2.85 3.10

3.32 2.90

A) >unit.costs <- matrix(2.85, 3.32, 3.10, 2.90) B) >unit.costs <- unit.costs(2.85, 3.32, 3.10, 2.90) C) >unit.costs <- matrix(c(2.85, 3.32, 3.10, 2.90), nrow=2, byrow=TRUE) D) >unit.costs <- unit.costs(2.85, 3.32, 3.10, 2.90), nrow=2, byrow=TRUE)

Version 1

13


38) The owners of CaroKat Café need to create a new schedule for its staff. The café is open from 6 am to 10 pm seven days a week. All of the café’s employees are full time and work a schedule of five 8-hour days followed by two days off. Recently the owners have been busier on Fridays and Saturdays due to music acts. Caro and Kat like that all of their employees work full time, how many employees start their work week on Wednesday? Days Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Workers Needed 6 6 5 8 10 10 5

A) 0 B) 1 C) 2 D) 3

39) Chasing Windmills, manufacturer of wind turbines, is looking to open one or more new plants to service the Midwest region. They want to keep the distance between the plant and five metropolitan areas to less than 300 miles. Using Solver add-in, what is the minimum number of plants needed to service the Midwest? From/To Chicago Des Moines Indianapolis Omaha St. Louis

Chicago 0 333 184 467 296

Des Moines Indianapolis 333 184 0 477 477 0 134 611 349 242

Omaha 467 134 611 0 432

St. Louis 296 349 242 432 0

A) 0 B) 1 C) 2 D) 3

Version 1

14


40) Chasing Windmills, manufacturer of wind turbines, is looking to open one or more new plants to service the Midwest region. They want to keep the distance between the plant and five metropolitan areas to less than 300 miles. What is the objective function? From/To Chicago Des Moines Indianapolis Omaha St. Louis

Chicago 0 333 184 467 296

Des Moines Indianapolis 333 184 0 477 477 0 134 611 349 242

Omaha 467 134 611 0 432

St. Louis 296 349 242 432 0

A) Minimize: Total number of plant sites B) Minimize: Total number of plant sites C) Maximize: Total number of plant sites D) Maximize: Total number of plant sites

41) Chasing Windmills, manufacturer of wind turbines, is looking to open one or more new plants to service the Midwest region. They want to keep the distance between the plant and five metropolitan areas to less than 300 miles. What is the constraint for Indianapolis if Chicago = x1 where xi is 0 or 1? From/To Chicago Des Moines Indianapolis Omaha St. Louis

Chicago 0 333 184 467 296

Des Moines Indianapolis 333 184 0 477 477 0 134 611 349 242

Omaha 467 134 611 0 432

St. Louis 296 349 242 432 0

A) x1 + x3 + x5 ≥ 1 B) x1 + x3 + x5 ≤ 1 C) x2 + x4 ≥ 1 D) x2 + x4 ≤ 1

42) An integer programming model which involves selection of investment is classified as what type of problem?

Version 1

15


A) clustering problem B) transportation problem C) integer problem D) capital budgeting problem

43) In a transportation model, when constructing the constraints, the points of demand are classified as __________blank. A) transportation B) destinations C) origin D) goal

Version 1

16


Answer Key Test name: Chap 18_2e_Jaggia 1) TRUE The general process for modeling an integer programming optimization problem is similar to that of linear programming, including the four essential components: an objective function, decision variables, constraints, and parameters. It also includes a constraint to require the decision variables to be integers. 2) FALSE Rounding optimal values from a linear programming result typically leads to a suboptimal or even infeasible solution. 3) TRUE 4) FALSE In a typical capital budgeting problem, a decision maker tries to choose from a number of potential projects, such as building new factories or developing new drugs. These situa-tions are often referred to as a “go vs. no-go” decision. In these situations, it does not make sense to partially fund individual projects. 5) FALSE Integer Programming is specific to integers whereas Linear Programming (LP) requires the ability for a fractional result as rounding tends to lead to a suboptimal solution. 6) TRUE In theory, a transportation problem can be formulated as an LP or IP model. In an LP model, if the demand and supply both have integer values, the transportation model usually produces an integer solution. 7) TRUE

Version 1

17


An assignment problem may be formulated as a maximization (e.g., to increase the overall productivity or efficiency) or minimization problem (e.g., to reduce the completion time of a project). 8) FALSE A scheduling problem often has multiple optimal solutions that achieve the same value of the objective function. 9) TRUE A facility location problem usually seeks to minimize the number of facilities and therefore the total cost, while satisfying the needs of all constituents. For example, locations of emergency shelters, distribution centers, and medical facilities. 10) FALSE Unlike a linear programming model, where the optimal solution is found at the corner or bound-ary of the feasibility region, the optimal solution to a nonlinear programming problem can be found any-where, making a nonlinear programming model more difficult to solve. 11) B Distribution of electricity through a large network of power plants, transformers, and transmission lines are usually modeled using nonlinear programming techniques rather than linear or integer programming techniques. 12) A One of the most common applications in optimization involves a manager assigning employees or team members to individual tasks or roles. Decision variables are usually formulated as binary (assigned or not assigned) and are not typically modeled as nonlinear programming applications. 13) B all.int = TRUE is used to ensure all decision variables are integers when solving an integer programming problem. Version 1

18


14) D The optimal solution to a nonlinear programming problem can be found any-where, making a nonlinear programming model more difficult to solve. Nonlinear functions are also more difficult to model, as they can take on many forms and structures (e.g., quadratic func-tions, or other polynomial functions with a higher degree).

15) A Derivative-based (also called gradient-based) methods usu-ally require that the nonlinear functions have a derivative or are differentiable (e.g., a function x2 has a derivative of 2x). As such, derivative-based algorithms work well with a problem where nonlinear functions are continuous or smooth. 16) D If Williams LLC can fund at least two (i.e., greater than or equal to two but no more than three and less than or equal to 3 platforms, we need to add

and

as constraint.)

17) C Since Creighton would not create a fraction of a gown or scrubs, the constraint needs to include $B$10:$C$10 = integer to limit the decision variables to integer values. And since this is a profit maximization problem, Creighton needs to solve for when materials used is less than or equal to the quantities available ($B$13:$B$14 <= $D$13:$D$14).

18) A We define aij as the cash investment required for projecti in Yearj, therefore the constraint for the integer programming formulation in Year one is

.

19) B Minimize: Total Cost = . It is important to know you are looking for the minimized amount, not the maximum like with a capital project.

20) C We define our decision variables xij as the number of pallets to be shipped from warehouse i to grocery store j, where i = 1 and 2, and j = 1, 2, and 3. The warehouse capacity constraints indicate that the number of pallets shipped from each warehouse cannot exceed its capacity.

21) A

Version 1

19


We define the decision variables as xij representing employee i being assigned to task j. For the four employees, we associate Kristy, Michelle, Mike, and Ronaldo with i = 1, 2, 3, and 4, respectively. Similarly, we associate the tasks milling, cutting, finishing, and assembly with j = 1, 2, 3, and 4, respectively. 22) D Assignment problems can be formulated as a maximization or minimization problem based on the problem’s end goal. In this case, the cabinet shop wants to minimize the total completion time.

23) C For this assignment problem we want to maximize instructor evaluations, when using Solver the overall evaluation is maximized with a score of 19 with an average of 4.75. Ingrid will teach intro accounting, Uday will teach advanced accounting, Mary will teach finance, and Mark will teach data analytics. 24) D For this assignment problem we want to maximize instructor evaluations, when using Solver the overall evaluation is maximized with a score of 19 with an average of 4.75. Ingrid will teach intro accounting, Uday will teach advanced accounting, Mary will teach finance, and Mark will teach data analytics. 25) C The purpose of the scheduling is to minimize the number of employees that need to be hired and therefore the objection function is Minimize

.

26) B The five-day work schedule determines how many employees work each day at the café. The number of employees working on Wednesday =x1 + x2 + x3 + x6 + x7. But these employees can start their work week on Monday (x1), Tuesday (x2), Wednesday (x3), Saturday (x6), or Sunday (x7). Version 1

20


27) D For this assignment problem we want to minimize the number of employees, when using Solver the total number of employees is minimized with 11 employees. 28) B We define aij as the cash investment required for project i in Year j, therefore the constraint for the integer programming formulation in Year one is

≤ 60,000.

29) B We define aij as the cash investment required for project i in Year j, therefore the constraint for the integer programming formulation in Year one is

30) A <p>Based on the information provided, the only correct formula for the objective function is Expected Return = .

31) A <p>Based on the information provided, the only correct formula for the objective function is Expected Return =

32) D The first is B10:E10 = binary. This sets the Funding approval as binary results. The second is F2:F5 <= G2:G5. This verifies that Cash available has to be greater than the Cash spent.

33) B =SUMPRODUCT(B8:E8, B10:E10) is the correct formula to place in G8. The formula allows for only the selected projects in B10:E10 to be summed in cell G8 from amounts in B8:E8.

34) A The objective function is synonymous with the manager’s goal. 35) A <p>Minimize: Total Cost = It is important to know you are looking for the Minimized amount, not the maximum like with a capital project.

36) C >unit.costs <- matrix(c(2.93, 3.26, 3.60, 3.00), nrow=2, byrow=TRUE) would be the line to identify the matrix for the exact transportation cost. 37) C

Version 1

21


>unit.costs <- matrix(c(2.85, 3.32, 3.10, 2.90), nrow=2, byrow=TRUE) would be the line to identify the matrix for the exact transportation cost. 38) C For this assignment problem we want to minimize the number of employees, when using Solver the total number of employees is minimized with 11 employees with 2 of those employees starting their work week on Wednesday. 39) C For this assignment problem we want to minimize the number of plants based on a distance of 300 miles. When using Solver, the total number of locations is minimized with 2 plants. With one in Chicago and one in Des Moines. 40) A The purpose of the scheduling is to minimize the number of plant sites that need to be built and therefore the objection function is Minimize

.

41) A We need to ensure that each plant is no more than 300 miles away from a metropolitan area. Therefore, the mileage constraint for Indianapolis =x1 + x3 + x5 ≥ 1, wherexi is 0 or 1. The Indianapolis constraint ensures that there will be at least one plant in Chicago, Indianapolis, and/or St. Louis. 42) D Because a decision must be an integer, not a fragment, selecting a clear goal of project selection, capital budgeting problem would be the right classification. 43) B

Version 1

22


When constructing the constraints, in a transportation model, the demand point is the location or destination in the problem. For example, Demand is what is being requested from the main location, whereas the main location is focused on capacity constraints.

Version 1

23


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.