I
Editorial Note: Polygon is MDC Hialeah's Academic Journal. It is a multi-disciplinary online publication whose purpose is to display the academic work produced by faculty and staff. We, the editorial committee of Polygon, are pleased to publish the 2015 Spring Issue Polygon which is the ninth consecutive issue of Polygon. It includes ten regular papers. We are pleased to present work from a diverse array of fields written by faculty from across the college. The editorial committee of Polygon is thankful to the Miami Dade College President, Dr. Eduardo J. Padrón, Miami Dade College District Board of Trustees, the President of Hialeah Campus, Dr. Mattie Roig-Watnik, Academic Dean, Dr. Ana Maria Bradley-Hess, Chair of Liberal Arts and Science Department, Dr. Cary Castro, Director of Hialeah Campus Administration, Ms. Andrea M. Forero, all staff and faculty of Hialeah Campus and Miami Dade College, in general, for their continued support and cooperation for the publication of Polygon. Sincerely, The Editorial Committee of Polygon: Dr. M. Shakil (Mathematics), Dr. Jaime Bestard (Mathematics), and Professor Victor Calderin (English) Patrons: Dr. Mattie Roig-Watnik, Hialeah Campus President Dr. Ana Maria Bradley-Hess, Dean of Academic and Student Affairs Dr. Caridad Castro, Chair of Liberal Arts and Sciences Dr. Jon Mcglone, Chair of World Language Miami Dade College District Board of Trustees: Helen Aguirre Ferré, Chair Armando J. Bucelo Jr. Benjamin León III Marili Cancio Jose K. Fuentes Armando J. Olivera Bernie Navarro Eduardo J. Padrón, College President Mission of Miami Dade College The mission of the College is to provide accessible, affordable, high ‐quality education that keeps the learner’s needs at the center of the decision-making process.
II
CONTENTS A LITTLE BIT OF KNOWLEDGE IS A DANGEROUS THINK 1-4
Dr. Jack Alexander
Imaginary Raised to the Power of the Imaginary is REAL ( i i = Re )
5-7
Dr. Jack Alexander
The De-Mystification of the Normal Distribution
812
Dr. Jack Alexander
Probability and Statistics have the Potential to Save Lives
1317
Dr. Jack Alexander
YOU CAN’T BEAT THE HOUSE
1823
Dr. Jack Alexander
The Results of College Algebra in the Mathematics Discipline Learning Outcomes Assessment
2429
Dr. Jaime Bestard
A Statistical Analysis of Students’ Opinions towards Project-Based and Problem-Based Learning Approaches of Instructions in Some Mathematics Courses
3044
Dr. M. Shakil
Teaching Statistical Methods (STA2023) Using EXCEL and STATDISK
4563
Dr. M. Shakil
Testing the Goodness of Fit of Some Continuous Probability Distributions
6478
Dr. M. Shakil
Review on Some Indices to Measure the Impact of Multicollinearity in a General Linear Regression Model
7998
Dr. M. Shakil
Comments about Polygon (http://www.mdc.edu/hialeah/Polygon2013/docs2013b/Comments_About_Polygon.pdf)
III
Previous Editions
Polygon, 2008 (http://issuu.com/polygon5/docs/polygon2008) Polygon, 2009 (http://issuu.com/polygon5/docs/polygon2009) Polygon, 2010 (http://issuu.com/polygon5/docs/polygon_2010) Polygon, 2011 (http://issuu.com/polygon5/docs/polygon_2011) Polygon, 2012 (http://issuu.com/polygon5/docs/polygon_2012) Polygon, 2013 (http://issuu.com/polygon5/docs/polygon2013) Polygon, 2014 (http://issuu.com/polygon5/docs/polygon_2014)
Disclaimer: The views and perspectives presented in these articles do not represent those of Miami Dade College. Back to Front Cover
1 A LITTLE BIT OF KNOWLEDGE IS A DANGEROUS THINK By Dr. Jack Alexander Miami Dade College, North Campus ABSTRACT: When a person books an airline flight, there is a .9005 probability that they will show up for the flight. The airlines, not wanting to have too many empty seats, typically overbook their flights. If too many people show up for a specific flight, the airline usually offers free tickets to those people who have confirmed reservations, but cannot get on the flight. Obviously, the airline cannot afford to do this for too many prospective passengers. An over-zealous ticket agent decides to book 236 passengers for a Boeing 767-300 which has a seating capacity of 213. He figures that since the probability that confirmed passengers’ actual show up is .9005, the average number of people that actually show up is: .9005 x 236 = 212.5. While on the surface this seems reasonable enough, the 212.5 calculation is only an average. Which means on any two consecutive days, 150 could show up on day one, but on day two 275 could show up. The average would be 212.5 ((150 + 275)/2 = 212.5). The situation would be fine on day one, but a disaster on day two. When statistical studies are conducted and the results published, typically other researchers will attempt to replicate the study to see if they get the same results. However, it is unlikely that they get exactly the same results. Hypothesis Testing is employed to see if there is a significant difference between parallel studies. The first part of this paper illustrates a way to determine the number of people to book for the Boeing 767-300 that would accommodate a relatively high percentage and financially feasible number of passengers. The second part illustrates how to test a claim of age discrimination leveled by unsuccessful applicants for a promotion. KEY WORDS: Probability, Average, Normal Distribution, Binomial Distribution, Bookings, Hypothesis Testing, Critical Value, Significance AMS Subject classification 2010: 62-07 Part 1: When a person books a flight, they either show up or they, for whatever reason, do not show up. Given there are only two possibilities, this is a Binomial situation. In the case where the ticket agent booked 236 people for an airplane that carries only 213 passengers, we must calculate the probability that more than 213 confirmed people show up for the flight. The best way to do this calculation is to employ a Normal approximation to the Binomial Distribution. Since the binomial distribution is discrete (whole numbers) while the normal distribution is continuous, we must use what is called the Continuity Correction ( x + .5 & x – .5).
2 This means that 213 will be considered any real number between and including 212.5 and 213.5. Hence, any number greater than 213.5 will be considered greater than 213. The conversion formula used to determine probabilities from the normal table is: Z = (x – μ)/σ where Z is the number of standard deviations from the mean, while μ is the population mean and σ is the population standard deviation. Since we are dealing with a binomial situation, calculating the mean and standard deviation are relatively simple. μ= np = 236 x .9005 = 212.5
σ = (npq)1/2 = (236 x ,9005 x .0995)1/2 = 4.6
Using these parameters, Z = (213.5 – 212.5)/4.6 = .28. If we look up .28 in the positive z-score table we obtain .6103. Therefore, the probability that more than 213 people would show up for a flight that is booked for 236 is given by: 1 – .6103 = .3897. Since this is almost 40%, it would be far too expensive for the airline to give out that many “free” tickets. Clearly, the number of bookings for the plane that has 213 seats must be reduced. If that number is reduced to 230 instead of 236, we can recalculate to determine the probability that too many people show up. In this scenario, μ = 230 x .9005 = 207.125 and σ = (230 x .9005 x .0995)1/2 = 4.54. Z = (213.5 – 207.125)/4.54 = 1.40. If we look this up in the positive z-score table we obtain .9192. Therefore, the probability of excessive passengers is given by: 1 – .9192 = .08. This is far more feasible. However, 8% is still relatively large and expensive considering giving out free tickets. In statistical analysis, it is customary to endeavor to accommodate anywhere from 95 to 97 percent of the population in question. It turns out that if 228 people are booked for the Boeing 767300, the probability of overbooking is just a little over 3.5%. The required calculations are given below: μ = 228 x .9005 = 205.314 σ = (228 x .9005 x .0995)1/2 = 4.52 Z = (213.5 – 205.314)/4.52 = 1.81. If we look this value up in the positive z-score table, we obtain: .9649. Therefore, the desired probability is: 1 – .9649 = .0351, which is approximately 3.5%
3 Part 2: The table below gives the results of 53 people who applied for a promotion at their place of work. Number Unsuccessful applicants | 23 Successful applicants | 30
Mean Age 47.0 43.9
Standard deviation 7.2 5.9
The unsuccessful applicants are screaming and claiming AGE discrimination. They feel that the only reason that they did not get the promotion is that they where older than those who did get the promotion. The “disgruntled” filed their complaint to the Human Resources Department. The Department analyzed the data and used a Hypothesis Test to determine if the unsuccessful candidates are significantly older than the successful candidates. The Hypothesis set up is given below: Null: Claim:
μ1 < μ2 μ1 > μ2
Studies of this type are typically done at the α = .05 level. In other words, the researcher wants to be 95% sure of the results.
Since we have calculated standard deviations from independent samples, we focus on the smaller sample of 23, which means we look up 22 degrees of freedom in the t distribution table to obtain the adjusted critical value. At the .05 significance level, the desired value is 1.717.
The Test Statistic for this type of test is: _ _ (x 1 – x2 ) – (μ1 – μ2 ) t = --------------------------- = (s12 /n1 + s22 )1/2
(47.0 – 43.9) – 0 ----------------------------(7.22 /23 + 5.92 /30 )1/2
= 1.68
The indication from this calculation is that we must not reject the null hypothesis because 1.68 is not greater than the critical value of 1.717. To the disappointment of people who did not get promoted will come a polite letter stating that the information that was provided does not support their claim of age discrimination.
4 CONCLUSION: The ticket agent that booked 236 passengers for the airplane that has 213 seats is probably not working for that airline any more. It is important to be able to calculate things like averages. However, averages can be miss-leading and do not apply to all situations. This paper illustrates that a more indepth type of analysis is necessary in this situation. A Little Bit of Knowledge is a Dangerous Think. The important lesson to be learned from part 2 is that statistical analysis provides us with a vehicle to determine if values are significantly different. It is true that 47.0 is greater than 43.9. However, the hypothesis test indicates that these values are not significantly different. If one is going to be a strong and effective advocate for some cause or claim of discrimination, one needs to know how to do hypotheses testing. Again, A Little Bit of Knowledge is a Dangerous Think.
REFFERENCES: Bluman, Allan G. 2014. Elementary Statistics: A Step by Step Approach. 9th Edition. McGraw-Hill Education, New York Jaggia, Sanjiv & Kelly, Alison. 2014. Essentials of Business Statistics. McGraw Hill Companies, Inc., New York Triola, M. F. 2014. Elementary Statistics, 12th Edition. Adison-Wesley, Boston
5
Imaginary Raised to the Power of the Imaginary is REAL ( i i = Re ) By Dr. Jack Alexander Miami Dade College, North Campus ABSTRACT: While the title of this paper sounds rather Philosophical, mathematically it is actually true that i i is a real number. Those of us who have studied mathematics are familiar with the strange and surprising equation e πi = – 1. We will first show how the Taylor series expansion of e x allows us to substantiate this strange equation. From that derivation it is an easy matter to show that i i is a real number. KEY WORDS: Taylor’s Series, Natural Numbers, Integers, Imaginary Numbers, Real Numbers, Rational Numbers, Irrational Numbers, Convergent Series AMS Subject Classification 2010: 00-01, 00A05, 11-01 NARRATIVE: Throughout the history of mathematics, it has been necessary to expand number systems to accommodate the solution of problems. The most fundamental number system is the set of Natural Numbers (N = {1, 2, 3, 4, 5, ▪ ▪ ▪ }). This system works fine if all we need to do is count. However, this system will not allow us to solve problems like: 5 – 5 or 2 – 4. The answers to these problems are 0 and – 2. Therefore, we needed a more expanded system which is the set of Integers ( I = {▪ ▪ ▪ – 3, –2, –1, 0, 1, 2, 3, ▪ ▪ ▪} ). Furthermore, problems like 2/5 or –9/4 have no solutions if we are restricted to the Natural Numbers or the Integers. This prompted the invention of the Rational Numbers , which are defined to be the ratio of two integers (a/b where b ≠ 0 ). There also turns out to be numbers that cannot be expressed as the quotient of two integers. The most familiar numbers of this sort are π = 3.141592654 ▪ ▪ ▪ , which is the relationship of the diameter of a circle to its circumference and e = 2.718281828 ▪ ▪ ▪, which is the base of the natural logarithms. These kinds of numbers are called irrationals. The set of numbers that include the natural numbers, the integers, the rational numbers and the irrationals is called the Real Number system. The equation x2 + 1 = 0 has no solution in the Real Number system. This motivated mathematicians to invent an “Imaginary” solution in the following manner. From x2 + 1 = 0, we can write: x2 = – 1. Further, this yields x = √ –1. This result is designated the Imaginary unit i. That is, i = √ – 1
6 The Taylor series expansion of e x is given by: ∞ ∑ xn/n! = 1 + x + x2/2! + x3/3! + x4/4! + x5/5! +x6 /6! ▪ ▪ ▪ n=0 If we let x = 1, and use only the first seven terms of the expansion, we obtain a value quite close to the familiar value of e. That is: e 1 = 1 + 1 + 1 /2! =1/3! + 1 /4! + 1 / 5! + 1 /6! = 1 + 1 + 1 /2 + 1 /6 + 1 /24 + 1 /120 + 1 /720 = 2.718 rounded to three decimal places. If we let x = πi, the expansion above becomes: eπi = 1 + πi + (πi)2/2! + (πi)3/3! + (πi)3/3! + (πi)4/4! + (πi)5/5!+ ▪ ▪ ▪ Now i 2 = – 1, i 3 = – i, i 4 = 1 and i 5 = i. These four values are all that are possible for powers of i. Hence, we can write the expansion as a double alternating series as indicated below: eπi = 1 + πi – π2/2! – π3i/3! + π4/4! + π5i/5! – π66! – π7i/7! + π8/8! + π9i/9! + ▪ ▪ ▪ Note that some of the terms include i and some do not. If we group all of the terms that contain i and factor out the i, we could rewrite the expansion as indicated below: eπi = i (π – π3/3! + π5/5! – π7/7! + π9/9! – π11 /11! + ▪ ▪ ▪) + 1 – π2/2! + π4/4! – π6/6! + π8/8! – π10/10! + ▪ ▪ ▪ If we sum just the indicated terms that are multiplied by i, we get .0086 and if we sum the indicated terms that do not contain i , we obtain – .97602. This indicates that the alternating series that is multiplied by i converges to 0 and the alternating series not including i converges to –1. Hence, eπi = – 1. Q.E.D. (quod erat demonstrandum), which from the Latin means: “ which was to be demonstrated”. Now we are in a position to show that i i is, in fact, a real number. If we take the square root of both sides of the equation e πi = – 1, we obtain √ e πi = √ –1. This simplifies to: e πi/2 = i . If we now raise both sides of the resulting equation to the power of i, we obtain e –π/2 = i i . Hence, i i = .20788 rounded to five decimal places. This is clearly a REAL number. However, unlike e πi , which is equal to the rational number –1, i i is irrational.
7 CONCLUSION: While fact that i i is equal to a Real number is quite fascinating, an important by-product that surfaces in this paper is the progression and expansion of number systems over time. In order for mathematics to keep pace with new and advancing applications, we had to move from Natural Numbers to Real Numbers to Imaginary numbers. This evolution is also fascinating.
REFERENCES: Bittinger, Marvin L., Ellenbogen, David J. & Johnson, Barbara L. 2010. Intermediate Algebra: Concepts & Applications. 5th Edition. Pearson Learning Solutions, Boston Larson, Roland E, Hostetler, Robert P. & Heyd, David E. 1998. Calculus with Analytic Geometry, 6th Edition. Houghton Mifflin Company, Boston Smith, Robert T. & Minton, Roland B. 2008. Calculus. 4th Edition. McGraw-Hill Companies, Inc., New York Steward, James. 1999. Calculus: Early Transcendentals, 4th Edition. Brooks/Cole Publishing Company, Albany, New York
8 The De-Mystification of the Normal Distribution By
Dr. Jack Alexander Department of Mathematics Miami Dade College, North Campus ABSTRACT: Those of us who teach courses in statistics, at the elementary and advanced levels, will most likely cover the Normal Distribution. This distribution has the familiar “Bellshaped” curve. It is a fact that many large data sets are bell-shaped. For example, the heights of adult males in this or any country would fit this distribution. If we were able to record all of the heights, we would find that a few men would be very short; a few would be very tall; but most heights would tend toward the mean. 3.5 3 2.5 2 Series1 1.5 1 0.5 0 -4
-3
-2
-1
0
1
2
3
4
The German mathematician Carl Friedrich Gauss (1777 – 1855) is given credit for the work he did with the normal distribution. In fact, it is sometimes referred to as the “Gaussian” distribution. However, the discovery of the equation for a normal distribution can be traced back to the French mathematician Abraham DeMoivre (1677 – 1754) in 1733. That formula, which is still in use today, is given below. 1 y = ------ e –.5 ((x - µ)/σ)^2 σ√2π Key Words: Normal Distribution, Mean, Standard Deviation, Taylor’s Series, Empirical Rule AMS Subject Classification 2010: 62-07
9
NARRATIVE: For elementary and intermediate courses in statistics we do not actually use the normal distribution formula other than to look up probabilities for given standard deviations in tables that have been prepared using the formula. Only in a course on the “theory” of statistics would we actually use DeMoivre’s formula. This paper demonstrates how the values in the table are calculated. It turns out that the Taylor’s series expression of the exponential function of ex is given by: ex = 1 + x + x2/2! + x3/3! + x4/4! + • • • The values typically presented in a normal table are calculated with mean set to 0 and the standard deviation set equal to 1. That is, µ = 0 and σ = 1. This allows for a simpler expression of the normal distribution formula. 1 y = ------- e -.5x^2 √2π If we let x = -.5x2 , we can write: e -.5x^2 = 1 – x2/2 + x4/8 – x6/48 + x8 /384 – • • •
The integration of the individual terms of the series provides a numerical scheme that is a good approximation of given integral below: ∫ e -.5x^2dx = x – x3/6 + x5/40 – x7/336 + x9/3132 – • • • Once this value is determined, it can be divided by √2π to obtain the value for the area under the curve for specific standard deviations from the mean. Since these calculations are rather tedious, a JAVA program was developed to determine areas (probabilities) under the normal curve from the mean to given standard deviations. A copy of that program is presented on the next page.
10 //Normal Distribution import java.io.*; public class Normal {public static void main(String[] args) throws IOException {InputStreamReader reader = new InputStreamReader(System.in); BufferedReader input = new BufferedReader(reader); System.out.print("Input the desired standard deviation:"); String text = input.readLine(); Double c = new Double(text); double x = c.doubleValue(); double z = x; double s = 0; double f = 1; double v; int k = 2; int m = 0; do {m = k-1; n = 2*k-1; s = s+(Math.pow(-1,m)*Math.pow(x,n))/(n*Math.pow(2,m)*f); f=f*k; ++k; } while(k<=25); s=s+x; v = s/Math.sqrt(2*Math.PI); System.out.println("The area under the curve between the mean and z = " +(float)v); } }
The JAVA program is interactive. If 1 is input, the area calculated is .34134474. Since the normal distribution is symmetric, the probability that a normal variable is plus or minus one standard deviation from the mean is given by 2 x .34134474 = .68268956. If 2 is input to the program we obtain .47724986. This means that the probability that a normal variable is two standard deviations above and below the mean is 2 x .47724986 = .95449972. Lastly if 3 is input to the program we obtain .4986501. This means that the probability that a normal variable is three standard deviations above or below the mean is 2 x .4986501 = .9973002.
11 These calculations are consistent with what is usually given in elementary statistics text books when the Empirical Rule is quoted. That rule states that approximately 68% of a normal distribution is within plus or minus one standard deviation of the mean; approximately 95% of the distribution is within plus or minus two standard deviations of the mean and approximately 99.7% of the distribution is within plus or minus three standard deviations of the mean. Typical Applications (above, below and in between) Statistical studies indicate that the mean IQ of the adult population is 100 with a standard deviation of 15. If a person is selected at random, what is the probability that the person will have an IQ greater than 110. Since the normal table found in most statistics texts as well as the JAVA program presented in this paper gives probabilities for mean µ = 0 and standard deviation σ = 1, we use the conversion formula given below: z = (x – µ) / σ where z = the number of standard deviations above the mean, μ = the population mean and σ = the population standard deviation. Hence, we can write: z = (110 – 100) / 15 = .67. Now we can input .67 in the JAVA program. The program yield .2486 rounded to 4 decimal places. Since this is the probability from the mean to 110, to get the probability above 110, we subtract that value from .5. That is .5 – .2486 = .2514, which is the desired answer to the question If we wanted to determine the probability that the person selected had an IQ less than 110, we would simply add the value that we obtained to .5. This would give .5 + .2486 = .7486. Suppose we ask the question: What is the probability that the selected person has an IQ between 90 and 110? Since the normal distribution is symmetric we could obtain this answer simply by adding .2486 to .2486 or just multiplying .2486 by 2. In any case, we obtain .4972. zα Notation: The expression zα denotes the z score with an area α to the its right. The “high” minded society Mensa requires that a person must have an IQ in the top 1.79%. This would be z.0179. What would be the minimum IQ required to be eligible for Mensa?
12 Again we will use the same conversion formula employed above, however, this time we will try to determine z first and then solve for x. From the discussion above we found that the area between 0 and 2 standard deviations was .4772 if we round to four decimal places. This indicates that the area above 2 standard deviations is approximately .5 – .4772 = .0228. This would indicate z.0228. This is close to the Mensa requirement. If we input a value close to 2, like 2.1 in the JAVA program we obtain .4821. This would give an area above 2.1 of .5 – .4821 = .0179, which is the Mensa requirement of 1.79%. To obtain the requisite IQ, we let z = 2.1 in the above conversion formula and as indicated, simply solve the equation for x. 2.1 = (x – 100)/15. That is x = 2.1(15) + 100 = 131.5. In other words, a person must have an IQ of at least 131.5, which is 2.1 standard deviations above the mean of 100, in order to be eligible form Mensa.
CONCLUSION: The “De-Mystification” of the normal distribution table calculations becomes apparent when it is realized that the calculations employ Taylor’s series. That series is used as an integral ingredient in the development of a JAVA program that will generate even more accurate values than are typically presented in tables that are given in statistics text books. The Empirical Rule or the 68-95-99.7 rule is also apparent from the program once we are able to determine probabilities for plus or minus 1, 2 and 3 standard deviations from the mean. The examples concerning above, below and in between as well as the Mensa Society question are typical normal distribution applications.
REFERRENCES: Bluman, Allan G. 2014. Elementary Statistics: A Step by Step Approach. 9th Edition. McGraw-Hill Education, New York. Mood, Alexander M., Graybill, Franklin A. & Boes, Duane C. 1963. Introduction to the Theory of Statistics. 3rd Edition. McGraw-Hill, New York. Triola, Mario F. 2014. Elementary Statistics, 12th Edition. Pearson Education, Inc. Boston.
13 Probability and Statistics have the Potential to Save Lives By Dr. Jack Alexander Miami Dade College, North Campus ABSTRACT: Twenty years ago several people died when a Water Taxi sunk in Baltimore’s Inner Harbor. The safety load for the taxi was 3500 pounds. The seating capacity on the taxi was 20, which means that the mean weight of passengers should not exceed 175 pounds. This paper illustrates how the employment of the Central Limit Theorem, which is an application of Normal Distribution could have been used to avoid this unfortunate catastrophe. The Normal Distribution can also be applied when it is necessary to change standards because of new populations and developments. For example, the weight requirements for ejection seats needed to be re-evaluated when women were allowed to become piolets of fighter jets. Moreover, regulatory agencies do large scale studies to inform the public about safety activities and equipment in automobiles, trucks, busses, trains and airplanes. They typically employ Hypothesis tests to strengthen the results of their studies. Key Words: Normal Distribution, Mean, Standard Deviation, Standard Error, Z-Score, Probability, Hypothesis Testing, Critical Value AMS Subject Classification 2010: 62-07 NARRATIVE: At the time of the Baltimore Harbor incident, statistical analysis suggested that the mean weight of adult men was 172 pounds with a standard deviation of 29 pounds. If just one man shows up to board the taxi, what is the probability that he will weighs more than 175 pounds? The conversion formula that is used so that probabilities can be read from the Standard Normal Z-Score table is: z = (x – μ)/σ Hence, we can write: z = (175 – 172)/29 = .10
14 If .10 is looked up in the normal positive Z-Score table we obtain: .5398. Since the table reads from the bottom up the probability that a given man weighs more than 175 is given by: 1 – .5398 = .4602. Since this value is close to 50%, this calculation alone should have been a warning signal that not too many men should be allowed to board the water taxi at one time. We can use the Central Limit Theorem to calculate the probability that a mean of 20 men exceeds 175 pounds. This theorem has the following requirements: a) μmean = μ b) σmean = σ/√n Requirement b is called the Standard Error of the Mean. The conversion formula, therefore for means is adjusted as follows: _ zmean = ( x – μ)/(σ/√n) Hence, if 20 adult men show up to board the water taxi, the adjusted conversion formula used to determine the probability that the mean weight of 20 men is greater than 175 is given by: Z175 = (175 – 172)/(29/√20) = .46 If we look up .46 in the positive Z-Score table, we obtain .6772. This means that the probability that the mean weight of 20 men is greater than 175 is equal to 1 – .6772 = .3228. This is to say that if 20 adult men show up to board the taxi, there is more than a 30% chance that the average weight is greater than 175. If this calculation had been done, water taxi operators could have been instructed to not allow as many as 20 adult men to board the taxi. If this had been done these unfortunate drownings could have been avoided. As indicated in the abstract, it became necessary to redesign ejection seats for women piolets. The original design was for men only. That design was for men weighing between 140 and 211 pounds. The weights of women are normally distributed with a mean of 143 pounds and a standard deviation of 29 pounds. If a woman is selected at random, we can determine the probability that her weight is between 140 and 211. To do this we must use the conversion formula given above twice (z = (x – µ)/σ).
15 Z140 = (140 – 143)/29 = -.10 and
Z211 = (211 – 143)/29 = 2.34
If we look these values up in the negative and positive Z-score tables we obtain .4602 for -.10 and .9904 for 2.34. Therefore, the probability that a woman piolet would be safe to be ejected with these standards would be given by: .9904 – .4602 = .5302. This would mean that 46.98% of the women piolets would be in an unsafe situation. This is clearly untenable and the ejection seats had to be redesigned. The Department of Transportation (DOT) regularly does studies on safety procedures and equipment installed in automobiles. In a recent study of car accidents, they gathered the information in the table below involving cars that had airbag installed as opposed to cars that did not have airbags installed.
Occupant Fatalities Number of Occupants
Airbags Installed 41 11,541
No Airbags Installed 52 9,853
From this table we calculate two trial proportions as indicated below. ^ P1
41 = -----------11,541
= .00355
^ Q1
^ = 1 – P1
= .99645
^ P2
52 = -----------9,853
= .00528
^ ^ Q2 = 1 – P2
= .99472
These calculations clearly indicate that the fatality rate for automobiles with airbags is lower than the rate for vehicles that do not have airbags. To strengthen this argument, a test of hypothesis is done to determine if there is a significant difference in the aforementioned rates. Studies of this nature are typically done at the α =.05 level. In other words the researchers want to 95% sure of the results.
We begin the process by calculating what is called a pooled proportion that melds the two trial proportions together.
16 The Pooled Proportion:
__ P
41 + 52 = ------------------- = .00435 11,541 + 9,853
__ Q
=
__ 1 – P = .99565
This pooled proportion is used in the test statistic so that we will be able to test the hypothesis that the fatality rate for automobiles that have airbags is lower than the fatality rate for vehicles that do not have airbags. The Test Statistic:
Z=
^ ^ ( P1 – P2 ) – ( P1 – P2 ) ___________________ – – – – ( P Q /N1 + P Q /N2 )1/2
(.00355 – .00528) – 0 = -------------------------------------------------------(.00453(.99565) /11541 + .00453(.99565)/9853)1/2
Z = – 1.88. The hypothesis set up is as indicated below:
Null Claim
H0 : P1 ≥ P2 H1 : P1 < P2
Since this a one tail test on the lower end and the significance level is α = .05, the critical value is – 1.645. What this means is that we would Reject the Null Hypothesis since – 1.88 is less than the critical value of – 1.645. If we reject H0, this means that we accept the claim that the fatality rate is lower for those cars equipped with airbags. In other words, we are 95% sure that airbags provide a safer situation.
CONCLUSION: The illustrations given in this paper make it clear that knowledge of probability and statistics enables us to both avoid catastrophes and assess the dangers and safety features of certain activities and equipment. This knowledge can truly save lives.
17
REFERENCES: Bluman, Allan G. 2014. Elementary Statistics: A Step by Step Approach. 9th Edition. McGraw-Hill Education, New York McGervey, John D. 1986. Probabilities in Everyday Life. Ballantine Books (Random House, In.), New York. Triola, M. F. 2012. Elementary Statistics, 11th Edition. Addison-Wesley, Boston.
18
YOU CAN’T BEAT THE HOUSE By Dr. Jack Alexander Department of Mathematics Miami Dade College, North Campus Miami FL, 33167, USA E-Mail: jalexan2@mdc.edu ABSTRACT Gambling or “Games of Chance” are as old as mankind itself. Most of the modern games have their roots as far back as the 1500’s when many of these games were invented by French mathematicians. They developed number games, lotteries and casino games like Roulette for the French aristocracy who had more money than they had sense. Needless to say, the inventors and other soothsayers got rich tricking the willing players out of their funds. This paper details how to calculate the mathematical expectation and probability of winning some of the games most commonly played in today’s arena. KEY WORDS: Probability, Mathematical Expectation, Combinations, Permutations AMS Subject Classification 2010: 62-07 INTRODUCTION In the 1920’s, pick three number games were prevalent in the United States and abroad. In this country the Stock Exchange was used to determine a three digit number. A player could play as little as a dollar and win $100 if their number came out in a prescribed order. For $5.00 a player could “box” their number. This would mean that they would win $500 if they had the right three digits in any order. For example, if the winning number was 1 2 3, there are six different arrangements that win. ( 123, 132, 213, 231, 312, 321). These number games were illegal. Infamous criminals like “Bumpy” Johnson, “Dutch” Schultz, “Lucky” Luciano, Al Capone, and a woman named Stephanie St Claire known as the “Queen” of Harlem made fortunes on the number games. The reason the game was so profitable is because the probability of winning the number in a prescribed order is 1 chance in 1000 and 6 chances in 1000 if the number is boxed. This is easy to see because there are 10 possibilities ( 0 through 9) for each of the three digits. State and federal governments soon realized the potential for fund raising. That is why, at this point, every state in the union, the federal government and most countries have “legal” number
19
games. In fact, a person can now play either Pick 3 or Pick 4. The probability of winning Pick 4 is 1 chance in 10,000.
Mathematical Expectation E(x) is calculated by the formula E(x) =∑xP(x). Table 1 below illustrate how to calculate the expectation for the player if he or she plays $1.00 for a prescribed order and for the $5.00 boxing of a number played in pick 3.
TABLE 1 Prescribed Order
Boxing
X
P(x)
xP(x)
Win
$99
1/1000
99/1000
Lose
$(-1)
999/1000
-999/1000 -900/1000
x
P(x)
xP(x)
$495
6/1000
2970/1000
| $(-5)
994/1000
-4970/1000
|
-2000/1000
What the table indicates is that the player’s long term expectation for selecting a prescribed order is negative 90 cents, while the long term expectation for boxing is negative $2.00 LOTTERIES At this point in time, virtually every state in the union and many countries have lotteries. Ireland has had a national lottery for more than one hundred years. Lotteries are real money makers. They all promise an extremely large jackpot. However, the probability of winning is infinitesimally small. In Florida, you are allowed to pick 6 numbers in the range of 1 to 53. They do not require that the chosen numbers be in any prescribed order, therefore, this is a combination issue. To determine the probability we need to calculate 53C6. Combinations of this sort are calculated by the formula nCr = n!/[(n – r)!r!]. In this case we would calculate 53!/[(53 – 6)!6!] = 53!/(47!6!) = (53x52x51x50x49x48)/(6x5x4x3x2x1) = 22,957,480. What this means is that the probability of winning the jackpot is 1 chance in 22,957,480. If they were to require that the winning number had to be given in a prescribed order, this number of combinations would have to multiplied by 6!, which is equal to 720. The chances of winning would, in that case, be 1 in more than 16.5 billion. The prescribed order could also be calculated by the permutation
20
formula 53P6 = 53!/47! = 16.5293856 x 109. The lottery administrators know that is totally ridiculous and players would seldom if ever win. This would, of course reduce participation in the lottery. POWER BALL As if regular lotteries were not “mean” enough, Power Ball lotteries are even worse. Winning the jackpot requires that the player select the correct five numbers in the range of 1 to 59 and, in a separate drawing, you must also select the correct single number in the range of 1 to 39. To calculate the probability of winning you must first calculate 1/ 59C5 and multiply that result by 1/39. The calculations are: 1/5,006,386 x 1/39 = 1/195,249,054. With the chances of winning approximately 1 chance in 200 million, it is no wonder that, in many cases, no one wins. This is called a “roll over”. Once a roll over occurs, the jackpot value is increased. Psychology then stimulates more of the public to participate in the power ball game. This is just sophisticated stealing. Currently, more than 40 states have power ball lotteries. ROULETTE In 1931 the state of Nevada legalized casino gaming. The notorious mobster “Bugsy” Siegel opened the famous Flamingo in 1947. One of the oldest and most seductive games of chance at a casino is Roulette. They draw you in by announcing that if you win, they will pay you 35 to 1. For example, if you place a $5 bet on a particular number and if the wheel stops on the selected number you will make 5 x 35 = $175. Not only that, they will give you back your original $5 bet. The Wheel has the numbers 1 through 36, a 0 and 00. This means that the ball could stop in 38 different places. The table below illustrates a way to calculate the mathematical expectation of winning by placing a $5 bet on a single number.
TABLE 2 X
P(x)
xP(x)
Win
175
1/38
175/38
Lose
-5
37/38
-185/38
---------------------------------------------------------------------10/38
21 Table 2 indicates that, even though the casino will pay 35 to 1, the long term expectation for the player is negative. -10/38 =-.2631 or about negative 26 cents. Players think they can increase their chances of winning by placing their bet on the line between two numbers on the board. The casino will allow this. However, 2 will not divide evenly into 35. It will divide into 34 seventeen times. Hence, they will pay 17 to 1 if he ball stops on either of the two numbers. This means that if you bet $5 and the ball stops on one of the two numbers, they will pay you 5 x 17 = $85. Table 3 below illustrates the calculations for the expectation for this scenario.
TABLE 3 X
P(x)
xP(x)
Win
85
2/38
170/38
Lose
-5
36/38
-180/38
---------------------------------------------------------------------10/38 Notice that you get exactly the same negative expectation. Another strategy employed by players is to place the bet at the intersection of 4 numbers on the board. The casino will also permit this type of bet. However, 4 will not divide evenly into 35 either. It will divide into 32 eight times. Therefore, if the $5 bet chip touches 4 numbers, and the ball stops on any one of those numbers, the casino will pay 5 x 8 = $40. Table 4 below illustrates the calculations for the expectation in this case.
TABLE 4
X
P(x)
xP(x)
Win
40
4/38
160/38
Lose
-5
34/38
-170/38
-----------------------------------------------------------------------10/38
22 Again, the expectation is the same negative value of 26 cents. A final strategy that players try is to play on the outside of the board selecting only odd numbers or even numbers. The casino will permit this however, only numbers from 1 to 36 are considered. They do not consider 0 or 00 as odd or even. This means that the probability of choosing either an odd or even number is 18/38. For bets outside the board, the casino will pay only the amount of the bet. So, if $5 is played on odd or even and you win, the casino will pay you $5 and return your original $5. Table 5 below gives the calculations for this situation.
TABLE 5
X
P(x)
xP(x)
Win
5
18/38
90/38
Lose
-5
20/38
-100/38
-------------------------------------------------------------------------10/38 Note that the expectation is still exactly the same. You canâ&#x20AC;&#x2122;t beat the House!!
CONCLUSON Each year about 60 billion dollars are spent on gambling. Much of this money is spent by people who cannot afford it. This is the classic case of the rich getting richer and the poor getting poorer. Even the simplest of gambling games like Pick 3 are losing propositions for the players and highly profitable for game administrators. The analysis of each of the games touched on above indicates high probability of losing and negative mathematical expectation for the participant.
23
REFERENCES: Thomas Baker and Marjie Britz, Jokers Wild: Legalized Gambling in the Twenty-first Century, Preager Inc., 2000, Westport , CT Mario F. Triola, Elementary Statistics, 12 Edition, Pearson Education, Inc., 2014, Boston, MA Alan Wolfe and Erick C. Owens, Gambling: Mapping the American Moral Landscape, Baylor University Press, 2009, Waco, TX
24
The Results of College Algebra in the Mathematics Discipline Learning Outcomes Assessment Jaime Bestard, Ph.D. Professor, Mathematics and Statistics MDC â&#x20AC;&#x201C;Hialeah Campus E-mail: jbestard@mdc.edu Theme: Teaching and Learning Assessment, Learning Outcomes Keywords: Quantitative Reasoning â&#x20AC;&#x201C; Critical Thinking Assessment
Abstract
This paper provides an analysis of the results of the mathematics discipline assessment in a group of sections, selected by the author, as a sample in campus. The results show that the teaching and learning process needs to be re-shaped to a procedural work, characteristic of the daily activity for the adult learner that mostly populated the college students in the conditions of the higher learning in the early third millennium. It is recommended to shift to procedural course-work in the subject, provided the evidence of significant differences in student performance that favor the integration of learning in the solution of problems that are of further professional interest of the graduate.
1. Introduction: It is well known and expected that College Algebra students perform lower than in other courses, sometimes as a trait, or simply as the result of the poor motivation the instructional process produces in such students. The fact that the competencies of College Algebra for most higher learning institutions in U. S. are intended to promote applications of basic integrated algebraic principles to problem solving, put the success in this high risk course for students as a hurdle to their completion, and even more, to the understanding of the specific principles of their path-major action field of expertise. Most of the current literature and online resources utilized in the teaching and learning process for this course address the content and the competencies from the perspective of simple one direction problem with a specific objective. This feature suits exemplary to the General Education in the K-12 System, where students are not adult, their maturity is built up down the road to exit the High School level; but
25 there is uncertainty of the effectiveness of this approach in the higher learning level, that produces the basic discipline foundation for the profession, and the students are young adults with all the responsibilities and the accountability in life across the standards of the current times in which we live in and with the social impact the associate and baccalaureate degree students exit to the practice of the profession. Last but not least, the State Law that has recently approved the direct input of high school graduates to bridge courses and sometimes, directly to the college algebra state-wide in Florida, sets for the imperative need of the analysis of the instructional techniques and the learning of the students in outcomes that shape most of the professional action fields of the further graduate: the quantitative reasoning and the critical thinking. These are the engines and arguments that made the author to use the interpretation of the assessment in such courses of the discipline of mathematics to propose a change in the approach of the teaching and learning process, enhancing the offer to the students just in the class activity, providing the integration as a high impact teaching and learning techniques (3).
2. Methods: When it takes to the analysis of the teaching and learning of College Algebra the themes that shape the competencies are around the integration of algebraic techniques to manipulate rational, radical, polynomial, exponential and logarithmic expressions, counting with exemplary accomplish of prerequisites explained in the MAT1033 Intermediate Algebra course in factoring. The mastery of such techniques has been systematically, and across the board explained, based on specific exercises, but very seldom in applied integrative problem solving format, as it should be presented traditionally by the literature in use( 1) for college students. The software in use added significant practice level but little integration as well. (2) The vertical articulation across the curriculum were in the spot of the attention of national organizations in the last decade , with initiatives like MAC^3 in which MDC participated actively, but in the last three academic years such interest conceded the protagonist role to the use of technology . It is then, the role of assessment to unveil difficulties our students present to achieve what has to become two of their main outcomes in the pursue of the associate degree. The author is member of the assessment team in the mathematics discipline and participated in the design of artifacts to assess the learning in mathematics, as the result of committee work the decision is to assess learning in college algebra as the solution of an elementary optimization problem using the characteristics of the quadratic function (4) There was the design of a specific instructional integrative technique for an experimental unit ( section reference 814641) in advance, consisting in produce the taxonomic dissection of the characteristics of one
26 function according to ten important properties from the demonstration that the equation, graph or list is a function to the calculation of the average rate of change of that function, passing by the protocol to obtain domain, range, intercepts, asymptotes, behavior, turning points; local and absolute extremes, including function values. All these topics were in the context of procedural work , making the students to provide written arguments for their conclusive statements ( students showed to struggle with the symbolic mathematical representation of these conclusions, that is why the conduction of the experiment produced just the written level) The rest of the sections targeted by the author in this research were taught in a traditional style following the textbook and the online resources that promoted the delivery of practice of one or two characteristics of functions in specific exercises. By the Week XI of the Fall 2014 semester a discipline assessment application was conducted in MAC1005 College Algebra sections college â&#x20AC;&#x201C; wide, with the instrument previously approved by the discipline, as proposed by the Learning Outcomes Committee. The assessment consists in a practical application where the students must construct a simple function model in the context of a specific situation that relates the variables and results in a quadratic model for which it is ask to determine the values of the variables that maximizes the function and the maximum value of such function. This situation is integrative enough to produce significant learning, that is applicable to further courses, and eventually to the professional applications in many action fields of the path majors of our students. The sample of sections shows a representative random sampling of the population of students taking the course in campus. The analysis targeted the sections and the scores of the assessments in five categories: Exemplary, Proficient, Developing, Emerging and No/Effort; Evidence. The descriptive statistics are used to produce analysis of variance provided that the observations of the independent groups which independent results are assumed normally distributed observations. The experimental unit is embedded in the data and can easily identified by its reference number that compared to the rest of the placebo sections show the results in the way of descriptive arguments that target the central tendency, the variability, the shape of the distribution and the presence of outliers. The data appraisal is conducted both by the table presentation and in a side by side box and whiskers display shown. The intentionality of the analysis is to convince the readers that the procedural work of the students infused in the instructional process may produce improve in learning gains by this high risk course.
27
3. Results: The statistical results of the scores of the assessment artifacts are shown in Table 1 TABLE 1. Descriptive Statistics: Ref844577, Ref823715, Ref814549, Ref814641, ... Sum of Section Mean SE Mean StDev Variance CoefVar Squares Q1 Median Ref844577 1.400 0.163 0.516 0.267 36.89 22.000 1.000 1.000 Ref823715 1.200 0.145 0.561 0.314 46.72 26.000 1.000 1.000 Ref814549 1.471 0.151 0.624 0.390 42.45 43.000 1.000 1.000 Ref814641 2.138 0.163 0.875 0.766 40.94 154.000 1.500 2.000 Ref817804 1.1905 0.0878 0.4024 0.1619 33.80 33.0000 1.0000 1.0000 Ref814546 1.316 0.154 0.671 0.450 51.00 41.000 1.000 1.000 Ref817805 1.400 0.152 0.681 0.463 48.61 48.000 1.000 1.000
Section Ref844577 Ref823715 Ref814549 Ref814641 Ref817804 Ref814546 Ref817805
Q3 IQR Skewness Kurtosis MSSD 2.000 1.000 0.48 -2.28 0.056 2.000 1.000 0.11 0.38 0.071 2.000 1.000 1.00 0.20 0.063 3.000 1.500 0.40 -0.36 0.054 1.0000 0.0000 1.70 0.98 0.0250 2.000 1.000 0.77 1.12 0.083 2.000 1.000 0.40 0.36 0.079
Table 2. One-way ANOVA: Ref844577, Ref823715, Ref814549, Ref814641, Ref817804, ... Source DF SS MS F P Factor 6 16.121 2.687 6.10 0.000 Error 124 54.627 0.441 Total 130 70.748 S = 0.6637 R-Sq = 22.79% R-Sq(adj) = 19.05%
Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev --------+---------+---------+---------+Ref844577 10 1.4000 0.5164 (---------*---------) Ref823715 15 1.2000 0.5606 (-------*-------) Ref814549 17 1.4706 0.6243 (-------*-------) Ref814641 29 2.1379 0.8752 (-----*------) Ref817804 21 1.1905 0.4024 (------*------) Ref814546 19 1.3158 0.6710 (-------*------) Ref817805 20 1.4000 0.6806 (------*------) --------+---------+---------+---------+1.20 1.60 2.00 2.40 Pooled StDev = 0.6637
28
t of Ref844577, Ref823715, Ref814549, Ref814641, Ref817804, Ref814546, Ref8 4
Data
3
2
1
0 Ref844577 Ref823715 Ref814549 Ref814641 Ref817804 Ref814546 Ref817805
Figure 1
Figure 1 shows the side by side box and whiskers plots where it appears that the experimental unit differs from the rest in terms of measures of center as the variability of the groups seems to be similar then a pooled standard deviation is accepted and this result is confirmed by the ANOVA one Way un-stacked with pvalue < 0.05 as a typical level of confidence.
4. Conclusions: There are significant( pvalue <0.05; n =132) differences in the results of the students in section Ref # 814641, experimental unit, under the integrative instruction of the analysis of the characteristics of a function, when exposed to an assessment instrument, that uses the integrative problem solving approach, compared to the rest of the sections which performed under and significantly different after being exposed to the traditional methods concurrent with the textbook and online materials content approach. This interpretation of the assessment results drive the attention to the importance and effectiveness of the instructional procedural work, to overcome the lack of motivation and improve performance and learning gains, since it is the logical daily behavior of adult students that foster their maturity and professional commitment to learn via applications, integrating curriculum and producing a high impact in the learning experience.
29
References 1. Bestard, J., Campus Textbook Revision Notes Revision and Recommendations, MDC Math Discipline, Fall 2014-01. 2. Bestard, J., Campus notes in software revision (ALEKS and MyMathlab), MDC Math Discipline, Fall 2014-01. 3. Fink, L. D., A Self-Directed Guide to Designing Courses for Significant Learning, San Francisco, Jossey-Bass, 2003. 4. Mathematics Discipline Committee, DLO Assessment Instrument Design, Spring 2014
30 A Statistical Analysis of Students’ Opinions towards Project-Based and ProblemBased Learning Approaches of Instructions in Some Mathematics Courses M. Shakil, Ph.D. Professor of Mathematics Liberal Arts & Sciences Department Miami Dade College, Hialeah Campus FL 33012, USA E-mail: mshakil@mdc.edu
Abstract Classroom assessment is one of the most significant teaching strategies. It is a major component of classroom research at present. Classroom Assessment Techniques are designed to help teachers measure the effectiveness of their teaching by finding out what students are learning in the classroom and how well they are learning it. In recent years, there has been a great emphasis on student-centered and cooperative learning approaches of instructions in various disciplines. This paper deals with a statistical analysis of students’ opinions towards project-based and problem-based learning approaches of instructions implemented in some mathematics courses during fall 2014 term.
2010 Mathematics Subject Classifications: 97C40, 97C70, 97D40, 97D50. Keywords: Classroom assessment techniques, Cooperative learning, Problem-based learning, Project-based learning.
1. INTRODUCTION As observed by Angelo and Cross (1993), “the goals of college teachers differ, depending on their disciplines, the specific content of their courses, their students, and their own personal philosophies about the purposes of higher education. All faculty, however, are interested in promoting the cognitive growth and academic skills of their students”. Assessing accomplishments in the cognitive domain has occupied educational psychologists for long, (see, for example, Angelo and Cross (1993), and references therein). Many researchers have worked and developed useful theories and taxonomies on the assessment of academic skills, intellectual development and cognitive abilities, both from the analytical and quantitative point of view. The development of the general theory of measuring the cognitive abilities began with the work of Bloom et al. (1956), known as “Bloom Taxonomy.” Further developments continued with the contributions of Ausubel (1968), Bloom et al. (1971), McKeachie et al. (1986), and Angelo and Cross (1993), among others. “Active engagement in higher learning implies and requires self-awareness and self-direction,” which is defined as “metacognition” by cognitive psychologists. As pointed in “Education and Research Policy (2000)”, Flinders University of South Australia, (http://www.flinders.edu.au/teach/teach/home.html), no matter, what our topic design, classroom strategies, assessment practices and interactions with students may be, it is
31 expected that a teacher uphold the following principles for effective teaching and learning in all classes, that is, teaching should: focus on desired learning outcomes for students, in the form of knowledge,
understanding, skill and attitudes; assist students in forming broad conceptual understandings while gaining depth of
knowledge; encourage informed and critical questioning of accepted theories and views; develop an awareness of the limited and provisional nature of much of current knowledge in all fields; see how understanding evolves and is subject to challenge and revision; engage students as active participants in the learning process, while acknowledging that all learning must involve a complex interplay of active and receptive processes; engage students in discussion of ways in which study tasks can be undertaken; respect students' right to express views and opinions; incorporate a concern for the welfare and progress of individual students; proceed from an understanding of students knowledge, capabilities and backgrounds; encompass a range of perspectives from groups of different ethnic background, socioeconomic status and sex; acknowledge and attempt to meet the demands of students with disabilities; encourage an awareness of the ethical dimensions of problems and issues; utilize instructional strategies and tools to enable many different styles of learning and; adopt assessment methods and tasks appropriate to the desired learning outcomes of the course and topic and to the capabilities of the student.
There are various classroom assessment techniques developed by Angelo and Cross (1993) which lead to better learning and more effective teaching. The following are some of the objectives of the Classroom Assessment Techniques (CAT’s):
these CAT’s assess how well students are learning the content of the particular subject or topic they are studying. these are designed to give teachers information that will help them improve their course materials and assignments. these CAT’s require students to think more carefully about the course work and its relationship to their learning.
The kind of learning task or stage of learning assessed by these CAT’s is defined by Norman (1980, p. 46) as accretion, the “accumulation of knowledge into already established structures.” According to Greive (2003, p. 48), “classroom assessment is an ongoing sophisticated feedback mechanism that carries with it specific implications in terms of learning and teaching.” Grieve (2003) further observes, “the classroom assessment techniques emphasize the principles of active learning as well as student-centered learning.” In recent years, there has been a great emphasis on student-centered and cooperative learning, for example, cooperative group problem-based and project-based learning approaches of instructions in various disciplines. According to Dr. David Moursund, Emeritus Professor, University of Oregon (see www.uoregon.edu/~moursund/Math/pbl.htm), “while Project-Based
32 Learning and Problem-Based Learning share much in common, they are two distinct approaches to learning. Project-Based Learning is an individual or group activity that goes on over a period of time, resulting in a product, presentation, or performance. It typically has a time line and milestones, and other aspects of formative evaluation as the project proceeds. The project may or may not address a specific problem. In Problem-Based Learning, a specific problem is specified by the course instructor. Students work individually or in teams over a period of time to develop solutions to this problem”; for details on these see www.uoregon.edu/~moursund/Math/pbl.htm. According to BIE (see http://bie.org/about/what_pbl), project-based learning is defined as “a systematic teaching method that engages students in learning knowledge and skills through an extended inquiry process structured around complex, authentic questions and carefully designed products and tasks.” This process can last for varying time periods and can extend over multiple content areas. On the other hand, “Problem-based learning, according to the Wikipedia (see http://en.wikipedia.org/wiki/Problem-based_learning), is a student-centered pedagogy in which students learn about a subject through the experience of creating a problem. Students learn both thinking strategies and domain knowledge. The PBL format originated from the medical school of thought, and is now used in other schools of thought too. It was developed at the McMaster University Medical School in Canada in the 1960s and has since spread around the world. The goals of PBL are to help the students develop flexible knowledge, effective problem solving skills, self-directed learning, effective collaboration skills and intrinsic motivation. Problem-based learning is a style of active learning”. According to Dr. De Gallow of the Problem-Based Learning Faculty Institute, University of California, Irvine (see http://www.pbl.uci.edu/whatispbl.html), “one of the primary features of Problem-Based Learning is that it is student-centered. ‘Student-centered” refers to learning opportunities that are relevant to the students, the goals of which are at least partly determined by the students themselves. This does not mean that the teacher abdicates her authority for making judgments regarding what might be important for students to learn; rather, this feature places partial and explicit responsibility on the students’ shoulders for their own learning. Creating assignments and activities that require student input presumably also increases the likelihood of students being motivated to learn”. It is evident from the above theories that both problem-based and project-based learning approaches of instructions are student-centered, and focus on cooperative learning. For details on student-centered and cooperative learning approaches of instructions, the interested readers are referred to Slavin (1995), Oon-Seng (2003), Timpson and Bendel-Simso (2003), SavinBaden and Major (2004), Johnson et al. (2008), and Eggen and Kauchak (2010), among others. The effects of problem-based and project-based learning approaches of instructions on students’ learning have been studied and analyzed by many researchers. See, for example, Kaw and Yalcin (2008), Shakil (2008), Gok and Silay (2010), and Khairiree and Kurusatian (2009), among others. Motivated by the importance of the problem-based and project-based learning approaches of instructions, in this paper we have presented a statistical survey of students’ opinions towards project-based and problem-based learning approaches implemented in some of the mathematics courses during fall 2014 term.
33
2. DESCRIPTION OF THE PROBLEMS USED FOR THE STUDY AND THE METHODOLOGY Education is a purposeful activity. The discipline of mathematics taught at Miami Dade College has its own identity, importance and educational values. Besides Utilitarian, Disciplinary, and Cultural values and aims of the teaching of mathematics, another important objective is to develop in our students the quantitative analytic, critical thinking, communication and technological skills to evaluate and process numerical data, which belong to the following General Education Learning Outcomes as defined by the Miami Dade College:
Communicate effectively using listening, speaking, reading, and writing skills. Use quantitative analytical skills to evaluate and process numerical data. Solve problems using critical and creative thinking and scientific reasoning. Formulate strategies to locate, evaluate, and apply information. Use computer and emerging technologies effectively.
The objective of developing the quantitative analytic, critical thinking, communication and technological skills varies for different levels of mathematics classes. However, for any level of mathematics class, some of the common objectives of the above General Education Learning Outcomes can be specified as follows:
To develop the power of analytical reasoning and logical thinking. To develop the speed and accuracy in computing. To develop the skills of quantitative reasoning and analysis of real world data. To develop the competency of analyzing the real world problem. To enable the students to develop mathematical models to solve different real world problems.
Besides student success, our concern is also “how to incorporate the general education (learning) outcomes into a course.” In order to achieve these goals, it is understood that all the math instructors at Miami Dade College have been incorporating the various general education learning outcomes into their math classes, in one way or another. In view of the importance of the above-stated general education (learning) outcomes, and in order to achieve these goals, I have incorporated the problem-based and project-based learning approaches of instructions into some of my math classes. For example, during the fall 2014 term, the students of MAC1105, MAC1147, MAC2233, MAC2311, and STA 2023 classes were provided cooperative group problem-based and project-based (computer and writing) learning assignments as a means of assessment to evaluate their performance in all of these six math courses taught by me, which required students’ group projects reflecting their quantitative analytic, critical thinking, communication and technological skills, that is, mapping the general education learning outcomes, as stated above. These cooperative group project learning assignments varied in each course, and were divided into three categories, with the same theme, as given below: A) Project (Computer) Based Learning Assignment B) Problem Based Learning Assignment C) Project (Writing) Based Learning Assignment
34 After the completion of the above projects, the students of the above six math courses were asked to answer the following survey questions about their opinions towards the problem-based and project-based learning approaches of instructions. These survey questions were same for each class, and were posted online in the ANGEL Course Management. Question 1: Which Learning Assignments did you like most?
a) Problem Based Learning Assignments b) Project (Computer) Based Learning Assignments c) Project (Writing) Based Learning Assignments d) All of the above e) None of the above Question 2: Which Learning Assignments did you find difficult?
a) Problem Based Learning Assignments b) Project (Computer) Based Learning Assignments c) Project (Writing) Based Learning Assignments d) All of the above e) None of the above Question 3: Which Learning Assignments did you find easy?
a) Problem Based Learning Assignments b) Project (Computer) Based Learning Assignments c) Project (Writing) Based Learning Assignments d) All of the above e) None of the above Question 4: Did you find the above Learning Assignments Approach of Teaching helpful in your field of studies? a) Yes
b) No Question 5: Choose one of the following rating scales for the above Learning Assignments Approach of Teaching Mathematics Courses.
a) Excellent b) Good c) Average e) None of the above The studentsâ&#x20AC;&#x2122; responses on the above five survey questions and data analysis are provided in Section 3 below.
35
3. DISCUSSIONS OF RESULTS As pointed above, during the fall 2014 term, the students of my six classes were provided cooperative group problem-based and project-based (computer and writing) learning assignments. After the completion of these projects, the students were asked to complete the online survey questions, as stated in Section 2 above, about their opinions towards the problembased and project-based learning approaches of instructions. At the time of survey, there were 152 students in my six classes. They all participated in the survey and completed it online in the ANGEL Course Management within the specified time without any difficulty. The students’ responses on the above five survey questions and data analysis are provided in Figures 1-3 and Tables 1-6 below. The “TWO-WAY ANOVA: STUDENTS’ RESPONSE versus QUESTIONS (# 1, 2, 3), QUESTION (# 4), and QUESTION (#5), including INTERACTION PLOTS (DATA MEANS)” are also provided; see Tables 7, 8, and 9 respectively, and Figures 4, 5, and 6 respectively, below. These tables and figures are self-explanatory. One can easily draw inferences from the statistical analysis of students’ opinions towards project-based and problembased learning approaches of instructions implemented in some mathematics courses during fall 2014 term. From the interaction plots as provided below (Figures 4, 5, and 6), one can also observe some interaction between different factors. TABLE 1 STUDENTS’ RESPONSE versus QUESTIONS (# 1, 2, 3)
QUESTIONS
1. Which Learning Assignments did you like most? 2. Which Learning Assignments did you find difficult? 3. Which Learning Assignments did you find easy?
STUDENTS’ RESPONSE Problem Project Project (Writing) All of the Based (Computer) Based Learning above Learning Based Learning Assignments Assignments Assignments
None of the above
49
37
30
34
2
18
60
27
3
44
43
30
52
17
10
36 TABLE 2
COURSE
STA2023-A MAC2311 MAC1105 MAC2233 MAC1147 STA2023-B TOTAL
Q. 1: Which Learning Assignments did you like most? Problem Project Project All of the None of Based (Computer) (Writing) above the above Learning Based Based Assignments Learning Learning Assignments Assignments 10 8 5 3 0 11 6 3 6 0 6 3 9 0 0 9 8 3 8 1 7 3 6 12 0 6 9 4 5 1 49 37 30 34 2
TOTAL
26 26 18 29 28 25 152
TABLE 3
COURSE
STA2023-A MAC2311 MAC1105 MAC2233 MAC1147 STA2023-B TOTAL
Q. 2: Which Learning Assignments did you find difficult? Problem Project Project All of the None of Based (Computer) (Writing) above the above Learning Based Based Assignments Learning Learning Assignments Assignments 4 11 6 0 5 3 11 4 0 8 2 9 2 0 5 4 7 8 0 10 0 15 4 2 7 5 7 3 1 9 18 60 27 3 44
TOTAL
26 26 18 29 28 25 152
TABLE 4
COURSE
STA2023-A MAC2311 MAC1105 MAC2233 MAC1147 STA2023-B TOTAL
Q. 3: Which Learning Assignments did you find easy? Problem Project Project All of the None of Based (Computer) (Writing) above the above Learning Based Based Assignments Learning Learning Assignments Assignments 8 7 6 3 2 6 9 9 1 1 5 5 7 1 0 7 5 10 4 3 8 2 13 5 0 9 2 7 3 4 43 30 52 17 10
TOTAL
26 26 18 29 28 25 152
37 TABLE 5 Q. 4: Did you find the above Learning Assignments Approach of Teaching helpful in your field of studies? COURSE Yes No TOTAL STA2023-A 23 3 26 MAC2311 21 5 26 MAC1105 17 1 18 MAC2233 28 1 29 MAC1147 27 1 28 STA2023-B 21 4 25 TOTAL 137 15 152 TABLE 6 Q. 5: Choose one of the following rating scales for the above Learning Assignments Approach of Teaching Mathematics Courses. COURSE Excellent Good Average None of the TOTAL above STA2023-A 20 6 0 0 26 MAC2311 14 8 4 0 26 MAC1105 13 5 0 0 18 MAC2233 20 7 2 0 29 MAC1147 27 1 0 0 28 STA2023-B 11 13 0 1 25 TOTAL 105 40 6 1 152
38
FIGURE 1: Percentage of Students’ Response towards Questions # (1, 2, 3)
FIGURE 2: Percentage of Students’ Response towards Question # (4)
39
FIGURE 3: Percentage of Students’ Response towards Question # (5) TABLE 7 Two-way ANOVA: RESPONSE versus QUESTIONS (1, 2, 3), OPINION Source QUESTION OPINION Error Total
DF 2 4 8 14
S = 18.98
SS 0.00 1524.93 2882.67 4407.60
QUESTION Q. 1 Q. 2 Q. 3
MS 0.000 381.233 360.333
R-Sq = 34.60%
P 1.000 0.436
R-Sq(adj) = 0.00%
Individual 95% CIs For Mean Based on Pooled StDev ---------+---------+---------+---------+ (------------------*-------------------) (------------------*-------------------) (------------------*-------------------) ---------+---------+---------+---------+ 20 30 40 50
Mean 30.4 30.4 30.4
OPINION All of the a None of the Problem Base Project (Com Project (Wri
F 0.00 1.06
Mean 18.0000 18.6667 36.6667 42.3333 36.3333
Individual 95% CIs For Mean Based on Pooled StDev ----+---------+---------+---------+----(------------*------------) (-----------*------------) (-----------*------------) (-----------*------------) (-----------*------------) ----+---------+---------+---------+----0 20 40 60
40
Interaction Plot (data means) for RESPONSE 60
QUESTION Q. 1 Q. 2 Q. 3
50
Mean
40 30 20 10 0
l Al
of
e th
e ov ab ne No
e th of
e ov ab o Pr
em bl
s Ba
ed r te pu om
C t( ec j o Pr OPINION
ed as B ) t ec oj r P
ed as B g) in r it (W
FIGURE 4: Interaction Plot (data means) for RESPONSE - QUESTIONS (# 1, 2, 3) (A) Hypothesis Test: As a consequence of the results in Table 7 and Figure 4, using the Pvalue approach, we can draw the following conclusion from the hypothesis test among the means for the main effect due to the Survey Questions as follows: Conclusion:
Since the P-value (1.000) > 0.05, we fail to reject the null hypothesis. That is, at the 5% level of significance, we cannot conclude that the average of the students’ response for the three survey questions is significantly different from each other.
(B) Hypothesis Test: As a consequence of the results in Table 7 and Figure 4, using the Pvalue approach, we can draw the following conclusion from the hypothesis test among the means for the main effect due to the Students’ Opinions as follows: Conclusion:
Since the P-value (0.436) > 0.05, we fail to reject the null hypothesis. That is, at the 5% level of significance, we cannot conclude that the average of the students’ response for the five opinions is significantly different from each other.
41 TABLE 8 Two-way ANOVA: RESPONSE versus QUESTION (4), OPINION Source COURSE OPINION Error Total
DF 5 1 5 11
S = 3.540
COURSE MAC1105 MAC1147 MAC2233 MAC2311 STA2023-A STA2023-B
OPINION NO YES
SS 37.67 1240.33 62.67 1340.67
MS 7.53 1240.33 12.53
R-Sq = 95.33%
Mean 9.0 14.0 14.5 13.0 13.0 12.5
Mean 2.5000 22.8333
F 0.60 98.96
P 0.705 0.000
R-Sq(adj) = 89.72%
Individual 95% CIs For Mean Based on Pooled StDev -----+---------+---------+---------+---(------------*------------) (------------*------------) (------------*------------) (------------*------------) (------------*------------) (------------*------------) -----+---------+---------+---------+---5.0 10.0 15.0 20.0 Individual 95% CIs For Mean Based on Pooled StDev --+---------+---------+---------+------(-----*----) (-----*----) --+---------+---------+---------+------0.0 7.0 14.0 21.0 Interaction Plot (data means) for RESPONSE
30
C O URSE MA C 1105 MA C 1147 MA C 2233 MA C 2311 STA 2023-A STA 2023-B
25
Mean
20 15 10 5 0 NO
YES OPINION
FIGURE 5: Interaction Plot (data means) for RESPONSE - QUESTION (# 4)
42 TABLE 9 Two-way ANOVA: RESPONSE versus QUESTION (5), OPINION Source Q. 5 OPINION Error Total
DF 5 3 15 23
S = 4.089
Q. 5 MAC1105 MAC1147 MAC2233 MAC2311 STA2023-A STA2023-B
SS 18.83 1147.67 250.83 1417.33
MS 3.767 382.556 16.722
R-Sq = 82.30%
Mean 4.50 7.00 7.25 6.50 6.50 6.25
OPINION Average Excellent Good None of the
F 0.23 22.88
P 0.946 0.000
R-Sq(adj) = 72.86%
Individual 95% CIs For Mean Based on Pooled StDev +---------+---------+---------+--------(--------------*--------------) (-------------*--------------) (-------------*--------------) (--------------*-------------) (--------------*-------------) (--------------*-------------) +---------+---------+---------+--------0.0 3.0 6.0 9.0 Individual 95% CIs For Mean Based on Pooled StDev -----+---------+---------+---------+---(----*-----) (----*----) (-----*----) (----*----) -----+---------+---------+---------+---0.0 7.0 14.0 21.0
Mean 1.0000 17.5000 6.6667 0.1667
Interaction Plot (data means) for RESPONSE 30
Q. 5 MA C 1105 MA C 1147 MA C 2233 MA C 2311 STA 2023-A STA 2023-B
25
Mean
20 15 10 5 0 Average
Excellent Good OPINION
None of the above
FIGURE 6: Interaction Plot (data means) for RESPONSE - QUESTION (# 5)
43 Note: Hypothesis Test: Similar to the Table 7 and Figure 4, Tables 8 and 9, and Figures 5 and 6 are also self-explanatory. As a consequence of the results in Tables 8 and 9, and Figures 5 and 6, using the P-value approach, one can easily draw the conclusions from the hypothesis tests among the means for the main effect due to the Survey Questions and Students’ Opinions, similar to the Table 7.
4. CONCLUDING REMARKS Based on our observations and analysis, it is clear that cooperative group problem-based and project-based learning approaches of instructions are most important classroom assessment techniques. Incorporating these approaches of instructions in classes can help teachers to measure the effectiveness of their teaching by finding out what students are learning in the classroom and how well they are learning. In addition, these techniques can provide an efficient avenue of input and a high information return to the instructor without spending much time and energy. It is recommended that, in future, more group problem-based and project-based learning approaches of instructions be developed and implemented in other mathematics classes for better learning and more effective teaching.
Acknowledgment The author would like to thank the Editorial Committee of Polygon for accepting this paper for publication in Polygon. Also, the author is thankful to his institution, Miami Dade College, for providing him an opportunity to serve it, and allowing him to take a Graduate Course in “Analysis of Teaching”, which helped him to write this research paper. Further, the author would like to thank his wife for her patience and perseverance for the period during which this paper was prepared. Lastly, the author would like to dedicate this paper to his late parents.
References Angelo, T. A., and Cross, K. P. (1993). Classroom Assessment Techniques – A Handbook for College Teachers. Jossey-Bass, San Francisco. Ausubel, D. P. (1968), Educational Psychology: A Cognitive View. Holt, Reinhart & Winston, Troy, Mo. BIE. http://bie.org/about/what_pbl. Bloom, B. S., Hastings, J. T., and Madaus, G. F. (1971). Handbook on Formative and Summative Evaluation of Student Learning. McGraw-Hill, New York. Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., and Krathwohl, D. R. (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: Cognitive domain. David McKay Company, New York.
44 Eggen, P., and Kauchak, D. (2010). Educational Psychology: Windows on Classrooms, 8th Edition. Pearson Education, Inc., Upper Saddle River, NJ. Flinders University of South Australia (2000). Education and Research Policy, http://www.flinders.edu.au/teach/teach/home.html. Gallow, D. http://www.pbl.uci.edu/whatispbl.html. GÜk, T., and Sýlay, I. (2010). The Effects of Problem Solving Strategies on Students' Achievement, Attitude and Motivation. Latin-American Journal of Physics Education, 4(1), 7-20. Greive, D. (2003). A Handbook for Adjunct/Part-Time Faculty and Teachers of Adults, 5th Edition. The Adjunct Advocate, Ann Arbor. Johnson, D. W., Johnson, R. T., and Holubec, E. J. (2008). Cooperation in the classroom, 8th Edition. Interaction Book Co., Edina, MN. Kaw, A. K., and Yalcin, A. (2008). Problem-centered approach in a numerical methods course. Journal of Professional Issues in Engineering Education and Practice, 134(4), 359-364. Khairiree, K., and Kurusatian, P. (2009). Enhancing Students' Understanding Statistics with TinkerPlots: Problem-Based Learning Approach, Electronic Proceedings of the Fourteenth Asian Technology Conference in Mathematics, Beijing, China. McKeachie, W. J., Pintrich, P. R., Lin, Yi-Guang, and Smith, D. A. F. (1986). Teaching and Learning in the College Classroom: A Review of the Research Literature. National Center for Research to Improve Postsecondary Teaching and Learning, University of Michigan, Ann Arbor. Moursund, D. www.uoregon.edu/~moursund/Math/pbl.htm. Norman, D. A. (1980). What Goes On in the Mind of the Learner, in W. J. McKeachie (ed.). Learning, Cognition, and College Teaching, New Directions for Teaching and Learning, No. 2. Jossey-Bass, New York. Oon-Seng, T. (2003). Problem-based learning Innovation: Using problems to power Learning in the 21st Century. Thomson Learning Asia, Singapore, Singapore. Savin-Baden, M., and Major, C. H. (2004). Foundations of Problem-Based Learning. 1st Edition. Open University Press, Houston, Texas. Shakil, M. (2008). Classroom Assessment Techniques and their Implementation in a Mathematics Class. Polygon, Vol. II. 1-21. Slavin, R. E. (1995). Cooperative learning theory, research and practice. Allyn & Bacon, Boston. Timpson, W., and Bendel-Simso, P. (2003). Concepts and Choices FOR TEACHING. Atwood Publishing, Madison, WI. Wikipedia. http://en.wikipedia.org/wiki/Problem-based_learning.
45 Teaching Statistical Methods (STA2023) Using EXCEL and STATDISK M. Shakil, Ph.D. Professor of Mathematics Liberal Arts & Sciences Department Miami Dade College, Hialeah Campus FL 33012, USA E-mail: mshakil@mdc.edu
Abstract There is a great emphasis on statistical literacy and critical thinking in education these days. An introductory course in statistics such as Statistical Methods (STA 2023) at Miami-Dade College can easily provide such avenues to our students. In any Statistical Methods (STA 2023) Class, the students are first taught the frequency distributions, statistical graphs, and basic descriptive statistics, that is, center, variation, distribution, and outliers, which are important tools and techniques for describing, exploring, summarizing, and comparing data sets. Due to the tremendous development of computers and other technological resources, the use of Statistical Software for data analysis, in a Statistical Methods (STA 2023) class, cannot be underemphasized and ignored. In this paper, the uses of Excel and STATDISK in teaching courses in Statistical Methods (STA 2023) are presented. 2010 Mathematics Subject Classifications: 97C40, 97C70, 97D40, 97D50. Keywords: EXCEL, STATDISK, Statistical Methods, Teaching.
46 1. Introduction: Statistics is one of the important sciences at present. In order to summarize any type of data, we cannot underestimate its role, use and importance in the modern world. There is a great emphasis on statistical literacy and critical thinking in education these days. An introductory course in statistics such as Statistical Methods (STA 2023) at Miami-Dade College can easily provide such avenues to our students. In any Statistical Methods (STA 2023) Class, the students are first taught the frequency distributions, statistical graphs, and basic descriptive statistics, that is, center, variation, distribution, and outliers, which are important tools and techniques for describing, exploring, summarizing, and comparing data sets. Due to the tremendous development of computers and other technological resources, the use of Statistical Software for data analysis, in a Statistical Methods (STA 2023) class, cannot be underemphasized and ignored. According to Professor Mario F. Triola, the author of "Elementary Statistics" textbooks, “Statistical Software is now a common technology choice used in introductory statistics courses. An important reason many educators choose Statistical Software is its extensive use in corporate America. The world of business and industry has embraced the spreadsheet as an efficient and effective tool for the analysis of data, and many Statistical Software such as SPSS, SAS, Excel, MINITAB, STATDISK, among others, have become the premier spreadsheet programs. Motivated by a desire to better serve their students by better preparing them for professional careers, many instructors now include a Statistical Software as the technology tool throughout the statistics course. This marriage of statistics concepts and spreadsheet applications is giving birth to a generation of students who can enter professional careers armed with knowledge and skills that were once desired, but are now demanded” (http://www.statdisk.org/). In this paper, the uses of Excel and STATDISK in teaching courses in Statistical Methods (STA 2023) are presented. For details on STATDISK, see Triola (2010), http://www.statdisk.org/, and https://www.youtube.com/playlist?list=PLiuxxNbKiuJ5YgOqcJmowweivKgTIJzda, among others. For EXCEL demonstration with examples, the interested readers referred to Bluman (2013), and Triola (2014), among others. The organization of this paper is as follows. In Section 2, the use of Excel in teaching Statistical Methods (STA 2023) course is presented. Section 3 contains the use of STATDISK in teaching courses in Statistical Methods (STA 2023).
2. Uses of Excel: This is an Excel project. It is expected that the students have already learned about the following topics in the class:
Measures of central tendency (namely, mean, median, and mode) are used to indicate the “typical” value in a distribution. A comparison between the median and mean are used to determine the shape of distribution, while the mode measures the most frequently occurred data.
Measures of dispersion or variation (namely, range, standard deviation, and variance) are used to determine the “spread out” of the data.
Some statistical graphs, for example, histograms, can also be used to describe the shape of distribution
47
The five-number summary, namely, the minimum value, the 1st quartile, the median (i.e., the 2nd quartile or the 50th percentile), the 3rd quartile (i.e., the 75th percentile), and the maximum value are used to construct the box-plot.
In what follows, we will discuss some special features of Excel and its use for constructing frequency distributions, histograms, and descriptive statistics.
2.1. Some Special Features of Excel: These are described below. Statistical Analysis Tools: Microsoft Excel provides a set of data analysis tools-called the Analysis ToolPak-that you can use to save steps when you develop complex statistical or engineering analyses. You provide the data and parameters for each analysis; the tool uses the appropriate statistical or engineering macro functions and then displays the results in an output table. Some tools generate charts in addition to output tables. Related Worksheet Functions: Excel provides many other statistical, financial, and engineering worksheet functions. Some of the statistical functions are built-in and others become available when you install the Analysis ToolPak. Accessing the Data Analysis Tools: The Analysis ToolPak includes the tools described below. To access these tools, click Data Analysis on the Tools menu. If the Data Analysis command is not available, you need to load the Analysis ToolPak add-in program. Descriptive Statistics Analysis Tools: The Descriptive Statistics Analysis tool generates a report of univariate statistics for data in the input range, providing information about the central tendency and variability of your data. Histogram Analysis Tools: The Histogram analysis tool calculates individual and cumulative frequencies for a cell range of data and data bins. This tool generates data for the number of occurrences of a value in a data set.
2.2. How to Perform a Statistical Analysis: These are described below. On the Tools menu, click Data Analysis. If Data Analysis is not available, load the Analysis ToolPak as follows:
On the Tools menu, click Add-Ins.
In the Add-Ins available list, select the Analysis ToolPak box, and then click OK.
If necessary, follow the instructions in the setup program.
In the Data Analysis dialog box, click the name of the analysis tool you want to use, then click OK. In the dialog box for the tool you selected, set the analysis options you want.
48 2.3. Applications of Excel in the Statistical Analysis of a Sustainability Data: Sustainability of our environment is an important issue of international significance. Since mid-1980s, considerable research into various aspects of issues related to environmental sustainability has been conducted. For example, the concentration of ‘greenhouse’ gases (namely, carbon dioxide, carbon monoxide, methane, nitrous oxide, ozone, and chlorofluorocarbon) in the earth’s atmosphere, resulting in a gradual increase in temperatures at the earth’s surface, is an important area of research. Emissions of greenhouse gases worldwide resulting from human activities are expected to contribute to future climate changes. There is a number of potential impacts on the sustainability of ecosystems including warmer temperatures and rising sea levels, changes in rainfall patterns, and increased storm and cyclone intensity due to climate change. Consequently, these impacts are presenting many challenges for our world’s environment, community and economy. Many researchers have examined and analyzed different air pollutant concentrations, both statistically and experimentally. A majority of statistical studies, whether experimental or observational, are comparative in nature. The simplest type of a comparative study compares two populations based on samples drawn from them. As greenhouse and climate change are fundamental issues of environmental sustainability, this project aims at studying and conducting some statistical analysis of the carbon monoxide data from 1974 to 1996, at the two monitoring cities of Houston and San Antonio, Texas. To access to these data, the interested readers are referred to the following sources: Sources: Texas Environmental Center Websites: •
http://www.tec.org/
•
http://www.texascenter.org/almanac/Air/AIRCH6P3.HTML
NASA Goddard Institute for Space Studies Website: •
http://data.giss.nasa.gov/, http://eosweb.larc.nasa.gov/PRODOCS/narsto/table_narsto.html
University of California Scripps Institution of Oceanography Website: http://cdiac.esd.ornl.gov/trends/trends.htm Microsoft Excel is an efficient and effective tool for analyzing such data. As such, we shall use Excel for constructing frequency distributions, generating histograms, and finding basic descriptive statistics for the analysis of said carbon monoxide data, which are presented in the following paragraphs.
49 (a) Excel for Accessing the Data Analysis Tools: See Figure 1 below.
Figure 1
50 (b) Excel for Constructing Frequency Table or Frequency Distribution: See Figure 2 below.
Figure 2
51 (c) Excel for Generating Histogram: See Figure 3 below.
Figure 3
52 (d) Excel for Descriptive Statistics: See Figure 4 below.
Figure 4
53 (e) Excel Output of Statistical Analysis of Carbon Monoxide Data, Houston and San Antonio, Texas, 1974 â&#x20AC;&#x201C; 1996: See Figure 5 below. FREQUENCY DISTRIBUTIONS
Figure 5
54 (f) HISTOGRAMS: See Figure 6 below.
Figure 6
55 (g) DESCRIPTIVE STATISTICS: See Figure 7 below.
Figure 7
56 3. Uses of Statdisk: This is a Statdisk project. Statdisk is a full featured statistical analysis package. It includes over 70 functions and tests, dozens of built-in datasets, and graphing. Statdisk is free to users of any of Pearson Education Triola Statistics Series textbooks.
3.1. Some Special Features of Statdisk: These are described below. Help Overview, Sample Editor / Data Window
Many individual modules include their own Help comments. Here we provide some comments about the STATDISK main menu bar at the top.
File: Click on File to open an existing file or to save a file that has been created in the STATDISK Data Window. The "Open" and "Save As" features require that you select the location of the file to open or saved. If the default that is displayed is not what you want, click on the small box to the right of the default location and proceed to select the desired location.
Edit: Click on Edit to Copy a STATDISK file to another application or to Paste a file from another application.
Analysis: Click on Analysis to access many of the STATDISK modules, including those related to such features as confidence intervals and hypothesis testing.
Data: Click on Data to access STATDISK features such as those related to descriptive statistics, histograms, and boxplots.
Data Sets: Click on Datasets to access the list of data sets in Appendix B of the textbook.
Window: The Window menu lists all of the windows currently open in your STATDISK session. You can click a window (or use its hotkey combination) to bring it to the front.
Help: The help menu will open help from this site for the various STATDISK windows. There are also links to additional resources such as the STATDISK Workbook and the Triola Statistics website.
The STATDISK Sample Editor serves as a basic starting point for using STATDISK. Many of the modules in STATDISK require raw data in order to perform a statistical analysis. The STATDISK Data Window allows you to manually enter lists of raw data.
57 3.2. Applications of Statdisk: In the following examples, the uses of Statdisk are demonstrated. (a) Descriptive Statistics: See Figure 8 below.
Figure 8: THE ATMOSPHERIC CARBON DIOXIDE CONCENTRATIONS IN PARTS PER MILLION (PPM) DATA AT MAUNA LOA AND CAPE KUMUKAHI, HAWII (http://cdiac.esd.ornl.gov/trends/trends.htm, 13 June 2002)
58 (b) Histogram and Frequency Distribution: See Figure 9 below.
Figure 9: THE ATMOSPHERIC CARBON DIOXIDE CONCENTRATIONS IN PARTS PER MILLION (PPM) DATA AT MAUNA LOA AND CAPE KUMUKAHI, HAWII (http://cdiac.esd.ornl.gov/trends/trends.htm, 13 June 2002)
59 (c) Boxplots: See Figure 10 below.
Figure 10: THE ATMOSPHERIC CARBON DIOXIDE CONCENTRATIONS IN PARTS PER MILLION (PPM) DATA AT MAUNA LOA AND CAPE KUMUKAHI, HAWII (http://cdiac.esd.ornl.gov/trends/trends.htm, 13 June 2002)
60 (d) Normal Quantile Plots: See Figure 11 below.
Figure 11: THE ATMOSPHERIC CARBON DIOXIDE CONCENTRATIONS IN PARTS PER MILLION (PPM) DATA AT MAUNA LOA AND CAPE KUMUKAHI, HAWII (http://cdiac.esd.ornl.gov/trends/trends.htm, 13 June 2002) (e) Confidence Interval Estimates and Hypothesis Test Mean – Two Independent Samples: See Figure 12 below.
Figure 12: THE ATMOSPHERIC CARBON DIOXIDE CONCENTRATIONS IN PARTS PER MILLION (PPM) DATA AT MAUNA LOA AND CAPE KUMUKAHI, HAWII (http://cdiac.esd.ornl.gov/trends/trends.htm, 13 June 2002)
61 (e) Pie Charts: See Figure 13 below. Frequency Distribution
Relative Frequency Distribution (%)
Figure 13: Blood Types of the 25 Army Inductees (SOURCE: BLUMAN) Frequency Distribution and Relative Frequency Distribution (%)
(f) Regression: See Figure 14 below.
Figure 14: Number of Absences and Final Grades (%) (SOURCE: BLUMAN) Scatterplot, Correlation and Regression Results
62 4. Concluding Remarks: In this paper, the uses of Excel and Statdisk are demonstrated through some examples. It is hoped that this paper will be helpful in teaching any introductory course in statistics such as courses in Statistical Methods (STA 2023) using Excel and Statdisk. Further, as there is a great emphasis on statistical literacy and critical thinking in education these days, it is hoped that, with the help of Excel and Statdisk, the students will be able to conduct statistical research projects in their STA2023 courses, and will be able to achieve the following: I.
Search or web-search any real world data.
II.
Analyze the data statistically using Statdisk, that is,
Compute descriptive statistics for any real world data;
Draw histograms and other statistical graphs for data sets;
Discuss the distributions of data sets;
Other Statistical Analysis.
III. Write a statistical research project or report by incorporating the above findings. IV. Present the research project.
Acknowledgment The author would like to thank the Editorial Committee of Polygon for accepting this paper for publication in Polygon. Also, the author is thankful to his institution, Miami Dade College, for providing him an opportunity to serve it. Further, the author would like to thank his wife for her patience and perseverance for the period during which this paper was prepared. Lastly, the author would like to dedicate this paper to his late parents.
References 1. Bluman, A. G. (2013). Elementary Statistics, A Brief Version, 6th Edition. McGraw-Hill Co., New York. 2. Triola, M. F. (2010). Elementary Statistics, 11th Edition. Addison-Wesley, New York. 3. Triola, M. F. (2014). Elementary Statistics Using Excel, 5th Edition. Addison-Wesley, New York. 4. http://www.statdisk.org/ 5. https://www.youtube.com/playlist?list=PLiuxxNbKiuJ5YgOqcJmowweivKgTIJzda
63 6. http://www.tec.org/ 7. http://www.texascenter.org/almanac/Air/AIRCH6P3.HTML 8. http://data.giss.nasa.gov/, http://eosweb.larc.nasa.gov/PRODOCS/narsto/table_narsto.html 9. http://cdiac.esd.ornl.gov/trends/trends.htm
64
Testing the Goodness of Fit of Some Continuous Probability Distributions M. Shakil, Ph.D. Professor of Mathematics Liberal Arts & Sciences Department Miami Dade College, Hialeah Campus FL 33012, USA E-mail: mshakil@mdc.edu
Abstract In this paper, we have tested the goodness of fit for Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Generalized Extreme Value, Inverse Gaussian (3P), Log-Logistic (3P), Lognormal, Lognormal (3P) and Weibull probability distributions to the maximum 24-hour precipitation levels recorded by the U.S. Weather Bureau for thirty-six inland hurricanes during the time they were over the mountains (from 1900 to 1969), as reported in Larsen & Marx (2006). It was found that the Burr (4P) distribution was the best fit amongst the ten continuous probability distributions for the maximum 24-hour precipitation data based on both Kolmogorov-Smirnov and Anderson-Darling goodness of fit tests. On the other hand, it was found that Dagum (4P) distribution fitted reasonably well to the maximum 24-hour precipitation data based on ChiSquared tests goodness of fit test. Since fitting of a probability distribution to precipitation data may be helpful in predicting the probability or forecasting the frequency of occurrence of the precipitation during hurricanes, and planning beforehand, it is hoped that this study will be useful in many problems of hydrological processes and designs, and other applied research.
2010 Mathematics Subject Classifications: 62C12, 62F03, 62N02, 62N03, 62-07. Keywords: Goodness of fit test, Hurricane, Precipitation, Probability distribution. 1. Introduction A hurricane is a type of a tropical storm with lot of wind gusts and rainfall. Besides rainfall, we may also have other kinds of precipitation during a hurricane, such as snow, hail, sleet, or freezing rain. The tropical storms which are formed in the Northwest Pacific (close to Japan) are called typhoons, whereas the South Pacific or Indian Ocean Storms are known as cyclones. The tropical storms which are formed in the Atlantic and the Northeast Pacific are known as hurricanes. The eastern and southern coastal regions of the United States are therefore most of the time in danger from hurricanes. Sometimes, these hurricanes enter into inland areas also
65 before they disappear. The rainfall or other types of precipitation produced by hurricanes cause widespread flooding in the affected areas due to which people have to face a lot of damage and destruction of their property, including loss of life, resulting into great socio-economic problems. The statistical analysis of precipitation data during hurricanes is therefore very crucial, and plays an important role in many studies of hydrological processes and designs. Many researchers have investigated the precipitation analysis both mathematically and statistically, see, for example, Phien and Ajirajah (1984), Husak et al. (2007), Jacob and Koblinsky (2007), Kwaku and Duke (2007), Hanson and Vogel (2008), Olofintoye et al. (2009), Mahdavi et al. (2010), Sharma and Singh (2010), Gonzalez et al. (2012), Khudri and Sadia (2013), and Muralee and Muthuchamy (2014), and references therein. Further, fitting of a probability distribution to precipitation data may be helpful in predicting the probability or forecasting the frequency of occurrence of the precipitation during hurricanes, and planning beforehand. For example, as pointed out by Larsen & Marx (2006), the gamma distribution is frequently used to describe the variation in precipitation levels. As such, Larsen & Marx (2006) have discussed the applicability of twoparameter gamma probability density function to the maximum 24-hour precipitation levels recorded by the U.S. Weather Bureau for thirty-six inland hurricanes during the time they were over the mountains (from 1900 to 1969). However, since the distribution of the observed frequencies of precipitation data during hurricanes depend on many factors such as the force, speed and direction of the wind, temperature of the air, level of humidity, atmospheric pressure, and latitude and altitude of the earth, among others, it is difficult to predict them exactly. Therefore, the statistical analysis of precipitation data during hurricanes is very necessary and important. Motivated by the importance of the study of precipitation during hurricanes in many problems of hydrological processes and designs, in this paper, we have investigated the fitting of the commonly used continuous probability distributions, namely, Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Generalized Extreme Value, Inverse Gaussian (3P), LogLogistic (3P), Lognormal, Lognormal (3P) and Weibull probability distributions, to the maximum 24-hour precipitation levels recorded as reported in Larsen & Marx (2006), to determine their applicability and best fit to these data based on the goodness of fit (GOF) tests, namely, Kolmogorov-Smirnov, Anderson-Darling, and Chi-Squared tests for goodness-of-fit (see Massey (1951), Stephens (1974), Conover (1999), Blischke and Murthy (2000), Hogg and Tanis (2006), and Ahsanullah et al. (2014), among others). The organization of this paper is as follows, Section 2 contains the description of the maximum 24-hour precipitation data. Also, in Section 2, we have provided the continuous probability distributions considered in this paper, namely, Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Generalized Extreme Value, Inverse Gaussian (3P), Log-Logistic (3P), Lognormal, Lognormal (3P) and Weibull distributions. In Section 3, we have presented the results and discussions of our findings. Some concluding remarks are given in Section 4.
2. Methodology In this section, we will test the goodness of fit of Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Generalized Extreme Value, Inverse Gaussian (3P), Log-Logistic (3P), Lognormal, Lognormal (3P) and Weibull probability distributions. To illustrate the performance of these
66 distributions, we have consider the maximum 24-hour precipitation levels as reported in Larsen &
Marx (2006), and determine their best fit. For the sake of completeness, these maximum 24-hour precipitation data are provided in Table 1 below. In Table 2, we have provided the probability density functions and parameters of Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Generalized Extreme Value, Inverse Gaussian (3P), Log-Logistic (3P), Lognormal, Lognormal (3P) and Weibull distributions considered in this paper. Table 1 (Source-Larsen & Marx, An Int. to Math. Stat., Page 360, 2006) Year 1969 1968 1965 1960 1959 1957 1955 1954 1954 1952 1949 1945 1942 1940 1939 1938 1934 1933 1932 1932 1929 1928 1928 1923 1923 1920 1916 1916 1915 1915 1912 1906 1902 1901 1900 1900
Name Location Camille Tye River, VA Candy Hickley, NY Betsy Haywood Gap, NC Brenda Cairo, NY Gracie Big Meadows, VA Audrey Russels Point, OH Connie Slide Mt., NY Hazel Big Meadows, VA Carol Eagles Mere, PA Abel Bloserville 1-N, PA — North Ford # 1, NC — Crossnore, NC — Big Meadows, VA — Rhodhiss Dam, NC — Caesars Head, SC — Hubbardston, MA — Balcony Falls, VA — Peekamoose, NY — Ceasars Head, SC — Rockhouse, NC — Rockhouse, NC — Roanoke, VA — Caesars Head, SC — Mohonk Lake, NY — Wappingers Falls, NY — Landrum, SC — Altapass, NC — Highlands, NC — Lookout Mt., TN — Highlands, NC — Norcross, GA — Horse Cove, NC — Sewanee, TN — Linville, NC — Marrobone, KY — St. Johnsbury, VT
Maximum Precipitation (in.) 31 2.82 3.98 4.02 9.50 4.50 11.40 10.71 6.31 4.95 5.64 5.51 13.40 9.72 6.47 10.16 4.21 11.60 4.75 6.85 6.25 3.42 11.80 0.80 3.69 3.10 22.22 7.43 5.00 4.58 4.46 8.00 3.73 3.50 6.20 0.67
67 Table 2 (Continuous Probability Distributions Commonly Used in Precipitation Analysis)
f x
Sl. Name of the No. Distributions
1
2
Burr (4P)
Dagum (4P)
x k f x x 1
f x
3
4
Frechet (3P)
Gamma (3P)
Generalized Extreme Value
k
k 0 : shape parameter 0 : shape parameter 0 : scale parameter (real): location parameter, where x
k 1
x 1
k 1
0 : shape parameter 0 : scale parameter (real): location parameter, where x
1
exp x
0 : shape parameter 0 : scale parameter (real): location parameter, where x
x 1 f x x exp
5
x
k 0 : shape parameter 0 : shape parameter 0 : scale parameter (real): location parameter, where x
1
x k
f x
Parameters
1 1 1 1 exp 1 k x k 1 k x k f x 1 exp x exp x
k0
k0
k 0 : shape parameter 0 : scale parameter (real): location parameter, where k x 0 1 x
for k 0 for k 0
68 Sl. Name of the No. Distributions
f x f x
6
7
Parameters
2 x
3
Inverse Gaussian (3P)
LogLogistic (3P)
x f x
x 2 exp 2 2 x
1
1
x
1 ln x 2 f x exp 2 x 2 1
8
Lognormal
f x
9
10
Lognormal (3P)
Weibull
2
1
x
2
1 ln x 2 exp 2
f x
x
1
exp
x
0 : scale parameter (real): location parameter
0 : location parameter is the mean, where x
0 : shape parameter 0 : scale parameter (real): location parameter, where x 0 : scale parameter (real): location parameter, and 0 x
0 : scale parameter (real): location parameter (real): location parameter, where x 0 : shape parameter 0 : scale parameter, and 0 x
Fitting of the above-said distributions to precipitation data are carried as follows. As a first step, using Easyfit software, we have computed the descriptive statistics of the precipitation data as given in Table 3 below.
69 Table 3 (Descriptive Statistics) Statistic
Value
Percentile
Value
Sample Size
36
Min
0.67
Range
30.33
5%
0.7805
Mean
7.2875
10%
3.016
Variance
33.41
25% (Q1)
3.99
Std. Deviation
5.7801
50% (Median)
5.575
Coef. of Variation
0.79316
75% (Q3)
9.665
Std. Error
0.96335
90%
12.28
Skewness
2.518
95%
23.537
Excess Kurtosis
8.1006
Max
31
Further, using the software statdisk and minitab, we have tested the normality of the maximum 24-hour precipitation levels by Ryan-Joiner Test (Similar to Shapiro-Wilk Test), along with drawing a histogram of the data, which are given in Figure 1 (a, b) and Table 4 below. Histogram of Maximum Precipitation (in.) 18 16 14
Frequency
12 10 8 6 4 2 0
0
8
16 24 Maximum Precipitation (in.)
Figure 1 (a): Normality Assessment
32
70 Normality Test Normal
99
Mean StDev N RJ P-Value
95 90
7.288 5.780 36 0.856 <0.010
Percent
80 70 60 50 40 30 20 10 5
1
-10
0
10 20 Maximum Precipitation (in.)
30
Figure 1 (b): Normality Assessment
Table 4 (Ryan-Joiner Test of Normality Assessment) Ryan-Joiner Test (Similar to Shapiro-Wilk Test) Test statistic, Rp: 0.8564 Critical value for 0.05 significance level: 0.9689 Critical value for 0.01 significance level: 0.9561 Reject normality with a 0.05 significance level. Reject normality with a 0.01 significance level. Possible Outliers Number of data values below Q1 by more than 1.5 IQR: 0 Number of data values above Q3 by more than 1.5 IQR: 2
From Table 4 and Figure 1 (histogram) of Ryan-Joiner Test of Normality Assessment of the maximum 24-hour precipitation levels associated with inland hurricanes as shown above, it is obvious that the shape of the precipitation data is skewed to the right. This is also confirmed
71 from the skewness of the precipitation data as computed in Table 3. Since fitting of a probability distribution to precipitation data may be helpful in predicting the probability or forecasting the frequency of occurrence of the precipitation during hurricanes, this suggests that y, the maximum 24-hour precipitation levels associated with inland hurricanes, could possibly be modeled by some skewed distributions. As such we have tested the fitting of Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Generalized Extreme Value, Inverse Gaussian (3P), LogLogistic (3P), Lognormal, Lognormal (3P), and Weibull probability distributions based on their goodness of fit to the precipitation data recorded during inland hurricanes (as given in Table 1). For this, we have used the Easyfit software for estimating the parameters of these distributions, and the goodness of fit (GOF) tests, namely, Kolmogorov-Smirnov, Anderson-Darling, and ChiSquared tests for goodness-of-fit, which are provided in the Tables 5 and 6 below. For the parameters estimated in Table 5 below, Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Generalized Extreme Value, Inverse Gaussian (3P), Log-Logistic (3P), Lognormal, Lognormal (3P), and Weibull probability distributions respectively have been superimposed on the histogram of the precipitation data recorded during inland hurricanes (as given in Table 1), which is provided in Figure 2 below. For these distributions, we have provided the cumulative distribution function, survival function, hazard function, cumulative hazard function, P-P plot, QQ plot and probability difference in Figures 3 – 9 respectively as given below. Table 5 Fitting Results #
Distribution
Parameters
1
Burr (4P)
k=0.30163 =13.534 =12.7 =-9.1501
2
Dagum (4P)
k=1.3521 =3.185 =6.535 =-1.5483
3
Frechet (3P)
=5.1325 =15.556 =-10.782
4
Gamma (3P)
=1.9818 =3.565 =0.22225
5
Generalized Extreme Value k=0.30369 =2.7022 =4.583
6
Inverse Gaussian (3P)
=27.277 =8.9022 =-1.6147
7
Log-Logistic (3P)
=3.0766 =6.6246 =-0.76363
8
Lognormal
=0.73133 =1.7405
9
Lognormal (3P)
=0.55886 =1.9765 =-1.2191
10
Weibull
=1.5873 =7.6331
72 Table 6 Goodness of Fit â&#x20AC;&#x201C; Summary
#
Kolmogorov Smirnov
Distribution
Anderson Darling
Chi-Squared
Statistic Rank Statistic Rank
Statistic
Rank
1
Burr (4P)
0.07623
1
0.27484
1
2.3822
10
2
Dagum (4P)
0.08048
2
0.36359
2
0.16443
1
3
Frechet (3P)
0.09163
4
0.43862
4
0.21089
3
4
Gamma (3P)
0.12003
8
0.80068
9
1.3276
8
5
Generalized Extreme Value 0.09362
5
0.52054
5
0.48312
5
6
Inverse Gaussian (3P)
0.10639
7
0.62851
7
0.71039
6
7
Log-Logistic (3P)
0.08449
3
0.3813
3
0.16489
2
8
Lognormal
0.13132
9
0.78121
8
2.0608
9
9
Lognormal (3P)
0.10343
6
0.55514
6
0.34253
4
10
Weibull
0.1328
10
0.91012
10
0.71243
7
Probability Density Function 0.56 0.52 0.48 0.44 0.4 0.36
f(x)
0.32 0.28 0.24 0.2 0.16 0.12 0.08 0.04 0 2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
x Histogram Weibull
Burr (4P) Gen. Extreme Value
Dagum (4P)
Frechet (3P)
Gamma (3P)
Inv. Gaussian (3P)
Log-Logistic (3P)
Lognormal
Lognormal (3P)
Figure 2: Fitting of Probability Density Functions to the Precipitation Data
73 Cumulative Distribution Function 1
0.9
0.8
0.7
F(x)
0.6
0.5
0.4
0.3
0.2
0.1
0 2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
x Sample Weibull
Burr (4P) Gen. Extreme Value
Dagum (4P)
Frechet (3P)
Gamma (3P)
Inv. Gaussian (3P)
Log-Logistic (3P)
Lognormal
Lognormal (3P)
Figure 3: Fitting of Cumulative Distribution Functions to the Precipitation Data Survival Function 1
0.9
0.8
0.7
S(x)
0.6
0.5
0.4
0.3
0.2
0.1
0 2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
x Sample Weibull
Burr (4P) Gen. Extreme Value
Dagum (4P)
Frechet (3P)
Gamma (3P)
Inv. Gaussian (3P)
Log-Logistic (3P)
Lognormal
Lognormal (3P)
Figure 4: Survival Functions of Distributions for the Precipitation Data
74 Hazard Function 0.48
0.44
0.4
0.36
0.32
h(x)
0.28
0.24
0.2
0.16
0.12
0.08
0.04
0 2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
x Burr (4P) Gen. Extreme Value
Dagum (4P)
Frechet (3P)
Gamma (3P)
Inv. Gaussian (3P)
Log-Logistic (3P)
Lognormal
Lognormal (3P)
Weibull
Figure 5: Hazard Functions of Distributions for the Precipitation Data Cumulative Hazard Function
9
8
7
6
H(x)
5
4
3
2
1
0 2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
x Burr (4P) Gen. Extreme Value
Dagum (4P)
Frechet (3P)
Gamma (3P)
Inv. Gaussian (3P)
Log-Logistic (3P)
Lognormal
Lognormal (3P)
Weibull
Figure 6: Cumulative Hazard Functions of Distributions for the Precipitation Data
75 P-P Plot 1
0.9
0.8
0.7
P (Model)
0.6
0.5
0.4
0.3
0.2
0.1
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
Log-Logistic (3P)
Lognormal
0.8
0.9
1
P (Empirical) Burr (4P) Gen. Extreme Value
Dagum (4P)
Frechet (3P)
Gamma (3P)
Inv. Gaussian (3P)
Lognormal (3P)
Weibull
Figure 7: P-P Plot of Distributions for the Precipitation Data Q-Q Plot 30 28 26 24 22
Quantile (Model)
20 18 16 14 12 10 8 6 4 2
2
4
6
8
10
12
14
16
18
20
22
24
26
28
x Burr (4P) Gen. Extreme Value
Dagum (4P)
Frechet (3P)
Gamma (3P)
Inv. Gaussian (3P)
Log-Logistic (3P)
Lognormal
Lognormal (3P)
Figure 8: Q-Q Plot of Distributions for the Precipitation Data
Weibull
30
76 Probability Difference
0.24
0.2
0.16
0.12
Probability Difference
0.08
0.04
0
-0.04
-0.08
-0.12
-0.16
-0.2
-0.24
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
x Burr (4P) Gen. Extreme Value
Dagum (4P)
Frechet (3P)
Gamma (3P)
Inv. Gaussian (3P)
Log-Logistic (3P)
Lognormal
Lognormal (3P)
Weibull
Figure 9: Probability Differences of Distributions for the Precipitation Data
3. Results and Discussions The descriptive statistics of the maximum 24-hour precipitation levels associated with inland hurricanes (Table 1) are provided in Table 3 above. The estimates of parameters of Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Generalized Extreme Value, Inverse Gaussian (3P), Log-Logistic (3P), Lognormal, Lognormal (3P), and Weibull probability distributions for the precipitation data are given in Table 5. From Table 4 and Figure 1 (histogram) of Ryan-Joiner Test of Normality Assessment of the precipitation data, it is obvious that the shape of the precipitation data is skewed to the right. This is also confirmed from the skewness of the precipitation data as computed in Table 3. For the parameters estimated in Table 5, the probability density functions of Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Generalized Extreme Value, Inverse Gaussian (3P), Log-Logistic (3P), Lognormal, Lognormal (3P), and Weibull distributions respectively have been superimposed on the histogram of the precipitation data, which is provided in Figure 2. The goodness of fit (GOF) of Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Generalized Extreme Value, Inverse Gaussian (3P), Log-Logistic (3P), Lognormal, Lognormal (3P), and Weibull distributions to the precipitation data by Kolmogorov-Smirnov, Anderson-Darling, and the ChiSquared GOF tests is summarized in the Table 6 above. Further, for these distributions, we have provided cumulative distribution function, survival function, hazard function, cumulative hazard function, P-P plot, Q-Q plot and probability difference in Figures 3 â&#x20AC;&#x201C; 9 respectively as given above.
77
From the Kolmogorov-Smirnov and Anderson-Darling GOF tests as provided in Table 6 and Figure 2 above, we observed that the Burr (4P) distribution is the best fit amongst the ten continuous probability distributions to the maximum 24-hour precipitation levels associated with inland hurricanes (Table 1).. On the other hand, Dagum (4P) distribution was found to be the best fit for the maximum 24-hour precipitation data by Chi-Squared tests goodness of fit test (Table 6). The graphs of cumulative distribution function, survival function, hazard function, cumulative hazard function, P-P plot, Q-Q plot and probability difference as provided in Figures 3 â&#x20AC;&#x201C; 9 respectively also confirm these results.
4. Concluding Remarks In many problems of hydrological processes and designs, fitting of a probability distribution to precipitation data may be helpful in predicting the probability or forecasting the frequency of occurrence of the precipitation during hurricanes, and planning beforehand. Motivated by the importance of the study of precipitation during hurricanes, In this paper, we have tested the goodness of fit for Burr (4P), Dagum (4P), Frechet (3P), Gamma (3P), Gen. Extreme Value, Inv. Gaussian (3P), Log-Logistic (3P), Lognormal, Lognormal (3P) and Weibull probability distributions to the maximum 24-hour precipitation levels recorded by the U.S. Weather Bureau for thirty-six inland hurricanes during the time they were over the mountains (from 1900 to 1969), as reported in Larsen & Marx (2006). It was found that the Burr (4P) distribution was the best fit for the maximum 24-hour precipitation data by both Kolmogorov-Smirnov and Anderson-Darling goodness of fit tests, whereas Dagum (4P) distribution was found to be the best fit for the maximum 24-hour precipitation data by Chi-Squared tests goodness of fit test. It is hoped that this study will be helpful in many problems of hydrological research.
Acknowledgment The author would like to thank Professor M. Ahsanullah, Rider University, New Jersey, USA, and Professor B. M. Golam Kibria, FIU, Miami, USA, for their helpful suggestions, which improved the quality and presentation of the paper. The author would like to thank the Editorial Committee of Polygon for accepting this paper for publication in Polygon. Also, the author is thankful to his wife for her patience and perseverance for the period during which this paper was prepared. Lastly, the author would like to dedicate this paper to his late parents.
78
References Ahsanullah, M., Kibria, B. M. G., and Shakil, M. (2014). Normal and Student´s t Distributions and Their Applications. Atlantis Press, Paris, France. Blischke, W. R., and Murthy, D. N. P. (2000). Reliability, Modeling, Prediction, and Optimization, John Wiley & Sons, New York. Conover, W. J. (1999). Practical Nonparametric Statistics, John Wiley & Sons, New York. Hanson, L. S., and Vogel, R. (2008). The probability distribution of daily rainfall in the United States. In Conference proceeding paper, World Environmental and Water Resources Congress. Hogg, R. V., and Tanis, E. A. (2006). Probability and Statistical Inference, Pearson/Prentice Hall, NJ. Husak, G. J., Michaelsen, J., and Funk, C. (2007). Use of the gamma distribution to represent monthly rainfall in Africa for drought monitoring applications, International Journal of Climatology, 27, 7, 935 - 944. Kwaku, X. S., and Duke, O. (2007). Characterization and frequency analysis of one day annual maximum and two five consecutive days' maximum rainfall of Accra, Ghana, ARPN J. Eng. Appl. Sci., 2, 5, 27 - 31. Massey, F. J. (1951). The Kolmogorov-Smirnov test for goodness of fit, Journal of the American Statistical Association, 6, 68 - 78. Mahdavi, M., Osati, K., Sadeghi, S. A. N., Karimi, B., and Mobaraki, J. (2010). Determining suitable probability distribution models for annual precipitation data (a case study of Mazandaran and Golestan provinces), Journal of Sustainable Development, 3, 1, 159 - 168. Muralee, M. G., and Muthuchamy, I. (2014). Use of ‘Easy Fit Software for Probability Analysis of Rainfall of Lower Bhavani Project Command, Tamil Nadu, Trends in Biosciences,7, 19, 3053 – 3056. Phien, H. N., and Ajirajah, T. J. (1984). Applications of the log Pearson type-3 distribution in hydrology, Journal of hydrology, 73, 3, 359 - 372. Sharma, M. A., and Singh, J. B. (2010). Use of probability distribution in rainfall analysis, New York Science Journal, 3, 9, 40 - 49. Stephens, M. A. (1974). EDF statistics for goodness-of-fit, and some comparisons, Journal of the American Statistical Association, 69, 730 – 737.
79
Review on Some Indices to Measure the Impact of Multicollinearity in a General Linear Regression Model M. Shakil, Ph.D. Professor of Mathematics Liberal Arts & Sciences Department Miami Dade College, Hialeah Campus FL 33012, USA E-mail: mshakil@mdc.edu
ABSTRACT The general multiple linear regression model is defined by Y = X + , with the usual assumptions, where Y is an n x 1 vector of response variable, X is an n k (n > k) matrix of regressors, is a k 1 vector of regressor parameters to be estimated, and is an n 1 vector of uncorrelated error terms generated from N(0, 2), 2 > 0, and given and 2, Y ~ N(X, 2In), where In denotes the n n identity matrix. There exist a vast number of methods for detecting influential observations and measuring their effects on various aspects of the analysis of regression problems. Among others, the influence measures may also be based on the eigenstructure of the regressor matrix X, which can change substantially when a row is added to or omitted from X. In this paper, we have reviewed the uses of some indices, such as the condition number, condition index, collinearity index, and information index, to measure the impact of the eigenstructure and multicollinearity of the regressor matrix in a general linear regression model. It is hoped that the proposed study will be useful in many applied
research. 2010
Mathematics
Subject
Classification:
65F35,
15A12,
15A04,
62J05.
Keywords: Condition number, Condition index, Collinearity index, Entropy, Hat matrix, Information index, Regressor matrix.
80 1. INTRODUCTION The multiple linear regression model is given by Y = X + , where Y is an n 1 vector of response variable, X is an n k (n > k) matrix of predictors (also called regressor variables or explanatory variables), is a k 1 vector of regressor parameters to be estimated, and is an n 1 vector of uncorrelated error terms ~ N(0, 2), 2 > 0. Then, given and 2, Y ~ N(X, 2In), where In denotes the n x n identity matrix. It is assumed that the data (or observations) meet the usual assumptions of linear regression, that is, linear relationships between the predictors and response variable (known as the linearity), normality of the errors, homogeneity of variance (known as the homoscedasticity), independence of errors associated with observations, errors in predictor variables, and model specification. It is known that the prediction matrix or the hat matrix P = X(XTX)-1XT determines many of the standard least square results. The results of least squares fit of the general linear regression model to a given data set can be substantially influenced by the addition or deletion of one or few observations. There exist a vast number of methods for detecting influential observations and measuring their effects on various aspects of the analysis of regression problems. Among others, the influence measures may also be based on the eigenstructure of the regressor matrix X, which can change substantially when a row is added to or omitted from X. In this paper, we have reviewed and examined the effects of the eigenstructure of the regressor matrix X on its condition number, collinearity indices, and information indices. The organization of this paper is as follows In Section 2, some preliminaries on condition number of a matrix are presented. Section 3 contains the problems of multicollinearity in multiple linear regression models based on the condition number and collinearity indices of the regressor matrix X. In section 4, we have presented the information quantity as a measure of goodness of fit of multiple linear regression model, and also the problems of multicollinearity in multiple linear regression models based on Shannon’s entropy, employing the eigenstructure of the regressor matrix X, are discussed. Some real-life data sets are presented in section 5 to illustrate the theory. Some concluding remarks are given in section 6.
2. SOME PRELIMINARIES ON THE CONDITION NUMBER OF A MATRIX In what follows, some preliminaries on condition number of a matrix are presented. Condition number of a matrix is one of the widely used matrix features and has significant applications in many areas of numerical analysis, linear algebra, statistics, physics and engineering. The idea of the condition number of a matrix was first introduced in 1948 by Turing (1948). Further developments continued with the contributions of Rice (1966), Skeel (1979), Geurts (1982), Demmel (1987), Edelman (1988), Trefethen and Bau (1997), Bottcher and Grudsky (1998) Trefethen and Viswanath (1998), and Higham (2002), among others. For recent work on condition numbers, see, for example, Hargreaves (2004), Xu and Zhang (2004), Acosta et al. (2006), and Ern and Guermond (2006), and references therein. This section discusses some notions about the condition number of a matrix which is relevant for establishing the numerical
81 stability and convergence of the finite-difference approximations of fourth order boundary value problems. In the numerical solutions of fourth order boundary value problems, the interest in the condition number of the coefficient matrix A in the corresponding linear system A Y = C lies in the accuracy of computations, since it gives a bound for the propagation of the relative error in the data when the linear system A Y = C is numerically solved. It is basically a measure of stability or sensitivity of the matrix A to the numerical operations. When solving the linear system A Y = C , its coefficient matrix A is said to be ill-conditioned if small errors or perturbations in the coefficient matrix A or right hand side C correspond to large errors or perturbations in the solution Y. Linear systems which are extremely ill-conditioned may be impossible to solve accurately. Some well-known examples of ill-conditioned matrices are the Hilbert Matrix, Vandermonde Matrix, Pie Matrix, and the Wilkinson Bidiagonal Matrix of Order 20, see, for example, Gautschi (1990), among others. On the other hand, if the corresponding changes in the solutions are also small, then the linear system A Y = C is said to be wellconditioned. A numerical scale for measuring the ill-conditioning of the linear system A Y = C is provided by the condition number of the coefficient matrix A, which is provided below.
2.1. Condition Number of a Matrix: Definition: For a given non-singular matrix A
n×n,
and
a
matrix
norm
. , a non-negative quantity
(A)
given
by
( A) A . A1
(1)
is defined as the condition number of A and is denoted by Cond (A), see, for example, Datta (19950, and Leon (2006), among others. Since the condition number is defined in terms of a particular a matrix norm, many different matrix norms may be chosen. Consequently, the actual value of the condition number will vary depending on the norm chosen. Nevertheless, the general rule that large condition numbers indicate sensitivity will hold true no matter what norm is chosen. The most frequently used norms are the Frobenius norm defined as A
F
n
a
i , j 1
2 ij
,
82
AY and the p-norms defined as A
p
p
max Y 0
, ( p 1 ). Some commonly used p-norms are
Y
p m
maximum
column-sum
norm
=
A 1 max aij , 1 j n
maximum
row-sum
n
A max aij , and the Hilbert or spectral norm defined as A 2 max 1i m
Y 0
j 1
easily shown that A symmetric, then
2
norm
=
i 1
AY
2
. It can be
Y
2
max imum eigenvalue of AT A , and if A is Hermitian or real and
A 2 ( A) , where (A) denotes the spectral radius of A defined as
( A) max i , where ' s are the eigenvalues of A. It can be easily shown that ( A) A . i
The condition number corresponding to the Frobenius norm will be denoted F (A) and the condition number corresponding to the p-norm will be denoted p (A) . For a singular square matrix A n × n, that is, If A is not invertible, the condition number of A is infinite, that is,
(A) . Although, it is evident from the above definitions that the condition number of A is norm-dependent, there exist some relationships among the condition numbers of A with respect to different norms.
2.2. Some Useful Properties of the Condition Number of a Matrix: For the sake of completeness, some useful and important properties of the condition number of a matrix A are given below. For details on these, see, for example, Datta (1995), among others. It can be easily shown that if A n × n, then (i)
p ( A) 1, p 1 .
(ii)
( I ) 1 , where I is the identity matrix.
(iii)
If A is an orthogonal matrix, then 2 ( A) 1 .
(iv)
( AT A) ( A)2 .
(v)
( A) ( AT ) .
83 (vi)
1 1 ( A) n, n 2 ( A)
(vii)
1 2 ( A) n, n ( A)
(viii)
1 ( A) n2 . 2 1 ( A) n
(ix)
Accuracy of Solution of a Linear System Based on the Condition Number: The following well-known result
AS 1 min , ( A) A
(2)
where S is a singular matrix, can be used to measure how close A is to being singular. Thus, in view of (2), A can be approximated by a singular matrix S if and only if (A) is large.
2.3. Condition Number Based on Spectral Norm: Let . denote the spectral norm. Let A be a nonsingular n n matrix.
Definition 1. Let 1 , 2 , , n denote the n singular values of A . Let 1 , 2 , , n denote the corresponding n eigenvalues of A . Then the condition number 2 ( A) of A is defined by
2 ( A) A 2 . A 1
2
max i 1i n
min i
,
(3)
1i n
where max i and min i denote the largest and smallest singular values of A respectively. 1i n
1i n
(ii) It follows from the above definition (i) in (3) that, since the singular values of A are the positive square roots of the nonzero eigenvalues of the matrices A* A , the condition number of
A can also be defined by
84
2 ( A) A 2 . A 1
2
max i 1i n
min i
,
(4)
1i n
where max i and min i are, respectively, the largest and smallest eigenvalues in modulus of 1i n
1i n
A* A , where A* denote the Hermitian matrix. Definition 2. Let . denote the spectral norm. Let A be a nonsingular n n matrix. If A is Hermitian or real and symmetric, then the condition number 2 ( A) of A is defined by
2 ( A) A 2 . A 1
2
max i 1i n
min i
,
(5)
1i n
where max i and min i are, respectively, the largest and smallest eigenvalues in modulus of 1i n
1i n
A . Note that if the columns of A are orthogonal, the condition number, 2 ( A) , of A attains its minimum bound, that is, 2 ( A) 1 , (see, for example, Golub and Van Loan [1996], among others). If 2 ( A) is large, then A is said to be an ill-conditioned matrix. If 2 ( A) is near unity, then A is said to be a well-conditioned matrix. Matrices with small condition numbers are said to be well-conditioned.
3. MULTICOLLINEARITY BASED ON THE EIGENSTRUCTURE OF REGRESSOR MATRIX X In this section, the problems of multicollinearity in multiple linear regression models based on the condition number and collinearity indices of the regressor matrix X.
3.1. MULTI OLLINEARITY: As noted earlier, the omission or addition of one or few observations can influence the results of least squares fit of the multiple linear regression model Y = X + to the given data set. These observations are called influential observations, and can also cause multicollinearity problem in fitting the multiple linear regression model, among other numerical and statistical difficulties. The least – squares estimates of the regression coefficients in a multiple linear regression model Y = X + can not be uniquely computed, when there is an exact or approximate perfect linear relationship among the predictor variables. This is described by the term collinearity when there are two variables involved. For more than two variables, it is known as multicollinearity. Thus, if the columns of the regressor matrix X (that is, the predictor variables) are exactly or approximately linearly related (that is, linearly dependent), they are said to be multicollinear, which are sometimes apparent from high correlations among predictor variables. The presence of multicollinearities in the regressor matrix X can cause serious effects
85
on the least–squares estimates, , of the regression coefficients in the multiple linear
regression model Y = X + . This can be noted from the fact that is the solution of the
equation (XTX) = XT y, given by = (XTX)-1XT y. This implies that will be unique and computable if (XTX) is invertible (that is, non-singular). Thus, if the columns of X (that is, the predictor variables) are approximately linearly related (that is, linearly dependent), then (X TX) is
nearly singular. Consequently, the least–squares estimates, , of the regression coefficients
becomes numerically unstable. Further, note that Var( ) = (XTX)-1 2 , where 2 is the error variance. This implies that, when the columns of X are multicollinear, large variances associated
with the elements of can be expected. This means that
when Var( j ) = 2 v jj are large, where
j
j
will be statistically nonsignificant,
is the least-squares estimate of j , v jj is the j th
diagonal term of the matrix (X X) , and j = 1, 2, …, k . Thus it is obvious that the T
-1
multicollinearity in the columns of the regressor matrix X causes to become an unreliable estimate of . For some good and useful expositions of multicollinearity in multiple linear regression problems, both from numerical and probabilistic point of view, the interested readers are referred to can be found in Neter et al. (1996), Draper and Smith (1998), Belsley et al. (2005), among others.
3.2. CONDITION NUMBER OF X AS A MEASURE OF MULTI OLLINEARITY: Several measures for detecting multicollinearity and their remedial in multiple linear regression analysis problems have been proposed. The main concern of the least squares fit of the multiple linear regression model Y = X + to a given data set is the amount of increase in multicollinearity of the columns of X. As the amount of multicollinearity of the columns of X
increases, the least–squares estimates, , of the regression coefficients in the multiple linear regression model Y = X + become unstable and the standard errors for the regression coefficients get inflated. Note that the measure of multicollinearity can be judged by the extent of singularity of the matrix (XTX). Some of these measures of multicollinearity, with their cut-off values, are summarized in following Table 3.2.1.
86 TABLE 3.2.1 Measures of Multicollinearity R = the Correlation Matrix of the x 'js = 1
(n 1) (X X), if the x are suitably T
's j
R = 0, if R is singular; and R = 1, if all correlations are zero so that R = an Identity Matrix.
standardized.
VIF j Variance Inflation Factor =
1 , 1 r j2
2 j
j = 1, 2, …, k , where r is the coefficient of multiple determination when regressing x j on the remaining ( k - 1) predictor variables.
Tol j = Tolerance =
Cut-off Value
1 VIF j
(i) ( r j2 1) ( VIF j ) , if x j is approximately dependent on other predictor variables; (ii) VIF j > 10, corresponding to r j2 > 0.9 (unacceptable and so needs further investigation). Tol j < 0.1 (unacceptable and so needs further investigation).
3.3. CONDITION NUMBER AND COLLINEARITY INDICES OF X: The overall measures of multicollinearity in multiple linear regression analysis problems, which take all predictor variables (that is, the regressor variables) into account at the same time, can be detected by using the eigenstructure of X, its condition number and collinearity indices, see Chatterjee and Hadi (1988), and Belsley et al. (2005), among others.
CONDITION NUMBER OF X: Def 3.3.1. The condition number or the condition index of X, denoted by (X), is defined by , (X) = max min
max and min are, respectively, the largest and smallest eigenvalues of the matrix (XTX) whose columns have been normalized to an unit length.
COLLINEARITY INDEX OF X: Def 3.3.2. Let X j be the jth column of the matrix X and M j be the jth column of the matrix M = X(XTX)-1, where M is also known as the MoorePenrose Inverse of XT. Then, following Stewart (1987), and Chatterjee and Hadi (1986, 1988), the jth collinearity or jth condition index, denoted by j , is defined by
87
j = Xj Mj =
Xj ej
, j = 1, 2, …, k ,
where e j = ( I P j ) X j = the residual vector when X j is regressed on the matrix X j without
X j , and P j = X j ( X Tj X j ) 1 X Tj , the hat or prediction matrix for X j . The condition number of X is also related to its singular values, which easily follows from the singular-value decomposition (SVD) of X, see Belsley et al. (2005), among others. Consider an n k (n > k) matrix, X, be of rank r, where r < k. Then, the singular-value decomposition (SVD) of X is given by X = UDVT, where the matrix D is an r r diagonal matrix whose diagonal elements d1, d2, …, dr are the singular values of X. The singular values of X are the positive square roots of the nonzero eigenvalues of the matrices (XTX) and (XXT). U is an n r matrix whose columns are the corresponding normalized eigenvectors of the matrix (XX T), while V is an r k matrix whose columns are the corresponding normalized eigenvectors of the matrix (XTX). Now, since, in the multiple linear regression model Y = X + , the regressor matrix X, which is an n k (n > k) matrix of predictor variables, is assumed to be of full column rank, we have r = k. Hence, the singular-value decomposition (SVD) of X is given by Xn × k = Un × k Dk × k VTk × k , such that UTU = VTV = VVT = I, an identity matrix. But, UUT I, because U is an n k matrix, having k orthogonal columns. It follows from the above that, since the singular values of X are the positive square roots of the nonzero eigenvalues of the matrices (XTX), the condition number or the condition index of X can also be defined by
d = max , (X) = max min d min where d max and d min are, respectively, the largest and smallest singular values of the matrix X. If the columns of X are orthogonal, the condition number or the condition index, (X), of X attains its minimum bound, that is, (X) 1, see Golub and Van Loan (1996). If (X) is large, then X is said to be an ill-conditioned matrix. Matrices with small condition numbers are said to be well-conditioned. Thus the use of the condition number or the condition index of X can provide useful information for detecting multicollinearity in multiple linear regression analysis problems, cut-off values of which are summarized in following Table 3.3.1.
88 TABLE 3.3.1 CONDITION NUMBER, (X), OF X AS A MEASURE OF MULTICOLLINEARITY Cut-off Value of (X)
Nature of Multicollinearity
(X) < 10
Acceptable and no serious problem with multicollinearity. 10 < (X) < 30 Moderate to strong multicollinearity (acceptable). Severe multicollinearity (unacceptable and so (X) > 30 needs further investigation). Several other measures have also been proposed based on the eigenstructure of X for assessing the influence of the ith observation of X on the condition number of X. See, for example, Chatterjee and Hadi (1988), among others, for details.
4. ENTROPY AS A MEASURE OF GOODNESS OF FIT OF LINEAR REGRESSION MODEL In this section, information quantity as a measure of goodness of fit of multiple linear regression model is presented. The problem of multicollinearity in multiple linear regression models based on Shannon’s entropy, employing the eigenstructure of the regressor matrix X, is also discussed.
4.1. The Concept of Entropy: Historically, the concept of entropy was first introduced by Clausius in about 1850 in his study of the second law of thermodynamics. The scale of uncertainty in the outcome of a probability experiment with a discrete and finite number of outcomes was first formulated by Shannon (1948) in his studies of communication engineering. Entropy, as a matter of fact, arises from an important concept known as information, which is also, sometimes, called self-information. Thus, the notion of entropy as developed by Shannon (1948) and by many others later can be used to provide some powerful descriptive and inferential statistical methods. In particular, the entropy provides an excellent tool to quantify the amount of information (or uncertainty) contained in a random observation regarding its parent distribution (population). The large value of the entropy implies the greater uncertainty in the data. In other words, the larger will be the entropy, the smaller will be the information. This follows from the fact that maximizing the entropy is equivalent to minimizing its negative. Development of the general theory of information entropy began with the concept of entropy as propounded by Shannon, as described below.
4.2. ENTROPY OF A DISCRETE PROBABILITY DISTRIBUTION: Def. Initially, the concept of entropy was introduced for a discrete random variable with a n (countable) possible outcomes Ei’s, where P(Ei) = pi. In the discrete case, let P = (p1, p2, …, pn),
89 then the entropy, Hn(P), is defined as n Hn(P) = – pi ln pi . i=1 For further historical discussion of Hn(P), see Kapur (1989), and Suhir (1997). It can be easily verified that the entropy, Hn(P), satisfies the following conditions: (i) Hn(P) is maximum when p1 = p2 = …= pn = 1 / n. (ii) Hn(P) is minimum when pi = 1, pj = 0, j i, i = 1, 2, … , n, that is, Hn(P) is minimum, when one of the probabilities is unity and all others are zero. (iii) From (i) and (ii), it follows that 0 Hn(P) ln(n).
4.3. ENTROPY OF A CONTINUOUS PROBABILITY DISTRIBUTION: Def. Entropy of an absolutely continuous random variable X, with a probability density function f (x), is defined as follows + HX[f(x)] = – f(x)logf(x)dx, –
(6)
where the logarithms are taken to the base e. The above definition of entropy can easily be extended to the entropy of continuous multivariate probability distribution, see Kapur (1989). Further development of Shannon’s concept of entropy continued with the contributions of Kullback and Leibler (1951), Lindley (1956), and Jaynes (1957a, b), among others. Many researchers have also studied comparison of entropy and variance for some continuous distributions, see, for example, Kapur (1989), and Ebrahimi, et al. (1999), among others. Note that entropy is always non-negative in the case of a discrete random variable X. Also, when the random variable X is discrete, HX[p(x)] is invariant under one-to-one transformation of the discrete random variable X. When the random variable X is continuous, entropy is not invariant under one-to-one transformation of the continuous random variable X. The entropy of a continuous random variable X takes values in [– , + ], and E (X2) < implies HX [p(x)] < , but the converse may not hold. Further, Wyner and Ziv (1969) have shown that, given E | X |k, HX [p(x)] 1 [ ln { 2k e k(1/k) E |X|k }], k > 0, k kk - 1 where |X| denotes the absolute value of X. The equality in (7) is attained by the maximum entropy distribution with density given by
(7)
90
f*(x) = C() exp(–|x|k),
(8)
where the model parameter is obtained as the Lagrange multiplier for satisfying the constraint E | X |k < , and C() is the normalizing constant. When k = 2, the inequality (8) can be simplified to the following inequality: exp(2HX{p(x)}) Var (X) 2e
(9)
The ratio in the inequality (9) is the entropy power fraction proposed by Shanon (1948) for comparison of continuous random variables. The equality in (9) holds if and only if P (x) has a normal distribution. Some generalizations of Shannon’s entropy and other measures of entropy have also been proposed and studied by many researchers, see, for example, Kapur (1989), and Soofi and Retzer (2002), among others. The entropies of various commonly used absolutely continuous probability distributions have been tabulated by many authors, see, for example, Johnson and Kotz (1970), and Lazo and Rathie (1978), among others. Ebrahimi, et al. (1999) have tabulated the entropies by classifying them according to the three parametric families of probability distributions. These are as given below: (i)
location-scale family of probability distributions;
(ii)
shape-scale family of probability distributions; and
(iii)
student-t, F, and beta family of probability distributions.
Ebrahimi et al. (2003) have explored the properties of entropy, Kullback-Leibler information, and mutual information for order statistics. The entropy study of record value distributions obtained from some commonly used continuous probability models has been studied by Zahedi and Shakil (2006), among others. On the use of entropy in the inferential statistical analysis of multiple linear regression problems, the interested readers are referred to Soofi (1990), among others. In what follows, motivated by the work of Soofi (1990), we provide below the use of Shannon’s entropy as a multicollinearity diagnostic in multiple linear regression analysis problems based on the eigenstructure of the regressor matrix X.
4.4. ENTROPY AS A MEASURE OF MULTI OLLINEARITY 4.4.1. Let us consider the multiple linear regression model Y = X + , with the usual assumptions, where Y is an n 1 vector of response variable, X is an n k (n > k) matrix of regressors, is a k 1 vector of regressor parameters to be estimated, and is an n 1 vector of
uncorrelated error terms generated from N(0, 2), 2 > 0. Then given and 2, Y follows the nvariate normal distribution f(y|, 2) = N(X, 2In), where In denotes the n x n identity matrix. We are interested in the effects of multicollinearity of the columns of X on the entropy of the posterior distribution of (that is, the entropy of the distribution of after we observe y). The
91 posterior distribution of is given by
f(|y) = N( , 2(XTX)-1),
where the least–squares estimates, , of the regression coefficients in the multiple linear
regression model Y = X + is given by = (XTX)-1XT y. Thus, using the expression for the entropy of the multivariate normal probability distribution, see, for example, Kapur (1989), among others, the entropy of the posterior distribution of is given by H[|y] = – f(|y)log f(|y)d
(10)
= ln[(2e)k/2||1/2] where is the variance-covariance matrix = (XTX)-1 2 , 2 is the error variance. From (10), we have H[|y] = (k/2)ln(2e) – (1/2)ln|(XTX)/ 2 | k
= (k/2)ln(2e) – (1/2)
j 1
ln( j / 2 ),
(11)
where j ’s are the eigenvalues of the matrix (XTX) such that 1 2 … k . Note that the amount of uncertainty in the posterior distribution f(|y) of the parameter can be quantified by the entropy H[|y] as given in (11) above. Furthermore, the prior entropy of is computed to be a constant. Hence, a large value of the entropy H[|y] will indicate that a small amount of information about the parameter is contained in the data.
4.4.2. SHANNON’S INFORMATION MEASURE, INFORMAION NUMBER, AND INFORMATION INDEX Def. 4.4.2.1 SHANNON’S INFORMATION MEASURE: For most statistical problems, Shannon’s information measure is defined by I [|y] = - H[|y].
(12)
Thus, Shannon’s information measure can also be used to examine the effects of the eigenstructure on Shannon’s information measure. From (11) and (12), we have k
I [|y] = – (k/2)ln(2e) + (1/2)
j 1
ln( j / 2 ).
(13)
92 Also, in view of (13), the information in directions of principal components may be defined as follows: I [ /j |y] = – (1/2)ln(2e) + (1/2)ln( j / 2 ),
(14)
where j is the eigenvector corresponding to j . It follows from (13) and (14) that I [|y] represents the sum of independent information quantities in directions of principal components. It may, therefore, be used as an overall information about .
Def. 4.4.2.2 SHANNON’S INFORMATION NUMBER: Shannon’s information number is defined by
(X ) I [ l/ |y] - I [ k/ |y] = ln[( (X ) ], where (X) denotes the condition number of X. Thus, Shannon’s information number can also provide useful multicollinearity diagnostic in multiple linear regression analysis problems based on the eigenstructure of the regressor matrix X.
Def. 4.4.2.3 SHANNON’S INFORMATION INDEX: Shannon’s information index is defined by
( X j ) j I [ l/ |y] - I [ /j |y], j = 1, 2, …, k , which, in view of (12) and the definition of condition number (X) =
max
min , after
simplification, becomes
j = ln(
l
j ) = ln( j ), j = 1, 2, …, k ,
where j is the jth collinearity or jth condition index of the jth column X j of the matrix X.
5. APPLICATIONS: In this section, we have used the famous cement data of Woods et al. (1932), cf. Chatterjee and Hadi (1988, p. 36), to illustrate the various measures of multicollinearity diagnostics in multiple linear regression analysis problems as discussed above. The computations have been performed by using S-Plus by fitting the no-intercept multiple linear regression model. We have considered three cases: (i) Full Model; (ii) Model when third row omitted;
93 (iii) Model when tenth row omitted.
TABLE 4.1 THE CEMENT DATA Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14
X1 6 15 8 8 6 9 17 22 18 4 23 9 8 18
X2 7 1 11 11 7 11 3 1 2 21 1 11 10 1
X3 26 29 56 31 52 55 71 31 54 47 40 66 68 17
X4 60 52 20 47 33 22 6 44 22 26 34 12 12 61
X5 2.5 2.3 5.0 2.4 2.4 2.4 2.1 2.2 2.3 2.5 2.2 2.6 2.4 2.1
TABLE 4.2 CORRELATION FOR DATA IN: CEMENT Variable X1 X2 X3 X4 X5
X1 1.000 -0.837 -0.245 0.145 -0.352
X2
X3
X4
X5
1.000 0.333 -0.343 0.342
1.000 -0.978 0.214
1.000 -0.223
1.000
TABLE 4.3 THE CONDITION NUMBER, AND COLLINEARITY INDICES FOR DATA IN: CEMENT Condition Indices j Model
Condition Number
j=1
j =2
j =3
j =4
j =5
88
3.3
2.7
4.2
2.6
4.0
722
3.8
4.2
15.8
12.0
29.2
91
4.0
3.6
4.7
2.7
4.1
(X )
Full Data Third Row Omitted Tenth Row Omitted
94 TABLE 4.4 THE INFORMATION INDICES FOR DATA IN: CEMENT Shannon’s information index ( X j ) j Model
Shannon’s Information Number
j=1
j =2
j =3
j =4
j =5
4.477
1.194
0.993
1.435
0.955
1.386
6.582
1.335
1.435
2.760
2.485
3.374
4.511
1.386
1.281
1.547
0.993
1.411
(X )
Full Data Third Row Omitted Tenth Row Omitted
It is clear from the above data analysis that the Condition Number (X ) , Condition Indices j , Shannon’s Information Number (X ) , and Shannon’s information index ( X j ) j do not change when tenth row is omitted from our full model. However there are substantial changes in the Condition Number (X ) , Condition Indices j , Shannon’s Information Number (X ) , and Shannon’s information index ( X j ) j when third row is omitted from our full model. Thus the omission of the third row is the only point that individually influences the multicollinearity structure of the regressor matrix X in our full model.
6. CONCLUDING REMARKS The purpose of this note was to review and examine the effects of the eigenstructure of the regressor matrix X on its condition number, collinearity indices, and Shannon’s information indices, and investigate these both analytically (when feasible) and numerically. The problems of multicollinearity in multiple linear regression models based on the condition number and collinearity indices of the regressor matrix X were discussed. The concept of Shannon’s entropy and information quantity as a measure of goodness of fit of multiple linear regression model were also investigated. It is clear from our analysis that the Condition Number (X ) , Condition Indices j , Shannon’s Information Number (X ) , and Shannon’s information index
( X j ) j can be substantially used to assess the influences of the multicollinearity structure of the regressor matrix X in multiple linear regression analysis problems. These results will be useful in the applications of Shannon’s entropy in the statistical data analysis and characterization of multiple linear regression analysis problems. It will be useful in quantifying information contained in observing each point value in multiple linear regression analysis problems. For
95 future work, one can consider to develop inferential procedures for the problems of multicollinearity in multiple linear regression models based on Kullback-Leibler information quantity, and Akaike’s information criterion (AIC), employing the eigenstructure of the regressor matrix X. One can also consider the indices discussed in this paper to measure the impact of multicollinearity in the analysis of polynomial, ridge and Poisson Regression Models.
Acknowledgment: The author would like to thank the Editorial Committee of Polygon for accepting this paper for publication in Polygon. The author would like to thank Professor M. Ahsanullah, Rider University, New Jersey, USA, Professor B. M. Golam Kibria, FIU, Miami, USA, and Professor J. N. Singh, Barry University, Miami, USA, for volunteering to read the manuscript, and for their helpful suggestions, which improved the quality and presentation of the paper. Also, the author is thankful to his teacher, Professor Hassan Zahedi, FIU, Miami, USA, for introducing the concepts of Shannon’s entropy in probability and statistics. Further, the author would like to thank his wife for her patience and perseverance for the period during which this paper was prepared. Lastly, the author would like to dedicate this paper to his late parents.
REFERENCES 1. Acosta, G., Grana M., and Pinasco, J. P. (2006), “Condition Numbers and Scale Free Graphs,” Eur. Phys. J., B 53, 381 – 385.
2. Belsley, D. A., Kuh, E., and Welsch, R. E. (2005). Regression diagnostics: Identifying influential data and sources of collinearity (Vol. 571). John Wiley & Sons. 3. Bottcher, A., and Grudsky, S. M. (1998), "On the Condition Numbers of Large Semidefinite Toeplitz Matrices," Linear Algebra and its Applications, 279, 285 – 301. 4. Chatterjee, S., and Hadi, A. S. (1988). Sensitivity analysis in linear regression, John Wiley & Sons, New York. 5. Datta, B. N. (1995), Numerical Linear Algebra and Applications, Brooks/Cole Publishing Company, California. 6. Demmel, J. W. (1987), “On Condition Numbers and the Distance to the Nearest Ill-Posed Problem,” Numer. Math., 51, 251 – 289. 7. Draper, N. R., and Smith, H. (1998). Applied Regression Analysis, 3rd Edition, John Wiley & Sons, New York. 8. Ebrahimi, N., Soofi, E. S., and Zahedi, H. (2004), “Information Properties of Order Statistics and Spacings,” IEEE Transactions on Information Theory, Vol. 50, No. 1, 177183.
96 9. Edelman, A. (1988), “Eigenvalues and Condition Numbers of Random Matrices,” SIAM J. Matrix Anal. Appl., 9, 543 – 560. 10. Ern, A., and Guermond, J-L. (2006), “Evaluation of the Condition Number in Linear Systems Arising in Finite Element Approximations,” Mathematical Modelling and Numerical Analysis, Vol. 40, No. 1, 29 – 48 11. Gautschi, W. (1990), “How (un)stable are Vandermonde systems?” in Asymptotic and Computational Analysis, 193 – 210, Lecture Notes in Pure and Appl. Math., 124, Dekker, NY. 12. Geurts, A. J. (1982), “A Contribution to the Theory of Condition,” Numer. Math., 39, 85 – 96. 13. Golub, G. H., and Van Loan, C. F. (1996), Matrix Computations, Third Edition, Johns Hopkins University Press, Baltimore. 14. Hargreaves, G. I. (2004), “Computing the Condition Number of Tridiagonal and Diagonal-Plus-Semiseparable Matrices in Linear Time,” Numerical Analysis Report 447, Manchester Centre for Computational Mathematics, Manchester, U.K. 15. Higham,, N. J. (2002), Accuracy and Stability of Numerical Algorithms, Second Edition, SIAM, Philadelphia. 16. Jaynes, E. T. (1957a), “Information Theory and Statistical Mechanics,” Physical Review, 106, 620 – 630. 17. Jaynes, E. T. (1957b), “Information Theory and Statistical Mechanics,” Physical Review, 171 – 197. 18. Johnson, N. L., and Kotz, S. (1970), Distributions in Statistics: Continuous Univariate Distributions, Vol. I - II, John Wiley & Sons, New York. 19. Kapur, J. N. (1989), Maximum-Entropy Models in Science and Engineering, WileyEastern, New Delhi. 20. Kullback, S., and Leibler, R. A. (1951), On Information and Sufficiency, Annals of Mathematical Statistics, 22, 79 – 86. 21. Lazo, A. V., and Rathie, P. (1978), On the Entropy of continuous probability distributions, IEEE Trans. Inform Theory, 24, 120-122. 22. Leon, S. J. (2006), Linear Algebra with Applications, Prentice Hall, New Jersey. 23. Lindley, D. V. (1956), “On a Measure of Information Provided by an Experiment,” Annals of Mathematical Statistics, 27, 986 – 1005.
97 24. Maasoumi, E., (1993), “A Compendium to Information Theory in Economics and Econometrics,” Econometric Reviews, 12(2), 137 – 181. 25. Mazzuchi, T. A., Soofi, E. S., and Soyer, R. (2000), “Computation of Maximum Entropy Dirichlet for Modeling Lifetime Data,” Computational Statistics & Data Analysis, 32, 361 – 378. 26. Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W. (1996), Applied Linear Statistical Models, 4th Edition, Richard D. Irwin, Inc., Burr Ridge, Illinois. 27. Rice, J. R. (1966), “A Theory of Condition,” SIAM J. Numer. Anal., 3, 287 – 310. 28. Skeel, R. D. (1979) “Scaling for Numerical Stability in Gaussian Elimination,” J. Assoc. Comput. Mach., 26, 494 – 526. 29. Shannon, C. E. (1948), “A Mathematical Theory of Communication,” Bell System Tech. Journal, 27, 379 – 423, 623 – 656. 30. Soofi, E. S. (1990), Effects of collinearity on information about regression coefficients. J. Econometrics, 43, 255–274. 31. Soofi, E. S. (1994), “Capturing the Intangible Concept of Information,” Journal of the American Statistical Association, Vol. 89, No. 428, 1243 – 1254. 32. Soofi, E. S., Ebrahimi, N., and Habibullah, M. (1995), “Information Distinguishability with Application to Analysis of Failure Data,” Journal of the American Statistical Association, Vol. 90, No. 430, 657 – 668. 33. Soofi, E. S., and Gokhale, D. V. (1991), “Minimum Discrimination Information Estimator of the Mean with known Coefficient of Variation,” Computational Statistics & Data Analysis, 11, 165 – 177. 34. Soofi, E. S., and Retzer, J. J. (2002), “Information Indices: Unification and Applications,” Journal of Econometrics, 107, 17 – 40. 35. Suhir, E. (1997), Applied Probability for Engineers and Scientists, McGraw-Hill, New York. [33] Trefethen, L. N., and Bau (III), D. (1997), Numerical Linear Algebra, SIAM, Philadelphia. 36. Trefethen, L. N., and Viswanath, D. (1998), “Condition Numbers of Random Triangular Matrices,” SIAM J. Matrix Anal. Appl., 19, 565 – 581. 37. Turing, A. M. (1948), “Rounding-Off Errors in Matrix Processes,” Quart. J. Mech. and Appl. Math., 1, 287 - 308.
98 38. Verdugo Lazo, A. C. G., and Rathie, P. N. (1978), “On the Entropy of Continuous Probability Distributions,” IEEE Transactions of Information Theory, Vol. IT-24, No. 1. 39. Woods, H., Steinour, H. H. and Starke, H. R. (1932) Effect of composition of Portland cement on heat evolved during hardening. Industrial Engineering and Chemistry, 24, 1207–1214. 40. Wyner, A. D., and Ziv, J. (1969), “On communication of Analog Data from Bounded Source Space,” Bell System Technical Journal, 48, 3139 – 3172. 41. Xu, S., and Zhang, J. (2004), “A New Data Mining Approach to Predicting Matrix Condition Numbers,” Communications in Information and Systems, Vol. 4, No. 4, 325 – 340. 42. Zahedi, H. and Shakil, M. (2006). Properties of entropies of record values in reliability and life testing context. Communication in Statistics – Theory and Methods, 35, 9971010.