Introduction to Statistics: May 2014

Friday, May 30, 2014

Coefficient of Determination

The coefficient of determination is the primary way we can measure the extent, or strength, of the association that exists between two variables, X and Y. Statisticians interpret the coefficient of determination by looking at the amount of the variation in Y that is explained by the regression line. The coefficient of determination is defined by

Thus, we can conclude that the variation in number of workers (the independent variable X) explains 97.76 percent of the variation in the production of Redwood falls plant (the dependent variable Y).

Measures of Variability

The mean alone does not provide a complete or sufficient description of data. In this section we present descriptive numbers that measures the variability or spread of the observations from the mean. In particular we include
(i)   Range
(ii)   Interquartile range
(iii)   Variance
(iv)   Standard deviation and
(v)   Coefficient of variation
No two things are exactly alike. This is one of the basic principles of statistical quality control. Variation exists in all areas. The weather varies greatly from day to day, and even from hour to hour; grades on a test differ for students taking the same course with the same instructor, a person’s blood pressure, pulse, cholesterol level, and caloric intake will vary daily.

While two data sets could have the same mean, the individual observations in one set could vary more from the mean than do the observations in the second set. Consider the following two sets of sample data:

Sample A	1	2	1	36
Sample B	8	9	10	13

Although the mean is 10 for both samples, clearly, the data in sample A are further from 10 than are then data in sample B. We need descriptive numbers to measure this spread.

Range

Range is the difference between the largest and smallest observations. The greater the spread of the data from the center of the distribution, the larger the range will be. Since the range takes into account only the largest and smallest observations, it is susceptible to considerable distortion if there is an unusual extreme observation. Although the range measures the total spread of the data, the range may be an unsatisfactory measure of variability (spread) because outliers either very high or very low observations, influence it. One way to avoid this difficulty is to arrange the data in ascending or descending order, discard a few of the highest and few of the lowest numbers, and find the range of those remaining.

Interquartile Range

The interquartile range (IQR) measures the spread in the middle 50% of the data; it is the difference between the observation at Q3, the third quartile (or 75th percentile), and the observation at Q1, the first quartile (or 25th percentile). Thus

IQR = Q3 – Q1

where Q3 is located in the 0.75(n + 1)th position when the data are in increasing order and Q1 is located in the 0.25(n + 1)th position when the data are in increasing order.

Five-Number Summary
The five-number summary refers to the five descriptive measures: minimum, first quartile, median, third quartile, and maximum. Clearly,

Minimum < Q1 < Median < Q3 < Maximum

Example: Waiting Times at Gilotti’s Grocery
Gilotti’s Grocery advertises that customers wait less than minutes to pay if they go through the Speedy Transaction Aisles. Figure 1 is a stem-and-leaf display for a sample of 25 waiting times (in seconds). Compute the five-number summary.

Figure 1: Waiting times at Gilotti’s Grocery

Frequency	Stem	Leaf
9	1	1	2	4	6	7	8	8	9	9
9	2	1	2	2	2	4	6	8	9	9
7	3	0	1	2	3	4
2	4	0	2

Solution: From the stem-and-leaf display we see that the minimum time is 11 seconds and the maximum time is 42 seconds. The quartile, Q1, is located in the 0.25(25 + 1)th ordered position = 6.5th ordered position. The value is 18 seconds. The third quartile, Q3, is located in the 0.75(25 + 1)th ordered position = 19.5th ordered position. The value is 30.5 seconds. The median time is 0.5(25 + 1)th ordered observation = 13th ordered position observation. The value is 22 seconds. The range is calculated as 42 – 11 = 31 seconds; interquartile range = 30.5 – 18 = 12.5 seconds; that is, the middle 50% of the data have a spread of only 12.5 seconds.

Calculations of Quartiles from Grouped Observations

Quartiles:

where

l = lower limit of the quartile class,

h = width of the quartile class,

f = frequency of the quartile class

c= cumulative frequency of the class preceding the quartile class

n =total frequency

i = 1, 2, 3

For example, the maximum days first 25% and 75% days to maturity of 40 short-term investments (Table 2) are:

Table 2: Days to maturity 40 short-term investments.

Class interval	Frequency	Cumulative frequency
30—39	3	3
40—49	1	4
50—59	8	12
60—69	10	22
70—79	7	29
80—89	7	36
90—99	4	40
Total	40

Navbar