Statistics
13.1 Introduction
In Class IX, you have studied the classification of given data into ungrouped as well as grouped frequency distributions. You have also learnt to represent the data pictorially in the form of various graphs such as bar graphs, histograms (including those of varying widths) and frequency polygons. In fact, you went a step further by studying certain numerical representatives of the ungrouped data, also called measures of central tendency, namely, mean, median and mode. In this chapter, we shall extend the study of these three measures, i.e., mean, median and mode from ungrouped data to that of grouped data. We shall also discuss the concept of cumulative frequency, the cumulative frequency distribution and how to draw cumulative frequency curves, called ogives.
13.2 Mean of Grouped Data
The mean (or average) of observations, as we know, is the sum of the values of all the observations divided by the total number of observations. From Class IX, recall that if $x_{1}, x_{2}, \ldots, x_{\mathrm{n}}$ are observations with respective frequencies $f_{1}, f_{2}, \ldots, f_{\mathrm{n}}$, then this means observation $x_{1}$ occurs $f_{1}$ times, $x_{2}$ occurs $f_{2}$ times, and so on.
Now, the sum of the values of all the observations $=f_{1} x_{1}+f_{2} x_{2}+\ldots+f_{n} x_{n}$, and the number of observations $=f_{1}+f_{2}+\ldots+f_{n}$.
So, the mean $\bar{x}$ of the data is given by
$$ \bar{x}=\frac{f_{1} x_{1}+f_{2} x_{2}+\cdots+f_{n} x_{n}}{f_{1}+f_{2}+\cdots+f_{n}} $$
Recall that we can write this in short form by using the Greek letter $\Sigma$ (capital sigma) which means summation. That is,
$$ \bar{x}=\frac{\sum_{i=1}^{n} f_{i} x_{i}}{\sum_{i=1}^{n} f_{i}} $$
which, more briefly, is written as $\bar{x}=\frac{\sum f_{i} x_{i}}{\Sigma f_{i}}$, if it is understood that $i$ varies from 1 to $n$.
In most of our real life situations, data is usually so large that to make a meaningful study it needs to be condensed as grouped data. So, we need to convert given ungrouped data into grouped data and devise some method to find its mean.
Let us convert the ungrouped data of Example 1 into grouped data by forming class-intervals of width, say 15 . Remember that, while allocating frequencies to each class-interval, students falling in any upper class-limit would be considered in the next class, e.g., 4 students who have obtained 40 marks would be considered in the classinterval 40-55 and not in 25-40. With this convention in our mind, let us form a grouped frequency distribution table (see Table 13.2).
Table 13.2
Class interval | $10-25$ | $25-40$ | $40-55$ | $55-70$ | $70-85$ | $85-100$ |
---|---|---|---|---|---|---|
Number of students | 2 | 3 | 7 | 6 | 6 | 6 |
Now, for each class-interval, we require a point which would serve as the representative of the whole class. It is assumed that the frequency of each classinterval is centred around its mid-point. So the mid-point (or class mark) of each class can be chosen to represent the observations falling in the class. Recall that we find the mid-point of a class (or its class mark) by finding the average of its upper and lower limits. That is,
$$ \text { Class } \text { mark }=\frac{\text { Upper class limit }+ \text { Lower class limit }}{2} $$
With reference to Table 13.2, for the class $10-25$, the class mark is $\frac{10+25}{2}$, i.e., 17.5. Similarly, we can find the class marks of the remaining class intervals. We put them in Table 13.3. These class marks serve as our $x_{i}$ ’s. Now, in general, for the $i$ th class interval, we have the frequency $f_{i}$ corresponding to the class mark $x_{i}$. We can now proceed to compute the mean in the same manner as in Example 1.
Table 13.3
Class interval | Number of students $\left(\boldsymbol{f}_{\boldsymbol{i}}\right)$ | Class mark $\left(\boldsymbol{x}_{\boldsymbol{i}}\right)$ | $\boldsymbol{f}_{\boldsymbol{i}} \boldsymbol{x_i}$ |
---|---|---|---|
$10-25$ | 2 | 17.5 | 35.0 |
$25-40$ | 3 | 32.5 | 97.5 |
$40-55$ | 7 | 47.5 | 332.5 |
$55-70$ | 6 | 62.5 | 375.0 |
$70-85$ | 6 | 77.5 | 465.0 |
$85-100$ | 6 | 92.5 | 555.0 |
Total | $\sum f_{i}=30$ | $\sum f_{i} x_{i}=1860.0$ |
The sum of the values in the last column gives us $\Sigma f_{i} x_{i}$. So, the mean $\bar{x}$ of the given data is given by
$$ \bar{x}=\frac{\Sigma f_{i} x_{i}}{\Sigma f_{i}}=\frac{1860.0}{30}=62 $$
This new method of finding the mean is known as the Direct Method.
We observe that Tables 13.1 and 13.3 are using the same data and employing the same formula for the calculation of the mean but the results obtained are different. Can you think why this is so, and which one is more accurate? The difference in the two values is because of the mid-point assumption in Table 13.3, 59.3 being the exact mean, while 62 an approximate mean.
Sometimes when the numerical values of $x_{i}$ and $f_{i}$ are large, finding the product of $x_{i}$ and $f_{i}$ becomes tedious and time consuming. So, for such situations, let us think of a method of reducing these calculations.
We can do nothing with the $f_{i}$ ’s, but we can change each $x_{i}$ to a smaller number so that our calculations become easy. How do we do this? What about subtracting a fixed number from each of these $x_{i}^{\prime}$ ’s? Let us try this method.
The first step is to choose one among the $x_{i}^{\prime}$ s as the assumed mean, and denote it by ’ $a$ ‘. Also, to further reduce our calculation work, we may take ’ $a$ ’ to be that $x_{i}$ which lies in the centre of $x_{1}, x_{2}, \ldots, x_{n}$. So, we can choose $a=47.5$ or $a=62.5$. Let us choose $a=47.5$.
The next step is to find the difference $d_{i}$ between $a$ and each of the $x_{i}$ ’s, that is, the deviation of ’ $a$ ’ from each of the $x_{i}$ ’s.
i.e.,
$$ d_{i}=x_{i}-a=x_{i}-47.5 $$
The third step is to find the product of $d_{i}$ with the corresponding $f_{i}$, and take the sum of all the $f_{i} d_{i}$ ’s. The calculations are shown in Table 13.4.
Table 13.4
Class interval | Number of students $\left(\boldsymbol{f}_{\boldsymbol{i}}\right)$ | Class mark $\left(\boldsymbol{x}_{\boldsymbol{i}}\right)$ | $\boldsymbol{d_i}=\boldsymbol{x}_{\boldsymbol{i}}-\mathbf{4 7 . 5}$ | $\boldsymbol{f}_{\boldsymbol{i}} \boldsymbol{d_i}$ |
---|---|---|---|---|
$10-25$ | 2 | 17.5 | -30 | -60 |
$25-40$ | 3 | 32.5 | -15 | -45 |
$40-55$ | 7 | 47.5 | 0 | 0 |
$55-70$ | 6 | 62.5 | 15 | 90 |
$70-85$ | 6 | 77.5 | 30 | 180 |
$85-100$ | 6 | 92.5 | 45 | 270 |
Total | $\Sigma f_{i}=30$ | $\Sigma f_{i} d_{i}=435$ |
So, from Table 13.4, the mean of the deviations, $\bar{d}=\frac{\Sigma f_{i} d_{i}}{\Sigma f_{i}}$.
Now, let us find the relation between $\bar{d}$ and $\bar{x}$.
Since in obtaining $d_{i}$, we subtracted ’ $a$ ’ from each $x_{i}$, so, in order to get the mean $\bar{x}$, we need to add ’ $a$ ’ to $\bar{d}$. This can be explained mathematically as:
$$ \begin{aligned} \text { Mean of deviations, } \quad\quad\quad\quad \bar{d} & =\frac{\Sigma f_{i} d_{i}}{\Sigma f_{i}} \\ \text { So, } \quad\quad\quad\quad \bar{d} & =\frac{\Sigma f_{i}\left(x_{i}-a\right)}{\Sigma f_{i}} \\ & =\frac{\Sigma f_{i} x_{i}}{\Sigma f_{i}}-\frac{\Sigma f_{i} a}{\Sigma f_{i}} \\ & =\bar{x}-a \frac{\Sigma f_{i}}{\Sigma f_{i}} \\ & =\bar{x}-a \\ \text { So, } \quad\quad\quad\quad \bar{x} & =a+\bar{d} \\ \text { i.e., } \quad\quad\quad\quad\bar{x} & =a+\frac{\Sigma f_{i} d_{i}}{\Sigma f_{i}} \end{aligned} $$
Substituting the values of $a, \Sigma f_{i} d_{i}$ and $\Sigma f_{i}$ from Table 13.4, we get
$$ \bar{x}=47.5+\frac{435}{30}=47.5+14.5=62 . $$
Therefore, the mean of the marks obtained by the students is 62 .
The method discussed above is called the Assumed Mean Method.
Activity 1 : From the Table 13.3 find the mean by taking each of $x_{i}$ (i.e., 17.5, 32.5, and so on) as ’ $a$ ‘. What do you observe? You will find that the mean determined in each case is the same, i.e., 62 . (Why ?)
So, we can say that the value of the mean obtained does not depend on the choice of ’ $a$ ‘.
Observe that in Table 13.4, the values in Column 4 are all multiples of 15. So, if we divide the values in the entire Column 4 by 15 , we would get smaller numbers to multiply with $f_{i^{\prime}}$. (Here, 15 is the class size of each class interval.)
So, let $u_{i}=\frac{x_{i}-a}{h}$, where $a$ is the assumed mean and $h$ is the class size.
Now, we calculate $u_{i}$ in this way and continue as before (i.e., find $f_{i} u_{i}$ and then $\Sigma f_{i} u_{i}$). Taking $h=15$, let us form Table 13.5.
Table 13.5
Class interval | $\boldsymbol{f}_{\boldsymbol{i}}$ | $\boldsymbol{x}_{\boldsymbol{i}}$ | $\boldsymbol{d_i}=\boldsymbol{x}_{\boldsymbol{i}}-\boldsymbol{a}$ | $\boldsymbol{u_i}=\frac{\boldsymbol{x}_{\boldsymbol{i}}-\boldsymbol{a}}{\boldsymbol{h}}$ | $\boldsymbol{f}_{\boldsymbol{i}} \boldsymbol{u_i}$ |
---|---|---|---|---|---|
$10-25$ | 2 | 17.5 | -30 | -2 | -4 |
$25-40$ | 3 | 32.5 | -15 | -1 | -3 |
$40-55$ | 7 | 47.5 | 0 | 0 | 0 |
$55-70$ | 6 | 62.5 | 15 | 1 | 6 |
$70-85$ | 6 | 77.5 | 30 | 2 | 12 |
$85-100$ | 6 | 92.5 | 45 | 3 | 18 |
Total | $\Sigma f_{i}=30$ | $\Sigma f_{i} u_{i}=29$ |
Let
$$ \bar{u}=\frac{\Sigma f_{i} u_{i}}{\Sigma f_{i}} $$
Here, again let us find the relation between $\bar{u}$ and $\bar{x}$.
We have,
$$ u_{i}=\frac{x_{i}-a}{h} $$
Therefore,
$$ \begin{aligned} \bar{u} & =\frac{\Sigma f_{i} \frac{\left(x_{i}-a\right)}{h}}{\Sigma f_{i}}=\frac{1}{h}\left[\frac{\Sigma f_{i} x_{i}-a \Sigma f_{i}}{\Sigma f_{i}}\right] \\ \\ & =\frac{1}{h}\left[\frac{\Sigma f_{i} x_{i}}{\Sigma f_{i}}-a \frac{\Sigma f_{i}}{\Sigma f_{i}}\right] \\ \\ & =\frac{1}{h}[\bar{x}-a] \end{aligned} $$
So,
$$ \begin{aligned} h \bar{u} & =\bar{x}-a \\ \end{aligned} $$
i.e.,
$$\bar{x} =a+h \bar{u}$$
So,
$$ \bar{x}=a+h\left(\frac{\Sigma f_{i} u_{i}}{\Sigma f_{i}}\right) $$
Now, substituting the values of $a, h, \Sigma f_{i} u_{i}$ and $\Sigma f_{i}$ from Table 14.5, we get
$$ \begin{aligned} \bar{x} & =47.5+15 \times\left(\frac{29}{30}\right) \\ \\ & =47.5+14.5=62 \end{aligned} $$
So, the mean marks obtained by a student is 62 .
The method discussed above is called the Step-deviation method.
We note that :
- the step-deviation method will be convenient to apply if all the $d_{i}$ ’s have a common factor.
- The mean obtained by all the three methods is the same.
- The assumed mean method and step-deviation method are just simplified forms of the direct method.
- The formula $\bar{x}=a+h \bar{u}$ still holds if $a$ and $h$ are not as given above, but are any non-zero numbers such that $u_{i}=\frac{x_{i}-a}{h}$.
Activity 2 :
Divide the students of your class into three groups and ask each group to do one of the following activities.
1. Collect the marks obtained by all the students of your class in Mathematics in the latest examination conducted by your school. Form a grouped frequency distribution of the data obtained.
2. Collect the daily maximum temperatures recorded for a period of 30 days in your city. Present this data as a grouped frequency table.
3. Measure the heights of all the students of your class (in cm) and form a grouped frequency distribution table of this data.
After all the groups have collected the data and formed grouped frequency distribution tables, the groups should find the mean in each case by the method which they find appropriate.
13.3 Mode of Grouped Data
Recall from Class IX, a mode is that value among the observations which occurs most often, that is, the value of the observation having the maximum frequency. Further, we discussed finding the mode of ungrouped data. Here, we shall discuss ways of obtaining a mode of grouped data. It is possible that more than one value may have the same maximum frequency. In such situations, the data is said to be multimodal. Though grouped data can also be multimodal, we shall restrict ourselves to problems having a single mode only.
13.4 Median of Grouped Data
As you have studied in Class IX, the median is a measure of central tendency which gives the value of the middle-most observation in the data. Recall that for finding the median of ungrouped data, we first arrange the data values of the observations in ascending order. Then, if $n$ is odd, the median is the $\left(\frac{n+1}{2}\right)$ th observation. And, if $n$ is even, then the median will be the average of the $\frac{n}{2}$ th and the $\left(\frac{n}{2}+1\right)$ th observations.
Suppose, we have to find the median of the following data, which gives the marks, out of 50, obtained by 100 students in a test :
Marks obtained | 20 | 29 | 28 | 33 | 42 | 38 | 43 | 25 |
---|---|---|---|---|---|---|---|---|
Number of students | 6 | 28 | 24 | 15 | 2 | 4 | 1 | 20 |
First, we arrange the marks in ascending order and prepare a frequency table as follows :
Table 13.9
Marks obtained | Number of students (Frequency) |
---|---|
20 | 6 |
25 | 20 |
28 | 24 |
29 | 28 |
33 | 15 |
38 | 4 |
42 | 2 |
43 | 1 |
Total | $\mathbf{1 0 0}$ |
Here $n=100$, which is even. The median will be the average of the $\frac{n}{2}$ th and the $\left(\frac{n}{2}+1\right)$ th observations, i.e., the 50th and 51st observations. To find these observations, we proceed as follows:
Table 13.10
Marks obtained | Number of students |
---|---|
20 | 6 |
upto 25 | $6+20=26$ |
upto 28 | $26+24=50$ |
upto 29 | $50+28=78$ |
upto 33 | $78+15=93$ |
upto 38 | $93+4=97$ |
upto 42 | $97+2=99$ |
upto 43 | $99+1=100$ |
Now we add another column depicting this information to the frequency table above and name it as cumulative frequency column.
Table 13.11
Marks obtained | Number of students | Cumulative frequency |
---|---|---|
20 | 6 | 6 |
25 | 20 | 26 |
28 | 24 | 50 |
29 | 28 | 78 |
33 | 15 | 93 |
38 | 4 | 97 |
42 | 2 | 99 |
43 | 1 | 100 |
From the table above, we see that:
50th observaton is 28 (Why?)
51st observation is 29
So, $\quad$ Median $=\frac{28+29}{2}=28.5$
Remark : The part of Table 13.11 consisting Column 1 and Column 3 is known as Cumulative Frequency Table. The median marks 28.5 conveys the information that about 50% students obtained marks less than 28.5 and another $50 %$ students obtained marks more than 28.5.
Now, let us see how to obtain the median of grouped data, through the following situation.
Consider a grouped frequency distribution of marks obtained, out of 100, by 53 students, in a certain examination, as follows:
Table 13.12
Marks | Number of students |
---|---|
$0-10$ | 5 |
$10-20$ | 3 |
$20-30$ | 4 |
$30-40$ | 3 |
$40-50$ | 3 |
$50-60$ | 4 |
$60-70$ | 7 |
$70-80$ | 9 |
$80-90$ | 7 |
$90-100$ | 8 |
From the table above, try to answer the following questions:
How many students have scored marks less than 10 ? The answer is clearly 5 .
How many students have scored less than 20 marks? Observe that the number of students who have scored less than 20 include the number of students who have scored marks from 0 - 10 as well as the number of students who have scored marks from $10-20$. So, the total number of students with marks less than 20 is $5+3$, i.e., 8 . We say that the cumulative frequency of the class $10-20$ is 8 .
Similarly, we can compute the cumulative frequencies of the other classes, i.e., the number of students with marks less than 30 , less than $40, \ldots$, less than 100 . We give them in Table 13.13 given below:
Table 13.13
Marks obtained | Number of students (Cumulative frequency) |
---|---|
Less than 10 | 5 |
Less than 20 | $5+3=8$ |
Less than 30 | $8+4=12$ |
Less than 40 | $12+3=15$ |
Less than 50 | $15+3=18$ |
Less than 60 | $18+4=22$ |
Less than 70 | $22+7=29$ |
Less than 80 | $29+9=38$ |
Less than 90 | $38+7=45$ |
Less than 100 | $45+8=53$ |
The distribution given above is called the cumulative frequency distribution of the less than type. Here 10,20,30, . . 100, are the upper limits of the respective class intervals.
We can similarly make the table for the number of students with scores, more than or equal to 0 , more than or equal to 10 , more than or equal to 20 , and so on. From Table 13.12, we observe that all 53 students have scored marks more than or equal to 0 . Since there are 5 students scoring marks in the interval $0-10$, this means that there are $53-5=48$ students getting more than or equal to 10 marks. Continuing in the same manner, we get the number of students scoring 20 or above as $48-3=45,30$ or above as $45-4=41$, and so on, as shown in Table 13.14.
Table 13.14
Marks obtained | Number of students (Cumulative frequency) |
---|---|
More than or equal to 0 | 53 |
More than or equal to 10 | $53-5=48$ |
More than or equal to 20 | $48-3=45$ |
More than or equal to 30 | $45-4=41$ |
More than or equal to 40 | $41-3=38$ |
More than or equal to 50 | $38-3=35$ |
More than or equal to 60 | $35-4=31$ |
More than or equal to 70 | $31-7=24$ |
More than or equal to 80 | $24-9=15$ |
More than or equal to 90 | $15-7=8$ |
The table above is called a cumulative frequency distribution of the more than type. Here $0,10,20, \ldots, 90$ give the lower limits of the respective class intervals.
Now, to find the median of grouped data, we can make use of any of these cumulative frequency distributions.
Let us combine Tables 13.12 and 13.13 to get Table 13.15 given below:
Table 13.15
Marks | Number of students $(\boldsymbol{f})$ | Cumulative frequency $(\mathbf{c f})$ |
---|---|---|
$0-10$ | 5 | 5 |
$10-20$ | 3 | 8 |
$20-30$ | 4 | 12 |
$30-40$ | 3 | 15 |
$40-50$ | 3 | 18 |
$50-60$ | 4 | 22 |
$60-70$ | 7 | 29 |
$70-80$ | 9 | 38 |
$80-90$ | 7 | 45 |
$90-100$ | 8 | 53 |
Now in a grouped data, we may not be able to find the middle observation by looking at the cumulative frequencies as the middle observation will be some value in a class interval. It is, therefore, necessary to find the value inside a class that divides the whole distribution into two halves. But which class should this be?
To find this class, we find the cumulative frequencies of all the classes and $\frac{n}{2}$. We now locate the class whose cumulative frequency is greater than (and nearest to) $\frac{n}{2}$. This is called the median class. In the distribution above, $n=53$. So, $\frac{n}{2}=26.5$. Now 60 - 70 is the class whose cumulative frequency 29 is greater than (and nearest to) $\frac{n}{2}$, i.e., 26.5 .
Therefore, $60-70$ is the median class.
After finding the median class, we use the following formula for calculating the median.
$$ \text { Median }=l+\left(\frac{\frac{n}{2}-\mathrm{cf}}{f}\right) \times h $$
where
$$ \begin{aligned} l & =\text { lower limit of median class, } \\ n & =\text { number of observations, } \\ \mathrm{cf} & =\text { cumulative frequency of class preceding the median class, } \\ f & =\text { frequency of median class, } \\ h & =\text { class size (assuming class size to be equal). } \end{aligned} $$
Substituting the values $\frac{n}{2}=26.5, l=60, \mathrm{cf}=22, f=7, h=10$ in the formula above, we get
$$ \begin{aligned} \text { Median } & =60+\left(\frac{26.5-22}{7}\right) \times 10 \\ & =60+\frac{45}{7} \\ & =66.4 \end{aligned} $$
So, about half the students have scored marks less than 66.4 , and the other half have scored marks more than 66.4.
The mean is the most frequently used measure of central tendency because it takes into account all the observations, and lies between the extremes, i.e., the largest and the smallest observations of the entire data. It also enables us to compare two or more distributions. For example, by comparing the average (mean) results of students of different schools of a particular examination, we can conclude which school has a better performance.
However, extreme values in the data affect the mean. For example, the mean of classes having frequencies more or less the same is a good representative of the data. But, if one class has frequency, say 2, and the five others have frequency 20, 25, 20, 21,18 , then the mean will certainly not reflect the way the data behaves. So, in such cases, the mean is not a good representative of the data.
In problems where individual observations are not important, and we wish to find out a ’typical’ observation, the median is more appropriate, e.g., finding the typical productivity rate of workers, average wage in a country, etc. These are situations where extreme values may be there. So, rather than the mean, we take the median as a better measure of central tendency.
In situations which require establishing the most frequent value or most popular item, the mode is the best choice, e.g., to find the most popular T.V. programme being watched, the consumer item in greatest demand, the colour of the vehicle used by most of the people, etc.
Remarks:
1. There is a empirical relationship between the three measures of central tendency :
$$ 3 \text { Median }=\text { Mode }+2 \text { Mean } $$
2. The median of grouped data with unequal class sizes can also be calculated. However, we shall not discuss it here.
13.5 Summary
In this chapter, you have studied the following points:
1. The mean for grouped data can be found by :
(i) the direct method : $\bar{x}=\frac{\Sigma f_{i} x_{i}}{\Sigma f_{i}}$
(ii) the assumed mean method : $\bar{x}=a+\frac{\Sigma f_{i} d_{i}}{\Sigma f_{i}}$ (iii) the step deviation method : $\bar{x}=a+\left(\frac{\Sigma f_{i} u_{i}}{\Sigma f_{i}}\right) \times h$,
with the assumption that the frequency of a class is centred at its mid-point, called its class mark.
2. The mode for grouped data can be found by using the formula:
$$ \text { Mode }=l+\left(\frac{f_{1}-f_{0}}{2 f_{1}-f_{0}-f_{2}}\right) \times h $$
where symbols have their usual meanings.
3. The cumulative frequency of a class is the frequency obtained by adding the frequencies of all the classes preceding the given class.
4. The median for grouped data is formed by using the formula:
$$ \text { Median }=l+\left(\frac{\frac{n}{2}-\mathrm{cf}}{f}\right) \times h \text {, } $$
where symbols have their usual meanings.
A NOTE TO THE READER
For calculating mode and median for grouped data, it should be ensured that the class intervals are continuous before applying the formulae. Same condition also apply for construction of an ogive. Further, in case of ogives, the scale may not be the same on both the axes.