Skip to main content

Dispersion

Statistics is the branch of mathematics concerned with the collection, organisation, analysis, and interpretation of data. In the DSE compulsory syllabus, we focus on descriptive statistics -- summarising a dataset through measures of central tendency and measures of dispersion. This page also covers grouped data techniques and graphical representations such as box-and-whisker plots. These tools are frequently combined with probability) concepts in exam questions.

Measures of Central Tendency

A measure of central tendency identifies a single value that is representative of an entire dataset.

Mean

The mean (arithmetic average) of a dataset {x1,x2,,xn}\{x_1, x_2, \ldots, x_n\} is defined as:

xˉ=1ni=1nxi\begin{aligned} \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i \end{aligned}

The mean uses every data value, making it sensitive to outliers. It is the only measure of central tendency that lends itself to algebraic manipulation (e.g., combining datasets).

Examples
  • The scores of 55 students are 72,85,90,68,8072, 85, 90, 68, 80. The mean is xˉ=72+85+90+68+805=3955=79\bar{x} = \frac{72+85+90+68+80}{5} = \frac{395}{5} = 79.
  • If every score is increased by 55 bonus marks, the new mean is 79+5=8479 + 5 = 84.

Median

The median is the middle value of an ordered dataset. For nn data values sorted in ascending order:

  • If nn is odd, the median is the value at position n+12\dfrac{n+1}{2}.
  • If nn is even, the median is the average of the values at positions n2\dfrac{n}{2} and n2+1\dfrac{n}{2}+1.

The median is robust to outliers since it depends only on the position of data points, not their magnitude.

Examples
  • Dataset: {3,7,1,9,5}\{3, 7, 1, 9, 5\}. Sorted: {1,3,5,7,9}\{1, 3, 5, 7, 9\}. Median = 55 (position 33 of 55).
  • Dataset: {2,4,6,8,10,12}\{2, 4, 6, 8, 10, 12\}. Median = 6+82=7\frac{6+8}{2} = 7 (average of positions 33 and 44).
  • Salaries: {18000,20000,22000,25000,150000}\{18000, 20000, 22000, 25000, 150000\}. Median = 2200022000, which is far more representative than the mean of 4700047000.

Mode

The mode is the value that occurs most frequently in a dataset. A dataset may be unimodal (one mode), bimodal (two modes), multimodal, or have no mode at all.

The mode is the only measure of central tendency applicable to nominal (categorical) data.

Examples
  • {4,2,7,4,3,4,8}\{4, 2, 7, 4, 3, 4, 8\}: mode = 44 (appears 33 times).
  • {5,5,8,8,10}\{5, 5, 8, 8, 10\}: bimodal, modes are 55 and 88.
  • {1,2,3,4,5}\{1, 2, 3, 4, 5\}: no mode.

Comparison of the Three Measures

MeasureUses all valuesAffected by outliersApplicable to categorical dataUnique value
MeanYesYesNoYes
MedianNoNoNoYes
ModeNoNoYesNo

Measures of Dispersion

Measures of dispersion (spread) quantify how far individual data values deviate from the centre. Two datasets can share the same mean yet have very different spreads.

Range

Range=MaximumvalueMinimumvalue\begin{aligned} \mathrm{Range} = \mathrm{Maximum value} - \mathrm{Minimum value} \end{aligned}

The range is simple to compute but uses only two data points, making it highly sensitive to outliers.

Examples
  • {12,15,18,22,25}\{12, 15, 18, 22, 25\}: range =2512=13= 25 - 12 = 13.
  • {5,10,10,10,10,100}\{5, 10, 10, 10, 10, 100\}: range =95= 95, heavily distorted by the single outlier.

Interquartile Range (IQR)

The quartiles divide an ordered dataset into four equal parts:

  • Q1Q_1 (lower quartile): the median of the lower half.
  • Q2Q_2 (median): the middle value.
  • Q3Q_3 (upper quartile): the median of the upper half.
IQR=Q3Q1\begin{aligned} \mathrm{IQR} = Q_3 - Q_1 \end{aligned}

The IQR is resistant to outliers since it ignores the most extreme 50%50\% of data.

Examples
  • Dataset: {3,5,7,8,12,14,18,20,25}\{3, 5, 7, 8, 12, 14, 18, 20, 25\} (n=9n=9, odd).
    • Lower half: {3,5,7,8}\{3, 5, 7, 8\}, Q1=5+72=6Q_1 = \frac{5+7}{2} = 6.
    • Q2=12Q_2 = 12.
    • Upper half: {14,18,20,25}\{14, 18, 20, 25\}, Q3=18+202=19Q_3 = \frac{18+20}{2} = 19.
    • IQR =196=13= 19 - 6 = 13.

Variance

Variance measures the average squared deviation from the mean. There are two versions depending on whether the data represents the entire population or a sample drawn from a larger population.

Population variance (divides by nn):

σ2=1ni=1n(xixˉ)2\begin{aligned} \sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2 \end{aligned}

Sample variance (divides by n1n-1):

s2=1n1i=1n(xixˉ)2\begin{aligned} s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 \end{aligned}

An equivalent computational formula is:

σ2=1n[i=1nxi21n(i=1nxi)2]\begin{aligned} \sigma^2 = \frac{1}{n}\left[\sum_{i=1}^{n}x_i^2 - \frac{1}{n}\left(\sum_{i=1}^{n}x_i\right)^2\right] \end{aligned}

Why nn vs n1n-1? Dividing by n1n-1 (Bessel's correction) provides an unbiased estimator of the population variance when working with a sample. Using only nn data points, the sample mean xˉ\bar{x} is closer to the data points than the true population mean μ\mu, so the squared deviations tend to underestimate the true spread. Dividing by n1n-1 compensates for this. In the DSE syllabus, unless the problem explicitly identifies the data as a sample, the population formula (dividing by nn) is typically expected.

Examples
  • Dataset: {2,4,4,4,5,5,7,9}\{2, 4, 4, 4, 5, 5, 7, 9\} (n=8n=8).
    • xˉ=408=5\bar{x} = \frac{40}{8} = 5.
    • (xixˉ)2=9+1+1+1+0+0+4+16=32\sum(x_i - \bar{x})^2 = 9 + 1 + 1 + 1 + 0 + 0 + 4 + 16 = 32.
    • Population variance: σ2=328=4\sigma^2 = \frac{32}{8} = 4.
    • Sample variance: s2=3274.57s^2 = \frac{32}{7} \approx 4.57.

Standard Deviation

The standard deviation is the positive square root of the variance, restoring the units to match the original data:

σ=σ2,s=s2\begin{aligned} \sigma = \sqrt{\sigma^2}, \qquad s = \sqrt{s^2} \end{aligned}

Since the standard deviation is in the same units as the data, it is more interpretable than the variance for comparing spread.

Examples
  • Following the previous example: σ=4=2\sigma = \sqrt{4} = 2, s=3272.14s = \sqrt{\frac{32}{7}} \approx 2.14.
  • Two machines produce rods of length 1010 cm. Machine A has σ=0.1\sigma = 0.1 cm, Machine B has σ=0.5\sigma = 0.5 cm. Machine A is more precise.

Grouped Data

When data is presented in a grouped frequency distribution, individual values are not available. We work with class intervals instead.

Key Definitions

  • Class boundaries: The endpoints of each class interval, with no gaps between consecutive classes. For example, if raw intervals are 1010--1919 and 2020--2929, the class boundaries are 9.59.5--19.519.5 and 19.519.5--29.529.5.
  • Class width: The difference between the upper and lower class boundaries.
  • Class mark (midpoint): xi=lowerboundary+upperboundary2x_i = \dfrac{\mathrm{lower boundary} + \mathrm{upper boundary}}{2}, used as the representative value for all data in the class.

Mean of Grouped Data

xˉ=i=1kfixii=1kfi\begin{aligned} \bar{x} = \frac{\sum_{i=1}^{k} f_i x_i}{\sum_{i=1}^{k} f_i} \end{aligned}

where kk is the number of classes, fif_i is the frequency of class ii, and xix_i is the class mark.

Assumed Mean Method (Coding Method)

When class marks are equally spaced, let hh be the common class width and AA be the class mark of a convenient class (the assumed mean). Define di=xiAhd_i = \dfrac{x_i - A}{h}. Then:

xˉ=A+i=1kfidii=1kfi×h\begin{aligned} \bar{x} = A + \frac{\sum_{i=1}^{k} f_i d_i}{\sum_{i=1}^{k} f_i} \times h \end{aligned}

This method simplifies calculation by working with small integer values of did_i.

Examples
  • The following frequency distribution records the marks of 4040 students:
Class intervalfif_iClass mark xix_idid_ifidif_i d_i
30 -- 39434.52-28-8
40 -- 49844.51-18-8
50 -- 591454.50000
60 -- 691064.5111010
70 -- 79474.52288

Here A=54.5A = 54.5, h=10h = 10.

xˉ=54.5+88+0+10+840×10=54.5+240×10=54.5+0.5=55\begin{aligned} \bar{x} &= 54.5 + \frac{-8-8+0+10+8}{40} \times 10 \\ &= 54.5 + \frac{2}{40} \times 10 \\ &= 54.5 + 0.5 = 55 \end{aligned}

Variance of Grouped Data

For grouped data, the population variance is:

σ2=i=1kfi(xixˉ)2i=1kfi\begin{aligned} \sigma^2 = \frac{\sum_{i=1}^{k} f_i(x_i - \bar{x})^2}{\sum_{i=1}^{k} f_i} \end{aligned}

Or equivalently:

σ2=1n[fixi21n(fixi)2],n=fi\begin{aligned} \sigma^2 = \frac{1}{n}\left[\sum f_i x_i^2 - \frac{1}{n}\left(\sum f_i x_i\right)^2\right], \quad n = \sum f_i \end{aligned}

Histogram Estimation

In a histogram, the area of each bar represents the frequency of the corresponding class. If class widths are unequal, the height of each bar is the frequency density:

Frequencydensity=FrequencyClasswidth\begin{aligned} \mathrm{Frequency density} = \frac{\mathrm{Frequency}}{\mathrm{Class width}} \end{aligned}

The median, quartiles, and other percentiles can be estimated from a cumulative frequency curve (ogive) by linear interpolation within the relevant class.

Properties of Variance

Linear Transformation

For a dataset XX and constants a,ba, b:

Var(aX+b)=a2Var(X)\begin{aligned} \mathrm{Var}(aX + b) = a^2 \mathrm{Var}(X) \end{aligned}

Adding a constant bb shifts all values equally and does not affect spread. Multiplying by aa scales the spread by a|a|.

For the mean: aX+b=axˉ+b\overline{aX+b} = a\bar{x} + b.

Examples
  • If xˉ=50\bar{x} = 50 and σ2=16\sigma^2 = 16, then for Y=3X4Y = 3X - 4: yˉ=3(50)4=146\bar{y} = 3(50)-4 = 146 and Var(Y)=9×16=144\mathrm{Var}(Y) = 9 \times 16 = 144.
  • Temperatures recorded in Celsius have mean 2525 and standard deviation 33. In Fahrenheit (F=1.8C+32F = 1.8C + 32): mean =1.8(25)+32=77= 1.8(25)+32 = 77, standard deviation =1.8×3=5.4= 1.8 \times 3 = 5.4.

Combined Variance

Given two datasets XX and YY with sizes n1n_1 and n2n_2, means xˉ1\bar{x}_1 and xˉ2\bar{x}_2, and variances σ12\sigma_1^2 and σ22\sigma_2^2, the combined variance of the pooled dataset is:

σc2=n1σ12+n2σ22+n1(xˉ1xˉc)2+n2(xˉ2xˉc)2n1+n2\begin{aligned} \sigma_c^2 = \frac{n_1 \sigma_1^2 + n_2 \sigma_2^2 + n_1(\bar{x}_1 - \bar{x}_c)^2 + n_2(\bar{x}_2 - \bar{x}_c)^2}{n_1 + n_2} \end{aligned}

where the combined mean is:

xˉc=n1xˉ1+n2xˉ2n1+n2\begin{aligned} \bar{x}_c = \frac{n_1 \bar{x}_1 + n_2 \bar{x}_2}{n_1 + n_2} \end{aligned}

The additional terms n1(xˉ1xˉc)2n_1(\bar{x}_1 - \bar{x}_c)^2 and n2(xˉ2xˉc)2n_2(\bar{x}_2 - \bar{x}_c)^2 account for the between-group variation caused by the difference in means.

Examples
  • Group A: n1=6n_1 = 6, xˉ1=10\bar{x}_1 = 10, σ12=4\sigma_1^2 = 4.
  • Group B: n2=4n_2 = 4, xˉ2=20\bar{x}_2 = 20, σ22=9\sigma_2^2 = 9.

Combined mean: xˉc=6(10)+4(20)10=14\bar{x}_c = \frac{6(10)+4(20)}{10} = 14.

σc2=6(4)+4(9)+6(1014)2+4(2014)210=24+36+96+14410=30010=30\begin{aligned} \sigma_c^2 &= \frac{6(4) + 4(9) + 6(10-14)^2 + 4(20-14)^2}{10} \\ &= \frac{24 + 36 + 96 + 144}{10} = \frac{300}{10} = 30 \end{aligned}

Applications

Coefficient of Variation

The coefficient of variation (CV) allows comparison of variability between datasets measured in different units or with vastly different means:

CV=σxˉ×100%\begin{aligned} \mathrm{CV} = \frac{\sigma}{\bar{x}} \times 100\% \end{aligned}

A larger CV indicates greater relative dispersion.

Examples
  • Investment A: mean return =8%= 8\%, standard deviation =2%= 2\%. CV =28×100%=25%= \frac{2}{8} \times 100\% = 25\%.
  • Investment B: mean return =15%= 15\%, standard deviation =5%= 5\%. CV =515×100%33.3%= \frac{5}{15} \times 100\% \approx 33.3\%.
  • Investment A has lower relative risk.

Box-and-Whisker Plots

A box-and-whisker plot is a standardised graphical display of the five-number summary: minimum, Q1Q_1, Q2Q_2 (median), Q3Q_3, and maximum.

Construction:

  1. Draw a rectangular box from Q1Q_1 to Q3Q_3.
  2. Draw a line inside the box at Q2Q_2.
  3. Extend "whiskers" to the minimum and maximum values.

Identifying outliers: A data point is considered a potential outlier if it falls below Q11.5×IQRQ_1 - 1.5 \times \mathrm{IQR} or above Q3+1.5×IQRQ_3 + 1.5 \times \mathrm{IQR}.

Examples
  • Dataset: {5,8,12,15,18,20,24,28,35,42,58}\{5, 8, 12, 15, 18, 20, 24, 28, 35, 42, 58\} (n=11n=11).
    • Q2=18Q_2 = 18.
    • Lower half: {5,8,12,15,18}\{5, 8, 12, 15, 18\}, Q1=12Q_1 = 12.
    • Upper half: {18,20,24,28,35,42,58}\{18, 20, 24, 28, 35, 42, 58\}, Q3=28Q_3 = 28.
    • IQR =2812=16= 28 - 12 = 16.
    • Lower fence: 121.5(16)=1212 - 1.5(16) = -12.
    • Upper fence: 28+1.5(16)=5228 + 1.5(16) = 52.
    • Since 58>5258 > 52, the value 5858 is an outlier. The upper whisker extends to 4242 instead.

Skewness (DSE awareness)

While not computed algebraically in the compulsory syllabus, students should recognise:

  • Positively skewed: mean >> median, the right tail is longer.
  • Negatively skewed: mean << median, the left tail is longer.
  • Symmetrical: mean == median == mode (for unimodal distributions).

Wrap-up Questions
  1. Question: The marks of 77 students are 56,62,45,78,83,71,6556, 62, 45, 78, 83, 71, 65. Find the mean, median, and mode.
Answer
  • Sorted: {45,56,62,65,71,78,83}\{45, 56, 62, 65, 71, 78, 83\}.
  • Mean: xˉ=460765.7\bar{x} = \frac{460}{7} \approx 65.7.
  • Median (position 44 of 77): 6565.
  • Mode: none (all values are distinct).
  1. Question: A dataset has mean 2020 and variance 3636. Find the mean and variance of the transformed dataset Y=X206Y = \dfrac{X - 20}{6}.
Answer
  • yˉ=16(20)206=20206=0\bar{y} = \frac{1}{6}(20) - \frac{20}{6} = \frac{20-20}{6} = 0.
  • Var(Y)=(16)2×36=136×36=1\mathrm{Var}(Y) = \left(\frac{1}{6}\right)^2 \times 36 = \frac{1}{36} \times 36 = 1.
  1. Question: For the grouped frequency distribution below, find the mean and standard deviation using the coding method.

    ClassFrequency
    10 -- 195
    20 -- 2912
    30 -- 3918
    40 -- 4910
    50 -- 595
Answer
  • Class marks: 14.5,24.5,34.5,44.5,54.514.5, 24.5, 34.5, 44.5, 54.5. Let A=34.5A = 34.5, h=10h = 10.
  • did_i: 2,1,0,1,2-2, -1, 0, 1, 2.
  • fi=50\sum f_i = 50, fidi=5(2)+12(1)+18(0)+10(1)+5(2)=10+(12)+0+10+10=2\sum f_i d_i = 5(-2) + 12(-1) + 18(0) + 10(1) + 5(2) = -10 + (-12) + 0 + 10 + 10 = -2.
  • xˉ=34.5+250×10=34.50.4=34.1\bar{x} = 34.5 + \frac{-2}{50} \times 10 = 34.5 - 0.4 = 34.1.
  • fidi2=5(4)+12(1)+18(0)+10(1)+5(4)=20+12+0+10+20=62\sum f_i d_i^2 = 5(4) + 12(1) + 18(0) + 10(1) + 5(4) = 20 + 12 + 0 + 10 + 20 = 62.
  • \sigma_d^2 = rac{62}{50} - \left( rac{-2}{50} ight)^2 = 1.24 - 0.0016 = 1.2384.
  • σ2=1.2384imes102=123.84\sigma^2 = 1.2384 imes 10^2 = 123.84, so \sigma = \sqrt{123.84} pprox 11.13.
  1. Question: Two classes sat the same test. Class A (n1=30n_1 = 30, xˉ1=72\bar{x}_1 = 72, σ1=8\sigma_1 = 8). Class B (n2=20n_2 = 20, xˉ2=80\bar{x}_2 = 80, σ2=6\sigma_2 = 6). Find the combined mean and combined standard deviation.
Answer
  • Combined mean: xˉc=30(72)+20(80)50=2160+160050=376050=75.2\bar{x}_c = \frac{30(72)+20(80)}{50} = \frac{2160+1600}{50} = \frac{3760}{50} = 75.2.
  • Combined variance: σc2=30(64)+20(36)+30(7275.2)2+20(8075.2)250=1920+720+30(10.24)+20(23.04)50=1920+720+307.2+460.850=340850=68.16\begin{aligned} \sigma_c^2 &= \frac{30(64) + 20(36) + 30(72-75.2)^2 + 20(80-75.2)^2}{50} \\ &= \frac{1920 + 720 + 30(10.24) + 20(23.04)}{50} \\ &= \frac{1920 + 720 + 307.2 + 460.8}{50} \\ &= \frac{3408}{50} = 68.16 \end{aligned}
  • Combined standard deviation: σc=68.168.26\sigma_c = \sqrt{68.16} \approx 8.26.
  1. Question: The following are the lifetimes (in hours) of 1010 light bulbs: 820,790,810,780,830,800,795,815,805,855820, 790, 810, 780, 830, 800, 795, 815, 805, 855. Determine the range, IQR, and identify any outliers.
Answer
  • Sorted: {780,790,795,800,805,810,815,820,830,855}\{780, 790, 795, 800, 805, 810, 815, 820, 830, 855\}.
  • Range =855780=75= 855 - 780 = 75.
  • Q2=805+8102=807.5Q_2 = \frac{805+810}{2} = 807.5.
  • Lower half: {780,790,795,800,805}\{780, 790, 795, 800, 805\}, Q1=795Q_1 = 795.
  • Upper half: {810,815,820,830,855}\{810, 815, 820, 830, 855\}, Q3=820Q_3 = 820.
  • IQR =820795=25= 820 - 795 = 25.
  • Lower fence: 7951.5(25)=757.5795 - 1.5(25) = 757.5. Upper fence: 820+1.5(25)=857.5820 + 1.5(25) = 857.5.
  • No outliers (all values lie within [757.5,857.5][757.5, 857.5]).
  1. Question: A farmer records the yields (in kg) of two varieties of wheat over several seasons. Variety A: mean =45= 45, standard deviation =5= 5. Variety B: mean =60= 60, standard deviation =9= 9. Which variety has more consistent yield?
Answer
  • CVA=545×100%11.1%_A = \frac{5}{45} \times 100\% \approx 11.1\%.
  • CVB=960×100%=15.0%_B = \frac{9}{60} \times 100\% = 15.0\%.
  • Since CVA<_A < CVB_B, Variety A has more consistent (less variable) yield relative to its mean.
  1. Question: Given the dataset {a,b,c}\{a, b, c\} with mean 1010 and variance 88, find the value of a2+b2+c2a^2 + b^2 + c^2.
Answer
  • xˉ=a+b+c3=10    a+b+c=30\bar{x} = \frac{a+b+c}{3} = 10 \implies a+b+c = 30.
  • σ2=a2+b2+c23xˉ2=8\sigma^2 = \frac{a^2+b^2+c^2}{3} - \bar{x}^2 = 8.
  • a2+b2+c23100=8    a2+b2+c2=324\frac{a^2+b^2+c^2}{3} - 100 = 8 \implies a^2+b^2+c^2 = 324.
  1. Question: A set of 2020 numbers has mean 1515 and standard deviation 33. If each number is multiplied by 22 and then 55 is added, find the new mean and new standard deviation.
Answer
  • New mean: 2(15)+5=352(15) + 5 = 35.
  • New variance: 22×32=362^2 \times 3^2 = 36.
  • New standard deviation: 36=6\sqrt{36} = 6.
  1. Question: The histogram below (described verbally) shows the distribution of weights of 5050 apples. The class intervals and frequencies are:

    Weight (g)Frequency
    100 -- 1196
    120 -- 13914
    140 -- 15920
    160 -- 1798
    180 -- 1992

Estimate the median weight from the cumulative frequency distribution.

Answer
  • Cumulative frequencies: 6,20,40,48,506, 20, 40, 48, 50.
  • The median is the 502=25\frac{50}{2} = 25th value, which lies in the class 140140--159159 (cumulative 2020 to 4040).
  • Using linear interpolation within the class: Median=139.5+25204020×(159.5139.5)=139.5+520×20=139.5+5=144.5g\begin{aligned} \mathrm{Median} &= 139.5 + \frac{25-20}{40-20} \times (159.5 - 139.5) \\ &= 139.5 + \frac{5}{20} \times 20 \\ &= 139.5 + 5 = 144.5 \mathrm{ g} \end{aligned}
  1. Question: For the dataset {3,7,7,2,9,5,1,8,6,4}\{3, 7, 7, 2, 9, 5, 1, 8, 6, 4\}, find xi\sum x_i, xi2\sum x_i^2, the mean, and the population variance. Verify your variance using both the definition formula and the computational formula.
Answer
  • xi=3+7+7+2+9+5+1+8+6+4=52\sum x_i = 3+7+7+2+9+5+1+8+6+4 = 52.
  • xi2=9+49+49+4+81+25+1+64+36+16=334\sum x_i^2 = 9+49+49+4+81+25+1+64+36+16 = 334.
  • xˉ=5210=5.2\bar{x} = \frac{52}{10} = 5.2.
  • Definition formula: σ2=(35.2)2+(75.2)2+(75.2)2+(25.2)2+(95.2)2+(55.2)2+(15.2)2+(85.2)2+(65.2)2+(45.2)210=4.84+3.24+3.24+10.24+14.44+0.04+17.64+7.84+0.64+1.4410=63.610=6.36\begin{aligned} \sigma^2 &= \frac{(3-5.2)^2 + (7-5.2)^2 + (7-5.2)^2 + (2-5.2)^2 + (9-5.2)^2 + (5-5.2)^2 + (1-5.2)^2 + (8-5.2)^2 + (6-5.2)^2 + (4-5.2)^2}{10} \\ &= \frac{4.84+3.24+3.24+10.24+14.44+0.04+17.64+7.84+0.64+1.44}{10} \\ &= \frac{63.6}{10} = 6.36 \end{aligned}
  • Computational formula: σ2=33410(5210)2=33.427.04=6.36\begin{aligned} \sigma^2 &= \frac{334}{10} - \left(\frac{52}{10}\right)^2 = 33.4 - 27.04 = 6.36 \quad \checkmark \end{aligned}
  1. Question: The weekly wages (in dollars) of 88 workers in a small factory are 3200,3500,3800,4200,4500,4800,5200,120003200, 3500, 3800, 4200, 4500, 4800, 5200, 12000. The factory owner claims the average wage is USD 5150. Is this claim misleading? Explain using an appropriate measure of central tendency and dispersion.
Answer
  • Mean: xˉ=412008=5150\bar{x} = \frac{41200}{8} = 5150. The owner's figure is arithmetically correct.
  • Sorted: {3200,3500,3800,4200,4500,4800,5200,12000}\{3200, 3500, 3800, 4200, 4500, 4800, 5200, 12000\}.
  • Median: 4200+45002=4350\frac{4200+4500}{2} = 4350.
  • The median (43504350) is a far more representative measure here. The single extreme value of USD 12000 (likely the owner's own salary or a manager's) inflates the mean by USD 800. The median is resistant to outliers and better reflects what a typical worker earns.
  • The range (120003200=880012000 - 3200 = 8800) and the large gap between the mean and median both indicate significant skewness, confirming the mean is a poor choice of summary statistic.
  1. Question: A set of data has variance 2525 and mean 00. A new set is formed by removing the value 1010 from the original set. If the original set had n=6n = 6 values, find the new mean and new variance.
Answer
  • Original: xˉ=0\bar{x} = 0, σ2=25\sigma^2 = 25, n=6n = 6.
  • xi=0\sum x_i = 0, so xi2=nσ2+(xi)2n=6(25)+0=150\sum x_i^2 = n\sigma^2 + \frac{(\sum x_i)^2}{n} = 6(25) + 0 = 150.
  • After removing 1010: new sum =010=10= 0 - 10 = -10, new n=5n' = 5.
  • New mean: xˉ=105=2\bar{x}' = \frac{-10}{5} = -2.
  • New sum of squares: 150100=50150 - 100 = 50.
  • New variance: σ2=505(2)2=104=6\sigma'^2 = \frac{50}{5} - (-2)^2 = 10 - 4 = 6.

For the A-Level treatment of this topic, see Data Representation.


tip

Diagnostic Test Ready to test your understanding of Dispersion? The diagnostic test contains the hardest questions within the DSE specification for this topic, each with a full worked solution.

Unit tests probe edge cases and common misconceptions. Integration tests combine Dispersion with other DSE mathematics topics to test synthesis under exam conditions.

See Diagnostic Guide for instructions on self-marking and building a personal test matrix.


DSE Exam Technique

Showing Working

For statistics problems in DSE Paper 1:

  1. When computing the mean, show the sum divided by nn before giving the decimal.
  2. When computing variance, use the computational formula σ2=xi2nxˉ2\sigma^2 = \dfrac{\sum x_i^2}{n} - \bar{x}^2 and show both terms.
  3. For grouped data, show the class marks and the coding method clearly in a table.
  4. For the coding method, state the assumed mean AA and class width hh.
  5. For box plots, label all five values (min, Q1Q_1, Q2Q_2, Q3Q_3, max).

Significant Figures

The DSE typically requires answers to be given to 3 significant figures unless the question specifies otherwise. Exact fractions are preferred when they arise naturally.

Common DSE Question Types

  1. Combined mean and variance of two groups.
  2. Grouped data mean and standard deviation using the coding method.
  3. Effect of linear transformations on mean and variance.
  4. Box-and-whisker plots with outlier identification.
  5. Coefficient of variation for comparing relative dispersion.

Additional Worked Examples

Worked Example 13: Effect of adding a data point

A dataset {2,5,8,11,14}\{2, 5, 8, 11, 14\} has mean xˉ=8\bar{x} = 8 and variance σ2=20\sigma^2 = 20. Find the new mean and variance if the value 2020 is added.

Solution

New n=6n' = 6. New sum =40+20=60= 40 + 20 = 60. New mean xˉ=606=10\bar{x}' = \dfrac{60}{6} = 10.

xi2=4+25+64+121+196=410\sum x_i^2 = 4 + 25 + 64 + 121 + 196 = 410. New xi2=410+400=810\sum x_i^2 = 410 + 400 = 810.

New variance: σ2=8106102=135100=35\sigma'^2 = \dfrac{810}{6} - 10^2 = 135 - 100 = 35.

Worked Example 14: Standardised scores

In an exam, the mean is 60 and the standard deviation is 10. Student A scores 75 and Student B scores 55. Express each score as a standardised score (z-score).

Solution

zA=756010=1.5z_A = \frac{75 - 60}{10} = 1.5

zB=556010=0.5z_B = \frac{55 - 60}{10} = -0.5

Student A scored 1.5 standard deviations above the mean; Student B scored 0.5 standard deviations below.

Worked Example 15: Finding data from summary statistics

A dataset of 5 positive integers has mean 6 and variance 4. Find all possible datasets.

Solution

xi=30\sum x_i = 30 and xi2536=4    xi2=200\dfrac{\sum x_i^2}{5} - 36 = 4 \implies \sum x_i^2 = 200.

We need five positive integers summing to 30 with squares summing to 200.

If the data is symmetric around 6: try {4,5,6,7,8}\{4, 5, 6, 7, 8\}.

Sum =30= 30. xi2=16+25+36+49+64=190200\sum x_i^2 = 16 + 25 + 36 + 49 + 64 = 190 \neq 200.

Try {2,6,6,6,10}\{2, 6, 6, 6, 10\}: sum =30= 30, xi2=4+36+36+36+100=212200\sum x_i^2 = 4 + 36 + 36 + 36 + 100 = 212 \neq 200.

Try {4,4,6,8,8}\{4, 4, 6, 8, 8\}: sum =30= 30, xi2=16+16+36+64+64=196200\sum x_i^2 = 16 + 16 + 36 + 64 + 64 = 196 \neq 200.

Try {3,5,7,7,8}\{3, 5, 7, 7, 8\}: sum =30= 30, xi2=9+25+49+49+64=196200\sum x_i^2 = 9 + 25 + 49 + 49 + 64 = 196 \neq 200.

Try {4,4,8,6,8}\{4, 4, 8, 6, 8\}: sum =30= 30, xi2=16+16+64+36+64=196\sum x_i^2 = 16 + 16 + 64 + 36 + 64 = 196.

Try {2,6,6,8,8}\{2, 6, 6, 8, 8\}: sum =30= 30, xi2=4+36+36+64+64=204\sum x_i^2 = 4 + 36 + 36 + 64 + 64 = 204.

Try {4,6,6,6,8}\{4, 6, 6, 6, 8\}: sum =30= 30, xi2=16+36+36+36+64=188\sum x_i^2 = 16 + 36 + 36 + 36 + 64 = 188.

There may be no solution with 5 positive integers. Let me try {2,5,7,7,9}\{2, 5, 7, 7, 9\}: sum =30= 30, xi2=4+25+49+49+81=208\sum x_i^2 = 4 + 25 + 49 + 49 + 81 = 208.

{3,5,6,8,8}\{3, 5, 6, 8, 8\}: sum =30= 30, xi2=9+25+36+64+64=198\sum x_i^2 = 9 + 25 + 36 + 64 + 64 = 198.

{4,5,6,7,8}\{4, 5, 6, 7, 8\}: xi2=190\sum x_i^2 = 190. Need 200. The deficit is 10. If we change 5 to 6 and 6 to 5: {4,6,5,7,8}\{4, 6, 5, 7, 8\}: same sum of squares.

If we change 4 to 5 and 8 to 7: {5,5,6,7,7}\{5, 5, 6, 7, 7\}: xi2=25+25+36+49+49=184\sum x_i^2 = 25 + 25 + 36 + 49 + 49 = 184.

The minimum xi2\sum x_i^2 for sum 30 with 5 positive integers is achieved by values closest to 6.

The constraints may not be satisfiable with integers. In an exam, this would typically be solved numerically.

Worked Example 16: Grouped data variance with coding

For the frequency distribution below, find the standard deviation using the assumed mean method.

ClassFrequency
5 -- 93
10 -- 147
15 -- 1912
20 -- 245
25 -- 293
Solution

Class marks: 77, 1212, 1717, 2222, 2727. A=17A = 17, h=5h = 5.

did_i: 2-2, 1-1, 00, 11, 22.

fif_idid_ifidif_id_ifidi2f_id_i^2
32-26-612
71-17-77
12000
5155
32612

n=30n = 30, fidi=2\sum f_id_i = -2, fidi2=36\sum f_id_i^2 = 36.

dˉ=230=115\bar{d} = \dfrac{-2}{30} = -\dfrac{1}{15}.

σd2=3630(115)2=1.21225=269225\sigma_d^2 = \dfrac{36}{30} - \left(-\dfrac{1}{15}\right)^2 = 1.2 - \dfrac{1}{225} = \dfrac{269}{225}.

σ2=269225×25=269929.89\sigma^2 = \dfrac{269}{225} \times 25 = \dfrac{269}{9} \approx 29.89.

σ=2699=26935.47\sigma = \sqrt{\dfrac{269}{9}} = \dfrac{\sqrt{269}}{3} \approx 5.47.


DSE Exam-Style Questions

DSE Practice 1. Two groups of students took the same test. Group A: n1=40n_1 = 40, xˉ1=65\bar{x}_1 = 65, σ1=8\sigma_1 = 8. Group B: n2=60n_2 = 60, xˉ2=72\bar{x}_2 = 72, σ2=10\sigma_2 = 10. Find the overall mean and standard deviation.

Solution

Combined mean: xˉc=40(65)+60(72)100=2600+4320100=69.2\bar{x}_c = \dfrac{40(65) + 60(72)}{100} = \dfrac{2600 + 4320}{100} = 69.2.

Combined variance:

σc2=40(64)+60(100)+40(6569.2)2+60(7269.2)2100\sigma_c^2 = \frac{40(64) + 60(100) + 40(65 - 69.2)^2 + 60(72 - 69.2)^2}{100}

=2560+6000+40(17.64)+60(7.84)100= \frac{2560 + 6000 + 40(17.64) + 60(7.84)}{100}

=8560+705.6+470.4100=9736100=97.36= \frac{8560 + 705.6 + 470.4}{100} = \frac{9736}{100} = 97.36

σc=97.369.87\sigma_c = \sqrt{97.36} \approx 9.87

DSE Practice 2. The heights (in cm) of 8 students are: 158, 162, 165, 168, 170, 172, 175, 180. After converting to feet (1 cm = 0.03281 ft), find the mean and standard deviation in feet.

Solution

Let Y=0.03281XY = 0.03281X. Then yˉ=0.03281xˉ\bar{y} = 0.03281\bar{x} and σY=0.03281σX\sigma_Y = 0.03281\sigma_X.

xˉ=158+162+165+168+170+172+175+1808=13508=168.75\bar{x} = \dfrac{158 + 162 + 165 + 168 + 170 + 172 + 175 + 180}{8} = \dfrac{1350}{8} = 168.75 cm.

yˉ=0.03281×168.755.537\bar{y} = 0.03281 \times 168.75 \approx 5.537 ft.

σX=xi28168.752\sigma_X = \sqrt{\dfrac{\sum x_i^2}{8} - 168.75^2}.

xi2=24964+26244+27225+28224+28900+29584+30625+32400=228166\sum x_i^2 = 24964 + 26244 + 27225 + 28224 + 28900 + 29584 + 30625 + 32400 = 228166.

σX2=228166828476.5625=28520.7528476.5625=44.1875\sigma_X^2 = \dfrac{228166}{8} - 28476.5625 = 28520.75 - 28476.5625 = 44.1875.

σX=44.18756.647\sigma_X = \sqrt{44.1875} \approx 6.647 cm.

σY=0.03281×6.6470.2181\sigma_Y = 0.03281 \times 6.647 \approx 0.2181 ft.

DSE Practice 3. For the dataset {1,3,5,7,9,11,13}\{1, 3, 5, 7, 9, 11, 13\}, find the mean deviation (mean absolute deviation) and compare it with the standard deviation.

Solution

xˉ=497=7\bar{x} = \dfrac{49}{7} = 7.

Mean deviation =17+37+57+77+97+117+1377=6+4+2+0+2+4+67=2473.43= \dfrac{|1-7| + |3-7| + |5-7| + |7-7| + |9-7| + |11-7| + |13-7|}{7} = \dfrac{6 + 4 + 2 + 0 + 2 + 4 + 6}{7} = \dfrac{24}{7} \approx 3.43.

σ2=1+9+25+49+81+121+169749=455749=6549=16\sigma^2 = \dfrac{1 + 9 + 25 + 49 + 81 + 121 + 169}{7} - 49 = \dfrac{455}{7} - 49 = 65 - 49 = 16.

σ=4\sigma = 4.

The standard deviation (4) is greater than the mean deviation (3.43), which is always the case for datasets that are not constant.

DSE Practice 4. A set of data has xˉ=50\bar{x} = 50 and σ=4\sigma = 4. If every value is increased by kk, the new standard deviation becomes 10. Find kk and explain your answer.

Solution

Adding a constant kk does not change the standard deviation. Therefore, the new standard deviation should still be σ=4\sigma = 4, not 1010.

There is no value of kk that changes the standard deviation from 4 to 10 by addition alone. To change the standard deviation, we would need to multiply by a constant. If Y=aX+bY = aX + b, then σY=aσX\sigma_Y = |a|\sigma_X. For σY=10\sigma_Y = 10: a=104=2.5|a| = \dfrac{10}{4} = 2.5.

The question likely intends a multiplication, not just addition. If Y=2.5X+kY = 2.5X + k, then σY=10\sigma_Y = 10 for any kk.

DSE Practice 5. The table shows the distribution of marks in a test.

MarksFrequency
0 -- 194
20 -- 3910
40 -- 5922
60 -- 7914
80 -- 1005

Estimate the mean and standard deviation.

Solution

Class marks: 9.59.5, 29.529.5, 49.549.5, 69.569.5, 89.589.5. Class widths: 20, 20, 20, 20, 21.

For the coding method with equal class widths (using width 20): A=49.5A = 49.5, h=20h = 20.

did_i: 2-2, 1-1, 00, 11, 22 (approximately; the last class has width 21).

Using approximate equal widths:

fif_ixix_idid_ifidif_id_ifidi2f_id_i^2
49.52-28-816
1029.51-110-1010
2249.5000
1469.511414
589.521020

n=55n = 55, fidi=6\sum f_id_i = 6, fidi2=60\sum f_id_i^2 = 60.

xˉ=49.5+655×20=49.5+2.18=51.6851.7\bar{x} = 49.5 + \dfrac{6}{55} \times 20 = 49.5 + 2.18 = 51.68 \approx 51.7.

σd2=6055(655)2=1.09090.0119=1.079\sigma_d^2 = \dfrac{60}{55} - \left(\dfrac{6}{55}\right)^2 = 1.0909 - 0.0119 = 1.079.

σ2=1.079×400=431.6\sigma^2 = 1.079 \times 400 = 431.6. σ=431.620.8\sigma = \sqrt{431.6} \approx 20.8.