14 Multinomial data
- Identify the distribution from the study design and/or data.
- Be able to formalize the probability model, from which the data were generated.
- Know the basic principle behind goodness of fit test (\(\chi^2\)-test).
- Be able to perform a goodness of fit test for comparison of categorical data from several groups.
14.1 Reading material
- Multinomial data - in short
- A video going through \(\chi^2\)-test (Goodness of fit test) for frequency tables:
- Chapter 7 of Introduction to Statistics by Brockhoff
14.2 Multinomial data - in short
Multinomial data are categorical data with more than two groups. If there is only two groups, the data follow the binomial distribution. These data are naturally organized in a so-called frequency- or contingency table (see for example Exercise 15.1). In general terms such a table with \(n\) rows and \(k\) columns can be shown as in table 14.1.
Category I | Category II | \(\cdots\) | Category \(k\) | |
---|---|---|---|---|
Sample I | \(N_{11}\) | \(N_{12}\) | \(\cdots\) | \(N_{1k}\) |
Sample II | \(N_{21}\) | \(N_{22}\) | \(\cdots\) | \(N_{2k}\) |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) |
Sample \(n\) | \(N_{n1}\) | \(N_{n2}\) | \(\cdots\) | \(N_{nk}\) |
Such data can be collected in two ways following two slightly different models.
One distribution
\(N\) samples, that are put into \(nk\) categories. For example, \(256\) people are selected and categorized according to gender and color of hair. The model for this case is:
\[ N_{11},N_{12},...,N_{nk} \sim Multinomial(N,p_{11},p_{12},...,p_{nk}) \]
where
\[ N = N_{11} + N_{12} + ... + N_{nk} \quad \text{and} \quad p_{11} + p_{12} + ... + p_{nk} = 1$. \]
i.e. one distribution
Several distributions
Several \(N_i\) samples, that are put into \(k\) categories. In this case, we predefine the number of samples within each row and distribute those over the categories. For example, \(100\) men and \(123\) women, distributed on color of their hair. The model for this case is:
\[ N_{i1},N_{i2},...,N_{ik} \sim Multinomial(N_{i},p_{i1},p_{i2},...,p_{ik}) \]
where
\[ N_{i} = N_{i1} + N_{i2} + ... + N_{ik} \quad \text{and} \quad p_{i1} + p_{i2} + ... + p_{ik} = 1$ for $i = 1,...,n. \]
i.e. several (\(n\)) distributions.
The natural question for both types of data is whether there is independence between the columns (or rows). For the example that is; Is the distribution similar regardless of gender. However, the null hypothesis is stated differently depending on the model.
- \(H0:\) \(p_{ij} = p_{i \cdot}p_{\cdot j}\) where \(p_{i \cdot}\) refers to the row probability (i.e. the probability of having gender \(i\)), and \(p_{\cdot j}\) refers to the column probability (i.e. the probability of having hair color \(j\)).
- \(H0:\) \(p_{1j} = p_{2j} = ... = p_{nj} = p_{\cdot j}\) i.e. equal probability of hair color across gender.
In both cases the test is the same, and based on calculating the expected number of observations under the null hypothesis, and comparing those with the observed number of observations. If this number is large, then there is large differences between the observed and the expected, why the null hypothesis is rejected.
Category I | Category II | \(\cdots\) | Category \(k\) | ||
---|---|---|---|---|---|
Sample I | \(N_{11}\) | \(N_{12}\) | \(\cdots\) | \(N_{1k}\) | \(N_{1 \cdot} = \sum{N_{1j}}\) |
Sample II | \(N_{21}\) | \(N_{22}\) | \(\cdots\) | \(N_{2k}\) | \(N_{2 \cdot} = \sum{N_{2j}}\) |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) |
Sample \(n\) | \(N_{n1}\) | \(N_{n2}\) | \(\cdots\) | \(N_{nk}\) | \(N_{n \cdot} = \sum{N_{nj}}\) |
\(N_{\cdot 1} = \sum{N_{i1}}\) | \(N_{\cdot 2} = \sum{N_{i2}}\) | \(\cdots\) | \(N_{\cdot k} = \sum{N_{ik}}\) | \(N_{\cdot \cdot} = \sum{N_{ij}}\) |
The expected value \(E\) for each cell in the table is calculated as:
\[\begin{equation} E_{ij} = \frac{N_{i \cdot} N_{\cdot j}}{N_{\cdot \cdot}} \end{equation}\]
I.e. the row sum multiplied with the column sum and divided by the total sum (see table 14.2). The test statistic \(X^2_{obs}\) is calculated as:
\[\begin{equation} X^2_{obs} = \sum_{ij}{\frac{(E_{ij} - N_{ij})^2}{E_{ij}}} \end{equation}\]
I.e. the (squared) discrepancy between the expected (\(E_{ij}\)) and the observed (\(N_{ij}\)) divided by the expected value. Summed across all cells.
Under the null hypothesis \(X^2_{obs}\) follow a so-called chi-squared distribution (\(\chi^2\)) with \(df = (n-1)(k-1)\) degrees of freedom. The P-value is one-sided:
\[\begin{equation} P = P(\chi^2_{df} > X^2_{obs}) \end{equation}\]
OBS: Be aware that too low expected values (\(E_{ij}\)) makes this test unstable. A rule of thumb is that \(80\%\) of the cells should be above \(5\) and ALL should be above \(1\). In the case where this is violated, cells can be merged by summing both the expected and observed values.