3  Descriptive statistics

Learning objectives
  • Understand what univariate, bi-variate and multivariate data is.
  • Comprehend the concepts of centrality and dispersion.
  • Be able to compute a range of metrics (numbers), that are informative with respect to centrality and dispersion.
  • Know what correlated variables are, and be able to calculate a correlation coefficient.
  • Know how to use the functions aggregate() and summary() to create overview tables.
  • Know what the the very useful functions str(), head(), tail() and view() does.

3.1 Reading material

3.2 Exercises

Exercise 3.1
Exercise 3.1 - Descriptive statistics by hand

Below (table 3.1) is listed a vector of ranking (Liking) of coffee served at 56°C by 52 consumers. The data is sorted.

Table 3.1: Liking of coffee served at 56°C as ranked by 52 consumers.
1 2 3 4 5 6 7 8 9 10
2 2 3 3 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
5 6 6 6 6 6 6 6 6 6
6 6 7 7 7 7 7 7 7 7
7 7 7 7 7 7 7 8 8 8
8 9 NA NA NA NA NA NA NA NA

Some useful numbers:

\[\sum{X} = 301\]

\[\sum{(X_i - \bar{X})^2} = 122.7\]

Tasks
  1. Calculate mean, variance, standard deviation, median and inner quartile range for this distribution of data.
Exercise 3.2
Exercise 3.2 - Descriptive statistics

Serving temperature of coffee seems of importance as to how this drink is perceived. However, it is not totally clear how this relation is. In order to understand this, studies on the same type of coffee served at different temperature is conducted. In this exercise we are going to use the data from a consumer panel of 52 consumers, evaluating coffee served at six different temperatures on a set of sensorical descriptors leading to a total of \(52 \times 6 = 312\) samples.

In the dataset the results are listed. Taking these data from A to Z involves descriptive analysis for understanding variation within judge, between judge and between different temperatures, further outlier detection, and finally determination of structure between sensorical descriptors. In this exercise we are only going through some of the initial descriptive steps.

In the table below (table 3.2) a subset of the data is shown.

Table 3.2: A subset of the Results Consumer Test.xlsx data
Sample Temperatur Assessor ServingOrder TemperatureJudgment Liking Intensity Sour Bitter Sweet Male Female
1 31C 31 1 6 2 3 4 3 4 2 1 0
2 31C 31 2 6 2 3 7 5 8 3 1 0
3 31C 31 3 6 1 3 2 1 4 6 0 1
4 31C 31 4 6 1 2 5 6 4 3 1 0
5 31C 31 5 6 2 2 2 3 2 1 1 0
6 31C 31 6 6 2 4 3 4 2 1 1 0
307 62C 62 47 6 8 8 8 2 8 1 0 1
308 62C 62 48 6 6 7 7 4 3 3 0 1
309 62C 62 49 6 5 8 6 6 4 6 1 0
310 62C 62 50 6 6 5 7 7 7 4 0 1
311 62C 62 51 6 7 6 8 6 7 2 1 0
312 62C 62 52 6 6 7 6 6 7 3 1 0
Tasks
  1. Import the data
    • Be aware that the function read.xls() is not in the base library, so you need to add the specific library to your computer.
  2. Subsample on one temperature.
    • Below (listing 3.1) is listed two alternatives for doing this.
Listing 3.1: Importing and subsampling based on temperature.
Coffee <- read_excel( " Results Consumer Test . xlsx " )
Coffee_t44_v1 <- Coffee[Coffee$Temperatur == 44,]
Coffee_t44_v2 <- subset(Coffee, Temperatur == 44)

mean(Coffee_t44_v1$Liking)
Tasks
  1. Calculate the descriptive statistics for centrality (mean and median), dispersion (IQR, standard deviation and range) and extremes (min and max) for this distribution of datapoints for a single descriptor (e.g. )

  2. Now do it for all temperatures.

    • You should get something like the table below (table 3.3).
Table 3.3: Summary table computed in R.
Temp N Mean Median Std Min Max
31 52 3.576923 3 1.649078 1 7
37 52 4.750000 5 1.780890 1 7
44 52 5.826923 6 1.605397 2 9
50 52 5.961538 6 1.596092 2 8
56 52 5.788462 6 1.550920 2 9
62 52 6.173077 6 1.367998 2 8

This can be quite tedious, and result in a lot of coding. However, the function summary() and aggreggate() are very efficient in producing such results. Try to check out these functions and see if you can use those to generate summary statistics. Below are shown some code which does exactly what you want without too many lines of code.

Listing 3.2: Generating a summary table with aggregate().
# Include only responses
CoffeeDT <- Coffee[,2:10]

# Run aggregate for each type of summary
tmpN <-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'length')
tmpM<-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'mean')
tmpM2<-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'median')
tmpS<-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'sd')
tmpMi<-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'min')
tmpMx<-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'max')

# Merge these into a dataset
tmp <- cbind(tmpM$Temperatur,tmpN$Liking,tmpM$Liking,tmpM2$Liking,
             tmpS$Liking,tmpMi$Liking,tmpMx$Liking)

# Add a meaningfull label for each coloumn
colnames(tmp) <- c('Temp','N','Mean','Median','Std','Min','Max') 
print(tmp)
Tasks
  1. The above is done for , try to do it for some of the other responses.
    • Hint: This can be done by repeating the code and exchange $Liking with e.g. $Bitter. However, putting this in a for loop is another option.
  2. What have you learned from analysing these data in terms of importance of serving temperature on the sensorical properties as percieved by consumers?
    • Hint: You can run the code below to get a comprehensive overview. This is based on the mean aggreggate, but you might just as well check some of the other descriptive metrics. For instance, what does the standard deviation tells you about consumers in general, and does the type of sensorical attribute and serving temperature make a difference on the spread in scoring?
Listing 3.3: Code for plotting the results of listing 3.2.
matplot(tmpM[,2],tmpM[,6:10],type='l',lwd=3)
text(cbind(60,t(tmpM[6,6:10])),colnames(tmpM[,6:10]))

You might want to fix some of the labels in these figures. Check the documentation by typing ?matplot and see how to add meaning full stuff to the plot.