1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
2 | 2 | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 4 |
5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 |
5 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
6 | 6 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 |
7 | 7 | 7 | 7 | 7 | 7 | 7 | 8 | 8 | 8 |
8 | 9 | NA | NA | NA | NA | NA | NA | NA | NA |
3 Descriptive statistics
- Understand what univariate, bi-variate and multivariate data is.
- Comprehend the concepts of centrality and dispersion.
- Be able to compute a range of metrics (numbers), that are informative with respect to centrality and dispersion.
- Know what correlated variables are, and be able to calculate a correlation coefficient.
- Know how to use the functions
aggregate()
andsummary()
to create overview tables. - Know what the the very useful functions
str()
,head()
,tail()
andview()
does.
3.1 Reading material
- Chapter 1 of Introduction to Statistics by Brockhoff
- Especially section 1.1 to 1.4.
- Video lecture on central metrics (mean and median).
- Video lecture on dispersion (variance, standard deviation etc.)
- Video lecture on both central metrics and dispersion.
3.2 Exercises
Below (table 3.1) is listed a vector of ranking (Liking) of coffee served at 56°C by 52 consumers. The data is sorted.
Some useful numbers:
\[\sum{X} = 301\]
\[\sum{(X_i - \bar{X})^2} = 122.7\]
- Calculate mean, variance, standard deviation, median and inner quartile range for this distribution of data.
Serving temperature of coffee seems of importance as to how this drink is perceived. However, it is not totally clear how this relation is. In order to understand this, studies on the same type of coffee served at different temperature is conducted. In this exercise we are going to use the data from a consumer panel of 52 consumers, evaluating coffee served at six different temperatures on a set of sensorical descriptors leading to a total of \(52 \times 6 = 312\) samples.
In the dataset the results are listed. Taking these data from A to Z involves descriptive analysis for understanding variation within judge, between judge and between different temperatures, further outlier detection, and finally determination of structure between sensorical descriptors. In this exercise we are only going through some of the initial descriptive steps.
In the table below (table 3.2) a subset of the data is shown.
Sample | Temperatur | Assessor | ServingOrder | TemperatureJudgment | Liking | Intensity | Sour | Bitter | Sweet | Male | Female | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 31C | 31 | 1 | 6 | 2 | 3 | 4 | 3 | 4 | 2 | 1 | 0 |
2 | 31C | 31 | 2 | 6 | 2 | 3 | 7 | 5 | 8 | 3 | 1 | 0 |
3 | 31C | 31 | 3 | 6 | 1 | 3 | 2 | 1 | 4 | 6 | 0 | 1 |
4 | 31C | 31 | 4 | 6 | 1 | 2 | 5 | 6 | 4 | 3 | 1 | 0 |
5 | 31C | 31 | 5 | 6 | 2 | 2 | 2 | 3 | 2 | 1 | 1 | 0 |
6 | 31C | 31 | 6 | 6 | 2 | 4 | 3 | 4 | 2 | 1 | 1 | 0 |
307 | 62C | 62 | 47 | 6 | 8 | 8 | 8 | 2 | 8 | 1 | 0 | 1 |
308 | 62C | 62 | 48 | 6 | 6 | 7 | 7 | 4 | 3 | 3 | 0 | 1 |
309 | 62C | 62 | 49 | 6 | 5 | 8 | 6 | 6 | 4 | 6 | 1 | 0 |
310 | 62C | 62 | 50 | 6 | 6 | 5 | 7 | 7 | 7 | 4 | 0 | 1 |
311 | 62C | 62 | 51 | 6 | 7 | 6 | 8 | 6 | 7 | 2 | 1 | 0 |
312 | 62C | 62 | 52 | 6 | 6 | 7 | 6 | 6 | 7 | 3 | 1 | 0 |
- Import the data
- Be aware that the function
read.xls()
is not in the base library, so you need to add the specific library to your computer.
- Be aware that the function
- Subsample on one temperature.
- Below (listing 3.1) is listed two alternatives for doing this.
Calculate the descriptive statistics for centrality (mean and median), dispersion (IQR, standard deviation and range) and extremes (min and max) for this distribution of datapoints for a single descriptor (e.g. )
Now do it for all temperatures.
- You should get something like the table below (table 3.3).
Temp | N | Mean | Median | Std | Min | Max |
---|---|---|---|---|---|---|
31 | 52 | 3.576923 | 3 | 1.649078 | 1 | 7 |
37 | 52 | 4.750000 | 5 | 1.780890 | 1 | 7 |
44 | 52 | 5.826923 | 6 | 1.605397 | 2 | 9 |
50 | 52 | 5.961538 | 6 | 1.596092 | 2 | 8 |
56 | 52 | 5.788462 | 6 | 1.550920 | 2 | 9 |
62 | 52 | 6.173077 | 6 | 1.367998 | 2 | 8 |
This can be quite tedious, and result in a lot of coding. However, the function summary()
and aggreggate()
are very efficient in producing such results. Try to check out these functions and see if you can use those to generate summary statistics. Below are shown some code which does exactly what you want without too many lines of code.
aggregate()
.
# Include only responses
CoffeeDT <- Coffee[,2:10]
# Run aggregate for each type of summary
tmpN <-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'length')
tmpM<-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'mean')
tmpM2<-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'median')
tmpS<-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'sd')
tmpMi<-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'min')
tmpMx<-aggregate(CoffeeDT,by=list(CoffeeDT$Temperatur),FUN = 'max')
# Merge these into a dataset
tmp <- cbind(tmpM$Temperatur,tmpN$Liking,tmpM$Liking,tmpM2$Liking,
tmpS$Liking,tmpMi$Liking,tmpMx$Liking)
# Add a meaningfull label for each coloumn
colnames(tmp) <- c('Temp','N','Mean','Median','Std','Min','Max')
print(tmp)
- The above is done for , try to do it for some of the other responses.
- Hint: This can be done by repeating the code and exchange
$Liking
with e.g.$Bitter
. However, putting this in a for loop is another option.
- Hint: This can be done by repeating the code and exchange
- What have you learned from analysing these data in terms of importance of serving temperature on the sensorical properties as percieved by consumers?
- Hint: You can run the code below to get a comprehensive overview. This is based on the mean aggreggate, but you might just as well check some of the other descriptive metrics. For instance, what does the standard deviation tells you about consumers in general, and does the type of sensorical attribute and serving temperature make a difference on the spread in scoring?
You might want to fix some of the labels in these figures. Check the documentation by typing ?matplot
and see how to add meaning full stuff to the plot.