Final project: Which wine has the greatest quality mean?

Abstract

The aim of this project is to determine if there is a statistically significant difference in quality scores between red and white wines. Two datasets are used from the UCI machine learning repository: one listing qualities of red wine and the other listing qualities of white wine. R and RStudio were used to conduct a two sample t-test, which determined that white wine’s mean quality score is significantly higher than red wine.

For my final project in advanced statistics, I will be conducting inferential analysis on a dataset. The data I will be analyzing is part of a wine quality dataset retrieved from the UCI machine learning repository.  There are 12 factors in the set: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality. I’m particularly interested in seeing if mean quality varies between red and white wine, so that’s what I will be testing. 

Hypothesis

H0: Red and white wines have the same mean quality score. μ1= μ2

H1: White wines have a higher mean quality score than red wines.

μ1 > μ2

Samples

Sample 1:  White wine quality scores

Sample 2: Red wine quality scores

Before beginning analysis, I must clean the dataset because it’s not quite workable in its current format. Thankfully, the set is already well organized and will require minimal code to clean up! I used the following code to render the set workable (The full code for this project can be located on my GitHub page!): 

library("tidyr").
# Imports data set
red_wine <- read.csv("winequality-red.csv")
white_wine <- read.csv("winequality-white.csv")
colnames(red_wine) <- "x"
colnames(white_wine) <- "x"
# Cleans dataset
red_wine <- separate(red_wine, x, c("fixed_acidity","volatile_acidity","citric_acid",
                      "residual_sugar","chlorides","free_sulfur_dioxide",
                      "total_sulfur_dioxide","density","pH","sulphates",
                      "alcohol","quality"), sep=';')
white_wine <- separate(white_wine, x, c("fixed_acidity","volatile_acidity","citric_acid",
                                    "residual_sugar","chlorides","free_sulfur_dioxide",
                                    "total_sulfur_dioxide","density","pH","sulphates",
                                    "alcohol","quality"), sep=';')
w_wine_quality <- as.numeric(white_wine$quality)
r_wine_quality <- as.numeric(red_wine$quality)
plot(r_wine_quality)
plot(w_wine_quality)

Pictured above are the two initial plots for our dataset. For both sets of data, we can see that 5 and 6 are the most frequent quality scores. Let’s move into analysis! Because I’m comparing two independent variables, I will conduct a two sample t-test using the quality of red wine and the quality of white wine:

As pictured above, we can see that the mean quality of white wine is 5.878 and the mean quality of red wine is 5.636. The 95% confidence interval is (0.1951564, 0.2886173), meaning we can be 95% confident that the difference in our two means is between 0.1951564 and 0.2886173. Since our p-value is < 0.00000000000000022, which is less than the significance level of .05, we can conclude that the difference between the two population means is statistically significant and reject H0.  Finally, here is a box plot of the two datasets:

We can see from the visualization that white wine has a greater quality score outlier than red wine, which likely contributes to its larger mean. 
Conclusion: There is significant evidence to reject H0. White wines have a greater quality mean than red wines. 


Throughout the semester, I have used t-tests to determine the statistical significance of differences in population means. In this project, I got to use a t-test to conduct an analysis that will be useful in my own life, and next time I’m in the market for wine, I now know that whites have a significantly larger quality mean than reds! 

Full code for this project: https://github.com/mondedumandarines/blog/blob/ed5c133676e7ca5b755d915812d0c7d0eb0292dc/finalproject

Leave a comment