Hello world!
This week in advanced statistics we covered random variables and probability distributions. We were given three questions to answer, the first of which reads as follows:
A. Consider a population consisting of the following values, which represent the number of ice cream purchases during the academic year for each of five housemates.
8, 14, 16, 10, 11
a. Compute the mean of this population.
b. Select a random sample of size 2 out of the five members.
c. Compute the mean and standard deviation of your sample.
d. Compare the Mean and Standard deviation of your sample to the entire population of this set.
In order to solve this problem, I will be using R! Here is the code I generated for this example:
##Establishes ice cream purchase data set for the five housemates
ice_cream <- c(8, 14, 16, 10, 11)
#Finds population mean and standard deviation
mean_ice_cream <- mean(ice_cream)
sd_ice_cream <- sd(ice_cream)
#Collects two random samples from the ice_cream variable
sample_ic <- sample(ice_cream, 2)
#Sets up a data frame for easy plotting
ice_cream_frame <- as.data.frame(rbind(c(mean_ice_cream, sd_ice_cream)))
colnames(ice_cream_frame) <- c("Mean", "Standard Deviation")
#Gets the mean and standard deviation of the sample
mean_sample_ic <- mean(sample_ic)
sd_sample_ic <- sd(sample_ic)
#Sets up a data frame for easy plotting
sample_ic_frame <- as.data.frame(rbind(c(mean_sample_ic, sd_sample_ic)))
colnames(sample_ic_frame) <- c("Mean", "Standard Deviation")
#Plots data for easy visualization, limits enforced so added points data appears
plot(ice_cream_frame, type = "p", ylim = c(0, 3.5),col = "red")
points(sample_ic_frame, col = "blue")
This code found the mean and standard deviation of the original data set, which were 11.8 and 3.193744 respectively. It then sampled two random variables from the data set, which in this case were 11 and 10. The mean and standard deviation for this set were 10.5 and 0.7071068 respectively. The code then arranged them into data frames for easy analysis and plotted the two results. This was the plot produced:
In this plot, the original data set is red and the sample data is blue. We can see that our original data set has a much higher standard deviation and slightly higher mean than our sample data set. This checks out, given that the two variables selected by the sample() function were 10 and 11. Let’s move on!
B. Suppose that the sample size is n = 100 and the population proportion is p = 0.95.
1. Does the sample proportion p have approximately a normal distribution? Explain.
2. What is the smallest value of n for which the sampling distribution of p is approximately normal?
For the first question, the sample proportion does have an approximately normal value. We know so because p = .95 and q = .05. After multiplying by n, np = 95 and nq =5. Since both np and nq must be greater than or equal to 5 in order to be normally distributed, this data has a normal distribution. The smallest value of n possible for the sampling distribution of p is 100, given that multiplying q by n = 100 already produced the smallest possible value for the distribution to be normal.
The last question comes from our textbook:
C. Simulated coin-tossing is probably better done using rbinom than using sample. Explain how.
The sample() function in R does not produce the same number twice unless the replace= modifier is set to T to designate that you want R to repeat variables. If you do not set the replace modifier to T, you also cannot generate more sample values than the length provided, which in the case of flipping a coin would be two (heads or tails). In addition, sample() does not take symmetric probability into account. The rbinom() function would be better to use because it generates pseudo-random results after taking a vector, sample size input, and probability. Here is a snippet of code that uses rbinom() to predict 15 coin flips, with 0 = heads and 1 = tails.
rbinom(15, size = 1, prob = .5)
[1] 1 1 1 0 0 1 0 0 1 0 1 1 1 1 1
Here is what the same parameters would look like if we didn’t modify sample() accordingly:
sample(1:2, 15)
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
And this is what it would look like if we added all the extra steps:
sample(0:1, 15, replace = T, prob = c(0.5, 0.5))
[1] 0 1 0 1 0 0 0 1 1 1 1 1 1 1 0
See you next week!
