Hello world, I hope everyone’s Labor Day weekend was nice and relaxing!
This week in class we explored descriptive statistics. Our assignment was to calculate both central tendency and variation values for two data sets and analyze the difference between them. The two data sets were presented as follows:
set1 <- c(10, 2, 3, 2, 4, 2, 5)
set2 <- c(20, 12, 13, 12, 14, 12, 15)
Both data sets have 7 observations, which will make the data sets easier to compare and contrast. The first thing I did was calculate the central tendency values for both sets. R comes equipped with functions to calculate median and mean but not mode, meaning we will need to build our own function to calculate it. I was able to do so without much difficulty, and created a function that built a table with each number in the data set and their frequency. Ever unsatisfied, however, I wanted R to simply give me the most frequent value, which was a bit trickier! Reviewing Norman Matloff’s The Art of R Programming section on vector indexes, I was able to devise a solution that made me happy.
calculateMode <- function(x)
{
sorted <- as.data.frame(table(x))
m <- which.max(sorted[, 2])
setMode <- as.numeric(as.character(sorted[m, 1]))
return(setMode)
}
interquartile <- function(x)
{
q <- quantile(x)
iQ <- as.numeric(q[4] - q[2])
return(iQ)
}
rnge1 <- range(set1)
range1 <- rnge1[2] - rnge1[1]
rnge2 <- range(set2)
range2 <- rnge2[2] - rnge2[1]
The first line of code here creates a function titled calculateMode that takes an input x. The second line uses R’s table() function to generate the frequency of each number in x, then coerces it into a data frame saved as the variable “sorted.” On the next line, the function which.max() grabs the index of the highest number in our frequency column, which in this case is the second one, and assigns it to the variable m.
This next line here gave me the most trouble by a landslide. Assigning sorted[m, 1] (which selects the number associated with the index of our largest frequency value) to setMode without any modification creates a factor, which is not what I wanted. In order to get sorted[m, 1] to store as a number, it first must be converted to a character, then a numeric. The final line of code returns our nice, neat mode. Now, I’m positive this is not the most elegant and simplistic way to go about generating a mode, but I’m still a beginner and this function got the job done.
The next function calculates the interquartile value of the input using R’s quantile() function. The final two bits of code are to make my life a little easier, since R’s range() function returns the minimum and maximum values of a set and I wanted the difference of those two numbers. Let’s move on!
It’s time to generate the central tendency and variation values. I went and compiled both the values for set 1 and set 2 into central tendency and variation data frames for easy comparison using the following code:
#Create central tendency data frame
set1Central <- c(mean(set1),median(set1),calculateMode(set1))
set2Central <- c(mean(set2),median(set2), calculateMode(set2))
centralTendency <- cbind(set1Central, set2Central)
rownames(centralTendency) <- c("Mean", "Median", "Mode")
#Create variation data frame
set1Variation <- c(range1, interquartile(set1), var(set1), sd(set1))
set2Variation <- c(range2, interquartile(set2),var(set2),sd(set2))
variation <- cbind(set1Variation, set2Variation)
rownames(variation) <- c("Range", "Interquartile", "Variance", "Standard Deviation")
Now we have some nice neat tables to compare our data with! Let’s take a look.
| set1Central | set2Central | |
| Mean | 4 | 14 |
| Median | 3 | 13 |
| Mode | 2 | 12 |
| set1Variation | set2Variation | |
| Range | 8.000000 | 8.000000 |
| Interquartile | 2.500000 | 2.500000 |
| Variance | 8.333333 | 8.333333 |
| Standard Deviation | 2.886751 | 2.886751 |
When manually viewing the data sets, we can observe that set 2 contains the same values as set 1 with each value increased by 10. Since the data in set 2 is distributed the same as the data in set 1, all of our variance calculations are identical. However, each value’s increase by 10 in set 2 inflates the mean, median, and mode proportionately, causing each to be increased by 10 compares to its set 1 counterpart. This week’s assignment was a great way to explore how changes to a data set can impact its central tendency and variation calculations, and has furthered my understanding of how each function operates relative to its data set.
See you next week!