Analyzing shape and center of one quantitative variable (2024)

A series of tutorials by Mark Peterson for working in R

Basics of Data in R
Plotting and evaluating one categorical variable
Plotting and evaluating two categorical variables
Analyzing shape and center of one quantitative variable
Analyzing the spread of one quantitative variable
Relationships between quantitative and categorical data
Relationships between two quantitative variables
Final Thoughts on linear regression
A bit off topic - functions, grep, and colors
Sampling and Loops
Confidence Intervals
Bootstrapping
More on Bootstrapping
Hypothesis testing and p-values
Differences in proportions and statistical thresholds
Hypothesis testing for means
Final thoughts on hypothesis testing
Approximating with a distribution model
Using the normal model in practice
Approximating for a single proportion
Null distribution for a single proportion and limitations
Approximating for a single mean
CI and hypothesis tests for a single mean
Approximating a difference in proportions
Hypothesis test for a difference in proportions
Difference in means
Difference in means - Hypothesis testing and paired differences
Shortcuts
Testing categorical variables with Chi-sqare
Testing proportions in groups
Comparing the means of many groups
Linear Regression
Multiple Regression
Basic Probability
Random variables
Conditional Probability
Bayesian Analysis

4.1 Load today’s data
4.2 Shape of a distribution
- 4.2.1 Histogram
- 4.2.2 Try it out (I)
- 4.2.3 Dot plot
4.3 Measuring the center
- 4.3.1 Pizza price
- 4.3.2 Pizza sales volume
4.4 Try it out (II)
4.5 Output

4.1 Load today’s data

Start by loading today’s data. If you need to download the data, you can do so from these links, Save each to your data directory, and you should be set.

survey.csv
Pizza_Prices.csv.

# Set working directory# Note: copy this from your old script to match your directoriessetwd("~/path/to/class/scripts")# Load datasurvey <- read.csv("../data/survey.csv")pizza <- read.csv("../data/Pizza_Prices.csv")

The pizza data gives the price of pizza and number of slices sold for each week in several cities. The survey data has a small selection of student responses to a class survey (combined over multiple terms). Run head on each to see what we have to work with. You should get something like this (not the same format).

# Look at pizza datahead(pizza)

Week	BAL.Sales.vol	BAL.Price	DAL.Sales.vol	DAL.Price	CHI.Sales.vol	CHI.Price	DEN.Sales.vol	DEN.Price
1/8/1994	27982	2.76	58224	2.55	353412	2.34	58171	2.45
1/15/1994	26951	2.98	47699	2.74	264862	2.61	59348	2.40
1/22/1994	28782	2.78	59578	2.39	204975	2.77	63137	2.41
1/29/1994	32074	2.62	61595	2.49	208763	2.70	61271	2.29
2/5/1994	19765	2.81	64889	2.21	326558	2.45	70480	2.22
2/12/1994	22393	3.02	46388	2.75	176891	2.78	53496	2.48

# Look at survey datahead(survey)

class	height	birthMonth	genderID	pickRandomNumber	trueRandomNumber	sugaryDrink
Sophom*ore	63	October	Female	77	66	Soda
Freshman	62	June	Female	96	6	Pop
Sophom*ore	63	February	Female	76	97	Soda
Sophom*ore	64	June	Female	64	46	Pop
Sophom*ore	70	March	Male	80	93	Pop
Sophom*ore	62	October	Female	7	83	Pop

Now that we have an idea of the data we will be working with, we can take a look at the some ways of plotting and analyizing the numerical variables in these sets.

4.2 Shape of a distribution

You have all probably heard of a “Bell curve” before, and probably have an intuitive sense that it looks something like this.

This single peak with a symmetrical (balanced) fall in both directions is a common model used to describe data of many kinds. Statisticians (and anyone that analyzes data) are often concerned about whether their data look at least sort of like that bell curve. There are some tests to determine if your data are match this curve, but nothing works quite as well as your eyes. So, here, we will plot a few distributions to develop a sense for this shape.

4.2.1 Histogram

Perhaps the most common plot to assess shape is the histogram. You have all likely encountered histograms before, though may not have heard them called that. A histogram simply takes a set of numeric data, puts it in bins that capture a range of data, then plots the amount in each bin. This is very similar to the bar plots we made from categorical data, except that the bins are always ordered in a histogram and are selected somewhat arbitrarily.

In this section, we will make histograms to look at a few distributions. However, instead of just telling you what function you should use, I want to demonstrate the process of finding it.

There are two general ways to search for things that you want to accomplish in R: on the internet and within R. Try searching Google for “plot histogram in R” and see what comes up. One of the top few results will almost always have what you want, though it may take a few tries to get exactly the one you want. When I searched, one of the top hits was a long description of how histograms can be made (Link). It goes into a lot more detail than we need, but is an example of what you can find with a just a little digging.

Within R, we can extend the single-questionmark (?) search approach we used before. If we want to search for a term, but don’t know the function it is associated with, we can use two question marks, and R will search for the term in all it’s help files. Try it here by typing ??histogram (comment out after running to make your future easier).

This pulls up a list of functions that are tagged with “histogram” in some way along with a brief description of what the function does. In this list, it looks like the function hist may do what we want. Clicking on it opens the full help page, which describes what it does and the options we can set.

4.2.1.1 Histogram of Pizza price

Now, we can use this new found function to look at the distribution of pizza prices (cost by slice) in Denver. That data is in the column DEN.Price

# Histogram of denver pizza pricehist( pizza$DEN.Price )

After you see the plot, change the axis and main label to describe the plot better. It should look something like this:

4.2.1.2 Denver sales volume

However, what about sales volume? Does the number of slices sold per week look similarly uni-modal (statistician for one-peak)? Let’s look and find out. Modify the above code to make a histogram like this one:

The “e+04” part of the label is short hand for scientific notation and means “multiply the first number by 10⁴” (a 1 followed by 4 zeroes). So, this graph ranges from 20,000 to 100,000 slices sold per week.

More importantly, does this plot look as symmetrical as the previous plot? We call this difference from symmetry “skew”, and name it based on where the longer “tail” is. So, here, the graph goes much further to the right than the left, giving us a “right-skewed” distribution. Skew like this can cause some problems when we start asking where the center of the data is.

Finally, I am guessing that, even after changing labels, your histogram still looked a little different than mine. As mentioned above, the “bins” that catch the numbers are arbitrary. It is often the case that changing the size of those bins may reveal or conceal an interesting pattern. There are lots of ways to change the number of bins, but the easiest is by setting the parameter breaks = to a number close to the number of breaks you want[*] R won’t always let you set the exact number of bins because it wants to make sure that the breaks are sensible (e.g., the breaks are at even 100’s rather than every 12.785 units). R uses the function pretty to pick these breaks, though you can pass in a full list of breaks if you prefer to set an exact number. . Try it with this code, and play with different numbers to see how it affects the look of the histogram. Pick a good value and leave it in your script.

# Play to find a good number of breakshist( pizza$DEN.Sales.vol , breaks = 100)

4.2.1.3 Student height

As a final histogram example, let’s look at the distribution of student heights. Using the height column in the survey data, make a plot that looks like this:

Does this plot have a single peak, or more than one? Here, it looks like there are two peaks, so we call this plot bi-modal (two modes). Like with skewness, this can cause problems for talking about center.

4.2.2 Try it out (I)

Make a histogram of the pickRandomNumber column, which contains a number that students were asked to select at random, and decide whether or not it is skewed (include your assessment as a comment). Why do you think it might follow this pattern?

4.2.3 Dot plot

Let’s take a look at one more kind of plot for looking at distributions: the dot plot. A dot plot puts a mark (a dot) on the graph for each entry in the data, but it stacks up any that are at the same place. So, it can look a lot like a histogram, but with each value as it’s own bin. As you hopefully saw above, having more bins can be either really helpful or really problematic. In R, the function to make a dot plot is stripchart, though it is not nearly as pretty a plot as many of the others (without substantial additional arguments)[*] In my opinion, the lack of a built in function to make dot plots easily is because dot plots are rarely used any more. They were much more useful when working by hand becuase they could be built as you went along. . The function is set to work on continuously varying data, so we need to set the method = "stack" argument to make it build up the piles we want.

The dot plot is included here for completeness, but we will only rarely encounter it for the rest of the semester

4.2.3.1 Student height

Let’s start by looking at the bi-modal student height data. At first, it seems quite odd that something like height should have such sharp peaks, but I think we may see soon why that is the case.

# Make a dotplotstripchart(survey$height, method = "stack")

Now, what we can see is that there are two specific heights (63 inches and 70 inches) that have a large number of cases. This suggests that students with heights near those heights may have “rounded” to be those heights (5’3" and 5’10" respectively). This may have been difficult to see from the histogram, but gives us a hint that survey data may be less than completely reliable (more about that later).

We can replicate this, however, by setting the number of bins in the histogram to a very large number, which I think looks a little nicer.

# histogram like dot plothist(survey$height, breaks = 10)

However, it is possible with some tweaking to make nice dot plots. The effort is just more than I want to throw at you right now. Throught the semester we will add these skills, but they are not necessary now.

4.3 Measuring the center

Now that we have a sense of the shape of distributions, let’s talk a little about measuring their center. For this, let’s create a simple variable that will let us see exactly what is happening.

# Make sample variabletest <- c(1, 2, 3, 6, 7, 8, 9)

When most people think of the center, think of the “average”: you add up all of the values and divide by the number of values. In notation, this can look like:

\[\text{mean}=\frac{\text{sum of values}}{\text{number of values}}=\frac{\sum\limits_{i=1}^nx_i}{n}\]

So, for our test variable, which has 7 values, we could calculate mean like this:

# Calculate the mean by "hand"sum(test) / 7

## [1] 5.142857

However, R already has a function to do that automatically for us: mean().

# Calculate the mean more easilymean(test)

## [1] 5.142857

It gives the same result, but will do the work for us. The function mean() takes any numeric variable as it’s argument, making it great to use in a large number of circ*mstances.

However, there is one other way to think about the “center”. sometimes we might be interested in finding the middle value of a variable. This is called the “median” and is the number in a variable for which there are just as many values greater than it as there are less than it. In R, we find the median with the function median(), which works just like mean()

# Find the medianmedian(test)

## [1] 6

Not here that it is our fourth value from test and that there are three number larger than it and three numbers smaller than it. As you can see, the mean and median are often similar, but their differences can tell us important things as we will see below. Each measure of the center is good for slightly different things.

4.3.1 Pizza price

For our first actual example, let’s look again at the price of pizza Denver. Start by plotting the histogram again so that we have a sense of what we are looking at.

Calculating the mean we see:

# Mean price of pizzamean(pizza$DEN.Price)

## [1] 2.566026

So, it will cost about $2.57 for a slice of pizza on average. What about the median?

# Median price of pizzamedian(pizza$DEN.Price)

## [1] 2.56

This tells that half of the time we will pay more than $2.56, and half the time less than that. In this case, the mean and median are very similar, which shouldn’t surprise us as the distribution is symmetrical and both are very near the middle. But, what happens when the distribution isn’t symmetrical?

4.3.2 Pizza sales volume

Let’s look again at the sales volume of pizza in Denver

We can calculate the mean and the median the same way as before:

# Mean sales volumemean(pizza$DEN.Sales.vol)

## [1] 45742.89

# Median sales volumemedian(pizza$DEN.Sales.vol)

## [1] 43158

Not that now, they are rather different, off by 2,500 slices. This is becuase the very large numbers at the far right of the distribution pull the mean up by a lot, but only count as one more value over the median. This becomes more clear if we think about where the values fall on the plot.

So, let’s add the lines. The function abline() can be used to add lines to plots in a lot of ways (many of which we explore in coming chapters). Here, we just want to add a vertical line at the site of the mean and median, so we use the argument v = and pass in the mean or median. We will also use the argument col = to set the color[*] There are a lot of colors built into R. We will explore them more soon using the function colors() and apply them to a lot of plots .

# Add lines to the histogramabline(v = mean(pizza$DEN.Sales.vol), col = "red")abline(v = median(pizza$DEN.Sales.vol), col = "blue")

So, we can now see that the median is still right in the biggest peak, while the mean has been pulled away from it, towards the tail. Identifying when the mean and median differ is one big clue towards identifying skew (and the mean is always different in the direction of the skew).

4.4 Try it out (II)

Calculate the mean and median of the random numbers chosen by students (the pickRandomNumber column). Report the results and interpret what they mean. Plot the lines over the histogram as well – you should get something like this:

4.5 Output

As with other chapters, click the notebook icon at the top of the screen to compile and HTML output. This will help keep everything together and help you identify any errors in your script.

Analyzing shape and center of one quantitative variable (2024)

FAQs

How to tell the distribution shape of a quantitative variable? ›

The distribution shape of quantitative data can be described as there is a logical order to the values, and the 'low' and 'high' end values on the x-axis of the histogram are able to be identified. The distribution shape of a qualitative data cannot be described as the data are not numeric.

View Details ›

What is the measure of the center of a quantitative variable? ›

The median is generally a better measure of the center when there are extreme values or outliers because it is not affected by the precise numerical values of the outliers. The mean is the most common measure of the center.

Keep Reading ›

What graphical summaries are appropriate for a single quantitative variable? ›

To display data from one quantitative variable graphically, we can use either a histogram or boxplot.

What is an example of a quantitative variable? ›

Quantitative variables are also called numerical variables.. Height, weight, age, speed, diameter, and the number of marbles in a bag are all examples of quantitative variables. The circumference, diameter, and weight of an apple can be measured as quantitative variables.

How to interpret center spread and shape? ›

The center is the median and/or mean of the data. The spread is the range of the data. And, the shape describes the type of graph. The four ways to describe shape are whether it is symmetric, how many peaks it has, if it is skewed to the left or right, and whether it is uniform.

Discover More Details ›

How do you identify the shape of the distribution? ›

The Shape of a Distribution

First, if the data values seem to pile up into a single "mound", we say the distribution is unimodal. If there appear to be two "mounds", we say the distribution is bimodal. If there are more than two "mounds", we say the distribution is multimodal.

Discover More Details ›

How do you measure the center of quantitative data? ›

The two most widely used measures of the “center” of the data are the mean (average) and the median. To calculate the mean weight of 50 people, add the 50 weights together and divide by 50 . To find the median weight of the 50 people, order the data and find the number that splits the data into two equal parts.

Keep Reading ›

How to measure quantitative variables? ›

Quantitative variables are measured with some sort of scale that uses numbers. For example, height can be measures in the number of inches for everyone. Halfway between 1 inch and two inches has a meaning. Anything that you can measure with a number and finding a mean makes sense is a quantitative variable.

Keep Reading ›

What measure of center is best for qualitative data? ›

Qualitative Data:

The mode which represents the highest frequency is the best measure of central tendency when the data is not quantitative.

Learn More Now ›

What is a graphic way to summarize quantitative data for one variable? ›

Summarizing Quantitative Variables

For these variables we should use histograms or boxplots. Histograms differ from bar graphs in that they represent frequencies by area and not height. A good display will help to summarize a distribution by reporting the center, spread, and shape for that variable.

Show Me More ›

What graph is best for quantitative variables? ›

Histograms (similar to bar graphs) are used for quantitative data.

What is the most common graph of the distribution of one quantitative variable? ›

The most common graph of the distribution of a single quantitative variable is a histogram.

Explore More ›

What are 5 examples of quantitative? ›

Some basic examples of quantitative data include:

Weight in pounds.
Length in inches.
Distance in miles.
Number of days in a year.
A heatmap of a web page.

Oct 24, 2021

Get More Info Here ›

What are two quantitative variables? ›

An example of two quantitative variables is the height and weight of a person. Both variables can be measured, and for each survey you do on a population you get these two values.

Get More Info ›

What are some examples of quantitative observations? ›

Measuring the length of a flower's stem, counting the number of bees in a hive, or recording the temperature of a greenhouse are all examples of quantitative observations.

How can you tell the shape of a sampling distribution? ›

The shape of the distribution of the sample mean is not any possible shape. The shape of the distribution of the sample mean, at least for good random samples with a sample size larger than 30, is a normal distribution.

How do you tell what kind of distribution you have? ›

Answer: To identify your data's distribution, analyze its shape and characteristics using descriptive statistics and visualization techniques such as histograms or density plots.

Explore More ›

What are the four ways to describe the distribution of two quantitative variables? ›

When describing a Quantitative Distribution we want to at least note 4 things: The shape of the distribution, the presence of outliers, the center, and the spread. A helpful acronym to remember this is SOCS: Shape.

Explore More ›

How do you check the distribution of a variable? ›

1 Answer. Often, the most straightforward way to find the distribution of a variable defined in terms of other random variables is to compute its cumulative distribution function. For any number y this function is the chance Y≤y, FY(y)=Pr(Y≤y)=Pr(X21+X21≤y)=Pr(X2≤y(1+X21)).

Get More Info ›