How to create scatter plots in R.
In the last few posts, I went over different types of plots: line graphs, bar charts, histograms, and box plots. In this post, I’ll go over the scatter plots lesson that is part of DataQuest’s Data Analyst in R track. The DataQuest lesson is called Scatter Plots for Exploratory Analysis.
The month of March is Women’s History Month and in honor of Women’s History Month, I decided to use the data sets in the Tidy Tuesday repo called Women in the Workplace. These data sets are about women in the workforce, specifically jobs, employment, and earnings. With that being said let’s go into scatter plots.
Scatter plots represent data using points that have the value of one variable determining the position on the x-axis and the value of the other variable determining the position on the y-axis. Unlike the other plots I’ve studied, scatter plots do not require that values of one variable depend on the values of another variable. Instead, I’m looking for any sort of relationship between the variables.
There are different types of relationships between variables. I’ll use variable_x and variable_y as examples here:
I can create a scatter plot similar to how I created plots before. However, the layer geom_point() is what creates a scatter plot. Let’s look at an example from a data set I’m working with. The following code and scatter plot look at female earnings by year. As you can see, the format for the ggplot and aes layers are the same as I did with previous plots. The difference is the geom_point() layer which differentiates the scatter plot from other kinds of plots.
ggplot(data = earnings_female) +
aes(x = Year, y = percent) +
geom_point()
What kind of relationship do the variables have in this example? What can you conclude by looking at this scatter plot?
There are a few layers that I could add that would make my plot easier to interpret. As I discussed in my multiple line graphs post, I could adjust my axis ranges. I can specify a scatter plot’s x- and y-axis ranges by adding a separate layer for each to my plot using xlim() and ylim(). In this example, I adjusted the x-axis to show the years 1981 through 2011 and y-axis to specify ranges of 50 through 100.
You’ll also notice in this block of code the argument alpha = . This argument makes the points on the plot more transparent. Alpha uses values from 0 through 1 to determine transparency with the value of 0 specifying complete transparency and the value of 1 specifying complete opacity. The intermediate values specify varying degrees of transparency. DataQuest says that alpha values of 0.2 to 0.5 usually will help make overlapping points easier to see.
I could create multiple scatter plots by copying and pasting the code block from the second example. However, in programming if you find yourself copying and pasting repeatedly, there is likely a much more efficient solution.
This brings us to functions. I could use what I learned about writing custom functions and visualizing data to write a function to create a scatter plot. I could then use my knowledge of functionals to apply my function to the data to create multiple scatter plots at once.
Let’s look at the code below. I decided to use another data set within the the Women in the Workplace repo that looks at employment for both men and women. However, I decided to focus on the women here.
As you can see the code within the function is similar to what I’ve done previously. However instead of using aes() to map variables to axes, I would use aes_string() which allows me to pass vectors of variable names into my function.
I can then use a functional from the purrr package to apply the create_scatter() function to the variables I want to look at relationships between. In this case, I’m looking for a relationship between year and the percentages of women working part-time and full-time.
The create_scatter() function is a two-variable function so I need to use the functional map2(). Recall from my post on functionals that map2() takes two variables as arguments and returns a list.
So what kind of output am I looking to obtain with this function? I want to create two scatter plots exploring the following relationships:
I assigned the year variable to the x-axis using x_var. The y-axis will consist of variables I want to compare: part_time_female and full_time_female and I assigned these variables to y_var. In this instance, I decided to type out the names of the variables. However, I could do this another way. I could index the data frame to select specific rows and use the names() function to extract row names.
The result of this code are the scatter plots below.
create_scatter = function(x,y) {
ggplot(data = employed_gender) +
aes_string(x = x, y = y) +
geom_point(alpha = 0.3) +
xlim(1970, 2015) +
ylim(20, 100) +
theme(panel.background = element_rect(fill="white"))
}
x_var <- "year"
y_var <-c("part_time_female", "full_time_female")
map2(x_var, y_var, create_scatter)
[[1]]
[[2]]
Well this just about does it for scatter plots. This was another really fun lesson to work on. The next post will be about a project that covers what I learned thus far. Until next time…
For attribution, please cite this work as
Brantley (2020, March 11). Data Sci Dani: Scatter Plots. Retrieved from https://datascidani.com/posts/2020-03-11-scatterplots/
BibTeX citation
@misc{brantley2020scatter, author = {Brantley, Danielle}, title = {Data Sci Dani: Scatter Plots}, url = {https://datascidani.com/posts/2020-03-11-scatterplots/}, year = {2020} }