Working with control structures in R.
This past weekend was the Super Bowl. I thought for this post I would use a data set that was posted on Kaggle that contains data on the Super Bowl games from 1967 to present. This data set was put together by Timo Bozsolik.
In this post, I’m going to get into control structures. DataQuest says that control structures are necessary for repeated application or to execute an action only if a condition is met. Otherwise, writing code would be very tedious and not as fun.
I admit I struggled with control structures in other languages. However with R, I finally feel like I have a grasp on this concept. I feel like something clicked in this lesson. So without further delay, let’s get into it!
I’ll first load the tidyverse and janitor packages. I learned about the janitor package about a few days ago. It has simple functions for examining and cleaning data. This package came in handy for cleaning up the names of the columns in this dataset.
After installing and loading the janitor package, I imported the data. Next, I created a new object called clean_superbowl, that has the clean names. The last line of code gives you a side by side comparison of the column names of the imported data and the cleaned data frame. You can find more information about the janitor package here.
library(tidyverse)
library(janitor)
superbowl <-read_csv('/Users/User/datascidani2/_posts/2020-02-06-control-structures/superbowl.csv')
clean_superbowl <-janitor::clean_names(superbowl)
data.frame(superbowl = colnames(superbowl), clean_superbowl = colnames(clean_superbowl))
superbowl clean_superbowl
1 Date date
2 SB sb
3 Winner winner
4 Winner Pts winner_pts
5 Loser loser
6 Loser Pts loser_pts
7 MVP mvp
8 Stadium stadium
9 City city
10 State state
A conditional statement is a type of control structure that performs different calculations or actions based on whether or not a predefined condition is TRUE or FALSE. You can express conditional statements using comparison operators. One type of conditional statement is called an if statement. Let’s look at the screenshot below. I created a conditional statement that looks at the titles of the Super Bowl games. In this case, the condition is the first element of the Super Bowl title column being greater than 53. The action is the print statement, “Jennifer Lopez and Shakira performed at this Super Bowl.
if (clean_superbowl$sb[1] > 53) {
print("JLo and Shakira performed at this Superbowl.")
}
[1] "JLo and Shakira performed at this Superbowl."
I could add another type of conditional statement called the else statement. I can combine my if and else statements to form what is called an if/else statement. An action is executed if the condition in the if statement is TRUE. If that condition is not TRUE, the action in the else statement is executed.
if (clean_superbowl$sb[1] <= 53) {
print("JLo and Shakira performed at this Superbowl.")
}else{
print("No performance from JLo and Shakira at this Superbowl.")
}
[1] "No performance from JLo and Shakira at this Superbowl."
There is also something called nested else/if statements where an else/if statement is written within another else/if statement like so…
if(clean_superbowl$sb[1] > 53){
print("JLo and Shakira performed at this Superbowl.")
}else if(clean_superbowl$sb[1] < 53) {
print("No performance from JLo and Shakira at this Superbowl.")
}else if (clean_superbowl$sb[1] == 53){
print("Maroon 5 played at the 2019 Superbowl.")
}
[1] "JLo and Shakira performed at this Superbowl."
The problem with conditional statements is that they are inefficient. They are lengthy and if you have a situation where you have multiple conditional statements, it becomes repetitive. Another type of control structure solves the problem of repetition: For-loops.
For-loops perform an operation a given number of times, allowing me to execute a piece of code repeatedly on elements of a sequence. For example, let’s say I want to print all the Super Bowl titles. I could write a for-loop that looks like this:
for(i in clean_superbowl$sb) {
print(i)
}
Let’s examine this for-loop. The index variable i represents the element of a sequence. I can read this as “for every element in the sb column of the clean_superbowl data frame, print the element.” Although not shown here, this for-loop printed every single Super Bowl title until it reached the end of all the titles(in this case Super Bowl I (1)).
I don’t always have to use i. I could use any variable name. The variable name should describe what the variable represents making code more readable. I also would avoid using common names as variables. For example, I would not use sum as a variable name because it can cause problems if I wanted to use the sum() function later on.
When writing a for-loop, the elements I specify can be values, vectors, or data structures. In this next example, I wrote a for-loop to execute an operation on elements that are rows in a data frame. This for-loop calculates how many points the winning team won by for each game.
Let’s break this down further. I’ll start by explaining the first line of code. In the clean_superbowl data frame, each match has its own row. Since I want to perform the subtraction operation for each row of the data frame, the first part of the for-loop will consist of defining i as an element of the sequence of numbers from one to fifty-four (the number of rows in the data frame).
The nrow(clean_superbowl) returns the number of rows in the data frame. The print() function is used to display the results.
As I mentioned previously, executing one or more control structures inside another is called nesting. A for-loop can be used to loop over conditional statements. In this example, I used a for-loop, that for each row in the clean_superbowl data frame, print “Aww it was so close!” if the difference between the winner points and loser points is less than 10 and ” A Total Blowout!” if not.
Though I can print out the output of my for-loop, I ultimately want to store the output of my for-loop in an object so that I can use it. In this next example, I wanted to calculate the total number of points scored in each Super Bowl game. I first created an empty vector, total_points_scored. I then wrote the for-loop to add new elements into the vector. The new elements are the sums of winner_pts and loser_pts for each game.
total_points_scored <-c()
for(i in 1:nrow(clean_superbowl)) {
total_points_scored <-c(total_points_scored, clean_superbowl$winner_pts[i]+ clean_superbowl$loser_pts[i])
}
total_points_scored
[1] 51 16 74 62 34 52 51 65 38 56 48 50 31 46 31 45 61 69 37 41 39 53
[23] 55 56 44 75 43 69 61 39 65 36 52 59 56 54 47 44 47 37 50 66 37 46
[45] 38 22 31 21 27 29 30 23 47 45
After running the for-loop, the total_points_scored vector contains a sum winner_pts and loser_pts for each game.
There are times where my code would need to specify more than two outcomes. This is where selection control statements come in. Selection control statements allow me to specify more than two outcomes by adding else if statements to my code. Let’s say if I wanted to specify three conditions: games that were won by less than or equal to 10 points, games that were won by greater than or equal to 11 points and less than or equal to 20 points, games that were won by greater than 21 points. I can write the following:
superbowl_won_by <-c()
for(i in 1:nrow(clean_superbowl)) {
if(clean_superbowl$winner_pts[i]-clean_superbowl$loser_pts[i] < 10) {
superbowl_won_by <- c(superbowl_won_by, "A Close Game!")
}else if (clean_superbowl$winner_pts[i]-clean_superbowl$loser_pts[i] >= 11 & clean_superbowl$winner_pts[i]-clean_superbowl$loser_pts[i] <= 20){
superbowl_won_by <- c(superbowl_won_by, "A Total Blowout!")
} else if (clean_superbowl$winner_pts[i]-clean_superbowl$loser_pts[i] > 21){
superbowl_won_by <- c(superbowl_won_by, "Go back to the drawing board!")
}
}
superbowl_won_by[1:20]
[1] "A Total Blowout!" "A Close Game!"
[3] "A Close Game!" "A Total Blowout!"
[5] "A Close Game!" "Go back to the drawing board!"
[7] "A Close Game!" "A Close Game!"
[9] "A Close Game!" "A Total Blowout!"
[11] "A Close Game!" "A Close Game!"
[13] "A Total Blowout!" "A Total Blowout!"
[15] "A Close Game!" "A Close Game!"
[17] "Go back to the drawing board!" "A Close Game!"
[19] "Go back to the drawing board!" "A Close Game!"
As you can see, each row printed a statement based on the condition I specified.
Okay, that’s all for this post. I’m signing off for now! Until next time…
For attribution, please cite this work as
Brantley (2020, Feb. 6). Data Sci Dani: Control Structures. Retrieved from https://datascidani.com/posts/2020-02-06-control-structures/
BibTeX citation
@misc{brantley2020control, author = {Brantley, Danielle}, title = {Data Sci Dani: Control Structures}, url = {https://datascidani.com/posts/2020-02-06-control-structures/}, year = {2020} }