In my last post, I went over for-loops. For-loops are a common programming task and it’s important in understanding how programming works. However, in R for-loops are not written as often as it is for other programming languages.
Many of R’s built-in functions contain for-loops. One could say that many R functions serve as an alternative to for-loops. Many R functions are vectorized meaning that you can use them to operate on all the elements of a vector quickly. Some of these functions, like sum() and mean() for example, I’ve covered in previous posts. Using vectorized functions not only makes code easier to understand, but its run time is faster than for-loops. As DataQuest explains, this is because applying the function to the entire vector allows R to interpret the input and pass it to the compiled code once.
This post will go over DataQuest’s Working With Vectorized Functions lesson. I decided to make this post movie-themed in honor of the Academy Awards that took place last night. The data set I will use in this post comes from Kaggle and was composed by a Kaggle member with the user name of kikun1234. It explores award nominations of movies released between 2000-2018. I loaded the data set into a data frame called oscars_2018. So without further delay, let’s get started!
Writing Vectorized If/Else Statements
My last post covered one way to write if/else statements. In this lesson, however, I learned that there is a vectorized solution for this. I could use the if/else() function, which is part of the dplyr package, for applying if/else statements. The if/else function could replace the for-loops I wrote in the previous post. This function requires the following:
- A Vector/Multiple Vectors
- A Condition
- An action to perform if condition is true
- An action to perform if condition is false
Let’s look at an example. I want to see if the movies in my data frame recieved more Golden Globe nominations than Oscar nominations. If a movie recieved more Golden Globe nominations, then the statement “more popular at the Golden Globes” would print. If a movie recieved more Oscar nominations, then the statement, “more popular at the Oscars” would print.
Nested if_else() Functions
Similar to nested if/else statements, I can take a vectorized approach by nesting if_else() functions. When nesting these functions, I would specify a different if_else function to perform an action if the first condition is not met. In the below screenshot, I have three different if_else() functions representing three different conditions:
- If number of Golden Globes nominations are greater than number of Oscar nominations then print “more popular at the Golden Globes”
- If number of Golden Globes nominations are less number of than Oscar nominations then print “more popular at the Oscars”
- If number of Golden Globes nominations are equal to number of Oscar nominations then print “both the Golden Globes and Oscars love you”
For each if_else() function in this chain, if the condition is not met, the R interpreter moves on to the next if_else function.
Each if_else function requires that I specify two action to take place based on whether or not the condition is true or false. However because I only have three cases here, I used a pair of empty quotes as the second action in the last if_else() function.
Grouping and Summarizing Data Frames
One huge takeaway from this lesson is learning how to solve what are known as “split-apply-combine” problems in R. With a split-apply-combine problem, the data is split into groups, a function is performed on each group, and the results are summarized. According to DataQuest, this is useful for solving many data analysis problems that require the calculation of summary statistics. Let’s start by discussing how to group data.
Two functions that are useful for working on split-apply-combine problems are the group_by() function and the summarize() function. Both functions are part of the dplyr package. The group_by() function allows me to group data by variable. The summarize() function lets me apply a function like sum() or n() to each group. The n() function counts the number of data frame rows in each group.
Let’s look at this example. I want to group my oscars_2018 data frame based on the genre variable. This is what the code looks like. As you recall from a previous post, the pipe operator(%>%) is used to chain functions together. This code splits rows of my oscars_2018 data frame into groups based on genre. I then saved this to the genre_wins data frame.
When I print the first few rows of my new genre_wins data frame, I can see the grouping variable and the number of groups specified. The grouping variable is genre and there are 228 genre groups.
Once the data frame is grouped, I can perform operations on each group using the summarize() function.
Take a look at the example below. Let’s say I wanted to calculate how many awards were won in each genre. Once again, I grouped my data frame by genre and then used the summarize() function to add up the awards won in each genre. This was saved into a data frame I call genre_wins_one. When I print this data frame, I end up with the following:
I can see that the Action|Adventure|Fantasy genre has won 73 awards.
Multiple Grouping and Summarizing
I can also group my data frame by multiple variables like the screenshot below. I can see that my data frame is grouped by the genre and certificate variables. I then use the summarize() function to apply the n() function. As I mentioned earlier in the post, the n() function counts counts the number of data frame rows in each group.
Just as I grouped my data frame by multiple variables, I can specify multiple operations using the summarize() function.
Let’s say I want to know the minimum, maximum and average Metacritic (stored in the metascore variable) score for each genre. I would first group my data frame by genre. I then use the summarize() function to apply the following functions: min(), max() and mean() to the metascore variable.
When I’m done with a grouped data frame, it’s best to return it to its previous state by ungrouping it like so:
The Pipe Operator
I mentioned earlier that the pipe operator(%>%) is used to chain functions together. It makes code easier to read and write but how does it work?DataQuest explains it this way:
The pipe originated with a package called maggritR. The tidyverse has adopted the pipe as a key feature of its packages, so loading tidyverse packages loads %>% automatically. The pipe operator allows you to write code so that the output of a function is passed to the next function from left to right. Most R functions do a lot of computing work for you behind the scenes, and you as a user interact with a “wrapper”. The pipe is a good example of this.
I think this is best demonstrated with an example. Let’s say I wanted to add up the total number of reviews for the movie, The Lord of the Rings: The Fellowship of the Ring. I would first filter the data to retain information about the movie. Then, I would add a new column called total_reviews, containing the total number of reviews (adding user_reviews + critics_reviews variables). I store this in a data frame called lord_rings_reviews.
When I type View(lord_rings_reviews), a file opens up containing all the data on this Lord of the Rings movie. The column total_reviews has been added in as the last column in the data frame.
Whew, that just about covers it for vectorized functions. Until next time…