## Fundamentals of String Manipulation

In previous posts, I’ve worked with variables with numerical data. This post will focus on working on variables with character data. The data set that I am using for this post was posted on Tidy Tuesday’s github. This data set comes from Spotify and provides information about popular songs and their genres. So let’s get started!

Subsetting a String

In this lesson, I worked with a tidyverse package called stringr. The stringr package contains tools for combining, splitting, adding, and removing spaces from, and other performing other useful string data manipulations. Functions in the stringr package all begin with the prefix str_ and RStudio’s autocomplete feature makes this package especially fun to work with!

I’ll first discuss a function I learned can subset strings, str_sub(). The str_sub function takes a string, subsets it based on positions of characters within the string, and returns a new string containing only the characters between the specified positions. The string that’s returned includes the characters of the positions specified as well as those between them. All characters are included, even spaces.

Let’s look at an example. I am working with the data frame called spotify_songs and I want to subset the track_artist variable and return a new vector consisting of the first nine letters of each artist. I indexed the track_artist variable to include only the first twenty artists. I then subset the variable from left to right by position with 1 and 9 representing the position numbers.

I can see here that the function resulted in a return of strings with only the characters between the specified positions.

To subset the same track_artist variable from right to left by position, I use a minus sign () before the position number, like so:

Note that the sub_str() function is vectorized. I could apply it to a vector and it will return a new vector.

Splitting A String

Another technique for subsetting strings is splitting strings using the str_split() function. Unlike the str_sub() function, the str_split() function is not dependent on position. The str_split() function is used to split a string into pieces. The place where the string is split is called the delimeter. The delimeter refers to a space, comma, another character or characters.

Let’s look at what happens when I use the str_split() function to split the strings listed in the first example.

From the photos, I can see that the string split occurred where there is a space. I can also see that by default the output is a list.

I can use the simplify = TRUE argument to simplify the output into a matrix, like so:

Combining Strings

The function for combining strings is called str_c().

I’ll use str_c() to combine multiple strings from my spotify_songs data frame into a single variable.

Let’s see what happens when I combine the variables track_name and track_artist into one variable.

I can use the sep= argument within the str_c() function to specify characters to place between the strings I’m combining.

This is where I deviated from DataQuest a bit. After reading this section of the lesson, I was a bit confused. So I decided to do some research and came across this tidyverse documentation describing how to pad a string.

The stringr function, str_pad() lets me specify characters into an existing string to make it a specified length.

The function takes as arguments:

• The string you’re working with
• The minimum width of padded strings