Data Cleaning With R

I have now reached the Data Cleaning in R section of DataQuest’s Data Analyst in R track. I had some difficulty finding a messy data set to clean. I decided to practice cleaning two datasets.

I’ll first discuss the first data set which comes from UCI Machine Learning Repository. In the repository, I found the Purchasing Intention Data Set. This data set explores the intentions of online shoppers using metrics like Bounce Rate and Traffic Type.

The first thing I wanted to do was to convert the columns Weekend and Revenue from logical(TRUE/FALSE) columns to numeric columns. There are a few ways to do this as shown in the screenshot below.

Converting boolean columns to numeric columns

I decided to use the second script to convert the Weekend and Revenue columns to numeric columns. The first photo shown is before I converted the columns to numeric and the second photo shows the columns after I converted them to numeric columns.

The next thing I decided to do was to filter the data frame so that values for the ProductRelated column are more than 15. The ProductRelated column refers to products pages of a shopping site. I filtered the data using the script shown below.

Filtering data by Product Related pages

This is what the column looked like after I filtered it:

Filtering data by Product Related pages

I then decided to group the data by Month and Visitor Type and sum up the columns using Informational and Informational Duration columns.

Using group_by function to group data by columns.
Using mutate function to sum up columns.

The results are shown below.

Using group_by and mutate functions to group and sum up columns

Next, I wanted to filter and select variables from a data frame.

Using filter and select functions to view data.
Using filter and select functions to view data.

I decided to change the name of the Special Day column to Holiday using the rename function.

Using rename function to rename columns.

This is what the SpecialDay column looked like after I changed its name to Holiday.

Using rename function to rename columns.

Lastly with this data frame, I decided to look for duplicate values.

Viewing duplicate values.

The problem with this approach is the output of duplicated() is a vector and I’d have to search for the values that are TRUE. I would have to index the vector to get the values that are duplicates. This method is not ideal, especially if I’m working with multiple data frames.

Another Method of Finding Duplicates

There is another way to look for duplicated values. I can combine the duplicated() function with the purrr functionals and dplyr to look for duplicated values.

For this example, I’ll use two data frames I created myself.

The first step is to create a list of the data frames so I can use a functional to perform the same operation on each data frame.

Using purrr and dplyr to find duplicate values.
Using purrr and dplyr to find duplicate values.
Using purrr and dplyr to find duplicate values.

I’ll then use the map() functional and mutate() function to create a new column with the logical output of duplicated(). This will allow me to filter the data frame to return rows where the values of duplicated column are TRUE.

Using purrr and dplyr to find duplicate values.

When I call dup_traffic, you can see here that duplicates have been identified in the TrafficType column.

Using purrr and dplyr to find duplicated values.

I have to admit this was a pretty difficult section but I’m glad I’m learning it! That’s all for Data Cleaning for now. Until next time…