Unlocking the Power of the 'janitor' Package for Data Cleaning
Written on
Chapter 1: Introduction to the janitor Package
The janitor package, developed by a team including Sam Firke, Bill Denney, Chris Haid, Ryan Knight, Malte Grosser, and Jonathan Zadra, is a valuable tool for data cleaning and exploration in R. While its renowned function, clean_names(), will be discussed later, the package offers a suite of functionalities that enhance data preparation. Seamlessly compatible with the tidyverse, janitor can be easily integrated into various data workflows. For comprehensive overviews of its functions, refer to these resources here and here. Additionally, all relevant data and code can be found in this GitHub repository.
Chapter 1.1: Source of the Data
In this tutorial, we'll utilize data from the Geographic Names Information System (GNIS), a resource compiled by the U.S. Board on Geographic Names. Featured in Jeremy Singer-Vine's Data is Plural newsletter, this dataset serves as a searchable catalog of place names across the United States.
To begin, navigate to the Query Form to download the data. For this tutorial, I selected place names from Berkshire County, Massachusetts, my hometown. To conduct this search, choose "Massachusetts" from the state drop-down menu and "Berkshire" from the county menu, then click "Send Query":
The results will display as shown below. To save the data locally, select 'Save as pipe "|" delimited file':
The dataset will download as a .csv file, with features separated by "|". Save this file into a "data" folder within a new R project. Next, we’ll load the data into R, separate the columns, and make some adjustments to prepare for our exploration with the janitor package. Load the tidyverse and janitor packages in a new R Markdown file and use the read.csv() function to import the data as "place_names":
library(tidyverse)
library(janitor)
place_names = read.csv("./data/GNIS Query Result.csv")
The data will appear similar to its format in Excel, with one extensive column containing all the information.
This paragraph will result in an indented block of text, typically used for quoting other text.
Chapter 1.2: Preparing the Data
We'll begin by assigning the name "columns" to this single column to simplify our code. Using the separate() function, we will divide this column into its respective components. We’ll also filter the data to focus solely on Berkshire County, as a few entries from other counties were inadvertently included in our query. To simulate some data imperfections, we’ll create a duplicate ID in the dataset and add an additional column filled with NAs:
colnames(place_names) = "columns"
place_names =
place_names %>%
separate(columns, c("Feature Name", "ID", "Class", "County", "State", "Latitude", "Longitude", "Ele(ft)", "Map", "BGN Date", "Entry Date"), sep = "[|]") %>%
filter(County == "Berkshire") %>%
mutate(
ID = str_replace(ID, "598673", "598712"),
extra_column = NA
)
Before proceeding, we will create a secondary dataset called "non_ma_names" that contains entries not from Berkshire County. We will again read the "GNIS Query Result.csv" and separate the column names. The clean_names() function from the janitor package will be applied, followed by converting our ele_ft variable to numeric and map variable to a factor:
non_ma_names = read.csv("./data/GNIS Query Result.csv")
colnames(non_ma_names) = "columns"
non_ma_names =
non_ma_names %>%
separate(columns, c("Feature Name", "ID", "Class", "County", "State", "Latitude", "Longitude", "Ele(ft)", "Map", "BGN Date", "Entry Date"), sep = "[|]") %>%
filter(County != "Berkshire") %>%
clean_names() %>%
mutate(
ele_ft = as.numeric(ele_ft),
map = as.factor(map)
)
Now let's explore the capabilities of the janitor package!
Chapter 2: Key Functions in janitor
The first video, "R package reviews | janitor | clean your data!" provides a thorough overview of the janitor package's capabilities.
Chapter 2.1: Utilizing row_to_names()
You may often encounter data files with leading rows that contain unnecessary information or corporate logos. When loaded into R, these rows can inadvertently become column headers. The row_to_names() function from janitor allows you to specify the row containing the actual column names, removing all preceding rows:
test_names = row_to_names(place_names, 3, remove_row = TRUE, remove_rows_above = TRUE)
Although the GNIS data already has correctly positioned column names, this function proves helpful with other datasets.
Chapter 2.2: Cleaning Column Names
The clean_names() function is one I frequently employ upon loading new datasets. If you haven't yet incorporated it into your workflow, I highly recommend doing so. The function adapts column names to snake case, aligning with the tidyverse style guide:
place_names = clean_names(place_names)
The result is a tidy dataset with neatly formatted column names.
Chapter 2.3: Removing Empty Columns
The remove_empty() function, as its name suggests, eliminates any empty columns. Given that we added an empty column in our "place_names" dataset, we can see it in action:
place_names =
place_names %>%
remove_empty()
The empty_column will be removed, leaving only columns with data.
Chapter 2.4: Eliminating Constant Columns
The remove_constant() function removes columns where all values are identical. Since we filtered to only include Berkshire County data, the county and state columns contain the same values throughout, rendering them unnecessary:
place_names =
place_names %>%
remove_constant()
As a result, the county and state columns will be eliminated.
Chapter 2.5: Comparing Data Frames
Have you ever encountered an error while attempting to stack two data frames with rbind()? The compare_df_cols() function directly compares columns from two data frames, streamlining the troubleshooting process. Let’s compare our "place_names" dataset with "non_ma_names":
compare_df_cols(place_names, non_ma_names)
This output provides a convenient comparison table, highlighting any discrepancies in column types.
Chapter 2.6: Detecting Duplicates
In numerous projects, particularly those involving unique identifiers, ensuring no duplicates exist is crucial. The get_dupes() function identifies duplicated values in a specified column:
get_dupes(place_names, ID)
This will filter the data frame to show rows with duplicate IDs, allowing for easy investigation.
Chapter 2.7: Using tabyl() for Data Exploration
The tabyl() function offers a tidyverse-compatible alternative to the base table() function, making it invaluable for data exploration. For instance, to count schools in each town in Berkshire County, we can filter the class variable and utilize tabyl():
place_names %>%
filter(class %in% "School") %>%
tabyl(map) %>%
knitr::kable()
This produces a neatly formatted output table displaying school counts and percentages.
Next, let’s create a cross-tabulation of landmark types by town:
place_names %>%
tabyl(map, class) %>%
knitr::kable()
To analyze column percentages, we can easily adapt our code:
place_names %>%
tabyl(map, class) %>%
adorn_percentages("col") %>%
knitr::kable()
For improved readability, we can format the percentages:
place_names %>%
tabyl(map, class) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 0) %>%
knitr::kable()
This creates a clearer table for understanding the data.
Chapter 2.8: Additional Functions
While this article highlights some of the janitor package's most useful functions, it is not an exhaustive list. Other noteworthy functions include:
- excel_numeric_to_date(): Converts various Excel date formats to date variables.
- round_to_fraction(): Rounds decimal numbers to specified fractional values.
- top_levels(): Collapses categorical variables into high, middle, and low levels.
Chapter 3: Conclusion
Data analysts and scientists often spend significant time on data cleaning and exploration. The janitor package offers numerous tools to streamline these processes. Whether you're a novice or seasoned user, I hope this article has introduced you to valuable functions that can enhance your data science toolkit.
The second video, "Use These Data Cleaning Helpers for R from the janitor package," explores additional helpers in the janitor package to further enhance your data cleaning workflow.