Categorical Variables using TidyVerse

Dealing with categorical variables using TidyVerse

Whenever I work on data science using R, I always use tidyverse package. Tidyverse is a great collection of R packages offering data science solutions in the areas of data manipulation, exploration, and visualization that share a common design philosophy. It was created by R industry luminary Hadley Wickham, the chief scientist behind RStudio. R packages in the tidyverse are intended to make data scientists more productive. I intend to write several blogs using tidyverse package or packages which share an underlying design philosophy, grammar, and data structures. I will use each package to deal with problems I have encountered when I worked on masters of science in data science.

This semester we worked heavily with regression analysis. One of the problem I encountered was dealing with categorical variables. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.

Here are some of the ways to deal with categorical variable using the forcats package from tidyverse.

Reordering a factor by another variable

fct_reorder() reorders factor levels: often makes plots much better.

a <- gapminder %>% 
  filter(year == 2002, continent == "Asia") %>% 
  ggplot(aes(x = lifeExp, y = country)) +
  geom_point()

b <- gapminder %>% 
  filter(year == 2002, continent == "Asia") %>% 
  ggplot(aes(x = lifeExp, y = fct_reorder(country, lifeExp))) +
  geom_point()

ggarrange(a, b, ncol = 2, nrow = 1)

Reordering a factor by the frequency of values

fct_infreq reorders a categorical variable in order by its frequency.

starwars %>% 
  filter(!is.na(hair_color)) %>% 
  ggplot(aes(x = fct_infreq(hair_color))) +
  geom_bar() +
  labs(title = "Most Common Hair Color", x = "Types of Hair Color") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Collapsing the least/most frequent values of a factor into “others”

fct_lump makes it easy to plot or view a variable with too many factors.

starwars %>% 
  mutate(skin_color = fct_lump(skin_color, n = 5)) %>% 
  count(skin_color, sort = T) %>% 
  kable() %>% 
  kable_styling(full_width = F)
skin_colorn
Other41
fair17
light11
dark6
green6
grey6

Changing the order of a factor by hand

fct_relevel() when we need to manually reorder our factor levels.

# default
c <- crime %>% 
  as_tibble() %>% 
  distinct(offense) %>% 
  arrange(offense)

# after relevel
d <- crime %>%
  as_tibble() %>% 
  distinct(offense) %>%
  mutate(offense = fct_relevel(offense, c("theft", "auto theft", "robbery", "burglary", "aggravated assault", "rape", "murder"))) %>%
  arrange(offense)

kable(list(c, d)) %>% 
  kable_styling(full_width = F)
offense
aggravated assault
auto theft
burglary
murder
rape
robbery
theft
offense
theft
auto theft
robbery
burglary
aggravated assault
rape
murder

I hope some of these functions from the forcats packages help you visualize and understand categorical data as it helped me.

Avatar
Data Scientist

Saayed Alam creates machine learning products and occasionally gets philosophical.

Next
Previous

Related