Categorical Variables using TidyVerse
Dealing with categorical variables using TidyVerse
Whenever I work on data science using R, I always use tidyverse package. Tidyverse is a great collection of R packages offering data science solutions in the areas of data manipulation, exploration, and visualization that share a common design philosophy. It was created by R industry luminary Hadley Wickham, the chief scientist behind RStudio. R packages in the tidyverse are intended to make data scientists more productive. I intend to write several blogs using tidyverse package or packages which share an underlying design philosophy, grammar, and data structures. I will use each package to deal with problems I have encountered when I worked on masters of science in data science.
This semester we worked heavily with regression analysis. One of the problem I encountered was dealing with categorical variables. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.
Here are some of the ways to deal with categorical variable using the forcats package from tidyverse.
Reordering a factor by another variable
fct_reorder()
reorders factor levels: often makes plots much better.
a <- gapminder %>%
filter(year == 2002, continent == "Asia") %>%
ggplot(aes(x = lifeExp, y = country)) +
geom_point()
b <- gapminder %>%
filter(year == 2002, continent == "Asia") %>%
ggplot(aes(x = lifeExp, y = fct_reorder(country, lifeExp))) +
geom_point()
ggarrange(a, b, ncol = 2, nrow = 1)
Reordering a factor by the frequency of values
fct_infreq
reorders a categorical variable in order by its frequency.
starwars %>%
filter(!is.na(hair_color)) %>%
ggplot(aes(x = fct_infreq(hair_color))) +
geom_bar() +
labs(title = "Most Common Hair Color", x = "Types of Hair Color") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Collapsing the least/most frequent values of a factor into “others”
fct_lump
makes it easy to plot or view a variable with too many factors.
starwars %>%
mutate(skin_color = fct_lump(skin_color, n = 5)) %>%
count(skin_color, sort = T) %>%
kable() %>%
kable_styling(full_width = F)
skin_color | n |
---|---|
Other | 41 |
fair | 17 |
light | 11 |
dark | 6 |
green | 6 |
grey | 6 |
Changing the order of a factor by hand
fct_relevel()
when we need to manually reorder our factor levels.
# default
c <- crime %>%
as_tibble() %>%
distinct(offense) %>%
arrange(offense)
# after relevel
d <- crime %>%
as_tibble() %>%
distinct(offense) %>%
mutate(offense = fct_relevel(offense, c("theft", "auto theft", "robbery", "burglary", "aggravated assault", "rape", "murder"))) %>%
arrange(offense)
kable(list(c, d)) %>%
kable_styling(full_width = F)
|
|
I hope some of these functions from the forcats packages help you visualize and understand categorical data as it helped me.