A R Package for tidying Statistical Models
A package for tidying statistical models
This is the third in a series of blogs I have been writing about the family of packages in TidyVerse. The intention behind this blog came to me several times whenever I had to research while working on a data science project. Often I noticed during research, there are several ways to write codes in R. This creates a disparity in the readability and understanding of codes. It’s easier to follow someone’s work when there is a common language. Often I saw the use of different packages or functions to produce similar output; which is inefficient when someone is trying to understand the code. This is where TidyVerse comes in; it shares an underlying design philosophy, grammar, and data structures. My goal is to inform more data scientist about the common language of tidyverse package and hopefully more people will adapt it.
For this blog, I will talk about the broom package. broom is not part of the tidyverse family. However, it uses the same grammar and philosophy. broom summarizes key information about models in tidy tibbles. Hadley’s (the man behind tidyverse) paper makes a convincing statement of this problem: “While model inputs usually require tidy inputs, such attention to detail doesn’t carry over to model outputs. Outputs such as predictions and estimated coefficients are not always tidy. This makes it more difficult to combine results from multiple models. … This knocks you out of the flow of analysis and makes it harder to combine the results from multiple models. I’m not currently aware of any packages that resolve this problem.”
broom provides three verbs to make it convenient to interact with model objects:
tidy() summarizes information about model components
glance() reports information about the entire model
augment() adds informations about observations to a dataset
I specifically want to talk about the function glance. It returns a tibble with exactly one row of goodness of fitness measures and related statistics. This is useful when we have several models and we want to compare metrics of many models.
# fit two different models fit1 <- lm(Sepal.Width ~ Petal.Length + Petal.Width, iris) fit2 <- lm(mpg ~ wt + qsec, mtcars) # compare both model metrics full_join(glance(fit1), glance(fit2)) %>% add_column(" " = c("Model 1", "Model 2")) %>% select(" ", everything()) %>% kable() %>% kable_styling()