A R Package for a neat and quick Descriptive Statistical Analysis
A package for a neat and quick descriptive statistical analysis
One very important aspect before any data modeling is exploratory data analysis. Without viewing descriptive statistical analysis or appropriate plots, its hard to build the right model. There are several base functions and packages for such analysis and visualization. In several data science papers, I noticed too much detail which can cause a reader to lose interests or poorly rendered plots which does not say much about the data. In this blog, I will look at the summarytools package, which provides few functions to neatly summarize the data. This is especially useful in a RMarkdown report as it can directly render the summary outputs to HTML and then display it in the application or report.
summarytools provides four verbs to make it convenient to interact with model objects:
dfSummary
is used to summarize an entire dataset descriptive statistics i.e. variable type, variable statistics, frequency, and number of missing values along with plots to show the distribution of the data is automatically created by the function. Moreover, the output can be controlled using various arguments.
print(dfSummary(tobacco), method = "render")
Data Frame Summary
tobacco
Dimensions: 1000 x 9Duplicates: 2
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | gender [factor] | 1. F 2. M |
| 978 (97.8%) | 22 (2.2%) | |||||||||||||||||||||||||||||||||||||||||||||
2 | age [numeric] | Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4) | 63 distinct values | 975 (97.5%) | 25 (2.5%) | |||||||||||||||||||||||||||||||||||||||||||||
3 | age.gr [factor] | 1. 18-34 2. 35-50 3. 51-70 4. 71 + |
| 975 (97.5%) | 25 (2.5%) | |||||||||||||||||||||||||||||||||||||||||||||
4 | BMI [numeric] | Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2) | 974 distinct values | 974 (97.4%) | 26 (2.6%) | |||||||||||||||||||||||||||||||||||||||||||||
5 | smoker [factor] | 1. Yes 2. No |
| 1000 (100%) | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
6 | cigs.per.day [numeric] | Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8) | 37 distinct values | 965 (96.5%) | 35 (3.5%) | |||||||||||||||||||||||||||||||||||||||||||||
7 | diseased [factor] | 1. Yes 2. No |
| 1000 (100%) | 0 (0%) | |||||||||||||||||||||||||||||||||||||||||||||
8 | disease [character] | 1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ] |
| 222 (22.2%) | 778 (77.8%) | |||||||||||||||||||||||||||||||||||||||||||||
9 | samp.wgts [numeric] | Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1) |
| 1000 (100%) | 0 (0%) |
Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-02
descr
is used to obtain more detailed statistics on numerical variables.
descr(tobacco$age)
## Descriptive Statistics
## tobacco$age
## N: 1000
##
## age
## ----------------- --------
## Mean 49.60
## Std.Dev 18.29
## Min 18.00
## Q1 34.00
## Median 50.00
## Q3 66.00
## Max 80.00
## MAD 23.72
## IQR 32.00
## CV 0.37
## Skewness -0.04
## SE.Skewness 0.08
## Kurtosis -1.26
## N.Valid 975.00
## Pct.Valid 97.50
freq
is used to obtain more detailed statistics on categorical variables.
freq(tobacco$smoker)
## Frequencies
## tobacco$smoker
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## Yes 298 29.80 29.80 29.80 29.80
## No 702 70.20 100.00 70.20 100.00
## <NA> 0 0.00 100.00
## Total 1000 100.00 100.00 100.00 100.00
ctable
is used to cross-tabulate frequencies for a pair of categorical variables.
print(ctable(tobacco$disease, tobacco$gender), method = "render")
Cross-Tabulation, Row Proportions
disease * gender
Data Frame: tobaccogender | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
disease | F | M | <NA> | Total | ||||||||||||
Cancer | 16 | ( | 47.1% | ) | 18 | ( | 52.9% | ) | 0 | ( | 0.0% | ) | 34 | ( | 100.0% | ) |
Cholesterol | 10 | ( | 47.6% | ) | 11 | ( | 52.4% | ) | 0 | ( | 0.0% | ) | 21 | ( | 100.0% | ) |
Diabetes | 8 | ( | 57.1% | ) | 5 | ( | 35.7% | ) | 1 | ( | 7.1% | ) | 14 | ( | 100.0% | ) |
Digestive | 5 | ( | 41.7% | ) | 7 | ( | 58.3% | ) | 0 | ( | 0.0% | ) | 12 | ( | 100.0% | ) |
Hearing | 5 | ( | 35.7% | ) | 9 | ( | 64.3% | ) | 0 | ( | 0.0% | ) | 14 | ( | 100.0% | ) |
Heart | 9 | ( | 45.0% | ) | 11 | ( | 55.0% | ) | 0 | ( | 0.0% | ) | 20 | ( | 100.0% | ) |
Hypertension | 18 | ( | 50.0% | ) | 17 | ( | 47.2% | ) | 1 | ( | 2.8% | ) | 36 | ( | 100.0% | ) |
Hypotension | 7 | ( | 63.6% | ) | 4 | ( | 36.4% | ) | 0 | ( | 0.0% | ) | 11 | ( | 100.0% | ) |
Musculoskeletal | 8 | ( | 42.1% | ) | 10 | ( | 52.6% | ) | 1 | ( | 5.3% | ) | 19 | ( | 100.0% | ) |
Neurological | 7 | ( | 70.0% | ) | 3 | ( | 30.0% | ) | 0 | ( | 0.0% | ) | 10 | ( | 100.0% | ) |
Other | 1 | ( | 50.0% | ) | 1 | ( | 50.0% | ) | 0 | ( | 0.0% | ) | 2 | ( | 100.0% | ) |
Pulmonary | 9 | ( | 45.0% | ) | 11 | ( | 55.0% | ) | 0 | ( | 0.0% | ) | 20 | ( | 100.0% | ) |
Vision | 6 | ( | 66.7% | ) | 3 | ( | 33.3% | ) | 0 | ( | 0.0% | ) | 9 | ( | 100.0% | ) |
<NA> | 380 | ( | 48.8% | ) | 379 | ( | 48.7% | ) | 19 | ( | 2.4% | ) | 778 | ( | 100.0% | ) |
Total | 489 | ( | 48.9% | ) | 489 | ( | 48.9% | ) | 22 | ( | 2.2% | ) | 1000 | ( | 100.0% | ) |
Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-02
There were times I opened a dataset and did not know the appropriate way to explore the dataset. I hope this blogs gives a starting point those who face similar problem. Once you get an idea using this package about what type of dataset you are dealing with, you can perform appropriate data visualization.