A R Package for a neat and quick Descriptive Statistical Analysis

A package for a neat and quick descriptive statistical analysis

One very important aspect before any data modeling is exploratory data analysis. Without viewing descriptive statistical analysis or appropriate plots, its hard to build the right model. There are several base functions and packages for such analysis and visualization. In several data science papers, I noticed too much detail which can cause a reader to lose interests or poorly rendered plots which does not say much about the data. In this blog, I will look at the summarytools package, which provides few functions to neatly summarize the data. This is especially useful in a RMarkdown report as it can directly render the summary outputs to HTML and then display it in the application or report.

summarytools provides four verbs to make it convenient to interact with model objects:

dfSummary is used to summarize an entire dataset descriptive statistics i.e. variable type, variable statistics, frequency, and number of missing values along with plots to show the distribution of the data is automatically created by the function. Moreover, the output can be controlled using various arguments.

print(dfSummary(tobacco), method = "render")

Data Frame Summary

tobacco

Dimensions: 1000 x 9
Duplicates: 2
NoVariableStats / ValuesFreqs (% of Valid)GraphValidMissing
1gender [factor]1. F 2. M
489(50.0%)
489(50.0%)
978 (97.8%)22 (2.2%)
2age [numeric]Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4)63 distinct values975 (97.5%)25 (2.5%)
3age.gr [factor]1. 18-34 2. 35-50 3. 51-70 4. 71 +
258(26.5%)
241(24.7%)
317(32.5%)
159(16.3%)
975 (97.5%)25 (2.5%)
4BMI [numeric]Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2)974 distinct values974 (97.4%)26 (2.6%)
5smoker [factor]1. Yes 2. No
298(29.8%)
702(70.2%)
1000 (100%)0 (0%)
6cigs.per.day [numeric]Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8)37 distinct values965 (96.5%)35 (3.5%)
7diseased [factor]1. Yes 2. No
224(22.4%)
776(77.6%)
1000 (100%)0 (0%)
8disease [character]1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ]
36(16.2%)
34(15.3%)
21(9.5%)
20(9.0%)
20(9.0%)
19(8.6%)
14(6.3%)
14(6.3%)
12(5.4%)
11(5.0%)
21(9.5%)
222 (22.2%)778 (77.8%)
9samp.wgts [numeric]Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1)
0.86!:267(26.7%)
1.04!:249(24.9%)
1.05!:324(32.4%)
1.06!:160(16.0%)
! rounded
1000 (100%)0 (0%)

Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-02

descr is used to obtain more detailed statistics on numerical variables.

descr(tobacco$age)
## Descriptive Statistics  
## tobacco$age  
## N: 1000  
## 
##                        age
## ----------------- --------
##              Mean    49.60
##           Std.Dev    18.29
##               Min    18.00
##                Q1    34.00
##            Median    50.00
##                Q3    66.00
##               Max    80.00
##               MAD    23.72
##               IQR    32.00
##                CV     0.37
##          Skewness    -0.04
##       SE.Skewness     0.08
##          Kurtosis    -1.26
##           N.Valid   975.00
##         Pct.Valid    97.50

freq is used to obtain more detailed statistics on categorical variables.

freq(tobacco$smoker)
## Frequencies  
## tobacco$smoker  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##         Yes    298     29.80          29.80     29.80          29.80
##          No    702     70.20         100.00     70.20         100.00
##        <NA>      0                               0.00         100.00
##       Total   1000    100.00         100.00    100.00         100.00

ctable is used to cross-tabulate frequencies for a pair of categorical variables.

print(ctable(tobacco$disease, tobacco$gender),  method = "render")

Cross-Tabulation, Row Proportions

disease * gender

Data Frame: tobacco
gender
diseaseFM<NA>Total
Cancer16(47.1%)18(52.9%)0(0.0%)34(100.0%)
Cholesterol10(47.6%)11(52.4%)0(0.0%)21(100.0%)
Diabetes8(57.1%)5(35.7%)1(7.1%)14(100.0%)
Digestive5(41.7%)7(58.3%)0(0.0%)12(100.0%)
Hearing5(35.7%)9(64.3%)0(0.0%)14(100.0%)
Heart9(45.0%)11(55.0%)0(0.0%)20(100.0%)
Hypertension18(50.0%)17(47.2%)1(2.8%)36(100.0%)
Hypotension7(63.6%)4(36.4%)0(0.0%)11(100.0%)
Musculoskeletal8(42.1%)10(52.6%)1(5.3%)19(100.0%)
Neurological7(70.0%)3(30.0%)0(0.0%)10(100.0%)
Other1(50.0%)1(50.0%)0(0.0%)2(100.0%)
Pulmonary9(45.0%)11(55.0%)0(0.0%)20(100.0%)
Vision6(66.7%)3(33.3%)0(0.0%)9(100.0%)
<NA>380(48.8%)379(48.7%)19(2.4%)778(100.0%)
Total489(48.9%)489(48.9%)22(2.2%)1000(100.0%)

Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-02

There were times I opened a dataset and did not know the appropriate way to explore the dataset. I hope this blogs gives a starting point those who face similar problem. Once you get an idea using this package about what type of dataset you are dealing with, you can perform appropriate data visualization.

Avatar
Data Scientist

Saayed Alam creates machine learning products and occasionally gets philosophical.

Previous

Related