Some time ago, I wrote a Better summary function in R . Here is its multivariate extension:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
summary(x) msumm = function(y){ # multivariate summary usumm = function(x,h = 2){ # h is the number of values to print in the head and tail functions. if (!require(moments)) { stop("The function requires the moments package. To install it, run 'install.packages(\"moments\")'.\n") } a1 = suppressWarnings(data.frame(min = min(x, na.rm = T), med = median(x, na.rm = T), mean = mean(x,na.rm = T), max = max(x, na.rm = T), sd = stats::sd(x, na.rm = T), skew = skewness(x, na.rm = T),kurt = kurtosis(x,na.rm = T)) ) headh = head(x,h) ; tailh = tail(x,h) naat = which(is.na(x)) if( length(naat) == 0) naat= c("No NA's in the series") missing = ifelse( length(naat) == 0,"No","Yes") l = list(summary.stat = a1, na.at = naat, Head = headh, Tail = tailh, Length = length(x), missing = missing) return(l) } l1 = apply(y,2,usumm) stats = NULL ; missing = NULL for (i in 1:length(l1)){ stats = rbind(stats,l1[[i]]$summary.stat) missing = cbind(missing,l1[[i]]$missing) } stats = cbind(stats, missing = t(missing)) rownames(stats) <- names(l1) print(stats) return(l1) } |
Generate some factitious data and see the results. Results are pretty much self explanatory. The last column indicates if the variable has missing values or not.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
d = data.frame(rnormal = rnorm(10), mat = matrix(c(1:8,NA,10),nrow = 10, ncol = 1) ) m = msumm(d) #### Here is the print: # min med mean max sd skew kurt missing # rnormal -1.671465 -0.001411399 -0.1643362 0.9381707 0.7642856 -0.5742236 2.742984 No # mat 1.000000 5.000000000 5.1111111 10.0000000 2.9344695 0.1995093 2.000760 Yes names(m) # the names of the variables in the matrix/data.frame: # [1] "rnormal" "mat" #### Zoom in on the "mat" variable: m$mat # $summary.stat # min med mean max sd skew kurt # 1 1 5 5.111111 10 2.934469 0.1995093 2.00076 # # $na.at # indicates the location of the NA values: # [1] 9 # # $head.x # [1] 1 2 # # $tail.x # [1] NA 10 # # $length.x # [1] 10 # # $missing # [1] "Yes" |
Note:
1. This function uses “Kurtosis” and “Skewness” functions, both can be found in package “e1071” or package “moments” so you need at least one of those packages to avoid errors.
2. I use “suppressWarnings” since the function “sd” produces some meaningless warning which I want to ignore.
3. The function is designed to handle numeric data. It is straight forward to extend it to other class types. (In fact, I have no idea if it’s “straight forward” but it is common (bad) practice to phrase it as such when you have no time to actually do it. Another alternative is: “for the sake of brevity I refer the interested reader to… and skip it here”.)
Related:
The Pragmatic Programmer: Your Journey To Mastery
Coders at Work: Reflections on the Craft of Programming
Code Complete: A Practical Handbook of Software Construction