tidyverse | 数据分析常规操作 - 分组汇总 (sumamrise+group_by)

| 本文首发于 "生信补给站" https://mp.weixin.qq.com/s/tQt0ezYJj3H7x3aWZmKVEQ

使用 tidyverse 进行简单的数据处理:

盘一盘 Tidyverse| 筛行选列之 select, 玩转列操作

盘一盘 Tidyverse| 只要你要只要我有 - filter 筛选行

Tidyverse | 数据列的分分合合, 一分多, 多合一

Tidyverse| XX_join : 多个数据表 (文件) 之间的各种连接

本次介绍变量汇总以及分组汇总.

一 summarise 汇总

汇总函数 summarize(), 可以将数据框折叠成一行 , 多与 group_by()结合使用

1.1 summarize 完成指定变量的汇总

统计均值, 标准差, 最小值, 个数和逻辑值

library(dplyr)
iris %>%
    summarise(mean(Petal.Length), #无命名
              sd_pet_len = sd(Petal.Length,na.rm = TRUE), #命名
              min_pet_len = min(Petal.Length),
              n = n(),
             any(Sepal.Length> 5))
#  mean(Petal.Length) sd_pet_len min_pet_len   n any(Sepal.Length> 5)
#1              3.758   1.765298           1 150                  TRUE

常用函数:

Center 位置度量 : mean(), median()

Spread 分散程度度量 : sd(), IQR(), mad()

Range 秩的度量 : min(), max(), quantile()

Position 定位度量 : first(), last(), nth(),

Count 计数 : n(), n_distinct()

Logical 逻辑值的计数和比例 : any(), all()

1.2 , summarise_if 完成一类变量的汇总

iris %>%
    summarise_if(is.numeric, ~ mean(., na.rm = TRUE))
#  Sepal.Length Sepal.Width Petal.Length Petal.Width
#1     5.843333    3.057333        3.758    1.199333

1.3,summarise_at 完成指定变量的汇总

summarise_at 配合 vars, 可以更灵活的筛选符合条件的列, 然后进行汇总

iris %>%
    summarise_at(vars(ends_with("Length"),Petal.Width),
    list(~mean(.), ~median(.)))
#  Sepal.Length_mean Petal.Length_mean Petal.Width_mean Sepal.Length_median Petal.Length_median
#1          5.843333             3.758         1.199333                 5.8                4.35
#  Petal.Width_median
#1                1.3

二结合 group_by 汇总

group_by() 和 summarize() 的组合构成了使用 dplyr 包时最常用的操作之一: 分组摘要

2.1 按照 Species 分组, 变量汇总

iris %>%
    group_by(Species) %>%
    summarise(avg_pet_len = mean(Petal.Length),
              sd_pet_len = sd(Petal.Length),
              min_pet_len = min(Petal.Length),
              first_pet_len = first(Petal.Length),
             n_pet_len = n())
# A tibble: 3 x 6
#  Species    avg_pet_len sd_pet_len min_pet_len first_pet_len n_pet_len
#  <fct>            <dbl>      <dbl>       <dbl>         <dbl>     <int>
#1 setosa            1.46      0.174         1             1.4        50
#2 versicolor        4.26      0.470         3             4.7        50
#3 virginica         5.55      0.552         4.5           6          50

2.2 计数

n() : 无需参数返回当前分组的大小;

sum(!is.na(x)) : 返回非缺失值的梳理;

n_distinct(x): 返回唯一值的数量.

iris %>%
    group_by(Species) %>%
    summarise( n_pet_len = n(),
              noNA_n_pet_len =  sum(!is.na(Petal.Length)),
              Petal.Length_uniq_n = n_distinct(Petal.Length)
             )
# A tibble: 3 x 4
#  Species    n_pet_len noNA_n_pet_len Petal.Length_uniq_n
#  <fct>          <int>          <int>               <int>
#1 setosa            50             50                   9
#2 versicolor        50             50                  19
#3 virginica         50             50                  20

除此之外, 还可以用 dplyr 的 count 函数进行计数:

iris %>%
    count(Species)
# A tibble: 3 x 2
#  Species        n
#  <fct>      <int>
#1 setosa        50
#2 versicolor    50
#3 virginica     50

2.3 逻辑值的计数和比例

当与数值型函数一同使用时, TRUE 会转换为 1, FALSE 会转换为 0.

这使得 sum() 和 mean() 非常适用于逻辑值: sum(x) 可以找出 x 中 TRUE 的数量, mean(x) 则可以找出比例

iris %>%
    group_by(Species) %>%
    summarise( n_pet_len = n(),
              noNA_n_pet_len =  sum(!is.na(Petal.Length)),
              Petal.Length_uniq_n = n_distinct(Petal.Length),
              Petal.Length_uniq_n2 = sum(n_distinct(Petal.Length)>= 20)
             )
# A tibble: 3 x 5
#  Species    n_pet_len noNA_n_pet_len Petal.Length_uniq_n Petal.Length_uniq_n2
#  <fct>          <int>          <int>               <int>                <int>
#1 setosa            50             50                   9                    0
#2 versicolor        50             50                  19                    0
#3 virginica         50             50                  20                    1

参考资料:

https://r4ds.had.co.nz/

书籍:《R 数据科学》

[觉得不错, 右下角点个 "在看", 期待您的转发, 谢谢!]

来源: https://www.cnblogs.com/Mao1518202/p/13258175.html

与本文相关文章

暂无,快来抢沙发吧！