R语言包_dplyr_1
有5个基础的函数:
- filter
- select
- arrange
- mutate
- summarise
- group_by (plus)
可以和databases以及data tables中的数据打交道。
plyr包的特点
其基础函数有以下特点:
- 第一个参数df
- 返回df
- 没有数据更改in place
正是因为有这些特点,才可以使用%>%操作符,方便逻辑式编程。
载入数据
library(plyr)
library(dplyr)
# load packages
suppressMessages(library(dplyr))
install.packages("hflights")
library(hflights)
# explore data
data(hflights)
head(hflights)
# convert to local data frame
flights <- tbl_df(hflights)
# printing only shows 10 rows and as many columns as can fit on your screen
flights
# you can specify that you want to see more rows
print(flights, n=20)
# convert to a normal data frame to see all of the columns
data.frame(head(flights))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
filter
keep rows matching criteria
# base R approach to view all flights on January 1
flights[flights$Month==1 & flights$DayofMonth==1, ]
# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(flights, Month==1, DayofMonth==1)
# use pipe for OR condition
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")
# you can also use %in% operator
filter(flights, UniqueCarrier %in% c("AA", "UA"))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
select
pick columns by name
# base R approach to select DepTime, ArrTime, and FlightNum columns
flights[, c("DepTime", "ArrTime", "FlightNum")]
# dplyr approach
select(flights, DepTime, ArrTime, FlightNum)
# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
“chaining” or “pipelining”
# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)
# chaining method
flights %>%
select(UniqueCarrier, DepDelay) %>%
filter(DepDelay > 60)
# create two vectors and calculate Euclidian distance between them
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))
# chaining method
(x1-x2)^2 %>% sum() %>% sqrt()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
arrange
reorder rows
# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay
flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")]
# dplyr approach
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(DepDelay)
# use `desc` for descending
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
mutate
add new variable
create new variables that are functions of exciting variables
which is d
ifferent formtransform
# base R approach to create a new variable Speed (in mph)
flights$Speed <- flights$Distance / flights$AirTime*60
flights[, c("Distance", "AirTime", "Speed")]
# dplyr approach (prints the new variable but does not store it)
flights %>%
select(Distance, AirTime) %>%
mutate(Speed = Distance/AirTime*60)
# store the new variable
flights <- flights %>% mutate(Speed = Distance/AirTime*60)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
summarise
reduce variables to values
# base R approaches to calculate the average arrival delay to each destination
head(with(flights, tapply(ArrDelay, Dest, mean, na.rm=TRUE)))
head(aggregate(ArrDelay ~ Dest, flights, mean))
# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
flights %>%
group_by(Dest) %>%
summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))
#summarise_each allows you to apply the same summary function to multiple columns at once
#Note: mutate_each is also available
# for each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(mean), Cancelled, Diverted)
# for each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))
#Helper function n() counts the number of rows in a group
#Helper function n_distinct(vector) counts the number of unique items in that vector
# for each day of the year, count the total number of flights and sort in descending order
flights %>%
group_by(Month, DayofMonth) %>%
summarise(flight_count = n()) %>%
arrange(desc(flight_count))
# rewrite more simply with the `tally` function
flights %>%
group_by(Month, DayofMonth) %>%
tally(sort = TRUE)
# for each destination, count the total number of flights and the number of distinct planes that flew there
flights %>%
group_by(Dest) %>%
summarise(flight_count = n(), plane_count = n_distinct(TailNum))
# Grouping can sometimes be useful without summarising
# for each destination, show the number of cancelled and not cancelled flights
flights %>%
group_by(Dest) %>%
select(Cancelled) %>%
table() %>%
head()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
Window Functions
- Aggregation function (like mean) takes n inputs and returns 1 value
- Window function takes n inputs and returns n values
Includes ranking and ordering functions (like min_rank), offset functions (lead and lag), and cumulative aggregates (like cummean).
# for each carrier, calculate which two days of the year they had their longest departure delays
# note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
filter(min_rank(desc(DepDelay)) <= 2) %>%
arrange(UniqueCarrier, desc(DepDelay))
# rewrite more simply with the `top_n` function
flights %>%
group_by(UniqueCarrier) %>%
select(Month, DayofMonth, DepDelay) %>%
top_n(2,DepDelay) %>%
arrange(UniqueCarrier, desc(DepDelay))
# for each month, calculate the number of flights and the change from the previous month
flights %>%
group_by(Month) %>%
summarise(flight_count = n()) %>%
mutate(change = flight_count - lag(flight_count))
# rewrite more simply with the `tally` function
flights %>%
group_by(Month) %>%
tally() %>%
mutate(change = n - lag(n))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
Other functions
# randomly sample a fixed number of rows, without replacement
flights %>% sample_n(5)
# randomly sample a fraction of rows, with replacement
flights %>% sample_frac(0.25, replace=TRUE)
# base R approach to view the structure of an object
str(flights)
# dplyr approach: better formatting, and adapts to your screen width
glimpse(flights)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
Connecting Databases
- dplyr can connect to a database as if the data was loaded into a data frame
- Use the same syntax for local data frames and databases
- Only generates SELECT statements
- Currently supports SQLite, PostgreSQL/Redshift, MySQL/MariaDB, BigQuery, MonetDB
- Example below is based upon an SQLite database containing the hflights data
- Instructions for creating this database are in the databases vignette
# connect to an SQLite database containing the hflights data
my_db <- src_sqlite("my_db.sqlite3")
# connect to the "hflights" table in that database
flights_tbl <- tbl(my_db, "hflights")
# example query with our data frame
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
# identical query using the database
flights_tbl %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay))
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
You can write the SQL commands yourself
dplyr can tell you the SQL it plans to run and the query execution plan
# send SQL commands to the database
tbl(my_db, sql("SELECT * FROM hflights LIMIT 100"))
# ask dplyr for the SQL commands
flights_tbl %>%
select(UniqueCarrier, DepDelay) %>%
arrange(desc(DepDelay)) %>%
explain()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
参考资料
R语言包_dplyr_1的更多相关文章
- R语言包在linux上的安装等知识
有关install.packages()函数的详见:R包 package 的安装(install.packages函数详解) R的包(package)通常有两种:1 binary package:这种 ...
- R语言 包
R语言包 R语言的包是R函数,编译代码和样本数据的集合. 它们存储在R语言环境中名为"library"的目录下. 默认情况下,R语言在安装期间安装一组软件包. 随后添加更多包,当它 ...
- R语言——包的添加和使用
R是开源的软件工具,很多R语言用户和爱好者都会扩展R的功能模块,我们把这些模块称为包.我们可以通过下载安装这些已经写好的包来完成我们需要的任务工作. 包下载地址:https://cran.r-proj ...
- R语言包的安装
pheatmap包的安装 1: 首先R语言的安装路径里面最好不要有中文路径 2: 在安装其他依存的scales和colorspace包时候要关闭防火墙 错误提示: 试开URL'https://mirr ...
- Windows下使用Rtools编译R语言包
使用devtools安装github中的R源代码时,经常会出各种错误,索性搜了一下怎么在Windows下直接打包,网上的资料也是参差不齐,以下是自己验证通过的. 一.下载Rtools 下载地址:htt ...
- r语言 包说明
[在实际工作中,每个数据科学项目各不相同,但基本都遵循一定的通用流程.具体如下] [下面列出每个步骤最有用的一些R包] 1.数据导入以下R包主要用于数据导入和保存数据:feather:一种快速,轻 ...
- R语言包相关命令
R的包(package)通常有两种:1 binary package:这种包属于即得即用型(ready-to-use),但是依赖与平台,即Win和Linux平台下不同.2 Source package ...
- R语言包翻译
Shiny-cheatsheet 作者:周彦通 1.安装 install.packages("shinydashboard") 2.基础知识 仪表盘有三个部分:标题.侧边栏,身体 ...
- R语言包翻译——翻译
Shiny-cheatsheet ...
随机推荐
- 【Unity】11.8 关节
分类:Unity.C#.VS2015 创建日期:2016-05-02 一.简介 Unity提供了下面的关节组件:铰链关节(Hinge Joint).固定关节(Fixed Joint).弹簧关节(Spr ...
- 最新Windows下c++读写锁SRWLock介绍
https://blog.csdn.net/MoreWindows/article/details/7650574 https://blog.csdn.net/chenzhjlf/article/de ...
- nginx.conf中关于nginx-rtmp-module配置指令详解
译序:截至 Jul 8th,2013 官方公布的最新 Nginx RTMP 模块 nginx-rtmp-module 指令详解.指令Corertmp语法:rtmp { ... }上下文:根描述:保存所 ...
- Android xUtils3源代码解析之网络模块
本文已授权微信公众号<非著名程序猿>原创首发,转载请务必注明出处. xUtils3源代码解析系列 一. Android xUtils3源代码解析之网络模块 二. Android xUtil ...
- hdu1542 Atlantis (线段树+扫描线+离散化)
Atlantis Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Total S ...
- 进程process与线程thread
进程:process是一个外理过程,即然是外理过程,那么它就有生命周期,从进程的启动,运行,直到运行结束,进程终止.进程是程序的执行实例,即运行中的程序,同时也是程序的一个副本,程序是放置于磁盘的,而 ...
- 菜鸟学Java(十四)——Java反射机制(一)
说到反射,相信有过编程经验的人都不会陌生.反射机制让Java变得更加的灵活.反射机制在Java的众多特性中是非常重要的一个.下面就让我们一点一点了解它是怎么一回事. 什么是反射 在运行状态中,对于任意 ...
- 菜鸟学Java(七)——Ajax+Servlet实现无刷新下拉联动
下拉联动的功能可以说非常的常用,例如在选择省.市等信息的时候:或者在选择大类.小类的时候.总之,下拉联动很常用.今天就跟大家分享一个简单的二级下拉联动的功能. 大类下拉框:页面加载的时候就初始化大类的 ...
- CCShatteredTiles3D
CCSprite* pImgBg = CCSprite::create("1.png"); pImgBg->setPosition(ccp(CCDirector::share ...
- scala future
https://docs.scala-lang.org/overviews/core/futures.html https://docs.scala-lang.org/overviews/index. ...