plyr包的特点

其基础函数有以下特点：

第一个参数df
返回df
没有数据更改in place

正是因为有这些特点，才可以使用%>%操作符，方便逻辑式编程。

载入数据

library(plyr)

library(dplyr)

# load packages

suppressMessages(library(dplyr))

install.packages("hflights")

library(hflights)

# explore data

data(hflights)

head(hflights)

# convert to local data frame

flights <- tbl_df(hflights)

# printing only shows 10 rows and as many columns as can fit on your screen

flights

# you can specify that you want to see more rows

print(flights, n=20)

# convert to a normal data frame to see all of the columns

data.frame(head(flights))

filter

keep rows matching criteria

# base R approach to view all flights on January 1

flights[flights$Month==1 & flights$DayofMonth==1, ]

# dplyr approach

# note: you can use comma or ampersand to represent AND condition

filter(flights, Month==1, DayofMonth==1)

# use pipe for OR condition

filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")

# you can also use %in% operator

filter(flights, UniqueCarrier %in% c("AA", "UA"))

select

pick columns by name

# base R approach to select DepTime, ArrTime, and FlightNum columns

flights[, c("DepTime", "ArrTime", "FlightNum")]

# dplyr approach

select(flights, DepTime, ArrTime, FlightNum)

# use colon to select multiple contiguous columns, and use `contains` to match columns by name

# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name

select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))

“chaining” or “pipelining”

# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes

filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)

# chaining method

flights %>%

    select(UniqueCarrier, DepDelay) %>%

    filter(DepDelay > 60)

# create two vectors and calculate Euclidian distance between them

x1 <- 1:5; x2 <- 2:6

sqrt(sum((x1-x2)^2))

# chaining method

(x1-x2)^2 %>% sum() %>% sqrt()

arrange

reorder rows

# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay

flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")]

# dplyr approach

flights %>%

    select(UniqueCarrier, DepDelay) %>%

    arrange(DepDelay)

# use `desc` for descending

flights %>%

    select(UniqueCarrier, DepDelay) %>%

    arrange(desc(DepDelay))

mutate

add new variable
create new variables that are functions of exciting variables
which is d
ifferent form transform

# base R approach to create a new variable Speed (in mph)

flights$Speed <- flights$Distance / flights$AirTime*60

flights[, c("Distance", "AirTime", "Speed")]

# dplyr approach (prints the new variable but does not store it)

flights %>%

    select(Distance, AirTime) %>%

    mutate(Speed = Distance/AirTime*60)

# store the new variable

flights <- flights %>% mutate(Speed = Distance/AirTime*60)

summarise

reduce variables to values

# base R approaches to calculate the average arrival delay to each destination

head(with(flights, tapply(ArrDelay, Dest, mean, na.rm=TRUE)))

head(aggregate(ArrDelay ~ Dest, flights, mean))

# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay

flights %>%

    group_by(Dest) %>%

    summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))

#summarise_each allows you to apply the same summary function to multiple columns at once

#Note: mutate_each is also available

# for each carrier, calculate the percentage of flights cancelled or diverted

flights %>%

    group_by(UniqueCarrier) %>%

    summarise_each(funs(mean), Cancelled, Diverted)

# for each carrier, calculate the minimum and maximum arrival and departure delays

flights %>%

    group_by(UniqueCarrier) %>%

    summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))

#Helper function n() counts the number of rows in a group

#Helper function n_distinct(vector) counts the number of unique items in that vector

# for each day of the year, count the total number of flights and sort in descending order

flights %>%

    group_by(Month, DayofMonth) %>%

    summarise(flight_count = n()) %>%

    arrange(desc(flight_count))

# rewrite more simply with the `tally` function

flights %>%

    group_by(Month, DayofMonth) %>%

    tally(sort = TRUE)

# for each destination, count the total number of flights and the number of distinct planes that flew there

flights %>%

    group_by(Dest) %>%

    summarise(flight_count = n(), plane_count = n_distinct(TailNum))

# Grouping can sometimes be useful without summarising

# for each destination, show the number of cancelled and not cancelled flights

flights %>%

    group_by(Dest) %>%

    select(Cancelled) %>%

    table() %>%

    head()

Window Functions

Aggregation function (like mean) takes n inputs and returns 1 value
Window function takes n inputs and returns n values
Includes ranking and ordering functions (like min_rank), offset functions (lead and lag), and cumulative aggregates (like cummean).

# for each carrier, calculate which two days of the year they had their longest departure delays

# note: smallest (not largest) value is ranked as 1, so you have to use `desc` to rank by largest value

flights %>%

    group_by(UniqueCarrier) %>%

    select(Month, DayofMonth, DepDelay) %>%

    filter(min_rank(desc(DepDelay)) <= 2) %>%

    arrange(UniqueCarrier, desc(DepDelay))

# rewrite more simply with the `top_n` function

flights %>%

    group_by(UniqueCarrier) %>%

    select(Month, DayofMonth, DepDelay) %>%

    top_n(2,DepDelay) %>%

    arrange(UniqueCarrier, desc(DepDelay))

# for each month, calculate the number of flights and the change from the previous month

flights %>%

    group_by(Month) %>%

    summarise(flight_count = n()) %>%

    mutate(change = flight_count - lag(flight_count))

# rewrite more simply with the `tally` function

flights %>%

    group_by(Month) %>%

    tally() %>%

    mutate(change = n - lag(n))

Other functions

# randomly sample a fixed number of rows, without replacement

flights %>% sample_n(5)

# randomly sample a fraction of rows, with replacement

flights %>% sample_frac(0.25, replace=TRUE)

# base R approach to view the structure of an object

str(flights)

# dplyr approach: better formatting, and adapts to your screen width

glimpse(flights)

Connecting Databases

dplyr can connect to a database as if the data was loaded into a data frame
Use the same syntax for local data frames and databases
Only generates SELECT statements
Currently supports SQLite, PostgreSQL/Redshift, MySQL/MariaDB, BigQuery, MonetDB
Example below is based upon an SQLite database containing the hflights data
Instructions for creating this database are in the databases vignette

# connect to an SQLite database containing the hflights data

my_db <- src_sqlite("my_db.sqlite3")

# connect to the "hflights" table in that database

flights_tbl <- tbl(my_db, "hflights")

# example query with our data frame

flights %>%

    select(UniqueCarrier, DepDelay) %>%

    arrange(desc(DepDelay))

# identical query using the database

flights_tbl %>%

    select(UniqueCarrier, DepDelay) %>%

    arrange(desc(DepDelay))

You can write the SQL commands yourself
dplyr can tell you the SQL it plans to run and the query execution plan

# send SQL commands to the database

tbl(my_db, sql("SELECT * FROM hflights LIMIT 100"))

# ask dplyr for the SQL commands

flights_tbl %>%

    select(UniqueCarrier, DepDelay) %>%

    arrange(desc(DepDelay)) %>%

    explain()

参考资料

R语言包_dplyr_1的更多相关文章

R语言包在linux上的安装等知识
有关install.packages()函数的详见:R包 package 的安装(install.packages函数详解) R的包(package)通常有两种:1 binary package:这种 ...
R语言包
R语言包 R语言的包是R函数,编译代码和样本数据的集合. 它们存储在R语言环境中名为"library"的目录下. 默认情况下,R语言在安装期间安装一组软件包. 随后添加更多包,当它 ...
R语言——包的添加和使用
R是开源的软件工具,很多R语言用户和爱好者都会扩展R的功能模块,我们把这些模块称为包.我们可以通过下载安装这些已经写好的包来完成我们需要的任务工作. 包下载地址:https://cran.r-proj ...
R语言包的安装
pheatmap包的安装 1: 首先R语言的安装路径里面最好不要有中文路径 2: 在安装其他依存的scales和colorspace包时候要关闭防火墙错误提示: 试开URL'https://mirr ...
Windows下使用Rtools编译R语言包
使用devtools安装github中的R源代码时,经常会出各种错误,索性搜了一下怎么在Windows下直接打包,网上的资料也是参差不齐,以下是自己验证通过的. 一.下载Rtools 下载地址:htt ...
r语言包说明
[在实际工作中,每个数据科学项目各不相同,但基本都遵循一定的通用流程.具体如下] [下面列出每个步骤最有用的一些R包] 1.数据导入以下R包主要用于数据导入和保存数据:feather:一种快速,轻 ...
R语言包相关命令
R的包(package)通常有两种:1 binary package:这种包属于即得即用型(ready-to-use),但是依赖与平台,即Win和Linux平台下不同.2 Source package ...
R语言包翻译
Shiny-cheatsheet 作者:周彦通 1.安装 install.packages("shinydashboard") 2.基础知识仪表盘有三个部分:标题.侧边栏,身体 ...
R语言包翻译——翻译
Shiny-cheatsheet ...

随机推荐

R语言之——字符串处理函数
nchar 取字符数量的函数 length与nchar不同,length是取向量的长度 # nchar表示字符串中的字符的个数 nchar("abcd") [1] 4 # leng ...
60.自己定义View练习（五）高仿小米时钟 - 使用Camera和Matrix实现3D效果
*本篇文章已授权微信公众号 guolin_blog (郭霖)独家公布本文出自:猴菇先生的博客 http://blog.csdn.net/qq_31715429/article/details/546 ...
C++哪些运算符重载能够重载？
运算符重载是C++极为重要的语言特性之中的一个.本文将用代码实例回答--C++哪些运算符能够重载?怎样重载?实现运算符重载时须要注意哪些? 哪些运算符能够重载,哪些不可重载? C++98,C++0x, ...
FFmpeg(3)-AVFormatContext 结构体内容分析
AVIOContext *pb IO Context,.自定义一些读写格式或者从内存当中读时用到此成员变量. char filename[1024]; ...
scikit
http://scikit-learn.org/dev/_downloads/scikit-learn-docs.pdf http://scikit-learn.org/stable/tutorial ...
[Windows Azure] Windows Azure Web Sites, Cloud Services, and VMs: When to use which?
This document provides guidance on how to make an informed decision in choosing between Windows Azur ...
shutil 高级文件操作
High-level file operations 高级的文件操作模块,官网:https://docs.python.org/2/library/shutil.html# os模块提供了对目录或者 ...
Linux下双网卡绑定bond0【转】
一:原理: linux操作系统下双网卡绑定有七种模式.现在一般的企业都会使用双网卡接入,这样既能添加网络带宽,同时又能做相应的冗余,可以说是好处多多.而一般企业都会使用linux操作系统下自带的网卡绑 ...
pre 标签的使用注意事项
.news-msg{ // padding: 5px; white-space: pre-wrap; word-wrap: break-word; font-family:'微软雅黑'; } < ...
vi卡死解决办法
玩了这么多年linux 居然不知道这个..特此记录. 使用vim时,如果你不小心按了 Ctrl + s后,你会发现不能输入任何东西了,像死掉了一般,其实vim并没有死掉,这时vim只是停止向终端输出而 ...

R语言包_dplyr_1

plyr包的特点

载入数据

filter

select

“chaining” or “pipelining”

arrange

mutate

summarise

Window Functions

Other functions

Connecting Databases

参考资料

R语言包_dplyr_1的更多相关文章

随机推荐

热门专题