dplyr 0.4.0

January 9, 2015 in Uncategorized

I’m very pleased to announce that dplyr 0.4.0 is now available from CRAN. Get the latest version by running:

install.packages("dplyr")

dplyr 0.4.0 includes over 80 minor improvements and bug fixes, which are described in detail in the release notes. Here I wanted to draw your attention to two areas that have particularly improved since dplyr 0.3, two-table verbs and data frame support.

Two table verbs

dplyr now has full support for all two-table verbs provided by SQL:

Mutating joins, which add new variables to one table from matching rows in another: inner_join(), left_join(), right_join(), full_join(). (Support for non-equi joins is planned for dplyr 0.5.0.)
Filtering joins, which filter observations from one table based on whether or not they match an observation in the other table: semi_join(), anti_join().
Set operations, which combine the observations in two data sets as if they were set elements: intersect(), union(), setdiff().

Together, these verbs should allow you to solve 95% of data manipulation problems that involve multiple tables. If any of the concepts are unfamiliar to you, I highly recommend reading the two-table vignette (and if you still don’t understand, please let me know so I can make it better.)

Data frames

dplyr wraps data frames in a tbl_df class. These objects are structured in exactly the same way as regular data frames, but their behaviour has been tweaked a little to make them easier to work with. The new data_frames vignette describes how dplyr works with data frames in general, and below I highlight some of the features new in 0.4.0.

PRINTING

The biggest difference is printing: print.tbl_df() doesn’t try and print 10,000 rows! Printing got a lot of love in dplyr 0.4 and now:

All print() method methods invisibly return their input so you can interleaveprint() statements into a pipeline to see interim results.
If you’ve managed to produce a 0-row data frame, dplyr won’t try to print the data, but will tell you the column names and types:
```
data_frame(x = numeric(), y = character())

#> Source: local data frame [0 x 2]

#>

#> Variables not shown: x (dbl), y (chr)
```
dplyr never prints row names since no dplyr method is guaranteed to preserve them:
```
df <- data.frame(x = c(a = 1, b = 2, c = 3))

df

#>   x

#> a 1

#> b 2

#> c 3

df %>% tbl_df()

#> Source: local data frame [3 x 1]

#>

#>   x

#> 1 1

#> 2 2

#> 3 3
```
I don’t think using row names is a good idea because it violates one of the principles of tidy data: every variable should be stored in the same way.

To make life a bit easier if you do have row names, you can use the newadd_rownames() to turn your row names into a proper variable:
```
df %>%

  add_rownames()

#>   rowname x

#> 1       a 1

#> 2       b 2

#> 3       c 3
```
(But you’re better off never creating them in the first place.)
options(dplyr.print_max) is now 20, so dplyr will never print more than 20 rows of data (previously it was 100). The best way to see more rows of data is to use View().

COERCING LISTS TO DATA FRAMES

When you have a list of vectors of equal length that you want to turn into a data frame, dplyr provides as_data_frame() as a simple alternative to as.data.frame().as_data_frame() is considerably faster than as.data.frame() because it does much less:

l <- replicate(26, sample(100), simplify = FALSE)

names(l) <- letters

microbenchmark::microbenchmark(

  as_data_frame(l),

  as.data.frame(l)

)

#> Unit: microseconds

#>              expr      min        lq   median        uq      max neval

#>  as_data_frame(l)  101.856  112.0615  124.855  143.0965  254.193   100

#>  as.data.frame(l) 1402.075 1466.6365 1511.644 1635.1205 3007.299   100

It’s difficult to precisely describe what as.data.frame(x) does, but it’s similar todo.call(cbind, lapply(x, data.frame)) – it coerces each component to a data frame and then cbind()s them all together.

The speed of as.data.frame() is not usually a bottleneck in interactive use, but can be a problem when combining thousands of lists into one tidy data frame (this is common when working with data stored in json or xml).

BINDING ROWS AND COLUMNS

dplyr now provides bind_rows() and bind_cols() for binding data frames together. Compared to rbind() and cbind(), the functions:

Accept either individual data frames, or a list of data frames:

a <- data_frame(x = 1:5)

b <- data_frame(x = 6:10)

bind_rows(a, b)

#> Source: local data frame [10 x 1]

#>

#>    x

#> 1  1

#> 2  2

#> 3  3

#> 4  4

#> 5  5

#> .. .

bind_rows(list(a, b))

#> Source: local data frame [10 x 1]

#>

#>    x

#> 1  1

#> 2  2

#> 3  3

#> 4  4

#> 5  5

#> .. .

If x is a list of data frames, bind_rows(x) is equivalent to do.call(rbind, x).

Are much faster:

dfs <- replicate(100, data_frame(x = runif(100)), simplify = FALSE)

microbenchmark::microbenchmark(

  do.call("rbind", dfs),

  bind_rows(dfs)

)

#> Unit: microseconds

#>                   expr      min        lq   median        uq       max

#>  do.call("rbind", dfs) 5344.660 6605.3805 6964.236 7693.8465 43457.061

#>         bind_rows(dfs)  240.342  262.0845  317.582  346.6465  2345.832

#>  neval

#>    100

#>    100

(Generally you should avoid bind_cols() in favour of a join; otherwise check carefully that the rows are in a compatible order).

LIST-VARIABLES

Data frames are usually made up of a list of atomic vectors that all have the same length. However, it’s also possible to have a variable that’s a list, which I call a list-variable. Because of data.frame()s complex coercion rules, the easiest way to create a data frame containing a list-column is with data_frame():

data_frame(x = 1, y = list(1), z = list(list(1:5, "a", "b")))

#> Source: local data frame [1 x 3]

#>

#>   x        y         z

#> 1 1 <dbl[1]> <list[3]>

Note how list-variables are printed: a list-variable could contain a lot of data, so dplyr only shows a brief summary of the contents. List-variables are useful for:

Working with summary functions that return more than one value:

qs <- mtcars %>%

  group_by(cyl) %>%

  summarise(y = list(quantile(mpg)))

# Unnest input to collpase into rows

qs %>% tidyr::unnest(y)

#> Source: local data frame [15 x 2]

#>

#>    cyl    y

#> 1    4 21.4

#> 2    4 22.8

#> 3    4 26.0

#> 4    4 30.4

#> 5    4 33.9

#> .. ...  ...

# To extract individual elements into columns, wrap the result in rowwise()

# then use summarise()

qs %>%

  rowwise() %>%

  summarise(q25 = y[2], q75 = y[4])

#> Source: local data frame [3 x 2]

#>

#>     q25   q75

#> 1 22.80 30.40

#> 2 18.65 21.00

#> 3 14.40 16.25

Keeping associated data frames and models together:

by_cyl <- split(mtcars, mtcars$cyl)

models <- lapply(by_cyl, lm, formula = mpg ~ wt)

data_frame(cyl = c(4, 6, 8), data = by_cyl, model = models)

#> Source: local data frame [3 x 3]

#>

#>   cyl            data   model

#> 1   4 <S3:data.frame> <S3:lm>

#> 2   6 <S3:data.frame> <S3:lm>

#> 3   8 <S3:data.frame> <S3:lm>

dplyr’s support for list-variables continues to mature. In 0.4.0, you can join and row bind list-variables and you can create them in summarise and mutate.

My vision of list-variables is still partial and incomplete, but I’m convinced that they will make pipeable APIs for modelling much eaiser. See the draft lowliner package for more explorations in this direction.

Bonus

My colleague, Garrett, helped me make a cheat sheet that summarizes the data wrangling features of dplyr 0.4.0. You can download it from RStudio’s new gallery of R cheat sheets.

The dplyr package has been updated with new data manipulation commands for filters, joins and set operations.（转）的更多相关文章

Data Manipulation with dplyr in R
目录 select The filter and arrange verbs arrange filter Filtering and arranging Mutate The count verb ...
Accessing data in Hadoop using dplyr and SQL
If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into ...
HBase：Shell
HBase shell commands As told in HBase introduction, HBase provides Extensible jruby-based (JIRB) she ...
OCP—051试题
FROM: http://blog.itpub.net/26736162/viewspace-1252569/?page=2 http://blog.csdn.net/elearnings/artic ...
OCP考试062题库出现大量新题-19
choose three Which three statements are true about Oracle Data Pump? A) Oracle Data Pump export and ...
数据处理包plyr和dplyr包的整理
以下内容主要参照 Introducing dplyr 和 dplyr 包自带的简介 (Introduction to dplyr), 复制了原文对应代码, 并夹杂了个人理解和观点 (多附于括号内). ...
R语言扩展包dplyr笔记
引言 2014年刚到, 就在 Feedly 订阅里看到 RStudio Blog 介绍 dplyr 包已发布 (Introducing dplyr), 此包将原本 plyr 包中的 ddply() 等 ...
R Tidyverse dplyr包学习笔记2
Tidyverse 学习笔记 1.gapminder 我理解的gapminder应该是一个内置的数据集加载之后使用 > # Load the gapminder package > li ...
SSISDB7：查看当前正在运行的Package
在项目组中做ETL开发时,经常会被问到:“现在ETL跑到哪一个Package了?” 为了缩短ETL运行的时间,在ETL的设计上,经常会使用并发执行模式:Task 并发执行,Package并发执行.对于 ...

随机推荐

Ubuntu12.04嵌入式交叉编译环境arm-linu-gcc搭建过程，图解
转载:王文松的博客Ubuntu12.04嵌入式交叉编译环境arm-linu-gcc搭建过程,图解安装环境 Linux版本:Ubuntu 12.04 内核版本:Linux 3.5.0 ...
搭建MySQL高可用负载均衡集群
1.简介使用MySQL时随着时间的增长,用户量以及数据量的逐渐增加,访问量更是剧增,最终将会使MySQL达到某个瓶颈,那么MySQL的性能将会大大降低.这一结果也不利于软件的推广. 那么如何跨过这个 ...
DOM0 DOM2 DOM3
DOM0 DOM2 DOM3 DOM是什么 W3C 文档对象模型 (DOM) 是中立于平台和语言的接口,它允许程序和脚本动态地访问和更新文档的内容.结构和样式. DOM 定义了访问 HTML 和 ...
MYSQL安装(Linux)
1.首先检查下系统是否已经有mysql相关的安装项 rpm -qa|grep mysql 2.如果有,先删除 rpm -e --nodeps mysql-libs--.el6_5.×86_64 3.接 ...
JavaEE开发之SpringMVC中的自定义拦截器及异常处理
上篇博客我们聊了<JavaEE开发之SpringMVC中的路由配置及参数传递详解>,本篇博客我们就聊一下自定义拦截器的实现.以及使用ModelAndView对象将Controller的值加 ...
Vuex（一）——vuejs的状态管理模式
一.Vuex是什么? Vuex 是一个专为 Vue.js 应用程序开发的状态管理模式. 它采用集中式存储管理应用的所有组件的状态,并以相应的规则保证状态以一种可预测的方式发生变化. ...
1.6 OWIN集成
OWIN集成安装使用如果在应用程序里既使用ASP.NET MVC也使用ASP.NET Web API,需要在工程里安装Abp.Owin包. 安装添加Abp.Owin包到主工程里(一般是web工 ...
DbVisualizer：Oracle触发器，解决ORA-04098: 触发器 'USER.DECTUSERTEST_TRI' 无效且未通过重新验证
我没有用orcal的管理工具,而是用的DbVisualizer 9.5.2,管理数据库. 场景:需要在oracle里面实在自增字段,在网上一搜一堆文档,然后自己就找了一段自己写如下: drop tab ...
Python-3------新年考试周的Python学习
2016一开始就是考试周,准备专业课的考试复习.每天上午复习,晚上复习到8点半,之后到10点这点时间来看Python.庆幸没有在忙碌的时候荒废 Python的学习. 期待寒假,以前寒假在家总是没事做, ...
iOS开发 - Swift使用JavaScriptCore与JS交互
一.前言在这个提倡敏捷开发和H5横行的年代,原生App内嵌入一些H5页面已经成为一种流行的趋势.一套H5页面就可以适配复杂的iOS和Android页面,大量节省了开发和维护时间,如果本来就有移动端网 ...

The dplyr package has been updated with new data manipulation commands for filters, joins and set operations.（转）