http://blog.csdn.net/pipisorry/article/details/53320669

pyspark.sql.SQLContext

Main entry point for DataFrame and SQL functionality.

[pyspark.sql.SQLContext]

皮皮blog

pyspark.sql.DataFrame

A distributed collection of data grouped into named columns.

spark df和pandas df

spark df的操作基本和pandas df操作一样的[Pandas小记(6) ]

相互转换

从pandas_df转换：

spark_df = SQLContext.createDataFrame(pandas_df)

sc = SparkContext(master='local[8]', appName='kmeans')
sql_ctx = SQLContext(sc)
lldf_rdd = sql_ctx.createDataFrame(lldf)

另外，createDataFrame支持从list转换spark_df，其中list元素可以为tuple，dict，rdd

从spark_df转换：

pandas_df = spark_df.toPandas()

toPandas()

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

Note that this method should only be used if the resulting Pandas’s DataFrame is expectedto be small, as all the data is loaded into the driver’s memory.

This is only available if Pandas is installed and available.

>>> df.toPandas()
   age   name
0    2  Alice
1    5    Bob

[Spark与Pandas中DataFrame对比（详细）]

spark df方法

rdd: Returns the content as an pyspark.RDD of Row.

rollup(*cols)

Create a multi-dimensional rollup for the current DataFrame usingthe specified columns, so we can run aggregation on them.

>>> df.rollup("name", df.age).count().orderBy("name", "age").show()
+-----+----+-----+
| name| age|count|
+-----+----+-----+
| null|null|    2|
|Alice|null|    1|
|Alice|   2|    1|
|  Bob|null|    1|
|  Bob|   5|    1|
+-----+----+-----+

select(*cols)

Projects a set of expressions and returns a new DataFrame.

Parameters:	cols – list of column names (string) or expressions (Column).If one of the column names is ‘*’, that column is expanded to include all columnsin the current DataFrame.

>>> df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
>>> df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
>>> df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]

selectExpr(*expr)

Projects a set of SQL expressions and returns a new DataFrame.

This is a variant of select() that accepts SQL expressions.

>>> df.selectExpr("age * 2", "abs(age)").collect()
[Row((age * 2)=4, abs(age)=2), Row((age * 2)=10, abs(age)=5)]

toDF(*cols) Returns a new class:DataFrame that with new specified column names Parameters: cols – list of new column names (string) >>> df.toDF('f1', 'f2').collect() [Row(f1=2, f2=u'Alice'), Row(f1=5, f2=u'Bob')]
persist(storageLevel=StorageLevel(False, True, False, False, 1))¶: Sets the storage level to persist its values across operationsafter the first time it is computed. This can only be used to assigna new storage level if the RDD does not have a storage level set yet.If no storage level is specified defaults to (MEMORY_ONLY).

Parameters:	cols – list of new column names (string)

[pyspark.sql.DataFrame]

from: http://blog.csdn.net/pipisorry/article/details/53320669

ref:

Spark核心类：SQLContext和DataFrame的更多相关文章

Spark核心类：弹性分布式数据集RDD及其转换和操作pyspark.RDD
http://blog.csdn.net/pipisorry/article/details/53257188 弹性分布式数据集RDD(Resilient Distributed Dataset) 术 ...
Spark 核心篇-SparkContext
本章内容: 1.功能描述本篇文章就要根据源码分析SparkContext所做的一些事情,用过Spark的开发者都知道SparkContext是编写Spark程序用到的第一个类,足以说明SparkCo ...
Spark SQL初始化和创建DataFrame的几种方式
一.前述 1.SparkSQL介绍 Hive是Shark的前身,Shark是SparkSQL的前身,SparkSQL产生的根本原因是其完全脱离了Hive的限制. SparkSQL支持查询原 ...
【转载】Spark SQL 1.3.0 DataFrame介绍、使用
http://www.aboutyun.com/forum.php?mod=viewthread&tid=12358&page=1 1.DataFrame是什么?2.如何创建DataF ...
[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子
[Spark][Python]spark 从 avro 文件获取 Dataframe 的例子从如下地址获取文件: https://github.com/databricks/spark-avro/r ...
Spark 核心篇-SparkEnv
本章内容: 1.功能概述 SparkEnv是Spark的执行环境对象,其中包括与众多Executor执行相关的对象.Spark 对任务的计算都依托于 Executor 的能力,所有的 Executor ...
科普Spark，Spark核心是什么，如何使用Spark（1）
科普Spark,Spark是什么,如何使用Spark(1)转自:http://www.aboutyun.com/thread-6849-1-1.html 阅读本文章可以带着下面问题:1.Spark基于 ...
【二】Spark 核心
spark 核心 spark core RDD创建 >>> RDD转换 >>> RDD缓存 >>> RDD行动 >>> RDD输 ...
大数据体系概览Spark、Spark核心原理、架构原理、Spark特点
大数据体系概览Spark.Spark核心原理.架构原理.Spark特点大数据体系概览(Spark的地位) 什么是Spark? Spark整体架构 Spark的特点 Spark核心原理 Spark架构 ...

随机推荐

Mac安装opencv指南
最近接到了新的调研任务.主要是和人脸,各种所谓'AI'相关的.因为这里要处理视频和图像.于是在网上看到很多资料都是关于opencv的所以准备用opencv来开发这些东西.既然要用到opencv.那 ...
Java知IO
---恢复内容开始--- Java将IO(文件.网络.终端)封装成非常多的类,看似繁杂,其实每个类的具有独特的功能. 按照存取的对象是二进制还是文本,java使用字节流和字符流实现IO. 流是java ...
[LeetCode] Maximum Average Subarray I 子数组的最大平均值
Given an array consisting of n integers, find the contiguous subarray of given length k that has the ...
xcode7,AFN不能使用的问题
今天手贱立刻升级了Xcode7,结果AFN报错,且不能用了,解决办法如下第一步:升级AFN到2.6.0 完成之后,运行,结果请求都失败,提示 The resource could not be lo ...
洛谷P3980：[NOI2008]志愿者招募
线性规划: #include<cstdio> #include<cstdlib> #include<algorithm> #include<cstring&g ...
●BZOJ 2149 拆迁队
题链: http://www.lydsy.com/JudgeOnline/problem.php?id=2149 题解: 斜率优化DP,栈维护凸包,LIS,分治(我也不晓得是不是CDQ分治...) 一 ...
●BZOJ 3831 [Poi2014]Little Bird
题链: http://www.lydsy.com/JudgeOnline/problem.php?id=3831 题解: 单调队列优化DP 定义 F[i] 为到达第i课树的疲劳值. 显然最暴力的转移就 ...
●BZOJ 2006 NOI 2010 超级钢琴
题链: http://www.lydsy.com/JudgeOnline/problem.php?id=2006 题解: RMQ + 优先队列 (+ 前缀) 记得在一两个月前,一次考试考了这个题目的简 ...
[BZOJ]1050 旅行comf(HAOI2006)
图论一直是小C的弱项,相比其它题型,图论的花样通常会更多一点,套路也更难捉摸. Description 给你一个无向图,N(N<=500)个顶点, M(M<=5000)条边,每条边有一个权 ...
如何在Linux上编译c++文件
1. 打开Linux客户端,新建一个c++文件 2. 写如下代码,退出保存 3.对.cpp文件进行编译并输出结果.

Spark核心类：SQLContext和DataFrame