sparking water
1

2 It provides a way to initialize H2O services on each node in the Spark cluster and to access data stored in data structures of Spark and H2O.
3 Internal Backend is easiest to deploy; however when Spark or YARN kills the executor - which is not an unusual case - the entire H2O cluster goes down because H2O does not support high availability.
4 The internal backend is the default for behavior for Sparkling Water. Another way to change type of backend is by calling the setExternalClusterMode() or setInternalClusterMode() method on the H2OConf class. H2OConf is simple wrapper around SparkConf and inherits all properties in the Spark configuration.
5 好像在安装sparkingwater时,就会把pyspark和H2O装好: pip install h2o_pysparkling_2.3
=======================
1 启动spark : ./sbin/start-master.sh ./sbin/start-slave.sh spark://zcy-VirtualBox:7077
2 可以先运行一个很简单的脚本,看环境是否ready ,为了运行成功,需要把虚拟机内存调大(我改成了2g)

from pysparkling import *
from pyspark.sql import SparkSession
import h2o # Initiate SparkSession
spark = SparkSession.builder.appName("App name").getOrCreate() # Initiate H2OContext
hc = H2OContext.getOrCreate(spark) # Stop H2O and Spark services
h2o.cluster().shutdown()
spark.stop()
print ""
./bin/spark-submit --master spark://zcy-VirtualBox:7077 --conf "spark.executor.memory=1g" /home/zcy/working/tst.py
结果如下

3 运行一个稍微复杂的脚本:
import h2o
from datetime import datetime from pyspark import SparkConf, SparkFiles
from pyspark.sql import Row, SparkSession
import os
from pysparkling import * # Refine date column
def refine_date_col(data, col):
data["Day"] = data[col].day()
data["Month"] = data[col].month()
data["Year"] = data[col].year()
data["WeekNum"] = data[col].week()
data["WeekDay"] = data[col].dayOfWeek()
data["HourOfDay"] = data[col].hour() # Create weekend and season cols
# Spring = Mar, Apr, May. Summer = Jun, Jul, Aug. Autumn = Sep, Oct. Winter = Nov, Dec, Jan, Feb.
# data["Weekend"] = [ if x in ("Sun", "Sat") else for x in data["WeekDay"]]
data["Weekend"] = ((data["WeekDay"] == "Sun") | (data["WeekDay"] == "Sat"))
data["Season"] = data["Month"].cut([, , , , , ], ["Winter", "Spring", "Summer", "Autumn", "Winter"]) # This is just helper function returning path to data-files
def _locate(file_name):
if os.path.isfile("/home/zcy/working/data_tst/" + file_name):
return "/home/zcy/working/data_tst/" + file_name
else:
print "eeeeeeeeeeee" spark = SparkSession.builder.appName("ChicagoCrimeTest").getOrCreate()
# Start H2O services
h2oContext = H2OContext.getOrCreate(spark)
# Define file names
chicagoAllWeather = "chicagoAllWeather.csv"
chicagoCensus = "chicagoCensus.csv"
chicagoCrimes10k = "chicagoCrimes10k.csv.zip" # h2o.import_file expects cluster-relative path
f_weather = h2o.upload_file(_locate(chicagoAllWeather))
f_census = h2o.upload_file(_locate(chicagoCensus))
f_crimes = h2o.upload_file(_locate(chicagoCrimes10k))
print "" # Transform weather table
# Remove 1st column (date)
f_weather = f_weather[:] # Transform census table
# Remove all spaces from column names (causing problems in Spark SQL)
col_names = list(map(lambda s: s.strip().replace(' ', '_').replace('+', '_'), f_census.col_names)) # Update column names in the table
# f_weather.names = col_names
f_census.names = col_names # Transform crimes table
# Drop useless columns
f_crimes = f_crimes[:] # Set time zone to UTC for date manipulation
h2o.cluster().timezone = "Etc/UTC" # Replace ' ' by '_' in column names
col_names = list(map(lambda s: s.replace(' ', '_'), f_crimes.col_names))
f_crimes.names = col_names
refine_date_col(f_crimes, "Date")
f_crimes = f_crimes.drop("Date") # Expose H2O frames as Spark DataFrame
print ""
df_weather = h2oContext.as_spark_frame(f_weather)
df_census = h2oContext.as_spark_frame(f_census)
df_crimes = h2oContext.as_spark_frame(f_crimes) # Register DataFrames as tables
df_weather.createOrReplaceTempView("chicagoWeather")
df_census.createOrReplaceTempView("chicagoCensus")
df_crimes.createOrReplaceTempView("chicagoCrime") crimeWithWeather = spark.sql("""SELECT
a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,
a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,
a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,
b.minTemp, b.maxTemp, b.meanTemp,
c.PERCENT_AGED_UNDER_18_OR_OVER_64, c.PER_CAPITA_INCOME, c.HARDSHIP_INDEX,
c.PERCENT_OF_HOUSING_CROWDED, c.PERCENT_HOUSEHOLDS_BELOW_POVERTY,
c.PERCENT_AGED_16__UNEMPLOYED, c.PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA
FROM chicagoCrime a
JOIN chicagoWeather b
ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
JOIN chicagoCensus c
ON a.Community_Area = c.Community_Area_Number""") # Publish Spark DataFrame as H2OFrame with given name
crimeWithWeatherHF = h2oContext.as_h2o_frame(crimeWithWeather, "crimeWithWeatherTable")
print ""
# Transform selected String columns to categoricals
cat_cols = ["Arrest", "Season", "WeekDay", "Primary_Type", "Location_Description", "Domestic"]
for col in cat_cols :
crimeWithWeatherHF[col] = crimeWithWeatherHF[col].asfactor() # Split frame into two - we use one as the training frame and the second one as the validation frame
splits = crimeWithWeatherHF.split_frame(ratios=[0.8])
train = splits[]
test = splits[]
print ""
h2o.download_csv(train,'/home/zcy/working/data_tst/ret/train.csv')
h2o.download_csv(test,'/home/zcy/working/data_tst/ret/test.csv') # stop H2O and Spark services
h2o.cluster().shutdown()
spark.stop()
3 运行脚本,
./bin/spark-submit --master spark://zcy-VirtualBox:7077 --conf "spark.executor.memory=1g" /home/zcy/working/sparkH2O.py



sparking water的更多相关文章
- [LeetCode] Pacific Atlantic Water Flow 太平洋大西洋水流
Given an m x n matrix of non-negative integers representing the height of each unit cell in a contin ...
- [LeetCode] Trapping Rain Water II 收集雨水之二
Given an m x n matrix of positive integers representing the height of each unit cell in a 2D elevati ...
- [LeetCode] Water and Jug Problem 水罐问题
You are given two jugs with capacities x and y litres. There is an infinite amount of water supply a ...
- [LeetCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
- [LeetCode] Container With Most Water 装最多水的容器
Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...
- 如何装最多的水? — leetcode 11. Container With Most Water
炎炎夏日,还是呆在空调房里切切题吧. Container With Most Water,题意其实有点噱头,简化下就是,给一个数组,恩,就叫 height 吧,从中任选两项 i 和 j(i <= ...
- 【leetcode】Container With Most Water
题目描述: Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ...
- [LintCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
- [LintCode] Container With Most Water 装最多水的容器
Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...
随机推荐
- pandas DataFrame applymap()函数
pandas DataFrame的 applymap() 函数可以对DataFrame里的每个值进行处理,然后返回一个新的DataFrame: import pandas as pd df = pd. ...
- 【Java】Comparable和Comparator接口的区别
Java提供了只包含一个compareTo()方法的Comparable接口.这个方法可以个给两个对象排序.具体来说,它返回负数,0,正数来表明已经存在的对象小于,等于,大于输入对象. Java提供了 ...
- 3. Tensorflow生成TFRecord
1. Tensorflow高效流水线Pipeline 2. Tensorflow的数据处理中的Dataset和Iterator 3. Tensorflow生成TFRecord 4. Tensorflo ...
- 前台报错:Uncaught TypeError: Cannot read property '0' of null
错误现象: var div1=mycss[0].style.backgroundColor; //这一行提示360和chrome提示:Uncaught TypeError: Cannot read ...
- TCP拥塞控制-慢启动、拥塞避免、快重传、快启动
一般原理:发生拥塞控制的原因:资源(带宽.交换节点的缓存.处理机)的需求>可用资源. 作用:拥塞控制就是为了防止过多的数据注入到网络中,这样可以使网络中的路由器或者链路不至于过载.拥塞控制要做的 ...
- [React] 09 - Tutorial: components
jsx变为js的过程:http://babeljs.io/repl/ youtube: https://www.youtube.com/channel/UCA-Jkgr40A9kl5vsIqg-BIg ...
- C++/MFC-线程优先级
转载: https://blog.csdn.net/qwdpoiguw/article/details/72830426 一.线程优先级(Thread priority ) 简单的说就是(线程)的优先 ...
- C#利用反射实现两个类的对象之间相同属性的值的复制
http://blog.csdn.net/u013093547/article/details/53584591 今天在拷贝对象的时候,看着代码实在是有点烦,一堆一样的代码,还是找找有没有直接反射拷贝 ...
- Python2.7字符编码详解
目录 Python2.7字符编码详解 声明 一. 字符编码基础 1.1 抽象字符清单(ACR) 1.2 已编码字符集(CCS) 1.3 字符编码格式(CEF) 1.3.1 ASCII(初创) 1.3. ...
- redis(二)--用Redis作MySQL数据库缓存
用Redis作MySQL数据库缓存,必须解决2个问题.首先,应该确定用何种数据结构存储来自mysql的数据:在确定数据结构之后,还要考虑用什么标识作为该数据结构的键. 直观上看,Mysql中的数据都是 ...