sparking water
1

2 It provides a way to initialize H2O services on each node in the Spark cluster and to access data stored in data structures of Spark and H2O.
3 Internal Backend is easiest to deploy; however when Spark or YARN kills the executor - which is not an unusual case - the entire H2O cluster goes down because H2O does not support high availability.
4 The internal backend is the default for behavior for Sparkling Water. Another way to change type of backend is by calling the setExternalClusterMode() or setInternalClusterMode() method on the H2OConf class. H2OConf is simple wrapper around SparkConf and inherits all properties in the Spark configuration.
5 好像在安装sparkingwater时,就会把pyspark和H2O装好: pip install h2o_pysparkling_2.3
=======================
1 启动spark : ./sbin/start-master.sh ./sbin/start-slave.sh spark://zcy-VirtualBox:7077
2 可以先运行一个很简单的脚本,看环境是否ready ,为了运行成功,需要把虚拟机内存调大(我改成了2g)

from pysparkling import *
from pyspark.sql import SparkSession
import h2o # Initiate SparkSession
spark = SparkSession.builder.appName("App name").getOrCreate() # Initiate H2OContext
hc = H2OContext.getOrCreate(spark) # Stop H2O and Spark services
h2o.cluster().shutdown()
spark.stop()
print ""
./bin/spark-submit --master spark://zcy-VirtualBox:7077 --conf "spark.executor.memory=1g" /home/zcy/working/tst.py
结果如下

3 运行一个稍微复杂的脚本:
import h2o
from datetime import datetime from pyspark import SparkConf, SparkFiles
from pyspark.sql import Row, SparkSession
import os
from pysparkling import * # Refine date column
def refine_date_col(data, col):
data["Day"] = data[col].day()
data["Month"] = data[col].month()
data["Year"] = data[col].year()
data["WeekNum"] = data[col].week()
data["WeekDay"] = data[col].dayOfWeek()
data["HourOfDay"] = data[col].hour() # Create weekend and season cols
# Spring = Mar, Apr, May. Summer = Jun, Jul, Aug. Autumn = Sep, Oct. Winter = Nov, Dec, Jan, Feb.
# data["Weekend"] = [ if x in ("Sun", "Sat") else for x in data["WeekDay"]]
data["Weekend"] = ((data["WeekDay"] == "Sun") | (data["WeekDay"] == "Sat"))
data["Season"] = data["Month"].cut([, , , , , ], ["Winter", "Spring", "Summer", "Autumn", "Winter"]) # This is just helper function returning path to data-files
def _locate(file_name):
if os.path.isfile("/home/zcy/working/data_tst/" + file_name):
return "/home/zcy/working/data_tst/" + file_name
else:
print "eeeeeeeeeeee" spark = SparkSession.builder.appName("ChicagoCrimeTest").getOrCreate()
# Start H2O services
h2oContext = H2OContext.getOrCreate(spark)
# Define file names
chicagoAllWeather = "chicagoAllWeather.csv"
chicagoCensus = "chicagoCensus.csv"
chicagoCrimes10k = "chicagoCrimes10k.csv.zip" # h2o.import_file expects cluster-relative path
f_weather = h2o.upload_file(_locate(chicagoAllWeather))
f_census = h2o.upload_file(_locate(chicagoCensus))
f_crimes = h2o.upload_file(_locate(chicagoCrimes10k))
print "" # Transform weather table
# Remove 1st column (date)
f_weather = f_weather[:] # Transform census table
# Remove all spaces from column names (causing problems in Spark SQL)
col_names = list(map(lambda s: s.strip().replace(' ', '_').replace('+', '_'), f_census.col_names)) # Update column names in the table
# f_weather.names = col_names
f_census.names = col_names # Transform crimes table
# Drop useless columns
f_crimes = f_crimes[:] # Set time zone to UTC for date manipulation
h2o.cluster().timezone = "Etc/UTC" # Replace ' ' by '_' in column names
col_names = list(map(lambda s: s.replace(' ', '_'), f_crimes.col_names))
f_crimes.names = col_names
refine_date_col(f_crimes, "Date")
f_crimes = f_crimes.drop("Date") # Expose H2O frames as Spark DataFrame
print ""
df_weather = h2oContext.as_spark_frame(f_weather)
df_census = h2oContext.as_spark_frame(f_census)
df_crimes = h2oContext.as_spark_frame(f_crimes) # Register DataFrames as tables
df_weather.createOrReplaceTempView("chicagoWeather")
df_census.createOrReplaceTempView("chicagoCensus")
df_crimes.createOrReplaceTempView("chicagoCrime") crimeWithWeather = spark.sql("""SELECT
a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,
a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,
a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,
b.minTemp, b.maxTemp, b.meanTemp,
c.PERCENT_AGED_UNDER_18_OR_OVER_64, c.PER_CAPITA_INCOME, c.HARDSHIP_INDEX,
c.PERCENT_OF_HOUSING_CROWDED, c.PERCENT_HOUSEHOLDS_BELOW_POVERTY,
c.PERCENT_AGED_16__UNEMPLOYED, c.PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA
FROM chicagoCrime a
JOIN chicagoWeather b
ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
JOIN chicagoCensus c
ON a.Community_Area = c.Community_Area_Number""") # Publish Spark DataFrame as H2OFrame with given name
crimeWithWeatherHF = h2oContext.as_h2o_frame(crimeWithWeather, "crimeWithWeatherTable")
print ""
# Transform selected String columns to categoricals
cat_cols = ["Arrest", "Season", "WeekDay", "Primary_Type", "Location_Description", "Domestic"]
for col in cat_cols :
crimeWithWeatherHF[col] = crimeWithWeatherHF[col].asfactor() # Split frame into two - we use one as the training frame and the second one as the validation frame
splits = crimeWithWeatherHF.split_frame(ratios=[0.8])
train = splits[]
test = splits[]
print ""
h2o.download_csv(train,'/home/zcy/working/data_tst/ret/train.csv')
h2o.download_csv(test,'/home/zcy/working/data_tst/ret/test.csv') # stop H2O and Spark services
h2o.cluster().shutdown()
spark.stop()
3 运行脚本,
./bin/spark-submit --master spark://zcy-VirtualBox:7077 --conf "spark.executor.memory=1g" /home/zcy/working/sparkH2O.py



sparking water的更多相关文章
- [LeetCode] Pacific Atlantic Water Flow 太平洋大西洋水流
Given an m x n matrix of non-negative integers representing the height of each unit cell in a contin ...
- [LeetCode] Trapping Rain Water II 收集雨水之二
Given an m x n matrix of positive integers representing the height of each unit cell in a 2D elevati ...
- [LeetCode] Water and Jug Problem 水罐问题
You are given two jugs with capacities x and y litres. There is an infinite amount of water supply a ...
- [LeetCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
- [LeetCode] Container With Most Water 装最多水的容器
Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...
- 如何装最多的水? — leetcode 11. Container With Most Water
炎炎夏日,还是呆在空调房里切切题吧. Container With Most Water,题意其实有点噱头,简化下就是,给一个数组,恩,就叫 height 吧,从中任选两项 i 和 j(i <= ...
- 【leetcode】Container With Most Water
题目描述: Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ...
- [LintCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
- [LintCode] Container With Most Water 装最多水的容器
Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...
随机推荐
- 解决百度云推送通知,不显示默认Notification
问题:百度云推送通知,不显示默认Notification 描述:采用推送消息的方式,可以在onMessage方法里面获取到推送的消息.另外推送通知也有获取到内容,后台日志也有show private ...
- Java知多少(43)异常处理基础
Java异常是一个描述在代码段中发生的异常(也就是出错)情况的对象.当异常情况发生,一个代表该异常的对象被创建并且在导致该错误的方法中被抛出(throw).该方法可以选择自己处理异常或传递该异常.两种 ...
- 【转】燃烧吧,TestMice!
...当我们几个人碰面的时候,就感觉应该做点测试业内的实事. 记得当时的17站出了一些QTP辅件,给了我一些灵感.2008年做了一整年的QTP企业级实施,从方案到最后的收尾支持,得到最大的教训就是,当 ...
- Spark学习笔记——构建基于Spark的推荐引擎
推荐模型 推荐模型的种类分为: 1.基于内容的过滤:基于内容的过滤利用物品的内容或是属性信息以及某些相似度定义,来求出与该物品类似的物品. 2.协同过滤:协同过滤是一种借助众包智慧的途径.它利用大量已 ...
- [OpenCV] Samples 14: kalman filter
Ref: http://blog.csdn.net/gdfsg/article/details/50904811 #include "opencv2/video/tracking.hpp&q ...
- ios开发之--给WebView加载进度条
不是新东西,就是在项目里面用到H5页面的时候,中间加载延迟的时候,在最上面加载一个进度条,代码如下: // 获取屏幕 宽度.高度 bounds就是屏幕的全部区域 #define KDeviceWidt ...
- Win10系统安装过程小记
1.网上下载ghost系统http://win10.jysmac.cn/win1064.html 2.使用系统自带的激活工具激活 3.到windows官网下载更新工具更新系统,重新安装https:// ...
- mem 0908
taglib http://blog.csdn.net/zyujie/article/details/8735730 dozer: Dozer可以在JavaBean到JavaBean之间进行递归数据复 ...
- VS设置DLL所在的调试目录
如果一个项目依赖的DLL不想写在Path中,可以在 配置属性-调试-环境中添加 PATH=D:/OSG/bin;$(PATH)
- NUC972---Linux驱动开发
驱动开发是嵌入式 Linux 产品开发的重要组成部分,驱动是将芯片底层与Linux应用连接起来的桥梁.驱动程序的好坏直接影响和决定着产品的稳定性,稳定的驱动程序是产品可靠性的基石. 编写 Linux ...