1

2 It provides a way to initialize H2O services on each node in the Spark cluster and to access data stored in data structures of Spark and H2O.

3 Internal Backend  is easiest to deploy; however when Spark or YARN kills the executor - which is not an unusual case - the entire H2O cluster goes down because H2O does not support high availability.

4 The internal backend is the default for behavior for Sparkling Water.  Another way to change type of backend is by calling the setExternalClusterMode() or setInternalClusterMode() method on the H2OConf class. H2OConf is simple wrapper around SparkConf and inherits all properties in the Spark configuration.

5 好像在安装sparkingwater时,就会把pyspark和H2O装好: pip install h2o_pysparkling_2.3

=======================

1 启动spark :  ./sbin/start-master.sh      ./sbin/start-slave.sh spark://zcy-VirtualBox:7077

2 可以先运行一个很简单的脚本,看环境是否ready ,为了运行成功,需要把虚拟机内存调大(我改成了2g)

from pysparkling import *
from pyspark.sql import SparkSession
import h2o # Initiate SparkSession
spark = SparkSession.builder.appName("App name").getOrCreate() # Initiate H2OContext
hc = H2OContext.getOrCreate(spark) # Stop H2O and Spark services
h2o.cluster().shutdown()
spark.stop()
print ""

./bin/spark-submit --master spark://zcy-VirtualBox:7077  --conf "spark.executor.memory=1g" /home/zcy/working/tst.py

结果如下

3 运行一个稍微复杂的脚本:

import h2o
from datetime import datetime from pyspark import SparkConf, SparkFiles
from pyspark.sql import Row, SparkSession
import os
from pysparkling import * # Refine date column
def refine_date_col(data, col):
data["Day"] = data[col].day()
data["Month"] = data[col].month()
data["Year"] = data[col].year()
data["WeekNum"] = data[col].week()
data["WeekDay"] = data[col].dayOfWeek()
data["HourOfDay"] = data[col].hour() # Create weekend and season cols
# Spring = Mar, Apr, May. Summer = Jun, Jul, Aug. Autumn = Sep, Oct. Winter = Nov, Dec, Jan, Feb.
# data["Weekend"] = [ if x in ("Sun", "Sat") else for x in data["WeekDay"]]
data["Weekend"] = ((data["WeekDay"] == "Sun") | (data["WeekDay"] == "Sat"))
data["Season"] = data["Month"].cut([, , , , , ], ["Winter", "Spring", "Summer", "Autumn", "Winter"]) # This is just helper function returning path to data-files
def _locate(file_name):
if os.path.isfile("/home/zcy/working/data_tst/" + file_name):
return "/home/zcy/working/data_tst/" + file_name
else:
print "eeeeeeeeeeee" spark = SparkSession.builder.appName("ChicagoCrimeTest").getOrCreate()
# Start H2O services
h2oContext = H2OContext.getOrCreate(spark)
# Define file names
chicagoAllWeather = "chicagoAllWeather.csv"
chicagoCensus = "chicagoCensus.csv"
chicagoCrimes10k = "chicagoCrimes10k.csv.zip" # h2o.import_file expects cluster-relative path
f_weather = h2o.upload_file(_locate(chicagoAllWeather))
f_census = h2o.upload_file(_locate(chicagoCensus))
f_crimes = h2o.upload_file(_locate(chicagoCrimes10k))
print "" # Transform weather table
# Remove 1st column (date)
f_weather = f_weather[:] # Transform census table
# Remove all spaces from column names (causing problems in Spark SQL)
col_names = list(map(lambda s: s.strip().replace(' ', '_').replace('+', '_'), f_census.col_names)) # Update column names in the table
# f_weather.names = col_names
f_census.names = col_names # Transform crimes table
# Drop useless columns
f_crimes = f_crimes[:] # Set time zone to UTC for date manipulation
h2o.cluster().timezone = "Etc/UTC" # Replace ' ' by '_' in column names
col_names = list(map(lambda s: s.replace(' ', '_'), f_crimes.col_names))
f_crimes.names = col_names
refine_date_col(f_crimes, "Date")
f_crimes = f_crimes.drop("Date") # Expose H2O frames as Spark DataFrame
print ""
df_weather = h2oContext.as_spark_frame(f_weather)
df_census = h2oContext.as_spark_frame(f_census)
df_crimes = h2oContext.as_spark_frame(f_crimes) # Register DataFrames as tables
df_weather.createOrReplaceTempView("chicagoWeather")
df_census.createOrReplaceTempView("chicagoCensus")
df_crimes.createOrReplaceTempView("chicagoCrime") crimeWithWeather = spark.sql("""SELECT
a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,
a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,
a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,
b.minTemp, b.maxTemp, b.meanTemp,
c.PERCENT_AGED_UNDER_18_OR_OVER_64, c.PER_CAPITA_INCOME, c.HARDSHIP_INDEX,
c.PERCENT_OF_HOUSING_CROWDED, c.PERCENT_HOUSEHOLDS_BELOW_POVERTY,
c.PERCENT_AGED_16__UNEMPLOYED, c.PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA
FROM chicagoCrime a
JOIN chicagoWeather b
ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
JOIN chicagoCensus c
ON a.Community_Area = c.Community_Area_Number""") # Publish Spark DataFrame as H2OFrame with given name
crimeWithWeatherHF = h2oContext.as_h2o_frame(crimeWithWeather, "crimeWithWeatherTable")
print ""
# Transform selected String columns to categoricals
cat_cols = ["Arrest", "Season", "WeekDay", "Primary_Type", "Location_Description", "Domestic"]
for col in cat_cols :
crimeWithWeatherHF[col] = crimeWithWeatherHF[col].asfactor() # Split frame into two - we use one as the training frame and the second one as the validation frame
splits = crimeWithWeatherHF.split_frame(ratios=[0.8])
train = splits[]
test = splits[]
print ""
h2o.download_csv(train,'/home/zcy/working/data_tst/ret/train.csv')
h2o.download_csv(test,'/home/zcy/working/data_tst/ret/test.csv') # stop H2O and Spark services
h2o.cluster().shutdown()
spark.stop()

3 运行脚本,

./bin/spark-submit --master spark://zcy-VirtualBox:7077  --conf "spark.executor.memory=1g" /home/zcy/working/sparkH2O.py

sparking water的更多相关文章

  1. [LeetCode] Pacific Atlantic Water Flow 太平洋大西洋水流

    Given an m x n matrix of non-negative integers representing the height of each unit cell in a contin ...

  2. [LeetCode] Trapping Rain Water II 收集雨水之二

    Given an m x n matrix of positive integers representing the height of each unit cell in a 2D elevati ...

  3. [LeetCode] Water and Jug Problem 水罐问题

    You are given two jugs with capacities x and y litres. There is an infinite amount of water supply a ...

  4. [LeetCode] Trapping Rain Water 收集雨水

    Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...

  5. [LeetCode] Container With Most Water 装最多水的容器

    Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...

  6. 如何装最多的水? — leetcode 11. Container With Most Water

    炎炎夏日,还是呆在空调房里切切题吧. Container With Most Water,题意其实有点噱头,简化下就是,给一个数组,恩,就叫 height 吧,从中任选两项 i 和 j(i <= ...

  7. 【leetcode】Container With Most Water

    题目描述: Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ...

  8. [LintCode] Trapping Rain Water 收集雨水

    Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...

  9. [LintCode] Container With Most Water 装最多水的容器

    Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai).  ...

随机推荐

  1. 【Android】Android布局文件的一些属性值

    第一类:属性值 true或者 false   android:layout_centerHrizontal 水平居中 android:layout_centerVertical 垂直居中 androi ...

  2. --save与--save-dev的区别

    --save安装的包会在生产和开发环境中都使用: --save-dev的包只在开发环境中使用,在生产环境中就不需要这个包了,不再打包:

  3. 11款最轻量级的CSS框架

    日子一去不复返了HTML用于造型的网页.今天,CSS规则,很难想象没有它的任何网页设计.CSS在最近非常先进,用于创建复杂的Web设计和风格.那么,我们为什么要使用CSS框架?答案很简单.CSS框架主 ...

  4. github建库不成功,不能用SVN上传

    说明(2017-12-7 11:37:35): 1. 之前用SVN向gihub提交代码没有一点问题,昨天新建了一个仓库,结果死活提交不上去,以为是SVN的问题,折腾了一天都没弄好. 2. github ...

  5. Java知多少(28)super关键字

    super 关键字与 this 类似,this 用来表示当前类的实例,super 用来表示父类. super 可以用在子类中,通过点号(.)来获取父类的成员变量和方法.super 也可以用在子类的子类 ...

  6. git排除常用配置,svn与git共存时.gitignore配置

    #idea与myeclipse配置文件全部排除 *.class #package file*.war*.ear #kdiff3 ignore*.orig #maven ignoretarget/ #e ...

  7. Storm常见模式——流聚合

    转自:http://www.cnblogs.com/panfeng412/archive/2012/06/04/storm-common-patterns-of-stream-join.html 流聚 ...

  8. 追踪go语言(golang)的新版本新特性【摘抄】

    Go 2.0 新特性展望:详细 go2.0 会有什么新特性呢?下图是一个老外的调侃,他不希望发生这样的事情(please don't make it happen).我倒是希望其中一些实现,比如泛型和 ...

  9. react跳转url,跳转外链,新页面打开页面

    react中实现在js中内部跳转路由,有两种方法. 方法一: import PropTypes from 'prop-types'; export default class Header exten ...

  10. HighCharts-highcharts resetZoom点击事件

    场景:zoom缩放功能: 选中x轴的一段区域后,需要解除x轴已设定的max值对zoom缩放功能的影响: 点击'reset zoom'后,又需要将max值重新赋值给x轴. 查遍highcharts ap ...