sparking water
1

2 It provides a way to initialize H2O services on each node in the Spark cluster and to access data stored in data structures of Spark and H2O.
3 Internal Backend is easiest to deploy; however when Spark or YARN kills the executor - which is not an unusual case - the entire H2O cluster goes down because H2O does not support high availability.
4 The internal backend is the default for behavior for Sparkling Water. Another way to change type of backend is by calling the setExternalClusterMode() or setInternalClusterMode() method on the H2OConf class. H2OConf is simple wrapper around SparkConf and inherits all properties in the Spark configuration.
5 好像在安装sparkingwater时,就会把pyspark和H2O装好: pip install h2o_pysparkling_2.3
=======================
1 启动spark : ./sbin/start-master.sh ./sbin/start-slave.sh spark://zcy-VirtualBox:7077
2 可以先运行一个很简单的脚本,看环境是否ready ,为了运行成功,需要把虚拟机内存调大(我改成了2g)

from pysparkling import *
from pyspark.sql import SparkSession
import h2o # Initiate SparkSession
spark = SparkSession.builder.appName("App name").getOrCreate() # Initiate H2OContext
hc = H2OContext.getOrCreate(spark) # Stop H2O and Spark services
h2o.cluster().shutdown()
spark.stop()
print ""
./bin/spark-submit --master spark://zcy-VirtualBox:7077 --conf "spark.executor.memory=1g" /home/zcy/working/tst.py
结果如下

3 运行一个稍微复杂的脚本:
import h2o
from datetime import datetime from pyspark import SparkConf, SparkFiles
from pyspark.sql import Row, SparkSession
import os
from pysparkling import * # Refine date column
def refine_date_col(data, col):
data["Day"] = data[col].day()
data["Month"] = data[col].month()
data["Year"] = data[col].year()
data["WeekNum"] = data[col].week()
data["WeekDay"] = data[col].dayOfWeek()
data["HourOfDay"] = data[col].hour() # Create weekend and season cols
# Spring = Mar, Apr, May. Summer = Jun, Jul, Aug. Autumn = Sep, Oct. Winter = Nov, Dec, Jan, Feb.
# data["Weekend"] = [ if x in ("Sun", "Sat") else for x in data["WeekDay"]]
data["Weekend"] = ((data["WeekDay"] == "Sun") | (data["WeekDay"] == "Sat"))
data["Season"] = data["Month"].cut([, , , , , ], ["Winter", "Spring", "Summer", "Autumn", "Winter"]) # This is just helper function returning path to data-files
def _locate(file_name):
if os.path.isfile("/home/zcy/working/data_tst/" + file_name):
return "/home/zcy/working/data_tst/" + file_name
else:
print "eeeeeeeeeeee" spark = SparkSession.builder.appName("ChicagoCrimeTest").getOrCreate()
# Start H2O services
h2oContext = H2OContext.getOrCreate(spark)
# Define file names
chicagoAllWeather = "chicagoAllWeather.csv"
chicagoCensus = "chicagoCensus.csv"
chicagoCrimes10k = "chicagoCrimes10k.csv.zip" # h2o.import_file expects cluster-relative path
f_weather = h2o.upload_file(_locate(chicagoAllWeather))
f_census = h2o.upload_file(_locate(chicagoCensus))
f_crimes = h2o.upload_file(_locate(chicagoCrimes10k))
print "" # Transform weather table
# Remove 1st column (date)
f_weather = f_weather[:] # Transform census table
# Remove all spaces from column names (causing problems in Spark SQL)
col_names = list(map(lambda s: s.strip().replace(' ', '_').replace('+', '_'), f_census.col_names)) # Update column names in the table
# f_weather.names = col_names
f_census.names = col_names # Transform crimes table
# Drop useless columns
f_crimes = f_crimes[:] # Set time zone to UTC for date manipulation
h2o.cluster().timezone = "Etc/UTC" # Replace ' ' by '_' in column names
col_names = list(map(lambda s: s.replace(' ', '_'), f_crimes.col_names))
f_crimes.names = col_names
refine_date_col(f_crimes, "Date")
f_crimes = f_crimes.drop("Date") # Expose H2O frames as Spark DataFrame
print ""
df_weather = h2oContext.as_spark_frame(f_weather)
df_census = h2oContext.as_spark_frame(f_census)
df_crimes = h2oContext.as_spark_frame(f_crimes) # Register DataFrames as tables
df_weather.createOrReplaceTempView("chicagoWeather")
df_census.createOrReplaceTempView("chicagoCensus")
df_crimes.createOrReplaceTempView("chicagoCrime") crimeWithWeather = spark.sql("""SELECT
a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,
a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,
a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,
b.minTemp, b.maxTemp, b.meanTemp,
c.PERCENT_AGED_UNDER_18_OR_OVER_64, c.PER_CAPITA_INCOME, c.HARDSHIP_INDEX,
c.PERCENT_OF_HOUSING_CROWDED, c.PERCENT_HOUSEHOLDS_BELOW_POVERTY,
c.PERCENT_AGED_16__UNEMPLOYED, c.PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA
FROM chicagoCrime a
JOIN chicagoWeather b
ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
JOIN chicagoCensus c
ON a.Community_Area = c.Community_Area_Number""") # Publish Spark DataFrame as H2OFrame with given name
crimeWithWeatherHF = h2oContext.as_h2o_frame(crimeWithWeather, "crimeWithWeatherTable")
print ""
# Transform selected String columns to categoricals
cat_cols = ["Arrest", "Season", "WeekDay", "Primary_Type", "Location_Description", "Domestic"]
for col in cat_cols :
crimeWithWeatherHF[col] = crimeWithWeatherHF[col].asfactor() # Split frame into two - we use one as the training frame and the second one as the validation frame
splits = crimeWithWeatherHF.split_frame(ratios=[0.8])
train = splits[]
test = splits[]
print ""
h2o.download_csv(train,'/home/zcy/working/data_tst/ret/train.csv')
h2o.download_csv(test,'/home/zcy/working/data_tst/ret/test.csv') # stop H2O and Spark services
h2o.cluster().shutdown()
spark.stop()
3 运行脚本,
./bin/spark-submit --master spark://zcy-VirtualBox:7077 --conf "spark.executor.memory=1g" /home/zcy/working/sparkH2O.py



sparking water的更多相关文章
- [LeetCode] Pacific Atlantic Water Flow 太平洋大西洋水流
Given an m x n matrix of non-negative integers representing the height of each unit cell in a contin ...
- [LeetCode] Trapping Rain Water II 收集雨水之二
Given an m x n matrix of positive integers representing the height of each unit cell in a 2D elevati ...
- [LeetCode] Water and Jug Problem 水罐问题
You are given two jugs with capacities x and y litres. There is an infinite amount of water supply a ...
- [LeetCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
- [LeetCode] Container With Most Water 装最多水的容器
Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...
- 如何装最多的水? — leetcode 11. Container With Most Water
炎炎夏日,还是呆在空调房里切切题吧. Container With Most Water,题意其实有点噱头,简化下就是,给一个数组,恩,就叫 height 吧,从中任选两项 i 和 j(i <= ...
- 【leetcode】Container With Most Water
题目描述: Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ...
- [LintCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
- [LintCode] Container With Most Water 装最多水的容器
Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...
随机推荐
- Python测试DB2连通性
Python测试数据库连通性: #!/usr/bin/python27 #encoding: utf-8 import ibm_db import os import sys def find_db( ...
- 聊聊动态语言那些事(Python)
动态编程语言是高级程序设计语言的一个类别,在计算机科学领域已被广泛应用.它是一类在运行时可以改变其结构的语言:例如新的函数.对象.甚至代码可以被引进,已有的函数可以被删除或是其他结构上的变化.动态语言 ...
- React Native - FlexBox弹性盒模型
FlexBox布局 1. 什么是FlexBox布局? 弹性盒模型(The Flexible Box Module),又叫FlexBox,意为"弹性布局",旨在通过弹性的方式来对 ...
- 关于CLOS架构的举例 网络级 设备级 FATTREE网络 网络级CLOS 以及CLOS涉及的调度算法RR
1.概述 CLOS来自于传统电路交换概念,这个概念年代太久远,在当前数据通信网络中,内涵有所变化.本文主要谈的是实际上赋予的与原来略微有所差异的内涵. CLOS架构本身概念比较宽泛,有设备级的CLOS ...
- Mysql系列七:分库分表技术难题之分布式全局唯一id解决方案
一.前言 在前面的文章Mysql系列四:数据库分库分表基础理论中,已经说过分库分表需要应对的技术难题有如下几个: 1. 分布式全局唯一id 2. 分片规则和策略 3. 跨分片技术问题 4. 跨分片事物 ...
- 大数据学习笔记03-HDFS-HDFS组件介绍及Java访问HDFS集群
HDFS组件概述 NameNode 存储数据节点信息及元文件,即:分成了多少数据块,每一个数据块存储在哪一个DataNode中,每一个数据块备份到哪些DataNode中 这个集群有哪些DataNode ...
- 【代码审计】大米CMS_V5.5.3 目录遍历漏洞分析
0x00 环境准备 大米CMS官网:http://www.damicms.com 网站源码版本:大米CMS_V5.5.3试用版(更新时间:2017-04-15) 程序源码下载:http://www ...
- connect()返回SOCKET_ERROR不一定就是连接失败
connect()用于建立与指定socket的连接. 头文件: #include <sys/socket.h> 函数原型: int connect(int s, const struct ...
- webpack 配置
https://segmentfault.com/a/1190000009454172
- 快速构建springmvc+spring+swagger2环境
快速构建springmvc+spring+swagger2环境 开发工具:Intellij idea jdk: 1.8 开发步骤: 1.创建maven工程,如图建立工程结构 ...