sparking water

2 It provides a way to initialize H2O services on each node in the Spark cluster and to access data stored in data structures of Spark and H2O.

3 Internal Backend is easiest to deploy; however when Spark or YARN kills the executor - which is not an unusual case - the entire H2O cluster goes down because H2O does not support high availability.

4 The internal backend is the default for behavior for Sparkling Water. Another way to change type of backend is by calling the setExternalClusterMode() or setInternalClusterMode() method on the H2OConf class. H2OConf is simple wrapper around SparkConf and inherits all properties in the Spark configuration.

5 好像在安装sparkingwater时，就会把pyspark和H2O装好： pip install h2o_pysparkling_2.3

=======================

1 启动spark : ./sbin/start-master.sh ./sbin/start-slave.sh spark://zcy-VirtualBox:7077

2 可以先运行一个很简单的脚本，看环境是否ready ，为了运行成功，需要把虚拟机内存调大（我改成了2g）

from pysparkling import *

from pyspark.sql import SparkSession

import h2o

# Initiate SparkSession

spark = SparkSession.builder.appName("App name").getOrCreate()

# Initiate H2OContext

hc = H2OContext.getOrCreate(spark)

# Stop H2O and Spark services

h2o.cluster().shutdown()

spark.stop()

print ""

./bin/spark-submit --master spark://zcy-VirtualBox:7077 --conf "spark.executor.memory=1g" /home/zcy/working/tst.py

结果如下

3 运行一个稍微复杂的脚本：

import h2o

from datetime import datetime

from pyspark import SparkConf, SparkFiles

from pyspark.sql import Row, SparkSession

import os

from pysparkling import *

# Refine date column

def refine_date_col(data, col):

    data["Day"] = data[col].day()

    data["Month"] = data[col].month()

    data["Year"] = data[col].year()

    data["WeekNum"] = data[col].week()

    data["WeekDay"] = data[col].dayOfWeek()

    data["HourOfDay"] = data[col].hour()

    # Create weekend and season cols

    # Spring = Mar, Apr, May. Summer = Jun, Jul, Aug. Autumn = Sep, Oct. Winter = Nov, Dec, Jan, Feb.

    # data["Weekend"]   = [ if x in ("Sun", "Sat") else  for x in data["WeekDay"]]

    data["Weekend"] = ((data["WeekDay"] == "Sun") | (data["WeekDay"] == "Sat"))

    data["Season"] = data["Month"].cut([, , , , , ], ["Winter", "Spring", "Summer", "Autumn", "Winter"])

# This is just helper function returning path to data-files

def _locate(file_name):

    if os.path.isfile("/home/zcy/working/data_tst/" + file_name):

        return "/home/zcy/working/data_tst/" + file_name

    else:

        print "eeeeeeeeeeee"

spark = SparkSession.builder.appName("ChicagoCrimeTest").getOrCreate()

# Start H2O services

h2oContext = H2OContext.getOrCreate(spark)

# Define file names

chicagoAllWeather = "chicagoAllWeather.csv"

chicagoCensus = "chicagoCensus.csv"

chicagoCrimes10k = "chicagoCrimes10k.csv.zip"

# h2o.import_file expects cluster-relative path

f_weather = h2o.upload_file(_locate(chicagoAllWeather))

f_census = h2o.upload_file(_locate(chicagoCensus))

f_crimes = h2o.upload_file(_locate(chicagoCrimes10k))

print ""

# Transform weather table

# Remove 1st column (date)

f_weather = f_weather[:]

# Transform census table

# Remove all spaces from column names (causing problems in Spark SQL)

col_names = list(map(lambda s: s.strip().replace(' ', '_').replace('+', '_'), f_census.col_names))

# Update column names in the table

# f_weather.names = col_names

f_census.names = col_names

# Transform crimes table

# Drop useless columns

f_crimes = f_crimes[:]

# Set time zone to UTC for date manipulation

h2o.cluster().timezone = "Etc/UTC"

# Replace ' ' by '_' in column names

col_names = list(map(lambda s: s.replace(' ', '_'), f_crimes.col_names))

f_crimes.names = col_names

refine_date_col(f_crimes, "Date")

f_crimes = f_crimes.drop("Date")

# Expose H2O frames as Spark DataFrame

print ""

df_weather = h2oContext.as_spark_frame(f_weather)

df_census = h2oContext.as_spark_frame(f_census)

df_crimes = h2oContext.as_spark_frame(f_crimes)

# Register DataFrames as tables

df_weather.createOrReplaceTempView("chicagoWeather")

df_census.createOrReplaceTempView("chicagoCensus")

df_crimes.createOrReplaceTempView("chicagoCrime")

crimeWithWeather = spark.sql("""SELECT

a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,

a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,

a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,

b.minTemp, b.maxTemp, b.meanTemp,

c.PERCENT_AGED_UNDER_18_OR_OVER_64, c.PER_CAPITA_INCOME, c.HARDSHIP_INDEX,

c.PERCENT_OF_HOUSING_CROWDED, c.PERCENT_HOUSEHOLDS_BELOW_POVERTY,

c.PERCENT_AGED_16__UNEMPLOYED, c.PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA

FROM chicagoCrime a

JOIN chicagoWeather b

ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day

JOIN chicagoCensus c

ON a.Community_Area = c.Community_Area_Number""")

# Publish Spark DataFrame as H2OFrame with given name

crimeWithWeatherHF = h2oContext.as_h2o_frame(crimeWithWeather, "crimeWithWeatherTable")

print ""

# Transform selected String columns to categoricals

cat_cols = ["Arrest", "Season", "WeekDay", "Primary_Type", "Location_Description", "Domestic"]

for col in cat_cols :

    crimeWithWeatherHF[col] = crimeWithWeatherHF[col].asfactor()

# Split frame into two - we use one as the training frame and the second one as the validation frame

splits = crimeWithWeatherHF.split_frame(ratios=[0.8])

train = splits[]

test = splits[]

print ""

h2o.download_csv(train,'/home/zcy/working/data_tst/ret/train.csv')

h2o.download_csv(test,'/home/zcy/working/data_tst/ret/test.csv')

# stop H2O and Spark services

h2o.cluster().shutdown()

spark.stop()

3 运行脚本，

./bin/spark-submit --master spark://zcy-VirtualBox:7077 --conf "spark.executor.memory=1g" /home/zcy/working/sparkH2O.py

sparking water的更多相关文章

[LeetCode] Pacific Atlantic Water Flow 太平洋大西洋水流
Given an m x n matrix of non-negative integers representing the height of each unit cell in a contin ...
[LeetCode] Trapping Rain Water II 收集雨水之二
Given an m x n matrix of positive integers representing the height of each unit cell in a 2D elevati ...
[LeetCode] Water and Jug Problem 水罐问题
You are given two jugs with capacities x and y litres. There is an infinite amount of water supply a ...
[LeetCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
[LeetCode] Container With Most Water 装最多水的容器
Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...
如何装最多的水？ — leetcode 11. Container With Most Water
炎炎夏日,还是呆在空调房里切切题吧. Container With Most Water,题意其实有点噱头,简化下就是,给一个数组,恩,就叫 height 吧,从中任选两项 i 和 j(i <= ...
【leetcode】Container With Most Water
题目描述: Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ...
[LintCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
[LintCode] Container With Most Water 装最多水的容器
Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...

随机推荐

Fluent动网格【7】：网格节点运动
在动网格中,对于那些既包含了运动也包含了变形的区域,可以通过UDF来指定区域中每一个节点的位置.这给了用户最大的自由度来指定网格的运动.在其他的动网格技术中(如重叠网格)则很难做到这一点.定义网格节点 ...
比Screen更好用的神器：tmux
安装并启动 tmux tmux 应用程序的名称来源于终端(terminal)复用器(muxer)或多路复用器(multiplexer).换句话说,它可以将您的单终端会话分成多个会话. 它管理窗口和窗格 ...
【iCore4 双核心板_FPGA】例程五：基础逻辑门实验——逻辑门使用
实验现象: 打开tool-->Netlist viewer-->RTL viewer可观察各个逻辑连接核心代码: //--------------------module_logic_g ...
Pwnium CTF2014 – MatterOfCombination writeup
这道题是虽然只有75分,但是做出来的队伍却很少,我们队伍也没有做出来,这次是看到了0xAWES0ME 的解题思路后才有了这篇文章.原文地址可以点击看这里,英文的. 题目就是一张图片: 在网上可以找到这 ...
川崎机器人c#通讯（转）
由于本人在工业自动化行业做机器视觉的工作,所以除了图像处理方面要掌握外,还需要与工业机器人进行通信.最近学习了计算机与川崎机器人的TCP/IP通信,于是在这里记录一下. 除了直接与机器人通信外,有一种 ...
shell-整理目录下的备份文件并生成压缩包
背景: CI构建下来的备份应用包在服务器上保留几十个,空间占用大,看着不好看,可能还用不着,所以准备正好练练手吧! 其实CI上可以设置少保留几个,但是我没管.我只是想练练脚本先来看一下我的服务器源目 ...
稍稍解读下ThreadPoolExecutor
# 说说ThreadPoolExecutor ## 认识先来看看它所在的架构体系: ```java package java.util.concurrent; public interface Ex ...
[IR] Suffix Trees and Suffix Arrays
前缀树匹配前缀字符串是不言自明的道理. 1. 字符串的快速检索 2. 最长公共前缀(LCP) 等等树的压缩后缀树 Let s=abab, a suffix tree of s is a comp ...
网络编程 -- RPC实现原理 -- NIO多线程 -- 迭代版本V2
网络编程 -- RPC实现原理 -- 目录啦啦啦 V2——增加WriteQueue队列,存放selectionKey.addWriteEventToQueue()添加selectionKey并唤醒阻 ...
scrapy爬取某网站,模拟登陆过程中遇到的那些坑
本节内容在访问网站的时候,我们经常遇到有些页面必须用户登录才能访问.这个时候我们之前写的傻傻的爬虫就被ban在门外了.所以本节,我们给爬虫配置cookie,使得爬虫能保持用户已登录的状态,达到获得那 ...

sparking water

sparking water的更多相关文章

随机推荐

热门专题