Spark- 根据ip地址计算归属地

主要考察的是广播变量的使用：

1、将要广播的数据 IP 规则数据存放在HDFS上，（广播出去的内容一旦广播出去产就不能改变了，如果需要实时改变的规则，可以将规则放到Redis中）

2、在Spark中转成RDD，然后收集到Driver端，

3、把 IP 规则数据广播到Executor中。Driver端广播变量的引用是怎样跑到 Executor中的呢?　　Task在Driver端生成的，广播变量的引用是伴随着Task被发送到Executor中的，广播变量的引用也被发送到Executor中，恰好指向HDFS

4、Executor执行分配到的 Task时，从Executor中获取 IP 规则数据做计算。

package com.rz.spark.base

import java.sql.{Connection, DriverManager, PreparedStatement}

import org.apache.spark.broadcast.Broadcast

import org.apache.spark.rdd.RDD

import org.apache.spark.{SparkConf, SparkContext}

object IpLocation2 {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[2]")

    val sc = new SparkContext(conf)

    // 取到HDFS中的 ip规则

    val rulesLine: RDD[String] = sc.textFile(args())

    // 整理ip规则数据

    val ipRulesRDD: RDD[(Long, Long, String)] = rulesLine.map(line => {

      val fields = line.split("[|]")

      val startNum = fields().toLong

      val endNum = fields().toLong

      val province = fields()

      (startNum, endNum, province)

    })

    // 将分散在多个Executor中的部分IP规则数据收集到Driver端

    val rulesInDriver: Array[(Long, Long, String)] = ipRulesRDD.collect()

    // 将Driver端的数据广播到Executor中

    // 调用sc上的广播方法

    // 广播变量的引用（还在Driver端中）

    val broadcastRef: Broadcast[Array[(Long, Long, String)]] = sc.broadcast(rulesInDriver)

    // 创建RDD，读取访问日志

    val accessLines: RDD[String] = sc.textFile(args())

    // 整理数据

    val provinceAndOne: RDD[(String, Int)] = accessLines.map(log => {

      // 将log日志的第一行进行切分

      val fields = log.split("[|]")

      val ip = fields()

      // 将ip转换成10进制

      val ipNum = MyUtils.ip2Long(ip)

      // 进行二分法查找，通过Driver端的引用获取到Executor中的广播变量

      // （该函数中的代码是在Executor中被调用执行的，通过广播变量的引用，就可以拿到当前Executor中的广播的ip二人规则）

      // Driver端广播变量的引用是怎样跑到 Executor中的呢?

      // Task在Driver端生成的，广播变量的引用是伴随着Task被发送到Executor中的，广播变量的引用也被发送到Executor中，恰好指向HDFS

      val rulesInExecutor: Array[(Long, Long, String)] = broadcastRef.value

      // 查找

      var province = "末知"

      val index = MyUtils.binarySearch(rulesInExecutor, ipNum)

      if (index != -) {

        province = rulesInExecutor(index)._3

      }

      (province, )

    })

    // 聚合

    val reduced: RDD[(String, Int)] = provinceAndOne.reduceByKey(_+_)

    // 将结果打印

//    val result = reduced.collect()

//    println(result.toBuffer)

    // 将结果写入到MySQL中

    // 一次拿一个分区的每一条数据

    reduced.foreachPartition(it=>{

      val conn: Connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata?characterEncoding=utf-8","root","root")

      val pstm: PreparedStatement = conn.prepareStatement("insert into access_log values(?,?)")

      it.foreach(tp=>{

        pstm.setString(, tp._1)

        pstm.setInt(,tp._2)

        pstm.executeUpdate()

      })

      pstm.close()

      conn.close()

    })

    sc.stop()

  }

}

工具类

package com.rz.spark.base

import java.sql

import java.sql.{DriverManager, PreparedStatement}

import scala.io.{BufferedSource, Source}

object MyUtils {

  def ip2Long(ip: String): Long = {

    val fragments = ip.split("[.]")

    var ipNum = 0L

    for (i <-  until fragments.length){

      ipNum =  fragments(i).toLong | ipNum << 8L

    }

    ipNum

  }

  def readRules(path: String): Array[(Long, Long, String)] = {

    //读取ip规则

    val bf: BufferedSource = Source.fromFile(path)

    val lines: Iterator[String] = bf.getLines()

    //对ip规则进行整理，并放入到内存

    val rules: Array[(Long, Long, String)] = lines.map(line => {

      val fileds = line.split("[|]")

      val startNum = fileds().toLong

      val endNum = fileds().toLong

      val province = fileds()

      (startNum, endNum, province)

    }).toArray

    rules

  }

  def binarySearch(lines: Array[(Long, Long, String)], ip: Long) : Int = {

    var low =

    var high = lines.length -

    while (low <= high) {

      val middle = (low + high) /

      if ((ip >= lines(middle)._1) && (ip <= lines(middle)._2))

        return middle

      if (ip < lines(middle)._1)

        high = middle -

      else {

        low = middle +

      }

    }

    -

  }

  def data2MySQL(it: Iterator[(String, Int)]): Unit = {

    //一个迭代器代表一个分区，分区中有多条数据

    //先获得一个JDBC连接

    val conn: sql.Connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata?characterEncoding=UTF-8", "root", "")

    //将数据通过Connection写入到数据库

    val pstm: PreparedStatement = conn.prepareStatement("INSERT INTO access_log VALUES (?, ?)")

    //将分区中的数据一条一条写入到MySQL中

    it.foreach(tp => {

      pstm.setString(, tp._1)

      pstm.setInt(, tp._2)

      pstm.executeUpdate()

    })

    //将分区中的数据全部写完之后，在关闭连接

    if(pstm != null) {

      pstm.close()

    }

    if (conn != null) {

      conn.close()

    }

  }

}

pom文件

<properties>

        <maven.compiler.source>1.8</maven.compiler.source>

        <maven.compiler.target>1.8</maven.compiler.target>

        <scala.version>2.11.</scala.version>

        <spark.version>2.2.</spark.version>

        <hadoop.version>2.6.</hadoop.version>

        <encoding>UTF-</encoding>

    </properties>

    <dependencies>

        <!-- 导入scala的依赖 -->

        <dependency>

            <groupId>org.scala-lang</groupId>

            <artifactId>scala-library</artifactId>

            <version>${scala.version}</version>

        </dependency>

        <!-- 导入spark的依赖 -->

        <dependency>

            <groupId>org.apache.spark</groupId>

            <artifactId>spark-core_2.</artifactId>

            <version>${spark.version}</version>

        </dependency>

        <!-- 指定hadoop-client API的版本 -->

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-client</artifactId>

            <version>${hadoop.version}</version>

        </dependency>

    </dependencies>

    <build>

        <pluginManagement>

            <plugins>

                <!-- 编译scala的插件 -->

                <plugin>

                    <groupId>net.alchim31.maven</groupId>

                    <artifactId>scala-maven-plugin</artifactId>

                    <version>3.2.</version>

                </plugin>

                <!-- 编译java的插件 -->

                <plugin>

                    <groupId>org.apache.maven.plugins</groupId>

                    <artifactId>maven-compiler-plugin</artifactId>

                    <version>3.5.</version>

                </plugin>

            </plugins>

        </pluginManagement>

        <plugins>

            <plugin>

                <groupId>net.alchim31.maven</groupId>

                <artifactId>scala-maven-plugin</artifactId>

                <executions>

                    <execution>

                        <id>scala-compile-first</id>

                        <phase>process-resources</phase>

                        <goals>

                            <goal>add-source</goal>

                            <goal>compile</goal>

                        </goals>

                    </execution>

                    <execution>

                        <id>scala-test-compile</id>

                        <phase>process-test-resources</phase>

                        <goals>

                            <goal>testCompile</goal>

                        </goals>

                    </execution>

                </executions>

            </plugin>

            <plugin>

                <groupId>org.apache.maven.plugins</groupId>

                <artifactId>maven-compiler-plugin</artifactId>

                <executions>

                    <execution>

                        <phase>compile</phase>

                        <goals>

                            <goal>compile</goal>

                        </goals>

                    </execution>

                </executions>

            </plugin>

            <!-- 打jar插件 -->

            <plugin>

                <groupId>org.apache.maven.plugins</groupId>

                <artifactId>maven-shade-plugin</artifactId>

                <version>2.4.</version>

                <executions>

                    <execution>

                        <phase>package</phase>

                        <goals>

                            <goal>shade</goal>

                        </goals>

                        <configuration>

                            <filters>

                                <filter>

                                    <artifact>*:*</artifact>

                                    <excludes>

                                        <exclude>META-INF/*.SF</exclude>

                                        <exclude>META-INF/*.DSA</exclude>

                                        <exclude>META-INF/*.RSA</exclude>

                                    </excludes>

                                </filter>

                            </filters>

                        </configuration>

                    </execution>

                </executions>

            </plugin>

        </plugins>

    </build>

Spark- 根据ip地址计算归属地的更多相关文章

spark练习---ip匹配以及广播的特性
今天,我们还是在介绍spark的小练习,这次的小练习还是基于IP相关的操作,我们可以先看一下今天的需求,我们有两个文件, 第一个文件,是IP的字典,也就是我们上一篇介绍过的,就是表明了所有IP字段所属 ...
IP和归属地
ip: http://www.ip.cn/index.php?ip=10.132.98.143 归属地: http://www.ip138.com:8080/search.asp?action=mob ...
【Spark】如何用Spark查询IP地址？
文章目录需求思路 ip地址转换为Long类型的两种方法 ip地址转换数字地址的原理第一种方法第二种方法步骤一.在mysql创建数据库表二.开发代码需求日常生活中,当我们打开地图时,会 ...
IP地址计算和划分
一. B类地址范围从128-191(第一串8位二进制10000000~10111111),如172.168.1.1,第一和第二段号码为网络号码,剩下的2段号码为本地计算机的号码.转换为2进 ...
查询ip地址归属地
查询ip地址归属地方法: curl ip.cn/$ip 如果没有返回,试试地址写全: curl https://www.ip.cn/$ip 如:
python查询IP地址所属地
1.linux命令行版 #!/usr/bin/python #-*- coding: utf-8 -*- import json import urllib import sys def get_da ...
python 查找IP地址归属地
#!/usr/bin/env python # -*- coding: utf-8 -*- #查找IP地址归属地 #writer by keery_log #Create time:2013-10-3 ...
ip地址计算
1.多少个子网? 2x个,其中x为被遮盖(取值为1)的位数.例如,在11000000(这个值是子网掩码的最后几位,例如,mask=18)中,取值为1的位数为2,因此子网数位22=4个: 2.每个子网包 ...
【java】获取客户端访问的公网ip和归属地
import com.alibaba.druid.support.json.JSONUtils; import org.thymeleaf.util.StringUtils; import javax ...

随机推荐

ubuntu 使用串口picocom
连上USB转串口查看是否识别串口 dmesg | grep ttyUSB0 安装(mint / ubuntu): $ sudo apt-get install picocom 使用: 先赋予 ...
2015-04-14——css3 @media
//判断横竖屏 @media screen and (min-aspect-ratio: 13/13) { body {background-color:red;}} //屏幕宽高比,必须是除数形式 ...
emo前端
1 点击按钮可以在form中添加input控件,以name给input编号,然后点击按钮ajax上传表单,在回调函数中弹框显示结果: <form id="newfriends" ...
javascript变量声明提升和函数声明提升
在ES6之前,JavaScript没有块级作用域(一对花括号{}即为一个块级作用域),只有全局作用域和函数作用域.变量提升即将变量声明提升到它所在作用域的最开始的部分. JS的解析过程分为两个阶段:预 ...
SQL Server中行列转换 Pivot UnPivot
PIVOT用于将列值旋转为列名(即行转列),在SQLServer 2000可以用聚合函数配合CASE语句实现 PIVOT的一般语法是:PIVOT(聚合函数(列)FOR 列 in (-) )AS P 完 ...
SQL 时间函数 Datepart()与DateName()
1.Datepart() 返回代表指定日期的指定日期部分的整数语法 Datepart(datepart,date) 返回类型 int datepart: 日期部分缩写 year yy, yyyy ...
Linux下套接字具体解释（九）---poll模式下的IO多路复用server
參照 poll调用深入解析-从poll的实现来讲poll多路复用模型,非常有深度 poll多路复用 poll的机制与select相似,与select在本质上没有多大差别.管理多个描写叙述符也是进行轮询 ...
【Oracle】OGG数据初始化之RMAN
实验环境: 源端.目标端: DataBase:10.2.0.1.0 OS:OEL5.6 OGG:fbo_ggs_Linux_x86_ora11g_32bit 源端使用rman进行备份全库: RMAN& ...
C#设置当前程序通过IE代理服务器上网
注意:以下设置只在当前程序中有效,对IE浏览器无效,且关闭程序后,自动释放代码. using System; using System.Collections.Generic; using Syste ...
C/C++中浮点数输出格式问题
在C语言中,浮点数的输出格式有三种:%g, %f, %e 首先要说的是%e是采用科学计数法来显示. %g与后两者有一个重要的差别,就是设置输出精度的时候,(C中默认浮点输出精度是6),%g认为,包括整 ...

Spark- 根据ip地址计算归属地

Spark- 根据ip地址计算归属地的更多相关文章

随机推荐

热门专题