Spark- 根据ip地址计算归属地

主要考察的是广播变量的使用：

1、将要广播的数据 IP 规则数据存放在HDFS上，（广播出去的内容一旦广播出去产就不能改变了，如果需要实时改变的规则，可以将规则放到Redis中）

2、在Spark中转成RDD，然后收集到Driver端，

3、把 IP 规则数据广播到Executor中。Driver端广播变量的引用是怎样跑到 Executor中的呢?　　Task在Driver端生成的，广播变量的引用是伴随着Task被发送到Executor中的，广播变量的引用也被发送到Executor中，恰好指向HDFS

4、Executor执行分配到的 Task时，从Executor中获取 IP 规则数据做计算。

package com.rz.spark.base
 
import java.sql.{Connection, DriverManager, PreparedStatement}
 
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
 
object IpLocation2 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[2]")
    val sc = new SparkContext(conf)
 
    // 取到HDFS中的 ip规则
    val rulesLine: RDD[String] = sc.textFile(args())
 
    // 整理ip规则数据
    val ipRulesRDD: RDD[(Long, Long, String)] = rulesLine.map(line => {
      val fields = line.split("[|]")
      val startNum = fields().toLong
      val endNum = fields().toLong
      val province = fields()
      (startNum, endNum, province)
    })
    // 将分散在多个Executor中的部分IP规则数据收集到Driver端
    val rulesInDriver: Array[(Long, Long, String)] = ipRulesRDD.collect()
 
    // 将Driver端的数据广播到Executor中
    // 调用sc上的广播方法
    // 广播变量的引用（还在Driver端中）
    val broadcastRef: Broadcast[Array[(Long, Long, String)]] = sc.broadcast(rulesInDriver)
 
    // 创建RDD，读取访问日志
    val accessLines: RDD[String] = sc.textFile(args())
 
    // 整理数据
    val provinceAndOne: RDD[(String, Int)] = accessLines.map(log => {
      // 将log日志的第一行进行切分
      val fields = log.split("[|]")
      val ip = fields()
      // 将ip转换成10进制
      val ipNum = MyUtils.ip2Long(ip)
      // 进行二分法查找，通过Driver端的引用获取到Executor中的广播变量
      // （该函数中的代码是在Executor中被调用执行的，通过广播变量的引用，就可以拿到当前Executor中的广播的ip二人规则）
      // Driver端广播变量的引用是怎样跑到 Executor中的呢?
      // Task在Driver端生成的，广播变量的引用是伴随着Task被发送到Executor中的，广播变量的引用也被发送到Executor中，恰好指向HDFS
      val rulesInExecutor: Array[(Long, Long, String)] = broadcastRef.value
      // 查找
      var province = "末知"
      val index = MyUtils.binarySearch(rulesInExecutor, ipNum)
      if (index != -) {
        province = rulesInExecutor(index)._3
      }
      (province, )
    })
    // 聚合
    val reduced: RDD[(String, Int)] = provinceAndOne.reduceByKey(_+_)
    // 将结果打印
//    val result = reduced.collect()
//    println(result.toBuffer)
 
    // 将结果写入到MySQL中
    // 一次拿一个分区的每一条数据
    reduced.foreachPartition(it=>{
      val conn: Connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata?characterEncoding=utf-8","root","root")
      val pstm: PreparedStatement = conn.prepareStatement("insert into access_log values(?,?)")
 
      it.foreach(tp=>{
        pstm.setString(, tp._1)
        pstm.setInt(,tp._2)
        pstm.executeUpdate()
      })
      pstm.close()
      conn.close()
    })
 
    sc.stop()
  }
}

工具类

package com.rz.spark.base
 
import java.sql
import java.sql.{DriverManager, PreparedStatement}
 
import scala.io.{BufferedSource, Source}
 
object MyUtils {
 
  def ip2Long(ip: String): Long = {
    val fragments = ip.split("[.]")
    var ipNum = 0L
    for (i <-  until fragments.length){
      ipNum =  fragments(i).toLong | ipNum << 8L
    }
    ipNum
  }
 
  def readRules(path: String): Array[(Long, Long, String)] = {
    //读取ip规则
    val bf: BufferedSource = Source.fromFile(path)
    val lines: Iterator[String] = bf.getLines()
    //对ip规则进行整理，并放入到内存
    val rules: Array[(Long, Long, String)] = lines.map(line => {
      val fileds = line.split("[|]")
      val startNum = fileds().toLong
      val endNum = fileds().toLong
      val province = fileds()
      (startNum, endNum, province)
    }).toArray
    rules
  }
 
  def binarySearch(lines: Array[(Long, Long, String)], ip: Long) : Int = {
    var low =
    var high = lines.length -
    while (low <= high) {
      val middle = (low + high) /
      if ((ip >= lines(middle)._1) && (ip <= lines(middle)._2))
        return middle
      if (ip < lines(middle)._1)
        high = middle -
      else {
        low = middle +
      }
    }
    -
  }
 
  def data2MySQL(it: Iterator[(String, Int)]): Unit = {
    //一个迭代器代表一个分区，分区中有多条数据
    //先获得一个JDBC连接
    val conn: sql.Connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata?characterEncoding=UTF-8", "root", "")
    //将数据通过Connection写入到数据库
    val pstm: PreparedStatement = conn.prepareStatement("INSERT INTO access_log VALUES (?, ?)")
    //将分区中的数据一条一条写入到MySQL中
    it.foreach(tp => {
      pstm.setString(, tp._1)
      pstm.setInt(, tp._2)
      pstm.executeUpdate()
    })
    //将分区中的数据全部写完之后，在关闭连接
    if(pstm != null) {
      pstm.close()
    }
    if (conn != null) {
      conn.close()
    }
  }
}

pom文件

<properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <scala.version>2.11.</scala.version>
        <spark.version>2.2.</spark.version>
        <hadoop.version>2.6.</hadoop.version>
        <encoding>UTF-</encoding>
    </properties>
 
    <dependencies>
        <!-- 导入scala的依赖 -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
 
        <!-- 导入spark的依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.</artifactId>
            <version>${spark.version}</version>
        </dependency>
 
        <!-- 指定hadoop-client API的版本 -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
 
    </dependencies>
 
    <build>
        <pluginManagement>
            <plugins>
                <!-- 编译scala的插件 -->
                <plugin>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <version>3.2.</version>
                </plugin>
                <!-- 编译java的插件 -->
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.5.</version>
                </plugin>
            </plugins>
        </pluginManagement>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
 
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <executions>
                    <execution>
                        <phase>compile</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
 
            <!-- 打jar插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

Spark- 根据ip地址计算归属地的更多相关文章

spark练习---ip匹配以及广播的特性
今天,我们还是在介绍spark的小练习,这次的小练习还是基于IP相关的操作,我们可以先看一下今天的需求,我们有两个文件, 第一个文件,是IP的字典,也就是我们上一篇介绍过的,就是表明了所有IP字段所属 ...
IP和归属地
ip: http://www.ip.cn/index.php?ip=10.132.98.143 归属地: http://www.ip138.com:8080/search.asp?action=mob ...
【Spark】如何用Spark查询IP地址？
文章目录需求思路 ip地址转换为Long类型的两种方法 ip地址转换数字地址的原理第一种方法第二种方法步骤一.在mysql创建数据库表二.开发代码需求日常生活中,当我们打开地图时,会 ...
IP地址计算和划分
一. B类地址范围从128-191(第一串8位二进制10000000~10111111),如172.168.1.1,第一和第二段号码为网络号码,剩下的2段号码为本地计算机的号码.转换为2进 ...
查询ip地址归属地
查询ip地址归属地方法: curl ip.cn/$ip 如果没有返回,试试地址写全: curl https://www.ip.cn/$ip 如:
python查询IP地址所属地
1.linux命令行版 #!/usr/bin/python #-*- coding: utf-8 -*- import json import urllib import sys def get_da ...
python 查找IP地址归属地
#!/usr/bin/env python # -*- coding: utf-8 -*- #查找IP地址归属地 #writer by keery_log #Create time:2013-10-3 ...
ip地址计算
1.多少个子网? 2x个,其中x为被遮盖(取值为1)的位数.例如,在11000000(这个值是子网掩码的最后几位,例如,mask=18)中,取值为1的位数为2,因此子网数位22=4个: 2.每个子网包 ...
【java】获取客户端访问的公网ip和归属地
import com.alibaba.druid.support.json.JSONUtils; import org.thymeleaf.util.StringUtils; import javax ...

随机推荐

巨蟒python全栈开发数据库攻略2：基础攻略2
1.存储引擎表类型 2.整数类型和sql_mode 3.浮点类&字符串类型&日期类型&集合类型&枚举类型 4.数值类型补充 5.完整性约束
Java 语言基础之函数
函数的定义: 函数就是定义在类中的具有特定功能的一段独立小程序函数也称为方法函数定义格式: 修饰符返回值类型函数名(参数类型形式参数1, 参数类型形式参数2,...) { 执行语句; re ...
python进程锁
import time import threading import multiprocessing lock = multiprocessing.RLock() def task(arg): pr ...
JSP--JSP语法--指令--include(动态包含/静态包含)--九大隐式对象--四大域对象--JSP内置标签--JavaBean的动作元素--MVC三层架构
一．JSP 原理:JSP其实就是一个servlet. Servlet负责业务逻辑处理,JSP只负责显示.开发中,JSP中不能有一行JAVA代码二．JSP语法 1. JSP模板元素:JSP中HT ...
Codeforces Round #302 (Div. 2)
A. Set of Strings 题意:能否把一个字符串划分为n段,且每段第一个字母都不相同? 思路:判断字符串中出现的字符种数,然后划分即可. #include<iostream> # ...
把RedisWatcher安装为windows服务
安装完成后, 到安装目录下修改watcher.conf.注意,任何路径都不可包含空格,中文,特殊字符,且全部使用绝对路径配置文件中文注释exepath --> redis-server.exe的 ...
ArcGIS COM Exception 0x80040228
问题: string shpDir = Path.GetDirectoryName(shpfile); string shpfilename = Path.GetFileNa ...
Python 字符串连接问题归结
一.概述 Python 字符串连接场景较为普遍.由于编者对 Java 等语言较为熟悉,常常将两者语法混淆. 加之,Python 语法较为灵活.例如,单单实现字符串连接,就有数种方法.在此,一并归结! ...
CentOS中nginx负载均衡和反向代理的搭建
1: 修改centos命令行启动(减少内存占用): vim /etc/inittab :initdefault: --> 修改5为3 若要界面启动使用 startx 2:安装jdk )解压:jd ...
Codeforces Round #397 by Kaspersky Lab and Barcelona Bootcamp (Div. 1 + Div. 2 combined) C - Table Tennis Game 2
地址:http://codeforces.com/contest/765/problem/C 题目: C. Table Tennis Game 2 time limit per test 2 seco ...

Spark- 根据ip地址计算归属地

Spark- 根据ip地址计算归属地的更多相关文章

随机推荐

热门专题