maxmind geoip2使用笔记

客户需求如下，nginx的访问日志中ip，匹配出对应的国家，省份和城市，然后给我了一个maxmind的连接参考。

查找资料，有做成hive udf的使用方式，我们项目中一直使用 waterdrop 来做数据处理，所以决定开发一个 waterdrop的插件。

关于这个功能,waterdrop本身提供有两个商用组件，geopip2（也是使用maxmind) 另一个是国内的 ipipnet。

如果有人不懂 waterdrop，可以参考 https://interestinglab.github.io/waterdrop/#/zh-cn/quick-start

开发使用 scala语言，开发完毕后，使用 mvn clean package 打包即可，生成的包是不含有依赖的，请注意把依赖放到spark classpath中去使用。

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0"

         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

    <modelVersion>4.0.0</modelVersion>

    <groupId>com.student</groupId>

    <artifactId>GeoIP2</artifactId>

    <version>1.0-SNAPSHOT</version>

    <properties>

        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

        <scala.version>2.11.8</scala.version>

        <scala.binary.version>2.11</scala.binary.version>

        <spark.version>2.4.0</spark.version>

        <waterdrop.version>1.4.0</waterdrop.version>

        <maven.compiler.source>1.8</maven.compiler.source>

        <maven.compiler.target>1.8</maven.compiler.target>

    </properties>

    <dependencies>

        <dependency>

            <groupId>io.github.interestinglab.waterdrop</groupId>

            <artifactId>waterdrop-apis_2.11</artifactId>

            <version>${waterdrop.version}</version>

        </dependency>

        <dependency>

        <groupId>com.typesafe</groupId>

        <artifactId>config</artifactId>

        <version>1.3.1</version>

    </dependency>

        <dependency>

            <groupId>com.maxmind.db</groupId>

            <artifactId>maxmind-db</artifactId>

            <version>1.1.0</version>

        </dependency>

        <dependency>

            <groupId>com.maxmind.geoip2</groupId>

            <artifactId>geoip2</artifactId>

            <version>2.6.0</version>

        </dependency>

    </dependencies>

    <build>

        <plugins>

            <plugin>

                <groupId>org.apache.maven.plugins</groupId>

                <artifactId>maven-compiler-plugin</artifactId>

                <version>2.0.2</version>

                <configuration>

                    <source>${maven.compiler.source}</source>

                    <target>${maven.compiler.target}</target>

                </configuration>

            </plugin>

            <plugin>

                <groupId>net.alchim31.maven</groupId>

                <artifactId>scala-maven-plugin</artifactId>

                <executions>

                    <execution>

                        <id>scala-compile-first</id>

                        <phase>process-resources</phase>

                        <goals>

                            <goal>add-source</goal>

                            <goal>compile</goal>

                        </goals>

                    </execution>

                    <execution>

                        <id>scala-test-compile</id>

                        <phase>process-test-resources</phase>

                        <goals>

                            <goal>testCompile</goal>

                        </goals>

                    </execution>

                </executions>

            </plugin>

        </plugins>

    </build>

</project>

主要的程序文件只有一个Geoip2：

package com.student

import io.github.interestinglab.waterdrop.apis.BaseFilter

import com.typesafe.config.{Config, ConfigFactory}

import org.apache.spark.sql.{Dataset, Row, SparkSession}

import org.apache.spark.sql.functions.{col, udf}

import scala.collection.JavaConversions._

import com.maxmind.geoip2.DatabaseReader

import java.io.{File, InputStream}

import java.net.InetAddress

import com.maxmind.db.CHMCache

import org.apache.spark.SparkFiles

object ReaderWrapper extends Serializable {

  @transient lazy val reader = {

    val geoIPFile = "GeoLite2-City.mmdb";

    val database = new File(SparkFiles.get(geoIPFile));

    val reader: DatabaseReader = new DatabaseReader.Builder(database)

      //.fileMode(com.maxmind.db.Reader.FileMode.MEMORY)

      .fileMode(com.maxmind.db.Reader.FileMode.MEMORY_MAPPED)

      .withCache(new CHMCache()).build();

    reader

  }

}

class GeoIP2 extends BaseFilter {

  var config: Config = ConfigFactory.empty()

  /**

    * Set Config.

    **/

  override def setConfig(config: Config): Unit = {

    this.config = config

  }

  /**

    * Get Config.

    **/

  override def getConfig(): Config = {

    this.config

  }

  override def checkConfig(): (Boolean, String) = {

    val requiredOptions = List("source_field")

    val nonExistsOptions: List[(String, Boolean)] = requiredOptions.map { optionName =>

      (optionName, config.hasPath(optionName))

    }.filter { p =>

      !p._2

    }

    if (nonExistsOptions.length == 0) {

      (true, "")

    } else {

      (false, "please specify setting as non-empty string")

    }

  }

  override def prepare(spark: SparkSession): Unit = {

    val defaultConfig = ConfigFactory.parseMap(

      Map(

        "source_field" -> "raw_message",

        "target_field" -> "__ROOT__"

      )

    )

    config = config.withFallback(defaultConfig)

  }

  override def process(spark: SparkSession, df: Dataset[Row]): Dataset[Row] = {

    val srcField = config.getString("source_field")

    val func = udf { ip: String => ip2Locatation(ip) }

    val ip2Country=udf{ip:String => ip2Location2(ip,1)}

    val ip2Province=udf{ip:String => ip2Location2(ip,2)}

    val ip2City=udf{ip:String => ip2Location2(ip,3)}

    //df.withColumn(config.getString("target_field"), func(col(srcField)))

    df.withColumn("__country__", ip2Country(col(srcField)))

    .withColumn("__province__", ip2Province(col(srcField)))

    .withColumn("__city__", ip2City(col(srcField)))

  }

  def ip2Locatation(ip: String) = {

    try {

      val reader = ReaderWrapper.reader

      val ipAddress = InetAddress.getByName(ip)

      val response = reader.city(ipAddress)

      val country = response.getCountry()

      val subdivision = response.getMostSpecificSubdivision()

      val city = response.getCity()

      (country.getNames().get("zh-CN"), subdivision.getNames.get("zh-CN"), city.getNames().get("zh-CN"))

    }

    catch {

      case ex: Exception =>

        ex.printStackTrace()

        ("", "", "")

    }

  }

  def ip2Location2(ip: String,index: Int) = {

    try {

      val reader = ReaderWrapper.reader

      val ipAddress = InetAddress.getByName(ip)

      val response = reader.city(ipAddress)

      index match {

        case 1 => response.getCountry().getNames().get("zh-CN")

        case 2 => response.getMostSpecificSubdivision().getNames.get("zh-CN")

        case 3 => response.getCity().getNames().get("zh-CN")

        case _ => ""

      }

    }

    catch {

      case ex: Exception =>

        ex.printStackTrace()

        ""

    }

  }

}

测试类的代码如下：

package com.student

import com.typesafe.config._

import org.apache.spark.sql.{DataFrame, SparkSession}

object TestIt {

  def main(args: Array[String]): Unit = {

    val spark: SparkSession = SparkSession.builder().appName("demo").master("local[1]")

      .config("spark.files","/Users/student.yao/code/sparkshell/GeoLite2-City.mmdb")

      //.enableHiveSupport()

      .getOrCreate()

    spark.sparkContext.setLogLevel("WARN")

    //获取到第一个conf，复制给插件的实例

    val firstConf:Config = ConfigFactory.empty()

    //实例化插件对象

    val pluginInstance: GeoIP2 = new GeoIP2

    pluginInstance.setConfig(firstConf)

    pluginInstance.prepare(spark)

    //虚拟一些数据

    import spark.implicits._

    val sourceFile: DataFrame = Seq((1, "221.131.74.138"),

      (2, "112.25.215.84"),

      (3,"103.231.164.15"),

      (4,"36.99.136.137"),

      (5,"223.107.54.102"),

      (6,"117.136.118.125")

    ).toDF("id", "raw_message")

    sourceFile.show(false)

    val df2 = pluginInstance.process(spark,sourceFile)

    df2.show(false)

  }

}

遇到的问题

1.一开始的时候，把 GeoLite2-city.mmdb放到了 Resources文件夹，想把它打到jar包里，然后在项目中使用 getClass().getResourceAsStream("/GeoLite2-city")

运行中发现这种方式特别慢，说这种慢，是和使用 sparkFieles.get方式比较起来，遂改成sparkFiles获取的方式，先把文件放到 hdfs，然后在spark作业配置荐中配置：

spark.files="hdfs://nameservicesxxx/path/to/Geolite2-city.mmdb",这样 SparkFiles.get("GeoLite2-city.mmdb")　就可以获取文件使用。

2。性能优化的过程中，想把 Reader提出来放在 Prepare方法里面是很自然的一个想法，在本机测试的时侯没问题，因为是单机的，没发现，在生产上时发现报不可序列化的异常。

其实在 ideaj中可以使用 spark.sparkContext.broadcast(reader)的方式，就可以发现这个异常。如何解决这个异常，通常的解决方式是在 dataframe|rdd的 foreachpartition|mappartitions中

生成对象，这样就不会报错了。进一步可以使用单例这样效果更好些。但这个是插件，没法这样做，就想到了可以使用一个外壳包起来，让reader不序列化即可。

所以有了 ReaderWrapper extends Serializable @transient lazy val 这一段。

3。初始化 reader的时候进行缓存

经过这些处理，性能得到了提升，经测试，4G,4core,每5秒一个批次，2万条数据处理3秒钟。（kafka->waterdrop-json->es)全过程。

maxmind geoip2使用笔记的更多相关文章

企业运维实践-Nginx使用geoip2模块并利用MaxMind的GeoIP2数据库实现处理不同国家或城市的访问最佳实践指南
关注「WeiyiGeek」公众号设为「特别关注」每天带你玩转网络安全运维.应用开发.物联网IOT学习! 希望各位看友[关注.点赞.评论.收藏.投币],助力每一个梦想. 本章目录目录 0x00 前言 ...
通过GeoIP2分析访问者IP获取地理位置信息
原文链接:http://blog.csdn.net/johnnycode/article/details/42028841 MaxMind GeoIP2 服务能识别互联网用户的地点位置与其他特征,应用 ...
Hive UDF IP解析（二）：使用geoip2数据库自定义UDF
开发中经常会碰到将IP转为地域的问题,所以以下记录Hive中自定义UDF来解析IP. 使用到的地域库位maxmind公司的geoIP2数据库,分为免费版GeoLite2-City.mmdb和收费版Ge ...
Update(Stage5)：DMP项目_业务介绍_框架搭建
DMP (Data Management Platform) 导读整个课程的内容大致分为如下两个部分业务介绍技术实现对于业务介绍, 比较困难的是理解广告交易过程中各个参与者是干什么的对于技术 ...
获取用户登陆所在的ip及获取所属信息
package com.tcl.topsale.download.entity; public class GeoLocation { private String countryCode; priv ...
Hive UDF IP解析（一）：依赖包兼容性问题
Java依赖环境: <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-e ...
spark streaming 使用geoIP解析IP
1.首先将GEOIP放到服务器上,如,/opt/db/geo/GeoLite2-City.mmdb 2.新建scala sbt工程,测试是否可以顺利解析 import java.io.Fileimpo ...
Java 根据IP获取地址
用淘宝接口:(源码:java 根据IP地址获取地理位置) pom.xml: <!-- https://mvnrepository.com/artifact/net.sourceforge.jre ...
【日志处理、监控ELK、Kafka、Flume等相关资料】
服务介绍随着实时分析技术的发展及成本的降低,用户已经不仅仅满足于离线分析.目前我们服务的用户包括微博,微盘,云存储,弹性计算平台等十多个部门的多个产品的日志搜索分析业务,每天处理约32亿条(2TB) ...

随机推荐

mysql操作手册
开启日志:https://segmentfault.com/a/1190000003072237 常用词: Mysql:一种免费的跨平台的数据库系统 E:\mysql:表示是在dos 命令窗口下面 ...
JavaScript学习之倒计时
倒计时很常见,例如离XX活动还有XX天XX小时XX分XX秒,然后逐秒减少,实现很简单,我只是想记录这过程中的一点小坑. 先上代码: <html> <head> <meta ...
【JZOJ4882】【NOIP2016提高A组集训第12场11.10】多段线性函数
题目描述数据范围解法三分找出极值,两个二分找出极值的范围. 代码 #include<iostream> #include<stdio.h> #include<str ...
MaxCompute 费用暴涨之存储压缩率降低导致SQL输入量变大
现象:同样的SQL,每天处理的数据行数差不多,但是费用突然暴涨甚至会翻数倍. 分析: 我们先明确MaxCompute SQL后付费的计费公式:一条SQL执行的费用=扫描输入量 ️ SQL复杂度 ️ 0 ...
@loj - 2174@ 「FJOI2016」神秘数
目录 @description@ @solution@ @accepted code@ @details@ @description@ 一个可重复数字集合 S 的神秘数定义为最小的不能被 S 的子集的 ...
当pip安装因为网络超时而无法安装的时候慢
2.4 尝试pip --default-timeout=1000 install https://download.pytorch.org/whl/cu100/torch-1.1.0-cp36-cp ...
HDU-6290_奢侈的旅行(Dijstra+堆优化)
奢侈的旅行 Time Limit: 14000/7000 MS (Java/Others) Memory Limit: 512000/512000 K (Java/Others) Problem De ...
laravel框架手机发送验证码
https://blog.csdn.net/sunny_lg/article/details/52471225 现在登录注册时我们的验证方法不在单一化手机发送验证码已经成为常态让我们一起 ...
oracle 优化GROUP BY
提高GROUP BY 语句的效率, 可以通过将不需要的记录在GROUP BY 之前过滤掉.下面两个查询返回相同结果但第二个明显就快了许多. 低效: SELECT JOB , AVG(SAL) FROM ...
前端开发之BOM和DOM（转载）
BOM BOM:是指浏览器对象模型,它使JavaScript可以和浏览器进行交互. 1,navigator对象:浏览器对象,通过这个对象可以判定用户所使用的浏览器,包含了浏览器相关信息. naviga ...

maxmind geoip2使用笔记

maxmind geoip2使用笔记的更多相关文章

随机推荐

热门专题