大数据实战手册-开发篇之spark实战案例：实时日志分析

2.6 spark实战案例：实时日志分析

2.6.1 交互流程图

2.6.2 客户端监听器（java）

@SuppressWarnings("static-access")

	private void handleSocket() {

		lock.lock();

		Writer writer = null;

		RandomAccessFile raf = null;

		try {

			File file = new File(filepath);

			raf = new RandomAccessFile(file, "r");

			raf.seek(pointer);

			writer = new OutputStreamWriter(socket.getOutputStream(), "UTF-8");

			String line = null;

			while ((line = raf.readLine()) != null) {

				if (Strings.isBlank(line)) {

					continue;

				}

				line = new String(line.getBytes("ISO-8859-1"), "UTF-8");

				writer.write(line.concat("\n"));

				writer.flush();

				logger.info("线程：{}----起始位置：{}----读取文件\n{} :",Thread.currentThread().getName(), pointer, line);

				pointer = raf.getFilePointer();

			}

			Thread.currentThread().sleep(2000);

		} catch (Exception e) {

			logger.error(e.getMessage());

			e.printStackTrace();

		} finally {

			lock.unlock();

			fclose(writer, raf);

		}

	}

2.6.3 sparkStream实时数据接收（python）

conf = SparkConf()

conf.setAppName("HIS实时日志分析")

conf.setMaster('yarn') # spark standalone

conf.set('spark.executor.instances', 8) # cluster on yarn

conf.set('spark.executor.memory', '1g')

conf.set('spark.executor.cores', '1')

# conf.set('spark.cores.max', '2')

# conf.set('spark.logConf', True)

conf.set('spark.streaming.blockInterval', 1000*4)  # restart receiver interval

sc = SparkContext(conf = conf)

sc.setLogLevel('ERROR')

sc.setCheckpointDir('hdfs://hadoop01:9000/hadoop/upload/checkpoint/')

ssc = StreamingContext(sc, 30)  # time interval at which splits streaming data into block

lines = ssc.socketTextStream(str(ip), int(port))

# lines.pprint()

lines.foreachRDD(requestLog)

lines.foreachRDD(errorLog)

ssc.start()

ssc.awaitTermination()

2.6.4 sparklSQL、RDD结算、结构化搜索、结构存储mongoDB（python）

 def getSparkSessionInstance(sparkConf):

    '''

    :@desc 多个RDD全局共享sparksession

     .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/test.coll") \

     .config("spark.mongodb.output.uri", "mongodb://adxkj:123456@192.168.0.252:27017/") \

    :param sparkConf:

    :return:

    '''

    if ('sparkSessionSingletonInstance' not in globals()):

        globals()['sparkSessionSingletonInstance'] = SparkSession \

            .builder \

            .config(conf=sparkConf) \

            .getOrCreate()

    return globals()['sparkSessionSingletonInstance']

def timeFomate(x):

    '''

    :@desc 处理时间

    :param x:

    :return:

    '''

    if not isinstance(x, list):

        return None

    # filter microsenconds

    x.insert(0, ' '.join(x[0:2]))

    x.pop(1)

    x.pop(1)

    # filter '[]'

    rx = re.compile('([\[\]\',])')

    # text = rx.sub(r'\\\1', text)

    x = [rx.sub(r'', x[i]) for i in range(len(x))]

    # string to time

    x[0] = x[0][: x[0].find('.')]

    x[0] = ''.join(x[0])

    x[0] = datetime.strptime(x[0], '%Y-%m-%d %H:%M:%S')

    return x

def sqlMysql(sqlResult, table, url="jdbc:mysql://192.168.0.252:3306/hisLog", user='root', password=""):

    '''

    :@desc sql结果保存

    :param sqlResult:

    :param table:

    :param url:

    :param user:

    :param password:

    :return:

    '''

    try:

        sqlResult.write \

            .mode('append') \

            .format("jdbc") \

            .option("url", url) \

            .option("dbtable", table) \

            .option("user", user) \

            .option("password", password) \

            .save()

    except:

        excType, excValue, excTraceback = sys.exc_info()

        traceback.print_exception(excType, excValue, excTraceback, limit=3)

        # print(excValue)

        # traceback.print_tb(excTraceback)

def sqlMongodb(sqlResult, table):

    '''

    :@desc sql结果保存

    :param sqlResult:

    :param table:

    :param url:

    :param user:

    :param password:

    :return:

    '''

    try:

        sqlResult.\

            write.\

            format("com.mongodb.spark.sql.DefaultSource"). \

            options(uri="mongodb://adxkj:123456@192.168.0.252:27017/hislog",

                    database="hislog", collection=table, user="adxkj", password="123456").\

            mode("append").\

            save()

    except:

        excType, excValue, excTraceback = sys.exc_info()

        traceback.print_exception(excType, excValue, excTraceback, limit=3)

        # print(excValue)

        # traceback.print_tb(excTraceback)

def decodeStr(x) :

    '''

    :@desc base64解码

    :param x:

    :return:

    '''

    try:

        if x[9].strip() != '' :

            x[9] = base64.b64decode(x[9].encode("utf-8")).decode("utf-8")

            # x[9] = x[9][:5000] #mysql

        if x[11].strip() != '':

            x[11] = base64.b64decode(x[11].encode("utf-8")).decode("utf-8")

            # x[11] = x[11][:5000] #mysql

        if len(x) > 12 and x[12].strip() != '':

            x[12] = base64.b64decode(x[12].encode("utf-8")).decode("utf-8")

    except Exception as e:

        print("不能解码：", x, e)

    return x

def analyMod(x) :

    '''

    :@desc 通过uri匹配模块

    :param x:

    :return:

    '''

    if x[6].strip() == ' ':

        return None

    hasMatch = False

    for k, v in URI_MODULES.items() :

        if x[6].strip().startswith('/' + k) :

            hasMatch = True

            x.append(v)

    if not hasMatch:

        x.append('公共模块')

    return x

def requestLog(time, rdd):

    '''

    :@desc 请求日志分析

    :param time:

    :param rdd:

    :return:

    '''

    logging.info("+++++handle request log：length：%d，获取内容：++++++++++" % (rdd.count()))

    if rdd.isEmpty():

        return None

    logging.info("++++++++++++++++++++++处理requestLog+++++++++++++++++++++++++++++++")

    reqrdd = rdd.map(lambda x: x.split(' ')).\

        filter(lambda x: len(x) > 12 and x[4].find('http-nio-') > 0 and x[2].strip() == 'INFO').\

        filter(lambda x: x[8].strip().upper().startswith('POST') or x[8].strip().upper().startswith('GET')).\

        map(timeFomate).\

        map(decodeStr).\

        map(analyMod)

    reqrdd.cache()

    reqrdd.checkpoint()  # checkpoint先cache避免计算两次，以前的rdd销毁

    sqlRdd = reqrdd.map(lambda x: Row(time=x[0], level=x[1], clz=x[2], thread=x[3], user=x[4], depart=x[5],

                          uri=x[6], method=x[7], ip=x[8], request=x[9], oplen=x[10],

                          respone=x[11], mod=x[12]))

    # rdd持久化，降低内存消耗, cache onliy for StorageLevel.MEMORY_ONLY

    # reqrdd.persist(storageLevel=StorageLevel.MEMORY_AND_DISK_SER)

    if reqrdd.isEmpty():

        return None

    spark = getSparkSessionInstance(rdd.context.getConf())

    df = spark.createDataFrame(sqlRdd)

    df.createOrReplaceTempView(REQUEST_TABLE)

    # 结构化后再分析

    sqlresult = spark.sql("SELECT * FROM " + REQUEST_TABLE)

    sqlresult.show()

    # 保存

    sqlMongodb(sqlresult, REQUEST_TABLE)

def errorLog(time, rdd):

    '''

    :@desc 错误日志分析

    :param time:

    :param rdd:

    :return:

    '''

    logging.info("+++++handle error log：length：%d，获取内容：++++++++++" % (rdd.count()))

    if rdd.isEmpty():

        return None

    logging.info("++++++++++++++++++++++处理errorLog+++++++++++++++++++++++++++++++")

    errorrdd = rdd.map(lambda x: x.split(' ')). \

        filter(lambda x: len(x) > 13 and x[2].strip().upper().startswith('ERROR')). \

        map(timeFomate). \

        map(decodeStr). \

        map(analyMod). \

        map(lambda x: Row(time=x[0], level=x[1], clz=x[2], thread=x[3], user=x[4], depart=x[5],

                          uri=x[6], method=x[7], ip=x[8], request=x[9], oplen=x[10],

                          respone=x[11], stack=x[12], mod=x[13]))

    # rdd持久化，降低内存消耗

    errorrdd.persist(storageLevel=StorageLevel.MEMORY_AND_DISK_SER)

    if errorrdd.isEmpty():

        return None

    spark = getSparkSessionInstance(rdd.context.getConf())

    df = spark.createDataFrame(errorrdd)

    df.createOrReplaceTempView(ERROR_TABLE)

    # 结构化后再分析

    sqlresult = spark.sql("SELECT * FROM " + ERROR_TABLE)

    sqlresult.show()

    # 保存

    sqlMongodb(sqlresult, ERROR_TABLE)

备注：需要完整代码请联系作者@狼

大数据实战手册-开发篇之spark实战案例：实时日志分析的更多相关文章

苏宁基于Spark Streaming的实时日志分析系统实践 Spark Streaming 在数据平台日志解析功能的应用
https://mp.weixin.qq.com/s/KPTM02-ICt72_7ZdRZIHBA 苏宁基于Spark Streaming的实时日志分析系统实践原创: AI+落地实践 AI前线 20 ...
Spark 实践——基于 Spark Streaming 的实时日志分析系统
本文基于<Spark 最佳实践>第6章 Spark 流式计算. 我们知道网站用户访问流量是不间断的,基于网站的访问日志,即 Web log 分析是典型的流式实时计算应用场景.比如百度统计, ...
Java，面试题，简历，Linux，大数据，常用开发工具类，API文档，电子书，各种思维导图资源，百度网盘资源，BBS论坛系统 ERP管理系统 OA办公自动化管理系统车辆管理系统各种后台管理系统
Java,面试题,简历,Linux,大数据,常用开发工具类,API文档,电子书,各种思维导图资源,百度网盘资源BBS论坛系统 ERP管理系统 OA办公自动化管理系统车辆管理系统家庭理财系统各种后 ...
GIS+=地理信息+行业+大数据——基于云环境流处理平台下的实时交通创新型app
应用程序已经是近代的一个最重要的IT创新.应用程序是连接用户和数据之间的桥梁,提供即时訪问信息是最方便且呈现的方式也是easy理解的和令人惬意的. 然而,app开发人员.尤其是后端平台能力,一直在努力 ...
Spark SQL慕课网日志分析（1）--系列软件(单机)安装配置使用
来源: 慕课网 Spark SQL慕课网日志分析_大数据实战目标: spark系列软件的伪分布式的安装.配置.编译 spark的使用系统: mac 10.13.3 /ubuntu 16.06,两个 ...
大数据项目实践：基于hadoop+spark+mongodb+mysql+c#开发医院临床知识库系统
一.前言从20世纪90年代数字化医院概念提出到至今的20多年时间,数字化医院(Digital Hospital)在国内各大医院飞速的普及推广发展,并取得骄人成绩.不但有数字化医院管理信息系统(HIS ...
大数据核心知识点：Hbase、Spark、Hive、MapReduce概念理解，特点及机制
今天,上海尚学堂大数据培训班毕业的一位学生去参加易普软件公司面试,应聘的职位是大数据开发.面试官问了他10个问题,主要集中在Hbase.Spark.Hive和MapReduce上,基础概念.特点.应用 ...
大数据的前世今生【Hadoop、Spark】
一.大数据简介大数据是一个很热门的话题,但它是什么时候开始兴起的呢? 大数据[big data]这个词最早在UNIX用户协会的会议上被使用,来自SGI公司的科学家在其文章“大数据与下一代基础架构 ...
了解大数据的技术生态系统 Hadoop,hive,spark(转载)
首先给出原文链接: 原文链接大数据本身是一个很宽泛的概念,Hadoop生态圈(或者泛生态圈)基本上都是为了处理超过单机尺度的数据处理而诞生的.你能够把它比作一个厨房所以须要的各种工具. 锅碗瓢盆,各 ...
一文教你看懂大数据的技术生态圈:Hadoop,hive,spark
转自:https://www.cnblogs.com/reed/p/7730360.html 大数据本身是个很宽泛的概念,Hadoop生态圈(或者泛生态圈)基本上都是为了处理超过单机尺度的数据处理而诞 ...

随机推荐

投资组合计算分析——R语言
"投资组合"是指金融资产(如股票.债券和现金)的任何组合.投资组合有很多类型,包括市场投资组合和零投资投资组合.可以使用以下任何一种投资方法和原则来管理投资组合的资产分配:股息加权 ...
实现和CSS一样的easing动画？直接看Mozilla、Chromium源码!
前言在上一篇丝滑的贝塞尔曲线:从数学原理到应用介绍贝塞尔曲线实现动画时给自己留了一个坑,实现的动画效果和CSS的transition-timing-function: cubic-bezier差别较 ...
[Excel/Word]常用函数与技巧
1 Excel case1 同时多列筛选同时筛选多列: 选中首行(属性行)>筛选>(筛选目标的N列) case2 IF/OR/AND/COUNTIF语句 =IF(condition,co ...
基于SpringBoot实现单元测试的多种情境/方法（二）
本文分享自天翼云开发者社区@<基于SpringBoot实现单元测试的多种情境/方法(二)>, 作者:才开始学技术的小白 1 Mock基础回顾在上一篇分享中我们详细介绍了简单的.用moc ...
abc294G
Upd G 看上好模板的样子, 果然是个模板题好题 , 首先考虑这张图的 \(Euler \ Tour\), 简单点说, 就是dfs一遍, 把每个点入栈出栈顺序存起来, 举个例子· 2 1 2 2 ...
windows安装zabbix错误代码
zabbix安装:windows安装zabbix客户端很多坑,设计到很多问题,常见的问题有安装完成防火墙没有关闭,zabbix服务端接收不到客户端的信息.zabbix在cmd中安装的时候报错误代码,安 ...
Python 组织列表
组织列表在创建的列表中,元素的排列顺序是无法预测的,不能总控制用户提供数据的顺序,通过组织列表的方式,来控制列表的排序使用方法sort()对列表进行永久性排序 sort()方法:列表中值时小写时默 ...
2023-05-04：用go语言重写ffmpeg的scaling_video.c示例，用于实现视频缩放（Scaling）功能。
2023-05-04:用go语言重写ffmpeg的scaling_video.c示例,用于实现视频缩放(Scaling)功能. 答案2023-05-04: 这段代码实现了使用 libswscale 库 ...
聚合短信PHP代码示例短信接口调用CURL方法
聚合的短信相信大家已经做多了吧,网上的代码看了下就是感觉太繁琐了,不过网上的也是比较好的,用的是post方法,更安全,因我们的项目是在服务器上请求,又绑定了白名单 ,所以弄了个简单点的自己用,参考如下 ...
在chatGPT的帮助下成功从Rancher中删除无效的集群
只要你坚持,不放弃,问题总有解决的一天! 与chatgpt进行了几次沟通,成功解决历史遗留问题,成功从rancher中删除了无效的集群 chatGPT回答1 如果您在 Rancher UI 中无法删除 ...

大数据实战手册-开发篇之spark实战案例：实时日志分析

大数据实战手册-开发篇之spark实战案例：实时日志分析的更多相关文章

随机推荐

热门专题