转载自:https://elasticstack.blog.csdn.net/article/details/114374804

在今天的文章中,我将展示如何使用 file input 结合 multiline 来展示如何导入一个 CSV 文件。针对 multiline,我在之前的文章 “运用 Elastic Stack 分析 Spring boot 微服务日志 (一)” 有讲到过。另外我也有两篇关于使用 Logstash 导入 CSV 的例子

    Logstash:应用实践 - 装载 CSV 文档到 Elasticsearch

    Logstash:导入 zipcode CSV 文件和 Geo Search 体验

针对 CSV 的导入,我们也可以使用 Filebeat 来解析 CSV 文件。如果你有兴趣的话,请参考:

    Beats:运用 Elastic Stack 分析 COVID-19 数据并进行可视化分析

准备数据

在今天的练习中,我们有如下的测试数据:

multiline.csv

    INV-12402400071,05/31/2018,2595,Hy-Vee Wine and Spirits / Denison,"1620  4th Ave, South",Denison,51442,"1620 4th Ave, South Denison 51442(42.012395, -95.348601)",24,CRAWFORD,1011100,Blended Whiskies,260,DIAGEO AMERICAS,25608,Seagrams 7 Crown Bl Whiskey,6,1750,11.96,17.94,1,107.64,1.75,0.46
S29195400002,11/21/2015,2205,Ding's Honk And Holler,900 E WASHINGTON,CLARINDA,51632,"900 E WASHINGTON
CLARINDA 51632
(40.739238, -95.02756)",73,Page,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,12,325.68,9.00,2.38
S29198800001,11/20/2015,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632,"1013 MAIN
KEOKUK 52632
(40.39978, -91.387531)",56,Lee,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,6,162.84,4.50,1.19
S29198800001,11/20/2015,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632,"1013 MAIN
KEOKUK 52632
(40.39978, -91.387531)",56,Lee,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,6,162.84,4.50,1.19 这个数据来源于 https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy/data。其中的有些数据具有多行输入,也就是多出了一些换行符 "\n",从而导致有些记录分布在多行,尽管这种情况比较少见。在上面,我们可以看到如下的三个文档: INV-12402400071
S29195400002
S29198800001 其中 S29195400002 及 S29198800001 连个文档的内容跨三行。和第一个文档显然是不同的。那么我们该如何处理这种情况呢?首先,我们看到文档都是以 INV- 已经 S 开头的行。一般来说 Logstash 的架构图如下: 首先它含有一个 Input, 然后经过0个或多个 filter 的处理,最终输出到 Output。 针对我们的情况,我们可以使用如下的架构来对它进行处理: 我们可以使用 file input 配合 multiline,然后把数据传入到 csv, mutate, 及 Grok 这样的过滤器来进行处理。 首先,我们创建一个叫做 logstash_csv.conf 文件 logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} output {
stdout {
codec => rubydebug
}
} 在上面,我们使用 file 把指定位置的 multilne.csv 读入进来。我们使用了如下的 codec: codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
} 它首先匹配以 S 或 INV- 为开头的行,紧接着 S 或 INV- 后面接0-9之中的两个数字。negate 为 true 表示没有匹配的行需要添加到 previous (前面)已经匹配的行里从而组成一个文档。如果你对这个还不是很理解的话,请参阅之前在 “Beats:使用 Filebeat 传送多行日志” 中的描述。 我们使用 Logstash 运行上面的配置文件: sudo ./bin/logstash -f logstash_csv.conf 那么输出的结果为: 我们看到文档虽然一个文档被分为三行,但是它们还是被正确地识别为一个文档。在文档中,我们看见有 \n 字符出现。在接下来的处理中,我们需要把这个字符去掉。 我们接下来使用 csv 过滤器来进行处理: logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"] convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"} remove_field => ["message"]
}
} ​
output {
stdout {
codec => rubydebug
}
} 在上面,我们把 CSV 文档中的项进行解析,并形成各个字段。同时我们也使用 convert 把字段里的数值字段转换为数值类型以便于分析。删除 message 字段。 重新运行 Logstash, 并查看结果: 在上面,我们看到 Country 以及 City,它们都是大写字母,我们想把它们转换为小写字母。同时在 StoreLocation 中,我们发现有 \n 字符。我们在 filter 部分添加 mutate 来对它们进行处理: logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"] convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"} remove_field => ["message"]
} # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
}
} output {
stdout {
codec => rubydebug
}
} 重新运行 Logstash 并查看输出结果: 我们看到 Country 及 City 的字母都变为小写了,同时在 StoreLocation 中再也没有 \n 字符了。 接下来,我们想提取 StoreLocation 里面的位置信息。我们可以看到里面含有一个坐标(经纬度)。我们可以使用 grok 过滤器来进行匹配: logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"] convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"} remove_field => ["message"]
} # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
} # Get the lat/lon if there is a (numbers,numbers) data in the location
grok {
match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }
}
} output {
stdout {
codec => rubydebug
}
} 我们匹配 StoreLocation 里的含有括号 ()里的内容并赋予给 location。字符含 -,.0-9。重新运行 Logstash: 从上面我们可以看出来 location 从 StoreLocation 中被提取出来了。 接下来,我们来把文档的时间修改为来自文档中的时间。我们可以看到目前的 @timestamp 不是我们文档的 Date 字段的时间。 logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"] convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"} remove_field => ["message"]
} # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
} # Get the lat/lon if there is a (numbers,numbers) data in the location
grok {
match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }
} # Match the date to just daily and the correct timezone
date {
"match" => [ "Date", "MM/dd/YYYY" ]
"timezone" => "America/Chicago"
}
} output {
stdout {
codec => rubydebug
}
} 再次运行 Logstash: 显然现在的 @timestamp 变为来自文档中的时间了。 我们接下来可以添加输出到 Elasticsearch: logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"] convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"} remove_field => ["message"]
} # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
} # Get the lat/lon if there is a (numbers,numbers) data in the location
grok {
match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }
} # Match the date to just daily and the correct timezone
date {
"match" => [ "Date", "MM/dd/YYYY" ]
"timezone" => "America/Chicago"
}
} output {
elasticsearch {
hosts => ["https://your.cluster.here:9243"]
index => ["iowa-liquor"]
user => "elastic"
password => "redacted"
manage_template => false
} #output dots while we process
stdout { codec => "dots" }
#if we saw a date parse failure, dump it to screen to review
if "_dateparsefailure" in [tags] {
stdout { codec => "rubydebug" }
}
}

Logstash:使用 Logstash 导入 CSV 文件示例的更多相关文章

  1. neo4j导入csv文件

    neo4j导入csv文件 关于neo4j的安装 官网和网上博客提供了n中安装的方法,这里不再赘述: 普通安装: https://cloud.tencent.com/developer/article/ ...

  2. 导出csv文件示例

    导出csv文件示例 csv文件默认以英文逗号,做为列分隔符换行符\n作为行分隔符,写入到一个.csv文件即可.含有英文逗号,和换行符会发生数据输出会出现混乱,下面列出一些处理方法.特殊字符处理1.含有 ...

  3. ACCESS导入CSV文件出现乱码解决办法

    在ACCESS或Excel中导入CSV文件时常常出现乱码,这是因为简体中文版的windows操作系统及其应用软件默认都是ANSI/GBK编码,而导入的文件使用的编码与操作系统默认的编码不相符.出现这种 ...

  4. C# 将List中的数据导入csv文件中

    //http://www.cnblogs.com/mingmingruyuedlut/archive/2013/01/20/2849906.html C# 将List中的数据导入csv文件中   将数 ...

  5. oracle导入csv文件

    oracle导入csv文件: 1.建好对应的表和字段: 2.新建test.ctl文件,用记事本编辑写入: load data infile 'e:\TB_KC_SERV.csv' --修改对应的文件路 ...

  6. python导入csv文件时,出现SyntaxError

    背景 np.loadtxt()用于从文本加载数据. 文本文件中的每一行必须含有相同的数据. *** loadtxt(fname, dtype=<class 'float'>, commen ...

  7. R: 导入 csv 文件,导出到csv文件,;绘图后导出为图片、pdf等

    ################################################### 问题:导入 csv 文件 如何从csv文件中导入数据,?参数怎么设置?常用参数模板是啥? 解决方 ...

  8. python导入csv文件出现SyntaxError问题分析

    python导入csv文件出现SyntaxError问题分析 先简单描述下碰到的题目,要求是写出2个print的结果 可以看到,a指向了一个列表list对象,在Python中,这样的赋值语句,其实内部 ...

  9. Oracle数据库导入csv文件(sqlldr命令行)

    1.说明 Oracle数据库导入csv文件, 当csv文件较小时, 可以使用数据库管理工具, 比如DBevaer导入到数据库, 当csv文件很大时, 可以使用Oracle提供的sqlldr命令行工具, ...

随机推荐

  1. eclipse调用MySQL数据库的方法

    今天来总结一下使用如何使用eclipse调用MySQL数据库的数据. 一.设置eclipse 我们首先来设置一下eclipse. 在下部的Servers中右键选择new,选择server 之后在新弹出 ...

  2. 5-8 Resource 静态资源服务器

    静态资源服务器 什么是静态资源服务器 我们无论做什么项目,都会有一些页面中需要显示的静态资源,例如图片,视频文档等 我们一般会创建一个单独的项目,这个项目中保存静态资源 其他项目可以通过我们保存资源的 ...

  3. Java + Selenium + OpenCV解决自动化测试中的滑块验证

    最近工作过程中,一个常用的被测网站突然增加了滑块验证环节,导致整个自动化项目失效了. 为了解决这个滑块验证问题,在网上查阅了一些资料后,总结并实现了解决方案,现记录如下. 1.滑块验证思路 被测对象的 ...

  4. 小C的记事本_via牛客网

    题目 链接:https://ac.nowcoder.com/acm/contest/28537/G 来源:牛客网 时间限制:C/C++ 2秒,其他语言4秒 空间限制:C/C++ 131072K,其他语 ...

  5. python 日志类

    简介 在所有项目中必不可少的一定是日志记录系统,python为我们提供了一个比较方便的日志模块logging,通常,我们都会基于此模块编写一个日志记录类,方便将项目中的日志记录到文件中. loggin ...

  6. 手动注入bean到spring容器

    ApplicationContext applicationContext = SpringContextUtils.getApplicationContext(); //将applicationCo ...

  7. Windows 11上Dev C++ 5.11 提示 Failed to execute xxx Error 0的一种解决方法

    问题现象 在Windows 11上用Dev C++ 5.11编译运行程序,出现如下错误不能运行,且自己的程序没有问题. 可能的原因 Dev C++没有以管理员身份运行? 生成的程序所在的目录受到了访问 ...

  8. SpringCloud之Sentinel

    一. sentinel是什么? 1.概念: 分布式服务架构的流量治理组件. 2.sentinel有什么作用? 2.1 流控:QPS.线程数 2.2 熔断降级:降级-->熔断策略.时长.请求数等 ...

  9. 使用Docker搭建Nextcloud私有网盘

    一.准备材料 安装环境:linux 工具:docker 软件:MySql.Nextcloud 二.安装Docker 安装Docker:https://www.cnblogs.com/jzcn/p/15 ...

  10. Java 断点下载(下载续传)服务端及客户端(Android)代码

    原文: Java 断点下载(下载续传)服务端及客户端(Android)代码 - Stars-One的杂货小窝 最近在研究断点下载(下载续传)的功能,此功能需要服务端和客户端进行对接编写,本篇也是记录一 ...