Logstash:使用 Logstash 导入 CSV 文件示例
转载自:https://elasticstack.blog.csdn.net/article/details/114374804

在今天的文章中,我将展示如何使用 file input 结合 multiline 来展示如何导入一个 CSV 文件。针对 multiline,我在之前的文章 “运用 Elastic Stack 分析 Spring boot 微服务日志 (一)” 有讲到过。另外我也有两篇关于使用 Logstash 导入 CSV 的例子
Logstash:应用实践 - 装载 CSV 文档到 Elasticsearch
Logstash:导入 zipcode CSV 文件和 Geo Search 体验
针对 CSV 的导入,我们也可以使用 Filebeat 来解析 CSV 文件。如果你有兴趣的话,请参考:
Beats:运用 Elastic Stack 分析 COVID-19 数据并进行可视化分析
准备数据
在今天的练习中,我们有如下的测试数据:
multiline.csv
INV-12402400071,05/31/2018,2595,Hy-Vee Wine and Spirits / Denison,"1620 4th Ave, South",Denison,51442,"1620 4th Ave, South Denison 51442(42.012395, -95.348601)",24,CRAWFORD,1011100,Blended Whiskies,260,DIAGEO AMERICAS,25608,Seagrams 7 Crown Bl Whiskey,6,1750,11.96,17.94,1,107.64,1.75,0.46
S29195400002,11/21/2015,2205,Ding's Honk And Holler,900 E WASHINGTON,CLARINDA,51632,"900 E WASHINGTON
CLARINDA 51632
(40.739238, -95.02756)",73,Page,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,12,325.68,9.00,2.38
S29198800001,11/20/2015,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632,"1013 MAIN
KEOKUK 52632
(40.39978, -91.387531)",56,Lee,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,6,162.84,4.50,1.19
S29198800001,11/20/2015,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632,"1013 MAIN
KEOKUK 52632
(40.39978, -91.387531)",56,Lee,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,6,162.84,4.50,1.19
这个数据来源于 https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy/data。其中的有些数据具有多行输入,也就是多出了一些换行符 "\n",从而导致有些记录分布在多行,尽管这种情况比较少见。在上面,我们可以看到如下的三个文档:
INV-12402400071
S29195400002
S29198800001
其中 S29195400002 及 S29198800001 连个文档的内容跨三行。和第一个文档显然是不同的。那么我们该如何处理这种情况呢?首先,我们看到文档都是以 INV- 已经 S 开头的行。一般来说 Logstash 的架构图如下:
首先它含有一个 Input, 然后经过0个或多个 filter 的处理,最终输出到 Output。
针对我们的情况,我们可以使用如下的架构来对它进行处理:
我们可以使用 file input 配合 multiline,然后把数据传入到 csv, mutate, 及 Grok 这样的过滤器来进行处理。
首先,我们创建一个叫做 logstash_csv.conf 文件
logstash_csv.conf
input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
}
output {
stdout {
codec => rubydebug
}
}
在上面,我们使用 file 把指定位置的 multilne.csv 读入进来。我们使用了如下的 codec:
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
它首先匹配以 S 或 INV- 为开头的行,紧接着 S 或 INV- 后面接0-9之中的两个数字。negate 为 true 表示没有匹配的行需要添加到 previous (前面)已经匹配的行里从而组成一个文档。如果你对这个还不是很理解的话,请参阅之前在 “Beats:使用 Filebeat 传送多行日志” 中的描述。
我们使用 Logstash 运行上面的配置文件:
sudo ./bin/logstash -f logstash_csv.conf
那么输出的结果为:
我们看到文档虽然一个文档被分为三行,但是它们还是被正确地识别为一个文档。在文档中,我们看见有 \n 字符出现。在接下来的处理中,我们需要把这个字符去掉。
我们接下来使用 csv 过滤器来进行处理:
logstash_csv.conf
input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
}
filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"]
convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"}
remove_field => ["message"]
}
}
output {
stdout {
codec => rubydebug
}
}
在上面,我们把 CSV 文档中的项进行解析,并形成各个字段。同时我们也使用 convert 把字段里的数值字段转换为数值类型以便于分析。删除 message 字段。
重新运行 Logstash, 并查看结果:
在上面,我们看到 Country 以及 City,它们都是大写字母,我们想把它们转换为小写字母。同时在 StoreLocation 中,我们发现有 \n 字符。我们在 filter 部分添加 mutate 来对它们进行处理:
logstash_csv.conf
input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
}
filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"]
convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"}
remove_field => ["message"]
}
# Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
}
}
output {
stdout {
codec => rubydebug
}
}
重新运行 Logstash 并查看输出结果:
我们看到 Country 及 City 的字母都变为小写了,同时在 StoreLocation 中再也没有 \n 字符了。
接下来,我们想提取 StoreLocation 里面的位置信息。我们可以看到里面含有一个坐标(经纬度)。我们可以使用 grok 过滤器来进行匹配:
logstash_csv.conf
input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
}
filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"]
convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"}
remove_field => ["message"]
}
# Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
}
# Get the lat/lon if there is a (numbers,numbers) data in the location
grok {
match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }
}
}
output {
stdout {
codec => rubydebug
}
}
我们匹配 StoreLocation 里的含有括号 ()里的内容并赋予给 location。字符含 -,.0-9。重新运行 Logstash:
从上面我们可以看出来 location 从 StoreLocation 中被提取出来了。
接下来,我们来把文档的时间修改为来自文档中的时间。我们可以看到目前的 @timestamp 不是我们文档的 Date 字段的时间。
logstash_csv.conf
input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
}
filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"]
convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"}
remove_field => ["message"]
}
# Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
}
# Get the lat/lon if there is a (numbers,numbers) data in the location
grok {
match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }
}
# Match the date to just daily and the correct timezone
date {
"match" => [ "Date", "MM/dd/YYYY" ]
"timezone" => "America/Chicago"
}
}
output {
stdout {
codec => rubydebug
}
}
再次运行 Logstash:
显然现在的 @timestamp 变为来自文档中的时间了。
我们接下来可以添加输出到 Elasticsearch:
logstash_csv.conf
input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
}
filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"]
convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"}
remove_field => ["message"]
}
# Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
}
# Get the lat/lon if there is a (numbers,numbers) data in the location
grok {
match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }
}
# Match the date to just daily and the correct timezone
date {
"match" => [ "Date", "MM/dd/YYYY" ]
"timezone" => "America/Chicago"
}
}
output {
elasticsearch {
hosts => ["https://your.cluster.here:9243"]
index => ["iowa-liquor"]
user => "elastic"
password => "redacted"
manage_template => false
}
#output dots while we process
stdout { codec => "dots" }
#if we saw a date parse failure, dump it to screen to review
if "_dateparsefailure" in [tags] {
stdout { codec => "rubydebug" }
}
}
Logstash:使用 Logstash 导入 CSV 文件示例的更多相关文章
- neo4j导入csv文件
neo4j导入csv文件 关于neo4j的安装 官网和网上博客提供了n中安装的方法,这里不再赘述: 普通安装: https://cloud.tencent.com/developer/article/ ...
- 导出csv文件示例
导出csv文件示例 csv文件默认以英文逗号,做为列分隔符换行符\n作为行分隔符,写入到一个.csv文件即可.含有英文逗号,和换行符会发生数据输出会出现混乱,下面列出一些处理方法.特殊字符处理1.含有 ...
- ACCESS导入CSV文件出现乱码解决办法
在ACCESS或Excel中导入CSV文件时常常出现乱码,这是因为简体中文版的windows操作系统及其应用软件默认都是ANSI/GBK编码,而导入的文件使用的编码与操作系统默认的编码不相符.出现这种 ...
- C# 将List中的数据导入csv文件中
//http://www.cnblogs.com/mingmingruyuedlut/archive/2013/01/20/2849906.html C# 将List中的数据导入csv文件中 将数 ...
- oracle导入csv文件
oracle导入csv文件: 1.建好对应的表和字段: 2.新建test.ctl文件,用记事本编辑写入: load data infile 'e:\TB_KC_SERV.csv' --修改对应的文件路 ...
- python导入csv文件时,出现SyntaxError
背景 np.loadtxt()用于从文本加载数据. 文本文件中的每一行必须含有相同的数据. *** loadtxt(fname, dtype=<class 'float'>, commen ...
- R: 导入 csv 文件,导出到csv文件,;绘图后导出为图片、pdf等
################################################### 问题:导入 csv 文件 如何从csv文件中导入数据,?参数怎么设置?常用参数模板是啥? 解决方 ...
- python导入csv文件出现SyntaxError问题分析
python导入csv文件出现SyntaxError问题分析 先简单描述下碰到的题目,要求是写出2个print的结果 可以看到,a指向了一个列表list对象,在Python中,这样的赋值语句,其实内部 ...
- Oracle数据库导入csv文件(sqlldr命令行)
1.说明 Oracle数据库导入csv文件, 当csv文件较小时, 可以使用数据库管理工具, 比如DBevaer导入到数据库, 当csv文件很大时, 可以使用Oracle提供的sqlldr命令行工具, ...
随机推荐
- DHCP原理及配置
DHCP工作原理 集中的管理.分配IP地址,使client动态的获得IP地址.Gateway地址.DNS服务器地址等信息,并能够提升地址的使用率. 简单来说,DHCP就是一个不需要账号密码登录的.自动 ...
- P2512 【一本通提高篇贪心】「一本通 1.1 练习 6」[HAOI2008]糖果传递
[HAOI2008]糖果传递 题目描述 有 n n n 个小朋友坐成一圈,每人有 a i a_i ai 个糖果.每人只能给左右两人传递糖果.每人每次传递一个糖果代价为 1 1 1. 输入格式 小朋友 ...
- Möbius 反演注记
目录 基本理论基础 数论函数 线性筛 Mobius 反演 Dirichlet 卷积 数论分块 / 整除分块 拆函数 时间复杂度分析 基本形式 GCD 形 万能 Prod 的莫比乌斯反演 正常例题 YY ...
- 无意苦争春,一任群芳妒!M1 Mac book(Apple Silicon)能否支撑全栈工程师的日常?(Python3/虚拟机/Docker/Redis)
原文转载自「刘悦的技术博客」https://v3u.cn/a_id_187 就像大航海时代里突然诞生的航空母舰一样,苹果把玩着手心里远超时代的M1芯片,微笑着对Intel说:"不好意思,虽然 ...
- Odoo 14 Action URL 生成
from werkzeug.urls import url_encode url = '/web#%s' % url_encode({ 'action': 'hr.plan_wizard_action ...
- 使用.NET简单实现一个Redis的高性能克隆版(一)
译者注 该原文是Ayende Rahien大佬业余自己在使用C# 和 .NET构建一个简单.高性能兼容Redis协议的数据库的经历. 首先这个"Redis"是非常简单的实现,但是他 ...
- 【原创】Python 极验滑块验证
本文仅供学习交流使用,如侵立删! 记一次 极验滑块验证分析并通过 操作环境 win10 . mac Python3.9 selenium.seleniumwire 分析 最近在做的一个项目登录时会触发 ...
- 成为 Apache 贡献者,So easy!
点击上方蓝字关注 Apache DolphinScheduler Apache DolphinScheduler(incubating),简称"DS", 中文名 "海豚调 ...
- docker的volume和bind mount究竟有什么区别?
不知道你在使用docker的时候,有没有注意到volume mount和bind mount的使用? 进一步说,他们之间的区别到底是什么? 接下来的内容,我们就为你揭开他们的神秘面纱. 相同之处 首先 ...
- [CF1515F] Phoenix and Earthquake(图论推导,构造)
题面 在紧张又忙碌地准备联合省选时,发生了大地震,把原本要参赛的 n n n 个城市之间的全部 m m m 条道路震垮了,使得原本互相都能到达的这 n n n 个城市无法交通了.现在,需要紧急恢复 n ...