转载自:https://elasticstack.blog.csdn.net/article/details/114374804

在今天的文章中,我将展示如何使用 file input 结合 multiline 来展示如何导入一个 CSV 文件。针对 multiline,我在之前的文章 “运用 Elastic Stack 分析 Spring boot 微服务日志 (一)” 有讲到过。另外我也有两篇关于使用 Logstash 导入 CSV 的例子

    Logstash:应用实践 - 装载 CSV 文档到 Elasticsearch

    Logstash:导入 zipcode CSV 文件和 Geo Search 体验

针对 CSV 的导入,我们也可以使用 Filebeat 来解析 CSV 文件。如果你有兴趣的话,请参考:

    Beats:运用 Elastic Stack 分析 COVID-19 数据并进行可视化分析

准备数据

在今天的练习中,我们有如下的测试数据:

multiline.csv

    INV-12402400071,05/31/2018,2595,Hy-Vee Wine and Spirits / Denison,"1620  4th Ave, South",Denison,51442,"1620 4th Ave, South Denison 51442(42.012395, -95.348601)",24,CRAWFORD,1011100,Blended Whiskies,260,DIAGEO AMERICAS,25608,Seagrams 7 Crown Bl Whiskey,6,1750,11.96,17.94,1,107.64,1.75,0.46
S29195400002,11/21/2015,2205,Ding's Honk And Holler,900 E WASHINGTON,CLARINDA,51632,"900 E WASHINGTON
CLARINDA 51632
(40.739238, -95.02756)",73,Page,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,12,325.68,9.00,2.38
S29198800001,11/20/2015,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632,"1013 MAIN
KEOKUK 52632
(40.39978, -91.387531)",56,Lee,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,6,162.84,4.50,1.19
S29198800001,11/20/2015,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632,"1013 MAIN
KEOKUK 52632
(40.39978, -91.387531)",56,Lee,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,6,162.84,4.50,1.19 这个数据来源于 https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy/data。其中的有些数据具有多行输入,也就是多出了一些换行符 "\n",从而导致有些记录分布在多行,尽管这种情况比较少见。在上面,我们可以看到如下的三个文档: INV-12402400071
S29195400002
S29198800001 其中 S29195400002 及 S29198800001 连个文档的内容跨三行。和第一个文档显然是不同的。那么我们该如何处理这种情况呢?首先,我们看到文档都是以 INV- 已经 S 开头的行。一般来说 Logstash 的架构图如下: 首先它含有一个 Input, 然后经过0个或多个 filter 的处理,最终输出到 Output。 针对我们的情况,我们可以使用如下的架构来对它进行处理: 我们可以使用 file input 配合 multiline,然后把数据传入到 csv, mutate, 及 Grok 这样的过滤器来进行处理。 首先,我们创建一个叫做 logstash_csv.conf 文件 logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} output {
stdout {
codec => rubydebug
}
} 在上面,我们使用 file 把指定位置的 multilne.csv 读入进来。我们使用了如下的 codec: codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
} 它首先匹配以 S 或 INV- 为开头的行,紧接着 S 或 INV- 后面接0-9之中的两个数字。negate 为 true 表示没有匹配的行需要添加到 previous (前面)已经匹配的行里从而组成一个文档。如果你对这个还不是很理解的话,请参阅之前在 “Beats:使用 Filebeat 传送多行日志” 中的描述。 我们使用 Logstash 运行上面的配置文件: sudo ./bin/logstash -f logstash_csv.conf 那么输出的结果为: 我们看到文档虽然一个文档被分为三行,但是它们还是被正确地识别为一个文档。在文档中,我们看见有 \n 字符出现。在接下来的处理中,我们需要把这个字符去掉。 我们接下来使用 csv 过滤器来进行处理: logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"] convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"} remove_field => ["message"]
}
} ​
output {
stdout {
codec => rubydebug
}
} 在上面,我们把 CSV 文档中的项进行解析,并形成各个字段。同时我们也使用 convert 把字段里的数值字段转换为数值类型以便于分析。删除 message 字段。 重新运行 Logstash, 并查看结果: 在上面,我们看到 Country 以及 City,它们都是大写字母,我们想把它们转换为小写字母。同时在 StoreLocation 中,我们发现有 \n 字符。我们在 filter 部分添加 mutate 来对它们进行处理: logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"] convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"} remove_field => ["message"]
} # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
}
} output {
stdout {
codec => rubydebug
}
} 重新运行 Logstash 并查看输出结果: 我们看到 Country 及 City 的字母都变为小写了,同时在 StoreLocation 中再也没有 \n 字符了。 接下来,我们想提取 StoreLocation 里面的位置信息。我们可以看到里面含有一个坐标(经纬度)。我们可以使用 grok 过滤器来进行匹配: logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"] convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"} remove_field => ["message"]
} # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
} # Get the lat/lon if there is a (numbers,numbers) data in the location
grok {
match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }
}
} output {
stdout {
codec => rubydebug
}
} 我们匹配 StoreLocation 里的含有括号 ()里的内容并赋予给 location。字符含 -,.0-9。重新运行 Logstash: 从上面我们可以看出来 location 从 StoreLocation 中被提取出来了。 接下来,我们来把文档的时间修改为来自文档中的时间。我们可以看到目前的 @timestamp 不是我们文档的 Date 字段的时间。 logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"] convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"} remove_field => ["message"]
} # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
} # Get the lat/lon if there is a (numbers,numbers) data in the location
grok {
match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }
} # Match the date to just daily and the correct timezone
date {
"match" => [ "Date", "MM/dd/YYYY" ]
"timezone" => "America/Chicago"
}
} output {
stdout {
codec => rubydebug
}
} 再次运行 Logstash: 显然现在的 @timestamp 变为来自文档中的时间了。 我们接下来可以添加输出到 Elasticsearch: logstash_csv.conf input {
# Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks
file {
start_position => "beginning"
path => "/Users/liuxg/data/logstash_multiline/multline.csv"
sincedb_path => "/dev/null"
codec => multiline {
pattern => "^(S|INV-)[0-9][0-9]"
negate => "true"
what => "previous"
}
}
} filter {
# Parse the csv values define fields as integers and \floats
csv {
columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"] convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"} remove_field => ["message"]
} # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file
mutate {
gsub => [ "StoreLocation", "\n", " " ]
lowercase => [ "County", "City" ]
} # Get the lat/lon if there is a (numbers,numbers) data in the location
grok {
match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }
} # Match the date to just daily and the correct timezone
date {
"match" => [ "Date", "MM/dd/YYYY" ]
"timezone" => "America/Chicago"
}
} output {
elasticsearch {
hosts => ["https://your.cluster.here:9243"]
index => ["iowa-liquor"]
user => "elastic"
password => "redacted"
manage_template => false
} #output dots while we process
stdout { codec => "dots" }
#if we saw a date parse failure, dump it to screen to review
if "_dateparsefailure" in [tags] {
stdout { codec => "rubydebug" }
}
}

Logstash:使用 Logstash 导入 CSV 文件示例的更多相关文章

  1. neo4j导入csv文件

    neo4j导入csv文件 关于neo4j的安装 官网和网上博客提供了n中安装的方法,这里不再赘述: 普通安装: https://cloud.tencent.com/developer/article/ ...

  2. 导出csv文件示例

    导出csv文件示例 csv文件默认以英文逗号,做为列分隔符换行符\n作为行分隔符,写入到一个.csv文件即可.含有英文逗号,和换行符会发生数据输出会出现混乱,下面列出一些处理方法.特殊字符处理1.含有 ...

  3. ACCESS导入CSV文件出现乱码解决办法

    在ACCESS或Excel中导入CSV文件时常常出现乱码,这是因为简体中文版的windows操作系统及其应用软件默认都是ANSI/GBK编码,而导入的文件使用的编码与操作系统默认的编码不相符.出现这种 ...

  4. C# 将List中的数据导入csv文件中

    //http://www.cnblogs.com/mingmingruyuedlut/archive/2013/01/20/2849906.html C# 将List中的数据导入csv文件中   将数 ...

  5. oracle导入csv文件

    oracle导入csv文件: 1.建好对应的表和字段: 2.新建test.ctl文件,用记事本编辑写入: load data infile 'e:\TB_KC_SERV.csv' --修改对应的文件路 ...

  6. python导入csv文件时,出现SyntaxError

    背景 np.loadtxt()用于从文本加载数据. 文本文件中的每一行必须含有相同的数据. *** loadtxt(fname, dtype=<class 'float'>, commen ...

  7. R: 导入 csv 文件,导出到csv文件,;绘图后导出为图片、pdf等

    ################################################### 问题:导入 csv 文件 如何从csv文件中导入数据,?参数怎么设置?常用参数模板是啥? 解决方 ...

  8. python导入csv文件出现SyntaxError问题分析

    python导入csv文件出现SyntaxError问题分析 先简单描述下碰到的题目,要求是写出2个print的结果 可以看到,a指向了一个列表list对象,在Python中,这样的赋值语句,其实内部 ...

  9. Oracle数据库导入csv文件(sqlldr命令行)

    1.说明 Oracle数据库导入csv文件, 当csv文件较小时, 可以使用数据库管理工具, 比如DBevaer导入到数据库, 当csv文件很大时, 可以使用Oracle提供的sqlldr命令行工具, ...

随机推荐

  1. DHCP原理及配置

    DHCP工作原理 集中的管理.分配IP地址,使client动态的获得IP地址.Gateway地址.DNS服务器地址等信息,并能够提升地址的使用率. 简单来说,DHCP就是一个不需要账号密码登录的.自动 ...

  2. P2512 【一本通提高篇贪心】「一本通 1.1 练习 6」[HAOI2008]糖果传递

    [HAOI2008]糖果传递 题目描述 有 n n n 个小朋友坐成一圈,每人有 a i a_i ai​ 个糖果.每人只能给左右两人传递糖果.每人每次传递一个糖果代价为 1 1 1. 输入格式 小朋友 ...

  3. Möbius 反演注记

    目录 基本理论基础 数论函数 线性筛 Mobius 反演 Dirichlet 卷积 数论分块 / 整除分块 拆函数 时间复杂度分析 基本形式 GCD 形 万能 Prod 的莫比乌斯反演 正常例题 YY ...

  4. 无意苦争春,一任群芳妒!M1 Mac book(Apple Silicon)能否支撑全栈工程师的日常?(Python3/虚拟机/Docker/Redis)

    原文转载自「刘悦的技术博客」https://v3u.cn/a_id_187 就像大航海时代里突然诞生的航空母舰一样,苹果把玩着手心里远超时代的M1芯片,微笑着对Intel说:"不好意思,虽然 ...

  5. Odoo 14 Action URL 生成

    from werkzeug.urls import url_encode url = '/web#%s' % url_encode({ 'action': 'hr.plan_wizard_action ...

  6. 使用.NET简单实现一个Redis的高性能克隆版(一)

    译者注 该原文是Ayende Rahien大佬业余自己在使用C# 和 .NET构建一个简单.高性能兼容Redis协议的数据库的经历. 首先这个"Redis"是非常简单的实现,但是他 ...

  7. 【原创】Python 极验滑块验证

    本文仅供学习交流使用,如侵立删! 记一次 极验滑块验证分析并通过 操作环境 win10 . mac Python3.9 selenium.seleniumwire 分析 最近在做的一个项目登录时会触发 ...

  8. 成为 Apache 贡献者,So easy!

    点击上方蓝字关注 Apache DolphinScheduler Apache DolphinScheduler(incubating),简称"DS", 中文名 "海豚调 ...

  9. docker的volume和bind mount究竟有什么区别?

    不知道你在使用docker的时候,有没有注意到volume mount和bind mount的使用? 进一步说,他们之间的区别到底是什么? 接下来的内容,我们就为你揭开他们的神秘面纱. 相同之处 首先 ...

  10. [CF1515F] Phoenix and Earthquake(图论推导,构造)

    题面 在紧张又忙碌地准备联合省选时,发生了大地震,把原本要参赛的 n n n 个城市之间的全部 m m m 条道路震垮了,使得原本互相都能到达的这 n n n 个城市无法交通了.现在,需要紧急恢复 n ...