singer tap-minio-csv 使用
使用tap-minio-csv 我们可以将s3 中csv 的文件,通过singer 的target 写到不用的系统中,可以兼容
s3 的存储类型,以下是一个集成minio 的测试,将minio 中的csv 数据导入到pg中
环境准备
- docker-compose 文件
version: "3"
services:
s3:
image: minio/minio
command: server /data
ports:
- "9000:9000"
environment:
- "MINIO_ACCESS_KEY=dalongapp"
- "MINIO_SECRET_KEY=dalongapp"
volumes:
- "./data:/data"
target:
image: postgres:9.6.11
ports:
- "5432:5432"
environment:
- "POSTGRES_PASSWORD:dalong"
- 创建bucket 并上传文件

- 文件格式
my_table.csv
id,username,userage2,classinfo
1,"dalong",11,"v1"
2,"rong",29,"v2"
3,"appdemo",30,"v3"
4,"tetst",30,"v4"
my_table2.csv
id2,username2,userage3,classinfo2
7,"dalong",11,"v1"
8,"rong",29,"v2"
9,"appdemo",30,"v3"
10,"tetst",30,"v4"
tap 环境准备
- venv 初始化
mkdir s3-tap
python3 -m venv venv
- 激活虚拟环境
source venv/bin/activate
- 安装tap
pip install tap-minio-csv
- 添加tap 配置
主要是关于文件读取以及s3 连接信息,详细文档参考https://github.com/rongfengliang/tap-minio-csv/blob/master/README.md
{
"start_date": "2017-11-02T00:00:00Z",
"bucket": "demo",
"aws_access_key_id":"dalongapp",
"aws_secret_access_key":"dalongapp",
"endpoint_url":"http://localhost:9000",
"tables": "[{\"search_prefix\":\"exports\",\"search_pattern\":\"my_table.csv\",\"table_name\":\"my_table\",\"key_properties\":\"id\",\"delimiter\":\",\"},{\"search_prefix\":\"exports\",\"search_pattern\":\"my_table2.csv\",\"table_name\":\"my_table2\",\"key_properties\":\"id2\",\"delimiter\":\",\"}]"
}
- 简单说明
我们通过 tables 可以定义文件查找的方法以及csv 的处理规则
target 配置
- venv 初始化
mkdir pg-target
python3 -m venv venv
- 激活虚拟环境
source venv/bin/activate
- 安装tap
pip install target_postgres
- 配置target
{
"host": "localhost",
"port": 5432,
"dbname": "postgres",
"user": "postgres",
"password": "dalong",
"schema": "copy"
}
集成使用
- 模式发现
s3-tap/venv/bin/tap-minio-csv -c s3-tap/tap-config.json -d > catalog.json
效果
WARNING I have direct access to the bucket without assuming the configured role.
INFO Starting discover
INFO Sampling records to determine table schema.
INFO Sampling files (max files: 5)
INFO Checking bucket "demo" for keys matching "my_table.csv"
INFO exports/my_table.csv
INFO 2019-08-22 05:27:52+00:00
INFO Will download key "exports/my_table.csv" as it was last modified 2019-08-22 05:27:52+00:00
INFO Sampling exports/my_table.csv (max records: 1000, sample rate: 5)
INFO Sampled 1 rows from exports/my_table.csv
INFO exports/my_table2.csv
INFO 2019-08-22 09:01:45+00:00
INFO Found 2 files.
INFO Sampling records to determine table schema.
INFO Sampling files (max files: 5)
INFO Checking bucket "demo" for keys matching "my_table2.csv"
INFO exports/my_table.csv
INFO 2019-08-22 05:27:52+00:00
INFO exports/my_table2.csv
INFO 2019-08-22 09:01:45+00:00
INFO Will download key "exports/my_table2.csv" as it was last modified 2019-08-22 09:01:45+00:00
INFO Sampling exports/my_table2.csv (max records: 1000, sample rate: 5)
INFO Sampled 1 rows from exports/my_table2.csv
INFO Found 2 files.
INFO Finished discover
- 启用同步配置
主要是启用selected,在每个stream 的metadata 中
"metadata": [
{
"breadcrumb": [],
"metadata": {
"table-key-properties": [
"id"
],
+ "selected":true
}
}
完整如下:
{
"streams": [
{
"stream": "my_table",
"tap_stream_id": "my_table",
"schema": {
"type": "object",
"properties": {
"id": {
"type": [
"null",
"integer",
"string"
]
},
"username": {
"type": [
"null",
"string"
]
},
"userage2": {
"type": [
"null",
"integer",
"string"
]
},
"classinfo": {
"type": [
"null",
"string"
]
},
"_sdc_source_bucket": {
"type": "string"
},
"_sdc_source_file": {
"type": "string"
},
"_sdc_source_lineno": {
"type": "integer"
},
"_sdc_extra": {
"type": "array",
"items": {
"type": "string"
}
}
}
},
"metadata": [
{
"breadcrumb": [],
"metadata": {
"table-key-properties": [
"id"
],
"selected":true
}
},
{
"breadcrumb": [
"properties",
"id"
],
"metadata": {
"inclusion": "automatic"
}
},
{
"breadcrumb": [
"properties",
"username"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"userage2"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"classinfo"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"_sdc_source_bucket"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"_sdc_source_file"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"_sdc_source_lineno"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"_sdc_extra"
],
"metadata": {
"inclusion": "available"
}
}
]
},
{
"stream": "my_table2",
"tap_stream_id": "my_table2",
"schema": {
"type": "object",
"properties": {
"id2": {
"type": [
"null",
"integer",
"string"
]
},
"username2": {
"type": [
"null",
"string"
]
},
"userage3": {
"type": [
"null",
"integer",
"string"
]
},
"classinfo2": {
"type": [
"null",
"string"
]
},
"_sdc_source_bucket": {
"type": "string"
},
"_sdc_source_file": {
"type": "string"
},
"_sdc_source_lineno": {
"type": "integer"
},
"_sdc_extra": {
"type": "array",
"items": {
"type": "string"
}
}
}
},
"metadata": [
{
"breadcrumb": [],
"metadata": {
"table-key-properties": [
"id2"
],
"selected":true
}
},
{
"breadcrumb": [
"properties",
"id2"
],
"metadata": {
"inclusion": "automatic"
}
},
{
"breadcrumb": [
"properties",
"username2"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"userage3"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"classinfo2"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"_sdc_source_bucket"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"_sdc_source_file"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"_sdc_source_lineno"
],
"metadata": {
"inclusion": "available"
}
},
{
"breadcrumb": [
"properties",
"_sdc_extra"
],
"metadata": {
"inclusion": "available"
}
}
]
}
]
}
- 执行同步
s3-tap/venv/bin/tap-minio-csv -c s3-tap/tap-config.json -p catalog.json | pg-target/venv/bin/target-postgres -c pg-target/target.json
效果:
WARNING I have direct access to the bucket without assuming the configured role.
INFO Starting sync.
INFO my_table: Starting sync
INFO Syncing table "my_table".
INFO Getting files modified since 2017-11-02 00:00:00+00:00.
INFO Checking bucket "demo" for keys matching "my_table.csv"
INFO Table 'my_table' does not exist. Creating... CREATE TABLE copy.my_table ("_sdc_extra" jsonb, "_sdc_source_bucket" character varying, "_sdc_source_file" character varying, "_sdc_source_lineno" bigint, "classinfo" character varying, "id" character varying, "userage2" character varying, "username" character varying, PRIMARY KEY ("id"))
INFO exports/my_table.csv
INFO 2019-08-22 05:27:52+00:00
INFO Will download key "exports/my_table.csv" as it was last modified 2019-08-22 05:27:52+00:00
INFO exports/my_table2.csv
INFO 2019-08-22 09:01:45+00:00
INFO Found 2 files.
INFO Syncing file "exports/my_table.csv".
INFO Wrote 4 records for table "my_table".
INFO my_table: Completed sync (4 rows)
INFO my_table2: Starting sync
INFO Syncing table "my_table2".
INFO Getting files modified since 2017-11-02 00:00:00+00:00.
INFO Checking bucket "demo" for keys matching "my_table2.csv"
INFO Table 'my_table2' does not exist. Creating... CREATE TABLE copy.my_table2 ("_sdc_extra" jsonb, "_sdc_source_bucket" character varying, "_sdc_source_file" character varying, "_sdc_source_lineno" bigint, "classinfo2" character varying, "id2" character varying, "userage3" character varying, "username2" character varying, PRIMARY KEY ("id2"))
INFO exports/my_table.csv
INFO 2019-08-22 05:27:52+00:00
INFO exports/my_table2.csv
INFO 2019-08-22 09:01:45+00:00
INFO Will download key "exports/my_table2.csv" as it was last modified 2019-08-22 09:01:45+00:00
INFO Found 2 files.
INFO Syncing file "exports/my_table2.csv".
INFO Wrote 4 records for table "my_table2".
INFO my_table2: Completed sync (4 rows)
INFO Done syncing.
INFO Loading 4 rows into 'my_table'
INFO COPY my_table_temp ("_sdc_extra", "_sdc_source_bucket", "_sdc_source_file", "_sdc_source_lineno", "classinfo", "id", "userage2", "username") FROM STDIN WITH (FORMAT CSV, ESCAPE '\')
INFO UPDATE 0
INFO INSERT 0 4
INFO Loading 4 rows into 'my_table2'
INFO COPY my_table2_temp ("_sdc_extra", "_sdc_source_bucket", "_sdc_source_file", "_sdc_source_lineno", "classinfo2", "id2", "userage3", "username2") FROM STDIN WITH (FORMAT CSV, ESCAPE '\')
INFO UPDATE 0
INFO INSERT 0 4
{"bookmarks": {"my_table": {"modified_since": "2019-08-22T05:27:52+00:00"}, "my_table2": {"modified_since": "2019-08-22T09:01:45+00:00"}}}
pg 内容 
说明
以上是一个简单的实践,详细的使用可以参考https://github.com/rongfengliang/tap-minio-csv/blob/master/README.md
参考资料
https://github.com/rongfengliang/tap-minio-csv/blob/master/README.md
https://github.com/rongfengliang/tap-minio-csv-demo
singer tap-minio-csv 使用的更多相关文章
- python csv 模块的使用
python csv 模块的使用 歌曲推荐:攀登(live) csv 是用逗号分隔符来分隔列与列之间的. 1. csv的写入 1.简单的写入,一次写入一行 import csv with open(& ...
- pipelinewise 基于singer 指南的的数据pipeline 工具
pipelinewise 是基于开源singer 指南开发的数据pipeline工具,与singer tap 以及target 兼容 支持的特性 内置的elt 特性 轻量级 支持多种复制方法,cdc( ...
- Singer 修改tap-s3-csv 支持minio 连接
singer 团队官方处了一个tap-s3-csv 的tap,对于没有使用aws 的人来说并不是很方便了,所以简单修改了 下源码,可以支持通用的s3 csv 文件的处理,同时发布到了官方pip 仓库中 ...
- Singer 学习八 运行&&开发taps、targets (三 开发tap)
如何没有找到适合的tap,那么我们可以自己开发一个 hello world tap 仅仅是一个程序,我们可以使用任何语言进行编写,根据singer 指南,输出数据到stdout 即可,实际上一个简单的 ...
- Singer 学习三 使用Singer进行mongodb 2 postgres 数据转换
Singer 可以方便的进行数据的etl 处理,我们可以处理的数据可以是api 接口,也可以是数据库数据,或者 是文件 备注: 测试使用docker-compose 运行&&提供数据库 ...
- Singer 学习一 使用Singer进行mysql 2 postgres 数据转换
Singer 因为版本的问题,推荐的运行方式是使用virtualenv,对于taps&& target 的运行都是 推荐使用此方式,不然包兼容的问题太费事了 备注: 使用docker- ...
- Supercharging your ETL with Airflow and Singer
转自:https://www.stitchdata.com/blog/supercharging-etl-with-airflow-and-singer/ singer 团队关于singer 与air ...
- 使用singer tap-postgres 同步数据到pg
singer 是一个很不错的开源etl 解决方案,以下演示一个简单的数据从pg 同步到pg 很简单就是使用tap-postgres + target-postgres 环境准备 对于测试的环境的数据库 ...
- [转]Build An Image Manager With NativeScript, Node.js, And The Minio Object Storage Cloud
本文转自:https://www.thepolyglotdeveloper.com/2017/04/build-image-manager-nativescript-node-js-minio-obj ...
随机推荐
- Spring中扩展点汇总 ------------- 框架图
原文链接:https://my.oschina.net/dachengxi/blog/3014156 转载于:https://my.oschina.net/dachengxi/blog/30141 ...
- [转]解决ubuntu16.04 ‘E: 无法获得锁 /var/lib/dpkg/lock-frontend - open (11: 资源暂时不可用) ’ 问题
当运行sudo apt-get install/update/其他命令时,会出现如下提示: E: 无法获得锁 /var/lib/dpkg/lock-frontend - open (11: 资源暂时不 ...
- String常用使用方法,1.创建string的常用3+1种方式,2.引用类型使用==比较地址值,3.String当中获取相关的常用方法,4.字符串的截取方法,5.String转换常用方法,6.切割字符串----java
一个知识点使用一个代码块方便查看 1.创建string的常用3+1种方式 /* 创建string的常用3+1种方式 三种构造方法 public String():创建一个空字符串,不含有任何内容: p ...
- 版本控制器:Git-的使用
版本控制器:Git # 达到多人协同开发的目的 安装 """ 1.下载对应版本:https://git-scm.com/download 2.安装git:在选取安装路径的 ...
- drf--认证组件
目录 认证简介 用户认证RBAC(Role-Based Access Control) 局部使用 全局使用 源码分析 认证简介 使用场景:有些接口在进行访问时,需要确认用户是否已经登录,比如:用户需要 ...
- 微信小程序中使用全局变量解决页面的传值问题
由于项目需要,最近便在做 一个类似于美团的餐饮平台的的微信微信小程序 ,项目有十几个页面,那么页面间的传值被经常用到.在小程序中页面间的传值主要有使用全局变量和本地存储这两种方法,在这个项目中我采用的 ...
- echarts的地图省份颜色自适应变化
在使用echarts的地图的时候省份的颜色可能随着数据的多少显示不同的颜色,但是当后台返回的数据的变化较大时可能就不好控制了,所以需要设置根据后台的数据进行自适应 将后台返回的数据中的value放入一 ...
- Typora优化-适合不懂CSS代码的小白
转载请注明出处:https://www.cnblogs.com/nreg/p/11116176.html 先来一张优化前与优化后的对比图: 优化前: 优化后: 1.通过 文件-偏好设置 打开主题文件 ...
- odex vdex art区别
一.vdexpackage 直接转化的 可执行二进制码 文件:1.第一次开机就会生成在/system/app/<packagename>/oat/下:2.在系统运行过程中,虚拟机将其 从 ...
- SpringCloud学习第二章-SpringBoot
SpringCloud 学习前提 SpringCloud是基于SpringBoot构建的,因此他延续了SpringBoot的契约模式以及开发方式.下面将讲到SpringBoot的构建方式. S ...