Singer 修改tap-s3-csv 支持minio 连接

singer 团队官方处了一个tap-s3-csv 的tap，对于没有使用aws 的人来说并不是很方便了，所以简单修改了
下源码，可以支持通用的s3 csv 文件的处理，同时发布到了官方pip 仓库中，方便大家使用。
以下是简单代码修改部分的说明，以及如何发布pip包

修改说明

主要是关于连接s3 的部分，因为tap-s3-csv 使用的是boto3 我们需要修改的就是关于boto3 连接s3 的部署
添加上aws_access_key_id,aws_secret_access_key,endpoint_url
关于s3 自定义连接的说明，格式如下：

  s3_client = boto3.session.Session().client(

        service_name='s3',

        aws_access_key_id=aws_access_key_id,

        aws_secret_access_key=aws_secret_access_key,

        endpoint_url=endpoint_url,

几个需要修改的部分

s3.py
get_input_files_for_table 部分，主要是传递参数的
修改如下：

def get_input_files_for_table(config, table_spec, modified_since=None):

    bucket = config['bucket']

    aws_access_key_id = config['aws_access_key_id']

    aws_secret_access_key =config['aws_secret_access_key']

    endpoint_url =config['endpoint_url']

    to_return = []

    pattern = table_spec['search_pattern']

    try:

        matcher = re.compile(pattern)

    except re.error as e:

        raise ValueError(

            ("search_pattern for table `{}` is not a valid regular "

             "expression. See "

             "https://docs.python.org/3.5/library/re.html#regular-expression-syntax").format(table_spec['table_name']),

            pattern) from e

    LOGGER.info(

        'Checking bucket "%s" for keys matching "%s"', bucket, pattern)

    matched_files_count = 0

    unmatched_files_count = 0

    max_files_before_log = 30000

    for s3_object in list_files_in_bucket(bucket,aws_access_key_id,aws_secret_access_key,endpoint_url, table_spec.get('search_prefix')):

        key = s3_object['Key']

        last_modified = s3_object['LastModified']

        LOGGER.info(key)

        LOGGER.info(last_modified)

        if s3_object['Size'] == 0:

            LOGGER.info('Skipping matched file "%s" as it is empty', key)

            unmatched_files_count += 1

            continue

        if matcher.search(key):

            matched_files_count += 1

            if modified_since is None or modified_since < last_modified:

                LOGGER.info('Will download key "%s" as it was last modified %s',

                            key,

                            last_modified)

                yield {'key': key, 'last_modified': last_modified}

        else:

            unmatched_files_count += 1

        if (unmatched_files_count + matched_files_count) % max_files_before_log == 0:

            # Are we skipping greater than 50% of the files?

            if 0.5 < (unmatched_files_count / (matched_files_count + unmatched_files_count)):

                LOGGER.warn(("Found %s matching files and %s non-matching files. "

                             "You should consider adding a `search_prefix` to the config "

                             "or removing non-matching files from the bucket."),

                            matched_files_count, unmatched_files_count)

            else:

                LOGGER.info("Found %s matching files and %s non-matching files",

                            matched_files_count, unmatched_files_count)

    if 0 == matched_files_count:

        raise Exception("No files found matching pattern {}".format(pattern))

list_files_in_bucket 修改核心部分，关于连接s3 的
修改如下：

 @retry_pattern()

def list_files_in_bucket(bucket,aws_access_key_id,aws_secret_access_key,endpoint_url, search_prefix=None):

    s3_client = boto3.session.Session().client(

        service_name='s3',

        aws_access_key_id=aws_access_key_id,

        aws_secret_access_key=aws_secret_access_key,

        endpoint_url=endpoint_url,

    s3_object_count = 0

    max_results = 1000

    args = {

        'Bucket': bucket,

        'MaxKeys': max_results,

    if search_prefix is not None:

        args['Prefix'] = search_prefix

    paginator = s3_client.get_paginator('list_objects_v2')

    pages = 0

    for page in paginator.paginate(**args):

        pages += 1

        LOGGER.debug("On page %s", pages)

        s3_object_count += len(page['Contents'])

        yield from page['Contents']

    if 0 < s3_object_count:

        LOGGER.info("Found %s files.", s3_object_count)

    else:

        LOGGER.warning('Found no files for bucket "%s" that match prefix "%s"', bucket, search_prefix)

get_file_handle 部分，主要是关于获取s3 对象内容的

@retry_pattern()

def get_file_handle(config, s3_path):

    bucket = config['bucket']

    aws_access_key_id = config['aws_access_key_id']

    aws_secret_access_key =config['aws_secret_access_key']

    endpoint_url =config['endpoint_url']

    s3_client = boto3.resource(

       service_name="s3",

       aws_access_key_id=aws_access_key_id,

       aws_secret_access_key=aws_secret_access_key,

       endpoint_url=endpoint_url)

    s3_bucket = s3_client.Bucket(bucket)

    s3_object = s3_bucket.Object(s3_path)

    return s3_object.get()['Body']

init.py
关于tap 命令处理的部分，比如模式发现，执行同步,以及参数检验的
参数校验修改，修改为我们配置参数需要的

REQUIRED_CONFIG_KEYS = ["start_date", "bucket", "aws_access_key_id", "aws_secret_access_key", "endpoint_url"]

main 函数：

@singer.utils.handle_top_exception(LOGGER)

def main():

    args = singer.utils.parse_args(REQUIRED_CONFIG_KEYS)

    config = args.config

    bucket = config['bucket']

    aws_access_key_id = config['aws_access_key_id']

    aws_secret_access_key =config['aws_secret_access_key']

    endpoint_url =config['endpoint_url']

    config['tables'] = validate_table_config(config)

    try:

        for page in s3.list_files_in_bucket(bucket,aws_access_key_id,aws_secret_access_key,endpoint_url):

            break

        LOGGER.warning("I have direct access to the bucket without assuming the configured role.")

    except:

        LOGGER.error("can't connect to s3 storage")

    if args.discover:

        do_discover(args.config)

    elif args.properties:

        do_sync(config, args.properties, args.state)

pip 包约定处理
为了不和官方冲突，重新别名
setup.py:

#!/usr/bin/env python

from setuptools import setup

setup(name='tap-minio-csv',

      version='1.2.2',

      description='Singer.io tap for extracting CSV files from minio',

      author='rongfengliang',

      url='https://github.com/rongfengliang/tap-minio-csv',

      classifiers=['Programming Language :: Python :: 3 :: Only'],

      py_modules=['tap_minio_csv'],

      install_requires=[

          'backoff==1.3.2',

          'boto3==1.9.57',

          'singer-encodings==0.0.3',

          'singer-python==5.1.5',

          'voluptuous==0.10.5'

],

      extras_require={

          'dev': [

              'ipdb==0.11'

},

      entry_points='''

          [console_scripts]

          tap-minio-csv=tap_minio_csv:main

      ''',

      packages=['tap_minio_csv'])

项目包名称

发布pip 包

安装工具

python3 -m pip install --user --upgrade setuptools wheel twine

生成文件

python3 setup.py sdist bdist_wheel

上传
需要先注册账户,执行以下命令，按照提示输入账户信息即可

twine upload dist/*

pip 包

说明

以上是一个简单的说明，详细代码可以参考https://github.com/rongfengliang/tap-minio-csv

参考资料

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/session.html
https://github.com/singer-io/tap-s3-csv
https://github.com/rongfengliang/tap-minio-csv

Singer 修改tap-s3-csv 支持minio 连接的更多相关文章

ECMall如何支持SSL连接邮件服务器的配置
首先,主要是ecmall使用的phpmailer版本太低,不支持加密连接. 然后,得对相应代码做一定调整. 1. 覆盖phpmailer 请从附件进行下载: http://files.cnblogs. ...
在SSIS 的 64 位版本中不支持 Excel 连接管理器
Microsoft sql server 2008 R2——> SQL SERVER Business Intelligence Development Studio 使用EXCEL数据源或目标 ...
vs中开发web站点使IIS Express支持局域网连接
vs中开发web站点使IIS Express支持局域网连接在开发webapi的时候,客户端设备都会使用局域网的地址访问webapi,有时候需要调试api.这个时候就需要使用一些技巧了,这里我记录了我 ...
MySQL不支持远程连接的解决办法
如果mysql不支持远程连接,会出现提示:错误代码是1130,ERROR 1130: Host * is not allowed to connect to this MySQL server ,解决 ...
NetworkComms V3 之支持TCP连接和UDP连接
NetworkComms V3 无缝的支持TCP连接和UDP连接. 您可以很容易的创建这两种连接 //创建一个连接信息对象 ConnectionInfo connInfo = ); //创建一个TCP ...
Mysql 连接查询 Mysql支持的连接查询有哪些
CREATE TABLE `chx` ( `id` VARCHAR(20) NOT NULL, `name` VARCHAR(50) DEFAULT NULL, `name2` CHAR( ...
已使用 163 邮箱测试通过，且支持 SSL 连接。发送邮件
示例:Jack 发送一封邮件给 Rose. public class SendMail { public static void main(String[] args) { b ...
HslCommunication库的二次协议扩展，适配第三方通讯协议开发，基础框架支持长短连接模式
本文将使用一个gitHub开源的项目来扩展实现二次协议的开发,该项目已经搭建好了基础层架构,并实现了三菱,西门子,欧姆龙,MODBUS-TCP的通讯示例,也可以参照这些示例开发其他的通讯协议,并Pul ...
SQLServer 2016 Express 安装部署，并配置支持远程连接
在项目中需要用到SQLServer,于是安装部署了SQLServer,部署的过程中遇到了一下问题,记录一下以便之后遇到同样问题能快速解决. 一.安装包下载首先下载必要的安装包: 1.SQLServe ...

随机推荐

NOIP2018 填数游戏搜索、DP
LOJ 感觉这个题十分好玩于是诈尸更博.一年之前的做题心得只有这道题还记得清楚-- 设输入为\(n,m\)时的答案为\(f(n,m)\),首先\(f(n,m)=f(m,n)\)所以接下来默认\(n \ ...
java -jar参数运行方式设置classpath
转载自:https://www.cnblogs.com/aggavara/archive/2012/11/16/2773246.html 当用java -jar yourJarExe.jar来运行一个 ...
简洁的 Python Schema
目录 Python Schema使用说明 1. Schema是什么? 2. 安装 1. 给Schema类传入类型(int.str.float等) 2. 给Schema类传入可调用的对象(函数.带__c ...
CentOS7 firewalld防火墙启动关闭禁用添加删除规则等常用命令
CentOS7 firewalld防火墙常用命令1.firewalld的基本使用启动: systemctl start firewalld关闭: systemctl stop firewalld查看 ...
session中删除数组中的某一个值 - 购物车例子 - jsp
这篇随笔简单的讲一下在session中移除数组中的某一项内容,比如这里有一个购物车其中有两件商品,需要移除其中洗发水这一件商品. 其实在这个session对象中存储了一个数组,在订购页面时选择商品加入 ...
List泛型用法（转载）
网上的List泛型用法,未验证,目测基本正确,教学用资料. 1. List的基础.常用方法: 声明: 1.List<T> mList = new List<T>(); T为列 ...
python中用分别用selenium、requests库实现Windows认证登录
最近在搞单位的项目,实现python自动化,结果在第一步就把我给拒之门外,查资料问大佬,问我们开发人员,从周一折腾到周五才搞定了接下给大家分享一下项目背景:我们系统是基于Windows平台实现的, ...
分布式事务：Saga模式
1 Saga相关概念 1987年普林斯顿大学的Hector Garcia-Molina和Kenneth Salem发表了一篇Paper Sagas,讲述的是如何处理long lived transac ...
使用Python搭建http服务器
David Wheeler有一句名言:“计算机科学中的任何问题,都可以通过加上另一层间接的中间层解决.”为了提高Python网络服务的可移植性,Python社区在PEP 333中提出了Web服务器网关 ...
sklearn中的KMeans算法
1.聚类算法又叫做“无监督分类”,其目的是将数据划分成有意义或有用的组(或簇).这种划分可以基于我们的业务需求或建模需求来完成,也可以单纯地帮助我们探索数据的自然结构和分布. 2.KMeans算法将一 ...