KingabseES例程-事实数据与规则的匹配校验

背景

使用规则，对数据进行校验，比如电商的用户购物订单，是否合法。这就需要订单的多维度，如用户、地区、物流、支付手段、供应商等各类信息，进行动态组合条件的关联匹配。

通常的匹配模式

针对某个订单

一种是根据某个订单，与全部规则进行匹配。

select a.*

from rul04_val a,

     rul04 b

where 1 = 1

  and (a.c1 = b.c1 or b.c1 is null)

  and (a.c2 = b.c2 or b.c2 is null)

  and (a.c3 = b.c3 or b.c3 is null)

  and (a.c4 = b.c4 or b.c4 is null)

  and a.id = 1000;

Nested Loop  (cost=0.43..151.15 rows=12 width=26)

  Join Filter: (((a.c1 = b.c1) OR (b.c1 IS NULL)) AND ((a.c2 = b.c2) OR (b.c2 IS NULL)) AND ((a.c3 = b.c3) OR (b.c3 IS NULL)) AND ((a.c4 = b.c4) OR (b.c4 IS NULL)))

  ->  Index Scan using rul04_val_pkey on rul04_val a  (cost=0.43..8.45 rows=1 width=26)

        Index Cond: (id = 1000)

  ->  Seq Scan on rul04 b  (cost=0.00..62.90 rows=3990 width=16)

针对某个规则

一种是，根据某个规则，校验所有订单，

select a.*

from rul04_val a,

     rul04 b

where 1 = 1

  and (a.c1 = b.c1 or b.c1 is null)

  and (a.c2 = b.c2 or b.c2 is null)

  and (a.c3 = b.c3 or b.c3 is null)

  and (a.c4 = b.c4 or b.c4 is null)

  and b.id = 1000;

Nested Loop  (cost=0.28..374083.30 rows=30134 width=26)

  Join Filter: (((a.c1 = b.c1) OR (b.c1 IS NULL)) AND ((a.c2 = b.c2) OR (b.c2 IS NULL)) AND ((a.c3 = b.c3) OR (b.c3 IS NULL)) AND ((a.c4 = b.c4) OR (b.c4 IS NULL)))

  ->  Index Scan using rul04_pkey on rul04 b  (cost=0.28..8.30 rows=1 width=16)

        Index Cond: (id = 1000)

  ->  Seq Scan on rul04_val a  (cost=0.00..174075.00 rows=10000000 width=26)

全部订单和全部规则

如果需要对全部或大量的订单和规则，进行匹配运算，这将是海量的计算。

select a.*

from rul04_val a,

     rul04 b

where 1 = 1

  and (a.c1 = b.c1 or b.c1 is null)

  and (a.c2 = b.c2 or b.c2 is null)

  and (a.c3 = b.c3 or b.c3 is null)

  and (a.c4 = b.c4 or b.c4 is null) ;

                                                                                 QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Gather  (cost=1000.00..368873734.60 rows=120236596 width=46) (actual time=5.172..879507.928 rows=386411 loops=1)

   Workers Planned: 4

   Workers Launched: 4

   ->  Nested Loop  (cost=0.00..356849075.00 rows=30059149 width=46) (actual time=12.698..879456.332 rows=77282 loops=5)

         Join Filter: (((a.c1 = b.c1) OR (b.c1 IS NULL)) AND ((a.c2 = b.c2) OR (b.c2 IS NULL)) AND ((a.c3 = b.c3) OR (b.c3 IS NULL)) AND ((a.c4 = b.c4) OR (b.c4 IS NULL)))

         Rows Removed by Join Filter: 7979922718

         ->  Parallel Seq Scan on rul04_val a  (cost=0.00..99075.00 rows=2500000 width=26) (actual time=0.018..151.852 rows=2000000 loops=5)

         ->  Seq Scan on rul04 b  (cost=0.00..62.90 rows=3990 width=20) (actual time=0.001..0.162 rows=3990 loops=10000000)

 Planning Time: 0.289 ms

 Execution Time: 879521.346 ms

分解匹配规则

普通模式的查询，不能使用事实表的索引。这是因为关联条件中，AND运算包含OR运算，这就使得关联运算是volatile属性，这里需要进行SQL调优。

此处的SQL，是多列存在匹配需求，或者不需匹配的全集。这里可以利用对bit数据的类比，将匹配和全集的表达式，理解成1与0，反之亦可。从而得知，需要分解成2^4=16个，不含OR运算的查询，然后将16个查询结果合并。

--16个查询，组合。每个查询，都能使用较适合的索引。

select *

from rul04_val a, rul04 b

where b.c1 is null and b.c2 is null and b.c3 is null and b.c4 is null

union all

select *

from rul04_val a, rul04 b

where b.c1 is null and b.c2 is null and b.c3 is null and a.c4 = b.c4

union all

...

union all

from rul04_val a, rul04 b

where a.c1 = b.c1 and a.c2 = b.c2 and a.c3 = b.c3 and a.c4 = b.c4

;

Append  (cost=5.24..458866.31 rows=10410804 width=26) (actual time=0.432..1059.117 rows=386411 loops=1)

  ->  Result  (cost=5.24..174080.24 rows=10000000 width=26) (actual time=0.233..0.234 rows=0 loops=1)

        One-Time Filter: $0

        InitPlan 1 (returns $0)

          ->  Seq Scan on rul04 b_15  (cost=0.00..62.90 rows=12 width=0) (actual time=0.230..0.230 rows=0 loops=1)

                Filter: ((c1 IS NULL) AND (c2 IS NULL) AND (c3 IS NULL) AND (c4 IS NULL))

                Rows Removed by Filter: 3990

        ->  Seq Scan on rul04_val a  (cost=5.24..174080.24 rows=10000000 width=26) (never executed)

...

  ->  Nested Loop  (cost=112.08..5417.64 rows=1 width=26) (actual time=9.831..9.832 rows=0 loops=1)

        ->  HashAggregate  (cost=102.80..106.79 rows=399 width=16) (actual time=0.896..1.413 rows=3990 loops=1)

              Group Key: b_14.c1, b_14.c2, b_14.c3, b_14.c4

              ->  Seq Scan on rul04 b_14  (cost=0.00..62.90 rows=3990 width=16) (actual time=0.005..0.158 rows=3990 loops=1)

        ->  Bitmap Heap Scan on rul04_val a_15  (cost=9.28..13.30 rows=1 width=26) (actual time=0.002..0.002 rows=0 loops=3990)

              Recheck Cond: ((c3 = b_14.c3) AND (c4 = b_14.c4) AND (c1 = b_14.c1) AND (c2 = b_14.c2))

              ->  BitmapAnd  (cost=9.28..9.28 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=3990)

                    ->  Bitmap Index Scan on idx_rul04_val_c34  (cost=0.00..4.51 rows=10 width=0) (actual time=0.001..0.001 rows=5 loops=3990)

                          Index Cond: ((c3 = b_14.c3) AND (c4 = b_14.c4))

                    ->  Bitmap Index Scan on idx_rul04_val_c12  (cost=0.00..4.51 rows=10 width=0) (actual time=0.001..0.001 rows=4 loops=2240)

                          Index Cond: ((c1 = b_14.c1) AND (c2 = b_14.c2))

Planning Time: 1.763 ms

Execution Time: 1068.627 ms

由于需要关联的列数是不确定，而且优化后的查询语句，是大量的局部查询语句组合而成，这给开发人员带来巨大的工作量。我们需要一种，通用语句解决这个复杂且不确定的需求。

数据准备

规则表与事实表关联的列，建议列名相同，或使用相同的别名。

规则表

--规则表(4个条件列)

--插入多种（2^4）组合的条件数据

drop table rul04;

create table rul04

(

    id int primary key,

    c1 int,

    c2 int,

    c3 int,

    c4 int

);

--基于条件列，创建单列索引，这些索引未必使用。

create index idx_rul04_c1 on rul04 (c1);

create index idx_rul04_c2 on rul04 (c2);

create index idx_rul04_c3 on rul04 (c3);

create index idx_rul04_c4 on rul04 (c4);

insert into rul04

select row_number() over (), c1, c2, c3, c4

from (

         select null c1, null c2, null c3, null c4

         from generate_series(1, 1000) id

         union

         select null, null, null, (random() * 9)::int

         from generate_series(1, 1000)

         union

         select null, null, (random() * 9)::int, null

         from generate_series(1, 1000)

         union

         select null, (random() * 9)::int, null, null

         from generate_series(1, 1000)

         union

         select (random() * 9)::int, null, null, null

         from generate_series(1, 1000)

         union

         select null, (random() * 9)::int, null, (random() * 9)::int

         from generate_series(1, 1000)

         union

         select null, (random() * 9)::int, (random() * 9)::int, null

         from generate_series(1, 1000)

         union

         select null, null, (random() * 9)::int, (random() * 9)::int

         from generate_series(1, 1000)

         union

         select (random() * 9)::int, null, null, (random() * 9)::int

         from generate_series(1, 1000)

         union

         select (random() * 9)::int, null, (random() * 9)::int, null

         from generate_series(1, 1000)

         union

         select (random() * 9)::int, (random() * 9)::int, null, null

         from generate_series(1, 1000)

         union

         select (random() * 9)::int, null, (random() * 9)::int, (random() * 9)::int

         from generate_series(1, 1000)

         union

         select null, (random() * 9)::int, (random() * 9)::int, (random() * 9)::int

         from generate_series(1, 1000)

         union

         select (random() * 9)::int, (random() * 9)::int, null, (random() * 9)::int

         from generate_series(1, 1000)

         union

         select (random() * 9)::int, (random() * 9)::int, (random() * 9)::int, null

         from generate_series(1, 1000)

         union

         select (random() * 9)::int, (random() * 9)::int, (random() * 9)::int, (random() * 9)::int

         from generate_series(1, 1000)

     ) t

;

事实表

--事实表(4个条件列)

create table rul04_val

(

    id  int,

    c1  int,

    c2  int,

    c3  int,

    c4  int,

    val numeric

);

insert into rul04_val

select id,

       (random() * 999)::int,

       (random() * 999)::int,

       (random() * 999)::int,

       (random() * 999)::int,

       round(random() * 10000, 2)

from generate_series(1, 10000000) id;

alter table rul04_val

    add primary key (id);

--单维度列的索引

create index idx_rul04_val_c1 on rul04_val (c1);

create index idx_rul04_val_c2 on rul04_val (c2);

create index idx_rul04_val_c3 on rul04_val (c3);

create index idx_rul04_val_c4 on rul04_val (c4);

--按规则表的数据分布规律，创建2个及以上列的复合索引

create index idx_rul04_val_c12 on rul04_val (c1,c2);

create index idx_rul04_val_c13 on rul04_val (c3,c4);

并行执行动态查询

根据条件的组合关系，动态生成全部16个查询语句，使用dblink的异步查询函数，并行执行这些查询语句，最后结果组合。

创建多个dblink连接

建议数据的socket连接，允许免密。

create extension dblink ;

select connam,

       dblink_connect(connam,

                      format('host=%s port=%s dbname=%s user=%s ',

                             current_setting('unix_socket_directories', true),

                             current_setting('port', true),

                             current_database(),

                             current_user))::text as reault

from (select 'dblink_' || id as connam

      from generate_series(1000, 1000 - 1 + 100) id) t;

事实表与规则表的全量匹配运算

事实表名、规则表名、条件维度列，已经并行度，都可以自定义。

如果并行执行，用时最长的查询，决定整体执行时长。

在CTE中，会话连接（conn）记录数，决定并行度。

为了语句的通用性，可以使用主键列或唯一列，代替CTID列。

--dattab: 事实表名； rultab: 规则表名； colname: 关联的条件维度列名

with val as (select 'rul04_val' dattab, 'rul04' rultab, ('{c1,c2,c3,c4}'::text[]) colname),

     sta as

         (select id,

                 format('select a.ctid atid,b.ctid btid from %I a , %I b where %s ', dattab, rultab, whr) sqlsta

          from val,

               lateral array_length(colname, 1) colnum,

               generate_series(0, 2 ^ (array_length(colname, 1)) - 1) id,

              lateral ( select string_agg(case

                                              when get_bit(id::bit(10), sn + 9 - colnum) = 1

                                                  then format('a.%1$I=b.%1$I', colname[sn])

                                              else format('b.%1$I is null', colname[sn]) end, ' and ') whr

                        from generate_series(1, colnum) sn)),

     conn as

         (select connam,

                 row_number() over () sn

          from unnest(dblink_get_connections()) connam

          where dblink_is_busy(connam) = 0

          limit 4),

     qursta as

         (select connam, string_agg(sqlsta, ' union all ') as qursta

          from conn,

               sta,

               (select count(*) connum from conn) t

          where conn.sn = sta.id % connum + 1

          group by connam),

     sedqur as

         (select array_agg(connam) connams

          from qursta

             , lateral dblink_send_query(connam, qursta) sq),

     res as (

         select res.*

         from sedqur

            , lateral unnest(connams) connam

            , lateral dblink_get_result(connam) as res(atid tid, btid tid))

select a.*,b.*

from res

         join rul04_val a on a.ctid = atid

         join rul04 b on b.ctid = btid;

;

Hash Join  (cost=343.84..40906.33 rows=10000 width=46) (actual time=144.807..1444.010 rows=386411 loops=1)

  Hash Cond: (res.btid = b.ctid)

  CTE conn

    ->  Limit  (cost=0.01..0.17 rows=1 width=40) (actual time=0.046..0.173 rows=20 loops=1)

          ->  WindowAgg  (cost=0.01..0.17 rows=1 width=40) (actual time=0.046..0.171 rows=20 loops=1)

                ->  Function Scan on unnest connam  (cost=0.01..0.15 rows=1 width=32) (actual time=0.042..0.162 rows=20 loops=1)

                      Filter: (dblink_is_busy(connam) = 0)

                      Rows Removed by Filter: 2

  CTE sedqur

    ->  Aggregate  (cost=30.66..30.67 rows=1 width=32) (actual time=0.447..0.454 rows=1 loops=1)

          ->  Nested Loop  (cost=30.60..30.65 rows=1 width=32) (actual time=0.308..0.447 rows=16 loops=1)

                ->  GroupAggregate  (cost=30.59..30.62 rows=1 width=64) (actual time=0.296..0.318 rows=16 loops=1)

                      Group Key: conn.connam

                      ->  Sort  (cost=30.59..30.60 rows=1 width=64) (actual time=0.293..0.298 rows=16 loops=1)

                            Sort Key: conn.connam

                            Sort Method: quicksort  Memory: 27kB

                            ->  Nested Loop  (cost=30.06..30.58 rows=1 width=64) (actual time=0.220..0.276 rows=16 loops=1)

                                  ->  Function Scan on colnum  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=1)

                                  ->  Nested Loop  (cost=30.06..30.56 rows=1 width=64) (actual time=0.216..0.268 rows=16 loops=1)

                                        ->  Hash Join  (cost=0.06..0.53 rows=1 width=36) (actual time=0.202..0.211 rows=16 loops=1)

                                              Hash Cond: ((((id.id)::bigint % (count(*))) + 1) = conn.sn)

                                              ->  Nested Loop  (cost=0.02..0.36 rows=16 width=12) (actual time=0.011..0.015 rows=16 loops=1)

                                                    ->  Aggregate  (cost=0.02..0.03 rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=1)

                                                          ->  CTE Scan on conn conn_1  (cost=0.00..0.02 rows=1 width=0) (actual time=0.000..0.002 rows=20 loops=1)

                                                    ->  Function Scan on generate_series id  (cost=0.00..0.16 rows=16 width=4) (actual time=0.004..0.005 rows=16 loops=1)

                                              ->  Hash  (cost=0.02..0.02 rows=1 width=40) (actual time=0.186..0.186 rows=20 loops=1)

                                                    Buckets: 1024  Batches: 1  Memory Usage: 10kB

                                                    ->  CTE Scan on conn  (cost=0.00..0.02 rows=1 width=40) (actual time=0.047..0.179 rows=20 loops=1)

                                        ->  Aggregate  (cost=30.00..30.02 rows=1 width=32) (actual time=0.003..0.003 rows=1 loops=16)

                                              ->  Function Scan on generate_series sn  (cost=0.00..10.00 rows=1000 width=4) (actual time=0.000..0.000 rows=4 loops=16)

                ->  Function Scan on dblink_send_query sq  (cost=0.00..0.01 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=16)

  CTE res

    ->  Nested Loop  (cost=0.01..200.22 rows=10000 width=12) (actual time=143.473..973.506 rows=386411 loops=1)

          ->  Nested Loop  (cost=0.00..0.22 rows=10 width=32) (actual time=0.455..0.470 rows=16 loops=1)

                ->  CTE Scan on sedqur  (cost=0.00..0.02 rows=1 width=32) (actual time=0.449..0.449 rows=1 loops=1)

                ->  Function Scan on unnest connam_1  (cost=0.00..0.10 rows=10 width=32) (actual time=0.005..0.012 rows=16 loops=1)

          ->  Function Scan on dblink_get_result res_1  (cost=0.00..10.00 rows=1000 width=12) (actual time=57.747..59.090 rows=24151 loops=16)

  ->  Nested Loop  (cost=0.00..40425.00 rows=10000 width=32) (actual time=143.488..1377.323 rows=386411 loops=1)

        ->  CTE Scan on res  (cost=0.00..200.00 rows=10000 width=12) (actual time=143.475..1029.676 rows=386411 loops=1)

        ->  Tid Scan on rul04_val a  (cost=0.00..4.01 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=386411)

              TID Cond: (ctid = res.atid)

  ->  Hash  (cost=62.90..62.90 rows=3990 width=26) (actual time=1.308..1.308 rows=3990 loops=1)

        Buckets: 4096  Batches: 1  Memory Usage: 244kB

        ->  Seq Scan on rul04 b  (cost=0.00..62.90 rows=3990 width=26) (actual time=0.007..0.565 rows=3990 loops=1)

Planning Time: 0.479 ms

Execution Time: 1456.717 ms

断开所有连接

select *

from unnest(dblink_get_connections()) connam

   , lateral dblink_disconnect(connam);

总结

全量匹配，可以从几十分钟，减少到几秒。关键要在索引和组合条件之间，搞平衡。本质就是空间与时间的转换。

优化后的查询语句，仰赖事实表的多列索引，与规则数据的适配程度。由于分解重组后的查询语句较多，需要在索引与查询语句之间，进行取舍平衡。

KingabseES例程-事实数据与规则的匹配校验的更多相关文章

oracle xmltype导入并解析Excel数据 (二)规则说明
规则表字段说明如下: 其中RULE_FUNC_CUSTOMIZE表示,用户自己写函数,去判断数据是否合法,存储的是函数的名字此函数的参数只有一个,该列的值,字段类型是Varchar2, 校验失败的话 ...
ElasticSearch(十)Elasticsearch检索出的数据列表按字段匹配的优先顺序及搜索单词拼音一部分搜不到数据
检索出的数据列表按字段匹配的优先顺序一.举例比如,发布一篇文章,文章包括基本的字段包括标题.发布时间.点击率.关键字.内容.当在页面中输入“教育”搜索关键词,会检索出指定字段包括“教育”的所有数据 ...
Spring_数据校验和自定义检验规则和分组校验
@Validated :绑定需要校验的数据. 数据校验规则:为数据绑定校验的规则 private Long booId;@NotNull(message = "不能为空")pri ...
R合并数据框有重复匹配时只保留第一行
前言合并数据框有重复匹配时通常会返回所有的匹配,如何只保留匹配的第一行呢?其实这个需求也很常见.如芯片探针ID和基因ID往往多对一,要合并ID对应矩阵和芯片表达矩阵时. 数据例子 data = da ...
【SpringMVC】SpringMVC系列12之数据类型转换、格式化、校验
12.数据类型转换.格式化.校验 12.1.数据绑定流程 Spring MVC 主框架将 ServletRequest 对象及目标方法的入参实例传递给 WebDataBinderFacto ...
python读取两个csv文件数据，进行查找匹配出现次数
现有需求表1 表2 需要拿表1中的编码去表2中的门票编码列匹配,统计出现的次数,由于表2编码列是区域间,而且列不是固定的,代码如下 #encoding:utf-8 ##导入两个CSV进行比对 imp ...
Oracle GoldenGate 四、数据过滤和数据项匹配
写在开始前从两周前我花了大量的业余时间阅读GoldenGate官方文档,并根据文档实践和进一步学习了解GoldenGate,以下便是根据官方文档理解总结的GoldenGate学习内容: Oracle ...
web.xml中的url-pattern写法规则及匹配过程
servlet和filter在javaEE开发中很常用,因此有必要知道web.xml文件映射的规则 1. 写法 ①完全匹配:以“/”开头,以字母(非“*”)结束如:<url-patte ...
用Excel做数据分析常用函数（数据清理、关联匹配……）
本文总结在使用Excel进行数据分析时,最常用的功能和函数. Excel的功能和函数非常多,用进废退,除了学习基本的函数和功能,最重要的是遇到问题可以快速的搜索并解决. 首先Excel可以处理的数据量 ...
laravel 数据验证规则
return [ 'accepted' => '必须为yes,on,1,true', 'active_url' => '是否是一个合法的url,基于PHP的checkdnsrr函数,因此也 ...

随机推荐

radware 相关（alteon产品）
目录术语组网配置思路 IP 规划负载均衡器的IP规划服务器的IP规划登陆设备简单命令行配置 Main Menu 管理IP 确认设备版本 telnet/sshd/http enable 网 ...
建设数字工厂：MRP物料需求计划的逻辑原理与配置方法
本文分享自华为云社区<数字工厂深入浅出系列(七): MRP物料需求计划的逻辑原理与配置方法>,作者:云起MAE. MRP是生产制造企业"管好"物料的核心工具方法,基本思 ...
八万乌云漏洞库——Ubuntu本地Docker搭建环境
环境准备 ubuntu16.04虚拟机换国内源 vi /etc/apt/sources.list 阿里源 deb http://mirrors.aliyun.com/ubuntu/ xenial m ...
自定义javascript中call、bind、apply方法
call.bind.apply都是Function原型上的方法,用于改变this的指向自定义函数 js中的call.bind.apply是用c++代码实现的,我们这里使用js代码做一个模式,没有把所 ...
js高级之内存管理与闭包
javacript中的内存管理 javascript中不需要我们手动去分配内存,当我们创建变量的时候,会自动给我们分配内存. 创建基本数据类型时,会在栈内存中开辟空间存放变量创建引用数据类型时,会在 ...
SQL 注入学习手册【笔记】
SQL 注入基础 [若本文有问题请指正] 有回显回显正常基本步骤 1. 判断注入类型数字型 or 字符型数字型[示例]:?id=1 字符型[示例]:?id=1' 这也是在尝试闭合原来的 sql ...
Sealos 国内集群正式上线，可一键运行 LLama2 中文版大模型！
2023 年 7 月 19 日,MetaAI 宣布开源旗下的 LLama2 大模型,Meta 首席科学家.图灵奖得主 Yann LeCun 在推特上表示 Meta 此举可能将改变大模型行业的竞争格局. ...
三维模型OBJ格式轻量化压缩变形现象分析
三维模型OBJ格式轻量化压缩变形现象分析三维模型的OBJ格式轻量化压缩是一种常见的处理方法,它可以减小模型文件的体积,提高加载和渲染效率.然而,在进行轻量化压缩过程中,有时会出现模型变形的现象,即压 ...
T-SQL——关于数据合并(Merge)
目录 0. 背景说明及测试数据 1. 直接清空,重新插入 2. 单条记录执行插入.更新操作 3. Merge函数 3.1 准备测试数据 3.2 测试Merge 3.3 关于Merge 4.参考 sha ...
文心一言 VS 讯飞星火 VS chatgpt （95）-- 算法导论9.2 4题
四.用go语言,假设用RANDOMIZED-SELECT 去选择数组 A=(3,2,9,0,7,5,4,8,6,1)的最小元素,给出能够导致 RANDOMIZED-SELECT最坏情况发生的一个划分序 ...