KingabseES例程-事实数据与规则的匹配校验
KingabseES例程-事实数据与规则的匹配校验
背景
使用规则,对数据进行校验,比如电商的用户购物订单,是否合法。这就需要订单的多维度,如 用户、地区、物流、支付手段、供应商 等各类信息,进行动态组合条件的关联匹配。
通常的匹配模式
针对某个订单
一种是根据某个订单,与全部规则进行匹配。
select a.*
from rul04_val a,
rul04 b
where 1 = 1
and (a.c1 = b.c1 or b.c1 is null)
and (a.c2 = b.c2 or b.c2 is null)
and (a.c3 = b.c3 or b.c3 is null)
and (a.c4 = b.c4 or b.c4 is null)
and a.id = 1000;
Nested Loop (cost=0.43..151.15 rows=12 width=26)
Join Filter: (((a.c1 = b.c1) OR (b.c1 IS NULL)) AND ((a.c2 = b.c2) OR (b.c2 IS NULL)) AND ((a.c3 = b.c3) OR (b.c3 IS NULL)) AND ((a.c4 = b.c4) OR (b.c4 IS NULL)))
-> Index Scan using rul04_val_pkey on rul04_val a (cost=0.43..8.45 rows=1 width=26)
Index Cond: (id = 1000)
-> Seq Scan on rul04 b (cost=0.00..62.90 rows=3990 width=16)
针对某个规则
一种是,根据某个规则,校验所有订单,
select a.*
from rul04_val a,
rul04 b
where 1 = 1
and (a.c1 = b.c1 or b.c1 is null)
and (a.c2 = b.c2 or b.c2 is null)
and (a.c3 = b.c3 or b.c3 is null)
and (a.c4 = b.c4 or b.c4 is null)
and b.id = 1000;
Nested Loop (cost=0.28..374083.30 rows=30134 width=26)
Join Filter: (((a.c1 = b.c1) OR (b.c1 IS NULL)) AND ((a.c2 = b.c2) OR (b.c2 IS NULL)) AND ((a.c3 = b.c3) OR (b.c3 IS NULL)) AND ((a.c4 = b.c4) OR (b.c4 IS NULL)))
-> Index Scan using rul04_pkey on rul04 b (cost=0.28..8.30 rows=1 width=16)
Index Cond: (id = 1000)
-> Seq Scan on rul04_val a (cost=0.00..174075.00 rows=10000000 width=26)
全部订单和全部规则
如果需要对全部或大量的订单和规则,进行匹配运算,这将是海量的计算。
select a.*
from rul04_val a,
rul04 b
where 1 = 1
and (a.c1 = b.c1 or b.c1 is null)
and (a.c2 = b.c2 or b.c2 is null)
and (a.c3 = b.c3 or b.c3 is null)
and (a.c4 = b.c4 or b.c4 is null) ;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..368873734.60 rows=120236596 width=46) (actual time=5.172..879507.928 rows=386411 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Nested Loop (cost=0.00..356849075.00 rows=30059149 width=46) (actual time=12.698..879456.332 rows=77282 loops=5)
Join Filter: (((a.c1 = b.c1) OR (b.c1 IS NULL)) AND ((a.c2 = b.c2) OR (b.c2 IS NULL)) AND ((a.c3 = b.c3) OR (b.c3 IS NULL)) AND ((a.c4 = b.c4) OR (b.c4 IS NULL)))
Rows Removed by Join Filter: 7979922718
-> Parallel Seq Scan on rul04_val a (cost=0.00..99075.00 rows=2500000 width=26) (actual time=0.018..151.852 rows=2000000 loops=5)
-> Seq Scan on rul04 b (cost=0.00..62.90 rows=3990 width=20) (actual time=0.001..0.162 rows=3990 loops=10000000)
Planning Time: 0.289 ms
Execution Time: 879521.346 ms
分解匹配规则
普通模式的查询,不能使用事实表的索引。这是因为关联条件中,AND运算包含OR运算,这就使得关联运算是volatile属性,这里需要进行SQL调优。
此处的SQL,是多列存在匹配需求,或者不需匹配的全集。这里可以利用对bit数据的类比,将匹配和全集的表达式,理解成1与0,反之亦可。从而得知,需要分解成2^4=16个,不含OR运算的查询,然后将16个查询结果合并。
--16个查询,组合。每个查询,都能使用较适合的索引。
select *
from rul04_val a, rul04 b
where b.c1 is null and b.c2 is null and b.c3 is null and b.c4 is null
union all
select *
from rul04_val a, rul04 b
where b.c1 is null and b.c2 is null and b.c3 is null and a.c4 = b.c4
union all
...
union all
from rul04_val a, rul04 b
where a.c1 = b.c1 and a.c2 = b.c2 and a.c3 = b.c3 and a.c4 = b.c4
;
Append (cost=5.24..458866.31 rows=10410804 width=26) (actual time=0.432..1059.117 rows=386411 loops=1)
-> Result (cost=5.24..174080.24 rows=10000000 width=26) (actual time=0.233..0.234 rows=0 loops=1)
One-Time Filter: $0
InitPlan 1 (returns $0)
-> Seq Scan on rul04 b_15 (cost=0.00..62.90 rows=12 width=0) (actual time=0.230..0.230 rows=0 loops=1)
Filter: ((c1 IS NULL) AND (c2 IS NULL) AND (c3 IS NULL) AND (c4 IS NULL))
Rows Removed by Filter: 3990
-> Seq Scan on rul04_val a (cost=5.24..174080.24 rows=10000000 width=26) (never executed)
...
-> Nested Loop (cost=112.08..5417.64 rows=1 width=26) (actual time=9.831..9.832 rows=0 loops=1)
-> HashAggregate (cost=102.80..106.79 rows=399 width=16) (actual time=0.896..1.413 rows=3990 loops=1)
Group Key: b_14.c1, b_14.c2, b_14.c3, b_14.c4
-> Seq Scan on rul04 b_14 (cost=0.00..62.90 rows=3990 width=16) (actual time=0.005..0.158 rows=3990 loops=1)
-> Bitmap Heap Scan on rul04_val a_15 (cost=9.28..13.30 rows=1 width=26) (actual time=0.002..0.002 rows=0 loops=3990)
Recheck Cond: ((c3 = b_14.c3) AND (c4 = b_14.c4) AND (c1 = b_14.c1) AND (c2 = b_14.c2))
-> BitmapAnd (cost=9.28..9.28 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=3990)
-> Bitmap Index Scan on idx_rul04_val_c34 (cost=0.00..4.51 rows=10 width=0) (actual time=0.001..0.001 rows=5 loops=3990)
Index Cond: ((c3 = b_14.c3) AND (c4 = b_14.c4))
-> Bitmap Index Scan on idx_rul04_val_c12 (cost=0.00..4.51 rows=10 width=0) (actual time=0.001..0.001 rows=4 loops=2240)
Index Cond: ((c1 = b_14.c1) AND (c2 = b_14.c2))
Planning Time: 1.763 ms
Execution Time: 1068.627 ms
由于需要关联的列数是不确定,而且优化后的查询语句,是大量的局部查询语句组合而成,这给开发人员带来巨大的工作量。我们需要一种,通用语句解决这个复杂且不确定的需求。
数据准备
规则表与事实表关联的列,建议列名相同,或使用相同的别名。
规则表
--规则表(4个条件列)
--插入多种(2^4)组合的条件数据
drop table rul04;
create table rul04
(
id int primary key,
c1 int,
c2 int,
c3 int,
c4 int
);
--基于条件列,创建单列索引,这些索引未必使用。
create index idx_rul04_c1 on rul04 (c1);
create index idx_rul04_c2 on rul04 (c2);
create index idx_rul04_c3 on rul04 (c3);
create index idx_rul04_c4 on rul04 (c4);
insert into rul04
select row_number() over (), c1, c2, c3, c4
from (
select null c1, null c2, null c3, null c4
from generate_series(1, 1000) id
union
select null, null, null, (random() * 9)::int
from generate_series(1, 1000)
union
select null, null, (random() * 9)::int, null
from generate_series(1, 1000)
union
select null, (random() * 9)::int, null, null
from generate_series(1, 1000)
union
select (random() * 9)::int, null, null, null
from generate_series(1, 1000)
union
select null, (random() * 9)::int, null, (random() * 9)::int
from generate_series(1, 1000)
union
select null, (random() * 9)::int, (random() * 9)::int, null
from generate_series(1, 1000)
union
select null, null, (random() * 9)::int, (random() * 9)::int
from generate_series(1, 1000)
union
select (random() * 9)::int, null, null, (random() * 9)::int
from generate_series(1, 1000)
union
select (random() * 9)::int, null, (random() * 9)::int, null
from generate_series(1, 1000)
union
select (random() * 9)::int, (random() * 9)::int, null, null
from generate_series(1, 1000)
union
select (random() * 9)::int, null, (random() * 9)::int, (random() * 9)::int
from generate_series(1, 1000)
union
select null, (random() * 9)::int, (random() * 9)::int, (random() * 9)::int
from generate_series(1, 1000)
union
select (random() * 9)::int, (random() * 9)::int, null, (random() * 9)::int
from generate_series(1, 1000)
union
select (random() * 9)::int, (random() * 9)::int, (random() * 9)::int, null
from generate_series(1, 1000)
union
select (random() * 9)::int, (random() * 9)::int, (random() * 9)::int, (random() * 9)::int
from generate_series(1, 1000)
) t
;
事实表
--事实表(4个条件列)
create table rul04_val
(
id int,
c1 int,
c2 int,
c3 int,
c4 int,
val numeric
);
insert into rul04_val
select id,
(random() * 999)::int,
(random() * 999)::int,
(random() * 999)::int,
(random() * 999)::int,
round(random() * 10000, 2)
from generate_series(1, 10000000) id;
alter table rul04_val
add primary key (id);
--单维度列的索引
create index idx_rul04_val_c1 on rul04_val (c1);
create index idx_rul04_val_c2 on rul04_val (c2);
create index idx_rul04_val_c3 on rul04_val (c3);
create index idx_rul04_val_c4 on rul04_val (c4);
--按规则表的数据分布规律,创建2个及以上列的复合索引
create index idx_rul04_val_c12 on rul04_val (c1,c2);
create index idx_rul04_val_c13 on rul04_val (c3,c4);
并行执行动态查询
根据条件的组合关系,动态生成全部16个查询语句,使用dblink的异步查询函数,并行执行这些查询语句,最后结果组合。
创建多个dblink连接
建议数据的socket连接,允许免密。
create extension dblink ;
select connam,
dblink_connect(connam,
format('host=%s port=%s dbname=%s user=%s ',
current_setting('unix_socket_directories', true),
current_setting('port', true),
current_database(),
current_user))::text as reault
from (select 'dblink_' || id as connam
from generate_series(1000, 1000 - 1 + 100) id) t;
事实表与规则表的全量匹配运算
事实表名、规则表名、条件维度列,已经并行度,都可以自定义。
如果并行执行,用时最长的查询,决定整体执行时长。
在CTE中,会话连接(conn)记录数,决定并行度。
为了语句的通用性,可以使用主键列或唯一列,代替CTID列。
--dattab: 事实表名; rultab: 规则表名; colname: 关联的条件维度列名
with val as (select 'rul04_val' dattab, 'rul04' rultab, ('{c1,c2,c3,c4}'::text[]) colname),
sta as
(select id,
format('select a.ctid atid,b.ctid btid from %I a , %I b where %s ', dattab, rultab, whr) sqlsta
from val,
lateral array_length(colname, 1) colnum,
generate_series(0, 2 ^ (array_length(colname, 1)) - 1) id,
lateral ( select string_agg(case
when get_bit(id::bit(10), sn + 9 - colnum) = 1
then format('a.%1$I=b.%1$I', colname[sn])
else format('b.%1$I is null', colname[sn]) end, ' and ') whr
from generate_series(1, colnum) sn)),
conn as
(select connam,
row_number() over () sn
from unnest(dblink_get_connections()) connam
where dblink_is_busy(connam) = 0
limit 4),
qursta as
(select connam, string_agg(sqlsta, ' union all ') as qursta
from conn,
sta,
(select count(*) connum from conn) t
where conn.sn = sta.id % connum + 1
group by connam),
sedqur as
(select array_agg(connam) connams
from qursta
, lateral dblink_send_query(connam, qursta) sq),
res as (
select res.*
from sedqur
, lateral unnest(connams) connam
, lateral dblink_get_result(connam) as res(atid tid, btid tid))
select a.*,b.*
from res
join rul04_val a on a.ctid = atid
join rul04 b on b.ctid = btid;
;
Hash Join (cost=343.84..40906.33 rows=10000 width=46) (actual time=144.807..1444.010 rows=386411 loops=1)
Hash Cond: (res.btid = b.ctid)
CTE conn
-> Limit (cost=0.01..0.17 rows=1 width=40) (actual time=0.046..0.173 rows=20 loops=1)
-> WindowAgg (cost=0.01..0.17 rows=1 width=40) (actual time=0.046..0.171 rows=20 loops=1)
-> Function Scan on unnest connam (cost=0.01..0.15 rows=1 width=32) (actual time=0.042..0.162 rows=20 loops=1)
Filter: (dblink_is_busy(connam) = 0)
Rows Removed by Filter: 2
CTE sedqur
-> Aggregate (cost=30.66..30.67 rows=1 width=32) (actual time=0.447..0.454 rows=1 loops=1)
-> Nested Loop (cost=30.60..30.65 rows=1 width=32) (actual time=0.308..0.447 rows=16 loops=1)
-> GroupAggregate (cost=30.59..30.62 rows=1 width=64) (actual time=0.296..0.318 rows=16 loops=1)
Group Key: conn.connam
-> Sort (cost=30.59..30.60 rows=1 width=64) (actual time=0.293..0.298 rows=16 loops=1)
Sort Key: conn.connam
Sort Method: quicksort Memory: 27kB
-> Nested Loop (cost=30.06..30.58 rows=1 width=64) (actual time=0.220..0.276 rows=16 loops=1)
-> Function Scan on colnum (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=1)
-> Nested Loop (cost=30.06..30.56 rows=1 width=64) (actual time=0.216..0.268 rows=16 loops=1)
-> Hash Join (cost=0.06..0.53 rows=1 width=36) (actual time=0.202..0.211 rows=16 loops=1)
Hash Cond: ((((id.id)::bigint % (count(*))) + 1) = conn.sn)
-> Nested Loop (cost=0.02..0.36 rows=16 width=12) (actual time=0.011..0.015 rows=16 loops=1)
-> Aggregate (cost=0.02..0.03 rows=1 width=8) (actual time=0.006..0.007 rows=1 loops=1)
-> CTE Scan on conn conn_1 (cost=0.00..0.02 rows=1 width=0) (actual time=0.000..0.002 rows=20 loops=1)
-> Function Scan on generate_series id (cost=0.00..0.16 rows=16 width=4) (actual time=0.004..0.005 rows=16 loops=1)
-> Hash (cost=0.02..0.02 rows=1 width=40) (actual time=0.186..0.186 rows=20 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> CTE Scan on conn (cost=0.00..0.02 rows=1 width=40) (actual time=0.047..0.179 rows=20 loops=1)
-> Aggregate (cost=30.00..30.02 rows=1 width=32) (actual time=0.003..0.003 rows=1 loops=16)
-> Function Scan on generate_series sn (cost=0.00..10.00 rows=1000 width=4) (actual time=0.000..0.000 rows=4 loops=16)
-> Function Scan on dblink_send_query sq (cost=0.00..0.01 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=16)
CTE res
-> Nested Loop (cost=0.01..200.22 rows=10000 width=12) (actual time=143.473..973.506 rows=386411 loops=1)
-> Nested Loop (cost=0.00..0.22 rows=10 width=32) (actual time=0.455..0.470 rows=16 loops=1)
-> CTE Scan on sedqur (cost=0.00..0.02 rows=1 width=32) (actual time=0.449..0.449 rows=1 loops=1)
-> Function Scan on unnest connam_1 (cost=0.00..0.10 rows=10 width=32) (actual time=0.005..0.012 rows=16 loops=1)
-> Function Scan on dblink_get_result res_1 (cost=0.00..10.00 rows=1000 width=12) (actual time=57.747..59.090 rows=24151 loops=16)
-> Nested Loop (cost=0.00..40425.00 rows=10000 width=32) (actual time=143.488..1377.323 rows=386411 loops=1)
-> CTE Scan on res (cost=0.00..200.00 rows=10000 width=12) (actual time=143.475..1029.676 rows=386411 loops=1)
-> Tid Scan on rul04_val a (cost=0.00..4.01 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=386411)
TID Cond: (ctid = res.atid)
-> Hash (cost=62.90..62.90 rows=3990 width=26) (actual time=1.308..1.308 rows=3990 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 244kB
-> Seq Scan on rul04 b (cost=0.00..62.90 rows=3990 width=26) (actual time=0.007..0.565 rows=3990 loops=1)
Planning Time: 0.479 ms
Execution Time: 1456.717 ms
断开所有连接
select *
from unnest(dblink_get_connections()) connam
, lateral dblink_disconnect(connam);
总结
全量匹配,可以从几十分钟,减少到几秒。关键要在索引和组合条件之间,搞平衡。本质就是空间与时间的转换。
优化后的查询语句,仰赖事实表的多列索引,与规则数据的适配程度。由于分解重组后的查询语句较多,需要在索引与查询语句之间,进行取舍平衡。
KingabseES例程-事实数据与规则的匹配校验的更多相关文章
- oracle xmltype导入并解析Excel数据 (二)规则说明
规则表字段说明如下: 其中RULE_FUNC_CUSTOMIZE表示,用户自己写函数,去判断数据是否合法,存储的是函数的名字 此函数的参数只有一个,该列的值,字段类型是Varchar2, 校验失败的话 ...
- ElasticSearch(十)Elasticsearch检索出的数据列表按字段匹配的优先顺序及搜索单词拼音一部分搜不到数据
检索出的数据列表按字段匹配的优先顺序 一.举例 比如,发布一篇文章,文章包括基本的字段包括标题.发布时间.点击率.关键字.内容.当在页面中输入“教育”搜索关键词,会检索出指定字段包括“教育”的所有数据 ...
- Spring_数据校验和自定义检验规则和分组校验
@Validated :绑定需要校验的数据. 数据校验规则:为数据绑定校验的规则 private Long booId;@NotNull(message = "不能为空")pri ...
- R合并数据框有重复匹配时只保留第一行
前言 合并数据框有重复匹配时通常会返回所有的匹配,如何只保留匹配的第一行呢?其实这个需求也很常见.如芯片探针ID和基因ID往往多对一,要合并ID对应矩阵和芯片表达矩阵时. 数据例子 data = da ...
- 【SpringMVC】SpringMVC系列12之数据类型转换、格式化、校验
12.数据类型转换.格式化.校验 12.1.数据绑定流程 Spring MVC 主框架将 ServletRequest 对象及目标方法的入参实例传递给 WebDataBinderFacto ...
- python读取两个csv文件数据,进行查找匹配出现次数
现有需求 表1 表2 需要拿表1中的编码去表2中的门票编码列匹配,统计出现的次数,由于表2编码列是区域间,而且列不是固定的,代码如下 #encoding:utf-8 ##导入两个CSV进行比对 imp ...
- Oracle GoldenGate 四、数据过滤和数据项匹配
写在开始前 从两周前我花了大量的业余时间阅读GoldenGate官方文档,并根据文档实践和进一步学习了解GoldenGate,以下便是根据官方文档理解总结的GoldenGate学习内容: Oracle ...
- web.xml中的url-pattern写法规则及匹配过程
servlet和filter在javaEE开发中很常用,因此有必要知道web.xml文件映射的规则 1. 写法 ①完全匹配:以“/”开头,以字母(非“*”)结束 如:<url-patte ...
- 用Excel做数据分析常用函数(数据清理、关联匹配……)
本文总结在使用Excel进行数据分析时,最常用的功能和函数. Excel的功能和函数非常多,用进废退,除了学习基本的函数和功能,最重要的是遇到问题可以快速的搜索并解决. 首先Excel可以处理的数据量 ...
- laravel 数据验证规则
return [ 'accepted' => '必须为yes,on,1,true', 'active_url' => '是否是一个合法的url,基于PHP的checkdnsrr函数,因此也 ...
随机推荐
- 【MAUI Blazor踩坑日记】4.只在特定平台上引用包
其实这个并不是MAUI特有的问题,只是在MAUI中可能会遇到. 例如某个包只在Windows上有用,但打包的时候,安卓.ios也会把这个包计算在内 所以需要在不是特定平台时把它排除. 万幸从微软文档找 ...
- 2021-3-29 Enter按下事件
先在构造器中添加keydown事件 tBoxPsw.KeyDown += TBoxPsw_KeyDown; 在事件中添加按下enter按钮所触发的方法 private void TBoxPsw_Key ...
- 从0开发属于自己的nestjs框架的mini 版 —— ioc篇
如今,nodejs的框架也是层出不穷,偏向向底层的有 express.koa. Fastify,偏向于上层有阿里的 Egg.thinkjs .还有国外的 nestjs. 在这里我更喜欢 nestjs, ...
- 手写promise之分步解析
promise是es6推出适用于异步请求的构造函数,帮助解决回调地狱的问题,以下内容将自定义实现promise,只包括基本使用,所以一些边界情况考虑没有在内. 如果对promise用法还不熟悉的朋友可 ...
- jQuery项目的小技巧
1.返回顶部按钮 你可以利用 animate 和 scrollTop 来实现返回顶部的动画,而不需要使用其他插件. // Back to top // Back to top $('a.top').c ...
- 部署安装kafka集群
准备 zookeeper节点: 172.50.13.103 172.50.13.104 172.50.13.105 kafka版本: 2.13-2.7.0 安装步骤 部署安装zookeeper集群.参 ...
- 【技术积累】Linux中的命令行【理论篇】【六】
as命令 命令介绍 在Linux中,as命令是一个汇编器,用于将汇编语言源代码转换为可执行的目标文件.它是GNU Binutils软件包的一部分,提供了一系列用于处理二进制文件的工具. 命令说明 as ...
- AI绘画:StableDiffusion炼丹Lora攻略-实战萌宠图片生成
写在前面的话 近期在小红书发现了许多极其可爱.美观的萌宠图片,对这些美妙的图像深深着迷 于是想着看看利用AI绘画StableDiffusion以下简称(SD)做出来. 以下是详细实操的全过程,包括所有 ...
- 1Nginx基础及编译安装
1Nginx基础 1.Nginx概述 Nginx 功能介绍 Nginx(发音为"engine-x")是一个开源的高性能 HTTP和反向代理服务器.它具有以下功能: 1.静态文件服务 ...
- 基于bert-base-chinese训练bert模型(最后附上整体代码)
目录: 一.bert-base-chinese模型下载 二.数据集的介绍 三.完成类的代码 四.写训练方法 五.总源码及源码参考出处 一.bert-base-chinese模型下载 对于已经预训练好的 ...