openGauss 列存表PSort索引

openGauss 列存表 PSort 索引

概述

PSort(Partial sort) Index 是在列存表的列上建的聚簇索引。CUDesc 上有每个 CU 的 min 和 max 值，但如果业务的数据模型较为离散，查询时通过 min 和 max 值去过滤 CU 会出现大量的 CU 误读取，例如每个 CU 的 min 和 max 跨度都比较大时，其查询效率接近全表扫描。例如下图中的场景，查询 2 基本命中所有的 CU，此时查找近似全表扫描。

PSort 索引可以对部分区间（一般会包含多个 CU 覆盖的行）内的数据按照索引键进行排序，使得 CU 之间的交集尽量减少，提升查询的效率。

PSort 索引使用

在批量插入列存表的过程中，如果发现有 PSort 索引，会先对这批数据进行排序。PSort 索引表的组织形式也是 cstore 表（CUDesc 是 astore 表），表的字段包含了索引键的各个字段，加上对应的行号(TID)字段。插入数据的过程中如果发现有 PSort 索引，会将一定数量的数据按照 PSort 索引的索引键进行排序，与 TID 字段共同拼装成向量数组，再插入到 PSort 索引的 cstore 表中。所以 PSort 索引数据中列数比实际的索引键要多一列，多出的这一列用于存储这条记录在数据 cstore 存储中的位置。

// 构建 PSort 索引过程中构造索引数据

inline void ProjectToIndexVector(VectorBatch *scanBatch, VectorBatch *outBatch, IndexInfo *indexInfo)

{

Assert(scanBatch && outBatch && indexInfo);

int numAttrs = indexInfo->ii_NumIndexAttrs;

AttrNumber *attrNumbers = indexInfo->ii_KeyAttrNumbers;

Assert(outBatch->m_cols == (numAttrs + 1));

// index column

for (int i = 0; i < numAttrs; i++) {

    AttrNumber attno = attrNumbers[i];

    Assert(attno > 0 && attno <= scanBatch->m_cols);

    // shallow copy

    outBatch->m_arr[i].copy(&scanBatch->m_arr[attno - 1]);

}

// ctid column

// 最后一列是 tid

outBatch->m_arr[numAttrs].copy(scanBatch->GetSysVector(-1));

outBatch->m_rows = scanBatch->m_rows;

}

cstore 表执行插入流程，如果有 Psort 索引，会先将数据插入排序队列

void CStoreInsert::BatchInsert(in VectorBatch* pBatch, in int options)

{

Assert(pBatch || IsEnd());

/* keep memory space from leaking during bulk-insert */

MemoryContext oldCnxt = MemoryContextSwitchTo(m_tmpMemCnxt);

// Step 1: relation has partial cluster key

// We need put data into sorter contatiner, and then do

// batchinsert data

if (NeedPartialSort()) {

    Assert(m_tmpBatchRows);

    if (pBatch) {

        Assert(pBatch->m_cols == m_relation->rd_att->natts);

        m_sorter->PutVecBatch(m_relation, pBatch); // 插入局部排序队列

    }

    if (m_sorter->IsFull() || IsEnd()) { // 排序队列满了或者插入数据输入结束

        m_sorter->RunSort(); // 按照索引键排序

        /* reset and fetch next batch of values */

        DoBatchInsert(options);

        m_sorter->Reset(IsEnd());

        /* reset and free all memory blocks */

        m_tmpBatchRows->reset(false);

    }

}

// Step 2: relation doesn't have partial cluster key

// We need cache data until batchrows is full

else {

    Assert(m_bufferedBatchRows);

    // If batch row is full, we can do batchinsert now

    if (IsEnd()) {

        if (ENABLE_DELTA(m_bufferedBatchRows)) {

            InsertDeltaTable(m_bufferedBatchRows, options);

        } else {

            BatchInsertCommon(m_bufferedBatchRows, options);

        }

        m_bufferedBatchRows->reset(true);

    }

    // we need cache data until batchrows is full

    if (pBatch) {

        Assert(pBatch->m_rows <= BatchMaxSize);

        Assert(pBatch->m_cols && m_relation->rd_att->natts);

        Assert(m_bufferedBatchRows->m_rows_maxnum > 0);

        Assert(m_bufferedBatchRows->m_rows_maxnum % BatchMaxSize == 0);

        int startIdx = 0;

        while (m_bufferedBatchRows->append_one_vector(

                   RelationGetDescr(m_relation), pBatch, &startIdx, m_cstorInsertMem)) {

            BatchInsertCommon(m_bufferedBatchRows, options);

            m_bufferedBatchRows->reset(true);

        }

        Assert(startIdx == pBatch->m_rows);

    }

}

// Step 3: We must update index data for this batch data

// if end of batchInsert

FlushIndexDataIfNeed();

MemoryContextReset(m_tmpMemCnxt);

(void)MemoryContextSwitchTo(oldCnxt);

}

图 cstore 表插入流程示意图

插入流程中更新索引数据的代码

void CStoreInsert::InsertIdxTableIfNeed(bulkload_rows* batchRowPtr, uint32 cuId)

{

Assert(batchRowPtr);

if (relation_has_indexes(m_resultRelInfo)) {

    /* form all tids */

    bulkload_indexbatch_set_tids(m_idxBatchRow, cuId, batchRowPtr->m_rows_curnum);

    for (int indice = 0; indice < m_resultRelInfo->ri_NumIndices; ++indice) {

        /* form index-keys data for index relation */

        for (int key = 0; key < m_idxKeyNum[indice]; ++key) {

            bulkload_indexbatch_copy(m_idxBatchRow, key, batchRowPtr, m_idxKeyAttr[indice][key]);

        }

        /* form tid-keys data for index relation */

        bulkload_indexbatch_copy_tids(m_idxBatchRow, m_idxKeyNum[indice]);

        /* update the actual number of used attributes */

        m_idxBatchRow->m_attr_num = m_idxKeyNum[indice] + 1;

        if (m_idxInsert[indice] != NULL) {

            /* 插入PSort 索引 */

            m_idxInsert[indice]->BatchInsert(m_idxBatchRow, 0);

        } else {

            /* 插入 cbtree/cgin 索引 */

            CStoreInsert::InsertNotPsortIdx(indice);

        }

    }

}

}

索引插入流程和普通 cstore 数据插入相同。

使用 PSort 索引查询时，由于 PSort 索引 CU 内部已经有序，因此可以使用二分查找快速找到对应数据在 psort 索引中的行号，这一行数据的 tid 字段就是这条数据在数据 cstore 中的行号。

图-2 PSort 索引查询示意图

openGauss 列存表PSort索引的更多相关文章

Greenplum 行存、列存，堆表、AO表的原理和选择
转载自: https://github.com/digoal/blog/blob/master/201708/20170818_02.md?spm=a2c4e.11153940.blogcont179 ...
Greenplum列存压缩表索引机制
列存压缩表,简称AOCS表数据生成 create table testao(date text, time text, open float, high float, low float, volu ...
mysql 查询指定数据库所有表, 指定表所有列, 指定列所有表所有外键及索引, 以及索引的创建和删除
查询指定数据库中所有表 (指定数据库的,所有表) // 可以把 TABLE_NAME 换成 * 号, 查看更丰富的信息 SELECT TABLE_NAME FROM information_sc ...
【SQL进阶】【表默认值、自增、修改表列名、列顺序】Day02：表与索引操作
一.表的创建.修改与删除 1.创建一张新表 [设置日期默认值.设置id自增] [注意有备注添加备注COMMENT] CREATE TABLE user_info_vip( id int(11) pri ...
mysql优化 | 存储引擎，建表，索引，sql的优化建议
个人对于选择存储引擎,建表,建索引,sql优化的一些总结,给读者提供一些参考意见推荐访问我的个人网站,排版更好看: https://chenmingyu.top/mysql-optimize/ 存储 ...
Oracle 学习总结 - 表和索引的性能优化
表的性能表的性能取决于创建表之前所应用的数据库特性,数据库->表空间->表,创建数据库时确保为每个用户创建一个默认的永久表空间和临时表空间并使用本地管理,创建表空间设为本地管理并且自动段 ...
ORACLE表、索引和分区详解
ORACLE表.索引和分区一.数据库表每种类型的表都有不同的特性,分别应用与不同的领域堆组织表聚簇表(共三种) 索引组织表嵌套表临时表外部表和对象表 1.行迁移建表过程中可以指定以下两 ...
Oracle索引梳理系列（五）- Oracle索引种类之表簇索引（cluster index）
版权声明:本文发布于http://www.cnblogs.com/yumiko/,版权由Yumiko_sunny所有,欢迎转载.转载时,请在文章明显位置注明原文链接.若在未经作者同意的情况下,将本文内 ...
SOME：收缩数据库日志文件，查看表数据量和空间占用，查看表结构索引修改时间
---收缩数据库日志文件 USE [master]ALTER DATABASE yourdatabasename SET RECOVERY SIMPLE WITH NO_WAITALTER DATAB ...
[BILL WEI]SQL 如何将查询到的列作为表名去查询数据
我们在做sql查询的时候,有时候需要将查询的列作为表名,去引用,然后再次查询 declare @table_name varchar(20) select @table_name=table_name ...

随机推荐

【Azure Webjob + Redis】WebJob一直链接Azure Redis一直报错 Timeout Exception
问题描述运行在App Service上的Webjob连接Azure Redis出现Timeout Exception. 错误截图: 参考Azure Redis对于超时问题的排查建议, 在修改Min ...
【Azure Developer】解答《美丽的数学》一书中P120页的一道谜题：寻找第四个阶乘和数
一道谜题在观看<美丽的数学>一书中,在120页中有一道谜题: 数字145被称为一个阶乘和数, 因为它具有以下有趣的属性,如果我们将它的各位数字的阶乘相加,会得到该数字本身 1! +4! ...
AntSK：打造你的本地AI知识库——离线运行详细教程
亲爱的读者朋友们,今天我要给大家介绍一个强大的开源工具--AntSK.这个工具能让您在没有Internet连接时依然能使用人工智能知识库对话和查询,想象一下,即使在无网络的环境中,您也能与AI进行愉快 ...
Java interface 接口的使用 implements 实现----
1 package com.bytezreo.interfacetest; 2 3 /** 4 * 5 * @Description interface 接口的使用 implements 实现---- ...
Java 类的结构之三：构造器(或构造方法，constructor)的使用
1 /* 2 * 类的结构之三 :构造器(或构造方法,constructor)的使用 3 * construct:建设建造 4 * 5 * 一.构造器的作用: 6 * 创建对象 7 * 初始化对象的 ...
Python回顾面向对象
[一]面向过程开发和面向对象开发 [1]面向过程包括函数和面条包括面条版本一条线从头穿到尾学习函数后开始对程序进行分模块,分功能开发学习模块化开发,我们就可以对我们的功能进行分类开发建一个功能 ...
开源好用的所见即所得(WYSIWYG)编辑器：Editor.js
@ 目录特点基于区块干净的数据界面与交互插件标题和文本图片列表 Todo 表格使用安装创建编辑器实例配置工具本地化自定义样式今天介绍一个开源好用的Web所见即所得(WYS ...
Spring事务（二）-@Transactional注解
上一节说了Spring的事务配置,其中,声明式事务配置里有5种配置方式, @Transactional注解应该是最为常用的一种方式了.这一节就说说@Transactional注解. @Transact ...
yarn install --offline 离线安装回头试试 npm install ./package.tgz
yarn install --offline npm pack npm install ./package.tgz 尝试了 npm-pack-all --dev-deps 也不行,太慢,等了20分钟 ...
PV的回收策略、访问策略和状态
PersistentVolume(PV)的回收策略.访问策略和状态是Kubernetes存储管理中的重要概念. 回收策略 Retain:当PV的回收策略设置为Retain时,即使对应的Persiste ...

openGauss 列存表PSort索引

openGauss 列存表PSort索引的更多相关文章

随机推荐

热门专题