The ESTIMATE_PERCENT parameter in DBMS_STATS.GATHER_*_STATS procedures controls the percentage of rows to sample when gathering optimizer statistics. What percentage of rows should you sample to achieve accurate statistics? 100% will ensure that statistics are accurate, but it could take a long time. A 1% sample will finish much more quickly but it could result in poor statistics. It’s not an easy question to answer, which is why it is best practice to use the default: AUTO_SAMPLE_SIZE.

 

In this post, I’ll cover how the AUTO_SAMPLE_SIZE algorithm works in Oracle Database 12c and how it affects the accuracy of the statistics being gathered. If you want to learn more of the history prior to Oracle Database 12c, then this post on Oracle Database 11g is a good place to look. I will indicate below where there are differences between Oracle Database 11g and Oracle Database 12c.

It’s not always appreciated that (in general) a large proportion of the time and resource cost required to gather statistics is associated with evaluating the number of distinct values (NDVs) for each column. Calculating NDV using an exact algorithm can be expensive because the database needs to record and sort column values while statistics are being gathered. If the NDV is high, retaining and sorting column values can become resource-intensive, especially if the sort spills to TEMP. Auto sample size instead uses an approximate (but accurate) algorithm to calculate NDV that avoids the need to sort column data or spill to TEMP. In return for this saving, the database can afford to use a full table scan to ensure that the other basic column statistics are accurate.

Similarly, it can be resource-intensive to generate histograms but the Oracle Database mitigates this cost as follows:

  • Frequency and top frequency histograms are created as the database gathers basic column statistics (such as NDV, MIN, MAX) from the full table scan mentioned above. This is new to Oracle Database 12c.
  • If a frequency or top frequency histogram is not feasible, then the database will collect hybrid histograms using a sample of the column data. Top frequency is only feasible when the top 254 values constitute more than 99% of the entire non null column values and frequency is only feasible if NDV is 254 or less.
  • When the user has specified 'SIZE AUTO' in the METHOD_OPT clause for automatic histogram creation, the Oracle Database chooses which columns to consider for histogram creation based column usage data that’s gathered by the optimizer. Columns that are not used in WHERE-clause predicates or joins are not considered for histograms.

Both Oracle Database 11g and Oracle Database 12c use the following query to gather basic column statistics (it is a simplified here for illustrative purposes).

SELECT COUNT(c1), MIN(c1), MAX(c1)
FROM  t;

The query reads the table (T) and scans all rows (rather than using a sample). The database also needs to calculate the number of distinct values (NDV) for each column but the query does not use COUNT(DISTINCT c1) and so on, but instead, during execution,  a special statistics gathering row source is injected into the query. The statistics gathering row source uses a one-pass, hash-based distinct algorithm to gather NDV. The algorithm requires a full scan of the data, uses a bounded amount of memory and yields a highly accurate NDV that is nearly identical to a 100 percent sampling (a fact that can be proven mathematically). The statistics gathering row source also gathers the number of rows, number of nulls and average column length. Since a full scan is used, the number of rows, average column length, minimum and maximum values are 100% accurate.

Effect of auto sample size on histogram gathering

Hybrid histogram gathering is decoupled from basic column statistics gathering and uses a sampleof column values. This technique was used in Oracle Database 11g to build height-balanced histograms. More information on this can be found in this blog post. Oracle Database 12c replaced height-balanced histograms with hybrid histograms.

Effect of auto sample size on index stats gathering

AUTO_SAMPLE_SIZE affects how index statistics are gathered. Index statistics gathering is sample-based and it can potentially go through several iterations if the sample contains too few blocks or the sample size was too small to properly gather number of distinct keys (NDKs). The algorithm has not changed since Oracle Database 11g, so I’ve left it to the previous blog to go more detail. There one other thing to note:

At the time of writing, there are some cases where index sampling can lead to NDV mis-estimates for composite indexes. The best work-around is to create a column group on the relevant columns and use gather_table_stats. Alternatively, there is a one-off fix - 27268249. This patch changes the way NDV is calculated for indexes on large tables (and no column group is required). It is available for 12.2.0.1 at the moment, but note that it cannot be backported. As you might guess, it's significantly slower than index block sampling, but it's still very fast. At the time of writing, if you find a case where index NDV is causing an issue with a query plan, then the recommended approach is to add a column group rather than attempting to apply this patch.

Summary:

Note that top frequency and hybrid histograms are new to Oracle Database 12c. Oracle Database 11g had frequency and height-balanced histograms only. Hybrid histograms replaced height-balanced histograms.

  1. The auto sample size algorithm uses a full table scan (a 100% sample) to gather basic column statistics.
  2. The cost of a full table scan (verses row sampling) is mitigated by the approximate NDV algorithm, which eliminates the need to sort column data.
  3. The approximate NDV gathered by AUTO_SAMPLE_SIZE is close to the accuracy of a 100% sample.
  4. Other basic column statistics, such as the number of nulls, average column length, minimal and maximal values have an accuracy equivalent to 100% sampling.
  5. Frequency and top frequency histograms are created using a 100%* sample of column values and are created when basic column statistics are gathered. This is different to Oracle Database 11g, which decoupled frequency histogram creation from basic column statistics gathering (and used a sample of column values).
  6. Hybrid histograms are created using a sample of column values. Internally, this step is decoupled from basic column statistics gathering.
  7. Index statistics are gathered using a sample of column values. The sample size is determined automatically.

*There is an exception to case 5, above. Frequency histograms are created using a sample if OPTIONS=>'GATHER AUTO' is used after a bulk load where statistics have been gathered using online statistics gathering.

 

oracle 12c AUTO_SAMPLE_SIZE动态采用工作机制的更多相关文章

  1. oracle 11g AUTO_SAMPLE_SIZE动态采用工作机制

    Note that if you're interested in learning about Oracle Database 12c, there's an updated version of ...

  2. 2014年2月5日 Oracle ORACLE的工作机制[转]

      网上看到一篇描写ORACLE工作机制的文章,觉得很不错!特摘录了下来.   ORACLE的工作机制-1 (by xyf_tck) 我们从一个用户请求开始讲,ORACLE的简要的工作机制是怎样的,首 ...

  3. java开发连接Oracle 12c采用PDB遇到问题记录

    今天初次使用java连接Oracle 12c,遇到各种问题,为方便后续查询,在汇总了问题记录及解决方案如下. ORA-28040: No matching authentication protoco ...

  4. 从一个简单的main方法执行谈谈JVM工作机制

    本来JVM的工作原理浅到可以泛泛而谈,但如果真的想把JVM工作机制弄清楚,实在是很难,涉及到的知识领域太多.所以,本文通过简单的mian方法执行,浅谈JVM工作原理,看看JVM里面都发生了什么. 先上 ...

  5. JavaScript工作机制:V8 引擎内部机制及如何编写优化代码的5个诀窍

    概述 JavaScript引擎是一个执行JavaScript代码的程序或解释器.JavaScript引擎可以被实现为标准解释器,或者实现为以某种形式将JavaScript编译为字节码的即时编译器. 下 ...

  6. oracle 12c 列式存储 ( In Memory 理论)

    随着Oracle 12c推出了in memory组件,使得Oracle数据库具有了双模式数据存放方式,从而能够实现对混合类型应用的支持:传统的以行形式保存的数据满足OLTP应用:列形式保存的数据满足以 ...

  7. Httpd服务入门知识-http协议版本,工作机制及http服务器应用扫盲篇

    Httpd服务入门知识-http协议版本,工作机制及http服务器应用扫盲篇 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Internet与中国 Internet最早来源于美 ...

  8. malloc 函数工作机制(转)

    malloc()工作机制 malloc函数的实质体现在,它有一个将可用的内存块连接为一个长长的列表的所谓空闲链表.调用malloc函数时,它沿连接表寻找一个大到足以满足用户请求所需要的内存块.然后,将 ...

  9. Oracle 12C RAC的optimizer_adaptive_features造成数据插入超时

    问题分析 使用10046事件追踪方式,直接生成上传时的数据库事件日志进行分析,发现主要区别在于以下两条sql语句在每次长时间上传时都有出现,并且执行用户不是上传用户,而是数据库SYS用户. ***** ...

随机推荐

  1. statement对象与sql语句(新手)

    本篇介绍读上篇代码中的疑惑点 实现简单网页上对数据内容进行增删改查,需要用到三个部分:jsp网页前端部分+java后台程序+数据库表 一.创建一个Statement (用于在已经建立数据库连接的基础上 ...

  2. lambda 表达式拼接

    类库: using System; using System.Collections.Generic; using System.Linq; using System.Linq.Expressions ...

  3. numpy.loadtxt用法

    numpy.loadtxt(fname, dtype=, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None ...

  4. SWF加载器控件 SWFLoaderControl

    SWF加载器控件 书:165 <?xml version="1.0" encoding="utf-8"?> <s:Application xm ...

  5. 下拉列表控件实例 ComboBoxControl

    下拉列表控件实例 书:151页 <?xml version="1.0" encoding="utf-8"?> <s:Application x ...

  6. binTreepreorderTraversal二叉树前序遍历

    原题 Given a binary tree, return the preorder traversal of its nodes' values. For example: Given binar ...

  7. cache基础

    cache是系统中的一块快速SRAM,价格高,但是访问速度快,可以减少CPU到main memory的latency. cache中的术语有: 1) Cache hits,表示可以在cache中,查找 ...

  8. linux打包压缩与搜索命令

    1.tar命令 tar命令用于对文件进行打包压缩或解压,格式为“tar [选项] [文件]”.  tar命令的参数及其作用 参数 作用 -c 创建压缩文件 -x 解开压缩文件 -t 查看压缩包内有哪些 ...

  9. STL算法中函数对象和谓词

    函数对象和谓词定义 函数对象: 重载函数调用操作符的类,其对象常称为函数对象(function object),即它们是行为类似函数的对象.一个类对象,表现出一个函数的特征,就是通过“对象名+(参数列 ...

  10. mysql主从配置,读写分离

    Mysql主从配置,实现读写分离 大型网站为了软解大量的并发访问,除了在网站实现分布式负载均衡,远远不够.到了数据业务层.数据访问层,如果还是传统的数据结构,或者只是单单靠一台服务器扛,如此多的数据库 ...