Statistics Enhancements Teradata 14

Page 1

Teradata 14 Statistics Enhancement

TERADATA 14 STATISTICS ENHANCEMENT

By: Smarak Das Employee Id: 391485 Project: Kaiser Permanente Role: Teradata DBA 1


Teradata 14 Statistics Enhancement

1. INTRODUCTION The essential task of the optimizer is to produce the optimal execution plan (i.e., with lowest cost) among many possible plans. The estimations in the Teradata Database are derived primarily from statistics on tables, columns and indexes. But, collecting statistics involves time and resources. So, it is important for the users of the Teradata Database to understand the new statistics enhancements and make best use of them to keep the cost of statistics collection and maintenance to the minimum.

2. AUDIENCE The targeted audience for this document is experienced Teradata Database administrators and those with a background in query tuning and physical database design. However, the content is designed to be readily understood by any reader with a reasonable background in database technologies. This Document has been prepared by referencing several Teradata Guides or Orange Books.

2


Teradata 14 Statistics Enhancement

3. NEW STATISTICS COLLECTION OPTIONS 3.1 COLLECT SUMMARY Statistics A new option called SUMMARY is introduced to collect only the table-level statistical information such as row count, average block size, average row size, etc. without the histogram detail. This option can be used to provide up-to-date summary information to the optimizer in a quick and efficient way. This runs very quickly with negligible impact to the system resources. Note that SUMMARY statistics are different from regular column or index statistics and doesn’t contain any histogram. When SUMMARY option is specified in a collect statistics statement, no column or index specification is allowed. The SUMMARY statistics are automatically collected when statistics are collected or recollected on any column or index. Therefore, it is not required to explicitly submit a request to collect summary statistical information when statistics are being collected or recollected for individual columns or indexes. The optimizer depends on the summary statistical information to estimate the table row count and cost to do a full table scan of the table. For a row count estimate, it depended on the primary index and/or PARTITION statistics in releases prior to Teradata Database 14.0.

SYNTAX: COLLECT SUMMARY STATISTICS ON Orders;

3.2 SAMPLE STATISTICS The enhanced SAMPLE option allows the users to customize sample percentage for different columns. The recommended approach to collect sample statistics is to use the system determined sample percentage by specifying the SYSTEM SAMPLE option. If the final statistics are not reasonably accurate with the system-determined sample percentage, this new capability can be used to customize the sample percentage. The sampling options are remembered when explicitly specified and are applied during recollections.

3


Teradata 14 Statistics Enhancement

The following are the different variations of the sampling options:

SYNTAX: COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN EmployeeID ON Employee; COLLECT STATISTICS USING SAMPLE 20 PERCENT COLUMN EmployeeID ON Employee;

3.3 MAX INTERVALS The maximum number of intervals of a histogram can be customized using the MAXINTERVALS option. If this option is not specified, the system uses a default maximum number of intervals which is defined as 250. A greater number of intervals increases the granularity of the data in the histogram and helps to get better single-table and join selectivity estimations for non-uniform data. However, this should be done selectively as needed (for columns involved in a predicate exhibiting over or under estimations) as it increases the size of the histogram which can increase the query optimization time.

4


Teradata 14 Statistics Enhancement 3.4 MAXVALUELENGTH Histogram records values such as min, max, mode and biased values. The MAXVALUELENGTH refers to the maximum value length that can be used to build these values. If the value length is larger than the system-determined or specified maximum, it gets truncated. Note that in prior releases of the Teradata Database 14.0, the maximum value length could not exceed 16 bytes. Increase the maximum value length for columns that require more detailed information in the histogram to improve single-table predicate selectivity estimations. However, this should be done selectively as needed as it increases the size of the histogram which can increase the query optimization time. For example, if a UserId column has the same characters for the first 25 characters for all users, the default value length truncates the data to the first 25 characters which makes the histogram unreliable for estimations. Increasing the value length to the length which can uniquely identify each UserId greatly improves the selectivity for the single-table predicates based on this column. If this option is not specified, the system uses a default maximum value length which is defined as 25. For single-column statistics of non-LATIN type, the default is used as 25 Unicode characters which translate to 50 bytes. In all other cases, it is used as 25 bytes.

SYNTAX:   

COLLECT STATISTICS COLUMN EmployeeID ON Employee; COLLECT STATISTICS USING SYSTEM MAXVALUELENGTH COLUMN EmployeeID ON Employee; COLLECT STATISTICS USING MAXVALUELENGTH 50 COLUMN EmployeeID ON Employee;

3.5 COLUMN ORDERING FOR MULTICOLUMN STATISTICS The column ordering specified in the collect statistics statement is honored with the exception of the statistics collected with INDEX specification. Column ordering is useful when collecting multicolumn statistics on columns having single-table and join predicates or group by clause. It is recommended to use columns having single-table predicates as leading columns. In other cases, the ordering of the columns does not have any relevance. 5


Teradata 14 Statistics Enhancement

COLLECT STATISTICS COLUMN (EmployeeID, EmployeeName) ON EMPLOYEE;

COLLECT STATISTICS COLUMN (EmployeeName, EmployeeID) ON EMPLOYEE; [Not Allowed as ”Column” Type Stats Collection]

COLLECT STATISTICS INDEX (EmployeeID, EmployeeName) ON EMPLOYEE;

COLLECT STATISTICS INDEX (EmployeeName, EmployeeID) ON EMPLOYEE; [Allowed as “INDEX” Type Stats Collection]

3.6 COLLECTING PARTITION STATISTICS Statistics can be collected on the system-derived column PARTITION as before. When statistics are collected on PARTITION on row and column partitioned tables, the system gathers the following two histograms. (1) A row partitioning histogram with combined partition number and the corresponding number of active rows. (2) A column partitioning histogram with the column partition number and the corresponding compression ratio for each column partition. The collection of PARTITION statistics is highly optimized and runs quickly with minimal impact to the system resources. The collection of PARTITION statistics is highly recommended on all partitioned tables. PARTITION statistics can also be collected on non-partitioned tables. All the rows are considered to be in a single partition (with PARTITION number of 0) for nonpartitioned tables. This also results in updating the table-level demographics (as noted before, collecting statistics on any column automatically updates the table-level demographics). The only advantage of this operation is to update the table-level demographics indirectly. In Teradata Database 14.0, to provide up-to-date table row count information to the optimizer, use SUMMARY statistics collection instead of PARTITION statistics on non-partitioned tables.

6


Teradata 14 Statistics Enhancement

4. DROP STATISTICS When the statistics are dropped on the table by explicit column or index specification, the summary statistical information is not dropped. Only the table-level drop statement can drop the summary statistical information. This is done to retain the summary statistical information of the table which is useful to the optimizer even when no column or index statistics are present

SYNTAX: 

DROP STATS ON customer;

5. HELP STATISTICS The HELP STATISTICS statement displays the summary of statistical information. It is enhanced to display the summary statistical information with column name “*”. A new CURRENT clause is introduced to display the extrapolated statistical information.

7


Teradata 14 Statistics Enhancement

6. SHOW STATISTICS The SHOW STATISTICS statement displays the data definition statement more recently used to collect the statistics with the most recent USING options in effect.

8


Teradata 14 Statistics Enhancement 7. STATISTICS COLLECTION PERFORMANCE IMPROVEMENTS Collecting statistics is resource intensive in terms of CPU and IO. The statistics collection process has been enhanced with various optimization techniques to reduce the resource consumption. The new optimizer logic analyzes all the applicable access paths to read the base table data during statistics collection and chooses the optimal one based on the cost. These optimizations are automatic and are cost-based. When statistics are collected the first time, some of the optimizations may not take place as the optimizer may not have the required statistical information to do the costing. The optimizer will be more aggressive in considering all the applicable optimizations during statistics recollections.

(a) ROLLUP AGGREGATIONS When statistics are requested on both a single-column and a multicolumn on the same set of columns, the aggregation result used to produce multicolumn statistics is reused and rolled up to produce the single-column statistics. For example, if statistics are requested on (Employee_ID, Employee_Name), (Employee_Name) and (Employee_ID), the optimizer performs an aggregation based on (Employee_ID, Employee_Name) first and uses it to roll up to individual columns Employee_ID and Employee_Name.

(b) PRE-AGGREGATIONS When multiple statistics are requested in a single-statement, the optimizer considers the option of pre-aggregations followed by a rollup to the specified column combinations. This avoids reading the base table multiple times and also helps the subsequent aggregations as they read smaller data set since the initial pre-aggregations collapses the data based on the pre-aggregation grouping columns.

9


Teradata 14 Statistics Enhancement

8. STATISTICS CACHE In pre-Teradata Database 14.0 releases, statistics are cached in the general purpose dictionary cache. Since it is expected there will be an increase in histogram size with the new options such as increased number of intervals and value length, a new dedicated cache is implemented for statistics to keep the query parsing time low.

9. STALE STATISTICS EXTRAPOLATIONS Statistics extrapolation functionality is designed to reduce the frequency of statistics recollections. Even though recollecting the statistics at regular intervals is still required to get optimal plans, this feature is designed to eliminate the need to do expensive statistics recollections on a daily basis particularly on DATE and TIMESTAMP columns. Understanding how the optimizer detects stale statistics and does extrapolations helps Teradata Database users to take full advantage of this feature. This is especially important because extrapolations are predictions based on what is known to and assumed by the optimizer about the newly added data. Even though extrapolations provide better estimates for a majority of the scenarios, they may not help in all scenarios. The optimizer attempts to extrapolate the following statistical information. (1) Table row count (2) Stale Histograms (a) Number of distinct values (b) Number of nulls (c) High mode frequency (d) Maximum value of the histogram Note that the extrapolation works for only upward data growth. If the table size (number of rows) is constant or smaller, the extrapolations may not happen.

10


Teradata 14 Statistics Enhancement

10.

GUIDELINES FOR USE

Use the following guidelines to optimize the collect statistics processing and to simplify the use of various collect statistics operations. When multiple statistics need to be collected, it is recommended to group the statistics into a single collect statistics statement. For first time collections, group all the statistics that have the same USING options. For recollections, use table-level collect statistics command without any column or index specification. This enables the optimizer to apply global optimizations such as pre-aggregation and aggregation rollups to speed to up the collection process. Use SHOW STATISTICS statement to export the statistics from one system and import them in another system in dual active environments. This allows the system resources to be consumed in only one system. Do not copy or transfer SUMMARY statistics between tables within a system or across the systems. It is recommended to recollect them natively. Do not copy or transfer the PARTITION statistics between tables within a system or across the systems when their partitioning expressions don’t match or when the database issues a warning (in some cases the internal PARTITION numbers can differ even with the same partitioning expression; database identifies these scenarios and issues a warning when attempted to copy within a system). It is recommended to recollect the PARTITION statistics natively in these scenarios which runs very quickly. When a pattern of rows within a column changes completely (an example can be a new set of product codes in the beginning of the year) or when the data has sudden highs or lows (not following the normal pattern), the history records from the histogram may not correctly predict the nature of the newly added data and can potentially cause unreasonable extrapolations. It is recommended to drop the statistics (which also gets rid of the history records), and recollect the statistics. Recollect SUMMARY statistics after the data loads. This command runs quickly with negligible impact to the system resources and provides up-to-date row count to the optimizer. The accurate row count is a crucial piece of information for the optimizer to identify stale statistics and to enable extrapolations.

11


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.