Number of Distribution Keys (disKeyCount)
|
If there is no data skew, use no more than four distribution keys.
Generally, if you use many distribution keys, data can be evenly distributed in a cluster and thus avoid data skew. However, if too many distribution keys are used, the storage performance and joint query performance may deteriorate. You are advised to configure no more than four distribution keys.
- Storage performance issue:
When data is added, the hash function calculates the result of each distribution column, aggregates the results, and then determines where to distribute data. A large number of distribution keys require time-consuming, complex calculation.
- Union query performance issue:
During multi-table join query, if all the columns of the distribution key are involved in the join condition, data does not need to be redistributed in the execution plan. If a large number of distribution keys are used, some of them may not be the columns involved in the join condition, and data redistribution may occur, which consumes many resources and takes long.
|
Number of Index Columns/PCKs (indexKeyOrPckCount)
|
It is recommended that the number of partial cluster keys (PCKs)/columns of an index be less than or equal to 4.
- A large number of index columns require many resources to maintain index data, and are likely to contain duplicate indexes.
- While column-store data is imported, PCK columns are compared and calculated to determine CU division. A large number of PCKs will consume many resources and much time, affecting performance. To efficiently filter CUs in a query, the prefixes of the columns involved in the query conditions must be PCK columns. (For example, if the PCK columns are a, b, and c, the query criteria must be a>? and b>? and c>?.) Otherwise, all the CUs must be traversed, and data clustering does not contribute to query acceleration.
|
Invalid PCKs (invalidPck)
|
Do not create invalid PCK columns.
In 8.1.1 and later versions, the cluster can filter and compare data of the char, int8, int2, int4, text, bpchar, varchar, date, time, timestamp and timestamptz types. If a column of an unsupported data type is used as a PCK, the column is an invalid PCK column. It does not take effect during CU filtering and will consume resources for its maintenance.
|
numeric Data Usage (validityOfNumeric)
|
To facilitate calculation, integers are recommended if you need to use numeric types. If a value does not need to be highly precise, use a fixed-length floating data type for it.
If you have to use variable-length numeric types, you are advised to set their maximum length to 38 bits. The system will try to convert the calculation of such values to the calculation between int and bigint values to improve efficiency.
In 8.1.1 and later versions, if no precision is specified, up to 131,072 digits can be placed before the decimal point and up to 16,383 digits can be placed after the decimal point. That is, the maximum scale and precision are used. In this case, the calculation performance will deteriorate.
|
Index Column Width (widthOfIndexKey)
|
Generally, wide index columns are character string columns, which do not involve compare operations and will lead to large indexes that consume unnecessary space. Specify a value smaller than 64 bytes.
|
Replication Table Size (sizeOfCopyTable)
|
Tables whose storage space of a single DN is greater than the threshold (100 MB) are identified. Use common associated columns as distribution keys (generally with one primary key).
The cluster supports replication tables. A replication table maintains a full copy of data on each node and is mainly used to store data of enumerated types. A large replication table occupies much space and may slow down union queries, because the full table data on each node is traversed, which is time consuming. A union query on distribution tables may take less time, because the data traversed on each node is less than that of a replication table.
|
Skew Detection for Single-Distribution-Key Tables (recognitionOfDataSkew)
|
Data skew of single-distribution-key tables is detected by statistics. This audit applies only to tables with one distribution key.
|
Distribution Key Usage (validityOfDiskey)
|
In a cluster, you are not advised to use a column of the Boolean or date type as a distribution column, because it may cause data skew.
|
Number of Cached Sequence Values (cacheSizeOfSequence)
|
Specify a number greater than 100.
If a table column uses sequences, its next_value is obtained from the cached value in the local node. If cached sequence values are used up, a request will be sent asking GTM to obtain the value again. If a large amount of data is added but only a few values are cached, GTM will receive many requests, and may get overloaded and even break down. To avoid this problem, you are advised to set the cache value to a value greater than 100 when creating a sequence.
|
Optimizable Indexes (optimizableIndexKey)
|
Scenarios where indexes can be optimized:
- The index column of an index is the first N columns of another index.
- The index columns of two indexes are the same, but the orders are different.
|