Table Design

GaussDB(DWS) uses a distributed architecture. Data is distributed on DNs. Comply with the following principles to properly design a table:

Selecting a Storage Mode

[Proposal] Selecting a storage mode is the first step in defining a table. The storage mode mainly depends on the user's service type. For details, see Table 1.

Table 1 Table storage modes and scenarios

Storage Mode

Benefit

Drawback

Application Scenarios

Row storage

Data is stored by row. When you query a row of data, you can quickly locate the target row.

All data in the queried row is read while only a few columns are needed.

  1. The number of columns in the table is small, and most fields in the table are queried.
  2. Point queries (simple index–based query that returns only a few records) are performed.
  3. Add, Delete, Modify, and Query operations on entire rows are frequently performed.

Column storage

  1. Only necessary columns in a query are read.
  2. The homogeneity of data within a column facilitates efficient compression.

It is not suitable for INSERT or UPDATE operations on a small amount of data.

  1. Query a few columns in a table that contains a large number of columns.
  2. Statistical analysis queries (requiring a large number of association and grouping operations)
  3. Ad hoc queries (using uncertain query conditions and unable to utilize indexes to scan row-store tables)

Selecting a Distribution Mode

[Proposal] Comply with the following rules to distribute table data.
Table 2 Table distribution modes and scenarios

Distribution Mode

Description

Application Scenarios

Hash

Table data is distributed on all DNs in a cluster by hash.

Fact tables containing a large amount of data

Replication

Full data in a table is stored on every DN in a cluster.

Dimension tables and fact tables containing a small amount of data

Round-robin

Each row of the table is sent to each DN in turn. Therefore, data is evenly distributed on each DN.

Fact tables that contain a large amount of data and cannot find a proper distribution column in hash mode

Selecting a Partitioning Mode

Comply with the following rules to partition a table containing a large amount of data:

The example of a partitioned table definition is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
CREATE TABLE staffS_p1
(
  staff_ID       NUMBER(6) not null,
  FIRST_NAME     VARCHAR2(20),
  LAST_NAME      VARCHAR2(25),
  EMAIL          VARCHAR2(25),
  PHONE_NUMBER   VARCHAR2(20),
  HIRE_DATE      DATE,
  employment_ID  VARCHAR2(10),
  SALARY         NUMBER(8,2),
  COMMISSION_PCT NUMBER(4,2),
  MANAGER_ID     NUMBER(6),
  section_ID     NUMBER(4)
)
PARTITION BY RANGE (HIRE_DATE)
( 
   PARTITION HIRE_19950501 VALUES LESS THAN ('1995-05-01 00:00:00'),
   PARTITION HIRE_19950502 VALUES LESS THAN ('1995-05-02 00:00:00'),
   PARTITION HIRE_maxvalue VALUES LESS THAN (MAXVALUE)
);

Selecting a Distribution Key

Selecting a distribution key is important for a hash table. An improper distribution key may cause data skew. As a result, the I/O load is heavy on several DNs, affecting the overall query performance. After you select a distribution policy for a hash table, check for data skew to ensure that data is evenly distributed. Comply with the following rules to select a distribution key: