doc-exports/docs/dws/dev/dws_04_0079.html
Lu, Huayi ef0ada5a59 DWS DEV 20240716 version
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Lu, Huayi <luhuayi@huawei.com>
Co-committed-by: Lu, Huayi <luhuayi@huawei.com>
2024-11-02 09:07:47 +00:00

118 lines
15 KiB
HTML

<a name="EN-US_TOPIC_0000001510522473"></a><a name="EN-US_TOPIC_0000001510522473"></a>
<h1 class="topictitle1">GaussDB(DWS) Table Design Rules</h1>
<div id="body1528162486098"><p id="EN-US_TOPIC_0000001510522473__p27908691"><span id="EN-US_TOPIC_0000001510522473__text440718349">GaussDB(DWS)</span> uses a distributed architecture. Data is distributed on DNs. Comply with the following principles to properly design a table:</p>
<ul id="EN-US_TOPIC_0000001510522473__ul721556195614"><li id="EN-US_TOPIC_0000001510522473__li121556105617">[Notice] Evenly distribute data on each DN to prevent data skew. If most data is stored on several DNs, the effective capacity of a cluster decreases. Select a proper distribution column to avoid data skew.</li><li id="EN-US_TOPIC_0000001510522473__li103881958125612">[Notice] Evenly scan each DN when querying tables. Otherwise, DNs most frequently scanned will become the performance bottleneck. For example, when you use equivalent filter conditions on a fact table, the nodes are not evenly scanned.</li><li id="EN-US_TOPIC_0000001510522473__li109624020578">[Notice] Reduce the amount of data to be scanned. You can use the pruning mechanism of a partitioned table.</li><li id="EN-US_TOPIC_0000001510522473__li1156812495712">[Notice] Minimize random I/O. By clustering or local clustering, you can sequentially store hot data, converting random I/O to sequential I/O to reduce the cost of I/O scanning.</li><li id="EN-US_TOPIC_0000001510522473__li579110645713">[Notice] Try to avoid data shuffling. To shuffle data is to physically transfer it from one node to another. This unnecessarily occupies many network resources. To reduce network pressure, locally process data, and to improve cluster performance and concurrency, you can minimize data shuffling by using proper association and grouping conditions.</li></ul>
<div class="section" id="EN-US_TOPIC_0000001510522473__section189043059150"><h4 class="sectiontitle">Selecting a Storage Mode</h4><p id="EN-US_TOPIC_0000001510522473__p46309534">[Proposal] Selecting a storage mode is the first step in defining a table. The storage mode mainly depends on the user's service type. For details, see <a href="#EN-US_TOPIC_0000001510522473__table3891877">Table 1</a>.</p>
<div class="tablenoborder"><a name="EN-US_TOPIC_0000001510522473__table3891877"></a><a name="table3891877"></a><table cellpadding="4" cellspacing="0" summary="" id="EN-US_TOPIC_0000001510522473__table3891877" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Table storage modes and scenarios</caption><thead align="left"><tr id="EN-US_TOPIC_0000001510522473__row12104456"><th align="left" class="cellrowborder" valign="top" width="19.73%" id="mcps1.3.3.3.2.3.1.1"><p id="EN-US_TOPIC_0000001510522473__p40936856">Storage Mode</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="80.27%" id="mcps1.3.3.3.2.3.1.2"><p id="EN-US_TOPIC_0000001510522473__p46632853">Application Scenarios</p>
</th>
</tr>
</thead>
<tbody><tr id="EN-US_TOPIC_0000001510522473__row38265132"><td class="cellrowborder" valign="top" width="19.73%" headers="mcps1.3.3.3.2.3.1.1 "><p id="EN-US_TOPIC_0000001510522473__p12468015">Row storage</p>
</td>
<td class="cellrowborder" valign="top" width="80.27%" headers="mcps1.3.3.3.2.3.1.2 "><ul id="EN-US_TOPIC_0000001510522473__ul61112063105242"><li id="EN-US_TOPIC_0000001510522473__li55036904105242">Point queries (simple index-based queries that only return a few records)</li><li id="EN-US_TOPIC_0000001510522473__li46472609105247">Scenarios requiring frequent addition, deletion, and modification</li></ul>
</td>
</tr>
<tr id="EN-US_TOPIC_0000001510522473__row64051613"><td class="cellrowborder" valign="top" width="19.73%" headers="mcps1.3.3.3.2.3.1.1 "><p id="EN-US_TOPIC_0000001510522473__p20798169">Column storage</p>
</td>
<td class="cellrowborder" valign="top" width="80.27%" headers="mcps1.3.3.3.2.3.1.2 "><ul id="EN-US_TOPIC_0000001510522473__ul38359637105253"><li id="EN-US_TOPIC_0000001510522473__li50687390105253">Statistical analysis queries (requiring a large number of association and grouping operations)</li><li id="EN-US_TOPIC_0000001510522473__li60756677105259">Ad hoc queries (using uncertain query conditions and unable to utilize indexes to scan row-store tables)</li></ul>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="EN-US_TOPIC_0000001510522473__section4953718391536"><h4 class="sectiontitle">Selecting a Distribution Mode</h4><div class="p" id="EN-US_TOPIC_0000001510522473__p15437549133613">[Proposal] Comply with the following rules to distribute table data.
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="EN-US_TOPIC_0000001510522473__table56061421" frame="border" border="1" rules="all"><caption><b>Table 2 </b>Table distribution modes and scenarios</caption><thead align="left"><tr id="EN-US_TOPIC_0000001510522473__row28830064"><th align="left" class="cellrowborder" valign="top" width="19.99%" id="mcps1.3.4.2.1.2.4.1.1"><p id="EN-US_TOPIC_0000001510522473__p1734838511855">Distribution Mode</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="38.79%" id="mcps1.3.4.2.1.2.4.1.2"><p id="EN-US_TOPIC_0000001510522473__p5210823411855">Description</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="41.22%" id="mcps1.3.4.2.1.2.4.1.3"><p id="EN-US_TOPIC_0000001510522473__p5583513911855">Application Scenarios</p>
</th>
</tr>
</thead>
<tbody><tr id="EN-US_TOPIC_0000001510522473__row4741815"><td class="cellrowborder" valign="top" width="19.99%" headers="mcps1.3.4.2.1.2.4.1.1 "><p id="EN-US_TOPIC_0000001510522473__p48542757">Hash</p>
</td>
<td class="cellrowborder" valign="top" width="38.79%" headers="mcps1.3.4.2.1.2.4.1.2 "><p id="EN-US_TOPIC_0000001510522473__p39649219">Table data is distributed on all DNs in a cluster by hash.</p>
</td>
<td class="cellrowborder" valign="top" width="41.22%" headers="mcps1.3.4.2.1.2.4.1.3 "><p id="EN-US_TOPIC_0000001510522473__p57470137">Fact tables containing a large amount of data</p>
</td>
</tr>
<tr id="EN-US_TOPIC_0000001510522473__row47469189"><td class="cellrowborder" valign="top" width="19.99%" headers="mcps1.3.4.2.1.2.4.1.1 "><p id="EN-US_TOPIC_0000001510522473__p19799091">Replication</p>
</td>
<td class="cellrowborder" valign="top" width="38.79%" headers="mcps1.3.4.2.1.2.4.1.2 "><p id="EN-US_TOPIC_0000001510522473__p1648503511217">Full data in a table is stored on every DN in a cluster.</p>
</td>
<td class="cellrowborder" valign="top" width="41.22%" headers="mcps1.3.4.2.1.2.4.1.3 "><p id="EN-US_TOPIC_0000001510522473__p46187709">Dimension tables and fact tables containing a small amount of data</p>
</td>
</tr>
<tr id="EN-US_TOPIC_0000001510522473__row8298203115311"><td class="cellrowborder" valign="top" width="19.99%" headers="mcps1.3.4.2.1.2.4.1.1 "><p id="EN-US_TOPIC_0000001510522473__p16298193173114">Round-robin</p>
</td>
<td class="cellrowborder" valign="top" width="38.79%" headers="mcps1.3.4.2.1.2.4.1.2 "><p id="EN-US_TOPIC_0000001510522473__p92981631113114">Each row of the table is sent to each DN in turn. Therefore, data is evenly distributed on each DN.</p>
</td>
<td class="cellrowborder" valign="top" width="41.22%" headers="mcps1.3.4.2.1.2.4.1.3 "><p id="EN-US_TOPIC_0000001510522473__p19298931113113">Fact tables that contain a large amount of data and cannot find a proper distribution column in hash mode</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div class="section" id="EN-US_TOPIC_0000001510522473__section3098621691543"><h4 class="sectiontitle">Selecting a Partitioning Mode</h4><p id="EN-US_TOPIC_0000001510522473__p9149005">Comply with the following rules to partition a table containing a large amount of data:</p>
<ul id="EN-US_TOPIC_0000001510522473__ul493112515714"><li id="EN-US_TOPIC_0000001510522473__li189320255571">[Proposal] Create partitions on columns that indicate certain ranges, such as dates and regions.</li><li id="EN-US_TOPIC_0000001510522473__li649113276571">[Proposal] A partition name should show the data characteristics of a partition. For example, its format can be Keyword+Range characteristics.</li><li id="EN-US_TOPIC_0000001510522473__li6867122905711">[Proposal] Set the upper limit of a partition to <strong id="EN-US_TOPIC_0000001510522473__b842352706185941">MAXVALUE</strong> to prevent data overflow.</li></ul>
<p id="EN-US_TOPIC_0000001510522473__p2871939">The example of a partitioned table definition is as follows:</p>
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001510522473__screen31427626105856"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">staffS_p1</span>
<span class="p">(</span>
<span class="w"> </span><span class="n">staff_ID</span><span class="w"> </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span><span class="w"> </span><span class="k">not</span><span class="w"> </span><span class="k">null</span><span class="p">,</span>
<span class="w"> </span><span class="n">FIRST_NAME</span><span class="w"> </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">20</span><span class="p">),</span>
<span class="w"> </span><span class="n">LAST_NAME</span><span class="w"> </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">25</span><span class="p">),</span>
<span class="w"> </span><span class="n">EMAIL</span><span class="w"> </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">25</span><span class="p">),</span>
<span class="w"> </span><span class="n">PHONE_NUMBER</span><span class="w"> </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">20</span><span class="p">),</span>
<span class="w"> </span><span class="n">HIRE_DATE</span><span class="w"> </span><span class="nb">DATE</span><span class="p">,</span>
<span class="w"> </span><span class="n">employment_ID</span><span class="w"> </span><span class="n">VARCHAR2</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span>
<span class="w"> </span><span class="n">SALARY</span><span class="w"> </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span>
<span class="w"> </span><span class="n">COMMISSION_PCT</span><span class="w"> </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="mi">2</span><span class="p">),</span>
<span class="w"> </span><span class="n">MANAGER_ID</span><span class="w"> </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">6</span><span class="p">),</span>
<span class="w"> </span><span class="n">section_ID</span><span class="w"> </span><span class="nb">NUMBER</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span>
<span class="p">)</span>
<span class="n">PARTITION</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">RANGE</span><span class="w"> </span><span class="p">(</span><span class="n">HIRE_DATE</span><span class="p">)</span>
<span class="p">(</span><span class="w"> </span>
<span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">HIRE_19950501</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="p">(</span><span class="s1">'1995-05-01 00:00:00'</span><span class="p">),</span>
<span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">HIRE_19950502</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="p">(</span><span class="s1">'1995-05-02 00:00:00'</span><span class="p">),</span>
<span class="w"> </span><span class="n">PARTITION</span><span class="w"> </span><span class="n">HIRE_maxvalue</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="k">LESS</span><span class="w"> </span><span class="k">THAN</span><span class="w"> </span><span class="p">(</span><span class="k">MAXVALUE</span><span class="p">)</span>
<span class="p">);</span>
</pre></div></td></tr></table></div>
</div>
</div>
<div class="section" id="EN-US_TOPIC_0000001510522473__section1304242791554"><h4 class="sectiontitle">Selecting a Distribution Key</h4><p id="EN-US_TOPIC_0000001510522473__p64693751">Selecting a distribution key is important for a hash table. An improper distribution key may cause data skew. As a result, the I/O load is heavy on several DNs, affecting the overall query performance. After you select a distribution policy for a hash table, check for data skew to ensure that data is evenly distributed. Comply with the following rules to select a distribution key:</p>
<ul id="EN-US_TOPIC_0000001510522473__ul16936634195712"><li id="EN-US_TOPIC_0000001510522473__li793617341577">[Proposal] Select a column containing discrete data as the distribution key, so that data can be evenly distributed on each DN. If a single column is not discrete enough, consider using multiple columns as distribution keys. You can select the primary key of a table as the distribution key. For example, in an employee information table, select the certificate number column as the distribution key.</li><li id="EN-US_TOPIC_0000001510522473__li173441037125714">[Proposal] If the first rule is met, do not select a column having constant filter conditions as the distribution key. For example, in a query on the <strong id="EN-US_TOPIC_0000001510522473__b6552105162316">dwcjk</strong> table, if the <strong id="EN-US_TOPIC_0000001510522473__b1855212514234">zqdh</strong> column contains the constant filter condition <strong id="EN-US_TOPIC_0000001510522473__b14552135120238">zqdh='000001'</strong>, avoid selecting the <strong id="EN-US_TOPIC_0000001510522473__b20552105152319">zqdh</strong> column as the distribution key.</li><li id="EN-US_TOPIC_0000001510522473__li1996693912574">[Proposal] If the first and second rules are met, select the join conditions in a query as distribution keys. If a join condition is used as a distribution key, the data involved in a join task is locally distributed on DNs, which greatly reduces the data flow cost among DNs.</li></ul>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_04_0077.html">GaussDB(DWS) Database Object Design</a></div>
</div>
</div>