forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Lu, Huayi <luhuayi@huawei.com> Co-committed-by: Lu, Huayi <luhuayi@huawei.com>
298 lines
56 KiB
HTML
298 lines
56 KiB
HTML
<a name="EN-US_TOPIC_0000001188642150"></a><a name="EN-US_TOPIC_0000001188642150"></a>
|
|
|
|
<h1 class="topictitle1">Optimizing Data Skew</h1>
|
|
<div id="body8662426"><p id="EN-US_TOPIC_0000001188642150__p1567616298220">Data skew breaks the balance among nodes in the distributed MPP architecture. If the amount of data stored or processed by a node is much greater than that by other nodes, the following problems may occur:</p>
|
|
<ul id="EN-US_TOPIC_0000001188642150__ul10903133918227"><li id="EN-US_TOPIC_0000001188642150__li19033392222">Storage skew severely limits the system capacity. The skew on a single node hinders system storage utilization.</li><li id="EN-US_TOPIC_0000001188642150__li1790317398220">Computing skew severely affects performance. The data to be processed on the skew node is much more than that on other nodes, deteriorating overall system performance.</li><li id="EN-US_TOPIC_0000001188642150__li142637597507">Data skew severely affects the scalability of the MPP architecture. During storage or computing, data with the same values is often placed on the same node. Therefore, even if we add nodes after a data skew occurs, the skew data (data with the same values) is still placed on a single node, which become the capacity and performance bottleneck of the entire system.</li></ul>
|
|
<p id="EN-US_TOPIC_0000001188642150__p1345517311972"><span id="EN-US_TOPIC_0000001188642150__text1902512283">GaussDB(DWS)</span> provides a complete solution for data skew, including storage and computing skew.</p>
|
|
<div class="section" id="EN-US_TOPIC_0000001188642150__sca2958c039fa46ed8b2e765b60168c44"><h4 class="sectiontitle">Data Skew in the Storage Layer</h4><p id="EN-US_TOPIC_0000001188642150__ac016d56d630847f58a7aaaf91a515bd8">In the <span id="EN-US_TOPIC_0000001188642150__text1815116370">GaussDB(DWS)</span> database, data is distributed and stored on each DN. You can improve the query efficiency by using distributed execution. However, if data skew occurs, bottlenecks exist on some DNs during distribution execution, affecting the query performance. This is because the distribution column is not properly selected. This can be solved by adjusting the distribution column.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__a10cb3acc5f8d48919bceacd80c89c693">For example:</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188642150__s8cb529f293de4bfd92099d23e4db3a79"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
|
|
<span class="normal"> 2</span>
|
|
<span class="normal"> 3</span>
|
|
<span class="normal"> 4</span>
|
|
<span class="normal"> 5</span>
|
|
<span class="normal"> 6</span>
|
|
<span class="normal"> 7</span>
|
|
<span class="normal"> 8</span>
|
|
<span class="normal"> 9</span>
|
|
<span class="normal">10</span>
|
|
<span class="normal">11</span>
|
|
<span class="normal">12</span>
|
|
<span class="normal">13</span>
|
|
<span class="normal">14</span>
|
|
<span class="normal">15</span>
|
|
<span class="normal">16</span>
|
|
<span class="normal">17</span>
|
|
<span class="normal">18</span>
|
|
<span class="normal">19</span>
|
|
<span class="normal">20</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">explain</span><span class="w"> </span><span class="n">performance</span><span class="w"> </span><span class="k">select</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">from</span><span class="w"> </span><span class="n">inventory</span><span class="p">;</span>
|
|
<span class="mi">5</span><span class="w"> </span><span class="c1">--CStore Scan on lmz.inventory</span>
|
|
<span class="w"> </span><span class="n">dn_6001_6002</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">444</span><span class="p">..</span><span class="mi">83</span><span class="p">.</span><span class="mi">127</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">42000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6003_6004</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">512</span><span class="p">..</span><span class="mi">63</span><span class="p">.</span><span class="mi">554</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">27000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6005_6006</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">722</span><span class="p">..</span><span class="mi">99</span><span class="p">.</span><span class="mi">033</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">45000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6007_6008</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">529</span><span class="p">..</span><span class="mi">100</span><span class="p">.</span><span class="mi">379</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">51000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6009_6010</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">382</span><span class="p">..</span><span class="mi">71</span><span class="p">.</span><span class="mi">341</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">36000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6011_6012</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">547</span><span class="p">..</span><span class="mi">100</span><span class="p">.</span><span class="mi">274</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">51000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6013_6014</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">596</span><span class="p">..</span><span class="mi">118</span><span class="p">.</span><span class="mi">289</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">60000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6015_6016</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">1</span><span class="p">.</span><span class="mi">057</span><span class="p">..</span><span class="mi">132</span><span class="p">.</span><span class="mi">346</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">63000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6017_6018</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">940</span><span class="p">..</span><span class="mi">110</span><span class="p">.</span><span class="mi">310</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">54000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6019_6020</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">231</span><span class="p">..</span><span class="mi">41</span><span class="p">.</span><span class="mi">198</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">21000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6021_6022</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">927</span><span class="p">..</span><span class="mi">114</span><span class="p">.</span><span class="mi">538</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">54000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6023_6024</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">637</span><span class="p">..</span><span class="mi">118</span><span class="p">.</span><span class="mi">385</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">60000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6025_6026</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">288</span><span class="p">..</span><span class="mi">32</span><span class="p">.</span><span class="mi">240</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">15000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6027_6028</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">566</span><span class="p">..</span><span class="mi">118</span><span class="p">.</span><span class="mi">096</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">60000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6029_6030</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">423</span><span class="p">..</span><span class="mi">82</span><span class="p">.</span><span class="mi">913</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">42000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6031_6032</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">395</span><span class="p">..</span><span class="mi">78</span><span class="p">.</span><span class="mi">103</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">39000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6033_6034</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">376</span><span class="p">..</span><span class="mi">51</span><span class="p">.</span><span class="mi">052</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">24000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="n">dn_6035_6036</span><span class="w"> </span><span class="p">(</span><span class="n">actual</span><span class="w"> </span><span class="k">time</span><span class="o">=</span><span class="mi">0</span><span class="p">.</span><span class="mi">569</span><span class="p">..</span><span class="mi">79</span><span class="p">.</span><span class="mi">463</span><span class="w"> </span><span class="k">rows</span><span class="o">=</span><span class="mi">39000000</span><span class="w"> </span><span class="n">loops</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001188642150__a9e880d94ddf34045a6ceba962e6941db">In the performance information, you can view the number of scan rows of each DN in the inventory table. The number of rows of each DN differs a lot, the biggest is 63000000 and the smallest value is 15000000. This value difference on the performance of data scan is acceptable, but if the join operator exists in the upper-layer, the impact on the performance cannot be ignored.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__af3dcac4765574e6c89b0d4610e99f063">Generally, the data table is hash distributed on each DN; therefore, it is important to choose a proper distribution column. Run table_skewness() to view data skew of each DN in the inventory table. The query result is as follows:</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188642150__s0b0a40cfc88b4eb4aab8e76ec8ae1513"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
|
|
<span class="normal"> 2</span>
|
|
<span class="normal"> 3</span>
|
|
<span class="normal"> 4</span>
|
|
<span class="normal"> 5</span>
|
|
<span class="normal"> 6</span>
|
|
<span class="normal"> 7</span>
|
|
<span class="normal"> 8</span>
|
|
<span class="normal"> 9</span>
|
|
<span class="normal">10</span>
|
|
<span class="normal">11</span>
|
|
<span class="normal">12</span>
|
|
<span class="normal">13</span>
|
|
<span class="normal">14</span>
|
|
<span class="normal">15</span>
|
|
<span class="normal">16</span>
|
|
<span class="normal">17</span>
|
|
<span class="normal">18</span>
|
|
<span class="normal">19</span>
|
|
<span class="normal">20</span>
|
|
<span class="normal">21</span>
|
|
<span class="normal">22</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">select</span><span class="w"> </span><span class="n">table_skewness</span><span class="p">(</span><span class="s1">'inventory'</span><span class="p">);</span>
|
|
<span class="w"> </span><span class="n">table_skewness</span><span class="w"> </span>
|
|
<span class="c1">------------------------------------------</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6015_6016 "</span><span class="p">,</span><span class="mi">63000000</span><span class="p">,</span><span class="mi">8</span><span class="p">.</span><span class="mi">046</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6013_6014 "</span><span class="p">,</span><span class="mi">60000000</span><span class="p">,</span><span class="mi">7</span><span class="p">.</span><span class="mi">663</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6023_6024 "</span><span class="p">,</span><span class="mi">60000000</span><span class="p">,</span><span class="mi">7</span><span class="p">.</span><span class="mi">663</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6027_6028 "</span><span class="p">,</span><span class="mi">60000000</span><span class="p">,</span><span class="mi">7</span><span class="p">.</span><span class="mi">663</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6017_6018 "</span><span class="p">,</span><span class="mi">54000000</span><span class="p">,</span><span class="mi">6</span><span class="p">.</span><span class="mi">897</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6021_6022 "</span><span class="p">,</span><span class="mi">54000000</span><span class="p">,</span><span class="mi">6</span><span class="p">.</span><span class="mi">897</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6007_6008 "</span><span class="p">,</span><span class="mi">51000000</span><span class="p">,</span><span class="mi">6</span><span class="p">.</span><span class="mi">513</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6011_6012 "</span><span class="p">,</span><span class="mi">51000000</span><span class="p">,</span><span class="mi">6</span><span class="p">.</span><span class="mi">513</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6005_6006 "</span><span class="p">,</span><span class="mi">45000000</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">747</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6001_6002 "</span><span class="p">,</span><span class="mi">42000000</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">364</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6029_6030 "</span><span class="p">,</span><span class="mi">42000000</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">364</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6031_6032 "</span><span class="p">,</span><span class="mi">39000000</span><span class="p">,</span><span class="mi">4</span><span class="p">.</span><span class="mi">981</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6035_6036 "</span><span class="p">,</span><span class="mi">39000000</span><span class="p">,</span><span class="mi">4</span><span class="p">.</span><span class="mi">981</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6009_6010 "</span><span class="p">,</span><span class="mi">36000000</span><span class="p">,</span><span class="mi">4</span><span class="p">.</span><span class="mi">598</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6003_6004 "</span><span class="p">,</span><span class="mi">27000000</span><span class="p">,</span><span class="mi">3</span><span class="p">.</span><span class="mi">448</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6033_6034 "</span><span class="p">,</span><span class="mi">24000000</span><span class="p">,</span><span class="mi">3</span><span class="p">.</span><span class="mi">065</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6019_6020 "</span><span class="p">,</span><span class="mi">21000000</span><span class="p">,</span><span class="mi">2</span><span class="p">.</span><span class="mi">682</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6025_6026 "</span><span class="p">,</span><span class="mi">15000000</span><span class="p">,</span><span class="mi">1</span><span class="p">.</span><span class="mi">916</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="p">(</span><span class="mi">18</span><span class="w"> </span><span class="k">rows</span><span class="p">)</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001188642150__a9efde4ed96024b19926476bddf665eee">The table definition indicates that the table uses the <strong id="EN-US_TOPIC_0000001188642150__b43021826208">inv_date_sk</strong> column as the distribution column, which causes a data skew. Based on the data distribution of each column, change the distribution column to <strong id="EN-US_TOPIC_0000001188642150__b1111520233112">inv_item_sk</strong>. The skew status is as follows:</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188642150__s774bc7a2c4eb41a498c8cfe7d4c05d63"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
|
|
<span class="normal"> 2</span>
|
|
<span class="normal"> 3</span>
|
|
<span class="normal"> 4</span>
|
|
<span class="normal"> 5</span>
|
|
<span class="normal"> 6</span>
|
|
<span class="normal"> 7</span>
|
|
<span class="normal"> 8</span>
|
|
<span class="normal"> 9</span>
|
|
<span class="normal">10</span>
|
|
<span class="normal">11</span>
|
|
<span class="normal">12</span>
|
|
<span class="normal">13</span>
|
|
<span class="normal">14</span>
|
|
<span class="normal">15</span>
|
|
<span class="normal">16</span>
|
|
<span class="normal">17</span>
|
|
<span class="normal">18</span>
|
|
<span class="normal">19</span>
|
|
<span class="normal">20</span>
|
|
<span class="normal">21</span>
|
|
<span class="normal">22</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">select</span><span class="w"> </span><span class="n">table_skewness</span><span class="p">(</span><span class="s1">'inventory'</span><span class="p">);</span>
|
|
<span class="w"> </span><span class="n">table_skewness</span><span class="w"> </span>
|
|
<span class="c1">------------------------------------------</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6001_6002 "</span><span class="p">,</span><span class="mi">43934200</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">611</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6007_6008 "</span><span class="p">,</span><span class="mi">43829420</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">598</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6003_6004 "</span><span class="p">,</span><span class="mi">43781960</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">592</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6031_6032 "</span><span class="p">,</span><span class="mi">43773880</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">591</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6033_6034 "</span><span class="p">,</span><span class="mi">43763280</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">589</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6011_6012 "</span><span class="p">,</span><span class="mi">43683600</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">579</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6013_6014 "</span><span class="p">,</span><span class="mi">43551660</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">562</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6027_6028 "</span><span class="p">,</span><span class="mi">43546340</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">561</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6009_6010 "</span><span class="p">,</span><span class="mi">43508700</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">557</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6023_6024 "</span><span class="p">,</span><span class="mi">43484540</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">554</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6019_6020 "</span><span class="p">,</span><span class="mi">43466800</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">551</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6021_6022 "</span><span class="p">,</span><span class="mi">43458500</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">550</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6017_6018 "</span><span class="p">,</span><span class="mi">43448040</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">549</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6015_6016 "</span><span class="p">,</span><span class="mi">43247700</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">523</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6005_6006 "</span><span class="p">,</span><span class="mi">43200240</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">517</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6029_6030 "</span><span class="p">,</span><span class="mi">43181360</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">515</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6025_6026 "</span><span class="p">,</span><span class="mi">43179700</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">515</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="w"> </span><span class="p">(</span><span class="ss">"dn_6035_6036 "</span><span class="p">,</span><span class="mi">42960080</span><span class="p">,</span><span class="mi">5</span><span class="p">.</span><span class="mi">487</span><span class="o">%</span><span class="p">)</span>
|
|
<span class="p">(</span><span class="mi">18</span><span class="w"> </span><span class="k">rows</span><span class="p">)</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001188642150__a93859e15870344418f2554211042ef59">Data skew is solved.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p1011111372918">In addition to the <strong id="EN-US_TOPIC_0000001188642150__b847381105217">table_skewness()</strong> view, you can use the <strong id="EN-US_TOPIC_0000001188642150__b1423181595218">table_distribution</strong> function and the <a href="dws_04_0805.html">PGXC_GET_TABLE_SKEWNESS</a> view to efficiently query the data skew of each table.</p>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000001188642150__sa1f042c3016c439389a4245d02069d39"><h4 class="sectiontitle">Data Skew in the Computing Layer</h4><p id="EN-US_TOPIC_0000001188642150__p284184441110">Even if data is balanced across nodes after you change the distribution key of a table, data skew may still occur during a query. If data skew occurs in the result set of an operator on a DN, skew will also occur during the computing that involves the operator. Generally, this is caused by data redistribution during the execution.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p1184144491117">During a query, JOIN keys and GROUP BY keys are not used as distribution columns. Data is redistributed among DNs based on the hash values of data on the keys. The redistribution is implemented using the Redistribute operator in an execution plan. Data skew in redistribution columns can lead to data skew during system operation. After the redistribution, some nodes will have much more data, process more data, and will have much lower performance than others.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p1584114412112">In the following example, the <strong id="EN-US_TOPIC_0000001188642150__b1239412258214">s</strong> and <strong id="EN-US_TOPIC_0000001188642150__b82135162120">t</strong> tables are joined, and <strong id="EN-US_TOPIC_0000001188642150__b69311843112118">s.x</strong> and <strong id="EN-US_TOPIC_0000001188642150__b18628144516211">t.x</strong> columns in the join condition are not their distribution keys. Table data is redistributed using the <strong id="EN-US_TOPIC_0000001188642150__b1386934542219">REDISTRIBUTE</strong> operator. Data skew occurs in the <strong id="EN-US_TOPIC_0000001188642150__b1241532382320">s.x</strong> column and not in the <strong id="EN-US_TOPIC_0000001188642150__b158851318232">t.x</strong> column. The result set of the <strong id="EN-US_TOPIC_0000001188642150__b41382913247">Streaming</strong> operator (<strong id="EN-US_TOPIC_0000001188642150__b584819280272">id</strong> being <strong id="EN-US_TOPIC_0000001188642150__b285042882717">6</strong>) on datanode2 has data three times that of other DNs and causes a skew.</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188642150__s5ef4a77ad42743f2969d7d50b5842d5a"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">select</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">from</span><span class="w"> </span><span class="n">skew</span><span class="w"> </span><span class="n">s</span><span class="p">,</span><span class="n">test</span><span class="w"> </span><span class="n">t</span><span class="w"> </span><span class="k">where</span><span class="w"> </span><span class="n">s</span><span class="p">.</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">t</span><span class="p">.</span><span class="n">x</span><span class="w"> </span><span class="k">order</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="n">s</span><span class="p">.</span><span class="n">a</span><span class="w"> </span><span class="k">limit</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<pre class="screen" id="EN-US_TOPIC_0000001188642150__screen16515115161410"> id | operation | A-time
|
|
----+-----------------------------------------------------+-----------------------
|
|
1 | -> Limit | 52622.382
|
|
2 | -> Streaming (type: GATHER) | 52622.374
|
|
3 | -> Limit | [30138.494,52598.994]
|
|
4 | -> Sort | [30138.486,52598.986]
|
|
5 | -> Hash Join (6,8) | [30127.013,41483.275]
|
|
6 | -> Streaming(type: REDISTRIBUTE) | [11365.110,22024.845]
|
|
7 | -> Seq Scan on public.skew s | [2019.168,2175.369]
|
|
8 | -> Hash | [2460.108,2499.850]
|
|
9 | -> Streaming(type: REDISTRIBUTE) | [1056.214,1121.887]
|
|
10 | -> Seq Scan on public.test t | [310.848,325.569]
|
|
|
|
6 --Streaming(type: REDISTRIBUTE)
|
|
datanode1 (rows=5050368)
|
|
datanode2 (rows=15276032)
|
|
datanode3 (rows=5174272)
|
|
datanode4 (rows=5219328)</pre>
|
|
<p id="EN-US_TOPIC_0000001188642150__p7903936161617">It is more difficult to detect skew in computing than in storage. To solve skew in computing, GaussDB provides the Runtime Load Balance Technology (RLBT) solution controlled by the <a href="dws_04_0909.html#EN-US_TOPIC_0000001188482092__section1211182712176">skew_option</a> parameter. The RLBT solution addresses how to detect and solve data skew.</p>
|
|
<ol id="EN-US_TOPIC_0000001188642150__ol74761026123416"><li id="EN-US_TOPIC_0000001188642150__li3476142611349">Detect data skew.<p id="EN-US_TOPIC_0000001188642150__p118322022193616"><a name="EN-US_TOPIC_0000001188642150__li3476142611349"></a><a name="li3476142611349"></a>The solution first checks whether skew data exists in redistribution columns used for computing. RLBT can detect data skew based on statistics, specified hints, or rules.</p>
|
|
<ul id="EN-US_TOPIC_0000001188642150__ul690487193720"><li id="EN-US_TOPIC_0000001188642150__li109041177376">Detection based on statistics<p id="EN-US_TOPIC_0000001188642150__p270410121375"><a name="EN-US_TOPIC_0000001188642150__li109041177376"></a><a name="li109041177376"></a>Run the <strong id="EN-US_TOPIC_0000001188642150__b6256451204515">ANALYZE</strong> statement to collect statistics on tables. The optimizer will automatically identify skew data on redistribution keys based on the statistics and generate optimization plans for queries having potential skew. When the redistribution key has multiple columns, statistics information can be used for identification only when all columns belong to the same base table.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p192433157373">The statistics information can only provide the skew of the base table. If a column in the base table is skewed, or other columns have filtering conditions, or after the join of other tables, we cannot determine whether the skewed data still exists on the skewed column. If <a href="dws_04_0909.html#EN-US_TOPIC_0000001188482092__section1211182712176">skew_option</a> is set to <strong id="EN-US_TOPIC_0000001188642150__b877942724814">normal</strong>, it indicates that data skew persists and the base tables will be optimized to solve the skew. If <a href="dws_04_0909.html#EN-US_TOPIC_0000001188482092__section1211182712176">skew_option</a> is set to <strong id="EN-US_TOPIC_0000001188642150__b103161016174911">lazy</strong>, it indicates that data skew is solved and the optimization will stop.</p>
|
|
</li><li id="EN-US_TOPIC_0000001188642150__li29047783711">Detection based on specified hints<p id="EN-US_TOPIC_0000001188642150__p931621618376"><a name="EN-US_TOPIC_0000001188642150__li29047783711"></a><a name="li29047783711"></a>The intermediate results of complex queries are difficult to estimate based on statistics. In this case, you can specify hints to provide the skew information, based on which the optimizer optimizes queries. For details about the syntax of hints, see <a href="dws_04_0462.html">Skew Hints</a>.</p>
|
|
</li><li id="EN-US_TOPIC_0000001188642150__li13904376372">Detection based on rules<p id="EN-US_TOPIC_0000001188642150__p125857175373"><a name="EN-US_TOPIC_0000001188642150__li13904376372"></a><a name="li13904376372"></a>In a business intelligence (BI) system, a large number of SQL statements having outer joins (including left joins, right joins, and full joins) are generated, and many NULL values will be generated in empty columns that have no match for outer joins. If JOIN or GROUP BY operations are performed on the columns, data skew will occur. RLBT can automatically identify this scenario and generate an optimization plan for NULL value skew.</p>
|
|
</li></ul>
|
|
</li><li id="EN-US_TOPIC_0000001188642150__li144701652183518">Solve computing skew.<div class="p" id="EN-US_TOPIC_0000001188642150__p16298408196"><a name="EN-US_TOPIC_0000001188642150__li144701652183518"></a><a name="li144701652183518"></a><strong id="EN-US_TOPIC_0000001188642150__b1242394681818">Join</strong> and <strong id="EN-US_TOPIC_0000001188642150__b472104801814">Aggregate</strong> operators are optimized to solve skew.<ul id="EN-US_TOPIC_0000001188642150__ul1521811414381"><li id="EN-US_TOPIC_0000001188642150__li12181341193814"><strong id="EN-US_TOPIC_0000001188642150__b556945615187">Join</strong> optimization</li></ul>
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001188642150__p1783223742516">Skew and non-skew data is separately processed. Details are as follows:</p>
|
|
<ol type="a" id="EN-US_TOPIC_0000001188642150__ol18789740188"><li id="EN-US_TOPIC_0000001188642150__li109992041797">When redistribution is required on both sides of a join:<p id="EN-US_TOPIC_0000001188642150__p7944135113911"><a name="EN-US_TOPIC_0000001188642150__li109992041797"></a><a name="li109992041797"></a>Use <strong id="EN-US_TOPIC_0000001188642150__b153081146131315">PART_REDISTRIBUTE_PART_ROUNDROBIN</strong> on the side with skew. Specifically, perform round-robin on skew data and redistribution on non-skew data.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p42378111123">Use <strong id="EN-US_TOPIC_0000001188642150__b183191147201316">PART_REDISTRIBUTE_PART_BROADCAST</strong> on the side with no skew. Specifically, perform broadcast on skew data and redistribution on non-skew data.</p>
|
|
</li><li id="EN-US_TOPIC_0000001188642150__li1287162212914">When redistribution is required on only one side of a join:<p id="EN-US_TOPIC_0000001188642150__p131547131792"><a name="EN-US_TOPIC_0000001188642150__li1287162212914"></a><a name="li1287162212914"></a>Use <strong id="EN-US_TOPIC_0000001188642150__b142411649151317">PART_REDISTRIBUTE_PART_ROUNDROBIN</strong> on the side where redistribution is required.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p1452244691417">Use <strong id="EN-US_TOPIC_0000001188642150__b1825155181311">PART_LOCAL_PART_BROADCAST</strong> on the side where redistribution is not required. Specifically, perform broadcast on skew data and retain other data locally.</p>
|
|
</li><li id="EN-US_TOPIC_0000001188642150__li67891840187">When a table has <strong id="EN-US_TOPIC_0000001188642150__b154261652111313">NULL</strong> values padded:<p id="EN-US_TOPIC_0000001188642150__p1382516243178">Use <strong id="EN-US_TOPIC_0000001188642150__b8827125317134">PART_REDISTRIBUTE_PART_LOCAL</strong> on the table. Specifically, retain the <strong id="EN-US_TOPIC_0000001188642150__b2828753201312">NULL</strong> values locally and perform redistribution on other data.</p>
|
|
</li></ol>
|
|
<p id="EN-US_TOPIC_0000001188642150__p15494529151920">In the example query, the <strong id="EN-US_TOPIC_0000001188642150__b12753151773">s.x</strong> column contains skewed data and its value is <strong id="EN-US_TOPIC_0000001188642150__b1734662911711">0</strong>. The optimizer identifies the skew data in statistics and generates the following optimization plan:</p>
|
|
<pre class="screen" id="EN-US_TOPIC_0000001188642150__screen143584582195"> id | operation | A-time
|
|
----+-------------------------------------------------------------------------+-----------------------
|
|
1 | -> Limit | 23642.049
|
|
2 | -> Streaming (type: GATHER) | 23642.041
|
|
3 | -> Limit | [23310.768,23618.021]
|
|
4 | -> Sort | [23310.761,23618.012]
|
|
5 | -> Hash Join (6,8) | [20898.341,21115.272]
|
|
6 | -> Streaming(type: PART REDISTRIBUTE PART ROUNDROBIN) | [7125.834,7472.111]
|
|
7 | -> Seq Scan on public.skew s | [1837.079,1911.025]
|
|
8 | -> Hash | [2612.484,2640.572]
|
|
9 | -> Streaming(type: PART REDISTRIBUTE PART BROADCAST) | [1193.548,1297.894]
|
|
10 | -> Seq Scan on public.test t | [314.343,328.707]
|
|
|
|
5 --Vector Hash Join (6,8)
|
|
Hash Cond: s.x = t.x
|
|
Skew Join Optimizated by Statistic
|
|
6 --Streaming(type: PART REDISTRIBUTE PART ROUNDROBIN)
|
|
datanode1 (rows=7635968)
|
|
datanode2 (rows=7517184)
|
|
datanode3 (rows=7748608)
|
|
datanode4 (rows=7818240)</pre>
|
|
<p id="EN-US_TOPIC_0000001188642150__p12045218208">In the preceding execution plan, <strong id="EN-US_TOPIC_0000001188642150__b1895211642911">Skew Join Optimized by Statistic</strong> indicates that this is an optimized plan used for handling data skew. The <strong id="EN-US_TOPIC_0000001188642150__b63055302310">Statistic</strong> keyword indicates that the plan optimization is based on statistics; <strong id="EN-US_TOPIC_0000001188642150__b256918372328">Hint</strong> indicates that the optimization is based on hints; <strong id="EN-US_TOPIC_0000001188642150__b17920191183319">Rule</strong> indicates that the optimization is based on rules. In this plan, skew and non-skew data is separately processed. Non-skew data in the <strong id="EN-US_TOPIC_0000001188642150__b15985119161413">s</strong> table is redistributed based on its hash values, and skew data (whose value is <strong id="EN-US_TOPIC_0000001188642150__b198370184104">0</strong>) is evenly distributed on all nodes in round-robin mode. In this way, data skew is solved.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p12025219207">To ensure result correctness, the <strong id="EN-US_TOPIC_0000001188642150__b38721756171119">t</strong> table also needs to be processed. In the <strong id="EN-US_TOPIC_0000001188642150__b2857143518121">t</strong> table, the data whose value is <strong id="EN-US_TOPIC_0000001188642150__b158761911217">0</strong> (skew value in the <strong id="EN-US_TOPIC_0000001188642150__b1718752614121">s.x</strong> table) is broadcast and other data is redistributed based on its hash values.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p1811952132011">In this way, data skew in JOIN operations is solved. The above result shows that the output of the <strong id="EN-US_TOPIC_0000001188642150__b5430053141414">Streaming</strong> operator (<strong id="EN-US_TOPIC_0000001188642150__b1731522201511">id</strong> being <strong id="EN-US_TOPIC_0000001188642150__b37382217151">6</strong>) is balanced and the end-to-end performance of the query is doubled.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p6591124684711">If the stream operator type in the execution plan is <strong id="EN-US_TOPIC_0000001188642150__b17282194984315">HYBRID</strong>, the stream mode varies depending on the skew data. The following plan is an example:</p>
|
|
<pre class="screen" id="EN-US_TOPIC_0000001188642150__screen195235521841">EXPLAIN (nodes OFF, costs OFF) SELECT COUNT(*) FROM skew_scol s, skew_scol1 s1 WHERE s.b = s1.c;
|
|
QUERY PLAN
|
|
------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
|
id | operation
|
|
----+-----------------------------------------------------------------------------------------------------------------------------------------------------------
|
|
1 | -> Aggregate
|
|
2 | -> Streaming (type: GATHER)
|
|
3 | -> Aggregate
|
|
4 | -> Hash Join (5,7)
|
|
5 | -> Streaming(type: HYBRID)
|
|
6 | -> Seq Scan on skew_scol s
|
|
7 | -> Hash
|
|
8 | -> Streaming(type: HYBRID)
|
|
9 | -> Seq Scan on skew_scol1 s1
|
|
|
|
Predicate Information (identified by plan id)
|
|
--------------------------------------------------------------------------------------------------------------------------------------------
|
|
4 --Hash Join (5,7)
|
|
Hash Cond: (s.b = s1.c)
|
|
Skew Join Optimized by Statistic
|
|
5 --Streaming(type: HYBRID)
|
|
Skew Filter: (b = 1)
|
|
Skew Filter: (b = 0)
|
|
8 --Streaming(type: HYBRID)
|
|
Skew Filter: (c = 0)
|
|
Skew Filter: (c = 1)</pre>
|
|
<p id="EN-US_TOPIC_0000001188642150__p859115460479">Data 1 has skew in the <strong id="EN-US_TOPIC_0000001188642150__b1397092916413">skew_scol</strong> table. Perform <strong id="EN-US_TOPIC_0000001188642150__b144631046164114">ROUNDROBIN</strong> on skew data and <strong id="EN-US_TOPIC_0000001188642150__b58331423134210">REDISTRIBUTE</strong> on non-skew data.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p15911746124714">Data 0 is the side with no skew in the <strong id="EN-US_TOPIC_0000001188642150__b7890123784213">skew_scol</strong> table. Perform <strong id="EN-US_TOPIC_0000001188642150__b15891337114218">BROADCAST</strong> on skew data and <strong id="EN-US_TOPIC_0000001188642150__b2891537124218">REDISTRIBUTE</strong> on non-skew data.</p>
|
|
<p id="EN-US_TOPIC_0000001188642150__p85918462478">As shown in the preceding figure, the two stream types are <strong id="EN-US_TOPIC_0000001188642150__b16244201519449">PART REDISTRIBUTE PART ROUNDROBIN</strong> and <strong id="EN-US_TOPIC_0000001188642150__b1638115273445">PART REDISTRIBUTE PART BROADCAST</strong>. In this example, the stream type is <strong id="EN-US_TOPIC_0000001188642150__b27487541443">HYBRID</strong>.</p>
|
|
<ul id="EN-US_TOPIC_0000001188642150__ul11443172113917"><li id="EN-US_TOPIC_0000001188642150__li1144322118392"><strong id="EN-US_TOPIC_0000001188642150__b650479121919">Aggregate</strong> optimization</li></ul>
|
|
<p id="EN-US_TOPIC_0000001188642150__p1473194514497">For aggregation, data on each DN is deduplicated based on the <strong id="EN-US_TOPIC_0000001188642150__b1487151013202">GROUP BY</strong> key and then redistributed. After the deduplication on DNs, the global occurrences of each value will not be greater than the number of DNs. Therefore, no serious data skew will occur. Take the following query as an example:</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188642150__screen20940348184912"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">select</span><span class="w"> </span><span class="n">c1</span><span class="p">,</span><span class="w"> </span><span class="n">c2</span><span class="p">,</span><span class="w"> </span><span class="n">c3</span><span class="p">,</span><span class="w"> </span><span class="n">c4</span><span class="p">,</span><span class="w"> </span><span class="n">c5</span><span class="p">,</span><span class="w"> </span><span class="n">c6</span><span class="p">,</span><span class="w"> </span><span class="n">c7</span><span class="p">,</span><span class="w"> </span><span class="n">c8</span><span class="p">,</span><span class="w"> </span><span class="n">c9</span><span class="p">,</span><span class="w"> </span><span class="k">count</span><span class="p">(</span><span class="o">*</span><span class="p">)</span><span class="w"> </span><span class="k">from</span><span class="w"> </span><span class="n">t</span><span class="w"> </span><span class="k">group</span><span class="w"> </span><span class="k">by</span><span class="w"> </span><span class="n">c1</span><span class="p">,</span><span class="w"> </span><span class="n">c2</span><span class="p">,</span><span class="w"> </span><span class="n">c3</span><span class="p">,</span><span class="w"> </span><span class="n">c4</span><span class="p">,</span><span class="w"> </span><span class="n">c5</span><span class="p">,</span><span class="w"> </span><span class="n">c6</span><span class="p">,</span><span class="w"> </span><span class="n">c7</span><span class="p">,</span><span class="w"> </span><span class="n">c8</span><span class="p">,</span><span class="w"> </span><span class="n">c9</span><span class="w"> </span><span class="k">limit</span><span class="w"> </span><span class="mi">10</span><span class="p">;</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001188642150__p423191413501">The command output is as follows:</p>
|
|
<pre class="screen" id="EN-US_TOPIC_0000001188642150__screen1246505310505"> id | operation | A-time | A-rows
|
|
----+--------------------------------------------+------------------------+----------
|
|
1 | -> Streaming (type: GATHER) | 130621.783 | 12
|
|
2 | -> GroupAggregate | [85499.711,130432.341] | 12
|
|
3 | -> Sort | [85499.509,103145.632] | 36679237
|
|
4 | -> Streaming(type: REDISTRIBUTE) | [25668.897,85499.050] | 36679237
|
|
5 | -> Seq Scan on public.t | [9835.069,10416.388] | 36679237
|
|
|
|
4 --Streaming(type: REDISTRIBUTE)
|
|
datanode1 (rows=36678837)
|
|
datanode2 (rows=100)
|
|
datanode3 (rows=100)
|
|
datanode4 (rows=200)</pre>
|
|
<p id="EN-US_TOPIC_0000001188642150__p1411723035119">A large amount of skew data exists. As a result, after data is redistributed based on its <strong id="EN-US_TOPIC_0000001188642150__b370104915225">GROUP BY</strong> key, the data volume of datanode1 is hundreds of thousands of times that of others. After optimization, a GROUP BY operation is performed on the DN to deduplicate data. After redistribution, no data skew occurs.</p>
|
|
<pre class="screen" id="EN-US_TOPIC_0000001188642150__screen1192275205112"> id | operation | A-time
|
|
----+--------------------------------------------+-----------------------
|
|
1 | -> Streaming (type: GATHER) | 10961.337
|
|
2 | -> HashAggregate | [10953.014,10953.705]
|
|
3 | -> HashAggregate | [10952.957,10953.632]
|
|
4 | -> Streaming(type: REDISTRIBUTE) | [10952.859,10953.502]
|
|
5 | -> HashAggregate | [10084.280,10947.139]
|
|
6 | -> Seq Scan on public.t | [4757.031,5201.168]
|
|
|
|
Predicate Information (identified by plan id)
|
|
-----------------------------------------------
|
|
3 --HashAggregate
|
|
Skew Agg Optimized by Statistic
|
|
|
|
4 --Streaming(type: REDISTRIBUTE)
|
|
datanode1 (rows=17)
|
|
datanode2 (rows=8)
|
|
datanode3 (rows=8)
|
|
datanode4 (rows=14)</pre>
|
|
<p id="EN-US_TOPIC_0000001188642150__p12028142252">Applicable scope</p>
|
|
<ul id="EN-US_TOPIC_0000001188642150__ul174611613194"><li id="EN-US_TOPIC_0000001188642150__li15689124742119"><strong id="EN-US_TOPIC_0000001188642150__b8654526132418">Join</strong> operator<ul id="EN-US_TOPIC_0000001188642150__ul16965847123010"><li id="EN-US_TOPIC_0000001188642150__li7965247153014"><strong id="EN-US_TOPIC_0000001188642150__b989744514264">nest loop</strong>, <strong id="EN-US_TOPIC_0000001188642150__b203131647132612">merge join</strong>, and <strong id="EN-US_TOPIC_0000001188642150__b136991948132612">hash join</strong> can be optimized.</li><li id="EN-US_TOPIC_0000001188642150__li29651447123013">If skew data is on the left to the join, <strong id="EN-US_TOPIC_0000001188642150__b830162715273">inner join</strong>, <strong id="EN-US_TOPIC_0000001188642150__b1258602816271">left join</strong>, <strong id="EN-US_TOPIC_0000001188642150__b68717296271">semi join</strong>, and <strong id="EN-US_TOPIC_0000001188642150__b29328333279">anti join</strong> are supported. If skew data is on the right to the join, <strong id="EN-US_TOPIC_0000001188642150__b263165892712">inner join</strong>, <strong id="EN-US_TOPIC_0000001188642150__b18895135915276">right join</strong>, <strong id="EN-US_TOPIC_0000001188642150__b193651182814">right semi join</strong>, and <strong id="EN-US_TOPIC_0000001188642150__b102499418289">right anti join</strong> are supported.</li><li id="EN-US_TOPIC_0000001188642150__li2965174783019">For an optimization plan generated based on statistics, the optimizer checks whether it is optimal by estimating its cost. Optimization plans based on hints or rules are forcibly generated.</li></ul>
|
|
</li><li id="EN-US_TOPIC_0000001188642150__li6384195852110"><strong id="EN-US_TOPIC_0000001188642150__b1794942618513">Aggregate</strong> operator<ul id="EN-US_TOPIC_0000001188642150__ul1623719537308"><li id="EN-US_TOPIC_0000001188642150__li4237653193016"><strong id="EN-US_TOPIC_0000001188642150__b176754283265">array_agg</strong>, <strong id="EN-US_TOPIC_0000001188642150__b17988312263">string_agg</strong>, and <strong id="EN-US_TOPIC_0000001188642150__b168471633122610">subplan in agg qual</strong> cannot be optimized.</li><li id="EN-US_TOPIC_0000001188642150__li11237453163016">A plan generated based on statistics is affected by its cost, the <strong id="EN-US_TOPIC_0000001188642150__b4229155212305">plan_mode_seed</strong> parameter, and the <strong id="EN-US_TOPIC_0000001188642150__b1221355733010">best_agg_plan</strong> parameter. A plan generated based on hints or rules are not affected by them.</li></ul>
|
|
</li></ul>
|
|
</li></ol>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_04_0445.html">Typical SQL Optimization Methods</a></div>
|
|
</div>
|
|
</div>
|
|
|