Yang, Tong 6182f91ba8 MRS component operation guide_normal 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-09 14:55:21 +00:00

22 lines
2.8 KiB
HTML

<a name="mrs_01_0980"></a><a name="mrs_01_0980"></a>
<h1 class="topictitle1">Optimizing Group By</h1>
<div id="body1590395285869"><div class="section" id="mrs_01_0980__sbe95d3fca8ab4f4fb5a7d44ea588dcfd"><h4 class="sectiontitle">Scenario</h4><p class="msonormal" id="mrs_01_0980__en-us_topic_0116526945_p60811181">Optimize the Group by statement to accelerate the command execution and query speed.</p>
<p class="msonormal" id="mrs_01_0980__en-us_topic_0116526945_p10429724">During the Group by operation, Map performs grouping and distributes the groups to Reduce; Reduce then performs grouping again. Group by optimization can be performed by enabling Map aggregation to reduce Map output data volume.</p>
</div>
<div class="section" id="mrs_01_0980__sd2566fa77945498dbbfc885b158460d8"><h4 class="sectiontitle">Procedure</h4><p class="msonormal" id="mrs_01_0980__en-us_topic_0116526945_p39501348">On a Hive client, set the following parameter:</p>
<pre class="screen" id="mrs_01_0980__sad455a5d148042b8a7716f6eda647d45">set hive.map.aggr=true</pre>
</div>
<div class="section" id="mrs_01_0980__se282d7f385494aa6934d085f4482d184"><h4 class="sectiontitle">Precautions</h4><p class="msonormal" id="mrs_01_0980__en-us_topic_0116526945_p6780253"><strong id="mrs_01_0980__en-us_topic_0116526945_b61022284">Group By Data Skew</strong></p>
<p class="msonormal" id="mrs_01_0980__en-us_topic_0116526945_p12329649">Group by have data skew problems. When hive.groupby.skewindata is set to true, the created query plan has two MapReduce jobs. The Map output result of the first job is randomly distributed to Reduce tasks, and each Reduce task performs aggregation operations and generates output result. Such processing may distribute the same Group By Key to different Reduce tasks for load balancing purpose. According to the preprocessing result, the second Job distributes Group By Key to Reduce to complete the final aggregation operation.</p>
<p class="msonormal" id="mrs_01_0980__en-us_topic_0116526945_p43857981"><strong id="mrs_01_0980__en-us_topic_0116526945_b59177513">Count Distinct Aggregation Problem</strong></p>
<p class="msonormal" id="mrs_01_0980__en-us_topic_0116526945_p62835574">When the aggregation function count distinct is used in deduplication counting, serious Reduce data skew occurs if the processed value is empty. The empty value can be processed independently. If count distinct is used, exclude the empty value using the where statement and increase the last count distinct result by 1. If there are other computing operations, process the empty value independently and then combine the value with other computing results.</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_0977.html">Hive Performance Tuning</a></div>
</div>
</div>