doc-exports/docs/dws/dev/dws_06_0325.html
Lu, Huayi e6fa411af0 DWS DEV 830.201 version
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Lu, Huayi <luhuayi@huawei.com>
Co-committed-by: Lu, Huayi <luhuayi@huawei.com>
2024-05-16 07:24:04 +00:00

81 lines
14 KiB
HTML

<a name="EN-US_TOPIC_0000001495466261"></a><a name="EN-US_TOPIC_0000001495466261"></a>
<h1 class="topictitle1">Use Cases</h1>
<div id="body0000001495466261"><div class="section" id="EN-US_TOPIC_0000001495466261__section573310419439"><h4 class="sectiontitle">Background</h4><p id="EN-US_TOPIC_0000001495466261__p10605510184316">Currently, real-time precision marketing is required in the Internet, education, and gaming industries. User profiling enables user search based on combined criteria. Example:</p>
<ul id="EN-US_TOPIC_0000001495466261__ul1560591013431"><li id="EN-US_TOPIC_0000001495466261__li146051010154318">Before launching a sales promotion, e-commerce companies need to send the promotion message to a selected batch of users with specific features.</li><li id="EN-US_TOPIC_0000001495466261__li16605171020432">In education, exercise questions need to be pushed based on students' weaknesses.</li><li id="EN-US_TOPIC_0000001495466261__li360511012434">On search, video, and portal websites, different contents are pushed to users based on their interests.</li></ul>
<p id="EN-US_TOPIC_0000001495466261__p6605510114313">These use cases have the following characteristics in common:</p>
<ul id="EN-US_TOPIC_0000001495466261__ul136057101437"><li id="EN-US_TOPIC_0000001495466261__li260551084319">The data volume is huge and the calculation workload is huge.</li><li id="EN-US_TOPIC_0000001495466261__li1060581044311">There are a large number of users with a lot of labels and fields, occupying a large amount of storage space.</li><li id="EN-US_TOPIC_0000001495466261__li1360621014314">The feature conditions for selection are diversified, and it is difficult to find a fixed index. If each field has an index, that will occupy too much storage space.</li><li id="EN-US_TOPIC_0000001495466261__li13606131020434">High performance is required because real-time marketing requires response in seconds.</li><li id="EN-US_TOPIC_0000001495466261__li18606410144313">Data update has high requirements on timeliness, and user profiles need to be updated in real time.</li></ul>
<p id="EN-US_TOPIC_0000001495466261__p1960615102435">Roaring bitmaps in GaussDB(DWS) can efficiently generate, compress, and parse bitmap data, and supports the most common bitmap aggregation operations (AND, OR, NOT, and XOR). This feature meets the requirements of real-time precision marketing and quick user selection in the case of a large amount of data with hundreds of millions of users and tens of millions of labels.</p>
</div>
<div class="section" id="EN-US_TOPIC_0000001495466261__section1464915497436"><h4 class="sectiontitle"><strong id="EN-US_TOPIC_0000001495466261__b20648154914310">Example of Using roaringbitmap</strong></h4><p id="EN-US_TOPIC_0000001495466261__p63597234419">Assume that there is a web page browsing information table <strong id="EN-US_TOPIC_0000001495466261__b18510285919360">userinfo</strong>. The fields in the table are as follows:</p>
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001495466261__screen1635972164420"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">userinfo</span>
<span class="p">(</span><span class="n">userid</span><span class="w"> </span><span class="nb">int</span><span class="p">,</span>
<span class="n">age</span><span class="w"> </span><span class="nb">int</span><span class="p">,</span>
<span class="n">gender</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span>
<span class="n">salary</span><span class="w"> </span><span class="nb">int</span><span class="p">,</span>
<span class="n">hobby</span><span class="w"> </span><span class="nb">text</span>
<span class="p">)</span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">orientation</span><span class="o">=</span><span class="k">column</span><span class="p">);</span>
</pre></div></td></tr></table></div>
</div>
<p id="EN-US_TOPIC_0000001495466261__p559611562117">The data in the <strong id="EN-US_TOPIC_0000001495466261__b11970217349360">userinfo</strong> table increases with the change of user information. For example, if a user has multiple hobby attributes, there will be multiple records in the <strong id="EN-US_TOPIC_0000001495466261__b12875314209360">userinfo</strong> table.</p>
<p id="EN-US_TOPIC_0000001495466261__p659674132112">If a user wants to filter out males with income greater than CNY10,000, age greater than 30, and a hobby of phishing, and then push specific messages to these target groups.</p>
<p id="EN-US_TOPIC_0000001495466261__p1285515052310">The traditional method is to directly query the original table. The statement is as follows:</p>
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001495466261__screen1082181713176"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="k">distinct</span><span class="w"> </span><span class="n">userid</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">userinfo</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">salary</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">10000</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">30</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">gender</span><span class="w"> </span><span class="o">=</span><span class="s1">'m'</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">hobby</span><span class="w"> </span><span class="o">=</span><span class="s1">'fishing'</span><span class="p">;</span>
</pre></div></td></tr></table></div>
</div>
</div>
<p id="EN-US_TOPIC_0000001495466261__p417531810261">If the <strong id="EN-US_TOPIC_0000001495466261__b13609565139360">userinfo</strong> table contains a small amount of data, indexes are created in the salary, age, gender, and hobby columns to meet the query requirements. However, if the <strong id="EN-US_TOPIC_0000001495466261__b15244725589360">userinfo</strong> table contains a large amount of data and a large number of labels, the preceding statement cannot meet the requirements. The reasons are as follows:</p>
<ul id="EN-US_TOPIC_0000001495466261__ul055315423287"><li id="EN-US_TOPIC_0000001495466261__li12553134215282">A large number of indexes need to be created.</li><li id="EN-US_TOPIC_0000001495466261__li510823831915">The count (distinct) performance is poor.</li></ul>
<p id="EN-US_TOPIC_0000001495466261__p16901650161912"><strong id="EN-US_TOPIC_0000001495466261__b35321510172018">Roaring bitmaps perform better in this case.</strong></p>
<ol id="EN-US_TOPIC_0000001495466261__ol94751317132018"><li id="EN-US_TOPIC_0000001495466261__li247571712205">Create a RoaringBitmap table.<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001495466261__screen570455041916"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span>
<span class="normal">8</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">userinfoset</span>
<span class="p">(</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="nb">int</span><span class="p">,</span>
<span class="n">gender</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span>
<span class="n">salary</span><span class="w"> </span><span class="nb">int</span><span class="p">,</span>
<span class="n">hobby</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span>
<span class="n">userset</span><span class="w"> </span><span class="n">roaringbitmap</span><span class="p">,</span>
<span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="p">(</span><span class="n">age</span><span class="p">,</span><span class="n">gender</span><span class="p">,</span><span class="n">salary</span><span class="p">,</span><span class="n">hobby</span><span class="p">)</span>
<span class="p">)</span><span class="k">WITH</span><span class="w"> </span><span class="p">(</span><span class="n">orientation</span><span class="o">=</span><span class="k">column</span><span class="p">);</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="EN-US_TOPIC_0000001495466261__li147171923102012">All data in the <strong id="EN-US_TOPIC_0000001495466261__b17630758749360">userinfo</strong> table must be aggregated to the <strong id="EN-US_TOPIC_0000001495466261__b20002170559360">userinfoset</strong> table through the aggregation of the label column. You can run the following command to aggregate all data. Or you can aggregate only incremental data. To aggregate only incremental data, a set of users with the same label are put in a table record. This can be implemented by using the UPSERT function. Frequent update operations may generate a large amount of dirty data. Therefore, you are advised to create the <strong id="EN-US_TOPIC_0000001495466261__b19734457399360">userinfoset</strong> table as a row-store table to aggregate incremental data.<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001495466261__screen1514619501182"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">userinfoset</span>
<span class="k">SELECT</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">gender</span><span class="p">,</span><span class="w"> </span><span class="n">salary</span><span class="p">,</span><span class="w"> </span><span class="n">hobby</span><span class="p">,</span><span class="w"> </span><span class="n">rb_build_agg</span><span class="p">(</span><span class="n">userid</span><span class="p">)</span>
<span class="k">FROM</span><span class="w"> </span>
<span class="n">userinfo</span>
<span class="k">GROUP</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">gender</span><span class="p">,</span><span class="w"> </span><span class="n">salary</span><span class="p">,</span><span class="w"> </span><span class="n">hobby</span><span class="p">;</span>
</pre></div></td></tr></table></div>
</div>
</li><li id="EN-US_TOPIC_0000001495466261__li128441829102019">Query the <strong id="EN-US_TOPIC_0000001495466261__b9011596629360">userinfoset</strong> table for the selected user information.<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001495466261__screen186919582189"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="n">rb_iterate</span><span class="p">(</span><span class="n">rb_or_agg</span><span class="p">(</span><span class="n">userset</span><span class="p">))</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">userinfoset</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">salary</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">10000</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">age</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">30</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">gender</span><span class="w"> </span><span class="o">=</span><span class="s1">'m'</span><span class="w"> </span><span class="k">AND</span><span class="w"> </span><span class="n">hobby</span><span class="w"> </span><span class="o">=</span><span class="s1">'fishing'</span><span class="p">;</span>
</pre></div></td></tr></table></div>
</div>
</li></ol>
<p id="EN-US_TOPIC_0000001495466261__p138132555918">After data aggregation, the data volume of the table <strong id="EN-US_TOPIC_0000001495466261__b20004302629360">userinfoset</strong> is much smaller than that of the source table, so the scanning performance of the base table is much faster. In addition, based on the advantages of Roaring bitmaps, the performance of calculating <strong id="EN-US_TOPIC_0000001495466261__b4193092649360">rb_or_agg</strong> and <strong id="EN-US_TOPIC_0000001495466261__b21298001669360">rb_iterate</strong> is better. Compared with the traditional method, the performance is significantly improved.</p>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_06_0991.html">Roaring Bitmap Functions and Operators</a></div>
</div>
</div>