forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Su, Xiaomeng <suxiaomeng1@huawei.com> Co-committed-by: Su, Xiaomeng <suxiaomeng1@huawei.com>
33 lines
3.3 KiB
HTML
33 lines
3.3 KiB
HTML
<a name="dli_08_0328"></a><a name="dli_08_0328"></a>
|
|
|
|
<h1 class="topictitle1">Deduplication</h1>
|
|
<div id="body8662426"><div class="section" id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_section105585025918"><h4 class="sectiontitle">Function</h4><p id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_p205386514595">Deduplication removes rows that duplicate over a set of columns, keeping only the first one or the last one.</p>
|
|
</div>
|
|
<div class="section" id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_section155961853135910"><h4 class="sectiontitle">Syntax</h4><pre class="screen" id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_screen152822010020">SELECT [column_list]
|
|
FROM (
|
|
SELECT [column_list],
|
|
ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]]
|
|
ORDER BY time_attr [asc|desc]) AS rownum
|
|
FROM table_name)
|
|
WHERE rownum = 1</pre>
|
|
</div>
|
|
<div class="section" id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_section1627916611011"><h4 class="sectiontitle">Description</h4><ul id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_ul12220145108"><li id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_li82201045703">ROW_NUMBER(): Assigns a unique, sequential number to each row, starting with one.</li><li id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_li152201045504">PARTITION BY col1[, col2...]: Specifies the partition columns, for example, the deduplicate key.</li><li id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_li1422020451403">ORDER BY time_attr [asc|desc]: Specifies the ordering column, it must be a time attribute. Currently Flink supports proctime only. Ordering by ASC means to keep the first row, ordering by DESC means to keep the last row.</li><li id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_li1922012451800">WHERE rownum = 1: The rownum = 1 is required for Flink to recognize this query is deduplication.</li></ul>
|
|
</div>
|
|
<div class="section" id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_section17171113913120"><h4 class="sectiontitle">Precautions</h4><p id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_p10805114812117">None</p>
|
|
</div>
|
|
<div class="section" id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_section1085612512115"><h4 class="sectiontitle">Example</h4><p id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_p10785617548">The following examples show how to remove duplicate rows on <strong id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_b258511581542">order_id</strong>. The proctime is an event time attribute.</p>
|
|
<pre class="screen" id="dli_08_0328__en-us_topic_0000001119232092_en-us_topic_0000001132426601_screen13863297215">SELECT order_id, user, product, number
|
|
FROM (
|
|
SELECT *,
|
|
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY proctime ASC) as row_num
|
|
FROM Orders)
|
|
WHERE row_num = 1;</pre>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dli_08_0321.html">Data Manipulation Language (DML)</a></div>
|
|
</div>
|
|
</div>
|
|
|