doc-exports/docs/dws/dev/dws_06_0103.html
Lu, Huayi e6fa411af0 DWS DEV 830.201 version
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Lu, Huayi <luhuayi@huawei.com>
Co-committed-by: Lu, Huayi <luhuayi@huawei.com>
2024-05-16 07:24:04 +00:00

29 lines
6.6 KiB
HTML

<a name="EN-US_TOPIC_0000001188588966"></a><a name="EN-US_TOPIC_0000001188588966"></a>
<h1 class="topictitle1">Overview</h1>
<div id="body1561692316765"><p id="EN-US_TOPIC_0000001188588966__p10985850134520">A dictionary is used to define stop words, that is, words to be ignored in full-text retrieval.</p>
<p id="EN-US_TOPIC_0000001188588966__p178581220134616">A dictionary can also be used to normalize words so that different derived forms of the same word will match. A normalized word is called a lexeme.</p>
<p id="EN-US_TOPIC_0000001188588966__p5687281260">In addition to improving retrieval quality, normalization and removal of stop words can reduce the size of the <strong id="EN-US_TOPIC_0000001188588966__b1923103320572">tsvector</strong> representation of a document, thereby improving performance. Normalization and removal of stop words do not always have linguistic meaning. Users can define normalization and removal rules in dictionary definition files based on application environments.</p>
<p id="EN-US_TOPIC_0000001188588966__p1048084019528">A dictionary is a program that receives a token as input and returns:</p>
<ul id="EN-US_TOPIC_0000001188588966__ul448014095218"><li id="EN-US_TOPIC_0000001188588966__li14480204095216"><p id="EN-US_TOPIC_0000001188588966__p9480840125215"><a name="EN-US_TOPIC_0000001188588966__li14480204095216"></a><a name="li14480204095216"></a>An array of lexemes if the input token is known to the dictionary (note that one token can produce more than one lexeme).</p>
</li><li id="EN-US_TOPIC_0000001188588966__li1561353414222">A single lexeme to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a filtering dictionary).</li><li id="EN-US_TOPIC_0000001188588966__li1148124085212"><p id="EN-US_TOPIC_0000001188588966__p114812040175216"><a name="EN-US_TOPIC_0000001188588966__li1148124085212"></a><a name="li1148124085212"></a>An empty array if the input token is known to the dictionary but is a stop word.</p>
</li><li id="EN-US_TOPIC_0000001188588966__li24818403529"><p id="EN-US_TOPIC_0000001188588966__p448104019526"><a name="EN-US_TOPIC_0000001188588966__li24818403529"></a><a name="li24818403529"></a><strong id="EN-US_TOPIC_0000001188588966__b84235270611419">NULL</strong> if the dictionary does not recognize the token.</p>
</li></ul>
<p id="EN-US_TOPIC_0000001188588966__p194245415293"><span id="EN-US_TOPIC_0000001188588966__text417321870">GaussDB(DWS)</span> provides predefined dictionaries for many languages and also provides five predefined dictionary templates, <strong id="EN-US_TOPIC_0000001188588966__b1194175223014">Simple</strong>, <strong id="EN-US_TOPIC_0000001188588966__b7992205318307">Synonym</strong>, <strong id="EN-US_TOPIC_0000001188588966__b8166185623015">Thesaurus</strong>, <strong id="EN-US_TOPIC_0000001188588966__b93980582306">Ispell</strong>, and <strong id="EN-US_TOPIC_0000001188588966__b426530153117">Snowball</strong>. These templates can be used to create new dictionaries with custom parameters.</p>
<p id="EN-US_TOPIC_0000001188588966__p9394422131513">When using full-text retrieval, you are advised to:</p>
<ul id="EN-US_TOPIC_0000001188588966__ul1419424661519"><li id="EN-US_TOPIC_0000001188588966__li219417461151">In the text search configuration, configure a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until a dictionary recognizes it as a known word. If it is identified as a stop word, or no dictionary recognizes the token, it will be discarded and not indexed or searched for. Generally, the first dictionary that returns a non-<strong id="EN-US_TOPIC_0000001188588966__b9690185334013">NULL</strong> output determines the result, and any remaining dictionaries are not consulted. However, a filtering dictionary can replace the input token with a modified one, which is then passed to subsequent dictionaries.</li><li id="EN-US_TOPIC_0000001188588966__li4678848121518">The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like a <strong id="EN-US_TOPIC_0000001188588966__b28181044411">Snowball</strong> stemmer dictionary or a <strong id="EN-US_TOPIC_0000001188588966__b949154564410">Simple</strong> dictionary, which recognizes everything. In the following example, for an astronomy-specific search (<strong id="EN-US_TOPIC_0000001188588966__b149991648144619">astro_en</strong> configuration), you can configure the token type <strong id="EN-US_TOPIC_0000001188588966__b123511658144916">asciiword</strong> (ASCII word) with a <strong id="EN-US_TOPIC_0000001188588966__b731164245011">Synonym</strong> dictionary of astronomical terms, a general English <strong id="EN-US_TOPIC_0000001188588966__b259411213515">Ispell</strong> dictionary, and a <strong id="EN-US_TOPIC_0000001188588966__b151454261513">Snowball</strong> English stemmer dictionary:<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188588966__s9d5364b9b03845398238f2042156f96e"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">ALTER</span><span class="w"> </span><span class="nb">TEXT</span><span class="w"> </span><span class="k">SEARCH</span><span class="w"> </span><span class="n">CONFIGURATION</span><span class="w"> </span><span class="n">astro_en</span>
<span class="w"> </span><span class="k">ADD</span><span class="w"> </span><span class="n">MAPPING</span><span class="w"> </span><span class="k">FOR</span><span class="w"> </span><span class="n">asciiword</span><span class="w"> </span><span class="k">WITH</span><span class="w"> </span><span class="n">astro_syn</span><span class="p">,</span><span class="w"> </span><span class="n">english_ispell</span><span class="p">,</span><span class="w"> </span><span class="n">english_stem</span><span class="p">;</span>
</pre></div></td></tr></table></div>
</div>
<p id="EN-US_TOPIC_0000001188588966__p15825123017353">A filtering dictionary can be placed anywhere in the list, except at the end where it would be useless. Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries.</p>
</li></ul>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_06_0102.html">Dictionaries</a></div>
</div>
</div>