doc-exports/docs/dws/dev/dws_06_0083.html
Lu, Huayi e6fa411af0 DWS DEV 830.201 version
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Lu, Huayi <luhuayi@huawei.com>
Co-committed-by: Lu, Huayi <luhuayi@huawei.com>
2024-05-16 07:24:04 +00:00

29 lines
7.2 KiB
HTML

<a name="EN-US_TOPIC_0000001233430163"></a><a name="EN-US_TOPIC_0000001233430163"></a>
<h1 class="topictitle1">Full-Text Retrieval</h1>
<div id="body1536648829943"><p id="EN-US_TOPIC_0000001233430163__en-us_topic_0059778292_p987013511802">Full text searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query.</p>
<p id="EN-US_TOPIC_0000001233430163__en-us_topic_0059778075_p171221247206">Textual search operators have been used in databases for years. The <span id="EN-US_TOPIC_0000001233430163__text508424830">GaussDB(DWS)</span> has <strong id="EN-US_TOPIC_0000001233430163__b164191451520">~</strong>, <strong id="EN-US_TOPIC_0000001233430163__b486812313157">~*</strong>, <strong id="EN-US_TOPIC_0000001233430163__b48211728101518">LIKE</strong>, and <strong id="EN-US_TOPIC_0000001233430163__b294803291512">ILIKE </strong>operators for textual data types, but they lack many essential properties required by modern information systems. This problem can be solved by using indexes and dictionaries.</p>
<div class="note" id="EN-US_TOPIC_0000001233430163__note122379223317"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="EN-US_TOPIC_0000001233430163__p7237822631">The hybrid data warehouse (standalone) does not support full-text search.</p>
</div></div>
<div class="p" id="EN-US_TOPIC_0000001233430163__p1863381020555">Text search lacks the following essential properties required by information systems:<ul id="EN-US_TOPIC_0000001233430163__u30ef35a242e9483b9d58a1694c18fde2"><li id="EN-US_TOPIC_0000001233430163__ld665074884df490585e2f866e0886177">There is no linguistic support, even for English.<p id="EN-US_TOPIC_0000001233430163__a797d5480556d499ba43d9251d312ab07"><a name="EN-US_TOPIC_0000001233430163__ld665074884df490585e2f866e0886177"></a><a name="ld665074884df490585e2f866e0886177"></a>Regular expressions are not sufficient because they cannot easily handle derived words. For example, you might miss documents that contain <strong id="EN-US_TOPIC_0000001233430163__b842352706135932">satisfies</strong>, although you probably would like to find them when searching for <strong id="EN-US_TOPIC_0000001233430163__b842352706135937">satisfy</strong>. It is possible to use <strong id="EN-US_TOPIC_0000001233430163__b84235270614018">OR</strong> to search for multiple derived forms, but this is tedious and error-prone, because some words can have several thousand derivatives.</p>
</li></ul>
<ul id="EN-US_TOPIC_0000001233430163__u67b2a99fe9b34b4eab2d0c1c2b433229"><li id="EN-US_TOPIC_0000001233430163__l1c8cf3ae85724be6942230002611b839">They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.</li></ul>
<ul id="EN-US_TOPIC_0000001233430163__ub08722a8f8ec4595a5afd043b659c4ba"><li id="EN-US_TOPIC_0000001233430163__l9ebdba5f182d41669e840bc6e2e7c466">They tend to be slow because there is no index support, so they must process all documents for every search.</li></ul>
</div>
<div class="p" id="EN-US_TOPIC_0000001233430163__p16634121075520">Full text indexing allows documents to be preprocessed and an index is saved for later rapid searching. Preprocessing includes:<ul id="EN-US_TOPIC_0000001233430163__uc754cfc5ca784f14a84160ffbc8e5660"><li id="EN-US_TOPIC_0000001233430163__l610e4eeea3364b62b8f6547b1047cd0a">Parsing documents into tokens<p id="EN-US_TOPIC_0000001233430163__a114e4c225e3b4166adc7dadb1c775342"><a name="EN-US_TOPIC_0000001233430163__l610e4eeea3364b62b8f6547b1047cd0a"></a><a name="l610e4eeea3364b62b8f6547b1047cd0a"></a>It is useful to identify various classes of tokens, for example, numbers, words, complex words, and email addresses, so that they can be processed differently. In principle, token classes depend on the specific application, but for most purposes it is adequate to use a predefined set of classes.</p>
</li><li id="EN-US_TOPIC_0000001233430163__lc68e5f0b5beb43b7ae6637af1d600944">Converting tokens into lexemes<p id="EN-US_TOPIC_0000001233430163__a0b8469e6b01844ca8fe3fc603145d581"><a name="EN-US_TOPIC_0000001233430163__lc68e5f0b5beb43b7ae6637af1d600944"></a><a name="lc68e5f0b5beb43b7ae6637af1d600944"></a>A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as <strong id="EN-US_TOPIC_0000001233430163__b84235270614196">s</strong> or <strong id="EN-US_TOPIC_0000001233430163__b84235270614199">es</strong> in English) This allows searches to find variant forms of the same word, without tediously entering all the possible variants. Also, this step typically eliminates stop words, which are words that are so common that they are useless for searching. (In short, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.) <span id="EN-US_TOPIC_0000001233430163__text2095272401">GaussDB(DWS)</span> uses dictionaries to perform this step and provides various standard dictionaries.</p>
</li></ul>
<ul id="EN-US_TOPIC_0000001233430163__u1ceb893ebe6a4082ae5362b7923baee6"><li id="EN-US_TOPIC_0000001233430163__l6f8db66110e641dc9f8e132075e87921">Storing preprocessed documents optimized for searching<p id="EN-US_TOPIC_0000001233430163__aaf5fc5f5cce14437abecebaa8bb2502f"><a name="EN-US_TOPIC_0000001233430163__l6f8db66110e641dc9f8e132075e87921"></a><a name="l6f8db66110e641dc9f8e132075e87921"></a>For example, each document can be represented as a sorted array of normalized lexemes. Along with the lexemes, it is often desirable to store positional information for proximity ranking. Therefore, a document that contains a more "dense" region of query words is assigned with a higher rank than the one with scattered query words.</p>
</li></ul>
</div>
<p id="EN-US_TOPIC_0000001233430163__ac7804bf2cbb444369ac0d90a59155619">Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can define stop words that should not be indexed.</p>
<p id="EN-US_TOPIC_0000001233430163__ab77272e964d249fab03bf7b13e584cbf">A data type <strong id="EN-US_TOPIC_0000001233430163__b1143918599134">tsvector</strong> is provided for storing preprocessed documents, along with a type <strong id="EN-US_TOPIC_0000001233430163__b443911596135">tsquery</strong> for storing query conditions. For details, see <a href="dws_06_0018.html">Text Search Types</a>. For details about the functions and operators provided for the tsvector data type, see <a href="dws_06_0039.html">Text Search Functions and Operators</a>. The matching operator <strong id="EN-US_TOPIC_0000001233430163__b209244971617">@@</strong> is the most important. For details, see <a href="dws_06_0085.html">Basic Text Matching</a>.</p>
<p id="EN-US_TOPIC_0000001233430163__p8060118"></p>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_06_0082.html">Introduction</a></div>
</div>
</div>