forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Lu, Huayi <luhuayi@huawei.com> Co-committed-by: Lu, Huayi <luhuayi@huawei.com>
56 lines
13 KiB
HTML
56 lines
13 KiB
HTML
<a name="EN-US_TOPIC_0000001188588980"></a><a name="EN-US_TOPIC_0000001188588980"></a>
|
|
|
|
<h1 class="topictitle1">Parsing Documents</h1>
|
|
<div id="body8662426"><p id="EN-US_TOPIC_0000001188588980__en-us_topic_0059777460_p178011048913"><span id="EN-US_TOPIC_0000001188588980__text814909004">GaussDB(DWS)</span> provides function <strong id="EN-US_TOPIC_0000001188588980__en-us_topic_0058965789_b842352706113116">to_tsvector</strong> for converting a document to the <strong id="EN-US_TOPIC_0000001188588980__en-us_topic_0058965789_b842352706113120">tsvector</strong> data type.</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188588980__sa05326a2a4134a78a09ed28b350d4348"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="n">to_tsvector</span><span class="p">([</span><span class="w"> </span><span class="n">config</span><span class="w"> </span><span class="n">regconfig</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="n">document</span><span class="w"> </span><span class="nb">text</span><span class="p">)</span><span class="w"> </span><span class="k">returns</span><span class="w"> </span><span class="n">tsvector</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001188588980__a81066ad06feb468a9cb4b6b02d2d44f7"><strong id="EN-US_TOPIC_0000001188588980__b842352706113258">to_tsvector</strong> parses a textual document into tokens, reduces the tokens to lexemes, and returns a <strong id="EN-US_TOPIC_0000001188588980__b84235270611339">tsvector</strong>, which lists the lexemes together with their positions in the document. The document is processed according to the specified or default text search configuration. Here is a simple example:</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188588980__s435603da859d4598b14cde83534c5b70"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
|
|
<span class="normal">2</span>
|
|
<span class="normal">3</span>
|
|
<span class="normal">4</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="n">to_tsvector</span><span class="p">(</span><span class="s1">'english'</span><span class="p">,</span><span class="w"> </span><span class="s1">'a fat cat sat on a mat - it ate a fat rats'</span><span class="p">);</span>
|
|
<span class="w"> </span><span class="n">to_tsvector</span>
|
|
<span class="c1">-----------------------------------------------------</span>
|
|
<span class="w"> </span><span class="s1">'ate'</span><span class="p">:</span><span class="mi">9</span><span class="w"> </span><span class="s1">'cat'</span><span class="p">:</span><span class="mi">3</span><span class="w"> </span><span class="s1">'fat'</span><span class="p">:</span><span class="mi">2</span><span class="p">,</span><span class="mi">11</span><span class="w"> </span><span class="s1">'mat'</span><span class="p">:</span><span class="mi">7</span><span class="w"> </span><span class="s1">'rat'</span><span class="p">:</span><span class="mi">12</span><span class="w"> </span><span class="s1">'sat'</span><span class="p">:</span><span class="mi">4</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001188588980__aa0c3087d0e944b499b045b439b079fcb">In the preceding example we see that the resulting <strong id="EN-US_TOPIC_0000001188588980__b842352706113411">tsvector</strong> does not contain the words <strong id="EN-US_TOPIC_0000001188588980__b842352706113421">a</strong>, <strong id="EN-US_TOPIC_0000001188588980__b842352706113423">on</strong>, or <strong id="EN-US_TOPIC_0000001188588980__b842352706113425">it</strong>, the word <strong id="EN-US_TOPIC_0000001188588980__b842352706113427">rats</strong> became <strong id="EN-US_TOPIC_0000001188588980__b842352706113434">rat</strong>, and the punctuation sign (-) was ignored.</p>
|
|
<p id="EN-US_TOPIC_0000001188588980__a1581d7873189487b82b18a211a8f1409">The <strong id="EN-US_TOPIC_0000001188588980__b84235270611354">to_tsvector</strong> function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries is consulted. where the list can vary depending on the token type. The first dictionary that recognizes the token emits one or more normalized lexemes to represent the token. For example:</p>
|
|
<ul id="EN-US_TOPIC_0000001188588980__u533ec720e85945dc8ecd461655200d9d"><li id="EN-US_TOPIC_0000001188588980__lf03cdc6c0dd14ad5bb76e85cb2799b77"><strong id="EN-US_TOPIC_0000001188588980__b842352706113818">rats</strong> became <strong id="EN-US_TOPIC_0000001188588980__b842352706113820">rat</strong> because one of the dictionaries recognized that the word <strong id="EN-US_TOPIC_0000001188588980__b842352706113822">rats</strong> is a plural form of <strong id="EN-US_TOPIC_0000001188588980__b842352706113824">rat</strong>.</li><li id="EN-US_TOPIC_0000001188588980__l838773caaa5f49f180ab70b0900ec515">Some words are recognized as stop words (see <a href="dws_06_0104.html">Stop Words</a>), which causes them to be ignored since they occur too frequently to be useful in searching. In our example these are <strong id="EN-US_TOPIC_0000001188588980__b842352706113922">a</strong>, <strong id="EN-US_TOPIC_0000001188588980__b842352706113924">on</strong>, and <strong id="EN-US_TOPIC_0000001188588980__b842352706113926">it</strong>.</li><li id="EN-US_TOPIC_0000001188588980__lb0250fb85d4446088d9bdacaa98e866a">If no dictionary in the list recognizes the token then it is also ignored. In this example that happened to the punctuation sign (-) because there are in fact no dictionaries assigned for its token type (<strong id="EN-US_TOPIC_0000001188588980__b84235270611408">Space symbols</strong>), meaning space tokens will never be indexed.</li></ul>
|
|
<p id="EN-US_TOPIC_0000001188588980__a3cad884675854abbb4732f9874c84754">The choices of parser, dictionaries and which types of tokens to index are determined by the selected text search configuration. It is possible to have many different configurations in the same database, and predefined configurations are available for various languages. In our example we used the default configuration <strong id="EN-US_TOPIC_0000001188588980__b842352706114218">english</strong> for the English language.</p>
|
|
<p id="EN-US_TOPIC_0000001188588980__a430440cadc084e12baf50f0949258419">The function <strong id="EN-US_TOPIC_0000001188588980__b842352706114242">setweight</strong> can be used to label the entries of a <strong id="EN-US_TOPIC_0000001188588980__b842352706114246">tsvector</strong> with a given weight, where a weight is one of the letters <strong id="EN-US_TOPIC_0000001188588980__b842352706114253">A</strong>, <strong id="EN-US_TOPIC_0000001188588980__b842352706114255">B</strong>, <strong id="EN-US_TOPIC_0000001188588980__b842352706114257">C</strong>, or <strong id="EN-US_TOPIC_0000001188588980__b842352706114258">D</strong>. This is typically used to mark entries coming from different parts of a document, such as title versus body. Later, this information can be used for ranking of search results.</p>
|
|
<p id="EN-US_TOPIC_0000001188588980__a955d4d70cb544ba188bc48a9db0924db">Because <strong id="EN-US_TOPIC_0000001188588980__b842352706114329">to_tsvector(NULL)</strong> will return <strong id="EN-US_TOPIC_0000001188588980__b842352706114332">NULL</strong>, you are advised to use <strong id="EN-US_TOPIC_0000001188588980__b842352706114337">coalesce</strong> whenever a column might be <strong id="EN-US_TOPIC_0000001188588980__b11831241397">NULL</strong>. Here is the recommended method for creating a <strong id="EN-US_TOPIC_0000001188588980__b842352706114350">tsvector</strong> from a structured document: </p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001188588980__s20ca89f86327427c85cc8b5d3f9f5a75"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
|
|
<span class="normal"> 2</span>
|
|
<span class="normal"> 3</span>
|
|
<span class="normal"> 4</span>
|
|
<span class="normal"> 5</span>
|
|
<span class="normal"> 6</span>
|
|
<span class="normal"> 7</span>
|
|
<span class="normal"> 8</span>
|
|
<span class="normal"> 9</span>
|
|
<span class="normal">10</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">tsearch</span><span class="p">.</span><span class="n">tt</span><span class="w"> </span><span class="p">(</span><span class="n">id</span><span class="w"> </span><span class="nb">int</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"> </span><span class="n">keyword</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"> </span><span class="n">abstract</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"> </span><span class="n">body</span><span class="w"> </span><span class="nb">text</span><span class="p">,</span><span class="w"> </span><span class="n">ti</span><span class="w"> </span><span class="n">tsvector</span><span class="p">);</span>
|
|
|
|
<span class="k">INSERT</span><span class="w"> </span><span class="k">INTO</span><span class="w"> </span><span class="n">tsearch</span><span class="p">.</span><span class="n">tt</span><span class="p">(</span><span class="n">id</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">keyword</span><span class="p">,</span><span class="w"> </span><span class="n">abstract</span><span class="p">,</span><span class="w"> </span><span class="n">body</span><span class="p">)</span><span class="w"> </span><span class="k">VALUES</span><span class="w"> </span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="s1">'book'</span><span class="p">,</span><span class="w"> </span><span class="s1">'literature'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Ancient poetry'</span><span class="p">,</span><span class="s1">'Tang poem Song jambic verse'</span><span class="p">);</span>
|
|
|
|
<span class="k">UPDATE</span><span class="w"> </span><span class="n">tsearch</span><span class="p">.</span><span class="n">tt</span><span class="w"> </span><span class="k">SET</span><span class="w"> </span><span class="n">ti</span><span class="w"> </span><span class="o">=</span>
|
|
<span class="w"> </span><span class="n">setweight</span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="k">coalesce</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="s1">''</span><span class="p">)),</span><span class="w"> </span><span class="s1">'A'</span><span class="p">)</span><span class="w"> </span><span class="o">||</span>
|
|
<span class="w"> </span><span class="n">setweight</span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="k">coalesce</span><span class="p">(</span><span class="n">keyword</span><span class="p">,</span><span class="s1">''</span><span class="p">)),</span><span class="w"> </span><span class="s1">'B'</span><span class="p">)</span><span class="w"> </span><span class="o">||</span>
|
|
<span class="w"> </span><span class="n">setweight</span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="k">coalesce</span><span class="p">(</span><span class="n">abstract</span><span class="p">,</span><span class="s1">''</span><span class="p">)),</span><span class="w"> </span><span class="s1">'C'</span><span class="p">)</span><span class="w"> </span><span class="o">||</span>
|
|
<span class="w"> </span><span class="n">setweight</span><span class="p">(</span><span class="n">to_tsvector</span><span class="p">(</span><span class="k">coalesce</span><span class="p">(</span><span class="n">body</span><span class="p">,</span><span class="s1">''</span><span class="p">)),</span><span class="w"> </span><span class="s1">'D'</span><span class="p">);</span>
|
|
<span class="k">DROP</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">tsearch</span><span class="p">.</span><span class="n">tt</span><span class="p">;</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001188588980__a9c567cdcc78040e98883cc2b4f87962f">Here we have used <strong id="EN-US_TOPIC_0000001188588980__b1662172713610">setweight</strong> to label the source of each lexeme in the finished <strong id="EN-US_TOPIC_0000001188588980__b116222733616">tsvector</strong>, and then merged the labeled <strong id="EN-US_TOPIC_0000001188588980__b126232719367">tsvector</strong> values using the tsvector concatenation operator <strong id="EN-US_TOPIC_0000001188588980__b1262927173612">||</strong>. For details about these operations, see <a href="dws_06_0097.html">Manipulating tsvector</a>.</p>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_06_0091.html">Controlling Text Search</a></div>
|
|
</div>
|
|
</div>
|
|
|