forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Lu, Huayi <luhuayi@huawei.com> Co-committed-by: Lu, Huayi <luhuayi@huawei.com>
489 lines
47 KiB
HTML
489 lines
47 KiB
HTML
<a name="EN-US_TOPIC_0000001233430177"></a><a name="EN-US_TOPIC_0000001233430177"></a>
|
|
|
|
<h1 class="topictitle1">Text Search Parser</h1>
|
|
<div id="body8662426"><p id="EN-US_TOPIC_0000001233430177__en-us_topic_0059778480_p196571551946">Text search parsers are responsible for splitting raw document text into tokens and identifying each token's type, where the set of types is defined by the parser itself. Note that a parser does not modify the text at all — it simply identifies plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parsers than there is for custom dictionaries.</p>
|
|
<p id="EN-US_TOPIC_0000001233430177__a8318ee8074694179a27cecbc1b31c5b8">Currently, <span id="EN-US_TOPIC_0000001233430177__text1225663870">GaussDB(DWS)</span> provides the following built-in parsers: pg_catalog.default for English configuration, and pg_catalog.ngram, pg_catalog.zhparser, and pg_catalog.pound for full text search in texts containing Chinese, or both Chinese and English.</p>
|
|
<p id="EN-US_TOPIC_0000001233430177__a877c6e31601c4610a0298d653eec5b44">The built-in parser is named <strong id="EN-US_TOPIC_0000001233430177__b842352706101936">pg_catalog.default</strong>. It recognizes 23 token types, shown in <a href="#EN-US_TOPIC_0000001233430177__tfcaeb83ea7fb42de882258f647b03890">Table 1</a>.</p>
|
|
|
|
<div class="tablenoborder"><a name="EN-US_TOPIC_0000001233430177__tfcaeb83ea7fb42de882258f647b03890"></a><a name="tfcaeb83ea7fb42de882258f647b03890"></a><table cellpadding="4" cellspacing="0" summary="" id="EN-US_TOPIC_0000001233430177__tfcaeb83ea7fb42de882258f647b03890" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Default parser's token types</caption><thead align="left"><tr id="EN-US_TOPIC_0000001233430177__r8f351bfa363645c09f7ee2d49d135a59"><th align="left" class="cellrowborder" valign="top" width="25.25%" id="mcps1.3.4.2.4.1.1"><p id="EN-US_TOPIC_0000001233430177__a9b3ba65eee004ea6a506b1eec78ee02f">Alias</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="35.78%" id="mcps1.3.4.2.4.1.2"><p id="EN-US_TOPIC_0000001233430177__a03b03deeea7a4cd7becfbb4ebd1d8df7">Description</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="38.97%" id="mcps1.3.4.2.4.1.3"><p id="EN-US_TOPIC_0000001233430177__ac786391cad864779bee2ec4fbcab73a5">Examples</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="EN-US_TOPIC_0000001233430177__r37db00922e8044e88159f8b4274df191"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a2dc8420bd19647188577ded2b8636c7a">asciiword</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a9484580033844afe9646580a24843d79">Word, all ASCII letters</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a586850a8988b4f678ab710c4ecd72dea">elephant</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rd9abcbec3d8446adbc64474538b60bac"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a9bcacae5a94643f88d8781f78b3ff373">word</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a2b6de966bff04dd9956d3738fc76883a">Word, all letters</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a49f00f90d4754d638d57fe8409e68a7f">mañana</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rbda17c6200174b66acfb3cf7146ba3ba"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a8bd02f0681004a83a6958c6ff30d1db0">numword</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__aaf0566377c1d40bf9aa3b0bb4823a070">Word, letters and digits</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a7e1b7862fc764c87942c643a842f2610">beta1</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r27fe468869644a44bff68b98c7980091"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a8985a7e6f3a84f838d3835e39bf7c6dc">asciihword</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__ad9964428965b4ae99c00c20aba71add9">Hyphenated word, all ASCII</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a8c3a74e22d604bb1b18537511d0199e4">up-to-date</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rfd25833b1a0d4f85ba1df8abe78e7e23"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a1b1d48044aa249648ede118f4b37aebc">hword</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a3ba551ad27a641169d0c38671cf1f7c6">Hyphenated word, all letters</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__afac61762969a4d548bf661533d865133">lógico-matemática</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rb4ab6af7a49c442792736735cd0eaac3"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__ae12037e303944af0aa586378c76d71ef">numhword</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__acadcfef1df984b4782955974e3df7fdd">Hyphenated word, letters and digits</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a3a7120b97210410c9963d9ad46ccde05">postgresql-beta1</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r176dcfeb080b49a5844e670122a27df9"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__ac749ee2ab7d645a18ad1acb3ad0a680f">hword_asciipart</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a8fd9120cf6b240f190ab3f46d5093bcc">Hyphenated word part, all ASCII</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a74e3f5d5f8e34dce92804d15654b45e1">postgresql in the context postgresql-beta1</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r130faba6482b4e01b3dd92fe09d02e29"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a4fea307e1a8f463a8411ac48c5fe2089">hword_part</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a96a9a7adde894b5781c97afa3a64526a">Hyphenated word part, all letters</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a23e3bbb5f6a44f81bbfc8e3ee35b6991">lógico or matemática in the context lógico-matemática</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rc7691cc089d14d438f81e67a2e9ee01a"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__ad15abf868b2c49b1a7c1e410e40c8cef">hword_numpart</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a6ae8114b2b51485491de8f7ab5eb5e3f">Hyphenated word part, letters and digits</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__aaaa9663e96d54b70a11a370158f8f734">beta1 in the context postgresql-beta1</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r43e776b3ef8e46c690aab5b16188b846"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__ae9e9bac5f1de4cf9bbd28bd8aed9f56f">email</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__ad89c0fd54fd9473080ff92a859a5e319">Email address</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a57d81d2fd38c4b06bc51d77d3f45e73e">foo@example.com</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rfe7b85d387d049bc94c4c1a6e435c963"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a66c042ffbd164a6083f8542b1760e7c7">protocol</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a9d37881e42cd468bb6e742e7e8b3d760">Protocol head</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a6733b401854e4e3f8efe7f0b9f61a69b">http://</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r3257634c68f5453e92c36fe718e2fdd3"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a8126f979132646c59f206b78af8c8371">url</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a82c6fd3e17db43a49f6205f13cbddd4d">URL</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__ad948cca5b64749c0b563453e424b7c96">example.com/stuff/index.html</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r526df87ac619456f95f7fc6cd7a2bb73"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a72d8eea46d954e2ea12aad53634859be">host</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__aeb745dbae67a4c9aa043104125c40ee7">Host</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__af71175f690ce4d77a40cc2530511a393">example.com</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r1e7d19c8aa9a43f5aacb8b9722ca99b9"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a2db5030ea6ae4f98b34731641debd502">url_path</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__aee08ce84ab3b4dc7bdf4b25f6da35426">URL path</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__af9afadb0296d409a8399cfdef631872d">/stuff/index.html, in the context of a URL</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r23f81a80a9fd4deba355d8c606db4d34"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a51229c9bc49249029e645a8b72e2e134">file</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a3d086b1238094cc99a6660ab8ce79027">File or path name</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a3f605251e8884390a773bcbf6f786992">/usr/local/foo.txt, if not within a URL</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rc4f2a9efbda540869b9a29ec1d1fc3a9"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a54313823495f48d8bc406c11f19b600d">sfloat</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a150389a16f384ab59cca8e4a27572350">Scientific notation</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a7f2f2d5729254043b89075ee8463dabf">-1.23E+56</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r51018adcce49423dbe0bbfaa52b8d01a"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a06bb420b731d463eae4fce1d35111e8c">float</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a8fbaa950bb684366826e9e3394614962">Decimal notation</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__ad0062f8ab09849429ad48bb8665e118f">-1.234</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r19caea8c2a2c4570a1fcd493cbc8f3fd"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a5119d3cc1ba142668d4d9aea872305d4">int</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a89909a9eda3b415aba6ae27ba46a35ef">Signed integer</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__ad761ac638f0a4abbaaa54c134a353393">-1234</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r8efcea8ac401459ea175011d6f3089ea"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a084a5a3968fd4fcb9b75a262966eeb12">uint</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a0efc08499b514c538c162eaf53c3d32b">Unsigned integer</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a9d9c7c2101bf41cd80980385be3c4394">1234</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r23dbb89c0374473fa21d4f77de783a1a"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a5621007f66cc48dcbe7dc37c3e28036c">version</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__ad3fda73a78a44fa8a1e20133d127a38c">Version number</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__ac7b24637bd1e4706ba718d03f4475714">8.3.0</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r37eea6bac64c461ea5aca4fccf8763d1"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a0e915600603d4f9d9dfe1f46f7f9fa48">tag</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a9bb77e166118473e9ca385ef9c28c668">XML tag</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a647e808f050d47909d225e951a286e8f"><a href="dictionaries.html"></p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rbff43d4a61c34b7ababf249e27b4c669"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a396c85488a0d4037a2871edcd53d240c">entity</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__ae1260a3df5cf43f0bc26dc20e7aac235">XML entity</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__adc129279138f4f68a199605c6a8f6a7a">&amp;</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r3d2c64d7167845f88b4047066f4abf9c"><td class="cellrowborder" valign="top" width="25.25%" headers="mcps1.3.4.2.4.1.1 "><p id="EN-US_TOPIC_0000001233430177__a2a1a475c0c9249a0be68f3b7bb397343">blank</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="35.78%" headers="mcps1.3.4.2.4.1.2 "><p id="EN-US_TOPIC_0000001233430177__a2ba64b14e6f74139a0d4129f01531b00">Space symbols</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="38.97%" headers="mcps1.3.4.2.4.1.3 "><p id="EN-US_TOPIC_0000001233430177__a72ba92cb4a5c4b52bd29a2134a137e2e">(any whitespace or punctuation not otherwise recognized)</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001233430177__a0833028bb5494f59b35dd6ae2bf2c399">Note: The parser's notion of a "letter" is determined by the database's locale setting, specifically <strong id="EN-US_TOPIC_0000001233430177__b84235270610434">lc_ctype</strong>. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them. In most European languages, token types word and asciiword should be treated alike. </p>
|
|
<p id="EN-US_TOPIC_0000001233430177__aa72d3e36e40d4bd49c57e6d26d43ffc2"><strong id="EN-US_TOPIC_0000001233430177__b842352706104958">email</strong> does not support all valid email characters as defined by RFC 5322. Specifically, the only non-alphanumeric characters supported for email user names are period, dash, and underscore. </p>
|
|
<p id="EN-US_TOPIC_0000001233430177__a5f10701346114143baa2ce943e6452be">It is possible for the parser to identify overlapping tokens in the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component: </p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001233430177__s3dc0d5bee8ab4880b71237d8524f5e7c"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
|
|
<span class="normal"> 2</span>
|
|
<span class="normal"> 3</span>
|
|
<span class="normal"> 4</span>
|
|
<span class="normal"> 5</span>
|
|
<span class="normal"> 6</span>
|
|
<span class="normal"> 7</span>
|
|
<span class="normal"> 8</span>
|
|
<span class="normal"> 9</span>
|
|
<span class="normal">10</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="k">alias</span><span class="p">,</span><span class="w"> </span><span class="n">description</span><span class="p">,</span><span class="w"> </span><span class="n">token</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">ts_debug</span><span class="p">(</span><span class="s1">'english'</span><span class="p">,</span><span class="s1">'foo-bar-beta1'</span><span class="p">);</span>
|
|
<span class="w"> </span><span class="k">alias</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">description</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">token</span><span class="w"> </span>
|
|
<span class="c1">-----------------+------------------------------------------+---------------</span>
|
|
<span class="w"> </span><span class="n">numhword</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Hyphenated</span><span class="w"> </span><span class="n">word</span><span class="p">,</span><span class="w"> </span><span class="n">letters</span><span class="w"> </span><span class="k">and</span><span class="w"> </span><span class="n">digits</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">foo</span><span class="o">-</span><span class="n">bar</span><span class="o">-</span><span class="n">beta1</span>
|
|
<span class="w"> </span><span class="n">hword_asciipart</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Hyphenated</span><span class="w"> </span><span class="n">word</span><span class="w"> </span><span class="n">part</span><span class="p">,</span><span class="w"> </span><span class="k">all</span><span class="w"> </span><span class="n">ASCII</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">foo</span>
|
|
<span class="w"> </span><span class="n">blank</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="k">Space</span><span class="w"> </span><span class="n">symbols</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">-</span>
|
|
<span class="w"> </span><span class="n">hword_asciipart</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Hyphenated</span><span class="w"> </span><span class="n">word</span><span class="w"> </span><span class="n">part</span><span class="p">,</span><span class="w"> </span><span class="k">all</span><span class="w"> </span><span class="n">ASCII</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">bar</span>
|
|
<span class="w"> </span><span class="n">blank</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="k">Space</span><span class="w"> </span><span class="n">symbols</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">-</span>
|
|
<span class="w"> </span><span class="n">hword_numpart</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Hyphenated</span><span class="w"> </span><span class="n">word</span><span class="w"> </span><span class="n">part</span><span class="p">,</span><span class="w"> </span><span class="n">letters</span><span class="w"> </span><span class="k">and</span><span class="w"> </span><span class="n">digits</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">beta1</span>
|
|
<span class="p">(</span><span class="mi">6</span><span class="w"> </span><span class="k">rows</span><span class="p">)</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001233430177__a8762d3369f8e46b998b5e47dc964bc86">This behavior is desirable since it allows searches to work for both the whole compound word and for components. Here is another instructive example: </p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001233430177__sa487b9c814f74bd481e21cdcab289569"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
|
|
<span class="normal">2</span>
|
|
<span class="normal">3</span>
|
|
<span class="normal">4</span>
|
|
<span class="normal">5</span>
|
|
<span class="normal">6</span>
|
|
<span class="normal">7</span>
|
|
<span class="normal">8</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="k">alias</span><span class="p">,</span><span class="w"> </span><span class="n">description</span><span class="p">,</span><span class="w"> </span><span class="n">token</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">ts_debug</span><span class="p">(</span><span class="s1">'english'</span><span class="p">,</span><span class="s1">'http://example.com/stuff/index.html'</span><span class="p">);</span>
|
|
<span class="w"> </span><span class="k">alias</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">description</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">token</span><span class="w"> </span>
|
|
<span class="c1">----------+---------------+------------------------------</span>
|
|
<span class="w"> </span><span class="n">protocol</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">Protocol</span><span class="w"> </span><span class="n">head</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">http</span><span class="p">:</span><span class="o">//</span>
|
|
<span class="w"> </span><span class="n">url</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">URL</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">example</span><span class="p">.</span><span class="n">com</span><span class="o">/</span><span class="n">stuff</span><span class="o">/</span><span class="k">index</span><span class="p">.</span><span class="n">html</span>
|
|
<span class="w"> </span><span class="k">host</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="k">Host</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">example</span><span class="p">.</span><span class="n">com</span>
|
|
<span class="w"> </span><span class="n">url_path</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">URL</span><span class="w"> </span><span class="n">path</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="o">/</span><span class="n">stuff</span><span class="o">/</span><span class="k">index</span><span class="p">.</span><span class="n">html</span>
|
|
<span class="p">(</span><span class="mi">4</span><span class="w"> </span><span class="k">rows</span><span class="p">)</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001233430177__a2570b3157206487a8773242f0ec8a6cf">N-gram is a mechanical word segmentation method, and applies to no semantic Chinese segmentation scenarios. N-gram supports Chinese coding, including GBK and UTF-8. Six built-in token types are shown in <a href="#EN-US_TOPIC_0000001233430177__t7682dee3b51a4bdbac3572a7d5621298">Table 2</a>.</p>
|
|
|
|
<div class="tablenoborder"><a name="EN-US_TOPIC_0000001233430177__t7682dee3b51a4bdbac3572a7d5621298"></a><a name="t7682dee3b51a4bdbac3572a7d5621298"></a><table cellpadding="4" cellspacing="0" summary="" id="EN-US_TOPIC_0000001233430177__t7682dee3b51a4bdbac3572a7d5621298" frame="border" border="1" rules="all"><caption><b>Table 2 </b>Token types</caption><thead align="left"><tr id="EN-US_TOPIC_0000001233430177__rde9913114995419da7640cd18f3f3351"><th align="left" class="cellrowborder" valign="top" width="33.08%" id="mcps1.3.12.2.3.1.1"><p id="EN-US_TOPIC_0000001233430177__a2270461e915c4c5b93260072395533ec">Alias</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="66.92%" id="mcps1.3.12.2.3.1.2"><p id="EN-US_TOPIC_0000001233430177__afafbe53b377a45e4b088efb3ffc8b45f">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="EN-US_TOPIC_0000001233430177__r62a75d14583e4641b15984a2f591a609"><td class="cellrowborder" valign="top" width="33.08%" headers="mcps1.3.12.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a1a231e5ada174cf8b7bef84b620a7c81">zh_words</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66.92%" headers="mcps1.3.12.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a585b43228ceb4f6da2abb101b880f484">chinese words</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r2767bd9d0fe845879be65ec02631889c"><td class="cellrowborder" valign="top" width="33.08%" headers="mcps1.3.12.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a3af923f84cc445b6aba7d9fbd2e79bab">en_word</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66.92%" headers="mcps1.3.12.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__ae83158f35f524b7e8a073bd75d7dfa57">english word</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r56e0c5bcd1074a6f83c88fa88431773f"><td class="cellrowborder" valign="top" width="33.08%" headers="mcps1.3.12.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__ab8532b9414d64e08b6fcc3be85e11ea6">numeric</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66.92%" headers="mcps1.3.12.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__aba89ebb33e5045598cfb09568e84b33b">numeric data</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rce620c074745430cac3fb832e9d7a369"><td class="cellrowborder" valign="top" width="33.08%" headers="mcps1.3.12.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__ada1fcb4361e54709b61c9ddd74ad3093">alnum</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66.92%" headers="mcps1.3.12.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a434e9a3f80834c32802e419c09172266">alnum string</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r9f33a75762fb49b0bfce7aa53d341f46"><td class="cellrowborder" valign="top" width="33.08%" headers="mcps1.3.12.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__acc5e6a417a604d53a04f0c0139abcff7">grapsymbol</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66.92%" headers="mcps1.3.12.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a633aa6309c05421abdc88956d66cbd66">graphic symbol</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rfc067f1b948040d5b85e18c10d593153"><td class="cellrowborder" valign="top" width="33.08%" headers="mcps1.3.12.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a33e54ef9f8df47db8046b93313a5b268">multisymbol</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66.92%" headers="mcps1.3.12.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__ae9c54ddd2209480c9f646e576e1c183f">multiple symbol</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001233430177__a77d69a44efbc4d3bbf0cb329b4f97369">Zhparser is a dictionary-based semantic word segmentation method. The bottom-layer calls the Simple Chinese Word Segmentation (SCWS) algorithm (https://github.com/hightman/scws), which applies to Chinese segmentation scenarios. SCWS is a term frequency and dictionary-based mechanical Chinese words engine. It can split a whole paragraph Chinese text into words. The two Chinese coding formats, GBK and UTF-8, are supported. The 26 built-in token types are shown in <a href="#EN-US_TOPIC_0000001233430177__t2c6fd8cdf6bd48f3abda6e5b4273303f">Table 3</a>.</p>
|
|
|
|
<div class="tablenoborder"><a name="EN-US_TOPIC_0000001233430177__t2c6fd8cdf6bd48f3abda6e5b4273303f"></a><a name="t2c6fd8cdf6bd48f3abda6e5b4273303f"></a><table cellpadding="4" cellspacing="0" summary="" id="EN-US_TOPIC_0000001233430177__t2c6fd8cdf6bd48f3abda6e5b4273303f" frame="border" border="1" rules="all"><caption><b>Table 3 </b>Token types</caption><thead align="left"><tr id="EN-US_TOPIC_0000001233430177__r029bd3129c174d639b63c73a43919ea1"><th align="left" class="cellrowborder" valign="top" width="25%" id="mcps1.3.14.2.3.1.1"><p id="EN-US_TOPIC_0000001233430177__a6abb2a0dc82046efb8032a69012047e6">Alias</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="75%" id="mcps1.3.14.2.3.1.2"><p id="EN-US_TOPIC_0000001233430177__a06b62b0904ad48708387b5fada06d10c">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="EN-US_TOPIC_0000001233430177__r60bd0000a5da46d9b12b4de0291ef6fa"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__ad9b74021818144b1ac85289adfa5cf8c">A</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a4a1c35fbfd7c4f278cd1e0d2908b96a4">Adjective</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rd9be165d251342e79d8ef0f4757ab27a"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a602423cfb2e94f20ae1d239c819e23fa">B</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a1d7785bb339e41b7bce898382d5da669">Differentiation</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r09b5a61ea05948b08c087fba32256940"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__acd34aba29e614b16b51bcfae81caa633">C</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a3e1b3c904b1c4a66a4acb300342a8195">Conjunction</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r73694604eae1463b828aafb03182b5f6"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__af0f9cd7825d74868b360d26e92fc7777">D</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__aba79eb1380364f7c9a32279fb456c95a">Adverb </p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r89c7c27a854d48b18e06d6ed80fbf69a"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a670f11f9d2c642deb46878861fa82e96">E</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a6a36e13b7dee4f77b5aedbff5e661cd9">Exclamation</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r9c04127ac93b4f699b3859cd0c117469"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a4493529e51a0403a90917b49bb27fa16">F</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a1e00e74330974bd29219f3373ca9cd94">Position</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r3721feec581441358298b0efa39fd62b"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a9cc0b096ca5d4597aeb6ce4051be85a3">G</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a71d8d6d5fcb44216a7089bcb49eb27f6">Lexeme</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rb1cd8ce82b254a25b99c4db3b723a5e0"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a8aa8865bfc3747068d84d7a4d049a70f">H</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__af07af6a18ac64ce89a2b28bd3b295785">Preceding element</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r62f69233d9ed4ed3b2af411f1351ed49"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a8375080f46f145fc9db377cfb2b4ec12">I</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a123dacfa74814d6a9162f780273b6822">Idiom</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rb707989cba3b41969c836c2dc2c923a0"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a5dcad5b76d2e4e23b62547402d127a3b">J</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a2ed33f2a8a224c8c803c4e521f85414f">Acronyms and abbreviations</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__re0818f2b49c24b9cbf1af2b2054ec618"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a4e093ad6e3a04ab294b560b3180ef9eb">K</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a4dea93addfbf4f91a3a528d3864670b1">Subsequent element</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r832ff09b977d48479da60eb67225d3ad"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a07fff6086d5743e7bd601e7c5330e8b2">L</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a6cd8e9886a6f4f9a998513c7d61aabc1">Common words </p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rf021adec4a464949ba83cc87d938a11d"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a03f7171f179a460cb2cfc8dbd04637b3">M</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__af4ee3845de6c40e487f96e3d23a32543">Numeral</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rac9de1216b48438490393c4dbfa93c96"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__ae9ff756cbe574e21a0b6bd7759728ca8">N</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__ab97ba26c6b024db9bbbda22847413feb">Noun</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r145f90fcf56d41c7bb13cae399c215b5"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__aac3fb67803694e0595dc856f0f2e21ca">O</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__ad3e1d711b1c445fab25d896fca99c5b4">Onomatopoeia</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r61b1719ac8144367997596c93f211f40"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a0fc45eef67104d8db022a580a5c5fd1e">P</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__aafb885f7622642f89471f40dd15c6f6b">Preposition</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r269d8f2f73e24154b7e7a48d7fd3d462"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__aafaa04b9979a41fb8402ed5c8bcdbdff">Q</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__ae34c015405504654b066ecfd1f6b9fec">Quantifiers</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__re13f2e5677d843e28feb516d66d598c1"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a841c86f7025b4e828c5727636bb9279c">R</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a18156d4bcf0645b489110cfd2d410367">Pronoun</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__rcfeed00466664a059063578f7c12b672"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__aca6048eb962449db9b998c08015a424e">S</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a969fbd7292164b9a9d9ed54c3ce5e63a">Space</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r3cdc391a5f244c988292ebb534253012"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__aa594f702428641eca82fda828770f193">T</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__af6690c83a33f4d9a96ade0e2aaa285f5">Time</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r3b50ca8e7dd049af950469c837a5d912"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a1b67033bfabb494895011b9831bf416f">U</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a85fb79947ce940a0b90631029b895245">Auxiliary word</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r871e73e3ff6f4187be03aac3cd12c00b"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a9ac0965841e146229dd413f38361e7ee">V</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__ad6c3156c12a4479e86bea047801dc73c">Verb</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r9532bf4e559c4bf4b11d5deb6699a98b"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a0a1fc168e9b84e36a27ef33394c6ee38">W</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__ad379bcf4357b48438c38d92f775a25e3">Punctuation</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r1b052f63ae0c4c248db57693c4c90994"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a692a802fdbe1421b8dd9d8d12e0f1b37">X</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__adb112f6de7504788910cc7e0f30fbc66">Unknown</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r98e44ce07436471cb09470742089b663"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__aec73c6a762fb4b6584f542fc2ad98144">Y</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__a52cc3a78641b483eae2ff06e31c540d0">Interjection</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__r90a0e9e523f4486d89cafe410bc49ceb"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.14.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__a103ca19595084afeb899eb5589e5b651">Z</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="75%" headers="mcps1.3.14.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__aba9f92eaace54aac82b8e9098afb88fb">Status words</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001233430177__p646511115378">Pound segments words in a fixed format. It is used to segment to-be-parsed nonsense Chinese and English words that are separated by fixed separators. It supports Chinese encoding (including GBK and UTF8) and English encoding (including ASCII). Pound has six pre-configured token types (as listed in <a href="#EN-US_TOPIC_0000001233430177__table18356541133518">Table 4</a>) and supports five separators (as listed in <a href="#EN-US_TOPIC_0000001233430177__table14245115444310">Table 5</a>). The default, the separator is <strong id="EN-US_TOPIC_0000001233430177__b14683184119329">#</strong>. Pound The maximum length of a token is 256 characters.</p>
|
|
|
|
<div class="tablenoborder"><a name="EN-US_TOPIC_0000001233430177__table18356541133518"></a><a name="table18356541133518"></a><table cellpadding="4" cellspacing="0" summary="" id="EN-US_TOPIC_0000001233430177__table18356541133518" frame="border" border="1" rules="all"><caption><b>Table 4 </b>Token types</caption><thead align="left"><tr id="EN-US_TOPIC_0000001233430177__row2035613418358"><th align="left" class="cellrowborder" valign="top" width="33%" id="mcps1.3.16.2.3.1.1"><p id="EN-US_TOPIC_0000001233430177__p1198305217355">Alias</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="67%" id="mcps1.3.16.2.3.1.2"><p id="EN-US_TOPIC_0000001233430177__p298875213353">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="EN-US_TOPIC_0000001233430177__row203561341183514"><td class="cellrowborder" valign="top" width="33%" headers="mcps1.3.16.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p3994352133519">zh_words</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="67%" headers="mcps1.3.16.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p1299775243510">chinese words</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__row1135611412359"><td class="cellrowborder" valign="top" width="33%" headers="mcps1.3.16.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p111115363519">en_word</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="67%" headers="mcps1.3.16.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p17335323516">english word</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__row7356341113510"><td class="cellrowborder" valign="top" width="33%" headers="mcps1.3.16.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p13617533350">numeric</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="67%" headers="mcps1.3.16.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p1393533359">numeric data</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__row8356641163510"><td class="cellrowborder" valign="top" width="33%" headers="mcps1.3.16.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p912165312357">alnum</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="67%" headers="mcps1.3.16.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p16152535351">alnum string</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__row2356341183518"><td class="cellrowborder" valign="top" width="33%" headers="mcps1.3.16.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p619135353514">grapsymbol</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="67%" headers="mcps1.3.16.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p3215534359">graphic symbol</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__row1035684133514"><td class="cellrowborder" valign="top" width="33%" headers="mcps1.3.16.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p92595323518">multisymbol</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="67%" headers="mcps1.3.16.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p2027953113519">multiple symbol</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
|
|
<div class="tablenoborder"><a name="EN-US_TOPIC_0000001233430177__table14245115444310"></a><a name="table14245115444310"></a><table cellpadding="4" cellspacing="0" summary="" id="EN-US_TOPIC_0000001233430177__table14245115444310" frame="border" border="1" rules="all"><caption><b>Table 5 </b>Separator types</caption><thead align="left"><tr id="EN-US_TOPIC_0000001233430177__row13245145420435"><th align="left" class="cellrowborder" valign="top" width="34%" id="mcps1.3.17.2.3.1.1"><p id="EN-US_TOPIC_0000001233430177__p17245155411436">Delimiter</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="66%" id="mcps1.3.17.2.3.1.2"><p id="EN-US_TOPIC_0000001233430177__p1943018349445">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="EN-US_TOPIC_0000001233430177__row9245165416438"><td class="cellrowborder" valign="top" width="34%" headers="mcps1.3.17.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p1577178104417">@</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66%" headers="mcps1.3.17.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p359820140491">Special character</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__row424512543432"><td class="cellrowborder" valign="top" width="34%" headers="mcps1.3.17.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p177752844415">#</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66%" headers="mcps1.3.17.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p8245854154315">Special character</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__row62467541434"><td class="cellrowborder" valign="top" width="34%" headers="mcps1.3.17.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p1377958114410">$</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66%" headers="mcps1.3.17.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p1246754104311">Special character</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__row524635434319"><td class="cellrowborder" valign="top" width="34%" headers="mcps1.3.17.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p678218154410">%</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66%" headers="mcps1.3.17.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p1024616544438">Special character</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="EN-US_TOPIC_0000001233430177__row1324655444310"><td class="cellrowborder" valign="top" width="34%" headers="mcps1.3.17.2.3.1.1 "><p id="EN-US_TOPIC_0000001233430177__p127865812449">/</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="66%" headers="mcps1.3.17.2.3.1.2 "><p id="EN-US_TOPIC_0000001233430177__p2024614546435">Special character</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_06_0081.html">Full Text Search</a></div>
|
|
</div>
|
|
</div>
|
|
|