Yang, Tong 48706b7552 MRS COMP-LTS 320-lts.1 version
Reviewed-by: Kacur, Michal <michal.kacur@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2024-04-12 12:51:10 +00:00

44 lines
8.3 KiB
HTML

<a name="mrs_01_24580"></a><a name="mrs_01_24580"></a>
<h1 class="topictitle1">CsvBulkloadTool Supports Parsing User-Defined Delimiters in Data Files</h1>
<div id="body0000001583420345"><div class="section" id="mrs_01_24580__section3541137175419"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_24580__p9664056135415">Phoenix provides CsvBulkloadTool, a batch data import tool. This tool supports import of user-defined delimiters. Specifically, users can use any visible characters within the specified length as delimiters to import data files.</p>
<div class="note" id="mrs_01_24580__note157161915598"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_24580__p7716149125917">This section applies only to MRS 3.2.0 or later.</p>
</div></div>
</div>
<div class="section" id="mrs_01_24580__section1546794115527"><h4 class="sectiontitle">Constraints</h4><ul id="mrs_01_24580__ul175791312561"><li id="mrs_01_24580__li175791315562">User-defined delimiters cannot be an empty string.</li><li id="mrs_01_24580__li663419349567">A user-defined delimiter can contain a maximum of 16 characters.<div class="note" id="mrs_01_24580__note101801614144519"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_24580__p151811314124513">A long delimiter affects parsing efficiency, slows down data import, reduces the proportion of valid data, and results in large files. Use short delimiters as possible.</p>
</div></div>
</li><li id="mrs_01_24580__li114031657115611">User-defined delimiters must be visible characters.<div class="note" id="mrs_01_24580__note171862919477"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_24580__p81011311310">A user-defined delimiter whitelist can be configured to avoid any injection issues possible. Currently, the following delimiters are supported: letters, numbers, and special characters (`~!@#$%^&amp;*()\\-_=+\\[\\]{}\\\\|;:'\",&lt;&gt;./?).</p>
</div></div>
</li><li id="mrs_01_24580__li23761438125814">The start and end of a user-defined delimiter cannot be the same.</li></ul>
</div>
<div class="section" id="mrs_01_24580__section131861316539"><h4 class="sectiontitle">Description of New Parameters</h4><p id="mrs_01_24580__p181281546105219">The following two parameters are added based on the open source CsvBulkloadTool:</p>
<ul id="mrs_01_24580__ul0874192215320"><li id="mrs_01_24580__li8874822195318"><strong id="mrs_01_24580__b199613401864">--multiple-delimiter(-md)</strong><p id="mrs_01_24580__p191119297565">This parameter specifies the user-defined delimiter. If this parameter is specified, it takes effect preferentially and overwrites the <strong id="mrs_01_24580__b116521537115011">-d</strong> parameter in the original command.</p>
</li><li id="mrs_01_24580__li179541320165518"><strong id="mrs_01_24580__b1461243363">--multiple-delimiter-skip-check(-mdsc)</strong><p id="mrs_01_24580__p11401143125811">This parameter is used to skip the delimiter length and whitelist verification. It is not recommended.</p>
</li></ul>
</div>
<div class="section" id="mrs_01_24580__section174059441703"><h4 class="sectiontitle">Procedure</h4><ol id="mrs_01_24580__ol1157713512212"><li id="mrs_01_24580__li1033704115320"><a name="mrs_01_24580__li1033704115320"></a><a name="li1033704115320"></a><span>Upload the data file to the node where the client is deployed. For example, upload the <strong id="mrs_01_24580__b104084820531">data.csv</strong> file to the <strong id="mrs_01_24580__b1222210121530">/opt/test</strong> directory on the target node. The delimiter is <strong id="mrs_01_24580__b48420318123">|^[</strong>. The file content is as follows:</span><p><p id="mrs_01_24580__p19522119201219"><span><img id="mrs_01_24580__image7811184341512" src="en-us_image_0000001532503042.png"></span></p>
</p></li><li id="mrs_01_24580__li10695204119102"><span>Log in to the node where the client is installed as the client installation user.</span></li><li id="mrs_01_24580__li136952418109"><span>Run the following command to go to the client directory:</span><p><p id="mrs_01_24580__p1269510411103"><strong id="mrs_01_24580__b869512411101">cd </strong><em id="mrs_01_24580__i1171225419542">Client installation directory</em></p>
</p></li><li id="mrs_01_24580__li19695641171013"><span>Run the following command to configure environment variables:</span><p><p id="mrs_01_24580__p196957418108"><strong id="mrs_01_24580__b069513417106">source bigdata_env</strong></p>
</p></li><li id="mrs_01_24580__li11617433171917"><span>Run the following command to authenticate the current user if Kerberos authentication is enabled for the current cluster. The current user must have the permissions to create HBase tables and operate HDFS.</span><p><p id="mrs_01_24580__p36890922817"><strong id="mrs_01_24580__b16154182722416">kinit</strong> <em id="mrs_01_24580__i315432710242">Component service user</em></p>
<p id="mrs_01_24580__p1858645214304">Run the following command to set the Hadoop username if Kerberos authentication is not enabled for the current cluster:</p>
<p id="mrs_01_24580__p121554623314"><strong id="mrs_01_24580__b863416369335">export HADOOP_USER_NAME=hbase</strong></p>
</p></li><li id="mrs_01_24580__li656359145413"><span>Run the following command to upload the data file <strong id="mrs_01_24580__b2092101820459">data.csv</strong> in <a href="#mrs_01_24580__li1033704115320">1</a> to an HDFS directory, for example, <strong id="mrs_01_24580__b626511307455">/tmp</strong>:</span><p><p id="mrs_01_24580__p1262185985418"><strong id="mrs_01_24580__b146217595547">hdfs dfs -put /opt/test/data.csv /tmp</strong></p>
</p></li><li id="mrs_01_24580__li8487905327"><span>Run the Phoenix client command.</span><p><p id="mrs_01_24580__p156628346324"><strong id="mrs_01_24580__b6662113415320">sqlline.py</strong></p>
</p></li><li id="mrs_01_24580__li7557203619328"><span>Run the following command to create the <strong id="mrs_01_24580__b479217574">TEST</strong> table:</span><p><p id="mrs_01_24580__p55627360326"><strong id="mrs_01_24580__b656210362328">CREATE TABLE TEST ( ID INTEGER NOT NULL PRIMARY KEY, NAME VARCHAR, AGE INTEGER, ADDRESS VARCHAR, GENDER BOOLEAN, A DECIMAL, B DECIMAL ) split on (1, 2, 3,4,5,6,7,8,9);</strong></p>
<p id="mrs_01_24580__p25621936113214">After the table is created, run the <strong id="mrs_01_24580__b195621236113213">!quit</strong> command to exit the Phoenix CLI.</p>
</p></li><li id="mrs_01_24580__li7220612316"><span>Run the following import command:</span><p><p id="mrs_01_24580__p1359420497514"><strong id="mrs_01_24580__b99781117145619">hbase org.apache.phoenix.mapreduce.CsvBulkLoadTool -md '</strong><em id="mrs_01_24580__i7584133284713">User-defined delimiter</em><strong id="mrs_01_24580__b181378263569">' -t </strong><em id="mrs_01_24580__i18539144219476">Table name</em><strong id="mrs_01_24580__b1229433114560"> -i </strong><em id="mrs_01_24580__i6121157164713">Data path</em></p>
<p id="mrs_01_24580__p3784102719170">For example, to import the <strong id="mrs_01_24580__b34871422134819">data.csv</strong> file to the <strong id="mrs_01_24580__b8459172412579">TEST</strong> table, run the following command:</p>
<p id="mrs_01_24580__p23859258185"><strong id="mrs_01_24580__b19385132511819">hbase org.apache.phoenix.mapreduce.CsvBulkLoadTool -md '</strong><strong id="mrs_01_24580__b562712614191">|^[</strong><strong id="mrs_01_24580__b116271364194">' -t </strong><strong id="mrs_01_24580__b4862207151914">TEST</strong><strong id="mrs_01_24580__b06279610197"> -i </strong><strong id="mrs_01_24580__b1286210717198">/tmp/data.csv</strong></p>
</p></li><li id="mrs_01_24580__li47061251172417"><span>Run the following command to view data imported to the <strong id="mrs_01_24580__b173844111113">TEST</strong> table:</span><p><p id="mrs_01_24580__p92591517195920"><strong id="mrs_01_24580__b202591417195916">sqlline.py</strong></p>
<p id="mrs_01_24580__p1529941152515"><strong id="mrs_01_24580__b418964242619">SELECT * FROM TEST LIMIT 10;</strong></p>
<p id="mrs_01_24580__p731718132611"><span><img id="mrs_01_24580__image7866310101417" src="en-us_image_0000001583182157.png"></span></p>
</p></li></ol>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_24579.html">In-House Enhanced Phoenix</a></div>
</div>
</div>