Phoenix provides CsvBulkloadTool, a batch data import tool. This tool supports import of user-defined delimiters. Specifically, users can use any visible characters within the specified length as delimiters to import data files.
This section applies only to MRS 3.2.0 or later.
A long delimiter affects parsing efficiency, slows down data import, reduces the proportion of valid data, and results in large files. Use short delimiters as possible.
A user-defined delimiter whitelist can be configured to avoid any injection issues possible. Currently, the following delimiters are supported: letters, numbers, and special characters (`~!@#$%^&*()\\-_=+\\[\\]{}\\\\|;:'\",<>./?).
The following two parameters are added based on the open source CsvBulkloadTool:
This parameter specifies the user-defined delimiter. If this parameter is specified, it takes effect preferentially and overwrites the -d parameter in the original command.
This parameter is used to skip the delimiter length and whitelist verification. It is not recommended.
cd Client installation directory
source bigdata_env
kinit Component service user
Run the following command to set the Hadoop username if Kerberos authentication is not enabled for the current cluster:
export HADOOP_USER_NAME=hbase
hdfs dfs -put /opt/test/data.csv /tmp
sqlline.py
CREATE TABLE TEST ( ID INTEGER NOT NULL PRIMARY KEY, NAME VARCHAR, AGE INTEGER, ADDRESS VARCHAR, GENDER BOOLEAN, A DECIMAL, B DECIMAL ) split on (1, 2, 3,4,5,6,7,8,9);
After the table is created, run the !quit command to exit the Phoenix CLI.
hbase org.apache.phoenix.mapreduce.CsvBulkLoadTool -md 'User-defined delimiter' -t Table name -i Data path
For example, to import the data.csv file to the TEST table, run the following command:
hbase org.apache.phoenix.mapreduce.CsvBulkLoadTool -md '|^[' -t TEST -i /tmp/data.csv
sqlline.py
SELECT * FROM TEST LIMIT 10;