CsvBulkloadTool Supports Parsing User-Defined Delimiters in Data Files

Scenario

Phoenix provides CsvBulkloadTool, a batch data import tool. This tool supports import of user-defined delimiters. Specifically, users can use any visible characters within the specified length as delimiters to import data files.

This section applies only to MRS 3.2.0 or later.

Constraints

Description of New Parameters

The following two parameters are added based on the open source CsvBulkloadTool:

Procedure

  1. Upload the data file to the node where the client is deployed. For example, upload the data.csv file to the /opt/test directory on the target node. The delimiter is |^[. The file content is as follows:

  2. Log in to the node where the client is installed as the client installation user.
  3. Run the following command to go to the client directory:

    cd Client installation directory

  4. Run the following command to configure environment variables:

    source bigdata_env

  5. Run the following command to authenticate the current user if Kerberos authentication is enabled for the current cluster. The current user must have the permissions to create HBase tables and operate HDFS.

    kinit Component service user

    Run the following command to set the Hadoop username if Kerberos authentication is not enabled for the current cluster:

    export HADOOP_USER_NAME=hbase

  6. Run the following command to upload the data file data.csv in 1 to an HDFS directory, for example, /tmp:

    hdfs dfs -put /opt/test/data.csv /tmp

  7. Run the Phoenix client command.

    sqlline.py

  8. Run the following command to create the TEST table:

    CREATE TABLE TEST ( ID INTEGER NOT NULL PRIMARY KEY, NAME VARCHAR, AGE INTEGER, ADDRESS VARCHAR, GENDER BOOLEAN, A DECIMAL, B DECIMAL ) split on (1, 2, 3,4,5,6,7,8,9);

    After the table is created, run the !quit command to exit the Phoenix CLI.

  9. Run the following import command:

    hbase org.apache.phoenix.mapreduce.CsvBulkLoadTool -md 'User-defined delimiter' -t Table name -i Data path

    For example, to import the data.csv file to the TEST table, run the following command:

    hbase org.apache.phoenix.mapreduce.CsvBulkLoadTool -md '|^[' -t TEST -i /tmp/data.csv

  10. Run the following command to view data imported to the TEST table:

    sqlline.py

    SELECT * FROM TEST LIMIT 10;