This command is used to load user data of a particular type, so that CarbonData can provide good query performance.
Only the raw data on HDFS can be loaded.
LOAD DATA INPATH 'folder_path' INTO TABLE [db_name.]table_name OPTIONS(property_name=property_value, ...);
Parameter |
Description |
---|---|
folder_path |
Path of the file or folder used for storing the raw CSV data. |
db_name |
Database name. If this parameter is not specified, the current database is used. |
table_name |
Name of a table in a database. |
The following configuration items are involved during data loading:
OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='"')
You can use 'DELIMITER'='\t' to separate CSV data using tabs.
OPTIONS('DELIMITER'='\t')
CarbonData also supports \001 and \017 as delimiters.
When the delimiter of CSV data is a single quotation mark ('), the single quotation mark must be enclosed in double quotation marks (" "). For example, 'DELIMITER'= "'".
OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='"')
OPTIONS('COMMENTCHAR'='#')
OPTIONS('FILEHEADER'='column1,column2')
OPTIONS('ESCAPECHAR'='\')
Enter ESCAPECHAR in the CSV data. ESCAPECHAR must be enclosed in double quotation marks (" "). For example, "a\b".
In order for the data processing application to provide benefits, certain data integration is required. In most cases, data quality problems are caused by data sources.
Methods of handling bad records are as follows:
There are many options for clearing source data during CarbonData data loading, as listed in Table 2.
Configuration Item |
Default Value |
Description |
---|---|---|
BAD_RECORDS_LOGGER_ENABLE |
false |
Whether to create logs with details about bad records |
BAD_RECORDS_ACTION |
FAIL |
The four types of actions for bad records are as follows:
|
IS_EMPTY_DATA_BAD_RECORD |
false |
Whether empty data of a column to be considered as bad record or not. If this parameter is set to false, empty data ("",', or,) is not considered as bad records. If this parameter is set to true, empty data is considered as bad records. |
BAD_RECORD_PATH |
- |
HDFS path where bad records are stored. The default value is Null. If bad records logging or bad records operation redirection is enabled, the path must be configured by the user. |
Example:
LOAD DATA INPATH 'filepath.csv' INTO TABLE tablename OPTIONS('BAD_RECORDS_LOGGER_ENABLE'='true', 'BAD_RECORD_PATH'='hdfs://hacluster/tmp/carbon', 'BAD_RECORDS_ACTION'='REDIRECT', 'IS_EMPTY_DATA_BAD_RECORD'='false');
If REDIRECT is used, CarbonData will add all bad records into a separate CSV file. However, this file must not be used for subsequent data loading because the content may not exactly match the source record. You must clean up the source record for further data ingestion. This option is used to remind you which records are bad.
OPTIONS('MAXCOLUMNS'='400')
Name of the Optional Parameter |
Default Value |
Maximum Value |
---|---|---|
MAXCOLUMNS |
2000 |
20000 |
MAXCOLUMNS Value |
Number of Columns in the File Header |
Final Value Considered |
---|---|---|
Not specified in Load options |
5 |
2000 |
Not specified in Load options |
6000 |
6000 |
40 |
7 |
Max (column count of file header, MAXCOLUMNS value) |
22000 |
40 |
20000 |
60 |
Not specified in Load options |
Max (Number of columns in the first line of the CSV file, MAXCOLUMNS value) |
There must be sufficient executor memory for setting the maximum value of MAXCOLUMNS Option. Otherwise, data loading will fail.
OPTIONS('GLOBAL_SORT_PARTITIONS'='2')
To increase the number of partitions, you may need to increase the value of spark.driver.maxResultSize, as the sampling data collected in the driver increases with the number of partitions.
OPTIONS('DATEFORMAT'='dateFormat')
Date formats are specified by date pattern strings. The date pattern letters in Carbon are same as in JAVA.
OPTIONS('SKIP_EMPTY_LINE'='TRUE/FALSE')
splitSize = max(blocklet_size, (block_size - blocklet_size)) * scale_factor numPartitions = total size of input data / splitSize
The default value is 3. The value ranges from 1 to 300.
OPTIONS('SCALE_FACTOR'='10')
To load a CSV file to a CarbonData table, run the following statement:
LOAD DATA INPATH 'folder path' INTO TABLE tablename OPTIONS(property_name=property_value, ...);
The data in the data.csv file is as follows:
ID,date,country,name,phonetype,serialname,salary 4,2014-01-21 00:00:00,city1,aaa4,phone2435,ASD66902,15003 5,2014-01-22 00:00:00,city1,aaa5,phone2441,ASD90633,15004 6,2014-03-07 00:00:00,city1,aaa6,phone294,ASD59961,15005
CREATE TABLE carbontable(ID int, date Timestamp, country String, name String, phonetype String, serialname String,salary int) STORED AS carbondata;
LOAD DATA inpath 'hdfs://hacluster/tmp/data.csv' INTO table carbontable
options('DELIMITER'=',');
Success or failure will be recorded in the driver logs.