This section provides suggestions based on more than 50 test cases to help you create CarbonData tables with higher query performance.
Column name |
Data type |
Cardinality |
Attribution |
---|---|---|---|
msisdn |
String |
30 million |
dimension |
BEGIN_TIME |
bigint |
10,000 |
dimension |
host |
String |
1 million |
dimension |
dime_1 |
String |
1,000 |
dimension |
dime_2 |
String |
500 |
dimension |
dime_3 |
String |
800 |
dimension |
counter_1 |
numeric(20,0) |
NA |
measure |
... |
... |
NA |
measure |
counter_100 |
numeric(20,0) |
NA |
measure |
implement optimization as follows:
Place this column in the first column of sort_columns.
For example, if msisdn is the most frequently used filter criterion in a query, it is placed in the first column. Run the following command to create a table. The query performance is good if msisdn is used as the filter condition.
create table carbondata_table( msisdn String, ... )STORED AS carbondata TBLPROPERTIES ('SORT_COLUMS'='msisdn');
implement optimization as follows:
Create an index for the columns.
For example, if msisdn, host, and dime_1 are frequently used columns, the sort_columns column sequence is "dime_1-> host-> msisdn..." based on cardinality. Run the following command to create a table. The following command can improve the filtering performance of dime_1, host, and msisdn.
create table carbondata_table( dime_1 String, host String, msisdn String, dime_2 String, dime_3 String, ... )STORED AS carbondata TBLPROPERTIES ('SORT_COLUMS'='dime_1,host,msisdn');
implement optimization as follows:
sort_columns is sorted in ascending order of cardinality.
Run the following command to create a table:
create table carbondata_table( Dime_1 String, BEGIN_TIME bigint, HOST String, MSISDN String, ... )STORED AS carbondata TBLPROPERTIES ('SORT_COLUMS'='dime_2,dime_3,dime_1, BEGIN_TIME,host,msisdn');
create index carbondata_table_index_msidn on tablecarbondata_table ( MSISDN String) as 'carbondata' PROPERTIES ('table_blocksize'='128'); create index carbondata_table_index_host on tablecarbondata_table ( host String) as 'carbondata' PROPERTIES ('table_blocksize'='128');
The result of performance analysis of test-case shows reduction in query execution time from 15 to 3 seconds, thereby improving performance by nearly 5 times. The command for creating a table is as follows:
create table carbondata_table( Dime_1 String, BEGIN_TIME bigint, HOST String, MSISDN String, counter_1 double, counter_2 double, ... counter_100 double, )STORED AS carbondata ;
For example, if data is loaded to CarbonData every day, start_time is incremental for each load. In this case, it is recommended that the start_time column be put at the end of sort_columns, because incremental values are efficient in using min/max index. The command for creating a table is as follows:
create table carbondata_table( Dime_1 String, HOST String, MSISDN String, counter_1 double, counter_2 double, BEGIN_TIME bigint, ... counter_100 double, )STORED AS carbondata TBLPROPERTIES ( 'SORT_COLUMS'='dime_2,dime_3,dime_1..BEGIN_TIME');