HDFS Colocation is the data location control function provided by HDFS. The HDFS Colocation API stores associated data or data on which associated operations are performed on the same storage node. Hive supports the HDFS Colocation function. When Hive tables are created, after the locator information is set for table files, data files of related tables are stored on the same storage node when data is inserted into tables using the insert statement (other data import modes are not supported). This ensures convenient and efficient data computing among associated tables. The supported table formats are only TextFile and RCFile.
This section applies to MRS 3.x or later.
cd /opt/client
source bigdata_env
kinit MRS username
hdfs colocationadmin -createGroup -groupId <groupid> -locatorIds <locatorid1>,<locatorid2>,<locatorid3>
In the preceding command, <groupid> indicates the name of the created group. The group created in this example contains three locators. You can define the number of locators as required.
For details about group ID creation and HDFS Colocation, see HDFS description.
beeline
Assume that table_name1 and table_name2 are associated with each other. Run the following statements to create them:
CREATE TABLE <[db_name.]table_name1>[(col_name data_type , ...)] [ROW FORMAT <row_format>] [STORED AS <file_format>] TBLPROPERTIES("groupId"=" <group> ","locatorId"="<locator1>");
CREATE TABLE <[db_name.]table_name2> [(col_name data_type , ...)] [ROW FORMAT <row_format>] [STORED AS <file_format>] TBLPROPERTIES("groupId"=" <group> ","locatorId"="<locator1>");
After data is inserted into table_name1 and table_name2 using the insert statement, data files of table_name1 and table_name2 are distributed to the same storage position in the HDFS, facilitating associated operations among the two tables.