Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
7.0 KiB
Using HDFS Colocation to Store Hive Tables
Scenario
HDFS Colocation is the data location control function provided by HDFS. The HDFS Colocation API stores associated data or data on which associated operations are performed on the same storage node. Hive supports the HDFS Colocation function. When Hive tables are created, after the locator information is set for table files, data files of related tables are stored on the same storage node when data is inserted into tables using the insert statement (other data import modes are not supported). This ensures convenient and efficient data computing among associated tables. The supported table formats are only TextFile and RCFile.
Procedure
- Log in to the node where the client is installed as a client installation user.
- Run the following command to switch to the client installation directory, for example, opt/client:
cd /opt/client
- Run the following command to configure environment variables:
source bigdata_env
- If the cluster is in security mode, run the following command to authenticate the user:
kinit MRS username
- Create the groupid through the HDFS API.
hdfs colocationadmin -createGroup -groupId <groupid> -locatorIds <locatorid1>,<locatorid2>,<locatorid3>
- Run the following command to log in to the Hive client:
beeline
- Enable Hive to use colocation.
Assume that table_name1 and table_name2 are associated with each other. Run the following statements to create them:
CREATE TABLE <[db_name.]table_name1>[(col_name data_type , ...)] [ROW FORMAT <row_format>] [STORED AS <file_format>] TBLPROPERTIES("groupId"=" <group> ","locatorId"="<locator1>");
CREATE TABLE <[db_name.]table_name2> [(col_name data_type , ...)] [ROW FORMAT <row_format>] [STORED AS <file_format>] TBLPROPERTIES("groupId"=" <group> ","locatorId"="<locator1>");
After data is inserted into table_name1 and table_name2 using the insert statement, data files of table_name1 and table_name2 are distributed to the same storage position in the HDFS, facilitating associated operations among the two tables.