This section describes how to create a Hadoop foreign table in the GaussDB(DWS) database to access the Hadoop structured data stored on MRS HDFS. A Hadoop foreign table is read-only. It can only be queried using SELECT.
For details, see Preparing Data in an MRS Cluster.
For details, see "Managing MRS Data Sources > Creating an MRS Data Source Connection" in the Data Warehouse Service User Guide.
There are two methods for you to obtain the HDFS path.
For Hive data, log in to the Hive client of MRS (see 2), run the following command to view the detailed information about the table, and record the data storage path in the location parameter:
use <database_name>; desc formatted <table_name>;
For example, if the value of the location parameter in the returned result is hdfs://hacluster/user/hive/warehouse/demo.db/product_info_orc/, the HDFS path is /user/hive/warehouse/demo.db/product_info_orc/.
Determine whether to use a common user to create a foreign table in the customized database based on requirements.
\c mydatabase dbuser;
Enter your password as prompted.
When you create an MRS data source connection on the GaussDB(DWS) management console, the database administrator dbadmin automatically creates a foreign server in the default database postgres. If you create a foreign table in the default database postgres as the database administrator dbadmin, you need to connect to the database using the database client tool provided by GaussDB(DWS). For example, use the gsql client to connect to the database by running the following command:
gsql -d postgres -h 192.168.2.30 -U dbadmin -p 8000 -W password -r
Enter your password as prompted.
SELECT * FROM pg_foreign_server;
You can also run the \desc+ command to view the information about the foreign server.
The returned result is as follows:
srvname | srvowner | srvfdw | srvtype | srvversion | srvacl | srvoptions --------------------------------------------------+----------+--------+---------+------------+--------+--------------------------------------------------------------------------------------------------------------------- gsmpp_server | 10 | 13673 | | | | gsmpp_errorinfo_server | 10 | 13678 | | | | hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca | 16476 | 13685 | | | | {"address=192.168.1.245:25000,192.168.1.218:25000",hdfscfgpath=/MRS/8f79ada0-d998-4026-9020-80d6de2692ca,type=hdfs} (3 rows)
In the query result, each row contains the information about a foreign server. The foreign server associated with the MRS data source connection contains the following information:
You can find the foreign server you want based on the above information and record the values of its srvname and srvoptions.
After Obtaining Information About the Foreign Server Connected to the MRS Data Source and Obtaining the HDFS Path of the MRS Data Source are completed, you can create a foreign table to read data from the MRS data source.
The syntax for creating a foreign table is as follows. For details, see the syntax CREATE FOREIGN TABLE (SQL on Hadoop or OBS).
CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [ { column_name type_name [ { [CONSTRAINT constraint_name] NULL | [CONSTRAINT constraint_name] NOT NULL | column_constraint [...]} ] | table_constraint [, ...]} [, ...] ] ) SERVER dfs_server OPTIONS ( { option_name ' value ' } [, ...] ) DISTRIBUTE BY {ROUNDROBIN | REPLICATION} [ PARTITION BY ( column_name ) [ AUTOMAPPED ] ] ;
For example, when creating a foreign table named foreign_product_info, set parameters in the syntax as follows:
Mandatory. This parameter specifies the name of the foreign table to be created.
Multiple columns are separate by commas (,).
The number of columns and column types in the foreign table must be the same as those in the data stored on MRS. Learn Data Type Conversion before defining column data types.
This parameter specifies the foreign server name of the foreign table. This server must exist. The foreign table can read data from an MRS cluster by configuring the foreign server and connecting to the MRS data source.
Enter the value of the srvname field queried in Obtaining Information About the Foreign Server Connected to the MRS Data Source.
These are parameters associated with the foreign table. The key parameters are as follows:
If the MRS analysis cluster has enabled Kerberos authentication, ensure that the MRS user having the MRS data source connection has the read and write permissions for the directory.
Follow the steps in Obtaining the HDFS Path of the MRS Data Source to obtain the HDFS path, which is the value of parameter foldername.
Other parameters are optional. You can set them as required. In this example, you do not need to set these parameters.
Based on the above settings, the foreign table is created using the following statements:
DROP FOREIGN TABLE IF EXISTS foreign_product_info; CREATE FOREIGN TABLE foreign_product_info ( product_price integer , product_id char(30) , product_time date , product_level char(10) , product_name varchar(200) , product_type1 varchar(20) , product_type2 char(10) , product_monthly_sales_cnt integer , product_comment_time date , product_comment_num integer , product_comment_content varchar(200) ) SERVER hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca OPTIONS ( format 'orc', encoding 'utf8', foldername '/user/hive/warehouse/demo.db/product_info_orc/' ) DISTRIBUTE BY ROUNDROBIN;
Data is imported to Hive/Spark and then stored on HDFS in ORC format. Actually, GaussDB(DWS) reads ORC files on HDFS, and queries and analyzes data in these files.
Data types supported by Hive/Spark are different from those supported by GaussDB(DWS). Therefore, you need to learn the mapping between them. Table 1 describes the mapping in detail.
Type |
Column Type Supported by an HDFS/OBS Foreign Table of GaussDB(DWS) |
Column Type Supported by a Hive Table |
Column Type Supported by a Spark Table |
---|---|---|---|
Integer in two bytes |
SMALLINT |
SMALLINT |
SMALLINT |
Integer in four bytes |
INTEGER |
INT |
INT |
Integer in eight bytes |
BIGINT |
BIGINT |
BIGINT |
Single-precision floating point number |
FLOAT4 (REAL) |
FLOAT |
FLOAT |
Double-precision floating point number |
FLOAT8(DOUBLE PRECISION) |
DOUBLE |
FLOAT |
Scientific data type |
DECIMAL[p (,s)] The maximum precision can reach up to 38. |
DECIMAL The maximum precision can reach up to 38 (Hive 0.11). |
DECIMAL |
Date type |
DATE |
DATE |
DATE |
Time type |
TIMESTAMP |
TIMESTAMP |
TIMESTAMP |
BOOLEAN type |
BOOLEAN |
BOOLEAN |
BOOLEAN |
CHAR type |
CHAR(n) |
CHAR (n) |
STRING |
VARCHAR type |
VARCHAR(n) |
VARCHAR (n) |
VARCHAR (n) |
String |
TEXT(CLOB) |
STRING |
STRING |