Creating a Foreign Table

This section describes how to create a Hadoop foreign table in the GaussDB(DWS) database to access the Hadoop structured data stored on MRS HDFS. A Hadoop foreign table is read-only. It can only be queried using SELECT.

Prerequisites

Obtaining the HDFS Path of the MRS Data Source

There are two methods for you to obtain the HDFS path.

Obtaining Information About the Foreign Server Connected to the MRS Data Source

  1. Use the user who creates the foreign server to connect to the corresponding database.

    Determine whether to use a common user to create a foreign table in the customized database based on requirements.

    • Yes
      1. Ensure that you have created the common user dbuser and its database mydatabase, and manually created a foreign server in mydatabase by following steps in Manually Creating a Foreign Server.
      2. Connect to the database mydatabase as user dbuser through the database client tool provided by GaussDB(DWS).
        If you have connected to the database using the gsql client, run the following command to switch the user and database:
        \c mydatabase dbuser;

        Enter your password as prompted.

    • No

      When you create an MRS data source connection on the GaussDB(DWS) management console, the database administrator dbadmin automatically creates a foreign server in the default database postgres. If you create a foreign table in the default database postgres as the database administrator dbadmin, you need to connect to the database using the database client tool provided by GaussDB(DWS). For example, use the gsql client to connect to the database by running the following command:

      gsql -d postgres -h 192.168.2.30 -U dbadmin -p 8000 -W password -r

      Enter your password as prompted.

  2. Run the following command to view the information about the created foreign server connected to the MRS data source:

    SELECT * FROM pg_foreign_server;

    You can also run the \desc+ command to view the information about the foreign server.

    The returned result is as follows:

                         srvname                      | srvowner | srvfdw | srvtype | srvversion | srvacl |                                                     srvoptions
    --------------------------------------------------+----------+--------+---------+------------+--------+---------------------------------------------------------------------------------------------------------------------
     gsmpp_server                                     |       10 |  13673 |         |            |        |
     gsmpp_errorinfo_server                           |       10 |  13678 |         |            |        |
     hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca |    16476 |  13685 |         |            |        | {"address=192.168.1.245:25000,192.168.1.218:25000",hdfscfgpath=/MRS/8f79ada0-d998-4026-9020-80d6de2692ca,type=hdfs}
    (3 rows)

    In the query result, each row contains the information about a foreign server. The foreign server associated with the MRS data source connection contains the following information:

    • The value of srvname contains hdfs_server and the ID of the MRS cluster, which is the same as the MRS ID in the cluster list on the MRS management console.
    • The address parameter in the srvoptions field contains the IP addresses and ports of the active and standby nodes in the MRS cluster.

    You can find the foreign server you want based on the above information and record the values of its srvname and srvoptions.

Creating a Foreign Table

After Obtaining Information About the Foreign Server Connected to the MRS Data Source and Obtaining the HDFS Path of the MRS Data Source are completed, you can create a foreign table to read data from the MRS data source.

The syntax for creating a foreign table is as follows. For details, see the syntax CREATE FOREIGN TABLE (SQL on Hadoop or OBS).

CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name 
( [ { column_name type_name 
    [ { [CONSTRAINT constraint_name] NULL |
    [CONSTRAINT constraint_name] NOT NULL |
      column_constraint [...]} ] |
      table_constraint [, ...]} [, ...] ] ) 
    SERVER dfs_server 
    OPTIONS ( { option_name ' value ' } [, ...] ) 
    DISTRIBUTE BY {ROUNDROBIN | REPLICATION}
    [ PARTITION BY ( column_name ) [ AUTOMAPPED ] ] ;

For example, when creating a foreign table named foreign_product_info, set parameters in the syntax as follows:

Based on the above settings, the foreign table is created using the following statements:

DROP FOREIGN TABLE IF EXISTS foreign_product_info;

CREATE FOREIGN TABLE foreign_product_info
(
    product_price                integer        ,
    product_id                   char(30)       ,
    product_time                 date           ,
    product_level                char(10)       ,
    product_name                 varchar(200)   ,
    product_type1                varchar(20)    ,
    product_type2                char(10)       ,
    product_monthly_sales_cnt    integer        ,
    product_comment_time         date           ,
    product_comment_num          integer        ,
    product_comment_content      varchar(200)                      
) SERVER hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca 
OPTIONS (
format 'orc', 
encoding 'utf8',
foldername '/user/hive/warehouse/demo.db/product_info_orc/'
) 
DISTRIBUTE BY ROUNDROBIN;

Data Type Conversion

Data is imported to Hive/Spark and then stored on HDFS in ORC format. Actually, GaussDB(DWS) reads ORC files on HDFS, and queries and analyzes data in these files.

Data types supported by Hive/Spark are different from those supported by GaussDB(DWS). Therefore, you need to learn the mapping between them. Table 1 describes the mapping in detail.

Table 1 Data type mapping

Type

Column Type Supported by an HDFS/OBS Foreign Table of GaussDB(DWS)

Column Type Supported by a Hive Table

Column Type Supported by a Spark Table

Integer in two bytes

SMALLINT

SMALLINT

SMALLINT

Integer in four bytes

INTEGER

INT

INT

Integer in eight bytes

BIGINT

BIGINT

BIGINT

Single-precision floating point number

FLOAT4 (REAL)

FLOAT

FLOAT

Double-precision floating point number

FLOAT8(DOUBLE PRECISION)

DOUBLE

FLOAT

Scientific data type

DECIMAL[p (,s)]

The maximum precision can reach up to 38.

DECIMAL

The maximum precision can reach up to 38 (Hive 0.11).

DECIMAL

Date type

DATE

DATE

DATE

Time type

TIMESTAMP

TIMESTAMP

TIMESTAMP

BOOLEAN type

BOOLEAN

BOOLEAN

BOOLEAN

CHAR type

CHAR(n)

CHAR (n)

STRING

VARCHAR type

VARCHAR(n)

VARCHAR (n)

VARCHAR (n)

String

TEXT(CLOB)

STRING

STRING