CREATE FOREIGN TABLE (SQL on OBS or Hadoop)

Function

CREATE FOREIGN TABLE creates an HDFS or OBS foreign table in the current database to access structured data stored on HDFS or OBS. You can also export data in ORC format to HDFS or OBS.

Data stored in OBS: Data storage is decoupled from compute. The cluster storage cost is low, and storage capacity is not limited. Clusters can be deleted at any time. However, the computing performance depends on the OBS access performance and is lower than that of HDFS. OBS is recommended for applications that do not demand a lot of computation.

Data stored in HDFS: Data storage is not decoupled from compute. The cluster storage cost is high, and storage capacity is limited. The computing performance is high. You must export data before you delete clusters. HDFS is recommended for computing-intensive scenarios.

The hybrid data warehouse (standalone) does not support OBS and HDFS foreign table import and export.

Precautions

Syntax

Create a foreign table,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name 
( [ { column_name type_name 
    [ { [CONSTRAINT constraint_name] NULL |
    [CONSTRAINT constraint_name] NOT NULL |
      column_constraint [...]} ] |
      table_constraint [, ...]} [, ...] ] ) 
    SERVER server_name 
    OPTIONS ( { option_name ' value ' } [, ...] ) 
    [ {WRITE ONLY | READ ONLY}]
    DISTRIBUTE BY {ROUNDROBIN | REPLICATION}
   
    [ PARTITION BY ( column_name ) [ AUTOMAPPED ] ] ;

Parameter Description

Informational Constraint

In GaussDB(DWS), the use of data constraints depend on users. If users can make data sources strictly comply with certain constraints, the query on data with such constraints can be accelerated. Foreign tables do not support Index. Informational constraint is used for optimizing query plans.

The constraints of creating informational constraints for a foreign table are as follows:

Example 1

Example 1: In HDFS, import the TPC-H benchmark test tables part and region using Hive. The path of the part table is /user/hive/warehouse/partition.db/part_4, and that of the region table is /user/hive/warehouse/gauss.db/region_orc11_64stripe/.

  1. Establish HDFS_Server, with HDFS_FDW or DFS_FDW as the foreign data wrapper.
    1
    CREATE SERVER hdfs_server FOREIGN DATA WRAPPER HDFS_FDW OPTIONS (address '10.10.0.100:25000,10.10.0.101:25000',hdfscfgpath '/opt/hadoop_client/HDFS/hadoop/etc/hadoop',type'HDFS');
    
    • The IP addresses and port numbers of HDFS NameNodes are specified in OPTIONS. For details about the port number, search for dfs.namenode.rpc.port in the MRS-HDFS service configuration. In this example the port number is 25000.
    • 10.10.0.100:25000,10.10.0.101:25000 indicates the IP addresses and port numbers of the primary and standby HDFS NameNodes. It is the recommended format. Two groups of parameters are separated by commas (,).
  2. Create an HDFS foreign table. The HDFS server associated with the table is hdfs_server, the corresponding file format of the ft_region table on the HDFS server is 'orc', and the file directory in the HDFS file system is '/user/hive/warehouse/gauss.db/region_orc11_64stripe/'.
    • Create an HDFS foreign table without partition keys.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      DROP FOREIGN TABLE IF EXISTS ft_region;
      CREATE FOREIGN TABLE ft_region
      (
          R_REGIONKEY INT4,
          R_NAME TEXT,
          R_COMMENT TEXT
      )
      SERVER
          hdfs_server
      OPTIONS
      (
          FORMAT 'orc',
          encoding 'utf8',
          FOLDERNAME '/user/hive/warehouse/gauss.db/region_orc11_64stripe/'
      )
      DISTRIBUTE BY 
           roundrobin;
      
    • Create an HDFS foreign table with partition keys.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      CREATE FOREIGN TABLE ft_part 
      (
           p_partkey int, 
           p_name text, 
           p_mfgr text, 
           p_brand text, 
           p_type text, 
           p_size int, 
           p_container text, 
           p_retailprice float8, 
           p_comment text
      )
      SERVER
           hdfs_server
      OPTIONS
      (
           FORMAT 'orc',
           encoding 'utf8',
           FOLDERNAME '/user/hive/warehouse/partition.db/part_4'
      )
      DISTRIBUTE BY 
           roundrobin
      PARTITION BY 
           (p_mfgr) AUTOMAPPED;
      

      GaussDB(DWS) allows you to specify files using the keyword filenames or foldername. The latter is recommended. The key word distribute specifies the storage distribution mode of the ft_region table.

  3. View the created server and foreign table.
    1
    2
    SELECT * FROM pg_foreign_table WHERE ftrelid='ft_region'::regclass;
    SELECT * FROM pg_foreign_table WHERE ftrelid='ft_part'::regclass;
    

Example 2

Export data from the TPC-H benchmark test table region table to the /user/hive/warehouse/gauss.db/regin_orc/ directory of the HDFS file system through the HDFS write-only foreign table.

  1. Create an HDFS foreign table. The corresponding foreign data wrapper is HDFS_FDW or DFS_FDW, which is the same as that in Example 1.
  2. Create a write-only HDFS foreign table.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    CREATE FOREIGN TABLE ft_wo_region
    (
        R_REGIONKEY INT4,
        R_NAME TEXT,
        R_COMMENT TEXT
    )
    SERVER
        hdfs_server
    OPTIONS
    (
        FORMAT 'orc',
        encoding 'utf8',
        FOLDERNAME '/user/hive/warehouse/gauss.db/regin_orc/'
    )
    WRITE ONLY;
    
  3. Writes data to the HDFS file system through a write-only foreign table.
    1
    INSERT INTO ft_wo_region SELECT * FROM region;
    

Example 3

Perform operations on an HDFS foreign table that includes informational constraints.

Example 4

Read json data stored in OBS using a foreign table.

  1. The following JSON files are on OBS. The JSON objects contain nesting and arrays. Some objects have lost columns, and some object names are duplicate.
    {"A" : "simple1", "B" : {"C" : "nesting1"}, "D" : ["array", 2, {"E" : "complicated"}]}
    {"A" : "simple2", "D" : ["array", 2, {"E" : "complicated"}]}
    {"A" : "simple3", "B" : {"C" : "nesting3"}, "D" : ["array", 2, {"E" : "complicated3"}]}
    {"B" : {"C" : "nesting4"},"A" : "simple4",  "D" : ["array", 2, {"E" : "complicated4"}]}
    {"A" : "simple5", "B" : {"C" : "nesting5"}, "D" : ["array", 2, {"E" : "complicated5"}]}
  2. Create obs_server, with DFS_FDW as the foreign data wrapper.
    1
    2
    3
    4
    5
    6
    CREATE SERVER obs_server FOREIGN DATA WRAPPER DFS_FDW OPTIONS ( 
      ADDRESS 'obs.xxx.xxx.com', 
      ACCESS_KEY 'xxxxxxxxx', 
      SECRET_ACCESS_KEY 'yyyyyyyyyyyyy', 
      TYPE 'OBS'
    );
    
    • ADDRESS is the endpoint of OBS. Replace it with the actual endpoint. You can find the domain name by searching for the value of regionCode in the region_map file.
    • ACCESS_KEY and SECRET_ACCESS_KEY are access keys for the cloud account system. Replace the values as needed.
    • Hard-coded or plaintext AK and SK are risky. For security purposes, encrypt your AK and SK and store them in the configuration file or environment variables.
    • TYPE indicates the server type. Retain the value OBS.
  3. Create the OBS foreign table json_f and define the column names. For example, d#2_e indicates that the column is object e nested in the 2nd element of array d. The OBS server associated with the table is obs_server. foldername indicates the data source directory of the foreign table, that is, the OBS directory.

    // Hard-coded or plaintext AK and SK are risky. For security purposes, encrypt your AK and SK and store them in the configuration file or environment variables.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    CREATE FOREIGN TABLE json_f (
      a VARCHAR(10),
      b_c TEXT,
      d#1 INTEGER,
      d#2_e VARCHAR(30)
    )SERVER obs_server OPTIONS (
        foldername '/xxx/xxx/',
        format 'json',
        encoding 'utf8',
        force_mapping 'true'
    )distribute by roundrobin;
    
  4. Query the foreign table json_f. The fault tolerance parameter force_mapping is enabled by default. If a column is missing in a JSON object, NULL is filled in. If a JSON object name is duplicate, the last name prevails.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    SELECT * FROM json_f;
        a    |   b_c    | d#1 |    d#2_e
    ---------+----------+-----+--------------
     simple1 | nesting1 |   2 | complicated1
     simple2 |          |   2 | complicated2
     simple3 | nesting3 |   2 | complicated3
     simple4 | nesting4 |   2 | complicated4
     repeat  | nesting5 |   2 | complicated5
    (5 rows)
    

Example 5

Read a DLI multi-version foreign table using a foreign table. Only DLI 8.1.1 and later support the multi-version foreign table example.

  1. Create dli_server, with DFS_FDW as the foreign data wrapper.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    CREATE SERVER dli_server FOREIGN DATA WRAPPER DFS_FDW OPTIONS ( 
      ADDRESS 'obs.xxx.xxx.com', 
      ACCESS_KEY 'xxxxxxxxx', 
      SECRET_ACCESS_KEY 'yyyyyyyyyyyyy', 
      TYPE 'DLI',
      DLI_ADDRESS 'dli.xxx.xxx.com',
      DLI_ACCESS_KEY 'xxxxxxxxx',
      DLI_SECRET_ACCESS_KEY 'yyyyyyyyyyyyy'
    );
    
    • ADDRESS is the endpoint of OBS. DLI_ADDRESS is the endpoint of DLI. Replace it with the actual endpoint.
    • ACCESS_KEY and SECRET_ACCESS_KEY are access keys for the cloud account system to access OBS. Use the actual value.
    • DLI_ACCESS_KEY and DLI_SECRET_ACCESS_KEY are access keys for the cloud account system to access DLI. Use the actual value.
    • Hard-coded or plaintext AK and SK are risky. For security purposes, encrypt your AK and SK and store them in the configuration file or environment variables.
    • TYPE indicates the server type. Retain the value DLI.
  2. Create the OBS foreign table customer_address for accessing DLI. The table does not contain partition columns, and the DLI server associated with the table is dli_server. Where, the project_id is xxxxxxxxxxxxxxx, the database_name on DLI is database123, and the table_name of the table to be accessed is table456. Replace them based on the actual requirements.

    // Hard-coded or plaintext AK and SK are risky. For security purposes, encrypt your AK and SK and store them in the configuration file or environment variables.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    CREATE FOREIGN TABLE customer_address
    (
        ca_address_sk             integer               not null,
        ca_address_id             char(16)              not null,
        ca_street_number          char(10)                      ,   
        ca_street_name            varchar(60)                   ,   
        ca_street_type            char(15)                      ,   
        ca_suite_number           char(10)                      ,   
        ca_city                   varchar(60)                   ,   
        ca_county                 varchar(30)                   ,   
        ca_state                  char(2)                       ,   
        ca_zip                    char(10)                      ,   
        ca_country                varchar(20)                   ,   
        ca_gmt_offset             decimal(36,33)                  ,   
        ca_location_type          char(20)    
    ) 
    SERVER dli_server OPTIONS (
        FORMAT 'ORC',
        ENCODING 'utf8',
        DLI_PROJECT_ID 'xxxxxxxxxxxxxxx',
        DLI_DATABASE_NAME 'database123',
        DLI_TABLE_NAME 'table456'
    )
    DISTRIBUTE BY roundrobin;
    
  3. Query data in a DLI multi-version table using a foreign table.
    1
    2
    3
    4
    5
    SELECT COUNT(*) FROM customer_address;
     count 
    -------
        20
    (1 row)
    

Helpful Links

ALTER FOREIGN TABLE (for HDFS or OBS), DROP FOREIGN TABLE