CREATE FOREIGN TABLE creates an HDFS or OBS foreign table in the current database to access structured data stored on HDFS or OBS. You can also export data in ORC format to HDFS or OBS.
Data stored in OBS: Data storage is decoupled from compute. The cluster storage cost is low, and storage capacity is not limited. Clusters can be deleted at any time. However, the computing performance depends on the OBS access performance and is lower than that of HDFS. OBS is recommended for applications that do not demand a lot of computation.
Data stored in HDFS: Data storage is not decoupled from compute. The cluster storage cost is high, and storage capacity is limited. The computing performance is high. You must export data before you delete clusters. HDFS is recommended for computing-intensive scenarios.
The hybrid data warehouse (standalone) does not support OBS and HDFS foreign table import and export.
Create a foreign table,
1 2 3 4 5 6 7 8 9 10 11 12 | CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name ( [ { column_name type_name [ { [CONSTRAINT constraint_name] NULL | [CONSTRAINT constraint_name] NOT NULL | column_constraint [...]} ] | table_constraint [, ...]} [, ...] ] ) SERVER server_name OPTIONS ( { option_name ' value ' } [, ...] ) [ {WRITE ONLY | READ ONLY}] DISTRIBUTE BY {ROUNDROBIN | REPLICATION} [ PARTITION BY ( column_name ) [ AUTOMAPPED ] ] ; |
1 2 3 | [CONSTRAINT constraint_name] {PRIMARY KEY | UNIQUE} [NOT ENFORCED [ENABLE QUERY OPTIMIZATION | DISABLE QUERY OPTIMIZATION] | ENFORCED] |
1 2 3 | [CONSTRAINT constraint_name] {PRIMARY KEY | UNIQUE} (column_name) [NOT ENFORCED [ENABLE QUERY OPTIMIZATION | DISABLE QUERY OPTIMIZATION] | ENFORCED] |
Does not throw an error if a table with the same name exists. A notice is issued in this case.
Specifies the name of the foreign table to be created.
Value range: a string. It must comply with the naming convention.
Specifies the name of a column in the foreign table. Columns are separated by commas (,).
Value range: a string. It must comply with the naming convention.
A JSON object consists of nested or parallel name-value pairs, which are irrelevant to the sequence. When data in JSON format is imported, the mapping between fields and values is determined based on the automatic mapping between field names and names of name-value pairs. You need to define proper field names. Otherwise, you may not get the expected result. The rules for automatic mapping between field names and names of name-value pairs are as follows:
For example, to import each element of the {"A": "simple", "B": {"C": "nesting"}, "D": ["array", 2, {"E": "complicated"}]} object, the field names in the foreign table must be defined as a, b, b_c, d, d#0, d#1, d#2 and d#2_e. The sequence in which the fields are defined does not affect the import result.
Specifies the data type of the column.
Specifies the name of a constraint for the foreign table.
Specifies whether the column allows NULL.
When you create a table, whether the data in HDFS is NULL or NOT NULL cannot be guaranteed. The consistency of data is guaranteed by users. Users must decide whether the column is NULL or NOT NULL. (The optimizer optimizes the NULL/NOT NULL and generates a better plan.)
Specifies the server name of the foreign table. Users can customize its name.
Value range: a string indicating an existing server. It must comply with the naming convention.
Specifies whether a data file contains a table header. header is available only for CSV files.
If header is on, the first row of the data file will be identified as the header and ignored during export. If header is off, the first row will be identified as a data row.
Value range: true, on, false, and off. The default value is false or off.
Specifies the quotation mark for the CSV format. The default value is a double quotation mark (").
The quote value cannot be the same as the delimiter or null value.
The quote value must be a single-byte character.
Invisible characters are recommended as quote values, such as 0x07, 0x08, and 0x1b.
Specifies an escape character for a CSV file. The value must be a single-byte character.
The default value is a double quotation mark ("). If the value is the same as the quote value, it will be replaced with \0.
Specifies the file path on OBS. This is an OBS foreign table parameter. The data sources of multiple buckets are separated by vertical bars (|), for example, LOCATION 'obs://bucket1/folder/ | obs://bucket2/'. The database scans all objects in the specified folders.
When accessing a DLI multi-version table, you do not need to specify the location parameter.
Only JSON objects (embraced in {}) can be imported. JSON arrays (embraced in []) cannot be imported. However, arrays inside a JSON object can be imported.
When accessing a DLI multi-version table, you do not need to specify the foldername parameter.
Specifies the column delimiter of data, and uses the default delimiter if it is not set. The default delimiter of TEXT is a tab.
Valid value:
The value of delimiter can be a multi-character delimiter whose length is less than or equal to 10 bytes.
Specifies the newline character style of the imported data file.
Value range: multi-character newline characters within 10 bytes. Common newline characters include \r (0x0D), \n (0x0A), and \r\n (0x0D0A). Special newline characters include $ and #.
Valid value:
The default value is \N for the TEXT format.
Specifies in TEXT format, whether to escape the backslash (\) and its following characters.
noescaping is available only for TEXT source data files.
Value range: true, on, false, and off. The default value is false or off.
Specifies whether to generate an error message when the last column in a row in the source file is lost during data loading.
Value range: true, on, false, and off. The default value is false or off.
missing data for column "tt"
Specifies whether to ignore excessive columns when the number of data source files exceeds the number of foreign table columns. This parameter is available only during data importing.
Value range: true, on, false, and off. The default value is false or off.
extra data after last expected column
Specifies the DATE format for data import. This syntax is available only for READ ONLY foreign tables.
Value range: any valid DATE value. For details, see Date and Time Processing Functions and Operators.
Specifies the TIME format for data import. This syntax is available only for READ ONLY foreign tables.
Value range: a valid TIME value. Time zones cannot be used. For details, see Date and Time Processing Functions and Operators.
time_format is available only for TEXT and CSV source data files.
Specifies the TIMESTAMP format for data import. This syntax is available only for READ ONLY foreign tables.
Value range: any valid TIMESTAMP value. Time zones are not supported. For details, see Date and Time Processing Functions and Operators.
timestamp_format is available only for TEXT and CSV source data files.
Specifies the SMALLDATETIME format for data import. This syntax is available only for READ ONLY foreign tables.
Value range: a valid SMALLDATETIME value. For details, see Date and Time Processing Functions and Operators.
smalldatetime_format is available only for TEXT and CSV source data files.
This parameter specifies the data code of the data table to be exported when the database code is different from the data code of the data table. For example, the database code is Latin-1, but the data in the exported data table is in UTF-8 format. This parameter is optional. If this parameter is not specified, the database encoding format is used by default. This syntax is valid only for the write-only HDFS foreign table.
Value range: data code types supported by the database encoding
The dataencoding parameter is valid only for the ORC-formatted write-only HDFS foreign table.
Specifies the file size of a write-only foreign table. This parameter is optional. If this parameter is not specified, the file size in the distributed file system configuration is used by default. This syntax is available only for the write-only foreign table.
Value range: an integer ranging from 1 to 1024
The filesize parameter is valid only for the ORC-formatted write-only HDFS foreign table.
Specifies the compression mode of ORC files. This parameter is optional. This syntax is available only for the write-only foreign table.
Value range: zlib, snappy, and lz4 The default value is snappy.
Specifies the ORC version number. This parameter is optional. This syntax is available only for the write-only foreign table.
Value range: Only 0.12 is supported. The default value is 0.12.
Specifies the project ID corresponding to DLI. You can obtain the project ID from the management console. This parameter is available only when the server type is DLI. This feature is supported only in 8.1.1 or later.
Specifies the name of the database where the DLI multi-version table to be accessed is located. This parameter is available only when the server type is DLI. This feature is supported only in 8.1.1 or later.
Specifies the name of the DLI multi-version table to be accessed. This parameter is available only when the server type is DLI. This feature is supported only in 8.1.1 or later.
Specifies whether to check the character encoding.
In TEXT format, the rule of error tolerance for invalid characters imported is as follows:
In ORC format, the rule of error tolerance for invalid characters imported is as follows:
Indicates the handling method used when no correct name-value pairs are matched for the foreign table columns in JSON format.
The value can be true or false. Default value: true
There are no restrictions on JSON objects. While the definition of foreign table fields must comply with GaussDB(DWS) identifier specifications (such as length and character restrictions). Therefore, this import method may cause exceptions. For example, a JSON name cannot be correctly identified or a field is repeatedly defined. You are advised to use the fault tolerance option force_mapping or JSON operators (for details, see JSON/JSONB Functions and Operators).
For JSON format, SELECT COUNT(*) does not parse specific fields. Therefore, no error is reported when a field is missing or the format is incorrect.
Parameter |
OBS |
||||||
---|---|---|---|---|---|---|---|
- |
TEXT |
CSV |
JSON |
ORC |
CARBONDATA |
PARQUET |
|
READ ONLY |
READ ONLY |
READ ONLY |
READ ONLY |
WRITE ONLY |
READ ONLY |
READ ONLY |
|
location |
√ |
√ |
√ |
√ |
× |
√ |
√ |
format |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
header |
× |
√ |
× |
× |
× |
× |
× |
delimiter |
√ |
√ |
× |
× |
× |
× |
× |
quote |
× |
√ |
× |
× |
× |
× |
× |
escape |
× |
√ |
× |
× |
× |
× |
× |
null |
√ |
√ |
× |
× |
× |
× |
× |
noescaping |
√ |
× |
× |
× |
× |
× |
× |
encoding |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
fill_missing_fields |
√ |
√ |
× |
× |
× |
× |
× |
ignore_extra_data |
√ |
√ |
× |
× |
× |
× |
× |
date_format |
√ |
√ |
√ |
× |
× |
× |
× |
time_format |
√ |
√ |
√ |
× |
× |
× |
× |
timestamp_format |
√ |
√ |
√ |
× |
× |
× |
× |
smalldatetime_format |
√ |
√ |
√ |
× |
× |
× |
× |
chunksize |
√ |
√ |
√ |
× |
× |
× |
× |
filenames |
× |
× |
× |
× |
× |
× |
× |
foldername |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
dataencoding |
× |
× |
× |
× |
× |
× |
× |
filesize |
× |
× |
× |
× |
× |
× |
× |
compression |
× |
× |
× |
× |
√ |
× |
× |
version |
× |
× |
× |
× |
√ |
× |
× |
checkencoding |
√ |
√ |
√ |
√ |
× |
√ |
√ |
totalrows |
√ |
√ |
√ |
√ |
× |
× |
× |
force_mapping |
× |
× |
√ |
× |
× |
× |
× |
Parameter |
HDFS |
|||||
---|---|---|---|---|---|---|
- |
TEXT |
CSV |
JSON |
ORC |
PARQUET |
|
READ ONLY |
READ ONLY |
READ ONLY |
READ ONLY |
WRITE ONLY |
READ ONLY |
|
location |
× |
× |
× |
× |
× |
× |
format |
√ |
√ |
√ |
√ |
√ |
√ |
header |
× |
√ |
× |
× |
× |
× |
delimiter |
√ |
√ |
× |
× |
× |
× |
quote |
× |
√ |
× |
× |
× |
× |
escape |
× |
√ |
× |
× |
× |
× |
null |
√ |
√ |
× |
× |
× |
× |
noescaping |
√ |
× |
× |
× |
× |
× |
encoding |
√ |
√ |
√ |
√ |
√ |
√ |
fill_missing_fields |
√ |
√ |
× |
× |
× |
× |
ignore_extra_data |
√ |
√ |
× |
× |
× |
× |
date_format |
√ |
√ |
√ |
× |
× |
× |
time_format |
√ |
√ |
√ |
× |
× |
× |
timestamp_format |
√ |
√ |
√ |
× |
× |
× |
smalldatetime_format |
√ |
√ |
√ |
× |
× |
× |
chunksize |
√ |
√ |
√ |
× |
× |
× |
filenames |
√ |
√ |
√ |
√ |
× |
√ |
foldername |
√ |
√ |
√ |
√ |
√ |
√ |
dataencoding |
× |
× |
× |
× |
√ |
× |
filesize |
× |
× |
× |
× |
√ |
× |
compression |
× |
× |
× |
× |
√ |
× |
version |
× |
× |
× |
× |
√ |
× |
checkencoding |
√ |
√ |
√ |
√ |
√ |
√ |
totalrows |
× |
× |
× |
× |
× |
× |
force_mapping |
× |
× |
√ |
× |
× |
× |
WRITE ONLY creates a write-only HDFS/OBS foreign table.
READ ONLY creates a read-only HDFS/OBS foreign table.
If the foreign table type is not specified, a read-only foreign table is created by default.
Specifies ROUNDROBIN as the distribution mode for the HDFS/OBS foreign table.
Specifies REPLICATION as the distribution mode for the HDFS foreign table.
column_name specifies the partition column. AUTOMAPPED means the partition column specified by the HDFS partitioned foreign table is automatically mapped with the partition directory information in HDFS. The prerequisite is that the sequences of partition columns specified in the HDFS foreign table and in the directory are the same. This function is applicable only to read-only foreign tables.
Specifies the name of informational constraint of the foreign table.
Value range: a string. It must comply with the naming convention.
The primary key constraint specifies that one or more columns of a table must contain unique (non-duplicate) and non-null values. Only one primary key can be specified for a table.
Specifies that a group of one or more columns of a table must contain unique values. For the purpose of a unique constraint, NULL is not considered equal.
Specifies the constraint to be an informational constraint. This constraint is guaranteed by the user instead of the database.
The default value is ENFORCED. ENFORCED is a reserved parameter and is currently not supported.
Specifies the informational constraint on column_name.
Value range: a string. It must comply with the naming convention, and the value of column_name must exist.
Optimizes an execution plan using an informational constraint.
Disables the optimization of an execution plan using an informational constraint.
In GaussDB(DWS), the use of data constraints depend on users. If users can make data sources strictly comply with certain constraints, the query on data with such constraints can be accelerated. Foreign tables do not support Index. Informational constraint is used for optimizing query plans.
The constraints of creating informational constraints for a foreign table are as follows:
Example 1: In HDFS, import the TPC-H benchmark test tables part and region using Hive. The path of the part table is /user/hive/warehouse/partition.db/part_4, and that of the region table is /user/hive/warehouse/gauss.db/region_orc11_64stripe/.
1 | CREATE SERVER hdfs_server FOREIGN DATA WRAPPER HDFS_FDW OPTIONS (address '10.10.0.100:25000,10.10.0.101:25000',hdfscfgpath '/opt/hadoop_client/HDFS/hadoop/etc/hadoop',type'HDFS'); |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | DROP FOREIGN TABLE IF EXISTS ft_region; CREATE FOREIGN TABLE ft_region ( R_REGIONKEY INT4, R_NAME TEXT, R_COMMENT TEXT ) SERVER hdfs_server OPTIONS ( FORMAT 'orc', encoding 'utf8', FOLDERNAME '/user/hive/warehouse/gauss.db/region_orc11_64stripe/' ) DISTRIBUTE BY roundrobin; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | CREATE FOREIGN TABLE ft_part ( p_partkey int, p_name text, p_mfgr text, p_brand text, p_type text, p_size int, p_container text, p_retailprice float8, p_comment text ) SERVER hdfs_server OPTIONS ( FORMAT 'orc', encoding 'utf8', FOLDERNAME '/user/hive/warehouse/partition.db/part_4' ) DISTRIBUTE BY roundrobin PARTITION BY (p_mfgr) AUTOMAPPED; |
GaussDB(DWS) allows you to specify files using the keyword filenames or foldername. The latter is recommended. The key word distribute specifies the storage distribution mode of the ft_region table.
1 2 | SELECT * FROM pg_foreign_table WHERE ftrelid='ft_region'::regclass; SELECT * FROM pg_foreign_table WHERE ftrelid='ft_part'::regclass; |
Export data from the TPC-H benchmark test table region table to the /user/hive/warehouse/gauss.db/regin_orc/ directory of the HDFS file system through the HDFS write-only foreign table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | CREATE FOREIGN TABLE ft_wo_region ( R_REGIONKEY INT4, R_NAME TEXT, R_COMMENT TEXT ) SERVER hdfs_server OPTIONS ( FORMAT 'orc', encoding 'utf8', FOLDERNAME '/user/hive/warehouse/gauss.db/regin_orc/' ) WRITE ONLY; |
1 | INSERT INTO ft_wo_region SELECT * FROM region; |
Perform operations on an HDFS foreign table that includes informational constraints.
1 2 3 4 5 6 7 8 9 10 | CREATE FOREIGN TABLE ft_region ( R_REGIONKEY int, R_NAME TEXT, R_COMMENT TEXT , primary key (R_REGIONKEY) not enforced) SERVER hdfs_server OPTIONS(format 'orc', encoding 'utf8', foldername '/user/hive/warehouse/gauss.db/region_orc11_64stripe') DISTRIBUTE BY roundrobin; |
1 2 3 4 5 6 7 8 9 10 11 | SELECT relname,relhasindex FROM pg_class WHERE oid='ft_region'::regclass; relname | relhasindex ------------------------+------------- ft_region | f (1 row) SELECT conname, contype, consoft, conopt, conindid, conkey FROM pg_constraint WHERE conname ='ft_region_pkey'; conname | contype | consoft | conopt | conindid | conkey ---------------+---------+---------+--------+----------+-------- ft_region_pkey | p | t | t | 0 | {1} (1 row) |
1 2 3 4 5 6 | ALTER FOREIGN TABLE ft_region DROP CONSTRAINT ft_region_pkey RESTRICT; SELECT conname, contype, consoft, conindid, conkey FROM pg_constraint WHERE conname ='ft_region_pkey'; conname | contype | consoft | conindid | conkey ---------+---------+---------+----------+-------- (0 rows) |
1 | ALTER FOREIGN TABLE ft_region ADD CONSTRAINT constr_unique UNIQUE(R_REGIONKEY) NOT ENFORCED; |
1 2 | ALTER FOREIGN TABLE ft_region DROP CONSTRAINT constr_unique RESTRICT; SELECT conname, contype, consoft, conindid, conkey FROM pg_constraint WHERE conname ='constr_unique'; |
1 2 | ALTER FOREIGN TABLE ft_region ADD CONSTRAINT constr_unique UNIQUE(R_REGIONKEY) NOT ENFORCED disable query optimization; SELECT relname,relhasindex FROM pg_class WHERE oid='ft_region'::regclass; |
1 | ALTER FOREIGN TABLE ft_region DROP CONSTRAINT constr_unique CASCADE; |
Read json data stored in OBS using a foreign table.
{"A" : "simple1", "B" : {"C" : "nesting1"}, "D" : ["array", 2, {"E" : "complicated"}]} {"A" : "simple2", "D" : ["array", 2, {"E" : "complicated"}]} {"A" : "simple3", "B" : {"C" : "nesting3"}, "D" : ["array", 2, {"E" : "complicated3"}]} {"B" : {"C" : "nesting4"},"A" : "simple4", "D" : ["array", 2, {"E" : "complicated4"}]} {"A" : "simple5", "B" : {"C" : "nesting5"}, "D" : ["array", 2, {"E" : "complicated5"}]}
1 2 3 4 5 6 | CREATE SERVER obs_server FOREIGN DATA WRAPPER DFS_FDW OPTIONS ( ADDRESS 'obs.xxx.xxx.com', ACCESS_KEY 'xxxxxxxxx', SECRET_ACCESS_KEY 'yyyyyyyyyyyyy', TYPE 'OBS' ); |
// Hard-coded or plaintext AK and SK are risky. For security purposes, encrypt your AK and SK and store them in the configuration file or environment variables.
1 2 3 4 5 6 7 8 9 10 11 | CREATE FOREIGN TABLE json_f ( a VARCHAR(10), b_c TEXT, d#1 INTEGER, d#2_e VARCHAR(30) )SERVER obs_server OPTIONS ( foldername '/xxx/xxx/', format 'json', encoding 'utf8', force_mapping 'true' )distribute by roundrobin; |
1 2 3 4 5 6 7 8 9 | SELECT * FROM json_f; a | b_c | d#1 | d#2_e ---------+----------+-----+-------------- simple1 | nesting1 | 2 | complicated1 simple2 | | 2 | complicated2 simple3 | nesting3 | 2 | complicated3 simple4 | nesting4 | 2 | complicated4 repeat | nesting5 | 2 | complicated5 (5 rows) |
Read a DLI multi-version foreign table using a foreign table. Only DLI 8.1.1 and later support the multi-version foreign table example.
1 2 3 4 5 6 7 8 9 | CREATE SERVER dli_server FOREIGN DATA WRAPPER DFS_FDW OPTIONS ( ADDRESS 'obs.xxx.xxx.com', ACCESS_KEY 'xxxxxxxxx', SECRET_ACCESS_KEY 'yyyyyyyyyyyyy', TYPE 'DLI', DLI_ADDRESS 'dli.xxx.xxx.com', DLI_ACCESS_KEY 'xxxxxxxxx', DLI_SECRET_ACCESS_KEY 'yyyyyyyyyyyyy' ); |
// Hard-coded or plaintext AK and SK are risky. For security purposes, encrypt your AK and SK and store them in the configuration file or environment variables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | CREATE FOREIGN TABLE customer_address ( ca_address_sk integer not null, ca_address_id char(16) not null, ca_street_number char(10) , ca_street_name varchar(60) , ca_street_type char(15) , ca_suite_number char(10) , ca_city varchar(60) , ca_county varchar(30) , ca_state char(2) , ca_zip char(10) , ca_country varchar(20) , ca_gmt_offset decimal(36,33) , ca_location_type char(20) ) SERVER dli_server OPTIONS ( FORMAT 'ORC', ENCODING 'utf8', DLI_PROJECT_ID 'xxxxxxxxxxxxxxx', DLI_DATABASE_NAME 'database123', DLI_TABLE_NAME 'table456' ) DISTRIBUTE BY roundrobin; |
1 2 3 4 5 | SELECT COUNT(*) FROM customer_address; count ------- 20 (1 row) |