Creating an OBS Table Using the DataSource Syntax

Function

Create an OBS table using the DataSource syntax.

The main differences between the DataSource and the Hive syntax lie in the supported data formats and the number of supported partitions. For details, see syntax and precautions.

You are advised to use the OBS parallel file system for storage. A parallel file system is a high-performance file system that provides latency in milliseconds, TB/s-level bandwidth, and millions of IOPS. It applies to interactive big data analysis scenarios.

Precautions

Syntax

1
2
3
4
5
6
7
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name 
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING file_format 
  [OPTIONS (path 'obs_path', key1=val1, key2=val2, ...)] 
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [COMMENT table_comment]
  [AS select_statement]

Keywords

Parameters

Table 1 Parameters

Parameter

Mandatory

Description

db_name

No

Database name

The value can contain letters, numbers, and underscores (_), but it cannot contain only numbers or start with a number or underscore (_).

table_name

Yes

Name of the table to be created in the database

The value can contain letters, numbers, and underscores (_), but it cannot contain only numbers or start with a number or underscore (_). The matching rule is ^(?!_)(?![0-9]+$)[A-Za-z0-9_$]*$.

Special characters must be enclosed in single quotation marks ('').

The table name is case insensitive.

col_name

Yes

Column names with data types separated by commas (,)

The column name can contain letters, numbers, and underscores (_), but it cannot contain only numbers and must contain at least one letter.

The column name is case insensitive.

col_type

Yes

Data type of a column field, which is primitive.

col_comment

No

Column field description, which can only be string constants.

file_format

Yes

Format of the table to be created, which can be orc, parquet, json, csv, or avro.

path

Yes

OBS storage path where data files are stored. You are advised to use an OBS parallel file system for storage.

Format: obs://bucketName/tblPath

bucketName: bucket name

tblPath: directory name. You do not need to specify the file name following the directory.

Refer to Table 2 for details about property names and values during table creation.

Refer to Table 2 and Table 3 for details about the table property names and values when file_format is set to csv.

If there is a folder and a file with the same name in the OBS directory, the path pointed to by the OBS table will prioritize the file over the folder.

table_comment

No

Table description, which can only be string constants.

select_statement

No

Used in the CREATE TABLE AS statement to insert the SELECT query results of the source table or a data record to a table newly created in the OBS bucket.

Table 2 OPTIONS parameters

Parameter

Mandatory

Description

path

No

Path where the table is stored, which currently can only be an OBS directory

multiLevelDirEnable

No

Whether data in subdirectories is iteratively queried when there are nested subdirectories. When this parameter is set to true, all files in the table path, including files in subdirectories, are iteratively read when a table is queried.

Default value: false

dataDelegated

No

Whether data in the path is cleared when deleting a table or partition

Default value: false

compression

No

Compression format. This parameter is typically required for Parquet files and is set to zstd.

When file_format is set to csv, you can set the following OPTIONS parameters:
Table 3 OPTIONS parameters of the CSV data format

Parameter

Mandatory

Description

delimiter

No

Data separator

Default value: comma (,)

quote

No

Quotation character

Default value: double quotation marks ("")

escape

No

Escape character

Default value: backslash (\)

multiLine

No

Whether the column data contains carriage return characters or transfer characters. The value true indicates yes and the value false indicates no.

Default value: false

dateFormat

No

Date format of the date field in a CSV file

Default value: yyyy-MM-dd

timestampFormat

No

Date format of the timestamp field in a CSV file

Default value:

yyyy-MM-dd HH:mm:ss

mode

No

Mode for parsing CSV files. The options are as follows: Default value: PERMISSIVE

  • PERMISSIVE: Permissive mode. If an incorrect field is encountered, set the line to Null.
  • DROPMALFORMED: When an incorrect field is encountered, the entire line is discarded.
  • FAILFAST: Error mode. If an error occurs, it is automatically reported.

header

No

Whether the CSV file contains header information. The value true indicates that the table header information is contained, and the value false indicates that the information is not included.

Default value: false

nullValue

No

Character that represents the null value. For example, nullValue="nl" indicates that nl represents the null value.

comment

No

Character that indicates the beginning of the comment. For example, comment= '#' indicates that the line starting with # is a comment.

compression

No

Data compression format. Currently, gzip, bzip2, and deflate are supported. If you do not want to compress data, enter none.

Default value: none

encoding

No

Data encoding format. Available values are utf-8, gb2312, and gbk. Value utf-8 will be used if this parameter is left empty.

Default value: utf-8

Example 1: Creating an OBS Non-Partitioned Table

Example description: Create an OBS non-partitioned table named table1 and use the USING keyword to set the storage format of the table to orc.

You can store OBS tables in parquet, json, or avro format.

1
2
3
4
5
CREATE TABLE IF NOT EXISTS table1 (
    col_1   STRING,
    col_2   INT)
USING orc
OPTIONS (path 'obs://bucketName/filePath');

Example 2: Creating an OBS Partitioned Table

Example description: Create a partitioned table named student. The partitioned table is partitioned using facultyNo and classNo. The student table is partitioned by faculty number (facultyNo) and class number (classNo).

In practice, you can select a proper partitioning field and add it to the brackets following the PARTITIONED BY keyword.

1
2
3
4
5
6
7
CREATE TABLE IF NOT EXISTS student (
    Name        STRING,
    facultyNo   INT,
    classNo     INT)
USING csv
OPTIONS (path 'obs://bucketName/filePath')
PARTITIONED BY (facultyNo, classNo);

Example 3: Using CTAS to Create an OBS Non-Partitioned Table Using All or Part of the Data in the Source Table

Example description: Based on the OBS table table1 created in Example 1: Creating an OBS Non-Partitioned Table, use the CTAS syntax to copy data from table1 to table1_ctas.

When using CTAS to create a table, you can ignore the syntax used to create the table being copied. This means that regardless of the syntax used to create table1, you can use the DataSource syntax to create table1_ctas.

In addition, in this example, the storage format of table1 is orc, and the storage format of table1_ctas may be parquet. This means that the storage format of the table created by CTAS may be different from that of the original table.

Use the SELECT statement following the AS keyword to select required data and insert the data to table1_ctas.

The SELECT syntax is as follows: SELECT <Column name > FROM <Table name > WHERE <Related filter criteria>.

Example 4: Creating an OBS Non-Partitioned Table and Customizing the Data Type of a Column Field

Example description: Create an OBS non-partitioned table named table2. You can customize the native data types of column fields based on service requirements.

For details, see "Data Types" > "Primitive Data Types".

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
CREATE TABLE IF NOT EXISTS table2 (
    col_01  STRING,
    col_02  CHAR (2),
    col_03  VARCHAR (32),
    col_04  TIMESTAMP,
    col_05  DATE,
    col_06  INT,
    col_07  SMALLINT,
    col_08  BIGINT,
    col_09  TINYINT,
    col_10  FLOAT,
    col_11  DOUBLE,
    col_12  DECIMAL (10, 3),
    col_13  BOOLEAN
)
USING parquet
OPTIONS (path 'obs://bucketName/filePath');

Example 5: Creating an OBS Partitioned Table and Customizing OPTIONS Parameters

Example description: When creating an OBS table, you can customize property names and values. For details about OPTIONS parameters, see Table 2.

In this example, an OBS partitioned table named table3 is created and partitioned based on col_2. Configure path, multiLevelDirEnable, dataDelegated, and compression in OPTIONS.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
CREATE TABLE IF NOT EXISTS table3 (
    col_1   STRING,
    col_2   int
)
USING parquet
PARTITIONED BY (col_2)
OPTIONS (
    path 'obs://bucketName/filePath',
    multiLeveldirenable = true,
    datadelegated = true,
    compression = 'zstd'
);

Example 6: Creating an OBS Non-Partitioned Table and Customizing OPTIONS Parameters

Example description: A CSV table is a file format that uses commas to separate data values in plain text. It is commonly used for storing and sharing data, but it is not ideal for complex data types due to its lack of structured data concepts. So, when file_format is set to csv, more OPTIONS parameters can be configured. For details, see Table 3.

In this example, a non-partitioned table named table4 is created with a csv storage format, and additional OPTIONS parameters are used to constrain the data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
CREATE TABLE IF NOT EXISTS table4 (
    col_1 STRING,
    col_2 INT
)
USING csv
OPTIONS (
    path 'obs://bucketName/filePath',
    delimiter       = ',',
    quote            = '#',
    escape           = '|',
    multiline        = false,
    dateFormat       = 'yyyy-MM-dd',
    timestampFormat  = 'yyyy-MM-dd HH:mm:ss',
    mode             = 'failfast',
    header           = true,
    nullValue        = 'null',
    comment          = '*',
    compression      = 'deflate',
    encoding         = 'utf - 8'
);