- original_name
CreateDataset.html
Creating a Dataset
Function
This API is used to create a dataset.
URI
POST /v2/{project_id}/datasets
Table 1 Path Parameters
Parameter |
Mandatory |
Type |
Description |
project_id |
Yes |
String |
Project ID. For details about how to obtain the project ID, see Obtaining a Project ID <modelarts_03_0147> . |
Request Parameters
Table 2 Request body parameters
Parameter |
Mandatory |
Type |
Description |
data_format |
No |
String |
Data format. The options are as follows: - Default: default format - CarbonData: CarbonData (supported only by table datasets) |
data_sources |
No |
Array of DataSource <createdataset__request_datasource> objects |
Input dataset path, which is used to synchronize source data (such as images, text files, and audio files) in the directory and its subdirectories to the dataset. For a table dataset, this parameter indicates the import directory. The work directory of a table dataset cannot be an OBS path in a KMS-encrypted bucket. |
dataset_name |
Yes |
String |
Dataset name. The value contains 1 to 100 characters. Only letters, digits, underscores (_), and hyphens (-) are allowed, for example, dataset-9f3b. |
dataset_type |
No |
Integer |
Dataset type. The options are as follows: - 0: image classification - 1: object detection - 100: text classification - 101: named entity recognition - 102: text triplet - 200: sound classification - 201: speech content - 202: speech paragraph labeling - 400: table dataset - 600: video labeling - 900: custom format |
description |
No |
String |
Dataset description. The value is empty by default. The description contains 0 to 256 characters and does not support the following special characters: ^!<>=&"' |
import_annotations |
No |
Boolean |
Whether to automatically import the labeling information in the input directory, supporting detection, image classification, and text classification. The options are as follows: - true: Import labeling information in the input directory. (Default value) - false: Do not import labeling information in the input directory. |
import_data |
No |
Boolean |
Whether to import data. This parameter is used only for table datasets. The options are as follows: - true: Import data when creating a database. - false: Do not import data when creating a database. (Default value) |
label_format |
No |
LabelFormat <createdataset__request_labelformat> object |
Label format information. This parameter is used only for text datasets. |
labels |
No |
Array of Label <createdataset__request_label> objects |
Dataset label list. |
managed |
No |
Boolean |
Whether to host a dataset. The options are as follows: - true: Host a dataset. - false: Do not host a dataset. (Default value) |
schema |
No |
Array of Field <createdataset__request_field> objects |
Schema list. |
work_path |
Yes |
String |
Output dataset path, which is used to store output files such as label files. - The format is /Bucket name/File path, for example, /obs-bucket/flower/rose/. (The directory is used as the path.)- A bucket cannot be directly used as a path.- The output dataset path is different from the input dataset path or its subdirectory.- The value contains 3 to 700 characters. |
work_path_type |
Yes |
Integer |
Type of the dataset output path. The options are as follows: - 0: OBS bucket (default value) |
workforce_information |
No |
WorkforceInformation <createdataset__request_workforceinformation> object |
Team labeling information. |
workspace_id |
No |
String |
Workspace ID. If no workspace is created, the default value is 0. If a workspace is created and used, use the actual value. |
Table 3 DataSource
Parameter |
Mandatory |
Type |
Description |
data_path |
No |
String |
Data source path. |
data_type |
No |
Integer |
Data type. The options are as follows: - 0: OBS bucket (default value) - 1: GaussDB(DWS) - 2: DLI - 3: RDS - 4: MRS - 5: AI Gallery - 6: Inference service |
schema_maps |
No |
Array of SchemaMap <createdataset__request_schemamap> objects |
Schema mapping information corresponding to the table data. |
source_info |
No |
SourceInfo <createdataset__request_sourceinfo> object |
Information required for importing a table data source. |
with_column_header |
No |
Boolean |
Whether the first row in the file is a column name. This field is valid for the table dataset. The options are as follows: - true: The first row in the file is the column name. - false: The first row in the file is not the column name. |
Table 4 SchemaMap
Parameter |
Mandatory |
Type |
Description |
dest_name |
No |
String |
Name of the destination column. |
src_name |
No |
String |
Name of the source column. |
Table 5 SourceInfo
Parameter |
Mandatory |
Type |
Description |
cluster_id |
No |
String |
ID of an MRS cluster. |
cluster_mode |
No |
String |
Running mode of an MRS cluster. The options are as follows: - 0: normal cluster - 1: security cluster |
cluster_name |
No |
String |
Name of an MRS cluster. |
database_name |
No |
String |
Name of the database to which the table dataset is imported. |
input |
No |
String |
HDFS path of a table dataset. |
ip |
No |
String |
IP address of your GaussDB(DWS) cluster. |
port |
No |
String |
Port number of your GaussDB(DWS) cluster. |
queue_name |
No |
String |
DLI queue name of a table dataset. |
subnet_id |
No |
String |
Subnet ID of an MRS cluster. |
table_name |
No |
String |
Name of the table to which a table dataset is imported. |
user_name |
No |
String |
Username, which is mandatory for GaussDB(DWS) data. |
user_password |
No |
String |
User password, which is mandatory for GaussDB(DWS) data. |
vpc_id |
No |
String |
ID of the VPC where an MRS cluster resides. |
Table 6 LabelFormat
Parameter |
Mandatory |
Type |
Description |
label_type |
No |
String |
Label type of text classification. The options are as follows: - 0: The label is separated from the text, and they are distinguished by the fixed suffix _result. For example, the text file is abc.txt, and the label file is abc_result.txt. - 1: Default value. Labels and texts are stored in the same file and separated by separators. You can use text_sample_separator to specify the separator between the text and label and text_label_separator to specify the separator between labels. |
text_label_separator |
No |
String |
Separator between labels. By default, a comma (,) is used as the separator. The separator needs to be escaped. The separator can contain only one character, such as a letter, a digit, or any of the following special characters: !@#$%^&*_=|?/':.;, |
text_sample_separator |
No |
String |
Separator between the text and label. By default, the Tab key is used as the separator. The separator needs to be escaped. The separator can contain only one character, such as a letter, a digit, or any of the following special characters: !@#$%^&*_=|?/':.;, |
Table 7 Label
Parameter |
Mandatory |
Type |
Description |
attributes |
No |
Array of LabelAttribute <createdataset__request_labelattribute> objects |
Multi-dimensional attribute of a label. For example, if the label is music, attributes such as style and artist may be included. |
name |
No |
String |
Label name. |
property |
No |
LabelProperty <createdataset__request_labelproperty> object |
Basic attribute key-value pair of a label, such as color and shortcut keys. |
type |
No |
Integer |
Label type. The options are as follows: - 0: image classification - 1: object detection - 100: text classification - 101: named entity recognition - 102: text triplet relationship - 103: text triplet entity - 200: speech classification - 201: speech content - 202: speech paragraph labeling - 600: video classification |
Table 8 LabelAttribute
Parameter |
Mandatory |
Type |
Description |
default_value |
No |
String |
Default value of a label attribute. |
id |
No |
String |
Label attribute ID. |
name |
No |
String |
Label attribute name. |
type |
No |
String |
Label attribute type. The options are as follows: - text: text - select: single-choice drop-down list |
values |
No |
Array of LabelAttributeValue <createdataset__request_labelattributevalue> objects |
List of label attribute values. |
Table 9 LabelAttributeValue
Parameter |
Mandatory |
Type |
Description |
id |
No |
String |
Label attribute value ID. |
value |
No |
String |
Label attribute value. |
Table 10 LabelProperty
Parameter |
Mandatory |
Type |
Description |
@modelarts:color |
No |
String |
Default attribute: Label color, which is a hexadecimal code of the color. By default, this parameter is left blank. Example: #FFFFF0. |
@modelarts:default_shape |
No |
String |
Default attribute: Default shape of an object detection label (dedicated attribute). By default, this parameter is left blank. The options are as follows: - bndbox: rectangle - polygon: polygon - circle: circle - line: straight line - dashed: dotted line - point: point - polyline: polyline |
@modelarts:from_type |
No |
String |
Default attribute: Type of the head entity in the triplet relationship label. This attribute must be specified when a relationship label is created. This parameter is used only for the text triplet dataset. |
@modelarts:rename_to |
No |
String |
Default attribute: The new name of the label. |
@modelarts:shortcut |
No |
String |
Default attribute: Label shortcut key. By default, this parameter is left blank. For example: D. |
@modelarts:to_type |
No |
String |
Default attribute: Type of the tail entity in the triplet relationship label. This attribute must be specified when a relationship label is created. This parameter is used only for the text triplet dataset. |
Table 11 Field
Parameter |
Mandatory |
Type |
Description |
description |
No |
String |
Schema description. |
name |
No |
String |
Schema name. |
schema_id |
No |
Integer |
Schema ID. |
type |
No |
String |
Schema value type. |
Table 12 WorkforceInformation
Parameter |
Mandatory |
Type |
Description |
data_sync_type |
No |
Integer |
Synchronization type. The options are as follows: - 0: not to be synchronized - 1: data to be synchronized - 2: label to be synchronized - 3: data and label to be synchronized |
repetition |
No |
Integer |
Number of persons who label each sample. The minimum value is 1. |
synchronize_auto_labeling_data |
No |
Boolean |
Whether to synchronously update auto labeling data. The options are as follows: - true: Update auto labeling data synchronously. - false: Do not update auto labeling data synchronously. |
synchronize_data |
No |
Boolean |
Whether to synchronize updated data, such as uploading files, synchronizing data sources, and assigning imported unlabeled files to team members. The options are as follows: - true: Synchronize updated data to team members. - false: Do not synchronize updated data to team members. |
task_id |
No |
String |
ID of a team labeling task. |
task_name |
Yes |
String |
Name of a team labeling task. The value contains 1 to 64 characters, including only letters, digits, underscores (_), and hyphens (-). |
workforces_config |
No |
WorkforcesConfig <createdataset__request_workforcesconfig> object |
Manpower assignment of a team labeling task. You can delegate the administrator to assign the manpower or do it by yourself. |
Table 13 WorkforcesConfig
Parameter |
Mandatory |
Type |
Description |
agency |
No |
String |
Team administrator. |
workforces |
No |
Array of WorkforceConfig <createdataset__request_workforceconfig> objects |
List of teams that execute labeling tasks. |
Table 14 WorkforceConfig
Parameter |
Mandatory |
Type |
Description |
workers |
No |
Array of Worker <createdataset__request_worker> objects |
List of labeling team members. |
workforce_id |
No |
String |
ID of a labeling team. |
workforce_name |
No |
String |
Name of a labeling team. The value contains 0 to 1024 characters and does not support the following special characters: !<>=&"' |
Table 15 Worker
Parameter |
Mandatory |
Type |
Description |
create_time |
No |
Long |
Creation time. |
description |
No |
String |
Labeling team member description. The value contains 0 to 256 characters and does not support the following special characters: ^!<>=&"' |
email |
No |
String |
Email address of a labeling team member. |
role |
No |
Integer |
Role. The options are as follows: - 0: labeling personnel - 1: reviewer - 2: team administrator - 3: dataset owner |
status |
No |
Integer |
Current login status of a labeling team member. The options are as follows: - 0: The invitation email has not been sent. - 1: The invitation email has been sent but the user has not logged in. - 2: The user has logged in. - 3: The labeling team member has been deleted. |
update_time |
No |
Long |
Update time. |
worker_id |
No |
String |
ID of a labeling team member. |
workforce_id |
No |
String |
ID of a labeling team. |
Response Parameters
Status code: 201
Table 16 Response body parameters
Parameter |
Type |
Description |
dataset_id |
String |
Dataset ID. |
error_code |
String |
Error code. |
error_msg |
String |
Error message. |
import_task_id |
String |
ID of an import task. |
Example Requests
Creating an Image Classification Dataset
{
"workspace_id" : "0",
"dataset_name" : "dataset-457f",
"dataset_type" : 0,
"data_sources" : [ {
"data_type" : 0,
"data_path" : "/test-obs/classify/input/cat-dog/"
} ],
"description" : "",
"work_path" : "/test-obs/classify/output/",
"work_path_type" : 0,
"labels" : [ {
"name" : "Cat",
"type" : 0,
"property" : {
"@modelarts:color" : "#3399ff"
}
}, {
"name" : "Dog",
"type" : 0,
"property" : {
"@modelarts:color" : "#3399ff"
}
} ]
}
Creating an Object Detection Dataset
{
"workspace_id" : "0",
"dataset_name" : "dataset-95a6",
"dataset_type" : 1,
"data_sources" : [ {
"data_type" : 0,
"data_path" : "/test-obs/detect/input/cat-dog/"
} ],
"description" : "",
"work_path" : "/test-obs/detect/output/",
"work_path_type" : 0,
"labels" : [ {
"name" : "Cat",
"type" : 1,
"property" : {
"@modelarts:color" : "#3399ff"
}
}, {
"name" : "Dog",
"type" : 1,
"property" : {
"@modelarts:color" : "#3399ff"
}
} ]
}
Creating a Table Dataset
{
"workspace_id" : "0",
"dataset_name" : "dataset-de83",
"dataset_type" : 400,
"data_sources" : [ {
"data_type" : 0,
"data_path" : "/test-obs/table/input/",
"with_column_header" : true
} ],
"description" : "",
"work_path" : "/test-obs/table/output/",
"work_path_type" : 0,
"schema" : [ {
"schema_id" : 1,
"name" : "150",
"type" : "STRING"
}, {
"schema_id" : 2,
"name" : "4",
"type" : "STRING"
}, {
"schema_id" : 3,
"name" : "setosa",
"type" : "STRING"
}, {
"schema_id" : 4,
"name" : "versicolor",
"type" : "STRING"
}, {
"schema_id" : 5,
"name" : "virginica",
"type" : "STRING"
} ],
"import_data" : true
}
Example Responses
Status code: 201
Created
{
"dataset_id" : "WxCREuCkBSAlQr9xrde"
}
Status Codes
Status Code |
Description |
201 |
Created |
401 |
Unauthorized |
403 |
Forbidden |
404 |
Not Found |
Error Codes
See Error Codes <modelarts_03_0095>
.