Lai, Weijian 2f0818cf3d ModelArts API 22.3.0 version-20240311

Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Lai, Weijian <laiweijian4@huawei.com>
Co-committed-by: Lai, Weijian <laiweijian4@huawei.com>

2024-04-05 09:35:42 +00:00

294 KiB

Raw Permalink Blame History

Creating a Training Job

Function

This API is used to create a training job.

URI

POST /v2/{project_id}/training-jobs

**Table 1** Path Parameters
Parameter	Mandatory	Type	Description
project_id	Yes	String	Project ID. For details, see Obtaining a Project ID.

Request Parameters

**Table 2** Request body parameters
Parameter	Mandatory	Type	Description
kind	Yes	String	Training job type, which is job by default. Options: job: Training job hetero_job: Heterogeneous job autosearch_job: Auto search job mrs_job: MRS job [- edge_job: Edge job] (tag:hk,hc,fcs,fcs-super)
metadata	Yes	JobMetadata object	Metadata of a training job.
algorithm	No	JobAlgorithm object	Algorithm for training jobs. The following formats are supported: id: Only the algorithm ID is used. subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used. code_dir+boot_file: The code directory and boot file of a training job are used.
tasks	No	Array of Task objects	List of tasks in heterogeneous training jobs. If this parameter is specified, leave the spec parameter blank.
spec	No	spec object	Specifications of a training job. If this parameter is specified, leave the tasks parameter blank.

**Table 3** JobMetadata
Parameter	Mandatory	Type	Description
id	No	String	Training job ID, which is generated and returned by ModelArts after the training job is created.
name	Yes	String	Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-).
workspace_id	No	String	Workspace where a job is located. The default value is 0.
description	No	String	Training job description. The value must contain 0 to 256 characters. The default value is NULL.
create_time	No	Long	Timestamp when a training job is created, in milliseconds. The value is generated and returned by ModelArts after the job is created.
user_name	No	String	Username for creating a training job. The username is generated and returned by ModelArts after the training job is created.
annotations	No	Map<String,String>	Declaration template of a training job. For heterogeneous jobs, the default value of job_template is Template RL. For other jobs, the default value is Template DL.

**Table 4** JobAlgorithm
Parameter	Mandatory	Type	Description
id	No	String	Algorithm ID.
name	No	String	Algorithm name. Leave it blank.
subscription_id	No	String	Subscription ID of the subscription algorithm. This parameter must be used together with item_version_id.
item_version_id	No	String	Version ID of the subscription algorithm. This parameter must be used together with subscription_id.
code_dir	No	String	Code directory of a training job, for example, /usr/app/. This parameter must be used together with boot_file. If id or subscription_id+item_version_id is set, leave it blank.
boot_file	No	String	Boot file of a training job, which needs to be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used together with code_dir. If id or subscription_id+item_version_id is set, leave it blank.
autosearch_config_path	No	String	YAML configuration path of auto search jobs. An OBS URL is required.
autosearch_framework_path	No	String	Framework code directory of auto search jobs. An OBS URL is required.
command	No	String	Command for starting the container of the custom image of a training job in the custom image scenario.
parameters	No	Array of parameters objects	Running parameters of a training job.
policies	No	policies object	Policies supported by jobs, which are used for hyperparameter search.
inputs	No	Array of Input objects	Input of a training job.
outputs	No	Array of Output objects	Output of a training job.
engine	No	engine object	Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm management API or subscription_id+item_version_id of the subscription algorithm API.
environments	No	Array of Map<String,String> objects	Environment variables of a training job. The format is key: value. Leave this parameter blank.

**Table 5** parameters
Parameter	Mandatory	Type	Description
name	No	String	Parameter name.
value	No	String	Parameter value.
description	No	String	Parameter description.
constraint	No	constraint object	Parameter constraint.
i18n_description	No	i18n_description object	Internationalization description.

**Table 6** constraint
Parameter	Mandatory	Type	Description
type	No	String	Parameter type.
editable	No	Boolean	Whether the parameter is editable.
required	No	Boolean	Whether the parameter is mandatory.
sensitive	No	Boolean	Whether the parameter is sensitive.
valid_type	No	String	Valid type.
valid_range	No	Array of strings	Valid range.

**Table 7** i18n_description
Parameter	Mandatory	Type	Description
language	No	String	Internationalization language.
description	No	String	Description.

**Table 8** policies
Parameter	Mandatory	Type	Description
auto_search	No	auto_search object	Hyperparameter search configuration.

**Table 9** auto_search
Parameter	Mandatory	Type	Description
skip_search_params	No	String	Hyperparameter parameters that need to be skipped.
reward_attrs	No	Array of reward_attrs objects	List of search metrics.
search_params	No	Array of search_params objects	Search parameters.
algo_configs	No	Array of algo_configs objects	Search algorithm configurations.

**Table 10** reward_attrs
Parameter	Mandatory	Type	Description
name	No	String	Metric name.
mode	No	String	Search direction. max: A larger metric value indicates better performance. min: A smaller metric value indicates better performance.
regex	No	String	Regular expression of a metric.

**Table 11** search_params
Parameter	Mandatory	Type	Description
name	No	String	Hyperparameter name.
param_type	No	String	Parameter type. continuous: The parameter is a continuous value. discreate: The parameter is a discrete value.
lower_bound	No	String	Lower bound of the hyperparameter.
upper_bound	No	String	Upper bound of the hyperparameter.
discrete_points_num	No	String	Number of discrete points of a continuous hyperparameter.
discrete_values	No	Array of strings	List of discrete hyperparameter values.

**Table 12** algo_configs
Parameter	Mandatory	Type	Description
name	No	String	Name of the search algorithm.
params	No	Array of AutoSearchAlgoConfigParameter objects	Search algorithm parameters.

**Table 13** AutoSearchAlgoConfigParameter
Parameter	Mandatory	Type	Description
key	No	String	Parameter key.
value	No	String	Parameter value.
type	No	String	Parameter type.

**Table 14** engine
Parameter	Mandatory	Type	Description
engine_id	No	String	Engine ID selected for a training job. You can set this parameter to engine_id, engine_name + engine_version, or image_url.
engine_name	No	String	Name of the engine selected for a training job. If engine_id is set, leave this parameter blank.
engine_version	No	String	Name of the engine version selected for a training job. If engine_id is set, leave this parameter blank.
image_url	No	String	Custom image URL selected for a training job.

**Table 15** Task
Parameter	Mandatory	Type	Description
role	No	String	Role of a heterogeneous training job. Options: learner: supports GPUs or CPUs. worker: supports CPUs.
algorithm	No	algorithm object	Algorithm management and configuration.
task_resource	No	task_resource object	Resource flavors of a training job.

**Table 16** algorithm
Parameter	Mandatory	Type	Description
job_config	No	job_config object	Algorithm configuration, such as the boot file.
code_dir	No	String	Algorithm code directory, for example, /usr/app/. This parameter must be used together with boot_file.
boot_file	No	String	Code boot file of the algorithm, which needs to be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used together with code_dir.
engine	No	engine object	Engine of a heterogeneous job algorithm.
inputs	No	Array of inputs objects	Data input of an algorithm.
outputs	No	Array of outputs objects	Data output of an algorithm.

**Table 17** job_config
Parameter	Mandatory	Type	Description
parameters	No	Array of Parameter objects	Running parameter of an algorithm.
inputs	No	Array of Input objects	Data input of an algorithm.
outputs	No	Array of Output objects	Data output of an algorithm.
engine	No	engine object	Algorithm engine.

**Table 18** Parameter
Parameter	Mandatory	Type	Description
name	No	String	Parameter name.
value	No	String	Parameter value.
description	No	String	Parameter description.
constraint	No	constraint object	Parameter constraint.
i18n_description	No	i18n_description object	Internationalization description.

**Table 19** constraint
Parameter	Mandatory	Type	Description
type	No	String	Parameter type.
editable	No	Boolean	Whether the parameter is editable.
required	No	Boolean	Whether the parameter is mandatory.
sensitive	No	Boolean	Whether the parameter is sensitive.
valid_type	No	String	Valid type.
valid_range	No	Array of strings	Valid range.

**Table 20** i18n_description
Parameter	Mandatory	Type	Description
language	No	String	Internationalization language.
description	No	String	Description.

**Table 21** Input
Parameter	Mandatory	Type	Description
name	Yes	String	Name of the data input channel.
description	No	String	Description of the data input channel.
local_dir	No	String	Local directory of the container to which the data input channel is mapped.
remote	Yes	InputDataInfo object	Data input. Options: dataset: Dataset as the data input obs: OBS path as the data input
remote_constraint	No	Array of remote_constraint objects	Data input constraint.

**Table 22** InputDataInfo
Parameter	Mandatory	Type	Description
dataset	No	dataset object	Dataset as the data input.
obs	No	obs object	OBS in which data input and output stored.

**Table 23** dataset
Parameter	Mandatory	Type	Description
id	Yes	String	Dataset ID of a training job.
version_id	Yes	String	Dataset version ID of a training job.
obs_url	No	String	OBS URL of the dataset required by a training job. ModelArts automatically parses and generates the URL based on the dataset and dataset version IDs. For example, /usr/data/.

**Table 24** obs
Parameter	Mandatory	Type	Description
obs_url	Yes	String	OBS URL of the dataset required by a training job. For example, /usr/data/.

**Table 25** remote_constraint
Parameter	Mandatory	Type	Description
data_type	No	String	Data input type, including the data storage location and dataset.
attributes	No	String	Attributes if a dataset is used as the data input. Options: data_format: Data format data_segmentation: Data segmentation dataset_type: Labeling type

**Table 26** Output
Parameter	Mandatory	Type	Description
name	Yes	String	Name of the data output channel.
description	No	String	Description of the data output channel.
local_dir	No	String	Local directory of the container to which the data output channel is mapped.
remote	Yes	remote object	Description of the actual data output.

**Table 27** remote
Parameter	Mandatory	Type	Description
obs	Yes	obs object	OBS to which data is actually exported.

**Table 28** obs
Parameter	Mandatory	Type	Description
obs_url	Yes	String	OBS URL to which data is actually exported.

**Table 29** engine
Parameter	Mandatory	Type	Description
engine_id	No	String	Engine ID selected for an algorithm.
engine_name	No	String	Engine version name selected for an algorithm. If engine_id is specified, leave this parameter blank.
engine_version	No	String	Engine version name selected for an algorithm. If engine_id is specified, leave this parameter blank.
image_url	No	String	Custom image URL selected by an algorithm.

**Table 30** engine
Parameter	Mandatory	Type	Description
engine_id	No	String	Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7.
engine_name	No	String	Engine name of a heterogeneous job, for example, Caffe.
engine_version	No	String	Engine version of a heterogeneous job.
image_url	No	String	Custom image URL selected by an algorithm.

**Table 31** inputs
Parameter	Mandatory	Type	Description
name	Yes	String	Name of the data input channel.
description	No	String	Description of the data input channel.
local_dir	No	String	Local directory of the container to which the data input channel is mapped.
remote	Yes	remote object	Data input. Options: dataset: Dataset as the data input obs: OBS path as the data input

**Table 32** remote
Parameter	Mandatory	Type	Description
obs	No	obs object	OBS in which data input and output stored.

**Table 33** obs
Parameter	Mandatory	Type	Description
obs_url	Yes	String	OBS URL of the dataset required by a training job. For example, /usr/data/.

**Table 34** outputs
Parameter	Mandatory	Type	Description
name	Yes	String	Name of the data output channel.
description	No	String	Description of the data output channel.
local_dir	No	String	Local directory of the container to which the data output channel is mapped.
remote	Yes	remote object	Description of the actual data output.

**Table 35** remote
Parameter	Mandatory	Type	Description
obs	Yes	obs object	OBS to which data is actually exported.

**Table 36** obs
Parameter	Mandatory	Type	Description
obs_url	Yes	String	OBS URL to which data is actually exported.

**Table 37** task_resource
Parameter	Mandatory	Type	Description
flavor_id	No	String	Resource flavor ID of a training job.
node_count	Yes	Integer	Number of resource replicas selected for a training job. Minimum: 1

**Table 38** spec
Parameter	Mandatory	Type	Description
resource	No	resource object	Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id].
volumes	No	Array of volumes objects	Volumes attached to a training job.
log_export_path	No	log_export_path object	Export path of training job logs.

**Table 39** resource
Parameter	Mandatory	Type	Description
flavor_id	Yes	String	ID of the resource flavors selected for a training job.
node_count	No	Integer	Number of nodes used for creating a training job in a pool. By default, a single node is used.

**Table 40** volumes
Parameter	Mandatory	Type	Description
nfs	No	nfs object	Volumes attached in NFS mode.

**Table 41** nfs
Parameter	Mandatory	Type	Description
nfs_server_path	No	String	NFS server path.
local_path	No	String	Path for attaching volumes to the training container.
read_only	No	Boolean	Whether the volumes attached to the container in NFS mode are read-only.

**Table 42** log_export_path
Parameter	Mandatory	Type	Description
obs_url	No	String	OBS URL for storing training job logs.
host_path	No	String	Path of the host where training job logs are stored.

Response Parameters

Status code: 201

**Table 43** Response body parameters
Parameter	Type	Description
kind	String	Training job type, which is job by default. Options: job: Training job hetero_job: Heterogeneous job autosearch_job: Auto search job mrs_job: MRS job [- edge_job: Edge job] (tag:hk,hc,fcs,fcs-super)
metadata	JobMetadata object	Metadata of a training job.
status	Status object	Status of a training job. You do not need to set this parameter when creating a job.
algorithm	JobAlgorithmResponse object	Algorithm for training jobs. The following formats are supported: id: Only the algorithm ID is used. subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used. code_dir+boot_file: The code directory and boot file of a training job are used.
tasks	Array of TaskResponse objects	List of tasks in heterogeneous training jobs.
spec	spec object	Specifications of a training job.

**Table 44** JobMetadata
Parameter	Type	Description
id	String	Training job ID, which is generated and returned by ModelArts after the training job is created.
name	String	Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-).
workspace_id	String	Workspace where a job is located. The default value is 0.
description	String	Training job description. The value must contain 0 to 256 characters. The default value is NULL.
create_time	Long	Timestamp when a training job is created, in milliseconds. The value is generated and returned by ModelArts after the job is created.
user_name	String	Username for creating a training job. The username is generated and returned by ModelArts after the training job is created.
annotations	Map<String,String>	Declaration template of a training job. For heterogeneous jobs, the default value of job_template is Template RL. For other jobs, the default value is Template DL.

**Table 45** Status
Parameter	Type	Description
phase	String	Level-1 status of a training job. The value is stable. The options are as follows: Creating Pending Running Failed Completed, Terminating Terminated Abnormal
secondary_phase	String	Level-2 status of a training job. The value is unstable. The options are as follows: Creating Queuing Running Failed Completed Terminating Terminated CreateFailed TerminatedFailed Unknown Lost
duration	Long	Running duration of a training job, in milliseconds
node_count_metrics	Array<Array<Integer>>	Node count changes during the training job running period.
tasks	Array of strings	Tasks of a training job.
start_time	String	Start time of a training job. The value is in timestamp format.
task_statuses	Array of task_statuses objects	Status of a training job task.

**Table 46** task_statuses
Parameter	Type	Description
task	String	Name of a training job task.
exit_code	Integer	Exit code of a training job task.
message	String	Error message of a training job task.

**Table 47** JobAlgorithmResponse
Parameter	Type	Description
id	String	Algorithm for training jobs. Options: id: Only the algorithm ID is used. subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used. code_dir+boot_file: The code directory and boot file of a training job are used.
name	String	Algorithm name.
subscription_id	String	Subscription ID of the subscription algorithm. This parameter must be used together with item_version_id.
item_version_id	String	Version ID of the subscription algorithm. This parameter must be used together with subscription_id.
code_dir	String	Code directory of a training job, for example, /usr/app/. This parameter must be used together with boot_file. If id or subscription_id+item_version_id is set, leave it blank.
boot_file	String	Boot file of a training job, which needs to be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used together with code_dir. If id or subscription_id+item_version_id is set, leave it blank.
autosearch_config_path	String	YAML configuration path of auto search jobs. An OBS URL is required.
autosearch_framework_path	String	Framework code directory of auto search jobs. An OBS URL is required.
command	String	Boot command used to start the container of the custom image used by a training job. You can set this parameter to code_dir.
parameters	Array of Parameter objects	Running parameters of a training job.
policies	policies object	Policies supported by jobs.
inputs	Array of Input objects	Input of a training job.
outputs	Array of Output objects	Output of a training job.
engine	engine object	Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm management API or subscription_id+item_version_id of the subscription algorithm API.
environments	Array of Map<String,String> objects	Environment variables of a training job. The format is key: value. Leave this parameter blank.

**Table 48** Parameter
Parameter	Type	Description
name	String	Parameter name.
value	String	Parameter value.
description	String	Parameter description.
constraint	constraint object	Parameter constraint.
i18n_description	i18n_description object	Internationalization description.

**Table 49** constraint
Parameter	Type	Description
type	String	Parameter type.
editable	Boolean	Whether the parameter is editable.
required	Boolean	Whether the parameter is mandatory.
sensitive	Boolean	Whether the parameter is sensitive.
valid_type	String	Valid type.
valid_range	Array of strings	Valid range.

**Table 50** i18n_description
Parameter	Type	Description
language	String	Internationalization language.
description	String	Description.

**Table 51** policies
Parameter	Type	Description
auto_search	auto_search object	Hyperparameter search configuration.

**Table 52** auto_search
Parameter	Type	Description
skip_search_params	String	Hyperparameter parameters that need to be skipped.
reward_attrs	Array of reward_attrs objects	List of search metrics.
search_params	Array of search_params objects	Search parameters.
algo_configs	Array of algo_configs objects	Search algorithm configurations.

**Table 53** reward_attrs
Parameter	Type	Description
name	String	Metric name.
mode	String	Search direction. max: A larger metric value indicates better performance. min: A smaller metric value indicates better performance.
regex	String	Regular expression of a metric.

**Table 54** search_params
Parameter	Type	Description
name	String	Hyperparameter name.
param_type	String	Parameter type. continuous: The parameter is a continuous value. discreate: The parameter is a discrete value.
lower_bound	String	Lower bound of the hyperparameter.
upper_bound	String	Upper bound of the hyperparameter.
discrete_points_num	String	Number of discrete points of a continuous hyperparameter.
discrete_values	Array of strings	List of discrete hyperparameter values.

**Table 55** algo_configs
Parameter	Type	Description
name	String	Name of the search algorithm.
params	Array of AutoSearchAlgoConfigParameter objects	Search algorithm parameters.

**Table 56** AutoSearchAlgoConfigParameter
Parameter	Type	Description
key	String	Parameter key.
value	String	Parameter value.
type	String	Parameter type.

**Table 57** Input
Parameter	Type	Description
name	String	Name of the data input channel.
description	String	Description of the data input channel.
local_dir	String	Local directory of the container to which the data input channel is mapped.
remote	InputDataInfo object	Data input. Options: dataset: Dataset as the data input obs: OBS path as the data input
remote_constraint	Array of remote_constraint objects	Data input constraint.

**Table 58** InputDataInfo
Parameter	Type	Description
dataset	dataset object	Dataset as the data input.
obs	obs object	OBS in which data input and output stored.

**Table 59** dataset
Parameter	Type	Description
id	String	Dataset ID of a training job.
version_id	String	Dataset version ID of a training job.
obs_url	String	OBS URL of the dataset required by a training job. ModelArts automatically parses and generates the URL based on the dataset and dataset version IDs. For example, /usr/data/.

**Table 60** obs
Parameter	Type	Description
obs_url	String	OBS URL of the dataset required by a training job. For example, /usr/data/.

**Table 61** remote_constraint
Parameter	Type	Description
data_type	String	Data input type, including the data storage location and dataset.
attributes	String	Attributes if a dataset is used as the data input. Options: data_format: Data format data_segmentation: Data segmentation dataset_type: Labeling type

**Table 62** Output
Parameter	Type	Description
name	String	Name of the data output channel.
description	String	Description of the data output channel.
local_dir	String	Local directory of the container to which the data output channel is mapped.
remote	remote object	Description of the actual data output.

**Table 63** remote
Parameter	Type	Description
obs	obs object	OBS to which data is actually exported.

**Table 64** obs
Parameter	Type	Description
obs_url	String	OBS URL to which data is actually exported.

**Table 65** engine
Parameter	Type	Description
engine_id	String	Engine ID selected for a training job. You can set this parameter to engine_id, engine_name + engine_version, or image_url.
engine_name	String	Name of the engine selected for a training job. If engine_id is set, leave this parameter blank.
engine_version	String	Name of the engine version selected for a training job. If engine_id is set, leave this parameter blank.
image_url	String	Custom image URL selected for a training job.

**Table 66** TaskResponse
Parameter	Type	Description
role	String	Role of a heterogeneous training job. Options: learner: supports GPUs or CPUs. worker: supports CPUs.
algorithm	algorithm object	Algorithm management and configuration.
task_resource	FlavorResponse object	Flavors of a training job or an algorithm.

**Table 67** algorithm
Parameter	Type	Description
code_dir	String	Absolute path of the directory where the algorithm boot file is stored.
boot_file	String	Absolute path of the algorithm boot file.
inputs	inputs object	Algorithm input channel.
outputs	outputs object	Algorithm output channel.
engine	engine object	Engine on which a heterogeneous job depends.

**Table 68** inputs
Parameter	Type	Description
name	String	Name of the data input channel.
local_dir	String	Local path of the container to which the data input and output channels are mapped.
remote	remote object	Actual data input. Heterogeneous jobs support only OBS.

**Table 69** remote
Parameter	Type	Description
obs	obs object	OBS in which data input and output stored.

**Table 70** obs
Parameter	Type	Description
obs_url	String	OBS URL of the dataset required by a training job. For example, /usr/data/.

**Table 71** outputs
Parameter	Type	Description
name	String	Name of the data output channel.
local_dir	String	Local directory of the container to which the data output channel is mapped.
remote	remote object	Description of the actual data output.
mode	String	Data transmission mode. The default value is upload_periodically.
period	String	Data transmission period. The default value is 30s.

**Table 72** remote
Parameter	Type	Description
obs	obs object	OBS to which data is actually exported.

**Table 73** obs
Parameter	Type	Description
obs_url	String	OBS URL to which data is actually exported.

**Table 74** engine
Parameter	Type	Description
engine_id	String	Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7.
engine_name	String	Engine name of a heterogeneous job, for example, Caffe.
engine_version	String	Engine version of a heterogeneous job.
v1_compatible	Boolean	Whether the v1 compatibility mode is used.
run_user	String	User UID started by default by the engine.
image_url	String	Custom image URL selected by an algorithm.

**Table 75** FlavorResponse
Parameter	Type	Description
flavor_id	String	ID of the resource flavor.
flavor_name	String	Name of the resource flavor.
max_num	Integer	Maximum number of nodes in a resource flavor.
flavor_type	String	Resource flavor type. Options: CPU GPU Ascend
billing	billing object	Billing information of a resource flavor.
flavor_info	flavor_info object	Resource flavor details.
attributes	Map<String,String>	Other specification attributes.

**Table 76** billing
Parameter	Type	Description
code	String	Billing code.
unit_num	Integer	Number of billing units.

**Table 77** flavor_info
Parameter	Type	Description
max_num	Integer	Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.
cpu	cpu object	CPU specifications.
gpu	gpu object	GPU specifications.
npu	npu object	Ascend flavors.
memory	memory object	Memory information.

**Table 78** cpu
Parameter	Type	Description
arch	String	CPU architecture.
core_num	Integer	Number of cores.

**Table 79** gpu
Parameter	Type	Description
unit_num	Integer	Number of GPUs.
product_nume	String	Product name.
memory	String	Memory.

**Table 80** npu
Parameter	Type	Description
unit_num	String	Number of NPUs.
product_name	String	Product name.
memory	String	Memory.

**Table 81** memory
Parameter	Type	Description
size	Integer	Memory size.
unit	String	Number of memory units.

**Table 82** spec
Parameter	Type	Description
resource	Resource object	Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id].
volumes	Array of volumes objects	Volumes attached to a training job.
log_export_path	log_export_path object	Export path of training job logs.

**Table 83** Resource
Parameter	Type	Description
policy	String	Resource flavor of a training job. Options: regular
flavor_id	String	Resource flavor ID of a training job.
flavor_name	String	Read-only flavor name returned by ModelArts when flavor_id is used.
node_count	Integer	Number of resource replicas selected for a training job. Minimum: 1
pool_id	String	Resource pool ID selected for a training job.
flavor_detail	flavor_detail object	Flavors of a training job or an algorithm.

**Table 84** flavor_detail
Parameter	Type	Description
flavor_type	String	Resource flavor type. Options: CPU GPU Ascend
billing	billing object	Billing information of a resource flavor.
flavor_info	flavor_info object	Resource flavor details.

**Table 85** billing
Parameter	Type	Description
code	String	Billing code.
unit_num	Integer	Number of billing units.

**Table 86** flavor_info
Parameter	Type	Description
max_num	Integer	Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.
cpu	cpu object	CPU specifications.
gpu	gpu object	GPU specifications.
npu	npu object	Ascend flavors.
memory	memory object	Memory information.
disk	disk object	Disk information.

**Table 87** cpu
Parameter	Type	Description
arch	String	CPU architecture.
core_num	Integer	Number of cores.

**Table 88** gpu
Parameter	Type	Description
unit_num	Integer	Number of GPUs.
product_nume	String	Product name.
memory	String	Memory.

**Table 89** npu
Parameter	Type	Description
unit_num	String	Number of NPUs.
product_name	String	Product name.
memory	String	Memory.

**Table 90** memory
Parameter	Type	Description
size	Integer	Memory size.
unit	String	Number of memory units.

**Table 91** disk
Parameter	Type	Description
size	String	Disk size.
unit	String	Unit of the disk size. Generally, the value is GB.

**Table 92** volumes
Parameter	Type	Description
nfs	nfs object	Volumes attached in NFS mode.

**Table 93** nfs
Parameter	Type	Description
nfs_server_path	String	NFS server path.
local_path	String	Path for attaching volumes to the training container.
read_only	Boolean	Whether the volumes attached to the container in NFS mode are read-only.

**Table 94** log_export_path
Parameter	Type	Description
obs_url	String	OBS URL for storing training job logs.
host_path	String	Path of the host where training job logs are stored.

Example Requests

The following shows how to create a training job named TestModelArtsJob. The training job is described as This is a ModelArts job, the ID of the dependency algorithm is 3f5d6706-7b67-408d-8ba0-ec08048c45ed, no input or output is specified, and a free GPU flavor is used.

POST    https://endpoint/v2/{project_id}/training-jobs

{
  "kind" : "job",
  "metadata" : {
    "name" : "TestModelArtsJob",
    "description" : "This is a ModelArts job"
  },
  "algorithm" : {
    "id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed",
    "parameters" : [ {
      "name" : "input_dir",
      "value" : "obs://xxx/test/moxingtest-dir/"
    }, {
      "name" : "input_file",
      "value" : "obs://xxx/test/moxingtest/"
    }, {
      "name" : "large_file_method",
      "value" : "1"
    } ],
    "inputs" : [ ],
    "outputs" : [ ],
    "policies" : {
      "auto_search" : null
    },
    "environments" : { }
  },
  "spec" : {
    "resource" : {
      "policy" : "regular",
      "flavor_id" : "modelarts.p3.large.public.free",
      "node_count" : 1
    },
    "log_export_path" : {
      "obs_url" : ""
    }
  }
}

The following shows how to use a custom image to create a training job named TestModelArtsJob2 described as This is a ModelArts job2. The target instance is deployed in a dedicated resource pool and is attached with an NFS.

POST    https://endpoint/v2/{project_id}/training-jobs

{
  "kind" : "job",
  "metadata" : {
    "name" : "TestModelArtsJob2",
    "description" : "This is a ModelArts job2"
  },
  "algorithm" : {
    "engine" : {
      "image_url" : "hwstaff_z00424192/fastseq:1.2"
    },
    "command" : "cd /home/ma-user/ddp_demo && sh run_ddp.sh",
    "parameters" : [ ],
    "inputs" : [ ],
    "outputs" : [ ],
    "policies" : {
      "auto_search" : null
    },
    "environments" : {
      "NCCL_DEBUG" : "INFO",
      "NCCL_IB_DISABLE" : "0"
    }
  },
  "spec" : {
    "resource" : {
      "policy" : "regular",
      "flavor_id" : "modelarts.pool.visual.xlarge",
      "node_count" : 1,
      "pool_id" : "poolfaf38d76"
    },
    "log_export_path" : {
      "obs_url" : "/xxx/limou/ddp-demo-log/"
    },
    "volumes" : [ {
      "nfs" : {
        "nfs_server_path" : "192.168.0.82:/",
        "local_path" : "/home/ma-user/nfs/",
        "read_only" : false
      }
    } ]
  }
}

Example Responses

Status code: 201

{
  "kind" : "job",
  "metadata" : {
    "id" : "425b7087-83de-49ed-9e40-5bb642be956f",
    "name" : "TestModelArtsJob",
    "description" : "This is a ModelArts job",
    "create_time" : 1637045545982,
    "workspace_id" : "0",
    "ai_project" : "default-ai-project",
    "user_name" : ""
  },
  "status" : {
    "phase" : "Creating",
    "secondary_phase" : "Creating",
    "duration" : 0,
    "start_time" : 0,
    "node_count_metrics" : null,
    "tasks" : [ "worker-0", "server-0" ]
  },
  "algorithm" : {
    "id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed",
    "name" : "ttt-obs-gpu",
    "code_dir" : "/xxx/test/moxingtest-code/",
    "boot_file" : "/xxx/test/moxingtest-code/test_obs_gpu.py",
    "parameters" : [ {
      "name" : "input_dir",
      "description" : "",
      "i18n_description" : null,
      "value" : "s://xxx/test/moxingtest-dir/",
      "constraint" : {
        "type" : "String",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    }, {
      "name" : "input_file",
      "description" : "",
      "i18n_description" : null,
      "value" : "obs://cxxx/test/moxingtest/",
      "constraint" : {
        "type" : "String",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    }, {
      "name" : "large_file_method",
      "description" : "",
      "i18n_description" : null,
      "value" : "1",
      "constraint" : {
        "type" : "Integer",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    } ],
    "parameters_customization" : false,
    "engine" : {
      "engine_id" : "horovod-cp36-tf-1.16.2",
      "engine_name" : "Horovod",
      "engine_version" : "0.16.2-TF-1.13.1-python3.6",
      "usage" : "training",
      "support_groups" : "public,roma",
      "v1_compatible" : true,
      "run_user" : ""
    },
    "policies" : { }
  },
  "spec" : {
    "resource" : {
      "policy" : "regular",
      "turbo_range" : [ 1, 2 ],
      "flavor_id" : "modelarts.p3.large.public.free",
      "flavor_name" : "Computing GPU(V100) instance",
      "node_count" : 1,
      "flavor_detail" : {
        "flavor_type" : "GPU",
        "billing" : {
          "code" : "modelarts.vm.gpu.free",
          "unit_num" : 1
        },
        "attributes" : {
          "is_free" : "true",
          "max_free_job_count" : "10"
        },
        "flavor_info" : {
          "cpu" : {
            "arch" : "x86",
            "core_num" : 8
          },
          "gpu" : {
            "unit_num" : 1,
            "product_name" : "NVIDIA-V100",
            "memory" : "32GB"
          },
          "memory" : {
            "size" : 64,
            "unit" : "GB"
          }
        }
      }
    },
    "log_export_path" : { },
    "is_hosted_log" : true
  }
}

Status Codes

Status Code	Description
201	ok

Error Codes

See Error Codes.

Parent topic: Training Job Management

294 KiB Raw Permalink Blame History