Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Lai, Weijian <laiweijian4@huawei.com> Co-committed-by: Lai, Weijian <laiweijian4@huawei.com>
25 KiB
Deploying as a Real-Time Service
After an AI application is prepared, you can deploy the AI application as a real-time service and predict and call the service.
Constraints
A maximum of 20 real-time services can be deployed by a user.
Prerequisites
- Data has been prepared. Specifically, you have created an AI application in the Normal state in ModelArts.
Note
Real-time services deployed using the public resource pool also occupy quota resources when the services are Abnormal or Stopped. If the quota is insufficient and no more services can be deployed, delete some abnormal services to release resources.
Quota calculation:
- If a dedicated resource pool is used to deploy real-time services, the quota is not decreased. The quota is increased or decreased only when the dedicated pool is created, modified, or deleted.
- When a shared resource pool is used to deploy a real-time service, the quota will be increased or decreased when you create, change the number of, or delete instances.
Metering calculation:
- If a real-time service is deployed using a dedicated pool, only the data of the dedicated pool to which the service belongs is metered.
- When a shared pool is used to deploy a real-time service, the specifications used by the service will be metered.
Procedure
- Log in to the ModelArts management console. In the left navigation pane, choose Service Deployment > Real-Time Services. By default, the system switches to the Real-Time Services page.
- In the real-time service list, click Deploy in the upper left corner. The Deploy page is displayed.
- Set parameters for a real-time service.
- Set basic information about model deployment. For details about the parameters, see Table 1.
- Enter key information including the resource pool and AI application configurations. For details, see Table 2.
Table 2 Parameters Parameter
Sub-Parameter
Description
Resource Pool
Public resource pools
Instances in the public resource pool can be of the CPU or GPU type.
Resource Pool
Dedicated resource pools
Select a specification from the dedicated resource pool specifications.
NOTE:- The data of old-version dedicated resource pools will be gradually migrated to the new-version dedicated resource pools.
- For new users and the existing users who have migrated data from old-version dedicated resource pools to new ones, there is only one entry to new-version dedicated resource pools on the ModelArts management console.
- For the existing users who have not migrated data from old-version dedicated resource pools to new ones, there are two entries to dedicated resource pools on the ModelArts management console, where the entry marked with New is to the new version.
For more details about the new-version dedicated resource pools, see Comprehensive Upgrades to ModelArts Resource Pool Management Functions.
Multi-Pool Load Balancing
N/A
After this function is enabled, the service will be deployed in two dedicated resource pools, and service traffic will be evenly distributed among the pools through a load balancer. This minimizes the impact on the service after one resource pool fails, improving service reliability.
After this function is enabled, the number of compute nodes must be a multiple of 2 and the minimum value is 2.
NOTE:- Multi-pool load balancing is supported only when a dedicated resource pool is selected.
- Multi-pool load balancing requires that the compute nodes in the two resource pools have the same specifications.
- Both resource pools must be old or new resource pools.
AI Application and Configuration
AI Application Source
Select My AI Applications based on your requirements.
AI Application and Version
Select the AI application and version that are in the Normal state.
Traffic Ratio (%)
Set the traffic proportion of the current instance node. Service calling requests are allocated to the current version based on this proportion.
If you deploy only one version of an AI application, set this parameter to 100%. If you select multiple versions for gated launch, ensure that the sum of the traffic ratios of multiple versions is 100%.
Specifications
Select available specifications based on the list displayed on the console. The specifications in gray cannot be used in the current environment.
If specifications in the public resource pools are unavailable, no public resource pool is available in the current environment. In this case, use a dedicated resource pool or contact the administrator to create a public resource pool.
NOTE:When the selected flavor is used to deploy the service, necessary system consumption is generated. Therefore, the resources actually occupied by the service are slightly greater than the selected flavor.
Compute Nodes
Set the number of instances for the current AI application version. If you set Instances to 1, the standalone computing mode is used. If you set Instances to a value greater than 1, the distributed computing mode is used. Select a computing mode based on the actual requirements.
Environment Variable
Set environment variables and inject them to the pod. To ensure data security, do not enter sensitive information in environment variables.
Timeout
Timeout of a single model, including both the deployment and startup time. The default value is 20 minutes. The value must range from 3 to 120.
Mount Storage
This function will mount a storage volume to compute nodes (compute instances) as a local directory when the service is running. It is recommended when the model or input data is large.
OBS parallel file system
- Source Path: Select the storage path of the parallel file. A cross-region OBS parallel file system cannot be selected.
- Mount Path: Enter the container mount path, for example, /obs-mount/.
- Select a new directory. If an inventory directory is selected, the inventory files in it will be overwritten.
- It is a good practice to mount the container to an empty directory. If the directory is not empty, ensure that there are no files affecting container startup in the directory. Otherwise, such files will be replaced, resulting in failures to start the container and create the workload.
- The mount path must start with a slash (/) and can contain a maximum of 1,024 characters, including letters, digits, and the following special characters: \_-.
NOTE:- A file system can be mounted only once and to only one path. Each mount path must be unique. At most 10 disks can be mounted to an OBS bucket.
- If you need to mount multiple files, do not use the paths that are the same or similar, for example, /obs-mount/ and /obs-mount/tmp/.
- When an OBS parallel file system is mounted, a policy is configured for the bucket. Do not delete the policy.
Add AI Application Version and Configuration
If the selected AI application has multiple versions, you can add multiple versions and configure a traffic ratio. You can use gray launch to smoothly upgrade the AI application version.
NOTE:Free compute specifications do not support the gray launch of multiple versions.
WebSocket
-
Whether to deploy a real-time service as a WebSocket service.
NOTE:- This function is supported only if the AI application is WebSocket-compliant and comes from a container image.
- After this function is enabled, Traffic Limit and Data Collection cannot be set.
- This parameter cannot be changed after the service is deployed.
Data Collection
N/A
This function is disabled by default. When enabled, it collects and stores data generated when a real-time service is called based on configured rules.
- After confirming the entered information, complete service deployment as prompted. Generally, service deployment jobs run for a period of time, which may be several minutes or tens of minutes depending on the amount of your selected data and resources.
In the real-time service list, after the status of the newly deployed service changes from Deploying to Running, the service is deployed successfully.