Symptom
If a training job failed due to out of memory (OOM), possible symptoms as as follows:
- Error code 137 is returned.
- The log file contains error information with keyword killed.
Figure 1 Error log
- Error message "RuntimeError: CUDA out of memory." is displayed in logs.
Figure 2 Error log
- Error message "Dst tensor is not initialized" is displayed in TensorFlow logs.
Possible Causes
The possible causes are as follows:
- GPU memory is insufficient.
- OOM occurred on certain nodes. This issue is typically caused by the node fault.
Solution
- Modify hyperparameter settings to release unnecessary tensors.
- Modify network parameters, such as batch_size, hide_layer, and cell_nums.
- Release unnecessary tensors.
del tmp_tensor
torch.cuda.empty_cache()
- Use the local PyCharm to remotely access notebook for debugging.
- If the fault persists, submit a service ticket to locate the fault or even isolate the affected node.
Summary and Suggestions
Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.