forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Lai, Weijian <laiweijian4@huawei.com> Co-committed-by: Lai, Weijian <laiweijian4@huawei.com>
53 lines
12 KiB
HTML
53 lines
12 KiB
HTML
<a name="EN-US_TOPIC_0000001910008640"></a><a name="EN-US_TOPIC_0000001910008640"></a>
|
|
|
|
<h1 class="topictitle1">Error Message "No such file or directory" Displayed in Training Job Logs</h1>
|
|
<div id="body8662426"><div class="section" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section139705195459"><h4 class="sectiontitle">Symptom</h4><p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p474015238454">If a training job failed, error message "No such file or directory" is displayed in logs.</p>
|
|
<p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p171101331456">If a training input path is unreachable, error message "No such file or directory" is displayed.</p>
|
|
<p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p241021033">If a training boot file is unavailable, error message "No such file or directory" is displayed.</p>
|
|
<div class="fignone" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_fig18575194854619"><span class="figcap"><b>Figure 1 </b>Example log for an unavailable training boot file</span><br><span><img id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_image571373944620" src="figure/en-us_image_0000001909849128.png" width="497.42" height="55.70040000000001" title="Click to enlarge" class="imgResize"></span></div>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section7150141713473"><h4 class="sectiontitle">Possible Causes</h4><ul id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_ul542062015118"><li id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_li74211720165112">If the training input path is unreachable, the path is incorrect. Perform the following operations to locate the fault:<ol id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_ol6168113812417"><li id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_li1476939543"><a href="#EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section1181277141419">Checking Whether the Affected Path Is an OBS Path</a></li><li id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_li9476173911410"><a href="#EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section135339269">Checking Whether the Affected Path Is Available</a></li></ol>
|
|
</li><li id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_li18421320185110">If the training boot file is unavailable, the path to the training job boot command is incorrect. Rectify the fault by referring to <a href="#EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section1193319591840">Checking the File Boot Path of a Training Job Created Using a Custom Image</a>.</li><li id="EN-US_TOPIC_0000001910008640__li474135361414">Multiple processes or workers read and write the same file. If SFS is used, check whether multiple nodes concurrently write the same file. Analyze the code and check whether multiple processes write the same file. It is a good practice to prevent multiple processes or nodes from concurrently reading and writing the same file.</li></ul>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section1181277141419"><a name="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section1181277141419"></a><a name="en-us_topic_0000001128983644_section1181277141419"></a><h4 class="sectiontitle">Checking Whether the Affected Path Is an OBS Path</h4><p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p104095559422">When using ModelArts, store data in an OBS bucket. However, the OBS path cannot be used to read data during the execution of the training code.</p>
|
|
<p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p1275316186346">The reason is as follows:</p>
|
|
<p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p1020654191714">After a training job is created, the training performance is poor if the running container is directly connected to OBS. To prevent this issue, the system automatically downloads the training data to the local path of the running container. Therefore, an error occurs if an OBS path is used in training code. For example, if the OBS path to the training code is <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b1311885717567">obs://bucket-A/training/</strong>, the training code will be automatically downloaded to <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b88301724135720">${MA_JOB_DIR}/training/</strong>.</p>
|
|
<p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p4508776209">For example, the OBS path to the training code is <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b10983163815514">obs://bucket-A/XXX/{training-project}/</strong>, where <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b1577954612514">{training-project}</strong> is the name of the folder where the training code is stored. During training, the system will automatically download the data from OBS <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b2015815294538">{training-project}</strong> to the local path of the training container (<strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b89851859165310">$MA_JOB_DIR/{training-project}/</strong>).</p>
|
|
<p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p1528175916346">If the affected path is a path to the training data, perform the following operations to resolve this issue (see "input and output configurations" for details):</p>
|
|
</div>
|
|
<ol id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_ol74871258113615"><li id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_li648715817366">When creating an algorithm, set the code path parameter, which defaults to <span class="parmname" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_parmname13329192183914"><b>data_url</b></span>, in the input path mapping configuration.</li><li id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_li13736951103913">Add a hyperparameter, which defaults to <span class="parmname" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_parmname3409174611407"><b>data_url</b></span>, to the training code. Use <span class="parmname" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_parmname11105458194010"><b>data_url</b></span> as the local path for inputting the training data.</li></ol>
|
|
<div class="section" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section135339269"><a name="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section135339269"></a><a name="en-us_topic_0000001128983644_section135339269"></a><h4 class="sectiontitle">Checking Whether the Affected Path Is Available</h4><p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p631318167509">The code developed locally needs to be uploaded to the ModelArts backend. It is likely to incorrectly set the path to a dependency file in training code.</p>
|
|
<p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p8060118">You are suggested to use the following general solution to obtain the absolute path to a dependency file through the OS API.</p>
|
|
<p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p76191446165112">Example:</p>
|
|
<pre class="screen" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_en-us_topic_0274524271_screen0831354101214">|---project_root # Root directory for code
|
|
|---BootfileDirectory # Directory where the boot file is located
|
|
|---bootfile.py # Boot file
|
|
|---otherfileDirectory # Directory where other dependency files are located
|
|
|---otherfile.py # Other dependency files
|
|
</pre>
|
|
<p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p1199920198215">Do as follows to obtain the path to a dependency file, <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b1614154944719">otherfile_path</strong> in this example, in the boot file:</p>
|
|
<pre class="screen" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_screen216142010320">import os
|
|
current_path = os.path.dirname(os.path.realpath(__file__)) # Directory where the boot file is located
|
|
project_root = os.path.dirname(current_path) # Root directory of the project, which is the code directory set on the ModelArts training console
|
|
otherfile_path = os.path.join(project_root, "otherfileDirectory", "otherfile.py")</pre>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section1193319591840"><a name="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section1193319591840"></a><a name="en-us_topic_0000001128983644_section1193319591840"></a><h4 class="sectiontitle">Checking the File Boot Path of a Training Job Created Using a Custom Image</h4><p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p161514171852">Take OBS path <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b61171013579">obs://obs-bucket/training-test/demo-code</strong> as an example. The training code in this path will be automatically downloaded to <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b1011717045715">${MA_JOB_DIR}/demo-code</strong> in the training container, where <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b17117160195719">demo-code</strong> is the last-level directory of the OBS path and can be customized.</p>
|
|
<p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_p7171414469">If you use a custom image to create a training job, the system will automatically run the image boot command after the code directory is downloaded. The boot command must comply with the following rules:</p>
|
|
<ul id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_ul12179143612"><li id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_li11179147612">If the training startup script is a .py file, <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b16310233719">train.py</strong> for example, the boot command can be <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b741923575">python ${MA_JOB_DIR}/demo-code/train.py</strong>.</li><li id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_li81710141365">If the training startup script is an .sh file, <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b14634211181020">main.sh</strong> for example, the boot command can be <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b166343114104">bash ${MA_JOB_DIR}/demo-code/main.sh</strong>, where <strong id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_b10449659161019">demo-code</strong> is the last-level directory of the OBS path and can be customized.</li></ul>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_section169401039184514"><h4 class="sectiontitle">Summary and Suggestions</h4><p id="EN-US_TOPIC_0000001910008640__en-us_topic_0000001128983644_en-us_topic_0000001182115463_en-us_topic_0000001143177048_p1141444753416">Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.</p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="modelarts_13_0071.html">In-Cloud Migration Adaptation Issues</a></div>
|
|
</div>
|
|
</div>
|
|
|
|
|
|
<script language="JavaScript">
|
|
<!--
|
|
image_size('.imgResize');
|
|
var msg_imageMax = "view original image";
|
|
var msg_imageClose = "close";
|
|
//--></script> |