Name |
Mandatory |
Default |
Description |
---|---|---|---|
simlarity_threshold |
No |
0.9 |
Similarity threshold. When the similarity between two images is greater than the threshold, one of the images is filtered out as a duplicate image. The value ranges from 0 to 1. |
do_validation |
No |
True |
Indicates whether to validate data. The value can be True or False. True indicates that data is validated before deduplication. False indicates that data is deduplicated only. |
The following two types of operator input are available:
The following shows the directory structure in the image classification scenario. The following directory structure supports only single-label scenarios.
input_path/ --label1/ ----1.jpg --label2/ ----2.jpg --../
The following shows the directory structure in the object detection scenario. Images in JPG, JPEG, PNG, and BMP formats are supported. XML files are standard PACAL VOC files.
input_path/ --1.jpg --1.xml --2.jpg --2.xml ...
The output directory structure is as follows:
output_path/ --Data/ ----class1/ # If the input data has labeling information, the information is also output. class1 indicates the labeling class. ------1.jpg ----class2/ ------2.jpg ------3.jpg --output.manifest
A manifest file example is as follows:
{ "id": "xss", "source": "obs://home/fc8e2688015d4a1784dcbda44d840307_14.jpg", "usage": "train", "annotation": [ { "name": "Cat", "type": "modelarts/image_classification" } ] }
output_path/ --Data/ ----1.jpg ----1.xml # If the input data has labeling information, the information is also output. xml indicates the label file. ----2.jpg ----3.jpg --output.manifest
A manifest file example is as follows:
{ "source":"obs://fake/be462ea9c5abc09f.jpg", "annotation":[ { "annotation-loc":"obs://fake/be462ea9c5abc09f.xml", "type":"modelarts/object_detection", "annotation-format":"PASCAL VOC", "annotated-by":"modelarts/hard_example_algo" } ] }