File Formats

When creating a CDM job, you need to specify File Format in the job parameters of the migration source and destination in some scenarios. This section describes the application scenarios, subparameters, common parameters, and usage examples of the supported file formats.

CSV

To read or write a CSV file, set File Format to CSV. The CSV format can be used in the following scenarios:

After selecting the CSV format, you can also configure the following optional sub-parameters:

1. Line Separator

2. Field Delimiter

3. Encoding Type

4. Use Quote Character

5. Use RE to Separate Fields

6. Use First Row as Header

7. File Size

  1. Line Separator

    Character used to separate lines in a CSV file. The value can be a single character, multiple characters, or special characters. Special characters can be entered using the URL encoded characters. The following table lists the URL encoded characters of commonly used special characters.

    Table 1 URL encoded characters of special characters

    Special Character

    URL Encoded Character

    Space

    %20

    Tab

    %09

    %

    %25

    Enter

    %0d

    Newline character

    %0a

    Start of heading\u0001 (SOH)

    %01

  2. Field Delimiter

    Character used to separate columns in a CSV file. The value can be a single character, multiple characters, or special characters. For details, see Table 1.

  3. Encoding Type

    Encoding type of a CSV file. The default value is UTF-8.

    If this parameter is specified at the migration source, the specified encoding type is used to parse the file. If this parameter is specified at the migration destination, the specified encoding type is used to write data to the file.

  4. Use Quote Character
    • Exporting data from a database or NoSQL to CSV files (configuring Use Quote Character at the migration destination): If a field delimiter appears in the character string of a column of data at the migration source, set Use Quote Character to Yes at the migration destination to quote the character string as a whole and write it into the CSV file. Currently, CDM uses double quotation marks ("") as the quote character only. Figure 1 shows that the value of the name field in the database contains a comma (,).
      Figure 1 Field value containing the field delimiter

      If you do not use the quote character, the exported CSV file is displayed as follows:

      3.hello,world,abc

      If you use the quote character, the exported CSV file is displayed as follows:

      3,"hello,world",abc

      If the data in the database contains double quotation marks ("") and you set Use Quote Character to Yes, the quote character in the exported CSV file is displayed as three double quotation marks ("""). For example, if the value of a field is a"hello,world"c, the exported data is as follows:

      """a"hello,world"c"""
    • Exporting CSV files to a database or NoSQL (configuring Use Quote Character at the migration source): If you want to import the CSV files with quoted values to a database correctly, set Use Quote Character to Yes at the migration source to write the quoted values as a whole.
  5. Use RE to Separate Fields

    This function is used to parse complex semi-structured text, such as log files. For details, see Using Regular Expressions to Separate Semi-structured Text.

  6. Use First Row as Header

    This parameter is used when CSV files are exported to other locations. If this parameter is specified at the migration source, CDM uses the first row as the header when extracting data. When the CSV files are transferred, the headers are skipped. The number of rows extracted from the migration source is more than the number of rows written to the migration destination. The log files will output the information that the header is skipped during the migration.

  7. File Size

    This parameter is used when data is exported from the database to a CSV file. If a table contains a large amount of data, a large CSV file is generated after migration, which is inconvenient to download or view. In this case, you can specify this parameter at the migration destination so that multiple CSV files with the specified size can be generated. The value of this parameter is an integer. The unit is MB.

JSON

The following describes information about the JSON format:
  1. JSON types supported by CDM: JSON object and JSON array
    • JSON object: A JSON file contains a single object or multiple objects separated/merged by rows.
      1. The following is a single JSON object:
        {
             "took" : 190,
             "timed_out" : false,
             "total" : 1000001,
             "max_score" : 1.0
         }
      2. The following are JSON objects separated by rows:
        {"took" : 188, "timed_out" : false, "total" : 1000003, "max_score" : 1.0 }
        {"took" : 189, "timed_out" : false, "total" : 1000004, "max_score" : 1.0 }
      3. The following are merged JSON objects:
        {
            "took": 190,
            "timed_out": false,
            "total": 1000001,
            "max_score": 1.0
         }
         {
            "took": 191,
            "timed_out": false,
            "total": 1000002,
            "max_score": 1.0
         }
    • JSON array: A JSON file is a JSON array consisting of multiple JSON objects.
      [{
            "took" : 190,
            "timed_out" : false,
            "total" : 1000001,
            "max_score" : 1.0
        },
        {
            "took" : 191,
            "timed_out" : false,
            "total" : 1000001,
            "max_score" : 1.0
        }]
  2. JSON Reference Node

    Root node that records data. The data corresponding to the node is a JSON array. CDM extracts data from the array in the same mode. Use periods (.) to separate multi-layer nested JSON nodes.

  3. Copying Data from a JSON File
    1. Example 1: Extract data from multiple objects that are separated or merged. A JSON file contains multiple JSON objects. The following gives an example:
       {
          "took": 190,
          "timed_out": false,
          "total": 1000001,
          "max_score": 1.0
       }
       {
          "took": 191,
          "timed_out": false,
          "total": 1000002,
          "max_score": 1.0
       }
       {
          "took": 192,
          "timed_out": false,
          "total": 1000003,
          "max_score": 1.0
       }
      To extract data from the JSON object and write data to the database in the following formats, set File Format to JSON and JSON Type to JSON object, and then map fields.

      took

      timedOut

      total

      maxScore

      190

      false

      1000001

      1.0

      191

      false

      1000002

      1.0

      192

      false

      1000003

      1.0

    2. Example 2: Extract data from the reference node. A JSON file contains a single JSON object, but the valid data is on a data node. The following gives an example:
      {
          "took": 190,
          "timed_out": false,
          "hits": {
              "total": 1000001,
              "max_score": 1.0,
              "hits": 
               [{
                  "_id": "650612",
                  "_source": {
                      "name": "tom",
                      "books": ["book1","book2","book3"]
                  }
               },
               {
                   "_id": "650616",
                   "_source": {
                       "name": "tom",
                       "books": ["book1","book2","book3"]
                   }
               },
             {
                   "_id": "650618",
                   "_source": {
                       "name": "tom",
                       "books": ["book1","book2","book3"]
                   }
               }]
           }
       }
      To write data to the database in the following formats, set File Format to JSON, JSON Type to JSON object, and JSON Reference Node to hits.hits, and then map fields.

      ID

      SourceName

      SourceBooks

      650612

      tom

      ["book1","book2","book3"]

      650616

      tom

      ["book1","book2","book3"]

      650618

      tom

      ["book1","book2","book3"]

    3. Example 3: Extract data from the JSON array. A JSON file is a JSON array consisting of multiple JSON objects. The following gives an example:
      [{
            "took" : 190,
            "timed_out" : false,
            "total" : 1000001,
            "max_score" : 1.0
        },
        {
            "took" : 191,
            "timed_out" : false,
            "total" : 1000002,
            "max_score" : 1.0
        }]

      To write data to the database in the following formats, set File Format to JSON and JSON Type to JSON array, and then map fields.

      took

      timedOut

      total

      maxScore

      190

      false

      1000001

      1.0

      191

      false

      1000002

      1.0

    4. Example 4: Configure a converter when parsing the JSON file. On the premise of example 2, to add the hits.max_score field to all records, that is, to write the data to the database in the following formats, perform the following operations:

      ID

      SourceName

      SourceBooks

      MaxScore

      650612

      tom

      ["book1","book2","book3"]

      1.0

      650616

      tom

      ["book1","book2","book3"]

      1.0

      650618

      tom

      ["book1","book2","book3"]

      1.0

      Set File Format to JSON, JSON Type to JSON object, and JSON Reference Node to hits.hits, and then create a converter.
      1. Click to add a field.
        Figure 2 Adding a field
      2. Click to create a converter for the new field.
        Figure 3 Creating a field converter
      3. Set Converter to Expression conversion, enter "1.0" in the Expression text box, and click Save.
        Figure 4 Configuring a field converter

Binary

If you want to copy files between file systems, you can select the binary format. The binary format delivers the optimal rate and performance in file transfer, and does not require field mapping.

Common parameters

Solutions to File Format Problems

  1. When data in a database is exported to a CSV file, if the data contains commas (,), the data in the exported CSV file is disordered.

    The following solutions are available:

    1. Specify a field delimiter.

      Use a character that does not exist in the database or a rare non-printable character as the field delimiter. For example, set Field Delimiter at the migration destination to %01. In this way, the exported field delimiter is \u0001. For details, see Table 1.

    2. Use the quote character.

      Set Use Quote Character to Yes at the migration destination. In this way, if the field in the database contains the field delimiter, CDM quotes the field using the quote character and write the field as a whole to the CSV file.

  2. The data in the database contains line separators.

    Scenario: When you use CDM to export a table in the MySQL database (a field value contains the line separator \n) to a CSV file, and then use CDM to import the exported CSV file to MRS HBase, data in the exported CSV file is truncated.

    Solution: Specify a line separator.

    When you use CDM to export MySQL table data to a CSV file, set Line Separator at the migration destination to %01 (ensure that the value does not appear in the field value). In this way, the line separator in the exported CSV file is %01. Then use CDM to import the CSV file to MRS HBase. Set Line Separator at the migration source to %01. This avoids data truncation.