diff --git a/docs/dli/dev/ALL_META.TXT.json b/docs/dli/dev/ALL_META.TXT.json new file mode 100644 index 00000000..43e7a943 --- /dev/null +++ b/docs/dli/dev/ALL_META.TXT.json @@ -0,0 +1,1304 @@ +[ + { + "dockw":"Developer Guide" + }, + { + "uri":"dli_09_0120.html", + "node_id":"dli_09_0120.xml", + "product_code":"dli", + "code":"1", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"SQL Jobs", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli" + } + ], + "title":"SQL Jobs", + "githuburl":"" + }, + { + "uri":"dli_05_0044.html", + "node_id":"dli_05_0044.xml", + "product_code":"dli", + "code":"2", + "des":"DLI allows you to use data stored on OBS. You can create OBS tables on DLI to access and process data in your OBS bucket.This section describes how to create an OBS table", + "doc_type":"devg", + "kw":"Using Spark SQL Jobs to Analyze OBS Data,SQL Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Using Spark SQL Jobs to Analyze OBS Data", + "githuburl":"" + }, + { + "uri":"dli_09_0171.html", + "node_id":"dli_09_0171.xml", + "product_code":"dli", + "code":"3", + "des":"DLI allows you to use Hive user-defined functions (UDFs) to query data. UDFs take effect only on a single row of data and are applicable to inserting and deleting a singl", + "doc_type":"devg", + "kw":"Calling UDFs in Spark SQL Jobs,SQL Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Calling UDFs in Spark SQL Jobs", + "githuburl":"" + }, + { + "uri":"dli_09_0204.html", + "node_id":"dli_09_0204.xml", + "product_code":"dli", + "code":"4", + "des":"You can use Hive User-Defined Table-Generating Functions (UDTF) to customize table-valued functions. Hive UDTFs are used for the one-in-multiple-out data operations. UDTF", + "doc_type":"devg", + "kw":"Calling UDTFs in Spark SQL Jobs,SQL Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Calling UDTFs in Spark SQL Jobs", + "githuburl":"" + }, + { + "uri":"dli_05_0062.html", + "node_id":"dli_05_0062.xml", + "product_code":"dli", + "code":"5", + "des":"DLI allows you to use a Hive User Defined Aggregation Function (UDAF) to process multiple rows of data. Hive UDAF is usually used together with groupBy. It is equivalent ", + "doc_type":"devg", + "kw":"Calling UDAFs in Spark SQL Jobs,SQL Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Calling UDAFs in Spark SQL Jobs", + "githuburl":"" + }, + { + "uri":"dli_09_0123.html", + "node_id":"dli_09_0123.xml", + "product_code":"dli", + "code":"6", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Submitting a Spark SQL Job Using JDBC", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Submitting a Spark SQL Job Using JDBC", + "githuburl":"" + }, + { + "uri":"dli_09_0124.html", + "node_id":"dli_09_0124.xml", + "product_code":"dli", + "code":"7", + "des":"On DLI, you can connect to the server for data query in the Internet environment. In this case, you need to first obtain the connection information, including the endpoin", + "doc_type":"devg", + "kw":"Obtaining the Server Connection Address,Submitting a Spark SQL Job Using JDBC,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Obtaining the Server Connection Address", + "githuburl":"" + }, + { + "uri":"dli_09_0125.html", + "node_id":"dli_09_0125.xml", + "product_code":"dli", + "code":"8", + "des":"To connect to DLI, JDBC is utilized. You can obtain the JDBC installation package from Maven or download the JDBC driver file from the DLI management console.JDBC driver ", + "doc_type":"devg", + "kw":"Downloading the JDBC Driver Package,Submitting a Spark SQL Job Using JDBC,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Downloading the JDBC Driver Package", + "githuburl":"" + }, + { + "uri":"dli_09_0121.html", + "node_id":"dli_09_0121.xml", + "product_code":"dli", + "code":"9", + "des":"You need to be authenticated when using JDBC to create DLI driver connections.Currently, the JDBC supports authentication using the Access Key/Secret Key (AK/SK) or token", + "doc_type":"devg", + "kw":"Performing Authentication,Submitting a Spark SQL Job Using JDBC,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Performing Authentication", + "githuburl":"" + }, + { + "uri":"dli_09_0127.html", + "node_id":"dli_09_0127.xml", + "product_code":"dli", + "code":"10", + "des":"In Linux or Windows, you can connect to the DLI server using JDBC.Jobs submitted to DLI using JDBC are executed on the Spark engine.Once JDBC 2.X has undergone function r", + "doc_type":"devg", + "kw":"Submitting a Job Using JDBC,Submitting a Spark SQL Job Using JDBC,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Submitting a Job Using JDBC", + "githuburl":"" + }, + { + "uri":"dli_09_0129.html", + "node_id":"dli_09_0129.xml", + "product_code":"dli", + "code":"11", + "des":"Relational Database Service (RDS) is a cloud-based web service that is reliable, scalable, easy to manage, and immediately ready for use. It can be deployed in single-nod", + "doc_type":"devg", + "kw":"Introduction to RDS,Submitting a Spark SQL Job Using JDBC,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Introduction to RDS", + "githuburl":"" + }, + { + "uri":"dli_09_0006.html", + "node_id":"dli_09_0006.xml", + "product_code":"dli", + "code":"12", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Flink OpenSource SQL Jobs", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Flink OpenSource SQL Jobs", + "githuburl":"" + }, + { + "uri":"dli_09_0009.html", + "node_id":"dli_09_0009.xml", + "product_code":"dli", + "code":"13", + "des":"This guide provides reference for Flink 1.12 only.In this example, we aim to query information about top three most-clicked offerings in each hour from a set of real-time", + "doc_type":"devg", + "kw":"Reading Data from Kafka and Writing Data to RDS,Flink OpenSource SQL Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Reading Data from Kafka and Writing Data to RDS", + "githuburl":"" + }, + { + "uri":"dli_09_0010.html", + "node_id":"dli_09_0010.xml", + "product_code":"dli", + "code":"14", + "des":"This guide provides reference for Flink 1.12 only.This example analyzes real-time vehicle driving data and collects statistics on data results that meet specific conditio", + "doc_type":"devg", + "kw":"Reading Data from Kafka and Writing Data to GaussDB(DWS),Flink OpenSource SQL Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Reading Data from Kafka and Writing Data to GaussDB(DWS)", + "githuburl":"" + }, + { + "uri":"dli_09_0011.html", + "node_id":"dli_09_0011.xml", + "product_code":"dli", + "code":"15", + "des":"This guide provides reference for Flink 1.12 only.This example analyzes offering purchase data and collects statistics on data results that meet specific conditions. The ", + "doc_type":"devg", + "kw":"Reading Data from Kafka and Writing Data to Elasticsearch,Flink OpenSource SQL Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Reading Data from Kafka and Writing Data to Elasticsearch", + "githuburl":"" + }, + { + "uri":"dli_09_0012.html", + "node_id":"dli_09_0012.xml", + "product_code":"dli", + "code":"16", + "des":"This guide provides reference for Flink 1.12 only.Change Data Capture (CDC) can synchronize incremental changes from the source database to one or more destinations. Duri", + "doc_type":"devg", + "kw":"Reading Data from MySQL CDC and Writing Data to GaussDB(DWS),Flink OpenSource SQL Jobs,Developer Gui", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Reading Data from MySQL CDC and Writing Data to GaussDB(DWS)", + "githuburl":"" + }, + { + "uri":"dli_09_0013.html", + "node_id":"dli_09_0013.xml", + "product_code":"dli", + "code":"17", + "des":"This guide provides reference for Flink 1.12 only.Change Data Capture (CDC) can synchronize incremental changes from the source database to one or more destinations. Duri", + "doc_type":"devg", + "kw":"Reading Data from PostgreSQL CDC and Writing Data to GaussDB(DWS),Flink OpenSource SQL Jobs,Develope", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Reading Data from PostgreSQL CDC and Writing Data to GaussDB(DWS)", + "githuburl":"" + }, + { + "uri":"dli_09_0207.html", + "node_id":"dli_09_0207.xml", + "product_code":"dli", + "code":"18", + "des":"If you need to configure high reliability for a Flink application, you can set the parameters when creating your Flink jobs.Create an SMN topic and add an email address o", + "doc_type":"devg", + "kw":"Configuring High-Reliability Flink Jobs (Automatic Restart upon Exceptions),Flink OpenSource SQL Job", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Configuring High-Reliability Flink Jobs (Automatic Restart upon Exceptions)", + "githuburl":"" + }, + { + "uri":"dli_09_0202.html", + "node_id":"dli_09_0202.xml", + "product_code":"dli", + "code":"19", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Flink Jar Jobs", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Flink Jar Jobs", + "githuburl":"" + }, + { + "uri":"dli_09_0162.html", + "node_id":"dli_09_0162.xml", + "product_code":"dli", + "code":"20", + "des":"Built on Flink and Spark, the stream ecosystem is fully compatible with the open-source Flink, Storm, and Spark APIs. It is enhanced in features and improved in performan", + "doc_type":"devg", + "kw":"Stream Ecosystem,Flink Jar Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Stream Ecosystem", + "githuburl":"" + }, + { + "uri":"dli_09_0150.html", + "node_id":"dli_09_0150.xml", + "product_code":"dli", + "code":"21", + "des":"You can perform secondary development based on Flink APIs to build your own Jar packages and submit them to the DLI queues to interact with data sources such as MRS Kafka", + "doc_type":"devg", + "kw":"Flink Jar Job Examples,Flink Jar Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Flink Jar Job Examples", + "githuburl":"" + }, + { + "uri":"dli_09_0191.html", + "node_id":"dli_09_0191.xml", + "product_code":"dli", + "code":"22", + "des":"DLI allows you to use a custom JAR package to run Flink jobs and write data to OBS. This section describes how to write processed Kafka data to OBS. You need to modify th", + "doc_type":"devg", + "kw":"Writing Data to OBS Using Flink Jar,Flink Jar Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Writing Data to OBS Using Flink Jar", + "githuburl":"" + }, + { + "uri":"dli_09_0203.html", + "node_id":"dli_09_0203.xml", + "product_code":"dli", + "code":"23", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Spark Jar Jobs", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Spark Jar Jobs", + "githuburl":"" + }, + { + "uri":"dli_09_0205.html", + "node_id":"dli_09_0205.xml", + "product_code":"dli", + "code":"24", + "des":"DLI is fully compatible with open-source Apache Spark and allows you to import, query, analyze, and process job data by programming. This section describes how to write a", + "doc_type":"devg", + "kw":"Using Spark Jar Jobs to Read and Query OBS Data,Spark Jar Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Using Spark Jar Jobs to Read and Query OBS Data", + "githuburl":"" + }, + { + "uri":"dli_09_0176.html", + "node_id":"dli_09_0176.xml", + "product_code":"dli", + "code":"25", + "des":"DLI allows you to develop a program to create Spark jobs for operations related to databases, DLI or OBS tables, and table data. This example demonstrates how to develop ", + "doc_type":"devg", + "kw":"Using the Spark Job to Access DLI Metadata,Spark Jar Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Using the Spark Job to Access DLI Metadata", + "githuburl":"" + }, + { + "uri":"dli_09_0122.html", + "node_id":"dli_09_0122.xml", + "product_code":"dli", + "code":"26", + "des":"DLI Spark-submit is a command line tool used to submit Spark jobs to the DLI server. This tool provides command lines compatible with open-source Spark.Getting authorized", + "doc_type":"devg", + "kw":"Using Spark-submit to Submit a Spark Jar Job,Spark Jar Jobs,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Using Spark-submit to Submit a Spark Jar Job", + "githuburl":"" + }, + { + "uri":"dli_09_0019.html", + "node_id":"dli_09_0019.xml", + "product_code":"dli", + "code":"27", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Using Spark Jobs to Access Data Sources of Datasource Connections", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Using Spark Jobs to Access Data Sources of Datasource Connections", + "githuburl":"" + }, + { + "uri":"dli_09_0020.html", + "node_id":"dli_09_0020.xml", + "product_code":"dli", + "code":"28", + "des":"DLI supports the native Spark DataSource capability and other extended capabilities. You can use SQL statements or Spark jobs to access other data storage services and im", + "doc_type":"devg", + "kw":"Overview,Using Spark Jobs to Access Data Sources of Datasource Connections,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Overview", + "githuburl":"" + }, + { + "uri":"dli_09_0089.html", + "node_id":"dli_09_0089.xml", + "product_code":"dli", + "code":"29", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Connecting to CSS", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Connecting to CSS", + "githuburl":"" + }, + { + "uri":"dli_09_0189.html", + "node_id":"dli_09_0189.xml", + "product_code":"dli", + "code":"30", + "des":"The Elasticsearch 6.5.4 and later versions provided by CSS provides the security settings. Once the function is enabled, CSS provides identity authentication, authorizati", + "doc_type":"devg", + "kw":"CSS Security Cluster Configuration,Connecting to CSS,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"CSS Security Cluster Configuration", + "githuburl":"" + }, + { + "uri":"dli_09_0061.html", + "node_id":"dli_09_0061.xml", + "product_code":"dli", + "code":"31", + "des":"A datasource connection has been created on the DLI management console.Development descriptionConstructing dependency information and creating a Spark sessionImport depen", + "doc_type":"devg", + "kw":"Scala Example Code,Connecting to CSS,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Scala Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0090.html", + "node_id":"dli_09_0090.xml", + "product_code":"dli", + "code":"32", + "des":"A datasource connection has been created on the DLI management console.Development descriptionCode implementationImport dependency packages.from __future__ import print_f", + "doc_type":"devg", + "kw":"PySpark Example Code,Connecting to CSS,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"PySpark Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0190.html", + "node_id":"dli_09_0190.xml", + "product_code":"dli", + "code":"33", + "des":"A datasource connection has been created on the DLI management console.Development descriptionCode implementationConstructing dependency information and creating a Spark ", + "doc_type":"devg", + "kw":"Java Example Code,Connecting to CSS,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Java Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0086.html", + "node_id":"dli_09_0086.xml", + "product_code":"dli", + "code":"34", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Connecting to GaussDB(DWS)", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Connecting to GaussDB(DWS)", + "githuburl":"" + }, + { + "uri":"dli_09_0069.html", + "node_id":"dli_09_0069.xml", + "product_code":"dli", + "code":"35", + "des":"This section provides Scala example code that demonstrates how to use a Spark job to access data from the GaussDB(DWS) data source.A datasource connection has been create", + "doc_type":"devg", + "kw":"Scala Example Code,Connecting to GaussDB(DWS),Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Scala Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0087.html", + "node_id":"dli_09_0087.xml", + "product_code":"dli", + "code":"36", + "des":"This section provides PySpark example code that demonstrates how to use a Spark job to access data from the GaussDB(DWS) data source.A datasource connection has been crea", + "doc_type":"devg", + "kw":"PySpark Example Code,Connecting to GaussDB(DWS),Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"PySpark Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0199.html", + "node_id":"dli_09_0199.xml", + "product_code":"dli", + "code":"37", + "des":"This section provides Java example code that demonstrates how to use a Spark job to access data from the GaussDB(DWS) data source.A datasource connection has been created", + "doc_type":"devg", + "kw":"Java Example Code,Connecting to GaussDB(DWS),Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Java Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0077.html", + "node_id":"dli_09_0077.xml", + "product_code":"dli", + "code":"38", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Connecting to HBase", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Connecting to HBase", + "githuburl":"" + }, + { + "uri":"dli_09_0196.html", + "node_id":"dli_09_0196.xml", + "product_code":"dli", + "code":"39", + "des":"Create a datasource connection on the DLI management console.Add the /etc/hosts information of MRS cluster nodes to the host file of the DLI queue.For details, see sectio", + "doc_type":"devg", + "kw":"MRS Configuration,Connecting to HBase,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"MRS Configuration", + "githuburl":"" + }, + { + "uri":"dli_09_0063.html", + "node_id":"dli_09_0063.xml", + "product_code":"dli", + "code":"40", + "des":"The CloudTable HBase and MRS HBase can be connected to DLI as data sources.PrerequisitesA datasource connection has been created on the DLI management console.Hard-coded ", + "doc_type":"devg", + "kw":"Scala Example Code,Connecting to HBase,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Scala Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0078.html", + "node_id":"dli_09_0078.xml", + "product_code":"dli", + "code":"41", + "des":"The CloudTable HBase and MRS HBase can be connected to DLI as data sources.PrerequisitesA datasource connection has been created on the DLI management console.Hard-coded ", + "doc_type":"devg", + "kw":"PySpark Example Code,Connecting to HBase,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"PySpark Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0197.html", + "node_id":"dli_09_0197.xml", + "product_code":"dli", + "code":"42", + "des":"This example applies only to MRS HBase.PrerequisitesA datasource connection has been created and bound to a queue on the DLI management console.Hard-coded or plaintext pa", + "doc_type":"devg", + "kw":"Java Example Code,Connecting to HBase,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Java Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0198.html", + "node_id":"dli_09_0198.xml", + "product_code":"dli", + "code":"43", + "des":"SymptomThe Spark job fails to be executed, and the job log indicates that the Java server connection or container fails to be started.The Spark job fails to be executed, ", + "doc_type":"devg", + "kw":"Troubleshooting,Connecting to HBase,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Troubleshooting", + "githuburl":"" + }, + { + "uri":"dli_09_0080.html", + "node_id":"dli_09_0080.xml", + "product_code":"dli", + "code":"44", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Connecting to OpenTSDB", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Connecting to OpenTSDB", + "githuburl":"" + }, + { + "uri":"dli_09_0065.html", + "node_id":"dli_09_0065.xml", + "product_code":"dli", + "code":"45", + "des":"The CloudTable OpenTSDB and MRS OpenTSDB can be connected to DLI as data sources.PrerequisitesA datasource connection has been created on the DLI management console.Hard-", + "doc_type":"devg", + "kw":"Scala Example Code,Connecting to OpenTSDB,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Scala Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0081.html", + "node_id":"dli_09_0081.xml", + "product_code":"dli", + "code":"46", + "des":"The CloudTable OpenTSDB and MRS OpenTSDB can be connected to DLI as data sources.PrerequisitesA datasource connection has been created on the DLI management console.Hard-", + "doc_type":"devg", + "kw":"PySpark Example Code,Connecting to OpenTSDB,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"PySpark Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0193.html", + "node_id":"dli_09_0193.xml", + "product_code":"dli", + "code":"47", + "des":"This example applies only to MRS OpenTSDB.PrerequisitesA datasource connection has been created and bound to a queue on the DLI management console.Hard-coded or plaintext", + "doc_type":"devg", + "kw":"Java Example Code,Connecting to OpenTSDB,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Java Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0195.html", + "node_id":"dli_09_0195.xml", + "product_code":"dli", + "code":"48", + "des":"SymptomA Spark job fails to be executed and \"No respond\" is displayed in the job log.A Spark job fails to be executed and \"No respond\" is displayed in the job log.Solutio", + "doc_type":"devg", + "kw":"Troubleshooting,Connecting to OpenTSDB,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Troubleshooting", + "githuburl":"" + }, + { + "uri":"dli_09_0083.html", + "node_id":"dli_09_0083.xml", + "product_code":"dli", + "code":"49", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Connecting to RDS", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Connecting to RDS", + "githuburl":"" + }, + { + "uri":"dli_09_0067.html", + "node_id":"dli_09_0067.xml", + "product_code":"dli", + "code":"50", + "des":"PrerequisitesA datasource connection has been created and bound to a queue on the DLI management console.Hard-coded or plaintext passwords pose significant security risks", + "doc_type":"devg", + "kw":"Scala Example Code,Connecting to RDS,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Scala Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0084.html", + "node_id":"dli_09_0084.xml", + "product_code":"dli", + "code":"51", + "des":"PrerequisitesA datasource connection has been created and bound to a queue on the DLI management console.Hard-coded or plaintext passwords pose significant security risks", + "doc_type":"devg", + "kw":"PySpark Example Code,Connecting to RDS,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"PySpark Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0187.html", + "node_id":"dli_09_0187.xml", + "product_code":"dli", + "code":"52", + "des":"PrerequisitesA datasource connection has been created and bound to a queue on the DLI management console.Hard-coded or plaintext passwords pose significant security risks", + "doc_type":"devg", + "kw":"Java Example Code,Connecting to RDS,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Java Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0093.html", + "node_id":"dli_09_0093.xml", + "product_code":"dli", + "code":"53", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Connecting to Redis", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Connecting to Redis", + "githuburl":"" + }, + { + "uri":"dli_09_0094.html", + "node_id":"dli_09_0094.xml", + "product_code":"dli", + "code":"54", + "des":"Redis supports only enhanced datasource connections.PrerequisitesAn enhanced datasource connection has been created on the DLI management console and bound to a queue in ", + "doc_type":"devg", + "kw":"Scala Example Code,Connecting to Redis,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Scala Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0097.html", + "node_id":"dli_09_0097.xml", + "product_code":"dli", + "code":"55", + "des":"Redis supports only enhanced datasource connections.PrerequisitesAn enhanced datasource connection has been created on the DLI management console and bound to a queue in ", + "doc_type":"devg", + "kw":"PySpark Example Code,Connecting to Redis,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"PySpark Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0100.html", + "node_id":"dli_09_0100.xml", + "product_code":"dli", + "code":"56", + "des":"Redis supports only enhanced datasource connections.PrerequisitesAn enhanced datasource connection has been created on the DLI management console and bound to a queue in ", + "doc_type":"devg", + "kw":"Java Example Code,Connecting to Redis,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Java Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0188.html", + "node_id":"dli_09_0188.xml", + "product_code":"dli", + "code":"57", + "des":"SymptomAfter the code is directly copied to the .py file, unexpected characters may exist after the backslashes (\\).After the code is directly copied to the .py file, une", + "doc_type":"devg", + "kw":"Troubleshooting,Connecting to Redis,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "opensource":"true", + "IsBot":"Yes", + "IsMulti":"No", + "doc_type":"devg", + "product_code":"dli" + } + ], + "title":"Troubleshooting", + "githuburl":"" + }, + { + "uri":"dli_09_0113.html", + "node_id":"dli_09_0113.xml", + "product_code":"dli", + "code":"58", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Connecting to Mongo", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Connecting to Mongo", + "githuburl":"" + }, + { + "uri":"dli_09_0114.html", + "node_id":"dli_09_0114.xml", + "product_code":"dli", + "code":"59", + "des":"Mongo can be connected only through enhanced datasource connections.DDS is compatible with the MongoDB protocol.An enhanced datasource connection has been created on the ", + "doc_type":"devg", + "kw":"Scala Example Code,Connecting to Mongo,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Scala Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0117.html", + "node_id":"dli_09_0117.xml", + "product_code":"dli", + "code":"60", + "des":"Mongo can be connected only through enhanced datasource connections.DDS is compatible with the MongoDB protocol.PrerequisitesAn enhanced datasource connection has been cr", + "doc_type":"devg", + "kw":"PySpark Example Code,Connecting to Mongo,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"PySpark Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_0110.html", + "node_id":"dli_09_0110.xml", + "product_code":"dli", + "code":"61", + "des":"Mongo can be connected only through enhanced datasource connections.DDS is compatible with the MongoDB protocol.PrerequisitesAn enhanced datasource connection has been cr", + "doc_type":"devg", + "kw":"Java Example Code,Connecting to Mongo,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli", + "IsMulti":"No", + "IsBot":"Yes" + } + ], + "title":"Java Example Code", + "githuburl":"" + }, + { + "uri":"dli_09_00001.html", + "node_id":"dli_09_00001.xml", + "product_code":"dli", + "code":"62", + "des":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "doc_type":"devg", + "kw":"Change History,Developer Guide", + "search_title":"", + "metedata":[ + { + "documenttype":"devg", + "prodname":"dli" + } + ], + "title":"Change History", + "githuburl":"" + } +] \ No newline at end of file diff --git a/docs/dli/dev/CLASS.TXT.json b/docs/dli/dev/CLASS.TXT.json new file mode 100644 index 00000000..3f293aa1 --- /dev/null +++ b/docs/dli/dev/CLASS.TXT.json @@ -0,0 +1,560 @@ +[ + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"SQL Jobs", + "uri":"dli_09_0120.html", + "doc_type":"devg", + "p_code":"", + "code":"1" + }, + { + "desc":"DLI allows you to use data stored on OBS. You can create OBS tables on DLI to access and process data in your OBS bucket.This section describes how to create an OBS table", + "product_code":"dli", + "title":"Using Spark SQL Jobs to Analyze OBS Data", + "uri":"dli_05_0044.html", + "doc_type":"devg", + "p_code":"1", + "code":"2" + }, + { + "desc":"DLI allows you to use Hive user-defined functions (UDFs) to query data. UDFs take effect only on a single row of data and are applicable to inserting and deleting a singl", + "product_code":"dli", + "title":"Calling UDFs in Spark SQL Jobs", + "uri":"dli_09_0171.html", + "doc_type":"devg", + "p_code":"1", + "code":"3" + }, + { + "desc":"You can use Hive User-Defined Table-Generating Functions (UDTF) to customize table-valued functions. Hive UDTFs are used for the one-in-multiple-out data operations. UDTF", + "product_code":"dli", + "title":"Calling UDTFs in Spark SQL Jobs", + "uri":"dli_09_0204.html", + "doc_type":"devg", + "p_code":"1", + "code":"4" + }, + { + "desc":"DLI allows you to use a Hive User Defined Aggregation Function (UDAF) to process multiple rows of data. Hive UDAF is usually used together with groupBy. It is equivalent ", + "product_code":"dli", + "title":"Calling UDAFs in Spark SQL Jobs", + "uri":"dli_05_0062.html", + "doc_type":"devg", + "p_code":"1", + "code":"5" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Submitting a Spark SQL Job Using JDBC", + "uri":"dli_09_0123.html", + "doc_type":"devg", + "p_code":"1", + "code":"6" + }, + { + "desc":"On DLI, you can connect to the server for data query in the Internet environment. In this case, you need to first obtain the connection information, including the endpoin", + "product_code":"dli", + "title":"Obtaining the Server Connection Address", + "uri":"dli_09_0124.html", + "doc_type":"devg", + "p_code":"6", + "code":"7" + }, + { + "desc":"To connect to DLI, JDBC is utilized. You can obtain the JDBC installation package from Maven or download the JDBC driver file from the DLI management console.JDBC driver ", + "product_code":"dli", + "title":"Downloading the JDBC Driver Package", + "uri":"dli_09_0125.html", + "doc_type":"devg", + "p_code":"6", + "code":"8" + }, + { + "desc":"You need to be authenticated when using JDBC to create DLI driver connections.Currently, the JDBC supports authentication using the Access Key/Secret Key (AK/SK) or token", + "product_code":"dli", + "title":"Performing Authentication", + "uri":"dli_09_0121.html", + "doc_type":"devg", + "p_code":"6", + "code":"9" + }, + { + "desc":"In Linux or Windows, you can connect to the DLI server using JDBC.Jobs submitted to DLI using JDBC are executed on the Spark engine.Once JDBC 2.X has undergone function r", + "product_code":"dli", + "title":"Submitting a Job Using JDBC", + "uri":"dli_09_0127.html", + "doc_type":"devg", + "p_code":"6", + "code":"10" + }, + { + "desc":"Relational Database Service (RDS) is a cloud-based web service that is reliable, scalable, easy to manage, and immediately ready for use. It can be deployed in single-nod", + "product_code":"dli", + "title":"Introduction to RDS", + "uri":"dli_09_0129.html", + "doc_type":"devg", + "p_code":"6", + "code":"11" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Flink OpenSource SQL Jobs", + "uri":"dli_09_0006.html", + "doc_type":"devg", + "p_code":"", + "code":"12" + }, + { + "desc":"This guide provides reference for Flink 1.12 only.In this example, we aim to query information about top three most-clicked offerings in each hour from a set of real-time", + "product_code":"dli", + "title":"Reading Data from Kafka and Writing Data to RDS", + "uri":"dli_09_0009.html", + "doc_type":"devg", + "p_code":"12", + "code":"13" + }, + { + "desc":"This guide provides reference for Flink 1.12 only.This example analyzes real-time vehicle driving data and collects statistics on data results that meet specific conditio", + "product_code":"dli", + "title":"Reading Data from Kafka and Writing Data to GaussDB(DWS)", + "uri":"dli_09_0010.html", + "doc_type":"devg", + "p_code":"12", + "code":"14" + }, + { + "desc":"This guide provides reference for Flink 1.12 only.This example analyzes offering purchase data and collects statistics on data results that meet specific conditions. The ", + "product_code":"dli", + "title":"Reading Data from Kafka and Writing Data to Elasticsearch", + "uri":"dli_09_0011.html", + "doc_type":"devg", + "p_code":"12", + "code":"15" + }, + { + "desc":"This guide provides reference for Flink 1.12 only.Change Data Capture (CDC) can synchronize incremental changes from the source database to one or more destinations. Duri", + "product_code":"dli", + "title":"Reading Data from MySQL CDC and Writing Data to GaussDB(DWS)", + "uri":"dli_09_0012.html", + "doc_type":"devg", + "p_code":"12", + "code":"16" + }, + { + "desc":"This guide provides reference for Flink 1.12 only.Change Data Capture (CDC) can synchronize incremental changes from the source database to one or more destinations. Duri", + "product_code":"dli", + "title":"Reading Data from PostgreSQL CDC and Writing Data to GaussDB(DWS)", + "uri":"dli_09_0013.html", + "doc_type":"devg", + "p_code":"12", + "code":"17" + }, + { + "desc":"If you need to configure high reliability for a Flink application, you can set the parameters when creating your Flink jobs.Create an SMN topic and add an email address o", + "product_code":"dli", + "title":"Configuring High-Reliability Flink Jobs (Automatic Restart upon Exceptions)", + "uri":"dli_09_0207.html", + "doc_type":"devg", + "p_code":"12", + "code":"18" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Flink Jar Jobs", + "uri":"dli_09_0202.html", + "doc_type":"devg", + "p_code":"", + "code":"19" + }, + { + "desc":"Built on Flink and Spark, the stream ecosystem is fully compatible with the open-source Flink, Storm, and Spark APIs. It is enhanced in features and improved in performan", + "product_code":"dli", + "title":"Stream Ecosystem", + "uri":"dli_09_0162.html", + "doc_type":"devg", + "p_code":"19", + "code":"20" + }, + { + "desc":"You can perform secondary development based on Flink APIs to build your own Jar packages and submit them to the DLI queues to interact with data sources such as MRS Kafka", + "product_code":"dli", + "title":"Flink Jar Job Examples", + "uri":"dli_09_0150.html", + "doc_type":"devg", + "p_code":"19", + "code":"21" + }, + { + "desc":"DLI allows you to use a custom JAR package to run Flink jobs and write data to OBS. This section describes how to write processed Kafka data to OBS. You need to modify th", + "product_code":"dli", + "title":"Writing Data to OBS Using Flink Jar", + "uri":"dli_09_0191.html", + "doc_type":"devg", + "p_code":"19", + "code":"22" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Spark Jar Jobs", + "uri":"dli_09_0203.html", + "doc_type":"devg", + "p_code":"", + "code":"23" + }, + { + "desc":"DLI is fully compatible with open-source Apache Spark and allows you to import, query, analyze, and process job data by programming. This section describes how to write a", + "product_code":"dli", + "title":"Using Spark Jar Jobs to Read and Query OBS Data", + "uri":"dli_09_0205.html", + "doc_type":"devg", + "p_code":"23", + "code":"24" + }, + { + "desc":"DLI allows you to develop a program to create Spark jobs for operations related to databases, DLI or OBS tables, and table data. This example demonstrates how to develop ", + "product_code":"dli", + "title":"Using the Spark Job to Access DLI Metadata", + "uri":"dli_09_0176.html", + "doc_type":"devg", + "p_code":"23", + "code":"25" + }, + { + "desc":"DLI Spark-submit is a command line tool used to submit Spark jobs to the DLI server. This tool provides command lines compatible with open-source Spark.Getting authorized", + "product_code":"dli", + "title":"Using Spark-submit to Submit a Spark Jar Job", + "uri":"dli_09_0122.html", + "doc_type":"devg", + "p_code":"23", + "code":"26" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Using Spark Jobs to Access Data Sources of Datasource Connections", + "uri":"dli_09_0019.html", + "doc_type":"devg", + "p_code":"23", + "code":"27" + }, + { + "desc":"DLI supports the native Spark DataSource capability and other extended capabilities. You can use SQL statements or Spark jobs to access other data storage services and im", + "product_code":"dli", + "title":"Overview", + "uri":"dli_09_0020.html", + "doc_type":"devg", + "p_code":"27", + "code":"28" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Connecting to CSS", + "uri":"dli_09_0089.html", + "doc_type":"devg", + "p_code":"27", + "code":"29" + }, + { + "desc":"The Elasticsearch 6.5.4 and later versions provided by CSS provides the security settings. Once the function is enabled, CSS provides identity authentication, authorizati", + "product_code":"dli", + "title":"CSS Security Cluster Configuration", + "uri":"dli_09_0189.html", + "doc_type":"devg", + "p_code":"29", + "code":"30" + }, + { + "desc":"A datasource connection has been created on the DLI management console.Development descriptionConstructing dependency information and creating a Spark sessionImport depen", + "product_code":"dli", + "title":"Scala Example Code", + "uri":"dli_09_0061.html", + "doc_type":"devg", + "p_code":"29", + "code":"31" + }, + { + "desc":"A datasource connection has been created on the DLI management console.Development descriptionCode implementationImport dependency packages.from __future__ import print_f", + "product_code":"dli", + "title":"PySpark Example Code", + "uri":"dli_09_0090.html", + "doc_type":"devg", + "p_code":"29", + "code":"32" + }, + { + "desc":"A datasource connection has been created on the DLI management console.Development descriptionCode implementationConstructing dependency information and creating a Spark ", + "product_code":"dli", + "title":"Java Example Code", + "uri":"dli_09_0190.html", + "doc_type":"devg", + "p_code":"29", + "code":"33" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Connecting to GaussDB(DWS)", + "uri":"dli_09_0086.html", + "doc_type":"devg", + "p_code":"27", + "code":"34" + }, + { + "desc":"This section provides Scala example code that demonstrates how to use a Spark job to access data from the GaussDB(DWS) data source.A datasource connection has been create", + "product_code":"dli", + "title":"Scala Example Code", + "uri":"dli_09_0069.html", + "doc_type":"devg", + "p_code":"34", + "code":"35" + }, + { + "desc":"This section provides PySpark example code that demonstrates how to use a Spark job to access data from the GaussDB(DWS) data source.A datasource connection has been crea", + "product_code":"dli", + "title":"PySpark Example Code", + "uri":"dli_09_0087.html", + "doc_type":"devg", + "p_code":"34", + "code":"36" + }, + { + "desc":"This section provides Java example code that demonstrates how to use a Spark job to access data from the GaussDB(DWS) data source.A datasource connection has been created", + "product_code":"dli", + "title":"Java Example Code", + "uri":"dli_09_0199.html", + "doc_type":"devg", + "p_code":"34", + "code":"37" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Connecting to HBase", + "uri":"dli_09_0077.html", + "doc_type":"devg", + "p_code":"27", + "code":"38" + }, + { + "desc":"Create a datasource connection on the DLI management console.Add the /etc/hosts information of MRS cluster nodes to the host file of the DLI queue.For details, see sectio", + "product_code":"dli", + "title":"MRS Configuration", + "uri":"dli_09_0196.html", + "doc_type":"devg", + "p_code":"38", + "code":"39" + }, + { + "desc":"The CloudTable HBase and MRS HBase can be connected to DLI as data sources.PrerequisitesA datasource connection has been created on the DLI management console.Hard-coded ", + "product_code":"dli", + "title":"Scala Example Code", + "uri":"dli_09_0063.html", + "doc_type":"devg", + "p_code":"38", + "code":"40" + }, + { + "desc":"The CloudTable HBase and MRS HBase can be connected to DLI as data sources.PrerequisitesA datasource connection has been created on the DLI management console.Hard-coded ", + "product_code":"dli", + "title":"PySpark Example Code", + "uri":"dli_09_0078.html", + "doc_type":"devg", + "p_code":"38", + "code":"41" + }, + { + "desc":"This example applies only to MRS HBase.PrerequisitesA datasource connection has been created and bound to a queue on the DLI management console.Hard-coded or plaintext pa", + "product_code":"dli", + "title":"Java Example Code", + "uri":"dli_09_0197.html", + "doc_type":"devg", + "p_code":"38", + "code":"42" + }, + { + "desc":"SymptomThe Spark job fails to be executed, and the job log indicates that the Java server connection or container fails to be started.The Spark job fails to be executed, ", + "product_code":"dli", + "title":"Troubleshooting", + "uri":"dli_09_0198.html", + "doc_type":"devg", + "p_code":"38", + "code":"43" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Connecting to OpenTSDB", + "uri":"dli_09_0080.html", + "doc_type":"devg", + "p_code":"27", + "code":"44" + }, + { + "desc":"The CloudTable OpenTSDB and MRS OpenTSDB can be connected to DLI as data sources.PrerequisitesA datasource connection has been created on the DLI management console.Hard-", + "product_code":"dli", + "title":"Scala Example Code", + "uri":"dli_09_0065.html", + "doc_type":"devg", + "p_code":"44", + "code":"45" + }, + { + "desc":"The CloudTable OpenTSDB and MRS OpenTSDB can be connected to DLI as data sources.PrerequisitesA datasource connection has been created on the DLI management console.Hard-", + "product_code":"dli", + "title":"PySpark Example Code", + "uri":"dli_09_0081.html", + "doc_type":"devg", + "p_code":"44", + "code":"46" + }, + { + "desc":"This example applies only to MRS OpenTSDB.PrerequisitesA datasource connection has been created and bound to a queue on the DLI management console.Hard-coded or plaintext", + "product_code":"dli", + "title":"Java Example Code", + "uri":"dli_09_0193.html", + "doc_type":"devg", + "p_code":"44", + "code":"47" + }, + { + "desc":"SymptomA Spark job fails to be executed and \"No respond\" is displayed in the job log.A Spark job fails to be executed and \"No respond\" is displayed in the job log.Solutio", + "product_code":"dli", + "title":"Troubleshooting", + "uri":"dli_09_0195.html", + "doc_type":"devg", + "p_code":"44", + "code":"48" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Connecting to RDS", + "uri":"dli_09_0083.html", + "doc_type":"devg", + "p_code":"27", + "code":"49" + }, + { + "desc":"PrerequisitesA datasource connection has been created and bound to a queue on the DLI management console.Hard-coded or plaintext passwords pose significant security risks", + "product_code":"dli", + "title":"Scala Example Code", + "uri":"dli_09_0067.html", + "doc_type":"devg", + "p_code":"49", + "code":"50" + }, + { + "desc":"PrerequisitesA datasource connection has been created and bound to a queue on the DLI management console.Hard-coded or plaintext passwords pose significant security risks", + "product_code":"dli", + "title":"PySpark Example Code", + "uri":"dli_09_0084.html", + "doc_type":"devg", + "p_code":"49", + "code":"51" + }, + { + "desc":"PrerequisitesA datasource connection has been created and bound to a queue on the DLI management console.Hard-coded or plaintext passwords pose significant security risks", + "product_code":"dli", + "title":"Java Example Code", + "uri":"dli_09_0187.html", + "doc_type":"devg", + "p_code":"49", + "code":"52" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Connecting to Redis", + "uri":"dli_09_0093.html", + "doc_type":"devg", + "p_code":"27", + "code":"53" + }, + { + "desc":"Redis supports only enhanced datasource connections.PrerequisitesAn enhanced datasource connection has been created on the DLI management console and bound to a queue in ", + "product_code":"dli", + "title":"Scala Example Code", + "uri":"dli_09_0094.html", + "doc_type":"devg", + "p_code":"53", + "code":"54" + }, + { + "desc":"Redis supports only enhanced datasource connections.PrerequisitesAn enhanced datasource connection has been created on the DLI management console and bound to a queue in ", + "product_code":"dli", + "title":"PySpark Example Code", + "uri":"dli_09_0097.html", + "doc_type":"devg", + "p_code":"53", + "code":"55" + }, + { + "desc":"Redis supports only enhanced datasource connections.PrerequisitesAn enhanced datasource connection has been created on the DLI management console and bound to a queue in ", + "product_code":"dli", + "title":"Java Example Code", + "uri":"dli_09_0100.html", + "doc_type":"devg", + "p_code":"53", + "code":"56" + }, + { + "desc":"SymptomAfter the code is directly copied to the .py file, unexpected characters may exist after the backslashes (\\).After the code is directly copied to the .py file, une", + "product_code":"dli", + "title":"Troubleshooting", + "uri":"dli_09_0188.html", + "doc_type":"devg", + "p_code":"53", + "code":"57" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Connecting to Mongo", + "uri":"dli_09_0113.html", + "doc_type":"devg", + "p_code":"27", + "code":"58" + }, + { + "desc":"Mongo can be connected only through enhanced datasource connections.DDS is compatible with the MongoDB protocol.An enhanced datasource connection has been created on the ", + "product_code":"dli", + "title":"Scala Example Code", + "uri":"dli_09_0114.html", + "doc_type":"devg", + "p_code":"58", + "code":"59" + }, + { + "desc":"Mongo can be connected only through enhanced datasource connections.DDS is compatible with the MongoDB protocol.PrerequisitesAn enhanced datasource connection has been cr", + "product_code":"dli", + "title":"PySpark Example Code", + "uri":"dli_09_0117.html", + "doc_type":"devg", + "p_code":"58", + "code":"60" + }, + { + "desc":"Mongo can be connected only through enhanced datasource connections.DDS is compatible with the MongoDB protocol.PrerequisitesAn enhanced datasource connection has been cr", + "product_code":"dli", + "title":"Java Example Code", + "uri":"dli_09_0110.html", + "doc_type":"devg", + "p_code":"58", + "code":"61" + }, + { + "desc":"HUAWEI CLOUD Help Center presents technical documents to help you quickly get started with HUAWEI CLOUD services. The technical documents include Service Overview, Price Details, Purchase Guide, User Guide, API Reference, Best Practices, FAQs, and Videos.", + "product_code":"dli", + "title":"Change History", + "uri":"dli_09_00001.html", + "doc_type":"devg", + "p_code":"", + "code":"62" + } +] \ No newline at end of file diff --git a/docs/dli/dev/PARAMETERS.txt b/docs/dli/dev/PARAMETERS.txt new file mode 100644 index 00000000..6da8d5f0 --- /dev/null +++ b/docs/dli/dev/PARAMETERS.txt @@ -0,0 +1,3 @@ +version="" +language="en-us" +type="" \ No newline at end of file diff --git a/docs/dli/dev/dli_05_0044.html b/docs/dli/dev/dli_05_0044.html new file mode 100644 index 00000000..0fc4b9d7 --- /dev/null +++ b/docs/dli/dev/dli_05_0044.html @@ -0,0 +1,175 @@ + + +
DLI allows you to use data stored on OBS. You can create OBS tables on DLI to access and process data in your OBS bucket.
+This section describes how to create an OBS table on DLI, import data to the table, and insert and query table data.
+Creating a Database on DLI
+create database testdb;+
The following operations in this section must be performed for the testdb database.
+The main difference between DataSource syntax and Hive syntax lies in the range of table data storage formats supported and the number of partitions supported. For the key differences in creating OBS tables using these two syntax, refer to Table 1.
+ +Syntax + |
+Data Types + |
+Partitioning + |
+Number of Partitions + |
+
---|---|---|---|
DataSource + |
+ORC, PARQUET, JSON, CSV, and AVRO + |
+You need to specify the partitioning column in both CREATE TABLE and PARTITIONED BY statements. For details, see Creating a Single-Partition OBS Table Using DataSource Syntax. + |
+A maximum of 7,000 partitions can be created in a single table. + |
+
Hive + |
+TEXTFILE, AVRO, ORC, SEQUENCEFILE, RCFILE, and PARQUET + |
+Do not specify the partitioning column in the CREATE TABLE statement. Specify the column name and data type in the PARTITIONED BY statement. For details, see Creating an OBS Table Using Hive Syntax. + |
+A maximum of 100,000 partitions can be created in a single table. + |
+
The following describes how to create an OBS table for CSV files. The methods of creating OBS tables for other file formats are similar.
+Jordon,88,23 +Kim,87,25 +Henry,76,26+
CREATE TABLE testcsvdatasource (name STRING, score DOUBLE, classNo INT +) USING csv OPTIONS (path "obs://dli-test-021/test.csv");+
If you create an OBS table using a specified file, you cannot insert data to the table with DLI. The OBS file content is synchronized with the table data.
+select * from testcsvdatasource;+
Jordon,88,23 +Kim,87,25 +Henry,76,26 +Aarn,98,20+
select * from testcsvdatasource;+
CREATE TABLE testcsvdata2source (name STRING, score DOUBLE, classNo INT) USING csv OPTIONS (path "obs://dli-test-021/data");+
insert into testcsvdata2source VALUES('Aarn','98','20');+
select * from testcsvdata2source;+
Jordon,88,23 +Kim,87,25 +Henry,76,26+
CREATE TABLE testcsvdata3source (name STRING, score DOUBLE, classNo INT) USING csv OPTIONS (path "obs://dli-test-021/data2");+
insert into testcsvdata3source VALUES('Aarn','98','20');+
select * from testcsvdata3source;+
CREATE TABLE testcsvdata4source (name STRING, score DOUBLE, classNo INT) USING csv OPTIONS (path "obs://dli-test-021/data3") PARTITIONED BY (classNo);+
Jordon,88,25 +Kim,87,25 +Henry,76,25+
ALTER TABLE + testcsvdata4source +ADD + PARTITION (classNo = 25) LOCATION 'obs://dli-test-021/data3/classNo=25';+
select * from testcsvdata4source where classNo = 25;+
insert into testcsvdata4source VALUES('Aarn','98','25'); +insert into testcsvdata4source VALUES('Adam','68','24');+
When a partitioned table is queried using the where condition, the partition must be specified. Otherwise, the query fails and "DLI.0005: There should be at least one partition pruning predicate on partitioned table" is reported.
+select * from testcsvdata4source where classNo = 25;+
select * from testcsvdata4source where classNo = 24;+
CREATE TABLE testcsvdata5source (name STRING, score DOUBLE, classNo INT, dt varchar(16)) USING csv OPTIONS (path "obs://dli-test-021/data4") PARTITIONED BY (classNo,dt);+
insert into testcsvdata5source VALUES('Aarn','98','25','2021-07-27'); +insert into testcsvdata5source VALUES('Adam','68','25','2021-07-28');+
select * from testcsvdata5source where classNo = 25;+
select * from testcsvdata5source where dt like '2021-07%';+
Jordon,88,24,2021-07-29 +Kim,87,24,2021-07-29 +Henry,76,24,2021-07-29+
ALTER TABLE + testcsvdata5source +ADD + PARTITION (classNo = 24,dt='2021-07-29') LOCATION 'obs://dli-test-021/data4/classNo=24/dt=2021-07-29';+
select * from testcsvdata5source where classNo = 24;+
select * from testcsvdata5source where dt like '2021-07%';+
The following describes how to create an OBS table for TEXTFILE files. The methods of creating OBS tables for other file formats are similar.
+Jordon,88,23 +Kim,87,25 +Henry,76,26+
CREATE TABLE hiveobstable (name STRING, score DOUBLE, classNo INT) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data5' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';+
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' indicates that records are separated by commas (,).
+select * from hiveobstable;+
insert into hiveobstable VALUES('Aarn','98','25'); +insert into hiveobstable VALUES('Adam','68','25');+
select * from hiveobstable;+
Create an OBS Table Containing Data of Multiple Formats
+Jordon,88-22,23:21 +Kim,87-22,25:22 +Henry,76-22,26:23+
CREATE TABLE hiveobstable2 (name STRING, hobbies ARRAY<string>, address map<string,string>) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data6' +ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' +COLLECTION ITEMS TERMINATED BY '-' +MAP KEYS TERMINATED BY ':';+
select * from hiveobstable2;+
CREATE TABLE IF NOT EXISTS hiveobstable3(name STRING, score DOUBLE) PARTITIONED BY (classNo INT) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data7' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';+
You can specify the partition key in the PARTITIONED BY statement. Do not specify the partition key in the CREATE TABLE IF NOT EXISTS statement. The following is an incorrect example:
+CREATE TABLE IF NOT EXISTS hiveobstable3(name STRING, score DOUBLE, classNo INT) PARTITIONED BY (classNo) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data7';
+insert into hiveobstable3 VALUES('Aarn','98','25'); +insert into hiveobstable3 VALUES('Adam','68','25');+
select * from hiveobstable3 where classNo = 25;+
Jordon,88,24 +Kim,87,24 +Henry,76,24+
ALTER TABLE + hiveobstable3 +ADD + PARTITION (classNo = 24) LOCATION 'obs://dli-test-021/data7/classNo=24';+
select * from hiveobstable3 where classNo = 24;+
DLI.0005: There should be at least one partition pruning predicate on partitioned table `xxxx`.`xxxx`.;+
Cause: The partition key is not specified in the query statement of a partitioned table.
+Solution: Ensure that the where condition contains at least one partition key.
+CREATE TABLE testcsvdatasource (name string, id int) USING csv OPTIONS (path "obs://dli-test-021/data/test.csv");+
Cause: Data cannot be inserted if a specific file is used in the table creation statement. For example, the OBS file obs://dli-test-021/data/test.csv is used in the preceding example.
+CREATE TABLE testcsvdatasource (name string, id int) USING csv OPTIONS (path "obs://dli-test-021/data");+
CREATE TABLE IF NOT EXISTS testtable(name STRING, score DOUBLE, classNo INT) PARTITIONED BY (classNo) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data7';+
Cause: Do not specify the partition key in the list following the table name. Specify the partition key in the PARTITIONED BY statement.
+CREATE TABLE IF NOT EXISTS testtable(name STRING, score DOUBLE) PARTITIONED BY (classNo INT) STORED AS TEXTFILE LOCATION 'obs://dli-test-021/data7';+
DLI allows you to use a Hive User Defined Aggregation Function (UDAF) to process multiple rows of data. Hive UDAF is usually used together with groupBy. It is equivalent to SUM() and AVG() commonly used in SQL and is also an aggregation function.
+To grant required permissions, log in to the DLI console and choose Data Management > Package Management. On the displayed page, select your UDAF Jar package and click Manage Permissions in the Operation column. On the permission management page, click Grant Permission in the upper right corner and select the required permissions.
+Before you start, set up the development environment.
+ +Item + |
+Description + |
+
---|---|
OS + |
+Windows 7 or later + |
+
JDK + |
+JDK 1.8 (Java downloads). + |
+
IntelliJ IDEA + |
+IntelliJ IDEA is used for application development. The version of the tool must be 2019.1 or later. + |
+
Maven + |
+Basic configuration of the development environment. For details about how to get started, see Downloading Apache Maven and Installing Apache Maven. Maven is used for project management throughout the lifecycle of software development. + |
+
The following figure shows the process of developing a UDAF.
+No. + |
+Phase + |
+Software Portal + |
+Description + |
+
---|---|---|---|
1 + |
+Create a Maven project and configure the POM file. + |
+IntelliJ IDEA + |
+Compile the UDAF function code by referring to the Procedure description. + |
+
2 + |
+Editing UDAF code + |
+||
3 + |
+Debug, compile, and pack the code into a Jar package. + |
+||
4 + |
+Upload the Jar package to OBS. + |
+OBS console + |
+Upload the UDAF Jar file to an OBS path. + |
+
5 + |
+Create a DLI package. + |
+DLI console + |
+Select the UDAF Jar file that has been uploaded to OBS for management. + |
+
6 + |
+Create a UDAF on DLI. + |
+DLI console + |
+Create a UDAF on the SQL job management page of the DLI console. + |
+
7 + |
+Verify and use the UDAF. + |
+DLI console + |
+Use the UDAF in your DLI job. + |
+
<dependencies> + <dependency> + <groupId>org.apache.hive</groupId> + <artifactId>hive-exec</artifactId> + <version>1.2.1</version> + </dependency> + </dependencies>+
Create a Java Class file in the package path. In this example, the Java Class file is AvgFilterUDAFDemo.
+For details about how to implement the UDAF, see the following sample code:
+package com.dli.demo; + +import org.apache.hadoop.hive.ql.exec.UDAF; +import org.apache.hadoop.hive.ql.exec.UDAFEvaluator; + +/*** + * @jdk jdk1.8.0 + * @version 1.0 + ***/ +public class AvgFilterUDAFDemo extends UDAF { + + /** + * Defines the static inner class AvgFilter. + */ + public static class PartialResult + { + public Long sum; + } + + public static class VarianceEvaluator implements UDAFEvaluator { + + // Initializes the PartialResult object. + private AvgFilterUDAFDemo.PartialResult partial; + + // Declares a VarianceEvaluator constructor that has no parameters. + public VarianceEvaluator(){ + + this.partial = new AvgFilterUDAFDemo.PartialResult(); + + init(); + } + + /** + * Initializes the UDAF, which is similar to a constructor. + */ + @Override + public void init() { + + // Sets the initial value of sum. + this.partial.sum = 0L; + } + + /** + * Receives input parameters for internal iteration. + * @param x + * @return + */ + public void iterate(Long x) { + if (x == null) { + return; + } + AvgFilterUDAFDemo.PartialResult tmp9_6 = this.partial; + tmp9_6.sum = tmp9_6.sum | x; + } + + /** + * Returns the data obtained after the iterate traversal is complete. + * terminatePartial is similar to Hadoop Combiner. + * @return + */ + public AvgFilterUDAFDemo.PartialResult terminatePartial() + { + return this.partial; + } + + /** + * Receives the return values of terminatePartial and merges the data. + * @param + * @return + */ + public void merge(AvgFilterUDAFDemo.PartialResult pr) + { + if (pr == null) { + return; + } + AvgFilterUDAFDemo.PartialResult tmp9_6 = this.partial; + tmp9_6.sum = tmp9_6.sum | pr.sum; + } + + /** + * Returns the aggregated result. + * @return + */ + public Long terminate() + { + if (this.partial.sum == null) { + return 0L; + } + return this.partial.sum; + } + } +}+
After the compilation is successful, click package.
+The region of the OBS bucket to which the Jar package is uploaded must be the same as the region of the DLI queue. Cross-region operations are not allowed.
+If the reloading function of the UDAF is enabled, the create statement changes.
+CREATE FUNCTION AvgFilterUDAFDemo AS 'com.dli.demo.AvgFilterUDAFDemo' using jar 'obs://dli-test-obs01/MyUDAF-1.0-SNAPSHOT.jar';+
Or
+CREATE OR REPLACE FUNCTION AvgFilterUDAFDemo AS 'com.dli.demo.AvgFilterUDAFDemo' using jar 'obs://dli-test-obs01/MyUDAF-1.0-SNAPSHOT.jar';+
Use the UDAF function created in 6 in the query statement:
+select AvgFilterUDAFDemo(real_stock_rate) AS show_rate FROM dw_ad_estimate_real_stock_rate limit 1000;+
If the UDAF is no longer used, run the following statement to delete it:
+Drop FUNCTION AvgFilterUDAFDemo;+
Release Date + |
+What's New + |
+
---|---|
2024-04-30 + |
+Modified the following section: +In Connecting to Mongo, modified "mongo" in the sample table name and added additional clarification that DDS is compatible with the MongoDB protocol. + |
+
2024-02-27 + |
+Modified the following section: +Modified the description of the url parameter in the sample code in Scala Example Code. + |
+
2024-01-05 + |
+This issue is the first official release. + |
+
This guide provides reference for Flink 1.12 only.
+In this example, we aim to query information about top three most-clicked offerings in each hour from a set of real-time click data. Offerings' real-time click data will be sent to Kafka as the input source, and then the analysis result of Kafka data is to be output to RDS.
+For example, enter the following sample data:
+{"user_id":"0001", "user_name":"Alice", "event_time":"2021-03-24 08:01:00", "product_id":"0002", "product_name":"name1"} +{"user_id":"0002", "user_name":"Bob", "event_time":"2021-03-24 08:02:00", "product_id":"0002", "product_name":"name1"} +{"user_id":"0002", "user_name":"Bob", "event_time":"2021-03-24 08:06:00", "product_id":"0004", "product_name":"name2"} +{"user_id":"0001", "user_name":"Alice", "event_time":"2021-03-24 08:10:00", "product_id":"0003", "product_name":"name3"} +{"user_id":"0003", "user_name":"Cindy", "event_time":"2021-03-24 08:15:00", "product_id":"0005", "product_name":"name4"} +{"user_id":"0003", "user_name":"Cindy", "event_time":"2021-03-24 08:16:00", "product_id":"0005", "product_name":"name4"} +{"user_id":"0001", "user_name":"Alice", "event_time":"2021-03-24 08:56:00", "product_id":"0004", "product_name":"name2"} +{"user_id":"0001", "user_name":"Alice", "event_time":"2021-03-24 09:05:00", "product_id":"0005", "product_name":"name4"} +{"user_id":"0001", "user_name":"Alice", "event_time":"2021-03-24 09:10:00", "product_id":"0006", "product_name":"name5"} +{"user_id":"0002", "user_name":"Bob", "event_time":"2021-03-24 09:13:00", "product_id":"0006", "product_name":"name5"}+
2021-03-24 08:00:00 - 2021-03-24 08:59:59,0002,name1,2 +2021-03-24 08:00:00 - 2021-03-24 08:59:59,0004,name2,2 +2021-03-24 08:00:00 - 2021-03-24 08:59:59,0005,name4,2 +2021-03-24 09:00:00 - 2021-03-24 09:59:59,0006,name5,2 +2021-03-24 09:00:00 - 2021-03-24 09:59:59,0005,name4,1+
Step 3: Create an RDS Database and Table
+ + + +The queue name can contain only digits, letters, and underscores (_), but cannot contain only digits or start with an underscore (_). The name must contain 1 to 128 characters.
+The queue name is case-insensitive. Uppercase letters will be automatically converted to lowercase letters.
+The CIDR block of a queue cannot overlap with the CIDR blocks of DMS Kafka and RDS for MySQL DB instances. Otherwise, datasource connections will fail to be created.
+Retain default values for other parameters.
+CREATE TABLE clicktop ( + `range_time` VARCHAR(64) NOT NULL, + `product_id` VARCHAR(32) NOT NULL, + `product_name` VARCHAR(32), + `event_count` VARCHAR(32), + PRIMARY KEY (`range_time`,`product_id`) +) ENGINE = InnoDB + DEFAULT CHARACTER SET = utf8mb4;+
Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
+Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
+In this example, the syntax version of Flink OpenSource SQL is 1.12. In this example, the data source is Kafka and the result data is written to RDS.
+create table click_product( + user_id string, --ID of the user + user_name string, --Username + event_time string, --Click time + product_id string, --Offering ID + product_name string --Offering name +) with ( + "connector" = "kafka", + "properties.bootstrap.servers" = " 10.128.0.120:9092,10.128.0.89:9092,10.128.0.83:9092 ",-- Internal network address and port number of the Kafka instance + "properties.group.id" = "click", + "topic" = " testkafkatopic ",--Name of the created Kafka topic + "format" = "json", + "scan.startup.mode" = "latest-offset" +); + +--Result table +create table top_product ( + range_time string, --Calculated time range + product_id string, --Offering ID + product_name string --Offering name + event_count bigint, --Number of clicks + primary key (range_time, product_id) not enforced +) with ( + "connector" = "jdbc", + "url" = "jdbc:mysql://192.168.12.148:3306/testrdsdb ",--testrdsdb indicates the name of the created RDS database. Replace the IP address and port number with those of the RDS DB instance. + "table-name" = "clicktop", + "pwd_auth_name"="xxxxx", -- Name of the datasource authentication of the password type created on DLI. If datasource authentication is used, you do not need to set the username and password for the job. + "sink.buffer-flush.max-rows" = "1000", + "sink.buffer-flush.interval" = "1s" +); + +create view current_event_view +as + select product_id, product_name, count(1) as click_count, concat(substring(event_time, 1, 13), ":00:00") as min_event_time, concat(substring(event_time, 1, 13), ":59:59") as max_event_time + from click_product group by substring (event_time, 1, 13), product_id, product_name; + +insert into top_product + select + concat(min_event_time, " - ", max_event_time) as range_time, + product_id, + product_name, + click_count + from ( + select *, + row_number() over (partition by min_event_time order by click_count desc) as row_num + from current_event_view + ) + where row_num <= 3+
The sample data is as follows:
+{"user_id":"0001", "user_name":"Alice", "event_time":"2021-03-24 08:01:00", "product_id":"0002", "product_name":"name1"} +{"user_id":"0002", "user_name":"Bob", "event_time":"2021-03-24 08:02:00", "product_id":"0002", "product_name":"name1"} +{"user_id":"0002", "user_name":"Bob", "event_time":"2021-03-24 08:06:00", "product_id":"0004", "product_name":"name2"} +{"user_id":"0001", "user_name":"Alice", "event_time":"2021-03-24 08:10:00", "product_id":"0003", "product_name":"name3"} +{"user_id":"0003", "user_name":"Cindy", "event_time":"2021-03-24 08:15:00", "product_id":"0005", "product_name":"name4"} +{"user_id":"0003", "user_name":"Cindy", "event_time":"2021-03-24 08:16:00", "product_id":"0005", "product_name":"name4"} +{"user_id":"0001", "user_name":"Alice", "event_time":"2021-03-24 08:56:00", "product_id":"0004", "product_name":"name2"} +{"user_id":"0001", "user_name":"Alice", "event_time":"2021-03-24 09:05:00", "product_id":"0005", "product_name":"name4"} +{"user_id":"0001", "user_name":"Alice", "event_time":"2021-03-24 09:10:00", "product_id":"0006", "product_name":"name5"} +{"user_id":"0002", "user_name":"Bob", "event_time":"2021-03-24 09:13:00", "product_id":"0006", "product_name":"name5"}+
select * from `clicktop`;+
This guide provides reference for Flink 1.12 only.
+This example analyzes real-time vehicle driving data and collects statistics on data results that meet specific conditions. The real-time vehicle driving data is stored in the Kafka source table, and then the analysis result is output to GaussDB(DWS).
+For example, enter the following sample data:
+{"car_id":"3027", "car_owner":"lilei", "car_age":"7", "average_speed":"76", "total_miles":"15000"} +{"car_id":"3028", "car_owner":"hanmeimei", "car_age":"6", "average_speed":"92", "total_miles":"17000"} +{"car_id":"3029", "car_owner":"Ann", "car_age":"10", "average_speed":"81", "total_miles":"230000"}+
{"car_id":"3027", "car_owner":"lilei", "car_age":"7", "average_speed":"76", "total_miles":"15000"}+
When you create the instance, do not enable Kafka SASL_SSL.
+Step 3: Create a GaussDB(DWS) Database and Table
+ + + +The queue name can contain only digits, letters, and underscores (_), but cannot contain only digits or start with an underscore (_). The name must contain 1 to 128 characters.
+The queue name is case-insensitive. Uppercase letters will be automatically converted to lowercase letters.
+The CIDR block of a queue cannot overlap with the CIDR blocks of DMS Kafka and RDS for MySQL DB instances. Otherwise, datasource connections will fail to be created.
+Retain default values for other parameters.
+gsql -d gaussdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r+
CREATE DATABASE testdwsdb;+
\q +gsql -d testdwsdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r+
create schema test; +set current_schema= test; +drop table if exists qualified_cars; +CREATE TABLE qualified_cars +( + car_id VARCHAR, + car_owner VARCHAR, + car_age INTEGER , + average_speed FLOAT8, + total_miles FLOAT8 +);+
Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
+Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
+In this example, the syntax version of Flink OpenSource SQL is 1.12. In this example, the data source is Kafka and the result data is written to GaussDB(DWS).
+create table car_infos( + car_id STRING, + car_owner STRING, + car_age INT, + average_speed DOUBLE, + total_miles DOUBLE +) with ( + "connector" = "kafka", + "properties.bootstrap.servers" = " 10.128.0.120:9092,10.128.0.89:9092,10.128.0.83:9092 ",-- Internal network address and port number of the Kafka instance + "properties.group.id" = "click", + "topic" = " testkafkatopic",--Created Kafka topic + "format" = "json", + "scan.startup.mode" = "latest-offset" +); + +create table qualified_cars ( + car_id STRING, + car_owner STRING, + car_age INT, + average_speed DOUBLE, + total_miles DOUBLE +) +WITH ( + 'connector' = 'gaussdb', + 'driver' = 'com.gauss200.jdbc.Driver', + 'url'='jdbc:gaussdb://192.168.168.16:8000/testdwsdb ', ---192.168.168.16:8000 indicates the internal IP address and port of the GaussDB(DWS) instance. testdwsdb indicates the name of the created GaussDB(DWS) database. + 'table-name' = ' test\".\"qualified_cars', ---test indicates the schema of the created GaussDB(DWS) table, and qualified_cars indicates the GaussDB(DWS) table name. + 'pwd_auth_name'= 'xxxxx', -- Name of the datasource authentication of the password type created on DLI. If datasource authentication is used, you do not need to set the username and password for the job. + 'write.mode' = 'insert' +); + +/** Output information about qualified vehicles **/ +INSERT INTO qualified_cars +SELECT * +FROM car_infos +where average_speed <= 90 and total_miles <= 200000;+
The sample data is as follows:
+{"car_id":"3027", "car_owner":"lilei", "car_age":"7", "average_speed":"76", "total_miles":"15000"} +{"car_id":"3028", "car_owner":"hanmeimei", "car_age":"6", "average_speed":"92", "total_miles":"17000"} +{"car_id":"3029", "car_owner":"Ann", "car_age":"10", "average_speed":"81", "total_miles":"230000"}+
gsql -d testdwsdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r+
select * from test.qualified_cars;+
car_id car_owner car_age average_speed total_miles +3027 lilei 7 76.0 15000.0+
This guide provides reference for Flink 1.12 only.
+This example analyzes offering purchase data and collects statistics on data results that meet specific conditions. The offering purchase data is stored in the Kafka source table, and then the analysis result is output to Elasticsearch .
+For example, enter the following sample data:
+{"order_id":"202103241000000001", "order_channel":"webShop", "order_time":"2021-03-24 10:00:00", "pay_amount":"100.00", "real_pay":"100.00", "pay_time":"2021-03-24 10:02:03", "user_id":"0001", "user_name":"Alice", "area_id":"330106"} + +{"order_id":"202103241606060001", "order_channel":"appShop", "order_time":"2021-03-24 16:06:06", "pay_amount":"200.00", "real_pay":"180.00", "pay_time":"2021-03-24 16:10:06", "user_id":"0002", "user_name":"Jason", "area_id":"330106"}+
DLI reads data from Kafka and writes the data to Elasticsearch. You can view the result in Kibana of the Elasticsearch cluster.
+Step 3: Create an Elasticsearch Index
+ + + +The queue name can contain only digits, letters, and underscores (_), but cannot contain only digits or start with an underscore (_). The name must contain 1 to 128 characters.
+The queue name is case-insensitive. Uppercase letters will be automatically converted to lowercase letters.
+The CIDR block of a queue cannot overlap with the CIDR blocks of DMS Kafka and RDS for MySQL DB instances. Otherwise, datasource connections will fail to be created.
+Retain default values for other parameters.
+PUT /shoporders +{ + "settings": { + "number_of_shards": 1 + }, + "mappings": { + "properties": { + "order_id": { + "type": "text" + }, + "order_channel": { + "type": "text" + }, + "order_time": { + "type": "text" + }, + "pay_amount": { + "type": "double" + }, + "real_pay": { + "type": "double" + }, + "pay_time": { + "type": "text" + }, + "user_id": { + "type": "text" + }, + "user_name": { + "type": "text" + }, + "area_id": { + "type": "text" + } + } + } +}+
Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
+Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
+In this example, the syntax version of Flink OpenSource SQL is 1.12. In this example, the data source is Kafka and the result data is written to Elasticsearch.
+CREATE TABLE kafkaSource ( + order_id string, + order_channel string, + order_time string, + pay_amount double, + real_pay double, + pay_time string, + user_id string, + user_name string, + area_id string +) with ( + "connector" = "kafka", + "properties.bootstrap.servers" = "10.128.0.120:9092,10.128.0.89:9092,10.128.0.83:9092",-- Internal network address and port number of the Kafka instance + "properties.group.id" = "click", + "topic" = "testkafkatopic",--Created Kafka topic + "format" = "json", + "scan.startup.mode" = "latest-offset" +);+
CREATE TABLE elasticsearchSink ( + order_id string, + order_channel string, + order_time string, + pay_amount double, + real_pay double, + pay_time string, + user_id string, + user_name string, + area_id string +) WITH ( + 'connector' = 'elasticsearch-7', + 'hosts' = '192.168.168.125:9200', --Private IP address and port of the CSS cluster + 'index' = 'shoporders' --Created Elasticsearch engine +); +--Write Kafka data to Elasticsearch indexes +insert into + elasticsearchSink +select + * +from + kafkaSource;+
Use the Kafka client to send data to topics created in Step 2: Create a Kafka Topic to simulate real-time data streams.
+The sample data is as follows:
+{"order_id":"202103241000000001", "order_channel":"webShop", "order_time":"2021-03-24 10:00:00", "pay_amount":"100.00", "real_pay":"100.00", "pay_time":"2021-03-24 10:02:03", "user_id":"0001", "user_name":"Alice", "area_id":"330106"} + +{"order_id":"202103241606060001", "order_channel":"appShop", "order_time":"2021-03-24 16:06:06", "pay_amount":"200.00", "real_pay":"180.00", "pay_time":"2021-03-24 16:10:06", "user_id":"0002", "user_name":"Jason", "area_id":"330106"}+
GET shoporders/_search+
{ + "took" : 0, + "timed_out" : false, + "_shards" : { + "total" : 1, + "successful" : 1, + "skipped" : 0, + "failed" : 0 + }, + "hits" : { + "total" : { + "value" : 2, + "relation" : "eq" + }, + "max_score" : 1.0, + "hits" : [ + { + "_index" : "shoporders", + "_type" : "_doc", + "_id" : "6fswzIAByVjqg3_qAyM1", + "_score" : 1.0, + "_source" : { + "order_id" : "202103241000000001", + "order_channel" : "webShop", + "order_time" : "2021-03-24 10:00:00", + "pay_amount" : 100.0, + "real_pay" : 100.0, + "pay_time" : "2021-03-24 10:02:03", + "user_id" : "0001", + "user_name" : "Alice", + "area_id" : "330106" + } + }, + { + "_index" : "shoporders", + "_type" : "_doc", + "_id" : "6vs1zIAByVjqg3_qyyPp", + "_score" : 1.0, + "_source" : { + "order_id" : "202103241606060001", + "order_channel" : "appShop", + "order_time" : "2021-03-24 16:06:06", + "pay_amount" : 200.0, + "real_pay" : 180.0, + "pay_time" : "2021-03-24 16:10:06", + "user_id" : "0002", + "user_name" : "Jason", + "area_id" : "330106" + } + } + ] + } +}+
This guide provides reference for Flink 1.12 only.
+Change Data Capture (CDC) can synchronize incremental changes from the source database to one or more destinations. During data synchronization, CDC processes data, for example, grouping (GROUP BY) and joining multiple tables (JOIN).
+This example creates a MySQL CDC source table to monitor MySQL data changes and insert the changed data into a GaussDB(DWS) database.
+Step 2: Create an RDS MySQL Database and Table
+Step 3: Create a GaussDB(DWS) Database and Table
+ + + +The queue name can contain only digits, letters, and underscores (_), but cannot contain only digits or start with an underscore (_). The name must contain 1 to 128 characters.
+The queue name is case-insensitive. Uppercase letters will be automatically converted to lowercase letters.
+The CIDR block of a queue cannot overlap with the CIDR blocks of DMS Kafka and RDS for MySQL DB instances. Otherwise, datasource connections will fail to be created.
+CREATE TABLE mysqlcdc ( + `order_id` VARCHAR(64) NOT NULL, + `order_channel` VARCHAR(32) NOT NULL, + `order_time` VARCHAR(32), + `pay_amount` DOUBLE, + `real_pay` DOUBLE, + `pay_time` VARCHAR(32), + `user_id` VARCHAR(32), + `user_name` VARCHAR(32), + `area_id` VARCHAR(32) + +) ENGINE = InnoDB + DEFAULT CHARACTER SET = utf8mb4;+
gsql -d gaussdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r+
CREATE DATABASE testdwsdb;+
\q +gsql -d testdwsdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r+
create schema test; +set current_schema= test; +drop table if exists dwsresult; +CREATE TABLE dwsresult +( + car_id VARCHAR, + car_owner VARCHAR, + car_age INTEGER , + average_speed FLOAT8, + total_miles FLOAT8 +);+
Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
+Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
+In this example, the syntax version of Flink OpenSource SQL is 1.12. In this example, the data source is Kafka and the result data is written to Elasticsearch.
+create table mysqlCdcSource( + order_id string, + order_channel string, + order_time string, + pay_amount double, + real_pay double, + pay_time string, + user_id string, + user_name string, + area_id STRING +) with ( + 'connector' = 'mysql-cdc', + 'hostname' = ' 192.168.12.148',--IP address of the RDS MySQL instance + 'port'= ' 3306',--Port number of the RDS MySQL instance + 'pwd_auth_name'= 'xxxxx', -- Name of the datasource authentication of the password type created on DLI. If datasource authentication is used, you do not need to set the username and password for the job. + 'database-name' = ' testrdsdb',--Database name of the RDS MySQL instance + 'table-name' = ' mysqlcdc'--Name of the tartet table in the database +); + +create table dwsSink( + order_channel string, + pay_amount double, + real_pay double, + primary key(order_channel) not enforced +) with ( + 'connector' = 'gaussdb', + 'driver' = 'com.gauss200.jdbc.Driver', + 'url'='jdbc:gaussdb://192.168.168.16:8000/testdwsdb ', ---192.168.168.16:8000 indicates the internal IP address and port of the GaussDB(DWS) instance. testdwsdb indicates the name of the created GaussDB(DWS) database. + 'table-name' = ' test\".\"dwsresult', ---test indicates the schema of the created GaussDB(DWS) table, and dwsresult indicates the GaussDB(DWS) table name. + 'pwd_auth_name'= 'xxxxx', -- Name of the datasource authentication of the password type created on DLI. If datasource authentication is used, you do not need to set the username and password for the job. + 'write.mode' = 'insert' +); + +insert into dwsSink select order_channel, sum(pay_amount),sum(real_pay) from mysqlCdcSource group by order_channel;+
insert into mysqlcdc values +('202103241000000001','webShop','2021-03-24 10:00:00','100.00','100.00','2021-03-24 10:02:03','0001','Alice','330106'), +('202103241206060001','appShop','2021-03-24 12:06:06','200.00','180.00','2021-03-24 16:10:06','0002','Jason','330106'), +('202103241403000001','webShop','2021-03-24 14:03:00','300.00','100.00','2021-03-24 10:02:03','0003','Lily','330106'), +('202103241636060001','appShop','2021-03-24 16:36:06','200.00','150.00','2021-03-24 16:10:06','0001','Henry','330106');+
gsql -d testdwsdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r+
select * from test.dwsresult;+
order_channel pay_amount real_pay +appShop 400.0 330.0 +webShop 400.0 200.0+
This guide provides reference for Flink 1.12 only.
+Change Data Capture (CDC) can synchronize incremental changes from the source database to one or more destinations. During data synchronization, CDC processes data, for example, grouping (GROUP BY) and joining multiple tables (JOIN).
+This example creates a PostgreSQL CDC source table to monitor PostgreSQL data changes and insert the changed data into a GaussDB(DWS) database.
+The version of the RDS PostgreSQL database cannot be earlier than 11.
+Step 2: Create an RDS PostgreSQL Database and Table
+Step 3: Create a GaussDB(DWS) Database and Table
+ + + +The queue name can contain only digits, letters, and underscores (_), but cannot contain only digits or start with an underscore (_). The name must contain 1 to 128 characters.
+The queue name is case-insensitive. Uppercase letters will be automatically converted to lowercase letters.
+The CIDR block of a queue cannot overlap with the CIDR blocks of DMS Kafka and RDS for MySQL DB instances. Otherwise, datasource connections will fail to be created.
+create table test.cdc_order( + order_id VARCHAR, + order_channel VARCHAR, + order_time VARCHAR, + pay_amount FLOAT8, + real_pay FLOAT8, + pay_time VARCHAR, + user_id VARCHAR, + user_name VARCHAR, + area_id VARCHAR, + primary key(order_id));+
ALTER TABLE test.cdc_order REPLICA IDENTITY FULL;+
gsql -d gaussdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r+
CREATE DATABASE testdwsdb;+
\q +gsql -d testdwsdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r+
create schema test; +set current_schema= test; +drop table if exists dws_order; +CREATE TABLE dws_order +( + order_id VARCHAR, + order_channel VARCHAR, + order_time VARCHAR, + pay_amount FLOAT8, + real_pay FLOAT8, + pay_time VARCHAR, + user_id VARCHAR, + user_name VARCHAR, + area_id VARCHAR +);+
Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
+Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
+In this example, the syntax version of Flink OpenSource SQL is 1.12. In this example, the data source is Kafka and the result data is written to Elasticsearch.
+Parameter + |
+Description + |
+
---|---|
Queue + |
+A shared queue is selected by default. You can select a CCE queue with dedicated resources and configure the following parameters: +UDF Jar: UDF Jar file. Before selecting such a file, upload the corresponding JAR file to the OBS bucket and choose Data Management > Package Management to create a package. For details, see . +In SQL, you can call a UDF that is inserted into a JAR file. + NOTE:
+When creating a job, a sub-user can only select the queue that has been allocated to the user. +If the remaining capacity of the selected queue cannot meet the job requirements, the system automatically scales up the capacity and you will be billed based on the increased capacity. When a queue is idle, the system automatically scales in its capacity. + |
+
CUs + |
+Sum of the number of compute units and job manager CUs of DLI. One CU equals 1 vCPU and 4 GB. +The value is the number of CUs required for job running and cannot exceed the number of CUs in the bound queue. + |
+
Job Manager CUs + |
+Number of CUs of the management unit. + |
+
Parallelism + |
+Maximum number of Flink OpenSource SQL jobs that can run at the same time. + NOTE:
+This value cannot be greater than four times the compute units (number of CUs minus the number of JobManager CUs). + |
+
Task Manager Configuration + |
+Whether to set Task Manager resource parameters. +If this option is selected, you need to set the following parameters: +
|
+
OBS Bucket + |
+OBS bucket to store job logs and checkpoint information. If the selected OBS bucket is not authorized, click Authorize. + |
+
Save Job Log + |
+Whether to save job run logs to OBS. The logs are saved in Bucket name/jobs/logs/Directory starting with the job ID. + CAUTION:
+You are advised to configure this parameter. Otherwise, no run log is generated after the job is executed. If the job fails, the run log cannot be obtained for fault locating. +If this option is selected, you need to set the following parameters: +OBS Bucket: Select an OBS bucket to store user job logs. If the selected OBS bucket is not authorized, click Authorize.
+ NOTE:
+If Enable Checkpointing and Save Job Log are both selected, you only need to authorize OBS once. + |
+
Alarm Generation upon Job Exception + |
+Whether to notify users of any job exceptions, such as running exceptions or arrears, via SMS or email. +If this option is selected, you need to set the following parameters: +SMN Topic +Select a user-defined SMN topic. For details about how to create a custom SMN topic, see "Creating a Topic" in Simple Message Notification User Guide. + |
+
Enable Checkpointing + |
+Whether to enable job snapshots. If this function is enabled, jobs can be restored based on checkpoints. +If this option is selected, you need to set the following parameters:
+
|
+
Auto Restart upon Exception + |
+Whether to enable automatic restart. If this function is enabled, jobs will be automatically restarted and restored when exceptions occur. +If this option is selected, you need to set the following parameters: +
|
+
Idle State Retention Time + |
+How long the state of a key is retained without being updated before it is removed in GroupBy or Window. The default value is 1 hour. + |
+
Dirty Data Policy + |
+Policy for processing dirty data. The following policies are supported: Ignore, Trigger a job exception, and Save. +If you set this field to Save, Dirty Data Dump Address must be set. Click the address box to select the OBS path for storing dirty data. + |
+
create table PostgreCdcSource( + order_id string, + order_channel string, + order_time string, + pay_amount double, + real_pay double, + pay_time string, + user_id string, + user_name string, + area_id STRING, + primary key (order_id) not enforced +) with ( + 'connector' = 'postgres-cdc', + 'hostname' = ' 192.168.15.153',--IP address of the PostgreSQL instance + 'port'= ' 5432',--Port number of the PostgreSQL instance + 'pwd_auth_name'= 'xxxxx', -- Name of the datasource authentication of the password type created on DLI. If datasource authentication is used, you do not need to set the username and password for the job. + 'database-name' = ' testrdsdb',--Database name of the PostgreSQL instance + 'schema-name' = ' test',-- Schema in the PostgreSQL database + 'table-name' = ' cdc_order'--Table name in the PostgreSQL database +); + +create table dwsSink( + order_id string, + order_channel string, + order_time string, + pay_amount double, + real_pay double, + pay_time string, + user_id string, + user_name string, + area_id STRING, + primary key(order_id) not enforced +) with ( + 'connector' = 'gaussdb', + 'driver' = 'com.gauss200.jdbc.Driver', + 'url'='jdbc:gaussdb://192.168.168.16:8000/testdwsdb ', ---192.168.168.16:8000 indicates the internal IP address and port of the GaussDB(DWS) instance. testdwsdb indicates the name of the created GaussDB(DWS) database. + 'table-name' = ' test\".\"dws_order', ---test indicates the schema of the created GaussDB(DWS) table, and dws_order indicates the GaussDB(DWS) table name. + 'username' = 'xxxxx',--Username of the GaussDB(DWS) instance + 'password' = 'xxxxx',--Password of the GaussDB(DWS) instance + 'write.mode' = 'insert' +); + +insert into dwsSink select * from PostgreCdcSource where pay_amount > 100; ++
insert into test.cdc_order values +('202103241000000001','webShop','2021-03-24 10:00:00','50.00','100.00','2021-03-24 10:02:03','0001','Alice','330106'), +('202103251606060001','appShop','2021-03-24 12:06:06','200.00','180.00','2021-03-24 16:10:06','0002','Jason','330106'), +('202103261000000001','webShop','2021-03-24 14:03:00','300.00','100.00','2021-03-24 10:02:03','0003','Lily','330106'), +('202103271606060001','appShop','2021-03-24 16:36:06','99.00','150.00','2021-03-24 16:10:06','0001','Henry','330106');+
gsql -d testdwsdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r+
select * from test.dws_order;+
order_channel order_channel order_time pay_amount real_pay pay_time user_id user_name area_id +202103251606060001 appShop 2021-03-24 12:06:06 200.0 180.0 2021-03-24 16:10:06 0002 Jason 330106 +202103261000000001 webShop 2021-03-24 14:03:00 300.0 100.0 2021-03-24 10:02:03 0003 Lily 330106+
DLI supports the native Spark DataSource capability and other extended capabilities. You can use SQL statements or Spark jobs to access other data storage services and import, query, analyze, and process data. Currently, DLI supports the following datasource access services: CloudTable, Cloud Search Service (CSS), Distributed Cache Service (DCS), Document Database Service (DDS), GaussDB(DWS), MapReduce Service (MRS), and Relational Database Service (RDS). To use the datasource capability of DLI, you need to create a datasource connection first.
+When you use Spark jobs to access other data sources, you can use Scala, PySpark, or Java to program functions for the jobs.
+ +A datasource connection has been created on the DLI management console.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 +2 | import org.apache.spark.sql.{Row, SaveMode, SparkSession} +import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} + |
1 | val sparkSession = SparkSession.builder().getOrCreate() + |
1 +2 +3 +4 | sparkSession.sql("create table css_table(id int, name string) using css options( + 'es.nodes' 'to-css-1174404221-Y2bKVIqY.datasource.com:9200', + 'es.nodes.wan.only'='true', + 'resource' '/mytest/css')") + |
Parameter + |
+Description + |
+
---|---|
es.nodes + |
+CSS connection address. You need to create a datasource connection first. +If you have created an enhanced datasource connection, use the intranet IP address provided by CSS. The address format is IP1:PORT1,IP2:PORT2. + |
+
resource + |
+Name of the resource for the CSS datasource connection name. You can use /index/type to specify the resource location (for easier understanding, the index may be seen as database and type as table). + NOTE:
+
|
+
pushdown + |
+Whether to enable the pushdown function of CSS. The default value is true. For tables with a large number of I/O requests, the pushdown function help reduce I/O pressure when the where condition is specified. + |
+
strict + |
+Whether the CSS pushdown is strict. The default value is false. The exact match function can reduce more I/O requests than pushdown. + |
+
batch.size.entries + |
+Maximum number of entries that can be inserted in a batch. The default value is 1000. If the size of a single data record is so large that the number of data records in the bulk storage reaches the upper limit of the data amount in a single batch, the system stops storing data and submits the data based on the batch.size.bytes parameter. + |
+
batch.size.bytes + |
+Maximum amount of data in a single batch. The default value is 1 MB. If the size of a single data record is so small that the number of data records in the bulk storage reaches the upper limit of the data amount of a single batch, the system stops storing data and submits the data based on the batch.size.entries parameter. + |
+
es.nodes.wan.only + |
+Whether to access the Elasticsearch node using only the domain name. The default value is false. If the original internal IP address provided by CSS is used as the es.nodes, you do not need to set this parameter or set it to false. + |
+
es.mapping.id + |
+Document field name that contains the document ID in the Elasticsearch node. + NOTE:
+
|
+
batch.size.entries and batch.size.bytes limit the number of data records and data volume respectively.
+1 | sparkSession.sql("insert into css_table values(13, 'John'),(22, 'Bob')") + |
1 +2 | val dataFrame = sparkSession.sql("select * from css_table") +dataFrame.show() + |
Before data is inserted:
+Response:
+1 | sparkSession.sql("drop table css_table") + |
1 +2 | val resource = "/mytest/css" +val nodes = "to-css-1174405013-Ht7O1tYf.datasource.com:9200" + |
1 +2 | val schema = StructType(Seq(StructField("id", IntegerType, false), StructField("name", StringType, false))) +val rdd = sparkSession.sparkContext.parallelize(Seq(Row(12, "John"),Row(21,"Bob"))) + |
1 +2 +3 +4 +5 +6 +7 | val dataFrame_1 = sparkSession.createDataFrame(rdd, schema) +dataFrame_1.write + .format("css") + .option("resource", resource) + .option("es.nodes", nodes) + .mode(SaveMode.Append) + .save() + |
The value of SaveMode can be one of the following:
+1 +2 | val dataFrameR = sparkSession.read.format("css").option("resource",resource).option("es.nodes", nodes).load() +dataFrameR.show() + |
Before data is inserted:
+Response:
+spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/css/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/css/*
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 | import org.apache.spark.sql.SparkSession + +object Test_SQL_CSS { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + + // Create a DLI data table for DLI-associated CSS + sparkSession.sql("create table css_table(id long, name string) using css options( + 'es.nodes' = 'to-css-1174404217-QG2SwbVV.datasource.com:9200', + 'es.nodes.wan.only' = 'true', + 'resource' = '/mytest/css')") + + //*****************************SQL model*********************************** + // Insert data into the DLI data table + sparkSession.sql("insert into css_table values(13, 'John'),(22, 'Bob')") + + // Read data from DLI data table + val dataFrame = sparkSession.sql("select * from css_table") + dataFrame.show() + + // drop table + sparkSession.sql("drop table css_table") + + sparkSession.close() + } +} + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 | import org.apache.spark.sql.{Row, SaveMode, SparkSession}; +import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}; + +object Test_SQL_CSS { + def main(args: Array[String]): Unit = { + //Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + + //*****************************DataFrame model*********************************** + // Setting the /index/type of CSS + val resource = "/mytest/css" + + // Define the cross-origin connection address of the CSS cluster + val nodes = "to-css-1174405013-Ht7O1tYf.datasource.com:9200" + + //Setting schema + val schema = StructType(Seq(StructField("id", IntegerType, false), StructField("name", StringType, false))) + + // Construction data + val rdd = sparkSession.sparkContext.parallelize(Seq(Row(12, "John"),Row(21,"Bob"))) + + // Create a DataFrame from RDD and schema + val dataFrame_1 = sparkSession.createDataFrame(rdd, schema) + + //Write data to the CSS + dataFrame_1.write.format("css") + .option("resource", resource) + .option("es.nodes", nodes) + .mode(SaveMode.Append) + .save() + + //Read data + val dataFrameR = sparkSession.read.format("css").option("resource", resource).option("es.nodes", nodes).load() + dataFrameR.show() + + spardSession.close() + } +} + |
1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 +2 | import org.apache.spark.sql.{Row, SaveMode, SparkSession} +import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} + |
Hard-coded or plaintext AK and SK pose significant security risks. To ensure security, encrypt your AK and SK, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | val sparkSession = SparkSession.builder().getOrCreate() +sparkSession.conf.set("fs.obs.access.key", ak) +sparkSession.conf.set("fs.obs.secret.key", sk) +sparkSession.conf.set("fs.obs.endpoint", enpoint) +sparkSession.conf.set("fs.obs.connecton.ssl.enabled", "false") + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 | sparkSession.sql("create table css_table(id int, name string) using css options( + 'es.nodes' 'to-css-1174404221-Y2bKVIqY.datasource.com:9200', + 'es.nodes.wan.only'='true', + 'resource'='/mytest/css', + 'es.net.ssl'='true', + 'es.net.ssl.keystore.location'='obs://Bucket name/path/transport-keystore.jks', + 'es.net.ssl.keystore.pass'='***', + 'es.net.ssl.truststore.location'='obs://Bucket name/path/truststore.jks', + 'es.net.ssl.truststore.pass'='***', + 'es.net.http.auth.user'='admin', + 'es.net.http.auth.pass'='***')") + |
Parameter + |
+Description + |
+
---|---|
es.nodes + |
+CSS connection address. You need to create a datasource connection first. +If you have created an enhanced datasource connection, use the intranet IP address provided by CSS. The address format is IP1:PORT1,IP2:PORT2. + |
+
resource + |
+Name of the resource for the CSS datasource connection name. You can use /index/type to specify the resource location (for easier understanding, the index may be seen as database and type as table). + NOTE:
+1. In Elasticsearch 6.X, a single index supports only one type, and the type name can be customized. +2. In Elasticsearch 7.X, a single index uses _doc as the type name and cannot be customized. To access Elasticsearch 7.X, set this parameter to index. + |
+
pushdown + |
+Whether to enable the pushdown function of CSS. The default value is true. For tables with a large number of I/O requests, the pushdown function help reduce I/O pressure when the where condition is specified. + |
+
strict + |
+Whether the CSS pushdown is strict. The default value is false. The exact match function can reduce more I/O requests than pushdown. + |
+
batch.size.entries + |
+Maximum number of entries that can be inserted in a batch. The default value is 1000. If the size of a single data record is so large that the number of data records in the bulk storage reaches the upper limit of the data amount in a single batch, the system stops storing data and submits the data based on the batch.size.bytes parameter. + |
+
batch.size.bytes + |
+Maximum amount of data in a single batch. The default value is 1 MB. If the size of a single data record is so small that the number of data records in the bulk storage reaches the upper limit of the data amount of a single batch, the system stops storing data and submits the data based on the batch.size.entries parameter. + |
+
es.nodes.wan.only + |
+Whether to access the Elasticsearch node using only the domain name. The default value is false. If the original internal IP address provided by CSS is used as the es.nodes, you do not need to set this parameter or set it to false. + |
+
es.mapping.id + |
+Document field name that contains the document ID in the Elasticsearch node. + NOTE:
+
|
+
es.net.ssl + |
+Whether to connect to the security CSS cluster. The default value is false. + |
+
es.net.ssl.keystore.location + |
+OBS bucket location of the keystore file generated by the security CSS cluster certificate. + |
+
es.net.ssl.keystore.pass + |
+Password of the keystore file generated by the security CSS cluster certificate. + |
+
es.net.ssl.truststore.location + |
+OBS bucket location of the truststore file generated by the security CSS cluster certificate. + |
+
es.net.ssl.truststore.pass + |
+Password of the truststore file generated by the security CSS cluster certificate. + |
+
es.net.http.auth.user + |
+Username of the security CSS cluster. + |
+
es.net.http.auth.pass + |
+Password of the security CSS cluster. + |
+
batch.size.entries and batch.size.bytes limit the number of data records and data volume respectively.
+1 | sparkSession.sql("insert into css_table values(13, 'John'),(22, 'Bob')") + |
1 +2 | val dataFrame = sparkSession.sql("select * from css_table") +dataFrame.show() + |
Before data is inserted:
+Response:
+1 | sparkSession.sql("drop table css_table") + |
1 +2 | val resource = "/mytest/css" +val nodes = "to-css-1174405013-Ht7O1tYf.datasource.com:9200" + |
1 +2 | val schema = StructType(Seq(StructField("id", IntegerType, false), StructField("name", StringType, false))) +val rdd = sparkSession.sparkContext.parallelize(Seq(Row(12, "John"),Row(21,"Bob"))) + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 | val dataFrame_1 = sparkSession.createDataFrame(rdd, schema) +dataFrame_1.write + .format("css") + .option("resource", resource) + .option("es.nodes", nodes) + .option("es.net.ssl", "true") + .option("es.net.ssl.keystore.location", "obs://Bucket name/path/transport-keystore.jks") + .option("es.net.ssl.keystore.pass", "***") + .option("es.net.ssl.truststore.location", "obs://Bucket name/path/truststore.jks") + .option("es.net.ssl.truststore.pass", "***") + .option("es.net.http.auth.user", "admin") + .option("es.net.http.auth.pass", "***") + .mode(SaveMode.Append) + .save() + |
The value of Mode can be one of the following:
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 | val dataFrameR = sparkSession.read.format("css") + .option("resource",resource) + .option("es.nodes", nodes) + .option("es.net.ssl", "true") + .option("es.net.ssl.keystore.location", "obs://Bucket name/path/transport-keystore.jks") + .option("es.net.ssl.keystore.pass", "***") + .option("es.net.ssl.truststore.location", "obs://Bucket name/path/truststore.jks") + .option("es.net.ssl.truststore.pass", "***") + .option("es.net.http.auth.user", "admin") + .option("es.net.http.auth.pass", "***") + .load() +dataFrameR.show() + |
Before data is inserted:
+Response:
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 | import org.apache.spark.sql.SparkSession + +object csshttpstest { + def main(args: Array[String]): Unit = { + //Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + // Create a DLI data table for DLI-associated CSS + sparkSession.sql("create table css_table(id long, name string) using css options('es.nodes' = '192.168.6.204:9200','es.nodes.wan.only' = 'false','resource' = '/mytest','es.net.ssl'='true','es.net.ssl.keystore.location' = 'obs://xietest1/lzq/keystore.jks','es.net.ssl.keystore.pass' = '**','es.net.ssl.truststore.location'='obs://xietest1/lzq/truststore.jks','es.net.ssl.truststore.pass'='**','es.net.http.auth.user'='admin','es.net.http.auth.pass'='**')") + + //*****************************SQL model*********************************** + // Insert data into the DLI data table + sparkSession.sql("insert into css_table values(13, 'John'),(22, 'Bob')") + + // Read data from DLI data table + val dataFrame = sparkSession.sql("select * from css_table") + dataFrame.show() + + // drop table + sparkSession.sql("drop table css_table") + + sparkSession.close() + } +} + |
Hard-coded or plaintext AK and SK pose significant security risks. To ensure security, encrypt your AK and SK, store them in configuration files or environment variables, and decrypt them when needed.
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 | import org.apache.spark.sql.{Row, SaveMode, SparkSession}; +import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}; + +object Test_SQL_CSS { + def main(args: Array[String]): Unit = { + //Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + sparkSession.conf.set("fs.obs.access.key", ak) + sparkSession.conf.set("fs.obs.secret.key", sk) + + //*****************************DataFrame model*********************************** + // Setting the /index/type of CSS + val resource = "/mytest/css" + + // Define the cross-origin connection address of the CSS cluster + val nodes = "to-css-1174405013-Ht7O1tYf.datasource.com:9200" + + //Setting schema + val schema = StructType(Seq(StructField("id", IntegerType, false), StructField("name", StringType, false))) + + // Construction data + val rdd = sparkSession.sparkContext.parallelize(Seq(Row(12, "John"),Row(21,"Bob"))) + + // Create a DataFrame from RDD and schema + val dataFrame_1 = sparkSession.createDataFrame(rdd, schema) + + //Write data to the CSS + dataFrame_1.write .format("css") + .option("resource", resource) + .option("es.nodes", nodes) + .option("es.net.ssl", "true") + .option("es.net.ssl.keystore.location", "obs://Bucket name/path/transport-keystore.jks") + .option("es.net.ssl.keystore.pass", "***") + .option("es.net.ssl.truststore.location", "obs://Bucket name/path/truststore.jks") + .option("es.net.ssl.truststore.pass", "***") + .option("es.net.http.auth.user", "admin") + .option("es.net.http.auth.pass", "***") + .mode(SaveMode.Append) + .save(); + + //Read data + val dataFrameR = sparkSession.read.format("css") + .option("resource", resource) + .option("es.nodes", nodes) + .option("es.net.ssl", "true") + .option("es.net.ssl.keystore.location", "obs://Bucket name/path/transport-keystore.jks") + .option("es.net.ssl.keystore.pass", "***") + .option("es.net.ssl.truststore.location", "obs://Bucket name/path/truststore.jks") + .option("es.net.ssl.truststore.pass", "***") + .option("es.net.http.auth.user", "admin") + .option("es.net.http.auth.pass", "***") + .load() + dataFrameR.show() + + spardSession.close() + } +} + |
The CloudTable HBase and MRS HBase can be connected to DLI as data sources.
+A datasource connection has been created on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 +2 +3 +4 | import scala.collection.mutable +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.types._ + |
1 | val sparkSession = SparkSession.builder().getOrCreate() + |
1 +2 +3 +4 +5 +6 +7 +8 +9 | sparkSession.sql("CREATE TABLE test_hbase('id' STRING, 'location' STRING, 'city' STRING, 'booleanf' BOOLEAN, + 'shortf' SHORT, 'intf' INT, 'longf' LONG, 'floatf' FLOAT,'doublef' DOUBLE) using hbase OPTIONS ( + 'ZKHost'='cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181, + cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181, + cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181', + 'TableName'='table_DupRowkey1', + 'RowKey'='id:5,location:6,city:7', + 'Cols'='booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef')" +) + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 | sparkSession.sql("CREATE TABLE test_hbase('id' STRING, 'location' STRING, 'city' STRING, 'booleanf' BOOLEAN, + 'shortf' SHORT, 'intf' INT, 'longf' LONG, 'floatf' FLOAT,'doublef' DOUBLE) using hbase OPTIONS ( + 'ZKHost'='cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181, + cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181, + cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181', + 'TableName'='table_DupRowkey1', + 'RowKey'='id:5,location:6,city:7', + 'Cols'='booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef', + 'krb5conf'='./krb5.conf', + 'keytab' = './user.keytab', + 'principal' = 'krbtest')") + |
Parameter + |
+Description + |
+
---|---|
ZKHost + |
+ZooKeeper IP address of the HBase cluster. +You need to create a datasource connection first. +
|
+
RowKey + |
+Row key field of the table connected to DLI. The single and composite row keys are supported. A single row key can be of the numeric or string type. The length does not need to be specified. The composite row key supports only fixed-length data of the string type. The format is attribute name 1:Length, attribute name 2:Length. + |
+
Cols + |
+Mapping between the fields in the DLI table and the CloudTable table. In this mapping, the DLI table field is placed before the colon (:) and the CloudTable table field is placed after the colon (:). The period (.) is used to separate the column family and column name of the CloudTable table. +For example: DLI table field 1:CloudTable table.CloudTable table field 1, DLI table field 2:CloudTable table.CloudTable table field 2, DLI table field 3:CLoudTable table.CloudTable table field 3 + |
+
krb5conf + |
+Path of the krb5.conf file. This parameter is required when Kerberos authentication is enabled. The format is './krb5.conf'. For details, see Completing Configurations for Enabling Kerberos Authentication. + |
+
keytab + |
+Path of the keytab file. This parameter is required when Kerberos authentication is enabled. The format is './user.keytab.'. For details, see Completing Configurations for Enabling Kerberos Authentication. + |
+
principal + |
+Username created for Kerberos authentication. + |
+
1 | sparkSession.sql("insert into test_hbase values('12345','abc','guiyang',false,null,3,23,2.3,2.34)") + |
1 | sparkSession.sql("select * from test_hbase").show () + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 | val attrId = new StructField("id",StringType) +val location = new StructField("location",StringType) +val city = new StructField("city",StringType) +val booleanf = new StructField("booleanf",BooleanType) +val shortf = new StructField("shortf",ShortType) +val intf = new StructField("intf",IntegerType) +val longf = new StructField("longf",LongType) +val floatf = new StructField("floatf",FloatType) +val doublef = new StructField("doublef",DoubleType) +val attrs = Array(attrId, location,city,booleanf,shortf,intf,longf,floatf,doublef) + |
1 +2 | val mutableRow: Seq[Any] = Seq("12345","abc","city1",false,null,3,23,2.3,2.34) +val rddData: RDD[Row] = sparkSession.sparkContext.parallelize(Array(Row.fromSeq(mutableRow)), 1) + |
1 | sparkSession.createDataFrame(rddData, new StructType(attrs)).write.insertInto("test_hbase") + |
1 +2 +3 +4 +5 +6 +7 +8 | val map = new mutable.HashMap[String, String]() +map("TableName") = "table_DupRowkey1" +map("RowKey") = "id:5,location:6,city:7" +map("Cols") = "booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef" +map("ZKHost")="cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181, + cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181, + cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181" +sparkSession.read.schema(new StructType(attrs)).format("hbase").options(map.toMap).load().show() + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/hbase/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/hbase/*
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 | import org.apache.spark.sql.SparkSession + +object Test_SparkSql_HBase { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + + /** + * Create an association table for the DLI association Hbase table + */ + sparkSession.sql("CREATE TABLE test_hbase('id' STRING, 'location' STRING, 'city' STRING, 'booleanf' BOOLEAN, + 'shortf' SHORT, 'intf' INT, 'longf' LONG, 'floatf' FLOAT,'doublef' DOUBLE) using hbase OPTIONS ( + 'ZKHost'='cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181, + cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181, + cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181', + 'TableName'='table_DupRowkey1', + 'RowKey'='id:5,location:6,city:7', + 'Cols'='booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf, + longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef')") + + //*****************************SQL model*********************************** + sparkSession.sql("insert into test_hbase values('12345','abc','city1',false,null,3,23,2.3,2.34)") + sparkSession.sql("select * from test_hbase").collect() + + sparkSession.close() + } +} + |
import org.apache.spark.SparkFiles +import org.apache.spark.sql.SparkSession + +import java.io.{File, FileInputStream, FileOutputStream} + +object Test_SparkSql_HBase_Kerberos { + + def copyFile2(Input:String)(OutPut:String): Unit ={ + val fis = new FileInputStream(Input) + val fos = new FileOutputStream(OutPut) + val buf = new Array[Byte](1024) + var len = 0 + while ({len = fis.read(buf);len} != -1){ + fos.write(buf,0,len) + } + fos.close() + fis.close() + } + + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + val sc = sparkSession.sparkContext + sc.addFile("OBS address of krb5.conf") + sc.addFile("OBS address of user.keytab") + Thread.sleep(10) + + val krb5_startfile = new File(SparkFiles.get("krb5.conf")) + val keytab_startfile = new File(SparkFiles.get("user.keytab")) + val path_user = System.getProperty("user.dir") + val keytab_endfile = new File(path_user + "/" + keytab_startfile.getName) + val krb5_endfile = new File(path_user + "/" + krb5_startfile.getName) + println(keytab_endfile) + println(krb5_endfile) + + var krbinput = SparkFiles.get("krb5.conf") + var krboutput = path_user+"/krb5.conf" + copyFile2(krbinput)(krboutput) + + var keytabinput = SparkFiles.get("user.keytab") + var keytaboutput = path_user+"/user.keytab" + copyFile2(keytabinput)(keytaboutput) + Thread.sleep(10) + /** + * Create an association table for the DLI association Hbase table + */ + sparkSession.sql("CREATE TABLE testhbase(id string,booleanf boolean,shortf short,intf int,longf long,floatf float,doublef double) " + + "using hbase OPTIONS(" + + "'ZKHost'='10.0.0.146:2181'," + + "'TableName'='hbtest'," + + "'RowKey'='id:100'," + + "'Cols'='booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF2.longf,floatf:CF1.floatf,doublef:CF2.doublef'," + + "'krb5conf'='" + path_user + "/krb5.conf'," + + "'keytab'='" + path_user+ "/user.keytab'," + + "'principal'='krbtest') ") + + //*****************************SQL model*********************************** + sparkSession.sql("insert into testhbase values('newtest',true,1,2,3,4,5)") + val result = sparkSession.sql("select * from testhbase") + result.show() + + sparkSession.close() + } +}+
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 | import scala.collection.mutable + +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.types._ + +object Test_SparkSql_HBase { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + + // Create an association table for the DLI association Hbase table + sparkSession.sql("CREATE TABLE test_hbase('id' STRING, 'location' STRING, 'city' STRING, 'booleanf' BOOLEAN, + 'shortf' SHORT, 'intf' INT, 'longf' LONG, 'floatf' FLOAT,'doublef' DOUBLE) using hbase OPTIONS ( + 'ZKHost'='cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181, + cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181, + cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181', + 'TableName'='table_DupRowkey1', + 'RowKey'='id:5,location:6,city:7', + 'Cols'='booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef')") + + //*****************************DataFrame model*********************************** + // Setting schema + val attrId = new StructField("id",StringType) + val location = new StructField("location",StringType) + val city = new StructField("city",StringType) + val booleanf = new StructField("booleanf",BooleanType) + val shortf = new StructField("shortf",ShortType) + val intf = new StructField("intf",IntegerType) + val longf = new StructField("longf",LongType) + val floatf = new StructField("floatf",FloatType) + val doublef = new StructField("doublef",DoubleType) + val attrs = Array(attrId, location,city,booleanf,shortf,intf,longf,floatf,doublef) + + // Populate data according to the type of schema + val mutableRow: Seq[Any] = Seq("12345","abc","city1",false,null,3,23,2.3,2.34) + val rddData: RDD[Row] = sparkSession.sparkContext.parallelize(Array(Row.fromSeq(mutableRow)), 1) + + // Import the constructed data into Hbase + sparkSession.createDataFrame(rddData, new StructType(attrs)).write.insertInto("test_hbase") + + // Read data on Hbase + val map = new mutable.HashMap[String, String]() + map("TableName") = "table_DupRowkey1" + map("RowKey") = "id:5,location:6,city:7" + map("Cols") = "booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef" + map("ZKHost")="cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181, + cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181, + cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181" + sparkSession.read.schema(new StructType(attrs)).format("hbase").options(map.toMap).load().collect() + + sparkSession.close() + } +} + |
The CloudTable OpenTSDB and MRS OpenTSDB can be connected to DLI as data sources.
+A datasource connection has been created on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 +2 +3 +4 | import scala.collection.mutable +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.types._ + |
1 | val sparkSession = SparkSession.builder().getOrCreate() + |
1 +2 +3 +4 | sparkSession.sql("create table opentsdb_test using opentsdb options( + 'Host'='opentsdb-3xcl8dir15m58z3.cloudtable.com:4242', + 'metric'='ctopentsdb', + 'tags'='city,location')") + |
Parameter + |
+Description + |
+
---|---|
host + |
+OpenTSDB IP address. +
|
+
metric + |
+Name of the metric in OpenTSDB corresponding to the DLI table to be created. + |
+
tags + |
+Tags corresponding to the metric, used for operations such as classification, filtering, and quick search. A maximum of 8 tags, including all tagk values under the metric, can be added and are separated by commas (,). + |
+
1 | sparkSession.sql("insert into opentsdb_test values('futian', 'abc', '1970-01-02 18:17:36', 30.0)") + |
1 | sparkSession.sql("select * from opentsdb_test").show() + |
1 +2 +3 +4 +5 | val attrTag1Location = new StructField("location", StringType) +val attrTag2Name = new StructField("name", StringType) +val attrTimestamp = new StructField("timestamp", LongType) +val attrValue = new StructField("value", DoubleType) +val attrs = Array(attrTag1Location, attrTag2Name, attrTimestamp, attrValue) + |
1 +2 | val mutableRow: Seq[Any] = Seq("aaa", "abc", 123456L, 30.0) +val rddData: RDD[Row] = sparkSession.sparkContext.parallelize(Array(Row.fromSeq(mutableRow)), 1) + |
1 | sparkSession.createDataFrame(rddData, new StructType(attrs)).write.insertInto("opentsdb_test") + |
1 +2 +3 +4 +5 | val map = new mutable.HashMap[String, String]() +map("metric") = "ctopentsdb" +map("tags") = "city,location" +map("Host") = "opentsdb-3xcl8dir15m58z3.cloudtable.com:4242" +sparkSession.read.format("opentsdb").options(map.toMap).load().show() + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/opentsdb/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/opentsdb/*
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 | import org.apache.spark.sql.SparkSession + +object Test_OpenTSDB_CT { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + + // Create a data table for DLI association OpenTSDB + sparkSession.sql("create table opentsdb_test using opentsdb options( + 'Host'='opentsdb-3xcl8dir15m58z3.cloudtable.com:4242', + 'metric'='ctopentsdb', + 'tags'='city,location')") + + //*****************************SQL module*********************************** + sparkSession.sql("insert into opentsdb_test values('futian', 'abc', '1970-01-02 18:17:36', 30.0)") + sparkSession.sql("select * from opentsdb_test").show() + + sparkSession.close() + } +} + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 | import scala.collection.mutable +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.types._ + +object Test_OpenTSDB_CT { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + + // Create a data table for DLI association OpenTSDB + sparkSession.sql("create table opentsdb_test using opentsdb options( + 'Host'='opentsdb-3xcl8dir15m58z3.cloudtable.com:4242', + 'metric'='ctopentsdb', + 'tags'='city,location')") + + //*****************************DataFrame model*********************************** + // Setting schema + val attrTag1Location = new StructField("location", StringType) + val attrTag2Name = new StructField("name", StringType) + val attrTimestamp = new StructField("timestamp", LongType) + val attrValue = new StructField("value", DoubleType) + val attrs = Array(attrTag1Location, attrTag2Name, attrTimestamp,attrValue) + + // Populate data according to the type of schema + val mutableRow: Seq[Any] = Seq("aaa", "abc", 123456L, 30.0) + val rddData: RDD[Row] = sparkSession.sparkContext.parallelize(Array(Row.fromSeq(mutableRow)), 1) + + //Import the constructed data into OpenTSDB + sparkSession.createDataFrame(rddData, new StructType(attrs)).write.insertInto("opentsdb_test") + + //Read data on OpenTSDB + val map = new mutable.HashMap[String, String]() + map("metric") = "ctopentsdb" + map("tags") = "city,location" + map("Host") = "opentsdb-3xcl8dir15m58z3.cloudtable.com:4242" + sparkSession.read.format("opentsdb").options(map.toMap).load().show() + + sparkSession.close() + } +} + |
A datasource connection has been created and bound to a queue on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 +2 +3 | import java.util.Properties +import org.apache.spark.sql.{Row,SparkSession} +import org.apache.spark.sql.SaveMode + |
1 | val sparkSession = SparkSession.builder().getOrCreate() + |
1 +2 +3 +4 +5 +6 +7 | sparkSession.sql( + "CREATE TABLE IF NOT EXISTS dli_to_rds USING JDBC OPTIONS ( + 'url'='jdbc:mysql://to-rds-1174404209-cA37siB6.datasource.com:3306', // Set this parameter to the actual URL. + 'dbtable'='test.customer', + 'user'='root', // Set this parameter to the actual user. + 'password'='######', // Set this parameter to the actual password. + 'driver'='com.mysql.jdbc.Driver')") + |
Parameter + |
+Description + |
+
---|---|
url + |
+To obtain an RDS IP address, you need to create a datasource connection first. Refer to the Data Lake Insight User Guide for more information. +If you have created an enhanced datasource connection, use the internal network domain name or internal network address and the database port number provided by RDS to set up the connection. If MySQL is used, the format is Protocol header://Internal IP address:Internal network port number. If PostgreSQL is used, the format is Protocol header://Internal IP address:Internal network port number/Database name. +For example: jdbc:mysql://192.168.0.193:3306 or jdbc:postgresql://192.168.0.193:3306/postgres. + |
+
dbtable + |
+To connect to a MySQL cluster, enter Database name.Table name. To connect to a PostgreSQL cluster, enter Mode name.Table name. + NOTE:
+If the database and table do not exist, create them first. Otherwise, the system reports an error and fails to run. + |
+
user + |
+RDS database username. + |
+
password + |
+RDS database password. + |
+
driver + |
+JDBC driver class name. To connect to a MySQL cluster, enter com.mysql.jdbc.Driver. To connect to a PostgreSQL cluster, enter org.postgresql.Driver. + |
+
partitionColumn + |
+One of the numeric fields that are required for concurrently reading data. + NOTE:
+
|
+
lowerBound + |
+Minimum value of a column specified by partitionColumn. The value is contained in the returned result. + |
+
upperBound + |
+Maximum value of a column specified by partitionColumn. The value is not contained in the returned result. + |
+
numPartitions + |
+Number of concurrent read operations. + NOTE:
+When data is read, lowerBound and upperBound are evenly allocated to each task to obtain data. Example: +'partitionColumn'='id', +'lowerBound'='0', +'upperBound'='100', +'numPartitions'='2' +Two concurrent tasks are started in DLI. The execution ID of one task is greater than or equal to 0 and the ID is smaller than 50; the execution ID of the other task is greater than or equal to 50 and the ID is smaller than 100. + |
+
fetchsize + |
+Number of data records obtained in each batch during data reading. The default value is 1000. If this parameter is set to a large value, the performance is good but more memory is occupied, causing memory overflow as a result. + |
+
batchsize + |
+Number of data records written in each batch. The default value is 1000. If this parameter is set to a large value, the performance is good but more memory is occupied, causing memory overflow as a result. + |
+
truncate + |
+Whether to clear the table without deleting the original table when overwrite is executed. The options are as follows: +
The default value is false, indicating that the original table is deleted and then a new table is created when the overwrite operation is performed. + |
+
isolationLevel + |
+Transaction isolation level. The options are as follows: +
The default value is READ_UNCOMMITTED. + |
+
1 | sparkSession.sql("insert into dli_to_rds values(1, 'John',24),(2, 'Bob',32)") + |
1 +2 | val dataFrame = sparkSession.sql("select * from dli_to_rds") +dataFrame.show() + |
Before data is inserted
+After data is inserted
+1 | sparkSession.sql("drop table dli_to_rds") + |
1 +2 +3 +4 | val url = "jdbc:mysql://to-rds-1174405057-EA1Kgo8H.datasource.com:3306" +val username = "root" +val password = "######" +val dbtable = "test.customer" + |
1 +2 +3 +4 | var dataFrame_1 = sparkSession.createDataFrame(List((8, "Jack_1", 18))) +val df = dataFrame_1.withColumnRenamed("_1", "id") + .withColumnRenamed("_2", "name") + .withColumnRenamed("_3", "age") + |
1 +2 +3 +4 +5 +6 +7 +8 | df.write.format("jdbc") + .option("url", url) + .option("dbtable", dbtable) + .option("user", username) + .option("password", password) + .option("driver", "com.mysql.jdbc.Driver") + .mode(SaveMode.Append) + .save() + |
The value of SaveMode can be one of the following:
+1 +2 +3 +4 +5 +6 +7 | val jdbcDF = sparkSession.read.format("jdbc") + .option("url", url) + .option("dbtable", dbtable) + .option("user", username) + .option("password", password) + .option("driver", "org.postgresql.Driver") + .load() + |
1 +2 +3 +4 | val properties = new Properties() +properties.put("user", username) +properties.put("password", password) +val jdbcDF2 = sparkSession.read.jdbc(url, dbtable, properties) + |
Before data is inserted
+After data is inserted
+The DataFrame read by the read.format() or read.jdbc() method is registered as a temporary table. Then, you can use SQL statements to query data.
+1 +2 | jdbcDF.registerTempTable("customer_test") +sparkSession.sql("select * from customer_test where id = 1").show() + |
Query results
+The data created by the createDataFrame() method and the data queried by the read.format() method and the read.jdbc() method are all DataFrame objects. You can directly query a single record. (In 4, the DataFrame data is registered as a temporary table.)
+The where statement can be combined with filter expressions such as AND and OR. The DataFrame object after filtering is returned. The following is an example:
+1 | jdbcDF.where("id = 1 or age <=10").show() + |
The filter statement can be used in the same way as where. The DataFrame object after filtering is returned. The following is an example:
+1 | jdbcDF.filter("id = 1 or age <=10").show() + |
The select statement is used to query the DataFrame object of the specified field. Multiple fields can be queried.
+1 | jdbcDF.select("id").show() + |
1 | jdbcDF.select("id", "name").show() + |
1 | jdbcDF.select("id","name").where("id<4").show() + |
selectExpr is used to perform special processing on a field. For example, the selectExpr function can be used to change the field name. The following is an example:
+If you want to set the name field to name_test and add 1 to the value of age, run the following statement:
+1 | jdbcDF.selectExpr("id", "name as name_test", "age+1").show() + |
col is used to obtain a specified field. Different from select, col can only be used to query the column type and one field can be returned at a time. The following is an example:
+1 | val idCol = jdbcDF.col("id") + |
drop is used to delete a specified field. Specify a field you need to delete (only one field can be deleted at a time), the DataFrame object that does not contain the field is returned. The following is an example:
+1 | jdbcDF.drop("id").show() + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/rds/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/rds/*
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 | import java.util.Properties +import org.apache.spark.sql.SparkSession + +object Test_SQL_RDS { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + + // Create a data table for DLI-associated RDS + sparkSession.sql("CREATE TABLE IF NOT EXISTS dli_to_rds USING JDBC OPTIONS ( + 'url'='jdbc:mysql://to-rds-1174404209-cA37siB6.datasource.com:3306, + 'dbtable'='test.customer', + 'user'='root', + 'password'='######', + 'driver'='com.mysql.jdbc.Driver')") + + //*****************************SQL model*********************************** + //Insert data into the DLI data table + sparkSession.sql("insert into dli_to_rds values(1,'John',24),(2,'Bob',32)") + + //Read data from DLI data table + val dataFrame = sparkSession.sql("select * from dli_to_rds") + dataFrame.show() + + //drop table + sparkSession.sql("drop table dli_to_rds") + + sparkSession.close() + } +} + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 | import java.util.Properties +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.SaveMode + +object Test_SQL_RDS { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + + //*****************************DataFrame model*********************************** + // Set the connection configuration parameters. Contains url, username, password, dbtable. + val url = "jdbc:mysql://to-rds-1174404209-cA37siB6.datasource.com:3306" + val username = "root" + val password = "######" + val dbtable = "test.customer" + + // Create a DataFrame and initialize the DataFrame data. + var dataFrame_1 = sparkSession.createDataFrame(List((1, "Jack", 18))) + + // Rename the fields set by the createDataFrame() method. + val df = dataFrame_1.withColumnRenamed("_1", "id") + .withColumnRenamed("_2", "name") + .withColumnRenamed("_3", "age") + + // Write data to the rds_table_1 table + df.write.format("jdbc") + .option("url", url) + .option("dbtable", dbtable) + .option("user", username) + .option("password", password) + .option("driver", "com.mysql.jdbc.Driver") + .mode(SaveMode.Append) + .save() + + // DataFrame object for data manipulation + //Filter users with id=1 + var newDF = df.filter("id!=1") + newDF.show() + + // Filter the id column data + var newDF_1 = df.drop("id") + newDF_1.show() + + // Read the data of the customer table in the RDS database + // Way one: Read data from RDS using read.format() + val jdbcDF = sparkSession.read.format("jdbc") + .option("url", url) + .option("dbtable", dbtable) + .option("user", username) + .option("password", password) + .option("driver", "com.mysql.jdbc.Driver") + .load() + // Way two: Read data from RDS using read.jdbc() + val properties = new Properties() + properties.put("user", username) + properties.put("password", password) + val jdbcDF2 = sparkSession.read.jdbc(url, dbtable, properties) + + /** + * Register the dateFrame read by read.format() or read.jdbc() as a temporary table, and query the data + * using the sql statement. + */ + jdbcDF.registerTempTable("customer_test") + val result = sparkSession.sql("select * from customer_test where id = 1") + result.show() + + sparkSession.close() + } +} + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 | // The where() method uses " and" and "or" for condition filters, returning filtered DataFrame objects + jdbcDF.where("id = 1 or age <=10").show() + + // The filter() method is used in the same way as the where() method. + jdbcDF.filter("id = 1 or age <=10").show() + + // The select() method passes multiple arguments and returns the DataFrame object of the specified field. + jdbcDF.select("id").show() + jdbcDF.select("id", "name").show() + jdbcDF.select("id","name").where("id<4").show() + + /** + * The selectExpr() method implements special handling of fields, such as renaming, increasing or + * decreasing data values. + */ + jdbcDF.selectExpr("id", "name as name_test", "age+1").show() + + // The col() method gets a specified field each time, and the return type is a Column type. + val idCol = jdbcDF.col("id") + + /** + * The drop() method returns a DataFrame object that does not contain deleted fields, and only one field + * can be deleted at a time. + */ + jdbcDF.drop("id").show() + |
This section provides Scala example code that demonstrates how to use a Spark job to access data from the GaussDB(DWS) data source.
+A datasource connection has been created and bound to a queue on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 +2 +3 | import java.util.Properties +import org.apache.spark.sql.{Row,SparkSession} +import org.apache.spark.sql.SaveMode + |
1 | val sparkSession = SparkSession.builder().getOrCreate() + |
1 +2 +3 +4 +5 +6 +7 +8 | sparkSession.sql( + "CREATE TABLE IF NOT EXISTS dli_to_dws USING JDBC OPTIONS ( + 'url'='jdbc:postgresql://to-dws-1174404209-cA37siB6.datasource.com:8000/postgres', + 'dbtable'='customer', + 'user'='dbadmin', + 'passwdauth'='######'// Name of the datasource authentication of the password type created on DLI. If datasource authentication is used, you do not need to set the username and password for the job. +)" +) + |
Parameter + |
+Description + |
+
---|---|
url + |
+To obtain a GaussDB(DWS) IP address, you need to create a datasource connection first. Refer to Data Lake Insight User Guide for more information. +After an enhanced datasource connection is created, you can use the JDBC connection string (intranet) provided by GaussDB(DWS) or the intranet IP address and port number to connect to GaussDB(DWS). The format is protocol header://internal IP address:internal network port number/database name, for example: jdbc:postgresql://192.168.0.77:8000/postgres. For details about how to obtain the value, see GaussDB(DWS) cluster information. + NOTE:
+The GaussDB(DWS) IP address is in the following format: protocol header://IP address:port number/database name +Example: +jdbc:postgresql://to-dws-1174405119-ihlUr78j.datasource.com:8000/postgres +If you want to connect to a database created in GaussDB(DWS), change postgres to the corresponding database name in this connection. + |
+
passwdauth + |
+Name of datasource authentication of the password type created on DLI. If datasource authentication is used, you do not need to set the username and password for jobs. + |
+
dbtable + |
+Tables in the PostgreSQL database. + |
+
partitionColumn + |
+This parameter is used to set the numeric field used concurrently when data is read. + NOTE:
+
|
+
lowerBound + |
+Minimum value of a column specified by partitionColumn. The value is contained in the returned result. + |
+
upperBound + |
+Maximum value of a column specified by partitionColumn. The value is not contained in the returned result. + |
+
numPartitions + |
+Number of concurrent read operations. + NOTE:
+When data is read, lowerBound and upperBound are evenly allocated to each task to obtain data. Example: +'partitionColumn'='id', +'lowerBound'='0', +'upperBound'='100', +'numPartitions'='2' +Two concurrent tasks are started in DLI. The execution ID of one task is greater than or equal to 0 and the ID is less than 50, and the execution ID of the other task is greater than or equal to 50 and the ID is less than 100. + |
+
fetchsize + |
+Number of data records obtained in each batch during data reading. The default value is 1000. If this parameter is set to a large value, the performance is good but more memory is occupied. If this parameter is set to a large value, memory overflow may occur. + |
+
batchsize + |
+Number of data records written in each batch. The default value is 1000. If this parameter is set to a large value, the performance is good but more memory is occupied. If this parameter is set to a large value, memory overflow may occur. + |
+
truncate + |
+Indicates whether to clear the table without deleting the original table when overwrite is executed. The options are as follows: +
The default value is false, indicating that the original table is deleted and then a new table is created when the overwrite operation is performed. + |
+
isolationLevel + |
+Transaction isolation level. The options are as follows: +
The default value is READ_UNCOMMITTED. + |
+
1 | sparkSession.sql("insert into dli_to_dws values(1, 'John',24),(2, 'Bob',32)") + |
1 +2 | val dataFrame = sparkSession.sql("select * from dli_to_dws") +dataFrame.show() + |
Before data is inserted:
+Response:
+1 | sparkSession.sql("drop table dli_to_dws") + |
1 +2 +3 +4 | val url = "jdbc:postgresql://to-dws-1174405057-EA1Kgo8H.datasource.com:8000/postgres" +val username = "dbadmin" +val password = "######" +val dbtable = "customer" + |
1 +2 +3 +4 | var dataFrame_1 = sparkSession.createDataFrame(List((8, "Jack_1", 18))) +val df = dataFrame_1.withColumnRenamed("_1", "id") + .withColumnRenamed("_2", "name") + .withColumnRenamed("_3", "age") + |
1 +2 +3 +4 +5 +6 +7 | df.write.format("jdbc") + .option("url", url) + .option("dbtable", dbtable) + .option("user", username) + .option("password", password) + .mode(SaveMode.Append) + .save() + |
The options of SaveMode can be one of the following:
+1 +2 +3 +4 +5 +6 | val jdbcDF = sparkSession.read.format("jdbc") + .option("url", url) + .option("dbtable", dbtable) + .option("user", username) + .option("password", password) + .load() + |
1 +2 +3 +4 | val properties = new Properties() + properties.put("user", username) + properties.put("password", password) + val jdbcDF2 = sparkSession.read.jdbc(url, dbtable, properties) + |
Before data is inserted:
+Response:
+The dateFrame read by the read.format() or read.jdbc() method is registered as a temporary table. Then, you can use SQL statements to query data.
+1 +2 | jdbcDF.registerTempTable("customer_test") + sparkSession.sql("select * from customer_test where id = 1").show() + |
Query results
+The data created by the createDataFrame() method and the data queried by the read.format() method and the read.jdbc() method are all DataFrame objects. You can directly query a single record. (In Accessing a Data Source Using a DataFrame API, the DataFrame data is registered as a temporary table.)
+The where statement can be combined with filter expressions such as AND and OR. The DataFrame object after filtering is returned. The following is an example:
+1 | jdbcDF.where("id = 1 or age <=10").show() + |
The filter statement can be used in the same way as where. The DataFrame object after filtering is returned. The following is an example:
+1 | jdbcDF.filter("id = 1 or age <=10").show() + |
The select statement is used to query the DataFrame object of the specified field. Multiple fields can be queried.
+1 | jdbcDF.select("id").show() + |
1 | jdbcDF.select("id", "name").show() + |
1 | jdbcDF.select("id","name").where("id<4").show() + |
The selectExpr statement is used to perform special processing on a field. For example, it can be used to change the field name. The following is an example:
+If you want to set the name field to name_test and add 1 to the value of age, run the following statement:
+1 | jdbcDF.selectExpr("id", "name as name_test", "age+1").show() + |
col is used to obtain a specified field. Different from select, col can only be used to query the column type and one field can be returned at a time. The following is an example:
+1 | val idCol = jdbcDF.col("id") + |
drop is used to delete a specified field. Specify a field you need to delete (only one field can be deleted at a time), the DataFrame object that does not contain the field is returned. The following is an example:
+1 | jdbcDF.drop("id").show() + |
1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 | import java.util.Properties +import org.apache.spark.sql.SparkSession + +object Test_SQL_DWS { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + // Create a data table for DLI-associated DWS + sparkSession.sql("CREATE TABLE IF NOT EXISTS dli_to_dws USING JDBC OPTIONS ( + 'url'='jdbc:postgresql://to-dws-1174405057-EA1Kgo8H.datasource.com:8000/postgres', + 'dbtable'='customer', + 'user'='dbadmin', + 'password'='######')") + + //*****************************SQL model*********************************** + //Insert data into the DLI data table + sparkSession.sql("insert into dli_to_dws values(1,'John',24),(2,'Bob',32)") + + //Read data from DLI data table + val dataFrame = sparkSession.sql("select * from dli_to_dws") + dataFrame.show() + + //drop table + sparkSession.sql("drop table dli_to_dws") + + sparkSession.close() + } +} + |
Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 | import java.util.Properties +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.SaveMode + +object Test_SQL_DWS { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().getOrCreate() + + //*****************************DataFrame model*********************************** + // Set the connection configuration parameters. Contains url, username, password, dbtable. + val url = "jdbc:postgresql://to-dws-1174405057-EA1Kgo8H.datasource.com:8000/postgres" + val username = "dbadmin" + val password = "######" + val dbtable = "customer" + + //Create a DataFrame and initialize the DataFrame data. + var dataFrame_1 = sparkSession.createDataFrame(List((1, "Jack", 18))) + + //Rename the fields set by the createDataFrame() method. + val df = dataFrame_1.withColumnRenamed("_1", "id") + .withColumnRenamed("_2", "name") + .withColumnRenamed("_3", "age") + + //Write data to the dws_table_1 table + df.write.format("jdbc") + .option("url", url) + .option("dbtable", dbtable) + .option("user", username) + .option("password", password) + .mode(SaveMode.Append) + .save() + + // DataFrame object for data manipulation + //Filter users with id=1 + var newDF = df.filter("id!=1") + newDF.show() + + // Filter the id column data + var newDF_1 = df.drop("id") + newDF_1.show() + + // Read the data of the customer table in the RDS database + //Way one: Read data from GaussDB(DWS) using read.format() + val jdbcDF = sparkSession.read.format("jdbc") + .option("url", url) + .option("dbtable", dbtable) + .option("user", username) + .option("password", password) + .option("driver", "org.postgresql.Driver") + .load() + //Way two: Read data from GaussDB(DWS) using read.jdbc() + val properties = new Properties() + properties.put("user", username) + properties.put("password", password) + val jdbcDF2 = sparkSession.read.jdbc(url, dbtable, properties) + + /** + * Register the dateFrame read by read.format() or read.jdbc() as a temporary table, and query the data + * using the sql statement. + */ + jdbcDF.registerTempTable("customer_test") + val result = sparkSession.sql("select * from customer_test where id = 1") + result.show() + + sparkSession.close() + } +} + |
The CloudTable HBase and MRS HBase can be connected to DLI as data sources.
+A datasource connection has been created on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 | from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType, ShortType, LongType, FloatType, DoubleType +from pyspark.sql import SparkSession + |
1 | sparkSession = SparkSession.builder.appName("datasource-hbase").getOrCreate() + |
sparkSession.sql( + "CREATE TABLE testhbase(id STRING, location STRING, city STRING) using hbase OPTIONS (\ + 'ZKHost' = '192.168.0.189:2181',\ + 'TableName' = 'hbtest',\ + 'RowKey' = 'id:5',\ + 'Cols' = 'location:info.location,city:detail.city')")+
sparkSession.sql( + "CREATE TABLE testhbase(id STRING, location STRING, city STRING) using hbase OPTIONS (\ + 'ZKHost' = '192.168.0.189:2181',\ + 'TableName' = 'hbtest',\ + 'RowKey' = 'id:5',\ + 'Cols' = 'location:info.location,city:detail.city',\ + 'krb5conf' = './krb5.conf',\ + 'keytab'='./user.keytab',\ + 'principal' ='krbtest')")+
For details about how to obtain the krb5.conf and keytab files, see Completing Configurations for Enabling Kerberos Authentication.
+ +sparkSession.sql("insert into testhbase values('95274','abc','Jinan')")+
sparkSession.sql("select * from testhbase").show()+
1 +2 +3 +4 +5 +6 +7 +8 +9 | sparkSession.sql(\ + "CREATE TABLE test_hbase(id STRING, location STRING, city STRING, booleanf BOOLEAN, shortf SHORT, intf INT, longf LONG, + floatf FLOAT, doublef DOUBLE) using hbase OPTIONS (\ + 'ZKHost' = 'cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181,\ + cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181,\ + cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181',\ + 'TableName' = 'table_DupRowkey1',\ + 'RowKey' = 'id:5,location:6,city:7',\ + 'Cols' = 'booleanf:CF1.booleanf, shortf:CF1.shortf, intf:CF1.intf, \ longf:CF1.longf, floatf:CF1.floatf, doublef:CF1.doublef')") + |
1 +2 +3 +4 +5 +6 +7 +8 +9 | schema = StructType([StructField("id", StringType()),\ + StructField("location", StringType()),\ + StructField("city", StringType()),\ + StructField("booleanf", BooleanType()),\ + StructField("shortf", ShortType()),\ + StructField("intf", IntegerType()),\ + StructField("longf", LongType()),\ + StructField("floatf", FloatType()),\ + StructField("doublef", DoubleType())]) + |
1 | dataList = sparkSession.sparkContext.parallelize([("11111", "aaa", "aaa", False, 4, 3, 23, 2.3, 2.34)]) + |
1 | dataFrame = sparkSession.createDataFrame(dataList, schema) + |
1 | dataFrame.write.insertInto("test_hbase") + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 | // Set cross-source connection parameters +TableName = "table_DupRowkey1" +RowKey = "id:5,location:6,city:7" +Cols = "booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef" +ZKHost = "cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181,cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181, + cloudtable-cf82-zk1- WY09px9l.cloudtable.com:2181" + +// select +jdbcDF = sparkSession.read.schema(schema)\ + .format("hbase")\ + .option("ZKHost",ZKHost)\ + .option("TableName",TableName)\ + .option("RowKey",RowKey)\ + .option("Cols",Cols)\ + .load() +jdbcDF.filter("id = '12333' or id='11111'").show() + |
The length of id, location, and city parameter is limited. When inserting data, you must set the data values based on the required length. Otherwise, an encoding format error occurs during query.
+spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/hbase/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/hbase/*
+# _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType, ShortType, LongType, FloatType, DoubleType +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-hbase").getOrCreate() + + sparkSession.sql( + "CREATE TABLE testhbase(id STRING, location STRING, city STRING) using hbase OPTIONS (\ + 'ZKHost' = '192.168.0.189:2181',\ + 'TableName' = 'hbtest',\ + 'RowKey' = 'id:5',\ + 'Cols' = 'location:info.location,city:detail.city')") + + + sparkSession.sql("insert into testhbase values('95274','abc','Jinan')") + + sparkSession.sql("select * from testhbase").show() + # close session + sparkSession.stop()+
# _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark import SparkFiles +from pyspark.sql import SparkSession +import shutil +import time +import os + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("Test_HBase_SparkSql_Kerberos").getOrCreate() + sc = sparkSession.sparkContext + time.sleep(10) + + krb5_startfile = SparkFiles.get("krb5.conf") + keytab_startfile = SparkFiles.get("user.keytab") + path_user = os.getcwd() + krb5_endfile = path_user + "/" + "krb5.conf" + keytab_endfile = path_user + "/" + "user.keytab" + shutil.copy(krb5_startfile, krb5_endfile) + shutil.copy(keytab_startfile, keytab_endfile) + time.sleep(20) + + sparkSession.sql( + "CREATE TABLE testhbase(id string,booleanf boolean,shortf short,intf int,longf long,floatf float,doublef double) " + + "using hbase OPTIONS(" + + "'ZKHost'='10.0.0.146:2181'," + + "'TableName'='hbtest'," + + "'RowKey'='id:100'," + + "'Cols'='booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF2.longf,floatf:CF1.floatf,doublef:CF2.doublef'," + + "'krb5conf'='" + path_user + "/krb5.conf'," + + "'keytab'='" + path_user+ "/user.keytab'," + + "'principal'='krbtest') ") + + sparkSession.sql("insert into testhbase values('95274','abc','Jinan')") + + sparkSession.sql("select * from testhbase").show() + # close session + sparkSession.stop()+
# _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType, ShortType, LongType, FloatType, DoubleType +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-hbase").getOrCreate() + + # Createa data table for DLI-associated ct + sparkSession.sql(\ + "CREATE TABLE test_hbase(id STRING, location STRING, city STRING, booleanf BOOLEAN, shortf SHORT, intf INT, longf LONG,floatf FLOAT,doublef DOUBLE) using hbase OPTIONS ( \ + 'ZKHost' = 'cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181,\ + cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181,\ + cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181',\ + 'TableName' = 'table_DupRowkey1',\ + 'RowKey' = 'id:5,location:6,city:7',\ + 'Cols' = 'booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef')") + + # Create a DataFrame and initialize the DataFrame data. + dataList = sparkSession.sparkContext.parallelize([("11111", "aaa", "aaa", False, 4, 3, 23, 2.3, 2.34)]) + + # Setting schema + schema = StructType([StructField("id", StringType()), + StructField("location", StringType()), + StructField("city", StringType()), + StructField("booleanf", BooleanType()), + StructField("shortf", ShortType()), + StructField("intf", IntegerType()), + StructField("longf", LongType()), + StructField("floatf", FloatType()), + StructField("doublef", DoubleType())]) + + # Create a DataFrame from RDD and schema + dataFrame = sparkSession.createDataFrame(dataList, schema) + + # Write data to the cloudtable-hbase + dataFrame.write.insertInto("test_hbase") + + # Set cross-source connection parameters + TableName = "table_DupRowkey1" + RowKey = "id:5,location:6,city:7" + Cols = "booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF1.longf,floatf:CF1.floatf,doublef:CF1.doublef" + ZKHost = "cloudtable-cf82-zk3-pa6HnHpf.cloudtable.com:2181,cloudtable-cf82-zk2-weBkIrjI.cloudtable.com:2181, + cloudtable-cf82-zk1-WY09px9l.cloudtable.com:2181" + # Read data on CloudTable-HBase + jdbcDF = sparkSession.read.schema(schema)\ + .format("hbase")\ + .option("ZKHost", ZKHost)\ + .option("TableName",TableName)\ + .option("RowKey", RowKey)\ + .option("Cols", Cols)\ + .load() + jdbcDF.filter("id = '12333' or id='11111'").show() + + # close session + sparkSession.stop()+
The CloudTable OpenTSDB and MRS OpenTSDB can be connected to DLI as data sources.
+A datasource connection has been created on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 | from __future__ import print_function +from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType +from pyspark.sql import SparkSession + |
1 | sparkSession = SparkSession.builder.appName("datasource-opentsdb").getOrCreate() + |
1 +2 +3 +4 | sparkSession.sql("create table opentsdb_test using opentsdb options( + 'Host'='opentsdb-3xcl8dir15m58z3.cloudtable.com:4242', + 'metric'='ct_opentsdb', + 'tags'='city,location')") + |
sparkSession.sql("insert into opentsdb_test values('aaa', 'abc', '2021-06-30 18:00:00', 30.0)")+
result = sparkSession.sql("SELECT * FROM opentsdb_test")+
1 +2 +3 +4 | schema = StructType([StructField("location", StringType()),\ + StructField("name", StringType()), \ + StructField("timestamp", LongType()),\ + StructField("value", DoubleType())]) + |
1 | dataList = sparkSession.sparkContext.parallelize([("aaa", "abc", 123456L, 30.0)]) + |
1 | dataFrame = sparkSession.createDataFrame(dataList, schema) + |
1 | dataFrame.write.insertInto("opentsdb_test") + |
1 +2 +3 +4 +5 +6 +7 | jdbdDF = sparkSession.read + .format("opentsdb")\ + .option("Host","opentsdb-3xcl8dir15m58z3.cloudtable.com:4242")\ + .option("metric","ctopentsdb")\ + .option("tags","city,location")\ + .load() +jdbdDF.show() + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/opentsdb/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/opentsdb/*
+# _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-opentsdb").getOrCreate() + + + # Create a DLI cross-source association opentsdb data table + sparkSession.sql(\ + "create table opentsdb_test using opentsdb options(\ + 'Host'='10.0.0.171:4242',\ + 'metric'='cts_opentsdb',\ + 'tags'='city,location')") + + sparkSession.sql("insert into opentsdb_test values('aaa', 'abc', '2021-06-30 18:00:00', 30.0)") + + result = sparkSession.sql("SELECT * FROM opentsdb_test") + result.show() + + # close session + sparkSession.stop()+
# _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-opentsdb").getOrCreate() + + # Create a DLI cross-source association opentsdb data table + sparkSession.sql( + "create table opentsdb_test using opentsdb options(\ + 'Host'='opentsdb-3xcl8dir15m58z3.cloudtable.com:4242',\ + 'metric'='ct_opentsdb',\ + 'tags'='city,location')") + + # Create a DataFrame and initialize the DataFrame data. + dataList = sparkSession.sparkContext.parallelize([("aaa", "abc", 123456L, 30.0)]) + + # Setting schema + schema = StructType([StructField("location", StringType()),\ + StructField("name", StringType()),\ + StructField("timestamp", LongType()),\ + StructField("value", DoubleType())]) + + # Create a DataFrame from RDD and schema + dataFrame = sparkSession.createDataFrame(dataList, schema) + + # Set cross-source connection parameters + metric = "ctopentsdb" + tags = "city,location" + Host = "opentsdb-3xcl8dir15m58z3.cloudtable.com:4242" + + # Write data to the cloudtable-opentsdb + dataFrame.write.insertInto("opentsdb_test") + # ******* Opentsdb does not currently implement the ctas method to save data, so the save() method cannot be used.******* + # dataFrame.write.format("opentsdb").option("Host", Host).option("metric", metric).option("tags", tags).mode("Overwrite").save() + + # Read data on CloudTable-OpenTSDB + jdbdDF = sparkSession.read\ + .format("opentsdb")\ + .option("Host",Host)\ + .option("metric",metric)\ + .option("tags",tags)\ + .load() + jdbdDF.show() + + # close session + sparkSession.stop()+
A datasource connection has been created and bound to a queue on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 | from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType +from pyspark.sql import SparkSession + |
1 | sparkSession = SparkSession.builder.appName("datasource-rds").getOrCreate() + |
1 +2 +3 +4 +5 | url = "jdbc:mysql://to-rds-1174404952-ZgPo1nNC.datasource.com:3306" +dbtable = "test.customer" +user = "root" +password = "######" +driver = "com.mysql.jdbc.Driver" + |
For details about the parameters, see Table 1.
+1 | dataList = sparkSession.sparkContext.parallelize([(123, "Katie", 19)]) + |
1 +2 +3 | schema = StructType([StructField("id", IntegerType(), False),\ + StructField("name", StringType(), False),\ + StructField("age", IntegerType(), False)]) + |
1 | dataFrame = sparkSession.createDataFrame(dataList, schema) + |
1 +2 +3 +4 +5 +6 +7 +8 +9 | dataFrame.write \ + .format("jdbc") \ + .option("url", url) \ + .option("dbtable", dbtable) \ + .option("user", user) \ + .option("password", password) \ + .option("driver", driver) \ + .mode("Append") \ + .save() + |
The value of mode can be one of the following:
+1 +2 +3 +4 +5 +6 +7 +8 +9 | jdbcDF = sparkSession.read \ + .format("jdbc") \ + .option("url", url) \ + .option("dbtable", dbtable) \ + .option("user", user) \ + .option("password", password) \ + .option("driver", driver) \ + .load() +jdbcDF.show() + |
1 +2 +3 +4 +5 +6 +7 | sparkSession.sql( + "CREATE TABLE IF NOT EXISTS dli_to_rds USING JDBC OPTIONS (\ + 'url'='jdbc:mysql://to-rds-1174404952-ZgPo1nNC.datasource.com:3306',\ + 'dbtable'='test.customer',\ + 'user'='root',\ + 'password'='######',\ + 'driver'='com.mysql.jdbc.Driver')") + |
For details about the parameters for creating a table, see Table 1.
+1 | sparkSession.sql("insert into dli_to_rds values(3,'John',24)") + |
1 +2 | jdbcDF_after = sparkSession.sql("select * from dli_to_rds") +jdbcDF_after.show() + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/rds/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/rds/*
+If the following sample code is directly copied to the .py file, note that unexpected characters may exist after the backslashes (\) in the file content. You need to delete the indentations or spaces after the backslashes (\).
+# _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType +from pyspark.sql import SparkSession +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-rds").getOrCreate() + + # Set cross-source connection parameters. + url = "jdbc:mysql://to-rds-1174404952-ZgPo1nNC.datasource.com:3306" + dbtable = "test.customer" + user = "root" + password = "######" + driver = "com.mysql.jdbc.Driver" + + # Create a DataFrame and initialize the DataFrame data. + dataList = sparkSession.sparkContext.parallelize([(123, "Katie", 19)]) + + # Setting schema + schema = StructType([StructField("id", IntegerType(), False),\ + StructField("name", StringType(), False),\ + StructField("age", IntegerType(), False)]) + + # Create a DataFrame from RDD and schema + dataFrame = sparkSession.createDataFrame(dataList, schema) + + # Write data to the RDS. + dataFrame.write \ + .format("jdbc") \ + .option("url", url) \ + .option("dbtable", dbtable) \ + .option("user", user) \ + .option("password", password) \ + .option("driver", driver) \ + .mode("Append") \ + .save() + + # Read data + jdbcDF = sparkSession.read \ + .format("jdbc") \ + .option("url", url) \ + .option("dbtable", dbtable) \ + .option("user", user) \ + .option("password", password) \ + .option("driver", driver) \ + .load() + jdbcDF.show() + + # close session + sparkSession.stop()+
# _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-rds").getOrCreate() + + # Create a data table for DLI - associated RDS + sparkSession.sql( + "CREATE TABLE IF NOT EXISTS dli_to_rds USING JDBC OPTIONS (\ + 'url'='jdbc:mysql://to-rds-1174404952-ZgPo1nNC.datasource.com:3306',\ + 'dbtable'='test.customer',\ + 'user'='root',\ + 'password'='######',\ + 'driver'='com.mysql.jdbc.Driver')") + + # Insert data into the DLI data table + sparkSession.sql("insert into dli_to_rds values(3,'John',24)") + + # Read data from DLI data table + jdbcDF = sparkSession.sql("select * from dli_to_rds") + jdbcDF.show() + + # close session + sparkSession.stop()+
This section provides PySpark example code that demonstrates how to use a Spark job to access data from the GaussDB(DWS) data source.
+A datasource connection has been created and bound to a queue on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 | from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType +from pyspark.sql import SparkSession + |
1 | sparkSession = SparkSession.builder.appName("datasource-dws").getOrCreate() + |
1 +2 +3 +4 +5 | url = "jdbc:postgresql://to-dws-1174404951-W8W4cW8I.datasource.com:8000/postgres" +dbtable = "customer" +user = "dbadmin" +password = "######" +driver = "org.postgresql.Driver" + |
1 | dataList = sparkSession.sparkContext.parallelize([(1, "Katie", 19)]) + |
1 +2 +3 | schema = StructType([StructField("id", IntegerType(), False),\ + StructField("name", StringType(), False),\ + StructField("age", IntegerType(), False)]) + |
1 | dataFrame = sparkSession.createDataFrame(dataList, schema) + |
1 +2 +3 +4 +5 +6 +7 +8 +9 | dataFrame.write \ + .format("jdbc") \ + .option("url", url) \ + .option("dbtable", dbtable) \ + .option("user", user) \ + .option("password", password) \ + .option("driver", driver) \ + .mode("Overwrite") \ + .save() + |
The options of mode can be one of the following:
+1 +2 +3 +4 +5 +6 +7 +8 +9 | jdbcDF = sparkSession.read \ + .format("jdbc") \ + .option("url", url) \ + .option("dbtable", dbtable) \ + .option("user", user) \ + .option("password", password) \ + .option("driver", driver) \ + .load() +jdbcDF.show() + |
1 +2 +3 +4 +5 +6 +7 | sparkSession.sql( + "CREATE TABLE IF NOT EXISTS dli_to_dws USING JDBC OPTIONS ( + 'url'='jdbc:postgresql://to-dws-1174404951-W8W4cW8I.datasource.com:8000/postgres',\ + 'dbtable'='customer',\ + 'user'='dbadmin',\ + 'password'='######',\ + 'driver'='org.postgresql.Driver')") + |
1 | sparkSession.sql("insert into dli_to_dws values(2,'John',24)") + |
1 | jdbcDF = sparkSession.sql("select * from dli_to_dws").show() + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/dws/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/dws/*
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 | # _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-dws").getOrCreate() + + # Set cross-source connection parameters + url = "jdbc:postgresql://to-dws-1174404951-W8W4cW8I.datasource.com:8000/postgres" + dbtable = "customer" + user = "dbadmin" + password = "######" + driver = "org.postgresql.Driver" + + # Create a DataFrame and initialize the DataFrame data. + dataList = sparkSession.sparkContext.parallelize([(1, "Katie", 19)]) + + # Setting schema + schema = StructType([StructField("id", IntegerType(), False),\ + StructField("name", StringType(), False),\ + StructField("age", IntegerType(), False)]) + + # Create a DataFrame from RDD and schema + dataFrame = sparkSession.createDataFrame(dataList, schema) + + # Write data to the DWS table + dataFrame.write \ + .format("jdbc") \ + .option("url", url) \ + .option("dbtable", dbtable) \ + .option("user", user) \ + .option("password", password) \ + .option("driver", driver) \ + .mode("Overwrite") \ + .save() + + # Read data + jdbcDF = sparkSession.read \ + .format("jdbc") \ + .option("url", url) \ + .option("dbtable", dbtable) \ + .option("user", user) \ + .option("password", password) \ + .option("driver", driver) \ + .load() + jdbcDF.show() + + # close session + sparkSession.stop() + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 | # _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-dws").getOrCreate() + + # Create a data table for DLI - associated GaussDB(DWS) + sparkSession.sql( + "CREATE TABLE IF NOT EXISTS dli_to_dws USING JDBC OPTIONS (\ + 'url'='jdbc:postgresql://to-dws-1174404951-W8W4cW8I.datasource.com:8000/postgres',\ + 'dbtable'='customer',\ + 'user'='dbadmin',\ + 'password'='######',\ + 'driver'='org.postgresql.Driver')") + + # Insert data into the DLI data table + sparkSession.sql("insert into dli_to_dws values(2,'John',24)") + + # Read data from DLI data table + jdbcDF = sparkSession.sql("select * from dli_to_dws").show() + + # close session + sparkSession.stop() + |
A datasource connection has been created on the DLI management console.
+1 +2 +3 | from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType, Row +from pyspark.sql import SparkSession + |
1 | sparkSession = SparkSession.builder.appName("datasource-css").getOrCreate() + |
1 +2 | resource = "/mytest" +nodes = "to-css-1174404953-hDTx3UPK.datasource.com:9200" + |
resource indicates the name of the resource associated with the CSS. You can specify the resource location in /index/type format. (The index can be the database and type the table.)
+1 +2 +3 | schema = StructType([StructField("id", IntegerType(), False), + StructField("name", StringType(), False)]) +rdd = sparkSession.sparkContext.parallelize([Row(1, "John"), Row(2, "Bob")]) + |
1 | dataFrame = sparkSession.createDataFrame(rdd, schema) + |
1 | dataFrame.write.format("css").option("resource", resource).option("es.nodes", nodes).mode("Overwrite").save() + |
The options of mode can be one of the following:
+1 +2 | jdbcDF = sparkSession.read.format("css").option("resource", resource).option("es.nodes", nodes).load() +jdbcDF.show() + |
1 +2 +3 +4 +5 | sparkSession.sql( + "create table css_table(id long, name string) using css options( + 'es.nodes'='to-css-1174404953-hDTx3UPK.datasource.com:9200', + 'es.nodes.wan.only'='true', + 'resource'='/mytest')") + |
1 | sparkSession.sql("insert into css_table values(3,'tom')") + |
1 +2 | jdbcDF = sparkSession.sql("select * from css_table") +jdbcDF.show() + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/css/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/css/*
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 | # _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql.types import Row, StructType, StructField, IntegerType, StringType +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-css").getOrCreate() + + # Setting cross-source connection parameters + resource = "/mytest" + nodes = "to-css-1174404953-hDTx3UPK.datasource.com:9200" + + # Setting schema + schema = StructType([StructField("id", IntegerType(), False), + StructField("name", StringType(), False)]) + + # Construction data + rdd = sparkSession.sparkContext.parallelize([Row(1, "John"), Row(2, "Bob")]) + + # Create a DataFrame from RDD and schema + dataFrame = sparkSession.createDataFrame(rdd, schema) + + # Write data to the CSS + dataFrame.write.format("css").option("resource", resource).option("es.nodes", nodes).mode("Overwrite").save() + + # Read data + jdbcDF = sparkSession.read.format("css").option("resource", resource).option("es.nodes", nodes).load() + jdbcDF.show() + + # close session + sparkSession.stop() + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 | # _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-css").getOrCreate() + + # Create a DLI data table for DLI-associated CSS + sparkSession.sql( + "create table css_table(id long, name string) using css options( \ + 'es.nodes'='to-css-1174404953-hDTx3UPK.datasource.com:9200',\ + 'es.nodes.wan.only'='true',\ + 'resource'='/mytest')") + + # Insert data into the DLI data table + sparkSession.sql("insert into css_table values(3,'tom')") + + # Read data from DLI data table + jdbcDF = sparkSession.sql("select * from css_table") + jdbcDF.show() + + # close session + sparkSession.stop() + |
1 +2 +3 | from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType, Row +from pyspark.sql import SparkSession + |
Hard-coded or plaintext AK and SK pose significant security risks. To ensure security, encrypt your AK and SK, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | sparkSession = SparkSession.builder.appName("datasource-css").getOrCreate() +sparkSession.conf.set("fs.obs.access.key", ak) +sparkSession.conf.set("fs.obs.secret.key", sk) +sparkSession.conf.set("fs.obs.endpoint", enpoint) +sparkSession.conf.set("fs.obs.connecton.ssl.enabled", "false") + |
1 +2 | resource = "/mytest"; +nodes = "to-css-1174404953-hDTx3UPK.datasource.com:9200" + |
resource indicates the name of the resource associated with the CSS. You can specify the resource location in /index/type format. (The index can be the database and type the table.)
+1 +2 +3 | schema = StructType([StructField("id", IntegerType(), False), + StructField("name", StringType(), False)]) +rdd = sparkSession.sparkContext.parallelize([Row(1, "John"), Row(2, "Bob")]) + |
1 | dataFrame = sparkSession.createDataFrame(rdd, schema) + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 | dataFrame.write.format("css") + .option("resource", resource) + .option("es.nodes", nodes) + .option("es.net.ssl", "true") + .option("es.net.ssl.keystore.location", "obs://Bucket name/path/transport-keystore.jks") + .option("es.net.ssl.keystore.pass", "***") + .option("es.net.ssl.truststore.location", "obs://Bucket name/path/truststore.jks") + .option("es.net.ssl.truststore.pass", "***") + .option("es.net.http.auth.user", "admin") + .option("es.net.http.auth.pass", "***") + .mode("Overwrite") + .save() + |
The options of mode can be one of the following:
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 | jdbcDF = sparkSession.read.format("css")\ + .option("resource", resource)\ + .option("es.nodes", nodes)\ + .option("es.net.ssl", "true")\ + .option("es.net.ssl.keystore.location", "obs://Bucket name/path/transport-keystore.jks")\ + .option("es.net.ssl.keystore.pass", "***")\ + .option("es.net.ssl.truststore.location", "obs://Bucket name/path/truststore.jks")\ + .option("es.net.ssl.truststore.pass", "***")\ + .option("es.net.http.auth.user", "admin")\ + .option("es.net.http.auth.pass", "***")\ + .load() +jdbcDF.show() + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 | sparkSession.sql( + "create table css_table(id long, name string) using css options(\ + 'es.nodes'='to-css-1174404953-hDTx3UPK.datasource.com:9200',\ + 'es.nodes.wan.only'='true',\ + 'resource'='/mytest',\ + 'es.net.ssl'='true',\ + 'es.net.ssl.keystore.location'='obs://Bucket name/path/transport-keystore.jks',\ + 'es.net.ssl.keystore.pass'='***',\ + 'es.net.ssl.truststore.location'='obs://Bucket name/path/truststore.jks',\ + 'es.net.ssl.truststore.pass'='***',\ + 'es.net.http.auth.user'='admin',\ + 'es.net.http.auth.pass'='***')") + |
1 | sparkSession.sql("insert into css_table values(3,'tom')") + |
1 +2 | jdbcDF = sparkSession.sql("select * from css_table") +jdbcDF.show() + |
Hard-coded or plaintext AK and SK pose significant security risks. To ensure security, encrypt your AK and SK, store them in configuration files or environment variables, and decrypt them when needed.
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 | # _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql.types import Row, StructType, StructField, IntegerType, StringType +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-css").getOrCreate() + sparkSession.conf.set("fs.obs.access.key", ak) + sparkSession.conf.set("fs.obs.secret.key", sk) + sparkSession.conf.set("fs.obs.endpoint", enpoint) + sparkSession.conf.set("fs.obs.connecton.ssl.enabled", "false") + + # Setting cross-source connection parameters + resource = "/mytest"; + nodes = "to-css-1174404953-hDTx3UPK.datasource.com:9200" + + # Setting schema + schema = StructType([StructField("id", IntegerType(), False), + StructField("name", StringType(), False)]) + + # Construction data + rdd = sparkSession.sparkContext.parallelize([Row(1, "John"), Row(2, "Bob")]) + + # Create a DataFrame from RDD and schema + dataFrame = sparkSession.createDataFrame(rdd, schema) + + # Write data to the CSS + dataFrame.write.format("css") + .option("resource", resource) + .option("es.nodes", nodes) + .option("es.net.ssl", "true") + .option("es.net.ssl.keystore.location", "obs://Bucket name/path/transport-keystore.jks") + .option("es.net.ssl.keystore.pass", "***") + .option("es.net.ssl.truststore.location", "obs://Bucket name/path/truststore.jks") + .option("es.net.ssl.truststore.pass", "***") + .option("es.net.http.auth.user", "admin") + .option("es.net.http.auth.pass", "***") + .mode("Overwrite") + .save() + + # Read data + jdbcDF = sparkSession.read.format("css")\ + .option("resource", resource)\ + .option("es.nodes", nodes)\ + .option("es.net.ssl", "true")\ + .option("es.net.ssl.keystore.location", "obs://Bucket name/path/transport-keystore.jks")\ + .option("es.net.ssl.keystore.pass", "***")\ + .option("es.net.ssl.truststore.location", "obs://Bucket name/path/truststore.jks") + .option("es.net.ssl.truststore.pass", "***")\ + .option("es.net.http.auth.user", "admin")\ + .option("es.net.http.auth.pass", "***")\ + .load() + jdbcDF.show() + + # close session + sparkSession.stop() + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 | # _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql import SparkSession +import os + +if __name__ == "__main__": + + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-css").getOrCreate() + # Create a DLI data table for DLI-associated CSS + sparkSession.sql("create table css_table(id int, name string) using css options(\ + 'es.nodes'='192.168.6.204:9200',\ + 'es.nodes.wan.only'='true',\ + 'resource'='/mytest',\ + 'es.net.ssl'='true',\ + 'es.net.ssl.keystore.location' = 'obs://xietest1/lzq/keystore.jks',\ + 'es.net.ssl.keystore.pass' = '**',\ + 'es.net.ssl.truststore.location'='obs://xietest1/lzq/truststore.jks',\ + 'es.net.ssl.truststore.pass'='**',\ + 'es.net.http.auth.user'='admin',\ + 'es.net.http.auth.pass'='**')") + + # Insert data into the DLI data table + sparkSession.sql("insert into css_table values(3,'tom')") + + # Read data from DLI data table + jdbcDF = sparkSession.sql("select * from css_table") + jdbcDF.show() + + # close session + sparkSession.stop() + |
Redis supports only enhanced datasource connections.
+An enhanced datasource connection has been created on the DLI management console and bound to a queue in packages.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> +<dependency> + <groupId>redis.clients</groupId> + <artifactId>jedis</artifactId> + <version>3.1.0</version> +</dependency> +<dependency> + <groupId>com.redislabs</groupId> + <artifactId>spark-redis</artifactId> + <version>2.4.0</version> +</dependency> + |
1 +2 +3 +4 +5 | import org.apache.spark.sql.{Row, SaveMode, SparkSession} +import org.apache.spark.sql.types._ +import com.redislabs.provider.redis._ +import scala.reflect.runtime.universe._ +import org.apache.spark.{SparkConf, SparkContext} + |
1 | val sparkSession = SparkSession.builder().appName("datasource_redis").getOrCreate() + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 | //method one +var schema = StructType(Seq(StructField("name", StringType, false), StructField("age", IntegerType, false))) +var rdd = sparkSession.sparkContext.parallelize(Seq(Row("abc",34),Row("Bob",19))) +var dataFrame = sparkSession.createDataFrame(rdd, schema) +// //method two +// var jdbcDF= sparkSession.createDataFrame(Seq(("Jack",23))) +// val dataFrame = jdbcDF.withColumnRenamed("_1", "name").withColumnRenamed("_2", "age") +// //method three +// case class Person(name: String, age: Int) +// val dataFrame = sparkSession.createDataFrame(Seq(Person("John", 30), Person("Peter", 45))) + |
case class Person(name: String, age: Int) must be written outside the object. For details, see Connecting to data sources through DataFrame APIs.
+1 +2 +3 +4 +5 +6 +7 +8 +9 | dataFrame .write + .format("redis") + .option("host","192.168.4.199") + .option("port","6379") + .option("table","person") + .option("password","******") + .option("key.column","name") + .mode(SaveMode.Overwrite) + .save() + |
Parameter + |
+Description + |
+
---|---|
host + |
+IP address of the Redis cluster to be connected. +To obtain the IP address, log in to the official website, search for redis, go to the console of Distributed Cache Service for Redis, and choose Cache Manager. Select an IP address (including the port information) based on the IP address required by the host name to copy the data. + |
+
port + |
+Access port. + |
+
password + |
+Password for the connection. This parameter is optional if no password is required. + |
+
table + |
+Key or hash key in Redis. +
|
+
keys.pattern + |
+Use a regular expression to match multiple keys or hash keys. This parameter is used only for query. Either this parameter or table is used to query Redis data. + |
+
key.column + |
+Key value of a column. This parameter is optional. If a key is specified when data is written, the key must be specified during query. Otherwise, the key will be abnormally loaded during query. + |
+
partitions.number + |
+Number of concurrent tasks during data reading. + |
+
scan.count + |
+Number of data records read in each batch. The default value is 100. If the CPU usage of the Redis cluster still needs to be improved during data reading, increase the value of this parameter. + |
+
iterator.grouping.size + |
+Number of data records inserted in each batch. The default value is 100. If the CPU usage of the Redis cluster still needs to be improved during the insertion, increase the value of this parameter. + |
+
timeout + |
+Timeout interval for connecting to the Redis, in milliseconds. The default value is 2000 (2 seconds). + |
+
1 +2 +3 +4 +5 +6 +7 +8 +9 | sparkSession.read + .format("redis") + .option("host","192.168.4.199") + .option("port","6379") + .option("table", "person") + .option("password","######") + .option("key.column","name") + .load() + .show() + |
1 +2 +3 +4 +5 +6 | val sparkContext = new SparkContext(new SparkConf() + .setAppName("datasource_redis") + .set("spark.redis.host", "192.168.4.199") + .set("spark.redis.port", "6379") + .set("spark.redis.auth", "######") + .set("spark.driver.allowMultipleContexts","true")) + |
If spark.driver.allowMultipleContexts is set to true, only the current context is used when multiple contexts are started, to prevent context invoking conflicts.
+1 +2 | val stringRedisData:RDD[(String,String)] = sparkContext.parallelize(Seq[(String,String)](("high","111"), ("together","333"))) +sparkContext.toRedisKV(stringRedisData) + |
1 +2 | val hashRedisData:RDD[(String,String)] = sparkContext.parallelize(Seq[(String,String)](("saprk","123"), ("data","222"))) +sparkContext.toRedisHASH(hashRedisData, "hashRDD") + |
1 +2 +3 | val data = List(("school","112"), ("tom","333")) +val listRedisData:RDD[String] = sparkContext.parallelize(Seq[(String)](data.toString())) +sparkContext.toRedisLIST(listRedisData, "listRDD") + |
1 +2 +3 | val setData = Set(("bob","133"),("kity","322")) +val setRedisData:RDD[(String)] = sparkContext.parallelize(Seq[(String)](setData.mkString)) +sparkContext.toRedisSET(setRedisData, "setRDD") + |
1 +2 | val zsetRedisData:RDD[(String,String)] = sparkContext.parallelize(Seq[(String,String)](("whight","234"), ("bobo","343"))) +sparkContext.toRedisZSET(zsetRedisData, "zsetRDD") + |
1 +2 +3 +4 +5 +6 | val keysRDD = sparkContext.fromRedisKeys(Array("high","together", "hashRDD", "listRDD", "setRDD","zsetRDD"), 6) +keysRDD.getKV().collect().foreach(println) +keysRDD.getHash().collect().foreach(println) +keysRDD.getList().collect().foreach(println) +keysRDD.getSet().collect().foreach(println) +keysRDD.getZSet().collect().foreach(println) + |
1 | sparkContext.fromRedisKV(Array( "high","together")).collect().foreach{println} + |
1 | sparkContext.fromRedisHash(Array("hashRDD")).collect().foreach{println} + |
1 | sparkContext.fromRedisList(Array("listRDD")).collect().foreach{println} + |
1 | sparkContext.fromRedisSet(Array("setRDD")).collect().foreach{println} + |
1 | sparkContext.fromRedisZSet(Array("zsetRDD")).collect().foreach{println} + |
sparkSession.sql( + "CREATE TEMPORARY VIEW person (name STRING, age INT) USING org.apache.spark.sql.redis OPTIONS ( + 'host' = '192.168.4.199', + 'port' = '6379', + 'password' = '######', + table 'person')".stripMargin)+
1 | sparkSession.sql("INSERT INTO TABLE person VALUES ('John', 30),('Peter', 45)".stripMargin) + |
1 | sparkSession.sql("SELECT * FROM person".stripMargin).collect().foreach(println) + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/redis/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/redis/*
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> +<dependency> + <groupId>redis.clients</groupId> + <artifactId>jedis</artifactId> + <version>3.1.0</version> +</dependency> +<dependency> + <groupId>com.redislabs</groupId> + <artifactId>spark-redis</artifactId> + <version>2.4.0</version> +</dependency> + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 | import org.apache.spark.sql.{SparkSession}; + +object Test_Redis_SQL { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().appName("datasource_redis").getOrCreate(); + + sparkSession.sql( + "CREATE TEMPORARY VIEW person (name STRING, age INT) USING org.apache.spark.sql.redis OPTIONS ( + 'host' = '192.168.4.199', 'port' = '6379', 'password' = '******',table 'person')".stripMargin) + + sparkSession.sql("INSERT INTO TABLE person VALUES ('John', 30),('Peter', 45)".stripMargin) + + sparkSession.sql("SELECT * FROM person".stripMargin).collect().foreach(println) + + sparkSession.close() + } +} + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 | import org.apache.spark.sql.{Row, SaveMode, SparkSession} +import org.apache.spark.sql.types._ + +object Test_Redis_SparkSql { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkSession = SparkSession.builder().appName("datasource_redis").getOrCreate() + + // Set cross-source connection parameters. + val host = "192.168.4.199" + val port = "6379" + val table = "person" + val auth = "######" + val key_column = "name" + + // ******** setting DataFrame ******** + // method one + var schema = StructType(Seq(StructField("name", StringType, false),StructField("age", IntegerType, false))) + var rdd = sparkSession.sparkContext.parallelize(Seq(Row("xxx",34),Row("Bob",19))) + var dataFrame = sparkSession.createDataFrame(rdd, schema) + +// // method two +// var jdbcDF= sparkSession.createDataFrame(Seq(("Jack",23))) +// val dataFrame = jdbcDF.withColumnRenamed("_1", "name").withColumnRenamed("_2", "age") + +// // method three +// val dataFrame = sparkSession.createDataFrame(Seq(Person("John", 30), Person("Peter", 45))) + + // Write data to redis + dataFrame.write.format("redis").option("host",host).option("port",port).option("table", table).option("password",auth).mode(SaveMode.Overwrite).save() + + // Read data from redis + sparkSession.read.format("redis").option("host",host).option("port",port).option("table", table).option("password",auth).load().show() + + // Close session + sparkSession.close() + } +} +// methoe two +// case class Person(name: String, age: Int) + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 | import com.redislabs.provider.redis._ +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +object Test_Redis_RDD { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val sparkContext = new SparkContext(new SparkConf() + .setAppName("datasource_redis") + .set("spark.redis.host", "192.168.4.199") + .set("spark.redis.port", "6379") + .set("spark.redis.auth", "@@@@@@") + .set("spark.driver.allowMultipleContexts","true")) + + //***************** Write data to redis ********************** + // Save String type data + val stringRedisData:RDD[(String,String)] = sparkContext.parallelize(Seq[(String,String)](("high","111"), ("together","333"))) + sparkContext.toRedisKV(stringRedisData) + + // Save Hash type data + val hashRedisData:RDD[(String,String)] = sparkContext.parallelize(Seq[(String,String)](("saprk","123"), ("data","222"))) + sparkContext.toRedisHASH(hashRedisData, "hashRDD") + + // Save List type data + val data = List(("school","112"), ("tom","333")); + val listRedisData:RDD[String] = sparkContext.parallelize(Seq[(String)](data.toString())) + sparkContext.toRedisLIST(listRedisData, "listRDD") + + // Save Set type data + val setData = Set(("bob","133"),("kity","322")) + val setRedisData:RDD[(String)] = sparkContext.parallelize(Seq[(String)](setData.mkString)) + sparkContext.toRedisSET(setRedisData, "setRDD") + + // Save ZSet type data + val zsetRedisData:RDD[(String,String)] = sparkContext.parallelize(Seq[(String,String)](("whight","234"), ("bobo","343"))) + sparkContext.toRedisZSET(zsetRedisData, "zsetRDD") + + // ***************************** Read data from redis ******************************************* + // Traverse the specified key and get the value + val keysRDD = sparkContext.fromRedisKeys(Array("high","together", "hashRDD", "listRDD", "setRDD","zsetRDD"), 6) + keysRDD.getKV().collect().foreach(println) + keysRDD.getHash().collect().foreach(println) + keysRDD.getList().collect().foreach(println) + keysRDD.getSet().collect().foreach(println) + keysRDD.getZSet().collect().foreach(println) + + // Read String type data// + val stringRDD = sparkContext.fromRedisKV("keyPattern *") + sparkContext.fromRedisKV(Array( "high","together")).collect().foreach{println} + + // Read Hash type data// + val hashRDD = sparkContext.fromRedisHash("keyPattern *") + sparkContext.fromRedisHash(Array("hashRDD")).collect().foreach{println} + + // Read List type data// + val listRDD = sparkContext.fromRedisList("keyPattern *") + sparkContext.fromRedisList(Array("listRDD")).collect().foreach{println} + + // Read Set type data// + val setRDD = sparkContext.fromRedisSet("keyPattern *") + sparkContext.fromRedisSet(Array("setRDD")).collect().foreach{println} + + // Read ZSet type data// + val zsetRDD = sparkContext.fromRedisZSet("keyPattern *") + sparkContext.fromRedisZSet(Array("zsetRDD")).collect().foreach{println} + + // close session + sparkContext.stop() + } +} + |
Redis supports only enhanced datasource connections.
+An enhanced datasource connection has been created on the DLI management console and bound to a queue in packages.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 | from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType +from pyspark.sql import SparkSession + |
1 | sparkSession = SparkSession.builder.appName("datasource-redis").getOrCreate() + |
1 +2 +3 +4 | host = "192.168.4.199" +port = "6379" +table = "person" +auth = "@@@@@@" + |
1 +2 +3 +4 +5 | dataList = sparkSession.sparkContext.parallelize([(1, "Katie", 19),(2,"Tom",20)]) +schema = StructType([StructField("id", IntegerType(), False), + StructField("name", StringType(), False), + StructField("age", IntegerType(), False)]) +dataFrame = sparkSession.createDataFrame(dataList, schema) + |
1 +2 | jdbcDF = sparkSession.createDataFrame([(3,"Jack", 23)]) +dataFrame = jdbcDF.withColumnRenamed("_1", "id").withColumnRenamed("_2", "name").withColumnRenamed("_3", "age") + |
1 +2 +3 +4 +5 +6 +7 +8 | dataFrame.write + .format("redis")\ + .option("host", host)\ + .option("port", port)\ + .option("table", table)\ + .option("password", auth)\ + .mode("Overwrite")\ + .save() + |
1 | sparkSession.read.format("redis").option("host", host).option("port", port).option("table", table).option("password", auth).load().show() + |
sparkSession.sql( + "CREATE TEMPORARY VIEW person (name STRING, age INT) USING org.apache.spark.sql.redis OPTIONS ( + 'host' = '192.168.4.199', + 'port' = '6379', + 'password' = '######', + table 'person')".stripMargin)+
1 | sparkSession.sql("INSERT INTO TABLE person VALUES ('John', 30),('Peter', 45)".stripMargin) + |
1 | sparkSession.sql("SELECT * FROM person".stripMargin).collect().foreach(println) + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/redis/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/redis/*
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 | # _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType +from pyspark.sql import SparkSession +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-redis").getOrCreate() + + # Set cross-source connection parameters. + host = "192.168.4.199" + port = "6379" + table = "person" + auth = "######" + + # Create a DataFrame and initialize the DataFrame data. + # ******* method noe ********* + dataList = sparkSession.sparkContext.parallelize([(1, "Katie", 19),(2,"Tom",20)]) + schema = StructType([StructField("id", IntegerType(), False),StructField("name", StringType(), False),StructField("age", IntegerType(), False)]) + dataFrame_one = sparkSession.createDataFrame(dataList, schema) + + # ****** method two ****** + # jdbcDF = sparkSession.createDataFrame([(3,"Jack", 23)]) + # dataFrame = jdbcDF.withColumnRenamed("_1", "id").withColumnRenamed("_2", "name").withColumnRenamed("_3", "age") + + # Write data to the redis table + dataFrame.write.format("redis").option("host", host).option("port", port).option("table", table).option("password", auth).mode("Overwrite").save() + # Read data + sparkSession.read.format("redis").option("host", host).option("port", port).option("table", table).option("password", auth).load().show() + + # close session + sparkSession.stop() + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 | # _*_ coding: utf-8 _*_ +from __future__ import print_function +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession + sparkSession = SparkSession.builder.appName("datasource_redis").getOrCreate() + + sparkSession.sql( + "CREATE TEMPORARY VIEW person (name STRING, age INT) USING org.apache.spark.sql.redis OPTIONS (\ + 'host' = '192.168.4.199', \ + 'port' = '6379',\ + 'password' = '######',\ + 'table'= 'person')".stripMargin); + + sparkSession.sql("INSERT INTO TABLE person VALUES ('John', 30),('Peter', 45)".stripMargin) + + sparkSession.sql("SELECT * FROM person".stripMargin).collect().foreach(println) + + # close session + sparkSession.stop() + |
Redis supports only enhanced datasource connections.
+An enhanced datasource connection has been created on the DLI management console and bound to a queue in packages.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 +2 +3 +4 +5 +6 +7 +8 | import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.*; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import java.util.*; + |
1 +2 +3 +4 +5 +6 +7 +8 | SparkConf sparkConf = new SparkConf(); +sparkConf.setAppName("datasource-redis") + .set("spark.redis.host", "192.168.4.199") + .set("spark.redis.port", "6379") + .set("spark.redis.auth", "******") + .set("spark.driver.allowMultipleContexts","true"); +JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf); +SQLContext sqlContext = new SQLContext(javaSparkContext); + |
1 +2 +3 +4 | JavaRDD<String> javaRDD = javaSparkContext.parallelize(Arrays.asList( + "{\"id\":\"1\",\"name\":\"Ann\",\"age\":\"18\"}", + "{\"id\":\"2\",\"name\":\"lisi\",\"age\":\"21\"}")); +Dataset dataFrame = sqlContext.read().json(javaRDD); + |
1 +2 +3 | Map map = new HashMap<String, String>(); +map.put("table","person"); +map.put("key.column","id"); + |
1 | dataFrame.write().format("redis").options(map).mode(SaveMode.Overwrite).save(); + |
1 | sqlContext.read().format("redis").options(map).load().show(); + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/redis/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/redis/*
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 | public class Test_Redis_DaraFrame { + public static void main(String[] args) { + //create a SparkSession session + SparkConf sparkConf = new SparkConf(); + sparkConf.setAppName("datasource-redis") + .set("spark.redis.host", "192.168.4.199") + .set("spark.redis.port", "6379") + .set("spark.redis.auth", "******") + .set("spark.driver.allowMultipleContexts","true"); + JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf); + SQLContext sqlContext = new SQLContext(javaSparkContext); + + //Read RDD in JSON format to create DataFrame + JavaRDD<String> javaRDD = javaSparkContext.parallelize(Arrays.asList( + "{\"id\":\"1\",\"name\":\"Ann\",\"age\":\"18\"}", + "{\"id\":\"2\",\"name\":\"lisi\",\"age\":\"21\"}")); + Dataset dataFrame = sqlContext.read().json(javaRDD); + + Map map = new HashMap<String, String>(); + map.put("table","person"); + map.put("key.column","id"); + dataFrame.write().format("redis").options(map).mode(SaveMode.Overwrite).save(); + sqlContext.read().format("redis").options(map).load().show(); + + } +} + |
Mongo can be connected only through enhanced datasource connections.
+DDS is compatible with the MongoDB protocol.
+An enhanced datasource connection has been created on the DLI management console and bound to a queue in packages.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
import org.apache.spark.SparkConf; +import org.apache.spark.SparkContext; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.SaveMode;+
1 +2 +3 | SparkContext sparkContext = new SparkContext(new SparkConf().setAppName("datasource-mongo")); +JavaSparkContext javaSparkContext = new JavaSparkContext(sparkContext); +SQLContext sqlContext = new SQLContext(javaSparkContext); + |
JavaRDD<String> javaRDD = javaSparkContext.parallelize(Arrays.asList("{\"id\":\"5\",\"name\":\"Ann\",\"age\":\"23\"}")); +Dataset<Row> dataFrame = sqlContext.read().json(javaRDD);+
String url = "192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin"; +String uri = "mongodb://username:pwd@host:8635/db"; +String user = "rwuser"; +String database = "test"; +String collection = "test"; +String password = "######";+ +
dataFrame.write().format("mongo") + .option("url",url) + .option("uri",uri) + .option("database",database) + .option("collection",collection) + .option("user",user) + .option("password",password) + .mode(SaveMode.Overwrite) + .save();+
1 +2 +3 +4 +5 +6 +7 +8 | sqlContext.read().format("mongo") + .option("url",url) + .option("uri",uri) + .option("database",database) + .option("collection",collection) + .option("user",user) + .option("password",password) + .load().show(); + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/mongo/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/mongo/*
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 | import org.apache.spark.SparkConf; +import org.apache.spark.SparkContext; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.SaveMode; +import java.util.Arrays; + +public class TestMongoSparkSql { + public static void main(String[] args) { + SparkContext sparkContext = new SparkContext(new SparkConf().setAppName("datasource-mongo")); + JavaSparkContext javaSparkContext = new JavaSparkContext(sparkContext); + SQLContext sqlContext = new SQLContext(javaSparkContext); + +// // Read json file as DataFrame, read csv / parquet file, same as json file distribution +// DataFrame dataFrame = sqlContext.read().format("json").load("filepath"); + + // Read RDD in JSON format to create DataFrame + JavaRDD<String> javaRDD = javaSparkContext.parallelize(Arrays.asList("{\"id\":\"5\",\"name\":\"Ann\",\"age\":\"23\"}")); + Dataset<Row> dataFrame = sqlContext.read().json(javaRDD); + + String url = "192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin"; + String uri = "mongodb://username:pwd@host:8635/db"; + String user = "rwuser"; + String database = "test"; + String collection = "test"; + String password = "######"; + + dataFrame.write().format("mongo") + .option("url",url) + .option("uri",uri) + .option("database",database) + .option("collection",collection) + .option("user",user) + .option("password",password) + .mode(SaveMode.Overwrite) + .save(); + + sqlContext.read().format("mongo") + .option("url",url) + .option("uri",uri) + .option("database",database) + .option("collection",collection) + .option("user",user) + .option("password",password) + .load().show(); + sparkContext.stop(); + javaSparkContext.close(); + } +} + |
Mongo can be connected only through enhanced datasource connections.
+DDS is compatible with the MongoDB protocol.
+An enhanced datasource connection has been created on the DLI management console and bound to a queue in packages.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}+
val sparkSession = SparkSession.builder().appName("datasource-mongo").getOrCreate()+
sparkSession.sql( + "create table test_dds(id string, name string, age int) using mongo options( + 'url' = '192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin', + 'uri' = 'mongodb://username:pwd@host:8635/db', + 'database' = 'test', + 'collection' = 'test', + 'user' = 'rwuser', + 'password' = '######')")+ +
sparkSession.sql("insert into test_dds values('3', 'Ann',23)")+
sparkSession.sql("select * from test_dds").show()+
val url = "192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin" +val uri = "mongodb://username:pwd@host:8635/db" +val user = "rwuser" +val database = "test" +val collection = "test" +val password = "######"+
1 | val schema = StructType(List(StructField("id", StringType), StructField("name", StringType), StructField("age", IntegerType))) + |
val rdd = spark.sparkContext.parallelize(Seq(Row("1", "John", 23), Row("2", "Bob", 32))) +val dataFrame = spark.createDataFrame(rdd, schema)+
1 +2 +3 +4 +5 +6 +7 +8 +9 | dataFrame.write.format("mongo") + .option("url", url) + .option("uri", uri) + .option("database", database) + .option("collection", collection) + .option("user", user) + .option("password", password) + .mode(SaveMode.Overwrite) + .save() + |
The options of mode are Overwrite, Append, ErrorIfExis, and Ignore.
+1 +2 +3 +4 +5 +6 +7 +8 | val jdbcDF = spark.read.format("mongo").schema(schema) + .option("url", url) + .option("uri", uri) + .option("database", database) + .option("collection", collection) + .option("user", user) + .option("password", password) + .load() + |
Operation result
+spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/mongo/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/mongo/*
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 | import org.apache.spark.sql.SparkSession + +object TestMongoSql { + def main(args: Array[String]): Unit = { + val sparkSession = SparkSession.builder().getOrCreate() + sparkSession.sql( + "create table test_dds(id string, name string, age int) using mongo options( + 'url' = '192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin', + 'uri' = 'mongodb://username:pwd@host:8635/db', + 'database' = 'test', + 'collection' = 'test', + 'user' = 'rwuser', + 'password' = '######')") + sparkSession.sql("insert into test_dds values('3', 'Ann',23)") + sparkSession.sql("select * from test_dds").show() + sparkSession.close() + } +} + |
import org.apache.spark.sql.{Row, SaveMode, SparkSession} +import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} + +object Test_Mongo_SparkSql { + def main(args: Array[String]): Unit = { + // Create a SparkSession session. + val spark = SparkSession.builder().appName("mongodbTest").getOrCreate() + + // Set the connection configuration parameters. + val url = "192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin" + val uri = "mongodb://username:pwd@host:8635/db" + val user = "rwuser" + val database = "test" + val collection = "test" + val password = "######" + + // Setting up the schema + val schema = StructType(List(StructField("id", StringType), StructField("name", StringType), StructField("age", IntegerType))) + + // Setting up the DataFrame + val rdd = spark.sparkContext.parallelize(Seq(Row("1", "John", 23), Row("2", "Bob", 32))) + val dataFrame = spark.createDataFrame(rdd, schema) + + + // Write data to mongo + dataFrame.write.format("mongo") + .option("url", url) + .option("uri", uri) + .option("database", database) + .option("collection", collection) + .option("user", user) + .option("password", password) + .mode(SaveMode.Overwrite) + .save() + + // Reading data from mongo + val jdbcDF = spark.read.format("mongo").schema(schema) + .option("url", url) + .option("uri", uri) + .option("database", database) + .option("collection", collection) + .option("user", user) + .option("password", password) + .load() + jdbcDF.show() + + spark.close() + } +}+
Mongo can be connected only through enhanced datasource connections.
+DDS is compatible with the MongoDB protocol.
+An enhanced datasource connection has been created on the DLI management console and bound to a queue in packages.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType +from pyspark.sql import SparkSession+
1 | sparkSession = SparkSession.builder.appName("datasource-mongo").getOrCreate() + |
1 +2 +3 +4 +5 +6 | url = "192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin" +uri = "mongodb://username:pwd@host:8635/db" +user = "rwuser" +database = "test" +collection = "test" +password = "######" + |
1 +2 +3 +4 +5 | dataList = sparkSession.sparkContext.parallelize([(1, "Katie", 19),(2,"Tom",20)]) +schema = StructType([StructField("id", IntegerType(), False), + StructField("name", StringType(), False), + StructField("age", IntegerType(), False)]) +dataFrame = sparkSession.createDataFrame(dataList, schema) + |
1 +2 +3 +4 +5 +6 +7 +8 +9 | dataFrame.write.format("mongo") + .option("url", url)\ + .option("uri", uri)\ + .option("user",user)\ + .option("password",password)\ + .option("database",database)\ + .option("collection",collection)\ + .mode("Overwrite")\ + .save() + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 | jdbcDF = sparkSession.read + .format("mongo")\ + .option("url", url)\ + .option("uri", uri)\ + .option("user",user)\ + .option("password",password)\ + .option("database",database)\ + .option("collection",collection)\ + .load() +jdbcDF.show() + |
sparkSession.sql( + "create table test_dds(id string, name string, age int) using mongo options( + 'url' = '192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin', + 'uri' = 'mongodb://username:pwd@host:8635/db', + 'database' = 'test', + 'collection' = 'test', + 'user' = 'rwuser', + 'password' = '######')")+ +
1 | sparkSession.sql("insert into test_dds values('3', 'Ann',23)") + |
1 | sparkSession.sql("select * from test_dds").show() + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/mongo/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/mongo/*
+from __future__ import print_function +from pyspark.sql.types import StructType, StructField, IntegerType, StringType +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-mongo").getOrCreate() + + # Create a DataFrame and initialize the DataFrame data. + dataList = sparkSession.sparkContext.parallelize([("1", "Katie", 19),("2","Tom",20)]) + + # Setting schema + schema = StructType([StructField("id", IntegerType(), False),StructField("name", StringType(), False), StructField("age", IntegerType(), False)]) + + # Create a DataFrame from RDD and schema + dataFrame = sparkSession.createDataFrame(dataList, schema) + + # Setting connection parameters + url = "192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin" + uri = "mongodb://username:pwd@host:8635/db" + user = "rwuser" + database = "test" + collection = "test" + password = "######" + + # Write data to the mongodb table + dataFrame.write.format("mongo") + .option("url", url)\ + .option("uri", uri)\ + .option("user",user)\ + .option("password",password)\ + .option("database",database)\ + .option("collection",collection) + .mode("Overwrite").save() + + # Read data + jdbcDF = sparkSession.read.format("mongo") + .option("url", url)\ + .option("uri", uri)\ + .option("user",user)\ + .option("password",password)\ + .option("database",database)\ + .option("collection",collection)\ + .load() + jdbcDF.show() + + # close session + sparkSession.stop()+
from __future__ import print_function +from pyspark.sql import SparkSession + +if __name__ == "__main__": + # Create a SparkSession session. + sparkSession = SparkSession.builder.appName("datasource-mongo").getOrCreate() + + # Create a data table for DLI - associated mongo + sparkSession.sql( + "create table test_dds(id string, name string, age int) using mongo options(\ + 'url' = '192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin',\ + 'uri' = 'mongodb://username:pwd@host:8635/db',\ + 'database' = 'test',\ + 'collection' = 'test', \ + 'user' = 'rwuser', \ + 'password' = '######')") + + # Insert data into the DLI-table + sparkSession.sql("insert into test_dds values('3', 'Ann',23)") + + # Read data from DLI-table + sparkSession.sql("select * from test_dds").show() + + # close session + sparkSession.stop()+
You need to be authenticated when using JDBC to create DLI driver connections.
+Currently, the JDBC supports authentication using the Access Key/Secret Key (AK/SK) or token.
+Hard coding AKs and SKs or storing them in code in plaintext poses significant security risks. You are advised to store them in encrypted form in configuration files or environment variables and decrypt them when needed to ensure security.
+When using token authentication, you need to obtain the user token and configure the token information in the JDBC connection parameters. You can obtain the token as follows:
+Replace content in italic in the sample code with the actual values. For details, see .
+{ + "auth": { + "identity": { + "methods": [ + "password" + ], + "password": { + "user": { + "name": "username", + "password": "password", + "domain": { + "name": "domainname" + } + } + } + }, + "scope": { + "project": { + "id": "0aa253a31a2f4cfda30eaa073fee6477" //Assume that project_id is 0aa253a31a2f4cfda30eaa073fee6477. + } + } + } +}+
DLI Spark-submit is a command line tool used to submit Spark jobs to the DLI server. This tool provides command lines compatible with open-source Spark.
+DLI uses the Identity and Access Management (IAM) to implement fine-grained permissions for your enterprise-level tenants. IAM provides identity authentication, permissions management, and access control, helping you securely access your resources.
+With IAM, you can use your account to create IAM users for your employees, and assign permissions to the users to control their access to specific resource types.
+Currently, roles (coarse-grained authorization) and policies (fine-grained authorization) are supported.
+If the user who creates the queue is not an administrator, the queue can be used only after being authorized by the administrator. For details about how to assign permissions, see .
+You can download the DLI client tool from the DLI management console.
+The Beeline client is named dli-clientkit-<version>-bin.tar.gz, which can be used in Linux and depends on JDK 1.8 or later.
+Ensure that you have installed JDK 1.8 or later and configured environment variables on the computer where spark-submit is installed. You are advised to use spark-submit on the computer running Linux.
++
Item + |
+Mandatory + |
+Default Value + |
+Description + |
+
---|---|---|---|
dliEndPont + |
+No + |
+- + |
+Domain name of DLI +If you lef this parameter empty, the program determines the domain name based on region. + |
+
obsEndPoint + |
+Yes + |
+- + |
+OBS service domain name. +Obtain the OBS domain name from the administrator. + |
+
bucketName + |
+Yes + |
+- + |
+Name of a bucket on OBS. This bucket is used to store JAR files, Python program files, and configuration files used in Spark programs. + |
+
obsPath + |
+Yes + |
+dli-spark-submit-resources + |
+Directory for storing JAR files, Python program files, and configuration files on OBS. The directory is in the bucket specified by Bucket Name. If the directory does not exist, the program automatically creates it. + |
+
localFilePath + |
+Yes + |
+- + |
+The local directory for storing JAR files, Python program files, and configuration files used in Spark programs. +The program automatically uploads the files on which Spark depends to the OBS path and loads them to the resource package on the DLI server. + |
+
ak + |
+Yes + |
+- + |
+User's Access Key (AK) + |
+
sk + |
+Yes + |
+- + |
+User's Secret Key (SK) + |
+
projectId + |
+Yes + |
+- + |
+Project ID used by a user to access DLI. + |
+
region + |
+Yes + |
+- + |
+Region of interconnected DLI. + |
+
Modify the configuration items in the spark-defaults.conf file based on the Spark application requirements. The configuration items are compatible with the open-source Spark configuration items. For details, see the open-source Spark configuration item description.
+spark-submit [options] <app jar | python file> [app arguments]+
Parameter + |
+Value + |
+Description + |
+
---|---|---|
--class + |
+<CLASS_NAME> + |
+Name of the main class of the submitted Java or Scala application. + |
+
--conf + |
+<PROP=VALUE> + |
+Spark program parameters can be configured in the spark-defaults.conf file in the conf directory. If both the command and the configuration file are configured, the parameter value specified in the command is preferentially used. + NOTE:
+If there are multiple conf files, the format is --conf key1=value1 --conf key2=value2. + |
+
--jars + |
+<JARS> + |
+Name of the JAR file on which the Spark application depends. Use commas (,) to separate multiple names. The JAR file must be stored in the local path specified by localFilePath in the client.properties file in advance. + |
+
--name + |
+<NAME> + |
+Name of a Spark application. + |
+
--queue + |
+<QUEUE_NAME> + |
+Name of the Spark queue on the DLI server. Jobs are submitted to the queue for execution. + |
+
--py-files + |
+<PY_FILES> + |
+Name of the Python program file on which the Spark application depends. Use commas (,) to separate multiple file names. The Python program file must be saved in the local path specified by localFilePath in the client.properties file in advance. + |
+
-s,--skip-upload-resources + |
+<all | app | deps> + |
+Specifies whether to skip. Upload the JAR file, Python program file, and configuration file to OBS and load them to the resource list on the DLI server. If related resource files have been loaded to the DLI resource list, skip this step. +If this parameter is not specified, all resource files in the command are uploaded and loaded to DLI by default. +
|
+
-h,--help + |
+- + |
+Displays command help information. + |
+
./spark-submit --name <name> --queue <queue_name> --class org.apache.spark.examples.SparkPi spark-examples_2.11-2.1.0.luxor.jar 10 +./spark-submit --name <name> --queue <queue_name> word_count.py+
To use the DLI queue rather than the existing Spark environment, use ./spark-submit instead of spark-submit.
+On DLI, you can connect to the server for data query in the Internet environment. In this case, you need to first obtain the connection information, including the endpoint and project ID by following the following procedure.
+The format of the address for connecting to DLI is jdbc:dli://<endPoint>/<projectId>. Therefore, you need to obtain the endpoint and project ID.
+Obtain the DLI endpoint from the administrator. Specifically, log in to the cloud, click your username, and choose My Credentials from the short-cut menu. You can obtain the project ID on the My Credentials page.
+To connect to DLI, JDBC is utilized. You can obtain the JDBC installation package from Maven or download the JDBC driver file from the DLI management console.
+In Linux or Windows, you can connect to the DLI server using JDBC.
+DLI Data Type + |
+JDBC Type + |
+Java Type + |
+
---|---|---|
INT + |
+INTEGER + |
+java.lang.Integer + |
+
STRING + |
+VARCHAR + |
+java.lang.String + |
+
FLOAT + |
+FLOAT + |
+java.lang.Float + |
+
DOUBLE + |
+DOUBLE + |
+java.lang.Double + |
+
DECIMAL + |
+DECIMAL + |
+java.math.BigDecimal + |
+
BOOLEAN + |
+BOOLEAN + |
+java.lang.Boolean + |
+
SMALLINT/SHORT + |
+SMALLINT + |
+java.lang.Short + |
+
TINYINT + |
+TINYINT + |
+java.lang.Short + |
+
BIGINT/LONG + |
+BIGINT + |
+java.lang.Long + |
+
TIMESTAMP + |
+TIMESTAMP + |
+java.sql.Timestamp + |
+
CHAR + |
+CHAR + |
+Java.lang.Character + |
+
VARCHAR + |
+VARCHAR + |
+java.lang.String + |
+
DATE + |
+DATE + |
+java.sql.Date + |
+
Before using JDBC, perform the following operations:
+DLI uses the Identity and Access Management (IAM) to implement fine-grained permissions for your enterprise-level tenants. IAM provides identity authentication, permissions management, and access control, helping you securely access your resources.
+With IAM, you can use your account to create IAM users for your employees, and assign permissions to the users to control their access to specific resource types.
+Currently, roles (coarse-grained authorization) and policies (fine-grained authorization) are supported.
+If the user who creates the queue is not an administrator, the queue can be used only after being authorized by the administrator. For details about how to assign permissions, see .
+Class.forName("com.dli.jdbc.DliDriver");
+Connection conn = DriverManager.getConnection(String url, Properties info);
+Parameter + |
+Description + |
+
---|---|
url + |
+The URL format is as follows: +jdbc:dli://<endPoint>/projectId? <key1>=<val1>;<key2>=<val2>... +
|
+
Info + |
+The Info object passes user-defined configuration items. If Info does not pass any attribute item, you can set it to null. The format is as follows: info.setProperty ("Attribute item", "Attribute value"). + |
+
Item + |
+Mandatory + |
+Default Value + |
+Description + |
+
---|---|---|---|
queuename + |
+Yes + |
+- + |
+Queue name of DLI. + |
+
databasename + |
+No + |
+- + |
+Name of a database. + |
+
authenticationmode + |
+No + |
+token + |
+Authentication mode. Currently, token- and AK/SK-based authentication modes are supported. + |
+
accesskey + |
+Yes + |
+- + |
+AK/SK authentication key. For details about how to obtain the key, see Performing Authentication. + |
+
secretkey + |
+Yes + |
+- + |
+AK/SK authentication key. For details about how to obtain the key, see Performing Authentication. + |
+
servicename + |
+This parameter must be configured if authenticationmode is set to aksk. + |
+- + |
+Indicates the service name, that is, dli. + |
+
token + |
+This parameter must be configured if authenticationmode is set to token. + |
+- + |
+Token authentication. For details about the authentication mode, see Performing Authentication. + |
+
charset + |
+No + |
+UTF-8 + |
+JDBC encoding mode. + |
+
usehttpproxy + |
+No + |
+false + |
+Whether to use the access proxy. + |
+
proxyhost + |
+This parameter must be configured if usehttpproxy is set to true. + |
+- + |
+Access proxy host. + |
+
proxyport + |
+This parameter must be configured if usehttpproxy is set to true. + |
+- + |
+Access proxy port. + |
+
dli.sql.checkNoResultQuery + |
+No + |
+false + |
+Whether to allow invoking the executeQuery API to execute statements (for example, DDL) that do not return results. +
|
+
jobtimeout + |
+No + |
+300 + |
+End time of the job submission. Unit: second + |
+
iam.endpoint + |
+No. By default, the value is automatically combined based on regionName. + |
+- + |
+- + |
+
obs.endpoint + |
+No. By default, the value is automatically combined based on regionName. + |
+- + |
+- + |
+
directfetchthreshold + |
+No + |
+1000 + |
+Check whether the number of returned results exceeds the threshold based on service requirements. +The default threshold is 1000. + |
+
Statement statement = conn.createStatement();
+statement.execute("SET dli.sql.spark.sql.forcePartitionPredicatesOnPartitionedTable.enabled=true");
+statement.execute("select * from tb1");
+ResultSet rs = statement.getResultSet();
+while (rs.next()) { +int a = rs.getInt(1); +int b = rs.getInt(2); +}+
conn.close();
+import java.sql.*; +import java.util.Properties; + +public class DLIJdbcDriverExample { + + public static void main(String[] args) throws ClassNotFoundException, SQLException { + Connection conn = null; + try { + Class.forName("com.dli.jdbc.DliDriver"); + String url = "jdbc:dli://<endpoint>/<projectId>?databasename=db1;queuename=testqueue"; + Properties info = new Properties(); + info.setProperty("authenticationmode", "aksk"); + info.setProperty("regionname", "<real region name>"); + info.setProperty("accesskey", "<System.getenv("AK")>"); + info.setProperty("secretkey", "<System.getenv("SK")>") + conn = DriverManager.getConnection(url, info); + Statement statement = conn.createStatement(); + statement.execute("select * from tb1"); + ResultSet rs = statement.getResultSet(); + int line = 0; + while (rs.next()) { + line ++; + int a = rs.getInt(1); + int b = rs.getInt(2); + System.out.println("Line:" + line + ":" + a + "," + b); + } + statement.execute("SET dli.sql.spark.sql.forcePartitionPredicatesOnPartitionedTable.enabled=true"); + statement.execute("describe tb1"); + ResultSet rs1 = statement.getResultSet(); + line = 0; + while (rs1.next()) { + line ++; + String a = rs1.getString(1); + String b = rs1.getString(2); + System.out.println("Line:" + line + ":" + a + "," + b); + } + } catch (SQLException ex) { + } finally { + if (conn != null) { + conn.close(); + } + } + } +}+
If the JDBC requery function is enabled, the system automatically requeries when the query operation fails.
+To enable the requery function, add the attributes listed in Table 4 to the Info parameter.
+ +Item + |
+Mandatory + |
+Default Value + |
+Description + |
+
---|---|---|---|
USE_RETRY_KEY + |
+Yes + |
+false + |
+Whether to enable the requery function. If this parameter is set to True, the requery function is enabled. + |
+
RETRY_TIMES_KEY + |
+Yes + |
+3000 + |
+Requery interval (milliseconds). Set this parameter to 30000 ms. + |
+
RETRY_INTERVALS_KEY + |
+Yes + |
+3 + |
+Requery times. Set this parameter to a value from 3 to 5. + |
+
Set JDBC parameters, enable the requery function, and create a link. The following is an example:
+import com.xxx.dli.jdbs.utils.ConnectionResource;// Introduce "ConnectionResource". Change the class name as needed. +import java.sql.*; +import java.util.Properties; + +public class DLIJdbcDriverExample { + + private static final String X_AUTH_TOKEN_VALUE = "<realtoken>"; + public static void main(String[] args) throws ClassNotFoundException, SQLException { + Connection conn = null; + try { + Class.forName("com.dli.jdbc.DliDriver"); + String url = "jdbc:dli://<endpoint>/<projectId>?databasename=db1;queuename=testqueue"; + Properties info = new Properties(); + info.setProperty("token", X_AUTH_TOKEN_VALUE); +info.setProperty(ConnectionResource.USE_RETRY_KEY, "true"); // Enable the requery function. +info.setProperty(ConnectionResource.RETRY_TIMES_KEY, "30000");// Requery interval (ms) +info.setProperty(ConnectionResource.RETRY_INTERVALS_KEY, "5");// Requery Times + conn = DriverManager.getConnection(url, info); + Statement statement = conn.createStatement(); + statement.execute("select * from tb1"); + ResultSet rs = statement.getResultSet(); + int line = 0; + while (rs.next()) { + line ++; + int a = rs.getInt(1); + int b = rs.getInt(2); + System.out.println("Line:" + line + ":" + a + "," + b); + } + statement.execute("describe tb1"); + ResultSet rs1 = statement.getResultSet(); + line = 0; + while (rs1.next()) { + line ++; + String a = rs1.getString(1); + String b = rs1.getString(2); + System.out.println("Line:" + line + ":" + a + "," + b); + } + } catch (SQLException ex) { + } finally { + if (conn != null) { + conn.close(); + } + } + } +}+
Relational Database Service (RDS) is a cloud-based web service that is reliable, scalable, easy to manage, and immediately ready for use. It can be deployed in single-node, active/standby, or cluster mode.
+You can perform secondary development based on Flink APIs to build your own Jar packages and submit them to the DLI queues to interact with data sources such as MRS Kafka, HBase, Hive, HDFS, GaussDB(DWS), and DCS.
+This section describes how to interact with MRS through a custom job.
+The .keytab file of a human-machine account becomes invalid when the user password expires. Use a machine-machine account for configuration.
+For details about how to create an enhanced datasource connection, see Enhanced Datasource Connections in the Data Lake Insight User Guide.
+For details about how to configure security group rules, see "Security Group" in Virtual Private Cloud User Guide.
+For details about how to add an IP-domain mapping, see Modifying the Host Information in the Data Lake Insight User Guide.
+If the Kafka server listens on the port using hostname, you need to add the mapping between the hostname and IP address of the Kafka Broker node to the DLI queue. Contact the Kafka service deployment personnel to obtain the hostname and IP address of the Kafka Broker node.
+DLI does not support the download function. If you need to modify the uploaded data file, edit the local file and upload it again.
++
Parameter + |
+Description + |
+
---|---|
Type + |
+Select Flink Jar. + |
+
Name + |
+Job name, which contains 1 to 57 characters and consists of only letters, digits, hyphens (-), and underscores (_). + NOTE:
+The job name must be globally unique. + |
+
Description + |
+Description of the job, which contains 0 to 512 characters. + |
+
Parameter + |
+Description + |
+
---|---|
Application + |
+User-defined package. Before selecting a package, upload the corresponding JAR package to the OBS bucket and create a package on the Data Management > Package Management page. + |
+
Main Class + |
+Name of the main class of the JAR package to be loaded, for example, KafkaMessageStreaming. +
NOTE:
+When a class belongs to a package, the package path must be carried, for example, packagePath.KafkaMessageStreaming. + |
+
Class Arguments + |
+List of arguments of a specified class. The arguments are separated by spaces. + |
+
JAR Package Dependencies + |
+User-defined dependencies. Before selecting a package, upload the corresponding JAR package to the OBS bucket and create a JAR package on the Data Management > Package Management page. + |
+
Other Dependencies + |
+User-defined dependency files. Before selecting a file, upload the corresponding file to the OBS bucket and create a package of any type on the Data Management > Package Management page. +You can add the following content to the application to access the corresponding dependency file: fileName indicates the name of the file to be accessed, and ClassName indicates the name of the class that needs to access the file. +ClassName.class.getClassLoader().getResource("userData/fileName")+ |
+
Flink Version + |
+Before selecting a Flink version, you need to select the queue to which the Flink version belongs. Currently, the following versions are supported: 1.10. + |
+
+
Parameter + |
+Description + |
+
---|---|
CUs + |
+One CU has one vCPU and 4 GB memory. The number of CUs ranges from 2 to 400. + |
+
Job Manager CUs + |
+Set the number of CUs on a management unit. The value ranges from 1 to 4. The default value is 1. + |
+
Parallelism + |
+Maximum number of parallel operators in a job. + NOTE:
+
|
+
Task Manager Configuration + |
+Whether to set Task Manager resource parameters. +If this option is selected, you need to set the following parameters: +
|
+
Save Job Log + |
+Whether to save the job running logs to OBS. +If this option is selected, you need to set the following parameters: +OBS Bucket: Select an OBS bucket to store user job logs. If the selected OBS bucket is not authorized, click Authorize. + |
+
Alarm Generation upon Job Exception + |
+Whether to report job exceptions, for example, abnormal job running or exceptions due to an insufficient balance, to users via SMS or email. +If this option is selected, you need to set the following parameters: +SMN Topic +Select a user-defined SMN topic. For details about how to customize SMN topics, see "Creating a Topic" in the Simple Message Notification User Guide. + |
+
Auto Restart upon Exception + |
+Whether to enable automatic restart. If this function is enabled, jobs will be automatically restarted and restored when exceptions occur. +If this option is selected, you need to set the following parameters: +
|
+
After the job is started, the system automatically switches to the
+ page, and the created job is displayed in the job list. You can view the job status in the column. After a job is successfully submitted, the job status will change from to . After the execution is complete, the message Completed is displayed.If the job status is to copy these details. After handling the fault based on the provided information, resubmit the job.
Other buttons are as follows:
+Save As: Save the created job as a new job.
+List of parameters of a specified class. The parameters are separated by spaces.
+Parameter input format: --Key 1 Value 1 --Key 2 Value 2
+For example, if you enter the following parameters on the console:
+--bootstrap.server 192.168.168.xxx:9092
+The parameters are parsed by ParameterTool as follows:
+Only the latest run logs are displayed. For more information, see the OBS bucket that stores logs.
+Built on Flink and Spark, the stream ecosystem is fully compatible with the open-source Flink, Storm, and Spark APIs. It is enhanced in features and improved in performance to provide easy-to-use DLI with low latency and high throughput.
+DLI can be interconnected with other services by using Stream SQLs. You can directly use SQL statements to read and write data from various cloud services, such as Data Ingestion Service (DIS), Object Storage Service (OBS), CloudTable Service (CloudTable), MapReduce Service (MRS), Relational Database Service (RDS), Simple Message Notification (SMN), and Distributed Cache Service (DCS).
+After connections to other VPCs are established through VPC peering connections, you can access all data sources and output targets (such as Kafka, HBase, and Elasticsearch) supported by Flink and Spark in your dedicated DLI queues.
+You can compile code to obtain data from the desired cloud ecosystem or open-source ecosystem as the input data of Flink jobs.
+DLI Flink jobs support the following data formats:
+Avro, Avro_merge, BLOB, CSV, EMAIL, JSON, ORC, Parquet, and XML.
+DLI allows you to use Hive user-defined functions (UDFs) to query data. UDFs take effect only on a single row of data and are applicable to inserting and deleting a single data record.
+Log in to the DLI console and choose Data Management > Package Management. On the displayed page, select your UDF Jar package and click Manage Permissions in the Operation column. On the permission management page, click Grant Permission in the upper right corner and select the required permissions.
+Before you start, set up the development environment.
+ +Item + |
+Description + |
+
---|---|
OS + |
+Windows 7 or later + |
+
JDK + |
+JDK 1.8. + |
+
IntelliJ IDEA + |
+This tool is used for application development. The version of the tool must be 2019.1 or other compatible versions. + |
+
Maven + |
+Basic configurations of the development environment. Maven is used for project management throughout the lifecycle of software development. + |
+
No. + |
+Phase + |
+Software Portal + |
+Description + |
+
---|---|---|---|
1 + |
+Create a Maven project and configure the POM file. + |
+IntelliJ IDEA + |
+
+ Write UDF code by referring the steps in Procedure. + + |
+
2 + |
+Write UDF code. + |
+||
3 + |
+Debug, compile, and pack the code into a Jar package. + |
+||
4 + |
+Upload the Jar package to OBS. + |
+OBS console + |
+Upload the UDF Jar file to an OBS directory. + |
+
5 + |
+Create the UDF on DLI. + |
+DLI console + |
+Create a UDF on the SQL job management page of the DLI console. + |
+
6 + |
+Verify and use the UDF on DLI. + |
+DLI console + |
+Use the UDF in your DLI job. + |
+
<dependencies> + <dependency> + <groupId>org.apache.hive</groupId> + <artifactId>hive-exec</artifactId> + <version>1.2.1</version> + </dependency> +</dependencies>+
Set the package name as you need. Then, press Enter.
+Create a Java Class file in the package path. In this example, the Java Class file is SumUdfDemo.
+For details about how to implement the UDF, see the following sample code:
+package com.demo; +import org.apache.hadoop.hive.ql.exec.UDF; + public class SumUdfDemo extends UDF { + public int evaluate(int a, int b) { + return a + b; + } + }+
After the compilation is successful, click package.
+The generated JAR package is stored in the target directory. In this example, MyUDF-1.0-SNAPSHOT.jar is stored in D:\DLITest\MyUDF\target.
+The region of the OBS bucket to which the Jar package is uploaded must be the same as the region of the DLI queue. Cross-region operations are not allowed.
+CREATE FUNCTION TestSumUDF AS 'com.demo.SumUdfDemo' using jar 'obs://dli-test-obs01/MyUDF-1.0-SNAPSHOT.jar';+
Use the UDF created in 6 in the SELECT statement as follows:
+select TestSumUDF(1,2);+
If the UDF is no longer used, run the following statement to delete it:
+Drop FUNCTION TestSumUDF;+
DLI allows you to develop a program to create Spark jobs for operations related to databases, DLI or OBS tables, and table data. This example demonstrates how to develop a job by writing a Java program, and use a Spark job to create a database and table and insert table data.
+For example, the testdb database is created using the SQL editor of DLI. A program package for creating the testTable table in the testdb database does not work after it is submitted to a Spark Jar job.
+Before developing a Spark job to access DLI metadata, set up a development environment that meets the following requirements.
+ +Item + |
+Description + |
+
---|---|
OS + |
+Windows 7 or later + |
+
JDK + |
+JDK 1.8. + |
+
IntelliJ IDEA + |
+This tool is used for application development. The version of the tool must be 2019.1 or other compatible versions. + |
+
Maven + |
+Basic configurations of the development environment. Maven is used for project management throughout the lifecycle of software development. + |
+
No. + |
+Phase + |
+Software Portal + |
+Description + |
+
---|---|---|---|
1 + |
+Create a queue for general use. + |
+DLI console + |
+The DLI queue is created for running your job. + |
+
2 + |
+Configure the OBS file. + |
+OBS console + |
+
|
+
3 + |
+Create a Maven project and configure the POM file. + |
+IntelliJ IDEA + |
+
+ Write a program to create a DLI or OBS table by referring to the sample code. + + |
+
4 + |
+Write code. + |
+||
5 + |
+Debug, compile, and pack the code into a Jar package. + |
+||
6 + |
+Upload the Jar package to OBS and DLI. + |
+OBS console + |
+You can upload the generated Spark Jar package to an OBS directory and DLI program package. + |
+
7 + |
+Create a Spark JAR job. + |
+DLI console + |
+The Spark Jar job is created and submitted on the DLI console. + |
+
8 + |
+Check execution result of the job. + |
+DLI console + |
+You can view the job running status and run logs. + |
+
In this example, the Maven project name is SparkJarMetadata, and the project storage path is D:\DLITest\SparkJarMetadata.
+<dependencies> + <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> + </dependency> +</dependencies>+
Set the package name as you need. In this example, set Package to com.dli.demo and press Enter.
+Create a Java Class file in the package path. In this example, the Java Class file is DliCatalogTest.
+Write the DliCatalogTest program to create a database, DLI table, and OBS table.
+For the sample code, see Java Example Code.
+import org.apache.spark.sql.SparkSession;+
When you create a SparkSession, you need to specify spark.sql.session.state.builder, spark.sql.catalog.class, and spark.sql.extensions parameters as configured in the following example.
+SparkSession spark = SparkSession + .builder() + .config("spark.sql.session.state.builder", "org.apache.spark.sql.hive.UQueryHiveACLSessionStateBuilder") + .config("spark.sql.catalog.class", "org.apache.spark.sql.hive.UQueryHiveACLExternalCatalog") + .config("spark.sql.extensions","org.apache.spark.sql.DliSparkExtension") + .appName("java_spark_demo") + .getOrCreate();+
SparkSession spark = SparkSession + .builder() + .config("spark.sql.session.state.builder", "org.apache.spark.sql.hive.DliLakeHouseBuilder") + .config("spark.sql.catalog.class", "org.apache.spark.sql.hive.DliLakeHouseCatalog") + .appName("java_spark_demo") + .getOrCreate();+
spark.sql("drop table if exists test_sparkapp.dli_testtable").collect(); +spark.sql("create table test_sparkapp.dli_testtable(id INT, name STRING)").collect(); +spark.sql("insert into test_sparkapp.dli_testtable VALUES (123,'jason')").collect(); +spark.sql("insert into test_sparkapp.dli_testtable VALUES (456,'merry')").collect();+
spark.sql("drop table if exists test_sparkapp.dli_testobstable").collect(); +spark.sql("create table test_sparkapp.dli_testobstable(age INT, name STRING) using csv options (path 'obs://dli-test-obs01/testdata.csv')").collect();+
spark.stop();+
After the compilation is successful, double-click package.
+The generated JAR package is stored in the target directory. In this example, SparkJarMetadata-1.0-SNAPSHOT.jar is stored in D:\DLITest\SparkJarMetadata\target.
+Parameter + |
+Value + |
+
---|---|
Queue + |
+Select the DLI queue created for general purpose. For example, select the queue sparktest created in Step 1: Create a Queue for General Purpose. + |
+
Spark Version + |
+Select a Spark version. Select a supported Spark version from the drop-down list. The latest version is recommended. + |
+
Job Name (--name) + |
+Name of a custom Spark Jar job. For example, SparkTestMeta. + |
+
Application + |
+Select the package uploaded to DLI in Step 6: Upload the JAR Package to OBS and DLI. For example, select SparkJarObs-1.0-SNAPSHOT.jar. + |
+
Main Class (--class) + |
+The format is program package name + class name. + |
+
Spark Arguments (--conf) + |
+spark.dli.metaAccess.enable=true +spark.sql.warehouse.dir=obs://dli-test-obs01/warehousepath + NOTE:
+Set spark.sql.warehouse.dir to the OBS path that is specified in Step 2: Configure the OBS Bucket File. + |
+
Access Metadata + |
+Select Yes. + |
+
Retain default values for other parameters.
+After the fault is rectified, click Edit in the Operation column of the job, modify job parameters, and click Execute to run the job again.
+Call the API for creating a batch processing job. The following table describes the request parameters.
+Configure "spark.sql.warehouse.dir": "obs://bucket/warehousepath" in the CONF file if you need to run the DDL.
+The following example provided you with the complete API request.
+{ + "queue":"citest", + "file":"SparkJarMetadata-1.0-SNAPSHOT.jar", + "className":"DliCatalogTest", + "conf":{"spark.sql.warehouse.dir": "obs://bucket/warehousepath", + "spark.dli.metaAccess.enable":"true"}, + "sc_type":"A", + "executorCores":1, + "numExecutors":6, + "executorMemory":"4G", + "driverCores":2, + "driverMemory":"7G", + "catalog_name": "dli" +}+
This example uses Java for coding. The complete sample code is as follows:
+package com.dli.demo; + +import org.apache.spark.sql.SparkSession; + +public class DliCatalogTest { + public static void main(String[] args) { + + SparkSession spark = SparkSession + .builder() + .config("spark.sql.session.state.builder", "org.apache.spark.sql.hive.UQueryHiveACLSessionStateBuilder") + .config("spark.sql.catalog.class", "org.apache.spark.sql.hive.UQueryHiveACLExternalCatalog") + .config("spark.sql.extensions","org.apache.spark.sql.DliSparkExtension") + .appName("java_spark_demo") + .getOrCreate(); + + spark.sql("create database if not exists test_sparkapp").collect(); + spark.sql("drop table if exists test_sparkapp.dli_testtable").collect(); + spark.sql("create table test_sparkapp.dli_testtable(id INT, name STRING)").collect(); + spark.sql("insert into test_sparkapp.dli_testtable VALUES (123,'jason')").collect(); + spark.sql("insert into test_sparkapp.dli_testtable VALUES (456,'merry')").collect(); + + spark.sql("drop table if exists test_sparkapp.dli_testobstable").collect(); + spark.sql("create table test_sparkapp.dli_testobstable(age INT, name STRING) using csv options (path 'obs://dli-test-obs01/testdata.csv')").collect(); + + + spark.stop(); + + } +}+
object DliCatalogTest { + def main(args:Array[String]): Unit = { + val sql = args(0) + val runDdl = +Try(args(1).toBoolean).getOrElse(true) + System.out.println(s"sql is $sql +runDdl is $runDdl") + val sparkConf = new SparkConf(true) + sparkConf + .set("spark.sql.session.state.builder","org.apache.spark.sql.hive.UQueryHiveACLSessionStateBuilder") + .set("spark.sql.catalog.class","org.apache.spark.sql.hive.UQueryHiveACLExternalCatalog") + sparkConf.setAppName("dlicatalogtester") + + val spark = SparkSession.builder + .config(sparkConf) + .enableHiveSupport() + .config("spark.sql.extensions","org.apache.spark.sql.DliSparkExtension") + .appName("SparkTest") + .getOrCreate() + + System.out.println("catalog is " ++ spark.sessionState.catalog.toString) + if (runDdl) { + val df = spark.sql(sql).collect() + } else { + spark.sql(sql).show() + } + + spark.close() + } + +}+
#!/usr/bin/python +# -*- coding: UTF-8 -*- + +from __future__ import print_function + +import sys + +from pyspark.sql import SparkSession + +if __name__ == "__main__": + url = sys.argv[1] + creatTbl = "CREATE TABLE test_sparkapp.dli_rds USING JDBC OPTIONS ('url'='jdbc:mysql://%s'," \ + "'driver'='com.mysql.jdbc.Driver','dbtable'='test.test'," \ + " 'passwdauth' = 'DatasourceRDSTest_pwd','encryption' = 'true')" % url + + spark = SparkSession \ + .builder \ + .enableHiveSupport() \ +.config("spark.sql.session.state.builder","org.apache.spark.sql.hive.UQueryHiveACLSessionStateBuilder") \ +.config("spark.sql.catalog.class", "org.apache.spark.sql.hive.UQueryHiveACLExternalCatalog") \ +.config("spark.sql.extensions","org.apache.spark.sql.DliSparkExtension") \ + .appName("python Spark test catalog") \ + .getOrCreate() + + spark.sql("CREATE database if not exists test_sparkapp").collect() + spark.sql("drop table if exists test_sparkapp.dli_rds").collect() + spark.sql(creatTbl).collect() + spark.sql("select * from test_sparkapp.dli_rds").show() + spark.sql("insert into table test_sparkapp.dli_rds select 12,'aaa'").collect() + spark.sql("select * from test_sparkapp.dli_rds").show() + spark.sql("insert overwrite table test_sparkapp.dli_rds select 1111,'asasasa'").collect() + spark.sql("select * from test_sparkapp.dli_rds").show() + spark.sql("drop table test_sparkapp.dli_rds").collect() + spark.stop()+
A datasource connection has been created and bound to a queue on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 | import org.apache.spark.sql.SparkSession; + |
1 | SparkSession sparkSession = SparkSession.builder().appName("datasource-rds").getOrCreate(); + |
1 +2 +3 +4 +5 +6 +7 | sparkSession.sql( + "CREATE TABLE IF NOT EXISTS dli_to_rds USING JDBC OPTIONS ( + 'url'='jdbc:mysql://to-rds-1174404209-cA37siB6.datasource.com:3306', // Set this parameter to the actual URL. + 'dbtable'='test.customer', + 'user'='root', // Set this parameter to the actual user. + 'password'='######', // Set this parameter to the actual password. + 'driver'='com.mysql.jdbc.Driver')") + |
For details about the parameters for creating a table, see Table 1.
+1 | sparkSession.sql("insert into dli_to_rds values (1,'John',24)"); + |
1 | sparkSession.sql("select * from dli_to_rd").show(); + |
Response
+spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/rds/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/rds/*
+Connecting to data sources through SQL APIs
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 | import org.apache.spark.sql.SparkSession; + +public class java_rds { + + public static void main(String[] args) { + SparkSession sparkSession = SparkSession.builder().appName("datasource-rds").getOrCreate(); + + // Create a data table for DLI-associated RDS + sparkSession.sql("CREATE TABLE IF NOT EXISTS dli_to_rds USING JDBC OPTIONS ('url'='jdbc:mysql://192.168.6.150:3306','dbtable'='test.customer','user'='root','password'='**','driver'='com.mysql.jdbc.Driver')"); + + //*****************************SQL model*********************************** + //Insert data into the DLI data table + sparkSession.sql("insert into dli_to_rds values(3,'Liu',21),(4,'Joey',34)"); + + //Read data from DLI data table + sparkSession.sql("select * from dli_to_rds"); + + //drop table + sparkSession.sql("drop table dli_to_rds"); + + sparkSession.close(); + } +} + |
keytool -genkeypair -alias certificatekey -keyalg RSA -keystore transport-keystore.jks+
keytool -list -v -keystore transport-keystore.jks+
After you enter the correct keystore password, the corresponding information is displayed.
+keytool -import -alias certificatekey -file CloudSearchService.cer -keystore truststore.jks +keytool -list -v -keystore truststore.jks+
For details about the parameters, see Table 1. This part describes the precautions for configuring the connection parameters of the CSS security cluster.
+.option("es.net.http.auth.user", "admin") .option("es.net.http.auth.pass", "***")+
The parameters are the identity authentication account and password, which are also the account and password for logging in to Kibana.
+.option("es.net.ssl", "true")+
.option("es.net.ssl.keystore.location", "obs://Bucket name/path/transport-keystore.jks") +.option("es.net.ssl.keystore.pass", "***")+
Set the location of the keystore.jks file and the key for accessing the file. Place the keystore.jks file generated in Preparations in the OBS bucket, and then enter the AK, SK, and location of the keystore.jks file. Enter the key for accessing the file in es.net.ssl.keystore.pass.
+.option("es.net.ssl.truststore.location", "obs://Bucket name/path/truststore.jks") +.option("es.net.ssl.truststore.pass", "***")+
The parameters in the truststore.jks file are basically the same as those in the keystore.jks file. You can refer to the preceding procedure to set parameters.
+A datasource connection has been created on the DLI management console.
+<dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency>+
1 | import org.apache.spark.sql.SparkSession; + |
1 | SparkSession sparkSession = SparkSession.builder().appName("datasource-css").getOrCreate(); + |
sparkSession.sql("create table css_table(id long, name string) using css options( 'es.nodes' = '192.168.9.213:9200', 'es.nodes.wan.only' = 'true','resource' ='/mytest')");+
sparkSession.sql("insert into css_table values(18, 'John'),(28, 'Bob')");+
sparkSession.sql("select * from css_table").show();+
sparkSession.sql("drop table css_table");+
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/css/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/css/*
+<dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency>+
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 | import org.apache.spark.sql.*; + +public class java_css_unsecurity { + + public static void main(String[] args) { + SparkSession sparkSession = SparkSession.builder().appName("datasource-css-unsecurity").getOrCreate(); + + // Create a DLI data table for DLI-associated CSS + sparkSession.sql("create table css_table(id long, name string) using css options( 'es.nodes' = '192.168.15.34:9200', 'es.nodes.wan.only' = 'true', 'resource' = '/mytest')"); + + //*****************************SQL model*********************************** + // Insert data into the DLI data table + sparkSession.sql("insert into css_table values(18, 'John'),(28, 'Bob')"); + + // Read data from DLI data table + sparkSession.sql("select * from css_table").show(); + + // drop table + sparkSession.sql("drop table css_table"); + + sparkSession.close(); + } +} + |
Generate the keystore.jks and truststore.jks files and upload them to the OBS bucket. For details, see CSS Security Cluster Configuration.
+<dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency>+
1 | import org.apache.spark.sql.SparkSession; + |
1 | SparkSession sparkSession = SparkSession.builder().appName("datasource-css").getOrCreate(); + |
1 | sparkSession.sql("create table css_table(id long, name string) using css options( 'es.nodes' = '192.168.9.213:9200', 'es.nodes.wan.only' = 'true', 'resource' = '/mytest','es.net.ssl'='false','es.net.http.auth.user'='admin','es.net.http.auth.pass'='*******')"); + |
1 | sparkSession.sql("insert into css_table values(18, 'John'),(28, 'Bob')"); + |
1 | sparkSession.sql("select * from css_table").show(); + |
sparkSession.sql("drop table css_table");+
<dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency>+
<dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency>+
Import dependency packages.
+1 | import org.apache.spark.sql.SparkSession; + |
1 | SparkSession sparkSession = SparkSession.builder().appName("datasource-css").getOrCreate(); + |
1 +2 +3 | sparkSession.sql("create table css_table(id long, name string) using css options( 'es.nodes' = '192.168.13.189:9200', 'es.nodes.wan.only' = 'true', 'resource' = '/mytest','es.net.ssl'='true','es.net.ssl.keystore.location' = 'obs://Bucket name/Address/transport-keystore.jks','es.net.ssl.keystore.pass' = '**', +'es.net.ssl.truststore.location'='obs://Bucket name/Address/truststore.jks, +'es.net.ssl.truststore.pass'='***','es.net.http.auth.user'='admin','es.net.http.auth.pass'='**')"); + |
1 | sparkSession.sql("insert into css_table values(18, 'John'),(28, 'Bob')"); + |
1 | sparkSession.sql("select * from css_table").show(); + |
sparkSession.sql("drop table css_table");+
<?xml version="1.0" encoding="UTF-8"?> +<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +<!-- Put site-specific property overrides in this file. --> + +<configuration> +<property> + <name>fs.obs.bucket.Bucket name.access.key</name> + <value>AK</value> + </property> +<property> + <name>fs.obs.bucket.Bucket name.secret.key </name> + <value>SK</value> + </property> +</configuration>+
<name>fs.obs.bucket.Bucket name.access.key</name> is used to better locate the bucket address. The bucket name is the name of the bucket where the keystore.jks and truststore.jks files are stored.
+<dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency>+
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 | import org.apache.spark.sql.SparkSession; + +public class java_css_security_httpson { + public static void main(String[] args) { + SparkSession sparkSession = SparkSession.builder().appName("datasource-css").getOrCreate(); + + // Create a DLI data table for DLI-associated CSS + sparkSession.sql("create table css_table(id long, name string) using css options( 'es.nodes' = '192.168.13.189:9200', 'es.nodes.wan.only' = 'true', 'resource' = '/mytest','es.net.ssl'='true','es.net.ssl.keystore.location' = 'obs://Bucket name/Address/transport-keystore.jks','es.net.ssl.keystore.pass' = '**','es.net.ssl.truststore.location'='obs://Bucket name/Address/truststore.jks','es.net.ssl.truststore.pass'='**','es.net.http.auth.user'='admin','es.net.http.auth.pass'='**')"); + + //*****************************SQL model*********************************** + // Insert data into the DLI data table + sparkSession.sql("insert into css_table values(34, 'Yuan'),(28, 'Kids')"); + + // Read data from DLI data table + sparkSession.sql("select * from css_table").show(); + + // drop table + sparkSession.sql("drop table css_table"); + + sparkSession.close(); + } +} + |
DLI allows you to use a custom JAR package to run Flink jobs and write data to OBS. This section describes how to write processed Kafka data to OBS. You need to modify the parameters in the example Java code based on site requirements.
+Development tools such as IntelliJ IDEA and other development tools, JDK, and Maven have been installed and configured.
+<?xml version="1.0" encoding="UTF-8"?> +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <parent> + <artifactId>Flink-demo</artifactId> + <version>1.0-SNAPSHOT</version> + </parent> + <modelVersion>4.0.0</modelVersion> + + <artifactId>flink-kafka-to-obs</artifactId> + + <properties> + <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> + <!--Flink version--> + <flink.version>1.12.2</flink.version> + <!--JDK version --> + <java.version>1.8</java.version> + <!--Scala 2.11 --> + <scala.binary.version>2.11</scala.binary.version> + <slf4j.version>2.13.3</slf4j.version> + <log4j.version>2.10.0</log4j.version> + <maven.compiler.source>8</maven.compiler.source> + <maven.compiler.target>8</maven.compiler.target> + </properties> + + <dependencies> + <!-- flink --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-java</artifactId> + <version>${flink.version}</version> + <scope>provided</scope> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-streaming-java_${scala.binary.version}</artifactId> + <version>${flink.version}</version> + <scope>provided</scope> + </dependency> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-statebackend-rocksdb_2.11</artifactId> + <version>${flink.version}</version> + <scope>provided</scope> + </dependency> + + <!-- kafka --> + <dependency> + <groupId>org.apache.flink</groupId> + <artifactId>flink-connector-kafka_2.11</artifactId> + <version>${flink.version}</version> + </dependency> + + <!-- logging --> + <dependency> + <groupId>org.apache.logging.log4j</groupId> + <artifactId>log4j-slf4j-impl</artifactId> + <version>${slf4j.version}</version> + <scope>provided</scope> + </dependency> + <dependency> + <groupId>org.apache.logging.log4j</groupId> + <artifactId>log4j-api</artifactId> + <version>${log4j.version}</version> + <scope>provided</scope> + </dependency> + <dependency> + <groupId>org.apache.logging.log4j</groupId> + <artifactId>log4j-core</artifactId> + <version>${log4j.version}</version> + <scope>provided</scope> + </dependency> + <dependency> + <groupId>org.apache.logging.log4j</groupId> + <artifactId>log4j-jcl</artifactId> + <version>${log4j.version}</version> + <scope>provided</scope> + </dependency> + </dependencies> + + <build> + <plugins> + <plugin> + <groupId>org.apache.maven.plugins</groupId> + <artifactId>maven-assembly-plugin</artifactId> + <version>3.3.0</version> + <executions> + <execution> + <phase>package</phase> + <goals> + <goal>single</goal> + </goals> + </execution> + </executions> + <configuration> + <archive> + <manifest> + <mainClass>com.dli.FlinkKafkaToObsExample</mainClass> + </manifest> + </archive> + <descriptorRefs> + <descriptorRef>jar-with-dependencies</descriptorRef> + </descriptorRefs> + </configuration> + </plugin> + </plugins> + <resources> + <resource> + <directory>../main/config</directory> + <filtering>true</filtering> + <includes> + <include>**/*.*</include> + </includes> + </resource> + </resources> + </build> +</project>+
import org.apache.flink.api.common.serialization.SimpleStringEncoder; +import org.apache.flink.api.common.serialization.SimpleStringSchema; +import org.apache.flink.api.java.utils.ParameterTool; +import org.apache.flink.contrib.streaming.state.RocksDBStateBackend; +import org.apache.flink.core.fs.Path; +import org.apache.flink.runtime.state.filesystem.FsStateBackend; +import org.apache.flink.streaming.api.datastream.DataStream; +import org.apache.flink.streaming.api.environment.CheckpointConfig; +import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; +import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink; +import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.DateTimeBucketAssigner; +import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.OnCheckpointRollingPolicy; +import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer; +import org.apache.kafka.clients.consumer.ConsumerConfig; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.util.Properties; + +/** + * @author xxx + * @date 6/26/21 + */ +public class FlinkKafkaToObsExample { + private static final Logger LOG = LoggerFactory.getLogger(FlinkKafkaToObsExample.class); + + public static void main(String[] args) throws Exception { + LOG.info("Start Kafka2OBS Flink Streaming Source Java Demo."); + ParameterTool params = ParameterTool.fromArgs(args); + LOG.info("Params: " + params.toString()); + + // Kafka connection address + String bootstrapServers; + // Kafka consumer group + String kafkaGroup; + // Kafka topic + String kafkaTopic; + // Consumption policy. This policy is used only when the partition does not have a checkpoint or the checkpoint expires. + // If a valid checkpoint exists, consumption continues from this checkpoint. + // When the policy is set to LATEST, the consumption starts from the latest data. This policy will ignore the existing data in the stream. + // When the policy is set to EARLIEST, the consumption starts from the earliest data. This policy will obtain all valid data in the stream. + String offsetPolicy; + // OBS file output path, in the format of obs://bucket/path. + String outputPath; + // Checkpoint output path, in the format of obs://bucket/path. + String checkpointPath; + + bootstrapServers = params.get("bootstrap.servers", "xxxx:9092,xxxx:9092,xxxx:9092"); + kafkaGroup = params.get("group.id", "test-group"); + kafkaTopic = params.get("topic", "test-topic"); + offsetPolicy = params.get("offset.policy", "earliest"); + outputPath = params.get("output.path", "obs://bucket/output"); + checkpointPath = params.get("checkpoint.path", "obs://bucket/checkpoint"); + + try { + //Create an execution environment. + StreamExecutionEnvironment streamEnv = StreamExecutionEnvironment.getExecutionEnvironment(); + streamEnv.setParallelism(4); + RocksDBStateBackend rocksDbBackend = new RocksDBStateBackend(checkpointPath, true); + RocksDBStateBackend rocksDbBackend = new RocksDBStateBackend(new FsStateBackend(checkpointPath), true); + streamEnv.setStateBackend(rocksDbBackend); + // Enable Flink checkpointing mechanism. If enabled, the offset information will be synchronized to Kafka. + streamEnv.enableCheckpointing(300000); + // Set the minimum interval between two checkpoints. + streamEnv.getCheckpointConfig().setMinPauseBetweenCheckpoints(60000); + // Set the checkpoint timeout duration. + streamEnv.getCheckpointConfig().setCheckpointTimeout(60000); + // Set the maximum number of concurrent checkpoints. + streamEnv.getCheckpointConfig().setMaxConcurrentCheckpoints(1); + // Retain checkpoints when a job is canceled. + streamEnv.getCheckpointConfig().enableExternalizedCheckpoints( + CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); + + // Source: Connect to the Kafka data source. + Properties properties = new Properties(); + properties.setProperty("bootstrap.servers", bootstrapServers); + properties.setProperty("group.id", kafkaGroup); + properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, offsetPolicy); + String topic = kafkaTopic; + + // Create a Kafka consumer. + FlinkKafkaConsumer<String> kafkaConsumer = + new FlinkKafkaConsumer<>(topic, new SimpleStringSchema(), properties); + /** + * Read partitions from the offset submitted by the consumer group (specified by group.id in the consumer attribute) in Kafka brokers. + * If the partition offset cannot be found, set it by using the auto.offset.reset parameter. + * For details, see https://ci.apache.org/projects/flink/flink-docs-release-1.13/zh/docs/connectors/datastream/kafka/. + */ + kafkaConsumer.setStartFromGroupOffsets(); + + // Add Kafka to the data source. + DataStream<String> stream = streamEnv.addSource(kafkaConsumer).setParallelism(3).disableChaining(); + + // Create a file output stream. + final StreamingFileSink<String> sink = StreamingFileSink + // Specify the file output path and row encoding format. + .forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8")) + // Specify the file output path and bulk encoding format. Files are output in parquet format. + //.forBulkFormat(new Path(outputPath), ParquetAvroWriters.forGenericRecord(schema)) + // Specify a custom bucket assigner. + .withBucketAssigner(new DateTimeBucketAssigner<>()) + // Specify the rolling policy. + .withRollingPolicy(OnCheckpointRollingPolicy.build()) + .build(); + + // Add sink for DIS Consumer data source + stream.addSink(sink).disableChaining().name("obs"); + + // stream.print(); + streamEnv.execute(); + } catch (Exception e) { + LOG.error(e.getMessage(), e); + } + } +} ++
Parameter + |
+Description + |
+Example + |
+
---|---|---|
bootstrap.servers + |
+Kafka connection address + |
+IP address of the Kafka service 1:9092, IP address of the Kafka service 2:9092, IP address of the Kafka service 3:9092 + |
+
group.id + |
+Kafka consumer group + |
+test-group + |
+
topic + |
+Kafka consumption topic + |
+test-topic + |
+
offset.policy + |
+Kafka offset policy + |
+earliest + |
+
output.path + |
+OBS path to which data will be written + |
+obs://bucket/output + |
+
checkpoint.path + |
+Checkpoint OBS path + |
+obs://bucket/checkpoint + |
+
After the application is developed, upload the JAR package to DLI by referring to Flink Jar Job Examples and check whether related data exists in the OBS path.
+This example applies only to MRS OpenTSDB.
+A datasource connection has been created and bound to a queue on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 | import org.apache.spark.sql.SparkSession; + |
1 | sparkSession = SparkSession.builder().appName("datasource-opentsdb").getOrCreate(); + |
1 | sparkSession.sql("create table opentsdb_new_test using opentsdb options('Host'='10.0.0.171:4242','metric'='ctopentsdb','tags'='city,location')"); + |
1 | sparkSession.sql("insert into opentsdb_new_test values('Penglai', 'abc', '2021-06-30 18:00:00', 30.0)"); + |
1 | sparkSession.sql("select * from opentsdb_new_test").show(); + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/opentsdb/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/opentsdb/*
+<dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency>+
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 | import org.apache.spark.sql.SparkSession; + +public class java_mrs_opentsdb { + + private static SparkSession sparkSession = null; + + public static void main(String[] args) { + //create a SparkSession session + sparkSession = SparkSession.builder().appName("datasource-opentsdb").getOrCreate(); + + sparkSession.sql("create table opentsdb_new_test using opentsdb options('Host'='10.0.0.171:4242','metric'='ctopentsdb','tags'='city,location')"); + + //*****************************SQL module*********************************** + sparkSession.sql("insert into opentsdb_new_test values('Penglai', 'abc', '2021-06-30 18:00:00', 30.0)"); + System.out.println("Penglai new timestamp"); + sparkSession.sql("select * from opentsdb_new_test").show(); + + sparkSession.close(); + + } +} + |
For details, see section "Modifying the Host Information" in the Data Lake Insight User Guide.
+Before creating an MRS HBase table to be associated with the DLI table, ensure that the HBase table exists. The following provides example code to describe how to create an MRS HBase table:
+describe 'hbtest'+
create 'hbtest', 'info', 'detail'+
In this command, hbtest indicates the table name, and other parameters indicate the column family names.
+This example applies only to MRS HBase.
+A datasource connection has been created and bound to a queue on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 | import org.apache.spark.sql.SparkSession; + |
1 | parkSession = SparkSession.builder().appName("datasource-HBase-MRS").getOrCreate(); + |
1 | sparkSession.sql("CREATE TABLE testhbase(id STRING, location STRING, city STRING) using hbase OPTIONS('ZKHost'='10.0.0.63:2181','TableName'='hbtest','RowKey'='id:5','Cols'='location:info.location,city:detail.city') "); + |
1 | sparkSession.sql("insert into testhbase values('12345','abc','xxx')"); + |
1 | sparkSession.sql("select * from testhbase").show(); + |
1 | sparkSession.sql("CREATE TABLE testhbase(id STRING, location STRING, city STRING) using hbase OPTIONS('ZKHost'='10.0.0.63:2181','TableName'='hbtest','RowKey'='id:5','Cols'='location:info.location,city:detail.city,'krb5conf'='./krb5.conf','keytab'='./user.keytab','principal'='krbtest') "); + |
For details about how to obtain the krb5.conf and keytab files, see Completing Configurations for Enabling Kerberos Authentication.
+1 | sparkSession.sql("insert into testhbase values('95274','abc','Hongkong')"); + |
1 | sparkSession.sql("select * from testhbase").show(); + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/hbase/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/hbase/*
+1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 | import org.apache.spark.sql.SparkSession; + +public class java_mrs_hbase { + + public static void main(String[] args) { + //create a SparkSession session + SparkSession sparkSession = SparkSession.builder().appName("datasource-HBase-MRS").getOrCreate(); + + sparkSession.sql("CREATE TABLE testhbase(id STRING, location STRING, city STRING) using hbase OPTIONS('ZKHost'='10.0.0.63:2181','TableName'='hbtest','RowKey'='id:5','Cols'='location:info.location,city:detail.city') "); + + //*****************************SQL model*********************************** + sparkSession.sql("insert into testhbase values('95274','abc','Hongkong')"); + sparkSession.sql("select * from testhbase").show(); + + sparkSession.close(); + } +} + |
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 | import org.apache.spark.SparkContext; +import org.apache.spark.SparkFiles; +import org.apache.spark.sql.SparkSession; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; + +public class Test_HBase_SparkSql_Kerberos { + + private static void copyFile(File src,File dst) throws IOException { + InputStream input = null; + OutputStream output = null; + try { + input = new FileInputStream(src); + output = new FileOutputStream(dst); + byte[] buf = new byte[1024]; + int bytesRead; + while ((bytesRead = input.read(buf)) > 0) { + output.write(buf, 0, bytesRead); + } + } finally { + input.close(); + output.close(); + } + } + + public static void main(String[] args) throws InterruptedException, IOException { + SparkSession sparkSession = SparkSession.builder().appName("Test_HBase_SparkSql_Kerberos").getOrCreate(); + SparkContext sc = sparkSession.sparkContext(); + sc.addFile("obs://xietest1/lzq/krb5.conf"); + sc.addFile("obs://xietest1/lzq/user.keytab"); + Thread.sleep(20); + + File krb5_startfile = new File(SparkFiles.get("krb5.conf")); + File keytab_startfile = new File(SparkFiles.get("user.keytab")); + String path_user = System.getProperty("user.dir"); + File keytab_endfile = new File(path_user + "/" + keytab_startfile.getName()); + File krb5_endfile = new File(path_user + "/" + krb5_startfile.getName()); + copyFile(krb5_startfile,krb5_endfile); + copyFile(keytab_startfile,keytab_endfile); + Thread.sleep(20); + + /** + * Create an association table for the DLI association Hbase table + */ + sparkSession.sql("CREATE TABLE testhbase(id string,booleanf boolean,shortf short,intf int,longf long,floatf float,doublef double) " + + "using hbase OPTIONS(" + + "'ZKHost'='10.0.0.146:2181'," + + "'TableName'='hbtest'," + + "'RowKey'='id:100'," + + "'Cols'='booleanf:CF1.booleanf,shortf:CF1.shortf,intf:CF1.intf,longf:CF2.longf,floatf:CF1.floatf,doublef:CF2.doublef'," + + "'krb5conf'='" + path_user + "/krb5.conf'," + + "'keytab'='" + path_user+ "/user.keytab'," + + "'principal'='krbtest') "); + + //*****************************SQL model*********************************** + sparkSession.sql("insert into testhbase values('newtest',true,1,2,3,4,5)"); + sparkSession.sql("select * from testhbase").show(); + sparkSession.close(); + } +} + |
The Spark job fails to be executed, and the job log indicates that the Java server connection or container fails to be started.
+Check whether the host information of the datasource connection has been modified. If not, modify the host information by referring to Configuring MRS Host Information in DLI Datasource Connection. Then create and submit a Spark job again.
+This section provides Java example code that demonstrates how to use a Spark job to access data from the GaussDB(DWS) data source.
+A datasource connection has been created and bound to a queue on the DLI management console.
+Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
+1 +2 +3 +4 +5 | <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> +</dependency> + |
1 | import org.apache.spark.sql.SparkSession; + |
1 | SparkSession sparkSession = SparkSession.builder().appName("datasource-dws").getOrCreate(); + |
1 | sparkSession.sql("CREATE TABLE IF NOT EXISTS dli_to_dws USING JDBC OPTIONS ('url'='jdbc:postgresql://10.0.0.233:8000/postgres','dbtable'='test','user'='dbadmin','password'='**')"); + |
1 | sparkSession.sql("insert into dli_to_dws values(3,'L'),(4,'X')"); + |
1 | sparkSession.sql("select * from dli_to_dws").show(); + |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/dws/*
+spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/dws/*
+Accessing GaussDB(DWS) tables through SQL APIs
+import org.apache.spark.sql.SparkSession; + +public class java_dws { + public static void main(String[] args) { + SparkSession sparkSession = SparkSession.builder().appName("datasource-dws").getOrCreate(); + + sparkSession.sql("CREATE TABLE IF NOT EXISTS dli_to_dws USING JDBC OPTIONS ('url'='jdbc:postgresql://10.0.0.233:8000/postgres','dbtable'='test','user'='dbadmin','password'='**')"); + + //*****************************SQL model*********************************** + //Insert data into the DLI data table + sparkSession.sql("insert into dli_to_dws values(3,'Liu'),(4,'Xie')"); + + //Read data from DLI data table + sparkSession.sql("select * from dli_to_dws").show(); + + //drop table + sparkSession.sql("drop table dli_to_dws"); + + sparkSession.close(); + } +}+
You can use Hive User-Defined Table-Generating Functions (UDTF) to customize table-valued functions. Hive UDTFs are used for the one-in-multiple-out data operations. UDTF reads a row of data and output multiple values.
+Log in to the DLI console and choose Data Management > Package Management. On the displayed page, select your UDTF Jar package and click Manage Permissions in the Operation column. On the permission management page, click Grant Permission in the upper right corner and select the required permissions.
+Before you start, set up the development environment.
+ +Item + |
+Description + |
+
---|---|
OS + |
+Windows 7 or later + |
+
JDK + |
+JDK 1.8. + |
+
IntelliJ IDEA + |
+This tool is used for application development. The version of the tool must be 2019.1 or other compatible versions. + |
+
Maven + |
+Basic configurations of the development environment. Maven is used for project management throughout the lifecycle of software development. + |
+
No. + |
+Phase + |
+Software Portal + |
+Description + |
+
---|---|---|---|
1 + |
+Create a Maven project and configure the POM file. + |
+IntelliJ IDEA + |
+
+ Write UDTF code by referring the steps in Procedure. + + |
+
2 + |
+Write UDTF code. + |
+||
3 + |
+Debug, compile, and pack the code into a Jar package. + |
+||
4 + |
+Upload the Jar package to OBS. + |
+OBS console + |
+Upload the UDTF Jar file to an OBS directory. + |
+
5 + |
+Create the UDTF on DLI. + |
+DLI console + |
+Create a UDTF on the SQL job management page of the DLI console. + |
+
6 + |
+Verify and use the UDTF on DLI. + |
+DLI console + |
+Use the UDTF in your DLI job. + |
+
<dependencies> + <dependency> + <groupId>org.apache.hive</groupId> + <artifactId>hive-exec</artifactId> + <version>1.2.1</version> + </dependency> +</dependencies>+
Set the package name as you need. Then, press Enter.
+Create a Java Class file in the package path. In this example, the Java Class file is UDTFSplit.
+The UDTF class must inherit org.apache.hadoop.hive.ql.udf.generic.GenericUDTF to implement the initialize, process, and close methods.
+public void process(Object[] args) throws HiveException { + // TODO Auto-generated method stub + if(args.length == 0){ + return; + } + String input = args[0].toString(); + if(StringUtils.isEmpty(input)){ + return; + } + String[] test = input.split(";"); + for (int i = 0; i < test.length; i++) { + try { + String[] result = test[i].split(":"); + forward(result); + } catch (Exception e) { + continue; + } + } + + }+
After the compilation is successful, click package.
+The generated JAR package is stored in the target directory. In this example, MyUDTF-1.0-SNAPSHOT.jar is stored in D:\MyUDTF\target.
+The region of the OBS bucket to which the Jar package is uploaded must be the same as the region of the DLI queue. Cross-region operations are not allowed.
+CREATE FUNCTION mytestsplit AS 'com.demo.UDTFSplit' using jar 'obs://dli-test-obs01/MyUDTF-1.0-SNAPSHOT.jar';+
Use the UDTF created in 6 in the SELECT statement as follows:
+select mytestsplit('abc:123\;efd:567\;utf:890');+ +
If this function is no longer used, run the following statement to delete the function:
+Drop FUNCTION mytestsplit;+
The complete UDTFSplit.java code is as follows:
+import java.util.ArrayList; + +import org.apache.commons.lang.StringUtils; +import org.apache.hadoop.hive.ql.exec.UDFArgumentException; +import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; + +public class UDTFSplit extends GenericUDTF { + + @Override + public void close() throws HiveException { + // TODO Auto-generated method stub + + } + + @Override + public void process(Object[] args) throws HiveException { + // TODO Auto-generated method stub + if(args.length == 0){ + return; + } + String input = args[0].toString(); + if(StringUtils.isEmpty(input)){ + return; + } + String[] test = input.split(";"); + for (int i = 0; i < test.length; i++) { + try { + String[] result = test[i].split(":"); + forward(result); + } catch (Exception e) { + continue; + } + } + + } + + @Override + public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException { + if (args.length != 1) { + throw new UDFArgumentLengthException("ExplodeMap takes only one argument"); + } + if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) { + throw new UDFArgumentException("ExplodeMap takes string as a parameter"); + } + + ArrayList<String> fieldNames = new ArrayList<String>(); + ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(); + fieldNames.add("col1"); + fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); + fieldNames.add("col2"); + fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); + + return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); + } + +}+
DLI is fully compatible with open-source Apache Spark and allows you to import, query, analyze, and process job data by programming. This section describes how to write a Spark program to read and query OBS data, compile and package the code, and submit it to a Spark Jar job.
+Before you start, set up the development environment.
+ +Item + |
+Description + |
+
---|---|
OS + |
+Windows 7 or later + |
+
JDK + |
+JDK 1.8. + |
+
IntelliJ IDEA + |
+This tool is used for application development. The version of the tool must be 2019.1 or other compatible versions. + |
+
Maven + |
+Basic configurations of the development environment. Maven is used for project management throughout the lifecycle of software development. + |
+
No. + |
+Phase + |
+Software Portal + |
+Description + |
+
---|---|---|---|
1 + |
+Create a queue for general use. + |
+DLI console + |
+The DLI queue is created for running your job. + |
+
2 + |
+Upload data to an OBS bucket. + |
+OBS console + |
+The test data needs to be uploaded to your OBS bucket. + |
+
3 + |
+Create a Maven project and configure the POM file. + |
+IntelliJ IDEA + |
+
+ Write your code by referring to the sample code for reading data from OBS. + + |
+
4 + |
+Write code. + |
+||
5 + |
+Debug, compile, and pack the code into a Jar package. + |
+||
6 + |
+Upload the Jar package to OBS and DLI. + |
+OBS console + |
+You can upload the generated Spark JAR package to an OBS directory and DLI program package. + |
+
7 + |
+Create a Spark Jar Job. + |
+DLI console + |
+The Spark Jar job is created and submitted on the DLI console. + |
+
8 + |
+Check execution result of the job. + |
+DLI console + |
+You can view the job running status and run logs. + |
+
{"name":"Michael"} +{"name":"Andy", "age":30} +{"name":"Justin", "age":19}+
In this example, the Maven project name is SparkJarObs, and the project storage path is D:\DLITest\SparkJarObs.
+<dependencies> + <dependency> + <groupId>org.apache.spark</groupId> + <artifactId>spark-sql_2.11</artifactId> + <version>2.3.2</version> + </dependency> +</dependencies>+
Set the package name as you need. Then, press Enter.
+Create a Java Class file in the package path. In this example, the Java Class file is SparkDemoObs.
+Code the SparkDemoObs program to read the people.json file from the OBS bucket, create the temporary table people, and query data.
+For the sample code, see Sample Code.
+import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.apache.spark.sql.SparkSession; + +import static org.apache.spark.sql.functions.col;+
SparkSession spark = SparkSession + .builder() + .config("spark.hadoop.fs.obs.access.key", "xxx") + .config("spark.hadoop.fs.obs.secret.key", "yyy") + .appName("java_spark_demo") + .getOrCreate();+
Dataset<Row> df = spark.read().json("obs://dli-test-obs01/people.json"); +df.printSchema();+
df.createOrReplaceTempView("people");+
Dataset<Row> sqlDF = spark.sql("SELECT * FROM people"); +sqlDF.show();+
sqlDF.write().mode(SaveMode.Overwrite).parquet("obs://dli-test-obs01/result/parquet"); +spark.read().parquet("obs://dli-test-obs01/result/parquet").show();+
spark.stop();+
After the compilation is successful, double-click package.
+The generated JAR package is stored in the target directory. In this example, SparkJarObs-1.0-SNAPSHOT.jar is stored in D:\DLITest\SparkJarObs\target.
+You can only set the Application parameter when creating a Spark job and select the required JAR file from OBS.
+Upload the JAR file to OBS and DLI.
+You do not need to set other parameters.
+In the Operation column, click Edit, change the value of Main Class to com.SparkDemoObs, and click Execute to run the job again.
+Hard-coded or plaintext access.key and secret.key pose significant security risks. To ensure security, encrypt your AK and SK, store them in configuration files or environment variables, and decrypt them when needed.
+package com.dli.demo; + +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SaveMode; +import org.apache.spark.sql.SparkSession; + +import static org.apache.spark.sql.functions.col; + +public class SparkDemoObs { + public static void main(String[] args) { + SparkSession spark = SparkSession + .builder() + .config("spark.hadoop.fs.obs.access.key", "xxx") + .config("spark.hadoop.fs.obs.secret.key", "yyy") + .appName("java_spark_demo") + .getOrCreate(); + // can also be used --conf to set the ak sk when submit the app + + // test json data: + // {"name":"Michael"} + // {"name":"Andy", "age":30} + // {"name":"Justin", "age":19} + Dataset<Row> df = spark.read().json("obs://dli-test-obs01/people.json"); + df.printSchema(); + // root + // |-- age: long (nullable = true) + // |-- name: string (nullable = true) + + // Displays the content of the DataFrame to stdout + df.show(); + // +----+-------+ + // | age| name| + // +----+-------+ + // |null|Michael| + // | 30| Andy| + // | 19| Justin| + // +----+-------+ + + // Select only the "name" column + df.select("name").show(); + // +-------+ + // | name| + // +-------+ + // |Michael| + // | Andy| + // | Justin| + // +-------+ + + // Select people older than 21 + df.filter(col("age").gt(21)).show(); + // +---+----+ + // |age|name| + // +---+----+ + // | 30|Andy| + // +---+----+ + + // Count people by age + df.groupBy("age").count().show(); + // +----+-----+ + // | age|count| + // +----+-----+ + // | 19| 1| + // |null| 1| + // | 30| 1| + // +----+-----+ + + // Register the DataFrame as a SQL temporary view + df.createOrReplaceTempView("people"); + + Dataset<Row> sqlDF = spark.sql("SELECT * FROM people"); + sqlDF.show(); + // +----+-------+ + // | age| name| + // +----+-------+ + // |null|Michael| + // | 30| Andy| + // | 19| Justin| + // +----+-------+ + + sqlDF.write().mode(SaveMode.Overwrite).parquet("obs://dli-test-obs01/result/parquet"); + spark.read().parquet("obs://dli-test-obs01/result/parquet").show(); + + spark.stop(); + } +}+ +
If you need to configure high reliability for a Flink application, you can set the parameters when creating your Flink jobs.
+The reliability configuration of a Flink Jar job is the same as that of a SQL job, which will not be described in this section.
+Total number of CUs = Number of manager CUs + (Total number of concurrent operators / Number of slots of a TaskManager) x Number of TaskManager CUs
+For example, with a total of 9 CUs (1 manager CU) and a maximum of 16 concurrent jobs, the number of compute-specific CUs is 8.
+If you do not configure TaskManager specifications, a TaskManager occupies 1 CU by default and has no slot. To ensure a high reliability, set the number of slots of the TaskManager to 2, according to the preceding formula.
+Set the maximum number of concurrent jobs be twice the number of CUs.
+DLI provides various monitoring metrics for Flink jobs. You can define alarm rules as required using different monitoring metrics for fine-grained job monitoring.
+