A datasource connection has been created and bound to a queue on the DLI management console.
Hard-coded or plaintext passwords pose significant security risks. To ensure security, encrypt your passwords, store them in configuration files or environment variables, and decrypt them when needed.
1 2 3 4 5 | <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.3.2</version> </dependency> |
1 2 3 | import java.util.Properties import org.apache.spark.sql.{Row,SparkSession} import org.apache.spark.sql.SaveMode |
1 | val sparkSession = SparkSession.builder().getOrCreate() |
1 2 3 4 5 6 7 | sparkSession.sql( "CREATE TABLE IF NOT EXISTS dli_to_rds USING JDBC OPTIONS ( 'url'='jdbc:mysql://to-rds-1174404209-cA37siB6.datasource.com:3306', // Set this parameter to the actual URL. 'dbtable'='test.customer', 'user'='root', // Set this parameter to the actual user. 'password'='######', // Set this parameter to the actual password. 'driver'='com.mysql.jdbc.Driver')") |
Parameter |
Description |
---|---|
url |
To obtain an RDS IP address, you need to create a datasource connection first. Refer to the Data Lake Insight User Guide for more information. If you have created an enhanced datasource connection, use the internal network domain name or internal network address and the database port number provided by RDS to set up the connection. If MySQL is used, the format is Protocol header://Internal IP address:Internal network port number. If PostgreSQL is used, the format is Protocol header://Internal IP address:Internal network port number/Database name. For example: jdbc:mysql://192.168.0.193:3306 or jdbc:postgresql://192.168.0.193:3306/postgres. |
dbtable |
To connect to a MySQL cluster, enter Database name.Table name. To connect to a PostgreSQL cluster, enter Mode name.Table name. NOTE:
If the database and table do not exist, create them first. Otherwise, the system reports an error and fails to run. |
user |
RDS database username. |
password |
RDS database password. |
driver |
JDBC driver class name. To connect to a MySQL cluster, enter com.mysql.jdbc.Driver. To connect to a PostgreSQL cluster, enter org.postgresql.Driver. |
partitionColumn |
One of the numeric fields that are required for concurrently reading data. NOTE:
|
lowerBound |
Minimum value of a column specified by partitionColumn. The value is contained in the returned result. |
upperBound |
Maximum value of a column specified by partitionColumn. The value is not contained in the returned result. |
numPartitions |
Number of concurrent read operations. NOTE:
When data is read, lowerBound and upperBound are evenly allocated to each task to obtain data. Example: 'partitionColumn'='id', 'lowerBound'='0', 'upperBound'='100', 'numPartitions'='2' Two concurrent tasks are started in DLI. The execution ID of one task is greater than or equal to 0 and the ID is smaller than 50; the execution ID of the other task is greater than or equal to 50 and the ID is smaller than 100. |
fetchsize |
Number of data records obtained in each batch during data reading. The default value is 1000. If this parameter is set to a large value, the performance is good but more memory is occupied, causing memory overflow as a result. |
batchsize |
Number of data records written in each batch. The default value is 1000. If this parameter is set to a large value, the performance is good but more memory is occupied, causing memory overflow as a result. |
truncate |
Whether to clear the table without deleting the original table when overwrite is executed. The options are as follows:
The default value is false, indicating that the original table is deleted and then a new table is created when the overwrite operation is performed. |
isolationLevel |
Transaction isolation level. The options are as follows:
The default value is READ_UNCOMMITTED. |
1 | sparkSession.sql("insert into dli_to_rds values(1, 'John',24),(2, 'Bob',32)") |
1 2 | val dataFrame = sparkSession.sql("select * from dli_to_rds") dataFrame.show() |
Before data is inserted
After data is inserted
1 | sparkSession.sql("drop table dli_to_rds") |
1 2 3 4 | val url = "jdbc:mysql://to-rds-1174405057-EA1Kgo8H.datasource.com:3306" val username = "root" val password = "######" val dbtable = "test.customer" |
1 2 3 4 | var dataFrame_1 = sparkSession.createDataFrame(List((8, "Jack_1", 18))) val df = dataFrame_1.withColumnRenamed("_1", "id") .withColumnRenamed("_2", "name") .withColumnRenamed("_3", "age") |
1 2 3 4 5 6 7 8 | df.write.format("jdbc") .option("url", url) .option("dbtable", dbtable) .option("user", username) .option("password", password) .option("driver", "com.mysql.jdbc.Driver") .mode(SaveMode.Append) .save() |
The value of SaveMode can be one of the following:
1 2 3 4 5 6 7 | val jdbcDF = sparkSession.read.format("jdbc") .option("url", url) .option("dbtable", dbtable) .option("user", username) .option("password", password) .option("driver", "org.postgresql.Driver") .load() |
1 2 3 4 | val properties = new Properties() properties.put("user", username) properties.put("password", password) val jdbcDF2 = sparkSession.read.jdbc(url, dbtable, properties) |
Before data is inserted
After data is inserted
The DataFrame read by the read.format() or read.jdbc() method is registered as a temporary table. Then, you can use SQL statements to query data.
1 2 | jdbcDF.registerTempTable("customer_test") sparkSession.sql("select * from customer_test where id = 1").show() |
Query results
The data created by the createDataFrame() method and the data queried by the read.format() method and the read.jdbc() method are all DataFrame objects. You can directly query a single record. (In 4, the DataFrame data is registered as a temporary table.)
The where statement can be combined with filter expressions such as AND and OR. The DataFrame object after filtering is returned. The following is an example:
1 | jdbcDF.where("id = 1 or age <=10").show() |
The filter statement can be used in the same way as where. The DataFrame object after filtering is returned. The following is an example:
1 | jdbcDF.filter("id = 1 or age <=10").show() |
The select statement is used to query the DataFrame object of the specified field. Multiple fields can be queried.
1 | jdbcDF.select("id").show() |
1 | jdbcDF.select("id", "name").show() |
1 | jdbcDF.select("id","name").where("id<4").show() |
selectExpr is used to perform special processing on a field. For example, the selectExpr function can be used to change the field name. The following is an example:
If you want to set the name field to name_test and add 1 to the value of age, run the following statement:
1 | jdbcDF.selectExpr("id", "name as name_test", "age+1").show() |
col is used to obtain a specified field. Different from select, col can only be used to query the column type and one field can be returned at a time. The following is an example:
1 | val idCol = jdbcDF.col("id") |
drop is used to delete a specified field. Specify a field you need to delete (only one field can be deleted at a time), the DataFrame object that does not contain the field is returned. The following is an example:
1 | jdbcDF.drop("id").show() |
spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/rds/*
spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/rds/*
1 2 3 4 5 | <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.3.2</version> </dependency> |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | import java.util.Properties import org.apache.spark.sql.SparkSession object Test_SQL_RDS { def main(args: Array[String]): Unit = { // Create a SparkSession session. val sparkSession = SparkSession.builder().getOrCreate() // Create a data table for DLI-associated RDS sparkSession.sql("CREATE TABLE IF NOT EXISTS dli_to_rds USING JDBC OPTIONS ( 'url'='jdbc:mysql://to-rds-1174404209-cA37siB6.datasource.com:3306, 'dbtable'='test.customer', 'user'='root', 'password'='######', 'driver'='com.mysql.jdbc.Driver')") //*****************************SQL model*********************************** //Insert data into the DLI data table sparkSession.sql("insert into dli_to_rds values(1,'John',24),(2,'Bob',32)") //Read data from DLI data table val dataFrame = sparkSession.sql("select * from dli_to_rds") dataFrame.show() //drop table sparkSession.sql("drop table dli_to_rds") sparkSession.close() } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | import java.util.Properties import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SaveMode object Test_SQL_RDS { def main(args: Array[String]): Unit = { // Create a SparkSession session. val sparkSession = SparkSession.builder().getOrCreate() //*****************************DataFrame model*********************************** // Set the connection configuration parameters. Contains url, username, password, dbtable. val url = "jdbc:mysql://to-rds-1174404209-cA37siB6.datasource.com:3306" val username = "root" val password = "######" val dbtable = "test.customer" // Create a DataFrame and initialize the DataFrame data. var dataFrame_1 = sparkSession.createDataFrame(List((1, "Jack", 18))) // Rename the fields set by the createDataFrame() method. val df = dataFrame_1.withColumnRenamed("_1", "id") .withColumnRenamed("_2", "name") .withColumnRenamed("_3", "age") // Write data to the rds_table_1 table df.write.format("jdbc") .option("url", url) .option("dbtable", dbtable) .option("user", username) .option("password", password) .option("driver", "com.mysql.jdbc.Driver") .mode(SaveMode.Append) .save() // DataFrame object for data manipulation //Filter users with id=1 var newDF = df.filter("id!=1") newDF.show() // Filter the id column data var newDF_1 = df.drop("id") newDF_1.show() // Read the data of the customer table in the RDS database // Way one: Read data from RDS using read.format() val jdbcDF = sparkSession.read.format("jdbc") .option("url", url) .option("dbtable", dbtable) .option("user", username) .option("password", password) .option("driver", "com.mysql.jdbc.Driver") .load() // Way two: Read data from RDS using read.jdbc() val properties = new Properties() properties.put("user", username) properties.put("password", password) val jdbcDF2 = sparkSession.read.jdbc(url, dbtable, properties) /** * Register the dateFrame read by read.format() or read.jdbc() as a temporary table, and query the data * using the sql statement. */ jdbcDF.registerTempTable("customer_test") val result = sparkSession.sql("select * from customer_test where id = 1") result.show() sparkSession.close() } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | // The where() method uses " and" and "or" for condition filters, returning filtered DataFrame objects jdbcDF.where("id = 1 or age <=10").show() // The filter() method is used in the same way as the where() method. jdbcDF.filter("id = 1 or age <=10").show() // The select() method passes multiple arguments and returns the DataFrame object of the specified field. jdbcDF.select("id").show() jdbcDF.select("id", "name").show() jdbcDF.select("id","name").where("id<4").show() /** * The selectExpr() method implements special handling of fields, such as renaming, increasing or * decreasing data values. */ jdbcDF.selectExpr("id", "name as name_test", "age+1").show() // The col() method gets a specified field each time, and the return type is a Column type. val idCol = jdbcDF.col("id") /** * The drop() method returns a DataFrame object that does not contain deleted fields, and only one field * can be deleted at a time. */ jdbcDF.drop("id").show() |