DLI is fully compatible with open-source Apache Spark and allows you to import, query, analyze, and process job data by programming. This section describes how to write a Spark program to read and query OBS data, compile and package the code, and submit it to a Spark Jar job.
Before you start, set up the development environment.
Item |
Description |
---|---|
OS |
Windows 7 or later |
JDK |
JDK 1.8. |
IntelliJ IDEA |
This tool is used for application development. The version of the tool must be 2019.1 or other compatible versions. |
Maven |
Basic configurations of the development environment. Maven is used for project management throughout the lifecycle of software development. |
No. |
Phase |
Software Portal |
Description |
---|---|---|---|
1 |
Create a queue for general use. |
DLI console |
The DLI queue is created for running your job. |
2 |
Upload data to an OBS bucket. |
OBS console |
The test data needs to be uploaded to your OBS bucket. |
3 |
Create a Maven project and configure the POM file. |
IntelliJ IDEA |
Write your code by referring to the sample code for reading data from OBS. |
4 |
Write code. |
||
5 |
Debug, compile, and pack the code into a Jar package. |
||
6 |
Upload the Jar package to OBS and DLI. |
OBS console |
You can upload the generated Spark JAR package to an OBS directory and DLI program package. |
7 |
Create a Spark Jar Job. |
DLI console |
The Spark Jar job is created and submitted on the DLI console. |
8 |
Check execution result of the job. |
DLI console |
You can view the job running status and run logs. |
{"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19}
In this example, the Maven project name is SparkJarObs, and the project storage path is D:\DLITest\SparkJarObs.
<dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.3.2</version> </dependency> </dependencies>
Set the package name as you need. Then, press Enter.
Create a Java Class file in the package path. In this example, the Java Class file is SparkDemoObs.
Code the SparkDemoObs program to read the people.json file from the OBS bucket, create the temporary table people, and query data.
For the sample code, see Sample Code.
import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.SparkSession; import static org.apache.spark.sql.functions.col;
SparkSession spark = SparkSession .builder() .config("spark.hadoop.fs.obs.access.key", "xxx") .config("spark.hadoop.fs.obs.secret.key", "yyy") .appName("java_spark_demo") .getOrCreate();
Dataset<Row> df = spark.read().json("obs://dli-test-obs01/people.json"); df.printSchema();
df.createOrReplaceTempView("people");
Dataset<Row> sqlDF = spark.sql("SELECT * FROM people"); sqlDF.show();
sqlDF.write().mode(SaveMode.Overwrite).parquet("obs://dli-test-obs01/result/parquet"); spark.read().parquet("obs://dli-test-obs01/result/parquet").show();
spark.stop();
After the compilation is successful, double-click package.
The generated JAR package is stored in the target directory. In this example, SparkJarObs-1.0-SNAPSHOT.jar is stored in D:\DLITest\SparkJarObs\target.
You can only set the Application parameter when creating a Spark job and select the required JAR file from OBS.
Upload the JAR file to OBS and DLI.
You do not need to set other parameters.
In the Operation column, click Edit, change the value of Main Class to com.SparkDemoObs, and click Execute to run the job again.
Hard-coded or plaintext access.key and secret.key pose significant security risks. To ensure security, encrypt your AK and SK, store them in configuration files or environment variables, and decrypt them when needed.
package com.dli.demo; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SaveMode; import org.apache.spark.sql.SparkSession; import static org.apache.spark.sql.functions.col; public class SparkDemoObs { public static void main(String[] args) { SparkSession spark = SparkSession .builder() .config("spark.hadoop.fs.obs.access.key", "xxx") .config("spark.hadoop.fs.obs.secret.key", "yyy") .appName("java_spark_demo") .getOrCreate(); // can also be used --conf to set the ak sk when submit the app // test json data: // {"name":"Michael"} // {"name":"Andy", "age":30} // {"name":"Justin", "age":19} Dataset<Row> df = spark.read().json("obs://dli-test-obs01/people.json"); df.printSchema(); // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // Displays the content of the DataFrame to stdout df.show(); // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ // Select only the "name" column df.select("name").show(); // +-------+ // | name| // +-------+ // |Michael| // | Andy| // | Justin| // +-------+ // Select people older than 21 df.filter(col("age").gt(21)).show(); // +---+----+ // |age|name| // +---+----+ // | 30|Andy| // +---+----+ // Count people by age df.groupBy("age").count().show(); // +----+-----+ // | age|count| // +----+-----+ // | 19| 1| // |null| 1| // | 30| 1| // +----+-----+ // Register the DataFrame as a SQL temporary view df.createOrReplaceTempView("people"); Dataset<Row> sqlDF = spark.sql("SELECT * FROM people"); sqlDF.show(); // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+ sqlDF.write().mode(SaveMode.Overwrite).parquet("obs://dli-test-obs01/result/parquet"); spark.read().parquet("obs://dli-test-obs01/result/parquet").show(); spark.stop(); } }