Getting Started

This section describes how to use Spark2x to submit Spark applications, including Spark Core and Spark SQL. Spark Core is the kernel module of Spark. It executes tasks and is used to compile Spark applications. Spark SQL is a module that executes SQL statements.

Scenario Description

Develop a Spark application to perform the following operations on logs about netizens' dwell time for online shopping on a weekend.

log1.txt: logs collected on Saturday

LiuYang,female,20
YuanJing,male,10
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60

log2.txt: logs collected on Sunday

LiuYang,female,20
YuanJing,male,10
CaiXuyu,female,50
FangBo,female,50
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
CaiXuyu,female,50
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
FangBo,female,50
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60 

Prerequisites

Procedure

  1. Obtain the sample project and import it to IDEA. Import the JAR package on which the sample project depends. Use IDEA to configure and generate JAR packages.
  2. Prepare the data required by the sample project.

    Save the original log files in the scenario description to the HDFS system.
    1. Create two text files (input_data1.txt and input_data2.txt) on the local host and copy the content in the log1.txt and log2.txt files to the input_data1.txt and input_data2.txt files, respectively.
    2. Create the /tmp/input directory in HDFS, and upload input_data1.txt and input_data2.txt to the /tmp/input directory:

  3. Upload the generated JAR package to the Spark2x running environment (Spark2x client), for example, /opt/female.
  4. Go the client directory, configure the environment variables, and log in to the system.

    source bigdata_env

    source Spark2x/component_env

    kinit <service user for authentication>

  5. Run the following script in the bin directory to submit the Spark application:

    spark-submit --class com.xxx.bigdata.spark.examples.FemaleInfoCollection --master yarn-client /opt/female/FemaleInfoCollection.jar <inputPath>

  6. (Optional) After calling the spark-sql or spark-beeline script in the bin directory, directly enter SQL statements to perform operations such as query.

    For example, create a table, insert a piece of data, and then query the table.

    spark-sql> CREATE TABLE TEST(NAME STRING, AGE INT);
    Time taken: 0.348 seconds
    spark-sql>INSERT INTO TEST VALUES('Jack', 20);
    Time taken: 1.13 seconds
    spark-sql> SELECT * FROM TEST;
    Jack      20
    Time taken: 0.18 seconds, Fetched 1 row(s)

  7. View the running result of the Spark application.

    • View the running result data in a specified file.

      The storage path and format of the result data are specified by the Spark application.

    • Check the running status on the web page.
      1. Log in to Manager. Select Spark2x from the Service drop-down list.
      1. Go to the Spark2x overview page and click an instance, for example, JobHistory2x(host2).
      2. The History Server UI is displayed.

        Select the part file of an application. The History Server UI is used to display the status of Spark applications that are complete or incomplete.

        Figure 1 History Server UI
      3. Select an application ID and click this page to go to the Spark UI of the application.

        Spark UI: used to display the status of running applications.

        Figure 2 Spark UI
    • View Spark logs to learn application runtime conditions.

      View Spark2x Logs to learn application running status, and adjust applications based on log information.