This section describes how to develop an MRS Spark Python on DataArts Factory.
Prerequisites
You have the permission to access OBS paths.
Data preparation
# -*- coding: utf-8 -* import sys from pyspark import SparkConf, SparkContext def show(x): print(x) if __name__ == "__main__": if len(sys.argv) < 2: print ("Usage: wordcount <inputPath> <outputPath>") exit(-1) # Create SparkConf. conf = SparkConf().setAppName("wordcount") # Create SparkContext. Pass the conf=conf parameter. sc = SparkContext(conf=conf) inputPath = sys.argv[1] outputPath = sys.argv[2] lines = sc.textFile(name = inputPath) # Split each line of data by space to obtain words. words = lines.flatMap(lambda line:line.split(" "),True) # Pair each word into a tuple count 1. pairWords = words.map(lambda word:(word,1),True) # Use three partitions (reduceByKey) for summarization. result = pairWords.reduceByKey(lambda v1,v2:v1+v2) # Print the result. result.foreach(lambda t :show(t)) # Save the result to a file. result.saveAsTextFile(outputPath) # Stop SparkContext. sc.stop()
The encoding format must be set to UTF-8. Otherwise, an error will occur during script execution.
Procedure
In this example, upload wordcount.py and in.txt to obs://obs-tongji/python/.
Parameter descriptions:
--master yarn --deploy-mode cluster obs://obs-tongji/python/wordcount.py obs://obs-tongji/python/in.txt obs://obs-tongji/python/out
Specifically:
obs://obs-tongji/python/wordcount.py is the directory where the script is stored.
obs://obs-tongji/python/in.txt is the directory where the wordcount.py parameters are passed. You can pass the words to count.
obs://obs-tongji/python/out is the directory where output parameters are stored. This directory will also be created in the OBS bucket automatically. If the out directory already exists in the OBS bucket, an error will occur.
The job log shows that the job was successfully executed.
Prerequisites
You have the permission to access OBS paths.
Data preparation
Prepare the script file zt_test_sparkPython1.py with the following content:
from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("master"). setMaster("yarn") sc = SparkContext(conf=conf) print("hello python") sc.stop()
Procedure
Parameter descriptions:
--master yarn --deploy-mode cluster obs://obs-tongji/python/zt_test_sparkPython1.py
zt_test_sparkPython1.py indicates the directory where the script is stored.
Log in to MRS Manager and check that the log on YARN contains hello python.