我可以在提交spark作业时给python代码添加参数吗？

44 人关注

我想用 spark-submit 在spark集群中执行我的python代码。

一般来说，我们用python代码运行 spark-submit ，如下所示。

# Run a Python application on a cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  my_python_code.py \
但我想通过传递几个参数来运行my_python_code.py，有没有传递参数的聪明方法？


         python


         apache-spark


         cluster-mode


        5
        
        个回答


          已采纳


         0
         
         人赞同


          
           尽管
           
            sys.argv
           
           是一个很好的解决方案，我还是更喜欢在我的PySpark作业中用这种更恰当的方式来处理行命令的args。
          
          import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--ngrams", help="some useful description.")
args = parser.parse_args()
if args.ngrams:
    ngrams = args.ngrams
这样一来，你就可以启动你的工作，如下所示。
spark-submit job.py --ngrams 3
关于argparse模块的更多信息，请见Argparse教程


           
            
             
              Andre Carneiro
             
             ：


           
            
             
              不工作!结果显示："[TerminalIPythonApp] CRITICAL | 未识别的标志：'--ngrams' "


           
            
             
              如果你想在提交任务时发送配置，请确保在spark-submit之后立即运行配置信息，比如。【替换代码0


           
            
             
              没试过这个方案，但这听起来是个更好的方案，因为它可以消除对参数序列的依赖。


           
            
             
              有谁知道如何使用Pyspark和argparse？我不断地得到一个错误
              
               Unrecognized flag --arg1
              
              ，这让我发疯了!(Spark 2.4.4 and Python 3.6)


          
           
            
             
              Yes
             
             :  把这个放在一个叫args.py的文件中
            
            #import sys
print sys.argv
If you run
spark-submit args.py a b c d e 
You will see:
['/spark/args.py', 'a', 'b', 'c', 'd', 'e']


          
           
            
             
              
               
                你可以从spark-submit命令中传递参数，然后在你的代码中以如下方式访问它们。
               
               
                sys.argv[1]将得到第一个参数，sys.argv[2]是第二个参数，以此类推。请参考下面的例子。
               
               
                你可以创建如下代码，以获取你将在spark-submit命令中传递的参数。
               
               import os
import sys
n = int(sys.argv[1])
a = 2
tables = []
for _ in range(n):
    tables.append(sys.argv[a])
    a += 1
print(tables)
将上述文件保存为PysparkArg.py并执行下面的spark-submit命令。
spark-submit PysparkArg.py 3 table1 table2 table3
Output:
['table1', 'table2', 'table3']
This piece of code can be used in PySpark jobs where it is required to fetch multiple tables from the database and, the number of tables to be fetched & the table names will be given by the user while executing the spark-submit command.


          
           
            
             
              
               
                
                 
                  
                   啊，这是有可能的。
                   
                    http://caen.github.io/hadoop/user-spark.html
                   
                  
                  spark-submit \
    --master yarn-client \   # Run this as a Hadoop job
    --queue <your_queue> \   # Run on your_queue
    --num-executors 10 \     # Run with a certain number of executors, for example 10
    --executor-memory 12g \  # Specify each executor's memory, for example 12GB
    --executor-cores 2 \     # Specify each executor's amount of CPUs, for example 2
    job.py ngrams/input ngrams/output


           
            
             
              
               
                
                 
                  
                   
                    
                     我认为问题不在于如何传入参数，而在于一旦传入参数后，如何访问这些参数。


          
           
            
             
              
               
                
                 
                  
                   
                    
                    
                     trevorgrayson
                    
                   
                   
                    发布于
                    
                    2020-02-20


          
           
            
             
              
               
                
                 
                  
                   
                    Aniket Kulkarni的
                    
                     spark-submit args.py a b c d e
                    
                    似乎已经足够了，但值得一提的是，我们在处理可选/命名的args（例如--param1）时遇到了问题。