Apache spark 向Azure数据工厂中的Pyspark脚本传递参数时出错_Apache Spark_Pyspark_Azure Data Factory_Azure Data Factory 2_Azure Hdinsight

Apache spark 向Azure数据工厂中的Pyspark脚本传递参数时出错

apache-spark pyspark azure-data-factory

Apache spark 向Azure数据工厂中的Pyspark脚本传递参数时出错,apache-spark,pyspark,azure-data-factory,azure-data-factory-2,azure-hdinsight,Apache Spark,Pyspark,Azure Data Factory,Azure Data Factory 2,Azure Hdinsight,我正在运行Azure数据工厂的PySpark脚本。我在下面的Script/Jar下提到了给定部分中的参数参数是一个键值对。提交的论点如下所示 --arg '--APP_NAME ABC' --arg '--CONFIG_FILE_PATH wasbs://ABC --arg '--OUTPUT_INFO wasbs://XYZ 当执行管道时，我得到以下错误 usage: Data.py [-h] --CONFIG_FILE_PATH CONFIG_FILE_PATH --OUTPUT_

我正在运行Azure数据工厂的PySpark脚本。我在下面的Script/Jar下提到了给定部分中的参数

参数是一个键值对。提交的论点如下所示

--arg '--APP_NAME ABC' --arg '--CONFIG_FILE_PATH wasbs://ABC --arg '--OUTPUT_INFO wasbs://XYZ

当执行管道时，我得到以下错误

usage: Data.py [-h] --CONFIG_FILE_PATH CONFIG_FILE_PATH --OUTPUT_INFO
                      OUTPUT_INFO --ACTION_CODE ACTION_CODE --RUN_ID RUN_ID
                      --APP_NAME APP_NAME --JOB_ID JOB_ID --TASK_ID TASK_ID
                      --PCS_ID PCS_ID --DAG_ID DAG_ID
Data.py: error: argument --CONFIG_FILE_PATH is required.

您可以在Azure数据工厂中将参数传递给Pyspark脚本

代码：

{
    "name": "SparkActivity",
    "properties": {
        "activities": [
            {
                "name": "Spark1",
                "type": "HDInsightSpark",
                "dependsOn": [],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "rootPath": "adftutorial/spark/script",
                    "entryFilePath": "WordCount_Spark.py",
                    "arguments": [
                        "--input-file",
                        "wasb://sampledata@chepra.blob.core.windows.net/data",
                        "--output-file",
                        "wasb://sampledata@chepra.blob.core.windows.net/results"
                    ],
                    "sparkJobLinkedService": {
                        "referenceName": "AzureBlobStorage1",
                        "type": "LinkedServiceReference"
                    }
                },
                "linkedServiceName": {
                    "referenceName": "HDInsight",
                    "type": "LinkedServiceReference"
                }
            }
        ],
        "annotations": []
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

{
    "name": "SparkSubmit",
    "properties": {
        "description": "Submit a spark job",
        "activities": [
            {
                "type": "HDInsightMapReduce",
                "typeProperties": {
                    "className": "com.adf.spark.SparkJob",
                    "jarFilePath": "libs/spark-adf-job-bin.jar",
                    "jarLinkedService": "StorageLinkedService",
                    "arguments": [
                        "--jarFile",
                        "libs/sparkdemoapp_2.10-1.0.jar",
                        "--jars",
                        "/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.3.0-3039.jar,/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
                        "--mainClass",
                        "com.adf.spark.demo.Demo",
                        "--master",
                        "yarn-cluster",
                        "--driverMemory",
                        "2g",
                        "--driverExtraClasspath",
                        "/usr/lib/hdinsight-logging/*",
                        "--executorCores",
                        "1",
                        "--executorMemory",
                        "4g",
                        "--sparkHome",
                        "/usr/hdp/current/spark-client",
                        "--connectionString",
                        "DefaultEndpointsProtocol=https;AccountName=<YOUR_ACCOUNT>;AccountKey=<YOUR_KEY>",
                        "input=wasb://input@<YOUR_ACCOUNT>.blob.core.windows.net/data",
                        "output=wasb://output@<YOUR_ACCOUNT>.blob.core.windows.net/results"
                    ]
                },
                "inputs": [
                    {
                        "name": "input"
                    }
                ],
                "outputs": [
                    {
                        "name": "output"
                    }
                ],
                "policy": {
                    "executionPriorityOrder": "OldestFirst",
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "retry": 1
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "Spark Launcher",
                "description": "Submits a Spark Job",
                "linkedServiceName": "HDInsightLinkedService"
            }
        ],
        "start": "2015-11-16T00:00:01Z",
        "end": "2015-11-16T23:59:00Z",
        "isPaused": false,
        "pipelineMode": "Scheduled"
    }
}

在ADF中传递参数的演练：

{
    "name": "SparkActivity",
    "properties": {
        "activities": [
            {
                "name": "Spark1",
                "type": "HDInsightSpark",
                "dependsOn": [],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "rootPath": "adftutorial/spark/script",
                    "entryFilePath": "WordCount_Spark.py",
                    "arguments": [
                        "--input-file",
                        "wasb://sampledata@chepra.blob.core.windows.net/data",
                        "--output-file",
                        "wasb://sampledata@chepra.blob.core.windows.net/results"
                    ],
                    "sparkJobLinkedService": {
                        "referenceName": "AzureBlobStorage1",
                        "type": "LinkedServiceReference"
                    }
                },
                "linkedServiceName": {
                    "referenceName": "HDInsight",
                    "type": "LinkedServiceReference"
                }
            }
        ],
        "annotations": []
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

{
    "name": "SparkSubmit",
    "properties": {
        "description": "Submit a spark job",
        "activities": [
            {
                "type": "HDInsightMapReduce",
                "typeProperties": {
                    "className": "com.adf.spark.SparkJob",
                    "jarFilePath": "libs/spark-adf-job-bin.jar",
                    "jarLinkedService": "StorageLinkedService",
                    "arguments": [
                        "--jarFile",
                        "libs/sparkdemoapp_2.10-1.0.jar",
                        "--jars",
                        "/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.3.0-3039.jar,/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
                        "--mainClass",
                        "com.adf.spark.demo.Demo",
                        "--master",
                        "yarn-cluster",
                        "--driverMemory",
                        "2g",
                        "--driverExtraClasspath",
                        "/usr/lib/hdinsight-logging/*",
                        "--executorCores",
                        "1",
                        "--executorMemory",
                        "4g",
                        "--sparkHome",
                        "/usr/hdp/current/spark-client",
                        "--connectionString",
                        "DefaultEndpointsProtocol=https;AccountName=<YOUR_ACCOUNT>;AccountKey=<YOUR_KEY>",
                        "input=wasb://input@<YOUR_ACCOUNT>.blob.core.windows.net/data",
                        "output=wasb://output@<YOUR_ACCOUNT>.blob.core.windows.net/results"
                    ]
                },
                "inputs": [
                    {
                        "name": "input"
                    }
                ],
                "outputs": [
                    {
                        "name": "output"
                    }
                ],
                "policy": {
                    "executionPriorityOrder": "OldestFirst",
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "retry": 1
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "Spark Launcher",
                "description": "Submits a Spark Job",
                "linkedServiceName": "HDInsightLinkedService"
            }
        ],
        "start": "2015-11-16T00:00:01Z",
        "end": "2015-11-16T23:59:00Z",
        "isPaused": false,
        "pipelineMode": "Scheduled"
    }
}

在Azure数据工厂中传递参数的一些示例：

{
    "name": "SparkActivity",
    "properties": {
        "activities": [
            {
                "name": "Spark1",
                "type": "HDInsightSpark",
                "dependsOn": [],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "rootPath": "adftutorial/spark/script",
                    "entryFilePath": "WordCount_Spark.py",
                    "arguments": [
                        "--input-file",
                        "wasb://sampledata@chepra.blob.core.windows.net/data",
                        "--output-file",
                        "wasb://sampledata@chepra.blob.core.windows.net/results"
                    ],
                    "sparkJobLinkedService": {
                        "referenceName": "AzureBlobStorage1",
                        "type": "LinkedServiceReference"
                    }
                },
                "linkedServiceName": {
                    "referenceName": "HDInsight",
                    "type": "LinkedServiceReference"
                }
            }
        ],
        "annotations": []
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

{
    "name": "SparkSubmit",
    "properties": {
        "description": "Submit a spark job",
        "activities": [
            {
                "type": "HDInsightMapReduce",
                "typeProperties": {
                    "className": "com.adf.spark.SparkJob",
                    "jarFilePath": "libs/spark-adf-job-bin.jar",
                    "jarLinkedService": "StorageLinkedService",
                    "arguments": [
                        "--jarFile",
                        "libs/sparkdemoapp_2.10-1.0.jar",
                        "--jars",
                        "/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.3.0-3039.jar,/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
                        "--mainClass",
                        "com.adf.spark.demo.Demo",
                        "--master",
                        "yarn-cluster",
                        "--driverMemory",
                        "2g",
                        "--driverExtraClasspath",
                        "/usr/lib/hdinsight-logging/*",
                        "--executorCores",
                        "1",
                        "--executorMemory",
                        "4g",
                        "--sparkHome",
                        "/usr/hdp/current/spark-client",
                        "--connectionString",
                        "DefaultEndpointsProtocol=https;AccountName=<YOUR_ACCOUNT>;AccountKey=<YOUR_KEY>",
                        "input=wasb://input@<YOUR_ACCOUNT>.blob.core.windows.net/data",
                        "output=wasb://output@<YOUR_ACCOUNT>.blob.core.windows.net/results"
                    ]
                },
                "inputs": [
                    {
                        "name": "input"
                    }
                ],
                "outputs": [
                    {
                        "name": "output"
                    }
                ],
                "policy": {
                    "executionPriorityOrder": "OldestFirst",
                    "timeout": "01:00:00",
                    "concurrency": 1,
                    "retry": 1
                },
                "scheduler": {
                    "frequency": "Day",
                    "interval": 1
                },
                "name": "Spark Launcher",
                "description": "Submits a Spark Job",
                "linkedServiceName": "HDInsightLinkedService"
            }
        ],
        "start": "2015-11-16T00:00:01Z",
        "end": "2015-11-16T23:59:00Z",
        "isPaused": false,
        "pipelineMode": "Scheduled"
    }
}

{
“名称”：“SparkSubmit”，
“财产”：{
“说明”：“提交火花作业”，
“活动”：[
{
“类型”：“HDInsightMapReduce”，
“类型属性”：{
“className”：“com.adf.spark.SparkJob”，
“jarFilePath”：“libs/spark adf job bin.jar”，
“jarLinkedService”：“StorageLinkedService”，
“论点”：[
“--jarFile”，
“libs/sparkdemoapp_2.10-1.0.jar”，
“--罐子”，
“/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.0-3039.jar，/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar”，
“--主类”，
“com.adf.spark.demo.demo”，
“--主人”，
“纱线束”，
“--司机室”，
“2g”，
“--driverExtraClasspath”，
“/usr/lib/hdinsight logging/*”，
“--执行者核心”，
"1",
“--内存”，
“4g”，
“--斯巴克霍姆”，
“/usr/hdp/current/spark client”，
“--连接字符串”，
“DefaultEndpointsProtocol=https；AccountName=；AccountKey=”，
“输入=wasb://input@.blob.core.windows.net/data“，
“产出=wasb://output@.blob.core.windows.net/results“
]
},
“投入”：[
{
“名称”：“输入”
}
],
“产出”：[
{
“名称”：“输出”
}
],
“政策”：{
“执行优先顺序”：“最旧优先顺序”，
“超时”：“01:00:00”，
“并发”：1，
“重试”：1
},
“调度程序”：{
“频率”：“天”，
“间隔”：1
},
“名称”：“火花发射器”，
“说明”：“提交Spark作业”，
“LinkedService名称”：“HDInsightLinkedService”
}
],
“开始”：“2015-11-16T00:00:01Z”，
“结束”：“2015-11-16T23:59:00Z”，
“isPaused”：错误，
“pipelineMode”：“已计划”
}
}

您为此提供了任何值吗parameter@GaurangShah，是--APP_NAME是键，ABC是值。我删除了这些值以将图像粘贴到此处。似乎只传递了一个参数

--arg'--APP_NAME ABC'

总共传递了9个参数。我只提到了几个。