Apache spark 向Azure数据工厂中的Pyspark脚本传递参数时出错
我正在运行Azure数据工厂的PySpark脚本。 我在下面的Script/Jar下提到了给定部分中的参数 参数是一个键值对。 提交的论点如下所示Apache spark 向Azure数据工厂中的Pyspark脚本传递参数时出错,apache-spark,pyspark,azure-data-factory,azure-data-factory-2,azure-hdinsight,Apache Spark,Pyspark,Azure Data Factory,Azure Data Factory 2,Azure Hdinsight,我正在运行Azure数据工厂的PySpark脚本。 我在下面的Script/Jar下提到了给定部分中的参数 参数是一个键值对。 提交的论点如下所示 --arg '--APP_NAME ABC' --arg '--CONFIG_FILE_PATH wasbs://ABC --arg '--OUTPUT_INFO wasbs://XYZ 当执行管道时,我得到以下错误 usage: Data.py [-h] --CONFIG_FILE_PATH CONFIG_FILE_PATH --OUTPUT_
--arg '--APP_NAME ABC' --arg '--CONFIG_FILE_PATH wasbs://ABC --arg '--OUTPUT_INFO wasbs://XYZ
当执行管道时,我得到以下错误
usage: Data.py [-h] --CONFIG_FILE_PATH CONFIG_FILE_PATH --OUTPUT_INFO
OUTPUT_INFO --ACTION_CODE ACTION_CODE --RUN_ID RUN_ID
--APP_NAME APP_NAME --JOB_ID JOB_ID --TASK_ID TASK_ID
--PCS_ID PCS_ID --DAG_ID DAG_ID
Data.py: error: argument --CONFIG_FILE_PATH is required.
您可以在Azure数据工厂中将参数传递给Pyspark脚本 代码:
{
"name": "SparkActivity",
"properties": {
"activities": [
{
"name": "Spark1",
"type": "HDInsightSpark",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"rootPath": "adftutorial/spark/script",
"entryFilePath": "WordCount_Spark.py",
"arguments": [
"--input-file",
"wasb://sampledata@chepra.blob.core.windows.net/data",
"--output-file",
"wasb://sampledata@chepra.blob.core.windows.net/results"
],
"sparkJobLinkedService": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "HDInsight",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
{
"name": "SparkSubmit",
"properties": {
"description": "Submit a spark job",
"activities": [
{
"type": "HDInsightMapReduce",
"typeProperties": {
"className": "com.adf.spark.SparkJob",
"jarFilePath": "libs/spark-adf-job-bin.jar",
"jarLinkedService": "StorageLinkedService",
"arguments": [
"--jarFile",
"libs/sparkdemoapp_2.10-1.0.jar",
"--jars",
"/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.3.0-3039.jar,/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
"--mainClass",
"com.adf.spark.demo.Demo",
"--master",
"yarn-cluster",
"--driverMemory",
"2g",
"--driverExtraClasspath",
"/usr/lib/hdinsight-logging/*",
"--executorCores",
"1",
"--executorMemory",
"4g",
"--sparkHome",
"/usr/hdp/current/spark-client",
"--connectionString",
"DefaultEndpointsProtocol=https;AccountName=<YOUR_ACCOUNT>;AccountKey=<YOUR_KEY>",
"input=wasb://input@<YOUR_ACCOUNT>.blob.core.windows.net/data",
"output=wasb://output@<YOUR_ACCOUNT>.blob.core.windows.net/results"
]
},
"inputs": [
{
"name": "input"
}
],
"outputs": [
{
"name": "output"
}
],
"policy": {
"executionPriorityOrder": "OldestFirst",
"timeout": "01:00:00",
"concurrency": 1,
"retry": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Spark Launcher",
"description": "Submits a Spark Job",
"linkedServiceName": "HDInsightLinkedService"
}
],
"start": "2015-11-16T00:00:01Z",
"end": "2015-11-16T23:59:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
在ADF中传递参数的演练:
{
"name": "SparkActivity",
"properties": {
"activities": [
{
"name": "Spark1",
"type": "HDInsightSpark",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"rootPath": "adftutorial/spark/script",
"entryFilePath": "WordCount_Spark.py",
"arguments": [
"--input-file",
"wasb://sampledata@chepra.blob.core.windows.net/data",
"--output-file",
"wasb://sampledata@chepra.blob.core.windows.net/results"
],
"sparkJobLinkedService": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "HDInsight",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
{
"name": "SparkSubmit",
"properties": {
"description": "Submit a spark job",
"activities": [
{
"type": "HDInsightMapReduce",
"typeProperties": {
"className": "com.adf.spark.SparkJob",
"jarFilePath": "libs/spark-adf-job-bin.jar",
"jarLinkedService": "StorageLinkedService",
"arguments": [
"--jarFile",
"libs/sparkdemoapp_2.10-1.0.jar",
"--jars",
"/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.3.0-3039.jar,/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
"--mainClass",
"com.adf.spark.demo.Demo",
"--master",
"yarn-cluster",
"--driverMemory",
"2g",
"--driverExtraClasspath",
"/usr/lib/hdinsight-logging/*",
"--executorCores",
"1",
"--executorMemory",
"4g",
"--sparkHome",
"/usr/hdp/current/spark-client",
"--connectionString",
"DefaultEndpointsProtocol=https;AccountName=<YOUR_ACCOUNT>;AccountKey=<YOUR_KEY>",
"input=wasb://input@<YOUR_ACCOUNT>.blob.core.windows.net/data",
"output=wasb://output@<YOUR_ACCOUNT>.blob.core.windows.net/results"
]
},
"inputs": [
{
"name": "input"
}
],
"outputs": [
{
"name": "output"
}
],
"policy": {
"executionPriorityOrder": "OldestFirst",
"timeout": "01:00:00",
"concurrency": 1,
"retry": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Spark Launcher",
"description": "Submits a Spark Job",
"linkedServiceName": "HDInsightLinkedService"
}
],
"start": "2015-11-16T00:00:01Z",
"end": "2015-11-16T23:59:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
在Azure数据工厂中传递参数的一些示例:
{
"name": "SparkActivity",
"properties": {
"activities": [
{
"name": "Spark1",
"type": "HDInsightSpark",
"dependsOn": [],
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"rootPath": "adftutorial/spark/script",
"entryFilePath": "WordCount_Spark.py",
"arguments": [
"--input-file",
"wasb://sampledata@chepra.blob.core.windows.net/data",
"--output-file",
"wasb://sampledata@chepra.blob.core.windows.net/results"
],
"sparkJobLinkedService": {
"referenceName": "AzureBlobStorage1",
"type": "LinkedServiceReference"
}
},
"linkedServiceName": {
"referenceName": "HDInsight",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
},
"type": "Microsoft.DataFactory/factories/pipelines"
}
{
"name": "SparkSubmit",
"properties": {
"description": "Submit a spark job",
"activities": [
{
"type": "HDInsightMapReduce",
"typeProperties": {
"className": "com.adf.spark.SparkJob",
"jarFilePath": "libs/spark-adf-job-bin.jar",
"jarLinkedService": "StorageLinkedService",
"arguments": [
"--jarFile",
"libs/sparkdemoapp_2.10-1.0.jar",
"--jars",
"/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.3.0-3039.jar,/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar",
"--mainClass",
"com.adf.spark.demo.Demo",
"--master",
"yarn-cluster",
"--driverMemory",
"2g",
"--driverExtraClasspath",
"/usr/lib/hdinsight-logging/*",
"--executorCores",
"1",
"--executorMemory",
"4g",
"--sparkHome",
"/usr/hdp/current/spark-client",
"--connectionString",
"DefaultEndpointsProtocol=https;AccountName=<YOUR_ACCOUNT>;AccountKey=<YOUR_KEY>",
"input=wasb://input@<YOUR_ACCOUNT>.blob.core.windows.net/data",
"output=wasb://output@<YOUR_ACCOUNT>.blob.core.windows.net/results"
]
},
"inputs": [
{
"name": "input"
}
],
"outputs": [
{
"name": "output"
}
],
"policy": {
"executionPriorityOrder": "OldestFirst",
"timeout": "01:00:00",
"concurrency": 1,
"retry": 1
},
"scheduler": {
"frequency": "Day",
"interval": 1
},
"name": "Spark Launcher",
"description": "Submits a Spark Job",
"linkedServiceName": "HDInsightLinkedService"
}
],
"start": "2015-11-16T00:00:01Z",
"end": "2015-11-16T23:59:00Z",
"isPaused": false,
"pipelineMode": "Scheduled"
}
}
{
“名称”:“SparkSubmit”,
“财产”:{
“说明”:“提交火花作业”,
“活动”:[
{
“类型”:“HDInsightMapReduce”,
“类型属性”:{
“className”:“com.adf.spark.SparkJob”,
“jarFilePath”:“libs/spark adf job bin.jar”,
“jarLinkedService”:“StorageLinkedService”,
“论点”:[
“--jarFile”,
“libs/sparkdemoapp_2.10-1.0.jar”,
“--罐子”,
“/usr/hdp/current/hadoop-client/hadoop-azure-2.7.1.2.3.0-3039.jar,/usr/hdp/current/hadoop-client/lib/azure-storage-2.2.0.jar”,
“--主类”,
“com.adf.spark.demo.demo”,
“--主人”,
“纱线束”,
“--司机室”,
“2g”,
“--driverExtraClasspath”,
“/usr/lib/hdinsight logging/*”,
“--执行者核心”,
"1",
“--内存”,
“4g”,
“--斯巴克霍姆”,
“/usr/hdp/current/spark client”,
“--连接字符串”,
“DefaultEndpointsProtocol=https;AccountName=;AccountKey=”,
“输入=wasb://input@.blob.core.windows.net/data“,
“产出=wasb://output@.blob.core.windows.net/results“
]
},
“投入”:[
{
“名称”:“输入”
}
],
“产出”:[
{
“名称”:“输出”
}
],
“政策”:{
“执行优先顺序”:“最旧优先顺序”,
“超时”:“01:00:00”,
“并发”:1,
“重试”:1
},
“调度程序”:{
“频率”:“天”,
“间隔”:1
},
“名称”:“火花发射器”,
“说明”:“提交Spark作业”,
“LinkedService名称”:“HDInsightLinkedService”
}
],
“开始”:“2015-11-16T00:00:01Z”,
“结束”:“2015-11-16T23:59:00Z”,
“isPaused”:错误,
“pipelineMode”:“已计划”
}
}
您为此提供了任何值吗parameter@GaurangShah,是--APP_NAME是键,ABC是值。我删除了这些值以将图像粘贴到此处。似乎只传递了一个参数--arg'--APP_NAME ABC'
总共传递了9个参数。我只提到了几个。