Apache spark 仅在一个执行器中运行自定义函数的Spark Map操作_Apache Spark_Pyspark_Rdd

Apache spark 仅在一个执行器中运行自定义函数的Spark Map操作

apache-spark pyspark

Apache spark 仅在一个执行器中运行自定义函数的Spark Map操作,apache-spark,pyspark,rdd,Apache Spark,Pyspark,Rdd,我有一小段代码： sales = hive_context.table("inv_opt_test.store_sales") sales1 = sales.rdd.map(lambda x : (str(x[0])+"-"+str(x[1]),x[3:])) BaseStockLevel = sales1.groupByKey().map(lambda x: BigFunction(x)).cache() 从配置单元表读取后，我在这里创建键.map（lambda x:（str（x[0]）+“

我有一小段代码：

sales = hive_context.table("inv_opt_test.store_sales")
sales1 = sales.rdd.map(lambda x : (str(x[0])+"-"+str(x[1]),x[3:]))
BaseStockLevel = sales1.groupByKey().map(lambda x: BigFunction(x)).cache()

从配置单元表读取后，我在这里创建键

.map（lambda x:（str（x[0]）+“-”+str（x[1]），x[3:]）

，然后是一组具有相同键的数据

sales1.groupByKey（）

。上述所有阶段都在多个执行器中运行，但在map阶段使用自定义函数

.map（lambda x:BigFunction（x））

它只在一个执行器中运行

我相信情况并非如此。这种行为的原因可能是什么，或者我是否缺少任何设置？

输入的大小是多少？超过1200万行