Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark PySpark-连接两个RDD-无法连接-要解压缩的值太多_Apache Spark_Join_Pyspark_Rdd_Cloudera - Fatal编程技术网

Apache spark PySpark-连接两个RDD-无法连接-要解压缩的值太多

Apache spark PySpark-连接两个RDD-无法连接-要解压缩的值太多,apache-spark,join,pyspark,rdd,cloudera,Apache Spark,Join,Pyspark,Rdd,Cloudera,我有两个HDFS文件(非常简单): 测试: 测试2: 11,Player1,Team1 22,Player1,Team2 32,Player1,Team3 我想加入他们(按团队*列)以获得以下输出: Team1,1,11,Player1 Team3,3,32,Player1 为此,我使用以下代码: test = sc.textFile("/user/cloudera/Tests/test") test_filter = test.filter(lambda a: a.sp

我有两个HDFS文件(非常简单):

测试:

测试2:

11,Player1,Team1
22,Player1,Team2
32,Player1,Team3
我想加入他们(按团队*列)以获得以下输出:

Team1,1,11,Player1
Team3,3,32,Player1
为此,我使用以下代码:

test = sc.textFile("/user/cloudera/Tests/test")
test_filter = test.filter(lambda a: a.split(",")[1].upper() == "TEAM1" or a.split(",")[1].upper() == "TEAM2")
test_map = test_filter.map(lambda a: a.upper())
test_map = test_map.map(lambda a: (a.split(",")[1]))
for i in test_map.collect(): print(i)

test2=sc.textFile("/user/cloudera/Tests/test2")
test2_map = test2.map(lambda a: a.upper())
test2_map = test2_map.map(lambda a: (a.split(",")[2], a.split(",")[1]))
for i in test2_map.collect(): print(i)

test_join = test_map.join(test2_map)
for i in test_join.collect(): print(i)
但是当我尝试查看join RDD时,我得到了以下错误:

  File "/usr/lib/spark/python/pyspark/rdd.py", line 1807, in <lambda>
    map_values_fn = lambda (k, v): (k, f(v))
ValueError: too many values to unpack

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
文件“/usr/lib/spark/python/pyspark/rdd.py”,第1807行,在
map_值_fn=lambda(k,v):(k,f(v))
ValueError:要解压缩的值太多
位于org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
位于org.apache.spark.api.python.PythonRDD$$anon$1(PythonRDD.scala:176)
位于org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:277)
我做错了什么


谢谢

能否显示这两条语句的结果集: 对于test_map.collect()中的i:print(i) & 对于test2_map.collect()中的i:print(i)

您也可以尝试以下内容:

   test = sc.textFile("/user/cloudera/Tests/test")
   test_map = test.map(lambda a:a.upper())
   test_map = test_map.map(lambda a: (a.split(",")[1],a.split(",")[0]))
   for i in test_map.collect(): print(i)

   test2=sc.textFile("/user/cloudera/Tests/test2")
   test2_map = test2.map(lambda a: a.upper())
   test2_map = test2_map.map(lambda a: (a.split(",")[2], a.split(",")[1]))
   for i in test2_map.collect(): print(i)

   test_join = test_map.join(test2_map)
   for i in test_join.collect(): print(i)
   test = sc.textFile("/user/cloudera/Tests/test")
   test_map = test.map(lambda a:a.upper())
   test_map = test_map.map(lambda a: (a.split(",")[1],a.split(",")[0]))
   for i in test_map.collect(): print(i)

   test2=sc.textFile("/user/cloudera/Tests/test2")
   test2_map = test2.map(lambda a: a.upper())
   test2_map = test2_map.map(lambda a: (a.split(",")[2], a.split(",")[1]))
   for i in test2_map.collect(): print(i)

   test_join = test_map.join(test2_map)
   for i in test_join.collect(): print(i)