Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/svg/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 重复记录移动到pyspark中的其他临时表_Apache Spark_Pyspark_Pyspark Sql_Pyspark Dataframes - Fatal编程技术网

Apache spark 重复记录移动到pyspark中的其他临时表

Apache spark 重复记录移动到pyspark中的其他临时表,apache-spark,pyspark,pyspark-sql,pyspark-dataframes,Apache Spark,Pyspark,Pyspark Sql,Pyspark Dataframes,我正在使用Pypark 我的输入数据如下所示 COL1|COL2 |TYCO|130003| |EMC |120989| |VOLVO|102329| |BMW|130157| |FORD|503004| |TYCO|130003| from pyspark.sql import Row from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Test") \

我正在使用Pypark

我的输入数据如下所示

 COL1|COL2
|TYCO|130003|
|EMC |120989|
|VOLVO|102329|
|BMW|130157|
|FORD|503004|
|TYCO|130003|
from pyspark.sql import Row
from pyspark.sql import SparkSession
spark = SparkSession \
     .builder \
     .appName("Test") \
     .getOrCreate()

data = spark.read.csv("filepath")

data.registerTempTable("data")
spark.sql("SELECT count(col2)CNT, col2 from data GROUP BY col2 ").show()
我已经创建了DataFrame并查询重复项,如下所示

 COL1|COL2
|TYCO|130003|
|EMC |120989|
|VOLVO|102329|
|BMW|130157|
|FORD|503004|
|TYCO|130003|
from pyspark.sql import Row
from pyspark.sql import SparkSession
spark = SparkSession \
     .builder \
     .appName("Test") \
     .getOrCreate()

data = spark.read.csv("filepath")

data.registerTempTable("data")
spark.sql("SELECT count(col2)CNT, col2 from data GROUP BY col2 ").show()
这给出了正确的结果,但我们可以在单独的临时表中得到重复的值

output data in Temp1

+----+------+
|   1|120989|
|   1|102329|
|   1|130157|
|   1|503004|
+----+------+
以temp2格式输出数据

+----+------+
|   2|130003|
+----+------+