Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/331.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用python spark映射另一个文件_Python_Pyspark - Fatal编程技术网

使用python spark映射另一个文件

使用python spark映射另一个文件,python,pyspark,Python,Pyspark,作为spark和python的新手,尝试一些基本的东西来打印员工数据的计数和最大值 from pyspark.sql import Row from pyspark.sql import SparkSession from pyspark.sql import SQLContext import pyspark.sql.functions as psf spark = SparkSession \ .builder \ .appName("Hello") \ .conf

作为spark和python的新手,尝试一些基本的东西来打印员工数据的计数和最大值

from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as psf

spark = SparkSession \
    .builder \
    .appName("Hello") \
    .config("World") \
    .getOrCreate()


sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = spark.createDataFrame(
    sc.textFile("employee.txt").map(lambda l: l.split('::')),
    ["employeeid","deptid","salary"]
)
df.registerTempTable("df")

mostEmpDept = sqlContext.sql("""select deptid, cntDept from (
                                            select deptid, count(*) as cntDept, max(count(*)) over () as maxcnt 
                                            from df 
                                            group by deptid) as tmp
                                            where tmp.cntDept = tmp.maxcnt""")

mostEmpDept.show()
上面的代码为我提供了员工人数最多的部门,如下所示

+-------+--------+                                                              
|deptid |cntDept |
+-------+--------+
|    10 |       7|
+-------+--------+
10::Marketing
20::Finance
30::HumanResource
40::HouseKeeping
现在,我有另一个文件,其中包含所有的deptid及其名称,如何将此结果映射到另一个文件并打印deptid 10名称?另一个文件如下所示

+-------+--------+                                                              
|deptid |cntDept |
+-------+--------+
|    10 |       7|
+-------+--------+
10::Marketing
20::Finance
30::HumanResource
40::HouseKeeping
请使用以下资料:

sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = spark.createDataFrame(
    sc.textFile("employee.txt").map(lambda l: l.split('::')),
    ["employeeid","deptid","salary"]
)
df.registerTempTable("df")

dept = spark.createDataFrame(
    sc.textFile("dept.txt").map(lambda l: l.split('::')),
    ["deptid","deptname"]
)
dept.registerTempTable("dept")

mostEmpDept = sqlContext.sql("""select deptid, cntDept from (
                                            select deptid, count(*) as cntDept, max(count(*)) over () as maxcnt 
                                            from df 
                                            group by deptid) as tmp
                                            where tmp.cntDept = tmp.maxcnt""")

mostEmpDept.registerTempTable('mostEmpDept')

final_df= sqlContext.sql("select a.deptid, b.deptname from mostEmpDept a inner join dept b on a.deptid=b.deptid")

final_df.show()
如果要保存它,请使用

final_df.saveAsTextFile('Location')