Apache spark 如何将pyspark dataframe中的每一列映射到另一列?

Apache spark 如何将pyspark dataframe中的每一列映射到另一列?,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我已经通过执行下面的代码创建了dataframe from pyspark.sql import Row l = [('Ankit',25,'Ankit','Ankit'),('Jalfaizy',22,'Jalfaizy',"aa"),('saurabh',20,'saurabh',"bb"),('Bala',26,"aa","bb")] rdd = sc.parallelize(l) people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]

我已经通过执行下面的代码创建了dataframe

from pyspark.sql import Row
l = [('Ankit',25,'Ankit','Ankit'),('Jalfaizy',22,'Jalfaizy',"aa"),('saurabh',20,'saurabh',"bb"),('Bala',26,"aa","bb")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]),lname=x[2],mname=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()
执行上述代码后,我的结果如下所示

+---+--------+-----+--------+
|age|   lname|mname|    name|
+---+--------+-----+--------+
| 25|   Ankit|Ankit|   Ankit|
| 22|Jalfaizy|   aa|Jalfaizy|
| 20| saurabh|   bb| saurabh|
| 26|      aa|   bb|    Bala|
+---+--------+-----+--------+
但是我想映射每行中的每一列值,并且基于年龄列,哪些列是相同的,我的预期结果如下

+---+----------------+-------------------+------------------+
|age| lname_map_same | mname_map_same    |    name_map_same |
+---+----------------+-------------------+------------------+
| 25|  mname,name    |   lname,name      |   lname,mname    |
| 22|    name        |  none             |   lname          |
| 20|    name        |  none             |   lname          |
| 26|    none        |  none             |   none           |
+---+----------------+-------------------+------------------+

您可以使用映射函数解决问题。请查看以下代码:

df_new=spark.createDataFrame([
(25,“Ankit”、“Ankit”、“Ankit”)、(22,“Jalfaizy”、“aa”、“Jalfaizy”)、(26,“aa”、“bb”、“Bala”)
],(“年龄”、“lname”、“MNName”、“姓名”))
#只有3条记录添加到数据集
def查找_相同(第行):
标签=[“lname”、“MNName”、“name”]
结果=[行[0],]#保存最终结果的年龄
行=行[1:]#从行中删除年龄
对于范围(3)中的i:
s=[]
字段=行[i]
如果字段==行[(i+1)%3]:#检查字段是否与下一个字段相同
s、 附加(标签[(i+1)%3])
如果字段==行[(i-1)%3]:#检查字段是否与前一个字段相同
s、 附加(标签[(i-1)%3])
如果不是s:#如果没有找到相同的值,则返回None
s=无
结果。追加
返回结果
df_new.rdd.map(find_idential).toDF([“age”、“lname_map_same”、“mname_map_same”、“name_map_same”)).show()
输出:

+---+--------------+--------------+--------------+
|age|lname_map_same|mname_map_same| name_map_same|
+---+--------------+--------------+--------------+
| 25| [mname, name]| [name, lname]|[lname, mname]|
| 22|        [name]|          null|       [lname]|
| 26|          null|          null|          null|
+---+--------------+--------------+--------------+
    +---+---------------------+---------------------+----------------------+------------------------+------------------------+
|age|lname_map_same       |mname_map_same       |name_map_same         |n1_map_same             |n2_map_same             |
+---+---------------------+---------------------+----------------------+------------------------+------------------------+
|25 |[mname, n2, name, n1]|[name, lname, n1, n2]|[n1, mname, n2, lname]|[n2, name, lname, mname]|[lname, n1, mname, name]|
|22 |[name, n1]           |[n2]                 |[n1, lname]           |[name, lname]           |[mname]                 |
|26 |null                 |null                 |null                  |null                    |null                    |
+---+---------------------+---------------------+----------------------+------------------------+------------------------+
+---+------------+------------+------------+------------+
|age|n1_same     |n2_same     |n3_same     |n4_same     |
+---+------------+------------+------------+------------+
|25 |[n2, n3, n4]|[n3, n4, n1]|[n4, n1, n2]|[n1, n2, n3]|
|22 |[n3, n4]    |null        |[n4, n1]    |[n1, n3]    |
|26 |null        |null        |null        |null        |
+---+------------+------------+------------+------------+
如果您希望有5列应该考虑的内容,可以按照注释中的说明进行操作。因此,您必须修改标签列表并添加额外的if语句。此外,所有模运算都必须调整为与5匹配,for循环应该迭代5个元素。然后,您将得到如下代码:

df_new = spark.createDataFrame([
( 25,"Ankit","Ankit","Ankit","Ankit","Ankit"),( 22,"Jalfaizy","aa","Jalfaizy","Jalfaizy","aa"),( 26,"aa","bb","Bala","cc","dd")
], ("age", "lname","mname","name","n1","n2"))

def find_identical(row):
    labels = ["lname","mname","name","n1","n2"]
    result = [row[0],]
    row = row[1:]
        for i in range(5):
            s = []
            field = row[i]
            if field == row[(i+1)%5]:
                s.append(labels[(i+1)%5])
            if field == row[(i-1)%5]:
                s.append(labels[(i-1)%5])
            if field == row[(i+2)%5]:
                s.append(labels[(i+2)%5])
            if field == row[(i+3)%5]:
                s.append(labels[(i+3)%5])
            if not s:
                s = None
            result.append(s)
        return result

df_new.rdd.map(find_identical).toDF(["age","lname_map_same","mname_map_same","name_map_same","n1_map_same","n2_map_same"]).show(truncate=False)
输出:

+---+--------------+--------------+--------------+
|age|lname_map_same|mname_map_same| name_map_same|
+---+--------------+--------------+--------------+
| 25| [mname, name]| [name, lname]|[lname, mname]|
| 22|        [name]|          null|       [lname]|
| 26|          null|          null|          null|
+---+--------------+--------------+--------------+
    +---+---------------------+---------------------+----------------------+------------------------+------------------------+
|age|lname_map_same       |mname_map_same       |name_map_same         |n1_map_same             |n2_map_same             |
+---+---------------------+---------------------+----------------------+------------------------+------------------------+
|25 |[mname, n2, name, n1]|[name, lname, n1, n2]|[n1, mname, n2, lname]|[n2, name, lname, mname]|[lname, n1, mname, name]|
|22 |[name, n1]           |[n2]                 |[n1, lname]           |[name, lname]           |[mname]                 |
|26 |null                 |null                 |null                  |null                    |null                    |
+---+---------------------+---------------------+----------------------+------------------------+------------------------+
+---+------------+------------+------------+------------+
|age|n1_same     |n2_same     |n3_same     |n4_same     |
+---+------------+------------+------------+------------+
|25 |[n2, n3, n4]|[n3, n4, n1]|[n4, n1, n2]|[n1, n2, n3]|
|22 |[n3, n4]    |null        |[n4, n1]    |[n1, n3]    |
|26 |null        |null        |null        |null        |
+---+------------+------------+------------+------------+
动态方法将列数作为参数。但是在我的例子中,这个数字应该在1到5之间,因为创建的数据集最多有5个属性。它可能是这样的:

df_new = spark.createDataFrame([
( 25,"Ankit","Ankit","Ankit","Ankit","Ankit"),( 22,"Jalfaizy","aa","Jalfaizy","Jalfaizy","aa"),( 26,"aa","bb","Bala","cc","dd")
], ("age", "n1","n2","n3","n4","n5"))


def find_identical(row,number):
    labels = []
    for n in range(1,number+1):
        labels.append("n"+str(n))   #create labels dynamically
    result = [row[0],]
    row = row[1:]
    for i in range(number):
        s = []
        field = row[i]
        for x in range(1,number):
            if field == row[(i+x)%number]:
                s.append(labels[(i+x)%number]) #check for similarity in all the other fields
        if not s:
            s = None
        result.append(s)
    return result

number=4
colNames=["age",]
for x in range(1,number+1):
    colNames.append("n"+str(x)+"_same") #create the 'nX_same' column names
df_new.rdd.map(lambda r: find_identical(r,number)).toDF(colNames).show(truncate=False)
根据number参数,输出会有所不同,我将age列静态保留为第一列

输出:

+---+--------------+--------------+--------------+
|age|lname_map_same|mname_map_same| name_map_same|
+---+--------------+--------------+--------------+
| 25| [mname, name]| [name, lname]|[lname, mname]|
| 22|        [name]|          null|       [lname]|
| 26|          null|          null|          null|
+---+--------------+--------------+--------------+
    +---+---------------------+---------------------+----------------------+------------------------+------------------------+
|age|lname_map_same       |mname_map_same       |name_map_same         |n1_map_same             |n2_map_same             |
+---+---------------------+---------------------+----------------------+------------------------+------------------------+
|25 |[mname, n2, name, n1]|[name, lname, n1, n2]|[n1, mname, n2, lname]|[n2, name, lname, mname]|[lname, n1, mname, name]|
|22 |[name, n1]           |[n2]                 |[n1, lname]           |[name, lname]           |[mname]                 |
|26 |null                 |null                 |null                  |null                    |null                    |
+---+---------------------+---------------------+----------------------+------------------------+------------------------+
+---+------------+------------+------------+------------+
|age|n1_same     |n2_same     |n3_same     |n4_same     |
+---+------------+------------+------------+------------+
|25 |[n2, n3, n4]|[n3, n4, n1]|[n4, n1, n2]|[n1, n2, n3]|
|22 |[n3, n4]    |null        |[n4, n1]    |[n1, n3]    |
|26 |null        |null        |null        |null        |
+---+------------+------------+------------+------------+

在我看来,这件事很可疑。你确定这是应该的输出吗?是的,为了便于理解,我在问题中提到的任何输出都是正确的,在给定的行中哪些列是相等的。嗨,gaw,谢谢你的快速回复,现在它被硬编码为3个值,但是如果它是4列,它不能按预期工作,你能帮助我吗?如果它是4列(因此年龄+4个附加列)您必须在标签数组中添加另一个标签,并将
range(3)
替换为
range(4)
和所有模运算符(例如,将所有
%3
替换为
%4
),它应该可以工作。您还必须添加一个附加的if语句:
if field==行[(i+2)%4]:s.append(标签[(i+2)%4] )
我添加了以下代码“labels=[“lname”,“mname”,“name”,“n1”,“n2”]result=[行[0],]。#保存最终结果行的年龄行=行[1:]。#从行中删除范围(5)中i的年龄):“但不起作用它给出了错误的结果,如[mname,n2]而不是5个匹配列标题。感谢更新,这将只适用于5个列。因为我们是硬编码范围值。请动态建议而不是硬编码值。