Dataframe pySpark映射多列_Dataframe_Dictionary_Pyspark_Pyspark Dataframes

Dataframe pySpark映射多列

dataframe dictionary pyspark

Dataframe pySpark映射多列,dataframe,dictionary,pyspark,pyspark-dataframes,Dataframe,Dictionary,Pyspark,Pyspark Dataframes,我需要能够使用多列比较两个数据帧 pySpark尝试 # get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1. primaryAttributeValue_List = [ p.PrimaryLookupAttributeValue for p in AttributeLookup.select('PrimaryLookupAttributeVal

我需要能够使用多列比较两个数据帧

pySpark尝试

# get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1. 

primaryAttributeValue_List = [ p.PrimaryLookupAttributeValue for p in AttributeLookup.select('PrimaryLookupAttributeValue').distinct().collect() ]
primaryAttributeValue_List #dict of value, vary by filter 

Out: ['Archive',
 'Pending Security Deposit',
 'Partially Abandoned',
 'Revision Contract Review',
 'Open',
 'Draft Accounting In Review',
 'Draft Returned']


# compare df1 to PrimaryLookupAttributeValue
output = dataset_standardFalse2.withColumn('ConformedLeaseStatusName', f.when(dataset_standardFalse2['LeaseStatus'].isin(primaryAttributeValue_List), "FOUND").otherwise("TBD"))

display(output)

根据我的理解，您可以根据reference_df中的列创建一个映射（我假设这不是一个很大的数据帧）：

然后使用此映射获取df1中的相应值：

from itertools import chain
from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map

d = reference_df.agg(collect_set(array(concat_ws('\0','PrimaryLookupAttributeName','PrimaryLookupAttributeValue'), 'OutputItemNameByValue')).alias('m')).first().m
#[['LeaseStatus\x00Abandoned', 'Active'],
# ['LeaseRecoveryType\x00Gross-modified', 'Modified Gross'],
# ['LeaseStatus\x00Archive', 'Expired'],
# ['LeaseStatus\x00Terminated', 'Terminated'],
# ['LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'],
# ['LeaseStatus\x00Draft', 'Pending'],
# ['LeaseRecoveryType\x00Gross', 'Gross']]

mappings = create_map([lit(i) for i in chain.from_iterable(d)])

primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

df1.select("*", *[ mappings[concat_ws('\0', lit(c), col(c))].alias("Matched[{}]OutputItemNameByValue".format(c)) for c in primaryLookupAttributeName_List ]).show()
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|SourceSystemName|...|Matched[LeaseType]OutputItemNameByValue|Matched[LeaseRecoveryType]OutputItemNameByValue|Matched[LeaseStatus]OutputItemNameByValue|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|          ABC123|...|                                   null|                                          Gross|                               Terminated|
|          ABC123|...|                                   null|                                 Modified Gross|                                  Expired|
|          ABC123|...|                                   null|                                 Modified Gross|                                  Pending|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+

更新：要根据通过引用数据框检索的信息设置列名：

map_key = concat_ws('\0', PrimaryLookupAttributeName, PrimaryLookupAttributeValue)
map_value = OutputItemNameByValue

# a list of domains to retrieve
primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

# mapping from domain names to column names: using `reference_df`.`TargetAttributeForName`
NEWprimaryLookupAttributeName_List = dict(reference_df.filter(reference_df['DomainName'].isin(primaryLookupAttributeName_List)).agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)

test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NEWprimaryLookupAttributeName_List.items()])

注意-1:最好在primaryLookupAttributeName\u列表中循环，以便保留列的顺序，如果字典中缺少primaryLookupAttributeName\u列表中的任何条目，我们可以设置默认列名，即

未知-

。在旧方法中，只需丢弃缺少条目的列

test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List])

注意-2:根据注释，覆盖现有列名（未测试）：

（1）使用选择：

test = dataset_standardFalse2.select([c for c in dataset_standardFalse2.columns if c not in NEWprimaryLookupAttributeName_List.values()] + [ mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List]).show()

（2）使用reduce（如果列表很长，则不推荐）：

参考资料：

您是否希望将

df1.LeaseStatus

和

df1.LeaseRecoveryType

映射到基于

reference\u df.DomainName

和

reference\u df.PrimaryLookupAttributeValue

的

reference\u df.outputitItemNameByValue

？是的！除了，基于

参考(primarylookupattributeneame

)映射

df1.LeaseStatus

和

df1.LeaseRecoveryType

，这就是为什么我有dataframe

AttributeLookup

@jxC这很好。这就是我想要的，谢谢！我正在研究方法

concat\uws

，以及

chain

。Qq关于使用

\0

作为分隔符，为什么@jxc@jessgtrz您可以使用任何分隔符，使其不会混乱：PrimaryLookupAttributeName=“A”+PrimaryLookupAttributeValue=“B C”（A\0B C）与PrimaryLookupAttributeName=“A B”+PrimaryLookupAttributeValue=“C”（A B\0C）。我更喜欢空字符

\0

作为分隔符，因为它在正常文本中并不常见。另外，如果您经常编写Linux命令脚本，Linux文件名中不允许使用NUL char（'\0'），因此它成为我最喜欢的分隔符，只要需要，就使用它！因此，我问题中的逻辑映射了引用表和数据集的值，hense

OutputItemNameByValue

。我们可以映射带有引用和数据集的输出列名吗？例如，使用

.alias（“Matched[{}]OutputItemNameByValue.format（c））

使其迭代并匹配相应的输出列？询问abt此问题，因为它可能并不总是

匹配…OutputItemNameByValue

。我想我可以映射域名、primaryatributename（映射键）和TargetOutputColumn（映射键）。

.alias（“Matched[{}]OutputItemNameByValue.format（c））

是否可以替换为另一个映射@jxc@jessgtrz，我认为您可以将Python列表转换为字典，例如：

primarylookupatteributename\u list={'LeaseType'：'Name1'，'LeaseRecoveryType'：'Name2'，'leaserestatus'：'Name3'}

然后使用

*[mappings[concat\ws（'\0'，lit c，col c）]。c的别名（c_name），primarylookupatteributename\u list.items（）]

hmm，明白了。我很难找到最有效的方法，因为为primaryLookupAttributeName_列表构建字典嵌入了另一个数据帧。我会很欣赏另一副眼睛/大脑。非常感谢！我从你身上学到了很多@jxc

from functools import reduce

df_new = reduce(lambda d, c: d.withColumn(c, mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c)))), primaryLookupAttributeName_List, dataset_standardFalse2)