Dataframe pySpark映射多列

Dataframe pySpark映射多列,dataframe,dictionary,pyspark,pyspark-dataframes,Dataframe,Dictionary,Pyspark,Pyspark Dataframes,我需要能够使用多列比较两个数据帧 pySpark尝试 # get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1. primaryAttributeValue_List = [ p.PrimaryLookupAttributeValue for p in AttributeLookup.select('PrimaryLookupAttributeVal

我需要能够使用多列比较两个数据帧

pySpark尝试

# get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1. 

primaryAttributeValue_List = [ p.PrimaryLookupAttributeValue for p in AttributeLookup.select('PrimaryLookupAttributeValue').distinct().collect() ]
primaryAttributeValue_List #dict of value, vary by filter 

Out: ['Archive',
 'Pending Security Deposit',
 'Partially Abandoned',
 'Revision Contract Review',
 'Open',
 'Draft Accounting In Review',
 'Draft Returned']


# compare df1 to PrimaryLookupAttributeValue
output = dataset_standardFalse2.withColumn('ConformedLeaseStatusName', f.when(dataset_standardFalse2['LeaseStatus'].isin(primaryAttributeValue_List), "FOUND").otherwise("TBD"))

display(output)


根据我的理解,您可以根据reference_df中的列创建一个映射(我假设这不是一个很大的数据帧):

然后使用此映射获取df1中的相应值:

from itertools import chain
from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map

d = reference_df.agg(collect_set(array(concat_ws('\0','PrimaryLookupAttributeName','PrimaryLookupAttributeValue'), 'OutputItemNameByValue')).alias('m')).first().m
#[['LeaseStatus\x00Abandoned', 'Active'],
# ['LeaseRecoveryType\x00Gross-modified', 'Modified Gross'],
# ['LeaseStatus\x00Archive', 'Expired'],
# ['LeaseStatus\x00Terminated', 'Terminated'],
# ['LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'],
# ['LeaseStatus\x00Draft', 'Pending'],
# ['LeaseRecoveryType\x00Gross', 'Gross']]

mappings = create_map([lit(i) for i in chain.from_iterable(d)])

primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

df1.select("*", *[ mappings[concat_ws('\0', lit(c), col(c))].alias("Matched[{}]OutputItemNameByValue".format(c)) for c in primaryLookupAttributeName_List ]).show()
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|SourceSystemName|...|Matched[LeaseType]OutputItemNameByValue|Matched[LeaseRecoveryType]OutputItemNameByValue|Matched[LeaseStatus]OutputItemNameByValue|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|          ABC123|...|                                   null|                                          Gross|                               Terminated|
|          ABC123|...|                                   null|                                 Modified Gross|                                  Expired|
|          ABC123|...|                                   null|                                 Modified Gross|                                  Pending|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
更新:要根据通过引用数据框检索的信息设置列名:

map_key = concat_ws('\0', PrimaryLookupAttributeName, PrimaryLookupAttributeValue)
map_value = OutputItemNameByValue
# a list of domains to retrieve
primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

# mapping from domain names to column names: using `reference_df`.`TargetAttributeForName`
NEWprimaryLookupAttributeName_List = dict(reference_df.filter(reference_df['DomainName'].isin(primaryLookupAttributeName_List)).agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)

test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NEWprimaryLookupAttributeName_List.items()]) 
注意-1:最好在primaryLookupAttributeName\u列表中循环,以便保留列的顺序,如果字典中缺少primaryLookupAttributeName\u列表中的任何条目,我们可以设置默认列名,即
未知-
。在旧方法中,只需丢弃缺少条目的列

test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List])
注意-2:根据注释,覆盖现有列名(未测试):

(1) 使用选择:

test = dataset_standardFalse2.select([c for c in dataset_standardFalse2.columns if c not in NEWprimaryLookupAttributeName_List.values()] + [ mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List]).show()
(2) 使用reduce(如果列表很长,则不推荐):


参考资料:

您是否希望将
df1.LeaseStatus
df1.LeaseRecoveryType
映射到基于
reference\u df.DomainName
reference\u df.PrimaryLookupAttributeValue
reference\u df.outputitItemNameByValue
?是的!除了,基于
参考(primarylookupattributeneame
)映射
df1.LeaseStatus
df1.LeaseRecoveryType
,这就是为什么我有dataframe
AttributeLookup
@jxC这很好。这就是我想要的,谢谢!我正在研究方法
concat\uws
,以及
chain
。Qq关于使用
\0
作为分隔符,为什么@jxc@jessgtrz您可以使用任何分隔符,使其不会混乱:PrimaryLookupAttributeName=“A”+PrimaryLookupAttributeValue=“B C”(A\0B C)与PrimaryLookupAttributeName=“A B”+PrimaryLookupAttributeValue=“C”(A B\0C)。我更喜欢空字符
\0
作为分隔符,因为它在正常文本中并不常见。另外,如果您经常编写Linux命令脚本,Linux文件名中不允许使用NUL char('\0'),因此它成为我最喜欢的分隔符,只要需要,就使用它!因此,我问题中的逻辑映射了引用表和数据集的值,hense
OutputItemNameByValue
。我们可以映射带有引用和数据集的输出列名吗?例如,使用
.alias(“Matched[{}]OutputItemNameByValue.format(c))
使其迭代并匹配相应的输出列?询问abt此问题,因为它可能并不总是
匹配…OutputItemNameByValue
。我想我可以映射域名、primaryatributename(映射键)和TargetOutputColumn(映射键)。
.alias(“Matched[{}]OutputItemNameByValue.format(c))
是否可以替换为另一个映射@jxc@jessgtrz,我认为您可以将Python列表转换为字典,例如:
primarylookupatteributename\u list={'LeaseType':'Name1','LeaseRecoveryType':'Name2','leaserestatus':'Name3'}
然后使用
*[mappings[concat\ws('\0',lit c,col c)]。c的别名(c_name),primarylookupatteributename\u list.items()]
hmm,明白了。我很难找到最有效的方法,因为为primaryLookupAttributeName_列表构建字典嵌入了另一个数据帧。我会很欣赏另一副眼睛/大脑。非常感谢!我从你身上学到了很多@jxc
from functools import reduce

df_new = reduce(lambda d, c: d.withColumn(c, mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c)))), primaryLookupAttributeName_List, dataset_standardFalse2)