Loops pySpark迭代重复变量
我有一个目前可以使用的代码,但我希望它更高效,并避免硬编码: 1) 避免硬编码:对于Loops pySpark迭代重复变量,loops,pyspark,Loops,Pyspark,我有一个目前可以使用的代码,但我希望它更高效,并避免硬编码: 1) 避免硬编码:对于NotDefined\u filterDomainLookup,当Id=4时,将引用default\u referencedf作为相应的代码和名称。而不是硬编码代码和名称值 问题1 列名称和相应的新列名列表 test_matchedAttributeName_List =dict(matchedDomains.agg(collect_set(array('DomainName', 'TargetAttribute
NotDefined\u filterDomainLookup
,当Id=4时,将引用default\u referencedf作为相应的代码和名称。而不是硬编码代码和名称值
问题1
列名称和相应的新列名列表
test_matchedAttributeName_List =dict(matchedDomains.agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)
Output: {'LeaseType': 'ConformedLeaseTypeName', 'LeaseRecoveryType': 'ConformedLeaseRecoveryTypeName', 'LeaseStatus': 'ConformedLeaseStatusName'}
工作代码,但避免硬编码除外。具体地说,当Id=4时,我想为相应的代码和名称引用默认的_reference df
cond = col('PrimaryLookupAttributeName').isNull() & col('SecondaryLookupAttributeName').isNull()
NotDefined_filterDomainLookup = filterDomainLookup \
.withColumn('OutputItemIdByAttribute', when(cond, lit('4')).otherwise(col('OutputItemIdByAttribute'))) \
.withColumn('OutputItemCodeByAttribute', when(cond, lit('N/D')).otherwise(col('OutputItemCodeByAttribute'))) \
.withColumn('OutputItemNameByAttribute', when(cond, lit('Not Defined')).otherwise(col('OutputItemNameByAttribute')))
------------+-----------------------+-------------------------+----------------对于问题2,根据您的代码,我建议进行如下调整:
- 设置项_键,包括Id、名称和代码,并使用列表理解合并相同的逻辑
- 使用struct而不是array来实现上述逻辑
- 不需要为NotDefned_属性_列表创建Python字典,元组列表就足够了,而且更好
if count_ND > 0:
# move code above in (2), (3) and (4) here
# set up testing_NotDefined
testing_NotDefined = datasetMatchedPortfolio.select("*", *additional_cols)
else:
print("no Not Defines exist")
我们如何将
default\u reference.ItemId
与其他数据链接?我们如何知道在填充空值时应该应用哪个itemID?默认的_引用是否包含具有DomainName!='的其他行默认“?@jxc,使用域名将不起作用。我想,它可以分为两个步骤?1) 已识别cond
并将NotDefined\u filterDomainLookup
链接到Id=4、N/D等。硬编码Id=4映射到filterDomainLookup.OutputItemCodeByAttribute=default\u reference.ItemCode
?然后执行步骤2。我更新了postso,对于Q-1,您只需要通过提供一个ItemId来自动查找ItemCode
和ItemName
,映射来自default\u reference
?在您的示例中,您的id=4,因此应该从映射中检索N/D
和未定义
。用更简单的话来说,@jxc,&NotDefined\u filterDomainLookup
将只处理Id=4,因为它只应在cond
适用时起作用。我认为。否则
在这种情况下是不相关的。如果您想首先按项键对notdefind属性列表进行排序,请更改项键
和m2
的顺序:notdefind属性列表=[(k,row.domain,row[k])对于项中的k,对于m2中的行,如果行[k]]
我需要消化这一点,理解命名结构和结构。我会跟进的,谢谢!你能解释一下第三步吗(notdefined_Attribute_List=[(k,row.domain,row[k])…
)
m1 = NotDefined_filterDomainLookup.agg(m1_by_sql_expr).first().item_map
"""create a list of tuples of (map_key, map_value) to create MapType column:
| map_key = concat_ws('\0', item_key, attr_name, attr_value)
| map_value = item_value
"""
testingId = [('\0'.join([k, row.attr_name, row.attr_value]), row[k]) for row in m1 for k in item_keys if row[k]]
#[('Id\x00LeaseRecoveryType\x00Gross w/base year', '18'),
# ('Name\x00LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'),
# ('Id\x00LeaseStatus\x00Abandoned', '10'),
# ('Name\x00LeaseStatus\x00Abandoned', 'Active'),
# ('Id\x00LeaseStatus\x00Draft', '10'),
# ('Name\x00LeaseStatus\x00Draft', 'Pending'),
# ('Id\x00LeaseStatus\x00Archive', '11'),
# ('Name\x00LeaseStatus\x00Archive', 'Expired'),
# ('Id\x00LeaseStatus\x00Terminated', '10'),
# ('Name\x00LeaseStatus\x00Terminated', 'Terminated'),
# ('Id\x00LeaseRecoveryType\x00Gross', '11'),
# ('Name\x00LeaseRecoveryType\x00Gross', 'Gross'),
# ('Id\x00LeaseRecoveryType\x00Gross-modified', '15'),
# ('Name\x00LeaseRecoveryType\x00Gross-modified', 'Modified Gross')]
# this could be a problem for too many entries.
testing_mappings = create_map([lit(i) for i in chain.from_iterable(testingId)])
m2 = matchedDomains.agg(m2_by_func).first().item_map
NotDefned_Attribute_List = [(k, row.domain, row[k]) for row in m2 for k in item_keys if row[k]]
additional_cols = [
testing_mappings[concat_ws('\0', lit(k), lit(c), col(c))].alias(c_name)
for k,c,c_name in NotDefined_Attribute_List
]
if count_ND > 0:
# move code above in (2), (3) and (4) here
# set up testing_NotDefined
testing_NotDefined = datasetMatchedPortfolio.select("*", *additional_cols)
else:
print("no Not Defines exist")