Python 长度大于1时元组的条件列表理解
我有一个带有元组的句子,表示国家或数字的位置:Python 长度大于1时元组的条件列表理解,python,list,dictionary,tuples,Python,List,Dictionary,Tuples,我有一个带有元组的句子,表示国家或数字的位置: sample = In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo. 然后: tokenIDs2number = {(22,): 592.00,
sample = In the first 11 months of 2004 Hong Kong 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than 7,734 tons of cargo.
然后:
tokenIDs2number = {(22,): 592.00, (25,): 92630.00,(34,): 7734.00}
tokenIDs2location = {(8,9): Hong Kong}
我需要为这些元组的不同组合创建不同的句子组合,我称之为槽句子:
In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.
In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , NUMBER_SLOT passengers , and more than 7,734 tons of cargo.
In the first 11 months of 2004 LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92,630 passengers , and more than NUMBER_SLOT tons of cargo.
但是,我当前的代码基本上采用元组中元素的组合,因此我有两个句子,如:
In the first 11 months of 2004 LOCATION_SLOT Kong 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.
In the first 11 months of 2004 Hong LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.
举个例子
我如何解决这个问题,以便当我有一个元组键len>1
时,我根据自己的意愿将该键中的所有插槽填充为一个位置或数字插槽
当前代码:
for locationTokenIDs, location in tokenIDs2location.items():
for numberTokenIDs, number in tokenIDs2number.items():
sentenceDict = {}
sentenceDict["sentence"] = sample
sentenceDict["location-value-pair"] = {location:number}
for locationTokenID in locationTokenIDs:
for numberTokenID in numberTokenIDs:
finalTokens = cleanSample.split()
finalTokens[numberTokenID] = "NUMBER_SLOT"
finalTokens[locationTokenID] = "LOCATION_SLOT"
slotSentence = (" ").join(finalTokens)
sentenceDict["parsedSentence"] = slotSentence
注意,我必须创建一个字典,它还跟踪位置-值对和每个槽-句子组合的原始句子。关键部分是生成正确的slotcontent
注意,这只是一个例子,数字甚至可能是24000000
,其中句子中的值是2400万
,相同的万亿、百万、十亿和千
如果这是不可能的,另一种选择是填充组合中的所有插槽:
In the first 11 months of 2004 LOCATION_SLOT LOCATION_SLOT 's international airport at Chek Lap Kok handled daily an average of NUMBER_SLOT flights , 92,630 passengers , and more than 7,734 tons of cargo.
然后可能会修改句子以删除连续的槽,但我的偏好是一次完成所有操作。代码将每个locationTokenID视为槽,而locationTokenID实际上表示应视为槽的令牌片的端点。因此,我们需要在locationTokenID:循环中删除locationTokenID的
(它在每个locationTokenID上循环,就像它是一个插槽一样),并用单个插槽替换locationTokenID对定义的相应字片
下面的代码解决了OP中解决的问题,但仍然存在其他问题(例如,只保留最后生成的slotSentence
;我将让您解决这个问题,因为我不知道您要将slot语句存储在什么样的数据结构中):
输出:
在2004年的前11个月
赤濸角国际机场每日平均处理
航班数量92630名乘客,超过7734吨货物
货物
在2004年的前11个月
位于赤濸角的香港国际机场每日平均处理
592个航班,乘客人数,超过7734吨
货物。在2004年前11个月
位于赤濸角的香港国际机场每日平均处理
592个航班,92630名乘客,超过吨
货物
这可以扩展到适用于包含任意数量空格的位置和编号。我们通过使NumberTokenId和LocationTokenId都是一个2长度元组来实现这一点,该元组为每个位置/编号指定一系列标记:
sample = "In the first 11 months of 2004 Hong Kong Central 's international airport at Chek Lap Kok handled daily an average of 592 flights , 92 630 passengers , and more than 7 734 tons of cargo."
tokenIDs2number = {(22,22): '592', (25,26): '92 630',(32,33): '7 734'}
tokenIDs2location = {(7,9): 'Hong Kong Central'}
for locationTokenIDs, location in tokenIDs2location.items():
for numberTokenIDs, number in tokenIDs2number.items():
finalTokens = sample.split()
finalTokens[numberTokenIDs[0]:(numberTokenIDs[1]+1)] = "NUMBER_SLOT"
finalTokens[locationTokenIDs[0]:(locationTokenIDs[1]+1)] = "LOCATION_SLOT"
slotSentence = (" ").join(finalTokens)
print(slotSentence)
输出:
2004年前11个月**L O C A T I O N_us L O T**
赤濸角国际机场每日平均处理592宗
航班,**N U M B E R S L O T**乘客,超过7734吨
货物数量。
2004年前11个月**L O C A T I O N_us L O T**
赤濸角国际机场每日平均处理592宗
航班,92630名乘客,超过**吨
货物数量。
2004年前11个月**L O C A T I O N_us L O T**
赤濸角国际机场每天平均处理**N U
M B E R_S L O T**航班,92630名乘客,7734多名乘客
吨货物
考虑使用str.replace()
,而不是分割句子字符串。为此,您需要使用千位分隔符转换tokenID2number
中的元素,对于Python 2.7+,可以使用format(int,,')
处理@JonClements注释:
sample = "In the first 11 months of 2004 Hong Kong 's international airport " + \
"at Chek Lap Kok handled daily an average of 592 flights " + \
"92,630 passengers , and more than 7,734 tons of cargo."
tokenIDs2number = {(22,): 592, (25,): 92630,(34,): 7734}
tokenIDs2location = {(8,9): 'Hong Kong'}
sentenceList = []
# ITERATE ACROSS A LIST COMPREHENSION FOR ALL POSSIBLE COMBINATIONS
for item in [[s,i,j] for s in [sample] \
for i in tokenIDs2location.items() \
for j in tokenIDs2number.items()]:
sentenceDict = {}
sentenceDict["sentence"] = item[0]
sentenceDict["location-value-pair"] = {item[1][1]: item[2][1]}
sentenceDict["parsedSentence"] = sample.replace(item[1][1], 'LOCATION_SLOT').\
replace(format(item[2][1], ','), 'NUMBER_SLOT')
sentenceList.append(sentenceDict)
输出(句子列表的)
我已经解决了我的用例,但是使用了一种迂回的方式
我首先考虑包含多个LOCATION\u slot
或NUMBER\u slot
的slot语句-如果组合中的一个元组包含两个或多个slot,我将填充所有:
sentences2location2values = []
for locationTokenIDs, location in tokenIDs2location.items():
for numberTokenIDs, number in tokenIDs2number.items():
sentenceDict = {}
sentenceDict["sentence"] = sample
sentenceDict["location-value-pair"] = {location:number}
for locationTokenID in locationTokenIDs:
sampleTokens[locationTokenID] = "LOCATION_SLOT"
for numberTokenID in numberTokenIDs:
sampleTokens[numberTokenID] = "NUMBER_SLOT"
slotSentence = (" ").join(sampleTokens)
sentenceDict["parsedSentence"] = slotSentence
sentences2location2values.append(sentenceDict)
然后,我更改已解析的句子以删除连续的位置和编号槽:
for i,sentence in enumerate(sentences2location2values):
sampleTokens = sentence['parsedSentence'].split()
newTokens = []
for i,token in enumerate(sampleTokens):
if i>0 and ((token == "LOCATION_SLOT" and sampleTokens[i-1]=="LOCATION_SLOT") or (token == "NUMBER_SLOT" and sampleTokens[i-1]=="NUMBER_SLOT")):
continue
else:
newTokens.append(token)
sentence['parsedSentence']=(' ').join(newTokens)
虽然你认为迈克·德西莫的食谱很好。。。对于2.7+,您现在可以将其写成格式(int_值,,”)
。@JonClements这是否意味着我可以将替换(intWithCommas(item[2][1]),'NUMBER_SLOT')
替换为替换(format(item[2][1],”),'NUMBER_SLOT')
?@JonClements如果元组值实际上是浮点值,会发生什么?请注意,这些甚至可以查看句子中的值,如2400万
,并将其转换为24000000.00
@JonClements-非常感谢!我不知道这一点,我们现在相信你了。@dhruvghuati,如果值是浮点数,只需将int()
或round()
转换为格式(int,,)
之前的最接近整数即可。并建议此解决方案是否有效。这是一个很好的答案,从逻辑上讲很有意义,您能解释一下为什么位置槽被空格分割吗?还有,我如何使这一点通用(有时插槽跨越两个以上的空格,例如“刚果民主共和国”,也可能有多个数字插槽,而不仅仅是位置。正在使用len(LocationTokenId)玩弄
但我不会掩盖必要的国家/地区。这适用于具有任意数量空格的国家/地区,因为LocationTokenId中的值表示切片端点,并且在代码中被视为切片端点。适用于位置的相同逻辑也适用于数字。我用适用于位置的代码更新了答案
sentences2location2values = []
for locationTokenIDs, location in tokenIDs2location.items():
for numberTokenIDs, number in tokenIDs2number.items():
sentenceDict = {}
sentenceDict["sentence"] = sample
sentenceDict["location-value-pair"] = {location:number}
for locationTokenID in locationTokenIDs:
sampleTokens[locationTokenID] = "LOCATION_SLOT"
for numberTokenID in numberTokenIDs:
sampleTokens[numberTokenID] = "NUMBER_SLOT"
slotSentence = (" ").join(sampleTokens)
sentenceDict["parsedSentence"] = slotSentence
sentences2location2values.append(sentenceDict)
for i,sentence in enumerate(sentences2location2values):
sampleTokens = sentence['parsedSentence'].split()
newTokens = []
for i,token in enumerate(sampleTokens):
if i>0 and ((token == "LOCATION_SLOT" and sampleTokens[i-1]=="LOCATION_SLOT") or (token == "NUMBER_SLOT" and sampleTokens[i-1]=="NUMBER_SLOT")):
continue
else:
newTokens.append(token)
sentence['parsedSentence']=(' ').join(newTokens)