Python—将复杂的文本行读入字典
我想从文本文件中提取大量术语,并将它们分为以下几组:动物、艺术、建筑、车辆、人、人、食物、玻璃、瓶子、标牌、标语、DJ、派对。我目前在tester2文件中有四个字:Python—将复杂的文本行读入字典,python,loops,dictionary,Python,Loops,Dictionary,我想从文本文件中提取大量术语,并将它们分为以下几组:动物、艺术、建筑、车辆、人、人、食物、玻璃、瓶子、标牌、标语、DJ、派对。我目前在tester2文件中有四个字: I like sorbet I am a man wearing a shirt Pizza is my favorite meal formula 1 racing is awesome steak 这是我的密码: keyword_dictionary = { 'Animal' : ['animal', 'dog',
I like sorbet
I am a man wearing a shirt
Pizza is my favorite meal
formula 1 racing is awesome
steak
这是我的密码:
keyword_dictionary = {
'Animal' : ['animal', 'dog', 'cat'],
'Art' : ['art', 'sculpture', 'fearns'],
'Buildings' : ['building', 'architecture', 'gothic', 'skyscraper'],
'Vehicle' : ['car','formula','f-1','f1','f 1','f one','f-one','moped','mo ped','mo-ped','scooter'],
'Person' : ['person','dress','shirt','woman','man','attractive','adult','smiling','sleeveless','halter','spectacles','button','bodycon'],
'People' : ['people','women','men','attractive','adults','smiling','group','two','three','four','five','six','seven','eight','nine','ten','2','3','4','5','6','7','8','9','10'],
'Food' : ['food','plate','chicken','steak','pizza','pasta','meal','asian','beef','cake','candy','food pyramid','spaghetti','curry','lamb','sushi','meatballs','biscuit','apples','meat','mushroom','jelly', 'sorbet','nacho','burrito','taco','cheese'],
'Glass' : ['glass','drink','container','glasses','cup'],
'Bottle' : ['bottle','drink'],
'Signage' : ['sign','martini','ad','advert','card','bottles','logo','mat','chalkboard','blackboard'],
'Slogan' : ['Luck is overrated'],
'DJ' : ['dj','disc','jockey','mixer','instrument','turntable'],
'Party' : ['party']
}
def matcher(keywords, searcher):
for key, words in keywords.iteritems():
if searcher in words:
print key
with open("tester2.txt") as termsdesk:
for line in termsdesk:
term = matcher(keyword_dictionary, line.strip())
我希望我的结果如下所示:
Food
Person
Food
Vehicle
Food
但我得到的是:
Food
我想这是因为,与其让我的代码进行精确匹配,还不如让代码进行“类似”的匹配。我不确定如何实现这一点。也许可以用“if”函数来实现这一点吗 您尝试过以下方法吗
with open("tester2.txt") as termsdesk:
for line in termsdesk:
term = matcher(keyword_dictionary, line.split(" "))
反转映射更有意义,而且效率更高:
keyword_dictionary = {'mo-ped': 'Vehicle', 'group': 'People', 'spaghetti': 'Food', 'f-1': 'Vehicle', '6': 'People',
'5': 'People', 'five': 'People', 'gothic': 'Buildings', 'seven': 'People', 'adults': 'People',
'burrito': 'Food', 'martini': 'Signage', 'f one': 'Vehicle', 'ten': 'People', 'instrument': 'DJ',
'dress': 'Person', 'drink': 'Bottle', 'mushroom': 'Food', 'cat': 'Animal', 'glass': 'Glass',
'animal': 'Animal', 'pizza': 'Food', 'formula': 'Vehicle', 'meal': 'Food', 'curry': 'Food',
'3': 'People', 'sign': 'Signage', 'f1': 'Vehicle', 'biscuit': 'Food', 'bottles': 'Signage',
'pasta': 'Food', 'card': 'Signage', 'sculpture': 'Art', '8': 'People', 'apples': 'Food', '9':
'People', 'nacho': 'Food', 'mat': 'Signage', 'bottle': 'Bottle', 'shirt': 'Person', 'halter':
'Person', 'jockey': 'DJ', 'six': 'People', 'beef': 'Food', 'party': 'Party', 'container': 'Glass',
'women': 'People', 'four': 'People', '10': 'People', 'attractive': 'Person', 'mo ped': 'Vehicle',
'blackboard': 'Signage', 'two': 'People', 'f-one': 'Vehicle', '4': 'People', 'car': 'Vehicle',
'cheese': 'Food', 'plate': 'Food', 'food': 'Food', 'smiling': 'Person', 'bodycon': 'Person',
'jelly': 'Food', 'button': 'Person', 'men': 'People', 'people': 'People', 'eight': 'People',
'sushi': 'Food', 'chalkboard': 'Signage', 'cake': 'Food', 'sorbet': 'Food', 'turntable': 'DJ',
'2': 'People', 'skyscraper': 'Buildings', 'nine': 'People', 'meatballs': 'Food', '7': 'People',
'art': 'Art', 'building': 'Buildings', 'sleeveless': 'Person', 'lamb': 'Food', 'disc': 'DJ',
'scooter': 'Vehicle', 'asian': 'Food', 'chicken': 'Food', 'food pyramid': 'Food', 'person':
'Person', 'ad': 'Signage', 'spectacles': 'Person', 'glasses': 'Glass', 'dog': 'Animal',
'logo': 'Signage', 'mixer': 'DJ', 'dj': 'DJ', 'architecture': 'Buildings', 'three': 'People',
'fearns': 'Art', 'taco': 'Food', 'f 1': 'Vehicle', 'steak': 'Food', 'cup': 'Glass', 'man':
'Person', 'woman': 'Person', 'advert': 'Signage', 'candy': 'Food', 'meat': 'Food',
'adult': 'Person', 'moped': 'Vehicle', 'Luck is overrated': 'Slogan'}
with open("test.txt") as termsdesk:
for line in termsdesk:
for word in line.split():
if word in keyword_dictionary:
print(keyword_dictionary[word])
输出:
Food # sorbet
Person # man
Person # shirt
Food # meal
Vehicle # formula
Food # steak
Food
Person
Person
Food
Vehicle
Food
Food
Person
Food
Vehicle
Food
如果您要走自己的路线,您应该创建列表集,并且需要迭代每个单词,然后迭代每个k,v配对:
keyword_dictionary = {
'Animal' : {'animal', 'dog', 'cat'},
'Art' : {'art', 'sculpture', 'fearns'},
'Buildings' : {'building', 'architecture', 'gothic', 'skyscraper'},
'Vehicle' : {'car','formula','f-1','f1','f 1','f one','f-one','moped','mo ped','mo-ped','scooter'},
'Person' : {'person','dress','shirt','woman','man','attractive','adult','smiling','sleeveless','halter','spectacles','button','bodycon'},
'People' : {'people','women','men','attractive','adults','smiling','group','two','three','four','five','six','seven','eight','nine','ten','2','3','4','5','6','7','8','9','10'},
'Food' : {'food','plate','chicken','steak','pizza','pasta','meal','asian','beef','cake','candy','food pyramid','spaghetti','curry','lamb','sushi','meatballs','biscuit','apples','meat','mushroom','jelly', 'sorbet','nacho','burrito','taco','cheese'},
'Glass' : {'glass','drink','container','glasses','cup'},
'Bottle' : {'bottle','drink'},
'Signage' : {'sign','martini','ad','advert','card','bottles','logo','mat','chalkboard','blackboard'},
'Slogan' : {'Luck is overrated'},
'DJ' : {'dj','disc','jockey','mixer','instrument','turntable'},
'Party' : {'party'}
}
def matcher(keywords, searcher):
for word in searcher:
for key, words in keywords.items():
if word in words:
print(key)
break
with open("test.txt") as termsdesk:
for line in termsdesk:
matcher(keyword_dictionary, line.split())
输出:
Food # sorbet
Person # man
Person # shirt
Food # meal
Vehicle # formula
Food # steak
Food
Person
Person
Food
Vehicle
Food
Food
Person
Food
Vehicle
Food
您的函数不会返回任何设置term=matcher(..
将术语设置为等于None
比较逻辑,使用集合作为值并反转映射:
您的代码将涉及到迭代每一行和每一个单词,然后迭代dict中的每一个键和值,并使用列表进行另一个0(n)
循环,以查找值列表中的每个单词
使用集合作为值除了删除最后一个O(n)
search并将其替换为O(1)
set查找之外,所有操作都与您自己的逻辑相同
第一个代码简单地在每一行和每一个单词上循环,并不断地检查单词是否是dict中的单词,如果单词在dict中,则获取每个值,因此它的效率要高得多
如果你认为任何匹配项只是一个匹配项,你可以看到每个单词的列表不是<代码>不相交的< /代码>。
keyword_dictionary = {
'Animal' : {'animal', 'dog', 'cat'},
'Art' : {'art', 'sculpture', 'fearns'},
'Buildings' : {'building', 'architecture', 'gothic', 'skyscraper'},
'Vehicle' : {'car','formula','f-1','f1','f 1','f one','f-one','moped','mo ped','mo-ped','scooter'},
'Person' : {'person','dress','shirt','woman','man','attractive','adult','smiling','sleeveless','halter','spectacles','button','bodycon'},
'People' : {'people','women','men','attractive','adults','smiling','group','two','three','four','five','six','seven','eight','nine','ten','2','3','4','5','6','7','8','9','10'},
'Food' : {'food','plate','chicken','steak','pizza','pasta','meal','asian','beef','cake','candy','food pyramid','spaghetti','curry','lamb','sushi','meatballs','biscuit','apples','meat','mushroom','jelly', 'sorbet','nacho','burrito','taco','cheese'},
'Glass' : {'glass','drink','container','glasses','cup'},
'Bottle' : {'bottle','drink'},
'Signage' : {'sign','martini','ad','advert','card','bottles','logo','mat','chalkboard','blackboard'},
'Slogan' : {'Luck is overrated'},
'DJ' : {'dj','disc','jockey','mixer','instrument','turntable'},
'Party' : {'party'}
}
def matcher(keywords, searcher):
for key, words in keywords.items():
if not words.isdisjoint(searcher):
print(key)
with open("test.txt") as termsdesk:
for line in termsdesk:
matcher(keyword_dictionary, line.split())
输出:
Food
Person
Food
Vehicle
Food
如果每行只能获得一个匹配项,则要将相同的逻辑应用于反向映射方法,只需添加一个中断
:
with open("test.txt") as termsdesk:
for line in termsdesk:
for word in line.split():
if word in keyword_dictionary:
print(keyword_dictionary[word])
break
输出:
Food # sorbet
Person # man
Person # shirt
Food # meal
Vehicle # formula
Food # steak
Food
Person
Person
Food
Vehicle
Food
Food
Person
Food
Vehicle
Food
您需要使用
line.split(“”)将行分割为令牌。我不明白。line.spit(“”)是否位于第3段第2行的下方(在我的for循环之后)你的函数每次通过while循环时都会被定义。你能不能把函数放在循环之外,循环的目标是什么?你只做了一次。好的,所以我删除了循环。没有必要进行此运行。可能重复我做的是,它返回了以下错误:TypeError:“bool”对象不是iterable@KrishanVadher,如果你在同一行中有一个匹配的词,比如说“人”和“食物”,会发生什么?我想我需要写一些东西,确保它返回时说“找到两个关键字”但因为衬衫和男士在同一条线上,它应该和person一起返回一次。但是如果衬衫男士和冰糕在同一条线上,那会怎么样?在这种情况下,它应该被定义为“person”因为人比食物多?如果是这种情况,你需要记录每种发现的类型,并返回最常见的,但你还必须决定如果出现平局会发生什么