根据python中的文本匹配地理词典的名称
我有一个地理名称列表,城市名称我用来从文本表中导出地名。如何将城市名称列表中的多段名称(如“Santa Barbara”、“Los Angeles”等)与文本匹配?无法识别包含多个单词的城市名称 我尝试过的代码是:根据python中的文本匹配地理词典的名称,python,csv,dictionary,text,Python,Csv,Dictionary,Text,我有一个地理名称列表,城市名称我用来从文本表中导出地名。如何将城市名称列表中的多段名称(如“Santa Barbara”、“Los Angeles”等)与文本匹配?无法识别包含多个单词的城市名称 我尝试过的代码是: import csv import time #import tab-delimited keywords file f = open('cities_key.txt','r') allKeywords = f.read().lower().split(\n) f.close()
import csv
import time
#import tab-delimited keywords file
f = open('cities_key.txt','r')
allKeywords = f.read().lower().split(\n)
f.close()
#print(len(allKeywords))
allTexts = []
fullRow = []
with open('adrl_title_desc.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
#the full row for each entry, which will be used to recreate the improved CSV file in a moment
fullRow.append((row['title'], row['description']))
#the column we want to parse for our keywords
row = row['description'].lower()
allTexts.append(row)
#print(len(row))
#a flag used to keep track of which row is being printed to the CSV file
counter = 0
#use the current date and time to create a unique output filename
timestr = time.strftime(%Y-%m-%d-(%H-%M-%S))
filename = 'output-' + str(timestr) + '.csv'
#Open the new output CSV file to append ('a') rows one at a time.
with open(filename, 'a') as csvfile:
#define the column headers and write them to the new file
fieldnames = ['title', 'description', 'place']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
#define the output for each row and then print to the output csv file
writer = csv.writer(csvfile)
#this is the same as before, for currentRow in fullRow:
for entry in allTexts:
matches = 0
storedMatches = []
#for each entry:
#HOW TO RESOLVE MULTI-PART NAMES? e.g. Santa Barbara
allWords = entry.split(' ')
for words in allWords:
#remove punctuation that will interfere with matching
words = words.replace(',', '')
words = words.replace('.', '')
words = words.replace(';', '')
#if a keyword match is found, store the result.
if words in allKeywords:
if words in storedMatches:
continue
else:
storedMatches.append(words)
matches += 1
#send any matches to a new row of the csv file.
if matches == 0:
newRow = fullRow[counter]
else:
matchTuple = tuple(storedMatches)
newRow = fullRow[counter] + matchTuple
#write the result of each row to the csv file
writer.writerows([newRow])
counter += 1
城市名称:
说明:
在寻求帮助之前先付出努力是一件好事。这是我对你的代码所做的修改。我保留了你的代码并注释掉了它,这样你就知道我在做什么。在这种情况下,使用正则表达式是最好的选择。我使用的循环与您使用的相同。我没有拆分描述。相反,我使用正则表达式模块浏览了整个描述,查找城市名称。我也没有将列表用于storedMatches。使用集合将确保您没有添加重复项。检查城市是否已添加是您不需要的另一项检查。我使用了Python 3.7 我使用
import re
导入正则表达式模块
import csv
import time
#Raj006 import regular expression module
import re
#import tab-delimited keywords file
f = open('cities_key.txt','r')
#Raj006 Not making the keywords lower. Will match with lower using regex
#allKeywords = f.read().lower().split('\n')
allKeywords = f.read().split('\n')
f.close()
#print(len(allKeywords))
allTexts = []
fullRow = []
with open('adrl_title_desc.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
#the full row for each entry, which will be used to recreate the improved CSV file in a moment
fullRow.append((row['title'], row['description']))
#the column we want to parse for our keywords
#row = row['description'].lower()
#Raj006 not making description lower as regular expression takes care of case-insensitive search.
row = row['description']
allTexts.append(row)
#print(len(row))
#a flag used to keep track of which row is being printed to the CSV file
counter = 0
#use the current date and time to create a unique output filename
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")
filename = 'output-' + str(timestr) + '.csv'
#Open the new output CSV file to append ('a') rows one at a time.
with open(filename, 'a') as csvfile:
#define the column headers and write them to the new file
fieldnames = ['title', 'description', 'place']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
#define the output for each row and then print to the output csv file
writer = csv.writer(csvfile)
#this is the same as before, for currentRow in fullRow:
for entry in allTexts:
#matches = 0
#Raj006 Changed this to set to make sure the list is unique (which is basically the definiton of the set)
storedMatches = set()
#Raj006 looping through all cities and checking if the city name exists in the description.
#Raj006 re.search looks for the lookup word in the entire string (re.search(lookupword,string)).
for eachcity in allKewords:
if re.search('\\b'+eachcity+'\\b',entry,re.IGNORECASE):
#Adding the matched city to the set
storedMatches.add(eachcity)
#for each entry:
#HOW TO RESOLVE MULTI-PART NAMES? e.g. Santa Barbara
#allWords = entry.split(' ')
#for words in allWords:
#remove punctuation that will interfere with matching
#words = words.replace(',', '')
#words = words.replace('.', '')
#words = words.replace(';', '')
#if a keyword match is found, store the result.
#if words in allKeywords:
#if words in storedMatches:
#continue
#else:
#storedMatches.append(words)
#matches += 1
#send any matches to a new row of the csv file.
#if matches == 0:
#Raj006 Just using the length of the set to determine if any matches found. Reducing one more unnecessary check.
if len(storedMatches)==0:
newRow = fullRow[counter]
else:
matchTuple = tuple(storedMatches)
newRow = fullRow[counter] + matchTuple
#write the result of each row to the csv file
writer.writerows([newRow])
counter += 1
更新:添加了要重新搜索的忽略案例
我改进了上面的代码,以消除变量名中不必要的循环和混淆。我没有源文件,因此无法测试它。如果我发现任何问题,我会在以后更新它
import csv
import time
import re
allCities = open('cities_key.txt','r').readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")
with open('adrl_title_desc.csv') as descriptions,open('output-' + str(timestr) + '.csv', 'w', newline='') as output:
descriptions_reader = csv.DictReader(descriptions)
fieldnames = ['title', 'description', 'cities']
output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
output_writer.writeheader()
for eachRow in descriptions_reader:
title = eachRow['title']
description = eachRow['description']
citiesFound = set()
for eachcity in allCities:
eachcity=eachcity.strip()
if re.search('\\b'+eachcity+'\\b',description,re.IGNORECASE):
citiesFound.add(eachcity)
if len(citiesFound)>0:
output_writer.writerow({'title': title, 'description': description, 'cities': ", ".join(citiesFound)})
此代码将csv分隔符设置为|
,而不是,
,因为我在城市中使用它
测试文件。
cities_key.txt
San Francisco
San Gabriel
San Jacinto
San Jose
San Juan Capistrano
Haiti
San Mateo
adrl_title_desc.csv
key,title,description
1,title1,"some description here with San Francisco"
2,title2,"some, more description here with Haitian info"
3,title3,"some city not a wordSan Mateo"
4,title4,"some city San Juan Capistrano just normal"
5,title5,"multiple cities in one San Jacinto,San Jose and San Gabriel end"
代码输出
title|description|cities
title1|some description here with San Francisco|San Francisco
title4|some city San Juan Capistrano just normal|San Juan Capistrano
title5|multiple cities in one San Jacinto,San Jose and San Gabriel end|San Jacinto, San Jose, San Gabriel
@现在,我认为Python3.x不会出问题。我用
'\\b'+eachcity+'\\b'
(缺号+符号)修复了错误。您无法找到任何匹配项,因为当您出于某种原因使用readlines()
时,它会保留行尾。我使用strip()
来删除它们。我不得不在“打开文件”对话框中使用newline='
,因为csv编写器正在每行后面创建新行。您可以看到,在我的示例中,您找不到关键字2和关键字3的城市,因为这些城市没有作为单词与文本的其余部分分开。我更新了我的搜索代码,以便只查找单词而不是子字符串。查看Python站点上的文档。更新的代码给出了一个错误'\\b'+eachcity'\\b'
,'
在我的机器上运行正常,有两个反斜杠。实际的错误消息是什么?好的,我已经运行了这段代码,在一个输出文件中,它显示在Ist单元格“title | description | cities”的更新代码中,并经过测试。提供了测试文件和输出。