Python 在文本中识别两个单词的城市（'；纽约'；）_Python_Nltk

Python 在文本中识别两个单词的城市（'；纽约'；）

python

Python 在文本中识别两个单词的城市（'；纽约'；）,python,nltk,Python,Nltk,对于这段代码，我得到了一个包含多个城市的文本文件。我想确定提到的城市并打印它们的州和国家要求：如果提到的城市位于两个或两个以上的国家，我会要求用户提及他们正在谈论的城市。此外，如果有一个轻微的打字错误，我会问用户他们是否指的是某个城市。例如，如果他们键入'Dalls'而不是'Dallas'，我需要提供用户选项，例如'you means Dallas而不是Dalls' 问题：到目前为止，我已经成功地满足了这些条件，但当涉及到确定两个词的城市，如“纽约”或“旧金山”时，我的计划无法做到。这是因

对于这段代码，我得到了一个包含多个城市的文本文件。我想确定提到的城市并打印它们的州和国家

要求： 如果提到的城市位于两个或两个以上的国家，我会要求用户提及他们正在谈论的城市。此外，如果有一个轻微的打字错误，我会问用户他们是否指的是某个城市。例如，如果他们键入'Dalls'而不是'Dallas'，我需要提供用户选项，例如'you means Dallas而不是Dalls'

问题： 到目前为止，我已经成功地满足了这些条件，但当涉及到确定两个词的城市，如“纽约”或“旧金山”时，我的计划无法做到。这是因为我正在逐字逐句地阅读课文。如果你对如何更好地阅读课文有任何建议，请告诉我

p.S.（我知道代码可以用更高级的python方法来简化，但我对python的了解还没有达到这个水平。不过，请告诉我还有什么其他方法可以简化我的程序，因为我觉得现在没有必要了。谢谢！）

文件说明： 我正在使用一个名为“world cities.csv”、“text.txt”和“usa.txt”的文本“world cities.csv”是一个包含世界上许多城市的文件txt是一个包含我将为城市分析的句子的文件usa.txt包含英语中的常用词。我用它来比较“TEXT.txt”来删除常用词。我有一个问题，像“和”这样的词显示为打字错误。因此，这是一个非法的方法来摆脱他们

文本文件： 今天我去了海得拉巴，然后我去了美国的钦奈和纽约。现在我要去东京，明天再回到罗切斯特。达尔和斯德尼是我的下一个目的地

我使用过Geotext，它可以工作，但在阅读诸如“纽约”之类的城市时会出现问题。我的程序中没有geotext的部分读作“York”，当我添加geotext时读作“NewYork”。因此，我的城市列表中有“约克”和“纽约”。我被告知我可以使用NLTK软件包，但我仍在寻找一种有效的方法

在这里输入代码

import pandas as pd
import re


#imported dataset
dataset = pd.read_csv('world-cities.csv')

#assigned certain parts of data set to variable
data = dataset.iloc[:,:-1]
city = dataset.iloc[:,0]
state = dataset.iloc[:,2]
country = dataset.iloc[:,1]


#opened and imported textfile
txtfile = open('TEXT.txt','r')
txtfile = txtfile.read()
words = open('usa.txt','r')
words = words.read()


#getting rid of punctation
altered = re.sub("[.,:]",'',txtfile)
templist = [] #holds the cities(state and country) info of the places 
final = [] #final array
all_cities = [] #used to check for repeating cities
repeat = {} #contains only city names
repeatinfo = [] #contain all infor about repeating cities
stupid = 0
close = 0
typo = []
typodict = {}
typecount = 0
finaltypo = []

#finding out where the talked about cities are 
for x in altered.split():
    count = 0
    zcount = 0
    for y in city:
        if x == y:
            zcount +=1
            templist.append([city[count], state[count], country[count]])
            all_cities.append(city[count])
        count+=1
     if zcount > 1:
        repeat[x] = zcount

#put in all assumed Typos
for x in altered.split():
    if x not in all_cities:
        x = x.lower()
        if x not in words:
            typo.append(x)


#narrow down options of typos
many = 0
for a in typo:
    for b in city:
        b = b.lower()
        if len(a) >= (len(b)-1) and len(a) <= (len(b)+1):
            if a[0] == b[0] or a[-1::] == b[-1::]:
                if a[0:3] == b[0:3] or a[-3::] == b[-3::]:
                    #print(f'{a} vs {b}')
                    many = 0
                    for x in a:
                        if x in b:
                            many+=1
                        if many >= (len(b)-1) and many <= (len(b)+1):
                            typodict[b] = a

#let user choose if it is a typo or not
print('TYPO Checking')
for a in typo:
    p =0
    q = 0
    while(p < len(typo) and q == 0):
        for x,y in typodict.items():
            go2 = True
            while(go2 and q==0):
                if y == a:
                    user2 = input(f" Did you mean to type '{x}' instead of 
'{y}'? Enter 'y' or 'n': ")
                    user2 = user2.lower()
                    if user2 == 'y':
                        go2 = False
                        finaltypo.append(x)
                        p+=1
                        q+=1
                    elif user2 == 'n':
                        go2 = False
                    else: 
                        print('You have entered a invalid value')
                else:
                    go2 = False



#adding typoed cities into list
for x in finaltypo:
    x = x.capitalize()
    count = 0
    zcount = 0
    for y in city:
        if x == y:
            zcount +=1
            templist.append([city[count], state[count], country[count]])
            all_cities.append(city[count])
        count+=1
    if zcount > 1:
        repeat[x] = zcount

#finding out what cities repeat and adding all their information to repeat 
info
for x in repeat:
    rcount = 0
    for y in city:
        if x == y: 
            repeatinfo.append([city[rcount], state[rcount], 
country[rcount]])
        rcount +=1

#determining which country they mean when they mentioned repeating cities
print('Which City?')
for x,y in repeat.items():
    i = 0
    e = 0
    while(i < y and e == 0):
        go = True
        for c in repeatinfo: 
            go = True
            while(go and e == 0):
                if x == c[0]:
                    user = input(f'Do you mean {x} in {c[1]},{c[2]} enter y 
or n: ')
                    user = user.lower()
                    i +=1
                    if user == 'y':
                        final.append(f' {x} in {c[1]}, {c[2]}')
                        go = False
                        i +=1
                        e +=1
                    elif user == 'n':
                        go = False
                        i+=1
                    else:
                        print('You have entered a invalid input')
                else: 
                    go = False


#removing repeating cities from templist
for y in list(templist):
    if y[0] in list(repeat):
        templist.remove(y)

#adding remaining elements of templist to final list
for y in list(templist):
    final.append(f' {y[0]} in {y[1]}, {y[2]}')

#printing final output
print('\n You have entered the following cities:')               
for x in final:
    print(x)

猜一猜“纽约”出现在你的

city

列表中

我认为你可以这样做：

#finding out where the talked about cities are
for count,y in enumerate(city):
    if y in altered:
        zcount +=1
        templist.append([city[count], state[count], country[count]])
        all_cities.append(city[count])

我希望这能帮助你了解基本情况。如果你需要更多的帮助，请告诉我。

使用Hi @ Susanth Kakarla。如果任何答案都解决了你的问题，请点击检查标记来考虑。这向更广泛的社区表明，你已经找到了一个解决方案，并给回答者和你自己带来了一些声誉。没有义务这样做。

#finding out where the talked about cities are
for count,y in enumerate(city):
    if y in altered:
        zcount +=1
        templist.append([city[count], state[count], country[count]])
        all_cities.append(city[count])