使用python从文本中提取城市名称_Python_Validation_Normalization

使用python从文本中提取城市名称

python validation

使用python从文本中提取城市名称,python,validation,normalization,Python,Validation,Normalization,我有一个数据集，其中一列的标题是“你的位置和时区是什么？” 这意味着我们有这样的条目丹麦，CET 地点是英国德文郡，格林威治标准时间时区澳大利亚。澳大利亚东部标准时间+协调世界时10时甚至我一年中的大部分时间都在俄勒冈州的尤金或首尔，韩国依靠学校放假。我的主要时区是太平洋时区整个五月份我都在英国伦敦（GMT+1）。整个六月份，我将在挪威（GMT+2）或以色列（格林尼治标准时间+3），互联网接入有限。整个7月和8月我将在英国伦敦（GMT+1）。然后从 2015年9月，我将在美国波

我有一个数据集，其中一列的标题是“你的位置和时区是什么？”

这意味着我们有这样的条目

丹麦，CET

地点是英国德文郡，格林威治标准时间时区

澳大利亚。澳大利亚东部标准时间+协调世界时10时

甚至

我一年中的大部分时间都在俄勒冈州的尤金或首尔，韩国依靠学校放假。我的主要时区是太平洋时区

整个五月份我都在英国伦敦（GMT+1）。整个六月份，我将在挪威（GMT+2）或以色列（格林尼治标准时间+3），互联网接入有限。整个7月和8月我将在英国伦敦（GMT+1）。然后从 2015年9月，我将在美国波士顿（EDT）

有没有办法从中提取城市、国家和时区

我在考虑创建一个数组（从一个开源数据集），其中包含所有的国家名称（包括短格式）和城市名称/时区，然后如果数据集中的任何单词与城市/国家/时区或短格式匹配，它会将其填充到同一数据集中的一个新列中并进行计数

这实际吗

============基于NLTK应答的应答============

运行与我得到的Alecxe相同的代码

Traceback (most recent call last):
  File "E:\SBTF\ntlk_test.py", line 19, in <module>
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\__init__.py", line 110, in pos_tag
    tagger = PerceptronTagger()
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 141, in __init__
    self.load(AP_MODEL_LOC)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\tag\perceptron.py", line 209, in load
    self.model.weights, self.tagdict, self.classes = load(loc)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "C:\Python27\ArcGIS10.4\lib\site-packages\nltk\data.py", line 924, in _open
    return urlopen(resource_url)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 454, in _open
    'unknown_open', req)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Python27\ArcGIS10.4\lib\urllib2.py", line 1265, in unknown_open
    raise URLError('unknown url type: %s' % type)
URLError: <urlopen error unknown url type: c>

回溯（最近一次呼叫最后一次）：
文件“E:\SBTF\ntlk_test.py”，第19行，在
tagged_句子=[nltk.pos_标记（句子）用于标记化_句子中的句子]
文件“C:\Python27\ArcGIS10.4\lib\site packages\nltk\tag\\uuuuu init\uuuuuu.py”，第110行，在pos\u标记中
tagger=Perceptrontager（）
文件“C:\Python27\ArcGIS10.4\lib\site packages\nltk\tag\perceptron.py”，第141行，在uu init中__
自负载（AP\U型号\U LOC）
加载文件“C:\Python27\ArcGIS10.4\lib\site packages\nltk\tag\perceptron.py”，第209行
self.model.weights，self.tagdict，self.classes=荷载（loc）
文件“C:\Python27\ArcGIS10.4\lib\site packages\nltk\data.py”，第801行，装入
已打开\u资源=\u打开（资源\u url）
文件“C:\Python27\ArcGIS10.4\lib\site packages\nltk\data.py”，第924行，打开
返回url打开（资源\ url）
文件“C:\Python27\ArcGIS10.4\lib\urllib2.py”，第154行，在urlopen中
返回opener.open（url、数据、超时）
文件“C:\Python27\ArcGIS10.4\lib\urllib2.py”，第431行，打开
响应=自身打开（请求，数据）
文件“C:\Python27\ArcGIS10.4\lib\urllib2.py”，第454行，处于打开状态
“未知_打开”，请求）
文件“C:\Python27\ArcGIS10.4\lib\urllib2.py”，第409行，在调用链中
结果=func（*args）
文件“C:\Python27\ArcGIS10.4\lib\urllib2.py”，第1265行，未知\u打开
引发url错误（'未知url类型：%s'%1！'
URL错误：

我会使用自然语言处理提供的功能来提取实体

示例（主要基于）对文件中的每一行进行标记，将其拆分为块，并递归地查找每个块的

NE

（命名实体）标签。更多说明：

对于包含以下内容的

sample.txt

：

Denmark, CET
Location is Devon, England, GMT time zone
Australia. Australian Eastern Standard Time. +10h UTC.
My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)

它打印：

['Denmark', 'CET']
['Location', 'Devon', 'England', 'GMT']
['Australia', 'Australian Eastern Standard Time']
['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific']
['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']

输出并不理想，但对您来说可能是一个良好的开端。

我将使用自然语言处理提供的内容来提取实体

示例（主要基于）对文件中的每一行进行标记，将其拆分为块，并递归地查找每个块的

NE

（命名实体）标签。更多说明：

对于包含以下内容的

sample.txt

：

Denmark, CET
Location is Devon, England, GMT time zone
Australia. Australian Eastern Standard Time. +10h UTC.
My location is Eugene, Oregon for most of the year or in Seoul, South Korea depending on school holidays. My primary time zone is the Pacific time zone.
For the entire May I will be in London, United Kingdom (GMT+1). For the entire June I will be in either Norway (GMT+2) or Israel (GMT+3) with limited internet access. For the entire July and August I will be in London, United Kingdom (GMT+1). And then from September, 2015, I will be in Boston, United States (EDT)

它打印：

['Denmark', 'CET']
['Location', 'Devon', 'England', 'GMT']
['Australia', 'Australian Eastern Standard Time']
['Eugene', 'Oregon', 'Seoul', 'South Korea', 'Pacific']
['London', 'United Kingdom', 'Norway', 'Israel', 'London', 'United Kingdom', 'Boston', 'United States', 'EDT']

输出并不理想，但对您来说可能是一个好的开始。

@Racialz

nltk

常常令人惊讶！我远非NLP的专家，但我试图添加更多的解释和链接，以进一步阅读。谢谢你询问细节！明亮的我不知道NTLK-我将对此进行实验，然后（希望）接受答案：-）@alecxe我尝试在安装库和db之后完全按照您的方式运行代码。我在urllib2.py中得到{raise URLError（'unknown url type:%s'%type）}}}，但我不知道为什么会调用它！关于如何让你的代码工作有什么想法吗？回溯在我编辑的问题中。@GeorgeC看起来这是你的问题：。检查一下。不适用于10906 woodley ava granada hills CA地址，如this@Racialz

nltk