Python 正则表达式方法捕获1字和2字专有名词_Python_Regex

Python 正则表达式方法捕获1字和2字专有名词

python regex

Python 正则表达式方法捕获1字和2字专有名词,python,regex,Python,Regex,我已经提出了以下建议。我已经把问题缩小到无法同时捕获一个单词和两个单词的专有名词（1）如果我能设置一个条件，当在两个捕获之间做出选择时，它会指示较长单词的默认值，那就太好了及（2）如果我可以告诉ReGEX只考虑这个，如果字符串以PositIoin开头，比如在π上。我在玩类似的东西，但它不起作用： (^On|^at)([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,5}) 我该怎么做1和2 我现在的正则表达式 r'([A-Z][

我已经提出了以下建议。我已经把问题缩小到无法同时捕获一个单词和两个单词的专有名词

（1）如果我能设置一个条件，当在两个捕获之间做出选择时，它会指示较长单词的默认值，那就太好了

及

（2）如果我可以告诉ReGEX只考虑这个，如果字符串以PositIoin开头，比如在π上。我在玩类似的东西，但它不起作用：

(^On|^at)([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,5})

我该怎么做1和2

我现在的正则表达式

r'([A-Z][a-z]{3,15}$|[A-Z][a-z]{3,15}\s{0,1}[A-Z][a-z]{0,15})'

我想拍摄阿育王系列、轮班系列、指南针搭档和肯尼思·科尔

#'On its 25th anniversary, Ashoka',

#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole',

在自然语言处理中，您试图做的是所谓的“命名实体识别”。如果你真的想要一种能找到合适名词的方法，那么你可能需要考虑加紧命名实体识别。谢天谢地，

nltk

库中有一些易于使用的函数：

import nltk
s2 = 'at the Shift Series national conference, Compass Partners and fashion designer Kenneth Cole'
tokens2 = nltk.word_tokenize(s2)
tags = nltk.pos_tag(tokens2)
res = nltk.ne_chunk(tags)

结果:

res.productions()
Out[8]: 
[S -> ('at', 'IN') ('the', 'DT') ORGANIZATION ('national', 'JJ') ('conference', 'NN') (',', ',') ORGANIZATION ('and', 'CC') ('fashion', 'NN') ('designer', 'NN') PERSON,
 ORGANIZATION -> ('Shift', 'NNP') ('Series', 'NNP'),
 ORGANIZATION -> ('Compass', 'NNP') ('Partners', 'NNPS'),
 PERSON -> ('Kenneth', 'NNP') ('Cole', 'NNP')]

不完全正确，但这将与您要查找的大部分内容相匹配，但上的

除外
import re
text = """
#'On its 25th anniversary, Ashoka',

#'at the Shift Series national conference, Compass Partners and fashion designer Kenneth     
Cole',
"""
proper_noun_regex = r'([A-Z]{1}[a-z]{1,}(\s[A-Z]{1}[a-z]{1,})?)'
p = re.compile(proper_noun_regex)
matches = p.findall(text)

print matches

输出：
[('On', ''), ('Ashoka', ''), ('Shift Series', ' Series'), ('Compass Partners', ' Partners'), ('Kenneth Cole', ' Cole')]

然后也许你可以实现一个过滤器来检查这个列表
def filter_false_positive(unfiltered_matches):
    filtered_matches = []
    black_list = ["an","on","in","foo","bar"] #etc
    for match in unfiltered_matches:
        if match.lower() not in black_list:
            filtered_matches.append(match)
    return filtered_matches

或者因为python很酷：
def filter_false_positive(unfiltered_matches):
    black_list = ["an","on","in","foo","bar"] #etc
    return [match for match in filtered_matches if match.lower() not in black_list]

你可以这样使用它：
# CONTINUED FROM THE CODE ABOVE
matches = [i[0] for i in matches]
matches = filter_false_positive(matches)
print matches

给出最终输出：
['Ashoka', 'Shift Series', 'Compass Partners', 'Kenneth Cole']

确定一个单词是否因出现在句首而大写，或者它是否是一个专有名词的问题并不是那么简单
'Kenneth Cole is a brand name.' v.s. 'Can I eat something now?' v.s. 'An English man had tea'

在这种情况下，这是相当困难的，因此如果没有其他标准的专有名词，黑名单，数据库等，这将不会这么容易<代码>正则表达式

很棒，但我认为它不能以任何琐碎的方式从语法层面解释英语

尽管如此，祝你好运

我会使用NLP工具，python最流行的工具似乎是。正则表达式真的不是正确的方法。。。nltk网站首页上有一个例子，链接到前面的答案，下面是复制粘贴：

import nltk
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)    
tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
tagged = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tagged)

实体现在包含根据

Ah ic标记的单词，这非常有用。我会考虑这样做的。你如何抓住组织和人员？我不清楚如何使用这种树格式。我知道了。对于res.subtrees（filter=lambda t:t.node=='ORGANIZATION'）：subtree_l=[]对于subtree.leaves（）：subtree_l.append（leaf[0]）sub=''.join（subtree_l）tree.append（sub）print tree你知道如何捕捉带有小写介词的专有名词吗？例如：犹他大学的戴维埃克尔斯商学院