Python 组合nltk.RegexpParser语法_Python_Parsing_Nlp_Nltk

Python 组合nltk.RegexpParser语法

python parsing nlp

Python 组合nltk.RegexpParser语法,python,parsing,nlp,nltk,Python,Parsing,Nlp,Nltk,作为我进一步了解NLP的下一步，我正在尝试实现一种简单的启发式方法，它可以改进简单n-gram以外的结果根据下面链接的斯坦福搭配PDF，他们提到通过词性过滤器传递“候选词组”，该过滤器只允许通过那些可能是“词组”的模式，将产生比简单使用最频繁出现的双图更好的结果。资料来源：搭配，第143-144页：第144页的表格有7种标记模式。按顺序，NLTK POS标记等效物为： JJNN NN JJJNN JJNN NN JJ NN NN NN NN中的NN 在下面的代码中，当我独立应用下面的每个语

作为我进一步了解NLP的下一步，我正在尝试实现一种简单的启发式方法，它可以改进简单n-gram以外的结果

根据下面链接的斯坦福搭配PDF，他们提到通过词性过滤器传递“候选词组”，该过滤器只允许通过那些可能是“词组”的模式，将产生比简单使用最频繁出现的双图更好的结果。资料来源：搭配，第143-144页：

第144页的表格有7种标记模式。按顺序，NLTK POS标记等效物为：

JJNN

JJJNN

JJNN

NN JJ NN

NN NN

NN中的NN

在下面的代码中，当我独立应用下面的每个语法时，我可以得到期望的结果。但是当我尝试组合相同的语法时，我没有收到期望的结果

在我的代码中，您可以看到我取消了一个句子的注释，取消了语法注释1，运行它并检查结果

我应该能够组合所有句子，通过组合语法运行它（下面的代码中只有3个），并获得所需的结果

我的问题是，如何正确组合语法

我假设组合语法就像“或”，找到这个模式，或者这个模式

提前谢谢

import nltk

# The following sentences are correctly grouped with <JJ>*<NN>+. 
# Should see: 'linear function', 'regression coefficient', 'Gaussian random variable' and 
# 'cumulative distribution function'
SampleSentence = "In mathematics, the term linear function refers to two distinct, although related, notions"
#SampleSentence = "The regression coefficient is the slope of the line of the regression equation."
#SampleSentence = "In probability theory, Gaussian random variable is a very common continuous probability distribution."
#SampleSentence = "In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x."

# The following sentences are correctly grouped with <NN.?>*<V.*>*<NN>
# Should see 'mean squared error' and # 'class probability function'. 
#SampleSentence = "In statistics, the mean squared error (MSE) of an estimator measures the average of the squares of the errors, that is, the difference between the estimator and what is estimated."
#SampleSentence = "The class probability function is interesting"

# The sentence below is correctly grouped with <NN.?>*<IN>*<NN.?>*. 
# should see 'degrees of freedom'.
#SampleSentence = "In statistics, the degrees of freedom is the number of values in the final calculation of a statistic that are free to vary."

SampleSentence = SampleSentence.lower()

print("\nFull sentence: ", SampleSentence, "\n")

tokens = nltk.word_tokenize(SampleSentence)
textTokens = nltk.Text(tokens)    

# Determine the POS tags.
POStagList = nltk.pos_tag(textTokens)    

# The following grammars work well *independently*
grammar = "NP: {<JJ>*<NN>+}"
#grammar = "NP: {<NN.?>*<V.*>*<NN>}"    
#grammar = "NP: {<NN.?>*<IN>*<NN.?>*}"


# Merge several grammars above into a single one below. 
# Note that all 3 correct grammars above are included below. 

'''
grammar = """
            NP: 
                {<JJ>*<NN>+}
                {<NN.?>*<V.*>*<NN>}
                {<NN.?>*<IN>*<NN.?>*}
        """
'''

cp = nltk.RegexpParser(grammar)

result = cp.parse(POStagList)

for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
    print("NP Subtree:", subtree)

导入nltk
#下列句子用*+正确分组。
#应该看到：'线性函数'，'回归系数'，'高斯随机变量'和
#“累积分布函数”
SampleSentence=“在数学中，术语线性函数指两个不同但相关的概念”
#SampleSentence=“回归系数是回归方程直线的斜率。”
#SampleSentence=“在概率论中，高斯随机变量是一种非常常见的连续概率分布。”
#SampleSentence=“在概率论和统计学中，实值随机变量X的累积分布函数（CDF）或在X处计算的X的正分布函数，是X取值小于或等于X的概率。”
#下列句子正确地分组为**
#应参见“均方误差”和#“类概率函数”。
#SampleSentence=“在统计学中，估计值的均方误差（MSE）测量误差平方的平均值，即估计值与估计值之间的差值。”
#SampleSentence=“类概率函数很有趣”
#下面的句子正确地用***分组。
#应该看到“自由度”。
#SampleSentence=“在统计中，自由度是统计的最终计算中可以自由变化的值的数量。”
SampleSentence=SampleSentence.lower（）
打印（“\n完整句子：”，示例项，“\n”）
tokens=nltk.word\u标记化（SampleSentence）
textTokens=nltk.Text（标记）
#确定POS标签。
POStagList=nltk.pos_标记（文本标记）
#以下语法独立运行良好*
grammar=“NP:{*+}”
#grammar=“NP:{**}”
#grammar=“NP:{***}”
#将上面的几个语法合并到下面的一个语法中。
#请注意，上面所有3个正确的语法都包含在下面。
'''
语法=”“
NP:
{*+}
{**}
{***}
"""
'''
cp=nltk.RegexpParser（语法）
结果=cp.parse（后英语）
对于result.subtrees中的子树（filter=lambda t:t.label（）=='NP'）：
打印（“NP子树：”，子树）

如果我的评论是您想要的，那么下面是答案：

grammar = """
            NP: 
                {<JJ>*<NN.?>*<V.|IN>*<NN.?>*}"""

grammar=”“”
NP:
{****}"""

如果你能帮我理解更多，你不想写3行这样的语法=“NP:{*+}{**}{***}”"。相反，你想要一个单行的正则表达式模式，它可以容纳所有3种模式。嗨，Rahul。我想以某种方式组合3种正则表达式模式，以便它们产生与各自产生的结果相同的结果。我对它是如何用1、2、3+行编写的不偏不倚。我将在接下来的几天内尝试下面的代码。谢谢。当然，请继续！！我知道了我已经尝试了多种方案，但仍然有效。如果有其他问题，请尝试返回