Python 3.x 如何使用循环访问句子中动词前面的单词，使用空格？python_Python 3.x_Spacy

Python 3.x 如何使用循环访问句子中动词前面的单词，使用空格？python

python-3.x

Python 3.x 如何使用循环访问句子中动词前面的单词，使用空格？python,python-3.x,spacy,Python 3.x,Spacy,我使用spaCy通过词性标记定位句子中的动词，然后尝试操纵动词。动词的操作取决于一个条件，例如取决于动词前面的单词。例如，我可能想转换这个包含三个动词（does、hert、run）的句子：在这句话中： (2) "It hurts to run very fast." 我觉得这很简单。然而，不知何故，我的函数在同一个句子中两次遇到同一个POS标记时出现了问题。在这种情况下，似乎没有更新IF子句（下面第13行），因此它的计算结果为False，而它应该是True。我不知道我忽略了什么以及如何解决它

我使用spaCy通过词性标记定位句子中的动词，然后尝试操纵动词。动词的操作取决于一个条件，例如取决于动词前面的单词。例如，我可能想转换这个包含三个动词（does、hert、run）的句子：

在这句话中：

(2) "It hurts to run very fast."

我觉得这很简单。然而，不知何故，我的函数在同一个句子中两次遇到同一个POS标记时出现了问题。在这种情况下，似乎没有更新IF子句（下面第13行），因此它的计算结果为

False

，而它应该是

True

。我不知道我忽略了什么以及如何解决它。这是我的密码：

import pandas as pd
import spacy
nlp = spacy.load('en')

s = "Why does it hurt to run very fast."
df = pd.DataFrame({'sentence':[s]})
k = df['sentence']

1 def marking(row):
2    L = row
3    verblst = [('VB'), ('VBZ'), ('VBP')] # list of verb POS tags to focus on
4    chunks = []
5    pos = []
6    for token in nlp(L):
7        pos.append(token.tag_) # Just to check if POS tags are handled well
8    print(pos)  
9    if "Why" in L:  
10        for token in nlp(L):
11            if token.tag_ in verblst: 
                 # This line checks the POS tag of the word preceding the verb:
12               print(pos[pos.index(token.tag_)-1]) 
13                if pos[pos.index(token.tag_)-1] == 'TO': # Here things go wrong
14                    chunks.append(token.text + token.whitespace_)
15                elif pos[pos.index(token.tag_)-1] == 'WRB':
16                    chunks.append(token.text + token.whitespace_)                                
17                else: 
18                    chunks.append(token.text + 's' + token.whitespace_)
19            else:
20                chunks.append(token.text_with_ws)                    
        L = chunks
        L.pop(0)
        L.pop(0)
        L = [L[0].capitalize()] + L[1:] 
    L = "".join(L)
    return L

x = k.apply(marking)
print(x)

这将产生以下结果：

"It hurts to runs very fast."  # The 's' after run should not be there

                  0      1      2      3     4     5    6     7     8
POS list of s: ['WRB', 'VBZ', 'PRP', 'VB', 'TO', 'VB', 'RB', 'RB', '.']
sentence s:     "Why   does     it   hurt   to   run   very  fast.  ."

这个问题是因为在索引3和索引5中都发现了“VB”。看起来第13行中的索引在第一个“VB”之后没有更新，我希望这会自动发生。因此，对于第二个“VB”，第13行查看的是索引2，而不是索引4。因此，13中的条件不满足，第二个VB在第18行处理-导致错误。我对为什么会发生这种情况感到困惑。我没有看到什么？如何解决这个问题

非常感谢您的帮助。

这里的问题似乎是，您只需在预先编译的词性标记字符串列表中查找

标记.tag\ucode>字符串值的索引。这总是返回第一个匹配项–因此在“run”的情况下，脚本实际上不会检查索引5之前的POS（这将是到
），而是检查索引3之前的POS（这是PRP
）
考虑以下抽象示例：
test = ['a', 'b', 'c', 'a', 'd']
for value in test:
    print(test.index(value))  # this will print 0, 1, 2, 0, 4

一个更好（而且可能更简单）的解决方案是只需迭代标记
对象，并使用标记.i
，它将在父文档中返回其索引。理想情况下，您希望只处理一次文本，存储文档
，然后在需要时索引到其中。例如：
chunks = []
doc = nlp("Why does it hurt to run very fast.")

if doc[0].text == 'Why':  # the first token's text is "Why"
    for token in doc:
        if token.tag_ in ['VB', 'VBZ', 'VBP']:
            token_index = token.i  # this is the token index in the document
            prev_token = doc[token_index - 1]  # the previous token in the document
            if prev_token.tag_ == 'TO':
                chunks.append(token.text_with_ws)  # token text + whitespace
            # and so on

理想情况下，您总是希望尽可能晚地将spaCy的输出转换为纯文本。您试图在代码中解决的大多数问题都是spaCy已经为您做过的事情–例如，它为您提供了Doc
对象及其视图Span
和Token
，这些视图都是可执行的，您可以索引到它们中，在任何地方迭代标记，更重要的是，切勿破坏原文中的任何可用信息。一旦您的输出是一个文本字符串加上空白加上您添加的其他字符，您将无法很容易地恢复原始标记。您也不知道哪个令牌附加了空格，以及各个令牌之间是如何关联的
有关Doc
、Token
和Span
对象的更多详细信息，请参阅和，其中列出了每个对象的可用属性。
我确实觉得奇怪，在浏览词性标记列表时，我只获得了第一个匹配项。。。但是你的解决方案很好。它也简单得多。你的理由也很清楚。非常感谢！很高兴它成功了！是的，这就是Python中的list.index（）方法的工作原理—：“在值为x的第一个项的列表中返回从零开始的索引”。因此，如果您的列表包含相同的字符串两次或多次，并且您查找了它的索引，那么您将得到第一个匹配项的索引。这也是为什么您经常需要更灵活的数据结构，让您能够表达复杂的关系并保留精确的引用（尤其是在处理文本时）。
chunks = []
doc = nlp("Why does it hurt to run very fast.")

if doc[0].text == 'Why':  # the first token's text is "Why"
    for token in doc:
        if token.tag_ in ['VB', 'VBZ', 'VBP']:
            token_index = token.i  # this is the token index in the document
            prev_token = doc[token_index - 1]  # the previous token in the document
            if prev_token.tag_ == 'TO':
                chunks.append(token.text_with_ws)  # token text + whitespace
            # and so on