Python-使用spacy标记所有命名实体_Python_Pandas_Spacy

Python-使用spacy标记所有命名实体

python pandas

Python-使用spacy标记所有命名实体,python,pandas,spacy,Python,Pandas,Spacy,我创建了一个函数，用Spacy标记所有命名实体： def tag_ne(content): doc = nlp(content) text = doc.text for ent in doc.ents: text = re.sub(ent.text, ent.label_, text) return text 当我将其应用于一系列unicode字符串时，它是有效的。然而，当我将其应用于整个数据集时，我得到了一个错误（因为一个特定的观察结果导致了一

我创建了一个函数，用Spacy标记所有命名实体：

def tag_ne(content):
    doc = nlp(content)
    text = doc.text
    for ent in doc.ents:
        text = re.sub(ent.text, ent.label_, text)
    return text

当我将其应用于一系列unicode字符串时，它是有效的。然而，当我将其应用于整个数据集时，我得到了一个错误（因为一个特定的观察结果导致了一个错误）。我无法知道导致错误的原因，我无法共享我的数据集，但错误如下：

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-56-274bc594a3e7> in <module>()
----> 1 emails.content.apply(tag_ne)

/vol1/home/ccostello/.conda/envs/chris_/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
   3190             else:
   3191                 values = self.astype(object).values
-> 3192                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   3193 
   3194         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-46-6900d0e291db> in tag_ne(content)
      3     text = doc.text
      4     for ent in doc.ents:
----> 5         text = re.sub(ent.text, ent.label_, text)
      6     return text

/vol1/home/ccostello/.conda/envs/chris_/lib64/python2.7/re.pyc in sub(pattern, repl, string, count, flags)
    149     a callable, it's passed the match object and must return
    150     a replacement string to be used."""
--> 151     return _compile(pattern, flags).sub(repl, string, count)
    152 
    153 def subn(pattern, repl, string, count=0, flags=0):

/vol1/home/ccostello/.conda/envs/chris_/lib64/python2.7/re.pyc in _compile(*key)
    240         p = sre_compile.compile(pattern, flags)
    241     except error, v:
--> 242         raise error, v # invalid expression
    243     if len(_cache) >= _MAXCACHE:
    244         _cache.clear()

error: unbalanced parenthesis

---------------------------------------------------------------------------
错误回溯（最近一次呼叫上次）
在（）
---->1电子邮件。内容。应用（标记）
/应用中的vol1/home/ccostello/.conda/envs/chris\uz/lib/python2.7/site-packages/pandas/core/series.pyc（self、func、convert\u dtype、args、**kwds）
3190其他：
3191 values=self.astype（object.values）
->3192 mapped=lib.map\u推断（值，f，convert=convert\u数据类型）
3193
3194如果len（映射）和isinstance（映射[0]，系列）：
pandas/_libs/src/inference.pyx在pandas中。_libs.lib.map_infere（）
在标签中（内容）
3 text=doc.text
4对于doc.ents中的ent：
---->5 text=re.sub（ent.text，ent.label，text）
6返回文本
/sub中的vol1/home/ccostello/.conda/envs/chris\u64/lib64/python2.7/re.pyc（模式、复制、字符串、计数、标志）
149一个可调用函数，它传递了match对象并且必须返回
150要使用的替换字符串。”“”
-->151返回编译（模式、标志）.sub（repl、字符串、计数）
152
153 def子网（模式、应答、字符串、计数=0、标志=0）：
/编译（*键）中的vol1/home/ccostello/.conda/envs/chris\u/lib64/python2.7/re.pyc
240 p=sre_compile.compile（模式、标志）
241除错误外，v：
-->242 raise错误，v#表达式无效
243如果len（\u cache）>=\u MAXCACHE:
244_cache.clear（）
错误：不平衡括号

有什么替代方法可以标记可能使我绕过此错误的所有命名实体？否则，我如何解决它？

当然，您可以知道是哪一行导致此错误。只需添加try/except语句：

def tag_ne(content):
    doc = nlp(content)
    text = doc.text
    for ent in doc.ents:
        try:
            text = re.sub(ent.text, ent.label_, text)
        except Exception as e:
            print(ent.text, ent.label_, '\n', e)
    return text

如果ent.text不是一个模式，请尝试

text=re.sub（re.escape（ent.text），ent.label，text）

。