Python NLTK-为自然语言处理标记列中的所有行_Python_Pandas_Nltk_Jupyter

Python NLTK-为自然语言处理标记列中的所有行

python pandas

Python NLTK-为自然语言处理标记列中的所有行,python,pandas,nltk,jupyter,Python,Pandas,Nltk,Jupyter,==使用Juypter笔记本电脑== 我让NLTK处理单个文本字符串 Text= 'Hey. I got some text here' def preprocess(sent): sent = nltk.word_tokenize(sent) sent = nltk.pos_tag(sent) return sent sent = preprocess(Text) sent 输出： [('Hey', 'NNP'), ('.', '.'), ('I', 'P

==使用Juypter笔记本电脑==

我让NLTK处理单个文本字符串

    Text= 'Hey. I got some text here'
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent
sent = preprocess(Text)
sent

输出：

[('Hey', 'NNP'),
 ('.', '.'),
 ('I', 'PRP'),
 ('got', 'VBD'),
 ('some', 'DT'),
 ('text', 'NN'),
 ('here', 'RB')]

这是可以的，但不是很有用，因为我想在一个数据帧中的许多行上自动执行此操作

基本上，在维护索引键的同时标记这些单词，这样我就可以在一个新字段中重新组合我想要的标记。例如，我在包含1000多行的特定excel列中查找人名

当我在数据帧上尝试这一点时，这就是我遇到的问题

print(desdf)


           Description
0  some text here John
1      Other cool text
2            John Paul

使用此数据帧运行代码时，我得到TypeError：预期为字符串或类似字节的对象

def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent
sent = preprocess(desdf)
sent

这是不可能的，还是需要执行一些转换命令？谢谢你的帮助

错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-b7b2a604215b> in <module>
      3     sent = nltk.pos_tag(sent)
      4     return sent
----> 5 sent = preprocess(desdf)
      6 sent

<ipython-input-23-b7b2a604215b> in preprocess(sent)
      1 def preprocess(sent):
----> 2     sent = nltk.word_tokenize(sent)
      3     sent = nltk.pos_tag(sent)
      4     return sent
      5 sent = preprocess(desdf)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py in word_tokenize(text, language, preserve_line)
    142     :type preserve_line: bool
    143     """
--> 144     sentences = [text] if preserve_line else sent_tokenize(text, language)
    145     return [
    146         token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)
    104     """
    105     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
--> 106     return tokenizer.tokenize(text)
    107 
    108 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self, text, realign_boundaries)
   1275         Given a text, returns a list of the sentences in that text.
   1276         """
-> 1277         return list(self.sentences_from_text(text, realign_boundaries))
   1278 
   1279     def debug_decisions(self, text):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self, text, realign_boundaries)
   1329         follows the period.
   1330         """
-> 1331         return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
   1332 
   1333     def _slices_from_text(self, text):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in <listcomp>(.0)
   1329         follows the period.
   1330         """
-> 1331         return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
   1332 
   1333     def _slices_from_text(self, text):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self, text, realign_boundaries)
   1319         if realign_boundaries:
   1320             slices = self._realign_boundaries(text, slices)
-> 1321         for sl in slices:
   1322             yield (sl.start, sl.stop)
   1323 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self, text, slices)
   1360         """
   1361         realign = 0
-> 1362         for sl1, sl2 in _pair_iter(slices):
   1363             sl1 = slice(sl1.start + realign, sl1.stop)
   1364             if not sl2:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it)
    316     it = iter(it)
    317     try:
--> 318         prev = next(it)
    319     except StopIteration:
    320         return

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text)
   1333     def _slices_from_text(self, text):
   1334         last_break = 0
-> 1335         for match in self._lang_vars.period_context_re().finditer(text):
   1336             context = match.group() + match.group('after_tok')
   1337             if self.text_contains_sentbreak(context):

TypeError: expected string or bytes-like object

选择列并用于每列的处理功能：

sent = desdf['Description'].apply(preprocess)