Python NLTK-为自然语言处理标记列中的所有行
==使用Juypter笔记本电脑== 我让NLTK处理单个文本字符串Python NLTK-为自然语言处理标记列中的所有行,python,pandas,nltk,jupyter,Python,Pandas,Nltk,Jupyter,==使用Juypter笔记本电脑== 我让NLTK处理单个文本字符串 Text= 'Hey. I got some text here' def preprocess(sent): sent = nltk.word_tokenize(sent) sent = nltk.pos_tag(sent) return sent sent = preprocess(Text) sent 输出: [('Hey', 'NNP'), ('.', '.'), ('I', 'P
Text= 'Hey. I got some text here'
def preprocess(sent):
sent = nltk.word_tokenize(sent)
sent = nltk.pos_tag(sent)
return sent
sent = preprocess(Text)
sent
输出:
[('Hey', 'NNP'),
('.', '.'),
('I', 'PRP'),
('got', 'VBD'),
('some', 'DT'),
('text', 'NN'),
('here', 'RB')]
这是可以的,但不是很有用,因为我想在一个数据帧中的许多行上自动执行此操作
基本上,在维护索引键的同时标记这些单词,这样我就可以在一个新字段中重新组合我想要的标记。例如,我在包含1000多行的特定excel列中查找人名
当我在数据帧上尝试这一点时,这就是我遇到的问题
print(desdf)
Description
0 some text here John
1 Other cool text
2 John Paul
使用此数据帧运行代码时,我得到TypeError:预期为字符串或类似字节的对象
def preprocess(sent):
sent = nltk.word_tokenize(sent)
sent = nltk.pos_tag(sent)
return sent
sent = preprocess(desdf)
sent
这是不可能的,还是需要执行一些转换命令?谢谢你的帮助
错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-23-b7b2a604215b> in <module>
3 sent = nltk.pos_tag(sent)
4 return sent
----> 5 sent = preprocess(desdf)
6 sent
<ipython-input-23-b7b2a604215b> in preprocess(sent)
1 def preprocess(sent):
----> 2 sent = nltk.word_tokenize(sent)
3 sent = nltk.pos_tag(sent)
4 return sent
5 sent = preprocess(desdf)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py in word_tokenize(text, language, preserve_line)
142 :type preserve_line: bool
143 """
--> 144 sentences = [text] if preserve_line else sent_tokenize(text, language)
145 return [
146 token for sent in sentences for token in _treebank_word_tokenizer.tokenize(sent)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)
104 """
105 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
--> 106 return tokenizer.tokenize(text)
107
108
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self, text, realign_boundaries)
1275 Given a text, returns a list of the sentences in that text.
1276 """
-> 1277 return list(self.sentences_from_text(text, realign_boundaries))
1278
1279 def debug_decisions(self, text):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self, text, realign_boundaries)
1329 follows the period.
1330 """
-> 1331 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
1332
1333 def _slices_from_text(self, text):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in <listcomp>(.0)
1329 follows the period.
1330 """
-> 1331 return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
1332
1333 def _slices_from_text(self, text):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self, text, realign_boundaries)
1319 if realign_boundaries:
1320 slices = self._realign_boundaries(text, slices)
-> 1321 for sl in slices:
1322 yield (sl.start, sl.stop)
1323
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self, text, slices)
1360 """
1361 realign = 0
-> 1362 for sl1, sl2 in _pair_iter(slices):
1363 sl1 = slice(sl1.start + realign, sl1.stop)
1364 if not sl2:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it)
316 it = iter(it)
317 try:
--> 318 prev = next(it)
319 except StopIteration:
320 return
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self, text)
1333 def _slices_from_text(self, text):
1334 last_break = 0
-> 1335 for match in self._lang_vars.period_context_re().finditer(text):
1336 context = match.group() + match.group('after_tok')
1337 if self.text_contains_sentbreak(context):
TypeError: expected string or bytes-like object
选择列并用于每列的处理功能:
sent = desdf['Description'].apply(preprocess)