Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/337.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python nltk语句标记符和特殊字符的奇怪行为_Python_Nltk_Tokenize_Sentence - Fatal编程技术网

Python nltk语句标记符和特殊字符的奇怪行为

Python nltk语句标记符和特殊字符的奇怪行为,python,nltk,tokenize,sentence,Python,Nltk,Tokenize,Sentence,我在对德语文本使用sent\u标记器时出现了一些奇怪的行为 示例代码: sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle') for sent in sent_tokenizer.tokenize("Super Qualität. Tolles Teil.") print sent 此操作失败,错误如下: Traceback (most recent call last): for sent in sen

我在对德语文本使用
sent\u标记器时出现了一些奇怪的行为

示例代码:

sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
for sent in sent_tokenizer.tokenize("Super Qualität. Tolles Teil.")
      print sent
此操作失败,错误如下:

Traceback (most recent call last):
for sent in sent_tokenize("Super Qualität. Tolles Teil."):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1270, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
    prev = next(it)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter
    prev = next(it)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass
    for aug_tok in tokens:
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
鉴于:

  sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
  for sent in sent_tokenizer.tokenize("Super Qualität des Produktes. Tolles Teil.")
      print sent

非常有效

我在电脑上找到了解决方案

注意:在标记Unicode字符串时,请确保未使用 字符串的编码版本(可能需要对其进行解码) 首先,例如使用s.decode(“utf8”)

所以


就像一个符咒。

你是否缺少函数名末尾的“r”?
对于sent in sent_tokenize(“Super Qualität.Tolles Teil”):
“@Mr.polywhill只是问题中的一个输入错误:-)。这不是问题所在。问题在于最后一个单词包含非ASCII字符的句子。但我不知道原因。如果你用这种方式,你的“超级品质托尔斯·泰尔”(Super Qualität.Tolles Teil.)是有效的。文本的编码是什么?如果您在
加载中通过编码,则可以工作!
text = "Super Qualität. Tolles Teil."
sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
for sent in sent_tokenizer.tokenize(text.decode('utf8')):
      print sent