&引用；TypeError:应为字符串或缓冲区；使用tika在Python中解析文档时_Python_Parsing_Apache Tika_Named Entity Recognition_Named Entity Extraction

&引用；TypeError:应为字符串或缓冲区；使用tika在Python中解析文档时

python parsing

&引用；TypeError:应为字符串或缓冲区；使用tika在Python中解析文档时,python,parsing,apache-tika,named-entity-recognition,named-entity-extraction,Python,Parsing,Apache Tika,Named Entity Recognition,Named Entity Extraction,我正在尝试使用ApacheTika解析一些文档（如文件类型中列出的）。这是我用Python编写的代码 auth = urllib2.HTTPPasswordMgrWithDefaultRealm() auth.add_password(None, url, user, password) urllib2.install_opener(urllib2.build_opener(urllib2.HTTPBasicAuthHandler(auth))) outpage = urllib2.urlop

我正在尝试使用ApacheTika解析一些文档（如文件类型中列出的）。这是我用Python编写的代码

auth = urllib2.HTTPPasswordMgrWithDefaultRealm()
auth.add_password(None, url, user, password)
urllib2.install_opener(urllib2.build_opener(urllib2.HTTPBasicAuthHandler(auth)))

outpage = urllib2.urlopen(url)
data = json.loads(outpage.read().decode('utf-8'))
dictitems = data.values()
flattened_list = [y for x in dictitems for y in x]

filetypes = [".pdf", ".doc", ".docx", ".txt"]

def tikiparse(fi):
    for i in filetypes:
        if fi.endswith(i):
            text = parser.from_file(fi, "http://localhost:9998/")
            extractedcontent = text["content"]

            chunked = ne_chunk(pos_tag(word_tokenize(extractedcontent)))
            current_chunk = []
            cont_chunk = []

            for j in chunked:
                if type(j) == Tree:
                    current_chunk.append(" ".join([token for token, pos in j.leaves()]))
                elif current_chunk:
                    named_entity = " ".join(current_chunk)
                    if named_entity not in cont_chunk:
                        cont_chunk.append(named_entity)
                        current_chunk = []
                else:
                    continue
            return cont_chunk

循环可以完美地运行一段时间，并解析一些文档以提取命名实体。突然，我得到了以下错误。代码出了什么问题

Traceback (most recent call last):
  File "C:/Users/Kalapala/PycharmProjects/Attachments/DownloadFiles.py", line 74, in <module>
    tikiparse(f)
  File "C:/Users/Kalapala/PycharmProjects/Attachments/DownloadFiles.py", line 57, in tikiparse
    chunked = ne_chunk(pos_tag(word_tokenize(extractedcontent)))
  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 130, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 97, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1235, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1283, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1274, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1314, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 312, in _pair_iter
    prev = next(it)
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1287, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

Process finished with exit code 1

回溯（最近一次呼叫最后一次）：
文件“C:/Users/Kalapala/PycharmProjects/Attachments/DownloadFiles.py”，第74行，在
蒂基帕斯（f）
文件“C:/Users/Kalapala/PycharmProjects/Attachments/DownloadFiles.py”，第57行，在tikiparse中
chunked=ne_chunk（pos_标记（word_标记化（extractedcontent）））
文件“C:\Python27\lib\site packages\nltk\tokenize\\uuuu init\uuuu.py”，第130行，在word\u tokenize中
句子=[text]如果保留\u行，则发送\u标记化（文本、语言）
文件“C:\Python27\lib\site packages\nltk\tokenize\\uuuu init\uuuu.py”，第97行，在sent\u tokenize中
return tokenizer.tokenize（文本）
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”，第1235行，在tokenize中
返回列表（self.句子来自文本（文本，重新对齐边界））
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”，第1283行，来自文本中的句子
返回[self.span\u标记化（文本，重新对齐\u边界）中s，e的文本[s:e]
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”，第1274行，在span\u tokenize中
返回[（sl.start，sl.stop）用于切片中的sl]
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”，第1314行，位于重新对齐边界中
对于sl1、sl2成对iter（切片）：
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”，第312行，成对
上一个=下一个（it）
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”，第1287行，位于文本中的
对于self.\u lang\u vars.period\u context\u re（）.finditer中的匹配（文本）：
TypeError:应为字符串或缓冲区
进程已完成，退出代码为1

您遇到的问题是

word\u tokenise（）

需要一个字符串，但您将其他类型传递给该方法。您必须确保

extractedcontent

为字符串类型

根据您的

UnicodeDecodeError

注释，dictionary

text

的值包含一些无法编码/解码的字符，您可以对该值调用encode（'utf-8'）.strip（），例如

extractedcontent.encode（'utf-8'）.strip（）

来解析它

希望能有所帮助。

你能发布与

解析器相关的代码吗？来自\u file（）

？这是我发布的示例的第12行。下面是函数后面的几行：对于os.listdir（“”）：tikiparse（f）中的f，您是否检查了

extractedcontent

的类型？它应该是字符串，类型为“unicode”。当我尝试将其转换为str时，我得到一个UnicodeDecodeError:“utf8”编解码器无法解码位置9中的字节0xc2：数据意外结束现在您知道错误来自何处，请参阅我的帖子中的详细信息，这是我得到的错误：UnicodeDecodeError:“ascii”编解码器无法解码位置9中的字节0xc2：序号不在范围内（128）有一个打字错误，应该是“utf-8”是的，我想了一下，并更正了它。但问题依然存在：（@alapalak很抱歉听到这个消息，尝试谷歌搜索如何解决python语言中的特殊字符