Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/278.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
&引用;TypeError:应为字符串或缓冲区;使用tika在Python中解析文档时_Python_Parsing_Apache Tika_Named Entity Recognition_Named Entity Extraction - Fatal编程技术网

&引用;TypeError:应为字符串或缓冲区;使用tika在Python中解析文档时

&引用;TypeError:应为字符串或缓冲区;使用tika在Python中解析文档时,python,parsing,apache-tika,named-entity-recognition,named-entity-extraction,Python,Parsing,Apache Tika,Named Entity Recognition,Named Entity Extraction,我正在尝试使用ApacheTika解析一些文档(如文件类型中列出的)。这是我用Python编写的代码 auth = urllib2.HTTPPasswordMgrWithDefaultRealm() auth.add_password(None, url, user, password) urllib2.install_opener(urllib2.build_opener(urllib2.HTTPBasicAuthHandler(auth))) outpage = urllib2.urlop

我正在尝试使用ApacheTika解析一些文档(如文件类型中列出的)。这是我用Python编写的代码

auth = urllib2.HTTPPasswordMgrWithDefaultRealm()
auth.add_password(None, url, user, password)
urllib2.install_opener(urllib2.build_opener(urllib2.HTTPBasicAuthHandler(auth)))

outpage = urllib2.urlopen(url)
data = json.loads(outpage.read().decode('utf-8'))
dictitems = data.values()
flattened_list = [y for x in dictitems for y in x]

filetypes = [".pdf", ".doc", ".docx", ".txt"]

def tikiparse(fi):
    for i in filetypes:
        if fi.endswith(i):
            text = parser.from_file(fi, "http://localhost:9998/")
            extractedcontent = text["content"]

            chunked = ne_chunk(pos_tag(word_tokenize(extractedcontent)))
            current_chunk = []
            cont_chunk = []

            for j in chunked:
                if type(j) == Tree:
                    current_chunk.append(" ".join([token for token, pos in j.leaves()]))
                elif current_chunk:
                    named_entity = " ".join(current_chunk)
                    if named_entity not in cont_chunk:
                        cont_chunk.append(named_entity)
                        current_chunk = []
                else:
                    continue
            return cont_chunk
循环可以完美地运行一段时间,并解析一些文档以提取命名实体。突然,我得到了以下错误。代码出了什么问题

Traceback (most recent call last):
  File "C:/Users/Kalapala/PycharmProjects/Attachments/DownloadFiles.py", line 74, in <module>
    tikiparse(f)
  File "C:/Users/Kalapala/PycharmProjects/Attachments/DownloadFiles.py", line 57, in tikiparse
    chunked = ne_chunk(pos_tag(word_tokenize(extractedcontent)))
  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 130, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 97, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1235, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1283, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1274, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1314, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 312, in _pair_iter
    prev = next(it)
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1287, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

Process finished with exit code 1
回溯(最近一次呼叫最后一次):
文件“C:/Users/Kalapala/PycharmProjects/Attachments/DownloadFiles.py”,第74行,在
蒂基帕斯(f)
文件“C:/Users/Kalapala/PycharmProjects/Attachments/DownloadFiles.py”,第57行,在tikiparse中
chunked=ne_chunk(pos_标记(word_标记化(extractedcontent)))
文件“C:\Python27\lib\site packages\nltk\tokenize\\uuuu init\uuuu.py”,第130行,在word\u tokenize中
句子=[text]如果保留\u行,则发送\u标记化(文本、语言)
文件“C:\Python27\lib\site packages\nltk\tokenize\\uuuu init\uuuu.py”,第97行,在sent\u tokenize中
return tokenizer.tokenize(文本)
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”,第1235行,在tokenize中
返回列表(self.句子来自文本(文本,重新对齐边界))
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”,第1283行,来自文本中的句子
返回[self.span\u标记化(文本,重新对齐\u边界)中s,e的文本[s:e]
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”,第1274行,在span\u tokenize中
返回[(sl.start,sl.stop)用于切片中的sl]
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”,第1314行,位于重新对齐边界中
对于sl1、sl2成对iter(切片):
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”,第312行,成对
上一个=下一个(it)
文件“C:\Python27\lib\site packages\nltk\tokenize\punkt.py”,第1287行,位于文本中的
对于self.\u lang\u vars.period\u context\u re().finditer中的匹配(文本):
TypeError:应为字符串或缓冲区
进程已完成,退出代码为1

您遇到的问题是
word\u tokenise()
需要一个字符串,但您将其他类型传递给该方法。您必须确保
extractedcontent
为字符串类型

根据您的
UnicodeDecodeError
注释,dictionary
text
的值包含一些无法编码/解码的字符,您可以对该值调用encode('utf-8').strip(),例如
extractedcontent.encode('utf-8').strip()
来解析它


希望能有所帮助。

你能发布与
解析器相关的代码吗?来自\u file()
?这是我发布的示例的第12行。下面是函数后面的几行:对于os.listdir(“”):tikiparse(f)中的f,您是否检查了
extractedcontent
的类型?它应该是字符串,类型为“unicode”。当我尝试将其转换为str时,我得到一个UnicodeDecodeError:“utf8”编解码器无法解码位置9中的字节0xc2:数据意外结束现在您知道错误来自何处,请参阅我的帖子中的详细信息,这是我得到的错误:UnicodeDecodeError:“ascii”编解码器无法解码位置9中的字节0xc2:序号不在范围内(128)有一个打字错误,应该是“utf-8”是的,我想了一下,并更正了它。但问题依然存在:(@alapalak很抱歉听到这个消息,尝试谷歌搜索如何解决python语言中的特殊字符