Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/323.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Python检查存储数据的mimetype_Python_Mime Types - Fatal编程技术网

使用Python检查存储数据的mimetype

使用Python检查存储数据的mimetype,python,mime-types,Python,Mime Types,问题:我从WARC文件中提取了内容块。在将内容保存到文件之前,我正在编写一个过滤器来检查此内容块的mimetype。特别是,我只对application/pdf类型感兴趣。内容的前几行看起来像 HTTP/1.1 200 OK^ML, 388610C Date: Wed, 26 Jun 2013 02:18:37 GMT^M Server: Apache^M Last-Modified: Thu, 02 Dec 2010 22:54:07 GMT^M ETag: "9002f-41fc8-4c94

问题:我从WARC文件中提取了内容块。在将内容保存到文件之前,我正在编写一个过滤器来检查此内容块的mimetype。特别是,我只对application/pdf类型感兴趣。内容的前几行看起来像

HTTP/1.1 200 OK^ML, 388610C
Date: Wed, 26 Jun 2013 02:18:37 GMT^M
Server: Apache^M
Last-Modified: Thu, 02 Dec 2010 22:54:07 GMT^M
ETag: "9002f-41fc8-4c94c1c0"^M
Accept-Ranges: bytes^M
Content-Length: 270280^M
Connection: close^M
Content-Type: application/pdf^M
^M
%PDF-1.4
%ÐÔÅØ
1 0 obj
<< /S /GoTo /D [2 0 R  /Fit ] >>
endobj
7 0 obj <<
/Length 297
/Filter /FlateDecode
>>
stream
(2) 魔术包

import magic
print magic.from_buffer(content)

it prints `ASCII text, with CRLF, LF line terminators`.
(3) subprocess.Popen()

输出是一条错误消息:

Traceback (most recent call last):
  File "warc_extract_pdf.py", line 123, in <module>
    run()
  File "warc_extract_pdf.py", line 102, in run
    sys.exit(main(argvs))
  File "warc_extract_pdf.py", line 35, in main
    if extract_pdf(offset,record,outdir,outlog): 
  File "warc_extract_pdf.py", line 61, in extract_pdf
    if not mimetype(record,'application/pdf'): return False
  File "warc_extract_pdf.py", line 75, in mimetype
    p = Popen('file --mime-type', stdin=PIPE, stdout=PIPE, stderr=STDOUT)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
回溯(最近一次呼叫最后一次):
文件“warc_extract_pdf.py”,第123行,在
运行()
文件“warc_extract_pdf.py”,第102行,运行中
系统出口(主(argvs))
文件“warc_extract_pdf.py”,第35行,主目录
如果提取pdf(偏移量、记录、输出目录、输出日志):
文件“warc_extract_pdf.py”,第61行,extract_pdf
如果不是mimetype(记录,'application/pdf'):返回False
文件“warc_extract_pdf.py”,第75行,mimetype格式
p=Popen('file--mime类型',stdin=PIPE,stdout=PIPE,stderr=stdout)
文件“/usr/lib64/python2.6/subprocess.py”,第642行,在__
错误读取,错误写入)
文件“/usr/lib64/python2.6/subprocess.py”,第1234行,在_execute_child中
引发子对象异常
OSError:[Errno 2]没有这样的文件或目录
求救

是一个python库,用于解析WARC文件并从中获取信息。在您将该文件解析为http请求之前,它只是文本。从他们的示例中,您的用例如下所示:

import warc
f = warc.open("test.warc")
for record in f:
    print record.get("Content-Type","text/html")

这是个老问题,但我想我还是可以回答的

Python魔法将在这里发挥作用。只需使用.from_buffer(buffer,mime=True


什么是内容?文件的路径?请发布其声明。Popen需要一个序列,如:
['file','--mime type']
正如我发布的那样,“content”变量包含一个以“HTTP/1.1200”开头的字符串。将“file--mime type”更改为['file','--mime type']没有帮助。它生成了错误消息,说我没有正确使用“file”命令。这是正确的。然而,这些WARC文件是由一个叫做Heritrix的网络爬虫生成的。Heritrix没有很好地记录内容类型。大多数值只是“application/http;msgtype=request”,这是无用的。这就是为什么我必须自己检查mime类型。另外,我认为warc包不能为我检索内容。好吧,那就超出我的深度了。祝你好运
Traceback (most recent call last):
  File "warc_extract_pdf.py", line 123, in <module>
    run()
  File "warc_extract_pdf.py", line 102, in run
    sys.exit(main(argvs))
  File "warc_extract_pdf.py", line 35, in main
    if extract_pdf(offset,record,outdir,outlog): 
  File "warc_extract_pdf.py", line 61, in extract_pdf
    if not mimetype(record,'application/pdf'): return False
  File "warc_extract_pdf.py", line 75, in mimetype
    p = Popen('file --mime-type', stdin=PIPE, stdout=PIPE, stderr=STDOUT)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
import warc
f = warc.open("test.warc")
for record in f:
    print record.get("Content-Type","text/html")
import magic
import StringIO
msg_part_io_str = StringIO.StringIO()
with open('./Downloads/test123123.pdf', 'r') as f:
    msg_part_io_str.write(f.read())

d = magic.from_buffer(msg_part_io_str.getvalue(), mime=True)

print d
application/pdf