Python:如何分割WARC文件?

Python:如何分割WARC文件?,python,split,warc,Python,Split,Warc,我的目标是将WARC文件从CommonCrawl拆分并排序到其单独的记录中。示例文件: WARC/1.0 WARC-Type: warcinfo WARC-Date: 2020-08-04T01:43:40Z WARC-Record-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774> Content-Length: 500 Content-Type: application/warc-fields WARC-Filename: CC

我的目标是将WARC文件从CommonCrawl拆分并排序到其单独的记录中。示例文件:

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2020-08-04T01:43:40Z
WARC-Record-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
Content-Length: 500
Content-Type: application/warc-fields
WARC-Filename: CC-MAIN-20200804014340-20200804044340-00045.warc.gz

isPartOf: CC-MAIN-2020-34
publisher: Common Crawl
description: Wide crawl of the web for August 2020
operator: Common Crawl Admin (info@commoncrawl.org)
hostname: ip-10-67-67-22.ec2.internal
software: Apache Nutch 1.17 (modified, https://github.com/commoncrawl/nutch/)
robots: checked via crawler-commons 1.2-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
format: WARC File Format 1.1
conformsTo: http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/


WARC/1.0
WARC-Type: request
WARC-Date: 2020-08-04T03:25:25Z
WARC-Record-ID: <urn:uuid:6c0b749a-4d02-4a77-ab93-9bc4ba094cdc>
Content-Length: 303
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:959ea654-33fd-466b-b1bf-f08aa8abe774>
WARC-IP-Address: 104.254.66.40
WARC-Target-URI: http://00.auto.sohu.com/d/details?cityCode=450100&planId=1450&trimId=145372
WARC/1.0
WARC类型:warcinfo
WARC日期:2020-08-04T01:43:40Z
WARC记录ID:
内容长度:500
内容类型:应用程序/warc字段
WARC文件名:CC-MAIN-2020080401440-2020080444340-00045.WARC.gz
isPartOf:CC-MAIN-2020-34
发布者:普通爬网
描述:2020年8月的网络大爬网
操作员:公共爬网管理员(info@commoncrawl.org)
主机名:ip-10-67-67-22.ec2.internal
软件:Apache Nutch 1.17(已修改,https://github.com/commoncrawl/nutch/)
机器人:通过crawler commons 1.2-SNAPSHOT进行检查(https://github.com/crawler-commons/crawler-commons)
格式:WARC文件格式1.1
符合:http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/
WARC/1.0
WARC类型:请求
WARC日期:2020-08-04T03:25:25Z
WARC记录ID:
内容长度:303
内容类型:应用程序/http;msgtype=请求
WARC Warcinfo ID:
WARC IP地址:104.254.66.40
WARC目标URI:http://00.auto.sohu.com/d/details?cityCode=450100&planId=1450&trimId=145372
如何在“WARC/1.0”行将文件拆分为不同的记录?

请参见