得到一个;EOFError:已到达流的末端;尝试在Python和smart_打开的情况下动态卸载大文件时出错

得到一个;EOFError:已到达流的末端;尝试在Python和smart_打开的情况下动态卸载大文件时出错,python,python-3.x,amazon-s3,tar,tarfile,Python,Python 3.x,Amazon S3,Tar,Tarfile,我正在尝试从远程Apache服务器下载并解压缩一组文件。我提供了一个要动态下载和解压缩的.tbz(tar.bz2)文件列表。目标是通过tar解压器将它们从远程Apache服务器流式传输到我的amazonaws S3 bucket。我这样做是因为文件可以大到30Gb 我使用“smart_open”python库来抽象https和s3管理 我在这里提供的代码适用于小文件。当我尝试使用更大的文件(超过8Mb)执行此操作时,我会出现以下错误: "EOFError: End of stream alrea

我正在尝试从远程Apache服务器下载并解压缩一组文件。我提供了一个要动态下载和解压缩的.tbz(tar.bz2)文件列表。目标是通过tar解压器将它们从远程Apache服务器流式传输到我的amazonaws S3 bucket。我这样做是因为文件可以大到30Gb

我使用“smart_open”python库来抽象https和s3管理

我在这里提供的代码适用于小文件。当我尝试使用更大的文件(超过8Mb)执行此操作时,我会出现以下错误:

"EOFError: End of stream already reached"
以下是回溯:

Traceback (most recent call last):
  File "./script.py", line 28, in <module>
    download_file(fileName)
  File "./script.py", line 21, in download_file
    for line in tfext:
  File "/.../lib/python3.7/tarfile.py", line 706, in readinto
    buf = self.read(len(b))
  File "/.../lib/python3.7/tarfile.py", line 695, in read
    b = self.fileobj.read(length)
  File "/.../lib/python3.7/tarfile.py", line 537, in read
    buf = self._read(size)
  File "/.../lib/python3.7/tarfile.py", line 554, in _read
    buf = self.cmp.decompress(buf)
EOFError: End of stream already reached

我希望能够以与处理小文件完全相同的方式处理大文件。

压缩tar提取需要文件搜索,这可能无法使用smart\u open创建的虚拟文件描述符。另一种方法是在处理之前将数据下载到块存储

from smart_open import open
import tarfile
import boto3
from codecs import open as copen

filenames = ['test.tar.bz2',]

def download_file(fileName):
   s3 = boto3.resource('s3')
   bucket = s3.Bucket('bucketname')
   obj = bucket.Object(fileName)
   local_filename = '/tmp/{}'.format(fileName)
   obj.download_file(local_filename)
   tf = tarfile.open(local_filename, 'r:bz2')
   for member in tf.getmembers():
      tf.extract(member)
      fd = open(member.name, 'rb')
      print(member, len(fd.read()))
if __name__ == '__main__':
   for f in filenames:
      download_file(f)

压缩tar提取需要文件搜索,这可能无法使用smart_open创建的虚拟文件描述符。另一种方法是在处理之前将数据下载到块存储

from smart_open import open
import tarfile
import boto3
from codecs import open as copen

filenames = ['test.tar.bz2',]

def download_file(fileName):
   s3 = boto3.resource('s3')
   bucket = s3.Bucket('bucketname')
   obj = bucket.Object(fileName)
   local_filename = '/tmp/{}'.format(fileName)
   obj.download_file(local_filename)
   tf = tarfile.open(local_filename, 'r:bz2')
   for member in tf.getmembers():
      tf.extract(member)
      fd = open(member.name, 'rb')
      print(member, len(fd.read()))
if __name__ == '__main__':
   for f in filenames:
      download_file(f)