Python 为什么从Google Bucket中提取数据的速度如此之慢？_Python_Pyspark_Jupyter Notebook_Google Cloud Storage_Google Cloud Dataproc

Python 为什么从Google Bucket中提取数据的速度如此之慢？

python pyspark jupyter-notebook google-cloud-storage

Python 为什么从Google Bucket中提取数据的速度如此之慢？,python,pyspark,jupyter-notebook,google-cloud-storage,google-cloud-dataproc,Python,Pyspark,Jupyter Notebook,Google Cloud Storage,Google Cloud Dataproc,我在从Dataproc的Jupyter笔记本读取Google Bucket中的数据时遇到了一个问题。在我的名为stb_data的Google Bucket中，有一个文件夹data，其中包含815个文件夹，每个文件夹都包含文本文件。我需要从这815个文件夹中的每个文件夹中读取一个具有特定名称的文件的内容。以下是我现在正在做的事情： storage_client = storage.Client() bucket_name = 'stb_data' bucket = storage_client.b

我在从Dataproc的Jupyter笔记本读取Google Bucket中的数据时遇到了一个问题。在我的名为

stb_data

的Google Bucket中，有一个文件夹

data

，其中包含815个文件夹，每个文件夹都包含文本文件。我需要从这815个文件夹中的每个文件夹中读取一个具有特定名称的文件的内容。以下是我现在正在做的事情：

storage_client = storage.Client()
bucket_name = 'stb_data'
bucket = storage_client.bucket(bucket_name)

data = []  # data from all files with

# extract everything contained inside stb_data/data
blobs = storage_client.list_blobs(bucket_name, prefix='data/', delimiter='/')
[_ for _ in blobs]  # don't know why but I have to iterate over blobs to make it possible to use prefixes

# get all folders with files in stb_data/data
folders = list(blobs.prefixes)

for folder in folders:        
    # get all the files in the folder
    files = [blob.name for blob in storage_client.list_blobs(bucket_name, prefix=folder, delimiter='/')]
    
    # find the file I need (it must start with a letter 'n')
    filename = [file for file in files if file.split('/')[-1].startswith('n')][0]  

    contents = spark.read.text(f'gs://{bucket_name}/{filename}')  # read the file
    data += [line.value for line in contents.collect()]  # get the contents

这种方法是有效的，但速度非常慢。每个文本文件都不是特别大，在我的本地计算机上执行同样的操作会更快（我无法使用本地计算机，因为稍后会对数据进行操作）

我做错了什么？有更好的方法吗

提前感谢您的帮助

看起来您正在通过网络进行大量呼叫，这会降低执行速度。我建议您在通话中使用通配符。我想你可以用前缀来做。也许会有帮助。另一个建议是通过为每个调用增加编程计算时间来确定哪些调用增加了延迟。@GaurangiSaxena我是否正确理解了

gsutil

和通配符允许我将需要的文件复制到Jupyter目录中，在哪里可以更快地访问它们？1）这一行读取bucket中每个对象的名称：

[\ufor.\uin blobs]

。2）然后反复重复类似的过程：

storage\u client.list\u blob（bucket\u name，prefix=folder，delimiter='/'）

。为了加快速度，将整个存储桶列表读入内存一次。然后使用字符串模式，找到要处理的对象名称。请记住，云存储是一个平面名称空间。没有目录。@JohnHanley通过“将整个存储桶列表一次读入内存”您的意思是运行此命令

blobs=storage\u client.list\u blobs（bucket\u name，prefix='data/）

而不使用

分隔符

参数，然后在输出中查找必要的文件？