Python 使用多线程模块将API数据检索到dataframe中_Python_Pandas_Concurrent.futures

Python 使用多线程模块将API数据检索到dataframe中

python pandas

Python 使用多线程模块将API数据检索到dataframe中,python,pandas,concurrent.futures,Python,Pandas,Concurrent.futures,我正在使用一个第三方API从不同标签的大量天数中检索10分钟的数据。根据天数和标签数量，当前数据提取可能需要几分钟。因此，我正在尝试多线程技术，我知道这对于繁重的IO操作非常有用 API调用如下所示，我已替换了实际的API名称： import numpy as N import requests as r import json import pandas as pd from datetime import datetime import concurrent.futures

我正在使用一个第三方API从不同标签的大量天数中检索10分钟的数据。根据天数和标签数量，当前数据提取可能需要几分钟。因此，我正在尝试多线程技术，我知道这对于繁重的IO操作非常有用

API调用如下所示，我已替换了实际的API名称：

import numpy as N 
import requests as r 
import json 
import pandas as pd
from datetime import datetime 
import concurrent.futures

  
class pyGeneric: 
  
    def __init__(self, serverName, apiKey, rootApiUrl='/Generic.Services/api'): 
        """ 
        Initialize a connection to server, and return a pyGeneric server object 
        """ 
        self.baseUrl = serverName + rootApiUrl 
        self.apiKey = apiKey 
        self.bearer = 'Bearer ' + apiKey 
        self.header = {'mediaType':'application/json','Authorization':self.bearer} 
  
    def getRawMeasurementsJson(self, tag, start, end):
        apiQuery = '/measurements/' + tag + '/from/' + start + '/to/' + end + '?format=json' 
        dataresponse = r.get(self.baseUrl+apiQuery, headers=self.header) 
        data = json.loads(dataresponse.text) 
        return data 
                                                               
                                
    def getAggregatesPandas(self, tags, start, end):
        """        
        Return tag(s) in a pandas dataFrame
        """
        df = pd.DataFrame()
        if type(tags) == str:
            tags = [tags]
        for tag in tags:
            tempJson =  self.getRawMeasurementsJson(tag, start, end)
            tempDf = pd.DataFrame(tempJson['timeSeriesList'][0]['timeSeries'])
            name = tempJson['timeSeriesList'][0]['measurementName']
            df['TimeUtc'] = [datetime.fromtimestamp(i/1000) for i in tempDf['t']]
            df['TimeUtc'] = df['TimeUtc'].dt.round('min')
            df[name] = tempDf['v']
        return df
    

gener = pyGeneric('https://api.generic.com', 'auth_keymlkj9789878686')

对API的调用示例如下： gener_df=gener.getAggregatesPandas'tag1.10m.SQL'，'*-10d'，'*'

这对于单个标记来说是可行的，但对于列表来说，这需要更长的时间，这就是为什么我一直在尝试以下方法：

tags = ['tag1.10m.SQL',
'tag2.10m.SQL',
'tag3.10m.SQL',
'tag4.10m.SQL',
'tag5.10m.SQL',
'tag6.10m.SQL',
'tag7.10m.SQL',
'tag8.10m.SQL',
'tag9.10m.SQL',
'tag10.10m.SQL']

startdate = "*-150d"
enddate = '*'

final_df = pd.DataFrame

with concurrent.futures.ThreadPoolExecutor() as executor:
    args = ((i,startdate, enddate) for i in tags)
    executor.map(lambda p: gener.getAggregatesPandas(*p), args)

但是，我无法检查genr.getAggregatesPandas是否正确执行。最终，我希望在名为final_df的数据帧中获得结果，但也不确定如何继续。我在这篇文章中读到，上下文管理器中的append会导致数据帧的二次拷贝，因此最终会减慢速度。

您可以尝试下面的方法，如果服务器也能处理，它将很容易让您并行发出大量请求

# it's just a wrapper around concurrent.futures ThreadPoolExecutor with a nice tqdm progress bar!
from tqdm.contrib.concurrent import thread_map, process_map # for multi-threading, multi-processing respectively)

def chunk_list(lst, size):
    """
    From SO only; 
    Yield successive n-sized chunks from list.
    """
    for i in range(0, len(lst), size):
        yield lst[i:i + size]

for idx, my_chunk in enumerate(chunk_list(huge_list, size=2**12)):
    for response in thread_map(<which_func_to_call>, my_chunk, max_workers=your_cpu_cores+6)):
        # which_func_to_call -> wrap the returned response json obj in this, etc
        # do something with the response now..
        # make sure to cache the chunk results as well

现在我们可以用这个函数来代替； NB->my_new_func现在接受一个参数

编辑2：

对于缓存，我建议使用csv模块，并将您想要的响应写入csv文件，而不是使用熊猫等；或者您可以根据需要转储JSON响应等；类似JSON/dict响应的示例代码如下所示：

import csv
import os

with open(OUTPUT_FILE_NAME, "a+", newline="") as csvfile:
    # fieldnames = [your_headers_list]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    # Make sure you write the header only once as we are opening the file in append mode (writer.writeheader())
    for idx, my_chunk in enumerate(chunk_list(<huge_list>, size=CHUNK_SIZE)):
            for response in thread_map(
                <my_partial_wrapped_func>, my_chunk, max_workers=min(32, os.cpu_count() + 6)
            ):
            # .......
            # .......
            writer.writerow(<row_of_the_csv_as_a_dict_with_fieldnames_as_keys>)

您可以尝试下面的方法，只要服务器也能处理，它将很容易让您并行地发出大量请求

# it's just a wrapper around concurrent.futures ThreadPoolExecutor with a nice tqdm progress bar!
from tqdm.contrib.concurrent import thread_map, process_map # for multi-threading, multi-processing respectively)

def chunk_list(lst, size):
    """
    From SO only; 
    Yield successive n-sized chunks from list.
    """
    for i in range(0, len(lst), size):
        yield lst[i:i + size]

for idx, my_chunk in enumerate(chunk_list(huge_list, size=2**12)):
    for response in thread_map(<which_func_to_call>, my_chunk, max_workers=your_cpu_cores+6)):
        # which_func_to_call -> wrap the returned response json obj in this, etc
        # do something with the response now..
        # make sure to cache the chunk results as well

现在我们可以用这个函数来代替； NB->my_new_func现在接受一个参数

编辑2：

import csv
import os

with open(OUTPUT_FILE_NAME, "a+", newline="") as csvfile:
    # fieldnames = [your_headers_list]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    # Make sure you write the header only once as we are opening the file in append mode (writer.writeheader())
    for idx, my_chunk in enumerate(chunk_list(<huge_list>, size=CHUNK_SIZE)):
            for response in thread_map(
                <my_partial_wrapped_func>, my_chunk, max_workers=min(32, os.cpu_count() + 6)
            ):
            # .......
            # .......
            writer.writerow(<row_of_the_csv_as_a_dict_with_fieldnames_as_keys>)

正如我正确理解的那样，您需要了解GetAggregateSpanda是否正确执行

你可以像下面这样做

with concurrent.futures.ThreadPoolExecutor() as executor:
    args = ((i,startdate, enddate) for i in tags)
    results = executor.map(lambda p: gener.getAggregatesPandas(*p), args)
    for result in results:
        final_df.append(result,ignore_index=False)
    #another approach is below
    #for f in concurrent.futures.as_completed(results):
    #     final_df.append(result,ignore_index=False)

参考视频：-

正如我正确理解的那样，您需要了解GetAggregateSpand是否正确执行

你可以像下面这样做

with concurrent.futures.ThreadPoolExecutor() as executor:
    args = ((i,startdate, enddate) for i in tags)
    results = executor.map(lambda p: gener.getAggregatesPandas(*p), args)
    for result in results:
        final_df.append(result,ignore_index=False)
    #another approach is below
    #for f in concurrent.futures.as_completed(results):
    #     final_df.append(result,ignore_index=False)

REF视频：-

感谢您的回复和TQM库，我不知道！我不完全确定如何使您的代码适应我的问题。我知道chunk_列表是一个处理输入列表的生成器。但是实际上有三个参数，我不知道如何在这里导入它们。最后，我应该如何缓存区块结果？@amphinomos如果您的开始日期和结束日期是常量，那么您也可以使用functools.partials创建一个伪函数，并使用它来代替；我已经更新了答案，看一看！我再次感谢编写代码的努力，我知道它更倾向于并行处理，而不是多线程。老实说，它在我的头上飞，我还不能使它与我已有的现有代码一起工作。谢谢你的回复和TQM库，我不知道！我不完全确定如何使您的代码适应我的问题。我知道chunk_列表是一个处理输入列表的生成器。但是实际上有三个参数，我不知道如何在这里导入它们。最后，我应该如何缓存区块结果？@amphinomos如果您的开始日期和结束日期是常量，那么您也可以使用functools.partials创建一个伪函数，并使用它来代替；我已经更新了答案，看一看！我再次感谢编写代码的努力，我知道它更倾向于并行处理，而不是多线程。老实说，它凌驾于我的头上，我无法使用现有的代码使其工作。我可以使用：final_df=pd.DataFramecolumns=['TimeUtc']在上下文管理器之前，然后final_df=pd.mergefinal_df，result_df，on='TimeUtc'，how='outer'在for循环中。@amphinomos这正是您想要做的，但这将导致二次复制，正如您已经知道的那样；我想我误解了你的要求；在重读你的问题时，这个答案符合你的要求；不要使用pandas附加结果，而是将它们移动到列表中，然后只调用一次df.append；我分享的答案将帮助您更快地向API发出//请求，然后将它们写入csv文件，而不是使用pandas；因为较慢的不是数据帧的合并，而是响应速度；这就是我的答案。我能够使用：在上下文管理器之前使用final\u df=pd.DataFramecolumns=['TimeUtc']，然后在for循环中使用final\u df=pd.mergefinal\u df，result\u df，on='TimeUtc'，how='outer'。amph inomos这正是你想要做的，但这将导致二次复制，正如你已经意识到的那样；我想我误解了你的要求；在重读你的问题时，这个答案符合你的要求；不要使用pandas附加结果，而是将它们移动到列表中，然后只调用一次df.append；我分享的答案将帮助您更快地向API发出//请求，然后将它们写入csv文件，而不是使用pandas；因为较慢的不是数据帧的合并，而是响应速度；这就是我的答案