Python如何在API调用中使用updateid并迭代直到收集到所有数据

Python如何在API调用中使用updateid并迭代直到收集到所有数据,python,python-3.x,pandas,python-requests,Python,Python 3.x,Pandas,Python Requests,我正在使用Anomali Threatstream API,它一次最多返回1000行。然而,我正试图从我的电话中提取所有信息。数据以json格式返回,该格式易于处理,并可以很好地转换为数据帧。文档建议使用update_id并迭代 API文档说明,如果结果总数大于10000,则使用“更新id”检索智能API的大型智能数据集,Anomali建议使用update_id通过迭代API调用返回完整的数据集。使用update_id方法可确保在不影响性能的情况下检索大型数据集 此方法涉及将以下内容附加到API

我正在使用Anomali Threatstream API,它一次最多返回1000行。然而,我正试图从我的电话中提取所有信息。数据以json格式返回,该格式易于处理,并可以很好地转换为数据帧。文档建议使用update_id并迭代

API文档说明,如果结果总数大于10000,则使用“更新id”检索智能API的大型智能数据集,Anomali建议使用update_id通过迭代API调用返回完整的数据集。使用update_id方法可确保在不影响性能的情况下检索大型数据集

此方法涉及将以下内容附加到API调用:update\u id\uu gt=0&order\u by=update\u id

在进行第一次调用后,找到上次返回结果的update_id。在下次API调用中,将此值用于update_id_uugt变量。反复重复此过程,直到不再返回任何结果

我的API调用如下所示:

response = requests.get("https://api.threatstream.com/api/v2/intelligence/?username=<username>&api_key=<api_key>&created_ts__gte=2019-01-01T00:00:00.000Z&created_ts__lte=2019-02-28T23:59:59.999Z&tags.name=ingestedemails")
我当前的代码如下所示:

import requests
import json
from pandas.io.json import json_normalize
import pandas as pd

#API call
response = requests.get("https://api.threatstream.com/api/v2/intelligence/?username=<username>&api_key=<api_key>&created_ts__gte=2019-01-01T00:00:00.000Z&created_ts__lte=2019-02-28T23:59:59.999Z&tags.name=ingestedemails")

#Load data(json format) from API request
data = json.loads(response.text)
values = data['objects']

#Convert from json format to pandas dataframe
df = pd.DataFrame.from_dict(values, orient='columns')
df = df[['created_ts','value','source']]
def get_data(url, parameters):
    parameters = parameters.copy()
    parameters["update_id__gt"] = 0
    parameters["order_by"] = "update_id"
    while True:
        response = requests.get(url, params=parameters)
        if not response.text: # or other sign there are no further results
            return

        data = json.loads(response.text)
        values = data["objects"]
        df = pd.DataFrame.from_dict(values, orient="columns")

        yield df[["created_ts", "value", "source"]].copy()
        # copy() the relevant piece because else pandas might keep a reference to the whole dataframe

        parameters["update_id__gt"] = df["update_id"].iloc[-1]
我假设这是一个循环。这看起来像什么

如果以dict形式传入参数,则可以使用requests.get本身

获取数据的方法可能看起来有点像:

import requests
import json
from pandas.io.json import json_normalize
import pandas as pd

#API call
response = requests.get("https://api.threatstream.com/api/v2/intelligence/?username=<username>&api_key=<api_key>&created_ts__gte=2019-01-01T00:00:00.000Z&created_ts__lte=2019-02-28T23:59:59.999Z&tags.name=ingestedemails")

#Load data(json format) from API request
data = json.loads(response.text)
values = data['objects']

#Convert from json format to pandas dataframe
df = pd.DataFrame.from_dict(values, orient='columns')
df = df[['created_ts','value','source']]
def get_data(url, parameters):
    parameters = parameters.copy()
    parameters["update_id__gt"] = 0
    parameters["order_by"] = "update_id"
    while True:
        response = requests.get(url, params=parameters)
        if not response.text: # or other sign there are no further results
            return

        data = json.loads(response.text)
        values = data["objects"]
        df = pd.DataFrame.from_dict(values, orient="columns")

        yield df[["created_ts", "value", "source"]].copy()
        # copy() the relevant piece because else pandas might keep a reference to the whole dataframe

        parameters["update_id__gt"] = df["update_id"].iloc[-1]
然后可以这样调用,使用pandas.concat组合部分结果:

if __name__ == "__main__":
    url = "https://api.threatstream.com/api/v2/intelligence/"
    parameters = {
        "username": "<username>",
        "api_key": "<api_key>",
        "created_ts__gte": "2019-01-01T00:00:00.000Z",
        "created_ts__lte": "2019-02-28T23:59:59.999Z",
        "tags.name": "ingestedemails",
    }
    all_data = pd.concat(get_data(url, parameters))

此代码未经测试,因此可能需要进行一些调整,以下是另一种方法:

import pandas as pd
import requests
import json

#Variables that feed the API call
url = 'https://api.threatstream.com/api/v2/intelligence/?username='
username = '<username>'
api_key = '<api_key>'
start_date_time = '2019-01-01T00:00:00.000Z'
end_date_time = '2019-02-28T23:59:59.999Z'
source = '28'

start_num = str(0) #variable to set update_id__gt=0

apiCall = url+username+'&api_key='+api_key+'&created_ts__gte='+start_date_time+'&created_ts__lte='+end_date_time+'&trustedcircles='+source+'&update_id__gt='+start_num+'&order_by=update_id'
response = requests.get(apiCall) #Define response
data = json.loads(response.text) #Load the json data
values = data["objects"] #Convert to objects
df = pd.DataFrame.from_dict(values, orient="columns") #Create pandas dataframe from data objects

new_max_update_id = str(df['update_id'].iloc[-1] + 1) #Find the last update_id and add 1 for a starting place in next api call

#Run until the length of df2 is less than 1000
while True:
    try: 
        apiCall = url+username+'&api_key='+api_key+'&created_ts__gte='+start_date_time+'&created_ts__lte='+end_date_time+'&trustedcircles='+source+'&update_id__gt='+new_max_update_id+'&order_by=update_id'
        response = requests.get(apiCall)
        data = json.loads(response.text)
        values = data["objects"]
        df2 = pd.DataFrame.from_dict(values, orient="columns") #Convert new data to dataframe
        df = df.append(df2, ignore_index=True) #Add new data to df
        new_max_update_id = str(df2['update_id'].iloc[-1] + 1) #Update starting point for next api call
    except:
        if len(df2) < 1000: #If lenth is less than 1000 then stop looping
            break

谢谢你@Maarten Fabré我知道你在盲目地帮助我。