Python如何在API调用中使用updateid并迭代直到收集到所有数据_Python_Python 3.x_Pandas_Python Requests

Python如何在API调用中使用updateid并迭代直到收集到所有数据

python python-3.x pandas

Python如何在API调用中使用updateid并迭代直到收集到所有数据,python,python-3.x,pandas,python-requests,Python,Python 3.x,Pandas,Python Requests,我正在使用Anomali Threatstream API，它一次最多返回1000行。然而，我正试图从我的电话中提取所有信息。数据以json格式返回，该格式易于处理，并可以很好地转换为数据帧。文档建议使用update_id并迭代 API文档说明，如果结果总数大于10000，则使用“更新id”检索智能API的大型智能数据集，Anomali建议使用update_id通过迭代API调用返回完整的数据集。使用update_id方法可确保在不影响性能的情况下检索大型数据集此方法涉及将以下内容附加到API

我正在使用Anomali Threatstream API，它一次最多返回1000行。然而，我正试图从我的电话中提取所有信息。数据以json格式返回，该格式易于处理，并可以很好地转换为数据帧。文档建议使用update_id并迭代

API文档说明，如果结果总数大于10000，则使用“更新id”检索智能API的大型智能数据集，Anomali建议使用update_id通过迭代API调用返回完整的数据集。使用update_id方法可确保在不影响性能的情况下检索大型数据集

此方法涉及将以下内容附加到API调用：update\u id\uu gt=0&order\u by=update\u id

在进行第一次调用后，找到上次返回结果的update_id。在下次API调用中，将此值用于update_id_uugt变量。反复重复此过程，直到不再返回任何结果

我的API调用如下所示：

response = requests.get("https://api.threatstream.com/api/v2/intelligence/?username=<username>&api_key=<api_key>&created_ts__gte=2019-01-01T00:00:00.000Z&created_ts__lte=2019-02-28T23:59:59.999Z&tags.name=ingestedemails")

我当前的代码如下所示：

import requests
import json
from pandas.io.json import json_normalize
import pandas as pd

#API call
response = requests.get("https://api.threatstream.com/api/v2/intelligence/?username=<username>&api_key=<api_key>&created_ts__gte=2019-01-01T00:00:00.000Z&created_ts__lte=2019-02-28T23:59:59.999Z&tags.name=ingestedemails")

#Load data(json format) from API request
data = json.loads(response.text)
values = data['objects']

#Convert from json format to pandas dataframe
df = pd.DataFrame.from_dict(values, orient='columns')
df = df[['created_ts','value','source']]

def get_data(url, parameters):
    parameters = parameters.copy()
    parameters["update_id__gt"] = 0
    parameters["order_by"] = "update_id"
    while True:
        response = requests.get(url, params=parameters)
        if not response.text: # or other sign there are no further results
            return

        data = json.loads(response.text)
        values = data["objects"]
        df = pd.DataFrame.from_dict(values, orient="columns")

        yield df[["created_ts", "value", "source"]].copy()
        # copy() the relevant piece because else pandas might keep a reference to the whole dataframe

        parameters["update_id__gt"] = df["update_id"].iloc[-1]

我假设这是一个循环。这看起来像什么

如果以dict形式传入参数，则可以使用requests.get本身

获取数据的方法可能看起来有点像：

import requests
import json
from pandas.io.json import json_normalize
import pandas as pd

#API call
response = requests.get("https://api.threatstream.com/api/v2/intelligence/?username=<username>&api_key=<api_key>&created_ts__gte=2019-01-01T00:00:00.000Z&created_ts__lte=2019-02-28T23:59:59.999Z&tags.name=ingestedemails")

#Load data(json format) from API request
data = json.loads(response.text)
values = data['objects']

#Convert from json format to pandas dataframe
df = pd.DataFrame.from_dict(values, orient='columns')
df = df[['created_ts','value','source']]

def get_data(url, parameters):
    parameters = parameters.copy()
    parameters["update_id__gt"] = 0
    parameters["order_by"] = "update_id"
    while True:
        response = requests.get(url, params=parameters)
        if not response.text: # or other sign there are no further results
            return

        data = json.loads(response.text)
        values = data["objects"]
        df = pd.DataFrame.from_dict(values, orient="columns")

        yield df[["created_ts", "value", "source"]].copy()
        # copy() the relevant piece because else pandas might keep a reference to the whole dataframe

        parameters["update_id__gt"] = df["update_id"].iloc[-1]

然后可以这样调用，使用pandas.concat组合部分结果：

if __name__ == "__main__":
    url = "https://api.threatstream.com/api/v2/intelligence/"
    parameters = {
        "username": "<username>",
        "api_key": "<api_key>",
        "created_ts__gte": "2019-01-01T00:00:00.000Z",
        "created_ts__lte": "2019-02-28T23:59:59.999Z",
        "tags.name": "ingestedemails",
    }
    all_data = pd.concat(get_data(url, parameters))

此代码未经测试，因此可能需要进行一些调整，以下是另一种方法：

import pandas as pd
import requests
import json

#Variables that feed the API call
url = 'https://api.threatstream.com/api/v2/intelligence/?username='
username = '<username>'
api_key = '<api_key>'
start_date_time = '2019-01-01T00:00:00.000Z'
end_date_time = '2019-02-28T23:59:59.999Z'
source = '28'

start_num = str(0) #variable to set update_id__gt=0

apiCall = url+username+'&api_key='+api_key+'&created_ts__gte='+start_date_time+'&created_ts__lte='+end_date_time+'&trustedcircles='+source+'&update_id__gt='+start_num+'&order_by=update_id'
response = requests.get(apiCall) #Define response
data = json.loads(response.text) #Load the json data
values = data["objects"] #Convert to objects
df = pd.DataFrame.from_dict(values, orient="columns") #Create pandas dataframe from data objects

new_max_update_id = str(df['update_id'].iloc[-1] + 1) #Find the last update_id and add 1 for a starting place in next api call

#Run until the length of df2 is less than 1000
while True:
    try: 
        apiCall = url+username+'&api_key='+api_key+'&created_ts__gte='+start_date_time+'&created_ts__lte='+end_date_time+'&trustedcircles='+source+'&update_id__gt='+new_max_update_id+'&order_by=update_id'
        response = requests.get(apiCall)
        data = json.loads(response.text)
        values = data["objects"]
        df2 = pd.DataFrame.from_dict(values, orient="columns") #Convert new data to dataframe
        df = df.append(df2, ignore_index=True) #Add new data to df
        new_max_update_id = str(df2['update_id'].iloc[-1] + 1) #Update starting point for next api call
    except:
        if len(df2) < 1000: #If lenth is less than 1000 then stop looping
            break

谢谢你@Maarten Fabré我知道你在盲目地帮助我。