Python 我的程序运行得很慢，但在运行时也会变慢_Python

Python 我的程序运行得很慢，但在运行时也会变慢

python

Python 我的程序运行得很慢，但在运行时也会变慢,python,Python,我从MicrosoftAcademicKnowledgeAPI中提取数据，然后使用json响应作为字典来提取我需要的信息。在执行此操作时，我将信息添加到一个numpy数组中，并在最后将其更改为一个pandas数据帧以进行导出。这个程序运行得很好，但运行起来需要大量的时间。它在运行时似乎会减慢速度，因为最初几次通过循环时，只需要几秒钟，但之后需要几分钟我已经尽可能地简化了if-else语句，这有一点帮助，但不足以产生很大的不同。我还尽可能减少了对API的查询次数。每个查询只能返回1000个结果，

我从MicrosoftAcademicKnowledgeAPI中提取数据，然后使用json响应作为字典来提取我需要的信息。在执行此操作时，我将信息添加到一个numpy数组中，并在最后将其更改为一个pandas数据帧以进行导出。这个程序运行得很好，但运行起来需要大量的时间。它在运行时似乎会减慢速度，因为最初几次通过循环时，只需要几秒钟，但之后需要几分钟

我已经尽可能地简化了if-else语句，这有一点帮助，但不足以产生很大的不同。我还尽可能减少了对API的查询次数。每个查询只能返回1000个结果，但我需要大约35000个结果

rel\u info=np.array（[（“标题”、“作者姓名”、“发表于”、“日期”））
对于范围（0，循环）中的l:#循环在上面定义为35
偏移量=1000*l
#跟踪进度
打印（“进度：+str（圆整（（偏移/总分辨率）*100,2））+“%”）
#向MAK请求获取数据。1000是最大计数
url=”https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?expr=And（复合材料（AA.AfN=='brigham young university'），Y>=1908）和模型=最新&计数=1000&offset=“+str（offset）+”&attributes=Ti，D，AA.DAfN，AA.DAuN，J.JN”
response=req.get（url+'&订阅键={key}'）
data=response.json（）
对于范围内的i（0，len（数据[“实体”]）：
新数据=数据[“实体”][i]
#获取新数据
新建标题=新建数据[“Ti”]#获取标题
如果“J”不在新数据中：#如果关键字不在字典中，则获取日记账帐户
new_journ=“”
其他：
新建日志=新建数据[“J”][“JN”]或
新建日期=新建数据[“D”]#获取日期
new_auth=”“#如果关键字不在字典中，则仅获取与BYU关联的作者帐户
对于范围（0，len）内的j（新的_数据[“AA”]）：
如果“DAfN”不在新的_数据中[“AA”][j]：
新建授权=新建授权+“”
其他：
如果new_data[“AA”][j][“DAfN”]=“Brigham Young University”和new_auth==”：#可以将条件组合在一起以降低复杂性
新认证=新数据[“AA”][j][“DAuN”]
elif new_data[“AA”][j][“DAfN”]=“杨百翰大学”和new_auth！="":
new_auth=new_auth+”，“+new_data[“AA”][j][“DAuN”]
#不断向整个数据帧添加新数据
new\u info=np.array（[（新标题、新授权、新日志、新日期）]）
rel_info=np.vstack（（rel_info，new_info））

尝试使用以下方式在工作线程池中获取结果：

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor() as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

我最终解决了这个问题，改变了向收集的大量数据中添加数据的方式。我没有在每次迭代中添加一行数据，而是构建了一个可以容纳1000行数据的临时数组，然后将这个临时数组添加到完整的数据数组中。这将运行时间缩短到一分钟左右，而之前只需43分钟

rel_info = np.array([("Title", "Author_Name", "Jornal_Published_In", "Date")])

for req_num in range(0, loops):
offset = 1000 * req_num
# keep track of progress
print("Progress:" + str(round((offset/total_res)*100, 2)) + "%")
# get data with request to MAK. 1000 is the max count
url = "https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?expr=And(Composite(AA.AfN=='brigham young university'),Y>=1908)&model=latest&count=1000&offset="+str(offset)+"&attributes=Ti,D,AA.DAfN,AA.DAuN,J.JN"
response = req.get(url + '&subscription-key={key}')

data = response.json()

for i in range(0, len(data["entities"])):
    new_data = data["entities"][i]
    # get new data
    new_title = new_data["Ti"]                 # get title

    if 'J' not in new_data:                    # get journal account for if keys are not in dictionaries
        new_journ = ""
    else:
        new_journ = new_data["J"]["JN"] or ""

    new_date = new_data["D"]                   # get date

    new_auth = ""                              # get authors only affiliated with BYU account for if keys are not in dictionary
    for j in range(0, len(new_data["AA"])):
        if 'DAfN' not in new_data["AA"][j]:
            new_auth = new_auth + ""
        else:
            if new_data["AA"][j]["DAfN"] == "Brigham Young University" and new_auth == "":     # posibly combine conditionals to make less complex
                new_auth = new_data["AA"][j]["DAuN"]
            elif new_data["AA"][j]["DAfN"] == "Brigham Young University" and new_auth != "":
                new_auth = new_auth +", "+ new_data["AA"][j]["DAuN"]

    # here are the changes
    # keep adding to a temporary array for 1000 entities
    new_info = np.array([(new_title, new_auth, new_journ, new_date)])
    if (i == 0): work_stack = new_info
    else: work_stack = np.vstack((work_stack, new_info))
# add temporary array to whole array (this is to speed up the program)
rel_info = np.vstack((rel_info, work_stack))

请在你的问题中展示一些你确切地知道经济放缓发生在哪里吗？也许远程API对您感到恼火并限制了您的请求？我返回并打印了一些时间，观看了它的运行，我发现增加的原因是我在使用numpy的vstack函数时。随着阵列越来越大，堆叠所需的时间也越来越长。但我仍然不知道如何解决这个问题，因为我仍然需要将提取的任何新信息附加到更大的数组中。不要使用数组。在这种情况下没有任何意义，看起来您正在使用字符串？只需使用一个普通的列表。附加，这将为您提供线性时间，但使用带vstack的numpy数组使此算法成为二次算法“使用熊猫导出”是什么意思？在任何情况下，您都可以从列表列表构建pandas数据框，但您是否仅使用pandas转储到csv？