Python 有没有办法让这段代码更快？_Python_Pandas_Algorithm_Performance_Dataframe

Python 有没有办法让这段代码更快？

python pandas algorithm performance dataframe

Python 有没有办法让这段代码更快？,python,pandas,algorithm,performance,dataframe,Python,Pandas,Algorithm,Performance,Dataframe,我有一个熊猫数据框，其中包含一些文件的细节，大约有400万条记录。我需要找到在这个数据集中拥有最多出版物的前50位作者。我有两个包含这些数据的文件，所以我必须将它们读入数据帧，并将它们附加在一起，以获得一个单独的数据帧。我只使用了dataframe中的author列，因为还有32个其他列不是必需的到目前为止，我已经提出了以下解决方案。此外，这是一个算法分配，所以我不能使用任何内置算法。目前，我正在使用字典来存储每个作者的发表次数，然后我在dict上循环以获得发表次数最多的作者。此外，一行中可能

我有一个熊猫数据框，其中包含一些文件的细节，大约有400万条记录。我需要找到在这个数据集中拥有最多出版物的前50位作者。我有两个包含这些数据的文件，所以我必须将它们读入数据帧，并将它们附加在一起，以获得一个单独的数据帧。我只使用了dataframe中的author列，因为还有32个其他列不是必需的

到目前为止，我已经提出了以下解决方案。此外，这是一个算法分配，所以我不能使用任何内置算法。目前，我正在使用字典来存储每个作者的发表次数，然后我在dict上循环以获得发表次数最多的作者。此外，一行中可能有多个作者，如“Auth 1 | Auth 2 | Auth 3 |”，这就是我拆分字符串的原因

我想知道是否有一种更快的方法来完成这一切。是否有某种方法可以在数据帧循环期间找到最大值？同样，我不允许使用内置算法进行搜索或排序。任何建议都会有帮助

多谢各位

start_time = ti.default_timer()

only_authors_article = pd.DataFrame(articles['author'])
only_authors_inproceedings = pd.DataFrame(proceedings['author'])
all_authors = only_authors_article.append(only_authors_inproceedings, ignore_index = True)
all_authors = all_authors.dropna(how = 'any')

auth_dict = defaultdict(int)

for auth_list in zip(all_authors['author']):
    auth_list = auth_list[0]

    if '|' in auth_list:
        auths = auth_list.split('|')

        for auth in auths:
            auth_dict[auth] += 1
    else:
        auth_dict[auth_list] += 1


most_pub_authors = []

for i in range(0, 50):
    max_pub_count = 0
    max_pub_auth = None

    for author, pub_count in auth_dict.items(): 
        if pub_count > max_pub_count:
            max_pub_count = pub_count
            max_pub_auth = author

    most_pub_authors.append( (max_pub_auth, max_pub_count) ) 
    del auth_dict[max_pub_auth]

print(most_pub_authors) 


elapsed_time = ti.default_timer() - start_time
print("Total time taken: " + str(elapsed_time))

编辑1：来自所有作者的一些样本数据

    author
0   Sanjeev Saxena
1   Hans Ulrich Simon
2   Nathan Goodman|Oded Shmueli
3   Norbert Blum
4   Arnold Schönhage
5   Juha Honkala
6   Christian Lengauer|Chua-Huang Huang
7   Alain Finkel|Annie Choquet
8   Joachim Biskup
9   George Rahonis|Symeon Bozapalidis|Zoltán Fülöp...
10  Alex Kondratyev|Maciej Koutny|Victor Khomenko|...
11  Wim H. Hesselink
12  Christian Ronse
13  Carol Critchlow|Prakash Panangaden
14  Fatemeh Ghassemi|Ramtin Khosravi|Rosa Abbasi
15  Robin Milner
16  John Darlington
17  Giuseppe Serazzi|M. Italiani|Maria Calzarossa
18  Vincent Vajnovszki
19  Christian Stahl|Richard Müller 0001|Walter Vogler
20  Luc Devroye
21  K. C. Tan|T. C. Hu
22  William R. Franta
23  Ekkart Kindler
24  Demetres D. Kouvatsos
25  Christian Lengauer|Sergei Gorlatch
26  Roland Meyer
27  Stefan Reisch
28  Erzsébet Csuhaj-Varjú|Victor Mitrana
29  Lila Kari|Manasi S. Kulkarni

这是一种复杂的写作方式

auth_dict = defaultdict(int)

for auth_list in all_authors['author']:
    for auth in auth_list.split('|'):
        auth_dict[auth] += 1

这可能更快：

Counter(itertools.chain.from_iterable(
    auth_list.split('|') for auth_list in all_authors['author']))

其中，

itertools

是

import itertools

，

Counter

是

从集合导入计数器

把整本书翻了好几遍。尝试一个通行证：

most_pub_authors = heapq.nlargest(50, auth_dict.items(), key=itemgetter(1))

其中

itemgetter

是来自操作员导入itemgetter的

，此部分存在问题：

for i in range(0, 50):
    . . .
    for author, pub_count in auth_dict.items(): 
        . . .

在整个数据集上迭代50次

相反，您可以使用累加器方法：创建一个前50位作者的列表，首先按前50位作者进行填充，然后在

auth_dict

上迭代一次，如果发现一个高于该值的元素，则替换最低的元素

大概是这样的：

top_authors = []
lowest_pub_count = 0
top_n = 50
for author, pub_count in auth_dict.items():
    if pub_count > lowest_pub_count:        # found element that is larger than the smallest in top-N so far
        if len(top_authors) < top_n:        # not reached N yet - just append to the list
            top_authors.append([author, pub_count])
            if len(top_authors) < top_n:    # keep lowest_pub_count at 0 until N is reached
                continue
        else:                               # replace the lowest element with the found one
            for i in range(len(top_authors)):
                if top_authors[i][1] == lowest_pub_count:
                    top_authors[i] = [author, pub_count]
                    break
        lowest_pub_count = pub_count
        for i in range(len(top_authors)):   # find the new lowest element
            if top_authors[i][1] < lowest_pub_count:
                lowest_pub_count = top_authors[i][1]

top_作者=[]
最低发布数量=0
顶部=50
对于作者，auth_dict.items（）中的发布计数：
如果pub_count>lower_pub_count:#找到的元素大于到目前为止top-N中的最小元素
如果len（顶级作者）


根据中的答案，使用.items（）
进行迭代似乎是我尝试过的“author，pub_count in list（auth_dict.items（）”（auth_dict.items（））进行itI的最慢方法，但它比以前花费了大约2秒的时间。谢谢你的建议。你能发布一个你的数据的例子吗？这两个部分中哪一个花费的时间最长？我构建字典的第一个for循环花费的时间更长我不允许使用内置算法进行搜索或排序。任何建议都会有帮助。Reimplement heapq.nlagest with a heap，我想我可以试试。谢谢你的帮助！我试过这个，它肯定跑得更快！非常感谢！
for i in range(0, 50):
    . . .
    for author, pub_count in auth_dict.items(): 
        . . .

top_authors = []
lowest_pub_count = 0
top_n = 50
for author, pub_count in auth_dict.items():
    if pub_count > lowest_pub_count:        # found element that is larger than the smallest in top-N so far
        if len(top_authors) < top_n:        # not reached N yet - just append to the list
            top_authors.append([author, pub_count])
            if len(top_authors) < top_n:    # keep lowest_pub_count at 0 until N is reached
                continue
        else:                               # replace the lowest element with the found one
            for i in range(len(top_authors)):
                if top_authors[i][1] == lowest_pub_count:
                    top_authors[i] = [author, pub_count]
                    break
        lowest_pub_count = pub_count
        for i in range(len(top_authors)):   # find the new lowest element
            if top_authors[i][1] < lowest_pub_count:
                lowest_pub_count = top_authors[i][1]