使用Python读取和处理大型CSV

使用Python读取和处理大型CSV,python,python-3.x,Python,Python 3.x,我有一个精神上与此类似的问题。尽管如此,我似乎无法找到合适的解决方案 输入:我有如下CSV数据 id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost 1000000001,Smith,James,AMBIEN,100 1000000002,Garcia,Maria,AMBIEN,200 1000000003,Johnson,James,CHLORPROMAZINE,1000 1000000004,Rodriguez,M

我有一个精神上与此类似的问题。尽管如此,我似乎无法找到合适的解决方案

输入:我有如下CSV数据

id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
输出:从这里我只需要输出每种药物,所有处方的总成本,我需要计算出处方的唯一数量

drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0
我可以用Python很容易地完成这项工作。但是,当我尝试使用更大(1gb)的输入运行代码时,我的代码不会在合理的时间内终止

import sys, csv

def duplicate_id(id, id_list):
    if id in id_list:
        return True
    else:
        return False

def write_file(d, output):
    path = output
    # path = './output/top_cost_drug.txt'
    with open(path, 'w', newline='') as csvfile:
        fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for key, value in d.items():
            print(key, value)
            writer.writerow({'drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1])})

def read_file(data):
    # TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
    drug_info = {}
    with open(data) as csvfile:
        readCSV = csv.reader(csvfile, delimiter=',')
        next(readCSV)
        for row in readCSV:
            prescriber_id = row[0]
            prescribed_drug = row[3]
            prescribed_drug_cost = float(row[4])

            if prescribed_drug not in drug_info:
                drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
            else:
                if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
                    drug_info[prescribed_drug][0].append(prescriber_id)
                    drug_info[prescribed_drug][1].append(prescribed_drug_cost)
                else:
                    drug_info[prescribed_drug][1].append(prescribed_drug_cost)
    return(drug_info)

def main():
    data = sys.argv[1]
    output = sys.argv[2]
    drug_info = read_file(data)
    write_file(drug_info, output)

if __name__ == "__main__":
    main()

我很难弄清楚如何重构它来处理更大的输入,希望有人能看看,并为我提供一些解决这个问题的建议

列表在测试成员资格时效率低下,尤其是当长度为数千时,因为它的成本为O(n)。使用集合来存储您的处方ID,这将使成员资格测试的成本降低到O(1)


如果你能使用熊猫,请尝试以下方法。熊猫读取您的文件并将其存储在数据框中。它比我们使用迭代器手动处理文件快得多

import pandas as pd


df = pd.read_csv('sample_data.txt')

columns = ['id','drug_name','drug_cost']


df1 = df[columns]
gd = df1.groupby('drug_name')
cnt= gd.count()
s=gd.sum()

out = s.join(cnt,lsuffix='x')
out['total_cost']=out['drug_costx']
out['num_prescriber']=out['drug_cost']
fout = out[['num_prescriber','total_cost']]

fout.to_csv('out_data.csv')
我得到以下输出

drug_name,num_prescriber,total_cost
AMBIEN,2,300
BENZTROPINE MESYLATE,1,1500
CHLORPROMAZINE,2,3000

希望这能有所帮助。

我建议您分析代码,但只要看看代码,我就怀疑大部分时间代码都会出现在
duplicate\u id()
中。但无论如何,请分析代码
drug_name,num_prescriber,total_cost
AMBIEN,2,300
BENZTROPINE MESYLATE,1,1500
CHLORPROMAZINE,2,3000