需要比较python中1.5GB左右的超大文件

需要比较python中1.5GB左右的超大文件,python,csv,numpy,pandas,large-data-volumes,Python,Csv,Numpy,Pandas,Large Data Volumes,以上是示例数据。 数据根据电子邮件地址进行排序,文件非常大,大约为1.5Gb 我希望在另一个csv文件中输出类似的内容 "DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2" "Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025" "DF","0000650000@YAHOO.COM","NF2513521438550

以上是示例数据。 数据根据电子邮件地址进行排序,文件非常大,大约为1.5Gb

我希望在另一个csv文件中输出类似的内容

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
"DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
"DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"

i、 e如果条目第一次出现,我需要追加1如果第二次出现,我需要追加2,同样,我的意思是我需要计算文件中电子邮件地址的出现次数,如果电子邮件存在两次或两次以上,我希望日期之间存在差异,请记住,日期未排序,因此我们还必须根据特定的日期对其进行排序电子邮件地址,我正在寻找一个python解决方案,使用numpy或pandas库或任何其他库来处理这种类型的海量数据,而不会出现内存不足的情况。我有centos 6.3双核处理器,内存为4GB。使用内置数据库:您可以根据需要插入数据、排序和分组,使用比可用RAM大的文件是没有问题的。

确保您有0.11,阅读以下文档:,以及以下方法:(特别是“合并数百万行”

以下是一个似乎有效的解决方案。以下是工作流程:

1) 按块从csv读取数据并附加到hdfstore 2) 对存储的迭代,创建另一个执行合并器的存储

本质上,我们从表中获取一个块,并与文件其他部分的块组合。combiner函数不会减少,而是计算该块中所有元素之间的函数(以天为单位的差值),在执行过程中消除重复项,并在每次循环后获取最新数据。有点像递归的reduce

这应该是O(块的数量**2)内存和计算时间 在您的情况下,chunksize可以是1m(或更多)

"DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2",1,0 days
"Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025",1,0 days
"DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792",1,0 days
"Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800",1,0 days
"Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595",1,0 days
"Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957",1,0 days
"Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212",1,0 days
"DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080",1,0 days
"Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731",1,0 days
"DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000",1,0 days
"DF","0001HARISH@GMAIL.COM","NF251352240086","09DEC2010","B2C","4006",1,0 days
"DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000",2,3 days
"DF","0001HARISH@GMAIL.COM","NF252022031180","22DEC2010","B2C","3439",3,10 days
"Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41",1,0 days
"Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96",2,1 days
"Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96",3,0 days
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",4,9 days
"Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96",5,0 days
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",6,4 days
"Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96",7,0 days
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",8,44 days
"Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136",9,0 days
另一种可能的(系统管理)方式,避免数据库和SQL查询,以及运行时流程和硬件资源中的大量需求

更新20/04添加了更多代码和简化方法:-

  • 使用电子邮件和这个新字段(即:
    sort-k2-k4-n-t,output\u file
    )使用UNIX
    sort
  • 初始化3个变量,
    EMAIL
    PREV\u TIME
    COUNT
  • 在每行上交互,如果遇到新电子邮件,请添加“1,0天”。更新
    PREV\u TIME=timestamp
    COUNT=1
    EMAIL=new\u EMAIL
  • 下一行:3种可能的情况
    • a) 如果是同一封电子邮件,不同的时间戳:计算天数,增量计数=1,更新上一次时间,添加“计数,差异天数”
    • b) 如果相同的电子邮件,相同的时间戳:递增计数,添加“计数,0天”
    • c) 如果是新邮件,从3开始
  • 备选方案1。是添加一个新的字段时间戳,并在打印出该行时将其删除

    注意:如果1.5GB太大,无法一次排序,请使用电子邮件作为拆分点,将其拆分为较小的chuck。您可以在不同的机器上并行运行这些块

    processing [0] [datastore.h5]
    processing [1] [datastore_0.h5]
        count                date  diff                        email
    4       1 2011-06-24 00:00:00     0           0000.ANU@GMAIL.COM
    1       1 2011-06-24 00:00:00     0          00000.POO@GMAIL.COM
    0       1 2010-07-26 00:00:00     0           00000000@11111.COM
    2       1 2013-01-01 00:00:00     0         0000650000@YAHOO.COM
    3       1 2013-01-26 00:00:00     0       00009.GAURAV@GMAIL.COM
    5       1 2011-10-29 00:00:00     0          0000MANNU@GMAIL.COM
    6       1 2011-11-21 00:00:00     0    0000PRANNOY0000@GMAIL.COM
    7       1 2011-06-26 00:00:00     0  0000PRANNOY0000@YAHOO.CO.IN
    8       1 2012-10-25 00:00:00     0          0000RAHUL@GMAIL.COM
    9       1 2011-05-10 00:00:00     0            0000SS0@GMAIL.COM
    12      1 2010-12-09 00:00:00     0         0001HARISH@GMAIL.COM
    11      2 2010-12-12 00:00:00     3         0001HARISH@GMAIL.COM
    10      3 2010-12-22 00:00:00    13         0001HARISH@GMAIL.COM
    14      1 2012-11-28 00:00:00     0           000AYUSH@GMAIL.COM
    15      2 2012-11-29 00:00:00     1           000AYUSH@GMAIL.COM
    17      3 2012-12-08 00:00:00    10           000AYUSH@GMAIL.COM
    18      4 2012-12-12 00:00:00    14           000AYUSH@GMAIL.COM
    13      5 2013-01-25 00:00:00    58           000AYUSH@GMAIL.COM
    import pandas as pd
    import StringIO
    import numpy as np
    from time import strptime
    from datetime import datetime
    
    # your data
    data = """
    "DF","00000000@11111.COM","FLTINT1000130394756","26JUL2010","B2C","6799.2"
    "Rail","00000.POO@GMAIL.COM","NR251764697478","24JUN2011","B2C","2025"
    "DF","0000650000@YAHOO.COM","NF2513521438550","01JAN2013","B2C","6792"
    "Bus","00009.GAURAV@GMAIL.COM","NU27012932319739","26JAN2013","B2C","800"
    "Rail","0000.ANU@GMAIL.COM","NR251764697526","24JUN2011","B2C","595"
    "Rail","0000MANNU@GMAIL.COM","NR251277005737","29OCT2011","B2C","957"
    "Rail","0000PRANNOY0000@GMAIL.COM","NR251297862893","21NOV2011","B2C","212"
    "DF","0000PRANNOY0000@YAHOO.CO.IN","NF251327485543","26JUN2011","B2C","17080"
    "Rail","0000RAHUL@GMAIL.COM","NR2512012069809","25OCT2012","B2C","5731"
    "DF","0000SS0@GMAIL.COM","NF251355775967","10MAY2011","B2C","2000"
    "DF","0001HARISH@GMAIL.COM","NF251352240086","22DEC2010","B2C","4006"
    "DF","0001HARISH@GMAIL.COM","NF251742087846","12DEC2010","B2C","1000"
    "DF","0001HARISH@GMAIL.COM","NF252022031180","09DEC2010","B2C","3439"
    "Rail","000AYUSH@GMAIL.COM","NR2151120122283","25JAN2013","B2C","136"
    "Rail","000AYUSH@GMAIL.COM","NR2151213260036","28NOV2012","B2C","41"
    "Rail","000AYUSH@GMAIL.COM","NR2151313264432","29NOV2012","B2C","96"
    "Rail","000AYUSH@GMAIL.COM","NR2151413266728","29NOV2012","B2C","96"
    "Rail","000AYUSH@GMAIL.COM","NR2512912359037","08DEC2012","B2C","96"
    "Rail","000AYUSH@GMAIL.COM","NR2517612385569","12DEC2012","B2C","96"
    """
    
    
    # read in and create the store
    data_store_file = 'datastore.h5'
    store = pd.HDFStore(data_store_file,'w')
    
    def dp(x, **kwargs):
        return [ datetime(*strptime(v,'%d%b%Y')[0:3]) for v in x ]
    
    chunksize=5
    reader = pd.read_csv(StringIO.StringIO(data),names=['x1','email','x2','date','x3','x4'],
                         header=0,usecols=['email','date'],parse_dates=['date'],
                         date_parser=dp, chunksize=chunksize)
    
    for i, chunk in enumerate(reader):
        chunk['indexer'] = chunk.index + i*chunksize
    
        # create the global index, and keep it in the frame too
        df = chunk.set_index('indexer')
    
        # need to set a minimum size for the email column
        store.append('data',df,min_itemsize={'email' : 100})
    
    store.close()
    
    # define the combiner function
    def combiner(x):
    
        # given a group of emails (the same), return a combination
        # with the new data
    
        # sort by the date
        y = x.sort('date')
    
        # calc the diff in days (an integer)
        y['diff'] = (y['date']-y['date'].iloc[0]).apply(lambda d: float(d.item().days))
        y['count'] = pd.Series(range(1,len(y)+1),index=y.index,dtype='float64')  
    
        return y
    
    # reduce the store (and create a new one by chunks)
    in_store_file = data_store_file
    in_store1 = pd.HDFStore(in_store_file)
    
    # iter on the store 1
    for chunki, df1 in enumerate(in_store1.select('data',chunksize=2*chunksize)):
        print "processing [%s] [%s]" % (chunki,in_store_file)
    
        out_store_file = 'datastore_%s.h5' % chunki
        out_store = pd.HDFStore(out_store_file,'w')
    
        # iter on store 2
        in_store2 = pd.HDFStore(in_store_file)
        for df2 in in_store2.select('data',chunksize=chunksize):
    
            # concat & drop dups
            df = pd.concat([df1,df2]).drop_duplicates(['email','date'])
    
            # group and combine
            result = df.groupby('email').apply(combiner)
    
            # remove the mi (that we created in the groupby)
            result = result.reset_index('email',drop=True)
    
            # only store those rows which are in df2!
            result = result.reindex(index=df2.index).dropna()
    
            # store to the out_store
            out_store.append('data',result,min_itemsize={'email' : 100})
        in_store2.close()
        out_store.close()
        in_store_file = out_store_file
    
    in_store1.close()
    
    # show the reduced store
    print pd.read_hdf(out_store_file,'data').sort(['email','diff'])
    
    /usr/bin/gawk-F'''''''''{
    拆分(“1-2-3-4-5-6-7-8-9-10-11-12”,月“);
    
    对于(i=1;i将它们放入数据库。按名称排序,然后按日期排序。这听起来像是按照Map Reduce方法做的事情。我的第一个想法也是将数据放入数据库中,但它不会起作用。我还需要跟踪每个电子邮件地址的首次出现并进行计数,那么如何获得日期的差异?解决方法是什么??@Jeff我的文件有超过2000万行,唯一值超过500万如果我开始比较这些,我将得到500万*2000万的复杂度,这可能需要几个月才能解决问题我如何从n*n减少复杂度,以便我能够处理如此大的容量第一次出现电子邮件时,它的电子邮件就是参考后续出现的日期?对于第二封电子邮件,很简单,计数为1,天数为天数之差,第三封电子邮件如何。天数是否更新为第三天和第一天之间的差值,或者是否涉及天数(可能第三天是当前日期和第三天-参考日期的最大值?)这将对我已排序的输出执行计数和天数计算。计算完成。谢谢!我仅根据应用了拆分[root@amanka桌面]#gawk'BEGIN{OFS=“,”;COUNT=0;PREV_TIME=0;EMAIL=0;while((getline第0行){split(line,a,,”)if(EMAIL!=a[2]){EMAIL=a[2];COUNT=1;PREV_TIME=a[7];打印行,“1,0天”}else{if(PREV_TIME==a[7]){COUNT=COUNT+1;打印行,COUNT,“0天”;}else{DAYS=((a[7]-PREV_TIME)/(60*60*24));PREV_TIME=a[7];COUNT=COUNT+1;打印行,COUNT,DAYS“天”;}}}}不客气。我很想知道1)gawk+sort使用了多少内存在1.5gb文件上花费了多少时间?我不确定内存,但在10-12分钟左右进行输出排序所花费的时间实际上非常少,大约需要15分钟,而且它比任何其他语言解决方案都要快得令人难以置信,我甚至想单次解析文件O(n)在python中,大约需要35分钟,使用shell脚本将时间减少到一半。对于同一个文件,我有更多的问题。如果您能帮助我,您可以在
    /usr/bin/gawk -F'","' ' { 
        split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " "); 
        for (i=1; i<=12; i++) mdigit[month[i]]=i; 
        print $0 "," mktime(substr($4,6,4) " " mdigit[substr($4,3,3)] " " substr($4,1,2) " 00 00 00"
    )}' < input.txt |  /usr/bin/sort -k2 -k7 -n -t, > output_file.txt