Numpy Python粒子模拟器：核心外处理问题描述_Numpy_Pandas_Pytables_H5py_Blaze

Numpy Python粒子模拟器：核心外处理问题描述

numpy pandas

Numpy Python粒子模拟器：核心外处理问题描述,numpy,pandas,pytables,h5py,blaze,Numpy,Pandas,Pytables,H5py,Blaze,用python/numpy编写蒙特卡罗粒子模拟器（布朗运动和光子发射）。我需要将模拟输出（>>10GB）保存到一个文件中，并在第二步中处理数据。与Windows和Linux的兼容性很重要 import pandas as pd

用python/numpy编写蒙特卡罗粒子模拟器（布朗运动和光子发射）。我需要将模拟输出（>>10GB）保存到一个文件中，并在第二步中处理数据。与Windows和Linux的兼容性很重要

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')

粒子数（

n_粒子

）为10-100。时间步数（

time\u size

）约为10^9

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')

模拟有3个步骤（以下代码适用于全内置RAM版本）：

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')

模拟（并存储）发射速率数组（包含许多几乎为0的元素）：

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')

形状（
```
n\u粒子
```
x
```
time\u大小
```
），浮动32，大小80GB

计算计数
数组（来自泊松过程的随机值与先前计算的速率）：

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')

形状（

n\u粒子

time\u大小

），uint8，大小20GB

counts = np.random.poisson(lam=emission).astype(np.uint8)

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')

查找计数的时间戳（或索引）。计数几乎总是0，因此时间戳数组将适合RAM

# Loop across the particles
timestamps = [np.nonzero(c) for c in counts]

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')

我执行步骤1一次，然后重复步骤2-3多次（~100次）。将来，我可能需要在计算计数之前，预处理排放量（应用cumsum 或其他函数） import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 问题: 我有一个工作在内存中的实现，我试图了解什么是实现一个可以扩展到（更）长模拟的核心外版本的最佳方法 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 我希望它存在我需要将阵列保存到一个文件中，并且我希望使用单个文件进行模拟。我还需要一种“简单”的方法来存储和调用模拟参数字典（标量） import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 理想情况下，我想要一个文件支持的numpy数组，我可以预先分配并填充块。然后，我希望numpy数组方法（max ，cumsum ，…）能够透明地工作，只需要一个chunksize 关键字来指定每次迭代要加载多少数组 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 更好的是，我想要一个Numexpr，它不是在缓存和RAM之间运行，而是在RAM和硬盘驱动器之间运行 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 有哪些切实可行的选择作为第一选择我开始尝试pyTables，但我对它的复杂性和抽象（与numpy如此不同）不满意。此外，我目前的解决方案（见下文）很难看，效率也不高 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 因此，我寻求答案的选择是 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 实现具有所需功能的numpy阵列（如何实现？） import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 以更智能的方式使用pytable（不同的数据结构/方法） import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 使用另一个库：h5py，blaze，pandas。。。（到目前为止，我还没有试过） import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 暂定解决方案（附表）我将模拟参数保存在'/parameters' 组中：每个参数都转换为numpy数组标量。详细的解决方案，但它的工作 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 我将emission 保存为一个可扩展数组（array ），因为我以块的形式生成数据，并且需要附加每个新块（不过我知道最终的大小）。保存计数问题更大。如果将其保存为pytable数组，则很难执行“计数>=2”之类的查询。因此，我将计数保存为多个表（每个粒子一个）[丑陋]，并使用进行查询。我不确定这是否节省空间，而且生成所有这些表而不是使用单个数组，会显著地破坏HDF5文件。此外，奇怪的是，创建这些表需要创建自定义数据类型（即使是标准的numpy数据类型）： import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 每个粒子计数“表”都有一个不同的名称（name=“particle\ud”%ip ），我需要将它们放在python列表中以便于迭代 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 编辑：这个问题的结果是一个名为的布朗运动模拟器。有关如何在HDF5文件中存储参数的信息，请参阅（它会进行pickle处理，以便您可以按自己的方式存储参数；pickle的大小限制为64kb）. import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 或者，可以将每个粒子设置为“数据”列，并分别对其进行选择 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 和一些输出（在这种情况下相当活跃的发射：） import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 可折叠溶液由于不需要Pandas提供的功能，而且处理速度要慢得多（见下面的笔记本），因此最好的方法是直接使用PyTables或h5py。到目前为止，我只尝试过pytables方法 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 所有测试均在本笔记本中进行： import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') pytables数据结构简介参考： import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') Pytables允许以两种格式在HDF5文件中存储数据：数组和表 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 阵列有三种类型的数组Array ，CArray 和Array 。它们都允许使用与numpy切片类似的符号来存储和检索（多维）切片 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') # Write data to store (broadcasting works) array1[:] = 3 # Read data from store in_ram_array = array1[:] 对于某些用例中的优化，CArray 保存在“chunk”中，其大小可以在创建时使用chunk\u shape 进行选择 import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') Array 和CArray 大小在创建时是固定的。但是，在创建之后，您可以一块一块地填充/写入数组块。相反，可以使用.append（）方法扩展array import pandas as pd import numpy as np n_particles = 10 chunk_size = 1000 # 1) create a new emission file, compressing as we go emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc') # generate simulated data for i in range(10): df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32') # create a globally unique index (time) # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of- data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397 try: nrows = emission.get_storer('df').nrows except: nrows = 0 df.index = pd.Series(df.index) + nrows emission.append('df',df) emission.close() # 2) create counts cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc') # this is an iterator, can be any size for df in pd.read_hdf('emission.hdf','df',chunksize=200): counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8)) # set the index as the same counts.index = df.index # store the sum across all particles (as most are zero this will be a # nice sub-selector # better maybe to have multiple of these sums that divide the particle space # you don't have to do this but prob more efficient # you can do this in another file if you want/need counts['particles_0_4'] = counts.iloc[:,0:4].sum(1) counts['particles_5_9'] = counts.iloc[:,5:9].sum(1) # make the non_zero column indexable cs.append('df',counts,data_columns=['particles_0_4','particles_5_9']) cs.close() # 3) find interesting counts print pd.read_hdf('counts.hdf','df',where='particles_0_4>0') print pd.read_hdf('counts.hdf','df',where='particles_5_9>0') 桌子表是一个完全不同的野兽。它基本上是一张“桌子”。您只有1D索引，每个元素都是一行。每行中都有“columns”数据类型，每列可以有不同的类型。据您所知，表基本上是一个1D记录数组，每个元素都是ha