Numpy Python粒子模拟器:核心外处理 问题描述

Numpy Python粒子模拟器:核心外处理 问题描述,numpy,pandas,pytables,h5py,blaze,Numpy,Pandas,Pytables,H5py,Blaze,用python/numpy编写蒙特卡罗粒子模拟器(布朗运动和光子发射)。我需要将模拟输出(>>10GB)保存到一个文件中,并在第二步中处理数据。与Windows和Linux的兼容性很重要 import pandas as pd

用python/numpy编写蒙特卡罗粒子模拟器(布朗运动和光子发射)。我需要将模拟输出(>>10GB)保存到一个文件中,并在第二步中处理数据。与Windows和Linux的兼容性很重要

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
粒子数(
n_粒子
)为10-100。时间步数(
time\u size
)约为10^9

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
模拟有3个步骤(以下代码适用于全内置RAM版本):

import pandas as pd                                                                                                                                                                                                                                                                                               
import numpy as np                                                                                                                                                                                                                                                                                                

n_particles = 10                                                                                                                                                                                                                                                                                                  
chunk_size = 1000                                                                                                                                                                                                                                                                                                 

# 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   

# generate simulated data                                                                                                                                                                                                                                                                                         
for i in range(10):                                                                                                                                                                                                                                                                                               

    df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            

    # create a globally unique index (time)                                                                                                                                                                                                                                                                       
    # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-

    data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
        try:                                                                                                                                                                                                                                                                                                          
            nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
        except:                                                                                                                                                                                                                                                                                                       
            nrows = 0                                                                                                                                                                                                                                                                                                 

        df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
        emission.append('df',df)                                                                                                                                                                                                                                                                                      

    emission.close()                                                                                                                                                                                                                                                                                                  

    # 2) create counts                                                                                                                                                                                                                                                                                                
    cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           

    # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
    for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         

        counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             

        # set the index as the same                                                                                                                                                                                                                                                                                   
        counts.index = df.index                                                                                                                                                                                                                                                                                       

        # store the sum across all particles (as most are zero this will be a 
        # nice sub-selector                                                                                                                                                                                                                       
        # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
        # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
        # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
        counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
        counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           

        # make the non_zero column indexable                                                                                                                                                                                                                                                                          
        cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         

    cs.close()                                                                                                                                                                                                                                                                                                        

    # 3) find interesting counts                                                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
    print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
  • 模拟(并存储)发射速率数组(包含许多几乎为0的元素):

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    • 形状(
      n\u粒子
      x
      time\u大小
      ),浮动32,大小80GB
  • 计算
    计数
    数组(来自泊松过程的随机值与先前计算的速率):

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    • 形状(
      n\u粒子
      x
      time\u大小
      ),uint8,大小20GB

      counts = np.random.poisson(lam=emission).astype(np.uint8)
      
      import pandas as pd                                                                                                                                                                                                                                                                                               
      import numpy as np                                                                                                                                                                                                                                                                                                
      
      n_particles = 10                                                                                                                                                                                                                                                                                                  
      chunk_size = 1000                                                                                                                                                                                                                                                                                                 
      
      # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
      emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
      
      # generate simulated data                                                                                                                                                                                                                                                                                         
      for i in range(10):                                                                                                                                                                                                                                                                                               
      
          df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
      
          # create a globally unique index (time)                                                                                                                                                                                                                                                                       
          # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
      
          data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
              try:                                                                                                                                                                                                                                                                                                          
                  nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
              except:                                                                                                                                                                                                                                                                                                       
                  nrows = 0                                                                                                                                                                                                                                                                                                 
      
              df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
              emission.append('df',df)                                                                                                                                                                                                                                                                                      
      
          emission.close()                                                                                                                                                                                                                                                                                                  
      
          # 2) create counts                                                                                                                                                                                                                                                                                                
          cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
      
          # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
          for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
      
              counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
      
              # set the index as the same                                                                                                                                                                                                                                                                                   
              counts.index = df.index                                                                                                                                                                                                                                                                                       
      
              # store the sum across all particles (as most are zero this will be a 
              # nice sub-selector                                                                                                                                                                                                                       
              # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
              # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
              # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
              counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
              counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
      
              # make the non_zero column indexable                                                                                                                                                                                                                                                                          
              cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
      
          cs.close()                                                                                                                                                                                                                                                                                                        
      
          # 3) find interesting counts                                                                                                                                                                                                                                                                                      
          print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
          print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
      
  • 查找计数的时间戳(或索引)。计数几乎总是0,因此时间戳数组将适合RAM

    # Loop across the particles
    timestamps = [np.nonzero(c) for c in counts]
    
    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
  • 我执行步骤1一次,然后重复步骤2-3多次(~100次)。将来,我可能需要在计算
    计数之前,预处理
    排放量
    (应用
    cumsum
    或其他函数)

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    问题: 我有一个工作在内存中的实现,我试图了解什么是实现一个可以扩展到(更)长模拟的核心外版本的最佳方法

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    我希望它存在 我需要将阵列保存到一个文件中,并且我希望使用单个文件进行模拟。我还需要一种“简单”的方法来存储和调用模拟参数字典(标量)

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    理想情况下,我想要一个文件支持的numpy数组,我可以预先分配并填充块。然后,我希望numpy数组方法(
    max
    cumsum
    ,…)能够透明地工作,只需要一个
    chunksize
    关键字来指定每次迭代要加载多少数组

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    更好的是,我想要一个Numexpr,它不是在缓存和RAM之间运行,而是在RAM和硬盘驱动器之间运行

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    有哪些切实可行的选择 作为第一选择 我开始尝试pyTables,但我对它的复杂性和抽象(与numpy如此不同)不满意。此外,我目前的解决方案(见下文)很难看,效率也不高

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    因此,我寻求答案的选择是

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
  • 实现具有所需功能的numpy阵列(如何实现?)

  • import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
  • 以更智能的方式使用pytable(不同的数据结构/方法)

  • import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
  • 使用另一个库:h5py,blaze,pandas。。。(到目前为止,我还没有试过)

  • import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    暂定解决方案(附表) 我将模拟参数保存在
    '/parameters'
    组中:每个参数都转换为numpy数组标量。详细的解决方案,但它的工作

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    我将
    emission
    保存为一个可扩展数组(
    array
    ),因为我以块的形式生成数据,并且需要附加每个新块(不过我知道最终的大小)。保存
    计数
    问题更大。如果将其保存为pytable数组,则很难执行“计数>=2”之类的查询。因此,我将计数保存为多个表(每个粒子一个)[丑陋],并使用
    进行查询。我不确定这是否节省空间,而且
    生成所有这些表而不是使用单个数组,会显著地破坏HDF5文件。此外,奇怪的是,创建这些表需要创建自定义数据类型(即使是标准的numpy数据类型):

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    每个粒子计数“表”都有一个不同的名称(
    name=“particle\ud”%ip
    ),我需要将它们放在python列表中以便于迭代

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    编辑:这个问题的结果是一个名为的布朗运动模拟器。

    有关如何在HDF5文件中存储参数的信息,请参阅(它会进行pickle处理,以便您可以按自己的方式存储参数;pickle的大小限制为64kb).

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    或者,可以将每个粒子设置为“数据”列,并分别对其进行选择

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    和一些输出(在这种情况下相当活跃的发射:)

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    可折叠溶液 由于不需要Pandas提供的功能,而且处理速度要慢得多(见下面的笔记本),因此最好的方法是直接使用PyTables或h5py。到目前为止,我只尝试过pytables方法

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    所有测试均在本笔记本中进行:

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    pytables数据结构简介 参考:

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    Pytables允许以两种格式在HDF5文件中存储数据:数组和表

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    阵列 有三种类型的数组
    Array
    CArray
    Array
    。它们都允许使用与numpy切片类似的符号来存储和检索(多维)切片

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    # Write data to store (broadcasting works)
    array1[:]  = 3
    
    # Read data from store
    in_ram_array = array1[:]
    
    对于某些用例中的优化,
    CArray
    保存在“chunk”中,其大小可以在创建时使用
    chunk\u shape
    进行选择

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    Array
    CArray
    大小在创建时是固定的。但是,在创建之后,您可以一块一块地填充/写入数组块。相反,可以使用
    .append()
    方法扩展
    array

    import pandas as pd                                                                                                                                                                                                                                                                                               
    import numpy as np                                                                                                                                                                                                                                                                                                
    
    n_particles = 10                                                                                                                                                                                                                                                                                                  
    chunk_size = 1000                                                                                                                                                                                                                                                                                                 
    
    # 1) create a new emission file, compressing as we go                                                                                                                                                                                                                                                             
    emission = pd.HDFStore('emission.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                   
    
    # generate simulated data                                                                                                                                                                                                                                                                                         
    for i in range(10):                                                                                                                                                                                                                                                                                               
    
        df = pd.DataFrame(np.abs(np.random.randn(chunk_size,n_particles)),dtype='float32')                                                                                                                                                                                                                            
    
        # create a globally unique index (time)                                                                                                                                                                                                                                                                       
        # http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-
    
        data-to-a-pandas-hdfstore-and-get-a-natural/16999397#16999397                                                                                                                                                              
            try:                                                                                                                                                                                                                                                                                                          
                nrows = emission.get_storer('df').nrows                                                                                                                                                                                                                                                                   
            except:                                                                                                                                                                                                                                                                                                       
                nrows = 0                                                                                                                                                                                                                                                                                                 
    
            df.index = pd.Series(df.index) + nrows                                                                                                                                                                                                                                                                        
            emission.append('df',df)                                                                                                                                                                                                                                                                                      
    
        emission.close()                                                                                                                                                                                                                                                                                                  
    
        # 2) create counts                                                                                                                                                                                                                                                                                                
        cs = pd.HDFStore('counts.hdf',mode='w',complib='blosc')                                                                                                                                                                                                                                                           
    
        # this is an iterator, can be any size                                                                                                                                                                                                                                                                            
        for df in pd.read_hdf('emission.hdf','df',chunksize=200):                                                                                                                                                                                                                                                         
    
            counts = pd.DataFrame(np.random.poisson(lam=df).astype(np.uint8))                                                                                                                                                                                                                                             
    
            # set the index as the same                                                                                                                                                                                                                                                                                   
            counts.index = df.index                                                                                                                                                                                                                                                                                       
    
            # store the sum across all particles (as most are zero this will be a 
            # nice sub-selector                                                                                                                                                                                                                       
            # better maybe to have multiple of these sums that divide the particle space                                                                                                                                                                                                                                  
            # you don't have to do this but prob more efficient                                                                                                                                                                                                                                                           
            # you can do this in another file if you want/need                                                                                                                                                                                                                                                               
            counts['particles_0_4'] = counts.iloc[:,0:4].sum(1)                                                                                                                                                                                                                                                           
            counts['particles_5_9'] = counts.iloc[:,5:9].sum(1)                                                                                                                                                                                                                                                           
    
            # make the non_zero column indexable                                                                                                                                                                                                                                                                          
            cs.append('df',counts,data_columns=['particles_0_4','particles_5_9'])                                                                                                                                                                                                                                         
    
        cs.close()                                                                                                                                                                                                                                                                                                        
    
        # 3) find interesting counts                                                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_0_4>0')                                                                                                                                                                                                                                                      
        print pd.read_hdf('counts.hdf','df',where='particles_5_9>0')         
    
    桌子
    是一个完全不同的野兽。它基本上是一张“桌子”。您只有1D索引,每个元素都是一行。每行中都有“columns”数据类型,每列可以有不同的类型。据您所知,表基本上是一个1D记录数组,每个元素都是ha