Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/276.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用pandas迭代数十万个csv文件_Python_Multithreading_Python 3.x_Pandas_Multiprocessing - Fatal编程技术网

Python 使用pandas迭代数十万个csv文件

Python 使用pandas迭代数十万个csv文件,python,multithreading,python-3.x,pandas,multiprocessing,Python,Multithreading,Python 3.x,Pandas,Multiprocessing,我目前正在使用concurrent.futures.processPoolExecutor遍历大量CSV文件,如下所示: def readcsv(file): df = pd.read_csv(file, delimiter="\s+", names=[headers], comment="#") #DOING SOME OTHER STUFF TO IT full.append(df) if __name__ == "__main__": full = [

我目前正在使用concurrent.futures.processPoolExecutor遍历大量CSV文件,如下所示:

def readcsv(file):
    df = pd.read_csv(file, delimiter="\s+", names=[headers], comment="#")
    #DOING SOME OTHER STUFF TO IT 
    full.append(df) 

if __name__ == "__main__":
    full = []
    files = "glob2 path to files" 
    with concurrent.futures.ProcessPoolExecutor(max_workers=45) as proc:
        proc.map(readcsv,files)
    full = pd.concat(full)
这目前不以这种方式工作,因为它返回一个ValueError,在最后一行没有要连接的对象。我如何迭代这些文件并将它们附加到列表中,然后对它们进行合并,或者尽可能快地将它们直接放入数据帧中?可用资源是64gb ram和虚拟机中的46个内核

映射函数实际上包含函数的结果。因此,您只需返回df:

你看过了吗?它会帮你的。。。df=dask.dataframe.read_csv'*.csv'.compute。。。。如果你停止计算,你也可以在读取的同时对它进行操作,如果你不需要一次在内存中存储所有的数据,只想求一列的和,就让它把它们拼凑起来。。。
def readcsv(file):
    df = pd.read_csv(file, delimiter="\s+", names=[headers], comment="#")
    #DOING SOME OTHER STUFF TO IT 
    return df

if __name__ == "__main__":
    files = "glob2 path to files" 
    with concurrent.futures.ProcessPoolExecutor(max_workers=45) as proc:
        full = pd.concat(proc.map(readcsv,files))