Python 如何读取非常大的CSV的一小部分行。熊猫-时间序列-大型数据集_Python_Pandas_Time Series_Bigdata

Python 如何读取非常大的CSV的一小部分行。熊猫-时间序列-大型数据集

python pandas

Python 如何读取非常大的CSV的一小部分行。熊猫-时间序列-大型数据集,python,pandas,time-series,bigdata,Python,Pandas,Time Series,Bigdata,我在一个大的文本文件中有一个时间序列。该文件超过4 GB 因为这是一个时间序列，我只想读1%的行理想的极简主义示例： df = pandas.read_csv('super_size_file.log', load_line_percentage = 1) print(df) 期望输出： >line_number, value 0, 654564 100, 54654654 200,

我在一个大的文本文件中有一个时间序列。该文件超过4 GB

因为这是一个时间序列，我只想读1%的行

理想的极简主义示例：

df = pandas.read_csv('super_size_file.log',
                      load_line_percentage = 1)
print(df)

期望输出：

>line_number, value
 0,           654564
 100,         54654654
 200,         54
 300,         46546
 ...

加载后我无法重新采样，因为首先加载它需要太多内存

我可能想一块一块地加载，然后对每个块重新采样。但在我看来，这似乎是低效的

欢迎提出任何意见

使用“读取csv”功能时，可以输入要读取的行数。以下是您可以做的：

import pandas as pd
# Select file 
infile = 'path/file'
number_of_lines = x
# Use nrows to choose number of rows
data = pd.read_csv(infile,, nrows = number_of_lines*0.01)

如果您想像前面提到的那样逐块读取数据，也可以使用chunksize选项：

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

看一看。它包含了一个优雅的描述如何读取一个CSV文件在块

基本思想是传递chunksize参数（每个块的行数）。

然后，在循环中，您可以逐块读取此文件。

每当我需要处理一个非常大的文件时，我都会问“您会怎么做？”

将大文件加载为

dask.DataFrame

，将索引转换为列（由于没有完整的索引控制，因此解决方法不可用），然后对该新列进行筛选

import dask.dataframe as dd
import pandas as pd

nth_row = 100  # grab every nth row from the larger DataFrame
dask_df = dd.read_csv('super_size_file.log')  # assuming this file can be read by pd.read_csv
dask_df['df_index'] = dask_df.index
dask_df_smaller = dask_df[dask_df['df_index'] % nth_row == 0]

df_smaller = dask_df_smaller.compute()  # to execute the operations and return a pandas DataFrame

这将为您提供较大文件中的第0行、第100行、第200行等。如果要将数据帧缩减到特定列，请在调用compute之前执行此操作，即

dask_df_较小=dask_df_较小[['Signal_1'，'Signal_2']]

。您还可以使用

scheduler='processs'

选项调用compute来使用CPU上的所有内核。

这应该可以满足您的需要

# Select All From CSV File Where

import csv
# Asks for search criteria from user
search_parts = input("Enter search criteria:\n").split(",")
# Opens csv data file
file = csv.reader(open("C:\\your_path\\test.csv"))
# Go over each row and print it if it contains user input.
for row in file:
    if all([x in row for x in search_parts]):
        print(row)

# If you only want to read rows 1,000,000 ... 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)

read\u csv

有一个

nrows

arg和

chunksize

，您是否尝试过以下任何一种：您可以运行linux head命令，并读取它

head super\u size\u file.log>small\u sample.log

或

head-n 1000 super\u size\u file.log>small\u sample.log

@EdChum:nrows加载n第一行。我想全部加载，但100行中只有一行。。。chunkzise很好，但是加载每个块需要时间。（99%我不想要）。但这完全是我的B计划。@sh.jeon：linux中的“头”似乎与nrows相同。（顺便说一句，很有意思，但从我的角度来看也是一样）