Python 有没有办法读取和更改PyCharm中巨大csv文件的内容？_Python_Pandas_Csv_Large Files

Python 有没有办法读取和更改PyCharm中巨大csv文件的内容？

python pandas csv

Python 有没有办法读取和更改PyCharm中巨大csv文件的内容？,python,pandas,csv,large-files,Python,Pandas,Csv,Large Files,我正在尝试创建一个当前可以读取csv的程序，确定每行的一列中是否包含子字符串，如果不存在子字符串，则将某些列重写为新的csv。我有这么多的代码-但我需要使用该程序的csv有超过300万行。我使用PyCharm，目前无法处理这么多数据。它只能以只读格式查看csv，不允许我使用它。我知道pandas有块大小的特性，但我不知道如何用我的其余代码实现它 def reading(csv_input): originalLength = 0 rowCount = 0 with ope

我正在尝试创建一个当前可以读取csv的程序，确定每行的一列中是否包含子字符串，如果不存在子字符串，则将某些列重写为新的csv。我有这么多的代码-但我需要使用该程序的csv有超过300万行。我使用PyCharm，目前无法处理这么多数据。它只能以只读格式查看csv，不允许我使用它。我知道pandas有块大小的特性，但我不知道如何用我的其余代码实现它

def reading(csv_input):
    originalLength = 0
    rowCount = 0
    with open(f'Web Report {csv_input}', 'w') as file:
        writer = csv.writer(file)
        writer.writerow(['Index', 'URL Category', 'User IP', 'URL'])
        dropCount = 0
        data = pd.read_csv(csv_input, chunksize=100000)
        df = pd.DataFrame(data,
                          columns=['Line', 'Date', 'Hour', 'User Name', 'User IP', 'Site Name',
                                   'URL Category', 'Action', 'Action Description'])
        originalLength = len(df.index)
        for line in range(originalLength):
            dataLine = df.loc[line]
            x = dataLine.get(key='Action')
            if x == 0:
                siteName = dataLine.get(key='Site Name')
                if 'dbk' in siteName:
                    dropCount = dropCount + 1
                elif 'ptc' in siteName:
                    dropCount = dropCount + 1
                elif 'wcf' in siteName:
                    dropCount = dropCount + 1
                elif 'google' in siteName:
                    dropCount = dropCount + 1
                else:
                    writer.writerow([line,  # Original Index
                                     df.loc[line].get(key='URL Category'),  # Original URL Category
                                     df.loc[line].get(key='User IP'),  # Original User IP
                                     df.loc[line].get(key='Site Name')])  # Original Site Name
                    rowCount = rowCount + 1
            else:
                dropCount = dropCount + 1
    file.close()
    print("Input: " + str(csv_input))
    print("Output: " + str(file.name))
    print("Original Length: " + str(originalLength))
    print("Current Length: " + str(rowCount))
    print("Drop Count: " + str(dropCount) + "\n")

    return df

如果您使用

csv

写入文件，那么您也可以使用它逐行读取

import csv

with open('input.csv') as infile, open('output.csv', 'w') as outfile:
    csv_reader = csv.reader(infile)
    csv_writer = csv.writer(outfile)
    
    # copy headers
    headers = next(csv_reader)
    csv_writer.writerow(headers)
    
    # process rows
    for row in csv_reader:  # read row by row
        # keep only rows with even index
        if int(row[0]) % 2 == 0:
            print('--- row ---')
            print(row)
            csv_writer.writerow(row)

如果要将

pandas

与

chunk

一起使用，则应使用

for

-loop进行此操作。
当您使用pandas编写时，需要

append

无头模式

import pandas as pd

first = True
for df in pd.read_csv('input.csv', chunksize=1):  # read row by row
    # keep only rows with even index
    if df.index % 2 == 0:
        print('--- row ---')
        print(df)
        if first:
            # create new file with headers
            df.to_csv('output.csv', mode='w')
            first = False
        else:
            # append to existing file without headers
            df.to_csv('output.csv', mode='a', header=False)

最小工作代码

import pandas as pd
import csv

# --- create some data ---

data = {
    'A': range(0,10), 
    'B': range(10,20),
    'C': range(20,30),
} # columns

df = pd.DataFrame(data)
df.to_csv('input.csv', index=False)

# --- read and write with `pandas` ---

first = True
for df in pd.read_csv('input.csv', chunksize=1):  # read row by row
    # keep only rows with even index
    if df.index % 2 == 0:
        print('--- row ---')
        print(df)
        if first:
            # create empty with headers
            df.to_csv('output_pandas.csv', mode='w')
            first = False
        else:
            # append to existing file without headers
            df.to_csv('output_pandas.csv', mode='a', header=False)
        
# --- read and write with `csv` ---
 
with open('input.csv') as infile, open('output.csv', 'w') as outfile:
    csv_reader = csv.reader(infile)
    csv_writer = csv.writer(outfile)
    
    # copy headers
    headers = next(csv_reader)
    csv_writer.writerow(headers)
    
    # process rows
    for row in csv_reader:
        # keep only rows with even index
        if int(row[0]) % 2 == 0:
            print('--- row ---')
            print(row)
            csv_writer.writerow(row)

医生：，

在使用函数之前，您是否尝试过使用functools import cache和

@cache

中的

？如果您使用模块csv
写入新文件，那么您可以使用相同的模块csv
逐行读取，而不使用pandas
如果您使用chunksize
则可以使用迭代器用于
-类似于循环的用于pd中的df。读取\u csv（csv\u输入，chunksize=100000）：…代码…