Python 从csv中删除一行而不复制文件
有多个SO问题可以解决这个主题的某些形式,但是它们都似乎非常低效,无法从csv文件中仅删除一行(通常涉及复制整个文件)。如果我有这样的csv格式:Python 从csv中删除一行而不复制文件,python,python-3.x,csv,Python,Python 3.x,Csv,有多个SO问题可以解决这个主题的某些形式,但是它们都似乎非常低效,无法从csv文件中仅删除一行(通常涉及复制整个文件)。如果我有这样的csv格式: fname,lname,age,sex John,Doe,28,m Sarah,Smith,27,f Xavier,Moore,19,m 删除Sarah的行最有效的方法是什么?如果可能,我希望避免复制整个文件。这可能会有帮助: with open("sample.csv",'r') as f: for line in f:
fname,lname,age,sex
John,Doe,28,m
Sarah,Smith,27,f
Xavier,Moore,19,m
删除Sarah的行最有效的方法是什么?如果可能,我希望避免复制整个文件。这可能会有帮助:
with open("sample.csv",'r') as f:
for line in f:
if line.startswith('sarah'):continue
print(line)
这是一种方式。您确实需要将文件的其余部分加载到缓冲区,但这是我在Python中能想到的最好方法:
with open('afile','r+') as fd:
delLine = 4
for i in range(delLine):
pos = fd.tell()
fd.readline()
rest = fd.read()
fd.seek(pos)
fd.truncate()
fd.write(rest)
fd.close()
我解决了这个问题,就好像你知道行号一样。如果要检查文本,则不使用上述循环:
pos = fd.tell()
while fd.readline().startswith('Sarah'): pos = fd.tell()
如果找不到“Sarah”,则会出现异常
如果你要删除的行接近尾声,这可能会更有效,但我不确定阅读所有内容、删除行并将其转储回会比用户时间节省很多(考虑到这是一个Tk应用程序)。这也只需要打开并刷新一次文件,所以除非文件非常长,而且Sarah是真实的,否则它可能不会被注意到
删除Sarah的行最有效的方法是什么?如果可能,我希望避免复制整个文件
最有效的方法是用csv解析器忽略的内容覆盖该行。这样就避免了在删除行之后移动行
如果csv解析器可以忽略空行,请使用
\n
符号覆盖该行。否则,如果解析器从值中去除空白,则使用
(空格)符号覆盖该行 这里有一个基本问题。目前没有一个文件系统(据我所知)提供从文件中间删除大量字节的功能。您可以覆盖现有字节,或写入新文件。因此,您的选择是:
- 创建文件的副本,但不包含有问题的行,删除旧文件,并在适当位置重命名新文件。(这是您要避免的选项)
- 用将被忽略的内容覆盖行的字节。根据要读取文件的确切内容,注释字符可能有效,空格也可能有效(甚至可能是
)。但是,如果希望完全通用,这不是CSV文件的选项,因为没有定义注释字符\0
- 作为最后一个绝望的措施,你可以:
- 读到要删除的行
- 将文件的其余部分读入内存
- 并用要保留的数据覆盖该行和所有后续行
- 截断文件作为最终位置(文件系统通常允许这样做)
如果要删除第一行,最后一个选项显然没有多大帮助(但是如果要删除靠近末尾的一行,它很方便)。在这个过程中,它也很容易崩溃。 < P>你可以用熊猫来做。如果您的数据保存在data.csv下,以下内容应该会有所帮助:
import pandas as pd
df = pd.read_csv('data.csv')
df = df[df.fname != 'Sarah' ]
df.to_csv('data.csv', index=False)
就地编辑文件是一项充满陷阱的任务(很像在迭代过程中修改iterable),通常不值得费心。在大多数情况下,写入临时文件(或工作内存,取决于您拥有的更多存储空间或RAM),然后删除源文件并用临时文件替换源文件,其性能与尝试在适当位置执行相同的操作相同 但是,如果你坚持,这里有一个通用的解决方案:
import os
def remove_line(path, comp):
with open(path, "r+b") as f: # open the file in rw mode
mod_lines = 0 # hold the overwrite offset
while True:
last_pos = f.tell() # keep the last line position
line = f.readline() # read the next line
if not line: # EOF
break
if mod_lines: # we've already encountered what we search for
f.seek(last_pos - mod_lines) # move back to the beginning of the gap
f.write(line) # fill the gap with the current line
f.seek(mod_lines, os.SEEK_CUR) # move forward til the next line start
elif comp(line): # search for our data
mod_lines = len(line) # store the offset when found to create a gap
f.seek(last_pos - mod_lines) # seek back the extra removed characters
f.truncate() # truncate the rest
这将只删除与提供的比较函数匹配的行,然后在文件的其余部分上迭代,将数据移到“已删除”行上。您也不需要将文件的其余部分加载到工作内存中。要测试它,请使用包含以下内容的test.csv
:
fname,lname,age,sex
John,Doe,28,m
Sarah,Smith,27,f
Xavier,Moore,19,m
您将获得test.csv
,其中删除了原有的Sarah
行:
fname,lname,age,sex
John,Doe,28,m
Xavier,Moore,19,m
下一步是构建一个测试人员,该测试人员将在尽可能隔离的环境中运行这些功能,并尝试为每个功能获得一个公平的基准。我的测试结构如下:
- 三个样本数据CSV生成为1Mx10个随机数矩阵(~200MB文件),在其开头、中间和结尾分别放置一条可识别的线,从而生成三个极端场景的测试用例
- 在每次测试之前,将主样本数据文件复制为临时文件(因为线删除是破坏性的)
- 采用各种文件同步和缓存清除方法,以确保在每次测试开始之前清除缓冲区
- 测试是使用最高优先级(
)通过chrt-f99
来运行的,因为在这样的场景中,Python无法准确测量其性能/usr/bin/time
- 每次试验至少进行三次,以消除不可预测的波动
- 测试也在Python2.7和Python3.6(CPython)中运行,以查看版本之间是否存在性能一致性
- 收集所有基准数据并保存为CSV,以备将来分析
#!/usr/bin/env python
import collections
import os
import random
import shutil
import subprocess
import sys
import time
try:
range = xrange # cover Python 2.x
except NameError:
pass
try:
DEV_NULL = subprocess.DEVNULL
except AttributeError:
DEV_NULL = open(os.devnull, "wb") # cover Python 2.x
SAMPLE_ROWS = 10**6 # 1M lines
TEST_LOOPS = 3
CALL_SCRIPT = os.path.join(os.getcwd(), "remove_line.py") # the above script
def get_temporary_path(path):
folder, filename = os.path.split(path)
return os.path.join(folder, "~$" + filename)
def generate_samples(path, data="LINE", rows=10**6, columns=10): # 1Mx10 default matrix
sample_beginning = os.path.join(path, "sample_beg.csv")
sample_middle = os.path.join(path, "sample_mid.csv")
sample_end = os.path.join(path, "sample_end.csv")
separator = os.linesep
middle_row = rows // 2
with open(sample_beginning, "w") as f_b, \
open(sample_middle, "w") as f_m, \
open(sample_end, "w") as f_e:
f_b.write(data)
f_b.write(separator)
for i in range(rows):
if not i % middle_row:
f_m.write(data)
f_m.write(separator)
for t in (f_b, f_m, f_e):
t.write(",".join((str(random.random()) for _ in range(columns))))
t.write(separator)
f_e.write(data)
f_e.write(separator)
return ("beginning", sample_beginning), ("middle", sample_middle), ("end", sample_end)
def normalize_field(field):
field = field.lower()
while True:
s_index = field.find('(')
e_index = field.find(')')
if s_index == -1 or e_index == -1:
break
field = field[:s_index] + field[e_index + 1:]
return "_".join(field.split())
def encode_csv_field(field):
if isinstance(field, (int, float)):
field = str(field)
escape = False
if '"' in field:
escape = True
field = field.replace('"', '""')
elif "," in field or "\n" in field:
escape = True
if escape:
return ('"' + field + '"').encode("utf-8")
return field.encode("utf-8")
if __name__ == "__main__":
print("Generating sample data...")
start_time = time.time()
samples = generate_samples(os.getcwd(), "REMOVE THIS LINE", SAMPLE_ROWS)
print("Done, generation took: {:2} seconds.".format(time.time() - start_time))
print("Beginning tests...")
search_string = "REMOVE"
header = None
results = []
for f in ("temp_file_stream", "temp_file_wm",
"in_place_stream", "in_place_wm", "in_place_mmap"):
for s, path in samples:
for test in range(TEST_LOOPS):
result = collections.OrderedDict((("function", f), ("sample", s),
("test", test)))
print("Running {function} test, {sample} #{test}...".format(**result))
temp_sample = get_temporary_path(path)
shutil.copy(path, temp_sample)
print(" Clearing caches...")
subprocess.call(["sudo", "/usr/bin/sync"], stdout=DEV_NULL)
with open("/proc/sys/vm/drop_caches", "w") as dc:
dc.write("3\n") # free pagecache, inodes, dentries...
# you can add more cache clearing/invalidating calls here...
print(" Removing a line starting with `{}`...".format(search_string))
out = subprocess.check_output(["sudo", "chrt", "-f", "99",
"/usr/bin/time", "--verbose",
sys.executable, CALL_SCRIPT, temp_sample,
search_string, f], stderr=subprocess.STDOUT)
print(" Cleaning up...")
os.remove(temp_sample)
for line in out.decode("utf-8").split("\n"):
pair = line.strip().rsplit(": ", 1)
if len(pair) >= 2:
result[normalize_field(pair[0].strip())] = pair[1].strip()
results.append(result)
if not header: # store the header for later reference
header = result.keys()
print("Cleaning up sample data...")
for s, path in samples:
os.remove(path)
output_file = sys.argv[1] if len(sys.argv) > 1 else "results.csv"
output_results = os.path.join(os.getcwd(), output_file)
print("All tests completed, writing results to: " + output_results)
with open(output_results, "wb") as f:
f.write(b",".join(encode_csv_field(k) for k in header) + b"\n")
for result in results:
f.write(b",".join(encode_csv_field(v) for v in result.values()) + b"\n")
print("All done.")
最后(和TL;DR):以下是我的结果-我仅从结果集中提取最佳时间和内存数据,但您可以在此处获得完整的结果集:和
根据我收集的数据,最后有几点提示:
- 如果工作内存存在问题(处理超大文件等),则只有
起作用*\u流
#!/usr/bin/env python import mmap import os import shutil import sys import time def get_temporary_path(path): # use tempfile facilities in production folder, filename = os.path.split(path) return os.path.join(folder, "~$" + filename) def temp_file_wm(path, comp): path_out = get_temporary_path(path) with open(path, "rb") as f_in, open(path_out, "wb") as f_out: while True: line = f_in.readline() if not line: break if comp(line): f_out.write(f_in.read()) break else: f_out.write(line) f_out.flush() os.fsync(f_out.fileno()) shutil.move(path_out, path) def temp_file_stream(path, comp): path_out = get_temporary_path(path) not_found = True # a flag to stop comparison after the first match, for fairness with open(path, "rb") as f_in, open(path_out, "wb") as f_out: while True: line = f_in.readline() if not line: break if not_found and comp(line): continue f_out.write(line) f_out.flush() os.fsync(f_out.fileno()) shutil.move(path_out, path) def in_place_wm(path, comp): with open(path, "r+b") as f: while True: last_pos = f.tell() line = f.readline() if not line: break if comp(line): rest = f.read() f.seek(last_pos) f.write(rest) break f.truncate() f.flush() os.fsync(f.fileno()) def in_place_stream(path, comp): with open(path, "r+b") as f: mod_lines = 0 while True: last_pos = f.tell() line = f.readline() if not line: break if mod_lines: f.seek(last_pos - mod_lines) f.write(line) f.seek(mod_lines, os.SEEK_CUR) elif comp(line): mod_lines = len(line) f.seek(last_pos - mod_lines) f.truncate() f.flush() os.fsync(f.fileno()) def in_place_mmap(path, comp): with open(path, "r+b") as f: stream = mmap.mmap(f.fileno(), 0) total_size = len(stream) while True: last_pos = stream.tell() line = stream.readline() if not line: break if comp(line): current_pos = stream.tell() stream.move(last_pos, current_pos, total_size - current_pos) total_size -= len(line) break stream.flush() stream.close() f.truncate(total_size) f.flush() os.fsync(f.fileno()) if __name__ == "__main__": if len(sys.argv) < 3: print("Usage: {} target_file.ext <search_string> [function_name]".format(__file__)) exit(1) target_file = sys.argv[1] search_func = globals().get(sys.argv[3] if len(sys.argv) > 3 else None, in_place_wm) start_time = time.time() search_func(target_file, lambda x: x.startswith(sys.argv[2].encode("utf-8"))) # some info for the test runner... print("python_version: " + sys.version.split()[0]) print("python_time: {:.2f}".format(time.time() - start_time))
#!/usr/bin/env python import collections import os import random import shutil import subprocess import sys import time try: range = xrange # cover Python 2.x except NameError: pass try: DEV_NULL = subprocess.DEVNULL except AttributeError: DEV_NULL = open(os.devnull, "wb") # cover Python 2.x SAMPLE_ROWS = 10**6 # 1M lines TEST_LOOPS = 3 CALL_SCRIPT = os.path.join(os.getcwd(), "remove_line.py") # the above script def get_temporary_path(path): folder, filename = os.path.split(path) return os.path.join(folder, "~$" + filename) def generate_samples(path, data="LINE", rows=10**6, columns=10): # 1Mx10 default matrix sample_beginning = os.path.join(path, "sample_beg.csv") sample_middle = os.path.join(path, "sample_mid.csv") sample_end = os.path.join(path, "sample_end.csv") separator = os.linesep middle_row = rows // 2 with open(sample_beginning, "w") as f_b, \ open(sample_middle, "w") as f_m, \ open(sample_end, "w") as f_e: f_b.write(data) f_b.write(separator) for i in range(rows): if not i % middle_row: f_m.write(data) f_m.write(separator) for t in (f_b, f_m, f_e): t.write(",".join((str(random.random()) for _ in range(columns)))) t.write(separator) f_e.write(data) f_e.write(separator) return ("beginning", sample_beginning), ("middle", sample_middle), ("end", sample_end) def normalize_field(field): field = field.lower() while True: s_index = field.find('(') e_index = field.find(')') if s_index == -1 or e_index == -1: break field = field[:s_index] + field[e_index + 1:] return "_".join(field.split()) def encode_csv_field(field): if isinstance(field, (int, float)): field = str(field) escape = False if '"' in field: escape = True field = field.replace('"', '""') elif "," in field or "\n" in field: escape = True if escape: return ('"' + field + '"').encode("utf-8") return field.encode("utf-8") if __name__ == "__main__": print("Generating sample data...") start_time = time.time() samples = generate_samples(os.getcwd(), "REMOVE THIS LINE", SAMPLE_ROWS) print("Done, generation took: {:2} seconds.".format(time.time() - start_time)) print("Beginning tests...") search_string = "REMOVE" header = None results = [] for f in ("temp_file_stream", "temp_file_wm", "in_place_stream", "in_place_wm", "in_place_mmap"): for s, path in samples: for test in range(TEST_LOOPS): result = collections.OrderedDict((("function", f), ("sample", s), ("test", test))) print("Running {function} test, {sample} #{test}...".format(**result)) temp_sample = get_temporary_path(path) shutil.copy(path, temp_sample) print(" Clearing caches...") subprocess.call(["sudo", "/usr/bin/sync"], stdout=DEV_NULL) with open("/proc/sys/vm/drop_caches", "w") as dc: dc.write("3\n") # free pagecache, inodes, dentries... # you can add more cache clearing/invalidating calls here... print(" Removing a line starting with `{}`...".format(search_string)) out = subprocess.check_output(["sudo", "chrt", "-f", "99", "/usr/bin/time", "--verbose", sys.executable, CALL_SCRIPT, temp_sample, search_string, f], stderr=subprocess.STDOUT) print(" Cleaning up...") os.remove(temp_sample) for line in out.decode("utf-8").split("\n"): pair = line.strip().rsplit(": ", 1) if len(pair) >= 2: result[normalize_field(pair[0].strip())] = pair[1].strip() results.append(result) if not header: # store the header for later reference header = result.keys() print("Cleaning up sample data...") for s, path in samples: os.remove(path) output_file = sys.argv[1] if len(sys.argv) > 1 else "results.csv" output_results = os.path.join(os.getcwd(), output_file) print("All tests completed, writing results to: " + output_results) with open(output_results, "wb") as f: f.write(b",".join(encode_csv_field(k) for k in header) + b"\n") for result in results: f.write(b",".join(encode_csv_field(v) for v in result.values()) + b"\n") print("All done.")
sed -ie "/Sahra/d" your_file