Python 从csv中删除一行而不复制文件_Python_Python 3.x_Csv

Python 从csv中删除一行而不复制文件

python python-3.x csv

Python 从csv中删除一行而不复制文件,python,python-3.x,csv,Python,Python 3.x,Csv,有多个SO问题可以解决这个主题的某些形式，但是它们都似乎非常低效，无法从csv文件中仅删除一行（通常涉及复制整个文件）。如果我有这样的csv格式： fname,lname,age,sex John,Doe,28,m Sarah,Smith,27,f Xavier,Moore,19,m 删除Sarah的行最有效的方法是什么？如果可能，我希望避免复制整个文件。这可能会有帮助： with open("sample.csv",'r') as f: for line in f:

有多个SO问题可以解决这个主题的某些形式，但是它们都似乎非常低效，无法从csv文件中仅删除一行（通常涉及复制整个文件）。如果我有这样的csv格式：

fname,lname,age,sex
John,Doe,28,m
Sarah,Smith,27,f
Xavier,Moore,19,m

删除Sarah的行最有效的方法是什么？如果可能，我希望避免复制整个文件。

这可能会有帮助：

with open("sample.csv",'r') as f:
    for line in f:
        if line.startswith('sarah'):continue
        print(line)

这是一种方式。您确实需要将文件的其余部分加载到缓冲区，但这是我在Python中能想到的最好方法：

with open('afile','r+') as fd:
    delLine = 4
    for i in range(delLine):
        pos = fd.tell()
        fd.readline()
    rest = fd.read()
    fd.seek(pos)
    fd.truncate()
    fd.write(rest)
    fd.close()

我解决了这个问题，就好像你知道行号一样。如果要检查文本，则不使用上述循环：

pos = fd.tell()
while fd.readline().startswith('Sarah'): pos = fd.tell()

如果找不到“Sarah”，则会出现异常

如果你要删除的行接近尾声，这可能会更有效，但我不确定阅读所有内容、删除行并将其转储回会比用户时间节省很多（考虑到这是一个Tk应用程序）。这也只需要打开并刷新一次文件，所以除非文件非常长，而且Sarah是真实的，否则它可能不会被注意到

删除Sarah的行最有效的方法是什么？如果可能，我希望避免复制整个文件

最有效的方法是用csv解析器忽略的内容覆盖该行。这样就避免了在删除行之后移动行

如果csv解析器可以忽略空行，请使用

\n

符号覆盖该行。否则，如果解析器从值中去除空白，则使用

（空格）符号覆盖该行

这里有一个基本问题。目前没有一个文件系统（据我所知）提供从文件中间删除大量字节的功能。您可以覆盖现有字节，或写入新文件。因此，您的选择是：

创建文件的副本，但不包含有问题的行，删除旧文件，并在适当位置重命名新文件。（这是您要避免的选项）
用将被忽略的内容覆盖行的字节。根据要读取文件的确切内容，注释字符可能有效，空格也可能有效（甚至可能是
```
\0
```
）。但是，如果希望完全通用，这不是CSV文件的选项，因为没有定义注释字符
作为最后一个绝望的措施，你可以：
- 读到要删除的行
- 将文件的其余部分读入内存
- 并用要保留的数据覆盖该行和所有后续行
- 截断文件作为最终位置（文件系统通常允许这样做）

如果要删除第一行，最后一个选项显然没有多大帮助（但是如果要删除靠近末尾的一行，它很方便）。在这个过程中，它也很容易崩溃。

< P>你可以用熊猫来做。如果您的数据保存在data.csv下，以下内容应该会有所帮助：

import pandas as pd

df = pd.read_csv('data.csv')
df = df[df.fname != 'Sarah' ]
df.to_csv('data.csv', index=False)

就地编辑文件是一项充满陷阱的任务（很像在迭代过程中修改iterable），通常不值得费心。在大多数情况下，写入临时文件（或工作内存，取决于您拥有的更多存储空间或RAM），然后删除源文件并用临时文件替换源文件，其性能与尝试在适当位置执行相同的操作相同

但是，如果你坚持，这里有一个通用的解决方案：

import os

def remove_line(path, comp):
    with open(path, "r+b") as f:  # open the file in rw mode
        mod_lines = 0  # hold the overwrite offset
        while True:
            last_pos = f.tell()  # keep the last line position
            line = f.readline()  # read the next line
            if not line:  # EOF
                break
            if mod_lines:  # we've already encountered what we search for
                f.seek(last_pos - mod_lines)  # move back to the beginning of the gap
                f.write(line)  # fill the gap with the current line
                f.seek(mod_lines, os.SEEK_CUR)  # move forward til the next line start
            elif comp(line):  # search for our data
                mod_lines = len(line)  # store the offset when found to create a gap
        f.seek(last_pos - mod_lines)  # seek back the extra removed characters
        f.truncate()  # truncate the rest

这将只删除与提供的比较函数匹配的行，然后在文件的其余部分上迭代，将数据移到“已删除”行上。您也不需要将文件的其余部分加载到工作内存中。要测试它，请使用包含以下内容的

test.csv

：

fname,lname,age,sex John,Doe,28,m Sarah,Smith,27,f Xavier,Moore,19,m 您将获得

test.csv

，其中删除了原有的

Sarah

行：

fname,lname,age,sex John,Doe,28,m Xavier,Moore,19,m 下一步是构建一个测试人员，该测试人员将在尽可能隔离的环境中运行这些功能，并尝试为每个功能获得一个公平的基准。我的测试结构如下：

三个样本数据CSV生成为1Mx10个随机数矩阵（~200MB文件），在其开头、中间和结尾分别放置一条可识别的线，从而生成三个极端场景的测试用例
在每次测试之前，将主样本数据文件复制为临时文件（因为线删除是破坏性的）
采用各种文件同步和缓存清除方法，以确保在每次测试开始之前清除缓冲区
测试是使用最高优先级（
```
chrt-f99
```
）通过
```
/usr/bin/time
```
来运行的，因为在这样的场景中，Python无法准确测量其性能
每次试验至少进行三次，以消除不可预测的波动
测试也在Python2.7和Python3.6（CPython）中运行，以查看版本之间是否存在性能一致性
收集所有基准数据并保存为CSV，以备将来分析

不幸的是，我手头没有一个可以完全隔离运行测试的系统，因此我的数字是通过在虚拟机监控程序中运行它获得的。这意味着I/O性能可能非常不稳定，但它同样会影响仍然提供可比数据的所有测试。无论哪种方式，都欢迎您在自己的系统上运行此测试，以获得与之相关的结果

我已将执行上述场景的测试脚本设置为：

#!/usr/bin/env python

import collections
import os
import random
import shutil
import subprocess
import sys
import time

try:
    range = xrange  # cover Python 2.x
except NameError:
    pass

try:
    DEV_NULL = subprocess.DEVNULL
except AttributeError:
    DEV_NULL = open(os.devnull, "wb")  # cover Python 2.x

SAMPLE_ROWS = 10**6  # 1M lines
TEST_LOOPS = 3
CALL_SCRIPT = os.path.join(os.getcwd(), "remove_line.py")  # the above script

def get_temporary_path(path):
    folder, filename = os.path.split(path)
    return os.path.join(folder, "~$" + filename)

def generate_samples(path, data="LINE", rows=10**6, columns=10):  # 1Mx10 default matrix
    sample_beginning = os.path.join(path, "sample_beg.csv")
    sample_middle = os.path.join(path, "sample_mid.csv")
    sample_end = os.path.join(path, "sample_end.csv")
    separator = os.linesep
    middle_row = rows // 2
    with open(sample_beginning, "w") as f_b, \
            open(sample_middle, "w") as f_m, \
            open(sample_end, "w") as f_e:
        f_b.write(data)
        f_b.write(separator)
        for i in range(rows):
            if not i % middle_row:
                f_m.write(data)
                f_m.write(separator)
            for t in (f_b, f_m, f_e):
                t.write(",".join((str(random.random()) for _ in range(columns))))
                t.write(separator)
        f_e.write(data)
        f_e.write(separator)
    return ("beginning", sample_beginning), ("middle", sample_middle), ("end", sample_end)

def normalize_field(field):
    field = field.lower()
    while True:
        s_index = field.find('(')
        e_index = field.find(')')
        if s_index == -1 or e_index == -1:
            break
        field = field[:s_index] + field[e_index + 1:]
    return "_".join(field.split())

def encode_csv_field(field):
    if isinstance(field, (int, float)):
        field = str(field)
    escape = False
    if '"' in field:
        escape = True
        field = field.replace('"', '""')
    elif "," in field or "\n" in field:
        escape = True
    if escape:
        return ('"' + field + '"').encode("utf-8")
    return field.encode("utf-8")

if __name__ == "__main__":
    print("Generating sample data...")
    start_time = time.time()
    samples = generate_samples(os.getcwd(), "REMOVE THIS LINE", SAMPLE_ROWS)
    print("Done, generation took: {:2} seconds.".format(time.time() - start_time))
    print("Beginning tests...")
    search_string = "REMOVE"
    header = None
    results = []
    for f in ("temp_file_stream", "temp_file_wm",
              "in_place_stream", "in_place_wm", "in_place_mmap"):
        for s, path in samples:
            for test in range(TEST_LOOPS):
                result = collections.OrderedDict((("function", f), ("sample", s),
                                                  ("test", test)))
                print("Running {function} test, {sample} #{test}...".format(**result))
                temp_sample = get_temporary_path(path)
                shutil.copy(path, temp_sample)
                print("  Clearing caches...")
                subprocess.call(["sudo", "/usr/bin/sync"], stdout=DEV_NULL)
                with open("/proc/sys/vm/drop_caches", "w") as dc:
                    dc.write("3\n")  # free pagecache, inodes, dentries...
                # you can add more cache clearing/invalidating calls here...
                print("  Removing a line starting with `{}`...".format(search_string))
                out = subprocess.check_output(["sudo", "chrt", "-f", "99",
                                               "/usr/bin/time", "--verbose",
                                               sys.executable, CALL_SCRIPT, temp_sample,
                                               search_string, f], stderr=subprocess.STDOUT)
                print("  Cleaning up...")
                os.remove(temp_sample)
                for line in out.decode("utf-8").split("\n"):
                    pair = line.strip().rsplit(": ", 1)
                    if len(pair) >= 2:
                        result[normalize_field(pair[0].strip())] = pair[1].strip()
                results.append(result)
                if not header:  # store the header for later reference
                    header = result.keys()
    print("Cleaning up sample data...")
    for s, path in samples:
        os.remove(path)
    output_file = sys.argv[1] if len(sys.argv) > 1 else "results.csv"
    output_results = os.path.join(os.getcwd(), output_file)
    print("All tests completed, writing results to: " + output_results)
    with open(output_results, "wb") as f:
        f.write(b",".join(encode_csv_field(k) for k in header) + b"\n")
        for result in results:
            f.write(b",".join(encode_csv_field(v) for v in result.values()) + b"\n")
    print("All done.")

最后（和TL；DR）：以下是我的结果-我仅从结果集中提取最佳时间和内存数据，但您可以在此处获得完整的结果集：和

根据我收集的数据，最后有几点提示：

如果工作内存存在问题（处理超大文件等），则只有

*\u流

起作用

#!/usr/bin/env python

import mmap
import os
import shutil
import sys
import time

def get_temporary_path(path):  # use tempfile facilities in production
    folder, filename = os.path.split(path)
    return os.path.join(folder, "~$" + filename)

def temp_file_wm(path, comp):
    path_out = get_temporary_path(path)
    with open(path, "rb") as f_in, open(path_out, "wb") as f_out:
        while True:
            line = f_in.readline()
            if not line:
                break
            if comp(line):
                f_out.write(f_in.read())
                break
            else:
                f_out.write(line)
        f_out.flush()
        os.fsync(f_out.fileno())
    shutil.move(path_out, path)

def temp_file_stream(path, comp):
    path_out = get_temporary_path(path)
    not_found = True  # a flag to stop comparison after the first match, for fairness
    with open(path, "rb") as f_in, open(path_out, "wb") as f_out:
        while True:
            line = f_in.readline()
            if not line:
                break
            if not_found and comp(line):
                continue
            f_out.write(line)
        f_out.flush()
        os.fsync(f_out.fileno())
    shutil.move(path_out, path)

def in_place_wm(path, comp):
    with open(path, "r+b") as f:
        while True:
            last_pos = f.tell()
            line = f.readline()
            if not line:
                break
            if comp(line):
                rest = f.read()
                f.seek(last_pos)
                f.write(rest)
                break
        f.truncate()
        f.flush()
        os.fsync(f.fileno())

def in_place_stream(path, comp):
    with open(path, "r+b") as f:
        mod_lines = 0
        while True:
            last_pos = f.tell()
            line = f.readline()
            if not line:
                break
            if mod_lines:
                f.seek(last_pos - mod_lines)
                f.write(line)
                f.seek(mod_lines, os.SEEK_CUR)
            elif comp(line):
                mod_lines = len(line)
        f.seek(last_pos - mod_lines)
        f.truncate()
        f.flush()
        os.fsync(f.fileno())

def in_place_mmap(path, comp):
    with open(path, "r+b") as f:
        stream = mmap.mmap(f.fileno(), 0)
        total_size = len(stream)
        while True:
            last_pos = stream.tell()
            line = stream.readline()
            if not line:
                break
            if comp(line):
                current_pos = stream.tell()
                stream.move(last_pos, current_pos, total_size - current_pos)
                total_size -= len(line)
                break
        stream.flush()
        stream.close()
        f.truncate(total_size)
        f.flush()
        os.fsync(f.fileno())

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: {} target_file.ext <search_string> [function_name]".format(__file__))
        exit(1)
    target_file = sys.argv[1]
    search_func = globals().get(sys.argv[3] if len(sys.argv) > 3 else None, in_place_wm)
    start_time = time.time()
    search_func(target_file, lambda x: x.startswith(sys.argv[2].encode("utf-8")))
    # some info for the test runner...
    print("python_version: " + sys.version.split()[0])
    print("python_time: {:.2f}".format(time.time() - start_time))

#!/usr/bin/env python

import collections
import os
import random
import shutil
import subprocess
import sys
import time

try:
    range = xrange  # cover Python 2.x
except NameError:
    pass

try:
    DEV_NULL = subprocess.DEVNULL
except AttributeError:
    DEV_NULL = open(os.devnull, "wb")  # cover Python 2.x

SAMPLE_ROWS = 10**6  # 1M lines
TEST_LOOPS = 3
CALL_SCRIPT = os.path.join(os.getcwd(), "remove_line.py")  # the above script

def get_temporary_path(path):
    folder, filename = os.path.split(path)
    return os.path.join(folder, "~$" + filename)

def generate_samples(path, data="LINE", rows=10**6, columns=10):  # 1Mx10 default matrix
    sample_beginning = os.path.join(path, "sample_beg.csv")
    sample_middle = os.path.join(path, "sample_mid.csv")
    sample_end = os.path.join(path, "sample_end.csv")
    separator = os.linesep
    middle_row = rows // 2
    with open(sample_beginning, "w") as f_b, \
            open(sample_middle, "w") as f_m, \
            open(sample_end, "w") as f_e:
        f_b.write(data)
        f_b.write(separator)
        for i in range(rows):
            if not i % middle_row:
                f_m.write(data)
                f_m.write(separator)
            for t in (f_b, f_m, f_e):
                t.write(",".join((str(random.random()) for _ in range(columns))))
                t.write(separator)
        f_e.write(data)
        f_e.write(separator)
    return ("beginning", sample_beginning), ("middle", sample_middle), ("end", sample_end)

def normalize_field(field):
    field = field.lower()
    while True:
        s_index = field.find('(')
        e_index = field.find(')')
        if s_index == -1 or e_index == -1:
            break
        field = field[:s_index] + field[e_index + 1:]
    return "_".join(field.split())

def encode_csv_field(field):
    if isinstance(field, (int, float)):
        field = str(field)
    escape = False
    if '"' in field:
        escape = True
        field = field.replace('"', '""')
    elif "," in field or "\n" in field:
        escape = True
    if escape:
        return ('"' + field + '"').encode("utf-8")
    return field.encode("utf-8")

if __name__ == "__main__":
    print("Generating sample data...")
    start_time = time.time()
    samples = generate_samples(os.getcwd(), "REMOVE THIS LINE", SAMPLE_ROWS)
    print("Done, generation took: {:2} seconds.".format(time.time() - start_time))
    print("Beginning tests...")
    search_string = "REMOVE"
    header = None
    results = []
    for f in ("temp_file_stream", "temp_file_wm",
              "in_place_stream", "in_place_wm", "in_place_mmap"):
        for s, path in samples:
            for test in range(TEST_LOOPS):
                result = collections.OrderedDict((("function", f), ("sample", s),
                                                  ("test", test)))
                print("Running {function} test, {sample} #{test}...".format(**result))
                temp_sample = get_temporary_path(path)
                shutil.copy(path, temp_sample)
                print("  Clearing caches...")
                subprocess.call(["sudo", "/usr/bin/sync"], stdout=DEV_NULL)
                with open("/proc/sys/vm/drop_caches", "w") as dc:
                    dc.write("3\n")  # free pagecache, inodes, dentries...
                # you can add more cache clearing/invalidating calls here...
                print("  Removing a line starting with `{}`...".format(search_string))
                out = subprocess.check_output(["sudo", "chrt", "-f", "99",
                                               "/usr/bin/time", "--verbose",
                                               sys.executable, CALL_SCRIPT, temp_sample,
                                               search_string, f], stderr=subprocess.STDOUT)
                print("  Cleaning up...")
                os.remove(temp_sample)
                for line in out.decode("utf-8").split("\n"):
                    pair = line.strip().rsplit(": ", 1)
                    if len(pair) >= 2:
                        result[normalize_field(pair[0].strip())] = pair[1].strip()
                results.append(result)
                if not header:  # store the header for later reference
                    header = result.keys()
    print("Cleaning up sample data...")
    for s, path in samples:
        os.remove(path)
    output_file = sys.argv[1] if len(sys.argv) > 1 else "results.csv"
    output_results = os.path.join(os.getcwd(), output_file)
    print("All tests completed, writing results to: " + output_results)
    with open(output_results, "wb") as f:
        f.write(b",".join(encode_csv_field(k) for k in header) + b"\n")
        for result in results:
            f.write(b",".join(encode_csv_field(v) for v in result.values()) + b"\n")
    print("All done.")

sed -ie "/Sahra/d" your_file