Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 解析非常大的CSV文件。需要将一个字段拆分为许多较小的行&;在每行中保留ID。_Python_Python 3.x_Csv - Fatal编程技术网

Python 解析非常大的CSV文件。需要将一个字段拆分为许多较小的行&;在每行中保留ID。

Python 解析非常大的CSV文件。需要将一个字段拆分为许多较小的行&;在每行中保留ID。,python,python-3.x,csv,Python,Python 3.x,Csv,我有一个大的CSV,它由一个“ID”列和一个“历史”列组成 ID很简单,只是一个整数 不过,历史记录是一个单元格,由数百个条目组成,这些条目在文本区域中用*注*分隔 我想用Python和CSV模块对此进行解析,以读取数据并将其导出为新的CSV,如下所示 现有数据结构: ID,History 56457827, "*** NOTE *** 2014-02-25 Long note here. This is just a stand in to give you an idea *** NOT

我有一个大的CSV,它由一个“ID”列和一个“历史”列组成

ID很简单,只是一个整数

不过,历史记录是一个单元格,由数百个条目组成,这些条目在文本区域中用*注*分隔

我想用Python和CSV模块对此进行解析,以读取数据并将其导出为新的CSV,如下所示

现有数据结构:

ID,History

56457827, "*** NOTE ***
2014-02-25
Long note here.  This is just a stand in to give you an idea
*** NOTE ***
2014-02-20
Another example.
This one has carriage returns.

Demonstrates they're all a bit different, though are really just text based"
56457896, "*** NOTE ***
2015-03-26
Another example of a note here.  This is the text portion.
*** NOTE ***
2015-05-24
Another example yet again."
ID, Date, History

56457827, 2014-02-25, "Long note here.  This is just a stand in to give you an idea"
56457827, 2014-02-20, "Another example.
This one has carriage returns.

Demonstrates they're all a bit different, though are really just text based"
56457896, 2015-03-26, "Another example of a note here.  This is the text portion."
56457896, 2015-05-24, "Another example yet again."
所需数据结构:

ID,History

56457827, "*** NOTE ***
2014-02-25
Long note here.  This is just a stand in to give you an idea
*** NOTE ***
2014-02-20
Another example.
This one has carriage returns.

Demonstrates they're all a bit different, though are really just text based"
56457896, "*** NOTE ***
2015-03-26
Another example of a note here.  This is the text portion.
*** NOTE ***
2015-05-24
Another example yet again."
ID, Date, History

56457827, 2014-02-25, "Long note here.  This is just a stand in to give you an idea"
56457827, 2014-02-20, "Another example.
This one has carriage returns.

Demonstrates they're all a bit different, though are really just text based"
56457896, 2015-03-26, "Another example of a note here.  This is the text portion."
56457896, 2015-05-24, "Another example yet again."
所以我需要掌握一些命令。我猜是一个循环,它会带来我可以管理的数据,但我需要分析数据

我相信我需要:

  • 1开始在CSV结构中循环
  • 2记下第一个ID
  • 3在历史记录字段中搜索*注释*
  • 4以某种方式抓住日期字符串并记下来
  • 5将我们在日期字符串之后找到的所有以下字符串数据添加到变量中(我们称之为“HistoryShapper”),直到
  • 6。。。直到我找到下一个*注意*
  • 7从新变量“HistoryShapper”中删除所有*注意*
  • 8将ID和“HistoryShapper”写入新CSV文件中的新行
  • 9重复步骤2-8,直到历史记录字段结束

    这个文件大约是5MB。这是最好的方法吗? 我对编程和数据处理还比较陌生,所以在今晚打开笔记本电脑深入研究之前,我愿意接受任何建设性的批评

    非常感谢,非常感谢所有反馈

  • 享受

    with open('data.csv') as f:
        header = f.readline()    # skip headers line
        blank_line = f.readline()    # blank line
    
        current_record = None
        s = f.readline()    # blank line
        while s:
            if not current_record:
                current_record = s
            else:
                current_record += s
                if s.rstrip().endswith('"'):
                    # Remove line breaks
                    current_record = current_record.replace('\r', ' ').replace('\n', ' ')
                    # Get date and history
                    ID, history = current_record.split(',', 1)
                    # dequote history
                    history = history.strip(' "')
                    # split history into items
                    items = [note.strip().split(' ', 1) for note in history.split('*** NOTE ***') if note]
                    for datetime, message in items:
                        print ('{}, {}, {}'.format(ID, datetime, message))
    
                    current_record = None
    
            s = f.readline()
    

    好的,您可以使用
    csv
    模块轻松解析输入文件,但您需要设置
    skipinitialspace
    ,因为您的文件在逗号后有空格。我还假设标题后面的空行不应该存在

    然后,您应该在
    '***注意***'
    上拆分历史记录列。每个注释文本的第一行应为日期,其余部分为实际历史。代码可以是:

    with open(input_file_name, newline = '') as fd, \
         open(output_file_name, "w", newline='') as fdout:
        rd = csv.reader(fd, skipinitialspace=True)
        ID, Hist = next(rd)    # skip header line
        wr = csv.writer(fdout)
        _ = wr.writerow((ID, 'Date', Hist))  # write header of output file
        for row in rd:
            # print(row)      # uncomment for debug traces
            hists = row[1].split('*** NOTE ***')
            for h in hists:
                h = h.strip()
                if len(h) == 0:     # skip initial empty note
                    continue
                # should begin with a data line
                date, h2 = h.split('\n', 1)
                _ = wr.writerow((row[0], date.strip(), h2.strip()))
    

    历史记录行中是否有换行符?为什么要删除换行符?只是为了更好地在控制台中查找。由您决定是否删除该线路谢谢您。当我尝试运行它时,会出现此错误。(当然我已经设置了输入文件名等)
    open(输出文件名,“w”,换行=”)作为fdout,^SyntaxError:invalid syntax
    ^
    指向终端中的
    fdout
    的结尾。谢谢。我意识到我没有回来表达我的感激之情。再次表示感谢。