Python 解析非常大的CSV文件。需要将一个字段拆分为许多较小的行&;在每行中保留ID。
我有一个大的CSV,它由一个“ID”列和一个“历史”列组成 ID很简单,只是一个整数 不过,历史记录是一个单元格,由数百个条目组成,这些条目在文本区域中用*注*分隔 我想用Python和CSV模块对此进行解析,以读取数据并将其导出为新的CSV,如下所示 现有数据结构:Python 解析非常大的CSV文件。需要将一个字段拆分为许多较小的行&;在每行中保留ID。,python,python-3.x,csv,Python,Python 3.x,Csv,我有一个大的CSV,它由一个“ID”列和一个“历史”列组成 ID很简单,只是一个整数 不过,历史记录是一个单元格,由数百个条目组成,这些条目在文本区域中用*注*分隔 我想用Python和CSV模块对此进行解析,以读取数据并将其导出为新的CSV,如下所示 现有数据结构: ID,History 56457827, "*** NOTE *** 2014-02-25 Long note here. This is just a stand in to give you an idea *** NOT
ID,History
56457827, "*** NOTE ***
2014-02-25
Long note here. This is just a stand in to give you an idea
*** NOTE ***
2014-02-20
Another example.
This one has carriage returns.
Demonstrates they're all a bit different, though are really just text based"
56457896, "*** NOTE ***
2015-03-26
Another example of a note here. This is the text portion.
*** NOTE ***
2015-05-24
Another example yet again."
ID, Date, History
56457827, 2014-02-25, "Long note here. This is just a stand in to give you an idea"
56457827, 2014-02-20, "Another example.
This one has carriage returns.
Demonstrates they're all a bit different, though are really just text based"
56457896, 2015-03-26, "Another example of a note here. This is the text portion."
56457896, 2015-05-24, "Another example yet again."
所需数据结构:
ID,History
56457827, "*** NOTE ***
2014-02-25
Long note here. This is just a stand in to give you an idea
*** NOTE ***
2014-02-20
Another example.
This one has carriage returns.
Demonstrates they're all a bit different, though are really just text based"
56457896, "*** NOTE ***
2015-03-26
Another example of a note here. This is the text portion.
*** NOTE ***
2015-05-24
Another example yet again."
ID, Date, History
56457827, 2014-02-25, "Long note here. This is just a stand in to give you an idea"
56457827, 2014-02-20, "Another example.
This one has carriage returns.
Demonstrates they're all a bit different, though are really just text based"
56457896, 2015-03-26, "Another example of a note here. This is the text portion."
56457896, 2015-05-24, "Another example yet again."
所以我需要掌握一些命令。我猜是一个循环,它会带来我可以管理的数据,但我需要分析数据
我相信我需要:
- 1开始在CSV结构中循环
- 2记下第一个ID
- 3在历史记录字段中搜索*注释*
- 4以某种方式抓住日期字符串并记下来
- 5将我们在日期字符串之后找到的所有以下字符串数据添加到变量中(我们称之为“HistoryShapper”),直到
- 6。。。直到我找到下一个*注意*
- 7从新变量“HistoryShapper”中删除所有*注意*
- 8将ID和“HistoryShapper”写入新CSV文件中的新行
- 9重复步骤2-8,直到历史记录字段结束 这个文件大约是5MB。这是最好的方法吗? 我对编程和数据处理还比较陌生,所以在今晚打开笔记本电脑深入研究之前,我愿意接受任何建设性的批评 非常感谢,非常感谢所有反馈 享受
with open('data.csv') as f:
header = f.readline() # skip headers line
blank_line = f.readline() # blank line
current_record = None
s = f.readline() # blank line
while s:
if not current_record:
current_record = s
else:
current_record += s
if s.rstrip().endswith('"'):
# Remove line breaks
current_record = current_record.replace('\r', ' ').replace('\n', ' ')
# Get date and history
ID, history = current_record.split(',', 1)
# dequote history
history = history.strip(' "')
# split history into items
items = [note.strip().split(' ', 1) for note in history.split('*** NOTE ***') if note]
for datetime, message in items:
print ('{}, {}, {}'.format(ID, datetime, message))
current_record = None
s = f.readline()
好的,您可以使用
csv
模块轻松解析输入文件,但您需要设置skipinitialspace
,因为您的文件在逗号后有空格。我还假设标题后面的空行不应该存在
然后,您应该在'***注意***'
上拆分历史记录列。每个注释文本的第一行应为日期,其余部分为实际历史。代码可以是:
with open(input_file_name, newline = '') as fd, \
open(output_file_name, "w", newline='') as fdout:
rd = csv.reader(fd, skipinitialspace=True)
ID, Hist = next(rd) # skip header line
wr = csv.writer(fdout)
_ = wr.writerow((ID, 'Date', Hist)) # write header of output file
for row in rd:
# print(row) # uncomment for debug traces
hists = row[1].split('*** NOTE ***')
for h in hists:
h = h.strip()
if len(h) == 0: # skip initial empty note
continue
# should begin with a data line
date, h2 = h.split('\n', 1)
_ = wr.writerow((row[0], date.strip(), h2.strip()))
历史记录行中是否有换行符?为什么要删除换行符?只是为了更好地在控制台中查找。由您决定是否删除该线路谢谢您。当我尝试运行它时,会出现此错误。(当然我已经设置了输入文件名等)
open(输出文件名,“w”,换行=”)作为fdout,^SyntaxError:invalid syntax
该^
指向终端中的fdout
的结尾。谢谢。我意识到我没有回来表达我的感激之情。再次表示感谢。