Python 解析文本文件中的数据
我已经建立了一个联系表单,为每个用户注册都向我发送电子邮件。我的问题更多地涉及到将一些文本数据解析为csv格式。我在邮箱中收到了多个用户的信息,我将这些信息复制到了一个文本文件中。数据如下所示Python 解析文本文件中的数据,python,parsing,Python,Parsing,我已经建立了一个联系表单,为每个用户注册都向我发送电子邮件。我的问题更多地涉及到将一些文本数据解析为csv格式。我在邮箱中收到了多个用户的信息,我将这些信息复制到了一个文本文件中。数据如下所示 Name:testuser2 电邮:testuser2@gmail.com 群集名称:o b 联系电话:12346971239 康宁:是的 姓名:testuser3 电邮:testuser3@gmail.com 集群名称:Mediternea 联系电话:9121319107 康宁:是的 姓名:testus
Name:testuser2
电邮:testuser2@gmail.com
群集名称:o b
联系电话:12346971239
康宁:是的
姓名:testuser3
电邮:testuser3@gmail.com
集群名称:Mediternea
联系电话:9121319107
康宁:是的
姓名:testuser4
电邮:tuser4@yahoo.com
集群名称:地中海
联系电话:7892174896
康宁:是的
姓名:tuser5
电邮:tuserner5@gmail.com
集群名称:River Retain A
联系电话:7583450912
康宁:是的
与会成员:2名
名称:测试用户
电邮:testuser@yahoo.co.in
群集名称:RD
联系电话:09833123445
康宁:是的
与会成员:2名
如图所示,数据包含一些常见字段和一些不存在的字段,我正在寻找如何解析这些数据的解决方案/建议,因此在标题“名称”下,我将收集该列下的名称信息,以及其他类似信息。对于标题为“成员参与”的数据,我可以选择数字并将其添加到Excel表格的同一标题下,如果用户没有此信息,则可以为空。下面的程序可能满足您的要求。总战略:
- 首先读入所有电子邮件文件,“手动”解析数据,然后
- 然后使用
将数据写入CSV文件CSV.DictWriter.writerows()
下面的程序可能满足您的要求。总战略:
- 首先读入所有电子邮件文件,“手动”解析数据,然后
- 然后使用
将数据写入CSV文件CSV.DictWriter.writerows()
您可以使用记录之间的空行来表示记录的结束。然后逐行处理输入文件并构造字典列表。最后将字典写入CSV文件
from csv import DictWriter
from collections import OrderedDict
with open('input') as infile:
registrations = []
fields = OrderedDict()
d = {}
for line in infile:
line = line.strip()
if line:
key, value = [s.strip() for s in line.split(':', 1)]
d[key] = value
fields[key] = None
else:
if d:
registrations.append(d)
d = {}
else:
if d: # handle EOF
registrations.append(d)
# fieldnames = ['Name', 'Email', 'Cluster Name', 'Contact No.', 'Coming', 'Members Participating']
fieldnames = fields.keys()
with open('registrations.csv', 'w') as outfile:
writer = DictWriter(outfile, fieldnames=fields)
writer.writeheader()
writer.writerows(registrations)
这段代码试图自动收集字段名,并将使用与第一次在输入中看到唯一键相同的顺序。如果在输出中需要特定的字段顺序,可以通过取消注释相应的行来确定
在示例输入上运行此代码会产生以下结果:
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o b,12346971239,Yes,
testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes,
testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes,
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2
姓名、电子邮件、群集名称、联系号码、即将加入、参与成员
testuser2,testuser2@gmail.com,o b,12346971239,是的,
testuser3,testuser3@gmail.com,Mediternea,9121319107,是的,
testuser4,tuser4@yahoo.com,Mediterrana,7892174896,是的,
tuser5,tuserner5@gmail.com,River Retain A,7583450912,是的,2
测试用户,testuser@yahoo.co.in,RD,09833123445,是的,2
您可以使用记录之间的空行来表示记录的结束。然后逐行处理输入文件并构造字典列表。最后将字典写入CSV文件
from csv import DictWriter
from collections import OrderedDict
with open('input') as infile:
registrations = []
fields = OrderedDict()
d = {}
for line in infile:
line = line.strip()
if line:
key, value = [s.strip() for s in line.split(':', 1)]
d[key] = value
fields[key] = None
else:
if d:
registrations.append(d)
d = {}
else:
if d: # handle EOF
registrations.append(d)
# fieldnames = ['Name', 'Email', 'Cluster Name', 'Contact No.', 'Coming', 'Members Participating']
fieldnames = fields.keys()
with open('registrations.csv', 'w') as outfile:
writer = DictWriter(outfile, fieldnames=fields)
writer.writeheader()
writer.writerows(registrations)
这段代码试图自动收集字段名,并将使用与第一次在输入中看到唯一键相同的顺序。如果在输出中需要特定的字段顺序,可以通过取消注释相应的行来确定
在示例输入上运行此代码会产生以下结果:
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o b,12346971239,Yes,
testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes,
testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes,
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2
姓名、电子邮件、群集名称、联系号码、即将加入、参与成员
testuser2,testuser2@gmail.com,o b,12346971239,是的,
testuser3,testuser3@gmail.com,Mediternea,9121319107,是的,
testuser4,tuser4@yahoo.com,Mediterrana,7892174896,是的,
tuser5,tuserner5@gmail.com,River Retain A,7583450912,是的,2
测试用户,testuser@yahoo.co.in,RD,09833123445,是的,2
让我们将问题分解为更小的子问题:
DATA = '''
Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o b
Contact No.: 12346971239
Coming: Yes
Name: testuser3
Email: testuser3@gmail.com
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes
'''
def parse_registrations(data):
data = data.strip()
return data.split('\n\n')
此函数为我们提供了每个注册的列表:
>>> regs = parse_registrations(DATA)
>>> regs
['Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o b\nContact No.: 12346971239\nComing: Yes', 'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes']
>>> regs[0]
'Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o b\nContact No.: 12346971239\nComing: Yes'
>>> regs[1]
'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes'
接下来,我们可以将这些子字符串转换为(键、值)对的列表:
dict()
函数可以将(键、值)对列表转换为字典:
>>> dict(field.split(': ', 1) for field in regs[0].split('\n'))
{'Coming': 'Yes', 'Cluster Name': 'o b', 'Name': 'testuser2', 'Contact No.': '12346971239', 'Email': 'testuser2@gmail.com'}
我们可以将这些字典传递到中,以CSV形式写入记录,并对任何缺少的值使用默认值
>>> w = csv.DictWriter(open("/tmp/foo.csv", "w"), fieldnames=["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"])
>>> w.writeheader()
>>> w.writerow({'Name': 'Steve'})
12
现在,让我们把这些结合起来
import csv
DATA = '''
Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o b
Contact No.: 12346971239
Coming: Yes
Name: tuser5
Email: tuserner5@gmail.com
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2
'''
COLUMNS = ["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"]
def parse_registration(reg):
return dict(field.split(': ', 1) for field in reg.split('\n'))
def parse_registrations(data):
data = data.strip()
regs = data.split('\n\n')
return [parse_registration(r) for r in regs]
def write_csv(data, filename):
regs = parse_registrations(data)
with open(filename, 'w') as f:
writer = csv.DictWriter(f, fieldnames=COLUMNS)
writer.writeheader()
writer.writerows(regs)
if __name__ == '__main__':
write_csv(DATA, "/tmp/test.csv")
输出:
$ python3 write_csv.py
$ cat /tmp/test.csv
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o b,12346971239,Yes,
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
让我们将问题分解为更小的子问题:
DATA = '''
Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o b
Contact No.: 12346971239
Coming: Yes
Name: testuser3
Email: testuser3@gmail.com
Cluster Name: Mediternea
Contact No.: 9121319107
Coming: Yes
'''
def parse_registrations(data):
data = data.strip()
return data.split('\n\n')
此函数为我们提供了每个注册的列表:
>>> regs = parse_registrations(DATA)
>>> regs
['Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o b\nContact No.: 12346971239\nComing: Yes', 'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes']
>>> regs[0]
'Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o b\nContact No.: 12346971239\nComing: Yes'
>>> regs[1]
'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes'
接下来,我们可以将这些子字符串转换为(键、值)对的列表:
dict()
函数可以将(键、值)对列表转换为字典:
>>> dict(field.split(': ', 1) for field in regs[0].split('\n'))
{'Coming': 'Yes', 'Cluster Name': 'o b', 'Name': 'testuser2', 'Contact No.': '12346971239', 'Email': 'testuser2@gmail.com'}
我们可以将这些字典传递到中,以CSV形式写入记录,并对任何缺少的值使用默认值
>>> w = csv.DictWriter(open("/tmp/foo.csv", "w"), fieldnames=["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"])
>>> w.writeheader()
>>> w.writerow({'Name': 'Steve'})
12
现在,让我们把这些结合起来
import csv
DATA = '''
Name: testuser2
Email: testuser2@gmail.com
Cluster Name: o b
Contact No.: 12346971239
Coming: Yes
Name: tuser5
Email: tuserner5@gmail.com
Cluster Name: River Retreat A
Contact No.: 7583450912
Coming: Yes
Members Participating: 2
'''
COLUMNS = ["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"]
def parse_registration(reg):
return dict(field.split(': ', 1) for field in reg.split('\n'))
def parse_registrations(data):
data = data.strip()
regs = data.split('\n\n')
return [parse_registration(r) for r in regs]
def write_csv(data, filename):
regs = parse_registrations(data)
with open(filename, 'w') as f:
writer = csv.DictWriter(f, fieldnames=COLUMNS)
writer.writeheader()
writer.writerows(regs)
if __name__ == '__main__':
write_csv(DATA, "/tmp/test.csv")
输出:
$ python3 write_csv.py
$ cat /tmp/test.csv
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o b,12346971239,Yes,
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
以下内容将自动将输入文本文件转换为CSV文件。标题根据最长条目自动生成
import csv, re
with open("input.txt", "r") as f_input, open("output.csv", "wb") as f_output:
csv_output = csv.writer(f_output)
entries = re.findall("^(Name: .*?)(?:\n\n|\Z)", f_input.read(), re.M+re.S)
# Determine the entry with the most fields for the CSV headers
headings = []
for entry in entries:
headings = max(headings, [line.split(":")[0] for line in entry.split("\n")], key=len)
csv_output.writerow(headings)
# Write the entries
for entry in entries:
csv_output.writerow([line.split(":")[1].strip() for line in entry.split("\n")])
这将生成可在Excel中打开的CSV文本文件,如下所示:
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o b,12346971239,Yes
testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes
testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2
以下内容将自动将输入文本文件转换为CSV文件。标题根据最长条目自动生成
import csv, re
with open("input.txt", "r") as f_input, open("output.csv", "wb") as f_output:
csv_output = csv.writer(f_output)
entries = re.findall("^(Name: .*?)(?:\n\n|\Z)", f_input.read(), re.M+re.S)
# Determine the entry with the most fields for the CSV headers
headings = []
for entry in entries:
headings = max(headings, [line.split(":")[0] for line in entry.split("\n")], key=len)
csv_output.writerow(headings)
# Write the entries
for entry in entries:
csv_output.writerow([line.split(":")[1].strip() for line in entry.split("\n")])
这将生成可在Excel中打开的CSV文本文件,如下所示:
Name,Email,Cluster Name,Contact No.,Coming,Members Participating
testuser2,testuser2@gmail.com,o b,12346971239,Yes
testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes
testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes
tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2
将来,您可能希望将注册信息存储在数据库中,或者甚至直接存储在服务器上的CSV文件中(如果合适)。实际上我已经开始这样做了,但对于某些用户,信息仍然是这种格式,只想将此数据转换为CSV,因此不会丢失任何信息。:)将来,您可能希望将注册信息存储在数据库中,甚至直接存储在服务器上的CSV文件中(如果合适的话)