Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 解析文本文件中的数据_Python_Parsing - Fatal编程技术网

Python 解析文本文件中的数据

Python 解析文本文件中的数据,python,parsing,Python,Parsing,我已经建立了一个联系表单,为每个用户注册都向我发送电子邮件。我的问题更多地涉及到将一些文本数据解析为csv格式。我在邮箱中收到了多个用户的信息,我将这些信息复制到了一个文本文件中。数据如下所示 Name:testuser2 电邮:testuser2@gmail.com 群集名称:o b 联系电话:12346971239 康宁:是的 姓名:testuser3 电邮:testuser3@gmail.com 集群名称:Mediternea 联系电话:9121319107 康宁:是的 姓名:testus

我已经建立了一个联系表单,为每个用户注册都向我发送电子邮件。我的问题更多地涉及到将一些文本数据解析为csv格式。我在邮箱中收到了多个用户的信息,我将这些信息复制到了一个文本文件中。数据如下所示

Name:testuser2
电邮:testuser2@gmail.com
群集名称:o b
联系电话:12346971239
康宁:是的
姓名:testuser3
电邮:testuser3@gmail.com
集群名称:Mediternea
联系电话:9121319107
康宁:是的
姓名:testuser4
电邮:tuser4@yahoo.com
集群名称:地中海
联系电话:7892174896
康宁:是的
姓名:tuser5
电邮:tuserner5@gmail.com
集群名称:River Retain A
联系电话:7583450912
康宁:是的
与会成员:2名
名称:测试用户
电邮:testuser@yahoo.co.in
群集名称:RD
联系电话:09833123445
康宁:是的
与会成员:2名

如图所示,数据包含一些常见字段和一些不存在的字段,我正在寻找如何解析这些数据的解决方案/建议,因此在标题“名称”下,我将收集该列下的名称信息,以及其他类似信息。对于标题为“成员参与”的数据,我可以选择数字并将其添加到Excel表格的同一标题下,如果用户没有此信息,则可以为空。

下面的程序可能满足您的要求。总战略:

  • 首先读入所有电子邮件文件,“手动”解析数据,然后
  • 然后使用
    CSV.DictWriter.writerows()
    将数据写入CSV文件


下面的程序可能满足您的要求。总战略:

  • 首先读入所有电子邮件文件,“手动”解析数据,然后
  • 然后使用
    CSV.DictWriter.writerows()
    将数据写入CSV文件


您可以使用记录之间的空行来表示记录的结束。然后逐行处理输入文件并构造字典列表。最后将字典写入CSV文件

from csv import DictWriter
from collections import OrderedDict

with open('input') as infile:
    registrations = []
    fields = OrderedDict()
    d = {}
    for line in infile:
        line = line.strip()
        if line:
            key, value = [s.strip() for s in line.split(':', 1)]
            d[key] = value
            fields[key] = None
        else:
            if d:
                registrations.append(d)
                d = {}
    else:
        if d:    # handle EOF
            registrations.append(d)


# fieldnames = ['Name', 'Email', 'Cluster Name', 'Contact No.', 'Coming', 'Members Participating']
fieldnames = fields.keys()

with open('registrations.csv', 'w') as outfile:
    writer = DictWriter(outfile, fieldnames=fields)
    writer.writeheader()
    writer.writerows(registrations)
这段代码试图自动收集字段名,并将使用与第一次在输入中看到唯一键相同的顺序。如果在输出中需要特定的字段顺序,可以通过取消注释相应的行来确定

在示例输入上运行此代码会产生以下结果:

Name,Email,Cluster Name,Contact No.,Coming,Members Participating testuser2,testuser2@gmail.com,o b,12346971239,Yes, testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes, testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes, tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2 Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2 姓名、电子邮件、群集名称、联系号码、即将加入、参与成员 testuser2,testuser2@gmail.com,o b,12346971239,是的, testuser3,testuser3@gmail.com,Mediternea,9121319107,是的, testuser4,tuser4@yahoo.com,Mediterrana,7892174896,是的, tuser5,tuserner5@gmail.com,River Retain A,7583450912,是的,2 测试用户,testuser@yahoo.co.in,RD,09833123445,是的,2
您可以使用记录之间的空行来表示记录的结束。然后逐行处理输入文件并构造字典列表。最后将字典写入CSV文件

from csv import DictWriter
from collections import OrderedDict

with open('input') as infile:
    registrations = []
    fields = OrderedDict()
    d = {}
    for line in infile:
        line = line.strip()
        if line:
            key, value = [s.strip() for s in line.split(':', 1)]
            d[key] = value
            fields[key] = None
        else:
            if d:
                registrations.append(d)
                d = {}
    else:
        if d:    # handle EOF
            registrations.append(d)


# fieldnames = ['Name', 'Email', 'Cluster Name', 'Contact No.', 'Coming', 'Members Participating']
fieldnames = fields.keys()

with open('registrations.csv', 'w') as outfile:
    writer = DictWriter(outfile, fieldnames=fields)
    writer.writeheader()
    writer.writerows(registrations)
这段代码试图自动收集字段名,并将使用与第一次在输入中看到唯一键相同的顺序。如果在输出中需要特定的字段顺序,可以通过取消注释相应的行来确定

在示例输入上运行此代码会产生以下结果:

Name,Email,Cluster Name,Contact No.,Coming,Members Participating testuser2,testuser2@gmail.com,o b,12346971239,Yes, testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes, testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes, tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2 Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2 姓名、电子邮件、群集名称、联系号码、即将加入、参与成员 testuser2,testuser2@gmail.com,o b,12346971239,是的, testuser3,testuser3@gmail.com,Mediternea,9121319107,是的, testuser4,tuser4@yahoo.com,Mediterrana,7892174896,是的, tuser5,tuserner5@gmail.com,River Retain A,7583450912,是的,2 测试用户,testuser@yahoo.co.in,RD,09833123445,是的,2
让我们将问题分解为更小的子问题:

  • 将大块文本拆分为单独的注册
  • 将每个注册转换为字典
  • 将字典列表写入CSV
  • 首先,让我们将注册数据块分解为不同的元素:

    DATA = '''
    Name: testuser2
    Email: testuser2@gmail.com
    Cluster Name: o  b
    Contact No.: 12346971239
    Coming: Yes
    
    Name: testuser3
    Email: testuser3@gmail.com
    Cluster Name: Mediternea
    Contact No.: 9121319107
    Coming: Yes
    '''
    
    def parse_registrations(data):
        data = data.strip()
        return data.split('\n\n')
    
    此函数为我们提供了每个注册的列表:

    >>> regs = parse_registrations(DATA)
    >>> regs
    ['Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o  b\nContact No.: 12346971239\nComing: Yes', 'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes']
    >>> regs[0]
    'Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o  b\nContact No.: 12346971239\nComing: Yes'
    >>> regs[1]
    'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes'
    
    接下来,我们可以将这些子字符串转换为(键、值)对的列表:

    dict()
    函数可以将(键、值)对列表转换为字典:

    >>> dict(field.split(': ', 1) for field in regs[0].split('\n'))
    {'Coming': 'Yes', 'Cluster Name': 'o  b', 'Name': 'testuser2', 'Contact No.': '12346971239', 'Email': 'testuser2@gmail.com'}
    
    我们可以将这些字典传递到中,以CSV形式写入记录,并对任何缺少的值使用默认值

    >>> w = csv.DictWriter(open("/tmp/foo.csv", "w"), fieldnames=["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"])
    >>> w.writeheader()
    >>> w.writerow({'Name': 'Steve'})
    12
    
    现在,让我们把这些结合起来

    import csv
    
    DATA = '''
    Name: testuser2
    Email: testuser2@gmail.com
    Cluster Name: o  b
    Contact No.: 12346971239
    Coming: Yes
    
    Name: tuser5
    Email: tuserner5@gmail.com
    Cluster Name: River Retreat A
    Contact No.: 7583450912
    Coming: Yes
    Members Participating: 2
    '''
    
    COLUMNS = ["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"]
    
    def parse_registration(reg):
        return dict(field.split(': ', 1) for field in reg.split('\n'))
    
    def parse_registrations(data):
        data = data.strip()
        regs = data.split('\n\n')
        return [parse_registration(r) for r in regs]
    
    def write_csv(data, filename):
        regs = parse_registrations(data)
        with open(filename, 'w') as f:
            writer = csv.DictWriter(f, fieldnames=COLUMNS)
            writer.writeheader()
            writer.writerows(regs)
    
    if __name__ == '__main__':
        write_csv(DATA, "/tmp/test.csv")
    
    输出:

    $ python3 write_csv.py
    
    $ cat /tmp/test.csv
    Name,Email,Cluster Name,Contact No.,Coming,Members Participating
    testuser2,testuser2@gmail.com,o  b,12346971239,Yes,
    tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
    

    让我们将问题分解为更小的子问题:

  • 将大块文本拆分为单独的注册
  • 将每个注册转换为字典
  • 将字典列表写入CSV
  • 首先,让我们将注册数据块分解为不同的元素:

    DATA = '''
    Name: testuser2
    Email: testuser2@gmail.com
    Cluster Name: o  b
    Contact No.: 12346971239
    Coming: Yes
    
    Name: testuser3
    Email: testuser3@gmail.com
    Cluster Name: Mediternea
    Contact No.: 9121319107
    Coming: Yes
    '''
    
    def parse_registrations(data):
        data = data.strip()
        return data.split('\n\n')
    
    此函数为我们提供了每个注册的列表:

    >>> regs = parse_registrations(DATA)
    >>> regs
    ['Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o  b\nContact No.: 12346971239\nComing: Yes', 'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes']
    >>> regs[0]
    'Name: testuser2\nEmail: testuser2@gmail.com\nCluster Name: o  b\nContact No.: 12346971239\nComing: Yes'
    >>> regs[1]
    'Name: testuser3\nEmail: testuser3@gmail.com\nCluster Name: Mediternea\nContact No.: 9121319107\nComing: Yes'
    
    接下来,我们可以将这些子字符串转换为(键、值)对的列表:

    dict()
    函数可以将(键、值)对列表转换为字典:

    >>> dict(field.split(': ', 1) for field in regs[0].split('\n'))
    {'Coming': 'Yes', 'Cluster Name': 'o  b', 'Name': 'testuser2', 'Contact No.': '12346971239', 'Email': 'testuser2@gmail.com'}
    
    我们可以将这些字典传递到中,以CSV形式写入记录,并对任何缺少的值使用默认值

    >>> w = csv.DictWriter(open("/tmp/foo.csv", "w"), fieldnames=["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"])
    >>> w.writeheader()
    >>> w.writerow({'Name': 'Steve'})
    12
    
    现在,让我们把这些结合起来

    import csv
    
    DATA = '''
    Name: testuser2
    Email: testuser2@gmail.com
    Cluster Name: o  b
    Contact No.: 12346971239
    Coming: Yes
    
    Name: tuser5
    Email: tuserner5@gmail.com
    Cluster Name: River Retreat A
    Contact No.: 7583450912
    Coming: Yes
    Members Participating: 2
    '''
    
    COLUMNS = ["Name", "Email", "Cluster Name", "Contact No.", "Coming", "Members Participating"]
    
    def parse_registration(reg):
        return dict(field.split(': ', 1) for field in reg.split('\n'))
    
    def parse_registrations(data):
        data = data.strip()
        regs = data.split('\n\n')
        return [parse_registration(r) for r in regs]
    
    def write_csv(data, filename):
        regs = parse_registrations(data)
        with open(filename, 'w') as f:
            writer = csv.DictWriter(f, fieldnames=COLUMNS)
            writer.writeheader()
            writer.writerows(regs)
    
    if __name__ == '__main__':
        write_csv(DATA, "/tmp/test.csv")
    
    输出:

    $ python3 write_csv.py
    
    $ cat /tmp/test.csv
    Name,Email,Cluster Name,Contact No.,Coming,Members Participating
    testuser2,testuser2@gmail.com,o  b,12346971239,Yes,
    tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
    

    以下内容将自动将输入文本文件转换为CSV文件。标题根据最长条目自动生成

    import csv, re
    
    with open("input.txt", "r") as f_input, open("output.csv", "wb") as f_output:
        csv_output = csv.writer(f_output)
        entries = re.findall("^(Name: .*?)(?:\n\n|\Z)", f_input.read(), re.M+re.S)
    
        # Determine the entry with the most fields for the CSV headers
        headings = []
        for entry in entries:
            headings = max(headings, [line.split(":")[0] for line in entry.split("\n")], key=len)
        csv_output.writerow(headings)
    
        # Write the entries
        for entry in entries:
            csv_output.writerow([line.split(":")[1].strip() for line in entry.split("\n")])
    
    这将生成可在Excel中打开的CSV文本文件,如下所示:

    Name,Email,Cluster Name,Contact No.,Coming,Members Participating
    testuser2,testuser2@gmail.com,o  b,12346971239,Yes
    testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes
    testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes
    tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
    Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2
    

    以下内容将自动将输入文本文件转换为CSV文件。标题根据最长条目自动生成

    import csv, re
    
    with open("input.txt", "r") as f_input, open("output.csv", "wb") as f_output:
        csv_output = csv.writer(f_output)
        entries = re.findall("^(Name: .*?)(?:\n\n|\Z)", f_input.read(), re.M+re.S)
    
        # Determine the entry with the most fields for the CSV headers
        headings = []
        for entry in entries:
            headings = max(headings, [line.split(":")[0] for line in entry.split("\n")], key=len)
        csv_output.writerow(headings)
    
        # Write the entries
        for entry in entries:
            csv_output.writerow([line.split(":")[1].strip() for line in entry.split("\n")])
    
    这将生成可在Excel中打开的CSV文本文件,如下所示:

    Name,Email,Cluster Name,Contact No.,Coming,Members Participating
    testuser2,testuser2@gmail.com,o  b,12346971239,Yes
    testuser3,testuser3@gmail.com,Mediternea,9121319107,Yes
    testuser4,tuser4@yahoo.com,Mediterranea,7892174896,Yes
    tuser5,tuserner5@gmail.com,River Retreat A,7583450912,Yes,2
    Test User,testuser@yahoo.co.in,RD,09833123445,Yes,2
    

    将来,您可能希望将注册信息存储在数据库中,或者甚至直接存储在服务器上的CSV文件中(如果合适)。实际上我已经开始这样做了,但对于某些用户,信息仍然是这种格式,只想将此数据转换为CSV,因此不会丢失任何信息。:)将来,您可能希望将注册信息存储在数据库中,甚至直接存储在服务器上的CSV文件中(如果合适的话)