用python处理大型文本文件_Python_Text Files

用python处理大型文本文件

python

用python处理大型文本文件,python,text-files,Python,Text Files,我有一个非常大的文件（3.8G），它是从我学校的一个系统中提取的用户。我需要重新处理该文件，以便它只包含他们的ID和电子邮件地址，逗号分隔我对这方面的经验很少，希望将其用作Python的学习练习该文件的条目如下所示： dn: uid=123456789012345,ou=Students,o=system.edu,o=system LoginId: 0099886 mail: fflintstone@system.edu dn: uid=543210987654321,ou=Student

我有一个非常大的文件（3.8G），它是从我学校的一个系统中提取的用户。我需要重新处理该文件，以便它只包含他们的ID和电子邮件地址，逗号分隔

我对这方面的经验很少，希望将其用作Python的学习练习

该文件的条目如下所示：

dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: fflintstone@system.edu

dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: brubble@system.edu

import csv

# Open the file
f = open("/path/to/large.file", "r")
# Create an output file
output_file = open("/desired/path/to/final/file", "w")

# Use the CSV module to make use of existing functionality.
final_file = csv.writer(output_file)

# Write the header row - can be skipped if headers not needed.
final_file.writerow(["LoginID","EmailAddress"])

# Set up our temporary cache for a user
current_user = []

# Iterate over the large file
# Note that we are avoiding loading the entire file into memory
for line in f:
    if line.startswith("LoginID"):
        current_user.append(line[9:].strip())
    # If more information is desired, simply add it to the conditions here
    # (additional elif's should do)
    # and add it to the current user.

    elif line.startswith("mail"):
        current_user.append(line[6:].strip())
        # Once you know you have reached the end of a user entry
        # write the row to the final file
        # and clear your temporary list.
        final_file.writerow(current_user)
        current_user = []

    # Skip lines that aren't interesting.
    else:
        continue

for line in f:
    if "uid=" in line:
        # Parse the user id out by grabbing the substring between the first = and ,
        uid = line[line.find("=")+1:line.find(",")]
    elif "mail:" in line:
        # Parse the email out by grabbing everything from the : to the end (removing the newline character)
        email = line[line.find(": ")+2:-1]
        # Given the formatting you've provided, this comes second so we can make an entry into the dict here
        data[uid] = email

writer = csv.writer(<filename>)
writer.writerow("User, Email")
for id, mail in data.iteritems:
    writer.writerow(id + "," + mail)

import csv
import ldap

con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)

try:
    with open('output_file', 'wb') as output:
        csv_writer = csv.writer(output)
        csv_writer.writerow(['LoginId', 'Mail'])

        for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
            csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
    # even if you don't have credentials, it's usually good to unbind
    con.unbind_s()

我正在尝试获取一个如下所示的文件：

0099886,fflintstone@system.edu
0083156,brubble@system.edu

任何提示或代码？

假设每个条目的结构始终相同，只需执行以下操作：

dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: fflintstone@system.edu

dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: brubble@system.edu

import csv

# Open the file
f = open("/path/to/large.file", "r")
# Create an output file
output_file = open("/desired/path/to/final/file", "w")

# Use the CSV module to make use of existing functionality.
final_file = csv.writer(output_file)

# Write the header row - can be skipped if headers not needed.
final_file.writerow(["LoginID","EmailAddress"])

# Set up our temporary cache for a user
current_user = []

# Iterate over the large file
# Note that we are avoiding loading the entire file into memory
for line in f:
    if line.startswith("LoginID"):
        current_user.append(line[9:].strip())
    # If more information is desired, simply add it to the conditions here
    # (additional elif's should do)
    # and add it to the current user.

    elif line.startswith("mail"):
        current_user.append(line[6:].strip())
        # Once you know you have reached the end of a user entry
        # write the row to the final file
        # and clear your temporary list.
        final_file.writerow(current_user)
        current_user = []

    # Skip lines that aren't interesting.
    else:
        continue

for line in f:
    if "uid=" in line:
        # Parse the user id out by grabbing the substring between the first = and ,
        uid = line[line.find("=")+1:line.find(",")]
    elif "mail:" in line:
        # Parse the email out by grabbing everything from the : to the end (removing the newline character)
        email = line[line.find(": ")+2:-1]
        # Given the formatting you've provided, this comes second so we can make an entry into the dict here
        data[uid] = email

writer = csv.writer(<filename>)
writer.writerow("User, Email")
for id, mail in data.iteritems:
    writer.writerow(id + "," + mail)

import csv
import ldap

con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)

try:
    with open('output_file', 'wb') as output:
        csv_writer = csv.writer(output)
        csv_writer.writerow(['LoginId', 'Mail'])

        for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
            csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
    # even if you don't have credentials, it's usually good to unbind
    con.unbind_s()

要打开文件，您需要使用类似于

with

关键字的东西，以确保即使出现问题，文件也能正确关闭：

with open(<your_file>, "r") as f:
   # Do stuff

要实际解析文件（文件打开时运行的内容），可以执行以下操作：

dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: fflintstone@system.edu

dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: brubble@system.edu

import csv

# Open the file
f = open("/path/to/large.file", "r")
# Create an output file
output_file = open("/desired/path/to/final/file", "w")

# Use the CSV module to make use of existing functionality.
final_file = csv.writer(output_file)

# Write the header row - can be skipped if headers not needed.
final_file.writerow(["LoginID","EmailAddress"])

# Set up our temporary cache for a user
current_user = []

# Iterate over the large file
# Note that we are avoiding loading the entire file into memory
for line in f:
    if line.startswith("LoginID"):
        current_user.append(line[9:].strip())
    # If more information is desired, simply add it to the conditions here
    # (additional elif's should do)
    # and add it to the current user.

    elif line.startswith("mail"):
        current_user.append(line[6:].strip())
        # Once you know you have reached the end of a user entry
        # write the row to the final file
        # and clear your temporary list.
        final_file.writerow(current_user)
        current_user = []

    # Skip lines that aren't interesting.
    else:
        continue

for line in f:
    if "uid=" in line:
        # Parse the user id out by grabbing the substring between the first = and ,
        uid = line[line.find("=")+1:line.find(",")]
    elif "mail:" in line:
        # Parse the email out by grabbing everything from the : to the end (removing the newline character)
        email = line[line.find(": ")+2:-1]
        # Given the formatting you've provided, this comes second so we can make an entry into the dict here
        data[uid] = email

writer = csv.writer(<filename>)
writer.writerow("User, Email")
for id, mail in data.iteritems:
    writer.writerow(id + "," + mail)

import csv
import ldap

con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)

try:
    with open('output_file', 'wb') as output:
        csv_writer = csv.writer(output)
        csv_writer.writerow(['LoginId', 'Mail'])

        for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
            csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
    # even if you don't have credentials, it's usually good to unbind
    con.unbind_s()

使用CSV编写器（请记住在文件开头导入CSV），我们可以这样输出：

dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: fflintstone@system.edu

dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: brubble@system.edu

import csv

# Open the file
f = open("/path/to/large.file", "r")
# Create an output file
output_file = open("/desired/path/to/final/file", "w")

# Use the CSV module to make use of existing functionality.
final_file = csv.writer(output_file)

# Write the header row - can be skipped if headers not needed.
final_file.writerow(["LoginID","EmailAddress"])

# Set up our temporary cache for a user
current_user = []

# Iterate over the large file
# Note that we are avoiding loading the entire file into memory
for line in f:
    if line.startswith("LoginID"):
        current_user.append(line[9:].strip())
    # If more information is desired, simply add it to the conditions here
    # (additional elif's should do)
    # and add it to the current user.

    elif line.startswith("mail"):
        current_user.append(line[6:].strip())
        # Once you know you have reached the end of a user entry
        # write the row to the final file
        # and clear your temporary list.
        final_file.writerow(current_user)
        current_user = []

    # Skip lines that aren't interesting.
    else:
        continue

for line in f:
    if "uid=" in line:
        # Parse the user id out by grabbing the substring between the first = and ,
        uid = line[line.find("=")+1:line.find(",")]
    elif "mail:" in line:
        # Parse the email out by grabbing everything from the : to the end (removing the newline character)
        email = line[line.find(": ")+2:-1]
        # Given the formatting you've provided, this comes second so we can make an entry into the dict here
        data[uid] = email

writer = csv.writer(<filename>)
writer.writerow("User, Email")
for id, mail in data.iteritems:
    writer.writerow(id + "," + mail)

import csv
import ldap

con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)

try:
    with open('output_file', 'wb') as output:
        csv_writer = csv.writer(output)
        csv_writer.writerow(['LoginId', 'Mail'])

        for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
            csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
    # even if you don't have credentials, it's usually good to unbind
    con.unbind_s()

writer=csv.writer（）
writer.writerow（“用户，电子邮件”）
对于id，mail-in-data.iteritems：
writer.writerow（id+，“+邮件）

另一个选项是在文件之前打开writer，写入头，然后在写入CSV的同时读取文件中的行。这避免了将信息转储到内存中，这可能是非常可取的。所以把所有这些放在一起

writer = csv.writer(<filename>)
writer.writerow("User, Email")
with open(<your_file>, "r") as f:
    for line in f:
        if "uid=" in line:
            # Parse the user id out by grabbing the substring between the first = and ,
            uid = line[line.find("=")+1:line.find(",")]
        elif "mail:" in line:
            # Parse the email out by grabbing everything from the : to the end (removing the newline character)
            email = line[line.find(": ")+2:-1]
            # Given the formatting you've provided, this comes second so we can make an entry into the dict here
            writer.writerow(iid + "," + email)

writer=csv.writer（）
writer.writerow（“用户，电子邮件”）
以开放（，“r”）作为f：
对于f中的行：
如果“uid=”在第行：
#通过抓取第一个=和之间的子字符串来解析用户id，
uid=line[line.find（“=”）+1:line.find（“，”）]
elif“邮件：”行中：
#通过抓取从：到结尾的所有内容（删除换行符）来解析电子邮件
email=line[line.find（“：”）+2:-1]
#根据您提供的格式，这是第二个，因此我们可以在这里输入dict
writer.writerow（iid+，“+电子邮件）

再次假设您的文件格式正确：

with open(inputfilename) as inputfile, with open(outputfilename) as outputfile:
    mail = loginid = ''
    for line in inputfile:
        line = inputfile.split(':')
        if line[0] not in ('LoginId', 'mail'):
            continue
        if line[0] == 'LoginId':
            loginid = line[1].strip()
        if line[0] == 'mail':
            mail = line[1].strip()
        if mail and loginid:
            output.write(loginid + ',' + mail + '\n')
            mail = loginid = ''

本质上等同于其他方法。

在我看来，它实际上像一个文件。该库有一个纯Python LDIF处理库，如果您的文件具有LDIF中可能出现的一些令人讨厌的问题，例如Base64编码值、条目折叠等，该库将提供帮助

你可以这样使用它：

import csv
import ldif

class ParseRecords(ldif.LDIFParser):
   def __init__(self, csv_writer):
       self.csv_writer = csv_writer
   def handle(self, dn, entry):
       self.csv_writer.writerow([entry['LoginId'], entry['mail']])

with open('/path/to/large_file') as input, with open('output_file', 'wb') as output:
    csv_writer = csv.writer(output)
    csv_writer.writerow(['LoginId', 'Mail'])
    ParseRecords(input, csv_writer).parse()

编辑

因此，要从实时LDAP目录中提取，请使用库执行以下操作：

dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: fflintstone@system.edu

dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: brubble@system.edu

import csv

# Open the file
f = open("/path/to/large.file", "r")
# Create an output file
output_file = open("/desired/path/to/final/file", "w")

# Use the CSV module to make use of existing functionality.
final_file = csv.writer(output_file)

# Write the header row - can be skipped if headers not needed.
final_file.writerow(["LoginID","EmailAddress"])

# Set up our temporary cache for a user
current_user = []

# Iterate over the large file
# Note that we are avoiding loading the entire file into memory
for line in f:
    if line.startswith("LoginID"):
        current_user.append(line[9:].strip())
    # If more information is desired, simply add it to the conditions here
    # (additional elif's should do)
    # and add it to the current user.

    elif line.startswith("mail"):
        current_user.append(line[6:].strip())
        # Once you know you have reached the end of a user entry
        # write the row to the final file
        # and clear your temporary list.
        final_file.writerow(current_user)
        current_user = []

    # Skip lines that aren't interesting.
    else:
        continue

for line in f:
    if "uid=" in line:
        # Parse the user id out by grabbing the substring between the first = and ,
        uid = line[line.find("=")+1:line.find(",")]
    elif "mail:" in line:
        # Parse the email out by grabbing everything from the : to the end (removing the newline character)
        email = line[line.find(": ")+2:-1]
        # Given the formatting you've provided, this comes second so we can make an entry into the dict here
        data[uid] = email

writer = csv.writer(<filename>)
writer.writerow("User, Email")
for id, mail in data.iteritems:
    writer.writerow(id + "," + mail)

import csv
import ldap

con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)

try:
    with open('output_file', 'wb') as output:
        csv_writer = csv.writer(output)
        csv_writer.writerow(['LoginId', 'Mail'])

        for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
            csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
    # even if you don't have credentials, it's usually good to unbind
    con.unbind_s()

这本书可能值得一读，尤其是《圣经》

请注意，在上面的示例中，我完全跳过了提供过滤器，您可能希望在生产中这样做。LDAP中的筛选器类似于SQL语句中的

WHERE

子句；它限制返回的对象。LDAP筛选器的规范参考是

类似地，如果即使在应用了适当的过滤器之后，仍然可能有数千个条目，那么您可能需要查看这些条目，尽管使用这些条目会使示例更加复杂。希望这足以让你开始，但如果有什么事情发生，请随意提问或提出一个新问题

祝你好运。

看起来像是LDAP中的某些东西。我的小费只是你需要的东西；）哇，非常感谢。它奏效了，所有的评论都非常有用。我不得不在下面一行的末尾添加一个冒号：

if line.startswith（“LoginID”）

好的，这对我来说非常有趣。在我处理完文件并转交给我的同事后，他们问我是否可以每晚这样做。我可以使用Python直接查询LDAP并生成ID为MAIL的CSV文件吗？可以，尽管LDAP的API有点不同。我将用天真的例子来编辑。我真的很感谢你的帮助，尤其是LDAP更新！我将试一试，看看我能想出什么。+1用于LDAP——我假设Alistair没有访问权限。我希望我能给你另一个+1的链接！好的，我得到一个错误：

以open（'id-file.csv'，wb'）作为输出：SyntaxError:无效语法

，插入符号指向单词

open

@Alistair中的字母n没有问题。不过我才意识到我使用了错误的注释符号。应该是

而不是

。最近用Java的时间太多了。。。