Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/300.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何使用python比较来自两个不同文件的行字符串?_Python_Python 3.x - Fatal编程技术网

如何使用python比较来自两个不同文件的行字符串?

如何使用python比较来自两个不同文件的行字符串?,python,python-3.x,Python,Python 3.x,我想在两个文件中找到匹配的电子邮件,并通过比较两个文件中的电子邮件来确定发送日期。我有两个文件1)maillog.txt(postfix maillog)和2)testmail.txt(包含用换行符分隔的电子邮件)我使用re从maillog.txt文件中提取电子邮件和发送日期,如下所示 Nov 3 10:08:43 server postfix/smtp[150754]: 78FA8209EDEF: to=<adamson@example.com>, relay=aspmx.l.g

我想在两个文件中找到匹配的电子邮件,并通过比较两个文件中的电子邮件来确定发送日期。我有两个文件1)maillog.txt(postfix maillog)和2)testmail.txt(包含用换行符分隔的电子邮件)我使用
re
从maillog.txt文件中提取电子邮件和发送日期,如下所示

Nov  3 10:08:43 server postfix/smtp[150754]: 78FA8209EDEF: to=<adamson@example.com>, relay=aspmx.l.google.com[74.125.24.26]:25, delay=3.2, delays=0.1/0/1.6/1.5, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718076 m11si5060862pls.447 - gsmtp)
Nov  3 10:10:45 server postfix/smtp[150754]: 7C42A209EDEF: to=<addison@linux.com>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.152.217]:25, delay=5.4, delays=0.1/0/3.8/1.5, dsn=2.0.0, status=sent (250 2.0.0 2dvkvt5tgc-1 Message accepted for delivery)
Nov  3 10:15:45 server postfix/smtp[150754]: 83533209EDE8: to=<johndoe@carchcoal.com>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.144.222]:25, delay=4.8, delays=0.1/0/3.3/1.5, dsn=2.0.0, status=sent (250 2.0.0 2dvm8yww64-1 Message accepted for delivery)
Nov  3 10:16:42 server postfix/smtp[150754]: 83A5E209EDEF: to=<jackn@alphanr.com>, relay=aspmx.l.google.com[74.125.200.27]:25, delay=1.6, delays=0.1/0/0.82/0.69, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718555 j186si6198120pgc.455 - gsmtp)
Nov  3 10:17:44 server postfix/smtp[150754]: 8636D209EDEF: to=<sbins@archcoal.com>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.144.222]:25, delay=4.1, delays=0.11/0/2.6/1.4, dsn=2.0.0, status=sent (250 2.0.0 2dvm8ywwdh-1 Message accepted for delivery)
Nov  3 10:18:42 server postfix/smtp[150754]: 8A014209EDEF: to=<leo@adalphanr.com>, relay=aspmx.l.google.com[74.125.200.27]:25, delay=1.9, delays=0.1/0/0.72/1.1, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718675 o2si6032950pgp.46 - gsmtp)
下面是我尝试过的,它也很有效,但是我想知道是否有更有效的方法来处理大量的邮件日志和电子邮件地址

import re
pattern=r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'
with open("testmail.txt") as fh1:
    for addr in fh1:
        if addr:
            with open("maillog.txt") as fh:
                for line in fh:
                    if line:
                        match=re.finditer(pattern,line)
                        for obj in match:
                            addr=addr.strip()
                            addr2=obj.group('email').strip()
                            if addr == addr2:
                                print(obj.groupdict('email'))
这是我的解决方案

In [1]: import re

In [2]: pat = r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'

In [3]: emails = set()

In [4]: date_email = {}

In [6]: with open('maillog.txt', mode='r') as f:
   ...:     for line in f:
   ...:         month, day, ts, email = re.search(pat, line).group('month', 'day', 'ts', 'email')
   ...:         date_email[email] = (month, day, ts)
   ...:         

In [7]: date_email
Out[7]: 
{'adamson@example.com': ('Nov', '3', '10:08:43'),
 'addison@linux.com': ('Nov', '3', '10:10:45'),
 'jackn@alphanr.com': ('Nov', '3', '10:16:42'),
 'johndoe@carchcoal.com': ('Nov', '3', '10:15:45'),
 'leo@adalphanr.com': ('Nov', '3', '10:18:42'),
 'sbins@archcoal.com': ('Nov', '3', '10:17:44')}

In [11]: with open('testmail.txt', mode='r') as f:
    ...:     for line in f:
    ...:         emails.add(line.strip())
    ...:         

In [12]: emails
Out[12]: {'adamson@example.com', 'jdswson@gmail.com'}

In [15]: for email in emails:
    ...:     if email in date_email:
    ...:         print(email, date_email[email])
    ...:         
('adamson@example.com', ('Nov', '3', '10:08:43'))
这是我的解决方案

In [1]: import re

In [2]: pat = r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'

In [3]: emails = set()

In [4]: date_email = {}

In [6]: with open('maillog.txt', mode='r') as f:
   ...:     for line in f:
   ...:         month, day, ts, email = re.search(pat, line).group('month', 'day', 'ts', 'email')
   ...:         date_email[email] = (month, day, ts)
   ...:         

In [7]: date_email
Out[7]: 
{'adamson@example.com': ('Nov', '3', '10:08:43'),
 'addison@linux.com': ('Nov', '3', '10:10:45'),
 'jackn@alphanr.com': ('Nov', '3', '10:16:42'),
 'johndoe@carchcoal.com': ('Nov', '3', '10:15:45'),
 'leo@adalphanr.com': ('Nov', '3', '10:18:42'),
 'sbins@archcoal.com': ('Nov', '3', '10:17:44')}

In [11]: with open('testmail.txt', mode='r') as f:
    ...:     for line in f:
    ...:         emails.add(line.strip())
    ...:         

In [12]: emails
Out[12]: {'adamson@example.com', 'jdswson@gmail.com'}

In [15]: for email in emails:
    ...:     if email in date_email:
    ...:         print(email, date_email[email])
    ...:         
('adamson@example.com', ('Nov', '3', '10:08:43'))

您可以尝试使用regex并捕获组:

让我们分三步解决您的解决方案:

第一步从email.txt捕获所有电子邮件地址:

第二步从data.txt捕获所需数据:

第三步:现在我们有了所有的数据,只要检查一下那封电子邮件是否收到就行了 然后,我们的数据列表将该组信息添加到dict:

完整代码:

正则表达式信息:


您可以尝试使用regex并捕获组:

让我们分三步解决您的解决方案:

第一步从email.txt捕获所有电子邮件地址:

第二步从data.txt捕获所需数据:

第三步:现在我们有了所有的数据,只要检查一下那封电子邮件是否收到就行了 然后,我们的数据列表将该组信息添加到dict:

完整代码:

正则表达式信息:


快速且未经测试但概念上足够简单:编译一个包含所有地址的大whoppin'正则表达式

import re

with open("testmail.txt") as fh1:
    emails = []
    for addr in fh1:
        emails.append(re.escape(addr.strip()))
    pattern=re.compile(
        r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>%s)' %
            '|'.join(emails))

with open("maillog.txt") as fh:
    for line in fh:
        for match in pattern.finditer(line):
            print(match.groupdict())
重新导入
打开(“testmail.txt”)作为fh1:
电子邮件=[]
对于fh1中的地址:
emails.append(re.escape(addr.strip()))
模式=重新编译(

r'(?P[A-Za-z]{3})\s{1,3}(?P\d{1,2})\s{1,2}(?P\d+:\d+:\d+:\d+).*to=

快速且未经测试,但概念上足够简单:编译一个包含所有地址的大whoppin'正则表达式

import re

with open("testmail.txt") as fh1:
    emails = []
    for addr in fh1:
        emails.append(re.escape(addr.strip()))
    pattern=re.compile(
        r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>%s)' %
            '|'.join(emails))

with open("maillog.txt") as fh:
    for line in fh:
        for match in pattern.finditer(line):
            print(match.groupdict())
重新导入
打开(“testmail.txt”)作为fh1:
电子邮件=[]
对于fh1中的地址:
emails.append(re.escape(addr.strip()))
模式=重新编译(

r'(?P[A-Za-z]{3})\s{1,3}(?P\d{1,2})\s{1,2}(?P\d+:\d+:\d+).*to=我的建议是将testmail.txt中的所有电子邮件存储在一个集合中,编译regex,然后在maillog.txt行上迭代,然后在邮件中进行搜索。这样,只有较短的文件必须驻留在内存中,regex模式只编译一次,并且在为此优化的集合中进行研究访问类型:

import re
pattern=r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'

# load the testmail file into a set
mails = set()
with open('testmail.txt') as fd:
    for line in fd:
        mails.add(line.strip())

#compile the regex once
rx = re.compile(pattern)

#process the maillog file:
with open('maillog.txt') as fd:
    for line in fd:
        m = rx.match(line)
        if m is not None and m.groupdict()['email'] in mails:
            print(m.groupdict())

我的建议是将testmail.txt中的所有电子邮件存储在一个集合中,编译regex,然后在maillog.txt行上迭代,然后在邮件中进行搜索。这样,只有较短的文件必须驻留在内存中,regex模式只编译一次,并且研究是在针对这种类型进行优化的集合中进行的访问权限:

import re
pattern=r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'

# load the testmail file into a set
mails = set()
with open('testmail.txt') as fd:
    for line in fd:
        mails.add(line.strip())

#compile the regex once
rx = re.compile(pattern)

#process the maillog file:
with open('maillog.txt') as fd:
    for line in fd:
        m = rx.match(line)
        if m is not None and m.groupdict()['email'] in mails:
            print(m.groupdict())

这可能更适合于代码审查。但一般来说,您不希望为每个测试邮件一遍又一遍地读取邮件日志。而是将测试邮件读取到一个集合,然后扫描邮件日志一次,测试集合中是否有一行邮件。maillog.txt或testmail哪个更大?常用的方法是在内存中加载较小的文件(如果可能的话,在dict或集合中进行更快的研究),然后一次扫描一行较大的内容。@SergeBallesta文件“testmail.txt”比较小,我想我应该把这个文件放在一个集合中,并在逐行读取的同时与大邮件日志文件进行比较。这可能更适合于代码审查。但一般来说,您不希望为每个测试邮件反复读取邮件日志。而是将测试邮件读取到一个集合中,然后扫描邮件日志一次,测试是否有邮件传入集合中有一行。maillog.txt或testmail哪一行更大?常用的方法是将较小的文件加载到内存中(如果可能的话,以dict或集合的形式进行更快的研究),然后一次扫描较大的一行。@SergeBallesta文件“testmail.txt”如果文件较小,我想我应该把这个文件放在一个集合中,并在逐行读取时与大邮件日志文件进行比较IMHO,因为文件较大,所以最好只在内存中加载一个,然后扫描另一个…IMHO,因为文件较大,所以最好只在内存中加载一个,然后扫描另一个。。。
for item in month_day:
        final_dict = {}
        if item[1] in emails:
            final_dict['month'] = item[0][0]
            final_dict['day'] = item[0][1]
            final_dict['ts'] = item[0][2]
            final_dict['email'] = item[1]
        if final_dict:
            print(final_dict)
 import re
pattern='^(\w{0,3})\s.(\d)\s(\d.+?\s)|<(\w+[@]\w+[.]\w+)>'
email_pattern='\w+[@]\w+[.]\w+'

emails=[]
with open('emails.txt','r') as f:
    for line in f:
        emails.append(re.search(email_pattern,line).group())
with open('data.txt','r') as f:
    month_day=[[find.group(4) if find.group(4) != None else [find.group(1), find.group(2), find.group(3)] for find in re.finditer(pattern,line)]for line in f]


for item in month_day:
    final_dict = {}
    if item[1] in emails:
        final_dict['month'] = item[0][0]
        final_dict['day'] = item[0][1]
        final_dict['ts'] = item[0][2]
        final_dict['email'] = item[1]
    if final_dict:
        print(final_dict)
{'ts': '10:08:43 ', 'month': 'Nov', 'email': 'adamson@example.com', 'day': '3'}
^ asserts position at start of a line
\w{0,3} matches any word character (equal to [a-zA-Z0-9_])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d matches a digit (equal to [0-9])
import re

with open("testmail.txt") as fh1:
    emails = []
    for addr in fh1:
        emails.append(re.escape(addr.strip()))
    pattern=re.compile(
        r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>%s)' %
            '|'.join(emails))

with open("maillog.txt") as fh:
    for line in fh:
        for match in pattern.finditer(line):
            print(match.groupdict())
import re
pattern=r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'

# load the testmail file into a set
mails = set()
with open('testmail.txt') as fd:
    for line in fd:
        mails.add(line.strip())

#compile the regex once
rx = re.compile(pattern)

#process the maillog file:
with open('maillog.txt') as fd:
    for line in fd:
        m = rx.match(line)
        if m is not None and m.groupdict()['email'] in mails:
            print(m.groupdict())