Python 从电子邮件中提取和统计域地址邮件
我有一个电子邮件列表,只想提取域名,并计算每个域名出现的次数: 电子邮件:Python 从电子邮件中提取和统计域地址邮件,python,regex,counter,Python,Regex,Counter,我有一个电子邮件列表,只想提取域名,并计算每个域名出现的次数: 电子邮件: import re from collections import Counter with open("mails.txt", "r") as f: texte = f.read().split('\n') for line in texte: newline = re.search("@[\w.]+", line)
import re
from collections import Counter
with open("mails.txt", "r") as f:
texte = f.read().split('\n')
for line in texte:
newline = re.search("@[\w.]+", line)
newmail = newline.group()
mails_value = Counter(newmail).most_common()
print (mails_value)
best@yahoo.com
hello@gmail.com
everybody@gmail.com
bye@gmail.com
day@yahoo.com
桌子。blue@gmail.com
life@yahoo.com
脚本:
import re
from collections import Counter
with open("mails.txt", "r") as f:
texte = f.read().split('\n')
for line in texte:
newline = re.search("@[\w.]+", line)
newmail = newline.group()
mails_value = Counter(newmail).most_common()
print (mails_value)
输出:
import re
from collections import Counter
with open("mails.txt", "r") as f:
texte = f.read().split('\n')
for line in texte:
newline = re.search("@[\w.]+", line)
newmail = newline.group()
mails_value = Counter(newmail).most_common()
print (mails_value)
[('g',1),('g',1),('6',1),('5',1),('f',1),('r',1)]
回溯(最近一次呼叫最后一次):
文件“counting.py”,第10行,在
newmail = newline.group()
AttributeError:“非类型”对象没有属性“组”
输出良好:
import re
from collections import Counter
with open("mails.txt", "r") as f:
texte = f.read().split('\n')
for line in texte:
newline = re.search("@[\w.]+", line)
newmail = newline.group()
mails_value = Counter(newmail).most_common()
print (mails_value)
@雅虎网站3
@gmail.com 4
您可以使用split
texte = "life@yahoo.com"
texte.split("@")
['life', 'yahoo.com']
非常接近-无需将文件拆分为行,只需使用
re.findall
、re.MULTILINE
和模式即可
做两次劈叉。第二个是@。。然后追加最后一项并将计数器应用于列表
import re
from collections import Counter
with open("mails.txt", "r") as f:
texte = f.read().split('\n')
domains = []
for line in texte:
line = line.split('@')
if line[-1] != "":
domains.append(line[-1])
mails_value = Counter(domains).most_common()
print(mails_value)
[('gmail.com',4),('yahoo.com',3)]
你不需要正则表达式。如果您可以相信所有输入都是格式良好的电子邮件,那么这就足够了:
from collections import defaultdict
domain_count = defaultdict(lambda: 0)
with open("mails.txt", "r") as f:
texte = f.readlines()
for line in texte:
domain = line.split('@')[-1]
domain_count[domain] += 1
print (domain_count)
正则表达式将避免您创建不必要的列表
import re
from collections import Counter
with open("mails.txt", "r") as f:
texte = f.read().split('\n')
l=[]
for line in texte:
p=re.compile("(?<=@)[^.]+(?=\.)")
newline = p.search(line)
if(newline):
newmail = newline.group(0)
l.append(newmail)
Counter(l)
您没有计算提取的提及,先收集它们,然后找到最常见的,请参见
[\w.]
与
defaultdict(lambda:0)
可以写成defaultdict(int)
<代码>计数器,如OP中所用,无论如何都更适合此任务。