Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/visual-studio-2010/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从电子邮件中提取和统计域地址邮件_Python_Regex_Counter - Fatal编程技术网

Python 从电子邮件中提取和统计域地址邮件

Python 从电子邮件中提取和统计域地址邮件,python,regex,counter,Python,Regex,Counter,我有一个电子邮件列表,只想提取域名,并计算每个域名出现的次数: 电子邮件: import re from collections import Counter with open("mails.txt", "r") as f: texte = f.read().split('\n') for line in texte: newline = re.search("@[\w.]+", line)

我有一个电子邮件列表,只想提取域名,并计算每个域名出现的次数:

电子邮件:

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')

    for line in texte:
        newline = re.search("@[\w.]+", line)
        newmail = newline.group()

        mails_value = Counter(newmail).most_common()

        print (mails_value)
best@yahoo.com

hello@gmail.com

everybody@gmail.com

bye@gmail.com

day@yahoo.com

桌子。blue@gmail.com

life@yahoo.com

脚本:

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')

    for line in texte:
        newline = re.search("@[\w.]+", line)
        newmail = newline.group()

        mails_value = Counter(newmail).most_common()

        print (mails_value)
输出:

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')

    for line in texte:
        newline = re.search("@[\w.]+", line)
        newmail = newline.group()

        mails_value = Counter(newmail).most_common()

        print (mails_value)
[('g',1),('g',1),('6',1),('5',1),('f',1),('r',1)]

回溯(最近一次呼叫最后一次):

文件“counting.py”,第10行,在

newmail = newline.group()
AttributeError:“非类型”对象没有属性“组”

输出良好:

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')

    for line in texte:
        newline = re.search("@[\w.]+", line)
        newmail = newline.group()

        mails_value = Counter(newmail).most_common()

        print (mails_value)
@雅虎网站3

@gmail.com 4

您可以使用split

texte = "life@yahoo.com"
texte.split("@")
['life', 'yahoo.com']

非常接近-无需将文件拆分为行,只需使用
re.findall
re.MULTILINE
和模式即可


做两次劈叉。第二个是@。。然后追加最后一项并将计数器应用于列表

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')

    domains = []

    for line in texte:
        line = line.split('@')
        if line[-1] != "":
            domains.append(line[-1])

mails_value = Counter(domains).most_common()

print(mails_value)   

[('gmail.com',4),('yahoo.com',3)]
你不需要正则表达式。如果您可以相信所有输入都是格式良好的电子邮件,那么这就足够了:

from collections import defaultdict

domain_count = defaultdict(lambda: 0)

with open("mails.txt", "r") as f:
    texte = f.readlines()

    for line in texte:
        domain = line.split('@')[-1]
        domain_count[domain] += 1

print (domain_count)

正则表达式将避免您创建不必要的列表

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')
    l=[]
    for line in texte:
        p=re.compile("(?<=@)[^.]+(?=\.)")
        newline = p.search(line)
        if(newline):

            newmail = newline.group(0)
            l.append(newmail)

Counter(l)

您没有计算提取的提及,先收集它们,然后找到最常见的,请参见
[\w.]
defaultdict(lambda:0)
可以写成
defaultdict(int)
<代码>计数器,如OP中所用,无论如何都更适合此任务。