Python：解析各种不同的文本文件电子邮件_Python_Regex_Email_Text Files

Python：解析各种不同的文本文件电子邮件

python regex email

Python：解析各种不同的文本文件电子邮件,python,regex,email,text-files,Python,Regex,Email,Text Files,以下是来自3封不同电子邮件的3个片段： 1) Subject: NEFS 11 and 12 fish for lease Greetings, NEFS 11 has the following fish for lease: up to 4,000 lbs live wt GOM cod @ 1.40 lbs NEFS 12 has the following fish for lease: 2,000 lbs American plaice @ .45 lbs Please let

以下是来自3封不同电子邮件的3个片段：

1)
Subject: NEFS 11 and 12 fish for lease

Greetings,

NEFS 11 has the following fish for lease:
up to 4,000 lbs live wt GOM cod @ 1.40 lbs
NEFS 12 has the following fish for lease:
2,000 lbs American plaice @ .45 lbs

Please let me know if you're interested in either,


2)
Subject: NEFS 11 fish for lease

2,000 lbs Grey sole @ 1.20 or best offer
1,000 lbs dabs @ .55 or best offer

thanks,


3)
Subject: NEFS 11 fish for lease

-GOM Cod up to 5,000 lbs (live wt) @ 1.40 lbs
-American Plaice 2,000 lbs      .60 lbs or best offer

我的问题是：从这些电子邮件中解析出行业（NEFS 11、12）、物种（GOM鳕鱼、灰鲽）、磅（4000磅、2000磅）和价格（1.40/磅、0.55/磅）信息的最有效方法是什么

我的第一个想法是使用正则表达式。但我不确定这是最好的方法，因为我的代码目前捕获了太多的信息；例如，当我去抓取重量数据时，我也抓取了价格数据，因为两者都与“磅”相邻。当我试图捕获部门数据时，我捕获了整个主题行

下面是我的一段代码，用于解析电子邮件中的物种数据：

for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
    with open(file_path, 'r') as f:
        sector_result = []
        pattern = re.compile("Available Quota | CC Yellowtail Flounder | GOM Yellowtail Flounder | GB Cod East | GB Cod West | GB Haddock East | GB Haddock West | GB Winter Flounder | GB Yellowtail Flounder | GOM Cod | GOM Haddock | GOM Winter Flounder | Plaice | Pollock | Redfish | SNE Winter Flounder | ME Winter Flounder | SNE Yellowtail Flounder | ME Yellowtail Flounder | White Hake | Witch Flounder", re.IGNORECASE)
        for linenum, line in enumerate(f):
            if pattern.search(line) != None:
                sector_result.append((linenum, line.rstrip('\n')))
                for linenum, line in sector_result:
                    print ("Fish Species:", line)

我搜索所有可能在电子邮件中找到的物种，理想情况下（例如3），我会生产：“鱼种：GOM鳕鱼，美洲鲽”，但生产的是

鱼种：-美洲鲽2000磅.60磅或最佳报价
我不是一个使用正则表达式的专家，所以我希望你能帮助修改我的正则表达式代码，或者给我一个建议，让我用另一种方法来解析这些邮件，还有更多的邮件。多谢各位
其他电子邮件：
NEFS 5 has the following fish available for lease/trade:

GB EAST cod: 954 lbs @ $0.83
GB EAST cod: 1,046 lbs to trade for 1,830 lbs GB WEST cod
GB blackback: 30,000 lbs @ $0.07
GOM blackback: 800 lbs @ $0.03
white hake: 6,322 lbs @ $0.13
pollock: 22,000 lbs @ $0.015
redfish: 14,000 lbs @ $0.015
GB yt: 1,873 lbs @ $1.13
GB yt: 5,127 lbs to trade for 10,254 lbs SNE yt

仅获取不同的鱼类类型：
with open(file_path, 'r') as f:
    pattern = re.compile(r"Available Quota|CC Yellowtail Flounder|GOM Yellowtail Flounder|GB Cod East|GB Cod West|GB Haddock East|GB Haddock West|GB Winter Flounder|GB Yellowtail Flounder|GOM Cod|GOM Haddock|GOM Winter Flounder|Plaice|Pollock|Redfish|SNE Winter Flounder|ME Winter Flounder|SNE Yellowtail Flounder|ME Yellowtail Flounder|White Hake|Witch Flounder", re.IGNORECASE)
    email = f.read()
    fish_types = pattern.findall(email)
    if fish_types:
        print("Fish Species:", " ".join(fish_types))

您的问题似乎需要比理想解决方案更多的信息。例如3，您希望输出的具体内容是什么？此外，如果电子邮件中没有任何形式的标准化，你所看到的是一个机器学习问题，而不是上面regexI所说的，对于电子邮件#3，理想情况下，输出应该是：鱼种：GOM鳕鱼，美洲鲽鱼
。此外，你是对的，没有标准化，这些电子邮件都是由人单独发送的；这是否意味着不可能从100封电子邮件中解析出所需的数据？为了避免使用机器学习，我认为最好的办法是确定信息的最常见呈现方式，并在这些方式上使用正则表达式。然后你就可以得到大部分数据，但不是全部。这几乎完美地工作，我想我知道为什么它不完美。它只打印了<代码>鱼种：普拉斯难道它不认识“GOM鳕鱼”和“美国鳕鱼”，因为它们在电子邮件中与破折号相邻吗？比如“-GOM Cod”而不是“GOM Cod”，很可能是因为正则表达式中的空格。让我来解决这个问题。正则表达式试图匹配“GOM Cod”，而不仅仅是“GOM Cod”，这是有道理的。所以我真的不想让你被问题淹没，但只是很快：直到现在，当我试图捕捉磅数时，例如，我会使用正则表达式搜索“磅|磅”，然后我会挂断……我以前一直在捕捉整行，但这显然是太多额外的东西。我知道有非捕获组，但不完全知道是否可以在这里使用。有没有办法在正则表达式匹配之前直接捕获一个数字？例如，“2000磅”就是这样的情况？如果您非常确信权重看起来会是这样，那么使用re.compile（r'\d+，\d+）
可能会更简单，它将识别所有包含数字的字符串，然后是逗号，然后是数字。哇，这确实成功地捕获了正确的5000和2000个数量。那么，是否应该使用相同的方法来获取价格？除了将逗号替换为“1.20磅”和“0.55磅”值的句点之外？