Python 剥离正在中断readline（）的不需要的字符_Python_Regex_Email_Quoted Printable

Python 剥离正在中断readline（）的不需要的字符

python regex email

Python 剥离正在中断readline（）的不需要的字符,python,regex,email,quoted-printable,Python,Regex,Email,Quoted Printable,我正在写一个小脚本来浏览版权通知电子邮件的大文件夹，并查找相关信息（IP和时间戳）。我已经找到了一些解决格式障碍的方法（有时IP和TS在不同的行上，有时在同一行上，有时在不同的地方，时间戳有4种不同的格式，等等）我遇到了一个奇怪的问题，我在其中解析的一些文件在一行中吐出了奇怪的字符，破坏了我对RealLoad（）返回的解析。当在文本编辑器中阅读时，所讨论的行看起来正常，但是RealLoad（）在IP的中间读取一个“=”和两个“n”字符。 e、 g 你知道我该怎么做吗？我真的无法控制是什么问题

我正在写一个小脚本来浏览版权通知电子邮件的大文件夹，并查找相关信息（IP和时间戳）。我已经找到了一些解决格式障碍的方法（有时IP和TS在不同的行上，有时在同一行上，有时在不同的地方，时间戳有4种不同的格式，等等）

<>我遇到了一个奇怪的问题，我在其中解析的一些文件在一行中吐出了奇怪的字符，破坏了我对RealLoad（）返回的解析。当在文本编辑器中阅读时，所讨论的行看起来正常，但是RealLoad（）在IP的中间读取一个“=”和两个“n”字符。 e、 g

你知道我该怎么做吗？我真的无法控制是什么问题导致了这种情况，我只是需要处理好它，不要太疯狂

相关功能，供参考（我知道很乱）：

已解决，如果其他人有类似问题，请将每一行保存为字符串，合并在一起，然后将它们重新.sub（）出来，记住\r和\n字符。我的解决方案有点像意大利面条，但可以防止对每个文件执行不必要的正则表达式：

def getIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""
while  not ("Torrent Hash Value: " in iplabel):
    iplabel = ce.readline()

ipraw = ce.readline()
if ("File Size" in ipraw):
    ipraw = ce.readline()

ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
    return ip[0]
    ce.close()
else:
    ipraw2 = ce.readline()                              #made this a new var
    ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw2)
    if ip:
        return ip[0]
        ce.close()
    else:
        ipraw = ipraw + ipraw2                          #Added this section
        ipraw = re.sub(r'(=\r*\n)', '', ipraw)          #
        ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
        if ip:
            return ip[0]
            ce.close()
        else:
            return ("No IP found in: " + ipraw)
            ce.close()

似乎至少有一些您正在处理的电子邮件已被编码为

此编码用于使8位字符数据可在7位（仅限ASCII）系统上传输，但它也强制执行76个字符的固定行长度。这是通过插入一个软换行符来实现的，该换行符由“=”组成，后跟行尾标记

Python提供了处理引用的可打印文件的编码和解码的模块。从引用的可打印文件中解码数据将删除这些软换行符

举个例子，让我们用你问题的第一段

>>> import quopri
>>> s = """I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.)."""

>>> # Encode to latin-1 as quopri deals with bytes, not strings.
>>> bs = s.encode('latin-1')

>>> # Encode
>>> encoded = quopri.encodestring(bs)
>>> # Observe the "=\n" inserted into the text.
>>> encoded
b"I'm writing a small script to run through large folders of copyright notice=\n emails and finding relevant information (IP and timestamp). I've already f=\nound ways around a few little formatting hurdles (sometimes IP and TS are o=\nn different lines, sometimes on same, sometimes in different places, timest=\namps come in 4 different formats, etc.)."

>>> # Printing without decoding from quoted-printable shows the "=".
>>> print(encoded.decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice=
 emails and finding relevant information (IP and timestamp). I've already f=
ound ways around a few little formatting hurdles (sometimes IP and TS are o=
n different lines, sometimes on same, sometimes in different places, timest=
amps come in 4 different formats, etc.).

>>> # Decode from quoted-printable to remove soft line breaks.
>>> print(quopri.decodestring(encoded).decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).

要正确解码，需要处理整个消息体，这与使用

readline

的方法相冲突。解决此问题的一种方法是将解码的字符串加载到缓冲区：

import io

def getIP(em):
    with open(em, 'rb') as f:
        bs = f.read()
    decoded = quopri.decodestring(bs).decode('latin-1')

    ce = io.StringIO(decoded)
    iplabel = ""
    while  not ("Torrent Hash Value: " in iplabel):
        iplabel = ce.readline()
        ...

如果您的文件包含完整的电子邮件（包括标题），则使用模块中的工具将自动处理此解码

import email
from email import policy

with open('message.eml') as f:
    s = f.read()
msg = email.message_from_string(s, policy=policy.default)
body = msg.get_content()

您确定两个

\n

前面只有一个

字符吗？其他IP是否具有其他字符，如

，并且可能不止一个？如果您只有

=\n\n

，您可以通过在最后一个IP部分

.xxx

之前使用

（？：=\n*）？

为IP编写正则表达式来说明这一点。问题是，我只是在将行读入字符串后才应用正则表达式，新行字符将字符串分开。我的第一反应是读3行，连接它们，然后是正则表达式，但如果每次都运行脚本，那么这将是一个相当大的额外负载，如果我只是将它插入另一个代码中，那将是非常复杂的代码：最后，因为我需要保存行位置，如果“正常”的话，则返回到它搜索不起作用。如果您的数据被拆分为多行，我建议您至少处理一个字符串，将至少两行合并，并在每个步骤中多读一行，丢弃第一行，将第二行与下一行合并，然后以这种方式迭代，否则，捕获/提取正确的模式对您来说将是困难的。最终，您只需保存先前读取的行，将它们组合起来，然后使用re.sub删除（=\r*\n），然后它就可以工作了（原来在=和\n之间还有一个\r字符，这让人困惑了一分钟）。感谢您的帮助。如果您已经解决了问题，请添加并接受它作为答案，而不是将解决方案放在问题中。

import io

def getIP(em):
    with open(em, 'rb') as f:
        bs = f.read()
    decoded = quopri.decodestring(bs).decode('latin-1')

    ce = io.StringIO(decoded)
    iplabel = ""
    while  not ("Torrent Hash Value: " in iplabel):
        iplabel = ce.readline()
        ...

import email
from email import policy

with open('message.eml') as f:
    s = f.read()
msg = email.message_from_string(s, policy=policy.default)
body = msg.get_content()