删除python中的URL、空行和unicode字符_Python_Regex_Unicode

删除python中的URL、空行和unicode字符

python regex unicode

删除python中的URL、空行和unicode字符,python,regex,unicode,Python,Regex,Unicode,我需要使用python从一个大文本文件500MiB中删除url、空行和带有unicode字符的行这是我的档案： https://removethis1.com http://removethis2.com foobar1 http://removethis3.com foobar2 foobar3 http://removethis4.com www.removethis5.com foobar4 www.removethis6.com foobar5 foobar6 foobar7 fo

我需要使用python从一个大文本文件500MiB中删除url、空行和带有unicode字符的行

这是我的档案：

https://removethis1.com
http://removethis2.com foobar1
http://removethis3.com foobar2
foobar3 http://removethis4.com
www.removethis5.com


foobar4 www.removethis6.com foobar5
foobar6 foobar7
foobar8 www.removethis7.com

在正则表达式之后，它应该如下所示：

foobar1
foobar2
foobar3 
foobar4 foobar5
foobar6 foobar7
foobar8

我得到的代码是：

    file = open(file_path, encoding="utf8")
    self.rawFile = file.read()
    rep = re.compile(r"""
                        http[s]?://.*?\s 
                        |www.*?\s  
                        |(\n){2,}  
                        """, re.X)
    self.processedFile = rep.sub('', self.rawFile)

但输出不正确：

foobar3 foobar4 foobar5
foobar6 foobar7
foobar8 www.removethis7.com

我还需要删除至少包含一个非ascii字符的所有行，但我无法为此任务提供正则表达式。

这将删除所有链接

(?:http|www).*?(?=\s|$)

解释

(?:            #non capturing group
    http|www   #match "http" OR "www"
)
    .*?        #lazy match anything until...
(
    ?=\s|$     #it is followed by white space or the end of line (positive lookahead)
)

将空白\s替换为换行符\n然后根据您希望结果与示例文本的匹配程度，去掉

之后的所有空行：

( +)?\b(?:http|www)[^\s]*(?(1)|( +)?)|\n{2,}

这种魔法会寻找前导空间，并在有前导空间时捕捉它们。然后它会查找http或www部分，后面是我使用的[^\s]*而不是简单的\s*，以防您想添加更多要排除的条件。之后，它使用一个regex条件来检查之前是否收集了任何空格。如果没有，那么它会尝试捕获任何尾随空格，这样您就不会在foobar4 www.removethis6.com foobar5之间删除太多。或者它寻找2+个换行符

如果您不使用任何内容替换所有内容，那么它将提供与您所请求的相同的输出

现在，这个正则表达式是相当严格的，可能会有很多边缘情况，在这些情况下它不起作用。这适用于OP，但如果需要更加灵活，您可能需要提供更多详细信息。

您可以尝试编码为ascii以捕获非ascii行，我认为这是您想要的：

with open("test.txt",encoding="utf-8") as f:
    rep = re.compile(r"""
                        http[s]?://.*?\s
                        |www.*?\s
                        |(\n)
                        """, re.X)
    for line in f:
        m = rep.search(line)
        try:
            if m:
                line = line.replace(m.group(), "")
                line.encode("ascii")
        except UnicodeEncodeError:
            continue
        if line.strip():
            print(line.strip())

输入：

https://removethis1.com
http://removethis2.com foobar1
http://removethis3.com foobar2
foobar3 http://removethis4.com
www.removethis5.com

1234 ā
5678 字
foobar4 www.removethis6.com foobar5
foobar6 foobar7
foobar8 www.removethis7.com

输出：

foobar1
foobar2
foobar3
foobar4 foobar5
foobar6 foobar7
foobar8

或者使用正则表达式匹配任何非ascii：

with open("test.txt",encoding="utf-8") as f:
    rep = re.compile(r"""
                        http[s]?://.*?\s
                        |www.*?\s
                        |(\n)
                        """, re.X)
    non_asc = re.compile(r"[^\x00-\x7F]")
    for line in f:
        non = non_asc.search(line)
        if non:
            continue
        m = rep.search(line)
        if m:
            line = line.replace(m.group(), "")
            if line.strip():
                print(line.strip())

输出与上面相同。您不能合并正则表达式，因为如果存在匹配项，您的正则表达式将完全删除一行，而只是替换为另一行。

不要一次完成所有操作，请逐行执行line@PadraicCunningham我试过了，但是速度非常慢。你想更改原始文件内容还是创建新文件？我需要将输出保存在另一个文件中。你说的Unicode字符到底是什么意思？每个字符（甚至ASCII字符）都包含在Unicode中。