如何删除python脚本中的重复项？_Python_Regex_Duplicates_File Manipulation

如何删除python脚本中的重复项？

python regex

如何删除python脚本中的重复项？,python,regex,duplicates,file-manipulation,Python,Regex,Duplicates,File Manipulation,我使用正则表达式搜索从搜索.js的文本文件中筛选出一些结果，该文件给了我大约16个结果，其中一些是重复的。我想从输出中删除重复项，并将其打印到控制台或重定向到文件中。我尝试使用set和dictionary.fromkeys，但没有成功！这是我目前的情况，提前谢谢你： #!/usr/bin/python import re import sys pattern = re.compile("[^/]*\.js") for i, line in enumerate(open('access_lo

我使用正则表达式搜索从搜索.js的文本文件中筛选出一些结果，该文件给了我大约16个结果，其中一些是重复的。我想从输出中删除重复项，并将其打印到控制台或重定向到文件中。我尝试使用set和dictionary.fromkeys，但没有成功！这是我目前的情况，提前谢谢你：

#!/usr/bin/python

import re
import sys

pattern = re.compile("[^/]*\.js")

for i, line in enumerate(open('access_log.txt')):
    for match in re.findall(pattern, line):
        x = str(match)
        print x

为什么设置不起作用，那里出了什么问题？你试过下面的吗

import re
import sys

pattern = re.compile("[^/]*\.js")
results = set()

for i, line in enumerate(open('access_log.txt')):
    for match in re.findall(pattern, line):
        results.add(str(match))

使用集合消除重复项：

#!/usr/bin/python

import re

pattern = re.compile("[^/]*\.js")

matches = set()
with open('access_log.txt') as f:
    for line in f:
        for match in re.findall(pattern, line):
            #x = str(match) # or just use match
            if match not in in matches:
                print match
                matches.add(match)

但我质疑你的正则表达式：

[^/]*\.js

您正在对每一行执行findall，这表明每一行可能有多个点击，例如：

file1.js file2.js file3.js

但是在你的正则表达式中：

[^/]*\.js

[^/]*正在进行贪婪匹配，将只返回一个匹配，即完整行

如果将匹配设置为非贪婪，即[^/]*？，则将获得3个匹配：

'file1.js'
' file2.js'
' file3.js'

但这突出了另一个潜在问题。对于这些特殊情况，您真的希望在第二个和第三个匹配中使用这些空格吗？也许在/abc/def.js的情况下，您可以保留/abc/后面的前导空格

因此，我建议：

#!/usr/bin/python

import re

pattern = re.compile("""
    (?x)            # verbose mode
    (?:             # first alternative:
        (?<=/)      # positive lookbehind assertion: preceded by '/'
        [^/]*?      # matches non-greedily 0 or more non-'/'
    |               # second alternative
        (?<!/)      # negative lookbehind assertion: not preceded by '/'
        [^/\s]*?    # matches non-greedily 0 or more non-'/' or non-whitespace
    )
    \.js            # matches '.js'
    """)

matches = set()
with open('access_log.txt') as f:
    for line in f:
        for match in pattern.findall(line):
            if match not in matches:
                print match
                matches.add(match)

另外，您使用的是Python2，它是生命的终结。如果可以，请转到python 3。欢迎使用SO！如果需要建议，请查看和。所以这不是一个代码编写服务，所以请发布您的最佳尝试，即使它不起作用。有关参考，请参阅。顺便说一句，Python2在一月份就要寿终正寝了，所以除非你在工作或其他方面需要它，否则停止学习它，转而学习Python3。Python 3要好得多。只需将匹配项放入列表中，然后查看受限于Python 2。不幸的是，在其他人创建的VM上工作时，显式请求使用Python 2。使用集合的方法是正确的，对于向集合添加值并测试该集合中是否已存在值的问题，这是正确的数据结构。还有一点建议：当您在读取文件时，通常希望使用with open。。。模式-这确保在完成文件时关闭该文件，即使发生错误。