比较python中的列表_Python_Python 3.x

比较python中的列表

python python-3.x

比较python中的列表,python,python-3.x,Python,Python 3.x,我有一个文本文件，看起来像： 0010000110 1111010111 0000110111 我想在python中将它们作为列表导入，然后将列表中的每个元素与其他列表中的对应元素进行比较，并对所有列表组合进行比较。如果两个元素都是1，则将计数器增加1，最后除以列表的长度。我试图编写代码，但它不能正常工作： with open("D:/test/Vector.txt", "r") as f1: for a in f1: with open("D:/test/Vector.tx

我有一个文本文件，看起来像：

0010000110
1111010111
0000110111

我想在python中将它们作为列表导入，然后将列表中的每个元素与其他列表中的对应元素进行比较，并对所有列表组合进行比较。如果两个元素都是1，则将计数器增加1，最后除以列表的长度。我试图编写代码，但它不能正常工作：

with open("D:/test/Vector.txt", "r") as f1:
   for a in f1:
      with open("D:/test/Vector.txt", "r") as f2:
         for b in f2:
            for i in range(10):
                result = 0;
                counter = 0;
                if int(a[i]) == int(b[i]) == 1:
                    counter = counter+1
            result = counter / 10;
            print(a, b, result)

编辑：使用python创建文本文件时，它会将每个条目\n移动到新行，但我不知道如何删除它

预期产出：

0010000110 0010000110 1
0010000110 1111010111 0.3
0010000110 0000110111 0.2
1111010111 0010000110 0.3
1111010111 1111010111 1
1111010111 0000110111 0.4
0000110111 0010000110 0.2
0000110111 1111010111 0.4
0000110111 0010000110 1

在继续之前，请确保两个字符串不相等。以下是您的问题的基本解决方案，以获得预期的输出：

f = open("Vector.txt", 'r')

l1 = [s.strip('\n') for s in f]
l2 = [s for s in l1]

f.close()

for a in l1:
    for b in l2:
        result = 0
        if (a == b):
            result = 1
        else:
            counter = 0
            for i in range(len(a)):
                if (int(a[i]) == int(b[i]) == 1):
                    counter += 1
            result = counter / len(a)
        print(a, b, result)

这在Python 3中运行良好，结果如下：

0010000110 0010000110 1
0010000110 1111010111 0.3
0010000110 0000110111 0.2
1111010111 0010000110 0.3
1111010111 1111010111 1
1111010111 0000110111 0.4
0000110111 0010000110 0.2
0000110111 1111010111 0.4
0000110111 0000110111 1

编辑：您不必使用两个列表。您只需使用l1列表并在其上迭代两次。如果要使用索引，可以使用以下方法避免迭代并在包含索引的列表中移动：

for a in range(0, len(l)):
   for b in range(0, len(l)):

如果要使用索引访问一个字符串元素，则可以执行以下操作：

for i in range(len(l[a]):
    if (int(l[a][i]) == int(l[b][i]) == 1):
        counter += 1

最后的指示是：

print((a + 1), (b + 1), result)

为了摆脱烦人的字符串处理，您可以访问

编辑：

为了回答您在评论中提出的效率问题，这里有一个解决方案，它涉及线程和较低的复杂性，而我们以前面临的是纯二次复杂性。此解决方案假定文件中包含的所有字符串具有相同的长度，并且不会相互比较文件。如果不是这样的话，我相信您将能够用这种基本方法找到解决方案

然后将每个比较存储到名为sourcefile_comparated.txt的文件中，行中的每个单词用逗号分隔。因为我们使用文件并启动多个线程，所以算法大量使用异常。因为我不知道你的服务器，我建议你在自己的机器上试试这个，然后自己设置文件路径

如果你想要接近线性复杂度的东西，你必须做出选择，因为你实际上希望每个字符串都有一个相对的计算

import os
import threading


class ListComparator(threading.Thread):

    def __init__(self, file):

        threading.Thread.__init__(self)
        self.file = file
        self._stopevent = threading.Event()

    def run(self):
        name, extension = os.path.splitext(self.file)

        if (extension == '.txt'):

            print('comparing strings in file ' + name)

            try :
                f = open(file, 'r')

                l = [s.strip('\n') for s in f]

                f.close()

            except:
                print('unable to open file' + file)
                l = None

            if (l != None):

                try :

                    target = open(name + '_compared.txt', 'w')

                except Exception as e:
                    print(e)
                    target = None

                if (target != None):
                    for i in range(0, len(l) - 1):
                        for j in range(i + 1, len(l)):
                            result = 0
                            counter = 0

                            for k in range(len(l[i])):
                                if (int(l[i][k]) == int(l[j][k]) == 1):
                                    counter += 1

                            result = counter / len(l[i])
                            s = l[i] + ', ' + l[j] + ', ' + str(result) + '\n'

                            target.write(s)

                    target.close()

                    print(name + ' compared')
                else:
                    print(name + ' not compared')

        def stop(self):
            self._stopevent.set()


current_dir = os.getcwd()

for subdir, dirs, files in os.walk(current_dir):

    for file in files:

        try :
            comp = ListComparator(file)
            comp.start()

        except Exception as e:
            print(e)

以下是控制台的输出：

comparing strings in file v
comparing strings in file Vector
Vector compared
v compared

以下是写入vector_compared.txt的数据：

0010000110, 1111010111, 0.3
0010000110, 0000110111, 0.2
1111010111, 0000110111, 0.4

使用该方法删除字符串中的空白

如果您有多个序列，并且希望对每个序列的相应元素执行某些操作，则可以使用

zip

将这些元素聚合在一起。（注意

zip

返回一个迭代器，因此在示例中

list

用于显示其结果）

如果序列在容器中，则需要将其解压缩以与

zip

一起使用：

>>> a
['0010000110', '1111010111']
>>> list(zip(*a))
[('0', '1'), ('0', '1'), ('1', '1'), ('0', '1'), ('0', '0'), ('0', '1'), ('0', '0'), ('1', '1'), ('1', '1'), ('0', '1')]
>>>

一旦将所有参数组合在一起，就可以轻松地处理它们—您可以将它们传递给函数，或者在您的情况下，只需比较它们：

>>> [x == y for x,y in zip(*a)]
[False, False, True, False, True, False, True, True, True, False]
>>>

sum

将使用一个迭代器/iterable并计算所有

True

的值-

True

的值为1，

False

的值为零

>>> sum(x == y for x,y in zip(*a))
5
>>>

顺便说一下：您可以将

zip

的结果分配给一个名称并使用它。它可以使事情更容易阅读：

>>> my_groups = zip(*a)
>>> my_groups
<zip object at 0x000000000308A9C8>
>>> sum(x == y for x,y in my_groups)
5
>>>

使用

zip

、

sum

和

itertools

您可以编写一些您想要的东西

>>> for combination in itertools.combinations_with_replacement(data, 2):
    print(combination, sum(x == y for x,y in zip(*combination)))


('0010000110', '0010000110') 10
('0010000110', '1111010111') 5
('0010000110', '0000110111') 6
('1111010111', '1111010111') 10
('1111010111', '0000110111') 5
('0000110111', '0000110111') 10
>>> 

>>> for a,b in itertools.combinations_with_replacement(data, 2):
    total = sum(x == y for x,y in zip(a, b))
    ratio = total / len(a)
    print(a, b, total, ratio)


0010000110 0010000110 10 1.0
0010000110 1111010111 5 0.5
0010000110 0000110111 6 0.6
1111010111 1111010111 10 1.0
1111010111 0000110111 5 0.5
0000110111 0000110111 10 1.0
>>>

我喜欢使用以下格式设置打印语句的格式：

>>> s = 'combination: {} {}\ttotal: {}\tratio: {}'
>>> for a,b in itertools.combinations_with_replacement(data, 2):
    total = sum(x == y for x,y in zip(a, b))
    ratio = total / len(a)
    print(s.format(a, b, total, ratio))


combination: 0010000110 0010000110  total: 10   ratio: 1.0
combination: 0010000110 1111010111  total: 5    ratio: 0.5
combination: 0010000110 0000110111  total: 6    ratio: 0.6
combination: 1111010111 1111010111  total: 10   ratio: 1.0
combination: 1111010111 0000110111  total: 5    ratio: 0.5
combination: 0000110111 0000110111  total: 10   ratio: 1.0
>>>

我一直在使用这是一种为循环编写

的简洁方法-许多人在习惯了它们后更喜欢它们（只要它们不太复杂）：
这可以用速记格式写成列表：
>>> [x + 2 for x in data]
[3, 4, 5]
>>>

您的示例文件的预期输出。您可以使用zip（a，b）
进行迭代，并在zip（a，b）

中对suba、subb说

。这样，您就不必处理a[i]
；您只需使用suba
。此外，行的末尾不需要分号。在Python中，换行符或分号表示“下一个命令”。使用两者都是多余的。在第一个示例中，您是如何得到1
的？…您说如果两者都是1
，则增量为1
？…不应该是0.3
？在实际使用计数器之前，您将计数器重置为零。行counter=0
在内部循环内，行result=counter/10
在外部。我怀疑您希望计数器=0
超出范围（10）
中I的范围。若要清除输出中不需要的换行符，请使用打印（a.rstrip（'\n'）、b.rstrip（'\n'）、result）
。（或者在前面的任何时候将它们去掉，因为逻辑的其余部分也不需要它们。）请注意，some_string.rstrip（…）
不会修改some_string
。（str是不可变的。）相反，它返回剥离的字符串。非常感谢，这正是我需要的。有可能优化计算吗？我有4723个列表，每个列表有10000个实体。计算结果要花很长时间，也许另一种方法会更好？如果我不打印而直接保存到文本文件，这可能会加快速度？是否需要比较每个字符串，即使它们相同？如果没有，您可以构建一个队列来避免这种情况，并最终得到一个具有代数套件和复杂性的东西：n-1+n-2+…+1.如果您不需要将每个文件相互比较，也许可以使用多线程。否则也可以，但在打开和关闭文件时，您必须保护这些文件。那么，您是对的，打印结果需要花费成本，但将数据保存到硬盘驱动器（SSD、PCI总线、中央内存等）可能会较慢。这是一个有趣的问题：）我正在运行测试的服务器正在使用SSD。服务器有24个内核，但只有一个正在执行计算。我不能理解的是，如果文件是45mb，在将其导入python并比较每个元素之后，它应该会急剧增长，并在计算时存储在内存中，而不是Ram
>>> s = 'combination: {} {}\ttotal: {}\tratio: {}'
>>> for a,b in itertools.combinations_with_replacement(data, 2):
    total = sum(x == y for x,y in zip(a, b))
    ratio = total / len(a)
    print(s.format(a, b, total, ratio))


combination: 0010000110 0010000110  total: 10   ratio: 1.0
combination: 0010000110 1111010111  total: 5    ratio: 0.5
combination: 0010000110 0000110111  total: 6    ratio: 0.6
combination: 1111010111 1111010111  total: 10   ratio: 1.0
combination: 1111010111 0000110111  total: 5    ratio: 0.5
combination: 0000110111 0000110111  total: 10   ratio: 1.0
>>>

>>> data = [1,2,3]
>>> for x in data:
    print(x+2)

3
4
5

>>> [x + 2 for x in data]
[3, 4, 5]
>>>