Python—比较两个字符串的最佳方法,记录比较特定项的序列位置的统计信息?

Python—比较两个字符串的最佳方法,记录比较特定项的序列位置的统计信息?,python,string,compare,Python,String,Compare,我正在处理两个文件,这两个文件的行如下所示: 这是一个例子 在其中一个文件中,将显示上面的行,而另一个文件中的相应行将相同,但可能在不同的位置有“| |”项: 这是一个例子 我只需要收集第二个文件中“| |”落在“正确”位置的频率(我们假设第一个文件始终是“正确的”)、“| |”落在第一个文件没有“| |”的位置的频率,以及该特定行的整体“| |”标记数的差异 我知道我可以一个人做这件事,但我想知道你们这些才华横溢的人是否知道一些难以置信的简单方法?基本的东西(比如读取中的文件)都是我熟悉的东西

我正在处理两个文件,这两个文件的行如下所示:

这是一个例子

在其中一个文件中,将显示上面的行,而另一个文件中的相应行将相同,但可能在不同的位置有“| |”项:

这是一个例子

我只需要收集第二个文件中“| |”落在“正确”位置的频率(我们假设第一个文件始终是“正确的”)、“| |”落在第一个文件没有“| |”的位置的频率,以及该特定行的整体“| |”标记数的差异

我知道我可以一个人做这件事,但我想知道你们这些才华横溢的人是否知道一些难以置信的简单方法?基本的东西(比如读取中的文件)都是我熟悉的东西——我只是想听听关于如何进行行的实际比较和收集统计数据的建议

最好的,
乔治娜这就是你要找的吗

此代码假定每一行的格式都与示例中的格式相同

fileOne = open('theCorrectFile', 'r')
fileTwo = open('theSecondFile', 'r')

for corrrectLine in fileOne:
    otherLine = fileTwo.readline()
    for i in len(correctLine.split("||")):
        count = 0
        wrongPlacement = 0
        if (len(otherLine.split("||")) >= i+1) and (correctLine.split("||")[i] == otherLine.split("||")[i]):
            count += 1
        else:
            wrongPLacement += 1
print 'there are %d out of %d "||" in the correct places and %d in the wrong places' %(count, len(correctLine.split("||"), wrongPlacement)

我不确定这有多容易,因为它确实使用了一些更先进的概念,如生成器,但它至少是健壮的,并且有很好的文档记录。实际代码位于底部,相当简洁

基本思想是,函数
iter_delim_set
返回元组的迭代器,元组包含行号、在“预期”字符串中找到分隔符的索引集,以及“实际”字符串的类似集合。每对(预期的、结果的)行生成一个这样的元组。这些元组被简洁地形式化为
集合。命名为tuple
类型,称为
DelimLocations

然后,函数
analyze
只返回基于这样一个数据集的更高级别的信息,该数据集存储在
DelimAnalysis
namedtuple
中。这是使用基本集合代数完成的

"""Compare two sequences of strings.

Test data:
>>> from pprint import pprint
>>> delimiter = '||'
>>> expected = (
...     delimiter.join(("one", "fish", "two", "fish")),
...     delimiter.join(("red", "fish", "blue", "fish")),
...     delimiter.join(("I do not like them", "Sam I am")),
...     delimiter.join(("I do not like green eggs and ham.",)))
>>> actual = (
...     delimiter.join(("red", "fish", "blue", "fish")),
...     delimiter.join(("one", "fish", "two", "fish")),
...     delimiter.join(("I do not like spam", "Sam I am")),
...     delimiter.join(("I do not like", "green eggs and ham.")))

The results:
>>> pprint([analyze(v) for v in iter_delim_sets(delimiter, expected, actual)])
[DelimAnalysis(index=0, correct=2, incorrect=1, count_diff=0),
 DelimAnalysis(index=1, correct=2, incorrect=1, count_diff=0),
 DelimAnalysis(index=2, correct=1, incorrect=0, count_diff=0),
 DelimAnalysis(index=3, correct=0, incorrect=1, count_diff=1)]

What they mean:
>>> pprint(delim_analysis_doc)
(('index',
  ('The number of the lines from expected and actual',
   'used to perform this analysis.')),
 ('correct',
  ('The number of delimiter placements in ``actual``',
   'which were correctly placed.')),
 ('incorrect', ('The number of incorrect delimiters in ``actual``.',)),
 ('count_diff',
  ('The difference between the number of delimiters',
   'in ``expected`` and ``actual`` for this line.')))

And a trace of the processing stages:
>>> def dump_it(it):
...     '''Wraps an iterator in code that dumps its values to stdout.'''
...     for v in it:
...         print v
...         yield v

>>> for v in iter_delim_sets(delimiter,
...                          dump_it(expected), dump_it(actual)):
...     print v
...     print analyze(v)
...     print '======'
one||fish||two||fish
red||fish||blue||fish
DelimLocations(index=0, expected=set([9, 3, 14]), actual=set([9, 3, 15]))
DelimAnalysis(index=0, correct=2, incorrect=1, count_diff=0)
======
red||fish||blue||fish
one||fish||two||fish
DelimLocations(index=1, expected=set([9, 3, 15]), actual=set([9, 3, 14]))
DelimAnalysis(index=1, correct=2, incorrect=1, count_diff=0)
======
I do not like them||Sam I am
I do not like spam||Sam I am
DelimLocations(index=2, expected=set([18]), actual=set([18]))
DelimAnalysis(index=2, correct=1, incorrect=0, count_diff=0)
======
I do not like green eggs and ham.
I do not like||green eggs and ham.
DelimLocations(index=3, expected=set([]), actual=set([13]))
DelimAnalysis(index=3, correct=0, incorrect=1, count_diff=1)
======
"""
from collections import namedtuple


# Data types

## Here ``expected`` and ``actual`` are sets
DelimLocations = namedtuple('DelimLocations', 'index expected actual')

DelimAnalysis = namedtuple('DelimAnalysis',
                           'index correct incorrect count_diff')
## Explanation of the elements of DelimAnalysis.
## There's no real convenient way to add a docstring to a variable.
delim_analysis_doc = (
    ('index', ("The number of the lines from expected and actual",
               "used to perform this analysis.")),
    ('correct', ("The number of delimiter placements in ``actual``",
                 "which were correctly placed.")),
    ('incorrect', ("The number of incorrect delimiters in ``actual``.",)),
    ('count_diff', ("The difference between the number of delimiters",
                    "in ``expected`` and ``actual`` for this line.")))


# Actual functionality

def iter_delim_sets(delimiter, expected, actual):
    """Yields a DelimLocations tuple for each pair of strings.

    ``expected`` and ``actual`` are sequences of strings.
    """
    from re import escape, compile as compile_
    from itertools import count, izip
    index = count()

    re = compile_(escape(delimiter))
    def delimiter_locations(string):
        """Set of the locations of matches of ``re`` in ``string``."""
        return set(match.start() for match in re.finditer(string))

    string_pairs = izip(expected, actual)

    return (DelimLocations(index=index.next(),
                           expected=delimiter_locations(e),
                           actual=delimiter_locations(a))
            for e, a in string_pairs)

def analyze(locations):
    """Returns an analysis of a DelimLocations tuple.

    ``locations.expected`` and ``locations.actual`` are sets.
    """
    return DelimAnalysis(
        index=locations.index,
        correct=len(locations.expected & locations.actual),
        incorrect=len(locations.actual - locations.expected),
        count_diff=(len(locations.actual) - len(locations.expected)))

我该如何计算另一条线在没有标记的地方有| |标记的次数(而不是仅仅计算它有一个标记在同一位置的次数)哇!你跑得很快。干得好。@Sebastian:你说得绝对正确!我不知道我怎么会错过。修好它now@Georgina:我不知道@Sebastian为什么想将行的部分存储在一个新变量(chunks)中,但我可以推测他想减少函数调用的数量以提高效率。作为一个附加信息,您应该提供您期望的此示例行的准确输出。不清楚“| |”什么时候是正确的(当所有周围的单词都相等时?只有前一个/下一个单词相等时?)。至于“你们这些才华横溢的家伙”,你们肯定知道如何奉承程序员的自我