Python中时间戳之间的Jaccard索引_Python_Machine Learning_Similarity

Python中时间戳之间的Jaccard索引

python machine-learning

Python中时间戳之间的Jaccard索引,python,machine-learning,similarity,Python,Machine Learning,Similarity,我将UNIX时间戳转换为字符串，以及我需要从中获取Jaccard索引的给定时间字符串输入。以下数据作为时间间隔存储在二维数组中 unix_converted = [['00:00:00', '00:00:03'], ['00:00:03', '00:00:06'], ['00:00:12', '00:00:15']] input_timestamps = [['00:00:00', '00:00:03'], ['00:00:03', '00:00:06'], ['00:00:06', '00:0

我将UNIX时间戳转换为字符串，以及我需要从中获取Jaccard索引的给定时间字符串输入。以下数据作为时间间隔存储在二维数组中

unix_converted = [['00:00:00', '00:00:03'], ['00:00:03', '00:00:06'], ['00:00:12', '00:00:15']]
input_timestamps = [['00:00:00', '00:00:03'], ['00:00:03', '00:00:06'], ['00:00:06', '00:00:09']]

def jaccard_index(s1, s2):
    raise NotImplementedError

我是否必须将这些时间间隔转换为datetime对象，或者有一种简单的方法？以及如何获取索引本身？

您可以利用Python对集合的本机支持来计算您的Jaccard索引

unix\u converted=['00:00:00'，'00:00:03']，['00:00:03'，'00:00:06']，['00:00:12'，'00:00:15']
输入时间戳=['00:00:00'，'00:00:03']，['00:00:03'，'00:00:06']，['00:00:06'，'00:00:09']
def jaccard_索引（s1、s2）：
s1=集合（{'-'.join（each）for each in s1}）
s2=集合（{'-'.join（each）for each in s2}）
返回len（s1.交点（s2））/len（s1.并集（s2））
打印（jaccard_索引（unix_转换，输入时间戳））#输出0.5

编辑：我假设Jaccard索引指的是Jaccard相似性，即给定列表的并集上的交集。

此代码在时间戳不一定在同一invervals中计算的情况下计算Jaccard相似性<代码>O（len（s1）^2+len（s2）^2）时间复杂度

unix_converted = [(1, 3), (6, 10), (11, 12)]
input_timestamps = [(1, 3), (4, 7)]


def jaccard_index(s1, s2):

    def _set_sum(start1, end1, start2, end2):
        """ returns sum if there is an overlap and None otherwise """
        if start2 <= start1 <= end2:
            return start2, max(end1, end2)
        if start1 <= start2 <= end1:
            return start1, max(end1, end2)
        return None  # separate sets

    def _set_intersection(start1, end1, start2, end2):
        """ returns intersection if there is an overlap and None otherwise """
        if start2 <= start1 <= end2:
            return start1, min(end1, end2)
        if start1 <= start2 <= end1:
            return start2, min(end1, end2)
        return None  # separate sets

    # Calculate A u B
    sum = []
    for x, y in s1 + s2:
        matched_elem = False
        for i, (x2, y2) in enumerate(sum):
            set_sum = _set_sum(x, y, x2, y2)
            if set_sum is not None:
                sum[i] = set_sum
                matched_elem = True
                break
        if not matched_elem:
            sum.append((x, y))

    # join overlapping timestamps
    element_is_joined = [False for _ in sum]
    for i, (x, y) in enumerate(sum):
        if not element_is_joined[i]:
            for j, (x2, y2) in enumerate(sum):
                if element_is_joined[j] or i == j:
                    continue
                set_sum = _set_sum(x, y, x2, y2)
                if set_sum is not None:  # overlap is found
                    sum[j] = set_sum
                    element_is_joined[i] = True
                    break

    sum_ = 0
    for (x, y), is_joined in zip(sum, element_is_joined):
        if not is_joined:
            sum_ += y - x

    if sum_ == 0:
        raise ValueError('Division by zero')

    # calculate A ^ B
    intersection = 0
    for x, y in s1:
        for x2, y2 in s2:
            set_intersection = _set_intersection(x, y, x2, y2)
            if set_intersection is not None:
                intersection += set_intersection[1] - set_intersection[0]

    return intersection / sum_


print(jaccard_index(unix_converted, input_timestamps)) #outputs 0.333333

unix_converted=[（1,3）、（6,10）、（11,12）] 输入时间戳=[（1,3）、（4,7）] def jaccard_索引（s1、s2）：定义集和（开始1、结束1、开始2、结束2）： “”“如果存在重叠，则返回sum，否则返回None”“”

如果START2请考虑解释JACARD索引是什么，以及实际上提供一个解决问题的尝试。您要在我假设的两个列表中计算JACARD索引，即函数<代码> JACK索引（S1，S2）中的参数。预期的是

unix\u转换的和input\u时间戳
eh？请同时提供预期的输出注意：如果s1和s2中的时间戳具有相同的inverval，则此计算有效。