Python中时间戳之间的Jaccard索引

Python中时间戳之间的Jaccard索引,python,machine-learning,similarity,Python,Machine Learning,Similarity,我将UNIX时间戳转换为字符串,以及我需要从中获取Jaccard索引的给定时间字符串输入。以下数据作为时间间隔存储在二维数组中 unix_converted = [['00:00:00', '00:00:03'], ['00:00:03', '00:00:06'], ['00:00:12', '00:00:15']] input_timestamps = [['00:00:00', '00:00:03'], ['00:00:03', '00:00:06'], ['00:00:06', '00:0

我将UNIX时间戳转换为字符串,以及我需要从中获取Jaccard索引的给定时间字符串输入。以下数据作为时间间隔存储在二维数组中

unix_converted = [['00:00:00', '00:00:03'], ['00:00:03', '00:00:06'], ['00:00:12', '00:00:15']]
input_timestamps = [['00:00:00', '00:00:03'], ['00:00:03', '00:00:06'], ['00:00:06', '00:00:09']]

def jaccard_index(s1, s2):
    raise NotImplementedError


我是否必须将这些时间间隔转换为datetime对象,或者有一种简单的方法?以及如何获取索引本身?

您可以利用Python对集合的本机支持来计算您的Jaccard索引

unix\u converted=['00:00:00','00:00:03'],['00:00:03','00:00:06'],['00:00:12','00:00:15']
输入时间戳=['00:00:00','00:00:03'],['00:00:03','00:00:06'],['00:00:06','00:00:09']
def jaccard_索引(s1、s2):
s1=集合({'-'.join(each)for each in s1})
s2=集合({'-'.join(each)for each in s2})
返回len(s1.交点(s2))/len(s1.并集(s2))
打印(jaccard_索引(unix_转换,输入时间戳))#输出0.5

编辑:我假设Jaccard索引指的是Jaccard相似性,即给定列表的并集上的交集。

此代码在时间戳不一定在同一invervals中计算的情况下计算Jaccard相似性<代码>O(len(s1)^2+len(s2)^2)时间复杂度

unix_converted = [(1, 3), (6, 10), (11, 12)]
input_timestamps = [(1, 3), (4, 7)]


def jaccard_index(s1, s2):

    def _set_sum(start1, end1, start2, end2):
        """ returns sum if there is an overlap and None otherwise """
        if start2 <= start1 <= end2:
            return start2, max(end1, end2)
        if start1 <= start2 <= end1:
            return start1, max(end1, end2)
        return None  # separate sets

    def _set_intersection(start1, end1, start2, end2):
        """ returns intersection if there is an overlap and None otherwise """
        if start2 <= start1 <= end2:
            return start1, min(end1, end2)
        if start1 <= start2 <= end1:
            return start2, min(end1, end2)
        return None  # separate sets

    # Calculate A u B
    sum = []
    for x, y in s1 + s2:
        matched_elem = False
        for i, (x2, y2) in enumerate(sum):
            set_sum = _set_sum(x, y, x2, y2)
            if set_sum is not None:
                sum[i] = set_sum
                matched_elem = True
                break
        if not matched_elem:
            sum.append((x, y))

    # join overlapping timestamps
    element_is_joined = [False for _ in sum]
    for i, (x, y) in enumerate(sum):
        if not element_is_joined[i]:
            for j, (x2, y2) in enumerate(sum):
                if element_is_joined[j] or i == j:
                    continue
                set_sum = _set_sum(x, y, x2, y2)
                if set_sum is not None:  # overlap is found
                    sum[j] = set_sum
                    element_is_joined[i] = True
                    break

    sum_ = 0
    for (x, y), is_joined in zip(sum, element_is_joined):
        if not is_joined:
            sum_ += y - x

    if sum_ == 0:
        raise ValueError('Division by zero')

    # calculate A ^ B
    intersection = 0
    for x, y in s1:
        for x2, y2 in s2:
            set_intersection = _set_intersection(x, y, x2, y2)
            if set_intersection is not None:
                intersection += set_intersection[1] - set_intersection[0]

    return intersection / sum_


print(jaccard_index(unix_converted, input_timestamps)) #outputs 0.333333
unix_converted=[(1,3)、(6,10)、(11,12)] 输入时间戳=[(1,3)、(4,7)] def jaccard_索引(s1、s2): 定义集和(开始1、结束1、开始2、结束2): “”“如果存在重叠,则返回sum,否则返回None”“”
如果START2请考虑解释JACARD索引是什么,以及实际上提供一个解决问题的尝试。您要在我假设的两个列表中计算JACARD索引,即函数<代码> JACK索引(S1,S2)中的参数。预期的是
unix\u转换的
input\u时间戳
eh?请同时提供预期的输出注意:如果s1和s2中的时间戳具有相同的inverval,则此计算有效。