Algorithm 比较海量数据的最佳算法_Algorithm_Performance_Python 3.x_String Comparison_Bigdata

Algorithm 比较海量数据的最佳算法

algorithm performance python-3.x

Algorithm 比较海量数据的最佳算法,algorithm,performance,python-3.x,string-comparison,bigdata,Algorithm,Performance,Python 3.x,String Comparison,Bigdata,我有一个大的数据集csv（334MB），如下所示 month, output 1,"['23482394','4358309','098903284'....(total 2.5 million entries)]" 2,"['92438545','23482394',323103404'....(total 2.2 million entries)]" 3,"[...continue 现在，我需要比较一个月和前一个月的产出重叠的百分比例如，当我比较第1个月和第2个月时，我希望得到这样的结果

我有一个大的数据集csv（334MB），如下所示

month, output
1,"['23482394','4358309','098903284'....(total 2.5 million entries)]"
2,"['92438545','23482394',323103404'....(total 2.2 million entries)]"
3,"[...continue

现在，我需要比较一个月和前一个月的产出重叠的百分比

例如，当我比较第1个月和第2个月时，我希望得到这样的结果：“第2个月的输出与month1有90%的重叠”，然后“Month3与Month2有88%的重叠”

用Python3解决这个问题的最佳方法是什么？

您可以使用集合交集方法提取包含两个数组或列表的公共元素。集合交的复杂度为O（min（len（a），len（b））

在月项目与实际数据之间的重叠方面，您可能会取得更好的成功。

您可以使用集合交集方法提取公共元素b/w两个数组或列表。集合交集的复杂性为O（min（len（a），len（b））

您可能会在月份条目与实际数据之间的重叠方面取得更好的成功。

每个特定月份的值是否唯一且始终是整数？334 MB将装入您的普通计算机的RAM中，因此请确保不要过度设计此项。请定义此重叠：这些始终是整数吗？前缀“0”是否重要？they unique？顺序相关吗？请添加一些代码，说明如何比较python中的两个简短示例字符串。这将使事情变得更简单。@IvanSivak每个月的值都是唯一的，并且总是整数。如果值是唯一的，那么“重叠”是什么意思？@MBo我的意思是在一个月的列表中没有重叠。因此，从这个意义上说，该值是唯一的。例如，当您比较month1和month2的值时，会有重叠。每个特定月份的值是唯一的并且总是整数吗？334 MB将适合您的普通计算机的RAM，因此请确保不要过度设计此值。P请定义此重叠：这些总是整数吗？'0'前缀重要吗？它们唯一吗？顺序相关吗？请添加一些代码，说明如何在python中比较两个简短的示例字符串。这将使事情变得更容易。@IvanSivak每个月的值都是唯一的，它们总是整数。如果值是e唯一-你所说的“重叠”是什么意思？@MBo我的意思是在一个月的列表中没有重叠。因此，该值在这个意义上是唯一的。例如，当您比较month1和month2的值时，会有重叠。嗨，有什么理由使用np.random.choice而不是random.sample？谢谢。嗨，有什么理由使用np.random.choice而不是random.sample？谢谢。

# generate random numpy array with unique elements
import numpy as np

month1 = np.random.choice(range(10**5, 10**7), size=25*10**5, replace=False)
month2 = np.random.choice(range(10**5, 10**7), size=22*10**5, replace=False)
month3 = np.random.choice(range(10**5, 10**7), size=21*10**5, replace=False)

print('Month 1, 2, and 3 contains {}, {}, and {} elements respectively'.format(len(month1), len(month2), len(month3)))

Month 1, 2, and 3 contains 2500000, 2200000, and 2100000 elements respectively

# Compare month arrays for overlap

import time

startTime = time.time()
union_m1m2 = set(month1).intersection(month2) 
union_m2m3 = set(month2).intersection(month3)

print('Percent of elements in both month 1 & 2: {}%'.format(round(100*len(union_m1m2)/len(month2),2)))
print('Percent of elements in both month 2 & 3: {}%'.format(round(100*len(union_m2m3)/len(month3),2)))

print('Process time:{:.2f}s'.format(time.time()-startTime))

Percent of elements in both month 1 & 2: 25.3%
Percent of elements in both month 2 & 3: 22.24%
Process time:2.46s