Python 什么东西占用了这么多记忆?
我有一个简单的代码,它读取csv文件,根据前2列查找重复项,然后在另一个csv中写入重复项,并在第三个csv中保留唯一值 我正在使用set:Python 什么东西占用了这么多记忆?,python,python-2.7,csv,memory,set,Python,Python 2.7,Csv,Memory,Set,我有一个简单的代码,它读取csv文件,根据前2列查找重复项,然后在另一个csv中写入重复项,并在第三个csv中保留唯一值 我正在使用set: def my_func(): area = "W09" inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv' out = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv' out2 = r'f:
def my_func():
area = "W09"
inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
out = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
#i = 0
seen = set()
with open(inf, 'r') as infile, open(out, 'w') as outfile1, open(out2, 'w') as outfile2:
reader = csv.reader(infile, delimiter=" ")
writer1 = csv.writer(outfile1, delimiter=" ")
writer2 = csv.writer(outfile2, delimiter=" ")
for row in reader:
x, y = row[0], row[1]
x = float(x)
y = float(y)
if (x, y) in seen:
writer2.writerow(row)
continue
seen.add((x, y))
writer1.writerow(row)
seen.clear()
我想,这个集合是最好的选择,但是集合的大小是输入文件大小的七倍?(输入文件大小从140 MB到50 GB csv),RAM使用量从1GB到近400 GB(我使用的服务器有768 GB的RAM):
我还在小样本上使用了profiler
Line # Mem usage Increment Line Contents
8 21.289 MiB 21.289 MiB @profile
9 def my_func():
10 21.293 MiB 0.004 MiB area = "W10"
11
12 21.293 MiB 0.000 MiB inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
13 21.293 MiB 0.000 MiB out = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
14 21.297 MiB 0.004 MiB out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
15
16
17
18 #i = 0
19 21.297 MiB 0.000 MiB seen = set()
20
21 21.297 MiB 0.000 MiB with open(inf, 'r') as infile, open(out,'w') as outfile1, open(out2, 'w') as outfile2:
22 21.297 MiB 0.000 MiB reader = csv.reader(infile, delimiter=" ")
23 21.297 MiB 0.000 MiB writer1 = csv.writer(outfile1, delimiter=" ")
24 21.297 MiB 0.000 MiB writer2 = csv.writer(outfile2, delimiter=" ")
25 1089.914 MiB -9.008 MiB for row in reader:
26 1089.914 MiB -7.977 MiB x, y = row[0], row[1]
27
28 1089.914 MiB -6.898 MiB x = float(x)
29 1089.914 MiB 167.375 MiB y = float(y)
30
31 1089.914 MiB 166.086 MiB if (x, y) in seen:
32 #z = line.split(" ",3)[-1]
33 #if z == "5284":
34 # print X, Y, z
35
36 1089.914 MiB 0.004 MiB writer2.writerow(row)
37 1089.914 MiB 0.000 MiB continue
38 1089.914 MiB 714.102 MiB seen.add((x, y))
39 1089.914 MiB -9.301 MiB writer1.writerow(row)
40
41
42
43 690.426 MiB -399.488 MiB seen.clear()
可能是什么问题?有没有更快的方法过滤掉结果?
还是一种使用更少方式RAM的方式
csv示例:
我们正在查看GeoTIFF转换为csv文件,因此它是X Y值
475596 101832 4926
475626 101832 4926
475656 101832 4926
475686 101832 4926
475716 101832 4926
475536 101802 4926
475566 101802 4926
475596 101802 4926
475626 101802 4926
475656 101802 4926
475686 101802 4926
475716 101802 4926
475746 101802 4926
475776 101802 4926
475506 101772 4926
475536 101772 4926
475566 101772 4926
475596 101772 4926
475626 101772 4926
475656 101772 4926
475686 101772 4926
475716 101772 4926
475746 101772 4926
475776 101772 4926
475806 101772 4926
475836 101772 4926
475476 101742 4926
475506 101742 4926
编辑:
所以我尝试了Jean提供的解决方案:
结果是,在我的140 MB csv的小集上,集的大小现在减半了,这是一个很好的改进。我将尝试在更大的数据上运行它,看看它能做什么。我无法将其链接到分析器,因为分析器会将执行时间延长大量时间
Line # Mem usage Increment Line Contents
8 21.273 MiB 21.273 MiB @profile
9 def my_func():
10 21.277 MiB 0.004 MiB area = "W10"
11
12 21.277 MiB 0.000 MiB inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
13 21.277 MiB 0.000 MiB out = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
14 21.277 MiB 0.000 MiB out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
15
16
17 21.277 MiB 0.000 MiB seen = set()
18
19 21.277 MiB 0.000 MiB with open(inf, 'r') as infile, open(out,'w') as outfile1, open(out2, 'w') as outfile2:
20 21.277 MiB 0.000 MiB reader = csv.reader(infile, delimiter=" ")
21 21.277 MiB 0.000 MiB writer1 = csv.writer(outfile1, delimiter=" ")
22 21.277 MiB 0.000 MiB writer2 = csv.writer(outfile2, delimiter=" ")
23 451.078 MiB -140.355 MiB for row in reader:
24 451.078 MiB -140.613 MiB hash = float(row[0])*10**7 + float(row[1])
25 #x, y = row[0], row[1]
26
27 #x = float(x)
28 #y = float(y)
29
30 #if (x, y) in seen:
31 451.078 MiB 32.242 MiB if hash in seen:
32 451.078 MiB 0.000 MiB writer2.writerow(row)
33 451.078 MiB 0.000 MiB continue
34 451.078 MiB 78.500 MiB seen.add((hash))
35 451.078 MiB -178.168 MiB writer1.writerow(row)
36
37 195.074 MiB -256.004 MiB seen.clear()
您可以创建自己的散列函数,以避免存储浮点的
元组
,而是以一种独特的方式将浮点组合在一起的浮点值
假设坐标不能超过1000万(也许你可以降到100万),你可以:
hash = x*10**7 + y
(这在浮动上执行一种逻辑“或”,因为值是有限的,所以x
和y
之间没有混淆。)
然后将hash
放在集合中,而不是浮动的tuple
。使用10**14
不存在吸收浮点数的风险,值得一试:
>>> 10**14+1.5
100000000000001.5
然后,循环变为:
for row in reader:
hash = float(row[0])*10**7 + float(row[1])
if hash in seen:
writer2.writerow(row)
continue
seen.add(hash)
writer1.writerow(row)
一个浮点,即使很大(因为一个浮点的大小是固定的),在内存中至少比两个浮点的元组小2到3倍。在我的机器上:
>>> sys.getsizeof((0.44,0.2))
64
>>> sys.getsizeof(14252362*10**7+35454555.0)
24
为什么要将前两列转换为浮动?作为int
,它们会更好吗?因为有些文件的坐标值是十进制的,比如2345641.5您的可能副本可以通过在else语句中添加seen.add((x,y))来节省大量时间。仅当该值不存在时,才应将其添加到集合中。向集合中添加值是一项代价高昂的操作。@Gautamagarwal已看到。由于继续,add
有效地位于else
中。另外,这里的问题不是时间,而是内存。坐标在两个方向上都在-3mil到+3mil之间,现在将尝试这个想法并发布结果。您的解决方案很好,我能够使集合大小减半,编辑问题