Algorithm 与其他几个大型文件相比,计算文件唯一性(以%计)的最有效方法
我有大约30个500MB的文件,每行一个字。我用pseudo bash编写了一个脚本:Algorithm 与其他几个大型文件相比,计算文件唯一性(以%计)的最有效方法,algorithm,file,Algorithm,File,我有大约30个500MB的文件,每行一个字。我用pseudo bash编写了一个脚本: for i in *; do echo "" > everythingButI for j in *-except-$i; do cat $j >> everythingButI sort everythingButI | uniq > tmp mv tmp everythingButI done comm
for i in *; do
echo "" > everythingButI
for j in *-except-$i; do
cat $j >> everythingButI
sort everythingButI | uniq > tmp
mv tmp everythingButI
done
comm $i everythingButI -2 -3 > uniqueInI
percentUnique=$(wc -l uniqueInI) / $(wc -l $i) * 100
echo "$i is $percentUnique% Unique"
done
它计算每个文件的“唯一性”(文件已经排序,并且在每个文件中是唯一的)
如果我有文件:
file1 file2 file3
a b 1
c c c
d e e
f g
h
文件1的唯一性为75%(因为它的1/4行位于另一个文件中),文件2的唯一性为60%,文件3的唯一性为33.33%。但如果以500MB的速度播放30个文件,则需要一点时间才能运行
我想写一个python脚本,它的速度要快得多,但我想知道实现这一点的最快算法是什么。(我的电脑上也只有2GB的RAM。)
有人对算法有意见,或者知道一种更快的方法吗 编辑:由于每个输入都已经进行了内部排序和重复数据消除,因此您实际上需要一个n路合并,而本文前一版本中的哈希构建练习是毫无意义的 如果你不小心的话,n路合并有点复杂。基本上,它是这样工作的:
- 读入每个文件的第一行,并将其唯一行计数器和总行计数器初始化为0
- 执行此循环体:
- 在读取的行中找到最小值
- 如果该值与任何其他文件中的值不同,则增加该文件的唯一行计数器
- 对于每个文件,如果最小值等于上次读取的值,则读取下一行并增加该文件的“总行数”计数器。若您点击了文件的末尾,那个么您就完成了该文件:将其从进一步考虑中删除
- 循环,直到您没有正在考虑的文件。此时,每个文件都应该有一个精确的唯一行计数器和总行计数器。那么百分比就是简单的乘法和除法
import os
import itertools
# see: http://docs.python.org/dev/library/itertools.html#itertools-recipes
# modified for 3.x and eager lists
def partition(pred, iterable):
t1, t2 = itertools.tee(iterable)
return list(itertools.filterfalse(pred, t1)), list(filter(pred, t2))
# all files here
base = "C:/code/temp"
names = os.listdir(base)
for n in names:
print("analyzing {0}".format(n))
# {name => file}
# files are removed from here as they are exhausted
files = dict([n, open(os.path.join(base,n))] for n in names)
# {name => number of shared items in any other list}
shared_counts = {}
# {name => total items this list}
total_counts = {}
for n in names:
shared_counts[n] = 0
total_counts[n] = 0
# [name, currentvalue] -- remains mostly sorted and is
# always a very small n so sorting should be lickity-split
vals = []
for n, f in files.items():
# assumes no files are empty
vals.append([n, str.strip(f.readline())])
total_counts[n] += 1
while len(vals):
vals = sorted(vals, key=lambda x:x[1])
# if two low values are the same then the value is not-unique
# adjust the logic based on definition of unique, etc.
low_value = vals[0][1]
lows, highs = partition(lambda x: x[1] > low_value, vals)
if len(lows) > 1:
for lname, _ in lows:
shared_counts[lname] += 1
# all lowest items discarded and refetched
vals = highs
for name, _ in lows:
f = files[name]
val = f.readline()
if val != "":
vals.append([name, str.strip(val)])
total_counts[name] += 1
else:
# close files as we go. eventually we'll
# dry-up the 'vals' and quit this mess :p
f.close()
del files[name]
# and what we want...
for n in names:
unique = 1 - (shared_counts[n]/total_counts[n])
print("{0} is {1:.2%} unique!".format(n, unique))
回顾过去,我已经看到了缺陷!:-)
VAL
的排序出于一个传统原因,不再适用。实际上,在这里只需一个min
就可以很好地工作(对于任何相对较小的文件集都可能更好)。这里有一些非常难看的psuedo代码,可以进行n向合并
#!/usr/bin/python
import sys, os, commands
from goto import goto, label
def findmin(linesread):
min = ""
indexes = []
for i in range(len(linesread)):
if linesread[i] != "":
min = linesread[i]
indexes.append(i)
break
for i in range(indexes[0]+1, len(linesread)):
if linesread[i] < min and linesread[i] != "":
min = linesread[i]
indexes = [i]
elif linesread[i] == min:
indexes.append(i)
return min, indexes
def genUniqueness(path):
wordlists = []
linecount = []
log = open(path + ".fastuniqueness", 'w')
for root, dirs, files in os.walk(path):
if root.find(".git") > -1 or root == ".":
continue
if root.find("onlyuppercase") > -1:
continue
for i in files:
if i.find('lvl') >= 0 or i.find('trimmed') >= 0:
wordlists.append( root + "/" + i );
linecount.append(int(commands.getoutput("cat " + root + "/" + i + " | wc -l")))
print root + "/" + i
whandles = []
linesread = []
numlines = []
uniquelines = []
for w in wordlists:
whandles.append(open(w, 'r'))
linesread.append("")
numlines.append(0)
uniquelines.append(0)
count = range(len(whandles))
for i in count:
linesread[i] = whandles[i].readline().strip()
numlines[i] += 1
while True:
(min, indexes) = findmin(linesread)
if len(indexes) == 1:
uniquelines[indexes[0]] += 1
for i in indexes:
linesread[i] = whandles[i].readline().strip()
numlines[i] += 1
if linesread[i] == "":
numlines[i] -= 1
whandles[i] = 0
print "Expiring ", wordlists[i]
if not any(linesread):
break
for i in count:
log.write(wordlists[i] + "," + str(uniquelines[i]) + "," + str(numlines[i]) + "\n")
print wordlists[i], uniquelines[i], numlines[i]
#/usr/bin/python
导入系统、操作系统和命令
从转到导入转到,标签
def findmin(线路读取):
min=“”
索引=[]
对于范围内的i(len(linesread)):
如果线路读取[i]!="":
最小值=线路负载[i]
附加索引(i)
打破
对于范围内的i(索引[0]+1,len(linesread)):
如果linesread[i]-1或root==”:
持续
如果root.find(“onlyuppercase”)>-1:
持续
对于文件中的i:
如果i.find('lvl')>=0或i.find('trimmed')>=0:
追加(root+“/”+i);
linecount.append(int(commands.getoutput(“cat”+root+“/“+i+”| wc-l”))
打印根“+”/“+i”
whandles=[]
linesread=[]
numlines=[]
唯一行=[]
对于单词列表中的w:
whandles.append(打开(w,'r'))
linesread.append(“”)
numlines.append(0)
唯一行。追加(0)
计数=范围(len(whandles))
就我而言:
linesread[i]=whandles[i].readline().strip()
numlines[i]+=1
尽管如此:
(最小,索引)=findmin(linesread)
如果len(索引)==1:
唯一行[索引[0]]+=1
对于索引中的i:
linesread[i]=whandles[i].readline().strip()
numlines[i]+=1
如果linesread[i]==“”:
numlines[i]-=1
whandles[i]=0
打印“到期”,字表[i]
如果没有(线路读取):
打破
就我而言:
log.write(字表[i]+”,“+str(uniquelines[i])+”,“+str(numlines[i])+”\n”)
打印字表[i]、唯一行[i]、数字行[i]
只是一个想法:为什么不试试diff-s
命令呢?当两个文件相同时,它会报告。请参阅diff手册页。我看不出这对我有什么帮助。。。所有文件都不相同。您如何定义唯一性?它们是否包含完全相同的单词,但顺序并不重要?我的bash生锈了,你在尝试吗