Python 根据内容对同一目录中的文件进行分组
我有一个关于分组具有相同内容但不同文件名的文件的问题。我查看了Python 根据内容对同一目录中的文件进行分组,python,python-3.x,file,grouping,Python,Python 3.x,File,Grouping,我有一个关于分组具有相同内容但不同文件名的文件的问题。我查看了filecmp.cmp(),但一次只比较两个文件 这样做的目的是要改变如下情况: file1: [a,b,c,d,e,f,g,h,i] file2: [a,b,c,d,e,f,g,h,i] file3: [a,b,c,d,e,f,g,h,i] file4: [a,b,c,d,e,f,g,h] file5: [a,b,c,d,e,f,g,h] file6: [a,b,c,d,e] 进入: 我想我有大约1800个.txt文件,但只有大约
filecmp.cmp()
,但一次只比较两个文件
这样做的目的是要改变如下情况:
file1: [a,b,c,d,e,f,g,h,i]
file2: [a,b,c,d,e,f,g,h,i]
file3: [a,b,c,d,e,f,g,h,i]
file4: [a,b,c,d,e,f,g,h]
file5: [a,b,c,d,e,f,g,h]
file6: [a,b,c,d,e]
进入:
我想我有大约1800个.txt文件,但只有大约20个独特的文件。我想创建一个列表、一个字典或一个显示分组的数据框
感谢您的帮助。谢谢 您可以使用类似SHA-1的散列函数来检查具有相同内容的文件,以下是其中的摘录: 例如,给定文件名的上述函数将返回其内容的哈希值 file1.txt 这是一个测试 file2.txt 这是一个考验 file3.txt 这是一个测试 输出:
print(hash_value_for("file1.txt"))
> 0828324174b10cc867b7255a84a8155cf89e1b8b
print(hash_value_for("file2.txt"))
> cc4bc53ee478380f385721b45247107338a9cec3
print(hash_value_for("file3.txt"))
> 0828324174b10cc867b7255a84a8155cf89e1b8b
import hashlib
import itertools
BLOCKSIZE = 65536
def hash_value_for(file_name):
hasher = hashlib.sha1()
with open(file_name, 'rb') as afile:
buf = afile.read(BLOCKSIZE)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(BLOCKSIZE)
return hasher.hexdigest()
file_names = ["file1.txt", "file2.txt", "file3.txt",
"file4.txt", "file5.txt", "file6.txt"]
file_names_with_hash_values = {}
for file_name in file_names:
file_names_with_hash_values[file_name] = hash_value_for(file_name)
result = {}
for key, value in sorted(file_names_with_hash_values.items()):
result.setdefault(value, []).append(key)
print(result)
{'e99a894b164a9274e7dabc1b77b41f4148860d96': ['file1.txt', 'file2.txt', 'file3.txt'],
'bf141159c6499f26f46c7bdc28914417ff66aa15': ['file4.txt', 'file5.txt'],
'a019bdc760a550cdc55de1343d4ebbcff1ba49c3': ['file6.txt']}
现在回到您最初的示例:
文件:
print(hash_value_for("file1.txt"))
> 0828324174b10cc867b7255a84a8155cf89e1b8b
print(hash_value_for("file2.txt"))
> cc4bc53ee478380f385721b45247107338a9cec3
print(hash_value_for("file3.txt"))
> 0828324174b10cc867b7255a84a8155cf89e1b8b
import hashlib
import itertools
BLOCKSIZE = 65536
def hash_value_for(file_name):
hasher = hashlib.sha1()
with open(file_name, 'rb') as afile:
buf = afile.read(BLOCKSIZE)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(BLOCKSIZE)
return hasher.hexdigest()
file_names = ["file1.txt", "file2.txt", "file3.txt",
"file4.txt", "file5.txt", "file6.txt"]
file_names_with_hash_values = {}
for file_name in file_names:
file_names_with_hash_values[file_name] = hash_value_for(file_name)
result = {}
for key, value in sorted(file_names_with_hash_values.items()):
result.setdefault(value, []).append(key)
print(result)
{'e99a894b164a9274e7dabc1b77b41f4148860d96': ['file1.txt', 'file2.txt', 'file3.txt'],
'bf141159c6499f26f46c7bdc28914417ff66aa15': ['file4.txt', 'file5.txt'],
'a019bdc760a550cdc55de1343d4ebbcff1ba49c3': ['file6.txt']}
假设我们有以下文件,每个文件的内容如下:
file1: [a,b,c,d,e,f,g,h,i]
file2: [a,b,c,d,e,f,g,h,i]
file3: [a,b,c,d,e,f,g,h,i]
file4: [a,b,c,d,e,f,g,h]
file5: [a,b,c,d,e,f,g,h]
file6: [a,b,c,d,e]
代码:
print(hash_value_for("file1.txt"))
> 0828324174b10cc867b7255a84a8155cf89e1b8b
print(hash_value_for("file2.txt"))
> cc4bc53ee478380f385721b45247107338a9cec3
print(hash_value_for("file3.txt"))
> 0828324174b10cc867b7255a84a8155cf89e1b8b
import hashlib
import itertools
BLOCKSIZE = 65536
def hash_value_for(file_name):
hasher = hashlib.sha1()
with open(file_name, 'rb') as afile:
buf = afile.read(BLOCKSIZE)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(BLOCKSIZE)
return hasher.hexdigest()
file_names = ["file1.txt", "file2.txt", "file3.txt",
"file4.txt", "file5.txt", "file6.txt"]
file_names_with_hash_values = {}
for file_name in file_names:
file_names_with_hash_values[file_name] = hash_value_for(file_name)
result = {}
for key, value in sorted(file_names_with_hash_values.items()):
result.setdefault(value, []).append(key)
print(result)
{'e99a894b164a9274e7dabc1b77b41f4148860d96': ['file1.txt', 'file2.txt', 'file3.txt'],
'bf141159c6499f26f46c7bdc28914417ff66aa15': ['file4.txt', 'file5.txt'],
'a019bdc760a550cdc55de1343d4ebbcff1ba49c3': ['file6.txt']}
输出:
print(hash_value_for("file1.txt"))
> 0828324174b10cc867b7255a84a8155cf89e1b8b
print(hash_value_for("file2.txt"))
> cc4bc53ee478380f385721b45247107338a9cec3
print(hash_value_for("file3.txt"))
> 0828324174b10cc867b7255a84a8155cf89e1b8b
import hashlib
import itertools
BLOCKSIZE = 65536
def hash_value_for(file_name):
hasher = hashlib.sha1()
with open(file_name, 'rb') as afile:
buf = afile.read(BLOCKSIZE)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(BLOCKSIZE)
return hasher.hexdigest()
file_names = ["file1.txt", "file2.txt", "file3.txt",
"file4.txt", "file5.txt", "file6.txt"]
file_names_with_hash_values = {}
for file_name in file_names:
file_names_with_hash_values[file_name] = hash_value_for(file_name)
result = {}
for key, value in sorted(file_names_with_hash_values.items()):
result.setdefault(value, []).append(key)
print(result)
{'e99a894b164a9274e7dabc1b77b41f4148860d96': ['file1.txt', 'file2.txt', 'file3.txt'],
'bf141159c6499f26f46c7bdc28914417ff66aa15': ['file4.txt', 'file5.txt'],
'a019bdc760a550cdc55de1343d4ebbcff1ba49c3': ['file6.txt']}
这只是一个示例,您可以更改代码以满足您的需要(并获得所需的输出)。一个经典的方法是使用字典。首先编译目录中所有文件名的列表,并将其存储到名为file_names的列表中。然后:
filedict={}
for name in file_names:
file=open(name,"r")
filecontents=file.read()
if filecontents in filedict:
filedict[filecontents].append(name)
else:
filedict[filecontents]=[]
filedict[filecontents].append(name)
此词典的每个值都是具有相同文本内容的文件列表。字典的键将是文件的字符串。谢谢大家的帮助。对不起,我才意识到我从来没有对你的答案和建议做出过回应。我最终使用了Abe Binder建议的字典方法。再次感谢你的帮助。