Python 将千兆字节的文本合并到一个文件中,按出现次数排序

Python 将千兆字节的文本合并到一个文件中,按出现次数排序,python,list,text-files,find-occurrences,word-list,Python,List,Text Files,Find Occurrences,Word List,我编写这个脚本的目的是获取一个充满文本文件的文件夹,捕获所有文件中的每一行,然后按频率降序输出一个包含每一行的文件 它不仅查找唯一的行,还查找每个唯一行在所有文件中出现的频率 它需要用这个脚本处理大量文本-至少2GB左右,所以我需要高效地完成。 到目前为止,我还没有达到这个目标 import os, sys #needed for looking into a directory from sys import argv #allows passing of arguments from com

我编写这个脚本的目的是获取一个充满文本文件的文件夹,捕获所有文件中的每一行,然后按频率降序输出一个包含每一行的文件

它不仅查找唯一的行,还查找每个唯一行在所有文件中出现的频率

它需要用这个脚本处理大量文本-至少2GB左右,所以我需要高效地完成。 到目前为止,我还没有达到这个目标

import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences

#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))

filenames=[]  

#Get name of files in directory, add them to a list
for file in os.listdir(dir_string):
    if file.endswith(".txt"):
        filenames.append(os.path.join(dir_string, file)) #add names of files to a list

#Declare name of file to be written
out_file_name = dir_string+".txt"

#Create output file
outfile = open(out_file_name, "w")

#Declare list to be filled with lines seen
lines_seen = []

#Parse All Lines in all files
for fname in filenames: #for all files in list
    with open(fname) as infile: #open a given file
        for line in infile: #for all lines in current file, read one by one
                #Here's the problem.
                lines_seen.append(str(line).strip('\n')) #add line to list of lines seen,
                                                         #removing the endline

    #Organizes the list by number of occurences, but produced a list that contains
    # [(item a, # of a occurrences ), (item b, # of b occurrences)...]
    lines_seen = Counter(lines_seen).most_common()

    #Write file line by line to the output file
    for item in lines_seen: outfile.write(str(item[0])+"\n")

outfile.close()
当我收到一条错误消息时,它是关于行
lines\u seen.append(str(line).strip('\n'))

我首先尝试在不转换为字符串和剥离的情况下添加行,但它会在字符串中包含一个可见的“\n”,这是我无法接受的。 对于较小的列表,转换为字符串和剥离不会占用太多内存。 我想不出一个更有效的方法来摆脱尾行字符

在我的电脑上,这会导致
内存错误
,在我的Mac电脑上,这会导致
被杀死:9
-还没有在Linux上尝试过

我是否需要转换成二进制,组装我的有序列表,然后再转换回来? 否则怎么做呢

编辑-很明显,对我来说,最好的总体方法是使用unix命令

cd DirectoryWithFiles
cat *.txt | sort | uniq -c | sort -n -r > wordlist_with_count.txt
cut  -c6- wordlist_with_count.txt > wordlist_sorted.txt

你的问题显然是缺乏记忆力

您可以消除在过程中看到的行中的冗余行,这可能会有所帮助

from collections import Counter
lines_seen = Counter()

# in the for loop :
lines_seen[ lines_seen.append(str(line).strip('\n')) ] += 1

# at the end:
for item in lines_seen.most_common():
    outfile.write(str(item[0])+"\n")

编辑

如评论中所述,另一种解决办法是:

from collections import Counter
lines_seen = Counter()

# get the files names

for fname in filenames: #for all files in list
    with open(fname) as infile: #open a given file
        lines_seen.update(infile.read().split('\n'))

for item in lines_seen.most_common():
    print( item[0], file=outfile )

我会像这样解决这个问题

import os, sys #needed for looking into a directory
from sys import argv #allows passing of arguments from command line, where I call the script
from collections import Counter #allows the lists to be sorted by number of occurrences

#Pass argument containing Directory of files to be combined
dir_string = str((argv[1]))


#Get name of files in directory, add them to a list
filenames = []
for file in os.listdir(dir_string):
    if file.endswith(".txt"):
        filenames.append(os.path.join(dir_string, file)) #add names of files to a list


#Declare name of file to be written
out_file_name = os.path.join(dir_string, 'out.txt')


# write all the files to a single file instead of list
with open(out_file_name, "w") as outfile:
    for fname in filenames: #for all files in list
        with open(fname) as infile: #open a given file
              for line in infile: #for all lines in current file, read one by one
                   outfile.write(line)

# create a counter object from outfile
with open(out_file_name, "r") as outfile:
    c = Counter(outfile)



print "sorted by line alphabhitically"
from operator import itemgetter   
print sorted(c.items(),key=itemgetter(0))

print "sorted by count"
print sorted(c.items(), key=itemgetter(1))


def index_in_file(unique_line):
    with open(out_file_name, "r") as outfile:
        for num, line in enumerate(outfile, 1):
            if unique_line[0] in line:
                return num

print "sorted by apperance of line in the outfile"
s= sorted(c.items(),key=index_in_file)
print s

# Once you decide what kind of sort you want, write the sorted elements into a outfile.
with open(out_file_name, "w") as outfile:
    for ss in s:
        outfile.write(ss[0].rstrip()+':'+str(ss[1])+'\n')

这就是我在另一个答案下面的评论中建议的减少内存消耗的方法:

lines_seen = collections.Counter()

for filename in filenames:
    with open(filename, 'r') as file:
        for line in file:
            line = line.strip('\n')
            if line:
                lines_seen.update([line])

with open(out_file_name, "w") as outfile:
    for line, count in lines_seen.most_common():
        outfile.write('{}, {}\n'.format(line, count))

请注意,
line.strip('\n')
仅在读取的每行末尾删除换行符,因此
line.rstrip('\n')
将更有效。您可能还希望使用
line.strip()
删除前导空格和尾随空格。清除存储的空白(可能相当大)将进一步减少内存使用。

与其将
列表
保留在内存中,为什么不将行写入临时文件中呢,我不知道如何在不将文件放入列表或集合的情况下对该文件进行排序,这让我回到了同样的问题。根据这条线索:如果文件的大小超过2gb,则可以读取2gb的文件,正如您在帖子中所提到的……您最好将其分为文件块或更小的列表块……并尝试分别对每个块进行排序,然后将其写入对于一个主输出文件,我可以,但这可能会破坏在整个目录中获取总次数的目的。注释不用于扩展讨论;这段对话已经结束。