Python 3.x 筛选单词列表中不常用单词的最快方法

Python 3.x 筛选单词列表中不常用单词的最快方法,python-3.x,pandas,multiprocessing,apply,text-mining,Python 3.x,Pandas,Multiprocessing,Apply,Text Mining,我有一个包含csv格式令牌列表的数据集,如下所示: song, tokens aaa,"['everyon', 'pict', 'becom', 'somebody', 'know']" bbb,"['tak', 'money', 'tak', 'prid', 'tak', 'littl']" 首先,我想找到文本中出现的所有单词,至少在一定时间内,比如说5个,这很容易做到: # converters simply reconstruct the string of tokens in a li

我有一个包含csv格式令牌列表的数据集,如下所示:

song, tokens
aaa,"['everyon', 'pict', 'becom', 'somebody', 'know']"
bbb,"['tak', 'money', 'tak', 'prid', 'tak', 'littl']"
首先,我想找到文本中出现的所有单词,至少在一定时间内,比如说5个,这很容易做到:

# converters simply reconstruct the string of tokens in a list of tokens
tokens = pd.read_csv('dataset.csv',
                      converters={'tokens': lambda x: x.strip("[]").replace("'", "").split(", ")})

# List of all words
allwords = [word for tokens in darklyrics['tokens'] for word in tokens]
allwords = pd.DataFrame(allwords, columns=['word'])

more5 = allwords[allwords.groupby("word")["word"].transform('size') >= 5]
more5 = set(more5['word'])
frequentwords = [token.strip() for token in more5]
frequentwords.sort()
现在,我想删除出现在frequentwords中的每个标记列表,为此,我使用以下代码:

def remove_non_frequent(x):
    global frequentwords
    output = []

    for token in x:
        if token in frequentwords:
            output.append(token)

    return output

def remove_on_chunk(df):
    df['tokens'] = df.apply(lambda x: remove_non_frequent(x['tokens']), axis=1)

    return df


def parallelize_dataframe(df, func, n_split=10, n_cores=4):
    df_split = np.array_split(df, n_split)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

lyrics_reconstructed = parallelize_dataframe(lyrics, remove_on_chunk)
非多进程版本的计算时间约为2.30-3小时,而此版本的计算时间约为1小时

当然这是一个缓慢的过程,因为我必须在一个包含30k个元素的列表中搜索大约1.3亿个令牌,但我很确定我的代码不是特别好

有没有更快更好的方法来实现这样的目标?

进行设置操作。我已经将您的示例数据保存到tt1文件中,所以这应该可以工作。另外,如果您是以某种方式自己生成数据,请帮自己一个忙,去掉引号和方括号。这将节省您在预处理过程中的时间

from collections import Counter
import re

rgx = re.compile(r"[\[\]\"' \n]")     # data cleanup

# load and pre-process the data
counter = Counter()
data = []
with open('tt1', 'r') as o:
    o.readline()
    for line in o:
        parts = line.split(',')
        clean_parts = {re.sub(rgx, "", i) for i in parts[1:]}
        counter.update(clean_parts)
        data.append((parts[0], clean_parts))


n = 2                         # <- here set threshold for number of occurences
common_words = {i[0] for i in counter.items() if i[1] > n}

# process the data
clean_data = []
for s, r in data:
    clean_data.append((s, r - common_words))

已经有一段时间了,但我会发布问题的正确解决方案,因为这只是对他的代码的一个小小修改。 他使用的集合不能处理重复的集合,因此显而易见的想法是重用相同的代码,但使用多个集合。 我曾经使用过这个实现


这段代码提取最频繁的单词并根据此列表过滤每个文档,在110 MB数据集上,在不到2分钟的时间内完成这项工作。

非常好,非常感谢。大约需要2分钟!还有一个小问题,我想保留不常见的重复单词,我正在使用您的代码,但使用from collections import Counter import re from multiset import Multiset rgx = re.compile(r"[\[\]\"' \n]") # data cleanup # load and pre-process the data counter = Counter() data = [] with open('tt1', 'r') as o: o.readline() for line in o: parts = line.split(',') clean_parts = [re.sub(rgx, "", i) for i in parts[1:]] counter.update(clean_parts) ms = Multiset() for word in clean_parts: ms.add(word) data.append([parts[0], ms]) n = 2 # <- here set threshold for number of occurences common_words = Multiset() # I'm using intersection with the most common words since # common_words is way smaller than uncommon_words # Intersection returns the lowest value count between two multisets # E.g ('sky', 10) and ('sky', 1) will produce ('sky', 1) # I want the number of repeated words in my document so i set the # common words counter to be very high for item in counter.items(): if item[1] >= n: common_words.add(item[0], 100) # process the data clean_data = [] for s, r in data: clean_data.append((s, r.intersection(common_words))) output_data = [] for s, ms in clean_data: tokens = [] for item in ms.items(): for i in range(0, item[1]): tokens.append(item[0]) output_data.append([s] + [tokens])