Python 添加了一个答案,在这里我用随机数据展示了一个示例,当您在开始时将数据转换为set时比在检查每个ngram时将pandas Series转换为numpy时要快得多。@JanMusil是的,您是对的。谢谢你的帮助!!! def clean_ngram(ng):
Python 添加了一个答案,在这里我用随机数据展示了一个示例,当您在开始时将数据转换为set时比在检查每个ngram时将pandas Series转换为numpy时要快得多。@JanMusil是的,您是对的。谢谢你的帮助!!! def clean_ngram(ng):,python,optimization,nlp,multiprocessing,Python,Optimization,Nlp,Multiprocessing,添加了一个答案,在这里我用随机数据展示了一个示例,当您在开始时将数据转换为set时比在检查每个ngram时将pandas Series转换为numpy时要快得多。@JanMusil是的,您是对的。谢谢你的帮助!!! def clean_ngram(ng): if 'percent' in ng: ng = ng.replace('percent', '%') if 'point' in ng: ng = ng.replace('point', '.
添加了一个答案,在这里我用随机数据展示了一个示例,当您在开始时将数据转换为set时比在检查每个ngram时将pandas Series转换为numpy时要快得多。@JanMusil是的,您是对的。谢谢你的帮助!!!
def clean_ngram(ng):
if 'percent' in ng:
ng = ng.replace('percent', '%')
if 'point' in ng:
ng = ng.replace('point', '.')
if ng.split(' ')[0] not in stopwords['First'].dropna().values \
and ng.split(' ')[-1] not in stopwords['Last '].dropna().values \
and (bool(re.match(r"^[0-9.% ]+$", ng)) == False):
return ng
df['Word'] = df['Word'].apply(lambda x: clean_ngram(x))
p = Pool(processes=2)
df['Word'] = p.map(clean_ngram, df['Word'])
p.close()
p.join()
#first generate random 450 stopwords + 50 nans
>>> stopwords = np.array(['word_num'+str(i) for i in range(450)]+[np.nan for _ in range(50)])
#shuffle the stopwords and print some of them
>>> stopwords = pd.Series(stopwords).sample(frac=1)
>>> stopwords
304 word_num304
84 word_num84
215 word_num215
438 word_num438
276 word_num276
...
217 word_num217
280 word_num280
69 word_num69
365 word_num365
404 word_num404
Length: 500, dtype: object
#generate random words to be checked if they are in stopwords
>>> ngrams = ['word_num{}'.format(int(np.random.rand()*1000)) for _ in range(20000)]
>>> ngrams = np.array(ngrams)
>>> ngrams
array(['word_num642', 'word_num729', 'word_num901', ..., 'word_num940',
'word_num616', 'word_num58'], dtype='<U11')
#define function that checks words presence in stopwords pd.Series (same way as you did)
#this function returns also time it took to run
>>> def check_ngrams_Series(ngrams,stopwords):
... func = lambda ng: ng in stopwords.dropna().values
... time_begin = time()
... result = list(map(func,ngrams))
... time_end = time()
... return np.array(result), time_end-time_begin
#define function that checks words presence in stopwords converted to set
#this function returns also time it took to run
>>> def check_ngrams_set(ngrams,stopwords):
... func = lambda ng: ng in stopwords
... time_begin = time()
... result = list(map(func,ngrams))
... time_end = time()
... return np.array(result), time_end-time_begin
#try to run both functions
>>> series_out = check_ngrams_Series(ngrams,stopwords)
>>> sets_out = check_ngrams_set(ngrams,set(stopwords))
#checks their first output (words presence in stopwords) is same
>>> np.all(sets_out[0] == series_out[0])
True
#show how long it took to function that uses set to run
>>> sets_out[1]
0.008014917373657227
#show how long it took to function that uses pd.Series to run
>>> series_out[1]
15.30849814414978