Python 添加了一个答案,在这里我用随机数据展示了一个示例,当您在开始时将数据转换为set时比在检查每个ngram时将pandas Series转换为numpy时要快得多。@JanMusil是的,您是对的。谢谢你的帮助!!! def clean_ngram(ng):

Python 添加了一个答案,在这里我用随机数据展示了一个示例,当您在开始时将数据转换为set时比在检查每个ngram时将pandas Series转换为numpy时要快得多。@JanMusil是的,您是对的。谢谢你的帮助!!! def clean_ngram(ng):,python,optimization,nlp,multiprocessing,Python,Optimization,Nlp,Multiprocessing,添加了一个答案,在这里我用随机数据展示了一个示例,当您在开始时将数据转换为set时比在检查每个ngram时将pandas Series转换为numpy时要快得多。@JanMusil是的,您是对的。谢谢你的帮助!!! def clean_ngram(ng): if 'percent' in ng: ng = ng.replace('percent', '%') if 'point' in ng: ng = ng.replace('point', '.


添加了一个答案,在这里我用随机数据展示了一个示例,当您在开始时将数据转换为set时比在检查每个ngram时将pandas Series转换为numpy时要快得多。@JanMusil是的,您是对的。谢谢你的帮助!!!
def clean_ngram(ng):
    if 'percent' in ng:
        ng = ng.replace('percent', '%')
    if 'point' in ng:
        ng = ng.replace('point', '.')
    if ng.split(' ')[0] not in stopwords['First'].dropna().values \
        and ng.split(' ')[-1] not in stopwords['Last '].dropna().values \
          and (bool(re.match(r"^[0-9.% ]+$", ng)) == False):
              return ng

df['Word'] = df['Word'].apply(lambda x: clean_ngram(x))
p = Pool(processes=2)
df['Word'] = p.map(clean_ngram, df['Word'])
p.close()
p.join()
#first generate random 450 stopwords + 50 nans
>>> stopwords = np.array(['word_num'+str(i) for i in range(450)]+[np.nan for _ in range(50)])

#shuffle the stopwords and print some of them
>>> stopwords = pd.Series(stopwords).sample(frac=1)
>>> stopwords
304    word_num304
84      word_num84
215    word_num215
438    word_num438
276    word_num276
          ...     
217    word_num217
280    word_num280
69      word_num69
365    word_num365
404    word_num404
Length: 500, dtype: object

#generate random words to be checked if they are in stopwords
>>> ngrams = ['word_num{}'.format(int(np.random.rand()*1000)) for _ in range(20000)]
>>> ngrams = np.array(ngrams)
>>> ngrams
array(['word_num642', 'word_num729', 'word_num901', ..., 'word_num940',
       'word_num616', 'word_num58'], dtype='<U11')

#define function that checks words presence in stopwords pd.Series (same way as you did)
#this function returns also time it took to run
>>> def check_ngrams_Series(ngrams,stopwords):
...     func = lambda ng: ng in stopwords.dropna().values
...     time_begin = time()
...     result = list(map(func,ngrams))
...     time_end = time()
...     return np.array(result), time_end-time_begin

#define function that checks words presence in stopwords converted to set
#this function returns also time it took to run
>>> def check_ngrams_set(ngrams,stopwords):
...     func = lambda ng: ng in stopwords
...     time_begin = time()
...     result = list(map(func,ngrams))
...     time_end = time()
...     return np.array(result), time_end-time_begin

#try to run both functions
>>> series_out = check_ngrams_Series(ngrams,stopwords)
>>> sets_out = check_ngrams_set(ngrams,set(stopwords))

#checks their first output (words presence in stopwords) is same
>>> np.all(sets_out[0] == series_out[0])
True

#show how long it took to function that uses set to run
>>> sets_out[1]
0.008014917373657227

#show how long it took to function that uses pd.Series to run
>>> series_out[1]
15.30849814414978