Python 我可以并行化利用熊猫的代码吗？_Python_Pandas_Parallel Processing_Nlp_Pool

Python 我可以并行化利用熊猫的代码吗？

python pandas parallel-processing nlp

Python 我可以并行化利用熊猫的代码吗？,python,pandas,parallel-processing,nlp,pool,Python,Pandas,Parallel Processing,Nlp,Pool,我正在做NLP，需要清理我的数据。我编写了三个函数来1）清理数据，2）检查数据是否与主题相关，3）检查数据是否为英语我有大约800万行的数据，大多数计算彼此不依赖。我曾考虑使用池来并行化代码，但我不确定这是否明智，因为所有数据都存储在熊猫数据帧中（我知道numba不能很好地处理数据帧）我可以使用池并行化我的代码吗？它是否像我在文档中找到的代码一样简单？Pool是正确的图书馆吗应该注意，我在MacOSX上运行这个。以下是我的代码供参考： import pandas as pd import

我正在做NLP，需要清理我的数据。我编写了三个函数来1）清理数据，2）检查数据是否与主题相关，3）检查数据是否为英语

我有大约800万行的数据，大多数计算彼此不依赖。我曾考虑使用池来并行化代码，但我不确定这是否明智，因为所有数据都存储在熊猫数据帧中（我知道numba不能很好地处理数据帧）

我可以使用池并行化我的代码吗？它是否像我在文档中找到的代码一样简单？Pool是正确的图书馆吗

应该注意，我在MacOSX上运行这个。以下是我的代码供参考：

import pandas as pd
import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
import enchant
import numpy as np

sf = pd.read_csv('timeandtweet.csv')

def clean_tweet(x):

    cleaning = BeautifulSoup(x,"lxml")

    letters_only = re.sub("[^a-zA-Z]"," ", cleaning.get_text())

    words = letters_only.lower().split()

    words = [w for w in words if not w in (stopwords.words("english")+[u'rt'])]

    return " ".join(words)



def on_topic(x):

    topics = [u'measles',u'mmr',u'vaccine',u'vaccines']

    if any(j in topics for j in x.split()):
        return 1
    else:
        return -1

def is_english(x):

    lang = enchant.Dict('en_US')
    L = len(x.split())

    words = []
    for i in x.split():
        words.append(lang.check(i))

    if float(sum(words))/L <0.6:
        return -1
    else:
        return 1





sf['Clean Tweet'] = np.zeros_like(sf.Tweet)

sf['English-Topic'] = np.zeros_like(sf.Tweet)

for i in xrange(len(sf)): #Loop instead of df.apply for speed?
    if( (i+1)%1000 == 0 ):
        print "Review %d of %d\n" % ( i+1, len(sf) ) 

    sf['Clean Tweet'][i] = clean_tweet(sf.Tweet[i])

    sf['English-Topic'][i] = (on_topic(sf['Clean Tweet'][i]), is_english(sf['Clean Tweet'][i])  )


sf.to_csv('cleaned_processed.csv', index = False)

但我一直得到一个值错误

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

您在哪一行获得ValueError？ValueError：包含多个元素的数组的真值不明确。使用a.any（）或a.all通常在数据帧中未给定索引值时发生此错误。您究竟在哪一行得到错误？？Python指向

answer1=result.get（）

作为错误源。

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()