Python 优化处理数据帧的性能_Python_Pandas_Csv_Cython_Large Data

Python 优化处理数据帧的性能

python pandas csv

Python 优化处理数据帧的性能,python,pandas,csv,cython,large-data,Python,Pandas,Csv,Cython,Large Data,我有一个包含两列的数据帧：Time（字符串）和Tweet（也是字符串）我已经为一些机器学习应用程序编写了一些清理tweet的代码，我想将清理功能应用于整个tweet列我的问题是，数据相当大，大约8500000行数据 Time Tweet 0 """Tue Apr 07 10:40:33 +0000 2015""" " ""@caiittlliinnx @RoscoStraughan man flu

我有一个包含两列的数据帧：Time（字符串）和Tweet（也是字符串）

我已经为一些机器学习应用程序编写了一些清理tweet的代码，我想将清理功能应用于整个tweet列

我的问题是，数据相当大，大约8500000行数据

    Time                                        Tweet
0   """Tue Apr 07 10:40:33 +0000 2015"""    " ""@caiittlliinnx @RoscoStraughan man flu is the worst though so I have to back Ross up here like"""
1   """Tue Apr 07 10:40:35 +0000 2015"""    " ""RT @CTVLondon: Two farms near Woodstock have been placed under quarantined after avian influenza was found at a turkey farm."""
2   """Tue Apr 07 10:40:38 +0000 2015"""    " ""@DREWXAVECAO @EXFLOP Flu fez as pazes com Fla, mas coleciona desafetos dentro e fora do futebol (07\/04\/15-06h00):"""
3   """Tue Apr 07 10:40:42 +0000 2015"""    " ""3-0 hahahahaha going up mmr ;)"""
4   """Tue Apr 07 10:40:42 +0000 2015"""    " ""Flu itu sesuatu :\/"""

我尝试了3种方法：

循环-我尝试过做一个简单的循环，并将清理功能应用到df.Tweet系列中的每个元素。这是可行的，但需要永远

df.cleaned = np.zeros_like(df.Tweet)
for i in range(len(df)):
    df.cleaned[i] = clean_tweet( df.Tweet[i])

我尝试了

df.apply

方法，但这也需要很长时间，更糟糕的是，我无法监控清理了多少数据

我尝试使用

Pool

pool = Pool(4)

chunksize = 1000000

reader = pd.read_csv('filename',chunksize = chunksize)

flist = []

for i in reader:
    f = apply_async(clean_tweet,[i])
    flist.apply(f)

alist = [j.get() for j in flist]

df = pd.concat(alist)

我也不确定这是否有效。我不能让我的脚本运行几天，所以我正在寻找最晚可以在一夜之间运行的东西

有人能推荐一种方法吗？也许是把剧本简化了？我愿意接受建议

这是“clean_tweet”的代码

def clean_tweet(x):

    cleaning = BeautifulSoup(x,"lxml")

    letters_only = re.sub("[^a-zA-Z]"," ", cleaning.get_text())

    words = letters_only.lower().split()

    words = [w for w in words if not w in (stopwords.words("english")+[u'rt'])]

    almost = " ".join(words)

    return re.sub(r'\b\w{1}\b', '', almost)

清洁推特的代码是什么？头顶可能在那里。@DeepSpace你可能是对的。参见editI，首先使用

str

方法，而不是2

re.sub

，或者至少第一个方法<与

str.replace

相比，code>re.sub是一个资源占用者。先进行一些分析怎么样？使用

cProfile

。随附标准库。

clean\u tweet

的代码是什么？头顶可能在那里。@DeepSpace你可能是对的。参见editI，首先使用

str

方法，而不是2

re.sub

，或者至少第一个方法<与

str.replace

相比，code>re.sub是一个资源占用者。先进行一些分析怎么样？使用

cProfile

。随标准库一起提供。