Python 如何优化处理数据帧的嵌套循环代码_Python_Pandas_Optimization_Nlp

Python 如何优化处理数据帧的嵌套循环代码

python pandas optimization nlp

Python 如何优化处理数据帧的嵌套循环代码,python,pandas,optimization,nlp,Python,Pandas,Optimization,Nlp,我是优化新手，需要帮助改进代码的运行时间。它完成了我的任务，但它需要永远。有什么建议可以改进它，让它运行得更快吗代码如下： def probabilistic_word_weighting(df, lookup): # instantiate new place holder for class weights for each text sequence in the df class_probabilities = [0.0, 0.0, 0.0, 0.0, 0.0, 0.

我是优化新手，需要帮助改进代码的运行时间。它完成了我的任务，但它需要永远。有什么建议可以改进它，让它运行得更快吗

代码如下：

def probabilistic_word_weighting(df, lookup):

    # instantiate new place holder for class weights for each text sequence in the df
    class_probabilities = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
    for index, row in lookup.iterrows():
        if row.word in df.words.split():
            class_proba_ = row.class_proba.strip('][').split(', ')
            class_proba_ = [float(i) for i in class_proba_]
            class_probabilities = [a + b for a, b in zip(class_probabilities, class_proba_)]

    return class_probabilities

两个输入df如下所示：

查找

index    word                               class_proba
6231    been    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
8965    havent  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
3270    derive  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
7817    a       [0.0, 0.0, 7.451379, 6.552, 0.0, 0.0, 0.0, 0.0]
3452    hello   [0.0, 0.0, 0.0, 0.0, 0.000155327, 0.0, 0.0, 0.0]
5112    they    [0.0, 0.0, 0.00032289312, 0.0, 0.0, 0.0, 0.0, 0.0]
1012    time    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
7468    some    [0.000193199, 0.0, 0.0, 0.000212947, 0.0, 0.0, 0.0, 0.0]
6428    people  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487
5537    scuba   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487

它所做的基本上是遍历查找中的每一行，其中包含一个单词及其相对类权重。如果在df.word中的任何文本序列中找到该单词，则lookup.word的class_概率将添加到分配给df.word中每个序列的class_概率变量中。它在df中的每一行中循环查找行的每一次迭代

如何才能更快地完成此操作？

IIUC，您正在使用

df。将应用于您的函数，但您可以这样做。这样做的目的不是每次找到相应的单词时都要对查找
的行重新执行操作，而是执行一次操作，并重塑df
以执行矢量化操作
1：使用str.split
、stack
和对df
列中的单词进行重塑，以获得每个单词的新行：
s_df = df['words'].str.split(expand=True).stack().to_frame(name='split_word')
print (s_df.head(8))
    split_word
0 0          i
  1     havent
  2       been
  3       back
1 0        but
  1        its
2 0       they
  1       used

2：通过set_index
单词列、str.strip
、str.split
和astype
对lookup
进行重塑，以获得一个数据框，其中单词作为索引，列中的class_proba的每个值
split_lookup = lookup.set_index('word')['class_proba'].str.strip('][')\
                     .str.split(', ', expand=True).astype(float)
print (split_lookup.head())
          0    1         2      3         4    5    6         7
word                                                           
been    0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
havent  0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
derive  0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
a       0.0  0.0  7.451379  6.552  0.000000  0.0  0.0  0.000000
hello   0.0  0.0  0.000000  0.000  0.000155  0.0  0.0  0.000000

3:Merge
两者，drop
不必要的列和groupby
级别=0是df
和sum
的原始索引
df_proba = s_df.merge(split_lookup, how='left',
                      left_on='split_word', right_index=True)\
               .drop('split_word', axis=1)\
               .groupby(level=0).sum()
print (df_proba.head())
          0    1         2         3    4    5    6         7
0  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0  10.55799
1  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0   0.00000
2  0.000000  0.0  0.000323  0.000000  0.0  0.0  0.0   0.00000
3  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0   0.00000
4  0.000193  0.0  7.451379  6.552213  0.0  0.0  0.0   0.00000

4：最后，转换为一个列表，并使用to\u numpy
和tolist
重新分配到原始df：
df['class_proba'] = df_proba.to_numpy().tolist()
print (df.head())
                                           words  \
0                          i  havent  been  back   
1                                       but  its   
2              they  used  to  get  more  closer   
3                                        no  way   
4  when  we  have  some  type  of  a  thing  for   

                                         class_proba  
0   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.55798974]  
1           [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
2  [0.0, 0.0, 0.00032289312, 0.0, 0.0, 0.0, 0.0, ...  
3           [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
4  [0.000193199, 0.0, 7.451379, 6.552212946999999...  

IIUC，您正在使用df。在函数中应用，但您可以这样做。这样做的目的不是每次找到相应的单词时都要对查找
的行重新执行操作，而是执行一次操作，并重塑df
以执行矢量化操作
1：使用str.split
、stack
和对df
列中的单词进行重塑，以获得每个单词的新行：
s_df = df['words'].str.split(expand=True).stack().to_frame(name='split_word')
print (s_df.head(8))
    split_word
0 0          i
  1     havent
  2       been
  3       back
1 0        but
  1        its
2 0       they
  1       used

2：通过set_index
单词列、str.strip
、str.split
和astype
对lookup
进行重塑，以获得一个数据框，其中单词作为索引，列中的class_proba的每个值
split_lookup = lookup.set_index('word')['class_proba'].str.strip('][')\
                     .str.split(', ', expand=True).astype(float)
print (split_lookup.head())
          0    1         2      3         4    5    6         7
word                                                           
been    0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
havent  0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
derive  0.0  0.0  0.000000  0.000  0.000000  0.0  0.0  5.278995
a       0.0  0.0  7.451379  6.552  0.000000  0.0  0.0  0.000000
hello   0.0  0.0  0.000000  0.000  0.000155  0.0  0.0  0.000000

3:Merge
两者，drop
不必要的列和groupby
级别=0是df
和sum
的原始索引
df_proba = s_df.merge(split_lookup, how='left',
                      left_on='split_word', right_index=True)\
               .drop('split_word', axis=1)\
               .groupby(level=0).sum()
print (df_proba.head())
          0    1         2         3    4    5    6         7
0  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0  10.55799
1  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0   0.00000
2  0.000000  0.0  0.000323  0.000000  0.0  0.0  0.0   0.00000
3  0.000000  0.0  0.000000  0.000000  0.0  0.0  0.0   0.00000
4  0.000193  0.0  7.451379  6.552213  0.0  0.0  0.0   0.00000

4：最后，转换为一个列表，并使用to\u numpy
和tolist
重新分配到原始df：
df['class_proba'] = df_proba.to_numpy().tolist()
print (df.head())
                                           words  \
0                          i  havent  been  back   
1                                       but  its   
2              they  used  to  get  more  closer   
3                                        no  way   
4  when  we  have  some  type  of  a  thing  for   

                                         class_proba  
0   [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.55798974]  
1           [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
2  [0.0, 0.0, 0.00032289312, 0.0, 0.0, 0.0, 0.0, ...  
3           [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]  
4  [0.000193199, 0.0, 7.451379, 6.552212946999999...  

这将温和地优化for
循环：切换到用于查找中的行。itertuples（）：
而不是用于查找中的索引行。iterrows（）：
itertuples
比iterrows
要快，请确保对集合而不是列表使用中的运算符，因为集合中的成员资格测试更快if row.word in df.words.split（）：
，切换到word\u set=set（df.words.split（））
（在for循环外定义一次），然后使用if row.word in word\u set
这将略微优化for
循环：切换到用于查找中的行。itertuples（）：
而不是用于索引，lookup.iterrows（）中的行：
itertuples
比iterrows
快，请确保对集合而不是列表使用in
运算符，因为在集合中进行成员资格测试更快if-row.word-in-df.words.split（）：
，切换到word\u set=set（df.words.split（））
（在for循环外定义一次），然后使用if-row.word-in-word\u set
这很神奇，谢谢。它不仅解决了我的问题，而且使优化更容易实现。@connor449很高兴它有帮助：）我做了一些计时，在输入上，它几乎快了两倍，但是如果你将df的大小增加10倍，它就会快14倍以上！这是魔术，谢谢你。它不仅解决了我的问题，而且使优化更容易实现。@connor449很高兴它有帮助：）我做了一些计时，在输入上，它几乎快了两倍，但是如果你将df的大小增加10倍，它就会快14倍以上！