Python 作为其他行（pandas）的通用函数的新Dataframe列_Python_Pandas_Dataframe_Vectorization

Python 作为其他行（pandas）的通用函数的新Dataframe列
python pandas dataframe
Python 作为其他行（pandas）的通用函数的新Dataframe列,python,pandas,dataframe,vectorization,Python,Pandas,Dataframe,Vectorization,在作为
在作为
中其他行函数的数据框中创建新列的最快（也是最有效）方法是什么？
考虑以下示例：
import pandas as pd

d = {
    'id': [1, 2, 3, 4, 5, 6],
    'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']
}
pandas_df = pd.DataFrame(d)

这将产生：
   id word
0   1  cat
1   2  hat
2   3  hag
3   4  hog
4   5  dog
5   6  elephant

假设我想创建一个新列bar
，其中包含一个值，该值基于使用函数foo
将当前行中的单词与数据框中的其他行进行比较的输出
def foo(word1, word2):
    # do some calculation
    return foobar  # in this example, the return type is numeric

threshold = some_threshold

for index, _id, word in pandas_df.itertuples():
    value = sum(
        pandas_df[pandas_df['word'] != word].apply(
            lambda x: foo(x['word'], word),
            axis=1
        ) < threshold
    )
    pandas_df.loc[index, 'bar'] = value

我还有spark dataframes
的功能。我认为把它们分成几个部分是有道理的，这样它们就不会太宽了。然而，我通常发现类似pandas
问题的解决方案有时可以修改为适用于spark


受我对这个问题的spark
版本的启发，我尝试在pandas
中使用a。我的速度测试表明这稍微快一点（尽管我怀疑这可能会随着数据的大小而变化）。不幸的是，我仍然无法绕过调用apply（）


示例代码：
from nltk.metrics.distance import edit_distance as edit_dist

pandas_df2 = pd.DataFrame(d)

i, j = np.where(np.ones((len(pandas_df2), len(pandas_df2))))
cart = pandas_df2.iloc[i].reset_index(drop=True).join(
    pandas_df2.iloc[j].reset_index(drop=True), rsuffix='_r'
)

cart['dist'] = cart.apply(lambda x: edit_dist(x['word'], x['word_r']), axis=1)
pandas_df2 = (
    cart[cart['dist'] < 2].groupby(['id', 'word']).count()['dist'] - 1
).reset_index()

从nltk.metrics.distance导入编辑距离作为编辑距离
熊猫_df2=pd.数据帧（d）
i、 j=np.其中（np.one（（len（pandas_df2），len（pandas_df2）））
cart=pandas\u df2.iloc[i]。重置\u索引（drop=True）。加入(
pandas_df2.iloc[j].重置索引（drop=True），rsuffix=''u r'
)
cart['dist']=cart.apply（lambda x:edit_dist（x['word']，x['word_r']），axis=1）
熊猫_df2=(
购物车[cart['dist']<2]。分组方式（['id'，'word']）。计数（）['dist']-1
).reset_index（）
让我们试着分析一下问题：
如果你有<代码> N< /代码>行，那么你就要在你的相似性函数中考虑<代码> n*n“配对”。在一般情况下，评估所有这些都是不可避免的（听起来很合理，但我无法证明这一点）。因此，您至少有O（n^2）个时间复杂度
但是，您可以尝试使用该时间复杂性的常量因素。
我发现的可能选项有：

1.并行化：
因为您有一些大型的数据帧
，所以并行处理是最好的选择。这将使您在时间复杂度方面获得（几乎）线性的改进，因此，如果您有16个工作人员，您将获得（几乎）16倍的改进
例如，我们可以将df的行划分为不相交的部分，分别处理每个部分，然后合并结果。
非常基本的并行代码可能如下所示：
from multiprocessing import cpu_count,Pool

def work(part):
    """
    Args:
        part (DataFrame) : a part (collection of rows) of the whole DataFrame.

    Returns:
        DataFrame: the same part, with the desired property calculated and added as a new column
    """
     # Note that we are using the original df (pandas_df) as a global variable
     # But changes made in this function will not be global (a side effect of using multiprocessing).
    for index, _id, word in part.itertuples(): # iterate over the "part" tuples
        value = sum(
            pandas_df[pandas_df['word'] != word].apply( # Calculate the desired function using the whole original df
                lambda x: foo(x['word'], word),
                axis=1
            ) < threshold
        )
        part.loc[index, 'bar'] = value
    return part

# New code starts here ...

cores = cpu_count() #Number of CPU cores on your system

data_split = np.array_split(data, cores) # Split the DataFrame into parts
pool = Pool(cores) # Create a new thread pool
new_parts = pool.map(work , data_split) # apply the function `work` to each part, this will give you a list of the new parts
pool.close() # close the pool
pool.join()
new_df = pd.concat(new_parts) # Concatenate the new parts

从多处理导入cpu\u计数，池
def工作（部分）：
"""
Args：
部分（数据帧）：整个数据帧的一部分（行集合）。
返回：
DataFrame：相同的部分，计算所需的属性并添加为新列
"""
#请注意，我们使用原始df（pandas_df）作为全局变量
#但此函数中所做的更改不会是全局性的（使用多处理的副作用）。
对于索引，_id，word in part.itertuples（）：#迭代“part”元组
值=总和(
pandas_df[pandas_df['word']！=word]。应用（#使用整个原始df计算所需函数
lambda x:foo（x['word'，word），
轴=1
)<阈值
)
part.loc[索引，'条']=值
返回部分
#新代码从这里开始。。。
cores=cpu_count（）#系统上的cpu核心数
data_split=np.数组_split（数据、核心）#将数据帧拆分为多个部分
pool=pool（cores）#创建一个新的线程池
new_parts=pool.map（work，data_split）#将函数'work'应用于每个部分，这将为您提供新部分的列表
池。关闭（）#关闭池
pool.join（）
new_df=pd.concat（新零件）#连接新零件

注意：我已经尽力使代码尽可能接近OP的代码。这只是一个基本的演示代码，有很多更好的替代方案

2.“低级别”优化：
另一个解决方案是尝试优化相似度函数计算和迭代/映射。我认为与上一个选项或下一个选项相比，这不会给您带来太多的加速

3.函数相关修剪：
最后你可以尝试的是相似性函数相关的改进。这在一般情况下不起作用，但如果您能够分析相似性函数，它将非常有效。例如：
假设您使用的是Levenshtein distance（LD
），您可以观察到任意两个字符串之间的距离>=它们的长度之差。i、 e.LD（s1，s2）>=abs（len（s1）-len（s2））

<> P>你可以用这个观察来修剪可能的相似对来考虑评估。因此，对于每个长度为l1
的字符串，仅将其与长度为l2
的字符串进行比较，这些字符串具有abs（l1-l2）的预处理思想（groupby）
因为要查找小于2的编辑距离，所以可以首先按字符串的长度分组。如果组之间的长度差大于或等于2，则无需对其进行比较。（这一部分与第3.H节中库赛·阿洛特曼的回答非常相似）
因此，第一件事是根据字符串的长度分组
df["length"] = df.word.str.len() 
df.groupby("length")["id", "word"]

然后，如果长度差小于或等于2，则计算每两个连续组之间的编辑距离。这与您的问题没有直接关系，但我希望这会有所帮助
潜在矢量化（在groupby之后）
之后，您还可以尝试通过将每个字符串拆分为字符来对计算进行矢量化。请注意，如果拆分的成本大于它所带来的向量化收益，则不应这样做。或者，在创建数据帧时，只需创建一个包含字符而不是单词的数据帧
我们将使用中的答案拆分一个st
df["length"] = df.word.str.len() 
df.groupby("length")["id", "word"]

# assuming we had groupped the df.
df_len_3 = pd.DataFrame({"word": ['cat', 'hat', 'hag', 'hog', 'dog']})
# turn it into chars
splitted = df_len_3.word.apply(lambda x: pd.Series(list(x)))

    0   1   2
0   c   a   t
1   h   a   t
2   h   a   g
3   h   o   g
4   d   o   g

splitted.loc[0] == splitted # compare one word to all words

    0       1       2
0   True    True    True  -> comparing to itself is always all true.
1   False   True    True
2   False   True    False
3   False   False   False
4   False   False   False


splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1

0    1
1    2
2    2
3    2
4    1
dtype: int64