Python 如何避免在pandas中循环使用分类变量来查看/操作数据帧切片/子集_Python_Pandas_Dataframe_Loops_Slice

Python 如何避免在pandas中循环使用分类变量来查看/操作数据帧切片/子集

python pandas dataframe loops

Python 如何避免在pandas中循环使用分类变量来查看/操作数据帧切片/子集,python,pandas,dataframe,loops,slice,Python,Pandas,Dataframe,Loops,Slice,我有一个带有分类变量的大数据帧。我想从属于分类变量每个值的dataframe子集中提取值，并将其保存为列表集合（在我提供的代码示例中用于创建稀疏向量）我当前的方法迭代分类变量的每个值，选择具有该值的数据帧，然后从该子数据帧中提取列表。它相当慢，我认为这是因为两件事：在数据帧上循环和创建子数据帧我希望加快这个过程，并找出一种方法来避免这种通过临时数据帧的循环（我发现自己在代码中经常这样做）。为了给我的当前项目一个规模感，我在5英里的观测中有大约7k个类别。我在下面包含代码以演示我当前的工作流程

我有一个带有分类变量的大数据帧。我想从属于分类变量每个值的dataframe子集中提取值，并将其保存为列表集合（在我提供的代码示例中用于创建稀疏向量）

我当前的方法迭代分类变量的每个值，选择具有该值的数据帧，然后从该子数据帧中提取列表。它相当慢，我认为这是因为两件事：在数据帧上循环和创建子数据帧

我希望加快这个过程，并找出一种方法来避免这种通过临时数据帧的循环（我发现自己在代码中经常这样做）。为了给我的当前项目一个规模感，我在5英里的观测中有大约7k个类别。我在下面包含代码以演示我当前的工作流程：

数据帧设置：

import pandas as pd

c1=['a','b','c','d','e']*5
c2=[4,8,3,5,6]*6
c3=list(range(1,11))*3

df=pd.DataFrame(list(zip(c1,c2,c3)),columns=['catvar','weight','loc'])

在数据帧的子集上循环的函数：

from scipy.sparse import csr_matrix

def make_sparse_vectors(df,
                        loc_colname='loc',
                        weighting_colname='weight',
                        cat_colname='catvar',
                       ):
    # create list of ids:
    id_list=list(df[cat_colname].unique())

    # length of sparse vector:
    vlength=max(df[loc_colname])+1

    # loop to create sparse vectors:
    sparse_vector_dict={}
    for i in id_list:
        df_temp=df[df[cat_colname]==i]

        temp_loc_list=df_temp[loc_colname].tolist()
        temp_weight=df_temp[weighting_colname].tolist()
        temp_row_list=[0]*len(temp_loc_list)

        sparse_vector_dict[i]=csr_matrix((temp_weight,(temp_row_list,temp_loc_list)),shape=(1,vlength))
    
    return sparse_vector_dict

make_sparse_vectors(df)

for i in id_list:
    df_temp=df[df[cat_colname]==i]

{'a': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'b': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'c': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'd': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>,
 'e': <1x11 sparse matrix of type '<class 'numpy.intc'>'
    with 2 stored elements in Compressed Sparse Row format>}

一些想法：

Pandas的groupby（）函数似乎非常理想，但从文档中可以看出，它主要用于降低数据帧的维数。虽然在某些情况下很有用，但它不适用于此问题（因为我正在提取的列表总体上与数据帧的维度相同）
掩蔽可能会有帮助，但我想不出一个掩蔽能让我在不涉及循环的情况下达到这个目的

我不确定您想要返回什么，但您应该使用

groupby

。我会这样做的

loc_colname='loc'
weighting_colname='weight'
cat_colname='catvar'
vlength = max(df[loc_colname]+1)

new\u df=df.groupby（cat\u colname）.apply（创建稀疏向量）

要获得口述，请阅读更多

df\u dict=new\u df.to\u dict（）

您还可以通过以下方法大大加快此过程。但是，如果听到的声音太大，速度可能会变慢

fast\u df=df.groupby（cat\u colname）.swifter.apply（创建稀疏向量）

csr\u矩阵在做什么？@Kenan它正在为每个分类变量创建稀疏矩阵（使用从数据帧中提取的权重和位置/索引）。这就是我在执行我所关心的循环时所创建的列表的用途。我发现我不知怎么错过了复制和粘贴包含导入的行-我将把它添加到问题中的代码中。我知道了，所以apply（）允许我将groupby（）与自定义函数一起使用。这里的关键区别在于，它返回一个包含信息的数据帧，而我以前的方法返回一个包含相同信息的字典。基于使用timeit运行这个玩具示例1000次，您的方法大约快15%。针对多处理方法更新并将df转换为DICT阅读您发布的关于swifter/dask的链接，它确实看起来非常有用。但是，我在尝试运行示例代码时出错：

AttributeError:“DataFrameGroupBy”对象没有属性“swifter”

。基于线程，还没有实现使用swifter执行groupby（）。哦，我明白了，如果您对我的解决方案感到满意，也许可以尝试一下。请不要忘记接受答案，这样这个问题就可以结束了。

def create sparse vectors(df_temp):
    temp_loc_list=df_temp[loc_colname].tolist()
    temp_weight=df_temp[weighting_colname].tolist()
    temp_row_list=[0]*len(temp_loc_list)

    return csr_matrix((temp_weight,(temp_row_list,temp_loc_list)),shape=(1,vlength))