Python 提高使用掩码的ItErrorws（）查询的速度_Python_Pandas

Python 提高使用掩码的ItErrorws（）查询的速度

python pandas

Python 提高使用掩码的ItErrorws（）查询的速度,python,pandas,Python,Pandas,我有一个在内容方面与此类似的大型数据集： test = pd.DataFrame({'date':['2018-08-01','2018-08-01','2018-08-02','2018-08-03','2019-09-01','2019-09-02','2019-09-03','2020-01-02','2020-01-03','2020-01-04','2020-10-04','2020-10-05'], 'account':['a','a','a','a

我有一个在内容方面与此类似的大型数据集：

test = pd.DataFrame({'date':['2018-08-01','2018-08-01','2018-08-02','2018-08-03','2019-09-01','2019-09-02','2019-09-03','2020-01-02','2020-01-03','2020-01-04','2020-10-04','2020-10-05'],
                 'account':['a','a','a','a','b','b','b','c','c','c','d','e']})

对于每个帐户，我都试图创建一个列，该列对具有最早日期的行指定“是”，即使该最早日期重复，否则为“否”。我正在使用下面的代码，它可以很好地处理这个数据的较小子集，但不能处理整个较大的数据集

first_date = test.groupby('account').agg({'date':np.min})

test['first_date'] = 'No'
for row in first_date.iterrows():
    account = row[0]
    date = row[1].date
    mask = (test.account == account) & (test.date == date)
    test.loc[mask, 'first_date'] = 'Yes'

有改进的想法吗？我对python相当陌生，对于使用pandas DataFrame的较大数据集，已经有运行时问题了。提前感谢。

通常，当我们使用pandas或numpy时，我们希望避免对数据进行迭代，并使用提供的矢量化方法

使用groupby.transform获取每行的最小日期，然后使用np.where创建条件列：

m = test['date'] == test.groupby('account')['date'].transform('min')
test['first_date'] = np.where(m, 'Yes', 'No')


          date account first_date
0   2018-08-01       a        Yes
1   2018-08-01       a        Yes
2   2018-08-02       a         No
3   2018-08-03       a         No
4   2019-09-01       b        Yes
5   2019-09-02       b         No
6   2019-09-03       b         No
7   2020-01-02       c        Yes
8   2020-01-03       c         No
9   2020-01-04       c         No
10  2020-10-04       d        Yes
11  2020-10-05       e        Yes

通常，当我们使用pandas或numpy时，我们希望避免对数据进行迭代，并使用提供的矢量化方法

使用groupby.transform获取每行的最小日期，然后使用np.where创建条件列：

m = test['date'] == test.groupby('account')['date'].transform('min')
test['first_date'] = np.where(m, 'Yes', 'No')


          date account first_date
0   2018-08-01       a        Yes
1   2018-08-01       a        Yes
2   2018-08-02       a         No
3   2018-08-03       a         No
4   2019-09-01       b        Yes
5   2019-09-02       b         No
6   2019-09-03       b         No
7   2020-01-02       c        Yes
8   2020-01-03       c         No
9   2020-01-04       c         No
10  2020-10-04       d        Yes
11  2020-10-05       e        Yes