Python 3.x 如何使用集合从另一个数据帧更新数据帧_Python 3.x_Pandas_Function_Numpy_Dataframe

Python 3.x 如何使用集合从另一个数据帧更新数据帧

python-3.x pandas function numpy dataframe

Python 3.x 如何使用集合从另一个数据帧更新数据帧,python-3.x,pandas,function,numpy,dataframe,Python 3.x,Pandas,Function,Numpy,Dataframe,如果唯一_值匹配，则尝试将df1中的每一行更新为df2，然后根据df1中的状态，更新df2中price_数组中的price；如果不是，则将该行附加到df2并分配新的ID列这是第2部分的问题，来自：注: 活动和新：添加暂停和非活动：删除 df1（无ID列）： df2：更新的df2的期望输出： unique_value Price_array ID 0 xyz123 {6.67,7.55} 1000

如果唯一_值匹配，则尝试将df1中的每一行更新为df2，然后根据df1中的状态，更新df2中price_数组中的price；如果不是，则将该行附加到df2并分配新的ID列

这是第2部分的问题，来自：

注:
活动和新：添加
暂停和非活动：删除

df1（无ID列）：

df2：

更新的df2的期望输出：

        unique_value        Price_array       ID
0       xyz123              {6.67,7.55}       1000    <- updated (added 6.67, added 7.55, removed 4.55)
1       xyz985              {1.31}            1001    
2       abc987              {4.56}            1002
3       eff987              {5.55}            1003    <- updated (removed 1.75, added 5.55)
4       asd541              {8.85}            1004
5       efg125              {5.77}            1005    <- appended and new ID assigned

有没有办法根据df1的状态将df1中的价格更新为df2中的价格数组？我的想法与此类似（“状态”列从代码的广播部分删除）：

但出现以下错误：

ValueError                                Traceback (most recent call last)
<ipython-input-156-6ff78c7a4a9a> in <module>()
     46     if mask[i]:
     47         # Broadcast refresh table into the matched rows in historical
---> 48         df2.loc[df2["unique_value"] == uv1, ["unique_value", "Price"]] = df1.iloc[i, :].values.reshape((1,3))
     49 

/anaconda/envs/pyfull36/lib/python3.6/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    192             key = com._apply_if_callable(key, self.obj)
    193         indexer = self._get_setitem_indexer(key)
--> 194         self._setitem_with_indexer(indexer, value)
    195 
    196     def _has_valid_type(self, k, axis):

/anaconda/envs/pyfull36/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
    581                     value = np.array(value, dtype=object)
    582                     if len(labels) != value.shape[1]:
--> 583                         raise ValueError('Must have equal len keys and value '
    584                                          'when setting with an ndarray')
    585 

ValueError: Must have equal len keys and value when setting with an ndarray

ValueError回溯（最近一次调用）
在（）
46如果面具[i]：
47#将刷新表广播到历史记录中匹配的行中
--->48 df2.loc[df2[“唯一值”]==uv1，[“唯一值”，“价格”]=df1.iloc[i，：]值。重塑（（1,3））
49
/anaconda/envs/pyfull36/lib/python3.6/site-packages/pandas/core/index.py in____设置项__（self、key、value）
192 key=com.\u如果可调用，则应用（key，self.obj）
193 indexer=self.\u get\u setitem\u indexer（键）
-->194 self.\u setitem\u带索引器（索引器，值）
195
196 def_具有有效的_类型（self、k、axis）：
/anaconda/envs/pyfull36/lib/python3.6/site-packages/pandas/core/index.py在带有索引器（self、indexer、value）的setitem中
581 value=np.array（value，dtype=object）
582如果len（标签）！=value.shape[1]：
-->583 raise value ERROR（'必须具有相等的len键和值'
584'当设置为ndarray'时）
585
ValueError：使用ndarray设置时，必须具有相等的len键和值

以下代码包括3个主要步骤：
设置数据帧，然后
```
。加入它们
```


使用np.where
和set
math更新'Price\u array'。

根据，旧版本的pandas在聚合集
时会引发类型错误
。这不是熊猫1.1.2中的问题


使用.update
填充任何缺少的ID值


设置数据帧

这就是他们开始时的样子

将熊猫作为pd导入
#设置数据帧
df1=pd.DataFrame（{'unique_value'：['xyz123'，'EFP987'，'efg125'，'xyz123'，'xyz123'，'EFP987']，'Status'：['active'，'suspended'，'active'，'new'，'inactive'，'new']，'Price'：[6.67,1.75,5.77,7.55,4.55,5.55]）
df2=pd.DataFrame（{'unique_value'：['xyz123'，'xyz985'，'abc987'，'eff987'，'asd541']，'Price_array'：[{4.55}，{1.31}，{4.56}，{1.75}，{8.85}]，'ID:[100010011002,1003,1004]}）
#df1
唯一值状态价格
0 xyz123激活6.67
1.987暂停使用1.75
2 efg125有源5.77
3 xyz123新7.55
4 xyz123非活动4.55
5.987新的5.55
#df2
唯一\u值价格\u数组ID
0 xyz123{4.55}1000
1 xyz985{1.31}1001
2 abc987{4.56}1002
3.987{1.75}1003
4 asd541{8.85}1004

设置要加入的数据帧
#对于df2，将unique_值设置为索引
df2.设置索引（“唯一值”，就地=真）
#对于df1，groupby unique_值并将集合聚合到Price上
df1g=df1.groupby（'unique_value'）.agg（{'Price'：set}）
#加入df2和df1g
dfj=df2.join（df1g，how='outer'）
#将NaN替换为空字符串“”，然后将“”替换为空集；NaN不能直接替换为集合
dfj[['Price\u array'，'Price']]=dfj[['Price\u array'，'Price']]].fillna（''）.applymap（set）
#dfj
价格\数组ID价格
唯一值
abc987{4.56}1002.0{}
asd541{8.85}1004.0{}
eff987{1.75}1003.0{1.75,5.55}
efg125{}NaN{5.77}
xyz123{4.55}1000.0{4.55,6.67,7.55}
xyz985{1.31}1001.0{}

使用np.where
和set
math更新'Price\u array'

如果'Price'
是一个空集{}

使用x.Price-x.Price\u数组


否则

使用x.Price\u数组


setmath的顺序很重要

{4.56}-set（）
是{4.56}
set（）-{4.56}
是set（）



#使用np.where和set math更新Price#u数组
dfj['Price_array']=dfj['Price_array'，'Price']]。应用（λx:np.其中（len（x.Price）>0，x.Price-x.Price_array，x.Price_array），axis=1）
#删除价格栏
drop（列=['Price']，inplace=True）
#重置索引
dfj.重置索引（就地=真）
#dfj
唯一\u值价格\u数组ID
0 abc987{4.56}1002.0
1 asd541{8.85}1004.0
2EFF987{5.55}1003.0
3 efg125{5.77}南
4 xyz123{6.67,7.55}1000.0
5 xyz985{1.31}1001.0

填写任何缺少的'ID'值
#提取缺少ID的所有行
dfjna=dfj.loc[dfj.ID.isna（）].copy（）
#从ID列中获取最大ID值
idm=int（dfj.ID.max（））
#从idm+1开始的范围更新所有缺少的ID值
dfjna.ID=范围（idm+1，idm+len（dfjna）+1）
#使用dfjna更新dfj中缺少的ID值
更新（dfjna）
#将ID列设置为int
dfj.ID=dfj.ID.astype（int）
#显示器（dfj）
唯一\u值价格\u数组ID
0 abc987{4.56}1002
1 asd541{8.85}1004
2.987{5.55}1003
3 efg125{5.77}1005
4 xyz123{6.67,7.55}1000
5 xyz985{1.31}1001
谢谢
        unique_value        Price_array       ID
0       xyz123              {6.67,7.55}       1000    <- updated (added 6.67, added 7.55, removed 4.55)
1       xyz985              {1.31}            1001    
2       abc987              {4.56}            1002
3       eff987              {5.55}            1003    <- updated (removed 1.75, added 5.55)
4       asd541              {8.85}            1004
5       efg125              {5.77}            1005    <- appended and new ID assigned

# additional state variables
# 1. for the ID to be added
current_max_id = df2["ID"].max()
# 2. for matching unique_values, avoiding searching df2["unique_value"] every time
current_value_set = set(df2["unique_value"].values)

# match unique_value's using the state variable instead of `df2`
mask = df1["unique_value"].isin(current_value_set)

for i in range(len(df1)):
    
    # current unique_value from df1
    uv1 = df1["unique_value"][i]
    
    # 1. update existing
    if mask[i]:
        
        # broadcast df1 into the matched rows in df2 (mind the shape)
        df2.loc[df2["unique_value"] == uv1, ["unique_value", "Status", "Price"]] = df1.iloc[i, :].values.reshape((1, 3))
        
        #UPDATE PRICE with PRICE_ARRAY
        ...see below

    # 2. append new
    else:
        # update state variables
        current_max_id += 1
        current_value_set.add(uv1)
        # append the row (assumes df2.index=[0,1,2,3,...])
        df2.loc[len(df2), :] = [df1.iloc[i, 0], df1.iloc[i, 1], df1.iloc[i, 2], current_max_id]

        curr_price=df1.iloc[i,df1.columns.get_loc('Price')]
        if df1.iloc[i,df1.columns.get_loc('Status')] in ('inactive', 'suspended'):
            df2.loc[df2["unique_value"] == uv1,'Price_array'].discard(curr_price)
        else:
            df2.loc[df2["unique_value"] == uv1,'Price_array'].add(curr_price)  

ValueError                                Traceback (most recent call last)
<ipython-input-156-6ff78c7a4a9a> in <module>()
     46     if mask[i]:
     47         # Broadcast refresh table into the matched rows in historical
---> 48         df2.loc[df2["unique_value"] == uv1, ["unique_value", "Price"]] = df1.iloc[i, :].values.reshape((1,3))
     49 

/anaconda/envs/pyfull36/lib/python3.6/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    192             key = com._apply_if_callable(key, self.obj)
    193         indexer = self._get_setitem_indexer(key)
--> 194         self._setitem_with_indexer(indexer, value)
    195 
    196     def _has_valid_type(self, k, axis):

/anaconda/envs/pyfull36/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
    581                     value = np.array(value, dtype=object)
    582                     if len(labels) != value.shape[1]:
--> 583                         raise ValueError('Must have equal len keys and value '
    584                                          'when setting with an ndarray')
    585 

ValueError: Must have equal len keys and value when setting with an ndarray