Python 从数据帧生成变更日志
我每天都在运行一些页面的爬网,我想跟踪每次爬网之间的变化。基本上每天我都会从页面中获取所需内容,并将其写入历史表格。然后,对于每个URL,我从历史数据生成一个数据帧。我能够实现以下目标:Python 从数据帧生成变更日志,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我每天都在运行一些页面的爬网,我想跟踪每次爬网之间的变化。基本上每天我都会从页面中获取所需内容,并将其写入历史表格。然后,对于每个URL,我从历史数据生成一个数据帧。我能够实现以下目标: from to crawl_id 20190609 price 50 100 20190613 price
from to
crawl_id
20190609 price 50 100
20190613 price 100 140
vdp_url www.url1.com www.url2.com
20190614 vdp_url www.url2.com www.url1.com
20190616 vdp_url www.url1.com www.url3.com
我需要生成如下内容:
[{"date": "20190609", "from": 50, "to": 100, "field": "price"}, {"date": "20190613", "from": 100, "to": 140, "field": "price"},{"date": "20190613", "from": "www.url1.com", "to": "www.url2.com", "field": "vdp_url"}, {"date": "20190614", "from": "www.url2.com", "to": "www.url1.com", "field": "vdp_url"}, {"date": "20190616", "from": "www.url1.com", "to": "www.url3.com", "field": "vdp_url"}]
这是我用来生成上述数据帧的代码:
histories_df = [{'crawl_id': '20190606', 'vdp_url': 'www.url1.com', 'price': None},
{'crawl_id': '20190607', 'vdp_url': 'www.url1.com', 'price': None},
{'crawl_id': '20190609', 'vdp_url': 'www.url1.com', 'price': 50},
{'crawl_id': '20190613', 'vdp_url': 'www.url1.com', 'price': 100},
{'crawl_id': '20190614', 'vdp_url': 'www.url2.com', 'price': 140},
{'crawl_id': '20190615', 'vdp_url': 'www.url1.com', 'price': None},
{'crawl_id': '20190616', 'vdp_url': 'www.url1.com', 'price': 140},
{'crawl_id': '20190617', 'vdp_url': 'www.url3.com', 'price': 140}]
histories_df = pd.DataFrame(histories_df)
trimmed_histories = histories_df.set_index('crawl_id')
histories_df_prev = trimmed_histories.shift(-1)
diff_bool = trimmed_histories.where(trimmed_histories.values != histories_df_prev.values).notna().stack()
difference = pd.concat([trimmed_histories.stack()[diff_bool], histories_df_prev.stack()[diff_bool]], axis=1).dropna()
difference.columns=['from', 'to']
几个小时以来,我一直在尝试使用纯for
s、iterrows
、索引、groupby
以及我发现的任何东西来实现这一目标,但运气不佳
谢谢 Ummm使用命令
difference.rename_axis(['date','field']).reset_index().to_dict('r')
Out[128]:
[{'date': '20190609', 'field': 'price', 'from': 50.0, 'to': 100.0},
{'date': '20190613', 'field': 'price', 'from': 100.0, 'to': 140.0},
{'date': '20190613',
'field': 'vdp_url',
'from': 'www.url1.com',
'to': 'www.url2.com'},
{'date': '20190614',
'field': 'vdp_url',
'from': 'www.url2.com',
'to': 'www.url1.com'},
{'date': '20190616',
'field': 'vdp_url',
'from': 'www.url1.com',
'to': 'www.url3.com'}]
Ummm使用来记录
difference.rename_axis(['date','field']).reset_index().to_dict('r')
Out[128]:
[{'date': '20190609', 'field': 'price', 'from': 50.0, 'to': 100.0},
{'date': '20190613', 'field': 'price', 'from': 100.0, 'to': 140.0},
{'date': '20190613',
'field': 'vdp_url',
'from': 'www.url1.com',
'to': 'www.url2.com'},
{'date': '20190614',
'field': 'vdp_url',
'from': 'www.url2.com',
'to': 'www.url1.com'},
{'date': '20190616',
'field': 'vdp_url',
'from': 'www.url1.com',
'to': 'www.url3.com'}]
哇!这正是我需要的,非常感谢!我想我缺少的关键是rename_轴(['date','field'])。reset_index()
partwoow!这正是我需要的,非常感谢!我想我缺少的关键是rename\u轴(['date','field'])。reset\u index()
part