Python3数据帧密钥错误问题
我有一个数据帧爬网,如下所示: 当我运行此代码时Python3数据帧密钥错误问题,python,python-3.x,pandas,Python,Python 3.x,Pandas,我有一个数据帧爬网,如下所示: 当我运行此代码时 crawl_stats = ( crawls['updated'] .groupby(crawls.index.get_level_values('url')) .agg({ 'number of crawls': 'count', 'proportion of updates': 'mean', 'number of updates': 'sum' }) 它显示了错
crawl_stats = (
crawls['updated']
.groupby(crawls.index.get_level_values('url'))
.agg({
'number of crawls': 'count',
'proportion of updates': 'mean',
'number of updates': 'sum'
})
它显示了错误:
KeyError Traceback (most recent call last)
<ipython-input-62-180f1041465d> in <module>
8 crawl_stats = (
9 crawls['updated']
---> 10 .groupby(crawls.index.get_level_values('url'))
11 # .groupby('url')
12 .agg({
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py in _get_level_values(self, level)
3155 """
3156
-> 3157 self._validate_index_level(level)
3158 return self
3159
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py in _validate_index_level(self, level)
1942 elif level != self.name:
1943 raise KeyError('Level %s must be same as name (%s)' %
-> 1944 (level, self.name))
1945
1946 def _get_level_number(self, level):
KeyError: 'Level url must be same as name (None)'
它还显示错误:
KeyError Traceback (most recent call last)
<ipython-input-63-8c5f0f6f7c86> in <module>
9 crawls['updated']
10 # .groupby(crawls.index.get_level_values('url'))
---> 11 .groupby('url')
12 .agg({
13 'number of crawls': 'count',
3293 # Add key to exclusions
KeyError: 'url'
您需要替换此:
.groupby(crawls.index.get_level_values('url'))
与:
因为数据帧中没有索引 有两个问题-需要按列
url
分组,还需要为具有聚合函数的新列名定义元组列表:
crawls = pd.DataFrame({
'url': ['a','a','a','a','b','b','b'],
'updated': list(range(7))
})
print (crawls)
url updated
0 a 0
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
d = [('number of crawls', 'count'),
('proportion of updates', 'mean'),
('number of updates', 'sum')]
crawl_stats = crawls.groupby('url')['updated'].agg(d)
print (crawl_stats)
number of crawls proportion of updates number of updates
url
a 4 1.5 6
b 3 5.0 15
编辑:
son numeric列的问题应转换为numpy数组,最好是创建dict并传递给DataFrame CONTRUCOR:
更改:
columns = ['url','hour','updated']
data = np.array((url,hour,updated)).T
df = pd.DataFrame(data=data, columns=columns)
致:
始终建议以文本而不是图像形式发布示例。请编辑您的帖子,然后让我们知道。什么是
打印(crawls.columns.tolist())
?我根据您的指导修改了代码,但在编辑帖子时显示错误。您能给我帮助吗?检查问题下的注释,什么是print(crawls.columns.tolist())
?因为Keyerror
意味着没有列url
我根据你的指导修改了我的代码,但它显示了我编辑帖子时的错误。DataError:没有要删除的数字类型aggregate@pandalai-您的update
列似乎不是数字,请在我的解决方案之前尝试crawls['updated']=crawls['updated'].astype(float)
。谢谢您的帮助@耶斯雷尔终于成功了@潘达莱-不客气!如果我的回答有帮助,别忘了。谢谢
.groupby('url')
crawls = pd.DataFrame({
'url': ['a','a','a','a','b','b','b'],
'updated': list(range(7))
})
print (crawls)
url updated
0 a 0
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
d = [('number of crawls', 'count'),
('proportion of updates', 'mean'),
('number of updates', 'sum')]
crawl_stats = crawls.groupby('url')['updated'].agg(d)
print (crawl_stats)
number of crawls proportion of updates number of updates
url
a 4 1.5 6
b 3 5.0 15
columns = ['url','hour','updated']
data = np.array((url,hour,updated)).T
df = pd.DataFrame(data=data, columns=columns)
columns = ['url','hour','updated']
df = pd.DataFrame({'url':url, 'hour':hour,'updated':updated}, columns=columns)