Python Pandas-提取唯一的列组合,并在另一个表中对它们进行计数
任务1: 我有这样的桌子:Python Pandas-提取唯一的列组合,并在另一个表中对它们进行计数,python,pandas,dictionary,Python,Pandas,Dictionary,任务1: 我有这样的桌子: +----------+------------+----------+------------+----------+------------+-------+ | a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count | +----------+------------+----------+------------+----------+------------
+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country | 1 | NAN | NAN | NAN | NAN | 100 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | 8 | NAN | NAN | 20 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | 9 | NAN | NAN | 80 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 4 | sex | 6 | 40 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 5 | sex | 7 | 60 |
+----------+------------+----------+------------+----------+------------+-------+
{'a_name_0':'country','id_qname_0':1}
{'a_name_0':'region','id_qname_0':2, 'a_name_1':'city','id_qname_1':8}
{'a_name_0':'region','id_qname_0':2, 'a_name_1':'city','id_qname_1':9}
我需要将每一行串联起来,删除NaN,并在大小可变的字典中转换序列,例如,前两个dict将如下所示:
+----------+------------+----------+------------+----------+------------+-------+
| a_name_0 | id_qname_0 | a_name_1 | id_qname_1 | a_name_2 | id_qname_2 | count |
+----------+------------+----------+------------+----------+------------+-------+
| country | 1 | NAN | NAN | NAN | NAN | 100 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | 8 | NAN | NAN | 20 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 2 | city | 9 | NAN | NAN | 80 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 4 | sex | 6 | 40 |
+----------+------------+----------+------------+----------+------------+-------+
| region | 3 | age | 5 | sex | 7 | 60 |
+----------+------------+----------+------------+----------+------------+-------+
{'a_name_0':'country','id_qname_0':1}
{'a_name_0':'region','id_qname_0':2, 'a_name_1':'city','id_qname_1':8}
{'a_name_0':'region','id_qname_0':2, 'a_name_1':'city','id_qname_1':9}
之后的每一本词典都应该存储在一个列表中
任务2。
使用下表,我必须计算上一步dict中列的外观:
+----------+------------+----------+------------+----------+
| id | country | city | age | sex |
+----------+------------+----------+------------+----------+
| 1 | 1 | NAN | NAN | NAN |
+----------+------------+----------+------------+----------+
| 2 | 1 | 8 | NAN | NAN |
+----------+------------+----------+------------+----------+
如果有更快的映射解决方案,请提供建议,因为我将要做的事情可能会非常混乱。
答案对我没有帮助,因为我需要迭代器来提取参数以及计算它们的外观 您可以通过使用
orient='r'
(记录
)删除count
列并将所有行转换为目录列表,然后在字典理解中过滤掉缺少值的目录:
L = [{k:v for k, v in x.items() if pd.notna(v)} for x in df.drop('count', 1).to_dict('r')]
print (L)
[{'a_name_0': 'country', 'id_qname_0': 1},
{'a_name_0': 'region', 'id_qname_0': 2, 'a_name_1': 'city', 'id_qname_1': 8.0},
{'a_name_0': 'region', 'id_qname_0': 2, 'a_name_1': 'city', 'id_qname_1': 9.0},
{'a_name_0': 'region', 'id_qname_0': 3, 'a_name_1': 'age',
'id_qname_1': 4.0, 'a_name_2': 'sex', 'id_qname_2': 6.0},
{'a_name_0': 'region', 'id_qname_0': 3, 'a_name_1': 'age',
'id_qname_1': 5.0, 'a_name_2': 'sex', 'id_qname_2': 7.0}]
无法100%确定第二个数据帧:
L1 = [dict(zip(list(x.values())[::2], list(x.values())[1::2])) for x in L]
df = pd.DataFrame(L1)
print (df)
country region city age sex
0 1.0 NaN NaN NaN NaN
1 NaN 2.0 8.0 NaN NaN
2 NaN 2.0 9.0 NaN NaN
3 NaN 3.0 NaN 4.0 6.0
4 NaN 3.0 NaN 5.0 7.0
您可以通过使用
orient='r'
(records
)删除count
列并将所有行转换为目录列表,然后在字典理解中过滤掉缺少值的目录:
L = [{k:v for k, v in x.items() if pd.notna(v)} for x in df.drop('count', 1).to_dict('r')]
print (L)
[{'a_name_0': 'country', 'id_qname_0': 1},
{'a_name_0': 'region', 'id_qname_0': 2, 'a_name_1': 'city', 'id_qname_1': 8.0},
{'a_name_0': 'region', 'id_qname_0': 2, 'a_name_1': 'city', 'id_qname_1': 9.0},
{'a_name_0': 'region', 'id_qname_0': 3, 'a_name_1': 'age',
'id_qname_1': 4.0, 'a_name_2': 'sex', 'id_qname_2': 6.0},
{'a_name_0': 'region', 'id_qname_0': 3, 'a_name_1': 'age',
'id_qname_1': 5.0, 'a_name_2': 'sex', 'id_qname_2': 7.0}]
无法100%确定第二个数据帧:
L1 = [dict(zip(list(x.values())[::2], list(x.values())[1::2])) for x in L]
df = pd.DataFrame(L1)
print (df)
country region city age sex
0 1.0 NaN NaN NaN NaN
1 NaN 2.0 8.0 NaN NaN
2 NaN 2.0 9.0 NaN NaN
3 NaN 3.0 NaN 4.0 6.0
4 NaN 3.0 NaN 5.0 7.0
很好,比我的解决方案df.drop('count',axis=1)中的index行的
更好。iterrows():list\uu.append(row.dropna().to_json(orient='index'))
@datanoveler list理解是一个好办法:但我会采用任何有效的解决方案,因为这不是大数据案例。谢谢你,我的朋友乔泽夫!你又做了一次;)很好,比我的解决方案df.drop('count',axis=1)中的index行的更好。iterrows():list\uu.append(row.dropna().to_json(orient='index'))
@datanoveler list理解是一个好办法:但我会采用任何有效的解决方案,因为这不是大数据案例。谢谢你,我的朋友乔泽夫!你又做了一次;)如何提取第二个数据帧的id
?@jezrael'id'与此无关。我只需要扫描dict中的列是否在行中并对它们进行计数。如何提取第二个数据帧的id
?@jezrael'id'与此无关。我只需要扫描dict中的列是否在行中,并对它们进行计数。