Python中的Pandas DataFrames：如何展平数据，而不丢失丢失某些数据的行？（json_规范化，其中函数）_Python_Pandas_Dataframe

Python中的Pandas DataFrames：如何展平数据，而不丢失丢失某些数据的行？（json_规范化，其中函数）

python pandas dataframe

Python中的Pandas DataFrames：如何展平数据，而不丢失丢失某些数据的行？（json_规范化，其中函数）,python,pandas,dataframe,Python,Pandas,Dataframe,我刚刚开始学习Python中的Pandas和DataFrames，虽然我的用例是将EC2 EBS卷数据扁平化为CSV，其中有些卷有附件数据，有些没有，但我想做的是完全使用Pandas来实现这一点（而不是事先迭代数据以添加缺失的虚拟数据）下面是我能想到的最简单的用例： import pandas as pd states = [{'state': 'Florida', 'shortname': 'FL', 'counties': [{'name':

我刚刚开始学习Python中的Pandas和DataFrames，虽然我的用例是将EC2 EBS卷数据扁平化为CSV，其中有些卷有附件数据，有些没有，但我想做的是完全使用Pandas来实现这一点（而不是事先迭代数据以添加缺失的虚拟数据）

下面是我能想到的最简单的用例：

import pandas as pd

states = [{'state': 'Florida',
           'shortname': 'FL',
           'counties': [{'name': 'Dade', 'population': 12345},
                        {'name': 'Broward', 'population': 40000},
                        {'name': 'Palm Beach', 'population': 60000}]},
          {'state': 'Ohio',
           'shortname': 'OH',
           'counties': [{'name': 'Summit', 'population': 1234},
                        {'name': 'Cuyahoga', 'population': 1337}]},
          {'state': 'New York',
           'shortname': 'NY',
           'counties': []}]

counties_normalized_data = pd.json_normalize(data=states, record_path='counties', record_prefix='county.', meta=['state', 'shortname'])

print(counties_normalized_data)

这导致：

  county.name  county.population    state shortname
0        Dade              12345  Florida        FL
1     Broward              40000  Florida        FL
2  Palm Beach              60000  Florida        FL
3      Summit               1234     Ohio        OH
4    Cuyahoga               1337     Ohio        OH

虽然这有道理，但我不想完全失去纽约。相反，我希望保留纽约，并将county.name和county.population设置为“N'A”

所以我开始在

DataFrame

中使用

where

函数，但您可能已经知道，除非所有州都有完全相同的县数，否则它将无法工作。例如：

import pandas as pd

states = [{'state': 'Florida',
           'shortname': 'FL',
           'counties': [{'name': 'Dade', 'population': 12345},
                        {'name': 'Broward', 'population': 40000},
                        {'name': 'Palm Beach', 'population': 60000}]},
          {'state': 'Ohio',
           'shortname': 'OH',
           'counties': [{'name': 'Summit', 'population': 1234},
                        {'name': 'Cuyahoga', 'population': 1337}]},
          {'state': 'New York',
           'shortname': 'NY',
           'counties': []}]

df = pd.DataFrame(states)


df['counties'] = df['counties'].where(df['counties'].str.len() > 0, [{'name': 'Westchester', 'population': 3456}, {'name': 'Putnam', 'population': 1000}])

这将导致以下异常：

ValueError: operands could not be broadcast together with shapes (3,) (3,) (2,)

所以我想确定的是，什么是标准化数据的合适方法，有时候数据的子集没有值

我读过关于只规范化数据子集，然后再合并的书，但我还并没有遇到一个我能理解的具体例子

提前感谢您提供的任何指导