Python 如何使用Pandas将多个嵌套值转换为分类变量?
我正在研究yelp数据集,对于企业来说,这是来自Python 如何使用Pandas将多个嵌套值转换为分类变量?,python,pandas,dataframe,categorical-data,Python,Pandas,Dataframe,Categorical Data,我正在研究yelp数据集,对于企业来说,这是来自yelp\u academic\u dataset\u business.json的第一行json。后续行与此架构匹配: { "business_id":"0DI8Dt2PJp07XkVvIElIcQ", "name":"Innovative Vapors", "neighborhood":"", "address":"227 E Baseline Rd, Ste J2", "city":"Tempe", "state":"
yelp\u academic\u dataset\u business.json
的第一行json。后续行与此架构匹配:
{
"business_id":"0DI8Dt2PJp07XkVvIElIcQ",
"name":"Innovative Vapors",
"neighborhood":"",
"address":"227 E Baseline Rd, Ste J2",
"city":"Tempe",
"state":"AZ",
"postal_code":"85283",
"latitude":33.3782141,
"longitude":-111.936102,
"stars":4.5,
"review_count":17,
"is_open":0,
"attributes":[
"BikeParking: True",
"BusinessAcceptsBitcoin: False",
"BusinessAcceptsCreditCards: True",
"BusinessParking: {
'garage': False,
'street': False,
'validated': False,
'lot': True,
'valet': False
}",
"DogsAllowed: False",
"RestaurantsPriceRange2: 2",
"WheelchairAccessible: True"
],
"categories": [
"Tobacco Shops",
"Nightlife",
"Vape Shops",
"Shopping"
],
"hours":[
"Monday 11:0-21:0",
"Tuesday 11:0-21:0",
"Wednesday 11:0-21:0",
"Thursday 11:0-21:0",
"Friday 11:0-22:0",
"Saturday 10:0-22:0",
"Sunday 11:0-18:0"
],
"type":"business"
}
我尝试将json解析为csv,并使用pd导入csv。读取\u csv
,我得到以下DF:
+---+-----------------------------------------------------------------+
|idx| attributes |
+---+-----------------------------------------------------------------+
| 0 | BikeParking: True, BusinessAcceptsBitcoin: False, |
| | BusinessAcceptsCreditCards: True, ,DogsAllowed: False, |
| | RestaurantsPriceRange2: 2, WheelchairAccessible: True, |
| | BusinessParking: {'garage': False, |
| | 'street': False, |
| | 'validated': False, |
| | 'lot': True, |
| | 'valet': False} |
+---+-----------------------------------------------------------------+
但我真正想要的是:
+----+-----------------------------------+-----------------------------------+
| id | attributes_BusinessParking_garage | attributes_BusinessParking_lot |
+----+-----------------------------------+-----------------------------------+
| 0 | 1 | 0 |
+----+-----------------------------------+-----------------------------------+
def split_attributes (row):
for k, v in row[0].items():
row[k] = v
df = df.apply(split_attributes)
我知道有pd.get\u dummies,但是由于单元格被视为字符串,所以我没有很好的平面分类列
注意:为了简单起见,我在示例中没有显示更多的列。您是否尝试过使用映射函数来分隔属性 您可能需要初始化要清空字符串的列或任何需要的数据类型,然后执行以下操作:
+----+-----------------------------------+-----------------------------------+
| id | attributes_BusinessParking_garage | attributes_BusinessParking_lot |
+----+-----------------------------------+-----------------------------------+
| 0 | 1 | 0 |
+----+-----------------------------------+-----------------------------------+
def split_attributes (row):
for k, v in row[0].items():
row[k] = v
df = df.apply(split_attributes)
编辑
根据您的最新问题;您是否尝试过使用
pd.read_json
?在这种情况下,pandas不会将json视为字典。它存储为字符串。我还必须处理嵌套结构的情况。为什么不将字符串转换为字典?您熟悉ast
模块吗?我意识到从时间上讲,这不是最有效的解决方案,但如果它能让你达到你想要的目的……数据是这样进来的还是在你导入之后?请显示原始数据源并导入。这是数据的输入方式。请原谅,Yelp没有交出熊猫数据框。它源于json吗?csv?xml?哦。它是作为json文件分发的,我用它来解析csv。请发布一个原始json示例,因为熊猫有I/O方法来导入这些文件,而不是在链接代码中使用。