在Python中高效地将CSV文件合并到一个MongoDB集合
我在两个CSV文件之间分割了数据,我正在尝试合并它们,并将它们存储为MongoDB上的一个嵌入文档集合。 我使用Pandas数据帧成功地做到了这一点,但是groupby步骤非常慢,因为使用了.apply方法 有没有更快/更好的方法?也许将CSV文件加载到MongoDB,然后合并数据 使用熊猫的我的代码:在Python中高效地将CSV文件合并到一个MongoDB集合,python,json,mongodb,pymongo,Python,Json,Mongodb,Pymongo,我在两个CSV文件之间分割了数据,我正在尝试合并它们,并将它们存储为MongoDB上的一个嵌入文档集合。 我使用Pandas数据帧成功地做到了这一点,但是groupby步骤非常慢,因为使用了.apply方法 有没有更快/更好的方法?也许将CSV文件加载到MongoDB,然后合并数据 使用熊猫的我的代码: #load CSV df_inspection = pd.read_csv('inspection.csv') df_stock = pd.read_csv('stock.csv') #gro
#load CSV
df_inspection = pd.read_csv('inspection.csv')
df_stock = pd.read_csv('stock.csv')
#group Keys
grouped = ( df_stock.groupby(['INSPECTION_KEY'])
.apply(lambda x: x[['ITEM_CODE','ITEM_QTY']].to_dict('r'))
.reset_index()
.rename(columns={0:'ITEMS'}) )
#Merge DFs
df = df_inspection.merge(grouped, on='INSPECTION_KEY', how='left')
#Insert data
j_data = df.to_dict(orient='records')
db.collection.insert_many(j_data)
[
{
INSPECTION_KEY,
INSPECTION_DATE,
STORE_ID,
STORE_NAME,
ITEMS: [
{ITEM_CODE, ITEM_QTY}
]
}
]
[
{
INSPECTION_KEY: 1,
INSPECTION_DATE: 20181211,
STORE_ID: 'A',
STORE_NAME: 'AAA SHOP',
ITEMS: [
{
ITEM_CODE: 001,
ITEM_QTY: 5
},
{
ITEM_CODE: 002,
ITEM_QTY: 7
}
]
},
{
INSPECTION_KEY: 2,
INSPECTION_DATE: 20160224,
STORE_ID: 'A',
STORE_NAME: 'AAA SHOP',
ITEMS: [
{
ITEM_CODE: 006,
ITEM_QTY: 2
},
{
ITEM_CODE: 002,
ITEM_QTY: 3
}
]
},
{
INSPECTION_KEY: 3,
INSPECTION_DATE: 20171013,
STORE_ID: 'B',
STORE_NAME: 'STORE BB',
ITEMS: [
{
ITEM_CODE: 005,
ITEM_QTY: 8
},
{
ITEM_CODE: 001,
ITEM_QTY: 2
}
]
}
]
CSV文件的子集(实际文件的行数超过100k):
inspection.csv
INSPECTION_KEY, INSPECTION_DATE, STORE_ID, STORE_NAME
1, 20181211, 'A', 'AAA SHOP'
2, 20160224, 'A', 'AAA SHOP'
3, 20171013, 'B', 'STORE BB'
stock.csv
INSPECTION_KEY, ITEM_CODE, ITEM_QTY
1, 001, 5
1, 002, 7
2, 006, 2
2, 002, 3
3, 005, 8
3, 001, 2
所需的文档方案:
#load CSV
df_inspection = pd.read_csv('inspection.csv')
df_stock = pd.read_csv('stock.csv')
#group Keys
grouped = ( df_stock.groupby(['INSPECTION_KEY'])
.apply(lambda x: x[['ITEM_CODE','ITEM_QTY']].to_dict('r'))
.reset_index()
.rename(columns={0:'ITEMS'}) )
#Merge DFs
df = df_inspection.merge(grouped, on='INSPECTION_KEY', how='left')
#Insert data
j_data = df.to_dict(orient='records')
db.collection.insert_many(j_data)
[
{
INSPECTION_KEY,
INSPECTION_DATE,
STORE_ID,
STORE_NAME,
ITEMS: [
{ITEM_CODE, ITEM_QTY}
]
}
]
[
{
INSPECTION_KEY: 1,
INSPECTION_DATE: 20181211,
STORE_ID: 'A',
STORE_NAME: 'AAA SHOP',
ITEMS: [
{
ITEM_CODE: 001,
ITEM_QTY: 5
},
{
ITEM_CODE: 002,
ITEM_QTY: 7
}
]
},
{
INSPECTION_KEY: 2,
INSPECTION_DATE: 20160224,
STORE_ID: 'A',
STORE_NAME: 'AAA SHOP',
ITEMS: [
{
ITEM_CODE: 006,
ITEM_QTY: 2
},
{
ITEM_CODE: 002,
ITEM_QTY: 3
}
]
},
{
INSPECTION_KEY: 3,
INSPECTION_DATE: 20171013,
STORE_ID: 'B',
STORE_NAME: 'STORE BB',
ITEMS: [
{
ITEM_CODE: 005,
ITEM_QTY: 8
},
{
ITEM_CODE: 001,
ITEM_QTY: 2
}
]
}
]
所需收藏:
#load CSV
df_inspection = pd.read_csv('inspection.csv')
df_stock = pd.read_csv('stock.csv')
#group Keys
grouped = ( df_stock.groupby(['INSPECTION_KEY'])
.apply(lambda x: x[['ITEM_CODE','ITEM_QTY']].to_dict('r'))
.reset_index()
.rename(columns={0:'ITEMS'}) )
#Merge DFs
df = df_inspection.merge(grouped, on='INSPECTION_KEY', how='left')
#Insert data
j_data = df.to_dict(orient='records')
db.collection.insert_many(j_data)
[
{
INSPECTION_KEY,
INSPECTION_DATE,
STORE_ID,
STORE_NAME,
ITEMS: [
{ITEM_CODE, ITEM_QTY}
]
}
]
[
{
INSPECTION_KEY: 1,
INSPECTION_DATE: 20181211,
STORE_ID: 'A',
STORE_NAME: 'AAA SHOP',
ITEMS: [
{
ITEM_CODE: 001,
ITEM_QTY: 5
},
{
ITEM_CODE: 002,
ITEM_QTY: 7
}
]
},
{
INSPECTION_KEY: 2,
INSPECTION_DATE: 20160224,
STORE_ID: 'A',
STORE_NAME: 'AAA SHOP',
ITEMS: [
{
ITEM_CODE: 006,
ITEM_QTY: 2
},
{
ITEM_CODE: 002,
ITEM_QTY: 3
}
]
},
{
INSPECTION_KEY: 3,
INSPECTION_DATE: 20171013,
STORE_ID: 'B',
STORE_NAME: 'STORE BB',
ITEMS: [
{
ITEM_CODE: 005,
ITEM_QTY: 8
},
{
ITEM_CODE: 001,
ITEM_QTY: 2
}
]
}
]