Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/365.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中高效地将CSV文件合并到一个MongoDB集合_Python_Json_Mongodb_Pymongo - Fatal编程技术网

在Python中高效地将CSV文件合并到一个MongoDB集合

在Python中高效地将CSV文件合并到一个MongoDB集合,python,json,mongodb,pymongo,Python,Json,Mongodb,Pymongo,我在两个CSV文件之间分割了数据,我正在尝试合并它们,并将它们存储为MongoDB上的一个嵌入文档集合。 我使用Pandas数据帧成功地做到了这一点,但是groupby步骤非常慢,因为使用了.apply方法 有没有更快/更好的方法?也许将CSV文件加载到MongoDB,然后合并数据 使用熊猫的我的代码: #load CSV df_inspection = pd.read_csv('inspection.csv') df_stock = pd.read_csv('stock.csv') #gro

我在两个CSV文件之间分割了数据,我正在尝试合并它们,并将它们存储为MongoDB上的一个嵌入文档集合。 我使用Pandas数据帧成功地做到了这一点,但是groupby步骤非常慢,因为使用了.apply方法

有没有更快/更好的方法?也许将CSV文件加载到MongoDB,然后合并数据

使用熊猫的我的代码:

#load CSV
df_inspection = pd.read_csv('inspection.csv')
df_stock = pd.read_csv('stock.csv')

#group Keys
grouped = ( df_stock.groupby(['INSPECTION_KEY'])
                       .apply(lambda x: x[['ITEM_CODE','ITEM_QTY']].to_dict('r'))
                       .reset_index()
                       .rename(columns={0:'ITEMS'}) )

#Merge DFs
df = df_inspection.merge(grouped, on='INSPECTION_KEY', how='left')

#Insert data
j_data = df.to_dict(orient='records')
db.collection.insert_many(j_data)
[
 {
  INSPECTION_KEY,
  INSPECTION_DATE,
  STORE_ID,
  STORE_NAME,
  ITEMS: [
    {ITEM_CODE, ITEM_QTY}
  ]
 }
]
[
 {
  INSPECTION_KEY: 1,
  INSPECTION_DATE: 20181211,
  STORE_ID: 'A',
  STORE_NAME: 'AAA SHOP',
  ITEMS: [
    {
      ITEM_CODE: 001, 
      ITEM_QTY: 5
    },
    {
      ITEM_CODE: 002, 
      ITEM_QTY: 7
    }
  ]
 },
 {
  INSPECTION_KEY: 2,
  INSPECTION_DATE: 20160224,
  STORE_ID: 'A',
  STORE_NAME: 'AAA SHOP',
  ITEMS: [
    {
      ITEM_CODE: 006, 
      ITEM_QTY: 2
    },
    {
      ITEM_CODE: 002, 
      ITEM_QTY: 3
    }
  ]
 },
 {
  INSPECTION_KEY: 3,
  INSPECTION_DATE: 20171013,
  STORE_ID: 'B',
  STORE_NAME: 'STORE BB',
  ITEMS: [
    {
      ITEM_CODE: 005, 
      ITEM_QTY: 8
    },
    {
      ITEM_CODE: 001, 
      ITEM_QTY: 2
    }
  ]
 }
]
CSV文件的子集(实际文件的行数超过100k):

inspection.csv

INSPECTION_KEY, INSPECTION_DATE, STORE_ID, STORE_NAME
   1,            20181211,        'A',     'AAA SHOP'
   2,            20160224,        'A',     'AAA SHOP'
   3,            20171013,        'B',     'STORE BB'
stock.csv

INSPECTION_KEY, ITEM_CODE, ITEM_QTY
  1,               001,     5
  1,               002,     7
  2,               006,     2
  2,               002,     3
  3,               005,     8
  3,               001,     2
所需的文档方案:

#load CSV
df_inspection = pd.read_csv('inspection.csv')
df_stock = pd.read_csv('stock.csv')

#group Keys
grouped = ( df_stock.groupby(['INSPECTION_KEY'])
                       .apply(lambda x: x[['ITEM_CODE','ITEM_QTY']].to_dict('r'))
                       .reset_index()
                       .rename(columns={0:'ITEMS'}) )

#Merge DFs
df = df_inspection.merge(grouped, on='INSPECTION_KEY', how='left')

#Insert data
j_data = df.to_dict(orient='records')
db.collection.insert_many(j_data)
[
 {
  INSPECTION_KEY,
  INSPECTION_DATE,
  STORE_ID,
  STORE_NAME,
  ITEMS: [
    {ITEM_CODE, ITEM_QTY}
  ]
 }
]
[
 {
  INSPECTION_KEY: 1,
  INSPECTION_DATE: 20181211,
  STORE_ID: 'A',
  STORE_NAME: 'AAA SHOP',
  ITEMS: [
    {
      ITEM_CODE: 001, 
      ITEM_QTY: 5
    },
    {
      ITEM_CODE: 002, 
      ITEM_QTY: 7
    }
  ]
 },
 {
  INSPECTION_KEY: 2,
  INSPECTION_DATE: 20160224,
  STORE_ID: 'A',
  STORE_NAME: 'AAA SHOP',
  ITEMS: [
    {
      ITEM_CODE: 006, 
      ITEM_QTY: 2
    },
    {
      ITEM_CODE: 002, 
      ITEM_QTY: 3
    }
  ]
 },
 {
  INSPECTION_KEY: 3,
  INSPECTION_DATE: 20171013,
  STORE_ID: 'B',
  STORE_NAME: 'STORE BB',
  ITEMS: [
    {
      ITEM_CODE: 005, 
      ITEM_QTY: 8
    },
    {
      ITEM_CODE: 001, 
      ITEM_QTY: 2
    }
  ]
 }
]
所需收藏:

#load CSV
df_inspection = pd.read_csv('inspection.csv')
df_stock = pd.read_csv('stock.csv')

#group Keys
grouped = ( df_stock.groupby(['INSPECTION_KEY'])
                       .apply(lambda x: x[['ITEM_CODE','ITEM_QTY']].to_dict('r'))
                       .reset_index()
                       .rename(columns={0:'ITEMS'}) )

#Merge DFs
df = df_inspection.merge(grouped, on='INSPECTION_KEY', how='left')

#Insert data
j_data = df.to_dict(orient='records')
db.collection.insert_many(j_data)
[
 {
  INSPECTION_KEY,
  INSPECTION_DATE,
  STORE_ID,
  STORE_NAME,
  ITEMS: [
    {ITEM_CODE, ITEM_QTY}
  ]
 }
]
[
 {
  INSPECTION_KEY: 1,
  INSPECTION_DATE: 20181211,
  STORE_ID: 'A',
  STORE_NAME: 'AAA SHOP',
  ITEMS: [
    {
      ITEM_CODE: 001, 
      ITEM_QTY: 5
    },
    {
      ITEM_CODE: 002, 
      ITEM_QTY: 7
    }
  ]
 },
 {
  INSPECTION_KEY: 2,
  INSPECTION_DATE: 20160224,
  STORE_ID: 'A',
  STORE_NAME: 'AAA SHOP',
  ITEMS: [
    {
      ITEM_CODE: 006, 
      ITEM_QTY: 2
    },
    {
      ITEM_CODE: 002, 
      ITEM_QTY: 3
    }
  ]
 },
 {
  INSPECTION_KEY: 3,
  INSPECTION_DATE: 20171013,
  STORE_ID: 'B',
  STORE_NAME: 'STORE BB',
  ITEMS: [
    {
      ITEM_CODE: 005, 
      ITEM_QTY: 8
    },
    {
      ITEM_CODE: 001, 
      ITEM_QTY: 2
    }
  ]
 }
]