如何将mongodb集合中的数据加载到pandas'；数据帧？_Mongodb_Pandas_Pymongo

如何将mongodb集合中的数据加载到pandas'；数据帧？

mongodb pandas

如何将mongodb集合中的数据加载到pandas'；数据帧？,mongodb,pandas,pymongo,Mongodb,Pandas,Pymongo,我对熊猫还不熟悉（嗯，对所有“编程”的东西都不熟悉），但我一直被鼓励尝试一下。我有一个mongodb数据库“test”，其中包含一个名为“tweets”的集合。我访问ipython中的数据库： import sys import pymongo from pymongo import Connection connection = Connection() db = connection.test tweets = db.tweets tweets中文档的文档结构如下： entities'

我对熊猫还不熟悉（嗯，对所有“编程”的东西都不熟悉），但我一直被鼓励尝试一下。我有一个mongodb数据库“test”，其中包含一个名为“tweets”的集合。我访问ipython中的数据库：

import sys
import pymongo
from pymongo import Connection
connection = Connection()
db = connection.test
tweets = db.tweets

tweets中文档的文档结构如下：

entities': {u'hashtags': [],
  u'symbols': [],
  u'urls': [],
  u'user_mentions': []},
 u'favorite_count': 0,
 u'favorited': False,
 u'filter_level': u'medium',
 u'geo': {u'coordinates': [placeholder coordinate, -placeholder coordinate], u'type': u'Point'},
 u'id': 349223842700472320L,
 u'id_str': u'349223842700472320',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'lang': u'en',
 u'place': {u'attributes': {},
  u'bounding_box': {u'coordinates': [[[placeholder coordinate, placeholder coordinate],
     [-placeholder coordinate, placeholder coordinate],
     [-placeholder coordinate, placeholder coordinate],
     [-placeholder coordinate, placeholder coordinate]]],
   u'type': u'Polygon'},
  u'country': u'placeholder country',
  u'country_code': u'example',
  u'full_name': u'name, xx',
  u'id': u'user id',
  u'name': u'name',
  u'place_type': u'city',
  u'url': u'http://api.twitter.com/1/geo/id/1820d77fb3f65055.json'},
 u'retweet_count': 0,
 u'retweeted': False,
 u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 u'text': u'example text',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': u'Sat Jan 22 13:42:59 +0000 2011',
  u'default_profile': False,
  u'default_profile_image': False,
  u'description': u'example description',
  u'favourites_count': 100,
  u'follow_request_sent': None,
  u'followers_count': 100,
  u'following': None,
  u'friends_count': 100,
  u'geo_enabled': True,
  u'id': placeholder_id,
  u'id_str': u'placeholder_id',
  u'is_translator': False,
  u'lang': u'en',
  u'listed_count': 0,
  u'location': u'example place',
  u'name': u'example name',
  u'notifications': None,
  u'profile_background_color': u'000000',
  u'profile_background_image_url': u'http://a0.twimg.com/images/themes/theme19/bg.gif',
  u'profile_background_image_url_https': u'https://si0.twimg.com/images/themes/theme19/bg.gif',
  u'profile_background_tile': False,
  u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/241527685/1363314054',
  u'profile_image_url':       u'http://a0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
  u'profile_image_url_https':     u'https://si0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg', 
  u'profile_link_color': u'000000',
  u'profile_sidebar_border_color': u'FFFFFF',
  u'profile_sidebar_fill_color': u'000000',
  u'profile_text_color': u'000000',
  u'profile_use_background_image': False,
  u'protected': False,
  u'screen_name': placeholder screen_name',
  u'statuses_count': xxxx,
  u'time_zone': u'placeholder time_zone',
  u'url': None,
  u'utc_offset': -21600,
  u'verified': False}}

实体“：{u'hashtags'：[]，
u‘符号’：[]，
u'URL'：[]，
用户提到：[]}，
u'favorite_count'：0，
u'favorited'：错误，
u'过滤器\ u级别：u'中等'，
u'geo'：{u'coordinates'：[占位符坐标，-占位符坐标]，u'type'：u'Point'}，
u'id'：349223842700472320L，
u'id_str'：u'349223842700472320'，
u'in_reply_to_screen_name'：无，
u'in_reply_to_status_id'：无，
您在回复您的状态时显示：无，
u'in_reply_to_user_id'：无，
u'in_reply_to_user_id_str'：无，
u'lang'：u'en'，
u'place'：{u'attributes'：{}，
u'bounding_box'：{u'coordinates'：[[[占位符坐标，占位符坐标]，
[-占位符坐标，占位符坐标]，
[-占位符坐标，占位符坐标]，
[-占位符坐标，占位符坐标]]，
u'type'：u'Polygon'}，
u'country'：u'placeholder country'，
u‘国家代码’：u‘示例’，
u'full_name'：u'name，xx'，
u'id'：u'user id'，
u'name'：u'name'，
u'place\u type'：u'city'，
u'url'：u'http://api.twitter.com/1/geo/id/1820d77fb3f65055.json'},
u'retweet\u count'：0，
“转发”：错误，
u'source'：u'，
u'text'：u'example text'，
u'truncated'：False，
u'user'：{u'contributors\u enabled'：False，
u'created_at'：u'Sat Jan 22 13:42:59+0000 2011'，
u'default_profile'：False，
u'default_profile_image'：False，
u'description'：u'example description'，
你的最爱：100，
u'follow\u request\u sent'：无，
你的追随者数量：100，
u‘following’：无，
你的朋友数：100，
u'geo_enabled'：True，
u'id'：占位符\u id，
u'id_str'：u'placeholder_id'，
u'is_translator'：False，
u'lang'：u'en'，
u'已列出\ u计数：0，
u'location'：u'example place'，
u'name'：u'example name'，
u'notifications'：无，
u'profile\u background\u color'：u'000000'，
u'profile\u background\u image\u url'：u'http://a0.twimg.com/images/themes/theme19/bg.gif',
u'profile\u background\u image\u url\u https:'u'https://si0.twimg.com/images/themes/theme19/bg.gif',
u'profile\u background\u tile'：False，
u'profile\u banner\u url'：u'https://pbs.twimg.com/profile_banners/241527685/1363314054',
u'profile\u image\u url'：u'http://a0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
u'profile\u image\u url\u https:'u'https://si0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg', 
u'profile\u link\u color'：u'000000'，
u'profile_边栏_border_color'：u'FFFFFF'，
u'profile\u侧边栏\u fill\u color'：u'000000'，
u'profile\u text\u color'：u'000000'，
u'profile\u use\u background\u image'：False，
u'protected'：False，
u'screen\u name'：占位符screen\u name'，
u‘状态\计数’：xxxx，
u‘时区’：u‘占位符时区’，
u'url'：无，
u'utc_offset'：-21600，
u'verified'：False}

现在，据我所知，熊猫的主要数据结构——类似电子表格的表格——被称为DataFrame。如何将我的“tweets”集合中的数据加载到熊猫的数据框中？如何在数据库中查询子文档？

在将光标传递给DataFrame之前，请理解从MongoDB获得的光标

import pandas as pd
df = pd.DataFrame(list(tweets.find()))

您可以使用以下代码将MongoDB数据加载到pandas DataFame。它对我有用。我也希望你

import pymongo
import pandas as pd
from pymongo import Connection
connection = Connection()
db = connection.database_name
input_data = db.collection_name
data = pd.DataFrame(list(input_data.find()))

如果您在MongoDb中有如下数据：

[
    {
        "name": "Adam", 
        "age": 27, 
        "address":{
            "number": 4, 
            "street": "Main Road", 
            "city": "Oxford"
        }
     },
     {
        "name": "Steve", 
        "age": 32, 
        "address":{
            "number": 78, 
            "street": "High Street", 
            "city": "Cambridge"
        }
     }
]

from pandas import DataFrame

df = DataFrame(list(db.collection_name.find({}))

您可以将数据直接放入数据框中，如下所示：

[
    {
        "name": "Adam", 
        "age": 27, 
        "address":{
            "number": 4, 
            "street": "Main Road", 
            "city": "Oxford"
        }
     },
     {
        "name": "Steve", 
        "age": 32, 
        "address":{
            "number": 78, 
            "street": "High Street", 
            "city": "Cambridge"
        }
     }
]

from pandas import DataFrame

df = DataFrame(list(db.collection_name.find({}))

您将得到以下输出：

df.head()

|    | name    | age  | address                                                   |
|----|---------|------|-----------------------------------------------------------|
| 1  | "Steve" | 27   | {"number": 4, "street": "Main Road", "city": "Oxford"}    | 
| 2  | "Adam"  | 32   | {"number": 78, "street": "High St", "city": "Cambridge"}  |

但是，子文档将在子文档单元格中显示为JSON。如果要展平对象，以便子文档属性显示为单个单元格，则无需任何参数即可使用

from pandas.io.json import json_normalize

datapoints = list(db.collection_name.find({})

df = json_normalize(datapoints)

df.head()

这将提供以下格式的数据帧：

|    | name   | age  | address.number | address.street | address.city |
|----|--------|------|----------------|----------------|--------------|
| 1  | Thomas | 27   |     4          | "Main Road"    | "Oxford"     |
| 2  | Mary   | 32   |     78         | "High St"      | "Cambridge"  |

使用：

df=pd.DataFrame.from_dict（collection）

Great，通过传递“df”，集合的文档将在数据列中显示。但是，我需要在其中一个文档“entities”中查询子文档“hashtags.text”。你知道我如何从熊猫内部做到这一点吗？你能为你的文档展示一些例子，以便我能给你提供帮助吗？你需要什么？hashtags字段？是的，我对hashtags字段感兴趣。我有一个包含283000行的集合，每行有10列（5个双行、2个长行、2个字符串和1个ISODate）。给我数据帧需要3-5秒。我预计这大约需要零秒。我发现

list（）

占用了大部分时间。这是预期的还是我有一些不好的配置？（仅供参考，我正在阅读整个集合，即使用

find（）

）应该有一种使用read_json的方法来实现这一点，这将更加有效（特别是对于大型数据集）。这里我们提到了集合名称。如果我们不想提及集合名称，那么我们如何存档。？如果我们不想提及集合名称，那么如何获取所有集合的数据？？这是否适用于来自MongoDB的数GB数据？还是熊猫数据帧受到影响，我们需要尝试另一种方法？就像我有一个将近15GB的tweets JSON数据导入到MongoDB中，我正试图将它放入CSV.Traceback文件“C:\DEV\Python\lib\site packages\pymongo\network.py”，第235行，在_receive\u data\u on_socket buf=bytearray（length）MemoryError``result\u df=pd.JSON\u normalize（#data=JSON.loads（raw\u JSON\line\text））data=pymongo_collection.find（）#data=tuple（pymongo_collection.find（））``无需将读取pymongo光标转换为列表或元组即可正常工作。