Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/314.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 为什么提取2700条记录(每条320KB)需要30秒?_Python_Mongodb_Pymongo - Fatal编程技术网

Python 为什么提取2700条记录(每条320KB)需要30秒?

Python 为什么提取2700条记录(每条320KB)需要30秒?,python,mongodb,pymongo,Python,Mongodb,Pymongo,我在MongoDB中有2700条记录。每个文档的大小约为320KB。我使用的引擎是wiredTiger,收集的总大小约为885MB 我的MongoDB配置如下: systemLog: destination: file path: /usr/local/var/log/mongodb/mongo.log logAppend: true storage: dbPath: /usr/local/var/mongodb engine: wiredTiger wiredTiger

我在
MongoDB
中有2700条记录。每个文档的大小约为320KB。我使用的引擎是
wiredTiger
,收集的总大小约为885MB

我的
MongoDB
配置如下:

systemLog:
  destination: file
  path: /usr/local/var/log/mongodb/mongo.log
  logAppend: true
storage:
  dbPath: /usr/local/var/mongodb
  engine: wiredTiger
  wiredTiger:
      engineConfig:
         cacheSizeGB: 1
         statisticsLogDelaySecs: 0
         journalCompressor: snappy
      collectionConfig:
         blockCompressor: snappy
      indexConfig:
         prefixCompression: false
net:
  bindIp: 127.0.0.1
我的连接是通过
插座

mongo_client = MongoClient('/tmp/mongodb-27017.sock')
收集统计数据揭示了这一结果:

db.mycol.stats()
{
    "ns" : "bi.mycol",
    "count" : 2776,
    "size" : 885388544,
    "avgObjSize" : 318944,
    "storageSize" : 972476416,
    "capped" : false,
    "wiredTiger" : {
        "metadata" : {
            "formatVersion" : 1
        },
        "creationString" : "allocation_size=4KB,app_metadata=(formatVersion=1),block_allocation=best,block_compressor=snappy,cache_resident=0,checkpoint=(WiredTigerCheckpoint.9=(addr=\"01e30275da81e4b9e99f78e30275db81e4c61d1e01e30275dc81e40fab67d5808080e439f6afc0e41e80bfc0\",order=9,time=1444566832,size=511762432,write_gen=13289)),checkpoint_lsn=(24,52054144),checksum=uncompressed,collator=,columns=,dictionary=0,format=btree,huffman_key=,huffman_value=,id=5,internal_item_max=0,internal_key_max=0,internal_key_truncate=,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=1MB,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=0,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=u,version=(major=1,minor=1)",
        "type" : "file",
        "uri" : "statistics:table:collection-0-6630292038312816605",
        "LSM" : {
            "bloom filters in the LSM tree" : 0,
            "bloom filter false positives" : 0,
            "bloom filter hits" : 0,
            "bloom filter misses" : 0,
            "bloom filter pages evicted from cache" : 0,
            "bloom filter pages read into cache" : 0,
            "total size of bloom filters" : 0,
            "sleep for LSM checkpoint throttle" : 0,
            "chunks in the LSM tree" : 0,
            "highest merge generation in the LSM tree" : 0,
            "queries that could have benefited from a Bloom filter that did not exist" : 0,
            "sleep for LSM merge throttle" : 0
        },
        "block-manager" : {
            "file allocation unit size" : 4096,
            "blocks allocated" : 0,
            "checkpoint size" : 511762432,
            "allocations requiring file extension" : 0,
            "blocks freed" : 0,
            "file magic number" : 120897,
            "file major version number" : 1,
            "minor version number" : 0,
            "file bytes available for reuse" : 460734464,
            "file size in bytes" : 972476416
        },
        "btree" : {
            "column-store variable-size deleted values" : 0,
            "column-store fixed-size leaf pages" : 0,
            "column-store internal pages" : 0,
            "column-store variable-size leaf pages" : 0,
            "pages rewritten by compaction" : 0,
            "number of key/value pairs" : 0,
            "fixed-record size" : 0,
            "maximum tree depth" : 4,
            "maximum internal page key size" : 368,
            "maximum internal page size" : 4096,
            "maximum leaf page key size" : 3276,
            "maximum leaf page size" : 32768,
            "maximum leaf page value size" : 1048576,
            "overflow pages" : 0,
            "row-store internal pages" : 0,
            "row-store leaf pages" : 0
        },
        "cache" : {
            "bytes read into cache" : 3351066029,
            "bytes written from cache" : 0,
            "checkpoint blocked page eviction" : 0,
            "unmodified pages evicted" : 8039,
            "page split during eviction deepened the tree" : 0,
            "modified pages evicted" : 0,
            "data source pages selected for eviction unable to be evicted" : 1,
            "hazard pointer blocked page eviction" : 1,
            "internal pages evicted" : 0,
            "pages split during eviction" : 0,
            "in-memory page splits" : 0,
            "overflow values cached in memory" : 0,
            "pages read into cache" : 10519,
            "overflow pages read into cache" : 0,
            "pages written from cache" : 0
        },
        "compression" : {
            "raw compression call failed, no additional data available" : 0,
            "raw compression call failed, additional data available" : 0,
            "raw compression call succeeded" : 0,
            "compressed pages read" : 10505,
            "compressed pages written" : 0,
            "page written failed to compress" : 0,
            "page written was too small to compress" : 0
        },
        "cursor" : {
            "create calls" : 7,
            "insert calls" : 0,
            "bulk-loaded cursor-insert calls" : 0,
            "cursor-insert key and value bytes inserted" : 0,
            "next calls" : 0,
            "prev calls" : 2777,
            "remove calls" : 0,
            "cursor-remove key bytes removed" : 0,
            "reset calls" : 16657,
            "search calls" : 16656,
            "search near calls" : 0,
            "update calls" : 0,
            "cursor-update value bytes updated" : 0
        },
        "reconciliation" : {
            "dictionary matches" : 0,
            "internal page multi-block writes" : 0,
            "leaf page multi-block writes" : 0,
            "maximum blocks required for a page" : 0,
            "internal-page overflow keys" : 0,
            "leaf-page overflow keys" : 0,
            "overflow values written" : 0,
            "pages deleted" : 0,
            "page checksum matches" : 0,
            "page reconciliation calls" : 0,
            "page reconciliation calls for eviction" : 0,
            "leaf page key bytes discarded using prefix compression" : 0,
            "internal page key bytes discarded using suffix compression" : 0
        },
        "session" : {
            "object compaction" : 0,
            "open cursor count" : 7
        },
        "transaction" : {
            "update conflicts" : 0
        }
    },
    "nindexes" : 2,
    "totalIndexSize" : 208896,
    "indexSizes" : {
        "_id_" : 143360,
        "date_1" : 65536
    },
    "ok" : 1
}

我如何理解MongoDB使用swap?如何推断瓶颈究竟在哪里


编辑:

OS: Mac osX Yosemite
MongoDB version: 3.0.0
Total RAM: 8G
Filesystem: Mac OS Extended (Journaled)
我在python中获取数据的方式是:

for doc in mycol.find({'date': {"$lte": '2016-12-12', '$gte': '2012-09-09'}}, {'_id': False}):
       doc['uids'] = set(doc['uids'])
       records.append(doc)
日期
字段已编制索引

编辑2: 以下是获取数据时的结果:

CPU core1: ~65%
CPU core2: ~65%
CPU core3: ~65%
CPU core4: ~65%
RAM: 7190/8190MB
swap: 1140/2048MB

编辑3:
MongoDB日志是:

2015-10-11T17:25:08.317+0330 I NETWORK  [initandlisten] connection accepted from anonymous unix socket #18 (2 connections now open)
2015-10-11T17:25:08.321+0330 I NETWORK  [initandlisten] connection accepted from anonymous unix socket #19 (3 connections now open)
2015-10-11T17:25:36.501+0330 I QUERY    [conn19] getmore bi.mycol cursorid:10267473126 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:3 nreturned:14 reslen:4464998 locks:{} 199ms
2015-10-11T17:25:37.665+0330 I QUERY    [conn19] getmore bi.mycol cursorid:10267473126 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:5 nreturned:14 reslen:4464998 locks:{} 281ms
2015-10-11T17:25:50.331+0330 I NETWORK  [conn19] end connection anonymous unix socket (2 connections now open)
2015-10-11T17:25:50.363+0330 I NETWORK  [conn18] end connection anonymous unix socket (1 connection now open)

编辑4:
样本数据为:

{"date": "2012-09-12", "uids": [1,2,3,4,...,30000]}
注意:我在
uid
字段中有30000个uid。


编辑5:

OS: Mac osX Yosemite
MongoDB version: 3.0.0
Total RAM: 8G
Filesystem: Mac OS Extended (Journaled)
解释查询显示它使用了
IXSCAN
stage:

$ db.mycol.find({'date': {"$lte": '2018-11-27', '$gte': '2011-04-23'}}, {'_id': 0}).explain("executionStats")
{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "bi.mycol",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "$and" : [
                {
                    "date" : {
                        "$lte" : "2018-11-27"
                    }
                },
                {
                    "date" : {
                        "$gte" : "2011-04-23"
                    }
                }
            ]
        },
        "winningPlan" : {
            "stage" : "PROJECTION",
            "transformBy" : {
                "_id" : 0
            },
            "inputStage" : {
                "stage" : "FETCH",
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "keyPattern" : {
                        "date" : 1
                    },
                    "indexName" : "date_1",
                    "isMultiKey" : false,
                    "direction" : "forward",
                    "indexBounds" : {
                        "date" : [
                            "[\"2011-04-23\", \"2018-11-27\"]"
                        ]
                    }
                }
            }
        },
        "rejectedPlans" : [ ]
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 2776,
        "executionTimeMillis" : 2312,
        "totalKeysExamined" : 2776,
        "totalDocsExamined" : 2776,
        "executionStages" : {
            "stage" : "PROJECTION",
            "nReturned" : 2776,
            "executionTimeMillisEstimate" : 540,
            "works" : 2777,
            "advanced" : 2776,
            "needTime" : 0,
            "needFetch" : 0,
            "saveState" : 31,
            "restoreState" : 31,
            "isEOF" : 1,
            "invalidates" : 0,
            "transformBy" : {
                "_id" : 0
            },
            "inputStage" : {
                "stage" : "FETCH",
                "nReturned" : 2776,
                "executionTimeMillisEstimate" : 470,
                "works" : 2777,
                "advanced" : 2776,
                "needTime" : 0,
                "needFetch" : 0,
                "saveState" : 31,
                "restoreState" : 31,
                "isEOF" : 1,
                "invalidates" : 0,
                "docsExamined" : 2776,
                "alreadyHasObj" : 0,
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "nReturned" : 2776,
                    "executionTimeMillisEstimate" : 0,
                    "works" : 2776,
                    "advanced" : 2776,
                    "needTime" : 0,
                    "needFetch" : 0,
                    "saveState" : 31,
                    "restoreState" : 31,
                    "isEOF" : 1,
                    "invalidates" : 0,
                    "keyPattern" : {
                        "date" : 1
                    },
                    "indexName" : "date_1",
                    "isMultiKey" : false,
                    "direction" : "forward",
                    "indexBounds" : {
                        "date" : [
                            "[\"2011-04-23\", \"2018-11-27\"]"
                        ]
                    },
                    "keysExamined" : 2776,
                    "dupsTested" : 0,
                    "dupsDropped" : 0,
                    "seenInvalidated" : 0,
                    "matchTested" : 0
                }
            }
        }
    },
    "serverInfo" : {
        "host" : "MySys.local",
        "port" : 27017,
        "version" : "3.0.0",
        "gitVersion" : "nogitversion"
    },
    "ok" : 1
}

编辑6:

OS: Mac osX Yosemite
MongoDB version: 3.0.0
Total RAM: 8G
Filesystem: Mac OS Extended (Journaled)

这里有两个问题:

  • 使用isodate而不是字符串日期可以更快地查找索引,因为字符串日期可以进行字母顺序字符串比较,而isodate可以进行数字比较。 由于您的总记录数较低,索引的类型应该不是一个大问题,问题可能是文档的大小及其网络传输加上反序列化

  • 尝试不选择uid字段的查询,即

    for doc in mycol.find({'date': {"$lte": '2016-12-12', '$gte': '2012-09-09'}}, {'_id': False,'uid':False}):
    

  • 您的查询时间将大大缩短。然后,您需要调查应用程序和mongodb服务器之间的传输时间,并使用
    find_one()
    对单个文档获取进行基准测试,以查看反序列化需要多少时间。

    我用于提高性能的方法:

  • 首先,我没有使用for循环遍历查询和获取数据,而是将光标指向
    Pandas
    ,而不是在python中创建一个大的列表对象:

    cursor = mycol.find({'date': {"$lte": end_date, '$gte': start_date}}, {'_id': False})
    df = pandas.DataFrame(list(cursor))
    
    性能得到了很大提高,现在最多需要10秒,而不是30秒

  • 我没有使用大约需要6秒钟的
    doc['uids']=set(doc['uids'])
    ,而是将默认列表更改为使用数据帧本身设置和处理重复项


  • “如何推断瓶颈到底在哪里?”——显然是通过测量所有相关的因素:CPU、内存、磁盘和网络利用率。检查服务器端和客户端。内存不仅在可用空间方面,而且在最大可用带宽方面,都可能成为瓶颈。您是打算在这里留下20条小评论,还是打算做作业,然后带着完整的结果回来(更新您的问题)?@KarolyHorvath问题已更新。谢谢你抽出时间。你们有日期索引吗?2.您能否将查询保持在for循环之外,然后获取结果?1。对2.什么意思?我需要一个光标来遍历它并获取数据。起初我们有
    ISODate
    ,但因为我们在
    Gregorian
    旁边使用
    Jalali
    日期
    MongoDB
    将时间偏移添加到插入的日期中,我们现在无法使用它及其复杂性来解释它。你的第二个解决方案并不能解决问题!因为我的应用程序中需要
    uid
    。我确信,如果我删除
    uid
    ,它将按数量级加快查询时间。水平缩放解决了我的问题吗?@AlirezaHos移除UID不是我提到的解决方案,它是跟踪真正问题的一种方法。如果文档的大小是真正的问题,您可能需要针对网络时间和反序列化时间对代码进行基准测试。使用
    uids
    需要39秒,而不使用
    uids
    则需要1秒。可以解释更多关于如何针对网络时间和反序列化时间对代码进行基准测试的信息吗?@Alireza Hos尝试切分数据。这可能对你有帮助。你能提到第一次阅读和后续阅读所花费的时间吗。