Mongodb碎片平衡无法正常工作,报告了大量moveChunk错误

Mongodb碎片平衡无法正常工作,报告了大量moveChunk错误,mongodb,sharding,balance,database,nosql,Mongodb,Sharding,Balance,Database,Nosql,我们有一个包含3个碎片的mongoDb集群,每个碎片是一个副本集,包含3个节点,我们使用的mongoDb版本是3.2.6。我们有一个大约230G大小的大型数据库,其中包含大约5500个集合。我们发现大约2300个集合不平衡,其他3200个集合平均分布到3个碎片 下面是sh.status的结果(整个结果太大了,我只是发布了其中的一部分): 集合“normal_20160913”不平衡,我将此集合的getShardDistribution()结果发布如下: mongos> db.normal

我们有一个包含3个碎片的mongoDb集群,每个碎片是一个副本集,包含3个节点,我们使用的mongoDb版本是3.2.6。我们有一个大约230G大小的大型数据库,其中包含大约5500个集合。我们发现大约2300个集合不平衡,其他3200个集合平均分布到3个碎片


下面是sh.status的结果(整个结果太大了,我只是发布了其中的一部分):

集合“normal_20160913”不平衡,我将此集合的getShardDistribution()结果发布如下:

mongos> db.normal_20160913.getShardDistribution()

Shard shard2 at shard2/10.25.2.6:27018,10.25.8.178:27018
 data : 4.77GiB docs : 203776 chunks : 118
 estimated data per chunk : 41.43MiB
 estimated docs per chunk : 1726

Totals
 data : 4.77GiB docs : 203776 chunks : 118
 Shard shard2 contains 100% data, 100% docs in cluster, avg obj size on shard : 24KiB
平衡器进程处于运行状态,chunksize为默认值(64M):

我在mogos日志中发现了很多moveChunk错误,这可能是一些集合不平衡的原因,下面是它们的最新部分:

mongos> sh.status()
--- Sharding Status --- 
  sharding version: {
    "_id" : 1,
    "minCompatibleVersion" : 5,
    "currentVersion" : 6,
    "clusterId" : ObjectId("57557345fa5a196a00b7c77a")
}
  shards:
    {  "_id" : "shard1",  "host" : "shard1/10.25.8.151:27018,10.25.8.159:27018" }
    {  "_id" : "shard2",  "host" : "shard2/10.25.2.6:27018,10.25.8.178:27018" }
    {  "_id" : "shard3",  "host" : "shard3/10.25.2.19:27018,10.47.102.176:27018" }
  active mongoses:
    "3.2.6" : 1
  balancer:
    Currently enabled:  yes
    Currently running:  yes
        Balancer lock taken at Sat Sep 03 2016 09:58:58 GMT+0800 (CST) by iZ23vbzyrjiZ:27017:1467949335:-2109714153:Balancer
    Collections with active migrations: 
        bdtt.normal_20131017 started at Sun Sep 18 2016 17:03:11 GMT+0800 (CST)
    Failed balancer rounds in last 5 attempts:  0
    Migration Results for the last 24 hours: 
        1490 : Failed with error 'aborted', from shard2 to shard3
        1490 : Failed with error 'aborted', from shard2 to shard1
        14 : Failed with error 'data transfer error', from shard2 to shard1
  databases:
    {  "_id" : "bdtt",  "primary" : "shard2",  "partitioned" : true }
      bdtt.normal_20160908
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard2  142
            too many chunks to print, use verbose if you want to force print
        bdtt.normal_20160909
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard1  36
                shard2  42
                shard3  46
            too many chunks to print, use verbose if you want to force print
        bdtt.normal_20160910
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard1  34
                shard2  32
                shard3  32
            too many chunks to print, use verbose if you want to force print
        bdtt.normal_20160911
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard1  30
                shard2  32
                shard3  32
            too many chunks to print, use verbose if you want to force print
        bdtt.normal_20160912
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard2  126
            too many chunks to print, use verbose if you want to force print
        bdtt.normal_20160913
            shard key: { "_id" : "hashed" }
            unique: false
            balancing: true
            chunks:
                shard2  118
            too many chunks to print, use verbose if you want to force print
    }
2016-09-19T14:25:25.427+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:25:59.620+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:25:59.644+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:35:02.701+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:35:02.728+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:18.232+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:18.256+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:27.101+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:42:27.112+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
2016-09-19T14:43:41.889+0800 I SHARDING [conn37136926] moveChunk result: { ok: 0.0, errmsg: "Not starting chunk migration because another migration is already in progress", code: 117 }
我尝试手动使用moveChunk命令,它返回相同的错误:

mongos> sh.moveChunk("bdtt.normal_20160913", {_id:ObjectId("57d6d107edac9244b6048e65")}, "shard3")
{
    "cause" : {
        "ok" : 0,
        "errmsg" : "Not starting chunk migration because another migration is already in progress",
        "code" : 117
    },
    "code" : 117,
    "ok" : 0,
    "errmsg" : "move failed"
}
我不确定是否创建了太多的集合,导致迁移不堪重负?每天大约有60-80个新的收藏将被创建

我需要这里的帮助来回答以下问题,任何提示都会很好:

  • 为什么有些集合不平衡,这与大量新创建的集合有关
  • 是否有任何命令可以检查处理迁移作业的详细信息?我得到了很多错误日志,显示一些迁移jog正在运行,但我找不到哪个正在运行

  • 我想推测一下,但我的猜测是,您的收藏非常不平衡,目前正在通过区块迁移进行平衡(这可能需要很长时间)。因此,手动块迁移已排队,但不会立即执行

    以下几点可能会进一步澄清:

    • :MongoDB区块迁移在队列机制中进行,一次只迁移一个区块
    • :平衡器锁信息可能会让您对正在迁移的内容有更多的了解。您还应该能够在您的mongos日志文件中看到正在进行区块迁移的日志条目
    你可以选择在你的收藏中做一些。预拆分过程基本上配置了一个空集合来启动平衡,并从一开始就避免不平衡。因为一旦它们变得不平衡,块迁移过程可能不是你的朋友

    此外,您可能希望重新访问碎片密钥。您可能在使用切分键时出错,导致了很多不平衡

    另外,您的数据大小在我看来并不太大,无法保证采用分片配置。请记住,除非您的数据大小/工作集大小属性迫使您进行切分配置,否则不要进行切分配置。因为切分不是免费的(你可能已经感受到了痛苦)

    回答我自己的问题: 最后,我们找到了根本原因,它与此“问题”完全相同,是由异常副本集配置引起的。 发生此问题时,我们的副本集配置如下所示:

    shard1:PRIMARY> rs.conf()
    {
        "_id" : "shard1",
        "version" : 3,
        "protocolVersion" : NumberLong(1),
        "members" : [
            {
                "_id" : 0,
                "host" : "10.25.8.151:27018",
                "arbiterOnly" : false,
                "buildIndexes" : true,
                "hidden" : false,
                "priority" : 1,
                "tags" : {
    
                },
                "slaveDelay" : NumberLong(0),
                "votes" : 1
            },
            {
                "_id" : 1,
                "host" : "10.25.8.159:27018",
                "arbiterOnly" : false,
                "buildIndexes" : true,
                "hidden" : false,
                "priority" : 1,
                "tags" : {
    
                },
                "slaveDelay" : NumberLong(0),
                "votes" : 1
            },
            {
                "_id" : 2,
                "host" : "10.25.2.6:37018",
                "arbiterOnly" : true,
                "buildIndexes" : true,
                "hidden" : false,
                "priority" : 1,
                "tags" : {
    
                },
                "slaveDelay" : NumberLong(0),
                "votes" : 1
            },
            {
                "_id" : 3,
                "host" : "10.47.114.174:27018",
                "arbiterOnly" : false,
                "buildIndexes" : true,
                "hidden" : true,
                "priority" : 0,
                "tags" : {
    
                },
                "slaveDelay" : NumberLong(86400),
                "votes" : 1
            }
        ],
        "settings" : {
            "chainingAllowed" : true,
            "heartbeatIntervalMillis" : 2000,
            "heartbeatTimeoutSecs" : 10,
            "electionTimeoutMillis" : 10000,
            "getLastErrorModes" : {
    
            },
            "getLastErrorDefaults" : {
                "w" : 1,
                "wtimeout" : 0
            },
            "replicaSetId" : ObjectId("5755464f789c6cd79746ad62")
        }
    }
    
    副本集中有4个节点:一个主节点、一个从节点、一个仲裁节点和一个24小时延迟的从节点。这使得3个节点占多数,因为仲裁器不存在数据,所以平衡器需要等待延迟的从属节点来满足写入要求(确保接收器碎片已接收到区块)


    有几种方法可以解决这个问题。我们刚刚删除了仲裁器,平衡器现在工作正常。

    迁移过程中出现
    数据传输错误
    ,这可能表明网络层存在问题。另外,你能提供每个碎片的更多细节吗?我发现每个分片只包含两个节点(建议最少包含三个数据承载节点)。这是故意的吗?
    shard1:PRIMARY> rs.conf()
    {
        "_id" : "shard1",
        "version" : 3,
        "protocolVersion" : NumberLong(1),
        "members" : [
            {
                "_id" : 0,
                "host" : "10.25.8.151:27018",
                "arbiterOnly" : false,
                "buildIndexes" : true,
                "hidden" : false,
                "priority" : 1,
                "tags" : {
    
                },
                "slaveDelay" : NumberLong(0),
                "votes" : 1
            },
            {
                "_id" : 1,
                "host" : "10.25.8.159:27018",
                "arbiterOnly" : false,
                "buildIndexes" : true,
                "hidden" : false,
                "priority" : 1,
                "tags" : {
    
                },
                "slaveDelay" : NumberLong(0),
                "votes" : 1
            },
            {
                "_id" : 2,
                "host" : "10.25.2.6:37018",
                "arbiterOnly" : true,
                "buildIndexes" : true,
                "hidden" : false,
                "priority" : 1,
                "tags" : {
    
                },
                "slaveDelay" : NumberLong(0),
                "votes" : 1
            },
            {
                "_id" : 3,
                "host" : "10.47.114.174:27018",
                "arbiterOnly" : false,
                "buildIndexes" : true,
                "hidden" : true,
                "priority" : 0,
                "tags" : {
    
                },
                "slaveDelay" : NumberLong(86400),
                "votes" : 1
            }
        ],
        "settings" : {
            "chainingAllowed" : true,
            "heartbeatIntervalMillis" : 2000,
            "heartbeatTimeoutSecs" : 10,
            "electionTimeoutMillis" : 10000,
            "getLastErrorModes" : {
    
            },
            "getLastErrorDefaults" : {
                "w" : 1,
                "wtimeout" : 0
            },
            "replicaSetId" : ObjectId("5755464f789c6cd79746ad62")
        }
    }