MongoDB 维护Replica Set

共计 6515 个字符，预计需要花费 17 分钟才能阅读完成。

在每个 MongoDB（版本 3.2.9）Instance 中，都有一个本地数据库（local），用于存储 Replication 进程的信息和本地数据。local 数据库的特性是：位于 local 数据库中的数据和集合不会被 Replication 进程复制到其他 MongoDB instance 上。如果实例上有些 collection 和 data 不计划被复制到其他 MongoDB Instance，可以将这些 collection 和 data 存储在 local 数据库中。

MongoDB shell 提供一个全局变量 rs，是数据库命令的包装器（wrapper），用于维护 Replica Set。

一，Replica Set 的配置

1，查看 Replica Set 的配置信息

MongoDB 将 Replica Set 的配置信息存储在 local.system.replset 集合中，在同一个 Replica Set 中所有成员 local.system.replset 是相同的，不能直接修改该集合，必须通过 rs.initiate()来初始化，通过 rs.reconfig()来重新配置，对 Replica Set 增加或删除成员都会相应的修改 Replica Set 的配置信息。

rs.config()

use local
db.system.replset.find()

配置信息重要信息主要有两部分：Replica Set 的 id 值和 member 数组

{_id: "replica set name",
members: [
    {
      _id: <int>,
      host: "host:port",
      arbiterOnly: <boolean>,
      buildIndexes: <boolean>,
      hidden: <boolean>,
      priority: <number>,
      slaveDelay: <int>,
      votes: <number>
    },
    ...
  ],
...
}

成员的配置文档：

priority：表示一个成员被选举为 Primary 节点的优先级，默认值是 1，取值范围是从 0 到 100，将 priority 设置为 0 有特殊含义：Priority 为 0 的成员永远不能成为 Primary 节点。Replica Set 中，Priority 最高的成员，会优先被选举为 Primary 节点，只要其满足条件。

hidden：将成员配置为隐藏成员，要求 Priority 为 0。Client 不会向隐藏成员发送请求，因此隐藏成员不会收到 Client 的 Request。

slaveDelay：单位是秒，将 Secondary 成员配置为延迟备份节点，要求 Priority 为 0，表示该成员比 Primary 成员滞后指定的时间，才能将 Primary 上进行的写操作同步到本地。为了数据读取的一致性，应将延迟备份节点的 hidden 设置为 true，避免用户读取到明显滞后的数据。Delayed members maintain a copy of the data that reflects the state of the data at some time in the past.

votes：有效值是 0 或 1，默认值是 1，如果 votes 是 1，表示该成员（voting member）有权限选举 Primary 成员。在一个 Replica Set 中，最多有 7 个成员，其 votes 属性的值是 1。

arbiterOnly：表示该成员是仲裁者，arbiter 的唯一作用是就是参与选举，其 votes 属性是 1，arbiter 不保存数据，也不会为 client 提供服务。

buildIndexes：表示实在在成员上创建 Index，该属性不能修改，只能在增加成员时设置该属性。如果一个成员仅仅作为备份，不接收 Client 的请求，将该成员设置为不创建 index，能够提高数据同步的效率。

2，重新配置 Replica Set

对 Replica Set 重新配置，必须连接到 Primary 节点；如果 Replica Set 中没有一个节点被选举为 Primary，那么，可以使用 force option（rs.reconfig(config,{force:true})），在 Secondary 节点上强制对 Replica Set 进行重新配置。

The force parameter allows a reconfiguration command to be issued to a non-primary node. If set as {force: true}, this forces the replica set to accept the new configuration even if a majority of the members are not accessible. Use with caution, as this can lead to rollback situations.

示例，在 primary 节点中，重新配置成员的优先级属性（priority）。

cfg = rs.conf()
cfg.members[0].priority = 1
cfg.members[1].priority = 1
cfg.members[2].priority = 5
rs.reconfig(cfg)

3，增加成员

3.1，该使用默认的配置增加成员

-- 增加一个成员，用于存储数据
rs.add("host:port")

-- 增加一个 arbiter，用于选举

rs.add("host:port",true)

3.2，使用配置文档增加成员

示例，为 Replica Set 增加一个延迟备份的隐藏节点，滞后 Primary 节点 1hour，该节点不参与投票，也不创建 index，仅仅作为数据备份。

rs.add({ _id:4, host: "host:port", priority: 0, hidden:true, slaveDelay:3600, votes:0, buildIndexes:true, arbiterOnly:false} )

4，删除成员

rs.remove("host")

5，对 replica set 重新配置，能够增加成员，删除成员，并能同时修改成员的属性

二，对 Replica Set 的成员进行操作

1，冻结当前成员，使当前成员在指定的时间内没有资格成为 Primary，即当前成员在一定时间内保持 Secondary 身份

Makes the current replica set member ineligible to become primary for the period specified.

rs.freeze(seconds)

2，强制 Primary 节点退化为 Secondary 节点

rs.stepDown（）使当前 Primary 节点退化为 Secondary 节点，并激发选举 Primary 的事件。该函数使当前的 Primary 节点在指定的时间内，不能成为 Primary 节点。在一定的时间内，如果有 Secondary 节点满足条件，那么该 Secondary 节点被选举为 Primary 节点；如果没有 Secondary 节点满足条件，那么原 Primary 节点参与选举。stepDown 函数，使 Primary 节点退化为 Secondary，并在一段时间内不能参与选举。

Forces the primary of the replica set to become a secondary, triggering an election for primary. The method steps down the primary for a specified number of seconds; during this period, the stepdown member is ineligible from becoming primary.

rs.stepDown(stepDownSecs, secondaryCatchUpPeriodSecs)

3，强制当前成员从指定成员同步数据

rs.syncFrom("host:port");

4，使当前的 Secondary 节点能够读取数据

默认情况下，Secondary 节点是不能读取数据的

rs.slaveOk()

三，查看 Replica Set 的状态

set 字段：Replica Set 的 name

stateStr：成员状态的描述信息

name：该成员的 host 和端口

syncTo：该成员从哪个成员同步数据，可以使用 rs.syncFrom()强制同步的 Path，从指定的 Target 成员同步数据。

{"set" : "rs0",
    "myState" : 1,
    "heartbeatIntervalMillis" : NumberLong(2000),
    "members" : [ 
        {"_id" : 0,
            "name" : "cia-sh-05:40004",
            "health" : 1,
            "state" : 2,
            "stateStr" : "SECONDARY",
            "uptime" : 240973,
            "optime" : {"ts" : Timestamp(1473336939, 1),
                "t" : NumberLong(5)
            },
            "optimeDate" : ISODate("2016-09-08T12:15:39.000Z"),
            "lastHeartbeat" : ISODate("2016-09-10T04:39:55.041Z"),
            "lastHeartbeatRecv" : ISODate("2016-09-10T04:39:56.356Z"),
            "pingMs" : NumberLong(0),
            "syncingTo" : "cia-sh-06:40001"
        }, .....
    ]
}

三，Replica Set 的操作日志

MongoDB 的 Replication 实际上是基于操作日志（operation log）的复制。Replication 进程的整个过程是：Replication 将 Primary 节点中执行的写操作记录到 oplog 集合中，Secondary 成员读取 Primary 成员的 oplog 集合，重做（redo）oplog 中记录的写操作，最终，Replica Set 中的各个成员达到数据的一致性。

oplog 集合中记录的操作是基于单个 doc 的，也就是说，如果一条命令只影响一个 doc，那么 Replication 向 oplog 集合中插入一个操作命令；如果一个命令影响多个 doc，那么 Replication 将该命令拆分成多个等效的操作命令，每个操作命令只会影响一个 doc，最终向 oplog 集合中插入的是多个操作命令。

1，oplog 集合

oplog 集合是一个特殊的固定集合，存储的是 Primary 节点的操作日志，每个 Replica Set 的成员都一个 oplog 的副本：local.oplog.rs，该集合存储在每个成员的 local 数据库中。Replica Set 中的每个成员都有一个 oplog 集合，用于存储当前节点的操作记录，其他成员能够从任何一个成员的 oplog 中同步数据。

The oplog (operations log) is a special capped collection that keeps a rolling record of all operations that modify the data stored in your databases. MongoDB applies database operations on the primary and then records the operations on the primary’s oplog. The secondary members then copy and apply these operations in an asynchronous process. All replica set members contain a copy of the oplog, in the local.oplog.rs collection, which allows them to maintain the current state of the database.

2，oplog 的大小

oplog 集合是一个固定集合，其大小是固定的，在第一次开始 Replica Set 的成员时，MongoDB 创建默认大小的 oplog。在 MongoDB 3.2.9 版本中，MongoDB 默认的存储引擎是 WiredTiger，一般情况下，oplog 的默认大小是数据文件所在 disk 空闲空间（disk free space）的 5%，最小不会低于 990 MB，最大不会超过 50 GB。

3，修改 oplog 的大小

修改的过程主要分为三步：

以单机模式重启 mongod
启动之后，重新创建 oplog，并保留最后一个记录作为种子
以复制集方式重启 mongod

详细过程是：

step1：以单机模式重启 mongod

对于 Primary 成员，首先调用 stepDown 函数，强制 Primary 成员转变为 Secondary 成员

rs.stepDown()

对于 secondary 成员，调用 shutdownServer()函数，关闭 mongod

use admin 
db.shutdownServer()

启动 mongod 实例，不要使用 replset 参数

mongod --port 37017 --dbpath /srv/mongodb

step2：创建新的 oplog

有备无患，备份 oplog 文件

mongodump --db local --collection 'oplog.rs' --port 37017

将 oplog 中最后一条有效记录保存到 temp 集合中，作为新 oplog 的 seed

use local

db.temp.drop()
db.temp.save(db.oplog.rs.find( {}, {ts: 1, h: 1 } ).sort({$natural : -1} ).limit(1).next())

db.oplog.rs.drop()

重建新的 oplog 集合，并将 temp 集合中一条记录保存到 oplog 中，size 的单位是 Byte

db.runCommand({ create: "oplog.rs", capped: true, size: (2 * 1024 * 1024 * 1024) } )
db.oplog.rs.save(db.temp.findOne() )

step3：以复制集模式启动 mongod，replset 参数必须制定正确的 Replica Set 的名字

db.shutdownServer()
mongod --replSet rs0 --dbpath /srv/mongodb

三，查看 mongod 的开机日志

在 local.startup_log 集合中，存储 mongod 每次启动时的开机日志

更多 MongoDB 相关教程见以下内容：

CentOS 编译安装 MongoDB 与 mongoDB 的 php 扩展 http://www.linuxidc.com/Linux/2012-02/53833.htm

CentOS 6 使用 yum 安装 MongoDB 及服务器端配置 http://www.linuxidc.com/Linux/2012-08/68196.htm

Ubuntu 13.04 下安装 MongoDB2.4.3 http://www.linuxidc.com/Linux/2013-05/84227.htm

MongoDB 入门必读(概念与实战并重) http://www.linuxidc.com/Linux/2013-07/87105.htm

Ubunu 14.04 下 MongoDB 的安装指南 http://www.linuxidc.com/Linux/2014-08/105364.htm

《MongoDB 权威指南》(MongoDB: The Definitive Guide)英文文字版[PDF] http://www.linuxidc.com/Linux/2012-07/66735.htm

Nagios 监控 MongoDB 分片集群服务实战 http://www.linuxidc.com/Linux/2014-10/107826.htm

基于 CentOS 6.5 操作系统搭建 MongoDB 服务 http://www.linuxidc.com/Linux/2014-11/108900.htm

MongoDB 的详细介绍：请点这里
MongoDB 的下载地址：请点这里

本文永久更新链接地址：http://www.linuxidc.com/Linux/2016-09/135649.htm

MongoDB 维护Replica Set

开源项目搭建中国裁判文书网本地搜索WEB系统

历史往往不是由问题决定的，而是由对问题的应对方式决定的【转】

开源项目MessageNest打造个性化消息推送平台多种通知方式

阿里云2核4G4M轻量应用服务器_297元/年【优惠购买入口】

阿里云免费SSL证书有效期从1年缩短至3个月！

[限时免费正版] Grisly 恐怖庄园的秘密 – 制作用心的中文解谜游戏

云展网 – PDF 文档一键转换在线 3D 翻页书 (生成链接二维码/方便微信转发分享)

[限时免费正版] 2Do 待办事项/任务清单工具 – 评价颇高的 GTD 效率应用

阿里云服务器4核8g购买价格多少钱？

[限时免费正版] 美国亚马逊又来了！免费赠送 $110 美元安卓正版应用游戏