Corosync cluster engine provides the reliable inter-cluster communications between the cluster nodes. It syncs the cluster configuration across the cluster nodes all the time. It also maintains the cluster membership and notifies when quorum is achieved or lost. It provides the messaging layer inside the cluster to manage the system and resource availability. In Veritas cluster , this functionality has been provided by LLT + GAB (Low latency transport + Global Atomic Broadcast) . Unlike veritas cluster, Corosync uses the existing network interface to communicate with cluster nodes.
Why do we need redundant corosync Links ?
By default ,we configure the network bonding by aggregating couple of physical network interfaces for primary node IP. Corosync will use this interface as heartbeat link in default configurations. If there is an issue with network and lost the network connectivity between two nodes , cluster might need to face the split brain situation. To avoid split brain , we are configuring additional network links. This network link should be configured with different network switch or we can use the direct network cable between two nodes.
Note: For tutorial simplicity , we will use unicast (Not Multicast) for corosync. Unicast method should be fine for two node clusters.
Configuring the additional corosync links is an online activity and can be done without impacting the services.
Let’s explore the existing configuration:
1. View the corosync configuration using pcs command.
[root@UA-HA ~]# pcs cluster corosync
totem {
version: 2
secauth: off
cluster_name: UABLR
transport: udpu
}
nodelist {
node {
ring0_addr: UA-HA
nodeid: 1
}
node {
ring0_addr: UA-HA2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}
[root@UA-HA ~]#
2. Corosync uses two UDP ports mcastport (for mcast receives) and mcastport – 1 (for mcast sends).
- mcast receives: 5405
- mcast sends: 5404
[root@UA-HA ~]# netstat -plantu | grep 54 |grep corosync udp 0 0 192.168.203.134:5405 0.0.0.0:* 34363/corosync [root@UA-HA ~]#
3. Corosync configuration file is located in /etc/corosync.
[root@UA-HA ~]# cat /etc/corosync/corosync.conf
totem {
version: 2
secauth: off
cluster_name: UABLR
transport: udpu
}
nodelist {
node {
ring0_addr: UA-HA
nodeid: 1
}
node {
ring0_addr: UA-HA2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}
[root@UA-HA ~]#
4. Verify current ring Status using corosync-cfgtool.
[root@UA-HA ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 192.168.203.134
status = ring 0 active with no faults
[root@UA-HA ~]# ssh UA-HA2 corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id = 192.168.203.131
status = ring 0 active with no faults
[root@UA-HA ~]#
As we can see that only one ring has been configured for corosync and it uses the following interfaces from each node.
[root@UA-HA ~]# ifconfig br0
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.203.134 netmask 255.255.255.0 broadcast 192.168.203.255
[root@UA-HA ~]# ssh UA-HA2 ifconfig br0
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.203.131 netmask 255.255.255.0 broadcast 192.168.203.255
[root@UA-HA ~]#
Configure a new ring :
5. To add additional redundancy for corosync links, we will use the following interface on both nodes.
[root@UA-HA ~]# ifconfig eno33554984
eno33554984: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.0.3 netmask 255.255.255.0 broadcast 172.16.0.255
[root@UA-HA ~]# ssh UA-HA2 ifconfig eno33554984
eno33554984: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.16.0.2 netmask 255.255.255.0 broadcast 172.16.0.255
[root@UA-HA ~]#
Dedicated Private address for Corosync Links:
172.16.0.3 – UA-HA-HB2
172.16.0.2 – UA-HA2-HB2
6. Before making changes in corosync configuration, we need to move the cluster in to maintenance mode.
[root@UA-HA ~]# pcs property set maintenance-mode=true [root@UA-HA ~]# pcs property show maintenance-mode Cluster Properties: maintenance-mode: true [root@UA-HA ~]#
This will eventually puts the resources in unmanaged state.
[root@UA-HA ~]# pcs resource
Resource Group: WEBRG1
vgres (ocf::heartbeat:LVM): Started UA-HA (unmanaged)
webvolfs (ocf::heartbeat:Filesystem): Started UA-HA (unmanaged)
ClusterIP (ocf::heartbeat:IPaddr2): Started UA-HA (unmanaged)
webres (ocf::heartbeat:apache): Started UA-HA (unmanaged)
Resource Group: UAKVM2
UAKVM2_res (ocf::heartbeat:VirtualDomain): Started UA-HA2 (unmanaged)
[root@UA-HA ~]#
7. Update the /etc/hosts with following entries on both the nodes.
[root@UA-HA corosync]# cat /etc/hosts |grep HB2 172.16.0.3 UA-HA-HB2 172.16.0.2 UA-HA2-HB2 [root@UA-HA corosync]#
8. Update the corosync.conf with rrp_mode & ring1_addr.
[root@UA-HA corosync]# cat corosync.conf
totem {
version: 2
secauth: off
cluster_name: UABLR
transport: udpu
rrp_mode: active
}
nodelist {
node {
ring0_addr: UA-HA
ring1_addr: UA-HA-HB2
nodeid: 1
}
node {
ring0_addr: UA-HA2
ring1_addr: UA-HA2-HB2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}
[root@UA-HA corosync]#
Here is the difference between previous configuration file vs New one.
[root@UA-HA corosync]# sdiff -s corosync.conf corosync.conf_back
rrp_mode: active <
ring1_addr: UA-HA-HB2 <
ring1_addr: UA-HA2-HB2 <
[root@UA-HA corosync]#
9. Restart the corosync services on both the nodes.
[root@UA-HA ~]# systemctl restart corosync [root@UA-HA ~]# ssh UA-HA2 systemctl restart corosync
10. Check the corosync service status.
[root@UA-HA ~]# systemctl status corosync
● corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2015-10-19 02:38:16 EDT; 16s ago
Process: 36462 ExecStop=/usr/share/corosync/corosync stop (code=exited, status=0/SUCCESS)
Process: 36470 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 36477 (corosync)
CGroup: /system.slice/corosync.service
└─36477 corosync
Oct 19 02:38:15 UA-HA corosync[36477]: [QUORUM] Members[2]: 2 1
Oct 19 02:38:15 UA-HA corosync[36477]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 19 02:38:16 UA-HA systemd[1]: Started Corosync Cluster Engine.
Oct 19 02:38:16 UA-HA corosync[36470]: Starting Corosync Cluster Engine (corosync): [ OK ]
Oct 19 02:38:24 UA-HA corosync[36477]: [TOTEM ] A new membership (192.168.203.134:3244) was formed. Members left: 2
Oct 19 02:38:24 UA-HA corosync[36477]: [QUORUM] Members[1]: 1
Oct 19 02:38:24 UA-HA corosync[36477]: [MAIN ] Completed service synchronization, ready to provide service.
Oct 19 02:38:25 UA-HA corosync[36477]: [TOTEM ] A new membership (192.168.203.131:3248) was formed. Members joined: 2
Oct 19 02:38:26 UA-HA corosync[36477]: [QUORUM] Members[2]: 2 1
Oct 19 02:38:26 UA-HA corosync[36477]: [MAIN ] Completed service synchronization, ready to provide service.
[root@UA-HA ~]#
11. Verify the corosync configuration using pcs command.
[root@UA-HA ~]# pcs cluster corosync
totem {
version: 2
secauth: off
cluster_name: UABLR
transport: udpu
rrp_mode: active
}
nodelist {
node {
ring0_addr: UA-HA
ring1_addr: UA-HA-HB2
nodeid: 1
}
node {
ring0_addr: UA-HA2
ring1_addr: UA-HA2-HB2
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}
[root@UA-HA ~]#
12.Verify the ring status.
[root@UA-HA ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 192.168.203.134
status = ring 0 active with no faults
RING ID 1
id = 172.16.0.3
status = ring 1 active with no faults
[root@UA-HA ~]# ssh UA-HA2 corosync-cfgtool -s
Printing ring status.
Local node ID 2
RING ID 0
id = 192.168.203.131
status = ring 0 active with no faults
RING ID 1
id = 172.16.0.2
status = ring 1 active with no faults
[root@UA-HA ~]#
You could also check the ring status using following command.
[root@UA-HA ~]# corosync-cmapctl |grep member runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.203.134) r(1) ip(172.16.0.3) runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.1.status (str) = joined runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.203.131) r(1) ip(172.16.0.2) runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.2.status (str) = joined [root@UA-HA ~]#
We have successfully configured redundant rings for corosync .
13. Clear the cluster maintenance mode.
[root@UA-HA ~]# pcs property unset maintenance-mode
or
[root@UA-HA ~]# pcs property set maintenance-mode=false
[root@UA-HA ~]# pcs resource
Resource Group: WEBRG1
vgres (ocf::heartbeat:LVM): Started UA-HA
webvolfs (ocf::heartbeat:Filesystem): Started UA-HA
ClusterIP (ocf::heartbeat:IPaddr2): Started UA-HA
webres (ocf::heartbeat:apache): Started UA-HA
Resource Group: UAKVM2
UAKVM2_res (ocf::heartbeat:VirtualDomain): Started UA-HA2
[root@UA-HA ~]#
Let’s break it !!
You could easily test the rrp_mode by pulling out the network cable from one of the configured interface. I have just used “ifconfig br0 down” command to simulate this test on UA-HA2 node. Assuming that application/DB is using different interface.
[root@UA-HA ~]# ping UA-HA2 PING UA-HA2 (192.168.203.131) 56(84) bytes of data. ^C --- UA-HA2 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 1002ms [root@UA-HA ~]#
Check the ring status. We can see that ring 0 has been marked as faulty.
[root@UA-HA ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 192.168.203.134
status = Marking ringid 0 interface 192.168.203.134 FAULTY
RING ID 1
id = 172.16.0.3
status = ring 1 active with no faults
[root@UA-HA ~]#
You could see that cluster is running perfectly without any issue.
[root@UA-HA ~]# pcs resource
Resource Group: WEBRG1
vgres (ocf::heartbeat:LVM): Started UA-HA
webvolfs (ocf::heartbeat:Filesystem): Started UA-HA
ClusterIP (ocf::heartbeat:IPaddr2): Started UA-HA
webres (ocf::heartbeat:apache): Started UA-HA
Resource Group: UAKVM2
UAKVM2_res (ocf::heartbeat:VirtualDomain): Started UA-HA2
[root@UA-HA ~]#
Bring up the br0 interface using “ifconfig br0 up”. Ring 0 is back to online.
[root@UA-HA ~]# corosync-cfgtool -s
Printing ring status.
Local node ID 1
RING ID 0
id = 192.168.203.134
status = ring 0 active with no faults
RING ID 1
id = 172.16.0.3
status = ring 1 active with no faults
[root@UA-HA ~]#
Hope this article informative to you. Share it ! Comment it !! Be Sociable !!!
Javier says
rrp_mode: active is not supported by red hat, it should be set to passive.
Alan says
Perfect example, wonderfully written. I followed it to set up a second link on my 2-node cluster and it went flawlessly. Thanks!
praveen says
for Update the corosync.conf with rrp_mode & ring1_addr. , no example given