Skip to content


SystemMirrors, building clusters and fighting with the CAA services

I am expecting something good to come my way soon. Why? I have suffered for a few days trying to get to work a very simple cluster that consistently refused all of my efforts insisting on not letting the first node to start the HA services and to join the cluster. This is not an extraordinary cluster, there is nothing special about just two nodes, one resource group and this is all.

But first, this is how the cluster was created.

# clmgr add cluster lawmsmpa1 \
            nodes=lawmsmpa1c1,lawmsmpa1c2 \
            type=NSC \
            heartbeat_type=unicast \
            repositories=hdisk2

Note, that hdisk2 has identical PVID on both nodes and its reserve_policy is set to no_reserve on both nodes.

# lsattr -El hdisk2 | grep reserve_policy
reserve_policy no_reserve 

Next, I defined the application controller aka the “entity” identifying the highly available application start and stop scripts.

# clmgr add application_controller lawmsmp \
            startscript=/usr/es/sbin/cluster/scripts/start_cluster.ksh \
            stopscript=/usr/es/sbin/cluster/scripts/stop_cluster.ksh

The service address was defined next.

# clmgr add service_ip 10.27.45.1 \
            netmask=255.255.255.0 \
            network=net_ether_01

Finally, the cluster resource group is defined as follows.

# clmgr add resource_group lawmsmpRG \
            nodes=lawmsmpa1c1,lawmsmpa1c2 \
            startup=OHN fallover=FNPN fallback=NFB \
            service_label=lawmsmpa1 \
            applications=lawmsmp \
            volume_group=lawson_vg \
            fs_before_ipaddr=true 

At this time, we should have the caavg_private volume group present on the hdisk2 on the both nodes… Well, this was not the case….. It was present on the second node (lawmsmpa1c2) but not on the first one which by the way is the primary node of this cluster (it was declared first while defining the cluster).
Rebooting both nodes definitely did not help. The situation did not change. Every attempt to start cluster services failed on the primary node. The error message was always the same –

lawmsmpa1c1: rc.cluster: Error: CAA cluster services are not active on this node.

or

RSCT cluster services (cthags) are not active on this node

And the clconfg service was not running.

# lssrc -g caa
Subsystem         Group            PID          Status
  clcomd           caa              13041908     active
  clconfd          caa              failed

Indeed, id does not work. The cluster services and its resource group could be started and brought on line on the second node but not on the first one…. Looking at the /etc/services on lawmsmpa1c1, I noticed that one HA entry was missing. The missing entry was

caa_cfg         6181/tcp

Well, this could explain at least some of the reasons behind this disaster but not all. The cluster sync following the update to the services file did not help. The repository was still mangled and the caavg_private volume group was still only present on the second node. It is time for some scrubbing!

Executed on both cluster nodes:

# export CAA_FORCE_ENABLED=1                                            
# rmcluster -f -r hdisk2
# lsattr -El cluster0                                                   
# rmdev -dl cluster0

On lawmsmpa1c1

# mkvg -f -y scrubvg hdisk2                                              
# varyoffvg  scrubvg

On lawmsmpa1c2

# importvg -f -y scrubvg hdisk2             
# varyoffvg scrubvg                                                      
# exportvg  scrubvg

On lawmsmpa1c1

# exportvg scrubvg

Yes, we validated it – the disk for the repository volume group is accessible from both nodes.
On both nodes

# shutdown -Fr

After reboot, the next command executed on both nodes showed that cluster definition present (do not be surprised, we only cleaned the repository disk!)

# odmget -q name=cluster0 CuAt
CuAt:
        name = "cluster0"
        attribute = "node_uuid"
        value = "749fb3f2-05f9-11e5-95f1-56bdbf443b02"
        type = "R"
        generic = "DU"
        rep = "s"
        nls_index = 3

Crossed my fingers, and asked to synchronize the cluster which if everything works as intended will create the caa_private volume group on both nodes – the way it was designed to be!

# clmgr sync cluster

It worked like a charm and the cluster started like nothing bad ever happened…… Thanks God it is Friday! 🙂

By the way, PowerHA 7.1.3 has the following requirement:

CAAnodename  =   hostname   =  COMMUNICATION_PATH

Because hostname and COMMUNICATION_PATH are both in the short-name format the contents of /etc/hosts also followed this formula.

100.127.145.2      lawmsmpa1c1
100.127.145.3      lawmsmpa1c2
100.127.145.1      lawmsmpa1   

Posted in HACMP, Real life AIX.

Tagged with , , , , , , , .


One Response

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Miroslav Pilat says

    Hi, I was bit fighting with cluster sync lately. IBM support sugested to change hostname to shotname but that did not work until I read in your article that CAAnodename = hostname = COMMUNICATION_PATH. I have updated COMMPATH of the nodes via smit menus and finaly got the cluster sync to work. Thanks.



Some HTML is OK

or, reply to this post via trackback.

WordPress Anti Spam by WP-SpamShield



Copyright © 2016 - 2017 Waldemar Mark Duszyk. All Rights Reserved. Created by Blog Copyright.