Skip to content


a cluster that just stopped working….

Following IBM recommendations, I upgraded both nodes PowerHA code. Executing the halevel generated the same output on both nodes.

law1:RDC:/root>/usr/es/sbin/cluster/utilities/halevel
6.1.0
law1:RDC:/root>/usr/es/sbin/cluster/utilities/halevel -s
6.1.0 SP6

After the reboot, I am still looking at the same misery – the “offending” node is not able to communicate with the other node and vice versa.
I noticed that the “offending” nodes also has intermittent ping issues – sometimes the pings returned and sometimes they did not.

Executing on both nodes the command netstat -in generated different output. The network column on one node showed the network with correct netmask (255.255.255.0) and wrong netmask (255.0.0.0) on the other node. This is wrong, the output should be the same – in my case the subnet mask should be 255.255.255.0. Bellow, the first figure (form law002) shows the “correct” netmask value.

law002:RDC:/usr/es/sbin/cluster/etc>netstat -in
Name  Mtu   Network     Address            Ipkts Ierrs Opkts Oerrs  Coll
en0   1500  link#2      c2.88.dd.76.11.b     3753 0 4077   0     0
en0   1500  10.254.245  10.254.245.60        3753 0 4077   0     0
en0   1500  10.19.81    10.19.81.17          3753 0 4077   0     0
lo0   16896 link#1                           2897 0 2897   0     0
lo0   16896 127         127.0.0.1            2897 0 2897   0     0
lo0   16896 ::1%1                            2897 0 2897   0     0
lawaprpu001:RDC:/usr/es/sbin/cluster/etc>

The output from law001 proves that the interface is operating with a different netmask (255.0.0.0) than law002:

law001:TechPark:/usr/es/sbin/cluster/etc>netstat -in
Name  Mtu   Network     Address            Ipkts Ierrs Opkts Oerrs  Coll
en0   1500  link#2      b2.6.4c.ee.c1.b      3706 0 3661   0     0
en0   1500  10          10.19.81.16          3706 0 3661   0     0
en0   1500  10          10.254.245.61        3706 0 3661   0     0
lo0   16896 link#1                            994 0  994   0     0
lo0   16896 127         127.0.0.1             994 0  994   0     0
lo0   16896 ::1%1                             994 0  994   0     0
lawaptpu001:TechPark:/usr/es/sbin/cluster/etc>

Executing the next command showed the same facts about the current netmask value on both machines.

law002:RDC:/usr/es/sbin/cluster/etc>ifconfig -a
en0: flags=1e080863,480
inet 10.254.245.60 netmask 0xffffff00 broadcast 10.254.245.255
inet 10.19.81.17 netmask 0xffffff00 broadcast 10.19.81.255
tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1
lo0: flags=e08084b,c0
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1%1/0
tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1
law001:TechPark:/usr/es/sbin/cluster/etc>ifconfig -a
en0: flags=1e080863,480
inet 10.19.81.16 netmask 0xff000000 broadcast 10.255.255.255
inet 10.254.245.61 netmask 0xff000000 broadcast 10.255.255.255
tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1
lo0: flags=e08084b,c0
inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
inet6 ::1%1/0
tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

To complicate the matter even further, executing the smitty chinet shows the correct netmask. So why smitty “sees” the same (en0) differently then netstat and ifconfig? It is time to querry the ODM.

lawaptpu001:TechPark:/etc>odmget CuAt | grep -p en0
CuAt:
name = "en0"
attribute = "netaddr"
value = "10.254.245.61"
type = "R"
generic = "DU"
rep = "s"
nls_index = 4

CuAt:
name = "en0"
attribute = "netmask"
value = "255.255.255.0"
type = "R"
generic = "DU"
rep = "s"
nls_index = 8

CuAt:
name = "en0"
attribute = "state"
value = "up"
type = "R"
generic = "DU"
rep = "sl"
nls_index = 5

CuAt:
name = "en0"
attribute = "alias4"
value = "10.254.245.61,255.255.255.0"
type = "R"
generic = "DU"
rep = "s"
nls_index = 0

CuAt:
name = "en0"
attribute = "alias4"
value = "10.19.81.16,255.255.255.0"
type = "R"
generic = "DU"
rep = "s"
nls_index = 0

The last output exposed the source of my issues but I have to admit that it took me a moment to understand what I looked at and what looked at me back. Why don’t you pause and take another look before reading further?
The listing above says that two IP aliases are present (coonfigured) on the network interface en0! Ooops. How did this happened?

For a last many years, I always configure cluster nodes pretty much using the same approach. Each node gets one IP address that it is activated at the boot time. This address is on a specially designated network which is not routable – the network is not used for anything else but the boot addresses.
Next, each node gets “proper” IP address which is on the selected public network using the IP alias mechanism. Regardless of their network, both addresses employ the same netmask (HA rule). The cluster service address is selected from the same public network as the last two aliases.

Well, looking at the output above, we see that en0 does not have one alias but two!!! Not to mention that they both have the correct (in this case) netmask. This situation requires correction – both aliases have to be removed and than the proper IP address attached to en0 following with its IP alias. By the way, I think that the official IBM AIX way has changed and what I just said could be wrong. Check it, but I think that currently IBM advices to configure IP aliases via the smitty hacmp menu instead.

Before IP aliases removal, I do logout from the host to login back through its HMC connection – removing aliases from en0 will disable access to the host.

There are few ways to remove an alias from an network interface. One is of course the smitty inetalias way and the other possibility is the chdev way.

So, one alias at the time remembering that these are IPv4 and not IPv6 aliases:

chdev -l en0 -a delalias4=10.254.245.61,255.255.255.0
chdev -l en0 -a delalias4=10.19.81.16,255.255.255.0

I do execute the odmget to verify that all is clean and there are no leftovers and I do apply IP address followed by one alias. I verify that odmget shows what I have just done.

law001:TechPark:/root>odmget CuAt | grep -p en0
CuAt:
name = "en0"
attribute = "netaddr"
value = "10.254.245.61"
type = "R"
generic = "DU"
rep = "s"
nls_index = 4

CuAt:
name = "en0"
attribute = "netmask"
value = "255.255.255.0"
type = "R"
generic = "DU"
rep = "s"
nls_index = 8

CuAt:
name = "en0"
attribute = "state"
value = "up"
type = "R"
generic = "DU"
rep = "sl"
nls_index = 5

CuAt:
name = "en0"
attribute = "alias4"
value = "10.19.81.16,255.255.255.0"
type = "R"
generic = "DU"
rep = "s"
nls_index = 0

There is one alias only! Most likely because of nothing else but superstition (or maybe I was told to do it in the past and now I do not remember when and why) I execute the next two commands (the rootvg is mirrored).

law001:TechPark:/etc>bootlist -m normal -o
hdisk0 blv=hd5 pathid=0
hdisk1 blv=hd5 pathid=0

law001:TechPark:/etc>bosboot -ad /dev/hdisk0
bosboot: Boot image is 45732 512 byte blocks.

law001:TechPark:/etc>bosboot -ad /dev/hdisk1
bosboot: Boot image is 45732 512 byte blocks

I reboot the node, check its ODM, verify and synchronized the cluster and finally I start it up and let it go to the application team to continue their tasks…

Posted in HACMP, Real life AIX.

Tagged with , , , , , , .


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.



Copyright © 2016 Waldemar Mark Duszyk. All Rights Reserved. Created by Blog Copyright.