Replacing a Defective UCS 6324 Fabric Interconnect

February 28th, 2018
Replacing a Defective UCS 6324 Fabric Interconnect

Occasionally, UCS hardware does go bad and recently I had the opportunity to replace a defective UCS 6324 Fabric Interconnect.

First, some background on the hardware…

The UCS 6324 Fabric Interconnect (FI) is a core component of Cisco's UCS mini system and extends the UCS architecture into smaller domains while providing the same unified server and networking capability as in the full-scale UCS solution.  The UCS 6324 Fabric-Interconnect provides embedded connectivity in a small UCS domain for up to 20 servers.

Unlike the traditional Fabric Interconnects, the UCS 6324 Fabric Interconnects are installed directly into the UCS 5108 chassis, where the IO modules are installed.  (see diagram below)

Because the 6324 Fabric Interconnects are installed directly into the chassis as part of the UCS mini design, and there are no IO modules required, there is no need for FEX cabling or L1/L2 cables between FIs.

Before we begin to remove the defective hardware (in our case Fabric-Interconnect A), if there are QSFP licenses installed on the defective FI, we should contact TAC to have them transfer the port-licenses to the replacement Fabric Interconnect.  Since our 6324 Fabric Interconnect was not purchased with additional port-licenses, we're good here.

Other considerations before we begin:

• Cabling connected to the ports on the 6324 Fabric Interconnects should be labeled.

• Obtain a full-state and configurational backup of the 6324 Fabric interconnects.

• Verify the cluster state of the Fabric Interconnects

 

To verify the cluster state of the existing Fabric Interconnects, SSH to UCS Manager (UCSM) and run the command "show cluster extended-state".  We're looking for the "HA Ready" message between the 2 Fabric Interconnects.

 

PA-UCS-6324-A# show cluster extended-state 

Cluster Id: 0xe144f0c272fc11e7-0x9678002a6a88c401

Start time: Tue Nov 21 22:08:26 2017

Last election time: Tue Nov 21 22:29:48 2017

A: UP, PRIMARY

B: UP, SUBORDINATE

 

A: memb state UP, lead state PRIMARY, mgmt services state: UP

B: memb state UP, lead state SUBORDINATE, mgmt services state: UP

   heartbeat state PRIMARY_OK

INTERNAL NETWORK INTERFACES:

eth1, UP

eth2, UP

HA READY

Detailed state of the device selected for HA storage:

Chassis 1, serial: FOX1743H1ZN, state: active

PA-UCS-6324-A#

 

Once we have verified and performed the tasks listed above, we are ready to start the replacement of the defective Fabric Interconnect.

Step #1:  

The next task is to perform a fabric evacuation on the defective Fabric Interconnect.  The caveat here is that the fabric evacuation can ONLY be performed on the subordinate Fabric Interconnect, so if the defective Fabric Interconnect is currently the primary Fabric Interconnect, we will need to change the cluster lead over to the secondary Fabric Interconnect.  This can be accomplished by switching to local management and running the command, "cluster lead b".  Be sure to run the "show cluster state" command to verify the cluster state as "HA Ready" before proceeding.

 

PA-UCS-6324-A# sh cluster state

Cluster Id: 0xe144f0c272fc11e7-0x9678002a6a88c401

 

A: UP, PRIMARY

B: UP, SUBORDINATE

HA READY

PA-UCS-6324-A# connect local

Cisco Nexus Operating System (NX-OS) Software

TAC support: http://www.cisco.com/tac

Copyright (c) 2009, Cisco Systems, Inc. All rights reserved.

The copyrights to certain works contained in this software are

owned by other third parties and used and distributed under

license. Certain components of this software are licensed under

the GNU General Public License (GPL) version 2.0 or the GNU

Lesser General Public License (LGPL) Version 2.1. A copy of each

such license is available at

http://www.opensource.org/licenses/gpl-2.0.php and

http://www.opensource.org/licenses/lgpl-2.1.php

PA-UCS-6324-A(local-mgmt)#

PA-UCS-6324-A(local-mgmt)# cluster lead b

If the system is at 'infrastructure firmware' auto-install 'pending user Ack' stage,this action will result in upgrading and rebooting current primary. Please check the outstanding faults (scope monitoring <enter> show new-faults) and make sure the data-paths on FI-B are established properly before making it primary to ensure there is no data outage.

Do you want to continue? (yes/no):yes

Cluster Id: 0xe144f0c272fc11e7-0x9678002a6a88c401

PA-UCS-6324-A(local-mgmt)# sh cluster state

Cluster Id: 0xe144f0c272fc11e7-0x9678002a6a88c401

A: UP, SUBORDINATE

B: UP, PRIMARY

HA READY

PA-UCS-6324-A(local-mgmt)#

 

To evacuate the vEths and vHBAs from the defective Fabric Interconnect, navigate to Equipment tab -> Fabric Interconnects/Fabric-Interconnect A.  

Note that "Admin Evac Mode" can be enabled, even if the subordinate FI is to be replaced/removed because the UCSM configuration is kept on both FIs as a cluster.  

Once the Fabric-Interconnect A has been evacuated, and the vEths and vHBAs have been moved over to Fabric-Interconnect B, we can now safely remove the defective Fabric-Interconnect.

Step #2:

The next step is to physically swap the defective Fabric Interconnect with the replacement FI.  Once the replacement FI has been inserted into the slot, it takes roughly 2 minutes for the FI to power-up.

Step #3:

After the replacement Fabric Interconnect has been powered up, the FI will boot and run POST.  Once this is completed, the Fabric Interconnect will run through a series configurational dialog messages.

The FI should then detect the presence of a peer Fabric-Interconnect and will prompt us to add the replacement FI to the cluster.  This process also copies the configuration details over to the replacement FI.  The setup process then compares the firmware version of the replacement FI to the active Fabric-Interconnect.  If the firmware versions are not identical, the configurational dialog will prompt for an upgrade to match the peer FI version.  This process takes around 5-10 minutes.

 

Enter the configuration method. (console/gui) ? 

  Enter the configuration method. (console/gui) ? console

  Installer has detected the presence of a peer Fabric interconnect. This Fabric interconnect will be added to the cluster. Continue (y/n) ? yes

  Enter the admin password of the peer Fabric interconnect: 

    Connecting to peer Fabric interconnect... done

    Retrieving config from peer Fabric interconnect... done

    Installer has determined that the peer Fabric Interconnect is running a different firmware version than the local Fabric. Cannot join cluster.

    Local Fabric Interconnect

      UCSM version     : 3.0(2c)

      Kernel version   : 5.0(3)N2(3.02c)

      System version   : 5.0(3)N2(3.02c)

      local_model_no   : Mini

    Peer Fabric Interconnect

      UCSM version     : 3.2(2b)

      Kernel version   : 5.0(3)N2(3.22a)

      System version   : 5.0(3)N2(3.22a)

      peer_model_no    : Mini

  Do you wish to update firmware on this Fabric Interconnect to the Peer's version? (y/n): y

Updating firmware of Fabric Interconnect....... [ Please don't press Ctrl+c while updating firmware ]

 Updating images 

 Please wait for firmware update to complete.... 

 Checking the Compatibility of new Firmware..... [ Please don't Press ctrl+c ]. 

Verifying image bootflash:/installables/switch/ucs-mini-k9-kickstart.5.0.3.N2.3.22a.bin for boot variable "kickstart".

[####################] 100% -- SUCCESS

Verifying image bootflash:/installables/switch/ucs-mini-k9-system.5.0.3.N2.3.22a.bin for boot variable "system".

[####################] 100% -- SUCCESS

Verifying image type.

[####################] 100% -- SUCCESS

Extracting "system" version from image bootflash:/installables/switch/ucs-mini-k9-system.5.0.3.N2.3.22a.bin.

[####################] 100% -- SUCCESS

Extracting "kickstart" version from image bootflash:/installables/switch/ucs-mini-k9-kickstart.5.0.3.N2.3.22a.bin.

[####################] 100% -- SUCCESS

Extracting "bios" version from image bootflash:/installables/switch/ucs-mini-k9-system.5.0.3.N2.3.22a.bin.

[####################] 100% -- SUCCESS

Performing module support checks.

[####################] 100% -- SUCCESS

Notifying services about system upgrade.

[####################] 100% -- SUCCESS

Compatibility check is done:

Module  bootable          Impact  Install-type  Reason

------  --------  --------------  ------------  ------

     1       yes      disruptive         reset  Incompatible image

Images will be upgraded according to following table:

Module             Image         Running-Version             New-Version  Upg-Required

------  ----------------  ----------------------  ----------------------  ------------

     1            system         5.0(3)N2(3.02c)         5.0(3)N2(3.22a)           yes

     1         kickstart         5.0(3)N2(3.02c)         5.0(3)N2(3.22a)           yes

     1              bios                v1.022.0                v1.022.0            no

     1         power-seq                    v1.0                    v1.0            no

Switch will be reloaded for disruptive upgrade.

Install is in progress, please wait.

Performing runtime checks.

[####################] 100% -- SUCCESS

Setting boot variables.

[####################] 100% -- SUCCESS

Performing configuration copy.

[####################] 100% -- SUCCESS

Install has been successful.

 

 

Step #4:

Once the replacement FI has been upgraded, it will now prompt us to enter the management IP information.  After the management IP information has been configured on the replacement FI, the Fabric Interconnect will reboot two more times and will come online in approximately 10 minutes.

 

 

 Firmware Updation Successfully Completed. Please wait to enter the IP address 

    Peer Fabric interconnect Mgmt0 IPv4 Address: 172.17.10.72

    Peer Fabric interconnect Mgmt0 IPv4 Netmask: 255.255.255.0

    Cluster IPv4 address          : 172.17.10.70

    Peer FI is IPv4 Cluster enabled. Please Provide Local Fabric Interconnect Mgmt0 IPv4 Address  

  Physical Switch Mgmt0 IP address : 172.17.10.71

  Apply and save the configuration (select 'no' if you want to re-enter)? (yes/no): yes

  Applying configuration. Please wait.

  Configuration file - Ok

 FI will now reboot after which you can access the switch.

WARNING: There is unsaved configuration!!!

WARNING: This command will reboot the system

 

After the reboots have been completed, the replacement FI should be good to go.  To verify IP reachability, console into the replacement FI and verify that it can reach the peer Fabric-Interconnect and its gateway.

 

PA-UCS-6324-A(local-mgmt)# ping 172.17.10.72

PING 172.17.10.72 (172.17.10.72) from 172.17.10.71 : 56(84) bytes of data.

64 bytes from 172.17.10.72: icmp_seq=1 ttl=64 time=5.94 ms

64 bytes from 172.17.10.72: icmp_seq=2 ttl=64 time=0.108 ms

64 bytes from 172.17.10.72: icmp_seq=3 ttl=64 time=0.109 ms

64 bytes from 172.17.10.72: icmp_seq=4 ttl=64 time=0.105 ms

64 bytes from 172.17.10.72: icmp_seq=5 ttl=64 time=0.111 ms

64 bytes from 172.17.10.72: icmp_seq=6 ttl=64 time=0.103 ms

^C

--- 172.17.10.72 ping statistics ---

6 packets transmitted, 6 received, 0% packet loss, time 5015ms

rtt min/avg/max/mdev = 0.103/1.079/5.942/2.174 ms

PA-UCS-6324-A(local-mgmt)# ping 172.17.10.1

PING 172.17.10.1 (172.17.10.1) from 172.17.10.71 : 56(84) bytes of data.

64 bytes from 172.17.10.1: icmp_seq=1 ttl=255 time=0.541 ms

64 bytes from 172.17.10.1: icmp_seq=2 ttl=255 time=0.581 ms

64 bytes from 172.17.10.1: icmp_seq=3 ttl=255 time=0.569 ms

64 bytes from 172.17.10.1: icmp_seq=4 ttl=255 time=0.648 ms

64 bytes from 172.17.10.1: icmp_seq=5 ttl=255 time=0.578 ms

64 bytes from 172.17.10.1: icmp_seq=6 ttl=255 time=0.567 ms

^C

--- 172.17.10.1 ping statistics ---

6 packets transmitted, 6 received, 0% packet loss, time 4996ms

rtt min/avg/max/mdev = 0.541/0.580/0.648/0.042 ms

PA-UCS-6324-A(local-mgmt)#

 

Step #5:

Once the UCS Manager is up, http into UCSM and set the "Admin Evac Mode" to "off".  This can be configured under to Equipment tab -> Fabric Interconnects/Fabric-Interconnect A.

 

Step #6:

As a final step, switch the cluster lead to Fabric-Interconnect A by running the following commands:

 

connect local-mgmt b

show cluster state

cluster lead a

 

PA-UCS-6324-A# connect local-mgmt b

Cisco Nexus Operating System (NX-OS) Software

TAC support: http://www.cisco.com/tac

Copyright (c) 2009, Cisco Systems, Inc. All rights reserved.

The copyrights to certain works contained in this software are

owned by other third parties and used and distributed under

license. Certain components of this software are licensed under

the GNU General Public License (GPL) version 2.0 or the GNU

Lesser General Public License (LGPL) Version 2.1. A copy of each

such license is available at

http://www.opensource.org/licenses/gpl-2.0.php and

http://www.opensource.org/licenses/lgpl-2.1.php

PA-UCS-6324-B(local-mgmt)# show cluster state

Cluster Id: 0xe144f0c272fc11e7-0x9678002a6a88c401

B: UP, PRIMARY

A: UP, SUBORDINATE

HA READY

PA-UCS-6324-B(local-mgmt)# cluster lead a

If the system is at 'infrastructure firmware' auto-install 'pending user Ack' stage,this action will result in upgrading and rebooting current primary. Please check the outstanding faults (scope monitoring <enter> show new-faults) and make sure the data-paths on FI-A are established properly before making it primary to ensure there is no data outage.

Do you want to continue? (yes/no):yes

Cluster Id: 0xe144f0c272fc11e7-0x9678002a6a88c401

PA-UCS-6324-B(local-mgmt)#

PA-UCS-6324-B(local-mgmt)# sh cluster state

Cluster Id: 0xe144f0c272fc11e7-0x9678002a6a88c401

B: UP, SUBORDINATE

A: UP, PRIMARY

HA READY

PA-UCS-6324-B(local-mgmt)#

 

Congratulations!  You have successfully replaced a defective UCS 6324 Fabric Interconnect.

Join the High Availability, Inc. Mailing List

Subscribe