Thursday, September 27, 2012

IOS Upgrade on Cisco WS-C4507R Chassis with Dual Supervisor V Engines

Today we will upgrade the IOS version on both WS-X4516 supervisor engines V in a WS-C4507R chassis. This blog post assumes that your 4507R chassis's supervisor engine already has network support for you to SSH into it.

First, go to the Cisco support site and download the latest IOS version (you need a Cisco support contract to have access to new IOS images). Place this image on your TFTP server. In this example, the TFTP server is a CentOS Linux machine called alice.company.com.


scp ~/Downloads/cat4500-entservicesk9-mz.150-2.SG5.bin alice.company.com:/tmp
ssh alice.company.com

If you don't have a TFTP server installed, then set one up.

sudo yum -y install tftp-server
sudo vi /etc/xinetd.d/tftpd
sudo chkconfig xinetd on
chkconfig --list xinetd
sudo mkdir -p /tftpboot/ios

Now move the IOS image into the /tftpboot/ios directory.

sudo mv /tmp/cat4500-entservicesk9-mz.150-2.SG5.bin /tftpboot/ios

(Optional) Create a symbolic link to the image so that you can remember which hardware it's for. When your site has several different Cisco models, those symbolic links can be handy!

sudo ln -s /tftpboot/ios/cat4500-entservicesk9-mz.150-2.SG5.bin /tftpboot/ios/ws-c4507r

Check the file size of the new IOS image?

du -sb /tftpboot/ios/cat4500-entservicesk9-mz.150-2.SG5.bin
19458176 /tftpboot/ios/cat4500-entservicesk9-mz.150-2.SG5.bin

That means we need 19458176 bytes of free space on both supervisor engines internal flash storage. To check this, we must connect to the chassis's IP address and check the remaining space.

ssh 172.16.1.1
switch> enable
switch# show version | inc bytes of memory

cisco WS-C4507R (MPC8245) processor (revision 4) with 524288K bytes of memory.

We appear to have enough space. If we didn't, we could check the flash content and erase some old IOS images. For example, let's pretend there is the old cat4000-i9s-mz.122-20.EW3.bin IOS file on the bootflash. We could remove it like this :

switch# delete bootflash:cat4000-i9s-mz.122-20.EW3.bin

Check the redundancy status of both supervisor engines.

switch# show redundancy states
       my state = 13 -ACTIVE 
     peer state = 4  -STANDBY COLD 
           Mode = Duplex
           Unit = Primary
        Unit ID = 1

Redundancy Mode (Operational) = RPR
Redundancy Mode (Configured)  = RPR
Redundancy State              = RPR
Maintenance Mode = Disabled
    Manual Swact = enabled
  Communications = Up

   client count = 35
 client_notification_TMR = 240000 milliseconds
          keep_alive TMR = 9000 milliseconds
        keep_alive count = 1 
    keep_alive threshold = 18 
           RF debug mask = 0x0   

As we can see, we are running in RPR redundancy mode (or Route Processor Redundancy mode). This means that if the active supervisor engines fails or reloads, all ports will loose connection for several minutes while they synchronize with the other supervisor engine. That is not super duper.

Fortunately, there is another redundancy mode which offers a faster switchover. This mode is called Stateful SwitchOver or SSO for short. When the supervisors are running in SSO redundancy mode, the switch will keep working fine with layer 2 during a supervisor switchover, but all layer 3 connections will loose the neighbor relationship for 50 milliseconds because they are synchronized while they are running.

So let's change our redundancy mode from RPR to SSO.

switch# conf terminal
switch(config)# redundancy 
switch(config-red)# mode sso
Changing to sso mode will reset the standby. Do you want to continue?[confirm]

As you can see, when we do this, the standby supervisor engine will reload. So be sure to check this supervisor's console output. Once the standby engine is back online, double-check the redundancy mode again.

switch# sh redundancy states
       my state = 13 -ACTIVE 
     peer state = 4  -STANDBY COLD 
           Mode = Duplex
           Unit = Primary
        Unit ID = 1

Redundancy Mode (Operational) = RPR
Redundancy Mode (Configured)  = Stateful Switchover
Redundancy State              = RPR
Maintenance Mode = Disabled
    Manual Swact = enabled
  Communications = Up

   client count = 35
 client_notification_TMR = 240000 milliseconds
          keep_alive TMR = 9000 milliseconds
        keep_alive count = 1 
    keep_alive threshold = 18 
           RF debug mask = 0x0   

Ok, so now only the standby supervisor engine is running in SSO mode. That's because we can't be in SSO mode on both supervisor engines without a reload of the active one. Which means that the standby has to take over. During that supervisor switchover, there will be a layer 2 downtime and a layer 3 downtime of about 1 to 3 minutes depending on the amount of configured ports.

So, in order to continue with the IOS upgrade, you need to schedule a network maintenance!

Download the new IOS to both supervisor engines.

switch# copy tftp:/ios/cat4500-entservicesk9-mz.150-2.SG5.bin bootflash:

Address or name of remote host [172.16.1.33]? 
Source filename [/ios/cat4500-entservicesk9-mz.150-2.SG5.bin]? 
Destination filename [cat4500-entservicesk9-mz.150-2.SG5.bin]? 

The file will be transfered from the TFTP server into the active supervisor engine's internal flash storage. You will see a series of exclamation points (one for each packet) and then a few captial C letters. The C letters are showned when IOS verifies the new IOS image. You can do it manually with the verify command if you prefer.

While we're working with our TFTP server, we should make a backup of our configuration.

switch# copy run start
switch# copy start tftp

Next, copy the IOS to the standby supervisor engine.

switch# copy bootflash:cat4500-entservicesk9-mz.150-2.SG5.bin slavebootflash:

Now check to see if both supervisor engines have the new IOS image?

switch# dir bootflash:

Directory of bootflash:/

    2  -rwx    19458176  Sep 24 2012 13:08:32 -04:00  cat4500-entservicesk9-mz.150-2.SG5.bin

59244544 bytes total (26308040 bytes free)

switch# dir slavebootflash:
Directory of slavebootflash:/

    1  -rwx    13478072  Jun 29 2006 11:02:01 -04:00  cat4500-entservicesk9-mz.122-31.SG.bin
    3  -rwx    19458176  Sep 24 2012 13:52:19 -04:00  cat4500-entservicesk9-mz.150-2.SG5.bin

59244544 bytes total (6849736 bytes free)

We must make sure that configuration changes from the active supervisor engine is properly transfered to the standby one.

switch# conf terminal
switch(config)# redundancy
switch(config-red)# main-cpu
switch(config-r-mc)# auto-sync standard
switch(config-r-mc)# end

Now manually force a resynchronization of the supervisor engines.

switch# copy run start

Check your syslog output. When the synchronization occurs, you should see lines like these :

Sep 24 14:13:04 c4507r 230: 000258: Sep 24 14:13:03: %C4K_REDUNDANCY-5-CONFIGSYNC: The bootvar has been successfully synchronized to the standby supervisor
Sep 24 14:13:04 c4507r 231: 000259: Sep 24 14:13:03: %C4K_REDUNDANCY-5-CONFIGSYNC: The config-reg has been successfully synchronized to the standby supervisor
Sep 24 14:13:04 c4507r 232: 000260: Sep 24 14:13:03: %C4K_REDUNDANCY-5-CONFIGSYNC: The startup-config has been successfully synchronized to the standby supervisor
Sep 24 14:13:04 c4507r 233: 000261: Sep 24 14:13:03: %C4K_REDUNDANCY-5-CONFIGSYNC: The private-config has been successfully synchronized to the standby supervisor
Sep 24 14:13:04 c4507r 234: 000262: Sep 24 14:13:03: %C4K_REDUNDANCY-5-CONFIGSYNC_RATELIMIT: The vlan database has been successfully synchronized to the standby supervisor

Prepare the switch to boot the new IOS image.

switch# config terminal
switch(config)# config-register 0x2
switch(config)# boot system flash bootflash:cat4500-entservicesk9-mz.150-2.SG5.bin
switch(config)# end
switch# copy running-config start-config

We can now update the IOS image on the standby supervisor image. This can be done anytime because there is no traffic disruption. To see the standby supervisor upgrade messages, make sure to connect to the standby supervisor's console port.

switch# redundancy reload peer

This will reload the standby supervisor engine. You will see these syslog messages from the active one :

Sep 24 14:11:07 c4507r 224: 000249: Sep 24 14:11:06: %C4K_REDUNDANCY-5-CONFIGSYNC: The startup-config has been successfully synchronized to the standby supervisor
Sep 24 14:11:07 c4507r 225: 000250: Sep 24 14:11:07: %C4K_REDUNDANCY-5-CONFIGSYNC: The private-config has been successfully synchronized to the standby supervisor
Sep 24 14:11:17 c4507r 226: 000251: Sep 24 14:11:16: %C4K_REDUNDANCY-3-COMMUNICATION: Communication with the peer Supervisor has been lost
Sep 24 14:11:17 c4507r 227: 000252: Sep 24 14:11:16: %C4K_REDUNDANCY-3-SIMPLEX_MODE: The peer Supervisor has been lost

At the standby supervisor's console, you will see the upgrade messages. Once the new IOS is up and running on the standby supervisor, you should see this in your syslog server :

Sep 24 14:13:03 c4507r 228: 000256: Sep 24 14:13:02: %C4K_REDUNDANCY-2-IOS_VERSION_CHECK_FAIL: IOS version mismatch. Active supervisor version is 12.2(31)SG,. Standby supervisor version is 15.0(2)SG5,. Redundancy feature may not work as expected.
Sep 24 14:13:03 c4507r 229: 000257: Sep 24 14:13:03: %C4K_REDUNDANCY-3-COMMUNICATION: Communication with the peer Supervisor has been established

As the messages tell you, the IOS version is not the same on both supervisor engines. That's normal because we haven't updated the active one yet. We can check the version of both engines with this :

switch# show module

M MAC addresses                    Hw  Fw           Sw               Status
--+--------------------------------+---+------------+----------------+---------
 1 0017.e0fa.58c0 to 0017.e0fa.58c1 4.0 12.2(20r)EW1 12.2(31)SG       Ok       
 2 0017.e0fa.58c2 to 0017.e0fa.58c3 4.0 12.2(20r)EW1 15.0(2)SG5       Ok 

We can see that module in slot 2 (the standby supervisor) is running version 15.0(2)SG5 while the active module in slot 1 is running version 12.2(31)SG.

To upgrade the active supervisor engine, we must issue the following. Make sure you're connected to the active supervisor's console port when you issue this command. THIS COMMAND WILL CAUSE A LAYER 2 AND LAYER 3 NETWORK OUTAGE! So make sure you do this on a scheduled maintenance.

switch# redundancy force-switchover

The active supervisor will reload, forcing a supervisor engine switchover. This will make the standby supervisor engine take control of the chassis. During the switchover, the supervisor that used to be the active one reloads into the new IOS version. After a few minutes, we can see that they are both running the same IOS version :

ssh 172.16.1.1
switch> enable

switch# sh mod | inc 15.0
 1 0017.e0fa.58c0 to 0017.e0fa.58c1 4.0 12.2(20r)EW1 15.0(2)SG5       Ok       
 2 0017.e0fa.58c2 to 0017.e0fa.58c3 4.0 12.2(20r)EW1 15.0(2)SG5       Ok      

We can also see that our redundancy has changed for both engines from RPR to SSO.


switch# sh redundancy states
       my state = 13 -ACTIVE 
     peer state = 8  -STANDBY HOT 
           Mode = Duplex
           Unit = Secondary
        Unit ID = 2

Redundancy Mode (Operational) = Stateful Switchover
Redundancy Mode (Configured)  = Stateful Switchover
Redundancy State              = Stateful Switchover
     Maintenance Mode = Disabled
    Manual Swact = enabled
 Communications = Up

   client count = 60
 client_notification_TMR = 240000 milliseconds
          keep_alive TMR = 9000 milliseconds
        keep_alive count = 1 
    keep_alive threshold = 18 
           RF debug mask = 0x0


That's it! We now have both supervisor engines running the new IOS version and the statefull switchover redundancy mode.

Troubleshooting


Standby Supervisor Engine Reload Loop


Sometimes the standby supervisor might go into a reload loop because of this message :

Current BOOT file is --- flash:cat4500-entservicesk9-mz.150-2.SG5.bin
Invalid filename flash:cat4500-entservicesk9-mz.150-2.SG5.bin. It must begin with device name.

You are then presented with a ROMMON prompt. Hit Ctrl-C to prevent a reload. Then, at the prompt, issue a boot command to load the IOS.

rommon 1 > boot bootflash:cat4500-entservicesk9-mz.150-2.SG5.bin

The standby supervisor engine will boot the IOS. You can then check your configuration and fix the problem.

switch# show running-config | inc boot system
boot system flash flash:cat4500-entservicesk9-mz.150-2.SG5.bin
boot system flash bootflash:cat4500-entservicesk9-mz.150-2.SG5.bin

First line is our problem. It says to boot from flash: instead of bootflash: like the second line. So to fix our problem, we only need to remove that first line.

switch# conf terminal
switch(config)# no boot system flash flash:cat4500-entservicesk9-mz.150-2.SG5.bin
switch(config)# end
switch# copy run start
switch# sh run | inc boot system
boot system flash bootflash:cat4500-entservicesk9-mz.150-2.SG5.bin

There, that should do it. Now issue the standby supervisor reload again and see how it goes?

switch# redundancy reload peer

TFTP Configuration Backup Error


When we backup the switch's configuration to our TFTP server, we might have an access denied. That's because our TFTP server doesn't have the right to write into the /tftpboot directory. A simple, yet not very secure way to fix this is to change the permissions right before we do the config backup. Then place it back to what it was.

sudo chown o+rwx /tftpboot
switch# copy start tftp
sudo chmod o-w /tftpboot

References


13 comments:

  1. Thank you so much David.

    Following your tutorial, I was able to successfully replace a SUP and everything worked as a charm.

    Keep up the good work :)

    Cheers,
    Chris

    ReplyDelete
    Replies
    1. Hi Chris,

      My pleasure! I'm glad I could help :)

      DA+

      Delete
  2. Thanks for this guide. I found it very useful and it really helped me upgrade of two (old) 4500's

    Brian

    ReplyDelete
  3. Hello David,
    I have a question. My 4507 currently is running version 12.2(50)SG8. I see version 15.0.2SG3 with the feature that I need. I was wondering if I could go and install this version. Since it is 15.X do I have to have a licence to activate it? Your comments and help will be greatly appreciated.

    Best Regards,
    Robert

    ReplyDelete
    Replies
    1. Hello Robert,

      I'm not a Cisco IOS License expert. I think you pay by features rather than by specific versions. But don't quote me on that!

      One way to find out if you can have version 15.x is simply to try and download it from your Cisco account. If you can't, then it's probably that your Cisco contract doesn't cover it. If you can, then you're in luck.

      In the end, I've always found it was easier to say « I'm sorry » than to ask for permission ;)

      HTH,

      DA+

      P.S. Of course, a good way to have a better answer that mine is to call your friendly Cisco account manager.

      Delete
    2. Thanks for answering David.
      Best Regards,
      Robert

      Delete
  4. Thanks you for documentation !

    Best Regards,

    ReplyDelete
  5. Replies
    1. So i have question plz:
      when i am done with the upgrading and the configuration shouldn't i put the config-register back to be 0x2102 or leave it as 0x2 before shipping it to the customer?
      Thanks

      Delete
    2. Hey Ahmed,

      I would think so, yes. But please double-check with your Cisco support team to be 100 % positive on this one.

      HTH,

      David

      Delete
  6. Hi David. Many thanks for this article. Very well written. I had a query regarding upgrading the ROMMON and IOS at the same time for a 4510 with SSO. Have you come across that scenario before, & is it possible to reload the peer and have it come back up with the new ROMMON and IOS and then do a redundancy switch-over? And is this truly a "hitless" change? I am quite sure that there has to be a layer 2 and or layer 3 disconnection. But I keep reading "zero downtime" upgrades which confuses me.

    ReplyDelete

Note: Only a member of this blog may post a comment.