Saturday, March 2, 2013

Cisco Out Of Memory

Stardate: 90772.3

Today (well yesterday) our primary router ran out of memory. We haven't fixed the problem yet, I hope that will be the subject of a follow up post, but for right now I want to take you through detection, characterization, and mitigation.

Detection. The way I found out about the problem was via ssh.

Attempting to ssh into the router running out of memory.

(nibz@observer:~)
> ssh multiplexor.seas
nibz@multiplexor.seas's password:
Permission denied, please try again.
nibz@multiplexor.seas's password:
Connection closed by 2610:10:0:2::210
For anyone familiar with sshing into ciscos this is not how it normally looks. Usually you get three attempts with just 'Password' and one with your user visible.

Attempting to ssh into a router not running out of memory.

(nibz@observer:~)
> ssh nibz@wopr.seas
Password:
Password:
Password:
nibz@wopr.seas's password:
Connection closed by 131.252.211.3
I verified that it wasn't a knowing-the-password problem by using another account on the router. I connected a serial port to the router. Immediately found out of memory logs.

Console logs on the router.

10w0d: %AAA-3-ACCT_LOW_MEM_UID_FAIL: AAA unable to create UID for incoming calls due to insufficient processor memory

Logs sent to syslog.

Mar  2 00:43:52 multiplexor 4309463: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 128768 bytes failed from 0x1A8C110, alignment 0 
Mar  2 00:44:24 multiplexor 4309499: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 128768 bytes failed from 0x1A8C110, alignment 0 
Mar  2 00:47:37 multiplexor 4309643: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 395648 bytes failed from 0x1AA03FC, alignment 0 
Mar  2 02:18:33 multiplexor 4313756: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 395648 bytes failed from 0x1AA03FC, alignment 
I ran the 'show proc mem' command on the router to get a picture of the memory use of the router.

Show proc mem.

multiplexor#show proc mem
Processor Pool Total:  177300444 Used:  174845504 Free:    2454940
      I/O Pool Total:   16777216 Used:   13261296 Free:    3515920
Driver te Pool Total:    4194304 Used:         40 Free:    4194264
 
 PID TTY  Allocated      Freed    Holding    Getbufs    Retbufs Process
   0   0  108150192   43169524   58720200          0          0 *Init*
   0   0      12492    2712616      12492          0          0 *Sched*
   0   0  399177972  389135628    8911036   14228691    1490354 *Dead*
   0   0          0          0  102305848          0          0 *MallocLite*
   1   0  973921416  973821100     224768          0          0 Chunk Manager
   2   0        232        232       4160          0          0 Load Meter
   3   0          0          0       7076          0          0 DHCPD Timer
   4   0       4712       6732      11692          0          0 Check heaps
   5   0    7862444   49770056      13540    6270020   28190703 Pool Manager
   6   0          0          0       7160          0          0 DiscardQ Backgro
   7   0        232        232       7160          0          0 Timers
   8   0          0          0       4160          0          0 WATCH_AFS
   9   0        284        728       7160          0          0 License Client N
  10   0 2332421068 2332422156       7168          0          0 Licensing Auto U
  11   0    1482732    1483016       7160          0          0 Image License br
  12   0 2344349400 3601318192     169876     157356          0 ARP Input
  13   0  550320160  550382256       7160          0          0 ARP Background
  14   0          0          0       7168          0          0 CEF MIB API
  15   0          0          0       7160          0          0 AAA_SERVER_DEADT
This shows that the router is indeed running very low on memory. How did we get here? Monitoring + SNMP + RRDtool to the rescue!

Doing some quick estimation on this it looks like it loses about a MB of free ram every 18 hours. RRDtool isn't the best, and getting the big picture graph is hard to do, but basically it has been losing free ram at this rate for a couple of weeks.

Finally we get a show tech-support off of this thing.

multiplexor# show tech-support | redirect tftp://10.0.21.2/cisco/mux-tech-support-mar1
The redirect to tftp is a really cool pattern for getting information off of a cisco device. The tech-support run was about 50000 lines.

I will do a follow up post when I figure out whats going on.

Update 3-7-13:

The router ran completely out of memory. Even on console all I got was:

%% Low on memory; try again later

%% Low on memory; try again later

%% Low on memory; try again later
It was happily switching and routing at this point, however. We rebooted it because it was Saturday evening and better to have it happen at a time of our choosing than to break iscsi unexpectedly later in the week. Upon reboot, the system returned to full functionality, but we can tell from the zenoss graphs that it is still leaking memory at a rate of 1Mb every 18hrs. At this rate it will need to be rebooted again in 10 weeks. We have opened a case with TAC. I will update again if anything comes from this.

Update 5-17-13:

We still have not fixed the problem. The router can go about 10 weeks before it reboots. This is in an educational setting where there are 12 week terms, meaning we need to reboot our core router a least once a term. Wheeee. We've been on the horn with Cisco who has had numerous techs look at it, and has even replaced the hardware, but the problem remains. Anyone with some ideas is welcome to contact me privately.

4 comments:

  1. Hi,

    we have the same issue right now, on console we've got:
    %% Low on memory; try again later

    Our router is 3825.

    You wrote: "The router can go about 10 weeks before it reboots.". Does this mean that router reboot itself or you had to reboot it by yourselves?

    ReplyDelete
  2. Hi,

    We resolved the issue through an update provided by Cisco, I should update this post.

    The router can go 10 weeks before it reboots because of the total memory on the router and the rate of decay.

    I'll trace through our service request and figure out what else I can paste in here to help wayward travelers.

    ReplyDelete
    Replies
    1. Hi blindscientist,
      do you have any news?
      I have the same problem.

      Thank you

      Delete
    2. Hi Salvo. As I said, we upgraded the 3750x to the latest available version from Cisco. The memory problem went away.

      Delete