Today (well yesterday) our primary router ran out of memory. We haven't fixed the problem yet, I hope that will be the subject of a follow up post, but for right now I want to take you through detection, characterization, and mitigation.
Detection. The way I found out about the problem was via ssh.
Attempting to ssh into the router running out of memory.
(nibz@observer:~) > ssh multiplexor.seas nibz@multiplexor.seas's password: Permission denied, please try again. nibz@multiplexor.seas's password: Connection closed by 2610:10:0:2::210For anyone familiar with sshing into ciscos this is not how it normally looks. Usually you get three attempts with just 'Password' and one with your user visible.
Attempting to ssh into a router not running out of memory.
(nibz@observer:~) > ssh nibz@wopr.seas Password: Password: Password: nibz@wopr.seas's password: Connection closed by 131.252.211.3I verified that it wasn't a knowing-the-password problem by using another account on the router. I connected a serial port to the router. Immediately found out of memory logs.
Console logs on the router.
10w0d: %AAA-3-ACCT_LOW_MEM_UID_FAIL: AAA unable to create UID for incoming calls due to insufficient processor memory
Logs sent to syslog.
Mar 2 00:43:52 multiplexor 4309463: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 128768 bytes failed from 0x1A8C110, alignment 0 Mar 2 00:44:24 multiplexor 4309499: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 128768 bytes failed from 0x1A8C110, alignment 0 Mar 2 00:47:37 multiplexor 4309643: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 395648 bytes failed from 0x1AA03FC, alignment 0 Mar 2 02:18:33 multiplexor 4313756: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 395648 bytes failed from 0x1AA03FC, alignmentI ran the 'show proc mem' command on the router to get a picture of the memory use of the router.
Show proc mem.
multiplexor#show proc mem Processor Pool Total: 177300444 Used: 174845504 Free: 2454940 I/O Pool Total: 16777216 Used: 13261296 Free: 3515920 Driver te Pool Total: 4194304 Used: 40 Free: 4194264 PID TTY Allocated Freed Holding Getbufs Retbufs Process 0 0 108150192 43169524 58720200 0 0 *Init* 0 0 12492 2712616 12492 0 0 *Sched* 0 0 399177972 389135628 8911036 14228691 1490354 *Dead* 0 0 0 0 102305848 0 0 *MallocLite* 1 0 973921416 973821100 224768 0 0 Chunk Manager 2 0 232 232 4160 0 0 Load Meter 3 0 0 0 7076 0 0 DHCPD Timer 4 0 4712 6732 11692 0 0 Check heaps 5 0 7862444 49770056 13540 6270020 28190703 Pool Manager 6 0 0 0 7160 0 0 DiscardQ Backgro 7 0 232 232 7160 0 0 Timers 8 0 0 0 4160 0 0 WATCH_AFS 9 0 284 728 7160 0 0 License Client N 10 0 2332421068 2332422156 7168 0 0 Licensing Auto U 11 0 1482732 1483016 7160 0 0 Image License br 12 0 2344349400 3601318192 169876 157356 0 ARP Input 13 0 550320160 550382256 7160 0 0 ARP Background 14 0 0 0 7168 0 0 CEF MIB API 15 0 0 0 7160 0 0 AAA_SERVER_DEADTThis shows that the router is indeed running very low on memory. How did we get here? Monitoring + SNMP + RRDtool to the rescue!
Doing some quick estimation on this it looks like it loses about a MB of free ram every 18 hours. RRDtool isn't the best, and getting the big picture graph is hard to do, but basically it has been losing free ram at this rate for a couple of weeks.
Finally we get a show tech-support off of this thing.
multiplexor# show tech-support | redirect tftp://10.0.21.2/cisco/mux-tech-support-mar1The redirect to tftp is a really cool pattern for getting information off of a cisco device. The tech-support run was about 50000 lines.
I will do a follow up post when I figure out whats going on.
Update 3-7-13:
The router ran completely out of memory. Even on console all I got was:
%% Low on memory; try again later %% Low on memory; try again later %% Low on memory; try again laterIt was happily switching and routing at this point, however. We rebooted it because it was Saturday evening and better to have it happen at a time of our choosing than to break iscsi unexpectedly later in the week. Upon reboot, the system returned to full functionality, but we can tell from the zenoss graphs that it is still leaking memory at a rate of 1Mb every 18hrs. At this rate it will need to be rebooted again in 10 weeks. We have opened a case with TAC. I will update again if anything comes from this.
Update 5-17-13:
We still have not fixed the problem. The router can go about 10 weeks before it reboots. This is in an educational setting where there are 12 week terms, meaning we need to reboot our core router a least once a term. Wheeee. We've been on the horn with Cisco who has had numerous techs look at it, and has even replaced the hardware, but the problem remains. Anyone with some ideas is welcome to contact me privately.
Hi,
ReplyDeletewe have the same issue right now, on console we've got:
%% Low on memory; try again later
Our router is 3825.
You wrote: "The router can go about 10 weeks before it reboots.". Does this mean that router reboot itself or you had to reboot it by yourselves?
Hi,
ReplyDeleteWe resolved the issue through an update provided by Cisco, I should update this post.
The router can go 10 weeks before it reboots because of the total memory on the router and the rate of decay.
I'll trace through our service request and figure out what else I can paste in here to help wayward travelers.
Hi blindscientist,
Deletedo you have any news?
I have the same problem.
Thank you
Hi Salvo. As I said, we upgraded the 3750x to the latest available version from Cisco. The memory problem went away.
Delete