Thursday, November 14, 2013

Centos Fast Puppet Master

Ever been on Centos 6 and want a Puppet Master fast?

Try this:

curl -L | sh

Wednesday, November 13, 2013

Puppet Bootstrap One liners


Enterprise Linux 5:

# rpm -Uvh

Enterprise Linux 6:

# rpm -Uvh

Puppet Labs

Enterprise Linux 5:

# rpm -ivh

Enterprise Linux 6:

# rpm -ivh


Debian Squeeze:
# wget
# dpkg -i puppetlabs-release-squeeze.deb
# apt-get update
Ubuntu Precise:
# wget
# dpkg -i puppetlabs-release-precise.deb
# apt-get update

Tuesday, September 24, 2013

Puppet Time through abuse of inline_template

Stardate 91334.3

I was asked a pretty reasonable question about puppet:

Can I get access to the time in Puppet without resorting to writing a fact?

This, seemingly reasonable task, is not so easy to do. Puppet does not define this for you. However, we can use the inline_template() function for great good.

For the uninitiated, inline_template() calls out to the ERB templating processor without having to have a template file. This is very useful for simple file resources or conjunction with the file_line resource from the Puppet Labs stdlib module.

file { '/etc/motd':
   ensure  => file,
   content => inline_template("Welcome to <%= @hostname %>"),

However we can abuse this by doing anything in ERB that we can do in Ruby. A friend of mine, famililar with the jinja templating system from Python, remarked: 'so its like PHP, and I'm not even being derogatory.' This means we can acces time using Ruby's built in time modules.

$time = inline_template('<%= %>')
However this is being evaluated on the Puppet Master, not the node. So if the two are in different timezones, what then? The first way to improve this is to set it to utc.
$time = inline_template('<%= %>')
But we can actually go further and define two variables, one for time in UTC of catalog compilation and one for local time for the checking in node. While we don't have a fact for the time on the node, we do have a fact for its timezone.
$time_utc = inline_template('<%= %>')
$time_local = inline_template("<%= (@time + Time.zone_offset(@timezone)).strftime('%c') %>")
We use the strftime('%c') to strip the UTC timezone label off of the time.

Going further:

We can take this a step further by using the inline_template() function to do time comparisons:

$go_time = "2013-09-24 08:00:00 UTC" # We could seed this in hiera!

$time = inline_template("<%= %>")

if str2bool(inline_template("<%= > Time.parse(@go_time) %>")){

    notify { "GO GO GO: Make the changes!": }

What the above code does is gate changes based on time. This allows us to 'light the fuses' and only make changes to production after a certain time, after a downtime window begins for instance. Note that the Puppet clients are still going to check in on their own schedule, but since we know what time our downtime is starting, we can use Puppet to make sure they kick off a Puppet run at the right time.

$go_time   = "2013-09-24 08:00:00 UTC" # We could seed this in hiera!
$done_time = "2013-09-24 12:00:00 UTC" # We could seed this in hiera, too!

$time = inline_template("<%= %>")

if str2bool(inline_template("<%= > Time.parse(@go_time) %>")){

  notify { "GO GO GO: Make the changes!": }

  cron { 'fire off the puppet run':
     command => 'puppet agent --no-daemonize',
     day     => '24', # we can seed the date here in hiera, albiet more verbosely 
     hour    => '8',
     minute  => '1',
     user    => 'root',
     ensure  => 'absent',

} else { 

  cron { 'fire off the puppet run':
     command => 'puppet agent --no-daemonize',
     day     => '24', # we can seed the date here in hiera, albiet more verbosely 
     hour    => '8',
     minute  => '1',
     user    => 'root',
     ensure  => 'present',
What is this doing? We put this code out at 4pm. Get some dinner. Log in at 7:30 pm and wait for our 8:00 pm downtime. In Puppet runs before 8:00 pm a cronjob will be installed that will effecat a Puppet run precisely one minute after the downtime begins. In all Puppet runs before 8:00 pm, the resources that are potentially hazardous are passed over. But in all Puppet runs after 8:00 pm, the new state is ensured and the cronjob is removed. Then this code, which should define the new state of the system, can be hoisted into regular classes and defined types.

Monday, September 16, 2013

Custom categories with Puppet data in modules

In the old way of doing things, we would have a hierarchy in our hiera.yaml that looked something like this:

  - defaults
  - %{clientcert}
  - %{environment}
  - global

In the new way the hierarchy has been renamed categories, and each level of it is a category.

We can define category precedence in the system wide hiera.yaml, the module specific hiera.yaml, and the binder_config.yaml

The following binder_config.yaml will effectively insert the species category into the category listing:

version: 1
  [{name: site, include: 'confdir-hiera:/'},
   {name: modules, include: ['module-hiera:/*/', 'module:/*::default'] }
  [['node', '${fqdn}'],
   ['environment', '${environment}'],
   ['species', '${species}'],
   ['osfamily', '${osfamily}'],
   ['common', 'true']

This means we can use the species category if one is defined in a module. An example hiera.yaml from such a module is:
version: 2
  [['osfamily', '$osfamily', 'data/osfamily/$osfamily'],
   ['species', '$species', 'data/species/$species'],
   ['environment', '$environment', 'data/env/$environment'],
   ['common', 'true', 'data/common']
Which means when we run Puppet...

root@hiera-2:/etc/puppet# FACTER_species='human' puppet apply modules/startrek/tests/init.pp 
Notice: Compiled catalog for in environment production in 1.07 seconds
Notice: janeway commands the voyager
Notice: /Stage[main]/Startrek/Notify[janeway commands the voyager]/message: defined 'message' as 'janeway commands the voyager'
Notice: janeway is always wary of the section 31
Notice: /Stage[main]/Startrek/Notify[janeway is always wary of the section 31]/message: defined 'message' as 'janeway is always wary of the section 31'
Notice: Finished catalog run in 0.11 seconds

You can see full example code in the startrek module.

You can pre-order my book, Pro Puppet 2nd Ed, here.

Puppet 3.3 and Data in Modules

Puppet 3.3 was released last week. As part of that release, hiera2 and puppet data in modules is available in the testing mode. I have been working with William van Hevelingen to build an example module/tutorial on Puppet data in Modules. Our example module is available at: We hope to expand this further as the community learns more about how to use this new feature.

Saturday, July 20, 2013

Puppet Case insensitivity

I was helping my friend roll out some new services today and a section of his Puppet code caught my eye. This is what I saw:
  case $::kernel {
    'linux': {
This caught my eye because I had been explicitly capitalizing the 'L' in the $::kernel fact for years. I thought to myself "Is the fact capitalized?"
zeratul:~# facter -p kernel
What's going on here? Is the case operator insensitive?
case $::kernel {
  'sunos': { notify { $::kernel: }}
notice: SunOS
notice: /Stage[main]//Notify[SunOS]/message: defined 'message' as 'SunOS'
Wow. Is the '==' operator in Puppet case-insensitive as well?
if $::kernel == 'sunos' {
  notify { 'lasers': }
notice: lasers
notice: /Stage[main]//Notify[lasers]/message: defined 'message' as 'lasers'
Is this a problem with facter or puppet?
if "YES" == "yes" {
  notify { "false is true": }
notice: false is true
notice: /Stage[main]//Notify[false is true]/message: defined 'message' as 'false is true'
Seriously? Yep. Turns out the '==' operator is case-insensitive. The '=~' is case-sensitive, but you have to use regular expression syntax in order to use it:
if "YES" =~ /^yes$/ {
  notify { "false is true": }
notice: Finished catalog run in 1.30 seconds
Note that we should use '^$' to enclose the string so we don't accidentally get a substring match.

Tested on Puppet 2.7.x and 3.2.x

Saturday, June 8, 2013

Simple Openssl recepie

The Openssl command has always been very opaque to me. Whenever I am doing cert operations I feel like I am a monk in the middle ages, copying scrolls I cannot read. Last night, I learned a simple command to inspect x509 certificate files that is short enough to commit to memory. I encourage everyone to use this command whenever they encounter .pem files and I encourage you to memorize it as well.

The command syntax is:

 openssl openssl x509 -in  -text

A full example:

nibz@host $ openssl x509 -in /etc/ssl/certs/Verisign_Class_1_Public_Primary_Certification_Authority.pem  -text
        Version: 1 (0x0)
        Serial Number:
    Signature Algorithm: sha1WithRSAEncryption
        Issuer: C=US, O=VeriSign, Inc., OU=Class 1 Public Primary Certification Authority
            Not Before: Jan 29 00:00:00 1996 GMT
            Not After : Aug  2 23:59:59 2028 GMT
        Subject: C=US, O=VeriSign, Inc., OU=Class 1 Public Primary Certification Authority
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (1024 bit)
                Exponent: 65537 (0x10001)
    Signature Algorithm: sha1WithRSAEncryption

Thursday, May 30, 2013

Cisco IOS 15 Public-key authentication

With the release of IOS 15 from Cisco, we can now use ssh public keys to authenticate to Cisco devices. It's the technology of the late nineties, today!

First create yourself a rather small key:

ssh keygen -t rsa -b 1024
It will ask you some questions, hopefully you've seen this dialog before. If you need help please feel free to comment or privately message me.

After the key has been created, copy the public string into your copybuffer.

> cat .ssh/
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQDMuKvC5ZVRuQw6YF5xnMZLopBVbQv5jxgHcR6BWfws3lTaqfSrKUlp3BulxA7P2snphcavf4TS+bNHFd9PKGRVpoQ8ERZtXn1+f008XUN3cxYMZXLB18ae7kfm8Sxk/bO4xWGaQAKc7jkIQY4OLIE0TsKTZGux241N6BNeLGmuLQ==

Now add the key to cisco. This assumes the user has already been created properly. It also assumes you are running the following version of IOS:

*    1 52    WS-C3750G-48TS     15.0(2)SE             C3750-IPBASEK9-M
I have tried this on a 15.0(1) and it didn't work. Configuration commands:
fab6017a#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
fab6017a(config)#ip ssh pubkey
fab6017a(config)#ip ssh pubkey-chain 
fab6017a(conf-ssh-pubkey)#username nibz
Some notes on the above: Paste the whole public key once you get the (conf-ssh-pubkey-data) prompt. This includes the 'ssh-rsa' header and comment footer. Use the exit keyword on the (conf-ssh-pubkey-data) line, any other word will be sandwiched onto the end of the key. You can use this feature to split your key into multiple lines and input it that way. After this, cisco will hash your key and the configuration will look like:

  username nibz
   key-hash ssh-rsa 2F33A5AE2F505B42203276F9B2313138
This configuration can be put in other cisco configs elsewhere in your infrastructure. Happy hacking. This was performed on a Cisco3750G running IOS 15.0(2)SE

Friday, May 17, 2013

Ganeti Migration

It is common to want to do a 'Texas two-step' with your ganeti nodes. The idea being that you migrate everything off of a node, reboot it for patches, then migrate everything back, and do this around your cluster until you've patched all your hypervisors. Easier said than done!

The fast way:

# move primaries off node
gnt-node migrate $node

# move secondaries off node
gnt-node evacuate --secondary-only $node
Sometimes nodes fail to migrate. I don't know why. I just use the gnt-instance failover command to move them which requires a reboot. Sometimes the secondary disks don't move either. This is more annoying because hail crashes before making a plan for you. I have written the following script to, stupidly, move all the secondaries off of one node onto other nodes so you can reboot a node in your cluster without fear.

if [ -z $evac_node ]; then
echo "Please specify a node to evac"
exit 1

echo "Evacuating secondaries from $evac_node"

HYPERVISORS=`wget --no-check-certificate -O-  https://localhost:5080/2/nodes 2>/dev/null | grep id | awk '{print $NF}' | tr -d \",`

for instance in `gnt-instance list -o name,snodes  | grep $evac_node | cut -d " " -f 1`
        current_primary=`gnt-instance list -o name,pnode | grep $instance | awk '{print $NF}'`
        target_node=`echo $HYPERVISORS | sed 's/ /\n/g' | grep -v $evac_node | grep -v $current_primary | sort -R | head -n 1`
        echo "gnt-instance replace-disks -n $target_node $instance"
        gnt-instance replace-disks -n $target_node $instance
Happy hacking!

Tuesday, April 23, 2013

Ops School: The coolest thing since sliced bread

Stardate 90915.3

I work at an organization that teaches people new to computers how to administer them. I've give a talk with blkperl, my boss, many times called Zero to Root. Teaching in our organization is very personal and doesn't really translate to other environments. We can't say 'Here, use our curriculum, its baws.'

Ops School strives to be the curriculum for people who want to become operations people. It doesn't assume any previous knowledge, but doesn't hide the details from the users either. Its kindof like a for this generation of hackers. I have been slowing in my blog postings because some of that energy I use to write down things for other people has been siphoned off into opsschool.

I encourage everyone out there to contribute, there are huge chunks of this document that still need to be written.

Wednesday, March 27, 2013

Solaris redundant nfs mounts

Stardate 90840.8 Nfs servers with lots of disks, sharing files out to multiple clients, is a pattern followed everywhere. Eventually your users begin to not only use these remote files, but start to put nfs directories into their PATH variable. This causes problems whenever you need to patch or reboot the nfs server because all shells launched by users with PATHs that look into the nfs directories will hang forever(this does assume you are mounting hard).

You can minimize the effect of this by using redundant mount information.


/usr/local -ro,hard,intr,suid
/usr/local -ro,hard,intr,suid,
It is best to maintain an idea of a primary and secondary, at least for administration. Modify only the primary, and rsync to the secondary. Use readonly mounting to mount. This mount appears in ``mount`` like this:
/usr/local on, remote/read only/setuid/devices/rstchown/hard/intr/xattr/dev=5a0000e on Wed Mar 27 15:35:21 2013
Note that this is mounted on both servers. Packets get sent to both servers and the first to respond with valid information is reported to the system. This can make for some bizarre weirdness if you use readwrite mounts.

It is totally possible to use something like drbd between nfs servers (not on Solaris, obviously), to make this doable with readwrite mounts. I have not done this personally.

Monday, March 18, 2013

Cascadia IT Conf

Stardate 90814.44

This weekend we attended CasItConf13. I had a blast at met a lot of really cool people. I attended presentations on Logstash, IPv6, Chef and more. Jordan Sissel, in particular, did a great job of presenting Logstash. After his talk we met up and had a neat conversation. He showed me an app he had created called fingerpoken. Its a bit out of date and we had to do some hacks but I was able to get it up and running in a half-hour lunch break and still have time to demolish some tasty lunch provided by the wonderful folks over at Puppet Labs. Fingerpoken is an app that lets you send mouse and keyboard events to a computer with a smartphone.

And thats really what its all about. Is the tool simple and easy enough that you can get it going in a crunch? Are all the nonintuitive parts ripped out and replaced with sane defaults and the tool just 'goes'? In fingerpoken's case not really. We had to do some:

sudo ln -s /usr/lib/ /usr/lib/
But, what is the point of having the author of your tool nearby if not to tell you to do that? And yes, the abi is evidently close enough to just work in that case.

I am very impressed that I was able to get such high-level functionality out of a tool in a short period of time and under pressure. If your tool passes the 'setup at lunch at a conference' test, you're doing pretty dang good. If it doesn't, look for places to streamline it. I'm happy to test your random tool, please let me know.

My talk, on the Computer Action Team/Braindump, is available on my github and you can download the pdf from here.

In other news, it seems that github no longer allows you to download the raw files out of repositories if they are above a certain size. Possibly more on that later.

Thursday, March 7, 2013

Debain packaging

Stardate: 90784.6

Git-sync is a script in ruby we use at work for managing git repos. It is covered in an earlier post. I got tired of ensuring it as a file in puppet and decided to make a debian package. Here is the summary of how to make a simple debian package containing just a single file. Note that the answer to this stack overflow question is the source of most of my knowledge, so this will just be annotations and extensions to that.

Debian/Ubuntu packaging (on an ubuntu system) required me to install a single package: devscripts.

At a high level, debian packaging involves creating a 'debian' folder in your source tree and putting several metadatafiles in it. Figuring out the precise contents of these files is the challenge of packaging. I recommend you use the 'apt-get source git' command to get the source of a working package (git in this case) to compare to your own metadatafiles.

Debian/Ubuntu packaging using debuild creates files one level above your current working directory(wtf debian). So the first step is to build a build directory like:

cd ~/devel
mkdir git-sync-build
Procure the source:
nibz@darktemplar:~/devel/git-sync-build$ git clone
nibz@darktemplar:~/devel/git-sync-build$ ls
nibz@darktemplar:~/devel/git-sync-build$ cd git-sync
nibz@darktemplar:~/devel/git-sync-build/git-sync$ mkdir debian

All of the metadata files that debuild, the utility that will actually build the .deb, needs are going to be in the debian directory.

The first file to create is the debian/changelog file. This file is created with the dch utility. Run it from the git-sync directory. It will open vim and it will look like this. Many fields here need to be changed.

dch --create


  * Initial release. (Closes: #XXXXXX)

 -- Spencer Krum   Thu, 07 Mar 2013 01:40:18 -0800
PACKAGE refers to the name of the package. Replace the word PACKAGE with the name you want your package to register itself as. In my git-sync case I will use 'git-sync'. The package name must be lower case. The VERSION must be replaced with a version number. I'm using 1.0.1 for this, since it is the second release of git-sync, but the changes are very minor. There are long articles on the internet about version numbering. It's not my place to comment here. The RELEASE variable needs to be replaced with a debian or ubuntu codename such as 'precise' or 'wheezy'. I have no idea what urgency is, but setting it to low doesn't seem to hurt anything. Maybe this is how you tell apt/dpkg about security updates. The initial release stuff is fine. The name is a bit tricky. Later on we will gpg sign the package. Make sure the name and email in the changelog match exactly the name and email on your gpg key, else the debuild utility won't attempt to have you gpg sign it at all. My changelog looks like this:
git-sync (1.0.0) precise; urgency=low

  * Initial release. 

 -- Spencer Krum   Wed, 06 Mar 2013 16:46:14 -0800

Next create a debian/copyright file:
Upstream-Name: myScript
Upstream-Contact: Name, 

Files: *
Copyright: 2011, Name, 
License: (GPL-2+ | LGPL-2 | GPL-3 | whatever)
 Full text of licence.
 Unless there is a it can be found in /usr/share/common-licenses
I elected for the apache2 license and to use the two paragraph version of that license. I also gave credit where it was due here. Fill out this file as you see fit.

Next create a debain/compat file:

nibz@darktemplar:~/devel/git-sync-build/git-sync/debian$ echo 7 > compat
Next create the rules file. This file seems to be the work-doer in debian packaging. It is evaluated by make, which is picky, so make sure that indented line is a real tab(copying from my blog will probably fail). The --with python is... well I have no idea. I traced it to a python.pem (pem is a perlism) deep within /usr. Since I am packaging a ruby script I just removed it.

Example from stackoverflow

#!/usr/bin/make -f

    dh $@ --with python2
git-sync version
#!/usr/bin/make -f

	dh $@
Next make the control file. Make the natural substitutions here. I guessed on section and it just sorta worked.
nibz@darktemplar:~/devel/git-sync-build/git-sync/debian$ cat control 
Source: git-sync
Section: ruby
Priority: optional
Maintainer: Spencer Krum, 
Build-Depends: debhelper (>= 7),
               ruby (>= 1.8.7)
Standards-Version: 3.9.2
X-Ruby-Version: >= 1.8.7

Package: git-sync
Architecture: all
Section: ruby
Depends: ruby, ${misc:Depends}, ${python:Depends}
Description: Git syncing script, pull based
  Git-sync allows git repositories to be kept in sync via git
  hooks or other means. Pull based, able to handle force pushes
  and submodules
Next make the install file. I went with the default in the stackoverflow post. I attempted to make some simple modifications to it(moving the file to /usr/local/bin) and that made it fail so this file is evidently pretty finicky.
nibz@darktemplar:~/devel/git-sync-build/git-sync$ cat debian/install 
git-sync usr/bin

Now you can build the debian package.

nibz@darktemplar:~/devel/git-sync-build/git-sync$ debuild --no-tgz-check
If all went well, it should ask you to decrypt your gpg key twice and build a package in the directory one level up.
nibz@darktemplar:~/devel/git-sync-build/git-sync$ ls ..
git-sync          git-sync_1.0.0.dsc
git-sync_1.0.0_all.deb  git-sync_1.0.0_amd64.changes  git-sync_1.0.0.tar.gz
You now have a shiny .deb file that can be installed with dpkg -i git-sync_1.0.0_all.deb

It is easy to put this in a launchpad PPA if you have a launchpad account. From your launchpad homepage (a shortcut is if you are signed in). Press the "Create new PPA". Fill out the form.

Next build a source package. Launchpad PPAs take source packages and build binary packages on launchpad servers. Build it with:

nibz@darktemplar:~/devel/git-sync-build/git-sync$ debuild -S
It should go through the gpg motions again and build a source file. Then you should be able to run something like(with your launchpad username and name of PPA):
dput ppa:krum-spencer/git-sync-ppa git-sync_1.0.0_source.changes

Happy Packaging!


Stardate: 90784.4428183

Where I work(read as: play) we use a lot of git. As an operator we often have a service running with its configs in git. A common pattern we use is to have a post-receive hook on the git repository set up to update the git checkout on a remote server. We accomplish this through a post-receive hook that sshes into the remote server and calls a script called git-sync with some options. The git-sync script github project forked from the puppet-sync script project that we use specifically for puppet dynamic git environments. Hunner <3 More dynamic git environments with puppet. Finch <3.

A hook for a project goes in the hooks/post-receive file of the git server's bare repo. Lets look at one now:

Example git post-receive hook

## File: akwardly incorrect

REPONAME=`basename $PWD | sed 's/.git$//'`
SSH_ARGS="-i /shadow/home/git/.ssh/"

while read oldrev newrev refname
  BRANCH=`echo $refname | sed -n 's/^refs\/heads\///p'`
  if [ $BRANCH != "master" ]; then 
    echo "Branch is not master, therefore not pushing to nagios"
    exit 0
  [ "$newrev" -eq 0 ] 2> /dev/null && DELETE='--delete' || DELETE=''
    --branch "$BRANCH" \
    --repository "$REPO" \
    --deploy "$DEPLOY" \


ssh '/etc/init.d/nagios3 reload'

The hook will exit before doing anything if the branch is not

and if it is, will run the git-sync script remotely on the nagios host, then go back in to bounce the nagios service.

The git-sync script essentially performs a

git fectch; git checkout HEAD
It doesn't worry itself with merging, and it is submodule aware.

A file, .git-sync-stamp, must be created by the administrator of the system. This is how git-sync knows it is in charge of managing the repository. It is definitely not recommended that you add this file to git. However, that should more or less work if you never want to think about it. I also wrote this puppet defined type to manage the stamp, initial vcsrepo, and public_key_file for you.

A puppet defined type to initalize gitsync managed folders

define gitsync::gitsync(

  ssh_authorized_key { "${user}-${name}-gitsync":
    user   => $user,
    ensure => $present,
    type   => $public_key_type,

  vcsrepo { $deploy:
    ensure    => present,
    provider  => git,
    user      => $user,
    source    => $source
    require   => Ssh_authorized_key["${user}-${name}-gitsync"],

  file { "${deploy}/.git-sync-stamp":
    ensure  => present,
    owner   => $user,
    mode    => 0644,
    require => Vcsrepo[$deploy],

The last thing to note is that I didn't write git-sync. I've modified it but it was mostly written by Reid Vandewielle and others. Marut <3 Enjoy

Saturday, March 2, 2013

Cisco Out Of Memory

Stardate: 90772.3

Today (well yesterday) our primary router ran out of memory. We haven't fixed the problem yet, I hope that will be the subject of a follow up post, but for right now I want to take you through detection, characterization, and mitigation.

Detection. The way I found out about the problem was via ssh.

Attempting to ssh into the router running out of memory.

> ssh multiplexor.seas
nibz@multiplexor.seas's password:
Permission denied, please try again.
nibz@multiplexor.seas's password:
Connection closed by 2610:10:0:2::210
For anyone familiar with sshing into ciscos this is not how it normally looks. Usually you get three attempts with just 'Password' and one with your user visible.

Attempting to ssh into a router not running out of memory.

> ssh nibz@wopr.seas
nibz@wopr.seas's password:
Connection closed by
I verified that it wasn't a knowing-the-password problem by using another account on the router. I connected a serial port to the router. Immediately found out of memory logs.

Console logs on the router.

10w0d: %AAA-3-ACCT_LOW_MEM_UID_FAIL: AAA unable to create UID for incoming calls due to insufficient processor memory

Logs sent to syslog.

Mar  2 00:43:52 multiplexor 4309463: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 128768 bytes failed from 0x1A8C110, alignment 0 
Mar  2 00:44:24 multiplexor 4309499: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 128768 bytes failed from 0x1A8C110, alignment 0 
Mar  2 00:47:37 multiplexor 4309643: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 395648 bytes failed from 0x1AA03FC, alignment 0 
Mar  2 02:18:33 multiplexor 4313756: 10w1d: %SYS-2-MALLOCFAIL: Memory allocation of 395648 bytes failed from 0x1AA03FC, alignment 
I ran the 'show proc mem' command on the router to get a picture of the memory use of the router.

Show proc mem.

multiplexor#show proc mem
Processor Pool Total:  177300444 Used:  174845504 Free:    2454940
      I/O Pool Total:   16777216 Used:   13261296 Free:    3515920
Driver te Pool Total:    4194304 Used:         40 Free:    4194264
 PID TTY  Allocated      Freed    Holding    Getbufs    Retbufs Process
   0   0  108150192   43169524   58720200          0          0 *Init*
   0   0      12492    2712616      12492          0          0 *Sched*
   0   0  399177972  389135628    8911036   14228691    1490354 *Dead*
   0   0          0          0  102305848          0          0 *MallocLite*
   1   0  973921416  973821100     224768          0          0 Chunk Manager
   2   0        232        232       4160          0          0 Load Meter
   3   0          0          0       7076          0          0 DHCPD Timer
   4   0       4712       6732      11692          0          0 Check heaps
   5   0    7862444   49770056      13540    6270020   28190703 Pool Manager
   6   0          0          0       7160          0          0 DiscardQ Backgro
   7   0        232        232       7160          0          0 Timers
   8   0          0          0       4160          0          0 WATCH_AFS
   9   0        284        728       7160          0          0 License Client N
  10   0 2332421068 2332422156       7168          0          0 Licensing Auto U
  11   0    1482732    1483016       7160          0          0 Image License br
  12   0 2344349400 3601318192     169876     157356          0 ARP Input
  13   0  550320160  550382256       7160          0          0 ARP Background
  14   0          0          0       7168          0          0 CEF MIB API
  15   0          0          0       7160          0          0 AAA_SERVER_DEADT
This shows that the router is indeed running very low on memory. How did we get here? Monitoring + SNMP + RRDtool to the rescue!

Doing some quick estimation on this it looks like it loses about a MB of free ram every 18 hours. RRDtool isn't the best, and getting the big picture graph is hard to do, but basically it has been losing free ram at this rate for a couple of weeks.

Finally we get a show tech-support off of this thing.

multiplexor# show tech-support | redirect tftp://
The redirect to tftp is a really cool pattern for getting information off of a cisco device. The tech-support run was about 50000 lines.

I will do a follow up post when I figure out whats going on.

Update 3-7-13:

The router ran completely out of memory. Even on console all I got was:

%% Low on memory; try again later

%% Low on memory; try again later

%% Low on memory; try again later
It was happily switching and routing at this point, however. We rebooted it because it was Saturday evening and better to have it happen at a time of our choosing than to break iscsi unexpectedly later in the week. Upon reboot, the system returned to full functionality, but we can tell from the zenoss graphs that it is still leaking memory at a rate of 1Mb every 18hrs. At this rate it will need to be rebooted again in 10 weeks. We have opened a case with TAC. I will update again if anything comes from this.

Update 5-17-13:

We still have not fixed the problem. The router can go about 10 weeks before it reboots. This is in an educational setting where there are 12 week terms, meaning we need to reboot our core router a least once a term. Wheeee. We've been on the horn with Cisco who has had numerous techs look at it, and has even replaced the hardware, but the problem remains. Anyone with some ideas is welcome to contact me privately.

Friday, February 22, 2013

Stardates Updated

Stardate: 90749.1018235
I've been using this website for calculating the current stardate whenever I make a blog post. I like the stardates, I want to keep them, and to generally Trekify the rest of this blog. But entering in a the current date into HTML forms every time sucks. So I've reverse engineered their algorithm and written a bit of python to do the calculations for me. The python (and an attempt at preserving my math) is available here. In the future I want the blog to know what time it is I published and calculate and display the stardate based on that.

Monday, February 11, 2013

Nagios, maintenance windows and puppet

Stardate: 90721.31
Nagios is an excellent network monitoring service. It is used in production where I work. We wanted to be able to create maintenance windows using the web gui. (Actually we wanted a command line utility, but thats a separate post.) Turning off the default 'readonly' mode turned out to be a real pain and poorly documented. I ended up following the recommendations of this blog as well as some of the comments by readers. I also took the time to create a snipped of puppet code you can put in your manifests to make it easy to turn the command mode on. Note that the puppet code uses the default 'nagiosadmin' user and that it uses file_line from the puppet labs stdlib module. The code is available with syntax highlighting here.
  # the following is for enabling write access to the web gui
  if $readonly_web == false {

    file_line {.
        line    => "check_external_commands=1",
        path    => '/etc/nagios3/nagios.cfg',
        notify  => Service['nagios3'],
        require => Package['nagios3'];
        line    => "authorized_for_all_host_commands=nagiosadmin",
        path    => '/etc/nagios3/cgi.cfg',
        notify  => Service['nagios3'],
        require => Package['nagios3'];
        line    => "authorized_for_all_service_commands=nagiosadmin",
        path    => '/etc/nagios3/cgi.cfg',
        notify  => Service['nagios3'],
        require => Package['nagios3'];

    user { 'nagios':
      groups      => ['nagios', 'www-data'],
      membership  => minimum,
      require     => Package['nagios3'];

    file { '/var/lib/nagios3/rw':
      owner   => 'nagios',
      group   => 'www-data',
      mode    => '2710',
      ensure  => directory,
      require => Package['nagios3'];

    file { '/var/lib/nagios3':
      owner   => 'nagios',
      group   => 'nagios',
      mode    => '0751',
      ensure  => directory,
      require => Package['nagios3'];


Nanog 4

Stardate: 90718.8
Nanog ended Wednesday. :( After much traveling I am home. It was an incredible experience and I learned a lot. My notes on the third day are here: notes. I want to join the NANOG organizers and thank the sponsors of the conference for making it possible: Cyrus One, NTT, Google, Verisign, and Netflix. I met a lot of people and learned a lot(especially in the configuration and purpose of Internet exchanges).
For any other students out there, NANOG is going to do something to improve their system for students. They already give you a significantly discounted ticket price if you are a student, but the process for coming to NANOG is still hard to navigate. My advice to you now is to sign up for a ticket and check the 'student' box. You will immediately be given access to the student price. You will eventually have to verify your student status with the administrators of the conference, but you can do that later by email. For the next NANOG, they are going to redo the website and streamline the student ticketing/verification process. They are also preparing to launch a scholarship program. If you want to go to NANOG 58 in New Orleans and this stuff has not happened on the NANOG website. You can email the administrators, they are very approachable. You could even email me and I would love to help you.
Happy Hacking.

Wednesday, February 6, 2013

Nanog 3

Yesterday was day 2 of NANOG57. Like the previous day, I had a blast.
I was pretty tired but took notes on some of the presentations I attended here.
This second day was a lot more social for me. I met individuals from companies all over the networking spectrum. Big shoutout to Joe from google, Jeremy from, Charles from, David from Windstream, and Paul from Jive(Go Portland!). Which is somewhat ironic since while I think of Jive as being a Portland company, something like half of their workers(including Paul) don't work in Portland. These guys are all network admins for their companies/isps and I learned a lot from just talking to them.
I also want to make a big shoutout to Imtech, a UK company that sent Dave and another to NANOG. Dave and his friend are really cool people and welcomed me in right from the start.
I also want to thank the guys whose names I did not get from Comcast, Level 3, and Microsoft who welcomed me with open arms at the Beer 'n Gear. That was a lot of fun and I learned a lot.
Time for breakfast, will report in with more later.

Monday, February 4, 2013

Nanog 2

Wow! What a day. Day 1 of NANOG 57 was a rush and a blast. A brief summary before I collapse from exhaustion.
My brief and unedited notes are here. Another adventurous user has been keeping notes here. These notes only go slightly beyond what is presented in the abstracts for the presentations. A big shoutout and thank you to everyone who presented today. It was awesome.
I met a *ton* of cool people today. Among them were two network designers from the U.K., three professional network admins for three different internet exchanges in Germany(including DE-CIX, the largest IX in the world.) I met several engineers from other companies at the "Newcomers Lunch," and met several people from a major backbone provider after that.
I learned a lot and am looking forward to going back again tomorrow.

Nanog 1

I have an updated wpa_supplicant.conf for nanog. The initial configuration didn't work in practice:
     pairwise=CCMP TKIP
     group=CCMP TKIP
     eap=TTLS PEAP TLS

Sunday, February 3, 2013

Graphite: Vim vs Emacs

Stardate: 90697.76
Graph all the things! Graphite is a real time graphing tool. It allows sysadmins like myself to visualize statistics about our environment and change over time. We can also, using the graphite dashboard, easily map different data sources onto each other to try to find correlations or just to look at differences in use.
Nightfly, a co-worker of mine, has developed a script to run against our college's general login boxes. These boxes are used by the CS, ECE, and other departments. It provides a good picture of what people are using against time and against each other. Obviously the first order of business is to prove which editor is more popular:
The program to collect and submit this data is on github. The botnet element of this is hacked together with cron and ssh.
Big thanks to Nightfly for making the tech behind this post.

Nanog 0

Stardate: 90697.69
NANOG (North American Network Operators Group) is meeting in Orlando for NANOG 57. I am in Orlando and will attend.

You can see the agenda here. From the first day, I'm looking forward to the 'Newcomers Lunch,' BCOP(Best current operational practices), and the panel on the impacts of Super Storm Sandy.
I'm very excited about the wireless. From Nanog's page on the subject:
For the duration of the meeting conference, NANOG provides a dual-stack IPv4/v6 meeting network. IP address allocation is available by DHCP for IPv4 and neighbor discovery for IPv6. No NAT or translation protocols are utilized, in addition local NANOG DNS servers offer DNSSEC capability.
I believe the following wpa_supplicant.conf configuration will work for me:
        pairwise=CCMP TKIP
        group=CCMP TKIP
I will report on the wireless at NANOG tomorrow. I'll also see if I can find the certificate used by the access points.

Tuesday, January 29, 2013

Puppet Fact Fix after Ruby 1.9 upgrade

Stardate: 90683.2
We recently upgraded our entire infrastructure to Puppet 3. As if that wasn't ambitious enough (though I suppose Puppet 3.1 has an rc), we are slowly bringing ruby to 1.9 from 1.8.7. Surprisingly, puppet actually works under ruby 1.9. This blows my mind since I remember having to do some seriously ugly hacks to get puppet working under ruby 1.9. Unfortunately, and embarrassingly, one of my custom facts was not forward compatible.
Info: Loading facts in /var/lib/puppet/lib/facter/nvidia_graphics_device.rb
Could not retrieve dns_ip_6: undefined method `each' for " has address\n":String
Could not retrieve dns_ip_6: undefined method `each' for " has address\n":String
The unfortunate cause of this (other than that I didn't write 1.9 compatible code) is that string.each has been removed from ruby as of ruby 1.9. :(
For those of you unfamiliar with puppet hacking, most testing should be done through git dynamic environments. Unfortunately, there is a bug in puppet that prevents facts from being able to be tested on any branch but production. You can add the facts to git and push them to the environments directory on the puppet master, but unless they are in branch production you won't see them get filebucketted or run.
So what to do? The answer is to develop on the box itself, usually as root, though sudo is an idea. (Hopefully I can get a friend of mine to guest post on why sudo is the correct way to attain root privileges for administration, and I can counterpost on why su - is the correct way.) On ubuntu (as of 12.04, anyways) the puppet configuration dir is /etc/puppet, but the puppet var dir is /var/lib/puppet. Facts live in /var/lib/puppet/lib/facter. The best way to get information on a current puppet's installation and configuration is through:
puppet config print all | grep vardir
vardir = /var/lib/puppet
Change directory to the puppet vardir and modify the facts in place. Then run the facter utility with the '-p' argument. The '-p' argument tells facter to run all its normal facts as well as facts loaded in from puppet.

root@yermom:/var/lib/puppet/lib/facter# facter -p | grep dns
dns_ip_4 => [""]
dns_ip_6 => []

Fantastic. All is well again.

Monday, January 28, 2013

PuppetDB/Storeconfigs Cache expiry

Stardate: 90682.98
After a couple of weeks of getting frustrated with puppet's Storeconfigs/puppetdb features, I have emerged victorious. PuppetDB is the newer, better, postgressier backend for puppet Storeconfigs. PuppetDB sports some really nice features including a fancy status/metrics web dashboard:
As you can see this is some interesting and potentially beneficial feedback. It is updated live and is mobile browser compatible. Personally, I'm happy to get graphs of this data any way I can, but I would prefer not to be locked into their dashboard. I would rather be able to get these data out of an often updated file or udp port so that I could send it to graphite for real time graphing and correlation with other metrics. I also don't see the point of having it be mobile friendly, since most everyone will have their puppetmaster/puppetdb server firewalled heavily and mobile devices have no business on the internal network. Some of the metrics can lead to actually tuning and performance boosts: mostly this is in the increasing the number of threads and the max jvm heap size.
The punchline here is that with
storeconfigs = true
in puppet.conf you can do exported/collected resource magics. When doing this with nagios resources I've been able to export and collect resources flawlessly. The problems came up when I tried to modify a resource. Since we make heavy use of dynamic git environments with puppet I was running something like
 puppet agent --test --environment=nagios 
on a host at random and
 puppet agent --test --environment=nagios 
on the nagios server, hoping to collect exported resources. The problem was they were not changing. As it turns out puppetdb can cache old exported resources for up to an hour. My advice for others having problem getting nagios or other exported resources to change or purge is to give it time. Run a big ssh for loop or use mcollective to hit all your boxes and hit the coffee cart for a quick pick me up. Chances are good you just need to give it time.

Thursday, January 24, 2013

Irc Bots in Twisted with Invite-only Channels

Stardate: 90672.08

I'm kind of obsessed with writing irc bots in python using twisted.words.protocols. A longer example of how to do that may come later but for now I want to show you one of the best ways to debug your twisted irc bot and a vector to get really cool behavior not intended by the twisted developers. On an IRC server I frequent channels that are secured by forcing users to first login with NickServ then to ask ChanServ for an invite. The problem is you must join the channel after your receive the invitation from ChanServ. My solution to this problem is below, using irc_UNKNOWN and the example in the twisted.words documentation:

class BeerBot(irc.IRCClient):
    """A logging IRC bot."""
    """That also does beer things."""


    def signedOn(self):
        """Called when bot has succesfully signed on to server."""
        self.logger.log('identifying to nickserv')
        self.msg("NickServ", "identify %s" % config.password)
        self.logger.log('requesting channel invite')
        self.msg("ChanServ", "invite %s" %
        self.logger.log('channel joinied')


    def irc_unknown(self, prefix, command, params):
        self.logger.log("{0}, {1}, {2}".format(prefix, command, params))
        if command == "INVITE":

The irc_unknown is great because it simulates the 'window 1' on most irc clients(well most command line irc clients[and by that I mean weechat and irssi{and by that I mean irssi-for-life!}]). You can add if statements to grab the other 'named' irc control messages. The others are numbered and you can split those out as well. One of the bad things about irc is different irc servers behave differently. It must be a frustrating and thankless task for the maintainers of irssi/weechat/t.p.w to provide such a universal interface to all irc servers. (lol jk, irssi hasn't had an update since like 2010.) [but no really, thank you irssi devs *internet hug*].

The source code for beerbot can be found at my github.

Blacklisting Usernames in Charybdis

Stardate: 90671.98

For our ircd needs we use patched versions of Charybdis and Atheme. I discovered the other day that one of our users had been trying to use the nickname 'help'. It was discovered he was just a beginner trying to find the help for the /nick command. The interesting thing was that this was tripping alarms for another user. Nickserv will warn you when someone tries to use your nick. Another user had messaged me that someone was attempting to take their nick. After doing some digging I realized that the second user had registered the nick 'help' with NickServ.

Allowing users to use nicks like 'help' and 'support' open the door to social engineering attacks. I set out to block them at a services/ircd level. To my suprise this was done at the ircd level not at the services level. Big shoutout to 'grawity' on #atheme on

Make sure you are logged in as an oper and that you have OperServ enabled. Get help on the command:

/msg OperServ SQLINE help
Add a sqline with:
/msg OperServ SQLINE add help !P abuse
The !P means permanent(you can use !T

Tuesday, January 22, 2013

Cisco diagnostics

Stardate: 90666.54 Cisco switches and routers (Catalyst 3750 series in this example) support some really cool diagnostics. These diagnostics come in handy when trying to determine if a Layer 1 fault may be involved and where it is. This technology, known as Time domain reflectometry, is available on all Cisco 3750 models including the new Catalyst 3750X.
An initial example: This shows normal use an a no-fault return.

fab20a#test cable-diagnostics tdr interface Gi1/0/25
TDR test started on interface Gi1/0/25
A TDR test can take a few seconds to run on an interface
Use 'show cable-diagnostics tdr' to read the TDR results.
fab20a#show cable-diagnostics tdr int Gi1/0/25
TDR test last run on: January 22 20:41:52

Interface Speed Local pair Pair length        Remote pair Pair status
--------- ----- ---------- ------------------ ----------- --------------------
Gi1/0/25  1000M Pair A     49   +/- 4  meters Pair A      Normal
                Pair B     45   +/- 4  meters Pair B      Normal
                Pair C     48   +/- 4  meters Pair C      Normal
                Pair D     45   +/- 4  meters Pair D      Normal
Another example: this shows normal use and an open circuit(most likely meaning no host is on the other side.
fab60a#test cable-diagnostics tdr interface GigabitEthernet 1/0/31
TDR test started on interface Gi1/0/31
A TDR test can take a few seconds to run on an interface
Use 'show cable-diagnostics tdr' to read the TDR results.

fab60a#show cable-diagnostics tdr int GigabitEthernet 1/0/31
TDR test last run on: January 08 14:45:40

Interface Speed Local pair Pair length        Remote pair Pair status
--------- ----- ---------- ------------------ ----------- --------------------
Gi1/0/31  auto  Pair A     3    +/- 4  meters N/A         Open
                Pair B     2    +/- 4  meters N/A         Open
                Pair C     0    +/- 4  meters N/A         Open
                Pair D     3    +/- 4  meters N/A         Open
Note that you must specify 'int' between tdr and the interface identifier. Presumably so you could shoot electrons at something that isn't an interface, like the door or something. An example of a broken pair:

fab20a#test cable-diagnostics tdr interface Gi1/0/25
TDR test started on interface Gi1/0/25
A TDR test can take a few seconds to run on an interface
Use 'show cable-diagnostics tdr' to read the TDR results.
fab20a#show cable-diagnostics tdr int Gi1/0/25
TDR test last run on: January 22 18:33:07

Interface Speed Local pair Pair length        Remote pair Pair status
--------- ----- ---------- ------------------ ----------- --------------------
Gi1/0/25   100M Pair A     49   +/- 4  meters Pair A      Normal
                Pair B     45   +/- 4  meters Pair B      Normal
                Pair C     48   +/- 4  meters Pair C      Normal
                Pair D     0    +/- 4  meters Pair D      Open
Here the "D" pair is broken. You can see from the 'pair length' column that it is broken at the beginning of the cable. This means we got lucky. We were able to replace the patch cable instead of having a contractor rewire the wire in the conduit. Notice that with only 3 pairs, the link is still up at 100Mbit.

Sunday, January 20, 2013

Stardate: Unknown

Stardates are how I would like to keep time for this blog but Star Trek doesn't have a consistent sense of how to translate real time into stardates. During and after the Next Generation a certain amount of sanity appeared but nothing that can be rolled backwards to figure out what 2013 would have been. The most consistent and repeatable way to represent current time in stardates is by using the stardate calendar for Star Trek Online, which has a direct mapping from the current age into stardates of the future(which actually takes place after the events of ST:Nemesis. Current stardate: 90660.85.

Calculator is here.