BMC Nagios

From Secure Computing Wiki
Revision as of 17:38, 12 March 2008 by Ecrist (Talk | contribs) (Download)

Jump to: navigation, search

Overview

We just purchased a Dell PowerEdge 2950. Incidentally, we use FreeBSD on everything but most of our desktops. Our new, super-fast, 2950 has the Dell PERC 5/i RAID On MotherBoard card (ROMB), for which there are no user-land management utilities for FreeBSD. I was tasked with coming up with a way to monitor our RAID health, along with monitoring the other sensors that are available on the motherboard.

In addition to the ROMB health, I've written 5 other nagios plugins to monitor ambient temperature, system voltages (go/no-go), system fan speed, chassis intrusion detection, and power supply health. I'll cover some general setup requirements for use with the BMC (Baseboard Management Controller) and getting our nagios installation to talk with it.

These scripts now include performance data output. The voltage monitoring script has no usable data to graph, so there is nothing output.

Also, if you have a script or plugin that you need written, send at email to ecrist@secure-computing.net and I'll do my best to help you out.

BMC Configuration

To set up our Dell's BMC for communication with FreeBSD, we had to enter the BIOS on the BMC by pressing Ctl-E during system boot. From the menu, there is an option to enable LAN access. Also, you'll need to setup an IP address that is accessible from your Nagios monitoring station. Lastly, set a password and user name for your BMC. For our tests here, we set to root and 1234. There are ways to add additional users and privilege levels to your BMC, but that is out of the scope of this page.

Under the network section, enable the 'Shared' options, unless you have the DRAC card, where you will have a dedicated network card. FreeBSD works fine with the Shared option enabled. This allows the BMC to use the primary network interface on your server. This special interface has its own MAC address, so there's little conflict. Apparently, some higher class switches have a problem, as they see connections to multiple MAC addresses on a single port. You should be able to allow this by setting the appropriate policy.

Because of the use of IPMI, your monitoring system will require ipmitool!

Script Descriptions

check_bmc_dell_raid

This check queries the sdr type Drive Slot / Bay and looks at the text output. You could also do this by parsing the hex codes. The table below is the general logic of the script, with an additional column with the known hex codes for the RAID state.

Nagios Level IPMI Text IPMI Hex Code
Normal Drive Present 0x0180
Warning Drive Present, Parity Check In Progress 0x1180
Critical Drive Present, In Critical Array 0xA180

This command takes the following arguments: IP/HOSTNAME USERNAME PASSWORD.

Sample Nagios Config:

define service{
        host_name                       hostname
        service_description             Check PERC 5/i RAID
        use                             generic-service
        check_command                   check_bmc_dell_raid!192.168.1.92!root!1234
}

Power Supply Check

This check queries the sdr type Power Supply and looks at the text output. The following tables displays the Nagios state and how it relates to the output from IPMI:

Nagios State IPMI Output
Normal Presence detected
Normal Fully Redundant
Critical Failure detected, Power Supply AC lost
Critical Presence detected, Failure detected
Critical Presence detected, Failure detected, Power Supply AC lost

Command line options: IP/HOSTNAME USERNAME PASSWORD

Example Nagios configuration:

define service {
        host_name                       hostname
        service_description             Power Supply Supply
        use                             generic-service
        check_command                   check_bmc_ps!192.168.1.92!root!1234
}

Intrusion Sensor

Many modern server cases have a magnetic reed switch that detects whether the case is open or closed. There are a couple of advantages to having this sensor.

  1. Detect when someone opens the chassis cover.
  2. When cover is open, adjust fan speed and airflow to compensate.

Command line options: IP/HOSTNAME USERNAME PASSWORD

This is a pretty straight-forward script. The case is either open or not. Here's the example Nagios config:

define service{
        host_name                       hostname
        service_description             Check Intrusion
        use                             generic-service
        check_command                   check_bmc_intrusion!192.168.1.92!root!1234
}

Fan Speeds

On the Dell BMC we have, there are 6 sensors, but only 4 fans in the system, not counting the fans within the power supplies. When we query the BMC, the 2 other sensors are shown as disabled. This script should work OK on most other dell systems, as it will query all of the fan sensors, and display data, as well as indicate the number of disabled sensors.

This script could use a little cleaning up, and I'll work on it as I play with our system more. One thing that should be added is a couple of command line options to set a high/low threshold for fan speed. Here's the following Nagios state as it relates to sensor output:

Nagios State IPMI Output
Normal ### RPM (where ### > 0)
Normal Disabled
Critical 0 RPM
Critical Redundancy Lost

Command options: IP/HOSTNAME USERNAME PASSWORD

Sample Nagios Config:

define service{
        host_name                       hostname
        service_description             Check IPMI Fan Status
        use                             generic-service
        check_command                   check_bmc_fan_status!192.168.1.92!root!1234
}

Voltage (go/no-go)

Voltage sensors on the 2950 are a little weird. At the office, we've decided the system hardware is smarter than us, so we're OK with the way it does these. Actual voltage values are not displayed, only an Asserted/Deasserted state. Asserted is bad. It means there's an error. Deasserted is good. Everything is OK.

Command options: IP/HOSTNAME USERNAME PASSWORD

Sample Nagios config:

define service{
        host_name                       hostname
        service_description             BMC Voltages
        use                             generic-service
        check_command                   check_bmc_voltage!192.168.1.92!root!1234
}

Ambient Temperature

So, this script is pretty specific to the Dell 2950. As it turns out, the 1950s have 4 more temperature sensors within their chassis, whereas the 2950 only has the ambient air temperature. I guess something is better than nothing.

This is probably the most dynamic of the set, as you can set the high and low temperatures, as well as a buffer threshold, before you get a warning from Nagios.

Command options: IP/HOSTNAME USERNAME PASSWORD LOW HIGH THRESHOLD

You can enter either celsius or fahrenheit degrees, as it's pretty easy to figure out what you mean, since there's a relatively small sane set of values here. The system uses celsius, and we convert to fahrenheit when needed. This script outputs whatever format you input.

Sample Nagios config:

define service {
        host_name                       hostname
        service_description             Ambient Temperature
        use                             generic-service
        check_command                   check_bmc_temp!192.168.1.92!root!1234!50!85!2
}

In the above config sample, a temperature lower than 50F and higher than 85F would trigger a CRITICAL alert, and between 48 and 50 and 83 and 85 would trigger a WARNING.

Download

You can download the files in one .zip here, or select the individual file below: