Improving UNIX/Linux Heartbeat Monitor

SCOM monitoring infrastructures that deal with Cross-Platform (UNIX/Linux) monitoring must be very familiar with the UNIX/Linux Heartbeat Monitor as it is implemented by UNIX/Linux Core Library MP.

For quick reference here is the link to the Unit Monitor Type for this monitor (as it is implemented in version 7.5.1042.0 of the management pack): http://systemcentercore.com/?GetElement=Microsoft.Unix.WSMan.Heartbeat.MonitorType&Type=UnitMonitorType&ManagementPack=Microsoft.Unix.Library&Version=7.5.1042.0.

In my experience, this monitor implementation is quite dry – it’s a 2-state monitor that throws a rather laconic alert: “Heartbeat failed” with alert description: “The System is not responding to heartbeats”. An alert notification like this reaching the Unix Support personnel simply raises more questions than providing clues on what’s wrong with the Unix system. Of course one can have a look at the associated Knowledge that is verbose and a good starting point for investigation, but who does open Health Explorer in middle of the night? Also Health Explorer can provide more information regarding the outcome of Diagnostic Task(s) and Recovery (if enabled) but this brings back the question of usability.

With a little work the monitor implementation can be improved in a few areas:

– first let’s have a separate monitor for checking if the system is ICMP alive – I’m sure that Unix Support personnel will be grateful to know if the system is dead or not in the first place. This way the “UNIX/Linux WS-Management Heartbeat ICMP Diagnostic” will not be needed anymore.

Here is how I implemented such monitor:

		<UnitMonitor ID="Unix.ICMPMonitor.HostIsICMPResponsive" Accessibility="Public" Enabled="true" Target="Unix!Microsoft.Unix.Computer" ParentMonitorID="SystemHealth!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="NetworkMonitoring!System.NetworkManagement.ICMPMonitorType" ConfirmDelivery="false">
			<Category>AvailabilityHealth</Category>
			<AlertSettings AlertMessage="Unix.ICMPMonitor.HostIsICMPResponsive.AlertMessage">
				<AlertOnState>Error</AlertOnState>
				<AutoResolve>true</AutoResolve>
				<AlertPriority>Normal</AlertPriority>
				<AlertSeverity>Error</AlertSeverity>
			</AlertSettings>
			<OperationalStates>
				<OperationalState ID="Responding" MonitorTypeStateID="ICMPResponding" HealthState="Success" />
				<OperationalState ID="NotResponding" MonitorTypeStateID="ICMPNotResponding" HealthState="Error" />
			</OperationalStates>
			<Configuration>
				<IP>$Target/Property[Type="Unix!Microsoft.Unix.Computer"]/NetworkName$</IP>
				<Interval>180</Interval>
				<NoOfRetries>3</NoOfRetries>
				<NumberOfSamples>3</NumberOfSamples>
				<Timeout>1000</Timeout>
				<PacketSizeBytes>32</PacketSizeBytes>
			</Configuration>
		</UnitMonitor>

– second let’s have the Unit Monitor Type slightly morphed into a 3-State with the Warning state giving the opportunity for a recovery action to be taken and the Error state actually firing the alert that will state the fact that the system is not monitored.

Here is how I suggest to have the Unit Monitor Type implemented:

	  <UnitMonitorType ID="Unix.WSMan.Heartbeat.MonitorType" Accessibility="Public">
        <MonitorTypeStates>
          <MonitorTypeState ID="Available" NoDetection="false" />
		  <MonitorTypeState ID="NeedsRecovery" NoDetection="false" />
          <MonitorTypeState ID="NotAvailable" NoDetection="false" />
        </MonitorTypeStates>
        <Configuration>
          <xsd:element name="Interval" type="xsd:int" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element name="SyncTime" type="xsd:string" minOccurs="0" maxOccurs="1" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
		  <xsd:element name="CorrelateWindowSeconds" type="xsd:integer" minOccurs="0" default="323" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element name="MissedHeartbeats" type="xsd:integer" minOccurs="0" default="2" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element name="MissedWindowSeconds" type="xsd:integer" minOccurs="0" default="623" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
        </Configuration>
        <OverrideableParameters>
          <OverrideableParameter ID="Interval" Selector="$Config/Interval$" ParameterType="int" />
          <OverrideableParameter ID="SyncTime" Selector="$Config/SyncTime$" ParameterType="string" />
		  <OverrideableParameter ID="CorrelateWindowSeconds" Selector="$Config/CorrelateWindowSeconds$" ParameterType="int" />
          <OverrideableParameter ID="MissedHeartbeats" Selector="$Config/MissedHeartbeats$" ParameterType="int" />
          <OverrideableParameter ID="MissedWindowSeconds" Selector="$Config/MissedWindowSeconds$" ParameterType="int" />
        </OverrideableParameters>
        <MonitorImplementation>
          <MemberModules>
            <DataSource ID="DS" TypeID="Unix!Microsoft.Unix.WSMan.TimedEnumerator">
              <TargetSystem>$Target/Property[Type="Unix!Microsoft.Unix.Computer"]/NetworkName$</TargetSystem>
              <Uri>http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_Agent?__cimnamespace=root/scx</Uri>
              <Filter />
              <OutputErrorIfAny>true</OutputErrorIfAny>
              <SplitItems>false</SplitItems>
              <Interval>$Config/Interval$</Interval>
              <SyncTime>$Config/SyncTime$</SyncTime>
            </DataSource>
            <ProbeAction ID="EnableMonitoring" TypeID="Unix!Microsoft.Unix.EnableInstanceMonitoringOverrideAction">
              <ManagedEntityId>$Target/Id$</ManagedEntityId>
              <Value>true</Value>
            </ProbeAction>
            <ProbeAction ID="DisableMonitoring" TypeID="Unix!Microsoft.Unix.EnableInstanceMonitoringOverrideAction">
              <ManagedEntityId>$Target/Id$</ManagedEntityId>
              <Value>false</Value>
            </ProbeAction>
            <ConditionDetection ID="RepeatEventCondition" TypeID="System!System.ConsolidatorCondition">
              <Consolidator>
                <ConsolidationProperties />
                <TimeControl>
                  <WithinTimeSchedule>
                    <Interval>$Config/MissedWindowSeconds$</Interval>
                  </WithinTimeSchedule>
                </TimeControl>
                <CountingCondition>
                  <Count>$Config/MissedHeartbeats$</Count>
                  <CountMode>OnNewItemTestOutputRestart_OnTimerRestart</CountMode>
                </CountingCondition>
              </Consolidator>
            </ConditionDetection>
            <ConditionDetection ID="ErrorFilter" TypeID="System!System.ExpressionFilter">
              <Expression>
                <Exists>
                  <ValueExpression>
                    <XPathQuery Type="String">//ErrorCode</XPathQuery>
                  </ValueExpression>
                </Exists>
              </Expression>
            </ConditionDetection>
            <ConditionDetection ID="SuccessFilter" TypeID="System!System.ExpressionFilter">
              <Expression>
                <Not>
                  <Expression>
                    <Exists>
                      <ValueExpression>
                        <XPathQuery Type="String">//ErrorCode</XPathQuery>
                      </ValueExpression>
                    </Exists>
                  </Expression>
                </Not>
              </Expression>
            </ConditionDetection>
			<ConditionDetection TypeID="System!System.CorrelatorAutoCondition" ID="CorrelatedDataCondition">
			  <Correlator>
				<CorrelationExpression>
				  <Expression />
				</CorrelationExpression>
				<Count>1</Count>
				<Interval>$Config/CorrelateWindowSeconds$</Interval>
				<CorrelationOrder>InSequence</CorrelationOrder>
				<CorrelationItemPolicy>First</CorrelationItemPolicy>
			  </Correlator>
			</ConditionDetection>
          </MemberModules>
          <RegularDetections>
            <RegularDetection MonitorTypeStateID="Available">
              <Node ID="EnableMonitoring">
                <Node ID="SuccessFilter">
                  <Node ID="DS" />
                </Node>
              </Node>
            </RegularDetection>
			<RegularDetection MonitorTypeStateID="NeedsRecovery">
				<Node ID="CorrelatedDataCondition">
				  <Node ID="SuccessFilter">
					<Node ID="DS" />
				  </Node>
				  <Node ID="ErrorFilter">
					<Node ID="DS" />
				  </Node>
				</Node>
            </RegularDetection>
            <RegularDetection MonitorTypeStateID="NotAvailable">
              <Node ID="DisableMonitoring">
                <Node ID="RepeatEventCondition">
                  <Node ID="ErrorFilter">
                    <Node ID="DS" />
                  </Node>
                </Node>
              </Node>
            </RegularDetection>
          </RegularDetections>
        </MonitorImplementation>
      </UnitMonitorType>

Here is how the new Monitor looks like:

		<UnitMonitor ID="Unix.HostIsNotMonitoredByAgent.Monitor" Accessibility="Public" Enabled="true" Target="Unix!Microsoft.Unix.Computer" ParentMonitorID="SystemHealth!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="Unix.WSMan.Heartbeat.MonitorType" ConfirmDelivery="false">
			<Category>AvailabilityHealth</Category>
			<AlertSettings AlertMessage="Unix.HostIsNotMonitoredByAgent.AlertMessage">
				<AlertOnState>Error</AlertOnState>
				<AutoResolve>true</AutoResolve>
				<AlertPriority>Normal</AlertPriority>
				<AlertSeverity>Error</AlertSeverity>
				<AlertParameters>
				</AlertParameters>
			</AlertSettings>
			<OperationalStates>
			  <OperationalState ID="Available" MonitorTypeStateID="Available" HealthState="Success" />
			  <OperationalState ID="NeedsRecovery" MonitorTypeStateID="NeedsRecovery" HealthState="Warning" />
			  <OperationalState ID="NotAvailable" MonitorTypeStateID="NotAvailable" HealthState="Error" />
			</OperationalStates>
			<Configuration>
			  <Interval>300</Interval>
			  <SyncTime></SyncTime>
			  <CorrelateWindowSeconds>323</CorrelateWindowSeconds>
			  <MissedHeartbeats>2</MissedHeartbeats>
			  <MissedWindowSeconds>623</MissedWindowSeconds>
			</Configuration>
		</UnitMonitor>

I will leave to the reader the editing of the Alert Message, so that it’s clear what is going on. It can include “The host is not monitored by agent” and then some steps to follow in the attempt to fix the agent.

And here is the Recovery action associated with the Warning state:

		<Recovery ID="Unix.SCX.Restart.Recovery" Accessibility="Public" Enabled="true" Target="Unix!Microsoft.Unix.Computer" Monitor="Unix.HostIsNotMonitoredByAgent.Monitor" ResetMonitor="false" ExecuteOnState="Warning" Remotable="true" Timeout="300">
			<Category>Maintenance</Category>
			<WriteAction ID="SSHCommand" TypeID="Unix!Microsoft.Unix.SSHCommand.WriteAction">
			  <Host>$Target/Property[Type="Unix!Microsoft.Unix.Computer"]/PrincipalName$</Host>
			  <Port>$Target/Property[Type="Unix!Microsoft.Unix.Computer"]/SSHPort$</Port>
			  <UserName>$RunAs[Name="Unix!Microsoft.Unix.AgentMaintenanceAccount"]/UserName$</UserName>
			  <Password>$RunAs[Name="Unix!Microsoft.Unix.AgentMaintenanceAccount"]/Password$</Password>
			  <Command>/opt/microsoft/scx/bin/tools/scxadmin -stop provider; /opt/microsoft/scx/bin/tools/scxadmin -stop cimom; /opt/microsoft/scx/bin/tools/scxadmin -start all</Command>
			  <TimeoutSeconds>60</TimeoutSeconds>
			</WriteAction>
		</Recovery>

– third, let’s have the original “UNIX/Linux Heartbeat Monitor” disabled using an override.

Hope this helps.

Update: Management Pack that implements all the above changes is available for download on TechNet Gallery: https://gallery.technet.microsoft.com/AddOn-Unix-Heartbeat-3fc2a296.

Advertisements

8 responses to “Improving UNIX/Linux Heartbeat Monitor

  1. NBM October 19, 2016 at 10:56 am

    Hello,

    I’m interested in your solution but I can’t seem to figure out how to create this unit monitor type within SCOM 2012 (we’re on UR9). I’m mainly interested in the Unix.ICMPMonitor.HostIsICMPResponsive monitor. Is this something I need to make in a separate management pack and the import?

    Thank you for your help.

    Like

  2. spostea October 19, 2016 at 2:03 pm

    Yes, that’s right. You need to create another MP to introduce the Unit Monitor in question and then import the MP. Please note that it’s not a unit monitor type, it’s a unit monitor of System.NetworkManagement.ICMPMonitorType type (defined in System.NetworkManagement.Monitoring MP).

    Like

  3. sunny November 8, 2016 at 1:51 pm

    Is it possible for you to create MP for us, so that we can simply import and implement out of the box? I am confused, if i need to create 1 or 3 different monitors

    Like

  4. spostea November 8, 2016 at 2:44 pm

    It looks like there is interest for this solution, so yes, I will upload soon (this week) to TechNet Gallery a Management Pack that will cover all these changes explained in the article. Thanks for feedback and stay tuned for the updates.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: