Monthly Archives: June 2015

Improving UNIX/Linux Heartbeat Monitor

SCOM monitoring infrastructures that deal with Cross-Platform (UNIX/Linux) monitoring must be very familiar with the UNIX/Linux Heartbeat Monitor as it is implemented by UNIX/Linux Core Library MP.

For quick reference here is the link to the Unit Monitor Type for this monitor (as it is implemented in version 7.5.1042.0 of the management pack): http://systemcentercore.com/?GetElement=Microsoft.Unix.WSMan.Heartbeat.MonitorType&Type=UnitMonitorType&ManagementPack=Microsoft.Unix.Library&Version=7.5.1042.0.

In my experience, this monitor implementation is quite dry – it’s a 2-state monitor that throws a rather laconic alert: “Heartbeat failed” with alert description: “The System is not responding to heartbeats”. An alert notification like this reaching the Unix Support personnel simply raises more questions than providing clues on what’s wrong with the Unix system. Of course one can have a look at the associated Knowledge that is verbose and a good starting point for investigation, but who does open Health Explorer in middle of the night? Also Health Explorer can provide more information regarding the outcome of Diagnostic Task(s) and Recovery (if enabled) but this brings back the question of usability.

With a little work the monitor implementation can be improved in a few areas:

– first let’s have a separate monitor for checking if the system is ICMP alive – I’m sure that Unix Support personnel will be grateful to know if the system is dead or not in the first place. This way the “UNIX/Linux WS-Management Heartbeat ICMP Diagnostic” will not be needed anymore.

Here is how I implemented such monitor:

		<UnitMonitor ID="Unix.ICMPMonitor.HostIsICMPResponsive" Accessibility="Public" Enabled="true" Target="Unix!Microsoft.Unix.Computer" ParentMonitorID="SystemHealth!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="NetworkMonitoring!System.NetworkManagement.ICMPMonitorType" ConfirmDelivery="false">
			<Category>AvailabilityHealth</Category>
			<AlertSettings AlertMessage="Unix.ICMPMonitor.HostIsICMPResponsive.AlertMessage">
				<AlertOnState>Error</AlertOnState>
				<AutoResolve>true</AutoResolve>
				<AlertPriority>Normal</AlertPriority>
				<AlertSeverity>Error</AlertSeverity>
			</AlertSettings>
			<OperationalStates>
				<OperationalState ID="Responding" MonitorTypeStateID="ICMPResponding" HealthState="Success" />
				<OperationalState ID="NotResponding" MonitorTypeStateID="ICMPNotResponding" HealthState="Error" />
			</OperationalStates>
			<Configuration>
				<IP>$Target/Property[Type="Unix!Microsoft.Unix.Computer"]/NetworkName$</IP>
				<Interval>180</Interval>
				<NoOfRetries>3</NoOfRetries>
				<NumberOfSamples>3</NumberOfSamples>
				<Timeout>1000</Timeout>
				<PacketSizeBytes>32</PacketSizeBytes>
			</Configuration>
		</UnitMonitor>

– second let’s have the Unit Monitor Type slightly morphed into a 3-State with the Warning state giving the opportunity for a recovery action to be taken and the Error state actually firing the alert that will state the fact that the system is not monitored.

Here is how I suggest to have the Unit Monitor Type implemented:

	  <UnitMonitorType ID="Unix.WSMan.Heartbeat.MonitorType" Accessibility="Public">
        <MonitorTypeStates>
          <MonitorTypeState ID="Available" NoDetection="false" />
		  <MonitorTypeState ID="NeedsRecovery" NoDetection="false" />
          <MonitorTypeState ID="NotAvailable" NoDetection="false" />
        </MonitorTypeStates>
        <Configuration>
          <xsd:element name="Interval" type="xsd:int" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element name="SyncTime" type="xsd:string" minOccurs="0" maxOccurs="1" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
		  <xsd:element name="CorrelateWindowSeconds" type="xsd:integer" minOccurs="0" default="323" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element name="MissedHeartbeats" type="xsd:integer" minOccurs="0" default="2" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element name="MissedWindowSeconds" type="xsd:integer" minOccurs="0" default="623" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
        </Configuration>
        <OverrideableParameters>
          <OverrideableParameter ID="Interval" Selector="$Config/Interval$" ParameterType="int" />
          <OverrideableParameter ID="SyncTime" Selector="$Config/SyncTime$" ParameterType="string" />
		  <OverrideableParameter ID="CorrelateWindowSeconds" Selector="$Config/CorrelateWindowSeconds$" ParameterType="int" />
          <OverrideableParameter ID="MissedHeartbeats" Selector="$Config/MissedHeartbeats$" ParameterType="int" />
          <OverrideableParameter ID="MissedWindowSeconds" Selector="$Config/MissedWindowSeconds$" ParameterType="int" />
        </OverrideableParameters>
        <MonitorImplementation>
          <MemberModules>
            <DataSource ID="DS" TypeID="Unix!Microsoft.Unix.WSMan.TimedEnumerator">
              <TargetSystem>$Target/Property[Type="Unix!Microsoft.Unix.Computer"]/NetworkName$</TargetSystem>
              <Uri>http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_Agent?__cimnamespace=root/scx</Uri>
              <Filter />
              <OutputErrorIfAny>true</OutputErrorIfAny>
              <SplitItems>false</SplitItems>
              <Interval>$Config/Interval$</Interval>
              <SyncTime>$Config/SyncTime$</SyncTime>
            </DataSource>
            <ProbeAction ID="EnableMonitoring" TypeID="Unix!Microsoft.Unix.EnableInstanceMonitoringOverrideAction">
              <ManagedEntityId>$Target/Id$</ManagedEntityId>
              <Value>true</Value>
            </ProbeAction>
            <ProbeAction ID="DisableMonitoring" TypeID="Unix!Microsoft.Unix.EnableInstanceMonitoringOverrideAction">
              <ManagedEntityId>$Target/Id$</ManagedEntityId>
              <Value>false</Value>
            </ProbeAction>
            <ConditionDetection ID="RepeatEventCondition" TypeID="System!System.ConsolidatorCondition">
              <Consolidator>
                <ConsolidationProperties />
                <TimeControl>
                  <WithinTimeSchedule>
                    <Interval>$Config/MissedWindowSeconds$</Interval>
                  </WithinTimeSchedule>
                </TimeControl>
                <CountingCondition>
                  <Count>$Config/MissedHeartbeats$</Count>
                  <CountMode>OnNewItemTestOutputRestart_OnTimerRestart</CountMode>
                </CountingCondition>
              </Consolidator>
            </ConditionDetection>
            <ConditionDetection ID="ErrorFilter" TypeID="System!System.ExpressionFilter">
              <Expression>
                <Exists>
                  <ValueExpression>
                    <XPathQuery Type="String">//ErrorCode</XPathQuery>
                  </ValueExpression>
                </Exists>
              </Expression>
            </ConditionDetection>
            <ConditionDetection ID="SuccessFilter" TypeID="System!System.ExpressionFilter">
              <Expression>
                <Not>
                  <Expression>
                    <Exists>
                      <ValueExpression>
                        <XPathQuery Type="String">//ErrorCode</XPathQuery>
                      </ValueExpression>
                    </Exists>
                  </Expression>
                </Not>
              </Expression>
            </ConditionDetection>
			<ConditionDetection TypeID="System!System.CorrelatorAutoCondition" ID="CorrelatedDataCondition">
			  <Correlator>
				<CorrelationExpression>
				  <Expression />
				</CorrelationExpression>
				<Count>1</Count>
				<Interval>$Config/CorrelateWindowSeconds$</Interval>
				<CorrelationOrder>InSequence</CorrelationOrder>
				<CorrelationItemPolicy>First</CorrelationItemPolicy>
			  </Correlator>
			</ConditionDetection>
          </MemberModules>
          <RegularDetections>
            <RegularDetection MonitorTypeStateID="Available">
              <Node ID="EnableMonitoring">
                <Node ID="SuccessFilter">
                  <Node ID="DS" />
                </Node>
              </Node>
            </RegularDetection>
			<RegularDetection MonitorTypeStateID="NeedsRecovery">
				<Node ID="CorrelatedDataCondition">
				  <Node ID="SuccessFilter">
					<Node ID="DS" />
				  </Node>
				  <Node ID="ErrorFilter">
					<Node ID="DS" />
				  </Node>
				</Node>
            </RegularDetection>
            <RegularDetection MonitorTypeStateID="NotAvailable">
              <Node ID="DisableMonitoring">
                <Node ID="RepeatEventCondition">
                  <Node ID="ErrorFilter">
                    <Node ID="DS" />
                  </Node>
                </Node>
              </Node>
            </RegularDetection>
          </RegularDetections>
        </MonitorImplementation>
      </UnitMonitorType>

Here is how the new Monitor looks like:

		<UnitMonitor ID="Unix.HostIsNotMonitoredByAgent.Monitor" Accessibility="Public" Enabled="true" Target="Unix!Microsoft.Unix.Computer" ParentMonitorID="SystemHealth!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="Unix.WSMan.Heartbeat.MonitorType" ConfirmDelivery="false">
			<Category>AvailabilityHealth</Category>
			<AlertSettings AlertMessage="Unix.HostIsNotMonitoredByAgent.AlertMessage">
				<AlertOnState>Error</AlertOnState>
				<AutoResolve>true</AutoResolve>
				<AlertPriority>Normal</AlertPriority>
				<AlertSeverity>Error</AlertSeverity>
				<AlertParameters>
				</AlertParameters>
			</AlertSettings>
			<OperationalStates>
			  <OperationalState ID="Available" MonitorTypeStateID="Available" HealthState="Success" />
			  <OperationalState ID="NeedsRecovery" MonitorTypeStateID="NeedsRecovery" HealthState="Warning" />
			  <OperationalState ID="NotAvailable" MonitorTypeStateID="NotAvailable" HealthState="Error" />
			</OperationalStates>
			<Configuration>
			  <Interval>300</Interval>
			  <SyncTime></SyncTime>
			  <CorrelateWindowSeconds>323</CorrelateWindowSeconds>
			  <MissedHeartbeats>2</MissedHeartbeats>
			  <MissedWindowSeconds>623</MissedWindowSeconds>
			</Configuration>
		</UnitMonitor>

I will leave to the reader the editing of the Alert Message, so that it’s clear what is going on. It can include “The host is not monitored by agent” and then some steps to follow in the attempt to fix the agent.

And here is the Recovery action associated with the Warning state:

		<Recovery ID="Unix.SCX.Restart.Recovery" Accessibility="Public" Enabled="true" Target="Unix!Microsoft.Unix.Computer" Monitor="Unix.HostIsNotMonitoredByAgent.Monitor" ResetMonitor="false" ExecuteOnState="Warning" Remotable="true" Timeout="300">
			<Category>Maintenance</Category>
			<WriteAction ID="SSHCommand" TypeID="Unix!Microsoft.Unix.SSHCommand.WriteAction">
			  <Host>$Target/Property[Type="Unix!Microsoft.Unix.Computer"]/PrincipalName$</Host>
			  <Port>$Target/Property[Type="Unix!Microsoft.Unix.Computer"]/SSHPort$</Port>
			  <UserName>$RunAs[Name="Unix!Microsoft.Unix.AgentMaintenanceAccount"]/UserName$</UserName>
			  <Password>$RunAs[Name="Unix!Microsoft.Unix.AgentMaintenanceAccount"]/Password$</Password>
			  <Command>/opt/microsoft/scx/bin/tools/scxadmin -stop provider; /opt/microsoft/scx/bin/tools/scxadmin -stop cimom; /opt/microsoft/scx/bin/tools/scxadmin -start all</Command>
			  <TimeoutSeconds>60</TimeoutSeconds>
			</WriteAction>
		</Recovery>

– third, let’s have the original “UNIX/Linux Heartbeat Monitor” disabled using an override.

Hope this helps.

Update: Management Pack that implements all the above changes is available for download on TechNet Gallery: https://gallery.technet.microsoft.com/AddOn-Unix-Heartbeat-3fc2a296.

Advertisements