Home / Monitoring HP Proliant Servers

Monitoring HP Proliant Servers


Device Monitoring Templates > Monitoring-HP-Proliant-Servers
 
Summary
Device Type Object Identifier Mibs
HP Proliant Servers

Inherits Compaq Server properties

What to Monitor

HP Proliant Servers are preferred by administrators in datacenters and growing small and medium enterprises as it addresses the power and space constraint needs. It has a few variants including the BladeSystem.

The hardware health of the HP Proliant systems depends on the performance of the temperature sensors, the fan status and speed, the proper functioning of the power module, other than the usual suspects i.e., memory, processor and disk. In addition to these hardware health monitors, these systems provide an handle to monitor other metrics specific to the system status that include AMP (Advanced Memory Protection), Drive Array status, Network interface status etc.

  • Temperature: The server temperature must be maintained within the operating range. An increased temperature may result in server shutdown which means a downtime! HP Proliant systems can have many temperature sensors. Monitoring the temperature and the temperature thresholds are a proactive way of keeping a check on the temperature. You can also monitor the system generated logs as an added measure. The fan adjusts its speed to bring down the temperature which warrants monitoring the fan speed and status too.

  • Fan: As mentioned above, the fan cools the server when there is an increase in the temperature. Some systems have fans with variable speed that allows it to adjust the speed according to the server temperature. Improper functioning of the fan can lead to the temperature overshooting the threshold and in the absence of a redundant cooling fan, it can cause server shutdown.

  • Power: The HP Proliant servers that come with a fault-tolerant power supply architecture have the load distributed equally. In the event of a supply failure, uninterrupted supply is made available using the backup power. Monitoring the power supply status, the state of power redundancy, the used power capacity etc reflect the health of the power module.

  • Advanced Memory Protection and Memory Monitoring: Besides the regular resource utilization monitors, these systems have what is called the 'AMP- Advanced Memory Protection or the resilient memory subsystem that aids increased fault tolerance, ensuring high availability of services hosted on these servers.. It comprises features such as online spare memory, mirrored memory, hot replace etc. An overall health status monitoring of this module helps administrators take informed decisions.

  • CPU Monitoring: The status of the processor, the utilization of CPU, the CPU speed etc are some of the metrics that need monitoring to avert degradation of server health because of a poor performance of this resource
Monitor SNMP OID Details
CPU Monitors TOP ^
Temperature .1.3.6.1.4.1.232.6.2.6.8.1.4

"This is the current temperature sensor reading in degrees
celsius.

If this value cannot be determined by software, then a value
of -1 will be returned."

Fan speed .1.3.6.1.4.1.232.6.2.6.7.1.6

This specifies the speed of the fan. This value will be set
if the fan type is tachOutput."

It returns one of these results on query: other ( 1 ) , normal ( 2 ) , high ( 3 )

CPU speed .1.3.6.1.4.1.232.1.2.2.1.1.4 This is the internal speed in megahertz of this processor.
Zero is returned if this value is not available.

Resource Utilizaton

  • CPU Utilization
  • Memory Utilization
  • Disk Utilization

 

  • .1.3.6.1.2.1.25.3.3.1.2
  • .1.3.6.1.2.1.25.5.1.1.2
  • 1.3.6.1.2.1.25.2.3.1.6
The utilization of system resources are monitored if HostResource Mib is implemented on the system. The host physical and paging memory can be monitored by querying the relevant OIDs from CPQHOST MIB.
PSU Redundancy .1.3.6.1.4.1.232.6.2.9.3.1.9

This system supports redundant power supply for efficiency and high availability. Monitoring this variable gives the condition or state of the redundancy of the power supply.

It returns:

  • other(1): The redundancy state could not be determined.
  • notRedundant(2): The power supply is not operating in a redundant state.
  • redundant(3): The power supply is operating in a redundant state."

Drive Size

.1.3.6.1.4.1.232.3.2.5.1.1.45

This monitor gives the physical drive size in MB.

Status Monitors TOP ^
Advanced Memory Protection (AMP)

.1.3.6.1.4.1.232.6.2.14.4

This monitor gives the current condition of the Advanced Memory Protection subsystem. It returns one of the following results when queried for status:

  • other(1): The system does not support fault tolerant memory or the state cannot be determined by the Management Agent.
  • ok(2): This system is operating normally.
  • degraded(3): The system is running in a degraded state because it has been engaged.

Automatic Server Recovery (ASR)

.1.3.6.1.4.1.232.6.2.5.17

Its a heartbeat monitor. Querying this periodically returns the status of the ASR system. It can be one of the following:

  • other(1): The system does not support the 'heartbeat' feature.
  • degraded(2): This system is operating poorly.
  • failed(3): The subsystem has failed.

 

Drive Array Status .1.3.6.1.4.1.232.3.1.3

This monitor checks the status of the mounted array on the system and returns one of the following status:

  • other(1): The system does not support the 'heartbeat' feature.
  • degraded(2): This system is operating poorly.
  • failed(3): The subsystem has failed.

 

Thermal CPU Fan Status .1.3.6.1.4.1.232.6.2.6.5 This monitor gives the status of the processor fan(s) in the system. This value will be one of the following:
other(1): Fan status detection is not supported by this system or driver.
ok(2): All fans are operating properly.
failed(4): A fan is not operating properly.

The system will be shutdown if condition (4)occurs.
Network Interface Status .1.3.6.1.4.1.232.18.1.3

Quering this variable for CPQNIC MIB gives the status of the NIC on the HP Proliant Servers. The status it returns can be one of the following:

unknown (1), ok (2), degraded (3), or failed (4)

 

Power supply status

 

.1.3.6.1.4.1.232.6.2.9.2 The value returned on querying this variable, specifies the status of the fault tolerant power supply.
Traffic Monitors TOP ^
Rx Traffic .1.3.6.1.2.1.2.2.1.10 Rx Utilization is the percentage of the network bandwidth currently used by the received traffic on the network. A consistent high utilization indicates bottlenecks in the network and needs further troubleshooting.
Tx Traffic .1.3.6.1.2.1.2.2.1.16 Tx utilization is the percentage of the network bandwidth used up by the transmitted traffic. Again, a high utilization indicates network performance bottlenecks. Indepth traffic analysis using the Netflow module helps identify and free-up the bandwidth quickly.
Rx/Tx Errors

Rx- .1.3.6.1.2.1.2.2.1.14
Tx - .1.3.6.1.2.1.2.2.1.20

The number of inbound packets (Rx) or out-bound packets (Tx) containing errors, preventing them from being delivered to the next layer protocol.
Related Topics TOP ^


    Post a comment

    Your Name or E-mail ID (mandatory)

     

    Note: Your comment will be published after approval of the owner.




     RSS of this page