Performance and Availability monitoring.
Putting all the infrastructure and application and various monitoring into a Service Context for Service Monitoring.
Performance and Availability monitoring.
Putting all the infrastructure and application and various monitoring into a Service Context for Service Monitoring.
Faster, more efficient, better views for the operators to view. A more centralized approach to managing the infrastructures. Improved app visibility features.
TrueSight Operations Manager is a combination of different components (applications) like Presentation Server, Impact Manager, and AppVisibility Manager and IT's Data Analytics, but it provides a seamless integration and a holistic view with Application and Infrastructure Health views.
It provides common administration, and a Single Sign-On Platform with RBAC, which eases the cross launch between multiple tools and saves the need to configure users for all the different components and improving monitoring views.
There are no broader areas of improvement. It would vary, environment by environment. As such, there are no outstanding bugs or defects that are not documented.
No. TSOM 10.7 is quite stable provided it is installed with the vendor recommendations, which are created by the experience drawn from customers and the complexity of the environment.
No issues with scalabilty. The customers where I have implemented this have ranged from small to very large. I have never faced any deployment challenges in any of these cases.
Excellent support from the vendor. Support technicians and developers are all available to help if there is an issue. Support cases are tracked and the resolution (of the same) is pushed to be done faster.
No, I have always worked with BMC Solutions for infrastructure and application monitoring.
Yes, the setup after the design of the solution was pretty straightforward. The vendor has a lot of free Webinars where they will explain the best practices to design a solution and the best ways to implement it. These guidelines can be used to build custom guidelines for the customer.
Implemented with in-house team, have been interacting with Vendor team as well with excellent expertise in Truesight Operations Manager.
I have not dealt with the pricing or licensing, so I cannot comment.
Not applicable.
It is quite an efficient tool. There are continuous improvements being performed to satisfy the customer needs, but like any other tool or automation, it has some issues.
TrueSight offers a global solution with possibility of end to end integration.
We use it to scan and monitor our server environment. This allows us to monitor devices which are introduced as they are spun up, to see that there are no unknown devices, then we can verify uptimes as well as patching as another source of keeping devices in compliance.
Allows reliable access to server hardware info, uptime statuses, current patching, and much more. This allows us to make sure we have an updated inventory, as we feed this into our inventory system along with info from Atrium CMDB.
The ability to pull hosts together to show what processes are running, so it can be used for change management.
More modules for less popular applications and better documentation. Documentation can be great at times, but lacking in other areas.
I have used the BMC product in two separate instances, the one was as a monitor of monitors for an ops bridge to have a single view of all monitoring tools reporting into one source, this worked extremely well.
The other instance was as a managed services looking at multiple different customers across South Africa.
I believe that the ease of use and UI is great. The ability to fulfill the role as a manager of managers is fantastic. We integrated a number of other monitoring tools into BMC.
I think the ease of deployment needs to be looked at. It would be great if the deployment was faster and easier.
We experienced no issues with stability on both BMC and HP.
Only issue we experienced with scalability was that the maximum growth needs to be catered for in the initial build. Planning needs to be done carefully.
Technical support from BMC was good, had to wait a little longer some times for a response which complicated things with the client.
The companies I worked for were BMC shops from start to finish, made use of Remedy, BCO, Control M etc. Companies wanted the best of breed.
The setups were not complex but there was a large amount of pre-deployment and planning that went into the solutions.
The solutions are not the cheapest but are robust and stable. License model is rather complex and BMC do often change the model.
Other products were evaluated, such as HP and IBM as well as various opensource solutions.
My advice would be do not cut on the planning time as well as testing time, UAT, SIT as well as FIT.
Also make sure that you have the correct infrastructure in place and also cater for the intended growth.
This article is a review of BMC ProactiveNet Performance Manager (BPPM) version 8.6 and its key sub-components.
The main key sub-components include:
> ProactiveNet Analytics
> ProactiveNet Event Management (formerly Mastercell)
> ProactiveNet Performance Manager (i.e. PATROL)
Component |
Version |
BPPM Event Manager |
8.6 |
BPPM Analytics |
8.6 |
PATROL Central |
7.8.10 |
PATROL Central Operator – Web Edition |
7.8.10 |
PATROL Agent |
3.9.00.1i |
PATROL for UNIX Servers |
9.10.00.02 |
BPPM Event Management (previously known as Mastercell or BEM) is the component that replaces PATROL Enterprise Manager or PEM (previously known as CommandPost).
BPPM introduces a programming language called MRL. MRL is not as flexible as PERL or REX which can both be used in PEM, but MRL does include many in-built features such as policies that make the design of rules slightly easier.
PEM used to perform event management using up to 5 transformers or scripts written in PERL. PEM was effectively a tool box whereby all the intelligence is provided by the PERL scripts which enrich the events using a number of lookup files.
Which product is better, PEM or BPPM? BPPM is arguable a better event management platform. Although MRL is frustrating to work with, the in-built capabilities mean that you don’t have to develop everything from scratch. BPPM is generally a good event management platform.
PATROL Configuration Manager (PCM) is one of the best threshold management tools in the industry. The threshold management capabilities on BPPM (aka ProactiveNet) are poor in comparison. BMC state that they will include PCM functionality on the next release of BPPM.
The limitations of Threshold management in BPPM are numerous:
On the plus side, the different types of thresholds in BPPM are very powerful. BPPM has Absolute, Intelligent, Signature and Predictive thresholds. These thresholds are statistically based and will generate events when a statistical anomaly is detected. The product will automatically calculate trends using linear regression and variations based upon hourly, daily or weekly patterns. However, the statistics will not eliminate threshold management as BMC have sometimes claimed. Many thresholds are Boolean in nature – either good or bad - and are therefore not approriate for statistical analysis. Statistical analysis is only appropriate for about 20% to 30% of thresholds and analysis consumes a lot for CPU cycles.
BPPM is undeniably a complex product. Far too complex in my opinion. There are many other much simpler solutions such as HP SiteScope or CA Nimsoft which can be implemented much faster. In addition, the BMC Product Set has gradually got more and more complex over the years. The solution is really three products bundled together:
MasterCell is a great event management product. ProactiveNet has perhaps been oversold by BMC – and the value is overstated. The autonomous thresholds can only be applied to 20% -to 30% of parameters anyway. PATROL was originally a great product – but has become bloated and complex after years of poor product management.
As an illustration of how complex the BPPM solution has become, consider the following table:
Component / Feature |
Old Solution with PEM |
New BPPM Solution (version 8.6) |
Number of Servers |
3 (DEV, DR and PROD) |
11 (3 DEV, 3 TEST, 5 PROD) |
Number of Connections to the Agents |
2 (PEM and RT Server) |
3 (BIIP3, BPPM Adaptor, RT Server) |
Number of Adaptors |
1 – RT Server |
3 (RT Server, BPPM Adaptor, BIIP3 |
Dynamic Policy Files (for Rules) |
5 Rule Files |
12 Rule Files |
Forms for Threshold Management |
1 PCM |
2 (TEST and PROD BPPM Servers) |
The PATROL agent has always been very extensible. There is a rich API and many different ways to write an interface. PATROL Central has no API and therefore can not be extended. Both BPPM and PEM are very extensible and can be extended through a variety of scripting languages such as PhP or PERL.
BMC has never provided a web form that allows staff in the Operations Bridge to blackout servers or services for upcoming outages due to planned maintenance. This customer (mentioned in this review) had to write its own Web GUI for Blackout. This is an Apache and PhP solution that allows the shift operators to configure blackouts. It required 25 days of development to alter the blackout web form and migrate this functionality from PEM to BPPM.
For an environment of 500 Agents, BPPM requires from 0.5 to 1 FTE to keep the lights on - depending on the experience of the person. Typical daily tasks include the following:
The Agent commissioning process for configuring monitoring for a new server consists of the steps shown below:
Step Number |
Step |
Description |
1 |
Ping Host |
Ping Host to very that the hostname is correct? |
2 |
Install Agent |
Install Agent Using Solaris Package |
3 |
Update Event Rules |
edit BPPM enrichment file abc_host.csv |
4 |
Apply to PROD Cell |
import abc_host.csv into PROD cell |
5 |
Apply to TEST Cell |
import abc_host.csv into TEST cell |
6 |
Update PING Test (primary) |
Update PING Test configuration on Primary Server to ensure the host is up. |
7 |
Update PING Test (secondary) |
Update PING Test configuration on Secondary Server to ensure the host is up. |
8 |
Configure UNIX km |
Use PCM to give Agent Standard Configuration for the UNIX km. |
9 |
Update BIIP3 |
Update BIIP3 Config so that the Agent can talk to the Event Management Cell. |
10 |
Agent Restart |
Restart the Agent to ensure that the Agent Configuration takes affect. |
11 |
Update PCO Web Console |
Update PCO Web Console so that the Agent appears in the PATROL console. |
12 |
Update Work request |
Update the Work request to indicate the job is complete. |
If additional Monitoring is required for ORACLE or WEBLOGIC or some other Application, then there are additional configuration steps that are required.
There are two languages to learn with BPPM
Administration of BPPM is overly complex. The product has evolved over the course of the last 20 years. As another new component has been added via aquisition, the product has become increasingly complex and time consuming to administer.
Any Solution Design for BPPM should consider the following key questions:
Question |
Details |
How does the design allow for rule tracing? |
Using the trace log is not practical due to the volume of events. A good solution is to assign a Unique ID to each rule and then configure each rule to add an entry to a new slot called “matching_rules”. |
How does the design specify rule execution order? |
It is often difficult to design rules because of confusion about rule execution order. It is good practice to split all mrl files into mrl files for new rules and mrl files for refine rules. So you get: new_mcxp.mrl and refine_mcxp.mrl. The files then should be grouped in the .load file by stage, so you have refine rules followed by new rules … etc. |
Does the DEV environment have the same number of cells as the TEST and OAT environments? |
Don’t be tempted to have fewer cells in the DEV environment. It is tempting to have fewer cells in order to limit the number of zones (servers) required. This is a mistake. Rule execution order is greatly affected by the propagation (or not) of slots between cells and the configuration of mcell.propagate. |
Does the design specify the configuration of mcell.propagate? |
The design should specify the configuration of all mcell config files – including mcell.propagate, mcell.dir etc. |
Is BIIP3 included in the Design? |
BIIP3 is essential in order to forward PATROL events to the cells for any cells that are not event class 11 and 39. These events are explicitly generated by the PSL event_trigger() function. It is impossible for BPPM Analystics (ProactiveNet) to collect these events because they have no associated metric. |
Threshold Management |
If thresholds are being migrated fro PCM to BPPM, How will the thresholds be migrated from BPPM server to another? Has the export / import process been thoroughly tested? (because is has serious issues). I would advise migrating the thresholds to BPPM as a Phase II activity or wait for BPPM v9. |
Export Thresholds from PCM |
Does the design specify using a tool for extracting all the thresholds from PCM into a spreadsheet? (I have a PERL tool to do this). |
Testing |
Does the Design provide for at least a month of end-to-end testing once the rules have been completed. |
Monitoring the Monitoring |
Does the Design incorporate monitoring of the monitoring? Will an event be generated if the BIIP3 Adapter fails? |
Event Storm |
If the BIIP3 Adaptor looses connection to multiple agents every half an hour and then regains the connection 30 seconds later this will create 200 new AGENT_DOWN events (mc_adapter_control). The de-dup rule will not work because the AGENT_UP event closes the AGENT_DOWN event. What rule is going to prevent this event storm? |
Time-out Policies |
Does the Design specify timeout policies for all the main top level event classes such as MC_CELL.. and EVENT. Does the cell start reasonably quickly with 2000 events? What about 20,000 events? |
DDE Enrichment |
Does the Design fully specify the Enrichment files that will be used? |
DDE Synchronization |
Are the DDE config files pulled or pushed into the cells? How are the DDE cfg files synchronized between cells? |
Blackout |
Has a Web site been included in the Design for Blackout by the Operations Bridge? BPPM does have a “Schedule downtime” facility – but this is entirely inappropriate for operators and does not account for BIIP3 events. |
Blackout Dev |
If a blackout GUI is a requirement, has a month of Development been allocated (using something like Apache and PhP)? |
BPPM Analytics |
Does the Design discuss the possibility of implementing BPPM Analytics as a second phase? |
Reporting |
Does the design include Event Reporting to drive Continuous Improvement? Key reports are total events grouped by:
|
Reporting DEV |
If reporting is a requirement, does the Design include time to implement the BMC reporting tool or 2 weeks of development using PhP and mquery. |
AIG |
Does the Design Include Automatic Incident Generation? (AIG). Semi-automatic incident generation an option – whereby an operator creates a ticket by right clicking on an event. Is this option considered and discussed in the design? |
Failover |
Is failover considered? How is the configuration replicated? Replicated DISK? |
Training |
Doe the project plan include time for Training the staff in the operations Bridge? What about 2nd level support? |
Go-live |
Is the Go-Live big bang or Phased? Phased is preferred for risk mitigation but will require operators to run two consoles in parallel. |
Audible Alarm |
Is an Audible alarm a requirement? If so, then this will require a few days of development to configure a web page that uses a sound file and “mquery –s COUNT”. |
BPPM Classes
BPPM Has a number of event classes as shown below which all inherit from the CORE_EVENT class.
CORE_EVENT
Mastercell Rule Language (MRL)
Mastercell Rule Language (or MRL) is the language used to develop event management rules within BPPM. The administrator can develop 11 different types of rules as shown in the table in section "Rule Phases" below. The language is simple and relatively easy to learn in terms of both the syntax and the in-built functions. The most difficult concept to grasp is the execution order as explained below. One of the most common problems with the rules is to misunderstand the execution order and find that the rules are not executing in the desired sequence. The other cause of frustration is the lack of common statements such as a looping structures (do, while for until) which one takes for granted in other languages. It is possible to iterate over a list structure using the listwalk() function call. The New rule phase also has limited capability to loop over events using the Updates clause. Fortunately however, the need to loop is fairly rare. However, at times the lack of standard statements can be a cause of frustration.
The biggest problem with MRL is the slow cycling speed when debugging code. Compared to PhP or PERL, it takes at ten times as long, to stop, compile and restart. So debugging cycles are 10 times as long and productivity is similarly affected. True, it is not necessary to write pages and pages of code - but typically one will write about 8-15 pages of MRL for each project. 8 pages of PhP (tested and debugging) takes 1 to 2 days. 8 pages of MRL (tested and debugged) takes 2-4 weeks. In addition, one should allow for an additional month of End-to-End testing before production go-live to test the rules with real events - and to allow for all possible scenarios to play out and for all the bugs to emerge. This rules of thumb apply for companies of 5,000 to 10,000 employees. For larger organizations, you should allow for more time.
Rules are executed in the order shown below.
Execution Order |
Rule Phase |
Description |
1 |
Refine |
A Refine rule verifies the validity of incoming events and collects additional data for an event before it is sent through the remaining rule phases where further processing takes place. |
2 |
Filter |
Filter rules limit the number of incoming events by discarding those events that need no additional processing or analysis. Filter rules compare incoming events to the event condition formulas (ECFs) contained in the rule to determine if an event is discarded or proceeds to further processing. An incoming event is processed through each Filter rule until a Filter rule discards the event, or all Filter rules are exhausted. An event must match all the Filter rules to be accepted. |
3 |
Regulate |
Use regulate rules to handle time frequency accumulations of events or repetitive occurrences of events. An event is considered a repetition of another if the event has the same values for all the slots that are defined with the dup_detect=yes facet in the BAROC definition of its event class. |
4 |
New |
Use New rules to execute an action when a new event is received, for example increasing the severity level for an event or updating an existing event with new event data. New rules determine if an event becomes permanent and is placed in the repository. |
5 |
Abstract |
Abstract rules create high-level, or abstract, events based on low-level events. A new event starts at the new rules phase, skipping the filter and regulate rules phases. With Abstract rules, you can keep low-level events with cells in the lower-level of the cell hierarchy, abstract the data from low-level events into high-level events, and propagate them to a higher-level cell. A high-level cell in the hierarchy can consolidate abstract events from several low-level cells and prevent a large number of abstracted technical events for which no consolidating rules apply. |
6 |
Correlate |
Correlate rules build an effect-to-cause relationship between an event that occurs as a result of another event. Correlate rules execute whenever a cause or an effect event is received. The relationship between correlated events can be broken. |
7 |
Execute |
The Execute rule performs a specified action when a slot value has changed in the repository. The specified action, which is either internal to the cell or running an external executable, is based on the characteristics of one or more events. |
8 |
Threshold |
The Threshold rule counts the number of events that matches the criteria you specify if the number of these events exceeds the amount allowed within a time frame the Threshold rule executes. An event is considered a repetition of another if the event has the same values for all the slots that are defined with the dup_detect=yes facet in the BAROC definition of its event class. |
9 |
Propagate |
A cell uses Propagate rules to forward events or messages to one or more destination cells or gateways. For example, a Propagate rule can escalate an event from a lower level cell to a higher-level cell in an environment. |
10 |
Timer |
Use Timer rules to create timed triggers to call a rule. Timer rules are evaluated when a timer expires. |
11 |
Delete |
The purpose of Delete rules is to perform actions before an event is discarded from the repository, such as a rule that suppresses data that has no meaning without an event instance. Delete rules are evaluated whenever an event is deleted from the repository or when events are deleted using the Delete flag in the mposter command. |
PATROL Configuration Manager (PCM) is a configuration tool used for PATROL agents. The tool is mainly used for configuring Thresholds and is very effective at this task.
PCM is similar in concept to the Windows registry editor. The Main Form consists of a two TreeView panes as shown below. The left TreeView is used to configure hosts which are arranged in groups such as ORACLE (shown below). The right hand TreeView is used to manage the rules which can also be arranged into groups. The RuleSets are linked to the Hosts by dragging RuleSets from right to left. The RuleSets are dragged and dropped onto the leaves marked "LinkedRuleSets". The user then invokes a command called "Apply RuleSets". The Rulesets are applied to each Agent in the same order as they appear in the hierarchy on the left. RuleSets linked to lower level nodes take precedence and "override" higher level group RuleSets.
The use of PCM typically follows a three step process. Administrators must perform the following:
The key weaknesses of this configuration process are the following:
The key benefit of PCM is that it can be used to manage a Desired State for each Agent If you apply the configuration once or a thousand times, the result is exactly the same. The Hierarchy allows one to set global or default configuration using the higher nodes in the left TreeView an then to override the configuration with local (host specific) configuration using the lower nodes. This hierarchy works extremely well.
Policies
The Policies feature within BPPM Event Management is gnerally a well executed feature within the product and has suffcient flexibity to meet most customer's needs. The Dynamic Data Enrichment (DDE) policies allows the user to manage the rules externally using Comma Seperated Value (CSV) files.
The key thing that must be kept in mind, is that the DDE policies match based on Best Fit and not First Match. So for example, if you want to match on a hostname called "fred*" (the star is a wild card) then frederick will match before fred* even if fred* appears first in the csv file. The rules are loaded into a hash memory structure within the product. The benefit of 'Best Fit" is that the execution time for finding a match is predictable - irrespective of the number of lines in the CSV file (and there could be thousands). The disadvantage of "Best Fit" is that the matching can be out of sequence and counter-intuitive. Best Practice in this case is to keep the CSV files simple. Each Enrichment file should also have only one purpose. For example, the customer used in this review orignally started with 5 enrichment files with their old PATROL Enteprise manager (PEM) environment. After implementing BPPM, the customer ended up with 11 DDE enrichment files. The number of total lines was less, but the number of files was more.
When migrating from PEM to BPPM, the enrichment files should be "Normalized" - by minimizing the number of lookup columns in order to reduce the probability of out-of-order rule matching.
Policy |
Description |
Closure |
An closure policy closes a specified event when a separate specified event is received. |
Blackout Policy |
A blackout policy might be used during a maintenance window or holiday period |
Component Based Enrichment |
enriches the definition of an event associated with a component by assigning selected component slot definitions to the event slots |
Enrichment |
enriches the definition of an event associated with a component by assigning selected component slot definitions to the event slots |
Correlation |
Correlation relates one or more cause events to an effect event, and can close the effect event The cell maintains the association between these cause-and-effect events. |
Escalation |
Escalation raises or lowers the priority level of an event after a specified period of time. A specified number of event recurrences can also trigger escalation of an event. For example, if the abnormally high temperature of a storage device goes unchecked for 10 minutes or if a cell receives more than five high-temperature warning events in 25 minutes, an escalation event management policy might increase the priority level of the event to critical. |
Notification |
Notification sends a request to an external service to notify a user or group of users of the event. A notification event management policy might notify a system administrator by means of a pager about the imminent unavailability of mission-critical piece of storage hardware. |
Propagation |
Propagation forwards events to other cells or to integrations to other products. |
Recurrence |
Recurrence combines duplicate events into one event that maintains a counter of the number of duplicates. |
Remote |
Remote action automatically calls a specified action rule provided the incoming event satisfies the remote execution policy’s event criteria. |
Suppression |
Suppression specifies which events that the receiving cell should delete. Unlike a blackout event management policy, the suppression event management policy maintains no record of the deleted event. |
Threshold |
Threshold specifies a minimum number of duplicate events that must occur within a specific period of time before the cell accepts the event. For events allowed to pass through to the cell, the event severity can be escalated or de-escalated a relative number of levels or set to a specific level. If the event occurrence rate falls below a specified level, the cell can take action against the event, such as changing the event to closed or acknowledged status. |
Timeout |
Timeout changes an event status to closed after a specified period of time elapses |
Component Based Blackout |
Specifies which events the receiving cell should classify as unimportant and therefore not process . The events are logged for reporting purposes. A Component Based Blackout event management policy might specify that the cell ignore events generated from a component or device based on component selection criteria for this policy. |
CSV File Name |
Description |
Lookup Columns |
Data Columns |
Host.csv |
Assign Location and HostType (DEV, TEST or PROD) based on host name | HostName | Location, Physical Server, HostType |
HostSuppress.csv |
Filter out events based on hostname (e.g. when new Agent installed) | HostName | HostSuppress (YES,NO) |
Application.csv |
Assign an application nane to each event. | ApplicationClass, Parameter | Application |
ObjectSuppress.csv |
Filter out troublesome parameters based on Event class | ApplicationClass, Parameter, EventClass | ObjectSuppress (YES,NO) |
ApplicationSupress.csv |
Filter out events based on application | Application | ApplicationSuppress (YES,NO) |
HostBlackout.csv |
Blackout Hosts for planned outages based on timeframe | HostName, PhysicalServer, Location | TimeFrame |
Service.csv |
Assign Service Name to all events | Host, Instance, HostType | Service, SupportGroup |
ServiceSuppress.csv |
Filter Out events based on service | Service | ServiceSuppress (YES,NO) |
ServiceBlackout.csv |
Blackout services for planned outages during a particular time frame | Service | TimeFrame |
ServiceDowngrade.csv |
Downgrade severity for particular services | Service | SeverityCode (e.g. 12333) |
TextMessage |
Change message Text for certain parameters | ApplicationName, Parameter, EventClass | NewMesaage |
Note: Severitycode of 12333 downgrades MAJOR (4) and CRITICAL (5) to MINOR (3).
Issues
PATROL Agent Restart
If the PATROL agent’s configuration is changed, then the agent usually requires a restart. Unfortunately, the PATROL Agent regenerates all active events (any parameter that exceeds a threshold) when the agent is restarted. This means that all an agent must be blacked out when the Agent is restarted.
The Agent History file will always get corrupted if the History file exceeds 4 Gbytes. There is a 4 GB file size limit on Solaris. The history file will frequently exceed this limit on busy servers running messaging services such as Tuxedo or MQ (simply because there is a lot to monitor). The history file may get corrupted for other reasons. When the Agent gets corrupted, it will generated an event for every attempt to store a parameter value. This problem can generate hundreds of events every few minutes from just one host. This number events can easily overload a cell and a BIIP3 Adaptor (see BIIP3 Corruption below).
With 500 UNIX Agents, you should expect one agent to get corrupt history about every 2 weeks.
If the BIIP3 cache file is corrupted, the BIIP3 can get stuck on one event and keep generating the event. I have seen 4 million repeated events in a cell due to this problem.
BIIP3 Cache file corruption may be caused by overloaded (see PATROL Agent History Corruption above).
I have seen this problem occur twice within 3 months.
The workaround is to clear the ache file and restart the BIIP3 Adaptor.
In certain situations, the BIIP3 Adaptor may loose connection with all the agents every half an hour. The Agent will then gain connection again almost immediately. This causes a flapping AGENT_DOWN and AGENT_UP condition that is not de-duplicated – because the AGENT_UP clears the AGENT_DOWN event. This issue can generate thousands of events and thousands of new Incidents (assuming Automatic Incident Generation is implemented).
One best workaround is to create a new rule for MC_ADAPTER_CONTROL (AGENT_DOWN) events and set them initially to severity INFO. If the Agent is truly down then the second agent down event (which occurs 3 minutes later) should be configured in the rule to set the severity back to WARNING or ALARM.
The problem is also solved by restarted the BIIP3 Adapter. I therefore suggest that all customers schedule a restart of the BIIP3 adaptors once per day. No events are lost because the BIIP3 adapter (and the PATROL Agent) caches all events.
I have seen this problem about once per month with a population of 500 agents.
The migration of both global and local thresholds from one BPPM Analystics instance to another must be performed by hand. The is an export / import mechanism for global thresholds, but as of July 2012, this mechanism is unreliable. There is no import / export mechanism for local (host specific) thresholds.
BPPM Analytics does not support instance specific thresholds. In other words, you can not set a default threshold for FSCapacity across all file systems and then set an instance specific threshold that applies only to the root FileSystem and htne apply this instance specific threshold to all hosts. The instance specific threshold must be individually defined on all hosts. If there re 500 hosts, this becomes unfeasible. This is no script or API that can be used to automate this task.
With this release of BPPM, the PATROL Agents are connected to BPPM Analytics using the BPPM Adaptor. When you use the Graphing facility to graph parameters in BPPM, some of the hosts do not appear – event though they are connected via the Adapter. At the time of this writing, this case is open with BMC and is unresolved.
PATROL Events that are triggered using the event_trigger() PSL function are not supported by BPPM Analytics (ProactiveNet). This forces all customers (who use PATROL agents) to implement both the BIIP3 Adapter (for event_trigger() events) and the BPPM Adapter for all standard PATROL metrics (that have an underlying parameter).
This means that the adapter layer with a BPPM implementation is quite complex. There are three Adapters attached to every agent on three separate ports. The Adapters are the RTServer, the BIIP3 Adapter, and the BPPM Adapter.
This complexity means that the implementation becomes fragile, complex to administer and fundamentally unreliable.
It is difficult to define catch-all rules using the standard BMC Log monitoring KM. For example, it is possible to create a catch-all rule that triggers on the search stirng "ALARM". You hten give htis definition a custom origin which might be something like "LOG.BANKING_app_log.alarm". You then create a custom event mesasage that inserts the line from the log file inot the text of the message. This can be done with the syntax "%1-". The problem occurs at the event management layer. All events that match this rule will get rolled up into one event as duplicates - despite the fract that each event represents a different line from the log file and a different problem.
The work-around is to change the de-duplication rules at the event managemnet layer. Be careful. if the rules are improperly defined, you can make the product vulnerable to an event storm - which may only manifest itself a month or two later.
Monitoring of the monitoring is insufficient.
Typical Project
The review was conducted after an upgrade Project in which every component within an old PATROL environment was upgraded. The project was driven by the customers internal audit organization that review the companies products and determined that PATROL enterprise Manager (PEM) was no longer supported an therefore the whole environment should be upgraded.
The project consisted of a number of separate projects which could have been undertaken individually. The customer chose to performed all three projects simultaneously which increased the risk, complexity and length of the overall project.
Phase |
Description |
Phase 1 |
Solution Design |
Phase 2 |
Upgrade of the PATROL Agents and Knowledge Modules |
Phase 3 |
Replacement of PEM with BPPM Event Manager |
Phase 4 |
Introduction of BPPM Analytics |
The Solution Design phase was conducted in late 2011 and the implementation was started immediately after the New Year in 2012. Phase 3 of the solution was finally put into production on Thursday 28th June 2012.
Phase 4 of the project has not yet been completed. Phase 4 was removed from the project scope when the customer fell behind on delivery. Currently, there are no plans to complete this phase of the project.
The customer contracted several months of consultancy from BMC Software. BMC performed the initial solution Design and much of the initial configuration of the event management rules.
The resources assigned to the project, consisted of the following:
Resource |
Time Allocation |
BMC Consultant |
~ 3 months |
Customer SME |
7 Months full time |
Independent Consultant |
4 Months |
Customer UNIX Engineers (2 Engineers) |
4 Months |
Customer infrastrucutre Architect |
1 Month |
Customer Project Manager |
2 Month |
Customer Deliver manager |
2 Months |
Management Involvement (Project Sponsor + Resource Manager) |
1 Month |
Total |
24 Months |
Lessons Learned
The project overran initial estimates – both in terms of budget and cost. The following issues were encountered:
Issue |
Description |
Solution Design |
The Event Management Rules had to be completely redesigned which delayed the projected by about a month. The customer’s old rules used First Match – whereas BPPM only supports Best Fit. The complexity of the customer’s rules was not properly analysed or understood during the design phase. |
Documentation |
The design of the event management rules and were not properly documented. When it became evident that the design had to be changed, the lack of documentation slowed understanding and meant that some thinking had to be repeated and the design documented properly. |
Thresholds |
The customer spent over a month trying to migrate their thresholds from PATROL to BPPM. This tasks was complex due to the different format of the thresholds. The customer also experienced many issues with the migration tools which did not work properly. Managing thresholds in BPPM is not as easy as managing thresholds in PATROL (using PATROL Configuration Manager). In the end the customer abandoned the attempt to introduce BPPM analytics. The Autonomous alerts only covered 20% of the thresholds anyway, so the benefit of BPPM Analytics was not compelling. |
Testing |
The customer underestimated the time required for comprehensive testing. Testing should have been planned earlier, started earlier and resourced appropriately. At least a full month of end-to-end testing was required. |
Technical Lead |
Technical Leadership was lacking through some parts of the project. Initially, the BMC Consultant was the technical lead. Towards the end, an independent consultant was the technical lead. There were issues of continuity. |
Project Phases |
The project consisted of 4 project phases. Phase 2 and Phase 4 were optional and were not required in order for the custom to meet its audit deadline. In the end, Phase 4 was abandoned. |
Summary and Conclusion
BMC ProcativeNet Performance Manager (BPPM) is really 3 products bundles into one suite. It still makes sense to rate each component individually.
Product | Summary | Score 1-5 |
BMC BPPM v8.6 Analystics (formerly ProActiveNet) | The product appears to have reasonably good quality control. The graphing is good. The threshold management features are poor - but BMC says this is being fixed in the next release. I am not convinced on the whole concept of using statistics. Statistical analysis uses a lot of CPU which makes scaleability an issue. Only about 30% of monitored metrics are appropriate for statistical analysis. BMC's claims that this product removes the need for threshold management is an exageration and 70% of thresholds will still need to be managed using absolute value (i.e. standard) thresholds. |
3 |
BMC BPPM v8.6 Event Mgmt (formerly Mastercell) | This product is one of the strongest event management products around. There are challenges with using the MRL rule language - but generally this product works well. I question BMC's bundling of this product with ProactiveNet and would like to see the product available as a stand-alone component. Develoing and debugging rules is time consuming and difficult. Only time will tell if this product continuous to be a good event management platform. | 3 |
BMC PATROL 7.8.10 | Twenty years ago, PATROL was the best monitoring solution of its type. Since then the product has become bloated and overly complex. PCM was a great addition and makes the management of thresholds realtively easy and repeatable. The product has not changed much in about 8 years. Four years ago, BMC were going to retire the product. Today PATROL is an integral part of BMC's BPPM strategy. The KMs and the breadth of monitoring saves this product from a lower rating. | 3 |
Component/Capability |
Previous Version (with PEM) |
Latest Version (BPPM v8.6) |
Event Management |
3 |
4 |
Threshold Management |
5 |
2 |
Analytics / Graphs |
3 |
5 |
Ease of Implementation |
3 |
2 |
Extensibility / interfaces |
4 |
4 |
Operator Form for Blackout |
1 |
1 |
Average Score |
(3.2) |
(3) |
Components |
PATROL and associated KMs PATROL Central Operator PATROL Enterprise Manager (PEM) |
PATROL and associated KMs PATROL Central Operator BPPM Event Management BPPM Analytics (ProactiveNet) |
The score for BPPM has not improved with this revision. The product is more complex, more difficult to implement and thresholds are more difficult to administer. The improvement in capability associated with anomaly detection is not convincing and not proven to this customer and is only relevant for 30% of parameters. BMC must work hard to improve administration and ease of implementation.
The combination of BPPM Analytics (ProactiveNet), BPPM Event Management (Mastercell) and PATROL has the potential to be a market beating product. However, the investment required is significant. Time will tell if BMC delivers on this vision.
It helps to minimize downtime of applications by enabling proactive monitoring.
Deployment requires lots of resources (servers). It has too many consoles. Pricing is very high.
No stability issues. We can make it stable by allocating enough resources.
No issues with scalability. We can increase resources vertically, according to growth in infrastructure.
Support quality is good.
Have not used any other product.
It’s bit complex.
Its use depends on the speed of installation.
It's good if you understand the systems and issues well.
The pricing could be better.
It is a stable solution.
It is a scalable solution. We have about five administrators using BMC TrueSight Operations Management.
The technical support is satisfactory.
The initial setup was straightforward. It took us about eight months to implement.
I rate this solution a seven out of ten, and I would recommend it to others. In addition, the solution is deployed on-premises.
I like the deep-dive detail and end-user metrics data. The synthetic monitor is the best one. The best point of the new one is that there's no need for configuration. You can inject the Java script and start to change major developments in the application. This is a good approach, and we received all the data using this.
I would like them to improve the deep-dive details, tracing, and data agents in this product. We have EUEM, an end-user experience monitoring appliance. This one's quicker than the current one, and reporting side and filtration side are very bad.
There are many details we look at and explain what we receive information in the current one, but we cannot have historical data like we do with EUEM. We cannot have a powerful point to look for specific traffic from a specific application and a specific browser. We don't have it in the new one. The current BMC also needs to add the thing that control versions.
I have been using BMC TrueSight Operations Management for six years.
BMC support is very good, and they always find solutions. They can give you a release back or batch it if somebody needs assistance. This is a good thing for BMC support. BMC support is very good, and you can get the full journey if you have the full solution.
BMC TrueSight Operations Management has a module covering all the applications and search, the website hardware utilization, and the traffic and storage. You can get all the details, and it will be better if you have different websites. You have the full journey and no need to bother the tool itself.
The initial setup is straightforward and easy to implement quickly. You can receive all the data, and you can have a lot of dashboards covering all the details like how much traffic, the OS, the federal ID, client ID, and location. If all of this is the same, you can create tables, dashboards, server ID, and client ID.
I would advise potential users to try to get TrueSight Infrastructure Monitoring and synthetic when implementing BMC. Then they will have a very powerful solution. The main point is that it's the manager of managers. It would be best if you highlighted this, and BMC can integrate with all monitoring solutions. They will have only one screen to show all. That is the multiple monitoring application.
On a scale from one to ten, I would give BMC TrueSight Operations Management an eight.
Benefits
Key Capabilities
I believe in the Enterprise customer base BMC is going to lead. As mentioned above BMC have the ability to simply integrate anfd view in the Manager of Managers. This is critical in environments where there are multiple existing and legacy toolsets. A single view is key to empower the resources that need only critical information displayed so that fast and effective response is gauranteed.