This blog promotes knowledge sharing through experience and collaboration. For more product information, visit our WebSphere Commerce CSE page. For easier navigation, utilize the Categories to find posts that match your interest.
Where is your WebSphere Commerce black box?
While analyzing a site outage after the fact, we sometimes find that diagnostic data was not collected, or that information in the logs was lost. To avoid the dreaded "Reproduce the problem", every site should have a documented (and automated) procedure to ensure that key data is collected and preserved.
The daily site operations should be documented in the runbook. This includes alerts and procedures to follow when the site availability is impacted, such as enable redundancies and/or site maintenance page, gather diagnostics and restore operations. The key to enable you troubleshoot and react to an outage is to ensure that as part of site recovery, there are planned steps to gather diagnostics and, a very common oversight, to ensure the logs for the period are preserved. This is what I call "The black box".
What's in the black box?
Store all the data that can be required for troubleshooting in your site's "black box", including the timeline, diagnostic and system logs, monitoring reports, and system dumps. The black box should be as self-contained as possible. If it is useful, it's there. This will facilitate the analysis when multiple teams are providing their perspective.
The ship's log
Imagine spending an hour analyzing a drop in CPU only to realize that was the time when the maintenance page was put on. It's always harder to analyze data without context.
As the site is restored, logs will be written again. Several of these logs implement automatic rotation and after a few hours the information is lost. Complete an inventory of the logs on each tier and verify the logging configuration. Ensure the logs are large enough to provide you enough time to save them before the period of the event is rolled over.
Standard system logs are useful but often not enough to get root cause. There is diagnostic data that can only be collected while the site is experiencing issues. Of course, it is impossible to anticipate every problem, but there is data that often helps narrow down the scenario, such as database snapshots (AWR in Oracle), and the king of Java troubleshooting: Javacores.
It's also important is to preserve monitoring data for the site and infrastructure. Monitoring data for the site includes traffic and response time analysis. Infrastructure data includes CPU, IO and network statistics, integration points, etc.
If there is a crash or OutOfMemory condition, core files can be generated. These files tend to be large, and administrators clear them to free up space. Cores can be very valuable for troubleshooting. Ensure to save them before deleting them from the local file system.
Configuration files do not change often, but for consistency and to help analysis, it makes sense to store the key ones in the black box. Examples include:
Loading the box
The "box" is a directory structure in a shared network drive or ftp location.
Dealing with an outage shouldn't be ad-hoc. Unless there is a plan in place, with the rush to recover the site, there is often no time left to collect diagnostics or to preserve the logs. Review your site's runbook to ensure this is covered.