While helping our clients debug and solve their issues, there are cases where IBM Support will ask a client to enable some traces and collect some logs. It happens quite frequently that we have to go back and re-ask for the same information several times before we can even begin to help, which becomes frustrating both for the client who is running into the issue, and spending time gathering logs, as well as the for the IBM Support rep, who needs to re-ask for information that they know they already asked for.
The following is a list of items that we typically run into, which may prevent the necessary diagnostic information from getting into the logs, which will allow you (or Support) to look deeper into a problem.
Traces and logs are not enabled
Log files have been overwritten
The log file is empty or does not contain any new entries.
The problem or error is not re-created
The above list is not inclusive of all the things that can prevent the issues from being captured in your log files, but it's a list of items that are very easy to double check and prepare for, so you don't end up having to reproduce your issue a bunch of times to get a single log or trace of the error.
Here are 5 tips on how to avoid the above pitfalls:
1) Validate the traces are enabled before you reproduce the issue
When you enable traces in the WebSphere Administrative console, check your trace.log file to ensure that the traces start to write to the file. At the point where the traces were enabled or the trace specification was changed, you should see an entry in the logs that indicates the changes were applied:
[3/15/13 11:20:36:324 EDT] 00000042 ManagerAdmin I TRAS0018I: The trace state has changed. The new trace state is *=info:com.ibm.websphere.commerce.WC_SERVER=all
Once you enable the traces, be sure to validate the trace.log file for this type of entry to confirm the traces were enabled, before you continue to invest the time in reproducing an issue.
2) Ensure you have enough space in your logs to capture your issue
It's quite common that if you are tracing a scenario, especially on a production server, it will create a lot of log entries. The log files are configured to only keep a certain size of data, and then a certain number of historical data, to facilitate log rotation. Depending on the level of traces you enable, these limits can be exhausted very quickly, which may mean that the default settings of the file size and historical log files will be too small to capture your issue.
You will want to review your policy for log file sizes and historical log files that you are keeping (this is configured in the WebSphere Administrative console) and ensure that you have enough log files. What is enough? That's not always easy to know, as it will depend on how much traffic you have on the server while you reproduce the issue, as well as the level of tracing that you have enabled. What you can do is to enable the trace, and watch to see how much time it takes for a single log file to be rolled into a historical log file. From there, you can calculate the necessary size of the log and historical log file. I would recommend that each log file not be too large, as you will eventually need to open and read the log file, so having a 1gb file maybe too daunting to open in your favourite text editor.
3) Don't delete the trace file before you begin your testcase
Another common issue that happens is that the log files do not even get any new entries. Hopefully you detect this right away, while performing tip #1 above. The Application Server will keep an open file handle to the current log file. A common practice that users have is to delete their log files before they perform their test, thinking that this will make it easier to find the error once it's reproduced. The pitfall with this approach is that once you delete the file, the Application Server has no place to write (since the file handle is now broken), so the logging will stop, and no new file will be recreated until you restart the server.
If you want to move the files to make a 'clean' log file before you reproduce a test case, do so while the server is stopped to ensure you do not break the trace file. Another suggestion is, rather than deleting or moving the log files before running a testcase, simply note the time you start your test and end your test. The log files have very clear timestamps. Using the timestamps you can extract exactly the portion of the log that corresponds to your testcase
Note: Don't get too aggressive with trying to send only the error to Support, typically we like to see the transactions that occur before the error, so sending un-edited logs typically works best. Just let us know what time the test was running, and we'll easily find it.
4) Find the error before sending it on to get analyzed by IBM Support
Validate you actually got your data before sending it to a colleague or IBM to help review it. This is a simple thing to do, but it can waste a lot of time if you end up uploading a file, and waiting for somebody else to confirm you did not capture the problem or error. Once you reproduce your problem, open the log, and do a simple search to validate the error re-appears in the log file. When you send in your data, indicate the time and error that you had reproduced to cut down the time to analyze the issue for Support.
It's not uncommon that you get an error in your screen as an end user, but the errors are not in the logs. There are several reasons why this can happen, such as having a clustered server, so the error happens on a different server than the one you are tracing, or if you have some level of page or object cache enabled, such that the error response is from the cache directly, and is not actually run on the server.
5) The logs are for you too, not just Support
Finally, once you confirmed that the logs have the error, send them to support, but also feel free to begin looking at them yourself. Lots of problems are not difficult to solve once you put a critical eye to the error and the data in the logs. You might be surprised how easy problems can be to solve once you have the right data to solve it.
Spending a few extra minutes to validate the logs captured the issue may save you days in analyzing logs (a log that doesn't have the issue) and the time it will take to re-collect the logs over and over again.