IBM Support

EngMonApp fails to read or delete job run event files (with the file extension .dsv)

Troubleshooting


Problem

The EngMonApp process is part of the IBM InfoSphere DataStage and QualityStage Operations Console. Running on the engine tier system it reads job run event files and writes the data into the operations database. After EngMonApp has processed an event file that file is deleted as it is no longer required. A failure to read or delete one of these event files can cause problems such as inconsistent data in the operations database or even prevent EngMonApp from running at all.

Cause

The most common reason why an event file can not be read or deleted is because the underlying operating system file permissions do not allow it.

Environment

Engine tier system running operations console

Diagnosing The Problem

The EngMonApp process may show as "Not Running" in the Operations Console (or "STOPPED" if using DSAppWatcher.sh -status). Only severe problems cause EngMonApp to stop, such as failing to delete an event file. Other problems ,such as failures to read an event file, only cause a warning to be logged in the log file. In the most recent EngMonApp log file (InformationServerRoot/Server/DSODB/logs/EngMonApp-***.log) one or more of the following types of warnings and errors may be seen:

2011-10-31 09:16:11,244  WARN com.ibm.datastage.runtime.engmonapp.FileManager.watchForDirectories(FileManager.java:407) - Unable to delete the events directory: 2011102714

or

2011-10-31 12:17:20,508 ERROR com.ibm.datastage.runtime.engmonapp.FileManager.processDirectory(FileManager.java:617) - Unable to delete the event run file: /opt/IBM/InformationServer87/Server/DSODB/events/2011103112/1714_000_1802_00001_1.dsv

or

2011-11-01 15:32:02,201 ERROR com.ibm.datastage.runtime.engmonapp.FileManager.processFile(FileManager.java:673) - Unable to open the file: '/opt/IBM/InformationServer87/Server/DSODB/events/2011110115/1842_000_31458_00005_6.dsv
java.io.FileNotFoundException: /opt/IBM/InformationServer87/Server/DSODB/events/2011110115/1842_000_31458_00005_6.dsv (Permission denied)

or

2011-11-01 15:19:52,099  WARN com.ibm.datastage.runtime.engmonapp.FileManager.processDirectory(FileManager.java:473) - The events subdirectory that previously existed no longer exists: /opt/IBM/InformationServer87/Server/DSODB/events/2011110115

The event files (and subdirectories when necessary) are created by the operating system user id that is running the job, and reflect the default ownership and umask of that user id. The permissions for the event subdirectories and files mentioned in the log should be checked to confirm if operating system file permissions are the cause of the problem. For example:

dsadm@mk-arronh:/> ls -ld /opt/IBM/InformationServer87/Server/DSODB/events/2011103112
drwxrwxr-x 2 user1 ETLproject1 515 2011-10-31 12:17 /opt/IBM/InformationServer87/Server/DSODB/events/2011103112

The above example shows that the event files in that subdirectory can only be deleted by "user1" itself or by any user id in the "ETLproject1" group.

Once the permissions for the event subdirectories and files are known, the group membership for the user id running the EngMonApp process needs to be examined. The EngMonApp process is typically run by the DataStage Administrator user id (dsadm by default) and the "id" command can be used to verify the group membership for the dsadm user id:

dsadm@mk-arronh:/> id dsadm
uid=1006(dsadm) gid=1005(dstage) groups=16(dialout),33(video),1005(dstage)

From the above we can see that the dsadm user id is not in the group "ETLproject1" and so it is clear when the EngMonApp process is run by the dsadm user id it does not have sufficient operating system file permissions to delete the event file.

It is also possible for the umask for "user1" to be set such that group write permission is not granted, so even if the group memberships were set correctly files could not be deleted. In such cases the event subdirectory permissions would look like:

dsadm@mk-arronh:/> ls -ld /opt/IBM/InformationServer87/Server/DSODB/events/2011103112
drwxr-xr-x 1 user1 ETLproject1 515 2011-10-31 12:17 /opt/IBM/InformationServer87/Server/DSODB/events/2011103112

Resolving The Problem

To resolve these kinds of problems the umask and group memberships need to be modified to meet the following conditions:

i) The effective umask for any user id that can run jobs is such that owner and group permission includes read, write, and execute access (that is, the umask is set to at least 007, that is u=rwx,g=rwx,o=). Typically this involves setting the umask for all DataStage users by adding the umask command in the the dsenv file (InformationServerRoot/Server/DSEngine/dsenv). Alternatively the umask can be modified by changing or adding the relevant umask entry, in either every user's profile or in the global profile for all users.

ii) The user id that runs the EngMonApp process (the DataStage Administrator - dsadm by default) must have supplemental group membership of every primary group of each user id that may run DataStage jobs. (But note the DataStage Administrator's primary group must be left unchanged). Typically this involves using the "usermod" command to add supplementary group membership to the dsadm user id. In the example above, the dsadm user id needs have the "ETLproject1" group added to its list of supplemental groups. To elaborate, if the following user ids may run DataStage jobs:

user id primary group supplemental groups
user1 ETLproject1 dstage, staff, users, etc
user2 ETLproject2 dstage, staff, users, etc
user3 ETLproject3 dstage, staff, users, etc

Then the user id that runs the EngMonApp process (shown as dsadm below), should have supplemental group membership as shown below (note primary group should be left unchanged):

user id primary group supplemental groups
dsadm dstage ETLproject1, ETLproject2, ETLproject3 staff,
users, etc

Even with the above changes made, it may be necessary to change the permissions on any previously created event files and subdirectories to grant the group section read, write, and execute permission. Typically this involves a recursive chmod action across the top level events directory and all its subdirectories (InformationServerRoot/Server/DSODB/events).

If the supplemental groups of the DataStage Administrator user id had to be changed, it is usually necessary to completely log out of that user's session and then log back in again in order for the change in supplementary groups to recognised by newly created processes. In addition to this, the DSAppWatcher.sh process will then also need to be stopped entirely (by running "./DSAppWatcher.sh -stop") and restarted. Note that it is not sufficient to attempt to start the EngMonApp by itself (by running "./DSAppWatcher.sh -start EngMonApp") as the new EngMonApp process will only inherit the supplemental groups of the existing DSAppWatcher.sh process, which will not include any changes made after DSAppWatcher.sh was originally started.

Note that once this problem has been resolved and EngMonApp is restarted, a SQL error may be seen in the log file which relates to a SQL unique constraint violation. This occurs because the the data from an event file that could not be previously deleted is re-read and attempted to be re-inserted into the database, when it already exists. This error can be safely ignored and EngMonApp will automatically continue.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSVSEF","label":"IBM InfoSphere DataStage"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"}],"Version":"8.7;8.7.0.1;8.7.0.2;9.1;9.1.0.1;9.1.2.0;11.3;11.3.1.0;11.3.1.1;11.3.1.2;11.5;11.5.0.1;11.5.0.2;11.7;11.7.0.1","Edition":"All Editions","Line of Business":{"code":"LOB10","label":"Data and AI"}},{"Product":{"code":"SSZJPZ","label":"IBM InfoSphere Information Server"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":" ","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"}],"Version":"8.7","Edition":"All Editions","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
29 December 2018

UID

swg21570258