| |
Bassi Problem Tracking
The following are known problems being tracked by NERSC
on Bassi. Problems that must be resovled by IBM are
reported and receive an IBM "PMR" tracking number.
The problems listed here are those that have a direct
noticable
effect on NERSC users; therefore, this is not an exhaustive
list of all system PMRs.
Known Unresolved Problems
Resolved Problems
|
Parallel Job Launch Failures
|
| IBM PMR #: 00925,49R,000 |
| Created |
| June 1, 2006 |
| Status |
|
Fixed with November 15, 2006 upgrade to AIX 5.3 TL5 SP3.
|
| Description |
Intermittant job launch failures are occurring with symptoms similar
to:
ERROR: 0031-024 b0206.nersc.gov: no response; rc = -1
LoadL_starter: The program, /etc/pmdv4, terminated with a signal 11.
LoadL_starter: 2512-902 Unable to set process limits for user.
LoadL_starter: 2539-752 The getpcred system call failed for user ragerber. errno=2
[A file or directory in the path name does not exist.]
The Schedd daemon forced this job into reject state due to an internal
communication error, attempting to communicate with startd on
b1101.nersc.gov.
|
| Cause |
|
Password, group, and security files are corrupt when they
are initially indexed, causing a cascade of job-launch failures.
The files are being indexed because of poor application performance
under AIX 5.3
when large, unindexed files are in place (IBM PMR 76065,49R,000).
The files should not even be necessary under AIX 5.3, but
PAM/LDAP is unable to get group information from LDAP
(IBM PMR 89551,49R,000), so the files are manually created by
a script pulling from LDAP every two hours.
Once PMR 89551 is resolved and a fix
is in place, it is believed this problem will be fixed
as well.
|
| Resolution |
|
The pulls from LDAP were suspended on the compute nodes as a
temporary work-around.
Fixed with November 15, 2006 upgrade to AIX 5.3 TL5 SP3.
|
|
Degraded parallel performance when reading STDIN
|
| IBM PMR #: 86826,49R,000 |
| Created |
| April 21, 2006 |
| Status |
|
A temporary fix (efix) was applied on
August 2, 2006. A permanent fix is expected to be
incorporated into Parallel Environment 4.2.2.5.
|
| Description |
|
Parallel programs that read from Standard Input (STDIN)
are experiencing very poor performance.
|
| Cause |
|
A change was made in IBM's Parallel Environment (PE) version
4.2. to prevent a situation where a hang might
be possible when an application required redirected input from
one or more tasks, where the data could not all be read in
completely and a checkpoint was requested.
Previously, such data was left undelivered on the STDIN pipes,
and POE would wait in a read for data which would never get
delivered to the task(s), causing a hang condition.
The fix resulted in a logic change in POE, where it would
check for data on all tasks' STDIN pipes, using a
select() system call, and then once data was read, would
deliver it to the appropriate task,
As a result, this additional logic to check for STDIN on each
task has resulted in a situation where POE would continually
check for undelivered input, which caused a series of repeated
interrupts caused by the select() system calls that degraded
application performance. As a workaround, by setting
MP_STDINMODE=n, where "n" is a specific task number that will be
reading data via redirected STDIN, POE focused on just that
task's STDIN pipe to listen on, greatly reducing the number
of select() calls & associated interrupts. However, the use of
MP_STDINMODE will be limited to situations where a single task
will expect input data from STDIN, and in cases where multiple
tasks must read from STDIN it will not be helpful (and can
prevent another task from reading data).
|
| Resolution |
|
A temporary fix is in place.
|
|
MPI-2 one-sided functions fail in 64-bit (Resolved)
|
| IBM PMR #: 76523,49R,000 |
| Status |
|
This was determined to be a user coding error.
|
| Description |
|
A user reports that that MPI-2 one-side functions give
wrong answers in 64-bit compiles. In 32-bit the answers
are correct.
|
| Cause |
|
Unknown.
|
| Resolution |
|
An pointer address was declared to be of type
INTEGER(KIND=MPI_INTEGER), which fails
in 64-bit. The correct specification is
INTEGER(KIND=MPI_ADDRESS_KIND).
|
|
LAPI RDMA example code fails
|
| IBM PMR #: 76619,49R,000 |
| Created |
| March 7, 2006 |
| Status |
|
Fixed with system upgrade to PE 3.3.2.4 and LoadLeveler 4.2.2.4 on
August 2, 2006.
|
| Description |
The LAPI example from /opt/rsct/lapi/samples/xfer/Hw_xfer.c fails
with:
(LAPI_Util(handle, (lapi_util_t *) &util_pvo)) returns error: 498
(LAPI_Util(handle, (lapi_util_t *) &util_pvo)) returns error: 498
|
| Cause |
|
The RDMA LAPI functionality contained in the sample code
is not supported under LoadLeveler 3.3.0.0, which is
currently on Bassi.
|
| Resolution |
|
Update to LoadLeveler 3.3.2.4 expected to be performed on
August 2, 2006.
After installation on test system, we found that this
capability must be user-enabled by setting the
environment variable MP_RDMA_COUNT to a value greater than
zero and
using #@ network.LAPI = sn_all,not_shared,us,,,rcxtblocks=2
in LoadLeveler scripts.
|
|