sp-announcements mailing list archive
Index Potential problems with Seaborg batch jobs

From: David Turner (dpturner_at_lbl_dot_gov)
Date: 04/05/2005

  • Next message: Harsh Anand: "HDF 4.2r1 installed"
    Greetings Seaborg User,
    
    NERSC has identified and fixed a configuration problem on Seaborg that
    could possibly affect batch jobs submitted between 14:00 March 22 and
    15:45 April 4.  Jobs submitted during this interval could experience
    either of the following:
    
    1) Job failure due to insufficient memory for MPI operations.
        Two possible error messages are:
    
        ERROR: 0032-171 Communication subsystem error: Memory is exhausted. in 
    MPI_Isend, task 0
    
        ERROR: 0032-113 Out of memory in MPI_Allreduce, task 51
    
        Whether or not a particular program experiences this type of
        failure depends on the nature of its MPI operations; not all
        MPI codes will encounter this failure.
    
    2) Reading large files via stdin (standard input) will result in
        unpredictable results.  Input files over 1024 bytes in size will
        not be read correctly.  Depending on the program's logic, this could
        result in code failure, or more seriously, incorrect results.
    
    Situation 2) requires immediate user attention.  If you have run to
    completion any batch job submitted during the interval in question,
    that used stdin to read a file larger than 1024 bytes, you should
    look very closely at your results; they may not be correct.  If you
    have any pending batch jobs (status I, NQ, HS, or HU) that were
    submitted during this interval, and that expect to use stdin to read
    a file larger than 1024 bytes, you should cancel those jobs and
    resubmit them.
    
    We apologize for the inconvenience this problem causes for our users.
    NESRC staff  are actively working with IBM to prevent this problem in
    the future.
    
    -- 
    Best regards,
    
    David Turner
    User Services Group        email: dpturner_at_lbl_dot_gov
    NERSC Division             phone: (510) 486-4027
    Lawrence Berkeley Lab        fax: (510) 486-4316
    

  • Next message: Harsh Anand: "HDF 4.2r1 installed"