One time crash (so far) with TCO639-jane crash (pmixp_coll_ring.c:742: collective timeout)
Added by Jan Streffing over 2 years ago
On Friday I created the required initial and remapping files for the highest resolution FESOM2 ocean mesh (jane, 33million surface nodes) coupling to a medium-high resolution OpenIFS (TCO639L137). During the MPI init I got:
4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_reset_if_to: l40547 [37]: pmixp_coll_ring.c:742: 0x15553c0078a0: collective timeout seq=0 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_log: l40547 [37]: pmixp_coll.c:281: Dumping collective state 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:760: 0x15553c0078a0: COLL_FENCE_RING state seq=0 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:762: my peerid: 37:l40547 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:769: neighbor id: next 0:l10357, prev 36:l40546 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:779: Context ptr=0x15553c007918, #0, in-use=0 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:779: Context ptr=0x15553c007950, #1, in-use=0 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:779: Context ptr=0x15553c007988, #2, in-use=1 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:790: seq=0 contribs: loc=1/prev=22/fwd=23 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:792: neighbor contribs [38]: 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:825: done contrib: l[30377,30600,30604-30605,30611-30612,30618-30620,40406-40409,40513,40539-40546] 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:827: wait contrib: l[10357,10362,10365,10374-10376,10394-10395,30354,30356-30357,30360,30370-30372] 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:829: status=PMIXP_COLL_RING_PROGRESS 4225: slurmstepd: error: mpi/pmix_v3: pmixp_coll_ring_log: l40547 [37]: pmixp_coll_ring.c:833: buf (offset/size): 261015/532347
See: /work/ab0246/a270092/runtime/awicm3-v3.1/jane_maybe_some_steps_1/log/jane_maybe_some_steps_1_awicm3_compute_20000101-20000101_1092005.log
This is rather peculiar, as the previous experiment, that was used to create the oasis grids.nc areas.nc and masks.nc files for the offline remapping tool did not crash during mpi init, and nothing has changed in the setup since then. The only difference between the runs, is that I link in the rmp_* files that I made offline.
Indeed I ran the model again this morning, and no such crash reappeared. So far I only got this error once, in case I appears again we have it documented already. The error message gets a handful of hits on google. In particular https://lists.schedmd.com/pipermail/slurm-users/2019-July/003785.html document sporadic errors with heterogeneous parallel jobs (which this one is, albeit using tasksets instead)
Cheers, Jan