|
楼主 |
发表于 2009-8-11 21:09:59
|
显示全部楼层
什么原因导致的并行curupt?
最近在paramesh4.1版本测试三维太阳风模型中,总会遇到类似如下的错误提示,不知道是怎么回事?
同样的可执行程序xxx.out,用32个进程并行计算,只有个别测试中会通过,能计算到50个小时;
但是大多数只能计算到3-5个小时,然会错误提示:
例如下面计算到时间t=2.0161914601701584,
然后出错,.out 文件输出
-----------------------------------------------------------------
、、、
dt= 4.0940862030663244E-003 t= 2.053726520706556
iteration 500 no of blocks = 734
dt= 4.0940556833781927E-003 t= 2.057820576389934
iteration 501 no of blocks = 734
dt= 4.0940253116498702E-003 t= 2.061914601701584
p0_24440: p4_error: net_recv read: probable EOF on socket: 1
Killed by signal 2.
Killed by signal 2.
、、、
---------------------------------------------------------------------------------
还有的时候一开始提交就出现如下错误提示:
------------------------------------------------------------------------------------------------
p7_17982: p4_error: net_recv read: probable EOF on socket: 1
rm_l_7_18146: (288.308594) net_send: could not write to fd=5, errno = 32
p11_668: p4_error: net_recv read: probable EOF on socket: 1
rm_l_11_833: (282.488281) net_send: could not write to fd=5, errno = 32
p8_32619: p4_error: net_recv read: probable EOF on socket: 1
rm_l_8_315: (286.957031) net_send: could not write to fd=5, errno = 32
p19_2437: p4_error: net_recv read: probable EOF on socket: 1
rm_l_19_2601: (270.531250) net_send: could not write to fd=5, errno = 32
p2_8801: p4_error: net_recv read: probable EOF on socket: 1
rm_l_2_8965: (298.335938) net_send: could not write to fd=5, errno = 32
p22_2944: p4_error: net_recv read: probable EOF on socket: 1
rm_l_22_3113: (270.101562) net_send: could not write to fd=5, errno = 32
p14_1173: p4_error: net_recv read: probable EOF on socket: 1
p13_1005: p4_error: net_recv read: probable EOF on socket: 1
rm_l_3_9134: (299.269531) net_send: could not write to fd=6, errno = 9
rm_l_14_1337: (282.023438) net_send: could not write to fd=5, errno = 32
rm_l_13_1169: (283.507812) net_send: could not write to fd=5, errno = 32
p4_error: latest msg from perror: Bad file descriptor
rm_l_3_9134: p4_error: net_send write: -1
p26_27418: p4_error: net_recv read: probable EOF on socket: 1
p20_2605: p4_error: net_recv read: probable EOF on socket: 1
rm_l_26_27582: (266.183594) net_send: could not write to fd=5, errno = 32
p21_2773: p4_error: net_recv read: probable EOF on socket: 1
p25_3475: p4_error: net_recv read: probable EOF on socket: 1
rm_l_20_2769: (275.097656) net_send: could not write to fd=5, errno = 32
rm_l_21_2940: (273.609375) net_send: could not write to fd=5, errno = 32
rm_l_25_3652: (267.730469) net_send: could not write to fd=5, errno = 32
p28_28538: p4_error: net_recv read: probable EOF on socket: 1
rm_l_28_28702: (265.031250) net_send: could not write to fd=5, errno = 32
p17_20272: p4_error: interrupt SIGx: 13
p31_29042: p4_error: net_recv read: probable EOF on socket: 1
rm_l_31_29216: (263.597656) net_send: could not write to fd=5, errno = 32
p18_2269: p4_error: net_recv read: probable EOF on socket: 1
p24_3294: p4_error: net_recv read: probable EOF on socket: 1
rm_l_18_2433: (284.066406) net_send: could not write to fd=5, errno = 32
rm_l_24_3471: (275.207031) net_send: could not write to fd=5, errno = 32
p5_9308: p4_error: interrupt SIGx: 13
p1_8632: p4_error: interrupt SIGx: 13
p4_9139: p4_error: interrupt SIGx: 13
p0_8626: p4_error: net_recv read: probable EOF on socket: 1
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
Killed by signal 2.
p0_8626: (375.433594) net_send: could not write to fd=4, errno = 32
------------------------------------------------------------------------------------
上网搜索到 errno=09 is: Bad file descriptor ,errno=32 is: Broken pipe
但是这错误提示太广泛了,无法找到具体的可能原因。
首先,现在还不能确定这种错误 是程序本身的原因,还是并行环境的原因。
希望您能提出宝贵意见和看法,相信会对我有帮助!谢谢!
[ 本帖最后由 shzhang 于 2009-8-11 22:03 编辑 ] |
|