|
马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?注册
x
如果是用FORTRAN写程序,建议加上implicit none,特别是代码比较多时,可以检查出编译过程中的很多问题。
1、
[root@c0108 parallel]# mpiexec -n 5 ./simple
aborting job:
Fatal error in MPI_Irecv: Invalid rank, error stack:
MPI_Irecv(143): MPI_Irecv(buf=0x25dab60, count=0, MPI_DOUBLE_PRECISION, src=5, tag=99, MPI_COMM_WORLD, request=0x7fffa02ca86c) failed
MPI_Irecv(95): Invalid rank has value 5 but must be nonnegative and less than 5
rank 4 in job 5 c0108_52041 caused collective abort of all ranks
exit status of rank 4: return code 13
上面的意思是,进程号为5的无效,因为[root@c0108 parallel]# mpiexec -n 5 ./simple运行的时候,开了5个进程:0 1 2 3 4,所以一定是代码本身的问题,但不一定是某个进程号本身,也有可能是某个参数传递未成功等,MPI总会出现许多莫名的错误。。。
我的代码中MPI_Irecv语句有限,于是通过添加print语句的方法进行调试,找出错误代码所在的行,如下
print *, myid+1,'111111111111111111'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
call MPI_Irecv(P(1,1,location),IMAX*JMAX*MIN(ITSP, ke-myke),
&MPI_DOUBLE_PRECISION,MYID+1,RELY,MPI_COMM_WORLD,REQ,IERR)
2、
[root@c0109 test]# mpiexec -n 5 ./simple
rank 3 in job 22 c0109_51164 caused collective abort of all ranks
exit status of rank 3: killed by signal 11
[root@c0109 test]#
这个原因有很多种,其中signal 11是段错误。Signal 11, or officially know as "segmentation fault", means that the program accessed a memory location that was not assigned. That's usually a bug in the program.
如果是killed by signal 9 ,可尝试如下两种方法:
1)、So, try to resubmit the calculation and see if it fails in the same point again.
2)、Try the latest version of MPICH2, 1.0.8. It is hard to say what the problem might be. Could even be a bug in the application.
3、
[root@c0108 test]# mpirun -np 4 ./simple
aborting job:
Fatal error in MPI_Wait: Invalid MPI_Request, error stack:
MPI_Wait(139): MPI_Wait(request=0x7fff1f675228, status0x7fff1f675218) failed
MPI_Wait(75): Invalid MPI_Request
rank 2 in job 24 c0108_52041 caused collective abort of all ranks
exit status of rank 2: return code 13
solution:
generally it's because MPI_Test of MPI_Wait is supplied a request that
is unknown to MPICH (the request wasn't the one returned by MPICH when
you made the Isend/Irecv/send_init/recv_init)
就是说MPI_Irecv没有和MPI_Wait(req,status,IERR)对应,句柄对错号了。。
如果MPI_Wait()函数有很多,可以采用注释的方法一个个锁定错误。。。
另外:如果是FORTRAN程序,请首先检查一下status变量定义:
integer req,status(MPI_STATUS_SIZE),ierr
4、
aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(195): Initialization failed MPID_Init(170): failure during portals initialization MPIDI_Portals_Init(321): progress_init failed MPIDI_PortalsI_Progress_init(653): Out of memory
There is not enough memory on the nodes for the program plus MPI buffers to fit.
You can decrease the amount of memory that MPI is using for buffers by using MPICH_UNEX_BUFFER_SIZE environment variable.
本帖转自我的博客:http://blog.csdn.net/zhuliting/archive/2011/06/18/6553809.aspx
关注并行计算的同学,欢迎多交流,共同学习 |
|