找回密码
 注册
查看: 4583|回复: 0

MPI程序调试出错信息整理

[复制链接]
发表于 2011-6-23 13:37:19 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有账号?注册

x
如果是用FORTRAN写程序,建议加上implicit none,特别是代码比较多时,可以检查出编译过程中的很多问题。

1、

[root@c0108 parallel]# mpiexec -n 5 ./simple  
aborting job:  
Fatal error in MPI_Irecv: Invalid rank, error stack:  
MPI_Irecv(143): MPI_Irecv(buf=0x25dab60, count=0, MPI_DOUBLE_PRECISION, src=5, tag=99, MPI_COMM_WORLD, request=0x7fffa02ca86c) failed  
MPI_Irecv(95): Invalid rank has value 5 but must be nonnegative and less than 5  
rank 4 in job 5  c0108_52041   caused collective abort of all ranks  
  exit status of rank 4: return code 13   


上面的意思是,进程号为5的无效,因为[root@c0108 parallel]# mpiexec -n 5 ./simple运行的时候,开了5个进程:0 1 2 3 4,所以一定是代码本身的问题,但不一定是某个进程号本身,也有可能是某个参数传递未成功等,MPI总会出现许多莫名的错误。。。

我的代码中MPI_Irecv语句有限,于是通过添加print语句的方法进行调试,找出错误代码所在的行,如下

print *, myid+1,'111111111111111111'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

call MPI_Irecv(P(1,1,location),IMAX*JMAX*MIN(ITSP, ke-myke),

     &MPI_DOUBLE_PRECISION,MYID+1,RELY,MPI_COMM_WORLD,REQ,IERR)

2、

[root@c0109 test]# mpiexec -n 5 ./simple   
rank 3 in job 22  c0109_51164   caused collective abort of all ranks  
  exit status of rank 3: killed by signal 11   
[root@c0109 test]#   


这个原因有很多种,其中signal 11是段错误。Signal 11, or officially know as "segmentation fault", means that the program accessed a memory location that was not assigned. That's usually a bug in the program.

如果是killed by signal 9 ,可尝试如下两种方法:

1)、So, try to resubmit the calculation and see if it fails in the same point again.
2)、Try the latest version of MPICH2, 1.0.8. It is hard to say what the problem might be. Could even be a bug in the application.



3、

[root@c0108 test]# mpirun -np 4 ./simple   
aborting job:  
Fatal error in MPI_Wait: Invalid MPI_Request, error stack:  
MPI_Wait(139): MPI_Wait(request=0x7fff1f675228, status0x7fff1f675218) failed  
MPI_Wait(75): Invalid MPI_Request  
rank 2 in job 24  c0108_52041   caused collective abort of all ranks  
  exit status of rank 2: return code 13   


solution:


generally it's because MPI_Test of MPI_Wait is supplied a request that
is unknown to MPICH (the request wasn't the one returned by MPICH when
you made the Isend/Irecv/send_init/recv_init)
就是说MPI_Irecv没有和MPI_Wait(req,status,IERR)对应,句柄对错号了。。
如果MPI_Wait()函数有很多,可以采用注释的方法一个个锁定错误。。。
另外:如果是FORTRAN程序,请首先检查一下status变量定义:
integer req,status(MPI_STATUS_SIZE),ierr
4、

aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(195): Initialization failed MPID_Init(170): failure during portals initialization MPIDI_Portals_Init(321): progress_init failed MPIDI_PortalsI_Progress_init(653): Out of memory   


There is not enough memory on the nodes for the program plus MPI buffers to fit.


You can decrease the amount of memory that MPI is using for buffers by using MPICH_UNEX_BUFFER_SIZE environment variable.


本帖转自我的博客:http://blog.csdn.net/zhuliting/archive/2011/06/18/6553809.aspx
关注并行计算的同学,欢迎多交流,共同学习
您需要登录后才可以回帖 登录 | 注册

本版积分规则

快速回复 返回顶部 返回列表