[Paraview] 3.98 MPI_Finalize out of order in pvbatch

Burlen Loring bloring at lbl.gov
Fri Dec 7 15:13:31 EST 2012


Hi Kyle et al.

below are stack traces where PV is hung. I'm stumped by this, and can 
get no foothold. I still have one chance if we can get valgrind to run 
with MPI on nautilus. But it's a long shot, valgrinding pvbatch on my 
local system throws many hundreds of errors. I'm not sure which of these 
are valid reports.

PV 3.14.1 doesn't hang in pvbatch, so I wondering if anyone knows of a 
change in 3.98 that may account for the new hang?

Burlen

rank 0
#0  0x00002b0762b3f590 in gru_get_next_message () from 
/usr/lib64/libgru.so.0
#1  0x00002b073a2f4bd2 in MPI_SGI_grudev_progress () at grudev.c:1780
#2  0x00002b073a31cc25 in MPI_SGI_progress_devices () at progress.c:93
#3  MPI_SGI_progress () at progress.c:207
#4  0x00002b073a3244eb in MPI_SGI_request_finalize () at req.c:1548
#5  0x00002b073a2b8bee in MPI_SGI_finalize () at adi.c:667
#6  0x00002b073a2e3c04 in PMPI_Finalize () at finalize.c:27
#7  0x00002b073969d96f in vtkProcessModule::Finalize () at 
/sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ClientServerCore/Core/vtkProcessModule.cxx:229
#8  0x00002b0737bb0f9e in vtkInitializationHelper::Finalize () at 
/sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ServerManager/SMApplication/vtkInitializationHelper.cxx:145
#9  0x0000000000403c50 in ParaViewPython::Run (processType=4, argc=2, 
argv=0x7fff06195c88) at 
/sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvpython.h:124
#10 0x0000000000403cd5 in main (argc=2, argv=0x7fff06195c88) at 
/sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvbatch.cxx:21

rank 1
#0  0x00002b07391bde70 in __nanosleep_nocancel () from 
/lib64/libpthread.so.0
#1  0x00002b073a32c898 in MPI_SGI_millisleep (milliseconds=<value 
optimized out>) at sleep.c:34
#2  0x00002b073a326365 in MPI_SGI_slow_request_wait 
(request=0x7fff061959f8, status=0x7fff061959d0, set=0x7fff061959f4, 
gen_rc=0x7fff061959f0) at req.c:1460
#3  0x00002b073a2c6ef3 in MPI_SGI_slow_barrier (comm=1) at barrier.c:275
#4  0x00002b073a2b8bf8 in MPI_SGI_finalize () at adi.c:671
#5  0x00002b073a2e3c04 in PMPI_Finalize () at finalize.c:27
#6  0x00002b073969d96f in vtkProcessModule::Finalize () at 
/sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ClientServerCore/Core/vtkProcessModule.cxx:229
#7  0x00002b0737bb0f9e in vtkInitializationHelper::Finalize () at 
/sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/ParaViewCore/ServerManager/SMApplication/vtkInitializationHelper.cxx:145
#8  0x0000000000403c50 in ParaViewPython::Run (processType=4, argc=2, 
argv=0x7fff06195c88) at 
/sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvpython.h:124
#9  0x0000000000403cd5 in main (argc=2, argv=0x7fff06195c88) at 
/sw/analysis/paraview/3.98/sles11.1_intel11.1.038/ParaView/CommandLineExecutables/pvbatch.cxx:21


On 12/04/2012 05:15 PM, Burlen Loring wrote:
> Hi Kyle,
>
> I was wrong about MPI_Finalize being invoked twice, I had miss read 
> the code. I'm not sure why pvbatch is hanging in MPI_Finalize on 
> Nautilus. I haven't been able to find anything in the debugger. This 
> is new for 3.98.
>
> Burlen
>
> On 12/03/2012 07:36 AM, Kyle Lutz wrote:
>> Hi Burlen,
>>
>> On Thu, Nov 29, 2012 at 1:27 PM, Burlen Loring<bloring at lbl.gov>  wrote:
>>> it looks like pvserver is also impacted, hanging after the gui 
>>> disconnects.
>>>
>>>
>>> On 11/28/2012 12:53 PM, Burlen Loring wrote:
>>>> Hi All,
>>>>
>>>> some parallel tests have been failing for some time on Nautilus.
>>>> http://open.cdash.org/viewTest.php?onlyfailed&buildid=2684614
>>>>
>>>> There are MPI calls made after finalize which cause deadlock issues 
>>>> on SGI
>>>> MPT. It affects pvbatch for sure. The following snip-it shows the 
>>>> bug, and
>>>> bug report here: http://paraview.org/Bug/view.php?id=13690
>>>>
>>>>
>>>> //---------------------------------------------------------------------------- 
>>>>
>>>> bool vtkProcessModule::Finalize()
>>>> {
>>>>
>>>>    ...
>>>>
>>>>    vtkProcessModule::GlobalController->Finalize(1);<-------mpi_finalize
>>>> called here
>> This shouldn't be calling MPI_Finalize() as the finalizedExternally
>> argument is 1 and in vtkMPIController::Finalize():
>>
>>      if (finalizedExternally == 0)
>>        {
>>        MPI_Finalize();
>>        }
>>
>> So my guess is that it's being invoked elsewhere.
>>
>>>>    ...
>>>>
>>>> #ifdef PARAVIEW_USE_MPI
>>>>    if (vtkProcessModule::FinalizeMPI)
>>>>      {
>>>>      MPI_Barrier(MPI_COMM_WORLD);<-------------------------barrier 
>>>> after
>>>> mpi_finalize
>>>>      MPI_Finalize();<--------------------------------------second
>>>> mpi_finalize
>>>>      }
>>>> #endif
>> I've made a patch which should prevent this second of code from ever
>> being called twice by setting the FinalizeMPI flag to false after
>> calling MPI_Finalize(). Can you take a look here:
>> http://review.source.kitware.com/#/t/1808/ and let me know if that
>> helps the issue.
>>
>> Otherwise, would you be able to set a breakpoint on MPI_Finalize() and
>> get a backtrace of where it gets invoked for the second time? That
>> would be very helpful in tracking down the problem.
>>
>> Thanks,
>> Kyle
>



More information about the ParaView mailing list