SIGSEGV on all nodes in a parallel job

  • 124 Views
  • Last Post 30 April 2019
  • Topic Is Solved
srwheat posted this 18 April 2019

V192, RSM from Windows Client to Linux Cluster, a parallel job gets a SIGSEGV error on all processes.  Serial jobs behave the same.

But, if we "update the simulation one iteration" and then submit the job to do the rest of the iterations, it all runs fine, whether parallel or not.

The SIGSEGV seems to happen during initialization ... before the simulation gets going.

The last thing in the .trn file before the faults is noting "Hybrid initialization is done"

We captured stdout.live and it showed this at the end; and this is identical for when the job fails and when the job succeeds (starting at iteration 2).

Running Solver : /opt/apps/ansys/v192/fluent/bin/fluent --albion --run --launcher_setting_file "fluentLauncher.txt" --fluent_options "  -gu -driver null -driver null  -workbench-session -i \"SolutionPending.jou\"  -mpi=intel -t4"

/opt/apps/ansys/v192/fluent/fluent19.2.0/bin/fluent -r19.2.0 --albion --run --launcher_setting_file fluentLauncher.txt

Order By: Standard | Newest | Votes
jcallery posted this 22 April 2019

Hi Stephen,

 

Were you able to try running the solver manually outside of RSM from the terminal on the cluster?

 

Does this happen with all projects or just this one?

 

Can you download and try running the following workbench project:

https://drive.google.com/file/d/1-bcXljd-BYvnORbq3bYWhLnk_AptnSUf/view?usp=sharing

Please be sure the following are on each compute node:

 

Linux

For ALL 64-bit Linux platforms, OpenMotif, and Mesa libraries should be installed. These libraries are typically installed during a normal Linux installation. You will also need the xpdf package to view the online help.

 

ANSYS products require OpenMotif. After installing your Linux platform, review the tables below and install the appropriate version of OpenMotif. (You may need to use "rpm -iv -force" to install these.)

 

Table 2.1: OpenMotif Versions for SUSE Linux Enterprise

 

SUSE Linux Enterprise Release OpenMotif Version OpenMotif Zypper Package

SUSE Linux Enterprise 12 SP 2 SLES 12 SP2: motif-2.3.4-4.15.x86_64 motif

SUSE Linux Enterprise 12 SP 3 SLES 12 SP3: motif-2.3.4-4.15.x86_64 motif

Table 2.2: OpenMotif Versions for Red Hat Enterprise Linux

 

Red Hat Enterprise Linux Release OpenMotif Version

Red Hat Enterprise Linux 6.x motif-2.3.4-1

Red Hat Enterprise Linux 7.x motif-2.2.4-0

 

 

For More information on OpenMotif libraries for your platform, see the Motif download site.

 

Red Hat Enterprise Linux 6.9 and 7.3 through 7.5  —  You need to install the following libraries:

 

libpng12

 

libXp.x86_64

 

xorg-x11-fonts-cyrillic.noarch

 

xterm.x86_64

 

openmotif.x86_64

 

compat-libstdc++-33.x86_64

 

compat-libstdc++-44.x86_64

 

libstdc++.x86_64

 

libstdc++.i686

 

gcc-c++.x86_64

 

compat-libstdc++-33.i686

 

compat-libstdc++-44.i686

 

libstdc++-devel.x86_64

 

libstdc++-devel.i686

 

compat-gcc-34.x86_64

 

gtk2.i686

 

libXxf86vm.i686

 

libSM.i686

 

libXt.i686

 

xorg-x11-fonts-ISO8859-1-75dpi.noarch

 

glibc-2.12-1.166.el6_7.1 (or greater)

 

Red Hat no longer includes the 32-bit libraries in the base configuration so you must install those separately.

 

For more information on Red Hat Enterprise Linux libraries, see the Red Hat Libraries site.

 

CentOS 7.3 and 7.4  —  You need to install the following libraries:

 

glibc.i686

 

glib2.i686

 

bzip2-lib-s.i686

 

libpng.i686

 

libtiff.i686

 

libXft.i686

 

libXxf86vm.i686

 

sssd-client.i686

 

libpng12

 

libpng12.i686

 

libXp

 

libXp.i686

 

libXp

 

openmotif

 

zlib

 

Thank you,

 

Jake

jcallery posted this 22 April 2019

Here are the docs for manually using fluent to run the job on a compute node:

https://drive.google.com/open?id=1bThnS0TXJ6qnmFe_MXsFeUt6DuKCZb8w

 

Thank you,

Jake

srwheat posted this 25 April 2019

Jake,

I have not been able to figure out how to run a job locally; the users are kind of caught up in Workbench and don't know how to give me a journal file to just run.  My searches of the internet for an example journal file have been fruitless.  I'm not a fluent user, so I have no experience with how to configure a run.  Is there a link to a sample .jou file (and the rest of the needed files) that I could use to try this out?

I saw the link for a download of a project, but I'm not sure what to do with that.  I've asked my user to download it and run from workbench; I don't know how to extract from that set of files what I need to run it manually.

Stephen

jcallery posted this 25 April 2019

Hi Stephen,

 

Ok, I have put together a package for you.

https://drive.google.com/open?id=1Yi4nHugidwWKEujbtgPNwjbBcZZ-o5pG

 

Untar it in a location that is shared across nodes.

then to run it, cd to the directory where the .cas and .jou files are, and run:

/ansys_inc/v193/fluent/bin/fluent 3d -g -t2 -i elbow1-rel-path-no-dat.jou 

Adjust paths as necessary.

That will run it on two cores on the machine you are logged into.

Lets see if that works first.

If it completes without issue you should see something like:

 

Thank you,
Jake

 

srwheat posted this 25 April 2019

Jake,

On my login node, it worked fine.  I did it for both -t2 and -t16.  However, on one of my compute nodes, I got the output shown below.  I'd put this in a file, but I can't see how to attach a file to this.  Both nodes refer to the same NFS-mounted file systems for home directories and for the fluent executables.

Regarding how many of the rpm's you said I need, the login node did not have all of them.  And the compute nodes had even less than that.  It would seem that my next step is to get at least the rpm's that the login node has over to the compute nodes.  It appears we're on the way to solution.  Let me know if you recommend any other step.  I should be able to at least test-load the rpms on the one node I am using before I change my node images for every compute node.  I will likely get to that today.

Thanks,

Stephen

/ansys_inc/v192/fluent/bin/fluent 3d -g -t2 -i elbow1-rel-path-no-dat.jou 

/ansys_inc/v192/fluent/fluent19.2.0/bin/fluent -r19.2.0 3d -g -t2 -i elbow1-rel-path-no-dat.jou

/ansys_inc/v192/fluent/fluent19.2.0/cortex/lnamd64/cortex.19.2.0 -f fluent -g -i elbow1-rel-path-no-dat.jou (fluent "3d -pshmem  -host -r19.2.0 -t2 -mpi=ibmmpi -path/ansys_inc/v192/fluent -ssh")

/ansys_inc/v192/fluent/fluent19.2.0/bin/fluent -r19.2.0 3d -pshmem -host -t2 -mpi=ibmmpi -path/ansys_inc/v192/fluent -ssh -cx s1n1:432133954

Starting /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/3d_host/fluent.19.2.0 host -cx s1n1:432133954 "(list (rpsetvar (QUOTE parallel/function) "fluent 3d -flux -node -r19.2.0 -t2 -pshmem -mpi=ibmmpi -ssh") (rpsetvar (QUOTE parallel/rhost) "") (rpsetvar (QUOTE parallel/ruser) "") (rpsetvar (QUOTE parallel/nprocs_string) "2") (rpsetvar (QUOTE parallel/auto-spawn?) #t) (rpsetvar (QUOTE parallel/trace-level) 0) (rpsetvar (QUOTE parallel/remote-shell) 1) (rpsetvar (QUOTE parallel/path) "/ansys_inc/v192/fluent") (rpsetvar (QUOTE parallel/hostsfile) "") )"

 

              Welcome to ANSYS Fluent Release 19.2

 

              Copyright 1987-2018 ANSYS, Inc. All Rights Reserved.

              Unauthorized use, distribution or duplication is prohibited.

              This product is subject to U.S. laws governing export and re-export.

              For full Legal Notice, see documentation.

 

Build Time: Aug 08 2018 12:59:03 EDT  Build Id: 10236  

 

 

     --------------------------------------------------------------

     This is an academic version of ANSYS FLUENT. Usage of this product

     license is limited to the terms and conditions specified in your ANSYS

     license form, additional terms section.

     --------------------------------------------------------------

Host spawning Node 0 on machine "s1n1" (unix).

/ansys_inc/v192/fluent/fluent19.2.0/bin/fluent -r19.2.0 3d -flux -node -t2 -pshmem -mpi=ibmmpi -ssh -mport 192.168.1.111:192.168.1.1114230:0

Starting /ansys_inc/v192/fluent/fluent19.2.0/multiport/mpi/lnamd64/ibmmpi/bin/mpirun -e MPI_IBV_NO_FORK_SAFE=1 -e MPI_USE_MALLOPT_MMAP_MAX=0 -np 2 /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/3d_node/fluent_mpi.19.2.0 node -mpiw ibmmpi -pic shmem -mport 192.168.1.111:192.168.1.1114230:0

 

-------------------------------------------------------------------------------

ID    Hostname  Core  O.S.      PID          Vendor                      

-------------------------------------------------------------------------------

n0-1  s1n1      2/24  Linux-64  30181-30182  Intel(R) Xeon(R) E5-2643 v4 

host  s1n1            Linux-64  30004        Intel(R) Xeon(R) E5-2643 v4 

 

MPI Option Selected: ibmmpi

Selected system interconnect: shared-memory

-------------------------------------------------------------------------------

 

Cleanup script file is /home/swheat/fluent/parallel/cleanup-fluent-s1n1-30004.sh

terminate called after throwing an instance of 'std::runtime_error'

  what():  locale::facet::_S_create_c_locale name not valid

 

==============================================================================

Stack backtrace generated for process id 29853 on signal 6 :

1000000: fluent() [0x67f3b9]

1000000: /usr/lib64/libc.so.6(+0x362f0) [0x7fa75da7b2f0]

1000000: /usr/lib64/libc.so.6(gsignal+0x37) [0x7fa75da7b277]

1000000: /usr/lib64/libc.so.6(abort+0x148) [0x7fa75da7c968]

1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d) [0x7fa75e3ba2dd]

1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(+0x8e2b6) [0x7fa75e3b82b6]

1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(+0x8e301) [0x7fa75e3b8301]

1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(+0x8e518) [0x7fa75e3b8518]

1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(_ZSt21__throw_runtime_errorPKc+0x37) [0x7fa75e3e0c07]

1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(+0xb00f4) [0x7fa75e3da0f4]

1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(_ZNSt6locale5_ImplC2EPKcm+0x49) [0x7fa75e3cc269]

1000000: /ansys_inc/v192/fluent/fluent19.2.0/lnamd64/syslib/libstdc++.so.6(_ZNSt6localeC1EPKc+0x88c) [0x7fa75e3cd4dc]

1000000: /ansys_inc/v192/fluent/lib/lnamd64/libApipWrapper.so(+0xd8393) [0x7fa75e782393]

1000000: /ansys_inc/v192/fluent/lib/lnamd64/libApipWrapper.so(_ZN5Ansys10ApipHelper13GetInstallDirEPKw+0x66f) [0x7fa75e7221bf]

1000000: /ansys_inc/v192/fluent/lib/lnamd64/libApipWrapper.so(_ZN5Ansys17ApipConfiguration20initializeForVersionEPKw+0x1d) [0x7fa75e72f08d]

Please include this information with any bug report you file on this issue!

==============================================================================

 

 

No error handler available

 

Error: Cortex received a fatal signal (unrecognized signal).

Error Object: ()

 

version> exit

 

jcallery posted this 25 April 2019

Hi Stephen,

Yes, please get all of the required prerequisites installed on the compute nodes.

Once that is done, please rerun the manual test on the compute nodes.

 

Thank you,

Jake

srwheat posted this 29 April 2019

Jake,

I have finally gotten back to work on this.  I did install the rpm's that my login node had.  It still did not work.  So, I installed the rest of the required rpms.  It still did not work.  I then tried running fluent as an interactive job on the compute node without first allocating the node via SLURM.  Then it did work.  Interactive via SLURM, does not work.  Normal interactive, it does work.  Long story short, I found that it was an environment variable.  While running in SLURM, I inherit the environment variable LANG=en_US.UTF-8.  Within an interactive SLURM job, if I "unset LANG", it works fine.

It seems that my locale setup on the compute nodes may not be correct, or that fluent doesn't like LANG being set.

I get a bit of an error on the compute node when executing the "locale" command.  I'm going to see if I can clean that up and see if that makes it all work.

Have you seen this issue before?

srwheat posted this 29 April 2019

Jake,

I have fixed the node configurations to have the "locale" command work properly.

That has resulted in the sample .jou file you gave me run correctly within the SLURM environment.

My student will be trying the remote job submission again this evening.  If it works, I'll be closing this out.

Before I got this fix in place, he did run the project you sent us and it ran remotely ok, not running into the SIGSEGV issue; just to be sure the problem still existed, he tried the original project and it still had the SIGSEGV error.  Maybe the sample project you sent didn't trigger the locale issue.  Or, maybe we're facing two problems.  At least the interactive job seems to be working just fine now.

Thanks for all of the help!

Stephen

srwheat posted this 30 April 2019

Jake,

Unfortunately, the original project does not work.  Is there some way we could transfer that to you for you to look at?

The project you sent us works fine.  The .jou deck works fine.

What next?

jcallery posted this 30 April 2019

Hi Stephen,

At this point it sounds like things are working from a systems perspective.

Please ask the student to create a post in the Fluids Physics section of the forum for help from an engineer.

 

In the meantime I would suggest, starting with a very simple project from the student and make sure that works, then add each feature of the project that has the issue one at a time to see which features of the project may be causing the issue.

 

Thank you,

Jake

srwheat posted this 30 April 2019

Jake,

Thanks again for all of the help.  That sample .jou file was a life saver.

Stephen

Close