Error running Fluent on a cluster (improve quality used in setup)

  • 47 Views
  • Last Post 09 December 2019
  • Topic Is Solved
Isita posted this 29 November 2019

Hello, I am trying to run Fluent on a cluster and used the exact same layout for my script and journal files as before (a smaller geometry with about 7 million elements, now I have 140 million). Another difference is that since generating the mesh takes so much time I used the "improve orthogonal quality feature" in the setup.

The error file contains the following

Cleanup script file is /lustre/eaglefs/scratch/ness/flu10/cleanup-fluent-r1i3n16-12329.sh

cp: cannot stat ‘/scratch/ness/flu10/FFF-Setup-Output.dat’: No such file or directory

For some reason I can't attach files or paste the content here so here is the link to the files

https://drive.google.com/open?id=1d0Ab45nilBd6NXhBCJjWLR7BHZxRpyf3

Thanks

Order By: Standard | Newest | Votes
Isita posted this 30 November 2019

Hi, here is an update. I increase the number of nodes I was requesting to 20. I did not get a failure message but I got several lines like this

[215] MPI startup(): dapl fabric is not available and fallback fabric is not enabled

It does not look like it ran either... I upload the updated output as well.

 

rwoolhou posted this 03 December 2019

How much RAM is associated with the number of nodes you're picking?  For 140M cells you'll probably be looking at using 250-300GB RAM depending on models etc, more if there are several additional Eulerian phases. 

As an aside (and often stated), staff are not permitted to open/download attachments. 

Isita posted this 03 December 2019

The memory is 96 GB

Oh my bad, I'll try again to post extracted part...

 

 

 

 

Isita posted this 03 December 2019

The last line when I request 2 nodes with 32 tasks per node is "auto partitioning mesh by Metis (fast" and then it stops with a fatal error in one of the compute processes.

I tried with 20 nodes and the error file contains several

"[363] MPI startup(): dapl fabric is not available and fallback fabric is not enabled".

The output files has many instances of the type

"r1i0m32:UCM:5835c:aaae3000: 49931 us(49931 us): open_hca: dev open failed for mlx5_0, err=Cannot allocate memory" and "open_hca: device mlx4_0 not found"

Sorry for the poor formatting... it is the only way the post seems to be going through.

rwoolhou posted this 04 December 2019

The system has problems with certain command characters/strings that Fluent outputs. IT are aware but apparently it's a security feature..... 

The giveaway though is the "Cannot allocate memory" and the fact you have 140 million cells and only 96GB RAM.  You either need to find a bigger computer or coarsen the mesh. 

Isita posted this 06 December 2019

Alright... so solving the mesh/RAM incompatibility should make the whole simulation work?

rwoolhou posted this 09 December 2019

Most likely, it'll probably stop that error anyway. Whether you then find something else is another question. 

Close