HELP: Fluent freezes on HPC Cluster

  • Last Post 11 October 2019
mahereid97 posted this 09 October 2019


I am trying to simulate a wind turbine in Fluent using HPC on a Cluster. After solving several time steps (less than 20 usually), the calculations stop, and the console only prints the following message without starting the calculation of the next time step. (Screenshot attached)

"Updating solution at time levels N and N-1 using variable time stepping method.


Also all CPUs are still working 100%, yet I have no idea what they are calculating. 

The Cluster works fine with other cases and problems meaning it is well set, and this specific case works normally on a regular PC so the setup must be correct.

Any suggestion would be great


Attached Files

Order By: Standard | Newest | Votes
rwoolhou posted this 10 October 2019

Can you check the solver isn't trying to write output files to somewhere that is full or doesn't exist?  

mahereid97 posted this 10 October 2019

Autosave Case and Data is not due on the time step at which fluent freezes. Is there any writing that occurs between each time step? Anyways my project is saved in a folder with plenty of space, and other cases work fine.

rwoolhou posted this 10 October 2019

Only if you've set monitors or image export. Can you also check RAM usage, and whether DPM updates or the like are triggered.

If everything else works then it's likely that model, so you'll need to work through mesh and settings to see what the cause is. 

mahereid97 posted this 10 October 2019

The only monitor I am saving is the Moment coefficient at each time step. RAM is not the problem because there's plenty, and the only relevant model is the LES for turbulence. (I tried k-omega but same problem)

To give you some additional details about the case: -mesh has around 33m elements -periodic boundaries, sliding mesh, and symmetry involved. -cluster is 7 nodes having 16 cores each -time step is around 6e-5 sec.

What's weird is that the all the compute cores are 100% working, while the main management(host) cores drop from about 10% to 4-6% when the problem occurs.

rwoolhou posted this 11 October 2019

Can you run with mesh motion off?  We need to work through to see what's getting stuck, and if nothing else is changing it's a good starting point. 

mahereid97 posted this 11 October 2019

I switched to mrf instead of sliding mesh and the problem is gone (also "updating solution at time levels N and N-1" is instantaneous)

But I still need to use the sliding mesh for my simulations.

rwoolhou posted this 11 October 2019

Right, so if the model sticks at a set time (not necessarily number of timesteps) what changes at that point?

mahereid97 posted this 11 October 2019

What do you mean by that?

rwoolhou posted this 11 October 2019

If a model runs well and hasn't diverged but then "sticks" it's often either the hardware or saving (which you've checked) or something changes within the case. If mrf works it suggests there's a problem with sliding mesh. So, what happens with the mesh at the point the model goes wrong?  

mahereid97 posted this 11 October 2019

Oh, when fluent freezes I cannot check the mesh or case because the calculation can't be stopped unless I kill the processes which leads to fluent closing suddenly. Does it help if I let fluent display the mesh at each time step ? And do I need to display or monitor additional stuff at each time step to find the problem ?

rwoolhou posted this 11 October 2019

Or run to just before it fails and have a look. In the File menu there's also an option to Write Transcript. That should dump out everything that's written to the TUI and can be quite helpful. 

mahereid97 posted this 11 October 2019

The mesh looks normal and as it should be at the given time step. What am I looking for exactly? And it doesn't look like a normal fluent crash or failure because something is still calculating between the time steps. I guess the mesh is being rotated during this time, so can it be a partition problem, in which the host cpu fails to assign moving nodes to cpu cores? (Keep in mind that the same case does not have this problem on a 16cores single machine)