Hi Win,
This is a follow up to my previous post entitled "Error running single large time simulation on multiple nodes". Sorry for creating a new discussion, but I have not been able to reply or add a post to that thread. Not sure why.
Thank you so much for your response. I tried to follow the suggestion in the link you pointed me to, but I am still not able to get MPI to work on multiple nodes. Our queueing system is Torque/Moab and my test job is allocated 2 nodes and 20 cores per node.
When I use 'HFSS/RemoteSpawnCommand'='Scheduler' with this command:
ansysedt -distributed -machinelist num=40 -monitor -ng -batchoptions "'HFSS/EnableGPU'=0 'HFSS/HPCLicenseType'='pool' 'HFSS/MPIVendor'='Intel' 'HFSS/RemoteSpawnCommand'='Scheduler'" -batchsolve HFSSDesign1:Nominal:Setup1 SampleDesign.aedt
I get the error:
"[error] Batch option HFSS/RemoteSpawnCommand has invalid value Scheduler; this value is only allowed when running under an LSF or SGE/GE/UGE scheduler."
Is there a way to use the Scheduler option when using Torque/Moab?
When I run the ansysedt command with 'HFSS/MPIVendor'='Intel', but use 'HFSS/RemoteSpawnCommand'='SSH', and export the environment variable I_MPI_TCP_NETMASK=ib0, such as (using bash shell):
export I_MPI_TCP_NETMASK=ib0
ansysedt -distributed -machinelist list="n278:4:10:90%,n279:4:10:90%" -monitor -ng -batchoptions "'HFSS/EnableGPU'=0 'HFSS/HPCLicenseType'='pool' 'HFSS/MPIVendor'='Intel' 'HFSS/RemoteSpawnCommand'='SSH'" -batchsolve HFSSDesign1:Nominal:Setup1 SampleDesign.aedt
(where n278 and n279 are the two hosts allocated to the job by the queueing system), I still get the error:
Design type: HFSS
Allow off core: True
Using manual settings
Two level: Disabled
Distribution types: Variations, Frequencies, Transient Excitations, Domain Solver
Machines:
n278 [191904 MB]: RAM: 90%, task0 cores, task1
cores, task2:2 cores, task3:2 cores
n279 [191904 MB]: RAM: 90%, task0 cores, task1
cores, task2:2 cores, task3:2 cores
[info] Project:SampleDesign, Design:HFSSDesign1 (DrivenTerminal), Setup1 : Sweep distributing Frequencies (2:505 PM Apr 03, 2019)
[error] Project:SampleDesign, Design:HFSSDesign1 (DrivenTerminal), Could not start the memory inquiry solver: check distributed installations, MPI availability, MPI authentication and firewall settings. -- Simulating on machine: n278 (2:526 PM Apr 03, 2019)
I also tried to export I_MPI_TCP_NETMASK=ib and I_MPI_TCP_NETMASK=255.255.0.0 (our netmask), but these didn't work either.
What is the correct way to define the I_MPI_TCP_NETMASK variable? Is exporting it before invoking ansysedt an appropriate way?
Any other suggestions of what I can do to get this MPI capability working?
Can you also suggest a simple command (and arguments) that I can use to test this MPI capability using multiple nodes, if the command I am using is not the most appropriate?
Thank you very much again!
Shan-Ho