Command line configuring RSM on a Linux server running SLURM/Torque

  • 63 Views
  • Last Post 2 weeks ago
  • Topic Is Solved
srwheat posted this 3 weeks ago

I am installing RSM on a Linux cluster head node, Titan.  I don't have a GUI environment for this headnode.  For the time being, what I need is the ability to set up a single account for remote users using RSM; call it "fluent_user".  I need to somehow configure where the temporary files go.  And I need to configure it towards PBS/Torque.  We run SLURM, but I'm told the SLURM/Torque wrapper works fine for RSM.  We are running v193 RSM and Fluent.

Order By: Standard | Newest | Votes
jcallery posted this 3 weeks ago

Hi srwheat,

 

The wrapper for SLURM can sometimes work, but it is certainly not recommended.

In either case, on the submission node, you need to install the RSM Launcher service:

/ansys_inc/v193/RSM//Config/tools/linux/install_daemon -launcher

 

Thank you,

Jake

jcallery posted this 3 weeks ago

On the client side, are you submitting from a windows machine or a linux machine?

Also which temp files are you referring to?  Log files, or the project staging directory?

 

Thank you,

Jake

 

srwheat posted this 3 weeks ago

jcallery,

If the wrapper is not recommended, then what would be recommended for me use for my SLURM configuration?

As for the launcher, once I run that command, is the daemon set up to run always at boot time?  

A friend had suggested I do the following:

   Enable the daemon script to start at reboot. The best way is adding it to root's cron.

 

   @reboot /ansys_inc/v193/RSM/Config/tools/linux/rsmlauncher restart &

Are the two methods equivalent?

Thanks!

Stephen

srwheat posted this 3 weeks ago

Yes, the jobs are being submitted from Windows.

As for the file sharing, yes, this is for the staging.  I have successfully gotten the configuration tool to run; I have set things up to use an NFS directory that is available to all the cluster nodes for the staging.

I told it to use "Torque for Moab"; now I'm wondering if I should have chosen "PBS Pro", to connect to my wrappers.

Is there some manual somewhere that walks through the configuration that I could follow?  Even with just a few selections, the combinatorics are against me for getting it to work.

jcallery posted this 3 weeks ago

Hi Stephen,

SLURM is not supported at all with RSM out of the box.

I would recommend using a supported scheduler.

 

Using the command I suggested will create a /etc/init.d entry for you, so that it will start on reboot.

Which version of linux are you running on the cluster?

 

Thank you,

Jake

srwheat posted this 3 weeks ago

SLURM is mandated ... sigh.

Centos 7.5, latest kernel.

jcallery posted this 3 weeks ago

When submitting from windows I would use the RSM Internal transfer mechanism in the File Transfer Section, and not worry about OS file transfers.

This is where the project files will go on the cluster side.

 

Unfortunately I don't really know anything about the wrappers, so you will probably need to try both types in the client side RSM Configuration Utility.

 

Thank you,

Jake

srwheat posted this 3 weeks ago

We are trying the internal path, but we are running into the problem where the Windows RSM Config tool is v192, whereas we are v193 on the cluster.  So, it tried to connect via port 9192, which nothing is there for.  So, we said on the HPC Resource tab to use SSH to connect, which forces us to use External mechanism for file transfer, which we chose SCP via SSH.  We then, when trying, got the following error after the "Submission in progress... " line, KEYPATH required for SCP transfer but not defined in environment.  Any hints on that?  Can we get v192 to work with v193 using our first approach?

srwheat posted this 3 weeks ago

OK, we found instructions on doing KEYPATH.  We already had the ability to ssh (openssh) from windows to linux w/o using password.  So, we pointed the KEYPATH variable to the private key file.  When trying plink -i %KEYPATH% user@system pwd, it said it could not use the KEYPATH file; but plink w/o the KEYPATH parameter worked fine.  While trying to test the queue in RSMManager on the windows side, it too said that it could not use the KEYPATH file.

We had the file in \Users\myname\.ssh folder; and it was named xyz, so in control panel we set a user variable KEYPATH to c:\users\myname\.ssh\xyz and we could see from a new shell command window that it was set.

tsiriaks posted this 3 weeks ago

I will leave other parts for Jake to comment but I know that, regardless of any trick, you must use identical versions on client (Windows) and cluster.

Thanks,

Win

jcallery posted this 3 weeks ago

Hi srwheat,

Win is correct.  You will need to have matching versions of the Ansys products on the client machine and the cluster.

So it sounds like you need to install Ansys 19.2 on the cluster and then install the 19.2 version of the RSM Launcher service on the cluster:

sudo /ansys_inc/v192/RSM/Config/tools/linux/install_daemon -launcher

 

Thank you,
Jake

srwheat posted this 2 weeks ago

So, some news.  It turns out, that I have clients that are running v19.1 and some that are running v19.2.  I had already installed v19.3.

I installed v19.2, and that installed "ok"; I haven't been able to do anything to test it yet, as I'm not a fluent user, so I need a user to work with me on validating the install.

But, as I try to install v19.1, I only get to 65% done; this has happened twice.  The second time, I've let it run for a long time ... now going for more than 30min after getting stuck.  I haven't killed it yet.

The last few lines in the detailed log are:

AWPROOTDIR = /opt/apps/ansys/v191

Creating ANSYS RSM Cluster Master Service script ...

Created service script: arcmaster

 

AWPROOTDIR = /opt/apps/ansys/v191

Creating ANSYS RSM Cluster Node Service script ...

Created service script: arcnode

That is where it remains stuck.  I am presuming that I can have several versions installed and running at the same time.  Is that correct?

On my firewall, I have opened up the following tcp ports: 9191 9192 and 9193

Any idea as to why the hang is happening?  This help thread is getting a bit diverse on topic as the actions to solve the original problem are creating new questions that must be solved first.

srwheat posted this 2 weeks ago

I tried to add in to this post the results of "ps -aef | grep ans" which shows several processes for v191 running.  But "add post" would not work with that text.

Attached Files

srwheat posted this 2 weeks ago

I found out how to attach the file above.

srwheat posted this 2 weeks ago

I found these two log files from the v191 install

 

jcallery posted this 2 weeks ago

Hi srweat,

Regarding the stuck installation please make sure that libpng12 is installed, then retry the installation.

Thank you,

Jake

 

 

srwheat posted this 2 weeks ago

Jake,

Thanks!  That solved the install problem.  I will now be able to turn my attention to the original problem.

 

Stephen

srwheat posted this 2 weeks ago

Jake,

I have installed the client on my windows laptop.  I have RSM configurator set to go to my cluster.  I get the following, endlessly, listing in my rsm log files:

2019-03-12 142:49 [WARN] Failed to operate on the mutexPath: /tmp/MMFLock_USERPROXY191_ERV_STEPHEN WHEAT_FLUENTUSER.lock : Chmod failed to operate on MMFLock_USERPROXY191_ERV_STEPHEN WHEAT_FLUENTUSER.lock in /tmp : chmod: cannot access âMMFLock_USERPROXY191_ERV_STEPHENâ: No such file or directory

 

I have tried running rsmlauncher as root (via disabling the rsmadmin account) and as rsmadmin, with the same results.  However, I think when I was running as rsmadmin, my client "ready" light was always red.  When running as root, the client ready light goes green.  And, if I stop the server, the client takes notice.

Any ideas?

Stephen

srwheat posted this 2 weeks ago

Also, I have my credentials on both the client rsm config and the server rsm config for user "fluentuser", with the same password on each, both linked to Titan.

I don't know why the server would be using information about my windows account in the lock file.

jcallery posted this 2 weeks ago

Hi srwheat,

Can you manually delete the MMFLock*.log file?

 

Then try running the job again.

 

Thank you,

Jake

 

srwheat posted this 2 weeks ago

There is no such file there.  But, when the client is trying to "download the queue" or "queue test button", this file does get created: UserProxyLauncherLock.lock

 

But, it disappears when I restart the service.

jcallery posted this 2 weeks ago

Hi srwheet,

 

When it is created, what are the permissions on it?  Also what are the permissions on /tmp?

Can anyone read/write to and from the /tmp directory?

If you create a file there with a normal user, are you able to change the permissions of that file with chmod as that normal user?

 

Thank you,
Jake

srwheat posted this 2 weeks ago

/tmp is d777t

Verified any user can create/delete files

As it turns out, these two files show up right at the beginning of the transaction, but the first disappears quickly; I just happened to catch it this time

-rw-r--r-- 1 root root 40 Mar 12 14:50 MMFLock_USERPROXY191_ERV_STEPHEN WHEAT_FLUENTUSER.lock

-rw-rw-rw- 1 root root 40 Mar 12 14:50 UserProxyLauncherLock.lock

 

srwheat posted this 2 weeks ago

Note, is it possible the issue that there is a space between STEPHEN and WHEAT in the file name?

jcallery posted this 2 weeks ago

Hi srwheat,

It is a possibility.  

Can you try from a user that does not have a space in the name?

Also could you please create a file in /tmp as a regular user and and then do a chmod 777 on that file?

I want to make sure it doesn't throw an error.

 

Thank you,

Jake

srwheat posted this 2 weeks ago

Yes, I will be working with a user later today or tomorrow.  I verified that the file can be created and chmod'd .

srwheat posted this 2 weeks ago

Progress; RSM on windows client with a user name without spaces works.  But, we can't get workbench on that same client to see RSM as a service.  Looking on the web, it tells us how to use RSM from workbench when it is loaded as a service, but there are no instructions we can find to do that.

What steps do I need to do in Workbench on the client after RSM is configured and verified on that client?

 

We're close!

Thanks,

Stephen

srwheat posted this 2 weeks ago

Further update.  We tripped onto how to do that.  We did a right click in the schematic space and found properties where we could update the "update option".  Then the job ran, but failed with the following:

 

find: error while loading shared libraries: libSM.so.6: cannot open shared object file: No such file or directory

/opt/apps/ansys/v192/Tools/mono/Linux64/bin/mono: error while loading shared libraries: libSM.so.6: cannot open shared object file: No such file or directory

Seems like we are really close.

srwheat posted this 2 weeks ago

OK, found that the compute nodes did not have libSM installed; have fixed that and we are rebooting now.  Will update this when it is done and we have had a chance to run the job again.

srwheat posted this 2 weeks ago

Now we get the following error: terminated after throwing an instance of sgd::runtime_error

what(): locale::facet::_s_create_c_locale name not valid

It's almost as if the job is trying to visualize on the compute nodes.

We're missing something.  We'll wait for your insight, as we are without further ideas here.

 

Thanks,

Stephen

Show More Posts
Close