Moving cryosparc to a different cluster

mcahn · April 13, 2022, 5:59pm

I need to move cryoSPARC from one cluster to another. The files (the software, the database, researcher’s data) are on a filesystem that is shared between the two clusters, so pathnames will not change. Only two things will be different – the host name of the head node that cryoSPARC is running on, and the OS version: Red Hat Enterprise Linux 8 vs. the old 7.

Do I need to do anything other than stop cryoSPARC on the old head node, and start it on the new one? Do I need to do a re-installation? Is there anything about the installation or database that cares what the hostname is?

Note that I separately asked if I could move a user’s project (and received an answer), but just to be clear, that was an unrelated question.

Thanks
Matthew Cahn

wtempel · April 18, 2022, 10:03pm

@mcahn Re-installation should not be required. As you mentioned:

With this addition: between stopping cryoSPARC on the old and starting it on the new master host, please update the
export CRYOSPARC_MASTER_HOSTNAME=
line in
cryosparc_master/config.sh
.

mcahn · July 11, 2022, 4:10pm

Hi, I’m finally getting to doing this move of CryoSPARC from one cluster to another. I’m having trouble getting the mongodb to work. I did “cryosparcm stop” on the old cluster and copied the database files to a new location. I changed config.sh to reflect the new hostname, and changed the base port to 49000 to reflect a range of ports that is available on the new cluster.

Now when I do “cryosparcm start” on the new cluster, I get this error:

# cryosparcm start
Starting cryoSPARC System master process..
CryoSPARC is not already running.
database: started
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1915, in list_database_names
    for doc in self.list_databases(session, nameOnly=True)]
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1896, in list_databases
    res = admin._retryable_read_command(cmd, session=session)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/database.py", line 756, in _retryable_read_command
    _cmd, read_preference, session)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1461, in _retryable_read
    read_pref, session, address=address)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1278, in _select_server
    server = topology.select_server(server_selector)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/topology.py", line 243, in select_server
    address))
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/topology.py", line 200, in select_servers
    selector, server_timeout, address)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.7/site-packages/pymongo/topology.py", line 217, in _select_servers_loop
    (self._error_message(selector), timeout, self.description))
pymongo.errors.ServerSelectionTimeoutError: localhost:49001: timed out, Timeout: 20.0s, Topology Description: <TopologyDescription id: 62cc444c48e266eafaa72b78, topology_type: Single, servers: [<ServerDescription ('localhost', 49001) server_type: Unknown, rtt: None, error=NetworkTimeout('localhost:49001: timed out')>]>

I can see that mongodb is running:

# ps -ef | grep mongo
cryoem    243721 3676540  0 12:06 pts/49   00:00:00 grep --color=auto mongo
cryoem   3926046 3925998  0 11:43 ?        00:00:03 mongod --dbpath /tigress/MOLBIO/local/cryosparc-della-test/db --port 49001 --oplogSize 64 --replSet meteor --nojournal --wiredTigerCacheSizeGB 4

and it’s listening on port 49001:

# ss -nape   | grep 49001
u_str LISTEN     0      128              /tmp/mongodb-49001.sock -282685985                * 0           users:(("mongod",pid=3926046,fd=7)) <-> ino:3073 dev:0/64773 peers:       
tcp   LISTEN     0      128               0.0.0.0:49001               0.0.0.0:*           users:(("mongod",pid=3926046,fd=6)) uid:126619 ino:4012281310 sk:131e4 <->
tcp   TIME-WAIT  0      0                 28.112.172.234:52846       128.112.172.234:49001       timer:(timewait,9.646ms,0) ino:0 sk:b253

If I talk to the database with the mongo shell, I get “not master and slaveOk=false”:

# ./cryosparc2_master/deps/external/mongodb/bin/mongo --port 49001 --host 128.112.172.234 --shell
show dbs
2022-07-11T12:00:21.137-0400 E QUERY    [thread1] Error: listDatabases failed:{
	"ok" : 0,
	"errmsg" : "not master and slaveOk=false",
	"code" : 13435,
	"codeName" : "NotMasterNoSlaveOk"
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
Mongo.prototype.getDBs@src/mongo/shell/mongo.js:62:1
shellHelper.show@src/mongo/shell/utils.js:781:19
shellHelper@src/mongo/shell/utils.js:671:15
@(shellhelp2):1:1

Any advice appreciated.

– Matthew

wtempel · July 11, 2022, 9:37pm

Hi Matthew,
A few questions:

What version is your cryoSPARC instance?
While the database is listening on port 49001 (as confirmed by netstat), what is output by:
curl localhost:49001?
Did you run
crysparcm changeport 49000 or cryosparcm fixdbport (guide)?
What is the output of
cryosparcm log database?

mcahn · July 12, 2022, 7:57pm

Thanks very much. cryosparcm fixdbport solved the problem.

The answers are probably moot now, but for the record:

Version 3.3.2

curl localhost:49001
        It looks like you are trying to access MongoDB over HTTP on the native driver port.

I had not run either of those commands before; now I’ve run fixdbport.
The only thing I see in the database log that seems suspicous is:

2022-07-11T11:33:52.171-0400 I REPL     [replExecDBWorker-0] This node is not a member of the config

but it’s no longer giving that message.

– Matthew

mcahn · July 19, 2022, 7:38pm

I have yet another problem moving CryoSPARC to a different cluster. We have two clusters – Tiger is running production CryoSPARC, and Della is running my test installation. The two instances are running on different head nodes (and on different ports on those head nodes). Each instance has its own database. The only thing that overlaps is that the two clusters can see the same storage, although they have separate installation directories. With both instances running, users see this message in the production installation:

Token is invalid. Another cryoSPARC instance is running with the same license ID.

Could that be because I neglected to change this line in cluster_info.json:

    "worker_bin_path" : "/tigress/MOLBIO/local/cryosparc/cryosparc2_worker/bin/cryosparcw",

so that the test instance is pointing at the production installation directory?

I’d like not to start the test instance again until I understand the problem, so I don’t break production again.

Thanks,
Matthew

wtempel · July 20, 2022, 12:53pm

Do Della and Tiger have their own, distinct $CRYOSPARC_LICENSE_IDs?

Another caution (relevant once the current problem is resolved): cryoSPARC project directories must not be shared between instances.

mcahn · July 20, 2022, 3:26pm

No, I was using the same license for both installations. I’ve requested another license.

I’ll take care not to share project directories between the two instances.

Thanks for your help,
Matthew

mcahn · July 20, 2022, 7:55pm

Hi, next problem (or two) moving to another cluster. Running the tutorial, on the Patch Motion Correction (multi) step, I get this error (full traceback below):

[command: nvcc --preprocess -arch sm_80 -I/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/cuda /tmp/tmpu4irjcs8.cu --compiler-options -P]
[stderr:
b'cc1plus: fatal error: cuda_runtime.h: No such file or directory\ncompilation terminated.\n']

The include path does not seem like the right place. The include file is really in /usr/local/cuda-11.7/include. I tried copying cuda_runtime.h from there to the place that nvcc is looking for it, but that just led to the next missing include file. I could copy them all, but that doesn’t seem like the right solution.

I tried setting NVCC_PREPEND_FLAGS and CPATH in cluster_script.sh, and I can see in the logs that those variables are being set, but they do not seem to effect the nvcc command that’s being run.

I also wonder why the path /projects/MOLBIO/local/cryosparc is still the one for my production installation, and not this test installation. I have separate installation directories, separate databases, separate licenses, and I’ve set the worker path in cluster_info.json. The only place I can think of that might retain the path is in the database – my test instance is using a copy of the production database. Could you tell me where that include path is coming from?

Also note that I had to add to the list of libcufft versions in cryosparc2_worker/cryosparc_compute/skcuda_internal/cufft.py. Under RHEL 8, none of the listed versions match the versions that are installed, because the list includes minor versions (e.g. 10.1), but the installed versions have these, ro example – either the major version only (10), or the full version (10.7.2.50). RHEL 7 had 10.1. Adding “10” to the list seems to work – at least libcufftw.so is found.

/usr/local/cuda-11.7/lib64/libcufftw.so.10
/usr/local/cuda-11.7/lib64/libcufftw.so.10.7.2.50

Your help is greatly appreciated.
Matthew

Here’s the whole traceback from the failure to find the include file:

[CPU: 216.7 MB]  Error occurred while processing J1/imported/017013418492253161062_14sep05c_00024sq_00003hl_00005es.frames.tif
Traceback (most recent call last):
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/tools.py", line 429, in context_dependent_memoize
    return ctx_dict[cur_ctx][args]
KeyError: <pycuda._driver.Context object at 0x148254eeb210>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/projects/MOLBIO/local/cryosparc-della-test/cryosparc2_worker/cryosparc_compute/jobs/pipeline.py", line 60, in exec
    return self.process(item)
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 190, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 193, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/run_patch.py", line 195, in cryosparc_compute.jobs.motioncorrection.run_patch.run_patch_motion_correction_multi.motionworker.process
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 255, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 264, in cryosparc_compute.jobs.motioncorrection.patchmotion.unbend_motion_correction
  File "cryosparc_worker/cryosparc_compute/jobs/motioncorrection/patchmotion.py", line 121, in cryosparc_compute.jobs.motioncorrection.patchmotion.prepare_movie_for_processing
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/gpuarray.py", line 549, in fill
    func = elementwise.get_fill_kernel(self.dtype)
  File "<decorator-gen-120>", line 2, in get_fill_kernel
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/tools.py", line 433, in context_dependent_memoize
    result = func(*args)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/elementwise.py", line 498, in get_fill_kernel
    "fill")
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/elementwise.py", line 163, in get_elwise_kernel
    arguments, operation, name, keep, options, **kwargs)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/elementwise.py", line 149, in get_elwise_kernel_and_types
    keep, options, **kwargs)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/elementwise.py", line 76, in get_elwise_module
    options=options, keep=keep, no_extern_c=True)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/compiler.py", line 291, in __init__
    arch, code, cache_dir, include_dirs)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/compiler.py", line 254, in compile
    return compile_plain(source, options, keep, nvcc, cache_dir, target)
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/compiler.py", line 78, in compile_plain
    checksum.update(preprocess_source(source, options, nvcc).encode("utf-8"))
  File "/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/compiler.py", line 55, in preprocess_source
    cmdline, stderr=stderr)
pycuda.driver.CompileError: nvcc preprocessing of /tmp/tmp2hdc51cy.cu failed
[command: nvcc --preprocess -arch sm_80 -I/projects/MOLBIO/local/cryosparc/cryosparc2_worker/deps/anaconda/envs/cryosparc_worker_env/lib/python3.7/site-packages/pycuda/cuda /tmp/tmp2hdc51cy.cu --compiler-options -P]
[stderr:
b'cc1plus: fatal error: cuda_runtime.h: No such file or directory\ncompilation terminated.\n']

Marking J1/imported/017013418492253161062_14sep05c_00024sq_00003hl_00005es.frames.tif as incomplete and continuing...

mcahn · July 21, 2022, 4:11pm

I see that the production installation directory path gets built into a number of files, so the way I installed my test instance – by copying the production installation to another location – probably can’t be expected to work. I’ll try a clean installation. Perhaps that will fix the include path problem too.

– Matthew

wtempel · July 21, 2022, 6:14pm

If cluster_info.json is used, as intended, with cryosparcm cluster connect, the worker_bin_path will find its way into the database. This can be confirmed with the command
cryosparcm cli "get_scheduler_targets()"

mcahn · July 22, 2022, 7:48pm

I believe that the installation path winds up in some source files, and not just the database. Anyway, I have done a clean installation and now have CryoSPARC running on the new cluster. Thanks for all your help. There were a couple of things that I needed to fix. Since this thread is rather long and messy, I’ll open separate ones.

Best,
Matthew