Master instance migration | cluster setup

Dear Cryosparc Team,

we have recently migrated our master instance to a new physical machine according to the documentary.
Everything worked well except a strange behaviour when submitting jobs.
We have a setup with an independent master instance and workers as part of an HPC cluster where all instances have access to a common storage server. Next, we updated the different submission script files cluster_info.json and cluster_script.sh. Jobs which we submitted then were launched on the HPC compute nodes as previously. However, according to the gui, the job halts in launch state. Checking on the HPC cluster via simply squeue -j jobid shows us that the job is running and actually producing results in the respective directories but those informations are not recognized by the master.

The master instance has rwx access to the directories with the results.
Jobs seem to be submitted.
The ssh commands are working manually as well. So why is the master not able to read the neccessary information about job status and log?
When checking the command_core log, it is still looking for the old ssh-key name and path. Which had been changed prior job submission on the cluster.
Is the information about the ssh key and path somehow memorized by the cryosparc master beside the cluster_info.json file of the specific lane? And is it trying to use this old key to obatin neccessary information about a job which was successfully submitted via the new key?

I would be glad about any idea or help.

Best
Max

Did you subsequently run
cryosparcm cluster connect ?
What is the output of the command

cryosparcm cli "get_scheduler_targets()"

?

Thank you for you fast reply.

Yes I did so and the lane was successfully added. Otherwise job submission would probably not have worked, right?

I sent you the output as a private message.

I wonder whether CryoSPARC correctly extracts the cluster job id from sbatch output. Please can you create a script test_job_123.slurm (and substitute an actual partition in the
#SBATCH -p line)

#! /bin/sh
#SBATCH --job-name this_is_a_test
#SBATCH -n 1
#SBATCH --gres=gpu:1
#SBATCH -p yourpartition
#SBATCH --mem=1G
#SBATCH -t 12:00:00
hostname -f

and run

sbatch test_job_123.slurm > test_sub_123.out 2> test_sub_123.err

and post the contents of test_sub_123.out and test_sub_123.err.

Hm this is strange. I have executed the script on the cluster directly to make sure it is working as it should giving the feedback:

Submitted batch job 17389905

here is the output of
test_sub_123.out:

Submitted batch job 17389906

test_sub_123.err (was empty).
We have an additional slurm output file from our cluster for diagnosis which you can find here:

r12n34.palma.wwu
################################# JOB REPORT ##################################
Job ID: 17389905
Cluster: palma2
User/Group: XXX/XXX
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:01 core-walltime
Job Wall-clock time: 00:00:01
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 1.00 GB (1.00 GB/node)
###############################################################################

So far so good. Executing it from the cryosparc master node via (user information is indicated by XXX):
ssh -i ~/.ssh/XXX XXX@palma.uni-muenster.de sbatch /scratch/tmp/XXX/test/queue_sub_script.sh
is giving the feedback
Submitted batch job 17390054
Neither the .out nor the .err file is created. The job did also not submit an additional job as before. The only indication that the job actually existed is the slurm diagnosis output:

r12n34.palma.wwu
################################# JOB REPORT ##################################
Job ID: 17390054
Cluster: palma2
User/Group: XXX/XXX
State: FAILED (exit code 1)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:02 core-walltime
Job Wall-clock time: 00:00:02
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 1.00 GB (1.00 GB/node)
###############################################################################

To summarize, jobs are submitted as they should when submitting the script directly on the cluster. Submitting the script remotely does result in a positive feedback but is actually not executing the code. The rights for the submission file are the same as before -rwxr--r--.
I have currently no explanation for this behavior.

Also, when I submitted a normal sharpening job from the cryosparc master node. I am getting a new interesting output which may help you to find the problem. At first glance, everything looks like it is working fine except getting the output forwarded to the event log of the web interface. However, slurm marks the job as failed on the cluster. Here the feedback of the sharpening job submission from the cryospars event log:



-------- Submission command: 
ssh -i ~/.ssh/xxx xxx@palma.uni-muenster.de sbatch /scratch/tmp/xxx/Projects/CS-xxx/J156/queue_sub_script.sh

-------- Cluster Job ID: 
17390059

-------- Queued on cluster at 2023-08-07 08:23:22.304913

-------- Cluster job status at 2023-08-07 08:28:05.794706 (25 retries)
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

The output of ll ./J156:

total 67
-rwxr--r-- 1 xxx xxx    18 Aug  7 08:23 events.bson
drwxr-xr-x 2 xxx xxx  4096 Aug  7 08:21 gridfs_data
-rwxr--r-- 1 xxx xxx 13239 Aug  7 08:23 job.json
-rw-r--r-- 1 xxx xxx  5240 Aug  7 08:24 job.log
-rw-r--r-- 1 xxx xxx    90 Aug  7 08:22 P60_J156.err
-rw-r--r-- 1 xxx xxx   425 Aug  7 08:24 P60_J156.out
-rwxr--r-- 1 xxx xxx  2807 Aug  7 08:23 queue_sub_script.sh

And an interesting output from the less job.log (xxx for user information, full outpu via private message.):

================= xxxCW =======  2023-08-07 08:22:20.481898  =========
Project P60 Job J156
Master xxx.uni-muenster.de Port 39002
===========================================================================
========= monitor process now starting main process at 2023-08-07 08:22:20.482012
MAINPROCESS PID 21734
*** CommandClient: (http://xxx.uni-muenster.de:39002/api) URL Error [Errno 110] Connection timed out
Process Process-1:
Traceback (most recent call last):
  File "/home/c/xxx/xxx_worker/download/xxx_worker/xxxc_tools/xxxc/command.py", line 104, in func
    with make_json_request(self, "/api", data=data) as request:
  File "/home/c/xxx/xxx_worker/download/xxx_worker/deps/anaconda/envs/xxx_worker_env/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/c/xxx/xxx_worker/download/xxx_worker/xxxc_tools/xxxc/command.py", line 191, in make_request
    raise CommandClient.Error(client, error_reason, url=url)
xxxc_tools.xxxc.command.CommandClient.Error: *** CommandClient: (http://xxx.uni-muenster.de:39002/api) URL Error [Errno 110] Connection timed out
... ...

The issue is solved with two changes. Apparently, there were global ACL rules within the subnet of our servers which were set by the university IT security without our knowledge. These ACLs were now adapted and resolved the issue. The second issue was that the alias of the hostname was not resolved. Changing the hostname has removed the last issue. Sorry for this inconvenience and thank you for your great help.
Also, it is not directly an issue because it does not seem to harm the calculations but the command_rtp is repeatingly printing this block of error code (any idea what it could be and how to resolve it):

2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    | File Engine Scheduler Failed
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    | Traceback (most recent call last):
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/cryosparc_command/command_rtp/__init__.py", line 109, in background_worker
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     run_file_engine(file_engine['project_uid'], file_engine['session_uid'])
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/cryosparc_command/commandcommon.py", line 191, in wrapper
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     return func(*args, **kwargs)
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/cryosparc_command/command_rtp/__init__.py", line 637, in run_file_engine
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     find_new_files( project_uid = project_uid,
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/cryosparc_command/command_rtp/__init__.py", line 676, in find_new_files
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     new_files = _filetail_engine(project_uid, session_uid, entities, strategy)
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/cryosparc_command/command_rtp/__init__.py", line 804, in _filetail_engine
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     mongo.db['workspaces'].update_one(
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/collection.py", line 1132, in update_one
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     self._update_retryable(
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/collection.py", line 961, in _update_retryable
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     return self.__database.client._retryable_write(
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1644, in _retryable_write
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     return self._retry_with_session(retryable, func, s, None)
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1532, in _retry_with_session
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     return self._retry_internal(retryable, func, session, bulk)
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/mongo_client.py", line 1565, in _retry_internal
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     return func(session, sock_info, retryable)
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/collection.py", line 942, in _update
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     return self._update(
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/collection.py", line 907, in _update
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     _check_write_command_response(result)
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/helpers.py", line 249, in _check_write_command_response
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     _raise_last_write_error(write_errors)
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |   File "/home/cryospar/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/pymongo/helpers.py", line 222, in _raise_last_write_error
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    |     raise WriteError(error.get("errmsg"), error.get("code"), error)
2023-08-08 12:23:42,448 RTP.BG_WORKER        background_worker    ERROR    | pymongo.errors.WriteError: BSONObj size: 16837738 (0x100EC6A) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: ObjectId('64b4113e697f73145052ff9a'), full error: {'index': 0, 'code': 10334, 'errmsg': "BSONObj size: 16837738 (0x100EC6A) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: ObjectId('64b4113e697f73145052ff9a')"}
1 Like

@mruetter Thanks for reporting this pymongo.errors.WriteError . The error can occur when processing a large number (approx. > 10,000) of movies. We do not currently have a workaround but plan to fix this in a future release. Processing results should not be affected.

1 Like