Extract from micrographs stops abruptly

Hello,
I have been running into an issue with a particular job. I launch an “Extract from Micrographs” job, it starts correctly, but then abruptly stops in 5-30 minutes, processing roughly the same amount of micrographs (3500-ish). The job lists as failed, but there is no error in the event log, and in metadata log it just lists the following lines before the stop:

========= sending heartbeat at 2023-07-24 16:23:25.674665
========= main process now complete at 2023-07-24 16:23:37.405248.
========= monitor process now complete at 2023-07-24 16:23:43.162055.

I am at a loss, because an identical job on the other part of the same dataset had proceeded smoothly with no issues…
Please don’t hesitate to request additional information.

Please can you post:

  • additional five lines from the job log that precede the 16:23:25.674665 heartbeat
  • the final 10 lines of the event log
  • lines from the command_core log (/path/to/cryosparc_master/run/command_core.log or
    cryosparcm log command_core) that pertain to the failed extraction job.

You may also eliminate

  • potential OS-level causes of the failure:
    • exhausted storage capacity or quota on the volume that holds the project directory
    • exhausted RAM on the worker
  • repeatable failure of the job on extraction from specific micrographs
  • repeatable failure of the job on a specific worker node

Here is the logs that you’ve requested.
Job log:

> ========= sending heartbeat at 2023-07-24 16:22:25.361295
> ========= sending heartbeat at 2023-07-24 16:22:37.471547
> ========= sending heartbeat at 2023-07-24 16:22:49.460845
> ========= sending heartbeat at 2023-07-24 16:23:01.063566
> ========= sending heartbeat at 2023-07-24 16:23:15.300495

Event log:

> [CPU:   1.33 GB]
> -- 0.0: (3479 of 17355) processing J150/motioncorrected/010459585116581381098_FoilHole_29404620_Data_29377140_29377142_20230429_164609_EER_patch_aligned_doseweighted.mrc
>         5 particles extracted (5 rejected near edges)
>         Writing to J208/extract/010459585116581381098_FoilHole_29404620_Data_29377140_29377142_20230429_164609_EER_patch_aligned_doseweighted_particles.mrc
>         Total (asynchronous) processing time: 1.64s
> 
> [CPU:   1.39 GB]
> -- 1.0: (3480 of 17355) processing J150/motioncorrected/008853555362268766809_FoilHole_29404620_Data_29377152_29377154_20230429_164557_EER_patch_aligned_doseweighted.mrc
>         2 particles extracted (0 rejected near edges)
>         Writing to J208/extract/008853555362268766809_FoilHole_29404620_Data_29377152_29377154_20230429_164557_EER_patch_aligned_doseweighted_particles.mrc
>         Total (asynchronous) processing time: 1.86s
> 
> [CPU:   1.56 GB]
> -- 0.1: (3481 of 17355) processing J150/motioncorrected/010131049080804289702_FoilHole_29404621_Data_29376983_29376985_20230429_164657_EER_patch_aligned_doseweighted.mrc
> 
> [CPU:   1.56 GB]
> -- 0.1: (3482 of 17355) processing J150/motioncorrected/011507918805205004349_FoilHole_29404621_Data_29377140_29377142_20230429_164703_EER_patch_aligned_doseweighted.mrc
>         1 particles extracted (4 rejected near edges)
>         Writing to J208/extract/011507918805205004349_FoilHole_29404621_Data_29377140_29377142_20230429_164703_EER_patch_aligned_doseweighted_particles.mrc
>         Total (asynchronous) processing time: 1.59s
> 
> [CPU:   1.57 GB]
> -- 1.1: (3483 of 17355) processing J150/motioncorrected/010607545151755570022_FoilHole_29404621_Data_29377152_29377154_20230429_164651_EER_patch_aligned_doseweighted.mrc
> 
> [CPU:   1.57 GB]
> -- 1.1: (3484 of 17355) processing J150/motioncorrected/016344257819808203700_FoilHole_29404622_Data_29376983_29376985_20230429_165133_EER_patch_aligned_doseweighted.mrc
>         1 particles extracted (0 rejected near edges)
>         Writing to J208/extract/016344257819808203700_FoilHole_29404622_Data_29376983_29376985_20230429_165133_EER_patch_aligned_doseweighted_particles.mrc
>         Total (asynchronous) processing time: 1.79s
> 
> [CPU:   1.40 GB]
> -- 0.0: (3485 of 17355) processing J150/motioncorrected/003971386599620482875_FoilHole_29404622_Data_29377140_29377142_20230429_165139_EER_patch_aligned_doseweighted.mrc
>         1 particles extracted (1 rejected near edges)
>         Writing to J208/extract/003971386599620482875_FoilHole_29404622_Data_29377140_29377142_20230429_165139_EER_patch_aligned_doseweighted_particles.mrc
>         Total (asynchronous) processing time: 1.60s
> 
> [CPU:  137.1 MB]
> ====== Job process terminated abnormally.
> 
> [CPU:   1.57 GB]
> -- 1.0: (3486 of 17355) processing J150/motioncorrected/001529413231024034262_FoilHole_29404622_Data_29377152_29377154_20230429_165127_EER_patch_aligned_doseweighted.mrc
>         1 particles extracted (1 rejected near edges)
>         Writing to J208/extract/001529413231024034262_FoilHole_29404622_Data_29377152_29377154_20230429_165127_EER_patch_aligned_doseweighted_particles.mrc
>         Total (asynchronous) processing time: 1.63s
> 
> [CPU:   1.56 GB]
> -- 0.1: (3487 of 17355) processing J150/motioncorrected/013805053250859322033_FoilHole_29404623_Data_29376983_29376985_20230429_165152_EER_patch_aligned_doseweighted.mrc
> 
> [CPU:   1.56 GB]
> -- 0.1: (3488 of 17355) processing J150/motioncorrected/008385003645767961935_FoilHole_29404623_Data_29377140_29377142_20230429_165158_EER_patch_aligned_doseweighted.mrc

Command_core log:

> 2023-07-24 16:23:37,841 COMMAND.MAIN         start                INFO     |  === STARTED === 
> 2023-07-24 16:23:37,842 COMMAND.BG_WORKER    background_worker    INFO     |  === STARTED === 
> 2023-07-24 16:23:37,843 COMMAND.CORE         run                  INFO     | === STARTED TASKS WORKER ===
>  * Serving Flask app "command_core" (lazy loading)
>  * Environment: production
>    WARNING: This is a development server. Do not use it in a production deployment.
>    Use a production WSGI server instead.
>  * Debug mode: off
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/cryosparc_command/command_core/__init__.py", line 213, in start
>     app.run(host="0.0.0.0", port=port, threaded=True, passthrough_errors=False)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/flask/app.py", line 990, in run
>     run_simple(host, port, self, **options)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 1052, in run_simple
>     inner()
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 996, in inner
>     srv = make_server(
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 847, in make_server
>     return ThreadedWSGIServer(
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 740, in __init__
>     HTTPServer.__init__(self, server_address, handler)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/socketserver.py", line 452, in __init__
>     self.server_bind()
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/http/server.py", line 138, in server_bind
>     socketserver.TCPServer.server_bind(self)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/socketserver.py", line 466, in server_bind
>     self.socket.bind(self.server_address)
> OSError: [Errno 98] Address already in use
> 2023-07-24 16:23:40,197 COMMAND.JOBS         set_job_status       INFO     | Status changed for P6.J208 from running to failed
> 2023-07-24 16:23:50,415 COMMAND.MAIN         start                INFO     |  === STARTED === 
> 2023-07-24 16:23:50,416 COMMAND.BG_WORKER    background_worker    INFO     |  === STARTED === 
> 2023-07-24 16:23:50,416 COMMAND.CORE         run                  INFO     | === STARTED TASKS WORKER ===
>  * Serving Flask app "command_core" (lazy loading)
>  * Environment: production
>    WARNING: This is a development server. Do not use it in a production deployment.
>    Use a production WSGI server instead.
>  * Debug mode: off
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/cryosparc_command/command_core/__init__.py", line 213, in start
>     app.run(host="0.0.0.0", port=port, threaded=True, passthrough_errors=False)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/flask/app.py", line 990, in run
>     run_simple(host, port, self, **options)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 1052, in run_simple
>     inner()
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 996, in inner
>     srv = make_server(
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 847, in make_server
>     return ThreadedWSGIServer(
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 740, in __init__
>     HTTPServer.__init__(self, server_address, handler)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/socketserver.py", line 452, in __init__
>     self.server_bind()
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/http/server.py", line 138, in server_bind
>     socketserver.TCPServer.server_bind(self)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/socketserver.py", line 466, in server_bind
>     self.socket.bind(self.server_address)
> OSError: [Errno 98] Address already in use
> 2023-07-24 16:23:57,873 COMMAND.MAIN         start                INFO     |  === STARTED === 
> 2023-07-24 16:23:57,874 COMMAND.BG_WORKER    background_worker    INFO     |  === STARTED === 
> 2023-07-24 16:23:57,874 COMMAND.CORE         run                  INFO     | === STARTED TASKS WORKER ===
>  * Serving Flask app "command_core" (lazy loading)
>  * Environment: production
>    WARNING: This is a development server. Do not use it in a production deployment.
>    Use a production WSGI server instead.
>  * Debug mode: off
> Traceback (most recent call last):
>   File "<string>", line 1, in <module>
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/cryosparc_command/command_core/__init__.py", line 213, in start
>     app.run(host="0.0.0.0", port=port, threaded=True, passthrough_errors=False)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/flask/app.py", line 990, in run
>     run_simple(host, port, self, **options)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 1052, in run_simple
>     inner()
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 996, in inner
>     srv = make_server(
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 847, in make_server
>     return ThreadedWSGIServer(
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/site-packages/werkzeug/serving.py", line 740, in __init__
>     HTTPServer.__init__(self, server_address, handler)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/socketserver.py", line 452, in __init__
>     self.server_bind()
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/http/server.py", line 138, in server_bind
>     socketserver.TCPServer.server_bind(self)
>   File "/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.8/socketserver.py", line 466, in server_bind
>     self.socket.bind(self.server_address)

As for the other potential causes that you’ve mentioned:

  • Storage capacity and RAM shouldn’t be an issue - this is running on a cluster, and I am granted as much as possible
  • The job fails on different micrographs
  • As previously stated, the same job, but on the other part of the same dataset went okay, and I am using the same worker, so that should rule out the worker failure?

Thank you for taking the time to help me solve this!!

Was there a deliberate attempt to restart CryoSPARC while the job was running? Jobs running diuring a CryoSPARC restart may fail.

This error indicates a failure of the CryoSPARC startup. Potential reasons:

  • an incomplete preceding shutdown of CryoSPARC
  • $CRYOSPARC_BASE_PORT and/or subsequent 9 ports overlap with the master host’s ephemeral port range
    cat /proc/sys/net/ipv4/ip_local_port_range

Hello again,
Thank you for clarifying some of the situation.

I have not attempted restarting CryoSPARC while this job was running, that is true for all of the attempts.

As far as an incomplete shutdown goes, I can see how that can be the case, but what would I need to prove that?

Here is the output of the cryosparcm status command, as well as the command that you specified:


CryoSPARC System master node installed at
/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC/cryosparc_master
Current cryoSPARC version: v4.2.1

CryoSPARC process status:

app EXITED Jul 24 03:17 PM
app_api RUNNING pid 1513896, uptime 0:00:49
app_api_dev STOPPED Not started
app_legacy STOPPED Not started
app_legacy_dev STOPPED Not started
command_core RUNNING pid 1513274, uptime 0:00:51
command_rtp RUNNING pid 1517912, uptime 0:00:35
command_vis RUNNING pid 1498636, uptime 0:01:45
database RUNNING pid 1694942, uptime 1 day, 19:31:52


License is valid

global config variables:
export CRYOSPARC_LICENSE_ID=“XXXXXXXXXXXXX”
export CRYOSPARC_MASTER_HOSTNAME=“puhti-login12.bullx”
export CRYOSPARC_DB_PATH=“/projappl/project_2006450/usrappl/kyrybisi/cryoSPARC_database/”
export CRYOSPARC_BASE_PORT=40400
export CRYOSPARC_DB_CONNECTION_TIMEOUT_MS=20000
export CRYOSPARC_INSECURE=false
export CRYOSPARC_DB_ENABLE_AUTH=true
export CRYOSPARC_CLUSTER_JOB_MONITOR_INTERVAL=10
export CRYOSPARC_CLUSTER_JOB_MONITOR_MAX_RETRIES=1000000
export CRYOSPARC_PROJECT_DIR_PREFIX=‘CS-’
export CRYOSPARC_DEVELOP=false
export CRYOSPARC_CLICK_WRAP=true

cat /proc/sys/net/ipv4/ip_local_port_range

[kyrybisi@puhti-login12 cryosparc_worker]$ cat /proc/sys/net/ipv4/ip_local_port_range
32768 60999

What can I do to facilitate a proper shutdown and startup?

Thank you again for your time that you are putting into this!!!

After running
cryosparcm stop, you want to confirm the absence of “zombie” CryoSPARC-related processes:

ps -eopid,ppid,start,cmd | grep -e cryosparc -e mongo

Be aware that some systems are configured to host more than one CryoSPARC instance (with strict requirements such as non-overlapping port ranges). In this case you may see CryoSPARC-related processes even if the shutdown of the intended CryoSPARC instance was successful.
For reference, processes of a running CryoSPARC instance with no CryoSPARC jobs currently running:

2094338       1 14:22:46 python /u/wtempel/sw/cryosparc-prod/v4/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /u/wtempel/sw/cryosparc-prod/v4/cryosparc_master/supervisord.conf
2096575 2094338 14:22:50 mongod --auth --dbpath /u/wtempel/sw/cryosparc-prod/v4/cryosparc_db_61561 --port 61561 --oplogSize 64 --replSet meteor --nojournal --wiredTigerCacheSizeGB 4 --bind_ip_all
2097703 2094338 14:22:53 python -c import cryosparc_command.command_core as serv; serv.start(port=61562)
2100050 2094338 14:22:59 python -c import cryosparc_command.command_vis as serv; serv.start(port=61563)
2100757 2094338 14:23:00 python -c import cryosparc_command.command_rtp as serv; serv.start(port=61565)
2103594 2094338 14:23:07 /u/wtempel/sw/cryosparc-prod/v4/cryosparc_master/cryosparc_app/api/nodejs/bin/node ./bundle/main.js

Note the common supervisord ancestor of various CryoSPARC-related server processes.

The discrepancy in uptimes between the database and other processes is unusual.
You may want to try a “clean” restart of CryoSPARC:

  1. cryosparcm stop
  2. check for “zombie” processes (see ps command above)
  3. kill zombie processes if there are any. Do not use kill -9 (why?).
  4. cryosparcm start
  5. observe terminal output for startup errors. In case of errors, examine instance logs using the command
    cryosparcm log ... (details).
2 Likes

Hi,
After trying this full restart several times, it seems that I have finally been able to resolve the issue!
Thank you so much!!

1 Like