SSD cache : cache successfully synced in_use

Rajiv-Singh · September 25, 2023, 9:22pm

Data Analysis not progressing after SSD cache : cache successfully synced in_use
What could be the reason?

[CPU: 196.3 MB Avail: 40.79 GB]
Running on lane default

[CPU: 196.3 MB Avail: 40.79 GB]
Resources allocated:

[CPU: 196.3 MB Avail: 40.79 GB]
Worker: BIOCHEM-7048GR-TR

[CPU: 196.3 MB Avail: 40.78 GB]
CPU : [0, 1, 2, 3]

[CPU: 196.3 MB Avail: 40.78 GB]
GPU : [0]

[CPU: 196.3 MB Avail: 40.78 GB]
RAM : [0, 1, 2]

[CPU: 196.3 MB Avail: 40.78 GB]
SSD : True

[CPU: 196.3 MB Avail: 40.78 GB]

[CPU: 196.3 MB Avail: 40.78 GB]
Importing job module for job type nonuniform_refine_new…

[CPU: 261.7 MB Avail: 40.83 GB]
Job ready to run

[CPU: 261.7 MB Avail: 40.83 GB]

[CPU: 333.9 MB Avail: 40.74 GB]
Using random seed of 1156566140

[CPU: 340.2 MB Avail: 40.73 GB]
Loading a ParticleStack with 77496 items…

[CPU: 340.4 MB Avail: 40.65 GB]
SSD cache : cache successfully synced in_use

Waited for almost an hour then killed job, restarted CryoSPARC and rerun the job; still no progress

wtempel · September 26, 2023, 4:31pm

Please can you provide additional information:

CryoSPARC version and patch
number of particles
number of particle stacks
project directory storage:
- network or local storage
- if network storage, rated speed of network: 1 Gbps? 10 Gbps?

Rajiv-Singh · September 26, 2023, 5:07pm

Hi @wtempel,

Our instance running on our workstation locally with cryoSPARC version: v4.3.0+230816.

I have run the same job previously and it was running smoothly. This time its not progressing and showing same status now as well.

Local storage only on same workstation. No of Particle for this job 77496 after Ab initio reconstruction. I have run last week with 216000 particle of same data set it was working.

Please let me know if need further information.

wtempel · September 29, 2023, 4:14pm

@Rajiv-Singh Do you see any useful information in the job log (Metadata|Log) for this job?

Rajiv-Singh · September 29, 2023, 10:49pm

The job did not progress so I killed it.

wtempel · October 2, 2023, 7:43pm

Unless the killed job has also been cleared or deleted, it should still have information under Metadata |Log. What information is shown therein?

Rajiv-Singh · October 11, 2023, 12:15am

Hi, The job was cleaned.

However again running 2D class with SSD cache on not progressing as well. I guess there is some issue in taking particles on cache. I can send you the log files. Seems none of the job using ssd cache not progressing.

wtempel · October 11, 2023, 1:37pm

For this job, please post the

end of the Event Log
the job log under Metadata|Log.

What kind of device (connection type, size) do you use for caching?

Rajiv-Singh · October 11, 2023, 2:40pm

End of event lo as follow:

[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10 12:51:22.19]
12:51:22.19]
12:51:22.21]
12:51:30.41]
12:51:30.42]
12:51:30.44]

[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10
[2023-10-10 12:51:30.45]
12:51:30.45]
12:51:30.46]
12:51:30.46]
12:51:30.46]
12:51:30.47]
12:51:30.47]
12:51:30.47]
12:51:30.48]
12:51:31.37]
12:51:31.38]
12:51:42.75]
12:51:42.92]
12:51:48.93]
License is valid.
Launching job on lane default target BIOCHEM-7048GR-TR …
Running job on master node hostname BIOCHEM-7048GR-TR
[CPU: 196.3 MB]
[Avail: 41.07 GB] Job J159 Started
[CPU: 196.3 MB]
[Avail: 41.06 GB] Master running v4.3.0+230816, worker running v4.3.0+230816
[CPU: 196.3 MB]
[Avail: 41.04 GB] Working in directory:
/data/EmilyC/20230605_EC_33_1/CS-20230612-
ter/J159
[CPU: 196.3 MB]
[Avail: 41.04 GB] Running on lane default
[CPU: 196.3 MB]
[Avail: 41.04 GB] Resources allocated:
[CPU: 196.3 MB]
[Avail: 41.03 GB] Worker: BIOCHEM-7048GR-TR
[CPU: 196.3 MB]
[Avail: 41.02 GB] CPU
: [0, 1]
[CPU: 196.3 MB]
[Avail: 41.02 GB] GPU
: [0, 1, 2, 3]
[CPU: 196.3 MB]
[Avail: 41.02 GB] RAM
: [0, 1, 2]
[CPU: 196.3 MB]
[Avail: 41.02 GB] SSD
: True
[CPU: 196.3 MB]
[Avail: 41.01 GB] --------------------------------------------------------------
[CPU: 196.3 MB]
[Avail: 41.00 GB] Importing job module for job type class_2D…
[CPU: 224.7 MB]
[Avail: 41.51 GB] Job ready to run
[CPU: 224.8 MB]
[Avail: 41.51 GB] ***************************************************************
[CPU: 768.2 MB]
[Avail: 41.16 GB] Using random seed of 1670268754
[CPU: 872.2 MB]
[Avail: 41.00 GB] Loading a ParticleStack with 960576 items…
[CPU: 872.3 MB]
[Avail: 40.14 GB] SSD cache : cache successfully synced in_use

This is a Linux workstation
Linux BIOCHEM-7048GR-TR 4.15.0-39-generic #42~16.04.1-Ubuntu SMP Wed Oct 24 17:09:54 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Target 1: BIOCHEM-7048GR-TR
node
Cores
32
Memory
128 GB
GPUs
4
Worker bin path
/opt/cryosparc2/cryosparc_worker/bin/cryosparcw
Hostname
BIOCHEM-7048GR-TR
Name
BIOCHEM-7048GR-TR
Cache Path
/ssd/cryosparc2_scratch/
SSH String
spuser@BIOCHEM-7048GR-TR
Cache Reserve (MB)
10000
Filesystem Size Used Avail Use% Mounted on
ssd 861G 235G 626G 28% /ssd

lscpu | grep “cache”
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K

I am not able to upload log file

wtempel · October 11, 2023, 4:55pm

Please can you paste the contents in this forum topic.

Rajiv-Singh · October 11, 2023, 5:43pm

This gives an error as follow
An error occurred: Body is limited to 32000 characters; you entered 67908

Rajiv-Singh · October 11, 2023, 6:04pm

================= CRYOSPARCW ======= 2023-10-10 12:51:25.250715 =========
Project P22 Job J159
Master BIOCHEM-7048GR-TR Port 39002

========= monitor process now starting main process at 2023-10-10 12:51:25.250774
MAINPROCESS PID 26534
========= monitor process now waiting for main process
MAIN PID 26534
class2D.run cryosparc_compute.jobs.jobregister
========= sending heartbeat at 2023-10-10 12:51:40.497384
========= sending heartbeat at 2023-10-10 12:51:50.517348
========= sending heartbeat at 2023-10-10 12:52:00.644494
========= sending heartbeat at 2023-10-10 12:52:10.854909
========= sending heartbeat at 2023-10-10 12:52:20.909381
========= sending heartbeat at 2023-10-10 12:52:30.929378
========= sending heartbeat at 2023-10-10 12:52:41.788302

Then there are several of these

again in last part as follow
========= sending heartbeat at 2023-10-10 16:14:56.075678
========= sending heartbeat at 2023-10-10 16:15:06.095005
========= sending heartbeat at 2023-10-10 16:15:16.116277
========= sending heartbeat at 2023-10-10 16:15:26.133141
========= sending heartbeat at 2023-10-10 16:15:36.150743
========= sending heartbeat at 2023-10-10 16:15:46.167556
========= sending heartbeat at 2023-10-10 16:15:56.184891
========= sending heartbeat at 2023-10-10 16:16:06.203710
========= sending heartbeat at 2023-10-10 16:16:16.221898
========= sending heartbeat at 2023-10-10 16:16:26.239514
========= sending heartbeat at 2023-10-10 16:16:36.258213
========= sending heartbeat at 2023-10-10 16:16:46.276202
========= sending heartbeat at 2023-10-10 16:16:56.293377
========= sending heartbeat at 2023-10-10 16:17:06.313209
========= sending heartbeat at 2023-10-10 16:17:16.331930
========= sending heartbeat at 2023-10-10 16:17:26.349241
========= sending heartbeat at 2023-10-10 16:17:36.366142

JMB · April 24, 2024, 7:17am

Hi there,

I just had the same problem as well (stalled on "SSD cache : cache successfully synced in_use”).

I then recognised that somehow my scratch disc was not responding anymore (however, I could see it with “df” could “cd” into it but not even “ls” worked).

So I just rebooted the system and the scratch disc was accessible again and also cSPA works again properly.

rbs_sci · April 24, 2024, 8:01am

Check your logs for why it wasn’t readable - normally if a drive has issues it will be remounted read-only automatically; if even that failed I’d be concerned of a failing or corrupt disk. That said, SSDs tend to fail hard and without warning, so it might be something else…

JMB · April 25, 2024, 7:03am

@rbs_sci
thanks for the advice. Using
sudo journalctl -k | grep -i name_of_ssd_disk
or
sudo grep -i name_of_ssd_disk /var/log/messages

I could not find any error (but especially in “/var/log/messages” there are only entries visable after the reboot).
Is there any other way to " Check logs for why it wasn’t readablecheck"?
Appreciate your help

rbs_sci · April 25, 2024, 7:44am

/var/log/syslog might have something, in the past a failing drive for me has always spewed into dmesg, so if I’m diagnosing a system I will set up a cron job to dump dmesg to a text file every 24 hours.

Might also be worth checking for SATA errors (controller related) (dmesg again) as I had a bad SATA cable once (came with the motherboard…!) where the disk was detected but writing to it would make the SATA controller drop off the PCI-E bus. Turned out it was the cable.

Out of curiosity, what model is the SSD? I have a vague memory of a buggy SSD firmware years ago OCZ or Corsair IIRC?) where the drives wouldn’t come out of sleep properly. For the record, I don’t think this issue is that, as this was pretty early in the days of SSDs being available, but I mention for completeness.

edit: Also check SMART, with smartmon or similar (some UEFI BIOSes have a SMART check) or Gnome Disks, see if anything is suspicious there…?

SSD cache : cache successfully synced in_use

[CPU: 196.3 MB Avail: 40.78 GB]

================= CRYOSPARCW ======= 2023-10-10 12:51:25.250715 ========= Project P22 Job J159 Master BIOCHEM-7048GR-TR Port 39002

================= CRYOSPARCW ======= 2023-10-10 12:51:25.250715 =========
Project P22 Job J159
Master BIOCHEM-7048GR-TR Port 39002