Hard disk drive read-only file system

Hi!

Did anyone ever face a similar problem where the hard drive to where cryoSPARC is writing job info and outputs suddenly turns into a ‘read-only file system’ and then (obviously) the job fails?
I have never seen such a situation reported in the forum, but it is already the second time that we are facing this problem with two of our workstations.

When we first saw this happening to our other workstation, we thought that it was a hard drive problem, we sent it back to service and they couldn’t detect any problem with the hard drive. In fact, we have it running for several weeks without cryoSPARC started and nothing really wrong happens. From the moment we turn cryoSPARC on and run something it is a matter of a few hours until it happens. In our most recent workstation, this happened during Motion Correction job (after having processed more than 1500 micrographs; all default values) and is reproducible any time that I restarted or reconfigured a new Motion Correction job. For the other workstation, I saw it happening during different 3D refinement jobs.

Can anyone offer a clue? =)

Thank you,
André

Regarding the Motion Correction job on the recent workstation.
Traceback:

Last part of cryosparcm log command_core output:

Last part of cryosparcm joblog P1 J14 output:
image

Let me know if any other information is needed.

what is the output of dmesg | tail -n 20 right after the failure?

1 Like

I did not try that, but I will reproduce the problem once again and immediatly after submit that command so you will be able to help me further.

I will get back to you soon. Thank you for your help! :slight_smile:

André

Here I am again with some more detail!

Took me longer, since yesterday the workstation were busy running other software.
There are about 260 lines that are written in dmesg since something unusual starts to happen until cryoSPARC gives the job as ‘Failed’.

Some of the last lines of dmesg -T are posted below. Since you only asked for the last 20, I didn’t dare to share all the 260 lines, but they might be useful. I wait for your feedback. Thank you! :grinning:

[tor jul 2 15:35:59 2020] EXT4-fs warning (device sda): ext4_end_bio:323: I/O error 10 writing to inode 144661592 (offset 8388608 size 8388608 starting block 483275264)
[tor jul 2 15:35:59 2020] EXT4-fs warning (device sda): ext4_end_bio:323: I/O error 10 writing to inode 144661592 (offset 0 size 8388608 starting block 483274752)
[tor jul 2 15:35:59 2020] EXT4-fs warning (device sda): ext4_end_bio:323: I/O error 10 writing to inode 144661592 (offset 0 size 8388608 starting block 483274240)
[tor jul 2 15:35:59 2020] EXT4-fs warning (device sda): ext4_end_bio:323: I/O error 10 writing to inode 144661592 (offset 0 size 8388608 starting block 483273728)
[tor jul 2 15:35:59 2020] EXT4-fs warning (device sda): ext4_end_bio:323: I/O error 10 writing to inode 144661592 (offset 0 size 8388608 starting block 483273216)
[tor jul 2 15:35:59 2020] EXT4-fs warning (device sda): ext4_end_bio:323: I/O error 10 writing to inode 144661596 (offset 0 size 8388608 starting block 483281408)
[tor jul 2 15:35:59 2020] sd 0:0:0:0: [sda] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[tor jul 2 15:35:59 2020] sd 0:0:0:0: [sda] Sense not available.
[tor jul 2 15:35:59 2020] sd 0:0:0:0: [sda] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[tor jul 2 15:35:59 2020] sd 0:0:0:0: [sda] Sense not available.
[tor jul 2 15:35:59 2020] sd 0:0:0:0: [sda] 0 512-byte logical blocks: (0 B/0 B)
[tor jul 2 15:35:59 2020] sd 0:0:0:0: [sda] 4096-byte physical blocks
[tor jul 2 15:35:59 2020] sd 0:0:0:0: [sda] Write Protect is on
[tor jul 2 15:35:59 2020] sd 0:0:0:0: [sda] Mode Sense: 00 a4 58 dd
[tor jul 2 15:35:59 2020] sda: detected capacity change from 10000831348736 to 0
[tor jul 2 15:35:59 2020] JBD2: Detected IO errors while flushing file data on sda-8
[tor jul 2 15:35:59 2020] Aborting journal on device sda-8.
[tor jul 2 15:35:59 2020] JBD2: Error -5 detected when updating journal superblock for sda-8.
[tor jul 2 15:35:59 2020] EXT4-fs (sda): Delayed block allocation failed for inode 144661603 at logical offset 0 with max blocks 2048 with error 30
[tor jul 2 15:35:59 2020] EXT4-fs (sda): This should not happen!! Data will be lost
[tor jul 2 15:35:59 2020] EXT4-fs error (device sda) in ext4_writepages:2915: Journal has aborted
[tor jul 2 15:35:59 2020] EXT4-fs error (device sda): ext4_journal_check_start:61: Detected aborted journal
[tor jul 2 15:35:59 2020] EXT4-fs (sda): Remounting filesystem read-only
[tor jul 2 15:35:59 2020] EXT4-fs (sda): ext4_writepages: jbd2_start: 7168 pages, ino 144661604; err -30
[tor jul 2 15:35:59 2020] JBD2: Detected IO errors while flushing file data on sda-8

That looks exactly like some sort of hardware issue, ‘sda’ is a hard disk that suddenly disappeared from the linux system, and as such linux forces it into read-only mode to protect your data. Is this an external drive of some sort?

1 Like

Yes, without any doubt ‘sda’ is the name of the hard drive to where cryoSPARC is suppose to write and that gets disconnected and makes the job fail. It is an internal hard drive (Seagate IronWolf 10Tb 3.5" NAS HDD). On the other system internal HDD of 8Tb, I don’t know right now brand and model.

Do you think it is a matter of defective hard drive? I think it is strange because not only being newly installed HDD, but it only happens whyle using cryoSPARC.

Why would the drive disappear? Any clue? :slight_smile:

i would suspect connection issues or heat issues. can you check that all the cables are plugged in firmly to the drive and the motherboard?

Those were the first things that we suspected and they were checked several times. Cables have been replaced and it is all the same… the only thing that was not really changed was the hard drive. One of the workstations even went back to service and they decided not to change the drive because they didn’t detect any sign of being broken. Nevertheless, we have already ordered a new hard drive to see if that would solve the problem for that workstation. Still I think it is very strange.

Let me know if there are any other tips :slight_smile:

Hi @AndreGraca,

What happens when you try to read that file (/mnt/storagesda/cryosparc_projects/P1/J1/imported/FoilHole_3621484...) manually on the system? Are you able to read it properly?

Also, what version of cryoSPARC are you running? What OS?

How many other SATA drives do you have connected to the motherboard? Was this a custom built computer, or a prebuilt from an assembler?

1 Like

Hi @stephan! Thank you for checking out this topic :slight_smile:

Both workstation are custom built, so we don’t have the ‘privilege’ to have a warranty of a replacement unit that would work for sure. They are anyway pretty different workstations. One is already 2 or 3 years old, and the other was just assembled 2 weeks ago.
The newest workstation is running the latest version of cryoSPARC (v 2.15) and the oldest machine is running version 2.13., but if I remember correctly it was upgraded more than once since the trouble started (months ago).
Both of the workstations are running Ubuntu 16.04 LTS.

The newest machine has:

  • two M.2 PCIe Gen 3.0 x 4, NVMe 1.3: one for OS and where cryoSPARC master and worker nodes are installed; the other drive for scratch
  • two 10Tb HDD SATA 6Gb/s: One of the HDD is the one that disconnects (cryoSPARC outputs are being written there); the other one continues fine, but there is nothing being written on it so far.
    I am tempted to redirect cryosparc_projects directory to the other hardrive and see if the behaviour is similar… so far I did not try that, but it is good to clean out some options.

The other workstation I believe it only has one SSD and one HDD.

The screenshots and output that I have posted here are from the newest machine. There I was also trying to work with a dataset that never ran on the other machine, so somehow doesn’t seem to be project/data related.
By manually read, do you mean to display the tif file with an image viewer?.. If yes, I can report an interesting fact: for any movie (.tif file) I can only see multipage image file with all black pages (using Ubuntu’s integrated ‘image viewer’, ‘document viewer’ or ‘tiffgt’), but cryoSPARC web interface shows correctly some examples of the imported micrographs of J1 without any problem.

I hope that some of this info helps! We are a bit desperate, eheh

Many thanks,
André

UPDATE

Hi, just an update from our side after we have figured out what the problems were, after struggling with this for months and being relutant to change the hard drive.

  • Newest workstation: for our most recent machine we thought that likely the problem would not be from the new hard drive, so after nudging around I have realised that (although almost unmentioned by the manufacturer) the motherboard half the SATA ports are controlled by the MB main chipset and the others controlled by an ASMedia controller. Apparently this controller does not handle very well constant data transfer in massive ways.
  • The older workstation got fixed by changing the hard drive.

Thank you,
André

@AndreGraca
Thank you so much for posting the update. We are really glad this was resolved!!