CryoSPARC crashes when there is high network-in traffic

Hi team.
We are currently using CryoSPARC on AWS ParallelCluster. Occasionally, CryoSPARC crashes. Upon checking the monitoring data, I noticed a peak in network-in traffic of approximately 35GB/s. During these instances, the database log shows the following entries.

2025-03-19T22:27:49.275+0000 I NETWORK  [conn5669] end connection :43206 (184 connections now open)
2025-03-19T22:27:49.275+0000 I NETWORK  [conn5667] end connection :43190 (183 connections now open)
2025-03-19T22:29:42.117+0000 I STORAGE  [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 7ms
2025-03-19T22:31:45.585+0000 I STORAGE  [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 5ms
2025-03-19T22:34:00.337+0000 I NETWORK  [listener] connection accepted from  #5673 (184 connections now open)
2025-03-19T22:34:00.337+0000 I NETWORK  [listener] connection accepted from :49304 #5674 (185 connections now open)
2025-03-19T22:34:00.337+0000 I NETWORK  [conn5673] received client metadata from :49292 conn5673: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:34:00.337+0000 I NETWORK  [conn5674] received client metadata from :49304 conn5674: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:34:00.338+0000 I NETWORK  [listener] connection accepted from :49314 #5675 (186 connections now open)
2025-03-19T22:34:00.338+0000 I NETWORK  [listener] connection accepted from :49320 #5676 (187 connections now open)
2025-03-19T22:34:00.338+0000 I NETWORK  [conn5675] received client metadata from :49314 conn5675: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:34:00.338+0000 I NETWORK  [conn5676] received client metadata from :49320 conn5676: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:34:00.342+0000 I ACCESS   [conn5675] Successfully authenticated as principal cryosparc_user on admin from client :49314
2025-03-19T22:34:00.342+0000 I ACCESS   [conn5676] Successfully authenticated as principal cryosparc_user on admin from client :49320
2025-03-19T22:34:12.624+0000 I STORAGE  [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 4ms
2025-03-19T22:34:27.600+0000 I NETWORK  [listener] connection accepted from :37404 #5677 (188 connections now open)
2025-03-19T22:34:27.600+0000 I NETWORK  [listener] connection accepted from :37418 #5678 (189 connections now open)
2025-03-19T22:34:27.601+0000 I NETWORK  [conn5677] received client metadata from :37404 conn5677: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:34:27.601+0000 I NETWORK  [conn5678] received client metadata from :37418 conn5678: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:34:27.601+0000 I NETWORK  [listener] connection accepted from :37434 #5679 (190 connections now open)
2025-03-19T22:34:27.601+0000 I NETWORK  [listener] connection accepted from :37446 #5680 (191 connections now open)
2025-03-19T22:34:27.601+0000 I NETWORK  [conn5679] received client metadata from :37434 conn5679: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:34:27.601+0000 I NETWORK  [conn5680] received client metadata from :37446 conn5680: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:34:27.605+0000 I ACCESS   [conn5679] Successfully authenticated as principal cryosparc_user on admin from client :37434
2025-03-19T22:34:27.605+0000 I ACCESS   [conn5680] Successfully authenticated as principal cryosparc_user on admin from client :37446
2025-03-19T22:34:48.709+0000 I STORAGE  [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 2ms
2025-03-19T22:35:24.369+0000 I STORAGE  [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 5ms
2025-03-19T22:35:41.732+0000 I STORAGE  [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 6ms
2025-03-19T22:35:59.853+0000 I STORAGE  [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 4ms
2025-03-19T22:36:16.984+0000 I NETWORK  [listener] connection accepted from :42606 #5681 (192 connections now open)
2025-03-19T22:36:16.986+0000 I NETWORK  [conn5681] received client metadata from :42606 conn5681: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:36:16.987+0000 I NETWORK  [listener] connection accepted from :42620 #5682 (193 connections now open)
2025-03-19T22:36:16.987+0000 I NETWORK  [conn5682] received client metadata from :42620 conn5682: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:36:16.997+0000 I ACCESS   [conn5682] Successfully authenticated as principal cryosparc_user on admin from client :42620
2025-03-19T22:36:17.575+0000 I NETWORK  [conn5682] end connection :42620 (191 connections now open)
2025-03-19T22:36:17.575+0000 I NETWORK  [conn5681] end connection :42606 (192 connections now open)
2025-03-19T22:37:43.292+0000 I STORAGE  [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 8ms
2025-03-19T22:39:24.280+0000 I COMMAND  [conn45] command meteor.workspaces command: find { find: "workspaces", filter: { project_uid: "P76", session_uid: "S3" }, projection: { exposure_groups: 1, file_engine_last_run: 1 }, limit: 1, singleBatch: true, lsid: { id: UUID("6d5362ab-f538-437d-8ebd-58942377ac40") }, $clusterTime: { clusterTime: Timestamp(1742423964, 3), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: IXSCAN { project_uid: 1, session_uid: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:880 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 140ms
2025-03-19T22:39:25.604+0000 I COMMAND  [conn5633] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423958, 4), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 142ms
2025-03-19T22:39:27.642+0000 I COMMAND  [conn30] command local.oplog.rs command: getMore { getMore: 8615417076, collection: "oplog.rs", batchSize: 1000, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1742423963, 46), signature: { hash: BinData(0, ), keyId:  } }, $db: "local" } originatingCommand: { find: "oplog.rs", filter: { ns: /^(?:meteor\.|admin\.\$cmd)/, $or: [ { op: { $in: [ "i", "u", "d" ] } }, { op: "c", o.drop: { $exists: true } }, { op: "c", o.dropDatabase: 1 }, { op: "c", o.applyOps: { $exists: true } } ], ts: { $gt: Timestamp(1739901706, 11) } }, tailable: true, oplogReplay: true, awaitData: true, lsid: { id: UUID("838be17a-cf97-4841-843f-872a5f3052e6") }, $clusterTime: { clusterTime: Timestamp(1739901710, 80), signature: { hash: BinData(0, 64C3AFDB895F18A43778BFAE149236D529255727), keyId:  } }, $db: "local" } planSummary: COLLSCAN cursorid:8615417076 keysExamined:0 docsExamined:8 numYields:2 nreturned:8 reslen:4344 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_msg 1053ms
2025-03-19T22:39:27.735+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 12, after asserts: 12, after backgroundFlushing: 12, after connections: 12, after dur: 12, after extra_info: 22, after globalLock: 32, after locks: 42, after logicalSessionRecordCache: 42, after network: 52, after opLatencies: 72, after opReadConcernCounters: 72, after opcounters: 72, after opcountersRepl: 72, after oplogTruncation: 127, after repl: 258, after storageEngine: 641, after tcmalloc: 1160, after transactions: 1379, after wiredTiger: 2312, at end: 2425 }
2025-03-19T22:39:27.875+0000 I COMMAND  [conn5576] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423956, 6), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 159ms
2025-03-19T22:39:27.875+0000 I COMMAND  [conn5575] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423964, 3), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 185ms
2025-03-19T22:39:28.500+0000 I COMMAND  [conn2017] command meteor.sched_queued command: find { find: "sched_queued", filter: {}, projection: { queued_job_hash: 1, last_scheduled_at: 1 }, lsid: { id: UUID("e52e100a-4951-4451-9596-0f10c946b735") }, $clusterTime: { clusterTime: Timestamp(1742423963, 35), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: COLLSCAN keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:1 nreturned:0 reslen:217 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 364ms
2025-03-19T22:39:28.964+0000 I COMMAND  [conn5675] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5000, $clusterTime: { clusterTime: Timestamp(1742423964, 8), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 3535ms
2025-03-19T22:39:28.964+0000 I WRITE    [conn5638] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 1 NUM 3500 TOTAL 39.232657 ELAPSED 218.80523 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 numYields:2 locks:{ Global: { acquireCount: { r: 5, w: 5 } }, Database: { acquireCount: { w: 5 } }, Collection: { acquireCount: { w: 4 } }, oplog: { acquireCount: { w: 1 } } } 3560ms
2025-03-19T22:39:28.965+0000 I COMMAND  [conn5679] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5274, $clusterTime: { clusterTime: Timestamp(1742423964, 8), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 3535ms
2025-03-19T22:39:30.761+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 15, after asserts: 70, after backgroundFlushing: 81, after connections: 81, after dur: 81, after extra_info: 91, after globalLock: 101, after locks: 111, after logicalSessionRecordCache: 157, after network: 303, after opLatencies: 352, after opReadConcernCounters: 352, after opcounters: 379, after opcountersRepl: 379, after oplogTruncation: 389, after repl: 527, after storageEngine: 537, after tcmalloc: 547, after transactions: 547, after wiredTiger: 601, at end: 2091 }
2025-03-19T22:39:30.876+0000 I COMMAND  [conn5678] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423958, 10), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 1250ms
2025-03-19T22:39:30.876+0000 I COMMAND  [conn5580] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423963, 11), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 1250ms
2025-03-19T22:39:30.876+0000 I COMMAND  [conn5408] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423957, 8), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 1315ms
2025-03-19T22:39:30.876+0000 I COMMAND  [conn5673] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423964, 8), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 1315ms
2025-03-19T22:39:31.024+0000 I COMMAND  [conn5638] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 6612, $clusterTime: { clusterTime: Timestamp(1742423958, 4), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 5, w: 5 } }, Database: { acquireCount: { w: 5 } }, Collection: { acquireCount: { w: 4 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 5555ms
2025-03-19T22:39:31.024+0000 I COMMAND  [conn5581] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 50097, $clusterTime: { clusterTime: Timestamp(1742423963, 11), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 3121ms
2025-03-19T22:39:31.157+0000 I WRITE    [conn5664] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 1 THR 0 NUM 2500 TOTAL 28.977865 ELAPSED 214.20418 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 numYields:2 locks:{ Global: { acquireCount: { r: 5, w: 5 } }, Database: { acquireCount: { w: 5 } }, Collection: { acquireCount: { w: 4 } }, oplog: { acquireCount: { w: 1 } } } 3288ms
2025-03-19T22:39:32.761+0000 I COMMAND  [conn5632] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423959, 139), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 276ms
2025-03-19T22:39:32.761+0000 I COMMAND  [conn5579] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423958, 16), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 276ms
2025-03-19T22:39:33.957+0000 I COMMAND  [conn5664] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 258, $clusterTime: { clusterTime: Timestamp(1742423964, 8), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 5, w: 5 } }, Database: { acquireCount: { w: 5 } }, Collection: { acquireCount: { w: 4 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 5775ms
2025-03-19T22:39:34.148+0000 I COMMAND  [conn5659] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423964, 8), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 104ms
2025-03-19T22:39:34.148+0000 I COMMAND  [conn2001] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423963, 35), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 585ms
2025-03-19T22:39:34.148+0000 I COMMAND  [conn5674] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423961, 6), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 211ms
2025-03-19T22:39:34.206+0000 I WRITE    [conn5635] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 1 THR 1 NUM 3500 TOTAL 39.447960 ELAPSED 218.77935 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:2 numYields:4 locks:{ Global: { acquireCount: { r: 7, w: 7 } }, Database: { acquireCount: { w: 7 } }, Collection: { acquireCount: { w: 6 } }, oplog: { acquireCount: { w: 1 } } } 8845ms
2025-03-19T22:39:34.240+0000 I WRITE    [conn5660] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 1 NUM 2500 TOTAL 27.690359 ELAPSED 214.20617 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:1 numYields:3 locks:{ Global: { acquireCount: { r: 6, w: 6 } }, Database: { acquireCount: { w: 6 } }, Collection: { acquireCount: { w: 5 } }, oplog: { acquireCount: { w: 1 } } } 6462ms
2025-03-19T22:39:34.271+0000 I WRITE    [conn5638] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 1 NUM 4000 TOTAL 41.080067 ELAPSED 227.69644 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 numYields:1 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } 143ms
2025-03-19T22:39:34.271+0000 I WRITE    [conn5663] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 1 THR 1 NUM 3000 TOTAL 27.802154 ELAPSED 216.42389 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:2 numYields:4 locks:{ Global: { acquireCount: { r: 7, w: 7 } }, Database: { acquireCount: { w: 7 } }, Collection: { acquireCount: { w: 6 } }, oplog: { acquireCount: { w: 1 } } } 3427ms
2025-03-19T22:39:34.271+0000 I WRITE    [conn5662] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 0 NUM 2500 TOTAL 28.242600 ELAPSED 214.18415 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:4 numYields:6 locks:{ Global: { acquireCount: { r: 9, w: 9 } }, Database: { acquireCount: { w: 9 } }, Collection: { acquireCount: { w: 8 } }, oplog: { acquireCount: { w: 1 } } } 6493ms
2025-03-19T22:39:34.271+0000 I WRITE    [conn5637] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 0 NUM 4000 TOTAL 38.821593 ELAPSED 218.79305 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:4 numYields:6 locks:{ Global: { acquireCount: { r: 9, w: 9 } }, Database: { acquireCount: { w: 9 } }, Collection: { acquireCount: { w: 8 } }, oplog: { acquireCount: { w: 1 } } } 8918ms
2025-03-19T22:39:34.271+0000 I WRITE    [conn5636] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 1 THR 0 NUM 3500 TOTAL 37.079320 ELAPSED 218.80479 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:5 numYields:7 locks:{ Global: { acquireCount: { r: 10, w: 10 } }, Database: { acquireCount: { w: 10 } }, Collection: { acquireCount: { w: 9 } }, oplog: { acquireCount: { w: 1 } } } 8918ms
2025-03-19T22:39:34.283+0000 I COMMAND  [conn5638] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 6613, $clusterTime: { clusterTime: Timestamp(1742423967, 3), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 166ms
2025-03-19T22:39:34.283+0000 I COMMAND  [conn5577] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 50504, $clusterTime: { clusterTime: Timestamp(1742423967, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 5344ms
2025-03-19T22:39:34.283+0000 I COMMAND  [conn5636] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 8531, $clusterTime: { clusterTime: Timestamp(1742423958, 4), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 10, w: 10 } }, Database: { acquireCount: { w: 10 } }, Collection: { acquireCount: { w: 9 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 8965ms
2025-03-19T22:39:34.283+0000 I COMMAND  [conn5635] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 6254, $clusterTime: { clusterTime: Timestamp(1742423958, 4), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 7, w: 7 } }, Database: { acquireCount: { w: 7 } }, Collection: { acquireCount: { w: 6 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 8965ms
2025-03-19T22:39:34.283+0000 I COMMAND  [conn5662] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 154, $clusterTime: { clusterTime: Timestamp(1742423964, 8), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 9, w: 9 } }, Database: { acquireCount: { w: 9 } }, Collection: { acquireCount: { w: 8 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 6518ms
2025-03-19T22:39:34.283+0000 I COMMAND  [conn5663] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 716, $clusterTime: { clusterTime: Timestamp(1742423964, 8), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 7, w: 7 } }, Database: { acquireCount: { w: 7 } }, Collection: { acquireCount: { w: 6 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 5344ms
2025-03-19T22:39:34.283+0000 I COMMAND  [conn5637] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 10660, $clusterTime: { clusterTime: Timestamp(1742423958, 4), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 9, w: 9 } }, Database: { acquireCount: { w: 9 } }, Collection: { acquireCount: { w: 8 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 8965ms
2025-03-19T22:39:34.283+0000 I COMMAND  [conn5660] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 2140, $clusterTime: { clusterTime: Timestamp(1742423964, 8), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 6, w: 6 } }, Database: { acquireCount: { w: 6 } }, Collection: { acquireCount: { w: 5 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 6518ms
2025-03-19T22:39:34.283+0000 I COMMAND  [conn5675] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5001, $clusterTime: { clusterTime: Timestamp(1742423967, 3), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 3159ms
2025-03-19T22:39:34.283+0000 I COMMAND  [conn5679] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5275, $clusterTime: { clusterTime: Timestamp(1742423967, 3), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 3159ms
2025-03-19T22:39:34.284+0000 I COMMAND  [conn45] command meteor.workspaces command: find { find: "workspaces", filter: { $or: [ { rtp_childs: { $exists: true, $ne: [] } }, { rtp_workers: { $exists: true, $ne: {} } } ] }, projection: { project_uid: 1, session_uid: 1, rtp_childs: 1, rtp_workers: 1 }, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1742423964, 8), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: IXSCAN { rtp_childs: 1 }, IXSCAN { rtp_workers: 1 } keysExamined:376 docsExamined:374 cursorExhausted:1 numYields:10 nreturned:18 reslen:2823 locks:{ Global: { acquireCount: { r: 22 } }, Database: { acquireCount: { r: 11 } }, Collection: { acquireCount: { r: 11 } } } protocol:op_msg 9871ms
2025-03-19T22:39:34.298+0000 I COMMAND  [conn10] command admin.$cmd command: isMaster { ismaster: true, $clusterTime: { clusterTime: Timestamp(1742423956, 6), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 130ms
2025-03-19T22:39:34.435+0000 I NETWORK  [listener] connection accepted from 127.0.0.1:49824 #5683 (192 connections now open)
2025-03-19T22:39:34.486+0000 I NETWORK  [conn5683] received client metadata from 127.0.0.1:49824 conn5683: { driver: { name: "PyMongo", version: "4.8.0" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.15.0-1062-aws" }, platform: "CPython 3.10.14.final.0" }
2025-03-19T22:39:34.512+0000 I COMMAND  [conn39] command meteor.exposures command: aggregate { aggregate: "exposures", pipeline: [ { $match: { project_uid: "P76", session_uid: "S3", stage: { $in: [ "thumbs", "ctf", "pick", "extract", "extract_manual", "ready" ] } } }, { $project: { _id: 1 } }, { $count: "total_thumbs" } ], cursor: {}, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1742423974, 28), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: IXSCAN { project_uid: 1, session_uid: 1 } keysExamined:1909 docsExamined:1909 cursorExhausted:1 numYields:15 nreturned:1 reslen:240 locks:{ Global: { acquireCount: { r: 34 } }, Database: { acquireCount: { r: 17 } }, Collection: { acquireCount: { r: 17 } } } protocol:op_msg 141ms
2025-03-19T22:39:34.613+0000 I ACCESS   [conn5683] Successfully authenticated as principal cryosparc_user on admin from client 127.0.0.1:49824
2025-03-19T22:39:34.683+0000 I STORAGE  [WT RecordStoreThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 28ms
2025-03-19T22:39:34.690+0000 I COMMAND  [conn39] command meteor.exposures command: aggregate { aggregate: "exposures", pipeline: [ { $match: { project_uid: "P76", session_uid: "S3", in_progress: true } }, { $project: { _id: 1 } }, { $count: "in_progress" } ], cursor: {}, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1742423974, 92), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: IXSCAN { project_uid: 1, session_uid: 1 } keysExamined:1909 docsExamined:1909 cursorExhausted:1 numYields:14 nreturned:1 reslen:239 locks:{ Global: { acquireCount: { r: 32 } }, Database: { acquireCount: { r: 16 } }, Collection: { acquireCount: { r: 16 } } } protocol:op_msg 128ms
2025-03-19T22:39:39.333+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 97, after asserts: 333, after backgroundFlushing: 486, after connections: 701, after dur: 895, after extra_info: 1102, after globalLock: 1291, after locks: 1519, after logicalSessionRecordCache: 1567, after network: 1711, after opLatencies: 1991, after opReadConcernCounters: 2115, after opcounters: 2322, after opcountersRepl: 2424, after oplogTruncation: 3124, after repl: 3815, after storageEngine: 3882, after tcmalloc: 3933, after transactions: 3943, after wiredTiger: 3980, at end: 4062 }

Has anyone encountered similar issues or have any solutions for this problem?

Welcome to the forum @htqn.

Please can you provide additional details:

  1. the specs of the CryoSPARC master node
  2. Is the CryoSPARC master node also the head node of the parallelcluster?
  3. Does the CryoSPARC master node also serve as a GPU compute node?
  4. How many (approx) concurrent users and concurrent CryoSPARC jobs does the CryoSPARC instance have?
  5. Does the CryoSPARC master host run any non-CryoSPARC tasks?
  6. What is the output of these commands on the CryoSPARC master in a fresh shell:
    nproc
    free -h
    eval $(cryosparcm env)
    echo $CRYOSPARC_DB_PATH
    df -Th $CRYOSPARC_DB_PATH
    du -sh $CRYOSPARC_DB_PATH
    
  7. Please can you post some database log entries that immediately follow serverStatus was very slow: entries.
  8. What are the signs of the occasional crashes you reported?
  9. Are there any revealing entries in the command_core, command_vis, app, database, supervisord logs that coincide with the crashes
  10. How did you recover from the crashes?

Hi @wtempel ,Below are some details I can provide now. Please let me know if you need more information.

1.The master node is using a c6a.8xlarge instance type with 32 vCPUs and 64 GiB of memory.
2.Yes, the master node also serves as the head node.
3.The CryoSPARC master node does not serve as a GPU compute node.
5.Yes, sometimes it runs SLURM jobs
6.

~$ nproc
32
~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           61Gi        19Gi        13Gi       406Mi        28Gi        40Gi
Swap:            0B          0B          0B
~$ eval $(cryosparcm env)
~$ echo $CRYOSPARC_DB_PATH
/shared/cryosparc/cryosparc_db
~$ df -Th $CRYOSPARC_DB_PATH
Filesystem     Type  Size  Used Avail Use% Mounted on
/dev/nvme1n1   ext4  921G  712G  167G  82% /shared
~$ du -sh $CRYOSPARC_DB_PATH
666G    /shared/cryosparc/cryosparc_db
2025-03-19T22:39:39.333+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 97, after asserts: 333, after backgroundFlushing: 486, after connections: 701, after dur: 895, after extra_info: 1102, after globalLock: 1291, after locks: 1519, after logicalSessionRecordCache: 1567, after network: 1711, after opLatencies: 1991, after opReadConcernCounters: 2115, after opcounters: 2322, after opcountersRepl: 2424, after oplogTruncation: 3124, after repl: 3815, after storageEngine: 3882, after tcmalloc: 3933, after transactions: 3943, after wiredTiger: 3980, at end: 4062 }
2025-03-19T22:39:39.359+0000 I COMMAND  [conn5575] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423974, 120), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 157ms
2025-03-19T22:39:39.359+0000 I COMMAND  [conn5576] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423967, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 157ms
2025-03-19T22:39:39.359+0000 I COMMAND  [conn5633] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423974, 5), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 157ms
2025-03-19T22:39:39.395+0000 I COMMAND  [conn5679] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5277, $clusterTime: { clusterTime: Timestamp(1742423974, 9), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 4241ms
2025-03-19T22:39:39.400+0000 I COMMAND  [conn5581] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 50099, $clusterTime: { clusterTime: Timestamp(1742423974, 17), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 4246ms
2025-03-19T22:39:39.400+0000 I COMMAND  [conn5675] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5003, $clusterTime: { clusterTime: Timestamp(1742423974, 9), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 4246ms
2025-03-19T22:39:39.512+0000 I WRITE    [conn5636] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 0 NUM 4500 TOTAL 43.111552 ELAPSED 233.58522 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:4 numYields:4 locks:{ Global: { acquireCount: { r: 7, w: 7 } }, Database: { acquireCount: { w: 7 } }, Collection: { acquireCount: { w: 6 } }, oplog: { acquireCount: { w: 1 } } } 110ms
2025-03-19T22:39:39.512+0000 I WRITE    [conn5635] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 1 NUM 4500 TOTAL 45.369204 ELAPSED 233.58597 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:5 numYields:5 locks:{ Global: { acquireCount: { r: 8, w: 8 } }, Database: { acquireCount: { w: 8 } }, Collection: { acquireCount: { w: 7 } }, oplog: { acquireCount: { w: 1 } } } 111ms
2025-03-19T22:39:39.512+0000 I WRITE    [conn5637] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 1 THR 1 NUM 4000 TOTAL 42.012550 ELAPSED 233.58692 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:1 numYields:1 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } 108ms
2025-03-19T22:39:39.512+0000 I COMMAND  [conn5635] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 8532, $clusterTime: { clusterTime: Timestamp(1742423974, 5), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 8, w: 8 } }, Database: { acquireCount: { w: 8 } }, Collection: { acquireCount: { w: 7 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 131ms
2025-03-19T22:39:39.512+0000 I COMMAND  [conn5636] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 6255, $clusterTime: { clusterTime: Timestamp(1742423974, 5), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 7, w: 7 } }, Database: { acquireCount: { w: 7 } }, Collection: { acquireCount: { w: 6 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 131ms
2025-03-19T22:39:39.512+0000 I COMMAND  [conn5637] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 10661, $clusterTime: { clusterTime: Timestamp(1742423976, 3), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 131ms
2025-03-19T22:39:41.805+0000 I COMMAND  [conn30] command local.oplog.rs command: getMore { getMore: 8615417076, collection: "oplog.rs", batchSize: 1000, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1742423980, 105), signature: { hash: BinData(0, ), keyId:  } }, $db: "local" } originatingCommand: { find: "oplog.rs", filter: { ns: /^(?:meteor\.|admin\.\$cmd)/, $or: [ { op: { $in: [ "i", "u", "d" ] } }, { op: "c", o.drop: { $exists: true } }, { op: "c", o.dropDatabase: 1 }, { op: "c", o.applyOps: { $exists: true } } ], ts: { $gt: Timestamp(1739901706, 11) } }, tailable: true, oplogReplay: true, awaitData: true, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1739901710, 80), signature: { hash: BinData(0, 64C3AFDB895F18A43778BFAE149236D529255727), keyId:  } }, $db: "local" } planSummary: COLLSCAN cursorid:8615417076 keysExamined:0 docsExamined:2 numYields:2 nreturned:2 reslen:1271 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_msg 348ms
2025-03-19T22:39:44.507+0000 I COMMAND  [conn5679] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5288, $clusterTime: { clusterTime: Timestamp(1742423981, 2), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 2403ms
2025-03-19T22:39:44.507+0000 I COMMAND  [conn5675] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5014, $clusterTime: { clusterTime: Timestamp(1742423981, 2), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 2403ms
2025-03-19T22:39:44.651+0000 I COMMAND  [conn5408] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423967, 3), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 1057ms
2025-03-19T22:39:45.034+0000 I COMMAND  [conn5579] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423971, 2), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 271ms
2025-03-19T22:39:45.034+0000 I COMMAND  [conn5677] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423981, 2), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 271ms
2025-03-19T22:39:45.034+0000 I COMMAND  [conn5674] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423973, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 271ms
2025-03-19T22:39:45.034+0000 I COMMAND  [conn5632] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423971, 2), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 271ms
2025-03-19T22:39:45.034+0000 I COMMAND  [conn5658] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423973, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 271ms
2025-03-19T22:39:45.531+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 117, after asserts: 317, after backgroundFlushing: 535, after connections: 759, after dur: 885, after extra_info: 1045, after globalLock: 1122, after locks: 1234, after logicalSessionRecordCache: 1444, after network: 1725, after opLatencies: 1970, after opReadConcernCounters: 2063, after opcounters: 2063, after opcountersRepl: 2063, after oplogTruncation: 2195, after repl: 2271, after storageEngine: 2340, after tcmalloc: 2382, after transactions: 2392, after wiredTiger: 2524, at end: 3104 }
2025-03-19T22:39:45.537+0000 I COMMAND  [conn2001] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423980, 103), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 626ms
2025-03-19T22:39:45.631+0000 I COMMAND  [conn39] command meteor.workspaces command: find { find: "workspaces", filter: { status: "running", file_engine_status: "running", deleted: false }, projection: { project_uid: 1, session_uid: 1, file_engine_last_run: 1 }, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1742423980, 85), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: IXSCAN { status: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:311 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_msg 663ms
2025-03-19T22:39:45.631+0000 I COMMAND  [conn2015] command meteor.jobs command: find { find: "jobs", filter: { status: { $in: [ "launched", "started", "running", "waiting" ] }, heartbeat_at: { $lt: new Date(1742423802506) } }, projection: { project_uid: 1, uid: 1 }, lsid: { id: UUID("64aeaaa2-3b28-43f5-8678-866fcaf2dc31") }, $clusterTime: { clusterTime: Timestamp(1742423980, 103), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: IXSCAN { status: 1, completed_at: -1 } keysExamined:7 docsExamined:6 cursorExhausted:1 numYields:0 nreturned:0 reslen:209 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_msg 663ms
2025-03-19T22:39:45.636+0000 I COMMAND  [conn5577] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 50533, $clusterTime: { clusterTime: Timestamp(1742423979, 33), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 1035ms
2025-03-19T22:39:45.636+0000 I COMMAND  [conn5581] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 50100, $clusterTime: { clusterTime: Timestamp(1742423980, 103), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 986ms
2025-03-19T22:39:45.639+0000 I WRITE    [conn5636] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 1 THR 0 NUM 4500 TOTAL 46.626699 ELAPSED 238.67780 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 numYields:2 locks:{ Global: { acquireCount: { r: 5, w: 5 } }, Database: { acquireCount: { w: 5 } }, Collection: { acquireCount: { w: 4 } }, oplog: { acquireCount: { w: 1 } } } 949ms
2025-03-19T22:39:45.639+0000 I COMMAND  [conn5636] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 6256, $clusterTime: { clusterTime: Timestamp(1742423979, 31), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 5, w: 5 } }, Database: { acquireCount: { w: 5 } }, Collection: { acquireCount: { w: 4 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 980ms
2025-03-19T22:39:51.274+0000 I COMMAND  [conn5575] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423986, 99), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 266ms
2025-03-19T22:39:51.274+0000 I WRITE    [conn39] update meteor.exposures command: { q: { project_uid: "P76", session_uid: "S3", uid: 1911 }, u: { $set: { in_progress: true, worker_juid: "J76" } }, multi: false, upsert: false } planSummary: IXSCAN { project_uid: 1, session_uid: 1, uid: 1 } keysExamined:1 docsExamined:1 nMatched:1 nModified:1 numYields:2 locks:{ Global: { acquireCount: { r: 5, w: 5 } }, Database: { acquireCount: { w: 5 } }, Collection: { acquireCount: { w: 4 } }, oplog: { acquireCount: { w: 1 } } } 4263ms
2025-03-19T22:39:51.274+0000 I COMMAND  [conn5576] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423976, 3), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 266ms
2025-03-19T22:39:51.308+0000 I COMMAND  [conn39] command meteor.$cmd command: update { update: "exposures", ordered: true, lsid: { id: UUID("") }, txnNumber: 7058, $clusterTime: { clusterTime: Timestamp(1742423986, 149), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 5, w: 5 } }, Database: { acquireCount: { w: 5 } }, Collection: { acquireCount: { w: 4 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 4319ms
2025-03-19T22:39:51.321+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 146, after asserts: 293, after backgroundFlushing: 388, after connections: 494, after dur: 552, after extra_info: 745, after globalLock: 982, after locks: 1222, after logicalSessionRecordCache: 1538, after network: 1962, after opLatencies: 2195, after opReadConcernCounters: 2271, after opcounters: 2514, after opcountersRepl: 2555, after oplogTruncation: 2692, after repl: 2737, after storageEngine: 2747, after tcmalloc: 2767, after transactions: 2767, after wiredTiger: 2787, at end: 2819 }
2025-03-19T22:39:51.340+0000 I COMMAND  [conn5675] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5024, $clusterTime: { clusterTime: Timestamp(1742423986, 50), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 3636ms
2025-03-19T22:39:51.340+0000 I COMMAND  [conn5581] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 50129, $clusterTime: { clusterTime: Timestamp(1742423986, 34), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 154ms
2025-03-19T22:39:51.364+0000 I COMMAND  [conn2016] command meteor.jobs command: find { find: "jobs", filter: { status: { $in: [ "launched", "started", "running", "waiting" ] }, heartbeat_at: { $lt: new Date(1742423807995) } }, projection: { project_uid: 1, uid: 1 }, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1742423986, 149), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: IXSCAN { status: 1, completed_at: -1 } keysExamined:7 docsExamined:6 cursorExhausted:1 numYields:1 nreturned:0 reslen:209 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 112ms
2025-03-19T22:39:51.364+0000 I COMMAND  [conn5679] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5298, $clusterTime: { clusterTime: Timestamp(1742423986, 51), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 3660ms
2025-03-19T22:39:51.364+0000 I COMMAND  [conn46] command meteor.workspaces command: find { find: "workspaces", filter: { status: "running", file_engine_status: "running", deleted: false }, projection: { project_uid: 1, session_uid: 1, file_engine_last_run: 1 }, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1742423986, 149), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: IXSCAN { status: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:311 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 174ms
2025-03-19T22:39:57.586+0000 I COMMAND  [conn5632] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423984, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 254ms
2025-03-19T22:39:57.586+0000 I COMMAND  [conn5677] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423991, 38), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 254ms
2025-03-19T22:39:57.586+0000 I COMMAND  [conn5579] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423984, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 254ms
2025-03-19T22:39:57.586+0000 I COMMAND  [conn5658] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423984, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 254ms
2025-03-19T22:39:57.586+0000 I COMMAND  [conn5659] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423992, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 124ms
2025-03-19T22:39:57.586+0000 I COMMAND  [conn5674] command admin.$cmd command: isMaster { ismaster: 1, helloOk: true, $clusterTime: { clusterTime: Timestamp(1742423984, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "admin" } numYields:0 reslen:639 locks:{} protocol:op_msg 254ms
2025-03-19T22:39:57.624+0000 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 113, after asserts: 385, after backgroundFlushing: 598, after connections: 804, after dur: 943, after extra_info: 1120, after globalLock: 1328, after locks: 1841, after logicalSessionRecordCache: 2332, after network: 2977, after opLatencies: 3116, after opReadConcernCounters: 3315, after opcounters: 3564, after opcountersRepl: 3703, after oplogTruncation: 3893, after repl: 4016, after storageEngine: 4052, after tcmalloc: 4072, after transactions: 4072, after wiredTiger: 4102, at end: 4152 }
2025-03-19T22:39:57.647+0000 I COMMAND  [conn5581] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 50138, $clusterTime: { clusterTime: Timestamp(1742423991, 32), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 5319ms
2025-03-19T22:39:57.673+0000 I COMMAND  [conn2016] command meteor.sched_queued command: find { find: "sched_queued", filter: {}, projection: { queued_job_hash: 1, last_scheduled_at: 1 }, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1742423991, 2), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: COLLSCAN keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:0 nreturned:0 reslen:217 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_msg 389ms
2025-03-19T22:39:57.673+0000 I COMMAND  [conn5675] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5030, $clusterTime: { clusterTime: Timestamp(1742423991, 36), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 3084ms
2025-03-19T22:39:57.673+0000 I COMMAND  [conn5679] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 5304, $clusterTime: { clusterTime: Timestamp(1742423991, 38), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 3084ms
2025-03-19T22:39:57.673+0000 I COMMAND  [conn5577] command meteor.events command: insert { insert: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 50555, $clusterTime: { clusterTime: Timestamp(1742423991, 24), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } ninserted:1 keysInserted:2 numYields:0 reslen:214 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 203ms
2025-03-19T22:39:57.693+0000 I WRITE    [conn5662] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 1 NUM 4000 TOTAL 44.453264 ELAPSED 241.99182 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 numYields:1 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } 281ms
2025-03-19T22:39:57.693+0000 I WRITE    [conn5638] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 1 NUM 5500 TOTAL 58.285771 ELAPSED 247.52044 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 numYields:1 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } 281ms
2025-03-19T22:39:57.693+0000 I COMMAND  [conn5662] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 2144, $clusterTime: { clusterTime: Timestamp(1742423992, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 477ms
2025-03-19T22:39:57.693+0000 I COMMAND  [conn5638] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 8533, $clusterTime: { clusterTime: Timestamp(1742423988, 2), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 477ms
2025-03-19T22:39:57.717+0000 I COMMAND  [conn46] command meteor.exposures command: find { find: "exposures", filter: { project_uid: "P76", session_uid: "S3", exp_group_id: 1 }, projection: { abs_file_path: 1 }, lsid: { id: UUID("") }, $clusterTime: { clusterTime: Timestamp(1742423991, 30), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } planSummary: IXSCAN { project_uid: 1, session_uid: 1 } cursorid:101829008025 keysExamined:101 docsExamined:101 numYields:1 nreturned:101 reslen:23486 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_msg 104ms
2025-03-19T22:39:57.719+0000 I WRITE    [conn5664] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 1 THR 0 NUM 4500 TOTAL 44.069677 ELAPSED 244.23091 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:1 numYields:1 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } 236ms
2025-03-19T22:39:57.719+0000 I WRITE    [conn5637] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 0 NUM 5500 TOTAL 53.889285 ELAPSED 247.50847 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:1 numYields:1 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } 309ms
2025-03-19T22:39:57.719+0000 I COMMAND  [conn5664] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("8ce97d4f-16f6-4c0b-887a-113eb9a2f454") }, txnNumber: 259, $clusterTime: { clusterTime: Timestamp(1742423992, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 249ms
2025-03-19T22:39:57.719+0000 I COMMAND  [conn5637] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 10663, $clusterTime: { clusterTime: Timestamp(1742423988, 2), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 4, w: 4 } }, Database: { acquireCount: { w: 4 } }, Collection: { acquireCount: { w: 3 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 503ms
2025-03-19T22:39:57.719+0000 I WRITE    [conn5660] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 0 THR 0 NUM 4000 TOTAL 46.303193 ELAPSED 241.97694 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:3 numYields:3 locks:{ Global: { acquireCount: { r: 6, w: 6 } }, Database: { acquireCount: { w: 6 } }, Collection: { acquireCount: { w: 5 } }, oplog: { acquireCount: { w: 1 } } } 309ms
2025-03-19T22:39:57.719+0000 I COMMAND  [conn5660] command meteor.$cmd command: update { update: "events", ordered: true, lsid: { id: UUID("") }, txnNumber: 158, $clusterTime: { clusterTime: Timestamp(1742423992, 1), signature: { hash: BinData(0, ), keyId:  } }, $db: "meteor" } numYields:0 reslen:229 locks:{ Global: { acquireCount: { r: 6, w: 6 } }, Database: { acquireCount: { w: 6 } }, Collection: { acquireCount: { w: 5 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_msg 503ms
2025-03-19T22:39:57.720+0000 I WRITE    [conn5663] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 1 THR 1 NUM 4500 TOTAL 42.988605 ELAPSED 242.08429 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:4 numYields:4 locks:{ Global: { acquireCount: { r: 7, w: 7 } }, Database: { acquireCount: { w: 7 } }, Collection: { acquireCount: { w: 6 } }, oplog: { acquireCount: { w: 1 } } } 310ms
2025-03-19T22:39:57.720+0000 I WRITE    [conn5636] update meteor.events command: { q: { _id: ObjectId('') }, u: { $set: { text: "-- DEV 1 THR 0 NUM 5000 TOTAL 53.034732 ELAPSED 247.51999 --\n" } }, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 writeConflicts:4 numYields:4 locks:{ Global: { acquireCount: { r: 7, w: 7 } }, Database: { acquireCount: { w: 7 } }, Collection: { acquireCount: { w: 6 } }, oplog: { acquireCount: { w: 1 } } } 310ms










  1. The user cannot SSH to the head node due to peak network traffic, and at this time, some CryoSPARC jobs are running.
  2. Some logs in command_core

2025-03-19 22:39:34,643 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:39:40,684 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:39:40,708 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:39:57,802 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:39:57,820 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:40:27,893 heartbeat_manager    ERROR    | HTTPSConnectionPool(host='get.cryosparc.com', port=443): Max retries exceeded with url: /heartbeat/77ce0c9e-c26c-11ed-b764-471579cffcb4 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5cde15d720>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
2025-03-19 22:40:29,029 heartbeat_manager    WARNING  | Error connecting to cryoSPARC license server during instance heartbeat.
2025-03-19 22:40:29,297 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:40:29,329 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:40:54,592 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:40:54,610 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:41:24,352 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:41:24,373 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:42:07,361 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:42:14,319 heartbeat_manager    ERROR    | HTTPSConnectionPool(host='get.cryosparc.com', port=443): Max retries exceeded with url: /heartbeat/77ce0c9e-c26c-11ed-b764-471579cffcb4 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5cde53a170>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
2025-03-19 22:42:36,914 heartbeat_manager    WARNING  | Error connecting to cryoSPARC license server during instance heartbeat.
2025-03-19 22:44:01,914 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:47:31 critical             CRITICAL | WORKER TIMEOUT (pid:540372)
Received SIGABRT (addr=000003e80000224b)
/shared/cryosparc/cryosparc_master/cryosparc_compute/ioengine/core.so(traceback_signal_handler+0x113)[0x7f5d1f5049f3]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f5d28e6a420]
/lib/x86_64-linux-gnu/libpthread.so.0(pthread_cond_timedwait+0x271)[0x7f5d28e657d1]
python(+0x11b958)[0x5648ecdc6958]
python(_PyEval_EvalFrameDefault+0x68a0)[0x5648ecde5dd0]
python(_PyFunction_Vectorcall+0x6c)[0x5648ecdefa2c]
python(_PyEval_EvalFrameDefault+0x72c)[0x5648ecddfc5c]
python(_PyFunction_Vectorcall+0x6c)[0x5648ecdefa2c]
python(_PyEval_EvalFrameDefault+0x72c)[0x5648ecddfc5c]
python(+0x150582)[0x5648ecdfb582]
python(_PyEval_EvalFrameDefault+0x4c12)[0x5648ecde4142]
python(_PyFunction_Vectorcall+0x6c)[0x5648ecdefa2c]
python(_PyEval_EvalFrameDefault+0x72c)[0x5648ecddfc5c]
python(_PyFunction_Vectorcall+0x6c)[0x5648ecdefa2c]
python(_PyEval_EvalFrameDefault+0x72c)[0x5648ecddfc5c]
python(_PyFunction_Vectorcall+0x6c)[0x5648ecdefa2c]
python(_PyEval_EvalFrameDefault+0x72c)[0x5648ecddfc5c]
python(_PyFunction_Vectorcall+0x6c)[0x5648ecdefa2c]
python(_PyEval_EvalFrameDefault+0x72c)[0x5648ecddfc5c]
python(_PyFunction_Vectorcall+0x6c)[0x5648ecdefa2c]
python(_PyEval_EvalFrameDefault+0x72c)[0x5648ecddfc5c]
python(+0x150582)[0x5648ecdfb582]
python(_PyEval_EvalFrameDefault+0x4c12)[0x5648ecde4142]
python(_PyFunction_Vectorcall+0x6c)[0x5648ecdefa2c]
python(_PyEval_EvalFrameDefault+0x72c)[0x5648ecddfc5c]
python(_PyFunction_Vectorcall+0x6c)[0x5648ecdefa2c]
python(_PyEval_EvalFrameDefault+0x320)[0x5648ecddf850]
python(+0x1d7c60)[0x5648ece82c60]
python(PyEval_EvalCode+0x87)[0x5648ece82ba7]
python(+0x20812a)[0x5648eceb312a]
python(+0x203523)[0x5648eceae523]
python(+0x9a6f5)[0x5648ecd456f5]
python(_PyRun_SimpleFileObject+0x1ae)[0x5648ecea89fe]
python(_PyRun_AnyFileObject+0x44)[0x5648ecea8594]
python(Py_RunMain+0x38b)[0x5648ecea578b]
python(Py_BytesMain+0x37)[0x5648ece761f7]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f5d28b2e083]
python(+0x1cb0f1)[0x5648ece760f1]


2025-03-19 22:47:59 error                ERROR    | Worker (pid:540372) exited with code 1
2025-03-19 22:47:59 error                ERROR    | Worker (pid:540372) exited with code 1.
2025-03-19 22:47:59 info                 INFO     | Booting worker with pid: 1941729
2025-03-19 22:48:01,151 start                INFO     |  === STARTED === 
2025-03-19 22:48:01,153 background_worker    INFO     |  === STARTED === 
2025-03-19 22:48:01,154 run                  INFO     | === STARTED TASKS WORKER ===
2025-03-19 22:48:01,231 set_job_status       INFO     | Status changed for P76.J84 from running to failed
2025-03-19 22:48:01,238 app_stats_refresh    INFO     | Calling app stats refresh url http://x.x.x.x:45000/api/actions/stats/refresh_job for project_uid P76, workspace_uid None, job_uid J84 with body {'projectUid': 'P76', 'jobUid': 'J84'}
2025-03-19 22:48:01,240 set_job_status       INFO     | Status changed for P76.J84 from failed to failed
2025-03-19 22:48:01,240 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:48:01,241 set_job_status       INFO     | Status changed for P76.J84 from failed to failed
2025-03-19 22:48:01,244 app_stats_refresh    INFO     | Calling app stats refresh url http://x.x.x.x:45000/api/actions/stats/refresh_job for project_uid P76, workspace_uid None, job_uid J84 with body {'projectUid': 'P76', 'jobUid': 'J84'}
2025-03-19 22:48:01,248 set_job_status       INFO     | Status changed for P76.J84 from running to failed
2025-03-19 22:48:01,249 app_stats_refresh    INFO     | Calling app stats refresh url http://x.x.x.x:45000/api/actions/stats/refresh_job for project_uid P76, workspace_uid None, job_uid J84 with body {'projectUid': 'P76', 'jobUid': 'J84'}
2025-03-19 22:48:01,250 set_job_status       INFO     | Status changed for P76.J80 from running to failed
2025-03-19 22:48:01,261 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:48:01,261 app_stats_refresh    INFO     | Calling app stats refresh url http://x.x.x.x:45000/api/actions/stats/refresh_job for project_uid P76, workspace_uid None, job_uid J84 with body {'projectUid': 'P76', 'jobUid': 'J84'}
2025-03-19 22:48:01,262 app_stats_refresh    INFO     | Calling app stats refresh url http://x.x.x.x:45000/api/actions/stats/refresh_job for project_uid P76, workspace_uid None, job_uid J80 with body {'projectUid': 'P76', 'jobUid': 'J80'}
2025-03-19 22:48:01,335 app_stats_refresh    INFO     | code 200, text {"success":true}
2025-03-19 22:48:01,358 app_stats_refresh    INFO     | code 200, text {"success":true}
2025-03-19 22:48:01,362 app_stats_refresh    INFO     | code 200, text {"success":true}
2025-03-19 22:48:01,364 app_stats_refresh    INFO     | code 200, text {"success":true}
2025-03-19 22:48:01,366 app_stats_refresh    INFO     | code 200, text {"success":true}
2025-03-19 22:48:03,351 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:48:03,361 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:48:14,256 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:48:14,265 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:48:24,709 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:48:24,719 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:48:35,365 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:48:35,375 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:48:46,052 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:48:46,061 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:48:56,437 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:48:56,446 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:49:07,305 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:49:07,315 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:49:18,089 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:49:18,098 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:49:28,725 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:49:28,735 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:49:39,515 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:49:39,524 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:49:50,232 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:49:50,242 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:50:01,386 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:50:01,396 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:50:12,909 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:50:12,922 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:50:24,399 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:50:24,410 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:50:36,480 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:50:36,496 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:50:55,541 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:50:55,557 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:51:01,938 kill_job             INFO     | ---- Killing P76 J87
2025-03-19 22:54:30,950 kill_job             ERROR    | Delete cluster job failed with exit code 104
2025-03-19 22:54:30,950 kill_job             ERROR    | Traceback (most recent call last):
2025-03-19 22:54:30,950 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 3017, in kill_job
2025-03-19 22:54:30,950 kill_job             ERROR    |     res = cluster.delete_cluster_job(target, job_doc['cluster_job_id'], template_args)
2025-03-19 22:54:30,950 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/cryosparc_compute/cluster.py", line 207, in delete_cluster_job
2025-03-19 22:54:30,950 kill_job             ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
2025-03-19 22:54:30,950 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 421, in check_output
2025-03-19 22:54:30,950 kill_job             ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2025-03-19 22:54:30,950 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 526, in run
2025-03-19 22:54:30,950 kill_job             ERROR    |     raise CalledProcessError(retcode, process.args,
2025-03-19 22:54:30,950 kill_job             ERROR    | subprocess.CalledProcessError: Command '['scancel', '7163']' returned non-zero exit status 104.
2025-03-19 22:54:30,983 kill_job             ERROR    | scancel: error: Kill job error on job id 7163: Connection reset by peer
2025-03-19 22:54:30,983 kill_job             ERROR    | Traceback (most recent call last):
2025-03-19 22:54:30,983 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 3017, in kill_job
2025-03-19 22:54:30,983 kill_job             ERROR    |     res = cluster.delete_cluster_job(target, job_doc['cluster_job_id'], template_args)
2025-03-19 22:54:30,983 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/cryosparc_compute/cluster.py", line 207, in delete_cluster_job
2025-03-19 22:54:30,983 kill_job             ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
2025-03-19 22:54:30,983 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 421, in check_output
2025-03-19 22:54:30,983 kill_job             ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2025-03-19 22:54:30,983 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 526, in run
2025-03-19 22:54:30,983 kill_job             ERROR    |     raise CalledProcessError(retcode, process.args,
2025-03-19 22:54:30,983 kill_job             ERROR    | subprocess.CalledProcessError: Command '['scancel', '7163']' returned non-zero exit status 104.
2025-03-19 22:54:48,680 set_job_status       INFO     | Status changed for P76.J87 from running to killed
2025-03-19 22:54:58,127 app_stats_refresh    INFO     | Calling app stats refresh url http://x.x.x.x:45000/api/actions/stats/refresh_job for project_uid P76, workspace_uid None, job_uid J87 with body {'projectUid': 'P76', 'jobUid': 'J87'}
2025-03-19 22:55:02,923 app_stats_refresh    WARNING  | Failed to call stats refresh endpoint for P76 J87: HTTPConnectionPool(host='x.x.x.x', port=45000): Read timed out. (read timeout=2)
2025-03-19 22:56:39,140 run                  WARNING  | Failed to connect link: <urlopen error [Errno -3] Temporary failure in name resolution>
2025-03-19 22:57:21,578 check_heartbeats     WARNING  | Marking P76 J87 as failed...
2025-03-19 22:57:21,589 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:57:21,589 run                  INFO     | Received task dump_job_database with 2 args and 0 kwargs
2025-03-19 22:57:21,589 dump_job_database    INFO     | Request to export P76 J87
2025-03-19 22:57:21,594 set_job_status       INFO     | Status changed for P76.J87 from killed to failed
2025-03-19 22:57:21,595 dump_job_database    INFO     |    Exporting job to /fsx/....
2025-03-19 22:57:21,596 dump_job_database    INFO     |    Exporting all of job's images in the database to /fsx/....
2025-03-19 22:57:21,645 app_stats_refresh    INFO     | Calling app stats refresh url http://x.x.x.x:45000/api/actions/stats/refresh_job for project_uid P76, workspace_uid None, job_uid J87 with body {'projectUid': 'P76', 'jobUid': 'J87'}
2025-03-19 22:57:21,659 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:57:21,660 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:57:21,673 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:57:21,723 dump_job_database    INFO     |    Writing 43 database images to /fsx/....
2025-03-19 22:57:21,723 dump_job_database    INFO     |    Done. Exported 43 images in 0.13s
2025-03-19 22:57:21,723 dump_job_database    INFO     |    Exporting all job's streamlog events...
2025-03-19 22:57:21,732 dump_job_database    INFO     |    Done. Exported 1 files in 0.01s
2025-03-19 22:57:21,733 dump_job_database    INFO     |    Exporting job metafile...
2025-03-19 22:57:21,737 dump_job_database    INFO     |    Done. Exported in 0.00s
2025-03-19 22:57:21,737 dump_job_database    INFO     |    Updating job manifest...
2025-03-19 22:57:21,743 dump_job_database    INFO     |    Done. Updated in 0.01s
2025-03-19 22:57:21,743 dump_job_database    INFO     | Exported P76 J87 in 0.15s
2025-03-19 22:57:21,744 run                  INFO     | Completed task in 0.15519285202026367 seconds
2025-03-19 22:57:21,792 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 22:57:21,801 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 22:57:22,107 app_stats_refresh    INFO     | code 200, text {"success":true}
2025-03-19 22:57:22,137 check_heartbeats     WARNING  | Marked P76 J87 as failed
2025-03-19 22:57:22,171 kill_job             INFO     | ---- Killing P76 J88
2025-03-19 22:57:58,247 run                  WARNING  | Failed to connect link: <urlopen error [Errno -3] Temporary failure in name resolution>
2025-03-19 23:43:50,873 kill_job             ERROR    | Delete cluster job failed with exit code 245
2025-03-19 23:43:50,873 kill_job             ERROR    | Traceback (most recent call last):
2025-03-19 23:43:50,873 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 3017, in kill_job
2025-03-19 23:43:50,873 kill_job             ERROR    |     res = cluster.delete_cluster_job(target, job_doc['cluster_job_id'], template_args)
2025-03-19 23:43:50,873 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/cryosparc_compute/cluster.py", line 207, in delete_cluster_job
2025-03-19 23:43:50,873 kill_job             ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
2025-03-19 23:43:50,873 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 421, in check_output
2025-03-19 23:43:50,873 kill_job             ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2025-03-19 23:43:50,873 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 526, in run
2025-03-19 23:43:50,873 kill_job             ERROR    |     raise CalledProcessError(retcode, process.args,
2025-03-19 23:43:50,873 kill_job             ERROR    | subprocess.CalledProcessError: Command '['scancel', '7164']' returned non-zero exit status 245.
2025-03-19 23:43:50,879 kill_job             ERROR    | scancel: error: If munged is up, restart with --num-threads=10
2025-03-19 23:43:50,879 kill_job             ERROR    | scancel: error: Munge encode failed: Failed to receive message header: Timed-out
2025-03-19 23:43:50,879 kill_job             ERROR    | scancel: error: slurm_send_node_msg: [socket:[128325139]] slurm_bufs_sendto(msg_type=REQUEST_KILL_JOB) failed: Unexpected missing socket error
2025-03-19 23:43:50,879 kill_job             ERROR    | scancel: error: Kill job error on job id 7164: Unexpected missing socket error
2025-03-19 23:43:50,879 kill_job             ERROR    | Traceback (most recent call last):
2025-03-19 23:43:50,879 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/cryosparc_command/command_core/__init__.py", line 3017, in kill_job
2025-03-19 23:43:50,879 kill_job             ERROR    |     res = cluster.delete_cluster_job(target, job_doc['cluster_job_id'], template_args)
2025-03-19 23:43:50,879 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/cryosparc_compute/cluster.py", line 207, in delete_cluster_job
2025-03-19 23:43:50,879 kill_job             ERROR    |     res = subprocess.check_output(shlex.split(cmd), stderr=subprocess.STDOUT).decode()
2025-03-19 23:43:50,879 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 421, in check_output
2025-03-19 23:43:50,879 kill_job             ERROR    |     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
2025-03-19 23:43:50,879 kill_job             ERROR    |   File "/shared/cryosparc/cryosparc_master/deps/anaconda/envs/cryosparc_master_env/lib/python3.10/subprocess.py", line 526, in run
2025-03-19 23:43:50,879 kill_job             ERROR    |     raise CalledProcessError(retcode, process.args,
2025-03-19 23:43:50,879 kill_job             ERROR    | subprocess.CalledProcessError: Command '['scancel', '7164']' returned non-zero exit status 245.
2025-03-19 23:43:50,916 set_job_status       INFO     | Status changed for P76.J88 from running to killed
2025-03-19 23:43:50,917 app_stats_refresh    INFO     | Calling app stats refresh url http://x.x.x.x:45000/api/actions/stats/refresh_job for project_uid P76, workspace_uid None, job_uid J88 with body {'projectUid': 'P76', 'jobUid': 'J88'}
2025-03-19 23:43:51,104 app_stats_refresh    INFO     | code 200, text {"success":true}
2025-03-19 23:43:51,174 check_heartbeats     WARNING  | Marking P76 J88 as failed...
2025-03-19 23:43:51,175 run                  INFO     | Received task dump_job_database with 2 args and 0 kwargs
2025-03-19 23:43:51,175 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 23:43:51,175 dump_job_database    INFO     | Request to export P76 J88
2025-03-19 23:43:51,181 set_job_status       INFO     | Status changed for P76.J88 from killed to failed
2025-03-19 23:43:51,181 dump_job_database    INFO     |    Exporting job to /fsx/....
2025-03-19 23:43:51,245 dump_job_database    INFO     |    Exporting all of job's images in the database to /fsx/....
2025-03-19 23:43:51,252 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 23:43:51,256 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 23:43:51,256 app_stats_refresh    INFO     | Calling app stats refresh url http://x.x.x.x:45000/api/actions/stats/refresh_job for project_uid P76, workspace_uid None, job_uid J88 with body {'projectUid': 'P76', 'jobUid': 'J88'}
2025-03-19 23:43:51,275 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 23:43:51,276 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 23:43:51,286 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 23:43:51,287 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 23:43:51,296 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 23:43:51,297 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 23:43:51,312 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 23:43:51,313 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 23:43:51,324 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 23:43:51,325 dump_workspaces      INFO     | Exporting all workspaces in P76...
2025-03-19 23:43:51,328 app_stats_refresh    INFO     | code 200, text {"success":true}
2025-03-19 23:43:51,333 check_heartbeats     WARNING  | Marked P76 J88 as failed
2025-03-19 23:43:51,344 dump_workspaces      INFO     | Exported all workspaces in P76 to /fsx/....
2025-03-19 23:43:51,367 dump_job_database    INFO     |    Writing 43 database images to /fsx/....
2025-03-19 23:43:51,367 dump_job_database    INFO     |    Done. Exported 43 images in 0.12s
2025-03-19 23:43:51,367 dump_job_database    INFO     |    Exporting all job's streamlog events...
2025-03-19 23:43:51,390 dump_job_database    INFO     |    Done. Exported 1 files in 0.02s
2025-03-19 23:43:51,390 dump_job_database    INFO     |    Exporting job metafile...```

  1. I restarted the instance to recover
  1. Do you observe the crashes more often while slurm jobs are running on the headnode?
  2. What is the output of the command
    cat /sys/kernel/mm/transparent_hugepage/enabled 
    
    on the head node?
  3. What is the scope (host (which?)? network?) of the Network in plot in your first post?
  4. Do you know the source of the network traffic shown in the plot? Is it slurm or CryoSPARC workload dependent?

Hi @wtempel,

The plot represents all incoming network traffic on the EC2 master node.The read operations (Ops/s) also peaked at approximately 3000 during this time.
Currently, I do not have information on the source of this network traffic. The crashes frequently occur when running CryoSPARC jobs.

Output of the command: always [madvise] never

I observed a significant number of database operation calls during the peak traffic period, which were not present in previous logs. Is it normal to see such a high volume of database commands?