4.4 Queue jobs doesnt work

Dear all

We experience sometimes that jobs cant be queued.

You create a job, build it, queue it and after selecting the lane or a direct GPU you are going without any message back to the build state.

With Edge i can see that:

The enqueue_job is “pending” forever.

Perhaps related in the app_api i can see this:

app_api log

cryoSPARC Application API server running
getJobError P86 J1711

<— Last few GCs —>

[28400:0x65a5640] 1476002597 ms: Scavenge 528.1 (547.1) → 523.5 (552.8) MB, 65.8 / 0.0 ms (average mu = 0.998, current mu = 0.949) allocation failure
[28400:0x65a5640] 1476002732 ms: Scavenge 534.1 (553.1) → 529.3 (558.8) MB, 69.7 / 0.0 ms (average mu = 0.998, current mu = 0.949) allocation failure
[28400:0x65a5640] 1476002870 ms: Scavenge 540.2 (559.3) → 535.5 (564.8) MB, 71.8 / 0.0 ms (average mu = 0.998, current mu = 0.949) allocation failure

<— JS stacktrace —>

FATAL ERROR: Zone Allocation failed - process out of memory
1: 0xa3aaf0 node::Abort() [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
2: 0x970199 node::FatalError(char const*, char const*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
3: 0xbba42e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
4: 0xbba7a7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
5: 0x126068e v8::internal::Zone::NewExpand(unsigned long) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
6: 0xe45145 void std::vector<unsigned char, v8::internal::ZoneAllocator >::emplace_back(unsigned char&&) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
7: 0xe451d6 v8::internal::interpreter::BytecodeArrayWriter::EmitBytecode(v8::internal::interpreter::BytecodeNode const*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
8: 0xe3dec5 v8::internal::interpreter::BytecodeArrayBuilder::ToString() [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
9: 0xe54df8 v8::internal::interpreter::BytecodeGenerator::VisitTemplateLiteral(v8::internal::TemplateLiteral*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
10: 0xe53068 v8::internal::interpreter::BytecodeGenerator::VisitForAccumulatorValue(v8::internal::Expression*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
11: 0xe530d7 v8::internal::interpreter::BytecodeGenerator::VisitReturnStatement(v8::internal::ReturnStatement*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
12: 0xe51856 v8::internal::interpreter::BytecodeGenerator::VisitIfStatement(v8::internal::IfStatement*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
13: 0xe51d04 v8::internal::interpreter::BytecodeGenerator::VisitStatements(v8::internal::ZoneListv8::internal::Statement* const*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
14: 0xe51e5b v8::internal::interpreter::BytecodeGenerator::VisitBlockDeclarationsAndStatements(v8::internal::Block*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
15: 0xe51ed7 v8::internal::interpreter::BytecodeGenerator::VisitBlock(v8::internal::Block*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
16: 0xe51856 v8::internal::interpreter::BytecodeGenerator::VisitIfStatement(v8::internal::IfStatement*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
17: 0xe5181b v8::internal::interpreter::BytecodeGenerator::VisitIfStatement(v8::internal::IfStatement*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
18: 0xe51d04 v8::internal::interpreter::BytecodeGenerator::VisitStatements(v8::internal::ZoneListv8::internal::Statement* const*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
19: 0xe51e5b v8::internal::interpreter::BytecodeGenerator::VisitBlockDeclarationsAndStatements(v8::internal::Block*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
20: 0xe51ed7 v8::internal::interpreter::BytecodeGenerator::VisitBlock(v8::internal::Block*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
21: 0xe51856 v8::internal::interpreter::BytecodeGenerator::VisitIfStatement(v8::internal::IfStatement*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
22: 0xe51d04 v8::internal::interpreter::BytecodeGenerator::VisitStatements(v8::internal::ZoneListv8::internal::Statement* const*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
23: 0xe5249a v8::internal::interpreter::BytecodeGenerator::GenerateBytecodeBody() [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
24: 0xe527c9 v8::internal::interpreter::BytecodeGenerator::GenerateBytecode(unsigned long) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
25: 0xe64ee0 v8::internal::interpreter::InterpreterCompilationJob::ExecuteJobImpl() [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
26: 0xc8238b [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
27: 0xc88598 [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
28: 0xc8b59e v8::internal::Compiler::Compile(v8::internal::Handlev8::internal::SharedFunctionInfo, v8::internal::Compiler::ClearExceptionFlag, v8::internal::IsCompiledScope*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
29: 0xc8d6bc v8::internal::Compiler::Compile(v8::internal::Handlev8::internal::JSFunction, v8::internal::Compiler::ClearExceptionFlag, v8::internal::IsCompiledScope*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
30: 0x108658a v8::internal::Runtime_CompileLazy(int, unsigned long*, v8::internal::Isolate*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
31: 0x1448df9 [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/api/nodejs/bin/node]
cryoSPARC Application API server running
cryoSPARC Application API server running
cryoSPARC Application API server running
cryoSPARC Application API server running
cryoSPARC Application API server running
cryoSPARC Application API server running
cryoSPARC Application API server running
cryoSPARC Application API server running
cryoSPARC Application API server running
cryoSPARC Application API server running
cryoSPARC Application API server running
cryoSPARC Application API server running

<— Last few GCs —>

[9574:0x4c933e0] 1656316547 ms: Scavenge 628.8 (647.9) → 624.1 (653.7) MB, 59.4 / 0.0 ms (average mu = 0.976, current mu = 0.963) allocation failure
[9574:0x4c933e0] 1656316674 ms: Scavenge 634.7 (653.9) → 630.0 (659.2) MB, 62.7 / 0.0 ms (average mu = 0.976, current mu = 0.963) allocation failure
[9574:0x4c933e0] 1656316803 ms: Scavenge 640.7 (659.7) → 636.0 (665.4) MB, 64.2 / 0.0 ms (average mu = 0.976, current mu = 0.963) allocation failure

<— JS stacktrace —>

FATAL ERROR: Zone Allocation failed - process out of memory
1: 0xa3ad50 node::Abort() [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
2: 0x970199 node::FatalError(char const*, char const*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
3: 0xbba90e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
4: 0xbbac87 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
5: 0x1260bfe v8::internal::Zone::NewExpand(unsigned long) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
6: 0xc030ce v8::internal::AstValueFactory::NewConsString() [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
7: 0xfc116e v8::internal::ParseInfo::GetOrCreateAstValueFactory() [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
8: 0xfc3473 v8::internal::Parser::Parser(v8::internal::ParseInfo*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
9: 0xfe6e7c v8::internal::parsing::ParseFunction(v8::internal::ParseInfo*, v8::internal::Handlev8::internal::SharedFunctionInfo, v8::internal::Isolate*, v8::internal::parsing::ReportErrorsAndStatisticsMode) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
10: 0xfe72b5 v8::internal::parsing::ParseAny(v8::internal::ParseInfo*, v8::internal::Handlev8::internal::SharedFunctionInfo, v8::internal::Isolate*, v8::internal::parsing::ReportErrorsAndStatisticsMode) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
11: 0xc8ba30 v8::internal::Compiler::Compile(v8::internal::Handlev8::internal::SharedFunctionInfo, v8::internal::Compiler::ClearExceptionFlag, v8::internal::IsCompiledScope*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
12: 0xc8db9c v8::internal::Compiler::Compile(v8::internal::Handlev8::internal::JSFunction, v8::internal::Compiler::ClearExceptionFlag, v8::internal::IsCompiledScope*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
13: 0x1086afa v8::internal::Runtime_CompileLazy(int, unsigned long*, v8::internal::Isolate*) [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
14: 0x1449379 [/opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node]
cryoSPARC Application API server running
cryoSPARC Application API server running

Is there a way to recover from that state without restarting the complete cryosparc instance?

It may be linked to the fact that when one host is in a “strange” state (means ping ok but ssh not responding) and a job is sent to this host then the webapp also hangs.

Did the host run out of RAM?

What version of CryoSPARC do you use?

Are you referring to a CryoSPARC master or worker host? How many hosts are there?
What are the outputs of the commands in a fresh shell:

free -g
cat /sys/kernel/mm/transparent_hugepage/enabled
sudo dmesg -T | grep -i oom
ps -eo user,pid,ppid,start,rsz,vsz,command | grep -e cryosparc_ -e mongo
eval $(cryosparcm env) # load CryoSPARC environment
host $CRYOSPARC_MASTER_HOSTNAME
time curl ${CRYOSPARC_MASTER_HOSTNAME}:$CRYOSPARC_COMMAND_CORE_PORT

You may want to exit the shell after having recorded the commands’ outputs to avoid inadvertently running general commands inside the CryoSPARC environment.

Did the host run out of RAM?

I dont think so or at least i was not seeing it after i was informed about the problem.

What version of CryoSPARC do you use?

Current cryoSPARC version: v4.4.1+240110

Are you referring to a CryoSPARC master or worker host? How many hosts are there?

The strange state of the host was a worker. Sorry normally i write better reports :confused:
5 workers are available.

Output of the commands

[cryosparc_rd@krypton cryoSPARC2]$ free -g
total used free shared buff/cache available
Mem: 125 13 2 0 109 111
Swap: 15 0 15

[cryosparc_rd@krypton cryoSPARC2]$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

[cryosparc_rd@krypton cryoSPARC2]$ dmesg -T | grep -i oom

[cryosparc_rd@krypton cryoSPARC2]$ ps -eo user,pid,ppid,start,rsz,vsz,command | grep -e cryosparc_ -e mongo
cryospa+ 2600 1 Apr 05 17372 148764 python /opt/cryoSPARC2/cryosparc2_master/deps/anaconda/envs/cryosparc_master_env/bin/supervisord -c /opt/cryoSPARC2/cryosparc2_master/supervisord.conf
cryospa+ 2729 2600 Apr 05 5044496 8488024 mongod --auth --dbpath /opt/cryoSPARC2/cryosparc2_database --port 39001 --oplogSize 64 --replSet meteor --wiredTigerCacheSizeGB 4 --bind_ip_all
cryospa+ 2851 2600 Apr 05 3687032 4509648 python -c import cryosparc_command.command_core as serv; serv.start(port=39002)
cryospa+ 2909 2600 Apr 05 1041392 2606584 python -c import cryosparc_command.command_vis as serv; serv.start(port=39003)
cryospa+ 2914 2600 Apr 05 989696 1958156 python -c import cryosparc_command.command_rtp as serv; serv.start(port=39005)
cryospa+ 2978 2600 Apr 05 887276 1871876 /opt/cryoSPARC2/cryosparc2_master/cryosparc_app/nodejs/bin/node ./bundle/main.js
root 14149 14126 10:34:21 2472 215564 su - cryosparc_rd
cryospa+ 14701 14150 10:36:59 984 112816 grep --color=auto -e cryosparc_ -e mongo
cryospa+ 16025 2851 23:06:54 186480 688016 python -c import cryosparc_compute.run as run; run.run() --project P82 --job J8552 --master_hostname krypton.lnx.local.tld --master_command_core_port 39002
cryospa+ 16062 16025 23:06:54 748304 2075456 python -c import cryosparc_compute.run as run; run.run() --project P82 --job J8552 --master_hostname krypton.lnx.local.tld --master_command_core_port 39002

[cryosparc_rd@krypton cryoSPARC2] eval (cryosparcm env)

[cryosparc_rd@krypton cryoSPARC2]$ host $CRYOSPARC_MASTER_HOSTNAME
krypton.lnx.local.tld has address 192.168.136.145

[cryosparc_rd@krypton cryoSPARC2] time curl {CRYOSPARC_MASTER_HOSTNAME}:$CRYOSPARC_COMMAND_CORE_POR
curl: (7) Failed connect to krypton.lnx.local.tld:80; Connection refused

real 0m0.020s
user 0m0.008s
sys 0m0.005s
[cryosparc_rd@krypton cryoSPARC2] time curl {CRYOSPARC_MASTER_HOSTNAME}:39000

CryoSPARC
You need to enable JavaScript to run this app.

real 0m0.017s
user 0m0.006s
sys 0m0.006s

Would it be possible for you, given required privileges on and potential other uses of the server, to try if disabling transparent_hugepage has an effect on the queuing disruptions?

A character T may be missing at the end of that command. Please can you try again

eval $(cryosparcm env)
time curl {CRYOSPARC_MASTER_HOSTNAME}:$CRYOSPARC_COMMAND_CORE_PORT

record the output and then exit the shell to avoid inadvertently running general commands inside the CryoSPARC environment.

Currently the problem doesnt exist. I rebooted the strange state worker the 5th of April since then it is stable. Which let me believe it is linked to communication problems with workers.

command

[cryosparc_rd@krypton cryoSPARC2]$ time curl {$CRYOSPARC_MASTER_HOSTNAME}:$CRYOSPARC_COMMAND_CORE_PORT
Hello World from cryosparc command core.

real 0m0.016s
user 0m0.006s
sys 0m0.005s