My 3d flex reconstruction job keeps stalling at iteration 0.
It shows “job process terminated abnormally” and stalls just before the completion of iteration 0
I did try reducing the number of particles used for the reconstruction but it still keeps failing again at this step.
It would be great if somebody could help me out
Welcome to the forum @Justus. Please post any error messages (like those in your screenshot) as text.
Does the job.log
file inside the job’s directory contain any additional hints about the job’s failure?
Probably the same issue as me and @Flow describe here 3D Flex Reconstruc fails at iteration 0 - #5 by Flow. Nothing useful in logs.
Hi @bsobol,
Could you please trigger this bug and then have a look at dmesg
? Depending on your OS distro and version, there may be some additional information about why the crash happened.
Specifically, I’d be interested in a dmesg
entry that looks vaguely like this:
[76771.355512] python[39968]: segfault at 854 ip 00007f66deb53b65 sp 00007f66beffb1b0 error 4 in blobio_native.so[7f66deb4a000+2f000]
[76771.355533] Code: 00 00 0f 29 9c 24 90 00 00 00 0f 29 a4 24 a0 00 00 00 0f 29 ac 24 b0 00 00 00 0f 29 b4 24 c0 00 00 00 0f 29 bc 24 d0 00 00 00 <48> 63 bb 54 08 00 00 4c 8d 63 54 48 8d 84 24 10 01 00 00 c7 04 24”
Hi @wtempel and @hsnyder
My job log file looks like this…
================= CRYOSPARCW ======= 2023-03-07 17:10:04.817857 =========
Project P4 Job J583
Master csir Port 39002
========= monitor process now starting main process at 2023-03-07 17:10:04.817899
MAINPROCESS PID 58221
MAIN PID 58221
flex_refine.run_highres cryosparc_compute.jobs.jobregister
========= monitor process now waiting for main process
========= sending heartbeat at 2023-03-07 17:10:18.206409
========= sending heartbeat at 2023-03-07 17:10:28.250628
========= sending heartbeat at 2023-03-07 17:10:38.269944
========= sending heartbeat at 2023-03-07 17:10:48.289628
========= sending heartbeat at 2023-03-07 17:10:58.309331
========= sending heartbeat at 2023-03-07 17:11:08.328144
========= sending heartbeat at 2023-03-07 17:11:18.347817
========= sending heartbeat at 2023-03-07 17:11:28.366976
========= sending heartbeat at 2023-03-07 17:11:38.386573
========= sending heartbeat at 2023-03-07 17:11:48.406711
========= sending heartbeat at 2023-03-07 17:11:58.426201
========= sending heartbeat at 2023-03-07 17:12:08.445632
========= sending heartbeat at 2023-03-07 17:12:18.465494
========= sending heartbeat at 2023-03-07 17:12:28.483415
========= sending heartbeat at 2023-03-07 17:12:38.502939
========= sending heartbeat at 2023-03-07 17:12:48.522522
========= sending heartbeat at 2023-03-07 17:12:58.541806
========= sending heartbeat at 2023-03-07 17:13:08.561357
========= sending heartbeat at 2023-03-07 17:13:18.580471
========= sending heartbeat at 2023-03-07 17:13:28.599939
========= sending heartbeat at 2023-03-07 17:13:38.619642
========= sending heartbeat at 2023-03-07 17:13:48.638557
========= sending heartbeat at 2023-03-07 17:13:58.658072
========= sending heartbeat at 2023-03-07 17:14:08.677265
========= sending heartbeat at 2023-03-07 17:14:18.696897
========= sending heartbeat at 2023-03-07 17:14:28.716216
========= sending heartbeat at 2023-03-07 17:14:38.735152
========= sending heartbeat at 2023-03-07 17:14:48.754832
========= sending heartbeat at 2023-03-07 17:14:58.774575
========= sending heartbeat at 2023-03-07 17:15:08.792497
========= sending heartbeat at 2023-03-07 17:15:18.811966
========= sending heartbeat at 2023-03-07 17:15:28.831184
========= sending heartbeat at 2023-03-07 17:15:38.850861
========= sending heartbeat at 2023-03-07 17:15:48.870341
========= sending heartbeat at 2023-03-07 17:15:58.888923
========= sending heartbeat at 2023-03-07 17:16:08.908471
========= sending heartbeat at 2023-03-07 17:16:18.927831
========= sending heartbeat at 2023-03-07 17:16:28.949852
========= sending heartbeat at 2023-03-07 17:16:38.969392
========= sending heartbeat at 2023-03-07 17:16:48.987903
========= sending heartbeat at 2023-03-07 17:16:59.007366
========= sending heartbeat at 2023-03-07 17:17:09.026713
========= sending heartbeat at 2023-03-07 17:17:19.045956
========= sending heartbeat at 2023-03-07 17:17:29.065569
========= sending heartbeat at 2023-03-07 17:17:39.084251
========= sending heartbeat at 2023-03-07 17:17:49.103967
========= sending heartbeat at 2023-03-07 17:17:59.127503
========= sending heartbeat at 2023-03-07 17:18:09.146900
========= sending heartbeat at 2023-03-07 17:18:19.167214
========= sending heartbeat at 2023-03-07 17:18:29.186675
========= sending heartbeat at 2023-03-07 17:18:39.204050
========= sending heartbeat at 2023-03-07 17:18:49.222803
========= sending heartbeat at 2023-03-07 17:18:59.241755
========= sending heartbeat at 2023-03-07 17:19:09.260283
========= sending heartbeat at 2023-03-07 17:19:19.279170
========= sending heartbeat at 2023-03-07 17:19:29.298099
========= sending heartbeat at 2023-03-07 17:19:39.316821
========= sending heartbeat at 2023-03-07 17:19:49.336750
========= sending heartbeat at 2023-03-07 17:19:59.357680
========= sending heartbeat at 2023-03-07 17:20:09.377122
========= sending heartbeat at 2023-03-07 17:20:19.398057
========= sending heartbeat at 2023-03-07 17:20:29.410627
========= sending heartbeat at 2023-03-07 17:20:39.430742
========= sending heartbeat at 2023-03-07 17:20:49.450155
========= sending heartbeat at 2023-03-07 17:20:59.469832
========= sending heartbeat at 2023-03-07 17:21:09.489566
========= sending heartbeat at 2023-03-07 17:21:19.511450
========= sending heartbeat at 2023-03-07 17:21:29.530981
========= sending heartbeat at 2023-03-07 17:21:39.550584
========= sending heartbeat at 2023-03-07 17:21:49.560344
========= sending heartbeat at 2023-03-07 17:21:59.578905
========= sending heartbeat at 2023-03-07 17:22:09.598467
========= sending heartbeat at 2023-03-07 17:22:19.617388
========= sending heartbeat at 2023-03-07 17:22:29.638622
========= sending heartbeat at 2023-03-07 17:22:39.657272
========= sending heartbeat at 2023-03-07 17:22:49.677209
========= sending heartbeat at 2023-03-07 17:22:59.697509
========= sending heartbeat at 2023-03-07 17:23:09.718399
========= sending heartbeat at 2023-03-07 17:23:19.736916
========= sending heartbeat at 2023-03-07 17:23:29.767839
========= sending heartbeat at 2023-03-07 17:23:39.787753
========= sending heartbeat at 2023-03-07 17:23:49.808114
========= sending heartbeat at 2023-03-07 17:23:59.834213
========= sending heartbeat at 2023-03-07 17:24:09.852565
========= sending heartbeat at 2023-03-07 17:24:19.871538
========= sending heartbeat at 2023-03-07 17:24:29.899195
========= sending heartbeat at 2023-03-07 17:24:39.917097
========= sending heartbeat at 2023-03-07 17:24:49.936630
========= sending heartbeat at 2023-03-07 17:24:59.956419
========= sending heartbeat at 2023-03-07 17:25:10.008030
========= sending heartbeat at 2023-03-07 17:25:20.027601
========= sending heartbeat at 2023-03-07 17:25:30.058146
========= sending heartbeat at 2023-03-07 17:25:40.077335
========= sending heartbeat at 2023-03-07 17:25:50.114601
========= sending heartbeat at 2023-03-07 17:26:00.134531
========= sending heartbeat at 2023-03-07 17:26:10.191224
========= sending heartbeat at 2023-03-07 17:26:20.212472
========= sending heartbeat at 2023-03-07 17:26:30.247825
========= sending heartbeat at 2023-03-07 17:26:40.267190
========= sending heartbeat at 2023-03-07 17:26:50.299234
========= sending heartbeat at 2023-03-07 17:27:00.310700
========= sending heartbeat at 2023-03-07 17:27:10.348891
========= sending heartbeat at 2023-03-07 17:27:20.369498
========= sending heartbeat at 2023-03-07 17:27:30.405754
========= sending heartbeat at 2023-03-07 17:27:40.425335
========= sending heartbeat at 2023-03-07 17:27:50.455254
========= sending heartbeat at 2023-03-07 17:28:00.473219
========= sending heartbeat at 2023-03-07 17:28:10.491421
========= sending heartbeat at 2023-03-07 17:28:20.510282
========= sending heartbeat at 2023-03-07 17:28:30.544708
========= sending heartbeat at 2023-03-07 17:28:40.564607
========= sending heartbeat at 2023-03-07 17:28:50.596674
========= sending heartbeat at 2023-03-07 17:29:00.615914
========= sending heartbeat at 2023-03-07 17:29:10.645354
========= sending heartbeat at 2023-03-07 17:29:20.664575
========= sending heartbeat at 2023-03-07 17:29:30.684000
========= sending heartbeat at 2023-03-07 17:29:40.703416
========= sending heartbeat at 2023-03-07 17:29:50.722337
========= sending heartbeat at 2023-03-07 17:30:00.741321
========= sending heartbeat at 2023-03-07 17:30:10.758570
========= sending heartbeat at 2023-03-07 17:30:20.776725
========= sending heartbeat at 2023-03-07 17:30:30.796285
========= sending heartbeat at 2023-03-07 17:30:40.815539
========= sending heartbeat at 2023-03-07 17:30:50.856850
========= sending heartbeat at 2023-03-07 17:31:00.878079
Running job J583 of type flex_highres
Running job on hostname %s csir
Allocated Resources : {‘fixed’: {‘SSD’: False}, ‘hostname’: ‘csir’, ‘lane’: ‘default’, ‘lane_type’: ‘node’, ‘license’: True, ‘licenses_acquired’: 1, ‘slots’: {‘CPU’: [0, 1, 2, 3], ‘GPU’: [0], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7]}, ‘target’: {‘cache_path’: ‘/home/scratch’, ‘cache_quota_mb’: None, ‘cache_reserve_mb’: 10000, ‘desc’: None, ‘gpus’: [{‘id’: 0, ‘mem’: 25434324992, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 1, ‘mem’: 25434324992, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 2, ‘mem’: 25434324992, ‘name’: ‘NVIDIA RTX A5000’}, {‘id’: 3, ‘mem’: 25434324992, ‘name’: ‘NVIDIA RTX A5000’}], ‘hostname’: ‘csir’, ‘lane’: ‘default’, ‘monitor_port’: None, ‘name’: ‘csir’, ‘resource_fixed’: {‘SSD’: True}, ‘resource_slots’: {‘CPU’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111], ‘GPU’: [0, 1, 2, 3], ‘RAM’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]}, ‘ssh_str’: ‘cryosparcuser@csir’, ‘title’: ‘Worker node csir’, ‘type’: ‘node’, ‘worker_bin_path’: ‘/home/cryosparcuser/software/cryosparc/cryosparc_worker/bin/cryosparcw’}}
========= sending heartbeat at 2023-03-07 17:31:10.897306
========= sending heartbeat at 2023-03-07 17:31:20.915110
========= sending heartbeat at 2023-03-07 17:31:30.935954
========= sending heartbeat at 2023-03-07 17:31:40.953789
========= sending heartbeat at 2023-03-07 17:31:50.970527
========= sending heartbeat at 2023-03-07 17:32:00.990508
========= sending heartbeat at 2023-03-07 17:32:11.009385
========= sending heartbeat at 2023-03-07 17:32:21.028438
========= sending heartbeat at 2023-03-07 17:32:31.043106
========= sending heartbeat at 2023-03-07 17:32:41.062436
========= sending heartbeat at 2023-03-07 17:32:51.081353
========= sending heartbeat at 2023-03-07 17:33:01.101028
========= sending heartbeat at 2023-03-07 17:33:11.111506
========= sending heartbeat at 2023-03-07 17:33:21.130824
========= sending heartbeat at 2023-03-07 17:33:31.149827
========= sending heartbeat at 2023-03-07 17:33:41.168829
========= sending heartbeat at 2023-03-07 17:33:51.188067
========= sending heartbeat at 2023-03-07 17:34:01.206616
========= sending heartbeat at 2023-03-07 17:34:11.265377
========= sending heartbeat at 2023-03-07 17:34:21.337108
========= sending heartbeat at 2023-03-07 17:34:31.356937
========= sending heartbeat at 2023-03-07 17:34:41.376524
========= sending heartbeat at 2023-03-07 17:34:51.395858
========= sending heartbeat at 2023-03-07 17:35:01.415576
========= sending heartbeat at 2023-03-07 17:35:11.435576
========= sending heartbeat at 2023-03-07 17:35:21.459402
========= sending heartbeat at 2023-03-07 17:35:31.477519
========= sending heartbeat at 2023-03-07 17:35:41.496825
========= sending heartbeat at 2023-03-07 17:35:51.515903
========= sending heartbeat at 2023-03-07 17:36:01.534551
========= sending heartbeat at 2023-03-07 17:36:11.618282
========= sending heartbeat at 2023-03-07 17:36:21.634625
========= sending heartbeat at 2023-03-07 17:36:31.680260
========= sending heartbeat at 2023-03-07 17:36:41.699473
========= sending heartbeat at 2023-03-07 17:36:51.719056
========= sending heartbeat at 2023-03-07 17:37:01.737165
========= sending heartbeat at 2023-03-07 17:37:11.756474
========= sending heartbeat at 2023-03-07 17:37:21.775084
========= sending heartbeat at 2023-03-07 17:37:31.794470
========= sending heartbeat at 2023-03-07 17:37:41.813419
========= main process now complete at 2023-03-07 17:37:48.711001.
========= monitor process now complete at 2023-03-07 17:37:48.920223.
Hi @wtempel and @hsnyder
Running the command dmesg gives me the following message for " python " and the “Code” lines…
[16290.813729] python[58221]: segfault at 7eec718d7e70 ip 00007f71f9b74ff8 sp 00007ffdb4ff73b0 error 6 in _lbfgsb.cpython-38-x86_64-linux-gnu.so[7f71f9b65000+17000]
[16290.813744] Code: 48 8b bc 24 c8 00 00 00 8b 07 66 44 0f 28 da 4c 8b 4c 24 08 4c 8b 2c 24 66 44 0f 28 d2 f2 44 0f 59 da 66 44 0f 57 d7 49 63 39 45 0f 11 54 cd f8 f2 41 0f 5c cb 85 ff 0f 8e 04 02 00 00 48 8b