University of Maryland Mike P. Cummings  
Center for Bioinformatics and Computational Biology
HomeResearchPublicationsPersonnel

The Lattice Project
About Lattice
Applications
Client Activity
Create Account
Message Boards
Participant Profiles
Questions & Answers
Research Projects
Rules and Policies
Statistics
Teams
Top Computers
Top Participants
Top Teams
Your Account

BOINC Logo



Forum Thread

Checkpointing and Progress Bar
log in

Advanced search

Message boards : News : Checkpointing and Progress Bar

1 · 2 · 3 · 4 . . . 8 · Next
Author Message
Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 3811 - Posted: 10 May 2010, 16:22:39 UTC
Last modified: 22 May 2010, 15:39:17 UTC

It has been made clear to me that for long-running GARLI jobs, it is possible for the program to stay in both its initial phase (1-5%) and final phase (95-100%) for an extended period of time. Moreover, these phases can amount to significantly more than 10% of total runtime. During these phases the progress bar will not move and the program will not checkpoint! It is a situation we will work on remedying, but will have to live with for the time being. Thank you for your patience and understanding.

Profile fellie
Send message
Joined: 25 Mar 08
Posts: 7
Credit: 112,387
RAC: 0
Message 3815 - Posted: 11 May 2010, 11:25:31 UTC - in response to Message 3811.

See this thread http://boinc.umiacs.umd.edu/forum_thread.php?id=551

Initial checkpoints are happening between ~23 and ~43 hours.

BobCat13
Send message
Joined: 5 Jun 07
Posts: 60
Credit: 326,029
RAC: 0
Message 3826 - Posted: 12 May 2010, 15:03:28 UTC - in response to Message 3811.

Checkpointing starts once the task passes 1% and then stops again at the 95% mark, at least those are my observations so far. My task has been at 95% for 28 hours and the last checkpoint was also 28 hours ago.

Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 3827 - Posted: 12 May 2010, 18:42:56 UTC - in response to Message 3826.

That is correct - I have confirmation that the app won't checkpoint from 95% to 100%, either. We'll be working on this.

-Adam

pirogue
Send message
Joined: 5 Mar 10
Posts: 34
Credit: 492,757
RAC: 0
Message 3828 - Posted: 12 May 2010, 20:31:23 UTC - in response to Message 3827.

Is there any way to even guess at how long these are going to run?

I don't really mind the fact that the checkpoints don't work, since I don't reboot my machines or stop BOINC very often. What's bugging me is the bogus progress bar.

It took 72 hours to get to 95% and it's been 30+ hours trying to finish the last 5%. According to the progress bar, these should have been done around 27 hours ago. :(


Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 3829 - Posted: 12 May 2010, 20:36:40 UTC - in response to Message 3828.

I've been told by the author that it won't be hard to make progress increment from 1% -> 5% - I'll ask him about 95% -> 100%.

pirogue
Send message
Joined: 5 Mar 10
Posts: 34
Credit: 492,757
RAC: 0
Message 3830 - Posted: 12 May 2010, 20:53:55 UTC - in response to Message 3829.

While you're at it, can you ask him to make it closer to reality. :)

MacRonin
Send message
Joined: 23 Apr 08
Posts: 25
Credit: 5,990
RAC: 0
Message 3831 - Posted: 12 May 2010, 21:03:14 UTC - in response to Message 3830.

I'd appreciate realistic estimates also.

I don't mind running very long running tasks (I do climate ones also which estimate 200-500 hours but give months to get done and take checkpoints), but wish that there were realistic time estimates. Right now GARLI kills access for other well behaving apps who give realistic estimates. And while I would love to run GARLI WU's (when they exist) I don't want to punish others just for the opportunity to do so.

Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 3832 - Posted: 12 May 2010, 21:04:30 UTC - in response to Message 3830.

Yeah... he's tried. And the current version is a LOT better than it used to be. Since it's a combinatorial optimization problem heuristics (such as a GA) must be employed. With this type of stochastic algorithm it can be difficult to know how close to "done" the program is... it could be very near returning a solution and then suddenly find a better one, so it decides to explore the search space around that better solution a while longer. Hence the variance in GARLI runtimes!

Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 3833 - Posted: 12 May 2010, 21:08:10 UTC - in response to Message 3831.

For realistic estimates I'm working on a machine learning solution that I'll be training with various attributes of the program input data along with configuration settings and resultant runtime. Once I have this black box I'll be able to feed it the input from a job it's never run before and it'll give me a back a runtime estimate. I plan to have that very soon, but it's a work in progress so in the beginning it'll be a little rough. With this batch, actually, I've double-checked the estimate I put in and it's nowhere near the absurd values people are reporting from the client... eventually I'll have to figure out where the disconnect is.

B-Man
Send message
Joined: 27 Dec 08
Posts: 119
Credit: 43,669
RAC: 0
Message 3835 - Posted: 13 May 2010, 2:21:12 UTC

I had Boinc lose it's heartbeat from the Garli app and exit lost 80h of crunch time and reset back to zero. It was within 1 hour of it's first checkpoint.

Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 3837 - Posted: 13 May 2010, 2:24:17 UTC - in response to Message 3835.

Sorry about that, B-man.

MacRonin
Send message
Joined: 23 Apr 08
Posts: 25
Credit: 5,990
RAC: 0
Message 3838 - Posted: 13 May 2010, 2:37:50 UTC - in response to Message 3833.

One of my WU's has finally decided to share the CPU time with a some Milkyway that are due this Sunday, so its slowing down a bit. But here are the current stats. BTW originally they started at the same time.

WU #1 - 53:45 hours CPU time and 49.5 % completed

WU #2 - 48:43 hours CPU time and 63 % completed

Patrick Harnett*
Send message
Joined: 16 Apr 10
Posts: 4
Credit: 316,034
RAC: 0
Message 3839 - Posted: 13 May 2010, 7:10:46 UTC

Finally found a clear explanation to a problem. My computer sometimes freezes and has to be restarted. So having joined a month ago, and had over 100 elapsed hours on several occasions, I have never had a checkpoint. Probably wasted 1000 CPU hours.

Aborted and no-new-tasks. :(

Profile Cyph3r
Avatar
Send message
Joined: 23 Aug 08
Posts: 7
Credit: 1,166,159
RAC: 0
Message 3842 - Posted: 13 May 2010, 22:52:49 UTC - in response to Message 3839.

I have 6 WUs running for more than 48 hrs stuck at 95%, now with 120-130hrs.
I also have 6 suspended at 90%-93% with 67-75hrs (I had to finish other jobs from other projects, so I suspended work before the 95% checkpoint.)
I will change the hosts to NNW overnight, to prevent other projects from interfering with the current WUs. And wait to see what happens...

BobCat13
Send message
Joined: 5 Jun 07
Posts: 60
Credit: 326,029
RAC: 0
Message 3845 - Posted: 14 May 2010, 16:52:44 UTC - in response to Message 3826.

Well, this is nice. After 76 hours from the 95% mark and more that 139 hours total, the task gets to 100% and receives a computation error because it didn't create one of the output files.

http://boinc.umiacs.umd.edu/result.php?resultid=2975450


5-14-2010 11:01:09 AM The Lattice Project Computation for task 334402240.7554501772541058.2_1 finished
5-14-2010 11:01:10 AM The Lattice Project Output file 334402240.7554501772541058.2_1_4 for task 334402240.7554501772541058.2_1 absent


<core_client_version>6.10.43</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
<message>
<file_xfer_error>
<file_name>334402240.7554501772541058.2_1_4</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>

Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 3846 - Posted: 14 May 2010, 16:59:46 UTC - in response to Message 3845.

BobCat - did BOINC already clean up your working directory? (if not, we might be able to learn something from the program logs...)

BobCat13
Send message
Joined: 5 Jun 07
Posts: 60
Credit: 326,029
RAC: 0
Message 3848 - Posted: 14 May 2010, 17:34:50 UTC - in response to Message 3846.

BobCat - did BOINC already clean up your working directory? (if not, we might be able to learn something from the program logs...)

Yes, unfortunately the slot directory is empty.

Here are the last few entries in stdoutdae.txt regarding Lattice Project

11-May-2010 07:00:44 [The Lattice Project] [checkpoint_debug] result 334402240.7554501772541058.2_1 checkpointed
14-May-2010 11:01:09 [The Lattice Project] Computation for task 334402240.7554501772541058.2_1 finished
14-May-2010 11:01:10 [The Lattice Project] Output file 334402240.7554501772541058.2_1_4 for task 334402240.7554501772541058.2_1 absent
14-May-2010 12:45:48 [The Lattice Project] Started upload of 334402240.7554501772541058.2_1_0
14-May-2010 12:45:50 [The Lattice Project] Finished upload of 334402240.7554501772541058.2_1_0
14-May-2010 12:45:50 [The Lattice Project] Started upload of 334402240.7554501772541058.2_1_1
14-May-2010 12:45:52 [The Lattice Project] Finished upload of 334402240.7554501772541058.2_1_1
14-May-2010 12:45:52 [The Lattice Project] Started upload of 334402240.7554501772541058.2_1_2
14-May-2010 12:45:53 [The Lattice Project] Finished upload of 334402240.7554501772541058.2_1_2
14-May-2010 12:45:53 [The Lattice Project] Started upload of 334402240.7554501772541058.2_1_3
14-May-2010 12:45:59 [The Lattice Project] Finished upload of 334402240.7554501772541058.2_1_3

I have checked all of the other std*.txt logs in the data directory and they contain nothing about this task.

Profile [SG aktiv] Nullinger
Avatar
Send message
Joined: 24 Aug 08
Posts: 6
Credit: 1,296,964
RAC: 169
Message 3851 - Posted: 14 May 2010, 21:47:58 UTC
Last modified: 14 May 2010, 21:56:14 UTC

Hallo Adam, my first WU ends after 138 h also with an error:

http://boinc.umiacs.umd.edu/result.php?resultid=2974815

14.05.2010 21:32:13 The Lattice Project Computation for task 234594970.40009717460018646.5_0 finished
14.05.2010 21:32:13 The Lattice Project Output file 234594970.40009717460018646.5_0_4 for task 234594970.40009717460018646.5_0 absent
14.05.2010 21:32:14 The Lattice Project Started upload of 234594970.40009717460018646.5_0_0
14.05.2010 21:32:14 The Lattice Project Started upload of 234594970.40009717460018646.5_0_1
14.05.2010 21:32:17 The Lattice Project Finished upload of 234594970.40009717460018646.5_0_0
14.05.2010 21:32:17 The Lattice Project Finished upload of 234594970.40009717460018646.5_0_1
14.05.2010 21:32:17 The Lattice Project Started upload of 234594970.40009717460018646.5_0_2
14.05.2010 21:32:17 The Lattice Project Started upload of 234594970.40009717460018646.5_0_3
14.05.2010 21:32:18 The Lattice Project Finished upload of 234594970.40009717460018646.5_0_2
14.05.2010 21:33:09 The Lattice Project Finished upload of 234594970.40009717460018646.5_0_3

Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 3853 - Posted: 14 May 2010, 23:13:49 UTC - in response to Message 3851.

OK, this is definitely a problem. Basically, GARLI is not producing an output file that it is supposed to. I've stopped the project for the time being until I can figure it out. For the (very small) handful of you that have returned what I believe is a valid result (minus this one file that frankly isn't super-important), don't worry - I'll grant your credit manually. And for the rest of you, I'll try to find some way to fix this - but it might not be pretty.

1 · 2 · 3 · 4 . . . 8 · Next
Post to thread

Message boards : News : Checkpointing and Progress Bar

.......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... ..........

University of Maryland     UM Home | Directories | Search | Admissions | Calendar
Copyright © 2017 The Lattice Project
Direct questions and comments to Lattice Admin