University of Maryland Mike P. Cummings  
Center for Bioinformatics and Computational Biology
HomeResearchPublicationsPersonnel

The Lattice Project
About Lattice
Applications
Client Activity
Create Account
Message Boards
Participant Profiles
Questions & Answers
Research Projects
Rules and Policies
Statistics
Teams
Top Computers
Top Participants
Top Teams
Your Account

BOINC Logo



Forum Thread

Newest GARLI workunits take some time to show progress
log in

Advanced search

Message boards : News : Newest GARLI workunits take some time to show progress

1 · 2 · 3 · Next
Author Message
Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 5417 - Posted: 18 Jul 2013, 18:12:37 UTC
Last modified: 18 Jul 2013, 18:13:26 UTC

Workunits in the newest batches of GARLI workunits released won't be using a feature that speeds up input file processing since it potentially introduces a bug. So, if your workunit doesn't show signs of progress for the first couple of hours, don't be alarmed and let it continue! Unfortunately, GARLI won't checkpoint during this phase.

Roger Merkl
Send message
Joined: 19 Jun 12
Posts: 2
Credit: 167,594
RAC: 0
Message 5428 - Posted: 9 Aug 2013, 3:02:10 UTC - in response to Message 5417.

Hi Adam the Garli 5.02 I am working on is slow progress 132 hrs so far 80% done. using a 6 core proessor did not speed things up .

wbblakemore
Send message
Joined: 21 Dec 07
Posts: 7
Credit: 23,465
RAC: 0
Message 5429 - Posted: 9 Aug 2013, 5:16:24 UTC - in response to Message 5428.

Hi, Roger ...

I'm pretty sure that all BOINC apps only use one processor per job. Having multiple cores only means that you can run more of them simultaneously, but the speed for each job will remain the same, unless you've switched to a faster CPU.

Roger Merkl
Send message
Joined: 19 Jun 12
Posts: 2
Credit: 167,594
RAC: 0
Message 5430 - Posted: 9 Aug 2013, 15:52:57 UTC - in response to Message 5429.

Gd Day You can go into each project an set how many cpus it can use . Some larger projects will use more then one cpu depending on how you the set the program. Also it depends no how you have the bonic manager main program set I have it set to use a max of 16 cpus. Example you are doing only the lattice project the program is set to use the max amount of cpus 200% share .It gives 6 projects to do , as each one is completed the cpu is shared with the other 5 an so on as each is completed , I forgot one thing you have to set the program to not accept any new projects , experiment with it an see what happens .

Profile Gundolf Jahn
Send message
Joined: 24 Aug 08
Posts: 126
Credit: 1,112
RAC: 0
Message 5431 - Posted: 9 Aug 2013, 20:03:18 UTC - in response to Message 5430.

There are only a few (less than 5) BOINC projects that provide multithreaded applications. All other of the many dozens of projects have applications that use exactly one core per task; and The Lattice Project is one of the latter!

Gruß
Gundolf
____________
Computer sind nicht alles im Leben. (Kleiner Scherz)

BobCat13
Send message
Joined: 5 Jun 07
Posts: 60
Credit: 326,029
RAC: 0
Message 5469 - Posted: 19 Oct 2013, 2:03:04 UTC - in response to Message 5417.

Workunits in the newest batches of GARLI workunits released won't be using a feature that speeds up input file processing since it potentially introduces a bug. So, if your workunit doesn't show signs of progress for the first couple of hours, don't be alarmed and let it continue! Unfortunately, GARLI won't checkpoint during this phase.

Exactly how long is "first couple of hours" supposed to last? Running this task for over 12 hours now and no checkpoint has occurred.

Several other things seem different about this task as well:

1. Usually the _2 input file is several MBs in size, but this one is only 6,443 bytes.

2. In the conf file the availablememory was set to 2. Only 2 MB to run a GARLI task? I noticed that early and stopped the client and changed the setting to 1536 or 1.5 GB of RAM. Needed to change the <rsc_memory_bound> in client_state.xml as well since that was also set to 2 MB maximum. Really didn't seem to make a difference as the task is only using 3.21 MB of RAM.

3. In the garli.screen.log file, there are these lines:
For this dataset:
Mem level availablememory setting
great >= 1 MB
good approx 0 MB to 1 MB
low approx 0 MB to 1 MB
very low approx 0 MB to 1 MB
the minimum required availablememory is 1 MB

You specified that Garli should use at most 1536.0 MB of memory.

Garli will actually use approx. 0.0 MB of memory
**Your memory level is: great (you don't need to change anything)**

Umm, it's going to use 0.0 MB of memory?

4. When the task first starts, the progress immediately jumps to 2.50% done instead of the usual less than 1% that works it's way to 1% prior to checkpointing.

Profile ChertseyAl
Avatar
Send message
Joined: 10 Jun 07
Posts: 17
Credit: 126,348
RAC: 0
Message 5470 - Posted: 19 Oct 2013, 16:33:06 UTC - in response to Message 5469.


Exactly how long is "first couple of hours" supposed to last? Running this task for over 12 hours now and no checkpoint has occurred.


There are 41 'toxic' tasks left rattling around in the system. That's one of them. Abort it, or it will run for weeks and end with a computation error and then get sent out to another sucker. The problem is that these tasks have a 'max error' of 20, so they take months to fail that many times.

It's a pity that the admins don't kill these tasks rather than leaving them to trap the unwary. I got caught by one the other day. Luckily I only wasted 30+ hours before I noticed it.

Cheers,

Al.

BobCat13
Send message
Joined: 5 Jun 07
Posts: 60
Credit: 326,029
RAC: 0
Message 5471 - Posted: 20 Oct 2013, 2:39:46 UTC - in response to Message 5470.

Thanks for the heads up on the "toxic" tasks, Al.

I suspended it shortly after posting the message last night, and have now aborted it.

Profile Overtonesinger
Send message
Joined: 13 Jan 13
Posts: 14
Credit: 62,350
RAC: 0
Message 5472 - Posted: 22 Oct 2013, 19:33:22 UTC

Is it one of the 'toxic' tasks?

It took about 32 hours before first checkpoint... and at 12.5 percent it stopped showing progress for about 100 hours... but now it seems healthy, running at 21.165 percent and taking 1.2 Gigabytes of RAM.
Maybe it will fail at 1 milion of seconds time - just like that one reported on 9/9/2013 at 23:05.

Shall I let it compute and see if miracle happens and it completes normally? (deadline was on 2013-10-19 ! )

http://boinc.umiacs.umd.edu/workunit.php?wuid=1983765
(my computer has this number: 92795)

____________

Profile Overtonesinger
Send message
Joined: 13 Jan 13
Posts: 14
Credit: 62,350
RAC: 0
Message 5478 - Posted: 2 Nov 2013, 8:31:21 UTC

http://boinc.umiacs.umd.edu/workunit.php?wuid=1983765
(my computer has this number: 92795)

This probably *toxic* task is now at 47.382% after 276:42:40 time of computing on a virtual[HT]-core of core i7 720QM at 1.73 Ghz.

It will soon reach the magic milestone of 1 milion seconds... and I think it will finally ERROR out. :O

Shall I abort it? ... or see what will happen?
____________

Profile ChertseyAl
Avatar
Send message
Joined: 10 Jun 07
Posts: 17
Credit: 126,348
RAC: 0
Message 5479 - Posted: 2 Nov 2013, 18:13:44 UTC - in response to Message 5478.

I suspect it will die with "exceeded elapsed maximum time" shortly.

There are 35 tasks still rattling around. If they were all on their last replication before exceeding the maximum error count that's over a year of CPU time that's going to be wasted in total. Assuming an average of 10 remaining replications - 10 years crunching time wasted.

Nice.

Cheers,

Al.

Profile Overtonesinger
Send message
Joined: 13 Jan 13
Posts: 14
Credit: 62,350
RAC: 0
Message 5480 - Posted: 9 Nov 2013, 11:47:54 UTC - in response to Message 5479.

Thanx for the info. 10 YEARS ??? Really that much? Jesus!

OK, that WU with id 29920140.1892445912291928.3_8 , did NOT error out at 1 milion seconds !

So, what is the maximum execution time? :O

Now it has: 47.442%
and it has been running for 1 milion and 335 thousand SECONDS!
:)

Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 5493 - Posted: 12 Nov 2013, 21:03:04 UTC

I just saw some of the recent posts in this thread; sorry about that.

It's a little bit of a catch-22 for me. There is a huge variability in runtime in some sets of GARLI submissions. It's difficult for me to tell which are going to take a long time, and which will complete quickly. Add to this the fact that there is the occasional "toxic job", as you guys have been calling them, that seem to run indefinitely -- this would be considered a bug in GARLI.

In the past I have killed jobs that seemingly run forever, have lots of failures, etc. However, because it's difficult to tell jobs apart, I've upset people by canceling jobs that eventually finish (and thus no credit is granted). So, I've taken to just leaving jobs alone.

Perhaps the 20 failed jobs per WU is too many -- I'll consider lowering that.

Profile Overtonesinger
Send message
Joined: 13 Jan 13
Posts: 14
Credit: 62,350
RAC: 0
Message 5511 - Posted: 21 Nov 2013, 21:13:01 UTC - in response to Message 5493.

WU with id 29920140.1892445912291928.3_8 , did NOT error out, yet, after 1 milion 958 thousands of seconds...

Shall I abort it? It now shows: 47.469 %
____________

Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 5512 - Posted: 21 Nov 2013, 21:22:59 UTC - in response to Message 5511.

That workunit in particular can be safely aborted at this point. Sorry about that.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 216
Credit: 321,210
RAC: 74
Message 5515 - Posted: 23 Nov 2013, 6:32:55 UTC

Task http://boinc.umiacs.umd.edu/result.php?resultid=4240997 is running at 18.975% progress, 80:28:27 elapsed, and 12:22:30 estimated remaining. The deadline is in about 13 hours. Should I assume it is in an endless loop and abort it, or should you extend the deadline?

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 216
Credit: 321,210
RAC: 74
Message 5517 - Posted: 23 Nov 2013, 13:24:35 UTC - in response to Message 5515.

Task http://boinc.umiacs.umd.edu/result.php?resultid=4240997 is running at 18.975% progress, 80:28:27 elapsed, and 12:22:30 estimated remaining. The deadline is in about 13 hours. Should I assume it is in an endless loop and abort it, or should you extend the deadline?


It's now at 18.979% progress, 87:26:44 elapsed, 13:26:53 estimated remaining. Still a little progress, but very slow progress.

Workunit id: 166335540.4098547105960646.47

Profile Adam Bazinet
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 18 Feb 05
Posts: 1448
Credit: 334,567
RAC: 1
Message 5518 - Posted: 23 Nov 2013, 14:26:26 UTC - in response to Message 5517.

That's one we still need results for. I do see that it's a challenging job; we only have one other job back from that set and it took 293 hours.

Profile Overtonesinger
Send message
Joined: 13 Jan 13
Posts: 14
Credit: 62,350
RAC: 0
Message 5519 - Posted: 23 Nov 2013, 18:47:35 UTC - in response to Message 5518.

Oh, good. 293 hours... Hmm. On what CPU it has taken 293 hours, please. Can You look for it? And how many "Dhrystones" and "Whetstones!" is the CPU benchmarked by BOINC? :) In average, PCs that run Lattice have 3.2 GHz ... so, I think on my 1.6 GHz i7 , it will take 2 x 293 hours = (approximatelly) 600 hours :O.
Thanx.
P.S. Now it has 47.471% and runtime: 561 hours. :-)))
And YES, it is challenging job, because it has no successful result, yet.
____________

Profile Overtonesinger
Send message
Joined: 13 Jan 13
Posts: 14
Credit: 62,350
RAC: 0
Message 5520 - Posted: 23 Nov 2013, 18:55:15 UTC - in response to Message 5512.

Oh NO! :(
Have You already closed that phase of the project which this WU belongs to? :-(

Are You ABSOLUTELY sure, that there is NO scientific or debug-app benefit if it somehow finishes without aborting? (i.e. on some interesting unexpected error that would help You to know what happened there) ???
Thanx.
____________

1 · 2 · 3 · Next
Post to thread

Message boards : News : Newest GARLI workunits take some time to show progress

.......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... ..........

University of Maryland     UM Home | Directories | Search | Admissions | Calendar
Copyright © 2017 The Lattice Project
Direct questions and comments to Lattice Admin