Re: bupc timing on VMs

From: Nikita Andreev (lestat_at_kemsu.ru)
Date: Wed Apr 14 2010 - 08:11:31 PDT

  • Next message: Nikita Andreev: "Re: upc_all_reduce behaviour"
    Paul,
    
    Problem is rather complicated. And I realize that it takes much time to 
    comprehend and resolve it. And VMs are definitely not on your team's TODO 
    list.
    
    Anyway everything is ok on real hardware and I can perform development on 
    VMs and make actual measurements on physical cluster later. At least you 
    know that problem exists and as far as you have time you can look at it more 
    closely.
    
    Regards,
    Nikita
    
    ----- Original Message ----- 
    From: "Paul H. Hargrove" <PHHargrove_at_lbl_dot_gov>
    To: "Nikita Andreev" <lestat@kemsu.ru>
    Cc: "upc-users" <upc-users_at_lbl_dot_gov>
    Sent: Wednesday, April 14, 2010 10:03 PM
    Subject: Re: bupc timing on VMs
    
    
    > Nikita,
    >
    > I don't have time now to look at your problem in detail.  However, I 
    > thought I'd take a moment to let you know that I am generally distrustful 
    > of timing in VMs.
    >
    > I recall you are using VMWare, which I have not used in many years. 
    > However, my experience with Xen is that under heavy load the guest kernel 
    > is sometimes not even capable of keeping an accurate clock.  The problem 
    > is bad enough that ntpd is unable to correct for the problems.  So, I 
    > think that any work related to performance measurement should be done only 
    > on real hardware.
    >
    > -Paul
    >
    > Nikita Andreev wrote:
    >> Hello,
    >>  I'm doing some research on home made 2 node cluster. Actually each node 
    >> is 2-way virtual machine running on one host's system dual core 
    >> processor.
    >>  I'm testing time synchronization algorithm originally developed by PPW 
    >> team (thanks them for support). This code (see attachment) works perfect 
    >> on physical cluster. When I run it on VMs it shows wrong results. In 
    >> attached application I sync all threads to thread 0 two times. But 
    >> sometimes it turns out that time on syncing thread (which also was 
    >> distributed to the other node than thread 0) has gone ahead of master 
    >> thread 0.
    >>  One of the results:
    >> UPCR: UPC thread 0 of 4 on node1 (process 0 of 4, pid=13836)
    >> UPCR: UPC thread 3 of 4 on node2 (process 3 of 4, pid=24125)
    >> UPCR: UPC thread 2 of 4 on node2 (process 2 of 4, pid=24119)
    >> UPCR: UPC thread 1 of 4 on node1 (process 1 of 4, pid=13839)
    >> #1 local 10.550069 remote 10.550072
    >> #3 local 14.693299 remote 10.515528
    >> #0 local 0.000000 remote 0.000000
    >> #2 local 14.659530 remote 10.440920
    >>  As you can see time elapsed between time measurements on thread #3 is 
    >> 14.7sec and on master thread 10.5sec. These measurements (mt and et 
    >> variables) happen at the same moment and must be equal. Timings for 
    >> thread 2 is also wrong and ok for thread 1 since it's on the same node.
    >>  I can't comprehend why this is happening. Maybe processor virtualization 
    >> brakes timers?
    >>  I would greatly appreciate any suggestions and I'm ready to do any tests 
    >> to find out the source of the problem.
    >>  Regards,
    >> Nikita
    >
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory
    > 
    

  • Next message: Nikita Andreev: "Re: upc_all_reduce behaviour"