Re: Expense of BUPC timer functions

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Mar 23 2010 - 13:31:45 PDT

  • Next message: Yaakoub El Khamra: "Re: Question regarding blocksize"
    Nikita,
    
    The performance and internal representation of the ticks-related code 
    are platform dependent.  If you are only (or just primarily) interested 
    in Linux on x86 and x86-64 then I would agree with you that conversion 
    is cheap.  However, there do exist other platforms where a "tick" is a 
    structure of two 32-bit fields (AIX and IRIX) and conversion is more 
    costly as a result.
    
    I also agree that storing "raw" ticks vs "double"s uses the same storage 
    (BTW: a 64-bit IEEE double has 15 decimal digits of precision, even if 
    the default printf format shows only 6).
    
    So, I think it would be fine if you want to perform the conversion at 
    runtime rather than via post processing as I had suggested previously.  
    While not doing the conversion is cheaper than doing it, you get simpler 
    more portable/maintainable code if you perform the conversion at runtime.
    
    I am surprised that VMware has such a high average case cost for the 
    query.  The code you attached previously runs for 1 million iteration.  
    Perhaps with 1 million iterations your process (or even the entire VM) 
    is getting de-scheduled and affecting your measurement.  Do you still 
    see results in the 2500ns range if you lower the iteration count to 
    something like 10000?
    
    -Paul
    
    Nikita Andreev wrote:
    > Paul,
    >
    > I've ran this test several times and here what I've got:
    > Get 1 tick: 2797ns. 1 convert: 2ns
    > Get 1 tick: 2383ns. 1 convert: 4ns
    > Get 1 tick: 82ns. 1 convert: 2ns
    > Get 1 tick: 2137ns. 1 convert: 2ns
    > Get 1 tick: 2861ns. 1 convert: 2ns
    > Get 1 tick: 2773ns. 1 convert: 2ns
    > Get 1 tick: 90ns. 1 convert: 3ns
    > Get 1 tick: 2595ns. 1 convert: 3ns
    > Get 1 tick: 2881ns. 1 convert: 2ns
    > Get 1 tick: 2849ns. 1 convert: 3ns
    >
    > I'm running CentOS 5 on VMware Guest OS which is being run under VMware ESX
    > Server 3 with Intel Xeon E5345 @ 2.33GHz CPUs on board.
    >
    > Physical CentOS 5 host with Intel Xeon E5410 @ 2.33GHz CPUs produces
    > following results:
    > Get 1 tick: 38ns. 1 convert: 2ns
    > Get 1 tick: 38ns. 1 convert: 2ns
    > Get 1 tick: 38ns. 1 convert: 2ns
    > Get 1 tick: 38ns. 1 convert: 2ns
    > Get 1 tick: 38ns. 1 convert: 2ns
    > Get 1 tick: 38ns. 1 convert: 2ns
    > Get 1 tick: 38ns. 1 convert: 2ns
    > Get 1 tick: 39ns. 1 convert: 2ns
    > Get 1 tick: 38ns. 1 convert: 2ns
    > Get 1 tick: 38ns. 1 convert: 2ns
    >
    >
    > Anyway, the reason why I'm asking is the thing you said in one of the former
    > letters (from 27th of February):
    > "... If you are concerned about the costs (time for conversion or space for
    > storage) then it may be possible that one could record "raw" ticks in a
    > trace file plus enough additional information to allow ticks-to-ns
    > conversion to be performed outside of the UPC code (e.g. by a tool that
    > processes the trace file).  ..."
    >
    > Are we agreed upon the fact that tick to sec conversion produces nearly no
    > overhead and there is no need to write timestamps in ticks? Let's say we are
    > using doubles to store converted ticks to sec value. Double has 6 digits
    > after the point. It means that digits less than 1 microsecond are discarded.
    > Hence the only target of perturbation are VMs. And moreover the cause of
    > this perturbation is not the conversion.
    >
    > The next point is about the space for storage. bupc_ticks_t is basically a
    > uin64_t. double is also a 64 bit type.
    >
    > So it seems that tick to sec conversion in first place is simple and
    > reliable, isn't it?
    >
    > Nikita
    >
    >
    > -----Original Message-----
    > From: owner-upc-users_at_lbl_dot_gov [mailto:owner-upc-users_at_lbl_dot_gov] On Behalf Of
    > Paul H. Hargrove
    > Sent: Tuesday, March 23, 2010 2:48 PM
    > To: Nikita Andreev
    > Cc: upc-users_at_lbl_dot_gov
    > Subject: Re: Expense of BUPC timer functions
    >
    > Hello, Nikita.
    >
    > The fact that you are using a virtual machine is probably not perturbing 
    > the cost of bupc_ticks_now(), at least not to the extent you report 
    > seeing. Assuming a modern AMD or Intel CPU, this function is using the 
    > RDTCS instruction on most OSes, which should be quite cheap.
    >
    > The observation that the first call to bupc_ticks_to_ns() is more 
    > expensive than later calls is to be expected. The first instance parses 
    > /proc/cpuinfo to get the clock rate and stores the value for reuse in 
    > subsequent calls.
    >
    > Running on a 2.3GHz Intel Xeon E5410 from a Xen Dom0 kernel, the output 
    > from your attached program is
    > Get 1 tick: 33ns. 1 convert: 2ns
    > I tried on a Xen HVM DomU running on the same machine:
    > Get 1 tick: 34ns. 1 convert: 2ns
    > And on a Xen PV DomU on the same machine:
    > Get 1 tick: 33ns. 1 convert: 2ns
    > So virtualization is probably not a significant factor, at least under Xen
    >
    > On an older 2.2GHz Opteron which is not running Xen I see
    > Get 1 tick: 6ns. 1 convert: 2ns
    > And an old 2.8Ghz Pentium-4 yields
    > Get 1 tick: 82ns. 1 convert: 4ns
    > So, there can be significant variation among platforms, and I don't know 
    > if the 33-vs-6 difference is Xen related or not.
    >
    > So, I will agree with you that the relative cost of query and conversion 
    > are not ordered as the documentation suggests.
    > However, I can't reproduce the 2384ns query overhead.
    >
    >
    > If you can tell me more about the platform you are running on perhaps I 
    > could understand the extraordinarily high query cost you report.
    >
    > -Paul
    >
    > Nikita Andreev wrote:
    >   
    >> Hello Paul and all,
    >>
    >> I'm measuring the overhead of bupc_ticks_now() and bupc_ticks_to_ns() 
    >> and results doesn't look like I expected. Find the test attached.
    >>
    >> I've made 1 million iterations and have got the following:
    >>
    >> bupc_ticks_now: 2383ns
    >>
    >> bupc_ticks_to_ns: 4ns
    >>
    >> So the conversion is made lot faster than query. But documentation 
    >> says: The bupc_ticks_to_{us,ns}() conversion calls can be 
    >> significantly more expensive than the bupc_ticks_now() tick query.
    >>
    >> What I also noticed is that first bupc_tick_to_ns call in a loop is 
    >> very slow. It can take even 1,5 milliseconds. And almost all others 
    >> are very fast.
    >>
    >> What am I missing here?
    >>
    >> P.S. I've performed this test on a virtual machine if it makes any 
    >> difference.
    >>
    >> Regards,
    >>
    >> Nikita Andreev
    >>
    >>     
    >
    >
    >   
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Yaakoub El Khamra: "Re: Question regarding blocksize"