Re: Expense of BUPC timer functions

Date view	Thread view	Subject view	Author view	Attachment view
From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Mar 23 2010 - 13:31:45 PDT
Next message: Yaakoub El Khamra: "Re: Question regarding blocksize"
Previous message: Paul H. Hargrove: "Re: Question regarding blocksize"
In reply to: Nikita Andreev: "RE: Expense of BUPC timer functions"
Nikita,

The performance and internal representation of the ticks-related code 
are platform dependent.  If you are only (or just primarily) interested 
in Linux on x86 and x86-64 then I would agree with you that conversion 
is cheap.  However, there do exist other platforms where a "tick" is a 
structure of two 32-bit fields (AIX and IRIX) and conversion is more 
costly as a result.

I also agree that storing "raw" ticks vs "double"s uses the same storage 
(BTW: a 64-bit IEEE double has 15 decimal digits of precision, even if 
the default printf format shows only 6).

So, I think it would be fine if you want to perform the conversion at 
runtime rather than via post processing as I had suggested previously.  
While not doing the conversion is cheaper than doing it, you get simpler 
more portable/maintainable code if you perform the conversion at runtime.

I am surprised that VMware has such a high average case cost for the 
query.  The code you attached previously runs for 1 million iteration.  
Perhaps with 1 million iterations your process (or even the entire VM) 
is getting de-scheduled and affecting your measurement.  Do you still 
see results in the 2500ns range if you lower the iteration count to 
something like 10000?

-Paul

Nikita Andreev wrote:
> Paul,
>
> I've ran this test several times and here what I've got:
> Get 1 tick: 2797ns. 1 convert: 2ns
> Get 1 tick: 2383ns. 1 convert: 4ns
> Get 1 tick: 82ns. 1 convert: 2ns
> Get 1 tick: 2137ns. 1 convert: 2ns
> Get 1 tick: 2861ns. 1 convert: 2ns
> Get 1 tick: 2773ns. 1 convert: 2ns
> Get 1 tick: 90ns. 1 convert: 3ns
> Get 1 tick: 2595ns. 1 convert: 3ns
> Get 1 tick: 2881ns. 1 convert: 2ns
> Get 1 tick: 2849ns. 1 convert: 3ns
>
> I'm running CentOS 5 on VMware Guest OS which is being run under VMware ESX
> Server 3 with Intel Xeon E5345 @ 2.33GHz CPUs on board.
>
> Physical CentOS 5 host with Intel Xeon E5410 @ 2.33GHz CPUs produces
> following results:
> Get 1 tick: 38ns. 1 convert: 2ns
> Get 1 tick: 38ns. 1 convert: 2ns
> Get 1 tick: 38ns. 1 convert: 2ns
> Get 1 tick: 38ns. 1 convert: 2ns
> Get 1 tick: 38ns. 1 convert: 2ns
> Get 1 tick: 38ns. 1 convert: 2ns
> Get 1 tick: 38ns. 1 convert: 2ns
> Get 1 tick: 39ns. 1 convert: 2ns
> Get 1 tick: 38ns. 1 convert: 2ns
> Get 1 tick: 38ns. 1 convert: 2ns
>
>
> Anyway, the reason why I'm asking is the thing you said in one of the former
> letters (from 27th of February):
> "... If you are concerned about the costs (time for conversion or space for
> storage) then it may be possible that one could record "raw" ticks in a
> trace file plus enough additional information to allow ticks-to-ns
> conversion to be performed outside of the UPC code (e.g. by a tool that
> processes the trace file).  ..."
>
> Are we agreed upon the fact that tick to sec conversion produces nearly no
> overhead and there is no need to write timestamps in ticks? Let's say we are
> using doubles to store converted ticks to sec value. Double has 6 digits
> after the point. It means that digits less than 1 microsecond are discarded.
> Hence the only target of perturbation are VMs. And moreover the cause of
> this perturbation is not the conversion.
>
> The next point is about the space for storage. bupc_ticks_t is basically a
> uin64_t. double is also a 64 bit type.
>
> So it seems that tick to sec conversion in first place is simple and
> reliable, isn't it?
>
> Nikita
>
>
> -----Original Message-----
> From: owner-upc-users_at_lbl_dot_gov [mailto:owner-upc-users_at_lbl_dot_gov] On Behalf Of
> Paul H. Hargrove
> Sent: Tuesday, March 23, 2010 2:48 PM
> To: Nikita Andreev
> Cc: upc-users_at_lbl_dot_gov
> Subject: Re: Expense of BUPC timer functions
>
> Hello, Nikita.
>
> The fact that you are using a virtual machine is probably not perturbing 
> the cost of bupc_ticks_now(), at least not to the extent you report 
> seeing. Assuming a modern AMD or Intel CPU, this function is using the 
> RDTCS instruction on most OSes, which should be quite cheap.
>
> The observation that the first call to bupc_ticks_to_ns() is more 
> expensive than later calls is to be expected. The first instance parses 
> /proc/cpuinfo to get the clock rate and stores the value for reuse in 
> subsequent calls.
>
> Running on a 2.3GHz Intel Xeon E5410 from a Xen Dom0 kernel, the output 
> from your attached program is
> Get 1 tick: 33ns. 1 convert: 2ns
> I tried on a Xen HVM DomU running on the same machine:
> Get 1 tick: 34ns. 1 convert: 2ns
> And on a Xen PV DomU on the same machine:
> Get 1 tick: 33ns. 1 convert: 2ns
> So virtualization is probably not a significant factor, at least under Xen
>
> On an older 2.2GHz Opteron which is not running Xen I see
> Get 1 tick: 6ns. 1 convert: 2ns
> And an old 2.8Ghz Pentium-4 yields
> Get 1 tick: 82ns. 1 convert: 4ns
> So, there can be significant variation among platforms, and I don't know 
> if the 33-vs-6 difference is Xen related or not.
>
> So, I will agree with you that the relative cost of query and conversion 
> are not ordered as the documentation suggests.
> However, I can't reproduce the 2384ns query overhead.
>
>
> If you can tell me more about the platform you are running on perhaps I 
> could understand the extraordinarily high query cost you report.
>
> -Paul
>
> Nikita Andreev wrote:
>   
>> Hello Paul and all,
>>
>> I'm measuring the overhead of bupc_ticks_now() and bupc_ticks_to_ns() 
>> and results doesn't look like I expected. Find the test attached.
>>
>> I've made 1 million iterations and have got the following:
>>
>> bupc_ticks_now: 2383ns
>>
>> bupc_ticks_to_ns: 4ns
>>
>> So the conversion is made lot faster than query. But documentation 
>> says: The bupc_ticks_to_{us,ns}() conversion calls can be 
>> significantly more expensive than the bupc_ticks_now() tick query.
>>
>> What I also noticed is that first bupc_tick_to_ns call in a loop is 
>> very slow. It can take even 1,5 milliseconds. And almost all others 
>> are very fast.
>>
>> What am I missing here?
>>
>> P.S. I've performed this test on a virtual machine if it makes any 
>> difference.
>>
>> Regards,
>>
>> Nikita Andreev
>>
>>     
>
>
>   


-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory
Next message: Yaakoub El Khamra: "Re: Question regarding blocksize"
Previous message: Paul H. Hargrove: "Re: Question regarding blocksize"
In reply to: Nikita Andreev: "RE: Expense of BUPC timer functions"
Date view	Thread view	Subject view	Author view	Attachment view