From: sainath l (ls.sainath_at_gmail_dot_com)
Date: Sat Jul 25 2009 - 04:50:29 PDT
Hi guys,
@ Paul
I am using BUPC 2.8.0.
@ Gary
An alternative, might be to increase the number of
iterations until the total time taken exceeds some threshhold,
say 10 seconds. Then for any reasonable implementation of
upc_barrier you can assume that its impact on the total time
is not signifcant. Something like this:
#define MIN_TEST_TIME 10.0
while (flag)
{
upc_barrier;
start = get_time ();
for (i = 0; i < iter; i++)
{
upc_barrier;
upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
upc_barrier;
}
T = get_time () - start;
upc_barrier;
if (MYTHREAD == 0)
{
/* [...] */
if (T < MIN_TEST_TIME)
{
iter = iter * 2;
}
So after the while loop if I add
start = get_time();
for(i = 0; i < iter; i++)
{
upc_barrier;
upc_barrier;
}
temp = get_time() - start;
I should get an more accurate answer right as the time taken by barrier
would not be greater than T.
This is my
get_time() in gettime.h
-----------------------------------
double get_time()
{
static int Fcall = 1;
static int Init_time;
int err;
double Time;
struct timeval Tp;
if(Fcall == 1)
{
err = gettimeofday(&Tp,NULL);
Init_time = (double)Tp.tv_sec;
Fcall = 0;
}
err = gettimeofday(&Tp,NULL);
Time = (double)(Tp.tv_sec) - Init_time + (double) Tp.tv_usec *
1.0e-6;
return Time;
}
Thank you very much for the suggestions and help.
Cheers,
Sainath
On Sat, Jul 25, 2009 at 8:46 AM, Gary Funck <gary_at_intrepid_dot_com> wrote:
>
> On 07/25/09 05:37:20, sainath l wrote:
> > Hi,
> >
> > Thank you very much for answering my questions Paul. And extremely
> sorry
> > for not providing the "gettime.h" file. Will make sure that I provide
> all
> > the related files from next time.
>
>
> I used this simple implementation:
>
> #include <time.h>
>
> double
> get_time()
> {
> clock_t t = clock();
> return (double) t / (double) CLOCKS_PER_SEC;
> }
>
> I'm uncertain as to whether clock() will return the sum of
> the processor time of all currently running processes in
> UPC program, or just the time of the calling process. I think
> only the calling process. Things may become more problematic
> if pthrads are in play.
>
> What I've done in the past for this sort of thing is to declare
> a shared array:
>
> shared strict double cpu_times[THREADS];
>
> and then have each thread write the current iteration's
> per-thread time into cpu_times[MYTHREAD]. Thread 0 must
> then sum up all the cpu_times[] in order to arrive at the
> cpu time for the entire UPC program. As noted, another approach
> would likely have to be taken if pthread-ed UPC threads are
> used. In mixed process/pthreads, distributed, setting things
> become even more interesting.
>
> >
> > The code is running fine in an smp X4600 SMP node with 16 procs.
> > But it is not running in XT 4.
> > when I run it in XT 4 the code breaks during the first iteration. the
> > first iteration does not complete. the printf after the upc_free(B)
> > command does not execute.
>
> Some things that I noticed in the program:
>
> This section of code is apparently trying to find a value
> of 'iter' for which the execution time of upc_all_broadcast()
> will exceed the overhead of two back-to-back barrier calls
> and the for loop overhead.
>
>
> while (flag)
> {
> upc_barrier;
> start = get_time ();
> for (i = 0; i < iter; i++)
> {
> upc_barrier;
> upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
> UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
> upc_barrier;
> }
> T = get_time () - start;
> upc_barrier;
>
> start = get_time ();
> for (i = 0; i < iter; i++)
> {
> upc_barrier;
> upc_barrier;
> }
> temp = get_time () - start;
> upc_barrier;
>
> if (MYTHREAD == 0)
> {
> for (i = 0; i < THREADS; i++)
> {
> for (j = 0; j < mess_size; j++)
> {
> printf ("%d ", B[i].y[j]);
> }
> printf ("\n");
> }
> printf ("\n%lf %d %d \n", (T - temp), iter, mess_size);
>
> if ((T - temp) < 0.1)
> {
> iter = iter * 2;
> }
>
> [...]
>
> 1. Note that thread 0 is basing its idea of execution time upon
> its call to gettime(). As pointed out earlier, what is probably intended
> here is that thread 0 would work with the total cputime across all threads.
> This might not be necessary if the only goal is to tune 'iter', but is
> most likely necessary if the idea is find the cpu time across the entire
> program used by the upc_all_broadcast() call at various message sizes.
>
> 2. The value of time T above is the time taken to execute a number
> of upc_all_broadcast() calls determined by 'iter', along with
> two upc_barrier's for each iteration. The value 'temp' is the time
> taken to execute 2*iter upc_barrier's (plus some loop overhead, which
> is likely not significant in comparison. The value of 'iter' will
> be continously doubled as long as T never exceeds temp by more than 0.1.
> The motivation for the test is clear: to increase iter until the
> loop overhead exceeds the cost of the upc_all_broadcast() call by
> at least 0.1. The problem in the logic however, is that if the
> cost of upc_all_broadcast() (at low message sizes, in particular)
> is always less than the cost of two barrier calls, this loop will
> keep incrementing 'iter' ad infinitum. That's what happens when
> I try to run this code, compiled with GCC/UPC on an SMP-based
> system. An alternative, might be to increase the number of
> iterations until the total time taken exceeds some threshhold,
> say 10 seconds. Then for any reasonable implementation of
> upc_barrier you can assume that its impact on the total time
> is not signifcant. Something like this:
>
> #define MIN_TEST_TIME 10.0
>
> while (flag)
> {
> upc_barrier;
> start = get_time ();
> for (i = 0; i < iter; i++)
> {
> upc_barrier;
> upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
> UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
> upc_barrier;
> }
> T = get_time () - start;
> upc_barrier;
>
> if (MYTHREAD == 0)
> {
> /* [...] */
>
> if (T < MIN_TEST_TIME)
> {
> iter = iter * 2;
> }
>
> 3. This code worries me a bit:
>
> for (i = 0; i < iter; i++)
> {
> upc_barrier;
> upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
> UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
> upc_barrier;
> }
>
> - The upc_all_broadcast() call above is being executed concurrently
> by all threads. That is, they are all attempting to distibuta A
> across B at the same time. This is not a realistic use of broadcast.
>
> The following implementation ensures that only one thread executes
> a broadcast at a given time:
>
> int i, t;
> for (i = 0; i < iter; i++)
> {
> for (t = 0; t < THREADS; ++t)
> {
> upc_barrier;
> if (t == MYTHREAD)
> {
> upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof
> (int),
> UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
> }
> upc_barrier;
> }
> }
>
> You might need to normalize your results by dividing by the number of
> threads at the end of each test run, if you're interested in
> upc_all_broadcast() times as a function of message size only.
>
> - The test declars A as a vector dynamically allocated on thread 0.
> Thus, the broadcast above, is always copying from thread 0's shared space
> into all the other's shared space. More typically, A would have
> affinity to the calling thread. If you declare A as being local
> to a thread (dropping the "* shared" in the current implementation);
>
> shared[] int *A;
>
> and then make this call in each thread, rather than just thread 0:
>
> if (MYTHREAD == 0)
> {
> flag = 1;
>
> B = upc_global_alloc (THREADS, mess_size * sizeof (int));
>
> }
> /* All threads allocate their own 'A' */
> A = (shared [] int *) upc_alloc (mess_size * sizeof (int));
> for (i = 0; i < mess_size; i++)
> {
> A[i] = i + 1;
> }
> upc_barrier;
>
> this will be a more typical use of broadcast.
>
> - This can be simplified:
> upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
> UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
>
> to:
> upc_all_broadcast (B, A, mess_size * sizeof (int),
> UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
>
>
> Hopefully, incorporation of some/all of the suggestions above will lead
> to a more robust test.
>
> - Gary
>