From: Gary Funck (gary_at_intrepid_dot_com)
Date: Sat Jul 25 2009 - 00:46:21 PDT
On 07/25/09 05:37:20, sainath l wrote:
> Hi,
>
> Thank you very much for answering my questions Paul. And extremely sorry
> for not providing the "gettime.h" file. Will make sure that I provide all
> the related files from next time.
I used this simple implementation:
#include <time.h>
double
get_time()
{
clock_t t = clock();
return (double) t / (double) CLOCKS_PER_SEC;
}
I'm uncertain as to whether clock() will return the sum of
the processor time of all currently running processes in
UPC program, or just the time of the calling process. I think
only the calling process. Things may become more problematic
if pthrads are in play.
What I've done in the past for this sort of thing is to declare
a shared array:
shared strict double cpu_times[THREADS];
and then have each thread write the current iteration's
per-thread time into cpu_times[MYTHREAD]. Thread 0 must
then sum up all the cpu_times[] in order to arrive at the
cpu time for the entire UPC program. As noted, another approach
would likely have to be taken if pthread-ed UPC threads are
used. In mixed process/pthreads, distributed, setting things
become even more interesting.
>
> The code is running fine in an smp X4600 SMP node with 16 procs.
> But it is not running in XT 4.
> when I run it in XT 4 the code breaks during the first iteration. the
> first iteration does not complete. the printf after the upc_free(B)
> command does not execute.
Some things that I noticed in the program:
This section of code is apparently trying to find a value
of 'iter' for which the execution time of upc_all_broadcast()
will exceed the overhead of two back-to-back barrier calls
and the for loop overhead.
while (flag)
{
upc_barrier;
start = get_time ();
for (i = 0; i < iter; i++)
{
upc_barrier;
upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
upc_barrier;
}
T = get_time () - start;
upc_barrier;
start = get_time ();
for (i = 0; i < iter; i++)
{
upc_barrier;
upc_barrier;
}
temp = get_time () - start;
upc_barrier;
if (MYTHREAD == 0)
{
for (i = 0; i < THREADS; i++)
{
for (j = 0; j < mess_size; j++)
{
printf ("%d ", B[i].y[j]);
}
printf ("\n");
}
printf ("\n%lf %d %d \n", (T - temp), iter, mess_size);
if ((T - temp) < 0.1)
{
iter = iter * 2;
}
[...]
1. Note that thread 0 is basing its idea of execution time upon
its call to gettime(). As pointed out earlier, what is probably intended
here is that thread 0 would work with the total cputime across all threads.
This might not be necessary if the only goal is to tune 'iter', but is
most likely necessary if the idea is find the cpu time across the entire
program used by the upc_all_broadcast() call at various message sizes.
2. The value of time T above is the time taken to execute a number
of upc_all_broadcast() calls determined by 'iter', along with
two upc_barrier's for each iteration. The value 'temp' is the time
taken to execute 2*iter upc_barrier's (plus some loop overhead, which
is likely not significant in comparison. The value of 'iter' will
be continously doubled as long as T never exceeds temp by more than 0.1.
The motivation for the test is clear: to increase iter until the
loop overhead exceeds the cost of the upc_all_broadcast() call by
at least 0.1. The problem in the logic however, is that if the
cost of upc_all_broadcast() (at low message sizes, in particular)
is always less than the cost of two barrier calls, this loop will
keep incrementing 'iter' ad infinitum. That's what happens when
I try to run this code, compiled with GCC/UPC on an SMP-based
system. An alternative, might be to increase the number of
iterations until the total time taken exceeds some threshhold,
say 10 seconds. Then for any reasonable implementation of
upc_barrier you can assume that its impact on the total time
is not signifcant. Something like this:
#define MIN_TEST_TIME 10.0
while (flag)
{
upc_barrier;
start = get_time ();
for (i = 0; i < iter; i++)
{
upc_barrier;
upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
upc_barrier;
}
T = get_time () - start;
upc_barrier;
if (MYTHREAD == 0)
{
/* [...] */
if (T < MIN_TEST_TIME)
{
iter = iter * 2;
}
3. This code worries me a bit:
for (i = 0; i < iter; i++)
{
upc_barrier;
upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
upc_barrier;
}
- The upc_all_broadcast() call above is being executed concurrently
by all threads. That is, they are all attempting to distibuta A
across B at the same time. This is not a realistic use of broadcast.
The following implementation ensures that only one thread executes
a broadcast at a given time:
int i, t;
for (i = 0; i < iter; i++)
{
for (t = 0; t < THREADS; ++t)
{
upc_barrier;
if (t == MYTHREAD)
{
upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
}
upc_barrier;
}
}
You might need to normalize your results by dividing by the number of
threads at the end of each test run, if you're interested in
upc_all_broadcast() times as a function of message size only.
- The test declars A as a vector dynamically allocated on thread 0.
Thus, the broadcast above, is always copying from thread 0's shared space
into all the other's shared space. More typically, A would have
affinity to the calling thread. If you declare A as being local
to a thread (dropping the "* shared" in the current implementation);
shared[] int *A;
and then make this call in each thread, rather than just thread 0:
if (MYTHREAD == 0)
{
flag = 1;
B = upc_global_alloc (THREADS, mess_size * sizeof (int));
}
/* All threads allocate their own 'A' */
A = (shared [] int *) upc_alloc (mess_size * sizeof (int));
for (i = 0; i < mess_size; i++)
{
A[i] = i + 1;
}
upc_barrier;
this will be a more typical use of broadcast.
- This can be simplified:
upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
to:
upc_all_broadcast (B, A, mess_size * sizeof (int),
UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
Hopefully, incorporation of some/all of the suggestions above will lead
to a more robust test.
- Gary