From: sainath l (ls.sainath_at_gmail_dot_com)
Date: Sat Jul 25 2009 - 04:50:29 PDT
Hi guys, @ Paul I am using BUPC 2.8.0. @ Gary An alternative, might be to increase the number of iterations until the total time taken exceeds some threshhold, say 10 seconds. Then for any reasonable implementation of upc_barrier you can assume that its impact on the total time is not signifcant. Something like this: #define MIN_TEST_TIME 10.0 while (flag) { upc_barrier; start = get_time (); for (i = 0; i < iter; i++) { upc_barrier; upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int), UPC_IN_NOSYNC | UPC_OUT_NOSYNC); upc_barrier; } T = get_time () - start; upc_barrier; if (MYTHREAD == 0) { /* [...] */ if (T < MIN_TEST_TIME) { iter = iter * 2; } So after the while loop if I add start = get_time(); for(i = 0; i < iter; i++) { upc_barrier; upc_barrier; } temp = get_time() - start; I should get an more accurate answer right as the time taken by barrier would not be greater than T. This is my get_time() in gettime.h ----------------------------------- double get_time() { static int Fcall = 1; static int Init_time; int err; double Time; struct timeval Tp; if(Fcall == 1) { err = gettimeofday(&Tp,NULL); Init_time = (double)Tp.tv_sec; Fcall = 0; } err = gettimeofday(&Tp,NULL); Time = (double)(Tp.tv_sec) - Init_time + (double) Tp.tv_usec * 1.0e-6; return Time; } Thank you very much for the suggestions and help. Cheers, Sainath On Sat, Jul 25, 2009 at 8:46 AM, Gary Funck <gary_at_intrepid_dot_com> wrote: > > On 07/25/09 05:37:20, sainath l wrote: > > Hi, > > > > Thank you very much for answering my questions Paul. And extremely > sorry > > for not providing the "gettime.h" file. Will make sure that I provide > all > > the related files from next time. > > > I used this simple implementation: > > #include <time.h> > > double > get_time() > { > clock_t t = clock(); > return (double) t / (double) CLOCKS_PER_SEC; > } > > I'm uncertain as to whether clock() will return the sum of > the processor time of all currently running processes in > UPC program, or just the time of the calling process. I think > only the calling process. Things may become more problematic > if pthrads are in play. > > What I've done in the past for this sort of thing is to declare > a shared array: > > shared strict double cpu_times[THREADS]; > > and then have each thread write the current iteration's > per-thread time into cpu_times[MYTHREAD]. Thread 0 must > then sum up all the cpu_times[] in order to arrive at the > cpu time for the entire UPC program. As noted, another approach > would likely have to be taken if pthread-ed UPC threads are > used. In mixed process/pthreads, distributed, setting things > become even more interesting. > > > > > The code is running fine in an smp X4600 SMP node with 16 procs. > > But it is not running in XT 4. > > when I run it in XT 4 the code breaks during the first iteration. the > > first iteration does not complete. the printf after the upc_free(B) > > command does not execute. > > Some things that I noticed in the program: > > This section of code is apparently trying to find a value > of 'iter' for which the execution time of upc_all_broadcast() > will exceed the overhead of two back-to-back barrier calls > and the for loop overhead. > > > while (flag) > { > upc_barrier; > start = get_time (); > for (i = 0; i < iter; i++) > { > upc_barrier; > upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int), > UPC_IN_NOSYNC | UPC_OUT_NOSYNC); > upc_barrier; > } > T = get_time () - start; > upc_barrier; > > start = get_time (); > for (i = 0; i < iter; i++) > { > upc_barrier; > upc_barrier; > } > temp = get_time () - start; > upc_barrier; > > if (MYTHREAD == 0) > { > for (i = 0; i < THREADS; i++) > { > for (j = 0; j < mess_size; j++) > { > printf ("%d ", B[i].y[j]); > } > printf ("\n"); > } > printf ("\n%lf %d %d \n", (T - temp), iter, mess_size); > > if ((T - temp) < 0.1) > { > iter = iter * 2; > } > > [...] > > 1. Note that thread 0 is basing its idea of execution time upon > its call to gettime(). As pointed out earlier, what is probably intended > here is that thread 0 would work with the total cputime across all threads. > This might not be necessary if the only goal is to tune 'iter', but is > most likely necessary if the idea is find the cpu time across the entire > program used by the upc_all_broadcast() call at various message sizes. > > 2. The value of time T above is the time taken to execute a number > of upc_all_broadcast() calls determined by 'iter', along with > two upc_barrier's for each iteration. The value 'temp' is the time > taken to execute 2*iter upc_barrier's (plus some loop overhead, which > is likely not significant in comparison. The value of 'iter' will > be continously doubled as long as T never exceeds temp by more than 0.1. > The motivation for the test is clear: to increase iter until the > loop overhead exceeds the cost of the upc_all_broadcast() call by > at least 0.1. The problem in the logic however, is that if the > cost of upc_all_broadcast() (at low message sizes, in particular) > is always less than the cost of two barrier calls, this loop will > keep incrementing 'iter' ad infinitum. That's what happens when > I try to run this code, compiled with GCC/UPC on an SMP-based > system. An alternative, might be to increase the number of > iterations until the total time taken exceeds some threshhold, > say 10 seconds. Then for any reasonable implementation of > upc_barrier you can assume that its impact on the total time > is not signifcant. Something like this: > > #define MIN_TEST_TIME 10.0 > > while (flag) > { > upc_barrier; > start = get_time (); > for (i = 0; i < iter; i++) > { > upc_barrier; > upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int), > UPC_IN_NOSYNC | UPC_OUT_NOSYNC); > upc_barrier; > } > T = get_time () - start; > upc_barrier; > > if (MYTHREAD == 0) > { > /* [...] */ > > if (T < MIN_TEST_TIME) > { > iter = iter * 2; > } > > 3. This code worries me a bit: > > for (i = 0; i < iter; i++) > { > upc_barrier; > upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int), > UPC_IN_NOSYNC | UPC_OUT_NOSYNC); > upc_barrier; > } > > - The upc_all_broadcast() call above is being executed concurrently > by all threads. That is, they are all attempting to distibuta A > across B at the same time. This is not a realistic use of broadcast. > > The following implementation ensures that only one thread executes > a broadcast at a given time: > > int i, t; > for (i = 0; i < iter; i++) > { > for (t = 0; t < THREADS; ++t) > { > upc_barrier; > if (t == MYTHREAD) > { > upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof > (int), > UPC_IN_NOSYNC | UPC_OUT_NOSYNC); > } > upc_barrier; > } > } > > You might need to normalize your results by dividing by the number of > threads at the end of each test run, if you're interested in > upc_all_broadcast() times as a function of message size only. > > - The test declars A as a vector dynamically allocated on thread 0. > Thus, the broadcast above, is always copying from thread 0's shared space > into all the other's shared space. More typically, A would have > affinity to the calling thread. If you declare A as being local > to a thread (dropping the "* shared" in the current implementation); > > shared[] int *A; > > and then make this call in each thread, rather than just thread 0: > > if (MYTHREAD == 0) > { > flag = 1; > > B = upc_global_alloc (THREADS, mess_size * sizeof (int)); > > } > /* All threads allocate their own 'A' */ > A = (shared [] int *) upc_alloc (mess_size * sizeof (int)); > for (i = 0; i < mess_size; i++) > { > A[i] = i + 1; > } > upc_barrier; > > this will be a more typical use of broadcast. > > - This can be simplified: > upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int), > UPC_IN_NOSYNC | UPC_OUT_NOSYNC); > > to: > upc_all_broadcast (B, A, mess_size * sizeof (int), > UPC_IN_NOSYNC | UPC_OUT_NOSYNC); > > > Hopefully, incorporation of some/all of the suggestions above will lead > to a more robust test. > > - Gary >