From: Gary Funck (gary_at_intrepid_dot_com)
Date: Sat Jul 25 2009 - 00:46:21 PDT
On 07/25/09 05:37:20, sainath l wrote: > Hi, > > Thank you very much for answering my questions Paul. And extremely sorry > for not providing the "gettime.h" file. Will make sure that I provide all > the related files from next time. I used this simple implementation: #include <time.h> double get_time() { clock_t t = clock(); return (double) t / (double) CLOCKS_PER_SEC; } I'm uncertain as to whether clock() will return the sum of the processor time of all currently running processes in UPC program, or just the time of the calling process. I think only the calling process. Things may become more problematic if pthrads are in play. What I've done in the past for this sort of thing is to declare a shared array: shared strict double cpu_times[THREADS]; and then have each thread write the current iteration's per-thread time into cpu_times[MYTHREAD]. Thread 0 must then sum up all the cpu_times[] in order to arrive at the cpu time for the entire UPC program. As noted, another approach would likely have to be taken if pthread-ed UPC threads are used. In mixed process/pthreads, distributed, setting things become even more interesting. > > The code is running fine in an smp X4600 SMP node with 16 procs. > But it is not running in XT 4. > when I run it in XT 4 the code breaks during the first iteration. the > first iteration does not complete. the printf after the upc_free(B) > command does not execute. Some things that I noticed in the program: This section of code is apparently trying to find a value of 'iter' for which the execution time of upc_all_broadcast() will exceed the overhead of two back-to-back barrier calls and the for loop overhead. while (flag) { upc_barrier; start = get_time (); for (i = 0; i < iter; i++) { upc_barrier; upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int), UPC_IN_NOSYNC | UPC_OUT_NOSYNC); upc_barrier; } T = get_time () - start; upc_barrier; start = get_time (); for (i = 0; i < iter; i++) { upc_barrier; upc_barrier; } temp = get_time () - start; upc_barrier; if (MYTHREAD == 0) { for (i = 0; i < THREADS; i++) { for (j = 0; j < mess_size; j++) { printf ("%d ", B[i].y[j]); } printf ("\n"); } printf ("\n%lf %d %d \n", (T - temp), iter, mess_size); if ((T - temp) < 0.1) { iter = iter * 2; } [...] 1. Note that thread 0 is basing its idea of execution time upon its call to gettime(). As pointed out earlier, what is probably intended here is that thread 0 would work with the total cputime across all threads. This might not be necessary if the only goal is to tune 'iter', but is most likely necessary if the idea is find the cpu time across the entire program used by the upc_all_broadcast() call at various message sizes. 2. The value of time T above is the time taken to execute a number of upc_all_broadcast() calls determined by 'iter', along with two upc_barrier's for each iteration. The value 'temp' is the time taken to execute 2*iter upc_barrier's (plus some loop overhead, which is likely not significant in comparison. The value of 'iter' will be continously doubled as long as T never exceeds temp by more than 0.1. The motivation for the test is clear: to increase iter until the loop overhead exceeds the cost of the upc_all_broadcast() call by at least 0.1. The problem in the logic however, is that if the cost of upc_all_broadcast() (at low message sizes, in particular) is always less than the cost of two barrier calls, this loop will keep incrementing 'iter' ad infinitum. That's what happens when I try to run this code, compiled with GCC/UPC on an SMP-based system. An alternative, might be to increase the number of iterations until the total time taken exceeds some threshhold, say 10 seconds. Then for any reasonable implementation of upc_barrier you can assume that its impact on the total time is not signifcant. Something like this: #define MIN_TEST_TIME 10.0 while (flag) { upc_barrier; start = get_time (); for (i = 0; i < iter; i++) { upc_barrier; upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int), UPC_IN_NOSYNC | UPC_OUT_NOSYNC); upc_barrier; } T = get_time () - start; upc_barrier; if (MYTHREAD == 0) { /* [...] */ if (T < MIN_TEST_TIME) { iter = iter * 2; } 3. This code worries me a bit: for (i = 0; i < iter; i++) { upc_barrier; upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int), UPC_IN_NOSYNC | UPC_OUT_NOSYNC); upc_barrier; } - The upc_all_broadcast() call above is being executed concurrently by all threads. That is, they are all attempting to distibuta A across B at the same time. This is not a realistic use of broadcast. The following implementation ensures that only one thread executes a broadcast at a given time: int i, t; for (i = 0; i < iter; i++) { for (t = 0; t < THREADS; ++t) { upc_barrier; if (t == MYTHREAD) { upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int), UPC_IN_NOSYNC | UPC_OUT_NOSYNC); } upc_barrier; } } You might need to normalize your results by dividing by the number of threads at the end of each test run, if you're interested in upc_all_broadcast() times as a function of message size only. - The test declars A as a vector dynamically allocated on thread 0. Thus, the broadcast above, is always copying from thread 0's shared space into all the other's shared space. More typically, A would have affinity to the calling thread. If you declare A as being local to a thread (dropping the "* shared" in the current implementation); shared[] int *A; and then make this call in each thread, rather than just thread 0: if (MYTHREAD == 0) { flag = 1; B = upc_global_alloc (THREADS, mess_size * sizeof (int)); } /* All threads allocate their own 'A' */ A = (shared [] int *) upc_alloc (mess_size * sizeof (int)); for (i = 0; i < mess_size; i++) { A[i] = i + 1; } upc_barrier; this will be a more typical use of broadcast. - This can be simplified: upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int), UPC_IN_NOSYNC | UPC_OUT_NOSYNC); to: upc_all_broadcast (B, A, mess_size * sizeof (int), UPC_IN_NOSYNC | UPC_OUT_NOSYNC); Hopefully, incorporation of some/all of the suggestions above will lead to a more robust test. - Gary