Re: Defining block size during runtime

From: Gary Funck (gary_at_intrepid_dot_com)
Date: Sat Jul 25 2009 - 00:46:21 PDT

  • Next message: sainath l: "Re: Defining block size during runtime"
    On 07/25/09 05:37:20, sainath l wrote:
    >    Hi,
    > 
    >    Thank you very much for answering my questions Paul. And extremely sorry
    >    for not providing the "gettime.h" file. Will make sure that I provide all
    >    the related files from next time.
    
    
    I used this simple implementation:
    
    #include <time.h>
    
    double
    get_time()
    {
       clock_t t = clock();
       return (double) t / (double) CLOCKS_PER_SEC;
    }
    
    I'm uncertain as to whether clock() will return the sum of
    the processor time of all currently running processes in
    UPC program, or just the time of the calling process.  I think
    only the calling process.  Things may become more problematic
    if pthrads are in play.
    
    What I've done in the past for this sort of thing is to declare
    a shared array:
    
    shared strict double cpu_times[THREADS];
    
    and then have each thread write the current iteration's
    per-thread time into cpu_times[MYTHREAD].  Thread 0 must
    then sum up all the cpu_times[] in order to arrive at the
    cpu time for the entire UPC program.  As noted, another approach
    would likely have to be taken if pthread-ed UPC threads are
    used. In mixed process/pthreads, distributed, setting things
    become even more interesting.
    
    > 
    >    The code is running fine in an smp X4600 SMP node with 16 procs.
    >    But it is not running in XT 4.
    >    when I run it in XT 4 the code breaks during the first iteration. the
    >    first iteration does not complete. the printf after the upc_free(B)
    >    command does not execute.
    
    Some things that I noticed in the program:
    
    This section of code is apparently trying to find a value
    of 'iter' for which the execution time of upc_all_broadcast()
    will exceed the overhead of two back-to-back barrier calls
    and the for loop overhead.
    
    
          while (flag)
            {
              upc_barrier;
              start = get_time ();
              for (i = 0; i < iter; i++)
                {
                  upc_barrier;
                  upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
                                     UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
                  upc_barrier;
                }
              T = get_time () - start;
              upc_barrier;
    
              start = get_time ();
              for (i = 0; i < iter; i++)
                {
                  upc_barrier;
                  upc_barrier;
                }
              temp = get_time () - start;
              upc_barrier;
    
              if (MYTHREAD == 0)
                {
                  for (i = 0; i < THREADS; i++)
                    {
                      for (j = 0; j < mess_size; j++)
                        {
                          printf ("%d ", B[i].y[j]);
                        }
                      printf ("\n");
                    }
                  printf ("\n%lf %d %d \n", (T - temp), iter, mess_size);
    
                  if ((T - temp) < 0.1)
                    {
                      iter = iter * 2;
                    }
    
                  [...]
    
    1. Note that thread 0 is basing its idea of execution time upon
    its call to gettime().  As pointed out earlier, what is probably intended
    here is that thread 0 would work with the total cputime across all threads.
    This might not be necessary if the only goal is to tune 'iter', but is
    most likely necessary if the idea is find the cpu time across the entire
    program used by the upc_all_broadcast() call at various message sizes.
    
    2. The value of time T above is the time taken to execute a number
    of upc_all_broadcast() calls determined by 'iter', along with
    two upc_barrier's for each iteration.  The value 'temp' is the time
    taken to execute 2*iter upc_barrier's (plus some loop overhead, which
    is likely not significant in comparison.  The value of 'iter' will
    be continously doubled as long as T never exceeds temp by more than 0.1.
    The motivation for the test is clear: to increase iter until the
    loop overhead exceeds the cost of the upc_all_broadcast() call by
    at least 0.1.  The problem in the logic however, is that if the
    cost of upc_all_broadcast() (at low message sizes, in particular)
    is always less than the cost of two barrier calls, this loop will
    keep incrementing 'iter' ad infinitum.  That's what happens when
    I try to run this code, compiled with GCC/UPC on an SMP-based
    system.  An alternative, might be to increase the number of
    iterations until the total time taken exceeds some threshhold,
    say 10 seconds.  Then for any reasonable implementation of
    upc_barrier you can assume that its impact on the total time
    is not signifcant.  Something like this:
    
    #define MIN_TEST_TIME 10.0
    
          while (flag)
            {
              upc_barrier;
              start = get_time ();
              for (i = 0; i < iter; i++)
                {
                  upc_barrier;
                  upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
                                     UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
                  upc_barrier;
                }
              T = get_time () - start;
              upc_barrier;
    
              if (MYTHREAD == 0)
                {
                  /* [...] */
    
                  if (T < MIN_TEST_TIME)
                    {
                      iter = iter * 2;
                    }
    
    3. This code worries me a bit:
    
              for (i = 0; i < iter; i++)
                {
                  upc_barrier;
                  upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
                                     UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
                  upc_barrier;
                }
    
    - The upc_all_broadcast() call above is being executed concurrently
    by all threads.  That is, they are all attempting to distibuta A
    across B at the same time.  This is not a realistic use of broadcast.
    
    The following implementation ensures that only one thread executes
    a broadcast at a given time:
    
              int i, t;
              for (i = 0; i < iter; i++)
                {
                  for (t = 0; t < THREADS; ++t)
                    {
    		  upc_barrier;
                      if (t == MYTHREAD)
                        {
    		      upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
    					 UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
                        }
    		  upc_barrier;
                    }
                }
    
    You might need to normalize your results by dividing by the number of
    threads at the end of each test run, if you're interested in
    upc_all_broadcast() times as a function of message size only.
    
    - The test declars A as a vector dynamically allocated on thread 0.
    Thus, the broadcast above, is always copying from thread 0's shared space
    into all the other's shared space.  More typically, A would have
    affinity to the calling thread.  If you declare A as being local
    to a thread (dropping the "* shared" in the current implementation);
    
    shared[] int *A;
    
    and then make this call in each thread, rather than just thread 0:
    
          if (MYTHREAD == 0)
            {
              flag = 1;
    
              B = upc_global_alloc (THREADS, mess_size * sizeof (int));
    
            }
          /* All threads allocate their own 'A' */
          A = (shared [] int *) upc_alloc (mess_size * sizeof (int));
          for (i = 0; i < mess_size; i++)
    	{
    	  A[i] = i + 1;
    	}
          upc_barrier;
    
    this will be a more typical use of broadcast.
    
    - This can be simplified:
                  upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
                                     UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    
    to:
                  upc_all_broadcast (B, A, mess_size * sizeof (int),
                                     UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    
    
    Hopefully, incorporation of some/all of the suggestions above will lead
    to a more robust test.
    
    - Gary
    

  • Next message: sainath l: "Re: Defining block size during runtime"