Re: Defining block size during runtime

From: Gary Funck (gary_at_intrepid_dot_com)
Date: Sat Jul 25 2009 - 06:25:27 PDT

  • Next message: Paul H. Hargrove: "Re: ping pong in UPC"
    On 07/25/09 12:50:29, sainath l wrote:
    >> Gary wrote:
    >>    An alternative, might be to increase the number of
    >>    iterations until the total time taken exceeds some threshhold,
    >>    say 10 seconds.  Then for any reasonable implementation of
    >>    upc_barrier you can assume that its impact on the total time
    >>    is not signifcant.  Something like this:
    >> [...]
    > 
    >    So after the  while loop  if I add
    > 
    >    start = get_time();
    >    for(i = 0; i < iter; i++)
    >    {
    >            upc_barrier;
    >            upc_barrier;
    >    }
    >    temp = get_time() - start;
    
    I think you should keep the overhead calculation
    (the value of 'temp' above) where you had it, because
    you're printing out timing results each time through
    the loop rather than at the end.
    
    You just don't want 'temp' involved in the
    "if (T < MIN_TEST_TIME)" test, because that test is there just
    to make sure that the benchmark has run long enough
    to overcome noise caused by cache behvavior and so on.
    
    > 
    >    I should get an more accurate answer right  as the time taken by barrier
    >    would not be greater than T.
    > 
    >    This is my
    >    get_time() in gettime.h
    >    -----------------------------------
    > 
    >    double get_time()
    >    {
    >            static int Fcall = 1;
    >            static int Init_time;
    >            int err;
    >            double Time;
    >            struct timeval Tp;
    >            if(Fcall == 1)
    >            {
    >                    err = gettimeofday(&Tp,NULL);
    >                    Init_time = (double)Tp.tv_sec;
    >                    Fcall = 0;
    >            }
    >            err = gettimeofday(&Tp,NULL);
    >            Time = (double)(Tp.tv_sec) - Init_time + (double) Tp.tv_usec *
    >    1.0e-6;
    >            return Time;
    >    }
    
    First, although you're saving off the initial number of epoch
    seconds when this function was called, you're not incorporating
    micro-seconds?  And, to simplify the code, you don't need two
    static vars, because the cached init value will work fine
    as a sentinel.
    
    double
    get_time ()
    {
      static double Init_time = 0.0;
      double Time;
      struct timeval Tp;
      int err;
      if (!Init_time)
        {
          err = gettimeofday (&Tp, NULL);
          Init_time = (double) (Tp.tv_sec) + (double) Tp.tv_usec * 1.0e-6;
        }
      err = gettimeofday (&Tp, NULL);
      Time = ((double) (Tp.tv_sec) + (double) Tp.tv_usec * 1.0e-6) - Init_time;
      return Time;
    }
    
    I don't think that Init_time is needed at all however, since your
    timing code already uses differences betwwn timing points.
    Thus, this simpler version should work:
    
    double
    get_time ()
    {
      double Time;
      struct timeval Tp;
      (void) gettimeofday (&Tp, NULL);
      Time = (double) (Tp.tv_sec) + (double) Tp.tv_usec * 1.0e-6;
      return Time;
    }
    
    Also, since you're looking at elapsed times,
    it is only necessary that thread 0 do the timing
    calculation, but you can probably leave it as is,
    letting each thread duplicate some effort.
    
    Also, I think that the code would be easier to understand
    if the loop that calculate 'iter' is factored
    out of the main loop, rather than using 'flag'
    to drive things.
    
    > 
    >    Thank you very much for the suggestions and help.
    > 
    >    Cheers,
    >    Sainath
    > 
    >    On Sat, Jul 25, 2009 at 8:46 AM, Gary Funck <[1]gary_at_intrepid_dot_com> wrote:
    > 
    >      On 07/25/09 05:37:20, sainath l wrote:
    >      >    Hi,
    >      >
    >      >    Thank you very much for answering my questions Paul. And extremely
    >      sorry
    >      >    for not providing the "gettime.h" file. Will make sure that I
    >      provide all
    >      >    the related files from next time.
    > 
    >      I used this simple implementation:
    > 
    >      #include <time.h>
    > 
    >      double
    >      get_time()
    >      {
    >        clock_t t = clock();
    >        return (double) t / (double) CLOCKS_PER_SEC;
    >      }
    > 
    >      I'm uncertain as to whether clock() will return the sum of
    >      the processor time of all currently running processes in
    >      UPC program, or just the time of the calling process.  I think
    >      only the calling process.  Things may become more problematic
    >      if pthrads are in play.
    > 
    >      What I've done in the past for this sort of thing is to declare
    >      a shared array:
    > 
    >      shared strict double cpu_times[THREADS];
    > 
    >      and then have each thread write the current iteration's
    >      per-thread time into cpu_times[MYTHREAD].  Thread 0 must
    >      then sum up all the cpu_times[] in order to arrive at the
    >      cpu time for the entire UPC program.  As noted, another approach
    >      would likely have to be taken if pthread-ed UPC threads are
    >      used. In mixed process/pthreads, distributed, setting things
    >      become even more interesting.
    >      >
    >      >    The code is running fine in an smp X4600 SMP node with 16 procs.
    >      >    But it is not running in XT 4.
    >      >    when I run it in XT 4 the code breaks during the first iteration.
    >      the
    >      >    first iteration does not complete. the printf after the upc_free(B)
    >      >    command does not execute.
    > 
    >      Some things that I noticed in the program:
    > 
    >      This section of code is apparently trying to find a value
    >      of 'iter' for which the execution time of upc_all_broadcast()
    >      will exceed the overhead of two back-to-back barrier calls
    >      and the for loop overhead.
    > 
    >           while (flag)
    >             {
    >               upc_barrier;
    >               start = get_time ();
    >               for (i = 0; i < iter; i++)
    >                 {
    >                   upc_barrier;
    >                   upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
    >                                      UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    >                   upc_barrier;
    >                 }
    >               T = get_time () - start;
    >               upc_barrier;
    > 
    >               start = get_time ();
    >               for (i = 0; i < iter; i++)
    >                 {
    >                   upc_barrier;
    >                   upc_barrier;
    >                 }
    >               temp = get_time () - start;
    >               upc_barrier;
    > 
    >               if (MYTHREAD == 0)
    >                 {
    >                   for (i = 0; i < THREADS; i++)
    >                     {
    >                       for (j = 0; j < mess_size; j++)
    >                         {
    >                           printf ("%d ", B[i].y[j]);
    >                         }
    >                       printf ("\n");
    >                     }
    >                   printf ("\n%lf %d %d \n", (T - temp), iter, mess_size);
    > 
    >                   if ((T - temp) < 0.1)
    >                     {
    >                       iter = iter * 2;
    >                     }
    > 
    >                   [...]
    > 
    >      1. Note that thread 0 is basing its idea of execution time upon
    >      its call to gettime().  As pointed out earlier, what is probably
    >      intended
    >      here is that thread 0 would work with the total cputime across all
    >      threads.
    >      This might not be necessary if the only goal is to tune 'iter', but is
    >      most likely necessary if the idea is find the cpu time across the entire
    >      program used by the upc_all_broadcast() call at various message sizes.
    > 
    >      2. The value of time T above is the time taken to execute a number
    >      of upc_all_broadcast() calls determined by 'iter', along with
    >      two upc_barrier's for each iteration.  The value 'temp' is the time
    >      taken to execute 2*iter upc_barrier's (plus some loop overhead, which
    >      is likely not significant in comparison.  The value of 'iter' will
    >      be continously doubled as long as T never exceeds temp by more than 0.1.
    >      The motivation for the test is clear: to increase iter until the
    >      loop overhead exceeds the cost of the upc_all_broadcast() call by
    >      at least 0.1.  The problem in the logic however, is that if the
    >      cost of upc_all_broadcast() (at low message sizes, in particular)
    >      is always less than the cost of two barrier calls, this loop will
    >      keep incrementing 'iter' ad infinitum.  That's what happens when
    >      I try to run this code, compiled with GCC/UPC on an SMP-based
    >      system.  An alternative, might be to increase the number of
    >      iterations until the total time taken exceeds some threshhold,
    >      say 10 seconds.  Then for any reasonable implementation of
    >      upc_barrier you can assume that its impact on the total time
    >      is not signifcant.  Something like this:
    > 
    >      #define MIN_TEST_TIME 10.0
    > 
    >           while (flag)
    >             {
    >               upc_barrier;
    >               start = get_time ();
    >               for (i = 0; i < iter; i++)
    >                 {
    >                   upc_barrier;
    >                   upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
    >                                      UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    >                   upc_barrier;
    >                 }
    >               T = get_time () - start;
    >               upc_barrier;
    > 
    >               if (MYTHREAD == 0)
    >                 {
    >                   /* [...] */
    > 
    >                   if (T < MIN_TEST_TIME)
    >                     {
    >                       iter = iter * 2;
    >                     }
    > 
    >      3. This code worries me a bit:
    > 
    >               for (i = 0; i < iter; i++)
    >                 {
    >                   upc_barrier;
    >                   upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
    >                                      UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    >                   upc_barrier;
    >                 }
    > 
    >      - The upc_all_broadcast() call above is being executed concurrently
    >      by all threads.  That is, they are all attempting to distibuta A
    >      across B at the same time.  This is not a realistic use of broadcast.
    > 
    >      The following implementation ensures that only one thread executes
    >      a broadcast at a given time:
    > 
    >               int i, t;
    >               for (i = 0; i < iter; i++)
    >                 {
    >                   for (t = 0; t < THREADS; ++t)
    >                     {
    >                       upc_barrier;
    >                       if (t == MYTHREAD)
    >                         {
    >                           upc_all_broadcast (&B[0].y[0], A, mess_size *
    >      sizeof (int),
    >                                              UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    >                         }
    >                       upc_barrier;
    >                     }
    >                 }
    > 
    >      You might need to normalize your results by dividing by the number of
    >      threads at the end of each test run, if you're interested in
    >      upc_all_broadcast() times as a function of message size only.
    > 
    >      - The test declars A as a vector dynamically allocated on thread 0.
    >      Thus, the broadcast above, is always copying from thread 0's shared
    >      space
    >      into all the other's shared space.  More typically, A would have
    >      affinity to the calling thread.  If you declare A as being local
    >      to a thread (dropping the "* shared" in the current implementation);
    > 
    >      shared[] int *A;
    > 
    >      and then make this call in each thread, rather than just thread 0:
    > 
    >           if (MYTHREAD == 0)
    >             {
    >               flag = 1;
    > 
    >               B = upc_global_alloc (THREADS, mess_size * sizeof (int));
    > 
    >             }
    >           /* All threads allocate their own 'A' */
    >           A = (shared [] int *) upc_alloc (mess_size * sizeof (int));
    >           for (i = 0; i < mess_size; i++)
    >             {
    >               A[i] = i + 1;
    >             }
    >           upc_barrier;
    > 
    >      this will be a more typical use of broadcast.
    > 
    >      - This can be simplified:
    >                   upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
    >                                      UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    > 
    >      to:
    >                   upc_all_broadcast (B, A, mess_size * sizeof (int),
    >                                      UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    > 
    >      Hopefully, incorporation of some/all of the suggestions above will lead
    >      to a more robust test.
    >      - Gary
    > 
    > References
    > 
    >    Visible links
    >    1. mailto:gary_at_intrepid_dot_com
    
    -- 
    Gary Funck
    

  • Next message: Paul H. Hargrove: "Re: ping pong in UPC"