Re: Defining block size during runtime

From: sainath l (ls.sainath_at_gmail_dot_com)
Date: Sat Jul 25 2009 - 04:50:29 PDT

  • Next message: Gary Funck: "Re: Defining block size during runtime"
    Hi guys,
    
    
    
    @ Paul
    I am using BUPC 2.8.0.
    
    
    @ Gary
    
    
    An alternative, might be to increase the number of
    iterations until the total time taken exceeds some threshhold,
    say 10 seconds.  Then for any reasonable implementation of
    upc_barrier you can assume that its impact on the total time
    is not signifcant.  Something like this:
    #define MIN_TEST_TIME 10.0
         while (flag)
           {
             upc_barrier;
             start = get_time ();
             for (i = 0; i < iter; i++)
               {
                 upc_barrier;
                 upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
                                    UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
                 upc_barrier;
               }
             T = get_time () - start;
             upc_barrier;
             if (MYTHREAD == 0)
               {
                 /* [...] */
                 if (T < MIN_TEST_TIME)
                   {
                     iter = iter * 2;
                   }
    
    So after the  while loop  if I add
    
    start = get_time();
    for(i = 0; i < iter; i++)
    {
            upc_barrier;
            upc_barrier;
    }
    temp = get_time() - start;
    
    I should get an more accurate answer right  as the time taken by barrier
    would not be greater than T.
    
    
    This is my
    get_time() in gettime.h
    -----------------------------------
    
    double get_time()
    {
            static int Fcall = 1;
            static int Init_time;
            int err;
            double Time;
            struct timeval Tp;
            if(Fcall == 1)
            {
                    err = gettimeofday(&Tp,NULL);
                    Init_time = (double)Tp.tv_sec;
                    Fcall = 0;
            }
            err = gettimeofday(&Tp,NULL);
            Time = (double)(Tp.tv_sec) - Init_time + (double) Tp.tv_usec *
    1.0e-6;
            return Time;
    }
    
    
    Thank you very much for the suggestions and help.
    
    Cheers,
    Sainath
    
    
    On Sat, Jul 25, 2009 at 8:46 AM, Gary Funck <gary_at_intrepid_dot_com> wrote:
    
    >
    > On 07/25/09 05:37:20, sainath l wrote:
    > >    Hi,
    > >
    > >    Thank you very much for answering my questions Paul. And extremely
    > sorry
    > >    for not providing the "gettime.h" file. Will make sure that I provide
    > all
    > >    the related files from next time.
    >
    >
    > I used this simple implementation:
    >
    > #include <time.h>
    >
    > double
    > get_time()
    > {
    >   clock_t t = clock();
    >   return (double) t / (double) CLOCKS_PER_SEC;
    > }
    >
    > I'm uncertain as to whether clock() will return the sum of
    > the processor time of all currently running processes in
    > UPC program, or just the time of the calling process.  I think
    > only the calling process.  Things may become more problematic
    > if pthrads are in play.
    >
    > What I've done in the past for this sort of thing is to declare
    > a shared array:
    >
    > shared strict double cpu_times[THREADS];
    >
    > and then have each thread write the current iteration's
    > per-thread time into cpu_times[MYTHREAD].  Thread 0 must
    > then sum up all the cpu_times[] in order to arrive at the
    > cpu time for the entire UPC program.  As noted, another approach
    > would likely have to be taken if pthread-ed UPC threads are
    > used. In mixed process/pthreads, distributed, setting things
    > become even more interesting.
    >
    > >
    > >    The code is running fine in an smp X4600 SMP node with 16 procs.
    > >    But it is not running in XT 4.
    > >    when I run it in XT 4 the code breaks during the first iteration. the
    > >    first iteration does not complete. the printf after the upc_free(B)
    > >    command does not execute.
    >
    > Some things that I noticed in the program:
    >
    > This section of code is apparently trying to find a value
    > of 'iter' for which the execution time of upc_all_broadcast()
    > will exceed the overhead of two back-to-back barrier calls
    > and the for loop overhead.
    >
    >
    >      while (flag)
    >        {
    >          upc_barrier;
    >          start = get_time ();
    >          for (i = 0; i < iter; i++)
    >            {
    >              upc_barrier;
    >              upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
    >                                 UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    >              upc_barrier;
    >            }
    >          T = get_time () - start;
    >          upc_barrier;
    >
    >          start = get_time ();
    >          for (i = 0; i < iter; i++)
    >            {
    >              upc_barrier;
    >              upc_barrier;
    >            }
    >          temp = get_time () - start;
    >          upc_barrier;
    >
    >          if (MYTHREAD == 0)
    >            {
    >              for (i = 0; i < THREADS; i++)
    >                {
    >                  for (j = 0; j < mess_size; j++)
    >                    {
    >                      printf ("%d ", B[i].y[j]);
    >                    }
    >                  printf ("\n");
    >                }
    >              printf ("\n%lf %d %d \n", (T - temp), iter, mess_size);
    >
    >              if ((T - temp) < 0.1)
    >                {
    >                  iter = iter * 2;
    >                }
    >
    >              [...]
    >
    > 1. Note that thread 0 is basing its idea of execution time upon
    > its call to gettime().  As pointed out earlier, what is probably intended
    > here is that thread 0 would work with the total cputime across all threads.
    > This might not be necessary if the only goal is to tune 'iter', but is
    > most likely necessary if the idea is find the cpu time across the entire
    > program used by the upc_all_broadcast() call at various message sizes.
    >
    > 2. The value of time T above is the time taken to execute a number
    > of upc_all_broadcast() calls determined by 'iter', along with
    > two upc_barrier's for each iteration.  The value 'temp' is the time
    > taken to execute 2*iter upc_barrier's (plus some loop overhead, which
    > is likely not significant in comparison.  The value of 'iter' will
    > be continously doubled as long as T never exceeds temp by more than 0.1.
    > The motivation for the test is clear: to increase iter until the
    > loop overhead exceeds the cost of the upc_all_broadcast() call by
    > at least 0.1.  The problem in the logic however, is that if the
    > cost of upc_all_broadcast() (at low message sizes, in particular)
    > is always less than the cost of two barrier calls, this loop will
    > keep incrementing 'iter' ad infinitum.  That's what happens when
    > I try to run this code, compiled with GCC/UPC on an SMP-based
    > system.  An alternative, might be to increase the number of
    > iterations until the total time taken exceeds some threshhold,
    > say 10 seconds.  Then for any reasonable implementation of
    > upc_barrier you can assume that its impact on the total time
    > is not signifcant.  Something like this:
    >
    > #define MIN_TEST_TIME 10.0
    >
    >      while (flag)
    >        {
    >          upc_barrier;
    >          start = get_time ();
    >          for (i = 0; i < iter; i++)
    >            {
    >              upc_barrier;
    >              upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
    >                                 UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    >              upc_barrier;
    >            }
    >          T = get_time () - start;
    >          upc_barrier;
    >
    >          if (MYTHREAD == 0)
    >            {
    >              /* [...] */
    >
    >              if (T < MIN_TEST_TIME)
    >                {
    >                  iter = iter * 2;
    >                }
    >
    > 3. This code worries me a bit:
    >
    >          for (i = 0; i < iter; i++)
    >            {
    >              upc_barrier;
    >              upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
    >                                 UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    >              upc_barrier;
    >            }
    >
    > - The upc_all_broadcast() call above is being executed concurrently
    > by all threads.  That is, they are all attempting to distibuta A
    > across B at the same time.  This is not a realistic use of broadcast.
    >
    > The following implementation ensures that only one thread executes
    > a broadcast at a given time:
    >
    >          int i, t;
    >          for (i = 0; i < iter; i++)
    >            {
    >              for (t = 0; t < THREADS; ++t)
    >                {
    >                  upc_barrier;
    >                  if (t == MYTHREAD)
    >                    {
    >                      upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof
    > (int),
    >                                         UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    >                    }
    >                  upc_barrier;
    >                }
    >            }
    >
    > You might need to normalize your results by dividing by the number of
    > threads at the end of each test run, if you're interested in
    > upc_all_broadcast() times as a function of message size only.
    >
    > - The test declars A as a vector dynamically allocated on thread 0.
    > Thus, the broadcast above, is always copying from thread 0's shared space
    > into all the other's shared space.  More typically, A would have
    > affinity to the calling thread.  If you declare A as being local
    > to a thread (dropping the "* shared" in the current implementation);
    >
    > shared[] int *A;
    >
    > and then make this call in each thread, rather than just thread 0:
    >
    >      if (MYTHREAD == 0)
    >        {
    >          flag = 1;
    >
    >          B = upc_global_alloc (THREADS, mess_size * sizeof (int));
    >
    >        }
    >      /* All threads allocate their own 'A' */
    >      A = (shared [] int *) upc_alloc (mess_size * sizeof (int));
    >      for (i = 0; i < mess_size; i++)
    >        {
    >          A[i] = i + 1;
    >        }
    >      upc_barrier;
    >
    > this will be a more typical use of broadcast.
    >
    > - This can be simplified:
    >              upc_all_broadcast (&B[0].y[0], A, mess_size * sizeof (int),
    >                                 UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    >
    > to:
    >              upc_all_broadcast (B, A, mess_size * sizeof (int),
    >                                 UPC_IN_NOSYNC | UPC_OUT_NOSYNC);
    >
    >
    > Hopefully, incorporation of some/all of the suggestions above will lead
    > to a more robust test.
    >
    > - Gary
    >
    

  • Next message: Gary Funck: "Re: Defining block size during runtime"