Berkeley UPC - Unified Parallel C



UPC Task Library


  1. Download task-0.6.1.tgz
  2. Unpack the tar file (eg. tar xvzf task-0.6.1.tgz)
  3. Compile - there are Makefile examples for each test examples under the apps directory.
    When using Makefile in the apps directory, set the UPCC_FLAGS and CC variables accordingly.


Please send your comments and questions to

User Manual

A. Overview

UPC Task Library is a simple and effective way of expressing task parallelism in UPC. It is a library that helps users perform dynamic load balancing in a partitioned global address space model. This document provides a high-level API that abstracts concurrent task management details and dynamic load balancing mechanism.

A.1 Task Programming

A parallel loop and recursive divide-and-conquer program with potential load imbalance are good candidates for dynamic tasking. Transforming a parallel loop into a tasking form is easy (see section C.1 for more information). Recursive divide-and-conquer program (a.k.a. Fork/Join parallelism) can be written in blocking style. Figure 1 shows a Fibonacci example written using UPC task library. A task function, FIB, spawns two child tasks in line 11 and 12, respectively, and waits until these child tasks complete at line 13. In line 20, the main function allocates and initializes a global task queue and upc_barrier in line 21 ensures that all threads have allocated task queue before any thread uses it. In line 23, taskq_all_run executes tasks in the task queue.

  01: #include "upc.h"
  02: #include "upc_task.h"
  04: taskq_t *taskq;
  06: void FIB( int *n, int *out ) {
  07:   int n1 = *n-1;
  08:   int n2 = *n-2;
  09:   int x, y;
  10:   if (*n < 2){ *out = *n; return;}  /* termination condition */
  11:   taskq_put(taskq, FIB, &n1, &x);
  12:   taskq_put(taskq, FIB, &n2, &y);
  13:   taskq_wait(taskq);
  14:   *out = x + y;
  15: }
  17: int main(int argc, char *argv[]) {
  18:   int N, result;
  19:   N = (int) atoi(argv[1]);
  20:   taskq = taskq_all_alloc(1, FIB, sizeof(int), sizeof(int));
  21:   upc_barrier;    
  22:   if (MYTHREAD==0) taskq_put(taskq, FIB, &N, &result);  
  23:   taskq_all_run(taskq);  /* executes tasks in the taskq */
  24:   if (MYTHREAD==0) printf("Fibonacci(%d) result = %d\n", N, result);
  25:   taskq_all_free(taskq);
  26: } 
Figure 1. Fibonacci task example

A.2 Task and Task Function

A task is a unit of executable code, which is defined as a task function with pointers to its input and output data as arguments. A task function is declared in the user code as follows:

void task_func(void *input, void *output);

Input and output data are stored in contiguous memory and the sizes of input and output are specified when a task queue is allocated (see Section B.1). The input data is copied into the task library space when a task is created and travels with a task whenever a task migrates between threads. Data in the input and output region can be either a value or a reference. However, if such a data is a reference type, it should be a pointer to shared data because pointers to private data are meaningless when they move.

Within its function body, a task function can read its local variables, its input pointer, pointers to shared data, and file scope const variables and can write to its local variables, its output pointer, and pointers to shared data.

User programs cannot directly call task functions, instead task functions are put into the task queue and then executed via taskq_execute library function. (taskq_all_run can also execute tasks in the task queue since it will internally call taskq_execute function.)

A.3 Terms and Definitions

collective - a constraint placed on some language operations, which requires evaluation of such operations to be matched across all threads. The behavior of collective operations is undefined unless all threads execute the same sequence of collective operations.

single-valued - an operand to a collective operation, which has the same value on every thread. This behavior of the operation is otherwise undefined.

B. Task Library API

B.1 Task Queue Allocation and Free

taskq_t *taskq_all_alloc(int n, void *func1, int input_sz1, int output_sz1, ..., void *funcn, int input_szn, int output_szn);

Allocates a global task queue, which is a distributed data structure to implement a task pool in UPC, and returns a pointer to each thread's local portion of the global task queue - we will call this local portion of the global task queue a "local task queue" in this document. The taskq_all_alloc is a collective function. The first input argument, n, is the number of different task functions that can be put into this task queue and n triplets are followed. Each triplet describes a task function, which consists of a task function address, an input data size, and an output data size. The function arguments are single-valued, except the task function address, func, which can have different memory addresses among different threads. Therefore, the task library internally maps a task function address into an integer task function ID so that when a task is stolen by a thief thread, a thief thread can invoke the right task function using the task function ID.

void taskq_all_free(taskq_t *taskq);

Frees distributed task queue memory. It is a collective function. The input, taskq, should be the return value of the same taskq_all_alloc function call.

B.2 Task Queue Data Operation

The functions in this section are not collective functions and the arguments of these functions are not single-valued.

int taskq_put(taskq_t *taskq, void *func, void *in, void *out);

Creates a task using the input arguments - a task function, an input pointer, and an output pointer - and puts it into a task queue. Returns 1 if successfully added to the queue, otherwise, 0.

int taskq_execute(taskq_t *taskq);

Removes a task from the top of the local task queue and executes it. Returns 1 for a successful execution, otherwise returns 0.

int taskq_steal(taskq_t *taskq);

Attempts to steal tasks from random victim threads. On a successful steal, it returns the number of tasks from successful stealing or 0 if it fails.

int taskq_isEmpty(taskq_t *taskq);

Returns 1 if the local task queue is empty, otherwise, returns 0.

int taskq_all_isEmpty(taskq_t *taskq);

Returns 1 if the task queue is globally empty, otherwise, returns 0. It is a collective function.

void taskq_all_run(taskq_t *taskq);

Executes tasks in the task queue and returns when there is no tasks to execute in the task queue. It is a collective function and is equivalent to the following statements:
    do {
       while ( taskq_execute(taskq) ) { ; }   // executes all the tasks in the local task queue
       if ( taskq_steal(taskq) ) continue;    // tries to steal tasks from random victim threads
    } while ( !taskq_all_isEmpty(taskq) );    // checks the global termination condition

B.3 Taskq Synchronization

The synchronization functions are not collective functions and are called within a task function.

void taskq_wait (taskq_t *taskq);

It is a blocking operation; the taskq_wait function waits the tasks that are spawned before it to complete. It waits only the tasks that are spawned in the same task function where taskq_wait is called. The taskq_wait function internally calls the taskq_execute function to make progress while it is waiting child tasks.

void taskq_fence (taskq_t *taskq);

It is a non-blocking operation; the taskq_fence function insures that any task after it will not be scheduled for execution until all the tasks, which are spawned before it, are complete. This dependency relationship is only applied to the tasks that are spawned in the same function where the taskq_fence function is called. (See section C.2.2. for example)

B.4 Task Queue Parameter Setting Functions

Figure 2. Global task queue distributed on four UPC threads

This section describes APIs to control the behavior of a global task queue. Figure 2 shows the block diagram of a global task queue allocated on four UPC threads. Each thread manages its local task queue, which is further split into the shared region and the private region. The shared region is subject to concurrent accesses, thus operations to the shared region are serialized through a lock, but only the local thread can access the private region. From the user's point of view, the global task queue behaves as a stack because tasks are accessed from the top of the local task queue. On the other hand, on task stealing, a thief thread steals tasks from the bottom of victim thread's local task queue.

The following 4 functions are not collective functions and the arguments are not single-valued.

void taskq_set_steal_size(taskq_t *taskq, int k);

Set the number of tasks that are moved per one successful steal operation.

void taskq_set_chunk_size (taskq_t *taskq, int k);

Set the number of tasks that move between private region and shared region of the local task queue

void taskq_set_threshold (taskq_t *taskq, int k);

Set the threshold number of tasks; when the number of tasks in the private region of the local task queue exceeds the threshold value, the chunk size number of tasks are released from the private region to the shared region of the local task queue. It is recommended to set the threshold value greater than or equal to the chunk size.

C. Task Scheduling

C.1 Independent Task Execution - Parallel Loop

	01:  taskq_t *taskq; 
	02:  shared int A[10000]; 
	03:  void foo(int *index) { // a task function 
	04:    A[*index] = /* do some work */;  
	05:  } 
	06:  int main() { 
	07:    taskq = taskq_all_alloc(1, foo, sizeof(int), 0); 
	08:    upc_forall(int i=0; i < 10000; i++; &A[i]) {   
	09:      taskq_put(taskq, foo, &i); 
	10:    } 
	11:    do { 
	12:      while(taskq_execute(taskq)) {;}   
	13:      if ( taskq_steal(taskq) ) continue; 
	14:    } while ( !taskq_all_isEmpty(taskq) );   
	15:  } 
Figure 3. Parallel loop tasking example - independent task execution

Scheduling independent tasks is simple, while dependent tasks require proper synchronizations among tasks. Figure 3 shows an example that executes independent tasks with parallel-for parallelism. In lines 3-5, the task function, foo , is outlined from the parallel-for loop body and lines 8-10 show how a parallel-for loop is transformed into a tasking form. The nested while-loop, from lines 11 through 14, is the pattern that executes tasks in the task queue and can be replaced with a call to taskq_all_run. The inner while-loop executes tasks in the local task queue. Once the task queue becomes empty, taskq_steal function attempts to steal tasks from other thread's task queue. The outer while-loop checks if the task queue is globally empty.

C.2 Dependent Task Graph Execution - Fork/Join Parallelism

C.2.1. Blocking style

Figure 4. Blocking style fork/join parallelism (the running tasks at each step are shaded)

The parent task blocks until its children complete, as shown in Figure 4. The fibonacci example in Figure 1 is written in blocking style.

D. Publications


This page last modified on Saturday, 24-Sep-2016 20:41:40 PDT