From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Feb 12 2010 - 10:24:44 PST
Dorian Krause wrote: > >> There are two functions for allocating a UPC lock (static definitions >> are prohibited): >> upc_all_lock_alloc() is called collectively and all UPC threads >> receive the same pointer >> upc_global_lock_alloc() is called by a single thread. >> >> In Berkeley UPC we do allocate the locks from the UPC shared heap: >> In the case of the collective upc_all_lock_alloc() all such >> allocations ARE from thread 0. >> In the case of the non-collective upc_global_lock_alloc() the >> allocation is local to the calling thread. >> > > Nikita, > Paul, > > thanks a lot. Changing calls from upc_all_lock_alloc() to > upc_global_lock_alloc() indeed makes a difference (not in run-time, > though ...). > > Still I see that upc_lock_attempt and upc_lock require active > participation by the thread "hosting" a lock. To show what I mean, I > attached a test program which lets a master thread sleep and a slave > tries to acquire a lock. On my system (Opteron, ibv) I see the output: > > pvfs2-compute-2-13% upcrun -n 2 ./test_upc > UPCR: UPC thread 0 of 2 on pvfs2-compute-2-13.local (process 0 of 2, > pid=8829) > UPCR: UPC thread 1 of 2 on pvfs2-compute-2-13.local (process 1 of 2, > pid=8830) > [0]: Done sleeping. > [1]: Got the lock; took me 7.462001e-02 > [0]: Done sleeping. > [1]: Got the lock; took me 1.491880e-01 > > I don't quiet understand these results. Can't the lock simply be > implemented using e.g. bupc_atomic* functions without this "two-sided" > behavior? > > Dorian > > Dorian, I am sorry I hadn't answered the second part of your question - the part about progress. You are correct that the Berkeley UPC runtime library relies on the process with affinity to the lock making entries to the library in order to make progress. As for implementing locks via atomics, it is not that simple. First there is the simple matter that the locks were implemented years earlier than the atomics. The second reason is that the likely implementation of locks via atomics would require polling across the network. In other words a upc_lock() call that found a lock held by another thread would keep making atomics calls over-and-over until it acquired the lock creating a potential "storm" of network traffic (there are possible solutions to that, but they are complex). And finally, in the case of networks without RDMA (such as UDP and MPI) atomics use the same ActiveMessage mechanism as locks and thus suffer from the same progress problems and an atomics-based implementation would potentially require MORE attentiveness from the remote threads than the current implementation. I agree that it would be nice if we could do better than we do now. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory