From: Eric Frederich (eric.frederich_at_gmail_dot_com)
Date: Wed Nov 23 2005 - 14:35:17 PST
Hooray. That fixed it. I am pretty sure Before my /etc/hosts file looked like 127.0.0.1 localhost penguin27.tuxnetwork penguin27 192.168.1.208 myth.tuxnetwork myth now it looks like 127.0.0.1 localhost 192.168.1.207 penguin27.tuxnetwork penguin27 192.168.1.208 myth.tuxnetwork myth and I get the following results ;-) eric@penguin27 build $ ./upcrun -n 2 hello UPCR: UPC thread 0 of 2 on penguin27 (process 0 of 2, pid=4762) UPCR: UPC thread 1 of 2 on myth (process 1 of 2, pid=11362) Hello World from thread 1 of 2 ! ! Hello World from thread 2 of 2 ! ! Thanks a lot. Now that it is working, I am going away for the holiday weekend. Hopefully the power still stay alive and I'll be able to ssh in if I get bored and want to play around. Thanks again, ~Eric On 11/23/05, Dan Bonachea <bonachea_at_cs_dot_berkeley_dot_edu> wrote: > > At 05:42 PM 11/22/2005, Eric Frederich wrote: > >Dan, > > First of all, thanks for your quick correspondence. Attached is a > file > > with a list of commands I ran and their outputs. Please let me know if > > there is anything else I can tell you about my set up. > > > >Thanks, > >~Eric > > Hi Eric - the problem is shown in the log snippet below - it appears that > one > of the nodes (the one local to the spawning console) is binding to the > localhost (loopback) ethernet interface (127.0.0.1) instead of to the real > external IP interface, and consequently the compute node processes cannot > reach each other. > > I suspect the hostname 'penguin27' is incorrectly resolving to 127.0.0.1when > queried from penguin27, instead of resolving to the external IP address on > the > LAN shared by both compute nodes, as it should. You can confirm this DNS > misconfiguration by typing 'ping penguin27' on the penguin27 machine - it > should resolve to pinging 192.168.1.207, but I suspect it will instead > ping > 127.0.0.1. > > There are several possible solutions to try: > > 1. fix DNS resolution on penguin27 to resolve to the external interface > (check > /etc/hosts) > 2. spawn jobs from a console on a third node, which should force both > compute > nodes to bind to an external interface in order to reach the spawning > console > 3. change USE_NUMERIC_MASTER_ADDR to 1 in > gasnet/other/amudp/amudp_internal.h > and recompile the UPC runtime. > > Hope this helps.. > Dan > > system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh > no' 192.168.1.207 " echo connected to \$HOST... ; cd > '/home/eric/UPC/build' ; > './hello' '__AMUDP_SLAVE_PROCESS_VERBOSE__' 'penguin27:33197' " || ( echo > "connection to 192.168.1.207 failed." ; kill 4249 ) &) > system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh > no' 192.168.1.208 " echo connected to \$HOST... ; cd > '/home/eric/UPC/build' ; > './hello' '__AMUDP_SLAVE_PROCESS_VERBOSE__' 'penguin27:33197' " || ( echo > "connection to 192.168.1.208 failed." ; kill 4249 ) &) > connected to ... > slave connecting to 192.168.1.207:33197 > connected to ... > Endpoint table (nproc=2): > P#0: (192.168.1.208:32795) tag: 0x7f00000100001099 > P#1: (127.0.0.1:32795) tag: 0x7f00000100011099 > Slave 1/2 starting (tag=0x7f00000100011099)... > UDP recv buffer successfully set to 139704 bytes > slave connecting to 127.0.0.1:33197 > > -- ------------------------ Eric L. Frederich