From: Dan Bonachea (bonachea_at_cs_dot_berkeley_dot_edu)
Date: Tue Nov 22 2005 - 23:16:44 PST
At 05:42 PM 11/22/2005, Eric Frederich wrote: >Dan, > First of all, thanks for your quick correspondence. Attached is a file > with a list of commands I ran and their outputs. Please let me know if > there is anything else I can tell you about my set up. > >Thanks, >~Eric Hi Eric - the problem is shown in the log snippet below - it appears that one of the nodes (the one local to the spawning console) is binding to the localhost (loopback) ethernet interface (127.0.0.1) instead of to the real external IP interface, and consequently the compute node processes cannot reach each other. I suspect the hostname 'penguin27' is incorrectly resolving to 127.0.0.1 when queried from penguin27, instead of resolving to the external IP address on the LAN shared by both compute nodes, as it should. You can confirm this DNS misconfiguration by typing 'ping penguin27' on the penguin27 machine - it should resolve to pinging 192.168.1.207, but I suspect it will instead ping 127.0.0.1. There are several possible solutions to try: 1. fix DNS resolution on penguin27 to resolve to the external interface (check /etc/hosts) 2. spawn jobs from a console on a third node, which should force both compute nodes to bind to an external interface in order to reach the spawning console 3. change USE_NUMERIC_MASTER_ADDR to 1 in gasnet/other/amudp/amudp_internal.h and recompile the UPC runtime. Hope this helps.. Dan system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh no' 192.168.1.207 " echo connected to \$HOST... ; cd '/home/eric/UPC/build' ; './hello' '__AMUDP_SLAVE_PROCESS_VERBOSE__' 'penguin27:33197' " || ( echo "connection to 192.168.1.207 failed." ; kill 4249 ) &) system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh no' 192.168.1.208 " echo connected to \$HOST... ; cd '/home/eric/UPC/build' ; './hello' '__AMUDP_SLAVE_PROCESS_VERBOSE__' 'penguin27:33197' " || ( echo "connection to 192.168.1.208 failed." ; kill 4249 ) &) connected to ... slave connecting to 192.168.1.207:33197 connected to ... Endpoint table (nproc=2): P#0: (192.168.1.208:32795) tag: 0x7f00000100001099 P#1: (127.0.0.1:32795) tag: 0x7f00000100011099 Slave 1/2 starting (tag=0x7f00000100011099)... UDP recv buffer successfully set to 139704 bytes slave connecting to 127.0.0.1:33197