Re: problems running UPC programs

From: Dan Bonachea (bonachea_at_cs_dot_berkeley_dot_edu)
Date: Tue Nov 22 2005 - 23:16:44 PST

  • Next message: Eric Frederich: "Re: problems running UPC programs"
    At 05:42 PM 11/22/2005, Eric Frederich wrote:
    >Dan,
    >      First of all, thanks for your quick correspondence.  Attached is a file 
    > with a list of commands I ran and their outputs.  Please let me know if 
    > there is anything else I can tell you about my set up.
    >
    >Thanks,
    >~Eric
    
    Hi Eric - the problem is shown in the log snippet below - it appears that one 
    of the nodes (the one local to the spawning console) is binding to the 
    localhost (loopback) ethernet interface (127.0.0.1) instead of to the real 
    external IP interface, and consequently the compute node processes cannot 
    reach each other.
    
    I suspect the hostname 'penguin27' is incorrectly resolving to 127.0.0.1 when 
    queried from penguin27, instead of resolving to the external IP address on the 
    LAN shared by both compute nodes, as it should. You can confirm this DNS 
    misconfiguration by typing 'ping penguin27' on the penguin27 machine - it 
    should resolve to pinging 192.168.1.207, but I suspect it will instead ping 
    127.0.0.1.
    
    There are several possible solutions to try:
    
    1. fix DNS resolution on penguin27 to resolve to the external interface (check 
    /etc/hosts)
    2. spawn jobs from a console on a third node, which should force both compute 
    nodes to bind to an external interface in order to reach the spawning console
    3. change USE_NUMERIC_MASTER_ADDR to 1 in gasnet/other/amudp/amudp_internal.h 
    and recompile the UPC runtime.
    
    Hope this helps..
    Dan
    
    system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh 
    no'  192.168.1.207 " echo connected to \$HOST... ; cd '/home/eric/UPC/build' ; 
    './hello' '__AMUDP_SLAVE_PROCESS_VERBOSE__' 'penguin27:33197' "  || ( echo 
    "connection to 192.168.1.207 failed." ; kill 4249 ) &)
    system(ssh -f -o 'StrictHostKeyChecking no' -o 'FallBackToRsh 
    no'  192.168.1.208 " echo connected to \$HOST... ; cd '/home/eric/UPC/build' ; 
    './hello' '__AMUDP_SLAVE_PROCESS_VERBOSE__' 'penguin27:33197' "  || ( echo 
    "connection to 192.168.1.208 failed." ; kill 4249 ) &)
    connected to ...
    slave connecting to 192.168.1.207:33197
    connected to ...
    Endpoint table (nproc=2):
      P#0:   (192.168.1.208:32795)   tag: 0x7f00000100001099
      P#1:   (127.0.0.1:32795)       tag: 0x7f00000100011099
    Slave 1/2 starting (tag=0x7f00000100011099)...
    UDP recv buffer successfully set to 139704 bytes
    slave connecting to 127.0.0.1:33197
    

  • Next message: Eric Frederich: "Re: problems running UPC programs"