Bio-Mirror Search Data | Documents | FTP Databanks

GridFTP.Bio-Mirror.net

About

GridFTP.Bio-Mirror.net provides mirroring of public biology data sets. Bio-mirror.net project has been working since 1999 to provide up-to-date public biology data to the world.
"GridFTP" means a new, faster data transfer method is in place, now in test mode (Aug 2010). This software from the Globus project. has recently been improved to offer UDP-based file transport, with long-distance speed improvements of 3x to 10x over the usual TCP-based file transport.

-- Don Gilbert, August 2010, gilbertd At bio-mirror.net

GridFTP Documents and Installation help

READ_ME: http://www.globus.org/toolkit/docs/5.0/5.0.2/data/gridftp/
Early adopters of this will need to read the docs from Globus. We will later provide a how-to document suited to Bio-mirror.net.

As of this writing you will need to fetch and compile the GridFTP software with UDT support. Globus Version 5.0.2 or later is required.
Globus Toolkit 5.0.2+ Download

GridFTP + UDT will compile and run on Linux with these steps:

export GLOBUS_LOCATION=/usr/local/globus5
./configure --prefix=$GLOBUS_LOCATION 
make gridftp udt install > log.mks1 2>&1 &

cd $GLOBUS_LOCATION
cp -p bin/XXXpthr/shared/globus-url-copy  bin/ 
This last step for thr/shared/globus-url-copy is documented, but not obvious at first. UDT will only work with the Threaded build.

GridFTP + UDT will compile and run on MacOSX and Solaris 10 (my preference) if you modify the globus build process for UDT, to enable compile options for UNIX (Solaris 10) or Mac OSX. I also needed to modify udt4/src/channel.cpp for solaris 10.

GridFTP runs in anonymous FTP mode at bio-mirror.net, which also requires a few server source changes for better anon-ftp. Globus-url-copy needs a patch to preserve file timestamp, which it should, especially with the new -sync option.

These gt5.0.2_patches.txt are my patches that can be applied to gt5.0.2-all-source-installer/

  • udt4: src/channel.cpp for Solaris10; configure.ac: fix for *solaris*, *darwin*
  • gridftp server: anonymous ftp directories: limit to anonymous root
  • gridftp server: log_transfers, UDT client IP was 0.0.0.0
  • globus-url-copy: set dest file timestamp to match source time

    Trial runs

    A small repository for testing is available at port 2899 of this server. Please use this for trial runs, and you need not register.

    List server

    globus-url-copy -list \
      ftp://gridftp.bio-mirror.net:2899/biomirror/
    

    Copy tiny data set

    using TCP: 
      time globus-url-copy -sync -cd \
      ftp://gridftp.bio-mirror.net:2899/biomirror/rebase/ \
      rebase/
    using UDT: add -udt
    

    Copy larger data set

    A useful 3GB data set of NCBI Blast NR protein data
    using TCP:  
      time globus-url-copy -sync -cd \
     'ftp://gridftp.bio-mirror.net:2899/biomirror/blast/nr.*.tar.gz' \
      blast/
    using UDT:  add -udt
    

    Standard FTP comparison

    This same repository is available to standard FTP for comparison, as
    ftp://gridftp.bio-mirror.net/biomirror/
    This will be the same data as at ftp://bio-mirror.net/biomirror/ (by end of August 2010).

    Please use the hostname gridftp.bio-mirror.net rather than IP address, as we plan to change the address.

    Register for usage

    The full bio-mirror data repository is accessible on the standard GridFTP port 2811, after you register your computer IP address and contact info.
    globus-url-copy -list \
     ftp://gridftp.bio-mirror.net:2811/biomirror/
    
    We ask you to register your computer IP address for full access to GridFTP.Bio-Mirror.net because this is still in a trial stage, and we need to be able to assess problems and contact you if about any such. Early tests match other reports, the server cpu and memory load is higher than regular FTP, but not greatly so. GridFTP/UDT appears less conumptive of cpu than rsync, as well as more useful.

  • Test Cases

    GridFTP TCP and UDT transfer times for 113 GB from
    gridftp.bio-mirror.net/biomirror/blast/ (Indiana USA)
           Ping  Time(min)  TCP/  Distance
    Site    RTT  TCP   UDT  UDT    Km      Network Route
    --------------------------------------------------------------
    NCSA    10   139   138   1    200  Indiana - U of Illinois - NCSA
                 14    14              Megabytes/sec
    Purdue  17   125   125   1    500  In. - Chicago - Purdue, Indiana
                 15    15              MB/s
    ORNL    25   361   120   3   1200  In. - Chi. - Nashv., Tennesee - ORNL
                  5.3  16              MB/s
    TACC    37   616   120   5   2000  In. - Chi. - Houston, Texas - TACC
                  3.1  16              MB/s
    SDSC    65   750   475   1.6 3300  In. - Chi. - LA, California - SDSC
                  2.5   4.0            MB/s 
    
    CSTNET 274  3722*  304  12  12000  In. - Internet2 - Korea - Beijing, China
                  0.5   6.3            MB/s; * est. from partial TCP result
    --------------------------------------------------------------
    
    Transfer times (minutes), and below speed in Megabytes/second, for TCP and UDT, and the TCP/UDT ratio. NCSA, Purdue, ORNL, TACC, SDSC are Teragrid.org sites in USA. Land/sea line distance is given in Km. RTT is network distance as average round trip ping time in ms. TCP and UDT transfers were run simultaneously from each site. TCP buffer setting is -tcp-bs 500000

    Resource use by client globus-url-copy was higher for UDT.
    Resource use per gridftp server process

      UDT:  1.0% - 3.0% CPU; 40 Mb Memory
      TCP:  0.1% - 0.6% CPU;  6 Mb Memory
    Script for Testing


    Report on GridFTP qualities

    UDT as an Alternative Transport Protocol for GridFTP
    John Bresnahan, Michael Link, Rajkumar Kettimuthu, and Ian Foster
    Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439

    "We compare the performance of Iperf, scp, bbcp, GridFTP over TCP (both single and multiple streams), GridFTP over UDT, and raw UDT on four different networks—a wide-area network between Argonne National Laboratory (ANL) and the University of Auckland, New Zealand (NZ), with a round-trip time of 204 ms; a wide-area network between ANL and Los Angeles US(ISI), with a round-trip time of 60 ms; a wide-area network between the Ohio State University US (BMI) and JA site in Japan, which is a part of the Japan Gigibit Network II project, with a round-trip time of 193 ms; and a wide-area network between the JA site and Oak Ridge National Laboratory, Tennessee US (ORNL), with a round-trip time of 194 ms. To the best of our knowledge, all the pairs of the sites used in the experiments have 1 Gbit/s (maximum possible bandwidth)."

    Table 1: Throughput (in Mbit/s) achieved when transferring 1 GB of data over two wide-area networks, using various mechanisms.

    Mechanism       ANL/NZ  ANL/ISI BMI/JA  JA/ORNL 
    scp               2       9       3       3     
    bbcp             --      35       5     112     
    Iperf            19      74      59     110     
    GridFTP.TCP      16      59      73     113     
    GridFTP.UDT     187     418     220     380     
    UDT             174     398     211     374     
      # for data on disk, 1 transport stream (see paper)
    
    "In these experiments, 1 GB of data was transferred between the end points. Table 1 shows the throughput achieved in megabit per second. We noted that the performance of GridFTP over TCP is comparable to the performance of iperf and is significantly better than scp and bbcp. GridFTP over UDT outperforms the best possible throughput obtained with TCP by a factor of 4 on two testbeds (ANL-NZ and ANL-ISI). GridFTP over UDT outperforms GridFTP over TCP (single stream) by a factor of 3 on the other two testbeds (BMI-Japan and Japan- ORNL)."