Large datasets synchronization

Large datasets synchronization

Data synchronization by network

Résif-DC offers an on-demand synchonization service (rsync server) that can be accessed on a network level. Data synchronization of embargoed networks is subject to the acceptance of the PI of the experiment.

Summary

Rsync is a popular utility that provides fast incremental file transfer. It is usually shipped with modern Unix-like systems (eg. LinuxMacOS). Ports exist for Windows [1].

Rsync is a convenient way to download and update large dataset of continuous data, whereas other data services are more suitable for tailored data requests (time-windowed, quality-filtered, etc.) and smaller amounts of data. Résif allows end-users to download datasets via rsync upon specific request. If you wish to download a particular dataset, contact us describing your needs and goals. Résif-DC examines request on a case-by-case basis, according to its data policy.

Example usage

Once your request is approved, datacentre operators will provide you with a rsync module name and temporary login and password. You may then use the following commands in a shell prompt. These commands are compatible with most Unix/Linux systems. Depending on your shell variant, you may need to use a different syntax.

Note : the machine you are downloading from must be allowed to access server rsync.resif.fr on TCP port 873 (check with your IT service).

# STEP 1
# enter your local destination directory (create this directory before running) :export DESTINATION="/my/local/directory"

# STEP 2
# enter your credentials to access the data. These are provided by RESIF (*do not disclose!*)

export MODULE="xxxx"
export LOGNAME="xxxx"
export RSYNC_PASSWORD="xxxx"

# STEP 3
# datacenter specific parameters

export SERVER="rsync.resif.fr"
export OPTS="-rltvh --compress-level=1"
export DRYRUN="-n --stats"

# STEP 4
# launch a trial transfer (recommended)

rsync $OPTS $DRYRUN rsync://$LOGNAME@$SERVER/$MODULE/ $DESTINATION

# You are now ready to transfer -------------

# STEP 5
# launch full transfer.
# note : running this command many times will update your destination directory with new/modified files since last transfer.
# This will not delete any files on your side that don't exist anymore on the datacentre side.

rsync $OPTS rsync://$LOGNAME@$SERVER/$MODULE/ $DESTINATION

# Tips -------------------------------------
# ask us for more complex usages, or read rsync manpage.
# http://rsync.samba.org/ftp/rsync/rsync.html
#

# listing remote contents (like 'ls -l') without transferring:

rsync $OPTS rsync://$LOGNAME@$SERVER/$MODULE/

# using shell-style wildcards to transfer only the files you want :

rsync $OPTS rsync://$LOGNAME@$SERVER/$MODULE/2012/KES0* $DESTINATION
rsync $OPTS rsync://$LOGNAME@$SERVER/$MODULE/2012/*/HHZ.D/*.??? $DESTINATION

# settings options to remove files in your destination directory that do not exist anymore on the datacentre side (be careful!):

export OPTS="$OPTS --delete"

Known bugs and limitations

The data is delivered as daily miniseed files, as SDS file hierarchy. There is no possibility for finer grain time windows selection. Rsync access is granted to end-user for a limited time. Bandwidth and service availability may be adjusted depending on overall load on datacentre computing infrastructure.

Metadata, small datasets (typically under 100Gb of data), or fine grain time-windows data selection should be downloaded via FDSN webservices.


[1] For example, Cycgwin provides a large collection of tools which provide Unix-like features to Windows users. Ask your local IT support.

Search