Large datasets synchronization

Large datasets synchronization

Data synchronization by network

The Epos-France seismological data center (SDC) offers an on-demand synchonization service (rsync server) that can be accessed on a network level. Data synchronization of embargoed networks is subject to the acceptance of the PI of the experiment.

Summary

Rsync is a popular utility that provides fast incremental file transfer. It is usually shipped with modern Unix-like systems (eg. LinuxMacOS). Ports exist for Windows [1].

Rsync is a convenient way to download and update large dataset of continuous data, whereas other data services are more suitable for tailored data requests (time-windowed, quality-filtered, etc.) and smaller amounts of data. Epos-France allows end-users to download datasets via rsync upon specific request. If you wish to download a particular dataset, contact us describing your needs and goals. The Epos-France seismological datacenter examines request on a case-by-case basis, according to its data policy.

Usage instructions for open data in PH5 format

For downloading a complete open dataset in PH5 format, you need first to get the name of the rsync module. It is advertised in the persentation page of the network (in the comments section) if available.

Listing the distant files:

rsync rsync://rsync.resif.fr/NETWORK_MODULE_NAME

To download the dataset, add a destination directory and some options to the above command :

rsync -rltvh --compress-level=1 rsync://rsync.resif.fr/NETWORK_MODULE_NAME /data

Usage instructions for restricted data

Once your request is approved, datacentre operators will provide you with a rsync module name and temporary login and password. You may then use the following commands in a shell prompt. These commands are compatible with most Unix/Linux systems. Depending on your shell variant, you may need to use a different syntax.

Note: the machine you are downloading from must be allowed to access server rsync.resif.fr on TCP port 873 (check with your IT service).

Note: Access to open data does not need any previous request. Just set the RSYNC_MODULE setup in the following steps.

STEP 1

Enter your local destination directory (create this directory before running):

DESTINATION="/my/local/directory"

STEP 2

Enter your credentials to access the data. These are provided by Epos-France (do not disclose!)

RSYNC_MODULE="xxxx"
RSYNC_USERNAME="xxxx"
RSYNC_PASSWORD="xxxx"

Enter datacenter specific parameters

RSYNC_SERVER="rsync.resif.fr"
RSYNC_OPTS="-rltvh --compress-level=1"
DRYRUN="-n --stats"

STEP 3

Launch a trial transfer (recommended)

rsync $RSYNC_OPTS $DRYRUN rsync://$RSYNC_USERNAME@$RSYNC_SERVER/$RSYNC_MODULE $DESTINATION

You are now ready to transfer !

STEP 4

Launch full transfer.

note : running this command many times will update your destination directory with new/modified files since last transfer. This will not delete any files on your side that don’t exist anymore on the datacentre side.

rsync $RSYNC_OPTS rsync://$RSYNC_USERNAME@$RSYNC_SERVER/$RSYNC_MODULE/ $DESTINATION

Other usage exemples

Ask us for more complex usages, or read rsync manpage http://rsync.samba.org/ftp/rsync/rsync.html

Listing remote contents (like ‘ls -l’) without transferring:

rsync $RSYNC_OPTS rsync://$RSYNC_USERNAME@$RSYNC_SERVER/$RSYNC_MODULE/

Using shell-style wildcards to transfer only the files you want :

rsync $RSYNC_OPTS rsync://$RSYNC_USERNAME@$RSYNC_SERVER/$RSYNC_MODULE/2012/KES0* $DESTINATION
rsync $RSYNC_OPTS rsync://$RSYNC_USERNAME@$RSYNC_SERVER/$RSYNC_MODULE//2012/*/HHZ.D/*.??? $DESTINATION

Settings options to remove files in your destination directory that do not exist anymore on the datacentre side (be careful!):

RSYNC_OPTS="$RSYNC_OPTS --delete"

Known bugs and limitations

The data is delivered as daily miniseed files, as SDS file hierarchy, or as PH5 archives. There is no possibility for finer grain time windows selection. Rsync access to restricted data is granted to end-user for a limited time. Bandwidth and service availability may be adjusted depending on overall load on datacentre computing infrastructure.

Metadata, small datasets (typically under 100Gb of data), or fine grain time-windows data selection should be downloaded via FDSN webservices.


[1] For example, Cycgwin provides a large collection of tools which provide Unix-like features to Windows users. Ask your local IT support.

Search