2025-02-19 - Best Practices for Moving Data to Your New Project Folder

Created by Bent Petersen, Modified on Wed, 19 Feb at 2:11 PM by Bent Petersen

Dear all,

Many of you are asking me how to most efficiently move data from the old project directory to your new one, and how to organize yourself as a group in the new project directory.

Understanding the Project Folder Structure

Each project folder follows this structure:

1. Where to Store Your Data?

Shared software & environments → Store shared Conda environments, custom scripts, and compiled software in /projects/{project_name}/apps/.
Shared project data → Use /projects/{project_name}/data/ for large datasets, reference files, or results that multiple users need access to, so you don’t have multiple copies
Private user data → Keep your personal working files/data inside /projects/{project_name}/people/ku-ID/
Temporary data (for short-term use) → Use /projects/{project_name}/scratch/ for intermediate files that do not require long-term storage.
Scratch data is not backed up, and older files may be deleted automatically to free space.

2. Optimizing Storage: Compressing Large Data Efficiently

Before transferring large datasets, it’s best to compress old, uncompressed data to save space, money and reduce transfer time.

Instead of using standard gzip, which runs on a single CPU core, use pigz (parallel gzip) for faster compression by utilizing multiple threads.

Remember to only run multithreaded processes by submitting a job to the queue.

2.1. Compressing a Single File with pigz

pigz -p 8 -9 large_file.txt

• -p 8 → Uses 8 CPU cores (adjust based on need).

• -9 → Uses the highest compression level.

This creates large_file.txt.gz.

2.2. Compressing Entire Directories with tar + pigz

For directories, create a tarball and compress it using pigz:

tar cf - large_directory/ | pigz -p 8 -9 > large_directory.tar.gz

This method compresses multiple files at once while using multiple threads.
The output is a compressed tarball (.tar.gz).
Creating a tarball takes time, especially if you have many directories and many files

To extract later, use:

tar xvf large_directory.tar.gz

Submitting Compression as a SLURM Job

Since compression can be CPU-intensive, you should submit it as a job instead of running it on the login node.

3. Best Practices for Transferring Data

To ensure a smooth transition, DO NOT use mv, scp, or cp, as they lack error checking and cannot resume interrupted transfers.

Instead, use rsync, which offers better control, efficiency, and reliability.

Moving Data and Deleting Old Files Automatically

To move your data and remove the original files after transfer, use:

rsync -avh —progress --remove-source-files --progress /projects/mjolnir1/people/KU-ID/yourdata /projects/{projectname}/people/KU-ID/yourdata

-a → Preserves file permissions, timestamps, symbolic links, etc.

-v → Enables verbose mode to show progress.

-h → Displays human-readable file sizes.

--progress → Shows real-time transfer progress.

If a transfer gets interrupted, rerun the same rsync command—it will only copy missing or incomplete files instead of restarting from scratch.

Important Notes:

• This rsync command removes files from the original location only after a successful transfer.

• Directories are NOT deleted, so you may need to clean them up manually:

find /projects/mjolnir1/people/ku-ID/your_data/ -type d -empty -delete

This above command deletes all empty directories after file transfer.

Important Note: Do NOT Store Project Data in Your Home Directory (/home/ku-ID/)

Your home directory (/home/ku-ID/) has a strict 100GB quota and is NOT meant for storing project data.

Your home directory should only be used for:

Personal scripts or configurations (e.g., .bashrc, .vimrc).
Small temporary files (but not large datasets).
Software environments that don’t belong in a shared project folder.
- As software environments can grow in size it is recommended to store it in your /projects/{project_name}/people/ku-ID/ folder

Summary of Best Practices

Use rsync instead of mv, scp, or cp to ensure error checking and resuming capabilities.
Delete unnecessary files before moving data to save storage and backup costs.
Compress large files before transferring using pigz for multi-threaded compression.
Use /data/ for shared project data, /people/ for personal files, and /scratch/ for temporary files.
Submit CPU-intensive compression jobs to SLURM instead of running them on the login node.
After moving data, clean up empty directories with find -type d -empty -delete.

I hope this helps.

Best regards,

Bent