Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have implemented a rsync based system to move files from different environments to others.

The problem I'm facing now is that sometimes, there are files with the same name, but different path and content.

I want to make rsync (if possible) rename duplicated files because I need and use --no-relative option.

Duplicated files can occur in two ways:

  • There was a file with same name in dest directory already.
  • In the same rsync execution, we are transferring file with same name in a different location. Ex: dir1/file.txt and dir2/file.txt
  • Adding -b --suffix options, allows me to have at least 1 repetition for the first duplicated file's type mentioned.

    A minimum example (for Linux based systems):

    mkdir sourceDir1 sourceDir2 sourceDir3 destDir;
    echo "1" >> sourceDir1/file.txt;
    echo "2" >> sourceDir2/file.txt;
    echo "3" >> sourceDir3/file.txt;
    rsync --no-relative sourceDir1/file.txt destDir
    rsync --no-relative -b --suffix="_old" sourceDir2/file.txt sourceDir3/file.txt destDir
    

    Is there any way to achieve my requirements?

    @tripleee I also think so, but the requirement is clear "I need and use --no-relative option", so I though of a work-around and posted it – Fravadona Sep 20, 2022 at 8:08 Yep, It's a must requirement. The system create 1M files per day in a large hierarchical structure which must stay privately. – Ray Sep 20, 2022 at 10:32

    I don't think that you can do it directly with rsync.

    Here's a work-around in bash that does some preparation work with find and GNU awk and then calls rsync afterwards.

    The idea is to categorize the input files by "copy number" (for example sourceDir1/file.txt would be the copy #1 of file.txt, sourceDir2/file.txt the copy #2 and sourceDir3/file.txt the copy #3) and generate a file per "copy number" containing the list of all the files in that category. Then, you just have to launch an rsync with --from-file and a customized --suffix per category.

  • fast: incomparable to firing one rsync per file.
  • safe: it won't ever overwrite a file (see the step #3 below).
  • robust: handles any filename, even with newlines in it.
  • the destination directory have to be empty (or else it might overwrite a few files).
  • the code is a little long (and I made it longer by using a few process substitutions and by splitting the awk call into two).
  • Here are the steps:

    0)   Use a correct shebang for bash in your system.

    #!/usr/bin/env bash
    

    1)   Create a directory for storing the temporary files.

    tmpdir=$( mktemp -d ) || exit 1
    

    2)   Categorize the input files by "duplicate number", generate the files for rsync --from-file (one per dup category), and get the total number of categories.

    read filesCount < <(
        find sourceDir* -type f -print0 |
        LANG=C gawk -F '/' '
            BEGIN {
                RS = ORS = "\0"
                tmpdir = ARGV[2]
                delete ARGV[2]
                id = ++seen[$NF]
                if ( ! (id in outFiles) ) {
                    outFilesCount++
                    outFiles[id] = tmpdir "/" id
                print $0 > outFiles[id]
            END {
                printf "%d\n", outFilesCount
        ' - "$tmpdir"
    

    3)   Find a unique suffix — generated using a given set of chars — for rsync --suffix => the string shall be appended to it. note: You can skip this step if you know for sure that there's no existing filename that ends with _old+number.

    (( filesCount > 0 )) && IFS='' read -r -d '' suffix < <(
        LANG=C gawk -F '/' '
            BEGIN {
                RS = ORS = "\0"
                charsCount = split( ARGV[2], chars)
                delete ARGV[2]
                for ( i = 1; i <= 255; i++ )
                    ord[ sprintf( "%c", i ) ] = i
                l0 = length($NF)
                l1 = length(suffix)
                if  ( substr( $NF, l0 - l1, l1) == suffix ) {
                    n = ord[ substr( $NF, l0 - l1 - 1, 1 ) ]
                    suffix = chars[ (n + 1) % charsCount ] suffix
            END {
                print suffix
        ' "$tmpdir/1" '0/1/2/3/4/5/6/7/8/9/a/b/c/d/e/f'
    

    4)   Run the rsync(s).

    for (( i = filesCount; i > 0; i-- ))
        fromFile=$tmpdir/$i
        rsync --no-R -b --suffix="_old${i}_$suffix" -0 --files-from="$fromFile" ./ destDir/
    

    5)   Clean-up the temporary directory.

    rm -rf "$tmpdir"
                    It's everything I wanted, I'm not used to work with bash nor awk and couldn't research properly to make this on my own... I had to made some adaptations in order to merge it in my system. Maybe I introduce a final step to remove the hash introduced in 3 step and recalculate duplicate id in each file
    – Ray
                    Sep 20, 2022 at 10:47
    

    Guess it's not possible with only rsync. You have to make a list of files first and analyze it to work around dupes. Take a look at this command:

    $ rsync --no-implied-dirs --relative --dry-run --verbose sourceDir*/* dst/
    sourceDir1/file.txt
    sourceDir2/file.txt
    sourceDir3/file.txt
    sent 167 bytes  received 21 bytes  376.00 bytes/sec
    total size is 6  speedup is 0.03 (DRY RUN)
    

    Lets use it to create list of source files:

    mapfile -t list < <(rsync --no-implied-dirs --relative --dry-run --verbose sourceDir*/* dst/)
    

    Now we can loop through this list with something like this:

    declare  -A count
    for item in "${list[@]}"; {
        [[ $item =~ ^sent.*bytes/sec$ ]] && break
        [[ $item ]] || break
        fname=$(basename $item)
        echo "$item dst/$fname${count[$fname]}"
        ((count[$fname]++))
    sourceDir1/file.txt dst/file.txt
    sourceDir2/file.txt dst/file.txt1
    sourceDir3/file.txt dst/file.txt2
    

    Change echo to rsync and that is it.

    Thanks for taking the time to write an answer. I can not use that implementation because it means one rsync per file and my system moves large amounts of files so the performance would decay severely – Ray Sep 19, 2022 at 14:53

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.