Remote Duplicate File Detection Across Enormous File Sets

Crosspost from /r/debian: link; someone suggested trying here too, i guess for a broader audience.

basic setup

somewhere in the cloud, i have an enormous bittorrent seedbox i use to download and share (legal) recordings of live concerts of my favorite band. for an example tracker that allows this, see http://bt.etree.org. the seedbox runs a debian derivative. i have root access but do not get to pick which distro is used.

here at home, i maintain an enormous RAID6 array that holds my collection. presently, the organized-and-filed portion of the collection runs to 4.3TB. the RAID server runs jessie. obviously, i have root here too. an upgrade to stretch is acceptable.

problem i'm trying to solve

recordings appear on the seedbox automagically thanks to an RSS feed scraper picking up and downloading stuff. however, there is presently no way for me to check my collection to see if the stuff that has appeared on the seedbox already exists in the collection.

if a duplicate appears, i want to skip downloading said duplicate from the seedbox to the local server. i am not trying to weed out duplicate torrents at the RSS stage. i only want to weed out duplicates at the download-to-home step.

the file sizes (and my shitty net connection) are such that computation and comparison of a hash before download is preferable to comparison of a hash after download. that is, it would be faster to do the math and then download, than download and then do the math.

also, file names and paths are not consistent even for identical files, so can't be relied on to weed out duplicates. even a complicated date-aware script that understands my date-based filing system, but that just relies on rsync's filename matching, is not sufficient.

solution ideas

i'd like a tool that builds an index of hashes of archived files -- sha3, for example -- and then compares the hash against the incoming files on the seedbox. files without a match would be downloaded, for example, through creation of an include or exclude list passed by script to rsync.

the index would ideally be updatable as stuff is added to or reorganized in the collection, such as via a daily cron job. the cron job would create/update the hash, and then dump the hash file into a fixed location on the seedbox. i would then manually invoke the comparison/download script when i'm ready to do a download batch. complete recreation of the hash file at every run is seriously sub-optimal; hashing 4+ TB of data on a regular basis is computationally intensive and will flog the crap out of my drives.

to give you a feel of the scale, the collection is 4.3T and the du -ha is over 108,000 lines long. a data transfer from the seedbox might cover 400GB and the du -ha might be in excess of 10,000 lines long.

the question

can anybody point me to a tool that might help me accomplish at least the heavy computational and database sort lifting on this? i can kinda fumble my way through very simple bash scripting to put it together, and i'm not afraid of remote execution of ssh commands, in this context.

thanks!

edit: after thinking further, it occurs to me that a by-directory hash is not a solution. sometimes the text files that accompany the sound get mangled, such as CR/LF pairs being stripped to CR only. this would produce different fingerprints; the only reliable check is against the sound files individually.

submitted by rnbwpnt
[link][3 comments]

Remote Duplicate File Detection Across Enormous File Sets

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List