Joined: Oct. 2006
I've played around quite a bit with an OS X utility called "FileVacuum," which is essentially wget wrapped in an OS X interface. I've concluded that there are both practical and ethical (but not absolute) obstacles to tracking UD with sufficient coverage to reliably capture obliviated threads.
One reliable way of doing this would be to capture successive snapshots of UD with sufficient frequency to catch most changes. Previous copies need be renamed/grandfathered and retained. When obliviation is detected we either refer to our most recent copy, or, if that copy reflects the obliviation, go back a few hours and consult an earlier snapshot.
The obvious problem with this is that capturing the entire site, which is over 300 MB and thousands of files, takes hours even on a relatively fast cable connection and probably imposes a burden on their bandwidth that I find ethically unacceptable.
Alternatively, after having captured a snapshot of UD in its entirety once, I envision using wget to detect and download only new UD files (passing over files for which copies exist locally) relative to copies of the original archive. That newly updated copy is stored and never updated further. That is, the process would be repeated, say, 2x daily, but again against the same stored, unchanged copy of an original archive, relative to which new threads (relative to the original archive) that have been updated since the original archive was grabbed would again be "new," and hence copied locally. Manually grabbing new copies of the indexes for the current month may also be desirable (since such indexes would be a "new file" just once, even as references to new threads are added, so repeated updates help). This process would catch early copies of each thread, and repeating the process against a never modified stored archive of the entire site would result in updates that include comments added to threads since the local archive was first created. These successive, updated copies of UD are retained an consulted when needed. †
However, I found that even the process of using wget to compare local and server copies of their site and downloading only new files, while using much less bandwidth, is very time consuming - again taking several hours. That repeated interrogation of UD - which requires that every file on their server be checked against the local copy (with no transfer when a local copy is found) - still must be very demanding upon their bandwidth and disk activity, and I wonder if that doesn't go over the line ethically.
However, I don't know enough about the resources that process demands to make that judgement. Others may also have ethical lines drawn at different locations, so I'd be interested in your thoughts on either topic (or other approaches that did not occur to me.)
Really, the whole idea seems like more trouble than it is worth.
Myth: Something that never was true, and always will be.
"The truth will set you free. But not until it is finished with you."
- David Foster Wallace
"Here‚Äôs a clue. Snarky banalities are not a substitute for saying something intelligent. Write that down."
- Barry Arrington