Recently I needed to sift through a client’s website to figure out which media of theirs was used or unused. They had a lot of unused media which they wanted to get rid of, but they didn’t want to mistakenly delete something that was unknowingly being used somewhere.
So, I had the idea to use wget for this. With wget you can crawl an entire website and save the output HTML into a neat little folder, which can then be used for grepping. Sounded like a good solution to me.
Specifically, I used
wget -m -e robots=off http://site.com
The -m option stands for ‘mirror’ which will use settings intended to crawl and copy an entire website. I also thought it would be useful to ignore robots.txt, for a more complete download of the site. I believe wget adheres to robots.txt by default, but for my purposes I needed to turn it off, because I needed an as-complete-as-possible download of the entire site.
From that point I wrote some bash scripts to accomplish the aforementioned task I was tasked with, and that was that.