Wget an entire FTP folder from its index (RegEx Introduction)

Published
1 min read

Hi, folks!

Just a couple of hours ago I was trying to download all the files in a folder on the OSUOSL FTP Slackware mirror with wget, and all I kept getting was the index.html file from the page, so I decided to write a little script to download any file linked in the index. I'm sure there are tools which can do this far more succinctly, but I thought this would be a good way to begin to explain the incredibly useful nature of regular expressions. Here's how my script turned out...

for i in $(wget ftp://ftp.mirrorservice.org/sites/ftp.slackware.com/pub/slackware/slackware-13.0/source/a/tar/ -O - | grep "ftp://" | sed 's/^.*href=\"//g' | sed 's/\".*$//g'); do
    wget $i;
done

Now let me break it down... The first command I wanted was one to download the index.html file and extract the necessary link data from it's content. To download a file and then stream it's contents to another command, use the wget syntax:

wget {URL} -O - | {COMMAND}

First of all I piped the file's contents to grep, which ignores any line which does not contain the phrase "ftp://". This will ensures that we are only working with lines which contain a hyperlink to a file, ignoring all the extraneous HTML tags. The next process was to remove any of the surrounding HTML from the links. A link in an HTML document will always be preceeded by <a href=". To remove this part, I used sed. There are other tools which would work in a similar manner, but I find sed to be a great way of learning regular expressions, and I find it's syntax to be very easy to understand. The command to remove anything up to and including the href=" is as follows:

sed 's/^.*href=\"//g'

To anybody who doesn't understand regular expression syntax, this looks like a jumble of characters. I'll explain briefly how it works... Sed's syntax for basic search and replace is as follows:

sed 's/{REG EXP OR TEXT TO SEARCH FOR}/{TEXT TO REPLACE WITH/g'

The regexp to match in our example is ^.*href=\"

^ means "From the beginning of the line".
.* is a wildcard, denoting absolutely any sequence of characters.
`href=\"` describes the exact text string we want to have as the final characters in the match. The `\` is an escape character to force the `"` to be treated as a character.

In our command, the second part of the sed command is empty. This means that any text which matches the regexp will just be removed. The regexp will match from the beginning of any line which contains href=" up to the ".

Now each line processed will read something along the lines of:

ftp://ftp.mirrorservice.org:21/sites/ftp.slackware.com/pub/slackware/slackware-13.0/source/a/tar/rmt.8.gz"&gt;rmt.8.gz
(2429 bytes)

The second use of sed will remove anything occurring after the filename. It reads like so:

sed 's/\".*$//g'

This is used in the same way as the previous use of sed. The regexp to match is \".*$.

$ means "End of line", so this matches everything from the first occurrence of " up to the end of each line. The output should now be nothing but a list of links. The final part is to wrap the output in a loop, and hand each line to wget.

Anyway, I hope this has been informative, and I'll no doubt post some more soon!

n00b