lvrfy: A HTML Link Verifier

Version 1.6

28 November 1995

lvrfy is a script that verifies all the internal links in HTML pages on your server. Its operation is rather simple: it starts with one page, parses all the links (including inline images), and then recursively checks all the links.

Its greatest shortcomming is that it is slow. While it averaged 7.5 seconds per file on our server, with 1584 files, that adds up to over 3 hours. I was told that it can process 10000 links to 4000 pages in about 1.5 hours on a Sparc 1000 with dual 75MHz CPU's.

Source

This is a regular shell script. Just make it executable, and you're ready to go. It assumes that the following programs are in your path: sed, awk, csh, touch, and rm.

Note the copyright. If you make major changes, or wish to distribute a modified version, you need to contact me.

Command-Line Execution

	lvrfy startURL fromURL OKfile BADfile OFFSITEfile

The parameters are:

startURL: The page to verify. This is a partial URL, in that the server name and protocol are not listed. If you wanted to start with "http://www.cs.dartmouth.edu/~crow/" you would use "/~crow/" as the first parameter.
fromURL: This is used in the recursive call. If you specify a valid page for startURL, then any text as a placeholder will do the job.
OKfile: This is the file in which pages which have been successfully found are listed. Any pages listed in this file are assumed to have already been processed. Experienced users may use this page to prune portions of the server's pages from verification. Unexperienced users should insure that this file does not initially exist, to insure a complete scan of the document tree. You may not specify /dev/null for this parameter.; This file may be used to find files that are not reachable from the starting page, by comparing the file with the results of "find ROOT -name '*.*' -print." You should use 'diff' on the two files, after sorting them and grepping out user directories. Symbolic links may confuse the results.
BADfile: This is the file in which broken links are recorded. If it already exists, the results will be appended, so in general it should not initially exist. Note that upon completion, some entries may be duplicated if there are multiple copies of the bad link. You may specify /dev/null for this parameter, though that would be silly.
OFFSITEfile: This is the file in which HTTP links to other servers are recorded. No verification of these links is attempted, though it would be trivial to 'awk' this file into a script to run lynx to attempt loading each page. Note that entries are not necessarily unique; the number of entries for a URL indicates the number of links to it. You may specify /dev/null for this parameter.

Customization

You will probably want to customize several variables within the script.

'SERVER' specifies the name of the server that you are using. This is helpful in detecting links to the same server, even if they specify the full URL.
'SLASH' is the server's root directory.
'PUBLIC' is the name of the subdirectory with a user's account for /~uid/ links. You can intentionally set this wrong to stop the program from checking personal pages, though you'll have to grep out some garbage from the bad link file.
'TMP' is the directory where temporary files are saved. It shouldn't need too much space.
'MAXNEST' is the maximum nesting level for the recursion. If you have trouble with the script filling up the process table, use a smaller number.

Results

If everything works correctly, then the script will not produce any output to stdout or stderr, but you should watch for potential errors. The output files may contain the following types of entries:

OKfile

/usr/moosilauke/www/docs/index.html /~crow/
/usr/moosilauke/www/docs/images/dartgreeting3a.gif /
/usr/toe/grad/crow/public_html/lego/empire.html /~crow/

These entries reflect the files /, /images/dartgreeting3a.gif, and /~crow/lego/empire.html, given the correct values for the 'SLASH' and 'PUBLIC' variables. The second entry is the URL for the page from which the page was first found.

BADfile

Link to non-existant page /~samr/CS68.95W/Homework/ from /~samr/CS68.95W/
Link to non-existant page /TR/Search/ from /tr/home.html
Link to unreadable page /~cowen/schedule.html from /~cowen/homepage.html
Link to unreadable page /~cowen/schedule.html from /~cowen/
Link to server-generated index page /~rajendra/News/oldnews/ from /~rajendra/News/News.html

Here, we have a five bad links. The first one really is bad. The second one is due to the failure of lvrfy to recognize that /TR/ is an aliased directory, so the link is valid. The third entry indicates that the file wasn't readable by the user who ran lvrfy (you should run it as the same user that your server executes as, if possible)--in this case it was intentional. The fourth entry indicates that 'homepage.html' is the same as 'index.html,' thanks to the magic of symbolic links. The fifth entry is more of a warning than an error.

OFFSITEfile

http://www.dartmouth.edu/ /usr/windsor2/www/docs/phd_program.html
http://www.house.gov/ /u/crow/public_html/~crow/index/html
http://legowww.homepages.com/text/FAQ.html /u/crow/public_html/~crow/index.html

These are just standard URL's and the file in which they were found.

How it Works

The code is rather hard to read--it really should have been a single perl program, but I haven't learned perl yet. Instead, it mostly uses sed to parse each HTML document and extract out all the links. This is eventually converted into a shell script. This results in a depth-first search, leaving behind a sleeping process for each level in the search.

The recursion is halted if the nesting level gets too large. The pending checks are then processed by the top level. I believe that setting the maximum nesting level to 1 would result in a breadth-first search. I haven't studied the performance impace of doing this.

Known Bugs

Despite these problems, I've found the script quite useful.

Doesn't handle tags in comments correctly. It may fault on: , or otherwise get confused.
Doesn't handle unclosed tags, and may seg. fault if it finds them.
May leave files in /tmp when it doesn't complete successfully.
Doesn't recognize ScriptAlias directories, so links to aliased files will be reported as bad.
Symbolic links to directories may cause an infinite recursion when combined with restrictive directory permissions which prevent 'pwd' from working.
Certain pathalogical file or directory names may confuse it, but these should be quite rare.

Warning: This script isn't secure, and shouldn't be run as root.

I'm not sure if it is possible for a carefully constructed pathalogical case to misdirect the script, causing unexpected or dangerous side effects.

Created by Preston Crow
Visit my homepage!