This article expresses the author's opinion at the time of writing. There are no guarantees of correctness, originality or current relevance. Copying the whole article is forbidden. Transcription of selected parts is allowed provided that author and this source are mentioned.
It seems that recursive HTML5 validators are still hard to find, or at least I did not find one when I needed.
The W3C validator does check HTML5 but it is not recursive. So I took a Python script from Jérome Jaglale (maestric.com) that implements recursive crawling. Since I adopted the script, I have implemented many improvements. The final result can be found here.
I also tried to install the W3C validator itself (which is open-source and writter in Perl), in Mac and in Linux, but no success. Perl and CPAN always spell trouble, at least for me. I will try it again someday, so I can run the validator crawler in full-speed without DoSing W3C.
Almost every improvement is directed to speed up the process of re-validating the site, using as little time and bandwidth as possible, and avoiding using W3C public service when not absolutely necessary:
The script dumps W3C validator output for invalid pages into logs/ folder (one file per invalid page). BTW the script needs two preexisting folders to function: datacache/ and logs/.
Some techniques (like the HTTP HEAD trick) may be individually useful for other projects, so feel free to look into the w3cval code and copy them.
The script works best when the HTTP server adds Last-Modified to response header. It works without this header, but it needs to fetch the whole page to determine if it is different from the cached version. Static content gets this header automatically but "dynamic" content like PHP pages do not, because generated data could change at every execution.
I found this PHP trick that attributes the PHP source file date to Last-Modified header. As any code that employs header(), it must run before any code that "prints" something as data, including blank lines.
date_default_timezone_set("UTC"); $modified = date ("F d Y H:i:s", getlastmod()); header("Last-Modified: $modified GMT");
This approach was perfect for my use case, but it is probably too simple for non-trivial sites. Changes in included PHP modules won't reflect in getlastmod(), but it is not difficult to get the latest modification date among all included files. Some pages may be dynamic in other ways, or be too dynamic at all to deserve caching.