After updating BBQueue for the 3rd time to fix it to work with changes to Blockbuster’s site, I decided that there’s got to be a better way to go than regular expressions for pulling the data from their pages. Regular expressions are super handy and great for a lot of things, but handling HTML whose structure might change at any time without notice isn’t one of them. What I really needed was an HTML parser, and having been greatly impressed with Hpricot for Ruby I set out to find a similar library for Javascript. However, I came up empty handed. In the process though I came across any number of Javascript XML parsers and it dawned on me that, hey if I’m lucky and Blockbuster’s using XHTML, well that would work just fine. In fact, the widget engine itself has a built in XML parser and DOM objects as of not too long ago. Sure enough, Blockbuster’s doctype proudly proclaimed that it was XHTML 1.0 However, upon attempting to parse it, my plans were quickly dampened. At first I was getting errors about a few unsupported entities, like &nsbp; A little massaging of the HTML cleared that right up, and that’s when things got ugly: mismatched end tags all over the place. A quick pass through the validator confirmed it: 305 errors. I guess I should have figured that if they can’t be bothered to provide RSS feeds of your movie queue in the first place that having valid XHTML wouldn’t have been high on their priorities either. (In the interest of full disclosure though, it’s not for me either: 22 errors. The shame of it all.) So now it’s back to the original plan. Anyone know a good HTML parser written in Javascript?
Comments
Comment by Neal on 2007-07-19 09:52:55 +0000
You know – I’ve seen a lot of anti-XHTML hate out there that boils down to a “if you can’t do it right, don’t do it” sentiment. Your problem with Blockbuster’s “XHTML” is a good illustration of the problem. Bad XHTML poisons the waters, so to speak.
By the way, do you have “Mastering Regular Expressions”? Isn’t that a great book? It’s one of those books where you have to pause after every paragraph and think, “this information changes my entire understanding of the universe!”
Comment by Will on 2007-07-19 22:18:11 +0000
Poisons the waters and ruins my evening that’s for sure. I mean, how hard is it to close your friggin’ img tags?
I don’t actually have that book, I’ve picked up my limited regex skills from http://www.regular-expressions.info/
I’ve come close to picking up that book a couple times though and if it’ll change my understanding of the universe it sounds like I would be wise to do so.
Comment by Ben Lewis on 2007-07-21 10:57:17 +0000
Would it be wise or would there be drastic consequences for us all.