So, does anyone know how to make an HTML regex parser?
November 15, 2009 1:31 PM   Subscribe

 
(Apologies to regular people, this is one for the nerds.)
posted by spiderskull at 1:32 PM on November 15, 2009


Oh what the fuck ever. Regular expressions are a great tool for parsing HTML.
posted by xmutex at 1:34 PM on November 15, 2009


xmutex--

You've never seen a regex parsing system descend into special case madness, then.

Jesus. It's a DOM. Parse it as one.
posted by effugas at 1:35 PM on November 15, 2009 [11 favorites]


I'm pretty sure that you can parse anything with regular expressions. I don't know if I can but someone can.
posted by octothorpe at 1:36 PM on November 15, 2009 [2 favorites]


It should be a DOM, but it often isn't.
posted by boo_radley at 1:36 PM on November 15, 2009


Jesus. It's a DOM. Parse it as one.

Tons and tons of the HTML on the web ain't no fucking valid DOM.
posted by xmutex at 1:37 PM on November 15, 2009 [5 favorites]


<ia>
  <cthulhu src="r'lyeh" alt="Unimaginable horror" height="larger than your mere human mind can comprehend" width="400">
</ia>
posted by infinitewindow at 1:38 PM on November 15, 2009 [50 favorites]


BeautifulSoup?
posted by kuatto at 1:39 PM on November 15, 2009 [1 favorite]


BeautifulSoup?

Not the latest build. Not if you have to parse invalid HTML. Which of course you do. Because the internet, like the world, is a horrible, ugly place, which speaks highly of standards than when pressed for time does whatever awful thing is necessary.
posted by xmutex at 1:42 PM on November 15, 2009


Folks, the stackoverflow poster is right. You need a proper parser that can accommodate nondeterminism to handle the clusterfuck of special cases (i.e. resolve state transitions after the fact ala GLR parsers, which come to think of it would be a bounding disaster, so really, I have no easy solution).

I'm pretty sure that you can parse anything with regular expressions. I don't know if I can but someone can.

Give me your regular expression HTML parser and I will give you valid HTML that will break it. The problem with regexes is they're not amenable to variations in error handling -- you either accept or you don't. Recover has to be explicit, which leads to state space explosion and a right mess of code.
posted by spiderskull at 1:42 PM on November 15, 2009 [8 favorites]


Re: BeautifulSoup, it's fine for data mining, but if you have to render that shit, then Zod help you. It ain't a coincidence that browsers are such unwieldy projects.
posted by spiderskull at 1:44 PM on November 15, 2009


xmutex--

I didn't say it was a valid DOM.

Here's the deal. There are two kinds of HTML parsers.

Rejectors must only pass known good content. That means you never push through raw HTML, instead you parse what you can, decide you like it, and then emit *from the reconstructed DOM* what you like. If you want to prevent arbitrary HTML from showing up in a post (for example, if you want to allow bold but not &ltscript&gt) then this is the only way to do it safely.

Acceptors by contrast need to do whatever it takes -- the screen must be scraped. In this case, you don't use a regex, you man up and instrument a browser because that's the only thing that's going to get it right all the time. Warning, you'll get owned regularly as you do this.

The holy quote is, never bring a regular expression knife to a turing complete gunfight. It's so, so true.
posted by effugas at 1:47 PM on November 15, 2009 [26 favorites]


My preferred method is to use a 3D printer to output html on individual starch pellets, which I've trained a raven to sort through. The raven eats filters tags that don't match and drops the ones that do onto a flatbed scanner, where OCR returns the results to the application. It worked swimmingly until the raven pecked out my eyes and now I'm having a hard time validating the results. You don't want to know what I use for unit testing, but it involves chimps and printing text onto gnats.
posted by furtive at 1:48 PM on November 15, 2009 [31 favorites]


I'm pretty sure that you can parse anything with regular expressions. I don't know if I can but someone can.

Regular expressions can parse strings written in a regular language. Things written in a context-free language (e.g., most computer languages) cannot be parsed by a regular expression, to say nothing of more complex language types, of which there are several.

Bring Me Your Regexs! I Will Create HTML To Break Them!
posted by jedicus at 1:49 PM on November 15, 2009 [11 favorites]


lxml ftw
posted by i_am_a_Jedi at 1:56 PM on November 15, 2009


Perl's regexps are Turing-complete, inasmuch as you can embed arbitrary Perl in them. So if you were prone to perversity (and what Perl hacker isn't, just a little?), you really could parse HTML with a Perl regexp if you really, really wanted to. (The article's point that that way lies madness remains.)
posted by Zed at 1:58 PM on November 15, 2009 [5 favorites]


I'm pretty sure that you can parse anything with regular expressions. I don't know if I can but someone can.

Nope. To use an obvious example, you can't write a regular expression to extract the contents of all <div> elements from arbitrary well-formed XML documents, because XML is what's known as a context-free language, a superset of regular languages. You could write one to do that as long as the level of tag nesting is finite, but not if it's infinite. As the link notes, even the addition of back-references won't let you do this.
posted by gsteff at 2:00 PM on November 15, 2009 [2 favorites]


Things written in a context-free language (e.g., most computer languages) cannot be parsed by a regular expression, to say nothing of more complex language types, of which there are several.

Sorry, yea. I was being glib.
posted by octothorpe at 2:00 PM on November 15, 2009


The original question doesn't say a damn thing about parsing. If it'd been here, that answer would have rightly got the chop. What if the guy's just trying to do a complicated find-and-replace?

Course, this is stack bloody overflow, so all bets are off. The questions could have read completely differently earlier, or there could be a comment on a comment somewhere in which the OP clarifies that he's trying to write a parser, but perhaps I don't have enough points to see that yet. Either way: there's no "parsing" in the question, and lots of it in the answers. Gah, it's such a counter-productive system they have there.

("The <center> cannot hold" is genius, tho)
posted by bonaldi at 2:02 PM on November 15, 2009 [6 favorites]


You can also parse valid email addresses according to RFC2822 with a regular expression, but it is doozy.
posted by autopilot at 2:10 PM on November 15, 2009


Get over yourself and make this open source already
posted by effluvia at 2:12 PM on November 15, 2009


Who left the <cthulu> tag open!!!?
posted by xorry at 2:15 PM on November 15, 2009 [16 favorites]


I'm pretty sure that you can parse anything with regular expressions.

I'm pretty sure you can't parse a language defined thus:

1. if s is in the language, then "("s")" is in the language
2. if s and r are in the language, then sr is in the language
3. "." is in the language
4. nothing not constructable via the above rules is in the language

This gets you, e.g., "((((.)(.)(((.)))(.))))".

You can't balance parentheses with regular expressions. (Well, I believe new Perls, or Perl 6, or something, has extensions that let you do this, but those aren't real regular expressions.)
posted by kenko at 2:16 PM on November 15, 2009 [4 favorites]


... Even Jon Skeet cannot parse HTML using regular expressions ...
awesome
posted by memebake at 2:17 PM on November 15, 2009


Another good answer from the same page:
While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.
posted by memebake at 2:24 PM on November 15, 2009


Get over yourself and make this open source already

I'm confused by your comment.
posted by spiderskull at 2:32 PM on November 15, 2009 [2 favorites]


Nerd fight!

xmutex: I just use the old BeautifulSoup, as I am still on py26 and have no intention of switching until django does. In the meantime, have you looked into html5lib?

Parsing HTML with regular expressions is insanity, and as mentioned, not even possible to do completely by definition. Sure if you are screen-scraping a small amount of data from a page using a known formatting, it's a shortcut you can use, but a damn ugly and fragile one.
posted by cj_ at 2:36 PM on November 15, 2009


No, it's never OK to parse HTML with regex. This is pure programmer Not-Invented-Here syndrome. It's ridiculously easy to use a library that can take in XPATH, people just think they have big dicks when they can write their own code. Stop being a little bitch. Use a library, and read the fucking documentation, it will save everyone a ton of time.
posted by amuseDetachment at 2:48 PM on November 15, 2009 [4 favorites]


Man! A year or so I've been posting sometimes-helpful, occasionally-considered answers at SO, and it's *this* silly post-pub outburst in frustration at the endless stream of “How to I parse this HTML with regex?” questions that gets all the attention!

There is something deceptive about string processing in general and regex in particular that makes people think they understand how to use it, when all the parsing and escaping issues make it actually quite difficult. It's the same root problem that gives us cross-site-scripting flaws in 90% of websites.
posted by BobInce at 2:49 PM on November 15, 2009 [6 favorites]


I pity the foo who tries to use Regex on this code!

(LOL computors)
posted by Potomac Avenue at 2:53 PM on November 15, 2009 [3 favorites]


What on earth is this doing on the front page of MetaFilter? Didn't the amount of stupidity added to the internet when this was posted to reddit suffice? Not sure you're guys are doing that much of a better job, to be honest.

(The reason those of us who work with stuff like this tell people to avoid hacking together regular expressions to solve their HTML/XML problems is not that you cannot use regexps for processing HTML and XML if you know what you're doing; it's that the average programmer never knows what he's doing. And that every programmer think he's above average. Especially when posting to the internet.)
posted by effbot at 2:56 PM on November 15, 2009 [5 favorites]


I'm pretty sure that you can parse anything with regular expressions.

Regular expressions can only describe patterns that can be recognized by finite state machines, so a regular expression can't describe anbn, for all positive integers n (a famous example that requires an infinite state machine).
posted by twoleftfeet at 3:01 PM on November 15, 2009


Really though, the proper way to herald the death of regexps on Metafilter is to use a single period.

But that could stand for anything.
posted by twoleftfeet at 3:03 PM on November 15, 2009 [11 favorites]


Regular expressions can only describe patterns that can be recognized by finite state machines

The problem is that these days when most folks say regex they mean Perl (or PCRE) regex, which are Turing Complete, rather than the quaint, old fashioned regular expressions of Kleene et al. So, while it is not a good idea to "bring a regular expression knife to a turing complete gunfight", the /e and /ee extensions (not to speak of the /@{[Z҉A҉L҉G҉O]}/ possibilities) remove the limitation of FSM-based regex engines.

(I hear that Z҉A҉L҉G҉O will be supported in Perl6.66 as a built-in)
posted by autopilot at 3:10 PM on November 15, 2009 [2 favorites]


I'm pretty sure that you can parse anything with regular expressions. I don't know if I can but someone can.

No, you can only parse regular languages with regular expressions. That means no nested elements (basically). And if you want to do a finite number of nested elements, you have to write one regex for each level.

That said, you can use regular expressions to find tokens to build a more complex parser. If you just want to find a tag, and not an entire element, regular expressions can work for that. I have some code that goes through HTML files and extracts hyperlinks, and I think it works more quickly then trying to build a whole DOM tree. And if it misses a few links, it's not that important.
posted by delmoi at 3:14 PM on November 15, 2009


My job actually currently involves a lot of parsing XML and HTML. We use XSLT wherever possible, cleaning up the input first if necessary to make it well-formed, but when we know the schema and provenance of the input, I sometimes use regex to pull out a string or two. I'd never use it to parse arbitrary webpages though, and actually recently had to dissuade a developer from doing just that.
posted by gsteff at 3:24 PM on November 15, 2009


I've used this as an interview question in the past. Generally in the form "how would you get this data from this chunk of HTML" and then keep upping the ante… A lot of folks don't understand the limitations of regular expressions and believe them to be do anything swiss army knives.
posted by schwa at 3:31 PM on November 15, 2009 [1 favorite]


The problem is that these days when most folks say regex they mean Perl (or PCRE) regex, which are Turing Complete, rather than the quaint, old fashioned regular expressions of Kleene et al.

I concede the point.
posted by twoleftfeet at 3:34 PM on November 15, 2009 [2 favorites]


I'm not real sure what any of this means, but I'm glad that this thread is going precisely according to prophecy.
posted by cmoj at 3:42 PM on November 15, 2009 [3 favorites]


This topic has crossed into the pointless region of my nerdometer, or maybe I don't get the OP's true issue.

When a 3rd party content provider can't or won't provide a proper interface, I've stooped to web page-scraping to pull out what;s required. I've also done my share of processing content from myriad sources, including regex searches, and regex searches of regex searches. All of this usually cooked in a tasty Perl broth.

So, agreed - if you're wanting browser-strength parsing, use a library already.

I would about now start moaning about how stupid browser-makers accepted and even promoted the use of malformed, nonstandard elements, and then turning it into a rant on IE, but frankly I'm more interested in going out to dinner shortly.
posted by Artful Codger at 4:05 PM on November 15, 2009


"do anything swiss army knives"

Ah, but that's a very good analogy. A swiss army knife, or a multitool, can deal with lots of problems easily and as well as a specialised tool, maybe better, in that you don't have to crack open the tool case, blah blah... and yet sometimes your Leatherman is not in fact up to the task.

A few weeks ago one of my younger colleagues asked on our internal IRC how he could fix a regex that was choking on some nested HTML tags (said code was in an open source app from outside our organisation, it turns out). We all said "You can't. Use a parser." And he was all "why are you guys being so unhelpful? All I want to do is use a regex to process this chunk of HTML." To which we replied "You can't have what you want." Eventually he saw the light.
posted by i_am_joe's_spleen at 4:06 PM on November 15, 2009 [3 favorites]


I don't want to be that guy, but please can somebody explain all this, starting with HTML regex parser, all the way through to zalgo. I saw this on reddit and didn't understand it then
posted by Petrot at 4:06 PM on November 15, 2009 [1 favorite]


What would be really nice is if programmers didn't feel the need to tell other programmers how they must use their tools, or that using a tool in a certain way is always incorrect, or, my favorite, insult all programmers who might ever use a tool in a manner which they personally disapprove.

But it's been going on for at least 30 years, so I guess it's not gonna stop. Someone probably told Grace Hopper that the way she coded her Cobol showed she was a little bitch. And if that happened, I hope she smacked the guy (and yes, in the early '60s it was almost certainly a guy).
posted by mdevore at 4:07 PM on November 15, 2009




What would be really nice is if programmers didn't feel the need to tell other programmers how they must use their tools, or that using a tool in a certain way is always incorrect...

I don't think they're saying "don't use regexes to 'parse' HTML because I find it distasteful," they're saying "don't use regexes to 'parse' HTML because it's a brittle 'solution' and you are going to end up with a horrible codebase full of hacks and half-broken workarounds in the long run if you do so, and you are never going to be able to stop adding more hacks and workarounds, ever." And it's true.

Consider the parable of the nail, the shoe, and the bottle. Different technical question, same issue of tools.
posted by letourneau at 4:15 PM on November 15, 2009 [4 favorites]


When I first wanted to run search/replace patterns on HTML content, the people on the IRC channels devoted to PHP that I visit suggested that I use Regex, or (more to the point) simply appropriate some preexisting library for that purpose. Instead, I used PHP's own xmlparser to code my own, custom solutions, and have found them entirely compatible with my purposes thus far.

Private Data Redaction: Source Code [github.com], Sample Output [confessor.org]
HTML Input Validation: Source Code [github.com], Sample [confessor.org]
posted by The Confessor at 4:16 PM on November 15, 2009 [1 favorite]


And it's true.

No, it's not. Many programming solutions are for given boundary conditions. Within those conditions, "brittle" concerns need not apply.

If people could ease up on the gross overgeneralizations (e.g. "never") for when a tool might or might not be appropriate, something closer to an accord might be reached.
posted by mdevore at 4:20 PM on November 15, 2009


mdevore: in a lot of cases though, experience shows that the alleged boundaries often get extended. Someone says "oh, this will only ever have to deal with this restricted subset of markup" and then a little later, it turns out that you're getting some more complex input and your regex chokes.

Considering how little effort it is to use the lovely simple interfaces to parsers which so many platforms offer these days, it seems like common sense to head that particular danger off at the pass by using a parser in the first place. Yeah, there will be the odd case where it really truly isn't worth it. But I would think those cases are actually quite rare.
posted by i_am_joe's_spleen at 4:32 PM on November 15, 2009 [1 favorite]


it turns out that you're getting some more complex input and your regex chokes

Well, for an spec'ed program or environment, that could easily be interpreted as moving the goalposts. Depending upon the ramped complexity, the estimated costs to support more complex input could either forestall further liberalization of allowed input or fund a new approach.

But here again, there is an overgeneralization. Many programs will neither need to, nor should, anticipate the entire universe of HTML when parsing. In fact, designing a program to do so may well be a completely wrong-headed approach, if it changes a program from a few-line script to a bullet-proof commercial-grade program which takes orders of magnitude more time to properly code and support. Resource expenditure should always (oops, make that very frequently, I about overgeneralized, too) be considered when creating a programmatic solution.

In addition, hard boundaries need not always be an HTML-oriented limitation. They can be as simple as "work with these sets of web pages", "work with this data from these web pages", "work for 98% users 98% of the time", or even "OMG, we have to make this work in the next hour, do the best you can".
posted by mdevore at 5:06 PM on November 15, 2009 [1 favorite]


"The other fellow first."
==
"Assume subsequent developers will have your home address."
posted by butterstick at 5:09 PM on November 15, 2009 [1 favorite]


So let me think, if someone says "regular expressions" to me, should I think of

i) something well-documented in the literature, something every CS student spends hours on proofs of; or
ii) some bastardised "embrace-and-extend" scripting kludge that resembles line noise* and without a single unifying principle that could possibly be called design?

--
* line noise: archaic term from archean times i.e. Before Broadband; the exact meaning is lost to history.
posted by phliar at 5:12 PM on November 15, 2009 [4 favorites]


What would be really nice is if programmers didn't feel the need to tell other programmers how they must use their tools, or that using a tool in a certain way is always incorrect

It is incorrect in the mathematical sense. There is nothing subjective about this question. It is impossible to represent the HTML grammar as a regular expression. By "impossible", I do not mean "really hard", or even "so hard nobody has ever managed it"; I mean that regular expressions are defined in such a way that they can be proven, without a doubt, to be incapable of parsing HTML.

One can either choose an appropriate tool, or redefine the job until it fits the constraints of the tools at hand, but the fact remains that one cannot parse HTML with regular expressions.
posted by Mars Saxman at 5:15 PM on November 15, 2009 [6 favorites]


* ah, there's a third option: by redefining the terms of the discussion, you can use the phrase "parsing HTML with regular expressions" to describe something entirely different, which may actually be possible; the price is that this makes you sound like an idiot, until you get done explaining what your new meanings for those terms are, at which point the listener is likely to wonder why you didn't just say what you actually meant in the first place.
posted by Mars Saxman at 5:19 PM on November 15, 2009 [1 favorite]


mdevore, to a certain degree professionals imparting their wisdom on one another is a nice, collegeal thing. Not all fields are as prone to it as software developers, and you might miss it if it wasn't there, occasional incorrect blowhard aside.
posted by ~ at 5:32 PM on November 15, 2009 [2 favorites]


ah, there's a third option: by redefining the terms of the discussion

I choose the option where someone such as yourself doesn't get to, by fiat, declare that the world of HTML grammar must necessarily objectively define the role of parsing HTML for programmers. That there is no role for the subjective in the original statement "Every time you attempt to parse HTML with regular expressions" at the very first sentence of the post. But no, you knew what they were really saying, right, and are here to enlighten us and readjust our interpretations of the post properly. That isn't what they really meant.

Your comments here sound no more accurate that saying that, for example, since Newtonian equations don't properly model or encompass the universe, you cannot properly use Newtonian equations to solve problems in the real world. Things might go near the speed of light or a strong gravitational field, ya know, and there where would you be?

But hey, if you call people idiots, you win the discussion and perhaps avoid all those messy questions over who exactly redefined what.
posted by mdevore at 5:38 PM on November 15, 2009 [1 favorite]


to a certain degree professionals imparting their wisdom on one another is a nice, collegeal thing

Absolutely. People learn and are informed by online discussions on programming topics, and have been for a very long time. It is a critical feature of communal discussions that the value is immeasurable. People are probably not so informed by discussions where insults fly or people become so rigid in their proper thinking as not to allow for dissenting views unless they, I dunno, try to belittle those who advance them.
posted by mdevore at 5:43 PM on November 15, 2009


Things might go near the speed of light or a strong gravitational field, ya know, and there where would you be?

OK, I'll grant that we don't know the exact details of the HTML or HTML-like slurry the poster in the Stack Overflow thread was intending to parse.

However, I think the anti-regex people here have been bitten too many times by "it'll always be formatted like this" and "don't worry about those crazy edge cases" and "the third-party vendor says they're going to follow our spec" (afterwards followed by "OMG, why did the script break?!") to believe that even the original poster at Stack Overflow really knows what his script is going to be handed as input. In practice, for nearly any program that's going to be used for more than a quick afternoon hack (and even those scripts have a nasty tendency to "burrow" into business processes under the radar), you don't want to use regexes for HTML "parsing."

I hope you'll grant that crazy busted-ass HTML is more common in most programmers' daily experience than travel at relativistic speeds and around spacetime singularities.
posted by letourneau at 5:59 PM on November 15, 2009


mdevore, I don't understand the point you're trying to make here. There are "dissenting views" on the ability of regular expressions to parse HTML in the same sense that there are dissenting views on the value of pi.
posted by Mars Saxman at 6:04 PM on November 15, 2009 [2 favorites]


OK, pull back for a second guys.

It's possible to parse HTML with a RegEx.

I mean, it is. It's also (as a friend of mine likes to say) possible to build a computer out of macaroni and cheese, that does not mean it is a good idea.

Put another way, there are many problems that can be mathematically shown to be effectively insolvable with present computers and technology. They're called NP-complete. Tetris, it turns out, is NP-Complete. But sometimes you don't need to solve something -- sometimes you just need "good enough".

RegEx appears good enough, mathematical theory be damned.

Only, it turns out the math guys were right here. It's not just that RegEx isn't good enough in the extreme case. It's that it's not good enough in far, *far* too many real world cases. Stupid things will destroy you -- an extra carriage return, an unexpected attribute in an unexpected place, Unicode, overwide Unicode, even upper case. The problem with RegEx is:

1) You discover all these exceptions as you go, and you have to keep revisiting the code
2) As you keep revisiting, your Regular Expression gets uglier and more complicated, and the expense of modifying it steadily increases.

Everyone making the math argument is right, but you're sort of bringing Science to an Engineering debate. The reality -- coming from someone who has fought in these very trenches -- is that regular expressions are a very, VERY poor tool for the task of extracting context from HTML.

If you need security, build a DOM and reconstruct what you parse.
If you need correctness, spawn a browser and extract what it sees.
If you need a codebase to babysit, by all means, use a RegEx.
posted by effugas at 6:22 PM on November 15, 2009 [3 favorites]


Petrot: "I don't want to be that guy, but please can somebody explain all this, starting with HTML regex parser, all the way through to zalgo. I saw this on reddit and didn't understand it then"

This is why it probably doesn't belong on MeFi, but since it is, I'll accommodate. The important thing to start with is not HTML regex parsers, but regular expressions (regex) in general. They're a tool to do pattern matching in text, so long as the text has specific structured constraints: patterns can happen a specific number of times, or an unpredictable number of times, but no two patterns within can be tied together.

You can mix and match these patterns to build "regular expressions" like "<table>.* </table>"; which will return the longest string in the source text that starts with <table> and ends with </table>. The .* can match anything inside. Including another </table>. When you have nested tags like that, you want to ensure you've matched an equal number. This violates the constraint; you can't tie regular expression patterns to equal numbers like that. This gets complicated fast, trying to to grab all tables with a specific attribute is a nightmare if you can't make a lot of assumptions about the source text. So HTML, in the general sense, cannot be parsed by regular expressions; it's patterns cannot be represented.

The ideal solution is to man up and build an object tree (a "parse") of the data; each HTML tag becomes an object potentially containing other HTML tag objects. This lets you have tables that contain other tables, and is generally a much easier form to process once completed. Then you can operate on the tree structure of the data quite easily. For XHTML or RSS this is incredibly simple; I do it with my feed reader to ditch specific authors from Freakonomics' blog, for example.

There's a number of reasons people try HTML regex parsers:

Firstly, because they are ignorant of the subject; regex can work on occasion, and are very handy for the large majority of structured text that does meet the regular expression constraints. Which might include the last HTML source text you worked with.

Secondly, because it's a hell of a lot harder; writing a decent parser requires you to write down the grammar, but a lot of programming languages support regexs so pervasively that you're functionally illiterate if you don't know how they work. However we have free parsers for XML, so that argument is again from ignorance.

Thirdly, when you control the output; you can write the data out to XML then parse it with regular expressions. This infuriates people who expect your parser to accept all valid XML, say for WordPress imports.

Finally, when the source text doesn't even meet the requirements for XML, you're forced to figure something else out. Generally what happens is an unholy mess of regular expressions and source code, and when it fails it's usually unclear why. Systems like BeautifulSoup and StoneSoup attempt to formalize this process of dealing with "tag soup," without having you the programmer handle regex directly.

The author of the post is driven crazy by the number of people who fall into the second category rather than the fourth. Zalgo a Lovecraftian meme about things so horrific your mind simply breaks. The author reflects this in his writing slowly drifting from "yet another you can't do that" into insanity, mixing the two over time.

Frankly, my explaination is a bit like explaining Marmaduke comics. An exercise in futility, and valuable only for it's comical dedication to the premise, which is even then still a worthless and overdone contribution.
posted by pwnguin at 6:51 PM on November 15, 2009 [7 favorites]


I'm pretty sure that you can parse anything with regular expressions.

Yeah, you'd be wrong.
posted by chunking express at 6:56 PM on November 15, 2009 [1 favorite]


There are "dissenting views" on the ability of regular expressions to parse HTML in the same sense that there are dissenting views on the value of pi.

Migawd! This statement is so utterly wrong I don't even know where to begin. Are you really so locked into your theoretical position that you deny basic reality?

I have parsed lots of things, HTML included, with regular expressions. Logic 101: I (and millions of others who have done similar tasks) provide a counterexample to your assertion that you cannot parse HTML with regular expressions. You remember your logic classes when they taught about counterexamples and disproving an assertion, right?

You might also review logical fallacies such as the Bare assertion fallacy. There's another logical fallacy where you introduce a true statement which is unrelated to the assertion as somehow supporting it by proximity, but my memory fails as to what it is called. Perhaps you remember its name, and you can teach us all something useful. That would be nice. Anyway, you did with it that "value of pi" trick. Good try, though.
posted by mdevore at 7:03 PM on November 15, 2009


The problem is that these days when most folks say regex they mean Perl (or PCRE) regex
True. Perl's regexps haven't been regexps for a long while. And it is, in fact, possible to parse HTML fairly reliably using a perl regexp, as in the following example:
$foo =~ m/(?{ require HTML::Parser; HTML::Parser->new()->parse($_)->eof; })/;
posted by hattifattener at 7:12 PM on November 15, 2009 [14 favorites]


mdevore--

OK, let's talk about context. What that guy on Stack Overflow was saying:

"Stop asking me how to handle mixed case in RegEx. Parse a DOM."
"Stop asking me how to handle random carriage returns in RegEx. Parse a DOM."
"Stop asking me how to deal with nested tables. Parse a DOM."

...and so on. What we're all basically saying here is that RegEx works really well for your test set, but has an unacceptable rot rate (for acceptance purposes) and bypass rate (for rejection purposes). HTML RegEx based systems just rot, faster and more severely than literally anything else I can imagine in modern computing.
posted by effugas at 7:22 PM on November 15, 2009 [1 favorite]


Who left the <cthulu> tag open!!!?

Hate to break it to you, but plenty of tags in the HTML spec (the real spec, not that bullshit XHTML 1.0 crap) are supposed to be left open. <INPUT> tags, for example. Or <CTHULU> tags. It's all right there in the spec. I sure-as-fuck wouldn't want to be responsible for trying to close a CTHULU tag.

I'm not even going to bother linking to the spec in question because linking to CTHULU enrages CTHULU.

I don't want to be that guy, but please can somebody explain all this, starting with HTML regex parser, all the way through to zalgo.

Regular expressions (regex) are textual patterns. Regex parsers use these patterns to do things with the text, or sections of the text. For instance, most English sentences end with punctuation marks. Most English sentences use words that are separated by spaces. If I wanted to count how many words were in a document, or I wanted to take every sentence in a document and do something to it, regular expressions can do this with relative ease.

The problem is, much like English, HTML is a bastard language. In English, you don't have to end every sentence with a punctuation mark. I mean, you should, certainly, but what happens if you don't? Well, some people will probably reply to you saying something to the effect of ASSHOLE DON'T DO THAT but they will still understand 99% of what you wrote. HTML parsers are the same way. There is a specification. There are rules. But if you enforce those rules with an iron fist, you'll break a huge percentage of the web (the underlying inference, naturally, is that a huge percentage of web authors don't know what they're doing).

Allowing regular Joe user to give you HTML that you plan on displaying is like allowing Fark commentators to post to the Encyclopedia Britannica. You need all sorts of rules that you have to pre-program into your lexical parser. Formoronswho spacestrangely. foridiots that dontuse capitalsorapostrophes. forfucktards that dontevenbother wth punctuation

So getting back to the original post: some web person asked the interweb for a regular expression to parse HTML content. Which is about as sensible as asking for a regular expression to parse YouTube comments. There's simply no end to the level of inanity you'll be presented with.

I'm not sure how all of this morphed into a discussion about Zalgo, though.
posted by Civil_Disobedient at 7:24 PM on November 15, 2009


I feel the need to make two points here:

1. The question wasn't about "parsing HTML." The question was about matching all and only strings that start with a <>, and do not end with />. I agree with "you cannot (fully and reliably) parse html with a regex," because it's a plain old fact. But on the other hand, I can think of lots of problem spaces where actually FULLY PARSING the html the questioner had to deal with would be insane overbuilding. Nothing in the question clarifies whether that's the case or not. Dude might just be trying to pull some stuff out of a log file. Who knows.

2. This post's broad and multi-field opacity amuses the hell out of me.
posted by rusty at 7:28 PM on November 15, 2009 [1 favorite]


MeFi dropped ", end with a " from between my carats above there. It's probably trying to clean up tags with a regexp. Ha ha.
posted by rusty at 7:29 PM on November 15, 2009


What we're all basically saying here is that RegEx works really well for your test set

I only have two problems with this. First, that's not what everybody is saying here. It's what a lot of people are saying, but certainly not all. I'm not arguing regex is a good idea in many, even the majority, of cases. But to be blunt, when people starting saying that is never OK to do this, and that people who do this are Bad Programmers or even Bad People, it chaps my ass. It sounds just so much like a bunch of fresh-faced newly-scrubbed CS students regurgitating their indoctrination of how naughty it is, terribly naughty, for programmers to do this, that, and the other thing. Baloney.

Second, it's not just test sets. There is an infinite number of potential situations, and a nontrival number of situations in the real world, where a full-blown parser is not a good solution. Everybody loves the thought of an elegant program, but you know what? Sometimes elegance is too expensive, sometimes it takes too much time, sometimes it takes too many resources, and sometimes it is just plain unnecessary.

You don't like all those under- and poorly designed programs that break? Neither do I, and doubt you'll find many people here who do. But something which fails to support the universe is not necessarily underdesigned for a particular task.

Actually there's a third problem. The first five sentences of the original post (and I read it until it devolved into the random "it's made me insane" cliche) are as follows: "You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML."

Five sentences, every single one of them demonstrably false for situations in the real world of programming for specific tasks.
posted by mdevore at 7:42 PM on November 15, 2009


So, maybe there are times when a regex is the right tool, but I'm going to say, if you needed to go the internet for help to debug a regex as simple as <([a-z]+) *[^/]*?> then you are probably ill equipped to decide for yourself when one of those times is.
posted by adamt at 7:59 PM on November 15, 2009 [2 favorites]


mdevore--

It is a flat statement of fact that you will not stop an attacker from putting HTML on your site, with a RegEx based defense. If you are parsing text and looking to allow some tags but not others, RegEx will fail, and you will be pwned thoroughly.

I will go so far as to say, yes, it is never OK to use a RegEx to parse HTML for security purposes. It will never work. Ever.

Now, for data extraction, there's only one right way to do things: Run an actual browser, and then interrogate its DOM. Everything else, from DOM generation with BeautifulSoup to RegEx, fails some portion of the time for different datasets. There is a canonical truth, That Which The Browser Sees.

I'll agree that being an absolutist in the extraction context is silly, but you have to concede BeautifulSoup-based solutions will survive a mutating web app much longer than a RegEx will.
posted by effugas at 8:00 PM on November 15, 2009


Yes you can extract data doing something like /<table>(.*?)</table>/, but THAT IS NOT PARSING HTML. You are parsing a string. That it is HTML is incidental. You are literally saying "give me what's between these two strings." If what you actually want to say is "give me the first table, as it is defined in the HTML specification" you simply cannot do this with a regular expression. You can't. It's not even up for debate whether you can or not, before you even get to whether you should or not. I could invent valid HTML that breaks whatever regex you tried to give me.

As for whether you should scrape data in that manner is a separate issue. I am inclined to say it's nearly always a bad idea, simply because it's ugly and fragile. A complete nightmare to maintain in the event anything in the source changes. When you have to account for different cases (will it always have the same attributes? Will they always use quotes? Will they have escaped quotes in them?) it rapidly becomes absurd to even try when there are parsers that actually do this correctly.

You mention "real world programming." What does this even mean? Parsers exist in the "real world." And the code utilizing them is a lot easier to read and maintain.

But hey, if you want to dig yourself into that hole, by all means do so. Not saying you can't. I concede this sort of naïve string parsing is handy in one-off hacks, especially if you haven't acquainted yourself with an actual parser library. As long as you're OK with it being a hack.
posted by cj_ at 8:00 PM on November 15, 2009 [2 favorites]


I think I was on the same page as you mdevore midway through the thread, but unless I'm mistaken z full-blown parser which you correctly maintain is not a good solution is exactly what everyone is saying you shouldn't try to do with regex.

I've got plenty of little scripts that have been running for years that I was thinking (while reading this thread) 'parsed HTML with regex' but thinking further I realize that the first thing I usually do is drop all the tags (except maybe a select few) and treat the document as text. So you might be on the same side of things as everyone else (boorish as they can be), but there are just communication issues?
posted by xorry at 8:06 PM on November 15, 2009


Ugh, on post what cj_ said.
posted by xorry at 8:06 PM on November 15, 2009


cj_: "Yes you can extract data doing something like /(.*?)/, but THAT IS NOT PARSING HTML."

I feel this is directed at me, so apologies if not. I generally subscribe to the use a DOM approach. It's awesome and fun and everyone should try it. Yes, you can invent HTML that doesn't pass a regex (regex's don't parse, but that's irrelevant). And I can cite you HTML that doesn't parse. One inversion of HTML tags and suddenly it's a violation of the DTD's well-formedness and your XSLT processor does the right thing and refuses. I can point to data in the real world, from data ironically itself produced by web services. But I'd rather not because they're almost certainly producing malformed data intentionally and I would rather not give the authors a heads up. It's an ugly arms race =(

The question is, how do you cope in that situation? Personally, I gave up for a while, but now I'm thinking a quick application of sed will fix things up enough to pass on to tidy and html5lib.
posted by pwnguin at 8:20 PM on November 15, 2009


Parse me arse. No clue what this is all about, in case that wasn't obvious
posted by Abiezer at 8:24 PM on November 15, 2009


pwnguin--

Yup. You've pretty much hit on why the only truly-correct-for-data-extraction approach to extracting content from HTML is to spin up an entire browser, then muck around its internal object heirarchy.

What you basically have the Beautiful Soup and (my choice) HTMLAgilityPack guys doing is trying to emulate the browser's rather forgiving parser. On balance, you get far more situations in the real world where they get it right where regex's wouldn't, rather than they get it wrong when regex's would.
posted by effugas at 8:36 PM on November 15, 2009


var MacroPattern = /(@@|\?\?|\$\$)|\$([a-z])|(?:(@)|(\?)|\$)\{(?:(?:(?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)(?:\/((?:[^\\\/]|\\(?:.|\s))+)\/([i]*)|#([_a-z]+))?(\?))?(?:(?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)(?:\/((?:[^\\\/]|\\(?:.|\s))+)\/((?:[^\\\/]|\\(?:.|\s))*)\/((?:[^\\\/]|\\(?:.|\s))*\/)?([gi]*)|#([_a-z]+))?)?(?:(:)(?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)(?:\/((?:[^\\\/]|\\(?:.|\s))+)\/((?:[^\\\/]|\\(?:.|\s))*)\/((?:[^\\\/]|\\(?:.|\s))*\/)?([gi]*)|#([_a-z]+))?)?|(?:(?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)(?:#([_a-z]+))?(\+)(?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)(?:#([_a-z]+))?))(?:##([_a-z]+))?\}/g;

This is the most complex / beautiful regex I've ever written for production use. It is part of my personal web bookmarks system. It is not toy code or something purposely written to be obscure. I use this software every single day, dozens of times a day.

This regex scans strings for shell-style variable substitutions like ${var}. Besides simple substitutions, it also processes more complex ones such as ${var/pattern/replacement/flags}. It supports substitutions with embedded conditionals, internal regexes, binary operators, literals, vectors, and function calls. It's about as close as you can get to parsing a programming language with a single regex.

(I had to write another regex just to escape the big regex so it wouldn't get mangled by Mefi's input validator. I ♥ regexes.)
posted by ryanrs at 8:50 PM on November 15, 2009 [7 favorites]


ryanrs--

I am sure you are proud of your creation, however, a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming.

And that was more than a mere glimpse.

QED
posted by effugas at 8:56 PM on November 15, 2009


Show me, I want to see.
posted by ryanrs at 9:00 PM on November 15, 2009


ryanrs--

It is impossible for you to see, you are already on the wrong side of the looking glass, but you do realize that's like Brainfuck's Big Brother put into production right?
posted by effugas at 9:08 PM on November 15, 2009 [1 favorite]


full-blown parser which you correctly maintain is not a good solution is exactly what everyone is saying you shouldn't try to do with regex.

I read the complaints about regex as people saying it can't do the job. I maintain that it can do the job depending on what the job is, even when the job involves some parsing of HTML.

Perhaps my use of full-blown was ill-advised. In context, I meant use of a parser which would support not only all expected input, but any rational input from HTML. (Full-blown parser or no, junk input could be rejected.) Basically, it served as shorthand for a fully spec-compliant XML or HTML parser. Such a beast is not needed or desirable for all tasks related to parsing HTML and other data.

Without equivocation, my use of elegant was ill-advised, because the meaning is open to personal judgement. Here, I meant elegant as in able to gracefully and seamlessly handle all input, no matter what is thrown at it. I don't think of small useful scripts as elegant, more on the order of neat or clever, but I can see how many people would also describe them as elegant for the task they do. That's as valid as my own terms, on a wild day I might even describe them as "elegant" myself.

Fundamentally, one of the problems might be one of resource expectations for different developers. In an era of gigabytes of memory and CPU speeds with multiple cores, frequently little thought is given to the overhead involved with library functions and memory images, occasionally to extremes which remain detrimental to user and operating efficiency. If you come from a background of hand-assembly on low K memory and CPU speed machines, or embedded devices, today's profligate waste of memory and CPU cycles can seem beyond the pale. And there is still something very nice about creating an unadorned tight script of a few lines to manage a task. Greasemonkey scripts seem to do quite well here, for example, although given their plug-in dependence, they are not exactly the same concept.

I am pleased to see the large increase in programming for the iPhone and other smart phones, as it should refocus some community effort to programming efficiencies and designing for a limited device and environment. Of course, it has a built-in XML/DOM parser for this case, but the overall development mindset requires a return to resource-based decisions with similar trade-offs.
posted by mdevore at 9:08 PM on November 15, 2009


s/<Humanity>(.+)</Humanity>/<Great Old Ones>($1)/is;

Me and my fellow regexp fans hope to put this one into production some day soon.
posted by benzenedream at 9:23 PM on November 15, 2009 [2 favorites]


I'll have to take your word on that, effugas. It's my understanding that this "boundary" you speak of is only clearly visible to those who haven't crossed it. When I look, I only see \b.
posted by ryanrs at 9:26 PM on November 15, 2009


You know, I just realized my regex does not match \w{3}. Is that what you guys meant by "unfit for human eyes"?

*sigh* I suppose anything that is 91.55% [[:punct:]] has a bit of blackness in its heart.
posted by ryanrs at 9:35 PM on November 15, 2009


I once used regular expressions to build a perpetual motion machine. If it weren't for the ivory-tower academics' ignorance of the four-corners of my magnificent regular expression perpetual motion machine, we'd be living in a new world of cheap, clean, energy. All provided by regular expressions.

If computer science has cranks, does that mean that it's finally a proper science? If so, it's time to dust off the typewriter and start cranking out 30-page screeds. I bet of all the crazy mail that Chomsky gets, not nearly enough of it is about how the Chomsky hierarchy is educating us stupid. I don't mean to claim that anyone here is a crank. Here, I just think people are talking past each other/don't know the theory that underlies our tools. However the idea of sending crank mail to computer scientists is really growing on me.
posted by Llama-Lime at 9:39 PM on November 15, 2009 [1 favorite]


mdevore--

RegEx *cannot do the job* for security. *CANNOT*.

ryanrs--

You looked at that and realized anything?
posted by effugas at 9:40 PM on November 15, 2009


Yeah. Then I looked at benzenedream comment and realized it has an unescaped slash.
posted by ryanrs at 9:44 PM on November 15, 2009 [1 favorite]


Yeah, sorry, regex is a no-go. What happens when someone changes their markup in one small, indistinguishable way? Your regex is f'd. Your DOM parser still works.

Headless copy of the Gecko rendering engine, with a Greasemonkey script applied to get the valid piece of data and echo it out to the console. Call that via a shell script, and you are done.
posted by mark242 at 9:48 PM on November 15, 2009


pure programmer Not-Invented-Here syndrome. It's ridiculously easy to use a library that can take in XPATH, people just think they have big dicks when they can write their own code.
for data extraction, there's only one right way to do things: Run an actual browser, and then interrogate its DOM.

I think effugas has 1-up'd amuseDetachment. Not only should you not be using regular expressions, you shouldn't even be writing an SGML/XML parser, or even using an SGML/XML library. You should be communicating with another application.

That is some serious anti-NIH. I don't think you can get more anti-NIH without suggesting you hire a webkit, gecko, or trident dev to actually write the rest of your code for you.
posted by weston at 10:04 PM on November 15, 2009


ryanrs--

I remember this feeling. This is the feeling I got when I learned Andrew Tridgell could read SMB.

In its original hex.
posted by effugas at 10:08 PM on November 15, 2009 [1 favorite]


This is why it probably doesn't belong on MeFi

Heh, I can definitely see where you're coming from, but the diversity of topics is one of the coolest things about this place. One of the consequences of that is excluding large groups from interest. However, since there are quite a lot of fellow computer science nerds here, it's not like there aren't people here who wouldn't appreciate this. Besides, it's nice to see some debate to this effect (which, before you snark, I realize isn't the point of Mefi, just an interesting side effect).
posted by spiderskull at 10:20 PM on November 15, 2009


I'm pretty sure that you can parse anything with regular expressions.

Godel already answered this question back in the 1930s.

As for XML, no regex expects the Spanish Inquisition.
posted by Twang at 10:23 PM on November 15, 2009


I hope in your actual production code, ryanrs, that regex is commented.
posted by kenko at 10:23 PM on November 15, 2009 [1 favorite]


RegEx *cannot do the job* for security. *CANNOT*.

So what. I think I heard a gasp, so let me repeat that to assure everyone I actually said it.

So. What.

Although too many programs have bad security, that doesn't mean every program on the planet that parses HTML has to be security aware. Not every layer of every piece of a program which has access to any tiny bit of data needs to concern itself with security. Not today, nor even tomorrow.

If I write a script using regex to parse your latest rating off of Hot-Or-Not, as a dumb example, I don't care about regex security in the pull script. Now, I care a whole lot about my firewall, my connection security, my browser, my interpreter, my OS security, and so on, but the actual script? Whatever. As long as it's not going to run external input in protected space. But here, its sole task is to grab a value which can be nailed down, easily, to a simple binary yes or no value with defaults on junk input, and then tell me you're a beautiful guy. Extend the idea where safe and more useful.

You can safely sanitize a script to execute different commands based upon script values too, which probably also caused another collective gasp. However, if you limit the execution to known and safe command values based on fixed matching parsed input, rather than raw passed input, it is no different than manipulating a known value you received from any input, including a spec-compliant XML parser. Just don't do stuff like shell the damn thing or set it free to blow the stack. Easy enough circumstances to avoid for basic data manipulation.
posted by mdevore at 10:25 PM on November 15, 2009


@ryanrs:

/(@@|\?\?|\$\$)|\$([a-z])|(?:(@)|(\?)|\$)\{(?:(?:(?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<>)?)(?:\/((?:[^\\\/]|\\(?:.|\s))+)\/([i]*)|#...

I don't know any regex, to me that looks like you're trying revive Fluxus
posted by Twang at 10:28 PM on November 15, 2009


Yeah. Then I looked at benzenedream comment and realized it has an unescaped slash.

Urk! I should have escaped Humanity since you cannot escape the Great Old Ones.
posted by benzenedream at 10:29 PM on November 15, 2009 [2 favorites]


mdevore--

I'm the guy in the security space who reminds people there's a world outside of our own. So I hear you.

When someone wants to use RegEx to parse HTML, they want to do one of two things: Either extract data, or reject malicious HTML. RegEx's cannot do the latter, and yes, people keep getting this wrong and thus keep getting pwned by Russian Hackers. I cannot emphasize enough, this is a really common thing to try, at least equally common as data extraction.

Regexes can do the former, but it tends to be fragile, much more fragile (and complicated) than just loading in a loose parser and walking the DOM it renders. Not to mention the added cost of maintenance every time you have to go back. Can you imagine anyone in the world going back to ryanrs's regex and parsing it?

Regarding your safe execution model, yes, if you basically have a trivial case statement, and input can only select one of n known OK values, you can survive. However one only finds that in production when people got burned horrifyingly by -- wait for it -- that damn regex. Never, ever before.
posted by effugas at 10:35 PM on November 15, 2009


Alright, let me take a different tack here then. Why wouldn't you parse the DOM? Are you maintaining that a regular expression is easier to read, more simple to code, faster, or what?

I'll buy that in certain situations it is less computationally expensive (BeautifulSoup is notoriously slow when swallowing large, complex documents), but you'd have to make a case that this is important enough to give up readability, flexibility, and ease of development. There aren't many real world cases where you are scraping HTML in a tight loop.
posted by cj_ at 11:10 PM on November 15, 2009


I hope in your actual production code, ryanrs, that regex is commented.

Like the bright colors on a poisonous snake, the triple-backslashes warn of danger.
posted by ryanrs at 11:13 PM on November 15, 2009 [5 favorites]


Just kidding.

////////////////////////////////////////////////////
// Available macro substitutions:
//
//  Simple substitutions:
//      $$              Literal dollar sign
//
//      ${name}         Vars[name] if defined, else "".
//      $a              Same, but for single letter names only.
//
//      ${name[ix]}     Vars[name].split( '|')[ix]
//      ${name<jx>}     Vars[name].split('\n')[jx]
//
//      ${name#func}    func(Vars[name])
//      ${name##func}   Same
//
//
//  Regex substitutions:
//      ${name/pat/repl/flags}          Vars[name].replace(/regex/,replacement)
//      ${name/pat/repl/sep/flags}      Like gnu grep -o.  Non-matching text is replaced with sep.
//
//      ${name/pat/repl/flags##func}        func(regex result)
//      ${name/pat/repl/sep/flags##func}    func(regex result)
//
//      A literal 'str' can be used in place of name in the above substitutions.
//
//
//  Conditional selection:
//      ${cond?clause1:clause2}            If cond is true, use clause1, else clause2.
//      ${cond?clause1:clause2##func}      func(conditional result)
//
//      Cond:
//          'str'               Literal string
//          name                True if Vars[name] is not one of { 0, "", null, undefined, etc. }
//          name#func           True if func(Vars[name]) is not zero, etc.
//          name/pat/flags      True if Vars[name] matches regex (actually, searches)
//
//      Clause:
//          'str'                       Literal string
//          name                        Vars[name]
//          name#func                   func(clause); compare to ##func which applies to whole expression
//          name/pat/repl/flags         clause.replace(regex,repl)
//
//
//      Shorthand notation:
//          ${cond?clause1}         Same as ${cond?clause1:''}
//          ${cond?:clause2}        Same as ${cond?'':clause2}
//          ${clause1:clause2}      Same as ${clause1?clause1:clause2}
//
//
//  Binary operators:
//      ${op1+op2}          String concatenation.
//
//      Operands:
//          'str'       Literal string
//          'str'#func  func('str')
//
//          name        Vars[name]
//          name#func   func(Vars[name])
//
//
//  Home/search conditionals:
//      @{substitution}         if (!Vars.q) then substitution, else ''
//      ?{substitution}         if ( Vars.q) then substitution, else ''
//
//      Substitution can be a simple substitution, regex substitution,
//      conditional selection, or binary operation.
//
//
//                            var MacroPattern =
// esc                            /(@@|\?\?|\$\$)|
// gvar                            \$([a-z])|
// home, search                    (?:(@)|(\?)|\$)
//                                   \{
//                                     (?:
//                                         (?:
// cstr,cvar,cix,cjx                         (?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)
// cpat,cflag,cfnc                           (?:\/((?:[^\\\/]|\\(?:.|\s))+)\/([i]*)|#([_a-z]+))?
// qmark                                     (\?)
//                                         )?
//                                         (?:
// astr,avar,aix,ajx                         (?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)
// apat,asub,asep,aflag,afnc                 (?:\/((?:[^\\\/]|\\(?:.|\s))+)\/((?:[^\\\/]|\\(?:.|\s))*)\/((?:[^\\\/]|\\(?:.|\s))*\/)?([gi]*)|#([_a-z]+))?
//                                         )?
//                                         (?:
// colon                                     (:)
// bstr,bvar,bix,bjx                         (?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)
// bpat,bsub,bsep,bflag,bfnc                 (?:\/((?:[^\\\/]|\\(?:.|\s))+)\/((?:[^\\\/]|\\(?:.|\s))*)\/((?:[^\\\/]|\\(?:.|\s))*\/)?([gi]*)|#([_a-z]+))?
//                                         )?
//                                       |
// NOTE: only '+' is implemented           (?:
// xstr,xvar,xix,xjx,xfnc                    (?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)(?:#([_a-z]+))?
// binop                                     (\+|%%?|##?)
// ystr,yvar,yix,yjx,yfnc                    (?:'((?:[^\\']|\\(?:.|\s))*)'|([_a-z]+)(?:\[([0-9]+)\]|<([0-9]+)>)?)(?:#([_a-z]+))?
//                                         )
//                                     )
// gfnc                                (?:##([_a-z]+))?
//                                   \}
//                                /g;
//

posted by ryanrs at 11:13 PM on November 15, 2009 [5 favorites]


cj__: "Alright, let me take a different tack here then. Why wouldn't you parse the DOM? Are you maintaining that a regular expression is easier to read, more simple to code, faster, or what?"

Are you still arguing with me? Because I don't see why you're arguing against someone who agrees with you. "Parsing the DOM" doesn't make much sense; either the DOM is an OO version of a grammar, or an explicit instance of it (aka Abstract Syntax Tree). In the former, parsing grammars is kinda what parser-generators do but not relevant here. In the latter case, you've already parsed it into an AST, why do it again? You want to process the AST if you have it, not parse it. While regular expressions are faster and consume less RAM than say context free grammar parsers, regex is not generally appropriate for the task at hand.

My nuanced position isn't that hard: if you can generate a DOM do so. When you can't, fuck around with mixing Turing complete languages and regex and any other tool at your disposal, and join a cult dedicated to the resurrection of a dead god.
posted by pwnguin at 11:39 PM on November 15, 2009 [1 favorite]


Can you imagine anyone in the world going back to ryanrs's regex and parsing it?

It's really not so bad. You make your changes to the commented version, where it's broken down into sections. Then you use cut-n-paste to join the parts together and replace the live copy in the code. This process has the nice benefit of keeping the comments up-to-date.
posted by ryanrs at 11:43 PM on November 15, 2009


ryanrs: "Then you use cut-n-paste to join the parts together and replace the live copy in the code."

I'm surprised you don't have a sed script to do that for you, considering the circumstance.
posted by pwnguin at 11:53 PM on November 15, 2009 [1 favorite]


You mean this sed script?

sed -Ene $'s/^.* {4}//g;H;${x;s/\\n//gp;}'
posted by ryanrs at 12:06 AM on November 16, 2009 [3 favorites]


Ok, this is getting ridiculous. Do you know how easy it is to use XPATH? Here. Let me show you. The original poster wanted to find all the <p> tags and <a> tags.

Here's the code to find all the content inside those tags in ruby using nokogiri, only using the code listed in their example synopsis:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

###Load Google homepage
doc = Nokogiri::HTML(open('http://google.com/'))

###Get all the links and output the text
doc.xpath("//a").each { |data| puts data.content }
###Which will output the following...
#Images
#Videos
#Maps
#....

###Get all the "P" tags
doc.xpath("//p").each { |data| puts data.content }
###Which will output the following...
#©2009 - Privacy
How friggin hard could it possibly be? It's like 6 lines of code, and retard easy to use, dead obvious to maintain, with documentation on the homepage.
posted by amuseDetachment at 12:18 AM on November 16, 2009 [1 favorite]


> Are you still arguing with me?

Um, nothing I've said was directed at you in the first place. It wasn't directed at anyone in particular, but if I had to pick someone it would be mdevore, I guess.

I just want to know why people defending the practice of scraping HTML via regex feel this is a better solution than using a parser library. Do you think regular expressions make for more maintainable code? Or make it more readable? Faster to develop? Faster to execute? Safer?

Surely there's a reason to plant your flag on this hill if you are going to argue about it.
posted by cj_ at 12:36 AM on November 16, 2009


You really need an explanation? Fine. I admit, I use regexes to scrape porn sites. The regexes get me off, and the pics of nakes ladies makes it feel 'normal'. Don't judge me.
posted by ryanrs at 12:47 AM on November 16, 2009 [8 favorites]


I just want to know why people defending the practice of scraping HTML via regex feel this is a better solution than using a parser library.

I mostly see people with problems with the sweeping statements and dire pronouncements. I agree with a lot of the criticisms here, but I just can't get all that worked up about it, except maybe a little for sanitization. Scraping HTML via regex is not crossing the streams, it's a dirty solution that will probably work fine for many cases. Maybe that's good enough.

Faster to develop?

If you already know regular expressions, and if you are not already familiar with a good parser, this probably seems true, even when it's not. I'm probably as intellectually curious as the next guy, but I'm familiar with the weight of the feeling that I'm going to have to spend some unknown amount of time becoming acquainted with new material in order to get something done.

Particularly if you're working in PHP, in which you tend to come upon all sorts of little failures and weird gotchas as a matter of course.
posted by weston at 2:29 AM on November 16, 2009


Faster to develop?

This, for me. I can, or could once, create a regex in seconds for a quick task. I haven't in quite a while because Greasemonkey+XPath serves quite well for scraping and processing of web page content grabs, but runs second in speed and packaging convenience. Other programmers will have different development priorities for their tasks and targets, possibly causing them to use a regex.

For comparison, an XPath parsing approach isn't automatically superior to regex for programming, again depending on the target data and environment. Problems with XPath can be opaque and browser implementations have been buggy. As recently as FF 3.0, XPath didn't always do things like I think it should have based on the W3C docs, whereas I can always eventually figure out a regex issue. Maturity, stability, and a wealth of documentation shouldn't be completely discounted. Plus, you get some experience with regex, it doesn't look so much like line noise.

Anyway, did I ever mention that I like Greasemonkey? Heckuva handy tool for manipulating the DOM in the browser. I'm not sure I've used regex for HTML parsing since it came out. Maybe. Maybe not. But I don't feel guilty or idiotic about it if I did.
posted by mdevore at 3:18 AM on November 16, 2009


Although too many programs have bad security, that doesn't mean every program on the planet that parses HTML has to be security aware.

Bobby? Little Bobby Tables - is that you?
posted by bashos_frog at 5:01 AM on November 16, 2009 [1 favorite]


I have parsed lots of things, HTML included, with regular expressions. Logic 101: I (and millions of others who have done similar tasks) provide a counterexample to your assertion that you cannot parse HTML with regular expressions.

No. What you, and millions of others, have done is parse a subset of HTML; calling something that can always be parsed by a regular expression HTML does not make it so. As has already been pointed out already HTML is a context-free grammar which cannot be parsed by a regular expression.

Claiming code that uses a regular expression is able to handle HTML is asking for trouble when someone decides to use the W3C specification and not your personal regular grammar version of it. If you aren't actually using the W3C version why bother calling it HTML at all?
posted by prak at 6:51 AM on November 16, 2009 [1 favorite]


Right on. I told Microsoft the same thing.
posted by ryanrs at 8:19 AM on November 16, 2009


Unreadable RegEx FTW!
posted by chunking express at 8:59 AM on November 16, 2009


My RegEx is self-hosting.
posted by blue_beetle at 12:06 PM on November 16, 2009


prak--

To be fair, nothing implements W3C. The whole point of HTML is that it's a loose-y goose-y best effort language, where all players agree on this totally undocumented subset of error renderings because it's better than crashing.

This is actually the revolutionary invention of HTML: With every other parsed language, an error leads to parsing stopping. Only HTML keeps going, doing the best it can. This was, and is, revolutionary, and is unseen quite literally anywhere else. It was a huge part of why HTML won.
posted by effugas at 2:53 PM on November 16, 2009 [1 favorite]


I've been thinking about the sanitization issue, and I'm not so sure that regexs can't do a decent job of implementing a scorched earth complete removal.

Something like this:

s/<[^>]+(\s+\w*=('[^']+'|\w+)|\s+\w*=("[^"]+"|\w+))*\s*>//g

as an attempt at stripping out good faith markup plus something like this:

s/<//g
s/>//g

would seem to mean that it's pretty hard to feed anything through which is going to be parsed as markup by a browser.

How would someone attack this?
posted by weston at 3:34 PM on November 16, 2009


Jeff Atwood disapproves of this thread.
posted by gsteff at 3:59 PM on November 16, 2009


No wait, the other one. I didn't read his entire post before linking it here and, having now done so, would say that he actually agrees with those who have said there's no need to be dogmatic.
posted by gsteff at 4:04 PM on November 16, 2009


This is actually the revolutionary invention of HTML: With every other parsed language, an error leads to parsing stopping. Only HTML keeps going, doing the best it can

No, this makes no sense. Applications that need to accept HTML keep going because the rendering must go on, or whatever, but it can hardly be part of the language HTML that loosey-goosiness is ok (if it were then all those loosey-goosey pages wouldn't actually be loosey-goosey but rather strictly correct).
posted by kenko at 6:36 PM on November 16, 2009


I don't know Jeff Atwood and he doesn't know me, but the next time he pulls a multi-line quote from something I posted here, or that anybody posted here, he might consider letting the person he quoted know. It's the right thing to do.

Even if he doesn't like what I said.
posted by mdevore at 7:31 PM on November 16, 2009


It's the right thing to do.

I have his home address if you'd like to sue.
posted by pwnguin at 8:48 PM on November 16, 2009


Nope.

Oh wait, you were implying that it isn't the right thing to do. That was clever of you, you almost had me.
posted by mdevore at 9:01 PM on November 16, 2009


Kenko, the CSS and HTML5 standards require renderers to behave in specific ways when encountering errors in a document. In these standards, the requirements for a conforming renderer are quite different than the requirements for a conforming document. A renderer that correctly handles every possible conforming document, but pukes on broken documents, is not a conforming renderer according to the standard. In other words, broken documents do not give the renderer a free pass to just do whatever it wants.
posted by ryanrs at 12:12 AM on November 17, 2009


To be fair, nothing implements W3C.

That doesn't mean you can claim to have done it with tools that are computationally iincapable of managing it.

Many compilers don't seem to manage to implement ANSI C exactly as defined by the specification but I hope no one would disagree with it being madness to even consider one that attempts it using regular expressions.
posted by prak at 5:09 AM on November 17, 2009


You know, I once owned one of those lame battery-powered lawnmowers. Didn't mow very well, but it worked for my then-lawn. It wouldn't have worked on a lot of other lawns, didn't have the power or the endurance, and it certainly would pass no tests on the Manly Arts Standard Compliance Exam.

Lawnmowing purists may insist that it was madness to attempt to use the lawnmower because it didn't mow lawns per the Manly Arts National Lawn Society standards, and even if my lawnmower did work for many lawns, it was merely a grass-cutting hacking tool and not a "real" lawnmower.

But I bet you most people would still call it a lawnmower.

I do think it's interesting how the hand-wringing here over ZOMG! People! Using! RegEx! To! Parse! HTML! has magically morphed what is a blog post complaining about people using regex to parse HTML, a task done every day, into an academic treatise on the impure subsets and mathematically incomplete aspects of using regex to parse HTML. It's like people hate the idea so much, so very very much, they deny the simple reality of it.
posted by mdevore at 10:23 AM on November 17, 2009


You are pushing a cardboard mockup of a lawnmower over AstroTurf and claiming that because the AstroTurf never gets overgrown you are mowing a lawn.
posted by prak at 10:35 AM on November 17, 2009 [2 favorites]


You are pushing a cardboard mockup of a lawnmower over AstroTurf

Well, since we're into pronouncements...You are denying that the lawnmower both runs and actually mows. Or to reiterate, you deny the simple reality of it.
posted by mdevore at 10:42 AM on November 17, 2009


You are denying that the lawnmower both runs and actually mows.

It looks damn good in photographs and is light enough not to tire you out pushing it around an all day photo session.

AstroTurf and grass may be similar in many ways and even look similar on a cursory examination but they are not the same thing. Tools used successfully with AstroTurf are not therefore imbued with the power to deal with grass. If you can't guarantee you will only be dealing with AstroTurf you need tools that are designed to handle grass. If you can guarantee always having AstroTurf; don't call it grass.
posted by prak at 11:04 AM on November 17, 2009 [1 favorite]


mdevore--

You are sticking your fingers in your ears on this thread.

Nobody is denying that, given a body of text that conforms to HTML and a RegEx designed to parse that text, data can't be extracted. Of course it can be. This is stipulated repeatedly, and it's a little tiresome having you repeatedly straw man the opposition.

What we are saying is that using a regular expression in this manner is:

1) More fragile in the face of real world variations in HTML
2) More difficult to code for -- jesus, with a library, it's basically "for each item inside an a tag, print it".
3) More difficult to debug
4) Impossible to secure
5) Impossible to stablize -- you will always be poking at the damn thing when it misses something

I can write a production HTML parser in Brainf*ck too (it won't look much different than ryanrs's code). But it's a bad idea. The problem with regex's is people thing "Oh, it's so easy to just match all text within b tags" and whip out a few characters. But then they realize it's a little harder, and a little harder, and a littler harder.

And then they end up ryanrs.

I do like the lawn/astroturf analogy. It's much easier to mow a lawn that does not grow.
posted by effugas at 11:31 AM on November 17, 2009 [3 favorites]


AstroTurf and grass may be similar in many ways and even look similar on a cursory examination but they are not the same thing.

Good golly Miss Molly, you are absolutely right. It's not grass I was mowing to avoid being cited by the city. It was a horrible abomination masquarading as real grass. (Given my lawn's condition, that might actually have been true.) Anyway, good thing nobody was using the telephoto in the photoshoot. Of course, I didn't see the paparazzi, they were probably lurking in the bushes.

You must live in a fascinating world. You drive over to the grocery store and it has every grocery item in the word. It has to, otherwise it couldn't be called a grocery store. You come home and sit down in a front of a computer which can run every program ever written. You type messages on a keyboard which supports the full character set of anything you might ever want to type, including, perhaps, Zalgo. All while sitting in a chair which supports any human being who cares to sit in it, even if Manuel Uribe comes by for a visit.

Clearly, the language requirement that a subset must always contain the full set confers major advantages.
posted by mdevore at 11:44 AM on November 17, 2009


You are sticking your fingers in your ears on this thread.

Not true, I read every message even when they, like you, repeat the same arguments made before. Now that's tiresome. But:
Once more unto the breach, dear friends, once more;
Or close the wall up with our English dead.
In peace there's nothing so becomes a man
As modest stillness and humility:
But when the blast of war blows in our ears,
Then imitate the action of the tiger;
Meow. So let's try round 'n', where 'n' is becoming a large number.

1) More fragile in the face of real world variations in HTML What real world variations? If the person running the script has known input, there is no unexpected variation. Despite all the howls that circumstance does not occur, well, I'm sorry, but yes it does.

2) More difficult to code for -- jesus, with a library, it's basically "for each item inside an a tag, print it". More difficult for everything? I have to doubt your omniscience. Again, a quick regex to parse a small set of data is about as dirt simple as you can make it. Well, I guess you have to know regex, I'll give you a quarter point on that one, on the assumption that most people don't know it.

3) More difficult to debug Related to point #2, how difficult is it to debug a basic regex which parses a small well-defined set of data with no variations?

4) Impossible to secure As previous pointed out, security is not always a factor in parsing HTML data. Frequently not, I'd wager.

5) Impossible to stablize -- you will always be poking at the damn thing when it misses something Stabilize what? If I work with a known data set, do you think it has to grow uncontrollably? That it is a incontrovertable law of the universe that input must always change in ways that cannot be anticipated? Sorry, there is no such law. For example, I do exert full control on HTML on several web pages and data. Hardly a unique situation.

It's funny you bring up a strawman. Several of you have ran over to this poor defenseless scarecrow that's propped up close by and walloped the hell out of thing while I watched in bemusement. I mean, sure it's a better-looking target, but Michael it ain't.

So let me make this as clear as I can: You do not get to declare what the universe of HTML parsing consists of for every programmer that programs on this planet. You do not get to declare what will happen with their input. You do not get to state that their available resources are such that a particular choice they make is always wrong.

You DO get to say what best practices are for a majority, but not all, of situations and circumstances. And boy have you. Funny thing, though, I don't recall anyone challenging you there.

What you can also do is express how much you dislike the practice by bitching about it to me as if a) I represent all the programmers who do it and b) you're going to make any significant headway in changing the practice. OK, you did. Now maybe you could move on to something else, like debating whether the go to statement is harmful. Dijkstra beat you to that one by 40 years, but I'm sure that battle, too, still rages somewhere in the bowels of the Internet.
posted by mdevore at 3:01 PM on November 17, 2009


You must live in a fascinating world. You drive over to the grocery store and it has every grocery item in the word. It has to, otherwise it couldn't be called a grocery store.

You make the claim to be able to shop at grocery stores equipt with only American dollars; this works fine so long as you can be sure that you will never shop outside of America. However, if you are taken to Europe you are likely to run into trouble. Even if you manage to get it to work in foreign grocery stores some of the time you do not really have the ability to shop at any grocery store with what you have.

While you may feel distinguishing between your ability to shop at American grocery stores and all grocery stores is not important to you it may be to someone else depending on you to do their shopping. Claiming to be able to shop at any grocery store while at the same time not revealing you don't even have a passport is just foolish; you can't be sure what assumptions others will make about your abilities.

Just say American grocery stores if that is what you mean. You may not have the same selection but you also won't be asked to pay in euros.
posted by prak at 7:58 PM on November 17, 2009


Alright, dudes. Time for an example. Here's how I used regexes to parse some HTML and download the source code for my OS (the open source bits, anyway). This is something I wanted to do, and regexes seemed like a fine way to do it.

First I went to Mac OS X 10.6.2 Source.

Then I viewed the source, selected the guts of the big table, and copied it to the clipboard.

Then I ran the following from the terminal:
for x in $(
    pbpaste |
    egrep -o '<a href="/tarballs/[^"]+' |
    cut -c 10-
); do
    echo $x
    curl -O http://www.opensource.apple.com$x 2>/dev/null
done
It worked great. It tooks less than five minutes to write. I didn't even need to open an editor, I just hammered it out on the command line. The first time I ran it, I substituted echo for curl. The commands looked fine, so I ran it for real.

I do this kind of thing fairly often, nearly always from the command line. Are there problems? Sure, but not the ones that have been mentioned. My biggest annoyance is the crappy broken grep Apple ships. It's a really old version of GNU Grep that does not support combining -o and -i, and does not support non-greedy repetition. I suppose the latter is POSIX's fault, but the former is just a bug.

I realize this is a bit of a strawman in that everyone else is talking about code with a more persistence than a bash one-liner. I wouldn't put this code in a script. But I also wouldn't try to use a Python HTML parser from the command line.

(Actually, I did try. Python's whitespace rules made it a pain in the ass.)
posted by ryanrs at 3:33 AM on November 19, 2009


It's a matter of taste I guess, whether you use regex or xpath:

xgrep -x "//td[@class='project-downloads']/a/@href"

The main advantage is that I don't have a pbpaste that I know of on Ubuntu, so XPath lets me narrow down the selection from the source. Which is easy on the example page (yay CSS). You can probably throw in a quick trailing regex to throw away the href= part before piping into wget -i - -B http://www.opensource.apple.com.

Ideally, xgrep would give some obvious output options to convert nodesets to text and we could discard sed. Anyways, I need to excuse myself from this thread as I do have work to accomplish this week!
posted by pwnguin at 10:28 AM on November 19, 2009


« Older "And so she and her friend and the wolves walk...   |   The Block Newer »


This thread has been archived and is closed to new comments