egghelp/eggheads community

Posted: **Sat May 06, 2006 2:01 am**

just out of curiousity:

does anyone know of a webscript which features flexible HTML parsing, utilizing XPath or similar technique?

e.g. as soon as the following page code:

Code: Select all

...
<tr id=foo ...
   <td id=bar ...
      stuff
   ...
   </td>
</tr>

is changed to:

Code: Select all

...
<tr id=foo ...
   <td id=moo ...
      <table ...
         <td id=bar ...
            stuff 
...

you are screwed if you use a script that gets to "stuff" using regexp/regsub to locate the <td> tag with id=bar (which is pretty much every script known to me)

naturally, XPath is not a panacea against web page changes, but I'd imagine it could provide a far greater degree of flexibility

Posted: **Sat May 06, 2006 6:04 am**

are you talking about web script as in PHP? There are tDOM and XML (or was tDOM the XML module?!) modules for PHP, but no idea how to use them. Maybe there manuals give you some hints, if they support your intented way of manipulation.

PS: the advantage of a regular expression to a scan(f) expression, is the flexibility. Using \t+ or \s+ instead of a specific number of chars, should be possible give a certain flexibility.
I mean something like "<td .*?id=bar.*?>\n\t+(.+?)\n\t*</td>" should be flexible in the way you just showed.

Posted: **Sat May 06, 2006 6:28 am**

we aren't on a PHP forum, so:

nope, I meant eggdrop scripts that fetch info from webpages

and no, you can't compensate for webpage changes with regexps alone; it's nowhere near XPath ability to do that

Posted: **Sat May 06, 2006 6:32 am**

I don't think there is any script (atleast public) which features flexible HTML parsing but it would be nice to see it implemented in furture scripts, also would be easier for the scripter since he won't have to keep following the website changes.

Posted: **Sat May 06, 2006 7:15 am**

well doesnt the, in your link from yourself mentioned, tDOM support that flexibilty by using something like this?
set node [$root selectNodes {//tr[@id=foo]/td[@id=bar]}]

though I am still confident about the regexp

.
Give us an example where

Code: Select all

regexp {(?i)<tr .*?id=foo.*?>.*?<td .*?id=bar.*?>(?:\n\s*|)(.+?)(?:\n\s*|)</td>} $body {} stuff

doesnt find the wanted piece from your example (though I don't want to talk about the speed of such an expression on a 50kb html file. however I successfully used string first and string range to limit the actual string regexp parses). The given example should still work, even if all the \n and \t are truncated or instead of \t spaces are used. (?:\n\s*|) should match the as long as possible (or nothing), and therebefore "eat" the input. Alternately (probably faster way) would be use string trim $stuff " \t\n" on stuff

.

You could even go so far to "regsub -all {} {} $stuff stuff" to remove any comments (or on body, to remove them before looking for matchs)

.

Posted: **Tue May 09, 2006 3:00 am**

no no, you misunderstood that; perhaps my example was bad

basically, if you locate the info you need using XPath positional predicates like for example //foo[@bar='moo'][5], your script will continue to work even if they add tons of stuff under nodes #1 to #4; you can't do that with regexps - there is no notion of expression position

in general, XPath is vastly superior to using regexps for parsing webpages; the problem is, users need to install some XML parser extension for it to work, and most eggdrop users are too lame to do that

Posted: **Tue May 09, 2006 7:26 am**

Just an idea but maybe using tclperl with XML::XPath works?

Posted: **Tue May 09, 2006 11:07 am**

demond wrote:no no, you misunderstood that; perhaps my example was bad
basically, if you locate the info you need using XPath positional predicates like for example //foo[@bar='moo'][5], your script will continue to work even if they add tons of stuff under nodes #1 to #4; you can't do that with regexps - there is no notion of expression position

So you want the regexp to "intellgently" skip unintresting first 4 <foo bar='moo'> and beging to really parse whole regexp from there on? Depending on the complexy of the expressions you could use my suggestion of string first and give either the index of the 4th match to regexp or string range it together by using string first to find a logical end. However this will never be exact the same as XPath.
However you could create it as module, but then again people which are unable to compile the bot might not be able to use it.

Posted: **Tue May 09, 2006 11:59 am**

either you are a regexp fanatic, or you don't get my point since you don't know XPath

Posted: **Tue May 09, 2006 12:52 pm**

Well you are asking for XPath without using XPath. XPath has been developed for almost 10 years now (refering to the given links by you). Do you believe you can write some fast TCL script to emulate it? I am offering alternatives how to archieve similar results without developping a module worth years of time.

Posted: **Wed May 10, 2006 12:04 am**

what you seem to be unable to comprehend is that any regexp emulation of XPath's predicates would be ridiculously complicated and hard to read/understand

it's like doing numerical analysis in Roman numbers - if you know what I mean

Posted: **Wed May 10, 2006 5:25 am**

so you are looking for something like

Code: Select all

set goal 5
set id 0
set num 0
while {$num < $goal} {
  if {[set id [string first $body "<foo " $id]] == -1} {
    return -1
  } else {
    if {[string first $body ">" $id] > [set t [string first $body "bar='id'" $id]] && $t != -1} {
      set id $t
      incr num
    } elseif {$t == -1} {
      return -1
    } else {
      incr id [string length "<foo "]
    }
  }
}
...

continue with each condition... to find the end of "stuff" continue to find the index (you can also count the <foo opens to know which </foo> belongs to the wanted open tag) and string range the stuff between the > <. Though I REALLY doubt this is any easier than regexp (however I am sure it would be faster).
However you will now run into trouble with case sensitivity and will have trouble to equal bar=id, bar='id' and bar="id". you could of course temporarily convert all " to ' and check for ' (refering to W3C bar=id is wrong syntax anyway).

or do you want to split the XML tree into a multimentional array? but then I wonder how to *match* paremeters. Dont know if an endless sublist with tag and data would be possible. Maybe parents would just contain a list of direct childs as "data". And still then searching would be difficult in this non-linear list tree.

Posted: **Wed May 10, 2006 12:13 pm**

dude, do you actually read my posts???

I am NOT asking you - or anyone else for that matter - to show parsing webpages via regexps or otherwise or/and to elaborate on that; I am simply advocating XPath - which you obviously don't know and haven't used - as a superior tool for the job

I was asking if anyone knows a script using XPath - and I already gathered you don't - so let put this to rest and move on

Posted: **Wed May 10, 2006 8:26 pm**

As more as I understand your question, as less I understand what you actually ask

.

So you want to know if someone used XPath by tDOM in a script and managed to use a more flexible statement than your example in the FAQ?

egghelp/eggheads community

wondering

wondering