This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

wondering

Website and forum-related announcements and discussion, and anything else that doesn't fit in the above forums.
Post Reply
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

wondering

Post by demond »

just out of curiousity:

does anyone know of a webscript which features flexible HTML parsing, utilizing XPath or similar technique?

e.g. as soon as the following page code:

Code: Select all

...
<tr id=foo ...
   <td id=bar ...
      stuff
   ...
   </td>
</tr>
is changed to:

Code: Select all

...
<tr id=foo ...
   <td id=moo ...
      <table ...
         <td id=bar ...
            stuff 
...
you are screwed if you use a script that gets to "stuff" using regexp/regsub to locate the <td> tag with id=bar (which is pretty much every script known to me)

naturally, XPath is not a panacea against web page changes, but I'd imagine it could provide a far greater degree of flexibility
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use

Code: Select all

 tag when posting logs, code
User avatar
De Kus
Revered One
Posts: 1361
Joined: Sun Dec 15, 2002 11:41 am
Location: Germany

Post by De Kus »

are you talking about web script as in PHP? There are tDOM and XML (or was tDOM the XML module?!) modules for PHP, but no idea how to use them. Maybe there manuals give you some hints, if they support your intented way of manipulation.

PS: the advantage of a regular expression to a scan(f) expression, is the flexibility. Using \t+ or \s+ instead of a specific number of chars, should be possible give a certain flexibility.
I mean something like "<td .*?id=bar.*?>\n\t+(.+?)\n\t*</td>" should be flexible in the way you just showed.
De Kus
StarZ|De_Kus, De_Kus or DeKus on IRC
Copyright © 2005-2009 by De Kus - published under The MIT License
Love hurts, love strengthens...
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

we aren't on a PHP forum, so:

nope, I meant eggdrop scripts that fetch info from webpages

and no, you can't compensate for webpage changes with regexps alone; it's nowhere near XPath ability to do that
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use

Code: Select all

 tag when posting logs, code
User avatar
Sir_Fz
Revered One
Posts: 3794
Joined: Sun Apr 27, 2003 3:10 pm
Location: Lebanon
Contact:

Post by Sir_Fz »

I don't think there is any script (atleast public) which features flexible HTML parsing but it would be nice to see it implemented in furture scripts, also would be easier for the scripter since he won't have to keep following the website changes.
User avatar
De Kus
Revered One
Posts: 1361
Joined: Sun Dec 15, 2002 11:41 am
Location: Germany

Post by De Kus »

well doesnt the, in your link from yourself mentioned, tDOM support that flexibilty by using something like this?
set node [$root selectNodes {//tr[@id=foo]/td[@id=bar]}]

though I am still confident about the regexp :P.
Give us an example where

Code: Select all

regexp {(?i)<tr .*?id=foo.*?>.*?<td .*?id=bar.*?>(?:\n\s*|)(.+?)(?:\n\s*|)</td>} $body {} stuff
doesnt find the wanted piece from your example (though I don't want to talk about the speed of such an expression on a 50kb html file. however I successfully used string first and string range to limit the actual string regexp parses). The given example should still work, even if all the \n and \t are truncated or instead of \t spaces are used. (?:\n\s*|) should match the as long as possible (or nothing), and therebefore "eat" the input. Alternately (probably faster way) would be use string trim $stuff " \t\n" on stuff :D.

You could even go so far to "regsub -all {<!--.*?-->} {} $stuff stuff" to remove any comments (or on body, to remove them before looking for matchs) :D.
De Kus
StarZ|De_Kus, De_Kus or DeKus on IRC
Copyright © 2005-2009 by De Kus - published under The MIT License
Love hurts, love strengthens...
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

no no, you misunderstood that; perhaps my example was bad

basically, if you locate the info you need using XPath positional predicates like for example //foo[@bar='moo'][5], your script will continue to work even if they add tons of stuff under nodes #1 to #4; you can't do that with regexps - there is no notion of expression position

in general, XPath is vastly superior to using regexps for parsing webpages; the problem is, users need to install some XML parser extension for it to work, and most eggdrop users are too lame to do that
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use

Code: Select all

 tag when posting logs, code
K
Kappa007
Voice
Posts: 38
Joined: Tue Jul 26, 2005 9:53 pm

Post by Kappa007 »

Just an idea but maybe using tclperl with XML::XPath works?
User avatar
De Kus
Revered One
Posts: 1361
Joined: Sun Dec 15, 2002 11:41 am
Location: Germany

Post by De Kus »

demond wrote:no no, you misunderstood that; perhaps my example was bad
basically, if you locate the info you need using XPath positional predicates like for example //foo[@bar='moo'][5], your script will continue to work even if they add tons of stuff under nodes #1 to #4; you can't do that with regexps - there is no notion of expression position
So you want the regexp to "intellgently" skip unintresting first 4 <foo bar='moo'> and beging to really parse whole regexp from there on? Depending on the complexy of the expressions you could use my suggestion of string first and give either the index of the 4th match to regexp or string range it together by using string first to find a logical end. However this will never be exact the same as XPath.
However you could create it as module, but then again people which are unable to compile the bot might not be able to use it.
De Kus
StarZ|De_Kus, De_Kus or DeKus on IRC
Copyright © 2005-2009 by De Kus - published under The MIT License
Love hurts, love strengthens...
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

either you are a regexp fanatic, or you don't get my point since you don't know XPath
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use

Code: Select all

 tag when posting logs, code
User avatar
De Kus
Revered One
Posts: 1361
Joined: Sun Dec 15, 2002 11:41 am
Location: Germany

Post by De Kus »

Well you are asking for XPath without using XPath. XPath has been developed for almost 10 years now (refering to the given links by you). Do you believe you can write some fast TCL script to emulate it? I am offering alternatives how to archieve similar results without developping a module worth years of time.
De Kus
StarZ|De_Kus, De_Kus or DeKus on IRC
Copyright © 2005-2009 by De Kus - published under The MIT License
Love hurts, love strengthens...
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

what you seem to be unable to comprehend is that any regexp emulation of XPath's predicates would be ridiculously complicated and hard to read/understand

it's like doing numerical analysis in Roman numbers - if you know what I mean
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use

Code: Select all

 tag when posting logs, code
User avatar
De Kus
Revered One
Posts: 1361
Joined: Sun Dec 15, 2002 11:41 am
Location: Germany

Post by De Kus »

so you are looking for something like

Code: Select all

set goal 5
set id 0
set num 0
while {$num < $goal} {
  if {[set id [string first $body "<foo " $id]] == -1} {
    return -1
  } else {
    if {[string first $body ">" $id] > [set t [string first $body "bar='id'" $id]] && $t != -1} {
      set id $t
      incr num
    } elseif {$t == -1} {
      return -1
    } else {
      incr id [string length "<foo "]
    }
  }
}
...
continue with each condition... to find the end of "stuff" continue to find the index (you can also count the <foo opens to know which </foo> belongs to the wanted open tag) and string range the stuff between the > <. Though I REALLY doubt this is any easier than regexp (however I am sure it would be faster).
However you will now run into trouble with case sensitivity and will have trouble to equal bar=id, bar='id' and bar="id". you could of course temporarily convert all " to ' and check for ' (refering to W3C bar=id is wrong syntax anyway).

or do you want to split the XML tree into a multimentional array? but then I wonder how to *match* paremeters. Dont know if an endless sublist with tag and data would be possible. Maybe parents would just contain a list of direct childs as "data". And still then searching would be difficult in this non-linear list tree.
De Kus
StarZ|De_Kus, De_Kus or DeKus on IRC
Copyright © 2005-2009 by De Kus - published under The MIT License
Love hurts, love strengthens...
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

dude, do you actually read my posts???

I am NOT asking you - or anyone else for that matter - to show parsing webpages via regexps or otherwise or/and to elaborate on that; I am simply advocating XPath - which you obviously don't know and haven't used - as a superior tool for the job

I was asking if anyone knows a script using XPath - and I already gathered you don't - so let put this to rest and move on
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use

Code: Select all

 tag when posting logs, code
User avatar
De Kus
Revered One
Posts: 1361
Joined: Sun Dec 15, 2002 11:41 am
Location: Germany

Post by De Kus »

As more as I understand your question, as less I understand what you actually ask :D.

So you want to know if someone used XPath by tDOM in a script and managed to use a more flexible statement than your example in the FAQ?
De Kus
StarZ|De_Kus, De_Kus or DeKus on IRC
Copyright © 2005-2009 by De Kus - published under The MIT License
Love hurts, love strengthens...
Post Reply