you are screwed if you use a script that gets to "stuff" using regexp/regsub to locate the <td> tag with id=bar (which is pretty much every script known to me)
naturally, XPath is not a panacea against web page changes, but I'd imagine it could provide a far greater degree of flexibility
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use
are you talking about web script as in PHP? There are tDOM and XML (or was tDOM the XML module?!) modules for PHP, but no idea how to use them. Maybe there manuals give you some hints, if they support your intented way of manipulation.
PS: the advantage of a regular expression to a scan(f) expression, is the flexibility. Using \t+ or \s+ instead of a specific number of chars, should be possible give a certain flexibility.
I mean something like "<td .*?id=bar.*?>\n\t+(.+?)\n\t*</td>" should be flexible in the way you just showed.
I don't think there is any script (atleast public) which features flexible HTML parsing but it would be nice to see it implemented in furture scripts, also would be easier for the scripter since he won't have to keep following the website changes.
well doesnt the, in your link from yourself mentioned, tDOM support that flexibilty by using something like this?
set node [$root selectNodes {//tr[@id=foo]/td[@id=bar]}]
though I am still confident about the regexp .
Give us an example where
doesnt find the wanted piece from your example (though I don't want to talk about the speed of such an expression on a 50kb html file. however I successfully used string first and string range to limit the actual string regexp parses). The given example should still work, even if all the \n and \t are truncated or instead of \t spaces are used. (?:\n\s*|) should match the as long as possible (or nothing), and therebefore "eat" the input. Alternately (probably faster way) would be use string trim $stuff " \t\n" on stuff .
You could even go so far to "regsub -all {<!--.*?-->} {} $stuff stuff" to remove any comments (or on body, to remove them before looking for matchs) .
no no, you misunderstood that; perhaps my example was bad
basically, if you locate the info you need using XPath positional predicates like for example //foo[@bar='moo'][5], your script will continue to work even if they add tons of stuff under nodes #1 to #4; you can't do that with regexps - there is no notion of expression position
in general, XPath is vastly superior to using regexps for parsing webpages; the problem is, users need to install some XML parser extension for it to work, and most eggdrop users are too lame to do that
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use
demond wrote:no no, you misunderstood that; perhaps my example was bad
basically, if you locate the info you need using XPath positional predicates like for example //foo[@bar='moo'][5], your script will continue to work even if they add tons of stuff under nodes #1 to #4; you can't do that with regexps - there is no notion of expression position
So you want the regexp to "intellgently" skip unintresting first 4 <foo bar='moo'> and beging to really parse whole regexp from there on? Depending on the complexy of the expressions you could use my suggestion of string first and give either the index of the 4th match to regexp or string range it together by using string first to find a logical end. However this will never be exact the same as XPath.
However you could create it as module, but then again people which are unable to compile the bot might not be able to use it.
Well you are asking for XPath without using XPath. XPath has been developed for almost 10 years now (refering to the given links by you). Do you believe you can write some fast TCL script to emulate it? I am offering alternatives how to archieve similar results without developping a module worth years of time.
what you seem to be unable to comprehend is that any regexp emulation of XPath's predicates would be ridiculously complicated and hard to read/understand
it's like doing numerical analysis in Roman numbers - if you know what I mean
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use
set goal 5
set id 0
set num 0
while {$num < $goal} {
if {[set id [string first $body "<foo " $id]] == -1} {
return -1
} else {
if {[string first $body ">" $id] > [set t [string first $body "bar='id'" $id]] && $t != -1} {
set id $t
incr num
} elseif {$t == -1} {
return -1
} else {
incr id [string length "<foo "]
}
}
}
...
continue with each condition... to find the end of "stuff" continue to find the index (you can also count the <foo opens to know which </foo> belongs to the wanted open tag) and string range the stuff between the > <. Though I REALLY doubt this is any easier than regexp (however I am sure it would be faster).
However you will now run into trouble with case sensitivity and will have trouble to equal bar=id, bar='id' and bar="id". you could of course temporarily convert all " to ' and check for ' (refering to W3C bar=id is wrong syntax anyway).
or do you want to split the XML tree into a multimentional array? but then I wonder how to *match* paremeters. Dont know if an endless sublist with tag and data would be possible. Maybe parents would just contain a list of direct childs as "data". And still then searching would be difficult in this non-linear list tree.
I am NOT asking you - or anyone else for that matter - to show parsing webpages via regexps or otherwise or/and to elaborate on that; I am simply advocating XPath - which you obviously don't know and haven't used - as a superior tool for the job
I was asking if anyone knows a script using XPath - and I already gathered you don't - so let put this to rest and move on
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use