This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

Retrieving text from messy strings, rgding -regexp

Help for those learning Tcl or writing their own scripts.
Post Reply
k
kenneal
Voice
Posts: 10
Joined: Sun Mar 11, 2007 8:21 am

Retrieving text from messy strings, rgding -regexp

Post by kenneal »

I have some html code over here which i would like to extract the .jpg from.
Sometimes it generates as many different forms

E.g.

<img src='http://123.com/12345.jpg' border=0>
<code='http://123.com/12345.jpg'>
<a code=1 file='http://123.com/12345.jpg' form=value >

Is there any definite way to just 100% get the http://123.com/12345.jpg out of the always randomed + garbled html tag? I do understand regexp only thing that since it could appear in many forms i'm not very sure what should be the best way.
Last edited by kenneal on Fri May 04, 2007 10:56 am, edited 1 time in total.
User avatar
rosc2112
Revered One
Posts: 1454
Joined: Sun Feb 19, 2006 8:36 pm
Location: Northeast Pennsylvania

Post by rosc2112 »

Based on your examples:

Code: Select all

regexp {<.*?'(.*?\.jpg).*?>} $htmlInput fullstringmatchvar exactmatchvar

Or:

regexp {<.*?'(http://.*?\.jpg).*?>} etc

I would've thought html would use double quotes " tho, not '
Change the above example to use " if nec.

exactmatchvar is where your http.*jpg string will be stored.
k
kenneal
Voice
Posts: 10
Joined: Sun Mar 11, 2007 8:21 am

Post by kenneal »

regexp {<.*?'(.*?\.jpg).*?>} $htmlInput fullstringmatchvar exactmatchvar

I tried messing around with it works fine. I still have a few questions.
I want to get lets say a certain string that could be called

www.abc.com/link.php?ref=123zxc

or

www.abc.com/link.php?ref=34zxz

or

www.abc.com/link.php?ref=SDASDASDASd

They are embedded in codes like,

<img src='www.abc.com/link.php?ref=SDASDASDASd' border=0>
<code='www.abc.com/link.php?ref=SDASDASDASd'>
<a code=1 file='www.abc.com/link.php?ref=SDASDASDASd' form=value >

1) How could I go about retrieving them since the last part is always random?
2) What if in the string there is also another www.abc.com/link.php?ref=ZXCZXC , regexp will only take out 1, is there another way I can take out the other?

I tried messing around with regexp but it still does not seem to work..

Do appreciate your help here
User avatar
user
&nbsp;
Posts: 1452
Joined: Tue Mar 18, 2003 9:58 pm
Location: Norway

Post by user »

Code: Select all

{='([^']+)'}
To get all the matches, I'd use the -all and -inline options (check the regexp manual page) and a foreach loop.
BTW: your examples are not html.
Have you ever read "The Manual"?
k
kenneal
Voice
Posts: 10
Joined: Sun Mar 11, 2007 8:21 am

Post by kenneal »

Code: Select all

if [string match *.jpg* $line] {
         foreach jpg [regexp -all -inline {<.*?'(http://.*?\.jpg).*?>} $line] {
                   puthelp "PRIVMSG #jtest :$jpg"		
        }
}	
It returns me oddly the before and after regexped results.

[00:29:06] <@PALL> <br><br><img src='http://hotimg2.fotki.com/b/222_88/97_98/25000.jpg ' border=0 onclick="window.open('http://www2.lookpipe.com/get.php?filepa ... /25000.jpg ');" onload='if(this.width>

[00:29:08] <@PALL> http://hotimg2.fotki.com/b/222_88/97_98/25000.jpg

[00:29:10] <@PALL> <br><br><img src='http://hotimg1.fotki.com/a/222_88/97_98/3353.jpg ' border=0 onclick="window.open('http://www2.lookpipe.com/get.php?filepa ... 8/3353.jpg ');" onload='if(this.width>

[00:29:12] <@PALL> http://hotimg1.fotki.com/a/222_88/97_98/3353.jpg


I'm quite sure this results procs from the foreach line. If I removed the PRIVMSG line nothing comes out.

Edit:
I seem to get the idea that the -inline thing parses out info as such, so should I be making a statement where 1st line dont parse, 2nd line parse, 3rd line dont parse, 4th line parse, 5th line dont parse, 6 line parse.. etc so on? Just wanna confirm.

Also user, yes it is results parsed out from a php forum so its not entirely html... I tried regexp {='([^']+)'} but am still unable to parse out the link.php from the garbled text as I have with the .jpgs...
User avatar
user
&nbsp;
Posts: 1452
Joined: Tue Mar 18, 2003 9:58 pm
Location: Norway

Post by user »

You still get the entire match first (like you did without -inline), so add another variable in the loop to weed it out. Like this:

Code: Select all

foreach {garbage jpg} [regexp -all -inline {='(http[^']+)'} $data] {
	puthelp "PRIVMSG #chan :$jpg"
}
If you want to capture those relative urls starting with "www" too, replace "http" in the above rule with "(?:www|http)"
Have you ever read "The Manual"?
k
kenneal
Voice
Posts: 10
Joined: Sun Mar 11, 2007 8:21 am

Post by kenneal »

im testing it and everything seems good, thanks a lot!
Last edited by kenneal on Fri May 04, 2007 1:55 pm, edited 1 time in total.
k
kenneal
Voice
Posts: 10
Joined: Sun Mar 11, 2007 8:21 am

Post by kenneal »

Thx!
Post Reply