This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

regexp prob

Help for those learning Tcl or writing their own scripts.
Post Reply
i
iRoc

regexp prob

Post by iRoc »

i've done one coding but its not working
there's some problem on regexp
can anyone help me on that ?

HTML Source

Code: Select all

<tr>
 <td width="30" align="right"><p class="commsText">0.4</p></td>

 <td width="100%">
  <p class="commsText">Anderson to Watson,
2 runs,
another good cover drive, although this one has clunked a bit and hasn't been timed properly, it's chased down and they pick up two  </p>
 </td>
</tr>

Code: Select all

bind pub -!- !cric last
proc last {nick uhost hand chan text} {
	set cric(page) http://www.espncricinfo.com/the-ashes-2010-11/engine/match/446966.html
	set agent "Opera/9.10 (Windows NT 5.1; U; ru)"
	set t [::http::config -useragent $agent]
	set t [::http::geturl "$cric(page)" -timeout 30000] 
	set data [::http::data $t] 	
	::http::cleanup $t
	putlog "$t"	
	
	   set l [regexp -all -inline -- {<tr>.*?<td width="30" align="right"><p class="commsText">(.*?)</p></td>.*?<td width="100%">.*?<p class="commsText">(.*?)</p>.*?</td>.*?</tr>} $data]
 
   foreach {black a b} $l {

 	   set a [string trim $a " \n"]
 	   set b [string trim $b " \n"]

		putserv "PRIVMSG $chan :$a $b"
   }
   
}
User avatar
arfer
Master
Posts: 436
Joined: Fri Nov 26, 2004 8:45 pm
Location: Manchester, UK

Post by arfer »

Irrespective of your regexp pattern accuracy, I find using -inline -all switches is quite messy where you have multiple subexpressions. Each potentially returns a full match and a subexpression match. The returned list is also concatenated at each iteration yielding a flat list. It would seem from the foreach statement that you expected a list of lists to be returned with each sublist consisting of three elements. I am doubtful of this.

I likewise have only a limited understanding of regular expression patterns. In order to assist me, I tend to firstly scrape the whole of the data I need and write it to a text file, until I'm satisfied I have what I want from manually reading the file. No data written would indicate an incorrect pattern. I then set about manipulating and tidying the data until I have exactly what I need, writing to a text file at each stage for confirmation. I can then find some way to iterate through the data to extract the individual items. Once complete I can remove the text file opening/writing code.

You are obviously trying to scrape a live sports site. The additional difficulty is 'how do you know the data is new'. You would not want to repeat things. You would have to scrape the site at say 2-3 minute intervals and compare the current data with the previous data, only outputting the difference.
I must have had nothing to do
n
nml375
Revered One
Posts: 2860
Joined: Fri Aug 04, 2006 2:09 pm

Post by nml375 »

arfer;
The foreach-syntax is proper; it will pop three values off the list and assign them to "black", "a", and "b". Likewize, the regular expression will (as you stated) return a flat list with "groups" of three items (one for the full match, and two for the two submatch-patterns).
foreach will not dig through multiple level lists, it will always assign the list item(s) unaltered.

Now to the issue, when passing the supplied text through the regular expression, it works just fine for me. However, I cannot find anything even resembling that text on the url in your code. I think you'll have to double-check what you're matching against
NML_375
User avatar
arfer
Master
Posts: 436
Joined: Fri Nov 26, 2004 8:45 pm
Location: Manchester, UK

Post by arfer »

Thanks for the info nml375.

The site is a live commentary on a cricket match. I would imagine the game has long since finished so the html shown in this thread is unlikely to be there any more. Which, come to think of it, would be a good reason why the script doesn't work.
I must have had nothing to do
Post Reply