regexp prob

iRoc · Post by **iRoc** » Sun Jan 30, 2011 1:40 pm

i've done one coding but its not working
there's some problem on regexp
can anyone help me on that ?

HTML Source

<tr>
 <td width="30" align="right"><p class="commsText">0.4</p></td>

 <td width="100%">
  <p class="commsText">Anderson to Watson,
2 runs,
another good cover drive, although this one has clunked a bit and hasn't been timed properly, it's chased down and they pick up two  </p>
 </td>
</tr>

Code: Select all

bind pub -!- !cric last
proc last {nick uhost hand chan text} {
	set cric(page) http://www.espncricinfo.com/the-ashes-2010-11/engine/match/446966.html
	set agent "Opera/9.10 (Windows NT 5.1; U; ru)"
	set t [::http::config -useragent $agent]
	set t [::http::geturl "$cric(page)" -timeout 30000] 
	set data [::http::data $t] 	
	::http::cleanup $t
	putlog "$t"	
	
	   set l [regexp -all -inline -- {<tr>.*?<td width="30" align="right"><p class="commsText">(.*?)</p></td>.*?<td width="100%">.*?<p class="commsText">(.*?)</p>.*?</td>.*?</tr>} $data]
 
   foreach {black a b} $l {

 	   set a [string trim $a " \n"]
 	   set b [string trim $b " \n"]

		putserv "PRIVMSG $chan :$a $b"
   }
   
}

arfer · Post by **arfer** » Fri Mar 04, 2011 10:07 am

Irrespective of your regexp pattern accuracy, I find using -inline -all switches is quite messy where you have multiple subexpressions. Each potentially returns a full match and a subexpression match. The returned list is also concatenated at each iteration yielding a flat list. It would seem from the foreach statement that you expected a list of lists to be returned with each sublist consisting of three elements. I am doubtful of this.

I likewise have only a limited understanding of regular expression patterns. In order to assist me, I tend to firstly scrape the whole of the data I need and write it to a text file, until I'm satisfied I have what I want from manually reading the file. No data written would indicate an incorrect pattern. I then set about manipulating and tidying the data until I have exactly what I need, writing to a text file at each stage for confirmation. I can then find some way to iterate through the data to extract the individual items. Once complete I can remove the text file opening/writing code.

You are obviously trying to scrape a live sports site. The additional difficulty is 'how do you know the data is new'. You would not want to repeat things. You would have to scrape the site at say 2-3 minute intervals and compare the current data with the previous data, only outputting the difference.

Post by **nml375** » Fri Mar 04, 2011 3:23 pm

arfer;
The foreach-syntax is proper; it will pop three values off the list and assign them to "black", "a", and "b". Likewize, the regular expression will (as you stated) return a flat list with "groups" of three items (one for the full match, and two for the two submatch-patterns).
foreach will not dig through multiple level lists, it will always assign the list item(s) unaltered.

Now to the issue, when passing the supplied text through the regular expression, it works just fine for me. However, I cannot find anything even resembling that text on the url in your code. I think you'll have to double-check what you're matching against

arfer · Post by **arfer** » Fri Mar 04, 2011 7:00 pm

Thanks for the info nml375.

The site is a live commentary on a cricket match. I would imagine the game has long since finished so the html shown in this thread is unlikely to be there any more. Which, come to think of it, would be a good reason why the script doesn't work.