Simple parse

TC^ · Post by **TC^** » Sun Nov 10, 2002 12:06 pm

I need help on how to parse information from a simple HTML page...

for example:

This is the HTML-page:

9,1,16,80,9,128,some text

And i need the script to return the first number (9) on !first and "some text" on !second

Please don't redirect me to some large documentation!.. I'm still quite a nub in tcl-scripting...

Hope someone can help...

ppslim · Post by **ppslim** » Sun Nov 10, 2002 2:07 pm

I am afraid, we will direct you to large documentation.

However, we we will give you pointers on where to be looking within it.

First, you say it's HTML page, but what you posted isn't. SO I am guessing, that that text is embeded somwhere within the HTML page tiself.

First thing I sugest, is to download google.tcl, then read through the Tcl docs and see whop to download the HTML page. This part is straight forward.

Next, You will need to make the Tcl, parse through, and locate the text in question.

A series of string commands, or regexps should get you to this location, by looking at the ltext, for strings of texts, that remain within the page, even if the page is dynamicly generated.

Once you have obtained the text, and stored it in a variable, you can use "split" and "lindex" to obtaint he values you need.

I fyou need any more ifnormation, you will need to ask specific questions. Simply saying I need A from a web-page, wihtout giving any other details about the page contents, is simply a non-starter for us. Only you know what the content is so far.

TC^ · Post by **TC^** » Sun Nov 10, 2002 2:51 pm

Sorry I wasn't more precise in my question..

This is the page i'm talking about: http://213.114.155.110:8000/7.html

But thanks anyway! For your fast answer.. I'm thanful for that!

If you could give me some more pointers i'll be delighted!

ppslim wrote:I am afraid, we will direct you to large documentation.

However, we we will give you pointers on where to be looking within it.

First, you say it's HTML page, but what you posted isn't. SO I am guessing, that that text is embeded somwhere within the HTML page tiself.

First thing I sugest, is to download google.tcl, then read through the Tcl docs and see whop to download the HTML page. This part is straight forward.

Next, You will need to make the Tcl, parse through, and locate the text in question.

A series of string commands, or regexps should get you to this location, by looking at the ltext, for strings of texts, that remain within the page, even if the page is dynamicly generated.

Once you have obtained the text, and stored it in a variable, you can use "split" and "lindex" to obtaint he values you need.

I fyou need any more ifnormation, you will need to ask specific questions. Simply saying I need A from a web-page, wihtout giving any other details about the page contents, is simply a non-starter for us. Only you know what the content is so far.

ppslim · Post by **ppslim** » Sun Nov 10, 2002 2:58 pm

A nice simple page.

First off, you should learn how to use the HTTP package for Tcl. As stated, google.tcl uses this, so is a good place to an example.

Using the data it downloads, you will can start paring the contents.

Once simple idea, would be to locate the end of the tag <BODY>. This can be done using some of the simplified commands Tcl provides.

You would then save all content up to the </BODY> tag, again with the provided commands.

TC^ · Post by **TC^** » Sun Nov 10, 2002 4:23 pm

I've tried looking through google.tcl, and I must confess, it's a little too complicated for me..

However I've found another script that was far more simple, but not nearly as advanced, and therefore more difficult to get to do what I want..

Code: Select all

set radio_url "http://213.114.155.110:8000/7.html"

bind pub - !users users_get

proc users_get {nick mask hand chan args} {
	global radio_url
	set file [open "|lynx -source $radio_url" r]
	set html "[gets $file]"
	regsub "<HTML><meta http-equiv=\"Pragma\" content=\"no-cache\"></head>" $html "" html
      regsub "<body>" $html "\002 Listeners: \002" html
	putchan $chan $html
}

The webpage is following:

<HTML><meta http-equiv="Pragma" content="no-cache"></head><body>4,1,8,80,4,128,Fpu - Racer Car (Voidcom)</body></html>

And the script returns:

Listeners: 4,1,8,80,4,128,Fpu - Racer Car (Voidcom)</body></html>

My question is... Can regsub be used in such a way, that it ignores all text from the first comma and forward.. So it returns this:

Listeners: 4

I hope it is detailed enough this time