This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

Parsing a entire html source page

Help for those learning Tcl or writing their own scripts.
Post Reply
User avatar
ComputerTech
Master
Posts: 399
Joined: Sat Feb 22, 2020 10:29 am
Contact:

Parsing a entire html source page

Post by ComputerTech »

So i am trying to retrieve the entire code from this https:://google.com/search?q=lego

Code: Select all

bind PUB - "!test" the:test

package require http
package require tls

proc the:test {nick host hand chan text} {
http::register https 443 [list ::tls::socket]
set url "https://www.google.com/search?q=lego"
set data [::http::data [::http::geturl "$url" -timeout 10000]]
::http::config -useragent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"
foreach lines2 $data {putserv "PRIVMSG $chan :$lines2"}
http::unregister https
}
And i am getting this

Code: Select all

<Tech> <HTML><HEAD><meta
<Tech> http-equiv="content-type"
<Tech> content="text/html;charset=utf-8">
<Tech> <TITLE>302
<Tech> Moved</TITLE></HEAD><BODY>
<Tech> <H1>302
<Tech> Moved</H1>
<Tech> The
<Tech> document
<Tech> has
<Tech> moved
<Tech> <A
<Tech> HREF="https://www.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Dlego&q=EhAmB1MAAGEA2QAMAAAAAAAAGIDuioMGIhkA8aeDS7Cl4MTYJvxJOGvj5SyvlN0tmGEIMgFy">here</A>.
<Tech> </BODY></HTML>
ComputerTech
User avatar
CrazyCat
Revered One
Posts: 1306
Joined: Sun Jan 13, 2002 8:00 pm
Location: France
Contact:

Post by CrazyCat »

This is because you didn't think about potential redirections (as 301 or 302), and don't analyse the status.
Your line:

Code: Select all

set data [::http::data [::http::geturl "$url" -timeout 10000]]
The better way (not the best):

Code: Select all

set tok [::http::geturl $url]
if {[::http::ncode $tok]==301 || [::http::ncode $tok]==302} {
   // this is a redirection
} else {
   set data [::http::data $tok]
}
You can also use ::http::status and other infos to know if you are on the good page.

Have a look on https://www.tcl.tk/man/tcl8.4/TclCmd/http.htm
User avatar
ComputerTech
Master
Posts: 399
Joined: Sat Feb 22, 2020 10:29 am
Contact:

Post by ComputerTech »

Thanks CrazyCat will try that :wink:
ComputerTech
User avatar
ComputerTech
Master
Posts: 399
Joined: Sat Feb 22, 2020 10:29 am
Contact:

Post by ComputerTech »

Tried your suggestion CrazyCat,

Code: Select all

bind PUB - "!test" the:test

package require http
package require tls

proc the:test {nick host hand chan text} {
http::register https 443 [list ::tls::socket]
set url "https://www.google.com/search?q=lego+ninjago"
set tok [::http::geturl $url]
if {[::http::ncode $tok]==301 || [::http::ncode $tok]==302} {
  putserv "PRIVMSG $chan :FAIL"
} else {
   set data [::http::data $tok]
}
::http::config -useragent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"
foreach lines2 $data {putserv "PRIVMSG $chan :$lines2"}
http::unregister https
}
Results

Code: Select all

20<ComputerTech>30 !test
18<Tech18> FAIL
Google still thinks i am a bot, any ideas to bypass this?
ComputerTech
User avatar
CrazyCat
Revered One
Posts: 1306
Joined: Sun Jan 13, 2002 8:00 pm
Location: France
Contact:

Post by CrazyCat »

Google don't think you're a bot, google redirects you to a version you can read (without javascript).

Code: Select all

set tok [::http::geturl $url]
if {[::http::ncode $tok]==301 || [::http::ncode $tok]==302} {
   set meta $tok(meta)
   set data [::http::data [::http::geturl $meta(Location)]]
} else {
   set data [::http::data $tok]
}
Note that this system works only if there is just one redirection.

And I don't understand why you do ::http::config -useragent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0" after having used ::http ? The ::http::config must be at the initialisation of ::http
Post Reply