A wrapper would be required to simplify this entire process. A wrapper with redirect/cookie support and can handle errors gracefully. One entirely recursive, straight forward how it works. Hopefully..
It would be used like so:
s:wget <url-string> <type-string> <referer-string> <cookies-list> <redirects-int> <poison-int>
The only required part to issue is <url-string>. The rest if omitted will have defaults assumed. Timeout and UserAgent are assumed to make this simple as well...
You would interact with the wrapper like so...
Code: Select all
# login and get cookiejar and html to start with
set reply [s:wget "http:://website.com/login?username=user&password=pass&submit=submit%20the%20url" POST]
# was there an error?
if {$reply != 0} {
# no, seperate html and cookiejar from reply
foreach {html cookiejar_login} $reply { break }
# get next url, using cookiejar_login obtained from above
set reply [s:wget "http:://website.com/search?text=this&somethingelse=that&submit=Go" GET "http://website.com/login" $cookiejar_login]
# was there an error?
if {$reply != 0} {
# no, get html to parse and create a cookiejar for search
# seperate from the one we use for login. we may never
# use the cookiejar_search but if we need to it's there.
foreach {html cookiejar_search} $reply { break }
# rest of script here..
# etc...
# etc...using cookiejar_login
# if cookiejar_login expires, you will need
# to re-login, you can use a timer or bind
# to time to simply reget a new fresh
# cookiejar and keep this in global space
# for the search procedure to use. there
# are many ways to go from here....
} else {
# yes, error type is issued in partyline...
}
} else {
# yes, error type is issued in partyline...
}
Here is the code to said wrapper:
Code: Select all
# AN EASIER WAY v1.1
# by speechles (c) 2013
# s:wget - simple http wrapper for tcl scripting to acesss html/cookies
# s:debug - simple debug output for s:wget wrapper
#
# v1.1 - introducing an easier way...
# v1.0 - the beginning...
package require http
# remove the two lines below if you don't care about HTTPS
package require tls
http::register https 443 [list ::tls::socket -require 0 -request 1]
# recursive wget with cookies and referer
proc s:wget { url {type GET} {refer ""} {cookies ""} {re 0} {poison 0} } {
http::config -useragent "Mozilla/EggdropWget"
# if we have cookies, let's use em ;)
if {[string equal -nocase GET $type]} {
if {![string length $cookies]} {
catch {set token [http::geturl $url -binary 1 -timeout 10000]} error
} else {
catch {set token [::http::geturl $url -binary 1 -headers [list "Referer" "$refer" "Cookie" "[string trim [join $cookies {;}] {;}]" ] -timeout 10000]} error
}
} else {
foreach {url query} [split $url ?] {break}
if {![string length $cookies]} {
catch {set token [http::geturl $url -query $query -binary 1 -timeout 10000]} error
} else {
catch {set token [::http::geturl $url -query $query -binary 1 -headers [list "Referer" "$refer" "Cookie" "[string trim [join $cookies {;}] {;}]" ] -timeout 10000]} error
}
}
# error condition 1, invalid socket or other general error
if {![string match -nocase "::http::*" $error]} {
s:debug "Error: [string totitle [string map {"\n" " | "} $error]] \( $url \)"
return 0
}
# error condition 2, http error
if {![string equal -nocase [::http::status $token] "ok"]} {
s:debug "Http error: [string totitle [::http::status $token]] \( $url \)"
http::cleanup $token
return 0
}
upvar #0 $token state
# iterate through the meta array to grab cookies
foreach {name value} $state(meta) {
# do we have cookies?
if {[regexp -nocase ^Set-Cookie$ $name]} {
# yes, add them to cookie list
lappend ourCookies [lindex [split $value {;}] 0]
}
}
# if no cookies this iteration remember cookies from last
if {![info exists ourCookies] && [string length $cookies]} {
set ourCookies $cookies
}
# recursive redirect support, 300's
# the full gambit of browser support, hopefully ... ;)
if {[string match "*[http::ncode $token]*" "303|302|301" ]} {
foreach {name value} $state(meta) {
if {[regexp -nocase ^location$ $name]} {
if {![string match "http*" $value]} {
# fix our locations if needed
if {![string match "/" [string index $value 0]]} {
set value "[join [lrange [split $url "/"] 0 2] "/"]/$value"
} else {
set value "[join [lrange [split $url "/"] 0 2] "/"]$value"
}
}
# catch redirect to self's. There is one rule:
# A url can redirect to itself a few times to attempt to
# gain proper cookies, or referers. This is hard-coded at 2.
# We catch the 3rd time and poison our recursion with it.
# This will stop the madness ;)
if {[string match [string map {" " "%20"} $value] $url]} {
incr poison
if {$poison > 2} {
s:debug "HTTP Error: Redirect error self to self \(3rd instance poisoned\) \( $url \)"
http::cleanup $token
return 0
}
}
# poison any nested recursion over 10 traversals deep. no legitimate
# site needs to do this. EVER!
if {[incr re] > 10} {
s:debug "HTTP Error: Redirect error (>10 too deep) \( $url \)"
http::cleanup $token
return 0
}
http::cleanup $token
# recursive redirect by passing cookies and referer
# this is what makes it now work! :)
if {![info exists ourCookies]} {
return [s:wget [string map {" " "%20"} $value] $url $type "" $re $poison]
} else {
return [s:wget [string map {" " "%20"} $value] $url $type $ourCookies $re $poison]
}
}
}
}
# waaay down here, we finally check the ncode for 400 or 500 codes
if {[string match 4* [http::ncode $token]] || [string match 5* [http::ncode $token]]} {
s:debug "Http resource is not available: [http::ncode $token] \( $url \)"
http::cleanup $token
return 0
}
# --- return reply
set data [http::data $token]
http::cleanup $token
if {[info exists ourCookies]} {
return [list $data $ourCookies]
} else {
return [list $data ""]
}
}
# debug - errors pass thru here
proc s:debug {text} {
putlog "s:wget: $text"
}
# eof
.tcl set cookie [lindex [s:wget
http://www.google.com/search?q=hi POST] 1]
> [01:32] s:wget: Http resource is not available: 405 (
http://www.google.com/search )
> Tcl:
.tcl set cookie [lindex [s:wget
http://www.google.com/search?q=hi GET] 1]
> Tcl: PREF=ID=5fce19142b49a09b:FF=0:TM=1368865867:LM=1368865867:S=VkNZO3p8OIULztjJ NID=67=JZcQs5ry96mcOry62REFi1cnEuQmFWCfFSez9hifN9Raz5PWwHTcJY1HBkepwwMQgI3dyMdIfqYzjVq6rPpsvRFi61yiloXZpdtPJhVEbOAByVvHRRVKYPaFN1TEWhZS
.tcl llength $cookie
> Tcl: 2
Google doesn't like POST used for GET requests as you can tell, and it also issues "two" cookies to keep track of... Hence the "cookiejar" is always a list.
Hopefully with all the #comments in the code everything is clear what is going on...