This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

Parsing HTML encoded in US-Ascii

Help for those learning Tcl or writing their own scripts.
Post Reply
Y
Yourmove
Voice
Posts: 2
Joined: Tue Jul 18, 2006 4:07 pm

Parsing HTML encoded in US-Ascii

Post by Yourmove »

I've been trying (for a very long time) to parse a website (http://www.anidb.info) using my eggdrop bot, however for some reason all it would return is jibberish. My other scripts that parsed websites worked fine. I wasn't sure what was happening at first but then I realized that my eggdrop didn't have a *.enc file for us-ascii. I tried to create my own however it seems that I couldn't change the encoding files directory. So I came here (after searching the forums for an answer) to ask if anyone has successfully been able to parse a website that was encoded in US-ASCII and what was the process that you used? I read the tutorials on characters and encoding but...that really didn't help me solve the problem. The system is using TCL 8.4 and I'm using the http package.

Edit: I'm still new to TCL so please be patient...
User avatar
De Kus
Revered One
Posts: 1361
Joined: Sun Dec 15, 2002 11:41 am
Location: Germany

Post by De Kus »

I somehow doubt its a charset problem (since the default charset iso-8859-1 and most others include US-ASCII). I rather believe its because the server returns a gzipped page. The server sends gzipped content even if you explicitly forbid it in the HTTP request or even a HTTP version which doesnt support that and is therefore a violation against HTTP RFC 2965/RFC 2616 in many ways
.you will most likely have to turn over the content to gunzip so you can read uncompressed file then.
PuTTY wrote:GET /perl-bin/animedb.pl HTTP/1.1
Host: anidb.info
Accept-Encoding: chunked;q=1, *;q=0

HTTP/1.1 200 OK
Date: Wed, 19 Jul 2006 07:30:27 GMT
Server: Apache/1.3.36 (Unix) mod_perl/1.29
Set-Cookie: adbuin=1153294273-nVfC; path=/; expires=Sat, 16-Jul-2016 07:31:13 GMT
Cache-control: no-cache
Pragma: no-cache
Content-Type: text/html
Expires: Wed, 19 Jul 2006 07:31:13 GMT
X-Cache: MISS from anidb.info
Content-Encoding: gzip
Content-Length: 8216
PuTTY wrote:GET /perl-bin/animedb.pl HTTP/1.0
Host: anidb.info

HTTP/1.1 200 OK
Date: Wed, 19 Jul 2006 07:31:57 GMT
Server: Apache/1.3.36 (Unix) mod_perl/1.29
Set-Cookie: adbuin=1153294324-QXPa; path=/; expires=Sat, 16-Jul-2016 07:32:04 GMT
Cache-control: no-cache
Pragma: no-cache
Content-Type: text/html
Expires: Wed, 19 Jul 2006 07:32:04 GMT
X-Cache: MISS from anidb.info
Connection: close
Content-Encoding: gzip
Content-Length: 8216
As you can see... it even ignores the HTTP/1.0 request and sends HTTP/1.1 even if its not supported. I wonder if you can make Apache doing that without hardcoding the header in the PERL scripts which would be just plainly stupid from side of the scripter... maybe they don't care about people not being able to use gzip (even old IE would choke on that, since it supported only deflate).

Hint: if you want to show the &...; encoded Japanese charaters you will most likely have to use UTF-8 or SHIFT-JIS output (and of course find a libary that can convert them to a native encoding supported by TCL).
De Kus
StarZ|De_Kus, De_Kus or DeKus on IRC
Copyright © 2005-2009 by De Kus - published under The MIT License
Love hurts, love strengthens...
Y
Yourmove
Voice
Posts: 2
Joined: Tue Jul 18, 2006 4:07 pm

Ok

Post by Yourmove »

Oh, I didn't even notice that part. Thanks for the information. I'll try and see what I can do. I'll report back if I still get problems.

Thanks again.
Post Reply