For everything except google translations, the script has seamless encoding support (in this new experimental version it is seamless everywhere, meaning it wont generate tcl errors, you may see inaccurate text if your bot cannot display the intended encoding). Meaning, if the bot couldn't switch to that encoding and couldn't find it, it won't give an error. It will just look wrong. That is probably what happened, as it's happened to me before.MellowB wrote:Updated to the latest version now. Had some slight problems at first (everything that was unicode before suddenly wasnt) but got it working. But srsly have no clue why it then suddenly was working again. *shrugs*
When your bot doesn't start correctly and loses encodings leaving only (utf-8,identify,unicode) this will happen. You can see for a fact this is so by logging into your partyline and typing .tcl set a [encoding names] if you see a short list with the three encodings I listed above, this is exactly the scenario.
MellowB wrote:Either way, somehow the bot still is unable to read unicode from the input I give it. This has been a problem with the older versions for me and still is with the current one. (at least this is a problem with Japanese, Korean and similar "complicated" charsets/languages)
So yeah, searching google or wikipedia with some Japanese word is not possible at all - it just searches with some gibberish.
Code: Select all
variable encoding_conversion_input 0
About this, I've added a special function which should help you and others now customize things better. First off, google translations will now react based off what Google itself tells the bot the charset should be displayed in. This cannot be changed, it will only use the charset encoding google has told it to use. This functions 100% without problem in all my tests using this new script.MellowB wrote:Also the !trans trigger is not parsed in unicode either.
It usually seems to use the charset that the translated to pair is using, so if I get some translation to Korean for example I have to set my IRC client to Korean to be able to actually read what the bot gives me as output. Could you maybe do some workaround or whatever to get the output to the channel in UTF-8? There must be some way to force the translation page of google to Unicode, right? :S
I kinda hope so... :>
Now for the new feature, which should help alleviate encoding headaches greatly.
Code: Select all
# THIS IS TO BE USED TO DEVELOP A BETTER LIST FOR USE BELOW.
# To work-around certain encodings, it is now necessary to allow
# the public a way to trouble shoot some parts of the script on
# their own. To use these features involves the two settings below.
# set debug and administrator here
# this is used for debugging purposes
#----------
variable debug 1
variable debugnick speechles
At the moment, only google translations will actually follow the charset given and there isn't a way to make it do anything else. Everything else in the script will disregard what is given in the charset field, and will strictly use the encode_strings section. The debug is to allow you to correctly match encodings to complete the encode_strings array. It is not possible to correctly follow charsets given using some google requests (the bot will mangle utf-8 into unreadable swiss cheese unfortunately) and is why the encode_strings array is used rather than automatically using the charset google has told us it is using.* in channel *
<speechles> !g .co.jp sushi
<sp33chy> Google | YouTube - sushi @ http://www.youtube.com/watch?v=0b75cl4-qRE | YouTube - The Kaiten-Sushi Experience @ http://www.youtube.com/watch?v=ZMUpXGlTyAc | SUSHI-MASTER @ http://sushi-master.com/ | SUSHI-MASTER @ http://sushi-master.com/jpn/index4.html
* whilst in private message, the debug administrator will see this *
<sp33chy> url: http://www.google.co.jp/search?q=sushi& ... all&num=10 charset: shift_jis
* in channel *
<speechles> !tr en@ru this should be russian
<sp33chy> Google says: (en->ru) ÜÔÏ ÄÏÌÖÅÎ ÂÙÔØ ÒÕÓÓËÉÊ
* whilst in private message, the debug administrator will see this *
<sp33chy> url: http://www.google.com/translate_t?text= ... ir=en%7cru charset: koi8-r
The more I learn, the more advanced this script has become. You can see this script as the result of my 2 years or so of scripting tcl. It has been created with the minimalistic approach. Keeping a basic simple approach for queries (without using cumbersome -switches), while allowing depth and extensions for more advanced users. You can see this type of programming style for example, in wikipedia/wikimedia commands.MellowB wrote:Anyway, so far, thanks for the updates and fixes again. It sure is highly appreciated as always. :]
!wiki thing
!wiki .sr thing
!wiki .sr@sr-el thing
!wiki .sr@sr-el thing#anchor
Most users will suffice to have just english wikipedia available, to me this is too confining and biased. And to merely have serbian was not enough, It must do their latin equivalent. And to have merely an article displayed from where the bot felt it was appropriate was not enough, it must display a table-of-contents and allow some type of searching anchor tags. These are small details that not many pick up on right away and its 100% intentional. If you bury someone in too many commands, they will feel baffled. If they slowly get their feet wet, eventually they feel comfortable attempting the more advanced commands. I see this in channels my bot is in often.
Without further adieu, here is the experimental version of this script. Only those experiencing difficulty setting up the encode_strings should use this. This feature will be included in all future versions of the script, once more testing is done. But enjoy knowing the encodings for all the url's it reads, this is something to hopefully help those having problems finding exact encodings.
Note: this may be important for google translations to function 100%.
Code: Select all
# AUTOMAGIC CHARSET ENCODING SUPPORT
# on the fly encoding support
#
proc incithdecode {text} {
global incithcharset
set incithcharset [string map {"iso-" "iso" "windows-" "cp" "shift_jis" "shiftjis"} $incithcharset]
if {[lsearch -exact [encoding names] $incithcharset] != -1} {
set text [encoding convertfrom $incithcharset $text]
}
return $text
}
proc incithencode {text} {
global incithcharset
set incithcharset [string map {"iso-" "iso" "windows-" "cp" "shift_jis" "shiftjis"} $incithcharset]
if {[lsearch -exact [encoding names] $incithcharset] != -1} {
set text [encoding convertto $incithcharset $text]
}
return $text
}