UNOFFICIAL incith-google 2.1x (Nov30,2o12)

speechles · Post by **speechles** » Mon Jun 30, 2008 4:21 pm

MellowB wrote:Updated to the latest version now. Had some slight problems at first (everything that was unicode before suddenly wasnt) but got it working. But srsly have no clue why it then suddenly was working again. *shrugs*

For everything except google translations, the script has seamless encoding support (in this new experimental version it is seamless everywhere, meaning it wont generate tcl errors, you may see inaccurate text if your bot cannot display the intended encoding). Meaning, if the bot couldn't switch to that encoding and couldn't find it, it won't give an error. It will just look wrong. That is probably what happened, as it's happened to me before.
When your bot doesn't start correctly and loses encodings leaving only (utf-8,identify,unicode) this will happen. You can see for a fact this is so by logging into your partyline and typing .tcl set a [encoding names] if you see a short list with the three encodings I listed above, this is exactly the scenario.

MellowB wrote:Either way, somehow the bot still is unable to read unicode from the input I give it. This has been a problem with the older versions for me and still is with the current one. (at least this is a problem with Japanese, Korean and similar "complicated" charsets/languages)
So yeah, searching google or wikipedia with some Japanese word is not possible at all - it just searches with some gibberish.

Code: Select all

variable encoding_conversion_input 0

Have you tried this? Disabling input encoding conversion? A 0 will keep everything tied to utf-8 as far as input queries goes. This should solve that part.

MellowB wrote:Also the !trans trigger is not parsed in unicode either.
It usually seems to use the charset that the translated to pair is using, so if I get some translation to Korean for example I have to set my IRC client to Korean to be able to actually read what the bot gives me as output. Could you maybe do some workaround or whatever to get the output to the channel in UTF-8? There must be some way to force the translation page of google to Unicode, right? :S
I kinda hope so... :>

About this, I've added a special function which should help you and others now customize things better. First off, google translations will now react based off what Google itself tells the bot the charset should be displayed in. This cannot be changed, it will only use the charset encoding google has told it to use. This functions 100% without problem in all my tests using this new script.

Now for the new feature, which should help alleviate encoding headaches greatly.

Code: Select all

    # THIS IS TO BE USED TO DEVELOP A BETTER LIST FOR USE BELOW.
    # To work-around certain encodings, it is now necessary to allow
    # the public a way to trouble shoot some parts of the script on
    # their own. To use these features involves the two settings below.

    # set debug and administrator here
    # this is used for debugging purposes
    #----------
    variable debug 1
    variable debugnick speechles

This allows the bot to continually message your debug administrator whenever requests are made. This is what you should see, for example.

* in channel *
<speechles> !g .co.jp sushi
<sp33chy> Google | YouTube - sushi @ http://www.youtube.com/watch?v=0b75cl4-qRE | YouTube - The Kaiten-Sushi Experience @ http://www.youtube.com/watch?v=ZMUpXGlTyAc | SUSHI-MASTER @ http://sushi-master.com/ | SUSHI-MASTER @ http://sushi-master.com/jpn/index4.html
* whilst in private message, the debug administrator will see this *
<sp33chy> url: http://www.google.co.jp/search?q=sushi& ... all&num=10 charset: shift_jis
* in channel *
<speechles> !tr en@ru this should be russian
<sp33chy> Google says: (en->ru) ÜÔÏ ÄÏÌÖÅÎ ÂÙÔØ ÒÕÓÓËÉÊ
* whilst in private message, the debug administrator will see this *
<sp33chy> url: http://www.google.com/translate_t?text= ... ir=en%7cru charset: koi8-r

At the moment, only google translations will actually follow the charset given and there isn't a way to make it do anything else. Everything else in the script will disregard what is given in the charset field, and will strictly use the encode_strings section. The debug is to allow you to correctly match encodings to complete the encode_strings array. It is not possible to correctly follow charsets given using some google requests (the bot will mangle utf-8 into unreadable swiss cheese unfortunately) and is why the encode_strings array is used rather than automatically using the charset google has told us it is using.

MellowB wrote:Anyway, so far, thanks for the updates and fixes again. It sure is highly appreciated as always. :]

The more I learn, the more advanced this script has become. You can see this script as the result of my 2 years or so of scripting tcl. It has been created with the minimalistic approach. Keeping a basic simple approach for queries (without using cumbersome -switches), while allowing depth and extensions for more advanced users. You can see this type of programming style for example, in wikipedia/wikimedia commands.
!wiki thing
!wiki .sr thing
!wiki .sr@sr-el thing
!wiki .sr@sr-el thing#anchor
Most users will suffice to have just english wikipedia available, to me this is too confining and biased. And to merely have serbian was not enough, It must do their latin equivalent. And to have merely an article displayed from where the bot felt it was appropriate was not enough, it must display a table-of-contents and allow some type of searching anchor tags. These are small details that not many pick up on right away and its 100% intentional. If you bury someone in too many commands, they will feel baffled. If they slowly get their feet wet, eventually they feel comfortable attempting the more advanced commands. I see this in channels my bot is in often.

Without further adieu, here is the experimental version of this script. Only those experiencing difficulty setting up the encode_strings should use this. This feature will be included in all future versions of the script, once more testing is done. But enjoy knowing the encodings for all the url's it reads, this is something to hopefully help those having problems finding exact encodings.

Note: this may be important for google translations to function 100%.

Code: Select all

    # AUTOMAGIC CHARSET ENCODING SUPPORT
    # on the fly encoding support
    #
    proc incithdecode {text} {
      global incithcharset
      set incithcharset [string map {"iso-" "iso" "windows-" "cp" "shift_jis" "shiftjis"} $incithcharset]
      if {[lsearch -exact [encoding names] $incithcharset] != -1} {
        set text [encoding convertfrom $incithcharset $text]
      }
      return $text
    }

    proc incithencode {text} {
      global incithcharset
      set incithcharset [string map {"iso-" "iso" "windows-" "cp" "shift_jis" "shiftjis"} $incithcharset]
      if {[lsearch -exact [encoding names] $incithcharset] != -1} {
        set text [encoding convertto $incithcharset $text]
      }
      return $text
    }

These are the "automagic" encode/decode functions (presently only the encode function is used and is only used by google translations) to handle cross encodings. As you can see, html encoding names are not exactly 100% eggdrop names. I've used a string map to help alleviate this. If you are aware how string map works, together with the debug function you can add these misnamed charsets to the string map easily. In my tests, this solves the problem entirely with google translations, the encode_strings is unnecessary now at least for it.

speechles · Post by **speechles** » Mon Jun 30, 2008 10:10 pm

No longer experimental. There are a few things to explain.

Code: Select all

    # AUTOMAGIC
    # with this set to 1, the bottom encode_strings setting will become
    # irrelevant. This will make the script follow the charset encoding
    # the site is telling the bot it is. 
    # This DOES NOT affect wiki(media/pedia), it will not encode automatic.
    # Wiki(media/pedia) still requires using the encode_strings section below.
    variable automagic 1

This setting causes the script to disregard what is stored in your encode_strings section (even if you have set encoding output conversion to 1 within the config, setting automagic to 1 will override it), and instead will directly use the encoding the website is requesting be used. This will not work for wiki(pedia/media). They will still use the old method of encode_strings, mainly because wiki(pedia/media) only sends requests back as utf-8 (and eggdrop mangles utf-8, we've discussed this). There is also a new display for those using the debugging options (yes, as promised they have been left intact).

*in channel we see*
<speechles> !g .ru this should be russian
<sp33chy> 142 000 000 Ðåçóëüòàòû | BalkanInsight.com - Bulgaria Should Sp @ http://www.balkaninsight.com/en/main/comment/7422/ | [PDF] Russian language-Employees - You @ http://www.uscis.gov/files/nativedocume ... ussian.pdf | U.S. Should Defy Chinese-Russian Att @ http://www.heritage.org/Research/Nation ... bg2154.cfm | BBC NEWS | Business | BP Russia boss ' @
<sp33chy> http://news.bbc.co.uk/1/hi/business/7470439.stm

* whilst in private message, the debug administrator will see this *
<sp33chy> url: http://www.google.ru/search?q=this%20sh ... all&num=10 charset: cp1251 encode_string: cp1251

The underline makes you completely aware what encoding the bot is actually going to be using for display. With automagic set to 1, it will use charset (irregardless of what you have output encoding set to). With output encoding set to 1 (and automagic set to 0), it will use encode_strings, otherwise it will use generic iso8859-1 (the bots normal encoding). This will tell you the url, the charset (website suggest this), and finally the encode_strings entry for that country if one were to exist (otherwise this field will be blank). This was a sorely needed addition and should make troubleshooting encodings a breeze *crosses fingers*. Wiki(pedia/media) and google translations will display a slightly different debug line, simply containing url and charset requested by the website. The encode_string used for that country won't be displayed (yet).

*in channel we see*
<speechles> !g hello
<sp33chy> 543,000,000 Results | Hello @ http://www.hello.com/ | HELLO! - The place for celebrity news @ http://www.hellomagazine.com/ | Lionel Richie - Hello @ http://www.youtube.com/watch?v=PDZcqBgCS74 | YouTube - David Cook- Hello @ http://www.youtube.com/watch?v=mxVzkZLzoLU

* whilst in private message, the debug administrator will see this *
<sp33chy> url: http://www.google.com/search?q=hello&sa ... all&num=10 charset: iso8859-1 encode_string:

This is how the debug message will look if "automagic" is set to 0. It would follow the encode_string, but none seems to be set for our default of "com" so this is left blank. But our charset field is complete, which indicates google itself would use iso8859-1 to display. This is how the debug output can be used to correctly identify encodings if the "automagic" doesn't work for you.
Get the new script HERE or on the first post of this thread. Most important, have a fun.

Note: i've noticed google translations seems to suggest russian is encoded with koi8-r while google itself when invoked to process a russian request decides to use cp1251. It is inconsistencies such as this that make having the bot automagically do this very hard. Hopefully even though this doesn't match, it still handles the translations alright.. *crosses fingers again*

speechles · Post by **speechles** » Fri Jul 11, 2008 7:55 pm

Google recently adopted embold (bold) tags rather than their usual method of using just bold (bold) this leads to injection of html tagging meant to represent bold. You will get and tags in your results. But now you won't, the script will now interpret both normal bold and embold tags as the same and properly bold results and you will no longer see the tags anymore.

Code: Select all

      # no point having it so many times
      if {$incith::google::bold_descriptions > 0 && [string match "\002" $incith::google::desc_modes] != 1} {
        regsub -all -nocase {(?:<b>|</b>|<em>|</em>)} $html "\002" html
      } else {
        regsub -all -nocase {(<b>|</b>|<em>|</em>)} $html "" html
      }

This is the updated code, merely adding || into the regular substitution of bold within the fetch_html procedure. Also, people using EFNET in particular (since I am on efnet, I witness this) should change the section below in their config:

Code: Select all

    # set debug and administrator here
    # this is used for debugging purposes
    #----------
    variable debug 1
    variable debugnick speechles

Now I don't mind if your on efnet and leave this as speechles really. But your bot is pming me anytime it's used. Just making everyone aware. This is not so I can hack your bots (the script is very benevolent and feature rich, has no malicious features or hack attempts at all, nor will it ever!), the only information revealed is the URL, the charset the URL wants to be displayed in, and the ENCODE_STRINGS setting for that website if one exists. This is meant for your bot admin to use to help encoding problems within the script. It is fine to set 'debug' to 0. The script will no longer issue any messages to the 'debugnick' when debug is disabled (set to 0).

And as always, find the new script HERE or the first post of this thread. Enjoy

speechles · Post by **speechles** » Sun Jul 13, 2008 2:32 am

Another day, another update..LMAO

But seriously, the reason for so many updates is because of the complexity of managing encodings for all languages without leaving any of them out. Which today, has finally become a reality at least with this script. A true utf8_workaround has been provided and can be enabled via the config setting shown below.

But let me explain a bit. There are three settings which combined handle all of the character encodings. First is the automagic config settings. Setting 'automagic' to 1 will allow the script to automatically use the charset the website suggest. This works well, but not when the website is suggesting UTF-8. So we now have the 'utf8workaround' setting. Setting this to 1, will watch for utf-8 to be the encoding suggested by the website. If it finds this so, it will instead use the country encoding defined within the encode_strings array. Together this should solve any problems anyone has regarding any language. This is very much a work-in-progress and more a proof-of-concept for other script writers trying to overcome the dilemma of displaying utf-8 characters with eggdrop.

Code: Select all

    # UTF-8 Work-Around (for eggdrop, this helps automagic)
    # If you use automagic above, you may find that any utf-8 charsets are
    # being mangled. To keep the ability to use automagic, yet when utf-8
    # is the charset defined by automagic, this will make the script instead
    # follow the settings for that country in the encode_strings section below.
    variable utf8workaround 1

Above is the new setting which will go hand-in-hand with 'automagic' and 'encode_strings'

Enjoy and get the new script Here or the first post of this thread. Most important, have fun. That's why this script exists, for you to have fun with.

MellowB · Post by **MellowB** » Thu Jul 17, 2008 2:27 pm

Great that you keep on going like this but there has not been much improvement for me with the latest versions. I still have some broken output here and there and most of what I needed was working before these updates already anyway. So yeah, actually I'm pretty happy with the current state the script is in.

Tho i STILL can not get the bot to read unicode from the channel input, it always garbles this no matter if I turn on encoding conversion or not.
BUT there finally is a working fix to get eggdrop to fully support unicode by itself. Tho this breaks this script here VERY much, no matter what I try in the settings (disabling all your utf-8 fixed, encoding conversion and what not, I think I tried all possibilities but there is still some broken stuff or output from wikipedia that is correctly displayed till half of the output and then suddenly is garbled/broken).

Maybe you could have a look at the fix, implement this in your bot and MAYBE develop a version of the script that makes use of this? (like a branch of this and still keep doing a "normal" version for everyone that is not using this fix in its bot which most sure wont)
Check it out here:
http://forum.egghelp.org/viewtopic.php?p=82111#82111

This could save you a lot of trouble inventing new ways to get the bot to read and output utf-8 in "right" ways, or?
Anyway, have a look at it, I think it's worth some time. ^^

Zircon · Post by **Zircon** » Sat Jul 19, 2008 1:57 pm

Hello Speecheles

Thanks for this great script. I m using the last version o. Is the following result the one excepted ?

<Zircon> !translate fr@ro i understand
<Anti-Flood> Google says: (fr->ro) În?eleg

speechles · Post by **speechles** » Sat Jul 19, 2008 9:43 pm

Zircon wrote:Hello Speecheles

Thanks for this great script. I m using the last version o. Is the following result the one excepted ?

<Zircon> !translate fr@ro i understand
<Anti-Flood> Google says: (fr->ro) În?eleg

@ zircon --v

Click here to see wrote:Translation: English » Romanian
sl -> i understand
tl -> Înţeleg

The problem, partly..is google itself.. The problem I am assuming is the unfortunate appearance of the ? instead of that twisted 't'..

<Bot> url: http://www.google.com/translate_t?text= ... ir=fr%7cro charset: iso8859-1

Google has told us the query is being returned as iso8859-1 formatted html. So the bot will convert it to this, but I don't think iso8859-1 has that twisty t character... So the larger problem becomes, it must be the http package itself to blame... In reading on people experiencing similar problems reading html into php scripts I'm seeing exact same behaviors as eggdrop with utf-8. Appears http package doesn't [fconfigure] the socket correctly, and eggdrop becomes confused about charset header declarations. This is where things get messy, as well as the fact eggdrop doesn't natively handle utf-8 with the greatest of ease. Combined you have a cacophony of problems where one solution seems the smartest. I need to learn about [fconfigure] and solve the problem once and for all.

@ MellowB --v

For input, it seems there are two ways to solve it, you can either use [fconfigure] on the socket to force encoding to utf-8 (the correct way), or use [encoding convertfrom] to convert from utf-8 (sorta the way I had it but not correctly, added correctly this workaround actually succeeds...fyi, if you reget the 'o' version of this script, it will work, the method is now implemented correctly).

For output, most of what I've already included is for the most part how others have done it in php. But there is a way to get true utf-8 display, and that is using [fconfigure] on the socket. I just need to read a bit more as well as experiment and run a few tests to make sure [fconfigure] works for eggdrop as it appears to for php. In the meantime, reget the o version and see if 'input' now works as utf-8, it should, I've followed the php example.

Code: Select all

set input [encoding convertfrom "utf-8" $input] ; #this is the new part
..rest of script continues..
.. convert input to utf-8 and encode per uri conventions
.. fetch the query
.. etc etc

That's the work around in a nutshell, basically. I might've tried this trick already in which case, the o method described above won't work, it's merely the above code mandated to be run against all input, there is no setting its mandated for all queries.

manual page wrote:fconfigure channelId name value ?name value ...?
-encoding name
This option is used to specify the encoding of the channel, so that the data can be converted to and from Unicode for use in Tcl. For instance, in order for Tcl to read characters from a Japanese file in shiftjis and properly process and display the contents, the encoding would be set to shiftjis. Thereafter, when reading from the channel, the bytes in the Japanese file would be converted to Unicode as they are read. Writing is also supported - as Tcl strings are written to the channel they will automatically be converted to the specified encoding on output.

If a file contains pure binary data (for instance, a JPEG image), the encoding for the channel should be configured to be binary. Tcl will then assign no interpretation to the data in the file and simply read or write raw bytes. The Tcl binary command can be used to manipulate this byte-oriented data.

The default encoding for newly opened channels is the same platform- and locale-dependent system encoding used for interfacing with the operating system.

I've never known this beauty existed, wow. So many knowledgeable people on this forum yet encoding wise everybody falls flat on their face. To boldly go where no others have gone is my goal.. Solve the giant riddle of the sphyn.. erm, eggdrop. In any case, Why should we settle for work-arounds when the real thing seems attainable

. Might also port the entire thing over to egghttp if [fconfigure] works. Which also means a complete rewrite of the script where I can use the knowledge I now have to condense it immensely and reduce alot of the amateur tricks I used early on to something more elegant.. By christmas expect a total rewrite ported to egghttp with 'real' utf-8 support with 0 workarounds required and of course, the incith inspired name will remain. Incith r0x.

strikelight · Post by **strikelight** » Sat Jul 19, 2008 11:05 pm

speechles wrote: Might also port the entire thing over to egghttp if [fconfigure] works. Which also means a complete rewrite of the script where I can use the knowledge I now have to condense it immensely and reduce alot of the amateur tricks I used early on to something more elegant.. By christmas expect a total rewrite ported to egghttp with 'real' utf-8 support with 0 workarounds required and of course, the incith inspired name will remain. Incith r0x.

Just so you don't spin your wheels pointlessly, wanted to point out that eggdrop sockets (which egghttp uses) are not applicable to being used with fconfigure. With egghttp, you may need to use the encoding convertfrom/convertto methodology, if that will work at all either. Eggdrop1.6.20 will most likely have true utf-8 support, but, to that end as well, I'm not sure would help with rewriting with egghttp.

speechles · Post by **speechles** » Sat Jul 19, 2008 11:49 pm

strikelight wrote:
speechles wrote: Might also port the entire thing over to egghttp if [fconfigure] works. Which also means a complete rewrite of the script where I can use the knowledge I now have to condense it immensely and reduce alot of the amateur tricks I used early on to something more elegant.. By christmas expect a total rewrite ported to egghttp with 'real' utf-8 support with 0 workarounds required and of course, the incith inspired name will remain. Incith r0x.
Just so you don't spin your wheels pointlessly, wanted to point out that eggdrop sockets (which egghttp uses) are not applicable to being used with fconfigure. With egghttp, you may need to use the encoding convertfrom/convertto methodology, if that will work at all either. Eggdrop1.6.20 will most likely have true utf-8 support, but, to that end as well, I'm not sure would help with rewriting with egghttp.

Thats unfortunate, as I liked the call back feature egghttp has vs http packages.

But investigating http package further, it seems it does make use of fconfigure

Code: Select all

		# If we are getting text, set the incoming channel's
		# encoding correctly.  iso8859-1 is the RFC default, but
		# this could be any IANA charset.  However, we only know
		# how to convert what we have encodings for.
		set idx [lsearch -exact $encodings \
			[string tolower $state(charset)]]
		if {$idx >= 0} {
		    fconfigure $s -encoding [lindex $encodings $idx]
		}

Note: that lsearch method is horribly crippled. putting [string tolower] there can actually break some functionality. Try checking [encoding names] and see how many aren't entirely lowercase, those will never match (and this might be particular to apple and their mac encodings if any use these).

Code: Select all

[string map -nocase {"UTF-" "utf-" "iso-" "iso" "windows-" "cp" "shift_jis" "shiftjis"} $state(charset)]

The above code would work better with that lsearch and it's clear to see why in the string map.

(notice the - in iso-, the _ in shift_jis. a [string tolower] doesn't remove those). Http package has very clear limitations (and this may just be in regard to windrops in particular) without adding this functionality to either: the script you write (as I've attempted to do), or by modifying http::package itself.

But we will skip that problem for the moment, and just note that code is part of http::event which leads me to believe http package at least appears to be trying to do it right, but the utf-8 gets in the way somehow. This explains why using [encoding convertto] whatever the value of $state(charset) is, aka 'automagic' works for everything except utf-8. The socket is being misread to begin with for some reason??!@#$...
If anyone cares to elucidate this canopy of complexity, please do. We could all be enlightened.

strikelight · Post by **strikelight** » Sun Jul 20, 2008 1:05 am

speechles wrote:
Thats unfortunate, as I liked the call back feature egghttp has vs http packages.

I just looked at your code, and you seem to be using -timeout instead of -command... Use the command parameter with the http package to have it behave similarly to egghttp (non-blocking).

eg.

Code: Select all

proc callbackproc {sock} {
  set data [http::data $sock]
  ....
}
set url [http::geturl http://www.example.com/index.php -command callbackproc]

The main loss here though, is that older eggdrop versions do not process events in the main event loop to prevent blocking mode, so this will not work properly in old eggdrop versions, hence where egghttp comes in handy. Hope this helps.

scottjacko87 · Post by **scottjacko87** » Sun Jul 20, 2008 9:14 pm

Hey,

I just installed this script, how come it only works in DCC chat with the bot?

Thanks

speechles · Post by **speechles** » Mon Jul 21, 2008 2:41 pm

scottjacko87 wrote:Hey,

I just installed this script, how come it only works in DCC chat with the bot?

Thanks

You mean in query/message I believe. As the script doesn't perform in dcc chat as it never tracks any idx sessions. Whilst in partyline dcc chat with your bot, you would use .chanset. To enable it, you simply choose whether it should work in every channel the bot is in, or you want to pick which ones it does or doesn't.

;#enables every channel bot resides in.
.chanset * +google
;#selectively disables, just #thischannel
.chanset #thischannel -google
;#selectively enables, just #thatchannel
.chanset #thatchannel +google
;#disables every channel bot resides in.
.chanset * -google

It allows you to dynamically control the behavior, without the need for .rehash/.restart ...

testebr · Post by **testebr** » Mon Jul 21, 2008 7:33 pm

[20:30:17] <Me> !ebay hd 4850
[20:30:24] <Bot> Ebay Error: No html to parse.

I hope you can fix it soon.

speechles · Post by **speechles** » Mon Jul 21, 2008 8:31 pm

testebr wrote:[20:30:17] <Me> !ebay hd 4850
[20:30:24] <Bot> Ebay Error: No html to parse.

I hope you can fix it soon.

I'm aware. Ebay is changing servers.

Code: Select all

set query "http://search.ebay.${country}/${input}_W0QQpqryZ${input}"

This is the present query line. It uses the old server domain search which ebay is eliminating... But, sometimes ebay will direct this to a server supporting the old query so it will work, sometimes...

Code: Select all

set query "http://shop.ebay.${country}/items/${input}_W0QQ_nkwZ${input}QQ_scZ1QQ_sopZ1"

This is what the new query will be and it will always work. The template scraper presently only supports the old server. I will update the templates to support the new tonight.. hopefully (fingers crossed)

septal23 · Post by **septal23** » Wed Jul 23, 2008 2:52 pm

Everything is working for me except for the main !g search. (!gi, !gg, !gn all work) Any ideas?

edit: n/m ... it's my shell provider. acmeshells.net