This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

UNOFFICIAL incith-google 2.1x (Nov30,2o12)

Support & discussion of released scripts, and announcements of new releases.
Post Reply
M
MellowB
Voice
Posts: 24
Joined: Wed Jan 23, 2008 6:02 am
Location: Germany
Contact:

Post by MellowB »

speechles wrote: This should work... The problem with adding wikimedia sites, is the entire site becomes your 'country'. So to add say, 'thissite.com/wiki' to the encode strings with utf-8 encoding is as simple as adding the below to encode_strings:
thissite.com/wiki:utf-8
I did actually try this already but it is not working.
Using it with wiki.theppn.org - try searching for Utada Hikaru for example. Its still giving me gibberish for the Japanese name in the output.
speechles wrote: You will be suprised to know, it already does this. You have !translate (google's version of it) included with this script. Type !help translate in a channel where the script is running.
<speechles> !help translate
-sp33chy- --> Bot triggers available:
-sp33chy- !tr,!trans,!translate region@region <text> with 1 results.
<speechles> !tr en@fr hello france, this is english translated to french.
<sp33chy> Google says: (en->fr) bonjour la France, c'est Anglais traduit en français.
Oh gawd, haha, I feel pretty stupid now. I've been actually using this since some time then... lol. Thought it was done by some older script that i still have running, didn't realize that it was actually your script doing this... haha. Well thanks for clearing that up for me then. xD
On the keyboard of life, always keep one finger on the ESC key.
R
Renegade
Voice
Posts: 10
Joined: Sat May 24, 2008 12:40 am

Post by Renegade »

speechles wrote:This fixes the problem with calculations/conversions/etc
http://ereader.kiczek.com/incith-google-v1.98e.tcl 269 KB (276,224 bytes)
If problems persist, clear your web cache and reget this file. There is no version update for this fix since it was simplistic.
Thank you very, very much...works fine again :D
speechles wrote:For google safe_search anamolies, I will need to check all google related site queries. There may be spots where I've removed this (for debugging purposes) and forgot to put it back in. I'm thinking video may allow sexual nature even with safe_search on because of this (to debug some sections the query line is temporarily changed). Over this coming weekend I will make the safe_search fully compliant on ALL google related sites (including youtube), to address that very issue.
I actually meant the results on Google's side, not the script. Sometimes, Google returns results that are not PG-13 even with safe search on - be it that the query is too scientific, or that certain trigger words are missing on the page. That's not fixable by you, of course, it's just another reason the blacklist might be helpful.

As for the blacklist, I was actually not thinking of doing it based on the result - just refusing service for certain words. Like, if somebody does "!wiki anal sex", it goes "Sorry, I do not return results for that query.". Just a blacklist for search terms.

Lastly, I understand your reasons for not implementing the other searches. Of course my reason for requesting them was that I didn't want to install additional scripts, but if you would get in trouble for that, it's just not worth it. It's easy enough to install the other scripts.

Again, thank you very much for your help, and your work in general. The script is a favorite among the users. :D
(Especially the !locate trigger, for some reason...I added an extra trigger !stalk for it now.)
User avatar
speechles
Revered One
Posts: 1398
Joined: Sat Aug 26, 2006 10:19 pm
Location: emerald triangle, california (coastal redwoods)

Post by speechles »

MellowB wrote:
speechles wrote: This should work... The problem with adding wikimedia sites, is the entire site becomes your 'country'. So to add say, 'thissite.com/wiki' to the encode strings with utf-8 encoding is as simple as adding the below to encode_strings:
thissite.com/wiki:utf-8
I did actually try this already but it is not working.
Using it with wiki.theppn.org - try searching for Utada Hikaru for example. Its still giving me gibberish for the Japanese name in the output.
<speechles> !wm .wiki.theppn.org Utada Hikaru
<sp33chy> Utada Hikaru | Utada Hikaru (‡0Ò«ë) is one of Japan's most successful artists of all time. Her debut album, First Love, is the best-selling album ever in Japan with over 7.65 million copies sold in Japan alone. She has sold over 41 million records worldwide (with over 34 million in home nation Japan). Moreover, 3 of her albums are in the Top 10 best-selling album of all time in Japan (#1, #4, #8),
<sp33chy> making her one of the most indefinitely successful and popular singers in J-pop history. She is bilingual as she was raised in both New York and Tokyo. Utada Hikaru is also known in the west under her English language project name 'Utada'. Utada also sang the Kingdom Hearts themes, Hikari / Simple and Clean and the theme songs for @ http://wiki.theppn.org/Utada_Hikaru
For me it works (keep in mind nothing is associated for this site in my encode_strings, I'm using eggdrops standard encoing for this) but for the japanese I get gibberish. This is unavoidable at the moment until I learn more about the apparent utf-8 problem with eggdrop and how to handle multiple transcodings within the same page..

The problem is the script can only transcode to a single encoding. While utf-8 supports all (supposedly). The work-around for the eggdrop utf-8 problem is to transcode from/to exact encodings. This works, but cannot work when pages are embedded with mixes of languages. If someone can enlighten me on how to recognize embedded language changes before transcoding from utf-8, in this way I can regsub inject encoding markups, which when the bot renders output can use these markups to better handle multiple encodings within the same output as utf-8 naturally does. If this all sounds very complicated, believe me it is..

@Renegade, how about instead of a blacklist. Which would simply tell the user 'Please use appropriate language in this channel.' (Eventually, this would get to spamming so often, it would be just as bad as what was there before). How about instead we build an aversion vocabulary. What this would consist of is something like this:

Code: Select all

# would you like to use vocabulary aversion?
# this will replace swear words with more appropriate words
# and the query returned will be aversion free.
# 0 disables, anything else enables
#----------
variable aversion_vocabulary 1

# set your aversion vocabulary below if desired:
# remember to enable, keep the setting above at 1.
#----------
variable aversion {
fork*:math
anal:true
sex:love
ass:toe
dick:nose
faggot:friend
butthole:ear
bitch:woman
bastard:man
cock:rooster
c--unt:lobster ; #remove the two -- forum puts [censored] if i don't add those, remove this comment and ; as well
etc:etc
lol:lmao
be:creative
etc:etc
}
What would occur is the script will check this vocabulary against their input, doing this for all triggers. When a word is within the aversion list, it is replaced with it's matched word (delineated by the :). Together a query of !w anal sex, would reveal the wiki entry for 'true love'. This is much more appropriate for children to see and read directly after some jackass tries to see a query of anal sex appear. I will make exceptions so wildcards can be used within the aversion list (fork*, which if spelt as the swear word, would catch forking,forked any variant would become math, these accomodations will be included for the aversion list so you won't need to list all variants). Expect this to make the next revision of the script.
User avatar
speechles
Revered One
Posts: 1398
Joined: Sat Aug 26, 2006 10:19 pm
Location: emerald triangle, california (coastal redwoods)

Post by speechles »

Code: Select all

    # would you like to use vocabulary aversion?
    # this will replace swear words with more appropriate words
    # and the query returned will be aversion free.
    # 0 disables, anything else enables
    #----------
    variable aversion_vocabulary 1
    
    # set your aversion vocabulary below if desired:
    # remember to enable, keep the setting above at 1.
    #----------
    variable aversion {
      fork:nice
      anal:true
      sex:love
      "analsex:true love"
      analsecks:truelove
    }
<speechles> !g fork
<sp33chy> 38,700,000 Results | NICE Systems - NICE Systems Home Pag @ http://www.nice.com/ | Welcome to the National Institute for He @ http://www.nice.org.uk/ | Nice - Wikipedia, the free encyclopedi @ http://en.wikipedia.org/wiki/Nice | The Nice programming language @ http://nice.sourceforge.net/
<speechles> !w anal sex
<sp33chy> True love | True love may refer to: In fiction: True Love (film). True Love (play), a play by Charles L. Mee. True Love (short story), by Isaac Asimov. True Love (video game). The Truelove, a novel by Patrick O'Brian. In music: True Love (Crystal Gayle album). "True Love" (Elliott Smith song), an unreleased song which was intended to be on the album From a Basement on the Hill. "True Love" (Fumiya
<sp33chy> Fujii song), a 1993 song by Fumiya Fujii. "True Love" (Lil' Romeo song). True Love (Pat Benatar album), an album by Pat Benatar. "True Love" (Soldiers of Jah Army song). "True Love" (song), a 1956 song by Cole Porter from the musical High Society. True Love (Toots & the Maytals album). Retrieved from "http://en.wikipedia.org @ http://en.wikipedia.org/wiki/True_love
<speechles> !w analsex
<sp33chy> True love | True love may refer to: In fiction: True Love (film). True Love (play), a play by Charles L. Mee. True Love (short story), by Isaac Asimov. True Love (video game). The Truelove, a novel by Patrick O'Brian. In music: True Love (Crystal Gayle album). "True Love" (Elliott Smith song), an unreleased song which was intended to be on the album From a Basement on the Hill. "True Love" (Fumiya
<sp33chy> Fujii song), a 1993 song by Fumiya Fujii. "True Love" (Lil' Romeo song). True Love (Pat Benatar album), an album by Pat Benatar. "True Love" (Soldiers of Jah Army song). "True Love" (song), a 1956 song by Cole Porter from the musical High Society. True Love (Toots & the Maytals album). Retrieved from "http://en.wikipedia.org @ http://en.wikipedia.org/wiki/True_love
<speechles> !w analsecks
<sp33chy> Clarissa Oakes | Clarissa Oakes (titled The Truelove in the U.S.A.), (1993) is an historical novel set during the Napoleonic Wars written by Patrick O'Brian. It again features the duo, "Lucky" Captain Jack Aubrey and his friend and companion Stephen Maturin. @ http://en.wikipedia.org/wiki/Clarissa_Oakes
Get the new script here or the link on the first page. Vocabulary aversion is fully functional. This works across the entire engine meaning all queries. Enjoy. ;)

Note: you also have full wildcard matching and a few other tricks you can perform.Take the below for example:

Code: Select all

faggot:friend
dick*:silly
"hell:a hot place"
*fork*:math
All of these would be valid entires. Hopefully you understand how they work in these combinations. Keep in mind only the entry after the : can be more than one word, this is what the word is replaced with. When using spaces you must use quotes, otherwise they are optional. For those curious, here is how the aversion is made possible.

Code: Select all

    # Vocabulary Aversion
    # This converts swear words into appropriate words for IRC
    # this is rather rudimentary, is probably a better way to do this but meh..
    #
    proc vocabaversion {text} {
      set newtext ""
      foreach element [split $text] {
        set violation 0
        foreach vocabulary $incith::google::aversion {
          set swear [lindex [split $vocabulary :] 0]
          set avert [join [lrange [split $vocabulary :] 1 end]]
          if {[string match -nocase "$swear" $element]} {
            append newtext "$avert "
            set violation 1
            break
          }
        }
        if {$violation == 0} { append newtext "$element " }
      }
      return $newtext
    } 
R
Renegade
Voice
Posts: 10
Joined: Sat May 24, 2008 12:40 am

Post by Renegade »

Not what I originally had in mind, but just as good, if not better - thank you very much :D
(Tested and works.)

Code: Select all

variable version "incith:google-1.9.8e"
Just thought I'd point that out ;)


One more thing, though - after outputting the excerpt from a wikimedia page, the script outputs the URL of that page. Can you tell me how that URL is retrieved? 'cause it displays the wrong one for us, and I realized that it's a fault with our configuration - since I don't have admin access to the wiki, though, I can't just search the config for the wrong path - I have to tell the admin which variable is wrong.

P.S.: Love the "anal sex" -> "true love" conversion :lol:
User avatar
speechles
Revered One
Posts: 1398
Joined: Sat Aug 26, 2006 10:19 pm
Location: emerald triangle, california (coastal redwoods)

Post by speechles »

Renegade wrote:Not what I originally had in mind, but just as good, if not better - thank you very much :D
(Tested and works.)

Code: Select all

variable version "incith:google-1.9.8e"
Just thought I'd point that out ;)
Doh! I thought I changed all those to an f but sometimes humans being human we fail, heh. I assure you it's 1.9.8f even tho it will putlog itself as 1.9.8e.
Renegade wrote:One more thing, though - after outputting the excerpt from a wikimedia page, the script outputs the URL of that page. Can you tell me how that URL is retrieved? 'cause it displays the wrong one for us, and I realized that it's a fault with our configuration - since I don't have admin access to the wiki, though, I can't just search the config for the wrong path - I have to tell the admin which variable is wrong.
The way it arrives at the final URL is pretty simple.

Code: Select all

        set query "http://${country}/index.php?title=Special%3ASearch&search=${input}&fulltext=Search"
The script first begins here. Depending on what we get here, it will attempt to go further.

Code: Select all

      # see if our direct result is available and if so, lets take it
      regexp -- {<div id="contentSub"><p>.*?<a href="(.+?)".*?title} $html - match
      if {[string match -nocase "*action=edit*" $match]} { set match "" }
      # otherwise we only care about top result
      # this is the _only_ way to parse mediawiki, sorry.
      if {$match == ""} {
        if {![regexp -- {<li><a href="((?!http).+?)"} $html - match]} { regexp -- {<li style.*?><a href="(.+?)"} $html - match} 
      }
      if {[string match -nocase "*/wiki*" $country]} {
        regsub -- {/wiki} $country {} country
      }
      ... continued below ...
This part here is the direction parser. From here we will either get a direct match, or we will simply use the most relevant top result. There is redirect traversing as well, below this part, but these parts are the primary method for determining the final URL displayed. If these fail it attempts to read the no results found message to display (see below). If it can't read that, it looks for any server error message displayed. If it can't find anything I've mentioned above, only then will it display an wikimedia error stating that it is unable to parse that url.

Code: Select all

      ... continued from above ...
      # at this point we can tell if there was any match, so let's not even bother
      # going further if there wasn't a match, this pulls the 'no search etc' found.
      # this can be in any language.
      if {$match == ""} {
        # these are for 'no search results' or similar message
        # these can be in any language.
        if {[regexp -- {</form>.*?<p>(.+?)(<p><b>|</p><hr)} $html - match]} { regsub -all -- {<(.+?)>} $match {} match } 
        if {$match == ""} {
          if {[regexp -- {<div id="contentSub">(.+?)<form id=} $html - match]} {
            regsub -- { <a href="/wiki/Special\:Allpages.*?</a>} $match "." match
            regsub -- {<div.*?/div>} $match "" match
            regsub -- {\[Index\]} $match "" match
            regsub -- {<span.*?/span>} $match "" match
          } 
        }
        # this is our last error catch, this can grab the
        # 'wikimedia cannot search at this time' message
        # this can be in any language.
        if {[string len $match] < 3} { regexp -- {<center><b>(.+?)</b>} $html - match }
        if {$match == ""} {
          regsub -all -- { } $results {_} results
          if {$results != ""} { set results "#${results}" } 
          return "\002Wikimedia Error:\002 Unable to parse for: \002${input}\002 @ ${query}${results}"
        }
        # might be tags since we allowed any language here we cut them out
        regsub -all -- {<(.+?)>} $match {} match
        return "[descdecode ${match}]"
      }
This is the smart error control described above.
Renegade wrote:P.S.: Love the "anal sex" -> "true love" conversion :lol:
I thought you would, this way it baffles those who attempt these types of queries (filth mouths). They get a reply with their query attempt, but not quite the one expected. This can be fully customized for any language as well, if special characters are used within the aversion array they simply need to be escaped to prevent tcl errors.
M
MellowB
Voice
Posts: 24
Joined: Wed Jan 23, 2008 6:02 am
Location: Germany
Contact:

Post by MellowB »

speechles wrote:
<speechles> !wm .wiki.theppn.org Utada Hikaru
<sp33chy> Utada Hikaru | Utada Hikaru (‡0Ò«ë) is one of Japan's most successful artists of all time. Her debut album, First Love, is the best-selling album ever in Japan with over 7.65 million copies sold in Japan alone. She has sold over 41 million records worldwide (with over 34 million in home nation Japan). Moreover, 3 of her albums are in the Top 10 best-selling album of all time in Japan (#1, #4, #8),
<sp33chy> making her one of the most indefinitely successful and popular singers in J-pop history. She is bilingual as she was raised in both New York and Tokyo. Utada Hikaru is also known in the west under her English language project name 'Utada'. Utada also sang the Kingdom Hearts themes, Hikari / Simple and Clean and the theme songs for @ http://wiki.theppn.org/Utada_Hikaru
For me it works (keep in mind nothing is associated for this site in my encode_strings, I'm using eggdrops standard encoing for this) but for the japanese I get gibberish. This is unavoidable at the moment until I learn more about the apparent utf-8 problem with eggdrop and how to handle multiple transcodings within the same page..

The problem is the script can only transcode to a single encoding. While utf-8 supports all (supposedly). The work-around for the eggdrop utf-8 problem is to transcode from/to exact encodings. This works, but cannot work when pages are embedded with mixes of languages. If someone can enlighten me on how to recognize embedded language changes before transcoding from utf-8, in this way I can regsub inject encoding markups, which when the bot renders output can use these markups to better handle multiple encodings within the same output as utf-8 naturally does. If this all sounds very complicated, believe me it is..
Yes well, thats the exact same gibberish that i get. And it is only the case with the !wm trigger. I still believe that the !wm trigger does not really work with utf-8/encodie_strings variables and i do think so because of this:
[05:05:28] <MellowB> !wiki Utada Hikaru
[05:05:31] <Cocco> Hikaru Utada | Hikaru Utada (宇多田 ヒカル, Utada Hikaru^?, born January 19, 1983), also known by her fans as Hikki (ヒッキー, Hikkī^?), is a third culture singer-songwriter, arranger and record producer in Japan and the US. She is well-known internationally for her two theme song contributions to Square Enix's Kingdom Hearts video game series. Utada's first official Japanese album First
^Thats what i get if i use the normal !wiki with the english wikipedia set as default. It works perfectly fine, japanese text is displayed correctly.
[05:08:50] <MellowB> !wm .en.wikipedia.org Utada Hikaru
[05:08:52] <Cocco> Hikaru Utada | Hikaru Utada (‡0 Ò«ë, Utada Hikaru^?, born January 19, 1983), also known by her fans as Hikki (Òíü, Hikk+^?), is a third culture singer-songwriter, arranger and record producer in Japan and the US. She is well-known internationally for her two theme song contributions to Square Enix's Kingdom Hearts video game series. Utada's first official Japanese album First Love became the
^And this is what i get if i search for the exact same thing, on the exact same page using the !wm trigger. And yes, I did add en.wikipedia.org to the encode_strings section set to utf-8. (the exact string is "en.wikipedia.org:utf-8") It shows the same gibberish as if i search on the other page that i want to use.

Really, please look into it again and check if there really is no problem with the trigger, as i still believe that it is not using the encode_strings settings at all.
Sorry to be so annoying about this but I would really like to use some other wiki's with that !wm trigger but currently thats only rather limited possible, would be glad if you could either fix it or enlighten me why it will not work.
On the keyboard of life, always keep one finger on the ESC key.
User avatar
speechles
Revered One
Posts: 1398
Joined: Sat Aug 26, 2006 10:19 pm
Location: emerald triangle, california (coastal redwoods)

Post by speechles »

MellowB wrote:
[05:05:28] <MellowB> !wiki Utada Hikaru
[05:05:31] <Cocco> Hikaru Utada | Hikaru Utada (宇多田 ヒカル, Utada Hikaru^?, born January 19, 1983), also known by her fans as Hikki (ヒッキー, Hikkī^?), is a third culture singer-songwriter, arranger and record producer in Japan and the US. She is well-known internationally for her two theme song contributions to Square Enix's Kingdom Hearts video game series. Utada's first official Japanese album First
^Thats what i get if i use the normal !wiki with the english wikipedia set as default. It works perfectly fine, japanese text is displayed correctly.
[05:08:50] <MellowB> !wm .en.wikipedia.org Utada Hikaru
[05:08:52] <Cocco> Hikaru Utada | Hikaru Utada (‡0 Ò«ë, Utada Hikaru^?, born January 19, 1983), also known by her fans as Hikki (Òíü, Hikk+^?), is a third culture singer-songwriter, arranger and record producer in Japan and the US. She is well-known internationally for her two theme song contributions to Square Enix's Kingdom Hearts video game series. Utada's first official Japanese album First Love became the
^And this is what i get if i search for the exact same thing, on the exact same page using the !wm trigger. And yes, I did add en.wikipedia.org to the encode_strings section set to utf-8. (the exact string is "en.wikipedia.org:utf-8") It shows the same gibberish as if i search on the other page that i want to use.

Really, please look into it again and check if there really is no problem with the trigger, as i still believe that it is not using the encode_strings settings at all.
Wow, you are 100% absolutely correct. I forgot long ago when adding these, I added a special condition to both that was really only disclosed in regard to serbian language (serbian wikipedia supports a latin equivalent). But, to support this semi-disclosed extended feature (easter egg! yay!), the wikipedia output encoding was extended. I forgot to extend wikimedia's encoding to support the same functionality correctly (it does now tho ;)). So now let me discuss what this easter egg is. As well as now confidently say, that if you reget the v1.9.8f script wikipedia/wikimedia will mirror each other (at least in regard to output, functionality on the other hand is slightly different and why they are seperated).
!wiki .sr@sr-el serbia
Notice the @[encoding] after the initial site. This is regional switching. Wikipedia expects a url configuration to go hand in hand with this (these are special cases where other dialects go together with this country). So only works when it should as far as Wikipedia goes. You cannot make the english page into russian encoding unless wikipedia supports this.
<speechles> !wiki .sr@sr-el serbia
<sp33chy> ..snipped stuff.. @ http://sr.wikipedia.org/sr-el/Serbia
<speechles> !wiki .sr serbia
<sp33chy> ..snipped stuff.. @ http://sr.wikipedia.org/Serbia
For wikipedia the rule is there to protect this url correlation. But... for wikimedia.. This same rule does not apply.
!wm .en.wikipedia.org@ja Utada Hikaru
This will force the page to interpret as japanese explicity (if you have the encoding for japanese set for ja in the encode_strings), irregardless of what your default encoding for that site is set to. Hopefully you understand. I added this functionality to address situations where multiple encodings per page mangle results. This is a quasi work-around. You can add custom encodings this way to reference them outside the normal ones wikipedia uses.
!wm .yoursite.com/wiki@custom term

Code: Select all

custom:cp1251
Any use of the 'region' as custom when used with !wikimedia would encode as cp1251 in this case.
!w [.country[@region]] search term[#subtag]
!wm .yoursite.com[/wiki][@region] search term[#subtag]
If region is used it overrides country and yoursite encodings. This should explain it all hopefully.. keke :)

last of all, as I remember things.. heh.. You can also embed the default site with a region. So for those in serbia, or for your website.
variable wikimedia_site "yoursite.com/wiki@custom"
variable wiki_country "sr@sr-el"
This will force the default now to override the setting within encode_strings as well, allowing you to get to the exact page and encoding you wish as default.
M
MellowB
Voice
Posts: 24
Joined: Wed Jan 23, 2008 6:02 am
Location: Germany
Contact:

Post by MellowB »

Fantastic, thanks for clearing that one up and all the explaining.
Works like a charm now. :D
Glad that you are so supportive with your scrips and update/help out in such a quick and unproblematic way, much appreciated!
On the keyboard of life, always keep one finger on the ESC key.
R
Renegade
Voice
Posts: 10
Joined: Sat May 24, 2008 12:40 am

Post by Renegade »

Okay, through an army of putlogs, I was able to pinpoint my problem.
I would rather not give up the site's location here, so I'll user placeholders.
The problem is as follows:
Our wiki is at site.tld/wikidir
The script's final link goes to site.tld/wikidir/wikidir/index.php?title=result
As you can see, wikidir is repeated. For the record, I consciously picked "wikidir", not "wiki", because the wiki folder is not /wiki/

The reason for this seems to be as follows:
$wikimedia_site is set to site.tld/wikidir

Now,

Code: Select all

regexp -- {<div id="contentSub"><p>.*?<a href="(.+?)".*?title} $html - match
reads out the anchor url at the top, and puts it into $match - however, that anchor's href is /wikidir/index.php?title=result - since it begins with a /, it works fine on the web server, going from the root - the site. However, in the script, we do

Code: Select all

set country "${incith::google::wikimedia_site}"
setting $country to site.tld/wikidir

So we have
$country = site.tld/wikidir
$match = /wikidir/index.php?title=result


and then do

Code: Select all

      # we assume here we found another page to traverse in our search.
      if {![string match "*http*" $match]} { set query "http://${country}${match}" }
$query = http://site.tld/wikidir/wikidir/index.php?title=result

And finally, we do

Code: Select all

set link $query
and
use $link in $output, before we return $output.



I believe the problem's root is the assumption that the wikimedia software is installed in the root of the domain, thus not accounting for folders in $wikimedia_site. Would it be possible to implement a fix for that? Maybe something as simple as checking for a slash in $wikimedia_site, and then only taking the part before it?


Edit: If you're thinking "wth? he should have an 'unable to parse' error way earlier!", you're probably right. But in preparation for the last update of the mediawiki software, the server admin moved /wikidir to /wikidir/wikidir - iow, there's an outdated, functional wiki at that location. Searching for everything that existed before the update works as expected. Everything that was added after the update is obviously not in the outdated backup, and gives an "unable to parse" error - that's how I originally found the problem.
Had our beloved :roll: admin not moved the backup there, I probably would've noticed this problem right from the start. -_-
User avatar
speechles
Revered One
Posts: 1398
Joined: Sat Aug 26, 2006 10:19 pm
Location: emerald triangle, california (coastal redwoods)

Post by speechles »

Renegade wrote:Okay, through an army of putlogs, I was able to pinpoint my problem.
I would rather not give up the site's location here, so I'll user placeholders.
The problem is as follows:
Our wiki is at site.tld/wikidir
The script's final link goes to site.tld/wikidir/wikidir/index.php?title=result
As you can see, wikidir is repeated. For the record, I consciously picked "wikidir", not "wiki", because the wiki folder is not /wiki/

The reason for this seems to be as follows:
$wikimedia_site is set to site.tld/wikidir

Now,

Code: Select all

regexp -- {<div id="contentSub"><p>.*?<a href="(.+?)".*?title} $html - match
reads out the anchor url at the top, and puts it into $match - however, that anchor's href is /wikidir/index.php?title=result - since it begins with a /, it works fine on the web server, going from the root - the site. However, in the script, we do

Code: Select all

set country "${incith::google::wikimedia_site}"
setting $country to site.tld/wikidir

So we have
$country = site.tld/wikidir
$match = /wikidir/index.php?title=result
But your missing an important step in this logic trace, something that catches normally.

Code: Select all

      if {[string match -nocase "*/wiki*" $country]} {
        regsub -- {/wiki} $country {} country
      }
Our friend, mr. regsub normally takes the $county and strips '/wiki' from it which makes these types of collisions rare (unless the mediawiki page is overly customized, with a custom subdir not named 'wiki'...). But as a work-around, I can also easly split by / take the last lindex of country, check if thats the first lindex of match if so remove it from country. But.. conversely.. I would add a variable to keep the old behavior for those who don't modify their subdomain from /wiki. This allows them to keep their entries doubled if they happen to occur like this. Explicity removing duplicates on some causes problems depending on how wikimedia is installed on their machine, some pages require this duality. I'll have something to support this very shortly. :wink:

Code: Select all

# Wikimedia URL detection
# remove double entries from urls? would remove a '/wikisite' from this
# type of url @ http://yoursite.com/wikisite/wikisite/search_term
# if you have issues regarding url capture with wikimedia, enable this.
# /wiki/wiki/ problems are naturally averted, a setting of 0 already
# stops these type.
# --------
# variable wiki-detect 1
R
Renegade
Voice
Posts: 10
Joined: Sat May 24, 2008 12:40 am

Post by Renegade »

Read the part you quoted again - "For the record, I consciously picked "wikidir", not "wiki", because the wiki folder is not /wiki/" ;)

Like I said, I traced the whole process via putlogs, and it definitely progresses the way I outlined. If the directory strip only acts on /wiki, it's absolutely logical it fails on our /wikidir.
User avatar
speechles
Revered One
Posts: 1398
Joined: Sat Aug 26, 2006 10:19 pm
Location: emerald triangle, california (coastal redwoods)

Post by speechles »

Renegade wrote:Read the part you quoted again - "For the record, I consciously picked "wikidir", not "wiki", because the wiki folder is not /wiki/" ;)

Like I said, I traced the whole process via putlogs, and it definitely progresses the way I outlined. If the directory strip only acts on /wiki, it's absolutely logical it fails on our /wikidir.
Like I said, I wrote the thing. Why do I need re-read what you said? I completely understand what happens, the logic for me doesn't need to be traced. I have intricate knowledge of what every line does without seeing it run, my mind is the interpreter..heh. And with that, I easily fixed the script.. :D
Here it is!

Code: Select all

      # this will strip double domain entries from our country if it exists
      # on our anchor.
      if {$incith::google::wiki_domain_detect != 0} {
        if {[string match -nocase [lindex [split $country "/"] end] [lindex [split $match "/"] 1]]} {
          set country [join [lrange [split $country "/"] 0 end-1] "/"]
        }
      } elseif {[string match -nocase "*/wiki*" $country]} {
       regsub -- {/wiki} $country {} country
      }
This compares the last piece of the url of $country to the first of $match (since match starts with /, using lindex 0 gives us null space, not good, so we use 1). If they are identical the script will remove the last piece from $country to compensate. Otherwise, if you have disabled domain_detection it will simply use the old behavior, removing double /wiki instances.

Note: I'm aware !news and !local do not return results with certain query combinations (rare). Google has recently added another template to their display engine which the script cannot parse, yet.. :)
R
Renegade
Voice
Posts: 10
Joined: Sat May 24, 2008 12:40 am

Post by Renegade »

I hate to point it out, but...

Code: Select all

variable version "incith:google-1.9.8f"
:wink:

Anyway, tested and works fine - not that I ever doubted that. Thank you very much :)


Independently from that, I was not suggesting you re-read what I said because I doubted your grasp of your own code, but because your post a) implied I made a mistake in my tracing, even though I saw the state of both variables in my partyline, and know it went that way, and b) you said "(unless the mediawiki page is overly customized, with a custom subdir not named 'wiki'...)", while my original post readily acknowledged we have a custom wiki directory - implying you had read that over.

I was, at no point, in doubt over your abilities, or questioning your skills - I just wanted to point out that you were basing your rebuttal on the wrong assumption (that I was trying to parse a /wiki/ directory), and that I already admitted the special case of a custom wiki directory applies.

You're doing great work - I have no reason to insult or doubt you.
User avatar
speechles
Revered One
Posts: 1398
Joined: Sat Aug 26, 2006 10:19 pm
Location: emerald triangle, california (coastal redwoods)

Post by speechles »

Renegade wrote:Independently from that, I was not suggesting you re-read what I said because I doubted your grasp of your own code, but because your post a) implied I made a mistake in my tracing, even though I saw the state of both variables in my partyline, and know it went that way, and b) you said "(unless the mediawiki page is overly customized, with a custom subdir not named 'wiki'...)", while my original post readily acknowledged we have a custom wiki directory - implying you had read that over.
Wasn't it obvious that regsub removes a /wiki from country? Why would it do this? Why at that particular point? To cure the exact senario you describe above, only it WILL not work on custom wiki subdomains. Now it can do this by simply checking and stripping rather than assumption.

What you need to realize as well, there is no clear documentation on any of this wikipedia/wikimedia stuff. You start to figure it out as you go along. I had to develop alot of the routines on my own, without the use of any API or QUERY functions. Browsing is preferred since it doesn't track the request to a specific user (key authenticated sessions aren't required), much easier to use in the publics hands. So the script was written as if it were a browser itself, attempting to 'read' the site. It is very difficult to do this elegantly, especially for any language and any wikipedia page. The problem becomes expounded when you add in the #subtag and table-of-content's abilities. Functions such as the transcode and encode_strings can be useful for other scripters as well. If others have problems using eggdrop with utf-8 and foreign languages. I suggest they have a look at this script, and use these functions I've created to handle it. Sharing is caring.

note: I've got a horrible flu thing going on, allergies or something. So if I come off upset for some reason. It's not because of you, it's because I feel like 5hit.
Post Reply