Release: SA_urltitle.tcl

madpinger · Post by **madpinger** » Sat Oct 30, 2010 12:01 pm

The intended purpose of this eggdrop script is to relay the title information
of a url sent to a irc channel by irc users while attempting to identify the
correct character encoding to preserve the information and replace
HTML Entities with their desired unicode counterparts.

http://github.com/madpinger/Eggdrop-URL-title-script

Bash me, use it, abuse it, what ever works. ^.^

Just felt like doing it.

First url is utf-8, second url is euc-jp.
example of iso8859-1 compiled bot:

example of utf-8 compiled bot:

As you can see, it handles different encoding tho, with limits depending on the system's and the bots compiled encoding.

Updates:
Added Speechles's new proc with some notes, as you will need to make changed in order for it to work depending on your system and how your bot is compiled. Eventually, I'll get around to it or more simply put figure out how to account for all the different configurations.

Fixed white space issues as pointed out by spithash. Just never occurred to me as an issue, lol.
Changed how I use http clean up, so it should not loose any tokens.

speechles · Post by **speechles** » Sat Oct 30, 2010 12:22 pm

MOAR scripts are a good thing

This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.

Code: Select all

proc decode_entities {text {char "utf-8"} } {
	# code below is neccessary to prevent numerous html markups
	# from appearing in the output (ie, ", ᘧ, etc)
	# stolen (borrowed is a better term) from tcllib's htmlparse ;)
	# works unpatched utf-8 or not, unlike htmlparse::mapEscapes
	# which will only work properly patched....
	set escapes {
		  \xa0 ¡ \xa1 ¢ \xa2 £ \xa3 ¤ \xa4
		¥ \xa5 ¦ \xa6 § \xa7 ¨ \xa8 © \xa9
		ª \xaa « \xab ¬ \xac  \xad ® \xae
		¯ \xaf ° \xb0 ± \xb1 ² \xb2 ³ \xb3
		´ \xb4 µ \xb5 ¶ \xb6 · \xb7 ¸ \xb8
		¹ \xb9 º \xba » \xbb ¼ \xbc ½ \xbd
		¾ \xbe ¿ \xbf À \xc0 Á \xc1 Â \xc2
		Ã \xc3 Ä \xc4 Å \xc5 Æ \xc6 Ç \xc7
		È \xc8 É \xc9 Ê \xca Ë \xcb Ì \xcc
		Í \xcd Î \xce Ï \xcf Ð \xd0 Ñ \xd1
		Ò \xd2 Ó \xd3 Ô \xd4 Õ \xd5 Ö \xd6
		× \xd7 Ø \xd8 Ù \xd9 Ú \xda Û \xdb
		Ü \xdc Ý \xdd Þ \xde ß \xdf à \xe0
		á \xe1 â \xe2 ã \xe3 ä \xe4 å \xe5
		æ \xe6 ç \xe7 è \xe8 é \xe9 ê \xea
		ë \xeb ì \xec í \xed î \xee ï \xef
		ð \xf0 ñ \xf1 ò \xf2 ó \xf3 ô \xf4
		õ \xf5 ö \xf6 ÷ \xf7 ø \xf8 ù \xf9
		ú \xfa û \xfb ü \xfc ý \xfd þ \xfe
		ÿ \xff ƒ \u192 Α \u391 Β \u392 Γ \u393 Δ \u394
		Ε \u395 Ζ \u396 Η \u397 Θ \u398 Ι \u399
		Κ \u39A Λ \u39B Μ \u39C Ν \u39D Ξ \u39E
		Ο \u39F Π \u3A0 Ρ \u3A1 Σ \u3A3 Τ \u3A4
		Υ \u3A5 Φ \u3A6 Χ \u3A7 Ψ \u3A8 Ω \u3A9
		α \u3B1 β \u3B2 γ \u3B3 δ \u3B4 ε \u3B5
		ζ \u3B6 η \u3B7 θ \u3B8 ι \u3B9 κ \u3BA
		λ \u3BB μ \u3BC ν \u3BD ξ \u3BE ο \u3BF
		π \u3C0 ρ \u3C1 ς \u3C2 σ \u3C3 τ \u3C4
		υ \u3C5 φ \u3C6 χ \u3C7 ψ \u3C8 ω \u3C9
		ϑ \u3D1 ϒ \u3D2 ϖ \u3D6 • \u2022
		… \u2026 ′ \u2032 ″ \u2033 ‾ \u203E
		⁄ \u2044 ℘ \u2118 ℑ \u2111 ℜ \u211C
		™ \u2122 ℵ \u2135 ← \u2190 ↑ \u2191
		→ \u2192 ↓ \u2193 ↔ \u2194 ↵ \u21B5
		⇐ \u21D0 ⇑ \u21D1 ⇒ \u21D2 ⇓ \u21D3 ⇔ \u21D4
		∀ \u2200 ∂ \u2202 ∃ \u2203 ∅ \u2205
		∇ \u2207 ∈ \u2208 ∉ \u2209 ∋ \u220B ∏ \u220F
		∑ \u2211 − \u2212 ∗ \u2217 √ \u221A
		∝ \u221D ∞ \u221E ∠ \u2220 ∧ \u2227 ∨ \u2228
		∩ \u2229 ∪ \u222A ∫ \u222B ∴ \u2234 ∼ \u223C
		≅ \u2245 ≈ \u2248 ≠ \u2260 ≡ \u2261 ≤ \u2264
		≥ \u2265 ⊂ \u2282 ⊃ \u2283 ⊄ \u2284 ⊆ \u2286
		⊇ \u2287 ⊕ \u2295 ⊗ \u2297 ⊥ \u22A5
		⋅ \u22C5 ⌈ \u2308 ⌉ \u2309 ⌊ \u230A
		⌋ \u230B 〈 \u2329 〉 \u232A ◊ \u25CA
		♠ \u2660 ♣ \u2663 ♥ \u2665 ♦ \u2666
		" \x22 & \x26 < \x3C > \x3E O&Elig; \u152 œ \u153
		Š \u160 š \u161 Ÿ \u178 ˆ \u2C6
		˜ \u2DC   \u2002   \u2003   \u2009
		‌ \u200C ‍ \u200D ‎ \u200E ‏ \u200F – \u2013
		— \u2014 ‘ \u2018 ’ \u2019 ‚ \u201A
		“ \u201C ” \u201D „ \u201E † \u2020
		‡ \u2021 ‰ \u2030 ‹ \u2039 › \u203A
		€ \u20AC &apos; \u0027 ‎ "" ‏ ""
	};
	if {![string equal $char [encoding system]]} { set text [encoding convertfrom $char $text] }
	set text [string map [list "\]" "\\\]" "\[" "\\\[" "\$" "\\\$" "\"" "\\\"" "\\" "\\\\"] [string map $escapes $text]]
	regsub -all -- {&#([[:digit:]]{1,5});} $text {[format %c [string trimleft "\1" "0"]]} text
	regsub -all -- {&#x([[:xdigit:]]{1,4});} $text {[format %c [scan "\1" %x]]} text
	catch { set text "[subst "$text"]" }
	if {![string equal $char [encoding system]]} { set text [encoding convertto $char $text] }
	return "$text"
}

Feel free to steal (borrow) this..

madpinger · Post by **madpinger** » Sat Oct 30, 2010 12:36 pm

speechles wrote:MOAR scripts are a good thing

This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.
....
Feel free to steal (borrow) this..

Thanks, I'll review it's changes for inclusion. Tho, I think that I have the encoding covered with the converfrom which changes the encoding to the system default ?

I'm developing on 1.8 cvs patched to be utf-8, tho I did a quick test on 1.6.20 with out any mod.

*EDIT*
Oh, IC what you did there.

spithash · Post by **spithash** » Sun Oct 31, 2010 3:05 pm

Code: Select all

[20:56:51] <@spithash> http://www.youtube.com/user/spithash
[20:56:55] <@nagger> [Url title:] YouTube        - spithash's Channel

can anyone tell me why this white space appears there? I have the same problem with another title grab tcl aswell

madpinger · Post by **madpinger** » Mon Nov 01, 2010 12:17 pm

spithash wrote:
Code: Select all
[20:56:51] <@spithash> http://www.youtube.com/user/spithash
[20:56:55] <@nagger> [Url title:] YouTube        - spithash's Channel
can anyone tell me why this white space appears there? I have the same problem with another title grab tcl aswell

basically, it's cause the title is on more than one line in the HTML that is parsed.

Code: Select all

    <title>
    YouTube
        - spithash's Channel
  </title>

I merge multiple line titles to deal with this in the regexp. If it's a real bother, it would be simple enough to add white space stripping to it.

Tho, that's the reason in a nut shell.

*EDIT*
Ok, fixed that for you. This is the change to make

Code: Select all

[12:31] <madpinger> http://www.youtube.com/user/spithash 
[12:31] <Belkar> [Url title:] YouTube - spithash's Channel

find:

Code: Select all

                        foreach line [split $data \n] {
                            if {[regexp -nocase {<meta.*charset.(.*?)".*>} $line match charset]} {
                                set charenc $charset
                            }
                            append newdata $line
                        }

Change append newdata $line to append newdata [string trim $line]

Code: Select all

                        foreach line [split $data \n] {
                            if {[regexp -nocase {<meta.*charset.(.*?)".*>} $line match charset]} {
                                set charenc $charset
                            }
                            append newdata " [string trim $line]"
                        }

This keeps at least one space between the two lines, so words don't get joined. Updated github's copy with a token cleanup fix. Forgive me for some of the silly stuff I've messed up, I do this half asleep or drunk most times.

SVD · Post by **SVD** » Tue Jan 11, 2011 5:28 pm

Great script! However, it doesn't pick up when someone omits the http://. For example, if I type in www.youtube.com, I would like it to catch that and display the title. Any chance you could add that feature? Thanks in advance.

madpinger · Post by **madpinger** » Fri Jan 14, 2011 6:44 am

Stan wrote:Great script! However, it doesn't pick up when someone omits the http://. For example, if I type in www.youtube.com, I would like it to catch that and display the title. Any chance you could add that feature? Thanks in advance.

Hmm, sure. I'd tell you what to change here, but you have to prefix it with http:// before using the uri, or it has issues. I'll add that in with an other fix/feature a user requested on github in a few days ^.^

cubemon · Post by **cubemon** » Fri May 20, 2011 12:50 pm

speechles wrote:MOAR scripts are a good thing

This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.
Code: Select all
[string map [b]-nocase[/b] $escapes $text]
Feel free to steal (borrow) this..

I admit nicking your script and using successfully with my bot!

However, if you want Ä to correspond to "Ä" and ä to "ä" (and make other capital and lowercase umlauts work), you need to remove the -nocase option from the string map clause.

Thanks for a great conversion script!

kenh83 · Post by **kenh83** » Sat May 28, 2011 1:22 am

This script is no longer on GitHub.. lame.

SVD · Post by **SVD** » Tue Oct 18, 2011 11:02 am

I often see the error "Tcl error [pub_url]: can't read "tok": no such variable" when URLs are posted from certain websites. Is there an update or fix to this script? It's a great script otherwise.