This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

Release: SA_urltitle.tcl

Support & discussion of released scripts, and announcements of new releases.
Post Reply
m
madpinger
Voice
Posts: 12
Joined: Sun Oct 03, 2010 3:06 pm

Release: SA_urltitle.tcl

Post by madpinger »

The intended purpose of this eggdrop script is to relay the title information
of a url sent to a irc channel by irc users while attempting to identify the
correct character encoding to preserve the information and replace
HTML Entities with their desired unicode counterparts.


http://github.com/madpinger/Eggdrop-URL-title-script

Bash me, use it, abuse it, what ever works. ^.^

Just felt like doing it.


First url is utf-8, second url is euc-jp.
example of iso8859-1 compiled bot:
Image

example of utf-8 compiled bot:
Image

As you can see, it handles different encoding tho, with limits depending on the system's and the bots compiled encoding.

Updates:
Added Speechles's new proc with some notes, as you will need to make changed in order for it to work depending on your system and how your bot is compiled. Eventually, I'll get around to it or more simply put figure out how to account for all the different configurations.

Fixed white space issues as pointed out by spithash. Just never occurred to me as an issue, lol.
Changed how I use http clean up, so it should not loose any tokens.
Last edited by madpinger on Mon Nov 01, 2010 1:58 pm, edited 5 times in total.
User avatar
speechles
Revered One
Posts: 1398
Joined: Sat Aug 26, 2006 10:19 pm
Location: emerald triangle, california (coastal redwoods)

Post by speechles »

MOAR scripts are a good thing ;)

This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.

Code: Select all

proc decode_entities {text {char "utf-8"} } {
	# code below is neccessary to prevent numerous html markups
	# from appearing in the output (ie, ", ᘧ, etc)
	# stolen (borrowed is a better term) from tcllib's htmlparse ;)
	# works unpatched utf-8 or not, unlike htmlparse::mapEscapes
	# which will only work properly patched....
	set escapes {
		  \xa0 ¡ \xa1 ¢ \xa2 £ \xa3 ¤ \xa4
		¥ \xa5 ¦ \xa6 § \xa7 ¨ \xa8 © \xa9
		ª \xaa « \xab ¬ \xac ­ \xad ® \xae
		¯ \xaf ° \xb0 ± \xb1 ² \xb2 ³ \xb3
		´ \xb4 µ \xb5 ¶ \xb6 · \xb7 ¸ \xb8
		¹ \xb9 º \xba » \xbb ¼ \xbc ½ \xbd
		¾ \xbe ¿ \xbf À \xc0 Á \xc1 Â \xc2
		Ã \xc3 Ä \xc4 Å \xc5 Æ \xc6 Ç \xc7
		È \xc8 É \xc9 Ê \xca Ë \xcb Ì \xcc
		Í \xcd Î \xce Ï \xcf Ð \xd0 Ñ \xd1
		Ò \xd2 Ó \xd3 Ô \xd4 Õ \xd5 Ö \xd6
		× \xd7 Ø \xd8 Ù \xd9 Ú \xda Û \xdb
		Ü \xdc Ý \xdd Þ \xde ß \xdf à \xe0
		á \xe1 â \xe2 ã \xe3 ä \xe4 å \xe5
		æ \xe6 ç \xe7 è \xe8 é \xe9 ê \xea
		ë \xeb ì \xec í \xed î \xee ï \xef
		ð \xf0 ñ \xf1 ò \xf2 ó \xf3 ô \xf4
		õ \xf5 ö \xf6 ÷ \xf7 ø \xf8 ù \xf9
		ú \xfa û \xfb ü \xfc ý \xfd þ \xfe
		ÿ \xff ƒ \u192 Α \u391 Β \u392 Γ \u393 Δ \u394
		Ε \u395 Ζ \u396 Η \u397 Θ \u398 Ι \u399
		Κ \u39A Λ \u39B Μ \u39C Ν \u39D Ξ \u39E
		Ο \u39F Π \u3A0 Ρ \u3A1 Σ \u3A3 Τ \u3A4
		Υ \u3A5 Φ \u3A6 Χ \u3A7 Ψ \u3A8 Ω \u3A9
		α \u3B1 β \u3B2 γ \u3B3 δ \u3B4 ε \u3B5
		ζ \u3B6 η \u3B7 θ \u3B8 ι \u3B9 κ \u3BA
		λ \u3BB μ \u3BC ν \u3BD ξ \u3BE ο \u3BF
		π \u3C0 ρ \u3C1 ς \u3C2 σ \u3C3 τ \u3C4
		υ \u3C5 φ \u3C6 χ \u3C7 ψ \u3C8 ω \u3C9
		ϑ \u3D1 ϒ \u3D2 ϖ \u3D6 • \u2022
		… \u2026 ′ \u2032 ″ \u2033 ‾ \u203E
		⁄ \u2044 ℘ \u2118 ℑ \u2111 ℜ \u211C
		™ \u2122 ℵ \u2135 ← \u2190 ↑ \u2191
		→ \u2192 ↓ \u2193 ↔ \u2194 ↵ \u21B5
		⇐ \u21D0 ⇑ \u21D1 ⇒ \u21D2 ⇓ \u21D3 ⇔ \u21D4
		∀ \u2200 ∂ \u2202 ∃ \u2203 ∅ \u2205
		∇ \u2207 ∈ \u2208 ∉ \u2209 ∋ \u220B ∏ \u220F
		∑ \u2211 − \u2212 ∗ \u2217 √ \u221A
		∝ \u221D ∞ \u221E ∠ \u2220 ∧ \u2227 ∨ \u2228
		∩ \u2229 ∪ \u222A ∫ \u222B ∴ \u2234 ∼ \u223C
		≅ \u2245 ≈ \u2248 ≠ \u2260 ≡ \u2261 ≤ \u2264
		≥ \u2265 ⊂ \u2282 ⊃ \u2283 ⊄ \u2284 ⊆ \u2286
		⊇ \u2287 ⊕ \u2295 ⊗ \u2297 ⊥ \u22A5
		⋅ \u22C5 ⌈ \u2308 ⌉ \u2309 ⌊ \u230A
		⌋ \u230B 〈 \u2329 〉 \u232A ◊ \u25CA
		♠ \u2660 ♣ \u2663 ♥ \u2665 ♦ \u2666
		" \x22 & \x26 < \x3C > \x3E O&Elig; \u152 œ \u153
		Š \u160 š \u161 Ÿ \u178 ˆ \u2C6
		˜ \u2DC   \u2002   \u2003   \u2009
		‌ \u200C ‍ \u200D ‎ \u200E ‏ \u200F – \u2013
		— \u2014 ‘ \u2018 ’ \u2019 ‚ \u201A
		“ \u201C ” \u201D „ \u201E † \u2020
		‡ \u2021 ‰ \u2030 ‹ \u2039 › \u203A
		€ \u20AC &apos; \u0027 ‎ "" ‏ ""
	};
	if {![string equal $char [encoding system]]} { set text [encoding convertfrom $char $text] }
	set text [string map [list "\]" "\\\]" "\[" "\\\[" "\$" "\\\$" "\"" "\\\"" "\\" "\\\\"] [string map $escapes $text]]
	regsub -all -- {&#([[:digit:]]{1,5});} $text {[format %c [string trimleft "\1" "0"]]} text
	regsub -all -- {&#x([[:xdigit:]]{1,4});} $text {[format %c [scan "\1" %x]]} text
	catch { set text "[subst "$text"]" }
	if {![string equal $char [encoding system]]} { set text [encoding convertto $char $text] }
	return "$text"
}
Feel free to steal (borrow) this.. :)
Last edited by speechles on Sat May 28, 2011 8:44 pm, edited 2 times in total.
m
madpinger
Voice
Posts: 12
Joined: Sun Oct 03, 2010 3:06 pm

Post by madpinger »

speechles wrote:MOAR scripts are a good thing ;)

This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.
....
Feel free to steal (borrow) this.. :)
Thanks, I'll review it's changes for inclusion. Tho, I think that I have the encoding covered with the converfrom which changes the encoding to the system default ?

I'm developing on 1.8 cvs patched to be utf-8, tho I did a quick test on 1.6.20 with out any mod.

*EDIT*
Oh, IC what you did there. :D
User avatar
spithash
Master
Posts: 248
Joined: Thu Jul 12, 2007 9:21 am
Location: Libera
Contact:

Post by spithash »

Code: Select all

[20:56:51] <@spithash> http://www.youtube.com/user/spithash
[20:56:55] <@nagger> [Url title:] YouTube        - spithash's Channel
can anyone tell me why this white space appears there? I have the same problem with another title grab tcl aswell
Libera ##rtlsdr & ##re - Nick: spithash
Click here for troll.tcl
m
madpinger
Voice
Posts: 12
Joined: Sun Oct 03, 2010 3:06 pm

Post by madpinger »

spithash wrote:

Code: Select all

[20:56:51] <@spithash> http://www.youtube.com/user/spithash
[20:56:55] <@nagger> [Url title:] YouTube        - spithash's Channel
can anyone tell me why this white space appears there? I have the same problem with another title grab tcl aswell
basically, it's cause the title is on more than one line in the HTML that is parsed.

Code: Select all

    <title>
    YouTube
        - spithash's Channel
  </title>
I merge multiple line titles to deal with this in the regexp. If it's a real bother, it would be simple enough to add white space stripping to it.

Tho, that's the reason in a nut shell.

*EDIT*
Ok, fixed that for you. This is the change to make

Code: Select all

[12:31] <madpinger> http://www.youtube.com/user/spithash 
[12:31] <Belkar> [Url title:] YouTube - spithash's Channel
find:

Code: Select all

                        foreach line [split $data \n] {
                            if {[regexp -nocase {<meta.*charset.(.*?)".*>} $line match charset]} {
                                set charenc $charset
                            }
                            append newdata $line
                        }
Change append newdata $line to append newdata [string trim $line]

Code: Select all

                        foreach line [split $data \n] {
                            if {[regexp -nocase {<meta.*charset.(.*?)".*>} $line match charset]} {
                                set charenc $charset
                            }
                            append newdata " [string trim $line]"
                        }
This keeps at least one space between the two lines, so words don't get joined. Updated github's copy with a token cleanup fix. Forgive me for some of the silly stuff I've messed up, I do this half asleep or drunk most times. ;)
S
SVD
Voice
Posts: 9
Joined: Mon Mar 13, 2006 6:52 pm

Post by SVD »

Great script! However, it doesn't pick up when someone omits the http://. For example, if I type in www.youtube.com, I would like it to catch that and display the title. Any chance you could add that feature? Thanks in advance.
m
madpinger
Voice
Posts: 12
Joined: Sun Oct 03, 2010 3:06 pm

Post by madpinger »

Stan wrote:Great script! However, it doesn't pick up when someone omits the http://. For example, if I type in www.youtube.com, I would like it to catch that and display the title. Any chance you could add that feature? Thanks in advance.
Hmm, sure. I'd tell you what to change here, but you have to prefix it with http:// before using the uri, or it has issues. I'll add that in with an other fix/feature a user requested on github in a few days ^.^
c
cubemon
Voice
Posts: 1
Joined: Fri May 20, 2011 12:44 pm

Post by cubemon »

speechles wrote:MOAR scripts are a good thing ;)

This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.

Code: Select all

[string map [b]-nocase[/b] $escapes $text]
Feel free to steal (borrow) this.. :)
I admit nicking your script and using successfully with my bot! :)

However, if you want Ä to correspond to "Ä" and ä to "ä" (and make other capital and lowercase umlauts work), you need to remove the -nocase option from the string map clause.

Thanks for a great conversion script!
k
kenh83
Halfop
Posts: 61
Joined: Wed Sep 08, 2010 11:22 am

Post by kenh83 »

This script is no longer on GitHub.. lame. :(
S
SVD
Voice
Posts: 9
Joined: Mon Mar 13, 2006 6:52 pm

Post by SVD »

I often see the error "Tcl error [pub_url]: can't read "tok": no such variable" when URLs are posted from certain websites. Is there an update or fix to this script? It's a great script otherwise.
Post Reply