Code: Select all
proc decode_entities {text {char "utf-8"} } {
# code below is neccessary to prevent numerous html markups
# from appearing in the output (ie, ", ᘧ, etc)
# stolen (borrowed is a better term) from tcllib's htmlparse ;)
# works unpatched utf-8 or not, unlike htmlparse::mapEscapes
# which will only work properly patched....
set escapes {
\xa0 ¡ \xa1 ¢ \xa2 £ \xa3 ¤ \xa4
¥ \xa5 ¦ \xa6 § \xa7 ¨ \xa8 © \xa9
ª \xaa « \xab ¬ \xac \xad ® \xae
¯ \xaf ° \xb0 ± \xb1 ² \xb2 ³ \xb3
´ \xb4 µ \xb5 ¶ \xb6 · \xb7 ¸ \xb8
¹ \xb9 º \xba » \xbb ¼ \xbc ½ \xbd
¾ \xbe ¿ \xbf À \xc0 Á \xc1 Â \xc2
à \xc3 Ä \xc4 Å \xc5 Æ \xc6 Ç \xc7
È \xc8 É \xc9 Ê \xca Ë \xcb Ì \xcc
Í \xcd Î \xce Ï \xcf Ð \xd0 Ñ \xd1
Ò \xd2 Ó \xd3 Ô \xd4 Õ \xd5 Ö \xd6
× \xd7 Ø \xd8 Ù \xd9 Ú \xda Û \xdb
Ü \xdc Ý \xdd Þ \xde ß \xdf à \xe0
á \xe1 â \xe2 ã \xe3 ä \xe4 å \xe5
æ \xe6 ç \xe7 è \xe8 é \xe9 ê \xea
ë \xeb ì \xec í \xed î \xee ï \xef
ð \xf0 ñ \xf1 ò \xf2 ó \xf3 ô \xf4
õ \xf5 ö \xf6 ÷ \xf7 ø \xf8 ù \xf9
ú \xfa û \xfb ü \xfc ý \xfd þ \xfe
ÿ \xff ƒ \u192 Α \u391 Β \u392 Γ \u393 Δ \u394
Ε \u395 Ζ \u396 Η \u397 Θ \u398 Ι \u399
Κ \u39A Λ \u39B Μ \u39C Ν \u39D Ξ \u39E
Ο \u39F Π \u3A0 Ρ \u3A1 Σ \u3A3 Τ \u3A4
Υ \u3A5 Φ \u3A6 Χ \u3A7 Ψ \u3A8 Ω \u3A9
α \u3B1 β \u3B2 γ \u3B3 δ \u3B4 ε \u3B5
ζ \u3B6 η \u3B7 θ \u3B8 ι \u3B9 κ \u3BA
λ \u3BB μ \u3BC ν \u3BD ξ \u3BE ο \u3BF
π \u3C0 ρ \u3C1 ς \u3C2 σ \u3C3 τ \u3C4
υ \u3C5 φ \u3C6 χ \u3C7 ψ \u3C8 ω \u3C9
ϑ \u3D1 ϒ \u3D2 ϖ \u3D6 • \u2022
… \u2026 ′ \u2032 ″ \u2033 ‾ \u203E
⁄ \u2044 ℘ \u2118 ℑ \u2111 ℜ \u211C
™ \u2122 ℵ \u2135 ← \u2190 ↑ \u2191
→ \u2192 ↓ \u2193 ↔ \u2194 ↵ \u21B5
⇐ \u21D0 ⇑ \u21D1 ⇒ \u21D2 ⇓ \u21D3 ⇔ \u21D4
∀ \u2200 ∂ \u2202 ∃ \u2203 ∅ \u2205
∇ \u2207 ∈ \u2208 ∉ \u2209 ∋ \u220B ∏ \u220F
∑ \u2211 − \u2212 ∗ \u2217 √ \u221A
∝ \u221D ∞ \u221E ∠ \u2220 ∧ \u2227 ∨ \u2228
∩ \u2229 ∪ \u222A ∫ \u222B ∴ \u2234 ∼ \u223C
≅ \u2245 ≈ \u2248 ≠ \u2260 ≡ \u2261 ≤ \u2264
≥ \u2265 ⊂ \u2282 ⊃ \u2283 ⊄ \u2284 ⊆ \u2286
⊇ \u2287 ⊕ \u2295 ⊗ \u2297 ⊥ \u22A5
⋅ \u22C5 ⌈ \u2308 ⌉ \u2309 ⌊ \u230A
⌋ \u230B 〈 \u2329 〉 \u232A ◊ \u25CA
♠ \u2660 ♣ \u2663 ♥ \u2665 ♦ \u2666
" \x22 & \x26 < \x3C > \x3E O&Elig; \u152 œ \u153
Š \u160 š \u161 Ÿ \u178 ˆ \u2C6
˜ \u2DC \u2002 \u2003 \u2009
\u200C \u200D \u200E \u200F – \u2013
— \u2014 ‘ \u2018 ’ \u2019 ‚ \u201A
“ \u201C ” \u201D „ \u201E † \u2020
‡ \u2021 ‰ \u2030 ‹ \u2039 › \u203A
€ \u20AC ' \u0027 "" ""
};
if {![string equal $char [encoding system]]} { set text [encoding convertfrom $char $text] }
set text [string map [list "\]" "\\\]" "\[" "\\\[" "\$" "\\\$" "\"" "\\\"" "\\" "\\\\"] [string map $escapes $text]]
regsub -all -- {&#([[:digit:]]{1,5});} $text {[format %c [string trimleft "\1" "0"]]} text
regsub -all -- {&#x([[:xdigit:]]{1,4});} $text {[format %c [scan "\1" %x]]} text
catch { set text "[subst "$text"]" }
if {![string equal $char [encoding system]]} { set text [encoding convertto $char $text] }
return "$text"
}
Thanks, I'll review it's changes for inclusion. Tho, I think that I have the encoding covered with the converfrom which changes the encoding to the system default ?speechles wrote:MOAR scripts are a good thing
This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.
....
Feel free to steal (borrow) this..
Code: Select all
[20:56:51] <@spithash> http://www.youtube.com/user/spithash
[20:56:55] <@nagger> [Url title:] YouTube - spithash's Channel
basically, it's cause the title is on more than one line in the HTML that is parsed.spithash wrote:can anyone tell me why this white space appears there? I have the same problem with another title grab tcl aswellCode: Select all
[20:56:51] <@spithash> http://www.youtube.com/user/spithash [20:56:55] <@nagger> [Url title:] YouTube - spithash's Channel
Code: Select all
<title>
YouTube
- spithash's Channel
</title>
Code: Select all
[12:31] <madpinger> http://www.youtube.com/user/spithash
[12:31] <Belkar> [Url title:] YouTube - spithash's Channel
Code: Select all
foreach line [split $data \n] {
if {[regexp -nocase {<meta.*charset.(.*?)".*>} $line match charset]} {
set charenc $charset
}
append newdata $line
}
Code: Select all
foreach line [split $data \n] {
if {[regexp -nocase {<meta.*charset.(.*?)".*>} $line match charset]} {
set charenc $charset
}
append newdata " [string trim $line]"
}
Hmm, sure. I'd tell you what to change here, but you have to prefix it with http:// before using the uri, or it has issues. I'll add that in with an other fix/feature a user requested on github in a few days ^.^Stan wrote:Great script! However, it doesn't pick up when someone omits the http://. For example, if I type in www.youtube.com, I would like it to catch that and display the title. Any chance you could add that feature? Thanks in advance.
I admit nicking your script and using successfully with my bot!speechles wrote:MOAR scripts are a good thing
This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.
Feel free to steal (borrow) this..Code: Select all
[string map [b]-nocase[/b] $escapes $text]