Webby v1.7 (June 26th 2o12) Web info with public regexp

speechles · Post by **speechles** » Sat Apr 18, 2009 7:56 pm

Basically, webby is something I've written for myself to entertain ideas from others in scripts to develop. I'm giving this to the public in the hopes that it is useful and helps people better understand regexp syntax and how these types of scripts operate.

Webby v1.7 .. Have a fun

The config for v1.4 onward has been changed slightly, the new look and added features are noted below. This script works with any languages encodings now. even utf-8 works without patching!!!

Works on any Eggdrop/Windrop combination you can think of.

Has special coding to detect differing http packages, working around their differing limitations, as well as code to detect patched or unpatched bots. It's magic.

Code: Select all

# ---> Start of config; setting begins

# do you want to display header attributes always?
# --- [ 0 no / 1 yes ]
variable webbyheader 0

# do you want to display x-header attributes always?
# --- [ 0 no / 1 yes ]
variable webbyXheader 0

# do you want to display html attributes always?
# --- [ 0 no / 1 yes ]
variable webbydoc 0

# if using the regexp (regular expression) engine do
# you want to still show title and description of webpage?
# --- [ 0 no / 1 yes ]
variable webbyRegShow 0

# max length of each line of your regexp capture?
# --- [ integer ]
variable webbySplit 403

# how many regexp captures is the maximum to issue?
# presently going above 9, will only result in 9...
# --- [ 1 - 9 ]
variable webbyMaxRegexp 9

# which method should be used when shortening the url?
# (0-3) will only use the one you've chosen.
# (4-5) will use them all.
# 0 --> http://tinyurl.com
# 1 --> http://u.nu
# 2 --> http://is.gd
# 3 --> http://cli.gs
# 4 --> randomly select one of the four above ( 2,0,0,3,1..etc )
# 5 --> cycle through the four above ( 0,1,2,3,0,1..etc )
# ---  [ 0 - 5 ]
variable webbyShortType 5

# regexp capture limit
# this is how wide each regexp capture can be, prior to it
# being cleaned up for html elements. this can take a very
# long time to cleanup if the entire html was taken. so this
# variable is there to protect your bot lagging forever and
# people giving replies with tons of html to lag it.
# --- [ integer ]
variable regexp_limit 3000

# how should we treat encodings?
# 0 - do nothing, use eggdrops internal encoding whatever that may be.
# 1 - use the encoding the website is telling us to use.
#     This is the option everyone should use primarily.
# 2 - Force a static encoding for everything. If you use option 2,
#     please specify the encoding in the setting below this one.
# --- [ 0 off-internal / 1 automatic / 2 forced ]
variable webbyDynEnc 1

# option 2, force encoding
# if you've chosen this above then you must define what
# encoding we are going to force for all queries...
# --- [ encoding ]
variable webbyEnc "iso8859-1"

# fix http-Package conflicts?
# this will disregard http-packages detected charset if it appears
# different from the character set detected within the meta's of
# the html and vice versa, you can change this behavior
# on-the-fly using the --swap parameter.
# --- [ 0 no / 1 use http-package / 2 use html meta's ]
variable webbyFixDetection 2

# report http-package conflicts?
# this will display http-package conflict errors to the
# channel, and will show corrections made to any encodings.
# --- [ 0 never / 1 when the conflict happens / 2 always ]
variable webbyShowMisdetection 2

# where should files be saved that people have downloaded?
# this should NOT be eggdrops root because things CAN and WILL
# be overwritten. This FOLDER MUST exist otherwise the --file
# command will always fail as well.
# --- [ directory ] ----
variable webbyFile "downloads/"

# when using the --file parameter should webby immediately return
# after saving or should it return results for the query as well?
# --- [ 0 return after saving / 1 do normal query stuff as well ]
variable webbyFileReact 0

# <--- end of config; script begins

<speechles> !webby www.roms-isos.com --regexp <title>(.*?)</title>.*?"description".*?content="(.*?)"--
<sp33chy> regexp capture1 ( ROMS-ISOS: News )
<sp33chy> regexp capture2 ( Emulation, Gaming, Linux, whatever news )

<speechles> !webby --html --header www.youtube.com --xheader
<sp33chy> YouTube - Broadcast Yourself. ( http://tinyurl.com/9zza6 )( 200; text/html; No charset defined, assuming iso8859-1; 71200 bytes )
<sp33chy> Server=Apache; Keep-Alive=timeout=300; Expires=Tue, 27 Apr 1971 19:44:06 EST; Date=Sat, 18 Apr 2009 23:50:01 GMT; Content-Type=text/html; charset=utf-8; Content-Length=67768; Connection=Keep-Alive; Cache-Control=no-cache
<sp33chy> X-YouTube-MID=WkFSZzctYUFHdmlHTm5zMkNvVnZmQU9qanZYZG54VndUWFZLcXc2alRIWnZYQmIzZnByM3RR; X-Content-Type-Options=nosniff
<sp33chy> HTML 4.01 Transitional//EN; Metas: description=Share your videos with friends, family, and the world; keywords=video, sharing, camera phone, video phone, free, upload

<speechles> !webby hulu.com
<sp33chy> Hulu - Watch your favorites. Anytime. For free. ( http://u.nu/9fxk )( 200; text/html; utf-8; 69884 bytes; 1 redirects )
<sp33chy> Hulu.com is a free online video service that offers hit TV shows including Family Guy, 30 Rock, NBA and the Daily Show with Jon Stewart, etc. Our extensive library also includes blockbuster movies like Lost in Translation as well as classics like Fiddler on the Roof. Start watching now. No registration or downloads required.

<speechles> !webby www.google.com --regexp <title>([0-)</title>--
<sp33chy> regexp couldn't compile regular expression pattern: invalid character range

<speechles> !webby eggdrop.org.ru
<sp33chy> Новости / eggdrop.org.ru / The Russian Eggdrop Resource / всё об eggdrop/windrop, tcl скриптах, модулях, генераторах irc статистики ( http://tinyurl.com/4vfyun )( 200; text/html; cp1251; 24147 bytes )
<sp33chy> eggdrop.org.ru - The Russian Eggdrop Resource [Новости] (всё об eggdrop/windrop, tcl скриптах, модулях, генераторах irc статистики)

<speechles> !webby http://www.duzheer.com/youqingwenzhang/
<sp33chy> 友情文章 - 散文随笔 - 美文欣赏 - 读者文章阅读网 ( http://is.gd/M57rLr )( 200; text/html; gb2312; 19306 bytes )
<sp33chy> 本站提供精美的友情文章免费在线阅读欣赏

<speechles> !webby http://ar.wikipedia.org/wiki/%D8%A5%D8% ... 8%A7%D8%A8
<sp33chy> إرهاب - ويكيبيديا، الموسوعة الحرة ( http://is.gd/p7zhji )( 200; text/html; utf-8; 89211 bytes )

<speechles> !webby http://feature.jp.msn.com/skill/special ... 15/003.htm
<sp33chy> すし業界の常識を「CHANGE！」した職人 - MSNスキルアップ ( http://is.gd/aMlcPL )( 200; text/html; utf-8; 19657 bytes )
<sp33chy> 江戸前ずしの伝統技術「細工巻きずし」と房総地方の郷土料理「太巻き祭りずし」をもとに研究を重ねて生まれた「飾り巻きずし」。東京すしアカデミーの川澄健校長に芸術ともいえる「飾り巻きずし」の奥の深さを伺いました。

<speechles> !webby http://zh.wikipedia.org/wiki/China
<sp33chy> 中國 - 维基百科，自由的百科全书 ( http://is.gd/EkZvfR )( 200; text/html; utf-8; 599493 bytes )

In it's most simple mode, with just feeding webby a url, you will get: the title, a shorturl, httpcode, filetype, encoding, size and traversals (if any). Then it's followed up by the site description on the second line. This seems the cleanest and best way to display this data. There is an http-package conflict detection and resolution system written into the script which can spot and correctly resolves conflicts when the html's meta tagging and http-package both suggest different encodings.

You have access to a host of (--switches) you can embed anywhere in your !webby requests. All of them are listed below:

--header ( displays the websites header attributes (minus private cookie traffic) )
--xheader ( displays the websites x-header attributes )
--html ( displays the websites html type and meta tagging )
--post (attempts a post query rather than a get query )
--validate ( displays only header/x-header attributes, does not download page body )
--gz ( attempts gzipped queries rather than standard text )
--file ( saves the contents of the file given to the chosen directory )
--override ( combined with --regexp found below (this overrides the regexp_limit variable setting) )
--nostrip ( combined with --regexp found below (this stops the html scrubbing function from cleaning your regexp captures) )
--swap ( during conflicts this swaps the encoding used to resolve it. )
--regexp ( explanation below )

At the moment there are no flags controlling any of these behaviors literally anyone can instill any of these --switches within their !webby requests. In the future (v1.x) there will be settings to define flags for each of these extra ablities webby has using these --switches.

Along with all of those comes one more needing more of an explanation. The --regexp switch and parser requires using the syntax below:
--regexp REGEXP_GOES_HERE--

so using:
--regexp <title>(.*?)</title>--
Would give you a regexp capture of the title of the website.

using:
--regexp <title>(.*?)</title>.*?"description".*?content="(.*?)"--
Would give you both the title, and meta description of the site.

Your regexp can be up to 9 capture groups (it can be more actually, but they won't be used). The -nocase option will always be used. The capture if too long will be split into several lines accordingly. Each capture will be scraped of irrelevant html elements and this could possibly lag your bot tremendously. So a regexp_limit variable is within the config to stop people abusing and lagging your bot. The limit is preset at 999 chars but you can set this however you like. There is error detection present for your constructed regexp and any found will be reported diligently. There is also full error detection for any connection problems the bot may experience. The entire html the bot receives (minus any newlines, tabs, vertical tabs, carriage returns) are saved into your eggdrop root as webby.txt. Using this to craft and test regular expression and learn how they work is the main point of this script.

If you have any questions, comments, flames, problems feel free to post them here...

PS: the script also supports the --post modifier and when given the script will attempt to deconstruct the post query to see if it's malformed and then issue the query using post rather than simple query. This is for people wishing to scrape these types of sites which won't work properly with standard queries.

vBm · Post by **vBm** » Sun May 10, 2009 10:46 am

First of all very good script, helped me a lot for parsing info from some page to my chan.

Since last few days irc admins censored tinyurl, and now it wont be displayed at all.

Can you maybe add multiply choice in settings.
eg.

SnipURL
Bit.ly
TinyURL
Is.gd
Cli.gs
U.nu

thanks for your time making this

speechles · Post by **speechles** » Sun May 10, 2009 1:25 pm

Code: Select all

# which method should be used when shortening the url?
# (1-4) will only use the one you've chosen.
# (5-6) will use them all.
# 1 --> http://tinyurl.com
# 2 --> http://u.nu
# 3 --> http://is.gd
# 4 --> http://cli.gs
# 5 --> randomly select one of the four above ( 3,1,1,4,2..etc )
# 6 --> cycle through the four above ( 1,2,3,4,1,2..etc )
# ---
variable webbyShortType 6

I've added the sites you mentioned that use the identical api method that tinyurl uses. The others you listed use api-keys and other things meant to track requests and aren't very desireable being so cumbersome to use, so have been left out. You can also choose option 5 and have it rotate among them all, and since they block tinyurl change its part in this switch below:

Code: Select all

     1 { set query "http://tinyurl.com/api-create.php?[http::formatQuery url $url]" }
     2 { set query "http://u.nu/unu-api-simple?[http::formatQuery url $url]" }
     3 { set query "http://is.gd/api.php?[http::formatQuery url $url]" }
     4 { set query "http://cli.gs/api/v1/cligs/create?[http::formatQuery url $url]&title=&key=&appid=webby" }

So you would copy the contents of 2, or 3, or 4 over what is presently within 1. Then when randomizing (option 5) or cycling the tinyurl would no longer be used.
Get the script at the link above or here.

vBm · Post by **vBm** » Sun May 10, 2009 1:52 pm

Thank you very much sir, this is wonderful.

walker · Post by **walker** » Thu Sep 03, 2009 1:59 am

cannot shown anything

Koo · Post by **Koo** » Sun Apr 18, 2010 4:06 pm

Awesome script. ^_^b

By the way speechles, can I request the script to be able to show the information of the link just by pasting the URL on the channel (not only when someone type !webby <link>)?

Football · Post by **Football** » Fri Apr 23, 2010 9:20 am

[13:19] Tcl error [webby]: invalid command name "idna::domain_toascii"

spithash · Post by **spithash** » Wed Oct 05, 2011 5:41 pm

I think there's a problem with the utf.. my bot is version 1.8 and utf patched by default

Code: Select all

[00:38:32] <@spithash> !webby http://feature.jp.msn.com/skill/special/article/oowaza/week015/003.htm
[00:38:35] <@nagger> webby: conflict! http-package reports: iso8859-1 .. using charset detected from html meta tagging: utf-8 to avoid conflict.
[00:38:36] <@nagger> Ã£ÂAÂYÃ£ÂAÂWÃ¦Â¥ÂÃ§ÂUÂLÃ£ÂAÂ®Ã¥Â¸Â¸Ã¨ÂÂXÃ£ÂBÂRÃ£Â@ÂLCHANGEÃ¯Â¼ÂAÃ£Â@ÂMÃ£ÂAÂWÃ£ÂAÂ_Ã¨ÂAÂ·Ã¤ÂºÂº - MSNÃ£ÂBÂ¹Ã£ÂBÂÃ£ÂCÂ«Ã£ÂBÂ¢Ã£ÂCÂCÃ£ÂCÂW ( http://is.gd/aMlcPL )( 200; text/html; utf-8; 26947 bytes )

Meanwhile, some other script I'm using for automatic url shortening and title reader shows me the title correctly.. (I'm just pointing that one out cause I want to show you that it's properly utf patched and works fine)

Code: Select all

[00:38:33] <@nagger> http://is.gd/aMlcPL -> すし業界の常識を「CHANGE！」した職人 - MSNスキルアップ (16013 bytes)

any help would be acceptable

I love webby

Mysticales · Post by **Mysticales** » Tue Oct 18, 2011 5:50 pm

Nice script. Only gotta try to figure out it. Cause when I use it, it gives all that info I dont need. Yet in the config, its all 0 so not sure how to prevent it from telling me everything.

One thing I do see a good use for, is the url shortener. That alone is nice. =)

speechles · Post by **speechles** » Fri Oct 28, 2011 1:11 am

spithash wrote:I love webby

New version here: Webby v1.6

This should now render 100% of web sites correctly. Webby will now render correctly websites no other script can. It now converts all display elements into utf-8 before displaying. It also, detects correctly 100% of conflicts and resolves them correctly. So the issue you see above, no longer occurs.

Tested on a few bots on differing platforms. Those with issues please post about them here. Enjoy and dance like an robot.

xtas · Post by **xtas** » Sat Feb 18, 2012 10:06 pm

Could you add a possibility to take every link which is pasted in the channel and make a .html page with all the links like this:

#1 date-person who linked - Link which is pasted to channel - Title - Short url
#2 date-person who linked - Link which is pasted to channel - Title - Short url
etc...
etc..

If .html aint possible then .txt or something which I could upload instantly when the link is at the channel to my webpage. Well it doesn't need to be instant but like once in 6hours or something or always update at 24:00.

I'm not intrested about other statistics in the channel.

Sincerly,
XtaS

vBm · Post by **vBm** » Tue Mar 20, 2012 12:55 am

Could you possibly add goo.gl short url as an option?

Thank you in advance.

speechles · Post by **speechles** » Wed Jun 27, 2012 12:36 am

New version 1.7, adds the --file parameter to webby. This allows you to save images, files, etc.. within reason. I do not suggest using this on large files. But this does let you archive things locally that may disappear on the web one day. This is one of the features webby lacked that it should have had to begin with.

<speechles> !webby http://4.bp.blogspot.com/-KLuy5RPJWms/T ... 787809.jpg --file
<sp33chy> saving to downloads/4.bp.blogspot.com--google-girl%255B1%255D-787809.jpeg

<speechles> !webby http://4.bp.blogspot.com/-KLuy5RPJWms/T ... 787809.jpg --file
<sp33chy> saving to downloads/4.bp.blogspot.com--google-girl%255B1%255D-787809-1.jpeg

<speechles> !webby http://www.e90post.com/forums/attachmen ... 1202241766 --file
<sp33chy> saving to downloads/www.e90post.com--attachment.jpeg

<superGear> !webby http://kimete.com/droid/beta/dorootv3.sh --file
<sp33chy> saving to downloads/kimete.com--dorootv3.sh

The script will use SITE--FILENAME.EXTENSION to save files. When a filename exists already the script will add -1, if this exists it will add -2, and etc, until the filename doesn't exist and it can be saved. This it does on it's own as shown in the top 2 queries shown above. The script figures out extensions automatically with header detection, but filenames at the moment if not supplied in the url itself are generated randomly by the script as seen below.

<superGear> !webby http://www.osnews.com/ --file
<sp33chy> saving to downloads/www.osnews.com--lYVlfXtxtSNYc.html

[quote=""downloads/" folder"]eggdrop/downloads/4.bp.blogspot.com--google-girl%255B1%255D-787809-1.jpeg
eggdrop/downloads/4.bp.blogspot.com--google-girl%255B1%255D-787809.jpeg
eggdrop/downloads/i.imgur.com--psrSU.jpeg
eggdrop/downloads/i.imgur.com--psrSU-1.jpeg
eggdrop/downloads/i.imgur.com--psrSU-2.jpeg
eggdrop/downloads/kimete.com--dorootv3.sh
eggdrop/downloads/www.e90post.com--attachment.jpeg
eggdrop/downloads/www.google.com--EeTxnTbbPuPYd.html
eggdrop/downloads/www.osnews.com--lYVlfXtxtSNYc.html[/quote]

There are 2 new config options that control this behavior:

# where should files be saved that people have downloaded?
# this should NOT be eggdrops root because things MAY
# be overwritten. This FOLDER MUST exist otherwise the --file
# command will always fail as well.
# --- [ directory ] ----
variable webbyFile "downloads/"

# when using the --file parameter should webby immediately return
# after saving or should it return results for the query as well?
# --- [ 0 return after saving / 1 do normal query stuff as well ]
variable webbyFileReact 0

Have a fun, and download it below...
Webby v1.7

shadrach · Post by **shadrach** » Thu Jun 28, 2012 5:51 am

great addition. thx.

Koo · Post by **Koo** » Thu Jun 28, 2012 1:44 pm

Thank you for the script speechles.