Webby v1.7 .. Have a fun
The config for v1.4 onward has been changed slightly, the new look and added features are noted below. This script works with any languages encodings now. even utf-8 works without patching!!!
Works on any Eggdrop/Windrop combination you can think of.
Has special coding to detect differing http packages, working around their differing limitations, as well as code to detect patched or unpatched bots. It's magic.
Code: Select all
# ---> Start of config; setting begins
# do you want to display header attributes always?
# --- [ 0 no / 1 yes ]
variable webbyheader 0
# do you want to display x-header attributes always?
# --- [ 0 no / 1 yes ]
variable webbyXheader 0
# do you want to display html attributes always?
# --- [ 0 no / 1 yes ]
variable webbydoc 0
# if using the regexp (regular expression) engine do
# you want to still show title and description of webpage?
# --- [ 0 no / 1 yes ]
variable webbyRegShow 0
# max length of each line of your regexp capture?
# --- [ integer ]
variable webbySplit 403
# how many regexp captures is the maximum to issue?
# presently going above 9, will only result in 9...
# --- [ 1 - 9 ]
variable webbyMaxRegexp 9
# which method should be used when shortening the url?
# (0-3) will only use the one you've chosen.
# (4-5) will use them all.
# 0 --> http://tinyurl.com
# 1 --> http://u.nu
# 2 --> http://is.gd
# 3 --> http://cli.gs
# 4 --> randomly select one of the four above ( 2,0,0,3,1..etc )
# 5 --> cycle through the four above ( 0,1,2,3,0,1..etc )
# --- [ 0 - 5 ]
variable webbyShortType 5
# regexp capture limit
# this is how wide each regexp capture can be, prior to it
# being cleaned up for html elements. this can take a very
# long time to cleanup if the entire html was taken. so this
# variable is there to protect your bot lagging forever and
# people giving replies with tons of html to lag it.
# --- [ integer ]
variable regexp_limit 3000
# how should we treat encodings?
# 0 - do nothing, use eggdrops internal encoding whatever that may be.
# 1 - use the encoding the website is telling us to use.
# This is the option everyone should use primarily.
# 2 - Force a static encoding for everything. If you use option 2,
# please specify the encoding in the setting below this one.
# --- [ 0 off-internal / 1 automatic / 2 forced ]
variable webbyDynEnc 1
# option 2, force encoding
# if you've chosen this above then you must define what
# encoding we are going to force for all queries...
# --- [ encoding ]
variable webbyEnc "iso8859-1"
# fix http-Package conflicts?
# this will disregard http-packages detected charset if it appears
# different from the character set detected within the meta's of
# the html and vice versa, you can change this behavior
# on-the-fly using the --swap parameter.
# --- [ 0 no / 1 use http-package / 2 use html meta's ]
variable webbyFixDetection 2
# report http-package conflicts?
# this will display http-package conflict errors to the
# channel, and will show corrections made to any encodings.
# --- [ 0 never / 1 when the conflict happens / 2 always ]
variable webbyShowMisdetection 2
# where should files be saved that people have downloaded?
# this should NOT be eggdrops root because things CAN and WILL
# be overwritten. This FOLDER MUST exist otherwise the --file
# command will always fail as well.
# --- [ directory ] ----
variable webbyFile "downloads/"
# when using the --file parameter should webby immediately return
# after saving or should it return results for the query as well?
# --- [ 0 return after saving / 1 do normal query stuff as well ]
variable webbyFileReact 0
# <--- end of config; script begins
In it's most simple mode, with just feeding webby a url, you will get: the title, a shorturl, httpcode, filetype, encoding, size and traversals (if any). Then it's followed up by the site description on the second line. This seems the cleanest and best way to display this data. There is an http-package conflict detection and resolution system written into the script which can spot and correctly resolves conflicts when the html's meta tagging and http-package both suggest different encodings.<speechles> !webby www.roms-isos.com --regexp <title>(.*?)</title>.*?"description".*?content="(.*?)"--
<sp33chy> regexp capture1 ( ROMS-ISOS: News )
<sp33chy> regexp capture2 ( Emulation, Gaming, Linux, whatever news )
<speechles> !webby --html --header www.youtube.com --xheader
<sp33chy> YouTube - Broadcast Yourself. ( http://tinyurl.com/9zza6 )( 200; text/html; No charset defined, assuming iso8859-1; 71200 bytes )
<sp33chy> Server=Apache; Keep-Alive=timeout=300; Expires=Tue, 27 Apr 1971 19:44:06 EST; Date=Sat, 18 Apr 2009 23:50:01 GMT; Content-Type=text/html; charset=utf-8; Content-Length=67768; Connection=Keep-Alive; Cache-Control=no-cache
<sp33chy> X-YouTube-MID=WkFSZzctYUFHdmlHTm5zMkNvVnZmQU9qanZYZG54VndUWFZLcXc2alRIWnZYQmIzZnByM3RR; X-Content-Type-Options=nosniff
<sp33chy> HTML 4.01 Transitional//EN; Metas: description=Share your videos with friends, family, and the world; keywords=video, sharing, camera phone, video phone, free, upload
<speechles> !webby hulu.com
<sp33chy> Hulu - Watch your favorites. Anytime. For free. ( http://u.nu/9fxk )( 200; text/html; utf-8; 69884 bytes; 1 redirects )
<sp33chy> Hulu.com is a free online video service that offers hit TV shows including Family Guy, 30 Rock, NBA and the Daily Show with Jon Stewart, etc. Our extensive library also includes blockbuster movies like Lost in Translation as well as classics like Fiddler on the Roof. Start watching now. No registration or downloads required.
<speechles> !webby www.google.com --regexp <title>([0-)</title>--
<sp33chy> regexp couldn't compile regular expression pattern: invalid character range
<speechles> !webby eggdrop.org.ru
<sp33chy> Новости / eggdrop.org.ru / The Russian Eggdrop Resource / всё об eggdrop/windrop, tcl скриптах, модулях, генераторах irc статистики ( http://tinyurl.com/4vfyun )( 200; text/html; cp1251; 24147 bytes )
<sp33chy> eggdrop.org.ru - The Russian Eggdrop Resource [Новости] (всё об eggdrop/windrop, tcl скриптах, модулях, генераторах irc статистики)
<speechles> !webby http://www.duzheer.com/youqingwenzhang/
<sp33chy> 友情文章 - 散文随笔 - 美文欣赏 - 读者文章阅读网 ( http://is.gd/M57rLr )( 200; text/html; gb2312; 19306 bytes )
<sp33chy> 本站提供精美的友情文章免费在线阅读欣赏
<speechles> !webby http://ar.wikipedia.org/wiki/%D8%A5%D8% ... 8%A7%D8%A8
<sp33chy> إرهاب - ويكيبيديا، الموسوعة الحرة ( http://is.gd/p7zhji )( 200; text/html; utf-8; 89211 bytes )
<speechles> !webby http://feature.jp.msn.com/skill/special ... 15/003.htm
<sp33chy> すし業界の常識を「CHANGE!」した職人 - MSNスキルアップ ( http://is.gd/aMlcPL )( 200; text/html; utf-8; 19657 bytes )
<sp33chy> 江戸前ずしの伝統技術「細工巻きずし」と房総地方の郷土料理「太巻き祭りずし」をもとに研究を重ねて生まれた「飾り巻きずし」。東京すしアカデミーの川澄健校長に芸術ともいえる「飾り巻きずし」の奥の深さを伺いました。
<speechles> !webby http://zh.wikipedia.org/wiki/China
<sp33chy> 中國 - 维基百科,自由的百科全书 ( http://is.gd/EkZvfR )( 200; text/html; utf-8; 599493 bytes )
You have access to a host of (--switches) you can embed anywhere in your !webby requests. All of them are listed below:
At the moment there are no flags controlling any of these behaviors literally anyone can instill any of these --switches within their !webby requests. In the future (v1.x) there will be settings to define flags for each of these extra ablities webby has using these --switches.--header ( displays the websites header attributes (minus private cookie traffic) )
--xheader ( displays the websites x-header attributes )
--html ( displays the websites html type and meta tagging )
--post (attempts a post query rather than a get query )
--validate ( displays only header/x-header attributes, does not download page body )
--gz ( attempts gzipped queries rather than standard text )
--file ( saves the contents of the file given to the chosen directory )
--override ( combined with --regexp found below (this overrides the regexp_limit variable setting) )
--nostrip ( combined with --regexp found below (this stops the html scrubbing function from cleaning your regexp captures) )
--swap ( during conflicts this swaps the encoding used to resolve it. )
--regexp ( explanation below )
Along with all of those comes one more needing more of an explanation. The --regexp switch and parser requires using the syntax below:
--regexp REGEXP_GOES_HERE--
so using:
--regexp <title>(.*?)</title>--
Would give you a regexp capture of the title of the website.
using:
--regexp <title>(.*?)</title>.*?"description".*?content="(.*?)"--
Would give you both the title, and meta description of the site.
Your regexp can be up to 9 capture groups (it can be more actually, but they won't be used). The -nocase option will always be used. The capture if too long will be split into several lines accordingly. Each capture will be scraped of irrelevant html elements and this could possibly lag your bot tremendously. So a regexp_limit variable is within the config to stop people abusing and lagging your bot. The limit is preset at 999 chars but you can set this however you like. There is error detection present for your constructed regexp and any found will be reported diligently. There is also full error detection for any connection problems the bot may experience. The entire html the bot receives (minus any newlines, tabs, vertical tabs, carriage returns) are saved into your eggdrop root as webby.txt. Using this to craft and test regular expression and learn how they work is the main point of this script.
If you have any questions, comments, flames, problems feel free to post them here...
PS: the script also supports the --post modifier and when given the script will attempt to deconstruct the post query to see if it's malformed and then issue the query using post rather than simple query. This is for people wishing to scrape these types of sites which won't work properly with standard queries.