UNOFFICIAL incith-google 2.1x (Nov30,2o12)

testebr · Post by **testebr** » Mon Aug 04, 2008 4:16 pm

Can anyone check if whois google search still working?

!g whois bbc.co.uk

In this example the bot don't reply nothing more.

Thank you!

speechles · Post by **speechles** » Sat Aug 09, 2008 10:30 am

testebr wrote:Can anyone check if whois google search still working?

!g whois bbc.co.uk

In this example the bot don't reply nothing more.

Thank you!

<speechles> !g whois bbc.co.uk
<sp33chy> 49,100 Results | Whois record for bbc.co.uk (Created on Aug. 01, 1996 and Expires on Dec. 13, 2008) @ http://whois.domaintools.com/bbc.co.uk | CoolWhois.com - WHOIS search of bbc.c @ http://www.coolwhois.com/d/bbc.co.uk | CoolWhois.com - WHOIS search of ns1.rb @ http://www.coolwhois.com/d/ns1.rbsov.bbc.co.uk | bbc.co.uk - Who.is @ http://www.who.is/whois-uk/ip-address/bbc.co.uk/

Solved, the problem is for top results it appears google is changing the normal div class=g into an h2 class=r. This might affect other onebox results and I may need to fix them as well, but this solves the issue for whois. I've also corrected search result totals to appear once again. So when searching accurate totals should appear before results are displayed. Local has also been corrected to parse "special locations" properly as well.

Get the new script right here and please report any problems in this thread, thanks. Most important to remember, HAVE A FUN!

Also, for those wishing wikimedia were more than it was, it now can be

Code: Select all

    # Customized Wikimedia
    # allow customized triggers for special wikimedia pages
    # Anything other than 0 will enable and will use the list below.
    variable wiki_custom 1

    # Custom wiki triggers
    # This is used to customize triggers for different wikimedia sites.
    # The format is "trigger:wikisite.here"
    variable wiki_customs {
      "swiki:wiki.sabayonlinux.org"
      "gwiki:www.gentoo-wiki.com"
      "ed:encyclopediadramatica.com"
      "un:uncyclopedia.org"
    }

<speechles> !gwiki unix
<sp33chy> SECURITY Anonymizing Unix Systems | This text is for any human being out there who wishes to keep their data and doings private from any snooping eye - monitoring network traffic and stealing/accessing the computer including electronic forensics. Hackers, phreakers, criminals, members of democracy parties in totalitarian states, human rights workers, and people with high profiles might be interested in
<sp33chy> this information. It was especially written for novice hackers so they are not so easily convicted when busted for their early curiosity. @ http://www.gentoo-wiki.com/SECURITY_Ano ... ix_Systems
<speechles> !swiki unix
<sp33chy> Linux | Linux (also known as GNU/Linux) is a Unix-like computer operating system. It is one of the most prominent examples of open source development and free software; unlike proprietary operating systems such as Windows or Mac OS, all of its underlying source code is available to the general public for anyone to freely use, modify, and redistribute. According to wikipedia. Linux is the combination of
<sp33chy> the Linux kernel, the GNU set of operating system relates applications, and other FOSS (Free Open Source software) software. Linux is the basis of Gentoo GNU Linux which is the basis of Sabayon GNU Linux or Sabayon Linux. For more information see the Wikipedia link above. @ http://wiki.sabayonlinux.org/index.php?title=Linux
<speechles> !gwiki unix#toc
<sp33chy> SECURITY Anonymizing Unix Systems | ToC: THE AUDIENCE; GOAL; PREREQUISITES; USER DATA; Sensitive user data; Protecting home directories; Traceable user activity; Protecting /var/spool/* files; SYSTEM DATA; Sensitive system data; Traceable system activity; Logging - important and dangerous; Protecting system configs; Computer Memory and sensitive /proc interfaces; DELETE(D) DATA AND SWAP; How to delete
<sp33chy> files in a secure way; How to wipe free disk space; How to handle swap data; How to handle RAM; Temporary data - it is evil; NETWORK CONNECTIONS; HIDING PRIVACY SETTINGS; Mount is your friend; Removable Medias; ???; Final Comments; Example Configuration And Scripts; Crypto Filesystems; Tools; Additional thoughts; Credits; Greetings; Greets to individuals (in alphabetic order):; Greets to groups:; Greets to
<sp33chy> channel members: @ http://www.gentoo-wiki.com/SECURITY_Ano ... ystems#toc
<speechles> !gwiki unix#?
<sp33chy> SECURITY Anonymizing Unix Systems | ??? Any other ideas? Think about it! (and maybe send me your ideas ;-) @ http://www.gentoo-wiki.com/SECURITY_Ano ... #.3F.3F.3F [1 Redirect(s)]

This is included in the 1.9.8s update and was requested.

eMxyzptlk · Post by **eMxyzptlk** » Sat Aug 09, 2008 12:44 pm

Thank you for the custom wiki, you rock dude

P.S: The link in the post above is pointing to http://ereader.kiczek.com/incith-google-v1.98r.tcl instead of http://ereader.kiczek.com/incith-google-v1.98s.tcl ( Previous Version )

speechles · Post by **speechles** » Sat Aug 09, 2008 1:29 pm

eMxyzptlk wrote:Thank you for the custom wiki, you rock dude

P.S: The link in the post above is pointing to http://ereader.kiczek.com/incith-google-v1.98r.tcl instead of http://ereader.kiczek.com/incith-google-v1.98s.tcl ( Previous Version )

DOH!~ I've corrected that, thanks for spotting it. And about the props, it was an easy addition and makes the script more versatile with less typing so why not make it a reality. Enjoy.

Phyxion · Post by **Phyxion** » Fri Aug 15, 2008 7:39 am

GameSpot ain't working anymore speechles. They updated their code once again.

speechles · Post by **speechles** » Fri Aug 15, 2008 9:55 pm

Phyxion wrote:GameSpot ain't working anymore speechles. They updated their code once again.

They did quite more than update their html templates. They changed the entire query. What it does now is use a php backend to retrieve the search results using cookie and referrer fields, which presently i'd need to investigate how those even work (although i do remember reading a post by user concerning this exact issue) before I could add something to fix it. If you leave any of these details out, your returned html merely contains a "searching..." where normally the results appeared (you can test this yourself, do a !game anything. Now check your eggdrop root for a file named ig-debug.txt, contained within is the html with 'searching...' instead of usable results). I would need to question why gamespot would do something to prevent potential free advertising from any and all index/scrape bots? Gamespot must not be getting enough click-through impressions from people scraping their pages

. I've always had direct links to gamespot and every other site scraped appearing within the given results so it isn't blatant theft, it's helping advertise for them imo...

If you can tell me what you think, it would help. Is it immoral and wrong to scrape a website, when it is obvious that website is trying to eradicate scraping? If so, then it wouldn't be just of me to turn this script into something illicit (like heroin) where it's traded more for what it does wrong, then what it does right... If we all are damned and going to hell anyways, then we can soullessly and callously scrape them to death and update to a cookie/referrer approach rather than a simple query. Depends on what the object of this script is which I leave solely up to each and every one of you. The people using the script.

pwner · Post by **pwner** » Sat Aug 16, 2008 2:24 pm

hmm the script is great, but I have a little problem; out of all the features, only a few work for me (google search is gone, wiki, ebay and basically all the good ones

).

Could this be the fault of my shell provider, or my the tcl version I'm currently using?

I'm using incith-google-v1.98s, someone please help...

speechles · Post by **speechles** » Sun Aug 17, 2008 12:10 am

pwner wrote:hmm the script is great, but I have a little problem; out of all the features, only a few work for me (google search is gone, wiki, ebay and basically all the good ones ).

Could this be the fault of my shell provider, or my the tcl version I'm currently using?

I'm using incith-google-v1.98s, someone please help...

Let me explain why with a slight ethics lesson. Websites wish their content to be viewed on their medium. They sometimes take countermeasures to discourage scraping, which is the method of data retrieval this script uses.

For ebay this happens:

<bot> redirected: http://search.ebay.com/dog_W0QQpqryZdog -> http://shop.ebay.com/items/_W0QQ_nkwZdo ... omZQQ_mdoZ
<bot> url: http://shop.ebay.com/items/_W0QQ_nkwZdo ... omZQQ_mdoZ charset: iso8859-1 encode_string: iso8859-1

I haven't written parsers for the template this new server gives. Notice the search.ebay.com becomes shop.ebay.com, this server uses a new template design not supported at the moment. Only the search.ebay.com template is supported presently.

For google, you may see this:

<bot> redirected: http://www.google.com/search?hl=&q=anyt ... _all&num=1 -> http://sorry.google.com/sorry/?continue ... %26num%3D1
<bot> url: http://sorry.google.com/sorry/?continue ... %26num%3D1 charset: utf-8 encode_string:

This means google will only allow you to use its services if you can complete their captcha requirement given on the sorry.google.com page. This is some problem between you and google. The other google based sites may not work either for you because something identifies you as malicious possibly. This is beyond my control, contact google.

For you to even begin to see this debug output you MUST change the debugnick in the config section from 'speechles' to the nickname of your debug admin, your nickname perhaps?

As for the other functions of the script, they should all work except for gamespot. Ganespot uses a new server as well, that requires cookie/referrer fields to disuade scraping. Expect a new version soon with better redirect support for wiki(pedia/media) and a few other things too...

Phyxion · Post by **Phyxion** » Sun Aug 17, 2008 9:23 am

speechles wrote:
Phyxion wrote:GameSpot ain't working anymore speechles. They updated their code once again.
They did quite more than update their html templates. They changed the entire query. What it does now is use a php backend to retrieve the search results using cookie and referrer fields, which presently i'd need to investigate how those even work (although i do remember reading a post by user concerning this exact issue) before I could add something to fix it. If you leave any of these details out, your returned html merely contains a "searching..." where normally the results appeared (you can test this yourself, do a !game anything. Now check your eggdrop root for a file named ig-debug.txt, contained within is the html with 'searching...' instead of usable results). I would need to question why gamespot would do something to prevent potential free advertising from any and all index/scrape bots? Gamespot must not be getting enough click-through impressions from people scraping their pages . I've always had direct links to gamespot and every other site scraped appearing within the given results so it isn't blatant theft, it's helping advertise for them imo...

If you can tell me what you think, it would help. Is it immoral and wrong to scrape a website, when it is obvious that website is trying to eradicate scraping? If so, then it wouldn't be just of me to turn this script into something illicit (like heroin) where it's traded more for what it does wrong, then what it does right... If we all are damned and going to hell anyways, then we can soullessly and callously scrape them to death and update to a cookie/referrer approach rather than a simple query. Depends on what the object of this script is which I leave solely up to each and every one of you. The people using the script.

The search url still works and the info is also in the page but just build up different (You can check using Firefox -> View page source). But since I don't understand a lot from TCL (regexp etc dont understand anything of it unfortunatly) I can't help.

speechles · Post by **speechles** » Sun Aug 17, 2008 12:52 pm

Phyxion wrote:
speechles wrote:
Phyxion wrote:GameSpot ain't working anymore speechles. They updated their code once again.
They did quite more than update their html templates. They changed the entire query. What it does now is use a php backend to retrieve the search results using cookie and referrer fields, which presently i'd need to investigate how those even work (although i do remember reading a post by user concerning this exact issue) before I could add something to fix it. If you leave any of these details out, your returned html merely contains a "searching..." where normally the results appeared (you can test this yourself, do a !game anything. Now check your eggdrop root for a file named ig-debug.txt, contained within is the html with 'searching...' instead of usable results). I would need to question why gamespot would do something to prevent potential free advertising from any and all index/scrape bots? Gamespot must not be getting enough click-through impressions from people scraping their pages . I've always had direct links to gamespot and every other site scraped appearing within the given results so it isn't blatant theft, it's helping advertise for them imo...

If you can tell me what you think, it would help. Is it immoral and wrong to scrape a website, when it is obvious that website is trying to eradicate scraping? If so, then it wouldn't be just of me to turn this script into something illicit (like heroin) where it's traded more for what it does wrong, then what it does right... If we all are damned and going to hell anyways, then we can soullessly and callously scrape them to death and update to a cookie/referrer approach rather than a simple query. Depends on what the object of this script is which I leave solely up to each and every one of you. The people using the script.
The search url still works and the info is also in the page but just build up different (You can check using Firefox -> View page source). But since I don't understand a lot from TCL (regexp etc dont understand anything of it unfortunatly) I can't help.

I can't believe you just said that...You fail to understand how eggdrop works. Sure, it works on firefox because firefox can supply the cookie and referrer fields. It DOES NOT work on eggdrop until I supply those requirements. There IS NO search data to search for. There IS ONLY a static "searching..." message. Don't believe me? Check this out! Now where are the results to parse? There aren't any. Do you see what I've been saying all along now?

testebr · Post by **testebr** » Sun Aug 17, 2008 1:56 pm

Test -> Max Payne

The problem is not with referrer, but with javascript ajax result :]

Try disable javascript in your browser and test it.

speechles · Post by **speechles** » Sun Aug 17, 2008 2:01 pm

testebr wrote:Test -> Max Payne

{"search_results":"<div class="sort_results">\n <select class="{'term':'max payne','type':'game'

,'offset':false,'track':true}">\n <option selected="selected" value="rank">Sort By Rank<

\/option>\n <option value="date">Sort By Date<\/option>\n \n <option value

="score">Sort By Score<\/option>\n <\/se.....

The above comes from:
http://www.gamespot.com/pages/search/se ... &sort=rank

When communicating with gamespot, it will send you html data along with a cookie session ID. That html data will be incomplete because it is actually waiting on a php backend server to fill the html request using that ajax get above. Notice the 'search results' appearing at the front?

This means that it is silly to assume that since you can visit the website with a normal web browser and see all the html your bot will be able to do the same. Websites go out of their way to discourage bots, so these cookie sessions and other such nonsense and hurdles put up in our way that we must jump over in order to continue scraping them. Hopefully you understand what I mean.

testebr · Post by **testebr** » Sun Aug 17, 2008 2:11 pm

Read my reply above (I edited).

speechles · Post by **speechles** » Sun Aug 17, 2008 2:16 pm

testebr wrote:Read my reply above (I edited).

Read my reply. I already know this...

gamespot wrote:Response Headers
Date Sun, 17 Aug 2008 18:08:12 GMT
Server Apache
Accept-Ranges bytes
X-Powered-By PHP/5.2.5
Set-Cookie gspot_side_081708=4; expires=Wed, 20-Aug-2008 18:08:12 GMT; path=/; domain=.gamespot.com
Keep-Alive timeout=300, max=990
Connection Keep-Alive
Transfer-Encoding chunked
Content-Type text/html; charset=ISO-8859-1
Request Headers
Host www.gamespot.com
User-Agent Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16
Accept text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language en-us,en;q=0.5
Accept-Encoding gzip,deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive 300
Connection keep-alive
Referer http://forum.egghelp.org/viewtopic.php?p=84640
Cookie gspot_side_081408=100; geolocn=NzAuMTMyLjAuOTE6ODQw; XCLGFbrowser=Cg8ILkh0Qr9HAAAAXg8; mbox=PC#1216060154750-11875#1280814507|session#1217742433671-451299#1217744367|check#true#1217742567; __qca=4869b91b-5b1c2-cf30b-ab8d7; MADCAPP=083B3d:1; __utmz=14953632.1217742436.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=14953632.3523376941471426000.1217742436.1217742436.1217742436.1; gspot_promo_081408=1; gspot_promo_081608=1; gspot_side_081608=2; u_srv_0_0=-1; __qcb=1709914989; gspot_side_081708=3
Cache-Control max-age=0

See the problem? The script merely does a single page load. Which can get the http headers. The script will need to do a second request to the search_ajax.php url filling in the request headers correctly to retrieve any search results. The cookie session is all that matters notice the referring site is egghelp and I still got successful search data in the browser.

Firefox + firebug will allow you to see http headers as shown above (firebug is buggy though so disable it afterwards or it may crash firefox eventually).

Phyxion · Post by **Phyxion** » Sun Aug 17, 2008 2:54 pm

testebr wrote:Test -> Max Payne

The problem is not with referrer, but with javascript ajax result :]

Try disable javascript in your browser and test it.

That's what I meant too speechles.

But after I checked again I see you are right.

My bad