|
WWW::Spyder POD
|
Social links
Class::Prototype
WWW::Spyder Javascript tricks serial() join function Smart quotes Text to Excel Developing Featherweight Web Services with JavaScript
Miscellaneous
|
|
| WWW::Spyder POD
|
WWW::Spyder Perldoc Below is the POD for WWW::Spyder (view the code). For a discussion and examples of using it please see this page. You can also find it for download on the CPAN. NAMEWWW::Spyder VERSION0.18 SYNOPSISA web spider that returns plain text, HTML, and other information per page crawled and can determine what pages to get and parse based on supplied terms compared to the text in links as well as page content. METHODS
$spyder = WWW::Spyder->new(shift||die"Gimme a URL!\n"); # ...or... $spyder = WWW::Spyder->new( %options ); $spyder->terms( qr/\bkings?\b/i, qr/\bqueens?\b/i ); $spyder->UA->timeout(30); $spyder->UA->max_size(250_000); $spyder->UA->agent('Mozilla/5.0'); $spyder->UA->from('bluefintuna@fish.net'); Weird courteous behaviorCourtesy didn’t used to be weird, but that’s another story. You will probably notice that the courtesy routines force a sleep when a recently seen domain is the only choice for a new link. The sleep is partially randomized. This is to prevent the spyder from being recognized in weblogs as a robot. The web and courtesyPlease, I beg of thee, exercise the most courtesy you can. Don’t let impatience get in the way. Bandwidth and server traffic are Update: Google seems to be excluding generic LWP agents now. See, I told you so. A single parallel robot can really hammer a major server, even someone with as big a farm and as much bandwidth as Google. VERBOSITY
SAMPLE USAGESee “spyder-mini-bio” in this distributionIt’s an extremely simple, but fairly cool pseudo bio-researcher. Simple continually crawling spyder:In the following code snippet: use WWW::Spyder; my $spyder = WWW::Spyder->new( shift || die"Give me a URL!\n" ); while ( my $page = $spyder->crawl ) { print '-'x70,"\n"; print "Spydering: ", $page->title, "\n"; print " URL: ", $page->url, "\n"; print " Desc: ", $page->description || 'n/a', "\n"; print '-'x70,"\n"; while ( my $link = $page->next_link ) { printf "%22s ->> %s\n", length($link->name) > 22 ? substr($link->name,0,19).'...' : $link->name, length($link) > 43 ? substr($link,0,40).'...' : $link; } } as long as unique URLs are being found in the pages crawl’d, the spyder will never stop. Each “crawl” returns a page object which gives the following methods to get information about the page.
Spyder that will give up the ghost...The following spyder is initialized to stop crawling when either of its conditions are met: 10mins pass or 300 pages are crawled. use WWW::Spyder; my $url = shift || die "Please give me a URL to start!\n"; my $spyder = WWW::Spyder->new (seed => $url, sleep_base => 10, exit_on => { pages => 300, time => '10min', },); while ( my $page = $spyder->crawl ) { print '-'x70,"\n"; print "Spydering: ", $page->title, "\n"; print " URL: ", $page->url, "\n"; print " Desc: ", $page->description || '', "\n"; print '-'x70,"\n"; while ( my $link = $page->next_link ) { printf "%22s ->> %s\n", length($link->name) > 22 ? substr($link->name,0,19).'...' : $link->name, length($link) > 43 ? substr($link,0,40).'...' : $link; } } Primitive page readeruse WWW::Spyder; use Text::Wrap; my $url = shift || die "Please give me a URL to start!\n"; @ARGV or die "Please also give me a search term.\n"; my $spyder = WWW::Spyder->new; $spyder->seed_url($url); $spyder->terms(@ARGV); while ( my $page = $spyder->crawl ) { print '-'x70,"\n * "; print $page->title, "\n"; print '-'x70,"\n"; print wrap('','', $page->text); sleep 60; } TIPSIf you are going to do anything important with it, implement some signal blocking to prevent accidental problems and tie your gathered information to a DB_File or some such. Right now the module loads You might want to to set $| = 1. PRIVATE METHODSare private but hack away if you’re inclinedTO DOSpyder is conceived to live in a future namespace as a servant class for a complex web research agent with simple interfaces to pre-designed grammars for research reports; or self-designed grammars/reports (might be implemented via Parse::FastDescent if that lazy-bones Conway would just find another 5 hours in the paltry 32 hour day he’s presently working). I’d like the thing to be able to parse RTF, PDF, and perhaps even resource sections of image files but that isn’t on the radar right now. TO DOABLE BY 1.0Add 2-4 sample scripts that are a bit more useful. There are many functions that should be under the programmer’s control and not buried in the spyder. They will emerge soon. I’d like to put in hooks to allow the user to keep(), toss(), or exclude(), urls, link names, and domains, while crawl’ing. Clean up some redundant, sloppy, and weird code. Probably change or remove the AUTOLOAD. Put in a go_to_seed() method and a subclass, ::Seed, with rules to construct query URLs by search engine. It would be the autostart or the fallback for perpetual spyders that run out of links. It would hit a given or default search engine with the Spyder‘s terms as the query. Obviously this would only work with terms() defined. Implement auto-exclusion for failure vs. success rates on names as well as domains (maybe URI suffixes too). Turn length of courtesy queue into the breadth/depth setting? make it automatically adjusting...? Consistently found link names are excluded from term strength sorting? Eg: “privacy policy,” “read more,” “copyright...” Fix some image tag parsing problems and add area tag parsing. Configuration for user:password by domain. ::Page objects become reusable so that a spyder only needs one. ::Enqueue objects become indexed so they are nixable from anywhere. Expand exit_on routines to size, slept time, dwindling success ratio, and maybe more. Make methods to set “skepticism” and “effort” which will influence the way the terms are used to keep, order, and toss URLs. BE WARNEDThis module already does some extremely useful things but it’s in its infancy and it is conceived to live in a different namespace and perhaps become more private as a subservient part of a parent class. This may never happen but it’s the idea. So don’t put this into production code yet. I am endeavoring to keep its interface constant either way. That said, it could change completely. Also!This module saves cookies to the user’s home. There will be more control over cookies in the future, but that’s how it is right now. They live in $ENV{HOME}/spyderCookie. Anche!Robot Rules aren’t respected. Spyder endeavors to be polite as far as server hits are concerned, but doesn’t take “no” for answer right now. I want to add this, and not just by domain, but by page settings. UNDOCUMENTED FEATURESA.k.a. Bugs. Don’t be ridiculous! Bugs in my code?! There is a bug that is causing retrieval of image src tags, I think but haven’t tracked it down yet, as links. I also think the plain text parsing has some problems which will be remedied shortly. If you are building more than one spyder in the same script they are going to share the same exit_on parameters because it’s a self-installing method. This will not always be so. See Bugs file for more open and past issues. Let me know if you find any others. If you find one that is platform specific, please send patch code/suggestion b/c I might not have any idea how to fix it. WHY Spyder?I didn’t want to use the more appropriate Spider because I think there is a better one out there somewhere in the zeitgeist and the namespace future of Spyder is uncertain. It may end up a semi-private part of a bigger family. And I may be King of Kenya someday. One’s got to dream. If you like Spyder, have feedback, wishlist usage, better algorithms/implementations for any part of it, please let me know! AUTHOR, AUTHORAshley5, ashley@cpan.org. Bob’s your monkey’s uncle. COPYRIGHT(c)2001-2002 Ashley Pond V. All rights reserved. This program is free software; you may redistribute or modify it under the same terms as Perl. THANKS TOMost all y’all. Especially Lincoln Stein, Gisle Aas, The Conway, Raphael Manfredi, Gurusamy Sarathy, and plenty of others. COMPARE WITHWWW::Robot, LWP::UserAgent, WWW::SimpleRobot, WWW::RobotRules, LWP::RobotUA, and other kith and kin. |
|
|
Perl Books ·
CPAN ·
mod_perl ·
Perl Monks ·
Perl Mongers ·
Perl Journal ·
Use Perl ·
Perl Jobs ·
ActiveState ·
perldoc.perl.org ·
O’Reilly Perl ·
W3Schools tutorials ·
Ovid's CGI Course ·
Catalyst ·
Perl at Wikipedia
Text, original code, fonts, and graphics ©1990-2008 Ashley Pond V. |
||