Following redirects and links
Redirects
It's not uncommon that websites have redirects,
fortunately LWP::UserAgent supports them out of the box.
Using max_redirect
you can control how many redirects LWP
will handle.
There is a special page /redirect
that will redirect to the index page.
use LWP::UserAgent; my $ua = LWP::UserAgent->new(agent => 'MyWebScraper/1.0 <http://example.com>'); my $response = $ua->get('http://example:3000/redirect'); say $response->decoded_content;
If we set max_redirect
to 0
we don't get to the index page.
use LWP::UserAgent; my $ua = LWP::UserAgent->new( agent => 'MyWebScraper/1.0 <http://example.com>', max_redirect => 0 ); my $response = $ua->get('http://example:3000/redirect'); say $response->decoded_content;
Links
It's also not uncommon to follow the links that are available on the web page.
We can use CSS
selectors to get all the a
tags. Let's try it again on a simple html example:
use HTML::TreeBuilder::XPath; use HTML::Selector::XPath; my $html = <<'EOF'; <html> <head> <title>A sample webpage!</title> </head> <body> <h1>Perltuts.com rocks!</h1> <a href="http://perltuts.com">perltuts.com</a> </body> </html> EOF my $tree = HTML::TreeBuilder::XPath->new; $tree->ignore_unknown(0); $tree->parse($html); $tree->eof; my $xpath = HTML::Selector::XPath::selector_to_xpath('a'); my @nodes = $tree->findnodes($xpath); my @attrs = $nodes[0]->getAttributes(); say $attrs[0]->getValue();