Following redirects and links

Redirects

It's not uncommon that websites have redirects, fortunately LWP::UserAgent supports them out of the box. Using max_redirect you can control how many redirects LWP will handle.

There is a special page /redirect that will redirect to the index page.

use LWP::UserAgent;

my $ua =
  LWP::UserAgent->new(agent => 'MyWebScraper/1.0 <http://example.com>');

my $response = $ua->get('http://example:3000/redirect');

say $response->decoded_content;

If we set max_redirect to 0 we don't get to the index page.

use LWP::UserAgent;

my $ua = LWP::UserAgent->new(
    agent        => 'MyWebScraper/1.0 <http://example.com>',
    max_redirect => 0
);

my $response = $ua->get('http://example:3000/redirect');

say $response->decoded_content;

Links

It's also not uncommon to follow the links that are available on the web page.

We can use CSS selectors to get all the a tags. Let's try it again on a simple html example:

use HTML::TreeBuilder::XPath;
use HTML::Selector::XPath;

my $html = <<'EOF';
<html>
    <head>
        <title>A sample webpage!</title>
    </head>
    <body>
        <h1>Perltuts.com rocks!</h1>
        <a href="http://perltuts.com">perltuts.com</a>
    </body>
</html>
EOF

my $tree = HTML::TreeBuilder::XPath->new;
$tree->ignore_unknown(0);
$tree->parse($html);
$tree->eof;

my $xpath = HTML::Selector::XPath::selector_to_xpath('a');
my @nodes = $tree->findnodes($xpath);
my @attrs = $nodes[0]->getAttributes();
say $attrs[0]->getValue();