Scraping
Now we know how to fetch pages. Let's extract some data from them! In the next code examples there is no error handling, this is done for simplicity and brevity, but you should always check the errors in real applications.
The default page as you already know looks like this:
<html> <head> <title>A sample webpage!</title> </head> <body> <h1></h1> </body> </html>
This one is in HTML
format, we need an HTML
parser and XPath
and/or CSS
selectors mechanizm to extract the data from it.
XPath scraping
First, we will try to scrape html on its own. We use HTML::TreeBuilder::XPath for XPath
. XPath is XML
query language. If you are not familiar with XPath
here is a quick cheatsheet:
descendant-or-self::* all elements //h1 <h1> element descendant-or-self::h1/span <span> within <h1> descendant-or-self::h1 | descendant-or-self::span <h1> and span descendant-or-self::h1/descendant::span <span> with parent <h1> descendant-or-self::h1/following-sibling::*[name() = 'span' and (position() = 1)] <span> preceded by <div> descendant-or-self::*[contains(concat(' ', normalize-space(@class), ' '), ' class ')] Elements of class "class" descendant-or-self::div[contains(concat(' ', normalize-space(@class), ' '), ' class ')] <div> of class "class" descendant-or-self::*[@id = 'id'] Element with id "id" descendant-or-self::div[@id = 'id'] <div> with id "id" descendant-or-self::a[@attr] <a> with attribute "attr"
In the following example we extract the title
of the page.
use HTML::TreeBuilder::XPath; my $html = <<'EOF'; <html> <head> <title>A sample webpage!</title> </head> <body> <h1>Perltuts.com rocks!</h1> </body> </html> EOF my $tree = HTML::TreeBuilder::XPath->new; $tree->ignore_unknown(0); $tree->parse($html); $tree->eof; my @nodes = $tree->findnodes('//title'); say $nodes[0]->as_text;
Exercise
Extract and print the h1
tag content.
use HTML::TreeBuilder::XPath; my $html = <<'EOF'; <html> <head> <title>A sample webpage!</title> </head> <body> <h1>Perltuts.com rocks!</h1> </body> </html> EOF my $tree = HTML::TreeBuilder::XPath->new; $tree->ignore_unknown(0); $tree->parse($html); $tree->eof; my @nodes = $tree->findnodes(...); say $nodes[0]->as_text;
CSS selectors scraping
CSS
selectors are easier to understand than XPath
for some developers. If you're not familiar with CSS
selectors here is a quick cheatsheet:
* all elements h1 <h1> element h1 span <span> within <h1> h1, span <h1> and span h1 > span <span> with parent <h1> div + span <span> preceded by <div> .class Elements of class "class" div.class <div> of class "class" #id Element with id "id" div#id <div> with id "id" a[attr] <a> with attribute "attr"
Good thing that by using HTML::Selector::XPath we can teach HTML::TreeBuilder::XPath to understand CSS
selectors too.
In the following example we extract the title
of the page by using a CSS
selector.
use HTML::TreeBuilder::XPath; use HTML::Selector::XPath; my $html = <<'EOF'; <html> <head> <title>A sample webpage!</title> </head> <body> <h1>Perltuts.com rocks!</h1> </body> </html> EOF my $tree = HTML::TreeBuilder::XPath->new; $tree->ignore_unknown(0); $tree->parse($html); $tree->eof; my $xpath = HTML::Selector::XPath::selector_to_xpath('h1'); my @nodes = $tree->findnodes($xpath); say $nodes[0]->as_text;
Exercise
Put everything together (including fetching a page), extract and print the h1
tag content by using a CSS
selector.
use LWP::UserAgent; use HTML::TreeBuilder::XPath; use HTML::Selector::XPath; my $ua = LWP::UserAgent->new; my $response = ... my $html = ... my $tree = ... my $xpath = HTML::Selector::XPath::selector_to_xpath(...); my @nodes = $tree->findnodes($xpath); say $nodes[0]->as_text;