Parsing HTML Document Using PHP Native Extensions

While there exists many third party libraries to parse and process HTML documents, these libraries are too bloated when you just need to write a simple single file script. Hence the question is it possible to parse HTML using native built-in PHP core extension ? Yes, through DOM and DOMXPath. However, it will take a while before you’re familiarize with both APIs.

Below is a sample code to download, parse, and print all links from reddit main page. Note that when parsing HTML5 document, you will encounter "Tag time invalid in Entity" warning as DOM will default to HTML4 Transitional DTD which does not contains newer HTML tag. Just use @-operator to suppress the warning.

<?php
$html = file_get_contents("http://reddit.com");
$dom  = new DOMDocument();
@$dom->loadHTML($html);

$finder = new DOMXPath($dom);
$links  = $finder->query('//a[@class="title"]');
foreach ( $links as $link )
{
    echo $link->nodeValue . "\n";
    echo $link->getAttribute("href") . "\n\n";
}

No comments:

Post a Comment