Below is a sample code to download, parse, and print all links from reddit main page. Note that when parsing HTML5 document, you will encounter "Tag time invalid in Entity" warning as DOM will default to HTML4 Transitional DTD which does not contains newer HTML tag. Just use @-operator to suppress the warning.
<?php
$html = file_get_contents("http://reddit.com");
$dom = new DOMDocument();
@$dom->loadHTML($html);
$finder = new DOMXPath($dom);
$links = $finder->query('//a[@class="title"]');
foreach ( $links as $link )
{
echo $link->nodeValue . "\n";
echo $link->getAttribute("href") . "\n\n";
}
No comments:
Post a Comment