Extract Hyperlinks Using Python and PHP

Is always a great fun if you can rewrite certain code from one programming language to another. I was looking at this short snippet of Python code by unconscionable which request a page and dump certain comments links.

Sample Python code reproduced here.
import urllib2, re
headers = {'User-agent': 'I promise I\'m not doing this a lot',}
req = urllib2.Request("http://www.reddit.com/r/BuyItForLife/search?q=headphones&restrict_sr=on", None, headers)
website = urllib2.urlopen(req)

html = website.read()

links = re.findall('"((http|ftp)s?://.*?)"', html)
for i in links:
    if 'http://www.reddit.com/r/BuyItForLife/comments/' in i[0]:
        print i[0]

My rewrite using PHP using file_get_contents and stream_context_create, something new for me.
<?php

$options['http']['header'] = "User-agent: I promise I'm not doing this a lot'\r\n";
$context = stream_context_create($options);

$url  = "http://www.reddit.com/r/BuyItForLife/search?q=headphones&restrict_sr=on";
$html = file_get_contents($url, TRUE, $context);
preg_match_all('/"((http|ftp)\s?:\/\/.*?)"/i', $html, $links);
foreach ( $links[1] as $link )
{
    if ( strstr($link, 'http://www.reddit.com/r/BuyItForLife/comments') )
        echo $link, "\n";
}

Comparison of both code snippet.
  1. Regex is simpler and more readable in Python. You don’t need to escape certain character (example is forward slash /) like in PHP. API is simpler and make more sense, result are returned instead of using callback in PHP where you have two sets of array.
  2. file_get_contents() is awesome and dangerous as well for reading both offline and online file. Nothing equivalent is found in Python.
  3. Finding and matching string is way more readable in Python.

No comments:

Post a Comment