Easy web scraping with PHP
Web scraping is a technique of web development where you load a web page and "scrape" the data off the page to be used elsewhere. It's not pretty, but sometimes scraping is the only way to access data or content from a web site that doesn't provide RSS or an open API.
I'm not going to discuss the legal aspects of scraping, as it may be considered copyright infringement in some situations. However, there are also perfectly legal reasons to need to scrape, like if you have permission.
To make things really easy, we're going to let the power of regular expressions do all the work for us. If you're not familiar with regular expressions, you may want to google for a tutorial. Here is the documentation for PHP regular expression syntax.
First, we start off by loading the HTML using file_get_contents. Next, we use preg_match_all with a regular expression to turn the data on the page into a PHP array.
This example will demonstrate scraping this web site's blog page to extract the most recent blog posts. This is just for demo purposes - of course, the RSS feed is much better suited for this.
// get the HTML $html = file_get_contents("http://www.thefutureoftheweb.com/blog/");
Here is what the HTML looks like for the blog posts:
<ul id="main"> <li> <h1><a href="[link]">[title]</a></h1> <span class="date">[date]</span> <div class="section"> [content] </div> </li> </ul>
So we will use a regular expression that looks for all the li elements and capture the content using parentheses at the appropriate places (link, title, date & content).
preg_match_all( '/<li>.*?<h1><a href="(.*?)">(.*?)<\/a><\/h1>.*?<span class="date">(.*?)<\/span>.*?<div class="section">(.*?)<\/div>.*?<\/li>/s', $html, $posts, // will contain the blog posts PREG_SET_ORDER // formats data into an array of posts ); foreach ($posts as $post) { $link = $post[1]; $title = $post[2]; $date = $post[3]; $content = $post[4]; // do something with data }
There's a lot going on inside that regular expression, but there are really only a few "tricks" that are used. Anytime I want to say "skip over whatever is between" I use .*?
. And any time I want to say "match whatever is in here" I use (.*?)
. And lastly, the s
at the end tells PHP to allow the dot .
to match newlines. That's about all there is to it.
The regular expression will only match blog posts, because they are the only <li> elements that contain an <h1>, <span class="date"> and <div class="section">.
Web scraping is highly unreliable - if the HTML structure were to change this code would break instantly. However, it's often quite easy to write this code, and usually produces a perfectly usable hack solution.