Makebeta

Web Developer Tools and Tutorials

Scraping Links With PHP

August 11th, 2007 · 93 Comments

Abstract Network

In this tutorial you will learn how to build a PHP script that scrapes links from any web page.

What You’ll Learn

  1. How to use cURL to get the content from a website (URL).
  2. Call PHP DOM functions to parse the HTML so you can extract links.
  3. Use XPath to grab links from specific parts of a page.
  4. Store the scraped links in a MySQL database.
  5. Put it all together into a link scraper.
  6. What else you could use a scraper for.
  7. Legal issues associated with scraping content.

What You Will Need

  • Basic knowledge of PHP and MySQL.
  • A web server running PHP 5.
  • The cURL extension for PHP.
  • MySQL - if you want to store the links.
Backhoe Digging

Get The Page Content

cURL is a great tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every way. Here’s the code to grab our target site content:

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
	echo "<br />cURL error number:" .curl_errno($ch);
	echo "<br />cURL error:" . curl_error($ch);
	exit;
}

If the request is successful $html will be filled with the content of $target_url. If the call fails then we’ll see an error message about the failure.

curl_setopt($ch, CURLOPT_URL,$target_url);

This line determines what URL will be requested. For example if you wanted to scrape this site you’d have $target_url = “/makebeta/”. I won’t go into the rest of the options that are set (except for CURLOPT_USERAGENT - see below). You can read an in depth tutorial on PHP and cURL here.

Tip: Fake Your User Agent

Many websites won’t play nice with you if you come knocking with the wrong User Agent string. What’s a User Agent string? It’s part of every request to a web server that tells it what type of agent (browser, spider, etc) is requesting the content. Some websites will give you different content depending on the user agent, so you might want to experiment. You do this in cURL with a call to curl_setopt() with CURLOPT_USERAGENT as the option:

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

This would set cURL’s user agent to mimic Google’s. You can find a comprehensive list of user agents here: User Agents.

Common User Agents

I’ve done a bit of the leg work for you and gathered the most common user agents:

Search Engine User Agents

  • Google - Googlebot/2.1 ( http://www.googlebot.com/bot.html)
  • Google Image - Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html)
  • MSN Live - msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm)
  • Yahoo - Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
  • ask

Browser User Agents

  • Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
  • IE 7 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)
  • IE 6 - Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)
  • Safari - Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11 (KHTML, like Gecko) Safari/3.0.2
  • Opera - Opera/9.00 (Windows NT 5.1; U; en)

Using PHP’s DOM Functions To Parse The HTML

Puzzle Workers

PHP provides with a really cool tool for working with HTML content: DOM Functions. The DOM Functions allow you to parse HTML (or XML) into an object structure (or DOM - Document Object Model). Let’s see how we do it:

$dom = new DOMDocument();
@$dom->loadHTML($html);

Wow is it really that easy? Yes! Now we have a nice DOMDocument object that we can use to access everything within the HTML in a nice clean way. I discovered this over at Russll Beattie’s post on: Using PHP TO Scrape Sites As Feeds, thanks Russell!

Tip: You may have noticed I put @ in front of loadHTML(), this suppresses some annoying warnings that the HTML parser throws on many pages that have non-standard compliant code.

XPath Makes Getting The Links You Want Easy

Now for the real magic of the DOM: XPath! XPath allows you to gather collections of DOM nodes (otherwise known as tags in HTML). Say you want to only get links that are within unordered lists. All you have to do is write a query like “/html/body//ul//li//a” and pass it to XPath->evaluate(). I’m not going to go into all the ways you can use XPath because I’m just learning myself and someone else has already made a great list of examples: XPath Examples. Here’s a code snippet that will just get every link on the page using XPath:

$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

Next we’ll iterate through all the links we’ve gathered using XPath and store them in a database. First the code to iterate through the links:

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	storeLink($url,$target_url);
}

$hrefs is an object of type DOMNodeList and item() is a function that returns a DOMNode object for the specified index. The index can be between 0 and $hrefs->length. So we’ve got a loop that retrieves each link as a DOMNode object.

$url = $href->getAttribute('href');

DOMNodes inherit the getAttribute() function from the DOMElement class. getAttribute() returns any attribute of the node (in this case an <a> tag with the href attribute). Now we’ve got our URL and we can store it in the database.

We’ll want a database table that looks something like this:

CREATE TABLE `links` (
`url` TEXT NOT NULL ,
`gathered_from` TEXT NOT NULL ,
`time_stamp` TIMESTAMP NOT NULL
);

We’ll a storeLink() function to put the links in the database. I’ll assume you know the basics of how to connect to a database (If not grab a MySQL & PHP tutorial here).

function storeLink($url,$gathered_from) {
	$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
	mysql_query($query) or die('Error, insert query failed');
}

Your Completed Link Scraper

function storeLink($url,$gathered_from) {
	$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
	mysql_query($query) or die('Error, insert query failed');
}

$target_url = "http://www.merchantos.com/";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
	echo "<br />cURL error number:" .curl_errno($ch);
	echo "<br />cURL error:" . curl_error($ch);
	exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	storeLink($url,$target_url);
	echo "<br />Link stored: $url";
}

What Else Could I Do With This Thing?

The possibilities are limitless. For starters you might want to store a list of sites that you want scraped in a database and then set up the script so it runs on a regular basis to scrap those sites. You could then compare the link structure over time or maybe republish the links in some sort of directory. Leave a comment below and say what you’re using this script for. Here are a few other things people have done with scrapers in the past:

Law Book and Gavel

There is no easy answer to this question. Many organizations scrap content from all over the web - Google, Yahoo, Microsoft, and many others. These companies get away with it under fair use and because site owners want to be included in the search results. However, there have been copyright infringement rulings against these companies.

The real answer is that it depends who you scrape and what you do with the content. Basic copyright law gives authors an automatic copyright on everything they create. But the same law permits fair use of copyrighted material. Fair use includes: criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research. But even these uses could be considered copyright infringement in some circumstances. So be careful before you claim “fair use” as your defense!

Here’s a couple sites that have granted you the right to use their content. They do require you to attribute the content to the author or the URL you scraped it from:

  • Wikipedia - GNU Free Documentation License
  • Open Directory Project - Open Directory License
  • Creative Commons Logo
    Creative Commons - Creative Commons Attribution 3.0

    Many sites publish their content under some form of the Creative Commons license. You can search for creative commons licensed works here: Creative Commons Search. Remember that it’s your responsibility to verify the copyright rules for anything you use, even stuff found using the Creative Commons Search.
Makebeta is community resource produced by MerchantOS. MerchantOS is POS Software that makes running your retail business easy by organizing your sales and inventory. There's no software to install and data backups are automatic. Get your point of sale and inventory control the easy way with MerchantOS.

93 responses so far ↓

  • 1 developercast.com » MakeBeta Blog: Scraping Links With PHP // Aug 15, 2007 at 10:30 am

    [...] Justin Laing over at Merchant OS there’s a new tutorial on creating a simple link scraper with the help of PHP and the cURL extension. In this tutorial [...]

  • 2 Ian Van Ness // Aug 16, 2007 at 3:32 am

    Instead of cURL (or by using cURL to fetch), the tidy plugin for PHP is great for scraping data from sites (links, images, etc). With the PHP5 version, it has a great OO interface too!

  • 3 justin // Aug 16, 2007 at 9:46 am

    Ian,
    Thanks for the tip, I’ll look into tidy!
    -justin

  • 4 Matthew Hancock // Aug 17, 2007 at 11:06 am

    I can’t seem to get the dom/xpath object to work… I am grabbing the full html correctly, but I never get any values once I start parsing. I am running php 4, is this a php 5 function?

  • 5 justin // Aug 17, 2007 at 11:13 am

    Matthew,
    Yup, you’re right it’s only in PHP5. You can do pretty much all the same stuff using DOM XML in PHP 4: http://us2.php.net/manual/en/ref.domxml.php

  • 6 Fredrik Komstadius // Aug 20, 2007 at 5:54 am

    Been wondering how to do this in a quick and nice way for a long time. This solves my problems :) Excellent work!

  • 7 dave // Aug 20, 2007 at 10:41 pm

    I’m having a little trouble with this; I’ve got php5 installed, but keep getting this error:

    Call to undefined method DOMXPath::evaluate()

    Any thoughts?

  • 8 dave // Aug 20, 2007 at 10:55 pm

    OK, seems to be a bug with PHP versions 5.0.3/5.0.4 as I’m not having any trouble on another server. Oh well; problem solved. Confirmed to work on v5.2.2

  • 9 dave // Aug 21, 2007 at 12:38 pm

    (hoping this box handles php code well…)
    OK, so my need was related to parsing a page for images, so I thought I’d pass along my code just for others; it’s pretty basic:

    $imgs = $xpath->evaluate(”/html/body//img”);
    for ($i = 0; $i length; $i++) {
    // get image attributes
    $img = $imgs->item($i);
    $imgsrc = $img->getAttribute(”src”);
    $imgwidth = $img->getAttribute(”width”);
    $imgheight = $img->getAttribute(”height”);
    $imgalt = $img->getAttribute(”alt”);
    }

  • 10 jim // Aug 22, 2007 at 2:04 pm

    can anyone pls tell me how can i do this with php4.. my lots of problems will be solved..

  • 11 justin // Aug 24, 2007 at 4:29 pm

    Jim,
    Other than recommending DOM XML for PHP 4. You could try just using regular expressions to parse the content you grab with cURL.
    -Justin

  • 12 Stefan // Aug 29, 2007 at 12:23 am

    Great. Was just looking to a PHP equivalent of HtmlAgilityPack.HtmlDocument which I use for my .NET code. I try to, well rather need to (because of it’s complexity…), stay away from regular expressions…

  • 13 Creating a php scraper - WickedFire - Affiliate Marketing Forum - Internet Marketing Webmaster SEO Forum // Aug 31, 2007 at 6:48 pm

    [...] of PHP scrapers. I have a tutorial on building a link scraper, maybe it could help some of you. Scraping Links With PHP Anyone know of any other good [...]

  • 14 Scraping Links With PHP | Best Web Design Resources. // Sep 2, 2007 at 6:09 am

    [...] How to use cURL to get the content from a website (URL). [...]

  • 15 seviyorum seni // Sep 2, 2007 at 6:10 am

    can anyone pls tell me how can i do this with php4.. my lots of problems will be solved..

  • 16 cynthiaknouft // Sep 2, 2007 at 6:10 am

    Other than recommending DOM XML for PHP 4. You could try just using regular expressions to parse the content you grab with cURL.

  • 17 jim // Sep 6, 2007 at 3:27 am

    Hi guys,
    check this function to grab all the links after getting it from curl functions..thanks justin for you hint
    use the below function like get_links($html) where $html is from the above code

    function get_links($s,$url=”){

    if($url) {
    $p = parse_url($url);
    if($p["port"]) {
    $port = “:$p[port]“;
    } else {
    $port = ”;
    }
    }

    $copy = $s; // so we can return links and titles in their proper case
    $s = strtolower($s); // or else the strstr and strpos searches are case sensitive…
    $pos_start=strpos($s,”]*)”?[^>]*>(.*)?’,$array[$i],$r);

    if($url) {
    if(!eregi(”^mailto”,$r[1])) {

    if(eregi(”^(f|ht)tp”,$r[1])) {
    /* full url */
    $this_url = $r[1];
    } elseif(eregi(”^/”,$r[1])) {
    /* absolute path, but not full url */
    $this_url = $p["scheme"] . “://” . $p["host"] . $port . $r[1];
    } else {
    if($p["path"] == “/” || $p["path"] == ”) {
    /* relative link, but no url path */
    $this_url = $p["scheme"] . “://” . $p["host"] . $port . “/” . $r[1];
    } else {
    /* relative link, with url path */
    if(ereg(”/$”,$p["path"])) {
    /* and the path ends in ‘/’, so not a file */
    $this_url = $p["scheme"] . “://” . $p["host"] . $port . $p["path"] . $r[1];
    } else {
    /* and the path doesn’t end in ‘/’, so
    probably a file (but it *could* be
    a directory, we can’t really know) */
    $remove = strrchr($p["path"],”/”);
    $path = ereg_replace(”$remove”,”/”,$p["path"]);
    $this_url = $p["scheme"] . “://” . $p["host"] . $port . “$path” . $r[1];

    }
    }

    }

    $links[] = array($array[$i],$this_url,$r[2]);

    }

    } else {

    $links[] = array($array[$i],$r[1],$r[2]);

    }

    }
    for($z=0;$z

  • 18 jim // Sep 6, 2007 at 3:28 am

    this is the remaining part as it wasnt coming in one

    for($z=0;$z

  • 19 Tim Henderson // Sep 14, 2007 at 9:48 am

    php 5.2.4, everything works EXCEPT $url is empty.

  • 20 Tim Henderson // Sep 14, 2007 at 9:58 am

    NEVER MIND! $url was empty because of nonstandard or I should say non-Windows single quotes around ‘href’ in the code. all working now

  • 21 W-Shadow // Sep 30, 2007 at 1:40 am

    For those asking for a PHP 4 version - I wrote a function like that a while ago - extracting all links from a page. It uses cUrl and regular expressions and might be a bit more readable than the one in comments above.

  • 22 justin // Sep 30, 2007 at 10:01 am

    Thanks W-Shadow, the relative to absolute translator function looks very helpful also!

  • 23 jehove // Oct 2, 2007 at 4:27 am

    Thanks, this is great… anyways for some advance scraping. have you any idea on how to scrape data from pages with .aspx or .cfm.. im having a hardtime with these datas…

  • 24 K-City // Oct 3, 2007 at 11:58 am

    Does this only do it only for page or will it work its way through a whole site? If not does anyone have a solution to this problem?

  • 25 justin // Oct 3, 2007 at 12:09 pm

    K-City,
    To do this for a whole site what you would do is take all the links you gather from a page and make a list of the ones that are internal (pointing to other pages on the same site). Then you would iterate on those links using them with cURL to grab their content and scrap it for more links. Rinse and repeat. If you’re handy with PHP I’m sure you could figure it out.
    -Justin

  • 26 Adam // Oct 8, 2007 at 6:00 am

    Anyone know how to scrape text from the sites rather than simply links?

  • 27 justin // Oct 8, 2007 at 9:24 am

    Adam,
    That’s even easier to do. You’ve got the code to do it up in the article. $html = curl_exec($ch);. That line gets the html from the page (containing all the text). Also if you want to remove all the html code and just leave plain text you could run it through $text = strip_tags($html);. Good luck.

  • 28 Vishal // Oct 10, 2007 at 12:20 am

    I am having php4 and i am getting this kind of error : Fatal error: Cannot instantiate non-existent class: domxpath

    can anyone help me out of this problem? i have code something like this:

    function
    XPath($xml, $namespaces = array())
    {
    $non = “”;
    $doc = new DOMDocument(”);
    //$doc->loadXML($xml);
    $doc= domxml_open_mem($xml);
    //echo $doc; exit;
    $this->xpath = new DOMXPath($doc);

    if(count($namespaces))
    {
    foreach ($namespaces as $p => $n)
    {
    $this->xpath->registerNamespace($p, $n);
    }
    }
    }

  • 29 Owen // Oct 19, 2007 at 3:29 pm

    Could someone please post the code to add a query box where a user can input the url instead of it being hard coded in the script? Thanks in advance!

  • 30 » Scrapping Links - Learn how? Black Hat Techniques: Black Hat Webmaster Tips // Oct 22, 2007 at 10:29 am

    [...] MakeBeta has released a pretty decent tutorial on scrapping links. It’s a great start if your going to be scraping anything.  It was also released as a guest post over on Elli’s Blog which is where I came upon it.  Hope it helps. [...]

  • 31 justin // Oct 29, 2007 at 7:09 pm

    Vishal,
    XPath isn’t supported in PHP4. See the comments above.

  • 32 Adam // Nov 13, 2007 at 9:03 am

    I wrote a link scraper a year or 2 ago, and I must say it wasn’t as advanced as this one. Very sneaky it is!

  • 33 Eat My Business » Blog Archive » Screen Scraping Tutorials and Info // Nov 15, 2007 at 3:46 pm

    [...] This one covers using php5 and covers PHP’s DOM functions http://www.merchantos.com/makebeta/php/scraping-links-with-php/  [...]

  • 34 Micheal // Nov 17, 2007 at 4:25 pm

    Jim, thanks its very usefull

  • 35 Micheal // Nov 17, 2007 at 4:27 pm

    Jim, thanks its very usefull, but somehow it is not working can you describe in more details.

  • 36 upscaleSEO // Dec 10, 2007 at 6:16 am

    Why was this post removed as guest posting from bluehatseo.com? Yesterday I prepared to use it as a tutorial, today it is gone. It was just good look for me to remember the name of your blog… just curious.

  • 37 justin // Dec 10, 2007 at 6:44 am

    I have no idea. Why don’t you ask bluehat that. I didn’t ask him to take it down or anything.

  • 38 Zeb // Dec 14, 2007 at 6:59 pm

    Let’s say I want to scrape a page that is written in Ruby, and takes time to load (like 10sec), but before it loads, it initially displays some fancy html saying it’s “loading, please wait”…
    then curl grabs the html from that “waiting html” before the entire page is loaded.
    Anyway around this?

  • 39 justin // Dec 14, 2007 at 8:38 pm

    Sounds like that page is using ajax to load the content after the initial page load. So you’ll need to poke around and find the ajax calls and use those in curl. It’s gonna get complicated pretty fast so it’s beyond what I can tell you how to do here. But maybe that’ll give you a start. Good luck.

  • 40 Dilip // Dec 18, 2007 at 12:38 am

    Hello,
    I use script from your site for grap links its nice.
    I want to scrap data through regular expression,
    so how can I do ??
    thanks alot ..

    Dilip

  • 41 r // Dec 25, 2007 at 9:49 am

    Let’s say you want to grab links and image alt tags… how would one go about this?

    create two XPath evaluations?

    $hrefs = $xpath->evaluate(”/html/body//a”);
    $alts = $xpath->evaluate(”/html/body//img”);

    and use two “for” loops?

  • 42 Jonathan Harriot // Dec 31, 2007 at 12:09 am

    Thanks Justin, this was very helpful. I am in the process of studying the how-to’s of scraping.

  • 43 Sudhanshu // Dec 31, 2007 at 4:12 am

    Thanks a ton.. :D
    You just saved my day today..

  • 44 syfur // Jan 2, 2008 at 1:06 pm

    This tutorial helped me learning THANKS…….

  • 45 Ryan // Jan 14, 2008 at 5:51 pm

    having problem with cURL getting google news on this website I am new to cURL and not an advanced web designer. Please take a look and see if its just something little. My server doesn’t allow fopen function either. Thank you in advance

  • 46 Manjula // Jan 18, 2008 at 11:30 pm

    These tips & tutorials are very helpful for the php beginners to learn coding.
    Thanks! a lot.

  • 47 Manjula // Jan 18, 2008 at 11:56 pm

    I m trying to write a script in PHP to grab the keywords from tools like, http://www.google.com/trends
    http://hotsearches.aol.com/,……….
    can anyone helpme!………

  • 48 returnUser // Feb 5, 2008 at 11:31 pm

    why need Dom here.
    try : —————
    $href = strip_tags($html, ‘‘);
    preg_split(”/<a\s*href=\”/”, $href);

  • 49 returnUser // Feb 6, 2008 at 3:48 pm

    why need Dom here.
    try
    strip_tags
    and
    preg_split

  • 50 Qamar Abbas // Feb 21, 2008 at 3:30 am

    Nice work regarding data grebbing, nicely appreciated.

  • 51 links for 2008-02-24 « sySolution // Feb 24, 2008 at 8:18 am

    [...] Scraping Links With PHP (tags: scraping php curl) [...]

  • 52 Jambo // Apr 2, 2008 at 10:36 am

    Just a quick one.

    Using the DOMobject and PHP I can get all the links in a page - except - things like Google adwords etc.

    I know these are JavaScript based ones -but- firefox can get them (pageinfo) and ‘debugbar’ in IE can.

    I think these use the DOMObject…

    Any ideas how to show **all** links?

    Been doing my small (but beautifully formed brain in =)

    Jambo.

  • 53 justin // Apr 2, 2008 at 10:42 am

    Those are going to be very hard to get. Here’s what you’d have to do:
    -Request the page
    -Parse it for the javascript link to google adwords
    -Request that javascript file from google adwords
    -(possible request another file from google adwords that the script tells you how to get)
    -Scrap the links out of the results returned from google adwords javascript return.
    DOM doesn’t use a javascript interpreter. The firefox/IE stuff your talking about is totally different it’s running inside the browser were the javascript is all interpreted etc.

  • 54 justin // Apr 2, 2008 at 10:50 am

    I did a little poking around on a page with adwords. You’ll see stuff like this in the page:

    iframe width="160" scrolling="no" height="90" frameborder="0" allowtransparency="true" hspace="0" vspace="0" marginheight="0" marginwidth="0" src="http://pagead2.googlesyndication.com/pagead/ads?[bighugequerystring-ommitted]” name=”google_ads_frame”

    What you need to do is parse all those out and get the urls like:
    http://pagead2.googlesyndication.com/pagead/ads?bighugequerystring-ommitted
    request that page from google (you’ll need to get your user agent correct and probably fill in a the referrer in the header to be the original page your trying to scrap).
    The google adwords links will be in the result returned from that crazy url. You’ll have to parse them out of that as well.

  • 55 turin // May 12, 2008 at 10:29 pm

    hi, if anyone can help. i’ve tested the code received this error:
    Fatal error: Call to undefined function curl_init() in C:\wamp\www\php_sandbox\scrapinglinks.php on line 11

    I’m new to PHP and can’t figure out why this is failing. Pls help!

  • 56 Swingerman // May 13, 2008 at 3:02 am

    I have a weird error:

    If i’m testing my curl code(from this post) on my testing server (localhost) everything works fine, but when i upload to my server and run the script from the browser it just cant connect to the give address(connection timeout).

    So the only thing have changed is the location of the script runs from. My webserver has curl enabled.

    I’ve figured out that the problem exists with this specific url. But it works if i run the script from localhost(testing server).

    What can cause this?

  • 57 justin // May 13, 2008 at 10:09 am

    Can you connect to any address? If it’s just one address you can’t connect to it must be something is wrong with the route from you to that address or that address is specifically blocking you or something in your configuration is specifically blocking that address. Too many possibilities for me to say what is going on for you.

  • 58 Dave // Jun 4, 2008 at 12:55 pm

    How would I also grab the linked text?

  • 59 Justin // Jun 4, 2008 at 3:36 pm

    $href->textContent should contain the text inside the link node.

  • 60 randz // Jun 17, 2008 at 10:29 am

    @turin : you need to enable your curl in php.ini

  • 61 bhavin // Jul 16, 2008 at 1:58 am

    hi
    this is really nice for new people in php like me.
    i have one queston.
    there is one webpage with two textbox.
    i wan to insert some values in this text box using php.
    can you tell me how to do that?

  • 62 avinash // Jul 17, 2008 at 10:49 pm

    can anyone tell me how to get the link name with the url

    ex: link name

  • 63 naden // Jul 20, 2008 at 10:22 am

    Very nice approach. I wrapped it all up into a handy PHP Class witch fetches all links from a given website including _all_ attributes like “href, title, rel” …

    You can download it here: http://www.naden.de/blog/linkfetcher-mit-php-links-extrahieren

    Post is in german, but the source is commented in english. Grab it at the end of the post.

  • 64 Tim // Jul 23, 2008 at 1:44 am

    Very nice, Justin. I’ve put this to work grabbing links off a government page. However, I’m wondering how I can grab the TEXT contained within the anchor tag? This script only seems able to grab the attributes of a tag, not the data that the tag surrounds.

  • 65 Justin // Jul 23, 2008 at 9:02 pm

    I think the text inside the element might be a child node. You’ll have to poke around in the php documentation for DOM:
    http://php.net/dom

  • 66 naden // Jul 24, 2008 at 2:32 am

    @tim, it’s very easy to access the text with the following pice of code:

    foreach( $xpath->evaluate( ‘/html/body//a’ ) as $item ) {
    print( $item->textContent );
    }

  • 67 lp // Aug 12, 2008 at 4:17 pm

    Its working nice, except, I have seen websites that doesnt put links between .

    They just show http://www.bla.com on the website and to visit it, you have to copy/paste it yourself.

    Is there a easy fix here?

  • 68 Suraj // Aug 20, 2008 at 11:47 am

    I have one question, is it possible to Site scrap from my page to others sites that are Password protected, is there a way to bypass login to get site content.

    Help is much appreciated

  • 69 Justin // Aug 20, 2008 at 11:56 am

    Suraj,
    You can get through logins. You’ll need a login / password of course. Then you need to figure cURL to store cookies. Then you simulate a login using cURL to replicate the normal login form getting submitted. Store the cookie that is returned from the login authorization, and then use that cookie when you hit pages that require login.

  • 70 jamil sadi // Aug 25, 2008 at 10:39 pm

    I am little confuse on how this scraper will run. I mean will we send request through browser?
    What if we want our scraper to run 24 hours a day. And not overloading the target site, things like that.

    1. How/Where to run the script. and its going to save the data into the db (mysql).
    2. scraper running 24×7. not overloading target site.

  • 71 Stanley // Sep 6, 2008 at 2:23 pm

    Suraj, you can access protected pages with cURL’s authentication options. See the cURL documentation for more info. http://www.curltutorials.com

  • 72 zniko07 // Sep 11, 2008 at 5:33 pm

    hello!
    i’ve just done an article (in french) on how fave a blog via curl and php and i’am refering to your article!
    http://k-wi.com/blog/?p=21

  • 73 yoda // Sep 26, 2008 at 9:39 am

    Justin,

    This is very cool, thanks for it. I don’t understand how you can use a cookie with a php script. Say I wanted to get the top 100 search results from google. I can set a preference cookie in my browser for this. How do I push that through in a PHP script running on a server?

  • 74 Matt // Nov 25, 2008 at 4:34 pm

    I’m doing something similar, only grabbing text from paragraphs using $paragraph->nodeValue (or textContent, same results).

    The problem I’m running into is that it strips out any HTML inside the paragraphs, such as line breaks. I was going to use the tags to further parse the paragraphs, but they don’t show up in nodeValue output. Any suggestions?

  • 75 justin // Nov 26, 2008 at 9:24 am

    I think you could get the full xml ofthat node like this:

    // $doc is your root document object, $node is the node you want to get the contents of
    $xml = $doc->saveXML($node);

    see:
    http://us2.php.net/manual/en/domdocument.savexml.php

  • 76 John Rockefeller // Feb 1, 2009 at 8:22 am

    This is a really great article. I’d recommend using the DOM object and upgrading to PHP5 if you’re still using PHP4. Getting this to work using regular expressions in PHP4 is extremely memory intensive. Obviously some benchmarks would need to be run but my experience tells me the DOM path would win out over regular expressions in this case.

  • 77 exhibition // Feb 17, 2009 at 2:03 am

    this code working well in php 4.

  • 78 Rambabu Katta // Feb 17, 2009 at 9:56 am

    good tutorial

  • 79 Leena // Feb 19, 2009 at 3:10 am

    hi.. when i run above code snippet i get an warning like this..

    Warning: domdocument::domdocument() expects at least 1 parameter, 0 given in C:\xampp\htdocs\interface\try.php on line 27

    ie the domdocument() class needs a parameter..

    Please somebody help me out of this…..

  • 80 MC // Mar 16, 2009 at 2:33 pm

    Great article…exactly what I needed. For one of the sites I’m attempting to scrape, it seems as though a few of the critical links I’m hoping to capture are hidden from the script. I’m assuming it’s because the site sees the user-agent I’m using (all of the listed “most common” agents) and denies access to the script. Is there a workaround for this?

  • 81 David // Mar 17, 2009 at 10:44 pm

    Thanks I’m going to use it to scrape this site!!!

  • 82 Ajay1kumar1 // Mar 21, 2009 at 4:03 am

    Curl works great.
    Thanks for tut.
    Regards,
    Ajay singh rathore

  • 83 viji // Apr 2, 2009 at 12:13 pm

    i want to show some other site content in my website as like as my own content when i am searching in site…pls anyone give code for this..Thanks in Advance

  • 84 sandeep verma // Apr 21, 2009 at 6:51 am

    Thanks guys that was extremely helpful!

    sandeep verma
    (http://sandeepverma.wordpress.com)

  • 85 AmitK // Jun 4, 2009 at 6:06 am

    Hi , can any buddy provide me the xpath queris for retrieval of the TextArea, Radio Buttons and check boxes

    This will be realy help for me

    Thanks in advance.

  • 86 AmitK // Jun 4, 2009 at 6:06 am

    i need values of TextArea, Radio Buttons and check boxes

  • 87 justin // Jun 4, 2009 at 9:52 am

    I think these two xpath queries would get textareas and input type=’radio’ from a document:
    query_for_textarea = “//textarea”
    query_for_radio = “//input[@type='radio']”
    Good luck!

  • 88 Build A Link Scraper With PHP - Tutorial Collection // Jun 4, 2009 at 5:37 pm

    [...] View Tutorial No Comment var addthis_pub=”izwan00″; BOOKMARK This entry was posted on Friday, June 5th, 2009 at 6:07 am and is filed under Php Tutorials. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site. [...]

  • 89 AmitK // Jun 4, 2009 at 11:47 pm

    -Justin , Thanks for the help , it worked

    $hrefs = $xpath->evaluate(”/html/body//textarea”);
    for ($cnt = 0; $cnt length; $cnt++) {
    $href = $hrefs->item($cnt);
    $url = $href->getAttribute(’name’).”.$href->nodeValue;
    #echo “Link stored: $url”;
    }
    if you notice in the above code we have to use $href->nodeValue; as getAttribute(’value’) won’t work

  • 90 AmitK // Jun 4, 2009 at 11:52 pm

    but am still struggling with the radio buttons and checked boxes $href->nodeValue as well as getAttribute(’value’) won’t work here… and am not being able to determine which radio button is selected.

    Any thoughts on this will be of great help…

    -Amit

  • 91 justin // Jun 5, 2009 at 10:16 am

    Try getAttribute(’checked’) - that’s the attribute for if a checkbox or radio button is selected. I think it’s true=checked, false=unchecked.

  • 92 Sandeep Verma // Jun 16, 2009 at 6:12 am

    We can use PEAR here….

    execute();

    echo $echo;
    ?>

  • 93 CMOSversion // Jul 2, 2009 at 5:45 am

    Scrapper in here is not capable of getting URL’s stored in a variable/reference written in any client -side scripting language

Leave a Comment