How to Create a Web Crawler Using PHP



PHP



In this article, we show how to create a very basic web crawler (also called web spider or spider bot) using PHP.

A web crawler is a script that can crawl sites, looking for and indexing the hyperlinks of a website. It goes from page to page, indexing the pages of the hyperlinks of that site.

Why are web crawlers important? Because they are able to find new hyperlinks and, thus, new pages of a website. They index the pages found and can store in a database.

This is how search engines work. Google has web crawler, a spider bot, that is able to search the whole world wide web, find new pages, and index them, so that the pages can be found on the search engine. Google is constantly crawling the web, so that new pages, which are published to the world wide web constantly, can be found and listed.

And it's not only search engines. There are plenty of other services that crawl the web for hyperlinks. Another example is a script that checks for broken links found on a website. One of the most prominent examples, and one I use myself, is http://brokenlinkcheck.com. This program crawls a website, looking for broken links, so that a website owner can fix the broken links, improving his/her site.

So web crawlers are in full use on the internet and is a very valuable tool.

So how do we go about creating one with PHP? This will be shown below.


PHP Code

All we need to create this spider web crawler is a single block of PHP code.

This code is shown below.





So above is the PHP code that we need to create our web crawler. This block of code is the only code needed to do so.

So the first thing we put in our code is the website we want to crawl. This is placed in the $website_to_crawl variable. You would change the value of this variable to any website that you want to crawl.

We then create an array called all_links. This array will store all the links that our crawler finds later in the code.

We then create a function called get_links() that has the $url parameter. This function will get the links from each page that the website crawls. Each link that the crawler goes through gets assigned to the $url variable.

Inside this get_links() function, we make the all_links array global. This is so that later in the code we can continue to use this all_links array even outside of this get_links() function. Outside of the function, we can continue to use the all_links array, so it's important to make it a global variable.

We then create a variable named $contents. This variable gets the contents of each page (each link) that the crawler finds. This is very important because this is how we open up the links found to get the links that are in each link. That way, we can keep getting links from all the pages that are found. This line of code is critical. Without this line of code, this code would only be able to get links only on the page you specify in the website URL and not any other pages on the site.

The next expression looks for hyperlinks on a page. In this code we look for hyperlinks based on the regular expression for a hyperlink. Without going into all that makes up this expression, this line is able to find all hyperlinks on a page based on how a hyperlink is marked up in HTML.

The next link, using the preg_match_all() function looks to see if the regular expression we are looking for matches any on the page. If there is a match on any of the pages, this hyperlink is stored in the $matches array.

We then create a variable called $path_of_url. We parse the URL to give us the path to the page on the website. We will use this for only specific instances in the code shown below.

Next, we have a few if statements to find out whether the URL is an http or an https. If it is an http, we set the $type variable to "http". If it is an https, we set the $type variable is set to "https".

We then create a variable, $links_in_array, which stores all the hyperlinks stored in the second portion matches array. The matches array holds much more than just the URLs we want to find. Only in the $matches[2] portion is just the hyperlinks. So we just create another variable and store just the $matches[2] portion, which is just the hyperlinks.

We then loop through all the links, first, on the URL that we specified using the foreach loop.

We go through each link of the specified URL. We use a series of if statements to correct links so that we either don't duplicate links in the final array of all the hyperlinks we find or to make sure we have the full name of the hyperlink.

So the first if statement checks for hashtags ("#") found on a page. If a hyperlink contains a hashtag, we don't want to have the link with the hashtag. If you know HTML pretty well, you would know that if a link hashtag, it means that it is specifying a certain part of a webpage. So, for example, say we have the following links: http://earningaboutelectronics.com/Transistor#BJT and http://earningaboutelectronics.com/Transistor#FET. These 2 links both are referring to the same page. They are just referring to different parts of the page. Therefore, we don't want to index both of these pages, because they're the same page. We don't want duplicate entries in our final list of arrays. Therefore, if a link contains a hashtag we simply retain the portion that precedes the hashtag. We strip away all parts after the hashtag.

Next we check if the hyperlink begins with a period or dot ("."). If it does, we want to take away this dot. In HTML, this means that the link is a relative link from the home directory of the website. For example, a website owner may specify the path of a link to be "./Articles/Transistors". What this means is that the website owner created a relative link from the home directory. We don't want relative links so we simply remove the period. Later on we'll append the http:// and the URL to this link, so that we obtain the full complete pathway to the hyperlink.

Next we check if the link begins with "http://". If it does, the link remains the same.

We then check if the link begins with "https://". If it does, the link remains the same.

If the link begins with "//", we remove the "//". Later in the code, we'll append the "http://" and the URL to make a full absolute pathway to the link.

If the link begins with "#", this means that it is referring a certain part of the current page, which is the URL you specified. Therefore, we just make the link equal to the $url variable.

If the link is an email link (contains "mailto:"), we append a "[" and a "]" to the link. This is our signal so that later on, we can create an absolute pathway to this email function.

If the link begins with "/", we specify the pathway by adding in the $path_of_url variable, which we created from the PHP parse_url() function.

In all other cases, the $link variable is equal to the $path_of_url with the $link variable appended to it.

We then have an if statement that if the hyperlink is not found in the $all_links array, we want to add it to all the all_links array. The all_links array is our array to store all the links that the web crawler finds. This get_links() function is going to run over and over and over all throughout the website. Websites typically have hyperlinks to the same page all over the website. We only want to index a certain page once in the all_links array. Thus, we want to check if the link is currently in the array. If it is, we don't want to index it again. If it isn't, we put in the array via the array_push() function.

This ends the get_links() function.

Right below this, we call (or invoke) the get_links () function.

Remember, we created a function above. But to have the function run, you must call the function. This is what we do in the in this line.

After this, we have a foreach loop. This loops through all links on the current page that the crawler is on.

For each of these links, we run the get_links() function. So this means we get the links from each of the hyperlinks. This way, we can scan the whole site.

We then have another foreach loop that goes through each of the links. We then echo out these links.

I added an optional 2 lines of code so that I could see how many hyperlinks the web crawler indexed.

And this concludes all the code needed to build our spider web crawler using PHP.

I placed the above PHP code on my website. See the following link to see this PHP spider web crawler script crawl my website: Web Crawler on Learning about Electronics.

You can see all the hyperlinks indexed and at the bottom it tells you the total count of hyperlinks indexed.

If you want to make all the links hyperlinks, then instead of the line, echo $currentlink . "<br>";, you would instead put the line, echo "<a href=" . "\"" . "$currentlink" . "\"" . ">$currentlink</a>" . "
";


Related Resources

How to Create a Confirmation Page for an HTML Web Form Using PHP

How to Insert Images into a MySQL Database Using PHP

How to Create a Searchable Database Using MySQL and PHP

How to Search a MySQL Table For Any Word or Phrase Using PHP

How to Create a Send Email Form Using PHP and MySQL

HTML Comment Box is loading comments...