Dynamic URLs In The Eyes Of A Search Engine
Posted Sunday, August 3, 2003
Most Web sites with a large number of pages are developed using dynamic server-side Web technologies such as Hypertext PreProcessor (.php), Java Server Pages (.jsp), Active Server Pages (.asp), ColdFusion (.cfm) and Perl. These technologies provide programmers with the tools to build sites so that adding product or pages does not require extensive HTML work. In reality, all high volume sites must use one of these technologies in order to maximize efficiency and stay profitable.
The power and flexibility provided by these technologies is outstanding. There are, however, issues that need to be addressed in regards to the way search engines crawl a dynamically driven Web site. These issues do not have to do with the pages that are generated, but with the URLs these technologies generate.
Now put yourself in the shoes of a search engine. The search engine wants to only put pages into its index that are unique.
In order to explain search engine optimization technique, let me give you some perspective with a client of RustyBrick, Inc. RustyBrick designed and developed a custom e-commerce site for an undergarment and intimate apparel shop named Freshpair.com . Freshpair.com is built with a PHP dynamic scripting language that helps facilitate the day-to-day operations of the site. After the site was built and running, Freshpair.com participated in many online advertising campaigns, including Overture and Google AdWords.
At some point after sales leveled off and the company was looking for ways to increase revenues, Freshpair came to me at RustyBrick and asked what they can do. Freshpair.com's COO and I discussed many options; but one thing that stood out was that Freshpair.com was not to be found in the normal search results. We quickly devised a plan on what steps to take in order to make Freshpair.com "search engine friendly".
We will get back to Freshpair.com shortly; but first let's continue to discuss the importance of dynamically generated URLs. In order to better understand what a search engine sees, let us take a sample URL and discover it.
A simple standard URL would look something like: http//www.freshpair.com/underwear.html.
A complex URL would look something like: http//www.freshpair.com/catalog.php?formid=4&id=8&brand=&brasize=§ion=women.
The first thing you notice is the .php extension, and you might think that the .php extension is causing the issue. That is not the case. Next you will notice are question marks, equal signs, ampersands within the URL. Is that causing the issues? Kind of... These question marks, equal signs and ampersands are what are commonly referred to as "stop characters" in search engine optimization terms. They are named stop characters because they signal to search engines to stop crawling past a certain point, limiting the number of pages crawled on your site.
Let me present another example of one of the pages from our corporate Web site. The following URL is a page that contains the full RustyBrick client list. The URL reads as follows: http//www.rustybrick.com/portfolio_client_list_all.php. We also have enabled the Web visitor to sort the client list by industry. If a Web visitor would like to see all clients that fall within the industry of IT & Communication services they would be shown a URL that reads http//www.rustybrick.com/portfolio_client_list.php?industry=4. We again see the question marks and equal signs. Now the Web visitor wants to view the client list by the Retail and Wholesale service industry and clicks on that link. The URL now reads http//www.rustybrick.com/portfolio_client_list.php?industry=6. As you can see the URLs are exactly the same up until the last digit in the URL where the numbers come in. So if RustyBrick severed 200 industries there would be the same URL except for the last digit.
Now put yourself in the shoes of a search engine. The search engine wants to only put pages into its index that are unique. So the full client list contains the same information that the industry specific URLs contain and search engines do not want repetitive information in their index. Search engines decide to combat this issue by "pruning off" the URLs after a specific number of variable strings (i.e. ?, =. &).
For example, the URL http//www.rustybrick.com/portfolio_client_list.php?industry=6 might be pruned down to http//www.rustybrick.com/portfolio_client_list.php by the search engine in order to limit the number of repeated content.
In a case like Freshpair.com, where there are numerous methods of finding the same product and with an unlimited number of pages, how do we get the search engine to find each product and each method of finding that product? Search engines want to keep the number of pages that a site contains to a minimum in order to (1) eliminate duplicate search results with the same content and (2) to make the crawling of the pages efficient.
The solution we came up with was to program a Mod_Rewrite on the URLs to remove the stop characters from the URLs. We modified a URL that once looked like http//www.freshpair.com/catalog.php?formid=5&query=bra§ion=women to something more like http//www.freshpair.com/catalog_section_women_id_8.html. We replaced all stop characters with underscores and more friendly URL characters and names. Today, Google has indexed over 21,000 pages on Freshpair.com and sales have increased tremendously due to the Mod_Rewrite and other search engine optimization techniques applied to the site.
For more information on Mod_Rewrite please visit the Apache module mod_rewrite page at: http//httpd.apache.org/docs/mod/mod_rewrite.html