Analysis and Implications of Hilltop Algorithm
By Atul Gupta
Posted Saturday, September 18, 2004
In The Google Florida Update, we discussed how we believe that Google has deployed the Hilltop algo in its ‘Florida’ algo update. As usual, Google has been silent about the algo update so our analysis is based on research and experimets.
Why need a new algo?
While the PR algo did its job well all these years, there is a basic flaw in the PR system and Google knew about this. The PageRank (PR) system allocates an absolute ‘value of importance’ to a web page based on the number and quality of sites that link to it. However, ‘PR value’ is not specific to search terms and therefore a high-PR web page that even contained a passing reference to an off-topic keyword phrase, often got a high ranking for that phrase. Krishna Bharat from California realized the flaw in this PR-based ranking system and came up with an algorithm he called ‘Hilltop’ in the year 1999-2000. He filed for the Hilltop patent in Jan 2001 with Google as an assignee. Needless to say, Google realized the advantage this new algo would offer to their ranking system if combined with their own PR system. Hilltop could perfectly bridge the gap. The Hilltop algo may have gone through several refinements/iterations from its original form, before this deployment.
What is the Hilltop algo?
For the geeks who wish to go into great depths, there is detailed info available here –
Hilltop Paper & Hilltop Patent : (http://www.cs.toronto.edu/~georgem/hilltop/)
For the rest of us, here is a simple explanation –
In a nutshell, PR determines ‘authority’ of a web page in general.
Hilltop (LocalScore) determines the ‘authority’ of a web page related to the query or search term.
Bharat formulated that instead of using just the ‘PR value’ to find the ‘authoritative’ web pages; it would be more useful if the ‘value’ has topical relevance. As such, counting links from ‘topic relevant’ document to a web page would be more useful. He called these ‘topic relevant’ documents as ‘expert documents’ and links from these expert documents to the target documents evaluated their ‘authority score’
The Hilltop algo calculates a ‘score of authority’ of web pages (over-simplified) as follows:
Run a normal search on the keyphrase to locate a ‘corpus’ of expert documents. The qualifying rules of ‘expert documents’ are stringent so the ‘corpus’ is a manageable number of web pages.
Filter affiliate* sites and duplicate sites from the experts list.
Pages are assigned a LocalScore of ‘authority’ based on number and quality of votes they get from these expert documents. Pages are then ranked based on their LocalScore.
How does Hilltop define affiliate sites?
*Affiliate sites are defined as follows –
Pages that originate from the same domain ((www.ibm.com), (www.ibm.com/us/), products.ibm.com, solutions.ibm.com etc.)
Pages that originate from the same domains but with different top level and second level suffixes (like (www.ibm.com), (www.ibm.co.uk), (www.ibm.co.jp) etc.)
Pages that originate from neighborhood IPs (first 3 common octet in the IP number like 66.165.238.xxx is common)
Pages that originate from affiliate of affiliates (if (www.abc.com) is hosted on the same IP octet as (www.ibm.com), then (www.abc.com) is an affiliate of (www.ibm.co.uk) even if they are on a different IP series)
It is worth noting that the Hilltop algo bases its calculations only on ‘expert documents’. Its algo requires finding at least two expert documents voting for a page. If the algo does not find a minimum of two expert documents, the results returned are zero. Which essentially means, that the Hilltop algo fails to pass on any values to the rest of the ranking algo and therefore becomes ineffective for the search term query in question.
This is a very important aspect of the Hilltop algo – It is ineffective if sufficient expert documents are not located.
This unique feature of Hilltop algo, which has a high chance of returning a ‘zero’ score, based on highly specific query term, has led the majority of SEO community to believe that Google is using a ‘money words’ filter list. Actually, the ‘old Google’ results got displayed for specific search terms where Hilltop failed to produce effect. The collection of these terms is what the SEO community collected and called the ‘Money Words List’.
This effect also comes across as strong evidence, indicating the deployment of Hilltop by Google. When Google introduced this new algo on November 15th, 2003, an analyst figured out that if you search for a query term added with some ‘exclusion’ trash characters, Google displayed the original (pre-algo-change) results, bypassing the so-called ‘money words’ filter list.
For example if you search for “real estate –hgfhjfgjhgjg –kjhkhkjhkjhk” then Google would attempt to show you the pages on “real estate” but excluding pages that contained the terms “hgfhjfgjhgjg” and “kjhkhkjhkjhk”. Since it is easy to understand that, there would hardly be any page containing the words “hgfhjfgjhgjg” and “kjhkhkjhkjhk”, Google should be returning the same results as one would get for the term ‘real estate’ alone. However that did not happen. Google showed results, which seemed to be identical to pre-algo-change ranking. In fact an anti-Google group setup a site (www.scroogle.org) to capture the differences in rankings to extract a so-called ‘money words’ filter list.
What’s the real story behind the so-called ‘money keywords list’ filter?
We believe that the ‘money words’ filter list effect was just a spin-off symptom of the Hilltop algo. Each time, someone attempted to run a search term like “real estate –hgfhjfgjhgjg –kjhkhkjhkjhk”, Google passed on this entire search term to Hilltop. Since Hilltop was unable to locate sufficient ‘expert’ documents containing this ‘funny looking’ search term, it produced zero result. (read zero effect). This essentially means that the Hilltop was simply ‘bypassed’ with the exclusion search term. The rest of the Google algo was then left to extract and display results, which obviously looked identical to the pre-algo-update results.
The growing popularity of (www.scroogle.org) led Google to detect this bug. Google fixed this bug by kicking in the Hilltop a 2-step process. The exclusion terms are withheld while passing on the query to Hilltop; Hilltop does its work and extracts results, passes results to Google algo, Google excludes the terms just before displaying results. Simple. Exclusion terms are no longer passed on to Hilltop so the Hilltop now works fine. As you can see on Google site, the above exclusion method no longer shows ‘old Google’ results.
What does the new Google algo look like? What are the implications?
The combination of Hilltop algo, Google-PR and on-page relevance factors seem to be a highly potent combination, very difficult to beat. Not impossible, but very difficult. This new combination has far-reaching implications on how link-popularity/PageRank and links from Expert Documents (LocalScore) would affect your site ranking.
The exact Google algo will only be known to Google. It is a closely guarded secret. I’m not good at maths (I wish I were), but here is an attempt to simplify the new Google algorithm for the purpose of understanding of how variables take effect –
Old Google Ranking Formula = {(1-d)+a (RS)} * {(1-e)+b (PR * fb)}
New Google Ranking Formula = {(1-d)+a (RS)} * {(1-e)+b (PR * fb)} * {(1-f)+c (LS)}
Where:
RS = RelevanceScore: (Score based on keywords appearing in Title, Meta tags, Headlines, Body text, URL, Alt text, Title attribute, anchor text etc. of your site)
PR = PageRank: (Score based on number and PR value of pages linking to your site. Original formula is PR (A) = (1-d) + d (PR (t1)/C (t1) + ... + PR (tn)/C (tn)), where PR of page ‘A’ is the sum of the PR of each page linking to it divided by the number of outgoing links on each of those pages. ‘d’ is a dampening factor believed to be equal to 0.15)
LS = LocalScore: (Score computed from expert documents. Has variables and different values for search term appearing in title (16), headline (6), anchor text (1), search term density etc. Figures in parenthesis are the original values, which may have been changed by Google)
a, b, c = Tweak Weight Controls: (available to Google for fine-tuning the results)
d, e, f = Dampener Controls: (available to Google for fine-tuning the results. We believe that the value of ‘f’ is currently set at zero.)
fb = FactorBase: (The PageRank scale of 1 to 10 on Google bar is not linier but an exponential/logarithmic one. As per our internal analysis, we believe that it is a base ‘close to’ 8. This means that PR5 is 8 times more in value than PR4. As such, a PR8 website has a value 4000 times more than a PR4 website. This factor somehow needs to be built into the algo formula. We have therefore taken a fb value to accommodate this factor)
Merits of the new Google algo
Search engines have always been a little wary of the extent they would like to rely their ranking algo on ‘on-page’ factors. Most search engines discontinued valuing factors prone to extreme abuse such as the keyword Meta tag long back. On-page factors give too much control (for abuse) to the webmasters. Visible parts of the web page have been less prone to spam because they need to make sense to the human visitors. However, for quite some time, even these on-page factors have been subject to abuse by way of presenting sub-standard, over-optimized or even ‘cloaked’ content to the search engines.
What is the new ‘ranking’ weight distribution?
If you notice in the above new formula, Google has taken significant weight off the on-page factors. The only on-page variable in the formula is now the ‘RelevanceScore’
Our analysis of the above formula and Google behavior indicates that the total weight distributed to the 3 components (RS group, PR group and LS group) is as follows –
RelevanceScore = 20%, PageRank = 40%, LocalScore = 40%
Where:
RS is the translation of all SEO efforts
PR is the translation of Link-building efforts
LS is the translation of links from the expert documents
With this implementation, Google has shifted significant weight to the off-page factors, taking away ranking control from webmasters. As you can see, there is a fairly low score level available to gain just from your SEO efforts. If an average SEO expert is able to leverage 10% of this weight and a super expert SEO can leverage 18% of this weight, the total difference in ranking between an average SEO and a great SEO is just about 8%. News: The SEO and ranking rules have just been changed!!!
Is Hilltop running in real-time?
Google is primarily running its service through 10,000 Pentium servers distributed across the web. That’s how they have built their server architecture. If we study the Hilltop algo, it is difficult to believe that such Pentium servers would have the processing power to locate ‘expert documents’ from thousands of topical documents, evaluate LocalScore of target pages from all these documents and pass the value to other components of Google algo, which then further process the results, on the fly, all in just about 0.07 seconds – the speed Google is famous for.
So how and when does Hilltop kick in?
We believe that Google is running a batch processing of popular search terms (so-called ‘money keywords list’) and stores the results ready to serve. Google has vast database of popular search terms in its database, collected from actual searches as well as keyword phrases used in AdWords program. Google has perhaps set a threshold value to the number of searches a search term needs to have before it qualifies to get into the Hilltop pool for batch processing. The Hilltop runs on the total pool of popular search terms, maybe once a month. Incremental smaller size batch processing may be done more frequently on search terms that gain popularity and qualify to get into the Hilltop pool. Results for the major pool may be synchronized with the 10,000 servers once a month and the smaller batches updated more frequently.
Search terms that do not qualify to kick in the Hilltop algo continue to show you the old Google ranking. Many SEO’s are happy and claim that their listings have not gone down for several client sites. They are perhaps checking with highly specific search terms that have not qualified to be on Hilltop radar yet.
Google acquired the patent in February 2003. Why did it take so long to deploy?
Tests, tests, tests, compatibility issues, more tests, result evaluations, fine-tuning and further tests. This was perhaps going to be one hell of a change for Google to deploy. The algo needed to work perfectly with the existing components of PR and RelevanceScore Google algo. I guess all this takes time.
Are there any downsides / flaws with the new Google algo?
As we do our further analysis, we expect several bugs and shortcomings to manifest over time. Here are a few that we feel could hurt Google and its users –
Hilltop is based on the assumption that each ‘expert document’ it locates would be unbiased, Spam-free and manipulation-free. We feel that this may not be the case. If even a small percentage of expert documents are contaminated, the scores would magnify the error leading to a significant number of ‘false positives’ in the top-ranks.
Hilltop attempts to arrive at a selection of pages voted to be ‘authoritative’. There is no evidence of a guarantee that these pages would also mean ‘quality’
We believe that since a lot of processing power is required to run Hilltop, it (probably) runs on a monthly batch processing frequency for popular search terms. This coupled with the fact that significant weight is assigned to the ‘Hilltop’ part of the Google algo; we may expect to see sites continuing to rank without much fluctuation until the next processing. Since voting patterns of the ‘expert’ pages is unlikely to fluctuate much, we can expect to see ‘stale’ rankings over sustained periods. This may work against the fabric of search engines, who are expected to also include ‘new, good’ content in their search results. ‘Authoritative’ pages apart, people also want to see fresh content, which will now be visible only on less competitive or unique search terms where Hilltop fails to kick-in.
New sites will find it increasingly difficult to rank and with popular search terms. Google seems to have created a bigger barrier for new sites or new content to rank with extremely popular search terms.
Since most commercial sites find is easy to link to directories, trade associations, government sites of trade authorities, educational institutions, non-profit organizations (read non-competitive sites), such sites will populate the top-10 rankings on the result pages more.
Who will suffer in the near to medium term?
Affiliate sites / domain clusters / MLM programs running on same servers.
Sites relying heavily on ‘on-page’ site optimization factors.
Sites that rely on highly competitive search terms to get traffic.
Recommendations for site owners
They need to think out-of-the-box and seriously consider improving PageRank and links from ‘Expert Documents’ as distinct promotional campaigns. The rules of ranking have changed significantly
Get listing in as many major directories (DMOZ, Yahoo, About, LookSmart etc.), trade directories, yellow pages, associations, resource pages, highly classified sections pages etc.
Avoid domain clusters / affiliate programs or change nature of affiliate programs.
Avoid reciprocal links from suspect FFA sites and link farms
Popular Myths: Good site optimization will continue to support rankings to the extent of its weigh
Over-optimization is now being penalized: Over-optimization (spam) has always been either discounted or penalized. The current impact of rank loss is seen due to shift of weight from on-page to off-page factors. Good site optimization will continue to support rankings to the extent of its weight in the algo.
Link building is no longer important: Link-popularity building is as important as before, perhaps even more important now. The PR algo continues to gain importance.
Google is using a ‘money words’ filter list: As you can see from above arguments, Google is not using any filter list to penalize commercial sites. The results just ‘seem’ to be indicating such symptoms. Nor has Google implemented this algo for the sake of pushing their AdWords or building their bottom-line for the forthcoming IPO
Listing in DMOZ, Google directory, commercial directories gives Google the clue that your site is commercial and therefore penalizes it: On the contrary, since most of these directories ‘qualify’ to be the ‘expert documents’, links from these sites are of great value.
About the Author:
Atul Gupta is founder and CEO of SEORank . With an experience of over 8 years in the Internet industry, Atul Gupta has helped several companies formulate and roll out their online marketing strategies targeted towards search engine positioning. His knowledge and experience lends credibility to the company and fuels his team of professionals.