Book Reviews
![[Day Permalink]](http://radio.weblogs.com/0112083/images/permaLink.gif)
Thursday, January 2, 2003
Google PageRank algorithm --
As noted by many, the PageRank algorithm is based on solving a large eigenvalue problem. The dimension of the square matrix was 26 million in the description of the method (see below) written in 1997/1998. Today the dimension is of the order of 109, so the problem is computationally demanding. It must be one of the biggest eigenvalue computations done routinely. Of course, the matrix is sparse, and you don't need to be absolutely exact. It would be interesting to know what kind of solution methods are used, and how the solution is parallelized.
Curiouser and curiouser! (via Blogging Alone) writes: "... it did lead me to an interesting paper by Sergey Brin about Google. This was written in, I guess, 97/98 i.e. well before Google became the monster it is today.
[...] it does have the best description of the page rank algorithm and how it is calculated that I have seen so far. I'm guessing it's a good deal more sophisticated these days but this might be of interest for others like myself who wonder about the inner workings. [...] To quote from that paper:
2.1.1 Description of PageRank Calculation
Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. There are many other details which are beyond the scope of this paper."
|
|