网上对 Page 算法讲解的很多,实现代码也很多很杂, 所以为了找到一个更高质量的 PageRank 算法的实现,
我阅读了 Python Networkx 库上自带的 pagerank 方法的源码。部分多余内容我删除了,有兴趣可以直接下这个库查看源码
源码的地址在 http://networkx.github.io/download.html
具体的 pagerank 代码我已经上传到网盘在 http://pan.baidu.com/s/1ntOafH3
PageRank 算法最主要的地方在于对两个问题的解决,一个是 dangling nodes,一个是 spider trap
前者是说,没有引用其他网页的链接,用图角度的理解就是出度为 0。
后者是说,进入到一个网页或者几个网页,这个单个网页,或者几个网页之间相互引用,这样资源进去后就一直在里面转,出不来了,造成 rank sink 问题。
对于 dangling nodes,我们可以计算他们的 PR 贡献值, 然后均分给所有节点
对于 spider trap,需要用心灵漂移的方式去解决。
Networkx 库里面,主要有三种计算 PageRank 的方法(PS 为什么是三种,我直观觉得是因为这是三个人写的,因为连某些效果完全相同的初始化语句,三个写的都不一样)
1 pagerank 函数
以图结构为基础
通过迭代收敛方法计算 PR 值 (PageRank 值)
2 pagerank_numpy 函数
将图转换为 numpy 邻接矩阵,
通过 google_matrix 和 numpy 矩阵计算
通过计算最大特征值对应的主特征向量,即为所求 PR 值
3 pagerank_scipy 函数
将图转换为 sparse 稀疏矩阵
通过迭代收敛方法计算 PR 值
- """PageRank analysis of graph structure. """#BSD license.#NetworkX: http: //networkx.lanl.gov/
- import networkx as nx@not_implemented_for('multigraph') def pagerank(G, alpha = 0.85, personalization = None, max_iter = 100, tol = 1.0e-6, nstart = None, weight = 'weight', dangling = None) : """Return the PageRank of the nodes in the graph.
- Parameters
- -----------
- G : graph
- A NetworkX graph. 在PageRank算法里面是有向图
- alpha : float, optional
- 稳定系数, 默认0.85, 心灵漂移teleporting系数,用于解决spider trap问题
- personalization: dict, optional
- 个性化向量,确定在分配中各个节点的权重
- 格式举例,比如四个点的情况: {1:0.25,2:0.25,3:0.25,4:0.25}
- 默认个点权重相等,也可以给某个节点多分配些权重,需保证权重和为1.
- max_iter : integer, optional
- 最大迭代次数
- tol : float, optional
- 迭代阈值
- nstart : dictionary, optional
- 整个网络各节点PageRank初始值
- weight : key, optional
- 各边权重
- dangling: dict, optional
- 字典存储的是dangling边的信息
- key --dangling边的尾节点,也就是dangling node节点
- value --dangling边的权重
- PR值按多大程度将资源分配给dangling node是根据personalization向量分配的
- This must be selected to result in an irreducible transition
- matrix (see notes under google_matrix). It may be common to have the
- dangling dict to be the same as the personalization dict.
- Notes
- -----
- 特征值计算是通过迭代方法进行的,不能保证收敛,当超过最大迭代次数时,还不能减小到阈值内,就会报错
- """#步骤一:图结构的准备--------------------------------------------------------------------------------
- if len(G) == 0 : return {}
- if not G.is_directed() : D = G.to_directed()
- else: D = G#Create a copy in (right) stochastic form W = nx.stochastic_graph(D, weight = weight) N = W.number_of_nodes()#确定PR向量的初值
- if nstart is None: x = dict.fromkeys(W, 1.0 / N)#和为1
- else: #Normalized nstart vector s = float(sum(nstart.values())) x = dict((k, v / s) for k, v in nstart.items()) if personalization is None: #Assign uniform personalization vector
- if not given p = dict.fromkeys(W, 1.0 / N)
- else: missing = set(G) - set(personalization) if missing: raise NetworkXError('Personalization dictionary ''must have a value for every node. ''Missing nodes %s' % missing) s = float(sum(personalization.values())) p = dict((k, v / s) for k, v in personalization.items())#归一化处理
- if dangling is None: #Use personalization vector
- if dangling vector not specified dangling_weights = p
- else: missing = set(G) - set(dangling) if missing: raise NetworkXError('Dangling node dictionary ''must have a value for every node. ''Missing nodes %s' % missing) s = float(sum(dangling.values())) dangling_weights = dict((k, v / s) for k, v in dangling.items()) dangling_nodes = [n
- for n in W
- if W.out_degree(n, weight = weight) == 0.0]#dangling_nodes dangling节点#danglesum dangling节点PR总值
- #dangling初始化默认为personalization#dangling_weights根据dangling而生成,决定dangling node资源如何分配给全局的矩阵#迭代计算--------------------------------------------------------------------
- #PR = alpha * (A * PR + dangling分配) + (1 - alpha) * 平均分配#也就是三部分,A * PR其实是我们用图矩阵分配的,dangling分配则是对dangling node的PR值进行分配, (1 - alpha)分配则是天下为公大家一人一份分配的#其实通俗的来说,我们可以将PageRank看成抢夺大赛,有三种抢夺机制。#1,A * PR这种是自由分配,大家都愿意参与竞争交流的分配#2,dangling是强制分配,有点类似打倒土豪分田地的感觉,你不参与自由市场,那好,我们就特地帮你强制分。#3,平均分配,其实就是有个机会大家实现共产主义了,不让spider trap这种产生rank sink的节点捞太多油水,其实客观上也是在帮dangling分配。#从图和矩阵的角度来说,可以这样理解,我们这个矩阵可以看出是个有向图#矩阵要收敛-->矩阵有唯一解-->n阶方阵对应有向图是强连通的-->两个节点相互可达,1能到2,
- 2能到1#如果是个强连通图,就是我们上面说的第1种情况,自由竞争,那么我们可以确定是收敛的#不然就会有spider trap造成rank sink问题
- for _ in range(max_iter) : xlast = x x = dict.fromkeys(xlast.keys(), 0)#x初值danglesum = alpha * sum(xlast[n]
- for n in dangling_nodes)#第2部分:计算dangling_nodes的PR总值
- for n in x: for nbr in W[n] : x[nbr] += alpha * xlast[n] * W[n][nbr][weight]#第1部分: 将节点n的PR资源分配给各个节点,循环之
- for n in x: x[n] += danglesum * dangling_weights[n] + (1.0 - alpha) * p[n]#第3部分:节点n加上dangling nodes和均分的值#迭代检查err = sum([abs(x[n] - xlast[n]) for n in x]) if err < N * tol: return x raise NetworkXError('pagerank: power iteration failed to converge ''in %d iterations.' % max_iter) def google_matrix(G, alpha = 0.85, personalization = None, nodelist = None, weight = 'weight', dangling = None) : """Return the Google matrix of the graph.
- Parameters
- -----------
- 和PageRank参数表相似
- Returns
- -------
- A : NumPy matrix
- Google matrix of the graph
- Notes
- -----
- 这个方法返回的"谷歌矩阵" 描述了PageRank用到的马尔科夫链。
- PageRank需要收敛的话,就需要方程有唯一解,这样过度矩阵必须是不可約的。
- 用图的角度讲,就是说两个点之间必须相互可达,节点1能到节点2,节点2也有路径到1。
- 不然的话,会出现spider trap,导致rank sink问题出现,造成矩阵无法收敛,集中到sink节点了。
- The matrix returned represents the transition matrix that describes the
- Markov chain used in PageRank. For PageRank to converge to a unique
- solution (i.e., a unique stationary distribution in a Markov chain), the
- transition matrix must be irreducible. In other words, it must be that
- there exists a path between every pair of nodes in the graph, or else there
- is the potential of "rank sinks."
- """import numpy as np
- if nodelist is None: nodelist = G.nodes() M = nx.to_numpy_matrix(G, nodelist = nodelist, weight = weight) N = len(G) if N == 0 : return M#Personalization vector
- if personalization is None: p = np.repeat(1.0 / N, N)
- else: missing = set(nodelist) - set(personalization) if missing: raise NetworkXError('Personalization vector dictionary ''must have a value for every node. ''Missing nodes %s' % missing) p = np.array([personalization[n]
- for n in nodelist], dtype = float) p /= p.sum()#Dangling nodes
- if dangling is None: dangling_weights = p
- else: missing = set(nodelist) - set(dangling) if missing: raise NetworkXError('Dangling node dictionary ''must have a value for every node. ''Missing nodes %s' % missing)#确定dangling node的PR资源的分配矩阵dangling_weights = np.array([dangling[n]
- for n in nodelist], dtype = float) dangling_weights /= dangling_weights.sum() dangling_nodes = np.where(M.sum(axis = 1) == 0)[0]#计算行和为0,也就是计算出度为0的节点,就是dangling nodes#Assign dangling_weights to any dangling nodes(nodes with no out links) for node in dangling_nodes: M[node] = dangling_weights#将这个的出度直接设成平均分配M /= M.sum(axis = 1)#归一化#这个时候M矩阵已经解决了dangling node的问题,再通过1 - alpha解决
- return alpha * M + (1 - alpha) * np.outer(np.ones(N), p) def pagerank_numpy(G, alpha = 0.85, personalization = None, weight = 'weight', dangling = None) : """
- 备注
- 特征向量计算,是利用了Numpy在LAPACK库的方法进行的,对小型图的计算非常快速精确
- """import numpy as np
- if len(G) == 0 : return {}
- M = google_matrix(G, alpha, personalization = personalization, weight = weight, dangling = dangling)#use numpy LAPACK solver eigenvalues,
- eigenvectors = np.linalg.eig(M.T) ind = eigenvalues.argsort()#eigenvector of largest eigenvalue at ind[ - 1],
- normalized largest = np.array(eigenvectors[: , ind[ - 1]]).flatten().real norm = float(largest.sum()) return dict(zip(G, map(float, largest / norm))) def pagerank_scipy(G, alpha = 0.85, personalization = None, max_iter = 100, tol = 1.0e-6, weight = 'weight', dangling = None) : """
- -----
- 该方法应用的是SciPy库,实现方法和第一个:pagerank函数是基本一致的
- -----
- 备注
- 特征向量计算,是利用了SciPy库稀疏矩阵计算
- """import scipy.sparse N = len(G) if N == 0 : return {}
- nodelist = G.nodes() M = nx.to_scipy_sparse_matrix(G, nodelist = nodelist, weight = weight, dtype = float) S = scipy.array(M.sum(axis = 1)).flatten() S[S != 0] = 1.0 / S[S != 0] Q = scipy.sparse.spdiags(S.T, 0, *M.shape, format = 'csr') M = Q * M#初始化PageRank值x = scipy.repeat(1.0 / N, N)#确定personalization字典,各节点分配权重,默认权重相同
- if personalization is None: p = scipy.repeat(1.0 / N, N)
- else: missing = set(nodelist) - set(personalization) if missing: raise NetworkXError('Personalization vector dictionary ''must have a value for every node. ''Missing nodes %s' % missing) p = scipy.array([personalization[n]
- for n in nodelist], dtype = float) p = p / p.sum()#Dangling nodes的确定和资源配置
- if dangling is None: dangling_weights = p
- else: missing = set(nodelist) - set(dangling) if missing: raise NetworkXError('Dangling node dictionary ''must have a value for every node. ''Missing nodes %s' % missing)#Convert the dangling dictionary into an array in nodelist order dangling_weights = scipy.array([dangling[n]
- for n in nodelist], dtype = float) dangling_weights /= dangling_weights.sum() is_dangling = scipy.where(S == 0)[0]#迭代#第1部分: x * M,
- 第2部分dangling node分配,
- 第3部分rank sink解决
- for _ in range(max_iter) : xlast = x x = alpha * (x * M + sum(x[is_dangling]) * dangling_weights) + (1 - alpha) * p#check convergence,
- l1 norm err = scipy.absolute(x - xlast).sum() if err < N * tol: return dict(zip(nodelist, map(float, x))) raise NetworkXError('pagerank_scipy: power iteration failed to converge ''in %d iterations.' % max_iter)
来源: http://lib.csdn.net/article/python/43404