聚类算法（1）

技术2024-08-06 62

聚类是的无监督学习一个例子。无监督学习与监督学习最显著的不同在于，无监督学习寻找的并不是一些什么数据，而是一种模式，或规律，或者，聚类。

比如，我们要对一系列博客进行聚类以从中发现一些规律（《Programming Collective Intelligence》），我们主要根据单词频度来对博客进行聚类。

构造数据

我们使用feedparser对博客的RSS订阅源进行解析，以得到相关的单词频度。

以下函数从一个订阅源得到一单词频度列表。

#---------------------------------------------------------------------- # 返回一个RSS订阅源的标题及一个包含单词计数情况的词典 def getwordcounts(url): """counting the words in a Feed""" d = feedparser.parse(url) wc = {} # 若为空则直接返回 if not d.entries: return None,None # 遍历所有条目 for e in d.entries: if 'summary' in e: summary = e.summary else: summary = e.description # 提取一个单词列表 words = getwords(e.title+' '+summary) for word in words: wc.setdefault(word, 0) wc[word] += 1 return d.feed.title,wc

关于d.entries：

entries实际是有好些项的，所以这里可以写作d.entries[i]

注意这里的一句话：“This element always exists, although it may be an empty list.”。实际上，entries可能为空，所以在使用之前我们应该对其进行判断。判断方法有两种，一种是用len，一中是用not。另，e.summary和e.description是相等的。e.title和d.feed.title是不同的，前者是某条条目的title，即是某篇文章的标题，后者是这个博客的title。

所调用的getwords如下：

#---------------------------------------------------------------------- # 返回剥去html标记的单词列表 def getwords(html): """return words list without html signal""" # 去掉HTML标记并将其替换成'' txt = re.compile(r'<[^>]+>').sub('', html) # 利用所有非字母字符拆分单词 words = re.compile(r'[^A-Z^a-z]+').split(txt) # 转换为小写形式 return [word.lower() for word in words if word != '']

说明：1.关于html标记。（去掉的是那些html标记的符号，标记的具体内容并没有去掉）

2.关于re

main code：

#---------------------------------------------------------------------- # main code apcount = {} # 出现这些单词的博客数目 wordcounts = {} # 单词计数 with open('feedlist.txt') as file: feedlist = [line for line in file] for feedurl in feedlist: title, wc = getwordcounts(feedurl) if title == None: continue wordcounts[title] = wc for word, count in wc.items(): apcount.setdefault(word, 0) if count >= 1: apcount[word] += 1 wordlist = [] for w,bc in apcount.items(): frac = float(bc)/len(feedlist) if 0.1<frac<0.5: wordlist.append(w) with open('blogdata.txt', 'w') as outfile: outfile.write('Blog') for word in wordlist: outfile.write('/t%s' % word) outfile.write('/n') for blog,wc in wordcounts.items(): outfile.write(blog) for word in wordlist: if word in wc: outfile.write('/t%d' % wc[word]) else: outfile.write('/t0') outfile.write('/n')

这里的流程如下：

1.将所有url从文件feedlist.txt（文件名etc可换）读入到feedlist

2.解析每个url，将其名字和该博客中的单词计数列表存入tilte和wc中。将博客题目和其中所有不同单词出现的次数保存至wordcounts

3.统计出现某单词的博客数目，存入apcount（单词，出现于的博客数）中。前提是，如果该单词在该博客中出现过1次或一次以上，则将apcount[word]加1

4.因为很多单词，比如the,a之类的，几乎出现在每篇博文中，因此，我们对出现的单词进行频率限制。低于10%或高于50%的都舍弃。其余存入wordlist中。

5.新建文件blogdata.txt，写入相关数据。（包括，经频率限制处理过的单词，写入每个博客中所出现的经过频率限制处理的单词的次数）

所写blogdata.txt片段如下（数据经过裁剪，个数不一定相对应）：

Blog unemployment suburban month illustrated browsers skin go family very Google Blogoscoped 0 0 0 1 2 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 1 2 1 0 1 Topix Blog 1 1 1 0 0 0 0 1 2 1 0 0 0 1 0 1 2 0 1 1 1 1 1 1 0

那么如何实现按主题分类博客呢？可以采用分级聚类的方法。

分级聚类

分级聚类通过连续不断的将最为相似（两元素之间的距离最近）的群组两两合并，构造一个群组的层级结构。分级结构最终形成树状图。树状图可以有效的确定图中个元素的相似程度。

因此在此例中，要实现按主题队博客分类，只需要构造博客的分级聚类，若构造成功，则可对博客进行聚类。

读入数据

首先从我们写成的blogdata.txt中读入数据。

#---------------------------------------------------------------------- # 加载数据文件 def readfile(filename): """load data""" with open(filename) as file: lines = [line for line in file] # 第一列是列标题 colnames = lines[0].strip().split('/t')[1:] rownames = [] data = [] for line in lines[1:]: p = line.strip().split('/t') # 每行第一列是行名 rownames.append(p[0]) # 其余部分是数据 data.append([float(x) for x in p[1:]]) return rownames,colnames,data

要强烈的注意下blogdata的格式，如下：

第一行：Blog xxx xxx xxx xxx （xxx为出现在各博客中的合理单词）

第二行：blogname 000 000 000 000 （000为第一行中各个单词在博客blogname中出现的次数，可能为0次）

........ .......（同第二行）

函数readfile中的列标题即是那些合理单词；行名即是博客名字；data即是该单词出现在各博客中的次数。

聚类算法

算法hcluster完成对以上data的聚类功能。

#---------------------------------------------------------------------- # 聚类算法 def hcluster(rows, distancefunc = pearson): """cluster""" distances = {} currentclusterid = 1 # 最开始的聚类就是数据集中的行 clust = [bicluster(rows[i], id=1) for i in range(len(rows))] while len(clust) > 1: lowestpair = (0,1) closest = distancefunc(clust[0].vec, clust[1].vec) # 遍历每个配对，寻找最小值 for i in range(len(clust)): for j in range(i+1, len(clust)): # 用distances缓存距离的计算值 if (clust[i].id, clust[j].id) not in distances: distances[(clust[i].id, clust[j].id)] = distancefunc(clust[i].vec, clust[j].vec) d = distances[(clust[i].id, clust[j].id)] if d < closest: closest = d lowestpair = (i, j) # 计算两个聚类的平均值 mergevec = [(clust[lowestpair[0]].vec[i] + clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))] # 建立新的聚类 newcluster = bicluster(mergevec, left = clust[lowestpair[0]], right = clust[lowestpair[1]], distance = closest, id = currentclusterid) # 不在原始集合中的聚类，其id为负数 currentclusterid -= 1 del clust[lowestpair[0]] del clust[lowestpair[1]] clust.append(newcluster) return clust[0]

初始时，每个单独的行均构成一个聚类，随之程序对每两行之间的距离进行计算并进行聚类操作，直至只有1行存在（while len(clust) > 1:），即只有一个聚类。

整个算法是个无敌的三重循环。。。

while中的双重for是遍历整个聚类寻找距离最小的二聚类。为加快运行，对两个聚类直接的距离值进行了缓存，因为，在这两类不合并或不和其他类合并之前，其离都是相等的，计算结果保存在distances中，其结果大致为：distances == {（聚类1，聚类2）：距离}，所以在存入distances或从中取值时，我们使用的key是元组，该元组是聚类的id（每个聚类均有唯一的id，这是clust = [bicluster(rows[i], id=i) for i in range(len(rows))]的功劳。。。）。

记住我们所计算的是什么的聚类？

我们的目标是，找出相似的博客。我们通过聚类来达到这一点的。我们计算的是，关于某博客中出现的所有单词的频度的聚类。以此我们将所有的博客根据其中出现的单词的情况进行聚类（做成树状图）。

在找到了两个聚类后，我们对其中的所有单词频度求平均，并以此作为新的聚类，参与下一次的寻找。

求距离我们使用经典的皮尔逊算法。

#---------------------------------------------------------------------- # 皮尔逊算法--求相似度 def pearson(v1, v2): """pearson""" # 求和 sum1 = sum(v1) sum2 = sum(v2) # 求平方和 sum1sq = sum([pow(v,2) for v in v1]) sum2sq = sum([pow(v,2) for v in v2]) # 求乘积之和 psum = sum([v1[i]*v2[i] for i in range(len(v1))]) # 计算pearson score num = psum - (sum1*sum2/len(v1)) den = sqrt((sum1sq-pow(sum1,2)/len(v1))*(sum2sq-pow(sum2,2)/len(v1))) if den == 0: return 0 return 1.0-num/den

该算法返回的是0~1之间的数字。完全匹配返回1，否则返回0.即是说，相距越近，则返回的数字越大；但我们的算中是返回的：1.0-num/den，即相距越近则返回数字越小。这与我们的常识是相同的，注意对此算法hcluster中的判断条件是if d < closest，而不是>。

嗯，还可以。。。

把这棵树画出来看看？

树状图

使用PIL来画树状图。

#---------------------------------------------------------------------- # 树状图 #---------------------------------------------------------------------- # 树状图 -- 树的总高度 def getheight(clust): """get the height of the tree""" if clust.left == None and clust.right == None: return 1 else: return (getheight(clust.left) + getheight(clust.right)) #---------------------------------------------------------------------- # 树状图 -- 树的距离 def getdepth(clust): """the depth of the tree""" if clust.left == None and clust.right == None: return 0 else: return (max(getdepth(clust.left), getdepth(clust.right)) + clust.distance) #---------------------------------------------------------------------- # 生成图片 def drawdendrogram(clust, labels, jpeg = 'clusters.jpg'): """draw the picture""" # 获得高度和宽度 h = getheight(clust)*30 w = 1400 depth = getdepth(clust) # 由于宽度固定，因此需要对距离进行调整 scaling = float(w-150)/depth # 创建白色背景的图片 img = Image.new('RGB', (w,h), (255,255,255)) draw = ImageDraw.Draw(img) draw.line((0,h/2,10,h/2), fill=(255,0,0)) # 画第一个节点 drawnode(draw, clust, 10, (h/2), scaling, labels) img.save(jpeg, 'JPEG') #---------------------------------------------------------------------- # 画节点和连线 def drawnode(draw, clust, x, y, scaling, labels): """draw node and lined node""" if clust.id < 0: h1 = getheight(clust.left)*20 h2 = getheight(clust.right)*20 top = y - (h1+h2)/2 bottom = y + (h1+h2)/2 # 线的长度 l1 = clust.distance*scaling # 聚类到其子节点的垂直线 draw.line((x, top+h1/2, x, bottom-h2/2), fill=(255,0,0)) # 连接左侧节点的水平线 draw.line((x, top+h1/2, x+l1, top+h1/2), fill=(255,0,0)) # 连接右侧节点的水平线 draw.line((x, bottom-h2/2, x+l1, bottom-h2/2), fill=(255,0,0)) # 绘制左右节点 drawnode(draw, clust.left, x+l1, top+h1/2, scaling, labels) drawnode(draw, clust.right, x+l1, bottom-h2/2, scaling, labels) else: # 是叶节点则绘制其标签 draw.text((x+5,y-7), labels[clust.id], (0,0,0))

参数选择很重要，尤其是当数据很多时，否则画出来的将会惨不忍睹。。。

上面这段画树的代码来自《Programming Collective Intelligence》。每个语言都有其相应的库/框架之类，但仿佛Python特别多，而且特别杂特别神，比如传说的用Python操作MS Office的xlrd,xlwt之类。但我们只是用而已，而且做树状图对聚类算法来说并非是必要的一步，因此对以上代码没有精研。。。（我该补一句：详情请参考用户手册。。。）

列聚类

我们上面提到过，说，我们所要进行的是，对博客，依据其中所使用的单词，进行聚类，这样就把相似的博客聚合在了一起。在上面的例子中，我们聚类所操作的，实际上是行，行即是博客的名字。

那列呢？列是什么？

列是单词。如果我们对单词进行聚类呢？那么，我们可以看出哪些单词可能被同时使用，哪些不太可能被同时使用，这就是列聚类（如果商店使用列聚类算法得出哪些商品用户经常同时购买，那么就能很方便的进行捆绑销售了。。。）。。。

我们上面所进行的是行聚类，在行聚类中，行是博客列是单词；如果要进行列聚类又要直接使用上面的算法，那么最好的办法（也是最简单的办法）就是，将上面的数据进行反置，使，行代表单词，列代表次数。

#---------------------------------------------------------------------- # 列聚类 -- 倒置数据 #---------------------------------------------------------------------- def rotatematrix(data): newdata = [] for i in range(len(data[0])): newrow = [data[j][i] for j in range(len(data[0]))] newdata.append(newrow) return newdata

我们上面的格式是：

第一行：Blog xxx xxx xxx xxx （xxx为出现在各博客中的合理单词）

第二行：blogname 000 000 000 000 （000为第一行中各个单词在博客blogname中出现的次数，可能为0次）

........ .......（同第二行）

那么我们更改后的格式如：

第一行：Blog xxx xxx xxx xxx （xxx为出现在各博客）

第二行：单词 000 000 000 000 （000为该单词出现于列所代表的博客中的次数）

........ .......（同第二行）

总结行聚类和列聚类：

当数据项的数量比变量多的时候，出现无意义聚类的可能性就会增加。由于单词的数量比博客多得多，因此，在博客聚类中出现的模式要比在单词聚类中出现的更加合理。

总结分级聚类算法：

分级聚类将聚类表示成了直观的树状图。

两个缺点：

1.树状图并不会真正拆分数据，树状图只是表示出了该关系。（可以说，该关系并不能保存持久。。。）

2.结算量太过可观。。。

最新回复(0)