使用urllib包
>>> from urllib import urlopen
指定url
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
读入原始文档
>>> raw = urlopen(url).read()
Check:raw的类型是str
>>> type(raw)
<type 'str'>
如使用1.1的方法读入wikipedia的网页会返回Access Deny,需要使用urllib2,手动添加header,让wiki以为是浏览器访问使用urllib2包>>> import urllib2建立opener>>> opener = urllib2.build_opener()添加header>>> opener.addheaders = [('User-agent', 'Mozilla/5.0')]打开url>>> infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')>>> type(infile)<type 'instance'>读入原始文档>>> raw = infile.read()
如果是txt或其他文本文件,跳过此步>>> raw = nltk.clean_html(raw)
由于没有内置功能,只能手动找到所需要信息的头和尾,截取中间部分>>> raw.find("PART I")5303>>> raw.rfind("End of Project Gutenberg's Crime")1157681>>> raw = raw[5303:1157681]
>>> tokens = nltk.word_tokenize(raw)>>> type(tokens)<type 'list'>
>>> text = nltk.Text(tokens)>>> type(text)<class 'nltk.text.Text'>
>>> words = [w.lower() for w in text]
>>> vocab = sorted(set(words))
HTML----> ASCII (raw) ----> Text (tokens, text) ----> Vocab (words, vocab)