用python中htmlParser实现的spider(python spider)

    技术2025-01-17  11

    最近公司网站搞检查,发现了一些问题,一直在用的是xenu工具,速度快,小巧(写了这么多年的MFC,真的很亲切啊,呵呵)

     

    刚好这2天才学习python,所以自己写了一个spider,逐渐也对python有了一些了解,下面把源码分享出来,大家可以玩玩看

    文件是utf-8格式,但如果加了中文注释,竟然不能debug了。。。无论你是不是在文件头2行加了#encoding=utf-8等方法,有人解决了告诉我一下哦

    用python实现的网站spider 1.功能:检查死链和空title,链接目前只查属于站内的(包括主站和二级站,并且会过滤重复的URL,防止一个URL多次扫描) 2.当页面中有一些错误时,也会提供出错原因和具体位置,所以也可以检查一下网站是否标准 3.目前只查页面,不查资源,过滤了 .jpg|.jpeg|.avi|.png|.gif 4.依赖:必须安装python-3.2rc1(将来也有可能是正式版,可以自己改test.bat中的python的路径,不过不能装低于这个版本的,3.1版本loggingRotating方式有bug) 5.有些不能查出的死链:有时候死链会被重定向到别的页面,如果中国电信会直接跳到它的页面,此情况暂时无法处理,在公司测试没有此现象,adsl上网会这样 6.检查网页是否有语法错误,spider.py中 except HTMLParseError as e: # log.error("HTMLParseError : msg= %s, url= %s, pos= %s" % ( e.msg, url, lParser.getpos())) pass (前提是你已经了解了python的缩进语法)去掉第2行首的“#”即可,目前由于不少网页都有好些小问题,所以默认不开放 7.这是方便检查公司网站的小工具,在公司的几个项目网站上测试可用,不过没有在其他大型网站做过性能和兼容性测试,如果你有兴趣,源码都在里面,自己动手玩玩吧 8.如果有bug,请反馈给我,也可以自己debug找一下(我的开发环境eclipse + pydev) 出错说明: 1.URL出错,一般是死链接 [2011-02-09 11:12:11,392] ERROR : URLError : http://Hardware.global-supplier.com [2011-02-09 15:37:57,148] ERROR : HTTPError : code=400, msg= Bad Request, url= http://www.fashionalgifts.com/../company_info.html 2.页面解析时出错,下例是tag中内容有问题,pos代表页面的出错位置 [2011-02-09 11:12:17,119] ERROR : msg= malformed start tag, HTMLParseError : http://fashion.global-supplier.com/cosmetic-bags/pu-cosmetic-bag-gb-c09.html pos= (238, 27) 3.urlJoinError,一般是相对路径使用错误导致的,如下,第一个是页面URL,第二个是href中的内容,一级子目录竟然要返回上2级的目录 UrlJoin Error:http://www.fashionalgifts.com/company-culture.html/, ../../company_info.html 不过这样的URL在浏览器中是可以正常执行的。。。太强悍了,帮我们屏蔽错误 4.空title title is null, url=*** 

    spider.py(主程序)

    #Author:Paul Wang #Date:2011-01-26 #Description: from html.parser import HTMLParser from html.parser import HTMLParseError import urllib.request import myLogger import myHtmlParse import sys debugMode = 1 site = "http://www.geekzones.com/" charset = 'utf-8' logLever = 'debuG' try: def useage(): print(" python spider.py siteUrl charSet logLever") print(" example: python spider.py http://www.baidu.com utf-8 debug") def getInnerURL(url,parentUrl,siteName,log): global times global urlList lParser = myHtmlParse.MyParser(log) try: try: if(url[-1] != '/'): url += '/' if url in urlList or len(url) == 0: return times += 1 urlList.append(url) log.info('times = %d open url : %s' % (times , url)) lParser.currentLink = url; lParser.siteName = siteName; opener = urllib.request.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] htmlSource2 = opener.open(url).read() # req = urllib.request.Request(url) # htmlSource2 = urllib.request.urlopen(req).read() lParser.feed(htmlSource2.decode(charset)) lParser.filterRepeatLink() if(len(lParser.title) == 0): log.error('title is null, url= %s' % url) except urllib.error.HTTPError as e: log.error("HTTPError : code=%d, msg= %s, url= %s, parentUrl=%s" % ( e.code, e.msg, url, parentUrl)) except urllib.error.URLError as e: log.error("URLError : url= %s" % ( url)) except HTMLParseError as e: log.error("HTMLParseError : msg= %s, url= %s, pos= %s, parentUrl=%s" % ( e.msg, url, lParser.getpos(), parentUrl)) finally: lParser.close() for url1 in lParser.link: if url1 in urlList or len(url1) == 0: continue getInnerURL(url1, url, lParser.siteName,log) log = myLogger.myLogger('logging.config') if len(sys.argv) >= 3: site = sys.argv[1] charset = sys.argv[2] log.setLever(sys.argv[3]) else: useage() if debugMode != 1: sys.exit() if debugMode == 1: log.setLever("debUG") urlList = urllib.parse.urlparse(site) host = urlList.scheme + '://' + urlList.hostname siteName = host[ host.find('.') +1: ].rstrip('/') print(siteName) # sys.exit() times = 0; urlList = [] getInnerURL(site, '/',siteName,log) except KeyboardInterrupt: sys.exit()

    mylogger.py

    #Author:Paul Wang #Date:2011-01-26 #Description:logging wrapper import logging.config class myLogger: LEVELS = {'debug': logging.DEBUG, 'info': logging.INFO, 'warning': logging.WARNING, 'error': logging.ERROR, 'critical': logging.CRITICAL} def __init__(self, configName): print('init logger...') logging.config.fileConfig(configName) self.logger = logging.getLogger("simpleExample") def setLever(self, leverName): if (len(leverName)==0 ): self.logger.setLevel(logging.NOTSET) else: self.logger.setLevel( self.LEVELS.get(leverName.lower(), logging.NOTSET)) print('set lever %s...' % logging.getLevelName(self.logger.level) ) def writeLog(self, msg): self.logger.log(self.logger.level, msg) def debug(self, msg): self.logger.debug(msg) def info(self, msg): self.logger.info(msg) def warn(self, msg): self.logger.warn(msg) def error(self, msg): self.logger.error(msg) def critical(self, msg): self.logger.critical(msg) def __del__(self): print('destructor logger...')  

    myHtmlParse.py

    #Author:Paul Wang #Date:2011-01-26 from html.parser import HTMLParser import urllib import re import myLogger class MyParser(HTMLParser): def __init__ (self,log): HTMLParser. __init__ (self) self.currentLink = '' self.link = [] self.exlink = [] self.host = '' self.siteName = '' #like 'baidu.com' self.title = '' self.titleFlag = 0 self.log = log def handle_starttag(self, tag, attrs): if tag == 'a' : if len(attrs)==0:pass for name, value in attrs: if name == 'href' : pattern = re.compile('(.jpg|.jpeg|.avi|.png|.gif)</p>,re.I) match = pattern.search(value) if match: continue if len(value) and value[0] != '#' and value.find('@') == -1: if value.find(self.siteName) != -1 / or value[0] == '/' / or value[0:2] == './' / or value[0:3] == '../': if(self.currentLink[-1] != '/'): self.currentLink += '/' url = urllib.request.urljoin(self.currentLink, value) if(url.find("..") != -1 ): self.log.error("UrlJoin Error:%s, %s" % (self.currentLink, value)) self.link.append( url) else: self.exlink.append(value.rstrip('/')) elif tag == 'title': if len(attrs)==0:pass self.titleFlag = 1 def handle_data(self, data): if self.titleFlag == 1 and len(self.title)== 0: self.title=data self.titleFlag = 0 def filterRepeatLink(self): # print('count1: %d' % len(self.link)) dict = {} for keys in self.link: dict.setdefault(keys, '111') # print('count2: %d' % len(dict)) self.link = dict.keys() # print('count2: %d' % len(self.link))  

    logging.config(配置档,用过log4c等的一看就明白了)

    [loggers] keys=root,example [handlers] keys=consoleHandler,timedRotatingFileHandler [formatters] keys=simpleFormatter [formatter_simpleFormatter] #format=[%(asctime)s]%(name)s : %(message)s format=[%(asctime)s] %(levelname)s : %(message)s [logger_root] level=DEBUG handlers=consoleHandler,timedRotatingFileHandler [logger_example] level=DEBUG handlers=consoleHandler,timedRotatingFileHandler qualname=example propagate=0 [handler_consoleHandler] class=StreamHandler level=DEBUG formatter=simpleFormatter args=(sys.stdout,) [handler_rotateFileHandler] class=handlers.RotatingFileHandler level=DEBUG formatter=simpleFormatter args=('test.log', 'a', 1024*1024, 9) [handler_timedRotatingFileHandler] class=handlers.TimedRotatingFileHandler level=ERROR formatter=simpleFormatter args=('app.log','d') 

    最新回复(0)