nutch搜索引擎的搭建

    技术2022-05-20  28

    软件先安装好,NUTCH_JAVA_HOME是你java的安装路径设置好

     

    然后开始动手。

     

    在nutch目录下放个urls.txt存放你要扒取的网页

     

    crawl-urlfilter.txt我修改如下:

    # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*/.)*

     

    nutch-site.xml修改如下

    <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>http.agent.name</name> <value>Jennifer</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties:      http.robots.agents      http.agent.description      http.agent.url      http.agent.email      http.agent.version and set their values appropriately. </description> </property> <property> <name>http.agent.description</name> <value>Jennifer</value> <description>Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. </description> </property> <property> <name>http.agent.url</name> <value>Jennifer</value> <description>A URL to advertise in the User-Agent header. This will    appear in parenthesis after the agent name. Custom dictates that this    should be a URL of a page explaining the purpose and behavior of this    crawler. </description> </property> <property> <name>http.agent.email</name> <value>Jennifer</value> <description>An email address to advertise in the HTTP 'From' request    header and User-Agent header. A good practice is to mangle this    address (e.g. 'info at example dot com') to avoid spamming. </description> </property> </configuration>

     

    cygwin下输入:bin/nutch crawl urls.txt -dir /myDir 3 >& crawl.log

     

     这样就会生成与nutch目录平行的myDir目录 里面是扒取的结果。crawl.log使用nutch根目录下的日志文件

     

    然后将nutch根目录下的nutch.rar部署的tomcat上,

     

    在部署的文件里WEB-INF/classes下的nutch-site.xml修改如下

     

    <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>searcher.dir</name> <value>E:/myDir</value> </property> </configuration>

     

    KO  开始享受吧~~~


    最新回复(0)