nutch draft

    技术2022-05-20  40

    The Crawl Database is a data store where Nutch stores every URL,together with the metadata that it

    knows about。

     

    In Hadoop terms it's a Sequence file (meaning all records

    are stored in sequential manner) consisting of tuples of URL andCrawlDatum.

     

    Operations (like inserts, deletes and updates) in CrawlDatabase and other data are processed in batch mode. Here is an exampleof the contents of crawldb:

     

    http://www.example.com/page1.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ... http://www.example.com/page2.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ... http://www.example.com/page3.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ... 

    The Link Database is a data structure (Sequence file, URL ->Inlinks) that contains all inverted links. 

     

    In the parsing phase Nutchcan extract outlinks from a document and store them in format source url-> target_url,anchor_text.

     

    Inject

    IThe Inject command in Nutch has one responsibility: inject moreURLs into Crawl Database. Normally you should collect a set of URLs toadd and then process them in one batch to keep the time of a singleinsert small.

     

    Job1: Convert plain text into URL,CrawlDatum tuples and dedupe(mr task)

    Job2: Merge with existing CrawlDB, dedupe(mr task)

    Generate

    The Generate command in Nutch is used to generate a list of URLsto fetch from Crawl Database URLs with the highest scores arepreferred.

     

     

    Fetch

    Fetcher is responsible for fetching content from URLs and writingthem to disk. It also optionally parses the content. URLs are read froma Fetch List generated by Generator.

     

    Parse

    Parser reads raw fetched content, parses it and stores the results.

     

    UpdateDB

     

    The UpdateDB command reads the CrawlDatums from Segment (extractedURLs) and merges them to the existing CrawlDB.

     

    Invert links

    Inverts link information so we can use anchor texts from otherdocuments that point to a document together with the rest of thedocument data.

     

     

     


    最新回复(0)