Focused Crawling Implementation
- we implemented our focused crawling approach on top of the nutch opensource crawler, which uses MapReduce programming model. All the changes made by us follow MapReduce model, so that the code remains simple and scalable.
- It is easy to parallelize the code automatically on large cluster of machines if the code is written in MapReduce model. The run-time system is responsible for partitioning the input data, scheduling programs on set of machines, handling failure of any machine and managing inter-machine communication. Programmer need not required to take care of all such details. Thus a programmer, who doesn't have much knowledge about parallel and distributed system, can also easily write a program, which utilizes resources of distributed system. Moreover whole dataset need not necessary to load into the main memory at a time, since the model processes each record independently. This property helps the model to work on real time large dataset efficiently.
- In MapReduce programming model programmer is required to provide only two functions: Map function and reduce function. As described in figure 1 first the run-time system partitions the input data into certain splits, one for each instance of map function. Several instances of the map function is allowed to run on different machines. The run-time system passes records from splits, one by one, to respective map function as key-value pairs. Thus programmer has to provide a map function which operates on individual records and outputs an intermediate key-value pair. Run-time system combines the values, which share same key. Now the key and vector of values are passed to a particular instance of reduce function. Thus programmer has to provide a reduce function which operates on single key and multiple values and outputs a single key-value pair. In MapReduce model map and reduce function operates on one key-value pair at a time. Many real-world tasks containing large data processing are easily expressible in MapReduce model.
- Let us understand MapReduce model by an example. Let we have a large input file and we want to count how many times each word occurs in the file. The output file should contain
pairs. The run-time system splits the file and passes lines one by one to particular instance of map function. The map function takes a line as input and for each word in that line it generates intermediate key-value pair, where the key is respective word and the value is integer 1. The run-time system combines the values, which share same word(key). Thus if some word occurs three times in the input file then correspondind vector will be (1,1,1). For each word run-time system passes the word and the corresponding vector to reduce function. Reduce function just adds the values in vactor and outputs a key-value pair, where value is the sum of the values in the vector. Run-time system writes such key-value pairs one by one to a output file. Thus programmer has to write just two simple functions, map and reduce, as described above.
- The important point here is that we can run several map and reduce task parallely on several machines. If we want to use 3 machines for word count task, then run-time system makes instance of map and reduce function on each machine. The input file is partitioned into 3 splits. Each map task is responsible for one of the 3 splits. When all map function complete their task then run-time system combine the values which share the same word. Rules for partitioning the intermediate
pairs can be defined like words starting from 'a' to 'i' go to first reduce task, words starting from 'j' to 'o' go to second reduce task and remaining words go to third reduce task. When all reduce tasks complete their work, the run-time system combines all output files.
Thus If we have several machines then MapReduce model can easily speed up the process. Moreover programmer has to write just two easy functions: map and reduce. Run-time system takes care of distributed environment. Programs written using MapReduce model scale up well because only one key-value pair is required to stay in the memory at a time.
How nutch works
- Nutch is an opensource software, which contains modules for crawling, indexing and searching. User is required to provide a file containing seed urls. Nutch crawls predefined number of web pages starting from the given seed urls.
As shown in figure 2 first injector injects seed urls into crawl database as unfetched urls. Crawl database contains url-crawldatum as key-value pair, where crawldatum contains necessary information about url: whether it is fetched, unfetched or linked, last fetch time, signature, etc. Now generate-fetch-parse-update cycle runs times, where is a user defined parameter. In each cycle a new segment is generated, which contains all the required data related to topN best score urls at the segment generation time.
Generator module selects unfetched urls from crawl database and puts in fetch list for newly generated segment, whre is user defined parameter. As described in algorithm 1 the generator task is divided into two MapReduce tasks. The first MapReduce task selects best score urls. It processes url-crawldatum records from crawl database. If status of the url is then only the map function passes the record as
pair to the run-time system. Run-time system sorts all the records based on inlink score and passes them to reduce task in decreasing order. Reduce task outputs only first topN records. Seconds MapReduce task is required to prepare the records as
pairs instead of
Fetcher module fetches all the urls from the fetch list and store the web pages in content database. Parser module parses (algorithm 2) content of web pages and generates three separate databases: One database contains
pair, which is required for indexing, second database contains
, which stores
details and is required for update module and third database contains
pair for fetched urls in previous fetch task. Information from previous crawl database, parsed data (second database from parse module) and crawldatum data (third database from parse module) are combined into one new crawl database by update module.
Once the crawling is completed, invertlink module inverts all links using parsed data to get anchor text, which is required for indexing. Indexer module generates index using anchor text generated by inlink module and parsed text generated by parser module.
Focused Crawling Implementation
- Nutch software is a general crawler. To make it focused crawler we pluged our ordinal regression module in the nutch code and changed generator module such that it generates topN most relevant urls in the fetch list.
- We assume that user will provide level files along with the seed file. We train ordinal regression before starting focused crawling. As described in algorithm 3 training module processes level files one by one. Injector injects urls from level file into the crawl database of respective level. Generator creates a new segment and generates fetch list which contains all urls from respective crawl database. Parser module parses web pages and stores parse text in one file.
- We used MapReduce model to convert text into raw vector file as described in algorithm 4. Map function is identity function. Reduce function receives parse text of one web page at a time. It converts text into raw vector format and stores new words in word list, which also contains document count (Number of documents containing that word) . Raw vector format is required for CBOR algorithm.
- Once all levels are processed, all raw vector files are combined into one file. Now the raw vector file is converted into tf-idf vector file using word list. Chi-squared feature selection method is used to reduce the number of feature. Ordinal regression is trained using only selected features.
- Parse module is changed as shown in algorithm 5 to assign orscore of parent page to all out links. After parsing the web page, test method of ordinal regression module is called to get orscore for that page. The orscore is assigned to all the outlinks of that page, which are then stored in parse data database. Generator module is changed to make sure that it generates topN urls based on orscore instead of inlink score.
- Our new module and all the changes follow MapReduce model. It is necessary because if one module does not follow MapReduce model then that module will slow down the whole system and new system will not be scalable any more.