A Research on-An Efficient method for Deep Web Crawler based on accuracy
Keywords:
Deep web, web mining, feature selection, rankingAbstract
Due to the large volume of web resources and the dynamic nature of deep web, achieving wide coverage and high
efficiency is a challenging issue. We propose three-stage framework, for efficient harvesting deep web interfaces. In the first
stage, web crawler performs site-based searching for centre pages with the help of search engines, avoiding visiting a large
number of pages. To achieve more accurate results for a focused crawl, Web Crawler ranks websites to prioritize highly
relevant ones for a given topic. In the second stage the proposed system opens the web pages internally in application with
the help of Jsoup API and pre-process it. Then it performs the word count of query in web pages. In the third stage the
proposed system performs frequency analysis based on TF and IDF. It also uses a combination of TF*IDF for ranking web
pages. To eliminate bias on visiting some highly relevant links in hidden web directories, In proposed work, we design a link
tree data structure to achieve wider coverage for a website. Project experimental results on a set of representative domains
show the agility and accuracy of our proposed crawler framework, which efficiently retrieves deep-web interfaces from
large-scale sites and achieves higher harvest rates than other crawlers using Naïve Bayes algorithm.In this paper we
included the work up to the second stage. The proposed system uses KNNalgorithm, opens the web pages internally in
application with the help of Jsoup API and pre-process it. Then it performs the word count of query in web pages.