rules - Scrapy - target specified URLs only -
am using scrapy browse , collect data, finding spider crawling lots of unwanted pages. i'd prefer spider begin set of defined pages , parse content on pages , finish. i've tried implement rule below it's still crawling whole series of other pages well. suggestions on how approach this?
rules = ( rule(sgmllinkextractor(), callback='parse_adlinks', follow=false), )
thanks!
your extractor extracting every link because doesn't have rule arguments set.
if take @ official documentation, you'll notice scrapy linkextractors have lots of parameters can set customize linkextractors extract.
example:
rules = ( # specific domain links rule(lxmllinkextractor(allow_domains=['scrapy.org', 'blog.scrapy.org']), <..>), # links match specific regex rule(lxmllinkextractor(allow='.+?/page\d+.html)', <..>), # don't crawl speicific file extensions rule(lxmllinkextractor(deny_extensions=['.pdf','.html'], <..>), )
you can set allowed domains spider if don't want wonder off somewhere:
class myspider(scrapy.spider): allowed_domains = ['scrapy.org'] # crawl pages domain ^
Comments
Post a Comment