rules - Scrapy - target specified URLs only -

- August 15, 2015

am using scrapy browse , collect data, finding spider crawling lots of unwanted pages. i'd prefer spider begin set of defined pages , parse content on pages , finish. i've tried implement rule below it's still crawling whole series of other pages well. suggestions on how approach this?

rules = (     rule(sgmllinkextractor(), callback='parse_adlinks', follow=false),   )

thanks!

your extractor extracting every link because doesn't have rule arguments set.

if take @ official documentation, you'll notice scrapy linkextractors have lots of parameters can set customize linkextractors extract.

example:

rules = (     # specific domain links     rule(lxmllinkextractor(allow_domains=['scrapy.org', 'blog.scrapy.org']), <..>),       # links match specific regex     rule(lxmllinkextractor(allow='.+?/page\d+.html)', <..>),       # don't crawl speicific file extensions     rule(lxmllinkextractor(deny_extensions=['.pdf','.html'], <..>),   )

you can set allowed domains spider if don't want wonder off somewhere:

class myspider(scrapy.spider):     allowed_domains = ['scrapy.org']     # crawl pages domain ^

Search This Blog

Image

rules - Scrapy - target specified URLs only -

Comments

Post a Comment

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -