scrapy InitSpider: set Rules in __init_

scrapy InitSpider: set Rules in init? -

- February 15, 2012

i building recursive webspider optional login. want make settings dynamic via json config file.

in __init__ function, reading file , try populate variables, however, not work rules.

class crawlpyspider(initspider):  ...  #---------------------------------------------------------------------- def __init__(self, *args, **kwargs):     """constructor: overwrite parent __init__ function"""      # call parent init     super(crawlpyspider, self).__init__(*args, **kwargs)      # command line arg provided configuration param     config_file = kwargs.get('config')      # validate configuration file parameter     if not config_file:         logging.error('missing argument "-a config"')         logging.error('usage: scrapy crawl crawlpy -a config=/path/to/config.json')         self.abort = true      # check if file     elif not os.path.isfile(config_file):         logging.error('specified config file not exist')         logging.error('not found in: "' + config_file + '"')         self.abort = true      # good, read config     else:         # load json config         fpointer = open(config_file)         data = fpointer.read()         fpointer.close()          # convert json dict         config = json.loads(data)          # config['rules'] string array looks this:         # config['rules'] = [         #    'password',         #    'reset',         #    'delete',         #    'disable',         #    'drop',         #    'logout',         # ]          crawlpyspider.rules = (             rule(                 linkextractor(                     allow_domains=(self.allowed_domains),                     unique=true,                     deny=tuple(config['rules'])                 ),                 callback='parse',                 follow=false             ),         )

scrapy still crawls pages present in config['rules'] , therefore hits logout page. specified pages not being denied. missing here?

update:

i have tried setting crawlpyspider.rules = ... self.rules = ... inside __init__. both variants not work.

spider: initspider
rules: linkextractor
before crawl: doing login prior crawling

i try deny in parse function

    # dive deeper?     # nesting depth handled via custom middle-ware (middlewares.py)     #if curr_depth < self.max_depth or self.max_depth == 0:     links = linkextractor().extract_links(response)     link in links:         ignore in self.ignores:             if  (ignore not in link.url) , (ignore.lower() not in link.url.lower()) , link.url.find(ignore) == -1:                 yield request(link.url, meta={'depth': curr_depth+1, 'referer': response.url})

you setting class attribute want set instance attribute:

# this: crawlpyspider.rules = ( # should this: self.rules = ( <...>

Search This Blog

Image

scrapy InitSpider: set Rules in init? -

Comments

Post a Comment

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -

scrapy InitSpider: set Rules in __init__? -

Comments

Post a Comment

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -

scrapy InitSpider: set Rules in init? -