scrapy InitSpider: set Rules in __init__? -
i building recursive webspider optional login. want make settings dynamic via json config file.
in __init__
function, reading file , try populate variables, however, not work rules
.
class crawlpyspider(initspider): ... #---------------------------------------------------------------------- def __init__(self, *args, **kwargs): """constructor: overwrite parent __init__ function""" # call parent init super(crawlpyspider, self).__init__(*args, **kwargs) # command line arg provided configuration param config_file = kwargs.get('config') # validate configuration file parameter if not config_file: logging.error('missing argument "-a config"') logging.error('usage: scrapy crawl crawlpy -a config=/path/to/config.json') self.abort = true # check if file elif not os.path.isfile(config_file): logging.error('specified config file not exist') logging.error('not found in: "' + config_file + '"') self.abort = true # good, read config else: # load json config fpointer = open(config_file) data = fpointer.read() fpointer.close() # convert json dict config = json.loads(data) # config['rules'] string array looks this: # config['rules'] = [ # 'password', # 'reset', # 'delete', # 'disable', # 'drop', # 'logout', # ] crawlpyspider.rules = ( rule( linkextractor( allow_domains=(self.allowed_domains), unique=true, deny=tuple(config['rules']) ), callback='parse', follow=false ), )
scrapy still crawls pages present in config['rules']
, therefore hits logout
page. specified pages not being denied. missing here?
update:
i have tried setting crawlpyspider.rules = ...
self.rules = ...
inside __init__
. both variants not work.
- spider:
initspider
- rules:
linkextractor
- before crawl: doing login prior crawling
i try deny in parse
function
# dive deeper? # nesting depth handled via custom middle-ware (middlewares.py) #if curr_depth < self.max_depth or self.max_depth == 0: links = linkextractor().extract_links(response) link in links: ignore in self.ignores: if (ignore not in link.url) , (ignore.lower() not in link.url.lower()) , link.url.find(ignore) == -1: yield request(link.url, meta={'depth': curr_depth+1, 'referer': response.url})
you setting class attribute want set instance attribute:
# this: crawlpyspider.rules = ( # should this: self.rules = ( <...>
Comments
Post a Comment