solr - How to crawl a website that has SAML authentication using ManifoldCF or nutch? -
i trying crawl website, more google site using manifoldcf has saml authentication , index crawled data apache solr. crawl url, gives me 302 redirection login page , says responsecodenotindexable.
i not sure if have authenticated correctly or not. in manifoldcf have options http basic authentication, ntlm authentication , session-based access credentials authentication method. used session based authentication method more looks form based authentication rather saml authentication.
has crawled website using manifoldcf has saml authentication? , if not manifoldcf, has been able accomplish via apache nutch, because afraid, provides http basic , digest , ntlm authentication.
any insight helpful. can provide more information regarding issue, if here thinks can accomplished. when crawl https://sites.google.com/a/my-sub-domain.com, redirects sso login page , crawler refuses crawl more giving 302 error. it's intranet based website.
not sure whether helps, try out. in nutch, can provide credentials login page, have httpclient-auth.xml file in conf directory. there u can provide host name along credentials.
<auth-configuration> <credentials username="admin" password="admin123"> <authscope host="hostname" realm="login"/> <default/> </credentials> </auth-configuration> similarly can add number of credentials configuration.
to crawl https site, change plugin.includes property protocol-http protocol-httpclient in nutch-conf.xml
Comments
Post a Comment