solr - How to crawl a website that has SAML authentication using ManifoldCF or nutch? -
i trying crawl website, more google site
using manifoldcf
has saml authentication , index crawled data apache solr. crawl url, gives me 302
redirection login page , says responsecodenotindexable
.
i not sure if have authenticated correctly or not. in manifoldcf have options http basic
authentication, ntlm authentication
, session-based
access credentials authentication method. used session based
authentication method more looks form based authentication rather saml
authentication.
has crawled website using manifoldcf has saml
authentication? , if not manifoldcf
, has been able accomplish via apache nutch, because afraid, provides http
basic , digest
, ntlm
authentication.
any insight helpful. can provide more information regarding issue, if here thinks can accomplished. when crawl https://sites.google.com/a/my-sub-domain.com, redirects sso login page , crawler refuses crawl more giving 302 error. it's intranet based website.
not sure whether helps, try out. in nutch, can provide credentials login page, have httpclient-auth.xml file in conf directory. there u can provide host name along credentials.
<auth-configuration> <credentials username="admin" password="admin123"> <authscope host="hostname" realm="login"/> <default/> </credentials> </auth-configuration>
similarly can add number of credentials configuration.
to crawl https site, change plugin.includes property protocol-http protocol-httpclient in nutch-conf.xml
Comments
Post a Comment