solr - How to crawl a website that has SAML authentication using ManifoldCF or nutch? -


i trying crawl website, more google site using manifoldcf has saml authentication , index crawled data apache solr. crawl url, gives me 302 redirection login page , says responsecodenotindexable.

i not sure if have authenticated correctly or not. in manifoldcf have options http basic authentication, ntlm authentication , session-based access credentials authentication method. used session based authentication method more looks form based authentication rather saml authentication.

has crawled website using manifoldcf has saml authentication? , if not manifoldcf, has been able accomplish via apache nutch, because afraid, provides http basic , digest , ntlm authentication.

any insight helpful. can provide more information regarding issue, if here thinks can accomplished. when crawl https://sites.google.com/a/my-sub-domain.com, redirects sso login page , crawler refuses crawl more giving 302 error. it's intranet based website.

not sure whether helps, try out. in nutch, can provide credentials login page, have httpclient-auth.xml file in conf directory. there u can provide host name along credentials.

<auth-configuration>    <credentials username="admin" password="admin123">       <authscope host="hostname" realm="login"/>       <default/>    </credentials> </auth-configuration> 

similarly can add number of credentials configuration.

to crawl https site, change plugin.includes property protocol-http protocol-httpclient in nutch-conf.xml


Comments

Popular posts from this blog

Spring Boot + JPA + Hibernate: Unable to locate persister -

go - Golang: panic: runtime error: invalid memory address or nil pointer dereference using bufio.Scanner -

c - double free or corruption (fasttop) -