This is the mail archive of the
cygwin@cygwin.com
mailing list for the Cygwin project.
Re: Wget ignores robot.txt entry
- From: Randall R Schulz <rrschulz at cris dot com>
- To: cygwin at cygwin dot com
- Date: Thu, 13 Feb 2003 18:33:35 -0800
- Subject: Re: Wget ignores robot.txt entry
Lowell,
What's in your "~/.wgetrc" file? If it contains this:
robots = off
Then wget will not respect a "robots.txt" file on the host from which
it is retrieving files.
Before I learned of this option (accessible _only_ via this directive
in the .wgetrc file), I did something too clever by half to get
robots.txt ignored, so I know that wget does respect it.
Randall Schulz
At 18:14 2003-02-13, L Anderson wrote:
Using the latest of things Cygwin, I downloaded some stuff with wget
from <http://cygwin.com> to peruse off-line and noticed a problem I
can't explain:
The <http://cygwin.com/robots.txt> file has the entries:
User-agent: *
Disallow: /snapshots/
Disallow: /cgi-bin/
Disallow: /cgi2-bin/
so wget should not download /cgi-bin/.
However, "wget -o cygwincom.log -m -p --no-parent -X /cygwin,/ml
http://cygwin.com/" downloads /cgi-bin anyway.
NB. "wget -o cygwincom.log -m -p --no-parent -X /cgi-bin,/cygwin,/ml
http://cygwin.com/ doesn't download /cgi-bin
I ran a validity check on <http://cygwin.com/robots.txt> and found no errors.
Is this a bug in wget or am I doing something wrong?
Thanks,
Lowell Anderson
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Bug reporting: http://cygwin.com/bugs.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/