This is the mail archive of the cygwin@cygwin.com mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Wget ignores robot.txt entry


Lowell,

What's in your "~/.wgetrc" file? If it contains this:

robots = off

Then wget will not respect a "robots.txt" file on the host from which it is retrieving files.

Before I learned of this option (accessible _only_ via this directive in the .wgetrc file), I did something too clever by half to get robots.txt ignored, so I know that wget does respect it.

Randall Schulz


At 18:14 2003-02-13, L Anderson wrote:
Using the latest of things Cygwin, I downloaded some stuff with wget from <http://cygwin.com> to peruse off-line and noticed a problem I can't explain:

The <http://cygwin.com/robots.txt> file has the entries:

User-agent: *
Disallow: /snapshots/
Disallow: /cgi-bin/
Disallow: /cgi2-bin/

so wget should not download /cgi-bin/.

However, "wget -o cygwincom.log -m -p --no-parent -X /cygwin,/ml http://cygwin.com/"; downloads /cgi-bin anyway.

NB. "wget -o cygwincom.log -m -p --no-parent -X /cgi-bin,/cygwin,/ml http://cygwin.com/ doesn't download /cgi-bin

I ran a validity check on <http://cygwin.com/robots.txt> and found no errors.

Is this a bug in wget or am I doing something wrong?

Thanks,

Lowell Anderson

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Bug reporting:         http://cygwin.com/bugs.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]