This is the mail archive of the cygwin@cygwin.com mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Wget ignores robot.txt entry


Lowell,

Max Bowsher reported:

Or, on the command line -erobots=off :-)

Whilst this does control whether wget downloads robots.txt, a quick test confirms that even when it does get robots.txt, it still wanders into cgi-bin.

I'd suggest taking this to the wget list, except wget it currently maintainer-less, and, it appears, bitrotted.

Max.

As for this:

Perhaps there is a counterpart to the above, i.e., <meta name="robots" content="follow"> that's being involked and someone from Redhat could check into and rule this out.
You should realize that for open source programs like wget, the recommended practice is to examine the source yourself.

Randall Schulz


At 17:43 2003-02-14, L Anderson wrote:

Randall R Schulz wrote:
Lowell,
What's in your "~/.wgetrc" file? If it contains this:
robots = off
Then wget will not respect a "robots.txt" file on the host from which it is retrieving files.
Before I learned of this option (accessible _only_ via this directive in the .wgetrc file), I did something too clever by half to get robots.txt ignored, so I know that wget does respect it.
I have only two wgetrc related files as follows:

/etc/wgetrc
/usr/doc/wget-1.8.2/sample.wgetrc

NB: I use win98 and these are under my cygwin directory i:\cygwin (i.e. /cygdrive/i).

I have never changed either file--I just accept the default installed by setup. However, the two files differ by a few lines which are just comments anyway. i.e. doing:

$ diff /etc/wgetrc /usr/doc/wget-1.8.2/sample.wgetrc
73,74c73,74
< # You can set the default proxy for Wget to use. It will override the
< # value in the environment.
---
> # You can set the default proxies for Wget to use for http and ftp.
> # They will override the value in the environment.
75a76
> #ftp_proxy = http://proxy.yoyodyne.com:18023/

shows this. Moreover,

$ grep robot /etc/wgetrc
# Setting this to off makes Wget not download /robots.txt. Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
#robots = on

shows the only references to "robot" are also comments.

The stated default for wget is "robots=on" which I have seen honored for quite a number of other downloads and since I didn't use "-e robots=off", that can't explain it. The only other thing I have found that might be related is not under my control and I haven't yet figured out how to check it. From the wget documentation it states:

"
The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this:

<meta name="robots" content="nofollow">

This is explained in some detail at <http://www.robotstxt.org/wc/meta-user.html>. Wget supports this method of robot exclusion in addition to the usual /robots.txt exclusion.
"

Perhaps there is a counterpart to the above, i.e., <meta name="robots" content="follow"> that's being involked and someone from Redhat could check into and rule this out.

Thanks (and still puzzled)!

Lowell Anderson

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Bug reporting:         http://cygwin.com/bugs.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]