This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Bogus assumption prevents d2u/u2d/conv/etal working on mixed files.



  And here's why I was investigating cygutils.  I found that d2u wasn't
working on a file of mine.  Let me demonstrate:

-------snip-------
dk@mace /davek/d2utest> ls -la
total 3
drwxr-xr-x+   2 dk       Domain U        0 Apr  2 16:41 .
drwx------+  29 dk       Domain U        0 Apr  1 12:38 ..
-rw-r--r--    1 dk       Domain U     2902 Mar 31 17:01 stdprint.c
dk@mace /davek/d2utest> cp stdprint.c  stdprint1.c
dk@mace /davek/d2utest> d2u stdprint1.c
stdprint1.c: done.
dk@mace /davek/d2utest> ls -la
total 6
drwxr-xr-x+   2 dk       Domain U        0 Apr  2 16:41 .
drwx------+  29 dk       Domain U        0 Apr  1 12:38 ..
-rw-r--r--    1 dk       Domain U     2902 Mar 31 17:01 stdprint.c
-rw-r--r--    1 dk       Domain U     2902 Apr  2 16:41 stdprint1.c
dk@mace /davek/d2utest> cat stdprint.c | tr -d '\015'  |cat >stdprint2.c
dk@mace /davek/d2utest> ls -la
total 9
drwxr-xr-x+   2 dk       Domain U        0 Apr  2 16:41 .
drwx------+  29 dk       Domain U        0 Apr  1 12:38 ..
-rw-r--r--    1 dk       Domain U     2902 Mar 31 17:01 stdprint.c
-rw-r--r--    1 dk       Domain U     2902 Apr  2 16:41 stdprint1.c
-rw-r--r--    1 dk       Domain U     2897 Apr  2 16:41 stdprint2.c
-------snip-------

  I was pretty stunned to find d2u didn't have the same effect as tr -d.  A
few seconds work in the debugger, however, made it clear.

  Right inside conv.c, in the main convert (...) function, there's an
attempted optimisation.  After opening the file for conversion, it reads a
char at a time until it finds the first '\n' or '\r' in the whole file.  If
a '\n' comes first, it assumes the file is in Unix format; if a '\r' comes
first, it assumes the file must be in DOS format.

  Now, these assumptions are reasonable enough ways of guessing the file
format if it hasn't been specified by the command name or command line
switch, and therefore of deducing which kind of translation is required.

  But then it checks to see if the guessed format matches the format you've
asked it to convert into.  If so, it attempts to 'optimise' the conversion
by simply not performing it: it closes the file and leaves it untouched.

  Unfortunately, there is an extra unstated assumption in between deducing
the file type from the first EOL in the file and deducing that you don't
need to perform a conversion, and that assumption is that every other line
in the file has the same EOL as the first line.  And that assumption is
bogus, and it means that d2u/u2d and friends are no use on files which have
mixed EOL types, unless by good chance the very first line has the EOL type
that you wish to convert away from.

  My attached patch simply removes the attempted optimisation.  Like I say,
I think it's an invalid shortcut to assume that every line in a file has the
same EOL type.  I could imagine a case could be made for keeping the
'optimisation' and perhaps providing a command-line switch "-f" or "--force"
to force full processing of files even if they seem to already be in the
right mode;  OTOH I'd say that even if you wanted to keep the optimisation
in some cases, it's a dangerous optimisation that can lead to incorrect
output, and therefore it should only be switched on when the user
deliberately adds a command-line option, rather than being on by default and
disableable.


    cheers,
      DaveK
-- 
Can't think of a witty .sigline today....

Attachment: conv-patch.diff
Description: Binary data

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]