CVS log for mandoc/roff_escape.c

[BACK] Up to [cvsweb.bsd.lv] / mandoc

Request diff between arbitrary revisions


Default branch: MAIN
Current tag: MAIN


Revision 1.14 / (download) - annotate - [select for diffs], Wed Jun 8 13:23:57 2022 UTC (21 months, 2 weeks ago) by schwarze
Branch: MAIN
CVS Tags: HEAD
Changes since 1.13: +49 -26 lines
Diff to previous 1.13 (unified)

Surprisingly, every escape sequence can also be used as an argument
delimiter for an outer escape sequence, in which case the delimiting
escape sequence retains its syntax but usually ignores its argument
and loses its inherent effect.  Add rudimentary support for this
syntax quirk in order to improve parsing compatibility with groff.

Revision 1.13 / (download) - annotate - [select for diffs], Tue Jun 7 09:54:40 2022 UTC (21 months, 3 weeks ago) by schwarze
Branch: MAIN
Changes since 1.12: +18 -8 lines
Diff to previous 1.12 (unified)

Split the excessively generic diagnostic message "invalid escape sequence"
into the more specific messages "invalid escape argument delimiter"
and "invalid escape sequence argument".

Revision 1.12 / (download) - annotate - [select for diffs], Mon Jun 6 19:23:13 2022 UTC (21 months, 3 weeks ago) by schwarze
Branch: MAIN
Changes since 1.11: +9 -7 lines
Diff to previous 1.11 (unified)

To better match groff parsing, reject digits and some mathematical
operators as argument delimiters for some escape sequences that take
numerical arguments, in the same way as it had already been done for \h.

Argument delimiter parsing for escape sequences taking numerical arguments
is not perfect yet.  In particular, when a character representing a
scaling unit is abused as the argument delimiter, parsing for that
character becomes context-dependent, and it is no longer possible to
find the end of the escape sequence without calling the full numerical
expression parser, which i refrain from attempting in this commit.

For now, continuing to misparse insane constructions like \Bc1c+1cc
(which is valid in groff and resolves to "1" because 1c+1c = two
centimeters is a valid numerical expression and 'c' is also a valid
delimiter) is a small price to pay for keeping complexity at bay
and for not losing focus in the ongoing series of refinements.

Revision 1.11 / (download) - annotate - [select for diffs], Mon Jun 6 12:09:48 2022 UTC (21 months, 3 weeks ago) by schwarze
Branch: MAIN
Changes since 1.10: +5 -4 lines
Diff to previous 1.10 (unified)

Allow arbitrary argument delimiters for \C, like groff does.
The restriction of only allowing ' as the delimiter was introduced
by kristaps@ on 2011/04/09 when he first supported \C.
For most other escape sequences, similar restrictions were relaxed
later on, but for the rarely used \C, it was apparently forgotten.

While here, reject empty character names: they are never valid.

Revision 1.10 / (download) - annotate - [select for diffs], Sun Jun 5 13:54:09 2022 UTC (21 months, 3 weeks ago) by schwarze
Branch: MAIN
Changes since 1.9: +22 -11 lines
Diff to previous 1.9 (unified)

With the improved escape sequence parser, it becomes easy to also improve
diagnostics.  Distinguish "incomplete escape sequence", "invalid special
character", and "unknown special character" from the generic "invalid
escape sequence", also promoting them from WARNING to ERROR because
incomplete escape sequences are severe syntax violations and because
encountering an invalid or unknown special character makes it likely
that part of the document content intended by the authors gets lost.

Revision 1.9 / (download) - annotate - [select for diffs], Sun Jun 5 10:19:54 2022 UTC (21 months, 3 weeks ago) by schwarze
Branch: MAIN
Changes since 1.8: +12 -11 lines
Diff to previous 1.8 (unified)

Small cleanup of error reporting:
call mandoc_msg() only once at the end, not sometimes in the middle,
classify incomplete, non-expanding escape sequences as ESCAPE_ERROR,
and also reduce the number of return statemants;
no formatting change intended.

Revision 1.8 / (download) - annotate - [select for diffs], Thu Jun 2 14:51:41 2022 UTC (21 months, 3 weeks ago) by schwarze
Branch: MAIN
Changes since 1.7: +2 -2 lines
Diff to previous 1.7 (unified)

Since \. is not a character escape sequence, re-classify it from the
wrong parsing class ESCAPE_SPECIAL to the better-suited parsing class
ESCAPE_UNDEF, exactly like it is already done for the similar \\,
which isn't a character escape sequence either.

No formatting change is intended just yet, but this will matter for
upcoming improvements in the parser for roff(7) macro, string, and
register names.

See the node "5.23.2 Copy Mode" in "info groff" regarding
what \\ and \. really mean.

Revision 1.7 / (download) - annotate - [select for diffs], Thu Jun 2 11:29:07 2022 UTC (21 months, 3 weeks ago) by schwarze
Branch: MAIN
Changes since 1.6: +22 -19 lines
Diff to previous 1.6 (unified)

Avoid the layering violation of re-parsing for \E in roff_expand().
To that end, add another argument to roff_escape()
returning the index of the escape name.
This also makes the code in roff_escape() a bit more uniform
in so far as it no longer needs the "char esc_name" local variable
but now does everything with indices into buf[].
No functional change.

Revision 1.6 / (download) - annotate - [select for diffs], Wed Jun 1 23:20:26 2022 UTC (21 months, 3 weeks ago) by schwarze
Branch: MAIN
Changes since 1.5: +3 -0 lines
Diff to previous 1.5 (unified)

Fix a buffer overrun in the roff(7) escape sequence parser that could
be triggered by macro arguments ending in double backslashes, for
example if people wrote .Sq "\\" instead of the correct .Sq "\e".

The bug was hard to find because it caused a segfault only very rarely,
according to my measurements with a probability of less than one permille.
I'm sorry that the first one to hit the bug was an arm64 release build
run by deraadt@.  Thanks to bluhm@ for providing access to an arm64
machine for debugging purposes.  In the end, the bug turned out to be
architecture-independent.

The reason for the bug was that i assumed an invariant that does not exist.
The function roff_parse_comment() is very careful to make sure that the
input buffer does not end in an escape character before passing it on,
so i assumed this is still true when reaching roff_expand() immediately
afterwards.  But roff_expand() can also be reached from roff_getarg(),
in which case there *can* be a lone escape character at the end of the
buffer in case copy mode processing found and converted a double
backslash.

Fix this by handling a trailing escape character correctly in the
function roff_escape().

The lesson here probably is to refrain from assuming an invariant
unless verifying that the invariant actually holds is reasonably
simple.  In some cases, in particular for invariants that are important
but not simple, it might also make sense to assert(3) rather than just
assume the invariant.  An assertion failure is so much better than a
buffer overrun...

Revision 1.5 / (download) - annotate - [select for diffs], Tue May 31 20:23:05 2022 UTC (21 months, 4 weeks ago) by schwarze
Branch: MAIN
Changes since 1.4: +18 -3 lines
Diff to previous 1.4 (unified)

Rudimentary implementation of the \A escape sequence, following groff
semantics (test identifier for syntactical validity), not at all
following the completely unrelated Heirloom semantics (define
hyperlink target position).

The main motivation for providing this implementation is to get \A
into the parsing class ESCAPE_EXPAND that corresponds to groff parsing
behaviour, which is quite similar to the \B escape sequence (test
numerical expression for syntactical validity).  This is likely
to improve parsing of nested escape sequences in the future.

Validation isn't perfect yet.  In particular, this implementation
rejects \A arguments containing some escape sequences that groff
allows to slip through.  But that is unlikely to cause trouble even
in documents using \A for non-trivial purposes.  Rejecting the nested
escapes in question might even improve robustnest because the rejected
names are unlikely to really be usable for practical purposes - no
matter that groff dubiously considers them syntactically valid.

Revision 1.4 / (download) - annotate - [select for diffs], Tue May 31 18:09:57 2022 UTC (21 months, 4 weeks ago) by schwarze
Branch: MAIN
Changes since 1.3: +1 -1 lines
Diff to previous 1.3 (unified)

Trivial patch to put the roff(7) \g (interpolate format of register)
escape sequence into the correct parsing class, ESCAPE_EXPAND.
Expansion of \g is supposed to work exactly like the expansion
of the related escape sequence \n (interpolate register value),
but since we ignore the .af (assign output format) request,
we just interpolate an empty string to replace the \g sequence.

Surprising as it may seem, this actually makes a formatting difference
for deviate input like ".O\gNx" which used to raise bogus "escaped
character not allowed in a name" and "skipping unknown macro" errors
and printed nothing, whereas now it correctly prints "OpenBSD".

Revision 1.3 / (download) - annotate - [select for diffs], Mon May 30 23:03:47 2022 UTC (21 months, 4 weeks ago) by schwarze
Branch: MAIN
Changes since 1.2: +1 -1 lines
Diff to previous 1.2 (unified)

Dummy implementation of the roff(7) \V (interpolate environment variable)
escape sequence.  This is needed to get \V into the correct parsing
class, ESCAPE_EXPAND.

It is intentional that mandoc(1) output is *not* influenced by environment
variables, so interpolate the name of the variable with some decorating
punctuation rather than interpolating its value.

Revision 1.2 / (download) - annotate - [select for diffs], Fri May 20 13:09:13 2022 UTC (22 months, 1 week ago) by schwarze
Branch: MAIN
Changes since 1.1: +1 -1 lines
Diff to previous 1.1 (unified)

Re-classify the roff(7) \r (reverse line feed) escape sequence
from "ignore" to "unsupported" because when an input file uses it,
mandoc(1) is likely to significantly misformat the output,
usually showing parts of the output in a different order
than the author intended.

Revision 1.1 / (download) - annotate - [select for diffs], Thu May 19 15:37:47 2022 UTC (22 months, 1 week ago) by schwarze
Branch: MAIN

Make roff_expand() parse left-to-right rather than right-to-left.
Some escape sequences have side effects on global state, implying
that the order of evaluation matters.  For example, this fixes the
long-standing bug that "\n+x\n+x\n+x" after ".nr x 0 1" used to
print "321"; now it correctly prints "123".

Right-to-left parsing was convenient because it implicitly handled
nested escape sequences.  With correct left-to-right parsing, nesting
now requires an explicit implementation, here solved as follows:
1. Handle nested expanding escape sequences iteratively.
When finding one, expand it, then retry parsing the enclosing escape
sequence from the beginning, which will ultimately succeed as soon
as it no longer contains any nested expanding escape sequences.
2. Handle nested non-expanding escape sequences recursively.
When finding one, the escape sequence parser calls itself to find
the end of the inner sequence, then continues parsing the outer
sequence after that point.

This requires the mandoc_escape() function to operate in two different
modes.  The roff(7) parser uses it in a mode where it generates
diagnostics and may return an expansion request instead of a parse
result.  All other callers, in particular the formatters, use it
in a simpler mode that never generates diagnostics and always returns
a definite parsing result, but that requires all expanding escape
sequences to already have been expanded earlier.  The bulk of the
code is the same for both modes.
Since this required a major rewrite of the function anyway, move
it into its own new file roff_escape.c and out of the file mandoc.c,
which was misnamed in the first place and lacks a clear focus.

As a side benefit, this also fixes a number of assertion failures
that tb@ found with afl(1), for example "\n\\\\*0", "\v\-\\*0",
and "\w\-\\\\\$0*0".

As another side benefit, it also resolves some code duplication
between mandoc_escape() and roff_expand() and centralizes all
handling of escape sequences (except for expansion) in roff_escape.c,
hopefully easing maintenance and feature improvements in the future.

While here, also move end-of-input handling out of the complicated
function roff_expand() and into the simpler function roff_parse_comment(),
making the logic easier to understand.

Since this is a major reorganization of a central component of
mandoc(1), stability of the program might slightly suffer for a few
weeks, but i believe that's not a problem at this point of the
release cycle.  The new code already satisfies the regression suite,
but more tweaking and regression testing to further improve the
handling of various escape sequences will likely follow in the near
future.

This form allows you to request diff's between any two revisions of a file. You may select a symbolic revision name using the selection box or you may type in a numeric name using the type-in text box.




CVSweb