force=utf8 is required for most characters provided by add-on packgages
and (almost) all mathematical characters, because these are not
set up for inputencs utf8
unicodesymbols.py failed here (python 2.7 under Linux) before the simple fix
included in this commit.
Fix problems revealed by hand-compiling an examining the test samples in autotests/export/Unicode-characters/:
* new definitions
* fixed definitions
* "force=utf8" when required
* some IPA symbols fail without the "extraipa" package
* fix direction of "textcommaaboveright"
There are still many math symbols in lib/symbols that lack a corresponding
entry in lib/unicodesymbols, although a clear mapping exists. This commit
adds some of them (not all yet). In the future we should probably move the
information from both files into one database.
These are all in lib/symbols, but we did not yet know the corresponding unicode
numbers. unicodesymbols does still not contain all symbols from lib/symbols.
The parser that reads unicodesymbols uses backslashes to escape quotes, so
every backslash that is part of a LaTeX command needs to be escaped as well.
There are more candidates in the greek and cyrillic sections, but I don't
know those commands, so I did not touch them.
The lib/unicodesymbols part is based on work by Günter Milde:
Both, \r{A} and \AA (rsp. \r{a} and \aa) are equivalent standard LICR macros
for Aring/aring as well as the deprecated "angstrom sign" character (212B).
However, with \AA for 212B and \r{A} for 00C5, tex2lyx converts \AA to the
deprecated "angstrom sign" which is missing in many fonts including the
Unicode version of Latin Modern.
I added the normalize_c() calls so that tex2lyx prefers the precomposed forms
(these are better editable in LyX) and the deprecated flag.
LaTeXFeatures defines \textcommabelow and \textcommaabove based on a
generic \LyXTextAccent and declares TextCompositeCommands for the Baltic
letters in the T1 font encoding, using \textcommaabove for the small letter g
and \textcommabelow else.
This allows overwriting of the composite definition for other font encodings.
Especially, it does not interfere with the polish/baltic font encoding L7x
(supported by LatinModern and TeXGyre fonts) that provides pre-composed
glyphs.
Greek characters with perispomeni (tilde) accent were not properly shown
in the output document, because the "textgreek" feature re-defined \~ in
a way incompatible with lgrenc.def since version 0.8 (2013-05-13)
(package greek-fontenc).
The compatibility-definition is required for older versions of the font setup
(before the move of "lgrenc.def" from "babel" to "greek-fontenc").
It is now done with "ProvideTextCommand" to not overwrite the more complete
implementation in lgrenc.def.
With the compatibility definition, combined diacritics with tilde
must be input with the tilde first (e.g. \~>, not \>~).
"unicodesymbols" is changed accordingly.
Also, some LICRs for combining Greek diacritical characters were added to
Unicodesymbols.
This fixes bug #9615.
The "notermination" flag tells LyX, that terminating an LICR macro with {} is
not necessary. This is normally the case for all macros with non-alphabetical
name (e.g. \{).
However, combining diacritical characters are converted to *accent macros*,
which expect an argument (the base character).
In Unicode, the base character precedes the combining character,
in LaTeX the combining character precedes the base character.
LyX changes the order of the two characters to get this right,
e.g. "x" + "combining tilde" becomes "\~{x}".
In the special case there is no preceding character (e.g. at the start of the
document or a paragraph), Unicode shows the combining diacritical character
without base character.
The replacement is currently not "terminated" (e.g. "\~"), because of the
"notermination=text" flags in "unicodesymbols".
The accent macros take the *following* character as base character, which is
clearly not intended.
In case of a paragraph consisting of just one combining diacritical character,
LaTeX compilation fails with an error.
With the patch, LyX writes the accent macros with an empty argument,
e.g. "\~{}", the output is similar to the view in the GUI with the diacritical
character on its own, not on the follwoing character.
This is bug #9612. The patch is from Günter Milde. He wrote:
The patch uses the "long" macro names (\llless and \gggtr) to minimize
name-clash probability. (There is, e.g., a name clash of \lll with Babel's
polish.ldf (cf. bug #6197))
and fix wrong ones. This fixes the safe part of bug #8888. The symbols
provided by mdsymbol.sty have to wait, since mdsymbol.sty provides a huge
number of symbols, I don't have the time right now to process them all, and
a partial file format update does not make sense.
The macro is identical to \ldots in texted, but this way, tex2lyx can import both \ldots (as InsetSpecialChar) and \dots (as unicode glyph), while retaining the original distinction (which might get relevant with some special packages or via user redefinition of one of these macros).
The main part of the fix (unicodesymbols) is from Jürgen. This commit fixes
tree problems:
- \; etc. were also used in text mode, but are math only
- all of those glyphs need to be forced with utf8
- actually, \; etc. are not the correct macros, since the encoded spaces are
breakable, but the math spaces are all protected. The sapce symbols are not
defined in the utf8 encodings.
\textsubtilde is a combining character (0x0330), but 0x02f7 is not.
Apart from the wrong LaTeX output, having the same command for two symbols
confuses texc2lyx.