This regex handling is part of QT5. For lyx which uses QT4
findafv will still work, but is not good for caseinsensitive matchings
in handling non ASCII characters
I misinterpreted the unicode creation 'u8"\uF00xx"'.
The C++-compiler saw 'u8"\uF00x" "x"', but this was not intended.
The routine which mimicked is doing the right job now.
For search we used to lowercase for everything, but sonce the regex itself
should be left unchanged, this change was needed.
Works nice with ASCII, but fails miserably on on other UTF8 points (like Cyrillic chars)
Due to a hint from Scott:
> I don't think it should be printed to LYXERR0. That goes to the terminal
> for the user. Even our lyx2lyx tests fail if they detect any information
> printed to the terminal (although we escape some warnings). If "find" is
> not a good debug level for it, maybe we can put it in "info" (the first
> debug level)?
1.) Use vector for borders, because any value may be too small
if there are plenty of accented characters in a paragraph
2.) use '[\S]' instead of '.' in regex for 'accre'. The regex would
otherwise find also patterns like '\ {some text}'
Different behaviour in regexp{..} for 'İ' and 'ß':
1.) lowercase routine for 'İ' gives 'İ', so that if we are searching
while ignoring case, the string '\dot{I}' is converted to '\dot{i}'.
In this case we have to change it to 'İ' (instead of 'i', as one would expect).
2.) If 'ß' is inserted via keybord on fresh created regexp box it appears as \lyxmathsym{ß},
if pasted from the lyx-screen it appears as \text{ß}
1.) Added for 'breve' and 'grave' accents
2.) Corrected handling for 'i'-accents (allowed \hat{i} _and_ \hat{\imath})
because of problems with ignoring case
3.) Spaces: Changed some indents in source
The problem is the handling of regex as using math-mode. That is
any accented character is converted to a math macro.
For instance "ä" --> "\\ddot{a}".
Outside of math or regex it is not converted (if used xetex flavour),
but there are other chars which are converted in math and in text (but differently)
For instance "ů"
in math --> "\\mathring{u}"
in text --> "\\r{u}"
TODO: determine the still not handled conversions.
It would be nice, if we could persuade math factory to not convert
these characters, but I was unable to find the place where the
conversion actually takes place.
1.) Fill the 'head'-member to easier recognize the macro. May be discarded
later, although it does not take too much run-time
2.) Add some comment
3.) Ignore any macro inside the regex.
1.) Make sure the environment is mentioned in the string for search
(Added the keyword \latexenvironment{...})
2.) Handle it similar to \textcolor{}
That way we can also search for 'conclusion*' or 'summary' etc
in Additional.lyx.
Some macros need:
1.) Take care of case sensitivity
2.) Better handling of used argument values
3.) Cleaner list-environment search
4.) Remove superfluous '~' if searching for description or labeling env
Sometime it happen that the selection contains area which was skipped
in splitOnKnownMacros().
So we check, if a shorter selection would give the same mach size.
Also
a.) try to speed up regex search using non-greedy mode (.*?)
b.) remove '\n' completely in searched strings if it is not surrounded with
aplanumerical chars
The problem here is, that selecting any subset of a \lettrine{}
line always creates an initials header. That makes it impossible
to our search engine to find strings, because the regex does not
contain that info. So we have to discard the leading \lettrine part
completely.
We place now a marker (\endarguments) to determine that removable
part.
This is not possible for '$', because of the latex-meaning to
start/end math inset.
Therefore, if not ignoring format, we still have to use
[\\][\$] in regex in order to find '$' in text.
Given the regex 'r.*r\b' and a string
"abc regular something cursor currently"
we expect to find "regular something cursor".
But while searching we may be confronted with input
"regular something cursor curr"
and so the searched string would be seen longer.
Testcase without this patch:
1.) open de/Additional.lyx
2.) goto 6.1 Astronomy & Astrophysics
3.) open the index
4.) find advaced
a.) not ignoring format
b.) regex = .+
c.) language of regex: English
4.) search next
The seach finds the next break (which is outside of the index)
The following try to display the selection leads to crash
This change is valid for findadv too.
Patterns like '.*' now are greedy, like it is normal in regex
Searching for whole words is corrected, but can be slow.
One can speed up the search with adapted pattern.
So for instance searching for words starting and ending with 'r'
the normal pattern is 'r.*r'. The speed-up pattern could be
'\br[^\s]*r\b'. This halves the search time.
Search results are now different to that of lyx2.3, because the greedy
'.*' is now really greedy.
To achive the same results, we have to use '.*?' instead.
A try to decrement the number of tests for a match.
Also a try to handle Hebrew documents. Unfortunatelly
the latex output is missing the language specification
(only the change of encoding is available there).
I failed to find a proper place to add the lang.
That means, searching for e.g. English text in Hebrew documents
is not satisfying.
The needed time to find a simple string dependes on the
paragraph length was O(n^2)
Now it is down to O(n).
Before:
To determine if the pattern matches we compared the
paragraph from current position to the its end.
Increment current position if no match
Now:
Check if the character at current position has at least
the needed features (text, color, language etc)
If not, Increment current position
else proceed as before
Also added missing math env alignat
Modified handling of longtable/tabular
Added a routine to count for valid chars. This is needed
for detection of word boundaries.
Due to detection conflicts
regex '.*' vs match of word-boundaries in MatchStringAdv::operator()
we need to use '\b' in regex explicitly. E.g. '\b.*\b'
The backward search works, but
1.) only in current paragraph (this is the same as before)
2.) only in the same language environment.
1.) Added \textmd to be ignored (sometimes it is used and sometimes not)
2.) Typo: multiline --> multline. Searching in 'multline' caused a crash
because processing all of the '{' and '}' in the content of this math
exceeded the size of the interval field.
1.) Handle some unclosed parentheses
Sometimes \shortcut is not correctly closed
2.) Added \ldots as known char
3.) Discard some shapes (circlepar, droppar, ...)
4.) Omit resulting empty string and use some value
which cannot be matched instead
That way we do not match the whole table but only the cell contents.
The problem I had was
1.) Document language Spanish
2.) Table (copied from English doc) => language English
3.) All cell contents Spanish
Now search for English text led to a selection of the whole
table, although there was no English content in any cell.