Temporary LaTeX removal

Contents
Summary
Usage
Notes on using WordPerfect
Notes on using MS Word
Best way to format your LaTeX
Removal limitations

Summary

The second l2h menu has a LaTeX removal option "r". This removes math and most other nontext material from LaTeX file index.tex and puts the rest in a web page called "index@hide.html". Then you can run your favorite spell-checker, or even grammar checker, on web page index@hide.html.

Afterwards, you can run menu option "R" to put the LaTeX back in and replace the original index.tex by the new and improved one.

Warning: This is not intended to create an actual web page conversion of your document. All your formatting and equations will be missing in the html version. Use the w key to create actual web pages. To convert as best as possible to an MS Word type document, try latex2rtf.

Warning: A web page called "index@lost.html" is also created. Here all latex has irrecoverably been removed. So, if you feed index@lost.html into your spell/grammar checker, you will need to transfer all corrections manually back into index.tex.

Warning: Do not use a grammar checker if your knowledge of English grammar is less than perfect. (Or worse, if you are a nonnative English writer who is weak in English itself.) The results will be far more nonsensical than if you just write according to your best intuition. If you want better, you need to find some human being that has a better knowledge of English grammar than you. If a human has difficulty understanding your writing, a piece of software with far less brains than a mosquito is not going to help you.

Clarification: Almost always, the alternatives suggested by the grammar checker are all wrong. But the fact that the checker complains does indicate quite strongly that the sentence is not ideal. And it suggests where the problem is. So if you have some understanding of grammar, you can usually find a better way to say things.

Usage

To use LaTeX removal, enter the l2h menu. Before doing anything else, run latex using the "l" key in the first menu. This is to make sure that the current LaTeX file index.tex is syntactically correct. If it is not, LaTeX removal will be poor. And it is in fact likely to be refused completely.

Next press "2" to go to the second menu. Then press "r" to create the web page version of the current index.tex.

If all goes well, a web page file "index@hide.html" will be produced. Load it in your favorite spell or grammar checker. WordPerfect or MS Word are typical choices. In my experience, WordPerfect does a better job. But you have to set its options first to good values. And use a codepage. See the separate sections on WordPerfect and MS Word.

Do not be surprised if you still have to tell your spell or grammar checker to ignore things. The big idea is to make that as rare as reasonably possible. See the Removal Limitations section for more on this.

To allow the LaTeX to be restored, the web page file index@hide.html contains markers like "Display 321456789 is here." That probably corresponds to a figure or so. As long as you do not significantly alter these markers, the LaTeX should be properly restored. There are also paragraph markers like "Paragraph 4874 follows." If you mess those up, you will probably find the original paragraph back at the end of the file. So do not do that either.

If you did not write it, it is probably a marker and should be left alone.

(The web page that l2h gets back from the spell/grammar checker goes through a process called "reconciliation." This is intended to prevent serious amounts of LaTeX from being lost. But not being lost is not the same as being in the right place. The reconciliation will put things it cannot locate at the end of the paragraph, assuming that the paragraph marker is not corrupt. Otherwise it goes to the end of the file. You might even end up with two copies of the same thing: one with your spell and grammar corrections but missing latex; the other the original paragraph without corrections at the end of the file. The bottom line is: leave the markers alone.)

The first time you do this, or when you change checker, make only a few changes before checking that you can restore the latex OK.

Do not try to make other corrections besides spelling and grammar ones in index@hide.html. They will turn to garbage in the restored LaTeX file. Make such corrections in the new index.tex.

After correcting the spelling and grammar, use the "R" key in the l2h menu 2 to replace your current index.tex file with a new one including the corrections. You should take the opportunity to compare the two files before you confirm the change. Often the new file has a few things out of place that you need to fix after replacement. (If you replaced index.tex and then find out that you should not have, use the "x" menu key to restore the old index.tex.) (Personally, I prefer to load index.tex and index@new.tex in Emacs to compare them before I replace one by the other.)

Notes on using WordPerfect as checker

I think WordPerfect does the best grammar check. And independent test results have it the best in finding errors too. But it is somewhat of a pain to use.

First of all, WordPerfect does not do UTF-8. Therefore you will be asked to select a "codepage" before latex removal. For English and most other Western European languages, the default WINDOWS-1252 will work OK.

Before loading index@hide.html into WordPerfect, you need to do a few things. Go into the WordPerfect "Tools" "Settings" menu and select "Environment" "General" "Code Page". Select the appropriate code page. (The used code page must also be present in the "convert" folder of l2h.) You may also want to go into "Tools" "QuickCorrect" and prevent WordPerfect messing up things all by itself. For example, if you do not, it will turn "Fig. 1(c)" into "Fig 1.©".

You will need to change Grammatik options if you want to get some real feedback on your writing. Go into the Grammatik window, click "Options", then "Checking Styles". The quick way is now to select "Very Strict" and exit the menu. But that is probably too much. Instead select "Quick Check" and press "Edit". At the time of writing, I allow myself 3 consecutive nouns, 3 consecutive propositional phrases, 30 words in a sentence, 0 spell numbers (to avoid annoyance), and 0 words in split infinitives. As far as the rules are concerned, I have them all enabled except "Archaic", "Colloquial", "Foreign", "Jargon", "Second Person Address", and "Trademark".

WordPerfect will occasionally mistake hidden LaTeX markers at the end of a sentence for additional space. If it asks you to delete that space, say no.

Just using save does not work. When done spell/grammar checking, you have to save the result using the "File" "Publish" "HTML" menu item. While saving, put a check mark in the "Plain HTML" checkbox. Make sure you save in the same code page encoding as it was.

Notes on using MS Word as checker

Before loading index@hide.html in MS Word, go into the options and turn off "Autocorrect" and "Autoformat while you type". You do not want MS Word to mess up markers and such, all by itself.

MS Word creates a subfolder called "index@hide_files" when saving the corrected html file. This folder can be deleted.

Best way to format your LaTeX

For optimum LaTeX removal, format special characters as, say, \euro{}, \yen{}, \copyright{}, ... Similarly, format named multinational characters like in \ae{}. Alphabetic accents must enclose the accented letter in brackets, like in \c{C}. Nonalphabetic accents must be used without brackets, like in \"a. Exception: for accents on \i or \j, brackets must always be used: \"{\i}, \c{\i}. Use a tonos as \'{}I. Do not put whitespace in the middle of a word, or it becomes two words. For example, "\AA ngstrom" will become two words; use "\AA{}ngstrom" instead. LaTeX removal will also recognize cyrrilic letters of the form \cyrchar\CYRYO{} and various \ding{NNN} characters.

You can also have multinational characters in UTF-8 format. However, other encodings will show up as mojibake. If you use, say, ISO-8859-15 characters instead of UTF-8, index.tex can be converted to UTF-8 format by putting it in the ISO-8859-15_UTF-8 subfolder in the convert subfolder of the l2h folder. Then click convert_tex in the folder to convert. Later on you can convert index.tex back using the UTF-8_ISO-8859-15 subfolder. (Actually, you should probably not convert back. UTF-8 is now overwhelmingly the recommended encoding. But you might have to switch to xelatex to use it.)

Any text inside \latexhtml{...}{...} commands will be hidden. If you want the latex part to be spell/grammar checked, use separate \latex{...} and \html{...} commands.

(A knowledgeable user might be able to add more entries to the recognized latex characters. Using a text editor, open the file uc_latex.sub; it is in the data subfolder of the system-files folder of l2h. Follow the existing format. Unicode character numbers must be in nondecreasing order. For duplicated numbers, the first version is given priority. The equivalent latex code must be whitespace limited.)

(Note that in principle, a careful and computer-savy user could also change the latex removal process itself. The files to modify are tex_enc.sub and enc_tex.sub in the data subfolder in the system-files subfolder of l2h. They can be edited with a text editor. However, the language in which they are written is very user unfriendly. Watch for stray trailing spaces! Actually, you cannot watch for them. They are invisible but hurt. You can feel for them with the cursor. Or try highlighting the text. In emacs, set option "show trailing whitespace" and/or use "M-x delete-trailing-whitespace".)

Removal Limitations

LaTeX removal is not perfect. In particular, under some conditions math parts may not be properly removed, leaving your spell checker fuming over all the bad "words". Or large amounts of normal text may be removed along with the math, so that your spell checker cannot correct it.

First of all, many less common LaTeX constructs are not implemented at this time. Implementing every possible construct would make LaTeX removal very slow. At the time of writing, the only removed nontext environments are:

   comment, verbatim, rawhtml, figure, table, (leaving the captions),
   picture, tabular, thebibliography, theindex, flushright,
   displaymath, equation, equation*, align, align*, flalign, flalign*,
   cases, multiline, gather, eqnarray, eqnarray*, \[...\] and $$...$$.

The contents of quotation environments and such should presumably be spell-checked. (Although you can always blame the original author.) All \begin{...} and \end{...} commands are removed, even if the environment itself is not.

User-defined environments are not recognized. (If they are nontextual, you can remove them by temporarily enclosing them in a comment environment.) Also the discouraged old AMSTeX style of

   \ENVIRONMENT ... \endENVIRONMENT

instead of the recommended

   \begin{ENVIRONMENT} ... \end{ENVIRONMENT}

is currently not recognized. (An exception has been made for \equation[*].) Therefore your LaTeX writing style makes a lot of difference for how efficient LaTeX removal is.

The same applies for smaller text mark-up. Single quotes written as \lq{} and \rq{} and double quotes written as \lq\lq{} and \rq\rq{} will be converted to the proper HTML quotes. Similarly \copyright{} is converted, but just \copyright is not. Etcetera.

Another problem with LaTeX removal is that it uses a relatively simple algorithm. It does not actually interpret the LaTeX. Instead it uses heuristics to figure out what to shove out of the way and what to keep for the spell checker. Heuristics can fail. Certain constructions, (in particular, verbatim and comment environments and \verb commands), have the potential to cause problems.

LaTeX removal runs checks for likely problems and aborts if it recognizes them. In which case everything stops, of course. If it does not recognize a problem, exposed mathematics or similar is likely to remain in the document, or text may be removed as if it was mathematics. As long as you and your spell/grammar checker do not change the exposed markers, the index.tex document will still be properly restored. But of course, erroneously removed text will not have been checked. And you will need to tell your checker to ignore all the math left in.

These problems are more likely if you use verbatim environments or \verb commands. Comment and rawhtml environments are also potential trouble spots.

Here are a some rules that can help prevent these problems:

Do not put anything on the same line as an \end{verbatim} command. There is no point to do this in the first place, because LaTeX puts in a new line after a verbatim command anyway. (The only exception to the rule is a % used to comment out the \end{verbatim} command.) Comment and rawhtml environments act like verbatim ones and the same applies to them.
Do not use \ as an encloser in \verb commands. There are plenty of other characters you can use. The @ character is usually a good choice. But there are at least 93 other possibilities. (You can use a letter as the encloser in a \verb command if you put a tab character between \verb and the letter. And a control character like Ctrl-M also seems to work as an encloser. Not that I recommend any of those.)
A \verb command inside a verbatim or similar environment should include proper enclosers.

To better understand when problems may arise, consider the algorithm that is used in the math removal process:

Any \\ is assumed to be a newline and shoved out of the way. This can cause problems. For example, suppose index.tex contains
"The \verb\$\\~signs danced before his eyes."
LaTeX removal will assume that \\ in it is newline. That will cause either an abort (if there is no later \ on the same line) or else improper LaTeX removal. Fortunately, these cases seem to be rare and easily fixed. (Making a \verb-like command yourself is not easy, though obviously possible.)
Next any \verb command is moved out of the way by identifying its matching enclosers. This too can cause problems: \verb strings inside comments or verbatim environments may not have matching enclosers.
The confusing \}, \{, \%, and \$ are shoved out of the way.
Backslashes inside comments are shoved out of the way. (A comment is taken to be any % and the rest of the line.)
The header is shoved out of the way. The previous step is necessary to avoid misidentifying a commented-out \begin{document} for the real thing.
Comment, verbatim, and rawhtml environments are shoved out of the way, each separately. That provides a stronger check that their starts and ends are properly matched. Since comment blocks are done first, improperly closed blocks within these are no longer a concern.
This step too can fail. Consider the following examples:
```
   %\begin{verbatim}           /\begin{verbatim}           \\begin{verbatim}
   ...                versus   ...                versus   ...  
   %\end{verbatim}             %\end{verbatim}             %\end{verbatim}
```
In the first example, the %\end{verbatim} line is a comment to be ignored. The ... is regular text, to be spell checked. In the second example, the ... is literal text, along with its final %, that must be removed. In the third example, %\end{verbatim} is again a comment and ... regular text to be kept. LaTeX removal does not have the smarts to figure out which possibility it is. In fact, it will simply assume the first/third possibility. Consider also
```
   ...text\\end{verbatim}
```
LaTeX removal is not smart enough to figure out whether end{verbatim} is a text string following a new line or whether \end{verbatim} is an end to a verbatim environment. In this case, it will assume, probably incorrectly, that \\ is new line.
The trailer is moved out of the way. Note that verbatim/comment/ rawhtml environments could well contain an \end{document} string. So they must definitely be done first.
Other nontext environments are removed.
Comments are now removed. (Removing all comments inside header, trailer, and nontext environments would be too much.)
Inline and $$...$$ math is removed. Note that $$...$$ is improper and best rewritten as \[...\]. That also makes LaTeX removal more reliable.
Various other stuff is removed or converted to a more appropriate form for spell/grammar checking:
- \index index entries
- \cite... latex and natbib style citations;
- \ref... latex references;
- \title, \part, \chapter, ... sectional commands, leaving the text;
- various multinational characters in LaTeX representation are converted to HTML form;
- \begin{minipage}[...]{...} and \begin{minipage}{...} strings;
- \begin{...} and \end{...} strings;
- standard latex commands with nontextual arguments;
- boxes;
- other latex commands, except for some common ones that seem better left alone or converted to an equivalent html form;
Paragraph markers are added.

After the above encodings, in a second stage everything is converted to final html (or other) form. This step has a "Microsoft" version and a "Wordperfect" one.

Index

Examples