mirror of
https://git.lyx.org/repos/lyx.git
synced 2025-01-25 09:35:39 +00:00
# About Hunspell Hunspell is a free spell checker and morphological analyzer library and command-line tool, licensed under LGPL/GPL/MPL tri-license. Hunspell is used by LibreOffice office suite, free browsers, like Mozilla Firefox and Google Chrome, and other tools and OSes, like Linux distributions and macOS. It is also a command-line tool for Linux, Unix-like and other OSes. It is designed for quick and high quality spell checking and correcting for languages with word-level writing system, including languages with rich morphology, complex word compounding and character encoding. Hunspell interfaces: Ispell-like terminal interface using Curses library, Ispell pipe interface, C++/C APIs and shared library, also with existing language bindings for other programming languages. Hunspell's code base comes from OpenOffice.org's MySpell library, developed by Kevin Hendricks (originally a C++ reimplementation of spell checking and affixation of Geoff Kuenning's International Ispell from scratch, later extended with eg. n-gram suggestions), see http://lingucomponent.openoffice.org/MySpell-3.zip, and its README, CONTRIBUTORS and license.readme (here: license.myspell) files. Main features of Hunspell library, developed by László Németh: - Unicode support - Highly customizable suggestions: word-part replacement tables and stem-level phonetic and other alternative transcriptions to recognize and fix all typical misspellings, don't suggest offensive words etc. - Complex morphology: dictionary and affix homonyms; twofold affix stripping to handle inflectional and derivational morpheme groups for agglutinative languages, like Azeri, Basque, Estonian, Finnish, Hungarian, Turkish; 64 thousand affix classes with arbitrary number of affixes; conditional affixes, circumfixes, fogemorphemes, zero morphemes, virtual dictionary stems, forbidden words to avoid overgeneration etc. - Handling complex compounds (for example, for Finno-Ugric, German and Indo-Aryan languages): recognizing compounds made of arbitrary number of words, handle affixation within compounds etc. - Custom dictionaries with affixation - Stemming - Morphological analysis (in custom item and arrangement style) - Morphological generation - SPELLML XML API over plain spell() API function for easier integration of stemming, morpological generation and custom dictionaries with affixation - Language specific algorithms, like special casing of Azeri or Turkish dotted i and German sharp s, and special compound rules of Hungarian. Main features of Hunspell command line tool, developed by László Németh: - Reimplementation of quick interactive interface of Geoff Kuenning's Ispell - Parsing formats: text, OpenDocument, TeX/LaTeX, HTML/SGML/XML, nroff/troff - Custom dictionaries with optional affixation, specified by a model word - Multiple dictionary usage (for example hunspell -d en_US,de_DE,de_medical) - Various filtering options (bad or good words/lines) - Morphological analysis (option -m) - Stemming (option -s) See man hunspell, man 3 hunspell, man 5 hunspell for complete manual. # Dependencies Build only dependencies: g++ make autoconf automake autopoint libtool Runtime dependencies: | | Mandatory | Optional | |---------------|------------------|------------------| |libhunspell | | | |hunspell tool | libiconv gettext | ncurses readline | # Compiling on GNU/Linux and Unixes We first need to download the dependencies. On Linux, `gettext` and `libiconv` are part of the standard library. On other Unixes we need to manually install them. For Ubuntu: sudo apt install autoconf automake autopoint libtool Then run the following commands: autoreconf -vfi ./configure make sudo make install sudo ldconfig For dictionary development, use the `--with-warnings` option of configure. For interactive user interface of Hunspell executable, use the `--with-ui option`. Optional developer packages: - ncurses (need for --with-ui), eg. libncursesw5 for UTF-8 - readline (for fancy input line editing, configure parameter: --with-readline) In Ubuntu, the packages are: libncurses5-dev libreadline-dev # Compiling on OSX and macOS On macOS for compiler always use `clang` and not `g++` because Homebrew dependencies are build with that. brew install autoconf automake libtool gettext brew link gettext --force Then run autoreconf, configure, make. See above. # Compiling on Windows ## Compiling with Mingw64 and MSYS2 Download Msys2, update everything and install the following packages: pacman -S base-devel mingw-w64-x86_64-toolchain mingw-w64-x86_64-libtool Open Mingw-w64 Win64 prompt and compile the same way as on Linux, see above. ## Compiling in Cygwin environment Download and install Cygwin environment for Windows with the following extra packages: - make - automake - autoconf - libtool - gcc-g++ development package - ncurses, readline (for user interface) - iconv (character conversion) Then compile the same way as on Linux. Cygwin builds depend on Cygwin1.dll. # Debugging It is recommended to install a debug build of the standard library: libstdc++6-6-dbg For debugging we need to create a debug build and then we need to start `gdb`. ./configure CXXFLAGS='-g -O0 -Wall -Wextra' make ./libtool --mode=execute gdb src/tools/hunspell You can also pass the `CXXFLAGS` directly to `make` without calling `./configure`, but we don't recommend this way during long development sessions. If you like to develop and debug with an IDE, see documentation at https://github.com/hunspell/hunspell/wiki/IDE-Setup # Testing Testing Hunspell (see tests in tests/ subdirectory): make check or with Valgrind debugger: make check VALGRIND=[Valgrind_tool] make check For example: make check VALGRIND=memcheck make check # Documentation features and dictionary format: man 5 hunspell man hunspell hunspell -h http://hunspell.github.io/ # Usage After compiling and installing (see INSTALL) you can run the Hunspell spell checker (compiled with user interface) with a Hunspell or Myspell dictionary: hunspell -d en_US text.txt or without interface: hunspell hunspell -d en_GB -l <text.txt Dictionaries consist of an affix (.aff) and dictionary (.dic) file, for example, download American English dictionary files of LibreOffice (older version, but with stemming and morphological generation) with wget -O en_US.aff https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.aff?id=a4473e06b56bfe35187e302754f6baaa8d75e54f wget -O en_US.dic https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.dic?id=a4473e06b56bfe35187e302754f6baaa8d75e54f and with command line input and output, it's possible to check its work quickly, for example with the input words "example", "examples", "teached" and "verybaaaaaaaaaaaaaaaaaaaaaad": $ hunspell -d en_US Hunspell 1.7.0 example * examples + example teached & teached 9 0: taught, teased, reached, teaches, teacher, leached, beached verybaaaaaaaaaaaaaaaaaaaaaad # verybaaaaaaaaaaaaaaaaaaaaaad 0 Where in the output, `*` and `+` mean correct (accepted) words (`*` = dictionary stem, `+` = affixed forms of the following dictionary stem), and `&` and `#` mean bad (rejected) words (`&` = with suggestions, `#` = without suggestions) (see man hunspell). Example for stemming: $ hunspell -d en_US -s mice mice mouse Example for morphological analysis (very limited with this English dictionary): $ hunspell -d en_US -m mice mice st:mouse ts:Ns cats cats st:cat ts:0 is:Ns cats st:cat ts:0 is:Vs # Other executables The src/tools directory contains the following executables after compiling. - The main executable: - hunspell: main program for spell checking and others (see manual) - Example tools: - analyze: example of spell checking, stemming and morphological analysis - chmorph: example of automatic morphological generation and conversion - example: example of spell checking and suggestion - Tools for dictionary development: - affixcompress: dictionary generation from large (millions of words) vocabularies - makealias: alias compression (Hunspell only, not back compatible with MySpell) - wordforms: word generation (Hunspell version of unmunch) - hunzip: decompressor of hzip format - hzip: compressor of hzip format - munch (DEPRECATED, use affixcompress): dictionary generation from vocabularies (it needs an affix file, too). - unmunch (DEPRECATED, use wordforms): list all recognized words of a MySpell dictionary Example for morphological generation: $ ~/hunspell/src/tools/analyze en_US.aff en_US.dic /dev/stdin cat mice generate(cat, mice) = cats mouse cats generate(mouse, cats) = mice generate(mouse, cats) = mouses # Using Hunspell library with GCC Including in your program: #include <hunspell.hxx> Linking with Hunspell static library: g++ -lhunspell-1.7 example.cxx # or better, use pkg-config g++ $(pkg-config --cflags --libs hunspell) example.cxx ## Dictionaries Hunspell (MySpell) dictionaries: - https://wiki.documentfoundation.org/Language_support_of_LibreOffice - http://cgit.freedesktop.org/libreoffice/dictionaries - http://extensions.libreoffice.org - http://extensions.openoffice.org - http://wiki.services.openoffice.org/wiki/Dictionaries Aspell dictionaries (conversion: man 5 hunspell): - ftp://ftp.gnu.org/gnu/aspell/dict László Németh, nemeth at numbertext org