6/21/2023 0 Comments Popular dictionariesHunspell-compatible software should not treat # specially: for example, en_GB dictionary uses it as a flag. This allows a comments agreement to exist between dictionary authors without being explicitly implemented. While reading affix-file, it just ignores lines that do not start with the known directive and the rest of every line after the expected directive’s content. However, Hunspell’s code doesn’t have special logics for handling #. Comments seem to begin with the # character (as it is common for many other scripting and config file formats). There is no specification for the comments formatĪ lot of existing dictionaries have comments in their. dic files) describes language’s alphabet (sorted by letter frequency) instructs to convert ’ (typographic apostrophe) to ' before checking sets flags to mark some words as not-to-be-suggested ( !), and words that can only be parts of other words ( c), and sets a minimal size of the word for word compounding (1 character instead of the default 3: English dictionary uses “compounding” feature to check the spelling of ordinals like “111th”).Īnd it is just the first 7 lines of a 200-something-line file, for the language with one of the simplest settings. TRY esianrtolcdugmphbyfvkwzESIANRTOLCDUGMPHBYFVKWZ' The line might contain just a word or, more commonly, a word with flags (separated by /). dic file, or dictionary file, has a simple format: one word per line. The first file contains the list of words, but it can’t be interpreted without the other If your dictionary is only available as a package and you want to make some changes, you would have to unpack them or find the original source. Those are ZIP files with a different file extension and some additional meta-information inside. However, dictionaries are usually packaged for distribution as software-specific packages. It’s relatively easy to find and fix the issues in such dictionaries. Some dictionaries are developed in the open, and you could find those files in Github or some other source repository. The dictionary is stored in two text filesĪ dictionary for each language consists of two human-readable text files (more on that below). Six-part series already covered all the algorithms in Hunspell, and now we switch to broader topics. This text is one of the results of “Rebuilding the spellchecker” effort, striving to explain how the world’s most popular spellchecker Hunspell works via its Python port called Spylls. From my perspective, they are, first of all, didactic: they demonstrate the blossoming complexity of organically evolved software that solves the complicated task. Depending on your mindset, you might find the facts below curious, fascinating, ridiculous, or just plain boring. The temptation to reuse those dictionaries for text processing is high: they are somewhat suitable not only for spellchecking but also for determining the canonical (base) form of words, canonical capitalization, stemming, etc.īut the dictionary format of Hunspell has a lot of peculiarities. Thus, Hunspell dictionaries are the most common open dictionaries format they are available for almost a hundred of world languages. It is built-in Mozilla Firefox, Google Chrome, Libre/OpenOffice, MacOS, Adobe products, and whatnot. Hunspell is the most used spellchecker in the world.
0 Comments
Leave a Reply. |