Draft: More direct language handling
Previous TeX engines had the limitation of being able to load hyphenation patterns only at format creation time - when running iniTeX. LuaTeX has no such limitation and by using Lua it is possible to load hyphenation patterns at any time.
Today virtually all hyphenation patterns and exceptions that have been used by
TeX users are distributed in the hyph-utf8 package. hyph-utf8 provides
patterns/exceptions also in the new UTF-8 encoded plain text files that are
preferred for LuaTeX.
TeXLive's approach is to provide hyphenation patterns/exceptions for each
language in a separate package. Each package then hooks itself using the
TeXLive execute AddHyphen directive. An example for French:
execute AddHyphen \
name=french synonyms=patois,francais \
lefthyphenmin=2 righthyphenmin=2 \
file=loadhyph-fr.tex \
file_patterns=hyph-fr.pat.txt \
file_exceptions=
This information is also written to files used by eTeX language
mechanism, which is used by plain LuaTeX. This gets added to language.def:
\addlanguage{french}{loadhyph-fr.tex}{}{2}{2}
and this is written to language.dat.lua:
['french'] = {
loader = 'loadhyph-fr.tex',
lefthyphenmin = 2,
righthyphenmin = 2,
synonyms = { 'patois', 'francais' },
patterns = 'hyph-fr.pat.txt',
hyphenation = '',
},
etex.src reads language.def at format creation time. Listed languages are
registered and their hyphenation patterns loaded into the format. This enables
their use later with \uselanguage.
As for LuaTeX it is even discouraged to load patterns into format, the
mechanism is changed by hyph-utf8's own etex.src. Instead of loading each
pattern or exception file on \addlanguage, the language is only registred
and the files are loaded on first \uselanguage. Both commands, actually use
Lua code in luatex-hyphen.lua, which gets information from language.dat.lua
database.
But why use the language.def file at all? It's situation with synonyms isn't all that great, and information about left/right hyphenmins is already present in language.dat.lua.
The approach in this pull request separates language handling from minim-etex and introduces minim-languages (which depends on callbacks, and alloc). Most of the stuff happens in Lua, where \newlanguage and \minim:uselanguage are defined. Both use LuaTeX's lang.new() function to allocate language numbers and not the classical TeX count register 19. There is no longer \addlanguage.
The real \uselanguage is defined from TeX and keeps \uselanguage@hook, which actually was used by minim before. This pull request changes that, and instead defines a custom callback, which allows anybody to hook into \uselanguage. For compatibility, the TeX hook is also kept. \minim:uselanguage could in fact be omitted and instead just register into the callback.
Part of the code is taken almost literally from 'luatex-hyphen.lua(CC0 license). The code I would have written would essentially be the same. Polyglossia also has its own (very similar) code to parse language.dat.lua. Babel doesn't use language.defandlanguage.dat.lua` at all.
For discussion:
- Usability outside of minim? Why pull in callbacks? Simple callback function instead, like with
\uselanguage@hook? - Register
M.use_languageto the callback instead of having\minim:uselanguage? - Theoretically (https://texdoc.org/serve/luatex-hyphen/0) the
language.dat.luaformat doesn't include all the needed information. In practice (TeX Live, MikTeX) it does. Perhaps the specification should be updated to make the fields we use mandatory, just to be sure. - What about these
minim-pdfdefinitions that expect the usual\addlanguage/\uselanguagetwo step process:
% \newnamedlanguage {name} {lhm} {rhm}
\def\newnamedlanguage#1#2#3{%
\expandafter\newlanguage\csname lang@#1\endcsname
\expandafter\chardef\csname lhm@#1\endcsname=#2\relax
\expandafter\chardef\csname rhm@#1\endcsname=#3\relax
\csname lu@texhyphen@loaded@\the\csname lang@#1\endcsname\endcsname}
% \newnameddialect {language} {dialect}
\def\newnameddialect#1#2{%
\expandafter\chardef\csname lang@#2\endcsname\csname lang@#1\endcsname
\expandafter\chardef\csname lhm@#2\endcsname\csname lhm@#1\endcsname
\expandafter\chardef\csname rhm@#2\endcsname\csname rhm@#1\endcsname}