User Guide ========== The sections below explain how the high-level modules fit together and provide short recipes that you can adapt for your own training pipelines. Datasets -------- TorchFont exposes three dataset wrappers under :mod:`torchfont.datasets`. ``FontFolder`` Scans a directory of ``.otf``/``.ttf`` files. Every available Unicode code point and variation instance becomes an item. Use the ``codepoint_filter`` argument to limit the content and plug in a custom ``loader`` when you need extra preprocessing. ``GoogleFonts`` Maintains a sparse checkout of the `google/fonts` repository. Pass ``patterns`` to restrict which directories are materialized, and set ``download=True`` to ensure the checkout exists. The dataset inherits the same indexing and label structure as :class:`FontFolder`. ``FontRepo`` Generalizes the sparse checkout logic to arbitrary Git repositories. Provide a ``url``, ``ref``, and optional ``patterns`` describing what to materialize. Example – `FontRepo` ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from torchfont.datasets import FontRepo ibm_plex = FontRepo( root="data/font_repos", url="https://github.com/IBM/plex.git", ref="main", patterns=("fonts/Complete/OTF/*/*.otf",), download=True, ) sample, (style_label, content_label) = ibm_plex[42] Transforms ---------- Sequential transformations live under :mod:`torchfont.transforms`. Combine them with :class:`torchfont.transforms.Compose` to keep preprocessing modules declarative. .. code-block:: python from torchfont.transforms import Compose, LimitSequenceLength, Patchify transform = Compose( ( LimitSequenceLength(max_len=512), Patchify(patch_size=32), ) ) sample, labels = dataset[0] sample = transform(sample) ``LimitSequenceLength`` Clips both the command-type tensor and the coordinate tensor to ``max_len``. ``Patchify`` Zero-pads sequences to the next ``patch_size`` boundary, then reshapes them into contiguous patches—useful for transformer-style models. I/O Utilities ------------- The :mod:`torchfont.io` namespace contains helpers for converting raw outlines to Tensors. ``torchfont.io.pens.TensorPen`` Implements a FontTools-compatible pen that records every command as PyTorch tensors. The resulting tensors are what the datasets return, so you can reuse the pen when building custom loaders. .. code-block:: python from fontTools.ttLib import TTFont from torchfont.io.pens import TensorPen font = TTFont("MyFont-Regular.otf") glyph_set = font.getGlyphSet() glyph = glyph_set["A"] pen = TensorPen(glyph_set) glyph.draw(pen) command_types, coords = pen.get_tensor() Data Loading Tips ----------------- * Glyph sequences vary in length. Always supply a ``collate_fn`` that pads or truncates samples before they are stacked into a batch. * When working with ``GoogleFonts`` consider splitting the dataset into several :class:`torch.utils.data.Subset` objects and feeding them to Lightning's :class:`lightning.pytorch.utilities.combined_loader.CombinedLoader` (see ``examples/dataloader.py``) to parallelize IO. * Cache-heavy datasets benefit from setting ``num_workers`` to at least the number of CPU cores available during preprocessing and inferencing. Best Practices -------------- * **Keep raw fonts immutable.** The caching performed by :func:`torchfont.datasets.folder.load_font` assumes files on disk are not modified while the process is running. Call ``load_font.cache_clear()`` if you need to invalidate the cache. * **Separate style and content labels.** Every dataset returns both. Treat style (font instance) as one task and content (code point) as another so that your losses stay interpretable. * **Document your Transform pipeline.** Store the pipeline configuration next to model checkpoints to keep glyph preprocessing reproducible.