User Guide

The sections below explain how the high-level modules fit together and provide short recipes that you can adapt for your own training pipelines.

Datasets

TorchFont exposes three dataset wrappers under torchfont.datasets.

FontFolder

Scans a directory of .otf/.ttf files. Every available Unicode code point and variation instance becomes an item. Use the codepoint_filter argument to limit the content and plug in a custom loader when you need extra preprocessing.

GoogleFonts

Maintains a sparse checkout of the google/fonts repository. Pass patterns to restrict which directories are materialized, and set download=True to ensure the checkout exists. The dataset inherits the same indexing and label structure as FontFolder.

FontRepo

Generalizes the sparse checkout logic to arbitrary Git repositories. Provide a url, ref, and optional patterns describing what to materialize.

Example – FontRepo

from torchfont.datasets import FontRepo

ibm_plex = FontRepo(
    root="data/font_repos",
    url="https://github.com/IBM/plex.git",
    ref="main",
    patterns=("fonts/Complete/OTF/*/*.otf",),
    download=True,
)

sample, (style_label, content_label) = ibm_plex[42]

Transforms

Sequential transformations live under torchfont.transforms. Combine them with torchfont.transforms.Compose to keep preprocessing modules declarative.

from torchfont.transforms import Compose, LimitSequenceLength, Patchify

transform = Compose(
    (
        LimitSequenceLength(max_len=512),
        Patchify(patch_size=32),
    )
)

sample, labels = dataset[0]
sample = transform(sample)
LimitSequenceLength

Clips both the command-type tensor and the coordinate tensor to max_len.

Patchify

Zero-pads sequences to the next patch_size boundary, then reshapes them into contiguous patches—useful for transformer-style models.

I/O Utilities

The torchfont.io namespace contains helpers for converting raw outlines to Tensors.

torchfont.io.pens.TensorPen

Implements a FontTools-compatible pen that records every command as PyTorch tensors. The resulting tensors are what the datasets return, so you can reuse the pen when building custom loaders.

from fontTools.ttLib import TTFont
from torchfont.io.pens import TensorPen

font = TTFont("MyFont-Regular.otf")
glyph_set = font.getGlyphSet()
glyph = glyph_set["A"]
pen = TensorPen(glyph_set)
glyph.draw(pen)
command_types, coords = pen.get_tensor()

Data Loading Tips

  • Glyph sequences vary in length. Always supply a collate_fn that pads or truncates samples before they are stacked into a batch.

  • When working with GoogleFonts consider splitting the dataset into several torch.utils.data.Subset objects and feeding them to Lightning’s lightning.pytorch.utilities.combined_loader.CombinedLoader (see examples/dataloader.py) to parallelize IO.

  • Cache-heavy datasets benefit from setting num_workers to at least the number of CPU cores available during preprocessing and inferencing.

Best Practices

  • Keep raw fonts immutable. The caching performed by torchfont.datasets.folder.load_font() assumes files on disk are not modified while the process is running. Call load_font.cache_clear() if you need to invalidate the cache.

  • Separate style and content labels. Every dataset returns both. Treat style (font instance) as one task and content (code point) as another so that your losses stay interpretable.

  • Document your Transform pipeline. Store the pipeline configuration next to model checkpoints to keep glyph preprocessing reproducible.