User Guide¶
The sections below explain how the high-level modules fit together and provide short recipes that you can adapt for your own training pipelines.
Datasets¶
TorchFont exposes three dataset wrappers under torchfont.datasets.
FontFolderScans a directory of
.otf/.ttffiles. Every available Unicode code point and variation instance becomes an item. Use thecodepoint_filterargument to limit the content and plug in a customloaderwhen you need extra preprocessing.GoogleFontsMaintains a sparse checkout of the google/fonts repository. Pass
patternsto restrict which directories are materialized, and setdownload=Trueto ensure the checkout exists. The dataset inherits the same indexing and label structure asFontFolder.FontRepoGeneralizes the sparse checkout logic to arbitrary Git repositories. Provide a
url,ref, and optionalpatternsdescribing what to materialize.
Example – FontRepo¶
from torchfont.datasets import FontRepo
ibm_plex = FontRepo(
root="data/font_repos",
url="https://github.com/IBM/plex.git",
ref="main",
patterns=("fonts/Complete/OTF/*/*.otf",),
download=True,
)
sample, (style_label, content_label) = ibm_plex[42]
Transforms¶
Sequential transformations live under torchfont.transforms. Combine them
with torchfont.transforms.Compose to keep preprocessing modules
declarative.
from torchfont.transforms import Compose, LimitSequenceLength, Patchify
transform = Compose(
(
LimitSequenceLength(max_len=512),
Patchify(patch_size=32),
)
)
sample, labels = dataset[0]
sample = transform(sample)
LimitSequenceLengthClips both the command-type tensor and the coordinate tensor to
max_len.PatchifyZero-pads sequences to the next
patch_sizeboundary, then reshapes them into contiguous patches—useful for transformer-style models.
I/O Utilities¶
The torchfont.io namespace contains helpers for converting raw outlines
to Tensors.
torchfont.io.pens.TensorPenImplements a FontTools-compatible pen that records every command as PyTorch tensors. The resulting tensors are what the datasets return, so you can reuse the pen when building custom loaders.
from fontTools.ttLib import TTFont
from torchfont.io.pens import TensorPen
font = TTFont("MyFont-Regular.otf")
glyph_set = font.getGlyphSet()
glyph = glyph_set["A"]
pen = TensorPen(glyph_set)
glyph.draw(pen)
command_types, coords = pen.get_tensor()
Data Loading Tips¶
Glyph sequences vary in length. Always supply a
collate_fnthat pads or truncates samples before they are stacked into a batch.When working with
GoogleFontsconsider splitting the dataset into severaltorch.utils.data.Subsetobjects and feeding them to Lightning’slightning.pytorch.utilities.combined_loader.CombinedLoader(seeexamples/dataloader.py) to parallelize IO.Cache-heavy datasets benefit from setting
num_workersto at least the number of CPU cores available during preprocessing and inferencing.
Best Practices¶
Keep raw fonts immutable. The caching performed by
torchfont.datasets.folder.load_font()assumes files on disk are not modified while the process is running. Callload_font.cache_clear()if you need to invalidate the cache.Separate style and content labels. Every dataset returns both. Treat style (font instance) as one task and content (code point) as another so that your losses stay interpretable.
Document your Transform pipeline. Store the pipeline configuration next to model checkpoints to keep glyph preprocessing reproducible.