Token Economy in LLM Training Data Preparation

Michal Kubíček
16/02/2026

Shrnout tento článek pomocí AI

Michal Kubíček, February 2026

If you have encountered the term “token economy” in the context of large language models (LLM), it refers to the management of the textual units that a model processes ^[1]. A token is not a word in the conventional sense. It is a smaller segment of information—typically text—that may correspond to an entire word, part of a word, a punctuation mark, or even a group of bytes ^[2]. Models count, store, and train on tokens, not on sentences or paragraphs.

Every token carries a threefold cost. The computational cost arises because more tokens mean more operations during both training and inference. The financial cost follows from the fact that infrastructure is dimensioned and billed by the volume of tokens processed ^[3]. And the capacity cost means that the model’s context window is finite: every superfluous piece of text occupies space that could otherwise carry useful information.

This article examines a specific consequence of this economy: how the format of source data from which a training corpus is constructed affects the token count, and thereby the cost, quality, and breadth of what the model learns.

How Tokenizers See Text

Contemporary LLMs employ a variety of tokenization methods, the most prevalent of which is Byte-Pair Encoding (BPE) ^[4]. The principle is straightforward: the algorithm begins with a vocabulary of individual bytes and iteratively merges the most frequent pairs into new tokens until the target vocabulary size is reached. GPT, Llama, and Mistral all use variants of BPE ^[5].

A critical detail is that the tokenizer’s vocabulary is model-specific. The same text sent to different models will almost certainly be converted into a different number and sequence of tokens ^[1]. For example, the word “darkness” typically decomposes into two tokens: “dark” and “ness” ^[2]. The shared token “ness” helps the model understand that words ending in this suffix have something in common.

For our purposes, it is essential to understand that different types of text have different token densities. Standard English averages approximately four characters per token. Specialized jargon, legal contracts, or HTML code tokenize less efficiently because they contain expressions that are rare in general text and must be decomposed into more fragments ^[1].

Tokenization of languages such as Czech, with its diacritical marks (háčky and čárky), is even more demanding. Readers can experiment at https://platform.openai.com/tokenizer by entering a word in Czech alongside its English equivalent. Tokenization behavior varies slightly between model generations—newer models tend to be more efficient. For instance, the Czech word “kočka” (cat) consumed three tokens in older models (ko/č/ka), whereas GPT-5 era models process the same word in only two tokens (ko/čka).

The reason lies in how the tokenizer’s vocabulary is constructed. The tokenizer is trained separately from the model itself. During BPE training, the algorithm traverses a large text corpus, counts the frequencies of adjacent byte pairs, and progressively merges the most common ones into new tokens. The larger and more linguistically diverse the corpus, the more opportunities the algorithm has to encounter Czech text and create efficient merges for it.

In older models (GPT-2 through GPT-4), the tokenizer’s training corpus was overwhelmingly English. Czech characters with diacritics—č, ž, ř, ě—appeared relatively infrequently. The BPE algorithm therefore never produced a sufficiently frequent pair to merge, say, “č” with the following “k” into a single token. The word “kočka” thus fragmented into three or more parts because none of the combinations containing “č” exceeded the frequency threshold for merging.

In newer models, two things changed. First, the tokenizer’s training corpus grew substantially larger and more linguistically diverse, giving Czech bigrams such as “čk”, “ží”, and “ně” enough frequency to enter the vocabulary as standalone tokens. Second, the vocabulary size itself expanded—GPT-2 operated with approximately 50,000 tokens, whereas GPT-4’s tokenizer (o200k_base) uses around 200,000. A larger vocabulary provides more room for less common but still useful subwords from smaller languages.

The Data Crisis

According to research by Epoch AI, the total effective stock of high-quality publicly available text data amounts to approximately 300 trillion tokens ^[6]. If current trends continue, language models will exhaust this stock sometime between 2026 and 2032 ^[6]. The Stanford AI Index Report 2025 characterizes the situation as serious, noting that publishers and platforms are increasingly restricting AI companies’ access to their content ^[7].

The situation is even more acute for non-English data. English content accounts for over 56% of the web, while data from all other languages combined covers only 44% ^[8]. For a language like Czech, the share is substantially smaller still, meaning that every high-quality Czech text carries disproportionately high value for training.

In this context, a principle that has become a mantra in the LLM community takes on particular significance: better data beats better algorithms ^[9]. Specialized models such as BioGPT, Med-PaLM, and SaulLM-7B demonstrate that with carefully curated data, tens of billions of tokens suffice for a domain-specific model to outperform a general-purpose model trained on trillions ^[9].

Token Eaters

When a web crawler collects data for a training corpus, it typically retrieves the raw HTML source code of pages. The problem is that a typical web page contains vast amounts of tags, attributes, classes, and metadata that carry no semantic information useful to a model ^[10]. Navigation bars, footers, cookie banners, and advertising blocks consume tokens without adding value ^[11].

Consider a typical HTML fragment from a blog built on a modern framework:

Paragraph content…

</p>

The attributes class, id, data-* and others serve the browser for CSS rendering, web analytics, or JavaScript interactivity. A language model derives nothing useful from them. Yet all of these strings consume tokens ^[10].

The problem compounds at scale. As The New Stack observes, a single inefficiently serialized record can waste hundreds of tokens, and across millions of queries these losses accumulate rapidly ^[3].

Table 1: Token cost of common HTML artifacts (estimated for GPT-3 through GPT-5.x tokenizers)

HTML Artifact	Tokens
class=”article-container main-content”	7–9
data-analytics-section=”blog”	6–9
class=”text-2xl font-semibold mt-8 mb-4″	15–20
<div class=”article-body prose prose-lg max-w-none”>	12–18
id=”post-12847″	7
data-tracking-id=”abc123″	7–9

Markdown

Converting HTML to Markdown represents one of the most effective strategies for increasing the token efficiency of training data. Markdown preserves the semantic structure of a document—headings remain headings, links remain links, emphasis remains emphasis—but without dozens of redundant attributes ^[10]. Industry benchmarks confirm that HTML-to-Markdown conversion reduces token consumption by 20–30% for typical pages ^[10] and by up to 95% for complex e-commerce pages ^[11].

Table 2: Comparison of HTML and Markdown for identical informational content

Metric	HTML	Markdown	Savings
Characters (article excerpt)	1,144	400	65%
Tokens (e-commerce page) [11]	~40,000	~2,000	95%
Tokens (blog post) [10]	3–4,000	800–1,200	20–50%

Importantly, Markdown is not merely more token-efficient. LLMs are directly trained on Markdown because a substantial portion of high-quality training data originates from GitHub, Stack Overflow, and technical documentation ^[11]. Moreover, Markdown provides semantic anchors that models can leverage: headings marked with # explicitly define the hierarchy of ideas, tables using the pipe character | enable columnar reasoning ^[11].

Additional Approaches to Token Savings

JSON and CSV

For tabular data, JSON may be preferable to HTML tables. Flattening nested JSON structures and extracting only relevant fields can reduce token consumption by up to 69% ^[3]. CSV outperforms JSON by 40 to 50 percent in terms of token efficiency for tabular data ^[3].

Numerical Precision

A small but effective technique involves optimizing the precision of numerical values. Rounding numbers to the required precision can reduce the token consumption of numerical data by 30 to 40 percent ^[3].

Practical Impact: What This Means for Training

Higher Information Density

When meaning is preserved, a shorter representation yields higher information density. This means that within the same token budget, more diverse examples can be included in the dataset. Research by Lagasse et al. confirms that data composition—the number of examples and their average length in tokens—significantly affects token efficiency ^[12].

Improved Signal-to-Noise Ratio

Longer text frequently contains redundancy, stylistic filler, or digressions. These portions carry no new meaning, yet the model must process them and learn their patterns. Concise formulation reduces noise and improves the signal-to-noise ratio ^[9].

Cost Reduction

The economic impact is immediate. One illustrative case involves a company that achieved an 85% cost reduction by fine-tuning Mistral-7B as a replacement for GPT-3.5 ^[1]. The fine-tuned model required shorter prompts and produced more concise outputs, thereby reducing token consumption on both ends ^[1].

Tools for HTML-to-Markdown Conversion

The specialized model ReaderLM-v2 by Jina AI is a compact 1.5-billion-parameter model capable of processing documents up to 512,000 tokens in length and transforming them from HTML into Markdown or JSON ^[13]. According to benchmarks, it outperforms the older GPT-4o model by 15 to 20 percent on carefully curated test sets ^[13].

In the Python ecosystem, libraries such as html2text, markdownify, and trafilatura are available ^[14]. The last of these is particularly suitable for training data preparation, as it can intelligently extract a page’s main content while removing navigation, advertisements, and peripheral elements. Platforms such as the Apify Website Content Crawler offer Markdown conversion at the infrastructure level and report savings of 30–50% in tokens compared to raw HTML ^[15].

From Mobile-First to AI-First

Consider the period when websites began offering dedicated versions for mobile devices. Initially, these were separate instances, often deployed on subdomains. This approach was later replaced by responsive design, and over the past several years the dominant paradigm has shifted to mobile-first development. The underlying driver was straightforward: patterns of content consumption had changed.

A comparable transition is now underway. Instead of mobile browsers, the primary new consumers are AI agents and large language models. The Web, as originally proposed by Tim Berners-Lee more than three decades ago, was designed for human readers interacting through graphical browsers. HTML reflects this orientation: it contains numerous visual elements, structural wrappers, navigation components, and embedded scripts. For language models, these features constitute largely extraneous material that must be filtered out before meaningful processing can occur.

Most contemporary AI pipelines therefore introduce an intermediate transformation step in which HTML is converted into Markdown. This conversion removes presentation-layer artifacts and preserves a more compact, semantically structured textual representation suitable for machine processing ^[10][11]. The question, however, is: why generate Markdown downstream when the server can return it directly?

Cloudflare Markdown for Agents

In February 2026, Cloudflare introduced a mechanism called Markdown for Agents that enables servers to return Markdown directly through standard HTTP content negotiation ^[16]. When a client includes an Accept: text/markdown header in its request, the Cloudflare network automatically converts the HTML page to Markdown and delivers it in place of HTML. Activation requires only a configuration change in the dashboard—no template refactoring, no additional endpoints, no modifications to application code ^[16].

The results are compelling: for Cloudflare’s own blog, token consumption dropped from 16,180 tokens in HTML to 3,150 in Markdown—a savings of 80% ^[17]. The response additionally includes an x-markdown-tokens header with the estimated token count, enabling agents to plan their chunking strategy and context window utilization in advance ^[16]. Popular coding agents such as Claude Code and OpenCode already actively send the Accept: text/markdown header ^[16].

Solutions for Organizations Not Using Cloudflare

For organizations that do not use Cloudflare, this development offers a broader design principle: applications will increasingly need to expose an alternative, LLM-oriented representation of their content, optimized for structured machine consumption rather than visual rendering.

One such solution is the open-source project php-markdown-mirror ^[18]. It addresses a simple problem: one source of truth, two representations, zero duplication. The application continues to generate HTML as usual. Middleware intercepts the output, and if the client sends Accept: text/markdown (or appends a ?v=md parameter), it performs a one-time DOM parse, extracts the main content, and returns its Markdown representation. A regular visitor receives full HTML; an AI agent receives clean Markdown with correct Content-Type and Vary headers. The project additionally extracts Schema.org JSON-LD automatically and converts it to YAML frontmatter, so that the model receives structured metadata without needing to parse HTML ^[18].

Complementarily, the approach taken by Joost de Valk, creator of Yoast SEO, deserves mention. His WordPress plugin adds a <link rel=”alternate” type=”text/markdown”> tag to pages and creates dedicated .md URLs for each post ^[19]. While Cloudflare handles conversion at the infrastructure level, this approach addresses discoverability—an agent visiting the HTML version can programmatically determine that a Markdown version exists. The two approaches are complementary.

Recommendations for Practice

First, when collecting data from the web, always convert HTML to Markdown before storing it in the corpus. Markdown preserves semantic structure and minimizes token noise ^[10][11].

Second, for tabular data, prefer CSV over JSON and JSON over HTML. The choice of format can represent a 40–50% difference in token consumption ^[3].

Third, treat token efficiency as a first-class metric alongside accuracy and latency ^[3].

Fourth, test the precision of numerical data. Unnecessarily precise numbers increase the token footprint without benefit ^[3].

Fifth, maintain different compression profiles for different use cases. Agentic workflows require different optimization than RAG pipelines ^[3].

Sixth, consider serving Markdown natively. Whether through Cloudflare’s Markdown for Agents ^[16], a middleware solution like php-markdown-mirror ^[18], or a WordPress plugin with dedicated .md endpoints ^[19], exposing an LLM-friendly representation of your content is becoming a practical necessity rather than a theoretical exercise.

Conclusion

Token economy is not an abstract concept. It is a concrete economic parameter that determines how much knowledge fits within a given budget, how clean the training data will be, and how much model operation will cost. The choice of source data format—a matter that may appear purely technical—has a direct impact on all three cost dimensions: computation, finance, and capacity.

The simple rule is this: when working with a fixed token budget, it pays to encode facts as efficiently as possible without losing meaning. Fewer tokens for the same content means more room for other knowledge, lower costs, and cleaner training data.

References

[1] The Data Lead. The New Tokenomics: A Comprehensive Guide to the Economics of Large Language Models [online]. 2025 [cited 2026-02-16]. Available from: https://thedatalead.com/the-new-tokenomics-a-comprehensive-guide-to-the-economics-of-large-language-models/

[2] NVIDIA. Explaining Tokens — the Language and Currency of AI [online]. NVIDIA Blog, 2025-05-01 [cited 2026-02-16]. Available from: https://blogs.nvidia.com/blog/ai-tokens-explained/

[3] The New Stack. A Guide to Token-Efficient Data Prep for LLM Workloads [online]. 2025-12-06 [cited 2026-02-16]. Available from: https://thenewstack.io/a-guide-to-token-efficient-data-prep-for-llm-workloads/

[4] SENNRICH, Rico, Barry HADDOW and Alexandra BIRCH. Neural Machine Translation of Rare Words with Subword Units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, pp. 1715–1725. DOI: 10.18653/v1/P16-1162.

[5] KARPATHY, Andrej. minbpe: Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization [online]. GitHub, 2024 [cited 2026-02-16]. Available from: https://github.com/karpathy/minbpe

[6] VILLALOBOS, Pablo, Anson HO, Jaime SEVILLA, Tamay BESIROGLU, Lennart HEIM and Marius HOBBHAHN. Will we run out of data? Limits of LLM scaling based on human-generated data. In: Proceedings of the 41st International Conference on Machine Learning (ICML 2024). 2024, pp. 49523–49544. ArXiv: 2211.04325.

[7] Stanford University. AI Index Report 2025 [online]. Stanford HAI, 2025 [cited 2026-02-16]. Available from: https://aiindex.stanford.edu/report/

[8] FUENTES, René, Haoming JIANG and Yejin CHOI. To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis. In: Advances in Neural Information Processing Systems. 2023. ArXiv: 2305.13230.

[9] PAUL, Rohan. Selecting and Preparing Training Data for LLMs (2024–2025) [online]. 2025-06-14 [cited 2026-02-16]. Available from: https://www.rohan-paul.com/p/selecting-and-preparing-training

[10] SearchCans. Markdown vs. HTML for LLM Context: Optimizing Performance & Cost [online]. 2026-01-16 [cited 2026-02-16]. Available from: https://www.searchcans.com/blog/markdown-vs-html-llm-context-optimization-2026/

[11] Maxun. Why Markdown is the Secret to Better AI [online]. 2026 [cited 2026-02-16]. Available from: https://www.maxun.dev/blog/markdown

[12] LAGASSE, Ryan et al. A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets. ArXiv [online]. 2025. ArXiv: 2505.06150. Available from: https://arxiv.org/abs/2505.06150

[13] WANG, Feng, Zesheng SHI, Bo WANG, Nan WANG and Han XIAO. ReaderLM-v2: Small Language Model for HTML to Markdown and JSON. In: Proceedings of ACL 2025. Jina AI, 2025. ArXiv: 2503.01151. Available from: https://arxiv.org/abs/2503.01151

[14] GLUKHOV, Rost. Converting HTML to Markdown with Python: A Comprehensive Guide [online]. 2025 [cited 2026-02-16]. Available from: https://www.glukhov.org/post/2025/10/convert-html-to-markdown-in-python/

[15] Apify. Web Scraping for AI Training Data: The 2026 RAG Guide [online]. 2026 [cited 2026-02-16]. Available from: https://use-apify.com/blog/ai-training-data-web-scraping

[16] MARTINHO, Celso and Will ALLEN. Introducing Markdown for Agents [online]. Cloudflare Blog, 2026-02-12 [cited 2026-02-16]. Available from: https://blog.cloudflare.com/markdown-for-agents/

[17] The Register. Cloudflare turns websites into faster food for AI agents [online]. 2026-02-13 [cited 2026-02-16]. Available from: https://www.theregister.com/2026/02/13/cloudflaremarkdownforaicrawlers

[18] KUBÍČEK, Michal. php-markdown-mirror: Middleware for serving Markdown representations of PHP applications [online]. GitHub, 2026 [cited 2026-02-16]. Available from: https://github.com/kubicek-ai/php-markdown-mirror[19] DE VALK, Joost. Great minds think alike? My WordPress take on Markdown for Agents [online]. 2026-02-12 [cited 2026-02-16]. Available from: https://joost.blog/markdown-alternate

Is this article useful to you and are you citing it? Copy the citation

Kubíček, Michal. (2026, February 16). Token Economy in LLM Training Data Preparation. Kubicek.AI. https://www.kubicek.ai/en/token-economy-in-llm-training-data-preparation/