Data parsing in 2026: don't run every page through an LLM!

Among parser developers a strange approach is spreading: send every downloaded page to an LLM asking it to find the needed data. Sounds convenient — you don’t have to understand the markup, the model will figure it out. There are even dedicated tools for this: ScrapeGraphAI, Crawl4AI, FireCrawl — they all, in one way or another, run page contents through a language model on every iteration.

In practice this creates three problems at once.

Why this is a bad approach

Slow. The cycle looks like this: download a page, send it to the model, wait for the response, move to the next URL. On thousands of pages this becomes a very long wait. Even if you run things in parallel — each iteration of the cycle is still very slow.

Expensive. You burn tokens on HTML tags, styles, navigation menus and other junk that has nothing to do with your data. The useful payload is often 10% of what actually goes to the model.

Nondeterministic. The LLM can quietly skip some data, especially when there’s a lot on the page. You won’t always notice this immediately — and the data will already be incomplete. This is fundamentally different from a classic crawler that either found an element by selector or failed with an error. Here the failure happens silently. And that’s, in my opinion, the most dangerous part: you can’t be 100% sure that the collected data matches what was on the site.

How to do it right

Use the LLM once — not on every page.

Take several variants of pages with the needed information, load the markup into the model and ask it to generate code to extract the data — CSS selectors or XPath. Then run a normal crawler with those settings.

The choice of tool depends on the site’s complexity:

BeautifulSoup — for static pages where the data is present directly in the HTML. Lightweight, fast, reliable.
Playwright (or Puppeteer) — for dynamic sites with JavaScript rendering, infinite scrolling, modal windows and other joys of modern frontends. It brings up a real browser and waits until everything is loaded.

As a result you get a system that works fast, gives predictable results and doesn’t waste tokens. That’s exactly how search crawlers are built — nobody runs GPT on every indexed page.

Where LLM is still useful in parsing

LLMs are great at tasks that require understanding structure and writing code — exactly what we’re recommending. The model also works well at the post-processing stage: classifying extracted data, normalizing formats, extracting meaning from unstructured text. That’s its domain. Being the conveyor for every request — no.

Conclusion

Separate responsibilities: the LLM builds the tool, the crawler collects the data, the LLM processes the result if needed. Each does what it does best.

If you need to collect data from the web — products from marketplaces, listings, prices, reviews, any other sources — reach out. We’ll find, parse, and classify.

Why this is a bad approach

How to do it right

Where LLM is still useful in parsing

Conclusion

Другие статьи Python Dev

How to Pretend to Be Human: Scraping Without Getting Blocked

RabbitMQ as a bridge between the external and internal networks

How to set up queue dialing via SIP and Python

Проекты Python Dev

Automated call transcript: from recording to a structured document

Telegram bot for voice pranks

Automatic energy consumption control system

Need help?