// Python Dev
Data parsing in 2026: don't run every page through an LLM!
Published on 2026-05-13
Among parser developers a strange approach is spreading: send every downloaded page to an LLM asking it to find the needed data. Sounds convenient — you don’t have to understand the markup, the model will figure it out. There are even dedicated tools for this: ScrapeGraphAI, Crawl4AI, FireCrawl — they all, in one way or another, run page contents through a language model on every iteration.
In practice this creates three problems at once.
Why this is a bad approach
Slow. The cycle looks like this: download a page, send it to the model, wait for the response, move to the next URL. On thousands of pages this becomes a very long wait. Even if you run things in parallel — each iteration of the cycle is still very slow.
Expensive. You burn tokens on HTML tags, styles, navigation menus and other junk that has nothing to do with your data. The useful payload is often 10% of what actually goes to the model.
Nondeterministic. The LLM can quietly skip some data, especially when there’s a lot on the page. You won’t always notice this immediately — and the data will already be incomplete. This is fundamentally different from a classic crawler that either found an element by selector or failed with an error. Here the failure happens silently. And that’s, in my opinion, the most dangerous part: you can’t be 100% sure that the collected data matches what was on the site.
How to do it right
Use the LLM once — not on every page.
Take several variants of pages with the needed information, load the markup into the model and ask it to generate code to extract the data — CSS selectors or XPath. Then run a normal crawler with those settings.
The choice of tool depends on the site’s complexity:
- BeautifulSoup — for static pages where the data is present directly in the HTML. Lightweight, fast, reliable.
- Playwright (or Puppeteer) — for dynamic sites with JavaScript rendering, infinite scrolling, modal windows and other joys of modern frontends. It brings up a real browser and waits until everything is loaded.
As a result you get a system that works fast, gives predictable results and doesn’t waste tokens. That’s exactly how search crawlers are built — nobody runs GPT on every indexed page.
Where LLM is still useful in parsing
LLMs are great at tasks that require understanding structure and writing code — exactly what we’re recommending. The model also works well at the post-processing stage: classifying extracted data, normalizing formats, extracting meaning from unstructured text. That’s its domain. Being the conveyor for every request — no.
Conclusion
Separate responsibilities: the LLM builds the tool, the crawler collects the data, the LLM processes the result if needed. Each does what it does best.
If you need to collect data from the web — products from marketplaces, listings, prices, reviews, any other sources — reach out. We’ll find, parse, and classify.
// Python Dev
Другие статьи Python Dev
2026-05-12
How to Pretend to Be Human: Scraping Without Getting Blocked
There is such a Turing test — the machine tries to convince a person that it is also a person. In parsing everything works exactly the opposite: the site tries …
2026-04-03
RabbitMQ as a bridge between the external and internal networks
Queue brokers are usually seen as a tool inside a single system — decouple microservices, smooth out load spikes, organize background jobs. But there’s a …
2026-04-01
How to set up queue dialing via SIP and Python
This note analyzes the architecture of an automated outbound calling system: how the processing pipeline is organized, how dialing via Asterisk AMI works, and …
// Python Projects
Проекты Python Dev
2026-04-29
Automated call transcript: from recording to a structured document
Automatic call minutes: from recording to a structured document Distributed teams spend a lot of time on calls. They discuss tasks, make decisions, assign …
2026-03-26
Telegram bot for voice pranks
An upgrade of an existing Telegram bot: calls through SIP and Telegram, response recording, and monetization through Telegram Stars.
2026-03-26
Automatic energy consumption control system
An MVP system for controlling energy usage limits on EV charging points with automatic relay shutdown and full action logging.
// Contact
Need help?
Get in touch with me and I'll help solve the problem