// Python Dev
How to Pretend to Be Human: Scraping Without Getting Blocked
Published on 2026-05-12
There is such a Turing test — the machine tries to convince a person that it is also a person. In parsing everything works exactly the opposite: the site tries to figure out that you are not a human, and you try to convince it otherwise. And the better you understand how exactly the site sees you — the easier it is to do. Let’s break it down by layers.
Layer one: look like a human
I write parsers with Playwright — it’s a tool that controls a real browser: Chromium, WebKit, or Firefox. It doesn’t emulate HTTP requests, it runs a real engine, just like a normal user.
But that’s not enough. The thing is, a headless browser by default exposes itself: there are characteristic signs in headers, navigator.webdriver sticks out like a flag, environment parameters differ from regular Chrome. Anti-bot systems know these signs by heart, and there’s no disguise here — it’s simply saying “I’m a bot” in other words. Playwright allows you to reconfigure all of this: the correct User-Agent, the correct headers, the correct launch parameters.
The next step is setting up the fingerprint. You need to understand that large sites are big systems receiving data from many sources: ad trackers, visit counters, tracking system markers, behavioral analytics. All of this forms a profile even before you visit the target site, and you need to match that profile. A browser with empty cookies immediately raises questions — either a paranoid user or a bot, usually the latter. So before starting a session I go through a short list of common sites that have both counters and ad trackers — all so that these systems create our fingerprint, and when we go to the target site they already signal: this is a regular person, here is their profile.
And lastly in this layer — the IP address. Server addresses from data centers get caught quickly because their ranges are well known to anti-bot systems — such addresses immediately look like a red flag for any large site. The choice here is unambiguous: residential proxies. Yes, much more expensive, but they work where data-center ones won’t get you in.
Layer two: behave like a human
Okay, technically we already look like a human. But that’s not enough — you also need to behave accordingly.
Think about how a real user behaves on a site. They don’t open a page and immediately start methodically clicking on the needed elements. They scroll, move the mouse, read the wrong thing, follow links within the site — and all of this without any system or predictability. That’s exactly what needs to be reproduced: random mouse movements, scrolling, navigation via clicks on page elements rather than a direct goto.
You might think we’re wasting time on these extra actions. Actually no — and here’s why. A session that behaves naturally lives longer, and starting a new session costs resources: a new address from the proxy pool, a new browser instance, a new cookie warm-up. Spending a few seconds on scrolling so that the session works for hours is not a loss, it’s an investment. The difference between “100 pages in one run” and “thousands of wasted attempts” is quite tangible.
Another point that’s not obvious at first: I always go to the site’s main page first, and only then proceed to the desired section. A user who appears directly on the search results page or a product page and immediately starts clicking aggressively looks suspicious, because people don’t behave like that. You go to the homepage, look around, then continue — that’s a completely different story.
And about rotation: don’t operate with the same pattern for too long. Addresses change, fingerprint profiles rotate, behavior patterns vary — because predictability itself is also a signal.
Separately about captcha. Many perceive it as a task to be solved: buy a recognition service, integrate it into the pipeline, and move on. But that’s the wrong framing, because captcha is not an obstacle, it’s an indicator. It says that the system already considers you suspicious, and the degree of suspicion directly affects what you’ll see. Yandex, for example, sometimes shows a captcha that can be solved with one click in the right field — but only if you’re merely a suspect. If you’re considered a bot — they’ll show a graphical one: complex, slow, expensive to solve. The right goal is not “solve captchas quickly”, but “never get to them in the first place”.
Layer three: think like an engineer
Modern sites often deliver data not in HTML but via XHR requests: the page loads, and then pulls data with separate requests to an API. When you see this in DevTools, there’s a temptation to find those endpoints and hit them directly with a regular fetch, bypassing the browser completely — fast, cheap, no overhead.
Site owners aren’t stupid: endpoints are protected by tokens, they strictly check who requested them and from where, and they also check the fingerprint — as a result, you’ll get the data at best a couple of times, and then a 429 error will stop your parsing forever.
But here’s what works well. When Playwright opens a page, all XHR requests go through the browser with all the correct headers, cookies, and context — exactly how a real user does. With page.on('response', ...) you can listen to all network events and intercept the needed responses right at the moment the page loads. The browser received JSON — we received it too, asynchronously and without extra moves. No need to parse the DOM, no need to search for data with selectors that will break on the next redesign.
Which exact request to intercept — you need to look once in DevTools and understand the structure. That’s the engineer’s job: understand how the system is built, choose the optimal strategy, make a decision. Don’t throw everything into an LLM hoping for a miracle — here you need to think, not generate.
Scaling: need more — add more
This whole setup scales horizontally well — and that’s its main advantage. The browser runs in a container, proxies connect to it, need to parse more — spin up more containers and take more addresses from the pool. No magic, just multiplying what already works.
An important principle here — don’t try to squeeze the maximum out of a single session by pushing it to the limit. That’s a direct path to blocks, captchas, and unstable operation. It’s much better to have ten calm instances at a normal pace than one aggressive one that constantly fails and needs restarting.
Architecture: set up once — use everywhere
I have a core: a browser with configured masking, cookie warm-up logic, a proxy pool, behavioral patterns. All of this was solved once and now works for all projects without changes.
For each new site only a module with business logic is written — what data is needed, where it lies, how to get to it. Everything specific to a particular resource is only here, the infrastructure is common. A new project is not “write a parser from scratch”, but “describe what and from where to take”. Startup time is minimal, reliability is high.
Competitive analysis, price monitoring, lead generation, market research — half of modern business solutions rely on data that had to be collected somewhere. Parsing is the tool for such tasks, and as a quality tool it should be boring, reliable, and unobtrusive. No heroics, no “squeeze out the maximum”. Just a system that works.
// Python Dev
Другие статьи Python Dev
2026-04-03
RabbitMQ as a bridge between the external and internal networks
Queue brokers are usually seen as a tool inside a single system — decouple microservices, smooth out load spikes, organize background jobs. But there’s a …
2026-04-01
How to set up queue dialing via SIP and Python
This note analyzes the architecture of an automated outbound calling system: how the processing pipeline is organized, how dialing via Asterisk AMI works, and …
2026-04-01
How to analyze thousands of reviews on Wildberries using LAG: a step-by-step walkthrough
For popular products on Wildberries, the number of reviews easily runs into the thousands. Reading them manually is slow, tedious, and inefficient. Real reasons …
// Python Projects
Проекты Python Dev
2026-04-29
Automated call transcript: from recording to a structured document
Automatic call minutes: from recording to a structured document Distributed teams spend a lot of time on calls. They discuss tasks, make decisions, assign …
2026-03-26
Telegram bot for voice pranks
An upgrade of an existing Telegram bot: calls through SIP and Telegram, response recording, and monetization through Telegram Stars.
2026-03-26
Automatic energy consumption control system
An MVP system for controlling energy usage limits on EV charging points with automatic relay shutdown and full action logging.
// Contact
Need help?
Get in touch with me and I'll help solve the problem