How to Pretend to Be Human: Scraping Without Getting Blocked

There is such a Turing test — the machine tries to convince a person that it is also a person. In parsing everything works exactly the opposite: the site tries to figure out that you are not a human, and you try to convince it otherwise. And the better you understand how exactly the site sees you — the easier it is to do. Let’s break it down by layers.

Layer one: look like a human

I write parsers with Playwright — it’s a tool that controls a real browser: Chromium, WebKit, or Firefox. It doesn’t emulate HTTP requests, it runs a real engine, just like a normal user.

But that’s not enough. The thing is, a headless browser by default exposes itself: there are characteristic signs in headers, navigator.webdriver sticks out like a flag, environment parameters differ from regular Chrome. Anti-bot systems know these signs by heart, and there’s no disguise here — it’s simply saying “I’m a bot” in other words. Playwright allows you to reconfigure all of this: the correct User-Agent, the correct headers, the correct launch parameters.

The next step is setting up the fingerprint. You need to understand that large sites are big systems receiving data from many sources: ad trackers, visit counters, tracking system markers, behavioral analytics. All of this forms a profile even before you visit the target site, and you need to match that profile. A browser with empty cookies immediately raises questions — either a paranoid user or a bot, usually the latter. So before starting a session I go through a short list of common sites that have both counters and ad trackers — all so that these systems create our fingerprint, and when we go to the target site they already signal: this is a regular person, here is their profile.

And lastly in this layer — the IP address. Server addresses from data centers get caught quickly because their ranges are well known to anti-bot systems — such addresses immediately look like a red flag for any large site. The choice here is unambiguous: residential proxies. Yes, much more expensive, but they work where data-center ones won’t get you in.

Layer two: behave like a human

Okay, technically we already look like a human. But that’s not enough — you also need to behave accordingly.

Think about how a real user behaves on a site. They don’t open a page and immediately start methodically clicking on the needed elements. They scroll, move the mouse, read the wrong thing, follow links within the site — and all of this without any system or predictability. That’s exactly what needs to be reproduced: random mouse movements, scrolling, navigation via clicks on page elements rather than a direct goto.

You might think we’re wasting time on these extra actions. Actually no — and here’s why. A session that behaves naturally lives longer, and starting a new session costs resources: a new address from the proxy pool, a new browser instance, a new cookie warm-up. Spending a few seconds on scrolling so that the session works for hours is not a loss, it’s an investment. The difference between “100 pages in one run” and “thousands of wasted attempts” is quite tangible.

Another point that’s not obvious at first: I always go to the site’s main page first, and only then proceed to the desired section. A user who appears directly on the search results page or a product page and immediately starts clicking aggressively looks suspicious, because people don’t behave like that. You go to the homepage, look around, then continue — that’s a completely different story.

And about rotation: don’t operate with the same pattern for too long. Addresses change, fingerprint profiles rotate, behavior patterns vary — because predictability itself is also a signal.

Separately about captcha. Many perceive it as a task to be solved: buy a recognition service, integrate it into the pipeline, and move on. But that’s the wrong framing, because captcha is not an obstacle, it’s an indicator. It says that the system already considers you suspicious, and the degree of suspicion directly affects what you’ll see. Yandex, for example, sometimes shows a captcha that can be solved with one click in the right field — but only if you’re merely a suspect. If you’re considered a bot — they’ll show a graphical one: complex, slow, expensive to solve. The right goal is not “solve captchas quickly”, but “never get to them in the first place”.

Layer three: think like an engineer

Modern sites often deliver data not in HTML but via XHR requests: the page loads, and then pulls data with separate requests to an API. When you see this in DevTools, there’s a temptation to find those endpoints and hit them directly with a regular fetch, bypassing the browser completely — fast, cheap, no overhead.

Site owners aren’t stupid: endpoints are protected by tokens, they strictly check who requested them and from where, and they also check the fingerprint — as a result, you’ll get the data at best a couple of times, and then a 429 error will stop your parsing forever.

But here’s what works well. When Playwright opens a page, all XHR requests go through the browser with all the correct headers, cookies, and context — exactly how a real user does. With page.on('response', ...) you can listen to all network events and intercept the needed responses right at the moment the page loads. The browser received JSON — we received it too, asynchronously and without extra moves. No need to parse the DOM, no need to search for data with selectors that will break on the next redesign.

Which exact request to intercept — you need to look once in DevTools and understand the structure. That’s the engineer’s job: understand how the system is built, choose the optimal strategy, make a decision. Don’t throw everything into an LLM hoping for a miracle — here you need to think, not generate.

Scaling: need more — add more

This whole setup scales horizontally well — and that’s its main advantage. The browser runs in a container, proxies connect to it, need to parse more — spin up more containers and take more addresses from the pool. No magic, just multiplying what already works.

An important principle here — don’t try to squeeze the maximum out of a single session by pushing it to the limit. That’s a direct path to blocks, captchas, and unstable operation. It’s much better to have ten calm instances at a normal pace than one aggressive one that constantly fails and needs restarting.

Architecture: set up once — use everywhere

I have a core: a browser with configured masking, cookie warm-up logic, a proxy pool, behavioral patterns. All of this was solved once and now works for all projects without changes.

For each new site only a module with business logic is written — what data is needed, where it lies, how to get to it. Everything specific to a particular resource is only here, the infrastructure is common. A new project is not “write a parser from scratch”, but “describe what and from where to take”. Startup time is minimal, reliability is high.

Competitive analysis, price monitoring, lead generation, market research — half of modern business solutions rely on data that had to be collected somewhere. Parsing is the tool for such tasks, and as a quality tool it should be boring, reliable, and unobtrusive. No heroics, no “squeeze out the maximum”. Just a system that works.

Layer one: look like a human

Layer two: behave like a human

Layer three: think like an engineer

Scaling: need more — add more

Architecture: set up once — use everywhere

Другие статьи Python Dev

RabbitMQ as a bridge between the external and internal networks

How to set up queue dialing via SIP and Python

How to analyze thousands of reviews on Wildberries using LAG: a step-by-step walkthrough

Проекты Python Dev

Automated call transcript: from recording to a structured document

Telegram bot for voice pranks

Automatic energy consumption control system

Need help?