// DevOps

How I automated managing 366 servers and stopped spending nights on routine tasks

Published on 2026-05-16

When you have one server — you SSH in and do what’s needed. When there are ten — you write bash scripts. When there are more than three hundred — either you automate everything completely, or the infrastructure manages you instead of you managing it.

I’ve been down that path. A dozen projects, over 400 servers in seven countries — the Netherlands, Poland, USA, Germany, India, Kazakhstan, Russia. At some point it became clear: keeping it all in your head and poking around manually won’t work anymore. System or chaos. I chose the system — and now a new node anywhere in the world is up in a few minutes without a single manual SSH session.

I’ll explain how it works.


How it looked before

Launching a new server used to be a quest. Open SSH, remember (or find in notes) the needed commands, run them in order, forget to add something to DNS, go back, add it, add to monitoring separately, register in the control panel separately. An hour to an hour and a half per node — and that’s if everything goes smoothly. And each time there was a non-zero chance of doing something slightly differently than last time.

At a scale of over 400 servers this is more than an inconvenience. Even if you touch each one once a month — that’s hundreds of hours a year of mechanical work. Plus the infrastructure gradually turns into a black box: nobody really remembers what is configured on which server, why it’s set up that way, or who did it.


What’s inside

The foundation is Ansible paired with Semaphore. Ansible lets you describe server configuration as code: write it once, run it on any number of machines. Semaphore is a simple web UI on top of it so you don’t have to run things from the terminal and can see what’s happening in real time.

The whole repository — 49 playbooks, 39 roles, 11 inventory files. A playbook is a script: a sequence of steps for a server. A role is a reusable block of logic, e.g. “configure firewall” or “issue a certificate.” An inventory is a list of all servers with parameters: role, country, ports, service versions. All infrastructure lives in git — any change is visible in history, any state is reproducible. When something goes wrong, the answer is always in code, not in someone’s head who “remembers how they did it last time.”

Each server has a role: regular node, edge, shared, base machine, monitoring. Playbooks read that role and skip inappropriate steps themselves — a proxy won’t get the VPN stack, a shared node won’t go through full hardening. No manual filtering.

The inventory is organized hierarchically: project → region → host. Need to change a parameter for all nodes in the Netherlands — change one line. Need to update only one server or only a specific region — specify the desired filter when running. No extra code, no copy-paste.


What deployment looks like from the inside

Launching a new VPN node is the most illustrative example. One click in Semaphore, nine steps in the chain, 8–12 minutes of real time. While the node is provisioning, I do something else.

What happens inside: first the SSH availability is checked — no point in running the chain on a dead server. Then bootstrap: basic packages, Docker, swap. Next, hardening — unnecessary ports are closed, brute-force protection is configured, kernel parameters are applied to optimize networking. After that the project configs and files are copied. A DNS record and an SSL certificate are issued via the Cloudflare API. The node is registered in the management panel, the Docker stack is brought up. At the end — DNS records for traffic balancing.

Each step is described in code and has been run dozens of times. The probability of human error tends toward zero — not because people became more careful, but because the person is simply removed from the process.


Details that save your nerves

Monitoring adds itself. An agent is deployed on every server during provisioning that registers the machine in a central registry. Prometheus finds new servers through that registry — no static lists, no “don’t forget to add it.” Add a server to the inventory, run the playbook — it’s already in the metrics. Decommission a server — a separate playbook will clean up all entries.

SSL certificates are no longer a chore. The system issues them via Cloudflare, supports wildcards and multiple domains, and renews them automatically. One server can issue a certificate and share the state with other nodes — handy when a hundred machines serve the same domain.

State reconciliation playbook. Any live infrastructure sooner or later diverges from what’s declared in code — something was changed by hand, something went off. This playbook queries the management panel, compares it with the inventory, checks ports, verifies firewall rules. The result is a concrete report: this node exists in the panel but not in the inventory; here the profile doesn’t match; here a required rule is missing. Without this tool such discrepancies accumulate unnoticed.

Results in Telegram. After each playbook you get a message: how many hosts were processed, how many succeeded, how many failed, what exactly and a link to details. Update 50 nodes — you don’t have to stare at the screen. Start it, go do something else, get the result on your phone.


About boring tools

It’s trendy now to put AI agents on everything. I work actively with AI and understand its value. But there are tasks where a language model is not what you need. When you need not a “smart assistant that will figure it out,” but a deterministic system: run it — get exactly the result you described. No interpretation, no hallucinations, no creative approach at 3 a.m. on production.

Ansible is boring. Proven. Predictable. Build the structure once — and the two-hundredth server is configured exactly the same way as the first. That’s engineering — not when it’s pretty, but when it works.


The outcome

The numbers are simple. One server manually — 60–90 minutes. With automation — 8–12 minutes, and most of that time I’m not involved. At a scale of 366 machines that’s hundreds of hours a year that used to be spent on mechanical work.

But honestly — the main value isn’t speed. The main value is that the infrastructure stopped being a black box. Any server in any project is configured the same: the same kernel parameters, the same sequence of steps, the same checks. When something goes wrong — you look for the cause in code, not try to remember who did what by hand three weeks ago.

And one more detail you only start to appreciate when you need to reboot the entire infrastructure: rolling reboot. The playbook reboots servers one by one, waiting for each before the next. Restart 89 nodes of a project — one run, no risk of taking everything down at once. Before, that was an operation lasting several evenings with constant supervision. Now — start it and forget it.

400 servers, seven countries — one repository. Not because I used the trendiest tool. But because I chose the right one.

// Reviews

Related reviews

I came with an expensive request to configure a VPS server, but during the consultation Mikhail suggested a much simpler, more affordable solution. In the end I saved time and money. Mikhail — a true expert who works for the client's result, not for the fee. I recommend him!

I came with an expensive request to configure a VPS server, but during the consultation Mikhail suggested a much simpler and more cost-effective solution. In the end I saved budget and time. Mikhail — a true expert who …

kfhzasorin

VPS setup, server setup

2026-05-12 · ★ 5/5

// Contact

Need help?

Get in touch with me and I'll help solve the problem

Send request
Write and get a quick reply