Datacenters are unpredictable

Here is why

  • Step 1. We have machines (computers and infrastructure etc) - reliability P0 and people reliabilty P1
  • Step 2. We increase P0 (we have to react on every emergency by fixing things so P0 improves)
  • Step 3. In time P0 grows to such a high level, that it becomes obvious, P1 now can be decreased safely
  • Step 4. So we decrease P1 (in simple terms we can hire idiots, because machines will work no matter who is around)
  • Step 5. We now can push P1 down infinitely (like on Step 2 we were pushing P0 up also infinitely)
  • Step 6. Is equlibrium. It's some combination of P0 and P1 that was impossible to predict. You might try.
  • Step 7. Now can fire all the management - 100% of them. And put new management : Step 1. It's a loop.
  • Not only datacenters of course. Any combination of humans and machines is like that.
    Or even any combination of reliable and unreliable stuff.
    Taleb describes those two matters in "Fooled By Randomness", but he does not elaborate on blends or equilibrium.

    Feb 25 2021

    Important Note 1

    The equilibrium of Step 6 does not mean the Datacenter is reliable. Also the funding (amount of money you pour into the whole thing) is not the issue. Read about Chernobyl. Every Nuclear Reactor is a Datacenter on steroids.

    Apr 07 2022

    Important Note 2

    Let's split computers into 2 groups. Let's call one group 'Enterprise Computers' and another group 'Consumer Electronics'. Ironically, the price/utility ratio between those two groups is fluctuating amazingly with time. Because with monopolies comes greed, there are moments (the beginning of the Internet for example) when building anything at all using enterprise offerings was ridiculously expensive, so it was economically viable to compose enteprise offering from consumer electronics parts.

    The crypto bubble is another example (mining crypto on GPU is a classic example of re-using consumer electronics to provide enterprise offering). In ideal world we would have an adequate balance of price/reliability for computers. We live in a real world. In Europe people kill each other with blends of 3D printed parts blended with consumer electronics they order online (drones is the simplest example). War is extreme example of course, but the trend is kind of universal.

    Because of various regulations / conventions, things that are basically consumer electronics (can be grossly unreliable, but comes cheap) is used in place of an enterprise component, because "what you gonna do".

    Windows PC is consumer electronics. For 20+ years it is.

    Enterprise Computers are driven by Moore's law more of less. Consumer Electronics is driven by Donskoy's law. Those are *very* diffrent laws. Collisions are inevitable. And they do happen. Every day.

    July 28 2022

    There is also Huang's Law. More about it here

    Aug 10 2022

    So you're thinking about running the cluster of remote machines

    I used to run 10000 servers. It's not that hard, really, you just need to install sysdig, or poor man's version of it. The biggest problem is (of course) windows, there are still tools to deal with that, like the one that fire eye is using, but you need to take it from (open) source - it's European, actually. sysdig is made in USA. Also, there are various tools on a level of hardware support, such as CIMC - that could make one's life easier, when dealing with servers you can't even touch. Not being able to hit reset button on a server is a real problem.

    I think you're not really interested in building your own cloud. I think you're using existing cloud provider. In that case usually the only thing you need would be a sysdig or it's poor man's version. Things like thousand eyes might help. For 10+ servers can just install poor man's versions of those. Only sysdig is impossible to beat, so if you are planning to do anything serious, like try not to get hacked and stuff like that, you have to install sysdig. And find somebody, who can set it up. You will not find those people. Not many of us are operational. Especially after last 4 years.

    There is a lot of challenges when you deal with distributed setup. Even when you just piggyback somebody else's cloud, every time somebody deploys something on the servers things will break. When things break, somebody needs to understand what broke and why. I can handle this (because I know which tools to install and how to use them) most of people they hire last 12 years - they try to fix it all manually. So nothing works and people just yell at each other. Very sad picture. And of course there are hardware failures. Switches (networking) can fail even. Everything can fail. Most of the times people in DC can't detect hardware failure on their own, you need to hold their hand. (See 'Datacenters are unpredictable' part above)

    Recently, there are new sets of challenges that is happening, it is related to electricity shortages. The village where I live was affected. Pragmatically, this means that if you have your cluster of remote servers, you have to keep at least 2 locations (so when one goes down, you are still operational). There is not many people on this planet, who know how to run that kind of set up. Recently facebook went down, I don't know if you noticed, but it was a big deal.

    Bottom line is - you think your project is little, but if you want your 10+ servers actually work 24/7 - project is not little at all and there is no people who know how to make that happen. I know. If you want to hire me - act fast, the idiots are now destroying things everywhere again.

    Oct 15 2021

    Automation Paradox

    The 'Unpredictable Loop' is a side case of Automation (Quality) Paradox. Some day I should find time to write the Automation Paradox in detail. It's complex but the implications are huge. It's about distribution and quality and costs (dis) balances. Interwoven.

    Part of it is a fact that Guerilla Group Coding described in Mythical Man Month (1975) remains the only pragmatic engineering guidelines produced on US soil till modern times. Not only that, it actually directly contradicts the Asian IT Pyramids. It was possible to balance things out before 2008, after that it simply became impossible. Because government.

    Apr 05 2022

    Another part of the paradox is reversal of supply/demand pipelines.

    Pandemic reversed supply/demand pipelines. Since 2008 the safest thing for a recruiter/government worker was to do nothing. That resulted in candidates resumes sitting in limbo for years. So in the past when government worker sit on candidates resumes, it was a problem to the candidate, not to those who sit on the resume. Only I think it no longer works like that. There is a shortage of skilled candidates who can do anything at all (and there is a surplus of government workers who try to sit and do nothing). Supply / demand has been reversed because of pandemic, so the pressure now is on government workers.

    Apr 08 2022

    Every company on US soil has the same problem. For a very long time now, anybody, who is literate and can do anything at all with computer, is sucked into some enterprise corporation as a "body". Inside that corporation they're usually producing no value, (because production of IT value had been moved outside US long time ago). As a result, automation levels of the country degrade (rapidly). This can be fixed by doing things the way things were done 15+ years ago.

    Automated nonsense destroys the country much faster, than manual nonsense. I once worked with a person, who had written a script to inject 1000 bugs into production system. It was a problem for me that person knew how to code. I had to do rather complex rollback. True story. That person was not evil. They did not understand what they were doing. They made a mistake. Automation could be a solution, but it could also be a major problem. AI/ML is not a solution either.

    Apr 12 2022

    Most of modern infrastructure is not really designed for agility. Change the static IP address - and everything is messed up. New architectures should assume nothing is static. That thing alone kills the entire ITIL vertical. McAfee was aware of that.

    Current internet infrastructure (inherited from 1970s) is based on all-or-nothing (centralized) permissions. Blockchain *is* different architecture. There will be battles.

    IPv6 was a useless distraction. Accomplished virtually nothing, only made things worse. I think that was the end of the old world. "Ratified as an Internet Standard on 14 July 2017".

    Aug 27 2022