Datacenters

"Some notes on Automation Paradox"

Datacenters are unpredictable

Here is why

Step 1. We have machines (computers and infrastructure etc) - reliability P0 and people reliabilty P1

Step 2. We increase P0 (we have to react on every emergency by fixing things so P0 improves)

Step 3. In time P0 grows to such a high level, that it becomes obvious, P1 now can be decreased safely

Step 4. So we decrease P1 (in simple terms we can hire idiots, because machines will work no matter who is around)

Step 5. We now can push P1 down infinitely (like on Step 2 we were pushing P0 up also infinitely)

Step 6. Is equlibrium. It's some combination of P0 and P1 that was impossible to predict. You might try.

Step 7. Now can fire all the management - 100% of them. And put new management : Step 1. It's a loop.

Not only datacenters of course. Any combination of humans and machines is like that.
Or even any combination of reliable and unreliable stuff.
Taleb describes those two matters in "Fooled By Randomness", but he does not elaborate on blends or equilibrium.

Feb 25 2021

Important Note 1

The equilibrium of Step 6 does not mean the Datacenter is reliable. Also the funding (amount of money you pour into the whole thing) is not the issue. Read about Chernobyl. Every Nuclear Reactor is a Datacenter on steroids.

Apr 07 2022

Important Note 2

Let's split computers into 2 groups. Let's call one group 'Enterprise Computers' and another group 'Consumer Electronics'. Ironically, the price/utility ratio between those two groups is fluctuating amazingly with time. Because with monopolies comes greed, there are moments (the beginning of the Internet for example) when building anything at all using enterprise offerings was ridiculously expensive, so it was economically viable to compose enteprise offering from consumer electronics parts.

The crypto bubble is another example (mining crypto on GPU is a classic example of re-using consumer electronics to provide enterprise offering). In ideal world we would have an adequate balance of price/reliability for computers. We live in a real world. In Europe people kill each other with blends of 3D printed parts blended with consumer electronics they order online (drones is the simplest example). War is extreme example of course, but the trend is kind of universal.

Because of various regulations / conventions, things that are basically consumer electronics (can be grossly unreliable, but comes cheap) is used in place of an enterprise component, because "what you gonna do".

Windows PC is consumer electronics. For 20+ years it is.

Enterprise Computers are driven by Moore's law more of less. Consumer Electronics is driven by Donskoy's law. Those are *very* different laws. Collisions are inevitable. And they do happen. Every day.

July 28 2022

There is also Huang's Law. More about it here

Aug 10 2022

So you're thinking about running the cluster of remote machines

I used to run 10000 servers. It's not that hard, really, you just need to install sysdig, or poor man's version of it. The biggest problem is (of course) windows, there are still tools to deal with that, like the one that fire eye is using, but you need to take it from (open) source - it's European, actually. sysdig is made in USA. Also, there are various tools on a level of hardware support, such as CIMC - that could make one's life easier, when dealing with servers you can't even touch. Not being able to hit reset button on a server is a real problem.

I think you're not really interested in building your own cloud. I think you're using existing cloud provider. In that case usually the only thing you need would be a sysdig or it's poor man's version. Things like thousand eyes might help. For 10+ servers can just install poor man's versions of those. Only sysdig is impossible to beat, so if you are planning to do anything serious, like try not to get hacked and stuff like that, you have to install sysdig. And find somebody, who can set it up. You will not find those people. Not many of us are operational. Especially after last 4 years.

There is a lot of challenges when you deal with distributed setup. Even when you just piggyback somebody else's cloud, every time somebody deploys something on the servers things will break. When things break, somebody needs to understand what broke and why. I can handle this (because I know which tools to install and how to use them) most of people they hire last 12 years - they try to fix it all manually. So nothing works and people just yell at each other. Very sad picture. And of course there are hardware failures. Switches (networking) can fail even. Everything can fail. Most of the times people in DC can't detect hardware failure on their own, you need to hold their hand. (See 'Datacenters are unpredictable' part above)

Recently, there are new sets of challenges that is happening, it is related to electricity shortages. The village where I live was affected. Pragmatically, this means that if you have your cluster of remote servers, you have to keep at least 2 locations (so when one goes down, you are still operational). There is not many people on this planet, who know how to run that kind of set up. Recently facebook went down, I don't know if you noticed, but it was a big deal.

Bottom line is - you think your project is little, but if you want your 10+ servers actually work 24/7 - project is not little at all and there is no people who know how to make that happen. I know. If you want to hire me - act fast, the idiots are now destroying things everywhere again.

Oct 15 2021

Automation Paradox

The 'Unpredictable Loop' is a side case of Automation (Quality) Paradox. Some day I should find time to write the Automation Paradox in detail. It's complex but the implications are huge. It's about distribution and quality and costs (dis) balances. Interwoven.

Part of it is a fact that Guerilla Group Coding described in Mythical Man Month (1975) remains the only pragmatic engineering guidelines produced on US soil till modern times. Not only that, it actually directly contradicts the Asian IT Pyramids. It was possible to balance things out before 2008, after that it simply became impossible. Because government.

Apr 05 2022

Another part of the paradox is reversal of supply/demand pipelines.

Pandemic reversed supply/demand pipelines. Since 2008 the safest thing for a recruiter/government worker was to do nothing. That resulted in candidates resumes sitting in limbo for years. So in the past when government worker sit on candidates resumes, it was a problem to the candidate, not to those who sit on the resume. Only I think it no longer works like that. There is a shortage of skilled candidates who can do anything at all (and there is a surplus of government workers who try to sit and do nothing). Supply / demand has been reversed because of pandemic, so the pressure now is on government workers.

Apr 08 2022

Every company on US soil has the same problem. For a very long time now, anybody, who is literate and can do anything at all with computer, is sucked into some enterprise corporation as a "body". Inside that corporation they're usually producing no value, (because production of IT value had been moved outside US long time ago). As a result, automation levels of the country degrade (rapidly). This can be fixed by doing things the way things were done 15+ years ago.

Automated nonsense destroys the country much faster, than manual nonsense. I once worked with a person, who had written a script to inject 1000 bugs into production system. It was a problem for me that person knew how to code. I had to do rather complex rollback. True story. That person was not evil. They did not understand what they were doing. They made a mistake. Automation could be a solution, but it could also be a major problem. AI/ML is not a solution either.

Apr 12 2022

Most of modern infrastructure is not really designed for agility. Change the static IP address - and everything is messed up. New architectures should assume nothing is static. That thing alone kills the entire ITIL vertical. McAfee was aware of that.

Current internet infrastructure (inherited from 1970s) is based on all-or-nothing (centralized) permissions. Blockchain *is* different architecture. There will be battles.

IPv6 was a useless distraction. Accomplished virtually nothing, only made things worse. I think that was the end of the old world. "Ratified as an Internet Standard on 14 July 2017".

Aug 27 2022

Turns out firmware is not immune to Donskoy's law. Back in a day firmware can not possibly "not work" or "have bugs". There were some unusual moments in things like POS, but those were in a way not bugs in firmware, more like "bugs in closed source OS". First instinct is to delare "that's because von Neumann is obsolete!" (every generation is trying to trash von Neumann, I've seen (too) many PHDs trying to do just that). Only after one understands the fractal nature of von Neumann, it becomes clear that von Neumann still holds.

Dec 02 2022

So it's Huang vs von Neumann, in a way. It is clear Huang simply does not know about some HW stuff they used to make here. He is my age, but I only know because got lucky working with one guy from past generation. The wall between enterprise vs consumer (still) going strong.

Dec 09 2022

Windows was (and is) a trade deal on Asian HW. If not for Jobs's brilliance, we would still be paying infinite money renting CDs from MS, instead of watching LOL cats memes dancing on TikTok. And that might have been a better world. Ironically.

Dec 25 2022

People (still) don't understand the fragility of IT. In the beginning of the Internet, for example, MS firmly decided not to include a single RMI.jar file with MSIE. If they would have allowed it, Java would have been running everywhere over RMI. As a side effect, HTTP protocol (and all the companies around it ... such as ... Cisco for example) would *not* be in existence the way they are now. Can you imagine that? The entire Internet thing was only based on a single jar file (not) included with MSIE. That's the level of fragility of all this. It is still like that, of course. this page has more.

May 06 2023

Nothing is designed for spot instances, yet that's what is needed.

I did manage to slash the cloud costs by 2x and I will decrease further. Gas is no longer commodity, I think the same will happen with Traffic / CPUs. Eventually. They kind of have to oversubscribe. This will be some interesting Internet, because everything AI is *not* free by design.

It's interesting how some fake things try to sound like smart things yet they are not. Kubernetes for example sounds like Cybernetics, yet it violates the very key principles of Cybernetics. It also violates the key principles of business machines. Some things in US became the opposite of what they used to be.

100 y/o monopolies are beginning to crack in the cloud. (3 feeds providers and only one is of use)

Jul 18 2023

Storage capacity / CPU capacity are very easy to monitor. Networking is a bit more complex (not much, but still). What happens now, because of that, on a clouds there are idling nodes (that have plenty of capacity and CPU) but because network layer is messed up, the nodes are suffocated and become no use. So *clearly* datacenters' *network* layer is now oversubscribed. I think it only happens because with all that AI stuff, the piles of (useless) data are being dumped onto the network - to transport to the cloud, and in the cloud you have AI chips crunching all the data with tensorflow garbage, but because tensorflow garbage had been boosted by NVIDIA, the CPU layer says "everything is fine". But it is *only* fine for CPU layer! Network layer can not keep up simply! There is no other explanation to what I am observing in the clouds now.

I think that's because of the impedance between CPU layer (boosted by AI chips) and non-boosted network layer. Physical limitations do exist. Von Neumann all the way. That's why all the hoopla around quantum networks. I actually experienced the exact same problem when I upgraded my DC to Flash Drives : https://pault.com/p/Hadoop . Once upon a time I had the DC running.

Aug 30 2023

Put it all together. Figured out everything end to end. Well, Joel, this will be fun.

Aug 31 2023

Nash's equilibrium is at the bottom of (too) many things. Not sustainable.

Sep 02 2023

Figured out a very simple (yet bullet proof) actionable angle on AI. Yes, that billion dollar industry. (Sep 21 2024. Took 9 months! to go from this idea to blueprint on a napkin. This stuff is *hard*. And it is not billion dollar industry. It's (seven) trillion dollar industry now).

Dec 26 2023

Figured out simple pragmatic use of blockchain in DC yet nobody is doing it. Supply chain of course. Currently manual (means bugs etc). There was one company who made a baby step 10+ years ago, they got acquired in a snap and disappeared. Creative class likes (and protects) their caviar.

Dec 28 2023

Somebody is messing up with (6) on nuclear reactors. That means they *have* to be outside the zones. This becomes a trivial geometric puzzle. Highschool kid can solve it. That explains the stress on geometry in one of EU war zones. 30 years too late, but better late, than never.

Jan 14 2024

Figured out exactly what happened to systemd.

Jan 22 2024

Take nuclear reactor. Inside it there is a ridiculously complex process. Via the set of automated processes it's reduced into a set of binary 'on/off' indicators on operator's dashboard. Way to go. Disasters are guaranteed, looks like. Compression of complex matter (every dashboard is that) is *guaranteed* to lose (AKA 'simplify') the information. Automation is simplification and automation is everywhere.

Jan 27 2024

Situation in the clouds is a bit ... nonlinear. I am testing various providers and it's really really something. The explanation is simple. Once you find something reasonable, you just stay with them because migration of stuff between clouds is kind of hard. Usually. So there is no incentive to look around, basically. Once you start looking around ... To start to look around - need to have my stack implemented. And you don't have it.

Jan 29 2024

Clearly "cloud server for $5" means nothing, because very different things are all called the same. Clearly, networking layer is suffocated now, but it's suffocated differently (the method depends on a DC). I could put together some valuable technology for this vertical, like I already did decades ago, but it looks like the stuff I did decades ago is now "illegal" or something. Being literate is basically illegal.

Jan 30 2024

How can one even begin to do things in the cloud *not* having my benchmarks? I think they're all gone by this time. In some way it might make some things easy.

Jan 31 2024

Good interview question now. "There was several (trivial, technical) ways to prevent Cstrike disaster, that's being called the biggest IT disaster ever. Name at least one." Yes, I know the answer. Not some rocket science. Trivial stuff.

And if technical way is no good there also was 100% legislative way to prevent this. NY style.

See also the postmortem video by MS dude Dave.

Jul 19 2024