Presented by Evadne Wu at Code BEAM Lite in Stockholm, Sweden on 12 May 2023
We have celebrated 10 years of Elixir and also nearly 25 years of Erlang since the open source release in December 1998.
Most of the libraries that were needed to make the ecosystem viable have been built, talks given, books written, conferences held and training sessions provided. A new generation of companies have been built on top of the Elixir / Erlang ecosystem. In all measures, we have achieved further reach and maturity than 5 years ago.
In the session, Evadne will revisit the topic of Elixir adoption, and discuss the following topics, based on her personal ideas, beliefs, and experience:
- Migrating and managing Erlang & Elixir workloads
- Designing B2B / B2C systems that leverage the ecosystem’s capabilities
- Ensuring continued evolution of the team, the organisation and the ecosystem
When I first started in the ecosystem, I did not know how to program at all, but I had to deal with problems that seemed very complicated at the time, which was beyond the capability of any tools I knew. After consulting with many colleagues, I was recommended a book entitled Erlang in Anger, written by Fred Hebert. It was an enlightening experience, and I went on to read more books, learn Elixir, and eventually managed to create useful systems.
I have since built many systems and applications, which would not have been possible under the same constraints, without the Elixir ecosystem and of course the infrastructure of Erlang/OTP.
However, ever since reading the book, I have questioned whether it would be possible for people to experience and utilise this marvellous ecosystem without first having experienced and dealt with anger. I have observed that many contemporary participants in the ecosystem have all experienced some sort of mid-life crisis as programmers, and they all had to climb the ladder of abstraction to get to where they are now.
Within the time available today, I would like to discuss the following topics that are relevant to ecosystem adoption, based on my personal observations and opinions.
The most important aspect is Risk Management.
Systems building, like all other endeavours, can be viewed through the lens of risk management. At the highest level of abstraction, time and resources are consumed, and either the correct software is created on time, or it is not. Most software projects start in the shadow of uncertainty, not unlike a fog of war. Players start with no knowledge, and must learn the constraints and rules of the world.
Uncertainties that threaten the viability of a project, or Risks therefore, can come from incorrect foundational assumptions, faulty business case, frequent changes, unstable specification, incorrect architecture, misalignment between key co-dependent systems, insufficient performance, or other latent problems that are usually discovered through the implementation journey. As Tolstoy put it: All happy families are alike; each unhappy family is unhappy in its own way. The same applies to software projects.
The key to successful delivery of any project, is therefore identification of the riskiest part of the architecture and the subsequent management of it. It is better to deal with the most difficult problem first. However, when the problem space is ill-defined, and resolution can not be ascertained until attempted, it would not be possible to know which problem would be the most difficult.
In some cases, the solution is quite clear and can be implemented straight away based on the description of the problem.
In other cases, you would not know what the solution looks like at all, before attempting to solve the problem. This is essentially a wicked problem, which has appeared time and time again in the history of systems building. Many strategies have been tried by many teams to essentially pretend that this chaotic nature of systems building, could in fact be manageable and controllable. To one extreme there can be a waterfall process, where the specification is agreed in advance, the implementation period is long, and changes both infrequent and costly. To another extreme there can be seemingly no process at all, where specification is formulated ad-hoc or even retrospectively, the implementation period is as short as needed, and debugging is performed after deployment. To yet another extreme, there can be bastardisations that do not make any technological sense.
Yet the common attribute of all teams that have shipped successfully, is that there is some kind of agreed process, on top of successful execution, which could not have existed without a shared and consistent view of key risks and opportunities. It is therefore my opinion that taking risks appropriately would be more appropriate than trying the process of the day.
To deal with a wicked problems properly, it would be advantageous to have as many attempts as possible, which the cycle between the attempt and the response as short as possible. The tighter feedback cycle you have, the cheaper it is to try to attack the problem from a new angle, when the solution does not work. The chance of success is therefore higher, because with lower cost of iterations, you can try more times and rely on the law of large numbers. while also applying good project management throughout the process to mitigate execution problems as they inevitably occur.
Logically, any toolchain that is amenable to rapid prototyping and short iteration cycles, would be a good fit to wicked problems. However, for a specific toolchain to prevail, this alone is not enough, because you may be in a situation where the existing team or the entire organisation has been standardised on a particular ecosystem, or perhaps nobody has heard of the toolchain you have in mind, hence there is a great lack of appetite.
This brings us to two associated ideas that feed into the overall risk profile of Elixir promulgation. The first is switching costs, and the second one is technological leverage. These two ideas alongside the risk appetite of the adopting organisation, based on the perception of known and unknown risks, essentially decides whether adoption would be possible ahead of time.
I would like to first discuss technological leverage, which in the context of this presentation, is defined as the ability to manage a higher level of complexity with a lower level of energy expenditure. Systems building like all other activities can be seen as an industrial process, which can once more be abstracted into leverage on energy. High technology manufacturing, for example, is a highly leveraged process on energy input. This system works marvellously when the cost of energy and labour is low and the price of products is high.
Software systems are usually sold to address certain business practices, and the math works in the same way. Toolchains that provide a higher technological leverage, would exhibit general characteristics such as:
- Abundance of existing COTS or OSS solutions that address common issues
- Mature ecosystem, with long history of successful deployment
- Existing community of engineers familiar with the discipline and capable to apply it in practice
- Infrastructure to sustain growth and avoid stagnation
Specific to the organisation and the project team, the following characteristics should also exist:
- Ease of acquisition of engineers within the budget available
- Ability of existing team members to work effectively with the proposed technology
- Capability of team members to problem-solve
It is clear that the proficiency and comfort level among the team members, plays a critical role. After all, the matter of systems programming is more of a sociological problem than a technological or philosophical one, and must be solved at the correct level. If there are no team members comfortable enough to write your systems in Elixir, and you are unable to hire people to do so, then it will be unlikely that your systems can built in Elixir.
Assuming that feasibility studies have been done, and the choice of Elixir is appropriate, its promulgation would therefore be a sociological problem, instead of a technical one. It is therefore critical to understand the interaction between stakeholders, their existing perceptions of risk, and the actual extent to which their minds may be changed based on correct information.
If you possess technical capability or capacity to hire the correct people, then you need only convince all relevant stakeholders of your technical choice. If you do not, then the amount of risk you are persuading them to undertake will be much higher.
In any case, it is likely that you will need to assemble an accurate business case. The first objective however, is to correctly gauge whether you’ve got a chance. The world is vast and filled with hills you may die on, but you can only die once.
Similar to trading in markets, where traders would only open long positions if they believe the value will go up, organisations will adopt new technology if it helps them improve their market position, which makes their investors’ long positions go up. From this basis, it would be easy to surmise that the perceived level of leverage and expected returns would drive the decisions into switching or not switching. In certain cases, being first is in itself a great advantage, where toolchains that allow software to be built very quickly can have an advantage and be an easy sell. In other cases, it is the latent characteristic post-deployment, such as robustness and resilience, that matter more.
Similar to trading, once more, traders are usually required to cover their short positions once they can no longer carry the trade. This too has analogies in technological adoption. When the entire industry has suddenly changed direction, those who have made the wrong bets must move quickly, and suddenly the leverage offered by the correct technology would look more attractive. In such scenarios, the software is usually being built to make up lost progress, and speed matters more.
In general, players who have existing positions at the margin must react, while those who are clearly in the money can act unpredictably. Understanding the market position of the organisation and the aims of the executive suite, can be extremely helpful in building the business case.
I will also note that this is related to the Hype Cycle, where speculative premiums may be paid for implementation of new technology, which offers a good risk-to-reward ratio. In other words, when hype exists, the risk premium that is willingly paid may be much higher.
The other aspect of Risk Management is the classification of all risks and acceptance of some of them. This usually starts being done at inception of the project and may culminate in creation of a Risk Registry of some sorts. Unacceptable risks beget additional requirements. For example, the database must be backed up regularly because a catastrophic loss of data is unacceptable; the servers must be redundantly provisioned because interruption of service would breach the SLA requirements; and so on. While good programming can not save a system from bad infrastructure, a toolchain with good fault tolerance primitives and rapid error recovery capabilities would help resolve operational issues quicker, or even mitigate the minor issues automatically, before they become incidents.
This is why technologies that have Resilience at heart, can be more competitive when you have to build systems quickly or when there are severe penalties for failure. Resilience, within the context of this presentation, and the philosophy of Erlang, would be the capability to recover from errors while unattended. Take the following operational example:
The hardware running the database is faulty. The Database is automatically failed over to its replica. All connections must be re-established to the new primary (promoted by replica). Does the application continue processing transactions without interruption?
Take a second example:
The application runs out of memory due to too many transactions being processed at once. The problem happens once every day. However the system gracefully recovers, and the team is able to resolve the issue peacefully without the problem escalating to an active incident.
The learned scholar will know very quickly that the fault tolerant nature of Erlang/OTP, which is summarily inherited by Elixir, can be a key driving factor for certain kinds of systems that require ability to quickly recover from failures.
While communication between development and operations is key, one can not talk their way out of a production incident. A resilient system is therefore a safe system. As everybody obviously knows, the philosophy of “let it crash” does not mean “don’t fix the crash”, but it buys you time. With the correct primitives, the number of operational incidents can sometimes be greatly reduced in scenarios where disparate systems must be integrated together to form a cohesive whole.
While the risk of the system failing in the field only materialises after the system has been built and deployed, a system built on resilient primitives can have other benefits. For example, there can be much less code needed to handle deviations from the happy path, and any custom logic used to handle intermittent errors, can probably be replaced with standard Supervision Trees.
The most important aspect in my opinion is Sustainability. Elixir in itself is a community effort that builds on top of several core systems such as the language, the Hex.pm infrastructure, key libraries such as Ecto, Jason, Phoenix + LV, and so on. Similar to other popular frameworks, development and maintenance of core Elixir building blocks is partly funded by benevolent commercial interests.
Due to the insane amount of functional leverage inside the ecosystem, a large team is not required to maintain a specific library. In fact, I have heard old timers say that an Erlang library that has not changed for 10 years is probably good, because it did not need to change.
This is however related to one of the sustainability concerns I had, which is also related to the current legislative and security environment. While Open Source contributors are not “suppliers” and we are probably supposed to actively resist being labelled as unpaid labour, there can be true risks of breaks in maintenance in critical dependencies due to unavailability of key individuals. I too remember the good old days when there was no malware on Apple computers, but these days are no longer with us. The same scenario may happen with any of our key dependencies. From an outside perspective, an immense amount of goodwill has been created during the past 5 years, that it is now probably worth our while to maintain it.
I would also suggest ensuring that new and interesting use cases are well disseminated. I would think that currently we are doing a good job at this, and it is working well, so we simply do more of the same. I suppose our work would be complete once folks in the community start writing undergraduate textbooks. As any useful software system eventually becomes part of the furniture, our cutting edge programs today can become legacies of tomorrow. It would only be upon ourselves to choose how to deal with the environment that we have built.
In most software systems, there can be islands of specialisation, and complex problems usually arise in interactions between subsystems. The orchestration philosophy imbued in Elixir & Erlang/OTP is uniquely suitable to act as the glue between all components, providing high cohesion and high robustness at low cost. The community has proven to be sustainable, with ample social proof of further adoption and conditions amenable to generational succession; there is also an abundance of case studies and other learning materials which would help engineers get up to speed quickly, independently.
Various Elixir-curious teams from 5 years ago have managed to build their own systems and share their stories. The risk profile at the time leaned heavily on the project management and resourcing side, but these risks associated with starting up were obviously overcome. Due to the increased amount of information, it is now possible to more correctly price the risk of Elixir adoption, and the exercise is turned from an emotional and philosophical case to an economical one. The risk profile has shifted since then, and I would suggest that the focus could turn to preservation of our built environment.
As my learned friend Lyle Mantooth (Island Usurper) said:
I overcame the resistance to using Elixir by being an indispensable cowboy.
I put to you that old cowboys never die, they simply fade away.
Thank you for sharing this, I think it's crucial to keep convincing the overall tech community about the good sides of Elixir.
And yeah, resilient design de-risks itself and that alone is a huge gain because it also reduces other forms of complexity.
Still, I think we have to keep pushing to debunk other myths from "it's impossible to find talent" and the fear of "no one uses" that are common myths about any technology that is not the #1 most used.