Open Development of Scientific Software
Revolutionizing Research with Collaborative Software Infrastructure
Written by Max Schulz
The Revolution of Open Source
Image generated by Stable Diffusion: "A programmer resting on the shoulders of giants" (My promp engineering clearly needs improvement)
Have you ever thought about how the technology we use every day is largely built on the work of volunteers? The open-source movement has revolutionized the technology industry by enabling individuals and companies to freely access, modify, and distribute source code. But it’s not just about saving a few dollars on software licenses. The open-source movement has fostered a sense of community and collaboration, which has led to some of the most widely used projects in the industry.
It all started in the early days of computing, when software was freely shared among academics and researchers. In the late 1980s and early 1990s, the term "open source" was coined, and the movement began to gain momentum. Prominent figures from this period include Richard Stallman, the founder of the Free Software Foundation and the GNU project (which also includes the popular GCC compiler). He advocated for a strict definition of the movement that allowed anyone to use and sell open-source code, but required developers to release their software as open-source as well (he wrote the highly "restrictive" GPL License).
It wasn’t until the late 1990s and early 2000s that open source really started to shake things up. The release of the Linux operating system and the Apache web server proved that open source could not only be viable but also highly successful. Major companies like IBM and Red Hat recognized its potential and began investing in and supporting open-source projects.
Open source is everywhere today. It’s hard to find a piece of technology that hasn’t been influenced by open source in some way. From mobile operating systems and cloud infrastructures to machine learning and data analytics — open source has become an integral part of the technology industry.
Adoption in Life Science Industries
In this article, I would like to shed light on open-source development from the perspective of a software and service provider in the life sciences industry. While there are many articles about the general adoption of open-source software (and books for evaluating projects for deployment) and relatively healthy market forecasts for the open-source services market (e.g., this report), there is not much content about actual contributions to other vertical industries outside of software (e.g., biotechnology or manufacturing).
From my experience, the adoption of open-source principles in other industries compared to the "tech industry" (which somehow started considering only electronics and software technology) is still lagging behind. While it is also difficult to secure funding in the tech industry (good article by James Turner), there are large foundations such as the Apache Software Foundation or the Linux Foundation. The number of people using, for example, a web server program could make sustainable funding through donations (e.g., via GitHub Sponsors or Open Collective) at least conceivable.
In niche areas like the use of open-source software in scientific work, the situation is rather poor. Although it is clear that recent breakthroughs would not have been possible without it (see this analysis from the Chan Zuckerberg Initiative), most public funding organizations do not recognize its importance.
"They fund 50 different groups developing 50 different algorithms, but they don’t pay for a single software engineer." - Anne Carpenter
Only recently have voices been raised to address the problem of funding software infrastructures for science. As described by Adam Siepel or Anna Nowogrodzki, researchers are generally required to program new tools as part of their research, without recognition or training, and when the maintenance burden of an open-source project increases — especially when it becomes successful — scientists may have no choice but to abandon their efforts, as only "pure scientific work" is recognized and funded.
Fortunately, this is slowly changing, at least in the scientific field. Large private US foundations are setting up specific grants, such as the “Essential Open Source Software for Science” grant from the Chan Zuckerberg Initiative. In addition, the National Science Foundation (NSF) has created a fitting program called “Pathways to Enable Open-Source Ecosystems (POSE)” (first proposals in 2022).
In contrast, there are very few activities in the commercial biotechnology sector for funding and contributing to joint open-source efforts. This is due to two main factors: the lack of focus on software development and the lack of a culture of sharing.
Focus on Software Development: Traditionally, software and IT infrastructure have been viewed as pure cost centers rather than expertise that creates a competitive advantage. New companies in the field are slowly changing this dynamic with a new generation of "Techbio" startups that define software development as an explicit core activity.
Lack of a Culture of Sharing: Nowhere is intellectual property protection stronger than in the biotech sector, especially in drug development. Compared to patents in the software industry, which are relatively worthless, their value for pharmaceutical companies is enormous, and their creation could be seen as their "raison d'être." There are pre-competitive efforts like the Pistoia Alliance. Still, I believe much more needs to be done to foster the sense that sharing tools and infrastructure benefits everyone.
The "mood" in the industry is slowly changing as younger companies like Colossal build software spin-offs like Form Bio and members of large pharmaceutical companies like Roche endorse their use of open-source tools like Arvados or Camunda.
Strategies for Success
A project is usually started by one or a small number of individual contributors. Examples from the biotech industry include MultiQC by Phil Ewels, PyLabRobot by Stefan Golas, or Poly by Timothy Stiles. The spark for a new project usually comes from a need that arises in different contexts:
- Context of scientific work (indirectly funded by a research grant)
- Context of a project for a client (possibly directly funded)
- Context of product development (usually funded by employer)
The initial work is usually inspiring, and not much thought has to be put into the long-term sustainability of the effort. Unfortunately, after a while, the “cute puppy” phase, as Jacob Thornton beautifully described in his talk, comes to an end, and the project needs to be properly maintained - even more so if it’s successful and an increasing number of users adopt it. You will want to be paid for your efforts at some point, especially if you need to hire more hands to meet demand.
After some years starting and maintaining SiLA 2 (an open connectivity standard & tooling for scientific instruments & software) as well as observing the “open source in life sciences” space, I can nothing but agree with Aaron Stannard that you can’t rely on donations for a sustainable continuation of your project, but need to have a commercial offering around it - either as an independent consultant or as a company. In Aaron’s article, he lays out the different funding models that I am reciting here:
- Services: Training, consulting, and sales of support for users of your software. The Hyve is a good example of a consulting company dealing with open-source tools for biology.
- Open Core Offering: A free and open core offering with proprietary "enterprise" features for paying customers. Popular features include access control and audit readiness. GitLab, a software management platform (Git), is a good example.
- Licensing: Defining a free license for non-commercial open-source use and a proprietary license for commercial use of the software. A well-known example is the Qt tool for graphical user interfaces.
- Managed Services: Creating a managed service using open-source software that can be sold as a platform or Software as a Service (PaaS or SaaS). This model is often combined with certain proprietary features to simplify management. A good example from bioinformatics is Netflow Tower.
- Reputation: Instead of selling something directly, a company or individual can leverage the prestige of contributing to a community project to attract customers or talent for your company. This is currently the operating model for the company I work for: Wega, where the SiLA project led to new customers and hires (more on this below).
Many developers (myself included) would think that if part of the code is critical for the success of a company, it would analyze its dependencies and ensure they are sustainably maintained. Unfortunately, it has been shown countless times that this idea could not be further from the truth [Footnote: The dynamics remind me of the Tragedy of the Commons]. The security toolkit OpenSSL is a typical example of this: Only after the release of the famous "Heartbleed" vulnerability (whose costs were estimated at 500 million USD) did donations increase from $2000 per year to $9000, which is obviously far from enough to feed even one developer, let alone provide the resources that such a project would need (Source). As always, XKCD perfectly illustrates the situation:
The brittle modern infrastructure funding (XKCD)
Recently, a security vulnerability in the popular Java logging library log4j caused a global uproar, as it affects nearly every system and had a maintainer with three Github sponsors. As a result, Filippo Valsorda wrote about the need to pay maintainers directly and contractually for their work (Article) - and I agree with him. Yet, I fear that many procurement departments won’t understand this case for some time.
There are counterexamples where communities around projects have emerged, which became so deeply embedded in the value chain of the companies using them that they received sustainable funding, but that is usually a long and lonely road. Aside from the famous Linux and Apache projects, Jupyter seems to be doing fairly well, with significant donations from Microsoft for its predecessor IPython and later from public organizations. Another interesting example is the Robot Operating System (ROS), which started as a research project, then became part of a private company, and is now part of the Open Source Robotics Foundation, which receives significant funding from companies like Amazon, Bosch, and Nvidia.
Open Source and Standardization
Many industries have benefited from the standardization of formats so that the same solution can be applied in different contexts. Often-cited examples are shipping containers, USB, or in the lab, the SBS specification for microplates. These stories motivate the creation of new standards in unexplored areas, often leading to several competing definitions at the start. This situation is often mocked by another XKCD comic:
Another XKCD
When I thought about what a standard is, I realized that what we commonly refer to as a standard is a versioned documentation approved by an "independent" committee such as ISO. In contrast, I would argue that a standard is really just a set of definitions that are used most frequently in a particular context/use case. Examples include the Amazon S3 API used by competing services, Docker image descriptions that are compatible with other container services (e.g., Singularity), or the already mentioned ROS and its messaging interfaces.
The "winning standards" are simply those that are used the most - and although it would be nicer for a user to always have a definition that applies to everyone, in reality at least a handful of definitions are needed to ensure healthy competition, which provides incentives for continuous improvement.
Given this insight, any software could be considered a standard once sufficient adoption is achieved. To ensure the ongoing accessibility and maintenance of these standard infrastructures, an independent, non-profit organization is typically created, which is funded by membership fees and donations. Examples from the life sciences industry include Open Microscopy Environment, SiLA, or LADS.
Many standardization efforts fail for various reasons. Some of these can be mitigated with an open culture and open-source software as a backbone. Some rules in this regard:
- Membership is for decision-making, not access: Some foundations focus too much on creating incentives for membership by granting (paying) members only access to definitions and source files. This drastically hinders adoption.
- Implementation code from the start: Even before a first definition is published, there must be (at least) freely accessible (better open-source) applications that use the definitions in industry-relevant scenarios to test the concepts.
- Building a community: It must be easy for interested parties to see the latest discussions and join them – easy in the sense of a few clicks, without interviews or sign-up forms! Platforms such as in-person events and online forums must be created for regular exchange among members.
Such rules should be established as early as possible because profit-oriented companies (without open-source principles) will inevitably try to gain an advantage from an upcoming standard by excluding newcomers.
Benefits of Open Development
I’ve spoken about open-source as if it’s a given that it’s beneficial – and I’ve assumed that the impacts of previous projects speak for themselves. However, closed development certainly has its advantages, such as stricter control [footnote: The somewhat looser development model of open-source is well described in this groundbreaking article “The Cathedral and the Bazaar”]] and easier monetization. So what are the concrete benefits of opening up part of your intellectual property?
Regarding statistics, you can find many articles, such as from leading consulting firms like this one from McKinsey, showing that companies using open-source are more innovative. Specifically, I think these are the key benefits:
- Community Contributions: Users of your software will bring a variety of features and extensions that a single company cannot compete with. A great example is the difference between contributions to the open-source Stable Diffusion and the closed alternative DALL-E – as shown by Yannic Kilcher.
- Better Feedback: When users can analyze the code themselves and tinker with its inner workings, you benefit from deeper and faster feedback on quality – this can lead to dramatic improvements, particularly in security aspects. An alternative to open-sourcing key components could also be to have at least open access (such as the recent phenomenon ChatGPT, also see the article on user-centered innovation by Eric von Hippel).
- Reputation: As already mentioned as a funding mechanism, contributing to open-source projects can help maintain or create a reputation as a thought leader in a specific field – for Wega, its SiLA contributions, for example, strengthened its image as an expert in lab digitization. Additionally, it makes the company much more attractive to aspiring engineers as an employer.
Many business leaders fear making anything open-source because they think it just means giving away their work for free. This mindset ignores the most valuable asset you retain – the expertise. Of course, this capital can also be stolen (i.e., "talent poaching"), but when it comes to talent, you must live with this risk anyway.
However, the most important aspect is well summarized in this article: "When we share our resources, work, and know-how in open-source, everyone benefits. But the companies that make the most of it are those that actively engage in open-source projects."
It is best to embark on an open-source journey without a direct profit in mind, but it’s good to be aware that this will also benefit your company’s bottom line in the long run. I encourage you to think about where open-source initiatives could make sense in your company.