A Primer on Source-Driven Development for AI/ML Data Scientists and Developers

Kyle Bowerman

|
August 23, 2019

In this blog post, we look at the evolution of the traditional Salesforce deployment model and the new SF Developer eXperience (SFDX). The former relies on a golden copy of the production org whereas the latter relies on a code repository as the source of truth. If you or your organization have not begun to transition to source code development, you should start to plan soon.

This new paradigm has many advantages over the golden org copy, including continuous integration/continuous delivery, better change control, and is aligned with industry best practices. Even though not all Einstein features are built into SFDX or the metadata API yet, it is critical to start to use source development methodologies when building analytics or machine learning solutions in Salesforce. More so than all other types of development, analytics and ML are much more iterative. This makes it imperative that the input and output of each iteration can be isolated and compared. It does not matter if we are talking about source code, manual steps, or even sample data; the data science developer must be able to recreate any iteration exactly with expected results. This is science! Iterations are experiments!

A Timeline of Recent Discoveries in Software

In the interest of science, let’s take a step back and look at some recent discoveries in software in the last two decades that got us to where we are today.

1999

Salesforce is started by Marc Benioff

1999
2005

Linus Torvalds created Git

What happened?
Linus Torvalds, the creator of Linux, develops a new version control system that takes the best features of open source predecessors SVN and CVS to manage the Linux Kernel.

Why is it important?
Git will eventually win the version control wars over all the other open source and proprietary version control systems. It will also create a foundation for branching strategies best practices like GitFlow.

2005
2007

Heroku is created by James Lindenbaum, Adam Wiggins, and Orion Henry

What happened?
Pioneer PaaS company builds a platform for running web apps on AWS infrastructure supporting multiple open source languages (polyglot) and data stores.

Why is it important?
Heroku creates a command-line tool and framework that uses Git to deploy to the platform. This toolchain will accelerate the era of continuous deployment and continuous integration.

2007
2008

GitHub is created

What happened?
Git as a service is now available for free to developers of open source frameworks. It will eventually become the marketplace for open source projects, and this will have the effect of evolving software into reusable building blocks maintained by humanity for the greater good.

Why is it important?
GitHub will eventually create an extensible fat client framework called Electron that is built from the ground up on Javascript (NodeJS) which is the predominant language of the web. The first implementation of this framework is an integrated development environment (IDE) called Atom that allows for extensions to be built following the Eclipse model. Since it is built from Node instead of Java, it can leverage a modern package manager and dependency manager, thus avoiding the problems that plagued the Eclipse development community.

2008
2010

Salesforce Buys Heroku for $212M

What happened?
Salesforce.com realizes that Heroku represents the untapped market share, and they have a platform and services that are very complementary to their offerings. They also realized that the deployment framework and CLI would represent the cornerstone for the future of cloud Developer Operations (Dev Ops).

Why is it important?
Salesforce realizes that Heroku is state of the art when it comes to custom software release management. They can deploy any version of an app’s code and ecosystem place within minutes, whereas it can take hours to deploy in Salesforce. They make the critical decision to let Heroku operate in relative autonomy, and forge forward in CI/CD while learning from their successes.

2010
2011

Adam Wiggins from Heroku presented the 12-factor app

What happened?
The founders of Heroku write a manifesto on software development based on 12 basic principles, which has become the best practices to modern cloud app development to this day.

Why is it important?
The first factor is that there is a single code base tracked in revision control that can have multiple deployments. This was antithetical to the Saleforce.com deployment strategy. The third factor talked about separating the config from the code, and it addressed the problem of config drift that was a common problem with cloud platforms like Salesforce.

2011
2011

Docker initial release

What happened? Docker became the next generation of virtual machines. Instead of storing every bit that represented the state of a computer’s hard drive and memory, they built a recipe system to make identical snapshots every time.

Why is it important?
Using modern package managers, Docker can create and run “containers” from known images that can be started and stopped as needed with minimal overhead. Each container can have a single focus. This minimally single-focused service, known as Microservices, was becoming popular around the same time.

2011
2011

Adrian Cockcroft from Netflix introduces Chaos Monkey to the world

What happened?
Netflix, who had built their entire streaming platform around Amazon virtual machines (EC2), realized that it was better to assume that something would go wrong and plan for it rather than assume that everything would go right. They built a tool called Chaos Monkey that would randomly shut off their VMs to force their developers to build fault tolerance into their applications.

Why is it important?
This fail-safe approach shifted the emphasis from restoring a device to generating a new one that had the exact same state and data. The new machine would “Rise from the Ashes” of the old one.

2011
2012

Martin Fowler Describes the PhoenixServer

What happened?
Martin Fowler, the father of Agile, is inspired by Chaos Monkey and Docker articulates the importance of being able to build a server including state and data as fast as possible from scratch.

Why is it important?
The PhoenixServer philosophy was not just applied to outages, but it also applied to testing and validation. It addressed a viable mitigation of configuration drift that was first discussed by the 12-factor app. The idea was that you would build your server and load the state until you got it right. If not, you would burn it down and fix the recipe until you got it right. All the preceding technologies described leading to this pivotal philosophy.

2012
2015

Heroku introduces Pipelines and Review Apps

What happened?
Heroku introduces simultaneous concurrent environments ranging from development to testing to production. This “pipeline” represents the stages of an app through various versions of a single code base. In addition, they create a “Review App,” which is automatically created and deployed any time a developer creates a GitHub Pull Request (request for a branch to be merged into the master).

Why is it important?
The pipeline allowed developers and testers to see various versions of the app running at the same time. They can jump back and forth between the environments and compare the behavior and formally test the release candidates. Unlike the formal stages of the pipelines apps, the Review App allows a developer to see a running version of their commit with zero effort of deployment. They can use this to collaborate with other developers or users/testers. In the event that a critical piece of code or configuration is missing, the auto-deployment will fail, and the developer will receive the error logs. This is the foundation principle of Continuous Delivery.

2015
2015

First Commit of Microsoft VS Code built on Github’s Electron Platform

What happened?
Microsoft ports their flagship IDE Visual Studio, from their proprietary language .NET to the leading open-source language NodeJS using GitHub’s Electron framework.

Why is it important?
This may not seem like a huge step, but it’s like Exxon switching over their drilling platforms to run on solar. This move by Microsoft affirmed the truth that open source software and techniques were advancing more rapidly than any individual company could compete with. Ironically the guts of Atom (NodeJS) which came from Google Chrome browser allowed nearly 6000 user-contributed packages. Microsoft coveted this level of developer community participation, but they knew it only worked because it was open source and free.

2015
2017

Salesforce SFDX becomes generally available

What happened?
Salesforce announces its new Developer eXperience (SFDX). SFDX revolves around a command-line interface (CLI) built on the Heroku tooling framework which supports meta-data, data migration, and a Phoenix Server feature called “Scratch Orgs.” Most importantly, they have created a Dev Hub that allows for package development and orchestrations of deployment to various environments.

Why is it important?
Salesforce has allowed for Heroku to forge ahead with full autonomy to learn from its customers and the software industry in order to perfect state of the art tools and techniques for modern software development and operations. They have deliberately created a succession program from Org-driven development to source-driven development.

2017
2017

First release of SFDX extensions for VS Code

What happened?
Salesforce released a set of tools as an extension to VS Code that binds to the Salesforce CLI.

Why is it important?
These extensions allow for both Org-driven Development (just like Eclipse) and the new source-driven development model. Sunset for Eclipse, aka Force.com IDE plugin support, is announced.

2017
2018

Microsoft buys GitHub

What happened?
Realizing that GitHub has become the social app for developers and the marketplace for open source software, Microsoft buys GitHub.

Why is it important?
This acquisition solidifies git as the rightful heir to the version control system throne and that open source tooling frameworks like Electron and Heroku’s CLI framework (oclif) are here to stay. We don’t need to understand how they work but do need to understand how to get the most out of them.

2018

If you are Salesforce platform, custom app, or Einstein data scientist developer, you should start to consider your work product as “packages”

The manifest of these packages can contain code, config, declarative artifacts, documentation, and data. The concept of a “package” should not be new to Salesforce developers. However, what a package can contain should be expanded to consider the documentation of the repeatable steps that can not be captured in the Metadata API. Documentation of measures and observations are most important to Einstein developers who are still waiting for SFDX to catch up to their work area.

Before we take a look at how source-driven development for AI, ML, and analytics can be leveraged, let’s look at how it is being done in traditional Salesforce developments

There are two types of Salesforce developers: Those who learned from a classical custom app dev background like Java, NodeJS, C++, PHP,.NET, etc. versus those who came from the business and cut their teeth coding and configuring in Salesforce. We will call the former “classic coders” and the latter “business coders.” Both types of developers bring something to the table. Business coders understand how to align the needs of the business with the most straightforward implementation, whereas classic coders understand the nuances of a repeatable process and accounting for change.

When classic coders are introduced to Salesforce’s “Golden copy” methodology it does not sit right with them. If it were as simple as making some changes in a sandbox, collecting those changes in a manifest, then deploying those changes in production that would be one thing, but that is not the case. The real way development happens is that you try something in the sandbox then wonder why it didn’t work or think of a better way to do it. You neglect to change back the experimental changes and only add the final changes to the manifest, from memory no less! This configuration drift causes major havoc between environments.

The idea that we build features in a sandbox, collate the code and artifacts, and then deploy to production seems prone to error, not repeatable nor reversible. However, the new source development model not only matches the industry approach of source control but also sets us up for a good continuous integration/continuous deployment (CI/CD) approach and provides accountability for all changes.

With the introduction of the “scratch org,” Salesforce’s implementation of a PhoenixServer or Docker Container, developers can practice over and over again to get the manifest right. With the transient nature of the scratch org lasting only seven days, they are designed to spin up quickly and be disposable. It is the recipe we care about, not the actual instantiation. Soon the scratch org infrastructure will get more mature, and we will be able to add licensed features like FSC, app exchange packages, sample data, or even snapshots from production orgs or sandboxes. Until then, we must manually account for these steps.

This is no different for the AI developer; it is just that the list of things on the manual side is larger than the automatic side. In many ways, a good source control regime is much more relevant for the data scientist in Salesforce. You can draw a direct parallel to a model or lens iteration as you can to a scientific experiment. The cause and effect might not be immediately apparent, and that is ok as long as you can replicate the experiment. Repeatable steps, observations, and even sample data are now fair game to be added into your source control commits.

The source-driven development model is replacing the Org development model

The recipe for building the environment is more important than the environment itself. Open-source frameworks and Git are here to stay. GitHub, GitLab, Bitbucket are all hosted ‘Git as a Service’ offer similar feature sets. You can’t go wrong with any of them however when you choose a CI/CD solution, it will be easier to integrate if you select the same provider as your Git provider. Since all these CI/CD solutions are Docker bases, spending a few days to become proficient at Git commands and shell scripting will pay off exponentially.

If you or your company have not begun to transition or improve upon your source-driven development model, you should start now. Contact us to find out how we can help.