Comparing managed machine learning platforms
Dataiku vs. Alteryx vs. Sagemaker vs. Datarobot vs. Databricks
From: https://towardsdatascience.com/dataiku-vs-alteryx-vs-sagemaker-vs-datarobot-vs-databricks-b3870bd34813
Markus Schmitt Founder at Data Revenue – We speed up Biologists with custom built ML Software | www.datarevenue.com
Oct 16, 2020 Alteryx and Databricks are in the lead and are still gaining popularity.
What is a managed machine learning platform?
Code is only a small component of any machine learning solution. Usually companies have to use different tools and services to manage a machine learning solution end-to-end, including:
- Compute services to wrangle data and train machine learning models;
- Data management tools to clean, modify, track, and secure data;
- Software engineering tools to write and maintain code;
- Dashboarding tools to interact with the solution and view results.
The four core components of a managed machine learning service.
The goal of managed machine learning services is to centralize these components into a single packaged solution.
But not all managed machine learning services are fully comparable. Tools like AWS Sagemaker help you manage the complexity inherent in any machine learning solution, but still expect you to have engineers on your team who can build and understand the code. These tools focus more on the compute layer. Tools like Alteryx focus more on the presentation layer, and they try to hide the complexity, providing no-code user interfaces to integrate basic machine learning.
More generally, these platforms often incorporate the dashboarding tools and/or workflow orchestration tools that we’ve compared in previous articles. So tools like Alteryx can be thought of as a higher level of abstraction, enabling more unification at the cost of flexibility compared to using the lower-level tools directly.
We’ve compared the most popular managed platforms to help you make an informed choice about which one is best for you.
Just tell me which one to use
As always, “it depends” — but if you’re looking for a quick answer, you should probably use:
- Dataiku if you don’t already have your own set of tools for development, orchestration, and machine learning, and you want a predefined all-in-one solution. Your team needs to have some technical knowledge, but it doesn’t have to be primarily made up of software engineers.
- Alteryx if you’re focused on marketing and analytics and you want some access to machine learning and data management without writing code.
- Knime if you want a less expensive, less polished, but more flexible version of Alteryx.
- Sagemaker if your team has engineering knowledge but wants a higher level of abstraction over your machine learning infrastructure.
- Datarobot if you have data stored in spreadsheets and want the simplest (but least flexible) way to run predictive analytics.
- Databricks if you’re already invested in Apache Spark as a platform and are looking for a simpler way to run it.
Quick overview
Before we get into a detailed comparison, here’s a quick overview of each platform.
- Dataiku is a cross-platform desktop application that includes a broad range of tools, such as notebooks (similar to Jupyter Notebook), workflow management (similar to Apache Airflow), and automated machine learning. In general, Dataiku aims to replace many of your existing tools rather than to integrate with them.
- Alteryx is an analytics-focused platform that’s more comparable with dashboarding solutions like Tableau, but includes integrated machine learning components. It focuses on providing no-code alternatives to machine learning, advanced analytics and other components that usually require code.
- Knime is similar to Alteryx, but it has an open-source self-hosted option and its paid version is cheaper. It includes machine learning components and analytics integrations with a modular design.
- Datarobot focuses on automated machine learning. You upload data in a spreadsheet-like format, and it automatically finds a good model and parameters to predict a specific column.
- Databricks is primarily a managed Apache Spark environment that also includes integrations with tools like MLFlow for workflow orchestration.
- Sagemaker focuses on abstracting away the infrastructure needed to train and serve models, but now also includes Autopilot (similar to Datarobot) and Sagemaker Studio (similar to Dataiku).
We’ve given approximate grades to each library based on several criteria:
- Maturity: how long it’s been around and how stable it is.
- Popularity: how many people search for the tool on Google.
- Breadth: whether the tool has a specific focus or tries to do it all.
These are not rigorous or scientific benchmarks, but they’re intended to give you a quick overview of how the tools overlap and how they differ. For more details, see the head-to-head comparisons below.
Dataiku vs. Alteryx
Dataiku and Alteryx are both managed machine learning platforms, but Dataiku focuses on the engineering aspects, while Alteryx focuses on analytics and presentation.
Dataiku provides Data Science Studio (DSS), a cross-platform desktop application that includes a notebook (similar to Jupyter Notebook) for engineers to write code and a workflow orchestration tool (similar to Apache Airflow) to manage data and tasks. While it provides some user interfaces, there’s still an emphasis on writing code. By contrast, Alteryx provides a better dashboarding experience but less flexibility: In Alteryx you use the UI to create no-code machine learning components.
- Use Dataiku if your team is technical and you want your data scientists, engineers, and analysts to all use the same tool.
- Use Alteryx if your team is less technical and you want to do advanced analytics using prebuilt components.
Dataiku vs. Databricks
Both Dataiku and Databricks aim to allow data scientists, engineers, and analysts to use a unified platform, but Dataiku relies on its own custom software, while Databricks integrates existing tools. Databricks acts as the glue between Apache Spark, AWS or Azure, and MLFlow, and provides a centralized interface to connect these.
Dataiku is a higher-level tool, with integrations for machine learning libraries like Tensorflow and an AutoML interface that can do machine learning on data in a spreadsheet format.
- Use Dataiku if you’re comfortable managing your own infrastructure but want a platform to manage your machine learning pipelines and analytics.
- Use Databricks if you want a platform that manages your infrastructure for you and you’re comfortable with Apache Spark.
Dataiku vs. Datarobot
Datarobot and Dataiku both provide AutoML: a no-code machine learning platform where you can upload your data as spreadsheets, choose a target variable, and have the platform choose and optimize a machine learning model for you.
It’s important to note that this is Datarobot’s core focus, but it’s only one component of Dataiku, which also offers a full suite of data science tooling, including an IDE, a task orchestrator, and visualization tools.
- Use Datarobot if you have existing clean datasets and want to use predefined machine learning models to analyze your data, with no engineering skills required.
- Use Dataiku if you need something more flexible to help you design and build your own custom machine learning models.
Dataiku vs. Sagemaker
Dataiku focuses on providing coding and analytics tools for data scientists and engineers, while Sagemaker focuses on the underlying infrastructure: the servers that run and serve these models. Dataiku provides an integration to Sagemaker, but Sagemaker is also releasing tools that directly compete with Dataiku: Sagemaker Studio and Sagemaker Autopilot.
You can either use these platforms in combination, using Dataiku to build and manage your models and Sagemaker to train and serve them, or you can use Sagemaker for everything.
- Use Dataiku if you need a more mature platform with a focus on user interfaces and user experience, one that both your engineers and your analysts can use.
- Use Sagemaker if you have more engineers than analysts, you need more flexibility, and you don’t mind interfaces that are still being iterated on and lack polish.
Alteryx vs. Datarobot
Alteryx is a broader solution that provides analytics, data management, and dashboarding components as well as no-code machine learning. Datarobot has a narrower focus on no-code machine learning.
- Use Alteryx if your focus is on data and analytics, and you need a platform for your whole organization.
- Use Datarobot if you have an existing dataset and you want to analyze it using predefined and curated machine learning models.
Alteryx vs. Knime
Alteryx and Knime are similar tools, and their capabilities largely overlap. Alteryx is more commercial, offering only a paid platform, while Knime also has a free, open-source option. Knime lacks some of Alteryx’s polish, but it offers more flexibility.
- Use Alteryx if you have more business analysts than engineers on your team and you need polished reports and dashboards.
- Use Knime if you’re on a budget and flexibility is more important to you than presentation.
Sagemaker vs. Databricks
Sagemaker gives you a way to deploy and serve your machine learning models, using a variety of machine learning frameworks, on AWS infrastructure. Databricks lets you run Jupyter Notebooks on Apache Spark clusters (which may in turn run on AWS).
Databricks focuses on big data analytics, letting you run your data processing code on compute clusters. Sagemaker focuses on experiment tracking and model deployment. Both tools let data scientists write code in a familiar Notebook environment and run it on scalable infrastructure.
- Use Sagemaker if you need a general-purpose platform to develop, train, deploy, and serve your machine learning models.
- Use Databricks if you specifically want to use Apache Spark and MLFlow to manage your machine learning pipeline.
Sagemaker vs. Datarobot
Sagemaker includes Sagemaker Autopilot, which is similar to Datarobot. Both tools let you upload a simple dataset in a spreadsheet format, select a target variable, and have the platform automatically run experiments and select the best machine learning model for your data.
Because this so-called “AutoML” is Datarobot’s core focus, Datarobot has curated and tuned a wider library of models than Sagemaker. So Sagemaker is still catching up to Datarobot in this specific use case, but overall Sagemaker is a more full-featured, flexible platform for model building, deployment, serving, and experiment tracking.
- Use Sagemaker if you need a more flexible platform that includes AutoML.
- Use Datarobot if you want a simpler platform with more curated, ready-to-use models.
Final remarks
If you visit any of these platforms’ websites, you’ll see they make sweeping claims about how powerful they are and how easy they are to use. Keep in mind that they all aim to solve very difficult problems, and onboarding onto any of them will likely be a long and expensive process with some hurdles to overcome.
All of these tools and services aim to offer a shortcut to data processing, machine learning, and analytics. But this means they’re also more restrictive than you might expect. If machine learning is core to your business, then building your own pipeline is often still the best option. There are many excellent, mature, open-source platforms that you can use to build a fully custom solution.
These managed machine learning platforms sell the concept that non-technical people can build machine learning solutions without engineers. But in practice, it’s often experienced machine learning engineers who use these tools and services most successfully. People with a deep understanding of the underlying systems and tradeoffs can use managed platforms as a shortcut to building proofs of concept; because they understand the process the tool is designed to simplify, they know how to use it effectively. But those without this experience often find that managed platforms are too limited to meet their exact requirements and still too complicated for non-technical team members to use easily.