优步“米开朗基罗”机器学习平台的应用经验

原文：Scaling Machine Learning at Uber with Michelangelo, Jeremy Hermann and Mike Del Balso, November 2, 2018

借鉴点：

团队分工（有专门的平台团队）、团队协作
让开发人员使用他们熟悉的工具
特征管理
模型的版本化管理

~~~~~ 以下为原文及翻译 ~~~~~

1. Zero to 100 in three years （3年时间，从0到100）
- 1.1 ML use cases at Uber
2. How we scaled ML at Uber
- 2.1 Organization
- 2.2 Process
3. Technology
4. Key lessons learned

In September 2017, we published an article introducing Michelangelo, Uber’s Machine Learning Platform, to the broader technical community. At that point, we had over a year of production experience under our belts with the first version of the platform, and were working with a number of our teams to build, deploy, and operate their machine learning (ML) systems.

2017 年 9 月，我们发表了一篇文章，向更广泛的技术社区介绍了 Uber 的机器学习平台 Michelangelo。那时，我们在该平台的第一个版本上拥有超过一年的生产经验，并且正在与我们的许多团队合作构建、部署和操作他们的机器学习 (ML) 系统。

As our platform matures and Uber’s services grow, we’ve seen an explosion of ML deployments across the company. At any given time, hundreds of use cases representing thousands of models are deployed in production on the platform. Millions of predictions are made every second, and hundreds of data scientists, engineers, product managers, and researchers work on ML solutions across the company.

随着我们平台的成熟和Uber服务的增长，我们已经看到整个公司的ML部署呈爆炸式增长。在任何给定时间，数百个场景中使用的数千个模型部署在平台上。每秒钟都会进行数百万次预测，全公司的数百名数据科学家、工程师、产品经理和研究人员在其上致力于机器学习解决方案。

In this article, we reflect on the evolution of ML at Uber from the platform perspective over the last three years. We review this journey by looking at the path taken to develop Michelangelo and scale ML at Uber, offer an in-depth look at Uber’s current approach and future goals towards developing ML platforms, and provide some lessons learned along the way. In addition to the technical aspects of the platform, we also look at the important organizational and process design considerations that have been critical to our success with ML at Uber.

在本文中，我们从平台的角度回顾了过去三年 Uber 机器学习的演变。我们通过查看在Uber开发Michelangelo和扩展ML所采取的路径来回顾这段旅程，深入了解Uber开发ML平台的当前方法和未来目标，并提供一些在此过程中学到的经验教训。除了平台的技术方面，我们还研究了重要的组织和流程设计考虑因素，这些考虑因素对我们在 Uber 的机器学习取得成功至关重要。

1. Zero to 100 in three years （3年时间，从0到100）

In 2015, ML was not widely used at Uber, but as our company scaled and services became more complex, it was obvious that there was opportunity for ML to have a transformational impact, and the idea of pervasive deployment of ML throughout the company quickly became a strategic focus.

2015 年，机器学习在 Uber 并没有被广泛使用，但是随着我们公司规模的扩大和服务变得更加复杂，机器学习显然有机会产生变革性的影响，并且在整个公司普遍部署机器学习的想法很快就变成了一个战略重点。

While the goal of Michelangelo from the outset was to democratize ML across Uber, we started small and then incrementally built the system. Michelangelo’s initial focus was to enable large-scale batch training and productionizing batch prediction jobs. Over time, we added a centralized feature store, model performance reports, a low-latency real-time prediction service, deep learning workflows, notebook integrations, partitioned models, and many other components and integrations.

虽然米开朗基罗从一开始的目标是在整个 Uber 上实现 ML 的民主化，但我们从小规模开始，然后逐步构建系统。 Michelangelo 最初的重点是实现大规模批量训练和批量预测作业的生产化。随着时间的推移，我们添加了集中式特征存储、模型性能报告、低延迟实时预测服务、深度学习工作流、笔记本集成、分区模型以及许多其他组件和集成。

In three short years, Uber went from having no centralized ML efforts and a few bespoke ML systems to having advanced ML tools and infrastructure, and hundreds of production ML use-cases.

在短短三年内，优步从没有集中的 ML 工作和一些定制的 ML 系统发展到拥有先进的 ML 工具和基础设施，以及数百个生产 ML 用例。

1.1 ML use cases at Uber

Uber uses ML for a very diverse set of applications. Rather than applying ML to a few key areas (such as ad optimization or content relevance), Uber has a much more even spread of ML solutions. In this section, we discuss a select few Michelangelo use cases that came up over the last three years, highlighting the diversity and impact of ML at Uber:

优步将 ML 用于非常多样化的应用程序集。与将 ML 应用于几个关键领域（例如广告优化或内容相关性）不同，优步拥有更均匀的 ML 解决方案。在本节中，我们将讨论过去三年中出现的少数米开朗基罗用例，重点介绍 ML 在 Uber 中的多样性和影响：

1.1.1 Uber Eats

Uber Eats uses a number of machine learning models built on Michelangelo to make hundreds of predictions that optimize the eater experience each time the app is opened.

Uber Eats 使用了许多基于 Michelangelo 构建的机器学习模型来进行数百次预测，从而在每次打开应用程序时优化食客体验。

ML-powered ranking models suggest restaurants and menu items based on both historical data and information from the user’s current session in the app (e.g. their search query).

ML 驱动的排名模型根据历史数据和用户在应用程序中的当前会话（例如他们的搜索查询）的信息来推荐餐厅和菜单项。

Using Michelangelo, Uber Eats also estimates meal arrival times based on predicted ETAs, historical data, and various real-time signals for the meal and restaurant.

使用 Michelangelo，Uber Eats 还可以根据预计的 ETA、历史数据以及餐饮和餐厅的各种实时信号来估计餐饮到达时间。

1.1.2 Marketplace Forecasting

Uber’s Marketplace team leverages a variety of spatiotemporal forecasting models that are able to predict where rider demand and driver-partner availability will be at various places and times in the future. Based on forecasted imbalances between supply and demand, Uber systems can encourage driver-partners ahead of time to go where there will be the greatest opportunity for rides.

1.1.3 Customer Support

Around 15 million trips happen on Uber every day. People frequently leave wallets or phones in the car or have other problems that lead to thousands of support tickets each day through our help system. These tickets are routed to customer service representatives. Machine learning models built in Michelangelo are heavily used to automate or speed-up large parts of the process of responding to and resolving these issues. The first version of these models, based on boosted trees, sped up ticket handling time by 10 percent with similar or better customer satisfaction. The second version, based on a deep learning model, drove an additional 6 percent speedup.

1.1.4 Ride Check

Since the very first Uber ride in 2010, GPS data has been used to put every trip on the map so we know where and when you’re riding and who’s behind the wheel. But we can do more: by harnessing the power of GPS and other sensors in the driver’s smartphone, our technology can detect possible crashes. This technology can also flag trip irregularities beyond crashes that might, in some rare cases, indicate an increased safety risk. For example, if there is a long, unexpected stop during a trip, both the rider and the driver will receive a notification through our Ride Check feature that offers assistance in the event of a crash.

1.1.5 Estimated Times of Arrival (ETAs)

One of the most important and visible metrics for the company are ETAs for rider pickups. Accurate ETAs are critical to a positive user experience, and these metrics are fed into myriad other internal systems to help determine pricing and routing. However, ETAs are notoriously difficult to get right.

Uber’s Map Services team developed a sophisticated segment-by-segment routing system that is used to calculate base ETA values. These base ETAs have consistent patterns of errors. The Map Services team discovered that they could use a machine learning model to predict these errors and then use the predicted error to make a correction. As this model was rolled out city-by-city (and then globally for the last couple of years), we have seen a dramatic increase in the accuracy of the ETAs, in some cases reducing average ETA error by more than 50 percent.

1.1.6 One-Click Chat

The one click chat feature streamlines communication between riders and driver-partners by using natural language processing (NLP) models that predict and display the most likely replies to in-app chat messages. Letting driver-partners respond to rider chat messages with a single button press reduces distraction.

1.1.7 Self-Driving Cars

Uber’s self-driving car systems use deep learning models for a variety of functions, including object detection and motion planning. The modelers use Michelangelo’s Horovod for efficient distributed training of large models across a large number of GPU machines.

2. How we scaled ML at Uber

As a platform team, our mission is to unlock the value of ML and accelerate its adoption in all corners of the company. We do this by democratizing the tools and support our technical teams need, namely, optimizing for developer velocity, end-to-end ownership, software engineering rigor, and system flexibility.

For data scientists, our tooling simplifies the production and operations side of building and deploying ML systems, enabling them to own their work end-to-end. For engineers, Uber’s ML tooling simplifies the data science (feature engineering, modeling, evaluation, etc.) behind these systems, making it easy for them to train sufficiently high-quality models without needing a data scientist. Finally, for highly experienced engineering teams building specialized ML systems, we offer Michelangelo’s ML infrastructure components for customizable configurations and workflows.

Successfully scaling ML at a company like Uber requires getting much more than just the technology right—there are important considerations for organization and process design as well. In this section, we look at critical success factors across three pillars: organization, process, as well as technology.

Figure 1: The core strategy pillars of the Michelangelo Machine Learning Platform.

2.1 Organization

Widely varying requirements for ML problems and limited expert resources make organizational design particularly important—and challenging—for machine learning. While some ML projects at Uber are owned by teams with multiple ML engineers and data scientists, others are owned by teams with little to no technical expertise. Similarly, some problems can be solved by novices with commonly available out-of-the-box algorithms, while other problems require expert investigation with advanced techniques (and often don’t have known solutions).

对机器学习问题的广泛要求和有限的专家资源使得组织设计对于机器学习尤其重要且具有挑战性。虽然优步的一些 ML 项目由拥有多名 ML 工程师和数据科学家的团队所有，但其他一些项目由几乎没有技术专长的团队所有。类似地，一些问题可以由新手使用常用的开箱即用算法解决，而其他问题需要使用高级技术进行专家调查（并且通常没有已知的解决方案）。

Getting the right people working on the right problems has been critical to building high quality solutions and deploying them consistently and successfully in production. The challenge is in allocating scarce expert resources and amplifying their impact across many different ML problems. For example, if a new project requires computer vision know-how, what organizational structure will allow Uber to effectively allocate expert resources in a way that is aligned with company priorities?

让合适的人处理合适的问题对于构建高质量的解决方案并在生产中一致且成功地部署它们至关重要。挑战在于分配稀缺的专家资源并扩大其对许多不同 ML 问题的影响。例如，如果一个新项目需要计算机视觉专业知识，那么什么样的组织结构可以让优步以与公司优先事项一致的方式有效地分配专家资源？

After several iterations, Uber currently operates with the following main roles and responsibilities:

经过多次迭代，优步目前的主要角色和职责如下：

Figure 2: Organizational interactions of different teams in Uber’s ML ecosystem.

Let’s take a look at some of the key teams and how they work together to design, build, and deploy new ML systems in production.

2.1.1 Product teams

We found that it works best if the product engineering teams own the models they build and deploy in production. For example, our Map Services team owns the models that predict Uber’s ETAs. Product teams are typically staffed with the full set of skills they need to build and deploy models using Uber’s ML platforms. When they need additional expertise, they get assistance from the research and/or specialist teams.

我们发现，如果产品工程团队拥有他们在生产中构建和部署的模型，则效果最好。例如，我们的地图服务团队拥有预测 Uber 预计到达时间的模型。产品团队通常具备使用 Uber 的 ML 平台构建和部署模型所需的全套技能。当他们需要额外的专业知识时，他们会得到研究和/或专家团队的帮助。

Product organizations sometimes also have special teams who help address any gaps between what the platform provides and what specific product engineering teams need. These teams adapt the centralized platform tools for their use case and fill in feature gaps with tailored tools and workflows. For instance, many teams in Uber’s Marketplace organization have similar workflows around training, evaluating, and deploying models per city and product. A Marketplace team creates specialized tools that sit on top of Michelangelo, making it easier to manage these Marketplace ML projects.

2.1.2 Specialist teams

When product engineering teams encounter ML problems that stretch their abilities or resources, they can turn to an internal team of specialists for help. Uber’s specialists have deep expertise across different domains—like NLP, computer vision, recommender systems, forecasting—and partner with product engineering teams to build tailored solutions. For instance, our COTA project is an effort that pairs a specialist team with a product team to create massive impact for our business and customers.

当产品工程团队遇到 ML 问题时，他们的能力或资源会受到影响，他们可以向内部专家团队寻求帮助。优步的专家在 NLP、计算机视觉、推荐系统、预测等不同领域拥有深厚的专业知识，并与产品工程团队合作构建量身定制的解决方案。例如，我们的 COTA 项目是一项将专家团队与产品团队配对的努力，旨在为我们的业务和客户创造巨大的影响。

Typically, these projects last a few weeks to many quarters. As a project is de-risked and moves closer to launching in production, product teams often add relevant full-time experts to fill the expertise gap, ensure they’re able to maintain the system on their own, and free up specialist resources.

通常，这些项目会持续几周到多个季度。随着项目风险降低并接近投入生产，产品团队通常会增加相关的全职专家来填补专业知识空白，确保他们能够自行维护系统，并释放专家资源。

2.1.3 Research teams

Specialists and product engineering teams often engage with Uber’s AI research group, AI Labs, to collaborate on problems and help guide the direction for future research. Research teams typically do not own production code, but they frequently work closely with different teams on applied problems. When relevant new techniques and tools are developed by researchers, the platform engineering team integrates them into company-wide platforms, allowing new techniques to be easily leveraged across the company.

专家和产品工程团队经常与 Uber 的 AI 研究小组 AI Labs 合作，共同解决问题并帮助指导未来研究的方向。研究团队通常不拥有生产代码，但他们经常与不同的团队密切合作解决应用问题。当研究人员开发出相关的新技术和工具时，平台工程团队会将它们集成到公司范围的平台中，从而可以在整个公司轻松利用新技术。

2.1.4 ML Platform teams

The Michelangelo Platform team builds and operates a general purpose ML workflow and toolset that is used directly by the product engineering teams to build, deploy, and operate machine learning solutions.

Michelangelo Platform 团队构建并运行通用 ML 工作流和工具集，产品工程团队直接使用它来构建、部署和运行机器学习解决方案。

As our systems become more sophisticated and the problems we solve more complex, demand grows for additional flexibility, extensibility, and domain-specific ML development experiences. We’re spinning up a number of other, more domain-specific platforms to address specialized use cases that are not as well served by Michelangelo workflow tools. These new platform teams reuse a lot of the existing Michelangelo platform and deliver specialized ML development workflows to product teams. For instance, there are NLP and computer vision-specific platforms being built that contain special visualization tools, pre-trained models, metadata tracking, and other components that don’t fit well in a general-purpose platform.

随着我们的系统变得越来越复杂，我们解决的问题越来越复杂，对额外的灵活性、可扩展性和特定领域的 ML 开发体验的需求也在增长。我们正在启动许多其他更特定于领域的平台，以解决 Michelangelo 工作流工具不能很好地服务的特殊用例。这些新平台团队重用了大量现有的 Michelangelo 平台，并为产品团队提供了专门的 ML 开发工作流程。例如，正在构建的 NLP 和计算机视觉特定平台包含特殊的可视化工具、预训练模型、元数据跟踪和其他不适合通用平台的组件。

2.2 Process

As Uber’s ML operations mature, a number of processes have proven useful to the productivity and effectiveness of our teams. Sharing ML best practices (e.g., data organization methods, experimentation, and deployment management) and instituting more structured processes (e.g., launch reviews) are valuable ways to guide teams and avoid repeating others’ mistakes. Internally focused community building efforts and transparent planning processes engage and align ML teams under common goals.

随着优步 ML 运营的成熟，许多流程已被证明对我们团队的生产力和效率很有用。共享 ML 最佳实践（例如，数据组织方法、实验和部署管理）和建立更结构化的流程（例如，发布审查）是指导团队并避免重复他人错误的宝贵方法。以内部为重点的社区建设工作和透明的规划流程使 ML 团队在共同目标下参与和协调。

2.2.1 Launching models

Designing reliable processes to avoid common development pitfalls and verify intended model behavior are critical to safely scaling ML in an organization. ML systems are particularly vulnerable to unintended behaviors, tricky edge cases, and complicated legal/ethical/privacy problems. In practice, however, risk profiles differ significantly across use cases and require tailored approval and launch processes. For example, launching an automated update to an ETA prediction model that uses anonymized data requires less privacy scrutiny than launching a new pricing model.

设计可靠的流程以避免常见的开发陷阱并验证预期的模型行为对于在组织中安全扩展 ML 至关重要。机器学习系统特别容易受到意外行为、棘手的边缘情况和复杂的法律/道德/隐私问题的影响。然而，在实践中，风险状况因用例而异，需要量身定制的批准和启动流程。例如，与启动新定价模型相比，启动对使用匿名数据的 ETA 预测模型的自动更新需要更少的隐私审查。

For these reasons, product organizations (e.g., the Uber Eats or Marketplace teams) own the launch processes around their ML models. These teams adapt processes to their product area from a centralized launch playbook that walks through general product, privacy, legal, and ethical topics around experimenting with and launching ML models. The product teams themselves best understand the product implications of different model behavior and are best suited to consult with relevant experts to evaluate and eliminate risks.

出于这些原因，产品组织（例如，Uber Eats 或 Marketplace 团队）拥有围绕其 ML 模型的发布流程。这些团队根据集中的发布手册调整流程以适应他们的产品领域，该手册围绕试验和发布 ML 模型遍历一般产品、隐私、法律和道德主题。产品团队本身最了解不同模型行为对产品的影响，最适合咨询相关专家以评估和消除风险。

2.2.2 Coordinated planning across ML teams

When requirements outpace the roadmaps of the platform teams, product engineering teams can feel the desire to branch off and build their own systems tailored to their needs. Care needs to be taken to ensure teams are empowered to solve their own problems but also that the company is making good engineering tradeoffs to avoid fragmentation and technical debt. At Uber, we put together an internal group of senior leaders that oversees the evolution of ML tooling across the company to ensure that we’re making smart trade-offs and are maintaining long-term architecture alignment. This has been invaluable in resolving these tricky and sometimes sensitive situations.

当需求超过平台团队的路线图时，产品工程团队可能会渴望分支并构建自己的系统以满足他们的需求。 需要注意确保团队有权解决他们自己的问题，同时公司也在进行良好的工程权衡以避免碎片化和技术债务。 在 Uber，我们组建了一个内部高级领导小组，负责监督整个公司 ML 工具的发展，以确保我们做出明智的权衡并保持长期的架构一致性。这对于解决这些棘手且有时敏感的情况非常宝贵。

2.2.3 Community

Scaling high-quality ML across the company requires a connected and collaborative organization.

在整个公司范围内扩展高质量的机器学习需要一个相互关联的协作组织。

To build an internal community, we host an annual internal ML conference called UberML. We recently hosted around 500 employees and more than 50 groups presenting talks or posters on their work. Events like this enable practitioners to swap ideas, celebrate achievements, and make important connections for future collaborations. Teams at Uber also organize community building events including ML reading groups, talk series, and regular brown bag lunches for Uber’s ML-enthusiasts to learn about some of our internal ML projects from the individuals that build them.

为了建立一个内部社区，我们举办了一个名为 UberML 的年度内部 ML 会议。我们最近接待了大约 500 名员工和 50 多个小组，就他们的工作进行了演讲或海报展示。像这样的活动使从业者能够交换想法、庆祝成就并为未来的合作建立重要的联系。优步的团队还为优步的机器学习爱好者组织社区建设活动，包括机器学习阅读小组、谈话系列和定期的棕色包午餐，以从构建它们的个人那里了解我们的一些内部机器学习项目。

Our focus on community extends beyond our own walls. Our team also engages heavily with the external ML community through conferences, publishing papers, contributing to open source projects, and collaborating on ML projects and research with other companies and academia. Over the years, this community has grown into a global effort to share best practices, collaborate on cutting-edge projects, and generally improve the state of the field.

我们对社区的关注超越了我们自己的围墙。我们的团队还通过会议、发表论文、为开源项目做出贡献以及与其他公司和学术界合作开展 ML 项目和研究，与外部 ML 社区进行了大量互动。多年来，该社区已发展成为共享最佳实践、在尖端项目上进行协作并普遍改善该领域状况的全球性努力。

2.2.4 Education

It’s important for ML teams to always be learning. They need to stay on top of developments in ML theory, track and learn from internal ML projects, and master the usage of our ML tools. Proper channels to efficiently share information and educate on ML-related topics are critical.

对于 ML 团队来说，始终学习很重要。他们需要掌握 ML 理论的发展，跟踪内部 ML 项目并从中学习，并掌握我们 ML 工具的用法。有效共享信息和教育 ML 相关主题的适当渠道至关重要。

Uber ML education starts during an employee’s first week, during which we host special sessions for ML and Michelangelo boot camps for all technical hires. When major new functionality is released in Michelangelo, we host special training sessions with the employees that frequently use them. Documentation of key tools and user workflows has also helped encourage knowledge sharing and scaled adoption of our platform tools.

优步机器学习教育从员工入职的第一周开始，在此期间，我们为所有技术员工举办机器学习和米开朗基罗新兵训练营的特别会议。当 Michelangelo 发布主要新功能时，我们会与经常使用它们的员工一起举办特别培训课程。关键工具和用户工作流程的文档也有助于鼓励知识共享和我们平台工具的大规模采用。

Office hours are also held by different ML-focused groups in the company to offer support when questions arise. It also helps that the individuals who work on ML projects at Uber tend to be naturally inquisitive and hungry learners. Many of the community-led initiatives mentioned above are great ways for team members to keep up with internal and external developments.

公司内不同的以机器学习为重点的小组也会安排办公时间，以便在出现问题时提供支持。在 Uber 从事机器学习项目的人往往是天生好奇和渴望学习的人，这也有帮助。上面提到的许多社区主导的举措是团队成员跟上内部和外部发展的好方法。

3. Technology

There are myriad details to get right on the technical side of any ML system. At Uber, we’ve found the following high-level areas to be particularly important:

End-to-end workflow: ML is more than just training models; you need support for the whole ML workflow: manage data, train models, evaluate models, deploy models and make predictions, and monitor predictions.
ML as software engineering: We have found it valuable to draw analogies between ML development and software development, and then apply patterns from software development tools and methodologies back to our approach to ML.
Model developer velocity: Machine learning model development is a very iterative process—innovation and high-quality models come from lots and lots of experiments. Because of this, model developer velocity is critically important.
Modularity and tiered architecture: Providing end-to-end workflows is important for handling the most common ML use cases, but to address the less common and more specialized cases, it’s important to have primitive components that can be assembled in targeted ways.

3.1 End-to-end workflow

Early on, we recognized that successful ML at a large company like Uber requires much more than just training good models—you need robust, scalable support for the entire workflow. We found that the same workflow applies across a wide array of scenarios, including traditional ML and deep learning; supervised, unsupervised, and semi-supervised learning; online learning; batch, online, and mobile deployments; and time-series forecasting. It’s not critical that one tool provides everything (though this is how we did it) but it is important to have an integrated set of tools that can tackle all steps of the workflow.

早些时候，我们认识到在像 Uber 这样的大公司成功的 ML 需要的不仅仅是训练好的模型——你需要对整个工作流程的强大、可扩展的支持。我们发现相同的工作流程适用于各种场景，包括传统的机器学习和深度学习；监督、无监督和半监督学习；在线学习; 批量、在线和移动部署；和时间序列预测。一个工具提供一切并不重要（尽管我们就是这样做的），但重要的是拥有一组可以处理工作流程所有步骤的集成工具。

3.1.1 Manage data

This is typically the most complex part of the ML process and covers data access, feature discovery, selection, and transformations that happen during model training and the productionization of pipelines for those features when the model is deployed. At Uber, we built a feature store which allows teams to share high-quality features and easily manage the offline and online pipelines for those features as models are trained and then deployed, ensuring consistency between online and offline versions.

这通常是 ML 过程中最复杂的部分，涵盖模型训练期间发生的数据访问、特征发现、选择和转换，以及在部署模型时为这些特征生产流水线。在 Uber，我们建立了一个特性商店，允许团队共享高质量的特性，并在模型训练和部署时轻松管理这些特性的离线和在线管道，确保在线和离线版本之间的一致性。

3.1.2 Train models

In Michelangelo, users can train models from our web UI or from Python using our Data Science Workbench (DSW). In DSW, we support large-scale distributed training of deep learning models on GPU clusters, tree and linear models on CPU clusters, and lower scale training of a large variety of models using the myriad available Python toolkits. In addition to training simple models, users can compose more complex transformation pipelines, ensembles, and stacked models. Michelangelo also offers scalable grid and random hyperparameter search, as well as more efficient Bayesian black-box hyperparameter search.

在 Michelangelo 中，用户可以使用我们的数据科学工作台 (DSW) 从我们的 Web UI 或 Python 训练模型。在 DSW 中，我们支持 GPU 集群上的深度学习模型、CPU 集群上的树和线性模型的大规模分布式训练，以及使用无数可用 Python 工具包对各种模型进行小规模训练。除了训练简单的模型外，用户还可以组合更复杂的转换管道、集成和堆叠模型。 Michelangelo 还提供可扩展的网格和随机超参数搜索，以及更高效的贝叶斯黑盒超参数搜索。

3.1.3 Manage and evaluate models

Finding the right combination of data, algorithm, and hyperparameters is an experimental and iterative process. Moving through this process quickly and efficiently requires automation of all the experiments and the results. It also benefits from good visualization tools for understanding each individual model’s performance as well as being able to compare many models with each other to see the patterns of configuration and feature data that improve the model performance. Models managed in Michelangelo are rigorously managed, version controlled, fully reproducible, and have rich visualizations for model accuracy and explainability.

找到数据、算法和超参数的正确组合是一个实验和迭代过程。快速有效地完成这个过程需要所有实验和结果的自动化。它还受益于良好的可视化工具，用于了解每个单独模型的性能，以及能够将多个模型相互比较以查看可提高模型性能的配置模式和特征数据。在 Michelangelo 中管理的模型经过严格管理、版本控制、完全可复制，并且具有丰富的可视化以确保模型准确性和可解释性。

Figure 3: Michelangelo’s model comparison page showing a comparison of two models’ behavior across different segments and features.

3.1.4 Deploy models and make predictions

Once an effective model is trained, it’s important for the model developer to be able to deploy the model into a staging or production environment. In Michelangelo, users can deploy models via our web UI for convenience or through our API for integration with external automation tools. At deploy time, the model and related resources are packed up and then pushed out to an offline job for scheduled batch predictions or to online containers for real-time request-response predictions via Thrift. For both online and offline models, the system automatically sets up the pipelines for data from the feature store.

一旦训练了有效的模型，模型开发人员能够将模型部署到临时或生产环境中就很重要。在 Michelangelo 中，为了方便起见，用户可以通过我们的 Web UI 部署模型，也可以通过我们的 API 与外部自动化工具集成来部署模型。在部署时，模型和相关资源被打包，然后推送到离线作业以进行预定的批量预测，或者通过 Thrift 推送到在线容器进行实时请求-响应预测。对于在线和离线模型，系统会自动为特征存储中的数据设置管道。

3.1.5 Monitor data and predictions

Models are trained and initially evaluated against historical data. This means that users can know that a model would have worked well in the past. But once you deploy the model and use it to make predictions on new data, it’s often hard to ensure that it’s still working correctly. Models can degrade over time because the world is always changing. Moreover, there can be breakages or bugs in a production model’s data sources or data pipelines. In both cases, monitoring of (and alerting on) predictions made by models in production is critical. We have two approaches to monitoring models in production. The most accurate approach is to log predictions made in production and then join these to the outcomes as they are collected by our data pipelines; by comparing predictions against actuals, we can compute precise accuracy metrics. In cases where the outcomes are not easily collected or where we cannot easily join the predictions to outcomes, a second option is to monitor the distributions of the features and predictions and compare them over time. This is a less precise approach, but can still often detect problematic shifts in features and corresponding predictions.

模型经过训练并根据历史数据进行初步评估。这意味着用户可以知道模型在过去会运行良好。但是一旦您部署模型并使用它对新数据进行预测，通常很难确保它仍然正常工作。模型会随着时间的推移而退化，因为世界总是在变化。此外，生产模型的数据源或数据管道中可能存在破损或错误。在这两种情况下，监控（和提醒）生产中模型所做的预测都是至关重要的。我们有两种方法来监控生产中的模型。最准确的方法是记录生产中的预测，然后将这些预测加入我们的数据管道收集的结果中；通过将预测与实际情况进行比较，我们可以计算出精确的准确度指标。如果结果不容易收集，或者我们不能轻易地将预测与结果结合起来，第二种选择是监控特征和预测的分布，并随着时间的推移进行比较。这是一种不太精确的方法，但仍然可以经常检测到特征和相应预测中的有问题的变化。

3.2 ML as software engineering

An important principle of the Michelangelo team’s approach is to think of machine learning as software engineering. Developing and running ML in production should be as iterative, rigorous, tested, and methodological as software engineering. We have found it very valuable to draw analogies between ML and software development, and to apply insights from corresponding and mature software development tools and methodologies back to ML.

米开朗基罗团队方法的一个重要原则是将机器学习视为软件工程。在生产中开发和运行 ML 应该像软件工程一样迭代、严格、经过测试和方法论。我们发现将 ML 和软件开发进行类比，并将相应成熟的软件开发工具和方法的见解应用到 ML 中是非常有价值的。

For instance, once we recognized that a model is like a compiled software library, it becomes clear that we want to keep track of the model’s training configuration in a rigorous, version controlled system in the same way that you version control the library’s source code. It has been important to keep track of the assets and configuration that were used to create the model so that it can be reproduced (and/or improved) later. In the case of transfer learning in deep learning models, we track the entire lineage so that every model can be retrained, if needed. Without good controls and tools for this, we have seen cases in which models are built and deployed but are impossible to reproduce because the data and/or training configuration has been lost.

例如，一旦我们认识到模型就像一个编译好的软件库，很明显我们希望在严格的版本控制系统中跟踪模型的训练配置，就像您对库的源代码进行版本控制一样。跟踪用于创建模型的资产和配置非常重要，以便以后可以复制（和/或改进）它。在深度学习模型中的迁移学习的情况下，我们跟踪整个谱系，以便在需要时可以重新训练每个模型。如果没有良好的控制和工具，我们已经看到模型被构建和部署但由于数据和/或训练配置丢失而无法重现的情况。

In addition, to make sure software works correctly, it is important to run comprehensive tests before the software is deployed; in the same way, we always evaluate models against holdout sets before deploying. Similarly, it is important to have good monitoring of software systems to make sure they work correctly in production; the same applies to machine learning where you want to monitor the models in production as they may behave differently than they did in offline evaluation.

此外，为了确保软件正常工作，在部署软件之前运行全面的测试很重要；以同样的方式，我们总是在部署之前根据保持集评估模型。同样，对软件系统进行良好监控以确保它们在生产中正常工作也很重要；这同样适用于机器学习，您希望在生产中监控模型，因为它们的行为可能与离线评估不同。

3.3 Model developer velocity

Building impactful ML systems is a science and requires many iterations to get right. Iteration speed affects both how ML scales out across the organization and how productive a team can be on any given problem. A high priority for the Michelangelo team is enabling data science teams to go faster. The faster we go, the more experiments we can run, the more hypotheses we can test, the better results we can get.

构建有影响力的 ML 系统是一门科学，需要多次迭代才能正确。迭代速度会影响机器学习在整个组织中的扩展方式以及团队在任何给定问题上的工作效率。 Michelangelo 团队的首要任务是让数据科学团队能够更快地工作。我们走得越快，我们可以运行的实验越多，我们可以测试的假设越多，我们就能得到更好的结果。

The diagram below shows how we think about the standard ML development process and the different feedback loops within them. We are constantly thinking about this process and tightening these loops so it’s easier and faster to do iterative and agile data science.

下图显示了我们如何看待标准 ML 开发过程以及其中的不同反馈循环。我们一直在思考这个过程并收紧这些循环，以便更容易、更快地进行迭代和敏捷的数据科学。

Figure 4: The workflow of a machine learning project. Defining a problem, prototyping a solution, productionizing the solution and measuring the impact of the solution is the core workflow. The loops throughout the workflow represent the many iterations of feedback gathering needed to perfect the solution and complete the project.

Michelangelo’s “zero-to-one speed” or “time-to-value speed” is critical for how ML spreads across Uber. For new use cases, we focus on lowering the barrier to entry by fine-tuning the getting started workflow for people of different abilities and having a streamlined flow to get a basic model up and running with good defaults.

米开朗基罗的“零到一速度”或“时间价值速度”对于机器学习如何在优步中传播至关重要。对于新的用例，我们专注于通过为不同能力的人微调入门工作流程并简化流程以启动和运行具有良好默认值的基本模型来降低进入门槛。

For existing projects, we look at iteration speed, which gates how fast data scientists can iterate and get feedback on their new models or features either in an offline test or from an online experiment.

对于现有项目，我们着眼于迭代速度，这决定了数据科学家在离线测试或在线实验中迭代和获得新模型或功能反馈的速度。

A few principles have proven very useful in enabling teams to develop quickly:

Solve the data problem so data scientists don’t have to.
- Dealing with data access, integration, feature management, and pipelines can often waste a huge amount of a data scientist’s time. Michelangelo’s feature store and feature pipelines are critical to solving a lot of data scientist headaches.
Automate or provide powerful tools to speed up common flows.
Make the deployment process fast and magical.
- Michelangelo hides the details of deploying and monitoring models and data pipelines in production behind a single click in the UI.
Let the user use the tools they love with minimal cruft—“Go to the customer”.
- Michelangelo allows interactive development in Python, notebooks, CLIs, and includes UIs for managing production systems and records.
Enable collaboration and reuse.
- Again, Michelangelo’s feature store is critical to enabling teams to reuse important predictive features already identified and built by other teams.
Guide the user through a structured workflow.

事实证明，一些原则对于使团队快速开发非常有用：

解决数据问题，让数据科学家不必这样做。
- 处理数据访问、集成、功能管理和管道通常会浪费数据科学家的大量时间。 Michelangelo 的特征存储和特征管道对于解决许多数据科学家的难题至关重要。
自动化或提供强大的工具来加速常见的流程。
使部署过程快速而神奇。
- Michelangelo 将在生产中部署和监控模型和数据管道的细节隐藏在 UI 中的单击之后。
让用户以最少的繁琐使用他们喜欢的工具——“去找客户”。
- Michelangelo 允许使用 Python、笔记本、CLI 进行交互式开发，并包括用于管理生产系统和记录的 UI。
实现协作和重用。
- 同样，米开朗基罗的特性存储对于使团队能够重用其他团队已经确定和构建的重要预测特性至关重要。
引导用户完成结构化的工作流程。

3.3.1 Going to the customer: Notebooks and Python

When Michelangelo started, the most urgent and highest impact use cases were some very high scale problems, which led us to build around Apache Spark (for large-scale data processing and model training) and Java (for low latency, high throughput online serving). This structure worked well for production training and deployment of many models but left a lot to be desired in terms of overhead, flexibility, and ease of use, especially during early prototyping and experimentation.

Michelangelo 刚开始时，最紧迫和影响最大的用例是一些非常大规模的问题，这促使我们围绕 Apache Spark（用于大规模数据处理和模型训练）和 Java（用于低延迟、高吞吐量在线服务）进行构建 . 这种结构适用于许多模型的生产培训和部署，但在开销、灵活性和易用性方面还有很多不足之处，尤其是在早期原型设计和实验期间。

To provide greater flexibility, ease of use and iteration speed, we are moving the main model building workflows to Uber’s DSW. DSW provides flexible and easy access to Uber’s data infrastructure and compute resources in a natural notebook interface. Its integration with our cloud and on-prem GPU clusters allows for fast prototyping of Michelangelo-ready ML models in a notebook environment and easy saving of those models in Michelangelo for deployment and scaled serving. We’re transitioning to using DSW as the primary model exploration and prototyping interface for Michelangelo.

为了提供更大的灵活性、易用性和迭代速度，我们将主要的模型构建工作流程移至 Uber 的 DSW。 DSW 在自然的笔记本界面中提供了对 Uber 数据基础设施和计算资源的灵活、轻松的访问。它与我们的云和本地 GPU 集群的集成允许在笔记本环境中快速构建支持 Michelangelo 的 ML 模型的原型，并在 Michelangelo 中轻松保存这些模型以进行部署和扩展服务。我们正在过渡到使用 DSW 作为 Michelangelo 的主要模型探索和原型设计界面。

To support the same scalable modeling in a notebook environment that we have always provided via our UI, we have released (internally for now, but we hope to open source shortly) a set of libraries that extend Spark to provide a set of custom Estimator, Transformer, and Pipeline components that expose interfaces for batch, streaming, and request/response-based scoring (the latter is not available in the standard version of Spark). These components can be assembled using PySpark and then uploaded to Michelangelo for deployment and serving using our pure-Java serving system. This brings together much of the ease of use of Python with the scale of Spark and Java.

为了在笔记本环境中支持我们一直通过 UI 提供的相同可扩展建模，我们发布了（目前在内部，但我们希望尽快开源）一组扩展 Spark 以提供一组自定义估算器的库， Transformer 和 Pipeline 组件公开批处理、流和基于请求/响应的评分接口（后者在标准版本的 Spark 中不可用）。这些组件可以使用 PySpark 进行组装，然后上传到 Michelangelo 以使用我们的纯 Java 服务系统进行部署和服务。这将 Python 的许多易用性与 Spark 和 Java 的规模结合在一起。

Plain Python modeling is advantageous for simplicity and access to a richer ecosystem of ML and data toolkits. To address this, we recently expanded Michelangelo to serve any kind of Python model from any source for more flexible modeling support. Users build their models in DSW notebooks (or other preferred Python environment) and then use the Michelangelo PyML SDK to package and upload the model and dependencies to Michelangelo for storage, deployment, and serving (both batch and online).

简单的 Python 建模有利于简化和访问更丰富的 ML 和数据工具包生态系统。为了解决这个问题，我们最近扩展了 Michelangelo 以服务于任何来源的任何类型的 Python 模型，以提供更灵活的建模支持。用户在 DSW 笔记本（或其他首选 Python 环境）中构建他们的模型，然后使用 Michelangelo PyML SDK 将模型和依赖项打包并上传到 Michelangelo 进行存储、部署和服务（批量和在线）。

3.3.2 Speed with deep learning

The development workflow for deep learning models often has different requirements than other ML development workflows. Developers typically write a lot more detailed training code and require specialized compute resources (GPUs). We’ve focused a lot on making this process smooth and fast over the past year.

深度学习模型的开发工作流程通常与其他 ML 开发工作流程有不同的要求。开发人员通常会编写更详细的训练代码，并且需要专门的计算资源 (GPU)。在过去的一年里，我们非常注重使这个过程顺利和快速。

Michelangelo now has great tools to provision and run training jobs on different GPU machines both in Uber’s own data centers and various public clouds. Production TensorFlow models are served out of our existing high-scale Michelangelo model serving infrastructure (which is now integrated with TensorFlow Serving) or our PyML system. We have specialized tools to help modelers track their experiments and development, but once a model is saved in Michelangelo, it’s treated just like any other model in the system.

Michelangelo 现在拥有强大的工具，可以在 Uber 自己的数据中心和各种公共云中的不同 GPU 机器上配置和运行训练作业。生产 TensorFlow 模型由我们现有的大规模 Michelangelo 模型服务基础设施（现已与 TensorFlow Serving 集成）或我们的 PyML 系统提供。我们有专门的工具来帮助建模者跟踪他们的实验和开发，但是一旦模型保存在 Michelangelo 中，它就会像系统中的任何其他模型一样被处理。

3.3.3 Speeding up model development with AutoTune

AutoTune is a new general purpose optimization-as-a-service tool at Uber. It has been integrated into Michelangelo to allow modelers to easily use state-of-the-art black box Bayesian optimization algorithms to more efficiently search for an optimal set of hyperparameters. It serves as a new recommended alternative to the less sophisticated search algorithms that we’ve been offering in Michelangelo so far. This means more accurate models in the same amount of training time or less training time to get to a high-quality model.

3.4 Modularity and tiered offerings

One of the tensions we found while developing Michelangelo was between providing end-to-end support for the most common ML workflows while also providing the flexibility to support the less common ones.

我们在开发 Michelangelo 时发现的紧张局势之一是为最常见的 ML 工作流程提供端到端支持，同时还提供支持不太常见的工作流程的灵活性。

Originally, our platform and infrastructure components were combined into a single system. As our systems became more sophisticated and the problems we were solving became more varied and complex, demand grew for additional flexibility, extensibility, and domain-specific development experiences beyond what a monolithic platform could offer.

最初，我们的平台和基础设施组件被合并到一个系统中。随着我们的系统变得更加复杂，我们解决的问题变得更加多样化和复杂，对超出单一平台所能提供的额外灵活性、可扩展性和特定领域开发体验的需求也在增长。

We were able to address some of these issues through the bridge teams, as described above. But some teams wanted to mix and match parts of Michelangelo with their own components into new workflows. Other teams needed specialized development tools for their use cases but it didn’t make sense to build those tools from scratch. We made some major changes to the Michelangelo architecture to leverage our existing systems as much as possible while evolving with growing requirements as ML matured across the company:

如上所述，我们能够通过桥接团队解决其中的一些问题。但是一些团队希望将米开朗基罗的部分内容与他们自己的组件混合和匹配到新的工作流程中。其他团队需要为其用例提供专门的开发工具，但从头开始构建这些工具没有意义。我们对 Michelangelo 架构进行了一些重大更改，以尽可能多地利用我们现有的系统，同时随着 ML 在整个公司的成熟而随着不断增长的需求而发展：

We also found that for some problem domains, specialized development experiences are useful. This can be as simple as prebuilt workflows for applying and evaluating forecasting models or it can be something more customized, like an interactive learning and labeling tool built for a particular computer vision application. We want to support all of these use cases by allowing platform developers to leverage Michelangelo’s underlying infrastructure.

我们还发现，对于某些问题领域，专门的开发经验是有用的。这可以像用于应用和评估预测模型的预构建工作流一样简单，也可以是更加定制化的东西，例如为特定计算机视觉应用程序构建的交互式学习和标记工具。我们希望通过允许平台开发人员利用 Michelangelo 的底层基础架构来支持所有这些用例。

To address these issues, we are in the process of factoring out Michelangelo’s infrastructure into an explicit infrastructure layer and are making that infrastructure available for teams to leverage to build more specialized platforms, for example, for NLP or Vision. Once this is done, we will have two customer groups: the model builders who use the Michelangelo platform to build and deploy models, and ML systems builders who use the Michelangelo infrastructure components to build bespoke solutions or more specialized platforms.

为了解决这些问题，我们正在将 Michelangelo 的基础设施分解为一个显式的基础设施层，并使团队可以利用该基础设施来构建更专业的平台，例如 NLP 或 Vision。一旦完成，我们将拥有两个客户群：使用 Michelangelo 平台构建和部署模型的模型构建者，以及使用 Michelangelo 基础设施组件构建定制解决方案或更专业平台的 ML 系统构建者。

4. Key lessons learned

Building Michelangelo and helping to scale machine learning across Uber over the last three years, we have learned a lot from our successes and failures. In some cases, we got things right the first time, but more frequently, it took a few iterations to discover what works best for us.

Let developers use the tools that they want. This is an area where it took us several iterations over as many years to figure out the right approach. When we started Michelangelo, we focused first on the highest scale use cases because this was where we could have the most impact. While the early focus on Scala, config, and UI-based workflows allowed us to support the high scale use cases, it led us away from the mature, well-documented programmatic interfaces that model developers are already proficient with. Since modeling work is very iterative, it ended up being important to focus on developer velocity (and therefore fast iteration cycles) along with high scalability. We have landed on a hybrid solution where we offer a high scale option that is a bit harder to use and a lower scale system that is very easy to use.
Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming features at training time and then building the pipelines to deliver those features to production models. Broken data is the most common cause of problems in production ML systems. At Uber, our feature store addresses many of these issues by allowing modelers to easily share high-quality features, automatically deploy the production pipelines for those features, and monitor them over time. Once features are defined, we can also leverage Uber’s data quality monitoring tools to ensure that the feature data is correct over time.
It can take significant effort to make open source and commercial components work well at scale. Apache Spark and Cassandra are both popular and mature open source projects; however, it took more than a year of sustained effort in each case to make them work reliably at Uber’s scale. We had similar challenges with commercial toolkits that we tried early on and ended up abandoning.
Develop iteratively based on user feedback, with the long-term vision in mind. As we built Michelangelo, we almost always worked closely with customer teams on new capabilities. We would solve the problem well for one team first, and once it was successfully running in production, we’d generalize for the rest of the company. This process ensured that the solutions we built were actually used. It also helped keep the team engaged and leadership supportive because they saw steady impact. At the same time, it’s important to reserve some time for long-term bets that users may not see on their horizon. For instance, we started working on deep learning tooling well before there was real demand; if we had waited, we would have been too late.
Real-time ML is challenging to get right. Most existing data tools are built for offline ETL or online streaming. There are no great tools–yet!–that address the hybrid online/offline capabilities (batch, streaming, and RPC) required by realtime ML systems. This continues to be a big area of focus for us at Uber as part of our feature store.

在过去三年中，我们建立了 Michelangelo 并帮助在 Uber 中扩展机器学习，我们从成功和失败中学到了很多东西。在某些情况下，我们第一次就做对了，但更常见的是，需要反复几次才能发现最适合我们的方法。

让开发人员使用他们想要的工具。在这个领域，我们花了多年的时间反复迭代才找出正确的方法。当我们开始 Michelangelo 时，我们首先关注最大规模的用例，因为这是我们可以产生最大影响的地方。虽然早期对 Scala、配置和基于 UI 的工作流的关注使我们能够支持大规模用例，但它使我们远离了模型开发人员已经精通的成熟、有据可查的编程接口。由于建模工作是非常迭代的，因此关注开发人员速度（以及因此快速的迭代周期）以及高可扩展性最终变得很重要。我们已经采用了一种混合解决方案，我们提供了一个有点难以使用的高规模选项和一个非常易于使用的低规模系统。
数据是机器学习中最难的部分，也是最重要的部分。建模人员大部分时间都在训练时选择和转换特征，然后构建管道以将这些特征交付给生产模型。损坏的数据是生产 ML 系统中出现问题的最常见原因。在 Uber，我们的特征存储通过允许建模者轻松共享高质量特征、自动部署这些特征的生产管道并随着时间的推移对其进行监控，从而解决了其中的许多问题。一旦定义了特征，我们还可以利用 Uber 的数据质量监控工具来确保特征数据随着时间的推移是正确的。
使开源和商业组件在规模上运行良好可能需要付出巨大的努力。 Apache Spark 和 Cassandra 都是流行且成熟的开源项目；然而，在每种情况下都需要一年多的持续努力才能使它们在 Uber 的规模上可靠地工作。我们在早期尝试并最终放弃的商业工具包方面遇到了类似的挑战。
根据用户反馈迭代开发，并牢记长期愿景。在我们构建 Michelangelo 时，我们几乎总是与客户团队密切合作开发新功能。我们会先为一个团队很好地解决问题，一旦它在生产中成功运行，我们就会推广到公司的其他人。这个过程确保了我们构建的解决方案得到实际使用。它还有助于保持团队参与和领导支持，因为他们看到了稳定的影响。同时，为用户可能看不到的长期赌注预留一些时间也很重要。例如，我们在真正有需求之前就开始研究深度学习工具；如果我们等了，那就太晚了。
实时机器学习很难做到正确。大多数现有的数据工具都是为离线 ETL 或在线流而构建的。目前还没有出色的工具可以解决实时 ML 系统所需的混合在线/离线功能（批处理、流媒体和 RPC）。作为我们特色商店的一部分，这仍然是我们在优步关注的一个重要领域。