Introducing Palantir’s Next-Gen Pipeline Builder - Palantir

Since our inception, we’ve had the privilege of working on some of the most complex, challenging, data-driven missions in the world. This has required us to build software that enables a diverse range of problem-solvers across ever-changing geopolitical, technological, and economic landscapes. Over the past two decades, these experiences have taught us the foundational importance of robust data integration, and what is required to power it — at scale.

We’ve observed that conventional approaches to data integration assume a myriad trade-offs, such as:

(1) Speed vs. Robustness: It is assumed that in order to move quickly against complex data challenges, there must be a zero-sum exchange with robustness and stability. As the criticality of data outputs rises, it becomes more difficult to deliver timely pipelines — especially in settings where security and compliance are non-negotiable.

(2) Democratization vs. Sophistication: There is a growing desire to enable “citizen” data engineers and analysts, with applications and interfaces that are tuned for their skills and expertise. Yet lack of governance and simplicity means low/no-code solutions fail to meet the bar for production-grade data pipelining work.

(3) Efficiency vs. Flexibility: As data engineering work rises in complexity, there is a need to build declarative, streamlined experiences, which can provide small teams with technical leverage, as they build, tune, and scale the nuanced requirements of their domains. However, it is assumed that this streamlining comes at the expense of both optionality and the ability to switch among storage and compute paradigms.

Over the past few years, in the midst of an increasingly volatile world and the necessity for agile responses, it was no longer feasible to entertain these (false) trade-offs. Entire supply chains had to be stood up in days, such as those used to distribute every vaccine throughout the US and UK; simultaneous demand and supply shocks had to be assessed in near-real-time, tapping into data landscapes that were constantly in flux; sensor and IoT data had to be fused with a wide range of structured data, to support ever-fluid operations across infrastructure providers; among countless other examples, across the public and private sectors.

These experiences served as the crucible for our engineering teams, requiring us to reimagine Palantir Foundry’s underlying data integration architecture from first principles. The result is what we’re excited to share a first glimpse of here: Foundry’s Next-Generation Pipeline Builder

Enabling Every Stakeholder

Pipeline Builder is designed to enable fast, flexible, and scalable delivery of data pipelines — enabling fluidity and time-to-value, while simultaneously enforcing robustness and security. Technical users can build and maintain pipelines more rapidly than ever before, focusing on declarative descriptions of their end-to-end pipelines and desired outputs. Pipeline Builder provides managed, serverless support for batch, micro-batch, and streaming workflows across the full range of target outputs. Builds, refreshes, and other orchestrations are all handled behind-the-scenes.

Moreover, Pipeline Builder’s intelligent point-and-click, form-based interface enables citizen data engineers and less technical users to create pipelines without getting “stuck” in a simplified mode. Every pipeline, whether no-code or low-code, leverages the same git-style change management, data health checks, multi-modal security, and fine-grain auditing that permeates every other service across Foundry. Diverse teams can focus on describing business logic, rather than worrying about obtuse implementation details.

Beyond democratizing data integration, the vision for Pipeline Builder was to rethink what better data pipelines can look like. Distilling our experiences across industries, we engineered a next-generation data transformation back-end, which acts as an an intermediary between logic creation and execution substrate. As users describe the pipeline(s) they wish to build, Pipeline Builder’s back-end works to write the transform code and automatically perform checks on pipeline integrity — proactively identifying and refactoring errors, and offering solutions to ensure healthy ongoing builds.

Collapsing False Trade-offs

Pipeline Builder represents a paradigm shift in data integration, which is designed to enable a better building experience and better resulting data pipelines.

Supercharged Development Speed
Writing data pipelines in Pipeline Builder is significantly faster than authoring them from scratch. Users can immediately begin applying transforms to data (structured, unstructured, IoT, etc.) without needing to instantiate environments and layer on boilerplate code. The separation of data computation and schema computation means that users receive instant feedback on any builds that don’t meet target schemas or output “contracts”. Moreover, debugging is much faster, leveraging column provenance across the entire pipeline that allows users to see precisely which pieces of logic will affect target output columns.

Future-Proofed Versatility
Pipeline Builder is designed to fit any data integration need, and comes with both hundreds of commonly-used functions as well as the capability to build new functions for a variety of deployment paradigms. Critically, Pipeline Builder is not tied to any specific back-end engine; different execution engines can be employed for different requirements (e.g., in-memory engines for small-scale data; Spark for classic batch-based pipelines; Flink for low-latency, streaming pipelines). The decoupling of schema computation and data computation provides a consistent, fluid experience — regardless of the underlying compute modality.

Seamless Security
Pipeline Builder harnesses Foundry’s built-in, world class security primitives — which are relied upon in some of the most sensitive, complex data environments throughout the world. Data that mixes role-, marking-, and purpose-based access control paradigms can be leveraged by all types of authorized user personas (not just technical users). Out-of-the-box functions, such as those for Palantir’s “Cipher” capability for obfuscation and selective revelation, can be enabled with just a few clicks.

Connecting No-, Low-, and Pro-Code
Unlike common no/low-code paradigms, Pipeline Builder provides a best-in-class experience that is intended to coexist within a rich ecosystem that includes code-based development. It leverages the cornerstones of enterprise data ops — such as robust version control capabilities and real-time alerting on merge conflicts — which allow for concurrent, production-grade pipelining work. Moreover, pipelines that have been stood up can be easily templated and reused for repeat integrations, and connected with traditional code-based pipelines.

Efficiency at Scale
As data integration tasks grow in complexity, it is critical to keep costs under control and facilitate streamlined governance processes. Pipeline Builder, by default, will only materialize what’s necessary to fulfill the specified targets of the pipelines. This, in many cases, means far less cost associated with compute, as well as with storage. As Palantir continues to iterate on optimizations for each underlying compute engine, often in tandem with the open-source community, these efficiencies are baked into Pipeline Builder’s library of functions.

Looking Forward

We’re excited to share more about Pipeline Builder in the near future, including deeper looks at the underlying architecture, the different patterns of interoperability and extensibility, and the real-world use cases that are already demonstrating impact at scale. In the meantime, make sure to check out the following resources:

More from Palantir

Previous
Previous

Why Snowflake is a Top Cloud Data Warehousing Solution - Debi Prasad Mishra

Next
Next

The Worldwide Master Data Management Industry is Expected to Reach $39.4 Billion by 2028: Yahoo Finance