Building Data Products

7 Proven Best Practice to Master DataOps Architecture for Seamless Automation and Scalability

Niels Humbeck — Wed, 18 Dec 2024 12:10:28 GMT

DataOps is revolutionizing the way businesses manage and deploy data workflows, ensuring error-free production and faster deployment cycles. BERGH et al. (2019) outlined seven key best practices to implement a robust DataOps architecture. These steps, when executed effectively, can boost team productivity, enhance automation, and mitigate risks in data projects. Let’s dive into these practices and uncover how they can help businesses thrive in the era of DataOps.

1. Add Data and Logic Tests
Preventing errors in production and ensuring high-quality data are critical in DataOps. Automated tests build confidence that changes won’t negatively impact the system. Tests should be added incrementally with each new code iteration.

Key types of code testing include:

Unit Tests: Targeting “individual methods and functions of classes, components, or modules” and are easy to automate and cheap (Pittet, 2022).
Integration Tests: Ensure proper integration of services or modules.
Functional Tests: Focusing on the fulfillment of user or business requirements by “verifying the output of an action and do not check the intermediate states of the system when performing that action” (Pittet, 2022)

2. Use Version Control Systems
Version control systems, such as Git, are central to any software project and especially crucial in DataOps. Key benefits include:

Saving code in a known repository for easy rollback during emergencies.
Enabling teams to work in parallel by committing and pushing changes independently.
Supporting branch and merge workflows, which allow developers to create separate branches for testing new features without affecting production code.

By isolating development efforts, version control simplifies collaboration and increases productivity.

3. Multiple Environments for Development and Production
Using separate environments for development and production is essential. These environments act as isolated spaces, ensuring changes in one do not affect the other.

The production environment can serve as a template for development, enabling seamless transfer of configurations.
This approach supports continuous integration and continuous deployment (CI/CD) by ensuring configurations match across environments without manual intervention.

4. Reusability and Containerization
Containers streamline microservices by isolating tasks and defining clear input/output relationships (e.g., REST APIs). Benefits include:

Increased maintainability: Changes in one container do not affect others.
Scalability: Containers can balance load by replicating when data volumes surge.

5. Parameterization

Parameterizing workflows improves efficiency by tailoring deployments based on specific requirements. For instance, parameterized configurations can adapt seamlessly between development and production environments.

6. Automation and Orchestration of Data Pipelines

Ensuring smooth, automated data flow from ingestion to delivery. Are you interested in more information about Data Pipelines. Take a look here: LINK

7. Avoid fear and heroism in DataOps projects

With an increase of automation (e.g., infrastructure provisioning and workflow orchestration) and automated testing of code as well as firefighting activities can be reduced significantly. Heroism - like working on weekends – as well as fearing or just hoping production model is not going to crash can be avoided. Results in a stable and reliable production and long term retention of technical talents.

Conclusion

Implementing these seven best practices is essential for building a successful DataOps architecture. From automated testing to containerization, these strategies empower teams to work more efficiently, reduce risks, and achieve scalable, error-free deployments. By adopting these steps, businesses can unlock the full potential of DataOps and stay competitive in a data-driven world.

Sources:

Bergh, C., Benghiat, G., & Strod, E. (2019). The DataOps Cookbook Methodologies and Tools That Reduce Analytics Cycle Time While Improving Quality Second Edition. The DataOps Cookbook | DataKitchen
Pittel, S. (2024, Dec 20) different types of software testing. https://www.atlassian.com/continuous-delivery/software-testing/types-of-software-testing
Densmore, J. (2021). Data Pipelines Pocket Reference: Moving and Processing Data for Analytics. O’Reilly: Data Pipelines Pocket Reference[Book]
Raj, A., Bosch, J., Olsson, H. H. & Wang, T. J. (2020). Modelling Data Pipelines. Proceedings - 46th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2020, 13 20. https://doi.org/10.1109/SEAA51224.2020.00014
Gupta, S. (2020). Architecture for High-Throughput Low-Latency Big Data Pipeline on Cloud. Scalable Efficient Big Data Pipeline Architecture | Towards Data Science
Oleghe, O., & Salonitis, K. (2020). A framework for designing data pipelines for manufacturing systems. Procedia CIRP, 93, 724–729. https://doi.org/10.1016/j.procir.2020.04.016
Matskin, M., Tahmasebi, S., Layegh, A., Payberah, A. H., Thomas, A., Nikolov, N., & Roman, D. (2021). A Survey of Big Data Pipeline Orchestration Tools from the Perspective of the DataCloud Project *. (PDF) A Survey of Big Data Pipeline Orchestration Tools from the Perspective of the DataCloud Project *

Mastering Data Pipelines: The Secret to Fast and Reliable Data Operations

Niels Humbeck — Wed, 18 Dec 2024 11:57:18 GMT

In today’s data-driven world, data pipelines are the backbone of efficient and scalable DataOps. These pipelines are vital for managing both data and code, automating complex workflows, and minimizing manual data handling. Data pipelines can be described as small data refineries. But why are they so critical, and how can businesses leverage them for a competitive edge? Let’s explore the role of data pipelines in enabling DataOps and transforming raw data into actionable insights.

The Power of Data Pipelines

Data pipelines automate the flow of data, whether in batch or stream, from source to destination. They eliminate manual data handling, reduce repetitive tasks, and foster collaboration, automation, and continuous improvement (Atwal, 2020). Robust, reusable, and scalable data pipelines enhance efficiency by minimizing non-value-adding activities for data scientists and analysts. This translates to higher productivity, faster project deployment, and greater employee satisfaction. Data pipelines handle the entire data science lifecycle: from data ingestion to processing. In essence, a well-designed pipeline acts as a fully automated data refinery, transforming raw data into actionable insights that drive meaningful business impact.

From a business perspective, a data pipeline is a fully automated data-refinery which transforms raw data into actionable information, resulting in a meaningful business impact. Workflow orchestration tools support the description and execution of end-to-end data pipelines (Matskin et al., 2021). There are different applicable areas of workflow orchestration tools such as automating business workflows, automatic scientific workflows, or more generalized big data orchestration workflows (Matskin et al., 2021). Recently, many workflow orchestration tools for business processes were developed such as Airflow, Prefect, Kedro or Dagster. With the development of these new tools new opportunities arise for companies to automate the process to refine raw data to meaningful insights, resulting in a competitive advantage. Selecting the best fitting workflow orchestration tool for describing and executing data pipelines is crucial for the development and productionizing of data products and can lead to a competitive advantage.

Conceptual Model for Data Pipelines

Raj at al. describe a conceptual model for data pipelines (Raj et al., 2020). The model contains nodes and connectors which build end-to-end data pipeline. Nodes perform a specified activity to manipulate data like aggregating or joining data. Connectors are connecting nodes with each other. The output of one node is the input of the next node downstream. The first node is the source node and the last node downstream is the sink node. The Figure below describes a meta model for nodes and connectors. This generic toolkit serves as a template to design specified data pipelines. This toolkit is not fully exhaustive and can be extended to the given needs.

Own representation (changed) based on (Raj et al., 2020,p. 17)

For each specific use case the nodes and connectors are chosen from the template. If necessary additional nodes and connectors can be added. The nodes can perform different tasks during the data flow. In the following the given nodes are explained:

The data generation node describes activities where data is generated. In a manufacturing environment IoT sensors, logistic data, machine/ maintenance/ operator logs, PLC data (e.g., sensors), ERP data (e.g., manual entries in SAP) and Quality data are frequently sources for data generation. Other sources can be cameras (Oleghe & Salonitis, 2020).
Data collection nodes gather these data from the origins of data generation. Challenges are that the data is generated with a huge volume, high variety and a high velocity from multiple sources. The data can be transmitted in batch, mini batches or continuously streamed. The type of transmission affects the data collection nodes.
Data ingestion describes nodes which ingested data to a database or other applications depending on the data types (format, size, batch versus stream). Raw data from a data generating node can be ingested for example to a data lake. Data can be ingested to a data pipeline or an application as well.
Data processing nodes are performing processing tasks like aggregation or joining of datasets to solve a specific business task. There are many different tools and programs for processing data with a huge variety. For the yield optimization project parsing PLC text data into readable and understandable formats, creating the batch family trees or processing quality data is important.
Data storage describes all task to store data. There are different types of storage: Data lake for raw material, data warehouse for processed data, NoSql or relational data bases for intermediate steps in the pipeline.
Data labeling nodes describe tasks where for supervised or reinforced learning models the data is labeled.
Data preprocessing tasks are related to assuring high quality of the data by handling missing values and outliers, splitting data in train test and validation sets or performing other preprocessing steps like dimensionality reduction.

RAJ at al. specified a 8th node Data reception to ingest data to the machine learning model or the given application (Raj et al., 2020). The functionality is similar to the data ingestion node and is not considered further here. Nodes are connected via connectors. Connectors transmit data from one node to the other node. They can do that either by carrying the data on themselves or indirectly by pointing to a data base or connecting applications like Kafka with other applications. RAJ et al. proposed a guideline for data transports (see Figure above). Data transmissions are specified as continuous – also known as streaming data -, mini batches or batch processing data. Furthermore, RAJ at al. differentiated between structured data, labelled data and preprocessed data transmissions. Beside the data transport capability, the connectors have various other functions. Frequently data pipelines use cloud services to store data. Connectors can assure authentication to e.g., external systems. Connectors can monitor data pipelines and send alarms if an error occurs. Connectors can also validate e.g., data transfers. Furthermore, connectors can mitigate failures of data pipelines e.g., by specifying a ‘try again’ parameter or by specifying back up parameters.

A normal ETL-Job describes a data pipeline as well. More than one data pipeline can be managed for example by a workflow orchestration tool like Airflow. Airflow uses like other modern workflow orchestration tools a directed acyclic graph to “represent the flow and dependencies of tasks in a pipeline” (Densmore, 2021, p.18). Directed graphs means that tasks do not run before all the dependent task were executed successfully. The term ‘acyclic graphs’ means no loops in the graphs are allowed. An example DAG is displayed in Figure below:

Own representation

ETL is an acronym: Extracting data from various sources, Transformation of the raw (e.g., cleaning, joining data or formatting) and Loading the data into the final destination. Depending on the sequence of the last two steps it can be called ELT (when data is loaded first into final destination and transformed there) or ETL (when data is first transformed and then loaded) (Densmore, 2021, p.21-29). There is a huge variety of tools allowing to perform ETL jobs, for example the AWS-Glue catalog is a popular tool box.

Sources:

Atwal, H. (2020). Practical DataOps. In Practical DataOps. Apress. Practical DataOps: Delivering Agile Data Science at Scale[Book]
Matskin, M., Tahmasebi, S., Layegh, A., Payberah, A. H., Thomas, A., Nikolov, N., & Roman, D. (2021). A Survey of Big Data Pipeline Orchestration Tools from the Perspective of the DataCloud Project . (PDF) A Survey of Big Data Pipeline Orchestration Tools from the Perspective of the DataCloud Project https://datacloudproject.eu
Raj, A., Bosch, J., Olsson, H. H. & Wang, T. J. (2020). Modelling Data Pipelines. Proceedings - 46th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2020, 13 20. https://doi.org/10.1109/SEAA51224.2020.00014
Densmore, J. (2021). Data Pipelines Pocket Reference: Moving and Processing Data for Analytics. O’Reilly: Data Pipelines Pocket Reference[Book]

From Data Lifecycle to DataOps Architecture

Niels Humbeck — Wed, 18 Dec 2024 11:29:36 GMT

Data is a valuable asset for the most companies in the 21st century. Like other assets data needs to be managed over the whole lifecycle. Mismanagement of data can result in many risks like: data losses or breaches resulting in disclosure of private information, incompliancy’s with regulations or data inconsistencies. Those risks can materialize in imminent risks for companies through e.g., reputation losses or costly legal disputes. In literature the management of data over the whole lifecycle is called data lifecycle management giving guidance how to manage date from the generation or entry in the data system though their usage to the disposal of those data. During the beginning of the 21st century many different data lifecycle management approaches were developed differentiated by detail levels and purposes. El Arass, Tikito and Souissi analyzed the 12 most important data lifecycle managements models: Information pyramid, CRUD lifecycle, lifecycle for big data, IBM lifecycle, DataOne lifecycle, information lifecycle, CIGREF, DDI lifecycle, USGS lifecycle, PII lifecycle, Enterprise data lifecycle, Hindawi lifecycle (el Arass et al., 2017). Lets have a deep dive in three lifecycle management models:

Common Lifecycle Management Models

CRUD:

CRUD is an acronym (Create, Read, Update and Delete) describing basic functions of persistent database application allowing a user to manage data over the whole data lifecycle (IONOS, 2020). The basic functions are displayed below

Own representation based on (IONOS, 2020);

CRUD explains the basic functions needed for managing data over the lifecycle in an abstract way without giving detailed guidelines how to proceed step by step.

The Information Lifecycle:

The information lifecycle is a data lifecycle model developed especially for cloud computing mitigating the risks for data leakages and privacy losses developed by Lin et al. (Lin et al., 2014). The data lifecycle contains seven steps: Generation of data in the cloud by authorized users, Transmission of encrypted data with e.g. VPNs or digital certificate mechanism , Storage of data, Access of data by validating users identity, Reuse of data, Archiving of data, Disposal of data. This approach is focusing strongly on data security in the cloud environment. In comparison to the CRUD model, it is more fine-grained. The information lifecycle is displayed

Own representation based on (Lin et al., 2014)

The lifecycle management of data is a basic part in DataOps methodology to assure the data quality and data governance.

Team Data Science Process (TDSP)

DataOps methodology facilitates the rapid development of minimal viable products with a short lead time. Data analytic as well as data science projects have frequently an iterative character. Best practices were developed to standardize and optimized data science projects. The management of data is one part of it but the focus is clearly on the approach how to conduct data science projects. A good example for such best practices is the Team Data Science Process (TDSP). This framework is a structured approach to conduct data science projects. The main components of the TDSP are displayed below:

Own representation (changed) based on (Microsoft 2024; Thakurta & McGehee, 2017)

The main part of the TDSP describes the data science life cycle. Furthermore, guidelines for roles and responsibilities, a file system structure and a toolbox (resources and infrastructure components) are given as well (Microsoft, 2024). The TDSP is a method to structure data science projects in a flexible and iterative approach. Understanding the business needs is crucial for designing the right data science tools. To analyze and understand business needs data scientists work together with subject matter experts. Based on the domain knowledge of the subject matter experts the data scientists can outline the goals and deliverables of the data science project and define the quantifiable key success measures. Based in the key success measures the success or failure of the data science project can be calculated. According to the goal of the project data is acquired and analyzed: Detecting available data sources, create data pipelines, accessing data quality, cleaning, data wrangling, and explorative data analysis. With the given data the data science model is trained with meaningful input variables. Meaning input variables can be extracted from the raw data by feature engineering. The developed model is evaluated based on the key success measures. If defined thresholds are reached the model can be deployed to production. All of these steps are iterative and interconnected. That means for example after feature extraction the subject matter expert can be consulted. Given the feedback of the subject matter expert, the feature engineering can be adjusted (Microsoft, 2024).

Advantages in comparison to traditional project management methods like water fall is that the project is conducted in iterative steps and thus changes can be implemented even in late phases of the project. Thus, this approach is more agile. Nevertheless, several important points are not included in the project lifecycle TDSP:

Challenges in common data lifecycle management tools

The TDSP is focusing mainly on the development of new data science projects. The model for legacy data science projects needs to be updated to avoid model decays over time. Furthermore, the customer demands new features. The deployment of those features in production can be a high-risk event. The same applies for data analytics projects. Thus, the TDSP is a good practice for the development of a data science model but is not suitable to describe an holistic approach to develop and deploy data science models or data analytic features to production. Historically data architectures and infrastructures were focusing on production requirements like latency, load balancing and robustness. Data science projects as well as data analytic projects require that new features or model updates are deployed into production to stay competitive. This requires a change of paradigm for the data architecture: “A DataOps Architecture makes the steps to change what is in the production a central idea” (Bergh et al., 2019, p.110). Without having a suitable framework to deploy changes into production it remains a high-risk operation. Over time the deployments getting exponentially more complex with an increase of technical debt (Bergh et al., 2019). Technical debt is defined as “long term costs incurred by moving quickly in software engineering” (Sculley et al., 2015). Explanatory the components of an infrastructure to deploy a ML-Model is depicted below:

Own representation based on (Sculley et al., 2015)

The ML-Model is just a small part of the broader landscape. Changes in the ML-Model can affect as well other parts of the infrastructure. Deploying the new model manually results in a huge effort since the other components may need to be updated as well. Historically it happens that the development environment differs from the production environment resulting in additional problems since the configuration differs and cannot be tested before. Deploying code into production for data analytics projects or data science projects without a proper deployment framework results in high cycle times, production downtimes, errors in production and poor data quality. Thus it’s a high risk for operations.

DataOps Architecture

Bergh proposed in the book “The DataOps Cookbook – Methodologies and tools that reduces analytics cycle time while improving quality” an DataOps data architecture. The core idea of this architecture is to focus on how to deploy code in data analytics and data science projects to production with low cycle time, high data quality and low downtime risks. From a high level perspective Bergh proposed “decouples operations from new analytics creation and rejoins them under the automated framework of continuous delivery” (Bergh et al., 2019). This decoupling happens through the separation of work in two pipelines: Value Pipeline and Innovation Pipeline:

Own representation based on (Bergh et al., 2019, p.38)

The value pipeline describes the current production pipeline. Raw data is ingested in the system, data is cleaned, features are extracted and fed into the production system where either a ML/AI model is deployed or advanced analytics operation are performed. Finally, the results can be visualized and value is achieved for the end-customer. The Innovation pipeline is focusing on the deployment of new production models. The process steps within one of the tow pipelines can be executed either in a serial or parallel manner. The two pipelines are completely controlled by code (infrastructure-as-Code). Behind the two pipelines is a sophisticated DataOps data architecture. Different environments are used. For example, the development is done in the development environment. The development environment can be a copy of the production environment. The separation of the environments is beneficial since the development is not impacting the production and it simplifies the deployment into production. The value pipeline runs in the production environment. The innovation pipeline automates the processes from the idea, to the development of new features in the “dev” environment and the successful deployment and testing of the new feature in the “prod” environment. All processes are highly automated. The workflows in the value pipeline are automated e.g., with an workflow orchestration tool or ETL pipelines. The deployment of the successfully created new feature in the “prod” environment (CI/CD) is automated including automated tests before deployment. All infrastructure is described in code (infrastructure as code) which enables an automated provisioning of the infrastructure for “prod” as well as for the “dev” environment. Code is saved in a version control system like git. Meta data is saved to gain additional insights and to detect errors early as early as possible. Secrets are used to simplify the authentication. Specific secret storage possibilities exist to ensure safe operations. Parameters are used to simplify for example the automated infrastructure provisioning. Data pipelines are key components in DataOps architecture. The next chapter is focusing on the data pipelines.

Are you interested how Data Pipelines are working? Please have a look here: Mastering Data Pipelines: Mastering Data Pipelines: The Secret to Fast and Reliable Data Operati

What are your thoughts on it? Does the DataOps Architecture makes sense for your Data Product?

Sources:

el Arass, M., Tikito, I., & Souissi, N. (2017). Data lifecycles analysis: towards intelligent cycle: (PDF) Data lifecycles analysis: Towards intelligent cycle; DOI:10.1109/ISACV.2017.8054938
Lin, L., Liu, T., Hu, J., & Zhang, J. (2014). A privacy-aware cloud service selection method toward data life-cycle. Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS, 2015-April, 752–759. https://doi.org/10.1109/PADSW.2014.7097878
Microsoft. (2024).
- Lebenszyklus des Team Data Science-Prozesses. https://docs.microsoft.com/de-de/azure/architecture/data-science-process/lifecycle
- Was ist der Team Data Science-Prozess (TDSP)? Was ist der Team Data Science-Prozess (TDSP)? - Azure Architecture Center | Microsoft Learn
- Lebenszyklus des Team Data Science-Prozesses. https://docs .microsoft.com/de-de/azure/architecture/data-science-process/lifecycle
Thakurta, D., & McGehee, H. (2017). Team Data Science Process: Roles and tasks. https://github.com/Azure/Microsoft-TDSP/blob/master/Docs/roles-tasks.md
Bergh, C., Benghiat, G., & Strod, E. (2019). The DataOps Cookbook Methodologies and Tools That Reduce Analytics Cycle Time While Improving Quality Second Edition. The DataOps Cookbook | DataKitchen
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems. Hidden Technical Debt in Machine Learning Systems
IONOS 2024: CRUD: die Basis der Datenverwaltung: CRUD: die Basis der Datenverwaltung - IONOS

DevOps - Bridging Development and Operations

Niels Humbeck — Wed, 18 Dec 2024 06:22:29 GMT

DataOps, a fusion of "Data" and "Operations," addresses the challenges of developing data products by combining principles from Agile, DevOps, and Lean Manufacturing. It emphasizes collaboration, automation, and efficiency in handling data pipelines, enabling teams to deliver data products faster and with higher quality (Atwal, 2020, p.xxiii). If you are interested in the definition of DataOps take a look on the following blog post: Click Link

The exponential growth of data in recent years has created both incredible opportunities and formidable challenges for organizations. The ability to harness data efficiently and translate it into actionable insights has become a cornerstone of competitive advantage. However, traditional methodologies for example to develop software product often fail to keep up with the speed and complexity of data products development.

DataOps inherits components from (DataKitchen, 2018):

Lean Manufacturing: Focusing on the value adding processes by eliminating waste leads to a more efficient utilization of resources, with higher quality and lower costs.
Agile: Building the right product for the right people by increasing “the ability to react to unforeseen or volatile requirements regarding the functionality or the content“ of Data Products (Zimmer et al., 2015)
DevOps: Shared commitment towards the Data Product reduces information barriers (Culture of collaboration), automation (e.g., CI/CD pipelines), Infrastructure as Code and automated tests enable fast and reliable deployment of code into production in a high quality (Macarthy & Bass, 2020b)

Lets have a deep dive into the DevOps component:

DevOps

The performance of software development increased significantly in the last 25 years. A key factor for the increased performance of software development - besides the adaption of agile software development - was the widespread adoption of DevOps philosophy. The software development process is similar complex to the manufacturing process of physical goods. Instead of a physical transformation of raw material to a finished good, software developers create a code (e.g., a new feature for an existing product). This code needs to be tested (quality and security) and to be deployed in the production by IT operations. This process contains different dimensions of complexity, for example: code creation, organizational framework, tools and method.

Challenges in the traditional software development process

Traditionally, the release of new versions of a software product were a “high-stress affairs involving outages, firefighting, rollbacks, and occasionally much worse” which was conducted infrequently every few years (Atwal, 2020). There was no matured holistic approach to deal with the complexity involved. As a result, frictions occur in the development process. The software developer and the IT operators had different objectives. Software developers were eager to develop and test new products to fit the customer demands. On the other hand IT, operators were focusing on the stability of the production system. New releases frequently jeopardized the stability of the production system resulting in a negative attitude of its operators to new releases (König & Kugel, 2019). The technical debt of dependent legacy systems (e.g. monolith architecture) increases the workload for IT operations reducing the ability to test new features and increasing the risks of instabilities (Atwal, 2020, p.162). The figur below depicts explanatory the flow of the code in traditional software development:

Traditional flow of code in software development (explanatory)

The software development followed often the water fall principles, which means the work is executed subsequently. The software developer starts to work on the new release. After completing the whole code, it is passed -with some instructions- to the quality and security assurance colleagues in another department. After testing and approving the code, the package is passed to the IT Operations to deploy the code. The IT operations deploy the whole release at once. The infrastructure provision is done manually without significant automation. An extensive amount of time is needed to solve the configurations and dependencies manually.

In the first step the user requirements are specified in a very first phase. Requirement changes or testing MVP (minimum viable products) is rarely possible. Furthermore, the current project progress is difficult to assess since development, testing and deploying are subsequent tasks. Skill centric silo organization results in an extensive alignment and central planning effort without an end-to-end visibility resulting in huge cycle times and delays. The teams were often committed towards the functional organization and not the software product resulting in a dismissive information sharing and poor collaboration (Katal et al., 2019).

How DevOps works

From a lean perspective the traditional approach has similarity with the mass production of Ford and Tylor. Several lean principles are not met in the traditional approach. From a flow perspective the code is piling up after each process step resulting in long waiting times. The optimum is an one-piece flow where small chunks of codes go through the process chain frequently. Small chunks of codes could be tested faster and errors will be detected when they were made (right first time). Unfortunately, errors will be detected in the traditional approach late in the process resulting in extensive amount of reworks since error dependencies need to be resolved as well. Rework is one type of waste (Muda). The huge amount of work in progress (code between processes), the unbalanced teams resulting in idle times adding more types of waste to the traditional approach. In the traditional approach it is difficult to manage feedback from the production efficiently since information is shared dismissive between IT Operations and teams, but as well with the customer of the software product resulting in the risk that a software product is built which is not meeting the customer demands. The traditional approach is not compatible with lean ideas resulting in a poor performance. Based on these challenges the DevOps philosophy was developed. „DevOps is described as a software engineering culture and philosophy that utilizes cross-functional teams [Developer, IT-Operations, QA, Security] to build, test, and release software faster and more reliably through automation“ (Macarthy & Bass, 2020b). Companies applying DevOps show frequently outperformance in key success indicators in comparison to the traditional approach. Key success factors are for example: Deployment frequency, lead time for changes, change failure rate, time to restore services (Portman, 2020). The main idea is to facilitate rapid software development and deployment by bringing small releases constantly into production ready status (Lwakatare et al., 2016). There is no conclusive definition of DevOps. In general, there are four pillars of DevOps (CAMS): Culture, Automation, Measurement and Sharing (Katal et al., 2019). DevOps embraces a culture to bridge the conflicting interests for Developer and IT Operator. Small, self-organized and autonomous cross-functional teams (Developer, It Operator and Quality/Security assurance) with a shared commitment towards the product and not the function are built. Thus, the time to share information is reduced significantly, end-to-end visibility is given and the collaboration is strengthened. Learning and innovation cycles are improved through continuous feedback. An explanatory example of releasing and deploying code is depicted in Figure 4. Through continuous integration and continuous delivery/deployment - also known as CI/CD Pipeline – a commit can be brought into a production ready status through automation in a reliable and in high quality fast way. Thus, small pieces of code can be shipped fast into production reducing the work in progress and the cycle times significantly (“One-Piece-Flow”). Failures are detected right after their emergence and can be solved immediately reducing the amount of reworking and thus eliminating waste (“Right First Time”). This procedure enables to test a feature quickly and to adapt to the customer needs. MVP (minimum viable products) can be presented in an early phase of the projects. Agile methods can be applied like working in sprints with SCRUM. Working in sprints is a key enabler to refine the customer requirements during the development phase in order to build the right product for the right people. For explanatory reasons the CI/CD pipeline is broken down in process steps (in reality it is a continuous process). One sprint follows several phases: Plan, code, build (building a deployable package like a docker image), test, release to a repository, deploy, operate, monitor. It is very important that the production system is monitored and errors or warnings are continuously feedbacked. Continuous feedback assures that the necessary information is brought to the developers to react fast and preventive to possible failures (Continuous Improvement, Right First Time).

The processes for software development are highly automated inclusive the quality testing and the provision of infrastructure. An explanatory example of automated infrastructure provision (infrastructure as code) is shown in the figure below:

Containerization (e.g., with Docker), microservice architecture and tools to configure and provide the infrastructure enabling infrastructure as code. Automated and replicable infrastructure reduce significantly technical debt.

Summary

The blog discusses the evolution of software development from traditional, siloed, and waterfall-based approaches to modern, collaborative, and automated workflows enabled by DevOps. Key challenges with the traditional approach include long cycle times, poor collaboration between developers and IT operations, and technical debt from legacy systems.

DevOps addresses these challenges by fostering a culture of collaboration through cross-functional teams, automating processes like testing and infrastructure provisioning, and implementing Continuous Integration and Continuous Delivery (CI/CD) pipelines. These practices enable faster deployment, quicker detection and resolution of errors, and the delivery of small, incremental releases. This aligns with Lean principles, such as minimizing waste and enabling one-piece flow, and Agile principles, such as flexibility and responsiveness to customer needs.

What are your thouhts about DevOps? Does it help you in your daily work?

Sources:

Atwal, H. (2020). Practical DataOps. In Practical DataOps. Apress. Practical DataOps: Delivering Agile Data Science at Scale[Book]
DataKitchen. (2018). DataOps is NOT Just DevOps for Data. DataOps is NOT Just DevOps for Data | by DataKitchen | data-ops | Medium
Katal, A., Dahiya, S. ., & Bajoria, V. (2019a). DevOps: Bridging the gap between Development and Operations. DevOps: Bridging the gap between Development and Operations | IEEE Conference Publication | IEEE Xplore
König, G., & Kugel, R. (2019). DevOps – Welcome to the Jungle. s40702-019-00507-8.pdf
Lwakatare, L., Oivo, M., & Kuvaja, P. (2016). An Exploratory Study of DevOps Extending the Dimensions of DevOps with Practices. ICSEA 2016 : The Eleventh International Conference on Software Engineering Advances.
Macarthy, R. W., & Bass, J. M. (2020a). An Empirical Taxonomy of DevOps in Practice. 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 221–228. https://doi.org/10.1109/SEAA51224.2020.00046
Portman, D. (2020). Are you an Elite DevOps performer? Find out with the Four Keys Project.
Zimmer, M., Kemper, H., & Baars, H. (2015). The impact of Agility Requirements on Business intelligence Architectures.

Introduction into Agile Methodologies

Niels Humbeck — Wed, 18 Dec 2024 06:10:48 GMT

DataOps, a fusion of "Data" and "Operations," addresses the challenges of developing data products by combining principles from Agile, DevOps, and Lean Manufacturing. It emphasizes collaboration, automation, and efficiency in handling data pipelines, enabling teams to deliver data products faster and with higher quality (Atwal, 2020, p.xxiii). If you are interested in the devinition of DataOps take a look on the following blog post: Click Link

DataOps inherits components from (DataKitchen, 2018):

Lean Manufacturing: Focusing on the value adding processes by eliminating waste leads to a more efficient utilization of resources, with higher quality and lower costs.
Agile: Building the right product for the right people by increasing “the ability to react to unforeseen or volatile requirements regarding the functionality or the content“ of Data Products (Zimmer et al., 2015)
DevOps: Shared commitment towards the Data Product reduces information barriers (Culture of collaboration), automation (e.g., CI/CD pipelines), Infrastructure as Code and automated tests enable fast and reliable deployment of code into production in a high quality (Macarthy & Bass, 2020b)

Lets focus on the Agile development of Data Products

Agile Methodology

During the turn of the century the pace of industrializing new technologies increased significantly. Rapidly new possibilities for developing products – like software products – arose. Adapting the new possibilities in a fast-changing world to deliver a product suiting the needs of the customer best is a key success factor in the 21st century. Agile methodology “is a way of dealing with, and ultimately succeeding in, an uncertain and turbulent environment”(Agilealliance, 2022). The core values of agile were stated by software developers in 2001 in the “Manifesto for Agile Software Development”:

“We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

That is, while there is value in the items on the right, we value the items on the left more.” (Agilemanifesto, 2022). Agile philosophy is now adapted in many industry sectors but was first implemented in software development. “Agile software development is the umbrella term for a collection of independent frameworks and practices that follow the values of the Agile Manifesto” (Atwal, 2020, p.88). Agile frameworks are for example: Scrum, Kanban Methods and Scrumban.

Scrum: A widely used framework that organizes work into iterative cycles called sprints, which typically last two to four weeks. Scrum emphasizes roles like the Product Owner, Scrum Master, and Development Team, along with ceremonies such as daily stand-ups, sprint planning, and retrospectives. It promotes transparency, team accountability, and incremental progress toward a shared goal.
Kanban: Originating from Lean manufacturing, Kanban focuses on visualizing workflow, limiting work in progress, and maximizing efficiency. Tasks are represented on a Kanban board, allowing teams to track progress and identify bottlenecks in real time. It encourages continuous delivery without imposing rigid time-boxed iterations.
Scrumban: A hybrid approach combining elements of Scrum and Kanban, Scrumban leverages the iterative structure of Scrum while incorporating the flexibility and flow optimization of Kanban. This framework is often used by teams transitioning from Scrum to Kanban or by those seeking to adapt Agile principles in highly dynamic environments.

Each of these frameworks aligns with Agile's core principles but offers unique approaches to tackling complex projects. They enable teams to embrace uncertainty, respond to change, and deliver value incrementally and consistently. Agile’s adaptability, paired with its emphasis on collaboration and continuous improvement, has made it a cornerstone methodology in industries striving for innovation and customer satisfaction.

Product Management and OKRs

Product management plays a crucial role in driving innovation and delivering value in Agile environments. It focuses on understanding customer needs, defining product vision, and prioritizing development efforts to ensure products align with business goals and market demands. A successful product manager acts as a bridge between stakeholders, customers, and development teams, ensuring that the right product is built and delivered at the right time.

One key tool for aligning product management efforts with organizational objectives is Objectives and Key Results (OKRs). OKRs are a goal-setting framework designed to provide clarity, focus, and measurable outcomes for teams and organizations. The structure of OKRs consists of:

Objectives: Qualitative, inspiring, and clear goals that define what the team or organization aims to achieve. Objectives should be ambitious and aligned with the overall mission of the organization.
Key Results: Quantitative, measurable outcomes that indicate whether progress is being made toward achieving the objective. Key results must be specific, time-bound, and verifiable, ensuring accountability and transparency.

For example, a product team might set the following OKRs:

Objective: Increase user engagement with the product.
- Key Result 1: Achieve a 20% increase in daily active users (DAUs) over the next quarter.
- Key Result 2: Reduce the average time to first meaningful action within the app by 30%.
- Key Result 3: Achieve a 15% improvement in user retention after 30 days.

In Agile product management, OKRs serve as a guiding framework that connects high-level strategic goals with day-to-day execution. By establishing clear priorities and measurable outcomes, product managers can ensure that development efforts focus on delivering the greatest value to customers and the business.

The Role of OKRs in Agile Environments

In an Agile environment, where rapid iterations and adaptability are critical, OKRs provide the structure needed to balance flexibility with accountability. They encourage cross-functional collaboration by aligning teams on shared goals and breaking down silos between departments.

OKRs also foster a results-oriented culture. Instead of focusing solely on outputs (e.g., the number of features delivered), teams are encouraged to focus on outcomes (e.g., how those features impact customer satisfaction or business success). This shift ensures that development efforts are aligned with customer and organizational needs, reducing wasted resources and maximizing impact.

By integrating OKRs into Agile practices such as sprint planning, retrospectives, and roadmapping, product teams can create a feedback loop that ensures continuous improvement. This alignment of strategic goals, customer value, and measurable outcomes makes OKRs a powerful tool for modern product management within Agile frameworks.

Sources:

Atwal, H. (2020). Practical DataOps. In Practical DataOps. Apress. https://doi.org/10.1007/978-1 4842-5104-1
Agilemanifesto. (2022). Manifesto for Agile Software Development.
DataKitchen. (2018). DataOps is NOT Just DevOps for Data. https://medium.com/data-ops/dataops is-not-just-devops-for-data-6e03083157b7

Introduction into Lean Methodology

Niels Humbeck — Wed, 18 Dec 2024 05:54:02 GMT

DataOps, a fusion of "Data" and "Operations," addresses the challenges of developing data products by combining principles from Agile, DevOps, and Lean Manufacturing. It emphasizes collaboration, automation, and efficiency in handling data pipelines, enabling teams to deliver data products faster and with higher quality (Atwal, 2020, p.xxiii). If you are interested in the devinition of DataOps take a look on the following blog post: Click Link

DataOps inherits components from (DataKitchen, 2018):

Lean Manufacturing: Focusing on the value adding processes by eliminating waste leads to a more efficient utilization of resources, with higher quality and lower costs.
Agile: Building the right product for the right people by increasing “the ability to react to unforeseen or volatile requirements regarding the functionality or the content“ of Data Products (Zimmer et al., 2015)
DevOps: Shared commitment towards the Data Product reduces information barriers (Culture of collaboration), automation (e.g., CI/CD pipelines), Infrastructure as Code and automated tests enable fast and reliable deployment of code into production in a high quality (Macarthy & Bass, 2020b)

Lets Focus on Lean Manufacturing:

Lean Manufacturing

Lean manufacturing is a production philosophy focused on maximizing customer value by eliminating waste and continuously improving processes. Lean Methodology facilitate customer-centril development of products. Originating from the Toyota Production System, it emphasizes principles such as standardization, continuous flow, and employee involvement to optimize efficiency and quality. By prioritizing value-adding activities and minimizing non-value-adding processes, lean manufacturing aims to deliver high-quality products at lower costs with shorter lead times. Lets have a deep dive into the evolution of production systems:

Definition of Production Systems

Production systems can be characterized in reference to system elements (technical production systems or social production systems), to the viewpoint (real system or conceptional system), as well as to the considered value adding processes in the production system (Baumgärtner, 2006, p.34). The system element characteristics describe the role of the worker in the production system. The technical production system is focusing on technical and infrastructure elements and the worker is only a productive factor. The social production system includes all entities inclusive the workers with their know-how, skills, needs and values and focusing on the organization and cooperation between worker and other productive factors as a social-technical system (Baumgärtner, 2006). The description of social-technical systems adds a new dimension of complexity in comparison to the technical production system. Furthermore, production systems can also be characterized from a different viewpoint. Real production systems describe the production as a transformation of raw materials to a final product (Uygun, 2013, p.13). On the other hand, the conceptional production viewpoint is focusing on design principles, methods, and tools. Production systems can be characterized furthermore by their specific value adding activities like development, supply chain, quality control, production etc. Production systems are characterized by different dimensions: technical transformation from raw material physically transformed into a product, employee focus and their skills and capability, or as a methodic framework for principles, methods and tools. The interaction of those dimension is highly complex and evolved through the years. There were three major changes of paradigms.

Evolution of Production Systems - From Craftmans production to Lean Production

Craftmans Production Systems

A change of paradigm describes a fundamental change of the properties and framework of production systems. Till 1920 manual or craftsman manufacturing was the dominant form of production systems. Not standardized unique pieces were produced without mechanical automation resulting in a low productivity and individually made products (Baumgärtner, 2006, p.18; Dombrowski & Mielke, 2015, p.8). No economic of scale can be applied to produce unique individual products resulting in small volumes of manufactured goods. Due to a lack of division of labor, the workers had to master every process step of the production, so that only very well-trained employees could manufacture complex products. The number of available educated employees was limited in the beginning of the 20th century (Dombrowski & Mielke, 2015,p. 8).

Mass Production Systems

In the beginning of the 21st century electricity, steam engine and new construction tools were found and developed (Crespo,2012, p.30). These developments were the basis for the classical mass production systems from Henry Ford and Taylor. The prices for produced goods decreased with the utilization of economic of scale. This resulted in higher sales since many products – like cars – were affordable for the middle class triggering an economic boom (Dombrowski & Mielke, 2015,p. 8). Key component of the classical mass production is the complexity reduction of the production achieved by separating planning and executive activities, as well as the division of labor and standardization in production. This division of labor and standardization enables the employment of non-skilled workers for repetitive tasks resolving the resource restriction of educated workforce (Dombrowski & Mielke, 2015,p.8; Uygun, 2013, p.11). The classical mass production is ideal for the production of goods with low variability in a huge number. In the beginning of the classical mass production the market was a seller market. The main goal was to meet the demands. The classical mass production system came to its limits when the market changed to a buyers’ market demanding for more variety of goods offered for a cheaper price and was replaced by the Lean Production system which was significantly influenced by the Toyota production system (TPS).

The Lean Production System - Introduction

The Toyota Production System (TPS) was developed by the Toyota car company to relieve the difficult economic conditions in Japan after World War II and the oil crisis 1973. The Japanese market required the production of a wide portfolio of vehicles at low prices (Ohno, 2013, p.34). After the Second World War, Toyota had little capital resources. In the classical production system specialized production equipment for each variant of the product was required. These specialized production machines are inflexible in use and expensive (Womack et al., 1991, p.35). Furthermore, high working capital is needed for the classical mass production for buffering semifinished goods. High buffer volumes correspond to high cycle times (Dombrowski & Mielke, 2015,p.15). Toyota could not afford to copy the classical mass production system. The goal of the Toyota production system is to achieve at least the same level of productivity as that of the classic mass production with lower economic of scales and lower cost, higher quality and a shorter time to market (Baumgärtner, 2006, p.26; Merl, 2015, p.18). The Toyota production system was a key success factor to develop and produce with less resources and cost in a shorter period in a high quality new variants of products to increase the market share and customer satisfaction (Womack et al., 1991).

Lean Production System - Deep Dive into the core components

The success of the Toyota production system was the fundament for lean production systems. Lean production system can be defined as “production system that focusing continuous flow within supply chain by eliminating all wastes and performing continuous improvement towards product perfection” (Rose et al., 2011). Non value adding processes (Muda) need to be eliminated and value enabling processes to be optimized to focus on the value adding processes which deliver the value for the customer. Non value adding processes are processes producing waste (MUDA) (Dombrowski & Mielke, 2015, p.32). Mura refers to the unevenness or inconsistency in a process, which can cause inefficiencies, delays, and instability in production or workflow. Muri represents overburden or pushing systems, processes, or people beyond their capacity, leading to stress, errors, and potential breakdowns. Seven different types of waste exist: Transport, Inventory, Movement, Waiting, Overproduction, Overprocessing, Defects (TIMWOOD) (Ohno, 2013). The nine lean principles summarize the guiding ideas of lean: Standardization, right first time, flow principle, pull principle, continuous improvement, employee orientation and goal-oriented leadership, avoidance of waste and visual management (VDI, n.d.). A multitude of methods and tools can be applied for each of these principles. The six most important principles are explained in the following:

Standardization ensures a defined level of quality and enables stable and plannable processes (VDI, n.d.). Standardization forms the basis for a continuous improvement process (Dombrowski & Mielke, 2015, p.66).
The goal of right first time is to reduce the introduction of errors to the next process step. The detection of failures should be detected as early as possible to prevent wasteful reworks in subsequent steps.
The flow principle is characterized by a fast, continuous and low-turbulence flow of materials and information across the entire value chain (VDI, n.d.). The optimum of the flow principle is a lot size of one – also known as one-piece flow. The goal is to minimize the traveling distances, the waiting times and the buffer between process steps. Thereby the throughput time and the working capital is reduced significantly (Dombrowski & Mielke, 2015, p.96).
Continuous improvement describes a corporate culture that strives for improvements in the company and is anchored in all employees. The aim is, to question historically grown workflows. Opportunities for improvement are identified by employees. Identified improvement potentials can be summarized in a corporate proposal system. The PDCA cycle (Plan, Do, Check, Act) cycle helps to implement the optimization and SDCA (standardize, do, check, act) cycle helps to standardize the optimization (Dombrowski & Mielke, 2015, p.50).
Reducing and eliminating waste is a core principle of Lean to increase quality and value creation and reduce the costs (see above).
Visual management refers to a graphical representation of information (e.g. workflows and performance KPIs). Complex production systems can be made transparent and understandable. This increases the motivation of employees to enhance the performance, to focus on continuous improvements and to identify and eliminate waste (Dombrowski & Mielke, 2015, p.149).

Summary

The text explores the evolution of production systems, focusing on their transformation from manual manufacturing to classical mass production and finally to lean manufacturing. Lean manufacturing, derived from the Toyota Production System, emphasizes eliminating waste (Muda), optimizing value-adding processes, and achieving continuous improvement. Key principles of lean include standardization, right-first-time quality, flow, pull systems, waste reduction, and visual management. These principles enable companies to produce high-quality products efficiently, reduce costs, and respond to customer demands for variety and customization. Lean manufacturing represents a shift toward a customer-centric, resource-efficient production model designed to enhance value creation and competitiveness.

Outlook: How Lean Production Methodology influences DevOps and DataOps

Are you curious how lean methodolgy can be applied to build software products faster and more reliable? If yes, have a look here: Link

Sources:

Atwal, H. (2020). Practical DataOps. In Practical DataOps. Apress. Practical DataOps: Delivering Agile Data Science at Scale[Book]
Baumgärtner, G. (2006). Reifegradorientierte Gestaltung von Produktionssystemen. Theoretische und empirische Analyse eines Gestaltungsmodels. (Uygun, 2013, p.13)
Crespo, I. (2012). Ganzheitliche Produktionssysteme für kleine und mittlere Unternehmen.
DataKitchen. (2018). DataOps is NOT Just DevOps for Data. DataOps is NOT Just DevOps for Data | by DataKitchen | data-ops | Medium
Dombrowski, U., & Mielke, T. (2015). Ganzheitliche Produktionssysteme. Aktueller Stand und zukünftige Entwicklungen. Springer Vieweg (VDI-Buch).
Macarthy, R. W., & Bass, J. M. (2020a). An Empirical Taxonomy of DevOps in Practice. 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 221–228. https://doi.org/10.1109/SEAA51224.2020.00046
Ohno, T. (2013). Das Toyota-Produktionssystem.
Uygun, Y. (2013). GPS-Diagnode. Diagnose und Optimierung der Produktion auf Basis Ganzheitlicher Produktionssysteme.
VDI. (n.d.). VDI 2870: Blatt 1: Ganzheitliche Produktionssysteme Grundlagen, Einführung und Bewertung.
Womack, P., Roos, D., & Jones, T. (1991). The machine that changed the world.
Zimmer, M., Kemper, H., & Baars, H. (2015). The impact of Agility Requirements on Business intelligence Architectures.

Streamlining the Data Product Development with DataOps

Niels Humbeck — Fri, 13 Dec 2024 15:23:49 GMT

The development of data products is an intricate process, blending the complexities of data and code. Unlike traditional software development, the data dimension adding additional unique challenges. Data must be available, understood, and accurate. The exploratory work, often led by data scientists and analysts, adds another layer of complexity. Furthermore, developing the infrastructure to release even small chunks of data products requires robust data pipeline environments. There are three main challenges in the development of Data Products:

Waste during the whole lifecycle
Misalignment with business goals
And challenges in scaling data science/ analytics outputs (Productionizing)

Are you interested to learn more about the challenges of developing Data Products? Please have a look here: LINK

These complexities and challenges call for specialized methodologies called DataOps to overcome the hurdles of creating and maintaining high-quality data products (Atwal, 2020, p.12; p.174).

What is DataOps?

The Key Components of DataOps

DataOps incorporates methodologies from several established frameworks:

Agile: Focuses on creating the right product for the right people by adapting quickly to changing requirements. Agile enables data product teams to respond flexibly to unforeseen needs regarding functionality or content (Zimmer et al., 2015).
DevOps: Promotes a culture of collaboration and shared responsibility among teams. Automation, CI/CD pipelines, Infrastructure as Code, and automated testing ensure fast, reliable deployment with high quality (Macarthy & Bass, 2020b).
Lean: Aims to eliminate waste and focus on value-adding processes, resulting in higher efficiency, better resource utilization, improved quality, and reduced costs.

Why DataOps Matters

The complexity of building data products—from data ingestion to deployment—necessitates methodologies that streamline the process. DataOps supports:

Rapid Development of MVPs: DataOps enables teams to quickly create Minimum Viable Products (MVPs) to test ideas with customers and iteratively improve them. This approach reduces cycle times and accelerates delivery (Atwal, 2020, p.7).
Customer-Centric Development: DataOps focuses on customer needs, breaking down technological and organizational silos to streamline activities toward delivering value (Atwal, 2020, p.81).
Scalability and Robustness: By fostering automation and continuous improvement, DataOps enhances scalability and robustness, providing a competitive advantage to organizations (Atwal, 2020, p.136).
Data Governance: DataOps integrates automated governance and compliance checks into the pipeline, ensuring data integrity and regulatory compliance. This helps mitigate risks without slowing development.
Continuous Improvement: DataOps fosters a culture of continuous improvement by automating feedback loops, allowing teams to refine data pipelines and products in real-time. This ensures ongoing optimization and relevance of data products. New ideas can be tested without breaking the production model.
Innovation and Competitive Advantage:
DataOps accelerates innovation by enabling faster iteration and adaptation to market changes. It provides a competitive edge by ensuring data products remain relevant and responsive to customer needs.
Cross-Functional Collaboration:
Collaboration between data engineers, analysts, and business stakeholders is at the core of DataOps. This shared responsibility improves communication and accelerates delivery of data products.

Definitions of DataOps

My Definition for DataOps is:

“DataOps is a methodology that integrates principles from Agile, DevOps, and Lean to streamline the end-to-end development of Data Products. It focuses on collaboration, automation, and continuous improvement to ensure efficient, high-quality data pipelines powering the rapid development of customer centric data products. DataOps is not just a process but a mindset, aligning teams across disciplines to ensure data products meet evolving business needs and deliver actionable insights.”

In literature there are mainly three perspectives on the definition of DataOps (Mainali, 2020, p.17):

Goal-Oriented: DataOps goal is the “elimination [of] errors and inefficiency in data management, reducing the risk of data quality degradation and exposure of sensitive data using interconnected and secure data analytics models” (Mainali, 2020, p.17)
Activity-Oriented: DataOps describes the methodology as a key enabler for continuous flow of data through data pipelines “converting raw data into useful data products [which] can be treated as an end-to-end assembly line process that requires high level of collaboration, automation and continuous improvement” (Atwal, 2020, p.xxiv)
Team and Process-Oriented: The process or team-oriented definition of DataOps is focusing on the organizational framework underlining the relevance for cross functional teams, data governance management through the whole data lifecycle (Mainali, 2020, p.17)

Core Practices of DataOps

“DataOps is an integrated approach for delivering data analytic [products] solutions that uses automation, testing, orchestration, collaborative development, containerization, and continuous monitoring to continuously accelerate output and improve quality” (Ereth, 2018, p.6). The focus is on “data products rather than data projects, and data flows rather than layers of technology or organizational functions” (Atwal, 2020, p.81). The core of the development of a data product are the customer needs. All activities are streamlined to achieve this goal by breaking of technological and organizational silos. The following practices are important for the success of DataOps methodology (Bergh et al., 2019, p.27):

Orchestration of Data Pipelines: Ensuring smooth, automated data flow from ingestion to delivery. (Comprehensive framework for data pipelines: LINK)
Automated Testing and Monitoring: Validating data quality and detecting issues in real-time.
Version Control: Managing changes to data and code efficiently.
Branch and Merge Strategies: Facilitating collaboration among multiple teams.
Multiple Environments: Supporting development, testing, and production workflows.
Reusability, Containerization and Automation: Leveraging reusable components to reduce redundancy and improve efficiency.
Avoid fear and heroism in DataOps projects

Are you interested in learning more about DataOps best practices? Take a look on this blog post: LINK

The Future of Data Product Development

As organizations strive to harness the power of data, DataOps emerges as a vital methodology for delivering high-quality, scalable, and customer-focused data products. By combining automation, collaboration, and continuous improvement, DataOps bridges the gap between data and operations, enabling teams to meet the growing demands of data-driven innovation.

In summary, DataOps is not just a methodology; it is a mindset—a commitment for building better, faster, and more reliable data products by integrating principles from Agile, DevOps, and Lean Manufacturing. The journey toward effective DataOps may be complex, but its rewards are transformative for teams and organizations alike.

Sources

Atwal, P. (2020). DataOps: Delivering Data-Driven Value at Scale.
Bergh, J., et al. (2019). DataOps Cookbook.
Macarthy, R. W., & Bass, J. M. (2020a). An Empirical Taxonomy of DevOps in Practice. 2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), 221–228. https://doi.org/10.1109/SEAA51224.2020.00046
Mainali, S. (2020). "Exploring DataOps Definitions and Applications."
Zimmer, M., Kemper, H., & Baars, H. (2015). The impact of Agility Requirements on Business intelligence Architectures.
Ereth, J. (2018). DataOps – Towards a Definition.

What Are Data Products?

Niels Humbeck — Sat, 07 Dec 2024 06:45:44 GMT

In today’s data-driven world, the term "data products" is gaining traction as businesses increasingly leverage data to solve real-world problems. Just as crude oil must be refined to deliver its true value, data is an information asset that needs to be processed and analyzed to generate meaningful insights (Palmer, 2006). Whether for internal business use or external monetization, data products are the tools that transform raw data into value-adding solutions. Let’s explore what makes data products essential, how they are categorized, and how they create value.

Defining Data Products

At its core, a data product is the output of work by data scientists, analysts, and engineers designed delivering value to users through the power of information. Unlike raw data, data products deliver actionable insights or automate solutions to real-world problems. According to Loukides (2011), they can be classified as either:

Overt Data Products: These make data visible to the user, often in the form of reports, dashboards, or APIs.
Covert Data Products: These solve problems without exposing the raw data to the user. For example, an automated drone delivering goods doesn’t show the user the underlying data used to calculate the route—it simply performs the task seamlessly.

As businesses and customers increasingly prioritize simplicity and effectiveness, data products are evolving from overt to covert. The focus is shifting to providing fast, intuitive solutions rather than overwhelming users with raw data.

The Role of Data Products in Business

Data products serve a variety of purposes, from enabling better decision-making to automating complex processes, enhance customer experiences. Their applications extend across industries and customer types, whether the end-user is internal (within the company) or external (another company or individual customers). A well-designed data product can be a competitive differentiator by offering:

Unique Data: Providing proprietary or hard-to-find datasets.
Exceptional User Experiences: Delivering seamless, intuitive solutions that solve specific problems effectively (Weber, 2021).

For example, a business intelligence dashboard might provide executives with key metrics for decision-making, while a machine-learning algorithm might power a recommendation engine for an e-commerce platform.

Types of Data Products

O’Regan (2018) categorized data products into functional types based on their purpose. Here’s a closer look at the five main categories:

Raw Data (Overt Data Products)
- Provides access to cleaned or preprocessed data.
- Example: A company offering access to weather datasets for research or development.
Derived Data
- Supplies enriched or processed data, such as customer segmentation or predictive scores.
- Example: A marketing analytics platform that adds demographic insights to customer data.
Algorithms as a Service
- Delivers algorithms for external use, such as APIs for image recognition or text analysis.
- Example: A picture similarity search engine used in online retail.
Decision Support
- Offers insights and visualizations to guide user decisions.
- Example: Analytical dashboards showing real-time performance metrics.
Automated Decision-Making
- Executes decisions autonomously without human interaction.
- Example: High-frequency trading systems in financial markets.

Data products interact with users and systems in various ways to deliver value. These interactions can occur via:

APIs (Application Programming Interfaces):
APIs enable seamless communication between software components, often within modern microservice architectures. For example, one microservice might supply raw data, which is then processed by a data product and passed on to another system via an API.
Dashboards and Visualizations:
Insights are often presented in visual formats to aid understanding and decision-making, as seen in business intelligence dashboards.
Web Elements:
Data products power interactive web elements, such as recommendation engines on platforms like Netflix.

These interaction methods ensure that data products can deliver insights or actions efficiently to end-users or other systems.

Sources:

Loukides, M. (2011). “The Evolution of Data Products.” O’Reilly Media: https://www.oreilly.com/radar/evolution-of-data-products/

Palmer, M. (2006). Data is the New Oil: https://ana.blogs.com/maestros/2006/11/data_is_the_new.html

Weber, H. (2021). The Basis of Data Product Management by Eric Weber, Head of Data Products Yelp. https://www.youtube.com/watch?v=TNIje1OrnFA

O’Regan, S. (2018). Designing Data Products: https://towardsdatascience.com/designing-data-products-b6b93edf3d23

Why the majority of data products fail!

Niels Humbeck — Sat, 07 Dec 2024 06:08:56 GMT

Data products holds immense potential to transform businesses, but the harsh reality is that up to 80-85% of data science projects never make it into production, and of those that do, only a tiny fraction—around 8%—create meaningful value for their organizations (Thomas, 2020). This staggering failure rate stems from three major issues: waste, misalignment with business goals, and challenges in scaling data science/ analytics outputs (productionizing) (Atwal, 2020). Let’s dive into these challenges, explore their root causes, and discuss how organizations can overcome them.

1. The Problem of Waste during the development of Data Products

In the context of data science, waste refers to non-value-adding activities that consume significant resources. Data scientists, analysts, and engineers face numerous barriers in their daily workflows, such as:

Poorly described data: Incomplete or unclear documentation about available datasets.
Access restrictions: Limited or slow access to necessary databases due to authentication hurdles and organizational silos.
Lack of standardization: Inconsistent processes and frameworks leading to inefficiencies.
Repetitive manual work: Tasks that could be automated, resulting in lower scalability.
Workarounds prone to failure: Temporary fixes often collapse under pressure.

These inefficiencies often lead to frustrated teams and subpar results. Worse still, hidden debt—legacy systems and outdated practices—further delays progress by making testing and deployment cumbersome.

2. Misalignment with Business Objectives

Even the most advanced machine learning model is useless if it doesn’t address a critical business need. Miscommunication between data scientists and business stakeholders is a common problem. Without a shared understanding of goals, models may be developed that fail to answer the right questions or provide actionable insights. In some cases, these models are never deployed; in others, they’re implemented but never used.

An agile approach, such as building Minimum Viable Products (MVPs), can help bridge this gap. MVPs allow businesses to test ideas early and adapt quickly to changing requirements, ensuring the data product delivers real value.

3. Challenges in Productionizing Data Products

A key determinant of success is the speed and ease of deployment for data products. Companies that can rapidly test and refine models—using methods like A/B testing—gain a competitive edge. However, lengthy lead times for productionizing data products create bottlenecks.

Effective productionizing requires:

Robust infrastructure: Agile systems capable of handling small, iterative changes.
Real-time data: Fresh data is far more valuable than outdated or historical data, as it allows for proactive decision-making.
Reliable analytics: High-quality, actionable insights are critical to building trust with stakeholders.

Root Causes: Why These Challenges Persist

At the heart of these issues are two fundamental problems (Atwal, 2020):

Outdated Information Architectures
Many organizations still operate with 20th-century systems, relying on siloed databases, rigid security measures, and manual processes. These setups are ill-suited for modern data analytics, where interoperability, scalability, and speed are paramount.
Knowledge Gaps between Business and Data Professional and Weak Organizational Support
Data scientists often lack the domain knowledge needed to align their work with business needs. Similarly, IT departments frequently fail to understand the tools and data access requirements of data teams. This disconnect creates delays, friction, and frustration. Furthermore, weak development skills among data scientists can lead to poor-quality, non-reproducible models that are difficult to productionize.

What are your experience in developing a data product? Where do you see the main challenges?

Sources:

Atwal, H. (2020). Practical DataOps. In Practical DataOps. Apress. https://doi.org/10.1007/978-1 4842-5104-1

Thomas. (2020). 10 reasons why data science projects fails. https://fastdatascience.com/why-do data-science-projects-fail/

Building Data Products

7 Proven Best Practice to Master DataOps Architecture for Seamless Automation and Scalability

Conclusion

Sources:

Mastering Data Pipelines: The Secret to Fast and Reliable Data Operations

The Power of Data Pipelines

Conceptual Model for Data Pipelines

Sources:

From Data Lifecycle to DataOps Architecture

Common Lifecycle Management Models

CRUD:

Team Data Science Process (TDSP)

Challenges in common data lifecycle management tools

DataOps Architecture

Sources:

DevOps - Bridging Development and Operations

DevOps

Challenges in the traditional software development process

How DevOps works

Summary

Sources:

Introduction into Agile Methodologies

Agile Methodology

Product Management and OKRs

The Role of OKRs in Agile Environments

Sources:

Introduction into Lean Methodology

Lean Manufacturing

Definition of Production Systems

Evolution of Production Systems - From Craftmans production to Lean Production

Craftmans Production Systems

Mass Production Systems

The Lean Production System - Introduction

Lean Production System - Deep Dive into the core components

Summary

Outlook: How Lean Production Methodology influences DevOps and DataOps

Sources:

Streamlining the Data Product Development with DataOps

What is DataOps?

The Key Components of DataOps

Why DataOps Matters

Definitions of DataOps

Core Practices of DataOps

The Future of Data Product Development

Sources

What Are Data Products?

Defining Data Products

The Role of Data Products in Business

Types of Data Products

Data Interactions: How Data Products Share Results

Why the majority of data products fail!

1. The Problem of Waste during the development of Data Products

2. Misalignment with Business Objectives

3. Challenges in Productionizing Data Products

Root Causes: Why These Challenges Persist