DevOps — Real Challenges in Implementation

HungWei Chiu
5 min readOct 15, 2023

What Is DevOps

In the field of DevOps, the most common question that arises is: What exactly is “DevOps”? How does it differ from a “DevOps Engineer”? Why do job postings often mention roles like “DevOps/SRE Engineer”? Is it a job title, a skill set, or a cultural shift?

From my personal understanding, DevOps leans more towards a cultural and philosophical approach, with the goal of improving the Software Development Life Cycle (SDLC) by fostering collaboration between development and operations, rather than operating in silos.

Implementation

Just as Google discusses in its exploration of Site Reliability Engineering (SRE), DevOps serves as an interface, while SRE is an implementation. With different implementation approaches, there are often discrepancies. Even with the same DevOps concept, ten different teams might interpret and implement it in ten different ways.

The interpretation of DevOps varies widely, from viewing it as just another buzzword, a magical solution to all problems, to needing specialized DevOps positions to resolve all issues. Different teams may adopt different strategies. For example:

  1. Dedicated DevOps teams handle tasks that product teams prefer not to deal with.
  2. DevOps teams set the overall direction, and each product team has its own DevOps engineer managing their specific tasks.
  3. No dedicated DevOps engineers; developers are responsible for acquiring the necessary skills.

Additionally, the role of a DevOps engineer is diverse, encompassing tasks such as

  1. cloud management
  2. Infrastructure as Code (IaC)
  3. Continuous Integration/Continuous Deployment (CI/CD)
  4. Container management with Kubernetes
  5. Account and permission management
  6. Employee asset maintenance
  7. Internal management systems
  8. Security management.
  9. …etc

As the variety of tasks increases, DevOps engineers might find their roles resembling support engineers, addressing internal issues and complexities day in and day out. This situation leaves little time for contemplation and development on a specific issue or project. Constant context-switching becomes a significant issue, incurring additional time costs and hindering the ability to concentrate fully.

I believe this concept is akin to Google SRE’s notion of “Toil.” Nobody enjoys dealing with repetitive mundane tasks all day long. Yet, when these tasks consume your entire day, the time available for planning and developing new architectures diminishes. Over time, this can lead to fatigue and frustration.

With Great Power Comes Great Responsibility?

The saying “with great power comes great responsibility” is a common adage, but with time and experience, one gradually realizes that it’s not as straightforward as it seems. In fact, strictly adhering to this concept can have severe toxic effects on both teams and individual development.

As defined in the realm of Cloud Native technologies

Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.

Modern deployment environments involve increasingly complex architectures, a plethora of projects, and a wide array of domains to explore.

The CNCF Landscape illustrates this complexity, with dozens of major domains, each containing numerous related open-source projects. With hundreds of projects available, selecting an appropriate one for a team becomes a daunting task.

The introduction of DevOps culture often gives rise to “a guy” responsible for bridging the gap between Dev and Ops. This “a guy” integrates projects from various domains to create an environment and tools suitable for both Dev and Ops, aiming to enhance the efficiency of the SDLC.

System Complexity

Integration is not merely about downloading and installing. Besides understanding project concepts and usage, one must consider how to integrate them with existing architectures.

For instance, when deploying and managing Kubernetes applications using GitOps, should all related applications ideally follow GitOps practices? Consequently, should other projects transition their deployment methods to GitOps? Similarly, if Prometheus and Grafana monitoring stacks are deployed in an environment, should all third-party projects integrate with Prometheus for metrics collection and utilize Grafana for monitoring?

Over time, the integration of all these projects becomes exceedingly intricate and layered. If given the chance, try creating an architectural diagram of the current environment. Attempt to visualize the following issues and see if you can explain each flow and detail comprehensively. You will discover it’s not an easy task:

  1. Network packet flows (Layer 4 + Layer 7)
  2. IAM + AAA account authentication flows
  3. Secret/Vault processes
  4. Observability processes (Tracing/Logging/Monitoring)
  5. CI/CD processes
  6. Cloud/on-premise infrastructure architecture
  7. …and so on

Silver Bullet or Poison?

As teams and products grow, the aforementioned architecture might undergo restructuring, becoming even more complex. Try asking within the team: Who can comprehensively explain every process? Is it a case of “only one person can explain,” or does the team have individuals with expertise in various domains, each an expert in their own right?

From past work experiences, it’s observed that teams sometimes fall into the trap of having only “one person who can explain.” All deployments and architectures rely on this individual, leading others to feel awkward, unable to design architectures and debug professionally. Although this situation sounds grim, it’s unfortunately very real.

As modern architectures become increasingly complex, installation might be simple, but integration involves a multitude of configurations. Therefore, if a team prioritizes only the result, it might lead to a scenario where “those who know, keep knowing, and those who don’t, keep not knowing.” Knowledge sharing doesn’t contribute significantly to product releases in this context.

This situation could deteriorate, possibly resulting in:

  1. That silver bullet handles everything, including new architecture design and testing, as well as online architecture maintenance.
  2. All problems are directed to That silver bullet, creating a bottleneck.
  3. Others feel a growing gap, almost impossible to catch up, leading to a lack of motivation to learn, only capable of solving minor issues.
  4. If the team doesn’t recognize this problem, a vicious cycle of worsening issues in (1), (2), and (3) might ensue.

This approach has serious side effects for both individuals and the team:

For the team:

  1. Knowledge is not shared; all critical information resides with that silver bullet. If this expert leaves, it can significantly impact the team in the short term.
  2. Because vital knowledge is concentrated with that silver bullet, issues might pile up, and that silver bullet becomes a bottleneck.

For individuals:

  1. The silver bullet’s time becomes fragmented, relying on this individual for everything results in excessive context switching, leading to frustration and helplessness.
  2. Others don’t experience a sense of achievement and lack opportunities to perform. Some people might prefer this situation, as it offers an easy, stress-free job.

This phenomenon in deployment architecture is essentially a Single Point Of Failure (SPOF). If the silver bullet resigns or temporarily cannot work, it could disrupt the entire system.

Solution?

The occurrence of this phenomenon raises the question: can writing documentation improve the situation? I believe it can be effective but only to a certain extent.

For example, consider exploring various access configurations for applications within Kubernetes. This involves Kubernetes RBAC settings and might also entail cloud integrations, such as GCP Workload Identity or AWS IRSA.

However, if the reader lacks sufficient understanding of Kubernetes or AWS/GCP, these documents might as well be in a foreign language, ultimately reverting to whether the team members share a consistent foundational knowledge base.

I think teams should start from a mindset change rather than relying on specific “Action Items” to avoid issues:

  1. Don’t assume that implementing concepts like “DevOps” merely involves installing software everywhere; the complexity behind it increases with scale.
  2. Invest time in observing the division of labor within the team. There will undoubtedly be members with relative seniority and expertise. The crucial question is whether these members are gradually becoming bottlenecks, the silver bullet.
  3. Encourage knowledge sharing to foster a team environment that is dynamic and conducive to mutual growth.
  4. Handle deadlines appropriately. Evaluation shouldn’t solely rely on the abilities of a silver bullet.

--

--