πŸš€ DevOps Complete Learning Guide

Advanced Topics with Practical Examples

1. Git & GitHub Advanced

Git Internals & Advanced Commands

# Understanding Git Objects git cat-file -p HEAD # Print object content git ls-tree HEAD # List tree object git rev-parse HEAD # Get SHA of HEAD git reflog # Reference log of all changes

Interactive Rebase

Interactive rebase is powerful for cleaning up commit history before pushing to remote.
# Squash last 3 commits git rebase -i HEAD~3 # In the editor, you'll see: # pick abc1234 First commit # pick def5678 Second commit # pick ghi9012 Third commit # Change to: # pick abc1234 First commit # squash def5678 Second commit # squash ghi9012 Third commit

Git Stash Advanced

Command Description
git stash save "message" Stash with descriptive message
git stash list List all stashes
git stash apply stash@{2} Apply specific stash
git stash branch new-branch Create branch from stash
git stash show -p stash@{0} Show stash content

Git Hooks Example

Use Git hooks to enforce code quality and standards before commits.
#!/bin/sh # .git/hooks/pre-commit # Make it executable: chmod +x .git/hooks/pre-commit # Run tests before commit npm test || exit 1 # Check for console.log statements if grep -r "console.log" --include="*.js" .; then echo "Remove console.log statements before committing" exit 1 fi # Run linter npm run lint || exit 1 echo "Pre-commit checks passed!"

2. GitHub Actions

Complete CI/CD Workflow

name: CI/CD Pipeline on: push: branches: [ main, develop ] pull_request: branches: [ main ] env: NODE_VERSION: '16' DOCKER_REGISTRY: ghcr.io jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Setup Node.js uses: actions/setup-node@v3 with: node-version: ${{ env.NODE_VERSION }} cache: 'npm' - name: Install dependencies run: npm ci - name: Run tests run: npm test - name: Upload coverage uses: actions/upload-artifact@v3 with: name: coverage-report path: coverage/

Matrix Strategy

Use matrix strategies to test across multiple versions and platforms simultaneously.
jobs: test: strategy: matrix: os: [ubuntu-latest, windows-latest, macos-latest] node: [14, 16, 18] runs-on: ${{ matrix.os }} steps: - uses: actions/checkout@v3 - uses: actions/setup-node@v3 with: node-version: ${{ matrix.node }} - run: npm test

3. Linux for DevOps

Essential System Commands

Category Command Description
System Info uname -a All system information
df -h Disk usage human readable
free -h Memory usage
top / htop Process monitoring
Network ss -tulpn Show listening ports
ip addr show Show IP addresses
curl -I URL Get HTTP headers
dig domain.com DNS lookup

Service Management (systemd)

# Service control systemctl start nginx systemctl stop nginx systemctl restart nginx systemctl reload nginx systemctl enable nginx # Enable at boot systemctl disable nginx # Disable at boot systemctl status nginx # View logs journalctl -u nginx -f # Follow logs journalctl -u nginx --since "1 hour ago" journalctl -p err # Error logs only

Creating a Custom Service

Always create custom systemd services for production applications to ensure proper lifecycle management.
# /etc/systemd/system/myapp.service [Unit] Description=My Application After=network.target [Service] Type=simple User=appuser WorkingDirectory=/opt/myapp ExecStart=/usr/bin/node app.js Restart=always RestartSec=10 StandardOutput=syslog StandardError=syslog SyslogIdentifier=myapp [Install] WantedBy=multi-user.target # Reload and start systemctl daemon-reload systemctl start myapp systemctl enable myapp

Changing File Permissions

The chmod command enables you to change the permissions on a file. You must be superuser or the owner of a file or directory to change its permissions.

You can use the chmod command to set permissions in either of two modes:

  • Absolute Mode - Use numbers to represent file permissions (the method most commonly used to set permissions). When you change permissions by using the absolute mode, represent permissions for each triplet by an octal mode number.
  • Symbolic Mode - Use combinations of letters and symbols to add or remove permissions.

Setting File Permissions in Absolute Mode

Octal ValueFile Permissions SetPermissions Description
0---No permissions
1--xExecute permission only
2-w-Write permission only
3-wxWrite and execute permissions
4r--Read permission only
5r-xRead and execute permissions
6rw-Read and write permissions
7rwxRead, write, and execute permissions

Setting File Permissions in Symbolic Mode

SymbolFunctionDescription
uWhoUser (owner)
gWhoGroup
oWhoOthers
aWhoAll
=OperationAssign
+OperationAdd
-OperationRemove
rPermissionRead
wPermissionWrite
xPermissionExecute
lPermissionMandatory locking, setgid bit is on, group execution bit is off
sPermissionsetuid or setgid bit is on
SPermissionsuid bit is on, user execution bit is off
tPermissionSticky bit is on, execution bit for others is on
TPermissionSticky bit is on, execution bit for others is off

How to Change Permissions in Absolute Mode

If you are not the owner of the file or directory, become superuser.

Only the current owner or superuser can use the chmod command to change file permissions on a file or directory.

Change permissions in absolute mode by using the chmod command:

# Change file permissions using absolute mode chmod nnn filename

nnn specifies the octal values that change permissions on the file or directory. See the table above for the list of valid octal values.

filename is the file or directory.

Verify the permissions of the file have changed:

# Verify file permissions ls -l filename

Example--Changing Permissions in Absolute Mode

# Set rwxr-xr-x permissions on myfile chmod 755 myfile ls -l myfile

How to Change Permissions in Symbolic Mode

If you are not the owner of the file or directory, become superuser.

Only the current owner or superuser can use the chmod command to change file permissions on a file or directory.

Change permissions in symbolic mode by using the chmod command:

# Change file permissions using symbolic mode chmod who operator permission filename

who specifies whose permissions are changed, operator specifies the operation to perform, and permission specifies what permissions are changed. See the table above for the list of valid symbols.

filename is the file or directory.

Verify the permissions of the file have changed:

# Verify file permissions ls -l filename

Examples--Changing Permissions in Symbolic Mode

# Take away read permission from others chmod o-r filea # Add read and execute permissions for user, group, and others chmod a+rx fileb # Assign read, write, and execute permissions to group chmod g=rwx filec

4. Docker Deep Dive

Multi-Stage Dockerfile

Multi-stage builds reduce image size by separating build dependencies from runtime.
# Build stage FROM node:16-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --only=production # Development dependencies FROM node:16-alpine AS dev-deps WORKDIR /app COPY package*.json ./ RUN npm ci # Build application FROM dev-deps AS build COPY . . RUN npm run build # Production stage FROM node:16-alpine AS runtime RUN apk add --no-cache tini RUN addgroup -g 1001 -S nodejs && \ adduser -S nodejs -u 1001 WORKDIR /app COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules COPY --from=build --chown=nodejs:nodejs /app/dist ./dist USER nodejs EXPOSE 3000 ENTRYPOINT ["/sbin/tini", "--"] CMD ["node", "dist/index.js"] HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \ CMD node healthcheck.js || exit 1

Docker Commands Reference

# Container management docker run -d \ --name myapp \ --restart unless-stopped \ --memory="512m" \ --cpus="0.5" \ -p 8080:80 \ -v $(pwd)/data:/data:ro \ --env-file .env \ myimage:latest # Inspection and debugging docker inspect container_id docker stats docker logs -f --tail 50 container_name docker exec -it container_name sh # Cleanup docker system prune -a --volumes docker image prune -a docker container prune docker volume prune

5. Kubernetes Complete Guide

Deployment with All Features

apiVersion: apps/v1 kind: Deployment metadata: name: myapp labels: app: myapp spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 1 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: app image: myapp:v1 ports: - containerPort: 8080 env: - name: NODE_ENV value: "production" - name: DB_PASSWORD valueFrom: secretKeyRef: name: db-secret key: password resources: requests: memory: "128Mi" cpu: "100m" limits: memory: "256Mi" cpu: "200m" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 volumeMounts: - name: config mountPath: /etc/config volumes: - name: config configMap: name: app-config

Essential Kubectl Commands

Category Command
Cluster Info kubectl cluster-info
Get Resources kubectl get pods -o wide --all-namespaces
Describe kubectl describe pod pod-name
Logs kubectl logs -f pod-name --tail=50
Execute kubectl exec -it pod-name -- /bin/bash
Port Forward kubectl port-forward pod-name 8080:80
Scale kubectl scale deployment app --replicas=5
Rollout kubectl rollout status deployment/app

What is Ingress in Kubernetes?

Ingress is a Kubernetes API object used to manage external access to services inside the cluster.

Ingress is used to:

  • expose HTTP/HTTPS applications
  • route incoming traffic to different services
  • define host-based or path-based routing
  • handle SSL/TLS termination

Ingress acts like a set of rules that tells Kubernetes:

πŸ‘‰ β€œWhen traffic comes from outside, where should it go?”

What is an Ingress Controller?

An Ingress Controller is the actual implementation that executes the rules defined in the Ingress object.

Ingress is only a configuration. It does nothing by itself. The Ingress Controller is responsible for processing those rules and routing the traffic.

Examples of Ingress Controllers:

  • NGINX Ingress Controller
  • AWS ALB Ingress Controller
  • GKE Ingress
  • Traefik Ingress Controller
  • HAProxy Ingress

Simple Analogy

Ingress: Think of Ingress as a β€œtraffic rulebook” written on paper.

Ingress Controller: Think of the Ingress Controller as the β€œtraffic police” who reads the rulebook and actually controls the traffic.

Visual Diagram Explanation

Internet

LoadBalancer

Ingress Controller

--------------------------------

/app1 β†’ Service1

/app2 β†’ Service2

/admin β†’ Service3

The Ingress Controller reads these rules and handles routing.

Summary

ComponentMeaningResponsibility
IngressA set of routing rulesDefines how traffic should flow
Ingress ControllerThe engine that applies the rulesRoutes traffic to correct services

Basic Cluster Commands

CommandDescription
kubectl version
Show client + server K8s version.
kubectl cluster-info
See cluster master & DNS info.
kubectl get all
List all resources (pods, svc, deployments, etc.) in default namespace.

Working With Contexts

CommandDescription
kubectl config get-contexts
List all contexts (clusters).
kubectl config use-context dev
Switch to another cluster.
kubectl config current-context
Show which cluster you are using.

Namespaces

CommandDescription
kubectl get namespaces
List namespaces.
kubectl create namespace dev
Create namespace.
kubectl delete namespace dev
Delete namespace.
kubectl config set-context --current --namespace=dev
Set default namespace.

Pods

CommandDescription
kubectl get pods
List pods.
kubectl get pods -o wide
Show pod IP, node, etc.
kubectl describe pod pod-name
Detailed pod info & events.
kubectl logs pod-name
View logs.
kubectl logs -f pod-name
Follow live logs.
kubectl exec -it pod-name -- bash
Enter pod terminal.
kubectl delete pod pod-name
Delete a pod.

Deployments

CommandDescription
kubectl get deployments
List deployments.
kubectl create deployment web --image=nginx
Create deployment.
kubectl scale deployment web --replicas=5
Scale number of pods.
kubectl rollout status deployment/web
Check deployment rollout status.
kubectl rollout undo deployment/web
Rollback to previous version.
kubectl delete deployment web
Delete deployment.

ReplicaSets

CommandDescription
kubectl get rs
List ReplicaSets.
kubectl describe rs rs-name
ReplicaSet details.

Services

CommandDescription
kubectl get svc
List services.
kubectl expose deployment web --type=NodePort --port=80
Expose deployment as a service.
kubectl describe svc web
Service details.
kubectl get svc -o wide
See service cluster IP & ports.

ConfigMaps

CommandDescription
kubectl create configmap app-config --from-literal=env=prod
Create ConfigMap.
kubectl create configmap myconfig --from-file=config.properties
Create from file.
kubectl get configmaps
List ConfigMaps.
kubectl describe configmap app-config
View config details.

Secrets

CommandDescription
kubectl create secret generic db-secret --from-literal=password=1234
Create secret.
kubectl create secret generic tls-secret --from-file=server.crt --from-file=server.key
Create TLS secret.
kubectl get secrets
List secrets.
kubectl describe secret db-secret
Describe secret.

YAML Apply, Update & Delete

CommandDescription
kubectl apply -f deployment.yaml
Apply or update resource.
kubectl delete -f deployment.yaml
Delete resource.
kubectl edit deployment web
Edit resource live in editor.

Nodes & Cluster Info

CommandDescription
kubectl get nodes
List nodes.
kubectl get nodes -o wide
Node details (OS, internal IP).
kubectl describe node node1
Node details (taints, capacity).
kubectl drain node1 --ignore-daemonsets
Drain node safely.
kubectl cordon node1
Mark node unschedulable.
kubectl uncordon node1
Mark node schedulable.

Resource Usage

CommandDescription
kubectl top pods
CPU & memory usage of pods.
kubectl top nodes
CPU & memory usage of nodes. (Metrics server required)

Taints & Tolerations

Add taint:

kubectl taint nodes node1 key=value:NoSchedule

Remove taint:

kubectl taint nodes node1 key=value:NoSchedule-

Used for:

  • Dedicate nodes
  • Restrict workloads
  • Isolation

Labels & Selectors

CommandDescription
kubectl label pod web app=frontend
Add label.
kubectl get pods -l app=frontend
Filter with label.
kubectl label pod web app-
Remove label.

Port Forwarding

kubectl port-forward pod/mypod 8080:80

Access pod locally. Common for debugging APIs.

Ingress

CommandDescription
kubectl get ingress
List ingress rules.
kubectl describe ingress my-ingress
Ingress details.

StatefulSets

CommandDescription
kubectl get statefulsets
List StatefulSets.
kubectl describe statefulset mysql
Stateful app details.

Used for:

  • Databases
  • Kafka
  • ElasticSearch

DaemonSets

CommandDescription
kubectl get daemonsets
List DaemonSets.

Used for:

  • Logging agents
  • Monitoring agents

Jobs & CronJobs

CommandDescription
kubectl create job myjob --image=busybox
Create Job.
kubectl create cronjob backup --image=busybox --schedule="*/5 * * * *"
Create CronJob.
kubectl get jobs
List jobs.
kubectl get cronjobs
List CronJobs.

Pod Debugging

CommandDescription
kubectl describe pod web
Check events.
kubectl logs pod --previous
Check logs of crashed container.
kubectl exec -it web -- sh
Debug inside pod.
kubectl get pod web -o yaml
View complete pod spec.

Troubleshooting Cluster

CommandDescription
kubectl get events
Cluster-wide events.
kubectl get endpoints
Check service endpoints.
kubectl get componentstatus
Master component status (older K8s).
kubectl get networkpolicies
Check network restrictions.

Network Policies

CommandDescription
kubectl get netpol
List policies.
kubectl describe netpol
Network policy details.

Used for:

  • Restrict pod-to-pod communication
  • Zero-trust networking

Storage

CommandDescription
kubectl get pv
List persistent volumes.
kubectl get pvc
List persistent volume claims.
kubectl describe pv pv-name
PV details.
kubectl describe pvc pvc-name
PVC details.

Service Accounts & RBAC

CommandDescription
kubectl get serviceaccounts
List service accounts.
kubectl create serviceaccount dev-sa
Create SA.
kubectl get clusterrole
List cluster roles.
kubectl get clusterrolebinding
List bindings.

Used for:

  • Access control
  • Least privilege
  • Pod-to-AWS auth (IRSA)

Deleting Everything

CommandDescription
kubectl delete all --all
Delete pods, svc, deployments in namespace.
kubectl delete namespace dev
Delete entire namespace.

Useful Shortcuts

CommandDescription
kubectl get po
Short for pods.
kubectl get deploy
Short for deployments.
kubectl get svc
Short for service.
kubectl get ing
Short for ingress.
kubectl get cm
Short for configmap.
kubectl get no
Short for nodes.

Apply with Dry Run (very important)

kubectl apply -f app.yaml --dry-run=client

Check if YAML is valid without applying.

28. Debugging Node Issues

# Check taints, disk pressure, memory pressure. kubectl describe node node1
kubectl get nodes -o json | jq '.items[].status.conditions'

Detailed node conditions.

29. Extract Pod YAML

# Export pod definition. kubectl get pod web -o yaml > pod.yaml

Useful for reproducing or modifying pods.

30. Kubernetes API Access

# Start API server proxy for debugging. kubectl proxy

31. Advanced Pod Debugging

# Exec into a specific container inside a multi-container pod. kubectl exec -it pod --container app -- sh
# View logs of a specific container. kubectl logs pod -c app
# Check logs of a container that crashed. kubectl logs pod --previous
# Search for errors in pod events. kubectl describe pod pod | grep -i error

32. Ephemeral Debug Container (K8s 1.23+)

# Attach a temporary debug container. kubectl debug podname -it --image=busybox
# Debug an entire node. kubectl debug node/node1 -it --image=busybox

Used for:

  • CrashLoopBackOff
  • ImagePullBackOff
  • Node-level debugging

33. ImagePull & Crash Issues

# Get reason for failed pod. kubectl get pod pod -o jsonpath='{.status.containerStatuses[].state.waiting.reason}'

Common issues:

  • ImagePullBackOff
  • ErrImagePull
  • CrashLoopBackOff

34. Rollouts & History

# View deployment history. kubectl rollout history deployment web
# Rollback to a specific version. kubectl rollout undo deployment web --to-revision=2
# Pause rollout (for debugging). kubectl rollout pause deployment web
# Resume rollout. kubectl rollout resume deployment web

35. Advanced Scaling

# Scale down to zero. kubectl scale deployment web --replicas=0
# Scale StatefulSet (ordered). kubectl scale statefulset mysql --replicas=3

36. YAML Dry Run & Diff

# Show changes before applying. kubectl diff -f app.yaml
# Server-side validation without apply. kubectl apply -f app.yaml --dry-run=server

Useful for CI/CD pipelines.

37. Patching (Very Important for DevOps)

# Patch resources inline: kubectl patch deployment web -p '{"spec": {"replicas": 4}}'
# Patch with strategic merge: kubectl patch svc web --patch-file patch.yaml

Used for:

  • Hotfixes
  • On-the-fly changes
  • CI/CD automation

38. Node Health & Troubleshooting

# Find unhealthy nodes. kubectl get nodes -o json | jq '.items[].status.conditions[] | select(.status=="False")'
# Check disk pressure, memory pressure. kubectl describe node node1
# List all pods on a node. kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=node1

39. Events & Cluster-level Issues

# Show all events. kubectl get events
# Sort events by time. kubectl get events --sort-by='.lastTimestamp'
# Access kube-apiserver metrics. kubectl get --raw /metrics

40. Service Debugging

# Check if service has endpoints. kubectl get endpoints web
# Check wrong targetPort / selector. kubectl describe svc web
# Test service connectivity. kubectl run test --image=busybox -it --rm -- wget web

41. Ingress Debugging

# List ingress. kubectl get ingress
# Check rules & events. kubectl describe ingress web-ingress
# Check ingress controller health. kubectl get pods -n ingress-nginx

42. Network Policy Debugging

# List network policies. kubectl get netpol
# View network rules. kubectl describe netpol allow-app
# Test connectivity: kubectl run test --rm -it --image=busybox -- wget http://pod-ip

43. Persistent Storage Commands

# List storage classes. kubectl get storageclass
# Details of storage class. kubectl describe storageclass gp2
# List persistent volumes. kubectl get pv
# List persistent volume claims. kubectl get pvc
# PVC status. kubectl describe pvc my-pvc

44. Logs & Audit

# Logs from the last 1 hour. kubectl logs pod --since=1h
# Show last 50 lines. kubectl logs pod --tail=50
# Show logs with timestamps. kubectl logs pod --timestamps

45. Node-to-Pod Debugging

# Test pod connectivity. kubectl run debug --rm -it --image=busybox --command -- ping
# Find pod IP. kubectl get pod -o wide

46. Copy Files To & From Pod

# Copy from pod β†’ local. kubectl cp pod:/app/logs ./logs
# Copy local β†’ pod. kubectl cp file.txt pod:/app/

47. ConfigMaps & Secrets Debugging

# View ConfigMap values. kubectl get configmaps -o yaml
# Base64 encoded output. kubectl get secret db-secret -o yaml
# Decode: echo 'cGFzc3dvcmQ=' | base64 --decode

48. K8s Useful JSONPath Queries

# Get pod image: kubectl get pod web -o jsonpath='{.spec.containers[*].image}'
# Get pod node: kubectl get pod web -o jsonpath='{.spec.nodeName}'

49. Resource Quotas & Limits

# List quotas. kubectl get resourcequota
# Quota details. kubectl describe resourcequota rq1

50. LimitRanges

# List pod limit ranges. kubectl get limitrange

Used for:

  • default CPU/memory
  • maximum/minimum allowed resources

51. Service Account with Pod

# Show token & secrets. kubectl describe sa my-sa
# Check which service account pod uses. kubectl describe pod pod | grep ServiceAccount

52. RBAC Debugging

# Test user permission: kubectl auth can-i get pods --as user1
# Test namespace-specific permission: kubectl auth can-i delete pods -n dev

53. Port & Connectivity Debugging (Must Know)

# Test port: kubectl run test --rm -it --image=busybox -- nc -zv web 80
# DNS check: kubectl run test --rm -it --image=busybox -- nslookup web

54. Horizontal Pod Autoscaling

# Create autoscaler. kubectl autoscale deployment web --cpu-percent=50 --min=2 --max=10
# List HPAs. kubectl get hpa

Cluster Autoscaler Debugging

# Check autoscaler pod. kubectl -n kube-system get pods | grep autoscaler

Advanced Resource Filtering (field selectors)

CommandDescription
kubectl get pods --field-selector spec.nodeName=node1Get pods scheduled on a specific node
kubectl get pods --field-selector=status.phase=FailedGet pods that failed
kubectl get pods --field-selector=status.phase!=RunningGet pods that are not running
kubectl get pods --sort-by=.metadata.creationTimestampGet pods created in last 5 minutes

Advance JSON & YAML Output Formatting

CommandDescription
kubectl get pods -o jsonpath='{.items[*].metadata.name}'Get only pod names
kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}'Get node internal IPs
kubectl get pods -o jsonpath='{.items[*].spec.containers[*].image}'Get pod image names

Temporary BusyBox Pod (for debugging)

# Used for DNS testing, connecting to services, checking network restrictions kubectl run tmp --rm -it --image=busybox -- sh

Check Pod Events Script

# Helpful command to sort events kubectl get events --sort-by='.lastTimestamp'
Best for: CrashLoopBackOff, Pod scheduling failure, Image pull issues

Get Pod Environment Variables

kubectl exec -it -- printenv
kubectl get pod -o jsonpath='{.spec.containers[*].env}'

Debugging Service DNS

CommandDescription
kubectl run dns-test --rm -it --image=busybox -- nslookup svc-nameTest DNS resolution
kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.defaultTest cluster DNS

Debug Service Connectivity

CommandDescription
kubectl run test --rm -it --image=busybox -- nc -zv svc-name 80Test port connectivity
kubectl run test --rm -it --image=curlimages/curl -- curl http://svc-nameTest via curl

Debug Node Ports

curl :

Pod Security / User Permissions

kubectl get pod pod -o jsonpath='{.spec.containers[*].securityContext.runAsUser}'

Copy Kubernetes Manifest From Live Resource

CommandDescription
kubectl get deploy web -o yaml > web.yamlExtract current running deployment YAML
kubectl get cm app-config -o yaml > configmap.yamlExtract current configmap YAML

Validating YAML

CommandDescription
kubectl apply -f app.yaml --dry-run=clientClient-side validation
kubectl apply -f app.yaml --dry-run=serverServer-side validation
kubeval app.yamlLint YAML (if installed)

Kubernetes API Access (Raw)

CommandDescription
kubectl get --raw /metricsView cluster components metrics
kubectl get --raw /apiAccess API paths

Node Disk / Memory Pressure Debugging

kubectl describe node node1 | grep -i pressure

Look for:

  • DiskPressure
  • MemoryPressure
  • PIDPressure

Node Logs (Master & Worker Debugging)

  • journalctl -u kubelet -f
  • journalctl -u containerd -f

Restarting Pods Properly

CommandDescription
kubectl delete pod pod-nameDelete pod safely (deployment recreates it)
kubectl delete pod pod-name --force --grace-period=0Force delete stuck pod

Restart Deployment (Without Editing)

kubectl rollout restart deployment web

Checking Cluster Authentication

CommandDescription
kubectl auth can-i create podsTest if user can perform action
kubectl auth can-i delete pods --as bobTest as specific user

Get Logs From ALL Pods of a Deployment

kubectl logs -l app=web --all-containers=true
Labels must match deployment selector.

Debugging Network Policies

CommandDescription
kubectl get netpolList policies
kubectl run test --rm -it --image=busybox -- shConnectivity test across namespaces

Inside:

wget http://pod-ip

ConfigMap Reload Troubleshooting

Pods do NOT automatically reload ConfigMaps unless:
  • Pod restarts
  • Sidecar reloaders (e.g., Reloader, ConfigMap reloader)
  • Using projected volumes
kubectl get pod pod -o jsonpath='{.spec.volumes[*].configMap.name}'

Secret Decoding & Validation

CommandDescription
echo 'cGFzc3dvcmQ=' | base64 --decodeDecode a secret
echo -n 'mypassword' | base64Encode new value

Checking Pod Storage Paths

CommandDescription
kubectl get pod pod -o jsonpath='{.spec.volumes}'Check mounted volumes
kubectl get pod pod -o jsonpath='{.spec.containers[*].volumeMounts}'Check volume mount path

Live Pod Debug Session

kubectl debug --image=busybox pod-name

This creates a temporary container INSIDE pod for debugging.

Upgrade Kubernetes Without Downtime (Cluster Admin)

Check nodes before upgrade:

kubectl get nodes

Upgrade node one-by-one:

kubectl drain node1 --ignore-daemonsets

Then update node OS/Kubelet.

kubectl uncordon node1

Advanced kube-proxy Debugging

  • iptables -L -t nat
  • journalctl -u kube-proxy -f

6. Ingress & Cert Manager

Complete Ingress Configuration

apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: app-ingress annotations: nginx.ingress.kubernetes.io/rewrite-target: / nginx.ingress.kubernetes.io/ssl-redirect: "true" cert-manager.io/cluster-issuer: "letsencrypt-prod" nginx.ingress.kubernetes.io/rate-limit: "10" spec: ingressClassName: nginx tls: - hosts: - app.example.com secretName: app-tls-secret rules: - host: app.example.com http: paths: - path: / pathType: Prefix backend: service: name: frontend-service port: number: 80

Cert Manager Setup

# Install Cert Manager kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.yaml # Verify installation kubectl get pods --namespace cert-manager
# ClusterIssuer for Let's Encrypt apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt-prod spec: acme: server: https://acme-v02.api.letsencrypt.org/directory email: admin@example.com privateKeySecretRef: name: letsencrypt-prod-key solvers: - http01: ingress: class: nginx

7. Observability Stack

Prometheus Configuration

global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)

Useful PromQL Queries

# CPU Usage 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory Usage (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 # HTTP Error Rate sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # Pod Restart Rate rate(kube_pod_container_status_restarts_total[15m]) # 95th Percentile Response Time histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Grafana Dashboard Setup

Import dashboard ID 11074 for Kubernetes cluster monitoring or 3662 for Prometheus stats.

Quick Reference

Always test commands in development environment before running in production!

Daily Use Commands

# Git git log --oneline --graph --all -20 # Docker docker-compose logs -f --tail=100 docker system df # Kubernetes kubectl get all -A kubectl top pods --all-namespaces kubectl rollout restart deployment/app # Linux find /var/log -name "*.log" -mtime +30 -delete du -sh * | sort -hr | head -10

9. Terraform

Terraform Interview Questions

By DevOps Shack

Table of Contents

  1. Introduction to Terraform
    • What is Terraform and what are its main features?
    • Can you explain the difference between Terraform and other configuration management tools like Ansible, Puppet, or Chef?
  2. State Management
    • What is state in Terraform, and why is it important?
    • How do you manage multiple environments (e.g., development, staging, production) in Terraform?
  3. Providers and Modules
    • What is a Terraform provider, and how do you use it?
    • Explain the difference between Terraform modules and resources.
  4. Importing Resources
    • How can you import existing infrastructure into Terraform?
  5. Variables and Outputs
    • What are Terraform variables, and how do you use them?
    • How do you handle secrets or sensitive data in Terraform?
  6. Initialization and Planning
    • What is the purpose of the terraform init command?
    • How does Terraform handle concurrent operations in a team environment?
  7. Advanced Features
    • How does Terraform handle resource dependencies?
    • What is drift detection in Terraform, and how do you handle drift?
  8. Lifecycle Management
    • How does Terraform manage resource lifecycles?
    • What is the purpose of the terraform taint command?
  9. Dynamic Blocks and Conditional Logic
    • What are Terraform dynamic blocks, and how are they used?
    • How does Terraform support conditional resource creation?
  10. Remote State Management
    • How do you manage remote state in Terraform?
    • How does Terraform state file locking work in remote backends?
  11. Formatting and Debugging
    • What is terraform fmt, and why is it important?
    • How do you debug errors in Terraform?
  12. Zero-Downtime Deployments
    • How does Terraform handle zero-downtime deployments?
  13. Provisioners
    • Explain the difference between local-exec and remote-exec provisioners.
  14. Shared Modules
    • How do you manage shared modules in Terraform?
  15. Terraform Cloud
    • What is Terraform Cloud, and how does it differ from Terraform CLI?
  16. Resource Management
    • What are Terraform backends, and why are they important?
    • What is the purpose of the terraform output command?
  17. Version Constraints
    • How does Terraform manage provider and configuration version constraints?
  18. Secrets Management
    • How can you securely manage secrets in Terraform?
  19. Interactive Console
    • What is the purpose of the terraform console command?
  20. Limitations and Best Practices
    • What are the limitations of Terraform?
    • How can you ensure best practices while working with Terraform?

Introduction to Terraform

Terraform, developed by HashiCorp, is one of the most popular Infrastructure as Code (IaC) tools, enabling developers and operations teams to define, provision, and manage infrastructure efficiently. With its declarative configuration language (HCL) and multi-cloud compatibility, Terraform has become a go-to tool for automating infrastructure management. This document compiles 50 Terraform interview questions and answers, covering fundamental concepts, advanced features, and practical use cases. It serves as a comprehensive guide for professionals preparing for Terraform interviews or looking to strengthen their understanding of the tool.

What is Terraform and what are its main features?

Terraform is an open-source Infrastructure as Code (IaC) tool developed by HashiCorp. It allows you to define, provision, and manage infrastructure across various cloud providers and services using a declarative configuration language known as HashiCorp Configuration Language (HCL).

Main Features:

  • Infrastructure as Code (IaC): Manage infrastructure using code, enabling version control, reuse, and sharing.
  • Provider Agnostic: Supports multiple providers like AWS, Azure, GCP, and others, allowing for a consistent workflow.
  • Execution Plans: Generates and shows execution plans before applying changes, helping you understand what Terraform will do.
  • Resource Graph: Builds a graph of all resources and their dependencies, optimizing resource creation and modification.
  • Change Automation: Automates complex changesets to your infrastructure with minimal human interaction.

Can you explain the difference between Terraform and other configuration management tools like Ansible, Puppet, or Chef?

Purpose:

  • Terraform: Primarily an infrastructure provisioning tool. It focuses on creating, updating, and versioning infrastructure safely and efficiently.
  • Ansible/Puppet/Chef: Primarily configuration management tools. They are used to install and manage software on existing servers.

Approach:

  • Terraform: Declarative. You describe the desired state, and Terraform figures out how to achieve it.
  • Ansible/Puppet/Chef: Can be both declarative and procedural, depending on how you write your configurations or playbooks.

Infrastructure Lifecycle:

  • Terraform: Manages the entire lifecycle of infrastructure, including creation, scaling, and destruction.
  • Ansible/Puppet/Chef: Manages the software and settings on already provisioned infrastructure.

What is state in Terraform, and why is it important?

Terraform State: A persistent data store that maps Terraform configurations to real-world resources. It's typically stored in a file named terraform.tfstate.

Importance:

  • Mapping: Keeps track of resource IDs and metadata, enabling Terraform to manage resources effectively.
  • Planning and Execution: Allows Terraform to generate accurate execution plans by knowing the current state of resources.
  • Collaboration: When stored remotely (e.g., in AWS S3 or Terraform Cloud), it enables team collaboration by sharing the state.

Managing Multiple Environments in Terraform

To manage multiple environments (e.g., development, staging, production) in Terraform, you can use the following methods:

Workspaces

  • Use Terraform workspaces to maintain separate state files within the same configuration for different environments.
  • Example:
# Create and select a workspace terraform workspace new development terraform workspace select development

Directory Structure

  • Organize configurations into separate directories for each environment, each with its own state.
  • Example:
# Directory structure β”œβ”€β”€ environments β”‚ β”œβ”€β”€ dev β”‚ β”œβ”€β”€ staging β”‚ └── prod

Variable Files

  • Use different .tfvars files for each environment to parameterize configurations.
  • Example:
# Apply configuration with variable file terraform apply -var-file="dev.tfvars"

Backend Configuration

Configure backends to manage state storage for different environments.

Terraform Provider

A Terraform provider is a plugin that interacts with APIs of cloud platforms and services (e.g., AWS, Azure, Google Cloud). Providers define resources and data sources for a service.

Usage

  • Declaration:
# Declare a provider provider "aws" { region = "us-west-2" }
  • Version Pinning:
# Pin provider version provider "aws" { version = "~> 3.0" region = "us-west-2" }
  • You can configure multiple providers to manage resources across different platforms.

Difference Between Terraform Modules and Resources

Resources

Resources are the basic building blocks in Terraform, representing infrastructure objects like virtual networks, compute instances, or databases.

# Example of a resource resource "aws_instance" "web_server" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t2.micro" }

Modules

Modules are containers for multiple resources that are used together, promoting code reuse and organization. They can be shared and versioned.

# Example of using a module module "vpc" { source = "terraform-aws-modules/vpc/aws" version = "2.77.0" name = "my-vpc" cidr = "10.0.0.0/16" }

Importing Existing Infrastructure into Terraform

Step 1: Write Resource Configuration

Define the resource in your .tf files without any parameters that Terraform can't infer.

# Resource configuration resource "aws_instance" "existing" { # Configuration will be populated after import }

Step 2: Run Import Command

Use terraform import to map the existing resource to the Terraform resource.

# Import command terraform import aws_instance.existing i-0abcdef1234567890

Step 3: Refresh and Update Configuration

Run terraform plan to see differences and update the configuration to match the actual settings.

Terraform Variables

Input Variables

Input variables are parameters for Terraform modules, making configurations flexible and reusable.

# Example of an input variable variable "instance_type" { type = string default = "t2.micro" description = "EC2 instance type" }

Usage

# Using a variable in a resource resource "aws_instance" "web" { ami = "ami-0c55b159cbfafe1f0" instance_type = var.instance_type }

Setting Variables

  • Environment Variables: export TF_VAR_instance_type="t2.small"
  • Command-Line Flags: terraform apply -var="instance_type=t2.small"
  • Variable Files: Create .tfvars files and pass them with -var-file flag.

Output Variables

Output variables are used to expose values to the user or other configurations.

# Example of an output variable output "instance_ip" { value = aws_instance.web.public_ip }

Handling Secrets or Sensitive Data in Terraform

Sensitive Variables

Mark variables as sensitive to prevent them from being displayed in logs.

# Example of a sensitive variable variable "db_password" { type = string sensitive = true }

Avoid Hardcoding

Do not store secrets in code or version control. Use environment variables or prompt for input.

Use Vault or Secret Management Services

Integrate with tools like HashiCorp Vault to fetch secrets at runtime.

Secure State Storage

Use encrypted remote backends to store state files securely.

Example of Fetching a Secret

# Example of fetching a secret from Vault data "vault_generic_secret" "db_password" { path = "secret/data/db_password" } resource "aws_db_instance" "example" { password = data.vault_generic_secret.db_password.data["password"] # Other configurations }

Purpose of the terraform init Command

The terraform init command initializes a Terraform working directory by downloading and installing the necessary providers and modules.

Functions

  • Plugin Installation: Downloads provider plugins required for the configuration.
  • Backend Initialization: Sets up the backend for state storage.
  • Module Installation: Downloads modules from sources like GitHub or the Terraform Registry.

When to Run

  • First time setting up a configuration.
  • After adding or changing providers or modules.
  • After cloning a repository containing Terraform configurations.

Handling Resource Dependencies in Terraform

Implicit Dependencies

Terraform automatically determines resource dependencies by analyzing references in configurations.

# Example of implicit dependencies resource "aws_instance" "example" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t2.micro" subnet_id = aws_subnet.example.id } resource "aws_subnet" "example" { vpc_id = aws_vpc.example.id cidr_block = "10.0.1.0/24" }

Explicit Dependencies

Use depends_on when a dependency isn’t detected automatically.

# Example of explicit dependencies resource "null_resource" "example" { depends_on = [aws_instance.example] }

Managing Remote State in Terraform

Remote state is used to share the state file among team members and secure it.

Example Using AWS S3

# Example of remote state configuration # (Configuration details would be added here)

Terraform Backend Configuration

# Terraform backend configuration for S3 terraform { backend "s3" { bucket = "my-terraform-state" key = "global/s3/terraform.tfstate" region = "us-west-2" encrypt = true dynamodb_table = "terraform-lock-table" } }

Features of Terraform Backend

  • Storage: Stores the state in a remote backend like S3, Azure Blob, or Terraform Cloud.
  • Locking: Prevents concurrent changes using mechanisms like DynamoDB tables.

Terraform Data Sources

Data Sources allow you to fetch existing information or resources from a provider.

Example of a Data Source

# Fetching the most recent AWS AMI data "aws_ami" "example" { most_recent = true owners = ["self"] filter { name = "name" } } values = ["my-ami-*"]

Using Data Source in a Resource

# Example of using a data source in a resource resource "aws_instance" "example" { ami = data.aws_ami.example.id instance_type = "t2.micro" }

Terraform Commands: Plan vs Apply

  • terraform plan: Shows the changes Terraform will make to your infrastructure without actually applying them. Use for review and approval.
  • terraform apply: Executes the changes proposed in the plan, creating, modifying, or destroying resources as necessary.

Count vs For_each in Terraform

  • count: Creates multiple resources by a specified number.
# Example using count resource "aws_instance" "example" { count = 3 instance_type = "t2.micro" ami = "ami-0c55b159cbfafe1f0" }

Accessed using count.index.

  • for_each: Creates resources based on a map or a set.
# Example using for_each resource "aws_instance" "example" { for_each = { server1 = "t2.micro" server2 = "t2.small" } instance_type = each.value ami = "ami-0c55b159cbfafe1f0" }

Accessed using each.key and each.value.

Debugging Errors in Terraform

  • Enable Debug Logs: Set the TF_LOG environment variable.
# Enable debug logs export TF_LOG=DEBUG terraform apply
  • Log Output File: Redirect logs to a file for detailed review.
# Redirect logs to a file export TF_LOG_PATH="terraform.log"

Validate Configurations: Use terraform validate to check for syntax errors.

Plan Execution: Run terraform plan to identify issues in execution plans.

Local-exec vs Remote-exec Provisioners

  • local-exec: Executes commands on the machine running Terraform.
# Example of local-exec resource "null_resource" "example" { provisioner "local-exec" { command = "echo 'Hello, World!'" } }
  • remote-exec: Executes commands on a remote resource (e.g., an EC2 instance).
# Example of remote-exec resource "aws_instance" "example" { provisioner "remote-exec" { inline = [ "sudo apt-get update", "sudo apt-get install -y nginx" ] } }

Null Resource in Terraform

A null_resource is a resource that doesn’t directly manage infrastructure but allows running provisioners and triggers.

# Example of null_resource resource "null_resource" "example" { provisioner "local-exec" { command = "echo 'Triggered by change in variables!'" } triggers = { variable = var.example_variable } }

Use Cases for Null Resource

  • Execute local commands or scripts based on conditions.
  • Handle non-infrastructure workflows.

Terraform fmt

terraform fmt: Formats Terraform configuration files to ensure consistent style.

Run it in the directory containing .tf files: terraform fmt

Importance of terraform fmt

Improves readability and standardizes configuration files.

Terraform Taint Command

terraform taint: Marks a resource as needing to be destroyed and recreated during the next terraform apply.

# Example of terraform taint terraform taint aws_instance.example

Use Case: When a resource is in an inconsistent state or needs to be updated due to external changes.

Difference Between terraform destroy and terraform apply -destroy

  • terraform destroy: Deletes all the resources defined in the current state file.
# Command to destroy resources terraform destroy
  • terraform apply -destroy: Combines terraform plan and terraform destroy into one command, showing a plan before destruction.
# Command to apply destroy terraform apply -destroy

Rollback Changes in Terraform

  • State Restoration: Restore a previous state backup if state file corruption occurs.
# Restore state from backup cp terraform.tfstate.backup terraform.tfstate
  • Revert Code Changes: Revert to a previous commit in version control and reapply.
# Revert to a previous commit git checkout terraform apply
  • Manual Correction: Edit configurations and use terraform plan to apply corrective changes.

Terraform Modules

Terraform Modules are a way to encapsulate resources for reuse.

Steps to Create a Module

Structure:

β”œβ”€β”€ main.tf
β”œβ”€β”€ variables.tf
└── outputs.tf

Module Definition

# main.tf resource "aws_instance" "example" { ami = var.ami instance_type = var.instance_type }
# variables.tf variable "ami" {} variable "instance_type" { default = "t2.micro" }