DevOps Complete Learning Guide

1. Git & GitHub Advanced

Git Internals & Advanced Commands

# Understanding Git Objects git cat-file -p HEAD # Print object content git ls-tree HEAD # List tree object git rev-parse HEAD # Get SHA of HEAD git reflog # Reference log of all changes

Interactive Rebase

Interactive rebase is powerful for cleaning up commit history before pushing to remote.

# Squash last 3 commits git rebase -i HEAD~3 # In the editor, you'll see: # pick abc1234 First commit # pick def5678 Second commit # pick ghi9012 Third commit # Change to: # pick abc1234 First commit # squash def5678 Second commit # squash ghi9012 Third commit

Git Stash Advanced

Command	Description
`git stash save "message"`	Stash with descriptive message
`git stash list`	List all stashes
`git stash apply stash@{2}`	Apply specific stash
`git stash branch new-branch`	Create branch from stash
`git stash show -p stash@{0}`	Show stash content

Git Hooks Example

Use Git hooks to enforce code quality and standards before commits.

#!/bin/sh # .git/hooks/pre-commit # Make it executable: chmod +x .git/hooks/pre-commit # Run tests before commit npm test || exit 1 # Check for console.log statements if grep -r "console.log" --include="*.js" .; then echo "Remove console.log statements before committing" exit 1 fi # Run linter npm run lint || exit 1 echo "Pre-commit checks passed!"

2. GitHub Actions

Complete CI/CD Workflow

name: CI/CD Pipeline on: push: branches: [ main, develop ] pull_request: branches: [ main ] env: NODE_VERSION: '16' DOCKER_REGISTRY: ghcr.io jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Setup Node.js uses: actions/setup-node@v3 with: node-version: ${{ env.NODE_VERSION }} cache: 'npm' - name: Install dependencies run: npm ci - name: Run tests run: npm test - name: Upload coverage uses: actions/upload-artifact@v3 with: name: coverage-report path: coverage/

Matrix Strategy

Use matrix strategies to test across multiple versions and platforms simultaneously.

jobs: test: strategy: matrix: os: [ubuntu-latest, windows-latest, macos-latest] node: [14, 16, 18] runs-on: ${{ matrix.os }} steps: - uses: actions/checkout@v3 - uses: actions/setup-node@v3 with: node-version: ${{ matrix.node }} - run: npm test

3. Linux for DevOps

Essential System Commands

Category	Command	Description
System Info	`uname -a`	All system information
	`df -h`	Disk usage human readable
	`free -h`	Memory usage
	`top / htop`	Process monitoring
Network	`ss -tulpn`	Show listening ports
	`ip addr show`	Show IP addresses
	`curl -I URL`	Get HTTP headers
	`dig domain.com`	DNS lookup

Service Management (systemd)

# Service control systemctl start nginx systemctl stop nginx systemctl restart nginx systemctl reload nginx systemctl enable nginx # Enable at boot systemctl disable nginx # Disable at boot systemctl status nginx # View logs journalctl -u nginx -f # Follow logs journalctl -u nginx --since "1 hour ago" journalctl -p err # Error logs only

Creating a Custom Service

Always create custom systemd services for production applications to ensure proper lifecycle management.

# /etc/systemd/system/myapp.service [Unit] Description=My Application After=network.target [Service] Type=simple User=appuser WorkingDirectory=/opt/myapp ExecStart=/usr/bin/node app.js Restart=always RestartSec=10 StandardOutput=syslog StandardError=syslog SyslogIdentifier=myapp [Install] WantedBy=multi-user.target # Reload and start systemctl daemon-reload systemctl start myapp systemctl enable myapp

Changing File Permissions

The chmod command enables you to change the permissions on a file. You must be superuser or the owner of a file or directory to change its permissions.

You can use the chmod command to set permissions in either of two modes:

Absolute Mode - Use numbers to represent file permissions (the method most commonly used to set permissions). When you change permissions by using the absolute mode, represent permissions for each triplet by an octal mode number.
Symbolic Mode - Use combinations of letters and symbols to add or remove permissions.

Setting File Permissions in Absolute Mode

Octal Value	File Permissions Set	Permissions Description
0	---	No permissions
1	--x	Execute permission only
2	-w-	Write permission only
3	-wx	Write and execute permissions
4	r--	Read permission only
5	r-x	Read and execute permissions
6	rw-	Read and write permissions
7	rwx	Read, write, and execute permissions

Setting File Permissions in Symbolic Mode

Symbol	Function	Description
u	Who	User (owner)
g	Who	Group
o	Who	Others
a	Who	All
=	Operation	Assign
+	Operation	Add
-	Operation	Remove
r	Permission	Read
w	Permission	Write
x	Permission	Execute
l	Permission	Mandatory locking, setgid bit is on, group execution bit is off
s	Permission	setuid or setgid bit is on
S	Permission	suid bit is on, user execution bit is off
t	Permission	Sticky bit is on, execution bit for others is on
T	Permission	Sticky bit is on, execution bit for others is off

How to Change Permissions in Absolute Mode

If you are not the owner of the file or directory, become superuser.

Only the current owner or superuser can use the chmod command to change file permissions on a file or directory.

Change permissions in absolute mode by using the chmod command:

# Change file permissions using absolute mode chmod nnn filename

nnn specifies the octal values that change permissions on the file or directory. See the table above for the list of valid octal values.

filename is the file or directory.

Verify the permissions of the file have changed:

# Verify file permissions ls -l filename

Example--Changing Permissions in Absolute Mode

# Set rwxr-xr-x permissions on myfile chmod 755 myfile ls -l myfile

How to Change Permissions in Symbolic Mode

If you are not the owner of the file or directory, become superuser.

Only the current owner or superuser can use the chmod command to change file permissions on a file or directory.

Change permissions in symbolic mode by using the chmod command:

# Change file permissions using symbolic mode chmod who operator permission filename

who specifies whose permissions are changed, operator specifies the operation to perform, and permission specifies what permissions are changed. See the table above for the list of valid symbols.

filename is the file or directory.

Verify the permissions of the file have changed:

# Verify file permissions ls -l filename

Examples--Changing Permissions in Symbolic Mode

# Take away read permission from others chmod o-r filea # Add read and execute permissions for user, group, and others chmod a+rx fileb # Assign read, write, and execute permissions to group chmod g=rwx filec

4. Docker Deep Dive

Multi-Stage Dockerfile

Multi-stage builds reduce image size by separating build dependencies from runtime.

# Build stage FROM node:16-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --only=production # Development dependencies FROM node:16-alpine AS dev-deps WORKDIR /app COPY package*.json ./ RUN npm ci # Build application FROM dev-deps AS build COPY . . RUN npm run build # Production stage FROM node:16-alpine AS runtime RUN apk add --no-cache tini RUN addgroup -g 1001 -S nodejs && \ adduser -S nodejs -u 1001 WORKDIR /app COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules COPY --from=build --chown=nodejs:nodejs /app/dist ./dist USER nodejs EXPOSE 3000 ENTRYPOINT ["/sbin/tini", "--"] CMD ["node", "dist/index.js"] HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \ CMD node healthcheck.js || exit 1

Docker Commands Reference

# Container management docker run -d \ --name myapp \ --restart unless-stopped \ --memory="512m" \ --cpus="0.5" \ -p 8080:80 \ -v $(pwd)/data:/data:ro \ --env-file .env \ myimage:latest # Inspection and debugging docker inspect container_id docker stats docker logs -f --tail 50 container_name docker exec -it container_name sh # Cleanup docker system prune -a --volumes docker image prune -a docker container prune docker volume prune

5. Kubernetes Complete Guide

Deployment with All Features

apiVersion: apps/v1 kind: Deployment metadata: name: myapp labels: app: myapp spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 1 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: app image: myapp:v1 ports: - containerPort: 8080 env: - name: NODE_ENV value: "production" - name: DB_PASSWORD valueFrom: secretKeyRef: name: db-secret key: password resources: requests: memory: "128Mi" cpu: "100m" limits: memory: "256Mi" cpu: "200m" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 volumeMounts: - name: config mountPath: /etc/config volumes: - name: config configMap: name: app-config

Essential Kubectl Commands

Category	Command
Cluster Info	`kubectl cluster-info`
Get Resources	`kubectl get pods -o wide --all-namespaces`
Describe	`kubectl describe pod pod-name`
Logs	`kubectl logs -f pod-name --tail=50`
Execute	`kubectl exec -it pod-name -- /bin/bash`
Port Forward	`kubectl port-forward pod-name 8080:80`
Scale	`kubectl scale deployment app --replicas=5`
Rollout	`kubectl rollout status deployment/app`

What is Ingress in Kubernetes?

Ingress is a Kubernetes API object used to manage external access to services inside the cluster.

Ingress is used to:

expose HTTP/HTTPS applications
route incoming traffic to different services
define host-based or path-based routing
handle SSL/TLS termination

Ingress acts like a set of rules that tells Kubernetes:

👉 “When traffic comes from outside, where should it go?”

What is an Ingress Controller?

An Ingress Controller is the actual implementation that executes the rules defined in the Ingress object.

Ingress is only a configuration. It does nothing by itself. The Ingress Controller is responsible for processing those rules and routing the traffic.

Examples of Ingress Controllers:

NGINX Ingress Controller
AWS ALB Ingress Controller
GKE Ingress
Traefik Ingress Controller
HAProxy Ingress

Simple Analogy

Ingress: Think of Ingress as a “traffic rulebook” written on paper.

Ingress Controller: Think of the Ingress Controller as the “traffic police” who reads the rulebook and actually controls the traffic.

Visual Diagram Explanation

Internet

LoadBalancer

Ingress Controller

--------------------------------

/app1 → Service1

/app2 → Service2

/admin → Service3

The Ingress Controller reads these rules and handles routing.

Summary

Component	Meaning	Responsibility
Ingress	A set of routing rules	Defines how traffic should flow
Ingress Controller	The engine that applies the rules	Routes traffic to correct services

📎 Source: Kubernetes-Cheatsheet.pdf.pdf

Basic Cluster Commands

Command	Description
kubectl version	Show client + server K8s version.
kubectl cluster-info	See cluster master & DNS info.
kubectl get all	List all resources (pods, svc, deployments, etc.) in default namespace.

Working With Contexts

Command	Description
kubectl config get-contexts	List all contexts (clusters).
kubectl config use-context dev	Switch to another cluster.
kubectl config current-context	Show which cluster you are using.

Namespaces

Command	Description
kubectl get namespaces	List namespaces.
kubectl create namespace dev	Create namespace.
kubectl delete namespace dev	Delete namespace.
kubectl config set-context --current --namespace=dev	Set default namespace.

Pods

Command	Description
kubectl get pods	List pods.
kubectl get pods -o wide	Show pod IP, node, etc.
kubectl describe pod pod-name	Detailed pod info & events.
kubectl logs pod-name	View logs.
kubectl logs -f pod-name	Follow live logs.
kubectl exec -it pod-name -- bash	Enter pod terminal.
kubectl delete pod pod-name	Delete a pod.

Deployments

Command	Description
kubectl get deployments	List deployments.
kubectl create deployment web --image=nginx	Create deployment.
kubectl scale deployment web --replicas=5	Scale number of pods.
kubectl rollout status deployment/web	Check deployment rollout status.
kubectl rollout undo deployment/web	Rollback to previous version.
kubectl delete deployment web	Delete deployment.

ReplicaSets

Command	Description
kubectl get rs	List ReplicaSets.
kubectl describe rs rs-name	ReplicaSet details.

Services

Command	Description
kubectl get svc	List services.
kubectl expose deployment web --type=NodePort --port=80	Expose deployment as a service.
kubectl describe svc web	Service details.
kubectl get svc -o wide	See service cluster IP & ports.

ConfigMaps

Command	Description
kubectl create configmap app-config --from-literal=env=prod	Create ConfigMap.
kubectl create configmap myconfig --from-file=config.properties	Create from file.
kubectl get configmaps	List ConfigMaps.
kubectl describe configmap app-config	View config details.

Secrets

Command	Description
kubectl create secret generic db-secret --from-literal=password=1234	Create secret.
kubectl create secret generic tls-secret --from-file=server.crt --from-file=server.key	Create TLS secret.
kubectl get secrets	List secrets.
kubectl describe secret db-secret	Describe secret.

YAML Apply, Update & Delete

Command	Description
kubectl apply -f deployment.yaml	Apply or update resource.
kubectl delete -f deployment.yaml	Delete resource.
kubectl edit deployment web	Edit resource live in editor.

Nodes & Cluster Info

Command	Description
kubectl get nodes	List nodes.
kubectl get nodes -o wide	Node details (OS, internal IP).
kubectl describe node node1	Node details (taints, capacity).
kubectl drain node1 --ignore-daemonsets	Drain node safely.
kubectl cordon node1	Mark node unschedulable.
kubectl uncordon node1	Mark node schedulable.

Resource Usage

Command	Description
kubectl top pods	CPU & memory usage of pods.
kubectl top nodes	CPU & memory usage of nodes. (Metrics server required)

Taints & Tolerations

Add taint:

kubectl taint nodes node1 key=value:NoSchedule

Remove taint:

kubectl taint nodes node1 key=value:NoSchedule-

Used for:

Dedicate nodes
Restrict workloads
Isolation

Labels & Selectors

Command	Description
kubectl label pod web app=frontend	Add label.
kubectl get pods -l app=frontend	Filter with label.
kubectl label pod web app-	Remove label.

Port Forwarding

kubectl port-forward pod/mypod 8080:80

Access pod locally. Common for debugging APIs.

Ingress

Command	Description
kubectl get ingress	List ingress rules.
kubectl describe ingress my-ingress	Ingress details.

StatefulSets

Command	Description
kubectl get statefulsets	List StatefulSets.
kubectl describe statefulset mysql	Stateful app details.

Used for:

Databases
Kafka
ElasticSearch

DaemonSets

Command	Description
kubectl get daemonsets	List DaemonSets.

Used for:

Logging agents
Monitoring agents

Jobs & CronJobs

Command	Description
kubectl create job myjob --image=busybox	Create Job.
kubectl create cronjob backup --image=busybox --schedule="/5 * * *"	Create CronJob.
kubectl get jobs	List jobs.
kubectl get cronjobs	List CronJobs.

Pod Debugging

Command	Description
kubectl describe pod web	Check events.
kubectl logs pod --previous	Check logs of crashed container.
kubectl exec -it web -- sh	Debug inside pod.
kubectl get pod web -o yaml	View complete pod spec.

Troubleshooting Cluster

Command	Description
kubectl get events	Cluster-wide events.
kubectl get endpoints	Check service endpoints.
kubectl get componentstatus	Master component status (older K8s).
kubectl get networkpolicies	Check network restrictions.

Network Policies

Command	Description
kubectl get netpol	List policies.
kubectl describe netpol	Network policy details.

Used for:

Restrict pod-to-pod communication
Zero-trust networking

Storage

Command	Description
kubectl get pv	List persistent volumes.
kubectl get pvc	List persistent volume claims.
kubectl describe pv pv-name	PV details.
kubectl describe pvc pvc-name	PVC details.

Service Accounts & RBAC

Command	Description
kubectl get serviceaccounts	List service accounts.
kubectl create serviceaccount dev-sa	Create SA.
kubectl get clusterrole	List cluster roles.
kubectl get clusterrolebinding	List bindings.

Used for:

Access control
Least privilege
Pod-to-AWS auth (IRSA)

Deleting Everything

Command	Description
kubectl delete all --all	Delete pods, svc, deployments in namespace.
kubectl delete namespace dev	Delete entire namespace.

Useful Shortcuts

Command	Description
kubectl get po	Short for pods.
kubectl get deploy	Short for deployments.
kubectl get svc	Short for service.
kubectl get ing	Short for ingress.
kubectl get cm	Short for configmap.
kubectl get no	Short for nodes.

Apply with Dry Run (very important)

kubectl apply -f app.yaml --dry-run=client

Check if YAML is valid without applying.

28. Debugging Node Issues

# Check taints, disk pressure, memory pressure. kubectl describe node node1

kubectl get nodes -o json | jq '.items[].status.conditions'

Detailed node conditions.

29. Extract Pod YAML

# Export pod definition. kubectl get pod web -o yaml > pod.yaml

Useful for reproducing or modifying pods.

30. Kubernetes API Access

# Start API server proxy for debugging. kubectl proxy

31. Advanced Pod Debugging

# Exec into a specific container inside a multi-container pod. kubectl exec -it pod --container app -- sh

# View logs of a specific container. kubectl logs pod -c app

# Check logs of a container that crashed. kubectl logs pod --previous

# Search for errors in pod events. kubectl describe pod pod | grep -i error

32. Ephemeral Debug Container (K8s 1.23+)

# Attach a temporary debug container. kubectl debug podname -it --image=busybox

# Debug an entire node. kubectl debug node/node1 -it --image=busybox

Used for:

CrashLoopBackOff
ImagePullBackOff
Node-level debugging

33. ImagePull & Crash Issues

# Get reason for failed pod. kubectl get pod pod -o jsonpath='{.status.containerStatuses[].state.waiting.reason}'

Common issues:

ImagePullBackOff
ErrImagePull
CrashLoopBackOff

34. Rollouts & History

# View deployment history. kubectl rollout history deployment web

# Rollback to a specific version. kubectl rollout undo deployment web --to-revision=2

# Pause rollout (for debugging). kubectl rollout pause deployment web

# Resume rollout. kubectl rollout resume deployment web

35. Advanced Scaling

# Scale down to zero. kubectl scale deployment web --replicas=0

# Scale StatefulSet (ordered). kubectl scale statefulset mysql --replicas=3

36. YAML Dry Run & Diff

# Show changes before applying. kubectl diff -f app.yaml

# Server-side validation without apply. kubectl apply -f app.yaml --dry-run=server

Useful for CI/CD pipelines.

37. Patching (Very Important for DevOps)

# Patch resources inline: kubectl patch deployment web -p '{"spec": {"replicas": 4}}'

# Patch with strategic merge: kubectl patch svc web --patch-file patch.yaml

Used for:

Hotfixes
On-the-fly changes
CI/CD automation

38. Node Health & Troubleshooting

# Find unhealthy nodes. kubectl get nodes -o json | jq '.items[].status.conditions[] | select(.status=="False")'

# Check disk pressure, memory pressure. kubectl describe node node1

# List all pods on a node. kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=node1

39. Events & Cluster-level Issues

# Show all events. kubectl get events

# Sort events by time. kubectl get events --sort-by='.lastTimestamp'

# Access kube-apiserver metrics. kubectl get --raw /metrics

40. Service Debugging

# Check if service has endpoints. kubectl get endpoints web

# Check wrong targetPort / selector. kubectl describe svc web

# Test service connectivity. kubectl run test --image=busybox -it --rm -- wget web

41. Ingress Debugging

# List ingress. kubectl get ingress

# Check rules & events. kubectl describe ingress web-ingress

# Check ingress controller health. kubectl get pods -n ingress-nginx

42. Network Policy Debugging

# List network policies. kubectl get netpol

# View network rules. kubectl describe netpol allow-app

# Test connectivity: kubectl run test --rm -it --image=busybox -- wget http://pod-ip

43. Persistent Storage Commands

# List storage classes. kubectl get storageclass

# Details of storage class. kubectl describe storageclass gp2

# List persistent volumes. kubectl get pv

# List persistent volume claims. kubectl get pvc

# PVC status. kubectl describe pvc my-pvc

44. Logs & Audit

# Logs from the last 1 hour. kubectl logs pod --since=1h

# Show last 50 lines. kubectl logs pod --tail=50

# Show logs with timestamps. kubectl logs pod --timestamps

45. Node-to-Pod Debugging

# Test pod connectivity. kubectl run debug --rm -it --image=busybox --command -- ping

# Find pod IP. kubectl get pod -o wide

46. Copy Files To & From Pod

# Copy from pod → local. kubectl cp pod:/app/logs ./logs

# Copy local → pod. kubectl cp file.txt pod:/app/

47. ConfigMaps & Secrets Debugging

# View ConfigMap values. kubectl get configmaps -o yaml

# Base64 encoded output. kubectl get secret db-secret -o yaml

# Decode: echo 'cGFzc3dvcmQ=' | base64 --decode

48. K8s Useful JSONPath Queries

# Get pod image: kubectl get pod web -o jsonpath='{.spec.containers[*].image}'

# Get pod node: kubectl get pod web -o jsonpath='{.spec.nodeName}'

49. Resource Quotas & Limits

# List quotas. kubectl get resourcequota

# Quota details. kubectl describe resourcequota rq1

50. LimitRanges

# List pod limit ranges. kubectl get limitrange

Used for:

default CPU/memory
maximum/minimum allowed resources

51. Service Account with Pod

# Show token & secrets. kubectl describe sa my-sa

# Check which service account pod uses. kubectl describe pod pod | grep ServiceAccount

52. RBAC Debugging

# Test user permission: kubectl auth can-i get pods --as user1

# Test namespace-specific permission: kubectl auth can-i delete pods -n dev

53. Port & Connectivity Debugging (Must Know)

# Test port: kubectl run test --rm -it --image=busybox -- nc -zv web 80

# DNS check: kubectl run test --rm -it --image=busybox -- nslookup web

54. Horizontal Pod Autoscaling

# Create autoscaler. kubectl autoscale deployment web --cpu-percent=50 --min=2 --max=10

# List HPAs. kubectl get hpa

Cluster Autoscaler Debugging

# Check autoscaler pod. kubectl -n kube-system get pods | grep autoscaler

Advanced Resource Filtering (field selectors)

Command	Description
`kubectl get pods --field-selector spec.nodeName=node1`	Get pods scheduled on a specific node
`kubectl get pods --field-selector=status.phase=Failed`	Get pods that failed
`kubectl get pods --field-selector=status.phase!=Running`	Get pods that are not running
`kubectl get pods --sort-by=.metadata.creationTimestamp`	Get pods created in last 5 minutes

Advance JSON & YAML Output Formatting

Command	Description
`kubectl get pods -o jsonpath='{.items[*].metadata.name}'`	Get only pod names
`kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}'`	Get node internal IPs
`kubectl get pods -o jsonpath='{.items[].spec.containers[].image}'`	Get pod image names

Temporary BusyBox Pod (for debugging)

# Used for DNS testing, connecting to services, checking network restrictions kubectl run tmp --rm -it --image=busybox -- sh

Check Pod Events Script

# Helpful command to sort events kubectl get events --sort-by='.lastTimestamp'

Best for: CrashLoopBackOff, Pod scheduling failure, Image pull issues

Get Pod Environment Variables

kubectl exec -it -- printenv

kubectl get pod -o jsonpath='{.spec.containers[*].env}'

Debugging Service DNS

Command	Description
`kubectl run dns-test --rm -it --image=busybox -- nslookup svc-name`	Test DNS resolution
`kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default`	Test cluster DNS

Debug Service Connectivity

Command	Description
`kubectl run test --rm -it --image=busybox -- nc -zv svc-name 80`	Test port connectivity
`kubectl run test --rm -it --image=curlimages/curl -- curl http://svc-name`	Test via curl

Debug Node Ports

curl :

Pod Security / User Permissions

kubectl get pod pod -o jsonpath='{.spec.containers[*].securityContext.runAsUser}'

Copy Kubernetes Manifest From Live Resource

Command	Description
`kubectl get deploy web -o yaml > web.yaml`	Extract current running deployment YAML
`kubectl get cm app-config -o yaml > configmap.yaml`	Extract current configmap YAML

Validating YAML

Command	Description
`kubectl apply -f app.yaml --dry-run=client`	Client-side validation
`kubectl apply -f app.yaml --dry-run=server`	Server-side validation
`kubeval app.yaml`	Lint YAML (if installed)

Kubernetes API Access (Raw)

Command	Description
`kubectl get --raw /metrics`	View cluster components metrics
`kubectl get --raw /api`	Access API paths

Node Disk / Memory Pressure Debugging

kubectl describe node node1 | grep -i pressure

Look for:

DiskPressure
MemoryPressure
PIDPressure

Node Logs (Master & Worker Debugging)

journalctl -u kubelet -f
journalctl -u containerd -f

Restarting Pods Properly

Command	Description
`kubectl delete pod pod-name`	Delete pod safely (deployment recreates it)
`kubectl delete pod pod-name --force --grace-period=0`	Force delete stuck pod

Restart Deployment (Without Editing)

kubectl rollout restart deployment web

Checking Cluster Authentication

Command	Description
`kubectl auth can-i create pods`	Test if user can perform action
`kubectl auth can-i delete pods --as bob`	Test as specific user

Get Logs From ALL Pods of a Deployment

kubectl logs -l app=web --all-containers=true

Labels must match deployment selector.

Debugging Network Policies

Command	Description
`kubectl get netpol`	List policies
`kubectl run test --rm -it --image=busybox -- sh`	Connectivity test across namespaces

Inside:

wget http://pod-ip

ConfigMap Reload Troubleshooting

Pods do NOT automatically reload ConfigMaps unless:

Pod restarts
Sidecar reloaders (e.g., Reloader, ConfigMap reloader)
Using projected volumes

kubectl get pod pod -o jsonpath='{.spec.volumes[*].configMap.name}'

Secret Decoding & Validation

Command	Description
`echo 'cGFzc3dvcmQ=' \| base64 --decode`	Decode a secret
`echo -n 'mypassword' \| base64`	Encode new value

Checking Pod Storage Paths

Command	Description
`kubectl get pod pod -o jsonpath='{.spec.volumes}'`	Check mounted volumes
`kubectl get pod pod -o jsonpath='{.spec.containers[*].volumeMounts}'`	Check volume mount path

Live Pod Debug Session

kubectl debug --image=busybox pod-name

This creates a temporary container INSIDE pod for debugging.

Upgrade Kubernetes Without Downtime (Cluster Admin)

Check nodes before upgrade:

kubectl get nodes

Upgrade node one-by-one:

kubectl drain node1 --ignore-daemonsets

Then update node OS/Kubelet.

kubectl uncordon node1

Advanced kube-proxy Debugging

iptables -L -t nat
journalctl -u kube-proxy -f

6. Ingress & Cert Manager

Complete Ingress Configuration

apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: app-ingress annotations: nginx.ingress.kubernetes.io/rewrite-target: / nginx.ingress.kubernetes.io/ssl-redirect: "true" cert-manager.io/cluster-issuer: "letsencrypt-prod" nginx.ingress.kubernetes.io/rate-limit: "10" spec: ingressClassName: nginx tls: - hosts: - app.example.com secretName: app-tls-secret rules: - host: app.example.com http: paths: - path: / pathType: Prefix backend: service: name: frontend-service port: number: 80

Cert Manager Setup

# Install Cert Manager kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.yaml # Verify installation kubectl get pods --namespace cert-manager

# ClusterIssuer for Let's Encrypt apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: name: letsencrypt-prod spec: acme: server: https://acme-v02.api.letsencrypt.org/directory email: admin@example.com privateKeySecretRef: name: letsencrypt-prod-key solvers: - http01: ingress: class: nginx

7. Observability Stack

Prometheus Configuration

global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)

Useful PromQL Queries

# CPU Usage 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory Usage (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 # HTTP Error Rate sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) # Pod Restart Rate rate(kube_pod_container_status_restarts_total[15m]) # 95th Percentile Response Time histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Grafana Dashboard Setup

Import dashboard ID 11074 for Kubernetes cluster monitoring or 3662 for Prometheus stats.

Quick Reference

Always test commands in development environment before running in production!

Daily Use Commands

# Git git log --oneline --graph --all -20 # Docker docker-compose logs -f --tail=100 docker system df # Kubernetes kubectl get all -A kubectl top pods --all-namespaces kubectl rollout restart deployment/app # Linux find /var/log -name "*.log" -mtime +30 -delete du -sh * | sort -hr | head -10

9. Terraform

Terraform Interview Questions

By DevOps Shack

Introduction to Terraform
- What is Terraform and what are its main features?
- Can you explain the difference between Terraform and other configuration management tools like Ansible, Puppet, or Chef?
State Management
- What is state in Terraform, and why is it important?
- How do you manage multiple environments (e.g., development, staging, production) in Terraform?
Providers and Modules
- What is a Terraform provider, and how do you use it?
- Explain the difference between Terraform modules and resources.
Importing Resources
- How can you import existing infrastructure into Terraform?
Variables and Outputs
- What are Terraform variables, and how do you use them?
- How do you handle secrets or sensitive data in Terraform?
Initialization and Planning
- What is the purpose of the terraform init command?
- How does Terraform handle concurrent operations in a team environment?
Advanced Features
- How does Terraform handle resource dependencies?
- What is drift detection in Terraform, and how do you handle drift?
Lifecycle Management
- How does Terraform manage resource lifecycles?
- What is the purpose of the terraform taint command?
Dynamic Blocks and Conditional Logic
- What are Terraform dynamic blocks, and how are they used?
- How does Terraform support conditional resource creation?
Remote State Management
- How do you manage remote state in Terraform?
- How does Terraform state file locking work in remote backends?
Formatting and Debugging
- What is terraform fmt, and why is it important?
- How do you debug errors in Terraform?
Zero-Downtime Deployments
- How does Terraform handle zero-downtime deployments?
Provisioners
- Explain the difference between local-exec and remote-exec provisioners.
Shared Modules
- How do you manage shared modules in Terraform?
Terraform Cloud
- What is Terraform Cloud, and how does it differ from Terraform CLI?
Resource Management
- What are Terraform backends, and why are they important?
- What is the purpose of the terraform output command?
Version Constraints
- How does Terraform manage provider and configuration version constraints?
Secrets Management
- How can you securely manage secrets in Terraform?
Interactive Console
- What is the purpose of the terraform console command?
Limitations and Best Practices
- What are the limitations of Terraform?
- How can you ensure best practices while working with Terraform?

Introduction to Terraform

Terraform, developed by HashiCorp, is one of the most popular Infrastructure as Code (IaC) tools, enabling developers and operations teams to define, provision, and manage infrastructure efficiently. With its declarative configuration language (HCL) and multi-cloud compatibility, Terraform has become a go-to tool for automating infrastructure management. This document compiles 50 Terraform interview questions and answers, covering fundamental concepts, advanced features, and practical use cases. It serves as a comprehensive guide for professionals preparing for Terraform interviews or looking to strengthen their understanding of the tool.

What is Terraform and what are its main features?

Terraform is an open-source Infrastructure as Code (IaC) tool developed by HashiCorp. It allows you to define, provision, and manage infrastructure across various cloud providers and services using a declarative configuration language known as HashiCorp Configuration Language (HCL).

Main Features:

Infrastructure as Code (IaC): Manage infrastructure using code, enabling version control, reuse, and sharing.
Provider Agnostic: Supports multiple providers like AWS, Azure, GCP, and others, allowing for a consistent workflow.
Execution Plans: Generates and shows execution plans before applying changes, helping you understand what Terraform will do.
Resource Graph: Builds a graph of all resources and their dependencies, optimizing resource creation and modification.
Change Automation: Automates complex changesets to your infrastructure with minimal human interaction.

Can you explain the difference between Terraform and other configuration management tools like Ansible, Puppet, or Chef?

Purpose:

Terraform: Primarily an infrastructure provisioning tool. It focuses on creating, updating, and versioning infrastructure safely and efficiently.
Ansible/Puppet/Chef: Primarily configuration management tools. They are used to install and manage software on existing servers.

Approach:

Terraform: Declarative. You describe the desired state, and Terraform figures out how to achieve it.
Ansible/Puppet/Chef: Can be both declarative and procedural, depending on how you write your configurations or playbooks.

Infrastructure Lifecycle:

Terraform: Manages the entire lifecycle of infrastructure, including creation, scaling, and destruction.
Ansible/Puppet/Chef: Manages the software and settings on already provisioned infrastructure.

What is state in Terraform, and why is it important?

Terraform State: A persistent data store that maps Terraform configurations to real-world resources. It's typically stored in a file named terraform.tfstate.

Importance:

Mapping: Keeps track of resource IDs and metadata, enabling Terraform to manage resources effectively.
Planning and Execution: Allows Terraform to generate accurate execution plans by knowing the current state of resources.
Collaboration: When stored remotely (e.g., in AWS S3 or Terraform Cloud), it enables team collaboration by sharing the state.

Managing Multiple Environments in Terraform

To manage multiple environments (e.g., development, staging, production) in Terraform, you can use the following methods:

Workspaces

Use Terraform workspaces to maintain separate state files within the same configuration for different environments.
Example:

# Create and select a workspace terraform workspace new development terraform workspace select development

Directory Structure

Organize configurations into separate directories for each environment, each with its own state.
Example:

# Directory structure ├── environments │ ├── dev │ ├── staging │ └── prod

Variable Files

Use different .tfvars files for each environment to parameterize configurations.
Example:

# Apply configuration with variable file terraform apply -var-file="dev.tfvars"

Backend Configuration

Configure backends to manage state storage for different environments.

Terraform Provider

A Terraform provider is a plugin that interacts with APIs of cloud platforms and services (e.g., AWS, Azure, Google Cloud). Providers define resources and data sources for a service.

Usage

Declaration:

# Declare a provider provider "aws" { region = "us-west-2" }

Version Pinning:

# Pin provider version provider "aws" { version = "~> 3.0" region = "us-west-2" }

You can configure multiple providers to manage resources across different platforms.

Difference Between Terraform Modules and Resources

Resources

Resources are the basic building blocks in Terraform, representing infrastructure objects like virtual networks, compute instances, or databases.

# Example of a resource resource "aws_instance" "web_server" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t2.micro" }

Modules

Modules are containers for multiple resources that are used together, promoting code reuse and organization. They can be shared and versioned.

# Example of using a module module "vpc" { source = "terraform-aws-modules/vpc/aws" version = "2.77.0" name = "my-vpc" cidr = "10.0.0.0/16" }

Importing Existing Infrastructure into Terraform

Step 1: Write Resource Configuration

Define the resource in your .tf files without any parameters that Terraform can't infer.

# Resource configuration resource "aws_instance" "existing" { # Configuration will be populated after import }

Step 2: Run Import Command

Use terraform import to map the existing resource to the Terraform resource.

# Import command terraform import aws_instance.existing i-0abcdef1234567890

Step 3: Refresh and Update Configuration

Run terraform plan to see differences and update the configuration to match the actual settings.

Terraform Variables

Input Variables

Input variables are parameters for Terraform modules, making configurations flexible and reusable.

# Example of an input variable variable "instance_type" { type = string default = "t2.micro" description = "EC2 instance type" }

Usage

# Using a variable in a resource resource "aws_instance" "web" { ami = "ami-0c55b159cbfafe1f0" instance_type = var.instance_type }

Setting Variables

Environment Variables: export TF_VAR_instance_type="t2.small"
Command-Line Flags: terraform apply -var="instance_type=t2.small"
Variable Files: Create .tfvars files and pass them with -var-file flag.

Output Variables

Output variables are used to expose values to the user or other configurations.

# Example of an output variable output "instance_ip" { value = aws_instance.web.public_ip }

Handling Secrets or Sensitive Data in Terraform

Sensitive Variables

Mark variables as sensitive to prevent them from being displayed in logs.

# Example of a sensitive variable variable "db_password" { type = string sensitive = true }

Avoid Hardcoding

Do not store secrets in code or version control. Use environment variables or prompt for input.

Use Vault or Secret Management Services

Integrate with tools like HashiCorp Vault to fetch secrets at runtime.

Secure State Storage

Use encrypted remote backends to store state files securely.

Example of Fetching a Secret

# Example of fetching a secret from Vault data "vault_generic_secret" "db_password" { path = "secret/data/db_password" } resource "aws_db_instance" "example" { password = data.vault_generic_secret.db_password.data["password"] # Other configurations }

Purpose of the terraform init Command

The terraform init command initializes a Terraform working directory by downloading and installing the necessary providers and modules.

Functions

Plugin Installation: Downloads provider plugins required for the configuration.
Backend Initialization: Sets up the backend for state storage.
Module Installation: Downloads modules from sources like GitHub or the Terraform Registry.

When to Run

First time setting up a configuration.
After adding or changing providers or modules.
After cloning a repository containing Terraform configurations.

Handling Resource Dependencies in Terraform

Implicit Dependencies

Terraform automatically determines resource dependencies by analyzing references in configurations.

# Example of implicit dependencies resource "aws_instance" "example" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t2.micro" subnet_id = aws_subnet.example.id } resource "aws_subnet" "example" { vpc_id = aws_vpc.example.id cidr_block = "10.0.1.0/24" }

Explicit Dependencies

Use depends_on when a dependency isn’t detected automatically.

# Example of explicit dependencies resource "null_resource" "example" { depends_on = [aws_instance.example] }

Managing Remote State in Terraform

Remote state is used to share the state file among team members and secure it.

Example Using AWS S3

# Example of remote state configuration # (Configuration details would be added here)

Terraform Backend Configuration

# Terraform backend configuration for S3 terraform { backend "s3" { bucket = "my-terraform-state" key = "global/s3/terraform.tfstate" region = "us-west-2" encrypt = true dynamodb_table = "terraform-lock-table" } }

Features of Terraform Backend

Storage: Stores the state in a remote backend like S3, Azure Blob, or Terraform Cloud.
Locking: Prevents concurrent changes using mechanisms like DynamoDB tables.

Terraform Data Sources

Data Sources allow you to fetch existing information or resources from a provider.

Example of a Data Source

# Fetching the most recent AWS AMI data "aws_ami" "example" { most_recent = true owners = ["self"] filter { name = "name" } } values = ["my-ami-*"]

Using Data Source in a Resource

# Example of using a data source in a resource resource "aws_instance" "example" { ami = data.aws_ami.example.id instance_type = "t2.micro" }

Terraform Commands: Plan vs Apply

terraform plan: Shows the changes Terraform will make to your infrastructure without actually applying them. Use for review and approval.
terraform apply: Executes the changes proposed in the plan, creating, modifying, or destroying resources as necessary.

Count vs For_each in Terraform

count: Creates multiple resources by a specified number.

# Example using count resource "aws_instance" "example" { count = 3 instance_type = "t2.micro" ami = "ami-0c55b159cbfafe1f0" }

Accessed using count.index.

for_each: Creates resources based on a map or a set.

# Example using for_each resource "aws_instance" "example" { for_each = { server1 = "t2.micro" server2 = "t2.small" } instance_type = each.value ami = "ami-0c55b159cbfafe1f0" }

Accessed using each.key and each.value.

Debugging Errors in Terraform

Enable Debug Logs: Set the TF_LOG environment variable.

# Enable debug logs export TF_LOG=DEBUG terraform apply

Log Output File: Redirect logs to a file for detailed review.

# Redirect logs to a file export TF_LOG_PATH="terraform.log"

Validate Configurations: Use terraform validate to check for syntax errors.

Plan Execution: Run terraform plan to identify issues in execution plans.

Local-exec vs Remote-exec Provisioners

local-exec: Executes commands on the machine running Terraform.

# Example of local-exec resource "null_resource" "example" { provisioner "local-exec" { command = "echo 'Hello, World!'" } }

remote-exec: Executes commands on a remote resource (e.g., an EC2 instance).

# Example of remote-exec resource "aws_instance" "example" { provisioner "remote-exec" { inline = [ "sudo apt-get update", "sudo apt-get install -y nginx" ] } }

Null Resource in Terraform

A null_resource is a resource that doesn’t directly manage infrastructure but allows running provisioners and triggers.

# Example of null_resource resource "null_resource" "example" { provisioner "local-exec" { command = "echo 'Triggered by change in variables!'" } triggers = { variable = var.example_variable } }

Use Cases for Null Resource

Execute local commands or scripts based on conditions.
Handle non-infrastructure workflows.

Terraform fmt

terraform fmt: Formats Terraform configuration files to ensure consistent style.

Run it in the directory containing .tf files: terraform fmt

Importance of terraform fmt

Improves readability and standardizes configuration files.

Terraform Taint Command

terraform taint: Marks a resource as needing to be destroyed and recreated during the next terraform apply.

# Example of terraform taint terraform taint aws_instance.example

Use Case: When a resource is in an inconsistent state or needs to be updated due to external changes.

Difference Between terraform destroy and terraform apply -destroy

terraform destroy: Deletes all the resources defined in the current state file.

# Command to destroy resources terraform destroy

terraform apply -destroy: Combines terraform plan and terraform destroy into one command, showing a plan before destruction.

# Command to apply destroy terraform apply -destroy

Rollback Changes in Terraform

State Restoration: Restore a previous state backup if state file corruption occurs.

# Restore state from backup cp terraform.tfstate.backup terraform.tfstate

Revert Code Changes: Revert to a previous commit in version control and reapply.

# Revert to a previous commit git checkout terraform apply

Manual Correction: Edit configurations and use terraform plan to apply corrective changes.

Terraform Modules

Terraform Modules are a way to encapsulate resources for reuse.

Steps to Create a Module

Structure:

├── main.tf
├── variables.tf
└── outputs.tf

Module Definition

# main.tf resource "aws_instance" "example" { ami = var.ami instance_type = var.instance_type }

# variables.tf variable "ami" {} variable "instance_type" { default = "t2.micro" }

🚀 DevOps Complete Learning Guide

1. Git & GitHub Advanced

Git Internals & Advanced Commands

Interactive Rebase

Git Stash Advanced

Git Hooks Example

2. GitHub Actions

Complete CI/CD Workflow

Matrix Strategy

3. Linux for DevOps

Essential System Commands

Service Management (systemd)

Creating a Custom Service

Changing File Permissions

Setting File Permissions in Absolute Mode

Setting File Permissions in Symbolic Mode

How to Change Permissions in Absolute Mode

Example--Changing Permissions in Absolute Mode

How to Change Permissions in Symbolic Mode

Examples--Changing Permissions in Symbolic Mode

4. Docker Deep Dive

Multi-Stage Dockerfile

Docker Commands Reference

5. Kubernetes Complete Guide

Deployment with All Features

Essential Kubectl Commands

What is Ingress in Kubernetes?

Ingress is used to:

What is an Ingress Controller?

Examples of Ingress Controllers:

Simple Analogy

Visual Diagram Explanation

Summary

Basic Cluster Commands

Working With Contexts

Namespaces

Pods

Deployments

ReplicaSets

Services

ConfigMaps

Secrets

YAML Apply, Update & Delete

Nodes & Cluster Info

Resource Usage

Taints & Tolerations

Labels & Selectors

Port Forwarding

Ingress

StatefulSets

DaemonSets

Jobs & CronJobs

Pod Debugging

Troubleshooting Cluster

Network Policies

Storage

Service Accounts & RBAC

Deleting Everything

Useful Shortcuts

Apply with Dry Run (very important)

28. Debugging Node Issues

29. Extract Pod YAML

30. Kubernetes API Access

31. Advanced Pod Debugging

32. Ephemeral Debug Container (K8s 1.23+)

33. ImagePull & Crash Issues

34. Rollouts & History

35. Advanced Scaling

36. YAML Dry Run & Diff

37. Patching (Very Important for DevOps)

38. Node Health & Troubleshooting

39. Events & Cluster-level Issues

40. Service Debugging

41. Ingress Debugging

42. Network Policy Debugging

43. Persistent Storage Commands

44. Logs & Audit

45. Node-to-Pod Debugging

46. Copy Files To & From Pod

47. ConfigMaps & Secrets Debugging