Overview

Distributed systems represent a computing paradigm where multiple independent computers collaborate to achieve a common goal, appearing as a single coherent system to the end-user. Unlike centralised systems where all processing occurs on one machine, distributed systems span networks and leverage interconnected nodes to handle data storage, computation, and communication.

This architecture underpins modern cloud platforms (AWS, Google Cloud, Azure), large-scale applications (social networks, e-commerce), and scientific simulations. The foundational challenge is that nodes operate independently, communicate only by passing messages, and have no shared clock — making coordination a non-trivial engineering problem.

Classification

Distributed systems can be classified along several axes:

Dimension	Option A	Option B
Hardware/software uniformity	Homogeneous — all nodes run the same stack	Heterogeneous — diverse hardware and OS
Coupling	Tightly coupled — high interdependence, shared memory	Loosely coupled — communicate via messages, independent lifecycle
Topology	Client-server — hierarchical, servers hold resources	Peer-to-peer — egalitarian, all nodes share resources

Web applications follow the client-server model (browsers request from servers). File-sharing systems like BitTorrent follow peer-to-peer. Most microservice architectures occupy the loosely-coupled middle ground.

Advantages

Benefit	How
Horizontal scalability	Add nodes instead of upgrading a single machine. Cassandra scales to petabytes by adding nodes seamlessly.
Fault tolerance	Replication across nodes keeps the system running despite individual failures. Google Spanner uses synchronous cross-datacenter replication.
Resource sharing	Idle capacity from many machines is pooled. Folding@home aggregates CPU cycles for scientific workloads.
Parallelism	Tasks are divided and processed concurrently. MapReduce (Hadoop) exemplifies this for big-data analytics.
Latency reduction	Nodes placed near users reduce round-trip time. CDNs like Akamai serve content from the nearest edge location.

Disadvantages

Challenge	Root cause
Complexity	Coordination, synchronisation, and distributed state are hard. The fallacies of distributed computing (Deutsch) list common wrong assumptions: the network is reliable, latency is zero, bandwidth is infinite.
Network overhead	Messages can be delayed, lost, or duplicated. All must be handled explicitly.
Consistency vs availability	Keeping data consistent across nodes while staying available under failure is provably impossible to do perfectly — see the CAP Theorem.
Expanded attack surface	More nodes mean more entry points. Byzantine faults (malicious nodes) require dedicated protocols.
Operational cost	Debugging, monitoring, and deploying distributed systems requires specialised tooling (Prometheus, Kubernetes, distributed tracing).

Topics covered

CAP Theorem

The central trade-off in distributed data stores: Consistency, Availability, and Partition Tolerance cannot all be guaranteed simultaneously. Understand CP, AP, and CA system classes — and the PACELC extension.

CAP Theorem
Consensus & Algorithms

How distributed nodes reach agreement despite failures. Covers Paxos, Raft, quorum systems, failure models (crash-stop, Byzantine), the FLP impossibility result, and eventual consistency.

Consensus & Algorithms

LAMPORT, L. Time, Clocks, and the Ordering of Events in a Distributed System. CACM, 1978. ↩
DEUTSCH, P. Fallacies of Distributed Computing, 1994. ↩