Skip to content

Overview

Distributed systems represent a computing paradigm where multiple independent computers collaborate to achieve a common goal, appearing as a single coherent system to the end-user. Unlike centralised systems where all processing occurs on one machine, distributed systems span networks and leverage interconnected nodes to handle data storage, computation, and communication.

This architecture underpins modern cloud platforms (AWS, Google Cloud, Azure), large-scale applications (social networks, e-commerce), and scientific simulations. The foundational challenge is that nodes operate independently, communicate only by passing messages, and have no shared clock — making coordination a non-trivial engineering problem.


Classification

Distributed systems can be classified along several axes:

Dimension Option A Option B
Hardware/software uniformity Homogeneous — all nodes run the same stack Heterogeneous — diverse hardware and OS
Coupling Tightly coupled — high interdependence, shared memory Loosely coupled — communicate via messages, independent lifecycle
Topology Client-server — hierarchical, servers hold resources Peer-to-peer — egalitarian, all nodes share resources

Web applications follow the client-server model (browsers request from servers). File-sharing systems like BitTorrent follow peer-to-peer. Most microservice architectures occupy the loosely-coupled middle ground.


Advantages

Benefit How
Horizontal scalability Add nodes instead of upgrading a single machine. Cassandra scales to petabytes by adding nodes seamlessly.
Fault tolerance Replication across nodes keeps the system running despite individual failures. Google Spanner uses synchronous cross-datacenter replication.
Resource sharing Idle capacity from many machines is pooled. Folding@home aggregates CPU cycles for scientific workloads.
Parallelism Tasks are divided and processed concurrently. MapReduce (Hadoop) exemplifies this for big-data analytics.
Latency reduction Nodes placed near users reduce round-trip time. CDNs like Akamai serve content from the nearest edge location.

Disadvantages

Challenge Root cause
Complexity Coordination, synchronisation, and distributed state are hard. The fallacies of distributed computing (Deutsch) list common wrong assumptions: the network is reliable, latency is zero, bandwidth is infinite.
Network overhead Messages can be delayed, lost, or duplicated. All must be handled explicitly.
Consistency vs availability Keeping data consistent across nodes while staying available under failure is provably impossible to do perfectly — see the CAP Theorem.
Expanded attack surface More nodes mean more entry points. Byzantine faults (malicious nodes) require dedicated protocols.
Operational cost Debugging, monitoring, and deploying distributed systems requires specialised tooling (Prometheus, Kubernetes, distributed tracing).

Topics covered

  • CAP Theorem


    The central trade-off in distributed data stores: Consistency, Availability, and Partition Tolerance cannot all be guaranteed simultaneously. Understand CP, AP, and CA system classes — and the PACELC extension.

    CAP Theorem

  • Consensus & Algorithms


    How distributed nodes reach agreement despite failures. Covers Paxos, Raft, quorum systems, failure models (crash-stop, Byzantine), the FLP impossibility result, and eventual consistency.

    Consensus & Algorithms