Designing Resilient Distributed Systems with TypeScript: Handling System-Level Failures

Learn how to design distributed systems that recover gracefully from system-level failures using TypeScript. This beginner-friendly guide covers error handling best practices and patterns to build resilient distributed applications.

Distributed systems consist of multiple independent computers (nodes) working together. Because nodes communicate over networks and run independently, system-level failures like network glitches, hardware crashes, or service unavailability can happen anytime. To build resilient distributed systems in TypeScript, it's crucial to anticipate these failures and handle them gracefully.

In this article, we'll cover beginner-friendly error handling techniques in TypeScript that help distributed systems keep working even when some parts fail.

### 1. Use Try-Catch for Safe Error Handling

When calling distributed services or making network requests, your code should catch errors to avoid crashing the whole program. TypeScript’s `try-catch` helps safely handle failures.

typescript
async function fetchData(url: string): Promise<string> {
  try {
    const response = await fetch(url);
    if (!response.ok) {
      throw new Error(`Network error: ${response.status}`);
    }
    return await response.text();
  } catch (error) {
    console.error('Failed to fetch data:', error);
    throw error; // rethrow or handle gracefully
  }
}

### 2. Implement Retries with Exponential Backoff

Transient failures happen often in distributed systems. To handle temporary issues, try retrying requests with increasing delays (exponential backoff) before giving up.

typescript
async function retryFetch(url: string, retries = 3, delay = 500): Promise<string> {
  try {
    return await fetchData(url);
  } catch (error) {
    if (retries === 0) throw error;
    console.log(`Retrying in ${delay}ms... (${retries} left)`);
    await new Promise(res => setTimeout(res, delay));
    return retryFetch(url, retries - 1, delay * 2);
  }
}

### 3. Use Circuit Breaker Pattern

If a service is consistently failing, retrying may waste resources. Circuit breakers stop requests after multiple failures and open a 'cooling' period to allow recovery.

typescript
class CircuitBreaker {
  private failures = 0;
  private threshold: number;
  private cooldown: number;
  private nextAttempt: number = 0;

  constructor(threshold: number, cooldown: number) {
    this.threshold = threshold;
    this.cooldown = cooldown;
  }

  canRequest(): boolean {
    if (Date.now() > this.nextAttempt) {
      return true;
    }
    return false;
  }

  success() {
    this.failures = 0;
  }

  failure() {
    this.failures++;
    if (this.failures >= this.threshold) {
      this.nextAttempt = Date.now() + this.cooldown;
    }
  }
}

async function protectedFetch(url: string, cb: CircuitBreaker): Promise<string> {
  if (!cb.canRequest()) {
    throw new Error('Circuit breaker is open. Skipping request');
  }
  try {
    const result = await fetchData(url);
    cb.success();
    return result;
  } catch (error) {
    cb.failure();
    throw error;
  }
}

### 4. Graceful Degradation and Fallbacks

Sometimes the best way to handle failure is to provide cached or default data instead of failing completely.

typescript
async function fetchWithFallback(url: string, fallbackData: string): Promise<string> {
  try {
    return await retryFetch(url);
  } catch {
    console.warn('Returning fallback data');
    return fallbackData;
  }
}

### Conclusion

Handling system-level failures in distributed systems is crucial for reliability. In TypeScript, use try-catch blocks, retries with exponential backoff, circuit breakers, and fallback strategies to keep your system resilient. These patterns help your system continue working smoothly, even in the face of unpredictable failures.