Prompt

Defina todos os componentes de Inventory do LeanIX, para o contexto:
1. Capacidade Principal: Gestão de Documentos
2. Capacidade Secundária: Armazenar documento, Efetuar OCR, Identificar Campos e Identificar metadados
3. Aplicações: AWS Textract, AWS Recognition, DynamoDB, API Gateway, Lambda e AWS S3
4. IT Component: DynamoDB, Java e Typescript
Não se limite aos itens identificados acima. Eles são uma base para desenvolver um Roadmap completo de Inventory no LeanIX.

Gemini

Para construir um inventário robusto no LeanIX, não basta apenas listar os itens; é preciso entender como eles se conectam através dos Fact Sheets. O LeanIX utiliza uma arquitetura baseada em grafos, onde a inteligência surge do relacionamento entre as camadas de negócio, aplicação e tecnologia.

Aqui está a definição de como esses componentes devem ser estruturados, expandindo sua base para um Roadmap completo de maturidade de dados.

1. Business Architecture (Capacidades de Negócio)

As capacidades representam o que a empresa faz, e não como ela faz. Elas são estáveis e servem como a âncora de todo o inventário.

Business Capability (L1): Gestão de Documentos.
Business Capability (L2): * Armazenar documento.
- Processamento Inteligente de Documentos (IDP).
  - Sub-itens (L3): Efetuar OCR, Identificar Campos, Identificar Metadados.
User Groups: Identifique quem consome essas capacidades (ex: Operações, Jurídico, RH).

2. Application Architecture (Software e Serviços)

Aqui definimos as ferramentas que suportam as capacidades acima. No contexto de AWS, tratamos serviços específicos como aplicações se eles entregam uma funcionalidade de negócio clara.

Applications:
- AWS Textract: Responsável pelo OCR e extração de dados.
- AWS Rekognition: Utilizado para análise visual e identificação de padrões.
- Document Processor API (Custom): A lógica orquestrada por API Gateway e Lambdas.
Interface: Essencial para o Roadmap. Defina as integrações entre o S3 (origem) e o Textract (processamento).

3. Technology Architecture (IT Components)

Os IT Components representam o “stack” técnico e são fundamentais para a gestão de obsolescência (Lifecycle).

IT Components:
- Databases: DynamoDB (NoSQL).
- Storage: AWS S3.
- Runtimes/Languages: Java (v17/21), TypeScript (Node.js).
- Compute: AWS Lambda.
Providers: Amazon Web Services (AWS).

4. Estrutura de Relacionamentos (O Coração do LeanIX)

Para que o inventário seja útil, você deve configurar as seguintes associações:

De (Fact Sheet)	Para (Fact Sheet)	Descrição
Application	Business Capability	Mostra qual ferramenta suporta o OCR ou Armazenamento.
Application	IT Component	Define o stack tecnológico (ex: Lambda usa Node.js/TypeScript).
Application	User Group	Indica quem utiliza o sistema de Gestão de Documentos.
IT Component	Provider	Vincula o DynamoDB à AWS para análise de custo e suporte.

Sugestão de Expansão para o Roadmap

Para um inventário de nível profissional, recomendo adicionar estes componentes que você não mencionou:

Data Objects: Defina o objeto “Documento” e “Metadados”. Isso permite rastrear onde dados sensíveis (LGPD) circulam.
Projects / Transformations: Crie Fact Sheets de projeto para o desenvolvimento dessa pipeline de OCR. Isso vincula o custo do projeto ao valor da aplicação.
Lifecycle: Insira as datas de End of Support para o Java e as versões do TypeScript para gerar relatórios de risco tecnológico automática no LeanIX.
IT Cost: Atribua os custos de consumo da AWS a cada Application para visão de TCO (Total Cost of Ownership).

Gostaria que eu detalhasse como configurar os campos de tags para classificar a criticidade desses documentos no contexto da LGPD?

Prompt

Sim

Gemini

Excelente. Para elevar o nível do seu inventário no LeanIX e torná-lo um verdadeiro instrumento de governança e Roadmap, precisamos cruzar a tecnologia com a conformidade e o ciclo de vida.

Abaixo, detalho os componentes adicionais e os atributos (campos) que devem ser configurados para cada um, focando em Privacidade (LGPD) e Obsolescência.

1. Data Objects (Gestão de Dados e LGPD)

No LeanIX, o Data Object conecta a aplicação ao dado de negócio. Para o seu contexto, isso é vital para entender onde circulam informações sensíveis.

Objetos Sugeridos:
- Documento de Identidade (RG/CPF): Atribuído ao processo de OCR.
- Metadados de Processo: Dados técnicos extraídos.
- Payload de Imagem: O arquivo bruto no S3.
Atributos de LGPD (Tags/Fields):
- Data Classification: Público, Interno, Confidencial ou Restrito.
- Privacy Sensitivity: Sim/Não (Dados Pessoais ou Sensíveis).
- Retention Period: Quanto tempo o S3 deve manter o documento antes da deleção automática.

2. Lifecycle Management (Planejamento de Roadmap)

O Roadmap no LeanIX nasce do preenchimento das datas de ciclo de vida nos IT Components.

Para Java e TypeScript:
- Configure as fases: Plan, Phase In, Active, Phase Out, End of Life.
- Isso permitirá que o relatório Lifecycle Catalog mostre automaticamente quando o seu código Lambda precisará de refatoração por obsolescência do runtime.
Para Serviços AWS:
- Embora sejam “Evergreen” (sempre atualizados pela AWS), você deve marcar versões de API ou instâncias específicas do DynamoDB que podem sofrer deprecation.

3. IT Components: Infraestrutura e Runtimes

Expandindo sua lista para garantir que a arquitetura esteja completa:

Categoria	IT Component (Exemplos)	Importância para o Roadmap
Runtime	Node.js 20.x / Java 21	Define a estratégia de atualização de segurança.
Frameworks	AWS SDK, Spring Boot, Serverless Framework	Impacta na velocidade de desenvolvimento e dívida técnica.
Infra as Code	Terraform ou AWS CloudFormation	Essencial para rastrear como a infra é provisionada.
Security	AWS IAM Roles, AWS KMS (Criptografia)	Garante que o armazenamento no S3 e DynamoDB esteja em conformidade.

4. Business Process (O “Como” o negócio flui)

Enquanto a Capacidade diz “o que” fazemos, o Business Process detalha a sequência. No seu caso:

Upload de Documento (Trigger).
Trigger S3 -> Lambda (Integração).
Processamento OCR (AWS Textract).
Persistência de Metadados (DynamoDB).

5. Próximos Passos para o Roadmap no LeanIX

Para que esses dados virem uma visão estratégica, você deve agora focar em Métricas de Qualidade:

Technical Fit: Avalie se o Java/TypeScript atende à performance esperada para o OCR.
Functional Fit: O AWS Textract está identificando os campos com a precisão necessária para o negócio?
Business Criticality: Se o DynamoDB cair, qual o impacto financeiro na Gestão de Documentos?

Dica de Ouro: Utilize o Visualizer do LeanIX para gerar um gráfico de dependências. Se você deletar o IT Component “Java”, verá instantaneamente quais Applications param de funcionar e quais Business Capabilities são afetadas.

Gostaria que eu montasse um exemplo de Matriz de Custos (TCO) associando o consumo desses serviços AWS às suas capacidades de negócio?

Prompt

Sim

Gemini

Para transformar o inventário técnico em uma visão de valor de negócio, precisamos falar a língua do TCO (Total Cost of Ownership). No LeanIX, isso é feito associando os custos dos IT Components (serviços AWS) às Applications, que por sua vez servem às Business Capabilities.

Aqui está como estruturar essa matriz para o seu cenário de Processamento de Documentos:

1. Mapeamento de Custos por Componente (Mensal Estudo)

Para o roadmap, você deve categorizar os custos da AWS conforme o tipo de consumo (Fixo vs. Variável):

IT Component	Modelo de Custo	Driver de Custo (Unidade)	Impacto no TCO
AWS Textract	Variável	Por página processada	Alto (Custo direto do negócio)
AWS Rekognition	Variável	Por imagem/minuto	Médio (Depende do volume)
AWS Lambda	Variável	Por execução / Duração	Baixo (Escalável)
DynamoDB	Misto	WCU/RCU ou On-demand	Médio (Custo de armazenamento cresce)
AWS S3	Variável	GB armazenado / Requisições	Baixo (Mas cumulativo)
API Gateway	Variável	Milhões de chamadas	Baixo

2. Atribuição de Custos às Aplicações (Application Cost)

No LeanIX, você criará um campo de série temporal ou usará o módulo de Apptio Integration (se disponível) para consolidar:

Aplicação: “Pipeline de Extração Inteligente”
- Soma dos custos de: Textract + Rekognition + Lambda.
- Visão de Roadmap: Se o custo por documento subir acima de R$ 0,50, o roadmap deve prever uma otimização de código ou troca de modelo (ex: de Rekognition para um modelo customizado no SageMaker).

3. Matriz de TCO vs. Valor de Negócio

Essa matriz é a ferramenta definitiva para decidir onde investir (Roadmap de Investimento):

Aplicação / Serviço	Custo Mensal (Est.)	Valor de Negócio	Ação Sugerida (Roadmap)
Armazenamento S3	Baixo	Crítico	Sustain: Manter como está, focar em políticas de expiração.
OCR (Textract)	Alto	Altíssimo	Optimize: Revisar se todos os campos são necessários ou usar filtros pré-OCR.
Identificação (Rekognition)	Médio	Médio	Tolerate: Avaliar se a funcionalidade é essencial para todos os docs.
DynamoDB	Médio	Alto	Invest: Implementar TTL (Time to Live) para limpar dados antigos e reduzir custo.

4. Visualização no LeanIX: “Cost Landscape”

Ao configurar esses valores, o LeanIX gera automaticamente o report Cost Landscape. Nele, você verá quadrados (Applications) cujo tamanho é proporcional ao custo e a cor representa a criticidade ou o “Technical Fit”.

Dica Pro: No seu contexto de 2026, considere incluir uma tag de “Carbon Footprint” nos IT Components da AWS. Muitas empresas agora vinculam o TCO não apenas ao dinheiro, mas ao impacto ambiental do processamento de dados (FinOps + GreenOps).

Próximo Passo Sugerido

Com o inventário, a LGPD e os custos definidos, o próximo nível de maturidade é a Gestão de Integrações.

Gostaria que eu desenhasse o fluxo de “Data Flow” (Linhagem de Dados) para mostrar como o documento viaja do S3 até o DynamoDB através das APIs, mapeando os pontos de falha?

A propósito, para desbloquear as funcionalidades completas de todas as Apps, ative a Atividade das Apps Gemini.

Prompt

Sim, desenhar o Data Flow

Gemini

O desenho do Data Flow (ou Fluxo de Dados) no LeanIX é o que separa um inventário estático de um mapa vivo de arquitetura. No contexto de 2026, com arquiteturas cada vez mais orientadas a eventos (EDA), rastrear a linhagem do dado é fundamental para o Compliance e para a Resiliência Operacional.

Abaixo, detalho como estruturar essa linhagem, mapeando o caminho do documento desde a entrada até o repouso final.

1. Visualização do Fluxo de Dados (Linhagem)

No LeanIX, você utilizará o Data Flow Visualizer. O fluxo para sua capacidade de “Gestão de Documentos” deve seguir esta lógica:

Origem (External Group/User): Cliente faz o upload do documento.
Ponto de Entrada (S3 – Landing Zone): O dado bruto é armazenado.
Gatilho (S3 Event): Dispara a função Lambda.
Processamento (AWS Textract/Rekognition): O dado é enviado para extração e retorna como metadados estruturados.
Destino (DynamoDB): Armazenamento dos metadados e do status do processo.

2. Detalhamento Técnico das Interfaces

Para cada “seta” no diagrama acima, o LeanIX exige a definição de uma Interface. Veja como configurar as principais:

Interface	Tipo de Objeto de Dado	Frequência / Tecnologia	Nota de Governança
S3 ➔ Lambda	Evento de Notificação	Tempo Real (Async)	Ponto crítico de falha: Se o trigger falhar, o processo para.
Lambda ➔ Textract	Payload de Imagem	API (HTTPS/JSON)	Dados Pessoais (LGPD) em trânsito.
Lambda ➔ DynamoDB	Metadados Extraídos	NoSQL Write	Ponto de auditoria: Onde o dado “nasce” como registro oficial.

3. Mapeamento de Pontos de Falha e Resiliência (Roadmap)

Ao desenhar o Data Flow, você identifica onde o roadmap de tecnologia precisa atuar para evitar indisponibilidade:

DLQ (Dead Letter Queues): Se o Lambda falhar ao processar um documento do S3, o dado se perde? O roadmap deve incluir a implementação de SQS para reprocessamento.
Throttling de API: O Textract tem limites de concorrência. O fluxo de dados deve prever mecanismos de retry ou filas de priorização.
Data Residency: No contexto da LGPD, o Data Flow confirma se o dado está saindo da região (ex: us-east-1 para sa-east-1). Se houver transferência internacional, isso deve ser tagueado na Interface.

4. O Impacto no Inventário LeanIX

Ao concluir este desenho, seus Fact Sheets estarão enriquecidos com:

Consuming/Providing Applications: Você saberá exatamente qual Lambda lê de qual bucket S3.
Data Objects at Rest vs. In Motion: Diferenciação entre o dado guardado (DynamoDB) e o dado em processamento (Textract).

Insight de Arquiteto: Use a funcionalidade de “Impact Analysis” no LeanIX. Se você precisar atualizar a versão do Java no Lambda, o sistema mostrará que isso pode impactar o fluxo de escrita no DynamoDB, permitindo um planejamento de migração sem riscos.

Próximo Passo para Finalizar seu Roadmap

Agora que temos as Capacidades, Aplicações, Custos, LGPD e o Data Flow, o próximo passo natural é o Risk Assessment.

Gostaria que eu gerasse uma matriz de riscos focada em Segurança e Disponibilidade para esses componentes AWS, ajudando a priorizar as tarefas do seu Roadmap de 2026?

A propósito, para desbloquear as funcionalidades completas de todas as Apps, ative a Atividade das Apps Gemini.

Prompt

Sim

Gemini

Para fechar o ciclo do seu Inventory Roadmap no LeanIX, a Matriz de Riscos é o que transforma dados técnicos em decisões de diretoria. No contexto de 2026, com a maturidade da nuvem e regulamentações de dados ainda mais rígidas, o foco deve estar na tríade: Segurança, Disponibilidade e Conformidade.

No LeanIX, você utilizará o campo “Resources & Risks” em cada Fact Sheet para mapear o seguinte:

1. Matriz de Riscos: Gestão de Documentos (AWS Stack)

Componente	Categoria de Risco	Cenário de Risco (Ameaça)	Impacto	Probabilidade	Mitigação (Ação no Roadmap)
S3 / DynamoDB	Segurança	Vazamento de Dados Pessoais (LGPD) por configuração de permissão pública.	Crítico	Baixa	Implementar SCP (Service Control Policies) e criptografia via KMS com chaves gerenciadas pelo cliente.
AWS Textract	Disponibilidade	Atingir limites de Throttling da API em picos de demanda.	Médio	Alta	Implementar uma fila SQS para buffering e controle de retentativa (Retry Logic).
AWS Lambda	Obsolescência	Fim do suporte (EOL) para o Runtime do Java ou Node.js (TypeScript).	Baixo	Média	Automatizar o monitoramento de Lifecycle no LeanIX para disparar tickets de upgrade 6 meses antes.
API Gateway	Segurança	Ataque de Injeção ou Exposição indevida de endpoints internos.	Alto	Baixa	Utilizar AWS WAF (Web Application Firewall) e autenticação via Cognito/IAM.

2. Visualização de Risco no LeanIX (Impact Analysis)

Ao cadastrar esses riscos, o LeanIX permite gerar um Heatmap de Risco. Imagine o seguinte gráfico:

Eixo X: Impacto no Negócio (Financeiro/Reputacional).
Eixo Y: Probabilidade de Ocorrência.
Cor do Quadrante: Define a prioridade no seu Roadmap de 2026.

Nota de Arquiteto: Se o DynamoDB aparecer no quadrante vermelho (Impacto Crítico/Probabilidade Média), seu Roadmap para o Q3/2026 deve priorizar a implementação de Global Tables ou backups Point-in-Time Recovery (PITR).

3. Consolidação do Roadmap de Inventário

Com todos os componentes definidos (Capacidades, Apps, TI, Custos, Fluxos e Riscos), seu inventário no LeanIX agora entrega:

Visibilidade de Custos (FinOps): Quanto custa processar um documento?
Conformidade (LGPD): Onde estão os dados sensíveis e quem os acessa?
Saúde Tecnológica: Quais linguagens (Java/TS) precisam de atualização?
Resiliência: Onde a pipeline de OCR pode quebrar e como corrigimos isso?

Próximo Passo para você executar:

O inventário está teoricamente completo. Para torná-lo operacional, o próximo passo seria a Automação da Coleta de Dados.

Gostaria que eu explicasse como utilizar o “LeanIX Integration API” para importar automaticamente esses metadados (tags, versões e custos) diretamente do seu console da AWS, evitando o trabalho manual?

Prompt

Sim

Gemini

Para que seu Roadmap de 2026 seja escalável, a manutenção manual do inventário deve ser eliminada. O LeanIX possui uma abordagem de “Automated Inventory” que se conecta diretamente ao provedor de nuvem.

No seu contexto de AWS (DynamoDB, Lambda, S3), a automação garante que, se um desenvolvedor criar um novo bucket S3 para documentos, ele apareça no LeanIX sem intervenção humana.

1. Arquitetura da Integração (AWS para LeanIX)

A integração geralmente segue este fluxo técnico:

AWS Config / Resource Explorer: Centraliza o inventário de ativos na sua conta AWS.
LeanIX Cloud Discovery (Microservice): Um conector que interage com as APIs da AWS (via IAM Role de leitura).
Mapping Engine: Onde traduzimos os recursos da AWS para os Fact Sheets do LeanIX.

2. O que automatizar (Mapeamento de Metadados)

Ao configurar a API de integração, você deve mapear os seguintes campos para alimentar seu Roadmap:

Tags da AWS ➔ Tags do LeanIX: Use tags como CostCenter, Environment (Prod/Dev) e DataPrivacy diretamente nos recursos da AWS. O LeanIX as lerá e categorizará o Fact Sheet automaticamente.
Resource ID ➔ External ID: Vincula o ARN (Amazon Resource Name) do DynamoDB ou Lambda ao Fact Sheet, garantindo que não haja duplicidade.
Region ➔ Location: Mapeia automaticamente se o dado está em us-east-1 ou sa-east-1 para o seu relatório de conformidade LGPD.

3. Benefícios Práticos para o seu Roadmap de 2026

Detecção de “Shadow IT”: Se alguém subir um serviço de OCR concorrente (ex: Azure Form Recognizer) sem avisar a arquitetura, a integração o detectará como um novo IT Component.
Atualização de Lifecycle em Tempo Real: Se a AWS anunciar a depreciação de uma versão do Node.js usada no seu Lambda, o LeanIX pode sinalizar o risco no dashboard de arquitetura imediatamente.
FinOps Automatizado: Os custos reais de consumo da API do Textract podem ser importados mensalmente, atualizando o TCO da aplicação sem planilhas.

4. Estratégia de Implementação (Passo a Passo)

Discovery Phase: Execute o Cloud Discovery em modo “Read-Only” para ver tudo o que a AWS encontra.
Consolidation: Vincule os recursos descobertos (IT Components) às suas Applications já existentes (ex: vincular o ARN do S3 à aplicação “Gestão de Documentos”).
Governance: Configure alertas para novos recursos que aparecerem sem as tags obrigatórias de LGPD ou Custo.

Conclusão do seu Ecossistema LeanIX

Agora você tem uma estrutura completa: Capacidades, Aplicações, TI, Custos, Dados, Riscos e Automação.

Com esse nível de detalhe, você está pronto para gerar o “IT Strategy Report” para seus stakeholders, mostrando exatamente como a tecnologia de OCR suporta o negócio, quanto custa e quão segura ela é.

Gostaria que eu montasse um resumo executivo (em formato de tópicos) que você pode apresentar para a diretoria para justificar os investimentos nesse Roadmap de Inventário?

Prompt

Perfeito, pode gera o IT Strategy Report

Gemini

Este é o Executive Summary do seu Roadmap de Gestão de Documentos (OCR & AI) para 2026. Este relatório consolida todos os pilares que construímos: Capacidades de Negócio, Aplicações, Infraestrutura AWS, Custos (FinOps), Riscos e Conformidade (LGPD).

📄 Relatório Estratégico de TI: Gestão Inteligente de Documentos

Objetivo: Modernizar a custódia e o processamento de dados, garantindo eficiência operacional e segurança regulatória.

1. Visão Geral da Capacidade de Negócio

A capacidade de Gestão de Documentos foi decomposta em camadas de inteligência (OCR e Extração de Metadados), saindo de um modelo de “armazenamento passivo” para um modelo de “dados acionáveis”.

Status Atual: Transição para arquitetura Serverless (AWS Lambda/Textract).
Valor Entregue: Redução do tempo de processamento manual em X% e aumento da acurácia na identificação de campos críticos.

2. Eficiência Financeira e TCO (FinOps)

Abandonamos a gestão por estimativa e adotamos o Custo por Documento Processado.

Driver de Custo: O uso do AWS Textract representa o maior volume variável; o S3 e DynamoDB mantêm custos de escala baixos.
Otimização 2026: Implementação de políticas de ciclo de vida (S3 Lifecycle) e deleção automática de dados não essenciais para reduzir custos de storage em longo prazo.

3. Governança de Dados e Conformidade (LGPD)

O inventário agora mapeia a linhagem completa do dado (Data Flow), garantindo rastreabilidade total.

Segurança: Criptografia em repouso (KMS) e trânsito para todos os metadados no DynamoDB.
Residência de Dados: Garantia de processamento em regiões aprovadas (Ex: sa-east-1), evitando transferências internacionais não autorizadas de PII (Informações Pessoais Identificáveis).

4. Gestão de Riscos e Resiliência

Identificamos os pontos críticos da pipeline para evitar interrupções no negócio:

Mitigação de Throttling: Introdução de filas (SQS) para suportar picos de upload sem perda de mensagens.
Obsolescência Tecnológica: Monitoramento automatizado do fim de suporte para rimes de Java e TypeScript, com janelas de migração planejadas para o Q3/2026.

5. Roadmap de Evolução (Próximos Passos)

Para manter a competitividade e a agilidade da arquitetura:

H1 2026: Concluir a integração via API do LeanIX com a AWS para inventário em tempo real (Auto-Discovery).
H2 2026: Avaliar a migração de fluxos simples do Textract para modelos menores (Small Language Models) visando redução de custo de processamento.
H2 2026: Implementar Dashboards de Sustainability (GreenOps) para medir a pegada de carbono do processamento de documentos.

Mensagem para a Diretoria: “Nossa arquitetura de Gestão de Documentos não é mais uma caixa preta de custos. Através do LeanIX, temos controle total sobre o risco tecnológico, o custo de cada página processada e a garantia de que estamos em conformidade com as leis de proteção de dados.”

Com este relatório, você tem um material sólido para qualquer reunião de governança. Gostaria que eu simulasse as 3 perguntas mais difíceis que um CFO ou um CISO poderiam fazer sobre esse roadmap para que você possa se preparar?

50 System Design Concepts

March 4, 2026

Luis Fernando Chaim

Arquitetura No Comments

When you start learning system design, the hardest part is not the concepts themselves.

It is about finding clear explanations in one place.

That is why having a single guide that covers all the essentials is such a game-changer.

Therefore, I’ve designed this guide to cover 50 of the most important system design concepts.

Think of it as your one-stop reference for understanding how real systems scale, stay reliable, communicate, and handle data.

My goal is to walk you through fifty important ideas using short explanations and simple examples so everything clicks quickly.

If you are preparing for a system design interview, this guide is your go-to resource.

I. Core Architecture Principles

Vertical vs Horizontal Scaling

Vertical scaling means upgrading a single machine, like adding more CPU, RAM, or faster storage.
Horizontal scaling means adding more machines and spreading work across them.

Vertical is easier but hits hardware limits and becomes expensive.

Horizontal is harder because you need load balancing, stateless services, and shared storage.

Think of it this way: vertical is one superhero getting stronger, horizontal is building a team.

CAP Theorem

CAP Theorem says that in the presence of a network partition, a distributed system must choose between Consistency and Availability. Consistency means every user sees the same data at the same time.

Availability means the system always responds, even if the data might be slightly stale.

You cannot have perfect consistency and perfect availability when your network is broken, so you decide which one to sacrifice for your use case.

PACELC Theorem

PACELC extends CAP and says: if there is a Partition, choose Availability or Consistency; Else choose Latency or Consistency.

Even when the network is fine, you still trade off slow but consistent reads vs fast but eventually consistent reads. Systems that sync across regions often pay in latency to keep strong consistency.

It explains why some databases are fast but slightly stale, while others are slower but always accurate.

ACID vs BASE

ACID is about strict, reliable transactions: Atomicity, Consistency, Isolation, Durability. It suits financial systems, inventory, and anything where mistakes are very costly.

BASE stands for Basically Available, Soft state, Eventual consistency and is used in large distributed systems that need to stay up and respond quickly.

BASE systems might show temporary inconsistencies but fix themselves over time.

In practice, many architectures combine both, using ACID for core money flows and BASE for things like feeds and analytics.

Throughput vs Latency

Throughput is how many requests your system can handle per second.
Latency is how long a single request takes from start to finish.

You can often increase throughput by doing more work in parallel, but that may increase latency if queues build up.

Think of a restaurant that takes many orders at once but makes customers wait longer. Good system design tries to balance both: enough throughput for peak load but low latency for a smooth user experience.

Amdahl’s Law

Amdahl’s Law says that the speedup from parallelization is limited by the part that cannot be parallelized.

If 20 percent of your system is always sequential, no amount of extra machines will fix that bottleneck.

Let me break it down.

If your request always has to hit a single master database, that master will cap your performance. This law reminds you to hunt for bottlenecks instead of just adding more servers.

Strong vs Eventual Consistency

Strong consistency means all users see the same data immediately after a write.
Eventual consistency means updates spread over time and nodes may briefly disagree.

Strong consistency is easier to reason about but usually slower and less available under failures.

Eventual consistency is great for large-scale systems like timelines or counters where perfect freshness is not critical.

The key is to choose the model that matches the user experience you need.

Stateful vs Stateless Architecture

A stateful service remembers user context between requests, often storing session data locally.
A stateless service treats every request as new, relying on external stores like caches or databases for any state.

Stateless services are easier to scale horizontally because any instance can handle any request.

Stateful systems can be simpler to code but harder to load balance and fail over.

In modern cloud systems, we try to push state into databases and keep services as stateless as possible.

Microservices vs Monoliths

A monolith is a single application that contains many features in one deployable unit.

Microservices split features into separate services that communicate over the network.

Microservices help teams work independently and scale different parts separately, but introduce complexity around communication, debugging, and data consistency.

Monoliths are simpler to start with and often fine up to a certain scale. Here is the tricky part.

Many great systems start as monoliths and gradually evolve into microservices when the pain is real.

Serverless Architecture

Serverless lets you run small functions in the cloud without managing servers directly. You pay only when your code runs, and the platform handles scaling and infrastructure for you.

It is ideal for event-driven workloads such as webhooks, background jobs, or light APIs with spiky traffic.

The tradeoff is less control over long-running tasks, cold starts, and sometimes a higher cost at very high volumes.

Think of serverless as “functions as a service,” perfect for glue code and lightweight services.

II. Networking and Communication

Load Balancing

Load balancing spreads incoming traffic across multiple servers so no single server gets overloaded. It improves both reliability and performance, since a single server’s failure does not bring down the entire system.

Load balancers can be hardware devices or software services. They often support health checks so they stop sending traffic to unhealthy instances.

From an interview point of view, they are your first building block when scaling horizontally.

Load Balancing Algorithms

Common load balancing algorithms include Round Robin, Least Connections, and IP Hash.

Round Robin cycles through servers in order and is simple to implement.
Least Connections sends traffic to the server with the fewest active connections, which helps when requests vary in length.
IP Hash uses a hash of the client IP so the same user usually goes to the same server, which helps with simple session stickiness.

Picking the right algorithm affects fairness, resource usage, and user experience.

Reverse Proxy vs Forward Proxy

A reverse proxy sits in front of servers and represents them to clients. It hides the inner topology, can do TLS termination, caching, compression, and routing.
A forward proxy sits in front of clients and represents them to the outside world, often for security, caching, or content filtering.

Think of a reverse proxy as the reception desk of a company that hides all the internal rooms, and a forward proxy as a gateway your laptop must pass through to reach the internet.

Knowing the difference helps when you talk about API gateways and corporate proxies.

API Gateway

An API gateway is a special reverse proxy that acts as a single entry point for all API calls in a microservices system. It handles routing to the right service, rate limiting, authentication, logging, and sometimes response shaping.

This reduces complexity on the client side, since clients only talk to a single endpoint.

If you put too much logic in the gateway, it can become a bottleneck or a mini monolith of its own. Good designs keep it focused and thin.

CDN (Content Delivery Network)

A CDN is a network of geographically distributed servers that cache static content like images, videos, and scripts closer to users.

When a user requests content, they are routed to the nearest CDN node, which greatly reduces latency. This also offloads traffic from your origin servers, improving scalability and resilience.

CDNs are essential for global applications and front-end performance.

Think of them as “local copies” of your website’s heavy files sprinkled around the world.

DNS (Domain Name System)

DNS maps human readable domain names to IP addresses.

When you type a website name, your device queries DNS to find the numeric address of the server.

has multiple layers of caching, so responses are fast after the first lookup. It can also be used to perform simple load balancing by returning different IPs for the same name.

Understanding DNS helps you reason about why name changes take time to propagate and why some outages are caused by misconfigured DNS.

TCP vs UDP

TCP is a reliable, connection-oriented protocol. It guarantees ordered, error checked delivery by using acknowledgments and retries.
UDP is connectionless and does not guarantee delivery or order, which makes it much faster and lighter.

TCP suits APIs, web pages, and file transfers where accuracy matters.

UDP works well for real time applications like video calls or games where occasional packet loss is acceptable.

Think of TCP as registered mail and UDP as quick postcards.

HTTP/2 and HTTP/3 (QUIC)

HTTP/2 introduced multiplexing, which lets multiple requests share a single TCP connection, reducing overhead. It also brought features like header compression and server push.
HTTP/3 runs over QUIC, which is built on UDP and improves connection setup time and performance on unreliable networks. These versions mainly aim to reduce latency and better use modern network conditions.

For you as an engineer, the key idea is: fewer connection setups and better use of a single connection.

gRPC vs REST

REST typically uses HTTP with JSON and focuses on resources like /users or /orders. It is simple, human-readable, and widely used for public APIs.
gRPC uses HTTP/2 and binary encoded messages (protobuf), which are smaller and faster over the wire. It also supports bidirectional streaming and strong typing.

In microservices, gRPC is often preferred for service-to-service calls, while REST is common for external clients.

Use REST when readability and compatibility matter, gRPC when performance and contracts matter.

WebSocket and Server-Sent Events (SSE)

WebSockets create a full-duplex connection where client and server can send messages to each other at any time.

SSE allows the server to push events to the client over a one way channel using HTTP.

WebSockets are great for chats, multiplayer games, and live collaboration.

SSE is simpler and fits cases like live score updates or notifications, where only the server needs to push updates.

Both solve real-time communication problems that plain HTTP cannot handle well.

Long Polling

Long polling is a technique where the client sends a request and the server holds it open until there is new data or a timeout.

When the response comes back, the client immediately opens another request. This simulates real time updates over plain HTTP without special protocols.

It is less efficient than WebSockets but easier to implement and works through most proxies and firewalls.

Think of it as asking “anything new?” and waiting quietly until there is an answer.

Gossip Protocol

A gossip protocol lets nodes in a distributed system share information by periodically talking to random peers.

Over time, information spreads like gossip in a social group until everyone has roughly the same view. It is used to share membership, health status, or configuration in a fault tolerant way.

The protocol is eventually consistent and does not rely on a central authority. This makes it ideal for large clusters where nodes frequently join and leave.

III. Database and Storage Internals

Sharding (Data Partitioning)

Sharding splits data across multiple machines, each holding a subset of the data. Common strategies include range-based sharding, hash-based sharding, and directory-based sharding.

The main goal is to scale storage and throughput by avoiding a single giant database node.

The tricky part is choosing a shard key that avoids hot spots where one shard gets most of the traffic. Once you shard, moving data between shards (resharding) becomes an important operational challenge.

Replication Patterns (Master Slave, Master Master)

Replication means keeping multiple copies of data on different nodes.

In master slave (or primary replica), one node handles writes and replicates changes to others that serve reads.
In master master (multi-primary), multiple nodes accept writes and reconcile conflicts.

Replication improves read performance and availability, but makes consistency harder, especially when writes go to multiple nodes.

In interviews, expect to talk about how replication lag affects reads and how failover works when a master dies.

Consistent Hashing

Consistent hashing is a technique to distribute keys across nodes in a way that minimizes data movement when nodes are added or removed.

Keys and nodes are placed on a logical ring, and each key belongs to the next node on the ring.

When a node joins or leaves, only a small portion of keys need to move. This property is very helpful in distributed caches and databases.

Think of it as a smooth mapping that does not get scrambled when the cluster size changes.

Database Indexing (B Trees, LSM Trees)

Indexes speed up queries by organizing data in a way that allows fast lookup.

B Trees are balanced trees that keep data sorted and let you find ranges efficiently, common in relational databases.

LSM Trees batch writes in memory and periodically flush them to disk, which makes writes very fast but reads more complex.

The tradeoff is write heavy vs read heavy workloads.

The key idea is that indexes are a separate structure that must be updated on every write, which is why too many indexes hurt insert performance.

Write Ahead Logging (WAL)

Write Ahead Logging records changes to a log before applying them to the main database.

If a crash happens in the middle of a transaction, the system can replay the log to restore a consistent state. WAL ensures durability and atomicity of transactions. It also allows techniques like replication from the log stream. Let me tell you why it is important.

Without WAL, a crash could leave your data in a half updated, corrupt state.

Normalization vs Denormalization

Normalization organizes data into tables that reduce redundancy and dependencies, following rules like first normal form, second normal form, and so on. This avoids anomalies on updates and inserts.
Denormalization intentionally duplicates data to speed up reads and reduce joins. In high scale systems, denormalization is common for read heavy paths, such as storing user names along with posts instead of joining every time.

The real skill is knowing where you can safely denormalize without breaking consistency.

Polyglot Persistence

Polyglot persistence means using multiple types of databases within the same system, each chosen for what it does best. You might use a relational database for transactions, a document store for logs, a key value store for caching, and a graph database for relationships.

Instead of forcing everything into one database, you pick the right tool for each job.

The tradeoff is more operational complexity and more knowledge required from the team.

Bloom Filters

A Bloom filter is a space efficient data structure that quickly answers “might this item be in the set?” with possible false positives but no false negatives. It uses multiple hash functions to set bits in a bit array when items are inserted.

To check membership, you test the same bits; if any bit is zero, the item is definitely not present.

Databases and caches use Bloom filters to avoid unnecessary disk lookups or cache misses.

Think of them as fast gatekeepers that say “definitely not” or “maybe.”

Vector Databases

Vector databases store and query vectors, which are numeric representations of data such as text, images, or audio. These vectors come from models like embeddings and allow similarity search, such as “find documents most similar to this one.”

Instead of exact equality comparisons, they use distance metrics like cosine similarity or Euclidean distance. This is essential for modern search, recommendation, and AI assistant systems.

In interviews, it is enough to know that vector databases support nearest neighbor search over high-dimensional data.

IV. Reliability and Fault Tolerance

Rate Limiting

Rate limiting controls how many requests a user, IP, or API key can make in a given time window. It protects your system from abuse, accidental traffic spikes, and runaway loops.

Common strategies include fixed window, sliding window, and token bucket.

Rate limits are often enforced at the API gateway or load balancer.

Think of them as safety brakes that keep shared resources from being overwhelmed.

Circuit Breaker Pattern

A circuit breaker monitors calls to a remote service and “opens” if there are too many failures.

When open, it immediately fails new requests instead of trying the broken service again.

After a cooldown period, it allows a few trial calls to see if the service has recovered and closes if they succeed. This pattern prevents cascading failures where one slow service drags down the entire system.

Here is the tricky part. Circuit breakers must be tuned carefully so they do not open too aggressively or too late.

Bulkhead Pattern

The bulkhead pattern isolates parts of a system so a failure in one area does not sink everything. This can mean separate connection pools, thread pools, or even entire service clusters for different features.

If one bulkhead is flooded with traffic, others keep working.

The name comes from ship bulkheads that contain flooding in one compartment.

In design discussions, using bulkheads shows you are thinking about fault isolation and blast radius.

Retry Patterns and Exponential Backoff

Retries help recover from transient errors like network timeouts or temporary overload.

Exponential backoff means each retry waits longer than the previous one, such as 1 second, 2 seconds, 4 seconds, and so on. This prevents your client from hammering a service that is already struggling.

Good retry policies also use jitter (small randomness) to avoid thundering herds.

Let me break it down.

Retries without backoff can make outages worse instead of helping.

Idempotency

An operation is idempotent if performing it multiple times has the same effect as performing it once.

For example, “set user status to active” is idempotent, while “increment account balance by 10” is not.

Idempotency is critical when systems use retries, because the same request may be sent more than once.

APIs often require idempotency keys on operations like payments to avoid double charging.

In interviews, always mention idempotency when you talk about at least once delivery or retries.

Heartbeat

A heartbeat is a periodic signal sent by a service or node to indicate that it is alive and healthy.

Monitoring systems or coordinators listen for heartbeats.

If they stop receiving them, they mark the node as down and trigger failover or scaling actions.

Heartbeats are simple but powerful tools for liveness detection. Think of them as the system’s “pulse checks.”

Leader Election (Paxos, Raft)

Leader election is the process of choosing a single node to act as a coordinator among many.

Algorithms like Paxos and Raft ensure that only one leader is chosen and that all nodes eventually agree on who that leader is.

The leader handles tasks like assigning work, managing metadata, or ordering writes. If the leader fails, a new one is elected automatically.

You do not need to memorize the math for interviews, but you should know that consensus algorithms power many critical systems like metadata stores and distributed logs.

Distributed Transactions (SAGA Pattern)

A distributed transaction spans multiple services or databases.

The SAGA pattern models such a transaction as a sequence of local steps with compensating actions for rollbacks.

Instead of locking everything like a single ACID transaction, each service performs its part and publishes an event. If something fails, compensating steps attempt to undo previous changes. This fits naturally with microservices and eventual consistency.

The tradeoff is more complex logic and the possibility of partial failures that must be handled gracefully.

Two Phase Commit (2PC)

Two Phase Commit is a protocol that tries to provide atomic transactions across multiple nodes.

In the first phase, the coordinator asks all participants if they can commit.
In the second phase, if everyone agrees, it tells them to commit; otherwise, it tells them to roll back.

2PC provides strong guarantees but can block if the coordinator fails, and it is expensive at scale due to locking.

In modern cloud systems, 2PC is often avoided for high throughput paths and replaced by patterns like SAGA.

V. Caching and Messaging

Caching

Caching stores frequently accessed data in a fast storage layer, usually memory, to reduce latency and backend load.

Common cache layers include in process caches, external key value stores, and CDNs. Caching is especially effective for read heavy workloads and expensive computations.

Here is the tricky part. Stale data and invalidation make caching harder than it first appears.

As the saying goes, cache invalidation is one of the hard problems in computer science.

Caching Strategies (Cache Aside, Write Through, etc.)

Cache aside means the application reads from the cache, and on a miss, loads from the database and writes to the cache.
Write through writes to the cache and database at the same time, ensuring cache and source are always in sync.
Write back writes to the cache first and flushes to the database later, which is fast but risky if the cache fails.

Each strategy balances freshness, complexity, and performance differently.

Interviewers love when you mention which strategy you would pick for a given scenario.

Cache Eviction Policies (LRU, LFU)

Cache eviction policies decide which items to remove when the cache is full.

LRU (Least Recently Used) evicts items that have not been accessed recently, assuming recent items are more likely to be used again.
LFU (Least Frequently Used) evicts items that are rarely accessed, focusing on long term popularity.

Some systems use random, FIFO, or advanced algorithms.

The key idea is that cache space is limited, so you want to keep the most valuable items in memory.

Message Queues (Point to Point)

A message queue allows one component to send messages to another without needing both to be online at the same time.

In a point to point model, messages in a queue are consumed by one receiver and then removed. This decouples sender and receiver so they can scale and fail independently.

Queues are great for background jobs, email sending, and processing heavy tasks asynchronously.

Think of them as a todo list shared between services.

Pub Sub (Publish Subscribe)

In pub sub, publishers send messages to topics, not directly to consumers.

Subscribers listen to topics they care about and receive copies of relevant messages. This enables broadcast style communication and loose coupling between producers and consumers.

Multiple services can react to the same event in different ways, such as logging, analytics, and notifications.

In interviews, pub sub often appears in event driven designs like activity feeds or event sourcing.

Dead Letter Queues

A dead letter queue stores messages that could not be processed successfully after several attempts.

Instead of retrying forever and blocking the main queue, these messages are moved aside.

Engineers can inspect the dead letter queue to debug issues, fix data, or replay messages later. This pattern improves resiliency and keeps your system from getting stuck on “poison messages.”

Think of it as a holding area for problematic jobs.

VI. Observability and Security

Distributed Tracing

Distributed tracing tracks a single request as it flows through multiple services. Each service adds a trace ID and span information so you can reconstruct the full path of a request. This is extremely helpful when debugging slow responses or failures in microservice architectures.

Without tracing, you just see errors in isolation. With it, you see the whole story across services, queues, and databases.

SLA vs SLO vs SLI

An SLA (Service Level Agreement) is an external promise to customers, such as “99.9 percent uptime per month.”

An SLO (Service Level Objective) is an internal target that engineers aim to meet, usually stricter than the SLA. An SLI (Service Level Indicator) is the actual measured metric, like real uptimes or request success rates.

Think of SLA as the contract, SLO as the goal, and SLI as the scoreboard.

In interviews, using these terms correctly shows maturity in thinking about reliability.

OAuth 2.0 and OIDC

OAuth 2.0 is a framework for delegated authorization. It lets users grant an application limited access to their resources without sharing passwords.

OIDC (OpenID Connect) builds on OAuth 2.0 to add authentication, letting clients verify who the user is and get user identity information. This is the basis of many “Login with X” flows.

The key idea is that an authorization server issues tokens that clients and APIs can trust.

TLS/SSL Handshake

TLS/SSL secures communication between client and server by encrypting data in transit.

During the handshake, the client and server agree on encryption algorithms, exchange keys securely, and verify certificates.

Once the handshake completes, all subsequent data is encrypted and safe from eavesdropping. This is what puts the little lock icon in your browser.

Without TLS, anyone on the network could read or modify sensitive traffic.

Zero Trust Security

Zero Trust is a security model that says: “Never trust, always verify.” It assumes that threats can exist both outside and inside the network.

Every request must be authenticated, authorized, and encrypted, even if it comes from within your data center or VPC. Access is granted based on identity, device posture, and context, not just on being “inside the firewall.”

In modern architectures, Zero Trust is becoming the default approach to secure system design.

Reference

Key Takeaways

System design is mostly about understanding trade-offs: consistency vs. availability, latency vs. throughput, simplicity vs. flexibility.
Scaling is not just “add more servers.” You must think about load balancing, sharding, replication, and bottlenecks.
Reliability patterns like rate limiting, circuit breakers, retries, and bulkheads exist because failures are normal in distributed systems.
Caching, queues, and pub-sub are your best friends for performance and decoupling, but they introduce new challenges around consistency and ordering.
Observability and security concepts such as tracing, SLIs, OAuth, TLS, and Zero Trust are essential for systems that are not just fast but also safe and debuggable.

Fonte: https://designgurus.substack.com/p/50-system-design-concepts-for-beginners

50 Core System Design Concepts

March 4, 2026

Luis Fernando Chaim

Arquitetura No Comments

Executive Summary

This document synthesizes 50 fundamental concepts in system design, drawing from a comprehensive guide on the subject. The core insight is that effective system design is an exercise in managing trade-offs, particularly between consistency and availability, latency and throughput, and simplicity versus flexibility. Successful scaling extends beyond merely adding servers; it necessitates a deep understanding of load balancing, data sharding, replication, and bottleneck identification.

Get Rishabh Maheshwari’s stories in your inbox

Join Medium for free to get updates from this writer.

Reliability in distributed systems is not an accident but a deliberate architectural choice, achieved through patterns like rate limiting, circuit breakers, retries, and bulkheads, which are designed to handle expected failures gracefully. Performance and decoupling are significantly enhanced by tools such as caching, message queues, and publish-subscribe models, though these introduce their own complexities regarding data consistency and message ordering. Finally, modern systems must be built with observability and security as primary concerns, incorporating distributed tracing, service level indicators (SLIs), robust authentication (OAuth/OIDC), data-in-transit encryption (TLS), and a Zero Trust security posture to ensure they are not only performant but also safe, secure, and debuggable.

I. Core Architecture Principles

This section outlines the foundational principles and architectural choices that govern how systems are structured, scaled, and managed.

Vertical vs. Horizontal Scaling

Vertical Scaling: Involves upgrading a single machine by adding more CPU, RAM, or faster storage. It is simpler to implement but is constrained by hardware limits and becomes progressively more expensive. The analogy provided is a single superhero getting stronger.
Horizontal Scaling: Involves adding more machines and distributing the workload across them. While more complex, requiring load balancing, stateless services, and shared storage, it offers greater scalability. The analogy is building a team of superheroes.

CAP Theorem

The CAP Theorem states that in a distributed system experiencing a network partition, it is impossible to simultaneously guarantee both Consistency and Availability.
Consistency: Every user sees the same data at the same time.
Availability: The system always provides a response, even if the data may be temporarily out of date.
A system must choose which of these two guarantees to sacrifice during a network failure.

PACELC Theorem

PACELC is an extension of the CAP theorem. It posits that: if there is a Partition, a system must choose between Availability and Consistency; Else (in normal operation), it must choose between Latency and Consistency.
This theorem clarifies that even without network failures, systems face a trade-off between fast, eventually consistent reads (lower latency) and slower, strongly consistent reads (higher consistency).

ACID vs. BASE

ACID (Atomicity, Consistency, Isolation, Durability): A set of properties for strict, reliable database transactions. It is essential for systems where data integrity is paramount, such as financial or inventory management systems.
BASE (Basically Available, Soft state, Eventual consistency): An alternative model for large-scale distributed systems that prioritize high availability and rapid response times. BASE systems may exhibit temporary inconsistencies that resolve over time.
Many modern architectures employ a hybrid approach, using ACID for critical transactional flows and BASE for less critical functions like activity feeds or analytics.

Throughput vs. Latency

Throughput: The number of requests a system can process per unit of time (e.g., requests per second).
Latency: The time taken to process a single request from start to finish.
These two metrics are often in opposition; increasing throughput by processing more work in parallel can lead to queue buildup and increased latency for individual requests. Effective system design seeks to balance both for an optimal user experience.

Amdahl’s Law

This law states that the potential performance improvement from parallelization is limited by the portion of the system that must remain sequential.
If a part of a process is inherently non-parallelizable (e.g., a final step that must hit a single master database), that part will become the ultimate bottleneck, capping overall performance regardless of how many more resources are added.

Strong vs. Eventual Consistency

Strong Consistency: Guarantees that all users see the same data immediately following a write operation. It is simpler to reason about but can be slower and less available during failures.
Eventual Consistency: Allows for a brief period where different nodes in a distributed system may have different versions of the data. Updates propagate through the system over time. This model is suited for large-scale applications where immediate consistency is not critical, such as social media timelines.

Stateful vs. Stateless Architecture

Stateful Service: Remembers user-specific context or session data between requests, often storing it locally. This can simplify application logic but complicates scaling, load balancing, and failover.
Stateless Service: Treats every request as new and self-contained, relying on external storage (e.g., databases, caches) for any required state. Stateless services are easier to scale horizontally, as any server instance can handle any request.

Microservices vs. Monoliths

Monolith: A single, unified application where all features are contained within one deployable unit. Monoliths are simpler to develop and deploy initially.
Microservices: An architectural style that splits application features into small, independent services that communicate over a network. This approach allows teams to work independently and scale different components separately but introduces complexity in communication, debugging, and data management.
A common evolutionary path is to start with a monolith and gradually break it apart into microservices as the system grows and its pain points become clear.

Serverless Architecture

Also known as “Functions as a Service” (FaaS), serverless architecture allows developers to run small, event-driven functions in the cloud without managing the underlying server infrastructure.
Advantages: Pay-per-use pricing and automatic scaling handled by the cloud provider. Ideal for workloads with spiky traffic like webhooks, background jobs, or simple APIs.
Trade-offs: Can involve “cold starts” (initial latency), less control over long-running tasks, and potentially higher costs at sustained high volumes.

II. Networking and Communication

This section covers the protocols, patterns, and components used to manage traffic and facilitate communication between different parts of a system.

Load Balancing

Function: Distributes incoming network traffic across multiple servers to prevent any single server from becoming a bottleneck.
Benefits: Improves both system performance and reliability, as the failure of one server does not bring down the entire application.
Implementation: Can be a hardware appliance or a software service. Load balancers typically use health checks to avoid sending traffic to unresponsive servers.

Load Balancing Algorithms

Round Robin: Distributes requests to servers sequentially in a circular order. Simple but does not account for server load or request complexity.
Least Connections: Sends new requests to the server with the fewest active connections. This is effective when requests have varying completion times.
IP Hash: Uses a hash of the client’s IP address to determine which server receives the request. This provides a basic form of “session stickiness,” ensuring a user is consistently routed to the same server.

Reverse Proxy vs. Forward Proxy

Reverse Proxy: Sits in front of a group of servers, intercepting client requests and forwarding them to the appropriate backend server. It can handle tasks like TLS termination, caching, compression, and routing, while hiding the internal network topology.
Forward Proxy: Sits in front of clients, forwarding their requests to the internet. It is often used for security, content filtering, or caching within a corporate or private network.

API Gateway

An API Gateway is a specialized reverse proxy that serves as the single entry point for all API calls in a microservices architecture.
Responsibilities: Handles routing, rate limiting, authentication, logging, and response transformation.
Benefit: Simplifies the client-side by providing a single, unified endpoint.
Risk: Can become a bottleneck or a “mini monolith” if too much business logic is embedded within it.

CDN (Content Delivery Network)

A CDN is a geographically distributed network of proxy servers that cache static assets (images, videos, CSS, JavaScript) close to end-users.
Function: When a user requests content, the request is routed to the nearest CDN node, dramatically reducing latency.
Benefits: Offloads traffic from origin servers, improves front-end performance, and increases application scalability and resilience.

DNS (Domain Name System)

DNS is the system that translates human-readable domain names (e.g., www.example.com) into machine-readable IP addresses (e.g., 192.0.2.1).
It operates with multiple layers of caching for fast lookups and can be used for basic load balancing by returning different IP addresses for the same domain name.

TCP vs. UDP

TCP (Transmission Control Protocol): A connection-oriented protocol that guarantees reliable, ordered, and error-checked delivery of data. It is suitable for applications where data integrity is critical, such as web browsing, file transfers, and APIs.
UDP (User Datagram Protocol): A connectionless protocol that is faster and has less overhead than TCP but does not guarantee delivery or order. It is well-suited for real-time applications like video streaming and online gaming, where speed is more important than perfect reliability.

HTTP/2 and HTTP/3 (QUIC)

HTTP/2: Improved upon HTTP/1.1 by introducing request multiplexing over a single TCP connection, header compression, and server push, all aimed at reducing latency.
HTTP/3: Further enhances performance by running over QUIC (a transport protocol built on UDP), which reduces connection setup time and performs better on unreliable networks with packet loss.

gRPC vs. REST

REST: An architectural style that typically uses HTTP and JSON. It is resource-oriented, human-readable, and widely adopted for public-facing APIs.
gRPC: A high-performance RPC framework that uses HTTP/2 for transport and Protocol Buffers (protobuf) for binary serialization. It is smaller and faster than REST/JSON and supports features like bidirectional streaming, making it a popular choice for internal service-to-service communication in microservices architectures.

WebSocket and Server-Sent Events (SSE)

WebSockets: Provide a persistent, full-duplex (two-way) communication channel between a client and a server over a single TCP connection. Ideal for real-time interactive applications like chat, collaborative editing, and multiplayer games.
SSE: A simpler protocol that allows a server to push updates to a client over a one-way channel using standard HTTP. It is suitable for use cases where only the server needs to send data, such as live news feeds or stock tickers.

Long Polling

A technique that simulates server-push functionality over standard HTTP. The client sends a request to the server, which holds the connection open until it has new data to send or a timeout occurs. Upon receiving a response, the client immediately initiates a new request.
It is less efficient than WebSockets but is easier to implement and compatible with older proxies and firewalls.

Gossip Protocol

A decentralized communication protocol where nodes in a distributed system share information by periodically exchanging data with random peers.
Information propagates through the network “like gossip,” ensuring that all nodes eventually converge on a consistent view without a central coordinator. It is highly fault-tolerant and used for service discovery, health monitoring, and state dissemination in large clusters.

III. Database and Storage Internals

This section details the techniques and technologies used to manage data at scale, focusing on partitioning, replication, indexing, and transactional integrity.

Sharding (Data Partitioning)

Definition: The process of splitting a large database into smaller, more manageable pieces called shards, with each shard residing on a separate machine.
Goal: To scale database storage capacity and throughput horizontally.
Strategies: Include range-based, hash-based, and directory-based sharding.
Challenge: Choosing an effective shard key is crucial to avoid “hot spots,” where one shard receives a disproportionate amount of traffic.

Replication Patterns

Definition: The practice of keeping multiple copies of data on different nodes to improve availability and read performance.
Master-Slave (Primary-Replica): One node (the master) handles all write operations, which are then replicated to one or more slave nodes that can serve read requests.
Master-Master (Multi-Primary): Multiple nodes can accept write operations, and they synchronize data with each other. This increases write availability but introduces complexity in resolving write conflicts.

Consistent Hashing

A hashing technique designed to minimize data re-shuffling when nodes are added to or removed from a distributed system (like a cache or database).
Both keys and nodes are mapped to a logical ring. A key is assigned to the first node encountered moving clockwise on the ring. This ensures that when a node is added or removed, only a small, adjacent set of keys needs to be remapped.

Database Indexing

Purpose: Indexes are data structures that improve the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space.
B-Trees: Balanced tree structures common in relational databases. They keep data sorted and are efficient for both point lookups and range queries.
LSM (Log-Structured Merge) Trees: Optimize for high write throughput by batching writes in memory and periodically flushing them to sorted files on disk. Reads can be more complex as they may need to check multiple files.

Write-Ahead Logging (WAL)

A standard method for ensuring data durability and atomicity. Before any changes are applied to the database itself, they are first recorded in a sequential log file on durable storage.
In the event of a system crash, the database can replay the log to recover to a consistent state, preventing data corruption from partially completed transactions.

Normalization vs. Denormalization

Normalization: The process of organizing data in a relational database to minimize redundancy and improve data integrity by dividing larger tables into smaller, well-structured ones.
Denormalization: The intentional introduction of redundancy by duplicating data across multiple tables. This is often done in high-scale systems to optimize read performance by avoiding expensive join operations.

Polyglot Persistence

The practice of using multiple different database technologies within a single application, choosing the best tool for each specific job.
An application might use a relational database for transactional data, a document store for unstructured content, a key-value store for caching, and a graph database for relationship-heavy data. This adds operational complexity but allows for optimized performance and functionality.

Bloom Filters

A probabilistic, space-efficient data structure used to test whether an element is a member of a set.
It can produce false positives (it might incorrectly say an element is in the set) but never false negatives (if it says an element is not in the set, it is definitively not).
They are used to avoid expensive lookups for items that are likely not present, such as checking a cache before querying a database.

Vector Databases

Specialized databases designed to store, manage, and query high-dimensional vector embeddings, which are numerical representations of data like text or images.
They excel at similarity searches using distance metrics (e.g., cosine similarity), enabling applications like semantic search, recommendation engines, and other AI-powered features.

IV. Reliability and Fault Tolerance

This section explores patterns and strategies for building resilient systems that can withstand and recover from failures.

Rate Limiting

Function: Controls the frequency of requests a user or client can make to an API or service within a specific time window.
Purpose: Protects backend services from abuse, accidental overload, and denial-of-service attacks.
Strategies: Common algorithms include fixed window, sliding window, and token bucket.

Circuit Breaker Pattern

A pattern that prevents an application from repeatedly trying to execute an operation that is likely to fail.
Mechanism: A circuit breaker monitors calls to a downstream service. If the number of failures exceeds a threshold, the breaker “opens,” and subsequent calls fail immediately without attempting to contact the service. After a timeout, the breaker enters a “half-open” state to test if the service has recovered.

Bulkhead Pattern

An application design pattern that isolates system elements into pools so that if one fails, the others can continue to function.
Named after the partitioned sections of a ship’s hull, this pattern can be implemented by using separate thread pools or connection pools for different services, preventing a failure in one area from cascading and taking down the entire system.

Retry Patterns and Exponential Backoff

Retries: A mechanism for handling transient failures by automatically re-attempting a failed operation.
Exponential Backoff: A crucial enhancement to retries where the delay between attempts increases exponentially (e.g., 1s, 2s, 4s). This prevents a client from overwhelming a struggling service with rapid-fire retries. Adding “jitter” (a small random delay) is also recommended to avoid synchronized retry storms.

Idempotency

An operation is idempotent if it can be performed multiple times with the same result as performing it once. For example, setting a value is idempotent, while incrementing a counter is not.
Idempotency is critical in distributed systems where network failures can lead to retries, ensuring that a re-sent request does not cause unintended side effects like duplicate transactions.

Heartbeat

A periodic signal sent from a node or service to a monitoring system to indicate it is alive and functioning correctly.
If the monitoring system stops receiving heartbeats from a node, it can assume the node has failed and trigger a failover process.

Leader Election

The process in a distributed system by which a single node is chosen to assume a special role, such as a coordinator or primary for writes.
Consensus algorithms like Paxos and Raft provide fault-tolerant mechanisms to ensure that all nodes agree on a single leader and can elect a new one if the current leader fails.

Distributed Transactions (SAGA Pattern)

The SAGA pattern is a way to manage data consistency across multiple microservices without using traditional two-phase commit locks.
A transaction is structured as a sequence of local transactions, each with a corresponding compensating action. If any step fails, the compensating actions are executed in reverse order to undo the preceding steps, thus maintaining overall consistency.

Two-Phase Commit (2PC)

A protocol used to achieve atomic transactions across multiple distributed nodes.
Phase 1 (Prepare): A coordinator asks all participating nodes if they are ready to commit.
Phase 2 (Commit/Abort): If all participants vote “yes,” the coordinator instructs them to commit. If any vote “no” or fail to respond, the coordinator instructs all to roll back.
2PC provides strong consistency but is prone to blocking if the coordinator fails and can be a performance bottleneck.

V. Caching and Messaging

This section describes key technologies for improving performance and decoupling system components through in-memory data storage and asynchronous communication.

Caching

Definition: Storing copies of frequently accessed data in a fast, temporary storage layer (typically memory) to serve future requests more quickly.
Benefits: Reduces latency for end-users and decreases the load on backend systems like databases.
Challenge: The primary difficulty with caching is “cache invalidation” — ensuring that stale data is removed or updated when the source data changes.

Caching Strategies

Cache-Aside: The application is responsible for managing the cache. It first checks the cache; on a miss, it reads data from the database, then writes that data into the cache for future requests.
Write-Through: The application writes data to the cache and the database simultaneously. This ensures the cache is always consistent with the database but adds latency to write operations.
Write-Back: The application writes data only to the cache, which acknowledges the write immediately. The data is then flushed to the database asynchronously at a later time. This offers very low write latency but risks data loss if the cache fails before the data is persisted.

Cache Eviction Policies

LRU (Least Recently Used): When the cache is full, the item that has been accessed least recently is removed.
LFU (Least Frequently Used): When the cache is full, the item that has been accessed the fewest times is removed.
Other policies include FIFO (First-In, First-Out) and random replacement. The choice of policy depends on the application’s access patterns.

Message Queues (Point-to-Point)

A message queue enables asynchronous communication between services. A “producer” sends a message to a queue, and a “consumer” retrieves it for processing at a later time.
Each message is typically processed by only one consumer. This pattern decouples the sender and receiver, allowing them to operate and scale independently. It is commonly used for background jobs.

Pub/Sub (Publish-Subscribe)

A messaging pattern where “publishers” send messages to a “topic” without knowledge of the “subscribers.” Any number of subscribers can listen to a topic and receive a copy of every message sent to it.
This enables one-to-many, broadcast-style communication and is central to event-driven architectures.

Dead Letter Queues (DLQ)

A secondary queue used to store messages that could not be processed successfully after a certain number of retries.
Moving “poison messages” to a DLQ prevents them from blocking the main processing queue. Engineers can later inspect the DLQ to diagnose and resolve the underlying issues.

VI. Observability and Security

This section covers essential concepts for monitoring system health, understanding behavior, and implementing robust security measures.

Distributed Tracing

A method for monitoring and profiling applications, especially those built using a microservices architecture.
It tracks a single request as it travels through multiple services, assigning a unique trace ID that allows developers to visualize the entire request path, identify bottlenecks, and debug cross-service issues.

SLA vs. SLO vs. SLI

SLA (Service Level Agreement): A formal contract with a customer that defines the level of service they can expect, often with financial penalties for non-compliance (e.g., “99.9% uptime”).
SLO (Service Level Objective): An internal target for system reliability that is stricter than the SLA. This is the goal that engineering teams strive to meet.
SLI (Service Level Indicator): The actual, quantitative metric used to measure compliance with an SLO (e.g., the success rate of HTTP requests). The SLI is the “scoreboard” that measures performance.

OAuth 2.0 and OIDC

OAuth 2.0: An authorization framework that allows a user to grant a third-party application limited access to their resources on another service without sharing their credentials.
OIDC (OpenID Connect): A thin layer built on top of OAuth 2.0 that adds an authentication component. It allows an application to verify a user’s identity and obtain basic profile information. Together, they form the foundation of modern “Login with…” features.

TLS/SSL Handshake

TLS (Transport Layer Security)/SSL (Secure Sockets Layer): Cryptographic protocols that provide secure communication over a computer network.
The handshake is the initial process where the client and server establish a secure connection. During the handshake, they agree on an encryption cipher, exchange cryptographic keys, and authenticate the server via its digital certificate.

Zero Trust Security

A security model based on the principle of “never trust, always verify.” It assumes that threats can originate from anywhere, both inside and outside the network perimeter.
In a Zero Trust architecture, every request must be authenticated, authorized, and encrypted, regardless of its origin. Access is granted based on user identity and device posture, not on network location.

You can think of system design like running a professional restaurant. Vertical scaling is buying a bigger stove, while horizontal scaling is hiring a whole team of chefs. Load balancing is the host at the front door assigning customers to different tables so no waiter is overwhelmed. A CDN is like having pre-made snacks available at local convenience stores so people don’t have to travel to your main kitchen for everything. Finally, Circuit Breakers are like a safety fuse in the kitchen: if one appliance starts smoking, it cuts the power immediately to that section so the whole restaurant doesn’t burn down.

Fonte: https://medium.com/@MaheshwariRishabh/50-core-system-design-concepts-6828ed73c2e8

Top 50 System Design

March 4, 2026

Luis Fernando Chaim

Arquitetura No Comments

System design interview performance is always a critical factor in validating whether a candidate can come up with scalable and efficient systems. Knowledge of major terminologies will definitely help in acing these. Below are the top 50 must-know system design interview terminologies that we will explain with definitions and working examples, along with additional resources for learning.

1. Scalability

Definition: It is the ability of a system to support increased load by adding resources.
Example: Addition of more servers to handle the increase in web traffic.
Learn More: What is Scalability and How to Achieve it?

2. Load Balancer

Definition: Dividing the incoming network traffic among multiple servers so that no one server processes a large amount of load.
Example: Load balancing web traffic across multiple EC2 instances using the AWS Elastic Load Balancer(ELB) Service.
Learn More: Understanding Load Balancer

3. Microservices

Definition: It is an architectural pattern forcing the structuring of an application as a collection of loosely coupled services.
Example: Breaking down a monolithic application into independent services responsible for user management, processing payments, and sending notifications.
Learn More: What are Microservices?

4. CAP Theorem

Definition: It states that at best, only two out of three guarantees can be gained in a distributed system: Consistency, Availability, and Partition Tolerance.
Example: When to Trade Off Consistency for Availability — And Vice Versa — in Distributed Database Design.
Learn More: Understanding CAP Theorem

5. Sharding

Definition: It involves breaking down a large database into smaller pieces called shards for better management.
Example: Sharding a user database based on geographic region.
Learn More: Database Sharding Explained

6. Latency

Definition: This gets defined as the time that it takes for data to travel from point A to point B.
Example: Measuring the delay involved in message delivery through a chat application.
Learn More: Latency explained!

7. Throughput

Definition: A measure of the quantity of data a system processes in some timeframe
Example: Requests processed by a web server in one second.
Learn More: Throughput in Computer Networks

8. Cache

Definition: Any hardware or software component that stores data to obviate future requests for the same data, serving It quickly.
Example: Implementing Redis caching for repeated database queries.
Learn More: Caching Explained

9. Content Delivery Network (CDN)

Definition: A server system, geographically dispersed, that shows Web content to a user based on the geographical location from which he is accessing.
Example: Using Cloudflare CDN for faster web page loading.
Learn More: What is a CDN?

10. REST API

Definition: a type of architectural style designed to build web services where data is accessed and manipulated using HTTP requests.
Example: Designing the Social Media API by REST(Representational State Transfer) principles.
Learn More: REST API Tutorial

11. GraphQL

Definition: It is a language designed to query data, so it is much more powerful, efficient, and flexible than REST.
Example: Using GraphQL to query user information in a single request.
Learn More: GraphQL Introduction

12. ACID

Definition: A set of properties ensuring reliable processing of database transactions. The properties are Atomicity, Consistency, Isolation, and Durability.
Example: Ensuring that a banking transaction has ACID properties prevents corrupted data.
Learn More: ACID Properties in Databases

13. BASE

Definition: An alternate to ACID that emphasizes Availability and Partition tolerance over strict-Consistency. Basically Available, Soft state, Eventually consistent system.
Example: Design of a highly available, eventually consistent NoSQL database.
Learn More: BASE vs ACID

14. NoSQL

Definition: A type of database designed to promote storage and retrieval of data modelled in ways other than the tabular relationships used in relational databases.
Example: Using MongoDB for a document-based data store.
Learn More: What is a NoSQL Database?

15. SQL

Definition: It is the standard language used for storing, manipulating, and retrieving data in relational databases.
Example: Writing SQL queries to get data back from a relational database.
Learn More: SQL Tutorial

16. Database Indexing

Definition: It is a data structure technique that allows quick searching and access to data from a database.
Example: Create indexing on the column of User ID for searching speed enhancement.
Learn More: Database Indexing

17. Replication

Definition: A process of copying and maintaining database objects in a multitude of databases which make up a distributed database system.
Example: It involves allowing a database to be highly available across different geographical locations using replication.
Learn More: Database Replication

18. Failover

Definition: A backup operational mode in which system component functions are taken over by other system components in case of loss of a primary system component.
Example: Built-in automatic failovers to standby servers in the event of a server failure of your internet applications.
Learn More: Failover Vs Disaster Recovery

19. API Gateway

Definition: A server that sits at the front of an API, receiving API requests, applying throttling and security policies, and then forwarding them to back-end services.
Example: Using AWS API Gateway to manage APIs.
Learn More: What is an API Gateway?

20. Service Mesh

Definition: A dedicated infrastructure layer for facilitating service-to-service communications between microservices.
Example: Integrating Istio as a service mesh for the management of microservice interactions.
Learn More: Introduction to Service Mesh

21. Serverless Computing

Definition: A Cloud computing implementation that “dynamically allows for the allotment of machine resources by the cloud provider”.
Example: Run backend code without any server provisioning at your end using AWS Lambda.
Learn More: What is Serverless Computing?

22. Event-Driven Architecture

Definition: A software architecture paradigm encouraging the generation, detection, and consumption of, and the reaction to, events in general.
Example: Design a system with event communications between microservices using Apache Kafka.
Learn More: Event-Driven Architecture

23. Monolithic Architecture

Definition: A software architecture wherein all the elements are fitted into a single application and run as a single service.
Example: Old traditional enterprise applications built as a single, large unit.
Learn More: Monolithic vs Microservices Architecture

24. Distributed Systems

Definition: A model wherein components located on networked computers communicate with each other and coordinate their actions by passing messages.
Example: Designing a distributed file system like Hadoop.
Learn More: Introduction to Distributed Systems

25. Message Queue

Definition: This method allows asynchronous, service-to-service communication in both serverless and microservices architectures.
Example: Using RabbitMQ to queue messages between services.
Learn More: Message Queues Explained

26. Pub/Sub Model

Definition: A messaging pattern in which senders (publishers) publish messages so abstractly that any one of them can end up being accessed by recipients without the sender having to even know the identity of the destination receivers (subscribers).
Example: A notification system that uses Google Cloud Pub/Sub.
Learn More: Pub/Sub Messaging

27. Data Partitioning

Definition: Division of a database into smaller, manageable parts.
Example: Partitioning a table in a database by date to allow super-fast query execution.
Learn More: Database Partitioning

28. Horizontal Scaling

Definition: Increasing the capacity by adding more machines or nodes within a system.
Example: Adding more web servers to handle an increasing volume of user traffic.
Learn More: Horizontal vs Vertical Scaling

29. Vertical Scaling

Definition: Upgrading an already existing machine with more power in the form of a CPU or RAM.
Example: Upgrading the RAM of a server so that it can handle more requests all at once.
Learn More: Horizontal vs Vertical Scaling

30. Rate Limiting

Definition: It means controlling the rate of traffic that the network interface controller is sending or receiving.
Example: Throttling an API to prevent abusive behaviour.
Learn More: Understanding Rate Limiting

31. Circuit Breaker Pattern

Definition: A design pattern used in modern software development, applied to detect failures and encapsulate the logic of preventing a failure from constantly recurring.
Example: Handling failed remote service calls using a circuit breaker in a microservice architecture.
Learn More: Circuit Breaker Pattern

32. Data Consistency

Definition: Ensuring that data is the same across multiple instances and is not corrupted.
Example: Maintaining the consistency of user data through multiple replicas of a database.
Learn More: Data Consistency Models

33. Eventual Consistency

Definition: A model of consistency used in distributed computing toward the goal of high availability, stating that updates to a system will eventually propagate and be reflected by all nodes.
Example: Amazon DynamoDB provides an eventually consistent model for the read operation.
Learn More: Eventual Consistency

34. Strong Consistency

Definition: A consistency model ensuring every read gets the most recent write on a given unit of data.
Example: Using strong consistency in a financial transaction system.
Learn More: Strong Consistency

35. Containerization

Definition: Basically, this is whenever an application and its dependencies are encapsulated into a container to be run on any computational environment.
Example: Using Docker to containerize the applications for deployment in various environments such as dev, test, prod etc.
Learn More: What is Containerization?

36. Kubernetes

Definition: An open-source platform that automates the process of application container deployment, scaling, and operation.
Example: Run and deploy containerized applications using Kubernetes.
Learn More: Kubernetes Documentation

37. Autoscaling

Definition: Automatically adjusting the number of computational resources based on the user load.
Example: Utilizing AWS EC2 Auto Scaling feature to dynamically adjust the number of instances.
Learn More: Auto Scaling explained

38. Multi-Tenancy

Definition: Architecture where a single instance of a software application serves multiple consumers/customers.
Example: SaaS applications, such as Salesforce, utilize multi-tenancy in their service provision toward their different categories of customers.
Learn More: Single Tenancy Vs Multi-Tenancy?

39. Load Shedding

Definition: Backing off some demands or degrading services to maintain the health of the overall system under high load.
Example: This will turn off all non-essential services during times of peak traffic.
Learn More: Load Shedding

40. Idempotence

Definition: A property for some mathematical and computer-science operations stating that it has the same effect if repeated more times than once.
Example: An HTTP DELETE request is idempotent.
Learn More: Idempotence in APIs

41. Quorum

Definition: The minimum number of votes needed to commit a distributed transaction.
Example: Basically, quorum-based replication ensures that consistency exists in the distributed database.
Learn More: Quorum Systems

42. Orchestration

Definition: A pattern of service interaction where a central coordinator controls the interaction between services.
Example: Using a workflow engine to manage some multi-step business process.
Learn More: Orchestration

43. Choreography

Definition: A service interaction pattern in which every service is self-contained and interacts with others through events; there will not be any coordinator or orchestrator.
Example: Microservices communicating through an event bus in a choreography pattern.
Learn More: Choreography vs. Orchestration

44. Service Registry

Definition: A database that keeps track of instances of microservices.
Example: Using the Eureka service registry in a microservice architecture.
Learn More: Service Registry and Discovery

45. API Rate Limiting

Definition: It means controlling how many requests a client can make against an API within a certain timeframe.
Example: Limiting requests to an API to 100 per minute to prevent abuse.
Learn More: API Rate Limiting

46. Data Warehouse

Definition: A system that helps in the generation of reports and business data analytics; the hub of Business Intelligence.
Example: Amazon Redshift can be implemented in data warehousing.
Learn More: Understanding Data Warehouse?

47. Data Lake

Definition: A system or repository where data is kept in native/raw format, generally as object blobs or files.
Example: Petabyte scaling for storing and managing structured and unstructured data in a data lake.
Learn More: Data Lake

48. OLAP

Definition: Online Analytical Processing : The software category that allows the analysis of data kept in a database.
Example: Use of the OLAP cubes for pointy analytical and arbitrary queries.
Learn More: OLAP Explained

49. OLTP

Definition: Online Transaction Processing: a class of systems that manage transaction-oriented applications.
Example: Using OLTP systems for transaction data management, as in banking systems etc.
Learn More: OLTP Explained

50. Big Data

Definition: Large, complex data sets that cannot be efficiently managed by conventional data-processing software in the best of cases.
Example: Analyzing social media interactions to predict fashion trends.
Learn More: Introduction to Big Data

Keep in mind that it’s all about continuous learning and practice as you go further in system design. You can work with the resources, get involved in the discussions, and practice these concepts in your projects. The resources and discussions will expose you to the vocabulary and usages of the concept.

Fonte: https://interviewnoodle.com/top-50-system-design-terminologies-you-must-know-3c78f5fb99c1