Building a Real-time Crypto Data Platform

This was the first time I was building a real-time system, and it taught me way more about data pipelines than I expected.

The main idea was to build a beginner-friendly crypto ticker platform. CoinMarketCap seemed too complex - if a beginner goes there and looks at it, it's daunting. So the idea was to make it less overwhelming for new users, especially since 2020-21 was when a lot of new people were getting into crypto.

Market cap is the most important stat in crypto, yet most people don't know what it means. That needed to change.

The Vision: More Than Just Prices

We planned to start with a simple ticker, then add features like simple explainers of top market cap coins. We also wanted to show recent tweets from whitelisted accounts, so we built a scraper that would check certain time periods and find relevant posts.

The scraper would only work with whitelisted accounts and show tweets about specific coins, not just any random content. We brought together tweets, Reddit posts, Google trends, and prices all in one place to make it as simple as possible for users to understand what a coin does and what the latest news was about it.

This is something that tools like Gigabrain.gg do now with AI - they provide institutional-grade signals, real-time market intelligence, and sentiment analysis all in one terminal.

Data Source Strategy and API Hell

We used CoinGecko's API for pulling prices from exchanges and showing different price sources (DEXes and CEXes). But it had its own challenges dealing with costs and real-time requirements.

The Rate Limit Reality:

CoinGecko's free tier: 30 calls/minute, 10,000 monthly cap
Paid plans: $129/month for 500 calls/minute, 500,000 monthly cap
CoinMarketCap: 10,000 API calls monthly on free tier

These limits hit fast when you're trying to track hundreds of tokens in real-time. We had to get creative with caching and batch requests.

API Cost Management:
Every API call costs money or burns through your rate limit. We learned to:

Batch requests for multiple tokens
Cache aggressively but invalidate smartly
Use webhooks where possible instead of polling
Implement exponential backoff for failed requests

Real-time Data Ingestion Challenges

Building the data pipeline was where things got complex. We needed to handle:

Multiple Data Sources

CoinGecko for pricing and market data
Twitter API for social sentiment
Reddit API for community discussions
Google Trends for search volume
Direct exchange APIs for more granular data

Different Update Frequencies

Not all data updates at the same rate:

Prices: Every few seconds for major tokens
Social media: Every few minutes
Market cap calculations: Depends on price + supply changes
Trends data: Hourly or daily

Error Handling

APIs fail. A lot. We had to build robust retry mechanisms:

async function fetchWithRetry(url, options, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      const response = await fetch(url, options);
      if (response.status === 429) {
        // Rate limited - wait longer
        await delay(Math.pow(2, i) * 1000);
        continue;
      }
      return response;
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      await delay(Math.pow(2, i) * 500);
    }
  }
}

Database Design for Time-Series Data

Storing crypto data efficiently was crucial. We needed:

Schema Design

-- Price history table
CREATE TABLE price_history (
  id SERIAL PRIMARY KEY,
  token_id VARCHAR(50),
  price DECIMAL(20,8),
  volume_24h DECIMAL(20,2),
  market_cap DECIMAL(20,2),
  timestamp TIMESTAMPTZ,
  source VARCHAR(20)
);

-- Create time-based partitions
CREATE INDEX idx_price_timestamp ON price_history (timestamp DESC);
CREATE INDEX idx_price_token_time ON price_history (token_id, timestamp DESC);

Data Retention Strategy

Real-time data: Keep for 24 hours
Hourly aggregates: Keep for 30 days
Daily aggregates: Keep for 1 year
Weekly aggregates: Keep forever

This reduced storage costs while maintaining useful historical data.

Automation to Reduce Manual Effort

Since our team was small, we had to automate everything possible.

Token Selection Automation

Instead of manually curating every token, we started with coins we believed in - fundamentally strong projects without significant market caps yet, like Solana, Helium, and Thorchain (at the time).

We built scripts to:

Auto-add tokens that hit certain volume/market cap thresholds
Monitor social sentiment to identify trending tokens
Remove tokens that became inactive or suspicious

Social Media Automation

Our Twitter scraper automated:

Whitelisted account monitoring
Keyword filtering for relevant crypto content
Sentiment scoring using basic NLP
Duplicate detection and removal

Price Validation Automation

// Detect and flag suspicious price movements
function validatePriceUpdate(token, newPrice, previousPrice) {
  const changePercent = Math.abs(
    ((newPrice - previousPrice) / previousPrice) * 100
  );

  if (changePercent > 50) {
    // Flag for manual review
    flagSuspiciousPrice(token, newPrice, previousPrice);
    return false;
  }

  return true;
}

Production Problems and Solutions

Problem: API Rate Limits Killing Us

What happened: We kept hitting CoinGecko's rate limits during volatile market periods when everyone wanted real-time data.

Solution: Implemented intelligent caching and request batching:

Cache popular tokens more aggressively
Batch requests for less popular tokens
Use price change thresholds to determine update frequency

Problem: Social Media Data Quality

What happened: Our Twitter scraper was pulling in too much noise - retweets, off-topic content, spam.

Solution: Built better filtering:

Machine learning-based relevance scoring
Blacklist of spam phrases and accounts
Manual feedback loop to improve filters

Problem: Database Performance During Market Crashes

What happened: During major market events, write volume would spike 10x and queries would slow to a crawl.

Solution:

Implemented write-behind caching
Created separate read replicas for public API
Added database connection pooling
Used Redis for hot data (current prices)

Problem: Stale Data Detection

What happened: Sometimes APIs would return stale data without indicating it, leading to incorrect price displays.

Solution: Cross-validation system:

// Compare prices across multiple sources
function validatePriceData(token, sources) {
  const prices = sources.map((s) => s.price);
  const avgPrice = prices.reduce((a, b) => a + b) / prices.length;

  // Flag outliers
  const outliers = prices.filter(
    (p) => Math.abs(p - avgPrice) / avgPrice > 0.1
  );

  if (outliers.length > 0) {
    logSuspiciousData(token, sources);
  }
}

Scaling and Performance Optimizations

Caching Strategy

We implemented multi-layer caching:

Level 1: In-memory cache for current prices (Redis)
Level 2: Database query cache for historical data
Level 3: CDN cache for static content and API responses

WebSocket for Real-time Updates

Instead of polling, we used WebSockets to push price updates:

// Server-side price broadcasting
priceUpdater.on("price-change", (data) => {
  const { token, price, change } = data;

  // Only broadcast significant changes
  if (Math.abs(change) > 0.01) {
    wss.clients.forEach((client) => {
      if (client.subscribedTokens.includes(token)) {
        client.send(
          JSON.stringify({
            type: "price-update",
            token,
            price,
            change,
            timestamp: Date.now(),
          })
        );
      }
    });
  }
});

Database Optimizations

Partitioned tables by time ranges
Implemented read replicas for analytics queries
Used materialized views for complex aggregations
Added proper indexing for common query patterns

What I'd Do Differently

Looking back, several things I'd approach differently:

Start with webhooks: Instead of building complex polling systems, I'd negotiate webhook access with data providers from day one.

Microservices from the start: We ended up with a monolith that became hard to scale individual components. I'd split data ingestion, processing, and serving into separate services.

Better monitoring: We added monitoring reactively when things broke. I'd implement comprehensive observability from the beginning - metrics, logs, traces, and alerts.

Cost planning: API costs scaled faster than we expected. I'd model usage patterns and costs more thoroughly upfront.

The Real Learning

The biggest takeaway was understanding that real-time data systems are fundamentally different from CRUD applications. Everything that can fail, will fail:

APIs go down
Networks have latency spikes
Databases get overwhelmed
Cache invalidation creates race conditions

Building resilience into every layer isn't optional - it's the core requirement. Every optimization technique we learned came from production failures teaching us where the weak points were.

The automation we built saved countless hours of manual work, but more importantly, it made the system more reliable than human processes ever could be.

Building marketcap.guide taught me that real-time data platforms are 90% infrastructure and 10% features. The infrastructure work isn't glamorous, but it's what makes everything else possible.