Building a Real-time Crypto Data Platform
Technical breakdown of building a real-time cryptocurrency data platform - from data ingestion challenges to reducing manual effort through automation.
This was the first time I was building a real-time system, and it taught me way more about data pipelines than I expected.
The main idea was to build a beginner-friendly crypto ticker platform. CoinMarketCap seemed too complex - if a beginner goes there and looks at it, it's daunting. So the idea was to make it less overwhelming for new users, especially since 2020-21 was when a lot of new people were getting into crypto.
Market cap is the most important stat in crypto, yet most people don't know what it means. That needed to change.
The Vision: More Than Just Prices
We planned to start with a simple ticker, then add features like simple explainers of top market cap coins. We also wanted to show recent tweets from whitelisted accounts, so we built a scraper that would check certain time periods and find relevant posts.
The scraper would only work with whitelisted accounts and show tweets about specific coins, not just any random content. We brought together tweets, Reddit posts, Google trends, and prices all in one place to make it as simple as possible for users to understand what a coin does and what the latest news was about it.
This is something that tools like Gigabrain.gg do now with AI - they provide institutional-grade signals, real-time market intelligence, and sentiment analysis all in one terminal.
Data Source Strategy and API Hell
We used CoinGecko's API for pulling prices from exchanges and showing different price sources (DEXes and CEXes). But it had its own challenges dealing with costs and real-time requirements.
The Rate Limit Reality:
- CoinGecko's free tier: 30 calls/minute, 10,000 monthly cap
- Paid plans: $129/month for 500 calls/minute, 500,000 monthly cap
- CoinMarketCap: 10,000 API calls monthly on free tier
These limits hit fast when you're trying to track hundreds of tokens in real-time. We had to get creative with caching and batch requests.
API Cost Management:
Every API call costs money or burns through your rate limit. We learned to:
- Batch requests for multiple tokens
- Cache aggressively but invalidate smartly
- Use webhooks where possible instead of polling
- Implement exponential backoff for failed requests
Real-time Data Ingestion Challenges
Building the data pipeline was where things got complex. We needed to handle:
Multiple Data Sources
- CoinGecko for pricing and market data
- Twitter API for social sentiment
- Reddit API for community discussions
- Google Trends for search volume
- Direct exchange APIs for more granular data
Different Update Frequencies
Not all data updates at the same rate:
- Prices: Every few seconds for major tokens
- Social media: Every few minutes
- Market cap calculations: Depends on price + supply changes
- Trends data: Hourly or daily
Error Handling
APIs fail. A lot. We had to build robust retry mechanisms:
async function fetchWithRetry(url, options, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
const response = await fetch(url, options);
if (response.status === 429) {
// Rate limited - wait longer
await delay(Math.pow(2, i) * 1000);
continue;
}
return response;
} catch (error) {
if (i === maxRetries - 1) throw error;
await delay(Math.pow(2, i) * 500);
}
}
}
Database Design for Time-Series Data
Storing crypto data efficiently was crucial. We needed:
Schema Design
-- Price history table
CREATE TABLE price_history (
id SERIAL PRIMARY KEY,
token_id VARCHAR(50),
price DECIMAL(20,8),
volume_24h DECIMAL(20,2),
market_cap DECIMAL(20,2),
timestamp TIMESTAMPTZ,
source VARCHAR(20)
);
-- Create time-based partitions
CREATE INDEX idx_price_timestamp ON price_history (timestamp DESC);
CREATE INDEX idx_price_token_time ON price_history (token_id, timestamp DESC);
Data Retention Strategy
- Real-time data: Keep for 24 hours
- Hourly aggregates: Keep for 30 days
- Daily aggregates: Keep for 1 year
- Weekly aggregates: Keep forever
This reduced storage costs while maintaining useful historical data.
Automation to Reduce Manual Effort
Since our team was small, we had to automate everything possible.
Token Selection Automation
Instead of manually curating every token, we started with coins we believed in - fundamentally strong projects without significant market caps yet, like Solana, Helium, and Thorchain (at the time).
We built scripts to:
- Auto-add tokens that hit certain volume/market cap thresholds
- Monitor social sentiment to identify trending tokens
- Remove tokens that became inactive or suspicious
Social Media Automation
Our Twitter scraper automated:
- Whitelisted account monitoring
- Keyword filtering for relevant crypto content
- Sentiment scoring using basic NLP
- Duplicate detection and removal
Price Validation Automation
// Detect and flag suspicious price movements
function validatePriceUpdate(token, newPrice, previousPrice) {
const changePercent = Math.abs(
((newPrice - previousPrice) / previousPrice) * 100
);
if (changePercent > 50) {
// Flag for manual review
flagSuspiciousPrice(token, newPrice, previousPrice);
return false;
}
return true;
}
Production Problems and Solutions
Problem: API Rate Limits Killing Us
What happened: We kept hitting CoinGecko's rate limits during volatile market periods when everyone wanted real-time data.
Solution: Implemented intelligent caching and request batching:
- Cache popular tokens more aggressively
- Batch requests for less popular tokens
- Use price change thresholds to determine update frequency
Problem: Social Media Data Quality
What happened: Our Twitter scraper was pulling in too much noise - retweets, off-topic content, spam.
Solution: Built better filtering:
- Machine learning-based relevance scoring
- Blacklist of spam phrases and accounts
- Manual feedback loop to improve filters
Problem: Database Performance During Market Crashes
What happened: During major market events, write volume would spike 10x and queries would slow to a crawl.
Solution:
- Implemented write-behind caching
- Created separate read replicas for public API
- Added database connection pooling
- Used Redis for hot data (current prices)
Problem: Stale Data Detection
What happened: Sometimes APIs would return stale data without indicating it, leading to incorrect price displays.
Solution: Cross-validation system:
// Compare prices across multiple sources
function validatePriceData(token, sources) {
const prices = sources.map((s) => s.price);
const avgPrice = prices.reduce((a, b) => a + b) / prices.length;
// Flag outliers
const outliers = prices.filter(
(p) => Math.abs(p - avgPrice) / avgPrice > 0.1
);
if (outliers.length > 0) {
logSuspiciousData(token, sources);
}
}
Scaling and Performance Optimizations
Caching Strategy
We implemented multi-layer caching:
- Level 1: In-memory cache for current prices (Redis)
- Level 2: Database query cache for historical data
- Level 3: CDN cache for static content and API responses
WebSocket for Real-time Updates
Instead of polling, we used WebSockets to push price updates:
// Server-side price broadcasting
priceUpdater.on("price-change", (data) => {
const { token, price, change } = data;
// Only broadcast significant changes
if (Math.abs(change) > 0.01) {
wss.clients.forEach((client) => {
if (client.subscribedTokens.includes(token)) {
client.send(
JSON.stringify({
type: "price-update",
token,
price,
change,
timestamp: Date.now(),
})
);
}
});
}
});
Database Optimizations
- Partitioned tables by time ranges
- Implemented read replicas for analytics queries
- Used materialized views for complex aggregations
- Added proper indexing for common query patterns
What I'd Do Differently
Looking back, several things I'd approach differently:
Start with webhooks: Instead of building complex polling systems, I'd negotiate webhook access with data providers from day one.
Microservices from the start: We ended up with a monolith that became hard to scale individual components. I'd split data ingestion, processing, and serving into separate services.
Better monitoring: We added monitoring reactively when things broke. I'd implement comprehensive observability from the beginning - metrics, logs, traces, and alerts.
Cost planning: API costs scaled faster than we expected. I'd model usage patterns and costs more thoroughly upfront.
The Real Learning
The biggest takeaway was understanding that real-time data systems are fundamentally different from CRUD applications. Everything that can fail, will fail:
- APIs go down
- Networks have latency spikes
- Databases get overwhelmed
- Cache invalidation creates race conditions
Building resilience into every layer isn't optional - it's the core requirement. Every optimization technique we learned came from production failures teaching us where the weak points were.
The automation we built saved countless hours of manual work, but more importantly, it made the system more reliable than human processes ever could be.
Building marketcap.guide taught me that real-time data platforms are 90% infrastructure and 10% features. The infrastructure work isn't glamorous, but it's what makes everything else possible.