500M+
Domains in index
5M+
URLs crawled daily
40M+
IP proxy pool
~15 min
Detection latency
How we collect data
We don’t rely on a single data source. Multiple independent collection layers feed into a unified detection engine. If one signal source goes dark for a domain, others fill the gap.Live web crawling
A distributed crawl fleet continuously scans the web using real browser rendering. Every crawl captures HTTP headers, HTML source, executed JavaScript, network requests, and the rendered DOM. This gives us visibility into technologies that static fetchers miss entirely.Crawl prioritization:
The crawler continuously re-prioritizes domains based on technology change velocity, traffic rank signals, and inbound link graph changes. Domains that change stacks frequently get crawled more often.
| Tier | Domains | Crawl frequency |
|---|---|---|
| Tier 1 (Real-time) | High-value and high-traffic domains | Every 24–72 hours |
| Tier 2 (Standard) | Mid-traffic domains | Weekly |
| Tier 3 (Batch) | Long-tail domains | Monthly or on-demand |
Passive DNS monitoring
We ingest passive DNS feeds to track infrastructure changes across the entire domain namespace in near real-time:
- CNAME chain changes: a domain moving from
*.wpengine.comto*.kinsta.cloudsignals a hosting migration - NS record changes: switching from Route53 to Cloudflare DNS reveals infrastructure decisions
- MX record changes: email platform migrations (e.g., Google Workspace to Microsoft 365)
- A/AAAA record changes: IP lookups reveal CDN and hosting changes
- TXT record additions: SPF/DKIM records reveal new SaaS tool adoptions (e.g.,
include:sendgrid.net,include:mailchimp.com)
Additional real-time signals
Beyond DNS, we monitor multiple supplementary data streams to detect new domains and technology changes as they happen:
- TLS certificate issuances: new certificates from known platform CAs trigger domain discovery and crawl scheduling
- HTTP infrastructure signals: reverse proxy headers and response profile changes reveal CDN, hosting, and stack migrations
- New domain detection: newly observed domains are automatically queued for a full technology crawl
Common Crawl archives (historical data)
For historical technology timelines going back to 2008, we process datasets (CC-MAIN) — petabyte-scale crawl archives of the public web.How it works:
- We query the Common Crawl Index to retrieve record pointers for target domains across historical crawl dates — without downloading full archive files
- Only the relevant byte ranges are fetched using HTTP Range requests, minimizing data transfer
- HTML and headers are extracted and fed into our standard detection engine
- Detected technologies are stitched into a per-domain timeline showing when a site adopted or dropped each technology
These data layers complement each other. Live crawling captures current state with full JavaScript rendering. DNS and real-time signals detect changes within minutes. Common Crawl provides historical depth back to 2008. A single-source detection tool has blind spots. We don’t.
Technology detection engine
Raw crawl data is just HTML and headers. Our fingerprint matching engine turns it into structured technology intelligence across every layer of a website’s stack.- What we match against
- Confidence scoring
- Full-stack coverage
Detection is performed against a fingerprint database of 5,000+ technology signatures, matching across multiple signal types simultaneously:
| Signal type | Examples |
|---|---|
| HTTP response headers | X-Powered-By, Server, Set-Cookie prefixes, CSP headers, X-Generator |
| HTML source patterns | Meta tags, script src paths, link href values, class/ID conventions |
| JavaScript globals | window.Shopify, window.wp, window.ga, window.__NEXT_DATA__ |
| Cookie names | Platform-specific session cookie naming conventions |
| URL path patterns | /wp-content/, /wp-json/, /_next/, /__nuxt__/ |
| robots.txt and sitemap.xml | Generator tags and path structures |
| DNS records | CNAME targets (e.g., *.myshopify.com, *.hubspot.com, *.wpengine.com) |
| TLS certificate SANs | Platform-specific wildcard certificates and issuers |
| Favicon hashes | MD5/perceptual hashes matched to known technology icons |
| Network request destinations | Third-party API calls captured during JavaScript rendering |
Reliable access at scale
Accurate technology detection requires seeing what real users see, not a bot-blocked error page. Here’s how our crawl infrastructure handles that at scale.Real browser rendering
Every crawl uses full Chromium instances with JavaScript execution, DOM rendering, and network request capture. This is how we detect technologies that only load after JavaScript execution, including single-page apps, lazy-loaded scripts, and client-side frameworks.
Distributed proxy network
A heterogeneous pool of 40M+ IPs spanning residential, mobile, datacenter, and ISP proxies ensures reliable access across all website types. Requests are geo-routed through the target site’s primary market geography for natural traffic patterns.
Intelligent request management
Per-domain rate limiting, session consistency, and IP cool-down periods ensure our crawlers behave as responsible actors on the web. Crawl rates are throttled to avoid impacting site performance.
Challenge handling
JavaScript challenges and proof-of-work verifications are handled natively by our real browser fleet, ensuring we capture accurate technology data even from sites with advanced security configurations.
Data storage and serving
Collected data flows through a purpose-built pipeline optimized for both real-time serving and historical analysis.| Layer | Purpose |
|---|---|
| Raw crawl storage | Immutable crawl snapshots stored in columnar format for efficient historical queries |
| Technology index | Fast multi-field search and filtering across all detected technologies |
| Company graph | Firmographic data and technology timelines with time-series optimizations |
| Real-time event stream | DNS changes, CT log events, and crawl triggers processed with sub-minute latency |
| Cache layer | Hot domain lookups and API rate limiting for sub-200ms response times |
| Search API | Customer-facing REST API with bearer token authentication |
API response time
Under 200ms at p99. Powered by a distributed cache layer and optimized query serving infrastructure.
Data freshness
High-value domains refreshed every 24–72 hours. DNS and CT log signals processed within 15 minutes of change.
Scale at a glance
| Metric | Value |
|---|---|
| Domains in index | 500M+ |
| Daily crawl volume | 5M+ URLs |
| DNS change events processed/day | 50M+ |
| Proxy pool size | 40M+ IPs |
| Technology fingerprints | 5,000+ |
| Historical data coverage | 2008 – present |
| Detection latency (live trigger) | Under 15 minutes |
| API response time (p99) | Under 200ms |
Compliance and ethics
Our crawling infrastructure is a responsible actor on the public web.Respectful crawling
- Crawl rates throttled per domain (max 1 request/second by default)
robots.txtdirectives respected for non-public data signals- No authentication-gated or private content is crawled
Data handling
- All data collected from the public-facing web layer only
- Domain owner opt-out requests honored within 48 hours
- SOC 2 Type II certified, GDPR and CCPA compliant
- Data encrypted at rest and in transit
Related pages
Why TechnologyChecker?
Full-stack detection, 20-year historical data, and verified contacts. See how we compare to alternatives.
Technologies
Browse the full catalog of 40,000+ tracked technologies with market share and growth data.
Domain lookup
Scan any website to see its full technology stack with confidence scores and detection timeline.
API reference
Full REST API documentation with interactive examples and endpoint-level credit costs.
Frequently asked questions
How does TechnologyChecker detect backend technologies?
How does TechnologyChecker detect backend technologies?
Backend technologies are invisible to browser extensions. TechnologyChecker uses HTTP header analysis, DNS fingerprinting, TLS certificate inspection, and infrastructure probing to identify server frameworks, databases, CDNs, and standalone SaaS platforms. See full-stack detection for the full list of detection methods.
How fresh is the data?
How fresh is the data?
Data freshness depends on the domain tier. High-value domains are recrawled every 24–72 hours. DNS and certificate transparency changes are detected within 15 minutes. Historical data from Common Crawl extends back to 2008 with monthly granularity. Every API response includes a
last_crawled timestamp so you know exactly when the data was collected.What is a confidence score?
What is a confidence score?
Every technology detection includes a score from 0–100 indicating detection reliability. Multi-signal detections (e.g., a DNS CNAME + HTTP header + JavaScript global all pointing to the same technology) score higher than single-signal matches. Detections below a minimum threshold are excluded to minimize false positives. Use confidence scores to filter results in your lead scoring models.
Can I opt my domain out of TechnologyChecker's index?
Can I opt my domain out of TechnologyChecker's index?
Yes. Domain owners can request removal by contacting support@TechnologyChecker.io. Opt-out requests are processed within 48 hours. TechnologyChecker only collects data from the public-facing web — no authentication-gated or private content is ever crawled.