TechnologyChecker data infrastructure — crawling, DNS, CT logs, and Common Crawl

TechnologyChecker runs a multi-layered, distributed crawling infrastructure that detects technology stacks across hundreds of millions of websites, both in real time and historically. We combine live web crawling, passive monitoring, HTTP log analysis, certificate transparency tracking, and historical data mining via Common Crawl.

500M+

Domains in index

5M+

URLs crawled daily

40M+

IP proxy pool

~15 min

Detection latency

How we collect data

We don’t rely on a single data source. Multiple independent collection layers feed into a unified detection engine. If one signal source goes dark for a domain, others fill the gap.

Live web crawling

A distributed crawl fleet continuously scans the web using real browser rendering. Every crawl captures HTTP headers, HTML source, executed JavaScript, network requests, and the rendered DOM. This gives us visibility into technologies that static fetchers miss entirely.Crawl prioritization:

Tier	Domains	Crawl frequency
Tier 1 (Real-time)	High-value and high-traffic domains	Every 24–72 hours
Tier 2 (Standard)	Mid-traffic domains	Weekly
Tier 3 (Batch)	Long-tail domains	Monthly or on-demand

The crawler continuously re-prioritizes domains based on technology change velocity, traffic rank signals, and inbound link graph changes. Domains that change stacks frequently get crawled more often.

Passive DNS monitoring

We ingest passive DNS feeds to track infrastructure changes across the entire domain namespace in near real-time:

CNAME chain changes: a domain moving from *.wpengine.com to *.kinsta.cloud signals a hosting migration
NS record changes: switching from Route53 to Cloudflare DNS reveals infrastructure decisions
MX record changes: email platform migrations (e.g., Google Workspace to Microsoft 365)
A/AAAA record changes: IP lookups reveal CDN and hosting changes
TXT record additions: SPF/DKIM records reveal new SaaS tool adoptions (e.g., include:sendgrid.net, include:mailchimp.com)

DNS events trigger re-crawls on significant record mutations, typically within 15 minutes of a DNS change.

Additional real-time signals

Beyond DNS, we monitor multiple supplementary data streams to detect new domains and technology changes as they happen:

TLS certificate issuances: new certificates from known platform CAs trigger domain discovery and crawl scheduling
HTTP infrastructure signals: reverse proxy headers and response profile changes reveal CDN, hosting, and stack migrations
New domain detection: newly observed domains are automatically queued for a full technology crawl

These signals fill the gaps between scheduled crawls, keeping our index current within minutes of real-world changes.

Common Crawl archives (historical data)

For historical technology timelines going back to 2008, we process datasets (CC-MAIN) — petabyte-scale crawl archives of the public web.How it works:

We query the Common Crawl Index to retrieve record pointers for target domains across historical crawl dates — without downloading full archive files
Only the relevant byte ranges are fetched using HTTP Range requests, minimizing data transfer
HTML and headers are extracted and fed into our standard detection engine
Detected technologies are stitched into a per-domain timeline showing when a site adopted or dropped each technology

Common Crawl data enriches approximately 35% of our historical technology timeline records, providing adoption context that no live-only crawler can match.

These data layers complement each other. Live crawling captures current state with full JavaScript rendering. DNS and real-time signals detect changes within minutes. Common Crawl provides historical depth back to 2008. A single-source detection tool has blind spots. We don’t.

Technology detection engine

Raw crawl data is just HTML and headers. Our fingerprint matching engine turns it into structured technology intelligence across every layer of a website’s stack.

What we match against
Confidence scoring
Full-stack coverage

Detection is performed against a fingerprint database of 5,000+ technology signatures, matching across multiple signal types simultaneously:

Signal type	Examples
HTTP response headers	`X-Powered-By`, `Server`, `Set-Cookie` prefixes, CSP headers, `X-Generator`
HTML source patterns	Meta tags, script `src` paths, link `href` values, class/ID conventions
JavaScript globals	`window.Shopify`, `window.wp`, `window.ga`, `window.__NEXT_DATA__`
Cookie names	Platform-specific session cookie naming conventions
URL path patterns	`/wp-content/`, `/wp-json/`, `/_next/`, `/__nuxt__/`
robots.txt and sitemap.xml	Generator tags and path structures
DNS records	CNAME targets (e.g., `.myshopify.com`, `.hubspot.com`, `*.wpengine.com`)
TLS certificate SANs	Platform-specific wildcard certificates and issuers
Favicon hashes	MD5/perceptual hashes matched to known technology icons
Network request destinations	Third-party API calls captured during JavaScript rendering

Every technology detection includes a confidence score (0–100) so you know how reliable each signal is.How scoring works:

Multi-signal matches score higher than single-signal detections
Signals are weighted by reliability — a JavaScript global variable (window.Shopify) is stronger than an HTML class name
Corroborating signals from different layers (e.g., DNS CNAME + HTTP header + JS global) compound the score
Scores below a minimum threshold are excluded from results to minimize false positives

When building lead scoring models or automated workflows, use the confidence score to filter detections. A score above 80 indicates high-certainty detection backed by multiple independent signals.

Reliable access at scale

Accurate technology detection requires seeing what real users see, not a bot-blocked error page. Here’s how our crawl infrastructure handles that at scale.

Real browser rendering

Every crawl uses full Chromium instances with JavaScript execution, DOM rendering, and network request capture. This is how we detect technologies that only load after JavaScript execution, including single-page apps, lazy-loaded scripts, and client-side frameworks.

Distributed proxy network

A heterogeneous pool of 40M+ IPs spanning residential, mobile, datacenter, and ISP proxies ensures reliable access across all website types. Requests are geo-routed through the target site’s primary market geography for natural traffic patterns.

Intelligent request management

Per-domain rate limiting, session consistency, and IP cool-down periods ensure our crawlers behave as responsible actors on the web. Crawl rates are throttled to avoid impacting site performance.

Challenge handling

JavaScript challenges and proof-of-work verifications are handled natively by our real browser fleet, ensuring we capture accurate technology data even from sites with advanced security configurations.

Data storage and serving

Collected data flows through a purpose-built pipeline optimized for both real-time serving and historical analysis.

Layer	Purpose
Raw crawl storage	Immutable crawl snapshots stored in columnar format for efficient historical queries
Technology index	Fast multi-field search and filtering across all detected technologies
Company graph	Firmographic data and technology timelines with time-series optimizations
Real-time event stream	DNS changes, CT log events, and crawl triggers processed with sub-minute latency
Cache layer	Hot domain lookups and API rate limiting for sub-200ms response times
Search API	Customer-facing REST API with bearer token authentication

API response time

Under 200ms at p99. Powered by a distributed cache layer and optimized query serving infrastructure.

Data freshness

High-value domains refreshed every 24–72 hours. DNS and CT log signals processed within 15 minutes of change.

Scale at a glance

Metric	Value
Domains in index	500M+
Daily crawl volume	5M+ URLs
DNS change events processed/day	50M+
Proxy pool size	40M+ IPs
Technology fingerprints	5,000+
Historical data coverage	2008 – present
Detection latency (live trigger)	Under 15 minutes
API response time (p99)	Under 200ms

Compliance and ethics

Our crawling infrastructure is a responsible actor on the public web.

Respectful crawling

Crawl rates throttled per domain (max 1 request/second by default)
robots.txt directives respected for non-public data signals
No authentication-gated or private content is crawled

Data handling

All data collected from the public-facing web layer only
Domain owner opt-out requests honored within 48 hours
SOC 2 Type II certified, GDPR and CCPA compliant
Data encrypted at rest and in transit

If you’re a domain owner and want to opt out of TechnologyChecker’s index, contact support@TechnologyChecker.io and we’ll process your request within 48 hours.

Why TechnologyChecker?

Full-stack detection, 20-year historical data, and verified contacts. See how we compare to alternatives.

Technologies

Browse the full catalog of 40,000+ tracked technologies with market share and growth data.

Domain lookup

Scan any website to see its full technology stack with confidence scores and detection timeline.

API reference

Full REST API documentation with interactive examples and endpoint-level credit costs.

Frequently asked questions

How does TechnologyChecker detect backend technologies?

Backend technologies are invisible to browser extensions. TechnologyChecker uses HTTP header analysis, DNS fingerprinting, TLS certificate inspection, and infrastructure probing to identify server frameworks, databases, CDNs, and standalone SaaS platforms. See full-stack detection for the full list of detection methods.

How fresh is the data?

Data freshness depends on the domain tier. High-value domains are recrawled every 24–72 hours. DNS and certificate transparency changes are detected within 15 minutes. Historical data from Common Crawl extends back to 2008 with monthly granularity. Every API response includes a last_crawled timestamp so you know exactly when the data was collected.

What is a confidence score?

Every technology detection includes a score from 0–100 indicating detection reliability. Multi-signal detections (e.g., a DNS CNAME + HTTP header + JavaScript global all pointing to the same technology) score higher than single-signal matches. Detections below a minimum threshold are excluded to minimize false positives. Use confidence scores to filter results in your lead scoring models.

Can I opt my domain out of TechnologyChecker's index?

Yes. Domain owners can request removal by contacting support@TechnologyChecker.io. Opt-out requests are processed within 48 hours. TechnologyChecker only collects data from the public-facing web — no authentication-gated or private content is ever crawled.

500M+

5M+

40M+

~15 min

​How we collect data

​Technology detection engine

​Reliable access at scale

Real browser rendering

Distributed proxy network

Intelligent request management

Challenge handling

​Data storage and serving

API response time

Data freshness

​Scale at a glance

​Compliance and ethics

Respectful crawling

Data handling

​Related pages

Why TechnologyChecker?

Technologies

Domain lookup

API reference

​Frequently asked questions

How we collect data

Technology detection engine

Reliable access at scale

Data storage and serving

Scale at a glance

Compliance and ethics

Related pages

Frequently asked questions