Skip to main content
TechnologyChecker runs a multi-layered, distributed crawling infrastructure that detects technology stacks across hundreds of millions of websites, both in real time and historically. We combine live web crawling, passive monitoring, HTTP log analysis, certificate transparency tracking, and historical data mining via Common Crawl.

500M+

Domains in index

5M+

URLs crawled daily

40M+

IP proxy pool

~15 min

Detection latency

How we collect data

We don’t rely on a single data source. Multiple independent collection layers feed into a unified detection engine. If one signal source goes dark for a domain, others fill the gap.
1

Live web crawling

A distributed crawl fleet continuously scans the web using real browser rendering. Every crawl captures HTTP headers, HTML source, executed JavaScript, network requests, and the rendered DOM. This gives us visibility into technologies that static fetchers miss entirely.Crawl prioritization:
TierDomainsCrawl frequency
Tier 1 (Real-time)High-value and high-traffic domainsEvery 24–72 hours
Tier 2 (Standard)Mid-traffic domainsWeekly
Tier 3 (Batch)Long-tail domainsMonthly or on-demand
The crawler continuously re-prioritizes domains based on technology change velocity, traffic rank signals, and inbound link graph changes. Domains that change stacks frequently get crawled more often.
2

Passive DNS monitoring

We ingest passive DNS feeds to track infrastructure changes across the entire domain namespace in near real-time:
  • CNAME chain changes: a domain moving from *.wpengine.com to *.kinsta.cloud signals a hosting migration
  • NS record changes: switching from Route53 to Cloudflare DNS reveals infrastructure decisions
  • MX record changes: email platform migrations (e.g., Google Workspace to Microsoft 365)
  • A/AAAA record changes: IP lookups reveal CDN and hosting changes
  • TXT record additions: SPF/DKIM records reveal new SaaS tool adoptions (e.g., include:sendgrid.net, include:mailchimp.com)
DNS events trigger re-crawls on significant record mutations, typically within 15 minutes of a DNS change.
3

Additional real-time signals

Beyond DNS, we monitor multiple supplementary data streams to detect new domains and technology changes as they happen:
  • TLS certificate issuances: new certificates from known platform CAs trigger domain discovery and crawl scheduling
  • HTTP infrastructure signals: reverse proxy headers and response profile changes reveal CDN, hosting, and stack migrations
  • New domain detection: newly observed domains are automatically queued for a full technology crawl
These signals fill the gaps between scheduled crawls, keeping our index current within minutes of real-world changes.
4

Common Crawl archives (historical data)

For historical technology timelines going back to 2008, we process datasets (CC-MAIN) — petabyte-scale crawl archives of the public web.How it works:
  1. We query the Common Crawl Index to retrieve record pointers for target domains across historical crawl dates — without downloading full archive files
  2. Only the relevant byte ranges are fetched using HTTP Range requests, minimizing data transfer
  3. HTML and headers are extracted and fed into our standard detection engine
  4. Detected technologies are stitched into a per-domain timeline showing when a site adopted or dropped each technology
Common Crawl data enriches approximately 35% of our historical technology timeline records, providing adoption context that no live-only crawler can match.
These data layers complement each other. Live crawling captures current state with full JavaScript rendering. DNS and real-time signals detect changes within minutes. Common Crawl provides historical depth back to 2008. A single-source detection tool has blind spots. We don’t.

Technology detection engine

Raw crawl data is just HTML and headers. Our fingerprint matching engine turns it into structured technology intelligence across every layer of a website’s stack.
Detection is performed against a fingerprint database of 5,000+ technology signatures, matching across multiple signal types simultaneously:
Signal typeExamples
HTTP response headersX-Powered-By, Server, Set-Cookie prefixes, CSP headers, X-Generator
HTML source patternsMeta tags, script src paths, link href values, class/ID conventions
JavaScript globalswindow.Shopify, window.wp, window.ga, window.__NEXT_DATA__
Cookie namesPlatform-specific session cookie naming conventions
URL path patterns/wp-content/, /wp-json/, /_next/, /__nuxt__/
robots.txt and sitemap.xmlGenerator tags and path structures
DNS recordsCNAME targets (e.g., *.myshopify.com, *.hubspot.com, *.wpengine.com)
TLS certificate SANsPlatform-specific wildcard certificates and issuers
Favicon hashesMD5/perceptual hashes matched to known technology icons
Network request destinationsThird-party API calls captured during JavaScript rendering

Reliable access at scale

Accurate technology detection requires seeing what real users see, not a bot-blocked error page. Here’s how our crawl infrastructure handles that at scale.

Real browser rendering

Every crawl uses full Chromium instances with JavaScript execution, DOM rendering, and network request capture. This is how we detect technologies that only load after JavaScript execution, including single-page apps, lazy-loaded scripts, and client-side frameworks.

Distributed proxy network

A heterogeneous pool of 40M+ IPs spanning residential, mobile, datacenter, and ISP proxies ensures reliable access across all website types. Requests are geo-routed through the target site’s primary market geography for natural traffic patterns.

Intelligent request management

Per-domain rate limiting, session consistency, and IP cool-down periods ensure our crawlers behave as responsible actors on the web. Crawl rates are throttled to avoid impacting site performance.

Challenge handling

JavaScript challenges and proof-of-work verifications are handled natively by our real browser fleet, ensuring we capture accurate technology data even from sites with advanced security configurations.

Data storage and serving

Collected data flows through a purpose-built pipeline optimized for both real-time serving and historical analysis.
LayerPurpose
Raw crawl storageImmutable crawl snapshots stored in columnar format for efficient historical queries
Technology indexFast multi-field search and filtering across all detected technologies
Company graphFirmographic data and technology timelines with time-series optimizations
Real-time event streamDNS changes, CT log events, and crawl triggers processed with sub-minute latency
Cache layerHot domain lookups and API rate limiting for sub-200ms response times
Search APICustomer-facing REST API with bearer token authentication

API response time

Under 200ms at p99. Powered by a distributed cache layer and optimized query serving infrastructure.

Data freshness

High-value domains refreshed every 24–72 hours. DNS and CT log signals processed within 15 minutes of change.

Scale at a glance

MetricValue
Domains in index500M+
Daily crawl volume5M+ URLs
DNS change events processed/day50M+
Proxy pool size40M+ IPs
Technology fingerprints5,000+
Historical data coverage2008 – present
Detection latency (live trigger)Under 15 minutes
API response time (p99)Under 200ms

Compliance and ethics

Our crawling infrastructure is a responsible actor on the public web.

Respectful crawling

  • Crawl rates throttled per domain (max 1 request/second by default)
  • robots.txt directives respected for non-public data signals
  • No authentication-gated or private content is crawled

Data handling

  • All data collected from the public-facing web layer only
  • Domain owner opt-out requests honored within 48 hours
  • SOC 2 Type II certified, GDPR and CCPA compliant
  • Data encrypted at rest and in transit
If you’re a domain owner and want to opt out of TechnologyChecker’s index, contact support@TechnologyChecker.io and we’ll process your request within 48 hours.

Frequently asked questions

Backend technologies are invisible to browser extensions. TechnologyChecker uses HTTP header analysis, DNS fingerprinting, TLS certificate inspection, and infrastructure probing to identify server frameworks, databases, CDNs, and standalone SaaS platforms. See full-stack detection for the full list of detection methods.
Data freshness depends on the domain tier. High-value domains are recrawled every 24–72 hours. DNS and certificate transparency changes are detected within 15 minutes. Historical data from Common Crawl extends back to 2008 with monthly granularity. Every API response includes a last_crawled timestamp so you know exactly when the data was collected.
Every technology detection includes a score from 0–100 indicating detection reliability. Multi-signal detections (e.g., a DNS CNAME + HTTP header + JavaScript global all pointing to the same technology) score higher than single-signal matches. Detections below a minimum threshold are excluded to minimize false positives. Use confidence scores to filter results in your lead scoring models.
Yes. Domain owners can request removal by contacting support@TechnologyChecker.io. Opt-out requests are processed within 48 hours. TechnologyChecker only collects data from the public-facing web — no authentication-gated or private content is ever crawled.