Methodology

Our Methodology

We crawl 50M+ domains every month. Full-stack detection: HTTP headers, rendered DOM, JS globals, DNS, TLS, certificate transparency. Below is the signal pipeline, the confidence rubric, and how we make money. No vendor-funded "leader" lists. No frontend-only blind spots.

Who are
we?

We're TechnologyChecker.io, a technology intelligence platform for sales, GTM, and analyst teams. We scan 29.9M active domains a month and track 40,000+ technologies across the full stack: frontend frameworks, backend runtimes, CDNs, DNS, server fingerprints, TLS signatures.

We started building in 2023 because the alternatives weren't enough. Browser extensions that only see frontend code. "Market leader" reports that double-count agency-built sites. Detection databases six months out of date. Two years later we launched with full-stack detection, real-time subscriber and churn signals, and a 20-year historical view of the web.

About the 20-year view: the recent two years come from our own snapshotted crawls with the full signal stack — rendered DOM, JS globals, DNS, TLS handshakes, certificate transparency, and headers. The 18 years before that are reconstructed from public web archives (Common Crawl back to 2008, Internet Archive for earlier coverage, plus adjacent open corpora).

Archives preserve a narrower signal set than a live crawl. We replay detection against what they actually captured: HTTP response headers (Server, X-Powered-By, Set-Cookie names), <meta name="generator"> tags, script and asset src paths (framework bundles, CMS directories), inline HTML markers, and link/stylesheet fingerprints. Anything that depends on JavaScript execution, live DNS resolution, or TLS-handshake inspection is unavailable for historical periods — so server-rendered stacks reconstruct well; client-only SPAs and edge/runtime behavior do not.

Each year is analyzed separately, so emergence dates and adoption inflection points are reconstructable rather than smoothed. Use the long view for trends and timing, not absolute counts — and treat pre-2013 coverage as directional, since archive density, language bias, and PageRank-weighted sampling affect what's recoverable. Where coverage is sparse, we flag it on the chart.

The team behind it

Six people. Engineering, data, growth. Full bios and published research live on the authors page.

Mehmet Suleyman

CEO & Co-founder

Elif Arslan

CMO & Co-founder

David Thomson

CTO

Sophie Clarke

Product Marketing Manager

Emma Davies

Data Analyst

Emre Elbeyoglu

Growth & AI Advisor

What do
we look for?

Technology that is widely deployed, verifiable across multiple signal layers (HTTP headers, DOM markers, JS globals, DNS records, server banners, TLS fingerprints), and impactful enough that knowing about it changes how you sell, partner, or build.

Our aim is to be as opinionated as possible — to help you make better technology decisions, faster. Wirecutter, but for technology stacks. 🤔

Detection confidence is graded internally A–F. A technology must score B or higher across at least two independent signal layers before it ships to a live technology page. If a detection is fuzzy, we say so on the page — we don't paper over uncertainty with bold numbers.

How we create
the content.

Every technology profile and blog post here starts as data we collect ourselves. Not a vendor briefing. Not a press release rewrite. Four input streams feed the pipeline.

1. The monthly crawl

Every month we crawl 50M+ domains with a headless browser stack. It executes JavaScript, follows redirects, and captures the full response chain: HTTP headers, rendered DOM, network waterfall, JS globals, DNS answers, TLS handshakes, server banners. After filtering parked, redirected, and dead hosts, ~29.9M resolve to active sites that feed detection counts. Each crawl gives us technology signals (what's installed) and content signals (what the site says it does).

2. Certificate transparency logs

We ingest millions of entries from public certificate transparency logs. CT is the open, append-only ledger every public CA writes to. Subject alternative names, issuer chains, and certificate patterns reveal backend infrastructure that frontend crawls can't see: subdomains behind CDNs, internal API gateways, vendor-issued certs for services like Stripe, Cloudflare Access, or Auth0. CT catches backend adoption days or weeks before any frontend signal does. And because the logs are public, every detection is independently verifiable.

3. External data sources

Our crawl is the spine. External datasets enrich it. We pull from the Cloudflare Radar API for global traffic patterns, AI bot activity, DNS query trends, and outage data. We use Common Crawl and the Internet Archive for historical reconstruction (see "Who are we?" above). Public registries (WHOIS, RDAP, ASN allocations, RIPE) provide ownership and hosting context. Every external source that appears in a chart or blog post is named in the methodology footnote at the bottom of that page.

4. AI content analysis

While crawling, we run each site's rendered content through a structured LLM extraction step. It pulls company-level signals from public copy: stated industry, product category, pricing model, target-customer language ("for SMBs", "enterprise-grade"), team-page headcount cues, careers-page hiring signals, and stack mentions in engineering blogs or footers. These signals feed two things. First, technology profile context ("companies using X skew toward Y industry at Z size"). Second, blog cohort analysis ("of the 8,400 companies using Airtable, 62% describe themselves as operations or finance teams").

The AI step is extraction-only. It summarizes what the page says. It doesn't invent attributes that aren't there. We tag every AI-derived field in our database with its source URL and extraction timestamp, so any number we publish is traceable back to the page it came from.

From dataset to published page

Technology profiles come straight from the database (detection counts, adoption timeline, top industries, related stack, contact role distribution). A human editor reviews each one for accuracy and tone before it ships. Blog posts start with a query against the same database, and the dataset behind every chart is exportable so you can verify our work. When a vendor disputes a number, we publish the query that produced it.

From data
to insight.

Having data is the easy part. Turning it into something a sales rep or analyst can actually use takes longer, usually a lot longer.

Technology profile pages

Each technology page takes 30 to 40 hours of work before it ships. The raw detection counts come from the database in minutes. Everything else is the slow part.

For every page we run ClickHouse queries against the live 50M+ domain dataset to pull usage history, industry breakdowns, company-size distribution, country splits, and growth rate over time. We map the overlap stack: what else companies run alongside the technology, and what it tends to replace. We pull market-share movement against the closest 5 to 10 competitors. We scrape G2 and Capterra for review sentiment, then cross-check it against our own churn signals. We trace historical adoption back through the Common Crawl-rebuilt 20-year layer to see whether the technology is growing, plateauing, or in decline.

Then a human writes the analysis and FAQs. Then someone else reviews them. Then the page goes live.

Blog posts

Each blog post takes 15 to 25 hours. They start the same way: a database query against something we can answer that nobody else can. "How many companies switched off Intercom in the last 18 months." "Which industries adopt PostHog before Mixpanel." "What stack correlates with companies that grew headcount 3x in 24 months."

From there we pull the data and cross-reference it in ClickHouse for accuracy. We build the charts, all of which are exportable so you can verify our work. We write the post, run a second-person editorial review, and run the draft through SEO and accessibility checks. Nothing ships the same day it's written.

Author impact & content effort

Every article is signed, and each author writes in their own field — engineering, data, or growth — so the analysis carries first-hand know-how and unique insights, not commodity information rephrased from the rest of the web.

Each author page now makes that work visible: their most-read articles ranked by real reader pageviews, the estimated hours of work behind each post, and earned badges like Most Read and Deepest Research. That hours figure — we call it content effort — comes from each post's length, original visuals, and data: a measure of the real work behind the words, and exactly the kind of effort modern search engines now reward. The 15-to-25 hours above isn't a slogan; it's attributed per author, per post.

Why this matters

Most technology directories are auto-generated from a single API call. Most "industry insight" blog posts are paraphrased web consensus. The difference between thirty seconds of work and thirty hours of work shows up in two places: the depth of the answer and the freshness of the data behind it. We'd rather publish fewer pages and have each one be accurate than ship volume nobody trusts.

What you
won't find here.

You won't find inflated detection counts pulled from cached frontend-only scans. You won't find vendor-paid "leader" badges. You won't find rankings that flip to match who paid us last quarter — because no one pays us to rank higher.

Let's be real: how trustworthy is a "top CMS" list maintained by a CMS vendor's content team? Or a "top CDN" ranking written by a CDN's marketing department? Their SEO is excellent because they're billion- dollar companies, and they conveniently rank themselves #1. So yeah, we see what you're doing — and no, those rankings aren't real.

(P.S. did you know most people skim a "top 10" list without ever checking who actually published it?) 🤯

We've also found that most detection databases aren't opinionated because:

They scrape frontend HTML and miss anything that runs server-side — which is now most of the modern web (Edge functions, RSC streaming, server actions, headless API patterns).
They don't publish their methodology, so when their numbers conflict with reality, there's no way to verify whether a missing detection is a bug, a blind spot, or a vendor preference.
They depend on vendor relationships for "verification badges" and can't risk publishing data that contradicts a paying partner.

We're aggressive about all three. Our crawler runs full-stack (TLS fingerprinting, DNS introspection, header inspection, body parsing). Our methodology is on this page. And our rankings come from our database, period — no vendor has the ability to influence them.

Let's talk
about the money.

We felt strongly that TechnologyChecker.io should be:

Free to browse. Technology pages, blog posts, and charts have no paywall, no signup wall, and no rate limits for human readers.
Honest. Rankings come from detection counts in our database — not from sponsorship deals or "leader" program payments.
Actionable. After reading a technology page, you should know whether a technology is worth investigating for your stack or your prospects.

How do we make money?

Paid subscriptions. Primary revenue. Free, Pro, and Scale plans for teams who want bulk lookups, real-time subscriber & churn alerts, verified contacts, CRM exports, and access to our 20-year historical dataset (re-analyzed from public web archives — see "Who are we?" above). The free tier covers the use case most individual users have. Pricing is on a public page — no hidden tiers, no "schedule a call" wall to see the number.

API access. Teams integrating our data into their CRM, sales enrichment pipeline, or product analytics pay per-call or via subscription. Same dataset, programmatically. Documentation is public.

Chrome extension. Free, ad-free, no data resale. Install it on the Chrome Web Store. It's our front door — how most people meet us — and we have no plans to put it behind a paywall.

Bulk data licensing. Larger teams, agencies, and VC firms occasionally license bulk historical exports or custom segments. We disclose any commercial arrangement that touches editorial content.

What we don't take money for

No paid placements in rankings. A vendor cannot pay us to rank higher in detection counts, market share charts, or "top X" listings. Our rankings come from our crawler, full stop.

No vendor-funded methodology adjustments. If a vendor disputes their detection count, we publish our source query. If they're right about a bug, we fix it and re-run for everyone. If they want a different methodology, they pay nothing — and get the same number.

No sponsored content. Blog posts, technology pages, and charts are never sponsored. If a piece is ever produced under commercial arrangement — for example, a co-marketing data report — it'll be clearly labelled at the top of the page. None of the content live on the site today carries that label.

With these methods we get to stay small, independent, and keep TechnologyChecker.io free for the readers who use us. That's a fair deal — what do you think?

Hope you
enjoy.

And even if you don't — we won't stop 👀. Wishy-washy detection databases are out, and unfiltered, fully-cited technology intelligence is here to stay. 🚀

Want the granular version of our editorial process — fact-checking pipeline, correction policy, who reviews each article? See our about page, author profiles, or get in touch. Last updated: June 2026.

Turn tech stack into revenue.

Start detecting technology stacks instantly—verified contacts included. Free trial, no commitment.

Get Started