There’s a narrative pushed by some that measuring AI visibility is pointless. That prompt trackers are snake oil. That because LLM responses are probabilistic, any attempt to track them is fundamentally broken.
We disagree. Not because we sell a tracking tool (we do), but because the arguments don’t hold up when you look at them closely. Most of them boil down to the same logical fallacy: confusing “not perfect” with “not useful.”
Table of Contents
Let’s go through the most common myths.
Myth 1: LLM responses are inconsistent, so tracking is pointless
Yes, individual LLM responses vary. If you ask ChatGPT the same question twice, you might get different brands mentioned in a different order. Nobody disputes this.
But this argument confuses single-response randomness with aggregate patterns. The same critics who point out this inconsistency also acknowledge that aggregated visibility percentage (how often a brand appears across many executions) can be stable and meaningful. Rand Fishkin himself found that “City of Hope hospital” appeared in 97% of responses about oncology hospitals on the US west coast, even though its position varied constantly.
You can say the same thing in a million ways.
That’s exactly what prompt trackers measure: patterns over time, not individual responses. Weather is chaotic day-to-day. Climate trends are measurable. Same logic applies here.
What we see consistently at LLM Pulse across our client base is that the amount of brand mentions doesn’t vary dramatically week over week. Great brands get mentioned more, consistently. It’s not like when you ask “best car” one day it recommends BMW and the next day it recommends a completely random brand. There is a weighted consistency in how LLMs surface brands, and that consistency is what makes directional tracking valuable.
Myth 2: you can’t replicate the user experience, so don’t bother measuring
You also can’t replicate the exact Google SERP any individual user sees. Personalization, location, device, search history, whether they’re logged in or not… all of these change the results.
Did that stop us from tracking rankings for the past 20 years? No. The industry accepted the limitation and used “clean” baseline measurements as directional signals. That’s exactly what AI visibility tracking does.
Personalization in LLMs exists, but critics overstate its impact on brand-level visibility. When someone asks “best project management tools,” the top recommendations are remarkably consistent across sessions and users. The dominant brands show up regardless of personalization because the model’s training data overwhelmingly associates them with those topics.
The point of tracking was never to replicate one specific user’s experience. It’s to measure your brand’s baseline presence in the model’s outputs and monitor how that changes over time.
Myth 3: API-based tracking gives wrong results, so all tracking is flawed
This one is actually partly valid, but critics use it to discredit all tools equally, which is misleading.
The Surfer SEO study from December 2025 showed that ChatGPT via API cited an average of 7 sources, while scraping the actual interface returned 16. Only 8% overlap in sources between methods on Perplexity. 8% of API calls failed to detect mentions that appeared in the real interface.
This is a serious problem, and it’s precisely why at LLM Pulse we check the actual UI, not the API – for all models. We’ve been doing this from day one because the API simply doesn’t reflect what real users see.
Not all tools take this approach. Some prioritize API access for convenience or cost. But lumping all prompt trackers together because some use APIs is like saying all restaurants are bad because some serve frozen food. Check the methodology before dismissing the category.
Myth 4: tracking 50 custom prompts is like polling your friends
This is the argument that you need millions of prompts to measure anything meaningful, and that small prompt sets are statistically worthless.
Let’s unpack what “millions of prompts” actually means in practice.
Ahrefs Brand Radar claims ~190 million prompts per month. Sounds impressive until you look closer: approximately 180 million of those are AI Overviews (essentially Google’s feature), and only around 10 million are actual prompts across each ChatGPT, Gemini, Perplexity, and Copilot. Ahrefs tracks 4.3 billion keywords on Google, so the LLM prompt sample is a tiny fraction of their keyword universe.
Now distribute those 10 million prompts across 200+ countries. For smaller markets (Spain, Netherlands, Sweden, Portugal…), the relevant amount of prompts in your specific industry approaches zero. If you’re a SaaS company in the Nordics or a marketplace in Southern Europe, those “millions of prompts” might contain close to nothing about your sector.
This is where we believe 300 well-chosen prompts about YOUR industry, tracked consistently over time, are significantly more relevant than 10 million generic prompts spread thin across every country and topic imaginable.
The analogy isn’t “national poll vs. asking your friends.” It’s more like: a carefully designed customer survey of your target segment vs. a massive census that barely covers your zip code.
At LLM Pulse we offer 8 different ways to generate prompts: from Reddit discussions, from Google Search Console queries, from keyword research, from competitor analysis, and more. The idea that custom prompts are “invented” or disconnected from real demand is simply not true when you’re deriving them from actual user behavior signals.
Myth 5: visibility scores are meaningless because every tool calculates them differently
By this logic, Domain Rating is meaningless because Ahrefs, Moz (Domain Authority), and Semrush (Authority Score) all calculate it differently. Brand awareness metrics are meaningless because Nielsen, Kantar, and YouGov all use different methodologies.
Every analytics tool in existence uses proprietary formulas. The value isn’t in cross-comparing tools. It’s in tracking YOUR trend in ONE tool consistently over time.
Critics sometimes acknowledge this (“what matters is the trend”) and then immediately dismiss it. You can’t have it both ways. Either trends matter or they don’t. If your visibility score goes from 15% to 35% over three months while you’re actively optimizing content, that’s a signal worth paying attention to, regardless of whether the absolute number maps perfectly to reality.
Myth 6: share of voice is arbitrary because it depends on competitor selection
Yes, SoV depends on your competitive set. That’s how Share of Voice works in every industry: advertising, PR, traditional media, SEO.
The denominator isn’t “arbitrary.” It’s a strategic choice. You pick your competitive set based on your market positioning. A boutique hotel chain competes against other boutique chains, not against Marriott and Hilton. Changing competitors changes the number, and that’s a feature, not a bug. It lets you analyze different competitive frames depending on the strategic question you’re asking.
Myth 7: there’s no “position 1” in AI, so ranking position is meaningless
Nobody serious claims that position in an LLM response is identical to position on a Google SERP. But mention order, prominence (whether you’re in the first paragraph or the last), and whether you’re recommended vs. merely listed all carry signal about how the model weights your brand for that topic.
We actually started working on this metric called position distribution and results are pretty interesting: great brands leader get recommended in first position a lot more than those that are not.
Is it noisier than traditional rank tracking? Absolutely. Is it directional? Also yes.
Myth 8: “tracking is a feature, not a company”
Kevin Indig’s quote gets recycled constantly, and it’s worth examining.
Sistrix, Ahrefs, and Semrush all started primarily as rank trackers of some sort. Rank tracking was “just a feature” too, until companies built data moats, domain expertise, and product depth around it.
Early movers in measurement categories tend to build compounding advantages: proprietary datasets, methodology refinements, and integrations that make them hard to replicate. The question isn’t whether tracking is “a feature or a company.” It’s whether a tool delivers actionable value to its users.
We should also consider incentives when people make claims – we will not get into personal things, but as they say.. follow the money.
Myth 9: you can’t tell if the model searched the web or used training data
This is a fair technical point. When an LLM mentions your brand, it might be pulling from its training data (parametric knowledge) or from a live web search (retrieval). The optimization strategy differs: you can influence retrieval through content and SEO, but parametric knowledge only changes with the next training cycle.
However, this doesn’t invalidate the measurement itself. If your brand appears in the response, the user sees it regardless of where it came from. For visibility monitoring, what the user sees is what matters. For optimization strategy, yes, the source matters, and tools should work on surfacing this distinction.
What we do at LLM Pulse is show query fan-out whenever it’s available. For ChatGPT (and increasingly other models), we surface the actual sub-queries the model generated during retrieval. This gives you visibility into not just what the model said, but how it got there. Any serious tracking tool should be doing this.
Myth 10: the market sells certainty where only uncertainty exists
This is a strawman. The serious tools in this space don’t claim certainty. Sistrix openly calls their AI module a beta. At LLM Pulse, we’ve always been transparent that this is directional measurement.
The alternative that critics propose is essentially: don’t measure. Wait until the data is perfect. That’s not a strategy; that’s abdication.
Every measurement tool in marketing operates under uncertainty. Meta literally models conversions it can’t directly measure. GA4 misses 15-40% of users due to ad blockers and privacy browsers. Attribution models are famously approximate. Nobody argues we should stop measuring web analytics because it’s imperfect.
Tracking is just the starting point
The whole “tracking is pointless” debate misses something fundamental: tracking was never the end goal. It’s the foundation for action.
At LLM Pulse, prompt tracking is one piece of a much larger system. What we actually help teams do with that data:
- Content intelligence and creation. Identify what content drives mentions and citations, then create content that fills the gaps. Not guessing, but informed by what models actually surface and cite.
- Actionable recommendations. Based on visibility patterns, we tell you what to fix, what to create, and where to focus. The data feeds a strategy, not a dashboard you stare at.
- Sentiment and reputation tracking. It’s not just whether you’re mentioned, but how. Are models recommending you enthusiastically or mentioning you with caveats? Net sentiment across thousands of responses reveals how LLMs perceive your brand, and that perception shapes what millions of users read about you.
- Helping you get mentioned. We connect visibility gaps to concrete actions: linkbuilding opportunities, digital PR angles, affiliate marketing strategies, and social media signals that increase your brand’s presence in the training data and retrieval sources that LLMs depend on.
- Competitive intelligence at scale. Share of voice, competitor tracking, source analysis… when you’re monitoring 5 models across hundreds of prompts over months, you start seeing patterns that no one-off audit can reveal. Which competitors are gaining ground? Which sources are driving citations? Where is sentiment shifting?
This is the part critics conveniently ignore. They evaluate tracking tools as if all they do is show you a number. The question isn’t “is this number perfectly accurate?” The question is “does this data, at scale, help me make better decisions about my brand’s presence in AI?” And the answer, for hundreds of companies we work with, is yes.
What actually matters when choosing a tracking tool
Not all prompt trackers are created equal. Here’s what we think matters:
Scraping vs. API. If a tool uses the API instead of the actual interface, the data doesn’t reflect what users see. Full stop. This isn’t a minor methodological difference; it’s a fundamental accuracy issue.
Prompt sourcing. “Synthetic” prompts derived from real demand signals (Search Console, Reddit, keyword data, People Also Ask) are categorically different from prompts someone invented in a brainstorm. Ask your tool how it generates prompts. If the answer is “you type them in,” that’s only one input. You need multiple demand signals.
Model coverage. Tracking one or two LLMs gives you a partial picture. At LLM Pulse we track 5 models (ChatGPT, Perplexity, Gemini, Google AI overviews and AI mode), which gives a much more representative view of how brands surface across the AI ecosystem.
Transparency. Does the tool document its methodology? Does it show you the actual responses, the fan-out queries, the sources cited? Or does it just give you a score?
Directional honesty. Any tool that claims to give you “exact” AI visibility is lying. The honest framing is: this is directional data that helps you identify trends, gaps, and competitive movements over time.
The real risk isn’t imperfect measurement
The companies that will regret their decisions aren’t the ones using imperfect tracking tools. They’re the ones who decided not to track at all because the data wasn’t perfect yet.
We’ve seen this movie before. In 2005, SEO measurement was primitive. Rankings fluctuated. Tools disagreed. The methodology was questionable. The companies that started measuring and optimizing anyway are the ones that dominated organic search for the next decade.
AI search is following the same trajectory. The measurement will get better. The methodology will mature. But the window to build visibility is open now, and directional data beats no data every single time.
LLM Pulse tracks AI visibility across ChatGPT, Perplexity, Gemini, Google AI Mode and Overviews. We use real interface (not APIs), offer 8 ways to generate prompts from real demand signals, and surface query fan-out data whenever available.
