Business cases

How companies use AI to grow.

In its first month the assistant handled 2.3 million conversations — two-thirds of all Klarna support chats — doing the equivalent work of 700 full-time agents. Average resolution time dropped from 11 minutes to under 2 minutes. Repeat inquiries fell by 25% — Klarna attributes this to more accurate answers. Customer satisfaction (CSAT) stayed on par with human agents. The company estimated the impact at $40 million in profit improvement for 2024 and said so publicly, in a press release dated February 27, 2024. Keep the frame in mind: all figures are the company's own reporting for the first month of operation, not externally audited. The story had a sequel that made the case even more instructive. In parallel with the AI rollout, Klarna froze hiring and cut staff in 2024, publicly tying this to its AI strategy. Then in May 2025, CEO Sebastian Siemiatkowski admitted in a Bloomberg interview that cost-driven automation had produced 'lower quality' and announced the return of human agents — in a flexible 'Uber-type' setup with remote hires. 'It's so critical that you are clear to your customer that there will be always a human if you want,' was how he framed the new position. The AI did not go away: as of mid-2025 the assistant still handled about two-thirds of inquiries, and response times were 82% faster than in the pre-AI era. In our view, the Klarna case should be read whole, together with the 2025 reversal — that is where it is most useful. Part one proved that an LLM assistant on a narrow first line can genuinely hold two-thirds of the volume with human-level CSAT — validated at a scale of millions of dialogues and transferable to any business with a large flow of routine inquiries. Part two showed the price of radicalism: the metrics Klarna published measured speed and volume but not brand trust or the quality of complex cases — and that is exactly where the debt accumulated that humans had to be brought back to repay. The market frame matters too: Klarna was preparing for an IPO, and loud AI claims served the investment narrative — as did the subsequent course correction toward 'quality'. For practitioners the takeaway, in our view, is this: the target support architecture is a hybrid where AI removes the routine and humans remain the quality guarantee and the escalation path; projects designed as hybrids from day one never have to go through a painful reversal.

2.3Mchats in month one

700FTE equivalent

<2 минresolution (was 11 min)

$40Mprojected 2024 profit

📊

Finance · Morgan Stanley

Morgan Stanley: the GPT-4 assistant used by 98% of financial advisor teams

More than 98% of financial advisor teams use the assistant — near-ceiling adoption for a corporate tool. The system effectively answers questions across a corpus of 100,000 documents — versus roughly 7,000 questions the previous tool could handle. Document accessibility, by the firm's estimate, rose from 20% to 80%, and information retrieval collapsed from minutes to seconds. McMillan describes the qualitative shift: 'Now, advisors can engage clients on topics they haven't discussed before because the friction between knowledge and communication has gone to zero.' For Debrief, early public estimates exist: a pilot advisor put the savings at about 30 minutes of work per meeting; with a million Zoom calls a year across the division, CNBC estimated the potential at hundreds of thousands of hours annually. Kaitlin Elliott, Head of Firmwide Generative AI Solutions, noted: 'The feedback from advisors has been overwhelmingly positive... follow-ups that used to take days now happen within hours.' At the same time McMillan candidly called the project a 'grand experiment in productivity' and said a rigorous impact assessment would take at least a year of observation. The approach is scaling beyond wealth management: on October 23, 2024 the firm launched AskResearchGPT — a GPT-4 assistant over the Morgan Stanley Research corpus (70,000+ proprietary reports a year) for investment banking, sales & trading, and research staff, with a patented 'query → client-ready email with citations and links in one click' workflow. It is important to frame these numbers: Morgan Stanley publishes usage metrics (adoption, corpus accessibility, speed) rather than a dollar impact — the bank has nowhere claimed 'AI made us X million'. In our view this is a deliberate and honest position for an internal tool: rigorously attributing a copilot's contribution to advisor revenue is methodologically hard, and the firm prefers metrics it controls. The industry stakes are high: per a Citigroup estimate cited by CNBC, finance jobs are among the most prone to AI displacement, and AI adoption could boost the industry's profit by $170 billion by 2028. For the market, the Morgan Stanley case is the model of the 'low-risk' path into generative AI, the mirror image of customer-facing chatbots like Klarna's: the AI faces inward, toward employees; a human stays in the loop of every decision; and trust is built on measurable evals, not demos. In our view that combination — a narrow corpus, evals as a process, a human as the final filter — is the case's most transferable part: it works in any industry where the cost of a wrong answer is high, from medicine to law.

98%advisor teams adoption

100Kdocuments in corpus

20→80%document accessibility

7K→100Kquestion coverage

💬

SaaS / Customer support · Intercom

Intercom Fin: a Claude-powered support agent resolving up to 86% of questions

Out of the box Fin resolves 51% of inquiries on average; tuned for a specific company, resolution reaches 86% — a gap that itself shows knowledge-base quality is decisive. Response times drop from ~30 minutes to seconds. Per Intercom's blog in 2026, more than 7,000 teams use Fin, the average resolution rate across customers has reached 76% and keeps growing month over month — even as the share of complex queries rises; the product pages state over a million conversations per week. Published customer results from the Anthropic case study give three different frames of scale. Synthesia (a startup): in 6 months Fin closed over 6,000 conversations, saved over 1,300 support hours, and self-serve reached 87%. Fundrise (a growing fintech): over 50% of volume automated within 3 months at 95% response accuracy, with seasonal case peaks roughly halved year over year. Lightspeed (enterprise): resolution up to 65%, AI involved in 99% of conversations, and agents closing 31% more conversations per day — AI removes the routine even from dialogues a human still leads. It is important to frame the metrics: a 'resolution' in Intercom's reporting is a conversation fully closed without human involvement (since March 2026 — an 'outcome', which also includes procedures with final human confirmation). The 51%, 76%, and 86% figures are vendor and customer data, not an independent audit; resolution depends heavily on the quality of a specific company's knowledge base and its inquiry mix. Also telling is how the resolution ceiling depends on the type of business: startup Synthesia reached 87% self-serve, while enterprise client Lightspeed tops out at 65%. In our view this is not a difference in tuning quality but a natural difference in product and inquiry complexity: the more complex the product and the higher the cost of error, the more cases still require a human — and the more it matters that AI accelerates those dialogues too (hence the +31% conversations closed by Lightspeed's agents). In our view, Intercom's main contribution to the industry is not the resolution percentages but the economic model. Outcome-based pricing ($0.99 per resolution, later the outcome model) aligns vendor and customer incentives: Intercom earns only when its AI actually works, so it cannot afford a 'chatty bot' that simulates activity. This model has become the de facto industry standard for support AI agents — competitors across the market are copying it. The second transferable takeaway: Fin's migration from GPT-4 to Claude shows that for an application company the base model is a replaceable component, while the competitive advantage lives in the layer above it — retrieval models, validation, action integrations, and data from millions of real dialogues. For companies building their own agents, that argues for architectures where the model can be swapped without rewriting the product.

86%max Fin resolution

51%out-of-the-box resolution

45+languages

+31%agent conversations/day (Lightspeed)

🗄️

Technology · Uber

Uber QueryGPT: natural language to SQL — 10 minutes down to 3

QueryGPT produces sufficiently reliable queries in about 3 minutes versus ~10 minutes of manual authoring — roughly 70% time saved per query. In limited release the service averages about 300 daily active users, 78% of whom say the generated queries reduce the time they would have spent writing by hand. At the platform's scale of 1.2 million queries per month the scaling potential is obvious — but the team is deliberately rolling out gradually, and in its published learnings explicitly names choosing the right initial audience (personas) as a lesson of its own: start where the benefit is highest and SQL needs are routine — with operations teams, not with data engineers, who need an LLM draft least of all. The team's published learnings matter just as much. First: LLMs work excellently as classifiers on narrow tasks — a pipeline of specialized agents (intent → tables → columns) consistently beats one big prompt. Second: the user's question alone is insufficient input for generation; it must be enriched with context before the model call. Third: answer multiplicity (the same question is correctly solved with different tables and SQL styles) makes automated evaluation inherently fuzzy — hence an LLM judge and visual comparison against the reference instead of a binary match/no-match. Frame the numbers correctly: 10 and 3 minutes are the Uber team's estimates; 78% is user self-report, not a stopwatch measurement; the service was in limited release at publication. Uber publishes neither a generation-accuracy percentage nor a financial impact — and, in our view, that is more honest than third-party blog extrapolations that paint the case with hundreds of thousands of saved hours. In our view, QueryGPT's main transferable lesson is that production text-to-SQL is 20% model and 80% context engineering: curated domain workspaces, schema pruning, prompt enrichment, and evaluation discipline on a golden set. All of that transfers to any company with a large data platform — and requires neither your own model nor Uber's scale. The second lesson is pace: from hackathon to production took over a year and 20+ iterations. Teams expecting text-to-SQL to 'work within a sprint' underestimate precisely the long tail of domain tuning, not the difficulty of LLMs.

10→3 минquery authoring time

1.2Mplatform queries/mo

300daily active users

78%report time savings

🦉

EdTech · Duolingo

Duolingo: 148 new language courses in a year with generative AI

On April 30, 2025 Duolingo announced 148 new language courses at once — the largest content expansion in company history, created in under a year. The course catalog more than doubled. The seven most popular non-English languages — Spanish, French, German, Italian, Japanese, Korean, and Mandarin — became available from all 28 interface languages, opening learning to over a billion potential learners. Von Ahn framed the speed comparison himself: 'Developing our first 100 courses took about 12 years, and now, in about a year, we're able to create and launch nearly 150 new courses. This is a great example of how generative AI can directly benefit our learners.' The launch had a second, less celebratory side — a communications crisis. The 'AI-first' memo about phasing out contractors, published two days before the course announcement, triggered a wave of criticism: users declared they were deleting the app, and the comments under the company's TikTok and Instagram posts turned into a stream of anti-AI sentiment. In May 2025 von Ahn issued a clarification on LinkedIn: 'To be clear: I do not see AI as replacing what our employees do (we are in fact continuing to hire at the same speed as before). I see it as a tool to accelerate what we do.' In August 2025 he admitted in an interview that the memo 'did not give enough context' and stressed that no full-time employees had been laid off. Frame the result correctly: 148 courses and a doubled catalog are verifiable product facts; but the company published no quality metrics for the new courses (learner retention, level progression) at announcement time, and part of the user criticism targeted exactly the quality of AI content. In our view, the Duolingo case is the clearest demonstration of where generative AI gives a content business maximum leverage: not 'writing instead of people' but replicating one vetted gold standard across dozens of localizations. The '100 courses in 12 years versus 148 in a year' formula became a public speed benchmark for the whole EdTech industry — while honestly comparing tasks of different complexity: the first 100 courses included building the methodology from scratch, whereas the 148 new ones replicate it at beginner levels. The second lesson is the price of words. The difference between 'AI accelerates our teams' and 'we will replace contractors with AI' cost the company weeks of public crisis over essentially identical changes. For executives planning an AI transformation, communications strategy is not an appendix to the project but part of it — and the Duolingo case should be studied alongside Klarna's reversal: the market punishes radical rhetoric even when the operational results are real.

148new courses at once

<1 годаproduction time

12 летfor the first 100 courses

28interface languages

🛵

Delivery · DoorDash

DoorDash: a Claude-powered voice AI fields hundreds of thousands of Dasher calls a day

After successful testing in early 2024, DoorDash rolled the new self-service options out to all Dashers. The voice AI fields hundreds of thousands of Dasher calls daily at a response latency of 2.5 seconds or less. In the case study's wording, the solution has driven 'large and material reductions' in call volumes for Dasher-related support inquiries, cut escalations to live agents by thousands per day, and reduced the number of tasks agents perform to resolve inquiries. AI closes the routine questions, freeing agents for complex cases that need a human. It is important to hold the attribution frame — and to correct a widespread retelling of this case (including our own earlier version): the figures '−49% transfers, +12% first-contact resolution, $3M annual savings' belong to DoorDash's previous automation generation on Amazon Connect and Amazon Lex — the baseline the generative layer was built on top of. For the Claude solution itself, AWS and DoorDash publish speed and scale metrics (2.5s; hundreds of thousands of calls a day; 50x testing capacity; −50% development time) and qualitative impact statements ('thousands of escalations a day'), but no percentages or dollars. That is typical, honest practice for a fresh deployment — precise financial effects take a year of observation. The team has already announced the next step: expanding the knowledge bases and integrating DoorDash's event-driven logistics workflow service so the assistant not only answers questions but takes actions on the user's behalf. 'Using AWS and Anthropic's Claude, we've built a solution that gives Dashers reliable and simple-to-understand access to the information they need, when they need it. This has cascading positive impacts on our users and the platform as a whole,' says Chaitanya Hari, Contact Center Product Lead at DoorDash. In our view there are three transferable lessons. First: for voice AI, speed is product requirement number one, and choosing a light model (Haiku) over the flagship is sound engineering economics applicable to any realtime scenario. Second: test infrastructure matters more than model 'smarts' — the 50x testing capacity increase is the reason the project reached production in 8 weeks. Third: build the generative layer on top of working automation, not instead of it — DoorDash augmented a strong IVR baseline, which is exactly why it could honestly measure the added value through A/B testing.

100K+calls per day fielded by the voice AI

2.5сresponse latency (Claude 3 Haiku)

50×testing capacity increase (SageMaker)

8 недельfrom design to production A/B

🏦

Banking · Сбербанк

Sberbank: AI resolves 65% of customer inquiries in the contact center

As of Q1 2026, AI resolves over 65% of customer inquiries: 66% in voice channels and 71% in chats. The market comparison sets the frame: per Frank RG's 2026 estimate, the industry average is 23% automation in voice and 67% in chats. In voice, Sber is nearly three times ahead of the market; in chats, slightly above it. 95% of calls are answered immediately, during the conversation; the remaining 5% require additional analysis (typically 2–3 days). The bank's contact center received two Frank RG awards — for the most stable operator team and the best robotic service on an incoming line. Publicly named effects by layer: AI routing of corporate calls saved the bank 300 million rubles in 2023; AI in the business contact center saves over 7,000 operator hours per month. The GigaChat assistant added +7% to operator productivity and +2 pp to the CSI satisfaction index; operators use up to 20% of the model's suggestions (up to 45% in some areas), and dialogue quality scoring reaches 80% accuracy. The results were publicly discussed by Elena Levina, Vice President and Director of Sberbank's Customer Care Department. Frame the numbers correctly. This is the bank's own reporting, not an independent audit; industry experts in ComNews explicitly cautioned that improved routing metrics can have multiple explanations (Evgeny Surkov, Innostage) and that quantitative indicators without customer quality assessment give an incomplete picture (Evgenia Gilenyuk, SKB Kontur). The strength of Sber's reporting is its anchor to the external Frank RG benchmark: the market comparison makes the headline number verifiable. In our view, the Sber case is valuable above all as an architectural template: three layers with different metrics (routing — seconds and rubles; the assistant — productivity and CSI; automation — the share of inquiries without a human) are not blended into one 'AI effect' but measured separately. That transfers to any large contact center — not least because the first layer (routing) pays for itself before any LLM appears. A separate observation: the share of suggestions operators actually use (20–45%) is a rare example of an honest assistant-usefulness metric; most deployments report that suggestions exist, not whether people take them.

65%of inquiries resolved by AI (Q1 2026)

+7%operator productivity with GigaChat

+2 п.п.CSI growth after assistant rollout

300 млн ₽saved in 2023 (corporate call routing)

🛍️

E-commerce · Wildberries

Wildberries: AI in marketplace operations — warehouse machine vision, neural search, and review processing

Wildberries still discloses almost no quantitative business results for its AI systems — the case's key honest limitation. What is published are technical indicators: robot-arm productivity of at least 950 units per hour, a grip success rate above 97% (TAdviser), and roughly 6-month payback for the cross-belt sorter (RoboTrends). The public warehouse-robotics goal voiced by Elena Obraztsova, Wildberries' marketplace automation director, is to radically reduce dependence on warehouse staff within one to three years; her colleague Andrey Ulyanov honestly caps the ambition at a hybrid with about half of the processes automated. The platform's scale (about 15 million orders per day, tens of billions of events per day) is confirmed by the company's corporate blog. The trajectory, meanwhile, is visible to the naked eye through infrastructure launches: from the first industrial robot tests in 2024 to industrial operation of manipulators at Koledino in 2025 and the launch of an entire robotized hub at Krasny Bor by the end of 2025. Product AI launches (neural search, AI review replies, automatic goods redistribution) run in parallel with the warehouse work. In our view, the Wildberries case is interesting precisely as a model of 'AI without impact press releases': the company changes its operating model through contract terms and infrastructure rather than marketing numbers. Automatic goods redistribution is the clearest example: the AI system changes the division of responsibility between platform and seller, and its 'metric' is a legal document, not a percentage on a slide. For market analysts it is a reminder: the absence of public metrics does not equal the absence of deployment — but attributing specific improvement percentages to the company without its own publications is off-limits. Frame the technical numbers correctly too: 950 units per hour and 97% grips are the spec-sheet indicators of one manipulator type on sorting, not 'the efficiency of Wildberries' AI' overall; industry return estimates (30%, up to 70% in apparel) are market context, not company metrics. In our view the case's most transferable element is Ulyanov's sober goal-setting: 'a hybrid with ~50% of processes automated' is a more honest and achievable frame for warehouse robotics than a 'lights-out warehouse' — and it applies to most logistics operators.

15 млнorders per day (company data, Habr)

950/часrobot-arm units/hour, >97% grip rate (TAdviser)

1–3 годаpublic goal: radically reduce dependence on warehouse staff

🔊

Smart home · Яндекс

Yandex: Alice processes smart-home commands on the speaker itself — on average 6x faster than the cloud

According to the Yandex team, local processing of smart-home commands is on average 6x faster than the cloud path — the gain comes from eliminating the network round trip plus streaming on-device processing; the company published no exact milliseconds, and recognition speed varies with phrase complexity. Lights and other Zigbee devices keep working when the internet goes down — offline resilience became not a side effect but the project's second headline result. The approach stuck across the lineup: newer devices, including Station Mini 3 Pro, also got local voice-command processing (TAdviser) — Midi was not an experiment but a shakedown of the architecture for the whole platform. The ecosystem kept growing meanwhile: 5.3 million active YaOS/YaOS X devices and 2.9 billion Alice requests in 2025. The compression scale the team documented in numbers is telling in itself: NLU — from at least 30 gigabytes of RAM down to 90 megabytes; the smart-home backend — from 500+ to ~200 megabytes; ASR — down to a ~10-million-parameter model, orders of magnitude smaller than cloud counterparts. These figures are a rare public reference point for anyone assessing the feasibility of on-device AI on cheap hardware. In our view, the case's main transferable lesson is the correct decomposition of the hybrid: Yandex did not try to drag 'all of Alice' onto the speaker; it isolated the narrow domain where locality yields the most value (routine smart-home commands — high-frequency, short, with a limited vocabulary) and left everything else to the cloud. That framing — 'local for what is frequent and simple; cloud for what is rare and complex' — applies far beyond the smart home: from bank voice menus to automotive assistants. The second observation: on-device AI starts with hardware. The NPU in the SoC and the gigabyte of memory were designed into Midi upfront — on a weak chip this architecture would not have happened. Teams planning local models in devices should budget compute headroom a generation before the models are ready — otherwise, by the time the software matures, the hardware in devices already sold will not cope.

6×local processing faster than cloud (on average)

~10 млнparameters in the on-device ASR model

90 МБof RAM for the 'Begemotik' NLU (30+ GB in the cloud)

5,3 млнactive YaOS/YaOS X devices (2025)

🛡️

Fintech · Т-Банк

T-Bank: AI protection against phone scammers saved customers 1.2 billion rubles

The financial trajectory of 'Protect or Refund' has been published in stages: in its first six months it saved customers 170 million rubles; by May 2025 the cumulative amount reached 1.2 billion rubles, with 12.3 million rubles paid out in compensation. The ratio of those two figures — roughly one to a hundred — is effectively the system's public accuracy: for every ruble paid out on missed scammers there are about a hundred rubles of prevented damage. The bank estimates the expected ecosystem-wide effect at roughly 2.5 billion rubles per year. 'Fraud Roulette' produced its own numbers over a year of closed testing: 2,000+ participants, more than 3 million calls taken, 44,000 hours of scammers' time burned, and an estimated 490–500 million rubles of prevented damage (the bank's estimate, cited by Forbes and Vedomosti). Frame this correctly: all figures are the bank's own reporting, not an independent audit. 'Prevented damage' is a calculated value (what could have been stolen had the call reached a victim), and the bank does not disclose the methodology. The paid compensations (12.3 million rubles), however, are real money that went to customers — and they are what makes the rest of the reporting credible: a bank that pays for every miss has no incentive to overstate its system's quality. In our view, the case's main innovation is not the models but the economic construction. A financial guarantee turns AI quality from an internal metric into a P&L line: every false negative costs the bank real money, so the incentives of the team, the bank, and the customer are aligned automatically. That construction transfers to any industry where AI protects customers from losses — from insurance to cybersecurity — and it is a stronger trust signal than any accuracy certificate. The second observation: 'Fraud Roulette' is a rare example of an offensive antifraud strategy. Classic protection reduces victims' losses; the roulette attacks the economics of the criminal business, where an hour of call-center time has a concrete cost. The 44,000 hours scammers spent talking to trained volunteers are hours not spent on real victims. Whether the model scales and whether scammers adapt to the interception is an open question worth watching in 2026.

1,2 млрд ₽saved for customers cumulatively (by May 2025)

62 млнscam calls detected monthly by Neuroshield

44 000 чof scammers' time burned by Fraud Roulette in testing

12,3 млн ₽paid out in guarantee compensations

🩺

Healthcare · Botkin.AI (ООО «Интеллоджик»)

Botkin.AI: what the rise and shutdown of Russia's first registered medical AI teaches

The outcome cuts both ways. The product ceased to exist as an independent business: regulatory suspension in November 2023, sale to a competitor in December 2023, out of the public eye. Investors who had put in at least 271 million rubles across 2017–2020 exited against 15.5 million rubles of revenue and a 139 million ruble loss in the last full year. Yet AI radiology in Russia is alive and growing: Moscow's computer-vision experiment in radiology continues with dozens of AI services, and Intellogic's buyer keeps developing its own Celsus platform. The framing matters. 'A threat of harm to life and health' is the legal basis of order No. 7880, not a proven fact of harm done: there is no public evidence of injured patients. Nor was the recall permanent — in May 2024, after corrective measures and VNIIIMT expert review, the suspension was lifted. And the cause of the halt was not 'the AI made mistakes' as such, but the failure to submit post-registration reports and the divergence of real-world characteristics from those declared at registration. It is also worth remembering that the parties' public statements contradicted each other — the regulator spoke of an 'absence of clinical effect' while the investor called the threat 'completely excluded' — and the market has no independent arbiter between those positions beyond the order itself and the subsequent VNIIIMT review. In our view, the key lesson of this case is economic, not technological. Botkin.AI held a unique regulatory asset (the only class 2b certificate for an AI platform) and a government integration with the country's largest stream of studies — and still failed to build sustainable revenue: 15.5 million rubles a year did not cover even a fraction of the compliance costs a high-risk device class demands. Post-registration monitoring is a permanent expense line — clinicians, data, reporting; a company whose unit economics don't work cuts exactly that line, because the penalty is deferred. Here it arrived in its maximal form. A second observation — also editorial: the real price of the failure was measured not in investors' money but in the evidence bar set for the whole industry. Since November 2023, every medical AI developer in Russia knows that a registration certificate is the start of regulatory work, not its finish line — and that the regulator can verify claimed metrics on its own. For the market's maturity, such a precedent may be worth more than another success story.

2020Russia's first class 2b registration certificate for an AI platform

271+ млн ₽raised (11M in 2017 + 100M in 2019 + 160M in 2020)

150 000+studies in the Moscow pilot (March–December 2020)

−139 млн ₽loss in 2022 on 15.5M RUB revenue

🛢️

Oil & Gas · Газпром нефть

Gazprom Neft: the 'Digital Rig' cut non-productive drilling time by 15%

Published results of the field trials at Gazpromneft-Noyabrskneftegaz: well construction time fell by 6 days versus norms (17.5% for complex wells), the speed of footage operations rose 28.5% and non-footage operations 16.3%, pipe make-up time dropped 3.7% while make-up norms were exceeded by 40% — all without increasing well cost (Up-Pro, Integral Russia). The project's planned target was a 15% cut in complication-related NPT; in the words of trial participant Ramil Bariev, 'we planned for 15% growth when testing the project, and achieved much more.' Support-center effects are published separately: geosteering saves 3–5% of rig time (JPT/SPE), and the efficiency coefficient of placing horizontal wellbores within the pay zone rose from 60% to over 90% after GeoNavigator launched (Fontanka). Company-wide digitalization economics are also disclosed: the 'Asset of the Future' pilot delivered a 1.2 billion ruble effect, and the company expected 5–6 billion rubles per year from digital technologies in exploration starting in 2025 (ComNews). Important framing: all the figures above come from the company and its contractors, with no independent audit. The trial results were obtained on six wells of a single asset — pilot statistics, not fleet statistics; '−15% NPT' is the program's planned value, described by participants as 'exceeded' but without a published fleet-wide actual. Geosteering and 'Digital Rig' effects cannot be added together — they partly overlap within the same rig time. In our view, the most valuable thing about this case is its metric discipline: the company measures impact in days per well, NPT percentages, and meters of wellbore inside the pay zone — quantities that convert directly into cost per meter drilled. That distinguishes the project from the typical 'we deployed AI on a rig' story with no operational numbers. A second observation: eight years of evolution — from the 2011 center through the six technologies of 2017 to a robotic rig — show that the 'unmanned rig' is built in layers on top of a decade of data work, not bought as a finished product; replicating the top layer without the ones beneath it will not, in our view, work. An indirect confirmation of maturity is the interest of Middle Eastern national oil companies in this stack: technologies proven on the company's own fields became an argument in international negotiations before 'sovereign oilfield software' became a mainstream topic.

−15%complication-related non-productive time (program target; per participants — exceeded)

1,5+ чstick/slip warning before a forced stop

−6 сутокwell construction time in field trials (+28.5% footage-operation speed)

60% → 90%+horizontal-well placement efficiency in the pay zone (GeoNavigator center)

🔎

E-commerce · Ozon

Ozon: a transformer in search suggestions — honest fractions of a percent at trillion-ruble scale

Published gains by iteration: search-suggestion CTR +10%, then another +10% and +3%; the share of empty result pages fell 3%; the share of users finishing a session with an order grew 0.3%. Every figure comes from an A/B experiment on live traffic and was published by the team itself in its engineering blog — with the honest admission that clickability grows more easily than order conversion. Modest at first glance — but at trillions of rubles in GMV, fractions of a percent are a material business effect, and this is what real recommender-system numbers look like. A further, less visible result is infrastructural: dynamic compilation of ranking formulas cut the search service's CPU consumption roughly threefold (from 15%+ to 5–6%) and shaved 10 ms off query time — at tens of thousands of RPS, that is both hardware savings and a direct contribution to result speed. Credibility framing: all figures are the company's self-reported numbers with no independent audit. But this is self-reporting of a particular kind — published by engineers with the A/B methodology described and weak spots admitted (diminishing returns, the CTR-to-order gap), which sharply lowers the risk of marketing embellishment. Separately (a forecast, not a result): per an estimate cited by Forbes, a future AI search assistant could add 3–5% GMV for Ozon within one to two years. This estimate must not be conflated with the measured +0.3% — they are different genres of numbers. In our view, the case's main value is calibrational. It sets the market a benchmark of honest reporting: the +10% → +10% → +3% sequence shows not only the effect but its decay, and the '+0.3% users with an order' metric shows how expensive each fraction of a percent is on a mature product. When a vendor or integrator promises '+15% conversion from AI in search', this case is a ready-made ruler: one of the country's strongest ML teams, with its own GPU cluster and trillion-ruble GMV, documents an effect an order of magnitude more modest. The second observation is architectural: in our view, choosing a model of hundreds of millions of parameters over a fashionable 'multi-billion LLM' is exactly what engineering maturity looks like. Ozon picked the minimal architecture that solves the task within a hard latency budget, instead of the maximal one that would solve it in a slide deck.

+0,3%users finishing a session with an order

+10%suggestion CTR in iteration one (then +10% and +3%)

−3%share of empty search results

300 мсresponse budget at tens of thousands of RPS

⚛️

Nuclear power · Росатом / Росэнергоатом

Rosatom: repair optimization gave Russian NPPs 94 extra operating days in half a year

In the first half of 2026, repair campaigns across the Russian NPP fleet were shortened by a combined 94 days (maintenance ran on 17 power units), and output, per the June 16, 2026 forecast, exceeded plan by 7.44 billion kWh (Rossiyskaya Gazeta, EnergyLand). Station examples: Kursk NPP gained an extra 180 million kWh from shorter repairs, Balakovo about 1 billion kWh, and Rostov 40 million kWh thanks to unit 4's early grid reconnection. Digitalization's local contribution is quantified too: the electronic maintenance sheet at Rostov NPP alone saves 140 minutes a day — 851 hours a year. Important caveats. The 94 days are the combined fleet-wide effect of repair optimization for the half-year — on average days, not weeks, per unit. The 7.44 billion kWh of above-plan output results from a combination of factors: shorter repairs, operational efficiency, and the new Kursk NPP-2 unit coming online; isolating an 'AI share' or 'digital-twin share' within it is impossible from open data, and Rosatom itself publishes no such attribution. The '−30% diagnostics time' effect belongs to digital-twin pilots, not the whole fleet. In our view, this case is a rare example of the right sequence: first a reliability and lean culture built over decades (RPS), then a virtual environment for safe experiments (the Virtual Digital NPP since 2020), then targeted digital tools in the maintenance loop — and only atop this pyramid the aggregate result of 94 days. Tellingly, the loudest percentage in the case (the 2.4x speedup in data exchange) was achieved not by a neural network but by replacing paper and phone calls with a mobile app: in capital-intensive industries, plain process digitization still yields effects worth publishing next to the word 'AI'. A second editorial observation: the nuclear industry publishes its effects in physical quantities — days, kilowatt-hours, minutes — which makes them verifiable. We consider such reporting a model for industrial AI cases: ruble estimates depend on prices and methodologies, while 94 days and 7.44 billion kWh can be checked against dispatcher data.

−94 дняcombined fleet-wide repair reduction (H1 2026, 17 units)

7,44 млрд кВт·чextra output above plan (H1 2026)

−30%diagnostics time in digital-twin pilots

240 → 100 минmaintenance data-exchange cycle at Rostov NPP (851 hours saved a year)

📹

Retail · М.Видео-Эльдорадо

M.Video-Eldorado: in-store video analytics — an honest look at a pilot, not a 'chain-wide rollout'

This case's results exist in two genres, and it is important not to mix them. Genre one — the engineering blog (Habr, March 2021). The only impact metric published there concerns the alert itself: over 1.5 months of the pilot, 'lonely shopper' alerts fell from 25 to 5 per day; the system learned to filter false signals and bother staff only when it mattered. On business metrics the article is explicit: multi-camera tracking would only make them measurable — visit-to-purchase conversion, impact on average receipt. Genre two — the corporate press release (August 2021). The figures there are bolder: 'the customer attraction coefficient in pilot stores grew a third faster', 'the conversion coefficient grew 35% versus comparable stores', notifications about helping a customer or opening checkouts fell 75%, installations pay back within the first month, and the in-house build costs one-fifth of market analogues. The release calls the project 'Russia's first video analytics system with proven economic efficiency'. The methodology of these comparisons (how many stores, what period, how 'comparable' ones were selected) is not disclosed, and there is no independent verification. In our view, the scissors between the two genres are the most instructive part of the case. In March the engineers write 'we will only now learn to measure conversion'; in August the press office reports '+35% conversion'; strictly speaking, both texts can be true (pilot stores may indeed have outgrown a control group), but trust in a number depends directly on the genre it was published in. We would build a business case on the engineering version and treat the press-release one as an upper bound. The second observation: even by the boldest public data, the project reached 50 stores out of 1,200+ — roughly 4% of the chain — and no public confirmation of a full-network rollout has appeared since. The distance between 'proven economic efficiency' in a pilot and total deployment is years and billions (30,000 cameras versus 250 connected at press-release time). That is why we deliberately keep this case at its real scale: a solid, well-documented pilot with early scaling — and a useful ruler for everyone reading other companies' 'chain-wide rollouts'.

1 → 50stores: pilot (Oct 2020) → scaling plan (Aug 2021)

25 → 5'lonely shopper' alerts per day over 1.5 months

~$50edge device per camera (Raspberry; RTSP filter: 3 Mbit/s per 30 cameras)

30 000cameras — target scale (1,000+ stores)

📮

Postal & logistics · Почта России

Russian Post: OCR and robots sort around 8 million mail items a day

The automated pipeline sustains a national-scale flow. Confirmed operational figures: 40,000 letters per hour on the letter-sorting machine (11 per second), up to 8,000 parcels per hour on the parcel machine, 3 million items a day — the capacity of the Vnukovo center alone, 4 seconds per parcel for the robotic manipulator, 5-projection photography for OCR, and 92–95% speech recognition accuracy for the voice assistant. Since 2026 a routing loop has joined the sorting loop: the Teraplan platform plans transport across a network of 1,100 sorting nodes and 38,000 post offices. The routing loop's effects are so far described qualitatively, not quantitatively: reduced planning time, lower operating costs, and better forecast accuracy through clustering of logistics flows are claimed — no measured public figures for Teraplan exist yet, which honestly reflects the project's stage. What is absent from the public record entirely — and worth stating: the company has not disclosed the financial effect of sorting automation. The integral quality dynamics (on-time delivery: letters 54% → 85%, parcels 52% → 95% over 2013–2016) relate to the logistics modernization as a whole — new centers, transport, processes — not to OCR alone. The share of automatically recognized items and the cost of video coding are not published either. A separate frame is volume dynamics: the peak '8 million items a day' belongs to earlier overview materials, while 2024 publications already cite an annual letter flow of 1.3 billion. In our view, the engineering essence of this case is not OCR as such (postal services worldwide have recognized addresses for decades) but an honest failure architecture: five projections at capture, the machine decides within a second, everything unreadable goes to a human video coder, the hopeless goes to reject. It is a conveyor designed around the assumption that AI will make mistakes — and therefore it never stops. The reverse order — 'first believe in 100% accuracy, then act surprised' — would have cost the Post stopped conveyors. A second observation: the SMAB platform that tied Toshiba, Siemens, and Vanderlande machines together with a single protocol layer is, in our view, the case's most underrated part. Integration software never makes press releases, yet it is what turns a set of expensive imported machines into a manageable network — and it stays with the company when hardware suppliers change.

~8 млнitems per day (2.6B letters + 400M parcels a year)

11/секletters per second on the sorting machine (40,000/hour)

3 млн/деньVnukovo center capacity (64,000 m², 125,000 letters/hour)

4 секper parcel for the robotic manipulator

🧬

Pharma & Biotech · Moderna

Moderna: 750 internal GPTs in two months and 120 AI conversations per employee per week

Within two months of adopting ChatGPT Enterprise, the company had 750 GPTs, 40% of weekly active users had created their own GPTs, and each user averaged 120 ChatGPT Enterprise conversations per week — roughly twenty-five interactions per working day, meaning the tool is genuinely embedded in the workflow rather than opened 'for the record'. The legal team reported 100% adoption. The very fact that the case is measured in usage metrics rather than 'projected savings' sets it apart from most corporate announcements. Important framing. All figures are Moderna's self-reported numbers published in an OpenAI case study — material from interested parties (OpenAI sells ChatGPT Enterprise; Moderna demonstrates technology leadership); there is no independent audit. Dose ID is a pilot explicitly positioned as an assistant to human-made decisions, not an autonomous dose selector. No dollar-value financial impact has been disclosed, and the link '750 GPTs → 15 products in 5 years' remains declarative: the product plan will be tested by clinical trials and regulators, not by the number of chatbots. In our view, the case's main substance is not the numbers but the method: Moderna showed that 'AI adoption' is 20% platform choice (made, tellingly, via an in-house NPS experiment) and 80% a change program with contests, champions, office hours, and an engaged CEO. The '120 conversations per user per week' metric is the best publicly available indicator that the program worked: licenses can be issued by decree, habits cannot. A second observation: the pyramid '80% mChat adoption → three-platform NPS test → 750 GPTs' illustrates a sound investment sequence. A cheap API prototype built the skill base and user-behavior data before the enterprise purchase; the platform choice leaned on that base; the mass of GPTs grew on an already prepared culture. We would not expect the same numbers to reproduce at a company that started by simply buying licenses.

750GPTs in the first 2 months

120conversations per user per week

40%of weekly actives created GPTs

100%adoption in the legal team

🏦

Banking · BBVA

BBVA: 3 hours saved per employee per week, 20,000+ GPTs, and a ChatGPT rollout to 120,000 people

Per figures published by OpenAI in November 2025: about 3 hours saved per employee per week, 83% weekly active usage, efficiency improvements of up to 80%+ in tests of specific workflows, and 20,000+ custom GPTs created (around 4,000 in frequent use). December materials cite an even stronger engagement metric: 80% of users access the assistant daily. It was on the back of these results that the decision to roll out to all 120,000 employees was made — and Sam Altman publicly calls BBVA 'a strong example of how a large financial institution can adopt AI with real ambition and speed'. Frames to keep in mind. All metrics are BBVA's internal estimates published by OpenAI and the bank itself; there is no independent audit, and both parties benefit from a positive picture. 'Up to 80%+' refers to tests of specific workflows, not the whole operating model; '3 hours a week' is users' self-assessment on routine tasks, not a time study. The Peru example (7.5 min → 1 min) is one assistant in one country. Finally, '83% weekly active' and '80% daily' are metrics from different periods and methodologies and should not be glued together. In our view, this case has two genuinely rare traits. The first is the managerial honesty of the sequence: 3,300 → 11,000 → 120,000 with interim metrics published before each next step; that makes the tenfold expansion defensible to the board and the regulator, and the case reproducible as a method (unlike irreproducible 'success stories'). The second is the pyramid of 20,000 GPTs created versus 4,000 frequently used: the bank itself publishes a 1:5 tool survival ratio, and that is the case's most useful number for planning your own rollout — impact should be computed from the active core, not the gross count of what was created. A note of caution to close: 'The Eight' and the 'AI-native bank' are a declaration of direction, not a result; it will be testable through customer-experience and operating-cost metrics years from now, and we would not project today's productivity figures onto tomorrow's transformation promises.

~3 чsaved per employee per week

83%weekly active usage

20 000+custom GPTs (~4,000 in frequent use)

120 000employees in the target rollout across 25 countries

🛒

E-commerce & Fintech · Mercado Libre

Mercado Libre: the GPT-4o-powered Verdi platform — a 100x catalog expansion and nearly 99% fraud-flag accuracy

Per figures published by OpenAI: GPT-4 Vision enabled cataloging 100x more products within two years; fraud-detection accuracy reached nearly 99% for flagged items; within a few months Verdi took over 10% of customer-service mediation on one of the company's major sites; review summaries increase orders where available (with no effect size disclosed). The company estimates the potential at supporting the tasks of 9,000 operators and autonomously managing customer-service decisions worth $450 million annually. A separate, often-skipped result is the economics of development speed: the platform was designed 'with a focus on lowering cognitive load' so that any team can assemble, test, and deploy AI applications in a unified environment with built-in unit testing. The company does not quantify this effect, but it is precisely what explains building a platform instead of hiring more operators. Framing is mandatory. All figures are the company's self-report in a case study by OpenAI — the party selling the models used; there is no independent verification. The 100x is growth in cataloged products, not revenue or listing quality. 'Nearly 99%' refers to accuracy on flagged items — the metric says nothing about recall (how much fraud is never flagged at all). The 10% of mediations applies to one site, not the whole group; and '9,000 operators' and '$450M' are stated potential, not achieved results. Conflating these categories is the classic mistake in reading vendor case studies. In our view, the case's key meaning lies in the sequence of admitting AI to money. Mercado Libre first spent years running models on tasks with a cheap cost of error (catalog tags, translations, summaries), accumulating data, infrastructure, and organizational trust — and only then admitted AI to disputes with monetary consequences, and in fractions: 10% on one site, with escalation to humans and a rollback path. This is the opposite of the fashionable 'autonomous agent in production within a quarter' — and probably the only way to do such things in a company where AI decisions directly move other people's money. A second observation: 'developers never see the source code' is Verdi's most debatable and most interesting decision. We read it as a bet on radically lowering the entry barrier (any team assembles an AI app from nodes and skills) in exchange for total dependence on the quality of platform guardrails. With 17,000 developers the bet looks rational: centralized security scales, reviewing every home-grown bot does not. But we would not advise porting the pattern into an organization without a mature platform team.

100xmore products cataloged in 2 years

~99%fraud-detection accuracy (flagged items)

10%of mediations handled by AI on a major site

$450Mpotential in autonomous decisions annually

🏥

Health Insurance · Oscar Health

Oscar Health: documentation nearly 40% faster, claims escalations resolved 50% faster, and the first BAA with OpenAI

Per the OpenAI case study (April 2024): time spent documenting care conversations and reviewing lab results fell by nearly 40% (per the company blog — from 20+ minutes to under 12 per conversation); the claims assistant cut escalation resolution time by 50% with accuracy on par with or better than human agents; the expectation — automating investigation of at least 4,000 tickets per month (48,000 by year end). R&D experiments showed productivity gains of up to 90% in some cases. The engineering write-up adds granularity vendor cases usually lack: on simple scenarios the machine matched humans immediately, on complex ones it initially lagged, and after Skeleton Trace was introduced several question categories reached 100% accuracy, the hardest — 80%. Framing: all figures are Oscar's self-report in an OpenAI case study (an interested party); 'up to 90%' is an R&D result, not production; 4,000 tickets/month was an expectation at publication time, not a confirmed fact; payout decisions remain with humans — AI accelerates finding and preparing information. The ambition to 'bring down the cost of seeing physicians by a factor of 10 in three to five years' is Schlosser's declaration of direction, not a commitment with a metric. In our view, the case carries two systemic lessons beyond insurance. First: the sequence 'BAA first, models second' is the right order for any regulated industry; Oscar gained its head start not because it had access to different models, but because it was first to build the legal construct in which those models could lawfully touch real data. Second: the gap between 40% (production) and 90% (R&D) is the honest price of moving from demo to operations, and companies that publish both numbers with explicit labels deserve more trust than those reporting '90% everywhere'. A third, engineering observation: the claims assistant's history shows that the main work of putting LLMs into legacy systems is not prompting but data-representation design. GPT-4 failed until the traces were repackaged into a Skeleton Trace with iterative detail retrieval — in effect, the team invented for its task what would later become the common agentic-RAG pattern. We would read this case precisely as a textbook on 'preparing data for the model', not as an advertisement for a particular API.

~40%faster clinical documentation (20+ min → <12 min)

50%faster claims escalation resolution

100% / 80%assistant accuracy by question category (simple / hardest)

№1first insurer BAA with OpenAI

⚖️

LegalTech · Harvey

Harvey: a custom U.S. case law model that lawyers preferred over GPT-4 97% of the time

To test the model, Harvey brought in attorneys from 10 of the largest law firms. They were shown side-by-side outputs of the custom case law model and GPT-4 for the same question. "97% of the time, the lawyers preferred the output from the case law model," says Weinberg. "Usually, it was because it was a longer, more complete answer. It went into the nuance of what the question was asking and covered more relevant case law." The strength of the reaction, the team admits, surprised them. The second result concerns hallucinations — largely the reason the model was built: "Not only does the case law model not make up cases, but every sentence is actually supported with the case it's citing," Weinberg states. Framing that matters. The 97% is a preference metric, not a measure of formal legal accuracy: it shows which answer lawyers choose, not how many errors it contains. The testing was organized by Harvey itself; the question sample size was not disclosed; no independent replication has been published; and the case study was published by OpenAI — an interested party whose product underpins the solution. The no-hallucinations claim is likewise a founder's statement, not the result of a published independent audit. Editorial analysis. Indirect evidence that the custom-model bet paid off at least commercially is Harvey's trajectory after the case study: a $100M Series C at a $1.5B valuation (July 2024), $300M Series D at $3B (February 2025), $300M Series E at $5B (June 2025), $160M Series F at $8B (December 2025), and a $200M round at an $11B valuation in March 2026; 2025 revenue was $190M (Wikipedia, based on public disclosures). Customers include firms like Paul Weiss and A&O Shearman, where 3,500 lawyers ran about 40,000 queries during a trial period; Ashurst rolled Harvey out globally, Singapore's WongPartnership became the first Southeast Asian firm to test the product, and PwC adopted it for legal services in Singapore. In fairness: the company's growth does not isolate the custom model's contribution — Harvey grew across its whole product line, and an $11B valuation cannot be attributed to one technical decision. A second observation: the case captured the "maturity ladder" of working with LLMs — prompting → RAG → fine-tuning → custom training — which later became industry common sense. It also shows how fast baselines age: "preferred over GPT-4 97% of the time" sounded strong in April 2024, but a comparison against a long-surpassed model says nothing by itself about where Harvey stands against today's frontier models.

97%lawyer preference vs GPT-4

10 млрдtokens of data added to the model

10xrevenue growth in 2023

$715MSeries B valuation ($80M raised)

🛋️

Retail (E-commerce) · Wayfair

Wayfair: Gemini enriches a 30M+ product catalog 67% faster and lifts filter conversion by ~2%

The published numbers focus on the catalog front. Per the joint Wayfair–Google Cloud press release: the time needed to curate new and update existing product listings dropped by 67% across a catalog of more than 30 million products. Improving the accuracy of product attributes (color, subject) and the coverage of these tags in the catalog improved conversion rates when customers use filters by ~2%. The company estimates savings from dropping manual tagging at hundreds of thousands of dollars per year. Framing that matters. All figures are Wayfair's estimates published by Google Cloud — the vendor partner; there is no independent audit. The ~2% metric applies to conversion in filter-driven journeys, not sitewide revenue. The dollar savings are given as a "hundreds of thousands" range without an exact amount. The difference in wording between documents is also telling: the press release says curation time was reduced "by 67%," while the later case study version is more cautious — "up to 67% faster." Finally, there are no public metrics at all yet for Muse, the Discover tab, or UCP — those are pilots and announcements, not measured results. Editorial analysis. First: this case is a rare, clean example of a "data before storefront" strategy. Wayfair first fixed the foundation (catalog attributes) and only then started building customer products on top — Muse, visual search, agentic commerce. The reverse order — a flashy AI facade first, data later — is what usually produces disappointing pilots. Second: a ~2% conversion lift on filters across a 30M-product catalog is the classic economics of large funnels, where small percentages mean more than loud demos; that said, the absolute dollar impact cannot be assessed because Wayfair does not disclose the share of filter-driven sessions in sales. The third observation concerns UCP: Wayfair's bet on a protocol where the merchant of record stays with the retailer is an attempt to enter the agentic shopping era without ceding margin and customer relationships to the platforms. If agentic commerce becomes a meaningful channel, catalog attribute quality — where this case began — will be the entry ticket: an AI agent needs machine-readable, accurate product data even more than a human does. In that sense, the "boring" catalog tagging looks like the most far-sighted part of the whole program. Industry background confirms the direction: per PYMNTS Intelligence data cited in coverage of the case, 92% of companies use AI-driven personalization for growth and 77% of business leaders rank generative AI as the most impactful emerging technology — the question has long been not 'whether' but 'in what order.'

67%faster listing curation

30M+products in the catalog

~2%conversion lift with filters

$100K+annual savings (hundreds of thousands)

👻

Social Media · Snap Inc.

Snapchat: multimodal Gemini in the My AI chatbot — over 2.5x engagement growth in the U.S.

Per Snap and Google Cloud, after deploying Gemini in My AI, engagement within Snapping to My AI grew by over 2.5x in the United States. Kurian's wording in the Google blog: "Snap deployed the multimodal capability of Gemini within their 'My AI' chatbot and has since seen over 2.5x as much engagement within Snapping to My AI in the United States." Framing that matters. This is the case's only published metric. It was measured in a narrow window of August 27 – September 2, 2024 — stated directly in the footnote of Snap's announcement, marked "Snap Inc. internal data." "Snapping to My AI" means sending snaps to the bot, so the metric tracks growth of multimodal usage specifically, not all My AI activity. Absolute audience figures for the bot were not disclosed: "2.5x" of an unknown base could mean explosive growth or modest absolute numbers. No impact on revenue, retention, or time-in-app has been published. Editorial analysis. First: the size of the effect here is less interesting than its mechanics. The 2.5x+ growth happened not because the model got "smarter at answering in text," but because the bot learned to accept input in the format Snapchat's audience already communicates in — snaps. This is an unusually clean confirmation of the principle that modality must match the product's interface: the same LLM functionality packaged into a text box would almost certainly not have produced such a jump. Second: a one-week measurement window right after launch is the classic novelty spike; the steady-state engagement level could have turned out higher or notably lower, but the companies never published it — in late 2024 or since. Silence after a loud announcement is itself information for anyone evaluating the case. Third: strategically, the case illustrates where platforms of Snap's scale stand in the model market — they migrate between vendors (ChatGPT in 2023, Gemini in 2024) faster than the enterprise segment, because their integrations are thinner and their motivation is a specific capability rather than ecosystem loyalty. For model providers this means consumer flagships are a winnable — and losable — segment; for product teams, that architecting for model replaceability pays off at every such transition. Finally, note the disclosure asymmetry: the announcement includes an engagement figure but says nothing about inference cost, latency, or the impact of multimodal queries (noticeably more expensive than text) on the feature's economics — even though at 850M+ monthly users those are the parameters that decide whether the feature survives. The public part of the case answers "did it take off" but not "does it add up." For a reader mapping this case onto their own product, that means the scenarios and mechanics are transferable, but the financial model will have to be built from scratch on your own data.

2.5x+My AI engagement growth (U.S.)

850M+Snapchat monthly active users

5modalities: text, audio, image, video, code

2024Gemini launch in My AI (September)

📡

Telecom · Vodafone

Vodafone: Copilot for 68,000 employees — ~3 hours saved weekly, while TOBi resolves 70% of ~45M monthly inquiries

The Copilot pilot, assessed with KPMG, produced the case's headline number: savings of around three hours per week per person on average — on emails, minutes, and information search. 90% of participants said they benefited and wanted to keep using the tool; 60% said it improved the quality of their work. The legal department later measured an average of 4 hours saved per week per person, with contract drafting time down by about an hour per document. On this data Vodafone decided to roll Copilot out to 68,000 employees — while Petty stresses: "It's not about doing more work, it's about doing better quality work and being more customer-focused." Microsoft's story also records the expectation that savings will grow as users get comfortable with the tool: the pilot's three hours reflect users who had not yet reached "cruising speed" with Copilot. On the customer front: TOBi handles nearly 45 million inquiries monthly and fully resolves 70% of them through digital channels; the remaining ~30% go to live agents supported by SuperAgent. With SuperAgent, average call time dropped by at least one minute; Microsoft also reports improved customer satisfaction post-implementation — without disclosing specific NPS/CSAT values. Framing. The three hours is self-assessed by pilot participants in a 300-user trial, albeit collected with KPMG's involvement: it is a survey, not a time study. TOBi's 70% is the metric of a mature system that generative AI enhanced rather than created from scratch: it cannot be attributed wholly to Azure OpenAI. No currency-denominated financial impact has been disclosed on either front; all of the case's sources are Microsoft materials — an interested vendor. Editorial analysis. The most valuable part of the case is not the numbers but the proof construction: limited pilot → external assessor (KPMG) → scaling decision made on data. That moves the AI-productivity conversation from the genre of "feelings" to auditable metrics — the layer most enterprise Copilot deployments lack. The second value is the SuperAgent pattern: AI for the agent, not instead of the agent. A minute off each call across ~45M monthly inquiries is an enormous operational lever without the service-quality risks of full automation; notably, the sources report no staff reductions tied to these tools. Third: the variance of impact across functions (3 hours on average vs 4 for legal) is a practical argument to prioritize rollout by document-heavy departments rather than spreading licenses evenly. And fourth, a detail easy to miss: the feedback from neurodiverse employees about reduced writing stress shows enterprise AI has an inclusion dimension that never shows up in standard productivity metrics yet directly affects retention and engagement — not for nothing did Microsoft put "employee inclusion" in the title of its Azure AI story.

68 000employees getting Copilot

~3 чsaved per person weekly (pilot)

70%of inquiries TOBi resolves digitally

−1 минoff average call time (SuperAgent)

📝

Productivity software · Notion

Notion: Claude-powered agents inside the workspace cut customers' information search time by 35%

The impact is measured primarily on the side of Notion's customers. Osaka Gas cut information search time by 35%. Remote saves 10 minutes per search across 300 daily queries. dbt Labs saved over $35k a year by dropping separate AI tools — Notion's agents covered those scenarios. During onboarding, new employees query the AI assistant 10–20 times a day in their first weeks. Inside Notion itself, one recorded example saw 12 hours of prototyping work collapse into about 20 minutes: "Then your whole team can jump in and refine it together," says Liu. For Notion itself, the key result is economics: prompt caching cut the cost of running agents by 90% and latency by up to 85%. That is what makes agents sellable as a mass feature rather than a premium option for the few. Framing. The case is published by Anthropic — the vendor selling both the model and the Managed Agents infrastructure; customer metrics (35% at Osaka Gas, $35k at dbt Labs) come without measurement methodology; 12 hours → 20 minutes is a single example, not an average; and the before/after comparison does not isolate Opus 4.6's contribution from the general evolution of Notion AI. Editorial analysis. First: this is one of the few cases where a vendor story contains third-party metrics — the customer's customers (Osaka Gas, Remote, dbt Labs), and such numbers are more convincing than internal benchmarks, though they too passed through Anthropic's marketing filter. Second: the pairing of "90% savings from caching" with "30+ concurrent tasks" reveals the real cost structure of agent products — the winner is not whoever has the smarter model but whoever learned to reuse context; for anyone building agents into their SaaS, that is the case's most transferable takeaway. Third: the dbt Labs story ($35k saved by dropping third-party AI tools) sketches the market's consolidation dynamic — platforms' built-in agents are eating the budgets of specialized AI add-ons, worth remembering when deciding what to build yourself versus what comes bundled with the platform. A fourth observation concerns the pace of autonomy growth: the Notion 3.0 release in September 2025 claimed "over 20 minutes of multi-step actions," while Anthropic's case study already describes tasks running from 20 minutes to hours. In a matter of months, the agent's autonomous horizon grew severalfold — and it is this curve, not any single metric, that determines which categories of busywork agents will take over next.

35%less search time (Osaka Gas)

10 минsaved per search × 300 daily queries (Remote)

90%cost reduction (prompt caching)

30+concurrent agent tasks

🤖

Software development · Replit

Replit: a Claude-powered agent turned the platform into a growth machine — ARR from $1M to $240M

The headline number of Anthropic's case study: Replit's annual recurring revenue grew from $1M to $240M on the back of Claude-powered agent launches. The chronology from independent sources matches in order of magnitude: TechCrunch records $2.8M ARR in 2024 and $150M+ annualized less than a year after the agent launched, while funding-round coverage cites $240M of 2025 revenue and a $1B ARR target. The company's valuation tripled in half a year, from $3B to $9B. Product results: in a single session, Agent 4 produced 36,000+ lines of production-ready code; in another example, ~400 minutes of autonomous runtime built a "business operating system" of 16 systems — 8 admin and 8 client portal modules. Masad sums it up: "What previously required weeks of developer time now happens in a single afternoon." Framing. The $1M → $240M figures come from Anthropic, an interested vendor; independent sources give slightly different starting points ($2.8M in 2024), which doesn't change the order of the effect but is a reminder that the baseline was rounded favorably. The revenue growth cannot be attributed to Claude models alone: the pivot to a non-technical audience, a new pricing model, and the general "vibe coding" boom worked simultaneously. The "36,000 lines" and "400 minutes" examples are curated demonstrations, not averages. Editorial analysis. This case is the market's best illustration of the thesis that an agent expands the market rather than automating it: for ten years Replit sold an IDE to programmers and stagnated at $3M ARR, then took off when it sold outcomes to people who don't know how to code. Same product — different buyer. The second layer is the price of autonomy: the July incident with the deleted production database shows that 6+ hours of autonomous work without guardrails is 6+ hours of autonomous mistakes; the maturity of agent products is measured not only by session length but by the quality of sandbox isolation, backups, and the agent's rights to irreversible actions. And third: the #3 position in a16z's report on startups' real AI spending is rare external confirmation that Replit's growth is paid for by customer budgets, not just venture enthusiasm. There is a flip side to the same dependency, which TechCrunch names directly: Anthropic and OpenAI have launched competing coding tools of their own, so the suppliers of Replit's "engine" are simultaneously its potentially biggest competitors, with model-optimization advantages and the ability to subsidize prices. Betting on someone else's models gave Replit speed — and left a strategic risk no funding round can close.

$1M→$240MARR growth

50M+platform users

6+ чautonomous agent runtime

36 000+lines of code in one session

🦊

DevOps · GitLab

GitLab: Claude in GitLab Duo and internal workflows — 25–50% productivity gains

Per Anthropic's case study, using Claude in internal workflows delivered 25–50% productivity gains at GitLab. AI feature development accelerated to "weeks, not years" — versus the scenario of building an in-house ML stack. Jessie Young sums it up: the partnership let the team "weave AI into various features without reinventing the wheel" — powerful models integrated with the platform without a dedicated ML team. The customer-side effect is illustrated by a testimonial in the 2026 press release — Mans Booijink, Operations Manager at Cube: "GitLab Duo has accelerated how our teams plan, build, and ship software. The combination of Claude models and GitLab's platform means we're getting more capable AI without changing how we work or how it is governed." Framing. 25–50% is a wide range with no disclosed methodology: it is unknown which processes were measured, how, on what sample, and against what baseline; it is GitLab's self-report published by Anthropic — an interested vendor. "Weeks, not years" is a qualitative assessment, not a specific project timeline. Productivity metrics for external teams using GitLab Duo itself are not provided in these sources. Editorial analysis. First: the most durable thing in this case is not the numbers but the institutions. The model evaluation team and the "right model for the right use case" principle survived several Claude generations — from the 3 family in the original case study to Opus 4.7 in the 2026 agent platform. Companies where model selection is a process, not an event, migrate to new generations painlessly; companies where it is an event live through every release as a crisis. Second: the 2026 announcement shows where competition in DevOps AI has moved — not "whose model is smarter" but "whose agents fit into compliance": the "no separate governance layer" formula is selling governance as a product, and judging by GitLab's positioning, for enterprise it works better than benchmarks. Third: routing access through Google Cloud, Bedrock, and the Claude Marketplace captures the new reality of enterprise AI procurement — models are bought like electricity, through existing contracts and commitments, and a product that can "flow into" those contracts removes the main purchasing barrier. Fourth — a methodological contrast worth keeping in mind when reading such stories: where Vodafone backed its Copilot pilot with an independent KPMG assessment, GitLab publishes a 25–50% range with no external auditor. That doesn't make the number false, but it places it on a different evidence tier — and reminds us that "productivity gains" without a documented methodology are only loosely comparable across cases.

25–50%internal workflow productivity gains

Неделиnot years to ship AI features

50M+registered GitLab users

50%of the Fortune 100 are customers

🛡️

Cybersecurity · Palo Alto Networks

Palo Alto Networks: Claude for thousands of developers — feature development velocity up 20–30%

The headline numbers agree across both case studies: feature development and code implementation velocity rose 20–30%. Unit test writing speed increased 10–30% — and that is not just productivity: more tests mean fewer bugs and a higher-quality codebase. Junior developers complete tasks 70% faster, and their onboarding dropped from months (up to six) to weeks. Scale: a 150-developer pilot → 3,000 in rollout per Google Cloud's version; 2,500 onboarded and 3,500 ramping per Anthropic's. Patel sums up why this matters to cybersecurity: "Running Claude on Google Cloud's Vertex AI not only accelerates development projects, it enables us to hardwire security into code before it ships." A separate result is the vendor selection criterion Patel states outright: "Anthropic prioritized safety and security a lot more than other LLMs. They discuss security and safety implications in every meeting. As the largest cybersecurity company, that's a big deal for us." Framing. All figures are the company's self-report published by two interested vendors (Anthropic and Google Cloud); the methodology behind "feature development velocity" is not disclosed; the juniors' 70% is an estimate for a specific type of integration task; the 2,500/3,000 developer discrepancy between the stories reflects different snapshot moments — a reminder that numbers in vendor stories are point-in-time, not audited. Editorial analysis. First: this is a rare case where an AI assistant's ROI is computed not "on average" but by segment — and the biggest win sits with juniors (70% vs 20–30% average). The practical takeaway inverts the "AI amplifies the strong" intuition: in large engineering organizations AI primarily levels the team and converts six-month onboarding into weeks, which is direct money in hiring and scaling. Second: the sequence "map the process → measure a pilot → scale" matters more than any single number here; the leverage point (the initial development phase, 30–35% of time) was chosen from data, not fashion. Third: the post-processing pipeline — generating and running tests, finding vulnerabilities, and auto-patching after every PR — sketches a deeper restructuring than an IDE assistant: AI becomes a pipeline stage with its own area of responsibility, and Patel's remark that "we probably won't have all these stages" reads like the announcement of the next case study. Finally, note what gets measured here: PANW is one of the few to publish not just "average productivity" but segmentation by seniority (juniors/seniors) and a separate testing metric. The more granularly a company slices its own numbers, the more reason there is to trust them — by that criterion, the PANW case is markedly more evidential than the average vendor report.

20–30%feature development velocity

10–30%faster unit test generation

70%faster task completion for juniors

Неделиonboarding, down from months (up to 6)

🏛️

Government · Европейский парламент

European Parliament: Archibot, a Claude-powered AI archivist — 80% less time searching 2.1M documents

The case's key metric: document search and analysis time dropped by 80% while maintaining high accuracy and security standards. The system processes and provides access to more than 2.1 million documents in multiple languages. The user base is diverse: researchers and policymakers get fast access to historical context and precedents, and the case records the educational effect separately — hundreds of educators and students use Archibot as a window into the history of European parliamentarism. The history of European democracy since 1952 became available to any citizen, conversationally, in their own language. Framing. The 80% comes from Anthropic's official case study (an interested vendor) without disclosed methodology: it is unknown which scenarios were measured and against what baseline. Absolute usage figures (queries, unique users) were not published; "hundreds of educators and students" is the only audience-scale reference. No independent audit of Archibot's answer accuracy is publicly available. Editorial analysis. First: this case is a ready-made template for the public sector, and its strength lies not in the technology but in the choice of proving ground. A historical archive is the ideal starting point for generative AI in government: the data is already public (minimal privacy risk), an error doesn't block a public service, and the impact is public and legible to auditors and citizens alike. The contrast with typical government deployments — which start with citizens' sensitive data and end in scandal — is instructive. Second: Delepine's requirement of permanent control over the solution and its data is a formula that captures the broader European approach to AI procurement: not "buy a smart bot" but "own a governed instrument." For vendors this means that in the EU you sell not a model but the control perimeter around it. Third: multilinguality here is not a feature but the very substance of the project: the technical search barrier was removed equally for everyone, but it was the language barrier that separated a "formally open" archive from a genuinely accessible one. That is a useful lens for any "open data" project: openness is measured not by the fact of publication but by the cost of access for a specific person. Finally, the template is plainly replicable: national archives, parliamentary libraries, and municipal document collections worldwide sit on the same two problems — searchability and accessibility — and Archibot hands them a ready reference architecture: public data + a governed cloud perimeter + multilingual conversational access. The only question is who replicates it next, and whether they publish their own 80%.

80%less search and analysis time

2,1M+documents made accessible

1952archive dating back to

Все ЕСEU member-state languages

📡

Telecom · SK Telecom

SK Telecom: a telco LLM on fine-tuned Claude — 68% fewer low-quality support responses

LLM response quality ratings from live agents rose 34% (in-call assistance). The share of low-quality responses fell 68% — the fine-tuned telco model versus the base model. In post-call processing, quality reached ~89% of human-agent level, and the telco LLM's overall score improved from 3.3 to over 4.3. Per the AWS technical blog, fine-tuning Claude 3 Sonnet combined with prompt optimization improved ROUGE-3 by 58.1%, ROUGE-L by 26.8%, and citation accuracy by 70.59% over the baseline. Framing. All percentages are relative improvements on SKT's internal benchmarks; absolute quality values are not disclosed. ROUGE measures textual overlap, not semantic correctness, and guarantees nothing by itself (which the blind human evaluations compensate for). Both primary sources are parties to the deal: Anthropic (in which SKT invested $100M) and AWS (whose platform hosts it all). No data on contact center business metrics — handling time, satisfaction, savings — has been published. Editorial analysis. First: the case effectively publishes an industry LLM "maturity ladder" with the price of each rung — prompts deliver the first 35–40% almost for free, fine-tuning adds the next jump, synthetic data removes the data shortage. The practical takeaway for any team: don't start with the expensive stage — most of the effect goes to disciplined prompt work, and fine-tuning makes sense once the prompt ceiling is genuinely reached and measured. Second: the "89% of human level" figure in post-call processing is a model of honest metrics: it shows both readiness (routine write-ups can be handed to the model under supervision) and the boundary (the 10+ percent shortfall is why the human stays in the loop). Third: the SKT–Anthropic investor link makes the case simultaneously stronger and weaker: stronger because SKT risks its own money and builds an operator alliance (GTAA) on this bet; weaker because every published metric passes through two financially interested parties. The result has no external replication yet — though the very design of the Global Telco AI Alliance implies the approach will be replicated across the alliance's other operators, and that will be the real test of whether a telco LLM transfers beyond the Korean market.

−68%low-quality model responses

+34%response quality ratings

89%of human-agent quality (post-call)

3,3→4,3+telco LLM overall score

🧭

Travel & media · Lonely Planet

Lonely Planet: travel itinerary generation with Claude on Amazon Bedrock — 80% cheaper than manual curation

The headline public figure: itinerary generation costs dropped by roughly 80% versus manual curation. The platform creates thousands of unique travel itineraries, each of which previously took the team days of manual work and is now assembled in minutes; the planner's beta was handling around 1,000 trips per day. Fifty years of publishing content — 150 million guidebooks, 270,000 destinations, the knowledge of 750+ local experts — became a working digital platform while preserving the brand's key differentiator: the expertise behind the recommendations. An important caveat on the figure's bounds. Three sources phrase it differently: the AWS customer page speaks of an '80% cost reduction', Whyde himself is quoted on the AWS blog saying 'we reduced itinerary generation costs by nearly 80%', while in the IT Pro interview the framing is inverted — manual curation 'would cost about 80% more', which mathematically means only a ~44% reduction rather than a fivefold one. We show this discrepancy deliberately: the real economics likely sits between these interpretations, and the company has not published its calculation methodology. A separate figure — choosing Claude as a model roughly 78% cheaper than the alternatives considered — refers to inference cost, not to the comparison with manual labor. In our view, the value of this case lies less in the specific percentage — numbers from AWS vendor materials should be read as marketing, without independent audit — than in the purity of the pattern itself: this is one of the first public examples of monetizing a publishing archive through RAG. Lonely Planet did not generate travel content 'from scratch' with a public model — it built a product that cannot be replicated with a ChatGPT prompt, because the raw material (vetted expert content) exists only in-house. The competitive moat here is the data, not the model: the LLM itself was chosen on price and is replaceable if needed. Our second editorial takeaway: speed of entry. The path from Bedrock's opening (April 2023) to a public quote in the Claude 2 announcement (August 2023) took this sizable traditional publisher mere months — which, in our view, was possible thanks to a cloud transformation completed in advance. Companies whose infrastructure and data are already in order ride the generative wave faster than those that start an AI project with a migration.

80%cheaper than manual curation

Минутыper itinerary, down from days

270K+mappable destinations

750+local experts in the knowledge base

💊

Pharma · Pfizer

Pfizer: generative AI on AWS — 16,000 search hours saved yearly and impact estimated at up to $1B annually

The program's measured results: up to 16,000 search hours saved annually for 1,500 PSSM scientists and a 55% reduction in infrastructure costs (per the AWS case study). Prototypes ship in 6 weeks instead of 3+ months, five of PACT's 14 projects run in production, and the small-molecule division's experience has been transferred to large molecules. Vox made the corporate document corpus — some 20,000 documents per drug in development — accessible through a natural-language question. The company stated the scale of its ambition publicly at AWS re:Invent 2023: Pfizer estimates its priority AI use cases will deliver savings of $750 million to $1 billion annually. It is important to distinguish the genres of these numbers: 16,000 hours and 55% are retrospective measurements of a specific program, while $750M–$1B is the company's own forward-looking estimate across a portfolio of 17 use cases, with no published methodology. We present the two categories separately and suggest reading them differently: the first is fact, the second a stated goal. In our view, this case is interesting above all as the anti-example of the 'big bet': instead of one megaproject — a portfolio of 14 rapidly tested prototypes, of which slightly more than a third reached production. That funnel (14 → 5) is not low efficiency but the normal economics of innovation: testing a hypothesis in 6 weeks costs incomparably less than a year-long project that 'can't be canceled because too much has been invested'. Note also the sequence of layers: the Scientific Data Cloud (2019) and the cloud migration preceded the generative wave — Vox was built on a ready data foundation, which likely explains the speed. Our second editorial observation: a public impact estimate at CDTO level is a management instrument in itself. By naming the $750M–$1B range in a keynote, Fonseca moved generative AI from the category of IT experiments to that of corporate commitments to the market — with the corresponding resource priority. For companies stuck in pilots, that may be the most reproducible element of the case.

16 000 чsearch hours saved yearly

$750M–1Bestimated annual savings (priority use cases)

55%infrastructure cost reduction

6 недельprototype, down from 3+ months

🎬

Media & Streaming · Netflix

Netflix: the recommender system drives 80% of viewing and saves more than $1B a year

The recommender system is used on most screens of the product and in total influences choice for about 80% of hours streamed on Netflix; the remaining 20% comes from search. The figure's evolution is visible: in 2012 the company publicly cited 75% — by 2015 the share had grown to 80%. To measure the impact on the catalog, the authors introduced the effective catalog size (ECS) metric: it shows how many videos actually make up a typical hour of viewing. With personalized PVR ranking, the effective catalog is roughly 4 times larger than with unpersonalized popularity ranking: viewing spreads far beyond the hits into the broad library, including niche titles. The business bottom line is stated in the paper verbatim: 'We think the combined effect of personalization and recommendations save us more than $1B per year.' The mechanics run through churn: over years of developing personalization, monthly churn was reduced by several percentage points, which simultaneously raises member LTV and cuts the need for expensive acquisition to replace those who leave. Important bounds: this is the company's own estimate ('we think'), made at 2015's 65M+ member scale, and the authors do not disclose its component math — the paper gives neither the exact churn-reduction percentages nor acquisition costs. In our view, this work remains the gold standard for talking about an ML system's value. First, the chain 'system metric → product metric → money' is transparent: ECS and take-rate link to engagement, engagement to retention, retention to subscription revenue. Second, the '>$1B a year' estimate was published by senior executives under their own names in a peer-reviewed journal — an incomparably stronger commitment than an anonymous marketing press release, though the figure still cannot be independently verified. Our second observation: Netflix's main asset in this case is not the specific algorithms (the Netflix Prize-era SVD and RBM long ago became just parts of the ensemble) but the experimentation infrastructure and metric discipline. A system that optimizes months-horizon retention rather than session-horizon clicks is a management decision, not a technical one — and judging by the paper, that is what converted recommendations into a billion dollars.

>$1Bsaved per year

80%of streamed hours via recommendations

4xeffective catalog size

65M+members at publication

🛡️

Fintech · Stripe

Stripe Radar: the anti-fraud neural network decides in under 100 ms and wrongly blocks just 0.1% of legitimate payments

Radar makes a decision on every payment in less than 100 milliseconds — inside the payment flow, before the transaction is confirmed. Out of billions of legitimate payments on Stripe, the system incorrectly blocks just 0.1% — the key product guarantee: anti-fraud that does not strangle honest customers' revenue. Each architecture jump (regression → trees → the Wide & Deep ensemble → a pure DNN) brought a significant lift in detection quality, and the DNN migration cut training time by over 85%, to under two hours, turning retraining from an overnight job into a several-times-a-day operation. The effect keeps compounding: per Stripe's guide, new models improve Radar's ML performance by more than 20% year over year, and the current product page claims an average 32% fraud reduction for customers and training on over a trillion dollars of annual payment volume. Note the bounds: 0.1% false blocks and <100 ms are figures from the March 2023 engineering post; 92% 'familiar' cards and −32% fraud are marketing data from the 2026 product page; none of these values are independently audited, and the methodology behind the 'average fraud reduction' is not disclosed. In our view, the case's main value is its honestly displayed engineering economics of trade-offs. Deciding to drop the ensemble for iteration speed, knowing it costs 1.5% recall, and offsetting the loss with data scale is mature ML engineering: the team evidently judged that the ability to answer attackers the same day is worth more over time than a fixed percentage of recall. In anti-fraud, where the adversary adapts, the system's learning speed is not an operational metric but a combat characteristic. Our second observation: the case demonstrates the power of an infrastructure position. The data network effect (92% of cards already known to the network) is an advantage that an individual merchant or a niche anti-fraud vendor cannot replicate in principle. The same fact argues for caution when reading the numbers: a payments aggregator has both the motivation and the means to present statistics in the most favorable light, so product percentages should be read as orders of magnitude, not audited reporting.

<100 мсdecision per transaction

0.1%of legit payments wrongly blocked

1000+signals per transaction

-85%training time (to <2 hours)

🏠

Travel & Marketplaces · Airbnb

Airbnb: deep learning in search — +0.6% bookings from a new architecture and +14% bookings for new listings

The second wave's outcomes are fixed in online A/B tests. The two-tower architecture: +0.6% bookings and +0.75% revenue with −33% p99 scoring latency; a curious side effect — the average price of booked homes fell 2.3%, meaning the model got better at matching guests' price preferences. The cold-start mechanism lifted new-listing bookings by 14% (and their first-page impression share by 14%), adding +0.38% to overall bookings and making the supply side healthier. Position dropout brought another +0.7% bookings and an unexpected +1.8% revenue; bookings of boutique hotels — a segment hurt by the bias — rose 1.1%. The scale must be read correctly: fractions of a percent here are not 'small results'. At Airbnb volumes, +0.6% bookings is a huge absolute number, and the sum of sequential gains (+0.6%, +0.38%, +0.7% from the second paper alone) compounds into a double-digit cumulative effect over the years. The team itself calls applying neural networks to search one of the company's biggest ML success stories. The bounds are also honest: all figures are the company's self-reported internal A/B tests, but the methodological detail and the published failures give these numbers more weight than a typical press release. In our view, the pair of papers' main value is not the specific architectures (two-tower networks and position-bias mitigation long ago became industry standards) but the documented culture: the sole arbiter of every change was an online test on money (bookings), with offline metrics serving only as a hypothesis filter. The price 'soft monotonicity' case is telling: an intuitively right, offline-beautiful idea lost 0.67% of bookings in production — without online-test discipline it would have stayed in the system. Our second observation: Airbnb effectively published a 'map of rakes' for everyone taking neural networks into search ranking — from ID overfitting to saturation without normalization. Companies retracing this path save months not on others' successes but on others' dead ends; in that sense, honestly publishing failures is a rare case of engineering altruism that doubles as employer branding.

+0.6%bookings (two-tower NN)

+14%new-listing bookings

+1.8%revenue (position dropout)

-33%p99 scoring latency

📌

Social & Visual Search · Pinterest

Pinterest: one embedding instead of three — +46.7% repins in Lens and a third of the ML systems to maintain

In offline human-judgment evaluations the unified embedding lifted precision@5 by +22.2% for Flashlight, +110.1% for Lens, and +72.1% for Shop-the-Look versus the specialized models. The hardest domain gained the most — Lens: multi-task training on mixed data provided exactly the generalization camera photos lacked. The paper also honestly shows the variance: within Shop the Look, per-category results range from −33.3% to +249.7% — an average win does not mean winning everywhere. The online A/B test in Lens confirmed the offline picture with lifts in every product metric: repins +46.7%, clickthroughs +35.0%, closeups +32.7%, engager propensity up +16.3% to +26.7%. The operational bottom line: instead of three models, three training pipelines, and three retrieval infrastructures, the team maintains one; the paper states directly that deploying the unified embedding 'drastically reduced the operation and engineering cost' — while improving quality. In our view, the case's key lesson is counterintuitive: model consolidation is usually seen as a 'simpler but worse' trade-off, yet at Pinterest the universal model beat the specialists at their own tasks. The mechanics follow ML theory — multi-tasking acts as regularization, and one product's data becomes augmentation for another — but the effect's scale (+110% in Lens) was possible because the tasks proved sufficiently related and the weakest domain received the most 'foreign' data. We would not generalize 'one embedding is always better' to any task set: the paper's own Shop the Look category spread demonstrates why. Our second observation: this case is about the economics of ML platforms, not just quality. At 600+ million visual searches a month, every extra model means re-indexing billions of images and duplicate serving; a fleet one-third the size speeds up every subsequent iteration. Infrastructure consolidation is one of those rare investments that cut costs and raise team velocity at the same time; we note only that online numbers are published for Lens alone, and all results are company self-reporting.

+46.7%repins in Lens (A/B)

+110%precision@5 in Lens

600M+visual searches per month

3→1embeddings to maintain

🎧

Media & Music · Spotify

Spotify: an LLM-powered annotation platform — a 10x larger labeling corpus with 3x annotator productivity

The annotation corpus grew 10x and annotator productivity 3x: automating sampling, review serving, and feeding results back into training pipelines removed the manual steps, while LLM labeling took the bulk routine off people. The platform runs dozens of labeling projects in parallel, serving ML and GenAI use cases across a catalog of hundreds of millions of tracks and episodes — from release-relation detection and automatic content placement to analyzing podcasts for policy violations. Important bounds on these numbers. The '10x' and '3x' are Spotify's internal metrics with no external audit, and the post does not disclose the comparison base: it is unclear from what starting corpus and over what period the growth was measured. A third-party review in ZenML's LLMOps case database notes further gaps: the post gives no technical detail on prompts, few-shot approaches, or LLM fine-tuning, and the 'low cost' claim is not itemized — whether infrastructure, prompt-engineering time, and ongoing quality monitoring are included is unknown. We present the case with these caveats: the direction of the result is not in doubt; take the exact multipliers as self-reporting. In our view, the case's main value is an architectural pattern that transfers to virtually any industry: LLM plus humans, linked by an agreement metric with automatic escalation. This is fundamentally more reliable than 'LLM instead of humans': the model provides scale, people provide calibration and adjudication of edge cases, and the agreement metric automatically decides who should handle each specific item. The feedback loop is double: contested cases not only raise label quality but also reveal where labeling guidelines are ambiguous. Our second observation: Spotify treats labeling as a product with its own team, tooling, and metrics rather than a one-off procurement. In an era when generative model quality is limited not by architectures but by evaluation datasets, an annotation platform becomes infrastructure as critical as model serving — and Spotify's case was the first from 'big streaming' to show that publicly.

10xannotation corpus growth

3xannotator productivity

100M+tracks & episodes in catalog (hundreds of millions)

10+parallel labeling projects (dozens)

💼

HR Tech & Professional Networks · LinkedIn

LinkedIn: LLM-based skills extraction — 200 profile edits per second with a model 80% smaller at no quality loss

In production the system handles ~200 profile edits per second at up to 100 ms per message — on CPU, thanks to a distilled model 80% smaller than the original. In online A/B tests, improved skills extraction lifted metrics across three products at once. Job-member skills matching: +0.87% qualified applications, +0.40% qualified application rate, +0.48% apply clicks, +0.24% predicted confirmed hires. Job search: +0.76% PPC revenue, +0.15% sessions, +0.23% engagement. Job recommendations: +0.46% predicted confirmed hires and +0.14% applications. These figures need two caveats. First, scale: on a platform with hundreds of millions of members, fractions of a percent in the hiring funnel are meaningful absolute values and direct money (PPC revenue). Second, the source: all percentages are LinkedIn's internal A/B tests from the engineering blog, with no external audit; that said, the detailed metric breakdown and the explicit 'predicted' (rather than actual) hires speak to the report's care. In our view, the case's main engineering lesson is that 'LLM in production' almost never means 'the biggest model in production'. LinkedIn's actual formula: a large model as the teacher, a distilled student in serving, CPU infrastructure instead of GPU — and the pipeline's intelligence residing in the pairing with a knowledge graph rather than in one giant transformer. This is an 'LLM + knowledge graph' architecture where the taxonomy provides interpretability and control while the neural network provides context understanding; against the fashion for end-to-end solutions such a hybrid looks conservative, but it is what sustains 200 events per second on CPU. Our second observation is organizational: LinkedIn reports the ML pipeline's impact in end-product metrics (applications, hires, revenue), not extraction accuracy. That is a discipline: an infrastructure team whose success is measured by other teams' product metrics is forced to build feedback loops with those products — which LinkedIn did by embedding skill validation into recruiter and job-seeker interfaces.

200/секprofile edits processed per second

-80%model size (distillation)

<100 мсper message

41K+skills in the taxonomy

🛒

E-grocery & Delivery · Instacart

Instacart: Ava, the internal AI assistant — over half of employees monthly, with 20+ minute sessions

As of the September 2023 post, over half of Instacart employees used Ava monthly and more than 900 weekly; sessions ran 20+ minutes, and users 'were producing and copying a significant amount of code'. The January 2024 follow-up added specifics: 43% of the company saves more than an hour a week with Ava; 60% of engineers generate around 70,000 lines of code with it monthly; the Slack plugin is invoked over 5,000 times a month and summarizes more than 200 threads and channels. All figures are Instacart's internal data without external audit; note also that 'an hour saved per week' is employees' self-assessment from surveys, not a time-tracking measurement. In our view, the Ava case is valuable above all as a textbook of adoption product mechanics. Internal tools are usually rolled out by decree; Instacart instead walked the classic consumer path: start with an audience that gets instant value (engineers), lower the entry barrier (templates), viral loops (conversation sharing with Slack previews), user-generated content (the Prompt Exchange), and going to where the user already lives (the Slack bot). Every mechanic here transfers to any company — and judging by the retention numbers, together they work better than any corporate mandate. Our second observation concerns metrics: Instacart reported engagement, not registrations — weekly audience, session length, volume of generated code, Slack invocation frequency. For internal AI tools that is rare discipline: 'we gave everyone access' and 'half the company actually uses it every month' are fundamentally different statements. Finally, the economics of the bet is telling: a hackathon project received a product team and enterprise security guarantees within months — a speed of legalization that likely explains why shadow AI use never had time to take root.

>50%of employees use it monthly

900+weekly users

20+ минsession length

32KGPT-4 context window in Ava

🚕

Superapp: Ride-hailing & Delivery · Grab

Grab: LLM data classification — 20,000+ entities in the first month and 360 man-days saved per year

In the first month after rollout the system scanned more than 20,000 data entities — averaging 300–400 per day, a pace physically unreachable for a manual process. In a September 2023 survey, 80% of data owners said the new process helped them tag their entities, and for acknowledged tables users changed less than one tag on average — meaning the vast majority of the model's proposals were accepted unchanged. At two minutes of manual classification per entity, the automation saves roughly 360 man-days per year. By V2 the system covered Grab's entire data lake with what the team calls 'exceptionally low' misclassification rates — though the company publishes no exact percentages, and all the case's figures are self-reported without external audit. In our view, this case is a model of sober task selection for LLMs. Metadata classification is a scenario where a generative model is nearly ideal: the input is compact (column names and descriptions), the output is structured (tags from a fixed taxonomy), the cost of a single error is bounded (a human validator and weekly notifications), and the alternative is not 'another model' but thousands of hours of manual labor. The economics is calculated conservatively and legibly: 2 minutes × 20,000+ entities a month is arithmetic any CFO will accept — unlike abstract 'productivity percentages'. Our second observation: the V1→V2 history honestly shows that an LLM system is not 'set and forget'. The first version, 'surprisingly accurate' in 2023, accumulated a list of weaknesses on live traffic — and they were cured not with a more powerful model but with task decomposition, halving the prompt, and observability (LangSmith, threshold alerts). That is perhaps the case's most transferable lesson: in production LLM systems, the engineering around the model — orchestration, quotas, output schemas, prompt versioning, monitoring — matters more than the choice of the model itself.

20K+entities in the first month

360man-days saved per year

80%of data owners found it helpful

300-400entities per day