Habr AI→ original

Habr AI presented a prototype of a system that verifies the authenticity of references in research papers

Habr AI published an overview of a capstone project for checking academic references. The prototype accepts PDF and DOCX files, extracts the bibliography…

AI-processed from Habr AI; edited by Hamidun News
Habr AI presented a prototype of a system that verifies the authenticity of references in research papers
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

On Habr AI, a breakdown of a thesis project on automatic verification of scientific sources was published. The author builds a system that should not only find a list of references in a document, but also check whether each link actually exists and whether it can be trusted.

Why the problem grew

The idea seems narrow at first glance. With the growth of generative models, errors in bibliographies have stopped being mere typos: in scientific and quasi-scientific texts, distorted DOIs, mixed-up authors, broken URLs, and links to works that don't exist are increasingly common. For editors and reviewers, this means additional hours of manual checking, and for the author, a direct blow to the credibility of the text.

The problem has two parts. The first is formatting: the same source can be written according to GOST, APA, IEEE, or in a mixed format where half the fields are missing. The second is authenticity: even a perfectly formatted reference can lead nowhere. Therefore, the task comes down to not a cosmetic correction of the bibliography, but a verification of the text's reliability as such. If the source is not confirmed, the quality of the work, reproducibility of results, and the very logic of scientific citation all suffer.

How the system works

The current prototype takes PDF and DOCX, extracts text, searches for a bibliography block using a set of heuristics, breaks it into individual records, and parses fields: authors, title, year, journal, volume, issue, pages, DOI, and URL. After that, the system attempts to confirm the record through external sources — from Crossref and OpenAlex to Wikidata, ORCID, Google Scholar, and regular web search. The output is not a binary answer, but a confidence scale.

  • accepts a document through a web interface
  • highlights and structures the reference list
  • checks DOI, URL, and metadata matching
  • assigns a credibility status to each record
  • saves a report and final JSON for further processing

The key moment in the architecture is a hybrid approach. Rules and heuristics are responsible for feature extraction, DOI validation, and basic field checking, while the ML layer helps where the record is noisy, partially recognized, or doesn't fit a rigid template. This approach is necessary because pure rules quickly break on real documents, and a pure model turns into a black box that's hard to trust.

The statuses verified, likely_verified, unverified, and unknown allow the system to honestly show the degree of confidence rather than pretending that any controversial case can be solved automatically.

To assess quality, the author doesn't look at one overall number. Metrics are divided by stages: how well fields are extracted, how many references can be confirmed, how correctly the classification works, and whether auto-correction causes harm. This breakdown by layers is needed to understand exactly where the pipeline breaks: at extraction, matching, status assignment, or the attempt to correct a record.

Where failures begin

The most unpleasant part of the task appears before the link check itself. A PDF can contain headers, line breaks, chaotic arrangement of text blocks, or even be a scan without a proper text layer. In such cases, OCR is needed first, and only then bibliography parsing.

Even after that, there remain articles without DOI, dead URLs, Russian-language sources with weak representation in international registries, and records where the title or authors are distorted so much that direct matching doesn't work. A separate problem is external services. Some have rate limits, others have unstable responses, others can hit CAPTCHA or incomplete metadata. Therefore, the project author separately emphasizes the importance of explainability and human-in-the-loop mode.

The system should not only render a verdict, but also show which fields matched, where there is little confirmation, and what is better to check manually.

If a record cannot be reliably confirmed, the system should not pretend to be an all-powerful oracle.

This is especially important for auto-correction: correcting a bibliographic record can easily produce a new error if the algorithm is too confident in itself.

The nearest plans are to improve reference extraction, expand the labeled dataset, and run the pipeline on a corpus of examples with separate metrics for parsing, matching, classification, and auto-correction.

What this means

Link verification is gradually turning from boring editorial routine into a separate AI task at the intersection of NLP, data validation, and academic infrastructure. As models learn to convincingly fabricate bibliographies, the demand for systems that can distinguish a real source from neatly formatted fiction will only grow.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…