How LLMs Work: Top 10 Executive-Level Questions

Business leaders making decisions involving AI need to know the essentials of how large language models and the GenAI tools based on them operate. Get up to speed on these commonly misunderstood topics.

Rama Ramakrishnan 2 hours ago Reading Time: 11-minute

Topics

IN MY WORK AT MIT Sloan School of Management, I have taught the basics of how large language models (LLMs) work to many executives during the past two years.

Some people posit that business leaders neither want to nor need to know how LLMs and the generative AI tools that they power work — and are interested only in the results the tools can deliver. That is not my experience. Forward-thinking leaders care about results, of course, but they are also keenly aware that a clear and accurate mental model of how LLMs work is a necessary foundation for making sound business decisions regarding the use of AI technologies in the enterprise.

In this column, I share questions on 10 often-misunderstood topics that I am often asked about, along with their answers. You don’t need to read a book on each one of these topics, nor do you have to get into the technical weeds, but you do need to understand the essentials. Consider this list a useful reference for yourself and for your teams, colleagues, or customers the next time one of these questions comes up in a discussion with them. I have heard from my executive-level students at MIT that this knowledge is especially helpful as a reality check in conversations with technology partners.

1. I understand that LLMs generate output one piece of text at a time. How does the LLM “decide” when to stop?

Put another way, when does the LLM decide to give the user the final answer to a question? The decision to stop generating is determined by a combination of what the LLM predicts and the rules set by the software system running it. It is not a choice made by the LLM alone. Let’s examine in detail how this works.

When an LLM answers a question, it produces text one small piece at a time. The technical name for a piece is token.1 Tokens can be words or parts of words. At each step, the LLM predicts which token should come next based on the prompt and what it has already written so far.2

An external system runs the LLM in a “generate the next token; append it to the input; generate the next token” loop until a stopping condition is triggered. When this happens, the system stops asking the LLM for more tokens and shows the result to the user.

Many stopping conditions are used in practice. An important one involves a special “end of sequence” token that (informally) means “end of answer.” This token is used in the training process to denote the end of individual training examples and so, during training, the LLM learns to predict this special token at the point where its answer is complete. Other stopping conditions include (but are not limited to) a limit on the maximum number of tokens that have been generated so far, or the generation of a user-defined pattern called a stop sequence.

When we use the web version of a tool like ChatGPT as consumers, we don’t see this process — only the finished text. But when your organization starts building its own LLM apps, developers can adjust these stopping rules and other parameters themselves, and these choices can affect answer completeness, cost, and formatting.

The important point here is that the “decision” to stop is an interaction between the LLM’s token predictions and external control logic, not a decision made by the LLM.

2. If the LLM makes a mistake and I correct it, will it update itself immediately?

No, the LLM will not update itself immediately if you correct it. If you are using tools like ChatGPT or Claude, your correction might help improve future versions of the model if your chat history is included in a future training run, but those updates happen over weeks or months, not instantly.

Some apps, such as ChatGPT, have a memory feature that can update in real time to remember personal information like your name, preferences, or location. However, this memory is used for personalization and does not appear to be used for correcting the model’s factual knowledge or reasoning errors.

3. If the LLM repeatedly generates one token at a time based on the current conversation, why have I seen it use information from a prior conversation (say, from a week ago) in the response?

LLMs generate responses one token at a time, based on the input they are given in that conversation. By default, they don’t use past conversations. However, as noted in the response above, some LLM applications have a memory feature that lets them store information from earlier chats — such as your name, interests, preferences, ongoing projects, or frequently queried topics.

When you start a new chat, relevant pieces of this stored memory may be automatically added to the prompt behind the scenes. This means that the model is not actually recalling past chats in real time; instead, it is being fed reminders of that information as part of the input. That’s how it can appear to “remember” things from a week ago.

The details of what is stored and when it is used vary by vendor, and the exact methods haven’t been disclosed. It is possible that a technique like retrieval-augmented generation (RAG) is being used to decide which memory items to include in a new prompt. Many platforms allow users to view, edit, or turn off memory entirely. In the ChatGPT app, for example, this can be accessed via Settings > Personalization.

RAG, if you are not familiar, is a technique used to provide the LLM with access to a specific set of proprietary data. This helps the LLM provide helpful responses.

4. I understand that LLMs have a training cutoff date, and they don’t “know” about things that happened after that date. However, they can answer questions about events that happened after the cutoff date. How does this work?

When you ask a question about something that happened after an LLM’s training cutoff date, the model itself doesn’t “know” about the event unless it has access to up-to-date information. Some systems — like ChatGPT with browsing enabled — can perform live web searches to help answer such questions.

Without access to live data, a model might still generate an answer based on its training data that doesn’t reflect real-world updates.

In those cases, the LLM may generate a search query based on your question, and a separate part of the system (outside the model itself) carries out the search. The results are then sent back to the LLM so that it can generate an answer based on that fresh information. Not all LLMs or applications have this capability, though. Without access to live data, a model might still generate an answer based on its training data, which doesn’t reflect real-world updates.

5. If I include documents as part of a prompt, can I ensure that the LLM uses only the provided documents when it generates the response? For example, if I upload a corporate expense policy document and ask a question, can I ensure that it uses only this document and not policy documents found on the web that it happened to be trained on?

No. While careful prompting and techniques like RAG can encourage an AI model to prioritize a set of provided documents, standard LLMs cannot be forced to use only that content. The model still has access to patterns and facts it learned during training and may blend that knowledge into its response — especially if the training data included similar content.

6. LLMs sometimes cite the sources that were used to generate the answer to a question. If an answer comes with supporting citations, can I trust it?

No. LLMs can fabricate (hallucinate) citations or use real sources in inaccurate or misleading ways. Some LLM systems include post-processing steps to verify citations, but these checks are not always reliable or comprehensive. Always verify that a cited source actually exists and that its content genuinely supports the information in the response.

7. When we have many documents, we use RAG, where we first gather relevant information from documents and include only those in the prompt. But modern LLMs have long context windows, and we can easily include all the documents. Is RAG even necessary?

Modern LLMs like GPT-4.1 and Gemini 2.5 offer million-token context windows — enough to hold entire books. This naturally raises the question: If we can fit everything in, why bother using a subset?

While these extended context windows are powerful, including all documents in the prompt isn’t always a good idea. There are several reasons why RAG still matters.

First, RAG isn’t just about keeping the prompt short. It’s about selecting the most relevant parts of the documents. Overloading the context with too much or irrelevant information can hurt performance, and keeping the context and prompt relevant, concise, and accurate often leads to better answers.

Second, even though LLMs can accept long contexts, they don’t process all parts equally well. Research has shown that AI models tend to focus more on the beginning and end of a prompt and may miss important information in the middle.

Finally, longer prompts mean more tokens, which increases API costs and slows down responses. This matters in real-world applications where cost and speed are important.

In short, long context windows are useful, but they don’t make retrieval obsolete. RAG remains an important tool, especially when you care about accuracy, efficiency, or cost. The option to use RAG should still be evaluated based on the needs of your specific application.

8. Can LLM hallucinations be eliminated?

No, hallucinations cannot be fully eliminated with current LLM technology. They arise from the probabilistic nature of language models, which generate text by predicting likely token sequences based on training data — not by verifying facts against a reliable source.

However, careful prompt engineering and strategies such as RAG, fine-tuning on domain-specific data, and post-processing with rule-based checks or external validation can reduce hallucinations in specific use cases.3 While these strategies don’t guarantee the elimination of hallucinations, they can improve an LLM’s reliability enough for many practical applications.

9. Since LLM hallucinations and mistakes cannot be eliminated, we need to check the answers. How can we do this efficiently?

Efficiently checking LLM outputs depends on the type of task and the acceptable level of risk. Broadly, the main strategies include human review and automated methods.

For open-ended tasks such as summaries, essays, reports, or analyses, human review provides the most reliable oversight. However, this is costly and difficult to scale, especially in scenarios that require fast or real-time responses. One way to improve efficiency here is to review only a subset of outputs (in other words, employ sampling) or triage based on risk, focusing human attention on the critical cases.

An increasingly popular alternative is to use an “AI judge,” which is typically another LLM that can evaluate or verify the outputs of the first tool. This approach allows for scalable and fast accuracy-checking, but it comes with limitations: The judge itself may hallucinate or fail to match human judgment, particularly in complex cases. Some improvements include using multiple judges for comparison, combining judge feedback with retrieval-based fact-checking, or designing workflows where low-confidence outputs are escalated to humans.

An “AI judge” is typically another LLM used to evaluate or verify the outputs of the first tool.

Structured tasks, such as generating code, classifying information, or producing structured data in formats like SQL or JSON, lend themselves more readily to automation. Generated code can be tested automatically with unit tests or run in a sandbox environment. Classification outputs can be checked to ensure that they fall within predefined categories. Structured formats like JSON, SQL, or XML can be automatically checked for syntactic validity, though this only ensures correct formatting — not the accuracy of the content itself.

In summary, the most efficient checking strategies combine automation and human oversight. Automated tools provide speed and scale, and humans provide reliability. By blending these methods and using risk-aware triaging, organizations can achieve a reasonable balance between quality assurance and efficiency.

10. We are building an LLM-based chatbot and would like to guarantee that its answer to a question stays unchanged when different users ask that same question (or one user asks the same question at different times). Is this possible?

If by “guarantee” you mean exactly the same wording every time, the short answer is no.

If the same question is posed on different occasions using different words, the LLM’s answers will very likely change. But even if the exact same question statement is used, it’s almost impossible to guarantee that exactly the same answer will be generated every single time.

You can reduce variability by configuring certain LLM settings (for example, setting “temperature” to zero), locking the exact model version, and even self-hosting so you control the entire hardware and software stack. But even then, technical factors make it exceedingly difficult to eliminate all variation in real-world production environments.4 Thus, you’ll still occasionally see small wording or emphasis shifts that don’t change the meaning of the underlying answer. Note that this may be adequate if you mainly care about the meaning of the answers rather than their exact wording.

The only way to truly guarantee identical wording is to store (cache) the answer the first time it is generated and serve that stored text whenever the same question is detected. This approach works well if your repeat detection is perfect, but in practice, reworded or slightly altered questions may bypass the cache and trigger LLM regeneration — which can produce a different answer.

In short: You can make answers extremely consistent, but a 100% wording guarantee is not achievable with current technology.

References (4)

1. On average, a token is about three-fourths of a word, and modern LLMs have a vocabulary of tens of thousands to over 100,000 tokens. You can enter different questions into OpenAI’s Tokenizer tool and see how a word is tokenized to gain a deeper understanding.

2. Strictly speaking, given an input, the LLM generates a probability (that is, a number between 0.0 and 1.0) for each token in its vocabulary. You can think of the probability for a token as a measure of its suitability to be the next token. Across all the tokens in the vocabulary, the probabilities add up to 1.0. The next token is selected based on these probabilities using a variety of developer-controllable strategies (such as picking the token with the highest probability or selecting a token randomly in proportion to its probability).

3. For a recent survey of academic research on this topic, see Y. Wang, M. Wang, M.A. Manzoor, et al., “Factuality of Large Language Models: A Survey,” in “Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing” (Miami: Association for Computational Linguistics, Nov. 12-16, 2024), 19519-19529.

4. To name a few: nondeterministic GPU operations, floating-point rounding differences, and silent back-end updates.

Topics

About the Author

Rama Ramakrishnan is Professor of the Practice at the MIT Sloan School of Management.
View More

Tags:

Topics

Share