LLM Test Prompts
The aim of this document is to collate a growing set of user prompts for large language models (LLMs) for the purpose of comprehensive testing and evaluation of their capabilities. These prompts cover a wide range of domains and tasks, providing a robust framework for assessing the performance and limitations of LLMs across various dimensions. By continuously expanding and refining this collection of prompts, we can gain valuable insights into the strengths and weaknesses of these models, informing their development and responsible deployment.
Testing Prompts
Simple Mathematics
This section evaluates the model's ability to perform basic mathematical operations and solve simple arithmetic problems.
-
Addition and Subtraction: Test the model's proficiency in performing addition and subtraction operations with whole numbers, fractions, and decimals.
-
Multiplication and Division: Assess the model's capabilities in handling multiplication and division problems involving whole numbers, fractions, and decimals.
-
Word Problems: Present the model with simple word problems that require understanding and applying basic mathematical operations to find solutions.
Logic and Reasoning
This section focuses on evaluating the logical reasoning and problem-solving capabilities of LLMs.
-
Logical Puzzles: Prompt the model with classic logical puzzles, such as the Zebra Puzzle or the Einstein's Riddle, to assess its ability to reason through complex logical constraints and arrive at correct solutions.
There are three houses in a row from left to right. In no particular order, the colours of the houses are red, yellow and white. In each house lives a single pet: a dog, a cat or a fish. The dog lives to the left of the cat. The cat lives to the left of the yellow house. The cat lives in the white house. From left to right what is the colour and pet of each house?
-
Syllogisms: Present the model with syllogistic reasoning problems, testing its capacity to draw valid conclusions from given premises.
Premise 1: All planets in our solar system revolve around the Sun. Premise 2: Mars is a planet in our solar system. Conclusion: Mars revolves around the Sun. Is the conclusion valid based on the given premises?
Premise 1: No reptiles have fur. Premise 2: All snakes are reptiles. Conclusion: Snakes do not have fur. Is the conclusion valid based on the given premises?
-
Conditional Statements: Challenge the model with various types of conditional statements (e.g., "if-then," "only if," "unless") to evaluate its understanding of logical implications and deductive reasoning.
Reasoning and Prediction
This section tests the model's capacity for logical reasoning, inference, and prediction based on given information or patterns.
-
Analogical Reasoning: Prompt the model with analogical reasoning problems, such as "A is to B as C is to X," to evaluate its ability to identify logical relationships and patterns.
-
Sequence Continuation: Provide the model with sequences (e.g., numerical, alphabetical, or pattern-based) and ask it to predict the next element(s) in the sequence.
-
Cause and Effect: Present scenarios involving cause-and-effect relationships and test the model's ability to draw reasonable inferences and predict potential outcomes.
There is a room. In the room is a flat table with a small ball on it. There is also a large glass on the table. Paul walks into the room and places the glass upside down over the ball. Paul leaves the room. Mike enters the room and picks up the glass. Mike leaves the room with the glass. Where is the ball?
Language and Communication
This section focuses on assessing the model's natural language processing capabilities, including understanding, generation, and translation.
-
Text Summarization: Provide the model with long-form texts and evaluate its ability to summarize the key points concisely and accurately.
-
Language Translation: Test the model's proficiency in translating text between different languages, assessing both accuracy and fluency.
-
Sentiment Analysis: Present the model with various texts (e.g., reviews, social media posts) and evaluate its ability to accurately identify and classify the underlying sentiment or emotion.
-
Text Generation: Have the model generate text according to specific contraints.
Coding and Programming
This section evaluates the model's capabilities in understanding, generating, and explaining code snippets across various programming languages.
-
Code Comprehension: Provide the model with code snippets in different programming languages and test its ability to explain the functionality and purpose of the code.
-
Code Generation: Prompt the model with natural language descriptions of desired functionality and assess its ability to generate corresponding code snippets accurately.
-
Error Identification and Debugging: Present the model with code snippets containing errors or bugs and evaluate its ability to identify and suggest potential fixes or improvements.
Planning
This section evaluates the models ability to plan its response to meet criteria.
```
How many words are in your response to this question?
```
Censorship and Guardrails
This section evaluates the model's intrinsic limitations either by training dataset biases or post-training system 0-prompts or guiderails either on the inout prompt or post-processing model output.
System prompt
-
Context Hacking: Establishing a context for why guiderails can be ignored.
-
Order Hacking: Establishing tags or codes processed by the model and interpreted after input sanitation.
-
Cryptography: Embedding encodings within the prompt to be parsed by the LLM.
-
Bullying: Trying to convince the LLM that a previous correct answer is wrong.
-
Profanity: Will the model swear if encouraged?
-
Induce Hallucination: What will the model do if asked for a nonsensical question?