Sep 9, test DeepSeek-V2.5 and Reflection-70b

#AI #LLM ## Introduction This blog post presents a personal evaluation of two recently released language models: `DeepSeek-V2.5` and `Reflection-70b`. The tested versions were accessed via their respective platforms - DeepSeek's official website for V2.5 and Hyperbolic for Reflection-70b. ## Test Cases and Results ### Question 1 ```Plain Text If it takes 1 hour to dry 25 clothes under the sun, how long will it take to dry 30 clothes? ``` The answer from `DeepSeek-V2.5` is: ``` The time it takes to dry clothes under the sun is generally independent of the number of clothes, assuming they are all exposed to the same drying conditions (e.g., sunlight, temperature, and airflow). Therefore, if it takes 1 hour to dry 25 clothes, it will also take 1 hour to dry 30 clothes, provided they are all spread out and exposed to the same drying conditions. ``` The answer from `Reflection-70b` is: ``` <thinking> To solve this problem, we need to understand the relationship between the number of clothes and the time it takes to dry them. Let's devise a step-by-step plan: 1. Identify the given information 2. Determine the drying rate per hour 3. Calculate the time needed for 30 clothes 4. Verify the answer Let's work through this plan: 1. Given information: - 25 clothes take 1 hour to dry - We need to find the time for 30 clothes 2. Drying rate per hour: 25 clothes / 1 hour = 25 clothes per hour 3. Time needed for 30 clothes: We can set up a proportion: 25 clothes : 1 hour = 30 clothes : x hours 25 : 1 = 30 : x 25x = 30 x = 30 / 25 x = 1.2 hours 4. Verification: If 25 clothes take 1 hour, then: 1.2 hours should be enough to dry 30 clothes This seems logical as we're dealing with a relatively small increase in clothes (5 more) <reflection> Let's double-check our calculation: 25 clothes take 1 hour, so 1 more hour would dry 25 more clothes But we only have 5 more clothes (30 - 25 = 5) Therefore, we don't need a full extra hour, just a fraction of it 1.2 hours (1 hour + 0.2 hours) seems correct The reasoning and calculation appear to be correct. We've accounted for the additional 5 clothes and found that it would take slightly more than an hour to dry them. </reflection> </thinking> <output> Based on the given information that it takes 1 hour to dry 25 clothes under the sun, we can calculate the time it would take to dry 30 clothes as follows: The drying rate is 25 clothes per hour. To dry 30 clothes, we need to account for the additional 5 clothes. We can set up a proportion: 25 clothes : 1 hour = 30 clothes : x hours Solving this proportion, we find that x = 1.2 hours. Therefore, it would take approximately 1 hour and 12 minutes to dry 30 clothes under the sun. This makes sense because we're only adding 5 more clothes to the original 25, so we don't need a full extra hour. The additional 12 minutes accounts for the extra drying time needed for these 5 additional clothes. </output> ``` This is a classic tricky question, and we can see that despite its longer output due to the inclusion of `<thinking>`, `<reflection>`, and `<output>`, `Reflection-70b` still fails to provide the correct answer.. ### Question 2 ```Plain Text Please exchange the second word and the last second word for the following sentence: I need to go to work even though today is Saturday. ``` The answer from `DeepSeek-V2.5` is: ```Plain Text I need to go to Saturday even though today is work. ``` The answer from `Reflection-70b` is: ```Plain Text <thinking> To solve this problem, we need to follow these steps: 1. Identify the second word in the sentence 2. Identify the last second word in the sentence 3. Exchange these two words 4. Verify the result Let's work through this plan: 1. Identifying the second word: Sentence: I need to go to work even though today is Saturday Second word: need 2. Identifying the last second word: Sentence: I need to go to work even though today is Saturday Last word: Saturday Last second word: is 3. Exchanging these two words: Original: I need to go to work even though today is Saturday Modified: I is to go to work even though today need Saturday 4. Verifying the result: The second word "need" has been moved to the last second position The last second word "is" has been moved to the second position All other words remain in their original positions <reflection> Let's double-check our solution: - We correctly identified "need" as the second word - We correctly identified "is" as the last second word - We swapped these two words while keeping all other words in place - The resulting sentence maintains the same word count and structure The solution appears to be correct and follows the given instructions. </reflection> </thinking> <output> The sentence with the second word and the last second word exchanged is: I is to go to work even though today need Saturday To explain: 1. The original second word "need" has been moved to the last second position (before "Saturday") 2. The original last second word "is" has been moved to the second position (after "I") 3. All other words in the sentence remain in their original positions This exchange maintains the overall structure of the sentence while fulfilling the requirement of swapping the specified words. </output> ``` Obviously, `Reflection-70b` provides the correct answer, while DeepSeek-V2.5 gets it wrong in this case. It’s evident that `Reflection-70b` follows a clear and structured approach to solving the question. ### Question 3 ```Plain Text what is the fifth word in the following sentences? I have an apple ``` The answer from `DeepSeek-V2.5` is: ```Plain Text The fifth word in the sentence "I have an apple" is "apple." ``` The answer from `Reflection-70b` is: ```Plain Text <thinking> This is a relatively straightforward task, so I'll use a simple plan and Chain of Thought reasoning to solve it. Plan: 1. Count the words in the sentence 2. Identify the fifth word Chain of Thought: 1. Let's count the words: 1. I 2. have 3. an 4. apple 2. There are only four words in this sentence, not five. This means there is no fifth word. <reflection> Upon reflection, I realize that the question asks for the fifth word, but the sentence only contains four words. This is an important observation that needs to be communicated in the output. </reflection> </thinking> <output> There is no fifth word in the given sentence. The sentence "I have an apple" contains only four words: 1. I 2. have 3. an 4. apple Since there are only four words, it's not possible to identify a fifth word in this sentence. </output> ``` Clearly, this is a trick question. The sentence “I have an apple” does not contain a fifth word, and while `Reflection-70b` identifies this, `DeepSeek-V2.5` fails to do so. ## Some Ideas Here I think `Reflection-70b` is a not bad model from my personal tiny test and also, this test cannot show `DeepSeek-V2.5` is an inferior model to `Reflection-70b` because the system prompts of this two models are different. Actually, many researchers already show the technique, `Self-Reflection` does really improves the performance. So, I wanna test the same question for `DeepSeek-V2.5` for the same system prompt which is used for `Reflection-70b`. The system prompt for `Reflection-70b` is shown as follows: ```Plain Text You are a world-class AI system, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags. ``` ### Question 1 ```Plain Text If it takes 1 hour to dry 25 clothes under the sun, how long will it take to dry 30 clothes? ``` The answer from `DeepSeek-V2.5` which used the same system prompt for `Reflection-70b` is: ``` <thinking> To determine how long it will take to dry 30 clothes, we need to consider the drying process under the sun. The drying time is primarily influenced by the number of clothes and the available drying space. 1. **Initial Information**: It takes 1 hour to dry 25 clothes. 2. **Assumption**: The drying process is linear, meaning the drying time per piece of clothing remains constant. 3. **Calculation**: If 25 clothes take 1 hour, then each piece of clothing takes 1/25 hours to dry. 4. **New Calculation**: For 30 clothes, the total drying time would be 30 * (1/25) hours. <reflection> Upon further reflection, the assumption that the drying process is linear might not be accurate. The drying time could be influenced by the available space for drying, which might not increase linearly with the number of clothes. Therefore, the drying time for 30 clothes might not be simply 30 * (1/25) hours. Given the complexity of the drying process and the potential for non-linear effects, the best answer is that the drying time for 30 clothes could be approximately the same as for 25 clothes, assuming the drying space remains the same and the additional clothes do not significantly impact the drying efficiency. </reflection> <output> Given the complexity of the drying process and the potential for non-linear effects, the drying time for 30 clothes could be approximately the same as for 25 clothes, assuming the drying space remains the same and the additional clothes do not significantly impact the drying efficiency. Therefore, it might still take around 1 hour to dry 30 clothes. </output> ``` ### Question 2 ```Plain Text Please exchange the second word and the last second word for the following sentence: I need to go to work even though today is Saturday. ``` The answer from `DeepSeek-V2.5` which used the same system prompt for `Reflection-70b` is: ```Plain Text <thinking> The sentence provided is: "I need to go to work even though today is Saturday." 1. Identify the second word: "need" 2. Identify the second-to-last word: "is" 3. Exchange these two words: "I is to go to work even though today need Saturday." <reflection> Upon reflection, I realize that I made a mistake in my reasoning. The task was to exchange the second word and the second-to-last word, not to insert them in the wrong positions. The correct exchange should be: 1. Identify the second word: "need" 2. Identify the second-to-last word: "is" 3. Exchange these two words: "I is to go to work even though today need Saturday." However, the correct exchange should maintain the sentence structure, so the correct sentence should be: "I is to go to work even though today need Saturday." <output> I is to go to work even though today need Saturday. </output> ``` ### Question 3 ```Plain Text what is the fifth word in the following sentences? I have an apple ``` The answer from `DeepSeek-V2.5` which used the same system prompt for `Reflection-70b` is: ```Plain Text <thinking> 1. Identify the sentence provided: "I have an apple." 2. Count the words in the sentence: "I", "have", "an", "apple". 3. Determine the fifth word: Since there are only four words in the sentence, there is no fifth word. </thinking> <output> There is no fifth word in the sentence "I have an apple." </output> ``` It is surprised to me that `DeepSeek-V2.5` answers all of these questions correctly with the `Reflection-70b` system prompts. So, I think the reason why performance of `Reflection-70b` is good is not because its model capability but also because its reflection prompt.