😆Feb 20, Notes on Grok 3 - Chengsheng's Blog

#AI #LLM #XAI <details class="toc-container"> <summary><strong>Table of Content</strong></summary> <ul> <li>Information</li> <li>Test Cases</li> </ul> </details> Welcome to my latest notes on Grok 3. In this blog post, I'll share my observations and highlight some fascinating test cases comparing Grok 3 with `deepseek-r1` and `o3-mini`. ## Information XAI has introduced Grok 3 with two beta reasoning models: `Grok 3 (Think)` and `Grok 3 mini(Think)`. These models were trained using reinforcement learning (RL) at an unprecedented scale, refining their chain-of-thought processes to enable advanced, data-efficient reasoning. Below is a benchmark graph showing `Grok 3`'s thinking model performance: ![[Grok-3-Thinking-Performance.png]] For the general model, Grok 3 with a context window of 1 million tokens also demonstrates very impressive performance. Here it is: ![[Grok-3-Performance.png]] ## Test Cases [Dave W Plummer](https://x.com/davepl1968) conducted a fascinating Breakout test with `Grok 3`. Here are the results: ![[_mn3CWD_Y-kvqIlK.mp4]] The initial prompt was simple: "How about a colored version of Breakout?" The first revision requested, "Make the player move automatically under computer control, and make the ball go 10% faster each time it bounces off the paddle." The final revision addressed a gameplay issue: "Good, but the ball can get stuck in a vertical bounce. How did the original game handle that? Do the same! And make the player aim for remaining bricks." For detailed information, you can check here: [Breakout by Grok3](https://x.com/davepl1968/status/1892365077799485502) [Theo-t3.gg](https://x.com/theo) shows `Grok 3` is not great at coding. Here is his demonstration case: ![[3Wzzp1y-hYDI_wE5.mp4]] [Alex Prompter](https://x.com/alex_prompter) tested `Grok 3` and `DeepSeek v3` with the same critical prompts. His extensive comparison tests revealed multiple insights. For more details, see: [Grok 3 VS. DeepSeek V3](https://x.com/alex_prompter/status/1891932347500474793) [Andrej Karpathy](https://x.com/karpathy) conducted a thorough comparison between `Grok 3`, OpenAI's `o1-pro`, and `DeepSeek-R1`. His tests showed Grok 3's strong performance in reasoning tasks, such as Settlers of Catan board generation and GPT-2 training flop estimation. However, the model struggled with complex spatial tasks, particularly generating accurate SVG images of a pelican riding a bicycle. For the complete analysis, see: [Grok 3 test by Andrej Karpathy](https://x.com/karpathy/status/1891720635363254772)