#Reflection
## Job
In July, I helped my team build a private LLM benchmark. This benchmark is both confidential and tailored to our specific needs. The reason for creating this benchmark is that many public benchmarks have become contaminated as LLMs continue to develop. Despite claims from various LLMs that they surpass GPT-4 in areas like reasoning, I haven't experienced such outstanding performance in practice. Constructing the test set was challenging, and I drew inspiration from how many classical test sets are constructed. For evaluation, I read numerous papers and learned more about LLM-as-a-Judge.
I also published three blog posts this month about LLM benchmarks:
- [[July 6, 2024 LLMs Evaluation Benchmarks]]
- [[July 16, 2024 LLMs Evals Thoughts]]
- [[July 31, 2024 LLM & VLM-as-a-Judge]]
## Personal
This month, I conducted some interesting experiments with Midjourney, TextGrad, Dify, and DSPy. I've written several blog posts to document these experiences:
- [[July 23, DSPy with GPT-4o-mini on MMLU-Pro]]
- [[July 9, How to use DeepSeek with TextGrad]]
- [[July 7, 2024 Weekend with Midjourney]]
- [[July 11, 2024 借助 Dify 做一个三步翻译工作流]]
- [[July 14, 2024 How to use Yi-Vision with TextGrad]]
On a personal note, I started preparing for the PTE exam. Achieving a high score is not easy, and I plan to take the exam on August 8. I hope to perform well and achieve a good score.