2023 Week 8
I really enjoyed meeting Professor Li, grabbing coffee, and discussing the next project steps with Byron virtually in the office!
I also decided to go with GPT-3.5’s 4 methods of grading factuality and test them on model/prompt combinations. I am surprised that the quality of GPT-3.5’s ratings are surprisingly good, especially the Chain of Thought with justification and explanation. Here’s a written summary of the averaged scores (since the GPT-3.5 outputs vary in format and sometimes get chatty for DA 1-100):
One of Byron’s recent papers investigated that QAFactEval and other metrics that had incompatibility errors are terrible when correlated with human evaluation… and I’m happy that we don’t need to worry about them anymore
We decided to also test LLAMA-2 (since it’s relatively new and popular). We’re deciding on LLAMA-2, GPT-4, ALPACA, and Flan-T5-XL as the final models, each with one prompt (whichever has best performance)