Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd be happy to talk a bit about how we evaluated the model. The task we're performing is fundamentally long-form question answering (LFQA), and recent papers (https://arxiv.org/pdf/2103.06332.pdf) have shown that metrics such as ROUGE (used for the KITE benchmark) aren't great at evaluating the quality & truthfulness of generated answers. On our dataset, our approach is to use a combination of human evaluation (which is still arguably the most reliable metric used by the NLP research community to evaluate generated answer quality) and an entailment score (checking if a generated answer is consistent with a "ground truth" context document).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: