0814 G-EVAL | Notion

0814 Prompt Engineering (2)

G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment (1)

G-EVAL

현재 네뷸라에서 사용하고 있는 평가는

Task Description(Human Written) + Criteria(Human Written)

여기에 CoT(Chain of Thought : Automatically Written by LLM)을 추가하여 G-EVAL을 사용해보자(눈에 보이지 않게)

문제점.

1~5 scale에서 3점이 많이 나올 수 있다.
정수로만 점수가 나오는 경향이 있다.

이 문제를 해결하기 위해 출력되는 토큰의 확률을 계산하여 점수를 정규화, contiguous하게 할 수 있다.

Summarization과 Dialogue Generation에서 G-Eval의 평가지표(Spearman, Kendall-Tau)의 성능이 우월했다.

Retrieval 요소를 더할 수 없을까 ?

Audio/Video Caption Description 에서 attention을 중심으로 self-ask (google search API)로 도메인 확장