LLM evaluator는 특정 어플리케이션 도메인(사용자)마다 요구하는 criteria가 다르기 때문에, 너무 일반적인 criteria만으로는 부족하다.
사용자마다 유연하게 criteria를 세울 수 있게, ICL(in-context-learning)으로 모델에게 Criteria를 학습시키자.
TALEC :제로샷과 퓨샷을 결합해서, 모델이 더 많은 정보에 집중할 수 있게 하고, 새로운 프롬프트 패러다임을제시, 모델이 복잡한 Criteria를 더 잘 이해할 수 있게 하자.
TALEC의 평가 결과는 인간 평가 결과와 80% 이상의 상관관계를 나타냈다.

TALEC : Customized Business Evaluation Criteria

human preferences and achieves a correlation of over 80% with human judgments,

zero-shot and few-shot to make the judge model focus on more information.

allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house criteria

Especially in specific application domains (e.g., to-business or to-customer service), in-house evaluation criteria have to meet not only general standards (correctness, helpfulness and creativity, etc.) but also specific needs of customers and business security requirements at the same time, making the evaluation more difficult