Model Evaluation Metrics
Performance analysis of the medical language model
Perplexity
Lower perplexity indicates better language modeling performance
QA F1 Score
Higher F1 score indicates better question-answering performance
Training History
Details of previous model training sessions and their performance
Understanding the Metrics
Perplexity
Perplexity is a measurement of how well a probability model predicts a sample. Lower perplexity indicates the model is better at predicting the sample.
For language models, it can be interpreted as the model's uncertainty when predicting the next word. A lower perplexity means the model is more confident in its predictions.
F1 Score
F1 score is a measure of a model's accuracy on a specific dataset. It is the harmonic mean of precision and recall, where precision is the percentage of correctly identified positive results, and recall is the percentage of actual positive cases the model identified.
For question-answering tasks, a higher F1 score indicates better performance in providing accurate and relevant answers.