CEFR Classifiers - One Model to Grade them All
Collection
Automatic CEFR level classification of English learner writing. Trained on EFCamDAT, evaluated on multiple out-of-domain corpora.
•
3 items
•
Updated
A Multinomial Naive Bayes model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.
This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The Naive Bayes classifier provides fast, interpretable predictions based on word frequency patterns characteristic of different proficiency levels.
The model classifies text into 5 CEFR proficiency levels:
model.pkl: Trained Naive Bayes classifiervectorizer.pkl: TF-IDF/Count vectorizer for text preprocessingfrom huggingface_hub import hf_hub_download
import joblib
# Download model files
model_path = hf_hub_download(
repo_id="theluantran/cefr-naive-bayes",
filename="model.pkl"
)
vectorizer_path = hf_hub_download(
repo_id="theluantran/cefr-naive-bayes",
filename="vectorizer.pkl"
)
# Load model and vectorizer
model = joblib.load(model_path)
vectorizer = joblib.load(vectorizer_path)
# Predict
text = "This is a sample text to classify"
features = vectorizer.transform([text])
prediction = model.predict(features)[0]
probabilities = model.predict_proba(features)[0]
# Map numeric prediction to CEFR level
level_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
predicted_level = level_map[prediction]
print(f"Predicted level: {predicted_level}")
print(f"Confidence: {max(probabilities):.2%}")
This model is released for research and educational purposes. The training data is proprietary and not included.