CEFR Naive Bayes Classifier

A Multinomial Naive Bayes model for classifying English text by CEFR (Common European Framework of Reference for Languages) proficiency levels.

Model Description

This model is part of an ensemble CEFR text classification system that combines multiple approaches to estimate language proficiency levels. The Naive Bayes classifier provides fast, interpretable predictions based on word frequency patterns characteristic of different proficiency levels.

Labels

The model classifies text into 5 CEFR proficiency levels:

A1: Beginner
A2: Elementary
B1: Intermediate
B2: Upper Intermediate
C1/C2: Advanced

Model Details

Type: Multinomial Naive Bayes
Framework: scikit-learn
Task: Multi-class text classification
Input: Raw text strings
Output: Class predictions (0-4) with probability distributions
Files:
- model.pkl: Trained Naive Bayes classifier
- vectorizer.pkl: TF-IDF/Count vectorizer for text preprocessing

Usage

Basic Prediction

from huggingface_hub import hf_hub_download
import joblib

# Download model files
model_path = hf_hub_download(
    repo_id="theluantran/cefr-naive-bayes",
    filename="model.pkl"
)
vectorizer_path = hf_hub_download(
    repo_id="theluantran/cefr-naive-bayes",
    filename="vectorizer.pkl"
)

# Load model and vectorizer
model = joblib.load(model_path)
vectorizer = joblib.load(vectorizer_path)

# Predict
text = "This is a sample text to classify"
features = vectorizer.transform([text])
prediction = model.predict(features)[0]
probabilities = model.predict_proba(features)[0]

# Map numeric prediction to CEFR level
level_map = {0: 'A1', 1: 'A2', 2: 'B1', 3: 'B2', 4: 'C1/C2'}
predicted_level = level_map[prediction]

print(f"Predicted level: {predicted_level}")
print(f"Confidence: {max(probabilities):.2%}")

License

This model is released for research and educational purposes. The training data is proprietary and not included.

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including theluantran/cefr-naive-bayes

CEFR Classifiers - One Model to Grade them All

Collection

Automatic CEFR level classification of English learner writing. Trained on EFCamDAT, evaluated on multiple out-of-domain corpora. • 3 items • Updated 8 days ago