Monday, February 1, 2016

About the Essay Learning Tool

The HR Avatar Automated Essay Scoring System

Introduction


Written communication is a key skill in many positions. Communicating via email, writing reports, and creating presentations all require the ability to communicate effectively. Traditionally, essays written for assessment purposes are scored using human raters and a pre-defined scoring rubric. However, the cost of human scorers is relatively high and humans can become fatigued and erratic in high volume situations.

Luckily, machine learning has advanced to the point where computers can substitute for human raters reliably.  The HR Avatar Essay Test is an implementation of this technique, which results in lower cost and faster scoring turn-around.  

The HR Avatar Essay Test consists of several writing prompts. The writing prompts were designed to be general enough to provide an opportunity for anyone to be able to write a short essay.It is easy to add additional prompts for specific situations or for general use.

Applicants are asked to write a short essay with a minimum of 100 words and are given an unlimited time to do so. The essays are scored using Discern, an open source, machine learning program. Discern was designed by edX, a nonprofit organization founded by Harvard and the Massachusetts Institute of Technology (MIT) (edX, 2015; Markoff, 2013). The system produces a score that ranges from 0 to 100. A confidence estimate for the score is also computed, which ranges from 0 to 1. Scores with confidence estimates less than .10 are not considered valid.

How it works

HR Avatar uses open source essay scoring software originally published by EDX Corporation, a spin-off of The Massachusetts Institute of Technology (MIT).

Software addresses and performs regression to produce a score for each submitted essay along a continuous scale. This is different from classification, in which the software would simply attempt to categorize each essay into one or more 'groups' or to rank the essays relative to one another. 

Each essay is written according to a predetermined set of instructions typically referred to as the "Prompt." A typical prompt might be: "In a short essay of 100-400 words, explain whether it's better to be a planner or to be a dreamer."

All essays are scored by the machine learning algorithm based on a "Training Set" upon which a regression model has been built. The algorithm essentially analyzes all of the training essays and produces a best guess at how the human scorers who created the training set would have judged the new essay.

The application is written in Python and utilizes several open source machine-learning tools and is centered around a machine-learning library called scikit-learn (http://scikit-learn.org), which in turn uses a number of other open source mathematical and data manipulation packages

In order perform it's task, the application converts each essay into a number of different "features." Features are measurable aspects of the essay, such as spelling errors per character, or grammar errors per character. In concept, the essay is reduced to an N-dimensional vector containing all of the essay's feature scores. However, some features are complex vectors in and of themselves.

Each feature is measured using specialized software for text analysis. For example, the grammar errors are determined by looking for good and bad 'ngrams' which are essentially models of either good or bad grammar.

Another important feature known as a "bag of words" is also generated. The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.This results in a vector with a length equal to the number of unique words in the largest essay evaluated.

It's helpful to understand the Bag of Words approach in terms of how it's used to filter out junk email.

In Bayesian spam filtering, used by many spam filters, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail ("ham"). Imagine that there are two literal bags full of words. One bag is filled with words found in spam messages, and the other bag is filled with words found in legitimate e-mail. While any given word is likely to be found somewhere in both bags, the "spam" bag will contain spam-related words such as "stock", "Viagra", and "buy" much more frequently, while the "ham" bag will contain more words related to the user's friends or workplace.

To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability to determine which bag it is more likely to be.

Along with the bag of words feature, another feature is generated that represents how 'topical' the essay is, by using the bag of words vector that was generated.

Once the features are generated, the application formulates a model, using all training essays, and their accompanying human-generated scores. The model is essentially a catalog of all feature measurements for all of the training essays, along with their scores. Once created, this model can then be used to determine where in the score space a new, unscored essay lies, based on its feature measurements. In addition to score values, error values, which indicate how consistent the training essay set was, can be calculated. This can provide a confidence value for the final score. 

The software uses a technique called Gradient Boosting Regression to pinpoint the score within the model for a given essay by comparing the features for the new essay against the feature sets of the pre-scored or 'training' essays. This is a well-established machine learning technique. Data theory shows that this technique yields excellent results for regression problems like essay scoring. 

How does it perform?

The best indication of how well the machine learning algorithm works is to measure how well it predicts the score a human rater would have come up with for any given essay.To do this we can evaluate the machine-human rater reliability.


Reliability is a critical aspect of any assessment. It describes whether the score is consistent, and puts an upper limit on the validity of the assessment. The data were analyzed to ascertain the reliability of the machine scores of the essays to represent the scoring of human essay raters.

One thousand, two hundred and fourteen (N=1,214) essays were scored using both human scoring and machine scoring. The correlation between the ratings was .73, representing an inter-rater reliability of .73, which indicates that the machine scoring reliably rates the essays similarly to human raters. In the world of testing, a reliability value of 0.73 is generally considered more than acceptable.

Therefore, the machine scoring of the HR Avatar essay test was demonstrated to be a reliable method for scoring essays that is similar to human ratings, but significantly more efficient, requiring little or no human time or effort to arrive at an assessment of a large number of applicants’ writing skills.