July 8th, 2017
(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: email@example.com
I’ve written about my experience at a startup that used NLP to translate a salesperson’s text message into an entry in Salesforce. At the time, our NLP developer used the Stanford NLP library to try to build a model, but mostly they used a lot of regex and string matching.
I’ve been thinking about how we might have done that project faster and better. I’ve recently been thinking the right approach would have been Random Forests. We were lucky to face an NLP situation that had relatively low dimensionality for an NLP problem. We only had to worry about the words in Salesforce, and some of their synonyms, and common misspellings. So for “Account” we had to worry about “acc” and “Acccount” and the names of companies that were already in the Salesforce database: Hilton Hotels, Sheraton, Marriot, etc.
We could have built a neural net. We could have used Support Vector Machines. But the easiest and fastest approach would have been Random Forests. I could have programmed that myself, without help from an NLP expert. At least for our initial prototype.
I think any such company eventually moves to more complicated methods, but for us, we were under tremendous pressure of time, and so I think Random Forests might have offered the best trade-off between results versus time invested.
Of all compared classification results, ANN achieved the highest median overall classification accuracy (77%) followed by SVM with 68% and RF with 62%(Figure 2). Similarly for the kappa coefficient, ANN had the highest median kappa, at 0.72, while SVM and RF had 0.61 and 0.52 median kappa, respectively. By changing the training dataset, we were able to provide minimum and maximum accuracies for each classification algorithm, and thus can assess each classifier’s sensitivity to the data used to train it. The lowest variance of overall accuracy and kappa coefficient was achieved by RF and SVM (12 percentage points) while ANN had 15 percentage points. Hence, RF and SVM classifiers are less sensitive to changing datasets than ANN, with a difference of 3 percentage points between them.