Machine Learning vs. Rule Based Systems in NLP

One of the most exciting applications of NLP technology is enabling non-technical users to interact with large databases using natural language and to extract the information they need from the ocean of digital data almost instantly.

So how do we create such a system that will translate natural language queries into machine language that a database will understand? There are two basic approaches to query analysis: rule-based systems and machine learning algorithms. Let’s describe them each in a little more detail.

Rule-based Grammar

Grammar engineering is basically a hand-crafted system of rules based on linguistic structures that imitates the human way of building grammar structures.

The grammar-based approach traditionally implies that a human is involved in the process of stepwise system development and improvement. The biggest advantage of formal grammar is that there is always a way to check whether the system could process a query placed by a user and how it could do that. And since all the rules are written by people, any reported bug is easy to localize and fix by adjusting the rules in the related module.

Grammar rules can be developed in a very flexible manner, for example, through the extension of translation rules and synonyms base, and can easily be updated with new functions and data types, with no significant changes to the core system. This approach to query analysis is based on the development and extension of the existing rules, so the system doesn't require a massive training corpus, compared to the machine learning-based approach.

The most obvious disadvantage of the rule-based approach is that it requires skilled experts: it takes a linguist or a knowledge engineer to manually encode each rule in NLP. Rules need to be manually crafted and enhanced all the time. Moreover, the system can become so complex, that some rules can start contradicting each other.

Overall, a rule-based system is good at capturing a specific language phenomenon: it will decode the linguistic relationships between words to interpret the sentence. It can therefore handle sentence-level tasks, such as parsing and extraction very well. That is why the rule-based approaches are in general a better fit for query analysis.

Machine-learning algorithm

Machine Learning (ML) is also widely used in NLP. This approach is based on algorithms that learn to “understand” language without being explicitly programmed. This is possible through the use of statistical methods, where the system starts analyzing the training set (annotated corpus) to build its own knowledge, produce its own rules and its own classifiers.

Because the machine-learning approach is based on probabilistic results, it leaves significantly fewer formal guarantees. Like any other complicated process that a human cannot observe fully, it suffers from the butterfly effect: it can happen so, that even a small amount of new data for learning can significantly modify the model, and the new ‘improved’ version of the model will act unpredictably even to its author.

The obvious advantage of machine learning lies in its “learnability”, which is why no manual rule/grammar coding is needed, requiring high skills: the corpus can be annotated using the low-skilled workforce. Machine learning is good at tasks such as document classification or word clustering from a corpus, because in both cases there are a lot of data points (e.g. keywords etc), which makes it easy for the machine to learn statistical clues of the words for a given task.

In general, the application of machine learning approaches can significantly speed up the development of a capability of certain NLP systems, when good training data sets are available. However, it’s often not so easy in practice.

When it comes to building an NLP system for query analysis, the main problem with using the ML approach is the lack of training data (which allows the system to learn how to translate plain English into SQL), so you should have plenty of parsed messages (and preferably have them all coming from one domain, like ‘transportation enquiry system’). But what if you previously didn’t have such an interface, where users could write their queries in plain English? Where will you get them from? One of the options is to start brainstorming and create them manually, which can be very time-consuming, because datasets should be big enough for the “trained” ML-based system to deliver highly accurate results in recognizing the query and providing the needed information. | Moreover, once created and labelled, the corpus often can't be reused on new data schemas, and new “preparation” of data is required each time.

The table below sums up all the strengths and weaknesses of both approaches:

Hybrid Approach

As explained above, both systems have their limitations. And combining rule-based and machine learning approaches into a hybrid system, where one complements the other, seems to be a good solution. How does FriendlyData apply the hybrid approach?

The company’s main goal is to help the database ‘understand’ the query placed by a human in natural language and translate it into a language that is familiar to the database. To address this challenge, we use a formal grammar.

At the same time, the ever growing brevity and vagueness of queries makes query analysis very difficult to implement. People use poor grammar, colloquialisms, misspellings and abbreviations, which makes it extremely difficult for computers to analyze natural language. We have therefore realized the necessity of applying some machine-learning algorithms to the existing grammar-based approach.

We use a grammar-based parser for text-to-SQL translation, and ML to complement our rule-based grammar by fixing the syntax, eradicating typos, and so on. For example, most of the input queries that the grammar-based parser did not catch, can be easily fixed with the help of a smart spellchecker and suggestions.

By applying such hybrid approach we have improved the accuracy of processed queries by more than 10%. When a rule-based parser was used, only 80% of the queries were successfully ‘understood’ by the system. With the spellchecker and suggestions, which are based on machine learning algorithms, that number increased to 91%. Results show that this hybrid approach achieves the accuracy that is comparable to top ranked methods, with the added value that it does not require a special expertise.

And when we compare the accuracy metric of NLP systems for translating natural language questions in the corresponding SQL which is based only on ML, for example Seq2SQL from the recent Salesforce Research, we can see that the test execution accuracy is equal to 59.4%. And there are no means to increase the processing quality (like FriendlyData can by extending the rule base.) So this allows us to make an assumption that the hybrid approach to query analysis described above can show more accurate results than using only one of the approaches -- ML or hand-engineered grammars.