NEPALI PARSER – CHUNKER

1.1 Project Description

Parser is the part of Natural Language Processing which deals with finding relationship between the phrases present in a sentence and to represent them in a parse tree.

Chunking is the part of the Natural Language Processing in which the relationship between the words or phrases present in a sentence is determined and is grouped together.  The Chunker in this context takes some tagged input sentence and then performs chunking or finding the relation between words or between phrases and represents them in parenthesized notation.  The system operates on the basis of certain rules.  The input sentence should be given in Unicode value of the word following a slash and then the tag of the word in capital English letters.  The need of Unicode was felt because it brings uniformity in representing the words  rather than writing them in different Nepali fonts.  Though using Unicode is not a compulsion as the tags are important rather than the words themselves   Two consecutive words can have only a single spacing between them.  It then finds the phrases from the input words according to those rules.  In simple, the generated phrases are called as chunks.

The Chunker being developed is domain specific and chunks simple sentences like those we can find in simple school books for primary level.  Complex sentences are not considered as the context of this project.  To have a broad domain coverage well analyzed and complete set of rules need to be provided.

Such a system is useful for many other systems like Machine Translator, Grammar Checking, information retrieval etc.

1.2 Chunker

In linguistics, parsing is  the process of dividing sentence into phrases in order to find the relationships between words and phrases. Basically, a parser is a combination of a Chunker and a Parser Tree Generator. If both the job of chunking and parser tree generation is done, then that type of parsing is called as full parsing. If only the work of chunking is done, it is called as shallow parsing.

Chunker is a natural language processing technique that attempts to provide some machine understanding of the structure of a sentence, but without parsing it fully into a parsed tree form. Chunking is also called partial parsing or shallow parsing. The output of a chunker is a division of the text’s sentence into series of words that together consist a grammatical unit. It is mostly either noun, verb, or preposition phrase, with less-frequent occurrences of adverb, adjective, clause, and other phrases. The output is different from that of a fully parsed tree because it consists of series of words that do not overlap and that do not contain each other. This makes chunking an easier Natural Language Processing task than full parsing.

Fig: 1.1 Chunker As Blackbox

1.3 History and Background

The idea of using Chunkers in a parser is not new. Ejerhed and Church in 1983 [1] described a grammar for Swedish which includes noun phrase chunk  rules. Abney in 1991 built a chunk parser which first finds base chunks and  then attaches them with a separate attachment process. Brants in 1999 [2] used  a cascade of Markov model Chunkers for obtaining parsing results for  the German NEGRA corpus. Dipanjan Das and Monojit Chaudhary [3] has worked on chunking in Indian Language. Similary, much more have been done in several Indian languages. However no formal work has been done and  publicized in Nepali language so far. Various machine learning methods can also be used for the basic task of  determining the most appropriate chunk tag sequences for a text. For  instance, the memory based learning algorithm IB1-IG which was part of TiMBL package (Daelemans, Zavrel, van der Sloot and van den Bosch in 1999). In  memory based learning the training data is stored and  a new item is  classified by the most frequent classification among training items which are closest to this new item.

Nepali Language is rich in vocabulary and has many exceptions to be dealt and covered in rules than other languages. For instance it has many words that refer the second person like:त, तिमि, तपाइ,हजुर etc which are not found in languages like English. These sorts of thing greatly affect the core work of Morphological Analyzer and in turn keep some constraints in the rule prepared to do other Natural Language Processing tasks.  It is also difficult to adapt the works of other languages for the student’s work as there is always scarcity of required resources like language expert and translators.

It is also difficult to find usable resources and modules that have been tested after good analysis because the field of Natural Language Processing is still young in Nepali Language and is yet to be done a lot to achieve comparable results with other languages.

1.5 Scope And Domain

The Natural Language Processing Chunker can be used in many fields. Some of them are listed below.

l  Chunking can be useful for information retrieval, information extraction and question answering since a complete chunk (Noun Phrase, Verb Phrase, Postpositions Phrase) is likely to be semantically relevant for the requested information.

l  Chunker is useful in trimming down the search space, since each chunk must be all under a single parsed tree node and exclusively so.

l  Chunking can be used to check the language grammar effectively.

l  Chunking can be used for Machine Translation for better result since chunking can be used in language grammar checking. If the language grammar is better, the machine translation is always better.

The applications of NLP Chunker are not upto this, but these are some of the common applications.

1.5 Input And Output

The input to the chunker is one complete sentence in standard tagged form which will be read from a file. There should be one space between each word/tag combination.

Eg:  त्यो/PFS वदमास/ADP केटो/NN हो/VC

The output from the Chunker after processing will be as follows in the parenthesized notation into the new file.

Eg: ( ( त्यो/PFS ( वदमास/ADP ) AP केटो/NN ) NP हो/VC ) VP

1.6 Constraints

There is one constraint for the input of the system. The input provided to the system should be syntactically correct. The semantic analysis is not done and only syntax is what it matters. The input sentence should have a word value in Unicode format followed by a slash (“/”) and then the corresponding tag in English notation.  Two consecutive words can have a single spacing between them.

The system can function properly only if the sentence is syntactically correct and the available rules apply to the sentence.  If the set of POS rules concerned are insufficient to cover the sentence then the output cannot be expected to be properly analyzed.  Obtaining well organized and tested rules is a task that requires expert knowledge in the related field and  is not the work of average mind that can be done in an eye shot.

Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics. It studies the problems of automated generation and understanding of natural human languages. Natural language generation systems convert information from computer databases into normal-sounding human language, and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.

2.2 Steps in Natural Language Processing

2.2.1 Morphological Analysis

Morphological analysis is the first step of NLP. Here the individual words are analyzed into their components and non-word tokens, suck as punctuation, are separated from the words.  The main task of concern is on finding the root word in a given word and to find the present affixes. The objects in question can be a physical system (e.g. anatomy), a social system (e.g. an organisation) or a logical system (e.g. word forms or a system of ideas). The morphological analyzer should find out each word’s part of speech along with tense, singularity or plurality, gender, aspects, modularity and so on.

2.2.2 Syntactic Analysis

Linear sequences of words are transformed into structures that show how the words relate to each other. Some word sequences may be rejected if they violate the language’s rules for how words may be combined. For example, an English syntactic analyzer would reject the sentence “Boy the go the to store.” Syntactic analysis must exploit the results of morphological analysis to build a structural description of the sentence. The goal of this process, called parsing, is to convert the flat list of words that forms the sentence into a structure that defines the units that are represented by that flat list.

2.2.3 Semantic Analysis

The structures created by the syntactic analyzer are assigned meaning. In other words, a mapping is made between the syntactic structures and objects in the task domain. Structures for which no such mapping is possible may be rejected. For example, in most universe, the sentence “Colorless green ideas sleep furiously” would be rejected as semantically anomalous.

2.2.4 Discourse Integration

The meaning of an individual sentence may depend on the sentences that precede it and may influence the meanings of the sentences that follow it. For example, the word “it” in the sentence, “John wanted it,” depends on the prior discourse context, while the word “John” may influence the meaning of later sentence (such as, “He always had.”)

2.2.5 Pragmatic Analysis

The process for using knowledge of context and other knowledge for resolving ambiguities is usually called as pragmatic analysis. But if a hearer’s commonsense knowledge and knowledge about context is uncertain, even pragmatic analysis can fail to resolve all ambiguities. For example, the sentence “Do you know what time it is?” should be interpreted as a request to be told the time.

2.4 Parser

In linguistics, parsing (more formally: syntactic analysis) is the process of analyzing a sequence of tokens to determine its grammatical structure with respect to a given formal grammar. The computer program that does this work is called as Parser. Usually, a Chunker and a Parse Tree Generator collectively form a parser. If so, the parser becomes a full parser.

Fig: 2.1 Parser Structure

A parse tree or concrete syntax tree is a tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the grammar. Parse trees may be generated for sentences in natural languages as well as during processing of computer languages, such as programming languages. Parse trees are distinct from abstract syntax trees which is also known simply as syntax trees which are a related concept in compilers.

A parse tree is made up of nodes and branches. Below is a linguistic parse tree, here representing the English sentence “John hit the ball“. This is only one possible parse tree for this sentence; different kinds of linguistic parse trees exist. The parse tree is the entire structure, starting from S and ending in each of the leaf nodes (John, hit, the, ball).

Fig: 2.2 Parser Tree Example

In a parse tree, each node is either a root node, a branch node, or a leaf node. In the example, S is a root node, NP and VP are branch nodes, while John, hit, the, and ball are all leaf nodes.

A node can also be referred to as parent node or a child node. A parent node is one that has at least one other node linked by a branch under it. In the example, S is a parent of both NP and VP. A child node is one that has at least one node directly above it to which it is linked by a branch of the tree. Again from our example, hit is a child node of V.

2.4 Rule Based Parsing

One way of providing artificial intelligence to the system is by providing each and every rule to the system. The system is not allowed to look anywhere beside the provided rules. This method is limited to the rules provided and manual rule addition should be done. However, this method of parsing is easier and result can be seen quickly about the system’s performance.

2.5 Machine Learning Based Parsing

As a broad subfield of artificial intelligence, machine learning is concerned with the design and development of algorithms and techniques that allow computers to “learn”. At a general level, there are two types of learning: inductive, and deductive. Inductive machine learning methods extract rules and patterns out of massive data sets. The major focus of machine learning research is to extract information from data automatically, by computational and statistical methods.

In linguistic parsing, a huge set of data are provided to the system, from which the system extracts the rules and learns about it. This is called as training session of the system. These rules are unit in the rule set and later extracted to check for the parsing. Parsing by machine learning is a difficult and complex job rather than rule based. However, the result can be better.

2.6 Some Conceptual Definition

2.6.1 Grammar

Grammar is the study of the rules governing the use of a given natural language, and, as such, is a field of linguistics. Traditionally, grammar included morphology and syntax; but in modern linguistics these subfields are complemented by phonetics, phonology, orthography, semantics, and pragmatics.

The same term is also applied to any set of such rules; thus, each language can be said to have its own distinct grammar. Thus “English grammar” (uncountable) refers to the rules of the English language itself, while “an English grammar” (countable) refers to a specific study or analysis of these rules. A fully explicit grammar exhaustively describing the grammatical constructions of a language is called a prescriptive grammar, or, in theoretical linguistics, a generative grammar. Specific types of grammars, or approaches to constructing them, are known as grammatical frameworks. The standard framework of generative grammar is the transformational grammar model developed by Noam Chomsky in the 1950s to 1980s.

2.6.2 Context-free grammar

In linguistics and computer science, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form V → w  where V is a nonterminal symbol and w is a string consisting of terminals and/or non-terminals. The term “context-free” expresses the fact that the non-terminal V can always be replaced by w, regardless of the context in which it occurs. A formal language is context-free if there is a context-free grammar that generates it.

2.6.3 Context Sensitive Grammar

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars but still orderly enough to be parsed by a linear bounded automaton.

2.6.4 Production Rule System

A rule which define the composition of a nonterminal symbol from symbols of the grammar is the production rule.

A production rule system consists of :

l  a set of rules

l  working memory that stores temporary data

l  a forward chaining inference engine

These are also called as condition-action rules.
These components of a rule-based system have the form:

if then

or

if then

Example:
if (feather is present)
and (have lungs)
then (the animal is vertebrate)

Leave a Reply