Coling 2000

Tutorial : Trends in Robust Parsing

Guidelines of the Practical

Jacques Vergne

GREYC
University of Caen

https://lucasn01.users.greyc.fr/JacquesVergne/

Outline of the practical

Method of this practical

Chunking English :

Executing a very simple chunker for English
Let's have a look at the sequence of tasks
Let's have a look at the rules
Modifying rules

Changing natural language :

Making a chunker for French
Making a chunker for another language

Linking 2 chunks : the subject noun chunk to the verb chunk

Changing scale of the computed unit : clause bracketing

The genericity of the GREYC engine

Reminder of the aim of the practical

The aim of the practical is to give you a practical entrance into the field, and to bring a concrete basis to the course.
The practical gives participants the opportunity to practice on the "GREYC parser", which is a general platform to design and build parsers.
It is described (in French) on : https://lucasn01.users.greyc.fr/JacquesVergne/analyseur_GREYC/analyseur_du_GREYC.html

The practical consists in executing and modifying a very simple chunker :

executing the chunker on English corpora brought by participants,
writing rules for chunking French or another language, and testing these rules on corpora,
improving the result while modifying these rules.

Warning : you are using the "black box" version of the engine of the "GREYC parser".
The "GREYC parser" is protected by a confidentiality agreement. You are not allowed to copy it.
If you want to use it in your laboratory, you have to sign an agreement with the GREYC.

Method of this practical

You will work in pairs : one pair per computer.

With the present guidelines, you will follow a path at your own speed.

You should be able to work independently, but ask if you need help.

Take your time to understand every step and to master what you are doing.

You are encouraged to run your own trials.

You do not have to reach the end of these guidelines, but it's better to study the steps while following their order.

Executing a very simple chunker for English

Go into the "TutorialColingRobustParsing" directory at the root of C:

Files names :

input file of the chunker : to_be_chunked.txt
        contains the text to be chunked
data file : input_file_names.txt
        contains the names of the files to be chunked (.txt or .html files), one name per line
sequence of tasks file : chunking_en.ini
        defines the tasks to do for every detected language
        defines the constituent hierarchy : W for Word, CHK for CHunK
rule file : chunking_rules.en
batch file to execute the chunker and colour its output : chunking_en.bat
output file of the chunker : to_be_chunked.txt.xmlt
output file after colouring : to_be_chunked.txt.xmlt.html

To execute the chunker :

1) type or paste the text to be chunked in the input file of the chunker : to_be_chunked.txt
2) double click on chunking_en.bat to execute the chunker and to colour its output
3) look at the output file of the chunker : to_be_chunked.txt.xmltin the notepad
        (it is a tabulated XML format, to make it more readable)
4) open the output file after colouring : to_be_chunked.txt.xmlt.htmlin Netscape
        (it's done by the perl script colouring.pl)
        (have a look to the colour scheme in coloursCHK.html)

Try any text you want by typing it or pasting it in file to_be_chunked.txt.
You also may put a html file as input, just by typing its name in input_file_names.txt.

Let's have a look at the sequence of tasks

Open the tasks sequence file : chunking_en.ini in the notepad.

Structure of the tasks sequence file :

It defines the tasks to do for every detected language : here English and others (default)
It also defines the constituent hierarchy : W for Word, CHK for CHunK

// task rule files of this task previous task output unit

English:

splitting data/ASutil/splitting.txt null PAR SW;
// splitting splits the text into paragraphs (PAR)

tokenizing data/ASutil/tokenizing.txt splitting W;
// tokenizing tokenizes paragraphs into words (W)

chunking chunking_rules.en tokenizing CHK ;
// chunking groups words together into chunks (CHK)

>chunking XMLTab .xmlt lowerLevel W showRelations "subject_verb" ;
// >output task, format, extension, lowerlevel unit, relations output

default:
>null XMLTab .xmlt ;

Let's have a look at the rules

Open the rule file : chunking_rules.en in the notepad.
It is intended to be as simple as possible.

Structure of the rule file :

    tagging grammatical words
        tagging noun chunk beginnings
            tagging prepositions
            tagging determiners
        tagging verb chunk beginnings
        tagging chunk separators

    generating chunks
        generating & delivering a noun chunk
        generating & delivering a verb chunk
        delivering chunk separators

Understanding rules

tagging grammatical words :

$0=[ G=:{ a an } ] => $0.add( [ CS=d n=s ] );

condition on the current word => action on the current word

Translation of the rule :
if the Written Form (G for "Graphie") of the current word ($0) is a or an ({ a an }),
then its Syntactic Category (CS) is a determiner (d) and its number is singular (s)

generating chunks :

// pN = prepositional noun chunk
$0=[ CS==p ] => $chk=generate( [ CS=pN ] ) $chk.deliver() ;

condition on the current word => action on the generated chunk

Translation of the rule :
if the Syntactic Category (CS) of the current word ($0) is a preposition (p)
then generate a chunk, give it prepositional Noun chunk (pN) as Syntactic Category (CS)
and deliver it to next task (the output task)

// N = noun chunk
$0=[ CS==d ] <W.next< [ CS!=p ] => $chk=generate( [ CS=N ] ) $chk.deliver() ;

Translation of the rule :
if the Syntactic Category (CS) of the current word ($0) is a determiner (d),
    and if the previous word (<W.next<) is not (CS!=) a preposition (p)
        then generate a chunk, give it Noun chunk (N) as Syntactic Category (CS)
            and deliver it to next task (the output task)

The way rules are tested by the engine

All rules are tested once on the current word, one after the other, in the order they are written.
Words of the input file are processed one after the other.

Modifying rules

You may modify a rule simply by typing your modification (another written form, for instance) and saving the file.
At the beginning of the execution, rules are parsed, and you may observe syntax error messages.
If it occurs, you have to correct the rule at the mentioned line number, and execute again.

Try any text you want by typing it or pasting it in file to_be_chunked.txt.
You also may put a html file as input, just by typing its name in input_file_names.txt.
And try to modify rules to get a better result.

Making a chunker for French

Duplicate the rule file and call it : chunking_rules.fr
Edit it while replacing written forms with French written forms (do not try to be exhaustive : just put the function words you need).
Note that the rules generating chunks remain unchanged.

Type or paste something in French in the file : to_be_chunked.txt
To execute double click on chunking_fr.bat
Try with other French texts, and improve the results.

Making a chunker for another language

Choose another language you know in the following list : German, Spanish, Italian, Portuguese, Norwegian, Dutch (these are the detected languages, with English and French).
Duplicate the rule file and call it : chunking_rules.xx
Edit it while replacing written forms with written forms of this language.
Duplicate the sequence of tasks file : chunking_en.ini and call it : chunking_xx.ini and edit it while comparing chunking_en.ini and chunking_fr.ini.
Duplicate the execution file : chunking_en.bat and call it : chunking_xx.bat and edit it while comparing chunking_en.bat and chunking_fr.bat.

Type or paste something in this language in the file : to_be_chunked.txt

To execute double click on chunking_xx.bat

Linking 2 chunks : the subject noun chunk to the verb chunk

Understanding the linking process

The link between 2 chunks is set in 2 successive steps :

* First step : a chunk N is put aside waiting for an eventual V chunk
* Second step : when a verb chunk arrives, it is linked to its subject chunk N

Understanding the rules

The link between 2 chunks is set in 2 successive steps, with 2 successive rules :
* First step : a chunk N is put aside waiting for an eventual V chunk

$$=[ CS==N ], $@ => relate( $$,$@,subject_verb ) ;
// $$ = the built chunk
// $@ = the virtual chunk

Translation of the rule :
if the built chunk ($$) is a noun chunk (CS==N),
then relate it to the virtual chunk ($@) with a link named subject_verb

* Second step : when a verb chunk arrives, it is linked to its subject chunk N

$$=[ CS==V ] , $@ <subject_verb< $waitingN
=> discard( $waitingN,$@,subject_verb )
     relate( $waitingN,$$,subject_verb )
     $$.add( [ fct=V ] )
     $waitingN.add( [ fct=S ] ) ;
// fct=S : the function of this chunk is subject
// fct=V : the function of this chunk is verb

Translation of the rule :
if the built chunk ($$) is a verb chunk (CS==V),
    and if the virtual chunk ($@) is linked to a chunk (named $waitingN)
            with a link named subject_verb
        then discard the link between the waiting noun chunk ($waitingN) and the virtual chunk ($@)
                relate the waiting noun chunk ($waitingN) to the built chunk ($$)
                    with a link named subject_verb
                give the built chunk ($$) the verb function (fct=V)
                give the waiting noun chunk ($waitingN) the subject function (fct=S)

Executing the linking process

Go back to the English chunker.

Copy the following 2 rules and paste them at the end of the rule file : chunking_rules.en

// ========== linking chunks =================
// ===========================================

// a chunk N is put waiting for an eventual V chunk
$$=[ CS==N ], $@ => relate( $$,$@,subject_verb ) ;
// $$ = the built chunk
// $@ = the virtual chunk

// a verb chunk is arriving and is linked to its subject chunk N
$$=[ CS==V ] , $@ <subject_verb< $waitingN
=> discard( $waitingN,$@,subject_verb ) relate( $waitingN,$$,subject_verb )
$$.add( [ fct=V ] ) $waitingN.add( [ fct=S ] ) ;
// fct=S : the function of this chunk is subject
// fct=V : the function of this chunk is verb

Then execute the "chunker-linker" and look at the outputs.
Try with a verb distant from its subject.
Try in other languages.
Try other types of links.

Changing scale of the computed unit : clause bracketing

A new task in the sequence of tasks

Look at the file clausing_en.ini.
A new task clausing is added after the task chunking :

clausing clausing_rules.txt chunking CLS ;
// clausing groups chunks together into clauses (CLS)

>clausing XMLTab .xmlt lowerLevel W showRelations "subject_verb" ;
// >output task, format, extension, lowerlevel unit, relations output

The rules file of this new task

Look at the file clausing_rules.txt.

// ------- generating & delivering a subordinated clause ------------------
$0=[ CS==P ] => $cls=generate( [ CS=SUB1 ] ) $cls.deliver() ;

Translation of the rule :
if the current unit is a subordinating conjunction (CS==P)
    - it has been delivered to clausing in chunking_rules.en, therefore it is at chunk level -
    then generate a subordinated (CS=SUB1) clause (the current unit is the beginning of a subordinated clause)
            and deliver it to the following task (here the output task)

Executing this chunker-clauser

Type or paste something in to_be_chunked.txt (we keep this filename) with a subordinated clause.
Pay attention that your subordinating conjunction is correctly tagged in chunking_rules.en.
Then double click on clausing_en.bat.
The colouring is different : main clauses in yellow, and subordinated clauses in green.

Try other sentences.

Modifying rules

You can try relative clauses : they begin with a relative pronoun, which have to be tagged in chunking_rules.en.
So you have to add a rule in chunking_rules.en and a rule in clausing_rules.txt to tell that a relative clause begins with a relative pronoun.

A chunker-clauser for French

The same rule file clausing_rules.txt can be used for French.
You have to make another sequence of tasks file : clausing_fr.ini, and another .bat file : clausing_fr.bat.
Put some subordinating conjunctions in chunking_rules.fr.
Type or paste something in French in to_be_chunked.txt with a subordinated clause.
Then double click on clausing_fr.bat.

You could add relative clauses in French.

The genericity of the GREYC engine

During this practical, you have explored the genericity of the GREYC engine in 2 dimensions :

the language dimension : English, French, ...

the computed unit scale dimension : chunk, clause, ...

Now, if you arrive here, and if there is some more time, you may explore the dimension you wish.

Your assessment of the tutorial

We need to read your assessment of the tutorial.
Could you fill the form ? Thanks.