GREYC
University of Caen
The practical consists in executing and modifying a very simple chunker :
With the present guidelines, you will follow a path at your own speed.
You should be able to work independently, but ask if you need help.
Take your time to understand every step and to master what you are doing.
You are encouraged to run your own trials.
You do not have to reach the end of these guidelines, but it's better to study the steps while following their order.
Files names :
input file of the chunker : to_be_chunked.txtTo execute the chunker :
contains the text to be chunked
data file : input_file_names.txt
contains the names of the files to be chunked (.txt or .html files), one name per line
sequence of tasks file : chunking_en.ini
defines the tasks to do for every detected language
defines the constituent hierarchy : W for Word, CHK for CHunK
rule file : chunking_rules.en
batch file to execute the chunker and colour its output : chunking_en.bat
output file of the chunker : to_be_chunked.txt.xmlt
output file after colouring : to_be_chunked.txt.xmlt.html
1) type or paste the text to be chunked in the input file of the chunker : to_be_chunked.txtTry any text you want by typing it or pasting it in file to_be_chunked.txt.
2) double click on chunking_en.bat to execute the chunker and to colour its output
3) look at the output file of the chunker : to_be_chunked.txt.xmlt in the notepad
(it is a tabulated XML format, to make it more readable)
4) open the output file after colouring : to_be_chunked.txt.xmlt.html in Netscape
(it's done by the perl script colouring.pl)
(have a look to the colour scheme in coloursCHK.html)
// task rule files of this task previous task output unit
English:
splitting data/ASutil/splitting.txt
null PAR SW;
// splitting splits the text into paragraphs (PAR)
tokenizing data/ASutil/tokenizing.txt
splitting W;
// tokenizing tokenizes paragraphs into words (W)
chunking chunking_rules.en
tokenizing CHK ;
// chunking groups words together into chunks (CHK)
>chunking XMLTab .xmlt lowerLevel W showRelations
"subject_verb" ;
// >output task, format, extension, lowerlevel unit,
relations output
default:
>null XMLTab .xmlt ;
generating chunks
generating
& delivering a noun chunk
generating
& delivering a verb chunk
delivering
chunk separators
$0=[ G=:{ a an } ] => $0.add( [ CS=d n=s ] );
condition on the current word => action on the current word
Translation of the rule :
if the Written Form (G for "Graphie") of the current
word ($0) is a or an ({ a an }),
then its Syntactic
Category (CS) is a determiner (d) and its number is singular
(s)
generating chunks :
// pN = prepositional noun chunk
$0=[ CS==p ] => $chk=generate( [ CS=pN ] )
$chk.deliver() ;
condition on the current word => action on the generated chunk
Translation of the rule :
if the Syntactic Category (CS) of the current
word ($0) is a preposition (p)
then generate
a chunk, give it prepositional Noun chunk (pN) as Syntactic Category
(CS)
and deliver it to next task (the output task)
// N = noun chunk
$0=[ CS==d ] <W.next< [ CS!=p ] => $chk=generate(
[ CS=N ] ) $chk.deliver() ;
Translation of the rule :
if the Syntactic Category (CS) of the current
word ($0) is a determiner (d),
and if the previous word (<W.next<)
is not (CS!=) a preposition (p)
then generate
a chunk, give it Noun chunk (N) as Syntactic Category (CS)
and deliver it to next task (the output task)
Try any text you want by typing it or pasting it in file
to_be_chunked.txt.
You also may put a html file as input, just by typing
its name in input_file_names.txt.
And try to modify rules to get a better result.
Type or paste something in French in the file : to_be_chunked.txt
To execute double click on chunking_fr.bat
Try with other French texts, and improve the results.
Type or paste something in this language in the file : to_be_chunked.txt
To execute double click on chunking_xx.bat
* First step : a chunk N is put aside waiting for an eventual V chunk
* Second step : when a verb chunk arrives, it is linked to its subject chunk N
$$=[ CS==N ], $@ => relate( $$,$@,subject_verb ) ;
// $$ = the built chunk
// $@ = the virtual chunk
Translation of the rule :
if the built chunk ($$)
is a noun chunk (CS==N),
then relate it to the virtual chunk
($@) with a link named subject_verb
* Second step : when a verb chunk arrives, it is linked to its subject chunk N
$$=[ CS==V ] , $@ <subject_verb< $waitingN
=> discard( $waitingN,$@,subject_verb )
relate( $waitingN,$$,subject_verb )
$$.add( [ fct=V ] )
$waitingN.add( [ fct=S ] ) ;
// fct=S : the function of this chunk is subject
// fct=V : the function of this chunk is verb
Translation of the rule :
if the built chunk ($$)
is a verb chunk (CS==V),
and if the virtual chunk ($@)
is linked to a chunk (named $waitingN)
with a link named subject_verb
then
discard the link between the waiting noun chunk ($waitingN)
and the virtual chunk ($@)
relate the waiting noun chunk ($waitingN)
to the built chunk ($$)
with a link named subject_verb
give the built chunk ($$) the
verb function (fct=V)
give the waiting noun chunk ($waitingN)
the subject function (fct=S)
Copy the following 2 rules and paste them at the end of the rule file : chunking_rules.en
// ========== linking chunks =================
// ===========================================
// a chunk N is put waiting for an eventual V chunk
$$=[ CS==N ], $@ => relate( $$,$@,subject_verb ) ;
// $$ = the built chunk
// $@ = the virtual chunk
// a verb chunk is arriving and is linked to its subject chunk N
$$=[ CS==V ] , $@ <subject_verb< $waitingN
=> discard( $waitingN,$@,subject_verb ) relate( $waitingN,$$,subject_verb
)
$$.add( [ fct=V ] ) $waitingN.add(
[ fct=S ] ) ;
// fct=S : the function of this chunk is subject
// fct=V : the function of this chunk is verb
Then execute the "chunker-linker" and look at the outputs.
Try with a verb distant from its subject.
Try in other languages.
Try other types of links.
clausing clausing_rules.txt chunking CLS ;
// clausing groups chunks together into clauses (CLS)
>clausing XMLTab .xmlt lowerLevel W showRelations
"subject_verb" ;
// >output task, format, extension, lowerlevel unit, relations
output
// ------- generating & delivering a subordinated clause ------------------
$0=[ CS==P ] => $cls=generate( [ CS=SUB1 ] ) $cls.deliver() ;
Translation of the rule :
if the current unit is a subordinating conjunction (CS==P)
- it has been delivered to clausing in chunking_rules.en,
therefore it is at chunk level -
then generate a subordinated (CS=SUB1)
clause (the current unit is the beginning of a subordinated clause)
and deliver it to the following task (here the output task)
Try other sentences.
You could add relative clauses in French.