|
Named Entity Recognition
Named Entity Recognition (NER) is an essential part human language technology, useful for a variety of applications, like data mining, summarization, question-answering systems, anonymization of medical journals etc. NER can be divided into two sub-tasks: (a) chunking, i.e. the recognition of which words or multi-word strings constitute names, (b) semantic classification, i.e. name types.
The VISL approach to NER, developed by Eckhard Bick for Danish and Portuguese, is a distributed hybrid method, involving on the one hand traditional techniques like pattern matching, gazeteering and lexicography, and on the other hand a grammatical approach, where context-sensitive CG-rules are used to classify names, based, for instance, on syntactic function, verbal selection restrictions, noun-phrase feature inheritance, coordination, apposition structure etc. The system recognizes about 20 name types, which fall into 6 major categories: (1) people (2) organisations, (3) places, (4) events, (5) art work titles and (6) others, like brands or vehicles. These classe can be defined as feature bundles (cp. table below), and thus be disambiguated also by simply discarding or selecting semantic atomic features, like +LOC or +HUM. Currently, both NER-parsers achieve around 93% correct readings, with 2% chunking errors and 5% subtype classification errors.
For Danish, VISL's NER-system has participated in the Nordic Nomen Nescio research network, funded by the Nordic Council of Ministers. The following is a short list of relevant publications:
- Bick, Eckhard (2003-1), Named Entity Recognition for Danish, In: Nordisk Sprogteknologi (Ã…rbog 2002). p331-349, , Museum Tusculanum, Copenhaguen University
- Bick, Eckhard (2003-4), "Multi-Level NER in a CG framework", in Proceedings of NoDaLiDa2003, 30-31. May 2003, Reykjavik, forthcoming
- Bick, Eckhard (2003-5), "Multi-Level NER for Portuguese in a CG Framework", in: Proceedings of PROPOR2003, Faro, Springer
|
<vq> COGN siger, tilbyder
|
+LOC (place) være dér ved/i X
|
<cc> (concrete movable object) bring X
|
made, built, invented (HUM-cause)
|
+TIME X vare, begynde, slutte siden X
|
+LIFE
|
+MOVE
|
<hum>
|
+ (1)
|
-
|
-
|
-
|
-
|
+
|
+
|
<top>
|
-
|
+
|
-
|
-
|
-
|
-
|
-
|
<inst><civ>
|
+
|
+
|
-
|
built
|
-
|
-
|
-
|
<org><media>
<party>
|
+
(group)
|
-
|
-
|
constituted
|
-
|
metaph.
|
metaph.
|
<tit><media>
|
+
|
-
|
metaph.
|
authored
|
-
|
-
|
-
|
<genre>
|
+
|
-
|
-
|
taught
|
-
|
-
|
-
|
<brand><mat>
|
-
|
-
|
+
|
produced
|
-
|
-
|
-
|
<V> (<v>)
|
-
|
-
|
+
|
produced
|
-
|
-
|
+
|
<A> (<a>)
|
metaph.
|
-
|
+
|
-
|
-
|
+
|
+
|
<B> (<b>)
|
-
|
(-)
|
+
|
-
|
-
|
+
|
-
|
<astro>
|
-
|
+
|
-
|
-?
|
-
|
-
|
+
|
<occ>
|
-
|
metaph.
|
-
|
(held)
|
+
|
-
|
-
|
|
|