wordpos/README.md

6.6 KiB

wordpos

wordpos is a set of part-of-speech utilities for Node.js using natural's WordNet module.

Usage

var WordPOS = require('wordpos'),
    wordpos = new WordPOS();

wordpos.getAdjectives('The angry bear chased the frightened little squirrel.', function(result){
    console.log(result);
});
// [ 'little', 'angry', 'frightened' ]

wordpos.isAdjective('awesome', function(result){
    console.log(result);
});
// true

See wordpos_spec.js for full usage.

Installation

 npm install wordpos

Note: wordpos-bench.js requires a forked uubench module.

To run spec:

npm install jasmine-node -g
jasmine-node wordpos_spec.js --verbose

API

Please note: all API are async since the underlying WordNet library is async.

WordPOS is a subclass of natural's WordNet class and inherits all its methods.

getX()...

Get POS from text.

wordpos.getPOS(str, callback) -- callback receives a result object:
    {
      nouns:[],       Array of str words that are nouns
      verbs:[],       Array of str words that are verbs
      adjectives:[],  Array of str words that are adjectives
      adverbs:[],     Array of str words that are adverbs
      rest:[]         Array of str words that are not in dict or could not be categorized as a POS
    }

    Note: a word may appear in multiple POS (eg, 'great' is both a noun and an adjective)

wordpos.getNouns(str, callback) -- callback receives an array of nouns in str

wordpos.getVerbs(str, callback) -- callback receives an array of verbs in str

wordpos.getAdjectives(str, callback) -- callback receives an array of adjectives in str

wordpos.getAdverbs(str, callback) -- callback receives an array of adverbs in str

NB: If you're only interested in a certain POS (say, adjectives), using the particular getX() is faster than getPOS() which looks up the word in all index files.

NB: [stopwords] (https://github.com/NaturalNode/natural/blob/master/lib/natural/util/stopwords.js) are stripped out from str before lookup.

All getX() functions return the number of parsed words that will be looked up (less duplicates and stopwords).

Example:

wordpos.getNouns('The angry bear chased the frightened little squirrel.', console.log)
// [ 'bear', 'squirrel', 'little', 'chased' ]

wordpos.getPOS('The angry bear chased the frightened little squirrel.', console.log)
// output:
  {
    nouns: [ 'bear', 'squirrel', 'little', 'chased' ],
    verbs: [ 'bear' ],
    adjectives: [ 'little', 'angry', 'frightened' ],
    adverbs: [ 'little' ],
    rest: [ 'the' ]
  }

This has no relation to correct grammer of given sentence, where here only 'bear' and 'squirrel' would be considered nouns. (see http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#ex-recnominals)

pos-js, e.g., shows only 'squirrel' as noun:

The / DT
angry / JJ
bear / VB
chased / VBN
the / DT
frightened / VBN
little / JJ
squirrel / NN

isX()...

Determine if a word is a particular POS.

wordpos.isNoun(word, callback) -- callback receives result (true/false) if word is a noun.

wordpos.isVerb(word, callback) -- callback receives result (true/false) if word is a verb.

wordpos.isAdjective(word, callback) -- callback receives result (true/false) if word is an adjective.

wordpos.isAdverb(word, callback) -- callback receives result (true/false) if word is an adverb.

Examples:

wordpos.isVerb('fish', console.log);
// true

wordpos.isNoun('fish', console.log);
// true

wordpos.isAdjective('fishy', console.log);
// true

wordpos.isAdverb('fishly', console.log);
// false

lookupX()...

These calls are similar to natural's lookup() call, except they can be faster if you already know the POS of the word.

wordpos.lookupNoun(word, callback) -- callback receives array of lookup objects for a noun

wordpos.lookupVerb(word, callback) -- callback receives array of lookup objects for a verb

wordpos.lookupAdjective(word, callback) -- callback receives array of lookup objects for an adjective

wordpos.lookupAdverb(word, callback) -- callback receives array of lookup objects for an adverb

Example:

wordpos.lookupAdjective('awesome', console.log);
// output:
[ { synsetOffset: 1282510,
    lexFilenum: 0,
    pos: 's',
    wCnt: 5,
    lemma: 'amazing',
    synonyms: [ 'amazing', 'awe-inspiring', 'awesome', 'awful', 'awing' ],
    lexId: '0',
    ptrs: [],
    gloss: 'inspiring awe or admiration or wonder; "New York is an amazing city"; "the Grand Canyon is an awe-inspiring
sight"; "the awesome complexity of the universe"; "this sea, whose gently awful stirrings seem to speak of some hidden s
oul beneath"- Melville; "Westminster Hall\'s awing majesty, so vast, so high, so silent"  ' } ]

In this case only one lookup was found. But there could be several.

Or use WordNet's inherited method:

wordpos.lookup('great', console.log);
// ...

Other methods

WordPOS.WNdb -- access to the WNdb object

wordpos.parse(str) -- returns tokenized array of words, less duplicates and stopwords.  This method is called on all getX() calls internally.

Options

WordPOS.defaults = {
  /**
   * enable profiling, time in msec returned as second argument in callback
   */
  profile: false
};

To override, pass an options hash to the constructor. With the profile option, all callbacks receive a second argument that is the execution time in msec of the call.

    wordpos = new WordPOS({profile: true});
    wordpos.isAdjective('fast', console.log);
    // true 29

Benchmark

node wordpos-bench.js

Generally slow as it requires loading and searching large WordNet index files.

Single word lookup:

  getPOS : 30 ops/s { iterations: 10, elapsed: 329 }
  getNouns : 106 ops/s { iterations: 10, elapsed: 94 }
  getVerbs : 111 ops/s { iterations: 10, elapsed: 90 }
  getAdjectives : 132 ops/s { iterations: 10, elapsed: 76 }
  getAdverbs : 137 ops/s { iterations: 10, elapsed: 73 }

128-word lookup:

  getPOS : 0 ops/s { iterations: 1, elapsed: 2210 }
  getNouns : 2 ops/s { iterations: 1, elapsed: 666 }
  getVerbs : 2 ops/s { iterations: 1, elapsed: 638 }
  getAdjectives : 2 ops/s { iterations: 1, elapsed: 489 }
  getAdverbs : 2 ops/s { iterations: 1, elapsed: 407 }

On a win7/64-bit/dual-core/3GHz. getPOS() is slowest as it searches through all four index files.

There is probably room for optimization in the underlying library.

License

(The MIT License)

Copyright (c) 2012, mooster@42at.com