wordpos/README.md

223 lines
6.2 KiB
Markdown
Raw Normal View History

2012-05-02 23:18:10 +00:00
wordpos
=======
2012-05-03 09:09:02 +00:00
wordpos is a set of part-of-speech utilities for Node.js using [natural's](http://github.com/NaturalNode/natural) WordNet module.
2012-05-02 23:18:10 +00:00
2012-05-03 09:09:02 +00:00
There is no lexigraphical intelligence here (eg, see [pos-js](https://github.com/fortnightlabs/pos-js)). Only dictionary lookup.
2012-05-02 23:18:10 +00:00
2012-05-03 08:40:25 +00:00
Usage
-------
```js
var WordPOS = require('./wordpos'),
wordpos = new WordPOS();
2012-05-04 19:23:28 +00:00
2012-05-03 08:40:25 +00:00
wordpos.getAdjectives('The angry bear chased the frightened little squirrel.', function(result){
console.log(result);
2012-05-04 19:23:28 +00:00
});
2012-05-03 08:40:25 +00:00
// [ 'little', 'angry', 'frightened' ]
wordpos.isAdjective('awesome', function(result){
console.log(result);
});
// true
```
2012-05-04 19:23:28 +00:00
See `wordpos_spec.js` for full usage.
2012-05-03 08:40:25 +00:00
2012-05-02 23:18:10 +00:00
Installation
------------
2012-05-06 09:48:46 +00:00
Get the script `wordpos.js` and use it. (npm package may be coming.)
2012-05-02 23:29:44 +00:00
or use a git path in your package.json dependencies:
```
...
"dependencies": {
"wordpos": "git://github.com/moos/wordpos.git"
},
...
```
2012-05-06 09:48:46 +00:00
As of version 0.1.1, WordNet DB files are obtained off-line through dependency provided by [WNdb](https://github.com/moos/WNdb) module.
2012-05-04 19:23:28 +00:00
Note: `wordpos-bench.js` requires a [forked uubench](https://github.com/moos/uubench) module.
2012-05-02 23:29:44 +00:00
2012-05-03 08:10:05 +00:00
API
-------
Please note: all API are async since the underlying WordNet library is async.
2012-05-03 09:13:48 +00:00
WordPOS is a subclass of natural's [WordNet class](https://github.com/NaturalNode/natural#wordnet) and inherits all its methods.
2012-05-03 08:10:05 +00:00
### getX()
2012-05-04 19:23:28 +00:00
Get POS from text.
2012-05-03 08:10:05 +00:00
2012-05-04 19:23:28 +00:00
```
wordpos.getPOS(str, callback) -- callback receives a result object:
{
2012-05-03 08:40:25 +00:00
nouns:[], Array of str words that are nouns
verbs:[], Array of str words that are verbs
adjectives:[], Array of str words that are adjectives
adverbs:[], Array of str words that are adverbs
2012-05-03 09:09:02 +00:00
rest:[] Array of str words that are not in dict or could not be categorized as a POS
2012-05-03 08:40:25 +00:00
}
2012-05-04 19:23:28 +00:00
Note: a word may appear in multiple POS (eg, 'great' is both a noun and an adjective)
2012-05-03 08:40:25 +00:00
wordpos.getNouns(str, callback) -- callback receives an array of nouns in str
wordpos.getVerbs(str, callback) -- callback receives an array of verbs in str
wordpos.getAdjectives(str, callback) -- callback receives an array of adjectives in str
wordpos.getAdverbs(str, callback) -- callback receives an array of adverbs in str
2012-05-03 08:10:05 +00:00
```
2012-05-04 19:23:28 +00:00
2012-05-03 08:10:05 +00:00
NB: If you're only interested in a certain POS (say, adjectives), using the particular getX() is faster
than getPOS() which looks up the word in all index files.
2012-05-04 19:23:28 +00:00
NB: [stopwords] (https://github.com/NaturalNode/natural/blob/master/lib/natural/util/stopwords.js)
2012-05-03 08:10:05 +00:00
are stripped out from str before lookup.
Example:
```js
2012-05-03 08:40:25 +00:00
wordpos.getNouns('The angry bear chased the frightened little squirrel.', console.log)
// [ 'bear', 'squirrel', 'little', 'chased' ]
2012-05-03 08:10:05 +00:00
2012-05-03 09:09:02 +00:00
wordpos.getPOS('The angry bear chased the frightened little squirrel.', console.log)
// output:
2012-05-04 19:23:28 +00:00
{
2012-05-03 09:09:02 +00:00
nouns: [ 'bear', 'squirrel', 'little', 'chased' ],
verbs: [ 'bear' ],
adjectives: [ 'little', 'angry', 'frightened' ],
adverbs: [ 'little' ],
rest: [ 'the' ]
}
```
2012-05-04 19:23:28 +00:00
This has no relation to correct grammer of given sentence, where here only 'bear' and 'squirrel'
2012-05-03 08:10:05 +00:00
would be considered nouns. (see http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#ex-recnominals)
2012-05-03 09:09:02 +00:00
[pos-js](https://github.com/fortnightlabs/pos-js), e.g., shows only 'squirrel' as noun:
The / DT
angry / JJ
bear / VB
chased / VBN
the / DT
frightened / VBN
little / JJ
squirrel / NN
2012-05-03 08:10:05 +00:00
### isX()
2012-05-04 19:23:28 +00:00
Determine if a word is a particular POS.
2012-05-03 08:10:05 +00:00
2012-05-04 19:23:28 +00:00
```
2012-05-03 08:40:25 +00:00
wordpos.isNoun(word, callback) -- callback receives result (true/false) if word is a noun.
wordpos.isVerb(word, callback) -- callback receives result (true/false) if word is a verb.
wordpos.isAdjective(word, callback) -- callback receives result (true/false) if word is an adjective.
wordpos.isAdverb(word, callback) -- callback receives result (true/false) if word is an adverb.
2012-05-03 08:10:05 +00:00
```
2012-05-04 19:23:28 +00:00
2012-05-03 08:10:05 +00:00
Examples:
```js
2012-05-03 08:40:25 +00:00
wordpos.isVerb('fish', console.log);
// true
wordpos.isNoun('fish', console.log);
// true
wordpos.isAdjective('fishy', console.log);
// true
wordpos.isAdverb('fishly', console.log);
// false
2012-05-03 08:10:05 +00:00
```
2012-05-04 19:23:28 +00:00
2012-05-03 08:10:05 +00:00
### lookupX()
These calls are similar to natural's [lookup()](https://github.com/NaturalNode/natural#wordnet) call, except they can be faster if you
already know the POS of the word.
2012-05-04 19:23:28 +00:00
```
2012-05-03 08:40:25 +00:00
wordpos.lookupNoun(word, callback) -- callback receives array of lookup objects for a noun
2012-05-04 19:23:28 +00:00
2012-05-03 08:40:25 +00:00
wordpos.lookupVerb(word, callback) -- callback receives array of lookup objects for a verb
2012-05-04 19:23:28 +00:00
2012-05-03 08:40:25 +00:00
wordpos.lookupAdjective(word, callback) -- callback receives array of lookup objects for an adjective
2012-05-04 19:23:28 +00:00
2012-05-03 08:40:25 +00:00
wordpos.lookupAdverb(word, callback) -- callback receives array of lookup objects for an adverb
2012-05-03 08:10:05 +00:00
```
2012-05-04 19:23:28 +00:00
2012-05-03 08:10:05 +00:00
Example:
```js
2012-05-03 08:40:25 +00:00
wordpos.lookupAdjective('awesome', console.log);
// output:
[ { synsetOffset: 1282510,
lexFilenum: 0,
pos: 's',
wCnt: 5,
lemma: 'amazing',
synonyms: [ 'amazing', 'awe-inspiring', 'awesome', 'awful', 'awing' ],
lexId: '0',
ptrs: [],
gloss: 'inspiring awe or admiration or wonder; "New York is an amazing city"; "the Grand Canyon is an awe-inspiring
sight"; "the awesome complexity of the universe"; "this sea, whose gently awful stirrings seem to speak of some hidden s
oul beneath"- Melville; "Westminster Hall\'s awing majesty, so vast, so high, so silent" ' } ]
2012-05-03 08:10:05 +00:00
```
In this case only one lookup was found. But there could be several.
2012-05-04 19:23:28 +00:00
2012-05-02 23:18:10 +00:00
2012-05-03 09:09:02 +00:00
Or use WordNet's inherited method:
2012-05-03 08:10:05 +00:00
2012-05-03 08:40:25 +00:00
```js
wordpos.lookup('great', console.log);
// ...
```
2012-05-04 19:23:28 +00:00
2012-05-03 08:10:05 +00:00
Benchmark
----------
2012-05-04 19:23:28 +00:00
Generally slow as it requires loading and searching large WordNet index files.
2012-05-03 08:10:05 +00:00
Single word lookup:
2012-05-04 19:23:28 +00:00
```
getPOS : 30 ops/s { iterations: 10, elapsed: 329 }
getNouns : 106 ops/s { iterations: 10, elapsed: 94 }
getVerbs : 111 ops/s { iterations: 10, elapsed: 90 }
getAdjectives : 132 ops/s { iterations: 10, elapsed: 76 }
getAdverbs : 137 ops/s { iterations: 10, elapsed: 73 }
```
2012-05-03 08:10:05 +00:00
2012-05-04 19:23:28 +00:00
128-word lookup:
```
getPOS : 0 ops/s { iterations: 1, elapsed: 2210 }
getNouns : 2 ops/s { iterations: 1, elapsed: 666 }
getVerbs : 2 ops/s { iterations: 1, elapsed: 638 }
getAdjectives : 2 ops/s { iterations: 1, elapsed: 489 }
getAdverbs : 2 ops/s { iterations: 1, elapsed: 407 }
```
2012-05-03 08:10:05 +00:00
On a win7/64-bit/dual-core/3GHz. getPOS() is slowest as it searches through all four index files.
There is probably room for optimization in the underlying library.
2012-05-02 23:18:10 +00:00
License
-------
2012-05-03 08:10:05 +00:00
(The MIT License)
2012-05-02 23:18:10 +00:00
Copyright (c) 2012, mooster@42at.com