wordpos/README.md

357 lines
10 KiB
Markdown
Raw Normal View History

2012-05-02 23:18:10 +00:00
wordpos
=======
wordpos is a set of part-of-speech (POS) utilities for Node.js using [natural's](http://github.com/NaturalNode/natural) WordNet module.
2012-05-02 23:18:10 +00:00
## Usage
2012-05-03 08:40:25 +00:00
```js
2012-05-07 07:39:57 +00:00
var WordPOS = require('wordpos'),
wordpos = new WordPOS();
2012-05-04 19:23:28 +00:00
2012-05-03 08:40:25 +00:00
wordpos.getAdjectives('The angry bear chased the frightened little squirrel.', function(result){
console.log(result);
2012-05-04 19:23:28 +00:00
});
2012-05-03 08:40:25 +00:00
// [ 'little', 'angry', 'frightened' ]
wordpos.isAdjective('awesome', function(result){
console.log(result);
});
// true 'awesome'
2012-05-03 08:40:25 +00:00
```
2012-05-04 19:23:28 +00:00
See `wordpos_spec.js` for full usage.
2012-05-03 08:40:25 +00:00
## Installation
2012-05-02 23:18:10 +00:00
2012-05-07 07:39:57 +00:00
npm install wordpos
2012-05-02 23:29:44 +00:00
2012-05-30 23:45:49 +00:00
Note: `wordpos-bench.js` requires a [forked uubench](https://github.com/moos/uubench) module. To use the CLI (see below), it is recommended to installed globally with -g option.
2012-05-07 07:39:57 +00:00
To run spec:
2012-05-04 19:23:28 +00:00
2012-05-07 07:39:57 +00:00
npm install jasmine-node -g
jasmine-node wordpos_spec.js --verbose
jasmine-node validate_spec.js --verbose
2012-05-04 19:23:28 +00:00
2012-05-02 23:29:44 +00:00
## API
2012-05-03 08:10:05 +00:00
2012-05-25 18:09:37 +00:00
Please note: all API are async since the underlying WordNet library is async. WordPOS is a subclass of natural's [WordNet class](https://github.com/NaturalNode/natural#wordnet) and inherits all its methods.
2012-05-03 08:10:05 +00:00
### getX()...
2012-05-03 08:10:05 +00:00
2012-05-04 19:23:28 +00:00
Get POS from text.
2012-05-03 08:10:05 +00:00
2012-05-04 19:23:28 +00:00
```
wordpos.getPOS(text, callback) -- callback receives a result object:
2012-05-04 19:23:28 +00:00
{
nouns:[], Array of text words that are nouns
verbs:[], Array of text words that are verbs
adjectives:[], Array of text words that are adjectives
adverbs:[], Array of text words that are adverbs
rest:[] Array of text words that are not in dict or could not be categorized as a POS
2012-05-03 08:40:25 +00:00
}
2012-05-04 19:23:28 +00:00
Note: a word may appear in multiple POS (eg, 'great' is both a noun and an adjective)
wordpos.getNouns(text, callback) -- callback receives an array of nouns in text
wordpos.getVerbs(text, callback) -- callback receives an array of verbs in text
wordpos.getAdjectives(text, callback) -- callback receives an array of adjectives in text
wordpos.getAdverbs(text, callback) -- callback receives an array of adverbs in text
2012-05-03 08:10:05 +00:00
```
2012-05-04 19:23:28 +00:00
2012-05-25 18:09:37 +00:00
If you're only interested in a certain POS (say, adjectives), using the particular getX() is faster
than getPOS() which looks up the word in all index files. [stopwords] (https://github.com/NaturalNode/natural/blob/master/lib/natural/util/stopwords.js)
are stripped out from text before lookup.
2012-05-03 08:10:05 +00:00
If text is an array, all words are looked-up -- no deduplication, stopword filter or tokenization is applied.
getX() functions return the number of parsed words that will be looked up (less duplicates and stopwords).
2012-05-03 08:10:05 +00:00
Example:
```js
2012-05-03 08:40:25 +00:00
wordpos.getNouns('The angry bear chased the frightened little squirrel.', console.log)
// [ 'bear', 'squirrel', 'little', 'chased' ]
2012-05-03 08:10:05 +00:00
2012-05-03 09:09:02 +00:00
wordpos.getPOS('The angry bear chased the frightened little squirrel.', console.log)
// output:
2012-05-04 19:23:28 +00:00
{
2012-05-03 09:09:02 +00:00
nouns: [ 'bear', 'squirrel', 'little', 'chased' ],
verbs: [ 'bear' ],
adjectives: [ 'little', 'angry', 'frightened' ],
adverbs: [ 'little' ],
rest: [ 'the' ]
}
```
2012-05-04 19:23:28 +00:00
This has no relation to correct grammer of given sentence, where here only 'bear' and 'squirrel'
2012-05-03 08:10:05 +00:00
would be considered nouns. (see http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#ex-recnominals)
2012-05-03 09:09:02 +00:00
[pos-js](https://github.com/fortnightlabs/pos-js), e.g., shows only 'squirrel' as noun:
The / DT
angry / JJ
bear / VB
chased / VBN
the / DT
frightened / VBN
little / JJ
squirrel / NN
2012-05-03 08:10:05 +00:00
### isX()...
2012-05-03 08:10:05 +00:00
2012-05-04 19:23:28 +00:00
Determine if a word is a particular POS.
2012-05-03 08:10:05 +00:00
2012-05-04 19:23:28 +00:00
```
2012-05-30 22:09:32 +00:00
wordpos.isNoun(word, callback) -- callback receives result (true/false) if word is a noun.
2012-05-03 08:40:25 +00:00
wordpos.isVerb(word, callback) -- callback receives result (true/false) if word is a verb.
wordpos.isAdjective(word, callback) -- callback receives result (true/false) if word is an adjective.
wordpos.isAdverb(word, callback) -- callback receives result (true/false) if word is an adverb.
2012-05-03 08:10:05 +00:00
```
2012-05-04 19:23:28 +00:00
isX() methods return the looked-up word as the second argument to the callback.
2012-05-03 08:10:05 +00:00
Examples:
```js
2012-05-03 08:40:25 +00:00
wordpos.isVerb('fish', console.log);
// true 'fish'
2012-05-03 08:40:25 +00:00
wordpos.isNoun('fish', console.log);
// true 'fish'
2012-05-03 08:40:25 +00:00
wordpos.isAdjective('fishy', console.log);
// true 'fishy'
2012-05-03 08:40:25 +00:00
wordpos.isAdverb('fishly', console.log);
// false 'fishly'
2012-05-03 08:10:05 +00:00
```
2012-05-04 19:23:28 +00:00
### lookupX()...
2012-05-03 08:10:05 +00:00
These calls are similar to natural's [lookup()](https://github.com/NaturalNode/natural#wordnet) call, except they can be faster if you
already know the POS of the word.
2012-05-04 19:23:28 +00:00
```
2012-05-03 08:40:25 +00:00
wordpos.lookupNoun(word, callback) -- callback receives array of lookup objects for a noun
wordpos.lookupVerb(word, callback) -- callback receives array of lookup objects for a verb
wordpos.lookupAdjective(word, callback) -- callback receives array of lookup objects for an adjective
wordpos.lookupAdverb(word, callback) -- callback receives array of lookup objects for an adverb
2012-05-03 08:10:05 +00:00
```
2012-05-04 19:23:28 +00:00
lookupX() methods return the looked-up word as the second argument to the callback.
2012-05-03 08:10:05 +00:00
Example:
```js
2012-05-03 08:40:25 +00:00
wordpos.lookupAdjective('awesome', console.log);
// output:
[ { synsetOffset: 1282510,
lexFilenum: 0,
pos: 's',
wCnt: 5,
lemma: 'amazing',
synonyms: [ 'amazing', 'awe-inspiring', 'awesome', 'awful', 'awing' ],
lexId: '0',
ptrs: [],
gloss: 'inspiring awe or admiration or wonder; "New York is an amazing city"; "the Grand Canyon is an awe-inspiring
sight"; "the awesome complexity of the universe"; "this sea, whose gently awful stirrings seem to speak of some hidden s
oul beneath"- Melville; "Westminster Hall\'s awing majesty, so vast, so high, so silent" ' } ], 'awesome'
2012-05-03 08:10:05 +00:00
```
In this case only one lookup was found. But there could be several.
2012-05-04 19:23:28 +00:00
2012-05-02 23:18:10 +00:00
2012-05-03 09:09:02 +00:00
Or use WordNet's inherited method:
2012-05-03 08:10:05 +00:00
2012-05-03 08:40:25 +00:00
```js
wordpos.lookup('great', console.log);
// ...
```
### Other methods/properties
```
2012-05-20 17:21:25 +00:00
WordPOS.WNdb -- access to the WNdb object
WordPOS.natural -- access to underlying 'natural' module
wordpos.parse(str) -- returns tokenized array of words, less duplicates and stopwords. This method is called on all getX() calls internally.
```
E.g., WordPOS.natural.stopwords is the list of stopwords.
2012-05-08 04:41:46 +00:00
### Options
```js
WordPOS.defaults = {
/**
* enable profiling, time in msec returned as last argument in callback
2012-05-08 04:41:46 +00:00
*/
2012-05-20 18:29:10 +00:00
profile: false,
/**
* use fast index if available
*/
fastIndex: true,
/**
2012-05-30 22:07:55 +00:00
* if true, exclude standard stopwords.
* if array, stopwords to exclude, eg, ['all','of','this',...]
* if false, do not filter any stopwords.
*/
stopwords: true
2012-05-08 04:41:46 +00:00
};
```
2012-05-08 04:47:59 +00:00
To override, pass an options hash to the constructor. With the `profile` option, all callbacks receive a second argument that is the execution time in msec of the call.
2012-05-08 04:41:46 +00:00
```js
wordpos = new WordPOS({profile: true});
2012-05-08 04:47:59 +00:00
wordpos.isAdjective('fast', console.log);
// true 'fast' 29
2012-05-08 04:41:46 +00:00
```
2012-05-30 23:12:22 +00:00
### Fast Index
Version 0.1.4 introduces `fastIndex` option. This uses a secondary index on the index files and is much faster. It is on by default. Secondary index files are generated at install time and placed in the same directory as WNdb.path. Details can be found in tools/stat.js.
2012-05-20 18:29:10 +00:00
See blog article [Optimizing WordPos](http://blog.42at.com/optimizing-wordpos).
2012-05-04 19:23:28 +00:00
2012-05-30 23:12:22 +00:00
## CLI
2012-05-30 23:39:31 +00:00
Version 0.1.6 introduces the command-line interface (./bin/wordpos-cli.js), available as 'wordpos' if installed globally
"npm install wordpos -g", otherwise as 'node_modules/.bin/wordpos' if installed without the -g.
2012-05-30 23:12:22 +00:00
```bash
$ wordpos get The angry bear chased the frightened little squirrel
# Noun 4:
bear
chased
little
squirrel
# Adjective 3:
angry
frightened
little
# Verb 1:
bear
# Adverb 1:
little
```
Just the nouns, brief output:
```bash
$ wordpos get --noun -b The angry bear chased the frightened little squirrel
bear chased little squirrel
```
Just the counts: (nouns, adjectives, verbs, adverbs, total parsed words)
```bash
$ wordpos get -c The angry bear chased the frightened little squirrel
4 3 1 1 7
```
2012-05-30 23:44:08 +00:00
Just the adjective count: (0, adjectives, 0, 0, total parsed words)
2012-05-30 23:12:22 +00:00
```bash
$ wordpos get --adj -c The angry bear chased the frightened little squirrel
0 3 0 0 7
```
Get definitions:
```bash
$ wordpos def git
git
n: a person who is deemed to be despicable or contemptible; "only a rotter would do that"; "kill the rat"; "throw the bum out"; "you cowardly little pukes!"; "the British call a contemptible persona `git'"
```
Get full result object:
```bash
$ wordpos def git -f
{ git:
[ { synsetOffset: 10539715,
lexFilenum: 18,
pos: 'n',
wCnt: 0,
lemma: 'rotter',
synonyms: [],
lexId: '0',
ptrs: [],
gloss: 'a person who is deemed to be despicable or contemptible; "only a rotter would do that
"; "kill the rat"; "throw the bum out"; "you cowardly little pukes!"; "the British call a contemptib
le person a `git\'" ' } ] }
```
As JSON:
```bash
$ wordpos def git -j
{"git":[{"synsetOffset":10539715,"lexFilenum":18,"pos":"n","wCnt":0,"lemma":"rotter","synonyms":[],"
lexId":"0","ptrs":[],"gloss":"a person who is deemed to be despicable or contemptible; \"only a rotter
would do that\"; \"kill the rat\"; \"throw the bum out\"; \"you cowardly little pukes!\"; \"the British
call a contemptible person a `git'\" "}]}
```
Usage:
```bash
$ wordpos
Usage: wordpos-cli.js [options] <command> [word ... | -i <file> | <stdin>]
Commands:
get
get list of words for particular POS
def
lookup definitions
parse
show parsed words, deduped and less stopwords
Options:
-h, --help output usage information
-V, --version output the version number
-n, --noun Get nouns
-a, --adj Get adjectives
-v, --verb Get verbs
-r, --adv Get adverbs
-c, --count count only (noun, adj, verb, adv, total parsed words)
-b, --brief brief output (all on one line, no headers)
-f, --full full results object
-j, --json full results object as JSON
-i, --file <file> input file
-s, --stopwords include stopwords
```
## Benchmark
2012-05-03 08:10:05 +00:00
2012-05-07 07:39:57 +00:00
node wordpos-bench.js
2012-05-03 08:10:05 +00:00
512-word corpus (< v0.1.4) :
2012-05-04 19:23:28 +00:00
```
getPOS : 0 ops/s { iterations: 1, elapsed: 9039 }
getNouns : 0 ops/s { iterations: 1, elapsed: 2347 }
getVerbs : 0 ops/s { iterations: 1, elapsed: 2434 }
getAdjectives : 1 ops/s { iterations: 1, elapsed: 1698 }
getAdverbs : 0 ops/s { iterations: 1, elapsed: 2698 }
done in 20359 msecs
2012-05-04 19:23:28 +00:00
```
2012-05-03 08:10:05 +00:00
512-word corpus (as of v0.1.4, with fastIndex) :
2012-05-20 18:29:10 +00:00
```
getPOS : 18 ops/s { iterations: 1, elapsed: 57 }
getNouns : 48 ops/s { iterations: 1, elapsed: 21 }
getVerbs : 125 ops/s { iterations: 1, elapsed: 8 }
getAdjectives : 111 ops/s { iterations: 1, elapsed: 9 }
getAdverbs : 143 ops/s { iterations: 1, elapsed: 7 }
done in 1375 msecs
2012-05-20 18:29:10 +00:00
```
220 words are looked-up (less stopwords and duplicates) on a win7/64-bit/dual-core/3GHz. getPOS() is slowest as it searches through all four index files.
2012-05-03 08:10:05 +00:00
2012-05-02 23:18:10 +00:00
License
-------
2012-05-03 08:10:05 +00:00
(The MIT License)
Copyright (c) 2012, 2014 mooster@42at.com