wordpos/README.md

348 lines
11 KiB
Markdown
Raw Normal View History

2012-05-02 23:18:10 +00:00
wordpos
=======
[![NPM version](https://img.shields.io/npm/v/wordpos.svg)](https://www.npmjs.com/package/wordpos)
[![Build Status](https://img.shields.io/travis/moos/wordpos/master.svg)](https://travis-ci.org/moos/wordpos)
2012-05-02 23:18:10 +00:00
2016-01-18 08:09:56 +00:00
wordpos is a set of *fast* part-of-speech (POS) utilities for Node.js using fast lookup in the WordNet database.
2014-09-26 05:38:07 +00:00
2016-01-18 08:09:56 +00:00
Version 1.x is a mojor update with no direct depedence on [natural's](http://github.com/NaturalNode/natural), with support for Promises, and roughly 5x speed improvement over previous version.
**CAUTION** The WordNet database [wordnet-db](https://github.com/moos/wordnet-db) comprises [155,287 words](http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html) (3.0 numbers) which uncompress to over **30 MB** of data in several *un*[browserify](https://github.com/substack/node-browserify)-able files. It is *not* meant for the browser environment.
2012-05-02 23:18:10 +00:00
2014-09-23 19:59:05 +00:00
## Quick usage
2012-05-03 08:40:25 +00:00
2014-09-23 19:59:05 +00:00
Node.js:
2012-05-03 08:40:25 +00:00
```js
2012-05-07 07:39:57 +00:00
var WordPOS = require('wordpos'),
wordpos = new WordPOS();
2012-05-04 19:23:28 +00:00
2012-05-03 08:40:25 +00:00
wordpos.getAdjectives('The angry bear chased the frightened little squirrel.', function(result){
console.log(result);
2012-05-04 19:23:28 +00:00
});
2012-05-03 08:40:25 +00:00
// [ 'little', 'angry', 'frightened' ]
wordpos.isAdjective('awesome', function(result){
console.log(result);
});
// true 'awesome'
2012-05-03 08:40:25 +00:00
```
2012-05-04 19:23:28 +00:00
Command-line: (see [CLI](bin))
2014-09-26 21:06:43 +00:00
```bash
$ wordpos def git
git
n: a person who is deemed to be despicable or contemptible; "only a rotter would do that"; "kill the rat"; "throw the bum out"; "you cowardly little pukes!"; "the British call a contemptible person a `git'"
$ wordpos def git | wordpos get --adj
# Adjective 6:
despicable
contemptible
bum
cowardly
little
British
```
## Installation
npm install -g wordpos
2016-01-18 08:09:56 +00:00
To run test: (or just: npm test)
2016-01-18 08:09:56 +00:00
npm install -g mocha
mocha test
2014-09-23 19:59:05 +00:00
### Options
```js
WordPOS.defaults = {
/**
* enable profiling, time in msec returned as last argument in callback
*/
profile: false,
/**
* if true, exclude standard stopwords.
* if array, stopwords to exclude, eg, ['all','of','this',...]
* if false, do not filter any stopwords.
*/
stopwords: true
};
```
2014-09-26 06:34:52 +00:00
To override, pass an options hash to the constructor. With the `profile` option, all callbacks receive a last argument that is the execution time in msec of the call.
2014-09-23 19:59:05 +00:00
```js
wordpos = new WordPOS({profile: true});
wordpos.isAdjective('fast', console.log);
// true 'fast' 29
```
2012-05-02 23:29:44 +00:00
## API
2012-05-03 08:10:05 +00:00
2016-01-18 08:09:56 +00:00
Please note: all API are *async* since the underlying WordNet library is async.
2012-05-03 08:10:05 +00:00
2014-09-26 06:43:15 +00:00
#### getPOS(text, callback)
#### getNouns(text, callback)
#### getVerbs(text, callback)
#### getAdjectives(text, callback)
#### getAdverbs(text, callback)
2012-05-03 08:10:05 +00:00
2014-09-26 06:47:25 +00:00
Get part-of-speech from `text`. `callback(results)` receives and array of words for specified POS, or a hash for `getPOS()`:
2012-05-03 08:10:05 +00:00
2012-05-04 19:23:28 +00:00
```
wordpos.getPOS(text, callback) -- callback receives a result object:
2012-05-04 19:23:28 +00:00
{
2016-01-18 08:09:56 +00:00
nouns:[], Array of words that are nouns
verbs:[], Array of words that are verbs
adjectives:[], Array of words that are adjectives
adverbs:[], Array of words that are adverbs
rest:[] Array of words that are not in dict or could not be categorized as a POS
2012-05-03 08:40:25 +00:00
}
2012-05-04 19:23:28 +00:00
Note: a word may appear in multiple POS (eg, 'great' is both a noun and an adjective)
2012-05-03 08:10:05 +00:00
```
2012-05-04 19:23:28 +00:00
2012-05-25 18:09:37 +00:00
If you're only interested in a certain POS (say, adjectives), using the particular getX() is faster
2016-01-18 08:09:56 +00:00
than getPOS() which looks up the word in all index files. [stopwords](https://github.com/moos/wordpos/lib/natural/util/stopwords.js)are stripped out from text before lookup.
2012-05-03 08:10:05 +00:00
2016-01-18 08:09:56 +00:00
If `text` is an *array*, all words are looked-up -- no deduplication, stopword filtering or tokenization is applied.
2016-01-18 08:09:56 +00:00
getX() functions return a Promise.
2012-05-03 08:10:05 +00:00
Example:
```js
2012-05-03 08:40:25 +00:00
wordpos.getNouns('The angry bear chased the frightened little squirrel.', console.log)
// [ 'bear', 'squirrel', 'little', 'chased' ]
2012-05-03 08:10:05 +00:00
2012-05-03 09:09:02 +00:00
wordpos.getPOS('The angry bear chased the frightened little squirrel.', console.log)
// output:
2012-05-04 19:23:28 +00:00
{
2012-05-03 09:09:02 +00:00
nouns: [ 'bear', 'squirrel', 'little', 'chased' ],
verbs: [ 'bear' ],
adjectives: [ 'little', 'angry', 'frightened' ],
adverbs: [ 'little' ],
rest: [ 'the' ]
}
```
2014-09-26 06:34:52 +00:00
This has no relation to correct grammar of given sentence, where here only 'bear' and 'squirrel'
2014-09-26 06:01:48 +00:00
would be considered nouns.
2012-05-03 08:10:05 +00:00
2014-09-26 06:43:15 +00:00
#### isNoun(word, callback)
#### isVerb(word, callback)
#### isAdjective(word, callback)
#### isAdverb(word, callback)
2012-05-03 08:10:05 +00:00
2016-01-18 08:09:56 +00:00
Determine if `word` is a particular POS. `callback(result, word)` receives true/false as first argument and the looked-up word as the second argument. The resolved Promise receives true/false.
2012-05-03 08:10:05 +00:00
Examples:
```js
2012-05-03 08:40:25 +00:00
wordpos.isVerb('fish', console.log);
// true 'fish'
2012-05-03 08:40:25 +00:00
wordpos.isNoun('fish', console.log);
// true 'fish'
2012-05-03 08:40:25 +00:00
wordpos.isAdjective('fishy', console.log);
// true 'fishy'
2012-05-03 08:40:25 +00:00
wordpos.isAdverb('fishly', console.log);
// false 'fishly'
2012-05-03 08:10:05 +00:00
```
2012-05-04 19:23:28 +00:00
2016-01-18 08:09:56 +00:00
#### lookup(word, callback)
2014-09-26 06:43:15 +00:00
#### lookupNoun(word, callback)
#### lookupVerb(word, callback)
#### lookupAdjective(word, callback)
#### lookupAdverb(word, callback)
2012-05-03 08:10:05 +00:00
2016-01-18 08:09:56 +00:00
Get complete definition object for `word`. The lookupX() variants can be faster if you already know the POS of the word. Signature of the callback is `callback(result, word)` where `result` is an *array* of lookup object(s).
2012-05-03 08:10:05 +00:00
Example:
```js
2012-05-03 08:40:25 +00:00
wordpos.lookupAdjective('awesome', console.log);
// output:
[ { synsetOffset: 1282510,
lexFilenum: 0,
pos: 's',
wCnt: 5,
lemma: 'amazing',
synonyms: [ 'amazing', 'awe-inspiring', 'awesome', 'awful', 'awing' ],
lexId: '0',
ptrs: [],
gloss: 'inspiring awe or admiration or wonder; <snip> awing majesty, so vast, so high, so silent" '
} ], 'awesome'
2012-05-03 08:10:05 +00:00
```
2016-01-18 08:09:56 +00:00
In this case only one lookup was found, but there could be several.
2012-05-03 08:10:05 +00:00
2014-09-26 06:43:15 +00:00
#### rand(options, callback)
#### randNoun(options, callback)
#### randVerb(options, callback)
#### randAdjective(options, callback)
#### randAdverb(options, callback)
2014-05-03 21:41:39 +00:00
2014-09-26 06:34:52 +00:00
Get random word(s). (Introduced in version 0.1.10) `callback(results, startsWith)` receives array of random words and the `startsWith` option, if one was given. `options`, if given, is:
2014-05-03 21:41:39 +00:00
```
{
2014-09-26 06:34:52 +00:00
startsWith : <string> -- get random words starting with this
2014-05-03 21:41:39 +00:00
count : <number> -- number of words to return (default = 1)
}
```
Examples:
```js
wordpos.rand(console.log)
// ['wulfila'] ''
wordpos.randNoun(console.log)
// ['bamboo_palm'] ''
wordpos.rand({starstWith: 'foo'}, console.log)
// ['foot'] 'foo'
wordpos.randVerb({starstWith: 'bar', count: 3}, console.log)
// ['barge', 'barf', 'barter_away'] 'bar'
wordpos.rand({starsWith: 'zzz'}, console.log)
// [] 'zzz'
```
2016-01-18 08:09:56 +00:00
**Note on performance**: random lookups could involve heavy disk reads. It is better to use the `count` option to get words in batches. This may benefit from the cached reads of similarly keyed entries as well as shared open/close of the index files.
2014-05-03 21:41:39 +00:00
2016-01-18 08:09:56 +00:00
Getting random POS (`randNoun()`, etc.) is generally faster than `rand()`, which may look at multiple POS files until `count` requirement is met.
2014-05-03 21:41:39 +00:00
2014-09-26 06:43:15 +00:00
#### parse(text)
Returns tokenized array of words in `text`, less duplicates and stopwords. This method is called on all getX() calls internally.
2014-05-03 21:41:39 +00:00
2014-09-23 19:59:05 +00:00
2014-09-26 06:43:15 +00:00
#### WordPOS.WNdb
Access to the [wordnet-db](https://github.com/moos/wordnet-db) object containing the dictionary & index files.
2016-01-18 08:09:56 +00:00
#### WordPOS.stopwords
Access the array of stopwords.
2016-01-18 08:09:56 +00:00
## Promises
TODO
2012-05-08 04:41:46 +00:00
## Fast Index
Version 0.1.4 introduces `fastIndex` option. This uses a secondary index on the index files and is much faster. It is on by default. Secondary index files are generated at install time and placed in the same directory as WNdb.path. Details can be found in tools/stat.js.
2012-05-20 18:29:10 +00:00
Fast index improves performance **30x** over Natural's native methods. See blog article [Optimizing WordPos](http://blog.42at.com/optimizing-wordpos).
2012-05-04 19:23:28 +00:00
2016-01-18 08:09:56 +00:00
As of version 1.0, the fast index option is always on and cannot be turned off.
2014-09-23 19:59:05 +00:00
## Command-line: CLI
2012-05-30 23:12:22 +00:00
2014-09-26 05:38:07 +00:00
For CLI usage and examples, see [bin/README](bin).
2012-05-30 23:12:22 +00:00
2014-09-25 11:37:33 +00:00
## Benchmark
2012-05-03 08:10:05 +00:00
2014-09-26 05:48:55 +00:00
Note: `wordpos-bench.js` requires a [forked uubench](https://github.com/moos/uubench) module.
2014-09-26 05:38:07 +00:00
cd bench
2012-05-07 07:39:57 +00:00
node wordpos-bench.js
2012-05-03 08:10:05 +00:00
512-word corpus (< v0.1.4, comparable to Natural) :
2012-05-04 19:23:28 +00:00
```
getPOS : 0 ops/s { iterations: 1, elapsed: 9039 }
getNouns : 0 ops/s { iterations: 1, elapsed: 2347 }
getVerbs : 0 ops/s { iterations: 1, elapsed: 2434 }
getAdjectives : 1 ops/s { iterations: 1, elapsed: 1698 }
getAdverbs : 0 ops/s { iterations: 1, elapsed: 2698 }
done in 20359 msecs
2012-05-04 19:23:28 +00:00
```
2012-05-03 08:10:05 +00:00
512-word corpus (as of v0.1.4, with fastIndex) :
2012-05-20 18:29:10 +00:00
```
getPOS : 18 ops/s { iterations: 1, elapsed: 57 }
getNouns : 48 ops/s { iterations: 1, elapsed: 21 }
getVerbs : 125 ops/s { iterations: 1, elapsed: 8 }
getAdjectives : 111 ops/s { iterations: 1, elapsed: 9 }
getAdverbs : 143 ops/s { iterations: 1, elapsed: 7 }
done in 1375 msecs
2012-05-20 18:29:10 +00:00
```
220 words are looked-up (less stopwords and duplicates) on a win7/64-bit/dual-core/3GHz. getPOS() is slowest as it searches through all four index files.
2012-05-03 08:10:05 +00:00
2016-01-18 08:09:56 +00:00
### Version 1.0 Benchmark
Re-run v0.1.16:
```
getPOS : 11 ops/s { iterations: 1, elapsed: 90 }
getNouns : 21 ops/s { iterations: 1, elapsed: 47 }
getVerbs : 53 ops/s { iterations: 1, elapsed: 19 }
getAdjectives : 29 ops/s { iterations: 1, elapsed: 34 }
getAdverbs : 83 ops/s { iterations: 1, elapsed: 12 }
lookup : 1 ops/s { iterations: 1, elapsed: 720 }
lookupNoun : 1 ops/s { iterations: 1, elapsed: 676 }
looked up 220 words
done in 2459 msecs
```
V1.0:
```
getPOS : 14 ops/s { iterations: 1, elapsed: 73 }
getNouns : 26 ops/s { iterations: 1, elapsed: 38 }
getVerbs : 42 ops/s { iterations: 1, elapsed: 24 }
getAdjectives : 24 ops/s { iterations: 1, elapsed: 42 }
getAdverbs : 26 ops/s { iterations: 1, elapsed: 38 }
lookup : 6 ops/s { iterations: 1, elapsed: 159 }
lookupNoun : 13 ops/s { iterations: 1, elapsed: 77 }
looked up 221 words
done in 1274 msecs
```
That's roughly **2x** better across the board. Functions that read the data files see much improved performance: `lookup` about **5x** and `lookupNoun` over **8x**.
2014-09-25 11:37:33 +00:00
## Changes
2016-01-18 08:09:56 +00:00
1.0.1
- Removed direct dependency on Natural. Certain modules are included in /lib.
- Add support for Promises.
- Improved data file reads for up to **5x** performance increase.
- Tests are now mocha-based with assert interface.
0.1.16
- Changed dependency to wordnet-db (renamed from WNdb)
0.1.15
- Added `syn` (synonym) and `exp` (example) CLI commands.
- Fixed `rand` CLI command when no start word given.
- Removed -N, --num CLI option. Use `wordpos rand [N]` to get N random numbers.
- Changed CLI option -s to -w (include stopwords).
2014-09-26 22:15:02 +00:00
0.1.13
- Fix crlf issue for command-line script
0.1.12
2014-09-25 11:37:33 +00:00
- fix stopwords not getting excluded when running with CLI
- added 'stopwords' CLI *command* to show list of stopwords
- CLI *option* --stopword now renamed to --withStopwords
2014-09-26 22:15:02 +00:00
0.1.10
2014-09-25 11:37:33 +00:00
- rand functionality added
2014-09-26 22:15:02 +00:00
0.1.6
2014-09-25 11:37:33 +00:00
- added command line tool
2014-09-26 22:15:02 +00:00
0.1.4
2014-09-25 11:37:33 +00:00
- added fast index
2012-05-02 23:18:10 +00:00
License
-------
2012-05-03 08:10:05 +00:00
(The MIT License)
Copyright (c) 2012, 2014, 2016 mooster@42at.com