Major update - first v1.0 checkin with support for Promise, remove natural dependency, and more.

This commit is contained in:
Moos 2016-01-18 20:20:14 -08:00
parent 2001182b7a
commit b27c49fd01
15 changed files with 393 additions and 272 deletions

View File

@ -3,5 +3,3 @@ node_js:
- '5' - '5'
- '4' - '4'
- '0.12' - '0.12'
before_script:
- npm install -g jasmine-node

107
README.md
View File

@ -6,7 +6,7 @@ wordpos
wordpos is a set of *fast* part-of-speech (POS) utilities for Node.js using fast lookup in the WordNet database. wordpos is a set of *fast* part-of-speech (POS) utilities for Node.js using fast lookup in the WordNet database.
Version 1.x is a mojor update with no direct depedence on [natural's](http://github.com/NaturalNode/natural), with support for Promises, and roughly 5x speed improvement over previous version. Version 1.x is a major update with no direct dependence on [natural's](http://github.com/NaturalNode/natural), with support for [Promises](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise), and roughly 5x speed improvement over previous version.
**CAUTION** The WordNet database [wordnet-db](https://github.com/moos/wordnet-db) comprises [155,287 words](http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html) (3.0 numbers) which uncompress to over **30 MB** of data in several *un*[browserify](https://github.com/substack/node-browserify)-able files. It is *not* meant for the browser environment. **CAUTION** The WordNet database [wordnet-db](https://github.com/moos/wordnet-db) comprises [155,287 words](http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html) (3.0 numbers) which uncompress to over **30 MB** of data in several *un*[browserify](https://github.com/substack/node-browserify)-able files. It is *not* meant for the browser environment.
@ -104,7 +104,7 @@ wordpos.getPOS(text, callback) -- callback receives a result object:
``` ```
If you're only interested in a certain POS (say, adjectives), using the particular getX() is faster If you're only interested in a certain POS (say, adjectives), using the particular getX() is faster
than getPOS() which looks up the word in all index files. [stopwords](https://github.com/moos/wordpos/lib/natural/util/stopwords.js)are stripped out from text before lookup. than getPOS() which looks up the word in all index files. [stopwords](lib/natural/util/stopwords.js) are stripped out from text before lookup.
If `text` is an *array*, all words are looked-up -- no deduplication, stopword filtering or tokenization is applied. If `text` is an *array*, all words are looked-up -- no deduplication, stopword filtering or tokenization is applied.
@ -127,8 +127,7 @@ wordpos.getPOS('The angry bear chased the frightened little squirrel.', console.
} }
``` ```
This has no relation to correct grammar of given sentence, where here only 'bear' and 'squirrel' This has no relation to correct grammar of given sentence, where here only 'bear' and 'squirrel' would be considered nouns.
would be considered nouns.
#### isNoun(word, callback) #### isNoun(word, callback)
#### isVerb(word, callback) #### isVerb(word, callback)
@ -228,7 +227,33 @@ Access the array of stopwords.
## Promises ## Promises
TODO As of v1.0, all `get`, `is`, `rand`, and `lookup` methods return a standard ES6 [Promise](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise).
```js
wordpos.isVerb('fish').then(console.log);
// true
```
Compound, with error handler:
```js
wordpos.isVerb('fish')
.then(console.log)
.then(doSomethingElse)
.catch(console.error);
```
Callbacks, if given, are executed _before_ the Promise is resolved.
```js
wordpos.isVerb('fish', console.log)
.then(console.log)
.catch(console.error);
// true 'fish' 13
// true
```
Note that callback receives full arguments (including profile, if enabled), while the Promise receives only the result of the call. Also, beware that exceptions in the _callback_ will result in the Promise being _rejected_ and caught by `catch()`, if provided.
## Fast Index ## Fast Index
@ -236,7 +261,7 @@ Version 0.1.4 introduces `fastIndex` option. This uses a secondary index on the
Fast index improves performance **30x** over Natural's native methods. See blog article [Optimizing WordPos](http://blog.42at.com/optimizing-wordpos). Fast index improves performance **30x** over Natural's native methods. See blog article [Optimizing WordPos](http://blog.42at.com/optimizing-wordpos).
As of version 1.0, the fast index option is always on and cannot be turned off. As of version 1.0, fast index is always on and cannot be turned off.
## Command-line: CLI ## Command-line: CLI
@ -245,73 +270,15 @@ For CLI usage and examples, see [bin/README](bin).
## Benchmark ## Benchmark
Note: `wordpos-bench.js` requires a [forked uubench](https://github.com/moos/uubench) module. See [benchmark](benchmark/README).
cd bench
node wordpos-bench.js
512-word corpus (< v0.1.4, comparable to Natural) :
```
getPOS : 0 ops/s { iterations: 1, elapsed: 9039 }
getNouns : 0 ops/s { iterations: 1, elapsed: 2347 }
getVerbs : 0 ops/s { iterations: 1, elapsed: 2434 }
getAdjectives : 1 ops/s { iterations: 1, elapsed: 1698 }
getAdverbs : 0 ops/s { iterations: 1, elapsed: 2698 }
done in 20359 msecs
```
512-word corpus (as of v0.1.4, with fastIndex) :
```
getPOS : 18 ops/s { iterations: 1, elapsed: 57 }
getNouns : 48 ops/s { iterations: 1, elapsed: 21 }
getVerbs : 125 ops/s { iterations: 1, elapsed: 8 }
getAdjectives : 111 ops/s { iterations: 1, elapsed: 9 }
getAdverbs : 143 ops/s { iterations: 1, elapsed: 7 }
done in 1375 msecs
```
220 words are looked-up (less stopwords and duplicates) on a win7/64-bit/dual-core/3GHz. getPOS() is slowest as it searches through all four index files.
### Version 1.0 Benchmark
Re-run v0.1.16:
```
getPOS : 11 ops/s { iterations: 1, elapsed: 90 }
getNouns : 21 ops/s { iterations: 1, elapsed: 47 }
getVerbs : 53 ops/s { iterations: 1, elapsed: 19 }
getAdjectives : 29 ops/s { iterations: 1, elapsed: 34 }
getAdverbs : 83 ops/s { iterations: 1, elapsed: 12 }
lookup : 1 ops/s { iterations: 1, elapsed: 720 }
lookupNoun : 1 ops/s { iterations: 1, elapsed: 676 }
looked up 220 words
done in 2459 msecs
```
V1.0:
```
getPOS : 14 ops/s { iterations: 1, elapsed: 73 }
getNouns : 26 ops/s { iterations: 1, elapsed: 38 }
getVerbs : 42 ops/s { iterations: 1, elapsed: 24 }
getAdjectives : 24 ops/s { iterations: 1, elapsed: 42 }
getAdverbs : 26 ops/s { iterations: 1, elapsed: 38 }
lookup : 6 ops/s { iterations: 1, elapsed: 159 }
lookupNoun : 13 ops/s { iterations: 1, elapsed: 77 }
looked up 221 words
done in 1274 msecs
```
That's roughly **2x** better across the board. Functions that read the data files see much improved performance: `lookup` about **5x** and `lookupNoun` over **8x**.
## Changes ## Changes
1.0.1 1.0.0
- Removed direct dependency on Natural. Certain modules are included in /lib. - Removed npm dependency on Natural. Certain modules are included in /lib.
- Add support for Promises. - Add support for ES6 Promises.
- Improved data file reads for up to **5x** performance increase. - Improved data file reads for up to **5x** performance increase compared to previous version.
- Tests are now mocha-based with assert interface. - Tests are now [mocha](https://mochajs.org/)-based with [chai](http://chaijs.com/) assert interface.
0.1.16 0.1.16
- Changed dependency to wordnet-db (renamed from WNdb) - Changed dependency to wordnet-db (renamed from WNdb)

80
bench/README.md Normal file
View File

@ -0,0 +1,80 @@
## Benchmark
```bash
cd bench
node wordpos-bench.js
```
### Version 1.0 Benchmark
The following benchmarks were run on a Win8.1/Core i7/3.5GHz machine on a Seagate 500GB SATA II, 7200 RPM disk. The corpus was a 512-word text, with stopwords and duplicates removed, resulting in 220 words looked-up.
#### Pre v0.14 (comparable to Natural)
```
getPOS : 1 ops/s { iterations: 1, elapsed: 1514 }
getNouns : 2 ops/s { iterations: 1, elapsed: 409 }
getVerbs : 2 ops/s { iterations: 1, elapsed: 418 }
getAdjectives : 3 ops/s { iterations: 1, elapsed: 332 }
getAdverbs : 4 ops/s { iterations: 1, elapsed: 272 }
lookup : 1 ops/s { iterations: 1, elapsed: 1981 }
lookupNoun : 0 ops/s { iterations: 1, elapsed: 2016 }
looked up 220 words
done in 7770 msecs
```
#### v0.1.16 (with fastIndex):
```
getPOS : 11 ops/s { iterations: 1, elapsed: 90 }
getNouns : 21 ops/s { iterations: 1, elapsed: 47 }
getVerbs : 53 ops/s { iterations: 1, elapsed: 19 }
getAdjectives : 29 ops/s { iterations: 1, elapsed: 34 }
getAdverbs : 83 ops/s { iterations: 1, elapsed: 12 }
lookup : 1 ops/s { iterations: 1, elapsed: 720 }
lookupNoun : 1 ops/s { iterations: 1, elapsed: 676 }
looked up 220 words
done in 2459 msecs
```
#### v1.0:
```
getPOS : 14 ops/s { iterations: 1, elapsed: 73 }
getNouns : 26 ops/s { iterations: 1, elapsed: 38 }
getVerbs : 42 ops/s { iterations: 1, elapsed: 24 }
getAdjectives : 24 ops/s { iterations: 1, elapsed: 42 }
getAdverbs : 26 ops/s { iterations: 1, elapsed: 38 }
lookup : 6 ops/s { iterations: 1, elapsed: 159 }
lookupNoun : 13 ops/s { iterations: 1, elapsed: 77 }
looked up 221 words
done in 1274 msecs
```
These are **3.5x** better compared to v0.1.16 and **15x** better compared to pre v0.14, overall. Functions that read the data files see much improved performance: `lookup` about **13x** and `lookupNoun` **26x** compared to pre v0.14.
### Old benchmark
512-word corpus (< v0.1.4, comparable to Natural) :
```
getPOS : 0 ops/s { iterations: 1, elapsed: 9039 }
getNouns : 0 ops/s { iterations: 1, elapsed: 2347 }
getVerbs : 0 ops/s { iterations: 1, elapsed: 2434 }
getAdjectives : 1 ops/s { iterations: 1, elapsed: 1698 }
getAdverbs : 0 ops/s { iterations: 1, elapsed: 2698 }
done in 20359 msecs
```
512-word corpus (as of v0.1.4, with fastIndex) :
```
getPOS : 18 ops/s { iterations: 1, elapsed: 57 }
getNouns : 48 ops/s { iterations: 1, elapsed: 21 }
getVerbs : 125 ops/s { iterations: 1, elapsed: 8 }
getAdjectives : 111 ops/s { iterations: 1, elapsed: 9 }
getAdverbs : 143 ops/s { iterations: 1, elapsed: 7 }
done in 1375 msecs
```
220 words are looked-up (less stopwords and duplicates) on a win7/64-bit/dual-core/3GHz. getPOS() is slowest as it searches through all four index files.

View File

@ -1,15 +1,23 @@
/**
* wordpos-bench.js
*
* Copyright (c) 2012-2016 mooster@42at.com
* https://github.com/moos/wordpos
*
* Released under MIT license
*/
var uubench = require('uubench'), // from: https://github.com/moos/uubench var Bench = require('mini-bench'),
fs = require('fs'), fs = require('fs'),
_ = require('underscore')._, _ = require('underscore')._,
WordPOS = require('../src/wordpos'), WordPOS = require('../src/wordpos'),
wordpos = new WordPOS(); wordpos = new WordPOS();
suite = new uubench.Suite({ suite = new Bench.Suite({
type: 'fixed', type: 'fixed',
iterations: 1, iterations: 1,
sync: true, // important! async: false, // important!
start: function(tests){ start: function(tests){
console.log('starting %d tests', tests.length); console.log('starting %d tests', tests.length);
@ -110,6 +118,7 @@ suite.section('--512 words--', function(next){
suite.options.iterations = 1; suite.options.iterations = 1;
next(); next();
}); });
suite.bench('getPOS', getPOS); suite.bench('getPOS', getPOS);
suite.bench('getNouns', getNouns); suite.bench('getNouns', getNouns);
suite.bench('getVerbs', getVerbs); suite.bench('getVerbs', getVerbs);
@ -118,6 +127,4 @@ suite.bench('getAdverbs', getAdverbs);
suite.bench('lookup', lookup); suite.bench('lookup', lookup);
suite.bench('lookupNoun', lookupNoun); suite.bench('lookupNoun', lookupNoun);
suite.run(); suite.run();

View File

@ -1,7 +1,15 @@
{ {
"name": "wordpos", "name": "wordpos",
"author": "Moos <mooster@42at.com>", "author": "Moos <mooster@42at.com>",
"keywords": ["natural", "language", "wordnet", "adjectives", "nouns", "adverbs", "verbs"], "keywords": [
"natural",
"language",
"wordnet",
"adjectives",
"nouns",
"adverbs",
"verbs"
],
"description": "wordpos is a set of part-of-speech utilities for Node.js using the WordNet database.", "description": "wordpos is a set of part-of-speech utilities for Node.js using the WordNet database.",
"version": "1.0.0-RC1", "version": "1.0.0-RC1",
"homepage": "https://github.com/moos/wordpos", "homepage": "https://github.com/moos/wordpos",
@ -10,12 +18,12 @@
}, },
"bin": "./bin/wordpos-cli.js", "bin": "./bin/wordpos-cli.js",
"dependencies": { "dependencies": {
"commander": "^2.0.0",
"underscore": ">=1.3.1", "underscore": ">=1.3.1",
"wordnet-db": "latest", "wordnet-db": "latest"
"commander": "^2.0.0"
}, },
"devDependencies": { "devDependencies": {
"uubench": "git://github.com/moos/uubench.git", "mini-bench": "^1.0.0",
"chai": "*", "chai": "*",
"mocha": "*" "mocha": "*"
}, },

View File

@ -1,24 +1,41 @@
/*!
* dataFile.js
*
* Copyright (c) 2012-2016 mooster@42at.com
* https://github.com/moos/wordpos
*
* Portions: Copyright (c) 2011, Chris Umbel
*
* Released under MIT license
*/
var fs = require('fs'), var fs = require('fs'),
path = require('path'), path = require('path'),
_ = require('underscore'); _ = require('underscore');
// courtesy of natural.WordNet /**
// TODO link * parse a single data file line, returning data object
*
* @param line {string} - a single line from WordNet data file
* @returns {object}
*
* Credit for this routine to https://github.com/NaturalNode/natural
*/
function lineDataToJSON(line) { function lineDataToJSON(line) {
var data = line.split('| '), var data = line.split('| '),
tokens = data[0].split(/\s+/), tokens = data[0].split(/\s+/),
ptrs = [], ptrs = [],
wCnt = parseInt(tokens[3], 16), wCnt = parseInt(tokens[3], 16),
synonyms = []; synonyms = [],
i;
for(var i = 0; i < wCnt; i++) { for(i = 0; i < wCnt; i++) {
synonyms.push(tokens[4 + i * 2]); synonyms.push(tokens[4 + i * 2]);
} }
var ptrOffset = (wCnt - 1) * 2 + 6; var ptrOffset = (wCnt - 1) * 2 + 6;
for(var i = 0; i < parseInt(tokens[ptrOffset], 10); i++) { for(i = 0; i < parseInt(tokens[ptrOffset], 10); i++) {
ptrs.push({ ptrs.push({
pointerSymbol: tokens[ptrOffset + 1 + i * 4], pointerSymbol: tokens[ptrOffset + 1 + i * 4],
synsetOffset: parseInt(tokens[ptrOffset + 2 + i * 4], 10), synsetOffset: parseInt(tokens[ptrOffset + 2 + i * 4], 10),
@ -51,10 +68,15 @@ function lineDataToJSON(line) {
}; };
} }
/**
* read data file at location (bound to a data file).
* Reads nominal length and checks for EOL. Continue reading until EOL.
*
* @param location {Number} - seek location
* @param callback {function} - callback function
*/
function readLocation(location, callback) { function readLocation(location, callback) {
//console.log('## read location ', this.fileName, location); //console.log('## read location ', this.fileName, location);
var var
file = this, file = this,
str = '', str = '',
@ -68,8 +90,6 @@ function readLocation(location, callback) {
return; return;
} }
//console.log(' read %d bytes at <%d>', count, location); //console.log(' read %d bytes at <%d>', count, location);
//console.log(str);
callback(null, lineDataToJSON(str)); callback(null, lineDataToJSON(str));
}); });
@ -77,10 +97,9 @@ function readLocation(location, callback) {
fs.read(file.fd, buffer, 0, len, pos, function (err, count) { fs.read(file.fd, buffer, 0, len, pos, function (err, count) {
str += buffer.toString('ascii'); str += buffer.toString('ascii');
var eol = str.indexOf('\n'); var eol = str.indexOf('\n');
//console.log(' -- read %d bytes at <%d>', count, pos, eol); //console.log(' -- read %d bytes at <%d>', count, pos, eol);
if (eol === -1 && len < file.maxLineLength) { if (eol === -1 && len < file.maxLineLength) {
// continue reading
return readChunk(pos + count, cb); return readChunk(pos + count, cb);
} }
@ -90,14 +109,19 @@ function readLocation(location, callback) {
} }
} }
/**
* main lookup function
*
* @param record {object} - record to lookup, obtained from index.find()
* @param callback{function} (optional) - callback function
* @returns {Promise}
*/
function lookup(record, callback) { function lookup(record, callback) {
var results = [], var results = [],
self = this, self = this,
offsets = record.synsetOffset; offsets = record.synsetOffset;
return new Promise(function(resolve, reject) { return new Promise(function(resolve, reject) {
//console.log('data lookup', record);
offsets offsets
.map(function (offset) { .map(function (offset) {
return _.partial(readLocation.bind(self), offset); return _.partial(readLocation.bind(self), offset);
@ -109,7 +133,6 @@ function lookup(record, callback) {
function done(lastResult) { function done(lastResult) {
closeFile(); closeFile();
//console.log('done promise -- ');
if (lastResult instanceof Error) { if (lastResult instanceof Error) {
callback && callback(lastResult, []); callback && callback(lastResult, []);
reject(lastResult); reject(lastResult);
@ -129,7 +152,6 @@ function lookup(record, callback) {
//console.log(' ... opening', self.filePath); //console.log(' ... opening', self.filePath);
self.fd = fs.openSync(self.filePath, 'r'); self.fd = fs.openSync(self.filePath, 'r');
} }
// ref count so we know when to close the main index file // ref count so we know when to close the main index file
++self.refcount; ++self.refcount;
return Promise.resolve(); return Promise.resolve();
@ -145,13 +167,17 @@ function lookup(record, callback) {
} }
} }
/**
* turn ordinary function into a promising one!
*
* @param collect {Array} - used to collect results
* @returns {Function}
*/
function promisifyInto(collect) { function promisifyInto(collect) {
return function(fn) { return function(fn) {
return function() { return function() {
return new Promise(function (resolve, reject) { return new Promise(function (resolve, reject) {
fn(function (error, result) { // Note callback signature! fn(function (error, result) { // Note: callback signature!
//console.log('cb from get', arguments)
if (error) { if (error) {
reject(error); reject(error);
} }
@ -166,7 +192,13 @@ function promisifyInto(collect) {
} }
/**
* DataFile class
*
* @param dictPath {string} - path to dict folder
* @param name {string} - POS name
* @constructor
*/
var DataFile = function(dictPath, name) { var DataFile = function(dictPath, name) {
this.dictPath = dictPath; this.dictPath = dictPath;
this.fileName = 'data.' + name; this.fileName = 'data.' + name;
@ -177,13 +209,23 @@ var DataFile = function(dictPath, name) {
this.refcount = 0; this.refcount = 0;
}; };
// maximum read length at a time /**
* maximum read length at a time
* @type {Number}
*/
var MAX_SINGLE_READ_LENGTH = 512; var MAX_SINGLE_READ_LENGTH = 512;
//DataFile.prototype.get = get; /**
* lookup
*/
DataFile.prototype.lookup = lookup; DataFile.prototype.lookup = lookup;
// e.g.: wc -L data.adv as of v3.1
/**
* maximum line length in each data file - used to optimize reads
*
* wc -L data.adv as of v3.1
*/
DataFile.MAX_LINE_LENGTH = { DataFile.MAX_LINE_LENGTH = {
noun: 12972, noun: 12972,
verb: 7713, verb: 7713,
@ -191,4 +233,5 @@ DataFile.MAX_LINE_LENGTH = {
adv: 638 adv: 638
}; };
module.exports = DataFile; module.exports = DataFile;

View File

@ -6,6 +6,8 @@
* Copyright (c) 2012-2016 mooster@42at.com * Copyright (c) 2012-2016 mooster@42at.com
* https://github.com/moos/wordpos * https://github.com/moos/wordpos
* *
* Portions: Copyright (c) 2011, Chris Umbel
*
* Released under MIT license * Released under MIT license
*/ */
@ -16,6 +18,7 @@ var _ = require('underscore')._,
piper = require('./piper'), piper = require('./piper'),
KEY_LENGTH = 3; KEY_LENGTH = 3;
/** /**
* load fast index bucket data * load fast index bucket data
* *
@ -112,7 +115,7 @@ function find(search, callback) {
// pay the piper // pay the piper
this.piper(task, readIndexForKey, args, context, collector); this.piper(task, readIndexForKey, args, context, collector);
function collector(key, index, search, callback, buffer){ function collector(_key, index, search, callback, buffer){
var lines = buffer.toString().split('\n'), var lines = buffer.toString().split('\n'),
keys = lines.map(function(line){ keys = lines.map(function(line){
return line.substring(0,line.indexOf(' ')); return line.substring(0,line.indexOf(' '));
@ -136,21 +139,24 @@ function find(search, callback) {
* @param word {string} - search word * @param word {string} - search word
* @param callback {function} - callback function receives result * @param callback {function} - callback function receives result
* @returns none * @returns none
*
* Credit for this routine to https://github.com/NaturalNode/natural
*/ */
function lookup(word, callback) { function lookup(word, callback) {
var self = this; var self = this;
return new Promise(function(resolve, reject){ return new Promise(function(resolve, reject){
self.find(word, function (record) { self.find(word, function (record) {
var indexRecord = null; var indexRecord = null,
i;
if (record.status == 'hit') { if (record.status == 'hit') {
var ptrs = [], offsets = []; var ptrs = [], offsets = [];
for (var i = 0; i < parseInt(record.tokens[3]); i++) for (i = 0; i < parseInt(record.tokens[3]); i++)
ptrs.push(record.tokens[i]); ptrs.push(record.tokens[i]);
for (var i = 0; i < parseInt(record.tokens[2]); i++) for (i = 0; i < parseInt(record.tokens[2]); i++)
offsets.push(parseInt(record.tokens[ptrs.length + 6 + i], 10)); offsets.push(parseInt(record.tokens[ptrs.length + 6 + i], 10));
indexRecord = { indexRecord = {

View File

@ -12,7 +12,6 @@
var _ = require('underscore')._, var _ = require('underscore')._,
util = require('util'), util = require('util'),
path = require('path'),
fs = require('fs'); fs = require('fs');
/** /**
@ -21,7 +20,7 @@ var _ = require('underscore')._,
* *
* @param task {string} - task name unique to method! * @param task {string} - task name unique to method!
* @param method {function} - method to execute, gets (args, ... , callback) * @param method {function} - method to execute, gets (args, ... , callback)
* @param args {array} - args to pass to method * @param args {Array} - args to pass to method
* @param context {object} - other params to remember and sent to callback * @param context {object} - other params to remember and sent to callback
* @param callback {function} - result callback * @param callback {function} - result callback
*/ */

View File

@ -36,10 +36,10 @@ function makeRandX(pos){
callback = opts; callback = opts;
} }
index.rand(startsWith, count, function(record) { return index.rand(startsWith, count, function (record) {
args.push(record, startsWith); args.push(record, startsWith);
profile && args.push(new Date() - start); profile && args.push(new Date() - start);
callback.apply(null, args); callback && callback.apply(null, args);
}); });
}; };
} }
@ -50,6 +50,7 @@ function makeRandX(pos){
* @param startsWith {string} - get random word(s) that start with this, or '' * @param startsWith {string} - get random word(s) that start with this, or ''
* @param num {number} - number of words to return * @param num {number} - number of words to return
* @param callback {function} - callback function, receives words array and startsWith * @param callback {function} - callback function, receives words array and startsWith
* @returns Promise
*/ */
function rand(startsWith, num, callback){ function rand(startsWith, num, callback){
var self = this, var self = this,
@ -57,6 +58,8 @@ function rand(startsWith, num, callback){
trie = this.fastIndex.trie, trie = this.fastIndex.trie,
key, keys; key, keys;
return new Promise(function(resolve, reject) {
//console.log('-- ', startsWith, num, self.fastIndex.indexKeys.length); //console.log('-- ', startsWith, num, self.fastIndex.indexKeys.length);
if (startsWith) { if (startsWith) {
key = startsWith.slice(0, KEY_LENGTH); key = startsWith.slice(0, KEY_LENGTH);
@ -70,7 +73,7 @@ function rand(startsWith, num, callback){
if (!trie) { if (!trie) {
trie = new Trie(); trie = new Trie();
trie.addStrings(self.fastIndex.indexKeys); trie.addStrings(self.fastIndex.indexKeys);
this.fastIndex.trie = trie; self.fastIndex.trie = trie;
//console.log(' +++ Trie calc '); //console.log(' +++ Trie calc ');
} }
@ -87,11 +90,14 @@ function rand(startsWith, num, callback){
nextKey = _.last(keys); nextKey = _.last(keys);
} }
if (!key || !(key in self.fastIndex.offsets)) return process.nextTick(function(){ callback([], startsWith) }); if (!key || !(key in self.fastIndex.offsets)) {
callback && callback([], startsWith);
resolve([]);
}
} else { } else {
// no startWith given - random select among keys // no startWith given - random select among keys
keys = _.sample( this.fastIndex.indexKeys, num ); keys = _.sample(self.fastIndex.indexKeys, num);
// if num > 1, run each key independently and collect results // if num > 1, run each key independently and collect results
if (num > 1) { if (num > 1) {
@ -100,59 +106,67 @@ function rand(startsWith, num, callback){
self.rand(startsWith, 1, function (result) { self.rand(startsWith, 1, function (result) {
results.push(result[0]); results.push(result[0]);
if (++ii == num) { if (++ii == num) {
callback(results, ''); callback && callback(results, '');
resolve(results);
} }
}) });
}); });
return; return;
} }
key = keys; key = keys;
} }
// console.log(' using key', key, nextKey);
// prepare the piper // prepare the piper
var args = [key, nextKey, this], var args = [key, nextKey, self],
task = 'rand:' + key + nextKey, task = 'rand:' + key + nextKey,
context = [startsWith, num, callback]; // last arg MUST be callback context = [startsWith, num, callback]; // last arg MUST be callback
// pay the piper // pay the piper
this.piper(task, IndexFile.readIndexBetweenKeys, args, context, collector); self.piper(task, IndexFile.readIndexBetweenKeys, args, context, collector);
function collector(key, nextKey, index, startsWith, num, callback, buffer) { function collector(key, nextKey, index, startsWith, num, callback, buffer) {
var lines = buffer.toString().split('\n'), var lines = buffer.toString().split('\n'),
matches = lines.map(function (line) { matches = lines.map(function (line) {
return line.substring(0, line.indexOf(' ')); return line.substring(0, line.indexOf(' '));
}); });
//console.log(' got lines for key ', key, lines.length); //console.log(' got lines for key ', key, lines.length);
// we got bunch of matches for key - now search within for startsWith // we got bunch of matches for key - now search within for startsWith
if (startsWith !== key) { if (startsWith !== key) {
// binary search for startsWith within set of matches // binary search for startsWith within set of matches
var ind = _.sortedIndex(matches, startsWith); var ind = _.sortedIndex(matches, startsWith);
if (ind >= lines.length || matches[ind].indexOf(startsWith) === -1) { if (ind >= lines.length || matches[ind].indexOf(startsWith) === -1) {
return callback([], startsWith); callback && callback([], startsWith);
resolve([]);
return;
} }
// FIXME --- using Trie's new keysWithPrefix not yet pushed to npm.
// see https://github.com/NaturalNode/natural/commit/5fc86c42e41c1314bfc6a37384dd14acf5f4bb7b
var trie = new Trie(); var trie = new Trie();
trie.addStrings(matches); trie.addStrings(matches);
//console.log('Trie > ', trie.matchesWithPrefix( startsWith )); //console.log('Trie > ', trie.matchesWithPrefix( startsWith ));
matches = trie.keysWithPrefix(startsWith); matches = trie.keysWithPrefix(startsWith);
} }
var words = _.sample(matches, num); var words = _.sample(matches, num);
callback(words, startsWith); callback && callback(words, startsWith);
resolve(words);
} }
}); // Promise
} }
// relative weight of each POS word count (DB 3.1 numbers)
var POS_factor = {
Noun: 26,
Verb: 3,
Adjective: 5,
Adverb: 1,
Total: 37
};
/** /**
* rand() - for all Index files * rand() - for all Index files
* @returns Promise
*/ */
function randAll(opts, callback) { function randAll(opts, callback) {
var var
@ -163,12 +177,7 @@ function randAll(opts, callback) {
count = opts && opts.count || 1, count = opts && opts.count || 1,
args = [null, startsWith], args = [null, startsWith],
parts = 'Noun Verb Adjective Adverb'.split(' '), parts = 'Noun Verb Adjective Adverb'.split(' '),
self = this, self = this;
done = function(){
profile && (args.push(new Date() - start));
args[0] = results;
callback.apply(null, args)
};
if (typeof opts === 'function') { if (typeof opts === 'function') {
callback = opts; callback = opts;
@ -176,15 +185,20 @@ function randAll(opts, callback) {
opts = _.clone(opts); opts = _.clone(opts);
} }
// TODO -- or loop count times each time getting 1 from random part!!
// slower but more random.
// select at random a part to look at return new Promise(function(resolve, reject) {
// select at random a POS to look at
var doParts = _.sample(parts, parts.length); var doParts = _.sample(parts, parts.length);
tryPart(); tryPart();
function tryPart() { function tryPart() {
var rand = 'rand' + doParts.pop(); var part = doParts.pop(),
rand = 'rand' + part,
factor = POS_factor[part],
weight = factor / POS_factor.Total;
// pick count according to relative weight
opts.count = Math.ceil(count * weight * 1.1); // guard against dupes
self[rand](opts, partCallback); self[rand](opts, partCallback);
} }
@ -193,19 +207,23 @@ function randAll(opts, callback) {
results = _.uniq(results.concat(result)); // make sure it's unique! results = _.uniq(results.concat(result)); // make sure it's unique!
} }
//console.log(result);
if (results.length < count && doParts.length) { if (results.length < count && doParts.length) {
// reduce count for next part -- NO! may get duplicates
// opts.count = count - results.length;
return tryPart(); return tryPart();
} }
// trim excess // final random and trim excess
if (results.length > count) { results = _.sample(results, count);
results.length = count;
}
done(); done();
} }
function done() {
profile && (args.push(new Date() - start));
args[0] = results;
callback && callback.apply(null, args);
resolve(results);
}
}); // Promise
} }
/** /**

View File

@ -1,4 +1,4 @@
/** /*!
* wordpos.js * wordpos.js
* *
* Node.js part-of-speech utilities using WordNet database. * Node.js part-of-speech utilities using WordNet database.
@ -149,11 +149,11 @@ function get(isFn) {
}; };
} }
// setImmediate executes callback AFTER promise handlers.
// Without it, exceptions in callback may be caught by Promise.
function nextTick(fn, args) { function nextTick(fn, args) {
if (fn) { if (fn) {
setImmediate(function(){
fn.apply(null, args); fn.apply(null, args);
});
} }
} }
@ -216,7 +216,7 @@ var wordposProto = WordPOS.prototype;
* lookup a word in all indexes * lookup a word in all indexes
* *
* @param word {string} - search word * @param word {string} - search word
* @param callback {Functino} (optional) - callback with (results, word) signature * @param callback {Function} (optional) - callback with (results, word) signature
* @returns {Promise} * @returns {Promise}
*/ */
wordposProto.lookup = function(word, callback) { wordposProto.lookup = function(word, callback) {
@ -362,7 +362,17 @@ wordposProto.getVerbs = get('isVerb');
wordposProto.parse = prepText; wordposProto.parse = prepText;
/**
* access to WordNet DB
* @type {object}
*/
WordPOS.WNdb = WNdb; WordPOS.WNdb = WNdb;
/**
* access to stopwords
* @type {Array}
*/
WordPOS.stopwords = stopwords; WordPOS.stopwords = stopwords;
module.exports = WordPOS; module.exports = WordPOS;

40
test.js
View File

@ -1,40 +0,0 @@
var
WordPOS = require('./src/wordpos'),
wordpos = new WordPOS({profile: true}),
getAllPOS = wordpos.getPOS
;
console.log(1111,
wordpos.lookup('foot')
//wordpos.getPOS('was doing the work the ashtray closer Also known as inject and foldl, reduce boils down a list of values into a single value', console.log
.then(function(result){
console.log(' xxx - ', result)
})
.catch(function(result){
console.log(' error xxx - ', result)
}));
//wordpos.rand({count: 3},console.log)
return;
//getAllPOS('se', console.log)
wordpos.getPOS('se', console.log)
a=wordpos.getPOS('se', function(res) {
console.log(1, res)
wordpos.getPOS('sea hey who work', function(res) {
console.log(2, res)
wordpos.getPOS('sear done work ', function(res) {
console.log(3, res)
console.log('all done');
});
});
});
console.log(a)

View File

@ -1,11 +1,11 @@
/** /**
* wordpos_spec.js * wordpos_test.js
* *
* test file for main wordpos functionality * test file for main wordpos functionality
* *
* Usage: * Usage:
* npm install mocha -g * npm install mocha -g
* mocha wordpos_spec.js --verbose * mocha wordpos_test.js
* *
* or * or
* *
@ -388,4 +388,29 @@ describe('Promise pattern', function() {
assert.equal(result, true); assert.equal(result, true);
}); });
}); });
it('rand()', function () {
return wordpos.rand({count: 5}).then(function (result) {
assert.equal(result.length, 5);
});
});
it('randNoun()', function () {
return wordpos.randNoun().then(function (result) {
assert.equal(result.length, 1);
});
});
it('randNoun({count: 3})', function () {
return wordpos.randNoun({count: 3}).then(function (result) {
assert.equal(result.length, 3);
});
});
it('randNoun({startsWith: "foo"})', function () {
return wordpos.randNoun({startsWith: 'foo'}).then(function (result) {
assert.equal(result.length, 1);
assert.equal(result[0].indexOf('foo'), 0);
});
});
}); });

View File

@ -40,7 +40,7 @@
* read index file between the two offsets * read index file between the two offsets
* binary search read data O(log avg) * binary search read data O(log avg)
* *
* Copyright (c) 2012 mooster@42at.com * Copyright (c) 2012-2016 mooster@42at.com
* https://github.com/moos/wordpos * https://github.com/moos/wordpos
* *
* Released under MIT license * Released under MIT license
@ -48,7 +48,7 @@
var var
WNdb = require('../src/wordpos').WNdb, WNdb = require('../src/wordpos').WNdb,
util = require('util'), util = require('util'),
BufferedReader = require ("./buffered-reader"), BufferedReader = require ('./buffered-reader'),
_ = require('underscore')._, _ = require('underscore')._,
fs = require('fs'), fs = require('fs'),
path = require('path'), path = require('path'),

View File

@ -6,7 +6,7 @@
* Usage: * Usage:
* node validate index.adv * node validate index.adv
* *
* Copyright (c) 2012 mooster@42at.com * Copyright (c) 2012-2016 mooster@42at.com
* https://github.com/moos/wordpos * https://github.com/moos/wordpos
* *
* Released under MIT license * Released under MIT license