diff --git a/.travis.yml b/.travis.yml index b7484b6..4cbb75e 100644 --- a/.travis.yml +++ b/.travis.yml @@ -3,5 +3,3 @@ node_js: - '5' - '4' - '0.12' -before_script: - - npm install -g jasmine-node \ No newline at end of file diff --git a/README.md b/README.md index 097c301..f5785ff 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ wordpos wordpos is a set of *fast* part-of-speech (POS) utilities for Node.js using fast lookup in the WordNet database. -Version 1.x is a mojor update with no direct depedence on [natural's](http://github.com/NaturalNode/natural), with support for Promises, and roughly 5x speed improvement over previous version. +Version 1.x is a major update with no direct dependence on [natural's](http://github.com/NaturalNode/natural), with support for [Promises](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise), and roughly 5x speed improvement over previous version. **CAUTION** The WordNet database [wordnet-db](https://github.com/moos/wordnet-db) comprises [155,287 words](http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html) (3.0 numbers) which uncompress to over **30 MB** of data in several *un*[browserify](https://github.com/substack/node-browserify)-able files. It is *not* meant for the browser environment. @@ -104,7 +104,7 @@ wordpos.getPOS(text, callback) -- callback receives a result object: ``` If you're only interested in a certain POS (say, adjectives), using the particular getX() is faster -than getPOS() which looks up the word in all index files. [stopwords](https://github.com/moos/wordpos/lib/natural/util/stopwords.js)are stripped out from text before lookup. +than getPOS() which looks up the word in all index files. [stopwords](lib/natural/util/stopwords.js) are stripped out from text before lookup. If `text` is an *array*, all words are looked-up -- no deduplication, stopword filtering or tokenization is applied. @@ -127,8 +127,7 @@ wordpos.getPOS('The angry bear chased the frightened little squirrel.', console. } ``` -This has no relation to correct grammar of given sentence, where here only 'bear' and 'squirrel' -would be considered nouns. +This has no relation to correct grammar of given sentence, where here only 'bear' and 'squirrel' would be considered nouns. #### isNoun(word, callback) #### isVerb(word, callback) @@ -228,7 +227,33 @@ Access the array of stopwords. ## Promises -TODO +As of v1.0, all `get`, `is`, `rand`, and `lookup` methods return a standard ES6 [Promise](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise). + +```js +wordpos.isVerb('fish').then(console.log); +// true +``` + +Compound, with error handler: + +```js +wordpos.isVerb('fish') + .then(console.log) + .then(doSomethingElse) + .catch(console.error); +``` + +Callbacks, if given, are executed _before_ the Promise is resolved. + +```js +wordpos.isVerb('fish', console.log) + .then(console.log) + .catch(console.error); +// true 'fish' 13 +// true +``` +Note that callback receives full arguments (including profile, if enabled), while the Promise receives only the result of the call. Also, beware that exceptions in the _callback_ will result in the Promise being _rejected_ and caught by `catch()`, if provided. + ## Fast Index @@ -236,7 +261,7 @@ Version 0.1.4 introduces `fastIndex` option. This uses a secondary index on the Fast index improves performance **30x** over Natural's native methods. See blog article [Optimizing WordPos](http://blog.42at.com/optimizing-wordpos). -As of version 1.0, the fast index option is always on and cannot be turned off. +As of version 1.0, fast index is always on and cannot be turned off. ## Command-line: CLI @@ -245,73 +270,15 @@ For CLI usage and examples, see [bin/README](bin). ## Benchmark -Note: `wordpos-bench.js` requires a [forked uubench](https://github.com/moos/uubench) module. - - cd bench - node wordpos-bench.js - - -512-word corpus (< v0.1.4, comparable to Natural) : -``` - getPOS : 0 ops/s { iterations: 1, elapsed: 9039 } - getNouns : 0 ops/s { iterations: 1, elapsed: 2347 } - getVerbs : 0 ops/s { iterations: 1, elapsed: 2434 } - getAdjectives : 1 ops/s { iterations: 1, elapsed: 1698 } - getAdverbs : 0 ops/s { iterations: 1, elapsed: 2698 } -done in 20359 msecs -``` - -512-word corpus (as of v0.1.4, with fastIndex) : -``` - getPOS : 18 ops/s { iterations: 1, elapsed: 57 } - getNouns : 48 ops/s { iterations: 1, elapsed: 21 } - getVerbs : 125 ops/s { iterations: 1, elapsed: 8 } - getAdjectives : 111 ops/s { iterations: 1, elapsed: 9 } - getAdverbs : 143 ops/s { iterations: 1, elapsed: 7 } -done in 1375 msecs -``` - -220 words are looked-up (less stopwords and duplicates) on a win7/64-bit/dual-core/3GHz. getPOS() is slowest as it searches through all four index files. - -### Version 1.0 Benchmark - -Re-run v0.1.16: -``` - getPOS : 11 ops/s { iterations: 1, elapsed: 90 } - getNouns : 21 ops/s { iterations: 1, elapsed: 47 } - getVerbs : 53 ops/s { iterations: 1, elapsed: 19 } - getAdjectives : 29 ops/s { iterations: 1, elapsed: 34 } - getAdverbs : 83 ops/s { iterations: 1, elapsed: 12 } - lookup : 1 ops/s { iterations: 1, elapsed: 720 } - lookupNoun : 1 ops/s { iterations: 1, elapsed: 676 } - -looked up 220 words -done in 2459 msecs -``` - -V1.0: -``` - getPOS : 14 ops/s { iterations: 1, elapsed: 73 } - getNouns : 26 ops/s { iterations: 1, elapsed: 38 } - getVerbs : 42 ops/s { iterations: 1, elapsed: 24 } - getAdjectives : 24 ops/s { iterations: 1, elapsed: 42 } - getAdverbs : 26 ops/s { iterations: 1, elapsed: 38 } - lookup : 6 ops/s { iterations: 1, elapsed: 159 } - lookupNoun : 13 ops/s { iterations: 1, elapsed: 77 } - -looked up 221 words -done in 1274 msecs -``` -That's roughly **2x** better across the board. Functions that read the data files see much improved performance: `lookup` about **5x** and `lookupNoun` over **8x**. - +See [benchmark](benchmark/README). ## Changes -1.0.1 - - Removed direct dependency on Natural. Certain modules are included in /lib. - - Add support for Promises. - - Improved data file reads for up to **5x** performance increase. - - Tests are now mocha-based with assert interface. +1.0.0 + - Removed npm dependency on Natural. Certain modules are included in /lib. + - Add support for ES6 Promises. + - Improved data file reads for up to **5x** performance increase compared to previous version. + - Tests are now [mocha](https://mochajs.org/)-based with [chai](http://chaijs.com/) assert interface. 0.1.16 - Changed dependency to wordnet-db (renamed from WNdb) diff --git a/bench/README.md b/bench/README.md new file mode 100644 index 0000000..e658ce5 --- /dev/null +++ b/bench/README.md @@ -0,0 +1,80 @@ +## Benchmark + +```bash + cd bench + node wordpos-bench.js +``` + +### Version 1.0 Benchmark + +The following benchmarks were run on a Win8.1/Core i7/3.5GHz machine on a Seagate 500GB SATA II, 7200 RPM disk. The corpus was a 512-word text, with stopwords and duplicates removed, resulting in 220 words looked-up. + +#### Pre v0.14 (comparable to Natural) +``` + getPOS : 1 ops/s { iterations: 1, elapsed: 1514 } + getNouns : 2 ops/s { iterations: 1, elapsed: 409 } + getVerbs : 2 ops/s { iterations: 1, elapsed: 418 } + getAdjectives : 3 ops/s { iterations: 1, elapsed: 332 } + getAdverbs : 4 ops/s { iterations: 1, elapsed: 272 } + lookup : 1 ops/s { iterations: 1, elapsed: 1981 } + lookupNoun : 0 ops/s { iterations: 1, elapsed: 2016 } + +looked up 220 words +done in 7770 msecs +``` + +#### v0.1.16 (with fastIndex): +``` + getPOS : 11 ops/s { iterations: 1, elapsed: 90 } + getNouns : 21 ops/s { iterations: 1, elapsed: 47 } + getVerbs : 53 ops/s { iterations: 1, elapsed: 19 } + getAdjectives : 29 ops/s { iterations: 1, elapsed: 34 } + getAdverbs : 83 ops/s { iterations: 1, elapsed: 12 } + lookup : 1 ops/s { iterations: 1, elapsed: 720 } + lookupNoun : 1 ops/s { iterations: 1, elapsed: 676 } + +looked up 220 words +done in 2459 msecs +``` + +#### v1.0: +``` + getPOS : 14 ops/s { iterations: 1, elapsed: 73 } + getNouns : 26 ops/s { iterations: 1, elapsed: 38 } + getVerbs : 42 ops/s { iterations: 1, elapsed: 24 } + getAdjectives : 24 ops/s { iterations: 1, elapsed: 42 } + getAdverbs : 26 ops/s { iterations: 1, elapsed: 38 } + lookup : 6 ops/s { iterations: 1, elapsed: 159 } + lookupNoun : 13 ops/s { iterations: 1, elapsed: 77 } + +looked up 221 words +done in 1274 msecs +``` + +These are **3.5x** better compared to v0.1.16 and **15x** better compared to pre v0.14, overall. Functions that read the data files see much improved performance: `lookup` about **13x** and `lookupNoun` **26x** compared to pre v0.14. + + +### Old benchmark + +512-word corpus (< v0.1.4, comparable to Natural) : +``` + getPOS : 0 ops/s { iterations: 1, elapsed: 9039 } + getNouns : 0 ops/s { iterations: 1, elapsed: 2347 } + getVerbs : 0 ops/s { iterations: 1, elapsed: 2434 } + getAdjectives : 1 ops/s { iterations: 1, elapsed: 1698 } + getAdverbs : 0 ops/s { iterations: 1, elapsed: 2698 } +done in 20359 msecs +``` + +512-word corpus (as of v0.1.4, with fastIndex) : +``` + getPOS : 18 ops/s { iterations: 1, elapsed: 57 } + getNouns : 48 ops/s { iterations: 1, elapsed: 21 } + getVerbs : 125 ops/s { iterations: 1, elapsed: 8 } + getAdjectives : 111 ops/s { iterations: 1, elapsed: 9 } + getAdverbs : 143 ops/s { iterations: 1, elapsed: 7 } +done in 1375 msecs +``` + +220 words are looked-up (less stopwords and duplicates) on a win7/64-bit/dual-core/3GHz. getPOS() is slowest as it searches through all four index files. + diff --git a/bench/wordpos-bench.js b/bench/wordpos-bench.js index 715cb6c..87361eb 100644 --- a/bench/wordpos-bench.js +++ b/bench/wordpos-bench.js @@ -1,15 +1,23 @@ +/** + * wordpos-bench.js + * + * Copyright (c) 2012-2016 mooster@42at.com + * https://github.com/moos/wordpos + * + * Released under MIT license + */ -var uubench = require('uubench'), // from: https://github.com/moos/uubench +var Bench = require('mini-bench'), fs = require('fs'), _ = require('underscore')._, WordPOS = require('../src/wordpos'), wordpos = new WordPOS(); -suite = new uubench.Suite({ +suite = new Bench.Suite({ type: 'fixed', iterations: 1, - sync: true, // important! + async: false, // important! start: function(tests){ console.log('starting %d tests', tests.length); @@ -110,6 +118,7 @@ suite.section('--512 words--', function(next){ suite.options.iterations = 1; next(); }); + suite.bench('getPOS', getPOS); suite.bench('getNouns', getNouns); suite.bench('getVerbs', getVerbs); @@ -118,6 +127,4 @@ suite.bench('getAdverbs', getAdverbs); suite.bench('lookup', lookup); suite.bench('lookupNoun', lookupNoun); - - suite.run(); diff --git a/package.json b/package.json index 9d54917..58b4d53 100644 --- a/package.json +++ b/package.json @@ -1,7 +1,15 @@ { "name": "wordpos", "author": "Moos ", - "keywords": ["natural", "language", "wordnet", "adjectives", "nouns", "adverbs", "verbs"], + "keywords": [ + "natural", + "language", + "wordnet", + "adjectives", + "nouns", + "adverbs", + "verbs" + ], "description": "wordpos is a set of part-of-speech utilities for Node.js using the WordNet database.", "version": "1.0.0-RC1", "homepage": "https://github.com/moos/wordpos", @@ -10,18 +18,18 @@ }, "bin": "./bin/wordpos-cli.js", "dependencies": { + "commander": "^2.0.0", "underscore": ">=1.3.1", - "wordnet-db": "latest", - "commander": "^2.0.0" + "wordnet-db": "latest" }, "devDependencies": { - "uubench": "git://github.com/moos/uubench.git", + "mini-bench": "^1.0.0", "chai": "*", "mocha": "*" }, - "repository" : { - "type" : "git", - "url" : "git://github.com/moos/wordpos.git" + "repository": { + "type": "git", + "url": "git://github.com/moos/wordpos.git" }, "main": "./src/wordpos.js", "scripts": { diff --git a/src/dataFile.js b/src/dataFile.js index 63a870c..fa01208 100644 --- a/src/dataFile.js +++ b/src/dataFile.js @@ -1,24 +1,41 @@ +/*! + * dataFile.js + * + * Copyright (c) 2012-2016 mooster@42at.com + * https://github.com/moos/wordpos + * + * Portions: Copyright (c) 2011, Chris Umbel + * + * Released under MIT license + */ var fs = require('fs'), path = require('path'), _ = require('underscore'); -// courtesy of natural.WordNet -// TODO link +/** + * parse a single data file line, returning data object + * + * @param line {string} - a single line from WordNet data file + * @returns {object} + * + * Credit for this routine to https://github.com/NaturalNode/natural + */ function lineDataToJSON(line) { var data = line.split('| '), tokens = data[0].split(/\s+/), ptrs = [], wCnt = parseInt(tokens[3], 16), - synonyms = []; + synonyms = [], + i; - for(var i = 0; i < wCnt; i++) { + for(i = 0; i < wCnt; i++) { synonyms.push(tokens[4 + i * 2]); } var ptrOffset = (wCnt - 1) * 2 + 6; - for(var i = 0; i < parseInt(tokens[ptrOffset], 10); i++) { + for(i = 0; i < parseInt(tokens[ptrOffset], 10); i++) { ptrs.push({ pointerSymbol: tokens[ptrOffset + 1 + i * 4], synsetOffset: parseInt(tokens[ptrOffset + 2 + i * 4], 10), @@ -51,10 +68,15 @@ function lineDataToJSON(line) { }; } - +/** + * read data file at location (bound to a data file). + * Reads nominal length and checks for EOL. Continue reading until EOL. + * + * @param location {Number} - seek location + * @param callback {function} - callback function + */ function readLocation(location, callback) { //console.log('## read location ', this.fileName, location); - var file = this, str = '', @@ -68,8 +90,6 @@ function readLocation(location, callback) { return; } //console.log(' read %d bytes at <%d>', count, location); - //console.log(str); - callback(null, lineDataToJSON(str)); }); @@ -77,10 +97,9 @@ function readLocation(location, callback) { fs.read(file.fd, buffer, 0, len, pos, function (err, count) { str += buffer.toString('ascii'); var eol = str.indexOf('\n'); - //console.log(' -- read %d bytes at <%d>', count, pos, eol); - if (eol === -1 && len < file.maxLineLength) { + // continue reading return readChunk(pos + count, cb); } @@ -90,14 +109,19 @@ function readLocation(location, callback) { } } +/** + * main lookup function + * + * @param record {object} - record to lookup, obtained from index.find() + * @param callback{function} (optional) - callback function + * @returns {Promise} + */ function lookup(record, callback) { var results = [], self = this, offsets = record.synsetOffset; return new Promise(function(resolve, reject) { - //console.log('data lookup', record); - offsets .map(function (offset) { return _.partial(readLocation.bind(self), offset); @@ -109,7 +133,6 @@ function lookup(record, callback) { function done(lastResult) { closeFile(); - //console.log('done promise -- '); if (lastResult instanceof Error) { callback && callback(lastResult, []); reject(lastResult); @@ -129,7 +152,6 @@ function lookup(record, callback) { //console.log(' ... opening', self.filePath); self.fd = fs.openSync(self.filePath, 'r'); } - // ref count so we know when to close the main index file ++self.refcount; return Promise.resolve(); @@ -145,13 +167,17 @@ function lookup(record, callback) { } } - +/** + * turn ordinary function into a promising one! + * + * @param collect {Array} - used to collect results + * @returns {Function} + */ function promisifyInto(collect) { return function(fn) { return function() { return new Promise(function (resolve, reject) { - fn(function (error, result) { // Note callback signature! - //console.log('cb from get', arguments) + fn(function (error, result) { // Note: callback signature! if (error) { reject(error); } @@ -166,7 +192,13 @@ function promisifyInto(collect) { } - +/** + * DataFile class + * + * @param dictPath {string} - path to dict folder + * @param name {string} - POS name + * @constructor + */ var DataFile = function(dictPath, name) { this.dictPath = dictPath; this.fileName = 'data.' + name; @@ -177,13 +209,23 @@ var DataFile = function(dictPath, name) { this.refcount = 0; }; -// maximum read length at a time +/** + * maximum read length at a time + * @type {Number} + */ var MAX_SINGLE_READ_LENGTH = 512; -//DataFile.prototype.get = get; +/** + * lookup + */ DataFile.prototype.lookup = lookup; -// e.g.: wc -L data.adv as of v3.1 + +/** + * maximum line length in each data file - used to optimize reads + * + * wc -L data.adv as of v3.1 + */ DataFile.MAX_LINE_LENGTH = { noun: 12972, verb: 7713, @@ -191,4 +233,5 @@ DataFile.MAX_LINE_LENGTH = { adv: 638 }; + module.exports = DataFile; diff --git a/src/indexFile.js b/src/indexFile.js index 1e7ed06..413e6d6 100644 --- a/src/indexFile.js +++ b/src/indexFile.js @@ -6,6 +6,8 @@ * Copyright (c) 2012-2016 mooster@42at.com * https://github.com/moos/wordpos * + * Portions: Copyright (c) 2011, Chris Umbel + * * Released under MIT license */ @@ -16,6 +18,7 @@ var _ = require('underscore')._, piper = require('./piper'), KEY_LENGTH = 3; + /** * load fast index bucket data * @@ -112,7 +115,7 @@ function find(search, callback) { // pay the piper this.piper(task, readIndexForKey, args, context, collector); - function collector(key, index, search, callback, buffer){ + function collector(_key, index, search, callback, buffer){ var lines = buffer.toString().split('\n'), keys = lines.map(function(line){ return line.substring(0,line.indexOf(' ')); @@ -136,21 +139,24 @@ function find(search, callback) { * @param word {string} - search word * @param callback {function} - callback function receives result * @returns none + * + * Credit for this routine to https://github.com/NaturalNode/natural */ function lookup(word, callback) { var self = this; return new Promise(function(resolve, reject){ self.find(word, function (record) { - var indexRecord = null; + var indexRecord = null, + i; if (record.status == 'hit') { var ptrs = [], offsets = []; - for (var i = 0; i < parseInt(record.tokens[3]); i++) + for (i = 0; i < parseInt(record.tokens[3]); i++) ptrs.push(record.tokens[i]); - for (var i = 0; i < parseInt(record.tokens[2]); i++) + for (i = 0; i < parseInt(record.tokens[2]); i++) offsets.push(parseInt(record.tokens[ptrs.length + 6 + i], 10)); indexRecord = { diff --git a/src/piper.js b/src/piper.js index a5c18b5..4645d9e 100644 --- a/src/piper.js +++ b/src/piper.js @@ -12,7 +12,6 @@ var _ = require('underscore')._, util = require('util'), - path = require('path'), fs = require('fs'); /** @@ -21,7 +20,7 @@ var _ = require('underscore')._, * * @param task {string} - task name unique to method! * @param method {function} - method to execute, gets (args, ... , callback) - * @param args {array} - args to pass to method + * @param args {Array} - args to pass to method * @param context {object} - other params to remember and sent to callback * @param callback {function} - result callback */ diff --git a/src/rand.js b/src/rand.js index ec29047..fd94cd3 100644 --- a/src/rand.js +++ b/src/rand.js @@ -36,10 +36,10 @@ function makeRandX(pos){ callback = opts; } - index.rand(startsWith, count, function(record) { + return index.rand(startsWith, count, function (record) { args.push(record, startsWith); profile && args.push(new Date() - start); - callback.apply(null, args); + callback && callback.apply(null, args); }); }; } @@ -50,6 +50,7 @@ function makeRandX(pos){ * @param startsWith {string} - get random word(s) that start with this, or '' * @param num {number} - number of words to return * @param callback {function} - callback function, receives words array and startsWith + * @returns Promise */ function rand(startsWith, num, callback){ var self = this, @@ -57,102 +58,115 @@ function rand(startsWith, num, callback){ trie = this.fastIndex.trie, key, keys; - //console.log('-- ', startsWith, num, self.fastIndex.indexKeys.length); - if (startsWith){ - key = startsWith.slice(0, KEY_LENGTH); + return new Promise(function(resolve, reject) { - /** - * if key is 'a' or 'ab' (<3 chars), search for ALL keys starting with that. - */ - if (key.length < KEY_LENGTH) { + //console.log('-- ', startsWith, num, self.fastIndex.indexKeys.length); + if (startsWith) { + key = startsWith.slice(0, KEY_LENGTH); - // calc trie if haven't done so yet - if (!trie){ - trie = new Trie(); - trie.addStrings(self.fastIndex.indexKeys); - this.fastIndex.trie = trie; - //console.log(' +++ Trie calc '); + /** + * if key is 'a' or 'ab' (<3 chars), search for ALL keys starting with that. + */ + if (key.length < KEY_LENGTH) { + + // calc trie if haven't done so yet + if (!trie) { + trie = new Trie(); + trie.addStrings(self.fastIndex.indexKeys); + self.fastIndex.trie = trie; + //console.log(' +++ Trie calc '); + } + + try { + // trie throws if not found!!!!! + keys = trie.keysWithPrefix(startsWith); + } catch (e) { + keys = []; + } + + // read all keys then select random word. + // May be large disk read! + key = keys[0]; + nextKey = _.last(keys); } - try{ - // trie throws if not found!!!!! - keys = trie.keysWithPrefix( startsWith ); - } catch(e){ - keys = []; + if (!key || !(key in self.fastIndex.offsets)) { + callback && callback([], startsWith); + resolve([]); } - // read all keys then select random word. - // May be large disk read! - key = keys[0]; - nextKey = _.last(keys); + } else { + // no startWith given - random select among keys + keys = _.sample(self.fastIndex.indexKeys, num); + + // if num > 1, run each key independently and collect results + if (num > 1) { + var results = [], ii = 0; + _(keys).each(function (startsWith) { + self.rand(startsWith, 1, function (result) { + results.push(result[0]); + if (++ii == num) { + callback && callback(results, ''); + resolve(results); + } + }); + }); + return; + } + key = keys; } - if (!key || !(key in self.fastIndex.offsets)) return process.nextTick(function(){ callback([], startsWith) }); + // prepare the piper + var args = [key, nextKey, self], + task = 'rand:' + key + nextKey, + context = [startsWith, num, callback]; // last arg MUST be callback - } else { - // no startWith given - random select among keys - keys = _.sample( this.fastIndex.indexKeys, num ); + // pay the piper + self.piper(task, IndexFile.readIndexBetweenKeys, args, context, collector); - // if num > 1, run each key independently and collect results - if (num > 1){ - var results = [], ii = 0; - _(keys).each(function(startsWith){ - self.rand(startsWith, 1, function(result){ - results.push(result[0]); - if (++ii == num) { - callback(results, ''); - } - }) - }); - return; - } - key = keys; - } -// console.log(' using key', key, nextKey); + function collector(key, nextKey, index, startsWith, num, callback, buffer) { + var lines = buffer.toString().split('\n'), + matches = lines.map(function (line) { + return line.substring(0, line.indexOf(' ')); + }); + //console.log(' got lines for key ', key, lines.length); - // prepare the piper - var args = [key, nextKey, this], - task = 'rand:' + key + nextKey, - context = [startsWith, num, callback]; // last arg MUST be callback + // we got bunch of matches for key - now search within for startsWith + if (startsWith !== key) { + // binary search for startsWith within set of matches + var ind = _.sortedIndex(matches, startsWith); + if (ind >= lines.length || matches[ind].indexOf(startsWith) === -1) { + callback && callback([], startsWith); + resolve([]); + return; + } - // pay the piper - this.piper(task, IndexFile.readIndexBetweenKeys, args, context, collector); - - function collector(key, nextKey, index, startsWith, num, callback, buffer){ - var lines = buffer.toString().split('\n'), - matches = lines.map(function(line){ - return line.substring(0,line.indexOf(' ')); - }); - - //console.log(' got lines for key ', key, lines.length); - - // we got bunch of matches for key - now search within for startsWith - if (startsWith !== key){ - - // binary search for startsWith within set of matches - var ind = _.sortedIndex(matches, startsWith); - if (ind >= lines.length || matches[ind].indexOf(startsWith) === -1){ - return callback([], startsWith); + var trie = new Trie(); + trie.addStrings(matches); + //console.log('Trie > ', trie.matchesWithPrefix( startsWith )); + matches = trie.keysWithPrefix(startsWith); } - // FIXME --- using Trie's new keysWithPrefix not yet pushed to npm. - // see https://github.com/NaturalNode/natural/commit/5fc86c42e41c1314bfc6a37384dd14acf5f4bb7b - - var trie = new Trie(); - - trie.addStrings(matches); - //console.log('Trie > ', trie.matchesWithPrefix( startsWith )); - - matches = trie.keysWithPrefix( startsWith ); + var words = _.sample(matches, num); + callback && callback(words, startsWith); + resolve(words); } - var words = _.sample(matches, num); - callback(words, startsWith); - } + }); // Promise } +// relative weight of each POS word count (DB 3.1 numbers) +var POS_factor = { + Noun: 26, + Verb: 3, + Adjective: 5, + Adverb: 1, + Total: 37 +}; + /** * rand() - for all Index files + * @returns Promise */ function randAll(opts, callback) { var @@ -163,12 +177,7 @@ function randAll(opts, callback) { count = opts && opts.count || 1, args = [null, startsWith], parts = 'Noun Verb Adjective Adverb'.split(' '), - self = this, - done = function(){ - profile && (args.push(new Date() - start)); - args[0] = results; - callback.apply(null, args) - }; + self = this; if (typeof opts === 'function') { callback = opts; @@ -176,36 +185,45 @@ function randAll(opts, callback) { opts = _.clone(opts); } - // TODO -- or loop count times each time getting 1 from random part!! - // slower but more random. - // select at random a part to look at - var doParts = _.sample(parts, parts.length); - tryPart(); + return new Promise(function(resolve, reject) { + // select at random a POS to look at + var doParts = _.sample(parts, parts.length); + tryPart(); - function tryPart(){ - var rand = 'rand' + doParts.pop(); - self[ rand ](opts, partCallback); - } + function tryPart() { + var part = doParts.pop(), + rand = 'rand' + part, + factor = POS_factor[part], + weight = factor / POS_factor.Total; - function partCallback(result){ - if (result) { - results = _.uniq(results.concat(result)); // make sure it's unique! + // pick count according to relative weight + opts.count = Math.ceil(count * weight * 1.1); // guard against dupes + self[rand](opts, partCallback); } - //console.log(result); - if (results.length < count && doParts.length) { - // reduce count for next part -- NO! may get duplicates - // opts.count = count - results.length; - return tryPart(); + function partCallback(result) { + if (result) { + results = _.uniq(results.concat(result)); // make sure it's unique! + } + + if (results.length < count && doParts.length) { + return tryPart(); + } + + // final random and trim excess + results = _.sample(results, count); + done(); } - // trim excess - if (results.length > count) { - results.length = count; + function done() { + profile && (args.push(new Date() - start)); + args[0] = results; + callback && callback.apply(null, args); + resolve(results); } - done(); - } + + }); // Promise } /** diff --git a/src/wordpos.js b/src/wordpos.js index 666c21a..29a286f 100644 --- a/src/wordpos.js +++ b/src/wordpos.js @@ -1,4 +1,4 @@ -/** +/*! * wordpos.js * * Node.js part-of-speech utilities using WordNet database. @@ -149,11 +149,11 @@ function get(isFn) { }; } +// setImmediate executes callback AFTER promise handlers. +// Without it, exceptions in callback may be caught by Promise. function nextTick(fn, args) { if (fn) { - setImmediate(function(){ - fn.apply(null, args); - }); + fn.apply(null, args); } } @@ -216,7 +216,7 @@ var wordposProto = WordPOS.prototype; * lookup a word in all indexes * * @param word {string} - search word - * @param callback {Functino} (optional) - callback with (results, word) signature + * @param callback {Function} (optional) - callback with (results, word) signature * @returns {Promise} */ wordposProto.lookup = function(word, callback) { @@ -362,7 +362,17 @@ wordposProto.getVerbs = get('isVerb'); wordposProto.parse = prepText; +/** + * access to WordNet DB + * @type {object} + */ WordPOS.WNdb = WNdb; + +/** + * access to stopwords + * @type {Array} + */ WordPOS.stopwords = stopwords; + module.exports = WordPOS; diff --git a/test.js b/test.js deleted file mode 100644 index 45a6324..0000000 --- a/test.js +++ /dev/null @@ -1,40 +0,0 @@ -var - WordPOS = require('./src/wordpos'), - wordpos = new WordPOS({profile: true}), - getAllPOS = wordpos.getPOS - ; - - -console.log(1111, - wordpos.lookup('foot') - //wordpos.getPOS('was doing the work the ashtray closer Also known as inject and foldl, reduce boils down a list of values into a single value', console.log - .then(function(result){ - console.log(' xxx - ', result) - }) - .catch(function(result){ - console.log(' error xxx - ', result) - })); - -//wordpos.rand({count: 3},console.log) - -return; - - -//getAllPOS('se', console.log) -wordpos.getPOS('se', console.log) - - - - - a=wordpos.getPOS('se', function(res) { - console.log(1, res) - wordpos.getPOS('sea hey who work', function(res) { - console.log(2, res) - wordpos.getPOS('sear done work ', function(res) { - console.log(3, res) - console.log('all done'); - }); - }); - }); - - console.log(a) \ No newline at end of file diff --git a/test/validate_test.js b/test/validate_test.js index df42d61..f463d91 100644 --- a/test/validate_test.js +++ b/test/validate_test.js @@ -1,7 +1,7 @@ /** * validate_test.js * - * Run validate on all four main index files + * Run validate on all four main index files * * Copyright (c) 2012-2016 mooster@42at.com * https://github.com/moos/wordpos diff --git a/test/wordpos_test.js b/test/wordpos_test.js index b349f83..9eb1fce 100644 --- a/test/wordpos_test.js +++ b/test/wordpos_test.js @@ -1,11 +1,11 @@ /** - * wordpos_spec.js + * wordpos_test.js * - * test file for main wordpos functionality + * test file for main wordpos functionality * * Usage: * npm install mocha -g - * mocha wordpos_spec.js --verbose + * mocha wordpos_test.js * * or * @@ -388,4 +388,29 @@ describe('Promise pattern', function() { assert.equal(result, true); }); }); + + it('rand()', function () { + return wordpos.rand({count: 5}).then(function (result) { + assert.equal(result.length, 5); + }); + }); + + it('randNoun()', function () { + return wordpos.randNoun().then(function (result) { + assert.equal(result.length, 1); + }); + }); + + it('randNoun({count: 3})', function () { + return wordpos.randNoun({count: 3}).then(function (result) { + assert.equal(result.length, 3); + }); + }); + + it('randNoun({startsWith: "foo"})', function () { + return wordpos.randNoun({startsWith: 'foo'}).then(function (result) { + assert.equal(result.length, 1); + assert.equal(result[0].indexOf('foo'), 0); + }); + }); }); \ No newline at end of file diff --git a/tools/stat.js b/tools/stat.js index 77c0922..e045ae7 100644 --- a/tools/stat.js +++ b/tools/stat.js @@ -1,7 +1,7 @@ /** * stat.js * - * generate fast index for WordNet index files + * generate fast index for WordNet index files * * Usage: * node stat [--no-stats] index.adv ... @@ -40,7 +40,7 @@ * read index file between the two offsets * binary search read data O(log avg) * - * Copyright (c) 2012 mooster@42at.com + * Copyright (c) 2012-2016 mooster@42at.com * https://github.com/moos/wordpos * * Released under MIT license @@ -48,7 +48,7 @@ var WNdb = require('../src/wordpos').WNdb, util = require('util'), - BufferedReader = require ("./buffered-reader"), + BufferedReader = require ('./buffered-reader'), _ = require('underscore')._, fs = require('fs'), path = require('path'), diff --git a/tools/validate.js b/tools/validate.js index e9fa3f0..f29cc8c 100644 --- a/tools/validate.js +++ b/tools/validate.js @@ -1,12 +1,12 @@ /** * validate.js * - * read each index. file, and look up using wordpos and confirm find all words + * read each index. file, and look up using wordpos and confirm find all words * * Usage: * node validate index.adv * - * Copyright (c) 2012 mooster@42at.com + * Copyright (c) 2012-2016 mooster@42at.com * https://github.com/moos/wordpos * * Released under MIT license