Skip to content

Instantly share code, notes, and snippets.

@tbrianjones
Created August 7, 2014 23:06
Show Gist options
  • Save tbrianjones/0d44904e3e0cb500169f to your computer and use it in GitHub Desktop.
Save tbrianjones/0d44904e3e0cb500169f to your computer and use it in GitHub Desktop.
A Whitespace Tokenizer for PHP-NLP-TOOLS that creates specified ngrams ( unigrams, bigrams, trigrams, etc. )
<?php
namespace NlpTools\Tokenizers;
/**
* White space phrase tokenizer.
* Break on every white space
* Create ngrams with the specified number of words ( $n )
*/
class WhitespacePhraseTokenizer implements TokenizerInterface
{
private $n; // phrase word length
const PATTERN = '/[\pZ\pC]+/u';
public function set_n( $n ) {
$this->n = $n;
}
public function tokenize( $str )
{
// generate unigrams
$unigrams = preg_split(self::PATTERN,$str,null,PREG_SPLIT_NO_EMPTY);
$num_unigrams = count( $unigrams );
// generate other nGrams
$ngrams = array();
for( $n=2; $n<=$this->n; $n++ ) {
// loop through each unigram location in the text
for( $i=0; $i<=$num_unigrams-$n; $i++ ) {
$key = $i;
$ngram = array();
for( $key=$i; $key<$i+$n; $key++ )
$ngram[] = $unigrams[$key];
$ngrams[] = implode( ' ', $ngram );
}
}
// combine unigrams with new ngrams
$ngrams = array_merge( $unigrams, $ngrams );
return $ngrams;
}
}
@arogozin
Copy link

function __construct($n = 1) { $this->n = $n; }

With constructor you wont need to manually set_n.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment